NOVA: NOise-aware Verbal Confidence CAlibration for Robust Large Language Models in RAG Systems
NOVA: 面向RAG系统中鲁棒大语言模型的噪声感知言语置信度校准
Jiayu Liu, Rui Wang, Qing Zong, Yumeng Wang, Cheng Qian, Qingcheng Zeng, Tianshi Zheng, Haochen Shi, Dadi Guo, Baixuan Xu, Chunyang Li, Yangqiu Song
AI总结 提出NOVA框架,通过规则引导的监督微调,解决检索增强生成中噪声上下文导致的过度自信问题,在域内和域外分别提升ECE 10.9%和8.0%。
详情
准确评估模型置信度对于在关键事实领域部署大语言模型(LLM)至关重要。尽管检索增强生成(RAG)被广泛采用以改善基础事实,但RAG设置中的置信度校准仍知之甚少。我们跨四个基准进行了系统研究,揭示LLM在检索到噪声上下文时校准性能较差。具体而言,矛盾或无关的证据往往会加剧模型的过度自信问题。为解决此问题,我们提出NOVA规则(噪声感知言语置信度校准规则),为在噪声下解决过度自信提供原则性基础。我们进一步设计了NOVA,一个噪声感知校准框架,该框架通过由这些规则指导的约2K HotpotQA示例合成监督信号。通过使用此数据进行监督微调(SFT),NOVA使模型具备内在的噪声感知能力,而无需依赖更强的教师模型。实验结果表明,NOVA带来了显著收益,在域内和域外分别将ECE分数提高了10.9%和8.0%。通过弥合检索噪声与言语校准之间的差距,NOVA为构建既准确又认知可靠的LLM铺平了道路。
Accurately assessing model confidence is essential for deploying large language models (LLMs) in mission-critical factual domains. While retrieval-augmented generation (RAG) is widely adopted to improve grounding, confidence calibration in RAG settings remains poorly understood. We conduct a systematic study across four benchmarks, revealing that LLMs exhibit poor calibration performance especially when noisy contexts are retrieved. Specifically, contradictory or irrelevant evidence tends to exacerbate the model's overconfidence issue. To address this, we propose NOVA Rules (NOise-Aware Verbal Confidence CAlibration Rules) to provide a principled foundation for resolving overconfidence under noise. We further design NOVA, a noise-aware calibration framework that synthesizes supervision from ~2K HotpotQA examples guided by these rules. By performing supervised fine-tuning (SFT) with this data, NOVA equips models with intrinsic noise awareness without relying on stronger teacher models. Empirical results show that NOVA yields substantial gains, improving ECE scores by 10.9% in-domain and 8.0% out-of-domain. By bridging the gap between retrieval noise and verbal calibration, NOVA paves the way for both accurate and epistemically reliable LLMs.