FABSVer: Faster Training and Better Self-Verification for LLM Mathematical Reasoning
FABSVer: 更快的训练与更好的自验证用于大语言模型数学推理
Haihui Pan, Junwei Bao, Hongfei Jiang, Yang Song
AI总结 提出FABSVer方法,通过融合解生成与自验证为单次前向传播,并引入动态参考模型更新(DRMU)突破奖励瓶颈,在三个模型规模上实现更优的自验证与推理性能,训练时间仅为现有方法的51%-71%。
详情
尽管大语言模型在数学推理方面取得了显著进展,但它们在判断自身解决方案的正确性方面仍然不可靠。现有的为模型配备自验证能力的方法通常将解生成和验证视为两个独立的任务,导致训练时间大幅增加。在本文中,我们提出FABSVer,将这两个任务融合为单次生成过程,在联合优化两种能力的同时显著降低训练开销。我们进一步从理论和实验上识别出一个收敛瓶颈:随着训练进行,由于策略受固定参考模型约束,奖励达到平台期。为克服这一问题,我们引入动态参考模型更新(DRMU),提高了奖励上限并实现持续的奖励增长。在数学基准上的大量实验表明,FABSVer在三个模型规模上实现了优越的自验证和推理性能,同时仅需现有方法训练时间的51%–71%。分析进一步揭示了模型获取自验证能力的不同学习阶段,并且随着模型规模增大,验证奖励与答案奖励之间的差距显著缩小。
While large language models have made significant progress in mathematical reasoning, they remain unreliable at judging the correctness of their own solutions. Existing approaches that equip models with self-verification typically treat solution generation and verification as two separate tasks, leading to substantially increased training time. In this paper, we propose FABSVer, which fuses these two tasks into a single generation pass, dramatically reducing training overhead while jointly optimizing both capabilities. We further identify a convergence bottleneck both theoretically and empirically: as training progresses, the reward reaches a plateau because the policy is constrained by a fixed reference model. To overcome this, we introduce Dynamic Reference Model Update (DRMU), which raises the reward ceiling and enables sustained reward growth. Extensive experiments on math benchmarks demonstrate that FABSVer achieves superior self-verification and reasoning performance across three model scales, while requiring only 51%--71% of the training time of existing methods. Analysis further reveals distinct learning phases in how models acquire self-verification, and that the gap between verify and answer rewards shrinks noticeably as model size increases.