大模型推理能力 - arXivDaily 专题

2606.18910 2026-06-18 cs.LG cs.CL 新提交 90%

REVES: REvision and VErification--Augmented Training for Test-Time Scaling

REVES：通过修订与验证增强的测试时扩展训练

Yuanxin Liu, Ruida Zhou, Xinyan Zhao, Amr Sharaf, Hongzhou Lin, Arijit Biswas, Mohammad Ghavamzadeh, Zhaoran Wang, Mingyi Hong

发表机构 * Northwestern University（西北大学）； Amazon AGI（亚马逊人工智能实验室）； Qualcomm AI Research（高通人工智能研究）； University of Minnesota（明尼苏达大学）

专题命中测试时计算：通过修订与验证增强测试时扩展推理

AI总结提出REVES框架，通过将中间步骤的“接近正确”答案转化为解耦的修订和验证提示，实现高效的离策略数据生成，提升大语言模型的多步推理能力，在LiveCodeBench上比强化学习基线高6.5分。

详情

AI中文摘要

通过顺序修订进行测试时扩展已成为增强大语言模型（LLM）推理能力的强大范式。然而，标准的后训练方法主要优化单次目标，与多步推理动态存在根本性不匹配。虽然最近的工作将其视为多轮强化学习（RL），但传统方法直接优化多步轨迹，未能进一步利用模型可以从纠正中学习的中间步骤中的高质量错误。我们提出了一个两阶段迭代框架，交替进行在线数据/提示增强和策略优化。通过将成功恢复轨迹中的中间步骤（“接近正确”答案）转化为解耦的修订和验证提示，我们的方法将训练集中在有效的答案转换和错误识别上。与标准的多轮RL相比，这种方法实现了高效的离策略数据生成，并减少了长程采样的计算开销。在LiveCodeBench上，使用公开可用的测试用例作为反馈，我们观察到比RL基线高6.5分，比标准多轮训练高4.0分。除了编码，我们的方法在圆填充问题上达到了先前报告的SOTA结果，同时使用了最小的基础模型（4B）和远少于更大进化搜索系统的采样次数。在真实验证下的数学结果进一步证实了改进的纠正能力。该方法还泛化到分布外的约束满足谜题，如n皇后和迷你数独，其中正确性完全由问题约束定义。代码可在该https URL获取。

英文摘要

Test-time scaling via sequential revision has emerged as a powerful paradigm for enhancing Large Language Model (LLM) reasoning. However, standard post-training methods primarily optimize single-shot objectives, creating a fundamental misalignment with multi-step inference dynamics. While recent work treats this as multi-turn reinforcement learning (RL), conventional approaches optimize over the multi-step trajectories directly, failing to further exploit the high-quality mistakes in intermediate steps that model can learn from correcting them. We propose a two-stage iterative framework that alternates between online data/prompt augmentation and policy optimization. By converting the intermediate steps (``near-miss'' answers) in the successful recovery trajectories into decoupled revision and verification prompts, our approach concentrates training on both effective answer transformation and error identification. This approach enables efficient off-policy data generation and reduces the computational overhead of long-horizon sampling compared to standard multi-turn RL. On LiveCodeBench, using publicly available test cases as feedback, we observe gains of +6.5 points over the RL baseline and +4.0 points over standard multi-turn training. Beyond coding, our approach matches the previously reported SOTA result on circle packing while using the smallest base model (4B) and far fewer rollouts than the much larger evolutionary search systems. Math results under ground-truth verification further confirm improved correction ability. It also generalizes to out-of-distribution constraint-satisfaction puzzles such as n\_queens and mini\_sudoku, where correctness is defined entirely by problem constraints. Code is available at https://github.com/yxliu02/REVES.git.

URL PDF HTML ☆

赞 0 踩 0