arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

大模型推理能力

大模型数学、逻辑、规划、多步推理和测试时计算能力。

今日/当前日期收录 5 信号源:cs.CL, cs.AI, cs.LG
2606.19354 2026-06-19 cs.CL cs.LG 新提交 90%

Granularity-Regulated Adaptive Computational Efficiency for Optimal Verification in Test-Time Scaling

粒度调控的自适应计算效率:测试时扩展中的最优验证

Ardit Krasniqi, Luan Vejsiu, Elira Dervishi

发表机构 * European University of Tirana(欧洲地拉那大学)

专题命中 测试时计算 :测试时扩展中验证粒度自适应理论框架

AI总结 提出GRACE理论框架,将验证粒度建模为问题难度、验证器准确率和计算预算的函数,证明存在相变:细粒度验证在计算预算大或问题难时占优,粗粒度验证在低预算简单问题时更优,自适应策略可达到计算-性能帕累托前沿。

详情
AI中文摘要

测试时扩展(TTS)已成为一种强大的范式,通过在推理时投入额外计算来提升大语言模型(LLMs)的推理性能。TTS的核心组件是验证器,它选择或评分候选解以引导搜索过程。虽然先前工作已探索验证的益处,但一个基本问题仍未充分探索:在给定计算预算下,最优验证粒度是什么?粗粒度的结果奖励模型(ORMs)和细粒度的过程奖励模型(PRMs)代表两个极端,但两者单独均无法在所有场景下实现计算最优性。本文建立了一个统一的理论框架,称为GRACE(粒度调控的自适应计算效率),该框架将最优验证粒度刻画为问题难度、验证器准确率和计算预算的显式函数。我们证明存在一个相变:当计算预算大或问题难时,细粒度验证占优;而在低预算、简单问题场景下,粗粒度验证更受青睐。我们的理论将Best-of-N、束搜索和步骤级MCTS统一在一个帕累托最优框架内,并激发了一种自适应粒度策略,该策略可证明达到计算-性能帕累托前沿。在MATH-500、GSM8K和AIME基准上的实验结果证实了所有四个理论主张,在匹配计算量下,我们的自适应策略相比固定粒度基线准确率提升高达3.1%。

英文摘要

Test-time scaling (TTS) has emerged as a powerful paradigm for improving the reasoning performance of large language models (LLMs) by investing additional compute at inference time. A central component of TTS is the \emph{verifier}, which selects or scores candidate solutions to guide the search process. While prior work has explored the benefit of verification, a fundamental question remains underexplored: \emph{what is the optimal granularity of verification under a given compute budget?} Coarse-grained outcome reward models (ORMs) and fine-grained process reward models (PRMs) represent two extremes, yet neither alone achieves compute-optimality across all regimes. In this paper, we establish a unified theoretical framework, called \textbf{GRACE} (\underline{G}ranularity-\underline{R}egulated \underline{A}daptive \underline{C}omputational \underline{E}fficiency), that characterizes the optimal verification granularity as an explicit function of problem difficulty, verifier accuracy, and compute budget. We prove that there exists a phase transition: fine-grained verification dominates when either the compute budget is large or the problem is hard, whereas coarse-grained verification is preferred in the low-budget, easy-problem regime. Our theory unifies Best-of-$N$, beam search, and step-level MCTS within a single Pareto-optimality framework, and motivates an adaptive granularity strategy that provably achieves the compute-performance Pareto frontier. Empirical results on MATH-500, GSM8K, and AIME benchmarks corroborate all four theoretical claims, with our adaptive strategy outperforming fixed-granularity baselines by up to 3.1\% accuracy at matched compute.

2606.19919 2026-06-19 cs.LG 新提交 85%

ADaPT: Token-Level Decoupling for Efficient Large Reasoning Models

ADaPT:面向高效大推理模型的令牌级解耦

Tingyun Li, Zishang Jiang, Jinyi Han, Xinyi Wang, Sihang Jiang, Han Xia, Zhaoqian Dai, Shuguang Ma, Fei Yu, Jiaqing Liang, Yanghua Xiao

发表机构 * School of Data Science, Fudan University(复旦大学数据科学学院) Shanghai Institute of Artificial Intelligence for Education, East China Normal University(华东师范大学上海智能教育研究院) College of Computer Science and Artificial Intelligence, Fudan University(复旦大学计算机科学与人工智能学院) Ant Group(蚂蚁集团)

专题命中 测试时计算 :提出令牌级解耦框架ADaPT提升推理效率

AI总结 提出ADaPT,通过令牌级双过程框架解耦效率与正确性信号,引入模式选择令牌控制快慢推理,实现推理时效率-性能权衡的精确连续控制,在降低推理成本的同时保持强推理能力。

详情
AI中文摘要

大型推理模型依赖长思维链实现强性能,但统一应用此类推理会产生高计算成本。现有面向效率的方法试图缩短或混合推理策略,但往往会降低推理能力。我们将根本原因识别为效率激励与正确性优化之间的序列级耦合,这隐式惩罚了长但正确的推理轨迹。为解决此问题,我们提出自适应双过程思维(ADaPT),一种令牌级双过程框架,在训练期间显式解耦效率和正确性信号。ADaPT引入模式选择令牌来控制快速和慢速推理,将效率相关奖励仅应用于此令牌,以避免惩罚正确的长推理,同时在适当时鼓励效率。此外,ADaPT在推理时实现了对效率-性能权衡的精确连续控制:通过调整模式选择令牌的生成概率,单个训练好的模型可以平滑地沿效率-性能帕累托前沿移动。大量实验表明,ADaPT在多个基准测试中显著降低推理成本,同时保持强推理性能。

英文摘要

Large reasoning models rely on long chain-of-thought to achieve strong performance, but applying such reasoning uniformly incurs high computational cost. Existing efficiency-oriented methods attempt to shorten or mix reasoning strategies, yet often degrade reasoning capability. We identify the root cause as sequence-level coupling between efficiency incentives and correctness optimization, which implicitly penalizes long but correct reasoning trajectories. To address this issue, we propose Adaptive Dual-Process Thinking (ADaPT), a token-level dual-process framework that explicitly decouples efficiency and correctness signals during training. ADaPT introduces a mode-selection token to control fast and slow reasoning, applying efficiency-related rewards exclusively to this token to avoid penalizing correct long reasoning while encouraging efficiency when appropriate. Moreover, ADaPT enables precise and continuous control over the efficiency-performance trade-off at inference time: by adjusting the generation probability of the mode-selection token, a single trained model can smoothly move along the efficiency-performance Pareto frontier. Extensive experiments demonstrate that ADaPT significantly reduces inference cost while maintaining strong reasoning performance across multiple benchmarks.

2606.19808 2026-06-19 cs.AI cs.CL 新提交 85%

Think Again or Think Longer? Selective Verification for Budget-Aware Reasoning

再思考还是更长时间思考?面向预算感知推理的选择性验证

Sajib Acharjee Dip, Dawei Zhou, Liqing Zhang

发表机构 * Department of Computer Science, Virginia Tech(弗吉尼亚理工大学计算机科学系) Fralin Biomedical Research Institute, Virginia Tech(弗吉尼亚理工大学弗拉林生物医学研究所) FBRI Cancer Research Center(FBRI癌症研究中心)

专题命中 测试时计算 :选择性验证框架用于预算感知推理,优化测试时计算。

AI总结 提出选择性验证框架SEVRA,通过服务层控制器决定是否对冻结求解器的初始答案进行验证,在Math500上以更少token达到更高准确率,并减少有害翻转。

详情
AI中文摘要

测试时推理越来越多地被用作服务时的控制旋钮,但额外的推理并非均匀有价值:它可以修复失败的尝试,在已经正确的答案上浪费计算,或引入有害的答案更改。我们将其视为一个部署分配问题,而非新验证器问题。我们引入SEVRA,即面向推理分配的选择性验证,这是一个服务层控制器,决定是保留冻结求解器的初始答案还是调用主动验证。使用冻结的Qwen3-4B求解器,我们记录干预结果并从服务可见的尝试状态训练可恢复性感知的门控。在Math500上,选择性验证达到76.3%的准确率,而始终验证为75.5%,同时将生成后token减少26.8%,有害翻转从2.2%降至1.0%。然而,8,192 token的初始求解达到76.0%的准确率,总模型token减少28%,表明选择性恢复有用但并非测试的最佳成本前沿。在冻结迁移到GSM时,选择性策略仅验证3.0%的样本,准确率从93.4%提升至94.5%,验证token相对于始终验证减少91.2%;同样,更长的初始求解以更少的实际token达到相同准确率。在CommonsenseQA上,始终开启的验证有害,而Self-Consistency@5以约五倍的实际token成本提升准确率。由此得出的部署规则是:首先调整初始预算,然后在需要显式检查、有限重试、可审计性或回归风险控制时使用选择性恢复。

英文摘要

Test-time reasoning is increasingly used as a serving-time control knob, but extra reasoning is not uniformly valuable: it can repair failed attempts, waste compute on already-correct answers, or introduce harmful answer changes. We study this as a deployment allocation problem rather than a new-verifier problem. We introduce \sevra, Selective Verification for Reasoning Allocation, a serving-layer controller that decides whether to preserve a frozen solver's initial answer or invoke active verification. Using a frozen Qwen3-4B solver, we log intervention outcomes and train recoverability-aware gates from serving-visible attempt state. On \mathfive, selective verification reaches 76.3\% accuracy, compared with 75.5\% for always verifying, while reducing post-generation tokens by 26.8\% and harmful flips from 2.2\% to 1.0\%. However, an 8,192-token initial solve reaches 76.0\% accuracy with 28\% fewer total model tokens, showing that selective recovery is useful but not the best tested cost frontier. In frozen transfer to \gsm, the selective policy verifies only 3.0\% of examples, improves accuracy from 93.4\% to 94.5\%, and reduces verification tokens by 91.2\% relative to always verifying; again, a longer initial solve matches its accuracy with fewer realized tokens. On CommonsenseQA, always-on verification hurts, while Self-Consistency@5 improves accuracy at about five times the realized token cost. The resulting deployment rule is: tune the initial budget first, then use selective recovery when explicit checks, bounded retries, auditability, or regression-risk control matter.

2606.19771 2026-06-19 cs.AI 新提交 85%

Beyond Entropy: Learning from Token-Level Distributional Deviations for LLM Reasoning

超越熵:从令牌级分布偏差中学习以增强LLM推理

Xuanzhi Feng, Zhengyang Li, Zeyu Liu, Haoxi Li, Yuming Jiang, Bing Guo, Jingcai Guo, Jie Zhang, Song Guo

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) Sichuan University(四川大学) The Hong Kong Polytechnic University(香港理工大学)

专题命中 测试时计算 :通过令牌级分布偏差学习增强LLM推理。

AI总结 针对RLVR中令牌更新导致的熵塌陷或爆炸问题,提出ICT框架,利用JS散度识别关键令牌,通过选择性更新平衡策略集中度,提升推理性能。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)显著推进了大语言模型(LLM)推理;然而,它面临一个基本的优化不稳定性:均匀令牌更新会导致熵塌陷,从而过早收敛到次优策略,而过度的香农熵最大化可能导致熵爆炸,驱动盲目探索走向不连贯的推理链。为解决这一二分问题,我们引入了独立组合令牌(ICT)框架,该框架将优化焦点从标量不确定性转移到令牌logits的分布特性。通过利用令牌logits分布之间的詹森-香农(JS)散度,ICT将具有独特分布模式的令牌识别为引导LLM推理中有效探索的关键分支点。我们的理论分析基于香农熵和二阶Rényi熵,证明选择性地更新这些令牌可以调节策略集中度:它降低了由香农熵度量的整体分布不确定性,同时控制了由二阶Rényi熵捕获的概率集中度。这种双重效应防止了过度集中的令牌生成削弱探索,并有效稳定了训练景观。实验结果表明,在Qwen2.5(0.5B/1.5B/7B)模型上仅更新前10%的独特令牌,在涵盖数学、常识和奥林匹克级别问题的七个基准测试中,与GRPO、20-Entropy和STAPO基线相比,平均pass@4提升了4.58%,最大提升达14.9%。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced Large Language Model (LLM) reasoning; however, it faces a fundamental optimization instability: uniform token updates precipitate entropy collapse, leading to premature convergence to suboptimal strategies, whereas excessive Shannon Entropy maximization can cause entropy explosion, driving blind exploration toward incoherent reasoning chains. To resolve this dichotomy, we introduce the Independent Combinatorial Tokens (ICT) framework, which shifts the optimization focus from scalar uncertainty to the distributional properties of token logits. By leveraging the Jensen-Shannon (JS) divergence between token logits distributions, ICT identifies tokens with distinctive distributional patterns as critical branching points for guiding effective exploration in LLM reasoning. Our theoretical analysis, grounded in both Shannon and second-order Rényi entropy, proves that selectively updating on these tokens regulates policy concentration: it reduces the overall distribution uncertainty measured by Shannon entropy, while controlling probability concentration captured by second-order Rényi entropy. This dual effect prevents over-concentrated token generation from weakening exploration and effectively stabilizes the training landscape. Empirical results demonstrate that updating only the top 10% of unique tokens on Qwen2.5 (0.5B/1.5B/7B) models yields an average pass@4 improvement of 4.58%, with a maximum gain of 14.9%, over GRPO, 20-Entropy, and STAPO baselines across seven benchmarks spanning math, commonsense, and Olympiad-level problems.

2606.19750 2026-06-19 cs.LG cs.AI cs.CL 新提交 80%

Manifold Bandits: Bayesian Curriculum Learning over the Latent Geometry of Large Language Models

流形赌博机:大语言模型潜在几何上的贝叶斯课程学习

Darrien McKenzie, Nicklas Hansen, Xiaolong Wang

发表机构 * University of California, San Diego(加州大学圣迭戈分校)

专题命中 测试时计算 :贝叶斯课程学习框架用于LLM推理的强化学习。

AI总结 提出贝叶斯流形课程(BMC)框架,将问题采样建模为流形结构赌博机问题,通过层次任务树和贝叶斯学习引导采样,平衡学习信号、多样性和实用性。

Comments Webpage: https://darrienmckenzie.com/manifold-bandits/

详情
AI中文摘要

强化学习(RL)是提高大语言模型(LLMs)推理能力的关键方法,其中训练效率关键取决于优化过程中问题的采样方式。现有的自适应课程学习方法通常优先考虑中等难度的提示,将问题选择视为具有独立臂的标准赌博机问题,忽略了任务空间的结构化和异质性。在这项工作中,我们将问题采样框架化为具有内生非平稳性的流形结构赌博机问题:问题通过模型的潜在表示空间相关联,采样决策可以影响学习信号在该空间中的演变方式。为了实现这一视角,我们引入了贝叶斯流形课程(BMC),这是一个结构感知框架,将问题组织成层次任务树,并应用贝叶斯学习来指导采样。实验发现,不同的采样策略在生产性(学习信号)、多样性(任务流形覆盖)和实用性(评估相关性)之间引入了非平凡的权衡。这些结果表明,仅优先考虑难度不足以获得强大的下游性能,突出了将结构和类型感知纳入问题采样中的重要性。

英文摘要

Reinforcement learning (RL) is a central approach for improving reasoning capabilities in large language models (LLMs), where training efficiency depends critically on how problems are sampled during optimization. Existing adaptive curriculum learning methods typically prioritize prompts of intermediate difficulty, treating problem selection as a standard bandit problem with independent arms and overlooking the structured, heterogeneous nature of the task space. In this work, we frame problem sampling as a manifold-structured bandit problem with endogenous non-stationarity: problems are related through the model's latent representation space, and sampling decisions can steer how learning signals evolve across that space. To operationalize this perspective, we introduce Bayesian Manifold Curriculum (BMC), a structure-aware framework that organizes problems into a hierarchical task tree and applies Bayesian learning to guide sampling. Empirically, we find that different sampling strategies induce non-trivial tradeoffs between productivity (learning signal), diversity (coverage of the task manifold), and utility (evaluation relevance). These results show that prioritizing difficulty alone is insufficient for strong downstream performance, highlighting the importance of incorporating structure and type-awareness into problem sampling.