arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

大模型推理能力

大模型数学、逻辑、规划、多步推理和测试时计算能力。

今日/当前日期收录 7 信号源:cs.CL, cs.AI, cs.LG
2606.20068 2026-06-19 cs.AI 新提交 90%

Process-Verified Reinforcement Learning for Theorem Proving via Lean

基于Lean的过程验证强化学习用于定理证明

Minsu Kim, Se-Young Yun

发表机构 * KAIST AI(韩国科学技术院人工智能系)

专题命中 数学推理 :定理证明强化学习

AI总结 提出利用Lean证明助手提供过程级验证信号,结合GRPO风格强化学习目标,通过策略级监督提升定理证明性能。

详情
AI中文摘要

虽然基于可验证奖励的强化学习通常依赖于单一的二元验证信号,但形式推理中的符号证明助手提供了丰富、细粒度的结构化反馈。这种结构化过程与非结构化奖励之间的差距凸显了既密集又可靠的反馈的重要性。在这项工作中,我们证明Lean证明助手本身可以作为符号过程预言机,在训练期间提供结果级和细粒度的策略级验证反馈。证明尝试被解析为策略序列,Lean的细化标记出局部正确的步骤和最早失败的步骤,从而产生基于类型理论的密集、验证器基础的信用信号。我们将这些结构化奖励纳入GRPO风格的强化学习目标中,采用首次错误传播和首次令牌信用方法,平衡结果级和过程级优势。在STP-Lean和DeepSeek-Prover-V1.5上的实验表明,在大多数设置中,策略级监督优于仅结果基线,在MiniF2F和ProofNet等基准测试上取得了改进。除了经验上的提升,我们的研究还突出了一个更广阔的视角:符号证明助手不仅在评估时是验证器,而且在训练期间可以作为过程级奖励预言机。这为强化学习框架开辟了一条道路,该框架将语言模型的可扩展性与符号验证的可靠性相结合,用于形式推理。

英文摘要

While reinforcement learning from verifiable rewards (RLVR) typically has relied on a single binary verification signal, symbolic proof assistants in formal reasoning offer rich, fine-grained structured feedback. This gap between structured processes and unstructured rewards highlights the importance of feedback that is both dense and sound. In this work, we demonstrate that the Lean proof assistant itself can serve as a symbolic process oracle, supplying both outcome-level and fine-grained tactic-level verified feedback during training. Proof attempts are parsed into tactic sequences, and Lean's elaboration marks both locally sound steps and the earliest failing step, yielding dense, verifier-grounded credit signals rooted in type theory. We incorporate these structured rewards into a GRPO-style reinforcement learning objective with first-error propagation and first-token credit methods that balances outcome- and process-level advantages. Experiments with STP-Lean and DeepSeek-Prover-V1.5 show that tactic-level supervision outperforms outcome-only baselines in most settings, delivering improvements on benchmarks such as MiniF2F and ProofNet. Beyond empirical gains, our study highlights a broader perspective: symbolic proof assistants are not only verifiers at evaluation time, but can also act as process-level reward oracles during training. This opens a path toward reinforcement learning frameworks that combine the scalability of language models with the reliability of symbolic verification for formal reasoning.

2606.19788 2026-06-19 cs.AI cs.CL 新提交 90%

CombEval: A Framework for Evaluating Combinatorial Counting in Large Language Models

CombEval:评估大语言模型中组合计数的框架

Yuxu Zhou, Ondřej Kuželka, Yuyi Wang, Yuanhong Wang, Yi Chang

发表机构 * School of Artificial Intelligence, Jilin University(吉林大学人工智能学院) Czech Technical University in Prague(捷克布拉格理工大学) CRRC Zhuzhou Institute(中车株洲研究所) Tengen Intelligence Institute(天元智能研究院) International Center of Future Science, Jilin University(吉林大学未来科学国际合作中心) Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, MOE(教育部知识驱动人机智能工程研究中心)

专题命中 数学推理 :评估LLM在组合计数问题上的推理能力。

AI总结 提出CombEval动态基准,通过类型化Cofola规范生成组合计数问题,评估11个大语言模型在直接和代码增强设置下的表现,发现模型在有序对象、不可区分元素、相对位置约束和嵌套对象依赖上存在脆弱性。

Comments under review. Code: https://github.com/YuxuZhou-CN/combination-problem-generation

详情
AI中文摘要

我们提出了CombEval,一个用于评估大语言模型中组合计数的动态基准。CombEval将每个问题表示为关于实体、组合对象、对象依赖和约束的类型化Cofola规范,从而能够生成带有精确求解器验证答案的自然语言计数问题。与静态集合不同,CombEval支持对象类型、实体规模、约束数量和推理深度的系统变化。我们在直接和代码增强设置下评估了11个大语言模型,发现模型在有序对象、不可区分元素、相对位置约束和嵌套对象依赖上仍然脆弱。错误分析进一步识别出在约束解释和计数原则上的失败。CombEval为研究大语言模型何时以及为何在组合推理上失败提供了一个诊断测试平台。代码和生成的基准套件可在\url{this https URL}公开获取。

英文摘要

We present CombEval, a dynamic benchmark for evaluating combinatorial counting in large language models. CombEval represents each problem as a typed Cofola specification over entities, combinatorial objects, object dependencies, and constraints, enabling controlled generation of natural-language counting problems with exact solver-verified answers. Unlike static collections, CombEval supports systematic variation of object type, entity scale, constraint count, and reasoning depth. We evaluate 11 LLMs under direct and code-augmented settings and find that models remain brittle on ordered objects, indistinguishable elements, relatively positional constraints, and nested object dependencies. Error analysis further identifies failures in constraint interpretation and counting principles. CombEval provides a diagnostic testbed for studying when and why LLMs fail at combinatorial reasoning. The code and generated benchmark suites are publicly available at \url{https://github.com/YuxuZhou-CN/combination-problem-generation}.

2606.19399 2026-06-19 cs.LG cs.AI cs.LO cs.PL 新提交 90%

VERITAS: Verifier-Guided Proof Search for Zero-Shot Formal Theorem Proving

VERITAS:验证器引导的零样本形式定理证明搜索

Manish Acharya, Zhenyu Liao, Yueke Zhang, Kevin Leach, Yu Huang, Yifan Zhang

发表机构 * Department of Computer Science, Vanderbilt University(范德堡大学计算机科学系) Amazon(亚马逊)

专题命中 数学推理 :验证器引导的零样本形式定理证明。

AI总结 提出VERITAS框架,通过两阶段协议(Best-of-N采样+批评引导MCTS)利用验证器反馈进行零样本定理证明,在miniF2F上达40.6%准确率,并发布组合学基准VERITAS-CombiBench。

详情
AI中文摘要

基于LLM的形式化证明器通常将丰富的验证器信号(语法错误、类型不匹配、部分目标进展)压缩为二进制的通过/失败位。我们提出VERITAS,一个零样本框架,通过两阶段协议将每个验证器信号路由回证明搜索:首先进行Best-of-N采样,然后进行批评引导的MCTS遍历,该遍历将第一阶段失败作为显式负例吸收。该协议保留其第一阶段扫描解决的每个定理,因此第二阶段额外的解决可归因于反馈驱动的探索。VERITAS在miniF2F上达到40.6%(相比之下,独立运行的Best-of-5为36.9%,Portfolio为26.2%),在VERITAS-CombiBench上达到7.3%,这是一个我们发布的55个定理的组合学基准,在该基准上Best-of-5(1.8%)低于Portfolio(3.6%),暴露了当必须从验证器反馈中迭代恢复正确的引理名称时,无指导的采样会带来损害。工件可在GitHub上获取。

英文摘要

LLM-based formal provers often collapse rich verifier signals (syntax errors, type mismatches, partial goal progress) into a binary pass/fail bit. We present VERITAS, a zero-shot framework that routes every verifier signal back into proof search through a two-phase protocol: Best-of-N sampling first, then a critic-guided MCTS pass that ingests Phase 1 failures as explicit negative examples. The protocol preserves every theorem solved by its own Phase 1 sweep, so Phase 2's additional solves are attributable to feedback-driven exploration. VERITAS reaches 40.6% on miniF2F (vs. an independently run Best-of-5 at 36.9%, Portfolio 26.2%) and 7.3% on VERITAS-CombiBench, a 55-theorem combinatorics benchmark we release on which Best-of-5 (1.8%) falls below Portfolio (3.6%), exposing that unguided sampling hurts when correct lemma names must be recovered iteratively from verifier feedback. Artifacts are available on GitHub.

2606.19697 2026-06-19 cs.LG cs.AI cs.CL 新提交 85%

Efficiently Representing Algorithms With Chain-of-Thought Transformers

高效表示链式思维Transformer中的算法

Yanhong Li, Anej Svete, Ashish Sabharwal, William Merrill

发表机构 * Allen Institute for AI(艾伦人工智能研究所) ETH Zürich(苏黎世联邦理工学院)

专题命中 数学推理 :证明CoT Transformer高效模拟Word RAM算法,涉及排序等推理

AI总结 本文证明链式思维Transformer能以多对数开销高效模拟Word RAM算法,包括排序和Dijkstra算法,优于模拟图灵机的二次开销。

详情
AI中文摘要

推理模型(即在产生答案前输出一系列推理或思维token的语言模型)日益流行,部分原因在于理论结果表明链式思维(CoT)Transformer可以模拟图灵机,从而执行任意计算。然而,图灵机虽然适用于复杂性理论分析,但在讨论算法时并不方便、直观或高效。算法通常在更高的抽象层次上设计和分析,即具有随机访问存储器和单位成本操作(对$\bigO(\log n)$位字)的Word RAM模型。因此,Word RAM算法可能比其图灵机对应物更高效,这引出了一个问题:CoT Transformer能否高效模拟Word RAM算法?例如,它们能否在$\bigO(n \log n)$步内对n个元素排序,或在$\bigO(E + V \log V)$步内运行Dijkstra算法?我们给出肯定回答,开销不超过多对数。我们首先为具有多对数宽度和最右唯一硬注意力的有限精度Transformer建立这一结果,然后将结果推广到两个更实际的设置:有限宽度和对数精度:连续CoT(其中推理采用向量而非token形式)和混合架构(其中Transformer层位于循环(线性RNN)层之上)。在所有三种情况下,我们发现CoT可以高效模拟任何Word RAM算法,仅需在n上多对数开销。当Word RAM具有“平坦”指令集时,此开销降至对数平方,而对于无乘法平坦指令仅需对数开销——这与已知的CoT模拟图灵机(需要二次开销)形成鲜明对比。

英文摘要

The increasing popularity of \emph{reasoning} models -- language models that output a series of reasoning or thought tokens before producing an answer -- is justified, in part, by theoretical results showing that chain-of-thought (CoT) transformers can simulate Turing machines, and thus perform arbitrary computation. However, the Turing machine, while suitable for complexity-theoretic analysis, is not convenient, intuitive, or efficient for discussing algorithms. Algorithms are typically designed and analyzed at a higher level of abstraction, captured by the \emph{Word RAM} model with random-access memory and unit-cost operations on $\bigO(\log n)$-bit words. As a result, Word RAM algorithms can be substantially more efficient than their Turing machine counterparts, raising the question: \emph{Can CoT transformers efficiently simulate Word RAM algorithms?} For instance, can they sort $n$ items in $\bigO(n \log n)$ steps or run Dijkstra's algorithm in $\bigO(E + V \log V)$ steps? We answer affirmatively, up to poly-logarithmic overhead. We first establish this for finite-precision transformers with poly-logarithmic width and rightmost unique hard attention, then strengthen the result to two more practical settings with finite width and log-precision: \emph{continuous} CoT, where reasoning takes the form of vectors rather than tokens, and a \emph{hybrid} architecture in which transformer layers sit atop a recurrent (linear RNN) layer. In all three cases, we find that CoT \emph{can} efficiently simulate any Word RAM algorithm with only a poly-logarithmic overhead in $n$. This overhead reduces to log-square when the Word RAM has a ``flat'' instruction set, and only logarithmic for multiplication-free flat instructions -- in stark contrast to known CoT simulations of Turing machines, which require quadratic overhead over Word RAM.

2606.19636 2026-06-19 cs.LG cs.AI 新提交 85%

Hard or Just Unreached? Diagnosing the Sampling Blind Spot in Math-Reasoning Difficulty Estimation

困难还是未触及?诊断数学推理难度估计中的采样盲点

Luca Zhou, Sajel Shah, Emanuele Rodolà, Roberto Dessì

发表机构 * Sapienza University of Rome(罗马大学)

专题命中 数学推理 :诊断数学推理难度估计中的采样盲点,提出确定性采样。

AI总结 发现pass@k在数学推理难度估计中存在盲点,通过激活嫁接的确定性采样可恢复10.3-22.9%的零解样本,揭示结构可识别性。

Comments 9 pages of main paper, 4 figures and 5 tables in the main paper, with more in the appendix

详情
AI中文摘要

数学和科学推理基准依赖pass@k(达到正确结果的采样链比例)作为每个示例的典型难度信号。同样的信号驱动具有可验证奖励的强化学习、数学数据整理、合成课程和验证器训练。我们表明该代理在其最困难的层级上存在持续盲点:在我们测试的八个自由形式数学单元(GSM8K和MATH,跨四个开放权重模型)中,10.3-22.9%的示例在六次尝试中没有任何采样种子解决,但通过六链确定性机制在匹配计算量下被解决。这些是贪婪解码加上通过激活嫁接应用的五个廉价残差流扰动,而单独贪婪解码在这些数学单元上最多解决6%。恢复随额外预算扩展,跨扰动(其机制差异性我们通过所有十二个单元验证,每种设置下跨类型固定集Jaccard <= 0.47)。激活嫁接用作对内部表示的干预,而非解码方法;我们纯粹将其作为诊断和多样化工具,并且我们恢复的项目表明pass@k=0%层级在残差流中结构可识别,而非未修改模型在普通推理下达到它们。

英文摘要

Math and science reasoning benchmarks rely on pass@k, the fraction of sampled chains that reach gold, as the canonical per-example difficulty signal. The same signal drives RL with verifiable rewards, math data curation, synthetic curricula, and verifier training. We show this proxy has a persistent blind spot on its hardest stratum: on the eight free-form math cells we test (GSM8K and MATH across four open-weight models), 10.3-22.9% of the examples that no sampling seed solves in six tries are instead solved at matched compute by a six-chain deterministic regime. These are greedy decoding plus five cheap residual-stream perturbations applied via activation grafting, while greedy alone solves at most 6% on these math cells. Recovery scales with the additional budget, across perturbations whose mechanistic distinctness we verify across all twelve cells (cross-kind fix-set Jaccard <= 0.47 in every setup). Activation grafting is used as an intervention on internal representations, not a decoding method; we use it purely as a diagnostic and diversification tool, and our recovered items show that the pass@k= 0 % stratum is structurally identifiable in the residual stream rather than that the unmodified model reaches them under ordinary inference.

2606.20008 2026-06-19 cs.LG 新提交 80%

VIMPO: Value-Implicit Policy Optimization for LLMs

VIMPO: 值隐式策略优化用于大语言模型

Zhewei Kang, Aosong Feng, Sergey Levine, Dawn Song, Xuandong Zhao

发表机构 * UC Berkeley(加州大学伯克利分校) Yale University(耶鲁大学)

专题命中 数学推理 :方法在数学推理基准上优于GRPO。

AI总结 提出VIMPO方法,通过KL正则化强化学习的最优条件导出策略隐含值函数,无需训练评论家,实现细粒度信用分配,在数学推理基准上优于GRPO。

详情
AI中文摘要

基于可验证奖励的强化学习已成为提升大语言模型推理能力的核心工具,但当前方法在简单性与信用分配之间存在权衡。GRPO等群组相对方法避免了训练评论家,但通常为每个token分配轨迹级优势。Actor-critic方法提供更密集的学习信号,但需要学习值函数,其自身存在训练不稳定性。我们提出VIMPO,一种无需评论家的策略优化方法,从KL正则化强化学习的最优条件推导出策略隐含值函数。对于自回归生成,得到的值递归可以用策略-参考对数比率表示,并由轨迹结束时无未来奖励的终止条件锚定。这给出了一个简单的值损失,它结合了结果级可验证奖励,而无需训练评论家。相同的推导也产生了无需评论家的actor优势,使VIMPO能够通过值损失分离奖励合并,并通过PPO风格的actor更新进行策略改进。在数学RLVR基准上,VIMPO在MATH-500、AIME 2024、AIME 2025和OlympiadBench上均优于GRPO,尤其在竞赛式评估中提升更大。在噪声奖励下,VIMPO保持对GRPO的持续优势,表明策略隐含值优化可以在保持无评论家训练实用简单性的同时提供更精细的信用分配。

英文摘要

Reinforcement learning with verifiable rewards has become a central tool for improving the reasoning ability of large language models, but current methods face a trade-off between simplicity and credit assignment. Group-relative methods such as GRPO avoid training a critic, but typically assign a trajectory-level advantage to every token. Actor-critic methods provide denser learning signals, but require a learned value function with its own training instability. We introduce VIMPO, a critic-free policy optimization method that derives a policy-implied value function from the optimality conditions of KL-regularized reinforcement learning. For autoregressive generation, the resulting value recurrence can be written in terms of policy-reference log-ratios and anchored by the terminal condition that no future reward remains at the end of a trajectory. This gives a simple value loss that incorporates outcome-level verifiable rewards without training a critic. The same derivation also yields a critic-free actor advantage, allowing VIMPO to separate reward incorporation through the value loss from policy improvement through a PPO-style actor update. On mathematical RLVR benchmarks, VIMPO improves over GRPO across MATH-500, AIME 2024, AIME 2025, and OlympiadBench, with especially larger gains on competition-style evaluations. Under noisy rewards, VIMPO retains a consistent advantage over GRPO, suggesting that policy-implied value optimization can provide finer credit assignment while preserving the practical simplicity of critic-free training.

2606.11537 2026-06-19 cs.AI cs.CE 新提交 70%

MoCA-Agent: A Market-of-Claims Code Agent for Financial and Numerical Reasoning

MoCA-Agent: 一种用于金融和数值推理的声明市场代码智能体

Abdelrahman Abdallah, AbdelRahim A. Elmadany, Sameh Al Natour, Hasan Cavusoglu, Adam Jatowt, Muhammad Abdul-Mageed

发表机构 * University of Innsbruck(因斯布鲁克大学) University of British Columbia(不列颠哥伦比亚大学) Toronto Metropolitan University(多伦多都会大学)

专题命中 数学推理 :处理金融数值推理,涉及多步计算

AI总结 提出MoCA-Agent,通过声明级验证和代码生成解决金融表格问答中的数值推理错误,在十个基准上取得强性能。

详情
AI中文摘要

金融和表格问答不仅需要流畅的推理:答案必须基于支持它们的确切事实、公式、单位、符号和尺度。单个误读的单元格或错误操作可能会悄无声息地产生看似合理但错误的结果。我们引入了 \textsc{MOCA-Agent},一种声明市场代码智能体,它用声明级验证取代了自由形式的多智能体辩论。该系统将每个问题分解为类型化的原子声明,要求专业交易智能体买入或卖出这些声明,将其订单清算为置信度加权的接受/拒绝决策,并从市场支持的证据中合成可执行的Python程序。然后,一个代码感知验证器检查程序的执行、结构一致性和常见的金融推理错误,最多进行一次市场感知修复轮次。在涵盖金融数值推理、通用表格推理、ESG问答和多模态图表推理的十个公开基准上,\textsc{MOCA-Agent} 使用固定的 Qwen3.6-27B 骨干网络实现了强劲性能,包括在 FinQA 上达到 78.3%,在 FinanceMath 上达到 76.0%,在 MultiHiertt 上达到 71.2%,在 ESGenius 上达到 86.9%,以及在 FinChart-Bench 上平均达到 85.6%。这些结果表明,在原子声明级别聚合证据,而不是整个答案,提高了高风险数值推理的鲁棒性。\footnote{代码和数据可在以下网址获取:this https URL。}

英文摘要

Financial and tabular question answering requires more than fluent reasoning: answers must be grounded in the exact facts, formulas, units, signs, and scales that support them. A single misread cell or incorrect operation can silently produce a plausible but wrong result. We introduce \textsc{MOCA-Agent}, a market-of-claims code agent that replaces free-form multi-agent debate with claim-level verification. The system decomposes each question into typed atomic claims, asks specialist trader agents to buy or sell those claims, clears their orders into confidence-weighted accept/reject decisions, and synthesizes an executable Python program from market-supported evidence. A code-aware verifier then checks the program for execution, structural consistency, and common financial reasoning errors, with at most one market-aware repair round. Across ten public benchmarks spanning financial numerical reasoning, general tabular reasoning, ESG question answering, and multimodal chart reasoning, \textsc{MOCA-Agent} achieves strong performance using a fixed Qwen3.6-27B backbone, including $78.3\%$ on FinQA, $76.0\%$ on FinanceMath, $71.2\%$ on MultiHiertt, $86.9\%$ on ESGenius, and $85.6\%$ average on FinChart-Bench. These results show that aggregating evidence at the level of atomic claims, rather than whole answers, improves robustness in high-stakes numerical reasoning.\footnote{The code and data are available: https://github.com/UBC-NLP/MoCA-Agent.