arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

大模型推理能力

大模型数学、逻辑、规划、多步推理和测试时计算能力。

今日/当前日期收录 11 信号源:cs.CL, cs.AI, cs.LG

1. 规划推理 2 篇

2605.29649 2026-06-18 cs.AI 版本更新 专题 85

LLM-Evolved Domain-Independent Heuristics for Symbolic AI Planning

LLM进化的符号AI规划领域无关启发式

Elliot Gestrin, Jendrik Seipp

专题命中 规划推理 :LLM进化领域无关启发式,用于符号规划

AI总结 本文使用进化搜索让大语言模型生成领域无关的启发式函数,在未见测试域上超越手工最优启发式,并首次系统评估了启发式的信息性-速度权衡。

Comments Accepted at the LM4Plan workshop at ICAPS 2026

详情
AI中文摘要

启发式搜索是符号AI规划中的主导范式,最强的启发式是规划研究者数十年工作的成果。最近的工作表明,大型语言模型(LLM)可以为单个规划领域设计启发式,但迄今为止,没有LLM生成的启发式能在任意规划任务上工作。在本文中,我们使用进化搜索来产生第一个LLM生成的领域无关启发式,其超越了手工最优的现有技术。我们让LLM变异用C++编写的父启发式,将候选解存储在MAP-Elites档案中,以信息性和速度作为键,并通过混合覆盖率和求解时间计算适应度分数。为了将进化程序置于上下文中,我们还额外基准测试了一组广泛的手工启发式在信息性-速度权衡上的表现,据我们所知,这之前从未做过。在未见测试域上,我们最好的进化启发式比最强基线解决了更多任务,我们的完整启发式套件跨越了所述权衡的帕累托前沿。我们还发现,从平凡的盲目启发式开始进化优于从强FF启发式开始,即使最终程序本身是FF变体,并且LLM推理努力影响候选编译成功的频率远大于影响那些编译成功的候选的质量。由于进化程序是纯C++,它们可以作为即插即用替代品插入现有规划器,并继承底层搜索的健全性和完备性保证。

英文摘要

Heuristic search is the dominant paradigm in symbolic AI planning, and the strongest heuristics are the result of decades of work by planning researchers. Recent work has shown that large language models (LLMs) can design heuristics for individual planning domains, but no LLM-generated heuristic has so far worked on arbitrary planning tasks. In this paper, we use evolutionary search to produce the first LLM-generated domain-independent heuristics that exceed the hand-engineered state of the art. We let an LLM mutate parent heuristics written in C++, store candidates in a MAP-Elites archive keyed on informedness and speed and calculate fitness scores by blending coverage with solving time. To place the evolved programs in context, we additionally benchmark a broad set of hand-engineered heuristics on their informedness-speed tradeoff, which to our knowledge has not been done before. On unseen testing domains, our best evolved heuristic solves more tasks than even the strongest baseline, with our full heuristic suite spanning the Pareto frontier of said tradeoff. We also find that seeding evolution from the trivial blind heuristic outperforms seeding from the strong FF heuristic, even when the resulting program is itself an FF variant, and that LLM reasoning effort affects how often candidates compile much more than the quality of those that do. Because the evolved programs are plain C++, they slot into existing planners as drop-in replacements and inherit the soundness and completeness guarantees of the underlying search.

2603.09344 2026-06-18 cs.AI stat.ML 版本更新 专题 70

Robust Regularized Policy Iteration under Transition Uncertainty

鲁棒正则化策略迭代在转移不确定性下

Hongqiang Lin, Zhenghui Fu, Weihao Tang, Pengfei Wang, Yiding Sun, Qixian Huang, Dongxu Zhang

发表机构 * College of Computer Science and Technology, Zhejiang University, Hangzhou, China(浙江大学计算机科学与技术学院) School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University, Xi'an, China(西北工业大学人工智能、光学与电子学院(iOPEN)) School of Software Technology, Zhejiang University, Hangzhou, China(浙江大学软件技术学院) School of Software Engineering, Xi'an Jiaotong University, Xi'an, China(西安交通大学软件工程学院) School of Systems Science and Engineering, Sun Yat-sen University, Guangzhou, China(中山大学系统科学与工程学院)

专题命中 规划推理 :鲁棒策略迭代用于离线强化学习

AI总结 提出鲁棒正则化策略迭代(RRPI),通过将离线强化学习建模为鲁棒策略优化,使用KL正则化替代难解的双层目标,并基于鲁棒正则化贝尔曼算子实现高效策略迭代,理论保证收敛性,实验在D4RL基准上表现优异。

详情
AI中文摘要

离线强化学习(RL)无需在线探索即可实现数据高效且安全的策略学习,但其性能常因分布偏移而下降。学习到的策略可能访问分布外的状态-动作对,其中价值估计和学习到的动态不可靠。为了在统一框架中处理策略引发的外推和转移不确定性,我们将离线RL建模为鲁棒策略优化,将转移核视为不确定性集内的决策变量,并针对最坏情况动态优化策略。我们提出鲁棒正则化策略迭代(RRPI),用可处理的KL正则化替代难解的最大-最小双层目标,并基于鲁棒正则化贝尔曼算子推导出高效的策略迭代过程。我们提供了理论保证,证明所提出的算子是$\gamma$-压缩算子,且迭代更新替代目标能单调改进原始鲁棒目标并收敛。在D4RL基准上的实验表明,RRPI实现了强大的平均性能,在大多数环境中优于包括基于百分位数方法在内的最新基线,并在其余环境中保持竞争力。此外,RRPI通过将较低的$Q$值与高认知不确定性对齐,展现出鲁棒性能,从而防止策略执行不可靠的分布外动作。

英文摘要

Offline reinforcement learning (RL) enables data-efficient and safe policy learning without online exploration, but its performance often degrades under distribution shift. The learned policy may visit out-of-distribution state-action pairs where value estimates and learned dynamics are unreliable. To address policy-induced extrapolation and transition uncertainty in a unified framework, we formulate offline RL as robust policy optimization, treating the transition kernel as a decision variable within an uncertainty set and optimizing the policy against the worst-case dynamics. We propose Robust Regularized Policy Iteration (RRPI), which replaces the intractable max-min bilevel objective with a tractable KL-regularized surrogate and derives an efficient policy iteration procedure based on a robust regularized Bellman operator. We provide theoretical guarantees by showing that the proposed operator is a $γ$-contraction and that iteratively updating the surrogate yields monotonic improvement of the original robust objective with convergence. Experiments on D4RL benchmarks demonstrate that RRPI achieves strong average performance, outperforming recent baselines including percentile-based methods on the majority of environments while remaining competitive on the rest. Moreover, RRPI exhibits robust performance by aligning lower $Q$-values with high epistemic uncertainty, which prevents the policy from executing unreliable out-of-distribution actions.

2. 复杂问题求解 5 篇

2604.28076 2026-06-18 cs.CL cs.AI cs.LG 版本更新 专题 85

TopBench: A Benchmark for Implicit Predictive Reasoning in Tabular Question Answering

TopBench:表格问答中隐式预测推理的基准

An-Yang Ji, Jun-Peng Jiang, De-Chuan Zhan, Han-Jia Ye

发表机构 * School of Artificial Intelligence, Nanjing University, China(人工智能学院,南京大学,中国) National Key Laboratory for Novel Software Technology, Nanjing University, China(新型软件技术国家重点实验室,南京大学,中国)

专题命中 复杂问题求解 :表格问答中隐式预测推理的基准

AI总结 提出TopBench基准,包含779个样本和四个子任务,评估大语言模型在表格问答中识别隐式预测意图并进行可靠推理的能力,发现当前模型在意图识别上存在困难。

详情
AI中文摘要

大型语言模型(LLM)推动了表格问答的发展,其中大多数查询可以通过提取信息或简单聚合来回答。然而,一类常见的现实世界查询是隐式预测性的,需要从历史模式中推断未观察到的答案,而不仅仅是检索。这些查询带来了两个挑战:识别潜在意图和对大规模表格进行可靠的预测推理。为了评估LLM在带有隐式预测任务的表格问答中的表现,我们引入了TopBench,一个包含779个样本的基准,涵盖四个子任务,从单点预测到决策制定、处理效应分析和复杂过滤,要求模型生成涵盖推理文本和结构化表格的输出。我们在基于文本和代理工作流下评估了多种模型。实验表明,当前模型通常在意图识别上存在困难,默认进行查找。更深入的分析发现,准确的意图消歧是引导这些预测行为的前提。此外,提升预测精度的上限需要整合更复杂的建模或推理能力。

英文摘要

Large Language Models (LLMs) have advanced Table Question Answering, where most queries can be answered by extracting information or simple aggregation. However, a common class of real-world queries is implicitly predictive, requiring the inference of unobserved answers from historical patterns rather than mere retrieval. These queries introduce two challenges: recognizing latent intent and reliable predictive reasoning over massive tables. To assess LLMs in such Tabular questiOn answering with implicit Prediction tasks, we introduce TopBench, a benchmark consisting of 779 samples across four sub-tasks, ranging from single-point prediction to decision making, treatment effect analysis, and complex filtering, requiring models to generate outputs spanning reasoning text and structured tables. We evaluate diverse models under both text-based and agentic workflows. Experiments reveal that current models often struggle with intent recognition, defaulting to just lookups. Deeper analysis identifies that accurate intent disambiguation serves as the prerequisite for leading these predictive behaviors. Furthermore, elevating the upper bound of prediction precision requires the integration of more sophisticated modeling or reasoning capabilities.

2509.22363 2026-06-18 cs.LG eess.AS 版本更新 专题 70

Investigating Faithfulness in Large Audio Language Models

大型音频语言模型中的忠实性研究

Pooneh Mousavi, Lovenya Jain, Mirco Ravanelli, Cem Subakan

发表机构 * Concordia University(康科迪亚大学) Mila - Quebec AI Institute(魁北克人工智能研究院) Université Laval(拉瓦尔大学) Birla Institute of Technology and Science, Pilani(比拉理工学院和科学学院,皮兰尼)

专题命中 复杂问题求解 :研究链式推理的忠实性,涉及推理评估

AI总结 提出系统框架评估大型音频语言模型在推理链忠实性上的表现,定义三个音频忠实性标准,并通过基准测试发现模型推理与音频输入存在脱节。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

大型音频语言模型(LALMs)将音频编码器与预训练的大型语言模型集成,以执行复杂的多模态推理任务。虽然这些模型可以生成思维链(CoT)解释,但这些推理链的忠实性仍不清楚。在这项工作中,我们提出了一个系统框架来评估LALMs中CoT在输入音频和最终模型预测方面的忠实性。我们定义了音频忠实性的三个标准:无幻觉、整体性和专注聆听。我们还引入了一个基于音频和CoT干预的基准来评估忠实性\footnote{基准测试界面和评估结果可在以下网址获取:https://this https URL。}。在Audio Flamingo 3和Qwen2.5-Omni上的实验表明存在潜在的多模态脱节:推理通常与最终预测一致,但并不总是强烈基于音频,并且可能容易受到幻觉或对抗性扰动的影响。

英文摘要

Large Audio Language Models (LALMs) integrate audio encoders with pretrained Large Language Models to perform complex multimodal reasoning tasks. While these models can generate Chain-of-Thought (CoT) explanations, the faithfulness of these reasoning chains remains unclear. In this work, we propose a systematic framework to evaluate CoT faithfulness in LALMs with respect to both the input audio and the final model prediction. We define three criteria for audio faithfulness: hallucination-free, holistic, and attentive listening. We also introduce a benchmark based on both audio and CoT interventions to assess faithfulness\footnote{The benchmarking interface and evaluation results are available at https://poonehmousavi.github.io/faithfulness/. Experiments on Audio Flamingo 3 and Qwen2.5-Omni suggest a potential multimodal disconnect: reasoning often aligns with the final prediction but is not always strongly grounded in the audio and can be vulnerable to hallucinations or adversarial perturbations.

2603.05128 2026-06-18 eess.AS cs.SD 版本更新 专题 70

PolyBench: A Benchmark for Compositional Reasoning in Polyphonic Audio

PolyBench:多声部音频中组合推理的基准测试

Yuanjian Chen, Yang Xiao, Han Yin, Xubo Liu, Jinjie Huang, Ting Dang

发表机构 * Harbin University of Science and Technology(哈尔滨理工大学) The University of Melbourne(墨尔本大学) KAIST(韩国成均馆大学) University of Surrey(萨里大学)

专题命中 复杂问题求解 :评估音频大模型的组合推理能力

AI总结 针对多声部音频中组合推理评估缺失的问题,提出PolyBench基准,包含计数、分类、检测、并发和时长估计五个子集,评估发现现有大音频语言模型在多声部场景下性能持续下降。

Comments Accepted by INTERSPEECH 2026

详情
AI中文摘要

大型音频语言模型(LALMs)在音频推理方面能力日益增强,然而现有基准对多声部音频(多个声音事件同时发生并产生组合结构)中的推理覆盖有限。为弥补这一空白,我们引入了PolyBench,这是一个旨在评估多声部音频中组合推理的基准,包含五个评估子集,涵盖计数、分类、检测、并发和时长估计,所有这些都需要对多个并发事件及其关系进行推理。我们对最先进的LALMs的评估揭示了在多声部设置中性能持续下降,表明当前LALMs存在根本性瓶颈。

英文摘要

Large Audio Language Models (LALMs) are increasingly capable of reasoning over audio, yet existing benchmarks offer limited coverage of reasoning in polyphonic audio, where multiple sound events co-occur and induce compositional structure. To address this gap, we introduce PolyBench, a benchmark designed to evaluate compositional reasoning in polyphonic audio, comprising five evaluation subsets that cover counting, classification, detection, concurrency, and duration estimation, all of which require reasoning over multiple concurrent events and their relations. Our evaluation of state-of-the-art LALMs reveals consistent performance degradation in polyphonic settings, indicating a fundamental bottleneck in current LALMs.

2503.01805 2026-06-18 cs.LG cs.AI cs.CL 版本更新 专题 70

Depth-Width tradeoffs in Algorithmic Reasoning of Graph Tasks with Transformers

图任务算法推理中Transformer的深度-宽度权衡

Gilad Yehudai, Clayton Sanford, Maya Bechler-Speicher, Orr Fischer, Ran Gilad-Bachrach, Amir Globerson

发表机构 * Courant Institute of Mathematical Sciences, New York University(纽约大学应用数学科学研究所) Google Research(谷歌研究) Meta AI Bar-Ilan University(巴伊兰大学) Department of Bio-Medical Engineering, Edmond J. Safra Center for Bioinformatics, Tel-Aviv University(生物医学工程系,埃德蒙·J·萨法中心,特拉维夫大学) Tel Aviv University(特拉维夫大学)

专题命中 复杂问题求解 :研究Transformer在图算法任务中的推理能力。

AI总结 研究Transformer在图算法任务中深度与宽度的权衡,发现线性宽度下常数深度足以解决许多图问题,而某些问题需要二次宽度,实验验证了宽模型在保持精度的同时训练和推理更快。

Comments Updated ISF grant number

详情
AI中文摘要

Transformer已经彻底改变了机器学习领域。特别是,它们可用于解决复杂的算法问题,包括基于图的任务。在此类算法任务中,一个关键问题是能够实现该任务的Transformer的最小尺寸是多少。最近的工作开始探索图任务的这个问题,表明对于次线性嵌入维度(即模型宽度),对数深度就足够了。然而,我们在这里解决的一个开放问题是,如果允许宽度线性增长而深度保持固定,会发生什么。我们分析了这种情况,并得出了一个令人惊讶的结果:在线性宽度下,常数深度足以解决一系列基于图的问题。这表明宽度的适度增加可以允许更浅的模型,这在推理和训练时间方面是有利的。对于其他问题,我们表明需要二次宽度。我们的结果展示了Transformer实现图算法的复杂而有趣的格局。我们通过实验研究了深度和宽度相对能力之间的这些权衡,并发现宽模型在具有与深模型相同准确度的任务中,由于可并行化的硬件,训练和推理时间更快。

英文摘要

Transformers have revolutionized the field of machine learning. In particular, they can be used to solve complex algorithmic problems, including graph-based tasks. In such algorithmic tasks a key question is what is the minimal size of a transformer that can implement the task. Recent work has begun to explore this problem for graph-based tasks, showing that for sub-linear embedding dimension (i.e., model width) logarithmic depth suffices. However, an open question, which we address here, is what happens if width is allowed to grow linearly, while depth is kept fixed. Here we analyze this setting, and provide the surprising result that with linear width, constant depth suffices for solving a host of graph-based problems. This suggests that a moderate increase in width can allow much shallower models, which are advantageous in terms of inference and train time. For other problems, we show that quadratic width is required. Our results demonstrate the complex and intriguing landscape of transformer implementations of graph-based algorithms. We empirically investigate these trade-offs between the relative powers of depth and width and find tasks where wider models have the same accuracy as deep models, while having much faster train and inference time due to parallelizable hardware.

2601.17226 2026-06-18 cs.CL cs.AI 版本更新 专题 70

Retell, Reward, Repeat: Reinforcement Learning for Narrative Theory-Informed Story Retelling

复述、奖励、重复:面向叙事理论启发的故事复述的强化学习

David Y. Liu, Xanthe Muston, Dipankar Srirag, Aditya Joshi, Sebastian Sequoiah-Grayson

发表机构 * University of New South Wales(新南威尔士大学)

专题命中 复杂问题求解 :提升故事复述的逻辑性和合理性

AI总结 提出RRR强化学习框架,结合结构主义叙事学与标量叙事性,通过d-RLAIF从文本特征中获取训练信号,无需参考输出,提升LLM故事复述的逻辑性、合理性和完整性。

Comments 8 Pages, 7 figures

详情
AI中文摘要

反事实故事复述暴露了LLM在受限叙事解空间中的缺陷,此时它们无法依赖回忆记忆的训练数据。基于真实值的后训练(如SFT)无法教会LLM生成逻辑合理的叙事事件。本文提出Retell, Reward, Repeat (RRR),一个基于强化学习的流水线,将结构主义叙事学与标量叙事性相结合,以教授故事结构。我们扩展了TimeTravel数据集,加入人工标注的叙事平衡阶段,以评估奖励模型。通过d-RLAIF,RRR从文本特征的叙事性中推导训练信号,无需参考输出。评估表明,RRR训练的LLM在逻辑性、合理性和完整性上优于少样本和SFT基线,输出质量通过盲人偏好验证。RRR仅依赖小型查询数据集,为故事讲述——一个目前缺乏有效后训练方法的领域——提供了一种基于语言学、成本效益高的后训练机制。RRR强调了将既定语言学理论整合到当代NLP中的持续相关性。

英文摘要

Counterfactual story retelling exposes LLM shortcomings in constrained narrative solution spaces where they can no longer rely on recalling memorised training data. Ground-truth-based post-training, such as SFT, fails to teach LLMs how to generate logical and rational narrative events. In this paper, we introduce Retell, Reward, Repeat (RRR), an RL-based pipeline synthesising Structuralist Narratology with scalar narrativity to teach storytelling structure. We extend the TimeTravel dataset with human-annotated stages of narrative equilibrium to evaluate reward models. By using d-RLAIF, RRR derives training signals from the narrativity of textual features without the need for reference outputs. Evaluations demonstrate that RRR-trained LLMs outperform few-shot and SFT baselines in logic, rationality, and completeness, with output quality additionally validated by blind human preference. Relying on a small, query-only dataset, RRR provides a linguistically grounded, cost-effective post-training mechanism for storytelling--a domain currently lacking effective post-training methods. RRR highlights the continued relevance of integrating established linguistic theories into contemporary NLP.

3. 数学推理 3 篇

2603.01221 2026-06-18 cs.MA 版本更新 专题 85

Epistemic Gain, Aleatoric Cost: Uncertainty Decomposition in Multi-Agent Debate for Math Reasoning

认知增益,偶然成本:多智能体辩论中的不确定性分解用于数学推理

Dan Qiao, Binbin Chen, Fengyu Cai, Jianlong Chen, Wenhao Li, Fuxin Jiang, Zuzhi Chen, Hongyuan Zha, Tieying Zhang, Baoxiang Wang

专题命中 数学推理 :多智能体辩论中的数学推理不确定性分解

AI总结 本文提出贝叶斯不确定性分析框架,将多智能体辩论中的预测不确定性分解为认知不确定性和偶然不确定性,并设计不确定性引导的多智能体强化学习算法,在控制偶然成本的同时提升认知增益,从而提高推理准确性和辩论效率。

Comments ICML2026

详情
AI中文摘要

多智能体辩论(MAD)在改善推理和减少幻觉方面显示出前景,但信息交换如何塑造个体推理行为仍不清楚。经验上,MAD表现出矛盾现象,包括准确率随token熵增加而上升,以及同质和异质智能体组合之间的显著差异。在本文中,我们引入了一个用于MAD的贝叶斯不确定性分析框架,该框架将答案级别的预测不确定性分解为认知不确定性和偶然不确定性,分别对应辩论的潜在增益和成本。在多种智能体配置下,我们发现有效的辩论取决于在受控的偶然成本下实现高认知增益。基于这一见解,我们设计了一种不确定性引导的多智能体强化学习算法,鼓励更低的偶然成本和更有效的认知信息利用。实验表明,我们的方法同时提高了每个智能体的准确性,并促进了更富有成效的辩论过程,为理解和改进MAD提供了一个可操作的贝叶斯视角。

英文摘要

Multi-Agent Debate (MAD) has shown promise in improving reasoning and reducing hallucinations, yet it remains unclear how information exchange shapes individual reasoning behavior. Empirically, MAD exhibits paradoxical phenomena, including rising accuracy with increasing token entropy and marked differences between homogeneous and heterogeneous agent combinations. In this paper, we introduce a Bayesian uncertainty analysis framework for MAD, which decomposes answer-level predictive uncertainty into epistemic uncertainty and aleatoric uncertainty, corresponding to the potential gain and cost of debate. Across multiple agent configurations, we find that effective debate depends on achieving high epistemic gain under controlled aleatoric cost. Building on this insight, we design an uncertainty-guided multi-agent reinforcement learning algorithm that encourages lower aleatoric cost and more effective epistemic information utilization. Experiments show that our approach simultaneously enhances each agent's accuracy and promotes a more productive debate process, providing an operational Bayesian perspective for understanding and improving MAD.

2505.23851 2026-06-18 cs.CL cs.AI cs.SC 版本更新 专题 85

ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark

ASyMOB:代数符号数学运算基准

Michael Shalyt, Rotem Elimelech, Ido Kaminer

发表机构 * MIT(麻省理工学院) Technion - Israel Institute of Technology(技术学院-以色列理工学院)

专题命中 数学推理 :基准测试评估大模型符号数学推理鲁棒性

AI总结 提出ASyMOB基准,包含35,368个符号数学问题,通过扰动测试揭示大模型在符号数学推理中的鲁棒性不足,并发现LLM与CAS的互补潜力。

Comments Published in ICML2026: https://icml.cc/virtual/2026/poster/63549 Code repository: https://github.com/RamanujanMachine/ASyMOB Complete benchmark dataset: https://huggingface.co/datasets/Shalyt/ASyMOB-Algebraic_Symbolic_Mathematical_Operations_Benchmark

详情
AI中文摘要

大型语言模型(LLM)越来越多地应用于符号数学,然而现有评估常常混淆模式记忆与真正推理。为弥补这一空白,我们提出\textbf{ASyMOB},一个包含\textit{35,368}个经过验证的符号数学问题的高分辨率数据集,涵盖积分、极限、微分方程、级数和超几何函数。与以往基准不同,\textbf{ASyMOB}通过符号、数值和等价保持变换系统地扰动每个种子问题,从而实现对泛化能力的细粒度评估。我们的评估揭示了三个关键发现:(1)大多数模型的性能在微小扰动下崩溃,而顶级系统表现出明显的鲁棒性\textit{机制转变};(2)集成代码工具稳定了性能,尤其对较弱模型;(3)我们识别出计算机代数系统(CAS)失败而LLM成功的例子,以及仅通过LLM-CAS混合方法解决的问题,突显了有前景的集成前沿。\textbf{ASyMOB}作为一个原则性诊断工具,用于衡量和加速构建可验证、可信赖的AI以促进科学发现。

英文摘要

Large language models (LLMs) are increasingly applied to symbolic mathematics, yet existing evaluations often conflate pattern memorization with genuine reasoning. To address this gap, we present ASyMOB, a high-resolution dataset of 35,368 validated symbolic math problems spanning integration, limits, differential equations, series, and hypergeometrics. Unlike prior benchmarks, ASyMOB systematically perturbs each seed problem using symbolic, numeric, and equivalence-preserving transformations, enabling a fine-grained assessment of generalization. Our evaluation reveals three key findings: (1) most models' performance collapses under minor perturbations, while top systems exhibit an apparent regime shift in robustness; (2) integrated code tools stabilize performance, particularly for weaker models; and (3) we identify examples where Computer Algebra Systems (CAS) fail while LLMs succeed, as well as problems solved only via a hybrid LLM-CAS approach, highlighting a promising integration frontier. ASyMOB serves as a principled diagnostic tool for measuring and accelerating progress toward building verifiable, trustworthy AI for scientific discovery.

2605.03460 2026-06-18 cs.AI cs.LG 版本更新 专题 80

FinSTaR: Towards Financial Reasoning with Time Series Reasoning Models

FinSTaR:面向时间序列推理模型的金融推理

Seunghan Lee, Jun Seo, Jaehoon Lee, Sungdong Yoo, Minjae Kim, Tae Yoon Lim, Dongwan Kang, Hwanil Choi, Soonyoung Lee, Wonbin Ahn

发表机构 * LG AI Research(LG人工智能研究)

专题命中 数学推理 :金融时间序列推理,涉及数学推理和链式思维。

AI总结 针对时间序列推理模型在金融领域的失效问题,提出基于2x2能力分类法的FinSTaR模型,通过Compute-in-CoT和Scenario-Aware CoT策略在FinTSR-Bench基准上达到78.9%平均准确率。

Comments KDD Workshop on SciSoc Agents & LLMs 2026 (Oral Presentation)

详情
AI中文摘要

时间序列推理模型在通用领域表现出色,但在具有独特特征的金融领域却持续失败。我们提出一个通用的2x2能力分类法,通过交叉1)单实体与多实体分析,以及2)当前状态评估与未来行为预测来划分TSRM能力。我们在金融领域实例化该分类法——其中确定性评估与随机性预测的区分尤为关键——形成十个金融推理任务,并基于标普股票构建FinTSR-Bench基准。为此,我们提出FinSTaR(金融时间序列思考与推理),在FinTSR-Bench上训练,并针对每个类别采用不同的思维链策略。对于评估(确定性,即可从可观测数据计算得出),我们采用Compute-in-CoT,一种程序化思维链,使模型能够直接从原始价格推导答案。对于预测(本质上是随机的,即受不可观测因素影响),我们采用场景感知思维链,在做出判断前生成多种场景,模拟金融分析师在不确定性下的推理方式。所提方法在FinTSR-Bench上达到78.9%的平均准确率,显著优于LLM和TSRM基线。此外,我们展示了四个能力类别通过联合训练具有互补性和相互增强性,并且场景感知思维链相比标准思维链持续提升预测准确率。代码已公开:https://github.com/seunghan96/FinSTaR。

英文摘要

Time series (TS) reasoning models (TSRMs) have shown promising capabilities in general domains, yet they consistently fail in the financial domain, which exhibits unique characteristics. We propose a general 2 x 2 capability taxonomy for TSRMs by crossing 1) single-entity vs. multi-entity analysis with 2) assessment of the current state vs. prediction of future behavior. We instantiate this taxonomy in the financial domain-where the distinction between deterministic assessment and stochastic prediction is particularly critical-as ten financial reasoning tasks, forming the FinTSR-Bench benchmark based on S&P stocks. To this end, we propose FinSTaR (Financial Time Series Thinking and Reasoning), trained on FinTSR-Bench with distinct chain-of-thought (CoT) strategies tailored to each category. For assessment, which is deterministic (i.e., computable from observable data), we employ Compute-in-CoT, a programmatic CoT that enables models to derive answers directly from raw prices. For prediction, which is inherently stochastic (i.e., subject to unobservable factors), we adopt Scenario-Aware CoT, which generates diverse scenarios before making a judgment, mirroring how financial analysts reason under uncertainty. The proposed method achieves 78.9% average accuracy on FinTSR-Bench, substantially outperforming LLM and TSRM baselines. Furthermore, we show that the four capability categories are complementary and mutually reinforcing through joint training, and that Scenario-Aware CoT consistently improves prediction accuracy over standard CoT. Code is available at https://github.com/seunghan96/FinSTaR.

4. 逻辑推理 1 篇

2505.12369 2026-06-18 cs.AI cs.LG cs.LO 版本更新 专题 70

Fully Geometric Multi-Hop Reasoning on Knowledge Graphs with Transitive Relations

知识图谱上具有传递关系的全几何多跳推理

Fernando Zhapa-Camacho, Robert Hoehndorf

发表机构 * KAUST Center of Excellence for Smart Health (KCSH)(智能健康卓越中心) KAUST Center of Excellence for Generative AI(生成人工智能卓越中心)

专题命中 逻辑推理 :知识图谱多跳逻辑推理,几何嵌入方法

AI总结 提出GeometrE方法,将逻辑操作映射为纯几何变换,并引入传递损失函数,在保持可解释性的同时提升多跳推理性能。

Comments Accepted at ESWC 2026

Journal ref The Semantic Web. ESWC 2026. Lecture Notes in Computer Science, vol 16549. Springer, Cham (2026)

详情
AI中文摘要

知识图谱上的多跳逻辑推理需要将逻辑语义忠实地映射到潜在空间。当前的几何嵌入方法通过将实体映射到几何区域、逻辑操作映射到潜在变换,在此任务上表现出有效性。虽然几何嵌入可以为查询回答提供直接的可解释性框架,但当前方法仅利用了实体的几何构造,未能将逻辑操作映射为纯几何变换,而是使用神经组件来学习这些操作。另一方面,纯神经方法优于几何方法,但在潜在空间中缺乏可解释性。我们提出了GeometrE,一种用于多跳推理的几何嵌入方法,它将每个逻辑操作映射为潜在空间中的纯几何操作。此外,我们引入了一个传递损失函数,并表明与现有方法不同,它可以保留对所有a,b,c的逻辑规则:r(a,b)和r(b,c) -> r(a,c)。我们的实验表明,GeometrE优于当前最先进的几何方法,并在标准基准数据集上与现有的神经方法保持竞争力。

英文摘要

Multi-hop logical reasoning on knowledge graphs requires faithfully mapping the logical semantics to latent space. Current geometric embedding methods show to be useful on this task by mapping entities to geometric regions and logical operations to latent transformations. While a geometric embedding can provide a direct interpretability framework for query answering, current methods have only leveraged the geometric construction of entities, failing to map logical operations to pure geometric transformations and, instead, using neural components to learn these operations. On the other hand, purely neural-based methods outperform geometric methods, but they lack interpretability in the latent space. We introduce GeometrE, a geometric embedding method for multi-hop reasoning, that maps every logical operation to a purely geometric operation in the latent space. Additionally, we introduce a transitive loss function and show that, unlike existing methods, it can preserve the logical rule for all a,b,c: r(a,b) and r(b,c) -> r(a,c). Our experiments show that GeometrE outperforms current state-of-the-art geometric methods and remains competitive with existing neural-based methods on standard benchmark datasets.