大模型推理能力 - arXivDaily 专题

2606.20075 2026-06-19 cs.LG cs.CL 新提交 80%

What Makes Effective Supervision in Latent Chain-of-Thought: An Information-Theoretic Analysis

什么使得潜在思维链中的监督有效：一种信息论分析

Xinghao Chen, Chak Tou Leong, Wenjin Guo, Jian Wang, Wenjie Li, Xiaoyu Shen

发表机构 * Ningbo Institute of Digital Twin, Eastern Institute of Technology（宁波数字孪生研究院，东方理工大学）； Department of Computing, The Hong Kong Polytechnic University（香港理工大学计算学系）

专题命中复杂问题求解：潜在思维链监督信息论分析

AI总结本文从信息论角度分析潜在思维链中的监督失效问题，提出轨迹监督和空间监督两个维度，并引入统一潜在探针（ULP）量化信息保真度，揭示了信息-性能绑定关系。

详情

AI中文摘要

潜在思维链（Latent Chain-of-Thought, CoT）将推理内化到连续隐藏状态中，为冗长的离散推理轨迹提供了一种有前景的替代方案。然而，鲁棒的潜在推理仍然困难，因为结果监督提供的学习信号较弱，且容易导致潜在轨迹发生语义漂移。在这项工作中，我们从信息论角度分析潜在CoT，并将这种失效识别为双重崩溃：优化路径上的梯度衰减和潜在空间中的表征漂移。我们进一步将过程监督分解为两个互补维度：轨迹监督（注入密集的逐步推理信号）和空间监督（保持潜在流形的语义结构）。我们的分析表明，刚性几何压缩可能坍缩推理空间，而生成式重建提供了更灵活的语义锚点，更好地保留了信息容量。为了衡量这些效应，我们引入了统一潜在探针（Unified Latent Probe, ULP），用于量化潜在轨迹与显式推理步骤之间的互信息。实验揭示了清晰的信息-性能绑定关系：推理准确性取决于潜在链中保留的信息保真度。这些发现为潜在推理监督提供了一个原则性框架，并建议从几何模仿转向互信息最大化。我们的代码可在\href{this https URL}{此仓库}获取。

英文摘要

Latent Chain-of-Thought (CoT) internalizes reasoning within continuous hidden states, offering a promising alternative to verbose discrete reasoning traces. However, robust latent reasoning remains difficult because outcome supervision provides weak learning signals and leaves latent trajectories prone to semantic drift. In this work, we analyze Latent CoT from an information-theoretic perspective and identify this failure as a dual collapse: gradient attenuation along the optimization path and representational drift in the latent space. We further decompose process supervision into two complementary dimensions: Trajectory Supervision, which injects dense stepwise reasoning signals, and Space Supervision, which preserves the semantic structure of the latent manifold. Our analysis shows that rigid geometric compression can collapse the reasoning space, whereas generative reconstruction provides a more flexible semantic anchor that better preserves information capacity. To measure these effects, we introduce the Unified Latent Probe (ULP), which quantifies the mutual information between latent trajectories and explicit reasoning steps. Experiments reveal a clear Information-Performance Binding: reasoning accuracy depends on the information fidelity preserved in the latent chain. These findings provide a principled framework for latent reasoning supervision and suggest shifting from geometric imitation toward mutual information maximization. Our code is available at \href{https://github.com/EIT-NLP/Supervision-in-Latent-CoT}{this repository}.

URL PDF HTML ☆

赞 0 踩 0

2606.19427 2026-06-19 astro-ph.CO astro-ph.IM physics.comp-ph physics.data-an 新提交 80%

Physics-guided discovery of dynamical dark-energy equations of state through iterative AI reasoning

通过迭代AI推理发现动力学暗能量状态方程的物理引导

Clecio R. Bom, Bernardo M. Fraga, Miguel A. Sabogal, Armando Bernui, Phelipe Darc, Gustavo Schwarz

专题命中复杂问题求解：LLM迭代推理发现暗能量状态方程

AI总结提出迭代AI推理框架，利用大语言模型生成并优化暗能量状态方程，结合文献检索和自动评估，发现两种新参数化形式，在超新星、重子声学振荡和Planck数据上优于传统模型。

Comments 6 figures, 45 pages, submitted. Code: https://iadev.cbpf.br/labia/cosmoai

详情

AI中文摘要

现象学模型构建传统上依赖人类推理：方程从理论直觉、类比或经验便利中提出，然后才与数据对比。这里我们展示，这一循环可以重构为动力学暗能量的迭代AI推理过程。我们的框架使用大语言模型提出状态方程及宇宙学理由，通过从暗能量文献中检索来奠定基础，并通过自主评估进行优化。每个候选方程嵌入宇宙学模型，针对观测进行优化，并使用似然性能和理论一致性进行评估。独立的语言模型评判者对方程及其理由的物理动机、新颖性、清晰度、稳定性和实现有效性进行评分，使得后续提议在数学结构和物理推理上共同演化。应用于包括超新星、重子声学振荡和Planck似然在内的宇宙学数据组合，该框架识别出两种参数化形式，据我们所知，这些形式此前未被探索过，且与已有形式竞争。对于Pantheon+超新星、DESI DR2重子声学振荡和完整的Planck 2018温度、极化和透镜似然，AI选择的最佳模型获得的贝叶斯证据比这里考虑的传统参数化大一个单位以上。这些结果表明，AI引导的推理可以通过提出和评估动力学暗能量的可解释现象学参数化来补充物理模型构建。

英文摘要

Phenomenological model building has traditionally relied on human reasoning: equations are proposed from theoretical intuition, analogy, or empirical convenience, and only then tested against data. Here we show that this cycle can be recast as an iterative AI reasoning process for dynamical dark energy. Our framework uses a large language model to propose equations of state together with cosmological rationales, grounded by retrieval from the dark-energy literature and refined through autonomous evaluation. Each candidate is embedded in a cosmological model, optimized against observations, and assessed using likelihood performance and theoretical consistency. An independent language-model critic scores the physical motivation, novelty, clarity, stability and implementation validity of both the equation and its rationale, allowing subsequent proposals to evolve jointly in mathematical structure and physical reasoning. Applied to cosmological data combinations including supernovae, baryon acoustic oscillations and Planck likelihoods, the framework identifies two parameterizations that, to the best of our knowledge, have not previously been explored and that are competitive with established forms. For Pantheon+ supernovae, DESI DR2 baryon acoustic oscillations and the full Planck 2018 temperature, polarization, and lensing likelihoods, the best AI-selected model attains larger Bayesian evidence than the traditional parameterizations considered here by more than one unit. These results show that AI-guided reasoning can complement physical model building by proposing and evaluating interpretable phenomenological parameterizations for dynamical dark energy.

URL PDF HTML ☆

赞 0 踩 0

2606.20401 2026-06-19 eess.SY cs.SY 新提交 70%

PowerAgentBench-Dyn: A Benchmark for Agentic AI in Power System Dynamic Studies

PowerAgentBench-Dyn：电力系统动态研究中智能体AI的基准测试

Qian Zhang, Andrea Pomarico, Costas Mylonas, Magda Foti, Alberto Berizzi, Le Xie

专题命中复杂问题求解：涉及多步推理和工程判断，属于复杂问题求解

AI总结提出PowerAgentBench-Dyn基准，用于评估基于LLM的智能体在电力系统动态分析任务中的能力，涵盖模型质量审查和安全风险筛选两个任务。

详情

AI中文摘要

基于大型语言模型（LLM）的智能体越来越多地被用于通过与软件工具交互、解释中间结果以及自主规划后续行动来自动化多步骤工程工作流。电力系统动态研究是这些智能体一个特别有前景但尚未充分探索的应用领域。与静态计算任务不同，动态研究通常需要更多时间进行模型参数校准、工程判断以及在受限动作空间下的决策。本文介绍了PowerAgentBench-Dyn，一个旨在评估智能体AI系统在电力系统动态分析任务上的基准测试。该基准针对那些不能简化为单一优化或编码任务的问题，而是需要经验丰富的电力系统工程师日常执行的那种推理、工具使用和迭代实验。所提出的框架包括两个初始基准任务。第一个是动态模型质量审查基准，评估智能体根据系统运营商指定的模型质量合规标准验证和诊断动态模型的能力。第二个是动态安全风险筛选基准，评估智能体利用语义记忆和有限的仿真预算从未见故障数据集中识别、排序和分析最关键短路事故，并提出和评估可能的缓解措施的能力。对于每个任务，我们定义了仿真环境、观测和动作空间以及评估指标。该基准在基于度量的意义上是可复现的：发布案例和仿真器设置定义了确定性评估器，而随机智能体行为通过重复运行使用成功率和其他指标进行评估。该基准支持未来用于电力系统运行和规划的智能体AI的开发。

英文摘要

Large Language Model (LLM)-based agents are increasingly being used to automate multi-step engineering work flows by interacting with software tools, interpreting intermediate results, and autonomously planning subsequent actions. Power system dynamic studies represent a particularly promising yet largely unexplored application domain for these agents. Unlike static computational tasks, dynamic studies often require more time on model parameter calibration, engineering judgment, and decision making under constrained action spaces. This paper introduces PowerAgentBench-Dyn, a benchmark designed to evaluate Agentic AI systems on power system dynamic-analysis tasks. The benchmark targets problems that cannot be reduced to a single optimization or coding task, but instead require a type of reasoning, tool usage, and iterative experimentation routinely performed by experienced power system engineers. The proposed framework includes two initial benchmark tasks. The first, the Dynamic Model Quality Review Benchmark, evaluates agents' ability to validate and diagnose dynamic models based on model-quality compliance criteria specified by system operators. The second, the Dynamic Security Risk Screening Benchmark, assesses agents' capability to leverage semantic memory and a limited simulation budget to identify, rank, and analyze the most critical short-circuit contingencies from an unseen fault dataset, as well as propose and evaluate possible mitigation measures. For each task, we define the simulation environment, observation and action spaces, and evaluation metrics. The benchmark is reproducible in a metric-based sense: released cases and simulator settings define a deterministic evaluator, while stochastic agent behavior is assessed over repeated runs using success rates and other metrics. The benchmark supports the development of future Agentic AI for power system operation and planning.

URL PDF HTML ☆

赞 0 踩 0

2606.19893 2026-06-19 cs.AI 新提交 70%

MetaResearcher: Scaling Deep Research via Self-Reflective Reinforcement Learning in Adversarial Virtual Environments

MetaResearcher: 通过对抗虚拟环境中的自我反思强化学习扩展深度研究

Wei Yu, Suxing Liu, Minjie Yu, Jiahao Wang, Zhijian Zheng, Haocheng Deng, Bing Li

发表机构 * School of Digital Arts, Jiangxi Arts & Ceramics Technology Institute（江西陶瓷工艺美术职业技术学院数字艺术学院）； Universiti Sains Malaysia（马来西亚理科大学）

专题命中复杂问题求解：发现导向任务，超越事实检索。

AI总结提出MetaResearcher框架，通过演化虚拟世界、发现导向任务、自我反思元奖励和异构多智能体架构，在对抗环境中扩展深度研究智能体的训练，提升基准性能和认知鲁棒性。

详情

AI中文摘要

深度研究智能体在自主信息收集和综合方面展现了卓越的能力，但其训练仍受限于模拟环境的静态性、仅限事实检索的任务设计的局限性以及基于结果的强化学习的低效性。在这项工作中，我们提出了MetaResearcher，一个新颖的框架，在四个协同维度上扩展深度研究智能体的训练。首先，我们引入了一个演化虚拟世界，将时间动态和对抗性错误信息注入训练环境，迫使智能体发展来源可信度评估和时间冲突解决技能。其次，我们设计了发现导向任务——包括假设生成和矛盾解决——超越了简单的事实检索，推动智能体走向真正的研究行为。第三，我们在GRPO框架内提出了一种自我反思元奖励机制，共同优化答案正确性、搜索路径效率、反思深度和工具调用多样性，直接解决了先前工作中观察到的重复动作循环问题。第四，我们引入了一个异构多智能体群体架构，包括专门的侦察、过滤和合成模型，通过协调强化学习学习协作研究策略。基于LiteResearcher基础设施，MetaResearcher在训练中需要零边际API成本，同时目标是在基准性能（GAIA，Xbench-DS）和对抗条件下的认知鲁棒性方面实现显著改进。我们展示了完整的框架设计、训练方法和计划的实验验证。

英文摘要

Deep research agents have demonstrated remarkable capabilities in autonomous information gathering and synthesis, yet their training remains constrained by the static nature of simulated environments, the limits of fact-retrieval-only task designs, and the inefficiency of outcome-based reinforcement learning. In this work, we propose MetaResearcher, a novel framework that scales deep research agent training across four synergistic dimensions. First, we introduce an Evolving Virtual World that injects temporal dynamics and adversarial misinformation into the training environment, forcing agents to develop source credibility assessment and temporal conflict resolution skills. Second, we design Discovery-Oriented Tasks -- including hypothesis generation and contradiction resolution -- that transcend simple fact retrieval and push agents toward genuine research behaviors. Third, we propose a Self-Reflective Meta-Reward mechanism within the GRPO framework that jointly optimizes for answer correctness, search path efficiency, reflection depth, and tool call diversity, directly addressing the repetitive action loop problem observed in prior work. Fourth, we introduce a Heterogeneous Multi-Agent Swarm architecture comprising specialized Scout, Filter, and Synthesizer models that learn collaborative research strategies through coordinated reinforcement learning. Built upon the LiteResearcher infrastructure, MetaResearcher requires zero marginal API cost for training while targeting substantial improvements in both benchmark performance (GAIA, Xbench-DS) and epistemic robustness under adversarial conditions. We present the complete framework design, training methodology, and planned experimental validation.

URL PDF HTML ☆

赞 0 踩 0

2606.19741 2026-06-19 cs.AI cs.LG 新提交 65%

Interpreting Neural Combinatorial Optimization via Evolving Programmatic Bottlenecks

通过演化程序瓶颈解释神经组合优化

Haocheng Duan, Yuxin Guo, Jieyi Bi, Anqi Xie, Sirui Li, Yining Ma, Cathy Wu

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Nanyang Technological University（南洋理工大学）； Microsoft Research（微软研究院）； Massachusetts Institute of Technology（麻省理工学院）

专题命中复杂问题求解：涉及组合优化问题的可解释性，与推理相关。

AI总结提出演化程序瓶颈（EPB）框架，通过将黑盒神经组合优化模型蒸馏为可读程序组合，利用LLM和混合梯度下降实现可解释性，揭示模型行为与经典启发式变体的关系。

Comments Under Review

详情

AI中文摘要

神经组合优化（NCO）取得了强劲性能，但其黑盒性质仍然是部署和科学诊断的关键障碍。标准可解释性工具（如概念瓶颈模型）不适用于NCO，因为其决策是动态的、状态依赖的，且缺乏适当的概念词汇定义。为弥合这一差距，我们引入了演化程序瓶颈（EPB），据我们所知，这是首个通过将黑盒NCO模型蒸馏为人类可读程序组合来解释NCO策略的框架。EPB利用LLM自主演化一组程序，其中每个程序的每步动作分布作为瓶颈。EPB通过迭代框架工作：模块I固定程序库容量，并引入混合文本-数值梯度下降方案，该方案将学生路由器更新的数值梯度和基于LLM程序修订的文本梯度相结合；模块II通过故障目标扩展和冗余剪枝动态调整库容量。大量实验证明了EPB的有效性和广泛适用性，蒸馏后的程序组合在很大程度上保持了原始性能。EPB还揭示了NCO行为在优化阶段的变化，并且可以近似为经典启发式变体的组合。我们的工作推进了可解释NCO，并将EPB建立为解释序列决策模型的有前途工具。

英文摘要

Neural Combinatorial Optimization (NCO) achieves strong performance, yet its black-box nature remains a key roadblock to deployment and scientific diagnosis. Standard interpretability tools, such as Concept Bottleneck Models (CBMs), are ill-equipped for NCO, whose decisions are dynamic, state-dependent, and lack proper concept vocabulary definition. To close this gap, we introduce Evolving Programmatic Bottlenecks (EPB), to our knowledge, the first framework for interpreting NCO policies by distilling black-box NCO models into human-readable program portfolios. EPB employs an LLM to autonomously evolve a bank of programs, where each program's per-step action distribution serves as the bottleneck. EPB works through an iterative framework: Block I fixes program bank capacity and introduces a hybrid textual-numerical gradient descent scheme that couples numerical gradients for student router updates and textual gradients for LLM-based program revision; Block II dynamically adapts bank capacity via fault-targeted expansion and redundancy pruning. Extensive experiments demonstrate EPB's effectiveness and broad applicability, where the distilled program portfolios largely match original performance. EPB also reveals that NCO behavior shifts across optimization stages and can be approximated as a composition of classic heuristic variants. Our work advances interpretable NCO and establishes EPB as a promising tool for interpreting sequential decision-making models.

URL PDF HTML ☆

赞 0 踩 0

2606.20206 2026-06-19 stat.ML cs.LG 新提交 60%

Off-Policy Evaluation for Missingness-Aware Policies in MDPs with Rewards Missing Not at Random

马尔可夫决策过程中奖励非随机缺失的缺失感知策略的离线评估

Ziheng Wei, Annie Qu, Rui Miao

发表机构 * Department of Statistics, University of Michigan at Ann Arbor（密歇根大学安娜堡分校统计学系）； Department of Statistics（统计学系）； Applied Probability, University of California at Santa Barbara（加州大学圣巴巴拉分校应用概率系）； Department of Mathematical Sciences, University of Texas at Dallas（德克萨斯大学达拉斯分校数学科学系）

专题命中复杂问题求解：离线策略评估，奖励缺失问题

AI总结针对奖励非随机缺失的离线强化学习问题，提出基于未来状态作为影子变量的识别方法，并利用桥函数和min-max估计器恢复条件均值奖励，实现缺失感知策略的离线评估。

Comments Accepted at ICML 2026. 31 pages, 6 figures

详情

AI中文摘要

在离线强化学习中，由于记录稀疏或不规则，或超出特定奖励值的审查，记录批次数据中的即时奖励通常未被观测到。这个问题出现在实际场景中，包括医疗和营销。我们研究了有限时域马尔可夫决策过程中奖励非随机缺失时的离线策略评估，这破坏了可忽略性，并即使在以状态和行动为条件后也会引起选择偏差。为了解决这个问题，我们形式化了一个依赖于奖励的倾向模型，并使用未来状态作为影子变量来识别完整数据的条件均值奖励。我们进一步引入了一个桥函数，无需显式建模MNAR机制即可恢复条件均值奖励，并通过min-max过程进行估计以避免双重采样。基于这些识别结果，我们提出了一个类似Fitted-Q-Evaluation的估计器，该估计器传播恢复的奖励，同时允许目标策略依赖于过去的缺失指示符。最后，我们为我们的OPE估计器建立了一致性和有限样本误差界，并通过实验在模拟数据和MIMIC-III脓毒症数据上展示了我们方法相比现有方法的强性能。

英文摘要

In offline Reinforcement Learning, immediate rewards in logged batch data are often unobserved due to sparse or irregular record-keeping, or censored beyond certain reward values. This issue arises in practical settings, including health care and marketing. We investigate off-policy evaluation (OPE) in finite-horizon Markov decision processes when rewards are missing not at random (MNAR), which breaks ignorability and induces selection bias even after conditioning on states and actions. To address this, we formalize a reward-dependent propensity model and use future states as shadow variables to identify the full-data conditional mean reward. We further introduce a bridge function that recovers the conditional mean reward without explicitly modeling the MNAR mechanism, and estimate it via a min-max procedure to avoid double sampling. Building upon these identification results, we propose an Fitted-Q-Evaluation-style estimator that propagates the recovered rewards while allowing target policies to depend on past missingness indicators. Finally, we establish consistency and finite-sample error bounds for our OPE estimator, and show through experiments the strong performance of our method compared to existing methods on simulated and MIMIC-III Sepsis data.

URL PDF HTML ☆

赞 0 踩 0

2606.19846 2026-06-19 econ.GN q-fin.EC 新提交 55%

What Capital After Labor? Forecasting the Talent ROI Transition in the Human-AI Era

劳动力之后是什么资本？预测人机时代的人才ROI转型

Kwan Soo Shin, In Seok Kang

专题命中复杂问题求解：AI时代人才ROI预测框架

AI总结针对AI增强打破劳动时间与贡献的会计关联，本文构建从时间到产出的人才ROI预测框架，核心定理为ROI反转，并利用韩国52小时工作制案例验证了前期压力信号，预测产出型企业在2032年TFP增长领先1.5-2.0个百分点。

Comments 90 pages, 6 figures

详情

AI中文摘要

AI增强打破了劳动时间与生产贡献之间的会计联系，但企业仍通过基于时间的间接费用包来评估人才。本文开发了一个预测框架，用于在人机时代从基于时间的人才会计向基于产出的人才ROI转型。该框架以定理3（在τ*处的ROI反转）为实证主轴，包含四个机制定理：间接费用非加性、增强节省时间路径、创新溢价放大以及人机二元归因不确定性。韩国分阶段实施的52小时工作制规定提供了一个实证预警案例。在一个包含365家上市公司的DART面板数据（2281个公司-年观测值）中，SG&A与收入比率从2018年的18.26%上升至2020年的20.06%，在2021-2022年略有修正，并于2024年达到20.10%的峰值。在收入百分位队列代理下，双向固定效应（+1.56个百分点，p=0.049）、合并事件研究估计（t=+3时为+4.21个百分点，p=0.001）以及Callaway-Sant'Anna双重稳健交错DID估计（t=+4时为+4.51个百分点）收敛于一个正向间接费用压力特征。2015-2017年的向后扩展（224家公司，601个观测值）提供了预处理数据，提供了反对预先存在的上升趋势混杂因素的证据。我们将韩国证据解读为，据我们所知，第一个经验记录的τ*前间接费用压力制度特征，其中基于时间的会计仍占主导地位，而AI增强和劳动时间压缩共同推高了间接费用。预计到2032年，基于产出的公司在公司层面TFP增长上比基于时间的同行高出1.5-2.0个百分点。贡献在于为向AI增强的人才ROI会计转型提供了一个预测模型和管理规划工具。

英文摘要

AI augmentation breaks the accounting link between labor time and productive contribution, yet firms continue to evaluate talent through time-based overhead bundles. This paper develops a forecasting framework for the transition from time-based talent accounting to output-based talent ROI in the human-AI era. The framework centres on Theorem 3 (ROI Inversion at τ*) as the empirical spine, with four mechanism theorems: overhead non-additivity, augmentation-saved-time pathways, innovation-premium amplification, and human-AI dyad attribution uncertainty. Korea's staged 52-hour workweek mandate provides an empirical early-warning case. In a DART panel of 365 listed firms (2,281 firm-year observations), the SG&A-to-revenue ratio rose from 18.26 percent in 2018 to 20.06 percent in 2020, corrected mildly in 2021-2022, and peaked at 20.10 percent in 2024. Under the revenue-percentile cohort proxy, two-way fixed effects (+1.56 pp, p = 0.049), pooled event-study estimates (+4.21 pp at t = +3, p = 0.001), and Callaway-Sant'Anna doubly-robust staggered DiD estimates (+4.51 pp at t = +4) converge on a positive overhead-pressure signature. A 2015-2017 backward extension (224 firms, 601 observations) supplies pre-treatment data, providing evidence against pre-existing upward-trend confounds. We read the Korean evidence not as a direct τ* estimate or a point causal magnitude, but as, to our knowledge, the first empirically documented signature of the pre-τ overhead-pressure regime, where time-based accounting still dominates while AI augmentation and labor-time compression jointly raise overhead. Output-based firms are forecast to outperform time-based peers by 1.5-2.0 percentage points in firm-level TFP growth by 2032. The contribution is a forecasting model and managerial planning tool for the shift to AI-augmented talent ROI accounting.

URL PDF HTML ☆

赞 0 踩 0