arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.01168 2026-06-02 cs.CL

Thinking Economically: A Hierarchical Framework for Adaptive-Complexity Reasoning in LLMs

经济思维：面向LLM自适应复杂度推理的分层框架

Yubo Gao, Haotian Wu, Hong Chen, Junquan Huang, Yibo Yan, Jungang Li, Zihao Dongfang, Sicheng Tao, Puay Siew Tan, Jie Zhang, Xuming Hu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州））； The Hong Kong University of Science and Technology（香港科学与技术大学）； Nanyang Technological University（南洋理工大学）； Singapore Institute of Manufacturing Technology, A*STAR（新加坡制造技术研究所，A*STAR）

AI总结针对LLM推理中的“过度思考”问题，提出分层自适应预算器（HAB）框架，通过粗粒度到细粒度的预算分配实现计算资源的高效利用，在GSM8K和MATH500上同时提升准确率和降低token使用量。

详情

Comments: 11 pages, 4 figures, 3 tables

AI中文摘要

思维链（CoT）显著增强了LLM的推理能力，但常常因“过度思考”而产生大量计算开销：生成过长的推理过程却没有相应的精度提升。现有的效率方法通常采用统一压缩，忽略了推理复杂度在两个不同粒度上具有异质性这一关键观察：不同问题之间以及单个推理步骤内部。这激发了我们“经济思维”的原则：根据内在任务和步骤需求智能分配计算资源，而非追求统一的简洁性。我们提出分层自适应预算器（HAB），一个通过从粗到细的预算分配来实现该原则的训练框架。在步骤间层面，HAB预测每个问题的最优推理深度。在步骤内层面，HAB从基于困惑度的步骤比较和自适应帕累托优化目标中学习步骤特定的token预算信号，该目标捕捉局部质量-效率权衡，同时基于Fisher信息的剪枝器进一步提供细粒度的训练时指导，从而鼓励生成器内化更经济的推理模式。在GSM8K和MATH500上的实验表明，HAB不仅在准确率上超越了标准CoT，还减少了token使用量，实现了比对比基线更强的性能-效率权衡。

英文摘要

Chain-of-Thought (CoT) has significantly enhanced LLM reasoning, yet often incurs substantial computational overhead due to "overthinking": generating excessively long rationales without commensurate accuracy gains. Existing efficiency methods typically apply uniform compression, which overlooks a critical observation that reasoning complexity is heterogeneous at two distinct granularity: across different problems and within individual reasoning steps. This motivates our principle of Thinking Economically: intelligently allocating computational resources based on intrinsic task and step demands rather than pursuing uniform brevity. We propose Hierarchical Adaptive Budgeter (HAB), a training framework that operationalizes this principle through coarse-to-fine budgeting. At the inter-step level, HAB predicts the optimal reasoning depth for each problem. At the intra-step level, HAB learns step-specific token budgeting signals from PPL-derived step comparisons and an adaptive Pareto optimization objective that captures the local quality-efficiency trade-off, while a Fisher Information-based pruner further provides fine-grained training-time guidance, thereby encouraging the generator to internalize more economical reasoning patterns. Experiments on GSM8K and MATH500 show that HAB not only surpasses standard CoT in accuracy but also reduces token usage, achieving a stronger performance-efficiency trade-off than the compared baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.01164 2026-06-02 cs.CV

Towards Interactive Video World Modeling: Frontiers, Challenges, Benchmarks, and Future Trends

迈向交互式视频世界建模：前沿、挑战、基准与未来趋势

Jiuming Liu, Chaojun Ni, Mengmeng Liu, Chensheng Peng, Fangjinhua Wang, Sitian Shen, Marc Pollefeys, Masayoshi Tomizuka, Ayush Tewari, Per Ola Kristensson

发表机构 * Department of Engineering, University of Cambridge, U.K.（剑桥大学工程系）； Peking University（北京大学）； University of Twente（埃因霍温理工大学）； Mechanical Systems Control Laboratory, University of California, Berkeley, USA（加州大学伯克利分校机械系统控制实验室）； ETH Zurich（苏黎世联邦理工学院）； Microsoft（微软公司）； University of Oxford（牛津大学）

AI总结本文系统综述了交互式世界建模的研究趋势、技术挑战、评估基准，并提出了未来方向，重点在于动作条件可控性、长程交互与记忆以及实时响应性。

详情

Comments: Under review. The GitHub repository is publicly available at: https://github.com/liujiuming123/Awesome-Interactive-World-Model

AI中文摘要

随着大语言模型和基于扩散的内容生成的快速发展，世界建模引起了越来越多的研究关注，惠及游戏引擎、具身人工智能、自动驾驶等多个下游领域。通过将用户动作明确纳入世界状态转换，最近的文献在动作条件视频或3D生成范式中赋予了世界建模交互性，进一步增强了世界演化的可控性，并促进用户自由遍历、操纵、导航和个性化状态演化。本文旨在系统回顾交互式世界建模的最新研究趋势、技术发展、评估基准，并提出未来潜在方向。具体而言，我们首先总结了在应用场景、世界状态演化和场景模态方面的近期工作和趋势。随后，我们深入探讨三个关键的技术挑战，包括动作条件可控性、长程交互与记忆，以及实时交互的动作跟随响应性。此外，我们还全面比较了四个特定应用领域（开放世界探索、游戏引擎、自动驾驶和机器人）中的现有基准和指标。最后，我们讨论了实现下一代交互式世界建模的几个有前景的未来方向。相应的代码库已公开在：https://github.com/liujiuming123/Awesome-Interactive-World-Model。

英文摘要

With rapid development of large language models and diffusion-based content generation, world modeling has attracted increasing research attention, benefiting various downstream domains such as game engines, embodied AI, autonomous driving, etc. Through explicitly incorporating user actions into world state transition, recent literature empowers world modeling with interactivity in an action-conditioned video or 3D generation paradigm, further enhancing controllability over world evolutions and facilitating users to freely traverse, manipulate, navigate, and personalize the state evolution. In this paper, we aim to systematically review recent research trends, technical developments, evaluation benchmarks, and also propose future potential directions in interactive world modeling. Specifically, we first summarize recent efforts and trends in terms of application scenarios, world state evolution, and scene modality. Afterwards, we delve into three crucial technical challenges, including action-conditioned controllability, long-horizon interactions and memory, and action-following responsiveness for real-time interactivity. Furthermore, we also thoroughly compare existing benchmarks and metrics in four specific application fields: open-world exploration, game engine, autonomous driving, and robotics. Finally, we discuss several promising future directions in achieving next-generation interactive world modeling. The corresponding repository is publicly available at: https://github.com/liujiuming123/Awesome-Interactive-World-Model.

URL PDF HTML ☆

赞 0 踩 0

2606.01160 2026-06-02 cs.AI

Expected Value Alignment for Generative Reward Modeling in Formal Mathematics Verification

形式数学验证中生成式奖励建模的期望值对齐

Shihao Ji, Haotao Tan, Zihui Song, Mingyu Li

发表机构 * arXiv.org ； GitHub

AI总结提出期望值对齐（EVA）方法，通过从模型词元分布中提取连续分数，在保持生成式奖励模型离散输出的同时实现连续评分，用于Lean 4形式验证。

详情

AI中文摘要

大型语言模型（LLMs）越来越多地与形式化交互式定理证明器（如Lean 4）一起使用。通过强化学习或搜索方法扩展这些系统需要能够评估中间推理步骤的过程奖励模型（PRMs）。现有的奖励模型设计暴露了一个实际的权衡。值头模型提供连续分数但修改了生成模型接口，而生成式奖励模型保留了文本理由但难以匹配连续浮点回归，因为数值被分割到多个词元上。我们引入了期望值对齐（EVA），一种奖励建模过程，它保持表面输出离散，同时从模型的词元分布中提取连续分数。模型以结构化的JSON格式输出整数分数，EVA计算对应锚定词元logits的期望值作为连续分数。训练结合了因果语言建模目标与这些期望值的辅助均方误差损失。我们在 extit{Leibniz}中实例化EVA，这是一个用于Lean 4形式验证的奖励模型，并针对零样本和奖励建模基线进行了评估。评估表明，基于logits的连续评分显著减少了离散化伪影，同时保留了生成式批评的可解释性。

英文摘要

Large Language Models (LLMs) are increasingly used with formal interactive theorem provers such as Lean 4. Scaling these systems with reinforcement learning or search methods requires process reward models (PRMs) that can evaluate intermediate reasoning steps. Existing reward-model designs expose a practical trade-off. Value-head models provide continuous scores but modify the generative model interface, while generative reward models preserve textual rationales but are poorly matched to continuous floating-point regression because numeric values are split across tokens. We introduce Expected Value Alignment (EVA), a reward-modeling procedure that keeps the surface output discrete while extracting continuous scores from the model's token distribution. The model emits integer scores in a structured JSON format, and EVA computes a continuous score as the expectation over the logits of the corresponding anchor tokens. Training combines the causal language modeling objective with an auxiliary mean squared error loss on these expected values. We instantiate EVA in \textit{Leibniz}, a reward model for Lean 4 formal verification, and evaluate it against zero-shot and reward-modeling baselines. The evaluation demonstrates that continuous logit-based scoring significantly reduces discretization artifacts while retaining the interpretability of generative critiques.

URL PDF HTML ☆

赞 0 踩 0

2606.01159 2026-06-02 cs.LG cs.GT

Fairness in two-player zero-sum games with bandit feedback

带赌博反馈的两人零和博弈中的公平性

S Akash, Pratik Gajane

发表机构 * LatentForce.ai ； Laboratoire d’Informatique Fondamentale d’Orléans（奥尔良基础信息学实验室）； University of Orléans（奥尔良大学）

AI总结研究在公平约束下（每个动作概率至少为α/m）的两人零和博弈，通过重参数化将公平博弈转化为标准零和博弈，提出Fair-ETC-TPZSG算法并证明其遗憾界。

详情

AI中文摘要

我们研究在公平约束下的两人零和博弈（TPZSGs），其中每个动作必须以至少$α/m$的概率被选择。现有的实例相关结果针对$ extit{纯}$纳什均衡，而公平性通常产生$ extit{混合}$均衡，这是一个更难的学习目标。我们的关键技术工具是重参数化：每个公平策略分解为$p = (α/m)\mathbf{1} + (1-α)\widetilde{p}$，其中$\widetilde{p} \in Δ_m$，代入收益形式得到$p^{ op}Aq = \widetilde{p}^{ op}\widetilde{A} q$，其中公平收益矩阵$\widetilde{A} := (1-α)A + α\mathbf{1} c^{ op}$，$c_j = frac{1}{m}\sum_i A(i,j)$是列均值向量。那么$A$上的公平博弈等价于$\widetilde{A}$上的标准零和博弈，因此均衡存在性、KKT结构和LP基稳定性归结为应用于$\widetilde{A}$的经典结果。我们推导了公平最小最大值、公平纳什均衡、公平遗憾以及一个简洁的对偶表示，表明公平代价至多为$α(1-1/m)$，并且当无约束均衡已经具有完全支撑时消失。我们的主要结果是针对$ exttt{Fair-ETC-TPZSG}$算法的$\widetilde{O}(T^{2/3})$遗憾界，该算法适用于一般的混合公平均衡，并讨论了为什么朴素的动作消除不能轻易改进它。当公平均衡具有单一主导动作时，即当$\widetilde{p}^{\star}$是$Δ_m$的顶点时，该界收紧为实例相关的$\widetilde{O}(1/\widetildeΔ(α)^{2})$，其中$\widetildeΔ(α)$是LP边际间隙。

英文摘要

We study two-player zero-sum games (TPZSGs) with bandit feedback under fairness constraints requiring every action to be played with probability at least $α/m$. Existing instance-dependent results target $\textit{pure}$ Nash equilibria, while fairness generically produces $\textit{mixed}$ equilibria, a harder learning target. Our key technical tool is a reparametrization: every fair strategy decomposes as $p = (α/m)\mathbf{1} + (1-α)\widetilde{p}$ with $\widetilde{p} \in Δ_m$, and substituting into the payoff form yields $p^{\top}Aq = \widetilde{p}^{\top}\widetilde{A} q$ for a fair payoff matrix $\widetilde{A} := (1-α)A + α\mathbf{1} c^{\top}$, where $c_j = \tfrac{1}{m}\sum_i A(i,j)$ is the column-mean vector. The fair game on $A$ is then equivalent to a standard zero-sum game on $\widetilde{A}$, so equilibrium existence, KKT structure, and LP basis stability reduce to classical results applied to $\widetilde{A}$. We derive the fair minimax value, fair Nash equilibrium, fair regret, and a clean dual representation showing the price of fairness is at most $α(1-1/m)$ and vanishes whenever the unconstrained equilibrium already has full support. Our main result is an $\widetilde{O}(T^{2/3})$ regret bound for an Explore-Then-Commit algorithm, $\texttt{Fair-ETC-TPZSG}$, applicable to general mixed fair equilibria, together with a discussion of why naive action elimination does not readily improve it. When the fair equilibrium has a single dominant action, equivalently when $\widetilde{p}^{\star}$ is a vertex of $Δ_m$, the bound sharpens to instance-dependent $\widetilde{O}(1/\widetildeΔ(α)^{2})$, where $\widetildeΔ(α)$ is the LP-margin gap.

URL PDF HTML ☆

赞 0 踩 0

2606.01157 2026-06-02 cs.CV

HiTokSR: A Coarse-to-Fine Tokenizer with Hierarchical Codebooks for High-Fidelity Real-World Image Super-Resolution

HiTokSR: 一种用于高保真真实世界图像超分辨率的具有层次化码本的从粗到细分词器

Mingxi Li

发表机构 * arXiv.org

AI总结提出HiTokSR层次化标记预测框架，通过将潜在空间沿通道维度划分为频率感知组并独立量化，解耦全局结构与细节，结合视觉基础模型先验和索引级扰动策略，实现真实世界图像超分辨率的最优感知质量和重建保真度。

详情

AI中文摘要

向量量化（VQ）生成模型在真实世界图像超分辨率（Real-ISR）中显示出有希望的结果。然而，现有方法通常依赖于一个将低频结构与高频纹理纠缠在一起的单一潜在空间。这种纠缠迫使单个码本捕获组合上复杂的结构-纹理配对集合，这限制了表示能力并降低了码本利用率。为了解决这个问题，我们提出了HiTokSR，一个层次化的标记预测框架。HiTokSR不使用单一码本，而是将潜在空间沿通道维度划分为频率感知组，并用独立的子码本对每组进行量化。这种从粗到细的设计将全局结构与精细细节解耦，增强了组合表达能力，同时避免了高维最近邻查找的优化不稳定性。为了进一步提高语义一致性，我们的生成器通过自适应特征调制、多尺度类别标记和表示对齐损失，整合了来自视觉基础模型的先验。此外，我们在解码器微调过程中引入了一种索引级扰动策略，以弥合离散标记预测中的训练-测试差异。在真实世界基准上的大量实验表明，HiTokSR在感知质量和重建保真度方面均达到了最先进的性能。

英文摘要

Vector-quantized (VQ) generative models have shown promising results in real-world image super-resolution (Real-ISR). However, existing methods typically rely on a monolithic latent space that entangles low-frequency structures with high-frequency textures. This entanglement forces a single codebook to capture a combinatorially complex set of structure-texture pairings, which constrains representational capacity and limits codebook utilization. To address this issue, we present HiTokSR, a hierarchical token prediction framework. Instead of using a single codebook, HiTokSR partitions the latent space along the channel dimension into frequency-aware groups, quantizing each with an independent sub-codebook. This coarse-to-fine design disentangles global structures from fine details, enhancing combinatorial expressiveness while circumventing the optimization instability of high-dimensional nearest-neighbor lookups. To further improve semantic consistency, our generator integrates priors from a vision foundation model via adaptive feature modulation, multi-scale class tokens, and a representation alignment loss. Additionally, we introduce an index-level perturbation strategy during decoder fine-tuning to bridge the train-test discrepancy in discrete token prediction. Extensive experiments on real-world benchmarks demonstrate that HiTokSR achieves state-of-the-art performance in both perceptual quality and reconstruction fidelity.

URL PDF HTML ☆

赞 0 踩 0

2606.01151 2026-06-02 cs.LG

Lagrangian Perturbation Diffusion Steering: Latent Reinforcement Learning for Generative Policies

拉格朗日扰动扩散引导：用于生成策略的潜在强化学习

Hikmet Simsir, Ozgur S. Oguz

发表机构 * University of Michigan（密歇根大学）

AI总结提出拉格朗日扰动扩散引导（LP-DS），通过学习紧凑的噪声空间扰动来微调冻结的生成策略，利用拉格朗日信任区域目标优化，在保持潜在先验约束的同时提升下游价值，在多个基准上实现样本效率和回报提升。

详情

Comments: Accepted as a regular paper at ICML 2026

AI中文摘要

使用高容量生成策略的行为克隆实现了强大的模仿性能，但通常受限于演示覆盖率和分布偏移。直接强化学习微调可以提升性能，但更新大型动作解码器往往不稳定且样本效率低。我们提出拉格朗日扰动扩散引导（LP-DS），一种轻量级自适应方法，通过在解码前学习紧凑的噪声空间扰动来改进冻结的生成策略。LP-DS 使用拉格朗日信任区域目标优化该扰动，在约束与潜在先验偏差的同时提升下游价值。在 RoboMimic 操作、OpenAI Gym 运动和 Adroit 灵巧操作基准上，LP-DS 提高了样本效率、成功率和回报，同时相比无约束噪声空间引导保持了更高的动作空间熵，回报提升高达 25%。使用流匹配骨干、大型视觉-语言-动作模型以及物理 Franka 部署的额外评估表明，LP-DS 不限于紧凑扩散策略或模拟基准。项目页面：https://sites.google.com/view/lp-ds/home。

英文摘要

Behavior cloning with high-capacity generative policies achieves strong imitation performance, but is often limited by demonstration coverage and distribution shift. Direct reinforcement learning fine-tuning can improve performance, but updating large action decoders is frequently unstable and sample inefficient. We propose Lagrangian Perturbation Diffusion Steering (LP-DS), a lightweight adaptation method that improves a frozen generative policy by learning a compact noise-space perturbation before decoding. LP-DS optimizes this perturbation with a Lagrangian trust-region objective, improving downstream value while constraining deviation from the latent prior. Across RoboMimic manipulation, OpenAI Gym locomotion, and Adroit dexterous manipulation benchmarks, LP-DS improves sample efficiency, success, and return while maintaining higher action-space entropy than unconstrained noise-space steering, with return improvements of up to 25% over prior baselines. Additional evaluations with flow-matching backbones, a large vision-language-action model, and physical Franka deployment show that LP-DS is not limited to compact diffusion policies or simulated benchmarks. Project page: https://sites.google.com/view/lp-ds/home.

URL PDF HTML ☆

赞 0 踩 0

2606.01149 2026-06-02 cs.CV

CoSTL: Comprehensive Spatial-Temporal Representation Learning for Moment Retrieval and Highlight Detection

CoSTL：面向时刻检索与高亮检测的综合时空表征学习

Xin Dong, Wenjia Geng, Wenfeng Deng, Yansong Tang

发表机构 * Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生院）； Pengcheng Laboratory（鹏城实验室）

AI总结提出综合时空表征学习框架CoSTL，通过文本驱动的渐进细粒度图像编码器和多尺度时间感知模块，联合学习空间细节与时间动态，在时刻检索和高亮检测任务上达到最优性能。

详情

Comments: 14 pages, 3 figures

AI中文摘要

视频时刻检索（MR）和高亮检测（HD）是视频分析中的关键任务，旨在根据给定的文本查询定位特定时刻并估计片段级相关性。最近的方法将它们视为类似的视频定位任务，并使用相同的架构来解决。这些任务需要在图像级别进行细粒度理解，以及在整个视频中进行高级时间理解。现有方法主要关注使用帧级特征的时间建模，通常忽略了单个帧内与文本查询相关的丰富视觉信息。这种疏忽导致定位结果不准确。为了解决这一局限性，我们提出了一个综合时空表征学习框架（CoSTL），该框架捕获了细粒度的图像级信息和时间动态。具体来说，CoSTL包含一个文本驱动的渐进细粒度图像编码器，执行两步文本驱动的知识提取过程以学习细粒度空间表征。此外，一个多尺度时间感知模块捕获综合的时空表征，增强了模型处理时间动态的能力。我们在四个公开基准数据集上展示了最先进的性能：QVHighlights、Charades-STA、TACoS和TVSum。

英文摘要

Video Moment Retrieval (MR) and Highlight Detection (HD) are crucial tasks in video analysis that aim to localize specific moments and estimate clip-wise relevance based on a given text query. Recent approaches treat them as similar video grounding tasks and use the same architecture to solve them. These tasks require both fine-grained comprehension at the image level and high-level temporal understanding across the entire video. Existing approaches have primarily focused on temporal modeling using frame-level features, often neglecting the rich visual information related to the text query within individual frames. This oversight leads to inaccurate grounding results. To address this limitation, we propose a Comprehensive Spatial-Temporal Representation Learning Framework (CoSTL), which captures both fine-grained image-level information and temporal dynamics. Specifically, CoSTL incorporates a text-driven progressive fine-grained image encoder, performing a two-step text-driven knowledge extraction process to learn fine-grained spatial representations. Furthermore, a multi-scale temporal perception module captures comprehensive spatial-temporal representations, enhancing the model's ability to process temporal dynamics. We demonstrate state-of-the-art performance on four public benchmarks: QVHighlights, Charades-STA, TACoS, and TVSum.

URL PDF HTML ☆

赞 0 踩 0

2606.01148 2026-06-02 cs.CL

Not All Explanations Simulate Equally: Comparing Verbalized Feature Attributions and Self-Generated Rationales

并非所有解释都能同等模拟：比较言语化特征归因与自生成理由

Pingjun Hong, Benjamin Roth

发表机构 * Faculty of Computer Science, University of Vienna（维也纳大学计算机科学系）； UniVie Doctoral School Computer Science, University of Vienna（维也纳大学计算机科学博士学院）； Faculty of Philological and Cultural Studies, University of Vienna（维也纳大学文学与文化研究系）

AI总结本研究通过反事实模拟设置，比较了言语化特征归因和自生成理由两种解释来源对问答模型行为可模拟性的影响，发现解释格式和粒度显著影响模拟效果。

2606.01145 2026-06-02 cs.AI

Reasoning4Sciences: Bridging Reasoning Language Models to All Scientific Branches

Reasoning4Sciences：将推理语言模型桥接到所有科学分支

Teddy Ferdinan, Bartłomiej Koptyra, Mikołaj Langner, Tomasz Adamczyk, Łukasz Radliński, Maciej Markiewicz, Aleksander Szczęsny, Stanisław Woźniak, Tymoteusz Romanowicz, Dzmitry Pihulski, Mateusz Zbrocki, Mateusz Śmigielski, Michał Rajkowski, Mateusz Biedka, Konrad Kiełczyński, Konrad Wojtasik, Jacek Duszenko, Jan Eliasz, Piotr Matys, Michał Bernacki-Janson, Maria Bellaniar Ismiati, Latius Hermawan, Wiktoria Mieleszczenko-Kowszewicz, Anna Kubicka-Sowinska, Grzegorz Chodak, Karol Postawa, Paweł Zyblewski, Tomasz Szandała, Łukasz Sterczewski, Adrian Chajec, Pawel Niewiadomski, Piotr Gruber, Marcin Wdowikowski, Sławomir Czarnecki, Bartłomiej Kryszak, Dominik Drabik, Tomasz Kajdanowicz, Kamil Mamak, Paweł Preś, Katarzyna Paczkowska, Joachim Sobczuk, Tomasz Zięba, Jan Kocoń, Maciej Piasecki, Przemysław Kazienko

发表机构 * Poznan University of Technology（波兹南理工大学）； National Cheng Kung University（国立成功大学）； Universitas Katolik Musi Charitas Palembang（Palembang 巴厘岛天主教大学）

AI总结本文首次全面分析推理语言模型在28个科学学科中的采用情况，提出基于领域资源的成熟度评估框架，揭示学科间差距并展望未来方向。

详情

AI中文摘要

虽然推理语言模型（RLMs）正迅速成为科学研究的强大工具，但其影响主要集中在“硬科学”领域。RLMs在其他科学分支中的采用缓慢（或缺乏）导致研究生产力差距不断扩大。在本综述中，我们首次按照欧洲研究理事会（ERC）使用的分类，对RLMs在28个科学学科中的采用情况进行了全面分析，涵盖社会科学与人文、物理科学与工程以及生命科学。我们研究了RLMs如何跨学科开发、评估和应用。此外，我们引入了一个基于可用领域特定开发和评估资源的成熟度导向评估框架，揭示了RLM成熟度的显著差异，当仅考虑公开可用资源时，这种差异变得更加明显。最后，我们强调了当前跨学科流行的实施范式、当前挑战以及推动RLMs在科学中采用的未来方向。

英文摘要

While Reasoning Language Models (RLMs) are rapidly emerging as powerful tools for scientific research, their impact is primarily concentrated in "hard science" fields. The slow -- or lack of -- adoption of RLMs in other branches of science is causing a widening gap in research productivity. In this survey, we provide the first comprehensive analysis of RLM adoption across 28 scientific disciplines following the classification used by the European Research Council (ERC), spanning the Social Sciences and Humanities, Physical Sciences and Engineering, and Life Sciences. We examine how RLMs are developed, evaluated, and applied across disciplines. Furthermore, we introduce a maturity-oriented assessment framework based on available domain-specific development and evaluation resources, revealing substantial disparities in RLM maturity that become even more pronounced when only publicly available resources are considered. Finally, we highlight current implementation paradigms that are gaining popularity across disciplines, current challenges, and future directions in enabling RLM adoption across science.

URL PDF HTML ☆

赞 0 踩 0

2606.01132 2026-06-02 cs.CV

HakushoBench: A Japanese Chart and Table VQA Benchmark from Governmental White Papers

HakushoBench：来自政府白皮书的日语图表VQA基准

Issa Sugiura, Shuhei Kurita, Yusuke Oda, Naoaki Okazaki

发表机构 * Institute of Science Tokyo（东京科学研究所）； NII（日本学术振兴会）； NII LLMC（日本学术振兴会LLMC）

AI总结利用政府白皮书构建日语图表VQA基准HakushoBench，包含2053张图像和人工标注问答对，评估视觉语言模型对图表的深度理解。

详情

Comments: 16 pages, 17 figures

AI中文摘要

理解图表和表格图像对于将视觉语言模型（VLM）应用于现实世界的文档理解至关重要。虽然英语基准已经快速发展，但非英语基准仍然稀缺，这使得人们不清楚这种进展是否跨语言泛化。一个关键障碍是难以大规模收集真实且多样化的非英语图表和表格图像。为了解决这个问题，我们利用政府白皮书作为英语之外基准构建的可扩展来源，因为它们包含跨多种格式和领域的自然出现的图表和表格，并且在许多国家免费提供。作为首次实例，我们介绍了HakushoBench，这是一个基于33份政府白皮书构建的具有挑战性的日语图表和表格VQA基准。HakushoBench包含2053张图像，涵盖超过10种图像类型，并带有人工标注的问答对，旨在评估对图表和表格的深入和整体理解，而不仅仅是局部视觉线索。跨广泛VLM的实验表明，HakushoBench对开放权重模型仍然具有挑战性：最佳开放权重模型仅达到58.6%的准确率，开放权重与专有模型之间34.9个百分点的差距凸显了在复杂图表和表格理解方面仍有很大的改进空间。我们发布了数据集和代码。

英文摘要

Understanding chart and table images is essential for applying vision-language models (VLMs) to real-world document understanding. While English benchmarks have advanced rapidly, non-English counterparts remain scarce, leaving it unclear whether this progress generalizes across languages. A key obstacle is the difficulty of collecting realistic and diverse non-English chart and table images at scale. To address this, we leverage governmental white papers as a scalable source for benchmark construction beyond English, as they contain naturally occurring charts and tables across diverse formats and domains and are freely accessible in many countries. As a first instantiation, we introduce HakushoBench, a challenging Japanese chart and table VQA benchmark built from 33 governmental white papers. HakushoBench contains 2,053 images spanning over 10 image types, with manually annotated QA pairs, designed to assess deep and holistic understanding of charts and tables, rather than local visual cues alone. Experiments across a broad range of VLMs demonstrate that HakushoBench remains challenging for open-weight models: the best open-weight model achieves only 58.6% accuracy, and a 34.9-point gap between open-weight and proprietary models highlights substantial room for improvement in complex chart and table understanding. We release our dataset and code.

URL PDF HTML ☆

赞 0 踩 0

2606.01128 2026-06-02 cs.LG

Local MixVR: Breaking the Communication-Sample Dependence in Distributed Learning

Local MixVR：打破分布式学习中通信与样本的依赖关系

Tehila Dahan, Bassel Hamoud, Roie Reshef, Martin Jaggi, Kfir Y. Levy

发表机构 * Technion Haifa, Israel（技术离子海法分校，以色列）； EPFL Lausanne, Switzerland（洛桑联邦理工学院，瑞士）

AI总结提出Local MixVR框架，通过局部更新与方差缩减技术消除通信复杂度对样本总数N的依赖，实现仅与工作节点数M相关的通信复杂度，在M<O(N^{1/4})时优于现有最优方法。

2606.01126 2026-06-02 cs.LG cs.AI cs.CV

STARFISH: faST Accuracy Recovery in pruned networks From Internal State Healing

STARFISH: 从内部状态修复中实现剪枝网络的快速精度恢复

Shir Maon, Odelia Melamed, Adi Shamir

发表机构 * Weizmann Institute of Science（魏茨曼科学研究所）

AI总结提出STARFISH方法，通过少量无标签校准集优化剪枝网络与原始网络内部状态对齐，高效恢复精度，在ViT网络上优于现有方法。

详情

AI中文摘要

剪枝是一种旨在减少大型神经网络中权重数量的过程。这可以显著加快推理速度，但可能导致模型精度大幅下降，因此通常随后会进行修复过程以恢复部分丢失的精度。在本文中，我们提出了一种新的修复方法STARFISH，它可以高效地恢复任何剪枝网络的（大部分）精度。STARFISH的主要思想是使用少量无标签示例的校准集，优化剪枝网络以与原始网络的内部状态表示对齐。对于去除50%权重的常见情况，在基于ViT的网络中，STARFISH修复相比最先进方法将恢复精度提高了高达22%。在激进剪枝下其优势更为显著。例如，在ImageNet的DeiT-B网络中去除75%权重后，STARFISH仅使用训练图像数量的0.4%作为校准集，恢复了原始稠密模型精度的82%，而竞争恢复技术仅达到稠密模型精度的40%。

英文摘要

Pruning is a process designed to reduce the number of weights in a large neural network. This can substantially speed up inference but might cause a considerable reduction in the model's accuracy, and thus it is usually followed by a healing process that regains some of the lost accuracy. In this paper, we propose a new healing method, STARFISH, that can recover (most of) the accuracy of any pruned network efficiently. The main idea of STARFISH is to optimize the pruned network to align with the original network's internal state representations using a tiny calibration set of unlabeled examples. For the common case of removing 50% of the weights, STARFISH healing improves the recovered accuracy by up to 22% over the state-of-the-art methods on ViT-based networks. Its advantage is even more pronounced under aggressive pruning. For example, after eliminating 75% of the weights in a DeiT-B network for ImageNet, STARFISH uses only 0.4% of the number of training images as a calibration set and recovers 82% of the original dense accuracy, whereas competing recovery techniques reach only 40% of the dense model accuracy.

URL PDF HTML ☆

赞 0 踩 0

2606.01123 2026-06-02 cs.LG

From Reward-Free Representations to Preferences: Rethinking Offline Preference-Based Reinforcement Learning

从无奖励表示到偏好：重新思考离线基于偏好的强化学习

Jun-Jie Yang, Chia-Heng Hsu, Kui-Yuan Chen, Ping-Chun Hsieh

发表机构 * arXiv.org ； GitHub

AI总结本文提出一种结合无奖励表示学习和对比搜索微调的离线偏好强化学习框架，通过从无奖励离线数据中学习潜在后继度量表示，再利用偏好数据进行对比搜索和微调，显著提升了偏好效率。

详情

Comments: Published in ICML 2026

AI中文摘要

基于偏好的强化学习通过从成对的人类偏好反馈中学习，避免了显式的奖励工程。现有的离线PbRL方法通常遵循两阶段流程，首先从标记的偏好中学习奖励或偏好模型，然后在未标记数据上执行离线RL。我们通过零样本RL文献中的无奖励表示学习视角重新审视离线PbRL，并提出一个新的训练框架，该框架首先从无奖励离线数据中学习潜在后继度量表示，然后使用偏好数据进行对比搜索和微调。通过大量实验和消融研究，我们表明我们的方法在偏好效率上优于离线PbRL基线。这项工作首次将RFRL与PbRL联系起来，突出了其作为反馈高效解决方案的潜力。我们的代码可在https://github.com/rl-bandits-lab/FB-PbRL公开获取。

英文摘要

Preference-based reinforcement learning (PbRL) avoids explicit reward engineering by learning from pairwise human preference feedback. Existing offline PbRL methods typically follow a two-stage pipeline, first learning a reward or preference model from labeled preferences and then performing offline RL on unlabeled data. We revisit offline PbRL through the lens of reward-free representation learning (RFRL) from the zero-shot RL literature, and propose a new training framework that first learns latent successor-measure representations from reward-free offline data, followed by contrastive search and fine-tuning using preference data. Through extensive experiments and ablations, we show that our method achieves superior preference efficiency over offline PbRL baselines. This work is the first to connect RFRL with PbRL, highlighting its potential as a feedback-efficient solution. Our code is publicly available at https://github.com/rl-bandits-lab/FB-PbRL.

URL PDF HTML ☆

赞 0 踩 0

2606.01122 2026-06-02 cs.LG q-fin.CP

A Per-Component Diagnostic Protocol for Neural HJB-PIDE Solvers under Control-Dependent Lévy Jumps

控制依赖 Lévy 跳跃的神经 HJB-PIDE 求解器的逐分量诊断协议

R. Drissi

发表机构 * GitHub

AI总结提出一个五步诊断协议，用于检测残差训练的神经 HJB-PIDE 求解器在控制依赖 Lévy 跳跃下的算子计算错误，并通过 CRRA-Merton-Variance-Gamma 基准案例验证其有效性。

详情

AI中文摘要

我们针对具有控制依赖 Lévy 跳跃的残差训练神经 HJB-PIDE 求解器，提出一个五步诊断协议，旨在解决神经 PDE 方法的一种常见失效模式：学习到的解可能匹配标量诊断指标，但错误计算了其训练损失内部的算子。该协议将每个神经求解与至少一个从零开始的独立参考配对，将哈密顿量分解为漂移、扩散、补偿器和非局部积分分量（在 u 网格上），并在 (t,x) 网格上比较值函数及其低阶导数，然后进行任何 argmax 比较。应用于标准 CRRA-Merton-Variance-Gamma 基准，它隔离了神经方法重要性提议密度中缺失的 1/2 混合因子，该因子将非局部积分恰好缩放了一半——这是常数提议尺度误差的教科书式特征，而更长的训练、网格细化和截断扫描均无法发现。修正该错误后，四个参考解——两个具有不连续离散化的有限差分求解器、神经求解器以及通过 CRRA 齐次性获得的半解析标量基线——在最优控制上达成约 2% 以内的一致。常数系数 CRRA 基准通过齐次性简化为标量最大化，因此标量基线是此处的高效方法；贡献在于该协议，原则上可应用于真正需要神经 HJB-PIDE 求解器的非齐次和高维场景。该案例是更广泛的神经 PDE 验证失效的具体实例：学习到的值或控制的逐点一致可能与系统性错误的非局部算子共存，因此在信任 argmax 策略之前，需要进行逐分量和表面层次的检查。

英文摘要

We propose a five-step diagnostic protocol for residual-trained neural HJB-PIDE solvers with control-dependent Lévy jumps, targeting a general failure mode of neural PDE methods: a learned solution can match headline scalar diagnostics while miscomputing an operator inside its training loss. The protocol pairs each neural solve with at least one from-scratch independent reference, decomposes the Hamiltonian into drift, diffusion, compensator, and nonlocal-integral components across a u-grid, and compares the value function and its low-order derivatives over a (t,x) grid before any argmax comparison. Applied to a standard CRRA-Merton-Variance-Gamma benchmark, it isolates a missing 1/2-mixture factor in the neural method's importance-proposal density that scaled the nonlocal integral by exactly half - a textbook signature of a constant proposal scale error, invisible to longer training, grid refinement, and truncation sweeps. With the bug corrected, four references - two finite-difference solvers with disjoint discretizations, the neural solver, and a semi-analytic scalar baseline obtained from CRRA homogeneity - agree on the optimal control to within ~2%. The constant-coefficient CRRA benchmark collapses by homogeneity to a scalar maximization, so the scalar baseline is the efficient method here; the contribution is the protocol, applicable in principle to non-homogeneous and higher-dimensional settings where neural HJB-PIDE solvers are genuinely needed. The episode is a concrete instance of a broader neural-PDE verification failure: pointwise agreement of a learned value or control can coexist with a systematically wrong nonlocal operator, so per-component and surface-level checks are needed before trusting the argmax policy.

URL PDF HTML ☆

赞 0 踩 0

2606.01118 2026-06-02 cs.CV

Rank-Aware Quantile Activation for Motion-Robust Crop Segmentation in UAV Imagery

面向无人机影像中运动鲁棒作物分割的秩感知分位数激活

Abinav Kiran, Sravan Danda, Aditya Challa, Sougata Sen, Daya Sagar B S

发表机构 * Senior Member, IEEE（IEEE高级会员）

AI总结针对高速无人机影像中的运动模糊导致语义分割退化的问题，提出秩感知的双分位数激活（QAct）模块，通过实例级秩归一化替代幅度门控，在零样本和模糊监督两种设置下均显著提升mIoU，尤其在稀有纹理依赖类上表现突出，且与模糊域训练互补。

详情

AI中文摘要

高速无人机采集的运动模糊会降低对具有高农业价值的稀有纹理依赖类别的语义分割性能。标准CNN依赖于高频幅度特征，而模糊会破坏这些特征，导致少数信号被统计性擦除。我们提出双分位数激活（QAct），一种秩感知模块，用实例级秩归一化替代幅度门控。在Agriculture-Vision 2021数据集上，在零样本和模糊监督两种设置下、多种严重程度上进行评估，QAct是主导架构因素：它在两种设置和所有严重程度上都比ReLU带来一致的mIoU提升，在稀有结构和纹理依赖类别上增益最强。一些主导类别（水、播种机跳过）在蒸馏下表现出混合的每类性能。在中等模糊下，零样本QAct优于蒸馏训练的ReLU；在所有严重程度上，Distill-QAct达到最佳性能，证实了秩感知激活和模糊域训练是互补的鲁棒性来源。

英文摘要

Motion blur from high-speed UAV acquisition de-grades semantic segmentation on rare texture-dependent classes with high agronomic value. Standard CNNs rely on high-frequency magnitude features that blur destroys, causing statistical erasure of minority signals. We propose Dual Quantile Activation (QAct), a rank-aware block replacing magnitude gating with instance-level rank normalization. Evaluated onAgriculture-Vision 2021 across zero-shot and blur-supervised regimes at multiple severities, QAct is the dominant architectural factor: it delivers consistent mIoU gains over ReLU across both regimes and all severities, with strongest gains on rare structural and texture-dependent classes. Some dominant classes (water,planter skip) show mixed per-class performance under distillation. At moderate blur, zero-shot QAct outperforms distillation-trained ReLU; across all severities, Distill-QAct achieves best performance, confirming rank aware activation and blur-domain training are complementary robustness sources.

URL PDF HTML ☆

赞 0 踩 0

2606.01117 2026-06-02 cs.LG cs.AI

HASTE: Hardware-Aware Dynamic Sparse Training for Large Output Spaces

HASTE: 面向大输出空间的硬件感知动态稀疏训练

Nasib Ullah, Jinbin Zhang, Jean Lucien Randrianantenaina, Erik Schultheis, Rohit Babbar

发表机构 * University of Waterloo（滑铁卢大学）

AI总结提出组共享固定扇入稀疏性方法，通过半结构化输出层设计结合长尾分解，在极端多标签分类中实现显著加速并保持精度。

详情

Comments: Accepted at ICML 2026 Regular

AI中文摘要

极端多标签分类（XMC）涉及在具有数百万标签的大输出空间上学习模型，使得输出层成为内存计算瓶颈。虽然基于稀疏性的方法降低了算术复杂度，但由于不规则内存访问、硬件利用率低或在长尾场景中依赖辅助架构组件，它们通常无法产生成比例的速度提升。我们引入了组共享固定扇入稀疏性，一种半结构化的输出层设计，其中语义相关的标签共享一个稀疏输入模式，同时保留独立的权重。这种分组引入了任务对齐的归纳偏置——鼓励相关标签共享特征子集——同时减少了索引内存开销，增加了跨标签的特征重用，并通过利用现代加速器原语的自定义CUDA内核实现了高效的GPU执行。作为辅助目标的替代方案，我们利用XMC的长尾结构，将输出层分解为频繁标签上的小型密集头部和其余标签上的组共享稀疏尾部，在保留稀疏性内存优势的同时提供了信息丰富的梯度路径。通过内核级微基准测试，我们表明组共享固定扇入将算术减少转化为实际的挂钟时间增益，在前向传播中实现了高达4.4倍的加速，在反向传播中实现了高达25倍的加速，同时与FLOPs匹配的密集瓶颈相比，性能仅相差几个百分点。在大型XMC基准测试中，我们的方法在precision@k上匹配或优于先前的稀疏基线，同时缩小了与密集方法的性能差距。

英文摘要

Extreme multi-label classification (XMC) involves learning models over large output spaces with millions of labels, making the output layer a memory-compute bottleneck. While sparsity-based methods reduce arithmetic complexity, they often fail to yield proportional speedups due to irregular memory access, poor hardware utilization, or reliance on auxiliary architectural components in long-tailed regimes. We introduce group-shared fixed fan-in sparsity, a semi-structured output-layer design in which semantically related labels share a sparse input pattern while retaining independent weights. This grouping introduces a task-aligned inductive bias -- encouraging related labels to share feature subsets -- while reducing index memory overhead, increasing feature reuse across labels, and enabling efficient GPU execution via custom CUDA kernels that leverage modern accelerator primitives. As an alternative to auxiliary objectives, we exploit the long-tailed structure of XMC by decomposing the output layer into a small dense head over frequent labels and a group-shared sparse tail over the remainder, providing an informative gradient pathway while preserving the memory benefits of sparsity. Through kernel-level microbenchmarking, we show that group-shared fixed fan-in translates arithmetic reductions into practical wall-clock gains, achieving up to $4.4\times$ speedup in the forward pass and up to $25\times$ speedup in backward passes over standard fixed fan-in sparsity, while operating within a few percent of a FLOPs-matched dense bottleneck. Across large-scale XMC benchmarks, our approach matches or improves precision@k over prior sparse baselines, while narrowing the performance gap to dense.

URL PDF HTML ☆

赞 0 踩 0

2606.01112 2026-06-02 cs.RO

Tether-Aware Dynamic Collision Avoidance for USV-HROV Systems

USV-HROV系统的系缆感知动态避碰

Yang Gu, Ziyang Hong, Xuanlin Chen, Hao Wei, Cheng Wang, Shujie Yang, Yulin Si

发表机构 * Zhejiang University（浙江大学）

AI总结针对USV跟踪HROV时水下系缆与过往船只刮擦及系缆绷紧风险，提出一种系缆感知的动态避碰方法，通过引入系缆安全感知平面域和系缆绷紧感知速度障碍法，实现安全避碰并降低系缆绷紧可能性。

详情

AI中文摘要

由无人水面艇（USV）和混合遥控潜水器（HROV）组成的异构海洋机器人系统在海底电缆检测中展现出巨大潜力。在此类任务中，USV在水面跟踪HROV，同时通过脐带缆提供电力和通信。然而，USV在跟踪HROV时的动态避碰具有挑战性，因为水下系缆可能刮擦过往船只，而规避机动会增大USV-HROV间距，从而增加系缆绷紧的可能性并影响HROV操作。为解决这些挑战，本文提出了一种用于跟踪HROV的USV的系缆感知动态避碰方法。首先，引入系缆安全感知平面域，以表示系缆与障碍船之间的三维碰撞风险，无需显式系缆形状模型。其次，开发了系缆绷紧感知速度障碍法，以实现安全避碰并降低系缆绷紧的可能性。最后，该方法与视线制导集成，以协调HROV跟踪和避碰。基于Gazebo的仿真表明，所提方法能够避开动态障碍船，同时保持系缆安全并降低USV规避机动期间系缆绷紧的可能性。

英文摘要

Heterogeneous marine robotic systems composed of an unmanned surface vehicle (USV) and a hybrid remotely operated vehicle (HROV) have shown great potential for subsea cable inspection. In such missions, the USV tracks the HROV at the surface while supplying power and communication through an umbilical tether. However, dynamic collision avoidance for the USV during HROV tracking is challenging because the submerged tether may scrape against passing vessels, while evasive maneuvers can enlarge the USV--HROV separation, thereby increasing the likelihood of tether tautness and compromising HROV operations. To address these challenges, this work proposes a tether-aware dynamic collision avoidance method for a USV tracking an HROV. First, a tether safety-aware planar domain is introduced to represent the three-dimensional collision risk between the tether and obstacle vessels without an explicit tether shape model. Second, a tether tautness-aware velocity obstacle method is developed to achieve safe avoidance while reducing the likelihood of tether tautness. Finally, the method is integrated with line-of-sight guidance to coordinate HROV tracking and collision avoidance. Gazebo-based simulations show that the proposed method avoids dynamic obstacle vessels while maintaining tether safety and reducing the likelihood of tether tautness during USV evasive maneuvers.

URL PDF HTML ☆

赞 0 踩 0

2606.01106 2026-06-02 cs.CV

Temporal Evidence Routing with Structured Visual Evidence for TimeLogicQA

基于结构化视觉证据的时间证据路由用于TimeLogicQA

Yuyang Sun, Yongliang Wu, Xingyu Zhu, Yuxia Chen, Zhenxiang Jiang, Yangguang Ji, Wenbo Zhu, Yanxi Shi, Jay Wu, Shuo Wang, Xu Yang

发表机构 * Southeast University（东南大学）； National University of Singapore（新加坡国立大学）； Independent Researcher（独立研究员）； Opus AI Research（Opus AI研究院）； University of Science and Technology of China（中国科学技术大学）

AI总结提出视觉证据路由流水线，分离感知与符号时间推理，通过结构化视觉证据和确定性时间规则在TimeLogicQA上达到81.8 AvgAcc。

详情

AI中文摘要

TimeLogicQA评估视频问答系统是否能推理事件存在、顺序、持续性、边界条件和重叠等时间关系。我们通过一个视觉证据路由流水线来处理此任务，该流水线将感知与符号时间推理分离。系统首先将每个问题解析为事件目标、答案模式、候选选项和时间算子。然后，根据持续时间和算子难度对视频进行路由，对短片段使用有序的全帧证据，对长视频使用以事件为中心的候选窗口。多模态大语言模型为相关事件生成结构化视觉证据，而程序化验证器恢复密集的动作区间，确定性归约器应用算子特定的时间规则产生最终答案。保守融合仅在视觉证据、时间程序和置信度检查一致时接受答案，减少噪声答案翻转。在官方测试评估中，我们的最终系统实现了81.8的平均准确率。

英文摘要

TimeLogicQA evaluates whether video question answering systems can reason over temporal relations such as event existence, ordering, persistence, boundary conditions, and overlap. We address this task with a visual evidence routing pipeline that separates perception from symbolic temporal reasoning. The system first parses each question into event targets, answer mode, candidate options, and temporal operators. It then routes videos according to duration and operator difficulty, using ordered full-frame evidence for short clips and event-focused candidate windows for long videos. A multimodal large language model produces structured visual evidence for the relevant events, while programmatic verifiers recover dense action intervals and a deterministic reducer applies operator-specific temporal rules to produce the final answer. Conservative fusion accepts an answer only when the visual evidence, temporal program, and confidence checks agree, reducing noisy answer flips. On the official test evaluation, our final system achieves an AvgAcc of 81.8.

URL PDF HTML ☆

赞 0 踩 0

2606.01104 2026-06-02 cs.CV

Adaptive Dense Evidence Refinement for Video Relational Reasoning for VRR-QA Challenge

自适应密集证据精炼用于视频关系推理：VRR-QA挑战

Yuyang Sun, Yongliang Wu, Xingyu Zhu, Yuxia Chen, Zhenxiang Jiang, Yangguang Ji, Wenbo Zhu, Yanxi Shi, Jay Wu, Shuo Wang, Xu Yang

发表机构 * Southeast University（东南大学）； National University of Singapore（国立新加坡大学）； Independent Researcher（独立研究员）； Opus AI Research（Opus AI研究）； University of Science and Technology of China（中国科学技术大学）

AI总结提出一种自适应测试时计算系统，通过轻量视图识别不稳定问题并路由到高预算密集证据模块，在VRR-QA测试集上达到90.07%平均准确率。

详情

AI中文摘要

VRR-QA评估视频语言系统能否推断空间、时间、视角、深度和可见性关系，这些关系通常无法通过单帧解决。我们提出一个仅推理的系统，基于自适应测试时计算。系统首先通过直接视频语言模型传递回答每个问题，然后使用多个轻量视图发现不稳定问题。只有这些困难问题被路由到高预算密集证据模块，该模块构建带时间戳的帧观察、关系特定探针、候选验证和保守的时间聚合。这种设计分离了视频问答中常混淆的两个问题：寻找合理的替代答案以及决定何时应更改当前答案。在测试集上，最终系统获得90.07平均准确率和87.81宏平均准确率。报告重点介绍最终测试系统和复现自适应密集验证器所需的实现设置。

英文摘要

VRR-QA evaluates whether video-language systems can infer spatial, temporal, viewpoint, depth, and visibility relations that are not always resolved by a single frame. We present an inference-only system built around adaptive test-time computation. The system first answers each question with a direct video-language model pass, then uses multiple lightweight views to find unstable questions. Only these difficult questions are routed to a high-budget dense evidence module that constructs timestamped frame observations, relation-specific probes, candidate verification, and conservative temporal aggregation. This design separates two problems that are often confused in video question answering: finding plausible alternative answers and deciding when a current answer should actually be changed. On the test split, the final system obtains 90.07 average accuracy and 87.81 macro average accuracy. The report focuses on the final test system and the implementation settings required to reproduce the adaptive dense verifier.

URL PDF HTML ☆

赞 0 踩 0

2606.01101 2026-06-02 cs.LG cs.AI

Soft-NBCE: Entropy-Weighted Chunk Fusion for Long-Context

Soft-NBCE: 基于熵加权分块融合的长上下文处理

Shihao Ji, Mingyu Li, Zihui Song

发表机构 * Beijing Normal University（北京师范大学）； Chunjiang Intelligence（春江智能）

AI总结针对长上下文推理中硬选择策略导致语义碎片化的问题，提出Soft-NBCE，通过熵加权软融合和一致性蒸馏，在保持检索精度的同时提升多跳推理性能。

详情

Comments: 7 pages, 3 figures, 2 tables. Preprint

AI中文摘要

自注意力的二次复杂度仍然是大型语言模型（LLMs）处理超长上下文的瓶颈。朴素贝叶斯认知引擎（NBCE）通过将文档分块并在每个解码步骤路由到熵最低的分块，实现了长上下文推理的并行化。这种硬选择策略在跨分块推理时会导致语义碎片化，因为相邻token之间的突然路由变化破坏了模型的上下文基础。我们提出了Soft-NBCE，这是一种轻量级扩展，用软熵加权分块融合替代了离散的分块选择。通过预测熵上的温度缩放Softmax，为所有分块分配连续权重，实现了跨分块条件分布的log空间聚合。为了部分补偿分块引入的条件独立性假设，我们提出了一致性蒸馏，这是一种基于LoRA的自蒸馏方法，通过KL散度将分块logit分布约束为全上下文教师分布。在LongBench多跳基准测试中，带有一致性蒸馏的Soft-NBCE在NBCE风格基线（MuSiQue F1: 0.310 vs. 0.275（Vanilla NBCE）；HotpotQA F1: 0.479 vs. 0.427）上持续改进，同时在O(L^2/n)峰值内存下保持检索精度（NIAH-32K: 0.909）。

英文摘要

The quadratic complexity of self-attention remains a bottleneck for Large Language Models (LLMs) processing ultra-long contexts. The Naive Bayes Cognitive Engine (NBCE) parallelizes long-context inference by chunking documents and routing to the lowest-entropy chunk at each decoding step. This hard-selection strategy causes semantic fragmentation during cross-chunk reasoning, as abrupt routing changes between adjacent tokens disrupt the model's contextual grounding. We present Soft-NBCE, a lightweight extension that replaces discrete chunk selection with soft entropy-weighted chunk fusion. A temperature-scaled Softmax over predictive entropies assigns continuous weights to all chunks, enabling log-space aggregation across chunk-conditioned distributions. To partially compensate for the conditional independence assumption introduced by chunking, we propose Consistency Distillation, a LoRA-based self-distillation that constrains the chunked logit distribution toward a full-context teacher via KL-divergence. On LongBench multi-hop benchmarks, Soft-NBCE with Consistency Distillation improves consistently over NBCE-style baselines (MuSiQue F1: 0.310 vs.\ 0.275 for Vanilla NBCE; HotpotQA F1: 0.479 vs.\ 0.427) while maintaining retrieval accuracy (NIAH-32K: 0.909) at O(L^2/n) peak memory.

URL PDF HTML ☆

赞 0 踩 0

2606.01099 2026-06-02 cs.CL cs.AI

MiCU: End-to-End Smart Home Command Understanding with Large Language Model

MiCU: 基于大语言模型的端到端智能家居指令理解

Haowei Han, Kexin Hu, Weiwei Cai, Debiao Zhang, Bin Qin, Yuxiang Wang, Jiawei Jiang, Xiao Yan, Bo Du

发表机构 * School of Computer Science, Wuhan University（武汉大学计算机学院）； Xiaomi Corporation（小米公司）； Institute for Math & AI, Wuhan University（武汉大学数学与人工智能研究院）

AI总结提出MiCU，一种利用课程学习、强化学习和令牌压缩技术的领域特定大语言模型，用于解决智能家居中模糊指令理解问题，平均准确率提升20.01%。

详情

DOI: 10.1145/3770855.3818446

AI中文摘要

智能家居生态系统中的指令理解系统可以自动化设备控制并显著改善用户体验。然而，尽管它们在精确表述（例如“打开卧室灯”）上表现良好，但在处理模糊或不一致的指令（例如“让卧室变得舒适”）时却存在困难。大语言模型（LLM）在各种领域都能很好地泛化，并且在此类任务上可以超越传统的基于规则的系统，但其有效性通常受到领域特定数据稀缺、任务特定适应性不足以及高计算成本的限制。在本文中，我们提出了一种利用用户日志和LLM的自动化训练数据合成工作流程；然后构建了MiCU，一个在指令理解方面表现出色的领域特定LLM。具体来说，我们采用课程学习将领域知识注入基础LLM，然后通过冷启动训练结合领域特定思维规则引导的强化学习（RL）来增强其推理能力。此外，我们引入了一种令牌压缩技术，将设备描述压缩为单个特殊令牌，从而显著降低推理开销，并实现了\model-fast，一种针对长输入优化的高效变体。大量实验表明，MiCU显著优于基线，在所有设备类别上平均准确率提升20.01%。我们已在小米家应用中部署了MiCU，每天接收约170万页面浏览量。生产评估显示，MiCU将用户纠正率降低了1.57%，并将人工审核准确率提高了32.05%。我们的数据和代码可在https://github.com/xiaomi-research/iot_spec_llm获取。

英文摘要

Command understanding systems in smart home ecosystems can automate device control and substantially improve user experience. However, while they perform well on precise utterances (e.g., "turn on the bedroom light"), they struggle with ambiguous or misaligned commands (e.g., "make the bedroom cozy"). Large language models (LLMs) generalize well across various domains and can outperform traditional rule-based systems on such tasks, but their effectiveness is often constrained by scarce domain-specific data, insufficient task-specific adaptation, and high computational costs. In this paper, we propose an automated training data synthesis workflow using user logs and LLMs; then we build MiCU, a domain-specific LLM that excels at command understanding. Specifically, we employ curriculum learning to inject domain knowledge into the base LLM, then we enhance its reasoning ability via cold-start training combined with reinforcement learning (RL) guided by domain-specific thinking rules. Additionally, we introduce a token compression technique that condenses device description into a single special token, substantially reducing inference overhead and enabling \model-fast, an efficient variant optimized for long inputs. Extensive experiments show that MiCU significantly outperforms baselines, with an average accuracy gain of 20.01% across all device categories. We have deployed MiCU in the Xiaomi Home app, receiving approximately 1.7 million page views per day. Production evaluations show that MiCU reduces user correction rate by 1.57% and increases human audited accuracy by 32.05%. Our data and code are available at https://github.com/xiaomi-research/iot_spec_llm

URL PDF HTML ☆

赞 0 踩 0

2606.01098 2026-06-02 cs.RO cs.AI

Implicit Drifting Policy: One-Step Action Generation via Conditional Expert Geometry

隐式漂移策略：通过条件专家几何实现单步动作生成

Zemin Yang, Yaoyu He, Yiming Zhong, Yuhao Zhang, Xinge Zhu, Yao Mu, Qingqiu Huang, Yuexin Ma

发表机构 * ShanghaiTech University（上海科技大学）； Shanghai Jiao Tong University（上海交通大学）； The Chinese University of Hong Kong（香港中文大学）； Morphi Robot（Morphi机器人）

AI总结提出隐式漂移策略（IDP），一种单步模仿学习框架，通过条件专家几何隐式引入训练时的漂移校正，无需显式向量场估计，在2D、3D及真实世界操作任务中有效保持有效动作流形，性能优于显式漂移方法并达到强单步基线水平。

详情

AI中文摘要

基于扩散或流匹配的生成动作策略在行为克隆中表现出色，但其迭代采样对于高频机器人控制来说过于耗时。尽管最近的单步公式缓解了这种延迟，但它们不可避免地丢弃了提供关键动作校正的中间轨迹演化。由于条件演示极端稀疏，通过显式估计训练时漂移场直接恢复这一机制在数学上是不适定的。我们提出了隐式漂移策略（IDP），一种单步模仿学习框架，无需显式向量场估计即可将训练时的漂移校正引入策略学习。IDP从观测相似专家动作的局部变化中提取条件专家几何，并将其与全局参考几何进行比较，以分离条件特定的约束。这种局部几何结构自适应地加权一个标量势目标。结合专家近端终端评估，IDP在训练期间直接对单步生成器施加流形约束。在2D、3D和真实世界操作任务上的广泛评估表明，IDP有效保持了对有效动作流形的遵循，优于显式漂移方法，并达到了与强单步基线相当的性能。

英文摘要

Generative action policies based on diffusion or flow matching excel in behavior cloning, yet their iterative sampling is prohibitive for high-frequency robot control. While recent one-step formulations alleviate this latency, they inevitably discard the intermediate trajectory evolution that provides crucial action correction. Directly recovering this mechanism by explicitly estimating a training-time drifting field is mathematically ill-posed due to extreme conditional demonstration sparsity. We introduce Implicit Drifting Policy (IDP), a one-step imitation learning framework that brings the training-time correction of Drifting into policy learning without explicit vector field estimation. IDP extracts a conditional expert geometry from the local variation of observation-similar expert actions, and compares it against a global reference geometry to isolate condition-specific constraints. This local geometric structure adaptively weights a scalar potential objective. Combined with an expert-proximal terminal evaluation, IDP directly enforces manifold constraints on the one-step generator during training. Extensive evaluations across 2D, 3D, and real-world manipulation tasks show IDP effectively maintains adherence to valid action manifolds, improving upon explicit drifting methods and achieving competitive performance with strong one-step baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.01097 2026-06-02 cs.CV

Dual-Route Top-K Retrieval with 1v1 VLM Reranking for the CoVR-R

双路Top-K检索与1v1 VLM重排序用于CoVR-R

Yuyang Sun, Yongliang Wu, Xingyu Zhu, Yuxia Chen, Zhenxiang Jiang, Yangguang Ji, Wenbo Zhu, Yanxi Shi, Jay Wu, Shuo Wang, Xu Yang

发表机构 * Southeast University（东南大学）； National University of Singapore（新加坡国立大学）； Independent Researcher（独立研究者）； Opus AI Research（Opus AI研究）； University of Science and Technology of China（中国科学技术大学）

AI总结提出双路Top-K检索与1v1 VLM重排序方法，通过解耦召回与选择，在CoVR-R挑战中达到95.28% R@1。

详情

AI中文摘要

我们描述了用于CoVR-R挑战的\emph{双路Top-K检索与1v1 VLM重排序}方法。该方法将组合视频检索视为两个耦合问题：找到一个足够完整的Top-K候选集，然后安全地决定是否有任何候选应替换当前强Top-1。我们首先通过VLM槽选择器对现有候选进行推理/文本种子改进，而不引入DFN视觉检索。然后，我们使用DFN-H/DFN-L从联系表嵌入中添加视觉路径。这些路径合并为一个Top-10候选集，之后VLM最终重排序器在当前Top-1和每个挑战者之间进行保守的1v1比较。在隐藏测试集上，最终系统达到95.28 R@1、97.47 R@5、98.48 R@10和99.66 R@50。主要经验是CoVR-R从召回-选择解耦中获益更多，而非广泛的文本重排序或直接的多候选VLM分类。

英文摘要

We describe \emph{Dual-Route Top-K Retrieval with 1v1 VLM Reranking} for the CoVR-R challenge. The method treats composed video retrieval as two coupled problems: finding a sufficiently complete top-k candidate set, and then safely deciding whether any candidate should replace a strong current top-1. We first improve the reasoning/text seed with a VLM slot selector over existing candidates, without introducing DFN visual retrieval. We then add a visual route from contact-sheet embeddings using DFN-H/DFN-L. The routes are merged into a top-10 candidate set, after which a VLM final reranker performs conservative 1v1 comparisons between the current top-1 and each challenger. On the hidden test split, the final system reaches 95.28 R@1, 97.47 R@5, 98.48 R@10, and 99.66 R@50. The main lesson is that CoVR-R benefits more from recall-selection decoupling than from broad text reranking or direct multi-candidate VLM classification.

URL PDF HTML ☆

赞 0 踩 0

2606.01095 2026-06-02 cs.RO cs.AI

Beyond Task Success: Behavioral and Representational Diagnostics for WAM and VLA

超越任务成功：WAM 和 VLA 的行为与表征诊断

Hung Mai, Bin Zhu, Tuan Do

发表机构 * National Economics University, Vietnam（越南国家经济大学）； Singapore Management University（新加坡管理大学）； Phenikaa University, Vietnam（越南Phenikaa大学）

AI总结本文提出一个模型无关的诊断框架，通过行为分析和基于稀疏自编码器的特征分析，比较世界动作模型（WAM）与视觉-语言-动作（VLA）策略在机器人操作中的行为与表征差异，发现WAM在目标选择和行为改进上优于VLA但计算成本更高，且不同WAM架构对未来信息的编码方式不同。

详情

AI中文摘要

视觉-语言-动作（VLA）策略和世界动作模型（WAM）代表了机器人操作中两种日益重要的范式。然而，尚不清楚WAM中的未来预测是否在最终任务成功之外带来行为上有意义的改进。在本文中，我们探究WAM是否仅仅增加了未来预测，还是以对控制可操作的方式改变了机器人行为和内部表征。我们引入一个模型无关的诊断框架，通过两个互补的视角比较WAM和VLA：行为 rollout 分析和基于稀疏自编码器的特征分析。行为协议测量动作动态一致性、目标物体进展、干扰物干扰和运行时成本。特征空间协议将内部表征表征为记忆型、反应型或预测型，揭示模型是否编码了面向未来的结构。在LIBERO和RoboTwin2.0上，我们评估了7种策略，涵盖直接VLA以及联合、顺序和辅助WAM。我们的结果表明，仅凭成功隐藏了关键差异：WAM通常改善物体级行为和目标选择性，但其收益依赖于架构并导致更高的推理成本。顺序WAM显示出最清晰的预测结构，而辅助和联合WAM分别压缩或纠缠未来信息。这些发现为WAM设计提供了未来方向，以保留行为可操作的未来表征，实现高效操作。

英文摘要

Vision-language-action (VLA) policies and World-Action Models (WAM) represent two increasingly important paradigms for robotic manipulation. However, it remains unclear whether future prediction in WAMs leads to behaviorally meaningful improvements beyond final task success. In this paper, we ask whether WAMs merely add future prediction, or whether they change robot behavior and internal representations in ways that are actionable for control. We introduce a model-agnostic diagnostic framework that compares WAMs and VLAs through two complementary lenses: behavioral rollout analysis and sparse-autoencoder-based feature analysis. The behavioral protocol measures action dynamics consistency, target-object progress, distractor disturbance, and runtime cost. The feature-space protocol characterizes internal representations as memorized, reactive, or predictive, revealing whether models encode future-oriented structure. Across LIBERO and RoboTwin2.0, we evaluate 7 policies spanning direct VLAs and joint, sequential, and auxiliary WAMs. Our results show that success alone hides key differences: WAMs often improve object-level behavior and target selectivity, but their gains depend on architecture and incur higher inference cost. Sequential WAMs show the clearest predictive structure, while auxiliary and joint WAMs respectively compress or entangle future information. These findings suggest future directions for WAMs design to preserve behaviorally actionable future representations for efficient manipulation.

URL PDF HTML ☆

赞 0 踩 0

2606.01094 2026-06-02 cs.AI

CAREAgent: Clinical Agent with Structured Reasoning and Tool-Integrated for Order Generation

CAREAgent: 具有结构化推理和工具集成的临床智能体用于医嘱生成

Ruihui Hou, Ziyue Huai, Chennuo Zhang, Ziyan Liu, Siran Zhao, Yao Yu, Jie Zhai, Tong Ruan

发表机构 * East China University of Science and Technology, Shanghai, China（东华大学上海科学技术学院）； Zhongshan Hospital, Fudan University, Shanghai, China（复旦大学中山医院）

AI总结提出CAREAgent，通过两阶段推理数据构建和监督微调与强化学习，生成细粒度临床医嘱，在ClinicalBench上F1提升5.05%。

详情

AI中文摘要

临床医嘱生成是临床决策与实际实践之间的关键桥梁，将医疗决策转化为具体可执行的医嘱。现有智能体主要关注粗粒度决策，忽略了临床医嘱所需的细粒度可执行信息。为弥补这一差距，我们提出CAREAgent，一个用于临床医嘱生成的智能体。为支持其训练，我们引入了一种两阶段智能体推理数据构建方法。首先，我们设计了一个智能体框架，构建与真实临床工具使用一致的可验证推理轨迹。其次，我们根据格式合规性、医嘱有效性和临床合理性筛选推理轨迹。基于构建的数据，模型首先通过监督微调训练以获得基本的推理格式和医学知识，随后通过具有多维奖励函数的强化学习进行优化，以增强复杂的临床推理能力。在多个基准上的实验证明了CAREAgent的有效性。在ClinicalBench（训练中未见）上，CAREAgent的F1分数分别比单智能体、多智能体和智能体推理方法提高了5.05%、2.09%和0.86%。

英文摘要

Clinical order generation serves as a critical bridge between clinical decision-making and real-world practice, translating medical decisions into concrete and executable orders. Existing agents mainly focus on coarse-grained decisions and overlook the fine-grained, executable information required for clinical orders. To address this gap, we propose CAREAgent, an agent for clinical order generation. To support its training, we introduce a two-stage agentic reasoning data construction method. First, we design an agent framework that constructs verifiable reasoning trajectories aligned with realistic clinical tool usage. Second, we filter reasoning trajectories by format compliance, order validity, and clinical plausibility. Building on the constructed data, the model is first trained via supervised fine-tuning to acquire fundamental reasoning formats and medical knowledge, and is subsequently optimized through reinforcement learning with multi-dimensional reward functions to enhance complex clinical reasoning capabilities. Experiments on multiple benchmarks demonstrate the effectiveness of CAREAgent. On ClinicalBench (unseen during training), CAREAgent improves the F1 score by 5.05%, 2.09%, and 0.86% over the single-agent, multi-agent, and agentic reasoning methods, respectively.

URL PDF HTML ☆

赞 0 踩 0

2606.01092 2026-06-02 cs.LG cs.AI

A Fiber Criterion for Representation Identifiability in Supervised Learning

监督学习中表示可辨识性的纤维准则

Vasileios Sevetlidis

发表机构 * Athena Research Center, Kimmeria Campus, Xanthi, Greece（亚特兰大研究中心，基米里亚校区，哈尼亚，希腊）； Democritus University of Thrace, Vas. Sofias Campus, Xanthi, Greece（德摩根大学，瓦斯·索菲亚校区，哈尼亚，希腊）； International Hellenic University, Serres, Greece（国际希腊大学，塞雷斯，希腊）

AI总结本文提出纤维准则，通过投影映射的纤维常数性来形式化监督学习中表示-头部分解的可辨识性，并指出仅凭监督预测行为无法唯一确定表示。

详情

AI中文摘要

监督学习通过输入-输出行为评估预测器。当预测器实现为复合函数 $f=c\circ h$ 时，监督证据约束了复合映射 $f$，但未必确定表示-头部因子分解 $(h,c)$。本文形式化了由此产生的表示级可辨识性问题：对于一类可接受的表示-头部对，当且仅当表示属性在投影 $(h,c)\mapsto c\circ h$ 的纤维上为常数时，它可从诱导的预测器中辨识，等价于它下降为预测器的良定义属性。保持预测器的增广给出了一个规范障碍：辅助信息可以附加到表示上而头部忽略它，保持预测器不变但改变诸如极小性、压缩、不变性、等变性、干扰信息或语义可访问性等属性。这种构造将表示可辨识性与优化和有限样本估计分离开来。有限样本诊断说明了而非证明了该准则：精确代数见证在改变表示诊断时保持预测器固定，而匹配性能的Waterbirds模型表明不同约束可以在相似的监督性能下选择不同的表示。结果阐明，表示级声明需要超越监督预测行为本身的假设、目标、测量或归纳偏置。

英文摘要

Supervised learning evaluates predictors through their input-output behavior. When a predictor is implemented as a composition $f=c\circ h$, supervised evidence constrains the composite map $f$ but need not determine the representation-head factorization $(h,c)$. This paper formalizes the resulting representation-level identifiability problem: for a class of admissible representation-head pairs, a representation property is identifiable from the induced predictor exactly when it is constant on the fibers of the projection $(h,c)\mapsto c\circ h$, equivalently when it descends to a well-defined property of the predictor. Predictor-preserving augmentation gives a canonical obstruction: auxiliary information can be appended to a representation while the head ignores it, leaving the predictor unchanged but altering properties such as minimality, compression, invariance, equivariance, nuisance information, or semantic accessibility. This construction separates representation identifiability from optimization and finite-sample estimation. Finite-sample diagnostics illustrate, rather than prove, the criterion: exact algebraic witnesses hold the predictor fixed while changing representation diagnostics, and matched-performance Waterbirds models show that different constraints can select different representations at similar supervised performance. The results clarify that representation-level claims require assumptions, objectives, measurements, or inductive biases beyond supervised predictive behavior alone.

URL PDF HTML ☆

赞 0 踩 0

2606.01091 2026-06-02 cs.CL

Deep Research as Rubric for Reinforcement Learning

深度研究作为强化学习的评估准则

Wangyi Mei, Zhouhong Gu, Zhenhan Bai, Yin Cai, Lefan Zhang, Zhenxin Ding, Bo Chen, Yan Gao, Yi Wu, Yao Hu, Jiaqing Liang, Deqing Yang

发表机构 * Fudan University（复旦大学）； Xiaohongshu Inc.（小红书公司）； Beijing University of Posts and Telecommunications（北京邮电大学）

AI总结提出DR-rubric框架，通过两阶段过程（迭代多轮智能体搜索构建准则，然后蒸馏为可验证约束用于GRPO策略优化）为开放式推理和长文本生成任务生成细粒度奖励信号，实验表明该方法在多个基准上表现优异。

详情

AI中文摘要

开放式推理和长文本生成任务缺乏可靠的自动验证信号用于基于奖励的策略优化。评估准则提供了一种有前景的替代方案，但现有方法将其视为给定的产物——要么手工制作，要么由提示生成——并且常常忽略最关键的、任务特定的、知识密集的维度，从而扭曲了奖励信号。我们的关键观察是，准则构建本身就是一个研究问题：识别什么使回答正确或富有洞察力需要发现和综合外部知识。我们提出了深度研究作为评估准则（DR-rubric），一个用于构建此类准则的两阶段框架。第一阶段通过迭代多轮智能体搜索引出领域事实、结构约束和失败模式；第二阶段将这些证据蒸馏为原子化的、可独立验证的约束，用于基于GRPO的策略优化。由于正在训练的模型可以作为其自身的准则生成器，DR-rubric-8B支持无需前沿模型辅助的自举准则生成。我们在涵盖智能体研究和专家推理的6个基准上进行了评估。实验表明，DR-Rubric仅使用1K-3K训练实例就取得了强劲的竞争性能，其中GPT-5生成的准则特别有利于智能体任务的广度覆盖，Gemini生成的准则在智能体和专家推理任务上取得了最平衡的性能，而自举准则表现出从专业化到再平衡的演变，在第三次迭代时达到最佳整体性能。结果表明，将准则构建从静态评估模板重新定义为证据驱动的研究过程，可以为开放式任务产生更可扩展、更细粒度的奖励信号。

英文摘要

Open-ended reasoning and long-form generation tasks lack reliable automatic verification signals for reward-based policy optimization. Rubrics offer a promising alternative, but existing approaches treat them as given artifacts -- either hand-crafted or prompt-generated -- and often miss the task-specific, knowledge-intensive dimensions that matter most, distorting the reward signal. Our key observation is that rubric construction is itself a research problem: identifying what makes a response correct or insightful requires discovering and synthesizing external knowledge. We propose Deep Research as Rubric (DR-rubric), a two-stage framework for constructing such rubrics. Stage I elicits domain facts, structural constraints, and failure modes through iterative multi-turn agentic search; Stage II distills this evidence into atomic, independently verifiable constraints for GRPO-based policy optimization. Because the model under training can serve as its own rubric generator, DR-rubric-8B supports bootstrap rubric generation without frontier-model assistance. We evaluate on 6 benchmarks spanning agentic research and expert reasoning. Experiments show that DR-Rubric achieves strong competitive performance with only 1K -- 3K training instances, where GPT-5-generated rubrics particularly benefit breadth coverage on agentic tasks, Gemini-generated rubrics yield the most balanced performance across agentic and expert reasoning tasks, and bootstrap rubrics exhibit a specialization-to-rebalancing evolution achieving the best overall performance at the third iteration. Results demonstrate that reframing rubric construction from static evaluation templates into an evidence-driven research process yields more scalable, fine-grained reward signals for open-ended tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.01084 2026-06-02 cs.LG cs.AI

MViewRouter: Internalizing Geometric Equivariance via Multi-view Alternating Attention for Combinatorial Routing

MViewRouter：通过多视图交替注意力内化组合路由的几何等变性

Shiyan Liu, Bohan Tan, Yaoxin Wu, Yan Jin

发表机构 * Huazhong University of Science and Technology（华中科技大学）； Eindhoven University of Technology（埃因霍温理工大学）

AI总结提出MViewRouter框架，利用多视图交替注意力机制内化几何等变性作为结构归纳偏置，通过集体策略梯度聚合优化，解决组合路由问题中的对称性挑战，在TSP和CVRP上取得竞争性解质量和强零样本泛化。

详情

AI中文摘要

组合路由问题，如旅行商问题（TSP）和带容量约束的车辆路径问题（CVRP），是基础的NP难问题，具有广泛的现实应用。虽然最近的深度强化学习方法显示出有希望的性能，但它们通常仅通过数据增强处理几何对称性，导致决策不一致和泛化能力有限。为了解决这个问题，我们提出了MViewRouter，一个多视图框架，将几何等变性内化为结构归纳偏置，以实现跨路由问题变体的不变决策。我们的方法引入了一种多视图交替注意力（MAA）机制，能够在$D_4$对称群上进行并行处理，在视图内关系建模和视图间特征对齐之间交替进行。此外，我们通过集体策略梯度聚合（CPGA）优化策略，利用来自多个对称视图的共识梯度来稳定训练并加速收敛。在TSP和CVRP基准测试以及真实世界的TSPLIB实例上的实验表明，MViewRouter实现了竞争性的解质量和强大的零样本泛化能力。

英文摘要

Combinatorial routing problems such as the Traveling Salesman Problem (TSP) and the Capacitated Vehicle Routing Problem (CVRP) are fundamental NP-hard problems with broad real-world applications. While recent deep reinforcement learning methods have shown promising performance, they typically handle geometric symmetries only through data augmentation, resulting in inconsistent decisions and limited generalization. To address this issue, we propose MViewRouter, a multi-view framework that internalizes geometric equivariance as a structural inductive bias to achieve invariant decision-making across routing problem variants. Our approach introduces a Multi-view Alternating Attention (MAA) mechanism that enables parallel processing over the $D_4$ symmetry group, alternating between intra-view relational modeling and inter-view feature alignment. Furthermore, we optimize the policy via Collective Policy Gradient Aggregation (CPGA), leveraging consensus gradients from multiple symmetric views to stabilize training and accelerate convergence. Experiments on TSP and CVRP benchmarks, as well as real-world TSPLIB instances, demonstrate that MViewRouter achieves competitive solution quality and strong zero-shot generalization.

URL PDF HTML ☆

赞 0 踩 0

2606.01081 2026-06-02 cs.LG

Decision-Focused On-Policy Learning for Contextual Linear Optimization with Partial Feedback

面向决策的在线策略学习用于部分反馈下的上下文线性优化

Wyame Benslimane, Tinghan Ye, Pascal Van Hentenryck, Paul Grigas

发表机构 * Department of Industrial Engineering and Operations Research, University of California, Berkeley（工业工程与运筹学系，加州大学伯克利分校）； H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology（H.米尔顿·斯图尔特工业与系统工程学院，佐治亚理工学院）

AI总结提出一种混合梯度估计方法，用于部分反馈下顺序上下文线性优化的在线策略学习，实现决策质量驱动的预测模型训练，并在多个基准上优于上下文多臂赌博机基线。

详情

AI中文摘要

决策聚焦学习（DFL）通过优化下游决策质量而非单独预测准确性来训练预测模型。对于上下文线性优化，大多数现有DFL方法假设离线数据和目标成本向量的完全观测。我们开发了一种在线策略学习方法，用于部分反馈下的顺序上下文线性优化，推广了标准赌博机反馈设置。我们的方法学习一个随机预测-然后-优化策略，该策略从条件分布中采样成本向量预测，并求解由此产生的下游线性优化问题。为了更新这个分布模型，我们引入了一个双组分混合梯度估计器。第一个组分是得分函数估计器，它提供无偏但可能高方差的策略梯度估计。第二个是决策聚焦插件组分，它使用潜在成本向量的辅助干扰估计来利用下游优化结构，随着估计的改进而变得更具信息性。我们证明了平均平方策略梯度范数的$\mathcal{O}(T^{-1/2})$界，与标准非凸SGD速率相匹配。在top-$k$选择、最短路径、组合定价和真实数据能源调度基准上的实验表明，混合梯度方法在使用高斯和更丰富的条件生成模型时，在所有基准上实现了比上下文赌博机风格基线更低的累积遗憾。代码可在https://github.com/Joeyetinghan/on-policy-bandit-dfl获取。

英文摘要

Decision-focused learning (DFL) trains predictive models by optimizing downstream decision quality rather than standalone prediction accuracy. For contextual linear optimization, most existing DFL methods assume offline data and full observations of the objective cost vector. We develop an on-policy learning method for sequential contextual linear optimization under partial feedback, generalizing the standard bandit feedback setting. Our method learns a stochastic predict-then-optimize policy that samples a cost-vector prediction from a conditional distribution and solves the resulting downstream linear optimization problem. To update this distributional model, we introduce a two-component hybrid gradient estimator. The first component is a score function estimator, which provides an unbiased but potentially high-variance policy gradient estimate. The second is a decision-focused plug-in component that uses an auxiliary nuisance estimate of the latent cost vector to exploit the downstream optimization structure, becoming more informative as the estimate improves. We prove an $\mathcal{O}(T^{-1/2})$ bound on the average squared policy-gradient norm, matching the standard non-convex SGD rate. Experiments on top-$k$ selection, shortest path, combinatorial pricing, and a real-data energy-scheduling benchmark show that the hybrid gradient approach achieves lower cumulative regret than contextual-bandit-style baselines across all benchmarks, using both Gaussian and richer conditional generative models. Code is available at https://github.com/Joeyetinghan/on-policy-bandit-dfl.

URL PDF HTML ☆

赞 0 踩 0

2606.01080 2026-06-02 cs.LG cs.AI

ThinkSwitch: Context Distillation with LoRA and Weight Interpolation for Specific-Purpose Reasoning Tasks

ThinkSwitch：基于LoRA和权重插值的上下文蒸馏用于特定目的推理任务

Dhruv Saini, Rohan Pandey

发表机构 * bellevue High School（贝尔维尤高中）； DigitalOcean

AI总结提出ThinkSwitch方法，通过QLoRA蒸馏和球面权重插值协同训练指令模型和思考模型，在AIME 2026和PubMedQA上分别提升指令模型10/30→20/30和13/30→18/30，思考模型14/30→22/30和18/30→25/30，仅需15个训练提示和$2.86成本。

详情

AI中文摘要

大型语言模型通常通过在产生最终答案之前花费推理时间计算来改进困难任务。额外的计算可能有用，但也增加了延迟、令牌成本和部署复杂性。我们引入了 extbf{ThinkSwitch}，一种低计算量的程序，用于协同训练配对的指令和思考检查点。从兼容的Qwen3-4B指令和思考模型开始，每次迭代要求思考检查点生成答案，移除推理轨迹，通过QLoRA将仅答案对蒸馏到指令检查点，并通过球面权重插值重建思考检查点。唯一的人工输入是任务提示；标签由模型自身生成。在30个问题的AIME 2026评估中，ThinkSwitch将指令检查点从10/30提升到20/30，思考检查点从14/30提升到22/30。在30个问题的PubMedQA子集上，它将指令检查点从13/30提升到18/30，思考检查点从18/30提升到25/30。完整实验每个领域使用15个训练提示，在单个云RTX 3070上花费2.86美元。结果规模较小，但表明有针对性的蒸馏循环可以将显式推理的部分好处转移到权重中，同时保留独立的思考模式。

英文摘要

Large language models often improve on difficult tasks by spending inference-time compute on a reasoning trace before producing the final answer. That extra computation can be useful, but it also raises latency, token cost, and deployment complexity. We introduce \textbf{ThinkSwitch}, a low-compute procedure for co-training paired instruct and thinking checkpoints. Starting from compatible Qwen3-4B instruct and thinking models, each iteration asks the thinking checkpoint to generate answers, removes the reasoning trace, distills the answer-only pairs into the instruct checkpoint with QLoRA, and reconstructs a thinking checkpoint with spherical weight interpolation. The only human-supplied inputs are task prompts; the labels are generated by the model itself. On a 30-question AIME 2026 evaluation, ThinkSwitch improves the instruct checkpoint from 10/30 to 20/30 and the thinking checkpoint from 14/30 to 22/30. On a 30-question PubMedQA subset, it improves the instruct checkpoint from 13/30 to 18/30 and the thinking checkpoint from 18/30 to 25/30. The complete experiment uses 15 training prompts per domain and costs \$2.86 on a single cloud RTX 3070. The results are small-scale, but they indicate that targeted distillation loops can move part of the benefit of explicit reasoning into weights while preserving a separate thinking mode.

URL PDF HTML ☆

赞 0 踩 0