arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2409
2605.29421 2026-05-29 cs.CL

Learning Design Skills as Memory Policies for Agentic Photonic Inverse Design

将设计技能学习为记忆策略用于智能光子逆向设计

Shengchao Chen, Ting Shu, Sufen Ren

AI总结 提出SkillPCF闭环智能体框架,通过物理引导的记忆技能库、强化学习技能选择和模拟器接地技能演化,解决光子晶体光纤逆向设计中的知识积累问题,在真实数据集上实现更优的设计质量与效率权衡。

Comments AI4Physics@ICML 2026

详情
AI中文摘要

光子晶体光纤(PCF)逆向设计仍然具有挑战性,因为候选几何形状必须在昂贵的电磁模拟下满足耦合的光学目标。现有流程改进了代理预测或一次性参数推荐,但未能在迭代试验中积累可重用的设计知识。我们将PCF逆向设计表述为记忆策略学习问题,并提出SkillPCF,一个闭环智能体框架,结合了物理引导的记忆技能库、强化学习的技能选择和模拟器接地的技能演化。我们进一步构建了一个真实世界数据集,包含479个专家交互轨迹(2507个跨度)和553个记忆依赖的评估查询,涵盖色散工程、损耗优化和多目标设计。在多个LLM骨干和经典基线上的实验表明,SkillPCF在实际模拟预算下实现了更强的设计质量和效率权衡,证明了我们提出的记忆技能学习范式在物理感知的PCF逆向设计中的有效性。

英文摘要

Photonic crystal fiber (PCF) inverse design remains challenging because candidate geometries must satisfy coupled optical targets under expensive electromagnetic simulation. Existing pipelines improve surrogate prediction or one-shot parameter recommendation, but they do not accumulate reusable design knowledge across iterative trials. We formulate PCF inverse design as a memory-policy learning problem and propose SkillPCF, a closed-loop agent framework that combines a physics-guided memory skill bank, reinforcement-learned skill selection, and simulator-grounded skill evolution. We further construct a real-world dataset with 479 expert interaction traces (2,507 spans) and 553 memory-dependent evaluation queries covering dispersion engineering, loss optimization, and multi-objective design. Experiments across multiple LLM backbones and classical baselines show that SkillPCF achieves stronger design-quality and efficiency trade-offs under practical simulation budgets, demonstrating the effectiveness of our proposed memory-skill learning paradigm for physics-aware PCF inverse design.

2605.29420 2026-05-29 cs.AI cs.LG

When Does Persona Prompting Actually Help? A Retrieval and Metric Analysis of Expert Role Injection in LLMs

角色提示何时真正有效?LLM中专家角色注入的检索与度量分析

Shuai Xiao, Su Liu, Weikai Zhou, Jialun Wu, Xinjie He, Zhiyuan Lin, Qiyang Xie

AI总结 通过对比四种提示条件在1140个开放式问题上的表现,发现角色提示系统性地增加专家深度但降低清晰度,其效果高度依赖于问题类型和领域,且混合检索优于纯嵌入检索。

Comments 6 pages, 2 figures. Submitted for peer review

详情
AI中文摘要

角色提示被广泛用于引导大型语言模型,但其实际价值仍不明确。先前的工作通常使用聚合分数评估角色提示,难以确定专家角色提示是否一致地提高响应质量,或者是否沿着不同的质量维度改变响应。我们通过对比四种提示条件在涵盖38个专家角色和六个领域的1140个开放式问题上的表现来研究这个问题:无角色提示、通用领域专家提示、基于嵌入的角色检索,以及结合嵌入搜索和基于LLM的角色选择的混合检索方法。聚合结果显示各条件之间总体差异很小。然而,度量级分析揭示了一个聚合平均值掩盖的一致权衡:角色提示系统性地增加了专家深度,同时降低了清晰度。这些效果高度有条件而非普遍。角色提示在咨询类问题以及医学和心理学等领域表现最佳,在这些领域中,结构化的专家框架和风险沟通具有内在价值。相比之下,基线提示在金融、法律、科学和技术领域的概念性和解释性问题中表现更好,在这些领域中,简洁的平实语言解释更为重要。我们进一步表明,混合检索显著优于纯嵌入角色选择,尽管更好的角色检索并不能消除更广泛的专家深度与清晰度之间的权衡。总体而言,我们的发现表明,角色提示主要重塑响应特征而非广泛提升能力,并且多度量评估对于理解其效果是必要的。

英文摘要

Persona prompting is widely used to steer large language models, yet its practical value remains unclear. Prior work often evaluates persona prompting using aggregate scores, making it difficult to determine whether expert-role prompting consistently improves response quality or instead changes responses along different quality dimensions. We study this question through a controlled comparison of four prompting conditions across 1,140 open-ended questions spanning 38 expert roles and six domains: no role prompt, a generic domain-expert prompt, embedding-based role retrieval, and a hybrid retrieval method combining embedding search with LLM-based role selection. Aggregate results show only small overall differences between conditions. However, metric-level analysis reveals a consistent tradeoff that aggregate averages obscure: role prompting systematically increases expertise depth while reducing clarity. These effects are highly conditional rather than universal. Role prompting performs best on advisory questions and in domains such as medicine and psychology, where structured expert framing and risk communication are intrinsically valuable. In contrast, baseline prompting performs better on conceptual and explanatory questions in finance, legal, science, and technology domains, where concise plain-language explanation is more important. We further show that hybrid retrieval significantly improves over embedding-only role selection, although better role retrieval does not eliminate the broader expertise-depth versus clarity tradeoff. Overall, our findings suggest that persona prompting primarily reshapes response characteristics rather than broadly improving capability, and that multi-metric evaluation is necessary for understanding its effects.

2605.29416 2026-05-29 cs.RO cs.CV

3DVLA: Enhancing Vision-Language-Action Models via 3D Spatial and Instance Understanding

3DVLA:通过3D空间和实例理解增强视觉-语言-动作模型

Zhongyu Xia, Yousen Tang, Bingqing Wei, Yongtao Wang

AI总结 提出3DVLA框架,通过多视角一致性3D特征编码、实例估计模块和掩码自监督3D编码,解决VLA模型缺乏3D场景理解的问题,在LIBERO-Plus和RoboTwin 2.0上显著提升操作性能。

详情
AI中文摘要

视觉-语言-动作模型在机器人操作中取得了显著进展,但存在一个关键限制:缺乏3D场景理解。这一缺陷表现为三个相互交织的挑战:在不强制执行多视角一致性的情况下弱提取3D空间位置、不足的3D实例理解以及遮挡下的脆弱推理。尽管存在成熟的3D感知方法,但由于架构不兼容以及对昂贵实例级标注的严重依赖,它们难以直接集成到VLA流程中。为解决上述挑战,我们提出3DVLA,一个即插即用框架,将稳健的3D推理注入预训练的VLA,无需额外人工标注或丢弃VLM先验。具体来说,3DVLA通过以下方式应对三个挑战:(1)在所有模态上具有显式多视角一致性约束的普遍3D特征编码和空间条件几何聚合方法,(2)具有高级实例令牌的实例估计模块以实现3D实例感知,以及(3)保留预测器用于视觉令牌完成的掩码自监督3D编码分支以处理遮挡。我们将3DVLA与多个VLA基线集成,并在LIBERO-Plus和RoboTwin 2.0上进行评估。结果显示操作性能持续且显著提升,验证了我们方法的有效性和即插即用兼容性。

英文摘要

Vision-Language-Action models have achieved remarkable progress in robotic manipulation, yet they suffer from a critical limitation: a lack of 3D scene understanding. This deficiency manifests as three intertwined challenges: weak extraction of 3D spatial positions without enforcing multi-view consistency, inadequate 3D instance understanding, and fragile reasoning under occlusion. Although mature 3D perception methods exist, their direct integration into VLA pipelines is hindered by architectural incompatibility and by heavy reliance on costly instance-level annotations. To address the above challenges, we propose 3DVLA, a plug-and-play framework that injects robust 3D reasoning into pretrained VLAs without requiring extra manual labels or discarding VLM priors. Specifically, 3DVLA tackles the three challenges through: (1) pervasive 3D feature encoding with explicit multi-view consistency constraints across all modalities and a Spatially-Conditioned Geometry Aggregation method, (2) an instance estimation module with high-level instance tokens for 3D instance awareness, and (3) a masked self-supervised 3D encoding branch that retains its predictor for visual token completion to handle occlusions. We integrate 3DVLA with multiple VLA baselines and evaluate on LIBERO-Plus and RoboTwin 2.0. Results show consistent and significant gains in manipulation performance, validating both the effectiveness and plug-and-play compatibility of our approach.

2605.29414 2026-05-29 cs.CL cs.AI

Beyond Bilingual Transfer: Multilingual Code-Switching in Instruction Tuning

超越双语迁移:指令微调中的多语言代码切换

Shunta Asano, Jeonghun Baek, Toshihiko Yamasaki

AI总结 本研究通过跨四种语言的句子级多语言代码切换指令微调,验证了多语言代码切换能有效提升大语言模型的多语言理解性能,超越了传统双语迁移设置。

详情
AI中文摘要

近期研究表明,代码切换数据(CSD)——即在同一上下文中混合多种语言——可以改善大语言模型(LLMs)的跨语言迁移和多语言对齐。然而,现有研究主要关注英语与目标语言之间的双语迁移,涉及三种或更多语言的多语言设置在很大程度上尚未被探索。在本工作中,我们研究了跨四种语言(英语、日语、韩语和中文)的多语言代码切换指令微调。我们在Belebele上评估多语言理解能力。我们的实验表明,简单的句子级多语言CSD持续提高了所有四种语言的平均多语言性能,表明多语言代码切换在双语迁移设置之外也能有效。

英文摘要

Recent studies have shown that code-switching data (CSD), in which multiple languages are mixed within the same context, can improve cross-lingual transfer and multilingual alignment in large language models (LLMs). However, existing studies primarily focus on bilingual transfer between English and a target language, leaving multilingual settings involving three or more languages largely unexplored. In this work, we investigate multilingual code-switching instruction tuning across four languages: English, Japanese, Korean, and Chinese. We evaluate multilingual understanding on Belebele. Our experiments show that simple sentence-level multilingual CSD consistently improves average multilingual performance across all four languages, indicating that multilingual code-switching can be effective beyond bilingual transfer settings.

2605.29411 2026-05-29 cs.LG cs.AI stat.ME stat.ML

The Good, the Bad, and the Ugly of Markov Boundary for Tabular Prediction

马尔可夫边界在表格预测中的好、坏与丑

Shu Wan, Abhinav Gorantla, Huan Liu, K. Selçuk Candan

AI总结 研究马尔可夫边界在表格预测中的实际效用,发现理论上最优的边界在实践中有条件地提升预测性能,但因果发现方法难以实现其潜力。

Comments 11 pages, 9 figures, 2 tables. Preprint

详情
AI中文摘要

在标准图形假设下,目标变量的马尔可夫边界是使所有其他特征冗余的最小特征集。一旦观察到边界,目标变量与表格的其余部分条件独立。这对于表格预测来说是一个诱人的对象,因为它恰好指出了模型所需的列。然而,现代回归器仍然在完整特征集上训练。我们询问马尔可夫边界是否在SCM3K(一个包含3450个任务的合成SCM基准,特征数量从40到1000,涵盖六个SCM家族)上对预测真正有用,并使用六个回归器进行评估。答案比理论所暗示的要微妙得多。将回归器限制在oracle边界上通常会显著改善预测,并且随着特征空间变得更大更稀疏,改善程度增加。但是,通过因果发现恢复边界并在恢复的掩码上训练的自然流程并不奏效。现有的估计器在达到边界最有帮助的区域之前就耗尽了计算预算,即使它们运行,也很少能击败完整特征集。我们将此归因于三个原因。发现优化的是结构恢复而非预测。假阴性和假阳性具有高度不对称的预测成本。精确边界只是众多击败所有特征的特征集之一。然后,我们阐述了这些事实对于预测对齐的特征选择以及学习使用因果结构的表格模型的意义。

英文摘要

Under standard graphical assumptions, the Markov boundary of a target variable is the smallest set of features that renders every other feature redundant. Once the boundary is observed, the target is conditionally independent of the rest of the table. This is a tempting object for tabular prediction, since it names exactly the columns a model should need. Yet modern regressors are still trained on the full feature set. We ask whether the Markov boundary is genuinely useful for prediction on SCM3K, a 3,450-task synthetic SCM benchmark with feature counts from 40 to 1000 and six SCM families, evaluated with six regressors. The answer is more nuanced than the theory suggests. Restricting a regressor to the oracle boundary often improves prediction substantially, and the improvement grows as the feature space becomes larger and sparser. But the natural pipeline of recovering the boundary with causal discovery and training on the recovered mask does not deliver. Existing estimators exhaust the compute budget before reaching the regime where the boundary helps most, and even where they run they rarely beat the full feature set. We trace this to three causes. Discovery optimizes structural recovery rather than prediction. False negatives and false positives carry sharply asymmetric predictive cost. The exact boundary is only one of many feature sets that beat all features. We then develop what these facts imply for prediction-aligned feature selection and for tabular models that learn to use causal structure.

2605.29410 2026-05-29 cs.RO

A Progress-Aware Leader-Follower Midair Docking System for Dual-Drone Aerial Manipulation

面向双无人机空中操控的进度感知领航-跟随空中对接系统

Yifan Cai, Jan Ming Kevin Tan, Xiangqi Li, Chenzhe Jin, Narsimlu Kemsaram, Valerio Modugno

AI总结 提出一种进度感知的领航-跟随双四旋翼空中对接平台,通过被动磁锁紧模块和阶段管理器实现可靠对接,并基于定量指标进行仿真与实验评估。

Comments This paper has been accepted for publication in the Proceedings of the 2026 IEEE 22nd International Conference on Automation Science and Engineering (CASE 2026), August 17-21, 2026, Shenyang, China

详情
AI中文摘要

小型无人机之间的可靠空中对接对于模块化空中合作与操控至关重要,但需要在严格的推力和载荷约束下实现精确的相对位姿控制和可重复的平台操作。我们提出了一种双无人机对接平台,其中两架四旋翼以领航-跟随编队运行,并使用带有被动磁锁紧的轻量级模块化框架进行对接。一个进度感知的任务监督器管理阶段转换:接近、对准、捕获和稳定。该平台集成了完整的硬件-软件栈(带有Crazyflie/PX4接口的ROS 2)和同步日志记录,用于基准评估。我们在仿真和实际实验中,使用编队误差、基线及偏航一致性、对接成功率、对接时间和失败模式统计等定量指标对平台进行评估。该平台能够对对接监督和同步策略进行基于统计的比较,并为模块化空中合作和可重复的空中操控提供了实用的测试平台。

英文摘要

Reliable midair docking between small unmanned aerial vehicles (UAVs) is essential for modular aerial cooperation and manipulation, but it requires precise relative-pose control and repeatable platform under tight thrust and payload constraints. We present a dual-drone docking platform where two quadrotors operate in a leader-follower formation and dock using a lightweight modular frame with passive magnetic latching. A progress-aware mission supervisor manages phase transitions: approach, alignment, capture, and settle. This platform integrates a complete hardware-software stack (ROS 2 with Crazyflie/PX4 interfaces) and synchronized logging for benchmark evaluation. We evaluate the platform in simulation and real-world experiments using quantitative metrics such as formation error, baseline and yaw consistency, docking success rate, time-to-dock, and failure-mode statistics. The platform enables statistically grounded comparison of docking supervision and synchronization strategies and provides a practical testbed for modular aerial cooperation and repeatable midair aerial manipulation.

2605.29407 2026-05-29 cs.RO

Phase-Conditioned Imitation Learning with Autonomous Failure Recovery for Robust Deformable Object Manipulation

相位条件化模仿学习与自主故障恢复用于鲁棒可变形物体操作

Dayuan Chen, Kai Tang, Yukuan Zhang, Kazuhiro Kosuge, Yasuhisa Hirata

AI总结 提出一种相位条件化、力感知的闭环分层框架,通过FiLM调节的ACT编码器和多模态相位预测器实现自主故障恢复,显著提升可变形物体操作的成功率。

Comments Accepted to IEEE/ASME Transactions on Mechatronics

详情
AI中文摘要

本文提出了一种相位条件化、力感知的框架,用于鲁棒的可变形物体操作。标准的模仿学习策略(如使用Transformer的动作分块,ACT)在推理时依赖马尔可夫假设,当视觉上相似的观测需要矛盾的动作时会导致状态混淆,并阻止从执行故障中自主恢复。我们通过一个闭环分层架构解决了这一问题。一个FiLM条件化的ACT编码器根据当前任务相位调节特征提取,使得单一统一策略能够产生相位特定的行为,同时跨相位共享动作动态。一个融合视觉、力和位姿反馈的多模态相位预测器实时估计相位,检测仅靠视觉无法发现的接触故障,并自主触发恢复轨迹。该系统由一个用于柔顺执行的混合阻抗控制器和一个用于力感知数据收集的触觉遥操作接口完成。消融研究表明,基于FiLM的调制显著优于无条件化和令牌级条件化的基线,t-SNE分析证实FiLM诱导了良好分离的、相位特定的特征表示。在双臂挂上和脱下T恤的任务中验证,闭环系统通过自主错误恢复将挂上成功率从56%提高到87%。代码和视频:https://leledeyuan00.github.io/phaser/

英文摘要

This paper presents a phase-conditioned, force-aware framework for robust deformable object manipulation. Standard imitation learning policies such as Action Chunking with Transformers (ACT) rely on a Markovian assumption at inference, causing state aliasing when visually similar observations require contradictory actions and preventing autonomous recovery from execution failures. We address this with a closed-loop hierarchical architecture. A FiLM-conditioned ACT encoder modulates feature extraction based on the current task phase, enabling a single unified policy to produce phase-specific behaviors while sharing action dynamics across phases. A multi-modal phase predictor fusing visual, force, and pose feedback estimates the phase in real time, detecting contact failures that are invisible to vision alone and autonomously triggering recovery trajectories. The system is completed by a hybrid impedance controller for compliant execution and a haptic teleoperation interface for force-aware data collection. Ablation studies show that FiLM-based modulation significantly outperforms both unconditioned and token-level conditioned baselines, and t-SNE analysis confirms that FiLM induces well-separated, phase-specific feature representations. Validated on hanging and removing a T-shirt with dual arms, the closed-loop system improves the hanging success rate from 56\% to 87\% through autonomous error recovery. Code and videos: https://leledeyuan00.github.io/phaser/

2605.29405 2026-05-29 cs.LG

Information-Directed Offline-to-Online Reinforcement Learning

信息导向的离线到在线强化学习

Keru Chen

AI总结 本文提出信息导向采样(IDS)方法,通过条件互信息量化离线数据后的残余不确定性,在离线到在线强化学习中平衡即时遗憾与信息增益,并证明其贝叶斯遗憾界及在偏置残余不确定性场景下的优势。

详情
AI中文摘要

基于离线数据集的决策通常从固定离线数据中预热策略或评分模型,然后通过有限的在线交互进行优化。离线数据减少了不确定性,但并未消除探索需求;它改变了仍需探索的内容。我们通过学习目标 $χ$ 与在线轨迹在给定离线数据集条件下的条件互信息 $I(χ;τ_{1:T}\\mid\\mathcal{D}_N)$ 来形式化这种残余不确定性。这一观点自然地引出了信息导向采样(IDS),一个由参数 $η\\\ge 0$ 参数化的家族,通过权衡即时遗憾与信息增益来选择动作。我们通过比率证书证明了 IDS 的通用离线到在线贝叶斯遗憾界:任何由参考汤普森采样策略在同一随机策略类上满足的信息比率界都会被 IDS 继承。在已知动力学的贝叶斯线性奖励模型中,条件互信息具有对数行列式形式,且普通 IDS($η=0$)满足 $\\widetilde O\\\!\\\left(Hd\\\min\\\left\\\{\\\sqrt T,\\\,T\\\sqrt{C^\\\dagger_{β,\\\mathrm{IDS}_0}(N,T)/N}\\right\\\}\\right)$,其中覆盖系数与普通 IDS 自身诱导的访问分布相关。我们还识别出一个预热阶段,其中存在一个主导但信息丰富的探测动作,普通 IDS 会选择该探测动作而汤普森采样从不选择,从而产生常数因子的贝叶斯遗憾分离。受控的赌博机实验和 D4RL 离线到在线强化学习实验验证了这一机制:当离线数据信息丰富但留下偏置或低概率的残余不确定性,且目标在线动作可以解决这些不确定性时,IDS 最为有益,这种情形在离线强化学习、离线黑箱优化和贝叶斯优化中普遍存在。

英文摘要

Decision-making from offline datasets typically warm-starts a policy or score model from fixed offline data and then refines it with limited online interaction. Offline data reduces uncertainty, but it does not remove the need for exploration; it changes what remains to be explored. We formalise this residual uncertainty by the conditional mutual information $I(χ;τ_{1:T}\mid\mathcal{D}_N)$ between a learning target $χ$ and the online trajectories after conditioning on the offline dataset. This view leads naturally to information-directed sampling (IDS), a family parameterised by $η\ge 0$ that selects actions by trading off instantaneous regret against information gain. We prove a generic offline-to-online Bayesian regret bound for IDS through a ratio certificate: any information-ratio bound satisfied by a reference Thompson-sampling policy over the same randomised policy class is inherited by IDS. In a known-dynamics Bayesian linear-reward model, the conditional mutual information has a log-determinant form, and vanilla IDS ($η=0$) satisfies $\widetilde O\!\left(Hd\min\left\{\sqrt T,\,T\sqrt{C^\dagger_{β,\mathrm{IDS}_0}(N,T)/N}\right\}\right),$ where the coverage coefficient is tied to the visitation distribution induced by vanilla IDS itself. We also identify a warm-start regime with a dominated but informative probe in which vanilla IDS selects the probe while Thompson sampling never does, giving a constant-factor Bayesian regret separation. Controlled bandit experiments and D4RL offline-to-online RL experiments validate this mechanism: IDS is most beneficial when offline data is informative but leaves biased or low-probability residual uncertainty that targeted online actions can resolve, a regime shared by offline RL, offline black-box optimization, and Bayesian optimization.

2605.29402 2026-05-29 cs.CV cs.AI

Semantic and Visual Evidence for Efficient Long-Video Reasoning: A Solution for the HD-EPIC VQA Challenge

面向高效长视频推理的语义与视觉证据:HD-EPIC VQA挑战赛的解决方案

Yinsong Xu, Wei Jing, Liuxin Zhang, Wanjun Lv, Hui Li

AI总结 提出一种统一框架,通过解耦长视频推理为语义证据(粗到细提取全局过程结构)和视觉证据(基于目标的细粒度定位),并采用查询条件证据检索与整合,在HD-EPIC VQA挑战赛中取得竞争性能。

详情
AI中文摘要

理解长格式自我中心视频对于多模态大语言模型(MLLMs)仍然具有挑战性,原因在于有限的上下文长度和对细粒度视觉细节的定位不足。最近提出的HD-EPIC基准突出了这些局限性:即使是强大的长上下文模型,在多样化的视频问答任务中也表现较低。在本文中,我们提出了一个统一框架,将长视频推理解耦为两种互补的证据形式:语义证据和视觉证据。语义证据通过粗到细的提取流程捕获全局过程结构,而基于目标的视觉证据通过边界框和视觉嵌入保留细粒度的定位。在推理过程中,我们将推理形式化为查询条件的证据检索和整合过程,动态地从两个来源选择相关信息。我们的方法在HD-EPIC-VQA挑战赛的多个任务类别中取得了竞争性能。更广泛地说,我们的结果表明,显式地结构化、检索和整合语义与视觉证据对于使用MLLMs进行有效的长视频理解至关重要。

英文摘要

Understanding long-form egocentric videos remains challenging for multimodal large language models (MLLMs) due to limited context length and insufficient grounding of fine-grained visual details. The recently proposed HD-EPIC benchmark highlights these limitations: even strong long-context models achieve relatively low performance across diverse video question answering tasks. In this paper, we propose a unified framework that decouples long-video reasoning into two complementary forms of evidence: semantic evidence and visual evidence. Semantic evidence captures global procedural structure through a coarse-to-fine extraction pipeline, while object-centric visual evidence preserves fine-grained grounding through bounding boxes and visual embeddings. During inference, we formulate reasoning as a query-conditioned evidence retrieval and integration process, dynamically selecting relevant information from both sources. Our approach achieves competitive performance in the HD-EPIC-VQA Challenge across multiple task categories. More broadly, our results demonstrate that explicitly structuring, retrieving, and integrating semantic and visual evidence is critical for effective long-video understanding with MLLMs.

2605.29401 2026-05-29 cs.LG

Rethinking Post-Training Recipes for Multimodal Time-Series Forecasting

重新思考多模态时间序列预测的后训练方法

Haoxin Liu, Yichen Zhou, Rajat Sen, B. Aditya Prakash, Abhimanyu Das

AI总结 提出PostTime后训练方法,结合监督微调和基于可验证奖励的强化学习,利用大语言模型根据多模态上下文修正数值时间序列基础模型的预测,显著提升多模态时间序列预测性能。

详情
AI中文摘要

时间序列基础模型(TSFMs)在使用数值数据进行零样本单模态预测方面表现出色,但与LLMs不同,它们无法处理通常影响现实世界轨迹的多模态、非数值上下文。在这项工作中,我们弥合了这一差距,并主张一种多模态时间序列预测方法,该方法对LLMs进行后训练,使其作为上下文引导的修正器,作用于强大的数值TSFM先验。我们引入了PostTime,一种结合监督微调(SFT)和基于可验证奖励的强化学习(RLVR)的后训练方案,以及一种生成预测修正的自动推理轨迹的方法。PostTime教会LLM生成上下文条件的预测干预——基于多模态上下文决定修正、保留或忽略TSFM先验。我们在TimesX多模态预测基准上,使用Gemma-3-4B LLM和TimesFM-2.5 TSFM评估了该方法,结果表明它显著优于单独的TSFM、仅LLM的基线以及现有的多模态预测方法。

英文摘要

Time-Series Foundation Models (TSFMs) excel at zero-shot unimodal forecasting using numerical data, but unlike LLMs they cannot consume multimodal, non-numerical context that often shape real-world trajectories. In this work, we bridge this gap and argue for a multimodal time-series forecasting approach that post-trains LLMs to act as context-guided revisors over strong numerical TSFM priors. We introduce PostTime, a post-training recipe combining Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR), along with a methodology to generate automated reasoning traces for forecast revisions. PostTime teaches an LLM to generate context-conditioned forecast interventions -- decisions to revise, preserve, or ignore the TSFM prior based on the multimodal context. We evaluate this approach on the TimesX multimodal forecasting benchmark using a Gemma-3-4B LLM and TimesFM-2.5 TSFM, and show that it significantly outperforms standalone TSFMs, LLM-only baselines, and existing multimodal forecasting approaches.

2605.29400 2026-05-29 cs.AI cs.CL cs.HC

Architecture-Sensitive Supervised Fine-Tuning for Screen-Conditioned Action Prediction: A PiSAR Benchmark

面向屏幕条件动作预测的架构敏感监督微调:PiSAR基准

Rahul Bissa, Abhishek Vyas, Yash Jain

AI总结 通过PiSAR基准评估监督微调模型与前沿零样本模型在屏幕锚定行为预测上的性能,发现微调Qwen3-VL-8B-Instruct显著优于前沿基线,而Gemma-4-26B-A4B-IT微调效果不佳,揭示模型与微调方法不匹配问题。

Comments 14 pages, 7 figures, 2 tables. PiSAR corpus and fine-tuned weights are proprietary to AprioriLabs; methodology and recipe released

详情
AI中文摘要

我们在PiSAR(Persona, intent, Screen, Action, Rationale)的一个661行保留子集上,对三个监督微调模型与前沿零样本基线进行了基准测试。PiSAR是一个包含12,929个元组的屏幕锚定行为理由语料库,从公开的应用商店评论、Pew美国趋势面板人口统计数据以及OPeRA购物者轨迹中整理得到。每个模型,无论是前沿模型还是微调模型,都在相同的661行子集上使用相同的评分流程进行评估。有两个发现。第一,前沿零样本基线(Claude Opus 4.7和GPT-5.5)分别达到sem_sim 0.459和0.482;而微调的Qwen3-VL-8B-Instruct达到0.783,并且在79%的行上sem_sim >= 0.7,而两个前沿基线仅为1-2%,在同一测试集上绝对差距为0.30。第二,相同的训练数据和配方在Gemma-4-26B-A4B-IT上仅得0.441,与前沿零样本基线处于同一水平,而非微调的Qwen。我们将其解读为配方与模型不匹配:经过推理调优的高参数模型抵抗位移,可能需要更多数据或更强的微调方法。

英文摘要

We benchmark three supervised fine-tuned models against frontier zero-shot baselines on a 661-row held-out slice of PiSAR (Persona, intent, Screen, Action, Rationale), a 12,929-tuple corpus of screen-anchored behavioural rationales curated from public app-store reviews, Pew American Trends Panel demographics, and the OPeRA shopper traces. Every model, frontier or fine-tuned, is evaluated on the same 661-row slice with the same scoring pipeline. Two findings. First, frontier zero-shot baselines (Claude Opus 4.7 and GPT-5.5) reach sem_sim 0.459 and 0.482 respectively; a fine-tuned Qwen3-VL-8B-Instruct reaches 0.783 and clears sem_sim >= 0.7 on 79% of rows, against 1-2% for either frontier baseline, a gap of 0.30 absolute on the same test set. Second, the same training data and recipe on Gemma-4-26B-A4B-IT scores only 0.441, in the same band as the frontier zero-shot baselines rather than the fine-tuned Qwen. We read this as a recipe-vs-model mismatch: the reasoning-tuned high-parameter model resists displacement and would likely need either more data or a stronger fine-tuning method.

2605.29398 2026-05-29 cs.LG cs.AI

GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models

GDSD:强化学习作为扩散语言模型的引导去噪器自蒸馏

Xiaohang Tang, Keyue Jiang, Che Liu, Qifang Zhao, Xiaoxiao Xu, Sangwoong Yoon, Ilija Bogunovic

AI总结 提出引导去噪器自蒸馏(GDSD)方法,通过从逆KL正则化强化学习的闭式最优解中导出的优势引导自教师直接蒸馏扩散语言模型的去噪器,避免了ELBO似然代理带来的训练-推理不匹配偏差,在规划、数学和代码基准上显著优于现有方法。

Comments Preprint

详情
AI中文摘要

强化学习(RL)可用于改进扩散大语言模型(dLLMs)的策略(去噪器),但受到策略似然难以处理的阻碍。一类主流且高效的方法将标准RL中的似然替换为其证据下界(ELBO),该下界从随机掩码序列中估计。尽管与预训练高度一致,但这些方法通过使用ELBO作为似然代理引入了训练-推理不匹配(TIM)偏差,可能降低性能。在这项工作中,我们提出了引导去噪器自蒸馏(GDSD),直接从优势引导的自教师中蒸馏dLLMs的去噪器,该自教师源自逆KL正则化RL的闭式最优解。GDSD通过无归一化目标将dLLM的去噪器logits与教师匹配,将RL简化为无似然自蒸馏,从而绕过了TIM偏差。最近的基于ELBO的方法表现为应用不同蒸馏散度的实例,但存在GDSD避免的可诊断病态。在LLaDA-8B和Dream-7B的规划、数学和代码基准上,GDSD以更稳定的训练奖励动态持续优于先前最先进的基于ELBO的方法,测试准确率提升高达+19.6%。这些结果表明,直接的去噪器自蒸馏,无需依赖ELBO似然代理,可以为dLLMs提供更稳定有效的RL过程。代码可在https://github.com/GaryBall/GDSD获取。

英文摘要

Reinforcement learning (RL) can be used to improve the policy (denoiser) of diffusion large language models (dLLMs), while being hindered by the intractability of the policy likelihood. A dominant and efficient family of methods replaces the likelihood in standard RL with its evidence lower bound (ELBO), estimated from randomly masked sequences. Despite being well aligned with pre-training, these approaches introduce bias through training--inference mismatch by using the ELBO as a likelihood surrogate, which can degrade performance. In this work, we propose Guided Denoiser Self-Distillation (GDSD) to directly distill the denoiser of dLLMs from an advantage-guided self-teacher, derived from the closed-form optimum of reverse-KL regularized RL. GDSD matches the dLLM's denoiser logits to the teacher's via a normalization-free objective, which reduces RL to likelihood-free self-distillation and thus bypasses the TIM biases. Recent ELBO-based methods emerge as instances of applying different distillation divergences, but with diagnosable pathologies that GDSD avoids. On planning, math, and coding benchmarks with LLaDA-8B and Dream-7B, GDSD consistently outperforms prior state-of-the-art ELBO-based methods with a more stable training reward dynamics, achieving test-accuracy improvements of up to $+19.6\%$. These results suggest that direct denoiser self-distillation, without relying on an ELBO likelihood surrogate, can provide a more stable and effective RL procedure for dLLMs. Code is available at https://github.com/GaryBall/GDSD.

2605.29397 2026-05-29 cs.CL

Revisiting Observation Reduction for Web Agents: Comprehensive Evaluation with a Lightweight Framework

重新审视Web智能体的观察缩减:基于轻量级框架的综合评估

Masafumi Enomoto, Ryoma Obara, Haochen Zhang, Masafumi Oyamada

AI总结 针对LLM Web智能体中HTML观察过长的问题,提出基于最小失败集(MFS)的轻量级评估框架,通过覆盖率代理指标大幅加速评估,并优化剪枝程序实现2.2-3.1倍延迟降低同时保持84-89%成功率。

Comments 22 pages, 8 figures, 4 tables

详情
AI中文摘要

基于LLM的Web智能体中的HTML观察非常长,尽管已经提出了许多缩减方法,但仍不清楚哪些方法能在保持性能的同时降低整体智能体延迟。主要障碍是端到端评估的高成本:在我们的实验中,在WorkArena L1的33个任务上评估32种配置下的11种方法需要232.4累计小时。为解决此问题,我们提出了一个基于最小失败集(MFS)的轻量级评估框架,MFS是导致任务失败的最小HTML元素集合。我们将覆盖率定义为缩减方法完全保留MFS的实例比例,作为无需网络访问或LLM推理的代理指标。我们验证了覆盖率与端到端成功率强相关,在两个基准测试上累计评估时间加速超过100倍。利用该框架,我们发现提取式HTML缩减方法需要高计算成本或领域特定优化才能在保持性能的同时降低智能体延迟。在此基础上,我们在MFS训练数据上优化了一个剪枝程序,在WorkArena L1上实现了每步延迟2.2倍加速,同时保留了84%的原始成功率,在WebLinx上实现了3.1倍加速,保留了89%。

英文摘要

HTML observations in LLM-based web agents are extremely long, and while many reduction methods have been proposed, it remains unclear which methods reduce overall agent latency while maintaining performance. The main obstacle is the high cost of end-to-end evaluation: in our experiments, evaluating 11 methods across 32 configurations on 33 tasks of WorkArena L1 required 232.4 cumulative hours. To address this, we propose a lightweight evaluation framework based on the Minimal Failure Set (MFS), the minimal set of HTML elements whose removal causes task failure. We define coverage as the fraction of instances in which a reduction method fully retains the MFS, which serves as a proxy metric that requires neither web access nor LLM inference. We validate that coverage strongly correlates with end-to-end success rate, with over 100$\times$ speedup in cumulative evaluation time on both benchmarks. Using this framework, we find that extractive HTML reduction methods require either high computation cost or domain-specific optimization to reduce agent latency while maintaining performance. Building on this, we optimize a pruning program on MFS training data, achieving 2.2$\times$ faster per-step latency on WorkArena L1 while retaining 84\% of the original success rate, and 3.1$\times$ faster on WebLinx while retaining 89\%.

2605.29396 2026-05-29 cs.AI

Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization

对齐但脆弱:通过零阶优化增强LLM安全鲁棒性

Zhihao Liu, Yifan Wu, Jian Lou, Di Wang, Yuxi Zhou, Yuke Hu

AI总结 针对大语言模型安全对齐后易受轻量级后处理(如参数噪声、激活噪声或量化)影响的问题,提出基于零阶优化的混合框架,通过先标准一阶安全对齐再零阶精炼提升鲁棒性,并利用扰动评估估计层鲁棒性敏感性以高效聚焦关键层更新。

详情
AI中文摘要

大语言模型的安全对齐旨在减少有害或不安全行为,同时保持通用效用。然而,最近的研究发现对齐效果可能是脆弱的:轻量级的对齐后操作,如参数噪声、激活噪声或量化,很容易削弱预期的安全行为。先前提高鲁棒性的努力主要集中在数据整理、修改对齐目标和识别安全关键参数上,而优化器本身的作用在很大程度上未被探索。在本文中,我们首次从基础优化器的角度研究安全对齐的鲁棒性。这种以优化器为中心的视角自然地指向零阶优化,它通过评估扰动下的安全对齐来提供面向鲁棒性的信号。基于这一见解,我们提出了一个混合框架,首先执行标准的一阶安全对齐,然后应用零阶精炼来提高鲁棒性。从理论和实证上,我们表明仅需少量零阶精炼步骤即可增强鲁棒性,同时保持安全对齐。我们进一步通过利用其固有的基于扰动的评估来估计逐层鲁棒性敏感性,从而提高零阶精炼的效率,使精炼过程能够以适度的训练开销将更新集中在鲁棒性关键层上。

英文摘要

Safety alignment for large language models (LLMs) aims to reduce harmful or unsafe behavior while preserving general utility. However, recent findings reveal that alignment effects can be fragile: lightweight post-alignment manipulations, such as parameter noise, activation noise, or quantization, can easily weaken the intended safety behavior. Prior efforts to improve robustness have primarily focused on data curation, modified alignment objectives, and safety-critical parameter identification, leaving the role of the optimizer itself largely unexplored. In this paper, we are the first to study the robustness of safety alignment from the perspective of the base optimizer. This optimizer-centric view naturally points to zeroth-order optimization, which provides a robustness-oriented signal by evaluating safety alignment under perturbations. Based on this insight, we propose a hybrid framework that first performs standard first-order safety alignment and then applies zeroth-order refinement to improve robustness. Both theoretically and empirically, we show that only a few zeroth-order refinement steps can enhance robustness while preserving safety alignment. We further improve the efficiency of zeroth-order refinement by exploiting its inherent perturbation-based evaluations to estimate layer-wise robustness sensitivity, enabling the refinement process to concentrate updates on robustness-critical layers with modest training overhead.

2605.29394 2026-05-29 cs.AI

EvoMD-LLM: Learning the Language of Species Evolution in Reactive Molecular Dynamics

EvoMD-LLM:学习反应分子动力学中物种进化的语言

Zhichen Tang, Zhengzheng Dang, Yulin Chen, Jixin Wu, Haiwen Li, Yanming Wang

AI总结 提出EvoMD-LLM框架,将反应分子动力学轨迹离散化为符号时间序列,通过时间脚手架机制使自回归大语言模型学习物种组成演化,在多项时间预测任务上优于基线模型,并能生成可解释性预测。

Comments 17 pages, ACL Findings

详情
AI中文摘要

虽然大型语言模型(LLM)在静态科学推理方面表现出色,但它们在建模动态物理过程的时间结构方面存在困难。我们提出了EvoMD-LLM(进化分子动力学大型语言模型),这是一个将物种级分子动力学重新表述为符号时间语言建模问题的框架。反应分子动力学轨迹被离散化为分子事件序列,其中每个标记代表一个化学物种及其持续时间,通过高效微调使标准自回归LLM能够学习随时间的组成演化。EvoMD-LLM的一个关键组成部分是时间脚手架,它将事件持续时间视为显式语言标记,并作为结构化归纳偏置,与传统的序列建模方法相比,显著减少了无效或幻觉的分子输出。我们在多个时间预测任务上评估了EvoMD-LLM,达到了高达66.14%的准确率,并始终优于序列神经网络和基于语言的基线。除了定量改进,我们定性地观察到,该模型能够通过结合相关化学知识为其预测生成解释,尽管它没有经过配对轨迹-解释数据的显式监督。这些结果表明,符号时间语言建模为将LLM应用于动态物理模拟提供了有效框架。

英文摘要

While large language models (LLMs) excel at static scientific reasoning, they struggle to model the temporal structure of dynamic physical processes. We present EvoMD-LLM (Evolutionary Molecular Dynamics Large Language Model), a framework that reformulates species-level molecular dynamics as a symbolic temporal language modeling problem. Reactive MD trajectories are discretized into sequences of molecular events, where each token represents a chemical species augmented with its persistence duration, enabling standard autoregressive LLMs to learn compositional evolution over time through efficient fine-tuning. A key component of EvoMD-LLM is temporal scaffolding, which treats event duration as an explicit linguistic token and serves as a structured inductive bias, significantly reducing invalid or hallucinated molecular outputs compared to conventional sequence modeling approaches. We evaluate EvoMD-LLM on multiple temporal prediction tasks, achieving up to 66.14% accuracy and consistently outperforming sequential neural networks and language-based baselines. Beyond quantitative improvements, we qualitatively observe that the model is capable of generating interpretations for its own predictions by incorporating relevant chemical knowledge, even though it was not explicitly supervised with paired trajectory-explanation data. These results demonstrate that symbolic temporal language modeling provides an effective framework for grounding LLMs in dynamic physical simulations.

2605.29390 2026-05-29 cs.CV

Orthogonal Negative Guidance in Attention Feature Space for Text-to-Image Generation

注意力特征空间中的正交负引导用于文本到图像生成

Jungmin Ko, Jungwon Park, Jimyeong Kim, Changin Choi, Wonseok Lee, Wonjong Rhee

AI总结 提出一种基于注意力特征空间的正交负引导方法,通过正交化负提示注意力特征与正提示特征并仅减去正交分量,在无需训练的情况下有效抑制不需要的概念,同时保持图像质量和提示对齐。

Comments Preprint

详情
AI中文摘要

文本到图像(T2I)模型生成高质量图像的能力日益增强。然而,强制显式地避免指定对象或属性仍然是一个根本性的难题。现有方法,包括提示否定、事后编辑和负引导,对于显式概念抑制仍显不足,常常无法移除目标概念或降低整体图像质量。为此,我们提出了注意力特征空间中的正交负引导方法,这是一种无需训练的方法,在基于MM-DiT的T2I变换器的注意力输出空间中操作。我们的方法将负提示注意力特征相对于正提示特征进行正交化,并仅减去正交分量,从而在保留期望语义的同时抑制不需要的概念。在FLUX-dev和FLUX-schnell上的实验表明,我们的方法在概念抑制、提示对齐和图像质量之间取得了有利的权衡。在人工评估中,我们的方法比第二好的基线高出18.78%。我们进一步展示了该方法支持多概念抑制和可调概念抑制。

英文摘要

Text-to-image (T2I) models have become increasingly capable of generating high-quality images. Yet, enforcing the explicit absence of a specified object or attribute remains a fundamentally challenging problem. Existing approaches, including prompt negation, post-hoc editing, and negative guidance, remain insufficient for explicit concept suppression, often failing to remove the target concept or degrading overall image quality. To this end, we propose Orthogonal Negative Guidance in attention feature space, a training-free method that operates in the attention output space of MM-DiT-based T2I transformers. Our method orthogonalizes negative-prompt attention features with respect to positive-prompt features and subtracts only the orthogonal component, suppressing unwanted concepts while preserving desired semantics. Experiments on FLUX-dev and FLUX-schnell show that our method achieves favorable trade-offs between concept suppression, prompt alignment, and image quality. In human evaluation, our method outperforms the second-best baseline by 18.78%. We further show that our method supports multi-concept suppression and adjustable concept suppression.

2605.29387 2026-05-29 cs.LG cs.AI stat.ML

On the Optimizer Dependence of Neural Scaling Laws

神经缩放定律的优化器依赖性

Vansh Ramani, Shourya Vir Jain

AI总结 通过随机特征回归实验,发现优化器类型系统性地影响神经缩放定律中的缩放指数α,预条件优化器产生更陡峭的缩放,并提供了光谱诊断预测高级优化器的收益。

详情
AI中文摘要

神经缩放定律 $L(N) \propto N^{-α}$ 中的缩放指数 $α$ 通常被视为由架构和数据确定的固定常数。我们提出证据表明 $α$ 系统性地依赖于优化器。在受控的随机特征回归实验——神经缩放的理论框架——中,我们测量了五种优化器变体和六种光谱条件下的 $α$。预条件优化器一致地产生更陡峭的缩放(更大的 $α$),且 $α$ 的偏移在大部分测试光谱范围内增加,在 $s = 1.5$ 附近达到峰值,并在 $s = 2.0$ 时保持较大。在 $s \approx 1.0$(自然语言的特征)时,完全自然梯度达到 $α\approx 0.31$,而梯度下降为 $α\approx 0.12$——拟合指数大 $2.6$ 倍,在随机特征模型中,该差异随模型规模加倍而累积。这种指数偏移是否以及如何迁移到大规模 LLM 训练中——近期证据表明优势可能随规模减弱——仍是一个重要的开放问题。我们的结果表明,缩放定律预测应考虑优化器选择,并且我们提供了一个光谱诊断来预测高级优化器何时会带来收益。

英文摘要

The scaling exponent $α$ in neural scaling laws $L(N) \propto N^{-α}$ is commonly treated as a fixed constant set by architecture and data. We present evidence that $α$ depends systematically on the optimizer. In controlled random-feature regression experiments -- the canonical theoretical framework for neural scaling -- we measure $α$ across five optimizer variants and six spectral conditions. Preconditioned optimizers consistently yield steeper scaling (larger $α$), with the $α$-shift increasing across most of the tested spectral range, peaking near $s = 1.5$, and remaining large at $s = 2.0$. At $s \approx 1.0$ (characteristic of natural language), the full natural gradient achieves $α\approx 0.31$ versus $α\approx 0.12$ for gradient descent -- a $2.6\times$ larger fitted exponent that, within the random-feature model, compounds with each model-size doubling. Whether and how this exponent shift transfers to large-scale LLM training -- where recent evidence suggests the advantage may attenuate with scale -- remains an important open question. Our results imply that scaling-law forecasts should account for optimizer choice, and we provide a spectral diagnostic predicting when advanced optimizers will pay off.

2605.29380 2026-05-29 cs.LG cs.AI cs.CV

TRACER: Persistent Regularization for Robust Multimodal Finetuning

TRACER: 用于鲁棒多模态微调的持久正则化

Hesam Asadollahzadeh, Feng Liu, Christopher Leckie, Sarah M. Erfani

AI总结 提出TRACER方法,通过加权移动平均教师实现持久正则化,解决多模态对比微调中的灾难性遗忘和EMA坍缩问题,提升分布外鲁棒性。

Comments ICML 2026

详情
AI中文摘要

微调预训练多模态模型的主流策略通常会降低分布外(OOD)鲁棒性,这种现象被称为灾难性遗忘。在本文中,我们为多模态对比微调开发了一个理论框架,为每种策略提供了闭式解和几何分解。该框架表明,自蒸馏在保留预训练模型知识方面比其他正则化方法更有效。我们的分析揭示了一个被广泛忽视的局限性:在鲁棒微调中广泛使用的标准指数移动平均(EMA)教师存在坍缩问题。为了解决这个问题,我们证明加权移动平均(WMA)教师在有限时间范围内保持持久的正则化力,并在任务子空间中实现无偏收敛,同时保留正交知识。这些见解促使了**TRACER**(**T**rajectory-**R**obust **A**nchoring for **C**ontrastive **E**ncoder **R**egularization)的提出,它将对比学习与WMA引导的多视角蒸馏相结合。在CLIP微调上的大量实验表明,在三种骨干架构上,OOD准确率和校准性能持续提升,全面的消融实验证实TRACER既有理论依据,又对超参数选择具有鲁棒性。代码可在[https://github.com/HesamAsad/TRACER](https://github.com/HesamAsad/TRACER)获取。

英文摘要

Mainstream strategies for finetuning pretrained multimodal models often degrade out-of-distribution (OOD) robustness, a phenomenon known as catastrophic forgetting. In this paper, we develop a theoretical framework for multimodal contrastive finetuning, yielding closed-form solutions and a geometric decomposition for each strategy. This framework shows that self-distillation is more effective than other regularization approaches to retain the knowledge of the pretrained model. Our analysis reveals a largely overlooked limitation: standard Exponential Moving Average (EMA) teachers, widely used in robust finetuning, suffer from collapse. To solve this, we prove that a Weighted Moving Average (WMA) teacher maintains a persistent regularizing force over finite horizons and yields bias-free convergence in the task subspace while preserving orthogonal knowledge. These insights motivate **TRACER** (**T**rajectory-**R**obust **A**nchoring for **C**ontrastive **E**ncoder **R**egularization), which combines contrastive learning with WMA-guided multi-perspective distillation. Extensive experiments on CLIP finetuning demonstrate consistent OOD accuracy and calibration gains across three backbone architectures, and comprehensive ablations confirm that TRACER is both principled and robust to hyperparameter choices. Code is available at [https://github.com/HesamAsad/TRACER](https://github.com/HesamAsad/TRACER).

2605.29379 2026-05-29 cs.CL cs.LG

BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k_base

BrahmicTokenizer-131K:一种可替代o200k_base的印度文字兼容分词器

Rohan Shravan

AI总结 提出BrahmicTokenizer-131K,一种131072词汇量的字节级BPE分词器,通过两阶段改造在保持非印度文字性能的同时,显著提升印度文字的压缩效率。

Comments 24 pages, 15 tables, 3 code listings. Tokenizer artifact, verification scripts, and reproduction code at https://huggingface.co/theschoolofai/BrahmicTokenizer-131K and https://github.com/theschoolofai/BrahmicTokenizer-131K

详情
AI中文摘要

我们提出了BrahmicTokenizer-131K,一种131,072词汇量的字节级BPE分词器,它在131K词汇量类别中弥合了印度文字(Brahmic)的压缩差距,同时保留了OpenAI的o200k_base在英语、欧盟语言和代码方面的压缩性能。我们通过两阶段改造构建了它:(1)脚本剪枝裁剪,通过移除九个不相关书写系统将200,019个令牌减少到131,072个;(2)外科手术式改造,通过线性规划分配在九个印度文字Unicode块中填充2,372个语料库中缺失的词汇槽位。预分词器、解码器和继承的合并规则与o200k_base保持不变,使得BrahmicTokenizer-131K在分词器接口上成为即插即用的替代品。 在2700万份公开印度语预训练文本(28.4亿词,46.21 GB)上,BrahmicTokenizer-131K在相同词汇预算下产生的令牌比Mistral-Nemo Tekken / Sarvam-m少26.7%,每种语言的节省幅度从15.79%(泰米尔语)到76.79%(奥里亚语,压缩比4.31倍)。奥里亚语的优势在机制上可解释为Tekken/Sarvam-m包含零个奥里亚语块令牌;我们的改造添加了725个。在非印度语内容上,BrahmicTokenizer-131K与o200k_base的英语词汇生育率相当(1.235 vs 1.232令牌/词),并在HumanEval、MBPP和GSM8K上比Tekken/Sarvam-m好4.0-14.2%。在我们的14个分词器基准测试中,它是唯一一个在131K预算下同时在印度文字、英语、欧盟语言、代码和数学上具有竞争力的分词器。其他词汇类别的专用分词器(Sarvam-30B、Sarvam-1、MUTANT-Indic)以牺牲非印度语性能为代价实现了更好的印度语压缩:Sarvam-1的英语词汇生育率比我们差15.9%,其代码/数学压缩比我们差26-33%。我们在Apache 2.0许可下发布该工件,地址为https://huggingface.co/theschoolofai/BrahmicTokenizer-131K。

英文摘要

We present BrahmicTokenizer-131K, a 131,072-vocabulary byte-level BPE tokenizer that closes the Brahmic compression gap at the 131K-vocabulary class while preserving the English, EU-language, and code compression of OpenAI's o200k_base. We construct it through a two-stage retrofit: (1) a script-prune crop that reduces 200,019 tokens to 131,072 by removing nine out-of-scope writing systems, and (2) a surgical retrofit of 2,372 corpus-dead vocabulary slots determined by linear-programming allocation across nine Brahmic Unicode blocks. The pre-tokenizer, decoder, and inherited merge rules are unchanged from o200k_base, making BrahmicTokenizer-131K a drop-in replacement at the tokenizer interface. On 27 million documents of public Indic pretraining text (2.84 billion words, 46.21 GB), BrahmicTokenizer-131K produces 26.7% fewer tokens than Mistral-Nemo Tekken / Sarvam-m at the same vocabulary budget, with per-language savings of 15.79% (Tamil) to 76.79% (Odia, a 4.31x compression ratio). The Odia advantage is mechanistically explained by Tekken/Sarvam-m containing zero Oriya-block tokens; our surgery added 725. On non-Indic content, BrahmicTokenizer-131K matches o200k_base's English fertility (1.235 vs 1.232 tokens/word) and beats Tekken/Sarvam-m by 4.0-14.2% on HumanEval, MBPP, and GSM8K. Across our 14-tokenizer benchmark, it is the only tokenizer simultaneously competitive on Brahmic, English, EU, code, and math at the 131K budget. Specialist tokenizers at other vocab classes (Sarvam-30B, Sarvam-1, MUTANT-Indic) achieve better Indic compression at the cost of non-Indic performance: Sarvam-1's English fertility is 15.9% worse and its code/math compression 26-33% worse than ours. We release the artifact under Apache 2.0 at https://huggingface.co/theschoolofai/BrahmicTokenizer-131K.

2605.29378 2026-05-29 cs.RO

Decentralized LLM-Driven Coordination of Acoustic Robots for Contactless Object Manipulation

去中心化LLM驱动的声学机器人协调用于非接触式物体操控

Yingying Wang, Narsimlu Kemsaram, Sriram Subramanian

AI总结 提出一种去中心化框架,利用Whisper语音识别和LLM语义解析将自然语言指令转换为多机器人任务计划,实现声学机器人的非接触式物体操控,实验验证了顺序、并行和同步协作任务的有效性。

Comments This paper has been accepted for publication in the Proceedings of the 2026 IEEE 22nd International Conference on Automation Science and Engineering (CASE 2026), August 17-21, 2026, Shenyang, China

详情
AI中文摘要

自然语言接口可以简化与多机器人系统的交互,特别是当非专业用户需要发出高级命令时。使用超声相控阵的声学操控也实现了非接触式物体处理,适用于医疗保健、实验室自动化和精密运输等应用。然而,将大型语言模型(LLM)与分布式声学移动机器人相结合仍未被充分探索。本文提出了一种去中心化框架,用于自然语言驱动的声学机器人协调,实现非接触式物体操控。该系统使用基于Whisper的语音识别、基于LLM的语义解析、结构化JSON任务表示和分布式调度,将口语指令转换为可执行的多机器人任务计划。JSON模式编码了机器人分配、时间依赖、空间约束以及顺序、并行和同步执行的同步要求。该系统在两个基于TurtleBot3的声学机器人上实现,每个机器人配备一个超声相控阵用于非接触式物体运输。实验在三种场景下进行:顺序执行、并行多机器人运输和同步协作操控。系统在顺序任务中实现了96%的任务成功率,并行执行为86%,同步协作运输为70%。这些结果表明,自然语言命令可以转化为分布式机器人动作以实现非接触式操控,突显了LLM驱动的自动化在分布式机器人系统中用于人机交互的潜力。

英文摘要

Natural language interfaces can simplify interaction with multi-robot systems, especially when non-expert users need to issue high-level commands. Acoustic manipulation using ultrasonic phased arrays also enables contactless object handling for applications such as healthcare, laboratory automation, and precision transport. However, combining large language models (LLMs) with distributed acoustic mobile robots remains underexplored. This paper presents a decentralized framework for natural language-driven coordination of acoustic robots for contactless object manipulation. The system converts spoken instructions into executable multi-robot task plans using Whisper-based speech recognition, LLM-based semantic parsing, structured JSON task representation, and distributed scheduling. The JSON schema encodes robot assignments, temporal dependencies, spatial constraints, and synchronization requirements for sequential, parallel, and synchronized execution. The system is implemented on two TurtleBot3-based acoustic robots, each equipped with an ultrasonic phased array for contactless object transport. Experiments were conducted in three scenarios: sequential execution, parallel multi-robot transport, and synchronized cooperative manipulation. The system achieved task success rates of 96 percent for sequential tasks, 86 percent for parallel execution, and 70 percent for synchronized collaborative transport. These results show that natural language commands can be transformed into distributed robot actions for contactless manipulation, highlighting the potential of LLM-driven automation for human-robot interaction in distributed robotic systems.

2605.29368 2026-05-29 cs.CL cs.AI

SURGENT: A Surgical Multi-Agent Assistance System Across the Perioperative Workflow

SURGENT: 一种跨围手术期工作流程的手术多智能体辅助系统

Dongsheng Shi, Yue Li, Xin Yi, Yongyi Cui, Huawei Feng, Linlin Wang

AI总结 提出SURGENT手术多智能体辅助系统,结合思维树规划器、多科室协作智能体和检索增强推理,通过新型记忆设计管理长期患者病史和短期工作摘要,在五项围手术期任务中优于基线LLM和现有医疗多智能体框架。

Comments preprint

详情
AI中文摘要

现代外科护理的复杂性需要智能系统能够综合大量患者记录,支持协作决策,并在整个围手术期工作流程中提供透明、可审计的推理。尽管基于网络的大型语言模型(LLM)具有先进的推理能力,但由于输入长度限制、不完整的记忆管理和有限的可追溯性等关键限制,它们不适合外科应用。为了解决这个问题,我们提出了SURGENT,一种手术多智能体辅助系统,它结合了思维树规划器、多科室协作智能体以及基于临床指南和生物医学文献的检索增强推理。SURGENT具有一种新颖的记忆设计,可以管理长期患者病史和短期工作摘要,从而实现更完整、情境化和一致的推理。在五项关键围手术期任务(病例分析、手术计划模拟、安全监测、并发症风险评估和康复指导)上的实验评估表明,SURGENT优于基线LLM和现有的医疗多智能体框架,生成的推荐与患者病史更加一致。消融研究进一步突出了DeepSeek作为本地可部署骨干模型的优势,使其能够在无需依赖集中服务的情况下实现隐私保护部署。这些结果使SURGENT成为迈向智能、公平和安全的外科辅助系统的实用且可信的进步。

英文摘要

The intricate nature of modern surgical care necessitates intelligent systems that can synthesize extensive patient records, support collaborative decision-making, and provide transparent, auditable reasoning across the entire perioperative workflow. Although web-based Large Language Models (LLMs) possess advanced reasoning capabilities, they are ill-equipped for surgical applications due to critical limitations: input length constraints, incomplete memory management, and limited traceability. To address this issue, we present SURGENT, a surgical multi-agent assistance system that combines a Tree-of-Thought planner, multi-department collaboration agents, and retrieval-augmented reasoning with clinical guidelines and biomedical literature. SURGENT features a novel memory design that manages both long-term patient histories and short-term working summaries, enabling more complete, contextualized, and consistent reasoning. Experimental evaluations across five key perioperative tasks - case analysis, surgical plan simulation, safety monitoring, complication risk assessment, and rehabilitation guidance - show that SURGENT outperforms baseline LLMs and existing medical multi-agent frameworks, yielding recommendations more closely aligned with patient histories. Ablation studies further highlight the advantage of DeepSeek as a locally deployable backbone model, enabling privacy-preserving deployment without reliance on centralized services. These results position SURGENT as a practical and trustworthy advancement toward intelligent, equitable, and secure surgical assistance systems.

2605.29367 2026-05-29 cs.CL cs.CY cs.SI

Attention Asymmetry in AI Layoff Discourse on X: A Computational Analysis of Capital vs Labour Amplification

X平台上AI裁员话语中的注意力不对称性:资本与劳动放大的计算分析

Joy Bose

AI总结 通过收集X平台推文,使用账户级收集方法发现资本话语的放大效应是劳动话语的3.12倍,经粉丝数标准化后仍存在2.69倍的不对称性,并引入放大比和放大归一化指数作为平台话语不平等的度量指标。

Comments 18 pages, 3 figures, 9 tables

详情
AI中文摘要

当工人因AI驱动的重组而失业时,X(前Twitter)上同时发生两种截然不同的对话。科技高管和AI研究人员谈论生产力、转型和机遇。被解雇的工人和劳工批评者谈论失业、不确定性和恐惧。本文提出一个简单问题:哪种对话获得更多传播?我们报告了三项研究,使用两种收集方法和来自20个知名公共账户的763条推文。研究1使用基于关键词的收集(n=392),发现语料库之间无显著差异(p=0.891),表明关键词搜索对此任务噪声过大。研究2使用基于账户的收集(n=96),发现资本话语的平均放大优势是劳动话语的3.12倍(p=0.000003,Cohen's d=0.555)。研究3结合两种方法(n=763),确认了平均放大比4.18倍和中位数放大比10.77倍的结果(p<0.000001)。关键的是,在按粉丝数标准化后,不对称性仍然存在,为2.69倍(p=0.000009,Cohen's d=0.491),表明该效应并非仅仅是资本账户拥有更大受众的结果。该发现在所有测试的放大度量权重下均稳健。我们引入放大比和放大归一化指数作为衡量平台级话语不平等的简单指标。在Reddit上的跨平台复制(n=647条帖子)未复制该发现,表明不对称性可能特定于X基于账户的放大架构。我们讨论了跨平台话语分析的方法论意义。

英文摘要

When workers lose jobs to AI-driven restructuring, two very different conversations happen on X (formerly Twitter) at the same time. Tech executives and AI researchers talk about productivity, transformation, and opportunity. Laid-off workers and labour critics talk about job loss, uncertainty, and fear. This paper asks a simple question: which conversation gets more reach? We report three studies using two collection methods and 763 tweets from 20 named public accounts. Study 1 used keyword-based collection (n=392) and found no significant difference between corpora (p=0.891), revealing that keyword search is too noisy for this task. Study 2 used account-based collection (n=96) and found a 3.12x mean amplification advantage for capital discourse over labour discourse (p=0.000003, Cohen's d=0.555). Study 3 combined both methods (n=763) and confirmed the finding at 4.18x mean and 10.77x median amplification ratio (p<0.000001). Critically, after normalising for follower count, the asymmetry persists at 2.69x (p=0.000009, Cohen's d=0.491), demonstrating that the effect is not simply a consequence of capital accounts having larger audiences. The finding is robust across all tested amplification metric weightings. We introduce the Amplification Ratio and Amplification Normalisation Index as simple metrics for measuring platform-level discourse inequality. A cross-platform replication on Reddit (n=647 posts) did not replicate the finding, suggesting the asymmetry may be specific to X's account-based amplification architecture. We discuss the methodological implications for cross-platform discourse analysis.

2605.29366 2026-05-29 cs.LG

Solving Integer Linear Programming with Parallel Tempering

使用并行回火求解整数线性规划

Kyuil Sim, Sanghyeok Choi, Jinkyoo Park

AI总结 提出一种无求解器、基于采样的整数线性规划优化框架,利用局部平衡提议和并行回火技术直接探索离散可行区域,在多个基准上优于或匹敌经典求解器。

Comments Preprint. Code available at https://github.com/ski-sim/ILP-with-ParallelTempering

详情
AI中文摘要

整数线性规划(ILP)作为建模广泛组合优化问题的通用框架,通常由复杂的精确求解器或启发式方法求解。虽然基于学习的方法最近显示出有效性,但它们存在对分布外实例泛化能力差以及对外部求解器的固有依赖。在这项工作中,我们提出了一种无求解器、基于采样的ILP优化框架,无需训练或外部求解器即可直接探索离散可行区域。利用ILP的线性结构,我们采用局部平衡提议构建转移核,从而避免梯度近似。为了克服ILP能量景观的高度多模态性,我们集成了并行回火。除了标准的温度回火,我们还引入了惩罚回火,它在保持可行解目标景观的同时调节约束障碍。实验上,我们的方法在所有四个基准上持续优于SCIP,在200秒预算内匹配或超过Gurobi在四个任务中的两个,并且比基于学习的方法对分布偏移具有更强的鲁棒性。此外,在MIPLIB 2017实例上,我们的框架无需任何问题特定调优即可与经典求解器保持竞争力。

英文摘要

Integer Linear Programming (ILP) serves as a versatile framework for modeling a wide range of combinatorial optimization problems, typically addressed by sophisticated exact solvers or heuristics. While learning-based approaches have recently shown their effectiveness, they suffer from poor generalization to out-of-distribution instances and inherent dependence on external solvers. In this work, we propose a solver-free, sampling-based optimization framework for ILP that directly explores discrete feasible regions without training or external solvers. Exploiting the linear structure of ILP, we employ a Locally-Balanced Proposal to construct a transition kernel, thereby avoiding the gradient approximation. To overcome the highly multimodal nature of ILP energy landscapes, we integrate Parallel Tempering. In addition to standard temperature tempering, we introduce penalty tempering, which modulates constraint barriers while preserving the objective landscape over feasible solutions. Empirically, our method consistently outperforms SCIP across all four benchmarks, matches or exceeds Gurobi on two of four tasks within a 200-second budget, and is substantially more robust to distribution shift than learning-based methods. Furthermore, on MIPLIB 2017 instances, our framework remains competitive with classical solvers without any problem-specific tuning.

2605.29360 2026-05-29 cs.AI

MiraBench: Evaluating Action-Conditioned Reliability in Robotic World Models

MiraBench: 评估机器人世界模型中的动作条件可靠性

Tianzhuo Yang, Zihan Shen, Zirui Mi, Zhaoyi Zhang, Jiayi Zhou, Jiaming Ji, Juntao Dai, Jiawei Chen, Boyuan Chen, Yaodong Yang

AI总结 提出MiraBench基准,通过物理一致性、动作跟随保真度和乐观偏差检测三个层次评估机器人世界模型的动作条件可靠性,发现视觉保真度不能反映动作保真度、模型规模扩大不保证动作跟随改善、乐观偏差普遍存在。

详情
AI中文摘要

动作条件世界模型越来越多地被用作机器人学习的可扩展模拟器,但当前的评估对其在条件动作下预测的可靠性提供的证据有限。现有基准主要强调视觉保真度,未明确预测的未来是否物理上合理、是否忠实于命令动作,以及在动作不应成功时是否校准到失败。我们引入了\textsc{MiraBench},一个分层基准,将\emph{动作条件可靠性}定义为机器人世界模型的核心评估目标。MiraBench将此目标分解为三个逐步严格层次:\emph{物理一致性},评估无参考的物理一致性;\emph{动作跟随保真度},衡量预测是否尊重任务相关动作输入;以及\emph{乐观偏差检测},探测在导致失败的动作下预测成功结果的倾向。为支持此评估,我们整理了一个人工标注语料库,包含跨任务、失败类别和领先世界模型的超过16,000个判断。我们评估了12种代表性模型配置,涵盖向量条件机器人世界模型、文本条件生成世界模型、开源系统、闭源系统和多种模型规模。在这一广泛的模型景观中,MiraBench揭示了三个核心发现:视觉保真度是动作保真度的糟糕代理;增加模型规模并不能可靠地改善动作跟随;乐观偏差在现有系统中普遍存在。通过将评估从外观转向动作条件可靠性,MiraBench为评估和改进机器人世界模型作为忠实模拟器提供了诊断基础。

英文摘要

Action-conditioned world models are increasingly used as scalable simulators for robot learning, yet current evaluations provide limited evidence that their predictions are reliable under the actions they condition on. Existing benchmarks largely emphasize visual fidelity, leaving unclear whether predicted futures are physically plausible, faithful to commanded actions, and calibrated to failure when actions should not succeed. We introduce \textsc{MiraBench}, a hierarchical benchmark that defines \emph{action-conditioned reliability} as a core evaluation target for robotic world models. MiraBench decomposes this target into three progressively demanding levels: \emph{Physics Adherence}, which evaluates reference-free physical consistency; \emph{Action-Following Fidelity}, which measures whether predictions respect task-relevant action inputs; and \emph{Optimism Bias Detection}, which probes the tendency to predict successful outcomes under failure-inducing actions. To support this evaluation, we curate a human-annotated corpus with over 16,000 judgments across tasks, failure categories, and leading world models. We evaluate 12 representative model configurations spanning vector-conditioned robotic world models, text-conditioned generative world models, open-weight systems, closed-source systems, and multiple model scales. Across this broad model landscape, MiraBench reveals three central findings: visual fidelity is a poor proxy for action fidelity; increasing model scale does not reliably improve action following; and optimism bias is pervasive across current systems. By shifting evaluation from appearance to action-conditioned reliability, MiraBench provides a diagnostic foundation for assessing and improving robotic world models as faithful simulators.

2605.29358 2026-05-29 cs.AI

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

扩展单一语义性:从Claude 3 Sonnet中提取可解释特征

Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, Alex Tamkin, Esin Durmus, Tristan Hume, Francesco Mosconi, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, Tom Henighan

AI总结 本研究通过稀疏自编码器从生产级语言模型Claude 3 Sonnet中提取可解释特征,验证了字典学习方法在大规模模型上的可扩展性,并分析了特征的多语言、多模态特性及其对模型行为的因果影响。

详情
AI中文摘要

我们证明了稀疏自编码器可以从Claude 3 Sonnet(一个生产级语言模型)中提取可解释特征,解决了字典学习方法能否扩展到小型Transformer之外的问题。我们在模型中间层的残差流上训练了多达3400万个特征的稀疏自编码器,并使用缩放定律指导超参数选择。得到的特征是多语言和多模态的(尽管仅文本训练,但能泛化到图像),对概念的具体实例和抽象讨论都有响应,并可用于以与其解释一致的方式引导模型行为。我们发现了对应于著名实体和位置的特征,以及更抽象的概念,如讽刺或代码中的错误。我们还识别了与语言模型可能造成伤害的方式相关的特征——包括代表欺骗、权力追求、谄媚和偏见的特征——并展示了这些特征在被操纵时对模型输出的因果影响。此外,我们对特征的可解释性、几何结构和计算功能进行了分析。然而,仍然存在显著局限性:我们的特征集不完整,并且缺乏严格的方法来评估我们的特征是否忠实地捕捉了模型的计算过程。

英文摘要

We demonstrate that sparse autoencoders can extract interpretable features from Claude 3 Sonnet, a production-scale language model, addressing the open question of whether dictionary learning methods scale beyond small transformers. We trained sparse autoencoders with up to 34 million features on the model's middle layer residual stream, using scaling laws to guide hyperparameter selection. The resulting features are multilingual and multimodal (generalizing to images despite text-only training), respond to both concrete instances and abstract discussions of concepts, and can be used to steer model behavior in ways consistent with their interpretations. We find features corresponding to famous entities and locations, as well as more abstract concepts like sarcasm or errors in code. We also identify features relevant to ways in which language models might cause harm--including features representing deception, power-seeking, sycophancy, and bias--and show that these causally influence model outputs when manipulated. Additionally, we conduct analyses of feature interpretability, geometry, and computational function. However, significant limitations remain: our suite of features is incomplete, and we lack rigorous methods for evaluating whether our features faithfully capture model computations.

2605.29357 2026-05-29 cs.AI cs.LG cs.PL

PassNet: Scaling Large Language Models for Graph Compiler Pass Generation

PassNet: 为图编译器通生成扩展大型语言模型

Yiqun Liu, Yingsheng Wu, Ruqi Yang, Enrong Zheng, Honglei Qiu, Sijun He, Tai Liang, Jingjing Wu, Yuhan Zhou, Yiwei Zhang, Dongyan Chen, Weihan Yi, Xinqi Li, Siqi Bao

AI总结 针对编译器默认优化在长尾子图上性能不佳的问题,提出PassNet生态系统,包含大规模数据集和基准测试,通过微调小模型在少量轨迹上即可接近前沿模型性能。

Comments Code and data available at https://github.com/PaddlePaddle/PassNet

详情
AI中文摘要

现代张量编译器(如 TorchInductor)在主流模型上实现了显著加速,但在长尾负载上却面临系统性性能瓶颈——我们的性能分析显示,43% 的真实世界子图在默认编译下出现端到端减速。虽然 LLM 为实现自动化优化提供了途径,但现有工作集中于独立内核生成。我们认为,通生成(即 LLM 编写可直接集成到编译器流水线中的结构化图变换)是更合适的抽象。我们提出 PassNet,首个基于 LLM 的编译器通生成的大规模生态系统,包括:(1) PassNet-Dataset,包含来自 10 万个真实世界模型的超过 1.8 万个独特计算图;(2) PassBench,200 个精心挑选的长尾可融合任务(共包含 2060 个子图),在错误感知加速分数(ES_t)下进行评估——该指标统一了正确性、稳定性和性能——并具有针对系统性 LLM 利用的分层完整性防御。实验表明,PassBench 既具有高度区分性,又真正未饱和:最佳前沿模型在总体上落后 TorchInductor 37%,但在单个子图上,LLM 相比同一编译器可实现高达 3 倍的加速——这表明瓶颈在于一致性而非能力。在仅约 4000 个 PassNet 轨迹上微调一个小模型,可获得 2.67 倍的改进,接近前沿模型性能,证明了巨大的提升空间,并验证了 PassNet 作为推进 LLM 驱动编译器优化的实时训练基础设施。所有数据、基准测试和工具均已公开。

英文摘要

Modern tensor compilers such as TorchInductor deliver substantial speedups on mainstream models, yet face a systematic performance ceiling on long-tail workloads -- our profiling shows that 43% of real-world subgraphs experience end-to-end slowdowns under default compilation. While LLMs offer a path toward automated optimization, existing efforts focus on standalone kernel generation. We argue that pass generation -- where LLMs author structured graph transformations that integrate directly into compiler pipelines -- is the more appropriate abstraction. We propose PassNet, the first large-scale ecosystem for LLM-based compiler pass generation, comprising: (1) PassNet-Dataset, over 18K unique computational graphs from 100K real-world models; and (2) PassBench, 200 curated long-tail fusible tasks (comprising 2,060 subgraphs in total) evaluated under the Error-aware Speedup Score (ES_t) -- a metric unifying correctness, stability, and performance -- with layered integrity defenses against systematic LLM exploitation. Experiments reveal that PassBench is both highly discriminative and genuinely unsaturated: the best frontier model trails TorchInductor by 37% in aggregate, yet on individual subgraphs LLMs achieve up to 3x speedup over the same compiler -- indicating that the bottleneck is consistency, not capability. Fine-tuning a small model on merely ~4K PassNet trajectories yields a 2.67x improvement approaching frontier-model performance, demonstrating substantial headroom and validating PassNet as live training infrastructure for advancing LLM-driven compiler optimization. All data, benchmarks, and tooling are publicly available.

2605.29351 2026-05-29 cs.LG math.DS stat.ML

Attention as In-Context Empirical Bayes: A Two-Stage View via Particle Dynamics

注意力作为上下文经验贝叶斯:通过粒子动力学的两阶段视角

Matthew Smart, Soumya Ganguly, Nilava Metya, Alexandre V. Morozov, Anirvan M. Sengupta

AI总结 本文通过粒子动力学将最小注意力仅变换器解释为两阶段经验贝叶斯过程,揭示了深度和注意力残差的统计角色,并证明无需显式噪声调度即可实现有效去噪。

Comments 52 pages, 5 figures

详情
AI中文摘要

我们研究了在所有标记损坏情况下的最小注意力仅变换器,并表明它们具有两阶段经验贝叶斯解释。单个注意力步骤计算相对于由上下文定义的经验分布的核加权后验均值。深度通过粒子动力学(阶段1)细化该分布,而长程跳跃连接将噪声输入作为查询用于后验推断(阶段2),揭示了深度和注意力残差的独特统计角色。该框架隔离了一个最小设置,其中上下文本身诱导了一个控制上下文推断的深度依赖能量景观。我们表明,无需显式噪声调度即可出现有效去噪:固定的核带宽和有限的积分范围就足够了,从而产生了一个有原则的深度-噪声关系。我们进一步为一类表现良好的先验建立了后验均值恢复保证,其中经验估计器在渐近条件下收敛到贝叶斯最优预测器。将这些动力学与反向扩散极限联系起来,我们的结果为注意力作为通过基于样本的后验估计进行上下文推断提供了统计解释,无需显式密度建模。

英文摘要

We study minimal attention-only transformers under all-token corruption and show they admit a two-stage empirical Bayes interpretation. A single attention step computes a kernel-weighted posterior mean with respect to the empirical distribution defined by the context. Depth refines this distribution through particle dynamics (Stage 1), while a long-range skip-connection carries the noisy input as a query for posterior inference (Stage 2), revealing distinct statistical roles for depth and attention residuals. The framework isolates a minimal setting in which the context itself induces a depth-dependent energy landscape governing in-context inference. We show that effective denoising can emerge without an explicit noise schedule: a fixed kernel bandwidth and finite integration horizon suffice, yielding a principled depth-noise relationship. We further establish a posterior-mean recovery guarantee for a class of well-behaved priors, where the empirical estimator converges to the Bayes-optimal predictor under asymptotic conditions. Connecting these dynamics to reverse-diffusion limits, our results provide a statistical interpretation of attention as in-context inference via sample-based posterior estimation, without explicit density modeling.

2605.29350 2026-05-29 cs.AI

ConMoE: Expert-Pool Consolidation via Prototype Reassignment for MoE Compression

ConMoE: 通过原型重分配进行专家池整合以实现MoE压缩

Yilun Yao, Jiaming Pan, Elsie Dai, Peizhuang Cong, Yaoming Li, Tong Yang

AI总结 提出ConMoE,一种无需训练的MoE压缩方法,通过基于校准的贡献和可替换性信号选择保留的专家原型,并确定性重映射原始专家调用,在多个MoE语言模型上匹配或超越强基线。

Comments 12 pages, 3 figures, 5 tables

详情
AI中文摘要

混合专家(MoE)语言模型减少了每个token的计算量,但仍需存储和服务所有专家,导致部署时内存密集。现有的训练后压缩方法主要通过剪枝专家或合并其权重来缩减成本。我们将训练后MoE压缩形式化为专家池整合:保留一组较小的预训练专家作为可重用原型,并确定性地将每个原始专家引用重映射到一个选定的原型。这种观点将缩减后的专家池与表示原始专家槽位的重用结构分离,并允许在局部层范围内共享原型,同时保留原始路由器接口。我们提出ConMoE,一个无需训练的原型重映射框架,它使用基于校准的贡献和可替换性信号选择保留的专家,然后将原始专家调用重定向到选定的原型,无需权重更新或压缩后微调。在三个预训练的MoE语言模型上的实验表明,ConMoE在多种设置下匹配或超越了强剪枝和合并基线,在deepseek-moe-16b-base上以25%和50%的路由专家缩减均取得最佳平均分,同时在Qwen3-30B-A3B和OLMoE-1B-7B-0125上保持竞争力。消融实验表明,确定性重映射是最稳定的组件,而更广泛的跨层共享和事后权重融合则依赖于模型。

英文摘要

Mixture-of-Experts (MoE) language models reduce per-token computation but still require storing and serving all experts, making deployment memory-intensive. Existing post-training compression methods mainly shrink this cost by pruning experts or merging their weights. We formulate post-training MoE compression as expert-pool consolidation: retaining a smaller set of pretrained experts as reusable prototypes and deterministically remapping each original expert reference to one selected prototype. This view separates the reduced expert pool from the reuse structure that represents the original expert slots, and allows prototype sharing within local layer scopes while preserving the original router interface. We propose ConMoE, a train-free prototype remapping framework that selects retained experts using calibration-based contribution and replaceability signals, then redirects original expert calls to the selected prototypes without weight updates or post-compression fine-tuning. Experiments on three pretrained MoE language models show that ConMoE matches or outperforms strong pruning and merging baselines in several settings, achieving the best average score on deepseek-moe-16b-base at both 25% and 50% routed-expert reduction, while remaining competitive on Qwen3-30B-A3B and OLMoE-1B-7B-0125. Ablations indicate that deterministic reassignment is the most stable component, whereas broader cross-layer sharing and post-hoc weight fusion are model-dependent.

2605.29340 2026-05-29 cs.CL

A Study on Question-Answer Dataset for LLM Safety Evaluation with a Focus on Illegal Activities

面向LLM安全评估的问答数据集研究:聚焦非法活动

Kenji Imamura, Masao Ideuchi, Atsushi Fujita

AI总结 本文通过人工分析AnswerCarefully数据集,提出额外信息、问答示例创建方法和评估准则,用于评估LLM在非法活动方面的安全性。

Comments 10 pages, 1 figure

详情
AI中文摘要

在本文中,我们讨论了用于LLM安全评估的问答数据集,重点关注非法活动。具体来说,在人工分析AnswerCarefully的基础上,我们引入了若干额外信息、创建问答示例的方法以及评估LLM生成响应的准则。本研究的结果旨在与“JAI-Trust”项目共享。

英文摘要

In this paper, we discuss question-answer dataset for LLM safety evaluation, with a focus on illegal activities. Specifically, on the basis of manual analysis of AnswerCarefully, we introduce several additional information, methods for creating question-answer examples, and a rubric for evaluating LLM-generated responses. The outcomes of this study are intended to be shared with the "JAI-Trust" project.

2605.29339 2026-05-29 cs.CV

DMC-CF: Dynamic Multimodal CounterFactual QA benchmark for Causal Reasoning

DMC-CF: 用于因果推理的动态多模态反事实QA基准

Junzhe Zhang, Huixuan Zhang, Guirong Wang, Xingyao Zhang, Pei Liu, Lin Qu, Hu Wei, Xiaojun Wan

AI总结 针对现有因果推理数据集规模有限或基于非真实数据的问题,提出基于真实视频的大规模多模态因果反事实推理基准DMC-CF-Static,并利用动态图干预框架构建动态评估基准DMC-CF-Dynamic,实验表明当前多模态大模型在真实场景下的因果推理能力仍需大幅提升。

详情
AI中文摘要

随着多模态大语言模型(MLLMs)的快速发展,模型已展现出日益强大的多模态能力。然而,通过统计学习训练的MLLMs能否真正理解现实世界背后的因果关系仍是一个关键研究问题。近年来,众多多模态因果推理数据集被提出,但这些数据集要么规模有限,要么基于合成图像和视频、卡通内容或其他非真实多模态来源构建。为解决这些局限性,我们收集真实世界视频并构建了DMC-CF-Static,一个大规模多模态因果反事实推理基准。此外,为缓解传统静态评估中的数据污染等问题,我们使用因果图表示因果事件,并提出动态图干预(DGI)框架,从DMC-CF-Static构建动态评估基准DMC-CF-Dynamic。在包含静态和动态评估基准的整体DMC-CF上的实验结果表明,当前多模态大语言模型在真实场景下的多模态因果推理能力仍需大幅提升。

英文摘要

With the rapid advancement of multimodal large language models (MLLMs), models have demonstrated increasingly powerful multimodal capabilities. However, whether MLLMs trained through statistical learning can truly understand the causal relationships underlying the real world remains a key research question. In recent years, numerous multimodal causal reasoning datasets have been proposed. Nevertheless, these datasets are either limited in scale or constructed from synthetic images and videos, cartoon-based content, or other non-realistic multimodal sources. To address these limitations, we collect real-world videos and construct DMC-CF-Static, a large-scale benchmark for multimodal causal counterfactual reasoning. Furthermore, to mitigate issues such as data contamination in traditional static evaluation, we represent causal events using causal graphs and propose the Dynamic Graph Intervention (DGI) framework to build the dynamic evaluation benchmark DMC-CF-Dynamic from DMC-CF-Static. Experimental results on the overall DMC-CF, which includes both static and dynamic evaluation benchmarks, demonstrate that the multimodal causal reasoning capabilities of current multimodal large language models in real-world scenarios still require substantial improvement.