arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2280
2605.20635 2026-05-28 cs.LG math.ST stat.ML stat.TH

The General Theory of Localization Methods

局部化方法的一般理论

Congwei Song

AI总结 本文提出一种基于局部化核和局部均值的通用机器学习框架——局部化方法,系统揭示其与多种现有模型(如核方法、MeanShift、Transformer等)的联系,并展示其统一和泛化现代架构的能力。

Comments correct some math expressions

详情
AI中文摘要

本文提出一种称为局部化方法的通用机器学习框架,该框架从根本上建立在两个核心概念之上:局部化核和局部均值——这些是支撑自注意力机制的关键组成部分。为了建立严格的理论基础,该框架通过两个基本支柱正式定义:局部(化)模型的公式化和局部化技巧。我们系统地研究了局部化方法与广泛现有机器学习模型/方法之间的联系,包括(但不限于)核方法、惰性学习、MeanShift算法、松弛标记、Hopfield网络、局部线性嵌入(LLE)、模糊推理和去噪自编码器(DAEs)。通过剖析这些关系,我们阐明了局部化方法更广泛的理论意义,并展示了其在各种机器学习任务中的实际适用性。此外,我们探讨了该框架的高级扩展,如自适应核、层次局部模型和非局部模型。值得注意的是,我们展示了Transformer——现代序列建模的基石——可以使用层次局部模型构建,揭示了局部化方法统一和泛化最先进架构的能力。这项工作不仅提供了重新解释现有模型的统一理论视角,还为设计灵活、数据自适应的学习系统提供了新的方法论工具。

英文摘要

This paper proposes a general machine learning framework called the localization method, which is fundamentally built on two core concepts: localization kernels and local means -- key components that underpin the self-attention mechanism. To establish a rigorous theoretical foundation, the framework is formally defined through two essential pillars: the formulation of the local(-ized) model and the localization trick. We systematically investigate the connections between the localization method and a wide range of existing machine learning models/methods, including (but not limited to) kernel methods, lazy learning, the MeanShift algorithm, relaxation labeling, Hopfield networks, local linear embedding (LLE), fuzzy inference, and denoising autoencoders (DAEs). By dissecting these relationships, we clarify the broader theoretical significance of the localization method and demonstrate its practical applicability across diverse machine learning tasks. Furthermore, we explore advanced extensions of the framework, such as adaptive kernels, hierarchical local models, and non-local models. Notably, we show that the Transformer -- a cornerstone of modern sequence modeling -- can be constructed using hierarchical local models, revealing the ability of the localization method to unify and generalize state-of-the-art architectures. This work not only provides a unified theoretical lens to reinterpret existing models but also offers new methodological tools for designing flexible, data-adaptive learning systems.

2605.19257 2026-05-28 cs.RO

PRISM-SLAM: Probabilistic Ray-Grounded Inference for Scale-aware Metric SLAM

PRISM-SLAM: 面向尺度感知度量SLAM的概率射线基础推理

Eunsoo Im, Gyeonggwan Lee, Junghun Suh

AI总结 提出PRISM-SLAM框架,通过将视觉基础模型先验集成到贝叶斯因子图中,利用Plücker射线距离因子和动态场景不确定性门控机制,实现无尺度漂移的实时单目度量SLAM。

详情
AI中文摘要

单目SLAM历来在动态环境中存在尺度模糊和跟踪失败的问题。虽然最近的视觉基础模型(VFM)提供了显著的零样本深度先验,但简单地整合这些确定性预测忽略了预测不确定性和帧间尺度不一致性。我们提出了PRISM-SLAM,一个实时框架,将VFM先验严格集成到结构化的贝叶斯因子图中,以实现尺度感知、度量一致的定位与建图。具体来说,我们引入了Plücker射线距离因子,将单目观测锚定在全局一致的度量坐标系中的绝对空间,通过使度量尺度Fisher可识别,从数学上解决了尺度漂移。为了处理环境动态,我们从时间深度一致性中推导出认知不确定性代理,并设计了动态场景不确定性门控(DSUG)机制。这种软门控方法概率性地降低动态干扰物的权重,而不会产生与传统语义分割掩码相关的高计算开销。通过采用多进程架构异步处理VFM推理和几何跟踪,PRISM-SLAM仅使用RGB输入即可在30 FPS下提供验证的度量输出,弥合了基础模型与现实机器人应用之间的差距。在TUM RGB-D和7-Scenes基准上的评估表明,PRISM-SLAM的度量$SE(3)$绝对轨迹误差(ATE)几乎与其对齐的$Sim(3)$误差相同。这表明我们的系统能够生成可直接部署的度量轨迹,无需任何后处理尺度校正。项目页面:https://prismslam-cmd.github.io/prismslam_pr/

英文摘要

Monocular SLAM historically suffers from scale ambiguity and tracking failure in dynamic environments. While recent vision foundation models (VFMs) provide remarkable zero-shot depth priors, naively integrating these deterministic predictions ignores predictive uncertainty and frame-to-frame scale inconsistencies. We propose PRISM-SLAM, a real-time framework that rigorously integrates VFM priors into a structured Bayesian factor graph to achieve scale-aware, metric-consistent localization and mapping. Specifically, we introduce a Plücker Ray-Distance Factor to anchor monocular observations in absolute space within a globally consistent metric coordinate system, mathematically resolving scale drift by making the metric scale Fisher-identifiable. To handle environmental dynamics, we derive an epistemic uncertainty proxy from temporal depth consistency and formulate a Dynamic Scene Uncertainty Gating (DSUG) mechanism. This soft-gating approach probabilistically down-weights dynamic distractors without incurring the heavy computational overhead associated with traditional semantic segmentation masks. By employing a multi-process architecture that asynchronously processes VFM inference and geometric tracking, PRISM-SLAM provides verified metric output at 30 FPS using solely RGB input, bridging the gap between foundation models and real-world robotic applications. Evaluated on the TUM RGB-D and 7-Scenes benchmarks, PRISM-SLAM achieves a metric $SE(3)$ Absolute Trajectory Error (ATE) nearly identical to its oracle-aligned $Sim(3)$ error. This demonstrates that our system can produce deployment-ready metric trajectories by delivering robust metric SLAM solutions without any post-hoc scale correction. Project page: https://prismslam-cmd.github.io/prismslam_pr/

2605.02263 2026-05-28 cs.LG

Break the Block: Dynamic-size Reasoning Blocks for Diffusion Large Language Models via Monotonic Entropy Descent with Reinforcement Learning

打破块限制:通过单调熵下降与强化学习为扩散大语言模型实现动态大小推理块

Yan Jiang, Ruihong Qiu, Zi Huang

AI总结 针对扩散大语言模型中固定大小推理块导致的逻辑连贯性差和效率低问题,提出基于单调熵下降目标与强化学习的后训练框架b1,学习动态大小推理块以提升推理连贯性。

详情
AI中文摘要

最近的扩散大语言模型(dLLMs)通过基于块的半自回归生成范式展示了推理的有效性和效率。尽管取得了进展,固定大小的块生成仍然是有效且连贯推理的关键瓶颈。1. 从全局角度看,不同的推理任务对应不同的最优解码块大小,这使得“一刀切”的假设无效。2. 即使在单个推理任务中,刚性的块划分也会破坏逻辑流并降低推理连贯性。通过经验观察,我们发现对于块级熵,错误推理在块之间表现出波动和不稳定的趋势,而正确生成的任务则遵循一致的下降趋势。因此,本文提出了b1,一种新颖的dLLMs后训练框架,通过强化学习结合单调熵下降目标学习动态大小推理块,以增强推理连贯性。b1作为即插即用模块无缝集成到现有dLLM的后训练算法中。在各种推理基准上的大量实验表明,b1相比现有固定大小块基线具有一致的改进。我们的代码已发布在https://github.com/YanJiangJerry/Block-R1。

英文摘要

Recent diffusion large language models (dLLMs) have demonstrated both effectiveness and efficiency in reasoning via a block-based semi-autoregressive generation paradigm. Despite their progress, the fixed-size block generations remain a critical bottleneck for effective and coherent reasoning. 1. From a global perspective, different reasoning tasks would correspond to different optimal decoding block sizes, which makes a ``one-size-fits-all'' assumption ineffective. 2. Even within a single reasoning task, the rigid block partitioning would break the logical flow and reduce reasoning coherence. Through empirical observations, we reveal that for block-wise entropy, incorrect reasoning exhibits a fluctuating and unsteady trend between blocks, whereas the correctly generated tasks follow a consistent descending trend. Therefore, this paper proposes b1, a novel post-training framework for dLLMs that learns dynamic-size reasoning blocks via a Monotonic Entropy Descent objective with reinforcement learning to enhance reasoning coherence.b1 integrates seamlessly as a plug-and-play module with existing dLLM's post-training algorithms. Extensive experiments across various reasoning benchmarks showcase b1's consistent improvement over existing fixed-size block baselines. Our code has been released at https://github.com/YanJiangJerry/Block-R1.

2605.01046 2026-05-28 cs.LG

Learning in the Fisher Subspace: A Guided Initialization for LoRA Fine-Tuning

在 Fisher 子空间中学习:LoRA 微调的引导初始化

Zhi-Quan Feng, Ying-Jia Lin, Hung-Yu Kao

AI总结 本文提出一种基于 Fisher 信息的引导初始化方法,通过利用下游数据曲率信息选择 LoRA 适应子空间,以提升微调性能。

详情
AI中文摘要

LoRA 通过将更新限制在预训练权重的低秩子空间中来适应大型语言模型(LLMs)。虽然这大幅降低了训练成本,但适应的有效性关键取决于初始化时选择哪个子空间:一个将容量分配给任务无关方向的糟糕初始化会严重阻碍下游性能。现有的初始化策略主要依赖预训练权重的内在属性,隐含地假设仅权重几何就能反映任务相关性。然而,这种标准忽略了模型如何与下游数据分布交互。在这项工作中,我们将 LoRA 初始化表述为在目标数据分布下识别参数空间中方向的影响程度。我们认为,数据感知的敏感性(而非仅权重大小)应指导适应子空间的选择。基于这一观点,我们提出了一个 Fisher 引导的框架,利用下游数据诱导的曲率信息来表征参数扰动如何影响模型预测。这一视角为选择 LoRA 方向提供了一个原则性的、任务相关的标准,使适应更好地与目标对齐。跨不同任务和模态的实验结果表明,数据感知的初始化一致且显著地优于现有方法的下游性能。

英文摘要

LoRA adapts large language models (LLMs) by restricting updates to low-rank subspaces of pre-trained weights. While this substantially reduces training cost, the effectiveness of adaptation critically depends on which subspace is chosen at initialization: a poor initialization that allocates capacity to task-irrelevant directions can severely hinder downstream performance. Existing initialization strategies primarily rely on the intrinsic properties of pre-trained weights, implicitly assuming that weight geometry alone reflects task relevance. However, such criteria overlook how the model interacts with the downstream data distribution. In this work, we formulate LoRA initialization as identifying the degree of impact of directions in parameter space under the target data distribution. We argue that data-aware sensitivity, rather than weight-only magnitude, should govern the choice of adaptation subspaces. Building on this perspective, we propose a Fisher-guided framework that leverages curvature information induced by downstream data to characterize how parameter perturbations influence model predictions. This perspective yields a principled, task-dependent criterion for selecting LoRA directions that better align adaptation with the target objective. Empirical results across diverse tasks and modalities demonstrate that data-aware initialization consistently and significantly improves downstream performance over existing approaches.

2605.23192 2026-05-28 cs.CV

Occlusion-Aware Physics-Semantic Keyframe Selection for Robust Video Editing

遮挡感知的物理-语义关键帧选择用于鲁棒视频编辑

Lin Liu, Zhihan Xiao, Haohang Xu, Rong Cong, Zhibo Zhang, Xiaopeng Zhang, Qi Tian

AI总结 提出一种遮挡感知的物理-语义关键帧选择框架,通过从结构完整性、跟踪稳定性和属性可见性三个角度评估候选帧,自动选择最优锚定帧,并利用双向跟踪生成时空掩码,实现鲁棒且时序一致的视频编辑。

详情
AI中文摘要

近年来,基于扩散的生成模型在视频编辑领域取得了显著进展,能够根据自然语言指令实现多样化的对象级操作。然而,现有方法在遮挡、视角变化和快速物体运动场景下常常表现不佳,不可靠的视觉观测导致定位不准确、时间闪烁和编辑不一致。在本工作中,我们识别出缺乏可靠视觉锚点是遮挡鲁棒视频编辑的一个根本瓶颈。为解决此问题,我们提出了一种遮挡感知的物理-语义关键帧选择框架,该框架自动为下游编辑识别最优锚定帧。具体而言,我们的方法从三个互补角度评估候选帧:避免截断观测的结构完整性、衡量物理可靠性的循环一致跟踪稳定性、以及确保语义清晰性的基于视觉语言的属性可见性。选定的关键帧随后通过双向跟踪传播,生成密集的时空掩码,这些掩码作为扩散视频编辑骨干的辅助监督。通过将遮挡处理从显式重建转变为可靠锚点选择,我们的框架无需手动标注即可实现精确且时序一致的编辑。在具有挑战性的视频编辑基准上的大量实验证明了我们方法的有效性和高质量性能。

英文摘要

Video editing has recently achieved remarkable progress with diffusion-based generative models, enabling diverse object-level manipulations from natural language instructions. However, existing methods often struggle under occlusion, viewpoint changes, and fast object motion, where unreliable visual observations lead to inaccurate localization, temporal flickering, and inconsistent edits. In this work, we identify the absence of reliable visual anchors as a fundamental bottleneck in occlusion-robust video editing. To address this issue, we propose an occlusion-aware physics-semantic keyframe selection framework that automatically identifies an optimal anchor frame for downstream editing. Specifically, our method evaluates candidate frames from three complementary perspectives: structural completeness for avoiding truncated observations, cycle-consistent tracking stability for measuring physical reliability, and vision-language-based attribute visibility for ensuring semantic clarity. The selected keyframe is then propagated through bidirectional tracking to generate dense spatiotemporal masks, which are used as auxiliary supervision for a diffusion-based video editing backbone. By transforming occlusion handling from explicit reconstruction into reliable anchor selection, our framework enables precise and temporally consistent editing without requiring manual annotations. Extensive experiments on challenging video editing benchmarks demonstrate the effectiveness and high-quality performance of our method.

2605.22949 2026-05-28 cs.LG cs.MA

MARGIN: Runtime Confidence Calibration for Multi-Agent Foundation Model Coordination

MARGIN:多智能体基础模型协调的运行时置信度校准

Joss Armstrong

AI总结 提出在线校准方法MARGIN,通过任务流学习每个智能体每个置信度带的校准因子,无需模型访问或重训练,在分布漂移下将校准误差降低3-6倍,并显著提升多智能体选择性能。

详情
AI中文摘要

基础模型智能体越来越多地运行在多智能体部署中,协调者必须决定信任哪个智能体的响应。标准方法根据智能体自我报告的置信度进行加权,但最近的证据表明,基础模型的置信度系统性地校准不良,并且在困难任务上与准确性呈负相关。设计时校准方法(温度缩放、Platt缩放、直方图分箱)无法解决这个问题,因为它们对保留数据拟合固定校正,并在分布漂移下性能下降。我们提出MARGIN(通过增量归一化的多智能体运行时分级),一种在线校准方法,从任务流本身学习每个智能体、每个置信度带的校准因子,无需模型访问、无需保留数据、无需重新训练。MARGIN使用对称指数加权移动平均和贝叶斯收缩混合,具有三个超参数和稳健的默认值。在18个基础模型、8个基准测试和超过44,000个观测值上,MARGIN在分布漂移下实现了比最佳设计时基线低3-6倍的校准误差。在多智能体选择中,原始口头化置信度在困难基准测试的成对分辨率上未能击败随机(43-50%)。MARGIN完全纠正了这一点,将成对分辨率提高到70-89%,并在五个代码生成基准测试上缩小了37-78%的原始到Oracle pass@1差距,而无需任何关于哪个模型最强的先验知识。六个形式化命题描述了非策略智能体的收敛性、跟踪速度和对称更新的最优性,所有预测均通过实验说明。

英文摘要

Foundation model agents increasingly operate in multi-agent deployments where a coordinator must decide which agent's response to trust. The standard approach weights agents by their self-reported confidence, but recent evidence shows that foundation model confidence is systematically miscalibrated and, on hard tasks, inversely correlated with accuracy. Design-time calibration methods (temperature scaling, Platt scaling, histogram binning) cannot address this problem because they fit a fixed correction to held-out data and degrade under distribution shift. We present MARGIN (Multi-Agent Runtime Grading via Incremental Normalisation), an online calibration method that learns per-agent, per-confidence-band calibration factors from the task stream itself, requiring no model access, no held-out data, and no retraining. MARGIN uses symmetric exponentially weighted moving averages with Bayesian shrinkage blending, and has three hyperparameters with robust defaults. Across 18 foundation models, 8 benchmarks, and over 44,000 observations, MARGIN achieves 3-6x lower calibration error than the best design-time baseline under distribution shift. In multi-agent selection, raw verbalized confidence fails to beat random at pairwise resolution (43-50%) on hard benchmarks. MARGIN corrects this completely, raising pairwise resolution to 70-89% and closing 37-78% of the Raw-to-Oracle pass@1 gap across the five code-generation benchmarks without any oracle knowledge of which model is strongest. Six formal propositions characterize convergence, tracking speed, and the optimality of symmetric updates for non-strategic agents, with all predictions illustrated empirically.

2605.22705 2026-05-28 cs.CL

Tokenization with Split Trees

基于分裂树的Tokenization(ToaST)

Craig W. Schmidt, Michael Krumdick, Adam Wiemerslage, Seth Ebner, Varshini Reddy, Yuval Pinter, Chris Tanner

AI总结 提出ToaST方法,通过递归推理过程直接优化压缩,使用整数规划选择词汇表,在英文文本上相比BPE、WordPiece和UnigramLM减少超过11%的token数量,并提升Renyi效率和语言模型性能。

Comments All baseline tokenizers (BPE, WordPiece, Unigram) were trained incorrectly due to a bug in the Hugging Face tokenizers library: pair counts overflow i32 above ~108 GB of training data, dropping the most common merge pairs. All comparisons to ToaST are invalid. Thanks to Sander Land for identifying the missing merge pairs. See https://github.com/huggingface/tokenizers/issues/2058

详情
AI中文摘要

我们引入了基于分裂树的Tokenization(ToaST),一种子词分词方法,通过新的递归推理过程直接优化压缩。ToaST使用预计算的字节n-gram计数,独立于任何词汇表,贪婪地将每个预分词分裂成完全二叉树。给定词汇表后,推理递归地遍历每个分裂树,并在每条路径上发出第一个在词汇表中的节点。词汇表选择被形式化为一个整数规划(IP),在此推理过程下最小化所有分裂树的总token数。线性规划(LP)松弛在实践中接近整数解,产生可证明接近最优的词汇表,训练时间经验上与分裂树数量呈二次方关系。在英文文本上,当词汇表大小为40,960及以上时,与BPE、WordPiece和UnigramLM相比,ToaST将token数量减少了超过11%,从而减少了使用该分词器的模型的推理token数,因此扩展了有效上下文长度。ToaST还比这些基线方法更少使用常见的单字节token,导致Renyi效率显著提高。在训练15亿参数语言模型的实验中,ToaST获得了最高的CORE分数,比基线方法高出2.6%至7.6%,其中两个基线具有显著性,并在22个单独任务中的13个上取得了最佳成绩。

英文摘要

We introduce Tokenization with Split Trees (ToaST), a subword tokenization method that directly optimizes compression under a new recursive inference procedure. ToaST greedily splits each pretoken into a full binary tree using precomputed byte n-gram counts, independent of any vocabulary. Given a vocabulary, inference recursively descends each split tree and emits the first in-vocabulary node reached on each path. Vocabulary selection is formulated as an Integer Program (IP) that minimizes the total token count over all split trees under this inference procedure. The Linear Programming (LP) relaxation is near-integral in practice, yielding provably near-optimal vocabularies, with training time empirically scaling quadratically in the number of split trees. On English text, ToaST reduces token counts by more than 11% compared to BPE, WordPiece, and UnigramLM at vocabulary sizes of 40,960 and above, reducing the number of inference tokens for models using this tokenizer, thus extending the effective context length. ToaST also uses common single-byte tokens less frequently than these baselines, leading to a substantial improvement in Renyi efficiency. In experiments training 1.5B parameter language models, ToaST achieves the highest CORE score, outperforming baselines by 2.6%--7.6%, with significance for two of three, and scoring best on 13 of 22 individual tasks.

2605.22547 2026-05-28 cs.CV cs.AI

Case-Aware Medical Image Classification with Multimodal Knowledge Graphs and Reliability-Guided Refinement

基于多模态知识图谱和可靠性引导精化的病例感知医学图像分类

Yiming Xu, Yixuan Liu, Yuhang Zhang, Ling Zheng, Yihan Wang, Qi Song

AI总结 提出一种基于多模态知识图谱的病例感知推理框架,通过构建结构化诊断记忆、自适应检索相似病例、知识传播与注入机制以及置信度校准的决策精化方案,提升医学图像分类的性能和可解释性。

详情
AI中文摘要

深度学习为医学图像分类带来了显著进展,但现有方法大多依赖孤立的视觉证据,无法有效利用相似病例或外部知识。在临床实践中,诊断通常由相似历史病例及其相关症状支持。为了显式建模这一循证诊断过程,我们提出了一种由多模态知识图谱驱动的病例感知推理框架,用于医学图像分类。具体而言,我们构建了一个病例感知的多模态知识图谱作为结构化的诊断记忆,其中疾病、图像和症状按层次组织。给定输入图像,我们的方法自适应地从该记忆中检索相似病例,并提取相应的以病例为中心的子图。我们进一步引入了一种知识传播与注入机制,其中以图像为中心的图注意力网络将异质语义聚合为基于病例的特征,随后通过双向跨模态注意力机制将这些特征注入视觉表示以实现跨模态对齐。为了减轻噪声检索,我们设计了一种置信度校准的决策精化方案,通过联合考虑预测置信度和样本相似性来估计每个检索病例的可靠性,并重新加权其对最终预测的贡献,提供可解释的病例级证据。在多个医学影像数据集上的大量实验表明,我们的方法一致优于强基线,而消融和定性分析验证了其有效性和可解释性。代码可在 https://anonymous.4open.science/r/MKG-CARE-8B7B 获取。

英文摘要

Deep learning has brought significant progress to medical image classification, yet most existing methods still rely on isolated visual evidence and cannot effectively leverage similar cases or external knowledge. In clinical practice, diagnosis is typically supported by similar historical cases and their associated symptoms. To explicitly model this evidence-based diagnostic process, we propose a case-aware reasoning framework driven by multimodal knowledge graphs for medical image classification. Specifically, we construct a case-aware multimodal knowledge graph as a structured diagnostic memory, where diseases, images, and symptoms are hierarchically organized. Given an input image, our method adaptively retrieves similar cases from this memory and extracts their corresponding case-centered subgraphs. We further introduce a knowledge propagation and injection mechanism, in which an image-centric Graph Attention Network aggregates heterogeneous semantics into case-based features, followed by a bidirectional cross-modal attention mechanism that injects these features into visual representations for cross-modal alignment. To mitigate noisy retrieval, we design a confidence-calibrated decision refinement scheme that estimates the reliability of each retrieved case by jointly considering prediction confidence and sample similarity, and reweights its contribution to the final prediction, providing interpretable case-level evidence. Extensive experiments on multiple medical imaging datasets demonstrate that our approach consistently outperforms strong baselines, while ablation and qualitative analyses validate its effectiveness and interpretability. The code is available at https://anonymous.4open.science/r/MKG-CARE-8B7B.

2605.22166 2026-05-28 cs.AI

Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents

适配接口而非模型:面向确定性LLM智能体的运行时框架适配

Tianshi Xu, Huifeng Wen, Meng Li

AI总结 提出Life-Harness运行时框架,通过从训练轨迹中演化出可复用的环境侧干预,在不修改模型权重或评估环境的情况下,显著提升冻结LLM智能体在确定性任务中的性能。

Comments Work in progress

详情
AI中文摘要

LLM智能体不仅由其语言模型塑造,还受运行时框架的影响,该框架协调观察、工具使用、动作执行、反馈解释和轨迹控制。虽然现有的智能体适配方法主要更新模型参数,但在确定性、规则主导的领域中,许多失败源于模型-环境接口的不匹配。我们提出Life-Harness,一种生命周期感知的运行时框架,在不改变模型权重或评估环境的情况下改进冻结的LLM智能体。Life-Harness从训练轨迹中演化,通过将重复出现的交互失败转化为跨环境契约、程序技能、动作实现和轨迹调节的可复用干预,并在未见任务上保持固定以进行评估。在来自$\tau$-bench、$\tau^2$-bench和AgentBench的七个确定性环境中,Life-Harness在18个模型骨干上的126个模型-环境设置中改进了116个,平均相对提升88.5%。仅从Qwen3-4B-Instruct轨迹演化出的框架可迁移到其他17个模型,表明Life-Harness捕获的是可复用的环境侧结构而非模型特定行为。这些结果将运行时接口适配定位为以模型为中心的智能体训练的互补替代方案。代码可在https://github.com/Tianshi-Xu/Life-Harness获取。

英文摘要

LLM agents are shaped not only by their language models, but also by the runtime harness that mediates observation, tool use, action execution, feedback interpretation, and trajectory control. While existing agent adaptation methods mainly update model parameters, many failures in deterministic, rule-governed domains stem from mismatches at the model--environment interface. We propose Life-Harness, a lifecycle-aware runtime harness that improves frozen LLM agents without changing model weights or evaluation environments. Life-Harness evolves from training trajectories by converting recurring interaction failures into reusable interventions across environment contracts, procedural skills, action realization, and trajectory regulation, and remains fixed for evaluation on unseen tasks. On seven deterministic environments from $τ$-bench, $τ^2$-bench, and AgentBench, Life-Harness improves 116 out of 126 model--environment settings across 18 model backbones, with an average relative improvement of 88.5%. Harnesses evolved only from Qwen3-4B-Instruct trajectories transfer to 17 other models, showing that Life-Harness captures reusable environment-side structure rather than model-specific behavior. These results position runtime interface adaptation as a complementary alternative to model-centric agent training. Code is available at https://github.com/Tianshi-Xu/Life-Harness.

2510.20665 2026-05-28 cs.AI

The Shape of Reasoning: Topological Analysis of Reasoning Traces in Large Language Models

推理的形状:大型语言模型中推理轨迹的拓扑分析

Xue Wen Tan, Nathaniel Tan, Galen Lee, Stanley Kok

AI总结 提出基于拓扑数据分析(TDA)的评估框架,通过捕捉推理轨迹的几何结构实现高效自动评估,实验表明拓扑特征比图指标更有效预测推理质量。

Comments Accepted in ICML 2026 Workshop: Epistemic Intelligence in Machine Learning

详情
AI中文摘要

评估大型语言模型推理轨迹的质量仍然研究不足、劳动密集且不可靠:当前实践依赖于专家评分标准、手动注释和缓慢的成对判断。自动化努力主要由基于图的代理主导,这些代理量化结构连通性,但未阐明高质量推理的构成;对于固有复杂的过程,这种抽象可能过于简单。我们引入了一个基于拓扑数据分析(TDA)的评估框架,该框架捕捉推理轨迹的几何结构,并实现标签高效、自动化的评估。在我们的实证研究中,拓扑特征在评估推理质量方面比标准图指标具有更高的预测能力,这表明有效推理更好地由高维几何结构而非纯关系图来捕捉。我们进一步表明,一组紧凑、稳定的拓扑特征可靠地指示轨迹质量,为未来的强化学习算法提供了实用信号。

英文摘要

Evaluating the quality of reasoning traces from large language models remains understudied, labor-intensive, and unreliable: current practice relies on expert rubrics, manual annotation, and slow pairwise judgments. Automated efforts are dominated by graph-based proxies that quantify structural connectivity but do not clarify what constitutes high-quality reasoning; such abstractions can be overly simplistic for inherently complex processes. We introduce a topological data analysis (TDA)-based evaluation framework that captures the geometry of reasoning traces and enables label-efficient, automated assessment. In our empirical study, topological features yield substantially higher predictive power for assessing reasoning quality than standard graph metrics, suggesting that effective reasoning is better captured by higher-dimensional geometric structures rather than purely relational graphs. We further show that a compact, stable set of topological features reliably indicates trace quality, offering a practical signal for future reinforcement learning algorithms.

2605.21832 2026-05-28 cs.AI

FLUID: From Ephemeral IDs to Multimodal Semantic Codes for Industrial-Scale Livestreaming Recommendation

FLUID:从临时ID到多模态语义编码的工业级直播推荐

Xinhang Yuan, Zexi Huang, Anjia Cao, Xudong Lu, Zikai Wang, Penghao Zhou, Chang Liu, Wentao Guo, Qinglei Wang

AI总结 针对直播推荐中ID冷启动问题,提出FLUID框架,通过跨域多模态编码器生成层次化语义编码LUCID替代候选侧ID,并采用分阶段预热方案,在工业级系统上取得显著提升。

详情
AI中文摘要

现代推荐系统严重依赖基于ID的协同过滤:每个项目由一个独特的ID嵌入表示,该嵌入从用户交互中积累协同信号。然而,直播推荐在这种范式下面临独特挑战:直播间通常仅播出几十分钟,因此其项目ID在持续的冷启动状态下学习不佳,以ID为中心的排序模型无法泛化。我们提出FLUID,这是第一个从生产规模的直播排序器中完全淘汰候选侧项目ID的框架。FLUID引入了一个跨域多模态编码器,在短视频和直播上联合训练,生成离散的层次化语义编码,称为LUCID,用于基于内容的项目表征。为了使排序器适应LUCID,FLUID进一步采用分阶段预热方案:首先将冷启动的切片级LUCID作为独立标记与ID嵌入一起引入,然后在在线增量训练之前用热启动的房间级LUCID替换ID嵌入。FLUID部署在我们的工业级直播推荐系统上,该系统的跨平台合并用户基数超过十亿,取得了显著的在线收益:优质观看时长+0.55%,冷启动房间观看量+2.05%,活跃小时数+0.05%。

英文摘要

Modern recommender systems rely heavily on ID-based collaborative filtering: each item is represented by a unique ID embedding that accumulates collaborative signals from user interactions. Livestreaming recommendation, however, faces a unique challenge in this paradigm: a live room typically broadcasts for only tens of minutes, so its item ID remains poorly learned in a persistent cold-start state and ID-centric ranking models fail to generalize. We present FLUID, the first framework to fully retire the candidate-side item ID from a production-scale livestreaming ranker. FLUID introduces a cross-domain multimodal encoder, jointly trained on short videos and livestreams, to produce discrete hierarchical semantic codes, called LUCID, for content-based item characterization. To adapt the ranker to LUCID, FLUID further employs a staged warmup scheme: it first incorporates cold, slice-level LUCID as an independent token alongside the ID embedding, and then replaces the ID embedding with warm, room-level LUCID before online incremental training. Deployed on our industrial livestreaming recommenders with a cross-platform combined user base of over one billion globally, FLUID delivers significant online gains of +0.55% Quality Watch Duration, +2.05% Cold-Start Room Views, and +0.05% Active Hours.

2605.21743 2026-05-28 cs.AI econ.GN q-fin.EC

Who Uses AI? Platform Selection and the Measurement of Occupational AI Exposure

谁在使用AI?平台选择与职业AI暴露的测量

Michelle Yin, Burhan Ogut

AI总结 本文通过分析AI平台对话日志,揭示平台用户构成导致职业AI暴露测量偏差,并提出劳动力加权部分识别方法校正估计。

详情
AI中文摘要

来自AI平台的对话日志越来越多地被用于衡量职业对人工智能的暴露程度,但在这些日志中观察到的用户并非劳动力群体。我们表明,从平台导出的暴露分数结合了任务级别的AI适用性与平台用户群的职业构成。保持实证设计不变,仅改变平台输入会使ChatGPT后的就业系数变化1.9倍,并且同一供应商内的消费者和企业渠道在符号上存在分歧。我们将由此产生的非经典测量误差形式化,将其分解为职业间和职业内的选择,并构建了劳动力加权的部分识别界限。根据劳工统计局就业份额进行重新加权会使估计值衰减42%至93%。该偏差捕捉了观察用户中的增强效应,比劳动力中的替代效应更直接。

英文摘要

Conversation logs from AI platforms are increasingly used to measure occupational exposure to artificial intelligence, but the users observed in these logs are not the workforce. We show that platform-derived exposure scores combine task-level AI applicability with the occupational composition of the platform's user base. Holding the empirical design fixed, changing only the platform input changes the post-ChatGPT employment coefficient by a factor of 1.9, and consumer and enterprise channels within the same vendor disagree in sign. We formalize the resulting non-classical measurement error, decompose it into between- and within-occupation selection, and construct workforce-reweighted partial-identification bounds. Reweighting to Bureau of Labor Statistics employment shares attenuates estimates by 42 to 93 percent. The bias captures augmentation among observed users more directly than substitution in the workforce.

2605.16578 2026-05-28 cs.SD cs.AI cs.HC cs.LG

Voice "Cloning" is Style Transfer

语音“克隆”是风格迁移

Kaitlyn Zhou, Federico Bianchi, Martijn Bartelds, Anna Pot, Yongchan Kwon, James Zou

AI总结 研究发现语音克隆并非忠实复制原声,而是系统性地应用风格迁移,使克隆语音更权威、温暖、客服化且更人性化,导致说话者特征同质化,并影响人类信任与行为。

详情
AI中文摘要

人工生成的语音日益嵌入日常生活。语音克隆尤其适用于身份保留重要的应用,例如完成录音、用新语言配音或保存失语者的声音。然而,在我们的工作中,我们发现尽管术语如此,语音克隆并不能忠实地“克隆”个体的声音。相反,我们发现广泛使用的语音克隆模型系统性地对源语音应用风格迁移。根据人类标注者的评分,克隆语音相比源语音被认为更权威、更温暖、更接近客服风格且更人性化。人类标注者还报告对克隆语音的信任度高于源语音,并且更愿意向它们透露敏感个人信息。我们的工作还表明,语音克隆导致说话者特征的同质化,表现为口音、语速和音频嵌入空间的方差减小。总之,我们的结果凸显了语音克隆技术的一系列新局限和风险,及其对人类行为的潜在影响。

英文摘要

Artificially generated speech is increasingly embedded in everyday life. Voice cloning in particular enables applications where identity preservation is important, such as completing a recording, dubbing in a new language, or preserving the voices of individuals with speech loss. However, in our work, we find that despite the term, voice cloning does not faithfully ''clone'' an individual's voice. Instead, we find that widely-used voice cloning models systematically apply style transfer to source voices. As rated by human annotators, cloned voices are perceived as more authoritative, warm, customer-service-like, and human-like compared to their sources. Human annotators also report greater trust in cloned voices than source voices, and a greater willingness to disclose sensitive personal information to them. Our work furthermore shows that voice cloning leads to homogenization of speaker characteristics, as measured by reduced variance in accent, speaking rate, and the audio embedding space. Together, our results highlight a new set of limitations and risks of voice cloning technology and their potential impact on human behavior.

2602.06511 2026-05-28 cs.LG

EvoMAS: Evolutionary Generation of Multi-Agent Systems

EvoMAS:多智能体系统的进化生成

Yuntong Hu, Yuting Zhang, Matthew Trager, Yi Zhang, Shuo Yang, Wei Xia, Stefano Soatto

AI总结 提出EvoMAS方法,将多智能体系统生成转化为结构化配置生成,通过进化算法在配置空间中优化,提升任务性能、可执行性和鲁棒性。

Comments ICML2026

详情
Journal ref
ICML2026
AI中文摘要

基于大语言模型的多智能体系统在复杂推理、规划和工具增强任务中展现出巨大潜力,但设计有效的MAS架构仍然劳动密集、脆弱且难以泛化。现有的自动MAS生成方法要么依赖代码生成,常导致可执行性和鲁棒性失败,要么施加僵化的架构模板,限制了表达性和适应性。我们提出多智能体系统的进化生成(EvoMAS),将MAS生成形式化为结构化配置生成。EvoMAS在配置空间中进行进化生成。具体来说,EvoMAS从池中选择初始配置,应用基于执行轨迹引导的反馈条件变异和交叉,并迭代优化候选池和经验记忆。我们在多个基准测试上评估EvoMAS,包括BBEH、SWE-Bench和WorkBench,涵盖推理、软件工程和工具使用任务。EvoMAS在任务性能上持续优于人工设计的MAS和先前的自动MAS生成方法,同时生成的系统具有更高的可执行性和运行时鲁棒性。EvoMAS在BBEH推理上比智能体进化方法EvoAgent高出10.5个百分点,在WorkBench上高出7.1个百分点。使用Claude-4.5-Sonnet,EvoMAS在SWE-Bench-Verified上达到79.1%,与排行榜顶部持平。代码可在https://github.com/amazon-science/EvoMAS获取。

英文摘要

Large language model (LLM)-based multi-agent systems (MAS) show strong promise for complex reasoning, planning, and tool-augmented tasks, but designing effective MAS architectures remains labor-intensive, brittle, and hard to generalize. Existing automatic MAS generation methods either rely on code generation, which often leads to executability and robustness failures, or impose rigid architectural templates that limit expressiveness and adaptability. We propose Evolutionary Generation of Multi-Agent Systems (EvoMAS), which formulates MAS generation as structured configuration generation. EvoMAS performs evolutionary generation in configuration space. Specifically, EvoMAS selects initial configurations from a pool, applies feedback-conditioned mutation and crossover guided by execution traces, and iteratively refines both the candidate pool and an experience memory. We evaluate EvoMAS on diverse benchmarks, including BBEH, SWE-Bench, and WorkBench, covering reasoning, software engineering, and tool-use tasks. EvoMAS consistently improves task performance over both human-designed MAS and prior automatic MAS generation methods, while producing generated systems with higher executability and runtime robustness. EvoMAS outperforms the agent evolution method EvoAgent by +10.5 points on BBEH reasoning and +7.1 points on WorkBench. With Claude-4.5-Sonnet, EvoMAS also reaches 79.1% on SWE-Bench-Verified, matching the top of the leaderboard. Code is available at https://github.com/amazon-science/EvoMAS

2605.19729 2026-05-28 cs.CV cs.AI

LIFT and PLACE: A Simple, Stable, and Effective Knowledge Distillation Framework for Lightweight Diffusion Models

LIFT and PLACE: 一种简单、稳定且有效的轻量级扩散模型知识蒸馏框架

Hyunsoo Han, Sangyeop Yeo, Jaejun Yoo

AI总结 提出LIFT和PLACE框架,通过粗到细的蒸馏策略解决教师网络高复杂度带来的学生模仿困难,在极端压缩下仍能稳定训练并取得良好性能。

Comments Project page: https://hyun-s.github.io/LIFT_PLACE_site , 15 pages, 11 figure, 9 tables, To appear in CVPR 2026

详情
AI中文摘要

我们证明,在扩散模型的知识蒸馏中,教师网络由于其更大的容量而具有高度复杂的去噪过程,这给学生模型忠实模仿带来了重大挑战。为了解决这个问题,我们提出了一种基于线性拟合蒸馏(LIFT)和分段局部自适应系数估计(PLACE)的粗到细蒸馏框架。首先,LIFT将目标分解为“粗”对齐和“细”细化。学生先在粗对齐上训练,然后进行困难的细化。其次,PLACE通过将输出划分为基于误差的组来扩展LIFT以处理空间非均匀误差,提供局部自适应指导。我们的实验表明,LIFT和PLACE在扩散空间(图像/潜在)、骨干网络(U-Net/DiT)、任务(无条件/条件)、数据集上均有效,甚至扩展到基于流的模型如MMDiT(SD3)。此外,在极端压缩下(学生参数1.3M,仅为教师的1.6%),传统KD无法为稳定训练提供足够指导,FID分数常退化到50-200+,但我们的方法仍稳定收敛并达到15.73的FID。

英文摘要

We demonstrate that in knowledge distillation for diffusion models, the teacher network's highly complex denoising process - stemming from its substantially larger capacity - poses a significant challenge for the student model to faithfully mimic. To address this problem, we propose a coarse-to-fine distillation framework with LInear FiTtingbased distillation (LIFT) and Piecewise Local Adaptive Coefficient Estimation (PLACE). First, LIFT decomposes the objective into a "coarse" alignment and a "fine" refinement. The student is then trained on coarse alignment before proceeding to hard refinement. Second, PLACE extends LIFT to address spatially non-uniform errors by partitioning outputs into error-based groups, providing locally adaptive guidance. Our experiments show that LIFT and PLACE is effective across diffusion spaces (image/latent), backbones (U-Net/DiT), tasks (unconditional/conditional), datasets, and even extends to flow-based models such as MMDiT (SD3). Furthermore, under extreme compression with a 1.3M-parameter student (only 1.6% of the teacher), conventional KD fails to provide sufficient guidance for stable training, with FID scores often degrading to 50-200+, but our method remains stably convergent and achieves an FID of 15.73.

2602.06025 2026-05-28 cs.CL cs.AI cs.LG

Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory

学习面向运行时智能体记忆的查询感知预算层级路由

Haozhen Zhang, Haodong Yue, Tao Feng, Quanyu Long, Jianzhu Bao, Bowen Jin, Weizhi Zhang, Xiao Li, Jiaxuan You, Chengwei Qin, Wenya Wang

AI总结 提出 BudgetMem 框架,通过强化学习训练的轻量级路由器实现查询感知的预算层级路由,以在运行时平衡任务性能与记忆构建成本。

Comments Accepted by ICML 2026. Code is available at https://github.com/ViktorAxelsen/BudgetMem

详情
AI中文摘要

记忆对于在单个上下文窗口之外运行的大型语言模型(LLM)智能体日益重要,然而大多数现有系统依赖于离线的、查询无关的记忆构建,这可能导致效率低下并丢弃查询关键信息。尽管运行时记忆利用是一种自然的替代方案,但先前的工作通常会产生大量开销,并且对性能-成本权衡的显式控制有限。在这项工作中,我们提出了 extbf{BudgetMem},一个用于显式、查询感知性能-成本控制的运行时智能体记忆框架。BudgetMem 将记忆处理结构化为一组记忆模块,每个模块提供三个预算层级(即 extsc{Low}/ extsc{Mid}/ extsc{High})。一个轻量级路由器在模块间执行预算层级路由,以平衡任务性能和记忆构建成本,该路由器实现为通过强化学习训练的紧凑神经策略。使用 BudgetMem 作为统一测试平台,我们研究了实现预算层级的三种互补策略:实现(方法复杂度)、推理(推理行为)和容量(模块模型大小)。在 LoCoMo、LongMemEval 和 HotpotQA 上,当优先考虑性能时(即高预算设置),BudgetMem 超越了强基线,并在更紧的预算下提供了更好的精度-成本边界。此外,我们的分析揭示了不同层级策略的优势和劣势,阐明了在不同预算制度下每个轴何时提供最有利的权衡。

英文摘要

Memory is increasingly central to Large Language Model (LLM) agents operating beyond a single context window, yet most existing systems rely on offline, query-agnostic memory construction that can be inefficient and may discard query-critical information. Although runtime memory utilization is a natural alternative, prior work often incurs substantial overhead and offers limited explicit control over the performance-cost trade-off. In this work, we present \textbf{BudgetMem}, a runtime agent memory framework for explicit, query-aware performance-cost control. BudgetMem structures memory processing as a set of memory modules, each offered in three budget tiers (i.e., \textsc{Low}/\textsc{Mid}/\textsc{High}). A lightweight router performs budget-tier routing across modules to balance task performance and memory construction cost, which is implemented as a compact neural policy trained with reinforcement learning. Using BudgetMem as a unified testbed, we study three complementary strategies for realizing budget tiers: implementation (method complexity), reasoning (inference behavior), and capacity (module model size). Across LoCoMo, LongMemEval, and HotpotQA, BudgetMem surpasses strong baselines when performance is prioritized (i.e., high-budget setting), and delivers better accuracy-cost frontiers under tighter budgets. Moreover, our analysis disentangles the strengths and weaknesses of different tiering strategies, clarifying when each axis delivers the most favorable trade-offs under varying budget regimes.

2605.20150 2026-05-28 cs.CV cs.PF

TideGS: Scalable Training of Over One Billion 3D Gaussian Splatting Primitives via Out-of-Core Optimization

TideGS: 通过外存优化训练超过十亿个3D高斯溅射基元

Chonghao Zhong, Linfeng Shi, Hua Chen, Tiecheng Sun, Hao Zhao, Binhang Yuan, Chaojian Li

AI总结 针对大规模3D高斯溅射训练的内存瓶颈,提出TideGS外存训练框架,通过SSD-CPU-GPU层次化管理和三种协同技术,在单GPU上实现超过十亿高斯基元的训练并达到最优重建质量。

Comments Accepted to ICML 2026 as Spotlight. Website: https://sponge-lab.github.io/TideGS

详情
AI中文摘要

训练十亿基元规模的3D高斯溅射(3DGS)本质上是内存受限的:每个高斯基元携带一个大的属性向量,总参数表迅速超出GPU容量,限制了先前系统在商用单GPU硬件上只能处理数千万高斯基元。我们观察到3DGS训练本质上是稀疏且轨迹条件的:每次迭代仅激活当前相机批次可见的高斯基元,因此GPU内存可以作为工作集缓存而非持久参数存储。基于这一洞察,我们引入了TideGS,一个外存训练框架,通过三种协同技术管理SSD-CPU-GPU层次结构中的参数:用于SSD对齐空间局部性的块虚拟化几何、用于将I/O与计算重叠的分层异步流水线,以及轨迹自适应差分流,该流在迭代之间仅传输增量工作集变化。实验表明,TideGS能够在单个24 GB GPU上训练超过十亿个高斯基元,同时在大规模场景中实现评估的单GPU基线中最佳的重建质量,超越了先前的外存基线(例如约1亿高斯基元)和标准内存训练(例如约1100万高斯基元)。

英文摘要

Training 3D Gaussian Splatting (3DGS) at billion-primitive scale is fundamentally memory-bound: each Gaussian primitive carries a large attribute vector, and the aggregate parameter table quickly exceeds GPU capacity, limiting prior systems to tens of millions of Gaussians on commodity single-GPU hardware. We observe that 3DGS training is inherently sparse and trajectory-conditioned: each iteration activates only the Gaussians visible from the current camera batch, so GPU memory can serve as a working-set cache rather than a persistent parameter store. Building on this insight, we introduce TideGS, an out-of-core training framework that manages parameters across an SSD-CPU-GPU hierarchy via three synergistic techniques: block-virtualized geometry for SSD-aligned spatial locality, a hierarchical asynchronous pipeline to overlap I/O with computation, and trajectory-adaptive differential streaming that transfers only incremental working-set deltas between iterations. Experiments show that TideGS enables training with over one billion Gaussians on a single 24 GB GPU while achieving the best reconstruction quality among evaluated single-GPU baselines on large-scale scenes, scaling beyond prior out-of-core baselines (e.g., approximately 100M Gaussians) and standard in-memory training (e.g., approximately 11M Gaussians).

2605.19778 2026-05-28 cs.LG

B-cos GNNs: Faithful Explanations through Dynamic Linearity

B-cos GNNs:通过动态线性实现忠实解释

Joschka Groß, Mohammad Shaique Solanki, Verena Wolf

AI总结 提出B-cos GNNs,一种内在可解释的图神经网络,通过单个输入依赖的线性映射将预测精确分解为每个节点、每个特征的贡献,在保持高解释性的同时牺牲少量预测精度。

详情
AI中文摘要

我们引入B-cos GNNs,一类内在可解释的图神经网络,其预测通过单个输入依赖的线性映射精确分解为每个节点、每个特征的贡献。B-cos GNNs使用线性(求和)聚合,并用B-cos变换替换非线性消息和更新函数。这诱导了有意义的、任务特定的权重-输入对齐,可通过模型的动态线性直接访问。实例级解释来自单个前向和后向传播,无需辅助解释器、修改的学习目标或扰动过程。实例化为GIN后,我们的方法以较小的预测精度损失换取在各种合成和真实世界基准上最先进的解释性,产生的解释比事后基线快几个数量级。

英文摘要

We introduce B-cos GNNs, an inherently explainable class of graph neural networks whose predictions decompose exactly into per-node, per-feature contributions via a single input-dependent linear map. B-cos GNNs use linear (sum-based) aggregation and replace non-linear message and update functions with B-cos transforms. This induces meaningful, task-specific weight-input alignment that is directly accessible through the model's dynamic linearity. Instance-level explanations follow from a single forward and backward pass, requiring no auxiliary explainer, modified learning objective, or perturbation procedure. Instantiated as a GIN, our approach trades small losses in predictive accuracy for state-of-the-art explainability across diverse synthetic and real-world benchmarks, producing explanations orders of magnitude faster than post-hoc baselines.

2511.14159 2026-05-28 cs.CV

MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs

MVI-Bench:评估大型视觉语言模型对误导性视觉输入鲁棒性的综合基准

Huiyi Chen, Jiawei Peng, Dehai Min, Changchang Sun, Kaijie Chen, Yan Yan, Xu Yang, Lu Cheng

AI总结 针对现有鲁棒性基准忽视误导性视觉输入的问题,提出MVI-Bench基准,基于视觉基元的三级层次(视觉概念、视觉属性、视觉关系)构建6个类别1248个VQA实例,并引入MVI-Sensitivity指标进行细粒度评估,揭示18个LVLM的显著脆弱性。

Comments 18 pages, 9 figures

详情
AI中文摘要

评估大型视觉语言模型(LVLMs)的鲁棒性对于其持续发展和在现实世界应用中的负责任部署至关重要。然而,现有的鲁棒性基准通常关注幻觉或误导性文本输入,而在很大程度上忽视了评估视觉理解时由误导性视觉输入带来的同样关键的挑战。为填补这一重要空白,我们引入了MVI-Bench,这是首个专门设计用于评估误导性视觉输入如何削弱LVLMs鲁棒性的综合基准。基于基本视觉基元,MVI-Bench的设计围绕三个层次的误导性视觉输入:视觉概念、视觉属性和视觉关系。利用这一分类法,我们策划了六个代表性类别,并整理了1248个专家标注的VQA实例。为了促进细粒度的鲁棒性评估,我们进一步引入了MVI-Sensitivity,这是一种新颖的指标,可在细粒度上表征LVLM的鲁棒性。在18个最先进的LVLM上的实证结果揭示了它们对误导性视觉输入的显著脆弱性,我们在MVI-Bench上的深入分析提供了可操作的见解,可以指导开发更可靠和鲁棒的LVLM。基准和代码库可在https://github.com/chenyil6/MVI-Bench获取。

英文摘要

Evaluating the robustness of Large Vision-Language Models (LVLMs) is essential for their continued development and responsible deployment in real-world applications. However, existing robustness benchmarks typically focus on hallucination or misleading textual inputs, while largely overlooking the equally critical challenge posed by misleading visual inputs in assessing visual understanding. To fill this important gap, we introduce MVI-Bench, the first comprehensive benchmark specially designed for evaluating how Misleading Visual Inputs undermine the robustness of LVLMs. Grounded in fundamental visual primitives, the design of MVI-Bench centers on three hierarchical levels of misleading visual inputs: Visual Concept, Visual Attribute, and Visual Relationship. Using this taxonomy, we curate six representative categories and compile 1,248 expertly annotated VQA instances. To facilitate fine-grained robustness evaluation, we further introduce MVI-Sensitivity, a novel metric that characterizes LVLM robustness at a granular level. Empirical results across 18 state-of-the-art LVLMs uncover pronounced vulnerabilities to misleading visual inputs, and our in-depth analyses on MVI-Bench provide actionable insights that can guide the development of more reliable and robust LVLMs. The benchmark and codebase can be accessed at https://github.com/chenyil6/MVI-Bench.

2605.19743 2026-05-28 cs.AI cs.LG cs.MA

EngiAI: A Multi-Agent Framework and Benchmark Suite for LLM-Driven Engineering Design

EngiAI: 面向LLM驱动工程设计的智能体框架与基准测试套件

Gioele Molinari, Florian Felten, Soheyl Massoudi, Mark Fuge

AI总结 提出EngiAI多智能体系统框架和包含工作流、RAG、HPC三维度的基准套件,通过监督架构协调七个专业智能体,验证了LLM在工程设计中的能力与局限。

Comments 26 pages, 10 figures, to be published at IDETC 2026

详情
AI中文摘要

大型语言模型(LLM)智能体越来越多地应用于工程设计任务,但现有的评估框架未能充分处理结合仿真、检索和制造准备的多智能体系统。我们引入了一个包含三个评估维度的基准套件:(1)一个工作流基准,包含七种针对不同认知需求的提示风格——包括直接工具使用、语义消歧、条件分支和工作记忆任务;(2)一个检索增强生成(RAG)基准,采用门控评分来隔离检索对参数选择的贡献;(3)一个高性能计算(HPC)基准,评估在SLURM集群上的端到端机器学习训练编排。与基准一起,我们提出了EngiAI,一个基于LangGraph构建的多智能体系统(MAS)参考实现,通过监督架构协调七个专业智能体,统一拓扑优化、文档检索、HPC作业编排和3D打印机控制。在四个LLM后端和两个EngiBench问题上,专有模型在Beams2D上实现了96-97%的平均任务完成率,而开源4B参数模型达到55-78%,并显示出明显的代际改进。条件分支被证明最具挑战性,在Photonics2D上条件风格的任务完成率降至20-53%。RAG门控确认了近乎完美的检索增强分数(约1.0),而无检索时接近零,验证了评估设计。在HPC编排中,一个模型在100%的运行中完成了所有流水线步骤,而另一个模型降至50%,表明多步骤指令遵循在长时间运行的工作流中会退化。

英文摘要

Large Language Model (LLM) agents are increasingly applied to engineering design tasks, yet existing evaluation frameworks do not adequately address multi-agent systems that combine simulation, retrieval, and manufacturing preparation. We introduce a benchmark suite with three evaluation dimensions: (1) a workflow benchmark with seven prompt styles targeting distinct cognitive demands-including direct tool use, semantic disambiguation, conditional branching, and working-memory tasks; (2) a Retrieval-Augmented Generation (RAG) benchmark with gated scoring isolating retrieval contributions to parameter selection; and (3) an High Performance Computing (HPC) benchmark evaluating end-to-end ML training orchestration on a SLURM cluster. Alongside the benchmark we present EngiAI, a Multi-Agent System (MAS) reference implementation built on LangGraph that operationalizes the benchmark by coordinating seven specialized agents through a supervisor architecture, unifying topology optimization, document retrieval, HPC job orchestration, and 3D printer control. Across four LLM backends and two EngiBench problems, proprietary models achieve 96-97% average task completion on Beams2D, while open-source 4B-parameter models reach 55-78%, with clear generational improvement. Conditional branching proves most challenging, with task completion dropping to 20-53% for the conditional style on Photonics2D. RAG gating confirms near-perfect retrieval-augmented scores (about 1.0) versus near-zero without retrieval, validating the evaluation design. On HPC orchestration, one model completes all pipeline steps in 100% of runs while another drops to 50%, revealing that multi-step instruction following degrades over long-running workflows.

2605.19514 2026-05-28 cs.AI cs.CL cs.LG

Position: The Turing-Completeness of Autoregressive Transformers Relies Heavily on Context Management

立场:自回归Transformer的图灵完备性高度依赖于上下文管理

Guanyu Cui, Zhewei Wei, Kun He

AI总结 本文通过区分固定系统和缩放族两种设置,论证了上下文管理方法对自回归Transformer计算能力的决定性影响,并指出缩放族设置下的图灵完备性证明不适用于实际部署的固定系统。

Comments Accepted to the ICML 2026 Position Paper Track

详情
AI中文摘要

许多工作提出了引人注目的主张,即Transformer是图灵完备的。然而,文献常常混淆两种不同的设置:(i)固定系统设置,其中固定的自回归Transformer与固定的上下文管理方法耦合,逐步处理不同长度的输入;(ii)缩放族设置,其中使用一系列不同模型(具有增加的上下文窗口长度或数值精度)来处理不同的输入长度。现有的Transformer图灵完备性证明通常是在设置(ii)中建立的,而现实世界中的LLM部署以及图灵完备性的标准概念更自然地对应于设置(i)。在本文中,我们首先形式化固定系统设置,从而具体描述现实世界LLM的运行方式。然后,我们认为在缩放族设置中证明的结果提供了理论上有意义的资源界限,但并未建立图灵完备性,从而澄清了对现有结果的常见误解。最后,我们展示了不同的上下文管理方法可以产生截然不同的计算能力,并主张上下文管理是决定现实世界自回归Transformer计算能力的关键组成部分。

英文摘要

Many works make the eye-catching claim that Transformers are Turing-complete. However, the literature often conflates two distinct settings: (i) a fixed Transformer system setting, in which a fixed autoregressive Transformer is coupled with a fixed context-management method to process inputs of different lengths step by step, and (ii) a scaling-family setting, in which a family of different models (with increasing context-window length or numerical precision) is used to handle different input lengths. Existing proofs of Transformer Turing-completeness are frequently established in setting (ii), whereas real-world LLM deployment and the standard notion of Turing-completeness correspond more naturally to setting (i). In this paper, we first formalize the fixed-system setting, thereby providing a concrete characterization of how real-world LLMs operate. We then argue that results proved in the scaling-family setting provide theoretically meaningful resource bounds but do not establish Turing-completeness, thereby clarifying a common misinterpretation of existing results. Finally, we show that different context-management methods can yield sharply different computational power, and we advocate the position that context management is a central component that critically determines the computational power of real-world autoregressive Transformers.

2605.19444 2026-05-28 cs.LG cs.AI

Detecting and Mitigating the Correct-Answer Extinction Window in Test-Time Reinforcement Learning with Majority Voting

检测与缓解测试时强化学习中多数投票导致的正确答案灭绝窗口

Hongxiang Lin, Zhirui Kuai, Erpeng Xue, Lei Wang

AI总结 本文提出TTRL-Guard框架,通过翻转率感知奖励缩放、少数保留采样和风险条件稀疏更新三种机制,检测并缓解测试时强化学习中多数投票导致的正确答案信号永久抑制问题。

详情
AI中文摘要

测试时强化学习(TTRL)在使用多数投票作为伪标签信号时,在数学推理基准测试中报告了显著的准确率提升。我们认为这些提升被系统性地误解了:大部分提升反映的是已可解问题的锐化而非真正学习,而由正确变为错误的问题数量超过了真正学会的问题,且一旦多数投票锁定错误答案,这种损害是不可逆的。逐问题追踪显示,低能力问题中的正确答案信号在短暂活跃后会被永久抑制,我们将这一现象称为 extit{正确答案灭绝窗口},并以翻转率(FR)作为其领先指标。因此,我们提出TTRL-Guard,一个轻量级框架,包含三种针对灭绝窗口的机制:翻转率感知奖励缩放(FRS)在FR下降时降低高风险更新的权重,少数保留采样(MPS)保留少数正确答案的梯度信号,风险条件稀疏更新(RCSU)暂停对极化问题的更新。在三个模型和四个基准上的实验表明,TTRL-Guard在Qwen2.5-7B-Instruct和Qwen3-4B上取得了最佳平均pass@1,在AIME 2025上相对TTRL提升了+54%。

英文摘要

Test-time reinforcement learning (TTRL) reports substantial accuracy gains on mathematical reasoning benchmarks using majority vote as a pseudo-label signal. We argue these gains are systematically misinterpreted: most reflect sharpening of already-solvable problems rather than genuine learning, while problems corrupted from correct to incorrect outnumber truly learned ones, and this damage is irreversible once majority vote locks onto a wrong answer. Per-problem tracking reveals that correct-answer signals in low-ability problems are briefly active before being permanently suppressed, a phenomenon we term the \textit{Correct-Answer Extinction Window}, with Flip Rate (FR) as its leading indicator. We thus propose TTRL-Guard, a lightweight framework with three mechanisms targeting the extinction window: Flip-Rate-Aware Reward Scaling (FRS) down-weights at-risk updates as FR declines, Minority-Preserving Sampling (MPS) retains gradient signal from minority correct answers, and Risk-Conditioned Sparse Updatings (RCSU) suspends updates on polarized problems. Experiments across three models and four benchmarks show that TTRL-Guard achieves the best average pass@1 on Qwen2.5-7B-Instruct and Qwen3-4B, improves relatively over TTRL by +54\% on AIME 2025.

2605.19342 2026-05-28 cs.CV

Semantic-Enriched Latent Visual Reasoning

语义增强的潜在视觉推理

Tianrun Xu, Yue Sun, Qixun Wang, Jingyi Lu, Yuan Wang, Tianren Zhang, Longteng Guo, Fengyun Rao, Jing Lyu, Feng Chen, Jing Liu

AI总结 提出两阶段学习框架SLVR,通过属性级语义监督和多查询组相对策略优化增强潜在表示的语义丰富性,提升潜在视觉推理的鲁棒性和语义一致性。

详情
AI中文摘要

多模态潜在空间推理旨在通过在紧凑的潜在空间中直接进行视觉推理,来替代使用图像的显式思考。然而,现有方法主要依赖视觉监督,产生的潜在表示缺乏足够的语义丰富性,限制了它们支持多样化区域级推理任务的能力。在这项工作中,我们引入了语义增强的潜在视觉推理(SLVR),这是一个两阶段学习框架,用属性级视觉语义丰富潜在表示,并将其与多样化的推理目标对齐。在第一阶段,SLVR在细粒度属性监督下学习语义增强的区域中心潜在表示。在第二阶段,我们设计了多查询组相对策略优化(M-GRPO),以对齐基于同一区域的多个查询的潜在表示。为了支持这一框架,我们构建了SLV-Set,包含约40万条区域级属性标注和80万个多查询问答样本,并引入了SV-QA,一个评估语义变化下潜在推理的基准。实验表明,与现有基线相比,SLVR提高了潜在视觉推理的鲁棒性和语义一致性。

英文摘要

Multimodal latent-space reasoning aims to replace explicit thinking with images by performing visual reasoning directly in a compact latent space. However, existing approaches largely rely on visual supervision and produce latent representations that lack sufficient semantic richness, limiting their ability to support diverse region-level reasoning tasks. In this work, we introduce Semantic-Enriched Latent Visual Reasoning (SLVR), a two-stage learning framework that enriches latent representations with attribute-level visual semantics and aligns them with diverse reasoning objectives. In the first stage, SLVR learns semantically enriched region-centric latents under fine-grained attribute supervision. In the second stage, we design Multi-query Group Relative Policy Optimization (M-GRPO) to align latent representations across multiple queries grounded in the same region. To support this framework, we construct SLV-Set, comprising approximately 400K region-level attribute annotations and 800K multi-query question answering samples, and introduce SV-QA, a benchmark that evaluates latent reasoning under semantic variation. Experiments demonstrate that SLVR improves the robustness and semantic consistency of latent visual reasoning compared to existing baselines.

2605.18692 2026-05-28 cs.AI math.OC

Democratizing Large-Scale Re-Optimization with LLM-Guided Model Patches

利用LLM引导的模型补丁实现大规模重新优化的大众化

Tinghan Ye, Arnaud Deza, Ved Mohan, El Mehdi Er Raqabi, Pascal Van Hentenryck

AI总结 提出一个基于大语言模型的代理重新优化框架,通过自然语言交互和优化工具箱,使非专家用户能够动态更新和重新优化部署的优化模型,并在两个大规模实际案例中验证了其有效性和可扩展性。

详情
AI中文摘要

运筹学专家开发的优化模型通常作为工业环境中的决策支持系统部署。然而,现实环境是动态的,业务规则不断演变且存在不可预见的扰动。在这种情况下,最终用户理想情况下应重新优化模型以恢复可行且可实施的解决方案,但往往无法联系到原始模型开发者。本文介绍了一个代理重新优化框架,其中大语言模型充当运筹学专家,通过自然语言交互动态支持最终用户。大语言模型将用户提示转化为底层优化模型的结构化更新,从优化工具箱中选择合适的重新优化技术,并求解生成的实例以返回可实施的解决方案。该工具箱利用原始信息,包括历史解、有效不等式、求解器配置和元启发式算法,以加速重新优化同时保持解的质量。所提出的框架能够实现部署优化模型的交互式和持续适应,减少对运筹学专家的依赖,并提高决策支持系统的可持续性。在两个互补的大规模实际案例研究上的广泛实验证明了所提框架的有效性和可扩展性。第一个案例考虑在线供应链重新优化,其中必须快速生成解同时保持与部署计划接近,而第二个案例侧重于离线大学考试排程,其中解的质量优先于运行时间。结果表明,基于工具箱的架构通过基于原始信息和求解器感知的重新优化技术显著提高了计算效率,而基于结构化补丁的更新提高了模型修改的可解释性和可追溯性。

英文摘要

Optimization models developed by operations research (OR) experts are often deployed as decision-support systems in industrial settings. However, real-world environments are dynamic, with evolving business rules and unforeseen perturbations. In such contexts, end users should ideally re-optimize models to recover feasible and implementable solutions, often without access to the original model developers. This paper introduces an agentic re-optimization framework in which a large language model (LLM) acts as an OR expert, dynamically supporting end users through natural-language interaction. The LLM translates user prompts into structured updates of the underlying optimization model, selects suitable re-optimization techniques from an optimization toolbox, and solves the resulting instance to return implementable solutions. The toolbox leverages primal information, including historical solutions, valid inequalities, solver configurations, and metaheuristics, to accelerate re-optimization while preserving solution quality. The proposed framework enables interactive and continuous adaptation of deployed optimization models, reducing dependence on OR experts, and improving the sustainability of decision-support systems. Extensive experiments on two complementary large-scale real-world case studies demonstrate the effectiveness and scalability of the proposed framework. The first considers online supply chain re-optimization, where solutions must be generated rapidly while remaining close to the deployed plan, whereas the second focuses on offline university exam scheduling, where solution quality is prioritized over runtime. Results show that the toolbox-driven architecture significantly improves computational efficiency through primal-based and solver-aware re-optimization techniques, while the structured patch-based updates improve interpretability and traceability of model modifications.

2605.03517 2026-05-28 cs.LG stat.ML

Understanding Self-Supervised Learning via Latent Distribution Matching

通过潜在分布匹配理解自监督学习

Fabian A Mikulasch, Friedemann Zenke

AI总结 本文将自监督学习形式化为潜在分布匹配(LDM),通过对齐和均匀性最大化潜在表示的对数概率和熵,统一了多种SSL方法,并推导出用于高维时间序列的非线性无采样贝叶斯滤波模型。

Comments Accepted to ICML 2026 (Spotlight)

详情
AI中文摘要

自监督学习(SSL)擅长从复杂数据中学习通用潜在表示,但缺乏统一的理论框架来解释现有各种方法并指导新方法的设计。我们将SSL视为潜在分布匹配(LDM):学习表示以最大化其在假设潜在模型下的对数概率(对齐),同时最大化潜在熵以防止坍塌(均匀性)。这一观点将独立成分分析与对比、非对比和预测性SSL方法(包括停止梯度方法)统一起来。利用LDM,我们推导出一个非线性的、无采样的贝叶斯滤波模型,其中包含基于卡尔曼的预测器,用于高维时间序列。我们进一步证明,在温和假设下,即使使用非线性预测器,预测性LDM也能产生可识别的潜在表示。总体而言,LDM阐明了现有SSL方法背后的假设,并为开发新方法提供了原则性指导。

英文摘要

Self-supervised learning (SSL) excels at finding general-purpose latent representations from complex data, yet lacks a unifying theoretical framework that explains the diverse existing methods and guides the design of new ones. We cast SSL as latent distribution matching (LDM): learning representations that maximize their log-probability under an assumed latent model (alignment), while maximizing latent entropy to prevent collapse (uniformity). This view unifies independent component analysis with contrastive, non-contrastive, and predictive SSL methods, including stop gradient approaches. Leveraging LDM, we derive a nonlinear, sampling-free Bayesian filtering model with a Kalman-based predictor for high-dimensional timeseries. We further prove that predictive LDM yields identifiable latent representations under mild assumptions, even with nonlinear predictors. Overall, LDM clarifies the assumptions behind established SSL methods and provides principled guidance for developing new approaches.

2605.18113 2026-05-28 cs.CL

iPOE: Interpretable Prompt Optimization via Explanations

iPOE: 基于解释的可解释提示优化

Jiahui Li, Yarik Menchaca Resendiz, Sean Papay, Roman Klinger

AI总结 提出iPOE方法,通过自动从解释中生成指南并优化,实现可解释的提示优化,在四个数据集上性能提升高达39%,且人类与LLM对指南贡献的判断一致性达Cohen's kappa 0.65。

详情
AI中文摘要

提示优化通常被构建为一个离散搜索问题,旨在为LLM找到高性能且鲁棒的指令。然而,搜索结果可能无法透明地显示为什么以及在哪里特定的提示更改带来了性能提升。这与人类接受注释任务指导的方式形成对比。在人类任务中,研究人员精心设计注释指南,从而提高注释一致性。本文旨在结合这两种方法,并引入iPOE,一种通过解释进行可解释提示优化的新策略。我们通过自动从注释决策的解释(自动生成或来自人类)中创建指南来指导提示优化过程。此外,通过一系列操作(包括删除、添加、打乱和合并)来优化这组指南。最终的提示包含指导注释的指南,使LLM的决策过程和优化过程透明化。因此,它也为提示优化领域的非专业人士提供支持,特别是在需要专业知识的挑战性领域。在四个数据集上的实验中,我们发现iPOE相比评估基线最高可提升39%,并且LLM的解释可以替代所提出方法中的人类解释。此外,我们的可解释性验证研究表明,人类和LLM在哪些指南有助于其注释方面可以基本达成一致,Cohen's kappa得分高达0.65。

英文摘要

Prompt optimization has often been framed as a discrete search problem to find high-performing and robust instructions for an LLM. However, the search result might not make it transparent why and where specific prompt changes lead to performance gains. This is in contrast to how humans are instructed for annotation tasks. Here, researchers carefully design annotation guidelines, leading to enhanced annotation consistency. Our paper aims at joining these two approaches and introduces iPOE, a novel interpretable prompt optimization strategy via explanations. We guide the prompt optimization process by automatically created guidelines from explanations of annotation decisions (either automatically generated or from humans). This set of guidelines is furthermore optimized by as series of operations, including removing, adding, shuffling, and merging. The resulting prompt includes guidelines that instruct the annotation, making the decision process of the LLM and the optimization transparent. It therefore supports also laypeople in the area of prompt optimization, particularly in challenging domains requiring expertise. In our experiments on four datasets, we find that iPOE can improves over the evaluated baselines by up to 39% and LLM explanations can replace human explanations in the proposed method. Moreover, our interpretability validation study demonstrates that humans and LLMs can substantially agree on which guidelines contribute to their annotations, achieving a Cohen's kappa score of up to 0.65.

2605.17929 2026-05-28 cs.RO

TacSE3: Equivariant SE(3) Motion Estimation from Low-Texture Visuotactile Images for In-Gripper Tracking and Compensation

TacSE3: 基于低纹理视触觉图像的等变SE(3)运动估计用于夹爪内跟踪与补偿

Zhongyuan Liao, Junzhe Wang, Qingyang Liu, Zhenmin Huang, Jun Ma, Yi Cai, Fei Meng, Haobo Liang, Michael Yu Wang

AI总结 提出TacSE3,一种将低纹理视触觉观测转化为解耦三维力场并估计SE(3)增量刚体运动的触觉运动估计流程,通过双传感器感知减少平移-旋转歧义,实现夹爪内跟踪与补偿。

详情
AI中文摘要

机器人手内操作需要在频繁视觉遮挡下可靠地跟踪物体运动,然而低纹理视触觉图像为传统的图像或几何匹配方法提供的稳定对应点很少。本文提出TacSE3,一种触觉运动估计流程,将低纹理视触觉观测转化为解耦的三维力场,并估计SE(3)上的增量刚体运动。该方法从接触质心运动推导平面平移,并主要从剪切相关的触觉响应估计旋转,从而为夹爪内跟踪和补偿提供物理可解释的信号。使用成对DM-Tac指尖传感器的实验表明,双传感器感知减少了平移-旋转歧义,支持跨轴和物体几何形状的旋转跟踪,并提供轻量级补偿信号,在不重新训练基础策略的情况下提高了下游操作任务中的扰动容忍度。

英文摘要

Robotic in-hand manipulation requires reliable object-motion tracking under frequent visual occlusion, yet low-texture visuotactile images provide few stable correspondences for conventional image- or geometry-matching methods. This paper presents TacSE3, a tactile motion-estimation pipeline that converts low-texture visuotactile observations into a decoupled three-dimensional force field and estimates incremental rigid-body motion on SE(3). The method derives planar translation from contact-centroid motion and estimates rotation primarily from shear-related tactile responses, yielding a physically interpretable signal for in-gripper tracking and compensation. Experiments with paired DM-Tac fingertip sensors show that dual-sensor sensing reduces translation-rotation ambiguity, supports rotation tracking across axes and object geometries, and provides a lightweight compensation signal that improves disturbance tolerance in downstream manipulation tasks without retraining the base policy.

2605.17842 2026-05-28 cs.LG

SNLP: Layer-Parallel Inference via Structured Newton Corrections

SNLP:通过结构化牛顿校正的层并行推理

Ligong Han, Kai Xu, Hao Wang, Akash Srivastava

AI总结 提出结构化牛顿层并行(SNLP)框架,通过将Transformer层间依赖视为非线性残差方程并用结构化牛顿校正并行求解,实现推理加速,在0.5B模型上获得高达2.58倍加速。

Comments Project webpage: https://github.com/phymhan/nanochat-snlp

详情
AI中文摘要

自回归语言模型顺序执行Transformer层,造成传统张量或流水线并行无法消除的延迟瓶颈。我们研究是否可以通过将跨层的隐藏状态轨迹视为非线性残差方程的解,并用并行牛顿风格更新来求解,从而放松这种逐层依赖。虽然这一观点在原理上是合理的,但精确的牛顿校正需要昂贵的雅可比向量积,而朴素的固定点迭代在训练好的Transformer上不稳定。我们引入了结构化牛顿层并行(SNLP),一个训练和推理框架,用廉价的架构诱导替代动力学替换精确的层雅可比。在残差Transformer中,这产生了恒等牛顿(IDN),其中校正简化为前缀和式更新;在mHC风格架构中,HC牛顿(HCN)使用模型的残差混合矩阵。我们还研究了SNLP感知训练,包括预训练正则化和直接SNLP前向SFT。在Nanochat规模的Transformer上的实验表明,SNLP揭示了一个实用的速度-质量边界:在0.5B模型上,它实现了高达2.58倍的时钟加速,而一个较不激进的配置在不增加PPL的情况下实现了1.40倍加速。这种有用的权衡来自于IDN/HCN引入的有偏有限迭代计算,而不是精确恢复顺序轨迹。我们进一步表明,SNLP前向SFT可以保持下游任务准确性,并且SNLP可以作为自推测解码的草稿模型,而顺序验证器保持输出正确性。

英文摘要

Autoregressive language models execute Transformer layers sequentially, creating a latency bottleneck that is not removed by conventional tensor or pipeline parallelism. We study whether this layerwise dependency can be relaxed by treating the hidden-state trace across layers as the solution of a nonlinear residual equation and solving it with parallel Newton-style updates. While this view is principled, exact Newton corrections require expensive Jacobian-vector products and naive fixed-point iterations are unstable on trained Transformers. We introduce Structured Newton Layer Parallelism (SNLP), a training and inference framework that replaces exact layer Jacobians with cheap architecture-induced surrogate dynamics. In residual Transformers, this yields Identity Newton (IDN), where the correction reduces to a prefix-sum-like update; in mHC-style architectures, HC Newton (HCN) uses the model's residual mixing matrix. We also study SNLP-aware training, including pretraining regularization and direct SNLP-forward SFT. Experiments on Nanochat-scale Transformers show that SNLP exposes a practical speed-quality frontier: on 0.5B models, it reaches up to 2.58x wall-clock speedup, and a less aggressive configuration reaches 1.40x speedup without increasing PPL. The useful tradeoff comes from the biased finite-iteration computation induced by IDN/HCN rather than exact recovery of the sequential trace. We further show that SNLP-forward SFT can preserve downstream task accuracy, and that SNLP can serve as a drafter for self-speculative decoding while a sequential verifier preserves output correctness.

2605.02503 2026-05-28 cs.AI

DataClawBench: An Agent Benchmark for Exploratory Real-World Financial Data Analysis

DataClawBench: 面向探索性真实世界金融数据分析的智能体基准

Qiaohong Zhang, Weihao Ye, Jialong Chen, Yi Luo, BoYuan Li, Bowen Deng, Zibin Zheng, Jianhao Lin, Wei-Shi Zheng, Chuan Chen

AI总结 提出DataClawBench基准,通过金融智库场景评估智能体在无先验指导下的探索性数据分析能力,包含约206万条跨领域真实噪声数据及492个多步任务,实验发现探索行为与任务进展及正确性不呈正相关。

详情
AI中文摘要

自主数据分析智能体越来越期望在有限的人类数据指导下进行探索性分析。然而,现有基准通常在先验引导设置下评估此类智能体,提供选定的数据源、明确的数据模式或清洗后的数据,从而低估了探索负担。为了评估这一现实的探索性数据分析任务,我们引入了DataClawBench,这是一个基于金融智库咨询场景构建的基准,其中智能体必须独立探索不熟悉、有噪声的跨领域数据,并得出可验证的结论。DataClawBench提供了一个统一的真实世界数据环境,包含企业、行业和政策领域约206万条记录,并保留了原始数据噪声。在此数据环境之上,它定义了492个多步跨领域任务,每个任务都标注了中间里程碑,以诊断超出结果准确性的探索和推理失败。在OpenClaw智能体框架下对八个先进LLM的系统评估表明,探索性数据分析破坏了智能体的可靠性:更多的探索并不能可靠地转化为任务相关的进展或正确的最终答案。

英文摘要

Autonomous data analysis agents are increasingly expected to conduct exploratory analysis with limited human guidance about data. However, existing benchmarks typically evaluate such agents in prior-guided settings, providing selected data sources, explicit data schemas, or cleaned data, thereby understating the exploratory burden. To evaluate this realistic exploratory data analysis task, we introduce DataClawBench, a benchmark built from financial think-tank consulting scenarios where agents must independently explore unfamiliar, noisy, cross-domain data and produce verifiable conclusions. DataClawBench provides a unified real-world data environment with approximately 2.06 million records across enterprise, industry, and policy domains, with native data noise preserved. On top of this data environment, it defines 492 multi-step cross-domain tasks, each annotated with intermediate milestones that diagnose exploration and reasoning failures beyond outcome accuracy. A systematic evaluation of eight advanced LLMs under the OpenClaw agent reveals that exploratory data analysis breaks agent reliability: more exploration does not reliably translate into task-relevant progress or correct final answers.

2605.16030 2026-05-28 cs.LG cs.RO

Mind Dreamer: Untethering Imagination via Active Causal Intervention on Latent Manifolds

Mind Dreamer: 通过潜在流形上的主动因果干预释放想象力

Shaojun Xu, Xiaoling Zhou, Yihan Lin, Yapeng Meng, Xinglong Ji, Luping Shi, Rong Zhao

AI总结 针对基于模型的强化学习中历史束缚导致策略优化滞后的问题,提出Mind Dreamer框架,通过主动因果干预生成非连续潜在跳跃,并推导中继价值函数与中继不确定性函数,实现样本效率提升。

Comments 34 pages, 7 figures, ICML 2026 accepted

详情
AI中文摘要

基于模型的强化学习通过潜在想象实现样本效率,但仍受限于历史束缚:想象通常从观测状态初始化。这造成了学习不对称,即世界模型的流形发现速度超过策略的稀疏奖励优化速度。我们提出Mind Dreamer (MD)框架,实例化主动因果干预以超越马尔可夫连续性。MD将发现重新定义为全局中继期望自由能的最小化。它不是从历史数据初始化,而是从对抗生成器$s_0 \sim p_{gen}(\cdot)$中抽取初始状态,产生到认知盲点的非连续潜在跳跃,这些盲点物理上合理但认知上具有挑战性。我们推导了中继价值函数和中继不确定性函数,以解决这些空间断裂中的信用分配悖论。将合成锚点视为干预性中间状态,这些势能通过贝尔曼式备份传播实用和认知价值。值得注意的是,我们证明了跨不连续性的不确定性传播需要二次折扣$\gamma^2$,建立了形式化的认知视野。理论上,MD近似一个方差最小化重要性采样器,扩大了流形的谱间隙,减少了到达关键瓶颈状态的命中时间。实验上,MD在DeepMind Control Suite上比DreamerV3平均加速1.67倍,在稀疏奖励任务中达到8.8倍。

英文摘要

Model-Based Reinforcement Learning yields sample efficiency via latent imagination, yet remains constrained by Historical Tethering: imagination is typically initialized from observed states. This creates a learning asymmetry, where the world model's manifold discovery outpaces the policy's sparse-reward optimization. We propose Mind Dreamer (MD), a framework that instantiates Active Causal Intervention to transcend Markovian continuity. MD reformulates discovery as the minimization of a global Relay Expected Free Energy. Instead of initializing from historical data, it draws initial states from an adversarial generator $s_0 \sim p_{gen}(\cdot)$, creating non-continuous latent jumps to epistemic blind spots that are physically plausible yet cognitively challenging. We derive Relay Value Function and Relay Uncertainty Function to resolve the credit assignment paradox across these spatial ruptures. Treating synthesized anchors as interventional intermediary states, these potentials propagate pragmatic and epistemic value through Bellman-style backups. Notably, we prove that uncertainty propagation across discontinuities necessitates a quadratic discount $γ^2$, establishing a formal epistemic horizon. Theoretically, MD approximates a variance-minimizing importance sampler that expands the manifold's spectral gap, reducing the hitting time to critical bottleneck states. Empirically, MD achieves a 1.67$\times$ average speedup over DreamerV3 on DeepMind Control Suite, reaching 8.8$\times$ in sparse-reward tasks.