arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.15518 2026-05-20 cs.CL

DetectRL-X: Towards Reliable Multilingual and Real-World LLM-Generated Text Detection

DetectRL-X: 向可靠多语言和真实世界LLM生成文本检测迈进

Junchao Wu, Yefeng Liu, Chenyu Zhu, Hao Zhang, Zeyu Wu, Tianqi Shi, Yichao Du, Longyue Wang, Weihua Luo, Jinsong Su, Derek F. Wong

AI总结本文提出DetectRL-X，一个综合性的多语言基准，用于评估先进检测器在8个维度上的性能，通过8种常见商业语言和6种易受LLM滥用的领域人类文本，结合4种流行商业LLM生成文本及润色、扩展和缩写等AI辅助写作操作，分析不同语言、领域、生成器、攻击策略、文本长度和润色操作对检测器性能的影响，以强化多语言和语言特定检测器。

Comments ACL 2026 Main. Code and data are available at https://github.com/AIDC-AI/Marco-LLM/tree/main/DetectRL-X

详情

AI中文摘要

由于LLM生成内容的滥用风险日益增加，有效检测和治理LLM生成内容变得越来越关键。尽管现有检测器性能优异，但其在多语言和真实世界场景中的可靠性和潜力仍鲜有研究。本文介绍DetectRL-X，一个综合性的多语言基准，用于评估先进检测器在8个维度上的性能。该基准涵盖8种在商业中常用的语言，并收集了6种易受LLM滥用的领域的人类文本。为更好地匹配真实世界应用，我们使用4种流行的商业LLM生成LLM生成文本，并包含润色、扩展和缩写等典型AI辅助写作操作，以捕捉真实的使用模式。此外，我们开发了多语言框架用于改写和扰动攻击，以模拟多样化的人类修改和写作噪声，从而在不同语言上对检测器进行压力测试。在DetectRL-X上的实验结果揭示了当前最先进的检测器在不同语言资源上的优势和局限性。我们进一步分析了领域、生成器、攻击策略、文本长度和润色操作如何影响不同语言中的性能，突显DetectRL-X作为强化多语言和语言特定检测器的有效基准。

英文摘要

The effective detection and governance of Large Language Model (LLM) generated content has become increasingly critical due to the growing risk of misuse. Despite the impressive performance of existing detectors, their reliability and potential in multilingual, real-world scenarios remain largely underexplored. In this study, we introduce DetectRL-X, a comprehensive multilingual benchmark designed to evaluate advanced detectors across 8 dimensions. The benchmark encompasses 8 languages commonly used in commercial contexts and collects human-written texts from 6 domains highly susceptible to LLM misuse. To better aligned with real-world applications, We create LLM-generated texts using 4 popular commercial LLMs, and include typical AI-assisted writing operations such as polishing, expanding, and condensing to capture authentic usage patterns. Furthermore, we develop a multilingual framework for paraphrasing and perturbation attacks to simulate diverse human modifications and writing noise, enabling stress testing of detectors across languages. Experimental results on DetectRL-X reveal the strengths and limitations of current state-of-the-art detectors when applied to diverse linguistic resources. We further analyze how domains, generators, attack strategies, text length, and refinement operations influence performance in different languages, underscoring DetectRL-X as an effective benchmark for strengthening multilingual and language-specific detectors.

URL PDF HTML ☆

赞 0 踩 0

2605.15497 2026-05-20 cs.CV cs.GR

AnyAct: Towards Human Reenactment of Character Motion From Video

AnyAct: 向视频中非人类角色动作的重新演绎迈进

Liuhan Chen, Lei Zhong, Jiewei Wang, Qin Shuai, Li Yuan, Leidong Fan, Qing Li, Kanglin Liu

AI总结本文研究如何从单目视频中直接推导出人类动作的初始重新演绎，其目标是将非人类角色的动作重新诠释为可编辑的人类表演，以供后续动画创作使用。核心方法是利用稀疏局部关节运动线索在结构差异大的情况下保持本质动态，提出AnyAct模型以实现基于可转移稀疏局部2D关节运动的条件人类运动生成。

Comments 12 pages

详情

AI中文摘要

我们研究了从非人类角色的单目视频中直接推导出初始人类重新演绎的问题。我们的目标不是重建源角色本身，而是将它的动作重新诠释为一个合理且可编辑的人类表演，以供后续动画创作使用。这一任务具有挑战性，因为现有的基于视频的动作捕捉方法大多局限于以人类为中心的结构空间，而动作重定向方法通常需要结构化的3D源动作和已知的源拓扑。我们的关键见解是稀疏局部关节运动线索可以在较大的结构差异下保持本质动态，为角色视频到人类重新演绎提供稳定的桥梁。基于这一观察，我们提出了AnyAct，将角色视频驱动的人类重新演绎公式化为从可转移的稀疏局部2D关节运动中生成的条件人类运动。为了使这一方法实用，我们引入了三个关键设计：通过增强的3D到2D投影进行的人类运动-only监督、渐进的3D到2D训练以缓解条件模糊性，以及全局-局部运动解耦以实现可靠的局部运动控制。我们进一步构建了一个主要涵盖多样化非人类角色视频的基准。在该基准上的实验表明，AnyAct能够生成高保真的初始人类重新演绎，这些重新演绎保留了参考视频中角色的本质动态，进一步的消融研究验证了其核心设计的有效性。

英文摘要

We study the problem of directly deriving an initial human reenactment from a monocular video of a non-human character. Our goal is not to reconstruct the source character itself but to reinterpret its motion as a plausible and editable human performance for downstream animation authoring. This task is challenging because existing video-based motion capture methods are largely restricted to human-centric structural spaces, while motion retargeting methods typically require structured 3D source motions and known source topologies. Our key insight is that sparse local articulated motion cues can preserve essential dynamics across large structural differences, providing a stable bridge from character video to human reenactment. Based on this observation, we propose AnyAct, which formulates character-video-driven human reenactment as conditional human motion generation from transferable sparse local 2D articulated motion. To make this practical, we introduce three key designs: human-motion-only supervision via augmented 3D-to-2D projection, progressive 3D-to-2D training to alleviate conditioning ambiguity, and global-local motion decoupling for reliable local motion control. We further construct a benchmark primarily covering diverse non-human character videos. Experiments on the benchmark show that AnyAct produces high-fidelity initial human reenactments that preserve the essential dynamics of the characters in reference videos, and further ablation studies validate the effectiveness of its core designs.

URL PDF HTML ☆

赞 0 踩 0

2605.15336 2026-05-20 cs.RO cs.AI

HoloMotion-1 Technical Report

HoloMotion-1 技术报告

Maiyue Chen, Kaihui Wang, Bo Zhang, Xihan Ma, Zhiyuan Yang, Yi Ren, Qijun Huang, Zihao Zhu, Yucheng Wang, Zhizhong Su

AI总结本文提出HoloMotion-1，一种用于零样本全身运动追踪的人形运动基础模型，通过大规模混合运动语料库训练控制策略，提升了运动行为的多样性和准确性，实现了对多种运动类型和捕捉条件的鲁棒泛化。

Comments 20 pages, 4 figures, 6 tables. Technical report

详情

AI中文摘要

在本报告中，我们介绍了HoloMotion-1，一种用于零样本全身运动追踪的人形运动基础模型。HoloMotion-1的关键创新在于利用大规模混合运动语料库进行控制策略训练，其中来自真实视频重建的运动提供了运动多样性的主要来源，而经过精心挑选的运动捕捉数据和内部运动数据则提供了更高保真度的监督和面向部署的覆盖范围。这种数据模式使HoloMotion-1超越了传统仅依赖运动捕捉的训练，并使策略能够接触更广泛的行为、捕捉条件和运动风格。从这种异构数据中学习引入了新的挑战，包括重建噪声、源域不匹配、运动质量不均以及在大行为变化下的时间建模需求。为了解决这些挑战，HoloMotion-1集成了大容量时间建模、具有稀疏激活的专家混合变压器以及KV缓存推理用于实时控制，并采用序列级训练策略，提高了在扩展运动序列上的学习效率。在多个未见过的运动基准测试中，HoloMotion-1在多样化的运动类型和捕捉条件下表现出鲁棒的泛化能力，显著提高了跟踪精度，且能够直接转移到真实的人形机器人上，无需特定任务的微调。

英文摘要

In this report, we present HoloMotion-1, a humanoid motion foundation model for zero-shot whole-body motion tracking. A key innovation of HoloMotion-1 is to scale control-policy training with a large-scale hybrid motion corpus, where video-reconstructed motions from in-the-wild videos provide the dominant source of motion diversity, while curated motion-capture and in-house motion data provide higher-fidelity supervision and deployment-oriented coverage. This data regime enables HoloMotion-1 to move beyond conventional MoCap-only training and exposes the policy to substantially broader behaviors, capture conditions, and motion styles. Learning from such heterogeneous data introduces new challenges, including reconstruction noise, source-domain mismatch, uneven motion quality, and the need for temporal modeling under large behavioral variation. To address these challenges, HoloMotion-1 integrates large-capacity temporal modeling, a sparsely activated Mixture-of-Experts Transformer with KV-cache inference for real-time control, and a sequence-level training strategy that improves learning efficiency on extended motion sequences. Extensive experiments on multiple unseen motion benchmarks show that HoloMotion-1 generalizes robustly across diverse motion types and capture conditions, significantly improves tracking accuracy over prior methods, and transfers directly to a real humanoid robot without task-specific fine-tuning.

URL PDF HTML ☆

赞 0 踩 0

2605.15186 2026-05-20 cs.CV cs.AI

VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction

VGGT-Edit：基于残差场预测的前馈原生3D场景编辑

Kaixin Zhu, Yiwen Tang, Yifan Yang, Renrui Zhang, Bohan Zeng, Ziyu Guo, Ruichuan An, Zhou Liu, Qizhi Chen, Delin Qu, Jaehong Yoon, Wentao Zhang

AI总结本文提出VGGT-Edit，一种基于文本条件的前馈原生3D场景编辑框架，通过引入深度同步文本注入和残差变换头，实现高质量的3D场景编辑，同时构建DeltaScene数据集以提升编辑效果和推理速度。

详情

AI中文摘要

高质量的3D场景重建近年来已发展为通用的前馈架构，使单次正向传递即可生成复杂的环境。然而，尽管这些模型在静态场景感知方面表现强劲，但它们在响应动态人类指令方面仍然有限，限制了其在交互应用中的使用。现有的编辑方法通常依赖于2D提升策略，即单独编辑每个视图，然后将其提升回3D空间。这种间接流程往往导致模糊的纹理和不一致的几何结构，因为2D编辑器缺乏保持跨视角结构的空间意识。为了解决这些限制，我们提出了VGGT-Edit，一种用于文本条件的前馈框架，用于原生3D场景编辑。VGGT-Edit引入了深度同步的文本注入，以对齐语义指导与骨干网络的空间姿态，确保稳定的指令接地。此语义信号随后由残差变换头处理，直接预测3D几何位移以变形场景，同时保持背景稳定性。为了确保高保真结果，我们通过多术语目标函数监督该框架，强制几何准确性和跨视图一致性。我们还构建了DeltaScene数据集，一个通过自动化流程生成的大规模数据集，通过3D一致过滤确保地面真实质量。实验表明，VGGT-Edit在2D提升基线中表现显著更好，生成更清晰的物体细节，更强的多视图一致性以及接近即时的推理速度。项目页面是https://chriszkxxx.github.io/VGGT-Edit/.

英文摘要

High-quality 3D scene reconstruction has recently advanced toward generalizable feed-forward architectures, enabling the generation of complex environments in a single forward pass. However, despite their strong performance in static scene perception, these models remain limited in responding to dynamic human instructions, which restricts their use in interactive applications. Existing editing methods typically rely on a 2D-lifting strategy, where individual views are edited independently and then lifted back into 3D space. This indirect pipeline often leads to blurry textures and inconsistent geometry, as 2D editors lack the spatial awareness required to preserve structure across viewpoints. To address these limitations, we propose VGGT-Edit, a feed-forward framework for text-conditioned native 3D scene editing. VGGT-Edit introduces depth-synchronized text injection to align semantic guidance with the backbone's spatial poses, ensuring stable instruction grounding. This semantic signal is then processed by a residual transformation head, which directly predicts 3D geometric displacements to deform the scene while preserving background stability. To ensure high-fidelity results, we supervise the framework with a multi-term objective function that enforces geometric accuracy and cross-view consistency. We also construct the DeltaScene Dataset, a large-scale dataset generated through an automated pipeline with 3D agreement filtering to ensure ground-truth quality. Experiments show that VGGT-Edit substantially outperforms 2D-lifting baselines, producing sharper object details, stronger multi-view consistency, and near-instant inference speed. The project page is https://chriszkxxx.github.io/VGGT-Edit/.

URL PDF HTML ☆

赞 0 踩 0

2605.15113 2026-05-20 cs.LG

Learning from Language Feedback via Variational Policy Distillation

通过变分策略蒸馏学习语言反馈

Yang Li, Erik Nijkamp, Semih Yavuz, Shafiq Joty

AI总结本文提出变分策略蒸馏（VPD），通过将学习语言反馈形式化为变分期望最大化问题，解决传统自蒸馏方法中教师策略能力停滞的问题，从而在科学推理和代码生成任务中优于标准RLVR和现有自蒸馏基线。

详情

AI中文摘要

可验证奖励的强化学习（RLVR）面临稀疏结果信号的问题，导致复杂推理任务的探索瓶颈。最近的在线自蒸馏方法尝试通过利用语言反馈生成密集的token级监督来解决这一问题。然而，这些方法依赖于固定且被动的教师来解读反馈。随着学生策略的改进，教师的零样本评估能力趋于停滞，最终阻碍进一步学习。为克服这一问题，我们提出变分策略蒸馏（VPD），一个将学习语言反馈形式化为变分期望最大化（EM）问题的框架。VPD共同进化两种策略：在E步中，教师通过自适应信任区域更新在轨迹结果上主动优化，将文本反馈转化为动态改进的目标token分布。在M步中，学生内部化其在自身在线滚动中所获得的密集分布指导。通过持续提升教师从文本批评中提取可操作信号的能力，VPD克服了传统自蒸馏方法的局限性。在多样化的诊断反馈源上评估，VPD在科学推理和代码生成任务中持续优于标准RLVR和现有自蒸馏基线。最后，通过在刚性数学推理和冷启动场景中压力测试我们的框架，我们揭示了反馈驱动自蒸馏与纯环境驱动RL之间的基本界限。

英文摘要

Reinforcement learning from verifiable rewards (RLVR) suffers from sparse outcome signals, creating severe exploration bottlenecks on complex reasoning tasks. Recent on-policy self-distillation methods attempt to address this by utilizing language feedback to generate dense, token-level supervision. However, these approaches rely on a fixed, passive teacher to interpret the feedback. As the student policy improves, the teacher's zero-shot assessment capabilities plateau, ultimately halting further learning. To overcome this, we propose Variational Policy Distillation (VPD), a framework that formalizes learning from language feedback as a Variational Expectation-Maximization (EM) problem. VPD co-evolves both policies: in the E-step, the teacher is actively refined on trajectory outcomes via an adaptive trust-region update, translating textual feedback into a dynamically improved target token distribution. In the M-step, the student internalizes this dense distributional guidance on its own on-policy rollouts. By continuously improving the teacher's ability to extract actionable signals from textual critique, VPD overcomes the limitations of passive distillation. Evaluated across diverse sources of diagnostic feedback on scientific reasoning and code generation tasks, VPD consistently outperforms both standard RLVR and existing self-distillation baselines. Finally, by stress-testing our framework on rigid mathematical reasoning and cold-start regimes, we illuminate the fundamental bounds of feedback-driven self-distillation compared to pure environment-driven RL.

URL PDF HTML ☆

赞 0 踩 0

2605.14678 2026-05-20 cs.AI

$π$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows

$π$-Bench：评估长周期工作流中主动型个人助理代理

Haoran Zhang, Luxin Xu, Zhilin Wang, Runquan Gui, Shunkai Zhang, Haodi Lei, Zihao He, Bingsu He, Chicheng Qin, Tong Zhu, Xiaoye Qu, Yang Yang, Yu Cheng, Yafu Li

AI总结本文提出$π$-Bench基准，用于评估个人助理代理在长周期工作流中的主动协助能力，通过100个多轮任务和5种特定领域用户角色，验证代理在未明确表达意图前识别和执行隐藏意图的能力，揭示主动协助的挑战及前期交互对后续任务的重要性。

Comments 44 pages

详情

AI中文摘要

随着个人助理代理（如OpenClaw）的兴起，大型语言模型在日常和工作场景中支持用户的能力日益凸显。在这些场景中，主动协助是一个核心挑战，因为用户往往开始时请求不明确，留下重要的需求、约束或偏好未被陈述。然而，现有基准很少评估代理是否能在用户明确表达之前识别并执行此类隐藏意图，尤其是在持续的多轮交互中，用户需求逐渐显现。为填补这一空白，我们引入$π$-Bench，一个包含100个多轮任务和5种特定领域用户角色的主动协助基准。通过整合隐藏用户意图、任务间依赖性和跨会话连续性，$π$-Bench评估代理在延长交互中预见和解决用户需求的能力，共同衡量长周期轨迹中的主动性和任务完成度。实验表明（1）主动协助仍然具有挑战性，（2）任务完成与主动性存在明显区别，（3）前期交互对后续任务中主动意图解析的价值。

英文摘要

The rise of personal assistant agents, e.g., OpenClaw, highlights the growing potential of large language models to support users across everyday life and work. A core challenge in these settings is proactive assistance, since users often begin with underspecified requests and leave important needs, constraints, or preferences unstated. However, existing benchmarks rarely evaluate whether agents can identify and act on such hidden intents before they are explicitly stated, especially in sustained multi-turn interactions where user needs emerge gradually. To address this gap, we introduce $π$-Bench, a benchmark for proactive assistance comprising 100 multi-turn tasks across 5 domain-specific user personas. By incorporating hidden user intents, inter-task dependencies, and cross-session continuity, $π$-Bench evaluates agents' ability to anticipate and address user needs over extended interactions, jointly measuring proactivity and task completion in long-horizon trajectories that better reflect real-world use. Experiments show (1) proactive assistance remains challenging, (2) a clear distinction between task completion and proactivity, and (3) the value of prior interaction for proactive intent resolution in later tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.14530 2026-05-20 cs.CV

Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models

缓解大扩散视觉-语言模型中的遮蔽先验漂移和位置注意力崩溃

Sujung Hong, Chanyong Yoon, Seong Jae Hwang

AI总结本文研究了大扩散视觉-语言模型在长形式生成中的重复生成和视觉 grounding 退化问题，提出了一种无需训练的解决方案来缓解遮蔽先验漂移和位置注意力崩溃。

详情

AI中文摘要

大扩散视觉-语言模型（LDVLMs）最近作为一种有前途的替代自回归模型出现，能够实现高效的并行解码，并利用双向注意力获取全局上下文。尽管有这些进展，其在长形式生成中的行为仍然缺乏深入研究。在本文中，我们发现现有的LDVLMs存在重复生成和退化的视觉 grounding，并识别出两个根本原因。首先，重复生成源于遮蔽标记先验：由于生成标记被初始化为遮蔽标记，其隐藏表示在生成步骤中逐渐漂向共享的先验方向。其次，位置注意力偏置与迭代解屏蔽过程之间的基本不匹配会抑制对信息性视觉标记的注意力，从而降低视觉 grounding。基于这些见解，我们提出了一种无需训练的方法，引入遮蔽先验抑制和单调RoPE缩放来缓解解码过程中的遮蔽先验漂移和位置注意力崩溃。在通用多模态基准和视觉 grounding 任务上的实验表明，与基线LDVLMs相比有所改进，特别是在长形式描述基准上表现稳健。我们的结果表明，这些失败可以通过一种轻量级、即插即用的策略有效解决，该策略不需要额外训练，并且在多种LDVLM架构上具有泛化能力。

英文摘要

Large diffusion vision-language models (LDVLMs) have recently emerged as a promising alternative to autoregressive models, enabling parallel decoding for efficient inference and leveraging bidirectional attention for global context. Despite these advances, their behavior under long-form generation remains underexplored. In this work, we show that existing LDVLMs suffer from repetitive generation and degraded visual grounding, and identify two underlying causes. First, repetitive generation originates from a mask token prior: since generation tokens are initialized as mask tokens, their hidden representations progressively drift toward a shared prior direction over generation steps. Second, a fundamental misalignment between the positional attention bias and the iterative unmasking process suppresses attention toward informative visual tokens, degrading visual grounding. Based on these insights, we propose a training-free approach, introducing Mask Prior Suppression and Monotonic RoPE Scaling to mitigate mask prior drift and positional attention collapse during decoding. Experiments on general multimodal benchmarks and visual grounding tasks demonstrate improvements over baseline LDVLMs, with robust gains on long-form description benchmarks. Our results show that these failures can be effectively addressed with a lightweight, plug-and-play strategy that requires no additional training and generalizes across diverse LDVLM architectures.

URL PDF HTML ☆

赞 0 踩 0

2605.14102 2026-05-20 cs.AI

ChromaFlow: A Negative Ablation Study of Orchestration Overhead in Tool-Augmented Agent Evaluation

ChromaFlow: 一种关于在工具增强代理评估中编排开销的负消融研究

Tarun Mittal

AI总结该研究通过ChromaFlow框架分析了在工具增强自主推理中编排开销的影响，发现更激进的编排并未提升整体性能，反而增加了操作噪声，并强调了编排升级、确定性提取、证据协调、提供者健康门控和显式运行门控等作为可靠自主代理评估的第一要求。

Comments 12 pages, 6 tables, 1 figure. Updated with follow-up strict-provider full-Level-1 diagnostic

详情

AI中文摘要

自主语言模型代理越来越多地结合规划、工具使用、文档处理、浏览、代码执行和验证循环。这些能力使代理系统更加有用，但同时也引入了无法仅通过最终准确性来观察的操作失败模式。本报告介绍了ChromaFlow，一种围绕规划引导执行、专门化工具使用和 telemetry 驱动评估构建的工具增强自主推理框架。我们分析了ChromaFlow在GAIA 2023 Level-1验证任务下的清洁评估约束。一个冻结的完整Level-1基线实现了29/53正确的答案，或54.72%。后来的恢复配置通过扩展编排实现了27/53正确的答案，或50.94%，同时增加了回溯、超时事件、工具失败提及、令牌日志调用和战役日志成本估计。两个随机化的20任务烟雾评估产生了12/20和11/20正确的答案，表明小规模诊断增益在样本间不稳定。因此，中心结果是负消融：更激进的编排并未提高整体性能，反而增加了操作噪声。后来的严格提供者全Level-1诊断在显式完整性控制下达到了30/53，或56.60%，但显着提高了令牌日志成本。报告认为，受控编排升级、确定性提取、证据协调、提供者健康门控和显式运行门控应被视为可靠自主代理评估的第一要求。

英文摘要

Autonomous language-model agents increasingly combine planning, tool use, document processing, browsing, code execution, and verification loops. These capabilities make agent systems more useful, but they also introduce operational failure modes that are not visible from final accuracy alone. This report presents ChromaFlow, a tool-augmented autonomous reasoning framework built around planner-directed execution, specialized tool use, and telemetry-driven evaluation. We analyze ChromaFlow on GAIA 2023 Level-1 validation tasks under clean evaluation constraints. A frozen full Level-1 baseline achieved 29/53 correct answers, or 54.72%. A later recovery configuration with expanded orchestration achieved 27/53 correct answers, or 50.94%, while increasing tracebacks, timeout events, tool-failure mentions, token-log calls, and campaign-log cost estimates. Two randomized 20-task smoke evaluations produced 12/20 and 11/20 correct answers, showing that small diagnostic gains can be unstable across samples. The central result is therefore a negative ablation: more aggressive orchestration did not improve full-set performance and increased operational noise. A later strict-provider full-Level-1 diagnostic reached 30/53, or 56.60%, under explicit integrity controls, but at substantially higher token-log cost. The report argues that bounded planner escalation, deterministic extraction, evidence reconciliation, provider-health gates, and explicit run gates should be treated as first-order requirements for reliable autonomous agent evaluation.

URL PDF HTML ☆

赞 0 踩 0

2605.14063 2026-05-20 cs.LG

Reliability-Gated Source Anchoring for Continual Test-Time Adaptation

可靠性门控的源锚定用于持续测试时间适应

Vikash Singh, Debargha Ganguly, Weicong Chen, Sabyasachi Sahoo, Sreehari Sankar, Biyao Zhang, Mohsen Hariri, Shouren Wang, Osama Zafar, Christian Gagné, Vipin Chaudhary

AI总结该研究提出了一种可靠性门控的源锚定方法（RMemSafe），用于持续测试时间适应（CTTA），通过利用冻结源的归一化预测熵来抑制所有显式源耦合使用，从而在源可靠性下降时自动关闭源锚定和一致性过滤器，提升模型在持续腐蚀任务中的性能。

详情

AI中文摘要

持续测试时间适应（CTTA）在在线更新预训练模型时，将模型锚定到一个冻结的源检查点上。然而，当源可靠性下降时，这种锚定方式会失效。在CCCHard数据集上，ResNet-50源的top-1准确率下降至约1.3%，而现有源锚定CTTA方法仍然使用相同的锚定强度。本文提出RMemSafe，一种基于ROID的可靠性门控扩展方法，利用冻结源的归一化预测熵来衰减目标函数中的所有显式源耦合使用。当源后验接近均匀分布时，门控关闭：源锚定和一致性过滤器消失，目标函数减少为源无关的回退，包含ROID的基本损失加上边际校准。结合ASR，RMemSafe在8个匹配分割的持续腐蚀单元中实现了最低的错误率，并在所有9个单元中是最佳的重置方法，比ROID+ASR在ResNet-50上提升1.05个百分点，在ViT-B/16上提升0.48个百分点。受控的源退化扫描显示，其危害斜率比ROID+ASR浅1.13倍，与渐进衰减预测一致。熵门控检测到高熵源崩溃，而非自信错误的低熵源；该范围被明确评估和讨论。

英文摘要

Continual test-time adaptation (CTTA) updates a pretrained model online on an unlabeled, non-stationary stream while anchoring it to a frozen source checkpoint. This anchor is useful only when the source remains reliable. On CCC-Hard, however, a ResNet-50 source falls to approximately $1.3\%$ top-$1$ accuracy, while existing source-anchored CTTA methods continue applying the same anchor strength. We call this failure mode blind anchoring and propose RMemSafe, a reliability-gated extension of ROID that uses the frozen source's normalized predictive entropy to attenuate all explicit source-coupled uses in the objective. When the source posterior approaches uniformity, the gate closes: the source anchor and agreement filter vanish, and the objective reduces to a source-agnostic fallback comprising ROID's base losses plus marginal calibration. Combined with ASR, RMemSafe achieves the lowest error on $8$ of $9$ matched-split continual-corruption cells and is the best reset-based method on all $9$, improving ROID+ASR by $1.05$~pp on ResNet-50 and $0.48$~pp on ViT-B/16. A controlled source-degradation sweep shows a $1.13{\times}$ shallower harm slope than ROID+ASR, consistent with the graceful-decay prediction. The entropy gate detects high-entropy source collapse, not confidently wrong low-entropy sources; this scope is explicitly evaluated and discussed.

URL PDF HTML ☆

赞 0 踩 0

2605.13652 2026-05-20 cs.LG cs.AI cs.CL

Beyond Perplexity: A Geometric and Spectral Study of Low-Rank Pre-Training

超越困惑度：低秩预训练的几何与谱研究

Namrata Shivagunde, Vijeta Deshpande, Sherin Muckatira, Anna Rumshisky

AI总结本文通过几何和谱分析研究低秩预训练方法，揭示其与全秩训练在模型性能和解空间上的差异，发现低秩方法在不同模型规模下表现各异，且困惑度不能完全反映下游任务性能。

Comments 9 pages, 5 figures, 2 tables

详情

AI中文摘要

大规模语言模型的预训练主要受限于存储全秩权重、梯度和优化器状态的内存成本。低秩预训练出现以解决这一问题，相关方法空间迅速扩展。一个核心问题仍未解决：低秩方法是否能产生与全秩训练具有同等泛化能力的模型，或者秩约束是否根本性地改变了所达到的解？现有比较几乎完全依赖于单种子运行的验证困惑度，通常继承自先前文献。然而，困惑度是解质量的差代理；两种方法可以在困惑度上匹配，却收敛到不同的损失景观区域和内部表示。我们通过表征五种低秩预训练方法（GaLore和Fira（内存高效优化器）、CoLA和SLTrain（架构再参数化）、ReLoRA（适配器式更新带周期性重置））在三个模型规模（60M、130M、350M）下与全秩训练的解，关闭这一差距。我们评估每种方法在四个维度上的16个指标：1D损失景观沿随机/Top-K PCA方向、1D检查点之间插值、权重和学习更新的谱结构，以及激活相似性与全秩训练。我们显示低秩方法不等同于全秩训练，也不等同于彼此，即使验证困惑度接近。全秩训练在随机方向上达到更尖锐的盆地，而反方向则适用于top-1 PCA方向。每种方法收敛到几何上不同的盆地。低秩激活在训练过程中随着层数增加而偏离全秩激活，GaLore最接近全秩激活。进一步，验证困惑度在每个规模下并不转化为下游性能。添加几何和谱度量提高了预测。

英文摘要

Pre-training large language models is dominated by the memory cost of storing full-rank weights, gradients, and optimizer states. Low-rank pre-training has emerged to address this, and the space of methods has grown rapidly. A central question remains open: do low-rank methods produce models that generalize comparably to full-rank training, or does the rank constraint fundamentally alter the solutions reached? Existing comparisons rely almost entirely on validation perplexity from single-seed runs, often carried forward from prior literature. Yet perplexity is a poor proxy for solution quality; two methods can match on perplexity while converging to different loss landscape regions and internal representations. We close this gap by characterizing the solutions found by five low-rank pre-training methods, GaLore and Fira (memory-efficient optimizers), CoLA and SLTrain (architecture reparameterizations), and ReLoRA (adapter-style updates with periodic resets), against full-rank training at three model scales (60M, 130M, 350M). We evaluate each along 16 metrics across four dimensions: 1-D loss landscape along random/top-K PCA directions, 1-D interpolation between checkpoints, spectral structure of the weights and learned updates, and activation similarity to full-rank training. We show that low-rank methods are not equivalent to full-rank training, nor to one another, even when validation perplexity is close. Full-rank training settles into a sharper basin than low-rank methods along random directions, while the reverse holds for the top-1 PCA direction. Each method converges to a geometrically distinct basin. Low-rank activations diverge from full-rank in later layers as training progresses, with GaLore tracking full-rank most closely. Further, validation perplexity does not translate to downstream performance at every scale. Adding geometric and spectral metrics improves the prediction.

URL PDF HTML ☆

赞 0 踩 0

2605.13646 2026-05-20 cs.RO cs.AI

Causality-Aware End-to-End Autonomous Driving via Ego-Centric Joint Scene Modeling

基于因果性的端到端自动驾驶：通过以自身为中心的联合场景建模

Seokha Moon, Minseung Lee, Joon Seo, Jinkyu Kim, Jungbeom Lee

AI总结本文提出CaAD框架，通过共享潜在场景表示捕捉车辆与周围代理之间的因果依赖关系，以提高端到端自动驾驶的闭环规划性能。

详情

AI中文摘要

端到端自动驾驶通过直接从传感器输入预测未来轨迹，跳过了传统模块化流水线，近年来取得了显著进展。然而，现有方法往往忽视了车辆规划中的因果依赖关系，忽略了车辆与周围代理之间的相互关系。这种因果忽视导致轨迹预测不一致且不可靠，特别是在需要交互的关键场景中，车辆决策和邻近代理行为必须联合推理。为了解决这一限制，我们提出了CaAD，一个基于因果的端到端自动驾驶框架，该框架在共享的潜在场景表示中捕捉这些依赖关系。首先，我们提出一个以自身为中心的联合因果建模模块，基于边缘预测分支，并学习车辆与相关交互代理之间的因果依赖关系。其次，我们采用因果意识的策略对齐阶段，通过联合模式嵌入来对齐随机的车辆策略与从周围交通和地图上下文中计算出的规划导向闭环反馈。在Bench2Drive和NAVSIM基准上，CaAD展示了强大的闭环规划性能，分别在Bench2Drive上实现了87.53的驾驶分数和71.81的成功率，在NAVSIM上实现了91.1的PDMS。项目页面可在https://moonseokha.github.io/CaAD/上获取。

英文摘要

End-to-end autonomous driving, which bypasses traditional modular pipelines by directly predicting future trajectories from sensor inputs, has recently achieved substantial progress. However, existing methods often overlook the causal inter-dependencies in ego-vehicle planning, ignoring the reciprocal relations between the ego vehicle and surrounding agents. This causal oversight leads to inconsistent and unreliable trajectory predictions, especially in interaction-critical scenarios where ego decisions and neighboring agent behaviors must be reasoned about jointly. To address this limitation, we propose CaAD, a Causality-aware end-to-end Autonomous Driving framework that captures these dependencies within a shared latent scene representation. First, we propose an ego-centric joint-causal modeling module that builds on the marginal prediction branch, and learns causal dependencies between the ego vehicle and interaction-relevant agents. Second, we employ a causality-aware policy alignment stage implemented with joint-mode embeddings to align the stochastic ego policy with planning-oriented closed-loop feedback computed from surrounding traffic and map context. On the Bench2Drive and NAVSIM benchmarks, CaAD demonstrates strong closed-loop planning performance, achieving a Driving Score of 87.53 and Success Rate of 71.81 on Bench2Drive, and a PDMS of 91.1 on NAVSIM. The project page is available at https://moonseokha.github.io/CaAD/.

URL PDF HTML ☆

赞 0 踩 0

2605.12974 2026-05-20 cs.RO cs.SY eess.SY

Distributionally Robust Safety Under Arbitrary Uncertainties: A Safety Filtering Approach

在任意不确定性下的分布鲁棒安全：一种安全过滤方法

Daniel M. Cherenson, Haejoon Lee, Taekyung Kim, Dimitra Panagou

AI总结本文研究如何在分布模糊性下确保非线性系统的概率安全，提出了一种基于备份的安全过滤框架，通过在高性能名义策略和认证备份策略之间切换来保证安全，并采用分布鲁棒方法处理任意不确定性，通过采样方法验证了方法的有效性。

Comments 10 pages, 4 figures, submitted to IEEE Robotics and Automation Letters (RA-L); Project Page: https://dcherenson.github.io/drs-gk

详情

AI中文摘要

在本文中，我们研究如何在分布模糊性下确保非线性系统的概率安全。我们的方法基于一种备份-based的安全过滤框架，该框架在高性能的名义策略和认证备份策略之间切换以确保安全。为了处理任意不确定性，即分布不具有特定结构且真实分布未知的情况，我们采用分布鲁棒（DR）方法，使用Wasserstein不确定性集。而不是在线解决高维的DR轨迹优化问题，我们利用备份-based安全过滤的结构，将安全认证减少为在名义策略和备份策略之间切换的时间的一维搜索。然后，我们开发了一种基于采样的认证程序，具有有限样本保证，其中经验失败概率被与Wasserstein膨胀阈值进行比较。我们通过模拟三个系统验证了我们的方法，从Dubins车辆到高速赛车和战斗机，展示了方法的广泛应用性和计算效率。

英文摘要

In this work, we study how to ensure probabilistic safety for nonlinear systems under distributional ambiguity. Our approach builds on a backup-based safety filtering framework that switches between a high-performance nominal policy and a certified backup policy to ensure safety. To handle arbitrary uncertainties from ambiguous distributions, i.e., where the distribution is not of specific structure and the true distribution is unknown, we adopt a distributionally robust (DR) formulation using Wasserstein ambiguity sets. Rather than solving a high-dimensional DR trajectory optimization problem online, we exploit the structure of backup-based safety filtering to reduce safety certification to a one-dimensional search over the switching time between nominal and backup policies. We then develop a sampling-based certification procedure with finite-sample guarantees, where empirical failure probabilities are compared against a Wasserstein-inflated threshold. We validate our method through simulations across three systems, from a Dubins vehicle to a high-speed racing car and a fighter jet, demonstrating the broad applicability and computational efficiency.

URL PDF HTML ☆

赞 0 踩 0

2605.11021 2026-05-20 cs.LG

A Switching System Theory of Q-Learning with Linear Function Approximation

基于联合谱半径的Q学习线性函数逼近切换系统理论

Donghwan Lee, Han-Dong Lim

AI总结本文基于联合谱半径理论，提出了一种Q学习线性函数逼近的切换系统解释，推导了精确的线性切换模型，并将收敛性与相应切换系统的稳定性联系起来，同时扩展到具有独立同分布观测和马尔可夫观测的随机线性Q学习，提供了基于JSR的正则化Q学习视角。

2605.10525 2026-05-20 cs.CV

GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth

GemDepth：用于3D一致视频深度的几何嵌入特征

Yuecheng Liu, Junda Cheng, Longliang Liu, Wenjing Liao, Hanrui Cheng, Yuzhou Wang, Xin Yang

AI总结本文提出GemDepth框架，通过引入几何嵌入模块和交替时空变换器，解决视频深度估计中空间模糊和时间不一致的问题，实现高精度和鲁棒的3D一致性。

详情

AI中文摘要

视频深度估计将单目预测扩展到时间域以确保一致性。然而，现有方法在细节区域常出现空间模糊和时间不一致的问题。我们提出GemDepth框架，其核心思想是显式了解相机运动和全局3D结构是保持3D一致性必要的前提。GemDepth引入了一个几何嵌入模块（GEM），通过预测帧间相机姿态生成隐式几何嵌入。这种运动先验的注入使网络具备内在的3D感知和对齐能力。在这些几何提示的引导下，我们的交替时空变换器（ASTT）捕获潜在点级对应关系，同时提高空间精度以增强细节清晰度，并强制严格的时间一致性。此外，GemDepth采用数据高效训练策略，有效弥合了高效率和鲁棒几何一致性之间的差距。如图2所示，全面评估表明GemDepth在多个数据集上均取得最佳性能，特别是在复杂动态场景中。代码已公开在：https://github.com/Yuecheng919/GemDepth。

英文摘要

Video depth estimation extends monocular prediction into the temporal domain to ensure coherence. However, existing methods often suffer from spatial blurring in fine-detail regions and temporal inconsistencies. We argue that current approaches, which primarily rely on temporal smoothing via Transformers, struggle to maintain strict 3D geometric consistency-particularly under rotations or drastic view changes. To address this, we propose GemDepth, a framework built on the insight that an explicit awareness of camera motion and global 3D structure is a prerequisite for 3D consistency. Distinctively, GemDepth introduces a Geometry-Embedding Module (GEM) that predicts inter-frame camera poses to generate implicit geometric embeddings. This injection of motion priors equips the network with intrinsic 3D perception and alignment capabilities. Guided by these geometric cues, our Alternating Spatio-Temporal Transformer (ASTT) captures latent point-level correspondences to simultaneously enhance spatial precision for sharp details and enforce rigorous temporal consistency. Furthermore, GemDepth employs a data-efficient training strategy, effectively bridging the gap between high efficiency and robust geometric consistency. As shown in Fig.2, comprehensive evaluations demonstrate that GemDepth achieves state-of-the-art performance across multiple datasets, particularly in complex dynamic scenarios. The code is publicly available at: https://github.com/Yuecheng919/GemDepth.

URL PDF HTML ☆

赞 0 踩 0

2605.10344 2026-05-20 cs.AI

TMAS: Scaling Test-Time Compute via Multi-Agent Synergy

TMAS: 通过多智能体协同实现测试时间计算的扩展

George Wu, Nan Jing, Qing Yi, Chuan Hao, Ming Yang, Feng Chang, Yuan Wei, Jian Yang, Ran Tao, Bryan Dai

AI总结本文提出TMAS框架，通过多智能体协同实现测试时间计算的扩展，利用层次化记忆和混合奖励强化学习提升推理能力和探索效率。

详情

AI中文摘要

测试时间扩展已成为通过在推理过程中分配额外计算来提高大型语言模型推理能力的有效范式。最近的结构化方法通过在多个轨迹、细化轮次和基于验证的反馈之间组织推理进一步推进了这一范式。然而，现有结构化测试时间扩展方法要么弱化并行推理轨迹的协调，要么依赖于噪声历史信息而没有明确决定应保留和重用什么，限制了它们在探索和利用之间的平衡能力。在本文中，我们提出TMAS，一个通过多智能体协同扩展测试时间计算的框架。TMAS将推理组织为专门智能体之间的协作过程，从而在智能体、轨迹和细化迭代之间实现结构化信息流。为了支持有效的跨轨迹协作，TMAS引入了层次化记忆：经验银行重用低层次可靠中间结论和局部反馈，而指南银行记录之前探索的高层次策略，以引导后续展开远离冗余推理模式。此外，我们设计了一种针对TMAS定制的混合奖励强化学习方案，该方案联合保留基本推理能力、增强经验利用，并鼓励探索超出先前尝试的解决方案策略。在具有挑战性的推理基准上的广泛实验表明，TMAS在迭代扩展方面优于现有测试时间扩展基线，混合奖励训练进一步提高了跨迭代的扩展效果和稳定性。代码和数据可在https://github.com/IQuestLab/tmas获取。

英文摘要

Test-time scaling has become an effective paradigm for improving the reasoning ability of large language models by allocating additional computation during inference. Recent structured approaches have further advanced this paradigm by organizing inference across multiple trajectories, refinement rounds, and verification-based feedback. However, existing structured test-time scaling methods either weakly coordinate parallel reasoning trajectories or rely on noisy historical information without explicitly deciding what should be retained and reused, limiting their ability to balance exploration and exploitation. In this work, we propose TMAS, a framework for scaling test-time compute via multi-agent synergy. TMAS organizes inference as a collaborative process among specialized agents, enabling structured information flow across agents, trajectories, and refinement iterations. To support effective cross-trajectory collaboration, TMAS introduces hierarchical memories: the experience bank reuses low-level reliable intermediate conclusions and local feedback, while the guideline bank records previously explored high-level strategies to steer subsequent rollouts away from redundant reasoning patterns. Furthermore, we design a hybrid reward reinforcement learning scheme tailored to TMAS, which jointly preserves basic reasoning capability, enhances experience utilization, and encourages exploration beyond previously attempted solution strategies. Extensive experiments on challenging reasoning benchmarks show that TMAS achieves stronger iterative scaling than existing test-time scaling baselines, with hybrid reward training further improving scaling effectiveness and stability across iterations. Code and data are available at https://github.com/IQuestLab/tmas.

URL PDF HTML ☆

赞 0 踩 0

2605.08879 2026-05-20 cs.RO

Preserving Foundational Capabilities in Flow-Matching VLAs through Conservative SFT

通过保守监督微调在流匹配视觉-语言-动作中保持基础能力

Tianyi Zhang, Shaopeng Zhai, Haoran Zhang, Fuxian Huang, Qi Zhang

AI总结本文提出保守监督微调（ConSFT）方法，旨在通过动态调整学习信号来减少流匹配视觉-语言-动作模型在微调过程中对预训练能力的损害，从而在不依赖先验数据或架构开销的情况下提升模型在目标分布上的适应性和能力保留。

Comments 20 pages, 9 figures

详情

AI中文摘要

无约束的流匹配视觉-语言-动作（VLA）模型微调会导致参数过度覆盖，从而降低预训练能力。我们提出了保守监督微调（ConSFT），一种优化目标，能够适应目标分布同时减轻灾难性遗忘，无需先验数据或架构开销。通过根据模型置信度动态调整学习信号，ConSFT抑制来自低置信度样本的过度梯度，从而防止不成比例的参数更新，从而限制内在参数扰动风险。受强化学习信任区域裁剪的启发，这种形式建立了一个渐进学习动态，以确保目标收敛和先前能力保留，实现稀疏参数更新，而无需依赖显式正则化所需的并行参考网络。我们在LIBERO和RoboTwin基准上评估了ConSFT，针对最先进的流匹配VLA（π₀，π₀.₅和GR00T-N1.6-3B）。该方法在能力保留方面优于常规SFT，平均绝对优势超过20%，在无先验数据的环境中与数据密集型经验回放的效能相当。现实世界的机器人部署证实，ConSFT在下游适应过程中防止了空间过拟合，保留了预训练的物理技能，同时获取了序列目标任务。

英文摘要

Unconstrained fine-tuning of flow-matching Vision-Language-Action (VLA) models drives dense parameter overwrites, degrading pre-trained capabilities. We present Conservative Supervised Fine-Tuning (ConSFT), an optimization objective that adapts to target distributions while mitigating catastrophic forgetting, requiring zero prior data or architectural overhead. By dynamically scaling learning signals based on model confidence, ConSFT suppresses excessive gradients from low-confidence samples to prevent disproportionate parameter updates, thereby bounding the intrinsic parameter disruption risk. Inspired by reinforcement learning's trust-region clipping, this formulation establishes a progressive learning dynamic to secure target convergence and prior capability retention, maintaining sparse parameter updates without relying on the parallel reference networks required by explicit regularization. We evaluate ConSFT on the LIBERO and RoboTwin benchmarks across state-of-the-art flow-matching VLAs ($π_0$, $π_{0.5}$, and GR00T-N1.6-3B). The method outperforms vanilla SFT in capability retention by an average absolute margin of over 20\%, matching the efficacy of data-heavy Experience Replay in a prior-data-free regime. Real-world robotic deployments confirm that ConSFT precludes spatial overfitting during downstream adaptation, preserving pre-trained physical skills while acquiring sequential target tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.08830 2026-05-20 cs.CV cs.AI cs.RO

VECTOR-Drive: Tightly Coupled Vision-Language and Trajectory Expert Routing for End-to-End Autonomous Driving

VECTOR-Drive: 紧密耦合的视觉-语言与轨迹专家路由用于端到端自动驾驶

Rui Zhao, Jianlin Yu, Zhenhai Gao, Jiaqiao Liu, Fei Gao

AI总结本文提出VECTOR-DRIVE框架，通过紧密耦合的视觉-语言与轨迹专家路由，解决端到端自动驾驶中视觉语言理解和轨迹预测之间的耦合问题，实现更高的任务性能。

详情

AI中文摘要

端到端自动驾驶需要模型理解交通场景、推断驾驶意图并生成可执行的运动计划。最近的视觉-语言-动作（VLA）模型继承了大规模视觉-语言预训练的语义先验，但仍然面临耦合权衡：完全共享的骨干网络保留了多模态交互，但可能导致语言推理和轨迹预测的耦合问题；而解耦的推理-动作管道减少了任务冲突，但削弱了语义-运动耦合。我们提出VECTOR-DRIVE，一个基于Qwen2.5-VL-3B的紧密耦合VLA框架。VECTOR-DRIVE通过共享自注意力保持所有token的耦合，并根据token语义路由前馈计算。视觉和语言token由视觉-语言专家处理以保留语义先验，而目标点、主体状态和噪声动作token则路由到轨迹专家进行运动特定计算。在动作token路径上，一个流匹配规划器将噪声动作token细化为未来路径点和速度配置文件。这种设计在单一多模态Transformer中耦合了语义推理和运动规划，同时分离了任务特定的FFN计算。在Bench2Drive上，VECTOR-DRIVE实现了88.91的驾驶得分，并优于代表性的端到端和VLA基线。定性结果和消融进一步验证了共享注意力、语义感知专家路由、渐进式训练和基于流的动作解码的优势。

英文摘要

End-to-end autonomous driving requires models to understand traffic scenes, infer driving intent, and generate executable motion plans. Recent vision-language-action (VLA) models inherit semantic priors from large-scale vision-language pretraining, yet still face a coupling trade-off: fully shared backbones preserve multimodal interaction but may entangle language reasoning and trajectory prediction, whereas decou pled reasoning-action pipelines reduce task conflict but weaken semantic-motion coupling. We propose VECTOR-DRIVE, a tightly coupled VLA framework built on Qwen2.5-VL-3B. VECTOR-DRIVE keeps all tokens coupled through shared self attention and routes feed-forward computation according to token semantics. Vision and language tokens are processed by a Vision-Language Expert to preserve semantic priors, while target-point, ego-state, and noisy action tokens are routed to a Trajectory Expert for motion-specific computation. On the action-token pathway, a flow-matching planner refines noisy action tokens into future waypoints and speed profiles. This design couples semantic reasoning and motion planning within a single multimodal Transformer while separating task-specific FFN computation. On Bench2Drive, VECTOR-DRIVE achieves 88.91 Driving Score and outperforms representative end-to end and VLA-based baselines. Qualitative results and ablations further validate the benefits of shared attention, semantic-aware expert routing, progressive training, and flow-based action de coding.

URL PDF HTML ☆

赞 0 踩 0

2605.08696 2026-05-20 cs.CL cs.LG

Structured Recurrent Mixers for Massively Parallelized Sequence Generation

结构化递归混合器用于大规模并行序列生成

Benjamin L. Badger

AI总结本文提出了一种结构化递归混合器架构，能够在训练时实现序列并行表示与推理时的递归表示之间的代数转换，从而在不依赖专用内核或设备特定内存管理的情况下提高训练效率、输入信息容量和推理吞吐量。

详情

AI中文摘要

在过去二十年中，语言建模经历了从主要使用递归架构（在训练和推理过程中按顺序处理标记）到非递归模型（在训练过程中并行处理序列元素）的转变，后者在训练效率和稳定性方面有所提升，但以较低的推理吞吐量为代价。本文介绍了一种结构化递归混合器（SRM）架构，该架构能够在训练时实现序列并行表示与推理时的递归表示之间的代数转换，尤其不需要专用内核或设备特定的内存管理。我们通过实验表明，这种双表示方法相比其他线性复杂度模型，在训练效率、输入信息容量和推理吞吐量及并发性方面具有优势。我们推测递归模型对于信息丰富的输入（如语言）在扩展序列长度方面并不理想，但因其每个样本的常数内存需求，适合在样本（批量）维度上扩展。我们提供了Mojo/MAX推理实现的SRM，其吞吐量和并发性分别比同样强大的Transformer在vLLM上的推理提高了12倍和170倍，这些增益特征与PyTorch实现导致的GSM8k Pass@k计算常数增加30%。最后，我们证明SRM是有效的强化学习训练候选。

英文摘要

Over the last two decades, language modeling has experienced a shift from the use of predominantly recurrent architectures that process tokens sequentially during training and inference to non-recurrent models that process sequence elements in parallel during training, which results in greater training efficiency and stability at the expense of lower inference throughput. Here we introduce the Structured Recurrent Mixer, an architecture that allows for algebraic conversion between a sequence parallel representation at train time and a recurrent representation at inference, notably without the need for specialized kernels or device-specific memory management. We show experimentally that this dual representation allows for greater training efficiency, higher input information capacity, and larger inference throughput and concurrency when compared to other linear complexity models. We postulate that recurrent models are poorly suited to extended sequence length scaling for information-rich inputs typical of language, but are well suited to scaling in the sample (batch) dimension due to their constant memory per sample. We provide Mojo/MAX inference implementations of SRMs exhibiting 12x the throughput and 170x the concurrency of similarly powerful Transformers inferenced on vLLM, increases characteristic of Pytorch implementations resulting in a 30\% increase in compute-constant GSM8k Pass@k. We conclude by demonstrating that SRMs are effective reinforcement learning training candidates.

URL PDF HTML ☆

赞 0 踩 0

2605.08391 2026-05-20 cs.LG

SACHI: Structured Agent Coordination via Holistic Information Integration in Multi-Agent Reinforcement Learning

SACHI：通过整体信息整合实现多智能体强化学习中的结构化智能体协调

Nikunj Gupta, James Zachary Hare, Jesse Milzman, Rajgopal Kannan, Viktor Prasanna

AI总结本文提出SACHI方法，通过整体信息整合实现多智能体强化学习中的结构化智能体协调，解决了智能体在部分局部观察下协调行动的信息瓶颈问题，通过图Transformer卷积在智能体协调图上增强每个智能体的表示，从而在多个任务中表现出色。

详情

AI中文摘要

在合作性多智能体强化学习中，智能体基于部分局部观察行动时面临一个根本性的信息瓶颈：选择联合最优动作所需的知识分散在整个团队中，但每个智能体必须在没有访问队友观察、意图或所选动作的情况下做出决策。现有方法要么忽略这个瓶颈，将其压缩成一个标量混合信号，或者通过学习的通信通道绕过它。将动作协调视为智能体之间的结构化信息整合问题，我们提出结构化智能体协调通过整体信息整合（SACHI），其中在动作选择之前，通过智能体协调图上的图Transformer卷积，使每个智能体的表示增强，从而接收器敏感、内容依赖的信号来自队友。我们在五个合作任务上评估SACHI，涵盖空间、沟通和对抗性协调挑战，与十二个基线进行比较。SACHI在每个任务中都与最佳基线持平或表现更好，严格的汇总统计分析，包括归一化指标和bootstrap置信区间、Friedman排名和性能分析，证实这种优势在统计上显著，稳健且不依赖于模型容量的增加。参数匹配的消融进一步追溯收益的来源到一个单一的架构属性：消息传递操作中的内容依赖程度。

英文摘要

Cooperative multi-agent reinforcement learning agents that act on partial local observations face a fundamental information bottleneck: the knowledge needed to select jointly optimal actions is scattered across the team, yet each agent must commit to a decision without access to its teammates' observations, intentions, or chosen actions. Existing methods either ignore this bottleneck, compress it into a scalar mixing signal, or route around it with learned communication channels. Framing action coordination as a problem of structured information integration among agents, we propose \textit{structured agent coordination via holistic information integration}, or SACHI, in which graph transformer convolutions over an inter-agent coordination graph enrich each agent's representation with receiver-sensitive, content-dependent signals from teammates prior to action selection. We evaluate SACHI across five cooperative tasks spanning spatial, communicative, and adversarial coordination challenges against twelve baselines. SACHI consistently matches or outperforms the best baseline on every task, and rigorous aggregate statistical analyses, including normalized metrics with bootstrap confidence intervals, Friedman ranking, and performance profiling, confirm that this advantage is statistically significant, robust across environments, and not attributable to increased model capacity. Parameter-matched ablations further trace the source of the gains to a single architectural property: the degree of content-dependence in the message-passing operator.

URL PDF HTML ☆

赞 0 踩 0

2605.08143 2026-05-20 cs.LG cs.AI

HoReN: Normalized Hopfield Retrieval for Large-Scale Sequential Model Editing

HoReN：用于大规模序列模型编辑的归一化Hopfield检索

Yuan Fang, Yi Xie, Xuming Ran

AI总结本文提出HoReN，一种基于代码本的参数保持编辑器，通过在单个MLP层中引入离散键值记忆，实现了在大规模序列模型编辑中的高效检索和更新，同时在多种基准测试中表现出色。

Comments 30 pages, 10 figures

详情

AI中文摘要

大型语言模型编码了大量事实性知识，但部署后这些知识可能会过时或错误，而重新训练成本过高。这推动了终身模型编辑，旨在更新特定行为的同时保持模型其余部分。现有的编辑器，无论是参数修改型还是参数保持型，在编辑累积时都会严重退化，并且在处理同义词时难以泛化。我们提出了HoReN，一种基于代码本的参数保持编辑器，通过在单个MLP层中引入离散键值记忆来包装。HoReN将每个代码本条目视为知识键和Hopfield存储模式，通过单位超球面上的角度相似性检索编辑，并通过阻尼Hopfield动态来优化查询，使同义词收敛到正确的记忆盆地，而无关输入保持稳定。HoReN在多种基准测试中表现出强大的编辑性能，包括标准ZsRE、结构化WikiBigEdit和非结构化UnKE评估。此外，HoReN能够扩展到50,000个序列编辑的ZsRE，其整体性能始终高于0.93，而先前的编辑器在达到10,000个编辑之前会崩溃或严重退化。我们的代码可在https://github.com/ha11ucin8/HoReN上获得。

英文摘要

Large language models encode vast factual knowledge that can become outdated or incorrect after deployment, yet retraining is prohibitively costly. This motivates lifelong model editing, which updates targeted behavior while preserving the rest of the model. Existing editors, both parameter-modifying and parameter-preserving, degrade severely as edits accumulate and struggle to generalize across paraphrases. We propose HoReN, a codebook-based parameter-preserving editor that wraps a single MLP layer with a discrete key-value memory. HoReN treats each codebook entry as both a knowledge key and a Hopfield stored pattern, retrieves edits by angular similarity on the unit hypersphere, and refines queries through damped Hopfield dynamics so paraphrases converge to the correct memory basin while unrelated inputs remain stable. HoReN achieves strong editing performance with consistent gains across diverse benchmarks spanning standard ZsRE, structured WikiBigEdit, and unstructured UnKE evaluations. Moreover, HoReN scales to 50K sequential edits on ZsRE with stable overall performance above 0.93, while prior editors collapse or degrade severely before reaching 10K. Our code is available at https://github.com/ha11ucin8/HoReN.

URL PDF HTML ☆

赞 0 踩 0

2605.07721 2026-05-20 cs.CL cs.AI cs.LG

Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models

内存高效的循环变换器：在循环语言模型中解耦计算与内存

Victor Conchello Vendrell, Arnau Padres Masdemont, Niccolò Grillo, Jordi Ros-Giralt, Arash Behboodi, Fabio Valerio Massoli

AI总结本文提出了一种内存高效的循环变换器（MELT），通过解耦推理深度与内存消耗，实现了常数内存的迭代推理，同时保持了LoopLM的性能，仅需轻量级的后训练过程。

Comments 22 pages, 5 figures, 11 tables

详情

AI中文摘要

递归大语言模型（LLM）架构已作为一种改进推理能力的有希望的方法出现，因为它们能够在嵌入空间中进行多步计算而无需生成中间标记。例如Ouro模型通过迭代更新内部表示并在每次迭代中保留标准的键值（KV）缓存来进行推理，导致内存消耗与推理深度成线性增长。因此，增加推理迭代次数会导致内存使用变得不可接受，限制了此类架构的实际可扩展性。在本工作中，我们提出了内存高效的循环变换器（MELT），一种新颖的架构，将推理深度与内存消耗解耦。与使用每个层和循环的标准KV缓存不同，MELT在每个层中维护一个共享于推理循环的单个KV缓存。该缓存通过可学习的门控机制随时间更新。为了在该架构下实现稳定且高效的训练，我们提出采用分块训练的两阶段过程进行训练：插值转换，随后是注意力对齐的蒸馏，均从LoopLM起始模型到MELT。实验表明，我们展示MELT模型在从预训练Ouro参数微调后，优于同等规模的标准LLM，同时保持与这些模型相当的内存占用，并显著小于Ouro的内存占用。总体而言，MELT实现了无需牺牲LoopLM性能的常数内存迭代推理，仅需轻量级的后训练过程。

英文摘要

Recurrent LLM architectures have emerged as a promising approach for improving reasoning, as they enable multi-step computation in the embedding space without generating intermediate tokens. Models such as Ouro perform reasoning by iteratively updating internal representations while retaining a standard Key-Value (KV) cache across iterations, causing memory consumption to grow linearly with reasoning depth. Consequently, increasing the number of reasoning iterations can lead to prohibitive memory usage, limiting the practical scalability of such architectures. In this work, we propose Memory-Efficient Looped Transformer (MELT), a novel architecture that decouples reasoning depth from memory consumption. Instead of using a standard KV cache per layer and loop, MELT maintains a single KV cache per layer that is shared across reasoning loops. This cache is updated over time via a learnable gating mechanism. To enable stable and efficient training under this architecture, we propose to train MELT using chunk-wise training in a two phase procedure: interpolated transition, followed by attention-aligned distillation, both from the LoopLM starting model to MELT. Empirically, we show that MELT models fine-tuned from pretrained Ouro parameters outperform standard LLMs of comparable size, while maintaining a memory footprint comparable to those models and dramatically smaller than Ouro's. Overall, MELT achieves constant-memory iterative reasoning without sacrificing LoopLM performance, using only a lightweight post-training procedure.

URL PDF HTML ☆

赞 0 踩 0

2605.07379 2026-05-20 cs.CV cs.AI

RELO: Reinforcement Learning to Localize for Visual Object Tracking

RELO：用于视觉目标跟踪的强化学习定位

Xin Chen, Chuanyu Sun, Jiao Xu, Houwen Peng, Dong Wang, Huchuan Lu, Kede Ma

AI总结本文提出RELO方法，通过将目标定位建模为马尔可夫决策过程，利用强化学习替代传统手工设计的空间先验，以提升跟踪性能和一致性。

Comments ICML 2026 paper

详情

AI中文摘要

传统视觉目标跟踪方法通常使用手工设计的空间先验（如热图）来定位目标，但这些先验只能提供替代监督，并且与跟踪优化和评估指标（如交并比IoU和成功曲线下的面积AUC）不匹配。本文引入RELO，一种用于视觉目标跟踪的强化学习定位方法，将目标定位建模为马尔可夫决策过程。具体而言，RELO用强化学习学习的空间位置策略替代手工设计的空间先验，奖励结合帧级IoU和序列级AUC。此外，我们还引入层对齐的时间令牌传播以提高帧间语义一致性，计算开销极低。在多个基准测试中，RELO取得了优异的性能，无需模板更新，在LaSOText上达到了57.5%的AUC。这证实了基于奖励的定位为视觉目标跟踪提供了一种有效的替代方法。

英文摘要

Conventional visual object trackers localize targets using handcrafted spatial priors, often in the form of heatmaps. Such priors provide only surrogate supervision and are poorly aligned with tracking optimization and evaluation metrics, such as intersection over union (IoU) and area under the success curve (AUC). Here, we introduce RELO, a REinforcement-learning-to-LOcalize method for visual object tracking that formulates target localization as a Markov decision process. Specifically, RELO replaces handcrafted spatial priors with a localization policy learned over spatial positions via reinforcement learning, with rewards combining frame-level IoU and sequence-level AUC. We additionally introduce layer-aligned temporal token propagation to improve semantic consistency across frames, with negligible computational overhead. Across multiple benchmarks, RELO achieves superior results, attaining 57.5% AUC on LaSOText without template updates. This confirms that reward-driven localization provides an effective alternative to prior-driven localization for visual object tracking.

URL PDF HTML ☆

赞 0 踩 0

2605.07066 2026-05-20 cs.AI

2.5-D Decomposition for LLM-Based Spatial Construction

基于2.5-D分解的LLM空间构建

Paul Whitten, Li-Jen Chen, Sharath Baddam

AI总结本文提出了一种基于2.5-D分解的神经符号管道，通过让LLM在二维水平面上规划，同时确定性执行器计算垂直放置，从而消除一类错误，提升了空间构建的准确性。

详情

AI中文摘要

自主系统需要可靠的空问推理来从自然语言指令中构建结构，但大型语言模型（LLMs）在生成三维积木放置时会产生系统性的坐标错误。本文提出了一种基于2.5-D分解的神经符号管道：LLM在二维水平面上进行规划，同时确定性执行器根据列的占用计算所有垂直放置，从而消除了一类错误。在Build What I Mean基准测试（160轮次）中，GPT-4o-mini在12次独立运行中实现了94.6%的平均结构准确性，接近由架构代理错误设定的97.6%上限，且优于GPT-4o（90.3%）和最佳竞争系统（76.3%）。受控消融实验确认2.5-D分解是主要贡献者，占准确性50.7个百分点。该管道可直接转移到边缘硬件：Nemotron-3 120B在本地NVIDIA Jetson Thor AGX上运行，无需修改提示词即可达到94.5%的云结果。该原理，即从LLM的输出空间中移除确定性维度，适用于任何自主建造或组装任务，其中重力或其他物理约束固定一个或多个自由度。在500个IGLU协作建造任务上的转移实验证实了效果超越了主要基准。

英文摘要

Autonomous systems that build structures from natural-language instructions need reliable spatial reasoning, yet large language models (LLMs) make systematic coordinate errors when generating three-dimensional block placements. We present a neuro-symbolic pipeline based on \emph{2.5-D decomposition}: the LLM plans in the two-dimensional horizontal plane while a deterministic executor computes all vertical placement from column occupancy, eliminating an entire class of errors. On the Build What I Mean benchmark (160 rounds), GPT-4o-mini with this pipeline achieves 94.6\% mean structural accuracy across 12 independent runs, within 3.0 percentage points of the 97.6\% ceiling imposed by architect-agent errors that no builder-side improvement can address. This outperforms both GPT-4o at 90.3\% and the best competing system at 76.3\%. A controlled ablation confirms that 2.5-D decomposition is the dominant contributor, accounting for 50.7 percentage points of accuracy. The pipeline transfers directly to edge hardware: Nemotron-3 120B running locally on an NVIDIA Jetson Thor AGX matches the cloud result at 94.5\% with no prompt modifications. The underlying principle, removing deterministic dimensions from the LLM's output space, applies to any autonomous construction or assembly task where gravity or other physical constraints fix one or more degrees of freedom. A transfer experiment on 500 IGLU collaborative building tasks confirm the effect generalizes beyond the primary benchmark.

URL PDF HTML ☆

赞 0 踩 0

2605.06546 2026-05-20 cs.CL

Efficient Pre-Training with Token Superposition

高效的token叠加预训练

Bowen Peng, Théo Gigant, Jeffrey Quesnelle

AI总结本文提出了一种名为Token-Superposition Training (TST) 的方法，通过在不修改模型架构的情况下，提高预训练的数据吞吐量，从而在大规模预训练中实现更高的效率和性能。

Comments 25 pages, 11 figures, 28 tables

详情

AI中文摘要

大型语言模型的预训练通常成本高昂且在扩展时效率低下，需要复杂的侵入性修改才能实现高数据吞吐量。在本工作中，我们提出了Token-Superposition Training (TST)，一种简单的即插即用方法，能够在不修改并行性、优化器、分词器、数据或模型架构的情况下，显著提高预训练过程中每FLOPs的数据吞吐量。TST分为两个阶段：(i) 一个高度高效的叠加阶段，其中我们将许多连续的token合并成一个袋，并使用多热交叉熵(MCE)目标进行训练；(ii) 一个恢复阶段，其中我们恢复回标准训练。我们对270M和600M参数的规模进行了广泛评估，并在3B和10B A1B混合专家模型上进行了验证，证明其在不同设置中具有高度鲁棒性。最终，TST在基准损失和下游评估中均优于基线，且在等损失设置下，TST在10B A1B规模上实现了预训练时间的2.5倍减少。

英文摘要

Pre-training of Large Language Models is often prohibitively expensive and inefficient at scale, requiring complex and invasive modifications in order to achieve high data throughput. In this work, we present Token-Superposition Training (TST), a simple drop-in method that significantly improves the data throughput per FLOPs during pre-training without modifying the parallelism, optimizer, tokenizer, data, or model architecture. TST is done in two phases: (i) A highly efficient superposition phase where we combine many contiguous tokens into one bag and train using a multi-hot cross-entropy (MCE) objective, and (ii) a recovery phase where we revert back to standard training. We extensively evaluate TST on the scale of 270M and 600M parameters and validate on 3B and a 10B A1B mixture of experts model, demonstrating that it is highly robust in different settings. Ultimately, TST consistently outperforms baseline loss and downstream evaluations, and under equal-loss settings, TST yields up to a 2.5x reduction in total pre-training time at the 10B A1B scale.

URL PDF HTML ☆

赞 0 踩 0

2605.06501 2026-05-20 cs.LG cs.CL

Cubit: Token Mixer with Kernel Ridge Regression

Cubit：基于核岭回归的令牌混合器

Chuanyang Zheng, Jiankai Sun, Yihang Gao, Yuehao Wang, Liangchen Tan, Mac Schwager, Anderson Schneider, Yuriy Nevmyvaka, Xiaodong Liu

AI总结本文提出Cubit，一种基于核岭回归的新型架构，通过将令牌混合机制从Nadaraya-Watson回归转换为核岭回归，从而提供更稳固的数学基础，并在长序列建模能力上表现出优势。

Comments Tech Report

详情

AI中文摘要

自2017年引入以来，Transformer已成为现代深度学习中最广泛采用的架构之一。尽管在位置编码、注意力机制和前馈网络方面进行了大量改进，Transformer的核心令牌混合机制仍为注意力。在本文中，我们表明Transformer中的注意力模块可以被解释为执行Nadaraya-Watson回归，其中它计算令牌之间的相似性并相应地汇总值。受这一视角的启发，我们提出了Cubit，一种潜在的下一代架构，它利用核岭回归（KRR），而传统的Transformer依赖于Nadaraya-Watson回归。具体而言，Cubit通过将经典的注意力计算修改为结合KRR的闭式解，将值汇总通过核相似性与通过核矩阵的逆进行归一化。为了提高训练稳定性，我们进一步提出了有限范围重缩放（LRR），它在受控范围内缩放值层。我们认为，作为基于KRR的架构，Cubit比传统的Transformer提供了更稳固的数学基础，因为Transformer的注意力机制对应于Nadaraya-Watson回归。我们通过全面的实验验证了这一主张。实验结果表明，Cubit可能在长序列建模能力上表现更强。特别是，其在Transformer上的性能提升似乎随着训练序列长度的增长而增加。

英文摘要

Since its introduction in 2017, the Transformer has become one of the most widely adopted architectures in modern deep learning. Despite extensive efforts to improve positional encoding, attention mechanisms, and feed-forward networks, the core token-mixing mechanism in Transformers remains attention. In this work, we show that the attention module in Transformers can be interpreted as performing Nadaraya-Watson regression, where it computes similarities between tokens and aggregates the corresponding values accordingly. Motivated by this perspective, we propose Cubit, a potential next-generation architecture that leverages Kernel Ridge Regression (KRR), while the vanilla Transformer relies on Nadaraya-Watson regression. Specifically, Cubit modifies the classical attention computation by incorporating the closed-form solution of KRR, combining value aggregation through kernel similarities with normalization via the inverse of the kernel matrix. To improve the training stability, we further propose the Limited-Range Rescale (LRR), which rescales the value layer within a controlled range. We argue that Cubit, as a KRR-based architecture, provides a stronger mathematical foundation than the vanilla Transformer, whose attention mechanism corresponds to Nadaraya-Watson regression. We validate this claim through comprehensive experiments. The experimental results suggest that Cubit may exhibit stronger long-sequence modeling capability. In particular, its performance gain over the Transformer appears to increase as the training sequence length grows.

URL PDF HTML ☆

赞 0 踩 0

2605.06270 2026-05-20 cs.CV

Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction

Spark3R: 非对称令牌缩减使快速前馈3D重建

Zecheng Tang, Jiaye Fu, Qiankun Gao, Haijie Li, Yanmin Wu, Jiaqi Zhang, Siwei Ma, Jian Zhang

AI总结本文提出Spark3R框架，通过非对称令牌缩减技术，在不重新训练的情况下加速前馈3D重建模型，实现高达28倍的速度提升同时保持高质量重建。

详情

AI中文摘要

基于视觉Transformer的前馈3D重建模型可以直接从少量输入图像估计场景几何和相机姿态，但将其扩展到具有数百或数千帧的视频输入仍然具有挑战性，因为全局注意力层的二次成本。最近的令牌合并方法通过在全局注意力层内压缩令牌序列来加速这些模型，但它们对查询令牌和键值令牌应用均匀的缩减，忽略了它们在3D重建中功能不同的角色。在本文中，我们识别出前馈3D重建模型的一个关键属性：查询令牌编码视图特定的几何请求并且对压缩敏感，而键值令牌代表共享的场景上下文并且可以容忍剧烈压缩。受这一见解的启发，我们提出了Spark3R，一个无需训练的加速框架，通过为查询令牌和键值令牌分配不同的缩减因子来解耦压缩，对查询令牌应用组内令牌合并，对键值令牌应用轻量级令牌剪枝。此外，Spark3R在不同层之间自适应调整键值缩减因子，进一步改进质量-效率权衡。作为一种即插即用的框架，无需重新训练，Spark3R直接集成到多个预训练的前馈3D重建模型中，包括VGGT、π³、Depth-Anything-3和VGGT-Ω，并在1000帧输入上实现了高达28倍的速度提升，同时保持有竞争力的重建质量。

英文摘要

Feed-forward 3D reconstruction models based on Vision Transformers can directly estimate scene geometry and camera poses from a small set of input images, but scaling them to video inputs with hundreds or thousands of frames remains challenging due to the quadratic cost of global attention layers. Recent token-merging methods accelerate these models by compressing the token sequence within the global attention layers, but they apply a uniform reduction to query tokens and key-value tokens, ignoring their functionally distinct roles in 3D reconstruction. In this work, we identify a key property of feed-forward 3D reconstruction models: query tokens encode view-specific geometric requests and are sensitive to compression, while key-value tokens represent shared scene context and tolerate aggressive compression. Guided by this insight, we propose Spark3R, a training-free acceleration framework that decouples the compression of query tokens and key-value tokens by assigning distinct reduction factors, with intra-group token merging applied to query tokens and lightweight token pruning to key-value tokens. Additionally, Spark3R adaptively adjusts the key-value reduction factor across layers, further improving the quality-efficiency trade-off. As a plug-and-play framework requiring no retraining, Spark3R integrates directly into multiple pretrained feed-forward 3D reconstruction models, including VGGT, $π^3$, Depth-Anything-3, and VGGT-$Ω$, and achieves up to $28\times$ speedup on 1,000-frame inputs while maintaining competitive reconstruction quality.

URL PDF HTML ☆

赞 0 踩 0

2605.05480 2026-05-20 cs.LG cs.AI stat.ML

GRALIS: A Unified Canonical Framework for Linear Attribution Methods via Riesz Representation

GRALIS：通过里斯表示建立线性归因方法的统一规范框架

Raimondo Fanale

AI总结本文提出GRALIS框架，通过里斯表示理论统一了线性归因方法，提供七个形式定理保证归因方法的准确性、收敛性、Shapley交互值、Hoeffding ANOVA分解、Sobol敏感性泛化和多尺度扩展，展示了其在医学图像上的初步验证结果。

Comments 25 pages, 6 tables, 2 figures. Theoretical framework with preliminary experimental validation on BreaKHis (1,187 images, DenseNet-121). Extended empirical comparison in preparation

详情

AI中文摘要

深度神经网络的主要XAI归因方法——GradCAM、SHAP、LIME、集成梯度——基于不同的理论基础且无法正式比较。我们提出了GRALIS（梯度-里斯平均局部积分Shapley），一个建立归因表示理论的数学框架：L^2(Q, mu)上的每一个可加、线性和连续的归因功能都具有唯一的规范表示（Q，w，Delta），由里斯表示定理证明其必要性。该类包括SHAP、IG、LIME和线性化GradCAM，但不包括非线性功能如标准GradCAM或注意力图。七个形式定理提供了任何单个方法都缺乏的同时保证：（T1）必要规范形式；（T2）精确完备性；（T3）蒙特卡洛收敛O(1/sqrt(m))+O(1/k)；（T4）精确Shapley交互值；（T5）Hoeffding ANOVA分解；（T6）Sobol敏感性泛化；（T7）多尺度扩展（MS-GRALIS）具有最小方差权重。代数附录通过Mobius变换证明GRALIS-SIV对应关系，无需循环论证。GRALIS满足13.5/14个公理性质，而单独方法仅为2.5-6/14，包括完备性、敏感性、局部性、k阶交互和最优多尺度聚合。在BreaKHis（1,187例病理图像，DenseNet-121）上的初步验证报告删除忠实度AUC+0.015（恶性），96%类条件一致性，SAL=0.762±0.109和稀疏性指数0.39。与基线XAI方法的扩展比较计划在配套论文中进行。

英文摘要

The main XAI attribution methods for deep neural networks -- GradCAM, SHAP, LIME, Integrated Gradients -- operate on separate theoretical foundations and are not formally comparable. We present GRALIS (Gradient-Riesz Averaged Locally-Integrated Shapley), a mathematical framework establishing a representation theory for attributions: every additive, linear, and continuous attribution functional on L^2(Q,mu) admits a unique canonical representation (Q, w, Delta), proved necessary by the Riesz Representation Theorem. This class encompasses SHAP, IG, LIME and linearized GradCAM, but excludes nonlinear functionals such as standard GradCAM or attention maps. Seven formal theorems provide simultaneous guarantees absent in any individual method: (T1) necessary canonical form; (T2) exact completeness; (T3) Monte Carlo convergence O(1/sqrt(m))+O(1/k); (T4) exact Shapley Interaction Values; (T5) Hoeffding ANOVA decomposition; (T6) Sobol sensitivity generalization; (T7) multi-scale extension (MS-GRALIS) with minimum-variance weights. An algebraic appendix justifies the GRALIS-SIV correspondence via the Mobius transform without circularity. GRALIS satisfies 13.5/14 axiomatic properties vs. 2.5-6/14 for individual methods, including completeness, sensitivity, locality, order-k interactions and optimal multi-scale aggregation simultaneously. Preliminary validation on BreaKHis (1,187 histology images, DenseNet-121) reports deletion faithfulness AUC +0.015 (malignant), 96% class-conditional consistency, SAL = 0.762+/-0.109 and sparsity index 0.39. Extended comparison with baseline XAI methods is planned for a companion paper.

URL PDF HTML ☆

赞 0 踩 0

2605.04525 2026-05-20 cs.RO

HDFlow: Hierarchical Diffusion-Flow Planning for Long-horizon Tasks

HDFlow：用于长时间任务的分层扩散-流规划

Nandiraju Gireesh, Yuanliang Ju, Chaoyi Xu, Weiheng Liu, Yuxuan Wan, He Wang

AI总结本文提出HDFlow，一种新的分层规划框架，利用扩散和修正流模型的优势，克服了单一范式生成规划器的局限性，通过在模拟和现实中的家具组装任务验证其有效性。

Comments ICML 2026 (Spotlight)

详情

AI中文摘要

近年来，生成模型的进步在长时间、稀疏奖励任务中生成行为计划方面展现出潜力。尽管这些方法取得了有希望的结果，但它们通常缺乏分层分解的原理框架，并且由于其迭代去噪过程，难以应对实时执行的计算需求。在本文中，我们介绍了分层扩散-流（HDFlow），一种新颖的分层规划框架，能够最优地利用扩散和修正流模型的优势，以克服单一范式生成规划器的局限性。HDFlow采用高层扩散规划器，在学习的潜在空间中生成策略子目标序列，利用扩散强大的探索能力。这些子目标随后引导低层修正流规划器生成平滑且密集的轨迹，利用基于常微分方程（ODE）的轨迹生成的速度和效率。我们在四个具有挑战性的家具组装任务中评估了HDFlow，既在模拟中又在现实世界中，其表现显著优于最先进的方法。此外，我们还展示了该方法在两个包含多样化运动和操作任务的长时间基准测试中的泛化能力。项目网站：https://hdflow-page.github.io/

英文摘要

Recent advances in generative models have shown promise in generating behavior plans for long-horizon, sparse reward tasks. While these approaches have achieved promising results, they often lack a principled framework for hierarchical decomposition and struggle with the computational demands of real-time execution, due to their iterative denoising process. In this work, we introduce Hierarchical Diffusion-Flow (HDFlow), a novel hierarchical planning framework that optimally leverages the strengths of diffusion and rectified flow models to overcome the limitations of single-paradigm generative planners. HDFlow employs a high-level diffusion planner to generate sequences of strategic subgoals in a learned latent space, capitalizing on diffusion's powerful exploratory capabilities. These subgoals then guide a low-level rectified flow planner that generates smooth and dense trajectories, exploiting the speed and efficiency of ordinary differential equation (ODE)-based trajectory generation. We evaluate HDFlow on four challenging furniture assembly tasks in both simulation and real-world, where it significantly outperforms state-of-the-art methods. Furthermore, we also showcase our method's generalizability on two long-horizon benchmarks comprising diverse locomotion and manipulation tasks. Project website: https://hdflow-page.github.io/

URL PDF HTML ☆

赞 0 踩 0

2605.02223 2026-05-20 cs.SD cs.CV

Toward Fine-Grained Speech Inpainting Forensics:A Dataset, Method, and Metric for Multi-Region Tampering Localization

迈向细粒度语音修补取证：一个多区域篡改定位的数据集、方法和度量标准

Tung Vu, Yen Nguyen, Hai Nguyen, Cuong Pham, Cong Tran

AI总结本文提出MIST数据集、ISA方法和SF1@tau度量标准，用于多区域语音修补检测，揭示现有深度伪造检测器在细粒度语音修补检测上的不足。

详情

DOI: 10.1109/ACCESS.2026.3694045

AI中文摘要

近年来，语音克隆和文本到语音合成技术的进步使部分语音操纵——即攻击者在语音中替换几个词以改变其含义同时保持说话者身份——成为一种日益现实的威胁。现有音频深度伪造检测基准主要集中在句级二元分类或单区域篡改，无法检测和定位未知数量的多区域修补内容。我们通过三个贡献填补这一空白：首先，我们引入MIST（多区域修补语音篡改），一个覆盖6种语言、每句包含1-3个独立修补词级段的大型多语言数据集，通过LLM引导的语义替换和神经语音克隆生成，其中虚假内容仅占每句的2-7%。其次，我们提出了ISA（迭代段分析），一种与backbone无关的框架，通过粗到细的滑动窗口分类，结合容差区域提议和边界细化，无需先验知识即可恢复所有篡改区域。第三，我们定义了SF1@tau，一个基于时间IoU匹配的段级F1度量标准，联合评估区域计数准确性和定位精度。零样本评估显示，细粒度语音修补仍无法被现有深度伪造检测器解决：句级分类器在完全合成语音上对MIST句的伪造概率接近零，而ISA在这一具有挑战性的设置中始终优于非迭代基线，且数据集、代码和评估工具包已公开发布。

英文摘要

Recent advances in voice cloning and text-to-speech synthesis have made partial speech manipulation - where an adversary replaces a few words within an utterance to alter its meaning while preserving the speaker's identity - an increasingly realistic threat. Existing audio deepfake detection benchmarks focus on utterance-level binary classification or single-region tampering, leaving a critical gap in detecting and localizing multiple inpainted segments whose count is unknown a priori. We address this gap with three contributions. First, we introduce MIST (Multiregion Inpainting Speech Tampering), a large-scale multilingual dataset spanning 6 languages with 1-3 independently inpainted word-level segments per utterance, generated via LLM-guided semantic replacement and neural voice cloning, with fake content constituting only 2-7% of each utterance. Second, we propose ISA (Iterative Segment Analysis), a backbone-agnostic framework that performs coarse-to-fine sliding-window classification with gap-tolerant region proposal and boundary refinement to recover all tampered regions without prior knowledge of their count. Third, we define SF1@tau, a segment-level F1 metric based on temporal IoU matching that jointly evaluates region count accuracy and localization precision. Zero-shot evaluation reveals that partial inpainting at word granularity remains unsolved by existing deepfake detectors: utterance-level classifiers trained on fully synthesized speech assign near zero fake probability to MIST utterances where only 2-7% of content is manipulated. ISA consistently outperforms non-iterative baselines in this challenging setting, and the dataset, code, and evaluation toolkit are publicly released.

URL PDF HTML ☆

赞 0 踩 0

2605.00768 2026-05-20 cs.CL

Characterizing the Expressivity of Local Attention in Transformers

对变换器中局部注意力的表达力进行表征

Jiaoda Li, Ryan Cotterell

AI总结本文研究了变换器中局部注意力的表达力，通过形式语言理论分析了局部注意力如何扩展可识别的正则语言，并证明了全局和局部注意力在表达力上的互补性，实验表明混合型变换器在形式语言识别和自然语言建模中表现更优。

Comments ACL 2026

详情

AI中文摘要

变换器是语言建模中最流行的神经架构。变换器的核心是其全局注意力机制，该机制允许模型在生成下一个标记前聚合所有前序标记的信息。局部注意力是一种常见的注意力变体，它限制每个标记只能聚合来自有限窗口前序标记的信息，从而将全局注意力的二次成本降低到线性。尽管这种限制通常由效率驱动，但发现它也提高了模型质量，这一现象至今缺乏令人满意的解释。本文从识别器表达力的角度提供了这一现象的正式说明。已证明具有固定精度的变换器与包含单一过去操作符的线性时序逻辑片段相对应。此外，证明添加局部注意力引入了第二个时间操作符，严格扩大了可识别的正则语言类别。此外，全局和局部注意力在表达力上是互补的：彼此不包含，结合它们会产生最丰富的片段。在形式语言识别和自然语言建模上的实验验证了理论，显示混合型全局-局部变换器在性能上优于仅使用全局注意力的变换器。

英文摘要

The transformer is the most popular neural architecture for language modeling. The cornerstone of the transformer is its global attention mechanism, which lets the model aggregate information from all preceding tokens before generating the next token. One common variant of attention is called local attention, which restricts each token to aggregating information from a bounded window of predecessors, reducing the quadratic cost of global attention to linear. Although this restriction is usually motivated by efficiency, it has also been found to improve model quality, a phenomenon that has so far lacked a satisfactory explanation. We provide a formal account of this phenomenon in terms of recognizer expressivity. It has been shown that fixed-precision transformers with global attention correspond to a fragment of linear temporal logic containing a single past operator. We additionally prove that adding local attention introduces a second temporal operator, strictly enlarging the class of recognizable regular languages. Moreover, global and local attention are expressively complementary: neither subsumes the other, and combining them yields the richest fragment. Experiments on formal language recognition and natural language modeling corroborate the theory, showing that hybrid global--local transformers outperform their global-only counterparts.

URL PDF HTML ☆

赞 0 踩 0