arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3818
2605.17423 2026-05-19 cs.CV

Soap2Soap: Long Cinematic Video Remaking via Multi-Agent Collaboration

Soap2Soap:通过多智能体协作实现长 cinematic 视频重制

Yiren Song, Huilin Zhong, Kevin Qinghong Lin, Haofan Wang, Mike Zheng Shou

AI总结 本研究提出 Soap2Soap 框架,通过多智能体协作实现长 cinematic 视频重制,解决视频到视频生成中长期一致性与叙事保真度的问题。

详情
AI中文摘要

我们研究系列级 cinematic 重制,这是一个长视界视频到视频生成问题,通过风格化或演员替换局部化完整 episodes 或 films,同时严格保持叙事结构、动作编排和角色身份在数百个镜头中。现有视频生成和编辑管道在此领域常常失效,因为大相机运动和视角变化下会出现身份漂移、背景突变和语义侵蚀的叠加问题。我们提出 Soap2Soap,一个通过双桥一致性机制强制长期语言-视觉一致性的多智能体框架:一个场景感知的 JSON 剧本作为持久的语义骨架,以及在场景和镜头级别动态分配的视觉参考锚点。为在视频合成前抑制漂移,我们引入批次关键帧一致性,通过基于网格的公式共同生成多个关键帧在共享的潜在上下文中。一个闭环验证智能体进一步审计身份、稳定性和对齐度以触发选择性再生。在 SoapBench 上的实验显示,与商业视频生成 API 相比,在长期一致性和叙事保真度方面有显著提升。

英文摘要

We study series-level cinematic remaking, a long-horizon video-to-video generation problem that localizes full episodes or films via stylization or actor replacement while strictly preserving narrative structure, motion choreography, and character identity across hundreds of shots. Existing video generation and editing pipelines often break down in this regime due to compounding identity drift, background mutation, and semantic erosion under large camera motions and viewpoint changes. We propose Soap2Soap, a multi-agent framework that enforces long-term language-visual consistency through a Dual-Bridge Consistency mechanism: a scene-aware JSON screenplay serving as a persistent semantic backbone, and dynamically allocated visual reference anchors at both scene and shot levels. To suppress drift before video synthesis, we introduce batch keyframe consistency, jointly generating multiple keyframes in a shared latent context via a grid-based formulation. A closed-loop verification agent further audits identity, stability, and alignment to trigger selective regeneration. Experiments on SoapBench demonstrate strong improvements over commercial video generation APIs in long-term consistency and narrative fidelity.

2605.17421 2026-05-19 cs.RO

MUSE: Multimodal Uncertainty Quantification of State Estimation

MUSE:多模态状态估计不确定性量化

Minkyung Kim, Henry Che, Bhargav Chandaka, Bhumsitt Pramuanpornsatid, Chengyu Yang, Sheng Cheng, Xiaofeng Wang, Naira Hovakimyan, Shenlong Wang

AI总结 本文提出MUSE,一种基于学习的实时框架,利用Mamba的强效序列建模能力,从多个异步传感器流中估计定位不确定性,提高了状态估计的可靠性和鲁棒性。

Comments Code and dataset: https://github.com/hungdche/MUSE

详情
AI中文摘要

准确的视觉状态估计一直是机器人领域的重要课题,广泛应用于机器人导航、自动驾驶和自主飞行。最近的机器人感知进展显著提高了状态估计的精度和鲁棒性,但如何量化和校准其精度,即我们对估计的置信度以及能否检测失败仍然是一个根本性挑战。在视觉惯性里程计(VIO)中,异方差和多模态的性质使不确定性量化尤为困难。本文介绍了MUSE(多模态状态估计不确定性量化),一种新颖的实时学习框架,利用Mamba的强大且高效的序列建模能力,从多个异步传感器流中估计定位不确定性。在公开和内部数据集上的实验表明,MUSE相比现有不确定性量化方法在可靠性和鲁棒性方面表现更优,消融研究验证了其关键设计选择的优势。

英文摘要

Accurate visual state estimation has been a central topic in robotics with a wide range of applications in robot navigation, autonomous driving, and autonomous flight. Recent advances in robot perception have led to significant improvements in the accuracy and robustness of state estimation, yet a fundamental challenge remains in how to quantify and calibrate its precision, i.e., how confident we are in an estimate and whether failures can be detected. This issue is particularly pronounced in visual-inertial odometry (VIO), where the heteroscedastic and multimodal nature of the problem makes uncertainty quantification especially difficult. This paper introduces MUSE (Multimodal Uncertainty Quantification of State Estimation), a novel real-time learning-based framework that leverages the strong and efficient sequential modeling capacity of Mamba to estimate localization uncertainty from multiple asynchronous sensor streams. Experiments on both public and in-house datasets demonstrate that MUSE achieves superior reliability and robustness compared to existing uncertainty quantification methods, and ablation studies justify the benefits of its key design choices.

2605.17419 2026-05-19 cs.LG cs.AI

Learning Displacement-Robust Representations for Landslide Early Warning under Rainfall Forecast Uncertainty

学习位移鲁棒的表示以在降雨预报不确定性下进行滑坡预警

Ren Ozeki, Hamada Rizk, Hirozumi Yamaguchi

AI总结 本文提出了一种鲁棒于降雨场位移的滑坡预警系统,通过学习降雨和地形数据的潜在表示,以提高在降雨预报不确定性下的滑坡预测精度。

详情
AI中文摘要

由降雨引发的滑坡已成为全球范围内日益增长的风险,因为气候变化加剧了极端降雨事件。为了提供足够的撤离时间,实时灾害监测的滑坡预警系统(LEWS)必须通过整合观测降雨与短期降雨预报来估计近未来滑坡风险,这些预报来自时空环境数据流。尽管最近的滑坡预测方法通过统计和深度学习方法提高了预测性能,但大多数方法假设降雨输入是准确的。然而,在实际应用中,滑坡预测依赖于降雨预报,这些预报通常包含由于预测不确定性导致的降雨场空间位移。这种位移会改变局部累积降雨并降低预测准确性。为了解决这一挑战,我们提出了一种新的LEWS,其对降雨场位移具有鲁棒性。关键思想是学习降雨和地形数据的潜在表示,这些表示在降雨场运动中的位移下保持稳定,从而实现可靠的地理空间数据整合以估计滑坡风险。滑坡预测模型通过使用降雨-运动-感知对比学习(RMCL)进行训练,该方法引入了时间相关的降雨场扰动以模拟预报引起的降雨驱动时空环境数据流中的位移。实验使用了日本两年的降雨和地形数据,覆盖了19个地区中的滑坡事件。所提出的系统在精度上比最先进的基线高出高达37%。这些结果表明,将降雨建模为移动的空间场并在学习过程中处理降雨场位移显著提高了操作预警系统中短期滑坡预测的可靠性。

英文摘要

Rainfall-induced landslides pose a growing risk worldwide as climate change intensifies extreme rainfall events. To provide sufficient evacuation time, landslide early warning systems (LEWS) for real-time disaster monitoring must estimate near-future landslide risk by integrating observed rainfall with short-term rainfall forecasts from spatio-temporal environmental data streams. Although recent landslide prediction methods have improved predictive performance using statistical and deep learning approaches, most assume accurate rainfall inputs. In operational settings, however, landslide prediction relies on rainfall forecasts, which often contain spatial displacement of rainfall fields due to forecasting uncertainties. Such displacement can alter local accumulated rainfall and degrade prediction accuracy. To address this challenge, we propose a novel LEWS robust to rainfall field displacement. The key idea is to learn latent representations from rainfall and terrain data that remain stable under displacement in rainfall field motion, enabling reliable geospatial data integration for landslide risk estimation. The landslide prediction model is trained using Rainfall-Motion-Aware Contrastive Learning (RMCL), which introduces temporally correlated rainfall field perturbations to emulate forecast-induced displacement in rainfall-driven spatio-temporal environmental data streams. Experiments were conducted using two years of rainfall and terrain data across Japan, covering 19 regions with landslide events. The proposed system achieved up to 37% higher precision than state-of-the-art baselines. These results demonstrate that modeling rainfall as a moving spatial field and addressing rainfall field displacement during learning significantly improve the reliability of short-term landslide prediction in operational early warning systems.

2605.17410 2026-05-19 cs.AI

Computational Challenges in Token Economics: Bridging Economic Theory and AI System Design

令牌经济学中的计算挑战:连接经济理论与AI系统设计

Ou Wu, Yingjun Deng

AI总结 本文探讨了在大规模语言模型系统中,将令牌作为经济原语时所面临的计算挑战,提出了计算令牌经济学的概念和令牌经济学三元论,旨在建立连接令牌经济学与AI系统设计的研究议程。

Comments 43 pages

详情
AI中文摘要

令牌经济学已逐渐成为理解大型语言模型系统中资源分配、价值创造和定价的一个有用的视角。尽管近期的研究越来越多地将令牌视为经济原语,但高水平的经济理论与现代AI基础设施的计算现实之间仍存在显著的差距。本文识别并分析了在实时推理系统中实施令牌经济原则时出现的关键计算挑战。我们主张计算可行性不仅仅是令牌经济学的一个维度,而是其支配约束:这些挑战是由精细估值、低延迟执行和在不确定性下的分配最优性之间根本矛盾驱动的。为了结构化这个问题空间,我们引入了计算令牌经济学的概念,并提出了令牌经济学三元论——一个条件无免费午餐原则,捕捉了粒度、实时性能和最优性之间的固有权衡。我们进一步将主要技术挑战分为三个领域:实时价值会计、受限资源分配和经济感知的系统架构。与其提供完整的解决方案,本文旨在定义连接令牌经济学与AI系统设计的研究议程,突出计算经济学、机器学习系统和AI基础设施交汇处的开放问题。

英文摘要

Token economics has emerged as a useful lens for understanding resource allocation, value creation, and pricing in large language model systems. While recent work has increasingly treated tokens as economic primitives, there remains a substantial gap between high-level economic theory and the computational realities of modern AI infrastructure. This paper identifies and analyzes the key computational challenges that arise when token-economic principles are implemented in real-time inference systems. We argue that computational feasibility is not merely one dimension of token economics, but its governing constraint: these challenges are driven by fundamental tensions among fine-grained valuation, low-latency execution, and allocation optimality under uncertainty. To structure this problem space, we introduce the notion of \textbf{Computational Token Economics} and propose the \textbf{Token Economics Trilemma} -- a conditional no-free-lunch principle that captures the inherent trade-offs among granularity, real-time performance, and optimality. We further categorize the main technical challenges into three areas: real-time value accounting, constrained resource allocation, and economic-aware system architecture. Rather than presenting a complete solution, this paper aims to define a research agenda for bridging token economics and AI system design, highlighting open problems at the intersection of computational economics, machine learning systems, and AI infrastructure.

2605.17405 2026-05-19 cs.SD cs.MM

A Distribution Matching Approach to Neural Piano Transcription with Optimal Transport

基于最优传输的神经钢琴转录方法

Weixing Wei, Raynaldi Lalang, Dichucheng Li, Kazuyoshi Yoshii

AI总结 本文提出将自动钢琴转录视为最优传输问题,而非帧级多标签二分类问题,通过最小化预测音符分布到真实分布的传输成本,提升了时间对齐的感知相关性,并提出了一种带有谐波感知注意力机制的卷积循环神经网络来捕捉音乐中的频谱-时间依赖性。

Comments Accepted to ICASSP2026

详情
AI中文摘要

本文描述了一种新的范式,将自动钢琴转录(APT)正式化为最优传输(OT)问题,而不是帧级多标签二分类问题。我们的方法学习最小化将预测的音符事件分布传输到真实分布的时间和频率上的成本。因此,OT损失可以容纳时间错位,从而实现感知相关性优化。我们还提出了一种带有谐波感知注意力机制的卷积循环神经网络(CRNN),以捕捉音乐中固有的频谱-时间依赖性。使用MAESTRO数据集的实验表明,我们的方法在起始检测上取得了最先进的性能。我们确认了OT损失在应用于现有模型中的通用性。

英文摘要

This paper describes a novel paradigm that formalizes automatic piano transcription (APT) as an optimal transport (OT) problem, not as a frame-level multi-label binary classification problem. Our method learns to minimize the cost of transporting a predicted distribution of note events to the ground-truth distribution over time and frequency. The OT loss can thus accommodate temporal misalignment, leading to perceptually relevant optimization. We also propose a convolutional recurrent neural network (CRNN) with a harmonics-aware attention mechanism to capture the spectro-temporal dependencies inherent in music.Our experiments using the MAESTRO dataset showed that our method attained a state-of-the-art performance in onset detection. We confirmed the versatility of the OT loss in application to existing models.

2605.17403 2026-05-19 cs.LG

Self-Supervised Learning for Sparse Matrix Reordering

稀疏矩阵重新排序的自监督学习

Ziwei Li, Tao Yuan, Fangfang Liu, Shuzi Niu, Huiyuan Li, Wenjia Wu

AI总结 本文提出了一种自监督学习方法,通过多网格图网络捕捉结构信息,基于不等式推导三元组采样策略,并引入端最大链损失函数以减少不满足不等式的三元组数量,从而在稀疏矩阵重新排序中实现填充分离减少和LU分解时间加速。

Comments Accepted by DASFAA 2026

详情
AI中文摘要

使用适当顺序重新排列稀疏矩阵可以显著减少填充分离,即矩阵分解过程中引入的新非零元素,从而减少内存使用和运行时间。然而,找到最小化填充分离的顺序是NP难问题。现有方法,包括图论和深度学习方法,依赖于替代目标函数而没有理论保证。填充分离定理揭示了填充分离生成与矩阵稀疏结构之间的直接内在关系,即路径三元组不等式。本文首先使用多网格图网络来捕获每个顶点的结构信息。然后基于不等式推导出三元组采样策略。最后,我们引入端最大链损失函数以减少预测分数满足这些不等式的三元组数量。在公开可用的SuiteSparse矩阵集合上的实验评估表明,所提出的方法在填充分离减少和LU分解时间加速方面均优于现有方法。

英文摘要

Rearranging the rows or columns of a sparse matrix using an appropriate ordering can significantly reduce fill-ins, i.e., new nonzeros introduced during matrix factorization, decreasing memory usage and runtime. However, finding an ordering that minimizes fill-ins is NP-complete. Existing approaches, including graph-theoretic and deep learning methods, rely on surrogate objectives without theoretical guarantees. The Fill-Path Theorem reveals a direct and intrinsic relationship between fill-in generation and the sparse structure of the matrix as path triplet inequalities. Here we first employ a multigrid graph network to capture structural information for each vertex. We then derive a triplet sampling strategy based on inequalities. Finally, we introduce an end-max chain loss function to reduce the number of triplets whose predicted scores satisfy these inequalities. Experimental evaluations on the publicly available SuiteSparse matrix collection demonstrate the superiority of the proposed method in terms of both fill-in reduction and speedup in LU factorization time.

2605.17398 2026-05-19 cs.CL cs.LG

MiniGPT: Rebuilding GPT from First Principles

MiniGPT:从第一原理重新构建GPT

Jibin Joseph

AI总结 本文提出MiniGPT,一个基于PyTorch从头实现的GPT风格自回归语言模型,旨在在研究nanoGPT设计后,从第一原理重新构建GPT核心流程,同时保持模型和训练代码独立编写。MiniGPT实现了词嵌入、位置嵌入、因果多头自注意力、预层归一化Transformer块、残差连接、前馈MLP层、下一词交叉熵训练(教师强制)、验证跟踪、检查点选择和自回归文本生成。

Comments 13 pages, 2 figures

详情
AI中文摘要

本文提出了MiniGPT,一个基于PyTorch从头实现的GPT风格自回归语言模型。目的是在研究nanoGPT设计后,从第一原理重新构建GPT核心流程,同时保持模型和训练代码独立编写。MiniGPT实现了词嵌入、位置嵌入、因果多头自注意力、预层归一化Transformer块、残差连接、前馈MLP层、下一词交叉熵训练(教师强制)、验证跟踪、检查点选择以及自回归文本生成。本文在Tiny Shakespeare数据集上评估了该实现,使用字符级分词。一个基线模型在3000次训练迭代后达到验证损失1.7236。一个更强的10.77M参数配置,使用更大的上下文长度和改进的训练设置,达到最佳验证损失1.4780,并生成具有可识别莎士比亚风格对话结构的文本。MiniGPT并未引入新的语言模型架构。相反,它记录了从原始文本到训练好的字符级生成的清晰且可重复的实现路径,包括设计选择、训练行为、生成质量以及实际限制。

英文摘要

This paper presents MiniGPT, a compact from-scratch implementation of GPT-style autoregressive language modeling in PyTorch. The aim is to rebuild the core GPT pipeline from first principles after studying the design of nanoGPT by Andrej Karpathy, while keeping the model and training code independently written in a single notebook. MiniGPT implements token and positional embeddings, causal multi-head self-attention, pre-LayerNorm Transformer blocks, residual connections, feed-forward MLP layers, next-token cross-entropy training (teacher forcing), validation tracking, checkpoint selection, and autoregressive text generation. This paper evaluates the implementation on Tiny Shakespeare dataset using character-level tokenization. A baseline 0.83M-parameter model reaches a validation loss of 1.7236 after 3000 training iterations. A stronger 10.77M-parameter configuration, using a larger context length and improved training settings, reaches a best validation loss of 1.4780 and generates text with recognizable Shakespeare-style dialogue structure. MiniGPT does not introduce a new language-model architecture. Instead, it documents a clear and reproducible implementation path from raw text to trained character-level generation, including design choices, training behavior, generation quality, and practical limitations.

2605.17393 2026-05-19 cs.AI cs.LG cs.MA

Heterogeneous Information-Bottleneck Coordination Graphs for Multi-Agent Reinforcement Learning

异质信息瓶颈协调图用于多智能体强化学习

Wei Duan, Junyu Xuan, En Yu, Xiaoyu Yang, Jie Lu

AI总结 本文提出异质信息瓶颈协调图(HIBCG),通过理论指导机制解决多智能体强化学习中协调图的边存在性和信息传递容量分配问题,通过信息瓶颈方法构建组对齐的块对角先验,实现边存在性和信息容量的理论验证。

详情
AI中文摘要

协调图是合作多智能体强化学习(MARL)中的核心抽象,然而现有的稀疏图学习者缺乏理论基础的机制来决定哪些边应存在以及每条边应携带多少信息。当前方法依赖于启发式标准,无法保证学习到的拓扑结构的正式保证,并且没有系统的方法来分配不同的通信容量以处理结构不同的智能体关系。为了解决这个问题,我们提出了异质信息瓶颈协调图(HIBCG),它学习了一个组感知的稀疏图,在其中边的存在性和信息容量都得到了理论支持。通过图信息瓶颈(GIB)作为底层工具,HIBCG首先构建了一个组对齐的块对角先验,提供了一个闭式标准用于边保留——确定哪些边应该存在以及每个组块的密度——然后在所得到的拓扑上控制每个智能体的特征带宽,压缩信息以保留仅与任务相关的内容。我们证明了组对齐的先验严格收紧拓扑学习的变分界,目标分解为每个组块,实现了微分边控制,且容量分配遵循水填充原则。

英文摘要

Coordination graphs are a central abstraction in cooperative multi-agent reinforcement learning (MARL), yet existing sparse-graph learners lack a theoretically grounded mechanism to decide which edges should exist and how much information each edge should carry. Current methods rely on heuristic criteria that offer no formal guarantee on the learned topology, and no principled way to allocate different communication capacities to structurally different agent relationships. To address this, we propose Heterogeneous Information-Bottleneck Coordination Graphs (HIBCG), which learns a group-aware sparse graph in which both edge existence and message capacity are theoretically justified. With the graph information bottleneck (GIB) serving as the underlying tool, HIBCG first constructs a group-aligned block-diagonal prior that provides a closed-form criterion for edge retention -- determining which edges should exist and at what density per group block -- and then controls per-agent feature bandwidth on the resulting topology, compressing messages to retain only task-relevant content. We prove that the group-aligned prior strictly tightens the variational bound on topology learning, that the objective decomposes per group block, enabling differential edge control, and that capacity allocation follows a water-filling principle.

2605.17382 2026-05-19 cs.AI cs.CL cs.GR

QQJ: Quantifying Qualitative Judgment for Scalable and Human-Aligned Evaluation of Generative AI

QQJ: 量化定性判断以实现可扩展且与人类对齐的生成AI评估

Marjan Veysi, Pirooz Shamsinejadbabaki, Mohammad Zare, Mohammad Sabouri

AI总结 本文提出QQJ框架,通过专家设计的多维评分标准和高质量标注集校准大语言模型评估器,实现与人类判断一致的可扩展评估方法,验证了结构化定性判断在大规模应用中的有效性。

详情
AI中文摘要

生成人工智能的快速发展暴露了现有评估方法的根本局限,尤其是在开放性、创造性和面向人类的任务中。传统自动指标依赖于表面统计相似性,往往无法反映人类对质量的感知,而纯粹的人类评估虽然可靠,但成本高、主观性强且难以扩展。最近利用大语言模型作为评估者的做法虽然提高了可扩展性,但通常缺乏明确的人类定义评估原则,导致偏见和不一致。本文介绍Quantifying Qualitative Judgment (QQJ),一种可扩展且以人类为中心的评估框架,通过专家设计的多维评分标准和高质量标注集校准大语言模型评估器,以实现人类判断与自动化评估之间的桥梁。这种设计使在多样化的生成任务和模态上实现了一致、可解释和可扩展的评估。在文本和图像生成上的大量实验表明,QQJ在与人类判断的一致性方面优于传统自动指标和无约束的大语言模型评估者。此外,QQJ在重复评估中表现出更高的稳定性,并在识别关键失败模式如幻觉和意图不匹配方面具有更好的诊断能力。这些结果表明,结构化的定性判断可以在不牺牲可解释性和人类对齐的情况下实现规模化应用,使QQJ成为现代生成AI系统可靠评估的实用基础。

英文摘要

The rapid progress of generative artificial intelligence has exposed fundamental limitations in existing evaluation methodologies, particularly for open-ended, creative, and human-facing tasks. Traditional automatic metrics rely on surface-level statistical similarity and often fail to reflect human perceptions of quality, while purely human evaluation, although reliable, is costly, subjective, and difficult to scale. Recent approaches using large language models as evaluators offer improved scalability but frequently lack explicit grounding in human-defined evaluation principles, leading to bias and inconsistency. In this paper, we introduce Quantifying Qualitative Judgment (QQJ), a scalable and human-centric evaluation framework that explicitly bridges the gap between human judgment and automated assessment. QQJ separates the definition of quality from its execution by anchoring evaluation in expert-designed, multi-dimensional rubrics and calibrating large language model evaluators to align with expert reasoning using a small, high-quality annotation set. This design enables consistent, interpretable, and scalable evaluation across diverse generative tasks and modalities. Extensive experiments on text and image generation demonstrate that QQJ achieves substantially stronger alignment with human judgment than traditional automatic metrics and unconstrained LLM-based evaluators. Moreover, QQJ exhibits improved stability across repeated evaluations and superior diagnostic capability in identifying critical failure modes such as hallucination and intent mismatch. These results indicate that structured qualitative judgment can be operationalized at scale without sacrificing interpretability or human alignment, positioning QQJ as a practical foundation for reliable evaluation of modern generative AI systems.

2605.17380 2026-05-19 cs.AI cs.CR cs.LG

ADR: An Agentic Detection System for Enterprise Agentic AI Security

ADR:一种用于企业代理AI安全的代理检测系统

Chenning Li, Pan Hu, Justin Xu, Baris Ozbas, Olivia Liu, Caroline Van, Manxue Li, Wei Zhou, Mohammad Alizadeh, Pengyu Zhang, KK Sriramadhesikan, Ming Zhang

AI总结 本文提出ADR系统,一种大规模、经过生产验证的企业框架,用于安全地管理通过模型上下文协议(MCP)运行的AI代理。该系统解决了三个关键问题:观测有限、鲁棒性不足和检测成本高,并通过三个组件实现了这些目标:ADR传感器、ADR探索器和ADR检测器。

Comments Accepted at MLSys 2026 (Industry Track)

详情
AI中文摘要

我们提出了代理AI检测与响应(ADR)系统,这是首个大规模、经过生产验证的企业框架,用于安全地管理通过模型上下文协议(MCP)运行的AI代理。我们识别出该领域存在的三个持续挑战:(1)观测有限——现有的终端检测与响应(EDR)工具只能看到文件写入,而无法看到代理推理、提示或连接意图到执行的因果链;(2)鲁棒性不足——静态防御受限于预定义规则,无法在多样化的攻击技术和企业环境中泛化;(3)高检测成本——基于LLM的推理在大规模上成本过高。ADR通过三个组件解决这些挑战:ADR传感器用于高保真的代理遥测,ADR探索器用于系统性的预部署红队行动和困难示例生成,以及ADR检测器用于可扩展的、两阶段在线检测,结合快速初步筛查与上下文感知推理。在Uber部署超过十个月,ADR在生产中保持了可靠的检测,随着采用的增加,已覆盖超过7,200个唯一主机,每天处理超过10,000个代理会话,发现了数百个凭证泄露,涵盖26类,并启用了向左预防层(97.2%的精度,206个检测到的凭证)。为了验证该方法并促进社区采用,我们引入了ADR-Bench(302个任务,17种技术,133个MCP服务器),其中ADR实现了零误报,同时检测了67%的攻击——在F1分数上,比三个最先进的基线(ALRPHFS、GuardAgent、LlamaFirewall)高出2-4倍。在AgentDojo(公共提示注入基准)上,ADR检测了所有攻击,仅在93个任务中产生了3个误报。

英文摘要

We present the Agentic AI Detection and Response (ADR) system, the first large-scale, production-proven enterprise framework for securing AI agents operating through the Model Context Protocol (MCP). We identify three persistent challenges in this domain: (1) limited observability -- existing Endpoint Detection and Response (EDR) tools see file writes but not the agent reasoning, prompts, or causal chains linking intent to execution; (2) insufficient robustness -- static defenses constrained by pre-defined rules fail to generalize across diverse attack techniques and enterprise contexts; and (3) high detection costs -- LLM-based inference is prohibitively expensive at scale. ADR addresses these challenges via three components: the ADR Sensor for high-fidelity agentic telemetry, the ADR Explorer for systematic pre-deployment red teaming and hard-example generation, and the ADR Detector for scalable, two-tier online detection combining fast triage with context-aware reasoning. Deployed at Uber for over ten months, ADR has sustained reliable detection in production with growing adoption reaching over 7,200 unique hosts and processing over 10,000 agent sessions daily, uncovering hundreds of credential exposures across 26 categories and enabling a shift-left prevention layer (97.2% precision, 206 detected credentials). To validate the approach and enable community adoption, we introduce ADR-Bench (302 tasks, 17 techniques, 133 MCP servers), where ADR achieves zero false positives while detecting 67% of attacks -- outperforming three state-of-the-art baselines (ALRPHFS, GuardAgent, LlamaFirewall) by 2--4x in F1-score. On AgentDojo (public prompt injection benchmark), ADR detects all attacks with only three false alarms out of 93 tasks.

2605.17379 2026-05-19 cs.CL cs.AI

Learning Faster with Better Tokens: Parameter-Efficient Vocabulary Adaptation for Specialized Text Summarization

通过更好的令牌学习:用于专业文本摘要的参数高效词汇适应

Gunjan Balde, Soumyadeep Roy, Mainack Mondal, Niloy Ganguly

AI总结 本文提出了一种参数高效的领域适应方法,通过结合词汇适应和预训练,提升大型语言模型在专业领域文本摘要任务中的性能,同时减少训练时间和参数数量。

Comments 16 pages. Accepted in the 64th Annual Meeting of the Association for Computational Linguistics [ACL (Main) 2026] as a long paper

详情
AI中文摘要

预训练在通用领域语料库上的大型语言模型在应用于专门领域时常常表现出令牌化效率低下。尽管连续预训练用于领域适应在一定程度上缓解了性能下降,但并未解决根本的词汇匹配问题。为了解决这一差距,我们引入了一种有针对性的参数高效领域适应方法,结合词汇适应与预训练用于基于LLM的文本摘要。我们的统一框架在预训练令牌化器中增加领域特定的令牌,同时选择性地替换未充分训练和不可达的令牌以限制参数增长。我们在Llama-3.1-8B和Qwen2.5-7B上评估了我们的方法,在法律和医学摘要任务上使用以专家驱动文本和摘要为中心的评估协议,这些文本通常包含更高浓度的Out-of-Vocabulary(OOV)词。词汇适应算法通过提高生成摘要与参考摘要之间的语义相似性,提升了摘要模型的整体质量。此外,适应后的模型生成的摘要包含更多合适的新型和领域特定的词汇,从而提高了连贯性、相关性和忠实性。我们进一步观察到,我们的方法在连续预训练上减少了35-55%的训练时间,并将参数数量减少了多达37%。我们公开了代码库:https://github.com/gb-kgp/VocabReplace-Then-Expand。

英文摘要

Large language models pretrained on general-domain corpora often exhibit tokenization inefficiencies when applied to specialized domains. Although continual pretraining for domain adaptation partially alleviate performance degradation, it does not resolve the fundamental vocabulary mismatch. To address this gap, we introduce a targeted parameter-efficient domain adaptation approach that combines vocabulary adaptation with pretraining for LLM-based text summarization. Our unified framework augments pretrained tokenizers with domain-specific tokens while selectively replacing under-trained and unreachable tokens to limit parameter growth. We evaluate our approach on Llama-3.1-8B and Qwen2.5-7B across legal and medical summarization tasks on a challenge-oriented evaluation protocol focused on expert-driven text and summaries which typically has higher concentration of over-fragmented Out-of-Vocabulary (OOV) words. The vocabulary adaptation algorithm enhances the overall quality of the summarization model by improving semantic similarity between the generated summaries and their references. In addition, the adapted model produces summaries that incorporate more appropriate novel and domain-specific words, leading to improved coherence, relevance, and faithfulness. We further observe that our proposed approach significantly reduce training time by $35-55\%$ over continual pretraining and reduce parameter counts up to $37\%$ w.r.t expansion-only methods. We make the codebase publicly available at https://github.com/gb-kgp/VocabReplace-Then-Expand.

2605.16234 2026-05-19 cs.LG cs.AI cs.CL

No Free Swap: Protocol-Dependent Layer Redundancy in Transformers

没有免费的交换:Transformer中的协议依赖层冗余

Gabriel Garcia

AI总结 本文研究了Transformer中层冗余问题,通过比较替换和交换两种协议,发现它们在压缩中的效果存在显著差异,且在相同评估器下,不同协议可能导致层剪枝结果的变化,尤其在高替换距离时更为明显。

Comments 40 pages, 8 figures, 24 tables. Code is available at https://github.com/Gpgabriel25/ProtocolGapDiagnostic

详情
AI中文摘要

当研究人员询问两个Transformer层是否在压缩中“等价”时,他们常常混淆了不同的测试方法。替换测试询问是否可以将一层的映射替换为另一层的映射;交换测试询问是否当两层位置交换时,它们近似可交换。两者都是基于输出的swap-KL探测器,但它们并不总是一致:在预训练的Transformer中,协议差距可能在相同评估器下改变哪些层看起来可以安全剪枝,尤其是在替换距离较高时。我们跨检查点和架构测量了两种协议。在Pythia训练轨迹(410M和1.4B)上,替换-交换差距从初始化到收敛逐渐增大。在8B规模的WikiText-2合同下,Qwen3-8B进入了一个发散阶段:交换引导的移除比替换引导的在相同层预算下更安全,而Llama-3.1-8B在剪枝成本上两者持平,尽管交换KL较低,这表明指标差距不必一对一映射到移除。在层移除或合并之前,应在目标检查点上对两种swap-KL进行评分;该诊断仅需未标记的正向传递。

英文摘要

When researchers ask whether two transformer layers are "equivalent" for compression, they often conflate distinct tests. Replacement asks whether one layer's map can substitute for another's in place; interchange asks whether two layers approximately commute when their positions are swapped. Both are output-grounded swap-KL probes, but they need not agree: on pretrained transformers the protocol gap can change which layers look safe to prune by several-fold under the same evaluator, especially when replacement distances are high. We measure both protocols across checkpoints and architectures. On a Pythia training trajectory (410M and 1.4B), the replacement-interchange gap grows from initialization to convergence. Under one matched WikiText-2 contract at 8B scale, Qwen3-8B enters a divergent regime: interchange-guided removal is several-fold safer than replacement-guided at the same layer budgets, while Llama-3.1-8B ties the two protocols for pruning cost even though interchange KL is lower, showing metric gaps need not map one-to-one to removal. Before layer removal or merging, score both swap-KLs on the target checkpoint; the diagnostic requires only unlabeled forward passes.

2605.15735 2026-05-19 cs.CV cs.AI

UAM: A Dual-Stream Perspective on Forgetting in VLA Training

UAM:VL A训练中遗忘的双流视角

Jianke Zhang, Yuanfei Luo, Yucheng Hu, Xiaoyu Chen, Yanjiang Guo, Ziyang Liu, Hongbin Xu, Tian Lan, Jianyu Chen

AI总结 本文提出UAM模型,通过双流架构解决VL A训练中因单一编码器导致的多模态能力下降问题,展示了通过架构分离而非冻结权重或辅助数据可实现语义保留,并在多种任务中取得高成功率。

详情
AI中文摘要

视觉-语言-动作(VLA)模型通常通过在动作数据上微调预训练的视觉-语言模型(VLM)来构建。然而,我们证明这种标准方法系统性地削弱了VLM的多模态能力,这种副作用我们称之为‘具身税’。但VL A是否必须遗忘?受生物视觉双流组织的启发,我们将这种退化归因于结构性瓶颈:当前VL A要求单一编码器同时支持语言基础语义和控制相关的视觉特征,而生物视觉将识别与视觉运动控制分为不同的路径。基于此观点,我们提出了统一动作模型(UAM),添加了一个平行的背侧专家,作为大脑背侧通路的类比。为了使背侧专家成为有效的第二路径并减少对VLM的控制学习负担,我们从预训练的生成模型中初始化它,并用中层推理目标进行训练,该目标预测视觉动态。这种设计使我们能够仅用动作数据端到端地训练整个VLA:无需参数冻结、无需梯度停止、无需辅助VL共训练,UAM保留了超过95%的底层VLM的多模态能力,同时在多种任务中取得了最高平均成功率,包括未见物体、新物体-目标组合和指令变化等探测分布外泛化的任务。这些结果表明,VL A中的语义保留可以从架构分离本身产生,而非通过冻结权重或辅助数据重放,并且这种保留的语义能力可以自然地从VLM转移到动作中的语义泛化。

英文摘要

Vision--language--action (VLA) models are typically built by fine-tuning a pretrained vision--language model (VLM) on action data. However, we show that this standard recipe systematically erodes the VLM's multimodal competence, a side effect we call the embodiment tax. But do VLAs have to forget? Inspired by the two-stream organization of biological vision, we trace this degradation to a structural bottleneck: current VLAs ask a single encoder to support both language-grounded semantics and control-relevant visual features, whereas biological vision separates recognition and visuomotor control into distinct pathways. Building on this view, we propose the Unified Action Model (UAM), which adds a parallel Dorsal Expert, an analog of the brain's dorsal pathway. To make the Dorsal Expert an effective second pathway and reduce the control-learning burden on the VLM, we initialize it from a pretrained generative model and train it with a mid-level reasoning objective that predicts visual dynamics. This design allows us to train the whole VLA end-to-end on action data alone: with no parameter freezing, no gradient stopping, and no auxiliary VL co-training, UAM retains over $95\%$ of the underlying VLM's multimodal capability and at the same time achieves the highest average success rate among baselines on a variety of manipulation tasks that probe out-of-distribution generalization, including unseen objects, novel object--target compositions, and instruction variation. Together, these results suggest that semantic preservation in VLAs can emerge from architectural separation itself, rather than being enforced by frozen weights or auxiliary data replay, and that this preserved semantic capability can naturally transfer from VLMs to semantic generalization in actions.

2605.15694 2026-05-19 cs.LG

Going Beyond the Edge: Distributed Inference of Transformer Models on Ultra-Low-Power Wireless Devices

超越边缘:在超低功耗无线设备上实现变压器模型的分布式推断

Alexander Gräfe, Ding Huo, Vincent de Bakker, Johannes Berger, Marco Zimmerling, Sebastian Trimpe

AI总结 本文提出CATS框架,通过在超低功耗无线设备上实现分布式变压器推断,使多个设备协同执行远大于单个设备能处理的模型。核心方法结合了变压器划分、无线通信和训练,采用SomeGather通信原语减少带宽和内存使用,同时设计高效的模型并行方法,并通过消息丢弃提高通信可靠性。

详情
AI中文摘要

Transformer模型正迅速成为现代物联网(IoT)应用的核心,但其计算和内存需求远超单个典型超低功耗IoT设备的能力。我们提出了CATS,一种用于超低功耗无线设备的分布式变压器推断框架,使多个设备能够协同执行远大于单个设备能处理的模型。CATS的核心是一种通信感知的分布式变压器推断方案,结合了变压器划分、无线通信和训练。它采用SomeGather,一种新的剪枝通信原语,选择性广播激活列以减少通信带宽和RAM使用,而不牺牲模型精度。基于SomeGather,我们设计了一种划分方法,利用该原语实现高效的模型并行。为应对不可靠的无线通信,CATS在训练期间采用消息丢弃,模拟数据包丢失,并在推断时产生对消息丢失具有鲁棒性的模型。在实际实验中,我们证明CATS首次将分布式变压器推断带到了超低功耗无线设备上,部署在多达16个设备上,协同执行的变压器模型大小是单个设备能运行的14倍。

英文摘要

Transformer models are rapidly becoming a cornerstone of modern Internet of Things (IoT) applications, yet their computational and memory demands far exceed the capabilities of a single typical ultra-low-power IoT device. We present CATS, a framework for distributed transformer inference on ultra-low-power wireless devices, enabling multiple devices to collaboratively execute models far larger than what a single device can sustain. At its core, CATS is a communication-aware distributed transformer inference scheme co-designed across transformer partitioning, wireless communication and training. It employs SomeGather, a new pruned communication primitive that selectively broadcasts activation columns to reduce communication bandwidth and RAM usage without sacrificing model accuracy. Building on SomeGather, we design a partitioning method that exploits this primitive for efficient model parallelism. To cope with unreliable wireless communication, CATS employs message-dropout during training, which mimics packet losses and yields models that are robust to message loss during inference. In real-world experiments, we show that CATS brings distributed transformer inference to ultra-low-power wireless devices for the first time, with deployments on up to 16 devices that collaboratively execute transformer models up to 14 times larger than what a single device can run.

2605.15641 2026-05-19 cs.RO cs.CR

Propagating Unsafe Actions in LLM Controlled Multi-Robot Collaboration via Single Robot Compromise

通过单个机器人入侵在LLM控制的多机器人协作中传播不安全行为

Zhen Huang, Zhihuang Liu, Mengxuan Luo, Weishang Wu, Zhiping Cai

AI总结 本文研究了LLM控制的多机器人协作中的安全问题,提出了一种新型攻击模式,其中攻击者仅通过单个机器人传播恶意意图,导致系统中协调的不安全行为,通过三个指标量化了这一过程,并展示了攻击的高效性和持续性。

Comments Accepted by the 35th International Joint Conference on Artificial Intelligence (IJCAI 2026). 9 pages, 4 figures, 3 tables

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用作具身智能中的通用规划器,使单个机器人和多机器人协作的高层协调和底层任务规划成为可能。这种对具身LLM规划器的依赖也引发了关键的安全问题,因为不一致或被操控的指令可以转化为物理动作。先前的工作已研究了单个机器人设置中的此类威胁,而LLM控制的多机器人协作中的安全风险,尤其是通过机器人间通信传播的风险,仍鲜有研究。为弥合这一差距,我们提出了一种新的攻击模式,攻击者仅与单个入口机器人交互。被入侵的机器人然后通过同伴通信传播恶意意图,导致系统中协调的不安全行为。我们的评估涵盖了高风险维度,如失职、隐私侵犯和公共安全危害,揭示了多机器人规划器中持续的安全对齐差距。我们通过三个指标量化这一过程:服从性、传染性和隐蔽性。实验显示了攻击者的持续控制和快速传播:在最强的情况下,服从性达到1.00,传染性上升到0.90。值得注意的是,该攻击非常高效,只需3.0轮次即可入侵所有机器人,同时保持隐蔽性得分为0.81。当机器人必须在关键时刻解决权衡问题,如紧急情况或权利冲突时,此类风险会加剧,因为协调机制可能无意中允许对抗性指令覆盖安全要求。代码可在https://github.com/TheFatInsect/InfectBot上获取。

英文摘要

Large language models (LLMs) are increasingly used as general planners in embodied intelligence, enabling high level coordination and low level task planning for both single robot and multi-robot collaboration. This increasing reliance on embodied LLM planners also raises critical security concerns, since misaligned or manipulated instructions can be translated into physical actions. Prior work has studied such threats in single robot settings, while security risks in LLM controlled multi-robot collaboration, especially those propagated through inter robot communication, remain largely unexplored. To bridge this gap, we propose a novel attack paradigm for multi-robot system in which the adversary interacts with only a single entry robot. The compromised robot then propagates malicious intent through peer communication, leading to coordinated unsafe actions across the system. Our evaluation, covering high risk dimensions of dereliction of duty, privacy compromise, and public safety hazards, reveals a persistent safety alignment gap in multi-robot planners. We quantify this process with three metrics, obedience, infectiousness, and stealthiness. Experiments demonstrate both persistent attacker control and rapid propagation: obedience reaches 1.00 in the strongest cases, and infectiousness rises to 0.90. Notably, the attack is highly efficient, requiring as few as 3.0 rounds to compromise all the robots while maintaining a stealthiness score of 0.81. Such risks are amplified when robots must resolve trade offs in critical situations, such as emergencies or conflicts of rights, because the coordination mechanism can unintentionally allow adversarial instructions to override safety requirements. The code is available at https://github.com/TheFatInsect/InfectBot.

2605.15622 2026-05-19 cs.LG

Position: Zeroth-Order Optimization in Deep Learning Is Underexplored, Not Underpowered

位置:深度学习中零阶优化被低估,而非无能

Sijia Liu, Yicheng Lang, Soumyadeep Pal, Changsheng Wang, Yancheng Huang, Chongyu Fan, James Diffenderfer, Bhavya Kailkhura, Yihua Zhang

AI总结 本文探讨了深度学习中零阶优化(ZO)的局限性,指出其被低估而非无能,并提出了六个涵盖算法、系统和评估层面的立场,强调通过控制方差、方差-查询权衡和方向导数视角重新审视ZO方法的可行性,同时指出三个未被充分利用的机会,包括子空间和谱观点、ZO作为系统优势的通信效率以及去模糊化ZO评估与任务复杂性之间的关系。

Comments Accepted by ICML 2026 Position Paper Track as a Spotlight Paper

详情
AI中文摘要

零阶(ZO)优化,通过函数评估的有限差分来学习,由于其内存效率和适用于灰箱或黑箱管道的适用性,最近在深度学习中重新受到关注。然而,ZO方法往往被忽视,因为估计方差和不利的查询复杂性被认为是根本无法扩展的。我们主张这一结论可能是误导的:ZO优化是被低估的,而不是无能的。我们证明了许多看似限制性的因素源于短视的发展实践,尤其是全空间、元素-wise、估计器中心的设计。我们阐述了六个涵盖算法、系统和评估栈的立场。首先,我们通过方差控制、方差-查询权衡和方向导数视角重新审视估计器中心ZO方法的可行性边界。然后,我们识别出三个未被充分利用的机会:(i)子空间和谱观点的ZO,使通过优雅的查询扩展实现可解释的方差减少;(ii)ZO作为系统优势,为通信高效、管道友好的和资源受限的训练提供优势;(iii)需要去模糊化ZO评估与任务复杂性之间的关系。我们强烈倡导围绕ZO优化的独特优势重新思考,并采取相应行动,打开通往大规模、系统感知和资源高效学习的可行路径。

英文摘要

Zeroth-order (ZO) optimization, learning from finite differences of function evaluations without backpropagation, has recently regained attention in deep learning due to its memory efficiency and applicability to gray- or black-box pipelines. Yet, ZO methods are often dismissed as fundamentally unscalable because of estimator variance and unfavorable query complexity. We argue that this conclusion might be misguided: ZO optimization is underexplored, not underpowered. We show that many perceived limitations stem from myopic development practices, most notably full-space, element-wise, estimator-centric designs. We articulate six positions spanning the algorithmic, systems, and evaluation stack. First, we revisit the feasibility boundaries of estimator-centric ZO methods through variance control, variance-query tradeoffs, and directional-derivative lenses. Then, we identify three underexplored opportunities: (i) subspace and spectral views of ZO that enable interpretable variance reduction with graceful query scaling, (ii) the forward-only nature of ZO as a systems advantage for communication-efficient, pipeline-friendly, and resource-constrained training, and (iii) the need to de-obfuscate ZO evaluations from task complexity. We strongly advocate rethinking ZO optimization around its unique strengths and acting accordingly, opening a viable path toward large-scale, system-aware, and resource-efficient learning with ZO optimization.

2605.15586 2026-05-19 cs.LG cs.AI cs.CV

Embracing Biased Transition Matrices for Complementary-Label Learning with Many Classes

拥抱偏置转移矩阵以实现多类互补标签学习

Tan-Ha Mai, Chao-Kai Chiang, Han-Hwa Shih, Gang Niu, Masashi Sugiyama, Hsuan-Tien Lin

AI总结 本文提出了一种新的框架BICL,通过设计偏置的标签生成过程来克服传统互补标签学习在多类设置中的限制,从而在CIFAR-100和TinyImageNet-200上实现了传统方法的七倍以上准确率提升。

Comments 33 pages, 16 figures, 18 tables

详情
AI中文摘要

互补标签学习(CLL)是一种弱监督范式,其中实例被标记为不属于其类别的标签。尽管已有十年的研究,CLL方法主要在10类分类任务中具有竞争力,而扩展到大规模标签空间仍然是一个持久的瓶颈。这种限制源于传统方法对均匀标签生成的假设,这在多类设置中严重稀释了学习信号。在本文中,我们证明通过故意设计偏置(非均匀)的生成过程,将互补标签限制在类别的子集,可以克服这一长期存在的障碍。这一发现促使我们提出Bias-Induced Constrained Labeling(BICL),一个涵盖数据收集到训练的原理性框架,利用这种偏置。BICL在CIFAR-100和TinyImageNet-200上实现了有效学习,比传统方法的准确率提高了超过七倍。我们的发现为在现实应用中使CLL适用于多类问题开辟了新的道路。

英文摘要

Complementary-label learning (CLL) is a weakly supervised paradigm where instances are labeled with classes they do not belong to. Despite a decade of research, CLL methods remain competitive mainly on 10-class classification, with scaling to large label spaces continuing to be an enduring bottleneck. This limitation stems from the common assumption of uniform label generation in traditional methods, which fatally dilutes the learning signal in many-class settings. In this paper, we demonstrate that this long-standing barrier can be overcome by deliberately designing a biased (non-uniform) generation process that restricts complementary labels to a subset of classes. This finding motivates us to propose Bias-Induced Constrained Labeling (BICL), a principled framework spanning data collection to training that leverages this bias. BICL enables effective learning on CIFAR-100 and TinyImageNet-200, achieving more than sevenfold accuracy improvements over traditional methods. Our findings establish a new trajectory for making CLL feasible for many classes in real-world applications.

2605.15508 2026-05-19 cs.LG cs.CL

STS: Efficient Sparse Attention with Speculative Token Sparsity

STS: 高效稀疏注意力与推测性标记稀疏性

Ceyu Xu, Jiangnan Yu, Yongji Wu, Yuan Xie

AI总结 本文提出STS,一种无需模型再训练的稀疏注意力机制,通过利用较小的草稿模型识别出的重要标记来预测更大目标模型的重要标记,从而在大规模语言模型推理中实现高效的稀疏注意力计算,显著提升速度并保持准确性。

Comments 14 pages, 12 figures

详情
AI中文摘要

注意力的二次复杂性对大型语言模型(LLM)推理造成了严重的内存和计算瓶颈。这一挑战在新兴的代理应用中尤为突出,这些应用需要处理数百万标记序列。我们提出STS,一种稀疏注意力机制,无需模型再训练。STS利用关键洞察:由较小的草稿模型识别出的重要标记对更大目标模型的重要标记具有高度预测性。通过整合到推测解码框架中,STS将草稿模型的注意力分数重新利用,动态构建标记和头部层面的稀疏性掩码。该掩码有效剪枝目标LLM中的昂贵注意力计算。我们的评估显示,STS在代表性的基准NarrativeQA上实现了约90%稀疏度下的2.67倍加速,与密集注意力相比,准确性降解可忽略不计。STS在稀疏性与准确性权衡上建立了新的状态-of-the-art,通过在给定准确性预算下实现更高的稀疏度水平,优于先前技术。

英文摘要

The quadratic complexity of attention imposes severe memory and computational bottlenecks on Large Language Model (LLM) inference. This challenge is particularly acute for emerging agentic applications that require processing multi-million token sequences. We propose STS, a sparse attention mechanism that requires no model retraining. STS leverages the key insight that tokens identified as important by a smaller draft model are highly predictive of important tokens for a larger target model. By integrating into speculative decoding frameworks, STS repurposes the draft model's attention scores to dynamically construct a token-and-head-wise sparsity mask. This mask effectively prunes the expensive attention computation in the target LLM. Our evaluation shows that STS achieves a 2.67x speedup operating at approximately 90% sparsity on representative benchmark NarrativeQA, maintaining negligible accuracy degradation compared to dense attention. STS establishes a new state-of-the-art on the sparsity-accuracy trade-off, outperforming prior techniques by enabling higher sparsity levels for a given accuracy budget.

2605.15487 2026-05-19 cs.LG cs.CV eess.IV

Learning Normalized Energy Models for Linear Inverse Problems

学习归一化能量模型以解决线性逆问题

Nicolas Zilberstein, Santiago Segarra, Eero Simoncelli, Florentin Guth

AI总结 本文提出了一种新的能量模型,用于解决线性逆问题,通过引入基于协方差的正则化项来提高不同测量条件下的一致性,从而计算出归一化的后验密度,无需额外训练或微调,同时实现了能量引导的自适应采样、无偏的Metropolis-Hastings修正步骤以及通过贝叶斯规则估计退化算子。

Comments ICML 2026

详情
Journal ref
Int'l Conf Machine Learning (ICML), Jul 2026. https://openreview.net/forum?id=PlFJwgaaDK
AI中文摘要

生成扩散模型可以为成像中的逆问题提供强大的先验概率模型,但现有实现存在两个关键限制:(i) 先验密度以隐式方式表示,(ii) 它们依赖于似然近似,这会引入采样偏见。我们通过引入一种新的能量模型来解决这些挑战,该模型针对去噪进行了训练,并引入了基于协方差的正则化项,以确保在不同测量条件下的一致性。训练后的模型能够为各种线性逆问题计算归一化的后验密度,而无需额外的重新训练或微调。除了保留扩散模型的采样能力外,这还使以前不可用的能力得以实现:能量引导的自适应采样,可以实时调整采样计划,无偏的Metropolis-Hastings修正步骤,以及通过贝叶斯规则估计退化算子。我们验证了该方法在多个数据集(ImageNet、CelebA、AFHQ)和任务(修复、去模糊)上的性能,证明了其与现有基线相比具有竞争力或更优的表现。

英文摘要

Generative diffusion models can provide powerful prior probability models for inverse problems in imaging, but existing implementations suffer from two key limitations: $(i)$ the prior density is represented implicitly, and $(ii)$ they rely on likelihood approximations that introduce sampling biases. We address these challenges by introducing a new energy-based model trained for denoising with a covariance-based regularization term that enforces consistency across different measurement conditions. The trained model can compute normalized posterior densities for diverse linear inverse problems, without additional retraining or fine tuning. In addition to preserving the sampling capabilities of diffusion models, this enables previously unavailable capabilities: energy-guided adaptive sampling that adjusts schedules on-the-fly, unbiased Metropolis-Hastings correction steps, and blind estimation of the degradation operator via Bayes rule. We validate the method on multiple datasets (ImageNet, CelebA, AFHQ) and tasks (inpainting, deblurring), demonstrating competitive or superior performance to established baselines.

2605.15377 2026-05-19 cs.AI

Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute

为AI控制的集束监控:多样信号胜过更多计算

Eugene Koran, Yejun Yun, Samantha Tetef, Benjamin Arnav, Pablo Bernabeu-Pérez

AI总结 本文研究了通过结合多种监控信号来提高AI行为检测的性能,发现多样性的监控集合比单一或同质的监控集合更有效,且细调的监控方法在检测能力上更具优势。

详情
AI中文摘要

随着AI系统在大规模自主代理环境中越来越广泛地部署,确保它们采取的安全和符合用户意图的行为变得至关重要。监控代理行为是关键的安全机制,但可靠的监控仍然难以构建,而系统规模使人类监督变得不切实际。我们证明,将来自不同监控器的信号组合成一个集合可以提高检测偏离行为的能力。我们使用提示和微调策略构建了12个GPT-4.1-Mini监控器。我们在编码任务中评估了它们,其中候选解决方案通过标准测试但失败于对抗性输入。在这种情况下,多样化的集合优于单个监控器和同质的集合。我们的最佳3监控集合在检测性能上比由三个相同监控器组成的集合提高了2.4倍,且在独立数据集上表现强劲。我们认为这些结果表明,收益来自于多样性而不是规模。最佳集合结合了强个体表现和监控器之间低相关性。此外,微调的监控器出现在每一个表现最好的集合中,并且在非分布攻击类型上保持了这一优势,表明微调能够激发检测能力,而提示单独无法做到。这些结果支持集合监控作为一种实用的AI控制策略,以在合理的推理成本下获得安全收益。

英文摘要

As AI systems are increasingly deployed in autonomous agentic settings at scale, it is important to ensure the actions they take are safe and aligned with user intent. Monitoring agent actions is a key safety mechanism, yet reliable monitors remain difficult to build and the scale of these systems makes human oversight impractical. We show that combining signals from diverse monitors into an ensemble improves detection of misaligned actions. We build 12 GPT-4.1-Mini monitors using both prompting and fine-tuning strategies. We evaluate them on coding tasks where candidate solutions pass standard tests but fail on adversarial inputs. In this setting, diverse ensembles outperform both individual monitors and homogeneous ensembles. Our best 3-monitor ensemble achieves 2.4x greater detection performance gain compared to an ensemble composed of three identical monitors, with the same ensemble performing strongly on an independent dataset. We contend that these results show that diversity - not scale - drives gains. The best ensembles combine strong individual performance with low correlation between monitors. Furthermore, fine-tuned monitors appear in every top-performing ensemble and maintain this advantage on out-of-distribution attack types, suggesting that fine-tuning enables detection capabilities that prompting alone does not elicit. These results support ensemble monitoring as a practical AI control strategy for safety gains at reasonable inference costs.

2605.15177 2026-05-19 cs.AI

OpenDeepThink: Parallel Reasoning via Bradley-Terry Aggregation

OpenDeepThink: 通过布拉德利-蒂尔利聚合实现并行推理

Shang Zhou, Wenhao Chai, Kaiyuan Liu, Huanzhi Mao, Qiuyang Mang, Jingbo Shang

AI总结 该研究提出OpenDeepThink框架,通过布拉德利-蒂尔利聚合方法在测试时扩展计算资源,以提高大语言模型的推理能力,通过并行选择候选方案并消除选择瓶颈,从而提升模型在Codeforces等领域的表现。

Comments 19 pages, 4 figures

详情
AI中文摘要

测试时计算扩展是提高大语言模型推理能力的主要方向。现有方法主要通过扩展单个推理轨迹来扩展深度,而通过并行采样多个候选方案来扩展广度则较为简单,但会引入选择瓶颈:在没有地面真相验证器的情况下选择最佳候选方案,因为点wise LLM判断是嘈杂且有偏见的。为了解决这个问题,我们引入了OpenDeepThink,一种基于种群的测试时计算框架,通过成对的布拉德利-蒂尔利比较来选择。每次生成中,LLM随机判断候选方案对并利用布拉德利-蒂尔利聚合生成全局排名;排名最高的候选方案被保留,前四分之三的方案通过自然语言批评进行变异;后四分之一的方案被丢弃。OpenDeepThink在八个连续的LLM调用轮次中(约27分钟实时时钟时间)将Gemini 3.1 Pro的Codeforces Elo有效提升405分。该流程在较弱和较强模型之间转移时无需重新训练,并在多领域HLE基准测试中,收益集中在客观可验证的领域,而在主观领域则相反。我们发布了CF-73,一个包含73个专家评分的Codeforces问题的精选集,具有国际大师注释,并且与官方判决的本地评估一致性达到99%。

英文摘要

Test-time compute scaling is a primary axis for improving LLM reasoning. Existing methods primarily scale depth by extending a single reasoning trace. Scaling breadth by sampling multiple candidates in parallel is straightforward, but introduces a selection bottleneck: choosing the best candidate without a ground-truth verifier, since pointwise LLM judging is noisy and biased. To address this, we introduce OpenDeepThink, a population-based test-time compute framework that selects via pairwise Bradley-Terry comparison. Each generation, the LLM judges random pairs of candidates and aggregates votes via Bradley-Terry into a global ranking; top-ranked candidates are preserved and the top three quarters are mutated using the natural-language critiques produced during comparison; the bottom quarter is discarded. OpenDeepThink raises Gemini 3.1 Pro's effective Codeforces Elo by +405 points in eight sequential LLM-call rounds (~27 minutes wall-clock). The pipeline transfers across weaker and stronger models without retuning, and on the multi-domain HLE benchmark, gains appear concentrated in objectively verifiable domains and reverse in subjective ones. We release CF-73, a curated set of 73 expert-rated Codeforces problems with International Grandmaster annotation and 99% local-evaluation agreement against the official verdict.

2605.14133 2026-05-19 cs.AI

ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents

ClawForge: 为命令行代理生成可执行的交互式基准测试

Yuxiang Lai, Peng Xia, Haonian Ji, Kaiwen Xiong, Kaide Zeng, Jiaqi Liu, Fang Wu, Jike Zhong, Zeyu Zheng, Cihang Xie, Huaxiu Yao

AI总结 ClawForge通过生成可执行的交互式基准测试,解决了可扩展性与真实工作流评估之间的矛盾,通过系统测试代理在存在状态冲突时的处理能力。

详情
AI中文摘要

交互式代理基准测试面临可扩展性构建与真实工作流评估之间的张力。手工编写的任务扩展和修改成本高,而静态提示评估忽略了只有在代理在持久状态上操作时才会出现的失败。现有的交互式基准测试已显著提升了代理评估,但大多数初始化任务从干净的状态开始,没有系统测试代理如何处理已存在的部分、过时或冲突的物品。我们提出了ClawForge,一个基于生成器的可执行命令行工作流基准测试框架,在状态冲突下。该框架将场景模板、扎根槽位、初始化状态、参考轨迹和验证器编译成可重复的任务规范,并通过归一化的终端状态和可观测的副作用逐步评估代理,而不是精确轨迹匹配。我们实例化该框架为ClawForge-Bench(17个场景,6个能力类别)。在七个前沿模型上的结果表明,最佳模型仅达到45.3%的严格准确率,错误状态替换在所有模型中低于17%,最宽的模型分离(17%到90%)由代理在行动前是否检查现有状态决定。部分信用和步骤效率分析进一步揭示了许多失败是近似关闭而非早期崩溃,且在状态冲突下模型表现出不同的失败风格。

英文摘要

Interactive agent benchmarks face a tension between scalable construction and realistic workflow evaluation. Hand-authored tasks are expensive to extend and revise, while static prompt evaluation misses failures that only appear when agents operate over persistent state. Existing interactive benchmarks have advanced agent evaluation significantly, but most initialize tasks from clean state and do not systematically test how agents handle pre-existing partial, stale, or conflicting artifacts. We present \textbf{ClawForge}, a generator-backed benchmark framework for executable command-line workflows under state conflict. The framework compiles scenario templates, grounded slots, initialized state, reference trajectories, and validators into reproducible task specifications, and evaluates agents step by step over persistent workflow surfaces using normalized end state and observable side effects rather than exact trajectory matching. We instantiate this framework as the ClawForge-Bench (17 scenarios, 6 ability categories). Results across seven frontier models show that the best model reaches only 45.3% strict accuracy, wrong-state replacement remains below 17\% for all models, and the widest model separation (17% to 90%) is driven by whether agents inspect existing state before acting. Partial-credit and step-efficiency analyses further reveal that many failures are near-miss closures rather than early breakdowns, and that models exhibit qualitatively different failure styles under state conflict.

2605.14038 2026-05-19 cs.AI

Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use

模型适应性工具必要性揭示了大语言模型工具使用中的知行差距

Yize Cheng, Chenrui Fan, Mahdi JafariRaviz, Keivan Rezaei, Soheil Feizi

AI总结 本文研究了大语言模型在使用外部工具时的必要性问题,提出了一种基于模型自身性能的适应性工具必要性定义,并通过四个模型在算术和事实性问答数据集上的比较,发现工具必要性与实际调用行为之间存在显著的不匹配,揭示了LLM工具使用中的知行差距。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地作为自主代理,必须决定何时直接回答问题,何时调用外部工具。先前研究大多将工具必要性视为模型无关的属性,由人类或LLM判断者标注,主要涵盖答案明显的情况(例如获取天气与改写文本)。然而,现实中的工具必要性更为复杂,因为不同模型的能力边界存在分歧:一个强模型可以单独解决的问题,可能仍需要工具帮助弱模型。在本文中,我们引入了基于每个模型实证性能的模型适应性工具必要性定义。随后,我们比较了四个模型在算术和事实性问答数据集上的必要性与观察到的工具调用行为,发现存在26.5-54.0%和30.8-41.8%的显著不匹配。为了诊断失败,我们将工具使用分解为两个阶段:内部认知阶段,反映模型是否认为需要工具;执行阶段,决定模型是否实际做出调用动作。通过探测LLM隐藏状态,我们发现这两种信号往往可以线性解码,但它们的探测方向在晚期层、最后token的范围内几乎正交。通过追踪样本在两个阶段过程中的轨迹,我们进一步发现,大多数不匹配集中在认知到行动的转换过程中,而非认知本身。这些结果揭示了LLM工具使用中的知行差距:提高工具使用可靠性不仅需要更好的识别何时需要工具,还需要更好的将这种识别转化为行动。

英文摘要

Large language models (LLMs) increasingly act as autonomous agents that must decide when to answer directly vs. when to invoke external tools. Prior work studying adaptive tool use has largely treated tool necessity as a model-agnostic property, annotated by human or LLM judge, and mostly cover cases where the answer is obvious (e.g., fetching the weather vs. paraphrasing text). However, tool necessity in the wild is more nuanced due to the divergence of capability boundaries across models: a problem solvable by a strong model on its own may still require tools for a weaker one. In this work, we introduce a model-adaptive definition of tool-necessity, grounded in each model's empirical performance. Following this definition, we compare the necessity against observed tool-call behavior across four models on arithmetic and factual QA dataset, and find substantial mismatches of 26.5-54.0% and 30.8-41.8%, respectively. To diagnose the failure, we decompose tool use into two stages: an internal cognition stage that reflects whether a model believes a tool is necessary, and an execution stage that determines whether the model actually makes a tool-call action. By probing the LLM hidden states, we find that both signals are often linearly decodable, yet their probe directions become nearly orthogonal in the late-layer, last-token regime that drives the next-token action. By tracing the trajectory of samples in the two-stage process, we further discover that the majority of mismatch is concentrated in the cognition-to-action transition, not in cognition itself. These results reveal a knowing-doing gap in LLM tool-use: improving tool-use reliability requires not only better recognition of when tools are needed, but also better translation of that recognition into action.

2605.14005 2026-05-19 cs.CL cs.LG

Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding

毒藤:针对推测解码的隐秘加速-崩溃攻击

Shuoyang Sun, Chang Dai, Hao Fang, Kuofeng Gao, Xinhao Zhong, Yi Sun, Fan Mo, Shu-Tao Xia, Bin Chen

AI总结 本文提出Mistletoe攻击,通过优化降质目标和语义保留目标,隐秘地降低推测解码的接受长度τ,从而减少加速效果,同时保持输出质量。

详情
AI中文摘要

推测解码已成为加速大型语言模型(LLM)推理的广泛采用技术,通过并行生成多个候选token并用目标模型验证。然而,其效率关键依赖于平均接受长度τ,即每个验证步骤中多少候选token能被接受。本文识别了基于模型的推测解码中的新机制层漏洞:drafter被训练去近似目标模型分布,但这种近似不可避免地不完美。这种drafter-目标不匹配创造了一个隐藏的攻击面,其中小扰动可以保持目标模型的可见行为,同时显著降低候选token的接受性。我们提出Mistletoe,一种针对推测解码的隐秘加速-崩溃攻击。Mistletoe直接针对推测解码的接受机制。它联合优化一个降质目标,以减少drafter-目标的一致性,以及一个语义保留目标,以约束目标模型的输出分布。为了解决这两个目标之间的冲突,我们引入了一个null-space投影机制,其中降质梯度被投影到局部语义保留方向之外,从而抑制候选token的接受,同时最小化语义漂移。在各种推测解码系统上的实验表明,Mistletoe显著降低了平均接受长度τ,崩溃速度提升,并降低了平均token吞吐量,同时保持输出质量和困惑度。我们的工作强调推测解码引入了超越现有输出鲁棒性的机制层攻击面,呼吁对LLM加速系统进行更鲁棒的设计。

英文摘要

Speculative decoding has become a widely adopted technique for accelerating large language model (LLM) inference by drafting multiple candidate tokens and verifying them with a target model in parallel. Its efficiency, however, critically depends on the average accepted length $τ$, i.e., how many draft tokens survive each verification step. In this work, we identify a new mechanism-level vulnerability in model-based speculative decoding: the drafter is trained to approximate the target model distribution, but this approximation is inevitably imperfect. Such a drafter-target mismatch creates a hidden attack surface where small perturbations can preserve the target model's visible behavior while substantially reducing draft-token acceptability. We propose Mistletoe, a stealthy acceleration-collapse attack against speculative decoding. Mistletoe directly targets the acceptance mechanism of speculative decoding. It jointly optimizes a degradation objective that decreases drafter-target agreement and a semantic-preservation objective that constrains the target model's output distribution. To resolve the conflict between these objectives, we introduce a null-space projection mechanism, where degradation gradients are projected away from the local semantic-preserving direction, suppressing draft acceptance while minimizing semantic drift. Experiments on various speculative decoding systems show that Mistletoe substantially reduces average accepted length $τ$, collapses speedup, and lowers averaged token throughput, while preserving output quality and perplexity. Our work highlights that speculative decoding introduces a mechanism-level attack surface beyond existing output robustness, calling for more robust designs of LLM acceleration systems.

2605.13415 2026-05-19 cs.CL cs.AI cs.LG

KIT-TIP-NLP at MultiPride: Continual Learning with Multilingual Foundation Model

KIT-TIP-NLP 在 MultiPride 上的持续学习:多语言基础模型

Barathi Ganesh HB, Michal Ptaszynski, Rene Melendez, Juuso Eronen

AI总结 本文提出了一种多阶段框架,用于检测社交媒体中多语言的重新使用侮辱性语言。该框架解决了跨英语、西班牙语和意大利语推文中识别重新使用与非重新使用LGBTQ+相关侮辱性语言的挑战,通过数据驱动的模型选择、语义保留的增强、归纳迁移学习和领域特定知识注入等方法,提高了多语言情感表达的识别能力。

Comments Final Workshop of the 9th evaluation campaign EVALITA 2026

详情
AI中文摘要

本文提出了一种多阶段框架,用于检测多语言社交媒体中重新使用的侮辱性语言。该框架解决了在英语、西班牙语和意大利语推文中识别重新使用与非重新使用LGBTQ+相关侮辱性语言的挑战。该框架处理了三个交织的方法学挑战:数据稀缺、类别不平衡和跨语言的情感表达差异。该框架整合了通过交叉验证的数据驱动模型选择、通过回译的语义保留增强、具有动态周期级欠采样的归纳迁移学习,以及通过掩码语言模型注入的领域特定知识。系统评估了八个多语言嵌入模型,XLM-RoBERTa被选为基础模型,基于宏平均F1分数。通过GPT-4o-mini回译进行的数据增强有效将训练语料库增加了三倍,同时保留了语义内容和类别分布比例。该框架生成了四个最终运行用于评估,其中RUN 1是带有增强和欠采样的归纳迁移学习,RUN 2是带有掩码语言模型预训练,RUN 3和RUN 4是通过语言特定决策阈值优化的先前预测。语言特定的阈值优化表明,最优决策边界在不同语言中存在显著差异。这反映了模型置信度分数的分布差异和重新使用语言使用的语言差异。基于阈值的优化在不需模型重新训练的情况下,带来了2-5%的绝对F1提升。该方法完全可复现,所有代码和实验设置可在https://github.com/rbg-research/MultiPRIDE-Evalita-2026上找到。

英文摘要

This paper presents a multi-stage framework for detecting reclaimed slurs in multilingual social media discourse. It addresses the challenge of identifying reclamatory versus non-reclamatory usage of LGBTQ+-related slurs across English, Spanish, and Italian tweets. The framework handles three intertwined methodological challenges like data scarcity, class imbalance, and cross-linguistic variation in sentiment expression. It integrates data-driven model selection via cross-validation, semantic-preserving augmentation through back-translation, inductive transfer learning with dynamic epoch-level undersampling, and domain-specific knowledge injection via masked language modeling. Eight multilingual embedding models were evaluated systematically, with XLM-RoBERTa selected as the foundation model based on macro-averaged F1 score. Data augmentation via GPT-4o-mini back-translation to alternate languages effectively tripled the training corpus while preserving semantic content and class distribution ratios. The framework produces four final runs for the evaluation purposes where RUN 1 is inductive transfer learning with augmentation and undersampling, RUN 2 with masked language modeling pre-training, RUN 3 and RUN 4 are previous predictions refined via language-specific decision thresholds optimized via ROC analysis. Language-specific threshold refinement reveals that optimal decision boundaries vary significantly across languages. This reflects distributional differences in model confidence scores and linguistic variation in reclamatory language usage. The threshold-based optimization yields 2-5% absolute F1 improvement without requiring model retraining. The methodology is fully reproducible, with all code and experimental setup available at https://github.com/rbg-research/MultiPRIDE-Evalita-2026.

2605.11975 2026-05-19 cs.LG

Stochastic Minimum-Cost Reach-Avoid Reinforcement Learning

随机最小成本到达-避免强化学习

Jingduo Pan, Taoran Wu, Yiling Xue, Bai Xue

AI总结 本文研究了随机最小成本到达-避免强化学习问题,提出了一种新的方法来在满足概率至少p的到达-避免约束的同时最小化预期累积成本。通过引入到达-避免概率证书(RAPCs)和基于收缩的Bellman公式,该方法能够将到达-避免考虑整合到强化学习中,并在概率约束下实现成本优化。

Comments Accepted at the Forty-third International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

我们研究了随机最小成本到达-避免强化学习,其中智能体必须在概率至少p的情况下满足到达-避免规范,同时在随机环境中最小化预期累积成本。现有的安全和约束强化学习方法通常无法在随机环境中联合强制概率到达-避免约束并优化成本。为了解决这一挑战,我们引入了到达-避免概率证书(RAPCs),这些证书可以识别出从哪些状态可以满足随机到达-避免约束。基于RAPCs,我们开发了一种基于收缩的Bellman公式,该公式作为一种原理性的替代方法,用于将到达-避免考虑整合到强化学习中,从而在概率约束下实现成本优化。我们建立了所提出算法在结果目标下几乎确定收敛到局部最优策略。在MuJoCo模拟器中的实验显示了改进的成本性能和一致更高的到达-避免满足率。

英文摘要

We study stochastic minimum-cost reach-avoid reinforcement learning, where an agent must satisfy a reach-avoid specification with probability at least $p$ while minimizing expected cumulative costs in stochastic environments. Existing safe and constrained reinforcement learning methods typically fail to jointly enforce probabilistic reach-avoid constraints and optimize cost in the learning setting in stochastic environments. To address this challenge, we introduce reach-avoid probability certificates (RAPCs), which identify states from which stochastic reach-avoid constraints are satisfiable. Building on RAPCs, we develop a contraction-based Bellman formulation that serves as a principled surrogate for integrating reach-avoid considerations into reinforcement learning, enabling cost optimization under probabilistic constraints. We establish almost sure convergence of the proposed algorithms to locally optimal policies with respect to the resulting objective. Experiments in the MuJoCo simulator demonstrate improved cost performance and consistently higher reach-avoid satisfaction rates.

2605.11854 2026-05-19 cs.CL

Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models

自蒸馏轨迹感知玻尔兹曼建模:弥合扩散语言模型训练-推理差异

Kecheng Chen, Ziru Liu, Xijia Tao, Hui Liu, Yibing Liu, Xinyu Fu, Shi Wu, Suiyun Zhang, Dandan Tu, Lingpeng Kong, Rui Liu, Haoliang Li

AI总结 本文研究了如何利用自蒸馏轨迹进行真正的知识获取,而非仅加速推理。提出TABOM框架,通过玻尔兹曼建模对推理解屏蔽偏好建模,从而在新领域中提升扩散语言模型的表现并缓解灾难性遗忘。

Comments Project website: https://tonyckc.github.io/TABOM-web/

详情
AI中文摘要

扩散语言模型(DLMs)最近涌现出作为自回归语言模型的有希望的替代方案,提供了更强的全局意识和高度并行的生成能力。然而,使用标准负证据下界(NELBO)基于的监督微调对DLMs进行后训练仍然效率低下:训练过程在单步中重建随机掩码的token,而推理则遵循一种由置信度引导的、多步易到难的去噪轨迹。最近的基于轨迹的自蒸馏方法主要利用这些推理轨迹来压缩和加速采样步骤,通常在不显著增强模型基础能力的情况下提高解码效率,甚至在完整扩散解码下可能降低性能。在本文中,我们问自蒸馏轨迹是否可以不仅仅用于更快的推理,而是用于真正的知识获取。虽然这些轨迹位于预训练DLM自身的分布流形上,因此提供了潜在的更低优化障碍,但我们发现使用标准NELBO目标直接微调仅能获得微小的提升。为了解决这一限制,我们提出了轨迹对齐优化通过玻尔兹曼建模(TABOM),一种基于轨迹的自蒸馏后训练框架,使训练与推理的易到难结构对齐。TABOM将推理解屏蔽偏好建模为预测熵上的玻尔兹曼分布,并推导出一个可计算的成对排名目标,以使模型的确定性顺序与观察到的解码轨迹对齐。经验上,TABOM在新领域中实现了显著的提升,扩展了DLMs的有效知识边界,并与标准SFT相比显著缓解了灾难性遗忘。

英文摘要

Diffusion Language Models (DLMs) have recently emerged as a promising alternative to autoregressive language models, offering stronger global awareness and highly parallel generation. However, post-training DLMs with standard Negative Evidence Lower Bound (NELBO)-based supervised fine-tuning remains inefficient: training reconstructs randomly masked tokens in a single step, whereas inference follows a confidence-guided, multi-step easy-to-hard denoising trajectory. Recent trajectory-based self-distillation methods exploit such inference trajectories mainly for sampling-step compression and acceleration, often improving decoding efficiency without substantially enhancing the model's underlying capability, and may even degrade performance under full diffusion decoding. In this work, we ask whether self-distilled trajectories can be used not merely for faster inference, but for genuine knowledge acquisition. Although these trajectories lie on the pretrained DLM's own distributional manifold and thus offer a potentially lower optimization barrier, we find that naively fine-tuning on them with standard NELBO objectives yields only marginal gains. To address this limitation, we propose \textbf{T}rajectory-\textbf{A}ligned optimization via \textbf{Bo}ltzmann \textbf{M}odeling (\textbf{TABOM}), a self-distilled trajectory-based post-training framework that aligns training with the easy-to-hard structure of inference. TABOM models the inference unmasking preference as a Boltzmann distribution over predictive entropies and derives a tractable pairwise ranking objective to align the model's certainty ordering with the observed decoding trajectory. Empirically, TABOM achieves substantial gains in new domains, expands the effective knowledge boundary of DLMs, and significantly mitigates catastrophic forgetting compared with standard SFT.

2605.11817 2026-05-19 cs.RO cs.CV

See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model

See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model

Yixu Feng, Zinan Zhao, Yanxiang Ma, Chenghao Xia, Chengbin Du, Yunke Wang, Chang Xu

AI总结 本文提出了一种基于可微网格采样的视觉-语言-动作模型压缩方法,通过连续的token重采样保留关键空间信息,实现高达90%的计算量减少而不影响性能。

详情
Journal ref
Proceedings of the Forty-third International Conference on Machine Learning, 2026
AI中文摘要

视觉-语言-动作(VLA)模型在机器人操作中表现出色,但其高计算成本限制了实时部署。现有token剪枝方法面临根本性的权衡:使用剪枝进行剧烈压缩会不可避免地丢弃关键几何细节,如接触点,导致性能严重下降。我们主张通过重新思考压缩作为几何感知的连续token重采样来打破这种权衡。为此,我们提出了可微网格采样器(GridS),一个即插即用的模块,用于在VLA中进行任务感知的连续重采样。通过自适应预测最小的显著坐标集并利用可微插值提取特征,GridS在保留关键空间信息的同时实现了大幅压缩(少于10%的原始视觉token)。在LIBERO基准和真实机器人平台上的实验表明,GridS实现了76%的FLOPs减少,而无需降级成功率。代码可在https://github.com/Fediory/Grid-Sampler上获得。

英文摘要

Vision-Language-Action (VLA) models have shown remarkable promise in robotics manipulation, yet their high computational cost hinders real-time deployment. Existing token pruning methods suffer from a fundamental trade-off: aggressive compression using pruning inevitably discards critical geometric details like contact points, leading to severe performance degradation. This forces a compromise, limiting the achievable compression rate and thus the potential speedup. We argue that breaking this trade-off requires rethinking compression as a geometry-aware, continuous token resampling in the vision encoder. To this end, we propose the Differentiable Grid Sampler (GridS), a plug-and-play module that performs task-aware, continuous resampling of visual tokens in VLA. By adaptively predicting a minimal set of salient coordinates and extracting features via differentiable interpolation, GridS preserves essential spatial information while achieving drastic compression (with fewer than 10% original visual tokens). Experiments on both LIBERO benchmark and a real robotic platform demonstrate that validating the lowest feasible visual token count reported to date, GridS achieves a 76% reduction in FLOPs with no degradation in the success rate. The code is available at https://github.com/Fediory/Grid-Sampler.

2605.11567 2026-05-19 cs.CV

Dynamic Execution Commitment of Vision-Language-Action Models

视觉-语言-动作模型的动态执行承诺

Feng Chen, Xianghui Wang, Yuxuan Chen, Boying Li, Yefei He, Zeyu Zhang, Yicheng Wu

AI总结 本文提出A3机制,通过将动态执行承诺重新定义为自推测前缀验证问题,解决了视觉-语言-动作模型在动态或分布外情况下执行鲁棒性和推理吞吐量之间的平衡问题。

Comments code is available at https://inceptionwang.github.io/A3/

详情
AI中文摘要

视觉-语言-动作(VLA)模型主要采用动作分块方法,即在单次前向传递中预测并承诺一系列连续的低层动作,以摊销大规模主干网络的推理成本并减少每步延迟。然而,将这些多步骤预测提交到现实世界执行需要在成功率和推理效率之间进行平衡,这一决策通常由针对特定任务调整的固定执行时间范围控制。此类启发式方法忽略了预测可靠性与状态依赖性的关系,导致在动态或分布外情况下表现脆弱。在本文中,我们引入了A3,一种自适应动作接受机制,将动态执行承诺重新定义为自推测前缀验证问题。A3首先通过群体采样计算轨迹级的动作共识分数,然后选择一个代表性的草稿并优先验证下游部分。具体而言,它强制执行:(1)共识有序的条件不变性,通过判断在高共识动作条件下重新解码后低共识动作是否保持一致来验证低共识动作;以及(2)前缀封闭的序列一致性,通过只接受从开始处最长连续验证动作序列来保证物理运行完整性。因此,执行时间范围自然成为满足内部模型逻辑和序列执行约束的最长可验证前缀。在多种VLA模型和基准测试中,实验表明A3消除了手动调整时间范围的需要,同时在执行鲁棒性和推理吞吐量之间实现了更优的平衡。

英文摘要

Vision-Language-Action (VLA) models predominantly adopt action chunking, i.e., predicting and committing to a short horizon of consecutive low-level actions in a single forward pass, to amortize the inference cost of large-scale backbones and reduce per-step latency. However, committing these multi-step predictions to real-world execution requires balancing success rate against inference efficiency, a decision typically governed by fixed execution horizons tuned per task. Such heuristics ignore the state-dependent nature of predictive reliability, leading to brittle performance in dynamic or out-of-distribution settings. In this paper, we introduce A3, an Adaptive Action Acceptance mechanism that reframes dynamic execution commitment as a self-speculative prefix verification problem. A3 first computes a trajectory-wise consensus score of actions via group sampling, then selects a representative draft and prioritizes downstream verification. Specifically, it enforces: (1) consensus-ordered conditional invariance, which validates low-consensus actions by judging whether they remain consistent when re-decoded conditioned on high-consensus actions; and (2) prefix-closed sequential consistency, which guarantees physical rollout integrity by accepting only the longest continuous sequence of verified actions starting from the beginning. Consequently, the execution horizon emerges as the longest verifiable prefix satisfying both internal model logic and sequential execution constraints. Experiments across diverse VLA models and benchmarks demonstrate that A3 eliminates the need for manual horizon tuning while achieving a superior trade-off between execution robustness and inference throughput.

2605.11461 2026-05-19 cs.AI cs.LG

Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning

打破赢家通吃:合作策略优化提升大语言模型的多样化推理

Haoxuan Chen, Tianming Liang, Wei-Shi Zheng, Jian-Fang Hu

AI总结 本文提出Group Cooperative Policy Optimization (GCPO)方法,通过改变训练范式从 rollout 竞争转向团队合作,提升大语言模型在推理任务中的准确性和解题多样性。

详情
AI中文摘要

基于验证器的强化学习(RLVR)已成为提升大语言模型(LLM)推理能力的核心范式,然而流行的基于群体的优化算法如GRPO常常面临探索崩溃问题,即模型过早收敛于一组高分模式,缺乏探索新解的能力。最近的研究尝试通过添加熵正则化或多样性奖励来缓解这一问题,但这些方法并未改变赢家通吃的本质,即rollouts仍为个体优势竞争而非合作最大化全局多样性。在本文中,我们提出Group Cooperative Policy Optimization(GCPO),将训练范式从rollout竞争转向团队合作。具体而言,GCPO将独立rollout评分替换为团队层面的信用分配:rollout被奖励其对团队有效解覆盖的贡献,而非其个体准确性。该覆盖被描述为奖励加权语义嵌入上的确定体体积,其中只有正确且非冗余的rollout才对这一体积做出贡献。在优势估计过程中,GCPO将集体团队奖励重新分配给每个单个rollout,根据其对团队的平均边际贡献。这种合作训练范式将优化方向导向非冗余的正确推理路径。在多个推理基准测试中,GCPO在现有方法的基础上显著提高了推理准确性和解题多样性。代码将在https://github.com/bradybuddiemarch/gcpo上发布。

英文摘要

Reinforcement learning with verifiers (RLVR) has become a central paradigm for improving LLM reasoning, yet popular group-based optimization algorithms like GRPO often suffer from exploration collapse, where the models prematurely converge on a narrow set of high-scoring patterns, lacking the ability to explore new solutions. Recent efforts attempt to alleviate this by adding entropy regularization or diversity bonus. However, these approaches do not change the \textit{winner-takes-all} nature, where rollouts still compete for individual advantage rather than cooperating for maximizing global diversity. In this work, we propose Group Cooperative Policy Optimization (GCPO), which shifts the training paradigm from rollout competition to team cooperation. Specifically, GCPO replaces independent rollout scoring with team-level credit assignment: a rollout is rewarded by how much it contributes to the team's valid solution coverage, rather than its individual accuracy. This coverage is described as a determinant volume over reward-weighted semantic embeddings, where only correct and non-redundant rollouts contribute to this volume. During advantage estimation, GCPO redistributes the collective team reward to each single rollout according to its average marginal contribution to the team. This cooperative training paradigm routes optimization toward non-redundant correct reasoning paths. Experiments across multiple reasoning benchmarks demonstrate that GCPO significantly improves both reasoning accuracy and solution diversity over existing approaches. Code will be released at https://github.com/bradybuddiemarch/gcpo.