arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1764
2510.26615 2026-06-08 cs.CL 版本更新

SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding

SlideAgent:用于多页视觉文档理解的分层代理框架

Yiqiao Jin, Rachneet Kaur, Zhen Zeng, Sumitra Ganesh, Srijan Kumar

发表机构 * Georgia Institute of Technology(佐治亚理工学院) J.P. Morgan AI Research(摩根大通AI研究)

AI总结 提出SlideAgent,一种用于多模态多页文档(如幻灯片)理解的分层代理框架,通过全局、页面和元素三级推理构建结构化表示,在专有和开源模型上分别提升7.9%和9.8%的准确率。

详情
Comments
ACL 2026 Main Conference. https://slideagent.github.io/
AI中文摘要

多页视觉文档,如手册、宣传册、演示文稿和海报,通过布局、颜色、图标和跨页引用传达关键信息。虽然多模态大语言模型(MLLMs)为文档理解提供了机会,但当前系统在处理复杂的多页视觉文档时仍存在困难,尤其是在元素和页面上的细粒度推理方面。我们引入了SlideAgent,一个用于理解多模态、多页、多布局文档(尤其是幻灯片组)的通用代理框架。SlideAgent采用专门的代理,并将推理分解为三个专门级别——全局、页面和元素——以构建结构化的、与查询无关的表示,捕捉总体主题以及详细的视觉或文本线索。在推理过程中,SlideAgent选择性激活专门代理进行多级推理,并将其输出整合为连贯的、上下文感知的答案。大量实验表明,SlideAgent在专有模型(+7.9%)和开源模型(+9.8%)上均显著提高了准确率。

英文摘要

Multi-page visual documents such as manuals, brochures, presentations, and posters convey key information through layout, colors, icons, and cross-slide references. While multimodal large language models (MLLMs) offer opportunities in document understanding, current systems struggle with complex, multi-page visual documents, particularly in fine-grained reasoning over elements and pages. We introduce SlideAgent, a versatile agentic framework for understanding multi-modal, multi-page, and multi-layout documents, especially slide decks. SlideAgent employs specialized agents and decomposes reasoning into three specialized levels--global, page, and element--to construct a structured, query-agnostic representation that captures both overarching themes and detailed visual or textual cues. During inference, SlideAgent selectively activates specialized agents for multi-level reasoning and integrates their outputs into coherent, context-aware answers. Extensive experiments show that SlideAgent significantly improves accuracy over both proprietary (+7.9%) and open-source models (+9.8%).

2508.17821 2026-06-08 cs.LG cs.AI cs.CL 版本更新

Limitations of Normalization in Attention Mechanism

注意力机制中归一化的局限性

Timur Mudarisov, Mikhail Burtsev, Tatiana Petrova, Radu State

发表机构 * University of Luxembourg(卢森堡大学) London Institute for Mathematical Sciences(伦敦数学科学研究所)

AI总结 本文通过理论框架和GPT-2实验,揭示softmax归一化导致注意力随选择token数增加而趋于均匀,并分析低温度下梯度敏感性带来的训练挑战。

详情
AI中文摘要

本文研究了注意力机制中归一化的局限性。我们首先建立了一个理论框架,用于识别模型的选择能力以及token选择中涉及的几何分离。我们的分析包括在softmax缩放下token向量距离和分离准则的显式界限。通过使用预训练的GPT-2模型进行实验,我们实证验证了理论结果,并分析了注意力机制的关键行为。值得注意的是,我们证明随着所选token数量的增加,模型区分信息性token的能力下降,通常趋向于均匀选择模式。我们还表明,softmax归一化下的梯度敏感性在训练过程中带来了挑战,尤其是在低温度设置下。这些发现推进了当前对基于softmax的注意力机制的理解,并激发了在未来注意力架构中需要更稳健的归一化和选择策略的需求。

英文摘要

This paper investigates the limitations of the normalization in attention mechanisms. We begin with a theoretical framework that enables the identification of the model's selective ability and the geometric separation involved in token selection. Our analysis includes explicit bounds on distances and separation criteria for token vectors under softmax scaling. Through experiments with pre-trained GPT-2 model, we empirically validate our theoretical results and analyze key behaviors of the attention mechanism. Notably, we demonstrate that as the number of selected tokens increases, the model's ability to distinguish informative tokens declines, often converging toward a uniform selection pattern. We also show that gradient sensitivity under softmax normalization presents challenges during training, especially at low temperature settings. These findings advance current understanding of softmax-based attention mechanism and motivate the need for more robust normalization and selection strategies in future attention architectures.

2510.16023 2026-06-08 cs.LG cond-mat.mtrl-sci 版本更新

A Conformation-Centric Generative Foundation Model for Linear Polymer Modeling and Design

面向线性聚合物建模与设计的构象中心生成式基础模型

Fanmeng Wang, Ruochao Wang, Shan Mei, Wentao Guo, Hongshuai Wang, Qi Ou, Zhifeng Gao, Hongteng Xu

发表机构 * Gaoling School of Artificial Intelligence(人工智能学院) Renmin University of China(中国人民大学) DP Technology(DP技术) SINOPEC Research Institute of Petroleum Processing Co., Ltd.(中石油加工研究院)

AI总结 提出PolyConFM基础模型,通过构象中心生成预训练(条件生成、掩码自回归建模和方向变换)来建模线性聚合物,在多种下游任务中优于现有方法。

详情
AI中文摘要

线性聚合物是由单体共价键合形成连续链的大分子,支撑着无数技术并是现代生活不可或缺的。虽然深度学习正在推进聚合物科学,但现有方法通常仅通过单体级描述符表示整个线性聚合物,忽视了聚合物构象中固有的全局结构信息,最终限制了其实际性能。此外,这一重要领域仍缺乏能够有效支持多种下游任务的专用基础模型,从而严重制约了进展。为应对这些挑战,我们引入了PolyConFM,一个通过构象中心生成预训练专门用于建模和设计线性聚合物的基础模型。认识到每个线性聚合物本质上是一个连续链,其构象可以自然地分解为一系列局部构象(即其重复单元的构象),我们在条件生成范式下预训练PolyConFM,通过掩码自回归(MAR)建模重建这些局部构象,并进一步生成它们的取向变换以恢复相应的聚合物构象。同时,我们通过分子动力学模拟构建了一个线性聚合物构象数据集以缓解数据稀疏性,从而实现了以构象为中心的预训练。实验表明,PolyConFM在多种下游任务中始终优于代表性的任务特定方法,从而为聚合物科学提供了针对线性聚合物的强大工具。

英文摘要

Linear polymers, macromolecules formed from monomers covalently bonded into continuous chains, underpin countless technologies and are indispensable to modern life. While deep learning is advancing polymer science, existing methods typically represent the whole linear polymer solely through monomer-level descriptors, overlooking the global structural information inherent in polymer conformations, which ultimately limits their practical performance. Moreover, this important field still lacks a dedicated foundation model that can effectively support diverse downstream tasks, thereby severely constraining progress. To address these challenges, we introduce PolyConFM, a foundation model tailored for modeling and designing linear polymers through conformation-centric generative pretraining. Recognizing that each linear polymer is essentially a continuous chain whose conformation can be naturally decomposed into a sequence of local conformations (i.e., those of its repeating units), we pretrain PolyConFM under the conditional generation paradigm, reconstructing these local conformations via masked autoregressive (MAR) modeling and further generating their orientation transformations to recover the corresponding polymer conformation. Meanwhile, we construct a linear polymer conformation dataset via molecular dynamics simulations to mitigate data sparsity, thereby enabling conformation-centric pretraining. Experiments demonstrate that PolyConFM consistently outperforms representative task-specific methods across diverse downstream tasks, thereby equipping polymer science with a powerful tool targeting linear polymers.

2510.01427 2026-06-08 cs.AI 版本更新

Small Language Model Agents Enable Efficient and High-Quality Knowledge Mining

小型语言模型代理实现高效高质量的知识挖掘

Sipeng Zhang, Shuhuai Lin, Xinpeng Wei, Yihang Chen, Pin Qian, Su Wang, Huan Xu

发表机构 * University of California, San Diego(加州大学圣地亚哥分校) Carneigie Mellon University(卡内基梅隆大学) Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出Falconer框架,结合大语言模型的代理推理与轻量级代理模型,通过规划与标注实现可扩展的知识挖掘,在保持指令遵循精度的同时降低90%推理成本并加速20倍以上。

详情
Comments
Code available: https://github.com/LongfeiYun17/falconer
AI中文摘要

深度研究(Deep Research)的核心是知识挖掘,即根据用户指令从海量非结构化文本中提取结构化信息的任务。大语言模型(LLMs)擅长解释此类指令,但大规模部署成本过高;而传统的分类器和提取器流水线虽然高效,但脆弱且无法泛化到新任务。我们提出Falconer,一种协作框架,将LLM的代理推理与轻量级代理模型相结合,用于可扩展的知识挖掘。在Falconer中,LLM作为规划者,将用户指令分解为可执行的流水线,并作为标注者,生成监督信号以训练小型代理。该框架将分类和提取统一为两个原子操作:get label和get span,使得单个指令遵循模型能够替代多个任务特定组件。为了评估Falconer孵化的代理模型与人类和大模型提供的标注之间的一致性,我们构建了涵盖规划和端到端执行的新基准。实验表明,Falconer在指令遵循精度上接近最先进的LLM,同时将推理成本降低高达90%,并将大规模知识挖掘加速20倍以上,为深度研究提供了高效且可扩展的基础。

英文摘要

At the core of Deep Research is knowledge mining, the task of extracting structured information from massive unstructured text in response to user instructions. Large language models (LLMs) excel at interpreting such instructions but are prohibitively expensive to deploy at scale, while traditional pipelines of classifiers and extractors remain efficient yet brittle and unable to generalize to new tasks. We introduce Falconer, a collaborative framework that combines the agentic reasoning of LLMs with lightweight proxy models for scalable knowledge mining. In Falconer, LLMs act as planners, decomposing user instructions into executable pipelines, and as annotators, generating supervision to train small proxies. The framework unifies classification and extraction into two atomic operations, get label and get span, enabling a single instruction-following model to replace multiple task-specific components. To evaluate the consistency between proxy models incubated by Falconer and annotations provided by humans and large models, we construct new benchmarks covering both planning and end-to-end execution. Experiments show that Falconer closely matches state-of-the-art LLMs in instruction-following accuracy while reducing inference cost by up to 90% and accelerating large-scale knowledge mining by more than 20x, offering an efficient and scalable foundation for Deep Research.

2411.09734 2026-06-08 cs.LG cs.NA math.NA math.OC 版本更新

Modeling AdaGrad, RMSProp, and Adam with Integro-Differential Equations

用积分微分方程建模 AdaGrad、RMSProp 和 Adam

Carlos Heredia

发表机构 * IAMM Research, Department of Applied Artificial Intelligence(IAMM研究院应用人工智能系) DAMM

AI总结 提出 AdaGrad、RMSProp 和 Adam 的连续时间积分微分方程模型,通过数值模拟和稳定性分析验证其与离散算法的一致性,为自适应优化方法提供新视角。

详情
Comments
60 pages, 15 figures; v3 - Section 4 corrected
AI中文摘要

在本文中,我们通过将 AdaGrad、RMSProp 和 Adam 优化算法建模为一阶积分微分方程,提出了它们的连续时间形式。我们对这些方程进行数值模拟,并进行稳定性和收敛性分析,以证明它们作为原始算法准确近似的有效性。我们的结果表明,连续时间模型与离散实现的行为高度一致,从而为自适应优化方法的理论理解提供了新的视角。

英文摘要

In this paper, we propose a continuous-time formulation for the AdaGrad, RMSProp, and Adam optimization algorithms by modeling them as first-order integro-differential equations. We perform numerical simulations of these equations, along with stability and convergence analyses, to demonstrate their validity as accurate approximations of the original algorithms. Our results indicate a strong agreement between the behavior of the continuous-time models and the discrete implementations, thus providing a new perspective on the theoretical understanding of adaptive optimization methods.

2510.05363 2026-06-08 cs.AI 版本更新

MHA-RAG: Improving Efficiency, Accuracy, and Consistency by Encoding Exemplars as Soft Prompts

MHA-RAG:通过将示例编码为软提示来提高效率、准确性和一致性

Abhinav Jain, Xinyu Yao, Thomas Reps, Christopher Jermaine

发表机构 * Department of Computer Science, Rice University(计算机科学系,里士大学) Department of Computer Science, University of Wisconsin–Madison(计算机科学系,威斯康星大学麦迪逊分校)

AI总结 提出MHA-RAG框架,将领域示例编码为软提示,通过多头注意力机制控制生成,在多个问答基准上相比标准RAG提升20点性能,同时降低10倍推理成本。

详情
Comments
17 pages, 5 figures
AI中文摘要

在有限训练数据下将基础模型适应到新领域具有挑战性且计算成本高。虽然先前工作证明了使用领域特定示例作为上下文演示的有效性,但我们探究了将示例纯粹表示为文本是否是最有效、最稳定和最高效的方法。我们探索了一种替代方案:使用示例顺序不变模型架构将示例表示为软提示。为此,我们引入了多头注意力检索增强生成(MHA-RAG),该框架以注意力头数作为简单超参数,用于控制不同任务下的软提示生成。在多个问答基准和模型规模上,MHA-RAG相比标准RAG实现了20个百分点的性能提升,同时将推理成本降低了10倍GFLOPs——在示例顺序不变的情况下,实现了更高的准确性和更高的效率。

英文摘要

Adapting Foundation Models to new domains with limited training data is challenging and computationally expensive. While prior work has demonstrated the effectiveness of using domain-specific exemplars as in-context demonstrations, we investigate whether representing exemplars purely as text is the most efficient, effective, and stable approach. We explore an alternative: representing exemplars as soft prompts with an exemplar order invariant model architecture. To this end, we introduce Multi-Head Attention Retrieval-Augmented Generation (MHA-RAG), a framework with the number of attention heads serving as a simple hyperparameter to control soft prompt-generation across different tasks. Across multiple question-answering benchmarks and model scales, MHA-RAG achieves a 20-point performance gain over standard RAG, while cutting inference costs by a factor of 10X GFLOPs-delivering both higher accuracy and greater efficiency, invariant to exemplar order.

2509.14380 2026-06-08 cs.RO 版本更新

CRAFT: Coaching Reinforcement Learning Autonomously using Foundation Models for Multi-Robot Coordination Tasks

CRAFT:利用基础模型自主教练强化学习以完成多机器人协调任务

Seoyeon Choi, Kanghyun Ryu, Jonghoon Ock, Negar Mehr

发表机构 * Department of Mechanical Engineering, University of California Berkeley(机械工程系,加州大学伯克利分校)

AI总结 提出CRAFT框架,利用大语言模型分解任务、生成奖励函数,并通过视觉语言模型优化,实现多机器人协调学习,在四足导航和双臂操作任务中验证有效性。

详情
AI中文摘要

多智能体强化学习(MARL)为多智能体系统中的协调学习提供了强大的框架。然而,由于机器人具有高维连续联合动作空间、复杂的奖励设计以及并发学习智能体带来的非平稳性,将MARL应用于机器人领域仍然具有挑战性。另一方面,人类通常在教练的帮助下学习复杂的协调任务,教练通过精心设计的课程和详细反馈来指导学习。基于基础模型的推理能力,我们认为这些模型可以类似地教练机器人学习协调。受此启发,我们提出了CRAFT:利用基础模型自主教练强化学习以完成协调任务,这是一个利用基础模型作为多机器人协调“教练”的框架。CRAFT利用大语言模型(LLMs)的规划能力,自动将长时域协调任务分解为子任务序列。然后,CRAFT使用LLM生成的奖励函数训练每个子任务,并通过视觉语言模型(VLM)引导的奖励细化循环来改进它们。我们在多四足导航和双臂操作任务上评估了CRAFT,并展示了其学习复杂协调行为的能力。此外,在多四足导航设置中,我们展示了学到的策略可以迁移到现实世界。项目网站:https://iconlab.negarmehr.com/CRAFT/

英文摘要

Multi-Agent Reinforcement Learning (MARL) provides a powerful framework for learning coordination in multi-agent systems. However, applying MARL to robotics remains challenging due to their high-dimensional continuous joint action spaces, complex reward design, and non-stationarity from concurrently learning agents. On the other hand, humans often learn complex coordination with the help of coaches, who guide learning through carefully designed curricula and detailed feedback. Building on the reasoning capabilities of foundation models, we argue that these models can similarly coach robots to learn coordination. Motivated by this, we propose CRAFT: Coaching Reinforcement learning Autonomously using Foundation models for learning coordination Tasks, a framework that leverages foundation models to act as a "coach" for multi-robot coordination. CRAFT automatically decomposes long-horizon coordination tasks into sequences of subtasks using the planning capability of Large Language Models (LLMs). Then, CRAFT trains each subtask using LLM-generated reward functions, and refines them through a Vision Language Model (VLM)-guided reward-refinement loop. We evaluate CRAFT on multi-quadruped navigation and bimanual manipulation tasks, and demonstrate its capability to learn complex coordination behaviors. In addition, in a multi-quadruped navigation setting, we show that our learned policies transfer to the real world. Project website is https://iconlab.negarmehr.com/CRAFT/

2509.24935 2026-06-08 cs.CV cs.AI cs.LG 版本更新

Scalable GANs with Transformers

可扩展的Transformer生成对抗网络

Sangeek Hyun, MinKyu Lee, Jae-Pil Heo

发表机构 * KAIST(韩国科学技术院)

AI总结 本文通过紧凑变分自编码器潜在空间和纯Transformer架构,研究了生成对抗网络的可扩展性,并提出了轻量级中间监督和宽度自适应学习率调整来解决缩放时的失败模式,在ImageNet-256上以40个epoch达到2.96的FID。

详情
Comments
ICML 2026
AI中文摘要

可扩展性推动了生成建模的最新进展,但其原理在对抗学习中仍未充分探索。我们通过两个在其他类型生成模型中被证明有效的设计选择来研究生成对抗网络(GAN)的可扩展性:在紧凑的变分自编码器潜在空间中训练,以及采用纯Transformer的生成器和判别器。在潜在空间中训练能够在保持感知保真度的同时实现高效计算,而这种效率与普通Transformer自然匹配,后者的性能随计算预算扩展。基于这些选择,我们分析了朴素缩放GAN时出现的失败模式。具体来说,我们发现了随着网络规模扩大,生成器早期层利用不足和优化不稳定的问题。因此,我们提供了简单且对缩放友好的解决方案,如轻量级中间监督和宽度自适应学习率调整。我们的实验表明,GAT——一种纯Transformer的潜在空间GAN——能够在从S到XL的广泛容量范围内可靠地训练。此外,GAT-XL/2在ImageNet-256上仅用40个epoch(比强基线少6倍)就达到了最先进的单步类条件生成性能(FID为2.96)。项目页面:https://hse1032.github.io/GAT。

英文摘要

Scalability has driven recent advances in generative modeling, yet its principles remain underexplored for adversarial learning. We investigate the scalability of Generative Adversarial Networks (GANs) through two design choices that have proven to be effective in other types of generative models: training in a compact Variational Autoencoder latent space and adopting purely transformer-based generators and discriminators. Training in latent space enables efficient computation while preserving perceptual fidelity, and this efficiency pairs naturally with plain transformers, whose performance scales with computational budget. Building on these choices, we analyze failure modes that emerge when naively scaling GANs. Specifically, we find issues as underutilization of early layers in the generator and optimization instability as the network scales. Accordingly, we provide simple and scale-friendly solutions as lightweight intermediate supervision and width-aware learning-rate adjustment. Our experiments show that GAT, a purely transformer-based and latent-space GANs, can be easily trained reliably across a wide range of capacities (S through XL). Moreover, GAT-XL/2 achieves state-of-the-art single-step, class-conditional generation performance (FID of 2.18) on ImageNet-256 in just 60 epochs, 4x fewer epochs than strong baselines. Project page: https://hse1032.github.io/GAT.

2504.03635 2026-06-08 cs.AI cs.CL 版本更新

Finding the Minimal Parameter Budget for Implicit Reasoning: A Data Complexity Driven Scaling Law for Language Models

寻找隐式推理的最小参数预算:一种基于数据复杂度的语言模型缩放定律

Xinyi Wang, Shawn Tan, Shenbo Xu, Mingyu Jin, William Yang Wang, Rameswar Panda, Yikang Shen

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Cambridge(剑桥大学) University of Washington(华盛顿大学) University of Toronto(多伦多大学) University of Tokyo(东京大学)

AI总结 本文通过控制合成环境中的预训练实验,发现语言模型隐式推理所需的最小参数预算与图搜索熵之间存在缩放定律,并确定了每参数最多可处理约0.008比特信息的容量上限。

详情
Comments
Accepted to ICML 2026
AI中文摘要

推理是语言模型(LM)的核心能力,然而在预训练期间需要多少模型容量来支持推理仍不清楚。在这项工作中,我们研究了隐式推理所需的最小参数预算,隐式推理定义为无需显式思维链监督即可从所学知识中推断出新事实的能力。为了隔离这一现象,我们在一个受控的合成环境中从头开始预训练LM,该环境模拟了真实世界知识图谱的结构和分布,并通过多跳推理评估它们补全缺失边的能力。从理论和实证两个角度,我们确定了一个缩放定律,将该最优参数预算与图搜索熵度量联系起来。在广泛的模型大小、训练步数和图复杂度范围内,我们表明一个最优大小的语言模型最多可以可靠地处理每参数约0.008比特的信息。我们的结果刻画了预训练期间隐式推理所需的最小充分容量。我们的发现为匹配模型大小与数据复杂度提供了原则性指导,并为大型语言模型中推理的缩放行为提供了新见解。

英文摘要

Reasoning is a core capability of language models (LMs), yet it remains unclear how much model capacity is necessary to support reasoning during pretraining. In this work, we study the minimal parameter budget required for implicit reasoning, defined as the ability to infer new facts from learned knowledge without explicit chain-of-thought supervision. To isolate this phenomenon, we pretrain LMs from scratch in a controlled synthetic environment that mimics the structure and distribution of real-world knowledge graphs, and evaluate their ability to complete missing edges via multi-hop inference. From both a theoretical and an empirical perspective, we identify a scaling law linking this optimal parameter budget to a graph search entropy measure. Across a wide range of model sizes, training steps, and graph complexities, we show that an optimally sized language model can reliably reason over approximately 0.008 bits of information per parameter at most. Our results characterize the minimal sufficient capacity for implicit reasoning during pretraining. Our findings provide principled guidance for matching model size to data complexity and offer new insights into the scaling behavior of reasoning in large language models.

2503.03660 2026-06-08 cs.LG 版本更新

Chunking the Critic: A Transformer-based Soft Actor-Critic with N-Step Returns

分块评论家:基于Transformer的软演员-评论家算法与N步回报

Dong Tian, Onur Celik, Gerhard Neumann

发表机构 * Karlsruhe Institute of Technology (KIT)(卡尔斯鲁厄理工学院)

AI总结 提出一种序列条件评论家,用轻量Transformer建模轨迹上下文并聚合N步目标训练,在不使用重要性采样下增强评论家对长时域和稀疏奖励问题的时序建模能力。

详情
Comments
39 pages, 15 figures, ICLR2026 Poster
AI中文摘要

我们为软演员-评论家(SAC)引入了一种序列条件评论家,它使用轻量级Transformer建模轨迹上下文,并在聚合的$N$步目标上训练。与先前的方法不同,先前方法要么(i)孤立地对状态-动作对评分,要么(ii)依赖演员侧的动作分块来处理长时域,我们的方法通过条件化短轨迹段并整合多步回报来增强评论家本身——无需重要性采样(IS)。由此产生的序列感知价值估计捕获了扩展时域和稀疏奖励问题的关键时间结构。在局部运动基准上,我们进一步表明,冻结评论家参数几步使得我们的更新与CrossQ的核心思想兼容,从而在无需目标网络的情况下实现稳定训练。尽管其简单性——一个2层Transformer,128-256个隐藏单元,以及最大更新-数据比(UTD)为$1$——该方法始终优于标准SAC和强离策略基线,在长轨迹控制上尤其获得巨大收益。这些结果突显了评论家侧序列建模和$N$步自举对长时域强化学习的价值。

英文摘要

We introduce a sequence-conditioned critic for Soft Actor-Critic (SAC) that models trajectory context with a lightweight Transformer and trains on aggregated $N$-step targets. Unlike prior approaches that (i) score state-action pairs in isolation or (ii) rely on actor-side action chunking to handle long horizons, our method strengthens the critic itself by conditioning on short trajectory segments and integrating multi-step returns -- without importance sampling (IS). The resulting sequence-aware value estimates capture the critical temporal structure for extended-horizon and sparse-reward problems. On local-motion benchmarks, we further show that freezing critic parameters for several steps makes our update compatible with CrossQ's core idea, enabling stable training \emph{without} a target network. Despite its simplicity -- a 2-layer Transformer with 128-256 hidden units and a maximum update-to-data ratio (UTD) of $1$ -- the approach consistently outperforms standard SAC and strong off-policy baselines, with particularly large gains on long-trajectory control. These results highlight the value of sequence modeling and $N$-step bootstrapping on the critic side for long-horizon reinforcement learning.

2509.21751 2026-06-08 cs.LG physics.comp-ph physics.flu-dyn 版本更新

On the Effect of Neural Field Reparameterization for 4DVAR

神经场重参数化对四维变分同化的影响

Jaemin Oh

发表机构 * Division of Applied Mathematics, Brown University(布朗大学应用数学系)

AI总结 提出用神经场重参数化4DVAR,利用谱偏置隐式正则化,无需背景误差协方差,实现并行时间优化,在混沌基准测试中优于经典方法。

详情
Comments
26 pages, 9 figures, 11 tables
AI中文摘要

四维变分资料同化(4DVAR)是数值天气预报的基石,但由于目标函数的非凸性,它仍然计算密集且对初始化敏感。我们提出了一种基于神经场的4DVAR重构,其中时空状态被表示为由神经网络参数化的连续函数。我们证明,在参数空间中优化利用了神经场的谱偏置,作为隐式正则化器,稳定状态估计并抑制虚假的高频振荡,而无需显式的背景误差协方差信息。此外,通过参数化完整的时空轨迹,我们的框架实现了时间并行优化,并通过物理信息损失直接纳入物理约束。在混沌基准测试(包括二维Kolmogorov流和三维Taylor-Green涡旋)上的评估表明,神经重参数化比经典4DVAR产生更准确的初始条件。当与可分离神经架构(SPINNs)结合时,该方法实现了显著的加速。与许多机器学习方法不同,该框架不需要真实训练数据,为业务化资料同化提供了一种稳健且可扩展的替代方案。

英文摘要

Four-dimensional variational data assimilation (4DVAR) is a cornerstone of numerical weather prediction, yet it remains computationally intensive and sensitive to initialization due to the non-convexity of its objective function. We propose a neural field-based reformulation of 4DVAR in which the spatiotemporal state is represented as a continuous function parameterized by a neural network. We demonstrate that optimizing in parameter space leverages the spectral bias of neural fields, acting as an implicit regularizer that stabilizes state estimation and suppresses spurious high-frequency oscillations without requiring explicit background error covariance information. Furthermore, by parameterizing the full spatiotemporal trajectory, our framework enables parallel-in-time optimization and incorporates physical constraints directly through physics-informed losses. Evaluations on chaotic benchmarks, including 2D Kolmogorov flow and 3D Taylor-Green vortices, show that neural reparameterization produces more accurate initial conditions than classical 4DVAR. When combined with separable neural architectures (SPINNs), the method achieves substantial speedups. Unlike many machine learning approaches, this framework requires no ground-truth training data, offering a robust and scalable alternative for operational data assimilation.

2505.21285 2026-06-08 cs.LG stat.ML 版本更新

Learnable Kernel Density Estimation for Graphs and Its Application to Graph-Level Anomaly Detection

可学习图核密度估计及其在图级异常检测中的应用

Xudong Wang, Ziheng Sun, Chris Ding, Jicong Fan

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出LGKDE框架,通过图神经网络表示图分布并利用最大均值差异学习多尺度核密度估计,在理论保证下有效捕获结构模式和语义变化,在图异常检测任务中优于现有方法。

详情
Comments
Accepted in the Forty-Third International Conference on Machine Learning (ICML 2026), Main Track
AI中文摘要

本文提出一个名为LGKDE的框架,用于学习图的核密度估计。图密度估计的关键挑战在于有效捕获结构模式和语义变化,同时保持理论保证。结合图核和核密度估计(KDE)是图密度估计的标准方法,但由于核的手工设计和固定特征,性能不佳。我们的方法LGKDE利用图神经网络将每个图表示为离散分布,并利用最大均值差异学习多尺度KDE的图度量,其中所有参数通过最大化图相对于其精心设计的扰动版本的密度来学习。扰动在节点特征和图谱上进行,有助于更好地刻画正常密度区域的边界。理论上,我们为LGKDE建立了一致性和收敛性保证,包括均方积分误差界、鲁棒性和泛化性。我们通过展示其在恢复合成图分布底层密度方面的有效性,并将其应用于多个基准数据集上的图异常检测来验证LGKDE。广泛的实证评估表明,在大多数基准数据集上,LGKDE相比最先进的基线方法表现出优越的性能。

英文摘要

This work proposes a framework LGKDE that learns kernel density estimation for graphs. The key challenge in graph density estimation lies in effectively capturing both structural patterns and semantic variations while maintaining theoretical guarantees. Combining graph kernels and kernel density estimation (KDE) is a standard approach to graph density estimation, but has unsatisfactory performance due to the handcrafted and fixed features of kernels. Our method LGKDE leverages graph neural networks to represent each graph as a discrete distribution and utilizes maximum mean discrepancy to learn the graph metric for multi-scale KDE, where all parameters are learned by maximizing the density of graphs relative to the density of their well-designed perturbed counterparts. The perturbations are conducted on both node features and graph spectra, which helps better characterize the boundary of normal density regions. Theoretically, we establish consistency and convergence guarantees for LGKDE, including bounds on the mean integrated squared error, robustness, and generalization. We validate LGKDE by demonstrating its effectiveness in recovering the underlying density of synthetic graph distributions and applying it to graph anomaly detection across diverse benchmark datasets. Extensive empirical evaluation shows that LGKDE demonstrates superior performance compared to state-of-the-art baselines on most benchmark datasets.

2509.11740 2026-06-08 cs.RO 版本更新

From Pixels to Shelf: An Integrated Robotic System for Autonomous Supermarket Stocking with a Mobile Manipulator

从像素到货架:基于移动操作臂的自主超市补货集成机器人系统

Davide Peron, Victor Nan Fernandez-Ayala, Lukas Segelmark

发表机构 * Department of Information Engineering, University of Padova(帕多瓦大学信息工程系) Division of Decision and Control Systems, School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology(皇家理工学院电气工程与计算机科学学院决策与控制系统系) Animum AB(Animum公司)

AI总结 提出一种集成商用硬件与ROS2的模块化机器人系统,利用行为树规划、视觉检测和两步MPC控制,在700多次补货实验中实现98%以上的抓取成功率,但性能仍逊于人类工人。

详情
Comments
Preprint for CASE 2026
AI中文摘要

零售环境(尤其是超市)中的自主补货由于动态的人机交互、受限空间和多样化的产品几何形状而面临挑战。本文介绍了一种高效的模块化机器人系统,用于自主货架补货,该系统将商用硬件与可扩展的算法架构相结合。本工作的一个主要贡献是将现成硬件与基于ROS2的感知、规划和控制集成到一个适用于零售环境的可部署平台中。我们的解决方案利用行为树(BT)进行任务规划,使用微调视觉模型进行目标检测,并采用两步模型预测控制(MPC)框架,通过ArUco标记实现精确的货架导航。在模拟真实超市条件的实验室实验中,该系统在总共700多次补货事件中实现了超过98%的抓取放置成功率。然而,我们的比较基准表明,当前自主系统的性能和成本效益仍然低于人类工人,我们以此突出关键改进领域,并量化在广泛商业部署真正实现之前仍需取得的进展。

英文摘要

Autonomous stocking in retail environments, particularly supermarkets, presents challenges due to dynamic human interactions, constrained spaces, and diverse product geometries. This paper introduces an efficient modular robotic system for autonomous shelf stocking, integrating commercially available hardware with a scalable algorithmic architecture. A major contribution of this work is the system integration of off-the-shelf hardware and ROS2-based perception, planning, and control into a single deployable platform for retail environments. Our solution leverages Behavior Trees (BTs) for task planning, fine-tuned vision models for object detection, and a two-step Model Predictive Control (MPC) framework for precise shelf navigation using ArUco markers. Laboratory experiments replicating realistic supermarket conditions demonstrate reliable performance, achieving over 98% success in pick-and-place operations across a total of more than 700 stocking events. However, our comparative benchmarks indicate that the performance and cost-effectiveness of current autonomous systems remain inferior to that of human workers, which we use to highlight key improvement areas and quantify the progress still required before widespread commercial deployment can realistically be achieved.

2509.05316 2026-06-08 cs.LG cs.AI 版本更新

Standard vs. Modular Sampling: Best Practices for Reliable LLM Unlearning

标准采样与模块化采样:可靠的大语言模型遗忘的最佳实践

Praveen Bushipaka, Lucia Passaro, Tommaso Cucinotta

发表机构 * Scuola Superiore Sant’Anna(圣安纳高等学院) University of Pisa(比萨大学)

AI总结 针对大语言模型遗忘中采样策略的不足,提出模块化实体级遗忘(MELU)策略,通过多样化邻居集和模块化采样平衡遗忘效果与模型效用。

详情
AI中文摘要

传统的大语言模型遗忘设置包含两个子集——

英文摘要

A conventional LLM Unlearning setting consists of two subsets -"forget" and "retain", with the objectives of removing the undesired knowledge from the forget set while preserving the remaining knowledge from the retain. In privacy-focused unlearning research, a retain set is often further divided into neighbor sets, containing either directly or indirectly connected to the forget targets; and augmented by a general-knowledge set. A common practice in existing benchmarks is to employ only a single neighbor set, with general knowledge which fails to reflect the real-world data complexities and relationships. LLM Unlearning typically involves 1:1 sampling or cyclic iteration sampling. However, the efficacy and stability of these de facto standards have not been critically examined. In this study, we systematically evaluate these common practices. Our findings reveal that relying on a single neighbor set is suboptimal and that a standard sampling approach can obscure performance trade-offs. Based on this analysis, we propose and validate an initial set of best practices: (1) Incorporation of diverse neighbor sets to balance forget efficacy and model utility, (2) Standard 1:1 sampling methods are inefficient and yield poor results, (3) Our proposed Modular Entity-Level Unlearning (MELU) strategy as an alternative to cyclic sampling. We demonstrate that this modular approach, combined with robust algorithms, provides a clear and stable path towards effective unlearning.

2508.19239 2026-06-08 cs.AI 版本更新

Model Context Protocols in Adaptive Transport Systems: A Survey

自适应交通系统中的模型上下文协议:综述

Gaurab Chhetri, Shriyank Somvanshi, Md Monzurul Islam, Shamyo Brotee, Mahmuda Sultana Mimi, Dipti Koirala, Biplov Pandey, Subasish Das

发表机构 * Texas State University San Marcos(德克萨斯州立大学圣马科斯分校)

AI总结 本文首次系统调查模型上下文协议(MCP)作为统一范式,提出五类分类法,揭示传统协议孤立适应的局限,并指出MCP的客户端-服务器和JSON-RPC结构支持语义互操作性,为下一代自适应智能交通基础设施奠定基础。

详情
AI中文摘要

互联设备、自主系统和AI应用的快速扩展导致自适应交通系统严重碎片化,各种协议和上下文源仍然孤立。本综述首次系统调查模型上下文协议(MCP)作为统一范式,强调其桥接协议级适应与上下文感知决策的能力。通过分析已有文献,我们发现现有工作已隐含地趋近于MCP类架构,标志着从碎片化解决方案向标准化集成框架的自然演进。我们提出一个五类分类法,涵盖自适应机制、上下文感知框架、统一模型、集成策略和MCP使能架构。我们的发现揭示了三个关键见解:传统传输协议已达到孤立适应的极限,MCP的客户端-服务器和JSON-RPC结构支持语义互操作性,以及AI驱动的传输需要特别适合MCP的集成范式。最后,我们提出一个研究路线图,将MCP定位为下一代自适应、上下文感知和智能交通基础设施的基础。

英文摘要

The rapid expansion of interconnected devices, autonomous systems, and AI applications has created severe fragmentation in adaptive transport systems, where diverse protocols and context sources remain isolated. This survey provides the first systematic investigation of the Model Context Protocol (MCP) as a unifying paradigm, highlighting its ability to bridge protocol-level adaptation with context-aware decision making. Analyzing established literature, we show that existing efforts have implicitly converged toward MCP-like architectures, signaling a natural evolution from fragmented solutions to standardized integration frameworks. We propose a five-category taxonomy covering adaptive mechanisms, context-aware frameworks, unification models, integration strategies, and MCP-enabled architectures. Our findings reveal three key insights: traditional transport protocols have reached the limits of isolated adaptation, MCP's client-server and JSON-RPC structure enables semantic interoperability, and AI-driven transport demands integration paradigms uniquely suited to MCP. Finally, we present a research roadmap positioning MCP as a foundation for next-generation adaptive, context-aware, and intelligent transport infrastructures.

2508.03668 2026-06-08 cs.CL 版本更新

CTR-Sink: Attention Sink for Language Models in Click-Through Rate Prediction

CTR-Sink:用于点击率预测的语言模型中的注意力汇聚点

Zixuan Li, Binzong Geng, Jing Xiong, Yong He, Yuxuan Hu, Jian Chen, Dingwei Chen, Xiyu Chang, Ngai Wong, Liang Zhang, Linjian Mo, Chengming Li, Chuan Yuan, Zhenan Sun

发表机构 * NLPR, Institute of Automation, Chinese Academy of Sciences(神经信息处理教育部重点实验室,自动化研究所,中国科学院) Ant Group(蚂蚁集团) The University of Hong Kong(香港大学) City University of Hong Kong(香港城市大学) Sun Yat-sen University(中山大学) Shenzhen MSU-BIT University(深圳MSU-BIT大学)

AI总结 针对用户行为序列与语言模型预训练文本之间的结构差异导致的语义碎片化问题,提出CTR-Sink框架,通过引入行为级注意力汇聚点并动态调节注意力聚合,提升点击率预测性能。

详情
AI中文摘要

点击率(CTR)预测是推荐系统中的核心任务,利用历史行为数据估计用户点击可能性。将用户行为序列建模为文本以利用语言模型(LM)进行该任务的方法,由于LM强大的语义理解和上下文建模能力而受到关注。然而,存在一个关键的结构性差距:用户行为序列由离散的动作组成,这些动作由语义上空的分离符连接,与LM预训练中的连贯自然语言有根本不同。这种不匹配导致语义碎片化,即LM的注意力分散在无关的标记上,而不是集中在有意义的行为边界和行为间关系上,从而降低了预测性能。为了解决这个问题,我们提出了$ extit{CTR-Sink}$,一种新颖的框架,引入了针对推荐场景定制的行为级注意力汇聚点。受注意力汇聚点理论的启发,它构建了注意力聚焦汇聚点,并通过外部信息动态调节注意力聚合。具体来说,我们在连续行为之间插入汇聚点标记,融入推荐特定信号(如时间距离)作为稳定的注意力汇聚点。为了增强通用性,我们设计了一个两阶段训练策略,明确引导LM注意力朝向汇聚点标记,以及一个注意力汇聚点机制,放大汇聚点间的依赖关系以更好地捕捉行为相关性。在一个工业数据集和两个开源数据集(MovieLens、Kuairec)上的实验以及可视化结果,验证了该方法在不同场景下的有效性。

英文摘要

Click-Through Rate (CTR) prediction, a core task in recommendation systems, estimates user click likelihood using historical behavioral data. Modeling user behavior sequences as text to leverage Language Models (LMs) for this task has gained traction, owing to LMs' strong semantic understanding and contextual modeling capabilities. However, a critical structural gap exists: user behavior sequences consist of discrete actions connected by semantically empty separators, differing fundamentally from the coherent natural language in LM pre-training. This mismatch causes semantic fragmentation, where LM attention scatters across irrelevant tokens instead of focusing on meaningful behavior boundaries and inter-behavior relationships, degrading prediction performance. To address this, we propose $\textit{CTR-Sink}$, a novel framework introducing behavior-level attention sinks tailored for recommendation scenarios. Inspired by attention sink theory, it constructs attention focus sinks and dynamically regulates attention aggregation via external information. Specifically, we insert sink tokens between consecutive behaviors, incorporating recommendation-specific signals such as temporal distance to serve as stable attention sinks. To enhance generality, we design a two-stage training strategy that explicitly guides LM attention toward sink tokens and a attention sink mechanism that amplifies inter-sink dependencies to better capture behavioral correlations. Experiments on one industrial dataset and two open-source datasets (MovieLens, Kuairec), alongside visualization results, validate the method's effectiveness across scenarios.

2507.12927 2026-06-08 cs.LG cs.IT math.IT 版本更新

Trace Reconstruction with Language Models

基于语言模型的迹重建

Franziska Weindel, Michael Girsch, Reinhard Heckel

发表机构 * School of Computation, Information and Technology, Technical University of Munich(计算、信息与技术学院,慕尼黑技术大学) Munich Center for Machine Learning(慕尼黑机器学习中心)

AI总结 提出TReconLM解码器仅变换器,将迹重建视为下一个标记预测任务,在合成和真实数据上预训练和微调,显著优于现有算法。

详情
AI中文摘要

一般的迹重建问题旨在从被插入、删除和替换独立损坏的噪声副本中恢复原始序列。该问题出现在DNA数据存储等应用中,DNA数据存储因其高信息密度和持久性而成为一种有前景的存储介质。然而,DNA合成、存储和测序过程中引入的错误需要通过算法和编码进行纠正,而迹重建通常作为数据检索的一部分。在这项工作中,我们提出了TReconLM,一种仅解码器的变换器,将迹重建作为下一个标记预测任务来解决。TReconLM优于最先进的迹重建算法,包括先前的深度学习方法,能够以无错误的方式恢复更高比例的序列。我们在基于简单错误模型生成的合成数据上进行预训练,并在真实世界数据上进行微调,以适应特定技术的错误模式。代码可在https://github.com/MLI-lab/TReconLM获取。

英文摘要

The general trace reconstruction problem seeks to recover an original sequence from its noisy copies independently corrupted by insertions, deletions, and substitutions. This problem arises in applications such as DNA data storage, a promising storage medium due to its high information density and longevity. However, errors introduced during DNA synthesis, storage, and sequencing require correction through algorithms and codes, with trace reconstruction often used as part of data retrieval. In this work, we propose TReconLM, a decoder-only transformer that solves trace reconstruction as a next-token prediction task. TReconLM outperforms state-of-the-art trace reconstruction algorithms, including prior deep-learning approaches, recovering a substantially higher fraction of sequences without error. We pretrain on synthetic data generated from a simple error model and fine-tune on real-world data to adapt to technology-specific error patterns. Code is available at https://github.com/MLI-lab/TReconLM.

2505.05232 2026-06-08 cs.AI 版本更新

ChemQuests: A Curated Chemistry Question-Answer Database Extracted from ChemRxiv papers

ChemQuests: 从ChemRxiv论文中提取的精选化学问答数据库

Mahmoud Amiri, Thomas Bocklitz

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出ChemQuests数据集,包含从155篇ChemRxiv论文中提取的952个高质量问答对,覆盖17个化学子领域,用于化学NLP研究。

详情
AI中文摘要

化学文献的快速扩展给研究人员高效获取领域特定知识带来了重大挑战。为了支持化学领域自然语言处理(NLP)的进展,我们提出了ChemQuests,这是一个精选数据集,包含来自化学17个子领域的155篇ChemRxiv \cite{chemrxivWebsite}论文的952个高质量问答(QA)对。每个QA对都明确链接到其源文本片段,以确保可追溯性和上下文准确性。ChemQuests使用自动化流水线构建,该流水线结合了光学字符识别(OCR)、使用GPT-4o的QA生成以及模糊搜索验证。该数据集强调概念性、机理、应用以及合成或实验性问题,支持基于检索的QA系统、搜索引擎开发以及领域自适应大语言模型的微调。我们分析了数据集的结构、覆盖范围和局限性,并概述了扩展和专家验证的未来方向。ChemQuests为化学NLP研究、教育和工具开发提供了基础资源。

英文摘要

The rapid expansion of chemistry literature poses significant challenges for researchers seeking to efficiently access domain-specific knowledge. To support advancements in chemistry-focused natural language processing (NLP), we present ChemQuests, a curated dataset of 952 high-quality question-answer (QA) pairs derived from 155 ChemRxiv \cite{chemrxivWebsite} papers across 17 subfields of chemistry. Each QA pair is explicitly linked to its source text segment to ensure traceability and contextual accuracy. ChemQuests was constructed using an automated pipeline that combines optical character recognition (OCR), QA generation using GPT-4o, and fuzzy-search verification. The dataset emphasizes conceptual, mechanistic, applied, and synthetic or experimental questions, enabling applications in retrieval-based QA systems, search engine development, and fine-tuning of domain-adapted large language models. We analyze the dataset's structure, coverage, and limitations, and outline future directions for expansion and expert validation. ChemQuests provides a foundational resource for chemistry NLP research, education, and tool development.

2506.02622 2026-06-08 cs.RO cs.HC 版本更新

HORUS: A Mixed Reality Interface for Managing Teams of Mobile Robots

HORUS:用于管理移动机器人团队的混合现实界面

Omotoye Shamsudeen Adekoya, Antonio Sgorbissa, Carmine Tommaso Recchiuto

发表机构 * DIBRIS Department, RICE Laboratory, University of Genoa(DIBRIS部门、RICE实验室、热那亚大学)

AI总结 提出混合现实界面HORUS,通过Mini-Map和多种遥操作模式实现多机器人团队监控与任务分配,用户研究验证其在搜索救援任务中的协调有效性。

详情
Comments
7 pages, 7 figures, conference paper submitted to UR 2026
AI中文摘要

混合现实(MR)界面已被广泛探索用于控制移动机器人,但关于其在管理机器人团队方面的应用研究有限。本文提出HORUS:统一系统的整体操作现实,这是一个混合现实界面,提供了一套全面的工具,用于同时管理多个移动机器人。HORUS使操作员能够监控单个机器人状态、实时投影传感器数据,并将任务分配给单个机器人、团队子集或整个团队,所有这些都通过Mini-Map(地面站)完成。该界面还提供不同的遥操作模式:迷你地图模式,允许在观察机器人模型及其在迷你地图上的变换的同时进行遥操作;以及半沉浸式模式,提供平坦的屏幕状视图,可以是单视图或立体视图(3D)。我们进行了一项用户研究,参与者使用HORUS管理一个移动机器人团队,任务是在环境中寻找线索,模拟搜索和救援任务。该研究将HORUS的完整团队管理能力与单个机器人遥操作进行了比较。实验验证了HORUS在多机器人协调中的多功能性和有效性,展示了其在动态、基于团队的环境中推进人机协作的潜力。

英文摘要

Mixed Reality (MR) interfaces have been extensively explored for controlling mobile robots, but there is limited research on their application to managing teams of robots. This paper presents HORUS: Holistic Operational Reality for Unified Systems, a Mixed Reality interface offering a comprehensive set of tools for managing multiple mobile robots simultaneously. HORUS enables operators to monitor individual robot statuses, visualize sensor data projected in real time, and assign tasks to single robots, subsets of the team, or the entire group, all from a Mini-Map (Ground Station). The interface also provides different teleoperation modes: a mini-map mode that allows teleoperation while observing the robot model and its transform on the mini-map, and a semi-immersive mode that offers a flat, screen-like view in either single or stereo view (3D). We conducted a user study in which participants used HORUS to manage a team of mobile robots tasked with finding clues in an environment, simulating search and rescue tasks. This study compared HORUS's full-team management capabilities with individual robot teleoperation. The experiments validated the versatility and effectiveness of HORUS in multi-robot coordination, demonstrating its potential to advance human-robot collaboration in dynamic, team-based environments.

2506.01850 2026-06-08 cs.CV cs.AI cs.LG cs.MM 版本更新

MoDA: Modulation Adapter for Fine-Grained Visual Grounding in Instructional MLLMs

MoDA: 面向指令型多模态大语言模型的细粒度视觉定位的调制适配器

Wayner Barrios, Andrés Villa, Juan León Alcázar, SouYoung Jin, Bernard Ghanem

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出MoDA调制适配器,通过指令引导的通道级乘法调制增强细粒度视觉定位,在12个基准上对三种MLLM架构取得一致提升,计算开销极小。

详情
Comments
Accepted at ICML 2026. Code is available at https://github.com/waybarrios/MoDA
AI中文摘要

多模态大语言模型(MLLMs)通过将预训练的视觉编码器与大语言模型(LLMs)集成,在指令跟随任务中取得了显著成功。然而,现有方法由于视觉补丁表示中的语义纠缠,常常难以实现细粒度的视觉定位,其中单个补丁混合了多个不同的视觉元素,使得模型难以聚焦于指令相关的细节。为了应对这一挑战,我们提出了MoDA(调制适配器),一种轻量级模块,通过指令引导的通道级调制增强视觉定位。与Q-Former等执行加性特征选择的令牌级方法不同,MoDA通过对已对齐特征进行乘法调制在通道级操作,从而实现对每个指令相关嵌入维度的细粒度控制。遵循标准的LLaVA训练协议,MoDA在语言指令与预对齐的视觉特征之间应用交叉注意力,生成动态调制掩码,无需架构修改或额外监督。我们在涵盖视觉问答、视觉中心推理和幻觉检测的12个基准上评估了MoDA,包括最近的2024年基准(MMVP、CV-Bench、MMStar、RealWorldQA),并在三种不同的MLLM架构上进行了测试:LLaVA-1.5、LLaVA-MoRE(2025)和Qwen3-VL(2025)。MoDA在所有三个系列中均取得了一致的提升,在LLaVA-1.5系列的MMVP上提升了+12.0个百分点,在LLaVA-MoRE系列的ScienceQA上提升了+4.8个百分点,在Qwen3-VL上ScienceQA提升了+4.9、RealWorldQA提升了+4.1、GQA提升了+3.8,证实了这些增益在CLIP编码器之外具有泛化性,且计算开销极小(<1% FLOPs)。代码可在https://github.com/waybarrios/MoDA获取。

英文摘要

Multimodal Large Language Models (MLLMs) have achieved remarkable success in instruction-following tasks by integrating pretrained visual encoders with large language models (LLMs). However, existing approaches often struggle with fine-grained visual grounding due to semantic entanglement in visual patch representations, where individual patches blend multiple distinct visual elements, making it difficult for models to focus on instruction-relevant details. To address this challenge, we propose MoDA (Modulation Adapter), a lightweight module that enhances visual grounding through instruction-guided channel-wise modulation. Unlike token-level methods such as Q-Former that perform additive feature selection, MoDA operates at the channel level through multiplicative modulation on already-aligned features, enabling fine-grained control over which embedding dimensions are relevant for each instruction. Following the standard LLaVA training protocol, MoDA applies cross-attention between language instructions and pre-aligned visual features, generating dynamic modulation masks without architectural modifications or additional supervision. We evaluate MoDA across 12 benchmarks spanning visual question answering, vision-centric reasoning, and hallucination detection, including recent 2024 benchmarks (MMVP, CV-Bench, MMStar, RealWorldQA), on three distinct MLLM architectures: LLaVA-1.5, LLaVA-MoRE (2025), and Qwen3-VL (2025). MoDA delivers consistent gains across all three families, with +12.0 points on MMVP for the LLaVA-1.5 family and +4.8 points on ScienceQA for the LLaVA-MoRE family, and +4.9 ScienceQA, +4.1 RealWorldQA, and +3.8 GQA on Qwen3-VL, confirming that the gains generalize beyond CLIP-based encoders with minimal overhead (<1% FLOPs). Code is available at https://github.com/waybarrios/MoDA.

2505.23437 2026-06-08 cs.LG cs.AI cs.IR 版本更新

Bounded-Abstention Pairwise Learning to Rank

有界弃权成对学习排序

Antonio Ferrara, Andrea Pugnana, Francesco Bonchi, Salvatore Ruggieri

发表机构 * Intesa Sanpaolo AI Research(Intesa Sanpaolo AI研究中心) University of Trento(特伦托大学) University of Pisa(比萨大学)

AI总结 提出一种基于条件风险阈值的成对排序弃权方法,理论刻画最优策略,设计模型无关的插件算法,实验验证有效性。

详情
Comments
KDD 2026
AI中文摘要

排序系统影响健康、教育和就业等高风险领域的决策,可能产生重大经济和社会影响,因此集成安全机制至关重要。弃权是一种安全机制,允许算法决策系统将不确定或低置信度的决策推迟给人类专家。虽然弃权主要在分类任务中研究,但其在其他机器学习范式中的应用尚不充分。本文提出一种用于成对学习排序任务的弃权新方法。该方法基于对排序器条件风险设置阈值:当估计风险超过预定义阈值时,系统弃权不做决策。我们的贡献有三方面:最优弃权策略的理论刻画、一个模型无关的插件式算法用于构建弃权排序模型,以及在多个数据集上的全面实证评估,证明了我们方法的有效性。

英文摘要

Ranking systems influence decision-making in high-stakes domains like health, education, and employment, where they can have substantial economic and social impacts. This makes the integration of safety mechanisms essential. One such mechanism is abstention, which enables algorithmic decision-making systems to defer uncertain or low-confidence decisions to human experts. While abstention has been predominantly explored in the context of classification tasks, its application to other machine learning paradigms remains underexplored. In this paper, we introduce a novel method for abstention in pairwise learning-to-rank tasks. Our approach is based on thresholding the ranker's conditional risk: the system abstains from making a decision when the estimated risk exceeds a predefined threshold. Our contributions are threefold: a theoretical characterization of the optimal abstention strategy, a model-agnostic, plug-in algorithm for constructing abstaining ranking models, and a comprehensive empirical evaluation across multiple datasets, demonstrating the effectiveness of our approach.

2505.23131 2026-06-08 cs.LG cs.DC 版本更新

DOPPLER: Dual-Policy Learning for Device Assignment in Asynchronous Dataflow Graphs

DOPPLER: 异步数据流图中设备分配的双策略学习

Xinyu Yao, Daniel Bourgeois, Abhinav Jain, Yuxin Tang, Jiawen Yao, Zhimin Ding, Arlei Silva, Chris Jermaine

发表机构 * Rice University(里士大学) Rice Ken Kennedy Institute(里士肯尼迪研究所)

AI总结 提出Doppler框架,通过双策略网络(SEL选择操作、PLC放置设备)优化异步数据流图中的设备分配,减少执行时间并提高采样效率。

详情
Journal ref
Proceedings of the International Conference on Learning Representations (ICLR), 2026
Comments
32 pages, 19 figures
AI中文摘要

我们研究在work-conserving系统中将数据流图中的操作分配给设备以最小化执行时间的问题,重点关注复杂的机器学习工作负载。先前的基于学习的方法常常因三个关键限制而难以奏效:(1) 依赖像TensorFlow这样的批量同步系统,由于屏障同步导致设备利用率不足;(2) 在设计基于学习的方法时缺乏对底层系统调度机制的了解;(3) 完全依赖强化学习,忽略了专家设计的有效启发式结构。在本文中,我们提出Doppler,一个用于训练双策略网络的三阶段框架,包括1) 用于选择操作的$\mathsf{SEL}$策略和2) 用于将所选操作放置到设备上的$\mathsf{PLC}$策略。我们的实验表明,Doppler通过减少系统执行时间在所有任务上优于所有基线方法,并且通过减少每回合训练时间展示了采样效率。

英文摘要

We study the problem of assigning operations in a dataflow graph to devices to minimize execution time in a work-conserving system, with emphasis on complex machine learning workloads. Prior learning-based methods often struggle due to three key limitations: (1) reliance on bulk-synchronous systems like TensorFlow, which under-utilize devices due to barrier synchronization; (2) lack of awareness of the scheduling mechanism of underlying systems when designing learning-based methods; and (3) exclusive dependence on reinforcement learning, ignoring the structure of effective heuristics designed by experts. In this paper, we propose Doppler, a three-stage framework for training dual-policy networks consisting of 1) a $\mathsf{SEL}$ policy for selecting operations and 2) a $\mathsf{PLC}$ policy for placing chosen operations on devices. Our experiments show that Doppler outperforms all baseline methods across tasks by reducing system execution time and additionally demonstrates sampling efficiency by reducing per-episode training time.

2505.14289 2026-06-08 cs.AI 版本更新

EVA: Evolving Semantic Adversaries for Red-Teaming GUI Agents Against Environmental Injection Attacks

EVA: 针对环境注入攻击的红队GUI智能体的演化语义对抗方法

Yijie Lu, Manman Zhao, Tianjie Ju, Zihe Yan, Xinbei Ma, Yuan Guo, Daizong Ding, Gongshen Liu, Zhuosheng Zhang

发表机构 * School of Cyber Science and Engineering, Wuhan University(武汉大学计算机科学与工程学院) School of Computer Science, Shanghai Jiao Tong University(上海交通大学计算机科学学院) Independent Researcher(独立研究者)

AI总结 提出EVA框架,通过演化语义对抗载荷攻击多模态大语言模型驱动的GUI智能体,揭示语义欺骗是攻击成功的关键,实现85%攻击成功率且快速收敛。

详情
Comments
Accepted by
AI中文摘要

由多模态大语言模型驱动的图形用户界面智能体日益部署,但易受环境注入攻击。然而,当前的红队方法受限于高昂的计算成本和有限的适应性。一个基本问题仍未解决:攻击成功的瓶颈在于视觉感知还是语义理解?通过控制实验,我们观察到语义欺骗而非视觉外观是攻击成功的主要决定因素。基于这一见解,我们引入了EVA,一个仅在语义维度上演化对抗载荷的演化框架。EVA采用发现-部署框架来挖掘语言脆弱性模式并将其提炼为可泛化的规则。在五个代表性受害智能体上的实验结果表明,EVA实现了高达85%的攻击成功率,在仅1.18到1.71次迭代内将良性种子演化为成功攻击。这种快速收敛揭示了模型潜在表示中密集的语义攻击空间,揭示了一个关键的校准悖论:校准训练强化的指令遵循能力使智能体天生易受权威性、语义欺骗性环境线索的影响。

英文摘要

Graphical User Interface (GUI) agents powered by Multimodal Large Language Models (MLLMs) are increasingly deployed yet vulnerable to Environmental Injection Attacks (EIAs).However, current red-teaming methods are hindered by prohibitive computational costs and limited adaptability. A fundamental question remains unaddressed: does the bottleneck of attack success lie in visual perception or semantic understanding? Through controlled experiments, we observe that semantic deception, rather than visual appearance, serves as the primary determinant of attack success. Based on this insight, we introduce EVA, an evolutionary framework that evolves adversarial payloads exclusively within the semantic dimension. EVA employs a discovery-deployment framework to mine linguistic vulnerability patterns and distill them into generalizable rules. Experimental results across five representative victim agents demonstrate that EVA achieves up to 85\% attack success rate, evolving benign seeds into successful attacks within only 1.18 to 1.71 iterations. This rapid convergence uncovers a dense semantic attack space in the model's latent representation, unveiling a critical alignment paradox: the instruction-following capabilities reinforced by alignment training render agents inherently susceptible to authoritative, semantically deceptive environmental cues.

2505.10892 2026-06-08 cs.LG 版本更新

Multi-Objective Preference Optimization: Improving Human Alignment of Generative Models

多目标偏好优化:提升生成模型的人类对齐

Akhil Agnihotri, Rahul Jain, Deepak Ramachandran, Zheng Wen

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 针对RLHF和偏好优化方法假设单一目标的问题,提出多目标偏好优化框架MOPO,通过约束KL散度最大化主要目标并保障次要目标下限,在合成基准和人类偏好数据上实现帕累托最优策略。

详情
AI中文摘要

使用RLHF和偏好优化方法(如DPO、IPO)对LLM进行后训练已大大改善了对齐,但这些方法假设单一目标。实际上,人类表达多个通常相互冲突的目标,例如有用性和无害性,没有自然的标量化。我们研究多目标偏好对齐问题,其中策略必须同时平衡多个目标。我们提出多目标偏好优化(MOPO),一个受约束的KL正则化框架,通过可调安全阈值在强制执行次要目标下限的同时最大化主要目标。MOPO直接操作成对偏好,无需点式奖励,并允许简单的闭式迭代更新。实验上,MOPO在合成基准上恢复帕累托最优策略,并在人类偏好数据上微调时,产生数十亿参数模型,实现更高奖励和帕累托支配基线,具有稳定且鲁棒的优化动态。

英文摘要

Post-training LLMs with RLHF and preference optimization methods (e.g., DPO, IPO) has greatly improved alignment, yet these approaches assume a single objective. In reality, humans express multiple, often conflicting objectives, such as helpfulness and harmlessness, with no natural scalarization. We study the multi-objective preference alignment problem, where a policy must balance several objectives simultaneously. We propose Multi-Objective Preference Optimization (MOPO), a constrained KL-regularized framework that maximizes a primary objective while enforcing lower bounds on secondary objectives via tunable safety thresholds. MOPO operates directly on pairwise preferences without point-wise rewards, and admits simple closed-form iterative updates. Empirically, MOPO recovers Pareto-optimal policies on synthetic benchmarks and, when fine-tuned on human-preference data, yields multi-billion parameter models that achieve higher rewards and Pareto-dominate baselines, with stable and robust optimization dynamics.

2504.10102 2026-06-08 cs.RO cs.SY eess.SY 版本更新

A Human-Sensitive Controller: Adapting to Human Musculoskeletal Disorder-Related Constraints via Reinforcement Learning

一种人类敏感控制器:通过强化学习适应人类肌肉骨骼疾病相关约束

Vitor Martins, Sara M. Cerqueira, Mercedes Balcells, Elazer R Edelman, Cristina P. Santos

发表机构 * Fundação para a Ciência e Tecnologia(葡萄牙科学与技术基金会) Centro de Microssistemas Eletromecânicos da Universidade do Minho(University of Minho微机电系统中心) Massachusetts Institute of Technology(麻省理工学院) Brigham and Women’s Hospital, Harvard Medical School(哈佛医学院布莱尔妇女医院) GEVAB, IQS School of Engineering(GEVAB,IQS工程学院) LABBELS-Associate Laboratory, University of Minho(University of Minho关联实验室)

AI总结 提出基于强化学习的人类敏感机器人控制策略,使用Q学习和深度Q网络优化协作机器人的人机工效,在保持零疼痛风险下平均缩短38%任务完成时间。

详情
AI中文摘要

工作相关肌肉骨骼疾病仍然是工业环境中的主要挑战,导致劳动力参与减少、医疗成本增加和长期残疾。本研究引入了一种人类敏感机器人系统,旨在将有肌肉骨骼疾病史的个体重新融入标准工作岗位,同时优化更广泛劳动力的工效条件。本研究利用强化学习(RL)开发协作机器人的人类感知控制策略,重点优化工效条件并防止任务执行过程中的疼痛。实现并测试了两种RL方法,即Q学习和深度Q网络(DQN),以根据个体用户特征个性化控制策略。尽管实验结果显示存在模拟到现实的差距,但微调阶段成功地将策略适应了现实条件。DQN优于Q学习,在保持零疼痛风险和安全的工效水平的同时,更快地完成任务,在所有测试的人体测量中平均任务完成时间缩短了38%。结构化的测试协议证实了系统对不同人体测量的适应性,突显了RL驱动的协作机器人实现更安全、更包容工作场所的潜力。

英文摘要

Work-Related Musculoskeletal Disorders continue to be a major challenge in industrial environments, leading to reduced workforce participation, increased healthcare costs, and long-term disability. This study introduces a human-sensitive robotic system aimed at reintegrating individuals with a history of musculoskeletal disorders into standard job roles, while simultaneously optimizing ergonomic conditions for the broader workforce. This research leverages reinforcement learning (RL) to develop a human-aware control strategy for collaborative robots, focusing on optimizing ergonomic conditions and preventing pain during task execution. Two RL approaches, Q-Learning and Deep Q-Network (DQN), were implemented and tested to personalize control strategies based on individual user characteristics. Although experimental results revealed a simulation-to-real gap, a fine-tuning phase successfully adapted the policies to real-world conditions. DQN outperformed Q-Learning by completing tasks faster while maintaining zero pain risk and safe ergonomic levels, achieving on average 38% shorter task completion times across all tested anthropometries. The structured testing protocol confirmed the system's adaptability to diverse human anthropometries, underscoring the potential of RL-driven cobots to enable safer, more inclusive workplaces.

2504.00613 2026-06-08 cs.AI cs.IT cs.NE math.IT 版本更新

LLM-Guided Search for Deletion-Correcting Codes

LLM引导的删除纠正码搜索

Franziska Weindel, Reinhard Heckel

发表机构 * School of Computation, Information and Technology, Technical University of Munich(计算、信息与技术学院,慕尼黑技术大学)

AI总结 针对删除纠正码最大尺寸的开放问题,采用LLM引导的进化搜索FunSearch,发现构建短码长删除纠正码的函数,单删除场景证明达到最优的Varshamov-Tenengolts码,多删除和四进制编辑码改进现有构造但缺乏理论洞见。

详情
AI中文摘要

寻找最大尺寸的删除纠正码已经是一个开放问题超过70年,即使对于单个删除也是如此。我们改编了FunSearch,一种大型语言模型(LLM)引导的进化搜索,以发现构建短码长删除纠正码的函数。对于单个删除,我们的搜索发现了一个函数,我们证明该函数构建了推测最优的Varshamov-Tenengolts码。对于多个删除和四进制编辑码,发现的函数改进了先前的显式、基于搜索和神经网络的构造,但仍然是经验启发式,没有新的理论洞见。我们研究了LLM引导进化搜索的设计选择,并发现对于我们的问题,计算资源更好地分配用于采样更多函数,而不是每个函数更长的推理轨迹,并且将自然语言描述与代码共同进化会损害搜索质量。我们提出在进化过程中对逻辑相同的函数进行去重,这对搜索多样性至关重要。我们的结果展示了LLM引导进化搜索在信息论和编码设计中的潜力,并代表了此类方法在构建纠错码中的首次应用。然而,在我们目前的公式中,评估一个函数的复杂度随码长指数增长,限制了该方法仅适用于短码。

英文摘要

Finding deletion-correcting codes of maximum size has been an open problem for over 70 years, even for a single deletion. We adapt FunSearch, a large language model (LLM)-guided evolutionary search, to discover functions that construct deletion-correcting codes at short code lengths. For a single deletion, our search finds a function that we prove constructs the conjectured-optimal Varshamov-Tenengolts code. For multiple deletions and quaternary edit codes, the discovered functions improve on prior explicit, search-based, and neural constructions but remain empirical heuristics without new theoretical insights. We study design choices for LLM-guided evolutionary search and find that, for our problem, compute is better allocated to sampling more functions than to longer reasoning traces per function, and that co-evolving natural language descriptions with code hurts search quality. We propose deduplicating logically identical functions during evolution, which we find critical for search diversity. Our results demonstrate the potential of LLM-guided evolutionary search for information theory and code design and represent the first application of such methods for constructing error-correcting codes. However, in our current formulation, evaluating a function scales exponentially with code length, limiting the approach to short codes.

2502.16531 2026-06-08 cs.RO cs.SY eess.SY 版本更新

Efficient Coordination and Synchronization of Multi-Robot Systems Under Recurring Linear Temporal Logic

基于循环线性时序逻辑的多机器人系统高效协调与同步

Davide Peron, Victor Nan Fernandez-Ayala, Eleftherios E. Vlahakis, Dimos V. Dimarogonas

发表机构 * Department of Information Engineering, University of Padova(帕多瓦大学信息工程系) Division of Decision and Control Systems, School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology(皇家理工学院电气工程与计算机科学学院决策与控制系统系)

AI总结 提出一种结合离线计划综合与在线协调的底层方法,通过实时通信动态调整计划,并引入同步机制处理动作延迟,实现多机器人系统的可扩展协调与同步框架。

详情
Journal ref
Proc. IEEE ICRA, 2025, pp. 10194-10200
AI中文摘要

我们考虑在循环任务下形式化为线性时序逻辑(LTL)规范的多机器人系统。为了高效解决规划问题,我们提出了一种自底向上的方法,将离线计划综合与在线协调相结合,通过实时通信动态调整计划。为了解决动作延迟,我们引入了一种同步机制,确保协调的任务执行,从而得到一个适用于广泛多机器人应用的多智能体协调与同步框架。该软件包使用Python和ROS2开发,便于广泛部署。我们通过涉及九个机器人的实验室实验验证了我们的发现,显示出与先前方法相比增强的适应性。此外,我们进行了多达九十个智能体的仿真,以展示我们工作降低的计算复杂性和可扩展性特征。

英文摘要

We consider multi-robot systems under recurring tasks formalized as linear temporal logic (LTL) specifications. To solve the planning problem efficiently, we propose a bottom-up approach combining offline plan synthesis with online coordination, dynamically adjusting plans via real-time communication. To address action delays, we introduce a synchronization mechanism ensuring coordinated task execution, leading to a multi-agent coordination and synchronization framework that is adaptable to a wide range of multi-robot applications. The software package is developed in Python and ROS2 for broad deployment. We validate our findings through lab experiments involving nine robots showing enhanced adaptability compared to previous methods. Additionally, we conduct simulations with up to ninety agents to demonstrate the reduced computational complexity and the scalability features of our work.

2412.09119 2026-06-08 cs.LG cs.CR math.OC 版本更新

The Utility and Complexity of in- and out-of-Distribution Machine Unlearning

分布内与分布外机器遗忘的效用与复杂性

Youssef Allouah, Joshua Kazdan, Rachid Guerraoui, Sanmi Koyejo

发表机构 * EPFL(瑞士联邦理工学院) Stanford University(斯坦福大学)

AI总结 本文分析近似机器遗忘的效用、时间和空间复杂度权衡,提出输出扰动的经验风险最小化实现分布内遗忘的紧致权衡,并针对分布外遗忘提出鲁棒噪声梯度下降变体以摊销时间复杂性。

详情
AI中文摘要

机器遗忘,即从训练模型中选择性移除数据的过程,对于解决部署后的隐私问题和知识差距日益关键。尽管重要性显著,现有方法通常是启发式的且缺乏形式化保证。在本文中,我们分析了近似遗忘的基本效用、时间和空间复杂度权衡,提供了类似于差分隐私的严格认证。对于分布内遗忘数据——与保留集相似的数据——我们展示了一个出奇简单且通用的过程,即带有输出扰动的经验风险最小化,实现了紧致的遗忘-效用-复杂度权衡,解决了之前关于通过差分隐私实现“免费”遗忘的理论分离问题,差分隐私本质上便于移除此类数据。然而,这些技术在处理分布外遗忘数据——与保留集显著不同的数据——时失效,此时遗忘时间复杂度可能超过重新训练,即使对于单个样本也是如此。为了解决这个问题,我们提出了一种新的鲁棒噪声梯度下降变体,该变体在不损害效用的前提下可证明地摊销了遗忘时间复杂度。

英文摘要

Machine unlearning, the process of selectively removing data from trained models, is increasingly crucial for addressing privacy concerns and knowledge gaps post-deployment. Despite this importance, existing approaches are often heuristic and lack formal guarantees. In this paper, we analyze the fundamental utility, time, and space complexity trade-offs of approximate unlearning, providing rigorous certification analogous to differential privacy. For in-distribution forget data -- data similar to the retain set -- we show that a surprisingly simple and general procedure, empirical risk minimization with output perturbation, achieves tight unlearning-utility-complexity trade-offs, addressing a previous theoretical gap on the separation from unlearning "for free" via differential privacy, which inherently facilitates the removal of such data. However, such techniques fail with out-of-distribution forget data -- data significantly different from the retain set -- where unlearning time complexity can exceed that of retraining, even for a single sample. To address this, we propose a new robust and noisy gradient descent variant that provably amortizes unlearning time complexity without compromising utility.

2502.00527 2026-06-08 cs.LG cs.CL 版本更新

PolarQuant: Leveraging Polar Transformation for Efficient Key Cache Quantization and Decoding Acceleration

PolarQuant: 利用极坐标变换实现高效键缓存量化和解码加速

Songhao Wu, Ang Lv, Xiao Feng, Yufei Zhang, Xun Zhang, Guojun Yin, Wei Lin, Rui Yan

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学北京校区人工智能学院) ShanghaiTech University(上海科技大学) Meituan(美团)

AI总结 提出PolarQuant方法,通过将键向量分组为二维子向量并编码为量化半径和极角,解决键缓存量化中的异常值问题,同时通过查表加速解码,保持全精度模型性能。

详情
Comments
NeurIPS 2025 version with minor revisions to the methodology
AI中文摘要

大型语言模型中的KV缓存是内存使用的主要因素,限制了其更广泛的适用性。将缓存量化到更低的位宽是减少计算成本的有效方法;然而,先前的方法由于异常值的存在,难以量化键向量,导致过高的开销。我们提出了一种名为PolarQuant的新型量化方法,有效解决了异常值挑战。我们观察到,异常值通常只出现在两个维度中的一个,当应用旋转位置嵌入时,这两个维度会一起旋转特定角度。当表示为二维向量时,这些维度展现出结构良好的模式,半径和角度在极坐标中平滑分布。这减轻了异常值对逐通道量化的挑战,使其非常适合量化。因此,PolarQuant将键向量分为二维子向量组,将其编码为相应的量化半径和极角,而不是直接量化原始键向量。PolarQuant在KV缓存量化中实现了卓越的效率,并通过将查询-键内积转化为查表操作来加速解码过程,同时保持全精度模型的下游性能。

英文摘要

The KV cache in large language models is a dominant factor in memory usage, limiting their broader applicability. Quantizing the cache to lower bit widths is an effective way to reduce computational costs; however, previous methods struggle with quantizing key vectors due to outliers, resulting in excessive overhead. We propose a novel quantization approach called PolarQuant, which efficiently addresses the outlier challenge. We observe that outliers typically appear in only one of two dimensions, which are rotated together by a specific angle when rotary position embeddings are applied. When represented as two-dimensional vectors, these dimensions exhibit well-structured patterns, with radii and angles smoothly distributed in polar coordinates. This alleviates the challenge of outliers on per-channel quantization, making them well-suited for quantization. Thus, PolarQuant divides key vectors into groups of two-dimensional sub-vectors, encoding them as the corresponding quantized radius and the polar angle, rather than quantizing original key vectors directly. PolarQuant achieves the superior efficiency in KV cache quantization and accelerates the decoding process by turning the query-key inner product into a table lookup, all while maintaining the downstream performance of full-precision models.

2501.15768 2026-06-08 cs.RO cs.SY eess.SY 版本更新

Error-State LQR Formulation for Quadrotor UAV Trajectory Tracking

四旋翼无人机轨迹跟踪的误差状态LQR公式

Micah Reich

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出一种基于误差状态线性二次型调节器的四旋翼无人机轨迹跟踪方法,利用指数坐标表示姿态误差,结合全状态反馈与级联体速率控制器实现鲁棒控制。

详情
AI中文摘要

本文提出了一种用于四旋翼无人机(UAV)鲁棒轨迹跟踪的误差状态线性二次型调节器(LQR)公式。该方法利用误差状态动力学,并采用指数坐标表示姿态误差,从而实现用于实时控制的线性化系统表示。控制策略集成了基于LQR的全状态反馈控制器用于轨迹跟踪,并结合级联体速率控制器来处理执行器动力学。提供了误差状态动力学、线性化过程以及控制器设计的详细推导,突出了该方法在动态环境中实现精确稳定四旋翼控制的适用性。

英文摘要

This article presents an error-state Linear Quadratic Regulator (LQR) formulation for robust trajectory tracking in quadrotor Unmanned Aerial Vehicles (UAVs). The proposed approach leverages error-state dynamics and employs exponential coordinates to represent orientation errors, enabling a linearized system representation for real-time control. The control strategy integrates an LQR-based full-state feedback controller for trajectory tracking, combined with a cascaded bodyrate controller to handle actuator dynamics. Detailed derivations of the error-state dynamics, the linearization process, and the controller design are provided, highlighting the applicability of the method for precise and stable quadrotor control in dynamic environments.