arXivDaily arXiv每日学术速递 周一至周五更新
热门方向导航
2510.01565 2026-06-19 cs.LG cs.DC 版本更新

TetriServe: Efficiently Serving Mixed DiT Workloads

TetriServe: 高效服务混合DiT工作负载

Runyu Lu, Shiqi He, Wenxuan Tan, Shenggui Li, Ruofan Wu, Jeff J. Ma, Ang Chen, Mosharaf Chowdhury

发表机构 * University of Michigan(密歇根大学) University of Wisconsin-Madison(威斯康星大学麦迪逊分校) Nanyang Technological University(南洋理工大学)

AI总结 针对混合分辨率与截止时间的异构DiT工作负载,提出基于步骤级序列并行的TetriServe系统,通过轮次调度与自适应并行度,在保证图像质量下将SLO达成率提升32%。

详情
AI中文摘要

扩散Transformer(DiT)模型通过迭代去噪步骤生成高质量图像,但由于其高计算成本(尤其在大分辨率下),在严格服务级别目标(SLO)下服务这些模型具有挑战性。现有服务系统使用固定程度的序列并行,这对于具有混合分辨率和截止时间的异构工作负载效率低下,导致GPU利用率低和SLO达成率低。在本文中,我们提出步骤级序列并行,根据请求的截止时间动态调整单个请求的并行度。我们提出了TetriServe,一个实现此策略的DiT服务系统,用于高效图像生成。具体来说,TetriServe引入了一种新颖的基于轮次的调度机制,通过(1)将时间离散化为固定轮次以使截止时间感知调度可处理,(2)在步骤级别自适应并行度并最小化GPU小时消耗,以及(3)联合打包请求以最小化延迟完成,从而提高SLO达成率。对最先进的DiT模型进行的广泛评估表明,与现有解决方案相比,TetriServe在不降低图像质量的情况下实现了高达32%的SLO达成率提升。

英文摘要

Diffusion Transformer (DiT) models excel at generating high-quality images through iterative denoising steps, but serving them under strict Service Level Objectives (SLOs) is challenging due to their high computational cost, particularly at larger resolutions. Existing serving systems use fixed-degree sequence parallelism, which is inefficient for heterogeneous workloads with mixed resolutions and deadlines, leading to poor GPU utilization and low SLO attainment. In this paper, we propose step-level sequence parallelism to dynamically adjust the degree of parallelism of individual requests according to their deadlines. We present TetriServe, a DiT serving system that implements this strategy for highly efficient image generation. Specifically, TetriServe introduces a novel round-based scheduling mechanism that improves SLO attainment by (1) discretizing time into fixed rounds to make deadline-aware scheduling tractable, (2) adapting parallelism at the step level and minimizing GPU hour consumption, and (3) jointly packing requests to minimize late completions. Extensive evaluation on state-of-the-art DiT models shows that TetriServe achieves up to 32% higher SLO attainment compared to existing solutions without degrading image quality.

2508.02604 2026-06-19 cs.RO cs.SY eess.SY 版本更新

Periodic robust robotic rock chop via virtual model control

基于虚拟模型控制的周期性鲁棒机器人砍切

Yi Zhang, Fumiya Iida, Fulvio Forni

发表机构 * University of Cambridge(剑桥大学) University of Tokyo(东京大学)

AI总结 提出一种物理结构化的虚拟模型控制器,通过切换虚拟机构生成鲁棒的周期性砍切运动,无需预规划轨迹,在Franka机械臂上实现多种蔬菜的亚毫米级精确切割。

详情
AI中文摘要

机器人切割是一项具有挑战性的、接触丰富的操作任务,机器人必须同时协商未知的物体力学、大接触力和精确的运动要求。我们的假设是,这种复杂性可以通过设计一个物理结构化的虚拟模型控制器来缓解,该控制器使用切换虚拟机构生成鲁棒的、有节奏的岩石砍切运动,无需预先规划的轨迹或精确的环境信息。运动是由环境、机器人动力学和切换虚拟机构的虚拟力之间的相互作用产生的,最终通过可用的驱动实现。通过理论分析和实验验证,我们证明了受控的机器人行为会稳定到周期性的运动。使用Franka机械臂进行的实验表明,在五种不同的蔬菜上实现了鲁棒的切割,对于1毫米到6毫米的厚度,以每秒近一次切割的速度实现了亚毫米级的切片精度。尽管刀的形状或砧板的高度发生变化,控制器仍保持高性能,并成功适应了不同的人形机械臂,展示了鲁棒性和平台独立性。

英文摘要

Robotic cutting is a challenging, contact-rich manipulation task where the robot must simultaneously negotiate unknown object mechanics, large contact forces, and precise motion requirements. Our hypothesis is that this complexity can be alleviated through the design of a physically structured virtual-model controller that uses switched virtual mechanisms to generate a robust, rhythmic rock-chop motion for robotic cutting, without requiring pre-planned trajectories or precise environmental information. Motion is generated by the interaction between the environment, the robot's dynamics, and the virtual forces of the switching virtual mechanism, ultimately realized through the available actuation. Through theoretical analysis and experimental validation, we demonstrate that the controlled robot behavior settles into a stable periodic motion. Experiments with a Franka manipulator demonstrate robust cuts across five different vegetables, achieving sub-millimeter slice accuracy for thicknesses from 1 mm to 6 mm at a rate of nearly one cut per second. The controller maintains high performance despite changes in knife shape or cutting board height, and successfully adapts to a different humanoid manipulator, demonstrating robustness and platform independence.

2601.03040 2026-06-19 cs.RO cs.AI cs.LG 版本更新

PiDR: Physics-Informed Inertial Dead Reckoning for Autonomous Platforms

PiDR:面向自主平台的物理信息惯性航位推算

Arup Kumar Sahoo, Itzik Klein

发表机构 * Autonomous Navigation and Sensor Fusion Lab (ANSFL)(自主导航与传感器融合实验室(ANSFL)) Hatter Department of Marine Technologies(海洋技术系) Charney School of Marine Sciences(海洋科学学院) University of Haifa(海法大学)

AI总结 提出PiDR框架,将惯性导航原理作为物理信息残差融入网络训练,在纯惯性导航中减少轨迹漂移,在移动机器人和水下自主航行器数据集上定位精度提升超29%。

Comments 11 pages and 7 figures

详情
AI中文摘要

完全自主的一个基本要求是在缺乏外部数据(如GNSS信号或视觉信息)的情况下维持精确导航的能力。在这些具有挑战性的环境中,平台必须完全依赖惯性传感器,导致纯惯性导航。然而,在现实场景中,惯性传感器的固有噪声和其他误差项会导致导航解随时间漂移。尽管传统的深度学习模型已成为惯性导航的一种可能方法,但它们本质上是黑箱的。此外,它们在有限的监督传感器数据下难以有效学习,并且常常无法保持物理原理。为了解决这些局限性,我们提出了PiDR,一种用于纯惯性导航情况下自主平台的物理信息惯性航位推算框架。PiDR通过物理信息残差组件将惯性导航原理明确地整合到网络训练过程中,从而提供了透明性。即使在有限或稀疏监督下,PiDR在减轻轨迹突然偏差方面也起着关键作用。我们在移动机器人和自主水下航行器收集的真实世界数据集上评估了PiDR。在两个数据集中,我们获得了超过29%的定位改进,证明了PiDR在不同环境和动力学下运行的不同平台上的泛化能力。因此,PiDR提供了一种鲁棒、轻量级且有效的架构,可以部署在资源受限的平台上,在不利场景中实现实时纯惯性导航。

英文摘要

A fundamental requirement for full autonomy is the ability to sustain accurate navigation in the absence of external data, such as GNSS signals or visual information. In these challenging environments, the platform must rely exclusively on inertial sensors, leading to pure inertial navigation. However, the inherent noise and other error terms of the inertial sensors in such real-world scenarios will cause the navigation solution to drift over time. Although conventional deep-learning models have emerged as a possible approach to inertial navigation, they are inherently black-box in nature. Furthermore, they struggle to learn effectively with limited supervised sensor data and often fail to preserve physical principles. To address these limitations, we propose PiDR, a physics-informed inertial dead-reckoning framework for autonomous platforms in situations of pure inertial navigation. PiDR offers transparency by explicitly integrating inertial navigation principles into the network training process through the physics-informed residual component. PiDR plays a crucial role in mitigating abrupt trajectory deviations even under limited or sparse supervision. We evaluated PiDR on real-world datasets collected by a mobile robot and an autonomous underwater vehicle. We obtained more than 29% positioning improvement in both datasets, demonstrating the ability of PiDR to generalize different platforms operating in various environments and dynamics. Thus, PiDR offers a robust, lightweight, yet effective architecture and can be deployed on resource-constrained platforms, enabling real-time pure inertial navigation in adverse scenarios.

2601.02379 2026-06-19 cs.RO cs.AI 版本更新

Movement Primitives in Robotics: A Comprehensive Survey

机器人运动基元:综合综述

Nolan B. Gutierrez, Joseph M. Cloud, William J. Beksi

发表机构 * Department of Computer Science and Engineering, The University of Texas at Arlington, Arlington, USA(计算机科学与工程系,德克萨斯理工大学阿灵顿分校,阿灵顿,美国)

AI总结 综述机器人运动基元框架,涵盖从人类示教中编码轨迹的方法,分析弹簧-阻尼系统、概率耦合、神经网络等特性,并讨论应用与挑战。

Comments 105 pages, 3 figures, and 6 tables

详情
AI中文摘要

生物系统表现出连续的运动流,由顺序片段组成,使它们能够以创造性和多功能的方式执行复杂任务。这一观察促使研究人员识别出被称为运动基元的运动基本构建块,这些基元非常适合在自主系统(如机器人)中生成运动指令。在本综述中,我们按时间顺序提供了运动基元方法和应用的百科全书式概述。具体来说,我们将运动基元框架呈现为一种表示通过人类示教获得的机器人控制轨迹的方式。在机器人领域,运动基元可以在轨迹级别编码基本运动,例如机器人如何抓取杯子或抛球所需的运动序列。此外,运动基元已开发出具有弹簧-阻尼系统的理想分析特性、多个示教的概率耦合、在高维系统中使用神经网络等特性,以应对机器人领域的困难挑战。尽管运动基元广泛应用于各个领域,本综述的目标是告知从业者如何在机器人背景下使用这些框架。具体而言,我们旨在(i)系统回顾主要运动基元框架并检查其优缺点;(ii)突出已成功使用运动基元的应用;(iii)检查开放问题并讨论在机器人中应用运动基元时的实际挑战。

英文摘要

Biological systems exhibit a continuous stream of movements, consisting of sequential segments, that allow them to perform complex tasks in a creative and versatile fashion. This observation has led researchers towards identifying elementary building blocks of motion known as movement primitives, which are well-suited for generating motor commands in autonomous systems, such as robots. In this survey, we provide an encyclopedic overview of movement primitive approaches and applications in chronological order. Concretely, we present movement primitive frameworks as a way of representing robotic control trajectories acquired through human demonstrations. Within the area of robotics, movement primitives can encode basic motions at the trajectory level, such as how a robot would grasp a cup or the sequence of motions necessary to toss a ball. Furthermore, movement primitives have been developed with the desirable analytical properties of a spring-damper system, probabilistic coupling of multiple demonstrations, using neural networks in high-dimensional systems, and more, to address difficult challenges in robotics. Although movement primitives have widespread application to a variety of fields, the goal of this survey is to inform practitioners on the use of these frameworks in the context of robotics. Specifically, we aim to (i) present a systematic review of major movement primitive frameworks and examine their strengths and weaknesses; (ii) highlight applications that have successfully made use of movement primitives; and (iii) examine open questions and discuss practical challenges when applying movement primitives in robotics.

2512.24592 2026-06-19 cs.CV 版本更新

GH-ESD: Grounded Hypothesis-Driven Error Slice Discovery for Instance-Level Vision Tasks

GH-ESD:基于假设驱动的实例级视觉任务错误切片发现

Wei Zhang, Chaoqun Wang, Zixuan Guan, Sam Kao, Pengfei Zhao, Peng Wu, Sifeng He

发表机构 * Apple(苹果公司)

AI总结 提出GH-ESD框架,通过LLM生成假设与视觉语言模型验证,在实例级任务中自动发现空间关系错误切片,并构建GESD基准,显著提升检测和分割任务的错误切片发现精度。

Comments Accepted by ECCV2026

详情
AI中文摘要

视觉模型在语义一致子集上的系统性失败(称为错误切片)揭示了鲁棒性和评估的局限性。现有的切片发现方法主要将切片建模为表示空间中的聚类或预定义属性的组合。虽然对图像级分类有效,但这种公式对于目标检测和分割等实例级任务不足,因为失败通常源于上下文关系性和空间定位的视觉模式。我们提出GH-ESD(基于假设驱动的实例级错误切片发现),一个生成与验证框架,将切片发现重新表述为基于假设的生成和统计验证。GH-ESD利用LLM先验和基于空间的视觉证据构建关系失败假设,通过视觉语言模型在实例级发现假设切片,并通过实例级错误的统计趋势分析进行验证。我们还引入了GESD(基于空间的错误切片数据集),一个用于实例级错误切片发现的新基准,提供由专家定义且基于空间的切片,这些切片源自检测和分割失败。大量实验表明,GH-ESD持续优于基线,在检测任务的GESD基准上Precision@10提高了0.10(0.73对比0.63),同时也支持分割场景。GH-ESD识别出可解释的切片,促进可操作的模型改进。GESD数据集将在接收后公开。

英文摘要

Systematic failures of vision models on semantically coherent subsets, known as error slices, reveal limitations in robustness and evaluation. Existing slice discovery approaches largely model slices as clusters in representation space or combinations of predefined attributes. While effective for image-level classification, such formulations are insufficient for instance-level tasks such as object detection and segmentation, where failures often arise from contextual relational and spatially grounded visual patterns. We propose GH-ESD (Grounded Hypothesis-Driven Error Slice Discovery), a generate and verify framework that reformulates slice discovery as grounded hypothesis generation and statistical verification. GH-ESD constructs relational failure hypotheses using LLM priors and grounded visual evidence, discovers hypothesis slices at the instance level via Vision Language Models, and verifies them through statistical trend analysis over instance-level errors. We also introduce GESD (Grounded Error Slice Dataset), a new benchmark for instance-level error slice discovery, providing expert-defined and spatially grounded slices derived from detection and segmentation failures. Extensive experiments demonstrate that GH-ESD consistently outperforms baselines, improving Precision@10 by 0.10 (0.73 vs. 0.63) on the GESD benchmark for detection tasks, while also supporting segmentation scenarios. GH-ESD identifies interpretable slices that facilitate actionable model improvements. The GESD dataset will be made publicly available upon acceptance.

2512.18859 2026-06-19 cs.CL 版本更新

Toward Human-Centered AI-Assisted Terminology Work

迈向以人为中心的AI辅助术语工作

Antonio San Martin

发表机构 * Universite du Quebec à Trois-Rivieres(魁北克大学三河分校)

AI总结 本文提出以人为中心的人工智能框架,在利用生成式AI自动化术语工作的同时,通过增强术语学家能力、保持人类控制权来确保术语数据的准确性和可靠性。

Comments Accepted for publication in the journal Terminology

详情
AI中文摘要

生成式AI可能通过创造自动化新机会来改变术语工作。同时,它引发了对术语学家和术语资源未来的担忧,因为效率压力可能鼓励过度自动化,认为人类专业知识可被AI取代。然而,由于错误、幻觉和各种形式的偏见,大型语言模型在术语目的上仍然不可靠,使得术语学家在确保术语数据的准确性和可靠性方面不可或缺。本文认为,以人为中心的AI(强调AI的主要目标应是促进人类福祉的方法)提供了一个框架,可以在最大化生成式AI收益的同时减轻其风险。它主张高水平的自动化和有意义的人类控制是兼容且可取的,AI应增强术语学家的能力,同时保留他们的自主权和决策权。通过三个相互关联的维度——增强的术语学家、伦理AI和以人为中心的设计——审视了AI辅助术语工作的影响。特别是,本文探讨了AI整合如何重塑术语学家的角色,影响专业价值观和工作条件,要求管理AI产生的偏见,并呼吁围绕术语学家的需求设计AI工具。本文得出结论,以人为中心的方向是必要的,以确保AI加强而非削弱术语工作在支持专业交流以及跨语言和跨文化准确传播知识中的关键作用。

英文摘要

Generative AI is likely to transform terminology work by creating new opportunities for automation. At the same time, it raises concerns about the future of terminologists and terminological resources, as efficiency pressures may encourage excessive automation based on the perception that human expertise can be replaced by AI. However, large language models remain unreliable for terminological purposes due to errors, hallucinations, and various forms of bias, making terminologists indispensable for ensuring the accuracy and reliability of terminological data. This paper argues that human-centered AI, an approach that emphasizes that AI's primary goal should be to contribute to human well-being, provides a framework for maximizing the benefits of generative AI while mitigating its risks. It contends that high levels of automation and meaningful human control are compatible and desirable, and that AI should enhance terminologists' capabilities while preserving their agency and decision-making authority. The implications of AI-assisted terminology work are examined through three interrelated dimensions: the augmented terminologist, ethical AI, and human-centered design. In particular, the paper examines how AI integration reshapes the role of the terminologist, affects professional values and working conditions, requires the management of AI-generated bias, and calls for the design of AI tools around the terminologist's needs. The paper concludes that a human-centered orientation is necessary to ensure that AI strengthens, rather than undermines, the essential role of terminology work in supporting specialized communication and the accurate transmission of knowledge across languages and cultures.

2508.04266 2026-06-19 cs.CL 版本更新

ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents

ShoppingBench:面向LLM智能体的真实世界意图导向购物基准

Jiangyuan Wang, Kejun Xiao, Qi Sun, Huaipeng Zhao, Tao Luo, Jian Dong Zhang, Xiaoyi Zeng

发表机构 * Alibaba International Digital Commercial Group(阿里巴巴国际数字商业集团)

AI总结 提出ShoppingBench基准,包含多层级真实购物意图任务,通过模拟环境和250万商品评估LLM智能体,发现GPT-4.1成功率低于50%,并提出轨迹蒸馏策略提升小模型性能。

Comments Accepted for oral presentation at AAAI 2026

详情
AI中文摘要

现有的电子商务基准主要关注基本用户意图,例如查找或购买产品。然而,现实世界的用户通常追求更复杂的目标,例如应用优惠券、管理预算以及寻找多产品卖家。为了弥补这一差距,我们提出了ShoppingBench,这是一个新颖的端到端购物基准,旨在涵盖日益具有挑战性的接地意图级别。具体来说,我们提出了一个可扩展的框架,基于从采样的真实世界产品中得出的各种意图来模拟用户指令。为了促进一致且可靠的评估,我们提供了一个大规模购物沙箱作为交互式模拟环境,包含超过250万种真实产品。实验结果表明,即使是最先进的语言智能体(如GPT-4.1)在我们的基准任务上的绝对成功率也低于50%,这突显了我们的ShoppingBench带来的重大挑战。此外,我们提出了一种轨迹蒸馏策略,并利用监督微调以及基于合成轨迹的强化学习,将大型语言智能体的能力蒸馏到较小的智能体中。结果,我们训练的智能体实现了与GPT-4.1相媲美的竞争性能。

英文摘要

Existing benchmarks in e-commerce primarily focus on basic user intents, such as finding or purchasing products. However, real-world users often pursue more complex goals, such as applying vouchers, managing budgets, and finding multi-products seller. To bridge this gap, we propose ShoppingBench, a novel end-to-end shopping benchmark designed to encompass increasingly challenging levels of grounded intent. Specifically, we propose a scalable framework to simulate user instructions based on various intents derived from sampled real-world products. To facilitate consistent and reliable evaluations, we provide a large-scale shopping sandbox that serves as an interactive simulated environment, incorporating over 2.5 million real-world products. Experimental results demonstrate that even state-of-the-art language agents (such as GPT-4.1) achieve absolute success rates under 50% on our benchmark tasks, highlighting the significant challenges posed by our ShoppingBench. In addition, we propose a trajectory distillation strategy and leverage supervised fine-tuning, along with reinforcement learning on synthetic trajectories, to distill the capabilities of a large language agent into a smaller one. As a result, our trained agent achieves competitive performance compared to GPT-4.1.

2512.03818 2026-06-19 cs.CL 版本更新

Improving Alignment Between Human and Machine Codes: An Empirical Assessment of Prompt Engineering for Construct Identification in Psychology

改善人机编码对齐:心理学构念识别中提示工程的实证评估

Kylie L. Anglin, Stephanie Milan, Brittney Hernandez, Claudia Ventura

发表机构 * Department of Educational Psychology, Neag School of Education, University of Connecticut(教育心理学系,教育学院,康涅狄格大学) Department of Psychological Sciences, College of Liberal Arts and Sciences, University of Connecticut(心理学系,文理学院,康涅狄格大学)

AI总结 本研究提出一个实证框架,通过提示工程优化大语言模型在心理学文本中识别构念的性能。实验评估五种提示策略,发现构念定义和任务框架最关键,结合代码簿引导和自动提示工程的少样本方法最接近专家判断。

Comments 22 pages, 2 figures

详情
AI中文摘要

由于其架构和庞大的预训练数据,大语言模型(LLMs)表现出强大的文本分类性能。然而,LLM的输出——这里指分配给文本的类别——在很大程度上取决于提示的措辞。尽管关于提示工程的文献正在扩展,但很少有研究关注分类任务,更少有研究涉及心理学等领域,在这些领域中,构念具有精确的、理论驱动的定义,而这些定义可能未在预训练数据中得到充分体现。我们提出了一个实证框架,通过提示工程优化LLM在文本中识别构念的性能。我们实验评估了五种提示策略——代码簿引导的实证提示选择、自动提示工程、角色提示、思维链推理和解释性提示——采用零样本和少样本分类。我们发现,角色、思维链和解释并不能完全解决因措辞不当的提示而导致的性能损失。相反,提示中最有影响力的特征是构念定义、任务框架,以及在较小程度上提供的示例。在三个构念和两个模型中,与专家判断最一致的分类来自结合代码簿引导的实证提示选择和自动提示工程的少样本提示。基于我们的发现,我们建议研究人员生成并评估尽可能多的提示变体,无论是人工编写的、自动生成的,或者理想情况下两者兼有,并根据训练数据集中的实证性能选择提示和示例,在保留集中验证最终方法。该程序提供了一种实用、系统且理论驱动的方法,用于在需要与专家判断对齐的环境中优化LLM提示。

英文摘要

Due to their architecture and vast pre-training data, large language models (LLMs) demonstrate strong text classification performance. However, LLM output - here, the category assigned to a text - depends heavily on the wording of the prompt. While literature on prompt engineering is expanding, few studies focus on classification tasks, and even fewer address domains like psychology, where constructs have precise, theory-driven definitions that may not be well represented in pre-training data. We present an empirical framework for optimizing LLM performance for identifying constructs in texts via prompt engineering. We experimentally evaluate five prompting strategies -- codebook-guided empirical prompt selection, automatic prompt engineering, persona prompting, chain-of-thought reasoning, and explanatory prompting - with zero-shot and few-shot classification. We find that persona, chain-of-thought, and explanations do not fully address performance loss accompanying a badly worded prompt. Instead, the most influential features of a prompt are the construct definition, task framing, and, to a lesser extent, the examples provided. Across three constructs and two models, the classifications most aligned with expert judgments resulted from a few-shot prompt combining codebook-guided empirical prompt selection with automatic prompt engineering. Based on our findings, we recommend that researchers generate and evaluate as many prompt variants as feasible, whether human-crafted, automatically generated, or ideally both, and select prompts and examples based on empirical performance in a training dataset, validating the final approach in a holdout set. This procedure offers a practical, systematic, and theory-driven method for optimizing LLM prompts in settings where alignment with expert judgment is critical.

2511.22283 2026-06-19 cs.LG 版本更新

The Hidden Cost of Approximation in Online Mirror Descent

在线镜像下降中近似的隐藏代价

Ofir Schlisselberg, Uri Sherman, Tomer Koren, Yishay Mansour

发表机构 * Tel Aviv University(特拉维夫大学) Google Research(谷歌研究)

AI总结 研究在线镜像下降(OMD)在近似误差下的鲁棒性,发现正则子光滑度与误差容忍度密切相关:均匀光滑正则子有紧界,而负熵在单纯形上需指数小误差,对数障碍和Tsallis正则子仅需多项式误差。

详情
AI中文摘要

在线镜像下降(OMD)是一个基本的算法范式,支撑着优化、机器学习和序列决策中的许多算法。OMD迭代被定义为优化子问题的解,而这些子问题通常只能近似求解,导致算法的不精确版本。然而,现有的OMD分析通常假设理想的无误差环境,从而限制了我们对实践中应期望的性能保证的理解。在这项工作中,我们启动了对不精确OMD的系统研究,并揭示了正则子光滑性与对近似误差鲁棒性之间的复杂关系。当正则子一致光滑时,我们建立了由误差引起的超额遗憾的紧界。然后,对于单纯形及其子集上的障碍正则子,我们识别出一个尖锐的分离:负熵需要指数小的误差以避免线性遗憾,而对数障碍和Tsallis正则子即使在误差仅为多项式大小时也能保持鲁棒。最后,我们表明当损失是随机的且域是单纯形时,负熵重新获得鲁棒性——但这种性质并不扩展到所有子集,在那里指数小的误差再次是避免次优遗憾所必需的。

英文摘要

Online mirror descent (OMD) is a fundamental algorithmic paradigm that underlies many algorithms in optimization, machine learning and sequential decision-making. The OMD iterates are defined as solutions to optimization subproblems which, oftentimes, can be solved only approximately, leading to an inexact version of the algorithm. Nonetheless, existing OMD analyses typically assume an idealized error free setting, thereby limiting our understanding of performance guarantees that should be expected in practice. In this work we initiate a systematic study into inexact OMD, and uncover an intricate relation between regularizer smoothness and robustness to approximation errors. When the regularizer is uniformly smooth, we establish a tight bound on the excess regret due to errors. Then, for barrier regularizers over the simplex and its subsets, we identify a sharp separation: negative entropy requires exponentially small errors to avoid linear regret, whereas log-barrier and Tsallis regularizers remain robust even when the errors are only polynomial. Finally, we show that when the losses are stochastic and the domain is the simplex, negative entropy regains robustness-but this property does not extend to all subsets, where exponentially small errors are again necessary to avoid suboptimal regret.

2508.04424 2026-06-19 cs.CV 版本更新

Composed Object Retrieval: Object-level Retrieval via Composed Expressions

组合对象检索:通过组合表达式进行对象级检索

Tong Wang, Guanyu Yang, Nian Liu, Zongyan Han, Jinxing Zhou, Salman Khan, Fahad Shahbaz Khan

发表机构 * Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Southeast University, Ministry of Education, Jiangsu, China(新一代人工智能技术及跨学科应用国家重点实验室,东南大学,教育部,江苏,中国) Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE(穆罕默德·本·扎耶德人工智能大学(MBZUAI),阿布扎赫德,阿联酋)

AI总结 提出组合对象检索(COR)任务,通过组合参考对象、掩码和检索文本进行对象级检索,并构建COR125K基准和CORE模型,显著优于现有方法。

详情
AI中文摘要

基于用户意图检索细粒度视觉内容在多模态系统中仍然是一个挑战。尽管当前的组合图像检索(CIR)方法结合了参考图像和检索文本,但它们局限于图像级匹配,无法定位特定对象。为此,我们提出了组合对象检索(COR),一种新的对象级检索任务,从目标图像中的候选对象中检索目标对象,并用像素级掩码对检索结果进行定位。给定一个参考对象、其掩码、一个目标图像以及描述所需修改的检索文本,COR要求模型执行组合视觉-文本推理,而不是依赖显式的类别名称。这一设置带来了若干挑战,包括细粒度组合匹配、在视觉相似干扰物下的负对象过滤以及灵活的单对象或多对象检索。我们构建了COR125K,第一个大规模COR基准,包含408个类别的125,541个检索三元组,并划分基础/新类别以评估类别级泛化能力。我们还提出了CORE,一个统一的端到端模型,集成了参考区域编码、自适应视觉-文本交互和区域级对比学习,以将组合表示与目标对象对齐,同时抑制背景和干扰物。大量实验表明,CORE在基础和新类别上均显著优于现有的基于CIR的流程和强基线,为细粒度对象级多模态检索建立了一个简单而有效的基础。代码将在此https URL公开发布。

英文摘要

Retrieving fine-grained visual content based on user intent remains a challenge in multimodal systems. Although current Composed Image Retrieval (CIR) methods combine reference images with retrieval texts, they are constrained to image-level matching and cannot localize specific objects. To this end, we propose Composed Object Retrieval (COR), a new object-level retrieval task that retrieves target object(s) from candidate objects in a target image and grounds the retrieved result with pixel-level masks. Given a reference object, its mask, a target image, and a retrieval text describing the desired modification, COR requires models to perform composed visual-textual reasoning rather than relying on explicit category names. This setting introduces several challenges, including fine-grained compositional matching, negative-object filtering under visually similar distractors, and flexible single- or multi-object retrieval. We construct COR125K, the first large-scale COR benchmark, containing 125,541 retrieval triplets across 408 categories with base/novel splits for evaluating category-level generalization. We also present CORE, a unified end-to-end model that integrates reference region encoding, adaptive vision-text interaction, and region-level contrastive learning to align composed representations with target objects while suppressing background and distractors. Extensive experiments demonstrate that CORE significantly outperforms existing CIR-based pipelines and strong baselines in both base and novel categories, establishing a simple and effective foundation for fine-grained object-level multimodal retrieval. Code will be released publicly at https://github.com/wangtong627/COR.

2511.04260 2026-06-19 cs.CV cs.AI 版本更新

Proto-LeakNet: Towards Signal-Leak Aware Attribution in Synthetic Human Face Imagery

Proto-LeakNet:面向合成人脸图像中信号泄漏感知的归因方法

Claudio Giusti, Luca Guarnera, Sebastiano Battiato

发表机构 * Department of Mathematics and Computer Science(数学与计算机科学系) University of Catania(卡塔尼亚大学)

AI总结 提出Proto-LeakNet,利用扩散模型中的信号泄漏痕迹,结合闭集分类与密度开集评估,实现可解释的生成器归因,在闭集上训练后对未见生成器也有效。

Comments 44 pages, 27 figures, 11 tables

详情
AI中文摘要

合成图像和深度伪造生成模型的日益复杂使得源归因和真实性验证成为现代计算机视觉系统的关键挑战。最近的研究表明,扩散管道会在其输出中无意中留下持久的统计痕迹,称为信号泄漏,特别是在潜在表示中。基于这一观察,我们提出了Proto-LeakNet,一个信号泄漏感知且可解释的归因框架,它将闭集分类与基于密度的开集评估相结合,对学习到的嵌入进行开集评估,从而无需重新训练即可分析未见过的生成器。我们的方法作用于扩散模型的潜在域,重新模拟部分前向扩散以暴露残留的生成器特定线索。一个时间注意力编码器聚合多步潜在特征,而一个特征加权原型头则结构化嵌入空间并实现透明的归因。仅在闭集数据上训练并达到98.13%的宏AUC,Proto-LeakNet学习到的潜在几何结构在后处理下保持鲁棒,超越了最先进的方法,并且在真实图像与已知生成器之间以及已知与未见生成器之间实现了强可分离性。代码库可在以下链接获取:this https URL。

英文摘要

The growing sophistication of synthetic image and deepfake generation models has turned source attribution and authenticity verification into a critical challenge for modern computer vision systems. Recent studies suggest that diffusion pipelines unintentionally imprint persistent statistical traces, known as signal-leaks, within their outputs, particularly in latent representations. Building on this observation, we propose Proto-LeakNet, a signal-leak-aware and interpretable attribution framework that integrates Closed-set classification with a density-based Open-set evaluation on the learned embeddings, enabling analysis of unseen generators without retraining. Acting in the latent domain of diffusion models, our method re-simulates partial forward diffusion to expose residual generator-specific cues. A temporal attention encoder aggregates multi-step latent features, while a feature-weighted prototype head structures the embedding space and enables transparent attribution. Trained solely on closed data and achieving a Macro AUC of 98.13\%, Proto-LeakNet learns a latent geometry that remains robust under post-processing, surpassing state-of-the-art methods, and achieves strong separability both between real images and known generators, and between known and unseen ones. The codebase is available at the following link: https://github.com/claudiunderthehood/Proto-LeakNet .

2510.18784 2026-06-19 cs.LG 版本更新

CAGE: Curvature-Aware Gradient Estimation For Accurate Quantization-Aware Training

CAGE: 曲率感知梯度估计用于精确的量化感知训练

Soroush Tabesh, Mher Safaryan, Andrei Panferov, Alexandra Volkova, Dan Alistarh

发表机构 * Anonymous Authors(匿名作者)

AI总结 提出CAGE方法,通过曲率感知校正项改进直通估计器,平衡损失最小化与量化约束,在平滑非凸设置下提供收敛保证,显著提升低比特量化感知训练的精度。

Comments Accepted at MLSys 2026 (Oral). To appear in Proceedings of Machine Learning and Systems 8

Journal ref Proceedings of Machine Learning and Systems 8 (MLSys 2026)

详情
AI中文摘要

尽管在低比特量化感知训练(QAT)方面已有大量工作,但这些技术与原生训练之间仍存在精度差距。为解决这一问题,我们引入了CAGE(曲率感知梯度估计),一种新的QAT方法,它用曲率感知校正项增强直通估计器(STE)梯度,旨在抵消量化引起的损失增加。CAGE源自QAT的多目标视角,平衡损失最小化与量化约束,产生一个依赖于局部曲率信息的原理性校正项。在理论方面,我们引入了量化优化的帕累托最优解概念,并证明CAGE在平滑非凸设置下具有强收敛保证。在实现方面,我们的方法是优化器无关的,但我们提供了一个利用Adam统计信息的高效实现。在相似计算成本下,CAGE在精度上显著优于先前最先进的方法:对于QAT微调,它将压缩精度损失相对于先前最佳方法减半;而对于Llama模型的QAT预训练,其在3比特权重和激活(W3A3)下的精度与先前最佳方法在4比特(W4A4)下达到的精度相当。官方实现可在以下链接找到:https://github.com/IST-DASLab/CAGE。

英文摘要

Despite significant work on low-bit quantization-aware training (QAT), there is still an accuracy gap between such techniques and native training. To address this, we introduce CAGE (Curvature-Aware Gradient Estimation), a new QAT method that augments the straight-through estimator (STE) gradient with a curvature-aware correction designed to counteract the loss increase induced by quantization. CAGE is derived from a multi-objective view of QAT that balances loss minimization with the quantization constraints, yielding a principled correction term that depends on local curvature information. On the theoretical side, we introduce the notion of Pareto-optimal solutions for quantized optimization, and establish that CAGE yields strong convergence guarantees in the smooth non-convex setting. In terms of implementation, our approach is optimizer-agnostic, but we provide a highly-efficient implementation that leverages Adam statistics. CAGE significantly improves upon the prior state-of-the-art methods in terms of accuracy, for similar computational cost: for QAT fine-tuning, it halves the compression accuracy loss relative to the prior best method, while for QAT pre-training of Llama models, its accuracy for 3-bit weights-and-activations (W3A3) matches the accuracy achieved at 4-bits (W4A4) with the prior best method. The official implementation can be found over https://github.com/IST-DASLab/CAGE .

2507.23534 2026-06-19 cs.LG cs.CV 版本更新

Continual Learning with Support Boundary Experience Blending

支持边界经验混合的持续学习

Chih-Fan Hsu, Ming-Ching Chang, Wei-Chao Chen

发表机构 * National Taiwan University(国立台湾大学)

AI总结 提出经验混合框架,通过差分隐私启发的噪声生成支持边界数据,联合训练样本和边界数据以正则化决策边界,在多个数据集上提升持续学习准确率。

详情
AI中文摘要

持续学习旨在减轻模型在顺序任务训练时的灾难性遗忘。常见方法经验回放存储过去的样本,但仅稀疏地近似数据分布,导致决策边界脆弱且过于简化。我们通过引入支持边界数据来解决这一限制,该数据通过差分隐私启发的噪声注入潜在特征,生成边界邻近表示,隐式正则化决策边界。基于此,我们提出经验混合框架,通过双模型聚合策略联合训练样本和支持边界数据。经验混合有两个组成部分:(1) 潜在空间噪声注入以生成支持边界数据,(2) 联合利用样本和支持边界数据的端到端训练。与标准经验回放不同,支持边界数据丰富了决策边界附近的特征空间,从而实现更稳定和鲁棒的持续学习。在CIFAR-10、CIFAR-100、Tiny ImageNet和ImageNet1K上的大量实验分别展示了10%、6%、13%和2%的持续准确率提升。

英文摘要

Continual learning (CL) seeks to mitigate catastrophic forgetting when models are trained with sequential tasks. A common approach, experience replay (ER), stores past exemplars but only sparsely approximates the data distribution, yielding fragile and oversimplified decision boundaries. We address this limitation by introducing Support Boundary Data (SBD), generated via differential-privacy-inspired noise into latent features to create boundary-adjacent representations that implicitly regularize decision boundaries. Building on this idea, we propose Experience Blending (EB), a framework that jointly trains on exemplars and SBD through a dual-model aggregation strategy. EB has two components: (1) latent-space noise injection to generate support boundary data, and (2) end-to-end training that jointly leverages exemplars and SBD. Unlike standard experience replay, SBD enriches the feature space near decision boundaries, leading to more stable and robust continual learning. Extensive experiments on CIFAR-10, CIFAR-100, Tiny ImageNet, and ImageNet1K demonstrate consistent accuracy improvements of 10%, 6%, 14%, 2%, respectively.

2510.27285 2026-06-19 cs.CV cs.CR 版本更新

Rethinking Robust Adversarial Concept Erasure in Diffusion Models

重新思考扩散模型中的鲁棒对抗性概念擦除

Qinghong Yin, Yu Tian, Heming Yang, Xiang Chen, Xianlin Zhang, Yue Ming, Xueming Li, Yue Zhang

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Dept. of Comp. Sci. and Tech., Institute for AI, Tsinghua University(计算机科学与技术系,人工智能研究院,清华大学) University of Chinese Academy of Sciences(中国科学院大学) Nanjing University of Aeronautics and Astronautics(南京航空航天大学)

AI总结 针对扩散模型中概念擦除的对抗训练忽视概念语义导致拟合不足的问题,提出语义引导的鲁棒对抗概念擦除方法S-GRACE,显著提升擦除性能26%并减少90%训练时间。

详情
AI中文摘要

概念擦除旨在选择性地遗忘扩散模型(DMs)中的不良内容,以降低敏感内容生成的风险。作为概念擦除的一种新范式,现有方法大多采用对抗训练来识别和抑制目标概念,从而减少敏感输出的可能性。然而,这些方法常常忽视对抗训练在DMs中的特异性,导致仅能部分缓解。在这项工作中,我们从概念空间的角度调查并量化了这种特异性,即对抗样本能否真正拟合目标概念空间?我们观察到现有方法在生成对抗样本时忽视了概念语义的作用,导致对概念空间的拟合效果不佳。这种忽视导致了以下问题:1)当对抗样本较少时,它们无法全面覆盖目标概念;2)反之,它们会破坏其他目标概念空间。受这些发现分析的启发,我们引入了S-GRACE(语义引导的鲁棒对抗概念擦除),它优雅地利用概念空间内的语义引导来生成对抗样本并执行擦除训练。使用七种最先进方法和三种对抗提示生成策略在各种DM遗忘场景下进行的实验表明,S-GRACE显著提高了擦除性能26%,更好地保留了非目标概念,并将训练时间减少了90%。我们的代码可在此https URL获取。

英文摘要

Concept erasure aims to selectively unlearning undesirable content in diffusion models (DMs) to reduce the risk of sensitive content generation. As a novel paradigm in concept erasure, most existing methods employ adversarial training to identify and suppress target concepts, thus reducing the likelihood of sensitive outputs. However, these methods often neglect the specificity of adversarial training in DMs, resulting in only partial mitigation. In this work, we investigate and quantify this specificity from the perspective of concept space, i.e., can adversarial samples truly fit the target concept space? We observe that existing methods neglect the role of conceptual semantics when generating adversarial samples, resulting in ineffective fitting of concept spaces. This oversight leads to the following issues: 1) when there are few adversarial samples, they fail to comprehensively cover the object concept; 2) conversely, they will disrupt other target concept spaces. Motivated by the analysis of these findings, we introduce S-GRACE (Semantics-Guided Robust Adversarial Concept Erasure), which grace leveraging semantic guidance within the concept space to generate adversarial samples and perform erasure training. Experiments conducted with seven state-of-the-art methods and three adversarial prompt generation strategies across various DM unlearning scenarios demonstrate that S-GRACE significantly improves erasure performance 26%, better preserves non-target concepts, and reduces training time by 90%. Our code is available at https://github.com/Qhong-522/S-GRACE.

2511.04514 2026-06-19 cs.LG 版本更新

Linear Mode Connectivity under Data Shifts for Deep Ensembles of Image Classifiers

图像分类器深度集成在数据偏移下的线性模式连通性

C. Hepburn, T. Zielke, A. P. Raulf

发表机构 * Institute for AI Safety & Security(人工智能安全与安全研究所)

AI总结 实验研究数据偏移下线性模式连通性(LMC)的条件,发现小学习率和大批量可减轻其影响,并揭示LMC在训练效率与集成多样性间的权衡。

Comments 17 pages, 22 figures

详情
AI中文摘要

线性模式连通性(LMC)现象将深度学习的多个方面联系起来,包括噪声随机梯度下的训练稳定性、局部最小值(盆地)的平滑性和泛化性、采样模型的相似性和功能多样性,以及架构对数据处理的影响。在这项工作中,我们实验研究了数据偏移下的LMC,并确定了减轻其影响的条件。我们将数据偏移解释为随机梯度噪声的额外来源,可以通过小学习率和大批量来减少。这些参数影响模型是收敛到相同的局部最小值,还是收敛到损失景观中具有不同平滑性和泛化性的区域。尽管通过LMC采样的模型往往比收敛到不同盆地的模型更频繁地犯相似错误,但LMC的好处在于平衡训练效率与从更大、更多样化的集成中获得的收益。代码和补充材料可从此https URL获取。本工作已提交给IEEE考虑发表。版权可能随时转移,此后此版本可能不再可访问。

英文摘要

The phenomenon of linear mode connectivity (LMC) links several aspects of deep learning, including training stability under noisy stochastic gradients, the smoothness and generalization of local minima (basins), the similarity and functional diversity of sampled models, and architectural effects on data processing. In this work, we experimentally study LMC under data shifts and identify conditions that mitigate their impact. We interpret data shifts as an additional source of stochastic gradient noise, which can be reduced through small learning rates and large batch sizes. These parameters influence whether models converge to the same local minimum or to regions of the loss landscape with varying smoothness and generalization. Although models sampled via LMC tend to make similar errors more frequently than those converging to different basins, the benefit of LMC lies in balancing training efficiency against the gains achieved from larger, more diverse ensembles. Code and supplementary materials are available at https://github.com/DLR-KI/LMC. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.

2510.24399 2026-06-19 cs.CV cs.RO 版本更新

GenTrack: A New Generation of Multi-Object Tracking

GenTrack:新一代多目标跟踪

Toan Van Nguyen, Rasmus G. K. Christiansen, Dirk Kraft, Leon Bodenhagen

发表机构 * SDU Robotics, University of Southern Denmark(SDU机器人实验室,南丹麦大学)

AI总结 提出GenTrack多目标跟踪方法,采用随机与确定性混合策略,结合粒子群优化与社会交互,在弱检测器、遮挡等场景下有效维持目标身份一致性并减少ID切换。

Comments This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

本文介绍了一种新颖的多目标跟踪(MOT)方法,称为GenTrack,其主要贡献包括:第一,一种混合跟踪方法,采用随机和确定性方式,以鲁棒地处理未知且时变的目标数量,特别是在维持目标身份(ID)一致性和管理非线性动态方面;第二,利用粒子群优化(PSO)和一些提出的适应度度量,引导随机粒子朝向其目标分布模式,从而即使在弱且噪声大的目标检测器下也能实现有效跟踪;第三,整合目标间的社会交互,以增强PSO引导的粒子,并改进强(匹配)和弱(未匹配)轨迹的连续更新,从而减少ID切换和轨迹丢失,尤其是在遮挡期间;第四,基于GenTrack重新定义的视觉MOT基线,结合了基于空间一致性、外观、检测置信度、轨迹惩罚和社会分数的综合状态与观测模型,以实现系统且高效的目标更新;第五,首个公开可用的最小依赖源代码参考实现,包含三种变体,包括GenTrack Simple、Strengthen和Super,便于灵活重新实现。实验结果表明,与最先进的跟踪器相比,GenTrack在标准基准和现实场景中提供了优越的性能,并集成了基线实现以进行公平比较。还讨论了未来工作的潜在方向。所提方法和比较跟踪器的源代码参考实现已在GitHub上提供:this https URL

英文摘要

This paper introduces a novel multi-object tracking (MOT) method, dubbed GenTrack, whose main contributions include: first-a hybrid tracking approach employing both stochastic and deterministic manners to robustly handle unknown and time-varying numbers of targets, particularly in maintaining target identity (ID) consistency and managing nonlinear dynamics, second-leveraging particle swarm optimization (PSO) with some proposed fitness measures to guide stochastic particles toward their target distribution modes, enabling effective tracking even with weak and noisy object detectors, third-integration of social interactions among targets to enhance PSO-guided particles as well as improve continuous updates of both strong (matched) and weak (unmatched) tracks, thereby reducing ID switches and track loss, especially during occlusions, fourth-a GenTrack-based redefined visual MOT baseline incorporating a comprehensive state and observation model based on space consistency, appearance, detection confidence, track penalties, and social scores for systematic and efficient target updates, and five-the first ever publicly available source-code reference implementation with minimal dependencies, featuring three variants, including GenTrack Simple, Strengthen, and Super, facilitating flexible reimplementation. Experimental results have shown that GenTrack provides superior performance on standard benchmarks and real-world scenarios compared to state-of-the-art trackers, with integrated implementations of baselines for fair comparison. Potential directions for future work are also discussed. The source-code reference implementations of both the proposed method and compared-trackers are provided on GitHub: https://github.com/SDU-VelKoTek/GenTrack

2510.18383 2026-06-19 cs.CL cs.AI 版本更新

MENTOR: Reinforcement Learning via Flexible Teacher-Optimized Rewards for Tool-Use Distillation

MENTOR: 通过灵活的教师优化奖励进行工具使用蒸馏的强化学习

ChangSu Choi, Hoyun Song, Dongyeon Kim, WooHyeon Jung, Minkyung Cho, Sunjin Park, NohHyeob Bae, Seona Yu, KyungTae Lim

发表机构 * Seoul National University of Science and Technology(首尔科学技术大学) Korea Advanced Institute of Science and Technology(韩国科学技术院) LG CNS

AI总结 提出MENTOR方法,通过灵活的教师优化奖励结构,平衡行为对齐与下游性能,提升小模型在工具使用任务中的域外泛化能力。

详情
AI中文摘要

将大型语言模型(LLMs)的工具使用能力蒸馏到小型语言模型(SLMs)中对其实际应用至关重要。主要方法监督微调(SFT)由于与静态教师轨迹的刚性对齐,导致域外(OOD)泛化性能较差。虽然强化学习(RL)提供了一种替代方案,但SLMs的能力限制带来了严峻的困境:稀疏的结果奖励提供的指导不足,而严格的轨迹匹配施加了过于严格的约束。为了弥合这一能力驱动的差距,我们提出了MENTOR,它引入了一种灵活且过程感知的奖励结构。MENTOR不强制执行刚性复制,而是利用教师的参考来指导工具使用行为,平衡行为对齐与下游性能。在可控可执行工具基准上的大量实验表明,与SFT和严格RL基线相比,MENTOR提高了OOD工具使用性能。我们的研究结果表明,在可验证的工具使用环境中,灵活的工具使用对齐比严格的轨迹复制为开发适应性小模型提供了更有效的方法。

英文摘要

Distilling the tool-use capabilities of large language models (LLMs) into small language models (SLMs) is essential for their practical application. The predominant approach, supervised fine-tuning (SFT), suffers from poor out-of-domain (OOD) generalization due to its rigid alignment with static teacher trajectories. While reinforcement learning (RL) offers an alternative, the capacity limitations of SLMs pose a severe dilemma: sparse outcome rewards provide insufficient guidance, whereas strict trajectory matching imposes overly restrictive constraints. To bridge this capacity-driven gap, we propose MENTOR, which introduces a flexible yet process-aware reward structure. Instead of enforcing rigid replication, MENTOR uses the teacher's reference to guide tool-use behavior, balancing behavioral alignment with downstream performance. Extensive experiments on controlled executable-tool benchmarks demonstrate that MENTOR improves OOD tool-use performance compared to SFT and strict RL baselines. Our findings suggest that within verifiable tool-use environments, flexible tool-use alignment offers a more effective approach than strict trajectory replication for developing adaptable small models.

2510.21978 2026-06-19 cs.LG cs.AI 版本更新

Beyond Reasoning Gains: Mitigating General-Capability Forgetting in Large Reasoning Models

超越推理增益:缓解大型推理模型中的通用能力遗忘

Hoang Phan, Xianjun Yang, Yuanshun Yao, Jingyu Zhang, Shengjie Bi, Xiaocheng Tang, Madian Khabsa, Lijuan Liu, Deren Lei

发表机构 * Meta Superintelligence Labs(Meta超智能实验室) New York University(纽约大学) Johns Hopkins University(约翰霍普金斯大学)

AI总结 针对强化学习训练导致推理模型遗忘基础能力的问题,提出RECAP重放策略,通过动态目标重加权在线调整训练重点,在保持通用能力的同时提升推理性能。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)在数学和多模态推理方面取得了显著进展,并已成为当代语言和视觉-语言模型的标准后训练范式。然而,RLVR方法引入了能力退化的重大风险,即模型在长时间训练后,若未采用正则化策略,会遗忘基础技能。我们通过实验证实了这一担忧,观察到开源推理模型在感知和忠实性等核心能力上出现性能下降。虽然施加KL散度等正则化项有助于防止偏离基础模型,但这些项是在当前任务上计算的,因此不能保证保留更广泛的知识。同时,跨异构领域的经验回放使得决定每个目标应获得多少训练权重变得困难。为解决这一问题,我们提出RECAP——一种具有动态目标重加权的重放策略,用于通用知识保留。我们的重加权机制利用短期收敛和不稳定信号在线自适应,将后训练焦点从饱和目标转移到表现不佳或不稳定的目标。我们的方法是端到端的,可直接应用于现有RLVR流程,无需训练额外模型或进行繁重调优。在Qwen2.5-VL-3B和Qwen2.5-VL-7B上的广泛实验证明了我们方法的有效性,该方法不仅保留了通用能力,还通过实现任务内奖励的更灵活权衡提升了推理性能。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has delivered impressive gains in mathematical and multimodal reasoning and has become a standard post-training paradigm for contemporary language and vision-language models. However, the RLVR recipe introduces a significant risk of capability regression, in which models forget foundational skills after prolonged training without employing regularization strategies. We empirically confirm this concern, observing that open-source reasoning models suffer performance degradation on core capabilities such as perception and faithfulness. While imposing regularization terms like KL divergence can help prevent deviation from the base model, these terms are computed on the current task and therefore do not guarantee preservation of broader knowledge. Meanwhile, commonly used experience replay across heterogeneous domains makes it nontrivial to decide how much training emphasis each objective should receive. To address this, we propose RECAP-a replay strategy with dynamic objective reweighting for general knowledge preservation. Our reweighting mechanism adapts online using short-horizon signals of convergence and instability, shifting the post-training focus away from saturated objectives and toward underperforming or volatile ones. Our method is end-to-end and readily applicable to existing RLVR pipelines without training additional models or heavy tuning. Extensive experiments on benchmarks using Qwen2.5-VL-3B and Qwen2.5-VL-7B demonstrate the effectiveness of our method, which not only preserves general capabilities but also improves reasoning by enabling more flexible trade-offs among in-task rewards.

2510.20454 2026-06-19 cs.LG 版本更新

Capturing Intransitive Dominance in Tennis Forecasting: A Graph Neural Network Approach

网球预测中非传递性优势的捕捉:一种图神经网络方法

Lawrence Clegg, John Cartlidge

发表机构 * School of Engineering Mathematics and Technology, University of Bristol(布里斯托大学工程数学与技术学院)

AI总结 针对网球中常见的非传递性优势(A胜B,B胜C,C胜A),提出图神经网络模型,通过时间有向图建模历史比赛结果,捕捉被传递性评级系统忽略的预测信号,与加权Elo结合后显著提升预测性能。

Comments 41 pages, 7 figures. Major revision reframing the paper from betting-market inefficiency toward intransitivity analysis, forecast complementarity, and robustness. Added forecast-encompassing tests, new intransitivity measures, robustness analyses, and expanded appendices

详情
AI中文摘要

非传递性球员优势(即球员A击败B,B击败C,但C击败A)在竞技网球中很常见。然而,很少有已知的尝试将其纳入预测方法中。我们通过一种图神经网络方法来解决这个问题,该方法通过时间有向图显式建模这些非传递性关系,其中球员作为节点,他们的历史比赛结果作为有向边。我们的模型(准确率65.7%,Brier分数0.214)与加权Elo等已建立的评级系统相比具有竞争力。尽管它在无条件准确性上没有超越基线,但一项预测包含测试表明它携带了互补信息。组合预测显著优于加权Elo,并且有迹象表明,在我们的模型针对的非传递性对决中,增益增长更强烈。因此,基于图的球员交互表示捕捉了传递性评级系统丢弃的预测信号,即使在没有共同对手的球员之间也是如此。

英文摘要

Intransitive player dominance, where player A beats B, B beats C, but C beats A, is common in competitive tennis. Yet, there are few known attempts to incorporate it within forecasting methods. We address this problem with a graph neural network approach that explicitly models these intransitive relationships through temporal directed graphs, with players as nodes and their historical match outcomes as directed edges. Our model (65.7% accuracy, 0.214 Brier score) forecasts competitively with established rating systems such as Weighted Elo. Although it does not improve on the baseline in unconditional accuracy, a forecast-encompassing test shows that it carries complementary information. A combined forecast significantly outperforms Weighted Elo, and there is some indication that the gain grows more strongly on the intransitive matchups our model targets. A graph-based representation of player interactions thus captures a forecasting signal that transitive rating systems discard, even between players who share no common opponents.

2510.19893 2026-06-19 cs.LG 版本更新

EQPO: Equitable Group Relative Policy Optimization for Clinical Reasoning

EQPO: 面向临床推理的公平群体相对策略优化

Shiqi Dai, Wei Dai, Jiaee Cheong, Paul Pu Liang

发表机构 * MIT(麻省理工学院) Harvard University(哈佛大学)

AI总结 提出EQPO分层强化学习方法,通过自适应重加权样本促进异质临床人群的均衡学习,在7个诊断基准上降低F1标准差43.9%,缩小预测公平差距27.2%。

Comments Accepted as Oral on NeurIPS 2025 GenAI4Health Workshop

详情
AI中文摘要

医疗AI系统展示了令人印象深刻的诊断性能,但它们在不同人口统计群体之间通常表现出不均匀的准确性,使代表性不足的人群处于不利地位。尽管多模态推理基础模型推动了临床诊断的发展,基于强化学习的后训练倾向于吸收并放大多数主导训练语料中存在的偏见。我们提出公平群体相对策略优化(EQPO),一种分层强化学习方法,通过根据子群表示、任务难度和数据来源自适应地重新加权样本,鼓励跨异质临床人群的平衡学习。由于人口统计注释在真实临床数据中经常缺失,EQPO还在不可用时应用无监督聚类来恢复潜在子群。在覆盖5种模态(X射线、CT、皮肤镜、乳腺X线摄影、超声)的7个诊断基准上,EQPO在QoQ-Med3-8B上相比原始GRPO将F1标准差降低43.9%,最大跨群体F1差距降低42.7%,并在MedGemma-4B上将预测公平差距缩小27.2%(相比有偏减轻的RL基线),同时即使没有任何人口统计标签也将F1提高12.5%。检查训练轨迹显示,EQPO在优化过程中稳步提高公平性,而基线方法的公平性随训练进行而下降,并且发现的隐式群体保持稳定并与掩蔽的人口统计属性对齐。我们进一步发布了EquiMedGemma-4B和EquiQoQ-Med3-8B,这两种具有公平意识的临床VLLM在显著缩小人口统计差距的同时达到了最先进的准确性。

英文摘要

Medical AI systems demonstrated impressive diagnostic performance, yet they routinely show uneven accuracy across demographic groups, disadvantaging underrepresented populations. Although multimodal reasoning foundation models have pushed clinical diagnosis forward, reinforcement learning-based post-training tends to absorb and magnify the biases present in majority-dominated training corpora. We propose Equitable Group Relative Policy Optimization (EQPO), a hierarchical reinforcement learning method that encourages balanced learning across heterogeneous clinical populations by adaptively reweighting samples according to subgroup representation, task difficulty, and data source. As demographic annotations are frequently missing in real-world clinical data, EQPO additionally applies unsupervised clustering to recover latent subpopulations when they are unavailable. On 7 diagnostic benchmarks covering 5 modalities (X-ray, CT, dermoscopy, mammography, ultrasound), EQPO reduces F1 standard deviation by 43.9% and the maximum cross-group F1 gap by 42.7% on QoQ-Med3-8B over vanilla GRPO, and narrows predictive parity gaps by 27.2% on MedGemma-4B over bias-mitigated RL baselines while raising F1 by 12.5% even without any demographic labels. Examining the training trajectory shows that EQPO steadily improves fairness over the course of optimization, in contrast to baseline methods whose fairness degrades as training proceeds, and the discovered implicit groups remain stable and align with masked demographic attributes. We further release EquiMedGemma-4B and EquiQoQ-Med3-8B, equitability-aware clinical VLLMs that attain state-of-the-art accuracy with markedly smaller demographic gaps.

2510.16311 2026-06-19 cs.LG 版本更新

Toward General Digraph Contrastive Learning: A Dual Spatial Perspective

面向一般有向图对比学习:双空间视角

Zhengyu Wu, Daohan Su, Yang Zhang, Xunkai Li, Rong-Hua Li, Guoren Wang

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出S2-DiGCL框架,从复数域和实数域双空间视角对有向图进行对比学习,通过磁拉普拉斯自适应调制和路径子图增强,在节点分类和链接预测任务上分别提升4.41%和4.34%。

详情
AI中文摘要

图对比学习(GCL)已成为一种从图中提取一致表示而无需标签信息的强大工具。然而,现有方法主要关注无向图,忽略了在实际网络(如社交网络和推荐系统)中基础且不可或缺的关键方向信息。本文提出了S2-DiGCL,一种新颖的框架,强调从复杂域和实数域视角对有向图进行对比学习的空间洞察。从复数域视角,S2-DiGCL在磁拉普拉斯中引入个性化扰动,以自适应地调制边相位和方向语义。从实数域视角,它采用基于路径的子图增强策略,捕捉细粒度的局部不对称性和拓扑依赖性。通过联合利用这两个互补的空间视图,S2-DiGCL构建了高质量的正负样本,从而实现更通用和鲁棒的有向图对比学习。在7个真实有向图数据集上的大量实验证明了我们方法的优越性,在监督和无监督设置下,节点分类和链接预测分别实现了4.41%和4.34%的性能提升,达到了最先进水平。

英文摘要

Graph Contrastive Learning (GCL) has emerged as a powerful tool for extracting consistent representations from graphs, independent of labeled information. However, existing methods predominantly focus on undirected graphs, disregarding the pivotal directional information that is fundamental and indispensable in real-world networks (e.g., social networks and recommendations).In this paper, we introduce S2-DiGCL, a novel framework that emphasizes spatial insights from complex and real domain perspectives for directed graph (digraph) contrastive learning. From the complex-domain perspective, S2-DiGCL introduces personalized perturbations into the magnetic Laplacian to adaptively modulate edge phases and directional semantics. From the real-domain perspective, it employs a path-based subgraph augmentation strategy to capture fine-grained local asymmetries and topological dependencies. By jointly leveraging these two complementary spatial views, S2-DiGCL constructs high-quality positive and negative samples, leading to more general and robust digraph contrastive learning. Extensive experiments on 7 real-world digraph datasets demonstrate the superiority of our approach, achieving SOTA performance with 4.41% improvement in node classification and 4.34% in link prediction under both supervised and unsupervised settings.

2510.08807 2026-06-19 cs.RO cs.LG 版本更新

Humanoid Everyday: A Comprehensive Robotic Dataset for Open-World Humanoid Manipulation

Humanoid Everyday:面向开放世界人形机器人操作的综合机器人数据集

Zhenyu Zhao, Hongyi Jing, Xiawei Liu, Jiageng Mao, Abha Jha, Hanwen Yang, Rong Xue, Sergey Zakharov, Vitor Guizilini, Yue Wang

发表机构 * University of Southern California(南加州大学) Toyota Research Institute(丰田研究院)

AI总结 提出Humanoid Everyday数据集,包含10.3k轨迹、260个任务的多模态数据,用于人形机器人灵巧操作、人机交互和移动操作研究,并配套云评估平台。

详情
AI中文摘要

从运动到灵巧操作,人形机器人在展示复杂的全身能力方面取得了显著进展。然而,当前大多数机器人学习数据集和基准主要关注固定机器人臂,少数现有人形数据集要么局限于固定环境,要么任务多样性有限,通常缺乏人机交互和下肢运动。此外,缺乏用于在人形数据上对基于学习的策略进行基准测试的标准化评估平台。在这项工作中,我们提出了Humanoid Everyday,一个大规模且多样化的人形操作数据集,其特点是涉及灵巧物体操作、人机交互、运动集成动作等广泛的任务多样性。利用高效的人工监督遥操作流水线,Humanoid Everyday聚合了高质量的多模态感官数据,包括RGB、深度、LiDAR和触觉输入,以及自然语言注释,包含10.3k条轨迹和超过300万帧数据,涵盖7个大类共260个任务。此外,我们对数据集上的代表性策略学习方法进行了分析,提供了它们在不同任务类别中的优势和局限性的见解。为了标准化评估,我们引入了一个基于云的评估平台,允许研究人员在我们的受控环境中无缝部署他们的策略并接收性能反馈。通过发布Humanoid Everyday以及我们的策略学习分析和标准化的基于云的评估平台,我们旨在推进通用人形操作的研究,并为现实世界中更有能力和具身化的机器人代理奠定基础。我们的数据集、数据收集代码和云评估网站在我们的项目网站上公开发布。

英文摘要

From loco-motion to dextrous manipulation, humanoid robots have made remarkable strides in demonstrating complex full-body capabilities. However, the majority of current robot learning datasets and benchmarks mainly focus on stationary robot arms, and the few existing humanoid datasets are either confined to fixed environments or limited in task diversity, often lacking human-humanoid interaction and lower-body locomotion. Moreover, there are a few standardized evaluation platforms for benchmarking learning-based policies on humanoid data. In this work, we present Humanoid Everyday, a large-scale and diverse humanoid manipulation dataset characterized by extensive task variety involving dextrous object manipulation, human-humanoid interaction, locomotion-integrated actions, and more. Leveraging a highly efficient human-supervised teleoperation pipeline, Humanoid Everyday aggregates high-quality multimodal sensory data, including RGB, depth, LiDAR, and tactile inputs, together with natural language annotations, comprising 10.3k trajectories and over 3 million frames of data across 260 tasks across 7 broad categories. In addition, we conduct an analysis of representative policy learning methods on our dataset, providing insights into their strengths and limitations across different task categories. For standardized evaluation, we introduce a cloud-based evaluation platform that allows researchers to seamlessly deploy their policies in our controlled setting and receive performance feedback. By releasing Humanoid Everyday along with our policy learning analysis and a standardized cloud-based evaluation platform, we intend to advance research in general-purpose humanoid manipulation and lay the groundwork for more capable and embodied robotic agents in real-world scenarios. Our dataset, data collection code, and cloud evaluation website are made publicly available on our project website.

2509.13972 2026-06-19 cs.RO 版本更新

BIM Informed Visual SLAM for Construction Environments

BIM 引导的视觉 SLAM 在建筑环境中的应用

Asier Bikandi-Noya, Miguel Fernandez-Cortizas, Muhammad Shaheer, Ali Tourani, Holger Voos, Jose Luis Sanchez-Lopez

发表机构 * Automation and Robotics Research Group, Interdisciplinary Centre for Security, Reliability, and Trust (SnT), University of Luxembourg(自动化与机器人研究组,安全、可靠与信任跨学科研究中心(SnT),卢森堡大学)

AI总结 针对建筑环境中视觉SLAM轨迹漂移问题,提出利用建筑信息模型(BIM)的结构先验增强RGB-D SLAM系统,通过墙面对应与几何约束优化减少漂移,提升全局一致性,实验显示轨迹误差降低25.23%,地图精度提升7.14%。

Comments 9 pages, 7 tables, 4 figures

详情
AI中文摘要

监测建筑施工现场需要将计划设计与实际建造状态进行比较,而同步定位与地图构建(SLAM)技术可以实时估计实际状态。然而,视觉SLAM在建筑环境中容易产生轨迹漂移,生成的地图在几何上与实际环境不准确。为解决这一局限,我们利用从建筑信息模型(BIM)导出的结构先验增强现有的RGB-D SLAM系统。该系统将检测到的墙面与BIM中的对应墙面关联,并将这些对应关系作为几何约束加入后端优化,从而减少漂移并增强全局一致性。所提方法实时运行,并在多个真实建筑工地上验证,与最先进的基线相比,平均轨迹误差降低25.23%,地图精度提升7.14%。鲁棒性分析进一步表明,该方法对不完整的BIM数据以及计划模型与实际环境之间的几何差异具有韧性。

英文摘要

Monitoring building construction sites requires comparing the as-planned design with the as-built state, which can be estimated in real time using Simultaneous Localization and Mapping (SLAM) techniques. However, visual SLAM is prone to trajectory drift in construction environments, producing maps that are geometrically inaccurate with the actual environment. To address this limitation, we augment an existing RGB-D SLAM system with structural priors derived from the Building Information Model (BIM). The system associates detected walls with their BIM counterparts and includes these correspondences as geometric constraints in the back-end optimization, reducing drift and enhancing global consistency. The proposed method operates in real time and is validated on multiple real construction sites, achieving an average trajectory error reduction of 25.23% and a 7.14% improvement in map accuracy over state-of-the-art baselines. Robustness analyses further demonstrate resilience to incomplete BIM data and geometric discrepancies between as-planned models and the as-built environment.

2510.00831 2026-06-19 cs.AI cs.LG eess.SP 版本更新

Controlled Comparison of Machine Learning Models for Fault Classification and Localization in Power System Protection

电力系统保护中故障分类与定位的机器学习模型受控比较

Julian Oelhaf, Georg Kordowich, Changhun Kim, Paula Andrea Pérez-Toro, Christian Bergler, Andreas Maier, Johann Jäger, Siming Bayer

发表机构 * Department of Electrical Engineering, Media and Computer Science, Ostbayerische Technische Hochschule Amberg-Weiden(奥贝格-魏登应用技术大学电气工程、媒体与计算机科学系)

AI总结 在统一电磁暂态数据集和10-50ms决策窗口下,对比机器学习模型在故障分类与定位中的性能,发现分类在10ms时F1>0.98,定位误差稳定在约10%线路长度。

Comments Accepted at IEEE PES Innovative Smart Grid Technologies Europe 2026 (ISGT Europe 2026). Pre-camera-ready author version; final proceedings version may differ

详情
AI中文摘要

现代电力系统因逆变器基和分布式能源的集成而日益复杂,挑战了传统保护方案的可靠性,并推动了机器学习在保护任务中的应用。然而,由于不同研究中的数据集、传感假设和决策时域各异,已发表的结果往往难以比较。本文在相同的传感、时序和验证条件下,基于公共电磁暂态数据集,使用10-50ms的决策窗口以反映保护相关时间尺度,对故障分类(FC)和故障定位(FL)的机器学习模型进行了受控比较。对于FC,性能最佳的非线性模型在10ms时F1分数已超过0.98,而低容量模型在较短时域下性能下降,但随窗口延长而改善,表明相关故障类型信息在最早暂态中已存在。对于FL,顶级模型在所有评估时域下达到约10%归一化线路长度的稳定定位误差,而较弱模型形成明显分离的第二性能层级。线路解析分析显示,定位精度随电网段变化,表明存在拓扑依赖的难度而非仅时间上下文不足。这些发现为比较两个信息需求根本不同的保护任务中的机器学习模型提供了受控参考。

英文摘要

The increasing complexity of modern power systems, driven by the integration of inverter-based and distributed energy resources, challenges the reliability of conventional protection schemes and motivates the use of machine learning for protection tasks. However, published results are often difficult to compare because datasets, sensing assumptions, and decision horizons vary across studies. This paper presents a controlled comparison of machine learning models for fault classification (FC) and fault localization (FL) under identical sensing, timing, and validation conditions on a common electromagnetic transient dataset, using decision windows of 10-50 ms to reflect protection-relevant time scales. For FC, the best-performing nonlinear models achieve F1 scores above 0.98 already at 10 ms, while lower-capacity models degrade at shorter horizons but improve with longer windows, indicating that relevant fault-type information is already present in the earliest transient. For FL, the top-performing models reach a stable localization error of about 10 % of normalized line length across all evaluated horizons, while weaker models form a clearly separated second performance tier. Line-resolved analysis shows that localization accuracy varies across grid segments, indicating topology-dependent difficulty rather than insufficient temporal context alone. These findings provide a controlled reference for comparing machine learning models across two protection tasks with fundamentally different information requirements.

2509.25148 2026-06-19 cs.AI 版本更新

AAPA: Adversarially Anchored Preference Alignment for Post-Training of Large Language Models

AAPA:用于大型语言模型后训练的对抗锚定偏好对齐

Faqiang Qian, Kang An, Weikun Zhang, Ziliang Wang, Xuhui Zheng, Liangjian Wen, Yong Dai, Mengya Gao, Yichao Wu

发表机构 * Southwest University of Finance and Economics(西南财经大学)

AI总结 提出AAPA框架,通过固定轻量判别器对策略输出与专家响应进行句子级对抗锚定,增强SFT、GRPO等后训练目标,在指令遵循基准上持续提升性能。

详情
AI中文摘要

大型语言模型的后训练对齐通常结合了专家演示上的监督微调(SFT)和来自偏好或可验证反馈的强化学习(RL)。SFT提供了有用的行为锚点,但可能过拟合静态演示,而RL鼓励探索但可能偏离专家行为或利用不完美的奖励。我们提出\textbf{AAPA}(\emph{对抗锚定偏好对齐}),这是一个插件式框架,通过句子级对抗锚定信号增强现有的后训练目标。AAPA使用固定的轻量判别器将策略生成结果与离线预收集的专家响应进行比较,因此在策略优化期间既不需要在线教师推理,也不需要判别器协同训练。相同的锚定项可以添加到SFT、GRPO和CHORD中,同时保留其原始训练流程。在指令遵循基准上的实验表明,AAPA在不同模型规模上一致地改善了相应的基础目标。特别是,分阶段的AAPA配置在\texttt{Qwen3-0.6B}上比强GRPO基线提高了5.77%,在\texttt{Qwen3-4B}上提高了3.75%。对响应长度、对数概率分布和判别器变体的进一步分析表明,对抗锚定为偏好优化提供了稳定的语义基础信号。代码可在\url{this https URL}获取。

英文摘要

Post-training alignment of large language models often combines supervised fine-tuning (SFT) on expert demonstrations with reinforcement learning (RL) from preference or verifiable feedback. SFT provides a useful behavioral anchor but can overfit to static demonstrations, whereas RL encourages exploration but may drift from expert behavior or exploit imperfect rewards. We propose \textbf{AAPA} (\emph{Adversarially Anchored Preference Alignment}), a plug-in framework that augments existing post-training objectives with a sentence-level adversarial anchoring signal. AAPA compares policy rollouts with offline, pre-collected expert responses using a fixed lightweight discriminator, and therefore requires neither online teacher inference nor discriminator co-training during policy optimization. The same anchoring term can be added to SFT, GRPO, and CHORD while preserving their original training pipelines. Experiments on instruction-following benchmarks show that AAPA consistently improves the corresponding base objectives across model scales. In particular, the staged AAPA configuration improves over a strong GRPO baseline by 5.77\% on \texttt{Qwen3-0.6B} and 3.75\% on \texttt{Qwen3-4B}. Further analyses on response length, log-probability distributions, and discriminator variants suggest that adversarial anchoring provides a stable semantic grounding signal for preference optimization. Code is available at \url{https://github.com/IsFaqq/AAPA}.

2509.19658 2026-06-19 cs.RO cs.AI 版本更新

RoboSSM: Scalable In-context Imitation Learning via State-Space Models

RoboSSM: 基于状态空间模型的可扩展上下文模仿学习

Youngju Yoo, Jiaheng Hu, Yifeng Zhu, Bo Liu, Qiang Liu, Roberto Martín-Martín, Peter Stone

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) KAIST(韩国科学技术院) FAIR at Meta(元宇宙FAIR) Amazon(亚马逊) Sony AI(索尼人工智能)

AI总结 提出RoboSSM,用状态空间模型替代Transformer实现上下文模仿学习,在LIBERO基准上对未见和长时任务泛化更优,首次证明SSM是ICIL高效可扩展的骨干网络。

Comments IROS 2026

详情
AI中文摘要

上下文模仿学习(ICIL)使机器人能够从仅包含少量演示的提示中学习任务。通过消除部署时参数更新的需求,该范式支持对新任务的少样本适应。然而,最近的ICIL方法依赖于Transformer,其计算能力有限,并且在处理比训练时更长的提示时往往表现不佳。在这项工作中,我们引入了RoboSSM,一种基于状态空间模型(SSM)的可扩展上下文模仿学习方案。具体来说,RoboSSM用Longhorn(一种最先进的SSM)替代Transformer,该模型提供线性时间推理和强大的外推能力,非常适合长上下文提示。通过在LIBERO基准上的多样化实验,我们证明了将SSM应用于ICIL的有效性,通过处理测试时更长的上下文,实现了比基于Transformer的ICIL方法对未见和长时任务更好的泛化。这些结果首次表明,SSM是ICIL高效且可扩展的骨干网络。我们的代码可在此网址获取。

英文摘要

In-context imitation learning (ICIL) enables robots to learn tasks from prompts consisting of just a handful of demonstrations. By eliminating the need for parameter updates at deployment time, this paradigm supports few-shot adaptation to novel tasks. However, recent ICIL methods rely on Transformers, which have computational limitations and tend to underperform when handling longer prompts than those seen during training. In this work, we introduce RoboSSM, a scalable recipe for in-context imitation learning based on state-space models (SSM). Specifically, RoboSSM replaces Transformers with Longhorn -- a state-of-the-art SSM that provides linear-time inference and strong extrapolation capabilities, making it well-suited for long-context prompts. Through diverse experiments on the LIBERO benchmark, we demonstrate the effectiveness of applying SSMs to ICIL, achieving improved generalization to both unseen and long-horizon tasks than Transformer-based ICIL methods by handling longer contexts at test-time. These results show for the first time that SSMs are an efficient and scalable backbone for ICIL. Our code is available at https://github.com/youngjuY/RoboSSM.

2509.10416 2026-06-19 cs.RO 版本更新

TASC: Task-Aware Shared Control for Relational Telemanipulation

TASC:面向关系遥操作的任务感知共享控制

Ze Fu, Pinhao Song, Yutong Hu, Renaud Detry

发表机构 * KU Leuven, Dept. Mechanical Engineering, Research unit Robotics, Automation and Mechatronics(KU莱顿机械工程系,机器人、自动化与机电一体化研究单位) KU Leuven, Dept. Electrical Engineering, Research unit Processing Speech and Images(KU莱顿电气工程系,语音与图像处理研究单位)

AI总结 提出TASC框架,通过视觉构建开放词汇交互图推断任务级用户意图,并基于空间约束提供共享控制辅助,提升关系遥操作效率与泛化能力。

Comments Accepted to IROS 2026

详情
AI中文摘要

我们提出了TASC,一个面向关系遥操作的任务感知共享控制框架,该框架从仅运动输入中推断任务级用户意图并提供辅助。为了在没有预定义模板的情况下支持抓取关系任务,TASC从视觉输入构建一个开放词汇的交互图来表示功能性物体关系,并据此推断用户意图。然后,共享控制策略在抓取和物体交互过程中提供辅助,该辅助由视觉语言模型预测的空间约束引导。我们的方法解决了共享控制下关系遥操作的两个关键挑战:(1)从低级运动命令中推断任务级意图,以及(2)跨不同物体和任务的泛化辅助。在仿真和真实世界的实验表明,与先前方法相比,TASC提高了任务效率并减少了用户输入努力,同时实现了跨多种关系遥操作任务的零样本泛化。支持我们实验的代码在此https URL公开提供。

英文摘要

We present TASC, a Task-Aware Shared Control framework for relational telemanipulation that infers task-level user intent and provides assistance from motion-only input. To support prehensile relational tasks without predefined templates, TASC constructs an open-vocabulary interaction graph from visual input to represent functional object relationships, and infers user intent accordingly. A shared control policy then provides assistance during both grasping and object interaction, guided by spatial constraints predicted by a vision-language model. Our method addresses two key challenges in relational telemanipulation under shared control: (1) task-level intent inference from low-level motion commands, and (2) generalizable assistance across diverse objects and tasks. Experiments in both simulation and the real world demonstrate that TASC improves task efficiency and reduces user input effort compared to prior methods, while enabling zero-shot generalization across diverse relational telemanipulation tasks. The code that supports our experiments is publicly available at https://github.com/fitz0401/tasc.

2504.11171 2026-06-19 cs.CV cs.AI 版本更新

TerraMind: Large-Scale Generative Multimodality for Earth Observation

TerraMind:面向地球观测的大规模生成式多模态模型

Johannes Jakubik, Felix Yang, Benedikt Blumenstiel, Erik Scheurer, Rocco Sedona, Stefano Maurogiovanni, Jente Bosmans, Nikolaos Dionelis, Valerio Marsocci, Niklas Kopp, Rahul Ramachandran, Paolo Fraccaro, Thomas Brunschwiler, Gabriele Cavallaro, Juan Bernabe-Moreno, Nicolas Longépé

发表机构 * IBM Research – Europe(IBM欧洲研究院) ETH Zurich(苏黎世联邦理工学院) Forschungszentrum Jülich(尤利希研究中心) European Space Agency(欧洲航天局) Φ \Phi -Lab(Φ实验室) NASA IMPACT University of Iceland(爱沙尼亚大学)

AI总结 提出首个任意到任意生成式多模态基础模型TerraMind,通过双尺度表示(token级和像素级)预训练,实现零样本/少样本应用,并引入“模态思考”能力,在PANGAEA等基准上达到领先性能。

Comments Accepted at ICCV'25

详情
AI中文摘要

我们提出了TerraMind,这是首个面向地球观测(EO)的任意到任意生成式多模态基础模型。与其他多模态模型不同,TerraMind在跨模态的双尺度表示(结合token级和像素级数据)上进行预训练。在token级别,TerraMind编码高层上下文信息以学习跨模态关系;在像素级别,TerraMind利用细粒度表示捕捉关键空间细节。我们在一个全球大规模数据集的九种地理空间模态上预训练了TerraMind。在本文中,我们证明:(i)TerraMind的双尺度早期融合方法为地球观测解锁了一系列零样本和少样本应用;(ii)TerraMind引入了“模态思考”(TiM)——在微调和推理过程中生成额外人工数据以改善模型输出的能力;(iii)TerraMind在PANGAEA等社区标准的地球观测基准上达到了超越现有最优的性能。预训练数据集、模型权重和我们的代码均在宽松许可下开源。

英文摘要

We present TerraMind, the first any-to-any generative, multimodal foundation model for Earth observation (EO). Unlike other multimodal models, TerraMind is pretrained on dual-scale representations combining both token-level and pixel-level data across modalities. On a token level, TerraMind encodes high-level contextual information to learn cross-modal relationships, while on a pixel level, TerraMind leverages fine-grained representations to capture critical spatial nuances. We pretrained TerraMind on nine geospatial modalities of a global, large-scale dataset. In this paper, we demonstrate that (i) TerraMind's dual-scale early fusion approach unlocks a range of zero-shot and few-shot applications for Earth observation, (ii) TerraMind introduces "Thinking-in-Modalities" (TiM) -- the capability of generating additional artificial data during finetuning and inference to improve the model output -- and (iii) TerraMind achieves beyond state-of-the-art performance in community-standard benchmarks for EO like PANGAEA. The pretraining dataset, the model weights, and our code are open-sourced under a permissive license.

2506.14990 2026-06-19 cs.AI 版本更新

MEAL: A Benchmark for Continual Multi-Agent Reinforcement Learning

MEAL: 持续多智能体强化学习基准

Tristan Tomilin, Luka van den Boogaard, Samuel Garcin, Constantin Ruhdorfer, Bram Grooten, Fabrice Kusters, Yali Du, Andreas Bulling, Mykola Pechenizkiy, Meng Fang

发表机构 * Eindhoven University of Technology, The Netherlands(埃因霍温理工大学,荷兰) University of Edinburgh, UK(爱丁堡大学,英国) University of Stuttgart, Germany(斯图加特大学,德国) King's College London, UK(伦敦国王学院,英国) University of Liverpool, UK(利物浦大学,英国)

AI总结 提出MEAL基准,利用JAX和GPU加速实现100任务序列训练,揭示长序列中出现的失败模式。

Comments To be published in the International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

基准在强化学习(RL)研究中扮演核心角色,但其计算约束往往塑造了研究内容。尽管有终身学习的动机,大多数持续RL论文仅考虑3-10个顺序任务,因为CPU密集型环境使得更长的序列不切实际。同时,合作多智能体环境中的持续学习仍基本未被探索。为弥补这些空白,我们引入MEAL(多智能体自适应学习环境),这是首个持续多智能体RL基准。通过利用JAX和GPU加速,MEAL能够在单个GPU上几小时内训练100个任务的序列。我们发现,长任务序列揭示了在较小规模下不会出现的失败模式。

英文摘要

Benchmarks play a central role in reinforcement learning (RL) research, yet their computational constraints often shape what is studied. Despite the motivation of lifelong learning, most continual RL papers consider only 3-10 sequential tasks, as CPU-bound environments make longer sequences impractical. Meanwhile, continual learning in cooperative multi-agent settings remains largely unexplored. To address these gaps, we introduce MEAL (Multi-agent Environments for Adaptive Learning), the first benchmark for continual multi-agent RL. By leveraging JAX and GPU acceleration, MEAL enables training on sequences of 100 tasks in a few hours on a single GPU. We find that long task sequences reveal failure modes that do not appear at smaller scales.

2509.00271 2026-06-19 cs.RO 版本更新

Learn from What We HAVE: History-Aware VErifier that Reasons about Past Interactions Online

从我们所拥有的学习:在线推理过去交互的历史感知验证器

Yishu Li, Xinyi Mao, Ying Yuan, Kyutae Sim, Ben Eisner, David Held

发表机构 * Robotics Institute, Carnegie Mellon University(卡内基梅隆大学机器人研究所) Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系)

AI总结 提出历史感知验证器HAVE,通过解耦动作生成与验证,利用历史交互在线消除歧义,理论证明其提升期望动作质量,在多个模拟和真实环境中验证有效性。

Comments CoRL 2025

详情
AI中文摘要

我们引入了一种新颖的历史感知验证器(HAVE),通过利用过去的交互来在线消除不确定场景中的歧义。机器人经常遇到视觉上模糊的物体,这些物体的操作结果直到物理交互之前都是不确定的。虽然仅凭生成模型理论上可以适应这种模糊性,但在实践中,即使在以动作历史为条件的情况下,它们在模糊情况下也会获得次优性能。为了解决这个问题,我们提出明确地将动作生成与验证解耦:我们使用无条件的基于扩散的生成器来提出多个候选动作,并采用我们的历史感知验证器通过推理过去的交互来选择最有希望的动作。通过理论分析,我们证明了使用验证器显著提高了期望动作质量。在多个模拟和真实环境(包括铰接物体、多模态门和不均匀物体拾取)中的实证评估和分析证实了我们方法的有效性以及对基线的改进。我们的项目网站位于:this https URL

英文摘要

We introduce a novel History-Aware VErifier (HAVE) to disambiguate uncertain scenarios online by leveraging past interactions. Robots frequently encounter visually ambiguous objects whose manipulation outcomes remain uncertain until physically interacted with. While generative models alone could theoretically adapt to such ambiguity, in practice they obtain suboptimal performance in ambiguous cases, even when conditioned on action history. To address this, we propose explicitly decoupling action generation from verification: we use an unconditional diffusion-based generator to propose multiple candidate actions and employ our history-aware verifier to select the most promising action by reasoning about past interactions. Through theoretical analysis, we demonstrate that employing a verifier significantly improves expected action quality. Empirical evaluations and analysis across multiple simulated and real-world environments including articulated objects, multi-modal doors, and uneven object pick-up confirm the effectiveness of our method and improvements over baselines. Our project website is available at: https://liy1shu.github.io/HAVE_CoRL25/