arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2163
2605.19528 2026-05-20 cs.CV

Towards Camera-Robust 3D Localization: Equation-Anchored Tool-Use for MLLMs

面向相机鲁棒的3D定位:基于方程的工具使用用于MLLMs

Xueying Jiang, Wenhao Li, Quanhao Qian, Deli Zhao, Shijian Lu, Gongjie Zhang, Ran Xu

AI总结 本文提出了一种基于方程的工具使用框架,通过将空间工具作为公式变量重新利用,以解决多模态大语言模型(MLLMs)中3D定位的相机固有模糊问题,从而在3D物体检测和3D视觉定位任务中取得了显著提升。

详情
AI中文摘要

多模态大语言模型(MLLMs)中的3D定位,包括3D物体检测和3D视觉定位,本质上受限于相机内参的模糊性:相同图像在不同相机下可以对应不同的3D场景。现有的MLLMs要么忽略相机参数并过度拟合于标准训练内参,要么从外部工具检索深度和3D线索,但将返回值视为参考线索(数值提示,模型可以隐式解释)。我们提出了一种基于方程的工具使用框架,将空间工具重新作为公式变量。该框架主动检索相机内参并采样多点度量深度,将针孔反投影方程$\hat{X} = (u_c - c_x)ar{Z}/f_x$明确写出在Chain-of-Thought(CoT)中,并在回归最终9自由度包围盒之前将工具输出代入公式。在从$0.5 imes$到$1.5 imes$缩放的相机内参下,我们的方法在3D物体检测和3D视觉定位任务中优于仅使用RGB和工具增强的基线方法,特别是在相机偏离训练尺度最显著时有显著提升。代码和数据将被发布。

英文摘要

3D localization in Multimodal Large Language Models (MLLMs), including 3D object detection and 3D visual grounding, is fundamentally limited by camera intrinsic ambiguity: the same image admits different 3D scenes under different cameras. Existing MLLMs either ignore camera parameters and overfit to a canonical training intrinsic, or retrieve depth and 3D cues from external tools but treat the returned values as reference cues (numerical hints that the model is free to interpret implicitly), both preventing camera information from being deterministically propagated into the prediction. We propose an equation-anchored tool-use framework that re-purposes spatial tools as formula variables. The proposed framework proactively retrieves camera intrinsics and samples multi-point metric depths, writes the pinhole back-projection equation $\hat{X} = (u_c - c_x)\bar{Z}/f_x$ explicitly in Chain-of-Thought (CoT), and substitutes tool outputs into the formula before regressing the final 9-DoF bounding box. On both 3D object detection and 3D visual grounding tasks under rescaled camera intrinsics from $0.5\times$ to $1.5\times$, our method outperforms RGB-only and tool-augmented baselines, with significant gains where the camera deviates most from the training scale. Code and data will be released.

2605.19527 2026-05-20 cs.CV

Dual-Prompt CLIP with Hybrid Visual Encoders for Occluded Person Re-Identification

双提示CLIP与混合视觉编码器用于遮挡行人重识别

Zhangjian Ji, Shaotong Qiao, Kai Feng, Wei Wei

AI总结 本文提出了一种双提示学习重识别模型DPL-ReID,通过双提示学习策略和现实遮挡增强方法,提升遮挡行人重识别的鲁棒性和准确性。

详情
AI中文摘要

遮挡行人重识别旨在在多个摄像头视图中匹配部分可见的行人。然而,遮挡会破坏身体区域线索,从而复杂化跨视图匹配。大多数基于预训练视觉-语言模型的行人重识别方法只关注增强基于提示的特征学习,而忽略遮挡物的语义信息。基于CLIP-ReID的成功,我们提出了一种新的双提示学习重识别(DPL-ReID)模型用于遮挡行人重识别。它结合了双提示学习(Dual-PL)策略,可以利用文本线索捕捉完整的行人语义并保持对遮挡的鲁棒性,以及现实世界遮挡增强(RWOA)方法,该方法真实模拟现实世界中遇到的遮挡场景以丰富遮挡样本。此外,我们还设计了加权门控特征融合(WGFF)方法,它结合LSNet来捕捉全局信息并作为特征门控机制。该机制可以有效引导CLIP视觉编码器生成更全面的特征表示。在多个基准遮挡重识别数据集上的广泛实验表明,所提出的DPL-ReID实现了最先进的性能。遮挡实例库可在https://github.com/stone-qiao/DPL-ReID上获取。

英文摘要

Occluded person re-identification focuses on matching partially visible pedestrians across multiple camera views. However, occlusions disrupt body-region cues, thereby complicating cross-view matching. Most person ReID methods built on pretrained vision-language models only focus on enhancing prompt-based feature learning while ignoring the semantic information of occluders. Based on the success of CLIP-ReID, we propose a novel Dual Prompt Learning ReID (DPL-ReID) model for occluded person ReID. It incorporates a Dual Prompt Learning (Dual-PL) strategy, which can utilize textual cues to capture complete pedestrian semantics and keep robustness against occlusion, and a Real-World Occlusion Augmentation (RWOA) method that realistically simulates occlusion scenarios encountered in real word to enrich occluded samples. In addition, we also design a Weighted Gated Feature Fusion (WGFF) method, which in corporates LSNet to capture global information and act as a feature-gating mechanism. This mechanism can effectively guide the CLIP visual encoder toward generating more comprehensive feature representations. Extensive experiments on several benchmark occluded ReID datasets show that our proposed DPL-ReID achieves the state-of-the art performance. The occlusion instance library are available at https://github.com/stone-qiao/DPL-ReID.

2605.19524 2026-05-20 cs.RO cs.CV

SafeAlign-VLA: A Negative-Enhanced Safe Alignment Framework for Risk-Aware Autonomous Driving

SafeAlign-VLA: 一种增强负样本的安全对齐框架用于风险感知的自动驾驶

Kefei Tian, Yuansheng Lian, Kai Yang, Xiangdong Chen, Shen Li

AI总结 本文提出SafeAlign-VLA框架,通过整合负样本数据提升自动驾驶系统对安全边界的理解,通过生成安全标签和反事实轨迹,结合两阶段训练策略和基于锚点的群体相对策略优化,提高了自动驾驶的安全性和鲁棒性。

详情
AI中文摘要

端到端的自动驾驶系统在常见场景中表现优异,但在安全关键的长尾案例中表现不佳。视觉-语言-动作(VLA)模型因其强大的推理能力而具有前景。然而,大多数基于VLA的方法依赖于正专家演示,很少利用负样本,导致对危险行为和安全边界的理解不足。为了解决这一限制,我们提出了SafeAlign-VLA,一种统一的增强负样本的安全对齐框架,将负数据整合到监督学习和强化学习中。首先,我们开发了一种反事实安全配对范式,通过反事实推理从危险场景中生成结构化的安全标签和反事实正轨迹。然后采用两阶段训练策略:负样本增强的监督微调用于故障反馈和轨迹修正,接着是基于锚点的群体相对策略优化,利用正负轨迹作为对比锚点,引导采样并惩罚高风险行为。在NAVSIM和DeepAccident上的实验验证了所提框架。SafeAlign-VLA在NAVSIM v1测试集上达到89.1 PDMS,比无负样本基线提高了1.3%。在DeepAccident上,碰撞率降低到3.36%,同时达到84.2%的语言准确率和85.8%的风险预测准确率。这些结果证明了所提增强负样本的安全对齐框架在安全和鲁棒自动驾驶中的有效性。

英文摘要

End-to-end autonomous driving systems excel in common scenarios but struggle with safety-critical long-tail cases. Vision-Language-Action (VLA) models are promising due to their strong reasoning capabilities. However, most VLA-based approaches rely on positive expert demonstrations, rarely exploiting negative samples, leading to insufficient understanding of risky behaviors and safety boundaries. To address this limitation, we propose SafeAlign-VLA, a unified negative-enhanced safe alignment framework that incorporates negative data into supervised learning and reinforcement learning. First, we develop a counterfactual safety pairing paradigm to generate structured safety labels and counterfactual positive trajectories from risky scenarios via counterfactual reasoning. Then, a two-stage training strategy is adopted: negative-enhanced supervised fine-tuning for failure feedback and trajectory correction, followed by anchor-based group relative policy optimization that uses positive and negative trajectories as contrastive anchors to steer sampling and penalize high-risk behaviors via group-relative advantages. Experiments on NAVSIM and DeepAccident validate the proposed framework. SafeAlign-VLA achieves 89.1 PDMS on the NAVSIM v1 testset, improving over the baseline without negative data by 1.3%. On DeepAccident, it reduces the collision rate to 3.36%, while achieving 84.2% language accuracy and 85.8% risk prediction accuracy. These results demonstrate the effectiveness of the proposed negative-enhanced safe alignment framework for safe and robust autonomous driving.

2605.19523 2026-05-20 cs.CL cs.AI cs.CV

Investigating Cross-Modal Skill Injection: Scenarios, Methods, and Hyperparameters

探究跨模态技能注入:场景、方法与超参数

Zhiyu Xu, Lean Wang, Yuanxin Liu, Lei Li, Hao Zhou, Fandong Meng, Jie Zhou, Xu Sun

AI总结 本文研究了跨模态技能注入在不同场景下的表现,分析了其方法和超参数的影响,发现其在指令遵循和跨语言任务中表现良好,但在数学推理中存在困难,同时指出经典方法如TA和DARE在性能上优于其他融合方法。

详情
AI中文摘要

视觉-语言模型(VLMs)在一般多模态理解方面表现出色;然而,它们在高效获取持续演化的领域特定技能方面存在困难。传统增强VLM能力的方法,如监督微调(SFT),需要大量的数据集整理和大量的计算资源。模型合并作为一种高效的替代方法,能够将领域专家的LLM专业知识转移到VLMs上,而无需额外的数据集要求或显著的计算开销。与传统合并同质LLM的方法不同,跨模态技能注入旨在通过将领域专家LLM整合到VLM中来诱导出新的跨模态能力。然而,现有研究缺乏对跨模态技能注入的适用性和方法的系统分析。在本研究中,我们从三个主要方面探讨了跨模态技能注入:场景、方法和超参数。在场景方面,我们发现跨模态技能注入在指令遵循和跨语言设置中表现良好,但在数学推理中表现不佳。在方法方面,我们发现经典方法如TA和DARE在性能上优于其他融合方法。我们还提供了这些经典方法所依赖的超参数调优的系统和定量分析。

英文摘要

Vision-Language Models (VLMs) have demonstrated remarkable proficiency in general multi-modal understanding; yet they struggle to efficiently acquire continually evolving domain-specific skills. Conventional approaches to enhancing VLM capabilities, such as Supervised Fine-Tuning (SFT), require extensive dataset curation and substantial computational resources. Model merging has emerged as an efficient alternative that enables the transfer of domain-specific expertise from Large Language Models (LLMs) to VLMs without incurring additional training data requirements or significant computational overhead. Unlike conventional merging of homogeneous LLMs, which mainly aggregates existing capabilities, cross-modal skill injection aims to induce emergent cross-modal capabilities by integrating a domain-expert LLM into a VLM. However, existing research lacks a systematic analysis of the applicability and methodology of cross-modal skill injection. In this study, we investigate cross-modal skill injection across three main aspects: scenarios, methods, and hyperparameters. For scenarios, we find that cross-modal skill injection generally performs well in instruction-following and cross-lingual settings, yet struggles with mathematical reasoning. For methods, we find that classic approaches such as TA and DARE consistently achieve superior performance over alternative merging methods. We also provide a systematic and quantitative analysis of the hyperparameter tuning that these classic methods critically depend on.

2605.19522 2026-05-20 cs.CV

iDiff: Interpretable Difference-aware Framework for Pairwise Image Quality Assessment

iDiff:用于成对图像质量评估的可解释差异感知框架

Xinli Yue, JianHui Sun, Tao Shao, Liangchao Yao, Fan Xia, Yuetang Deng

AI总结 本文提出iDiff框架,通过双分支设计结合可解释的差异建模和结构化多模态推理,提升成对图像质量评估的鲁棒性和可解释性,并在NTIRE 2026 RAIM挑战中取得第一名。

Comments Accepted to CVPR 2026 Workshop

详情
AI中文摘要

成对图像质量评估(IQA)在专业摄影中需要一个模型不仅能够识别两个候选图像之间的优选图像,还能提供有说服力且基于图像的推理。在NTIRE 2026 RAIM挑战中,这一要求进一步通过联合评估偏好预测和推理生成被强调。为了解决这一任务,我们提出了iDiff,一个用于成对图像质量评估的可解释差异感知框架。我们的方法采用由答案模型和推理模型组成的双分支设计。答案模型通过显式地将每个样本分解为左右全局和局部视图,随后进行内容感知的专业化处理,针对人物和场景图像,并通过跨主干的集成方法进行聚合,以实现稳健的偏好预测。推理模型专注于推理生成,并逐步增强,通过专家式模板、多源质量特征以及基于答案模型预测的条件监督进行优化。通过这种方式,iDiff联合建模了判别性决策和结构化解释,提高了鲁棒性和可解释性。广泛的实验表明,所提出的框架在准确性和推理质量指标上都有效。我们的方法在NTIRE 2026 RAIM挑战中取得了第一名,展示了将显式差异建模与结构化多模态推理整合用于成对IQA的有效性。

英文摘要

Pairwise image quality assessment (IQA) in professional photography requires a model not only to identify the preferred image between two candidates, but also to provide convincing and image-grounded reasoning. In the NTIRE 2026 RAIM challenge, this requirement is further emphasized by jointly evaluating preference prediction and rationale generation. To address this task, we propose iDiff, an Interpretable Difference-aware framework for pairwise image quality assessment. Our method adopts a dual-branch design consisting of an Answer Model and a Thinking Model. The Answer Model performs robust preference prediction by explicitly decomposing each sample into left/right global and local views, followed by content-aware specialization for person and scene images and ensemble-based aggregation across backbones. The Thinking Model focuses on rationale generation and is progressively enhanced with expert-style templates, multi-source quality features, and answer-aware supervision conditioned on the Answer Model prediction. In this way, iDiff jointly models discriminative decision making and structured explanation, improving both robustness and interpretability. Extensive experiments demonstrate the effectiveness of the proposed framework on both accuracy and reasoning-quality metrics. Our method achieved first place in the NTIRE 2026 RAIM challenge, showing the effectiveness of integrating explicit difference modeling with structured multimodal reasoning for pairwise IQA.

2605.19521 2026-05-20 cs.AI cs.GT

Efficient Elicitation of Collective Disagreements

高效获取集体分歧

Mohamed Ouaguenouni, Felipe Garrido-Lucero, Umberto Grandi, César Hidalgo, Magdalena Tydrichova

AI总结 本文研究了群体在备选方案上的分歧结构,提出了一种分层框架来确定计算现有分歧度量所需的最小聚合偏好信息,引入了 plurality 矩阵并展示了超越三级分歧度量的理论和实验价值。

详情
AI中文摘要

我们分析了在一组替代方案上,一群选民之间的分歧结构。调查通常要求进行成对比较,这简单直观,或者要求对替代方案进行完整排序,以获取选民的全部偏好。基于成对比较无法区分结构性分歧与噪声的观察,我们提出了一种分层框架,以确定计算文献中若干分歧度量所需的最小聚合偏好信息。具体而言,我们引入了 plurality 矩阵,这是成对比较的推广,记录了对于每一个替代方案的子集 S,每个 a ∈ S 在 S 中排名第一的概率。我们定义分歧度量的级别为表达该度量所需的最小子集大小,证明了许多现有概念,包括排名方差和分裂度,处于级别 3,证明成对比较不足以表达这些度量。此外,我们展示了超越级别 3 的理论和实验价值。为了使这些结果具有可操作性,我们设计了两种获取 plurality 矩阵的协议,探索了所需参与者数量与每个参与者认知负荷之间的权衡。

英文摘要

We analyze the structure of the disagreement among a population of voters over a set of alternatives. Surveys typically ask either for pairwise comparisons, simple and intuitive for participants, or full rankings over alternatives, eliciting the entire voters' preferences. Building on the observation that pairwise comparisons cannot distinguish structural disagreement from noise, we propose a stratified framework to identify the minimal aggregated preference information needed to compute a number of disagreement measures from the literature. Specifically, we introduce the plurality matrix, a generalization of pairwise comparisons that records, for every subset $S$ of alternatives, the probability that each $a \in S$ ranks first in $S$. We define the level of a disagreement measure as the smallest subset size needed to express it, showing that many existing notions, including rank-variance and divisiveness, sit at level $3$, proving that pairwise comparisons are not enough. In addition, we demonstrate the interest of going beyond level $3$ both theoretically and experimentally. To make these results actionable, we design two elicitation protocols to estimate the plurality matrix, exploring the trade-off between the number of required participants and the cognitive load requested to each of them.

2605.19518 2026-05-20 cs.AI

BLINKG: A Benchmark for LLM-Integrated Knowledge Graph Generation

BLINKG:一个用于集成大语言模型的知识图谱生成基准

Carla Castedo, Enrique Iglesias, Manuel Lama, Alberto Bugarin-Diz, Maria-Esther Vidal, David Chaves-Fraga

AI总结 本文提出BLINKG基准,用于评估大语言模型在从异构数据源生成知识图谱中的映射能力,通过复杂度递增的场景和实验评估,揭示了LLM在知识图谱构建中的潜力与局限。

详情
AI中文摘要

生成知识图谱(KGs)仍然是知识工程师最耗时和劳动密集的任务,因为他们需要在输入数据源和本体术语之间识别语义等价性。虽然声明性解决方案(如RML、SPARQL-Anything)帮助泛化了这一过程,但将输入模式元素与本体术语对齐仍涉及复杂的转换并需要大量手动工作。随着大语言模型(LLMs)的出现,人们越来越关注利用其能力来协助KG工程师。尽管一些研究探索了使用LLMs自动化KG构建,但尚无标准化框架来评估它们在数据模式和本体概念之间建立对应关系的有效性。因此,在本文中,我们提出了BLINKG,一个用于评估LLMs在从异构数据源构建KG时映射能力的基准。该基准包含一系列基于真实世界用例的复杂度递增的场景。我们使用BLINK对几种最先进的LLMs进行了广泛的实验评估,观察到它们已经提供了有前途的解决方案。然而,它们在复杂场景中的表现仍然有限。得益于这一基准,我们能够评估当前LLMs在KG构建中的能力。此外,我们定义了一套要求,以实现(半)自动(LLM驱动)的KG构建,为该领域开辟了新的研究方向。

英文摘要

Generating Knowledge Graphs (KGs) remains one of the most time-consuming and labor-intensive tasks for knowledge engineers, as they need to identify semantic equivalences between input data sources and ontology terms. While declarative solutions (e.g., RML, SPARQL-Anything) have helped to generalize this process, aligning input schema elements with ontology terms still involves intricate transformations and requires considerable manual effort. With the advent of Large Language Models (LLMs), there is growing interest in leveraging their capabilities to assist KG engineers. Although some studies have explored using LLMs to automate KG construction, there is still no standardized framework for assessing how effectively they establish correspondences between data schemes and ontology concepts. Therefore, in this paper, we propose BLINKG, a benchmark designed to evaluate the mapping capabilities of LLMs in constructing KGs from heterogeneous data sources. The benchmark includes a set of scenarios with increasing complexity, based on real-world use cases. We conduct an extensive experimental evaluation of several stateof-the-art LLMs using BLINK and observe that they already offer promising solutions. However, their performance remains limited in complex scenarios. Thanks to this benchmark, we can already assess the current capabilities of LLMs for KG construction. Additionally, we define a set of requirements for achieving (semi)automated (LLM-driven) KG construction, opening new research lines in this area.

2605.19516 2026-05-20 cs.CL cs.AI cs.LG

Base Models Look Human To AI Detectors

基础模型对AI检测器看起来很像人类

Yixuan Even Xu, Ziqian Zhong, Aditi Raghunathan, Fei Fang, J. Zico Kolter

AI总结 本研究发现基础模型生成的文本在AI检测器中常被误判为人类生成,提出HIP方法通过迭代改写提升检测器规避能力,揭示当前检测器更关注指令调优和局部上下文而非通用机器生成文本特征。

Comments 39 pages, 9 figures

详情
AI中文摘要

随着AI生成文本在现实世界大规模应用,机构越来越多地使用商业AI文本检测器,尤其是在教育和学术诚信流程中。我们报告了一个令人惊讶的经验发现:当用GPTZero和Pangram评估时,基础模型生成的文本往往被判断为高度人类化,而经过指令调优的模型生成的文本则不具有这种特性。基于这一观察,我们提出了Humanization by Iterative Paraphrasing (HIP),一种不依赖特定检测器的管道,它最小化地微调基础模型为改写器并迭代应用。与我们测试的基线相比,HIP在商业检测器上实现了更好的语义保留与检测器规避的平衡。在Llama-3和Qwen-3系列模型中,从0.6B到70B的不同规模上,HIP始终提高了检测器的人类化程度。我们的发现表明,当前检测器更关注指令调优和局部上下文而非任何通用机器生成文本的不变特征。这反过来要求检测器设计更明确地建模这些因素。

英文摘要

As AI-generated text enters the real-world at scale, institutions increasingly use commercial AI-text detectors, especially in education and academic-integrity workflows. We report a surprising empirical finding about such systems: when evaluated by GPTZero and Pangram, generated text from base models is often judged overwhelmingly human, whereas text generated by their instruction-tuned counterparts is not. Building on this observation, we propose Humanization by Iterative Paraphrasing (HIP), a detector-agnostic pipeline that minimally fine-tunes a base model into a paraphraser and applies it iteratively. Compared with the baselines we test, HIP yields a stronger trade-off between semantic preservation and detector evasion on commercial detectors. Across Llama-3 and Qwen-3 families, spanning model sizes from 0.6B to 70B, HIP consistently improves detector human-likeness. Our findings suggest that current detectors are tracking artifacts of instruction tuning and local context more than any invariant notion of machine-generated text. This, in turn, calls for detector designs that model these factors more explicitly.

2605.19511 2026-05-20 cs.CV

Are Watermarked Images Editable? SafeMark for Watermark-Preserving Text-Guided Image Editing

水印图像可编辑吗?SafeMark用于水印保持的文本引导图像编辑

Xiaodong Wu, Qi Li, Xiangman Li, Zelin Zhang, Lingshuang Liu, Jianbing Ni

AI总结 本文研究了一个基础但未被充分探索的问题:水印图像能否在不损害水印完整性的情况下保持可编辑?我们提出了SafeMark框架,该框架在图像编辑过程中显式地将水印完整性整合进去。具体来说,SafeMark将阈值化的水印解码损失直接添加到扩散编辑器的训练目标中,微调编辑器,使得语义上有效的编辑也能够在最终输出中保留嵌入的水印。这种设计具有清晰的信息论依据:在编辑图像上保持高比特准确性下限界了编辑通道所保持的水印与编辑输出之间的互信息,这一量根本控制着水印恢复能力。SafeMark与可微扩散编辑器兼容,不需要架构修改。在多个数据集、文本引导编辑方法和编辑后失真设置上的广泛评估表明,SafeMark在多种编辑设置中实现了高水印比特准确性,同时保持高质量的语义编辑,而不会牺牲对常见编辑后失真的鲁棒性。这些结果表明,语义可编辑性和水印完整性本质上是兼容的,使生成编辑管道中的图像溯源变得可信。

详情
AI中文摘要

本文研究了一个基础但未被充分探索的问题:水印图像能否在不损害水印完整性的情况下保持可编辑?我们提出了SafeMark,一个用于水印保持的文本引导图像编辑的框架,该框架在编辑过程中显式地整合水印完整性。具体来说,SafeMark将阈值化的水印解码损失直接添加到扩散编辑器的训练目标中,微调编辑器,使得语义上有效的编辑也能够在最终输出中保留嵌入的水印。这种设计具有清晰的信息论依据:在编辑图像上保持高比特准确性下限界了编辑通道所保持的水印与编辑输出之间的互信息,这一量根本控制着水印恢复能力。SafeMark与可微扩散编辑器兼容,且不需要架构修改。在多个数据集、文本引导编辑方法和编辑后失真设置上的广泛评估表明,SafeMark在多种编辑设置中实现了高水印比特准确性,同时保持高质量的语义编辑,而不会牺牲对常见编辑后失真的鲁棒性。这些结果表明,语义可编辑性和水印完整性本质上是兼容的,使生成编辑管道中的图像溯源变得可信。

英文摘要

This paper investigates a fundamental yet underexplored question: can watermarked images remain editable without compromising watermark integrity? We propose SafeMark, a framework for watermark-preserving text-guided image manipulation that explicitly integrates watermark integrity into the editing process. Specifically, SafeMark adds a thresholded watermark-decoding loss directly to the diffusion editor's training objective, fine-tuning the editor so that semantically valid edits also preserve the embedded watermark at the final output. This design admits a clean information-theoretic justification: maintaining high bit-accuracy on the edited image lower-bounds the mutual information that the editor channel preserves between watermark and edited output, the quantity that fundamentally controls watermark recoverability. SafeMark is compatible with differentiable diffusion-based editors, and requires no architectural modification. Extensive evaluations across multiple datasets, text-guided editing methods, and post-edit distortion settings demonstrate that SafeMark achieves high watermark bit accuracy across diverse editing settings while maintaining high-quality semantic edits, without sacrificing robustness to common post-edit distortions. These results demonstrate that semantic editability and watermark integrity are fundamentally compatible, enabling trustworthy image provenance in generative editing pipelines.

2605.19510 2026-05-20 cs.CV

Return of Frustratingly Easy Unsupervised Video Domain Adaptation

令人沮丧的简单无监督视频域适应重现

Pengfei Wei, Yiqun Sun, Zhiqiang Xu, Yiping Ke, Lawrence B. Hsieh

AI总结 本文提出了一种名为MetaTrans的简单无监督视频域适应方法,通过巧妙的模型架构设计,分别处理跨域视频的空间和时间分歧,从而在多个跨域动作识别任务中实现了显著的性能提升。

Comments To appear in ICML 2026

详情
AI中文摘要

无监督视频域适应(UVDA)是一个实用但研究较少的问题。在本文中,我们提出了一种名为MetaTrans的令人沮丧的简单UVDA方法。具体来说,MetaTrans采用了一个包含仅两个基本损失项的简洁学习目标。尽管学习目标的简洁性,MetaTrans体现了一种先进的UVDA思想,即通过微妙的模型架构设计,分别处理跨域视频的空间和时间分歧。通过实现一个时间静态减法模块,MetaTrans有效地消除了空间和时间分歧。广泛的实证评估,特别是在各种跨域动作识别任务中,显示了显著的绝对适应性能提升和相对于最先进UVDA基线的显著优越性能提升。

英文摘要

Unsupervised video domain adaptation (UVDA) is a practical but under-explored problem. In this paper, we propose a frustratingly easy UVDA method, called MetaTrans. Specifically, MetaTrans adopts a concise learning objective that contains only two fundamental loss terms. Despite the simplicity of the learning objective, MetaTrans embodies an advanced UVDA idea, that is, handling the spatial and temporal divergence of cross-domain videos separately, through a subtle model architecture design. By implementing a temporal-static subtraction module, MetaTrans effectively removes spatial and temporal divergence. Extensive empirical evaluations, particularly on various cross-domain action recognition tasks, show substantial absolute adaptation performance enhancement and significantly superior relative performance gain compared with state-of-the-art UVDA baselines.

2605.19506 2026-05-20 cs.CV

EventPrune: Cascaded Event-Assisted Token Pruning for Efficient First-Person Dynamic Spatial Reasoning

EventPrune: 用于高效第一人称动态空间推理的级联事件辅助标记修剪

Pengtao Ma, Ziliang Zhou, Ciyu Ruan, Haoyang Wang, Kaiyuan Li, Zihang Gong, Wenhua Ding, Chen Gao, Jingao Xu, Xinlei Chen

AI总结 本文提出Event Cascade Pruning (ECP),一种无需训练的框架,利用事件相机的高频运动线索作为连续事件引导的运动先验,指导标记选择,从而在第一人称动态空间推理中实现高效的标记修剪,提升推理速度和减少计算量。

详情
AI中文摘要

第一人称动态空间推理需要模型跟踪连续运动和精确的几何结构,但基于Transformer的视频大语言模型(Video-LLMs)的二次注意力成本使得密集视觉标记计算成本高昂。现有标记修剪方法主要依赖离散静态快照,无法保留推理所需的关键运动和几何线索。我们提出了Event Cascade Pruning (ECP),据我们所知,这是首个无需训练的框架,利用事件相机的高频运动线索作为连续事件引导的运动先验来指导标记选择。ECP结合了三个阶段:事件触发的因果采样用于锚定包含运动信息的关键帧,事件引导的运动显著性过滤用于抑制事件不活跃的视觉标记,以及事件-注意力排名融合用于校准空间注意力与运动显著动态。在减少80%的视觉标记的情况下,ECP在准确率上优于全标记基线(37.62% vs. 36.31%),同时实现了1.89倍的推理加速和52%的GFLOPs减少。我们进一步引入了ESR-Real,首个用于第一人称空间推理的真实世界RGB-事件基准,其中ECP在全标记基线上的准确率提高了2.68个百分点。

英文摘要

First-person dynamic spatial reasoning requires models to track continuous motion and precise geometric structure, but the quadratic attention cost of Transformer-based Video-LLMs makes dense visual tokens computationally expensive. Existing token pruning paradigms predominantly rely on discrete static snapshots, failing to preserve the motion and geometric cues essential for reasoning. We propose Event Cascade Pruning (ECP), to our knowledge the first training-free framework that leverages the high-frequency motion cues from event cameras as a continuous event-guided motion prior to guide token selection. ECP combines three stages: Event-Triggered Causal Sampling to anchor motion-informative keyframes, Event-guided Motion Saliency Filtering to suppress event-inactive visual tokens, and Event-Attention Ranking Fusion to calibrate spatial attention with motion-salient dynamics. With 80% visual token reduction, ECP outperforms the full-token baseline (37.62% vs. 36.31%) while achieving 1.89x inference speedup and 52% GFLOPs reduction. We further introduce ESR-Real, the first real-world RGB-event benchmark for first-person spatial reasoning, where ECP improves accuracy by 2.68 percentage points over full-token baselines.

2605.19501 2026-05-20 cs.RO cs.AI

CANINE: Coaching Visually Impaired Users for Interactive Navigation with a Robot Guide Dog

CANINE: 为视觉障碍者提供交互导航的机器人导盲犬教学系统

Cunjun Yu, Zishuo Wang, Anxing Xiao, Linfeng Li, David Hsu

AI总结 本文提出CANINE系统,通过个性化适应性语音反馈帮助视觉障碍者学习与机器人导盲犬的交互导航,通过分解复杂协调任务并分层训练提升学习效率和最终导航性能。

Comments Accepted to RSS 2026

详情
AI中文摘要

机器人导盲犬提供了显著扩展视障者独立移动能力的导航帮助,但其有效使用需要微妙的人机协调,这使得用户难以从通用口头指令中学习。为解决这一挑战,我们提出了CANINE,一个自动化教学系统,通过个性化、适应性的语音反馈训练用户进行交互导航。CANINE将复杂协调任务分解为子技能,并在两个层次上运作。在高层,它通过知识追踪跟踪学习者在子技能中的熟练度,并优先训练最薄弱的领域。在底层,CANINE通过观察每个人类实践片段,利用基础模型推断错误的根本原因,并生成适应性的针对性语音纠正。通过盲folded参与者受控研究,将受试者视为定量评估的代理群体,证明CANINE在学习效率和最终导航性能上均优于通用口头指令。我们进一步通过保留研究和探索性案例研究验证CANINE。保留研究显示在两周后仍保持技能提升。案例研究确认CANINE在训练视障用户方面的有效性,同时揭示了实际部署中的额外设计考虑因素。两者均与受控研究的结果一致。项目页面:https://cunjunyu.github.io/project/canine/

英文摘要

Robot guide dogs offer navigation assistance that greatly expands the independent mobility of the visually impaired, but their effective use requires subtle human-robot coordination that is difficult for users to learn from generic verbal instructions. To tackle this challenge, we present CANINE, an automated coaching system that trains users for interactive navigation with a robot guide dog, through personalized, adaptive verbal feedback. CANINE decomposes a complex coordination task into sub-skills and operates at two levels. At the high level, it decides what to train by tracking the learner's proficiency across sub-skills using knowledge tracing and prioritizing training on the weakest areas. At the low level, CANINE decides how to train each sub-skill by observing each human practice episode, using foundation models to infer the underlying causes of errors, and generating targeted verbal corrections adaptively. A controlled study with blindfolded participants, treated as a proxy population for quantitative evaluation, demonstrates that CANINE significantly improves both learning efficiency and final navigation performance compared to generic verbal instructions. We further validate CANINE through a retention study and an exploratory case study. The retention study shows lasting skill improvement after two weeks. The case study confirms CANINE's effectiveness in training a visually impaired user, while revealing additional design considerations for real-world deployment. Both are well aligned with the findings of the controlled study. Project page: https://cunjunyu.github.io/project/canine/

2605.19490 2026-05-20 cs.RO cs.CV

Closed-Loop Hybrid Digital Twin Platform for Connected and Automated Vehicle Validation

闭环混合数字孪生平台用于联网和自动化车辆验证

Kanglong Quan, Zhebing Xia, Linfeng Jiang, Hao Yu, Ziheng Qiao, Dapeng Dong, Dongyao Jia

AI总结 本文提出一种闭环混合数字孪生平台,通过高保真CARLA-SUMO协同模拟与物理测试现场和车辆的紧密耦合,实现联网和自动化车辆的高效验证。

详情
AI中文摘要

联网和自动化车辆(CAVs)的全面且高效的验证在实际部署前至关重要。虽然基于模拟的测试提供了可扩展性,但现有方法往往缺乏与真实车辆和现场数据的无缝集成,限制了其在捕捉动态真实世界交互方面的保真度。为弥合这一差距,本文提出了一种新的实时混合数字孪生平台。其核心创新在于高保真CARLA-SUMO协同模拟与物理测试现场和车辆通过低延迟的车辆到万物(V2X)通信链路的紧密耦合。定制开发的中间件作为关键桥梁,同步真实CAV的运动状态作为模拟中的影子车辆,并将虚拟控制命令转换为底盘执行的控制器局域网络(CAN)消息以实现闭环控制。详细的实现包括使用摄影测量法进行全尺寸资产重建以及云边协同架构以实现可扩展的多用户操作。实验结果表明同步稳定且闭环控制有效,延迟低,证实了该平台在多场景CAV验证中的实用性。

英文摘要

Comprehensive and efficient validation of connected and automated vehicles (CAVs) is critical prior to real-world deployment. While simulation-based testing offers scalability, existing approaches often lack seamless integration with real vehicles and field data, limiting their fidelity in capturing dynamic, real-world interactions. To bridge this gap, this paper proposes a novel real-time hybrid digital twin platform. Its core innovation lies in the tight coupling of a high-fidelity CARLA-SUMO co-simulation with a physical test site and vehicle via a low-latency Vehicle-to-Everything (V2X) communication link. A custom-developed middleware serves as the critical bridge, synchronizing a real CAV's kinematic state as a shadow vehicle in the simulation and translating virtual control commands into chassis-actuating Controller Area Network (CAN) messages for closed-loop control. Detailed implementation includes using photogrammetry for full-scale asset reconstruction and a cloud-edge collaborative architecture for scalable, multi-user operation. Experimental results demonstrate stable synchronization and effective closed-loop control with low latency, confirming the platform's practicality for multi-scenario CAV verification.

2605.19485 2026-05-20 cs.AI

Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models

基于注意力引导的强化学习对抗大推理模型的 jailbreak 方法

Zheng Lin, Zhenxing Niu, Haoxuan Ji, Yuzhe Huang, Haichang Gao

AI总结 本文研究了对抗大推理模型的 jailbreak 攻击,发现攻击成功率与模型的注意力模式密切相关,并提出了一种基于强化学习的方法,通过将注意力信号纳入奖励函数设计来提升攻击效果,同时引入多样化的说服策略以提高攻击成功率。

详情
AI中文摘要

大推理模型(LRMs)在通过生成结构化的分步推理内容解决复杂问题方面表现出显著的能力。然而,暴露模型的内部推理过程会引入额外的安全风险;例如,最近的研究表明,LRMs比标准LLMs更容易受到jailbreak攻击。在本文中,我们研究了对LRMs的jailbreak攻击,并揭示出攻击成功率(ASR)与LRMs的注意力模式密切相关。具体而言,成功的jailbreak攻击倾向于在输入提示中对有害标记分配较低的注意力,而在推理内容中对这些标记分配较高的注意力。受此发现启发,我们提出了一种针对LRMs的新型jailbreak方法,利用强化学习(RL)来增强攻击效果,明确地将注意力信号纳入奖励函数设计。此外,我们引入了多样化的说服策略以丰富RL的动作空间,这始终提高了ASR。在五个开源和闭源LRMs上进行的广泛实验表明,我们的方法在三个基准测试中实现了显著更高的ASR,优于现有方法在有效性、效率和可迁移性方面。

英文摘要

Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in solving complex problems by generating structured, step-by-step reasoning content. However, exposing a model's internal reasoning process introduces additional safety risks; for example, recent studies show that LRMs are more vulnerable to jailbreak attacks than standard LLMs. In this paper, we investigate jailbreak attacks on LRMs and reveal that the attack success rate (ASR) is closely correlated with LRMs' attention patterns. Specifically, successful jailbreaks tend to assign lower attention to harmful tokens in the input prompt, while allocating higher attention to those tokens in the reasoning content. Motivated by this finding, we propose a novel jailbreak method for LRMs that leverages reinforcement learning (RL) to enhance attack effectiveness, explicitly incorporating attention signals into the reward function design. In addition, we introduce diverse persuasion strategies to enrich the RL action space, which consistently improves the ASR. Extensive experiments on five open-source and closed-source LRMs across three benchmarks demonstrate that our method achieves substantially higher ASR, outperforming existing approaches in terms of effectiveness, efficiency, and transferability.

2605.19484 2026-05-20 cs.CV cs.AI cs.GR cs.HC

CutVerse: A Compositional GUI Agents Benchmark for Media Post-Production Editing

CutVerse: 一个用于媒体后期制作编辑的组合式GUI代理基准测试

Haobo Hu, Xiangwu Guo, Zhiheng Chen, Difei Gao, Haotian Liu, Libiao Jin, Qi Mao

AI总结 本研究提出CutVerse,一个用于评估自主GUI代理在真实媒体后期制作环境中的能力的基准测试,揭示现有代理在复杂、长周期媒体后期制作工作流中的局限性。

详情
AI中文摘要

尽管GUI代理在网页导航和基础操作系统任务中取得了显著进展,但其在专业创意工作流中的能力仍鲜有研究。为弥合这一差距,我们引入CutVerse,一个旨在系统评估自主GUI代理在真实媒体后期制作环境中的基准测试。我们收集了7个专业应用(如Premiere Pro、Photoshop)的专家演示,涵盖186个复杂、长周期任务,这些任务基于真实的编辑工作流,涉及密集的多模态界面和紧密耦合的交互序列。为支持可扩展评估,我们开发了一个轻量级解析器,将原始屏幕记录和低级交互日志转换为结构化、组合式的GUI动作轨迹,具有精确的定位。广泛评估显示,现有代理在现实媒体编辑任务中的任务成功率仅为36.0%,凸显了复杂、长周期媒体后期制作工作流在本基准测试中的挑战。尽管当前模型在空间定位、多模态对齐和协调动作执行方面表现出色,但在长周期可靠性和领域特定规划方面仍存在限制。

英文摘要

While GUI agents have made significant progress in web navigation and basic operating system tasks, their capabilities in professional creative workflows remain largely underexplored. To bridge this gap, we introduce Cutverse, a benchmark designed to systematically evaluate autonomous GUI agents in realistic media post-production environments. We curate expert demonstrations across 7 professional applications (e.g., Premiere Pro, Photoshop), covering 186 complex, long-horizon tasks grounded in authentic editing workflows, involving dense multimodal interfaces and tightly coupled interaction sequences. To support scalable evaluation, we develop a lightweight parser that transforms raw screen recordings and low-level interaction logs into structured, compositional GUI action trajectories with precise grounding. Extensive evaluations reveal that existing agents achieve only 36.0\% task success on realistic media editing tasks, underscoring the challenges posed by complex, long-horizon media post-production workflows in our benchmark.While current models demonstrate promising spatial grounding, multimodal alignment, and coordinated action execution, they remain limited in long-horizon reliability and domain-specific planning.

2605.19483 2026-05-20 cs.LG

Adynamical systems view of training generativemodels and the memorization phenomenon

用动力系统观点看训练生成模型及记忆现象

Siva Athreya, Chiranjib Bhattacharya, Vivek S. Borkar

AI总结 本文从动力系统角度分析生成模型训练中的记忆现象,通过研究SGD中的时间尺度差异及崩溃现象,揭示生成模型在训练过程中产生相同或相似输出的机制。

Comments 12 pages

详情
AI中文摘要

利用作者之一(VSB)关于生成模型崩溃和高维随机梯度下降中双时间尺度动态的研究,本文从系统理论角度解释了生成模型中的记忆现象。这纯粹依赖于训练阶段的动力学特性。具体来说,我们使用Austin [2016] 的结果,提出一个简化的SGD损失函数模型,其中损失函数对某些变量有强依赖性,对其他变量有弱依赖性。这自然导致常数步长SGD中存在两个不同的时间尺度。这一事实已被用于解释SGD中的双下降现象(Borkar [2026])。结合Borkar [2025a] 中开发的SGD崩溃现象数学模型,我们利用Azizian等人 [2024] 的最新结果,分析常数步长SGD,以解释记忆现象,即在同时进行调优的生成模型中,输出在显著时间段内保持相同或相似。这为机器学习文献中报告的上述现象及其相互关系提供了新的视角,使用动力系统观点。

英文摘要

Using recent works of one of the authors (VSB) on collapse in generative models and two time scale dynamics in stochastic gradient descent in high dimensions, we give a system theoretic explanation of the memorization phenomenon in generative models. This relies purely on the dynamic aspects of the training phase. Specifically, we use a result of Austin [2016] to motivate a stylized model for the loss function for stochastic gradient descent (SGD) wherein the loss function has a strong dependence on some variables and weak dependence on the rest in a precise sense. This naturally leads to two distinct time scales in the constant step size SGD that is commonly used in machine learning. This fact has been used to explain the double descent phenomenon in SGD in Borkar [2026]. In conjunction with a mathematical model for collapse phenomenon in SGD developed in Borkar [2025a], we analyze the constant step size SGD using the recent results of Azizian et al. [2024] in order to explain the phenomenon of memorization wherein a generative model that is concurrently being tuned yields the same or similar outputs for significant stretches of time. This gives a novel perspective on the aforementioned phenomena reported in machine learning literature and their interrelationships, using a dynamical systems viewpoint.

2605.19470 2026-05-20 cs.CL cs.LG

Drifting Objectives for Refining Discrete Diffusion Language Models

漂移目标用于细化离散扩散语言模型

Daisuke Oba, Hiroki Furuta, Naoaki Okazaki

AI总结 本文研究如何将漂移方法应用于离散扩散语言模型,通过引入TokenDrift目标,将类别预测提升为软令牌特征,并在冻结语义空间中应用反称漂移,从而提升生成质量。

Comments Project page: https://daioba.github.io/tokendrift/

详情
AI中文摘要

离散扩散语言模型(DDLMs)通过迭代去噪类别令牌序列生成文本,而近期针对连续生成器的漂移方法表明,部分采样时间的修正可以通过反称固定点目标在训练中吸收。我们研究如何将这一原理转移到DDLMs中,其中主要挑战是与离散文本的接口:硬令牌样本不可微,类别预测不直接提供连续样本进行漂移。我们提出了TokenDrift,一种漂移目标,将类别预测提升为软令牌特征,在冻结的语义空间中应用反称漂移,并将由此产生的stop-gradient特征目标反向传播到DDLM的logits中。在受控的持续训练实验中,使用掩码和均匀状态扩散基础架构,TokenDrift在匹配的延续基线之上提升了固定NFE生成质量,在MDLM上将Gen.-PPL在4 NFEs时降低了89%,在DUO上降低了86%。这些结果表明,漂移可以为DDLMs提供实用的细化目标。

英文摘要

Discrete diffusion language models (DDLMs) generate text by iteratively denoising categorical token sequences, while recent drifting methods for continuous generators suggest that part of this sampling-time correction can instead be absorbed into training through an anti-symmetric fixed-point objective. We study how to transfer this principle to DDLMs, where the main challenge is the interface with discrete text: hard token samples are non-differentiable, and categorical predictions do not directly provide continuous samples to drift. We formulate TokenDrift, a drifting objective that lifts categorical predictions to soft-token features, applies anti-symmetric drifting in a frozen semantic space, and backpropagates the resulting stop-gradient feature target to DDLM logits. In controlled continual-training experiments with masked and uniform-state diffusion backbones, TokenDrift improves fixed-NFE generation quality over matched continuation baselines, reducing Gen.-PPL at 4 NFEs by 89% on MDLM and 86% on DUO. These results suggest that drifting can provide a practical refinement objective for DDLMs.

2605.19469 2026-05-20 cs.LG cs.AI cs.RO

Sampling-Based Safe Reinforcement Learning

基于采样的安全强化学习

Luca Vignola, Bruce D. Lee, Manish Prajapat, Manuel Wendl, Melanie Zeilinger, Andreas Krause, Yarden As

AI总结 本文提出了一种基于采样的安全强化学习方法,通过在有限的动力学样本集上联合施加约束,确保学习过程中的安全性,并在连续域中提供实用的安全保证,同时通过限制认知不确定性实现了高效的探索。

详情
AI中文摘要

安全探索仍然是强化学习(RL)中的基本挑战,限制了RL智能体在现实世界中的部署。我们提出了一种基于采样的安全强化学习(SBSRL),这是一种基于模型的RL算法,通过在有限的动力学样本集上联合施加约束,确保学习过程中的安全性。这种形式近似了在不确定动力学下的不可行最坏情况优化,并在连续域中实现了实用的安全保证。我们进一步引入了一种基于限制认知不确定性的探索策略,消除了显式探索奖励的需要。在常规条件下,我们推导了学习过程中安全性的高概率保证以及恢复近最优策略的有限时间样本复杂度界。实验证明,SBSRL在仿真和真实机器人硬件中均实现了安全且高效的探索,并可轻松扩展到实际的深度集合实现,以解决高维连续控制问题。

英文摘要

Safe exploration remains a fundamental challenge in reinforcement learning (RL), limiting the deployment of RL agents in the real world. We propose Sampling-Based Safe Reinforcement Learning (SBSRL), a model-based RL algorithm that maintains safety throughout the learning process by enforcing constraints jointly across a finite set of dynamics samples. This formulation approximates an intractable worst-case optimization over uncertain dynamics and enables practical safety guarantees in continuous domains. We further introduce an exploration strategy based on constraining epistemic uncertainty, eliminating the need for explicit exploration bonuses. Under regularity conditions, we derive high-probability guarantees of safety throughout learning and a finite-time sample complexity bound for recovering a near-optimal policy. Empirically, SBSRL achieves safe and efficient exploration both in simulation and in real robotic hardware, and readily extends to practical deep-ensemble implementations that scale to high-dimensional continuous control problems.

2605.19462 2026-05-20 cs.LG cs.AI

Quantifying the Pre-training Dividend: Generative versus Latent Self-Supervised Learning for Time Series Foundation Models

量化预训练红利:生成与潜在自监督学习在时间序列基础模型中的应用

Noam Major, Kathy Razmadze, Yoli Shavit

AI总结 本文研究了自监督学习在时间序列中的应用,比较了生成范式与潜在对齐架构,发现预训练红利在异常检测和分类任务中显著提升,但在预测任务中效果有限,同时表明表示质量与数据来源无关,且在适度的架构深度下趋于稳定。

详情
AI中文摘要

自监督学习(SSL)在视觉和自然语言处理中的成功促使其在时间序列中的快速应用。然而,研究主要集中在生成范式和预测任务上,未量化学习表示的广泛应用。我们建立了一个受控框架来评估“预训练红利”:SSL在多样时间任务中的价值。我们系统比较了生成范式与潜在对齐架构,引入了适用于时间序列的LeJEPA和DINO的变体。这些变体利用离散小波变换(DWT)增强来强制对局部波动的不变性。我们的分析揭示预训练红利高度不对称:SSL在异常检测和分类任务中可获得高达375%的收益,但在预测任务中效果有限。我们证明表示的实用性非普遍,由精度-不变性权衡决定,任务所需的特定信号分辨率必须与目标一致。最后,我们显示表示质量与数据来源无关,并在适度的架构深度下趋于稳定,表明通过大规模合成生成可实现扩展。我们的代码可在:https://github.com/noammajor/Models 获取。

英文摘要

The success of self-supervised learning (SSL) in vision and NLP has motivated its rapid adoption for time series. However, research has focused primarily on Generative paradigms and forecasting tasks, leaving the broader utility of learned representations unquantified. We establish a controlled framework to evaluate the "pre-training dividend": the value added by SSL across diverse temporal tasks. We systematically compare Generative paradigms against Latent Alignment architectures, introducing adaptations of LeJEPA and DINO for time series. These adaptations utilize Discrete Wavelet Transform (DWT) augmentations to enforce invariance to local fluctuations. Our analysis reveals that the pre-training dividend is highly asymmetric: SSL yields gains of up to 375% for anomaly detection and classification, yet remains marginal for forecasting. We demonstrate that representational utility is non-universal, governed by a precision-invariance trade-off where the specific signal resolution required by the task must align with the objective. Finally, we show that representation quality is largely independent of data origin and saturates at moderate architectural depths, suggesting a path to scaling via massive synthetic generation. Our code is available at: https://github.com/noammajor/Models

2605.19461 2026-05-20 cs.AI

Beyond Mode Collapse: Distribution Matching for Diverse Reasoning

超越模式崩溃:用于多样化推理的分布匹配

Xiaozhe Li, Yang Li, Xinyu Fang, Shengyuan Ding, Peiji Li, Yongkang Chen, Yichuan Ma, Tianyi Lyu, Linyang Li, Dahua Lin, Qipeng Guo, Qingwen Liu, Kai Chen

AI总结 本文提出DMPO方法,通过原理性近似前向KL最小化来防止on-policy强化学习中的模式崩溃,展示了在NP难组合优化问题上的改进效果,提升了多样化推理能力。

详情
AI中文摘要

像GRPO这样的在线强化学习方法会遭遇模式崩溃:它们表现出减少的解决方案多样性,在发现一个解决方案后,将概率质量集中在单一解决方案上,并停止探索替代策略。我们证明这源于反KL最小化的行为,这种行为强化了首次发现的高回报轨迹,而不是维持多个多样解决方案的分布。我们提出DMPO(分布匹配策略优化),通过原理性近似前向KL最小化来防止模式崩溃。DMPO构建一个群体层面的目标分布,该分布与采样的轨迹成正比于其奖励,然后将策略分布对齐到此目标。这提供了覆盖模式的行为,而无需采样自不可行的全局目标分布,使训练过程中持续探索成为可能。我们在NP难组合优化问题上验证了DMPO,其中存在指数级多的可行解,但只有少数接近最优解,是评估探索的理想测试环境。DMPO在文本基NP-Bench上实现了43.9%的Quality Ratio(对比GRPO的40.1%),在视觉基NP-Bench上实现了43.1%(对比38.4%),分别展示了9%和12%的相对改进。这些收益扩展到数学推理(+2.0%)和跨领域任务(+2.3%),表明保持多样性训练增强了跨模态的通用推理能力。我们的工作确立了分布匹配作为防止on-policy RL中模式崩溃的实用且原理性方法,一致的质量改进表明在多样化推理任务中持续探索的能力。

英文摘要

On-policy reinforcement learning methods like GRPO suffer from mode collapse: they exhibit reduced solution diversity, concentrating probability mass on a single solution once discovered and ceasing exploration of alternative strategies. We show this stems from reverse KL minimization's mode-seeking behavior, which reinforces the first high-reward trajectory found rather than maintaining a distribution over multiple diverse solutions. We propose DMPO (Distribution-Matching Policy Optimization), which prevents mode collapse through principled approximation of forward KL minimization. DMPO constructs a group level target distribution over sampled trajectories proportional to their rewards, then aligns the policy distribution to this target. This provides mode-covering behavior without requiring sampling from the intractable global target distribution, enabling sustained exploration throughout training. We validate DMPO on NP-hard combinatorial optimization, where exponentially many feasible solutions exist but only a few approach optimality, an ideal testbed for evaluating exploration. DMPO achieves 43.9% Quality Ratio on text-based NP-Bench (vs. GRPO's 40.1%) and 43.1% on vision-based NP-Bench (vs. 38.4%), demonstrating 9% and 12% relative improvements respectively. These gains generalize to mathematical reasoning (+2.0%) and out-of-domain tasks (+2.3%), showing that diversity-preserving training enhances general reasoning capabilities across modalities. Our work establishes distribution matching as a practical, principled approach to preventing mode collapse in on-policy RL, with consistent quality improvements demonstrating sustained exploration across diverse reasoning tasks.

2605.19458 2026-05-20 cs.LG

Implicit Bias of Mirror Flow in Homogeneous Neural Networks: Sparse and Dense Feature Learning

隐式偏置与同质神经网络中的稀疏和密集特征学习

Tom Jacobs, Guido Montufar

AI总结 研究隐式偏置如何影响同质神经网络中的稀疏和密集特征学习,通过推导新的平衡方程和实验验证,揭示了镜像流在优化动态和分类器几何结构中的作用。

Comments 36 pages, 14 figures

详情
AI中文摘要

我们研究了在具有同质激活函数的深度神经网络中,镜像流达到的最大边际解。扩展经典梯度流结果,我们从凸对偶性推导出镜像流的新平衡方程,从而能够表征诱导边际的水平函数。我们进一步建立了最大边际特征以及收敛速度和范数增长估计。最后,我们通过合成数据集和标准视觉任务的实验支持我们的理论。具体而言,我们显示:(1)不同的非同质镜像映射可以诱导相同的最大边际解;(2)收敛可以非常缓慢,包括指数级缓慢的区域;以及(3)尽管所有考虑的镜像映射都表现出特征学习,但它们可以产生从稀疏到密集神经元激活的明显不同表示。这些结果为同质神经网络中的稀疏和密集特征学习提供了统一的视角,突显了镜像映射如何影响优化动态和学习分类器的几何结构。

英文摘要

We study the max-margin solutions reached by mirror flow in deep neural networks with homogeneous activation functions. Extending classical results on gradient flow, we derive a novel balance equation for mirror flow from convex duality, enabling a characterization of the horizon function governing the induced margin. We further establish max-margin characterizations together with convergence rates and norm growth estimates. Finally, we support our theory through experiments on synthetic datasets and standard vision tasks. Concretely, we show that: (1) distinct non-homogeneous mirror maps can induce the same max-margin solution; (2) convergence can be extremely slow, including exponentially slow regimes; and (3) although all considered mirror maps exhibit feature learning, they can produce markedly different representations, ranging from sparse to dense neuron activations. Together, these results provide a unified perspective on sparse and dense feature learning in homogeneous neural networks, highlighting how mirror maps shape both optimization dynamics and the geometry of the learned classifiers.

2605.19457 2026-05-20 cs.AI

Generative Auto-Bidding with Unified Modeling and Exploration

生成式自动出价:统一建模与探索

Mingming Zhang, Feiqing Zhuang, Na Li, Shengjie Sun, Xiaowei Chen, Junxiong Zhu, Fei Xiao, Keping Yang, Lixin Zou, Chenliang Li

AI总结 本文提出GUIDE框架,通过结合定向探索与安全回退机制,解决生成模型在自动出价中探索与安全平衡的问题,实现效率与安全的统一。

Comments 11pages, sigir2026

详情
AI中文摘要

自动化出价是现代数字广告的核心。早期基于规则的方法缺乏适应性,而后续的强化学习方法将出价建模为马尔可夫决策过程,但难以处理长期依赖。最近的生成模型显示了潜力,但缺乏明确的机制来平衡探索和安全性,仅依赖动作扰动或轨迹引导,没有安全回退。这导致了低效的探索和广告平台的高财务风险。为了解决这一差距,我们提出了GUIDE(生成式自动出价:统一建模与探索)框架,通过协同整合定向探索与安全回退机制。GUIDE使用决策变压器(DT)联合建模历史出价动作和环境状态转移。Q值模块通过正则化约束引导DT的探索,而逆向动力学模块(IDM)利用DT预测的未来状态来推断鲁棒且行为一致的动作作为安全策略回退。Q值模块随后在两者之间自适应地选择最终动作,平衡探索和安全性。这些组件共同形成一个集成的“探索-安全回退-选择”流水线,实现了效率和安全的统一。我们在公开数据集、模拟拍卖环境以及通过大规模在线部署在淘宝(中国领先的广告平台)上进行了广泛实验。结果表明,GUIDE在所有场景中均优于最先进的基线。在实际部署中,GUIDE实现了显著的收益:广告GMV增长+4.10%,广告点击增长+1.40%,广告成本下降+1.66%,广告ROI增长+3.52%,证明了其有效性和强大的工业适用性。

英文摘要

Automated bidding is central to modern digital advertising. Early rule-based methods lacked adaptability, while subsequent Reinforcement Learning approaches modeled bidding as a Markov Decision Process but struggled with long-term dependencies. Recent generative models show promise, yet they lack explicit mechanisms to balance exploration and safety, relying solely on action perturbations or trajectory guidance without a safety fallback. This results in inefficient exploration and elevated financial risk for advertising platforms. To address this gap, we propose GUIDE (Generative Auto-Bidding with Unified Modeling and Exploration), a framework that synergistically integrates directed exploration with a safe fallback mechanism. GUIDE employs a Decision Transformer (DT) to jointly model historical bidding actions and environmental state transitions. A Q-value module guides the DT's exploration via regularization constraints, while an Inverse Dynamics Module (IDM) leverages DT-predicted future states to infer robust, behaviorally consistent actions as a safe policy fallback. The Q-value module then adaptively selects the final action between these two options, balancing exploration and safety. Together, these components form an integrated "explore-safeguard-select" pipeline that unifies efficiency and safety. We conduct extensive experiments on public datasets, in simulated auction environments, and through large-scale online deployment on Taobao, a leading Chinese advertising platform. Results show GUIDE consistently outperforms state-of-the-art baselines across all scenarios. In real-world deployment, GUIDE achieves notable gains: +4.10% ad GMV, +1.40% ad clicks, +1.66% ad cost, and +3.52% ad ROI, demonstrating its effectiveness and strong industrial applicability.

2605.19447 2026-05-20 cs.AI

What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents

什么和何时去蒸馏:多轮代理的定向 hindsight 蒸馏

Xiaozhe Li, Tianyi Lyu, Yang Li, Yichuan Ma, Peiji Li, Linyang Li, Qipeng Guo, Dahua Lin, Kai Chen

AI总结 本文研究了多轮代理中如何选择性地利用 hindsight 蒸馏,提出了一种基于环境反馈的强化学习框架 SERL,通过任务奖励和环境反馈的结合,在 ALFWorld 和 WebShop 任务中取得了较高的成功率。

详情
AI中文摘要

强化学习可以通过稀疏任务奖励训练大语言模型代理,但长周期信用分配仍然具有挑战性:一个成功或失败的信号必须分布在许多动作上。现有方法依赖于轨迹级奖励或代理信号,没有充分利用每一步的环境反馈。多轮代理设置尚不充分探索,其中反馈可以包括错误信息、页面变化、观察或参考轨迹。我们系统研究了五个反馈源和两种插入粒度,并引入了 SERL,一种选择性环境加权学习框架。SERL 使用任务奖励确定更新方向,而环境反馈调整放置和大小,专注于关键动作。在 ALFWorld 和 WebShop 上,SERL 分别达到 90.0% 和 80.1% 的成功率,优于强大的 RL 和蒸馏基线。分析显示,有意义的点上的基于事实、与动作相关的反馈始终优于随意使用更长或更丰富的上下文。

英文摘要

Reinforcement learning can train LLM agents from sparse task rewards, but long-horizon credit assignment remains challenging: a single success-or-failure signal must be distributed across many actions. Existing methods rely on trajectory-level rewards or proxy signals, without fully leveraging per-step environmental feedback. Multi-turn agent settings are underexplored, where feedback can include error messages, page changes, observations, or reference trajectories. We systematically study five feedback sources and two insertion granularities and introduce SERL, a selective environment-reweighted learning framework. SERL uses the task reward to determine update direction, while environment feedback adjusts placement and magnitude, focusing on critical actions. On ALFWorld and WebShop, SERL achieves 90.0% and 80.1% success, outperforming strong RL and distillation baselines. Analysis shows that grounded, action-relevant feedback at meaningful points consistently outperforms indiscriminate use of longer or richer context.

2605.19446 2026-05-20 cs.CV cs.AI

Targeted Downstream-Agnostic Attack

定向下游无关攻击

Zhuxin Lei, Ziyuan Yang, Yi Zhang

AI总结 本文提出了一种定向下游无关攻击(TDAA)方法,通过在更严格的威胁模型下,要求攻击同时具有针对性和下游无关性,解决了传统下游无关攻击(DAAs)在目标未知和编码器不直接生成预测时的挑战。通过引入威胁图像作为特征级锚点,构建了任务无关的桥梁,揭示了受害者编码器的脆弱性。

详情
AI中文摘要

近年来,由于其在表示提取方面的强大能力,预训练编码器得到了广泛应用。然而,它们容易受到下游无关攻击(DAAs)的攻击。现有的DAA方法基于一种宽松的威胁模型,只要生成的下游无关对抗样本(DAEs)改变原始预测,攻击就算成功,而无需特定目标。在本文中,我们提出了一种在更严格的威胁模型下进行的定向DAA(TDAA)方法,要求攻击必须同时具有针对性和下游无关性。由于下游任务未知且编码器不直接生成预测,实现针对性攻击尤其具有挑战性。为此,我们引入了一个名为“威胁图像”的新组件,由攻击者预先选择作为目标。具体来说,设计了一个生成器,生成针对每个样本的对抗扰动,迫使受害者编码器为DAEs和威胁图像输出相同的特征。与以往的DAA方法生成所有样本共享的单一扰动不同,我们的方法采用样本特定的范式。这生成了针对每个图像的定制扰动,以确保高攻击成功率和隐蔽性。通过利用威胁图像作为特征级锚点,我们的方法构建了一个任务无关的桥梁,揭示了受害者编码器的脆弱性。在10种自监督方法上对3个基准数据集的广泛实验展示了我们方法的有效性,并揭示了预训练编码器的显著脆弱性。代码将在审查期结束后公开。

英文摘要

Recently, pre-trained encoders have gained widespread use due to their strong capability in representation extraction. However, they are vulnerable to downstream-agnostic attacks (DAAs). Existing DAA methods operate under a permissive threat model, where an attack is successful if the generated downstream-agnostic adversarial examples (DAEs) change the original prediction, without requiring a specific target. In this paper, we propose a Targeted DAA (TDAA) method under a stricter threat model requiring the attack to be both targeted and downstream-agnostic. Since the downstream task is unknown and encoders do not directly produce predictions, achieving a targeted attack is particularly challenging. To address this, we introduce a novel component termed the 'threat image', pre-selected by the attacker as the target. Specifically, a generator is designed to produce example-specific adversarial perturbations that compel the victim encoder to output identical features for both the DAEs and the threat image. Unlike previous DAA methods that generate a single shared perturbation for all samples, which often fails due to image diversity, our method adopts an example-specific paradigm. This generates tailored perturbations for each image to ensure a high attack success rate and invisibility. By leveraging the threat image as a feature-level anchor, our method builds a task-agnostic bridge to reveal the vulnerabilities of the victim encoder. Extensive experiments on 10 self-supervised methods across 3 benchmark datasets demonstrate the effectiveness of our approach and reveal the pronounced vulnerability of pre-trained encoders. The code will be made publicly available after the review period.

2605.19436 2026-05-20 cs.LG cs.CL cs.CV

CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization

CEPO: 使用对比证据策略优化进行RLVR自蒸馏

Ahmed Heakl, Abdelrahman M. Shaker, Youssef Mohamed, Rania Elbadry, Omar Fetouh, Fahad Shahbaz Khan, Salman Khan

AI总结 本文提出CEPO,通过对比证据策略优化解决RLVR中自蒸馏的问题,通过区分关键推理步骤与填充内容来提升模型性能。

Comments 9 pages

详情
AI中文摘要

当模型在强化学习中产生正确解时,每个token都会收到相同的奖励信号,无论其是关键推理步骤还是语法填充。一种自然的解决方法是将模型条件化为正确的答案作为教师,识别出模型在知道答案时会生成不同的token。先前的工作表明,这种方法要么通过泄露答案到梯度而破坏训练,要么产生弱信号,无法区分关键步骤和填充内容,因为两者在模型基线下看起来同样令人惊讶。我们提出对比证据策略优化(CEPO),在每个token上提出更尖锐的问题:不仅“正确答案是否偏好此token?”而且“正确答案是否偏好它,而错误答案是否厌恶它?”满足两者的是真正的推理步骤;不满足的是填充内容。错误答案的教师是从训练批次中已有的拒绝rollouts构造的,不增加额外的采样成本。我们证明CEPO继承了先前最先进状态下的所有结构安全保证,同时在关键token上严格提高信用,改进在填充位置恰好消失。实验表明,CEPO在五个多模态数学推理基准上分别达到43.43%和60.56%的平均准确率(在2B和4B规模下),而GRPO在相同训练预算下为41.17%和57.43%。分布匹配自蒸馏方法(OPSD、SDPO)在未训练基线下表现低于,实验证实了我们的理论预测的信息泄漏。我们的代码可在https://github.com/ahmedheakl/CEPO上获得。

英文摘要

When a model produces a correct solution under reinforcement learning with verifiable rewards (RLVR), every token receives the same reward signal regardless of whether it was a decisive reasoning step or a grammatical filler. A natural fix is to condition the model on the correct answer as a teacher, identifying tokens it would have generated differently had it known the answer. Prior work shows this either corrupts training by leaking the answer into the gradient, or produces a weak signal that cannot distinguish decisive steps from filler, since both look equally surprising relative to the model's baseline. We propose Contrastive Evidence Policy Optimization (CEPO), which asks a sharper question at every token: not just "does the correct answer favor this token?" but "does the correct answer favor it while the wrong answer disfavors it?" A token satisfying both is a genuine reasoning step; one satisfying neither is filler. The wrong-answer teacher is constructed from rejected rollouts already in the training batch, incurring no additional sampling cost. We prove CEPO inherits all structural safety guarantees of the prior state of the art while strictly sharpening credit at decisive tokens, with the improvement vanishing exactly at filler positions. Empirically, CEPO achieves 43.43% and 60.56% average accuracy across five multimodal mathematical reasoning benchmarks at 2B and 4B scale, respectively, versus 41.17% and 57.43% for GRPO under identical training budgets. Distribution-matching self-distillation methods (OPSD, SDPO) fall below the untrained baseline, empirically confirming the information leakage our theory predicts. Our code is available at https://github.com/ahmedheakl/CEPO.

2605.19435 2026-05-20 cs.CV cs.AI

KappaPlace: Learning Hyperspherical Uncertainty for Visual Place Recognition via Prototype-Anchored Supervision

KappaPlace: 通过原型锚定监督学习超球面不确定性用于视觉位置识别

Maya Yanko, Yoli Shavit

AI总结 本文提出KappaPlace,一种学习具有不确定性的视觉位置识别表示的框架,通过原型锚定监督策略利用潜在类别代表作为概率目标,以减轻视觉位置识别中不确定性估计不准确的问题,从而提高导航系统的可靠性。

详情
AI中文摘要

视觉位置识别(VPR)对于自主导航至关重要,但最先进的方法缺乏良好的校准不确定性估计。标准流程无法可靠地指示查询是否模糊或匹配可能不正确,这在安全关键的机器人学中带来风险。我们提出KappaPlace,一种学习不确定性感知VPR表示的原理性框架。我们的核心贡献是一种原型锚定监督策略,利用潜在类别代表作为概率目标。通过将图像描述符建模为von Mises-Fisher(vMF)变量,我们学习了一个轻量级模块来预测浓度参数作为对aleatoric不确定性的直接代理。虽然现有的VPR不确定性方法通常局限于查询中心的视角,我们推导出一种新的匹配层面的公式来量化特定查询-参考对的可靠性。在五个多样化的基准测试中,KappaPlace将预期校准误差(ECE@K)比现有方法减少了高达50%,同时保持或提高了检索召回率。我们提供了联合训练变体和冻结骨干的后训练扩展。我们的结果表明,KappaPlace提供了稳健、稳定且校准良好的信号,能够在VPR流程中实现可靠的决策。我们的代码可在:https://github.com/mayayank95/UncertaintyAwareVPR

英文摘要

Visual Place Recognition (VPR) is critical for autonomous navigation, yet state-of-the-art methods lack well-calibrated uncertainty estimation. Standard pipelines cannot reliably signal when a query is ambiguous or a match is likely incorrect, posing risks in safety-critical robotics. We propose KappaPlace, a principled framework for learning uncertainty-aware VPR representations. Our core contribution is a Prototype-Anchored supervision strategy that leverages latent class representatives as targets for a probabilistic objective. By modeling image descriptors as von Mises-Fisher (vMF) variables, we learn a lightweight module to predict the concentration parameter as a direct proxy for aleatoric uncertainty. While existing VPR uncertainty methods are typically restricted to a query-centric view, we derive a novel match-level formulation to quantify the reliability of specific query-reference pairs. Across five diverse benchmarks, KappaPlace reduces Expected Calibration Error (ECE@K) by up to 50% compared to existing methods while maintaining or improving retrieval recall. We provide both a joint-training variant and a post-training extension for frozen backbones. Our results demonstrate that KappaPlace provides a robust, stable, and well-calibrated signal that enables reliable decision-making within the VPR pipeline. Our code is available at: https://github.com/mayayank95/UncertaintyAwareVPR

2605.19433 2026-05-20 cs.CL cs.AI

Backtracking When It Strays: Mitigating Dual Exposure Biases in LLM Reasoning Distillation

在偏离时回溯:缓解大语言模型推理蒸馏中的双重暴露偏差

Bing Wang, Shaotian Yan, Chen Shen, kaiyuan liu, Sinan Fan, Ximing Li, Rui Miao, Xiaosong Yuan, Zhanming Shen, Jieping Ye

AI总结 本文提出了一种新的LLM推理蒸馏方法MOTAB,通过动态监控学生模型生成过程并回溯偏离安全边界的情况,缓解了传统蒸馏方法中因训练分布与推理上下文不匹配导致的双重暴露偏差问题,从而提升推理性能。

Comments 26 pages, 8 figures

详情
AI中文摘要

大型语言模型(LLMs)通过长链思考(CoT)在复杂推理任务中取得了显著成功,但其巨大的计算开销阻碍了实际应用。LLM推理蒸馏通过将推理能力从强大的教师模型转移到紧凑的学生模型来解决这一问题。然而,现有蒸馏方法面临根本性的困境。典型的离线蒸馏严格利用教师生成的黄金轨迹,由于训练分布与学生生成的推理上下文不匹配,导致长链CoT推理中出现错误级联。为了解决这一问题,在线蒸馏允许学生探索自己的轨迹,但我们证明这会引入相互的反向暴露偏差:当学生生成次优上下文时,教师模型也难以提供积极指导。为了解决这一双重暴露偏差问题,我们提出监控轨迹并在偏离时回溯(MOTAB)新的LLM推理蒸馏流程。具体而言,MOTAB动态监控学生在线生成过程,对照自适应的安全边界。当生成偏离并超过此阈值时,MOTAB回溯到上一个安全状态,并利用教师干预来纠正方向。这种方法本质上可以容忍少量学生错误以缓解暴露偏差,同时防止次优上下文以避免反向暴露偏差。在LIMO-v2和AceReason数据集上的广泛实验表明,MOTAB有效缓解了双重暴露偏差,使推理任务的平均性能提高了约3%。

英文摘要

Large language models (LLMs) have achieved remarkable success in complex reasoning tasks via long chain-of-thought (CoT), yet their immense computational overhead hinders real-world deployment. LLM reasoning distillation addresses this by transferring reasoning capabilities from formidable teacher models to compact student models. However, existing distillation paradigms face a fundamental dilemma. Typical off-policy distillation strictly utilizes teacher-generated golden trajectories, suffering from an exposure bias due to the mismatch between training distributions and student-generated inference contexts, which leads to error cascades in long CoT reasoning. To address this, on-policy distillation allows students to explore their own trajectories, but we demonstrate that it inherently introduces a reciprocal reversed exposure bias: the teacher model also struggles to provide positive guidance when conditioned on student-generated sub-optimal contexts. To resolve this dual exposure biases problem, we propose Monitoring Trajectories and Backtracking when it strays (MOTAB), a new LLM reasoning distillation pipeline. Specifically, MOTAB dynamically monitors the student's on-policy generation against an adaptive safety boundary. When the generation strays and exceeds this threshold, MOTAB backtracks to the last safe state and leverages teacher intervention to correct the course. This approach inherently tolerates minor student errors to mitigate exposure bias, while preventing sub-optimal contexts to circumvent reversed exposure bias. Extensive experiments on the LIMO-v2 and AceReason datasets demonstrate that MOTAB effectively alleviates the dual exposure biases, yielding a roughly 3% average performance improvement in reasoning tasks.

2605.19431 2026-05-20 cs.RO

Self-assembling Modular Aerial Robot for Versatile Aerial Tasks

自组装模块化空中机器人用于多功能空中任务

Junichiro Sugihara, Masaki Kitagawa, Jinjie Li, Yunong Li, Takuzumi Nishio, Kei Okada, Moju Zhao

AI总结 本文提出了一种自组装模块化空中机器人LEGION,通过飞行中自组装实现协同操作,结合灵活 maneuverability 和可重构性,实现了从被动观察者到主动参与者转变,拓展了空中物理交互的范围。

详情
AI中文摘要

多旋翼空中机器人在三维空间中具有出色的机动性,最近的进展使它们能够在复杂和狭窄的环境中进行灵活导航,尤其是对于小型机架。相比之下,用于高空工作的平台通常更大,以提供高推力以实现与环境的稳定物理交互。然而,这些矛盾的设计要求导致了灵活导航和稳健空中操作之间的长期权衡。本文提出了LEGION单元,这是一种可重新配置的模块化空中机器人,能够飞行中自组装以实现协同操作,灵感来自蚂蚁形成的自组织群体。每个单元保留了灵活的机动性,而两端的关节配备的对接接口使单元能够端到端自组装成飞行操作器。我们证明了多个单元可以自主飞行中对接;一旦锁定,它们通过控制接触力和扭矩保持零间隙锁定,即使在户外也能实现可靠的聚集和关节运动。我们进一步证明,自重构能力使单元能够在灵活的个体飞行和集体关节操作之间进行形态切换,同时实现核心飞行中操作原始操作,包括推、拉、旋转、抓取和携带。LEGION的自组织能力使空中机器人,特别是群组中的机器人,能够从被动观察者转变为环境中的主动参与者,拓展了空中物理交互的范围。

英文摘要

Multirotor aerial robots excel at maneuvering in three-dimensional space, and recent advances enable nimble navigation in cluttered and confined environments, especially for small airframes. By contrast, platforms built for high-altitude work tend to be larger to deliver high thrust for stable physical interaction with the environment. However, these conflicting design requirements create a long-standing trade-off between nimble navigation and robust aerial manipulation. Here, we present LEGION units, which are reconfigurable modular aerial robots capable of in-flight self-assembly for cooperative manipulation, drawing inspiration from the self-organized collectives formed by ants. Each unit retains nimble maneuverability while joint-equipped docking interfaces at both ends enable end-to-end self-assembly into a flying manipulator. We show that multiple units autonomously dock in flight; once latched, they maintain a zero-clearance interlock by controlling the contact force and torque, enabling reliable aggregation and articulated motion even outdoors. We further show that self-reconfigurability enables morphological switching between nimble individual flight and collective articulated manipulation, while realizing core in-flight manipulation primitives including pushing, pulling, rotating, grasping, and carrying. LEGION's self-organization enables aerial robots, especially in swarms, to shift from passive observers to active participants in their environment, broadening the scope of aerial physical interaction.

2605.19425 2026-05-20 cs.LG cs.AI

When to Stop Reusing: Dynamic Gradient Gating for Sample-Efficient RLVR

何时停止重用:动态梯度门控用于样本高效的RLVR

Yuchun Miao, Sen Zhang, Yuqi Zhang, Yaorui Shi, Qi Gu, Xunliang Cai, Lefei Zhang

AI总结 本文提出动态梯度门控(DGG)方法,通过实时监控lm_head梯度范数来检测并阻止有害的梯度传播,从而提高样本效率和训练速度。

Comments 23 pages, 10 figures

详情
AI中文摘要

可验证奖励的强化学习(RLVR)已成为大型语言模型(LLMs)高级推理的主要范式,但获取rollout样本成本高昂,使得样本效率成为关键瓶颈。一种自然的解决方法是将每个rollout批次用于多个梯度更新,这是经典强化学习中的标准做法。然而在RLVR中,这会放大策略偏移,导致严重性能下降。检测降级的早期迹象并停止重用仍是一个开放且具有挑战性的问题。我们通过识别不均衡权重分歧(DWD)现象来填补这一空白:性能下降与lm_head权重变化的急剧上升同步,而中间层保持稳定。经验上,我们验证DWD在各种LLM和任务中一致出现。理论上,我们证明(i)有害梯度集中在lm_head,而中间层在结构上被衰减,(ii)lm_head梯度范数下界了策略偏移。这些结果确立了lm_head梯度范数作为灾难性策略偏移的原理性、实时信号。基于这一见解,我们提出动态梯度门控(DGG),一种轻量级干预,通过实时监控lm_head梯度范数并在有害梯度污染优化器前拦截它们。DGG在数学、ALFWorld、WebShop和搜索增强型问答任务中一致匹配或超过标准单次使用基线,实现高达2.93倍的样本效率和2.14倍的墙钟加速。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has become the dominant paradigm for advanced reasoning in Large Language Models (LLMs), but rollout samples are expensive to obtain, making sample efficiency a critical bottleneck. A natural remedy is to reuse each rollout batch for multiple gradient updates, a standard practice in classical RL. Yet in RLVR, this amplifies policy shift, leading to severe performance degradation. Detecting the onset of degradation early enough to stop reuse remains an open and challenging problem. We close this gap by identifying the \textit{Disproportionate Weight Divergence (DWD)} phenomenon: performance degradation is synchronized with a sharp surge in the \texttt{lm\_head} weight change, while intermediate layers remain stable. Empirically, we verify that DWD emerges consistently across diverse LLMs and tasks. Theoretically, we prove that (i) harmful gradients concentrate at the \texttt{lm\_head} while intermediate layers are structurally attenuated, and (ii) the \texttt{lm\_head} gradient norm lower-bounds the policy divergence. These results establish the \texttt{lm\_head} gradient norm as a principled, real-time signal of catastrophic policy shift. Guided by this insight, we propose \textit{Dynamic Gradient Gating (DGG)}, a lightweight intervention that monitors the \texttt{lm\_head} gradient norm in real time and intercepts harmful gradients before they corrupt the optimizer. DGG consistently matches or exceeds the standard single-use baseline, achieving up to $2.93\times$ sample efficiency and $2.14\times$ wall-clock speedup across math, ALFWorld, WebShop, and search-augmented QA tasks.

2605.19420 2026-05-20 cs.RO

Beyond Waypoints: Dual-Heatmap Grounding for Cross-Embodiment Semantic Navigation

超越航点:双热图接地用于跨具身语义导航

Kaijie Yun, Yue Chen

AI总结 本文提出一种统一的视觉-语言框架,通过双热图表示替代单点回归,以解决语义指令与物理可达性之间的差距,从而提升跨具身语义导航的鲁棒性和性能。

详情
AI中文摘要

将开放式的语义指令接地为可执行的局部目标是人机交互中的基本挑战。尽管现有导航框架通常回归确定性的航点,但这种刚性方法会压缩空间不确定性,并且经常针对不可通行的物体中心,导致严重的执行失败。在本文中,我们专注于在视场内(in-FOV)的语义导航实际场景,其中机器人接收到简短的、交织的多模态(文本和图像)提示。为了弥合抽象语义意图与物理可达性之间的差距,我们提出了一种统一的视觉-语言框架,该框架放弃单点回归,转而采用双热图表示。我们的框架预测一个导航可及性热图,以捕捉连续的可到达区域,并结合一个面向热图用于方向约束。这些密集输出本质上充当可微的语义势场,能够无缝整合到下游的局部规划器中。为了支持这一范式,我们构建了一个完全自动化的、基于基础模型的合成数据管道,并建立了全面的模拟基准。广泛的实验表明,我们的框架在可比的8B基线中实现了最先进的性能。关键的是,通过特征融合研究和在不同机器人具身(Jetbot、H1、Aliengo)上的模拟研究,揭示出显式热图预测显著提高了可及率(AR)。通过将目标可靠地放置在可执行的自由空间中,我们的框架有效缓解了点回归的脆弱性,提供了一种可转移的路径,朝着安全的跨具身语义导航迈进。

英文摘要

Grounding open-ended semantic instructions into physically executable local goals is a fundamental challenge in human-robot interaction. While existing navigation frameworks often regress deterministic waypoints, this rigid formulation collapses spatial uncertainty and frequently targets non-traversable object centers, leading to severe execution failures. In this work, we focus on the practical setting of in-FOV semantic navigation, where a robot receives concise, interleaved multimodal (text and image) prompts. To bridge the gap between abstract semantic intent and physical reachability, we propose a unified Vision-Language framework that abandons single-point regression in favor of a Dual-Heatmap representation. Our framework predicts a navigation affordance heatmap that captures continuous reachable regions, coupled with a facing heatmap for orientation constraints. These dense outputs inherently function as a differentiable semantic potential field, integrating seamlessly with downstream local planners. To support this paradigm, we build a fully automated, foundation-model-assisted synthetic data pipeline and establish a comprehensive simulation benchmark. Extensive experiments demonstrate that our framework achieves state-of-the-art performance among comparable 8B baselines. Crucially, a feature-fusion study and simulation studies across diverse robot embodiments (Jetbot, H1, Aliengo) reveal that explicit heatmap prediction drastically improves the Affordance Rate (AR). By placing targets reliably in executable free space, our framework effectively mitigates the brittleness of point regression, offering a transferable path toward safe cross-embodiment semantic navigation.