arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2512.10659 2026-05-29 cs.LG

DCFO: Density-Based Counterfactuals for Outliers -- Additional Material

DCFO: 基于密度的离群点反事实解释——补充材料

Tommaso Amico, Pernille Matthews, Lena Krieger, Arthur Zimek, Ira Assent

AI总结针对局部离群因子（LOF）缺乏可解释性的问题，提出基于密度的离群点反事实解释方法（DCFO），通过将数据空间划分为LOF平滑区域实现高效梯度优化，在50个OpenML数据集上优于现有方法。

详情

AI中文摘要

离群点检测识别显著偏离大多数数据分布的数据点。解释离群点对于理解导致其检测的潜在因素、验证其重要性以及识别潜在偏差或错误至关重要。有效的解释提供可操作的见解，有助于采取预防措施以避免未来出现类似的离群点。反事实解释通过识别改变预测所需的最小变化，阐明特定数据点为何被分类为离群点。尽管有价值，但大多数现有的反事实解释方法忽略了离群点检测带来的独特挑战，并且未能针对经典、广泛采用的离群点检测算法。局部离群因子（LOF）是最流行的无监督离群点检测方法之一，通过相对局部密度量化离群程度。尽管LOF在多种应用中广泛使用，但它缺乏可解释性。为解决这一局限性，我们提出了基于密度的离群点反事实解释（DCFO），这是一种专门为LOF生成反事实解释的新方法。DCFO将数据空间划分为LOF行为平滑的区域，从而实现高效的基于梯度的优化。在50个OpenML数据集上的广泛实验验证表明，DCFO始终优于基准竞争对手，在生成的反事实的邻近性和有效性方面表现更优。

英文摘要

Outlier detection identifies data points that significantly deviate from the majority of the data distribution. Explaining outliers is crucial for understanding the underlying factors that contribute to their detection, validating their significance, and identifying potential biases or errors. Effective explanations provide actionable insights, facilitating preventive measures to avoid similar outliers in the future. Counterfactual explanations clarify why specific data points are classified as outliers by identifying minimal changes required to alter their prediction. Although valuable, most existing counterfactual explanation methods overlook the unique challenges posed by outlier detection, and fail to target classical, widely adopted outlier detection algorithms. Local Outlier Factor (LOF) is one the most popular unsupervised outlier detection methods, quantifying outlierness through relative local density. Despite LOF's widespread use across diverse applications, it lacks interpretability. To address this limitation, we introduce Density-based Counterfactuals for Outliers (DCFO), a novel method specifically designed to generate counterfactual explanations for LOF. DCFO partitions the data space into regions where LOF behaves smoothly, enabling efficient gradient-based optimisation. Extensive experimental validation on 50 OpenML datasets demonstrates that DCFO consistently outperforms benchmarked competitors, offering superior proximity and validity of generated counterfactuals.

URL PDF HTML ☆

赞 0 踩 0

2512.04733 2026-05-29 cs.CV cs.AI

E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving

E3AD：面向以人为中心的端到端自动驾驶的情感感知视觉-语言-动作模型

Yihong Tang, Haicheng Liao, Tong Nie, Junlin He, Ao Qu, Kehua Chen, Wei Ma, Zhenning Li, Lijun Sun, Chengzhong Xu

AI总结提出E3AD框架，通过连续VAD情感模型和双路径空间推理模块，将情感理解融入视觉-语言-动作模型，实现开放域端到端自动驾驶中的情感感知轨迹规划，在真实数据集上达到SOTA性能。

详情

AI中文摘要

端到端自动驾驶系统越来越多地采用视觉-语言-动作模型，但它们通常忽略乘客的情绪状态，而情绪状态对舒适度和自动驾驶接受度至关重要。我们引入了开放域端到端自动驾驶，其中自动驾驶车辆必须解释自由形式的自然语言命令，推断情绪，并规划物理上可行的轨迹。我们提出了E3AD，一个情感感知的VLA框架，通过两个认知启发的组件增强语义理解：一个连续的Valence-Arousal-Dominance情感模型，从语言中捕捉语调和紧迫性；以及一个双路径空间推理模块，融合自我中心和异中心视角以实现类人空间认知。结合模态预训练和基于偏好的对齐的一致性导向训练方案，进一步强化了情感意图与驾驶行为之间的一致性。在真实世界数据集上，E3AD改进了视觉定位和路径点规划，并在情感估计方面达到了最先进的VAD相关性。这些评估结果表明，将情感注入VLA风格的驾驶能够产生更符合人类行为的定位、规划和反馈。

英文摘要

End-to-end autonomous driving (AD) systems increasingly adopt vision-language-action (VLA) models, yet they typically ignore the passenger's emotional state, which is central to comfort and AD acceptance. We introduce Open-Domain End-to-End (OD-E2E) autonomous driving, where an autonomous vehicle (AV) must interpret free-form natural-language commands, infer the emotion, and plan a physically feasible trajectory. We propose E3AD, an emotion-aware VLA framework that augments semantic understanding with two cognitively inspired components: a continuous Valenc-Arousal-Dominance (VAD) emotion model that captures tone and urgency from language, and a dual-pathway spatial reasoning module that fuses egocentric and allocentric views for human-like spatial cognition. A consistency-oriented training scheme, combining modality pretraining with preference-based alignment, further enforces coherence between emotional intent and driving actions. Across real-world datasets, E3AD improves visual grounding and waypoint planning and achieves state-of-the-art (SOTA) VAD correlation for emotion estimation. These evaluation results show that injecting emotion into VLA-style driving yields more human-aligned grounding, planning, and feedback.

URL PDF HTML ☆

赞 0 踩 0

2512.03109 2026-05-29 cs.LG cs.AI stat.AP stat.ML

E-valuator: Reliable Agent Verifiers with Sequential Hypothesis Testing

E-valuator: 基于序贯假设检验的可靠智能体验证器

Shuvom Sadhuka, Drew Prinster, Clara Fannjiang, Gabriele Scalia, Bonnie Berger, Aviv Regev, Hanchen Wang

AI总结提出E-valuator方法，将任意黑盒验证器分数转化为具有可控虚警率的决策规则，通过序贯假设检验实现对智能体轨迹的在线监控，提升统计功效并节省令牌。

详情

AI中文摘要

智能体AI系统根据用户提示执行一系列动作，如推理步骤或工具调用。为了评估其轨迹的成功性，研究人员开发了验证器（如LLM评判器和过程奖励模型）来对智能体轨迹中每个动作的质量进行评分。尽管这些启发式评分可能提供信息，但在用于决定智能体是否会产生成功输出时，无法保证正确性。在此，我们引入e-valuator，一种将任意黑盒验证器分数转化为具有可证明虚警率控制的决策规则的方法。我们将区分成功轨迹（即会导致对用户提示正确响应的动作序列）与不成功轨迹的问题构建为序贯假设检验问题。E-valuator基于e-过程工具开发了一个序贯假设检验，该检验在智能体轨迹的每一步都保持统计有效性，从而能够对任意长动作序列的智能体进行在线监控。实验表明，在六个数据集和三个智能体上，e-valuator相比其他策略提供了更高的统计功效和更好的虚警率控制。我们还展示了e-valuator可用于快速终止有问题的轨迹并节省令牌。总之，e-valuator提供了一个轻量级、模型无关的框架，将验证器启发式转化为具有统计保证的决策规则，从而支持部署更可靠的智能体系统。

英文摘要

Agentic AI systems execute a sequence of actions, such as reasoning steps or tool calls, in response to a user prompt. To evaluate the success of their trajectories, researchers have developed verifiers, such as LLM judges and process-reward models, to score the quality of each action in an agent's trajectory. Although these heuristic scores can be informative, there are no guarantees of correctness when used to decide whether an agent will yield a successful output. Here, we introduce e-valuator, a method to convert any black-box verifier score into a decision rule with provable control of false alarm rates. We frame the problem of distinguishing successful trajectories (that is, a sequence of actions that will lead to a correct response to the user's prompt) and unsuccessful trajectories as a sequential hypothesis testing problem. E-valuator builds on tools from e-processes to develop a sequential hypothesis test that remains statistically valid at every step of an agent's trajectory, enabling online monitoring of agents over arbitrarily long sequences of actions. Empirically, we demonstrate that e-valuator provides greater statistical power and better false alarm rate control than other strategies across six datasets and three agents. We additionally show that e-valuator can be used for to quickly terminate problematic trajectories and save tokens. Together, e-valuator provides a lightweight, model-agnostic framework that converts verifier heuristics into decisions rules with statistical guarantees, enabling the deployment of more reliable agentic systems.

URL PDF HTML ☆

赞 0 踩 0

2511.19316 2026-05-29 cs.CV cs.AI

Evaluating Dataset Watermarking for Fine-tuning Traceability of Customized Diffusion Models: A Comprehensive Benchmark and Removal Approach

评估数据集水印用于定制扩散模型微调可追溯性：一个综合基准与移除方法

Xincheng Wang, Hanchi Sun, Wenjun Sun, Kejun Xue, Wangqiu Zhou, Jianbo Zhang, Wei Sun, Dandan Zhu, Xiongkuo Min, Jun Jia, Zhijun Fang

AI总结针对扩散模型微调中的版权与安全风险，本文建立统一威胁模型并提出包含普适性、可传递性和鲁棒性的评估框架，揭示现有数据集水印方法的脆弱性，并进一步提出一种实用的水印移除方法。

详情

AI中文摘要

最近扩散模型的微调技术使其能够再现特定图像集，例如特定人脸或艺术风格，但也引入了版权和安全风险。数据集水印已被提出，通过将不可察觉的水印嵌入训练图像来确保可追溯性，即使在微调后这些水印在输出中仍然可检测。然而，当前方法缺乏统一的评估框架。为解决这一问题，本文建立了一个通用威胁模型，并引入了一个包含普适性、可传递性和鲁棒性的综合评估框架。实验表明，现有方法在普适性和可传递性方面表现良好，并对常见图像处理操作具有一定的鲁棒性，但在真实威胁场景下仍然不足。为揭示这些脆弱性，本文进一步提出了一种实用的水印移除方法，该方法在不影响微调的情况下完全消除数据集水印，突出了未来研究的一个关键挑战。

英文摘要

Recent fine-tuning techniques for diffusion models enable them to reproduce specific image sets, such as particular faces or artistic styles, but also introduce copyright and security risks. Dataset watermarking has been proposed to ensure traceability by embedding imperceptible watermarks into training images, which remain detectable in outputs even after fine-tuning. However, current methods lack a unified evaluation framework. To address this, this paper establishes a general threat model and introduces a comprehensive evaluation framework encompassing Universality, Transmissibility, and Robustness. Experiments show that existing methods perform well in universality and transmissibility, and exhibit some robustness against common image processing operations, yet still fall short under real-world threat scenarios. To reveal these vulnerabilities, the paper further proposes a practical watermark removal method that fully eliminates dataset watermarks without affecting fine-tuning, highlighting a key challenge for future research.

URL PDF HTML ☆

赞 0 踩 0

2511.17798 2026-05-29 cs.RO

SM2ITH: Safe Mobile Manipulation with Interactive Human Prediction via Task-Hierarchical Bilevel Model Predictive Control

SM2ITH：通过任务分层双层模型预测控制实现安全移动操作与人机交互预测

Francesco D'Orazio, Sepehr Samavi, Xintong Du, Siqi Zhou, Giuseppe Oriolo, Angela P. Schoellig

AI总结提出SM$^2$ITH框架，结合分层任务模型预测控制与双层优化的人机交互预测，实现动态人机环境中的安全高效移动操作。

Comments Accepted to the IEEE International Conference on Robotics and Automation (ICRA) 2026

详情

AI中文摘要

移动操作机器人被设计用于在以人为中心的环境中执行复杂的导航和操作任务序列。尽管最近基于优化的方法（如分层任务模型预测控制，HTMPC）能够以严格的任务优先级实现高效的多任务执行，但它们目前主要应用于静态或结构化场景。将这些方法扩展到动态的人为中心环境需要预测模型来捕捉人类对机器人行为的反应。本文提出了SM$^2$ITH（通过任务分层双层模型预测控制实现安全移动操作与人机交互预测），这是一个统一框架，通过双层优化联合考虑机器人和人类动力学，将HTMPC与交互式人体运动预测相结合。该框架在两种不同的移动操作机器人（Stretch 3和Ridgeback-UR10）上进行了验证，涉及三种实验设置：（i）具有不同导航和操作优先级的递送任务，（ii）使用不同人体运动预测模型的顺序抓取-放置任务，以及（iii）涉及对抗性人类行为的交互。我们的结果突出了交互式预测如何实现安全高效的协调，优于依赖加权目标或开环人体模型的基线方法。

英文摘要

Mobile manipulators are designed to perform complex sequences of navigation and manipulation tasks in human-centered environments. While recent optimization-based methods such as Hierarchical Task Model Predictive Control (HTMPC) enable efficient multitask execution with strict task priorities, they have so far been applied mainly to static or structured scenarios. Extending these approaches to dynamic human-centered environments requires predictive models that capture how humans react to the actions of the robot. This work introduces Safe Mobile Manipulation with Interactive Human Prediction via Task-Hierarchical Bilevel Model Predictive Control (SM$^2$ITH), a unified framework that combines HTMPC with interactive human motion prediction through bilevel optimization that jointly accounts for robot and human dynamics. The framework is validated on two different mobile manipulators, the Stretch 3 and the Ridgeback-UR10, across three experimental settings: (i) delivery tasks with different navigation and manipulation priorities, (ii) sequential pick-and-place tasks with different human motion prediction models, and (iii) interactions involving adversarial human behavior. Our results highlight how interactive prediction enables safe and efficient coordination, outperforming baselines that rely on weighted objectives or open-loop human models.

URL PDF HTML ☆

赞 0 踩 0

2511.11118 2026-05-29 cs.LG

Improving Continual Learning of Knowledge Graph Embeddings via Informed Initialization

通过信息初始化改进知识图谱嵌入的持续学习

Gerard Pons, Besim Bilalli, Anna Queralt

AI总结提出一种基于知识图谱模式与已有嵌入的信息初始化策略，提升持续学习中新知识的获取并减少灾难性遗忘，同时加速训练。

详情

DOI: 10.1016/j.neucom.2026.134045

AI中文摘要

许多知识图谱（KG）会频繁更新，迫使知识图谱嵌入（KGE）适应这些变化。为了解决这个问题，KGE的持续学习技术在更新旧嵌入的同时纳入新实体的嵌入。这些方法中的一个必要步骤是嵌入的初始化，作为KGE学习过程的输入，它对最终嵌入的准确性以及训练所需的时间有重要影响。这对于相对较小且频繁的更新尤其重要。我们提出了一种新颖的信息嵌入初始化策略，可以无缝集成到现有的KGE持续学习方法中，该策略在减少灾难性遗忘的同时增强新知识的获取。具体地，利用KG模式以及先前学习的嵌入，基于新实体所属的类别来获得其初始表示。我们广泛的实验分析表明，所提出的初始化策略提高了所得KGE的预测性能，同时增强了知识保留。此外，我们的方法加速了知识获取，减少了增量学习新嵌入所需的周期数，从而减少了时间。最后，其在不同类型的KGE学习模型中的优势也得到了证明。

英文摘要

Many Knowledege Graphs (KGs) are frequently updated, forcing their Knowledge Graph Embeddings (KGEs) to adapt to these changes. To address this problem, continual learning techniques for KGEs incorporate embeddings for new entities while updating the old ones. One necessary step in these methods is the initialization of the embeddings, as an input to the KGE learning process, which can have an important impact in the accuracy of the final embeddings, as well as in the time required to train them. This is especially relevant for relatively small and frequent updates. We propose a novel informed embedding initialization strategy, which can be seamlessly integrated into existing continual learning methods for KGE, that enhances the acquisition of new knowledge while reducing catastrophic forgetting. Specifically, the KG schema and the previously learned embeddings are utilized to obtain initial representations for the new entities, based on the classes the entities belong to. Our extensive experimental analysis shows that the proposed initialization strategy improves the predictive performance of the resulting KGEs, while also enhancing knowledge retention. Furthermore, our approach accelerates knowledge acquisition, reducing the number of epochs, and therefore time, required to incrementally learn new embeddings. Finally, its benefits across various types of KGE learning models are demonstrated.

URL PDF HTML ☆

赞 0 踩 0

2511.08949 2026-05-29 cs.CL

EVADE: LLM-Based Explanation Generation and Validation for Error Detection in NLI

EVADE：基于LLM的解释生成与验证用于NLI错误检测

Longfei Zuo, Barbara Plank, Siyao Peng

AI总结提出EVADE框架，利用大语言模型生成和验证解释以检测NLI数据集中的标注错误，实验表明LLM验证能减少人力并提升微调性能。

详情

AI中文摘要

高质量数据集对于训练和评估可靠的NLP模型至关重要。在自然语言推理（NLI）等任务中，当同一实例有多个有效标签时，会出现人类标签变异（HLV），这使得难以区分标注错误和合理的变异。先前的框架VARIERR（Weber-Genzel等人，2024）在第一轮要求多位标注者解释其标签决策，并在第二轮通过有效性判断标记错误。然而，进行两轮人工标注成本高昂，且可能限制合理标签或解释的覆盖范围。我们的研究提出了一个新框架EVADE，用于使用大语言模型（LLM）生成和验证解释以检测错误。我们进行了全面分析，比较了人类和LLM检测的NLI错误，涉及分布比较、验证重叠以及对模型微调的影响。实验表明，LLM验证能优化生成的解释分布，使其更接近人类标注，并且从训练数据中移除LLM检测的错误比移除人类标注者识别的错误更能提升微调性能。这凸显了在标签变异下扩展错误检测、减少人工努力同时提高数据集质量的潜力。

英文摘要

High-quality datasets are critical for training and evaluating reliable NLP models. In tasks like natural language inference (NLI), human label variation (HLV) arises when multiple labels are valid for the same instance, making it difficult to separate annotation errors from plausible variation. An earlier framework, VARIERR (Weber-Genzel et al., 2024), asks multiple annotators to explain their label decisions in the first round and flags errors through validity judgments in the second round. However, conducting two rounds of manual annotation is costly and may limit the coverage of plausible labels or explanations. Our study proposes a new framework, EVADE, for generating and validating explanations to detect errors using large language models (LLMs). We perform a comprehensive analysis comparing human- and LLM-detected errors for NLI across distribution comparison, validation overlap, and impact on model fine-tuning. Our experiments demonstrate that LLM validation refines generated explanation distributions to more closely align with human annotations, and that removing LLM-detected errors from training data yields improvements in fine-tuning performance than removing errors identified by human annotators. This highlights the potential to scale error detection, reducing human effort while improving dataset quality under label variation.

URL PDF HTML ☆

赞 0 踩 0

2511.08548 2026-05-29 cs.AI

A Matter of Interest: Understanding Interestingness of Math Problems in Humans and Language Models

兴趣问题：理解人类与语言模型对数学问题的兴趣度

Shubhra Mishra, Yuka Machino, Gabriel Poesia, Albert Jiang, Joy Hsu, Adrian Weller, Challenger Mishra, David Broman, Joshua B. Tenenbaum, Mateja Jamnik, Cedegao E. Zhang, Katherine M. Collins

AI总结通过比较大型语言模型与不同数学背景人群对数学问题的兴趣度评分，研究LLM在兴趣判断上与人类的一致性，并评估其生成有趣问题的能力。

Comments Published at the Math-AI Workshop, NeurIPS 2025

详情

AI中文摘要

数学的演变受到兴趣度的重要影响：研究人员选择要解决的问题，学生选择要参与的问题，都是基于对兴趣和挑战的期望。随着AI系统，特别是那些在自然语言和形式数学上灵活操作的大型语言模型（LLMs）越来越多地用于数学研究和教育，描述它们的判断与来自不同数学背景的人们的判断有多接近变得至关重要。我们通过将LLM的评分与两个人群（具有大学数学经验的众包参与者和国际数学奥林匹克竞赛选手）的评分进行比较，研究LLM是否与人类的兴趣度判断一致。尽管许多LLM在广泛层面上与人类对兴趣度的看法一致，但它们在很大程度上未能匹配人类判断的分布。它们与人类认为问题有趣的原因也弱对齐，与人类选择的理由相关性低。最后，我们评估了LLM生成有趣问题的能力，发现经过有效性过滤后，LLM能够生成引人入胜的问题。我们得出结论，包括需要多LLM人机协作系统，这突显了LLM作为数学推理伙伴的前景和当前局限。

英文摘要

The evolution of mathematics is shaped importantly by interestingness: researchers choose which problems to pursue, and students choose which problems to engage with, based on expectations of interest and challenge. As AI systems, particularly large language models (LLMs) that operate flexibly over natural language and formal mathematics, are increasingly used in mathematics research and education, it becomes crucial to characterize how closely their judgments align with people from different mathematical backgrounds. We study whether LLMs align with human interestingness judgments by comparing LLM ratings with those of two populations, crowdsourced participants with college math experience and International Math Olympiad competitors. Although many LLMs broadly agree with human notions of interestingness, they largely fail to match the distribution of human judgments. They also weakly align with why humans find problems interesting, with low correlation to human-selected rationales. Finally, we evaluate LLMs' ability to generate interesting problems and find that, after filtering for validity, LLMs are able to generate engaging problems. We conclude with takeaways, including the need for multi-LLM human-AI collaborative systems, that highlight both the promise and current limits of LLMs as partners in mathematical reasoning.

URL PDF HTML ☆

赞 0 踩 0

2511.08423 2026-05-29 cs.CV

OmniAID: Decoupling Semantic and Artifacts for Universal AI-Generated Image Detection in the Wild

OmniAID: 解耦语义与伪影以实现通用AI生成图像野外检测

Yuncheng Guo, Junyan Ye, Chenjue Zhang, Hengrui Kang, Haohuan Fu, Conghui He, Weijia Li

AI总结提出OmniAID框架，通过解耦混合专家架构分离语义缺陷和通用伪影，结合两阶段训练策略和Mirage数据集，实现跨生成模型和语义内容的鲁棒AI生成图像检测。

Comments Accepted by ICML 2026

详情

AI中文摘要

一个真正通用的AI生成图像（AIGI）检测器必须同时泛化到多种生成模型和不同的语义内容。当前方法学习单一的、纠缠的伪造表示，混淆了内容相关的缺陷与内容无关的伪影，并进一步受到过时基准的限制。我们提出OmniAID，一种以解耦混合专家（MoE）架构为核心的新框架，该架构分离了：（1）通过可路由的专门语义专家在不同内容领域中的语义缺陷，以及（2）通过固定的通用伪影专家从内容相关缺陷中分离出内容无关的通用伪影。两阶段训练策略首先通过领域特定的困难采样独立专门化专家，然后训练一个轻量级门控网络以实现有效的输入路由。通过明确解耦“生成了什么”（内容特定缺陷）与“如何生成”（通用伪影），OmniAID实现了鲁棒的泛化。我们还引入了Mirage，一个大规模、当代的数据集，包含现代训练集和具有挑战性的测试集。大量实验表明，OmniAID超越了现有检测器，为针对现代野外威胁的AIGI检测建立了新标准。代码可在https://github.com/yunncheng/OmniAID获取。

英文摘要

A truly universal AI-Generated Image (AIGI) detector must simultaneously generalize across diverse generative models and varied semantic content. Current methods learn a single, entangled forgery representation, conflating content-dependent flaws with content-agnostic artifacts, and are further constrained by outdated benchmarks. We propose OmniAID, a novel framework centered on a decoupled Mixture-of-Experts (MoE) architecture that separates: (1) semantic flaws across distinct content domains via Routable Specialized Semantic Experts, and (2) content-agnostic universal artifacts from content-dependent flaws via a Fixed Universal Artifact Expert. A two-stage training strategy first specializes experts independently with domain-specific hard-sampling, then trains a lightweight gating network for effective input routing. By explicitly decoupling "what is generated" (content-specific flaws) from "how it is generated" (universal artifacts), OmniAID achieves robust generalization. We also introduce Mirage, a large-scale, contemporary dataset comprising a modern training set and a challenging test set. Extensive experiments demonstrate that OmniAID surpasses existing detectors, establishing a new standard for AIGI detection against modern, in-the-wild threats. Code is available at https://github.com/yunncheng/OmniAID.

URL PDF HTML ☆

赞 0 踩 0

2511.04758 2026-05-29 cs.RO cs.AI cs.MA

ScheduleStream: Temporal Planning with Samplers for GPU-Accelerated Multi-Arm Task and Motion Planning & Scheduling

ScheduleStream: 基于采样器的时序规划用于GPU加速的多臂任务与运动规划及调度

Caelan Garrett, Fabio Ramos

AI总结提出ScheduleStream，首个通用框架，通过混合持续动作和领域无关算法，结合GPU加速采样器，实现多臂并行任务与运动规划及调度。

Comments Project website: https://schedulestream.github.io

详情

Journal ref: 2026 IEEE International Conference on Robotics and Automation (ICRA)

AI中文摘要

双臂和类人机器人因其类似人类利用多臂高效完成任务的能力而具有吸引力。然而，由于混合离散-连续动作空间的增长，同时控制多个臂在计算上具有挑战性。任务与运动规划（TAMP）算法可以在混合空间中高效规划，但通常生成一次只移动一个臂的计划，而不是允许并行臂运动的调度。为了将TAMP扩展到生成调度，我们提出了ScheduleStream，这是第一个用于带采样操作的规划与调度的通用框架。ScheduleStream使用混合持续动作对时间动态进行建模，这些动作可以异步启动，并持续一个由其参数决定的时长。我们提出了领域无关的算法，无需任何特定于应用的机制即可解决ScheduleStream问题。我们将ScheduleStream应用于任务与运动规划及调度（TAMPAS），其中我们利用采样器内的GPU加速来加快规划。我们将ScheduleStream算法与模拟中的几种消融方法进行比较，发现它们能产生更高效的解决方案。我们在https://schedulestream.github.io上展示了ScheduleStream在几个真实世界双臂机器人任务上的应用。

英文摘要

Bimanual and humanoid robots are appealing because of their human-like ability to leverage multiple arms to efficiently complete tasks. However, controlling multiple arms at once is computationally challenging due to the growth in the hybrid discrete-continuous action space. Task and Motion Planning (TAMP) algorithms can efficiently plan in hybrid spaces but generally produce plans, where only one arm is moving at a time, rather than schedules that allow for parallel arm motion. In order to extend TAMP to produce schedules, we present ScheduleStream, the first general-purpose framework for planning & scheduling with sampling operations. ScheduleStream models temporal dynamics using hybrid durative actions, which can be started asynchronously and persist for a duration that's a function of their parameters. We propose domain-independent algorithms that solve ScheduleStream problems without any application-specific mechanisms. We apply ScheduleStream to Task and Motion Planning & Scheduling (TAMPAS), where we use GPU acceleration within samplers to expedite planning. We compare ScheduleStream algorithms to several ablations in simulation and find that they produce more efficient solutions. We demonstrate ScheduleStream on several real-world bimanual robot tasks at https://schedulestream.github.io.

URL PDF HTML ☆

赞 0 踩 0

2510.27391 2026-05-29 cs.CV cs.LG

Modality Alignment across Trees on Heterogeneous Hyperbolic Manifolds

异质双曲流形上的树间模态对齐

Wei Wu, Xiaomeng Fan, Yuwei Wu, Zhi Gao, Pengxiang Li, Yunde Jia, Mehrtash Harandi

AI总结提出一种在异质双曲流形上对齐图像和文本树状层次特征的方法，通过交叉注意力提取视觉层次特征、异质流形嵌入及KL距离度量学习中间流形，在开放集分类任务中优于基线。

Comments Published as a conference paper at ICLR 2026

详情

Journal ref: The Fourteenth International Conference on Learning Representations (ICLR 2026), Rio de Janeiro, Brazil, 2026

AI中文摘要

模态对齐对于视觉-语言模型（VLM）有效整合跨模态信息至关重要。然而，现有方法在提取文本层次特征的同时，对每个图像仅用单一特征表示，导致不对称和次优的对齐。为解决此问题，我们提出树间对齐（Alignment across Trees）方法，该方法为图像和文本模态构建并对齐树状层次特征。具体而言，我们引入一个语义感知的视觉特征提取框架，该框架对来自中间Transformer层的视觉类别标记应用交叉注意力机制，由文本线索引导以提取具有从粗到细语义的视觉特征。然后，我们将两种模态的特征树嵌入到具有不同曲率的双曲流形中，以有效建模其层次结构。为了在不同曲率的异质双曲流形之间进行对齐，我们推导了异质流形上分布之间的KL距离度量，并通过最小化该距离学习一个用于流形对齐的中间流形。我们证明了最优中间流形的存在性和唯一性。在多个图像数据集上的分类学开放集分类任务实验表明，我们的方法在少样本和跨域设置下持续优于强基线。

英文摘要

Modality alignment is critical for vision-language models (VLMs) to effectively integrate information across modalities. However, existing methods extract hierarchical features from text while representing each image with a single feature, leading to asymmetric and suboptimal alignment. To address this, we propose Alignment across Trees, a method that constructs and aligns tree-like hierarchical features for both image and text modalities. Specifically, we introduce a semantic-aware visual feature extraction framework that applies a cross-attention mechanism to visual class tokens from intermediate Transformer layers, guided by textual cues to extract visual features with coarse-to-fine semantics. We then embed the feature trees of the two modalities into hyperbolic manifolds with distinct curvatures to effectively model their hierarchical structures. To align across the heterogeneous hyperbolic manifolds with different curvatures, we formulate a KL distance measure between distributions on heterogeneous manifolds, and learn an intermediate manifold for manifold alignment by minimizing the distance. We prove the existence and uniqueness of the optimal intermediate manifold. Experiments on taxonomic open-set classification tasks across multiple image datasets demonstrate that our method consistently outperforms strong baselines under few-shot and cross-domain settings.

URL PDF HTML ☆

赞 0 踩 0

2510.26270 2026-05-29 cs.AI

Graph-Enhanced Policy Optimization in LLM Agent Training

图增强策略优化在LLM智能体训练中的应用

Jiazhen Yuan, Zhike Gong, Jinquan Hang, Zhengbiao Bai, Wei Zhao

AI总结提出图增强策略优化（GEPO）框架，通过状态转移图的双层结构信用分配（任务条件关键性评分），在LLM多步智能体训练中提升成功率。

详情

AI中文摘要

交互环境中的多步LLM智能体代表了向长时决策迈出的关键一步。为了训练此类智能体，广泛采用基于组的强化学习，该方法在组内对具有较高相对性能的轨迹进行强化。然而，在大多数现有方法中，轨迹内的每一步和具有相同终端奖励的每条轨迹都获得相同的信用，无论其实际贡献如何。由于不同状态在从采样轨迹构建的在线状态转移图中扮演不同的结构角色，其影响应被区分并转化为任务感知的信用，在步骤和轨迹两个层面上。因此，我们提出了图增强策略优化（GEPO），一种用于多步LLM智能体训练的双层结构信用分配框架。具体来说，GEPO推导出一个状态级的任务条件关键性评分，该评分结合了状态转移图上的拓扑中介中心性和与任务提示的语义相似性。基于该评分，轨迹级信用通过状态自适应折扣进行重塑，而步骤级信用则根据其后继状态的关键性进行缩放。实验结果表明，在7B规模下，GEPO在ALFWorld上的成功率比最强基线高出1.1%，在WebShop上高出3.2%，在搜索增强的QA任务上平均高出3.8%。与平坦的基于组的方法相比，GEPO降低了跨种子的方差，并将梯度信号集中在最关键步骤上。

英文摘要

Multi-step LLM agents in interactive environments represent a crucial step toward long-horizon decision-making. To train such agents, group-based reinforcement learning is widely adopted, which reinforces trajectories with higher relative performance within the group. However, in most existing methods, every step within a trajectory and every trajectory with the same terminal reward receive identical credit, regardless of their actual contributions. Since different states play different structural roles in an online state-transition graph built from sampled trajectories, their impacts should be differentiated and converted into task-aware credit at both the step and trajectory levels. We therefore present Graph-Enhanced Policy Optimization (GEPO), a framework for dual-level structural credit assignment in multi-step LLM agent training. Specifically, GEPO derives a state-level Task-Conditioned Criticality score that combines topological betweenness on the state-transition graph with semantic similarity to the task prompt. Based on this score, trajectory-level credit is reshaped through a state-adaptive discount, while step-level credit is scaled by the criticality of its successor state. Experimental results show that GEPO outperforms the strongest baselines by 1.1\% in success rate on ALFWorld, 3.2\% on WebShop, and 3.8\% on average across search-augmented QA tasks at the 7B scale. Compared with flat group-based methods, GEPO reduces across-seed variance and concentrates gradient signals on the most critical steps.

URL PDF HTML ☆

赞 0 踩 0

2510.22437 2026-05-29 cs.AI cs.CL

Modeling Hierarchical Thinking in Large Reasoning Models

大型推理模型中的层次化思维建模

G M Shahariar, Erfan Shayegani, Ali Nazari, Nael Abu-Ghazaleh

AI总结本文提出将大型推理模型（LRM）的层次化推理动态近似为有限状态机（FSM）中的轨迹，并通过Q值引导的推理时控制方法实现高效推理优化。

Comments Accepted in ICML 2026 as Oral

详情

AI中文摘要

大型推理模型（LRM）通过生成长链思维（CoT）序列来解决复杂任务；然而，控制推理轨迹的涌现动态尚未被充分理解，可能导致不一致性和推理病态。在这项工作中，我们提出将LRM的涌现层次化推理动态近似为有限状态机（FSM）中的轨迹，该状态机在六个抽象认知状态之间转换。我们证明这些状态和转换可以在模型的潜在状态中捕获。我们相信这种表示在LRM模型的可解释性和优化中具有不同的应用。例如，通过分析这些转换的拓扑结构，我们识别出推理策略中的统计变化，有助于从失败的推理链中识别出有效的推理链。为了说明这些潜在优势，我们提出了Q值引导转向，一种无需训练的推理时控制方法，将推理视为规划问题。我们估计状态转换的长期效用，并在句子边界处应用稀疏、正交的激活转向，使CoT生成与最优推理策略对齐。使用三个最先进的开源推理模型在四个基准测试（AIME25、MATH-500、GSM8k和GPQA Diamond）上的实验表明，Q值转向策略以“外科手术式”的效率实现了显著的性能提升，通常需要的干预次数比贪婪和加权基线少25倍，这表明通过引导高层认知动态而非微观管理令牌生成，可以有效地控制推理。代码可在 https://github.com/shahariar-shibli/CoT-FSM 获取。

英文摘要

Large Reasoning Models (LRMs) solve complex tasks by generating long Chain-of-Thought (CoT) sequences; however, the emergent dynamics governing reasoning trajectories are not well understood and can lead to inconsistencies and reasoning pathologies. In this work, we propose to approximate LRM's emerging hierarchical reasoning dynamics as a trajectory within a Finite State Machine (FSM) transitioning among six abstract cognitive states. We demonstrate that these states and transitions can be captured in the latent state of the model. We believe that this representation can have different applications in the interpretability and optimization of LRM models. For example, by analyzing the topology of these transitions, we identify statistical shifts in reasoning strategies that help identify effective reasoning chains from those that fail. To illustrate these potential advantages, we propose Q-Value guided steering, a training-free inference-time control method that treats reasoning as a planning problem. We estimate the long-horizon utility of state transitions and apply sparse, orthogonal activation steering at sentence boundaries to align the CoT generation with optimal reasoning policies. Experiments across four benchmarks (AIME25, MATH-500, GSM8k, and GPQA Diamond) using three state-of-the-art open reasoning models demonstrate that Q-Value steering policy achieves significant performance gains with "surgical" efficiency, often requiring 25 times fewer interventions than greedy and weighted baselines, which suggests that reasoning can be effectively controlled by guiding high-level cognitive dynamics rather than micro-managing token generation. Code is available at: https://github.com/shahariar-shibli/CoT-FSM.

URL PDF HTML ☆

赞 0 踩 0

2510.18416 2026-05-29 cs.SD

SegTune: Structured and Fine-Grained Control for Song Generation

SegTune：歌曲生成的结构化与细粒度控制

Pengfei Cai, Joanna Wang, Haorui Zheng, Xu Li, Zihao Ji, Teng Ma, Zhongliang Liu, Chen Zhang, Pengfei Wan

AI总结提出非自回归框架SegTune，通过段级局部描述和全局提示实现歌曲的结构化可控生成，并引入基于LLM的时长预测器实现精确的歌词-音乐对齐。

Comments This technical report was later revised and published at ACL 2026 (oral). ACL paper link: https://openreview.net/forum?id=FKf2S4u8at , code: https://github.com/KlingAIResearch/SegTune

详情

AI中文摘要

近期歌曲生成领域的进展在根据歌词和/或全局文本提示生成歌曲方面展现了有希望的结果。然而，大多数现有系统缺乏对歌曲随时间变化属性的建模能力，限制了对音乐结构和动态的细粒度控制。在本文中，我们提出SegTune，一个用于结构化和可控歌曲生成的非自回归框架。SegTune通过允许用户或大语言模型指定与歌曲段落对齐的局部音乐描述来实现段级控制。段级提示通过时间广播注入到对应时间窗口的模型中，而全局提示则影响整首歌曲以确保风格一致性。为了获得准确的段落时长并实现精确的歌词-音乐对齐，我们引入了一个基于LLM的时长预测器，该预测器以自回归方式生成LRC格式的句子级带时间戳歌词。我们进一步构建了一个大规模数据管道，用于收集带有对齐歌词和提示的高质量歌曲，并提出了新的评估指标来评估段级对齐和声乐属性一致性。实验结果表明，与现有基线相比，SegTune实现了优越的可控性和音乐连贯性。参见https://cai525.github.io/SegTune_demo获取我们工作的演示。

英文摘要

Recent advancements in song generation have shown promising results in generating songs from lyrics and/or global text prompts. However, most existing systems lack the ability to model the temporally varying attributes of songs, limiting fine-grained control over musical structure and dynamics. In this paper, we propose SegTune, a non-autoregressive framework for structured and controllable song generation. SegTune enables segment-level control by allowing users or large language models to specify local musical descriptions aligned to song sections.The segmental prompts are injected into the model by temporally broadcasting them to corresponding time windows, while global prompts influence the whole song to ensure stylistic coherence. To obtain accurate segment durations and enable precise lyric-to-music alignment, we introduce an LLM-based duration predictor that autoregressively generates sentence-level timestamped lyrics in LRC format. We further construct a large-scale data pipeline for collecting high-quality songs with aligned lyrics and prompts, and propose new evaluation metrics to assess segment-level alignment and vocal attribute consistency. Experimental results show that SegTune achieves superior controllability and musical coherence compared to existing baselines. See https://cai525.github.io/SegTune_demo for demos of our work.

URL PDF HTML ☆

赞 0 踩 0

2510.14150 2026-05-29 cs.AI cs.LG cs.NE

CodeEvolve: an open source evolutionary coding agent for algorithmic discovery and optimization

CodeEvolve：用于算法发现和优化的开源进化编码智能体

Henrique Assumpção, Diego Ferreira, Leandro Campos, Fabricio Murai

AI总结提出CodeEvolve开源框架，结合大语言模型与岛屿进化搜索，通过灵感交叉、元提示和深度细化，在AlphaEvolve基准上匹配或超越5/9问题，并在匹配条件下优于OpenEvolve和ShinkaEvolve，以更低成本超越前沿闭源集成。

Comments 21 pages, 16 figures, 8 tables

详情

AI中文摘要

我们介绍了CodeEvolve，一个开源框架，它将大语言模型与基于岛屿的进化搜索相结合，用于端到端的算法发现。CodeEvolve在CVT-MAP-Elites存档和加权LLM集成之上集成了基于灵感的交叉、元提示和深度细化，为复杂问题生成优化解决方案。在AlphaEvolve基准套件上，CodeEvolve在9个问题中的5个上匹配或超过了报告的AlphaEvolve结果，并且在匹配条件下，在9个问题中的6个上优于开源框架OpenEvolve和ShinkaEvolve。使用开放权重的Qwen3-Coder-30B骨干网络，它在CirclePackingSquare的两个实例上均超过了报告的AlphaEvolve分数，成本大约比前沿闭源集成低一个数量级，并且在无需重新调整的情况下，在启发式设计任务上与EoH保持竞争力。消融实验表明，CodeEvolve组件之间的相互作用（而非任何单一算子）驱动了这些结果。我们在https://github.com/inter-co/science-codeevolve 发布了该框架、实验数据和实用的超参数指南。

英文摘要

We introduce CodeEvolve, an open-source framework that couples large language models with island-based evolutionary search for end-to-end algorithmic discovery. CodeEvolve integrates inspiration-based crossover, meta-prompting, and depth-based refinement on top of a CVT-MAP-Elites archive and a weighted LLM ensemble to generate optimized solutions for complex problems. On the AlphaEvolve benchmark suite, CodeEvolve matches or surpasses the reported AlphaEvolve results on 5 of 9 problems and, under matched conditions, outperforms the open-source frameworks OpenEvolve and ShinkaEvolve on 6 of 9. With the open-weight Qwen3-Coder-30B backbone, it surpasses the reported AlphaEvolve score on both CirclePackingSquare instances at roughly an order of magnitude lower cost than a frontier closed-source ensemble, and remains competitive with EoH on heuristic-design tasks without retuning. Ablations show that the interaction between CodeEvolve's components, rather than any single operator, drives these results. We release the framework, experimental data, and practical hyperparameter guidelines at https://github.com/inter-co/science-codeevolve.

URL PDF HTML ☆

赞 0 踩 0

2510.11499 2026-05-29 cs.LG cs.AI

Offline Reinforcement Learning with Generative Trajectory Policies

基于生成轨迹策略的离线强化学习

Xinsong Feng, Leshu Tang, Chenan Wang, Haipeng Chen

AI总结本文提出生成轨迹策略（GTP），通过统一扩散、流匹配和一致性模型为常微分方程驱动的连续时间生成轨迹，并引入两种理论自适应方法，在D4RL基准上达到最先进性能。

Comments ICML 2026

详情

AI中文摘要

生成模型因其捕获复杂多模态行为的能力，已成为离线强化学习中一类强大的策略。然而，现有方法面临明显的权衡：扩散策略等慢速迭代模型计算成本高，而一致性策略等快速单步模型性能往往下降。在本文中，我们证明弥合这一差距是可能的。我们认为，超越个体方法局限的关键在于一个统一视角，该视角将现代生成模型（包括扩散、流匹配和一致性模型）视为学习由常微分方程驱动的连续时间生成轨迹的具体实例。这一原则性基础为强化学习中的生成策略提供了更清晰的设计空间，并使我们能够提出生成轨迹策略（GTP），一种新的、更通用的策略范式，学习底层ODE的完整解映射。为使该范式适用于离线强化学习，我们进一步引入了两种理论上原则性的自适应方法。实验结果表明，GTP在D4RL基准上达到了最先进的性能——它显著优于先前的生成策略，在多个以困难著称的AntMaze任务上取得了完美分数。

英文摘要

Generative models have emerged as a powerful class of policies for offline reinforcement learning (RL) due to their ability to capture complex, multi-modal behaviors. However, existing methods face a stark trade-off: slow, iterative models like diffusion policies are computationally expensive, while fast, single-step models like consistency policies often suffer from degraded performance. In this paper, we demonstrate that it is possible to bridge this gap. The key to moving beyond the limitations of individual methods, we argue, lies in a unifying perspective that views modern generative models, including diffusion, flow matching, and consistency models, as specific instances of learning a continuous-time generative trajectory governed by an Ordinary Differential Equation (ODE). This principled foundation provides a clearer design space for generative policies in RL and allows us to propose Generative Trajectory Policies (GTPs), a new and more general policy paradigm that learns the entire solution map of the underlying ODE. To make this paradigm practical for offline RL, we further introduce two key theoretically principled adaptations. Empirical results demonstrate that GTP achieves state-of-the-art performance on D4RL benchmarks - it significantly outperforms prior generative policies, achieving perfect scores on several notoriously hard AntMaze tasks.

URL PDF HTML ☆

赞 0 踩 0

2510.08722 2026-05-29 cs.LG cs.AI

The Impact of Semantic Pairs on Self-Supervised Representation Learning

语义对自监督表示学习的影响

Mohammad Alkhalefi, Georgios Leontidis, Mingjun Zhong

AI总结通过控制实验研究语义正对（不同同类实例）相比增强正对在自监督学习中的效果，发现语义对能提升泛化性能，尤其对比学习受益最大。

Comments 19 pages, 7 figures, 5 tables

详情

AI中文摘要

实例判别通过将同一图像的不同增强视图视为正对来学习视觉表示。虽然这鼓励对手工变换的不变性，但同图像正对可能保留背景、纹理、光照和对象特定细节等干扰相关性。语义正对，即不同的同类实例，通过在不同上下文中呈现对象可能减少这些相关性。然而，先前的研究通常将语义对与增强正对或错误邻居（即错误映射的语义对）结合，使得难以隔离语义配对的效果。我们提出了一个关于语义正对用于自监督表示学习的受控实证研究。从ImageNet-1K中，我们构建了两个匹配的子集：一个增强对基线和一个手动策划的语义对数据集，具有相同的类别组成和训练对数量。我们使用这些数据集在匹配的训练条件下比较代表性的对比和非对比SSL方法。在迁移学习和目标检测评估中，语义对预训练始终优于增强对预训练。额外的消融实验表明，语义对诱导了超出标准变换管道的不变性。在评估的方法中，对比学习从语义对中受益最大，其中SimCLR显示出最大的相对改进。这些结果阐明了语义正对在SSL中的作用，并为选择和设计能够有效利用语义对信息的框架提供了指导。

英文摘要

Instance discrimination learns visual representations by treating different augmented views of the same image as positive pairs. While this encourages invariance to handcrafted transformations, same-image positives can preserve nuisance correlations such as background, texture, illumination, and object-specific details. Semantic positive pairs, i.e., different same-class instances, may reduce these correlations by presenting objects across diverse contexts. However, previous studies often combine semantic pairs with augmented positives or false neighbors (i.e., incorrectly mapped semantic pairs), making it difficult to isolate the effect of semantic pairing. We present a controlled empirical study of semantic positive pairs for self-supervised representation learning. From ImageNet-1K, we construct two matched subsets: an augmented-pair baseline and a manually curated semantic-pair dataset with the same class composition and training-pair count. We use these datasets to compare representative contrastive and non-contrastive SSL methods under matched training conditions. Across transfer learning and object detection evaluations, semantic-pair pretraining consistently improves generalisation over augmented-pair pretraining. Additional ablations show that semantic pairs induce invariances beyond the standard transformation pipeline. Among the evaluated methods, contrastive learning benefits most strongly from semantic pairs, with SimCLR showing the largest relative improvement. These results clarify the role of semantic positive pairs in SSL and provide guidance for selecting and designing frameworks that can exploit semantic pair information effectively

URL PDF HTML ☆

赞 0 踩 0

2510.03550 2026-05-29 cs.CV

Streaming Drag-Oriented Interactive Video Manipulation: Drag Anything, Anytime!

流式拖拽导向的交互式视频操作：随时拖动任何物体！

Junbao Zhou, Yuan Zhou, Kesen Zhao, Qingshan Xu, Beier Zhu, Richang Hong, Hanwang Zhang

AI总结提出REVEL任务和DragStream方法，通过自适应分布自校正和空间频率选择性优化，实现自回归视频扩散模型的流式拖拽交互操作。

详情

AI中文摘要

实现对自回归视频扩散模型输出的流式、细粒度控制仍然具有挑战性，难以确保其始终与用户期望一致。为弥补这一差距，我们提出 extbf{流式拖拽导向的交互式视频操作（REVEL）}，这是一个新任务，允许用户通过细粒度的交互式拖拽 extit{随时}对 extit{任何物体}修改生成的视频。超越DragVideo和SG-I2V，REVEL将拖拽式视频操作统一为编辑和动画化视频帧，同时支持用户指定的平移、变形和旋转效果，使拖拽操作更加通用。在解决REVEL时，我们观察到： extit{i}）拖拽引起的扰动在潜在空间中累积，导致严重的潜在分布漂移，从而中断拖拽过程； extit{ii}）流式拖拽容易受到上下文帧的干扰，从而产生视觉上不自然的结果。因此，我们提出一种无需训练的方法 extbf{DragStream}，包括： extit{i}）自适应分布自校正策略，利用相邻帧的统计信息有效约束潜在嵌入的漂移； extit{ii}）空间频率选择性优化机制，允许模型充分利用上下文信息，同时通过沿生成过程选择性传播视觉线索来减轻其干扰。我们的方法可以无缝集成到现有的自回归视频扩散模型中，大量实验有力地证明了DragStream的有效性。

英文摘要

Achieving streaming, fine-grained control over the outputs of autoregressive video diffusion models remains challenging, making it difficult to ensure that they consistently align with user expectations. To bridge this gap, we propose \textbf{stReaming drag-oriEnted interactiVe vidEo manipuLation (REVEL)}, a new task that enables users to modify generated videos \emph{anytime} on \emph{anything} via fine-grained, interactive drag. Beyond DragVideo and SG-I2V, REVEL unifies drag-style video manipulation as editing and animating video frames with both supporting user-specified translation, deformation, and rotation effects, making drag operations versatile. In resolving REVEL, we observe: \emph{i}) drag-induced perturbations accumulate in latent space, causing severe latent distribution drift that halts the drag process; \emph{ii}) streaming drag is easily disturbed by context frames, thereby yielding visually unnatural outcomes. We thus propose a training-free approach, \textbf{DragStream}, comprising: \emph{i}) an adaptive distribution self-rectification strategy that leverages neighboring frames' statistics to effectively constrain the drift of latent embeddings; \emph{ii}) a spatial-frequency selective optimization mechanism, allowing the model to fully exploit contextual information while mitigating its interference via selectively propagating visual cues along generation. Our method can be seamlessly integrated into existing autoregressive video diffusion models, and extensive experiments firmly demonstrate the effectiveness of our DragStream.

URL PDF HTML ☆

赞 0 踩 0

2510.00936 2026-05-29 cs.CV

Resolution as a Direction: Vector-Panning Feature Alignment for Cross-Resolution Re-Identification

分辨率作为方向：跨分辨率重识别的向量平移特征对齐

Zanwu Liu, Chao Yuan, Bo Li, Xiaowei Zhang, Guanglin Niu

AI总结提出向量平移特征对齐（VPFA）方法，通过将低分辨率特征沿学习到的分辨率方向平移得到伪高分辨率表示，实现轻量级且高效的跨分辨率行人重识别。

详情

AI中文摘要

跨分辨率行人重识别（CR-ReID）在实际监控中仍然具有挑战性，其中相机质量和拍摄距离导致低分辨率（LR）查询与高分辨率（HR）图库图像之间存在显著的分辨率差距。先前的方法通常依赖于超分辨率（SR）或分辨率不变表示学习，这往往增加系统复杂性，并且可能无法直接解决由分辨率退化引起的特征不匹配问题。在这项工作中，我们从一项专门分析中报告了一个新的经验发现，其中身份特定的变化被平均化：标准ReID主干产生的HR-LR特征差异在嵌入空间中表现出一致的、与分辨率相关的语义方向。我们进一步基于典型相关分析（CCA）和皮尔逊相关分析支持这一观察。受此发现启发，我们提出了向量平移特征对齐（VPFA），一个轻量级的后处理模块，学习将LR特征沿学习到的分辨率方向平移，以获得伪HR表示。VPFA在特征提取后运行，可以以可忽略的开销集成到现有的ReID系统中。在多个CR-ReID基准上的大量实验表明，VPFA实现了最先进的性能，同时与基于SR或联合训练的方法相比提高了效率。

英文摘要

Cross-resolution person re-identification (CR-ReID) remains challenging in practical surveillance, where camera quality and capture distance lead to substantial resolution gaps between low-resolution (LR) queries and high-resolution (HR) gallery images. Prior approaches commonly rely on super-resolution (SR) or resolution-invariant representation learning, which often increases system complexity and may not directly address the feature mismatch induced by resolution degradation. In this work, we report a new empirical finding from a dedicated analysis in which identity-specific variation is averaged out: the HR--LR feature discrepancy produced by standard ReID backbones exhibits a consistent, resolution-related semantic direction in the embedding space. We further support this observation with statistical analyses based on Canonical Correlation Analysis (CCA) and Pearson correlation analysis. Motivated by this finding, we propose Vector Panning Feature Alignment (VPFA), a lightweight post-hoc module that learns to pan LR features along the learned resolution direction to obtain pseudo-HR representations. VPFA operates after feature extraction and can be integrated into existing ReID systems with negligible overhead. Extensive experiments on multiple CR-ReID benchmarks show that VPFA achieves state-of-the-art performance while improving efficiency compared to SR-based or jointly trained alternatives.

URL PDF HTML ☆

赞 0 踩 0

2509.24895 2026-05-29 cs.LG

Towards Understanding the Shape of Representations in Protein Language Models

理解蛋白质语言模型中表示的形状

Kosio Beshkov, Anders Malthe-Sørenssen

AI总结本研究通过平方根速度表示和图过滤分析蛋白质语言模型（PLM）的表示空间，发现ESM2模型中Karcher均值和有效维度随层数非线性变化，且PLM优先编码残基的局部关系，最忠实于结构的表示出现在模型倒数第二层附近。

Comments Accepted as a poster at ICLR 2026. OpenReview: https://openreview.net/forum?id=Dnn8SSBJaY

详情

Journal ref: International Conference on Learning Representations (ICLR), 2026

AI中文摘要

虽然蛋白质语言模型（PLM）是未来从头蛋白质设计最有前途的研究途径之一，但它们将序列转换为隐藏表示的方式以及这些表示中编码的信息尚未完全理解。一些工作试图提出PLM的可解释性工具，但侧重于理解单个序列如何被这些模型转换。因此，PLM如何转换整个序列空间及其关系仍然未知。在这项工作中，我们尝试通过将蛋白质结构和表示与平方根速度（SRV）表示和图过滤联系起来，来理解这个转换后的序列空间。这两种方法自然地导出一个度量空间，在该空间中，可以比较成对的蛋白质或蛋白质表示。我们分析了来自SCOP数据集的不同类型蛋白质，并表明Karcher均值和SRV形状空间的有效维度作为不同大小ESM2模型中层数的函数遵循非线性模式。此外，我们使用图过滤作为工具来研究模型编码蛋白质结构特征的上下文长度。我们发现PLM优先编码残基之间的直接和局部关系，但对于较大的上下文长度开始退化。最忠实于结构的编码往往出现在模型最后一层附近但之前，表明在这些层之上训练折叠模型可能会提高折叠性能。

英文摘要

While protein language models (PLMs) are one of the most promising avenues of research for future de novo protein design, the way in which they transform sequences to hidden representations, as well as the information encoded in such representations is yet to be fully understood. Several works have attempted to propose interpretability tools for PLMs, but they have focused on understanding how individual sequences are transformed by such models. Therefore, the way in which PLMs transform the whole space of sequences along with their relations is still unknown. In this work we attempt to understand this transformed space of sequences by identifying protein structure and representation with square-root velocity (SRV) representations and graph filtrations. Both approaches naturally lead to a metric space in which pairs of proteins or protein representations can be compared with each other. We analyze different types of proteins from the SCOP dataset and show that the Karcher mean and effective dimension of the SRV shape space follow a non-linear pattern as a function of the layers in ESM2 models of different sizes. Furthermore, we use graph filtrations as a tool to study the context lengths at which models encode the structural features of proteins. We find that PLMs preferentially encode immediate as well as local relations between residues, but start to degrade for larger context lengths. The most structurally faithful encoding tends to occur close to, but before the last layer of the models, indicating that training a folding model ontop of these layers might lead to improved folding performance.

URL PDF HTML ☆

赞 0 踩 0

2509.23730 2026-05-29 cs.AI

EAPO: Enhancing Policy Optimization with On-Demand Expert Assistance

EAPO: 利用按需专家协助增强策略优化

Siyao Song, Cong Ma, Zhihao Cheng, Shiye Lei, Minghao Li, Ying Zeng, Huaixiao Tou, Kai Jia

AI总结提出专家辅助策略优化（EAPO）框架，通过训练中与外部专家的多轮交互增强探索，解决强化学习中的稀疏奖励和低效探索问题，在多个基准上平均提升5个点。

Comments Accepted by ICML 2026

详情

AI中文摘要

大型语言模型（LLMs）最近在可验证奖励下的强化学习（RL）优化中，推理能力得到了提升。现有方法主要依赖基于结果的监督来增强内部LLM推理，这往往导致探索效率低下和奖励稀疏。为了缓解这一问题，我们提出了专家辅助策略优化（EAPO），一种新颖的RL框架，通过在训练过程中引入与外部专家的多轮交互来增强探索。与先前策略孤立推理的方法不同，EAPO激励策略自适应地决定何时以及如何咨询专家，从而产生更丰富的奖励信号和更可靠的推理轨迹。外部协助最终将专家知识内化到策略模型中，放大了模型固有的推理能力。在评估时，策略模型已经过良好优化，能够独立解决问题，产生改进的推理路径和更准确的解决方案。在AIME 2024/2025和AIMO 2025上，EAPO始终优于专家辅助、专家蒸馏和RL基线，平均比自探索RL高出5个点，并且泛化到非数学基准，包括HumanEval、HLE、GPQA、MMLU、EvalPlus、HotpotQA和SimpleQA。

英文摘要

Large language models (LLMs) have recently advanced in reasoning when optimized with reinforcement learning (RL) under verifiable rewards. Existing methods primarily rely on outcome-based supervision to strengthen internal LLM reasoning, often leading to inefficient exploration and sparse rewards. To mitigate this issue, we propose Expert-Assisted Policy Optimization (EAPO), a novel RL framework that enhances exploration by incorporating multi-turn interactions with external experts during training. Unlike prior methods, where policies reason in isolation, EAPO incentivizes the policy to adaptively determine when and how to consult experts, yielding richer reward signals and more reliable reasoning trajectories. External assistance ultimately internalizes expert knowledge into the policy model, amplifying the model's inherent reasoning capabilities. During evaluation, the policy model has been well-optimized to solve questions independently, producing improved reasoning paths and more accurate solutions. On AIME 2024/2025 and AIMO 2025, EAPO consistently outperforms expert-assisted, expert-distilled, and RL baselines, averaging a 5-point gain over self-exploration RL, and also generalizes to non-math benchmarks, including HumanEval, HLE, GPQA, MMLU, EvalPlus, HotpotQA, and SimpleQA.

URL PDF HTML ☆

赞 0 踩 0

2509.22504 2026-05-29 cs.AI cs.LG

Estimating the Empowerment of Language Model Agents

估计语言模型代理的赋权能力

Jinyeop Song, Jeff Gore, Max Kleiman-Weiner

AI总结提出基于信息论中赋权概念的评估框架EELMA，通过多轮文本交互近似有效赋权，实验表明赋权与任务性能强相关，可作为与任务成功度量互补的通用评估指标。

Comments Published at the International Conference on Machine Learning (ICML) 2026. 9 pages, 9 figures; camera-ready version

详情

AI中文摘要

随着语言模型（LM）代理在现实应用中的能力日益增强和广泛采用，除了昂贵且人工设计的基准测试外，对可扩展评估框架的需求日益增长。我们提出基于赋权的信息论评估，赋权是一种衡量代理通过其行动对未来状态影响的信息论度量。为了应对基于文本环境的独特挑战，我们引入了EELMA（估计语言模型代理的赋权能力），一种从多轮文本交互中近似有效赋权的算法。我们在文本游戏以及现实的网络和工具使用环境中演示了EELMA，表明赋权与平均任务性能强相关。我们进一步分析了赋权如何随模型、环境复杂性和代理配置而变化，并表明高赋权状态和行动通常标志着通用能力的关键时刻。这些结果确立了赋权作为一种与任务成功度量互补的、与目标无关的度量，用于LM代理评估。

英文摘要

As language model (LM) agents become increasingly capable and adopted in real-world applications, there is a growing need for scalable evaluation frameworks beyond costly, manually designed benchmarks. We propose information-theoretic evaluation based on empowerment, an information-theoretic measure of an agent's influence on future states through its actions. To handle the unique challenges of text-based environments, we introduce EELMA (Estimating Empowerment of Language Model Agents), an algorithm for approximating effective empowerment from multi-turn text interactions. We demonstrate EELMA on textual games and realistic web and tool-use environments, showing that empowerment strongly correlates with average task performance. We further analyze how empowerment varies across models, environment complexity, and agent configurations, and show that high-empowerment states and actions often mark pivotal moments for general capabilities. These results establish empowerment as a goal-agnostic metric that complements task-success measures for LM-agent evaluation.

URL PDF HTML ☆

赞 0 踩 0

2509.21979 2026-05-29 cs.CV cs.AI

Benchmarking and Mitigating Sycophancy in Medical Vision Language Models

医疗视觉语言模型中的谄媚行为基准测试与缓解

Juangui Xu, Zikun Guo, Jingwei Lv, Hongbin Lin, Shu Yang, Jun Wen, Di Wang, Lijie Hu

AI总结针对医疗视觉语言模型中的谄媚问题，提出分层医疗视觉问答基准和VIPER策略，通过过滤非证据社会线索减少谄媚，提升模型鲁棒性。

Comments 19figures, 61pages. The first two authors contributed equally

详情

AI中文摘要

视觉语言模型（VLM）有潜力改变医疗工作流程。然而，其部署受到谄媚行为的限制。尽管这对患者安全构成严重威胁，但系统性的基准测试仍然缺乏。本文通过引入一个医疗基准来填补这一空白，该基准在分层医疗视觉问答任务中对VLM应用多种模板。我们发现当前的VLM极易受到视觉线索的影响，失败率与模型大小或整体准确性相关。我们发现感知权威和用户模仿是强大的触发因素，表明存在独立于视觉数据的偏差机制。为了克服这一点，我们提出了一种基于证据的视觉信息净化响应（VIPER）策略，该策略主动过滤掉非基于证据的社会线索，从而强化基于证据的推理。VIPER在保持可解释性的同时减少了谄媚，并且始终优于基线方法，为VLM的稳健和安全集成奠定了必要的基础。

英文摘要

Visual language models (VLMs) have the potential to transform medical workflows. However, the deployment is limited by sycophancy. Despite this serious threat to patient safety, a systematic benchmark remains lacking. This paper addresses this gap by introducing a Medical benchmark that applies multiple templates to VLMs in a hierarchical medical visual question answering task. We find that current VLMs are highly susceptible to visual cues, with failure rates showing a correlation to model size or overall accuracy. we discover that perceived authority and user mimicry are powerful triggers, suggesting a bias mechanism independent of visual data. To overcome this, we propose a Visual Information Purification for Evidence based Responses (VIPER) strategy that proactively filters out non-evidence-based social cues, thereby reinforcing evidence based reasoning. VIPER reduces sycophancy while maintaining interpretability and consistently outperforms baseline methods, laying the necessary foundation for the robust and secure integration of VLMs.

URL PDF HTML ☆

赞 0 踩 0

2509.21154 2026-05-29 cs.LG cs.AI

GRPO is Secretly a Process Reward Model

GRPO 秘密地是一个过程奖励模型

Michael Sullivan, Alexander Koller

AI总结本文理论证明，使用结果奖励模型的 GRPO 强化学习算法等价于一个基于蒙特卡洛的过程奖励模型，并发现其缺陷，提出 λ-GRPO 改进，在推理任务上提升性能。

Comments 16 pages, 9 figures; accepted at ICML 2026

详情

AI中文摘要

过程奖励模型（PRMs）允许在强化学习（RL）中进行细粒度的信用分配，并且与结果奖励模型（ORMs）形成对比，后者为整个轨迹分配单一奖励。然而，我们在本文中提供了理论证明，配备 ORM 的组相对策略优化（GRPO）RL 算法实际上等价于一个配备非平凡、基于蒙特卡洛的 PRM 的 PRM-aware RL 目标（在温和假设下）。利用 GRPO-as-a-PRM 框架，我们识别出 GRPO 目标中的一个缺陷，该缺陷与不平衡的过程步骤和奖励相互作用，阻碍了探索和利用（在不同条件下）。我们提出对算法进行简单修改以减轻这一缺陷（λ-GRPO），并表明使用 λ-GRPO 调优的 LLM 在下游推理任务上优于使用标准 GRPO 调优的 LLM，并且更快达到峰值性能。这些结果表明，我们可以利用原始 GRPO 算法中隐藏的内置 PRM 结构来提升模型性能，而无需使用显式 PRM，并且对训练时间和成本的影响可以忽略不计。

英文摘要

Process reward models (PRMs) allow for fine-grained credit assignment in reinforcement learning (RL), and seemingly contrast with outcome reward models (ORMs), which assign a single reward to an entire trajectory. However, we provide theoretical proof in this work that the Group Relative Policy Optimization (GRPO) RL algorithm equipped with an ORM is in fact equivalent to a PRM-aware RL objective equipped with a non-trivial, Monte-Carlo-based PRM (given mild assumptions). Leveraging the framework of GRPO-as-a-PRM, we identify a flaw in the GRPO objective that interacts with imbalanced process steps and rewards to hinder both exploration and exploitation (under different conditions). We propose a simple modification to the algorithm to mitigate this defect ($λ$-GRPO), and show that LLMs tuned with $λ$-GRPO outperform LLMs tuned with standard GRPO on downstream reasoning tasks\textemdash and reach peak performance more rapidly. These results show that we can leverage the hidden, built-in PRM structure within the vanilla GRPO algorithm to boost model performance without employing an explicit PRM, and with a negligible impact on training time and cost.

URL PDF HTML ☆

赞 0 踩 0

2508.19282 2026-05-29 cs.CL cs.AI

Less Is More: Elevating RAG via Performance-Driven Context Compression

少即是多：通过性能驱动的上下文压缩提升RAG

Ziqiang Cui, Yunpeng Weng, Xing Tang, Peiyang Liu, Shiwei Li, Bowei He, Jiamin Chen, Yansen Zhang, Xiuqiang He, Chen Ma

AI总结提出CORE-RAG框架，利用任务性能作为反馈信号迭代优化压缩策略，在3%压缩率下平均精确匹配得分提升3.3点。

Comments Accepted by ICML 2026

详情

AI中文摘要

检索增强生成（RAG）已成为改善知识更新时效性和大型语言模型事实准确性的有前景范式。然而，纳入大量检索文档显著增加输入长度，导致计算成本过高。现有压缩方法通常因依赖预定义启发式规则而损害任务性能。这些启发式规则无法确保压缩后的上下文有利于生成任务。为解决这些限制，我们提出CORE-RAG，一种用于RAG系统中上下文压缩的新颖框架。CORE通过性能驱动的学习框架消除对代理启发式规则的依赖，直接利用任务性能作为反馈信号迭代优化压缩器策略。在此优化过程之前，我们引入知识蒸馏阶段，用稳健策略初始化压缩器。大量实验证明了我们方法的优越性。在3%的高压缩比下，CORE不仅避免了性能下降，而且与使用完整文档相比，平均精确匹配（EM）得分提高了3.3分。我们的代码可在https://github.com/ziqiangcui/CORE-RAG-ICML26获取。

英文摘要

Retrieval-Augmented Generation (RAG) has emerged as a promising paradigm for improving the timeliness of knowledge updates and the factual accuracy of large language models. However, incorporating a large volume of retrieved documents significantly increases input length, leading to prohibitive computational costs. Existing compression approaches often compromise task performance, primarily due to their reliance on predefined heuristics. These heuristics fail to ensure that the compressed context is conducive to the generation tasks. To address these limitations, we propose CORE-RAG, a novel framework for context compression in RAG systems. CORE eliminates reliance on proxy heuristics through a performance-driven learning framework, which directy utilizes task performance as a feedback signal to iteratively refine the compressor policy. Prior to this optimization process, we incorporate a knowledge distillation phase to initialize the compressor with a robust policy. Extensive experiments demonstrate the superiority of our approach. At a high compression ratio of 3%, CORE not only avoids performance degradation but also improves the average Exact Match (EM) score by 3.3 points compared to using full documents. Our code is available at https://github.com/ziqiangcui/CORE-RAG-ICML26.

URL PDF HTML ☆

赞 0 踩 0

2508.19202 2026-05-29 cs.CL

Demystifying Scientific Problem-Solving in LLMs by Probing Knowledge and Reasoning

通过探针知识和推理揭示LLMs中的科学问题解决

Alan Li, Yixin Liu, Arpan Sarkar, Doug Downey, Arman Cohan

AI总结本文提出SciReas基准和KRUX探针框架，系统评估LLMs在科学推理中的知识与推理角色，发现知识检索是主要瓶颈，外部上下文知识和推理增强均能提升性能。

Comments 33 pages, 18 figures

详情

Journal ref: ICML 2026 Main Conference

AI中文摘要

科学问题解决对LLMs提出了独特挑战，需要深厚的领域知识和通过复杂推理应用这些知识的能力。尽管自动化科学推理器在协助人类科学家方面具有巨大潜力，但目前尚无广泛采用的全面基准来评估科学推理，也很少有方法系统地梳理知识和推理在这些任务中的不同作用。为弥补这些空白，我们引入了SciReas，一个用于科学推理任务的多样化现有基准套件，以及SciReas-Pro，一个需要更复杂推理的选择性子集。我们的全面评估揭示了在单独依赖单个基准时隐藏的科学推理性能洞察。然后，我们提出了KRUX，一个用于研究推理和知识在科学任务中不同作用的探针框架。结合两者，我们进行了深入分析，得出几个关键发现：（1）从模型参数中检索任务相关知识是LLMs在科学推理中的关键瓶颈；（2）推理模型始终受益于在推理增强之上添加上下文中的外部知识；（3）增强言语化推理提高了LLMs浮现任务相关知识的能力。

英文摘要

Scientific problem solving poses unique challenges for LLMs, requiring both deep domain knowledge and the ability to apply such knowledge through complex reasoning. While automated scientific reasoners hold great promise for assisting human scientists, there is currently no widely adopted holistic benchmark for evaluating scientific reasoning, and few approaches systematically disentangle the distinct roles of knowledge and reasoning in these tasks. To address these gaps, we introduce SciReas, a diverse suite of existing benchmarks for scientific reasoning tasks, and SciReas-Pro, a selective subset that requires more complex reasoning. Our holistic evaluation surfaces insights about scientific reasoning performance that remain hidden when relying on individual benchmarks alone. We then propose KRUX, a probing framework for studying the distinct roles of reasoning and knowledge in scientific tasks. Combining the two, we conduct an in-depth analysis that yields several key findings: (1) Retrieving task-relevant knowledge from model parameters is a critical bottleneck for LLMs in scientific reasoning; (2) Reasoning models consistently benefit from external knowledge added in-context on top of the reasoning enhancement; (3) Enhancing verbalized reasoning improves LLMs' ability to surface task-relevant knowledge.

URL PDF HTML ☆

赞 0 踩 0

2508.15180 2026-05-29 cs.AI

PuzzleClone: A DSL-Powered Framework for Synthesizing Verifiable Data

PuzzleClone: 一种基于DSL的可验证数据合成框架

Kai Xiong, Yanwei Huang, Rongjunchen Zhang, Kun Chen, Haipang Wu, Yingcai Wu

AI总结提出PuzzleClone框架，通过DSL驱动的方法合成大规模、高可靠、多样化的可验证数学逻辑数据集，并构建PC-83K基准，实验表明后训练能显著提升LLM在逻辑与数学任务上的性能。

详情

AI中文摘要

高质量、带有可验证答案的数学和逻辑数据集对于增强大型语言模型（LLM）的推理能力至关重要。虽然最近的数据增强技术促进了大规模基准的创建，但现有的LLM生成数据集往往存在可靠性、多样性和可扩展性有限的问题。为了解决这些挑战，我们引入了PuzzleClone，一个使用新颖的DSL驱动方法大规模合成可验证数据的正式框架。我们的方法具有三个关键创新：（1）将种子谜题编码为结构化的逻辑规范，（2）通过系统化的变量和约束随机化生成可扩展的变体，（3）通过再现机制确保有效性。应用PuzzleClone，我们构建了PC-83K，一个包含超过83K个多样化且经过程序验证的谜题的基准。生成的谜题涵盖了广泛的难度和格式，对当前最先进的模型构成了重大挑战。实验结果表明，在PC-83K上进行后训练（SFT和RL）不仅在测试集上取得了显著提升，而且在各种逻辑和数学基准上也取得了改进。后训练将PC-83K上的平均性能从14.5提高到66.0，并在7个逻辑和数学基准上持续改进，绝对百分点最高达18.4（SATBench从51.6提高到70.0）。我们的代码和数据可在https://github.com/HiThink-Research/PuzzleClone获取。

英文摘要

High-quality mathematical and logical datasets with verifiable answers are essential for strengthening the reasoning capabilities of large language models (LLMs). While recent data augmentation techniques have facilitated the creation of large-scale benchmarks, existing LLM-generated datasets often suffer from limited reliability, diversity, and scalability. To address these challenges, we introduce PuzzleClone, a formal framework for synthesizing verifiable data at scale using a novel DSL-driven approach. Our approach features three key innovations: (1) encoding seed puzzles into structured logical specifications, (2) generating scalable variants through systematic variable and constraint randomization, and (3) ensuring validity via a reproduction mechanism. Applying PuzzleClone, we construct PC-83K, a benchmark comprising over 83K diverse and programmatically validated puzzles. The generated puzzles span a wide spectrum of difficulty and formats, posing significant challenges to current state-of-the-art models. Experimental results show that post training (SFT and RL) on PC-83K yields substantial improvements not only on the testset but also on various logic and mathematical benchmarks. Post training raises average performance on PC-83K from 14.5 to 66.0 and delivers consistent improvements across 7 logic and mathematical benchmarks up to 18.4 absolute percentage points (SATBench from 51.6 to 70.0). Our code and data are available at https://github.com/HiThink-Research/PuzzleClone.

URL PDF HTML ☆

赞 0 踩 0

2508.14610 2026-05-29 cs.RO

TRUST-Planner: Topology-guided Robust Trajectory Planner for AAVs with Uncertain Obstacle Spatial-temporal Avoidance

TRUST-Planner：面向具有不确定障碍物时空避让的AAV拓扑引导鲁棒轨迹规划器

Junzhi Li, Teng Long, Jingliang Sun, Jianxin Zhong

AI总结提出TRUST-Planner拓扑引导分层规划框架，通过动态增强可见概率图、无终端最小控制多项式和动态距离场实现复杂动态环境下的鲁棒时空避障，达到96%成功率和毫秒级计算效率。

Comments Accepted by IEEE Transactions on Industrial Electronics (TIE) for publication. The final version will be available online at https://ieeexplore.ieee.org/ after publication

详情

DOI: 10.1109/TIE.2026.3695224

AI中文摘要

尽管自主飞行器（AAV）的运动规划已取得广泛进展，但现有框架在复杂动态环境中仍面临局部极小值和死锁的挑战，导致碰撞风险增加。为了解决这些问题，我们提出了TRUST-Planner，一种拓扑引导的分层规划框架，用于鲁棒的时空避障。在前端，提出了一种动态增强可见概率图（DEV-PRM），以快速探索拓扑路径进行全局引导。后端利用统一的无终端最小控制多项式（UTF-MINCO）和动态距离场（DDF），实现高效的预测性避障和快速并行计算。此外，引入了一种增量式多分支轨迹管理框架，以实现时空拓扑决策，同时有效利用历史信息减少重规划时间。仿真结果表明，TRUST-Planner优于基线竞争对手，在测试的复杂环境中实现了96%的成功率和毫秒级计算效率。真实世界实验进一步验证了所提方法的可行性和实用性。

英文摘要

Despite extensive developments in motion planning of autonomous aerial vehicles (AAVs), existing frameworks faces the challenges of local minima and deadlock in complex dynamic environments, leading to increased collision risks. To address these challenges, we present TRUST-Planner, a topology-guided hierarchical planning framework for robust spatial-temporal obstacle avoidance. In the frontend, a dynamic enhanced visible probabilistic roadmap (DEV-PRM) is proposed to rapidly explore topological paths for global guidance. The backend utilizes a uniform terminal-free minimum control polynomial (UTF-MINCO) and dynamic distance field (DDF) to enable efficient predictive obstacle avoidance and fast parallel computation. Furthermore, an incremental multi-branch trajectory management framework is introduced to enable spatio-temporal topological decision-making, while efficiently leveraging historical information to reduce replanning time. Simulation results show that TRUST-Planner outperforms baseline competitors, achieving a 96\% success rate and millisecond-level computation efficiency in tested complex environments. Real-world experiments further validate the feasibility and practicality of the proposed method.

URL PDF HTML ☆

赞 0 踩 0

2507.23270 2026-05-29 cs.RO cs.SY eess.SY

Simulation-based planning of Motion Sequences for Automated Procedure Optimization in Multi-Robot Assembly Cells

基于仿真的多机器人装配单元自动化程序优化的运动序列规划

Loris Schneider, Marc Ungen, Elias Huber, Jan-Felix Klein

AI总结提出一种基于仿真的方法，通过将装配步骤分解为核心操作和遍历操作，并采用分解式运动规划策略优化调度，以生成高效无碰撞的多机器人运动序列，减少装配时间。

Comments Accepted for publication at IEEE CASE 2026

详情

AI中文摘要

可重构多机器人单元提供了一种应对波动装配需求的有前景的方法。然而，其配置的重复规划带来了新的挑战，特别是在生成优化、协调的多机器人运动序列以最小化装配时间方面。本文提出了一种基于仿真的方法，用于生成此类优化序列。该方法将装配步骤分解为与任务相关的核心操作和连接的遍历操作。核心操作受约束且预先确定，而遍历操作具有显著的优化潜力。核心操作的调度被形式化为一个优化问题，需要使用基于分解的运动规划策略集成可行的遍历操作。探索了几种求解技术，包括采样启发式、基于树的搜索和无梯度优化。对于运动规划，提出了一种分解方法，识别调度中的特定区域，这些区域可以使用改进的集中式路径规划算法独立求解。所提出的方法生成了高效且无碰撞的多机器人装配程序，优于依赖分散式、机器人个体运动规划的基线方法。通过仿真实验证明了其有效性。

英文摘要

Reconfigurable multi-robot cells offer a promising approach to meet fluctuating assembly demands. However, the recurrent planning of their configurations introduces new challenges, particularly in generating optimized, coordinated multi-robot motion sequences that minimize the assembly duration. This work presents a simulation-based method for generating such optimized sequences. The approach separates assembly steps into task-related core operations and connecting traverse operations. While core operations are constrained and predetermined, traverse operations offer substantial optimization potential. Scheduling the core operations is formulated as an optimization problem, requiring feasible traverse operations to be integrated using a decomposition-based motion planning strategy. Several solution techniques are explored, including a sampling heuristic, tree-based search and gradient-free optimization. For motion planning, a decomposition method is proposed that identifies specific areas in the schedule, which can be solved independently with modified centralized path planning algorithms. The proposed method generates efficient and collision-free multi-robot assembly procedures that outperform a baseline relying on decentralized, robot-individual motion planning. Its effectiveness is demonstrated through simulation experiments.

URL PDF HTML ☆

赞 0 踩 0

2507.16880 2026-05-29 cs.CV cs.AI cs.LG

Finding DoRI: Discovery of Retained Images in Diffusion Models

Finding DoRI: 扩散模型中保留图像的发现

Antoni Kowalczuk, Dominik Hintersdorf, Lukas Struppek, Kristian Kersting, Adam Dziedzic, Franziska Boenisch

AI总结通过挑战记忆局部化假设，发现文本嵌入的小扰动可重新触发数据复制，并证明记忆本质上是非局部的，从而提出对抗微调实现更鲁棒的缓解方法。

Comments Published at ICML 2026

详情

AI中文摘要

文本到图像扩散模型（DMs）在图像生成方面取得了显著成功。然而，由于它们可能无意中记忆并复制训练数据，数据隐私和知识产权问题仍然存在。最近的缓解工作集中在识别和剪枝负责触发逐字训练数据复制的权重，基于记忆可以被局部化的假设。我们挑战这一假设，并证明即使经过这样的剪枝，对先前缓解的提示的文本嵌入进行微小扰动可以重新触发数据复制，揭示了此类方法的脆弱性。我们的进一步分析提供了多个迹象表明记忆确实本质上不是局部的：（1）记忆图像的复制触发因素分布在文本嵌入空间中；（2）产生相同复制图像的嵌入会产生不同的模型激活；（3）不同的剪枝方法对同一图像识别出不一致的记忆相关权重集。最后，我们表明绕过局部性假设可以通过对抗微调实现更鲁棒的缓解。这些发现为文本到图像DMs中记忆的基本性质提供了新见解，并为未来开发更可靠的对抗DM记忆的缓解方法提供了信息。

英文摘要

Text-to-image diffusion models (DMs) have achieved remarkable success in image generation. However, concerns about data privacy and intellectual property remain due to their potential to inadvertently memorize and replicate training data. Recent mitigation efforts have focused on identifying and pruning weights responsible for triggering verbatim training data replication, based on the assumption that memorization can be localized. We challenge this assumption and demonstrate that, even after such pruning, small perturbations to the text embeddings of previously mitigated prompts can re-trigger data replication, revealing the fragility of such methods. Our further analysis then provides multiple indications that memorization is indeed \textit{not} inherently local: (1) replication triggers for memorized images are distributed throughout text embedding space; (2) embeddings yielding the same replicated image produce divergent model activations; and (3) different pruning methods identify inconsistent sets of memorization-related weights for the same image. Finally, we show that bypassing the locality assumption enables more robust mitigation through adversarial fine-tuning. These findings provide new insights into the fundamental nature of memorization in text-to-image DMs and inform the future development of more reliable mitigation methods against DM memorization.

URL PDF HTML ☆

赞 0 踩 0