arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1695
2605.23597 2026-05-25 cs.CL cs.LG

Structure-Guided Entity Resolution: Fine-Tuning LLMs for Robust Name Matching in Complex Linguistic Contexts

结构引导的实体解析:微调大语言模型以实现复杂语言上下文中的鲁棒姓名匹配

Shivam Chourasia, Hitesh Kapoor, Nilesh Patil

发表机构 * Dream Sports

AI总结 本文研究了在语言和文化复杂环境下进行人名匹配的实体解析问题,提出了一种名为Structure-Guided Entity Resolution(SGER)的新框架,通过两阶段课程式微调增强大语言模型对姓名结构和语义的理解,从而提升实体匹配的准确性。该方法在印度身份数据等具有高度语言多样性和噪声的现实场景中表现出色,取得了99.02%的高准确率,并在生产环境中成功部署,验证了其在大规模多语言系统中的有效性和鲁棒性。

Comments Accepted to ACL 2026. 8 pages, 1 figure, 2 tables

详情
AI中文摘要

跨异构记录匹配人名是实体解析的核心挑战,尤其是在语言和文化复杂的环境中。命名惯例的差异、跨文字的不一致音译以及频繁的数据录入错误使得统一用户身份变得困难,而这对于了解你的客户(KYC)合规至关重要。虽然大语言模型在理解自然语言方面显示出潜力,但它们往往难以处理此类特定领域设置中存在的结构化歧义。本文介绍了结构引导实体解析(SGER),一种新颖的框架,通过两阶段课程微调大语言模型。模型首先被训练解析人名的语法和语义结构,然后针对二元实体匹配的下游任务进行优化。我们在印度身份数据的挑战性背景下评估SGER,这是全球语言最多样化和噪声最大的环境之一。SGER在包含50,000个真实世界对的保留测试集上达到了99.02%的准确率和0.994的F1分数,优于GPT-4o少样本提示和单阶段微调基线。该系统已完全部署在全球最大的梦幻体育平台Dream11的生产环境中,服务超过2.5亿用户。我们的结果表明,课程引导的训练能够在现实世界的多语言系统中实现大规模、高精度的实体解析。

英文摘要

Matching person names across heterogeneous records is a core challenge in entity resolution, especially within linguistically and culturally complex environments. Variations in naming conventions, inconsistent transliteration across scripts, and frequent data entry errors make it difficult to unify user identities, an essential requirement for Know Your Customer (KYC) compliance. While Large Language Models have shown promise in understanding natural language, they often struggle with the structured ambiguity present in such domain-specific settings. This paper introduces Structure-Guided Entity Resolution (SGER), a novel framework that fine-tunes an LLM through a two-phase curriculum. The model is first trained to parse the grammatical and semantic structure of personal names, then optimized for the downstream task of binary entity matching. We evaluate SGER in the challenging context of Indian identity data, one of the most linguistically diverse and noisy environments globally. SGER achieves 99.02% accuracy and an F1 of 0.994 on a held-out set of 50,000 real-world pairs, outperforming GPT-4o few-shot prompting and single-stage fine-tuning baselines. The system is fully deployed in production at Dream11, the world's largest fantasy sports platform, serving 250M+ users. Our results demonstrate that curriculum-guided training enables robust, high-precision entity resolution in real-world multilingual systems at scale.

2605.23592 2026-05-25 cs.AI

Solving the Aircraft Disassembly Scheduling Problem

解决飞机拆解调度问题

Charles Thomas, Pierre Schaus

发表机构 * Institute of Information and Communication Technologies, Electronics and Applied Mathematics (ICTEAM)(信息与通信技术、电子与应用数学研究所) UCLouvain(乌得勒支大学)

AI总结 本文研究了飞机报废拆解过程中的调度问题,该问题涉及大量任务和多种约束条件,对航空公司实现可持续拆解和盈利至关重要。文章提出了两种求解方法,包括约束规划模型和混合整数规划模型,并基于工业合作伙伴提供的真实数据进行了测试,验证了模型在处理多达1450项任务实例中的有效性。

详情
AI中文摘要

拆解寿命终结的飞机是一项复杂的工程,对于可持续性而言是必要的,但为航空运输公司带来的利润空间很小。因此,拆解过程的高效调度对于确保流程的盈利能力和激励实践至关重要。这是一个涉及数千个任务和许多不同约束的大规模调度问题:提取计划重复使用的部件需要具有特定认证和设备的技师。提取操作可能受先后顺序关系约束。此外,在整个过程中必须保持飞机平衡。最后,飞机的某些位置空间有限,限制了可同时工作的技师数量。本文详细介绍了该问题,并提出了两种解决方法:约束规划模型和混合整数规划模型。这些模型在基于工业合作伙伴提供的真实运营数据、规模不同(最多1450个任务)的实例上进行了测试。

英文摘要

Dismantling aircrafts reaching their end of life is a complex endeavour that is necessary in terms of sustainability but yields small income margins for air transport companies. An efficient scheduling of the disassembly procedure is thus crucial to ensure the profitability of the process and incentivize practice. This is a large scheduling problem that involves thousands of tasks and many different constraints: Extracting parts that are destined to be reused requires technicians with specific certifications and equipment. Extraction operations might be subject to precedence relations. Furthermore, the aircraft must be kept balanced during the whole process. Finally, some of the locations of the aircraft have a limited space that caps the number of technicians able to work there concurrently. This article presents the problem in details and proposes two approaches to solve the problem: a Constraint Programming model and a MIP model. The models are tested on instances of varying sizes involving up to 1450 tasks, which are based on real operational data provided by an industrial partner.

2605.23590 2026-05-25 cs.AI

Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents

Co-ReAct:作为ReAct智能体逐步协作者的评分准则

Jiazheng Kang, Bowen Zhang, Zixin Song, Jiangwang Chen, Xiao Yang, Da Zhu, Guanjun Jiang

发表机构 * Qwen Applications Business Group of Alibaba(阿里巴巴文勤应用业务组) Tsinghua University(清华大学)

AI总结 Co-ReAct 是一种基于评分标准(rubrics)的行动选择框架,旨在改进 ReAct 代理在多步骤推理任务中的决策过程。该方法在每一步推理中注入评分标准作为指导,明确代理应关注的证据搜索、推理或自我评估方向,从而提升推理的深度和针对性。通过引入专门训练的评分标准生成器,并采用多评委共识排名优化目标,Co-ReAct 显著提升了多个基准任务上的表现,且无需修改原有代理的决策机制。

详情
AI中文摘要

用于搜索密集型、多步推理任务的ReAct风格智能体主要依赖自身内部判断来决定寻求哪些证据、下一步采取哪个推理或行动步骤以及何时停止,常常产生浅显、冗余或目标不明确的轨迹。先前的工作探索了将评分准则作为外部质量信号,但现有用途主要是评估性的而非行动指导性的:评分准则通常作为训练时的奖励或完成输出的事后评估器,在深度研究场景中,它们往往是粗粒度的、报告级别的而非步骤级别的。我们引入了Co-ReAct,一个评分准则指导的行动选择框架,在推理过程中将评分准则作为步骤级指导。在每个决策步骤,Co-ReAct将评分准则注入智能体的上下文,以指导下一个“推理或行动”决策,明确智能体在证据寻求、搜索、推理或自我评估中应瞄准什么。为了使这种指导可靠,我们使用GRPO训练了一个专用的评分准则生成器。与先前的成对或二元偏好公式不同,我们的目标优化了针对多评判专家共识排名的列表式斯皮尔曼等级相关奖励,鼓励评分准则具有区分性而不仅仅是合理。在DeepResearchBench和SQA-CS-V2上,Co-ReAct在基于8B/14B开源和前沿闭源基础模型构建的搜索智能体上,一致优于ReAct和代表性的测试时计算基线。训练好的评分准则生成器还可以作为即插即用组件,在不改变底层决策机制的情况下改进这些基线。我们的代码公开在https://github.com/ZBWpro/Co-ReAct。

英文摘要

ReAct-style agents for search-intensive, multi-step reasoning tasks rely largely on their own internal judgment to decide what evidence to seek, which reasoning or action step to take next, and when to stop, often producing shallow, redundant, or poorly targeted trajectories. Prior work has explored rubrics as external quality signals, but existing uses are mostly evaluative rather than action-guiding: rubrics typically serve as training-time rewards or post-hoc evaluators of completed outputs, and in deep-research settings they are often coarse-grained and report-level rather than step-level. We introduce Co-ReAct, a rubric-guided action-selection framework that uses rubrics as step-level guidance during inference. At each decision step, Co-ReAct injects a rubric into the agent's context to guide the next Reason-or-Act decision, specifying what the agent should target in evidence seeking, search, reasoning, or self-evaluation. To make this guidance reliable, we train a dedicated rubric generator with GRPO. Unlike prior pairwise or binary preference formulations, our objective optimizes a list-wise Spearman rank-correlation reward against multi-judge expert consensus rankings, encouraging rubrics that are discriminative rather than merely plausible. On DeepResearchBench and SQA-CS-V2, Co-ReAct consistently improves over ReAct and representative test-time compute baselines across search agents built on both 8B/14B open-source and frontier closed-source base models. The trained rubric generator can also serve as a drop-in component that improves these baselines without changing their underlying decision mechanisms. Our code is publicly available at https://github.com/ZBWpro/Co-ReAct.

2605.23583 2026-05-25 cs.RO cs.LG

How Many Training Samples Are Needed for the Inverse Kinematics Solutions by Artificial Neural Networks

人工神经网络求解逆运动学需要多少训练样本

Dong-Won Lim

发表机构 * The University of Suwon(苏won大学)

AI总结 本文研究了使用人工神经网络求解机器人逆运动学问题时所需的最小训练样本数量。通过构建不同规模的训练数据集,训练前馈神经网络并评估其精度、收敛性和泛化能力,发现当样本数量超过125后,模型效率提升不再显著。该研究为实际机器人应用中优化神经网络数据规模、平衡计算成本与模型精度提供了有价值的指导。

Comments 14 pages, 5 figures

详情
AI中文摘要

逆运动学在机器人运动规划与控制中扮演关键角色。机器人操作臂的逆运动学求解可通过传统方法如几何法、代数法或雅可比法实现,但这些方法存在缺陷。人工神经网络因其泛化能力和计算效率,已成为近似逆运动学解的有前途的替代方案。该方法基本上只训练记录用于求解逆运动学问题的少量末端执行器样本。然而,一个基本问题仍然存在:多少训练样本足以实现可靠且准确的逆运动学预测?本研究探讨了训练数据集大小与基于ANN的逆运动学求解器精度之间的数学框架。使用关节型机器人操作臂,我们生成不同数量的关节位置对来训练前馈神经网络,并评估其精度、收敛性和泛化能力。结果表明,超过125个训练样本并未有助于提高模型效率,该效率通过采样大小上的近似精度可比度量来衡量,为数据效率提供了宝贵见解。这项工作为优化ANN解决方案的数据规模提供了实用指导,平衡了实际机器人应用中的计算成本和模型精度。

英文摘要

Inverse Kinematics (IK) plays a critical role in robotic motion planning and control. The IK solutions of a robot manipulator could be done by conventional ways such as geometric, algebraic, or Jacobian methods, which have drawbacks. The Artificial Neural Networks (ANNs) have become a promising alternative for approximating IK solutions due to their generalization ability and computational efficiency. This approach basically trains only a few samples of the end effector that are recorded for the solution of the IK problem. However, a fundamental question remains: how many training samples are sufficient to achieve reliable and accurate IK predictions? This study investigates the mathematical framework of relating the size of training datasets and the accuracy of ANN-based IK solvers. Using an articulated robotic manipulator, we generate varying amounts of joint-position pairs to train feedforward neural networks and assess their accuracy, convergence, and generalization capability. The results reveal more training samples than 125 did not contribute to the improvement of the model efficiency that the comparable measure dealing with the approximation accuracy over the sampling size, offering valuable insight into data efficiency. This work provides practical guidance for optimizing the data sizing of ANN solutions, balancing computational cost and model accuracy for real-world robotic applications.

2605.23580 2026-05-25 cs.CV

Calibration-Informative Region Selection for Online LiDAR--Camera Calibration in Agricultural Environments

农业环境中在线LiDAR-相机标定的标定信息区域选择

Rajitha de Silva, Grzegorz Cielniak

发表机构 * Lincoln Institute for Agri-Food Technology, University of Lincoln, UK(林肯农业食品技术研究所,林肯大学,英国)

AI总结 本文研究了农业环境下在线激光雷达-相机标定中的校准信息区域选择问题,提出了一种基于支持图的多模态标定方法,将标定过程分解为初始标定、跨模态残差提取、支持图估计和支持感知优化四个模块。通过结合无目标标定方法MDPCalib和密集匹配模型CMRNext,该方法生成了一个密集校准支持图,用于识别标定信息可靠的区域,实验表明该方法在Bacchus Long-Term和KITTI数据集上能有效提升标定精度,尤其在平移参数方面表现突出。

Comments Accepted to ICRA 2026 Workshop on Agricultural Robotics

详情
AI中文摘要

可靠的多模态标定需要识别哪些观测真正约束外参,哪些主要引入噪声或模糊性。本文提出一种基于支持图的多模态标定方法,解耦四个功能模块:初始标定、跨模态残差提取、支持图估计和支持感知精化。我们利用MDPCalib(一种基于运动和深度点对应的无目标LiDAR-相机标定方法)和CMRNext(一种预测光流状图像平面残差的密集LiDAR-相机匹配模型)实例化该公式用于在线LiDAR-相机标定。关键贡献是密集标定支持图,它聚合对齐观测上的跨模态一致性,并突出标定证据持续可靠的区域。在Bacchus Long-Term (BLT)数据集和KITTI上,我们表明标定证据在空间和语义上不均匀,表明某些语义区域为标定提供更强的线索。在KITTI上,支持引导的精化改善了标定性能,平移精度更好,而旋转增益仍然有限。

英文摘要

Reliable multi-modal calibration requires identifying which observations truly constrain the extrinsic parameters and which ones mainly add noise or ambiguity. In this paper, we propose a support-map-driven approach to multi-modal calibration that decouples four functional blocks: initial calibration, cross-modal residual extraction, support-map estimation, and support-aware refinement. We instantiate this formulation for online LiDAR--camera calibration using MDPCalib, a target-less LiDAR--camera calibration method based on motion and deep point correspondences, and CMRNext, a dense LiDAR--camera matching model that predicts optical-flow-like image-plane residuals. The key contribution is a dense calibration support map that aggregates cross-modal agreement over aligned observations and highlights where calibration evidence is consistently reliable. Across the Bacchus Long-Term (BLT) dataset and KITTI, we show that calibration evidence is spatially and semantically non-uniform, indicating that some semantic regions provide stronger cues for calibration than others. On KITTI, support-guided refinement improves the calibration performance with better translation accuracy while rotational gains remain limited.

2605.23574 2026-05-25 cs.LG cs.SE

Push Your Agent: Measuring and Enforcing Quantitative Goal Persistence in Long-Horizon LLM Agents

推动你的智能体:在长周期LLM智能体中测量和强制实现定量目标持续性

Yuandao Cai, Yuzhang Zhu, Liyou Gao, Wensheng Tang, Shengchao Qin

发表机构 * Independent Researcher(独立研究者) Xidian University(西安电子科技大学)

AI总结 本文研究了长期语言智能体在完成定量目标时存在的“定量目标持续性”(QGP)问题,即智能体是否能持续工作直到外部验证器确认完成足够数量的有效任务。为此,作者提出了PushBench基准,用于直接衡量重复工作、重复提交、虚假完成等问题。实验表明,基于状态追踪和工作单元追踪的控制器在减少重复提交和提高任务完成率方面表现优异,而当前主流智能体在处理大量任务时成功率显著下降,突显了定量目标对智能体可靠性提出的更高要求。

详情
AI中文摘要

长周期语言智能体可能做出许多看似合理的局部工具调用,但未能持续直到请求的数量实际完成。我们将这一差距研究为定量目标持续性(QGP):即智能体是否持续工作,直到外部验证器确认足够数量的不同有效项。PushBench将其转化为一个用于仓库-工件收集和验证器支持的工作单元的基准,因此重复工作、重复提交、虚假完成和进度漂移被直接测量,而不是隐藏在最终成功标志之后。在匹配的控制器比较中,状态追踪检索控制器达到69-78%的成功率,同时消除了重复提交;而积压追踪工作单元控制器在标准和完成门控控制器无法完成任何任务实例的设置中达到25-50%的成功率。使用Claude Code(Sonnet 4.6)和Codex CLI(gpt-5.4)的黑盒前沿智能体评估解决了许多50个工件的任务,但在100个工件时每条件仅剩3/9的成功率。结果表明,定量目标对不同于局部任务能力的可靠性要求提出了挑战:智能体必须维护已验证的进度,并仅在请求的工作完成时停止。

英文摘要

Long-horizon language agents can make many plausible local tool calls yet fail to persist until a requested count is actually complete. We study this gap as Quantitative Goal Persistence (QGP): whether an agent keeps working until an external verifier confirms enough distinct valid items. PushBench turns this into a benchmark for repository-artifact collection and verifier-backed work units, so repeated work, duplicate submissions, false completion, and progress drift are measured directly rather than hidden behind a final success flag. In matched controller comparisons, a state-tracking retrieval controller reaches 69-78% success while eliminating duplicate submissions, and a backlog-tracking work-unit controller reaches 25-50% success in settings where standard and completion-gated controllers complete no task instances. Black-box frontier-agent evaluations with Claude Code (Sonnet 4.6) and Codex CLI (gpt-5.4) solve many 50-artifact tasks but drop to 3 out of 9 successes per condition at 100 artifacts. The results show that quantitative goals stress a different reliability requirement from local task competence: agents must maintain verified progress and stop only when the requested work is complete.

2605.23569 2026-05-25 cs.AI

CP or DP? Why Not Both: A Case Study in the Partial Shop Scheduling Problem

CP还是DP?为何不兼得:以部分车间调度问题为例

Emma Legrand, Roger Kameugne, Pierre Schaus

发表机构 * ICTEAM, UCLouvain, Belgium(ICTEAM,鲁汶大学,比利时)

AI总结 本文研究了如何将动态规划(DP)与约束规划(CP)有效结合,以解决部分车间调度问题(PSSP)。作者提出了一种混合方法,以DP作为主搜索框架,利用CP进行全局约束传播,从而提升求解效率与灵活性。该方法不仅支持任意优先级约束,还可与任何时间策略结合,并能设计出基于DP的大型邻域搜索方案,展示了DP与CP融合在组合优化问题中的可行性。

详情
AI中文摘要

动态规划(DP)和约束规划(CP)是解决组合优化问题的成熟范式。通常,这两种方法被分开使用。本文旨在展示两者可以有效且优雅地结合,其中DP作为主搜索框架,CP作为子程序利用全局约束传播。本文针对部分车间调度问题(PSSP)提出了这样一种方法,该问题之前已有纯DP方法,并且有高效的CP过滤算法可用。PSSP是一个通用调度问题,其中每个作业由一组具有任意优先约束的操作组成。该方法足够灵活,可以容纳任意时间DP策略,例如任意时间列搜索,而原始DP算法以严格的逐层方式运行。此外,CP建模的灵活性使得可以轻松纳入任意优先约束。因此,该模型自然地处理任何优先图,甚至允许设计大邻域搜索(LNS)方案,其中重用DP模型,并在重启之间施加偏序调度以改进当前解。虽然对于这个特定问题,该方法无法与最先进的纯CP求解器竞争,但我们的主要贡献是证明了这种混合集成的可行性。

英文摘要

Dynamic Programming (DP) and Constraint Programming (CP) are well-established paradigms for solving combinatorial optimization problems. Usually, these two approaches are used separately. This paper aims to show that the two can be combined effectively and elegantly, with DP serving as the primary search framework and CP used as a subroutine to leverage global constraint propagation. This paper presents such an approach for the Partial Shop Scheduling Problem (PSSP), for which a pure DP method has previously been proposed, and efficient CP filtering algorithms are available. The PSSP is a general scheduling problem where each job consists of a set of operations with arbitrary precedence constraints. The approach is flexible enough to accommodate anytime DP strategies, such as anytime column search, whereas the original DP algorithm operated in a strictly layer-wise manner. Moreover, the flexibility of the CP modeling makes it straightforward to incorporate arbitrary precedence constraints. As a result, the model naturally handles any precedence graph and even enables the design of a Large Neighborhood Search (LNS) scheme, in which the DP model is reused, and partial-order schedules are imposed across restarts to improve the incumbent solution. While not competitive with state-of-the-art pure CP solvers for this specific problem, our primary contribution is demonstrating the viability of this hybrid integration.

2605.23568 2026-05-25 cs.RO cs.SY eess.SY

TactileReflex: Noise-Statistics-Driven Vision-Tactile Reflex Control for Force-Sensitive Manipulation

TactileReflex:基于噪声统计的视觉-触觉反射控制用于力敏感操作

Ziyan Feng, Yulong Fu, Zheng Li, Yuxin He, Jieji Ren, Lujia Wang, Jinni Zhou, Yudong Zhong, Qiang Nie

发表机构 * Thrust of Robotics and Autonomous Systems, The Hong Kong University of Science and Technology (Guangzhou)(机器人与自主系统研究所,香港科学与技术大学(广州)) School of Mechanical Engineering, Shanghai Jiao Tong University(上海交通大学机械工程学院)

AI总结 本文提出了一种基于噪声统计特性的视觉-触觉反射控制方法TactileReflex,用于实现对力敏感的精细操作任务,如液体填充的塑料杯的抓取与操作。该方法通过分析触觉传感器的内在噪声特性,直接推导出控制器的阈值,无需外部力标定或手动调参。实验表明,TactileReflex能够有效防止容器不可逆变形,并在动态倒水任务中表现出优异的稳定性与成功率,具有作为高层次操作系统安全层的潜力。

Comments 8 pages, 4 figures, 6 tables

详情
AI中文摘要

操作易变形的柔性容器(如装有液体的一次性塑料杯)需要在极窄的力裕度内实时调整抓取力:力不足会导致滑动,力过大则会使薄壁不可逆变形。现有方法难以完成此类力敏感操作任务。我们提出一种基于噪声统计的标定驱动反射控制范式,结合基于视觉的触觉感知:通过分析传感器的固有噪声特性(通过简短的静态保持-卸载协议),直接推导出所有控制器阈值,消除了外部力标定、试错手动调参或材料特定的物理模型。实现该范式,我们提出了TactileReflex,一个三通道闭环控制器,从双视觉触觉传感器中提取三个图像级代理:剪切强度($S_y$)、接触强度($F_n$)和压力中心($C$),并以约12Hz驱动优先反射通道,用于滑动抑制、重量自适应释放和力保护。每个通道通过噪声导出的阈值直接在其代理上闭环。消融实验表明,只有完整的三通道系统能够防止容器不可逆变形(5/5成功,而部分配置最多1/5成功)。在动态倾倒任务中,固定力基线因姿态漂移在所有10次尝试中均失败,而TactileReflex在两种水量下实现了9/10成功。作为一个自包含且可解释的控制器,TactileReflex可作为高层操作流水线(包括无触觉VR遥操作和视觉-语言-动作策略)的即插即用安全层。

英文摘要

Manipulating fragile deformable containers, such as disposable plastic cups filled with liquid, demands real-time grip-force adaptation within an extremely narrow force margin: insufficient force causes slip, while excessive force irreversibly deforms the thin wall. Existing approaches struggle to achieve such force-sensitive manipulation tasks. We propose a noise-statistics-based calibration-driven reflex control paradigm with vision-based tactile sensing: by analyzing the sensor's intrinsic noise characteristics (via a brief static-hold-and-unload protocol), we directly derive all controller thresholds, eliminating external force calibration, trial-and-error manual tuning, or material-specific physical models. Instantiating this paradigm, we present TactileReflex, a three-channel closed-loop controller that extracts three image-level proxies, shear intensity ($S_y$), contact intensity ($F_n$), and center of pressure ($C$), from dual visuo-tactile sensors and drives prioritized reflex channels at ~12 Hz for slip suppression, weight-adaptive release, and force protection. Each channel closes the loop directly on its proxy via noise-derived thresholds. Ablation demonstrates that only the full three-channel system is able to prevent irreversible container deformation (5/5 success vs. at most 1/5 for partial configurations). In a dynamic pouring task, fixed-effort baselines fail in all 10 attempts due to pose drift, while TactileReflex achieves 9/10 success across two water volumes. As a self-contained and interpretable controller, TactileReflex can serve as a plug-and-play safety layer beneath high-level manipulation pipelines, including haptic-free VR teleoperation and vision-language-action (VLA) policies.

2605.23565 2026-05-25 cs.LG cs.AI

Understanding Goal Generalisation in Sequential Reinforcement Learning

理解序贯强化学习中的目标泛化

Jason Ross Brown, Edward James Young

发表机构 * University of Cambridge(剑桥大学) Geodesic Research(Geodesic研究)

AI总结 本研究探讨了序列强化学习代理在新环境中实现目标泛化的能力,分析了其训练历史对其行为的影响。通过研究超过100种序列训练流程并在250多个分布外环境中进行评估,发现显著特征和早期学习的目标对后续泛化具有重要影响。为此,研究提出了一种名为潜在策略梯度的方法,能够预测训练流程可能诱导的分布外行为,具有较高的预测准确性、良好的泛化能力和可解释性,为从发展角度理解目标泛化提供了基础。

详情
AI中文摘要

强化学习代理在其训练分布之外常常表现出非预期的目标导向行为,但我们目前缺乏基于训练历史对这类代理如何泛化到新环境的原理性理解。我们针对在单个或多个任务上序贯训练的代理解决了这一空白。我们研究了超过100个序贯训练流程,评估了超过250个分布外环境中的行为。我们发现显著特征驱动泛化,并且训练早期习得的目标会持续存在并影响后期习得的目标。为了解释这些现象,我们引入了潜在策略梯度方法,该方法预测训练流程可能诱导的分布外行为。我们的方法根据潜在变量如何映射到行为的简单模型,模拟训练过程中低维潜在变量的演化,以实现在训练目标上获得高奖励。它实现了强预测准确性,泛化到未见过的训练流程类型,并且是可解释的。我们的发现表明,虽然分布外RL代理行为依赖于整个训练流程,但这种依赖具有我们可以捕捉的底层结构,为从发展角度理解目标泛化奠定了基础。

英文摘要

Reinforcement learning agents often exhibit unintended goal-directed behaviour outside their training distribution, but we currently lack a principled understanding of how such agents will generalise to novel environments based on their training history. We address this gap for agents trained sequentially on one or more tasks. We study over 100 sequential training pipelines, evaluating behaviour across over 250 out-of-distribution environments. We find that salient features drive generalisation, and that goals learnt early in training can persist and influence those acquired later. To explain these phenomena, we introduce latent policy gradients, a method that predicts what out-of-distribution behaviour a training pipeline will likely induce. Our method simulates the evolution of low-dimensional latent variables during training according to what would achieve high reward on the training objective with respect to a simple model of how the latent variables map to behaviour. It achieves strong predictive accuracy, generalises to unseen types of training pipeline, and is interpretable. Our findings demonstrate that while out-of-distribution RL agent behaviour is dependent on the whole training pipeline, this dependence has an underlying structure we can capture, laying groundwork for understanding goal generalisation from a developmental perspective.

2605.23563 2026-05-25 cs.LG

MARS: Magnitude-Aware Rank Statistics

MARS:幅度感知排名统计

Muhammad Rajabinasab, Afsaneh M. Nejad, Arthur Zimek

发表机构 * University of Southern Denmark(南方丹麦大学)

AI总结 在机器学习模型的全面评估中,如何准确反映模型性能差异是一个重要问题。传统关键差异(CD)图依赖于离散排名,忽略了模型性能差距的幅度,导致“幅度盲”问题。为此,本文提出了一种基于幅度感知的排名统计方法MARS,通过引入相对边距系数对离散排名进行加权,从而更真实地反映模型性能差异,并在广泛实验设置中提供更深入的洞察。

Comments Preprint submitted to Elsevier Pattern Recognition Letters

详情
AI中文摘要

机器学习模型的全面评估是确保其按预期稳健且一致运行的关键。为了总结实验结果并选出最佳模型,通常使用临界差异(CD)图。标准CD图依赖于离散排名,忽略了模型之间性能差距的幅度,这引发了我们称之为幅度盲视的问题。为了解决这个问题,我们提出了幅度感知排名统计(MARS),它引入了一个相对边际系数作为离散排名的权重。该系数基于最佳和最差表现者之间的距离对排名进行缩放,并采用动态投影来处理边界情况。在计算CD值之后,MARS能够更真实地统计表示模型性能的差异,并提供更多关于方法在广泛实验设置中实际表现如何的见解。

英文摘要

Comprehensive evaluation of machine learning models is the key to make sure that they perform as robustly and consistently as desired. In order to summarize the experimental results and pick a winner, Critical Difference (CD) diagrams are used. Standard CD diagrams rely on discrete ranks, discarding the magnitude of performance gaps between models, raising an issue which we call magnitude-blindness. In order to address this issue, we propose Magnitude-Aware Rank Statistics (MARS) that incorporates a relative margin coefficient as a weight for the discrete ranks. This coefficient scales ranks based on the distance between the best and worst performers, with a dynamic projection to handle boundary cases. Followed by the calculation of a CD value, MARS results in a more realistic statistical representation of differences of model performances and more insights on how methods actually perform in vast and extensive experimental settings.

2605.23559 2026-05-25 cs.CV cs.AI

PathNavigate: A Training-Free Pathology Agent with Surprise-Guided Scan and Shared Slide Memory for Whole-Slide Image VQA

PathNavigate: 一种无需训练的病理学代理,具有惊喜引导扫描和共享幻灯片记忆用于全切片图像VQA

Chunze Yang, Qidong Liu, Wenjie Zhao, Yue Tang, Jiusong Ge, Di Zhang, Jiashuai Liu, Lei Wu, Junbo Lu, Ni Zhang, Xian Wu, Zeyu Gao, Chen Li

发表机构 * School of Comp. Science & Technology, Xi’an Jiaotong University(西安交通大学计算机科学与技术学院) Tencent Jarvis Lab(腾讯Jarvis实验室) University of Cambridge(剑桥大学)

AI总结 PathNavigate 是一种无需训练的病理图像问答代理,旨在解决全切片图像问答(WSI-VQA)中在有限检查预算下高效定位关键病理证据的问题。该方法采用“扫描-搜索-读取”流程,通过共享的在线记忆模块生成异常区域池,并结合问题条件的相关性筛选高倍镜下的目标区域,从而提升答案准确性和解释性。实验表明,PathNavigate 在保持模型冻结的前提下,实现了更高的效率和更可靠的证据选择路径。

详情
AI中文摘要

全切片图像视觉问答(WSI-VQA)将病理学视为极端上下文搜索问题:为了回答自由形式的临床查询,系统必须首先在严格的检查预算下导航千兆像素切片,以定位稀疏的高分辨率证据。现有方法主要分为两种范式:i)监督式病理学多模态大语言模型(MLLMs)和代理可以将定位和推理吸收到学习模块中,但它们通常将导航与任务特定的监督和重新训练耦合,限制了其实用性;ii)无需训练的病理学代理通过保持核心模型冻结来避免这种成本,但通常遵循问题优先的设计,主要从查询条件相关性构建初始候选集。这可能会遗漏问题中未提及的决定性形态,并迫使更重的推理时脚手架。为了解决这一挑战,我们引入了PathNavigate,一种无需训练的病理学代理,基于扫描-搜索-读出流程构建。在问题匹配之前,PathNavigate在低放大倍数下扫描当前切片,使用共享的在线记忆模块处理冻结的病理学特征,生成一个切片特定的惊喜场,标记异常区域池。然后,它仅在此池内应用问题条件的PLIP相关性,以选择高放大倍数的搜索目标。最后,它提取局部高放大倍数证据,并使用冻结的感知器-裁决器堆栈进行回答,利用相同的在线记忆作为切片级上下文。在WSI-VQA和SlideBench-BCNB上的实验表明,所提出的扫描-搜索-读出设计提高了答案准确性,并产生了更可解释的证据选择轨迹,且效率更高。代码已在线公开。

英文摘要

Whole-slide image visual question answering (WSI-VQA) frames pathology as an extreme-context search problem: to answer a free-form clinical query, a system must first navigate a gigapixel slide under a strict inspection budget to locate sparse, high-resolution evidence. Existing approaches largely fall into two paradigms: i) supervised pathology multimodal large language models (MLLMs) and agents can absorb localization and reasoning into learned modules, but they often couple navigation to task-specific supervision and retraining, limiting their practicality; ii) training-free pathology agents avoid this cost by keeping core models frozen, but often follow a question-first design, constructing the initial candidate set mainly from query-conditioned relevance. This can miss decisive morphology that is not named in the question, and force heavier inference-time scaffolding. To address this challenge, we introduce PathNavigate, a training-free pathology agent built around a scan-search-readout routine. Before question matching, PathNavigate scans the current slide at low magnification with a shared online memory module over frozen pathology features, producing a slide-specific surprise field that marks an abnormal-region pool. It then applies question-conditioned PLIP relevance only within this pool to select high-magnification search targets. Finally, it extracts local high-magnification evidence and answers with a frozen perceptor-adjudicator stack, using the same online memory as slide-level context. Experiments on WSI-VQA and SlideBench-BCNB show that the proposed scan-search-readout design improves answer accuracy and yields more interpretable evidence-selection trajectories with higher efficiency.The code is available online.

2605.23556 2026-05-25 cs.LG cs.IR math.CO

Is Dimensionality a Barrier for Retrieval Models?

维度是检索模型的障碍吗?

Kiril Bangachev, Guy Bresler, Jonathan Kogan, Yury Polyanskiy

发表机构 * Department of Electrical Engineering and Computer Science(电气工程与计算机科学系)

AI总结 本文探讨了为何现代基于嵌入的检索模型在表示维度较低(约1000维)的情况下仍能处理数十亿甚至数万亿的数据点。研究聚焦于最大边距嵌入问题,分析了在给定查询与文档相关性矩阵下,如何在有限维度中实现最大的分类边距。论文证明了在特定条件下,维度只需为 $O(k \log(n/k))$ 即可达到理论最优边距,从而解决了相关模型的维度需求问题,并通过实验验证了sigmoid损失在生成大边距嵌入方面的优势。

详情
AI中文摘要

为什么表示的低维度(通常$d\approx 1000$)不会阻止现代基于嵌入的检索模型扩展到数十亿甚至数万亿数据点?为了回答这个问题,我们在以下检索模型中研究最大间隔嵌入,该模型经典地出现在通信复杂性[PS86]和最近的基于嵌入的检索[WBNL26]中。设$A\in \{0,1\}^{N\times n}$是一个矩阵,指示$N$个查询中的每一个是否与$n$个文档中的每一个相关。我们感兴趣的是最大间隔$m>0$,记为$\mathsf{m}^{\mathsf{rd}}(d, A)$,使得存在查询和文档的单位范数嵌入$\{U_j\}_{j = 1}^N, \{V_i\}_{i = 1}^n$满足以下性质:当$A_{ji} = 1$时$\langle U_j, V_i\rangle \ge m$,否则$\langle U_j, V_i\rangle \le -m$。大间隔是表示质量的关键代理:它控制了对扰动的鲁棒性和跨查询的组合泛化能力。我们的主要定理表明,在没有维度限制的情况下,最佳可能间隔$\mathsf{m}^{\mathsf{rd}}(+\infty, A)$可以在维度$d = O(\mathsf{m}^{\mathsf{rd}}(+\infty, A)^{-2}\log n)$下几乎达到,这改进了[BDES02]的一个定理。结合定理1.5中的匹配下界,我们得出结论:当$A\in \{0,1\}^{\binom{n}{k}\times n}$是包含所有可能的$k$-稀疏行一次的矩阵时,维度$d = O(k\log (n/k))$是达到该设置下最大可能间隔$\mathsf{m}^{\mathsf{rd}}(+\infty, A) = \Theta(k^{-1/2})$的充分必要条件。这完全解决了[WBNL26]中的设定。我们还给出了当$d = o(k\log (n/k))$时产生大间隔的几种构造。最后,我们通过实验测试了InfoNCE和sigmoid损失在产生大间隔嵌入方面的表现,并展示了sigmoid损失的明显优势。

英文摘要

Why does the low dimensionality of representations, typically $d\approx 1000$, not prevent modern embedding-based retrieval models from scaling to billions, or even trillions, of data points? To answer this question, we study maximal-margin embeddings in the following retrieval model, classically studied in communication complexity [PS86] and more recently in embedding-based retrieval [WBNL26]. Let $A\in \{0,1\}^{N\times n}$ be a matrix indicating whether each of $N$ queries is relevant to each of $n$ documents. We are interested in the largest margin $m>0,$ denoted by $\mathsf{m}^{\mathsf{rd}}(d, A),$ for which there exist unit norm embeddings of the queries and documents $\{U_j\}_{j = 1}^N, \{V_i\}_{i = 1}^n$ with the following property. $\langle U_j, V_i\rangle \ge m$ whenever $A_{ji} = 1$ and $\langle U_j, V_i\rangle \le -m$ otherwise. A large margin is a key proxy for representation quality: it controls both robustness to perturbations and compositional generalization across queries. Our main theorem establishes that the best possible margin without a restriction on the dimension, $\mathsf{m}^{\mathsf{rd}}(+\infty, A),$ can be nearly achieved in dimension $d = O(\mathsf{m}^{\mathsf{rd}}(+\infty, A)^{-2}\log n)$ which improves a theorem of [BDES02]. Together with a matching lower bound in Theorem 1.5, we conclude that when $A\in \{0,1\}^{\binom{n}{k}\times n}$ is the matrix containing all possible $k$-sparse rows once, dimension $d = O(k\log (n/k))$ is necessary and sufficient for the maximal possible margin $\mathsf{m}^{\mathsf{rd}}(+\infty, A) = Θ(k^{-1/2})$ in this setting. This fully resolves the setup of [WBNL26]. We also give several constructions for large margins when $d = o(k\log (n/k)).$ Finally, we empirically test the InfoNCE and sigmoid losses for producing large margin embeddings and demonstrate a clear advantage of the sigmoid loss.

2605.23555 2026-05-25 cs.CV

Generator-Refiner-Examiner: A Tri-Module Data Augmentation Framework for 3D Human Avatar Learning from Monocular Videos

生成器-精炼器-检验器:一种用于从单目视频学习3D人体虚拟形象的三模块数据增强框架

Gangjian Zhang, Jian Shu, Sicheng Yu, Wenhao Shen, Yu Feng, Hao Wang

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 本文研究了从单目视频中重建具有逼真外观和可动画效果的3D人体化身的挑战。为了解决现有方法在数据稀缺情况下难以捕捉细节的问题,提出了一种名为TrioMan的三模块数据增强框架,包含生成器、细化器和检查器三个协同组件,分别用于生成多样化样本、提升生成质量以及筛选符合人体一致性的样本。实验表明,该方法在多个基准数据集上优于现有先进方法。

详情
AI中文摘要

本文解决了从单目视频重建逼真且可动画化的3D人体虚拟形象的挑战。现有方法依赖于将逐主体优化与通用人体先验相结合,但在训练帧数有限时往往难以捕捉细粒度细节。为了缓解数据稀缺问题,我们提出了TrioMan,一个用于增强3D虚拟形象学习的系统性三模块框架。我们的方法包含三个协同组件。生成器通过对姿态和相机施加高斯扰动来创建多样化的未见样本。精炼器通过由纹理和几何线索引导的一步扩散来提高生成数据的质量。检验器使用基于双分支注意力的相似性评估来选择与主体一致的样本。在X-Humans和NeuMan基准上的实验表明,TrioMan优于最先进的方法。

英文摘要

This paper addresses the challenge of reconstructing photorealistic and animatable 3D human avatars from monocular videos. While existing methods rely on combining per-subject optimization with generic human priors, they often fail to capture fine-grained details when training frames are limited. To mitigate this data scarcity, we propose TrioMan, a systematic tri-module framework for augmented 3D avatar learning. Our approach comprises three synergistic components. The Generator creates diverse unseen samples by imposing Gaussian perturbations on pose and camera. The Refiner improves the quality of generated data through one-step diffusion guided by texture and geometry cues. The Examiner selects subject-consistent samples using a dual-branch attention-based similarity evaluation. Experiments on the X-Humans and NeuMan benchmarks show that TrioMan outperforms state-of-the-art methods.

2605.23551 2026-05-25 cs.LG cs.AI

Goal-Conditioned Agents that Learn Everything All at Once

目标条件智能体一次性学习所有内容

Michael Matthews, Matthew Jackson, Michael Beukman, Thomas Foster, Alistair Letcher, Scott Fujimoto, Cédric Colas, Jakob Foerster

发表机构 * University of Oxford(牛津大学) McGill University(麦吉尔大学) MIT(麻省理工学院) Inria(法国国家信息与自动化研究所)

AI总结 本文提出了一种名为LEO(Learning Everything all at Once)的新方法,用于提升目标条件强化学习的效率。该方法通过一次性输出所有目标对应的价值和动作,实现了高效的并行更新,解决了传统全目标学习计算开销大的问题。实验表明,LEO在目标条件任务和连续控制环境中均表现出色,且相比传统方法有超过250倍的加速效果,为复杂环境中的强化学习提供了有力工具。

详情
AI中文摘要

一个目标条件的强化学习智能体在探索环境时,会在整个轨迹中看到大量信息,但大多数信息在仅根据命令目标进行在线策略更新时被丢弃。全目标学习(每个转换都用于针对每个目标进行离线策略学习)允许智能体提取最大信息,但通过简单的重新标记通常计算上不可行。这可以通过同时为每个目标输出值和动作来克服,从而允许通过网络单次传递进行高效的并行全目标更新,我们称之为一次性学习所有内容(LEO)。我们表明,这种方法在目标条件的Craftax上显著优于其他方法,在连续控制环境中与现有基线具有竞争力,同时与全目标重新标记相比实现了超过250倍的加速。然后,我们进一步表明,通过将LEO用作教师网络而非直接行动者,这种方法可以变得更加强大。我们希望,通过解锁大规模的全目标学习,LEO可以成为复杂环境中强化学习实践者的有用工具。我们开源了我们的代码。

英文摘要

A goal-conditioned reinforcement learning agent exploring an environment will see a wealth of information throughout a trajectory, most of which is discarded when only performing on-policy updates with respect to the commanded goal. All-goals learning, where each transition is used for learning off-policy with respect to every goal, allows agents to extract maximal information, however it is usually computationally infeasible when done via naive relabelling. This can be overcome by jointly outputting values and actions for every goal at once, allowing for efficient, parallel all-goals updates with a single pass through the network, in a process we call Learning Everything all at Once (LEO). We show that this approach significantly outperforms other methods on goal-conditioned Craftax and is competitive with existing baselines on continuous control environments, while achieving a >250x speed-up compared to all-goals relabelling. We then go on to show that this approach can be made even more powerful by using LEO as a teacher network, rather than a direct actor. We hope that, by unlocking all-goals learning at scale, LEO can serve as a useful tool for RL practitioners in complex environments. We open source our code.

2605.23540 2026-05-25 cs.LG

When One Point Is Not Enough: Addressing Ambiguous Instances in Dimensionality Reduction by Splitting

当一点不够时:通过分裂解决降维中的模糊实例

Diede P. M. van der Hoorn, Alessio Arleo, Fernando V. Paulovich

发表机构 * Eindhoven University of Technology(埃因霍温理工大学)

AI总结 本文研究了降维方法中因数据点模糊性导致的邻域结构失真问题,提出了一种基于图的方法来识别并复制这些模糊实例,将其映射到多个位置以更准确地反映其在高维空间中的多个邻域关系。该方法有效缓解了传统降维技术中因单点映射导致的局部结构丢失问题,并在多个实例上展示了其对隐藏邻域关系的揭示能力。

详情
AI中文摘要

降维(DR)方法广泛用于可视化高维数据。基于DR的分析中的一个关键任务是发现邻域,这依赖于分析投影的细粒度局部结构。然而,DR本质上是一个有损过程;没有技术能完美保留高维关系,因此投影包含视觉伪影。在本文中,我们强调了一个通常被忽视的视觉伪影来源:模糊实例。这些实例与高维空间中多个相互不相似的邻域高度相似。标准DR方法无法忠实地投影此类实例,因为每个数据实例被映射到视觉空间中的一个单点。因此,这样的实例仅被放置在其一个邻域中(或根本不放置),因此仅表示其部分邻域结构。我们称这种失真为部分邻域嵌入。在本文中,我们引入了一种基于图的方法,该方法识别模糊实例并将其复制为投影中的多个点,将每个副本放置在其各自的邻域中。我们使用UMAP来展示结果,但我们的方法也推广到其他基于局部图的DR技术,并且我们表明,我们的方法揭示了投影中先前隐藏的邻域成员关系,减少了多个示例中的部分邻域嵌入,并得到了定量分析的支持。

英文摘要

Dimensionality Reduction (DR) methods are widely used to visualize high-dimensional data. One key task in DR-based analysis is discovering neighborhoods, which relies on analyzing the fine-grained local structure of a projection. However, DR is an inherently lossy process; no technique can perfectly preserve the high-dimensional relationships, and projections therefore contain visual artifacts. In this paper, we highlight a typically overlooked source of visual artifacts: ambiguous instances. These are instances that are highly similar to multiple mutually dissimilar neighborhoods in the high-dimensional space. Standard DR methods cannot faithfully project such instances, since each data instance is mapped to a single point in the visual space. As a result, such an instance is placed in only one of its neighborhoods (or in none at all), so only part of its neighborhood structure is represented. We call this distortion partial neighborhood embedding. In this paper, we introduce a graph-based approach that identifies ambiguous instances and replicates them as multiple points in the projection, placing each copy within its respective neighborhood. We use UMAP for our results, but our approach also generalizes to other local graph-based DR techniques, and we show that our approach reveals previously hidden neighborhood memberships in projections and reduces partial neighborhood embedding across multiple examples, and is further supported by quantitative analyses.

2605.23523 2026-05-25 cs.CV

ComPose: When to Trust Hands for Object Pose Tracking

ComPose:何时信任手部进行物体姿态跟踪

Jisu Shin, Junoh Lee, JunGyu Lee, Inhwan Bae, Dohyeon Lee, Hokyun Im, Youngwoon Lee, Hae-Gon Jeon

发表机构 * GIST(韩国信息科学与技术学院) Yonsei Univ.(延世大学) DGIST(国立地面空间技术研究所)

AI总结 本文提出了一种名为 ComPose 的六自由度物体姿态跟踪框架,旨在从 RGB 视频中实现对被手部遮挡物体的鲁棒跟踪。该方法创新性地将手部运动作为补充线索,而非单纯遮挡物,在统一的跟踪流程中结合物体和手部的提示信息,通过自适应选择关键手部关节、融合多源线索并利用几何证据进行修正,实现了稳定且精确的物体轨迹估计。实验表明,该方法在严重遮挡和几何模糊情况下表现出色,且无需外部平滑处理即可获得时间上一致的 3D 轨迹,适用于机器人操作等下游任务。

Comments 22 pages, 10 figures

详情
AI中文摘要

从视频中重建物体运动是具身AI和机器人操作的关键组成部分。尽管已经研究了多种物体姿态跟踪方法,但它们严重依赖强大的外部先验(如深度数据或3D模板),并且即使使用显式掩码,仍然极易受到手部抓取造成的严重遮挡的影响。在这项工作中,我们提出了ComPose,一个6DoF物体跟踪框架,旨在从RGB视频中进行手部感知的物体姿态估计。我们的方法不是将手部纯粹视为遮挡物,而是将手部运动协调为物体跟踪的补充线索。具体来说,我们通过在一个统一的跟踪流程中结合来自基础模型的物体和手部线索,随时间恢复多种物体运动。在此,ComPose自适应地选择信息丰富的手部关节,结合物体和手部衍生的线索进行运动估计,并使用可见的几何证据和学习到的校正来细化所得的物体运动。我们进一步在旋转和平移上强制时间一致性,从而在没有外部平滑的情况下产生稳定的3D物体轨迹。大量实验表明,我们的方法在严重手部遮挡和几何模糊下准确、高效且鲁棒。此外,所得的轨迹还可以通过使机器人能够从在线视频中重建人类动作,有效地转移到下游机器人操作中。

英文摘要

Reconstructing the motion of objects from videos is a key component for embodied AI and robot manipulation. While diverse approaches to object pose tracking have been studied, they rely heavily on strong external priors, such as depth data or 3D templates, and remain highly vulnerable to severe occlusions by hand grasps despite the use of explicit masks. In this work, we present ComPose, a 6DoF object tracking framework designed for hand-aware object pose estimation from RGB video. Rather than treating the hand purely as an occluder, our method harmonizes hand motions as a \textit{complementary cue} for object tracking. In detail, we recover a variety of object motions over time by combining object and hand cues from foundation models within a unified tracking pipeline. Here, ComPose adaptively selects informative hand joints, combines object- and hand-derived cues for motion estimation, and refines the resulting object motion using visible geometric evidence and a learned correction. We further enforce the temporal consistency over both rotation and translation, yielding stable 3D object trajectories over time without any external smoothing. Extensive experiments show that our method is accurate, efficient, and robust under severe hand occlusion and geometric ambiguity. In addition, the resulting trajectories can also effectively transfer to downstream robot manipulation by enabling robots to reconstruct human actions from online videos.

2605.23522 2026-05-25 cs.LG cs.AI cs.CV

Precise: SDE-Consistent Stochastic Sampling for RL Post-Training of Flow-Matching Models

Precise: 用于流匹配模型强化学习后训练的SDE一致随机采样

Jade Zou, Tao Huang, Weijie Kong, Junzhe Li, Yue Wu, Qi Tian, Jiangfeng Xiong, Jianwei Zhang, Liefeng Bo, Zhao Zhong

发表机构 * Peking University(北京大学) Tencent Hunyuan(腾讯文言)

AI总结 该论文研究了如何通过强化学习(RL)对流匹配模型进行后训练,以提升其生成质量与提示对齐能力。核心方法是将确定性的采样轨迹转化为随机策略,通过设计一个符合随机微分方程(SDE)的采样器,实现探索与稳定性的平衡。提出的新采样器Precise在保持去噪轨迹SDE一致性的同时,有效减少了噪声干扰,实验表明其在奖励优化速度和生成质量上均优于现有方法。

详情
AI中文摘要

强化学习已成为提升扩散和流匹配生成器中提示对齐和感知质量的有效方法。将在线强化学习应用于流匹配的关键步骤是将确定性采样轨迹转化为随机策略,通常通过用随机微分方程替代逆向常微分方程来实现。随机采样器控制探索行为和去噪动力学,因此是策略的一部分,其设计会显著影响奖励优化性能。我们将采样器设计分解为两个相互依赖的组成部分:选择适量的随机探索,以及在强化学习中使用的少量步数下忠实地离散化得到的SDE。针对第一个组成部分,我们分析了去噪过程中探索与稳定性之间的固有张力,并推导出平衡两者的SDE调度。针对离散化挑战,我们使用一个玩具示例表明,现有采样器可能偏离流匹配过程,要么引入过多的离散化噪声,要么依赖不能保证收敛到数据分布的启发式规则。为解决这些问题,我们提出了Precise,一种新的随机采样器,平衡了有效探索与稳定性。关键地,Precise通过一种冻结干净潜变量后验均值的新颖近似,使去噪轨迹保持SDE一致,解决了标准采样器中的过度噪声问题。大量实验表明,该公式通过强化学习实现了显著更快且更稳定的奖励优化,达到了最先进的对齐分数(例如PickScore、HPSv2.1),同时匹配先前采样器的最佳域内性能所需的训练时间减少了13.1-53.2%。

英文摘要

Reinforcement learning (RL) has become an effective way to improve prompt alignment and perceptual quality in diffusion and flow-matching generators. A critical step for applying online RL to flow matching is turning the deterministic sampling trajectory into a stochastic policy, typically by replacing the reverse-time Ordinary Differential Equation (ODE) with a Stochastic Differential Equation (SDE). The stochastic sampler, controlling the exploration behavior and denoising dynamics, is thus part of the policy, and its design can significantly affect the reward optimization performance. We break down the sampler design into two interdependent components: choosing the right amount of stochastic exploration, and discretizing the resulting SDE faithfully at the small step counts used in RL. To address the first component, we analyze the inherent tension between exploration and stability in denoising and derive an SDE schedule that balances the two. Turning to the discretization challenge, we use a toy example to show that existing samplers can deviate from the flow-matching process, either by introducing excessive discretization noise or by relying on heuristic rules that do not guarantee convergence to the data distribution. To address these issues, we propose Precise, a new stochastic sampler that balances effective exploration with stability. Crucially, Precise keeps the denoising trajectory SDE-consistent through a novel approximation that freezes the clean-latent posterior mean, resolving the excess noise issue in standard samplers. Extensive experiments demonstrate that this formulation leads to significantly faster and more stable reward optimization via reinforcement learning, achieving state-of-the-art alignment scores (e.g., PickScore, HPSv2.1) while requiring 13.1-53.2% less wall-clock training time to match the best in-domain performance of prior samplers.

2605.23518 2026-05-25 cs.CV

VINS-120K: Ultra High-Resolution Image Editing with A Large-Scale Dataset

VINS-120K:基于大规模数据集的超高分辨率图像编辑

Zhizhou Chen, Shanyan Guan, Zhanxin Gao, En Ci, Yanhao Ge, Wei Li, Zhenyu Zhang, Jian Yang, Ying Tai

发表机构 * Nanjing University(南京大学) vivo

AI总结 本文提出VINS-120K,一个包含12万组高分辨率图像编辑指令对的大规模数据集,每张图像分辨率超过4K,用于推动超高分辨率图像编辑研究。研究还提出一种高频感知的后适配策略,使现有模型能够有效处理超高分辨率图像,并构建了VINS-4KEval基准以评估编辑效果。该工作为超高分辨率图像编辑提供了高质量数据支持和新的方法改进。

详情
AI中文摘要

直接编辑超高分辨率(UHR)图像具有价值但尚未充分探索,主要由于缺乏高质量数据以及高频纹理细节建模的挑战。我们引入VINS-120K,首个用于基于指令的UHR图像编辑的大规模数据集,包含120K精心筛选的指令、输入图像和编辑图像三元组。每张图像超过4K分辨率(≥4096×4096),并通过严格的多阶段流水线过滤以确保视觉质量、指令对齐和美学保真度。基于VINS-120K,我们进一步开发了一种高频感知的后适应策略,将预训练的非高分辨率模型扩展到UHR领域。我们还提出了VINS-4KEval基准,涵盖多种编辑类型,以促进UHR设置下的一致评估。实验证实,我们的工作在UHR图像编辑中改善了细粒度细节合成和纹理真实感。

英文摘要

Directly editing ultra-high-resolution (UHR) images is valuable but underexplored, primarily due to the lack of high-quality data and the challenge in modeling high-frequency texture details. We introduce VINS-120K, the first large-scale dataset for instruction-based UHR image editing, comprising 120K carefully curated triplets of instruction, input image, and edited image. Each image exceeds 4K resolution ($\geq$4096 $\times$ 4096) and is filtered through a rigorous multi-stage pipeline to ensure visual quality, instruction alignment, and aesthetic fidelity. Built on VINS-120K, we further develop a high-frequency-aware post-adaptation strategy to extend pretrained non-high-resolution models to the UHR regime. We also present VINS-4KEval, a benchmark covering diverse editing types, to facilitate consistent evaluation in UHR settings. Experiments confirm that our work improves fine-grained detail synthesis and texture realism in UHR image editing.

2605.23510 2026-05-25 cs.LG

Learning partially observed systems with neural Hamiltonian ordinary differential equations

学习部分观测系统:神经哈密顿常微分方程

Sunniva Meltzer, Sølve Eidnes, Alexander Johannes Stasik

发表机构 * Department of Mathematics and Cybernetics, SINTEF Digital(数学与自动化系,SINTEF数字研究院) Department of Physics, University of Oslo(物理系,奥斯陆大学) Department of Data Science, Norwegian University of Life Science(数据科学系,挪威生命科学大学)

AI总结 本文提出了一种名为神经哈密顿常微分方程(NHODE)的框架,用于从部分观测数据中学习动力系统。该方法结合了哈密顿神经网络和神经常微分方程,通过引入哈密顿结构确保能量守恒,并利用神经ODE的灵活性仅在观测变量上定义损失函数,从而在未观测变量上进行有效推理。实验表明,NHODE在多种复杂系统中表现出更高的预测精度和长期稳定性,能够同时捕捉观测和潜在动态,优于纯粹的数据驱动方法。

详情
AI中文摘要

从数据中学习动力系统时,嵌入物理结构可以约束解空间并提高泛化能力,但许多物理信息模型假设可以访问完整的系统状态。这限制了它们在部分观测场景中的使用,其中某些状态变量完全未被观测到,且必须在没有直接监督的情况下推断。在这里,我们提出了神经哈密顿常微分方程(NHODE),这是一个结合哈密顿神经网络(HNN)和神经常微分方程(neural ODE)的框架,用于从数据中学习部分观测的动力系统。哈密顿结构通过构造保证能量守恒,而神经常微分方程框架则提供了灵活的训练过程,使得损失可以仅定义在观测变量上。我们还通过对称性感知的坐标变换和可分离的能量公式,融入了额外的物理约束。该框架在复杂度递增的系统上进行了评估,从线性和非线性质量-弹簧系统到混沌三体问题。在所有示例中,嵌入的物理结构越多,预测的准确性和长期稳定性就越好。即使在最具挑战性的情况下,NHODE框架也能捕捉到观测和潜在动力学,而纯数据驱动的基线则变得不稳定。

英文摘要

When learning dynamical systems from data, embedding physical structure can constrain the solution space and improve generalization, but many physics-informed models assume access to the full system state. This limits their use in partially observed settings, where some state variables are completely unobserved and must be inferred without direct supervision. Here, we present neural Hamiltonian ordinary differential equations (NHODE), a framework that combines Hamiltonian neural networks (HNNs) with neural ordinary differential equations (neural ODEs) to learn partially observed dynamical systems from data. The Hamiltonian structure enforces energy conservation by construction, while the neural ODE framework enables a flexible training procedure that allows the loss to be defined only on observed variables. We also incorporate additional physical constraints through symmetry-aware coordinate transformations and separable energy formulations. The framework is evaluated on systems of increasing complexity, from linear and nonlinear mass-spring systems to the chaotic three-body problem. Across all examples, increasing the amount of embedded physical structure improves the accuracy and long-horizon stability of the predictions. Even in the most challenging regimes, the NHODE framework captures both observed and latent dynamics, whereas purely data-driven baselines become unstable.

2605.23507 2026-05-25 cs.CV

MDS-DETR: DETR with Masked Duplicate Suppressor

MDS-DETR: 带有掩码重复抑制器的DETR

Chanho Lee, Seunghee Koh, Yunho Jeon, Junmo Kim

发表机构 * Samsung Research(三星研究院) Korea Advanced Institute of Science(韩国先进科学研究院) Department of Artificial Intelligence Software, Hanbat National University(汉巴特国立大学人工智能软件系)

AI总结 DETR虽然是一种强大的端到端目标检测器,但其一对一匹配策略存在收敛慢和召回率低的问题。为解决这一问题,本文提出MDS-DETR,在单一解码器中结合了一对一和一对多监督,通过引入基于置信度的因果掩码机制的“掩码重复抑制器”(MDS),有效过滤一对多监督生成的重复预测,实现了无需额外查询或辅助解码器的可解释、无重复预测。实验表明,MDS-DETR在COCO数据集上相比现有方法在保持训练时间增加较小的情况下取得了更高的检测精度。

Comments code is available at https://github.com/DChoLee/MDS-DETR

详情
AI中文摘要

DEtection TRansformer (DETR) 是一种强大的端到端目标检测器,但其一对一匹配策略存在收敛慢和召回率低的问题。解决此问题的常见方法是使用一对多标签分配以提供更多正样本。然而,现有使用一对多匹配作为辅助目标的方法会导致训练成本增加,且其辅助解码器在推理时被丢弃。为解决这一限制,我们提出MDS-DETR,它在单一解码器中同时利用一对一和一对多监督。具体来说,我们引入了一个掩码重复抑制器(MDS),通过基于置信度的因果掩码向自注意力注入不对称性。MDS过滤掉由一对多监督层生成的重复项,在完全端到端的框架中实现可解释、无重复的预测。MDS-DETR优于现有的一对多DETR变体,如MS-DETR、MR.DETR和Relation-DETR,且无需依赖任何额外的查询或辅助解码器。在MS COCO上使用ResNet-50骨干网络进行12轮训练,MDS-DETR相比Deformable-DETR实现了+2.8 mAP的提升,训练时间仅增加5%,并且比最先进的MR.DETR高出+0.3 mAP,同时训练速度甚至快20%。我们的代码和模型可在\href{https://github.com/dcholee/mds-detr}{https://github.com/DChoLee/MDS-DETR}获取。

英文摘要

The DEtection TRansformer (DETR) is a powerful end-to-end object detector, yet its one-to-one matching strategy suffers from slow convergence and low recall. A common approach to address this issue is to use one-to-many label assignment to provide more positive samples. However, existing methods that use one-to-many matching as an auxiliary objective lead to increased training costs, with their auxiliary decoders discarded during inference. To address this limitation, we propose MDS-DETR, which leverages both one-to-one and one-to-many supervision within a single decoder. Specifically, we introduce a Masked Duplicate Suppressor (MDS) that injects asymmetry into self-attention via confidence-based causal masking. MDS filters out the duplicates generated by the one-to-many supervised layer, enables explainable, duplicate-free predictions in a fully end-to-end framework. MDS-DETR outperforms existing one-to-many DETR variants such as MS-DETR, MR.DETR and Relation-DETR, without relying on any additional queries or auxiliary decoders. Under a 12-epoch training schedule on MS COCO with a ResNet-50 backbone, MDS-DETR achieves a +2.8 mAP improvement over Deformable-DETR with only a 5\% increase in training time, and outperforms the state-of-the-art MR.DETR by +0.3 mAP while being even 20\% faster in training. Our code and models are available at \href{https://github.com/dcholee/mds-detr}{https://github.com/DChoLee/MDS-DETR}.

2605.23504 2026-05-25 cs.LG cs.AI

VACE: Learning Geometrically Structured Representations for Time Series Anomaly Detection

VACE:学习几何结构化表示用于时间序列异常检测

Alberto D. Cencillo, Leonardo Concepción, Isaac Triguero, Julián Luengo

发表机构 * Andalusian Research Institute in Data Science and Computational Intelligence (DaSCI)(安达卢西亚数据科学与计算智能研究 institute) Department of Computer Science and Artificial Intelligence (DECSAI), University of Granada(格拉纳达大学计算机科学与人工智能系)

AI总结 该论文提出了一种名为VACE的自监督异常检测方法,用于多变量时间序列中的异常检测。VACE通过速度对齐的通道嵌入方式,学习具有紧凑且方向一致结构的正常表示,从而更准确地识别异常。该方法无需负样本和合成异常,通过速度一致性目标训练编码器,使正常轨迹在嵌入空间中保持局部平滑和对齐。实验表明,VACE在多个基准数据集上取得了优于复杂方法的优异性能。

Comments 16 pages, 5 figures

详情
AI中文摘要

多变量时间序列中的异常检测是广泛实际应用中的关键任务,其中异常行为罕见、标签不可用且漏检成本高昂。核心挑战在于学习足够精确的正常性表征以标记偏差。表示自监督学习(通常通过对比方法)通过将时间补丁嵌入到潜在空间来解决这一问题,其中正常性占据一个定义明确的区域,异常通过几何偏差检测。然而,对比方法通过配对采样启发式间接塑造该空间,无法对基于距离评分所需的几何结构进行显式控制。这意味着正常表示的紧凑程度以及距离是否具有方向意义。我们提出VACE(速度对齐通道嵌入),一种自监督异常检测方法,将正常性表示为嵌入空间中紧凑且方向一致的区域。为此,VACE通过速度一致性目标训练通道感知编码器,无需负样本和合成异常,使得正常轨迹局部平滑且对齐。在测试时,马氏距离位置得分和速度库方向得分相乘,标记同时偏离分布和动态异常的点。尽管方法简单,VACE在严格评估下于TSB-AD-M上实现了最先进性能,显著优于使用更大预算训练的复杂方法。

英文摘要

Anomaly detection in multivariate time series is a critical task across a wide range of real-world applications, where abnormal behaviour is rare, labels are unavailable, and the cost of a miss is high. The central challenge is learning a characterisation of normality precise enough to flag deviations. Representation self-supervised learning, typically through contrastive approaches, addresses this by embedding temporal patches into a latent space where normality occupies a well-defined region, with anomalies detected by geometric deviation. However, contrastive approaches shape this space indirectly through pair-sampling heuristics, providing no explicit control over the geometric structure that distance-based scoring requires. This means how tightly normal representations are grouped, and whether distances are directionally meaningful. We present VACE (Velocity-Aligned Channel Embeddings), a self-supervised anomaly detection method that represents normality as a compact, directionally coherent region in the embedding space. To this end, VACE trains a channel-aware encoder through a velocity-consistency objective, with no negatives and no synthetic anomalies, so that normal trajectories are locally smooth and aligned. At test time, a Mahalanobis positional score and a velocity-bank directional score are combined multiplicatively, flagging points that are simultaneously off-distribution and dynamically atypical. Despite its simplicity, VACE achieves state-of-the-art performance on TSB-AD-M under rigorous evaluation, significantly outperforming more complex methods trained on substantially larger budgets.

2605.23497 2026-05-25 cs.CL

Asking For An Old Friend: Diagnosing and Mitigating Temporal Failure Modes in LLM-based Statutory Question Answering

询问老朋友:诊断和缓解基于LLM的法定问答中的时间故障模式

Max Prior, Andreas Schultz, Matthias Grabmair

发表机构 * Technical University of Munich(慕尼黑技术大学)

AI总结 该研究探讨了基于大语言模型(LLM)的法律问答系统在处理时效性法律条文时的两种失效模式:法规更新后的过时问题和对较新法规的偏好偏差。为此,研究构建了一个包含312个专家验证的德语法律问答对的基准数据集,并在不同推理设置下评估了多个LLM的表现。结果表明,引入基于检索的增强方法能显著提升模型在时间有效性方面的性能,而单纯依赖网络搜索则存在不稳定性和近期偏好问题,研究强调了在法律问答中必须将时间有效性作为硬性约束。

详情
AI中文摘要

大型语言模型越来越多地用于法律研究,但其固定的训练截止日期和对静态参数知识的依赖与成文法的演变性质相矛盾。我们研究了两种时间故障模式:截止后过时(模型在立法修正后应用被取代的规则)和近因偏差(即使历史版本支配事实模式,模型也偏好较新的规定)。为此,我们提出了一个包含312个专家验证、时间敏感的德国法定问答对的基准,涵盖三个类别:截止后修正问题、修正前问题和多条款修正前问题。我们评估了来自OpenAI、Anthropic和DeepSeek的五个LLM,在四种推理设置下:普通、网络搜索和两种检索增强变体(通过事实日期提取和版本过滤强制执行时间有效性)。使用经过人类专家评分验证的LLM作为评判,我们发现普通设置在截止后设置中性能严重下降。两种RAG方法在所有问题类型上均显著提高了性能,而网络搜索则产生不稳定的收益,并在历史锚定任务上表现出明显的近因偏差。我们的结果表明,可靠的法律问答需要将时间有效性视为硬约束。

英文摘要

Large language models are increasingly used for legal research, yet their fixed training cutoffs and reliance on static parametric knowledge are at odds with the evolving nature of statutory law. We study two temporal failure modes: post-cutoff staleness, where models apply superseded rules after legislative amendments, and recency bias, where models prefer newer provisions even when a historical version governs the fact pattern. To this end, we present a benchmark of 312 expert-validated, time-sensitive German statutory QA pairs spanning three categories: Post-Cutoff Amendment Questions, Pre-Amendment Questions, and Multi-Provision Pre-Amendment Questions. We evaluate five LLMs by OpenAI, Anthropic and DeepSeek under four inference settings: Vanilla, Web-search, and two retrieval-augmented variants that enforce temporal validity via a fact date extraction and version filtering. Using an LLM-as-a-judge validated against human expert ratings, we find severe degradation in the Vanilla post-cutoff setting. Both RAG approaches substantially improve performance across all question types, while web search yields unstable gains and exhibits a marked recency bias on historically anchored tasks. Our results indicate that reliable legal QA requires treating temporal validity as a hard constraint.

2605.23493 2026-05-25 cs.AI

EDGE-OPD: Internalizing Privileged Context with Evidence Guided On-Policy Distillation

EDGE-OPD:通过证据引导的在线策略蒸馏内化特权上下文

Aristotelis Lazaridis, Dylan Bates, Aman Sharma, Brian King, Vincent Lu, Jack FitzGerald

发表机构 * EdgeRunner AI

AI总结 本文研究了在基于特权上下文的On-Policy Self-Distillation(OPSD)中,如何避免特权信息对模型行为产生不必要的干扰问题,并提出了EDGE-OPD方法。该方法通过引导式采样和证据掩码机制,在训练过程中更精准地注入特权信息,确保学生模型学习到目标行为而非副作用。实验表明,EDGE-OPD有效提升了身份学习的效果,并有助于保持模型的一般能力。

详情
AI中文摘要

在线策略蒸馏(OPD)作为一种LLM后训练范式,因其在不引入模型分布漂移和通用任务回归的情况下有效提升能力而受到广泛关注。在线策略自蒸馏(OPSD)是OPD的一种高效用例,它仅需单一模型同时作为学生和教师,并且具有在训练过程中向教师提供推理时缺失的特权上下文(例如角色、私有事实或已解决的方案)的优势。该方法面临的挑战在于,特权信息可能过度改变模型行为:它可能修改推理、降低通用能力,并影响响应长度、风格或局部token偏好等性能指标。因此,OPSD可能训练学生模型学习副作用而非期望的可迁移行为。本文在稀有token/身份设定下研究该问题,并提出EDGE-OPD(证据引导的在线策略蒸馏),这是OPSD的一种改进,具有两个显著特征:a) 使用引导展开在采样时向学生注入特权上下文行为,使得稀有目标行为实际出现在在线策略数据中;b) 应用证据掩码:学生仅在特权上下文支持采样token的token位置进行更新,而非展开中的每个token。实验表明,OPSD(及其变体RLSD,无论是否使用验证器)完全无法学习目标身份,而引导展开的集成使其成功。此外,掩码区域消融实验显示,角色信号定位于正证据尾部,这使我们能够获得关于高效知识迁移和通用能力保持的宝贵见解。

英文摘要

On-Policy Distillation (OPD) has gained wide attraction as an LLM post-training paradigm due to its effectiveness in improving capabilities without introducing model distribution drift, and consequently, regression in general tasks. On-Policy Self-Distillation (OPSD) is an efficient use-case of OPD, which is appealing as it requires only a single model as a student and teacher, and it also has the benefit of providing privileged context that is a absent at inference time (e.g. a persona, a private fact, or a worked solution) to the teacher during the training process. The challenge in this approach is that the privileged information can change model behavior more than intended: it can modify reasoning, degrade general capabilities, and affect performance indicators like response length, style, or local token preferences. Consequently, OPSD may train the student on side effects rather than a desired, transferable behavior. In this paper, we study this problem in a rare-token/identity setting and propose EviDence GuidEd On-Policy Distillation (EDGE-OPD), a modification of OPSD with two distinct characteristics: a) it uses guided rollouts to inject privileged-context behavior to the student at sampling time, so that the rare target behavior is actually present in the on-policy data, and b) it applies an evidence mask: the student is updated only at token positions where the privileged context supports the sampled token, rather than on every token in the rollout. We empirically show that OPSD (and its variant RLSD, with and without a verifier) completely fail to learn a target identity, while the integration of guided rollouts allows them to succeed. Additionally, mask-region ablations show that the persona signal is localized to the positive-evidence tail, allows us to draw valuable insights about efficient knowledge transfer and preservation of general purpose capabilities.

2605.23482 2026-05-25 cs.CV cs.AI

Multimodal Distribution Matching for Vision-Language Dataset Distillation

多模态分布匹配用于视觉-语言数据集蒸馏

Jongoh Jeong, Hoyong Kwon, Minseok Kim, Kuk-Jin Yoon

发表机构 * Visual Intelligence Lab., KAIST(韩国科学技术院视觉智能实验室)

AI总结 该研究提出了一种名为Multimodal Distribution Matching (MDM)的多模态数据集蒸馏方法,旨在在有限的计算和内存资源下,高效生成保留视觉-语言语义信息的紧凑合成数据集。MDM通过结合数据、模型和损失层面的互补组件,实现了跨模态对齐与表示质量的保持,包括在联合嵌入空间中采样生成图像-文本对、基于预训练模型的权重空间插值构建混合教师模型,以及利用几何感知的损失函数匹配联合分布。实验表明,MDM在多个跨架构的图像-文本检索任务中表现出色,显著降低了蒸馏成本并保持了模型的鲁棒性。

Comments Accepted for publication at CVPR 2026. Project Page: https://andyj1.github.io/mdm

详情
AI中文摘要

数据集蒸馏将大型训练集压缩为紧凑的合成数据集,同时保持下游性能。随着现代系统越来越多地处理成对的视觉-语言输入,多模态蒸馏必须在严格的计算和内存预算下保持表示质量和跨模态对齐,然而先前的方法通常需要大量计算并忽略其相关性。为了解决这个问题,我们提出了多模态分布匹配(MDM),一种用于高效且可泛化的多模态蒸馏的几何感知框架。具体来说,MDM在数据、模型和损失层面集成了互补组件。在数据层面,它通过在联合嵌入空间中的聚类采样来初始化合成图像-文本对。在模型层面,它通过在权重空间中根据独立微调模型与预训练锚点的角度偏差进行插值,形成混合教师模型。在损失层面,它使用几何感知的匹配目标在单位超球面上匹配联合分布,该目标利用跨模态一致性和差异方向上的联合特征以及对称对比学习。在跨架构评估的图像-文本检索基准上,MDM生成的紧凑合成集保留了多模态语义,显著降低了蒸馏成本,并在不同架构下保持鲁棒性。

英文摘要

Dataset distillation compresses large training sets into compact synthetic datasets while preserving downstream performance. As modern systems increasingly operate on paired vision-language inputs, multimodal distillation must preserve representation quality and cross-modal alignment under tight compute and memory budgets, yet prior methods often require heavy computes and overlook their correlations. To address this, we present Multimodal Distribution Matching (MDM), a geometry-aware framework for efficient and generalizable multimodal distillation. Specifically, MDM integrates complementary components at the data, model, and loss levels. At the data level, it initializes synthetic image-text pairs by sampling from clusters in the joint embedding space. At the model level, it forms a mixed teacher by interpolating independently fine-tuned models in weight space according to their angular deviation from the pretrained anchor. At the loss level, it matches joint distributions on the unit hypersphere using a geometry-aware matching objective that exploits the joint features in the cross-modal agreement and discrepancy directions along with symmetric contrastive learning. Across image-text retrieval benchmarks with cross-architecture evaluation, MDM yields compact synthetic sets that preserve multimodal semantics, substantially reduce distillation cost, and remain robust across architectures.

2605.23478 2026-05-25 cs.CV cs.AI

PhenoYieldNet: Learning Crop-Aware Phenological Responses for Multi-Crop Yield Prediction

PhenoYieldNet: 学习作物感知的物候响应以进行多作物产量预测

Yu Luo, Xiaogang Zhu, Shan Zeng, Wei Xiang, Thomas Francis Bishop, Zhiyong Wang, Kun Hu

发表机构 * School of Computer Science, The University of Sydney(悉尼大学计算机科学学院) School of Computer Science and Information Technology, Adelaide University(阿德莱德大学计算机科学与信息技术学院) College of Mathematics and Computer Science, Wuhan Polytechnic University(武汉职业技术学院数学与计算机科学学院) School of Computing, La Trobe University(拉特罗布大学计算学院) School of Science, Edith Cowan University(埃迪斯科文大学科学学院)

AI总结 准确预测作物产量对可持续农业和全球粮食安全至关重要。现有方法多针对单一作物,难以泛化到多种作物,且未充分考虑不同作物对天气变化的特定物候响应。本文提出PhenoYieldNet,一种面向多作物产量预测的框架,通过显式建模作物的物候响应来学习作物特异性物候特征,包含作物物候库和注意力模块,能够动态捕捉不同物候阶段的时空特征,并通过预训练模型和自监督策略提升泛化能力,实验表明其在多作物数据集上显著优于现有方法。

Comments Accepted by CVPR2026

详情
AI中文摘要

准确的作物产量预测对于可持续农业和全球粮食安全至关重要。现有方法主要针对单一作物预测开发,通常难以泛化到不同作物类型,且未能解决由复杂天气模式动态调节的独特作物物候响应。在本文中,我们提出PhenoYieldNet,一个多作物产量预测框架,通过显式建模作物对时间驱动因素的响应来学习作物特异性物候。具体来说,我们开发了一个作物感知的时间解码器,由作物物候库(CPB)和作物物候注意力(CPA)模块组成。CPB集成了一组可学习的嵌入,利用查询引导CPA模块学习特定作物最相关的物候模式。CPA模块显式捕获多尺度趋势和变化成分以构建时间上下文,使模型能够动态调整不同物候阶段的注意力。为了学习鲁棒且可泛化的多作物预测特征,编码器使用预训练基础模型初始化,并通过自监督时序对比适应策略进一步调整以对齐农业时间动态。在多作物数据集上进行的大量实验表明,我们提出的方法显著优于最先进的方法,在不同地区和作物上展现出强大的泛化能力。

英文摘要

Accurate crop yield prediction is crucial for sustainable agriculture and global food security. While existing methods are predominantly developed for single-crop prediction, they often struggle to generalize across diverse crop types, without addressing the unique crop phenological responses that are dynamically modulated by complex weather patterns. In this paper, we propose PhenoYieldNet, a multi-crop yield prediction framework that learns crop-specific phenology by explicitly modeling their responses with temporal drivers. Specifically, we develop a crop-aware temporal decoder consisting of a Crop Phenology Bank (CPB) and a Crop Phenology Attention (CPA) module. The CPB integrates a set of learnable embeddings, which leverage a query to guide the CPA module to learn the most relevant phenology patterns for the specific crop. And the CPA module explicitly captures multi-scale trend and variation components to construct temporal contexts, enabling the model to dynamically adjust the attention across different phenological stages. To learn robust and generalizable features for multi-crop prediction, the encoder is initialized with a pre-trained foundation model, and further adapted via a self-supervised Temporal Contrastive Adaptation strategy to align with agricultural temporal dynamics. Extensive experiments conducted on multi-crop datasets indicate that our proposed method significantly outperforms state-of-the-art methods, exhibiting strong generalization capabilities across different regions and crops.

2605.23477 2026-05-25 cs.RO

Semantically Structured Mixture-of-Experts for Compositional Robotic Manipulation

语义结构化混合专家用于组合机器人操作

Chengyu Deng, Guanqi Chen, Yizhou Chen, Zejia Liu, Zhiwen Ruan, Guanhua Chen, Jia Pan

发表机构 * The University of Hong Kong(香港大学) Southern University of Science and Technology(南方科技大学)

AI总结 该研究针对基于扩散模型的机器人操作策略在多任务环境下计算成本高、泛化能力差的问题,提出了一种语义结构化的专家混合扩散策略(SMoDP)。该方法通过引入由视觉-语言模型标注指导的轻量技能预测器,在推理时将操作片段路由到专门负责特定行为阶段的专家模块,从而提升效率与可解释性。为确保路由鲁棒性,研究还设计了双对比对齐策略,强化多模态观测与语言定义技能语义的一致性,实验表明该方法在多任务基准上表现出更高的参数效率和任务迁移能力。

Comments Accepted to Robotics: Science and Systems (RSS) 2026

详情
AI中文摘要

基于扩散的策略为精确机器人操作建立了新标准,但面临关键的可扩展性瓶颈:高性能模型计算成本高,而轻量级替代方案通常难以在多样化的多任务环境中泛化。混合专家(MoE)架构通过仅激活参数子集提供了一条有前景的效率路径。然而,现有的MoE路由机制通常依赖于低级噪声或潜在统计量,忽略了操作任务的组合性质。这可能导致可重用行为在专家间碎片化,限制可解释性和可迁移性。我们提出了用于组合机器人操作的语义结构化混合专家扩散策略(SMoDP),这是一个将专家专业化建立在语义任务结构上的框架。SMoDP利用一个轻量级的推理时技能预测器,该预测器由视觉语言模型(VLM)的离线标注监督,将动作块路由到特定行为阶段专业化的专家。为了确保鲁棒的分配,我们提出了一种双对比对齐策略,该策略将多模态观测建立在语言定义的技能语义上(模态间),同时强制执行视觉上不同但功能相关行为之间的路由一致性(模态内)。我们的方法在多任务基准测试中优于代表性的扩散和基于MoE的基线,参数效率显著提高,并通过参数高效微调展示了向新任务的有效组合迁移。项目网站:https://deng-cy20.github.io/SMoDP/

英文摘要

Diffusion-based policies have established a new standard for precise robotic manipulation but face a critical scalability bottleneck: high-performance models are computationally expensive, while lightweight alternatives often fail to generalize across diverse multi-task environments. Mixture-of-Experts (MoE) architectures offer a promising path to efficiency by activating only a subset of parameters. However, existing MoE routing mechanisms typically rely on low-level noise or latent statistics, ignoring the compositional nature of manipulation tasks. This can fragment reusable behaviors across experts, limiting interpretability and transferability. We introduce Semantically Structured Mixture-of-Experts Diffusion Policy (SMoDP) for compositional robotic manipulation, a framework that grounds expert specialization in semantic task structure. SMoDP leverages a lightweight, inference-time skill predictor, supervised by offline annotations from Vision-Language Models (VLMs), to route action chunks to experts specialized for specific behavioral phases. To ensure robust assignment, we propose a dual contrastive alignment strategy that grounds multi-modal observations in language-defined skill semantics (Inter-modal) while enforcing routing consistency across visually distinct but functionally related behaviors (Intra-modal). Our approach outperforms representative diffusion and MoE-based baselines on multi-task benchmarks with significantly improved parameter efficiency and demonstrates effective compositional transfer to novel tasks through parameter-efficient fine-tuning. Project website: https://deng-cy20.github.io/SMoDP/

2605.23476 2026-05-25 cs.LG cond-mat.dis-nn cond-mat.mtrl-sci math.OC

Non-normal spectral signatures of instability in neural network training dynamics

神经网络训练动态中不稳定性的非正态谱特征

Souvik Ghosh

发表机构 * Department of Physics, National Sun Yat-sen University, Kaohsiung 80424, Taiwan(物理系,国立中山大学,高雄 80424,台湾)

AI总结 本文研究了深度网络训练过程中常见的不稳定性问题,如损失尖峰、振荡收敛和梯度异常,并通过非正规算子理论提供了理论解释。研究发现,常用优化器的线性化更新算子普遍是非正规的,其非正规性由Hessian矩阵与自适应预条件器或动量结构之间的相互作用引起。通过非正规稳定性理论,作者提出了一个基于伪谱的保守前兆界,并证明了条件数κ(V)可以作为训练过程中瞬时放大现象的早期预警指标,为理解自适应优化算法的稳定性提供了新的诊断工具和理论框架。

Comments 9 pages, 3 figurea

详情
AI中文摘要

深度网络中的训练不稳定性——损失尖峰、振荡收敛和梯度病态——在经验上普遍存在,但缺乏严格的算子理论解释。我们证明,实际使用的优化器的线性化更新算子通常是非正态的:对于Adam,非正态性由Hessian与对角自适应预条件子之间的换位子[H, M]控制;而对于带动量的SGD,它源于更新映射的增广状态空间结构。将非正态稳定性理论应用于这些算子,我们推导出一个保守的伪谱前兆界,其中κ(V)作为瞬态放大的早期预警指标,即使谱半径仍小于1;并且我们建立了更新算子的异常点作为该框架中κ(V) → ∞的极限情况。在两层网络上的数值实验证实,谱半径ρ(J)无法区分稳定和不稳定的训练阶段,而κ(V)能将它们分开约一个数量级,用非正态放大的连续严重性度量补充了经典的锐度准则。这些结果确立了非厄米算子理论作为神经网络优化稳定性中一个有用且未被充分探索的框架,为理解自适应优化稳定性提供了诊断语言和概念验证基准。

英文摘要

Training instabilities in deep networks - loss spikes, oscillatory convergence, and gradient pathologies - are empirically prevalent but lack a rigorous operator-theoretic explanation. We show that the linearized update operators for practically used optimizers are generically non-normal: for Adam, non-normality is controlled by the commutator [H, M] between the Hessian and the diagonal adaptive preconditioner, while for SGD with momentum it arises from the augmented state-space structure of the update map. Applying non-normal stability theory to these operators, we derive a conservative pseudospectral precursor bound in which κ(V) serves as an early-warning indicator of transient amplification even when the spectral radius remains below one, and we establish that exceptional points of the update operator appear as the κ(V) -> \infty limiting case of this framework. Numerical experiments on two-layer networks confirm that the spectral radius ρ(J) provides no separation between stable and unstable training phases while κ(V) separates them by approximately one order of magnitude, complementing the classical sharpness criterion with a continuous severity measure of non-normal amplification. These results establish non-Hermitian operator theory as a useful and underexplored framework for neural network optimization stability, offering a diagnostic language and proof-of-concept benchmark for understanding adaptive optimization stability.

2605.23472 2026-05-25 cs.CV

Rethinking Transfer Learning for Industrial Inspection: DINOv3 vs. ImageNet Pretraining Across RGB and X-ray Tasks

重新思考工业检测的迁移学习:DINOv3与ImageNet预训练在RGB和X射线任务上的对比

Mehdi Gharbage, Céline Teulière, Pierre Bouges, Thierry Chateau

发表机构 * Michelin Tyres Manufacturer(米其林轮胎制造商) Université Clermont Auvergne, CNRS, Institut Pascal(克莱蒙特-奥弗涅大学,CNRS,帕西尔研究所)

AI总结 本文探讨了现代视觉基础模型在工业检测任务中的迁移学习效果,比较了基于ImageNet监督预训练和DINOv3自监督蒸馏的ConvNeXt主干网络在RGB和X射线检测任务中的表现。研究发现,DINOv3在冻结参数的迁移中优势不明显,但在RGB任务的全微调下能提供更好的初始化,加快收敛并提升性能;而在X射线任务中,基于ImageNet的监督预训练仍更具优势。结果表明,现代视觉基础模型在工业RGB检测中具有潜力,但其迁移效果高度依赖下游任务的适配和数据模态。

Comments Accepted to the CVPR 2026 Workshop on Vision Foundation Models for Industrial Inspection (VISION'26)

详情
AI中文摘要

最近,在网页规模数据上预训练的视觉基础模型在许多下游任务中展现出强大的迁移能力,但它们在工业视觉检测中的有效性仍不明确。工业数据与网页数据差异显著,通常需要细粒度的密集预测,这引发了一个问题:现代自监督预训练能否超越基于监督ImageNet初始化的传统迁移学习范式。在这项工作中,我们比较了使用监督ImageNet分类或DINOv3蒸馏预训练的ConvNeXt骨干网络,并将它们与传统的ResNet-50基线相关联。我们在四个下游数据集上评估了语义分割、实例分割和物体检测,这些数据集涵盖RGB表面缺陷检测和X射线缺陷检测。我们进一步研究了冻结和完全微调两种适应机制。我们的结果表明,DINOv3在冻结迁移中没有明显优势,但在RGB任务完全微调后提供了更强的初始化,实现了更快的收敛和更好的最终性能。然而,在X射线模态偏移下,监督ImageNet预训练在冻结和微调设置中仍然更有效。总体而言,我们的发现表明,现代视觉基础模型对于监督RGB工业检测是有前景的,但它们的迁移能力强烈依赖于下游适应和目标模态。

英文摘要

Vision foundation models pretrained on web-scale data have recently shown strong transfer capabilities on many downstream tasks, but their effectiveness for industrial visual inspection remains unclear. Industrial data differ substantially from web-data and often require fine-grained dense prediction, raising the question of whether modern self-supervised pretraining can improve over the conventional transfer-learning paradigm based on supervised ImageNet initialization. In this work, we compare ConvNeXt backbones pretrained with supervised ImageNet classification or DINOv3 distillation, and relate them to the conventional ResNet-50 baseline. We evaluate semantic segmentation, instance segmentation, and object detection across four downstream datasets spanning RGB surface-defect inspection and X-ray defect detection. We further study both frozen and fully finetuned adaptation regimes. Our results show that DINOv3 offers no clear advantage in frozen transfer, but provides a stronger initialization after full finetuning on RGB tasks, yielding faster convergence and better final performance. Under X-ray modality shift, however, supervised ImageNet pretraining remains more effective in both frozen and finetuned settings. Overall, our findings suggest that modern vision foundation models are promising for supervised RGB industrial inspection, but their transferability is strongly conditioned by downstream adaptation and target modality.

2605.23471 2026-05-25 cs.LG cs.AI

CBANet: A Compact Attention-Based CNN-BiLSTM Network for Aggressive Driving Event Detection

CBANet:一种用于激进驾驶事件检测的紧凑型注意力CNN-BiLSTM网络

Hanadi Alhamdan, Ghadah Alosaimi, Amir Atapour-Abarghouei, Farshad Arvin

发表机构 * Department of Computer Science, Princess Nourah bint Abdulrahman University(普里西拉计算机科学系,普里西拉努拉·本·阿卜杜勒拉赫曼大学) Department of Computer Science, Durham University(计算机科学系,杜ham大学) Department of Computer Science, Imam Mohammad Ibn Saud Islamic University(计算机科学系,伊玛姆穆罕默德·本·萨德伊斯兰大学)

AI总结 本文提出了一种名为CBANet的紧凑型注意力机制结合CNN-BiLSTM的深度学习框架,用于检测激进驾驶事件。该方法通过构建工程化的动态特征来捕捉转向、加速和制动行为,并采用基于SMOTE的过采样与类别加权损失相结合的稳定训练策略,以应对自然驾驶数据中激进事件极度稀有的问题。实验表明,该方法在少数类召回率和安全关键F分数等指标上显著优于传统深度学习方法,同时保持了较高的计算效率。

Comments 8 pages, 4 figures, 4 tables. Submitted to IJCNN/WCCI 2026. CBANet: A compact attention-based CNN-BiLSTM framework for aggressive driving event detection using multivariate vehicle dynamics signals. Code available at https://github.com/halhamdan/CBANet

详情
AI中文摘要

激进驾驶是交通事故的主要原因,对道路安全构成严重威胁。尽管深度学习方法在从车辆传感器数据检测危险驾驶行为方面显示出有希望的结果,但它们在现实条件下的性能通常受到严重数据不平衡、驾驶员间巨大差异以及缺乏物理可解释的车辆动力学表示的限制。在本文中,我们提出了一种增强的深度学习框架,用于使用多变量车辆动力学信号进行激进驾驶检测。该方法不仅依赖原始测量,还构建了捕捉转向、加速和制动行为的工程动力学特征。为了解决自然驾驶数据中激进事件的极端稀少性,我们引入了一种稳定的训练策略,结合了基于SMOTE的受控过采样和类别加权损失公式,并评估了用于不平衡处理的焦点损失变体。此外,采用基于类别特定阈值校准的安全导向决策策略,以更好地反映现实应用中漏检和误报的不对称风险。该框架在新收集的自然驾驶数据集上进行了评估。大量实验表明,所提出的方法在保持实际计算效率的同时,在少数类召回率和安全关键F-score指标上始终优于标准深度学习基线。代码:\url{https://github.com/halhamdan/CBANet}

英文摘要

Aggressive driving is a major cause of traffic accidents and poses a serious threat to road safety. Although deep learning methods have shown promising results in detecting risky driving behaviours from vehicle sensor data, their performance in real-world conditions is often limited by severe data imbalance, large variability between drivers, and the lack of physically interpretable vehicle dynamics representations. In this paper, we propose an enhanced deep learning framework for aggressive driving detection using multivariate vehicle dynamics signals. Instead of relying solely on raw measurements, the proposed approach constructs engineered dynamic features that capture steering, acceleration, and braking behaviour. To address the extreme rarity of aggressive events in naturalistic driving data, we introduce a stable training strategy that combines controlled SMOTE-based oversampling with a class-weighted loss formulation, and evaluates focal loss variants for imbalance handling. Furthermore, a safety-oriented decision strategy based on class-specific threshold calibration is adopted to better reflect the asymmetric risks of missed detections and false alarms in real-world applications. The proposed framework is evaluated on a newly collected naturalistic driving dataset. Extensive experiments show that the proposed method consistently outperforms standard deep learning baselines with significant improvements in minority-class recall and safety-critical F-score metrics while maintaining practical computational efficiency. Code: \url {https://github.com/halhamdan/CBANet}

2605.23470 2026-05-25 cs.LG cs.AI cs.CE

Learning Individual Dynamics from Sparse Cross-Sectional Snapshots

从稀疏横截面快照中学习个体动力学

Christian Lagemann, Kai Lagemann, Steven L. Brunton, Sach Mukherjee

发表机构 * Statistics and Machine Learning, German Center for Neurodegenerative Diseases (DZNE)(统计与机器学习,德国神经退行性疾病中心(DZNE)) MediaTek Research(联发科技研究) Department of Mechanical Engineering & AI Institute in Dynamic Systems, University of Washington, Seattle(机械工程与人工智能动态系统研究所,华盛顿大学,西雅图) DZNE & University of Bonn, Bonn, Germany and University of Cambridge, Cambridge, United Kingdom(DZNE与波恩大学,波恩,德国和剑桥大学,剑桥,英国)

AI总结 该研究旨在从稀疏的横截面快照中学习个体的动态演化过程,传统方法在数据稀疏或完全横截面的情况下难以准确推断个体的连续时间轨迹。本文提出了一种名为CADENCE的概率框架,通过将潜在动态与静态个体上下文关联,实现了从孤立快照中恢复个体轨迹。该方法结合了基于分数的空域编码器和软专家混合路由机制,提供了单时间点轨迹推断的可识别性保证,并在多个基准测试中表现出优于现有序列模型的性能。

详情
AI中文摘要

预测一个动力学单元如何随时间演化——例如个体如何衰老、流行病如何传播、物理系统如何退化——通常需要密集的纵向追踪。当只有极其稀疏或完全横截面的数据可用时,推断个体化的连续时间轨迹本质上是病态的。现有方法迫使严格妥协:序列模型(如潜在ODE)需要密集的纵向数据,而横截面方法(如最优传输、基于流匹配的)映射聚合群体,丢失了个体动力学。在本文中,我们证明这种二分法可以被打破。我们介绍CADENCE,一个原则性的概率框架,通过将潜在动力学锚定到静态的个体级上下文,从孤立快照中恢复连续的个体轨迹。我们为单时间点轨迹推断提供了新颖的可识别性保证。通过结合基于分数的空间编码器(双射概率流ODE)以消除微分同胚歧义,以及软混合专家(SMoE)路由器,我们证明个体动力学参数和路由函数是联合可识别的。在一系列涵盖物理系统到真实世界生物数据的基准测试中,CADENCE严格在具有上下文结构的极端稀疏快照上训练,其性能匹配或超过了在密集全轨迹数据上训练的最先进序列模型。

英文摘要

Predicting how a dynamical unit evolves over time - how an individual ages, an epidemic spreads, or a physical system degrades - typically requires dense longitudinal tracking. When only extremely sparse or entirely cross-sectional data is available, inferring individualized, continuous-time trajectories is fundamentally ill-posed. Existing methods force a strict compromise: sequence models (e.g. latent ODEs) require dense longitudinal data, while cross-sectional methods (e.g. optimal transport, flow matching-based) map aggregate populations, losing individual dynamics. In this paper, we demonstrate that this dichotomy can be broken. We introduce CADENCE, a principled probabilistic framework that recovers continuous individual trajectories from isolated snapshots by anchoring latent dynamics to static, individual-level contexts. We provide novel identifiability guarantees for single-timepoint trajectory inference. By combining a score-based spatial encoder (bijective Probability Flow ODE) to eliminate diffeomorphic ambiguities with a Soft Mixture-of-Experts (SMoE) router, we show that individual dynamical parameters and routing function are jointly identifiable. Across a suite of benchmarks spanning physical systems to real-world biological data, CADENCE, trained strictly on extremely sparse snapshots with context structure, matches or exceeds the performance of state-of-the-art sequential models trained on dense, full-trajectory data.