arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.23569 2026-05-25 cs.AI

CP or DP? Why Not Both: A Case Study in the Partial Shop Scheduling Problem

CP还是DP?为何不兼得:以部分车间调度问题为例

Emma Legrand, Roger Kameugne, Pierre Schaus

AI总结 本文研究了如何将动态规划(DP)与约束规划(CP)有效结合,以解决部分车间调度问题(PSSP)。作者提出了一种混合方法,以DP作为主搜索框架,利用CP进行全局约束传播,从而提升求解效率与灵活性。该方法不仅支持任意优先级约束,还可与任何时间策略结合,并能设计出基于DP的大型邻域搜索方案,展示了DP与CP融合在组合优化问题中的可行性。

详情
AI中文摘要

动态规划(DP)和约束规划(CP)是解决组合优化问题的成熟范式。通常,这两种方法被分开使用。本文旨在展示两者可以有效且优雅地结合,其中DP作为主搜索框架,CP作为子程序利用全局约束传播。本文针对部分车间调度问题(PSSP)提出了这样一种方法,该问题之前已有纯DP方法,并且有高效的CP过滤算法可用。PSSP是一个通用调度问题,其中每个作业由一组具有任意优先约束的操作组成。该方法足够灵活,可以容纳任意时间DP策略,例如任意时间列搜索,而原始DP算法以严格的逐层方式运行。此外,CP建模的灵活性使得可以轻松纳入任意优先约束。因此,该模型自然地处理任何优先图,甚至允许设计大邻域搜索(LNS)方案,其中重用DP模型,并在重启之间施加偏序调度以改进当前解。虽然对于这个特定问题,该方法无法与最先进的纯CP求解器竞争,但我们的主要贡献是证明了这种混合集成的可行性。

英文摘要

Dynamic Programming (DP) and Constraint Programming (CP) are well-established paradigms for solving combinatorial optimization problems. Usually, these two approaches are used separately. This paper aims to show that the two can be combined effectively and elegantly, with DP serving as the primary search framework and CP used as a subroutine to leverage global constraint propagation. This paper presents such an approach for the Partial Shop Scheduling Problem (PSSP), for which a pure DP method has previously been proposed, and efficient CP filtering algorithms are available. The PSSP is a general scheduling problem where each job consists of a set of operations with arbitrary precedence constraints. The approach is flexible enough to accommodate anytime DP strategies, such as anytime column search, whereas the original DP algorithm operated in a strictly layer-wise manner. Moreover, the flexibility of the CP modeling makes it straightforward to incorporate arbitrary precedence constraints. As a result, the model naturally handles any precedence graph and even enables the design of a Large Neighborhood Search (LNS) scheme, in which the DP model is reused, and partial-order schedules are imposed across restarts to improve the incumbent solution. While not competitive with state-of-the-art pure CP solvers for this specific problem, our primary contribution is demonstrating the viability of this hybrid integration.

2605.23568 2026-05-25 cs.RO cs.SY eess.SY

TactileReflex: Noise-Statistics-Driven Vision-Tactile Reflex Control for Force-Sensitive Manipulation

TactileReflex:基于噪声统计的视觉-触觉反射控制用于力敏感操作

Ziyan Feng, Yulong Fu, Zheng Li, Yuxin He, Jieji Ren, Lujia Wang, Jinni Zhou, Yudong Zhong, Qiang Nie

AI总结 本文提出了一种基于噪声统计特性的视觉-触觉反射控制方法TactileReflex,用于实现对力敏感的精细操作任务,如液体填充的塑料杯的抓取与操作。该方法通过分析触觉传感器的内在噪声特性,直接推导出控制器的阈值,无需外部力标定或手动调参。实验表明,TactileReflex能够有效防止容器不可逆变形,并在动态倒水任务中表现出优异的稳定性与成功率,具有作为高层次操作系统安全层的潜力。

详情
Comments
8 pages, 4 figures, 6 tables
AI中文摘要

操作易变形的柔性容器(如装有液体的一次性塑料杯)需要在极窄的力裕度内实时调整抓取力:力不足会导致滑动,力过大则会使薄壁不可逆变形。现有方法难以完成此类力敏感操作任务。我们提出一种基于噪声统计的标定驱动反射控制范式,结合基于视觉的触觉感知:通过分析传感器的固有噪声特性(通过简短的静态保持-卸载协议),直接推导出所有控制器阈值,消除了外部力标定、试错手动调参或材料特定的物理模型。实现该范式,我们提出了TactileReflex,一个三通道闭环控制器,从双视觉触觉传感器中提取三个图像级代理:剪切强度($S_y$)、接触强度($F_n$)和压力中心($C$),并以约12Hz驱动优先反射通道,用于滑动抑制、重量自适应释放和力保护。每个通道通过噪声导出的阈值直接在其代理上闭环。消融实验表明,只有完整的三通道系统能够防止容器不可逆变形(5/5成功,而部分配置最多1/5成功)。在动态倾倒任务中,固定力基线因姿态漂移在所有10次尝试中均失败,而TactileReflex在两种水量下实现了9/10成功。作为一个自包含且可解释的控制器,TactileReflex可作为高层操作流水线(包括无触觉VR遥操作和视觉-语言-动作策略)的即插即用安全层。

英文摘要

Manipulating fragile deformable containers, such as disposable plastic cups filled with liquid, demands real-time grip-force adaptation within an extremely narrow force margin: insufficient force causes slip, while excessive force irreversibly deforms the thin wall. Existing approaches struggle to achieve such force-sensitive manipulation tasks. We propose a noise-statistics-based calibration-driven reflex control paradigm with vision-based tactile sensing: by analyzing the sensor's intrinsic noise characteristics (via a brief static-hold-and-unload protocol), we directly derive all controller thresholds, eliminating external force calibration, trial-and-error manual tuning, or material-specific physical models. Instantiating this paradigm, we present TactileReflex, a three-channel closed-loop controller that extracts three image-level proxies, shear intensity ($S_y$), contact intensity ($F_n$), and center of pressure ($C$), from dual visuo-tactile sensors and drives prioritized reflex channels at ~12 Hz for slip suppression, weight-adaptive release, and force protection. Each channel closes the loop directly on its proxy via noise-derived thresholds. Ablation demonstrates that only the full three-channel system is able to prevent irreversible container deformation (5/5 success vs. at most 1/5 for partial configurations). In a dynamic pouring task, fixed-effort baselines fail in all 10 attempts due to pose drift, while TactileReflex achieves 9/10 success across two water volumes. As a self-contained and interpretable controller, TactileReflex can serve as a plug-and-play safety layer beneath high-level manipulation pipelines, including haptic-free VR teleoperation and vision-language-action (VLA) policies.

2605.23565 2026-05-25 cs.LG cs.AI

Understanding Goal Generalisation in Sequential Reinforcement Learning

理解序贯强化学习中的目标泛化

Jason Ross Brown, Edward James Young

AI总结 本研究探讨了序列强化学习代理在新环境中实现目标泛化的能力,分析了其训练历史对其行为的影响。通过研究超过100种序列训练流程并在250多个分布外环境中进行评估,发现显著特征和早期学习的目标对后续泛化具有重要影响。为此,研究提出了一种名为潜在策略梯度的方法,能够预测训练流程可能诱导的分布外行为,具有较高的预测准确性、良好的泛化能力和可解释性,为从发展角度理解目标泛化提供了基础。

详情
AI中文摘要

强化学习代理在其训练分布之外常常表现出非预期的目标导向行为,但我们目前缺乏基于训练历史对这类代理如何泛化到新环境的原理性理解。我们针对在单个或多个任务上序贯训练的代理解决了这一空白。我们研究了超过100个序贯训练流程,评估了超过250个分布外环境中的行为。我们发现显著特征驱动泛化,并且训练早期习得的目标会持续存在并影响后期习得的目标。为了解释这些现象,我们引入了潜在策略梯度方法,该方法预测训练流程可能诱导的分布外行为。我们的方法根据潜在变量如何映射到行为的简单模型,模拟训练过程中低维潜在变量的演化,以实现在训练目标上获得高奖励。它实现了强预测准确性,泛化到未见过的训练流程类型,并且是可解释的。我们的发现表明,虽然分布外RL代理行为依赖于整个训练流程,但这种依赖具有我们可以捕捉的底层结构,为从发展角度理解目标泛化奠定了基础。

英文摘要

Reinforcement learning agents often exhibit unintended goal-directed behaviour outside their training distribution, but we currently lack a principled understanding of how such agents will generalise to novel environments based on their training history. We address this gap for agents trained sequentially on one or more tasks. We study over 100 sequential training pipelines, evaluating behaviour across over 250 out-of-distribution environments. We find that salient features drive generalisation, and that goals learnt early in training can persist and influence those acquired later. To explain these phenomena, we introduce latent policy gradients, a method that predicts what out-of-distribution behaviour a training pipeline will likely induce. Our method simulates the evolution of low-dimensional latent variables during training according to what would achieve high reward on the training objective with respect to a simple model of how the latent variables map to behaviour. It achieves strong predictive accuracy, generalises to unseen types of training pipeline, and is interpretable. Our findings demonstrate that while out-of-distribution RL agent behaviour is dependent on the whole training pipeline, this dependence has an underlying structure we can capture, laying groundwork for understanding goal generalisation from a developmental perspective.

2605.23563 2026-05-25 cs.LG

MARS: Magnitude-Aware Rank Statistics

MARS:幅度感知排名统计

Muhammad Rajabinasab, Afsaneh M. Nejad, Arthur Zimek

AI总结 在机器学习模型的全面评估中,如何准确反映模型性能差异是一个重要问题。传统关键差异(CD)图依赖于离散排名,忽略了模型性能差距的幅度,导致“幅度盲”问题。为此,本文提出了一种基于幅度感知的排名统计方法MARS,通过引入相对边距系数对离散排名进行加权,从而更真实地反映模型性能差异,并在广泛实验设置中提供更深入的洞察。

详情
Comments
Preprint submitted to Elsevier Pattern Recognition Letters
AI中文摘要

机器学习模型的全面评估是确保其按预期稳健且一致运行的关键。为了总结实验结果并选出最佳模型,通常使用临界差异(CD)图。标准CD图依赖于离散排名,忽略了模型之间性能差距的幅度,这引发了我们称之为幅度盲视的问题。为了解决这个问题,我们提出了幅度感知排名统计(MARS),它引入了一个相对边际系数作为离散排名的权重。该系数基于最佳和最差表现者之间的距离对排名进行缩放,并采用动态投影来处理边界情况。在计算CD值之后,MARS能够更真实地统计表示模型性能的差异,并提供更多关于方法在广泛实验设置中实际表现如何的见解。

英文摘要

Comprehensive evaluation of machine learning models is the key to make sure that they perform as robustly and consistently as desired. In order to summarize the experimental results and pick a winner, Critical Difference (CD) diagrams are used. Standard CD diagrams rely on discrete ranks, discarding the magnitude of performance gaps between models, raising an issue which we call magnitude-blindness. In order to address this issue, we propose Magnitude-Aware Rank Statistics (MARS) that incorporates a relative margin coefficient as a weight for the discrete ranks. This coefficient scales ranks based on the distance between the best and worst performers, with a dynamic projection to handle boundary cases. Followed by the calculation of a CD value, MARS results in a more realistic statistical representation of differences of model performances and more insights on how methods actually perform in vast and extensive experimental settings.

2605.23562 2026-05-25 cs.MA cs.AI

ARMS: Automatic Reward Shaping for Sparse-Reward Multi-Agent Reinforcement Learning

ARMS: 稀疏奖励多智能体强化学习的自动奖励塑形

Elie Abboud, Oren Gal

AI总结 在多智能体强化学习中,稀疏奖励是学习过程中的主要瓶颈,而传统的奖励塑造方法难以在保持策略结构的同时提升学习效率。本文提出了一种名为ARMS的自动奖励塑造框架,通过轨迹排序从稀疏环境奖励中学习密集的塑造奖励,并基于条件最佳响应推理保证在固定对手策略下保留每个智能体的最佳响应集和纳什均衡集。实验表明,ARMS在部分可观测的多智能体路径规划任务中显著提升了采样效率,具有良好的环境泛化能力,并揭示了多智能体系统中由探索不足和策略-奖励动态耦合引发的振荡行为问题。

详情
AI中文摘要

稀疏奖励是多智能体强化学习(MARL)中的一个主要瓶颈,其中同时学习会导致非平稳性并使奖励设计尤其精细。奖励塑形可以加速学习,但在多智能体环境中,它必须保留问题的战略结构,而不仅仅是改善短期优化。我们提出了多智能体系统中的自动奖励塑形(ARMS),这是一个用于MARL的自监督奖励塑形框架,通过轨迹排序从稀疏环境奖励中学习稠密塑形信号。由于单智能体轨迹排序保证不能直接迁移到MARL,我们通过条件最优反应推理重新表述策略不变性,并证明如果某些条件成立,则使用塑形奖励在固定对手策略下保留每个智能体的最优反应集,从而保留纳什均衡集。在此视角指导下,ARMS在策略学习和奖励学习之间交替,同时跨智能体共享塑形参数以提高效率。在部分可观测的多智能体路径规划领域中的实验表明,ARMS在奖励稀疏性和智能体数量增加的情况下提高了采样效率,泛化到未见过的环境,并揭示了一种MARL特有的失败模式,其中有限的探索和耦合的策略-奖励动态导致振荡行为。增加探索可缓解此效应并稳定学习。据我们所知,ARMS是第一个其设计动机来自博弈论均衡保持结果的MARL自动奖励塑形框架。

英文摘要

Sparse rewards are a major bottleneck in multi-agent reinforcement learning (MARL), where simultaneous learning induces non-stationarity and makes reward design especially delicate. Reward shaping can accelerate learning, but in the multi-agent setting it must preserve the strategic structure of the problem rather than merely improve short-term optimization. We propose Automatic Reward-shaping in Multi-agent Systems (ARMS), a self-supervised reward shaping framework for MARL that learns dense shaping signals from sparse environmental rewards through trajectory ranking. Since single-agent trajectory-ranking guarantees do not directly transfer to MARL, we reformulate policy invariance through conditional best-response reasoning, and show that if certain conditions hold, then using shaping rewards preserves each agent's best-response set under fixed opponent policies, and consequently preserve the set of Nash equilibria. Guided by this perspective, ARMS alternates between policy learning and reward learning while sharing shaping parameters across agents for efficiency. Experiments in a partially observable multi-agent pathfinding domain show that ARMS improves sampling efficiency under increasing reward sparsity and agent count, generalizes to unseen environments, and reveals a MARL-specific failure mode in which limited exploration and coupled policy--reward dynamics induce oscillatory behavior. Increasing exploration mitigates this effect and stabilizes learning. To the best of our knowledge, ARMS is the first automatic reward shaping framework for MARL whose design is motivated by a game-theoretic equilibrium-preservation result.

2605.23559 2026-05-25 cs.CV cs.AI

PathNavigate: A Training-Free Pathology Agent with Surprise-Guided Scan and Shared Slide Memory for Whole-Slide Image VQA

PathNavigate: 一种无需训练的病理学代理,具有惊喜引导扫描和共享幻灯片记忆用于全切片图像VQA

Chunze Yang, Qidong Liu, Wenjie Zhao, Yue Tang, Jiusong Ge, Di Zhang, Jiashuai Liu, Lei Wu, Junbo Lu, Ni Zhang, Xian Wu, Zeyu Gao, Chen Li

AI总结 PathNavigate 是一种无需训练的病理图像问答代理,旨在解决全切片图像问答(WSI-VQA)中在有限检查预算下高效定位关键病理证据的问题。该方法采用“扫描-搜索-读取”流程,通过共享的在线记忆模块生成异常区域池,并结合问题条件的相关性筛选高倍镜下的目标区域,从而提升答案准确性和解释性。实验表明,PathNavigate 在保持模型冻结的前提下,实现了更高的效率和更可靠的证据选择路径。

详情
AI中文摘要

全切片图像视觉问答(WSI-VQA)将病理学视为极端上下文搜索问题:为了回答自由形式的临床查询,系统必须首先在严格的检查预算下导航千兆像素切片,以定位稀疏的高分辨率证据。现有方法主要分为两种范式:i)监督式病理学多模态大语言模型(MLLMs)和代理可以将定位和推理吸收到学习模块中,但它们通常将导航与任务特定的监督和重新训练耦合,限制了其实用性;ii)无需训练的病理学代理通过保持核心模型冻结来避免这种成本,但通常遵循问题优先的设计,主要从查询条件相关性构建初始候选集。这可能会遗漏问题中未提及的决定性形态,并迫使更重的推理时脚手架。为了解决这一挑战,我们引入了PathNavigate,一种无需训练的病理学代理,基于扫描-搜索-读出流程构建。在问题匹配之前,PathNavigate在低放大倍数下扫描当前切片,使用共享的在线记忆模块处理冻结的病理学特征,生成一个切片特定的惊喜场,标记异常区域池。然后,它仅在此池内应用问题条件的PLIP相关性,以选择高放大倍数的搜索目标。最后,它提取局部高放大倍数证据,并使用冻结的感知器-裁决器堆栈进行回答,利用相同的在线记忆作为切片级上下文。在WSI-VQA和SlideBench-BCNB上的实验表明,所提出的扫描-搜索-读出设计提高了答案准确性,并产生了更可解释的证据选择轨迹,且效率更高。代码已在线公开。

英文摘要

Whole-slide image visual question answering (WSI-VQA) frames pathology as an extreme-context search problem: to answer a free-form clinical query, a system must first navigate a gigapixel slide under a strict inspection budget to locate sparse, high-resolution evidence. Existing approaches largely fall into two paradigms: i) supervised pathology multimodal large language models (MLLMs) and agents can absorb localization and reasoning into learned modules, but they often couple navigation to task-specific supervision and retraining, limiting their practicality; ii) training-free pathology agents avoid this cost by keeping core models frozen, but often follow a question-first design, constructing the initial candidate set mainly from query-conditioned relevance. This can miss decisive morphology that is not named in the question, and force heavier inference-time scaffolding. To address this challenge, we introduce PathNavigate, a training-free pathology agent built around a scan-search-readout routine. Before question matching, PathNavigate scans the current slide at low magnification with a shared online memory module over frozen pathology features, producing a slide-specific surprise field that marks an abnormal-region pool. It then applies question-conditioned PLIP relevance only within this pool to select high-magnification search targets. Finally, it extracts local high-magnification evidence and answers with a frozen perceptor-adjudicator stack, using the same online memory as slide-level context. Experiments on WSI-VQA and SlideBench-BCNB show that the proposed scan-search-readout design improves answer accuracy and yields more interpretable evidence-selection trajectories with higher efficiency.The code is available online.

2605.23556 2026-05-25 cs.LG cs.IR math.CO

Is Dimensionality a Barrier for Retrieval Models?

维度是检索模型的障碍吗?

Kiril Bangachev, Guy Bresler, Jonathan Kogan, Yury Polyanskiy

AI总结 本文探讨了为何现代基于嵌入的检索模型在表示维度较低(约1000维)的情况下仍能处理数十亿甚至数万亿的数据点。研究聚焦于最大边距嵌入问题,分析了在给定查询与文档相关性矩阵下,如何在有限维度中实现最大的分类边距。论文证明了在特定条件下,维度只需为 $O(k \log(n/k))$ 即可达到理论最优边距,从而解决了相关模型的维度需求问题,并通过实验验证了sigmoid损失在生成大边距嵌入方面的优势。

详情
AI中文摘要

为什么表示的低维度(通常$d\approx 1000$)不会阻止现代基于嵌入的检索模型扩展到数十亿甚至数万亿数据点?为了回答这个问题,我们在以下检索模型中研究最大间隔嵌入,该模型经典地出现在通信复杂性[PS86]和最近的基于嵌入的检索[WBNL26]中。设$A\in \{0,1\}^{N\times n}$是一个矩阵,指示$N$个查询中的每一个是否与$n$个文档中的每一个相关。我们感兴趣的是最大间隔$m>0$,记为$\mathsf{m}^{\mathsf{rd}}(d, A)$,使得存在查询和文档的单位范数嵌入$\{U_j\}_{j = 1}^N, \{V_i\}_{i = 1}^n$满足以下性质:当$A_{ji} = 1$时$\langle U_j, V_i\rangle \ge m$,否则$\langle U_j, V_i\rangle \le -m$。大间隔是表示质量的关键代理:它控制了对扰动的鲁棒性和跨查询的组合泛化能力。我们的主要定理表明,在没有维度限制的情况下,最佳可能间隔$\mathsf{m}^{\mathsf{rd}}(+\infty, A)$可以在维度$d = O(\mathsf{m}^{\mathsf{rd}}(+\infty, A)^{-2}\log n)$下几乎达到,这改进了[BDES02]的一个定理。结合定理1.5中的匹配下界,我们得出结论:当$A\in \{0,1\}^{\binom{n}{k}\times n}$是包含所有可能的$k$-稀疏行一次的矩阵时,维度$d = O(k\log (n/k))$是达到该设置下最大可能间隔$\mathsf{m}^{\mathsf{rd}}(+\infty, A) = \Theta(k^{-1/2})$的充分必要条件。这完全解决了[WBNL26]中的设定。我们还给出了当$d = o(k\log (n/k))$时产生大间隔的几种构造。最后,我们通过实验测试了InfoNCE和sigmoid损失在产生大间隔嵌入方面的表现,并展示了sigmoid损失的明显优势。

英文摘要

Why does the low dimensionality of representations, typically $d\approx 1000$, not prevent modern embedding-based retrieval models from scaling to billions, or even trillions, of data points? To answer this question, we study maximal-margin embeddings in the following retrieval model, classically studied in communication complexity [PS86] and more recently in embedding-based retrieval [WBNL26]. Let $A\in \{0,1\}^{N\times n}$ be a matrix indicating whether each of $N$ queries is relevant to each of $n$ documents. We are interested in the largest margin $m>0,$ denoted by $\mathsf{m}^{\mathsf{rd}}(d, A),$ for which there exist unit norm embeddings of the queries and documents $\{U_j\}_{j = 1}^N, \{V_i\}_{i = 1}^n$ with the following property. $\langle U_j, V_i\rangle \ge m$ whenever $A_{ji} = 1$ and $\langle U_j, V_i\rangle \le -m$ otherwise. A large margin is a key proxy for representation quality: it controls both robustness to perturbations and compositional generalization across queries. Our main theorem establishes that the best possible margin without a restriction on the dimension, $\mathsf{m}^{\mathsf{rd}}(+\infty, A),$ can be nearly achieved in dimension $d = O(\mathsf{m}^{\mathsf{rd}}(+\infty, A)^{-2}\log n)$ which improves a theorem of [BDES02]. Together with a matching lower bound in Theorem 1.5, we conclude that when $A\in \{0,1\}^{\binom{n}{k}\times n}$ is the matrix containing all possible $k$-sparse rows once, dimension $d = O(k\log (n/k))$ is necessary and sufficient for the maximal possible margin $\mathsf{m}^{\mathsf{rd}}(+\infty, A) = Θ(k^{-1/2})$ in this setting. This fully resolves the setup of [WBNL26]. We also give several constructions for large margins when $d = o(k\log (n/k)).$ Finally, we empirically test the InfoNCE and sigmoid losses for producing large margin embeddings and demonstrate a clear advantage of the sigmoid loss.

2605.23555 2026-05-25 cs.CV

Generator-Refiner-Examiner: A Tri-Module Data Augmentation Framework for 3D Human Avatar Learning from Monocular Videos

生成器-精炼器-检验器:一种用于从单目视频学习3D人体虚拟形象的三模块数据增强框架

Gangjian Zhang, Jian Shu, Sicheng Yu, Wenhao Shen, Yu Feng, Hao Wang

AI总结 本文研究了从单目视频中重建具有逼真外观和可动画效果的3D人体化身的挑战。为了解决现有方法在数据稀缺情况下难以捕捉细节的问题,提出了一种名为TrioMan的三模块数据增强框架,包含生成器、细化器和检查器三个协同组件,分别用于生成多样化样本、提升生成质量以及筛选符合人体一致性的样本。实验表明,该方法在多个基准数据集上优于现有先进方法。

详情
AI中文摘要

本文解决了从单目视频重建逼真且可动画化的3D人体虚拟形象的挑战。现有方法依赖于将逐主体优化与通用人体先验相结合,但在训练帧数有限时往往难以捕捉细粒度细节。为了缓解数据稀缺问题,我们提出了TrioMan,一个用于增强3D虚拟形象学习的系统性三模块框架。我们的方法包含三个协同组件。生成器通过对姿态和相机施加高斯扰动来创建多样化的未见样本。精炼器通过由纹理和几何线索引导的一步扩散来提高生成数据的质量。检验器使用基于双分支注意力的相似性评估来选择与主体一致的样本。在X-Humans和NeuMan基准上的实验表明,TrioMan优于最先进的方法。

英文摘要

This paper addresses the challenge of reconstructing photorealistic and animatable 3D human avatars from monocular videos. While existing methods rely on combining per-subject optimization with generic human priors, they often fail to capture fine-grained details when training frames are limited. To mitigate this data scarcity, we propose TrioMan, a systematic tri-module framework for augmented 3D avatar learning. Our approach comprises three synergistic components. The Generator creates diverse unseen samples by imposing Gaussian perturbations on pose and camera. The Refiner improves the quality of generated data through one-step diffusion guided by texture and geometry cues. The Examiner selects subject-consistent samples using a dual-branch attention-based similarity evaluation. Experiments on the X-Humans and NeuMan benchmarks show that TrioMan outperforms state-of-the-art methods.

2605.23551 2026-05-25 cs.LG cs.AI

Goal-Conditioned Agents that Learn Everything All at Once

目标条件智能体一次性学习所有内容

Michael Matthews, Matthew Jackson, Michael Beukman, Thomas Foster, Alistair Letcher, Scott Fujimoto, Cédric Colas, Jakob Foerster

AI总结 本文提出了一种名为LEO(Learning Everything all at Once)的新方法,用于提升目标条件强化学习的效率。该方法通过一次性输出所有目标对应的价值和动作,实现了高效的并行更新,解决了传统全目标学习计算开销大的问题。实验表明,LEO在目标条件任务和连续控制环境中均表现出色,且相比传统方法有超过250倍的加速效果,为复杂环境中的强化学习提供了有力工具。

详情
AI中文摘要

一个目标条件的强化学习智能体在探索环境时,会在整个轨迹中看到大量信息,但大多数信息在仅根据命令目标进行在线策略更新时被丢弃。全目标学习(每个转换都用于针对每个目标进行离线策略学习)允许智能体提取最大信息,但通过简单的重新标记通常计算上不可行。这可以通过同时为每个目标输出值和动作来克服,从而允许通过网络单次传递进行高效的并行全目标更新,我们称之为一次性学习所有内容(LEO)。我们表明,这种方法在目标条件的Craftax上显著优于其他方法,在连续控制环境中与现有基线具有竞争力,同时与全目标重新标记相比实现了超过250倍的加速。然后,我们进一步表明,通过将LEO用作教师网络而非直接行动者,这种方法可以变得更加强大。我们希望,通过解锁大规模的全目标学习,LEO可以成为复杂环境中强化学习实践者的有用工具。我们开源了我们的代码。

英文摘要

A goal-conditioned reinforcement learning agent exploring an environment will see a wealth of information throughout a trajectory, most of which is discarded when only performing on-policy updates with respect to the commanded goal. All-goals learning, where each transition is used for learning off-policy with respect to every goal, allows agents to extract maximal information, however it is usually computationally infeasible when done via naive relabelling. This can be overcome by jointly outputting values and actions for every goal at once, allowing for efficient, parallel all-goals updates with a single pass through the network, in a process we call Learning Everything all at Once (LEO). We show that this approach significantly outperforms other methods on goal-conditioned Craftax and is competitive with existing baselines on continuous control environments, while achieving a >250x speed-up compared to all-goals relabelling. We then go on to show that this approach can be made even more powerful by using LEO as a teacher network, rather than a direct actor. We hope that, by unlocking all-goals learning at scale, LEO can serve as a useful tool for RL practitioners in complex environments. We open source our code.

2605.23550 2026-05-25 math.OC cs.AI cs.NA math.NA

RA-DCA: A Randomized Active-Set DCA for Directional Stationarity in Max-Structured DC Programs

RA-DCA:面向最大结构DC规划方向稳定性的随机活动集DCA

Yi-Shuai Niu

AI总结 本文研究了一类非光滑的差分凸优化问题,其中被减去的凸项为多个光滑凸函数的最大值。为了解决标准DCA可能收敛到非方向平稳临界点的问题,同时避免大规模或组合型活动集带来的高计算成本,作者提出了一种基于随机化活动集的DCA方法RA-DCA。该方法通过在采样方向上投影活动梯度、检查采样顶点残差,并仅在残差较小时使用小规模线性规划作为补充,有效保持了DCA的下降结构,同时将随机筛选过程简化为矩阵乘法。实验表明,该方法在多种模型中能够避免非平稳临界点,并在组合型问题中展现出良好的筛选效果。

详情
Comments
40 pages, 7 figures
AI中文摘要

我们研究非光滑差凸规划,其中被减的凸项是光滑凸函数的有限最大值。在此设定下,标准DCA迭代可能收敛到非方向稳定的临界点,而当活动集较大或具有组合性质时,精确的活动顶点筛选可能代价高昂。我们提出RA-DCA,一种顶点优先的随机活动集DCA,它将活动梯度投影到采样方向,检查采样顶点残差,并仅在低残差凸组合回退时使用一个小型线性规划。该方法保留了DCA的下降结构,并将随机筛选层简化为矩阵乘法。在所述正则性、数值活动集一致性和随机嵌入假设下,受保护方法生成的每个聚点以概率1是方向稳定的。MATLAB实验首先在退化的最大仿射、最大二次和稀疏支撑函数模型上测试该定理,其中保护机制避免了非稳定临界点并紧密跟踪完整活动顶点扫描。随后,块top-k测试表明,当精确聚合枚举具有组合性质时,相同的筛选思想仍然有用。修剪回归、互补性和QUBO诊断区分了活动集选择有助于问题的情况与由多起点搜索、DC分裂或其他问题特定特征主导的情况。

英文摘要

We study nonsmooth difference-of-convex programs whose subtracted convex term is a finite maximum of smooth convex functions. In this setting, standard DCA iterations may converge to critical points that are not directionally stationary, whereas exact active-vertex screening can be expensive when active sets are large or combinatorial. We propose RA-DCA, a vertex-first randomized active-set DCA that projects active gradients onto sampled directions, checks a sampled vertex residual, and uses a small linear program only as a low-residual convex-combination fallback. The method preserves the descent structure of DCA and reduces the randomized screening layer to matrix multiplications. Under the stated regularity, numerical active-set consistency, and random-embedding assumptions, every accumulation point generated by the safeguarded method is directionally stationary with probability one. MATLAB experiments first test the theorem on degenerate max-affine, max-quadratic, and sparse support-function models, where the safeguard avoids nonstationary critical points and closely tracks a full active-vertex scan. Block top-k tests then show that the same screening idea remains useful when exact aggregate enumeration is combinatorial. Trimmed-regression, complementarity, and QUBO diagnostics separate cases where active-set selection helps from cases dominated by multistart search, the DC split, or other problem-specific features.

2605.23540 2026-05-25 cs.LG

When One Point Is Not Enough: Addressing Ambiguous Instances in Dimensionality Reduction by Splitting

当一点不够时:通过分裂解决降维中的模糊实例

Diede P. M. van der Hoorn, Alessio Arleo, Fernando V. Paulovich

AI总结 本文研究了降维方法中因数据点模糊性导致的邻域结构失真问题,提出了一种基于图的方法来识别并复制这些模糊实例,将其映射到多个位置以更准确地反映其在高维空间中的多个邻域关系。该方法有效缓解了传统降维技术中因单点映射导致的局部结构丢失问题,并在多个实例上展示了其对隐藏邻域关系的揭示能力。

详情
AI中文摘要

降维(DR)方法广泛用于可视化高维数据。基于DR的分析中的一个关键任务是发现邻域,这依赖于分析投影的细粒度局部结构。然而,DR本质上是一个有损过程;没有技术能完美保留高维关系,因此投影包含视觉伪影。在本文中,我们强调了一个通常被忽视的视觉伪影来源:模糊实例。这些实例与高维空间中多个相互不相似的邻域高度相似。标准DR方法无法忠实地投影此类实例,因为每个数据实例被映射到视觉空间中的一个单点。因此,这样的实例仅被放置在其一个邻域中(或根本不放置),因此仅表示其部分邻域结构。我们称这种失真为部分邻域嵌入。在本文中,我们引入了一种基于图的方法,该方法识别模糊实例并将其复制为投影中的多个点,将每个副本放置在其各自的邻域中。我们使用UMAP来展示结果,但我们的方法也推广到其他基于局部图的DR技术,并且我们表明,我们的方法揭示了投影中先前隐藏的邻域成员关系,减少了多个示例中的部分邻域嵌入,并得到了定量分析的支持。

英文摘要

Dimensionality Reduction (DR) methods are widely used to visualize high-dimensional data. One key task in DR-based analysis is discovering neighborhoods, which relies on analyzing the fine-grained local structure of a projection. However, DR is an inherently lossy process; no technique can perfectly preserve the high-dimensional relationships, and projections therefore contain visual artifacts. In this paper, we highlight a typically overlooked source of visual artifacts: ambiguous instances. These are instances that are highly similar to multiple mutually dissimilar neighborhoods in the high-dimensional space. Standard DR methods cannot faithfully project such instances, since each data instance is mapped to a single point in the visual space. As a result, such an instance is placed in only one of its neighborhoods (or in none at all), so only part of its neighborhood structure is represented. We call this distortion partial neighborhood embedding. In this paper, we introduce a graph-based approach that identifies ambiguous instances and replicates them as multiple points in the projection, placing each copy within its respective neighborhood. We use UMAP for our results, but our approach also generalizes to other local graph-based DR techniques, and we show that our approach reveals previously hidden neighborhood memberships in projections and reduces partial neighborhood embedding across multiple examples, and is further supported by quantitative analyses.

2605.23523 2026-05-25 cs.CV

ComPose: When to Trust Hands for Object Pose Tracking

ComPose:何时信任手部进行物体姿态跟踪

Jisu Shin, Junoh Lee, JunGyu Lee, Inhwan Bae, Dohyeon Lee, Hokyun Im, Youngwoon Lee, Hae-Gon Jeon

AI总结 本文提出了一种名为 ComPose 的六自由度物体姿态跟踪框架,旨在从 RGB 视频中实现对被手部遮挡物体的鲁棒跟踪。该方法创新性地将手部运动作为补充线索,而非单纯遮挡物,在统一的跟踪流程中结合物体和手部的提示信息,通过自适应选择关键手部关节、融合多源线索并利用几何证据进行修正,实现了稳定且精确的物体轨迹估计。实验表明,该方法在严重遮挡和几何模糊情况下表现出色,且无需外部平滑处理即可获得时间上一致的 3D 轨迹,适用于机器人操作等下游任务。

详情
Comments
22 pages, 10 figures
AI中文摘要

从视频中重建物体运动是具身AI和机器人操作的关键组成部分。尽管已经研究了多种物体姿态跟踪方法,但它们严重依赖强大的外部先验(如深度数据或3D模板),并且即使使用显式掩码,仍然极易受到手部抓取造成的严重遮挡的影响。在这项工作中,我们提出了ComPose,一个6DoF物体跟踪框架,旨在从RGB视频中进行手部感知的物体姿态估计。我们的方法不是将手部纯粹视为遮挡物,而是将手部运动协调为物体跟踪的补充线索。具体来说,我们通过在一个统一的跟踪流程中结合来自基础模型的物体和手部线索,随时间恢复多种物体运动。在此,ComPose自适应地选择信息丰富的手部关节,结合物体和手部衍生的线索进行运动估计,并使用可见的几何证据和学习到的校正来细化所得的物体运动。我们进一步在旋转和平移上强制时间一致性,从而在没有外部平滑的情况下产生稳定的3D物体轨迹。大量实验表明,我们的方法在严重手部遮挡和几何模糊下准确、高效且鲁棒。此外,所得的轨迹还可以通过使机器人能够从在线视频中重建人类动作,有效地转移到下游机器人操作中。

英文摘要

Reconstructing the motion of objects from videos is a key component for embodied AI and robot manipulation. While diverse approaches to object pose tracking have been studied, they rely heavily on strong external priors, such as depth data or 3D templates, and remain highly vulnerable to severe occlusions by hand grasps despite the use of explicit masks. In this work, we present ComPose, a 6DoF object tracking framework designed for hand-aware object pose estimation from RGB video. Rather than treating the hand purely as an occluder, our method harmonizes hand motions as a \textit{complementary cue} for object tracking. In detail, we recover a variety of object motions over time by combining object and hand cues from foundation models within a unified tracking pipeline. Here, ComPose adaptively selects informative hand joints, combines object- and hand-derived cues for motion estimation, and refines the resulting object motion using visible geometric evidence and a learned correction. We further enforce the temporal consistency over both rotation and translation, yielding stable 3D object trajectories over time without any external smoothing. Extensive experiments show that our method is accurate, efficient, and robust under severe hand occlusion and geometric ambiguity. In addition, the resulting trajectories can also effectively transfer to downstream robot manipulation by enabling robots to reconstruct human actions from online videos.

2605.23522 2026-05-25 cs.LG cs.AI cs.CV

Precise: SDE-Consistent Stochastic Sampling for RL Post-Training of Flow-Matching Models

Precise: 用于流匹配模型强化学习后训练的SDE一致随机采样

Jade Zou, Tao Huang, Weijie Kong, Junzhe Li, Yue Wu, Qi Tian, Jiangfeng Xiong, Jianwei Zhang, Liefeng Bo, Zhao Zhong

AI总结 该论文研究了如何通过强化学习(RL)对流匹配模型进行后训练,以提升其生成质量与提示对齐能力。核心方法是将确定性的采样轨迹转化为随机策略,通过设计一个符合随机微分方程(SDE)的采样器,实现探索与稳定性的平衡。提出的新采样器Precise在保持去噪轨迹SDE一致性的同时,有效减少了噪声干扰,实验表明其在奖励优化速度和生成质量上均优于现有方法。

详情
AI中文摘要

强化学习已成为提升扩散和流匹配生成器中提示对齐和感知质量的有效方法。将在线强化学习应用于流匹配的关键步骤是将确定性采样轨迹转化为随机策略,通常通过用随机微分方程替代逆向常微分方程来实现。随机采样器控制探索行为和去噪动力学,因此是策略的一部分,其设计会显著影响奖励优化性能。我们将采样器设计分解为两个相互依赖的组成部分:选择适量的随机探索,以及在强化学习中使用的少量步数下忠实地离散化得到的SDE。针对第一个组成部分,我们分析了去噪过程中探索与稳定性之间的固有张力,并推导出平衡两者的SDE调度。针对离散化挑战,我们使用一个玩具示例表明,现有采样器可能偏离流匹配过程,要么引入过多的离散化噪声,要么依赖不能保证收敛到数据分布的启发式规则。为解决这些问题,我们提出了Precise,一种新的随机采样器,平衡了有效探索与稳定性。关键地,Precise通过一种冻结干净潜变量后验均值的新颖近似,使去噪轨迹保持SDE一致,解决了标准采样器中的过度噪声问题。大量实验表明,该公式通过强化学习实现了显著更快且更稳定的奖励优化,达到了最先进的对齐分数(例如PickScore、HPSv2.1),同时匹配先前采样器的最佳域内性能所需的训练时间减少了13.1-53.2%。

英文摘要

Reinforcement learning (RL) has become an effective way to improve prompt alignment and perceptual quality in diffusion and flow-matching generators. A critical step for applying online RL to flow matching is turning the deterministic sampling trajectory into a stochastic policy, typically by replacing the reverse-time Ordinary Differential Equation (ODE) with a Stochastic Differential Equation (SDE). The stochastic sampler, controlling the exploration behavior and denoising dynamics, is thus part of the policy, and its design can significantly affect the reward optimization performance. We break down the sampler design into two interdependent components: choosing the right amount of stochastic exploration, and discretizing the resulting SDE faithfully at the small step counts used in RL. To address the first component, we analyze the inherent tension between exploration and stability in denoising and derive an SDE schedule that balances the two. Turning to the discretization challenge, we use a toy example to show that existing samplers can deviate from the flow-matching process, either by introducing excessive discretization noise or by relying on heuristic rules that do not guarantee convergence to the data distribution. To address these issues, we propose Precise, a new stochastic sampler that balances effective exploration with stability. Crucially, Precise keeps the denoising trajectory SDE-consistent through a novel approximation that freezes the clean-latent posterior mean, resolving the excess noise issue in standard samplers. Extensive experiments demonstrate that this formulation leads to significantly faster and more stable reward optimization via reinforcement learning, achieving state-of-the-art alignment scores (e.g., PickScore, HPSv2.1) while requiring 13.1-53.2% less wall-clock training time to match the best in-domain performance of prior samplers.

2605.23518 2026-05-25 cs.CV

VINS-120K: Ultra High-Resolution Image Editing with A Large-Scale Dataset

VINS-120K:基于大规模数据集的超高分辨率图像编辑

Zhizhou Chen, Shanyan Guan, Zhanxin Gao, En Ci, Yanhao Ge, Wei Li, Zhenyu Zhang, Jian Yang, Ying Tai

AI总结 本文提出VINS-120K,一个包含12万组高分辨率图像编辑指令对的大规模数据集,每张图像分辨率超过4K,用于推动超高分辨率图像编辑研究。研究还提出一种高频感知的后适配策略,使现有模型能够有效处理超高分辨率图像,并构建了VINS-4KEval基准以评估编辑效果。该工作为超高分辨率图像编辑提供了高质量数据支持和新的方法改进。

详情
AI中文摘要

直接编辑超高分辨率(UHR)图像具有价值但尚未充分探索,主要由于缺乏高质量数据以及高频纹理细节建模的挑战。我们引入VINS-120K,首个用于基于指令的UHR图像编辑的大规模数据集,包含120K精心筛选的指令、输入图像和编辑图像三元组。每张图像超过4K分辨率(≥4096×4096),并通过严格的多阶段流水线过滤以确保视觉质量、指令对齐和美学保真度。基于VINS-120K,我们进一步开发了一种高频感知的后适应策略,将预训练的非高分辨率模型扩展到UHR领域。我们还提出了VINS-4KEval基准,涵盖多种编辑类型,以促进UHR设置下的一致评估。实验证实,我们的工作在UHR图像编辑中改善了细粒度细节合成和纹理真实感。

英文摘要

Directly editing ultra-high-resolution (UHR) images is valuable but underexplored, primarily due to the lack of high-quality data and the challenge in modeling high-frequency texture details. We introduce VINS-120K, the first large-scale dataset for instruction-based UHR image editing, comprising 120K carefully curated triplets of instruction, input image, and edited image. Each image exceeds 4K resolution ($\geq$4096 $\times$ 4096) and is filtered through a rigorous multi-stage pipeline to ensure visual quality, instruction alignment, and aesthetic fidelity. Built on VINS-120K, we further develop a high-frequency-aware post-adaptation strategy to extend pretrained non-high-resolution models to the UHR regime. We also present VINS-4KEval, a benchmark covering diverse editing types, to facilitate consistent evaluation in UHR settings. Experiments confirm that our work improves fine-grained detail synthesis and texture realism in UHR image editing.

2605.23510 2026-05-25 cs.LG

Learning partially observed systems with neural Hamiltonian ordinary differential equations

学习部分观测系统:神经哈密顿常微分方程

Sunniva Meltzer, Sølve Eidnes, Alexander Johannes Stasik

AI总结 本文提出了一种名为神经哈密顿常微分方程(NHODE)的框架,用于从部分观测数据中学习动力系统。该方法结合了哈密顿神经网络和神经常微分方程,通过引入哈密顿结构确保能量守恒,并利用神经ODE的灵活性仅在观测变量上定义损失函数,从而在未观测变量上进行有效推理。实验表明,NHODE在多种复杂系统中表现出更高的预测精度和长期稳定性,能够同时捕捉观测和潜在动态,优于纯粹的数据驱动方法。

详情
AI中文摘要

从数据中学习动力系统时,嵌入物理结构可以约束解空间并提高泛化能力,但许多物理信息模型假设可以访问完整的系统状态。这限制了它们在部分观测场景中的使用,其中某些状态变量完全未被观测到,且必须在没有直接监督的情况下推断。在这里,我们提出了神经哈密顿常微分方程(NHODE),这是一个结合哈密顿神经网络(HNN)和神经常微分方程(neural ODE)的框架,用于从数据中学习部分观测的动力系统。哈密顿结构通过构造保证能量守恒,而神经常微分方程框架则提供了灵活的训练过程,使得损失可以仅定义在观测变量上。我们还通过对称性感知的坐标变换和可分离的能量公式,融入了额外的物理约束。该框架在复杂度递增的系统上进行了评估,从线性和非线性质量-弹簧系统到混沌三体问题。在所有示例中,嵌入的物理结构越多,预测的准确性和长期稳定性就越好。即使在最具挑战性的情况下,NHODE框架也能捕捉到观测和潜在动力学,而纯数据驱动的基线则变得不稳定。

英文摘要

When learning dynamical systems from data, embedding physical structure can constrain the solution space and improve generalization, but many physics-informed models assume access to the full system state. This limits their use in partially observed settings, where some state variables are completely unobserved and must be inferred without direct supervision. Here, we present neural Hamiltonian ordinary differential equations (NHODE), a framework that combines Hamiltonian neural networks (HNNs) with neural ordinary differential equations (neural ODEs) to learn partially observed dynamical systems from data. The Hamiltonian structure enforces energy conservation by construction, while the neural ODE framework enables a flexible training procedure that allows the loss to be defined only on observed variables. We also incorporate additional physical constraints through symmetry-aware coordinate transformations and separable energy formulations. The framework is evaluated on systems of increasing complexity, from linear and nonlinear mass-spring systems to the chaotic three-body problem. Across all examples, increasing the amount of embedded physical structure improves the accuracy and long-horizon stability of the predictions. Even in the most challenging regimes, the NHODE framework captures both observed and latent dynamics, whereas purely data-driven baselines become unstable.

2605.23508 2026-05-25 cs.GR cs.AI cs.CV cs.MM eess.IV

DrawVideo: Generating Long Video from Storyboard Keyframe Sketches

DrawVideo: 从故事板关键帧草图生成长视频

Chuanzhi Xu, Huiqi Liang, Bang Shi, Huiming Zhang, Yifan Xiao, Guangcheng Lin, Haodong Chen, Qiang Qu, Zhicheng Lu, Weidong Cai

AI总结 DrawVideo 是一种基于草图和分镜脚本的可控长视频生成框架,能够通过用户提供的黑白草图、外观描述和运动提示生成结构清晰、内容连贯的长视频。该方法将视频分解为多个可独立控制的镜头,每个镜头由草图、外观提示和运动提示定义,并采用分层策略生成参考帧和动作状态帧,最终合成完整视频。研究还构建了首个用于草图引导长视频生成的数据集 SketchLongVideo,实验表明该方法在结构控制、外观一致性和视觉稳定性方面表现优异。

详情
Comments
45 pages, 19 figures
AI中文摘要

长视频生成需要高保真合成、连贯的叙事结构以及用户对长时间跨度的控制。现有的文本到视频方法通常依赖单一长提示,限制了对姿态、构图、布局和运动的控制。我们提出 DrawVideo,一种草图引导、故事板驱动的可控长视频生成框架。DrawVideo 将长视频分解为独立可控的镜头,每个镜头由黑白草图、外观提示和运动提示定义。草图控制姿态和布局,外观提示定义身份、场景和风格,运动提示引导时间动态。DrawVideo 遵循分层“全局多镜头、局部单草图”策略:首先生成结构对齐的参考关键帧,然后将运动提示扩展为代表动作状态的衍生关键帧,最后在相邻关键帧之间合成片段以构建每个镜头。我们还引入了 SketchLongVideo,这是首个用于草图引导的文本到长视频生成的数据集,通过镜头检测、关键帧提取、视觉语言识别、提示分解和草图转换从动画视频构建。实验表明,DrawVideo 实现了强大的结构可控性、外观一致性、视觉稳定性和连贯的长视频生成。

英文摘要

Long video generation requires high-fidelity synthesis, coherent narrative structure, and user control over extended time spans. Existing text-to-video methods often rely on a single long prompt, limiting control over pose, composition, layout, and motion. We propose DrawVideo, a sketch-guided, storyboard-driven framework for controllable long-video generation. DrawVideo decomposes long videos into independently controllable shots, each defined by a black-and-white sketch, an appearance prompt, and a motion prompt. The sketch controls pose and layout, the appearance prompt defines identity, scene, and style, and the motion prompt guides temporal dynamics. DrawVideo follows a hierarchical 'global multi-shot, local single-sketch' strategy: it first generates a structure-aligned reference keyframe, then expands the motion prompt into derivative keyframes representing action states, and finally synthesizes clips between adjacent keyframes to build each shot. We also introduce SketchLongVideo, the first dataset for sketch-guided text-to-long-video generation, constructed from animation videos via shot detection, keyframe extraction, vision-language recognition, prompt decomposition, and sketch conversion. Experiments show that DrawVideo achieves strong structural controllability, appearance consistency, visual stability, and coherent long-video generation.

2605.23507 2026-05-25 cs.CV

MDS-DETR: DETR with Masked Duplicate Suppressor

MDS-DETR: 带有掩码重复抑制器的DETR

Chanho Lee, Seunghee Koh, Yunho Jeon, Junmo Kim

AI总结 DETR虽然是一种强大的端到端目标检测器,但其一对一匹配策略存在收敛慢和召回率低的问题。为解决这一问题,本文提出MDS-DETR,在单一解码器中结合了一对一和一对多监督,通过引入基于置信度的因果掩码机制的“掩码重复抑制器”(MDS),有效过滤一对多监督生成的重复预测,实现了无需额外查询或辅助解码器的可解释、无重复预测。实验表明,MDS-DETR在COCO数据集上相比现有方法在保持训练时间增加较小的情况下取得了更高的检测精度。

详情
Comments
code is available at https://github.com/DChoLee/MDS-DETR
AI中文摘要

DEtection TRansformer (DETR) 是一种强大的端到端目标检测器,但其一对一匹配策略存在收敛慢和召回率低的问题。解决此问题的常见方法是使用一对多标签分配以提供更多正样本。然而,现有使用一对多匹配作为辅助目标的方法会导致训练成本增加,且其辅助解码器在推理时被丢弃。为解决这一限制,我们提出MDS-DETR,它在单一解码器中同时利用一对一和一对多监督。具体来说,我们引入了一个掩码重复抑制器(MDS),通过基于置信度的因果掩码向自注意力注入不对称性。MDS过滤掉由一对多监督层生成的重复项,在完全端到端的框架中实现可解释、无重复的预测。MDS-DETR优于现有的一对多DETR变体,如MS-DETR、MR.DETR和Relation-DETR,且无需依赖任何额外的查询或辅助解码器。在MS COCO上使用ResNet-50骨干网络进行12轮训练,MDS-DETR相比Deformable-DETR实现了+2.8 mAP的提升,训练时间仅增加5%,并且比最先进的MR.DETR高出+0.3 mAP,同时训练速度甚至快20%。我们的代码和模型可在\href{https://github.com/dcholee/mds-detr}{https://github.com/DChoLee/MDS-DETR}获取。

英文摘要

The DEtection TRansformer (DETR) is a powerful end-to-end object detector, yet its one-to-one matching strategy suffers from slow convergence and low recall. A common approach to address this issue is to use one-to-many label assignment to provide more positive samples. However, existing methods that use one-to-many matching as an auxiliary objective lead to increased training costs, with their auxiliary decoders discarded during inference. To address this limitation, we propose MDS-DETR, which leverages both one-to-one and one-to-many supervision within a single decoder. Specifically, we introduce a Masked Duplicate Suppressor (MDS) that injects asymmetry into self-attention via confidence-based causal masking. MDS filters out the duplicates generated by the one-to-many supervised layer, enables explainable, duplicate-free predictions in a fully end-to-end framework. MDS-DETR outperforms existing one-to-many DETR variants such as MS-DETR, MR.DETR and Relation-DETR, without relying on any additional queries or auxiliary decoders. Under a 12-epoch training schedule on MS COCO with a ResNet-50 backbone, MDS-DETR achieves a +2.8 mAP improvement over Deformable-DETR with only a 5\% increase in training time, and outperforms the state-of-the-art MR.DETR by +0.3 mAP while being even 20\% faster in training. Our code and models are available at \href{https://github.com/dcholee/mds-detr}{https://github.com/DChoLee/MDS-DETR}.

2605.23504 2026-05-25 cs.LG cs.AI

VACE: Learning Geometrically Structured Representations for Time Series Anomaly Detection

VACE:学习几何结构化表示用于时间序列异常检测

Alberto D. Cencillo, Leonardo Concepción, Isaac Triguero, Julián Luengo

AI总结 该论文提出了一种名为VACE的自监督异常检测方法,用于多变量时间序列中的异常检测。VACE通过速度对齐的通道嵌入方式,学习具有紧凑且方向一致结构的正常表示,从而更准确地识别异常。该方法无需负样本和合成异常,通过速度一致性目标训练编码器,使正常轨迹在嵌入空间中保持局部平滑和对齐。实验表明,VACE在多个基准数据集上取得了优于复杂方法的优异性能。

详情
Comments
16 pages, 5 figures
AI中文摘要

多变量时间序列中的异常检测是广泛实际应用中的关键任务,其中异常行为罕见、标签不可用且漏检成本高昂。核心挑战在于学习足够精确的正常性表征以标记偏差。表示自监督学习(通常通过对比方法)通过将时间补丁嵌入到潜在空间来解决这一问题,其中正常性占据一个定义明确的区域,异常通过几何偏差检测。然而,对比方法通过配对采样启发式间接塑造该空间,无法对基于距离评分所需的几何结构进行显式控制。这意味着正常表示的紧凑程度以及距离是否具有方向意义。我们提出VACE(速度对齐通道嵌入),一种自监督异常检测方法,将正常性表示为嵌入空间中紧凑且方向一致的区域。为此,VACE通过速度一致性目标训练通道感知编码器,无需负样本和合成异常,使得正常轨迹局部平滑且对齐。在测试时,马氏距离位置得分和速度库方向得分相乘,标记同时偏离分布和动态异常的点。尽管方法简单,VACE在严格评估下于TSB-AD-M上实现了最先进性能,显著优于使用更大预算训练的复杂方法。

英文摘要

Anomaly detection in multivariate time series is a critical task across a wide range of real-world applications, where abnormal behaviour is rare, labels are unavailable, and the cost of a miss is high. The central challenge is learning a characterisation of normality precise enough to flag deviations. Representation self-supervised learning, typically through contrastive approaches, addresses this by embedding temporal patches into a latent space where normality occupies a well-defined region, with anomalies detected by geometric deviation. However, contrastive approaches shape this space indirectly through pair-sampling heuristics, providing no explicit control over the geometric structure that distance-based scoring requires. This means how tightly normal representations are grouped, and whether distances are directionally meaningful. We present VACE (Velocity-Aligned Channel Embeddings), a self-supervised anomaly detection method that represents normality as a compact, directionally coherent region in the embedding space. To this end, VACE trains a channel-aware encoder through a velocity-consistency objective, with no negatives and no synthetic anomalies, so that normal trajectories are locally smooth and aligned. At test time, a Mahalanobis positional score and a velocity-bank directional score are combined multiplicatively, flagging points that are simultaneously off-distribution and dynamically atypical. Despite its simplicity, VACE achieves state-of-the-art performance on TSB-AD-M under rigorous evaluation, significantly outperforming more complex methods trained on substantially larger budgets.

2605.23497 2026-05-25 cs.CL

Asking For An Old Friend: Diagnosing and Mitigating Temporal Failure Modes in LLM-based Statutory Question Answering

询问老朋友:诊断和缓解基于LLM的法定问答中的时间故障模式

Max Prior, Andreas Schultz, Matthias Grabmair

AI总结 该研究探讨了基于大语言模型(LLM)的法律问答系统在处理时效性法律条文时的两种失效模式:法规更新后的过时问题和对较新法规的偏好偏差。为此,研究构建了一个包含312个专家验证的德语法律问答对的基准数据集,并在不同推理设置下评估了多个LLM的表现。结果表明,引入基于检索的增强方法能显著提升模型在时间有效性方面的性能,而单纯依赖网络搜索则存在不稳定性和近期偏好问题,研究强调了在法律问答中必须将时间有效性作为硬性约束。

详情
AI中文摘要

大型语言模型越来越多地用于法律研究,但其固定的训练截止日期和对静态参数知识的依赖与成文法的演变性质相矛盾。我们研究了两种时间故障模式:截止后过时(模型在立法修正后应用被取代的规则)和近因偏差(即使历史版本支配事实模式,模型也偏好较新的规定)。为此,我们提出了一个包含312个专家验证、时间敏感的德国法定问答对的基准,涵盖三个类别:截止后修正问题、修正前问题和多条款修正前问题。我们评估了来自OpenAI、Anthropic和DeepSeek的五个LLM,在四种推理设置下:普通、网络搜索和两种检索增强变体(通过事实日期提取和版本过滤强制执行时间有效性)。使用经过人类专家评分验证的LLM作为评判,我们发现普通设置在截止后设置中性能严重下降。两种RAG方法在所有问题类型上均显著提高了性能,而网络搜索则产生不稳定的收益,并在历史锚定任务上表现出明显的近因偏差。我们的结果表明,可靠的法律问答需要将时间有效性视为硬约束。

英文摘要

Large language models are increasingly used for legal research, yet their fixed training cutoffs and reliance on static parametric knowledge are at odds with the evolving nature of statutory law. We study two temporal failure modes: post-cutoff staleness, where models apply superseded rules after legislative amendments, and recency bias, where models prefer newer provisions even when a historical version governs the fact pattern. To this end, we present a benchmark of 312 expert-validated, time-sensitive German statutory QA pairs spanning three categories: Post-Cutoff Amendment Questions, Pre-Amendment Questions, and Multi-Provision Pre-Amendment Questions. We evaluate five LLMs by OpenAI, Anthropic and DeepSeek under four inference settings: Vanilla, Web-search, and two retrieval-augmented variants that enforce temporal validity via a fact date extraction and version filtering. Using an LLM-as-a-judge validated against human expert ratings, we find severe degradation in the Vanilla post-cutoff setting. Both RAG approaches substantially improve performance across all question types, while web search yields unstable gains and exhibits a marked recency bias on historically anchored tasks. Our results indicate that reliable legal QA requires treating temporal validity as a hard constraint.

2605.23493 2026-05-25 cs.AI

EDGE-OPD: Internalizing Privileged Context with Evidence Guided On-Policy Distillation

EDGE-OPD:通过证据引导的在线策略蒸馏内化特权上下文

Aristotelis Lazaridis, Dylan Bates, Aman Sharma, Brian King, Vincent Lu, Jack FitzGerald

AI总结 本文研究了在基于特权上下文的On-Policy Self-Distillation(OPSD)中,如何避免特权信息对模型行为产生不必要的干扰问题,并提出了EDGE-OPD方法。该方法通过引导式采样和证据掩码机制,在训练过程中更精准地注入特权信息,确保学生模型学习到目标行为而非副作用。实验表明,EDGE-OPD有效提升了身份学习的效果,并有助于保持模型的一般能力。

详情
AI中文摘要

在线策略蒸馏(OPD)作为一种LLM后训练范式,因其在不引入模型分布漂移和通用任务回归的情况下有效提升能力而受到广泛关注。在线策略自蒸馏(OPSD)是OPD的一种高效用例,它仅需单一模型同时作为学生和教师,并且具有在训练过程中向教师提供推理时缺失的特权上下文(例如角色、私有事实或已解决的方案)的优势。该方法面临的挑战在于,特权信息可能过度改变模型行为:它可能修改推理、降低通用能力,并影响响应长度、风格或局部token偏好等性能指标。因此,OPSD可能训练学生模型学习副作用而非期望的可迁移行为。本文在稀有token/身份设定下研究该问题,并提出EDGE-OPD(证据引导的在线策略蒸馏),这是OPSD的一种改进,具有两个显著特征:a) 使用引导展开在采样时向学生注入特权上下文行为,使得稀有目标行为实际出现在在线策略数据中;b) 应用证据掩码:学生仅在特权上下文支持采样token的token位置进行更新,而非展开中的每个token。实验表明,OPSD(及其变体RLSD,无论是否使用验证器)完全无法学习目标身份,而引导展开的集成使其成功。此外,掩码区域消融实验显示,角色信号定位于正证据尾部,这使我们能够获得关于高效知识迁移和通用能力保持的宝贵见解。

英文摘要

On-Policy Distillation (OPD) has gained wide attraction as an LLM post-training paradigm due to its effectiveness in improving capabilities without introducing model distribution drift, and consequently, regression in general tasks. On-Policy Self-Distillation (OPSD) is an efficient use-case of OPD, which is appealing as it requires only a single model as a student and teacher, and it also has the benefit of providing privileged context that is a absent at inference time (e.g. a persona, a private fact, or a worked solution) to the teacher during the training process. The challenge in this approach is that the privileged information can change model behavior more than intended: it can modify reasoning, degrade general capabilities, and affect performance indicators like response length, style, or local token preferences. Consequently, OPSD may train the student on side effects rather than a desired, transferable behavior. In this paper, we study this problem in a rare-token/identity setting and propose EviDence GuidEd On-Policy Distillation (EDGE-OPD), a modification of OPSD with two distinct characteristics: a) it uses guided rollouts to inject privileged-context behavior to the student at sampling time, so that the rare target behavior is actually present in the on-policy data, and b) it applies an evidence mask: the student is updated only at token positions where the privileged context supports the sampled token, rather than on every token in the rollout. We empirically show that OPSD (and its variant RLSD, with and without a verifier) completely fail to learn a target identity, while the integration of guided rollouts allows them to succeed. Additionally, mask-region ablations show that the persona signal is localized to the positive-evidence tail, allows us to draw valuable insights about efficient knowledge transfer and preservation of general purpose capabilities.

2605.23482 2026-05-25 cs.CV cs.AI

Multimodal Distribution Matching for Vision-Language Dataset Distillation

多模态分布匹配用于视觉-语言数据集蒸馏

Jongoh Jeong, Hoyong Kwon, Minseok Kim, Kuk-Jin Yoon

AI总结 该研究提出了一种名为Multimodal Distribution Matching (MDM)的多模态数据集蒸馏方法,旨在在有限的计算和内存资源下,高效生成保留视觉-语言语义信息的紧凑合成数据集。MDM通过结合数据、模型和损失层面的互补组件,实现了跨模态对齐与表示质量的保持,包括在联合嵌入空间中采样生成图像-文本对、基于预训练模型的权重空间插值构建混合教师模型,以及利用几何感知的损失函数匹配联合分布。实验表明,MDM在多个跨架构的图像-文本检索任务中表现出色,显著降低了蒸馏成本并保持了模型的鲁棒性。

详情
Comments
Accepted for publication at CVPR 2026. Project Page: https://andyj1.github.io/mdm
AI中文摘要

数据集蒸馏将大型训练集压缩为紧凑的合成数据集,同时保持下游性能。随着现代系统越来越多地处理成对的视觉-语言输入,多模态蒸馏必须在严格的计算和内存预算下保持表示质量和跨模态对齐,然而先前的方法通常需要大量计算并忽略其相关性。为了解决这个问题,我们提出了多模态分布匹配(MDM),一种用于高效且可泛化的多模态蒸馏的几何感知框架。具体来说,MDM在数据、模型和损失层面集成了互补组件。在数据层面,它通过在联合嵌入空间中的聚类采样来初始化合成图像-文本对。在模型层面,它通过在权重空间中根据独立微调模型与预训练锚点的角度偏差进行插值,形成混合教师模型。在损失层面,它使用几何感知的匹配目标在单位超球面上匹配联合分布,该目标利用跨模态一致性和差异方向上的联合特征以及对称对比学习。在跨架构评估的图像-文本检索基准上,MDM生成的紧凑合成集保留了多模态语义,显著降低了蒸馏成本,并在不同架构下保持鲁棒性。

英文摘要

Dataset distillation compresses large training sets into compact synthetic datasets while preserving downstream performance. As modern systems increasingly operate on paired vision-language inputs, multimodal distillation must preserve representation quality and cross-modal alignment under tight compute and memory budgets, yet prior methods often require heavy computes and overlook their correlations. To address this, we present Multimodal Distribution Matching (MDM), a geometry-aware framework for efficient and generalizable multimodal distillation. Specifically, MDM integrates complementary components at the data, model, and loss levels. At the data level, it initializes synthetic image-text pairs by sampling from clusters in the joint embedding space. At the model level, it forms a mixed teacher by interpolating independently fine-tuned models in weight space according to their angular deviation from the pretrained anchor. At the loss level, it matches joint distributions on the unit hypersphere using a geometry-aware matching objective that exploits the joint features in the cross-modal agreement and discrepancy directions along with symmetric contrastive learning. Across image-text retrieval benchmarks with cross-architecture evaluation, MDM yields compact synthetic sets that preserve multimodal semantics, substantially reduce distillation cost, and remain robust across architectures.

2605.23478 2026-05-25 cs.CV cs.AI

PhenoYieldNet: Learning Crop-Aware Phenological Responses for Multi-Crop Yield Prediction

PhenoYieldNet: 学习作物感知的物候响应以进行多作物产量预测

Yu Luo, Xiaogang Zhu, Shan Zeng, Wei Xiang, Thomas Francis Bishop, Zhiyong Wang, Kun Hu

AI总结 准确预测作物产量对可持续农业和全球粮食安全至关重要。现有方法多针对单一作物,难以泛化到多种作物,且未充分考虑不同作物对天气变化的特定物候响应。本文提出PhenoYieldNet,一种面向多作物产量预测的框架,通过显式建模作物的物候响应来学习作物特异性物候特征,包含作物物候库和注意力模块,能够动态捕捉不同物候阶段的时空特征,并通过预训练模型和自监督策略提升泛化能力,实验表明其在多作物数据集上显著优于现有方法。

详情
Comments
Accepted by CVPR2026
AI中文摘要

准确的作物产量预测对于可持续农业和全球粮食安全至关重要。现有方法主要针对单一作物预测开发,通常难以泛化到不同作物类型,且未能解决由复杂天气模式动态调节的独特作物物候响应。在本文中,我们提出PhenoYieldNet,一个多作物产量预测框架,通过显式建模作物对时间驱动因素的响应来学习作物特异性物候。具体来说,我们开发了一个作物感知的时间解码器,由作物物候库(CPB)和作物物候注意力(CPA)模块组成。CPB集成了一组可学习的嵌入,利用查询引导CPA模块学习特定作物最相关的物候模式。CPA模块显式捕获多尺度趋势和变化成分以构建时间上下文,使模型能够动态调整不同物候阶段的注意力。为了学习鲁棒且可泛化的多作物预测特征,编码器使用预训练基础模型初始化,并通过自监督时序对比适应策略进一步调整以对齐农业时间动态。在多作物数据集上进行的大量实验表明,我们提出的方法显著优于最先进的方法,在不同地区和作物上展现出强大的泛化能力。

英文摘要

Accurate crop yield prediction is crucial for sustainable agriculture and global food security. While existing methods are predominantly developed for single-crop prediction, they often struggle to generalize across diverse crop types, without addressing the unique crop phenological responses that are dynamically modulated by complex weather patterns. In this paper, we propose PhenoYieldNet, a multi-crop yield prediction framework that learns crop-specific phenology by explicitly modeling their responses with temporal drivers. Specifically, we develop a crop-aware temporal decoder consisting of a Crop Phenology Bank (CPB) and a Crop Phenology Attention (CPA) module. The CPB integrates a set of learnable embeddings, which leverage a query to guide the CPA module to learn the most relevant phenology patterns for the specific crop. And the CPA module explicitly captures multi-scale trend and variation components to construct temporal contexts, enabling the model to dynamically adjust the attention across different phenological stages. To learn robust and generalizable features for multi-crop prediction, the encoder is initialized with a pre-trained foundation model, and further adapted via a self-supervised Temporal Contrastive Adaptation strategy to align with agricultural temporal dynamics. Extensive experiments conducted on multi-crop datasets indicate that our proposed method significantly outperforms state-of-the-art methods, exhibiting strong generalization capabilities across different regions and crops.

2605.23477 2026-05-25 cs.RO

Semantically Structured Mixture-of-Experts for Compositional Robotic Manipulation

语义结构化混合专家用于组合机器人操作

Chengyu Deng, Guanqi Chen, Yizhou Chen, Zejia Liu, Zhiwen Ruan, Guanhua Chen, Jia Pan

AI总结 该研究针对基于扩散模型的机器人操作策略在多任务环境下计算成本高、泛化能力差的问题,提出了一种语义结构化的专家混合扩散策略(SMoDP)。该方法通过引入由视觉-语言模型标注指导的轻量技能预测器,在推理时将操作片段路由到专门负责特定行为阶段的专家模块,从而提升效率与可解释性。为确保路由鲁棒性,研究还设计了双对比对齐策略,强化多模态观测与语言定义技能语义的一致性,实验表明该方法在多任务基准上表现出更高的参数效率和任务迁移能力。

详情
Comments
Accepted to Robotics: Science and Systems (RSS) 2026
AI中文摘要

基于扩散的策略为精确机器人操作建立了新标准,但面临关键的可扩展性瓶颈:高性能模型计算成本高,而轻量级替代方案通常难以在多样化的多任务环境中泛化。混合专家(MoE)架构通过仅激活参数子集提供了一条有前景的效率路径。然而,现有的MoE路由机制通常依赖于低级噪声或潜在统计量,忽略了操作任务的组合性质。这可能导致可重用行为在专家间碎片化,限制可解释性和可迁移性。我们提出了用于组合机器人操作的语义结构化混合专家扩散策略(SMoDP),这是一个将专家专业化建立在语义任务结构上的框架。SMoDP利用一个轻量级的推理时技能预测器,该预测器由视觉语言模型(VLM)的离线标注监督,将动作块路由到特定行为阶段专业化的专家。为了确保鲁棒的分配,我们提出了一种双对比对齐策略,该策略将多模态观测建立在语言定义的技能语义上(模态间),同时强制执行视觉上不同但功能相关行为之间的路由一致性(模态内)。我们的方法在多任务基准测试中优于代表性的扩散和基于MoE的基线,参数效率显著提高,并通过参数高效微调展示了向新任务的有效组合迁移。项目网站:https://deng-cy20.github.io/SMoDP/

英文摘要

Diffusion-based policies have established a new standard for precise robotic manipulation but face a critical scalability bottleneck: high-performance models are computationally expensive, while lightweight alternatives often fail to generalize across diverse multi-task environments. Mixture-of-Experts (MoE) architectures offer a promising path to efficiency by activating only a subset of parameters. However, existing MoE routing mechanisms typically rely on low-level noise or latent statistics, ignoring the compositional nature of manipulation tasks. This can fragment reusable behaviors across experts, limiting interpretability and transferability. We introduce Semantically Structured Mixture-of-Experts Diffusion Policy (SMoDP) for compositional robotic manipulation, a framework that grounds expert specialization in semantic task structure. SMoDP leverages a lightweight, inference-time skill predictor, supervised by offline annotations from Vision-Language Models (VLMs), to route action chunks to experts specialized for specific behavioral phases. To ensure robust assignment, we propose a dual contrastive alignment strategy that grounds multi-modal observations in language-defined skill semantics (Inter-modal) while enforcing routing consistency across visually distinct but functionally related behaviors (Intra-modal). Our approach outperforms representative diffusion and MoE-based baselines on multi-task benchmarks with significantly improved parameter efficiency and demonstrates effective compositional transfer to novel tasks through parameter-efficient fine-tuning. Project website: https://deng-cy20.github.io/SMoDP/

2605.23476 2026-05-25 cs.LG cond-mat.dis-nn cond-mat.mtrl-sci math.OC

Non-normal spectral signatures of instability in neural network training dynamics

神经网络训练动态中不稳定性的非正态谱特征

Souvik Ghosh

AI总结 本文研究了深度网络训练过程中常见的不稳定性问题,如损失尖峰、振荡收敛和梯度异常,并通过非正规算子理论提供了理论解释。研究发现,常用优化器的线性化更新算子普遍是非正规的,其非正规性由Hessian矩阵与自适应预条件器或动量结构之间的相互作用引起。通过非正规稳定性理论,作者提出了一个基于伪谱的保守前兆界,并证明了条件数κ(V)可以作为训练过程中瞬时放大现象的早期预警指标,为理解自适应优化算法的稳定性提供了新的诊断工具和理论框架。

详情
Comments
9 pages, 3 figurea
AI中文摘要

深度网络中的训练不稳定性——损失尖峰、振荡收敛和梯度病态——在经验上普遍存在,但缺乏严格的算子理论解释。我们证明,实际使用的优化器的线性化更新算子通常是非正态的:对于Adam,非正态性由Hessian与对角自适应预条件子之间的换位子[H, M]控制;而对于带动量的SGD,它源于更新映射的增广状态空间结构。将非正态稳定性理论应用于这些算子,我们推导出一个保守的伪谱前兆界,其中κ(V)作为瞬态放大的早期预警指标,即使谱半径仍小于1;并且我们建立了更新算子的异常点作为该框架中κ(V) → ∞的极限情况。在两层网络上的数值实验证实,谱半径ρ(J)无法区分稳定和不稳定的训练阶段,而κ(V)能将它们分开约一个数量级,用非正态放大的连续严重性度量补充了经典的锐度准则。这些结果确立了非厄米算子理论作为神经网络优化稳定性中一个有用且未被充分探索的框架,为理解自适应优化稳定性提供了诊断语言和概念验证基准。

英文摘要

Training instabilities in deep networks - loss spikes, oscillatory convergence, and gradient pathologies - are empirically prevalent but lack a rigorous operator-theoretic explanation. We show that the linearized update operators for practically used optimizers are generically non-normal: for Adam, non-normality is controlled by the commutator [H, M] between the Hessian and the diagonal adaptive preconditioner, while for SGD with momentum it arises from the augmented state-space structure of the update map. Applying non-normal stability theory to these operators, we derive a conservative pseudospectral precursor bound in which κ(V) serves as an early-warning indicator of transient amplification even when the spectral radius remains below one, and we establish that exceptional points of the update operator appear as the κ(V) -> \infty limiting case of this framework. Numerical experiments on two-layer networks confirm that the spectral radius ρ(J) provides no separation between stable and unstable training phases while κ(V) separates them by approximately one order of magnitude, complementing the classical sharpness criterion with a continuous severity measure of non-normal amplification. These results establish non-Hermitian operator theory as a useful and underexplored framework for neural network optimization stability, offering a diagnostic language and proof-of-concept benchmark for understanding adaptive optimization stability.

2605.23472 2026-05-25 cs.CV

Rethinking Transfer Learning for Industrial Inspection: DINOv3 vs. ImageNet Pretraining Across RGB and X-ray Tasks

重新思考工业检测的迁移学习:DINOv3与ImageNet预训练在RGB和X射线任务上的对比

Mehdi Gharbage, Céline Teulière, Pierre Bouges, Thierry Chateau

AI总结 本文探讨了现代视觉基础模型在工业检测任务中的迁移学习效果,比较了基于ImageNet监督预训练和DINOv3自监督蒸馏的ConvNeXt主干网络在RGB和X射线检测任务中的表现。研究发现,DINOv3在冻结参数的迁移中优势不明显,但在RGB任务的全微调下能提供更好的初始化,加快收敛并提升性能;而在X射线任务中,基于ImageNet的监督预训练仍更具优势。结果表明,现代视觉基础模型在工业RGB检测中具有潜力,但其迁移效果高度依赖下游任务的适配和数据模态。

详情
Comments
Accepted to the CVPR 2026 Workshop on Vision Foundation Models for Industrial Inspection (VISION'26)
AI中文摘要

最近,在网页规模数据上预训练的视觉基础模型在许多下游任务中展现出强大的迁移能力,但它们在工业视觉检测中的有效性仍不明确。工业数据与网页数据差异显著,通常需要细粒度的密集预测,这引发了一个问题:现代自监督预训练能否超越基于监督ImageNet初始化的传统迁移学习范式。在这项工作中,我们比较了使用监督ImageNet分类或DINOv3蒸馏预训练的ConvNeXt骨干网络,并将它们与传统的ResNet-50基线相关联。我们在四个下游数据集上评估了语义分割、实例分割和物体检测,这些数据集涵盖RGB表面缺陷检测和X射线缺陷检测。我们进一步研究了冻结和完全微调两种适应机制。我们的结果表明,DINOv3在冻结迁移中没有明显优势,但在RGB任务完全微调后提供了更强的初始化,实现了更快的收敛和更好的最终性能。然而,在X射线模态偏移下,监督ImageNet预训练在冻结和微调设置中仍然更有效。总体而言,我们的发现表明,现代视觉基础模型对于监督RGB工业检测是有前景的,但它们的迁移能力强烈依赖于下游适应和目标模态。

英文摘要

Vision foundation models pretrained on web-scale data have recently shown strong transfer capabilities on many downstream tasks, but their effectiveness for industrial visual inspection remains unclear. Industrial data differ substantially from web-data and often require fine-grained dense prediction, raising the question of whether modern self-supervised pretraining can improve over the conventional transfer-learning paradigm based on supervised ImageNet initialization. In this work, we compare ConvNeXt backbones pretrained with supervised ImageNet classification or DINOv3 distillation, and relate them to the conventional ResNet-50 baseline. We evaluate semantic segmentation, instance segmentation, and object detection across four downstream datasets spanning RGB surface-defect inspection and X-ray defect detection. We further study both frozen and fully finetuned adaptation regimes. Our results show that DINOv3 offers no clear advantage in frozen transfer, but provides a stronger initialization after full finetuning on RGB tasks, yielding faster convergence and better final performance. Under X-ray modality shift, however, supervised ImageNet pretraining remains more effective in both frozen and finetuned settings. Overall, our findings suggest that modern vision foundation models are promising for supervised RGB industrial inspection, but their transferability is strongly conditioned by downstream adaptation and target modality.

2605.23471 2026-05-25 cs.LG cs.AI

CBANet: A Compact Attention-Based CNN-BiLSTM Network for Aggressive Driving Event Detection

CBANet:一种用于激进驾驶事件检测的紧凑型注意力CNN-BiLSTM网络

Hanadi Alhamdan, Ghadah Alosaimi, Amir Atapour-Abarghouei, Farshad Arvin

AI总结 本文提出了一种名为CBANet的紧凑型注意力机制结合CNN-BiLSTM的深度学习框架,用于检测激进驾驶事件。该方法通过构建工程化的动态特征来捕捉转向、加速和制动行为,并采用基于SMOTE的过采样与类别加权损失相结合的稳定训练策略,以应对自然驾驶数据中激进事件极度稀有的问题。实验表明,该方法在少数类召回率和安全关键F分数等指标上显著优于传统深度学习方法,同时保持了较高的计算效率。

详情
Comments
8 pages, 4 figures, 4 tables. Submitted to IJCNN/WCCI 2026. CBANet: A compact attention-based CNN-BiLSTM framework for aggressive driving event detection using multivariate vehicle dynamics signals. Code available at https://github.com/halhamdan/CBANet
AI中文摘要

激进驾驶是交通事故的主要原因,对道路安全构成严重威胁。尽管深度学习方法在从车辆传感器数据检测危险驾驶行为方面显示出有希望的结果,但它们在现实条件下的性能通常受到严重数据不平衡、驾驶员间巨大差异以及缺乏物理可解释的车辆动力学表示的限制。在本文中,我们提出了一种增强的深度学习框架,用于使用多变量车辆动力学信号进行激进驾驶检测。该方法不仅依赖原始测量,还构建了捕捉转向、加速和制动行为的工程动力学特征。为了解决自然驾驶数据中激进事件的极端稀少性,我们引入了一种稳定的训练策略,结合了基于SMOTE的受控过采样和类别加权损失公式,并评估了用于不平衡处理的焦点损失变体。此外,采用基于类别特定阈值校准的安全导向决策策略,以更好地反映现实应用中漏检和误报的不对称风险。该框架在新收集的自然驾驶数据集上进行了评估。大量实验表明,所提出的方法在保持实际计算效率的同时,在少数类召回率和安全关键F-score指标上始终优于标准深度学习基线。代码:\url{https://github.com/halhamdan/CBANet}

英文摘要

Aggressive driving is a major cause of traffic accidents and poses a serious threat to road safety. Although deep learning methods have shown promising results in detecting risky driving behaviours from vehicle sensor data, their performance in real-world conditions is often limited by severe data imbalance, large variability between drivers, and the lack of physically interpretable vehicle dynamics representations. In this paper, we propose an enhanced deep learning framework for aggressive driving detection using multivariate vehicle dynamics signals. Instead of relying solely on raw measurements, the proposed approach constructs engineered dynamic features that capture steering, acceleration, and braking behaviour. To address the extreme rarity of aggressive events in naturalistic driving data, we introduce a stable training strategy that combines controlled SMOTE-based oversampling with a class-weighted loss formulation, and evaluates focal loss variants for imbalance handling. Furthermore, a safety-oriented decision strategy based on class-specific threshold calibration is adopted to better reflect the asymmetric risks of missed detections and false alarms in real-world applications. The proposed framework is evaluated on a newly collected naturalistic driving dataset. Extensive experiments show that the proposed method consistently outperforms standard deep learning baselines with significant improvements in minority-class recall and safety-critical F-score metrics while maintaining practical computational efficiency. Code: \url {https://github.com/halhamdan/CBANet}

2605.23470 2026-05-25 cs.LG cs.AI cs.CE

Learning Individual Dynamics from Sparse Cross-Sectional Snapshots

从稀疏横截面快照中学习个体动力学

Christian Lagemann, Kai Lagemann, Steven L. Brunton, Sach Mukherjee

AI总结 该研究旨在从稀疏的横截面快照中学习个体的动态演化过程,传统方法在数据稀疏或完全横截面的情况下难以准确推断个体的连续时间轨迹。本文提出了一种名为CADENCE的概率框架,通过将潜在动态与静态个体上下文关联,实现了从孤立快照中恢复个体轨迹。该方法结合了基于分数的空域编码器和软专家混合路由机制,提供了单时间点轨迹推断的可识别性保证,并在多个基准测试中表现出优于现有序列模型的性能。

详情
AI中文摘要

预测一个动力学单元如何随时间演化——例如个体如何衰老、流行病如何传播、物理系统如何退化——通常需要密集的纵向追踪。当只有极其稀疏或完全横截面的数据可用时,推断个体化的连续时间轨迹本质上是病态的。现有方法迫使严格妥协:序列模型(如潜在ODE)需要密集的纵向数据,而横截面方法(如最优传输、基于流匹配的)映射聚合群体,丢失了个体动力学。在本文中,我们证明这种二分法可以被打破。我们介绍CADENCE,一个原则性的概率框架,通过将潜在动力学锚定到静态的个体级上下文,从孤立快照中恢复连续的个体轨迹。我们为单时间点轨迹推断提供了新颖的可识别性保证。通过结合基于分数的空间编码器(双射概率流ODE)以消除微分同胚歧义,以及软混合专家(SMoE)路由器,我们证明个体动力学参数和路由函数是联合可识别的。在一系列涵盖物理系统到真实世界生物数据的基准测试中,CADENCE严格在具有上下文结构的极端稀疏快照上训练,其性能匹配或超过了在密集全轨迹数据上训练的最先进序列模型。

英文摘要

Predicting how a dynamical unit evolves over time - how an individual ages, an epidemic spreads, or a physical system degrades - typically requires dense longitudinal tracking. When only extremely sparse or entirely cross-sectional data is available, inferring individualized, continuous-time trajectories is fundamentally ill-posed. Existing methods force a strict compromise: sequence models (e.g. latent ODEs) require dense longitudinal data, while cross-sectional methods (e.g. optimal transport, flow matching-based) map aggregate populations, losing individual dynamics. In this paper, we demonstrate that this dichotomy can be broken. We introduce CADENCE, a principled probabilistic framework that recovers continuous individual trajectories from isolated snapshots by anchoring latent dynamics to static, individual-level contexts. We provide novel identifiability guarantees for single-timepoint trajectory inference. By combining a score-based spatial encoder (bijective Probability Flow ODE) to eliminate diffeomorphic ambiguities with a Soft Mixture-of-Experts (SMoE) router, we show that individual dynamical parameters and routing function are jointly identifiable. Across a suite of benchmarks spanning physical systems to real-world biological data, CADENCE, trained strictly on extremely sparse snapshots with context structure, matches or exceeds the performance of state-of-the-art sequential models trained on dense, full-trajectory data.

2605.23467 2026-05-25 cs.LG

S$^3$GNN: Efficient Global Mixing and Local Message Passing for Long-Range Graph Learning

S$^3$GNN:用于长程图学习的高效全局混合与局部消息传递

Dai Shi, Luke Thompson, Linhan Luo, Lequan Lin, Andi Han, Junbin Gao, José Miguel Hernández Lobato

AI总结 本文针对图神经网络在捕捉长距离依赖时面临的信息瓶颈问题,提出了一种名为S$^3$GNN的新方法。该方法通过引入轻量级的全局信息混合机制,在不依赖严格理论假设的前提下有效缓解了过度压缩现象。实验表明,S$^3$GNN在多个领域任务中实现了显著的性能提升,并大幅减少了参数数量。

详情
AI中文摘要

消息传递神经网络(MPNN)在捕获长程依赖时常常遭受信息瓶颈,导致过挤压(OSQ)现象。除了空间连通性增强(例如,重连)外,最近的研究表明,谱滤波可以产生强大的长程学习结果,因为谱算子能够实现全局信息混合,从而缓解OSQ。这些方法通过稳定深层传播中的雅可比能量或在强理论假设下保证OSQ缓解来实现这一点。我们重新审视这些结论,并表明相关的雅可比敏感性下界在实践中通常难以实现。然后,我们提出S$^3$GNN,它通过以显著较低的计算复杂度轻量级地重新引入被忽略的组件来缓解OSQ,而不需要这些限制性假设,同时特征变换的标准稳定性约束在我们的新动态下仍然有效。跨不同领域(例如,长程基准、KGQA和基于网格的流体动力学)的大量实验表明,S$^3$GNN在参数减少多达50%的情况下实现了高达一个数量级的误差降低。我们的代码可在https://github.com/EEthanShi/S3-GNN.git找到。

英文摘要

Message-passing neural networks (MPNNs) often suffer from an information bottleneck when capturing long-range dependencies, leading to the oversquashing (OSQ) phenomenon. Alongside spatial connectivity enrichment (e.g., rewiring), recent studies have shown that spectral filtering can yield strong long-range learning outcomes, as spectral operators enable global information mixing that alleviates OSQ. These approaches achieve this either by stabilizing the Jacobian energies in deep propagation or by guaranteeing OSQ mitigation under strong theoretical assumptions. We revisit these conclusions and show that the associated Jacobian sensitivity lower bound is generally difficult to achieve in practice. We then propose S$^3$GNN, which mitigates OSQ without such restrictive assumptions by lightweightly reintroducing omitted components with substantially lower computational complexity, while standard stability constraints on feature transformations remain effective under our new dynamics. Extensive experiments across diverse domains (e.g., long-range benchmarks, KGQA, and mesh-based fluid dynamics) demonstrate that S$^3$GNN achieves up to an order-of-magnitude error reduction with up to 50\% fewer parameters. Our code can be found in https://github.com/EEthanShi/S3-GNN.git.

2605.23464 2026-05-25 cs.LG

Unextractable Protocol Models: Collaborative Training and Inference without Weight Materialization

不可提取协议模型:无需权重物化的协作训练与推理

Alexander Long, Chamin Hewa Koneputugodage, Thalaiyasingam Ajanthan, Yan Zuo, Gil Avraham, Violetta Shevchenko, Hadi Mohaghegh Dolatabadi, Sameera Ramasinghe

AI总结 本文研究了在去中心化环境中协作训练和推理大规模神经网络的问题,提出了一种名为“不可提取协议模型(UPMs)”的新框架。该方法通过在参与者之间定期注入时间变化的可逆变换,使得模型各部分在不同时间步上不兼容,从而防止权重被完整提取。实验表明,UPMs在保持模型性能的同时有效提升了安全性,并分析了其在训练和推理中的开销及对各类攻击的防御能力。

详情
Journal ref
Advances in Neural Information Processing Systems 38, pp. 18677-18713 (NeurIPS 2025)
Comments
Accepted at NeurIPS 2025. 34 pages, 6 figures (5 in main body, 1 in appendix). Alexander Long and Chamin Hewa Koneputugodage contributed equally
AI中文摘要

我们考虑一个去中心化设置,其中参与者协作训练和提供大型神经网络服务,且每个参与者只处理模型的一个子集。在此设置中,我们探索了不可物化权重的可能性,即完整权重集永远不会对任何参与者可用。我们引入了不可提取协议模型(UPMs):一种利用分片模型设置来确保参与者持有的模型分片(即子集)在不同时间步不兼容的训练和推理框架。UPMs 在参与者边界定期注入时变、随机、可逆的变换;保持整体网络功能,但使跨时间组装变得不连贯。在 Qwen-2.5-0.5B 和 Llama-3.2-1B 上,10,000 次变换使 FP32 困惑度保持不变(ΔPPL < 0.01;Jensen-Shannon 漂移 < 4×10^{-5}),并且我们展示了如何控制低精度数据类型的增长。每 30 秒应用一次变换在推理时增加 3% 的延迟、0.1% 的带宽和 10% 的 GPU 内存开销,而训练开销降至 1.6% 的时间和 < 1% 的内存。我们考虑了多种攻击,表明直接攻击的要求不切实际且易于防御,并且基于梯度的拼接分区微调消耗了从头训练所需 token 的 ≥ 60%。通过使模型能够协作训练但不可提取,UPMs 使得在社区驱动的去中心化训练中嵌入程序化激励机制变得可行。

英文摘要

We consider a decentralized setup in which the participants collaboratively train and serve a large neural network, and where each participant only processes a subset of the model. In this setup, we explore the possibility of unmaterializable weights, where a full weight set is never available to any one participant. We introduce Unextractable Protocol Models (UPMs): a training and inference framework that leverages the sharded model setup to ensure model shards (i.e., subsets) held by participants are incompatible at different time steps. UPMs periodically inject time-varying, random, invertible transforms at participant boundaries; preserving the overall network function yet rendering cross-time assemblies incoherent. On Qwen-2.5-0.5B and Llama-3.2-1B, 10,000 transforms leave FP32 perplexity unchanged ($Δ$PPL $< 0.01$; Jensen-Shannon drift $< 4 \times 10^{-5}$), and we show how to control growth for lower precision datatypes. Applying a transform every 30s adds 3% latency, 0.1% bandwidth, and 10% GPU-memory overhead at inference, while training overhead falls to 1.6% time and $< 1$% memory. We consider several attacks, showing that the requirements of direct attacks are impractical and easy to defend against, and that gradient-based fine-tuning of stitched partitions consumes $\geq 60$% of the tokens required to train from scratch. By enabling models to be collaboratively trained yet not extracted, UPMs make it practical to embed programmatic incentive mechanisms in community-driven decentralized training.

2605.23459 2026-05-25 cs.SE cs.AI

AI Assurance: A Comprehensive Testing Strategy for Enterprise AI Systems

AI 保证:企业 AI 系统的综合测试策略

Chitra Badagi, Divye Singh, Animesh Sen, Adinath Shirsath

AI总结 本文针对基于大语言模型、检索管道和自主代理的企业级AI系统,提出了一种全面的测试保障策略,以应对传统软件质量保证方法难以处理的新型风险。研究强调应将AI测试重点转向持续风险降低,而非严格的正确性验证,并将评估作为与开发同等重要的工程学科。文章引入了结构化的AI失效分类体系,提出了改进的五层AI保障金字塔,并提供了评估驱动开发、RAG系统测试、模型生命周期管理等方面的实践指导,旨在为企业工程领导者和实践者提供既有理论依据又可操作的保障策略。

详情
AI中文摘要

企业 AI 系统构建于大语言模型、检索管道和自主代理之上,引入了一类传统软件质量保证从未设计应对的风险。这些系统是概率性的、上下文敏感的和涌现性的:它们无法在经典意义上被验证为正确,只能通过不断增加信心来评估。本文提出了一种围绕三个关键原则的企业 AI 系统综合保证策略:第一,AI 测试应侧重于持续风险降低而非严格正确性验证;第二,评估必须与开发一起被视为核心工程学科;第三,AI 保证中的失败可能导致与传统确定性软件系统根本不同的组织影响。我们引入了结构化的 AI 故障分类法,提出了修订后的五层 AI 保证金字塔,并提供了关于评估驱动开发、RAG 系统测试、模型生命周期管理和治理的操作指南。目标是让工程领导者和从业者掌握一种既有哲学基础又可操作部署的策略。

英文摘要

Enterprise AI systems, built on large language models, retrieval pipelines and autonomous agents, introduce a class of risks that traditional software quality assurance was never designed to address. These systems are probabilistic, context-sensitive and emergent: they cannot be verified to be correct in the classical sense, but only evaluated with increasing confidence. This paper presents a comprehensive assurance strategy for enterprise AI systems built around three key principles: first, that AI testing should focus on continuous risk reduction rather than strict correctness verification; second, that evaluation must be treated as a core engineering discipline alongside development; and third, that failures in AI assurance can lead to organizational impacts that are fundamentally different from those seen in traditional deterministic software systems. We introduce a structured AI Failure Taxonomy, propose a revised five-layer AI Assurance Pyramid and provide operational guidance on evaluation-driven development, RAG system testing, model lifecycle management and governance. The goal is to equip engineering leaders and practitioners with a strategy that is both philosophically grounded and operationally deployable.