arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2409
2602.08783 2026-05-29 cs.AI cs.CL

Dynamics Within Latent Chain-of-Thought: An Empirical Study of Causal Structure

潜在思维链中的因果结构:一项实证研究

Zirui Li, Xuefeng Bai, Kehai Chen, Yizhi Li, Jian Yang, Chenghua Lin, Min Zhang

AI总结 通过结构因果模型对潜在思维链进行干预分析,揭示其因果结构、步骤间影响传播及与显式思维链的差异。

详情
Comments
Accepted to ICML 2026; 25 pages, 23 figures
AI中文摘要

潜在或连续思维链方法用若干内部潜在步骤替代显式文本推理,但这些中间计算难以通过基于相关性的探针进行评估。本文将潜在思维链视为表示空间中的可操控因果过程,将潜在步骤建模为结构因果模型(SCM)中的变量,并通过逐步do-干预分析其效应。我们研究了两种代表性范式(即Coconut和CODI)在数学和通用推理任务上的表现,以探讨三个关键问题:(1)哪些步骤对正确性具有因果必要性,以及答案何时可早期解码;(2)影响如何在步骤间传播,以及这种结构与显式CoT相比如何;(3)中间轨迹是否保留竞争性答案模式,以及输出级承诺与步骤间表示级承诺的差异。我们发现潜在步骤预算更像分阶段功能而非同质化额外深度,并具有非局部路由特性,同时识别出早期输出偏差与后期表示承诺之间的持续差距。这些结果促使我们采用模式条件化和稳定性感知分析,以及相应的训练/解码目标,作为解释和改进潜在推理系统的更可靠工具。代码见https://github.com/J1mL1/causal-latent-cot。

英文摘要

Latent or continuous chain-of-thought methods replace explicit textual rationales with a number of internal latent steps, but these intermediate computations are difficult to evaluate beyond correlation-based probes. In this paper, we view latent chain-of-thought as a manipulable causal process in representation space by modeling latent steps as variables in a structural causal model (SCM) and analyzing their effects through step-wise do-interventions. We study two representative paradigms (i.e., Coconut and CODI) on both mathematical and general reasoning tasks to investigate three key questions: (1) which steps are causally necessary for correctness and when answers become decodable early; (2) how influence propagates across steps and how this structure compares to explicit CoT; and (3) whether intermediate trajectories retain competing answer modes and how output-level commitment differs from representational commitment across steps. We find that latent-step budgets behave less like homogeneous extra depth and more like staged functionality with non-local routing, and we identify a persistent gap between early output bias and late representational commitment. These results motivate mode-conditional and stability-aware analyses, together with corresponding training/decoding objectives, as more reliable tools for interpreting and improving latent reasoning systems. Code is available at https://github.com/J1mL1/causal-latent-cot.

2602.06791 2026-05-29 cs.LG cond-mat.dis-nn cond-mat.stat-mech

Rare Event Analysis of Large Language Models

大型语言模型的罕见事件分析

Jake McAllister Dorman, Edward Gillman, Dominic C. Rose, Jamie F. Mair, Juan P. Garrahan

AI总结 本文提出一个端到端框架,用于系统分析大型语言模型中的罕见事件,涵盖理论、高效生成策略、概率估计和误差分析,并通过实例展示其应用。

详情
Comments
ICML 2026 Oral Spotlight
AI中文摘要

作为概率模型,大型语言模型(LLMs)在推理过程中会显示罕见事件:即远离典型但高度显著的行为。根据定义,所有罕见事件都难以观察,但LLM使用的巨大规模意味着在开发过程中完全未观察到的事件在部署中可能变得突出。在此,我们提出了一个用于系统分析LLMs中罕见事件的端到端框架。我们提供了一个实用的实现,涵盖理论、高效生成策略、概率估计和误差分析,并通过具体示例加以说明。我们概述了扩展到其他模型和背景的应用,强调了这里提出的概念和技术的通用性。

英文摘要

Being probabilistic models, during inference large language models (LLMs) display rare events: behaviour that is far from typical but highly significant. By definition all rare events are hard to see, but the enormous scale of LLM usage means that events completely unobserved during development are likely to become prominent in deployment. Here we present an end-to-end framework for the systematic analysis of rare events in LLMs. We provide a practical implementation spanning theory, efficient generation strategies, probability estimation and error analysis, which we illustrate with concrete examples. We outline extensions and applications to other models and contexts, highlighting the generality of the concepts and techniques presented here.

2602.06036 2026-05-29 cs.CL

DFlash: Block Diffusion for Flash Speculative Decoding

DFlash:用于快速推测解码的块扩散模型

Jian Chen, Yesheng Liang, Zhijian Liu

AI总结 提出DFlash框架,利用轻量级块扩散模型并行生成草稿,通过目标模型上下文特征条件化,实现高质量草稿和高接受率,在多种模型和任务上实现超过6倍无损加速,比最先进的推测解码方法EAGLE-3快2.5倍。

详情
Comments
Accepted at ICML 2026. Camera-ready version. Code: https://github.com/z-lab/dflash
AI中文摘要

自回归大型语言模型(LLMs)性能强大,但需要固有的顺序解码,导致高推理延迟和低GPU利用率。推测解码通过使用快速草稿模型来缓解这一瓶颈,其输出由目标LLM并行验证;然而,现有方法仍然依赖于自回归草稿生成,这仍然是顺序的,限制了实际加速。扩散LLMs通过实现并行生成提供了一种有希望的替代方案,但当前的扩散模型通常性能不如自回归模型。在本文中,我们介绍了DFlash,一种采用轻量级块扩散模型进行并行草稿生成的推测解码框架。通过在单次前向传播中生成草稿标记,并将草稿模型条件化于从目标模型提取的上下文特征,DFlash实现了高效草稿生成,具有高质量输出和更高的接受率。实验表明,DFlash在多种模型和任务上实现了超过6倍的无损加速,比最先进的推测解码方法EAGLE-3提供高达2.5倍的加速提升。

英文摘要

Autoregressive large language models (LLMs) deliver strong performance but require inherently sequential decoding, leading to high inference latency and poor GPU utilization. Speculative decoding mitigates this bottleneck by using a fast draft model whose outputs are verified in parallel by the target LLM; however, existing methods still rely on autoregressive drafting, which remains sequential and limits practical speedups. Diffusion LLMs offer a promising alternative by enabling parallel generation, but current diffusion models typically underperform compared with autoregressive models. In this paper, we introduce DFlash, a speculative decoding framework that employs a lightweight block diffusion model for parallel drafting. By generating draft tokens in a single forward pass and conditioning the draft model on context features extracted from the target model, DFlash enables efficient drafting with high-quality outputs and higher acceptance rates. Experiments show that DFlash achieves over 6x lossless acceleration across a range of models and tasks, delivering up to 2.5x higher speedup than the state-of-the-art speculative decoding method EAGLE-3.

2602.05961 2026-05-29 cs.LG stat.ML

Discrete diffusion samplers and bridges: Off-policy algorithms and applications in latent spaces

离散扩散采样器与桥:离策略算法及其在潜在空间中的应用

Arran Carter, Sanghyeok Choi, Kirill Tamogashev, Víctor Elvira, Esmeralda S. Whitammer

AI总结 提出离策略训练技术改进离散扩散采样器性能,并首次引入离散域的数据到能量薛定谔桥训练,应用于图像生成模型的离散潜在空间中的无数据后验采样。

详情
Comments
ICML 2026. Code: https://github.com/mmacosha/offpolicy-discrete-diffusion-samplers-and-bridges
AI中文摘要

从已知归一化常数的分布 $p(x) \propto e^{-\mathcal{E}(x)}$ 中采样是统计学中一个重要且具有挑战性的问题。近年来,出现了一类新的摊销采样算法,通常称为扩散采样器,能够从未归一化的密度中快速高效地采样。这类算法在连续空间采样任务中已被广泛研究;然而,它们在离散空间问题中的应用仍 largely 未被探索。尽管该领域已取得一些进展,但离散扩散采样器并未充分利用连续空间采样中常用的思想。在本文中,我们提出通过引入离散扩散采样器的离策略训练技术来弥合这一差距。我们证明这些技术在已有和新颖的合成基准上提高了离散采样器的性能。接下来,我们将离散扩散采样器推广到两个任意分布之间的桥接任务,首次为离散域引入了数据到能量薛定谔桥训练。最后,我们展示了所提出的扩散采样器在图像生成模型的离散潜在空间中进行无数据后验采样的应用。

英文摘要

Sampling from a distribution $p(x) \propto e^{-\mathcal{E}(x)}$ known up to a normalising constant is an important and challenging problem in statistics. Recent years have seen the rise of a new family of amortised sampling algorithms, commonly referred to as diffusion samplers, that enable fast and efficient sampling from an unnormalised density. Such algorithms have been widely studied for continuous-space sampling tasks; however, their application to problems in discrete space remains largely unexplored. Although some progress has been made in this area, discrete diffusion samplers do not take full advantage of ideas commonly used for continuous-space sampling. In this paper, we propose to bridge this gap by introducing off-policy training techniques for discrete diffusion samplers. We show that these techniques improve the performance of discrete samplers on both established and new synthetic benchmarks. Next, we generalise discrete diffusion samplers to the task of bridging between two arbitrary distributions, introducing data-to-energy Schrödinger bridge training for the discrete domain for the first time. Lastly, we showcase the application of the proposed diffusion samplers to data-free posterior sampling in the discrete latent spaces of image generative models.

2602.03582 2026-05-29 cs.LG

Optimization and Generation in Aerodynamics Inverse Design

气动逆设计中的优化与生成

Huaguan Chen, Ning Lin, Luxi Chen, Jiacheng Cen, Rui Zhang, Wenbing Huang, Chongxuan Li, Hao Sun

AI总结 本文提出一个概率框架,将视觉特征保持与气动性能优化统一为目标,通过重加权学习分布实现优化和引导生成,实验表明在车辆和飞机设计中显著降低阻力同时保持视觉一致性。

详情
AI中文摘要

气动逆设计可以提高车辆和飞机的效率,但实际设计很少只追求性能:车辆改进必须在降低阻力的同时保留与设计语言、品牌识别和用户感知相关的视觉特征。传统的CFD驱动优化准确但探索范围慢,当前基于学习的方法仍主要性能驱动,缺乏连接优化、生成和视觉一致性的连贯目标。这里我们将视觉保持和气动改进表述为一个概率目标。与参考形状或视图一致的设计定义了一个学习的视觉设计分布,该分布通过气动成本重新加权。优化将初始几何体细化为低成本、高概率的设计,而引导生成从相同的输入视图中采样更低成本的3D候选。OpenFOAM评估显示,保持视觉特征的优化相对于初始车辆将车辆阻力降低5.8%,相对于初始飞机将最佳有效飞机阻力-升力目标降低28.8%,同时保持输入视觉特征。对于基于视图的生成,引导相对于从同一视图直接生成将车辆阻力降低3.0%,飞机阻力-升力目标降低68.6%,同时保持视觉一致性。使用3D打印车辆原型的风洞测试提供了独立的尾流级检查,控制分析解释了这些结果背后的分布机制。这项工作为保持视觉特征的气动改进和早期3D设计探索提供了概率基础和实用途径。

英文摘要

Aerodynamic inverse design can improve vehicle and aircraft efficiency, but practical design rarely seeks performance alone: vehicle refinement must reduce drag while preserving visual features linked to design language, brand recognition and user perception. Traditional CFD-driven optimization is accurate but slow for broad exploration, and current learning-based methods are still largely performance-driven and lack a coherent target linking optimization, generation and visual consistency. Here we formulate visual preservation and aerodynamic improvement as one probability target. Designs consistent with a reference shape or view define a learned visual design distribution, which is reweighted by aerodynamic cost. Optimization then refines an initial geometry toward a low-cost, high-probability design, whereas guided generation samples lower-cost 3D candidates from the same input view. OpenFOAM evaluation shows that visual-feature-preserving optimization reduces vehicle drag by 5.8\% relative to the initial vehicle and reduces the best valid aircraft drag-to-lift objective by 28.8\% relative to the initial aircraft while preserving input visual features. For view-based generation, guidance reduces vehicle drag by 3.0\% and the aircraft drag-to-lift objective by 68.6\% relative to direct generation from the same view, while maintaining visual consistency. Wind-tunnel tests with 3D-printed vehicle prototypes provide an independent wake-level check, and controlled analyses explain the distributional mechanisms behind these results. This work provides a probabilistic foundation and practical route for visual-feature-preserving aerodynamic refinement and early-stage 3D design exploration.

2602.03357 2026-05-29 cs.LG math.OC

Achieving Linear Speedup for Composite Federated Learning

实现复合联邦学习的线性加速

Kun Huang, Shi Pu, Karl Henrik Johansson

AI总结 提出基于法向映射的FedNMap方法,通过法向映射更新处理非光滑项并采用局部校正策略减轻数据异质性,在非凸损失下首次实现关于客户端数和本地更新次数的线性加速。

详情
Comments
38 pages, 19 figures
AI中文摘要

本文提出了FedNMap,一种基于法向映射的复合联邦学习方法,其中目标函数由光滑损失和可能非光滑的正则化项组成。FedNMap利用基于法向映射的更新方案来处理非光滑项,并采用局部校正策略来减轻客户端间数据异质性的影响。在标准假设下,包括光滑局部损失、正则化项的弱凸性以及有界随机梯度方差,FedNMap在非凸损失下(无论是否满足Polyak-Łojasiewicz条件)实现了关于客户端数和本地更新次数的线性加速。据我们所知,这是首个为非凸复合联邦学习建立线性加速的算法。数值实验证实了我们的理论发现,并展示了FedNMap的线性加速性能。

英文摘要

This paper proposes FedNMap, a normal map-based method for composite federated learning, where the objective consists of a smooth loss and a possibly nonsmooth regularizer. FedNMap leverages a normal map-based update scheme to handle the nonsmooth term and incorporates a local correction strategy to mitigate the impact of data heterogeneity across clients. Under standard assumptions, including smooth local losses, weak convexity of the regularizer, and bounded stochastic gradient variance, FedNMap achieves linear speedup with respect to both the number of clients and the number of local updates for nonconvex losses, both with and without the Polyak-Łojasiewicz condition. To the best of our knowledge, this is the first algorithm establishing linear speedup for nonconvex composite federated learning. Numerical experiments corroborate our theoretical findings and demonstrate the linear speedup of FedNMap.

2602.02849 2026-05-29 cs.AI

AutoSizer: Automatic Sizing of Analog and Mixed-Signal Circuits via Large Language Model (LLM) Agents

AutoSizer: 通过大语言模型代理自动调整模拟和混合信号电路的尺寸

Xi Yu, Dmitrii Torbunov, Soumyajit Mandal, Yihui Ren

AI总结 提出AutoSizer,一种反射式LLM驱动的元优化框架,通过双循环结构统一电路理解、自适应搜索空间构建和优化编排,在模拟和混合信号电路尺寸调整中实现更优解质量、更快收敛和更高成功率。

详情
AI中文摘要

模拟和混合信号(AMS)集成电路的设计仍然严重依赖专家知识,其中晶体管尺寸调整由于非线性行为、高维设计空间和严格的性能约束而成为主要瓶颈。现有的电子设计自动化(EDA)方法通常将尺寸调整视为静态黑箱优化,导致解决方案效率低下且鲁棒性不足。尽管大语言模型(LLM)展现出强大的推理能力,但它们并不适合AMS尺寸调整中的精确数值优化。为弥补这一差距,我们提出AutoSizer,一种反射式LLM驱动的元优化框架,以闭环方式统一电路理解、自适应搜索空间构建和优化编排。它采用双循环优化框架,内循环负责电路尺寸调整,外循环分析优化动态和约束,从仿真反馈中迭代优化搜索空间。我们进一步引入AMS-SizingBench,一个包含SKY130 CMOS技术中24种不同AMS电路的开源基准,旨在评估在基于仿真器的现实约束下的自适应优化策略。实验表明,AutoSizer在不同电路难度下实现了更高的解质量、更快的收敛速度和更高的成功率,优于传统优化方法和现有的基于LLM的代理。

英文摘要

The design of Analog and Mixed-Signal (AMS) integrated circuits remains heavily reliant on expert knowledge, with transistor sizing a major bottleneck due to nonlinear behavior, high-dimensional design spaces, and strict performance constraints. Existing Electronic Design Automation (EDA) methods typically frame sizing as static black-box optimization, resulting in inefficient and less robust solutions. Although Large Language Models (LLMs) exhibit strong reasoning abilities, they are not suited for precise numerical optimization in AMS sizing. To address this gap, we propose AutoSizer, a reflective LLM-driven meta-optimization framework that unifies circuit understanding, adaptive search-space construction, and optimization orchestration in a closed loop. It employs a two-loop optimization framework, with an inner loop for circuit sizing and an outer loop that analyzes optimization dynamics and constraints to iteratively refine the search space from simulation feedback. We further introduce AMS-SizingBench, an open benchmark comprising 24 diverse AMS circuits in SKY130 CMOS technology, designed to evaluate adaptive optimization policies under realistic simulator-based constraints. AutoSizer experimentally achieves higher solution quality, faster convergence, and higher success rate across varying circuit difficulties, outperforming both traditional optimization methods and existing LLM-based agents.

2602.02103 2026-05-29 cs.LG cs.CL

How Far Ahead Do LLMs Plan? Uncovering the Latent Horizon in Chain-of-Thought Reasoning

LLMs 能提前多远规划?揭示思维链推理中的潜在视界

Liyan Xu, Mo Yu, Fandong Meng, Jie Zhou

AI总结 通过探测方法 Tele-Lens 研究 LLMs 在思维链推理中的潜在规划能力,发现其具有短视视界,并基于此提出利用稀疏枢轴位置增强不确定性估计及自动识别 CoT 绕过的假设。

详情
Comments
Accepted to ICML 2026
AI中文摘要

思维链推理已成为激发大型语言模型多步推理的核心机制。然而,近期证据呈现一种矛盾:隐藏状态似乎在 CoT 完全展开之前就已经编码了未来的推理,而显式步骤对于需要组合计算的任务仍然至关重要。为了加深对 LLM 内部状态与其言语化推理轨迹之间关系的理解,我们通过探测方法 Tele-Lens 研究了 LLMs 的潜在规划强度,该方法应用于跨不同任务领域的隐藏状态。我们的实证结果表明,LLMs 表现出短视视界,主要进行增量转换,而没有精确的全局规划。利用这一特性,我们提出了一个增强 CoT 不确定性估计的假设,并通过实验验证了一组稀疏的枢轴位置可以有效地代表整个路径的不确定性。我们进一步强调了利用 CoT 动态的重要性,并证明了可以在不降低性能的情况下实现 CoT 绕过的自动识别。我们的代码、数据和模型发布于 https://github.com/lxucs/tele-lens。

英文摘要

Chain-of-thought (CoT) reasoning has become a central mechanism for eliciting multi-step reasoning in Large Language Models (LLMs). Yet recent evidence presents a tension: hidden states appear to already encode future reasoning before CoT fully unfolds, while explicit steps still remain crucial for tasks requiring compositional computation. To deepen the understanding between LLM's internal states and its verbalized reasoning trajectories, we investigate the latent planning strength of LLMs, through our probing method, Tele-Lens, applying to hidden states across diverse task domains. Our empirical results indicate that LLMs exhibit a myopic horizon, primarily conducting incremental transitions without precise global planning. Leveraging this characteristic, we propose a hypothesis on enhancing uncertainty estimation of CoT, which we validate that a sparse set of pivot positions can effectively represent the uncertainty of the entire path. We further underscore the significance of exploiting CoT dynamics, and demonstrate that automatic recognition of CoT bypass can be achieved without performance degradation. Our code, data and models are released at https://github.com/lxucs/tele-lens.

2602.01869 2026-05-29 cs.AI

Skill-Pro: Learning Reusable Skills from Experience via Non-Parametric PPO for LLM Agents

Skill-Pro: 通过非参数PPO从经验中学习可复用技能以用于LLM智能体

Qirui Mi, Zhijian Ma, Mengyue Yang, Haoxuan Li, Yisen Wang, Haifeng Zhang, Jun Wang

AI总结 提出Skill-Pro框架,通过非参数PPO从交互经验中自动学习可复用的程序性技能,无需参数更新,实现高效经验重用和长期自主性。

详情
Comments
Accepted at ICML 2026 (spotlight); 22 Pages, 6 Figures, 5 Tables
AI中文摘要

基于LLM的智能体在序列决策中表现出色,但通常依赖即时推理,即使在重复场景中也会重新推导解决方案。这种经验重用不足导致计算冗余和不稳定性。为弥补这一差距,我们提出Skill-Pro,一个使智能体能够从交互经验中自主学习可复用程序性技能而无需参数更新的框架。通过形式化Skill-MDP,Skill-Pro将被动的情节叙述转化为由激活、执行和终止条件定义的可执行技能,以确保可执行性。为了实现可靠的可重用性而不降低能力,我们引入非参数PPO,它利用语义梯度进行高质量候选生成,并使用PPO Gate进行稳健的技能验证。通过基于分数的维护,Skill-Pro维持紧凑、高质量的程序性记忆。在域内、跨任务和跨智能体场景下的实验结果表明,Skill-Pro实现了卓越的重用率和在极端内存压缩下的显著增益。可视化的进化轨迹和技能分布进一步揭示了Skill-Pro如何透明地积累、精炼和重用程序性知识以促进长期自主性。

英文摘要

LLM-driven agents excel at sequential decision-making but often rely on on-the-fly reasoning, re-deriving solutions even in recurring scenarios. This insufficient experience reuse leads to computational redundancy and instability. To bridge this gap, we propose Skill-Pro, a framework enabling agents to autonomously learn reusable procedural skills from interaction experiences without parameter updates. By formalizing a Skill-MDP, Skill-Pro transforms passive episodic narratives into executable Skills defined by activation, execution, and termination conditions to ensure executability. To achieve reliable reusability without capability degradation, we introduce Non-Parametric PPO, which leverages semantic gradients for high-quality candidate generation and a PPO Gate for robust Skill verification. Through score-based maintenance, Skill-Pro sustains compact, high-quality procedural memory. Experimental results across in-domain, cross-task, and cross-agent scenarios demonstrate that Skill-Pro achieves superior reuse rates and significant gains with extreme memory compression. Visualized evolutionary trajectories and Skill distributions further reveal how Skill-Pro transparently accumulates, refines, and reuses procedural knowledge to facilitate long-term autonomy.

2602.01456 2026-05-29 cs.LG cs.CV

Rectified LpJEPA: Joint-Embedding Predictive Architectures with Sparse and Maximum-Entropy Representations

Rectified LpJEPA:具有稀疏和最大熵表示的联合嵌入预测架构

Yilun Kuang, Yash Dagade, Tim G. J. Rudner, Randall Balestriero, Yann LeCun

AI总结 提出Rectified Distribution Matching Regularization (RDMReg)损失,通过将表示对齐到Rectified Generalized Gaussian分布,实现稀疏且最大熵的表示,从而改进联合嵌入预测架构(JEPA)的性能。

详情
Comments
ICML 2026
AI中文摘要

联合嵌入预测架构(JEPA)学习视角不变表示,并采用基于投影的分布匹配来防止崩溃。现有方法将表示正则化为各向同性高斯分布,但固有地偏向密集表示,未能捕捉高效表示中观察到的稀疏性关键特性。我们引入了Rectified Distribution Matching Regularization (RDMReg),这是一种切片双样本分布匹配损失,将表示对齐到Rectified Generalized Gaussian (RGG)分布。RGG通过整流显式控制期望的$\ell_0$范数,而其连续截断部分在期望$\ell_p$范数和支撑约束下具有最大熵特性。将RDMReg应用于JEPA得到Rectified LpJEPA,它严格推广了先前基于高斯的JEPA。实验表明,Rectified LpJEPA学习到稀疏、非负的表示,具有有利的稀疏性-性能权衡,并在图像分类基准上取得了有竞争力的下游性能,表明RDMReg可以在保留任务相关信息的同时强制执行稀疏性。

英文摘要

Joint-Embedding Predictive Architectures (JEPA) learn view-invariant representations and admit projection-based distribution matching for collapse prevention. Existing approaches regularize representations towards isotropic Gaussian distributions, but inherently favor dense representations and fail to capture the key property of sparsity observed in efficient representations. We introduce Rectified Distribution Matching Regularization (RDMReg), a sliced two-sample distribution-matching loss that aligns representations to a Rectified Generalized Gaussian (RGG) distribution. RGG enables explicit control over expected $\ell_0$ norm through rectification, while its continuous truncated component admits a maximum-entropy characterization under expected $\ell_p$ norm and support constraints. Equipping JEPAs with RDMReg yields Rectified LpJEPA, which strictly generalizes prior Gaussian-based JEPAs. Empirically, Rectified LpJEPA learns sparse, non-negative representations with favorable sparsity--performance trade-offs and competitive downstream performance on image classification benchmarks, showing that RDMReg can enforce sparsity while preserving task-relevant information.

2601.23156 2026-05-29 cs.LG cs.FL

Unsupervised Hierarchical Skill Discovery

无监督层次化技能发现

Damion Harvey, Geraud Nangue Tasse, Benjamin Rosman, Branden Ingram, Steven James

AI总结 提出一种基于语法的无监督方法,从无标签轨迹中分割技能并构建层次结构,在像素级环境(如Craftax和Minecraft)中优于现有基线,并能加速下游强化学习任务。

详情
Comments
Accepted to ICML 2026. 27 pages. 15 figures
AI中文摘要

我们考虑强化学习中无监督技能分割和层次结构发现的问题。虽然最近的方法试图将轨迹分割为可重用的技能或选项,但大多数依赖于动作标签、奖励或手工注释,限制了其适用性。我们提出了一种方法,将未标记的轨迹分割成技能,并使用基于语法的方法在它们之上诱导出层次结构。得到的层次结构既捕获了低级行为,也捕获了它们组合成高级技能的过程。我们在高维、基于像素的环境中评估了我们的方法,包括Craftax和完整、未修改版本的Minecraft。使用技能分割、重用和层次质量的指标,我们发现我们的方法始终比现有基线产生更结构化和语义上有意义的层次结构。此外,作为概念验证,我们证明了这些发现的层次结构加速并稳定了下游强化学习任务的学习。

英文摘要

We consider the problem of unsupervised skill segmentation and hierarchical structure discovery in reinforcement learning. While recent approaches have sought to segment trajectories into reusable skills or options, most rely on action labels, rewards, or handcrafted annotations, limiting their applicability. We propose a method that segments unlabelled trajectories into skills and induces a hierarchical structure over them using a grammar-based approach. The resulting hierarchy captures both low-level behaviours and their composition into higher-level skills. We evaluate our approach in high-dimensional, pixel-based environments, including Craftax and the full, unmodified version of Minecraft. Using metrics for skill segmentation, reuse, and hierarchy quality, we find that our method consistently produces more structured and semantically meaningful hierarchies than existing baselines. Furthermore, as a proof of concept, we demonstrate that these discovered hierarchies accelerate and stabilise learning on downstream reinforcement learning tasks.

2601.22274 2026-05-29 cs.LG

Server-Proximal Aggregation for Federated Domain-Incremental Learning under Partial Participation: Task-Uniform Convergence and Backward Transfer

部分参与下联邦域增量学习的服务器近端聚合:任务均匀收敛与反向迁移

Longtao Xu, Jian Li

AI总结 针对联邦域增量学习(FDIL)中客户端异构、任务顺序到达且标签空间固定的场景,提出无记忆算法SPECIAL,通过服务器端轻量近端项抑制累积漂移,实现反向知识迁移(BKT)保证和首个部分参与下的非凸收敛速率O((E/NT)^(1/2))。

详情
Comments
Accepted in ICML2026
AI中文摘要

现实联邦系统很少在静态数据上运行:输入分布漂移,而隐私规则禁止原始数据共享。我们将此设置研究为联邦域增量学习(FDIL),其中(i)客户端是异构的,(ii)任务顺序到达且域不断变化,但(iii)标签空间保持不变。在现实部署下,FDIL仍然缺少两个理论支柱:反向知识迁移(BKT)的保证以及在部分参与下所有任务序列上的收敛速率。我们引入SPECIAL(服务器近端高效持续聚合学习),一种简单的、无记忆的FDIL算法,它在标准FedAvg中添加了一个单服务器端“锚点”:在每一轮中,服务器通过一个轻量近端项,将均匀采样的参与客户端的更新推向先前的全局模型。该锚点无需重放缓冲区、合成数据或任务特定头部即可抑制累积漂移,保持通信和模型大小不变。我们的理论表明,SPECIAL(i)保留了早期任务:BKT界限将先前任务损失的任意增加限制为一个漂移控制项,该漂移控制项随着更多轮次、本地周期和参与客户端而缩小;(ii)在所有任务上高效学习:首个针对部分参与下FDIL的通信高效非凸收敛速率,O((E/NT)^(1/2)),其中E为本地周期数,T为通信轮数,N为每轮参与客户端数,与单任务FedAvg匹配,同时明确区分优化方差和任务间漂移。实验结果进一步证明了SPECIAL的有效性。

英文摘要

Real-world federated systems seldom operate on static data: input distributions drift while privacy rules forbid raw-data sharing. We study this setting as Federated Domain-Incremental Learning (FDIL), where (i) clients are heterogeneous, (ii) tasks arrive sequentially with shifting domains, yet (iii) the label space remains fixed. Two theoretical pillars remain missing for FDIL under realistic deployment: a guarantee of backward knowledge transfer (BKT) and a convergence rate that holds across the sequence of all tasks with partial participation. We introduce SPECIAL (Server-Proximal Efficient Continual Aggregation for Learning), a simple, memory-free FDIL algorithm that adds a single server-side ``anchor'' to vanilla FedAvg: in each round, the server nudges the uniformly sampled participated clients update toward the previous global model with a lightweight proximal term. This anchor curbs cumulative drift without replay buffers, synthetic data, or task-specific heads, keeping communication and model size unchanged. Our theory shows that SPECIAL (i) preserves earlier tasks: a BKT bound caps any increase in prior-task loss by a drift-controlled term that shrinks with more rounds, local epochs, and participating clients; and (ii) learns efficiently across all tasks: the first communication-efficient non-convex convergence rate for FDIL with partial participation, O((E/NT)^(1/2)), with E local epochs, T communication rounds, and N participated clients per round, matching single-task FedAvg while explicitly separating optimization variance from inter-task drift. Experimental results further demonstrate the effectiveness of SPECIAL.

2601.21909 2026-05-29 cs.AI cs.CL

From Meta-Thought to Execution: Cognitively Aligned Post-Training for Generalizable and Reliable LLM Reasoning

从元思维到执行:面向通用且可靠的大语言模型推理的认知对齐后训练

Shaojie Wang, Liang Zhang

AI总结 提出一种认知启发的两阶段后训练框架,通过元思维链监督学习通用策略和置信度校准强化学习优化执行可靠性,在分布内和分布外分别提升2.10%和3.86%。

详情
AI中文摘要

当前的大语言模型后训练方法通过监督微调(SFT)后接基于结果的强化学习(RL)来优化完整的推理轨迹。虽然有效,但仔细审视发现一个根本差距:这种方法与人类实际解决问题的方式不一致。人类认知自然地将问题解决分解为两个不同的阶段:首先获取跨问题泛化的抽象策略(即元知识),然后将其适应到具体实例。相比之下,通过将完整轨迹视为基本单元,当前方法本质上是问题中心的,将抽象策略与问题特定的执行纠缠在一起。为了解决这种错位,我们提出了一个认知启发的框架,明确地模仿人类认知的两阶段过程。具体而言,元思维链(CoMT)将监督学习聚焦于抽象推理模式而不涉及具体执行,从而能够获取可泛化的策略。然后,置信度校准强化学习(CCRL)通过中间步骤上的置信度感知奖励来优化任务适应,防止过度自信的错误级联并提高执行可靠性。在四个模型和十个基准上的实验表明,与标准方法相比,分布内和分布外分别提升了2.10%和3.86%,同时对教师模型选择、优化方法和符号扰动的变化保持高度鲁棒。

英文摘要

Current LLM post-training methods optimize complete reasoning trajectories through Supervised Fine-Tuning (SFT) followed by outcome-based Reinforcement Learning (RL). While effective, a closer examination reveals a fundamental gap: this approach does not align with how humans actually solve problems. Human cognition naturally decomposes problem-solving into two distinct stages: first acquiring abstract strategies (i.e., meta-knowledge) that generalize across problems, then adapting them to specific instances. In contrast, by treating complete trajectories as basic units, current methods are inherently problem-centric, entangling abstract strategies with problem-specific execution. To address this misalignment, we propose a cognitively-inspired framework that explicitly mirrors the two-stage human cognitive process. Specifically, Chain-of-Meta-Thought CoMT focuses supervised learning on abstract reasoning patterns without specific executions, enabling acquisition of generalizable strategies. Confidence-Calibrated Reinforcement Learning (CCRL) then optimizes task adaptation via confidence-aware rewards on intermediate steps, preventing overconfident errors from cascading and improving execution reliability. Experiments across four models and ten benchmarks show 2.10% and 3.86% improvements in-distribution and out-of-distribution respectively over standard methods, while remaining highly robust to variations in teacher model selection, optimization methods, and symbolic perturbations.

2601.21568 2026-05-29 cs.LG

Bridging Functional and Representational Similarity via Usable Information

通过可用信息桥接功能相似性与表征相似性

Antonio Almudévar, Alfonso Ortega

AI总结 提出一个基于可用信息的统一框架,从功能相似性、表征相似性及其关系三个维度进行理论和实证综合,揭示表征相似性是功能相似性的充分非必要条件。

详情
AI中文摘要

我们提出了一个通过可用信息量化表征之间相似性的统一框架,在三个关键维度上提供了严格的理论和实证综合。首先,针对功能相似性,我们建立了拼接性能与条件互信息之间的形式化联系。我们进一步揭示拼接本质上是非对称的,证明稳健的功能比较需要双向分析而非单向映射。其次,关于表征相似性,我们发现基于重构的指标和标准工具(如CKA、RSA)在特定约束下充当可用信息的估计量。关键的是,我们表明相似性是相对于预测族的能力而言的:对刚性观察者而言不同的表征,对更具表达力的观察者可能是相同的。第三,我们证明表征相似性是功能相似性的充分非必要条件。我们通过任务粒度层次统一这些概念:复杂任务上的相似性保证了任何更粗粒度衍生任务上的相似性,将表征相似性确立为最大粒度的极限:输入重构。

英文摘要

We present a unified framework for quantifying the similarity between representations through the lens of \textit{usable} information, offering a rigorous theoretical and empirical synthesis across three key dimensions. First, addressing functional similarity, we establish a formal link between stitching performance and conditional mutual information. We further reveal that stitching is inherently asymmetric, demonstrating that robust functional comparison necessitates a bidirectional analysis rather than a unidirectional mapping. Second, concerning representational similarity, we find that reconstruction-based metrics and standard tools (e.g., CKA, RSA) act as estimators of usable information under specific constraints. Crucially, we show that similarity is relative to the capacity of the predictive family: representations that appear distinct to a rigid observer may be identical to a more expressive one. Third, we demonstrate that representational similarity is sufficient but not necessary for functional similarity. We unify these concepts through a task-granularity hierarchy: similarity on a complex task guarantees similarity on any coarser derivative, establishing representational similarity as the limit of maximum granularity: input reconstruction.

2601.21564 2026-05-29 cs.LG

Representation Unlearning: Forgetting through Information Compression

表示遗忘:通过信息压缩实现遗忘

Antonio Almudévar, Alfonso Ortega

AI总结 提出表示遗忘框架,通过在模型表示空间学习信息瓶颈变换来直接执行遗忘,无需修改模型参数,实现可靠遗忘、保持效用和计算高效。

详情
AI中文摘要

机器遗忘旨在消除特定训练数据对模型的影响,这一需求由隐私法规和鲁棒性关注驱动。现有方法通常修改模型参数,但此类更新可能不稳定、计算成本高且受局部近似限制。我们引入表示遗忘,一个直接在模型表示空间中执行遗忘的框架。我们不修改模型参数,而是学习一个对表示施加信息瓶颈的变换:最大化与保留数据的互信息,同时抑制关于待遗忘数据的信息。我们推导出使这一目标可处理的变分替代,并展示如何在两种实际场景中实例化:当保留和遗忘数据都可用时,以及在仅能访问遗忘数据的零样本设置中。在多个基准上的实验表明,与以参数为中心的基线相比,表示遗忘实现了更可靠的遗忘、更好的效用保持和更高的计算效率。

英文摘要

Machine unlearning seeks to remove the influence of specific training data from a model, a need driven by privacy regulations and robustness concerns. Existing approaches typically modify model parameters, but such updates can be unstable, computationally costly, and limited by local approximations. We introduce Representation Unlearning, a framework that performs unlearning directly in the model's representation space. Instead of modifying model parameters, we learn a transformation over representations that imposes an information bottleneck: maximizing mutual information with retained data while suppressing information about data to be forgotten. We derive variational surrogates that make this objective tractable and show how they can be instantiated in two practical regimes: when both retain and forget data are available, and in a zero-shot setting where only forget data can be accessed. Experiments across several benchmarks demonstrate that Representation Unlearning achieves more reliable forgetting, better utility retention, and greater computational efficiency than parameter-centric baselines.

2601.19947 2026-05-29 cs.LG cs.AI cs.CV

NCSAM Noise-Compensated Sharpness-Aware Minimization for Noisy Label Learning

NCSAM: 噪声补偿的锐度感知最小化用于噪声标签学习

Jiayu Xu, Junbiao Pang

AI总结 提出NCSAM方法,通过噪声补偿扰动修正噪声标签引起的优化偏差,缓解对噪声标签的记忆,在合成和真实噪声标签基准上优于SAM基线。

详情
Comments
11 pages, 1 figure, 8 tables. Major revision of v1: revised PAC-Bayesian theoretical analysis, clarified the NCSAM formulation, added appendix derivations, reorganized experiments and ablations, updated related work, citations, writing, and author list
AI中文摘要

从噪声标签学习(LNL)仍然是深度学习中的一个基本挑战,因为现实世界的数据集通常包含损坏的注释。大多数现有方法依赖于标签校正或样本选择机制。相比之下,我们从优化角度研究LNL,通过建立标签噪声与锐度感知最小化(SAM)的平坦性寻求行为之间的理论联系。基于此分析,我们提出了噪声补偿的锐度感知最小化(NCSAM),它使用噪声补偿扰动来抵消由噪声标签引起的优化偏差。通过纠正失真的SAM扰动,NCSAM在训练过程中减轻了对噪声标签的记忆,同时保持了基于优化的学习的简单性。在合成和真实噪声标签基准上的实验表明,NCSAM在基于SAM的优化基线上持续改进,并与代表性的噪声标签学习方法保持竞争力。

英文摘要

Learning from Noisy Labels (LNL) remains a fundamental challenge in deep learning because real-world datasets often contain corrupted annotations. Most existing methods rely on label correction or sample selection mechanisms. In contrast, we study LNL from an optimization perspective by establishing a theoretical connection between label noise and the flatness-seeking behavior of Sharpness-Aware Minimization (SAM). Based on this analysis, we propose Noise-Compensated Sharpness-Aware Minimization (NCSAM), which uses a noise-compensated perturbation to counteract the optimization bias induced by noisy labels. By correcting distorted SAM perturbations, NCSAM mitigates the memorization of noisy labels during training while preserving the simplicity of optimization-based learning. Experiments on synthetic and real-world noisy-label benchmarks show that NCSAM consistently improves over SAM-based optimization baselines and remains competitive with representative noisy-label learning methods.

2601.18395 2026-05-29 cs.CL

Do not be greedy, Think Twice: Sampling and Selection for Document-level Information Extraction

不要贪婪,三思而后行:文档级信息抽取的采样与选择

Mikel Zubillaga, Oscar Sainz, Oier Lopez de Lacalle, Eneko Agirre

AI总结 提出ThinkTwice框架,通过采样生成多个候选模板并选择最优,利用无监督一致性和有监督奖励模型,在文档级信息抽取中超越贪婪解码方法。

详情
Comments
Submitted to EMNLP 2026
AI中文摘要

文档级信息抽取(DocIE)旨在生成包含给定文档中出现的实体、关系和事件的输出模板。标准做法包括使用贪婪解码提示仅解码器的大语言模型以避免输出变异性。我们没有将这种变异性视为限制,而是表明采样可以产生比贪婪解码更好的解决方案,尤其是在使用推理模型时。因此,我们提出了ThinkTwice,一个采样和选择框架,其中大语言模型为给定文档生成多个候选模板,然后一个选择模块选择最合适的模板。我们引入了一种利用生成输出之间一致性的无监督方法,以及一种使用在标记DocIE数据上训练的奖励模型的有监督选择方法。为了解决DocIE中黄金推理轨迹的稀缺性,我们提出了一种基于拒绝采样的方法来生成将输出模板与推理轨迹配对的银训练数据。我们的实验证明了无监督和有监督ThinkTwice的有效性,始终优于贪婪基线和有监督的最先进方法。

英文摘要

Document-level Information Extraction (DocIE) aims to produce an output template with the entities, relations, and events of interest occurring in the given document. Standard practices include prompting decoder-only LLMs using greedy decoding to avoid output variability. Rather than treating this variability as a limitation, we show that sampling can produce substantially better solutions than greedy decoding, especially when using reasoning models. We thus propose ThinkTwice, a sampling and selection framework in which the LLM generates multiple candidate templates for a given document, and a selection module chooses the most suitable one. We introduce both an unsupervised method that exploits agreement across generated outputs, and a supervised selection method using reward models trained on labeled DocIE data. To address the scarcity of golden reasoning trajectories for DocIE, we propose a rejection-sampling-based method to generate silver training data that pairs output templates with reasoning traces. Our experiments show the validity of unsupervised and supervised ThinkTwice, consistently outperforming greedy baselines and the supervised state-of-the-art.

2601.14855 2026-05-29 cs.LG

Adaptive Exponential Integration for Stable Gaussian Mixture Black-Box Variational Inference

自适应指数积分用于稳定高斯混合黑箱变分推断

Baojun Che, Yifan Chen, Daniel Zhengyu Huang, Xinying Mao, Weijie Wang

AI总结 针对高斯混合黑箱变分推断的不稳定和低效问题,提出结合仿射不变预处理、无条件保持协方差正定性的指数积分器和自适应时间步长的稳定高效框架,并证明其收敛性。

详情
Comments
41 pages, 10 figures
AI中文摘要

黑箱变分推断(BBVI)结合高斯混合族提供了一种灵活的方法来近似复杂的后验分布,无需目标密度的梯度。然而,标准的数值优化方法常常遭受不稳定和低效的问题。我们开发了一个稳定高效的框架,结合了三个关键组件:(1)通过自然梯度公式实现的仿射不变预处理,(2)无条件保持协方差矩阵正定性的指数积分器,以及(3)自适应时间步长以确保稳定性并适应不同的预热和收敛阶段。所提出的方法与流形优化和镜像下降有自然联系。对于高斯后验,我们证明了在无噪声设置下的指数收敛性和在蒙特卡洛估计下的几乎必然收敛性,严格论证了自适应时间步长的必要性。在多模态分布、Neal多尺度漏斗以及基于PDE的达西流贝叶斯逆问题上的数值实验证明了所提方法的有效性。

英文摘要

Black-box variational inference (BBVI) with Gaussian mixture families offers a flexible approach for approximating complex posterior distributions without requiring gradients of the target density. However, standard numerical optimization methods often suffer from instability and inefficiency. We develop a stable and efficient framework that combines three key components: (1) affine-invariant preconditioning via natural gradient formulations, (2) an exponential integrator that unconditionally preserves the positive definiteness of covariance matrices, and (3) adaptive time stepping to ensure stability and to accommodate distinct warm-up and convergence phases. The proposed approach has natural connections to manifold optimization and mirror descent. For Gaussian posteriors, we prove exponential convergence in the noise-free setting and almost-sure convergence under Monte Carlo estimation, rigorously justifying the necessity of adaptive time stepping. Numerical experiments on multimodal distributions, Neal's multiscale funnel, and a PDE-based Bayesian inverse problem for Darcy flow demonstrate the effectiveness of the proposed method.

2601.14758 2026-05-29 cs.LG cs.AI cs.CL

Mechanism Shift During Post-training from Autoregressive to Masked Diffusion Language Models

从自回归到掩码扩散语言模型的后训练中的机制转变

Injin Kong, Hyoungjoon Lee, Yohan Jo

AI总结 通过比较电路分析,发现后训练得到的掩码扩散模型在结构上根据任务保留或重组自回归电路,在语义上从局部专业化转向分布式整合,表明扩散后训练是内部计算的深度重组。

详情
AI中文摘要

将预训练的自回归模型(ARMs)后训练为掩码扩散模型(MDMs)已成为一种克服顺序生成局限性的经济有效方法。然而,后训练的MDMs是否获得了真正的新计算机制,还是仅仅以非自回归形式重新表达了自回归计算,仍不清楚。通过对ARMs及其从相同骨干网络后训练得到的MDM对应物进行电路比较分析,我们揭示了两个互补的重组轴。在结构上,转变是任务依赖的:MDMs在局部因果任务上保留自回归电路,但在全局任务上放弃继承的路径并将计算前置到早期层。在语义上,转变在不同机制间是一致的:ARMs中尖锐的局部专业化让位于MDMs中的分布式整合。这些发现共同表明,扩散后训练并非生成过程的表面变化,而是内部计算的重组,其深度取决于任务。

英文摘要

Post-training pretrained autoregressive models (ARMs) into masked diffusion models (MDMs) has emerged as a cost-effective way to overcome the limitations of sequential generation. Yet it remains unclear whether post-trained MDMs acquire genuinely new computational mechanisms or merely re-express autoregressive computation in a non-autoregressive form. Through a comparative circuit analysis of ARMs and their MDM counterparts post-trained from the same backbones, we uncover two complementary axes of reorganization. Structurally, the shift is task-dependent: MDMs preserve autoregressive circuitry on locally causal tasks but abandon inherited pathways and front-load computation into early layers on global tasks. Semantically, the shift is consistent across regimes: sharp, localized specialization in ARMs gives way to distributed integration in MDMs. Together, these findings show that diffusion post-training is not a surface-level change in the generation procedure but a reorganization of internal computation whose depth depends on the task.

2601.13111 2026-05-29 cs.CL cs.AI cs.IR

CORE-T: COherent REtrieval of Tables for Text-to-SQL

CORE-T: 面向文本到SQL的表格连贯检索

Hassan Soliman, Vivek Gupta, Dan Roth, Iryna Gurevych

AI总结 提出CORE-T框架,通过LLM生成元数据和预计算兼容性缓存,在无需训练的情况下从异构表集合中高效检索连贯可连接的表集合,提升表选择F1最多22.7点并减少40%的表数量。

详情
Comments
Preprint is revised and under review. Code and data available at: https://github.com/UKPLab/arxiv2026-core-t
AI中文摘要

现实中的文本到SQL工作流通常需要连接多个表格。因此,准确检索相关表集合成为端到端性能的关键瓶颈。我们研究一种开放书设置,其中查询必须从多个来源汇集的大规模异构表集合中回答,且没有数据库标识符等清晰的限定信号。在此设置下,密集检索(DR)实现了高召回率但返回大量干扰项,而考虑连接的方法通常依赖额外假设和/或产生高推理开销。我们提出CORE-T,一个可扩展、无需训练的框架,通过LLM生成的用途元数据丰富表格,并预计算轻量级表兼容性缓存。推理时,DR返回前K个候选;单次LLM调用选择一个连贯、可连接的子集,然后两步加法调整阶段恢复强兼容的表。在Bird、Spider、MMQA和Beaver上,CORE-T在表选择F1上比DR提升最多22.7点,同时返回的表减少最多40%,在多表执行准确率上提升最多24.4点,并且使用的总选择token比LLM密集型基线少1.64-4.20倍。

英文摘要

Realistic text-to-SQL workflows often require joining multiple tables. As a result, accurately retrieving the relevant set of tables becomes a key bottleneck for end-to-end performance. We study an open-book setting where queries must be answered over large, heterogeneous table collections pooled from many sources, without clean scoping signals such as database identifiers. Here, dense retrieval (DR) achieves high recall but returns many distractors, while join-aware alternatives often rely on extra assumptions and/or incur high inference overhead. We propose CORE-T, a scalable, training-free framework that enriches tables with LLM-generated purpose metadata and pre-computes a lightweight table-compatibility cache. At inference time, DR returns top-K candidates; a single LLM call selects a coherent, joinable subset, and a two-step additive adjustment stage restores strongly compatible tables. Across Bird, Spider, MMQA, and Beaver, CORE-T improves over DR by up to 22.7 points in table-selection F1 while returning up to 40% fewer tables, and by up to 24.4 points in multi-table execution accuracy, and uses 1.64-4.20x fewer total selection tokens than LLM-intensive baselines.

2601.12500 2026-05-29 cs.CV

Video Individual Counting and Tracking from Moving Drones: A Benchmark and Methods

来自移动无人机的视频个体计数与跟踪:基准与方法

Yaowu Fan, Jia Wan, Tao Han, Andy J. Ma, Wanli Ouyang, Antoni B. Chan

AI总结 针对大规模密集人群场景,提出移动无人机视频数据集MovingDroneCrowd++,并设计基于最优传输和描述子投票的计数与跟踪方法GD3A和DVTrack,显著降低计数误差并提升跟踪精度。

详情
AI中文摘要

在大规模场景中计数和跟踪密集人群是一个高度实用但具有挑战性的问题。现有方法大多依赖于场景覆盖有限的固定摄像头数据集,使其不足以用于大规模场景的人群分析。为弥补这一差距,我们引入了MovingDroneCrowd++,这是最大的视频级数据集,专门用于快速移动无人机下的密集人群计数和跟踪,在多种飞行高度、相机角度和光照条件下采集。然而,现有方法在这些具有挑战性的空中条件下仍无法达到令人满意的视频个体计数或跟踪性能。为此,我们提出了GD3A(通过分组描述符关联的全局密度图分解),一种视频个体计数方法,该方法首先通过带有自适应垃圾桶分数的最优传输建立帧间行人描述符的像素级对应关系。然后,采用分组关联来指导将全局密度图分解为共享、流入和流出密度图。我们进一步引入了一种行人跟踪方法DVTrack(描述子投票跟踪),该方法通过描述子投票将描述符级匹配转换为实例级关联。我们的方法依赖于每个行人的分组多个描述符的关联结果,而不是单个向量。由于组内匹配错误不影响最终的计数和跟踪结果,我们的方法在密集人群和具有挑战性的空中条件下更加鲁棒。实验表明,我们的方法在密集人群和复杂运动的移动无人机视频上,在人群计数和跟踪方面均取得了显著提升,计数误差降低了47.4%,跟踪精度提高了64.6%。代码、数据集和预训练模型可在 https://github.com/fyw1999/MovingDroneCrowd 获取。

英文摘要

Counting and tracking dense crowds in large-scale scenes is a highly practical yet challenging problem. Existing methods mostly rely on fixed-camera datasets with limited scene coverage, making them inadequate for crowd analysis in large-scale scenes. To bridge this gap, we introduce MovingDroneCrowd++, the largest video-level dataset dedicated to dense crowd counting and tracking with fast-moving drones, captured under diverse flight altitudes, camera angles, and illumination conditions. Existing methods, however, still fail to achieve satisfactory video individual counting or tracking performance under these challenging aerial conditions. To this end, we propose GD3A (Global Density map Decomposition via group-wise Descriptor Association), a video individual counting method that first establishes pixel-level correspondences between pedestrian descriptors across frames via optimal transport with an adaptive dustbin score. Then, group-wise association is adopted to guide the decomposition of the global density map into shared, inflow, and outflow density maps. We further introduce a pedestrian tracking method, DVTrack (Descriptor Voting Track), which converts descriptor-level matching into instance-level association through descriptor voting. Our methods rely on the association results of group-wise multiple descriptors for each pedestrian rather than a single vector. Since intra-group matching errors do not affect the final counting and tracking results, our methods are more robust in dense crowds and challenging aerial conditions. Experiments show that our methods achieve substantial gains in both crowd counting and tracking on moving-drone videos with dense crowds and complex motions, reducing counting error by 47.4% and improving tracking accuracy by 64.6%. Code, dataset, and pretrained models are available at https://github.com/fyw1999/MovingDroneCrowd.

2601.11178 2026-05-29 cs.AI cs.CL cs.MM cs.SI

TANDEM: Temporal-Aware Neural Detection for Multimodal Hate Speech

TANDEM: 面向多模态仇恨言论的时间感知神经检测

Girish A. Koushik, Helen Treharne, Diptesh Kanojia

AI总结 提出TANDEM统一框架,通过串联强化学习策略联合优化视觉-语言和音频-语言模型,将音频-视觉仇恨检测转化为结构化推理问题,在HateMM上目标识别F1达0.73(提升30%),并保持精确时间定位。

详情
Comments
Under review at ICWSM 2027
AI中文摘要

社交媒体平台日益被长篇多模态内容主导,其中有害叙事通过音频、视觉和文本线索的复杂交互构建。虽然自动化系统能以高准确率标记仇恨言论,但它们通常作为“黑箱”运作,无法提供细粒度、可解释的证据(如精确时间戳和目标身份),而这对于有效的人机协同审核是必需的。在这项工作中,我们提出了TANDEM,一个统一框架,将音频-视觉仇恨检测从二元分类任务转化为结构化推理问题。我们的方法采用一种新颖的串联强化学习策略,其中视觉-语言和音频-语言模型通过自约束跨模态上下文相互优化,在无需密集帧级监督的情况下,稳定地推理长时序列。在三个基准数据集上的实验表明,TANDEM显著优于零样本和上下文增强基线,在HateMM上目标识别F1达到0.73(比现有最佳方法提升30%),同时保持精确的时间定位。我们进一步观察到,虽然二元检测是鲁棒的,但由于固有的标签模糊性和数据集不平衡,在多类设置中区分攻击性和仇恨性内容仍然具有挑战性。更广泛地说,我们的发现表明,即使在复杂的多模态环境中,结构化、可解释的对齐也是可实现的,为下一代透明且可操作的在线安全审核工具提供了蓝图。

英文摘要

Social media platforms are increasingly dominated by long-form multimodal content, where harmful narratives are constructed through a complex interplay of audio, visual, and textual cues. While automated systems can flag hate speech with high accuracy, they often function as "black boxes" that fail to provide the granular, interpretable evidence, such as precise timestamps and target identities, required for effective human-in-the-loop moderation. In this work, we introduce TANDEM, a unified framework that transforms audio-visual hate detection from a binary classification task into a structured reasoning problem. Our approach employs a novel tandem reinforcement learning strategy where vision-language and audio-language models optimize each other through self-constrained cross-modal context, stabilizing reasoning over extended temporal sequences without requiring dense frame-level supervision. Experiments across three benchmark datasets demonstrate that TANDEM significantly outperforms zero-shot and context-augmented baselines, achieving 0.73 F1 in target identification on HateMM (a 30% improvement over state-of-the-art) while maintaining precise temporal grounding. We further observe that while binary detection is robust, differentiating between offensive and hateful content remains challenging in multi-class settings due to inherent label ambiguity and dataset imbalance. More broadly, our findings suggest that structured, interpretable alignment is achievable even in complex multimodal settings, offering a blueprint for the next generation of transparent and actionable online safety moderation tools.

2601.05149 2026-05-29 cs.CV

Multi-Scale Local Speculative Decoding for Image Generation

多尺度局部推测解码用于图像生成

Elia Peruzzo, Guillaume Sautière, Amirhossein Habibian

AI总结 提出多尺度局部推测解码(MuLo-SD)框架,通过低分辨率草稿模型与高分辨率目标模型结合、局部拒绝与重采样机制,加速自回归图像生成,实现高达5倍加速并保持语义对齐和感知质量。

详情
Comments
Accepted at CVPR 2026
AI中文摘要

自回归(AR)模型在图像合成中取得了显著成功,但其顺序性带来了严重的延迟限制。推测解码提供了一种有前景的加速途径,但现有方法受限于令牌级模糊性和缺乏空间感知。在这项工作中,我们引入了多尺度局部推测解码(MuLo-SD),一种新颖的框架,结合多分辨率草稿与空间感知验证来加速AR图像生成。我们的方法利用低分辨率草稿模型配合上采样步骤来提出候选图像令牌,然后由高分辨率目标模型并行验证。关键的是,我们引入了局部拒绝和重采样机制,通过关注空间邻域而非在第一次拒绝后进行光栅扫描重采样,从而高效纠正草稿错误。当与并行解码重采样集成时,MuLo-SD实现了显著的加速——高达$\mathbf{5 imes}$——在加速方面优于推测解码和并行解码基线,同时保持相当的语义对齐和感知质量。这些结果在MS-COCO 5k验证集上使用GenEval、DPG-Bench和FID/HPSv2进行了验证。广泛的消融实验突出了上采样设计、概率池化以及局部拒绝和重采样与邻域扩展的影响。我们的方法为图像合成中的推测解码设立了新的最先进水平,弥合了效率与保真度之间的差距。项目页面见https://qualcomm-ai-research.github.io/mulo-sd-webpage/。

英文摘要

Autoregressive (AR) models have achieved remarkable success in image synthesis, yet their sequential nature imposes significant latency constraints. Speculative Decoding offers a promising avenue for acceleration, but existing approaches are limited by token-level ambiguity and lack of spatial awareness. In this work, we introduce Multi-Scale Local Speculative Decoding (MuLo-SD), a novel framework that combines multi-resolution drafting with spatially informed verification to accelerate AR image generation. Our method leverages a low-resolution drafter paired with an up-sampling step to propose candidate image tokens, which are then verified in parallel by a high-resolution target model. Crucially, we incorporate a local rejection and resampling mechanism, enabling efficient correction of draft errors by focusing on spatial neighborhoods rather than raster-scan resampling after the first rejection. When integrated with parallel decoding resampling, MuLo-SD achieves substantial speedups -- up to $\mathbf{5\times}$ -- outperforming both speculative decoding and parallel decoding baselines in terms of acceleration, while maintaining comparable semantic alignment and perceptual quality. These results are validated using GenEval, DPG-Bench, and FID/HPSv2 on the MS-COCO 5k validation split. Extensive ablations highlight the impact of up-sampling design, probability pooling, and local rejection and resampling with neighborhood expansion. Our approach sets a new state-of-the-art in speculative decoding for image synthesis, bridging the gap between efficiency and fidelity. Project page is available at https://qualcomm-ai-research.github.io/mulo-sd-webpage/ .

2601.04765 2026-05-29 cs.CL cs.AI cs.LG physics.comp-ph

Differential syntactic and semantic encoding in LLMs

大型语言模型中句法与语义的差异编码

Santiago Acevedo, Alessandro Laio, Marco Baroni

AI总结 本研究通过平均共享句法结构或语义的句子隐藏表示向量,发现大型语言模型(以DeepSeek-V3为例)的内部层表示中句法和语义信息至少部分线性编码,且两者编码轮廓不同,可一定程度解耦。

详情
Comments
Published as conference paper at ICML 2026
AI中文摘要

我们研究了句法和语义信息如何在大型语言模型(LLMs)的内部层表示中编码,重点关注非常大的DeepSeek-V3。我们发现,通过平均共享句法结构或语义的句子的隐藏表示向量,我们得到了能够捕获表示中相当大比例的句法和语义信息的向量。特别是,从句子向量中减去这些句法和语义“质心”会强烈影响它们与句法和语义匹配句子的相似性,这表明句法和语义至少部分地线性编码。我们还发现句法和语义的跨层编码轮廓不同,并且这两种信号可以在一定程度上解耦,这表明LLM表示中这两种语言信息的差异编码。

英文摘要

We study how syntactic and semantic information is encoded in inner layer representations of Large Language Models (LLMs), focusing on the very large DeepSeek-V3. We find that, by averaging hidden-representation vectors of sentences sharing syntactic structure or meaning, we obtain vectors that capture a significant proportion of the syntactic and semantic information contained in the representations. In particular, subtracting these syntactic and semantic ``centroids'' from sentence vectors strongly affects their similarity with syntactically and semantically matched sentences, respectively, suggesting that syntax and semantics are, at least partially, linearly encoded. We also find that the cross-layer encoding profiles of syntax and semantics are different, and that the two signals can to some extent be decoupled, suggesting differential encoding of these two types of linguistic information in LLM representations.

2601.03729 2026-05-29 cs.CV

MATANet: A Multi-context Attention and Taxonomy-Aware Network for Fine-Grained Underwater Recognition of Marine Species

MATANet:用于海洋物种细粒度识别的多上下文注意与分类感知网络

Donghwan Lee, Byeongjin Kim, Geunhee Kim, Hyukjin Kwon, Nahyeon Maeng, Wooju Kim

AI总结 提出MATANet框架,通过多上下文环境注意力模块和层级感知表示学习模块,结合生物外观、环境上下文和分类结构,实现海洋生物细粒度识别,在FathomNet2025和LifeCLEF2015-Fish上取得最优性能。

详情
AI中文摘要

海洋生物的细粒度识别对于生态研究、生物多样性监测、栖息地保护和基于证据的政策制定至关重要。然而,许多现有方法主要依赖于以物体或ROI为中心的表征。这些限制在具有挑战性的水下场景中会降低判别性能,因为视觉上相似的生物通常出现在不同的环境条件下。为了解决这些问题,我们提出了MATANet(多上下文注意与分类感知网络),一个用于海洋生物细粒度分类识别的框架。MATANet的动机来自专家分类识别实践,其中在识别过程中同时考虑生物体形态和上下文线索。该框架由两个主要组件组成。首先,多上下文环境注意力模块(MCEAM)对主要感兴趣区域(ROI)与多尺度周围环境区域之间的交叉注意力进行建模,从而将局部形态线索与栖息地级上下文信息相结合。其次,层级感知表示学习模块(HRLM)使用分类层次作为辅助监督来正则化表示学习,并鼓励跨分类级别的语义结构化嵌入。通过联合建模生物外观、环境上下文和分类结构,MATANet学习了用于细粒度分类识别的更具判别性的表示。在FathomNet2025和LifeCLEF2015-Fish上的实验表明,MATANet持续优于现有方法的识别性能。在FAIR1M上的额外实验进一步检验了所提框架在水下图像之外的适用性。值得注意的是,MATANet在CVPR 2025 FGVC12研讨会的FathomNet 2025挑战赛中获得了第一名。

英文摘要

Fine-grained recognition of marine organisms is important for ecological research, biodiversity monitoring, habitat conservation, and evidence-based policy-making. However, many existing approaches primarily rely on object- or ROI-centered representations. These limitations can reduce discriminative performance in challenging underwater scenes, where visually similar organisms often appear under diverse environmental conditions. To address these challenges, we propose MATANet (Multi-context Attention and Taxonomy-Aware Network), a framework for fine-grained taxonomic recognition of marine organisms. MATANet is motivated by expert taxonomic identification practices, in which both organism-level morphology and contextual cues are considered during recognition. The framework consists of two main components. First, the Multi-Context Environmental Attention Module (MCEAM) models cross-attention between the primary region of interest (ROI) and multi-scale surrounding environmental regions, thereby combining local morphological cues with habitat-level contextual information. Second, the Hierarchy-Aware Representation Learning Module (HRLM) uses taxonomic hierarchy as auxiliary supervision to regularize representation learning and encourage semantically structured embeddings across taxonomic levels. By jointly modeling organism appearance, environmental context, and taxonomic structure, MATANet learns more discriminative representations for fine-grained taxonomic recognition. Experiments on FathomNet2025 and LifeCLEF2015-Fish demonstrate that MATANet consistently improves recognition performance over existing methods. Additional experiments on FAIR1M further examine the applicability of the proposed framework beyond underwater imagery. Notably, MATANet ranked first in the FathomNet 2025 Challenge at the CVPR 2025 FGVC12 workshop.

2601.00065 2026-05-29 cs.LG cs.CL cs.CR

When the Same Coefficients Reach Different Places: Asymmetric Realizability in Transplanting Tokenizers across Large Language Models

当相同系数到达不同位置:跨大型语言模型移植分词器中的非对称可实现性

Xiaoze Liu, Weichen Yu, Matt Fredrikson, Xiaoqian Wang, Jing Gao

AI总结 本文发现跨词汇模型组合中分词器移植的几何结构非对称性,并构造了“破坏令牌”以利用该漏洞,通过实验验证其在多个模型对中的存在性及对微调、谱滤波等防御措施的鲁棒性。

详情
AI中文摘要

跨词汇模型组合中的分词器移植将仅存在于捐赠者的嵌入行重构为基于共享词汇锚点的加权组合,并在基础模型上重用这些系数。我们识别出这种重构的一个结构几何特性:相同的系数向量在捐赠者和基础锚点跨度中到达不同的集合,即一个\emph{非对称可实现性}差距。在OMP下的65个捐赠者-基础对中,通过CLP、WECHSEL和FOCUS的跨算子验证,我们构造了\emph{破坏令牌}:在捐赠者锚点跨度中保持统计惰性,同时在基础中产生高显著性重构的单一系数向量。相同的Gemma-2-2B捐赠者检查点允许针对来自五个模型家族的13个不同下游基础进行此构造。植入的方向与未改变的干净参考权重合并。在部署者案例研究中,标准LoRA微调主要抑制了其提示分布与训练语料匹配的破坏者,并且在我们设置中不足以缓解此类攻击家族。测试的谱滤波器未能捕捉到非对称性。我们讨论了在开放权重组合供应链中的潜在滥用。

英文摘要

Tokenizer transplant in cross-vocabulary model composition reconstructs donor-only embedding rows as weighted combinations over shared lexical anchors and reuses those coefficients on the base. We identify a structural geometric property of this reconstruction: the same coefficient vector reaches different sets in the donor and base anchor spans, an \emph{asymmetric realizability} gap. Across 65 donor-base pairs under OMP, with cross-operator validation on CLP, WECHSEL, and FOCUS, we construct \textit{breaker tokens}: single coefficient vectors that remain statistically inert in the donor anchor span while producing a high-salience reconstruction in the base. The same Gemma-2-2B donor checkpoint admits this construction against 13 different downstream bases drawn from five model families. The planted direction passes weight-merging with a clean reference unchanged. In a deployer case study, standard LoRA fine-tuning suppresses the breaker primarily on prompts whose distribution matches the training corpus and is not a sufficient mitigation against this attack family in our setting. The tested spectral filters miss the asymmetry. We discuss potential misuse in the open-weight composition supply chain.

2512.21311 2026-05-29 cs.LG

Learning to Solve PDEs on Neural Shape Representations

在神经形状表示上学习求解偏微分方程

Lilian Welschinger, Yilin Liu, Zican Wang, Niloy Mitra

AI总结 提出一种无网格公式,学习基于神经局部形状属性的局部更新算子,直接在神经表示上求解表面偏微分方程,无需显式网格或逐实例优化,且保持可微性。

详情
Comments
Accepted at CVPR 2026. Project page: https://welschinger.github.io/Learning-to-Solve-PDEs-on-Neural-Shape-Representations/
AI中文摘要

在形状上求解偏微分方程支撑着许多形状分析和工程任务;然而,主流的偏微分方程求解器在多边形/三角形网格上运行,而现代3D资产越来越多地以神经表示的形式存在。这种不匹配导致没有合适的方法直接在神经域内求解表面偏微分方程,迫使进行显式网格提取或逐实例残差训练,阻碍了端到端的工作流程。我们提出了一种新颖的无网格公式,学习一个基于神经(局部)形状属性条件化的局部更新算子,使得表面偏微分方程能够直接在神经数据所在处求解。该算子自然地与流行的神经表面表示集成,仅在单个代表性形状上训练一次,并能在形状和拓扑变化中泛化,实现准确、快速的推理,无需显式网格划分或逐实例优化,同时保持可微性。在解析基准测试(球面上的热扩散方程和泊松方程)以及各种形状和神经表面表示上,我们的方法达到了与经典求解器相当的精度,同时实现了跨神经和传统表面表示的统一端到端流水线。我们的源代码和项目页面:https://welschinger.github.io/Learning-to-Solve-PDEs-on-Neural-Shape-Representations/。

英文摘要

Solving partial differential equations (PDEs) on shapes underpins many shape analysis and engineering tasks; yet, prevailing PDE solvers operate on polygonal/triangle meshes while modern 3D assets increasingly live as neural representations. This mismatch leaves no suitable method to solve surface PDEs directly within the neural domain, forcing explicit mesh extraction or per-instance residual training, preventing end-to-end workflows. We present a novel, meshfree formulation that learns a local update operator conditioned on neural (local) shape attributes, enabling surface PDEs to be solved directly where the (neural) data lives. The operator integrates naturally with prevalent neural surface representations, is trained once on a single representative shape, and generalizes across shape and topology variations, enabling accurate, fast inference without explicit meshing or per-instance optimization while preserving differentiability. Across analytic benchmarks (heat diffusion and Poisson equations on the sphere) and on diverse shapes and neural surface representations, our method achieves accuracy comparable to classical solvers while enabling a unified, end-to-end pipeline across neural and traditional surface representations. Our source code and project page: https://welschinger.github.io/Learning-to-Solve-PDEs-on-Neural-Shape-Representations/.

2512.19199 2026-05-29 cs.LG cs.AI

On the Koopman-Based Generalization Bounds for Multi-Task Deep Learning

基于Koopman的多任务深度学习泛化界

Mahdi Mohammadigohari, Giuseppe Di Fatta, Giuseppe Nicosia, Panos M. Pardalos

AI总结 本文利用算子理论技术建立多任务深度神经网络的泛化界,通过利用权重矩阵的小条件数并引入定制的Sobolev空间作为扩展假设空间,提出比传统范数方法更紧的界,该界在单输出设置下仍有效且优于现有Koopman界。

详情
Journal ref
Machine Learning, Optimization, and Data Science (LOD 2025), Lecture Notes in Computer Science (LNCS), vol. 16468, Springer, 2026, pp. 376--392
Comments
Accepted at the 11th International Conference on Machine Learning, Optimization, and Data Science (LOD), Castiglione della Pescaia, Italy, September 21-24, 2025. To appear in Lecture Notes in Computer Science (LNCS), volume 16467
AI中文摘要

本文利用算子理论技术建立了多任务深度神经网络的泛化界。作者通过利用权重矩阵中的小条件数并引入定制的Sobolev空间作为扩展假设空间,提出了比传统基于范数的方法更紧的界。该增强的界即使在单输出设置下仍然有效,优于现有的基于Koopman的界。所得框架保持了关键优势,如灵活性和与网络宽度无关,为核方法背景下的多任务深度学习提供了更精确的理论理解。

英文摘要

The paper establishes generalization bounds for multitask deep neural networks using operator-theoretic techniques. The authors propose a tighter bound than those derived from conventional norm based methods by leveraging small condition numbers in the weight matrices and introducing a tailored Sobolev space as an expanded hypothesis space. This enhanced bound remains valid even in single output settings, outperforming existing Koopman based bounds. The resulting framework maintains key advantages such as flexibility and independence from network width, offering a more precise theoretical understanding of multitask deep learning in the context of kernel methods.

2512.19184 2026-05-29 cs.LG cs.AI

Operator-Based Generalization Bound for Deep Learning: Insights on Multi-Task Learning

基于算子的深度学习泛化界:多任务学习的洞见

Mahdi Mohammadigohari, Giuseppe Di Fatta, Giuseppe Nicosia, Panos M. Pardalos

AI总结 本文通过算子理论框架,结合Koopman方法与现有技术,为向量值神经网络和深度核方法提出了更紧的泛化界,并引入草图技术降低计算成本,同时提出深度向量值再生核希尔伯特空间框架,利用Perron-Frobenius算子增强深度核方法,推导了新的Rademacher泛化界,解决了欠拟合和过拟合问题。

详情
Journal ref
Machine Learning, Optimization, and Data Science (LOD 2025), Lecture Notes in Computer Science (LNCS), vol. 16468, Springer, 2026, pp. 120--137
Comments
Accepted at the 11th International Conference on Machine Learning, Optimization, and Data Science (LOD), Castiglione della Pescaia, Italy, September 21-24, 2025. To appear in Lecture Notes in Computer Science (LNCS), volume 16467
AI中文摘要

本文提出了向量值神经网络和深度核方法的新型泛化界,通过算子理论框架聚焦多任务学习。我们的关键发展在于策略性地将基于Koopman的方法与现有技术相结合,实现了比传统基于范数的界更紧的泛化保证。为缓解基于Koopman方法的计算挑战,我们引入了适用于向量值神经网络的草图技术。这些技术在一般Lipschitz损失下给出了超额风险界,为包括鲁棒回归和多重分位数回归在内的应用提供了性能保证。此外,我们提出了一个新的深度学习框架——深度向量值再生核希尔伯特空间(vvRKHS),利用Perron-Frobenius(PF)算子增强深度核方法。我们为该框架推导了新的Rademacher泛化界,通过核精炼策略明确处理欠拟合和过拟合。这项工作为深度学习架构下的多任务学习泛化性质提供了新颖洞见,该领域直到最近才有所发展。

英文摘要

This paper presents novel generalization bounds for vector-valued neural networks and deep kernel methods, focusing on multi-task learning through an operator-theoretic framework. Our key development lies in strategically combining a Koopman based approach with existing techniques, achieving tighter generalization guarantees compared to traditional norm-based bounds. To mitigate computational challenges associated with Koopman-based methods, we introduce sketching techniques applicable to vector valued neural networks. These techniques yield excess risk bounds under generic Lipschitz losses, providing performance guarantees for applications including robust and multiple quantile regression. Furthermore, we propose a novel deep learning framework, deep vector-valued reproducing kernel Hilbert spaces (vvRKHS), leveraging Perron Frobenius (PF) operators to enhance deep kernel methods. We derive a new Rademacher generalization bound for this framework, explicitly addressing underfitting and overfitting through kernel refinement strategies. This work offers novel insights into the generalization properties of multitask learning with deep learning architectures, an area that has been relatively unexplored until recent developments.

2512.11944 2026-05-29 cs.RO cs.AI

A Review of Learning-Based Motion Planning: Toward a Data-Driven Optimal Control Approach

基于学习的运动规划综述:迈向数据驱动的最优控制方法

Jia Hu, Yang Chang, Haoran Wang

AI总结 本文系统综述了数据驱动最优控制范式,通过融合最优控制的理论保证与机器学习的自适应能力,为自动驾驶运动规划提供了三维实现路线图,并指出了四个未来研究方向。

详情
Comments
44 pages, 14 figures
AI中文摘要

自动驾驶的运动规划面临一个关键的权衡。传统的基于规则的流程提供了可验证的安全性和可解释性,但往往难以在复杂场景中泛化。相反,新兴的基于学习的方法——包括模仿学习、强化学习和生成式AI——提供了更大的适应性,但通常受限于不透明性和安全风险。现有的综述通常孤立地分析这些AI方法,忽视了将它们与严格的控制框架相结合的潜力。为弥合这一差距,本文首次系统综述了数据驱动最优控制(DDOC)范式,明确考察了它如何协同最优控制的理论保证与现代机器学习的自适应能力。基于这一框架,我们提出了首个DDOC运动规划路线图,将其实现结构化为三个关键维度:定制化、动力学自适应和自整定。最后,为缩小剩余的现实差距,我们确定了四个未来研究方向,从而加速向可信赖且类人的自动驾驶的过渡。

英文摘要

Motion planning for autonomous driving (AD) faces a critical trade-off. While traditional rule-based pipelines offer verifiable safety and interpretability, they often fail to generalize in complex scenarios. Conversely, emerging learning-based methods-including imitation learning (IL), reinforcement learning (RL), and generative AI-offer greater adaptability but are often constrained by opacity and safety risks. Existing surveys typically analyze these AI methods in isolation, overlooking the potential of integrating them with rigorous control frameworks. To bridge this gap, this paper presents the first systematic review of the Data-Driven Optimal Control (DDOC) paradigm, explicitly examining how it synergizes the theoretical guarantees of optimal control with the adaptive capabilities of modern machine learning. Building on this framework, we propose the first roadmap for DDOC-based motion planning, structuring its implementation into three critical dimensions: customization, dynamics adaptation, and self-tuning. Finally, to close the remaining reality gap, we identify four future research directions, thereby accelerating the transition to trustworthy and human-like autonomous driving.