arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.23458 2026-05-25 cs.CV cs.AI

One-Forcing: Towards Stable One-Step Autoregressive Video Generation

One-Forcing: 迈向稳定的一步自回归视频生成

Jiaqi Feng, Justin Cui, Yuanhao Ban, Cho-Jui Hsieh

AI总结 该论文提出了一种名为 One-Forcing 的方法,旨在解决单步自回归视频生成中的稳定性和质量问题。该方法通过在动态模式分解(DMD)目标中引入辅助的生成对抗网络(GAN)损失,实现了高质量且高效的单步视频生成。实验表明,One-Forcing 在 VBench 数据集上取得了当前最优的性能,并且仅需三分之一的训练成本即可实现稳定的逐帧自回归生成,优于以往方法。

详情
Comments
Work in Progress. Project Page: https://aurora-edu.github.io/one-forcing/, Code: https://github.com/Aurora-edu/One-Forcing
AI中文摘要

最近的进展显著改善了自回归机制下的实时交互式视频生成。然而,大多数现有的少步自回归视频生成方法(通常从相应的多步教师模型蒸馏而来)默认采用4步采样配置,这在部署期间仍会产生相当大的延迟,并且当进一步减少采样步数(特别是在一步设置中)时,会遭受严重的质量下降。轨迹式一致性蒸馏方法通常生成动态较弱的视频,而基于DMD的方法(如Self-Forcing)往往产生模糊的帧。为了解决这一挑战,我们提出了One-Forcing,一种简单而有效的方法,它通过向DMD目标添加辅助GAN损失,实现高质量高效的一步视频生成。在VBench上的实验表明,One-Forcing的总得分为83.76,在一步因果视频生成方法中达到了最先进的性能,并且与强大的多步方法保持竞争力。我们进一步证明,仅需分块模型三分之一的训练成本,即可稳定实现逐帧的一步自回归生成,而先前的方法未能成功实现这一设置。

英文摘要

Recent advances have substantially improved real-time interactive video generation in the autoregressive regime. However, most existing few-step autoregressive video generation methods, often distilled from a corresponding many-step teacher, default to a 4-step sampling configuration, which still incurs considerable latency during deployment and suffers from severe quality degradation when the number of sampling steps is further reduced, particularly in the one-step setting. Trajectory-style consistency distillation methods often produce videos with weak dynamics, while DMD-based approaches, such as Self-Forcing, tend to yield blurry frames. To address this challenge, we propose One-Forcing, a simple yet effective approach which augments the DMD objective with an auxiliary GAN loss for high-quality and efficient one-step video generation. Experiments on VBench show that One-Forcing achieves a total score of 83.76, establishing state-of-the-art performance among one-step causal video generation methods and remaining competitive with strong many-step approaches. We further demonstrate that one-step framewise autoregressive generation can be achieved stably with merely one-third of the training cost of the chunkwise model, a setting that prior methods have failed to achieve successfully.

2605.23451 2026-05-25 cs.CV

Efficient One-Step Diffusion Restoration Model with Compact Token Compression and Linear Attention

高效的一步扩散修复模型:紧凑令牌压缩与线性注意力

Bingtian Qiao, Yue Shi, Yingjie Zhou, Yong Guo, Guangtao Zhai, Jiezhang Cao

AI总结 本文针对真实场景图像超分辨率任务中现有方法计算量大、内存消耗高、推理延迟大的问题,提出了一种高效的一步式修复框架SANA-SR。该方法通过深度压缩自编码器将潜在特征压缩32倍,大幅减少冗余信息,同时引入线性注意力机制与LoRA微调技术,实现了线性复杂度的高分辨率图像恢复。实验表明,SANA-SR在多个基准数据集上取得了优异的定量性能,且模型参数量小、推理速度快,具有良好的实际部署潜力。

详情
AI中文摘要

真实图像超分辨率旨在从复杂且未知的真实退化中恢复高质量图像。然而,现有的生成式Real-ISR方法很大程度上继承了为高分辨率图像合成开发的密集潜在表示和二次成本全局建模范式,导致计算、内存使用和推理延迟随分辨率增长而不利地扩展,从而限制了实际部署。我们认为关键瓶颈不在于修复先验不足,而在于高分辨率修复过程中过多的令牌冗余和昂贵的令牌交互。受此观察启发,我们从紧凑潜在表示和线性复杂度建模的角度重新审视Real-ISR,提出了SANA-SR,一种高效的一步修复框架。具体来说,SANA-SR采用具有32倍压缩比的深度压缩自编码器,大幅减少潜在令牌,同时保留与修复相关的结构和纹理。在此紧凑潜在空间之上,我们引入了带有LoRA微调的线性注意力DiT,实现了具有线性复杂度令牌混合的高效高分辨率修复。在所有基准数据集上的大量实验表明,SANA-SR在定量性能上与现有方法高度竞争且通常更优,同时恢复出更清晰、更真实的纹理。此外,剪枝后,部署的模型运行时间为0.019秒,MACs为407.95G,参数量为344M,突显了其在移动设备上实际部署的强大潜力。

英文摘要

Real-world image super-resolution aims to recover high-quality images from complex and unknown real-world degradations. However, existing generative Real-ISR methods largely inherit the dense latent representations and quadratic-cost global modeling paradigm developed for high-resolution image synthesis, causing computation, memory usage, and inference latency to scale unfavorably with resolution and thus limiting practical deployment. We argue that the key bottleneck lies not in insufficient restoration priors, but in excessive token redundancy and costly token interactions during high-resolution restoration. Motivated by this observation, we revisit Real-ISR from the perspectives of compact latent representation and linear-complexity modeling, and propose SANA-SR, an efficient one-step restoration framework. Specifically, SANA-SR employs a deep compression autoencoder with a 32x compression ratio to drastically reduce latent tokens while preserving restoration-relevant structures and textures. On top of this compact latent space, we introduce a linear-attention DiT with LoRA fine-tuning, enabling efficient high-resolution restoration with linear-complexity token mixing. Extensive experiments on all benchmark datasets demonstrate that SANA-SR achieves highly competitive and often superior quantitative performance against existing methods, while restoring clearer and more realistic textures. Moreover, after pruning, the deployed model runs in 0.019s with 407.95G MACs and 344M parameters, highlighting its strong potential for practical mobile deployment.

2605.23449 2026-05-25 cs.LG cs.CV math.AG

Commutator-Induced Uncertainty in VAEs

VAE中的换位子引发的不确定性

Tahereh Dehdarirad, Michael Felsberg, Gabriel Eilertsen, Ziliang Xiong

AI总结 变分自编码器(VAEs)在学习非交换结构时常常面临不确定性问题。本文提出了一种基于李群的VAE框架,通过结合几何与代数视角分析不确定性,将离散生成因素与连续几何变换分离。该方法通过诊断代数非交换性并调整解码器对非交换结构的敏感度,提升了重构质量与潜在空间结构的一致性,在多个基准数据集上表现出优越的重构与潜在空间遍历性能。

详情
AI中文摘要

变分自编码器(VAE)通常难以表示学习到的潜在空间中的非交换结构。对称感知的VAE通常通过代数正则化强制交换性来解决这个问题,这适用于交换变换群,但当非交换性是数据内在特性时会抑制有意义的非交换结构。我们认为,非交换性应被明确诊断并反映在重建行为中。我们引入了一个李群VAE框架,该框架结合了几何和代数视角下的不确定性,同时将离散生成因子与连续几何变换分开。在第一阶段,模型在没有结构约束的情况下进行训练,同时通过有限Baker-Campbell-Hausdorff偏差测量代数非交换性,并通过重建顺序交换测试测量解码器顺序敏感性。这些诊断揭示了在无约束训练下潜在非交换性与重建行为之间的尺度不匹配。在第二阶段,我们引入了一个具有数据驱动校准常数的变形稳定性约束,使解码器敏感性与代数非交换性对齐。我们在dSprites、3DShapes、3DCars和CelebA上评估了该框架,并与通用和对称感知基线(包括beta-VAE、CLG-VAE和CFASL)进行了比较。在合成基准上,该方法提高了重建质量,并产生了与潜在非交换结构更一致的解码器行为。定性分析显示了更清晰的顺序依赖潜在组合和更稳定的重建。在CelebA上,该模型比CFASL产生了更忠实的重建和因子特定的潜在遍历,同时在学习的潜在方向之间也表现出有意义的顺序依赖交互。

英文摘要

Variational autoencoders (VAEs) often struggle to represent non-commutative structure in learned latent spaces. Symmetry-aware VAEs commonly address this issue by enforcing commutativity through algebraic regularization, which is appropriate for commutative transformation groups but can suppress meaningful non-commutative structure when it is intrinsic to the data. We argue that non-commutativity should instead be explicitly diagnosed and reflected in reconstruction behavior. We introduce a Lie Group VAE framework that combines geometric and algebraic perspectives on uncertainty while separating discrete generative factors from continuous geometric transformations. In a first phase, the model is trained without structural constraints while algebraic non-commutativity is measured through finite Baker-Campbell-Hausdorff deviations and decoder order sensitivity is measured through reconstruction order-swap tests. These diagnostics reveal a scale mismatch between latent non-commutativity and reconstruction behavior under unconstrained training. In a second phase, we introduce a deformation-stability constraint with a data-driven calibration constant that aligns decoder sensitivity with algebraic non-commutativity. We evaluate the framework on dSprites, 3DShapes, 3DCars, and CelebA against generic and symmetry-aware baselines, including beta-VAE, CLG-VAE, and CFASL. Across synthetic benchmarks, the method improves reconstruction quality and yields decoder-level behavior more consistent with latent non-commutative structure. Qualitative analyses show clearer order-dependent latent compositions and more stable reconstructions. On CelebA, the model yields more faithful reconstructions and factor-specific latent traversals than CFASL, while also exhibiting meaningful order-dependent interactions between learned latent directions.

2605.23448 2026-05-25 cs.CR cs.AI

AI Security Research Should Better Incentivize Defense Research

AI安全研究应更好地激励防御研究

Youqian Zhang

AI总结 本文指出人工智能安全研究领域存在严重失衡现象,即攻击性研究远多于防御性研究。通过分析多个子领域的学术论文,发现攻击与防御的比例普遍偏高,且攻击性研究往往在有利条件下进行,夸大了实际威胁,而防御性研究则面临更高的标准,导致可用的防御方案寥寥无几。因此,作者呼吁人工智能安全研究应更加重视并激励防御技术的发展。

详情
Comments
14 pages,3 figures,3 tables
AI中文摘要

本文考察了人工智能(AI)安全研究中的不平衡:该领域倾向于产出更多关于攻击AI系统的研究,而非防御。通过相关学术论文,我们发现跨子领域(包括联邦学习、语音识别、成员推断、大语言模型等)存在偏斜的攻击-防御比例。这种不平衡可能远不止简单的计数:攻击论文通常在有利条件下进行评估,使威胁看起来比实际更严重,而防御则面临更严格的标准,很少有方法能达到。结果是文献中充斥着已证明的漏洞,而可用且已部署的防御则很少。因此,我们认为AI安全研究应更好地激励防御研究。

英文摘要

This work examines an imbalance in artificial intelligence (AI) security research: the field tends to produce more work on attacking AI systems than on defending them. Drawing on related academic papers, we find biased attack-to-defense ratios across subfields, including federated learning, speech recognition, membership inference, large language models, etc. The imbalance possibly means far beyond a simple count: attack papers are routinely evaluated under favorable conditions that make threats look more severe than they are in practice, while defenses are held to a stricter standard that few can meet. The result is a literature rich in demonstrated vulnerabilities and thin on usable and deployed protections. We thus argue that AI security research should better incentivize defense research.

2605.23446 2026-05-25 cs.LG math.CO

Weisfeiler-Leman Is Incomplete on Simple Spectrum Graphs, so Canonicalize Them

Weisfeiler-Leman 在简单谱图上是不完备的,因此对它们进行规范化

Snir Hordan, Nadav Dym, Tim Seppelt

AI总结 该研究探讨了具有简单谱图的图同构问题,指出对于任意自然数 $k$,$k$-Weisfeiler-Leman 测试无法区分所有非同构的简单谱图,从而揭示了现有图神经网络在该类图上的局限性。为解决这一问题,研究提出了 PRiSM 方法,这是首个能够完全对简单谱图进行正则化分解的算法,填补了该领域的空白。PRiSM 不仅保证了表达能力的完备性,还与深度集合或 Transformer 结合后实现了对简单谱图的通用逼近能力,为图的表示学习提供了新的理论支持和实用方法。

详情
AI中文摘要

具有简单谱的图允许三次时间同构测试,然而我们证明对于每个自然数 $k$,$k$-Weisfeiler-Leman ($k$-WL) 测试无法区分所有非同构的简单谱图。由于 WL 层次结构限制了广泛使用的图神经网络 (GNN) 的区分能力,这种不完备性适用于所有此类 GNN,从而排除了每个 $k$-WL 对齐的 GNN 家族的完备性。为了弥补这一差距,我们引入了 PRiSM (分区、细化、求解、匹配),这是第一个可证明完备的简单谱特征分解规范化方法。PRiSM 获得了先前规范化方法显然缺乏的完备性保证,并解决了在简单谱图上实现完全表达性的开放问题。当与 DeepSets 或 Transformer 组合时,PRiSM 在简单谱图上实现了通用逼近,证明了使用规范化拉普拉斯位置编码的合理性。实验上,PRiSM 在图回归、分类和表达性方面与现有谱规范化方法性能相当或更优。

英文摘要

Graphs with a simple spectrum admit cubic-time isomorphism testing, yet we prove that for every natural number $k$, the $k$-Weisfeiler-Leman ($k$-WL) test cannot distinguish all non-isomorphic graphs with a simple spectrum. As the WL hierarchy upper-bounds the distinguishing power of widely-used Graph Neural Networks (GNNs), this incompleteness applies to all such GNNs, ruling out completeness for every $k$-WL-aligned GNN family. To close this gap, we introduce PRiSM (Partition, Refine, Solve, Match), the first provably complete canonicalization of simple-spectrum eigendecompositions. PRiSM obtains the completeness guarantee that prior canonicalizations provably lack, and resolves the open problem of achieving complete expressivity on simple-spectrum graphs. When composed with DeepSets or a Transformer, PRiSM achieves universal approximation on simple-spectrum graphs, justifying the use of canonicalized Laplacian positional encodings. Empirically, PRiSM performs comparably to or outperforms existing spectral canonicalizations on graph regression, classification, and expressivity

2605.23445 2026-05-25 cs.CV

DFSAttn: Dynamic Fine-grained Sparse Attention for Efficient Video Generation

DFSAttn:面向高效视频生成的动态细粒度稀疏注意力

Jie Hu, Zixiang Gao, Yutong He, Kun Yuan

AI总结 该论文提出了一种名为DFSAttn的动态细粒度稀疏注意力机制,旨在提升视频生成中扩散变换器的效率。针对现有块稀疏注意力在高稀疏比下质量下降的问题,DFSAttn通过理论分析得出注意力召回的下界,并设计了无需训练的稀疏注意力框架,包含基于希尔伯特曲线的令牌重排序、分层块评分和自适应稀疏掩码缓存等核心模块。实验表明,DFSAttn在保持高质量生成的同时,实现了高达2.1倍的端到端加速。

详情
Comments
ICML 2026; 17 pages, 8 figures;
AI中文摘要

扩散变换器在高品质视频生成中取得了显著成功,但其对时空3D全注意力的依赖由于注意力的二次复杂度而产生了高昂的计算成本。块稀疏注意力是一种常见方法,通过将计算集中在重要区域来缓解这一问题。然而,DiTs中的注意力图表现出固有的动态和细粒度稀疏性,这导致现有的块稀疏注意力方法在质量上显著下降,尤其是在高稀疏率下。在本文中,我们重新审视块稀疏注意力,并推导出注意力召回率的理论下界,以刻画影响其有效性的关键因素。在这些见解的指导下,我们提出了DFSAttn,一种无需训练的稀疏注意力框架,能够高效地实现动态、细粒度的稀疏化。DFSAttn包含三个核心设计:基于希尔伯特曲线的令牌重排序以实现细粒度稀疏性同时保持高效的GPU执行,分层块评分以准确估计块重要性,以及具有自适应比率的稀疏掩码缓存以平衡准确性和效率。实验结果表明,DFSAttn在高稀疏度下始终优于先前方法,在保持高生成质量的同时实现了高达2.1倍的端到端加速。我们的代码已开源,可在https://github.com/jessica-hujie/DFSAttn获取。

英文摘要

Diffusion transformers have achieved remarkable success in high-quality video generation, yet their reliance on spatiotemporal 3D full attention incurs prohibitive computational cost due to the quadratic complexity of attention. Block sparse attention is a common approach to mitigate this by focusing computation on important regions. However, attention maps in DiTs exhibit inherently dynamic and fine-grained sparsity, which causes existing block sparse attention methods to degrade significantly in quality, especially at high sparsity ratios. In this paper, we revisit block sparse attention and derive a theoretical lower bound on attention recall to characterize the key factors governing its effectiveness. Guided by these insights, we propose DFSAttn, a training-free sparse attention framework that enables dynamic, fine-grained sparsification efficiently. DFSAttn incorporates three core designs: Hilbert curve-based token reordering to achieve fine-grained sparsity while preserving efficient GPU execution, hierarchical block scoring for accurate block importance estimation, and sparse mask caching with adaptive ratios to balance accuracy and efficiency. Experimental results demonstrate that DFSAttn consistently outperforms prior methods under high sparsity, achieving up to 2.1$\times$ end-to-end speedup while maintaining high generation quality. Our code is open-sourced and available at https://github.com/jessica-hujie/DFSAttn.

2605.23434 2026-05-25 cs.LG

Onsager-Machlup Posterior Transport for Deep Gaussian Processes

深度高斯过程的Onsager-Machlup后验传输

Jian Xu, Delu Zeng, John Paisley, Qibin Zhao

AI总结 深度高斯过程(DGPs)中的近似推断在诱导变量上面临计算瓶颈。本文提出一种新的后验传输方法,通过确定性采样器将可计算的参考测度映射到与后验相关的诱导变量,并利用由Doob桥扩散过程导出的路径先验进行正则化。核心方法基于Song的概率流ODE和Onsager-Machlup作用量,实验证明该方法在多个UCI回归数据集上优于现有方法,尤其在大规模数据集上表现更优。

详情
AI中文摘要

对诱导变量的近似推断是深度高斯过程(DGP)的计算瓶颈。现有方法要么通过ELBO拟合显式密度$q_\phi(\bU)$(DSVI, IPVI, DDVI, DBVI),要么通过MCMC采样(SGHMC)。我们则将DGP推断框架化为\emph{后验传输}:学习一个确定性采样器,将易处理的参考测度映射到后验相关的诱导变量,并通过从Doob桥接参考扩散导出的路径先验进行正则化。我们的实现\textbf{OM-Path}(正式名称为FBVI-bridge-Path)使用Song的概率流ODE应用于DBVI的Doob桥接前向SDE;参考漂移由桥边际系数闭式给出(无需分数匹配),路径正则化器为\textbf{Onsager--Machlup作用量}。在训练时使用的有限$\epsilon$值下,目标函数是温度Doob桥路径后验的负对数未归一化密度,定理1通过Freidlin--Wentzell LDP将其识别为同一后验的小噪声MAP路径。在同一桥骨干上推导了两种严格的路径空间ELBO变体(FFJORD对数行列式;OM正则化CNF)作为消融实验。在七个UCI回归基准上与DBVI进行匹配种子的配对Wilcoxon检验,OM-Path在两个最大数据集上取得了统计显著的胜利(\textit{power}: $p=0.014$,NLL $\mathbf{0.012}$匹配DSVI基线$0.017$;\textit{protein}: $p=0.002$,RMSE $\mathbf{0.716}$对比$0.764$,NLL $\mathbf{1.086}$对比$1.149$),在\textit{yacht}/\textit{qsar}上统计持平,在\textit{boston}/\textit{energy}/\textit{concrete}上因小噪声数据而输给DBVI。严格的ELBO变体在任何UCI指标上均未超过DBVI:在该机制下,降低路径目标方差比精确密度跟踪更重要。

英文摘要

Approximate inference over inducing variables is the central computational bottleneck of Deep Gaussian Processes (DGPs). Existing methods either fit an explicit density $q_ϕ(\bU)$ by an ELBO (DSVI, IPVI, DDVI, DBVI) or sample by MCMC (SGHMC). We instead frame DGP inference as \emph{posterior transport}: learn a deterministic sampler that maps a tractable reference measure to posterior-relevant inducing variables, regularised by a path prior derived from the Doob-bridged reference diffusion. Our realisation, \textbf{OM-Path} (formally FBVI-bridge-Path), uses Song's probability-flow ODE applied to DBVI's Doob-bridged forward SDE; the reference drift is closed-form from the bridge marginal coefficients (no score matching) and the path regulariser is the \textbf{Onsager--Machlup action}. At the finite-$ε$ value used at training, the objective is the negative log unnormalised density of a tempered Doob-bridge path posterior, and Theorem 1 identifies it with the same posterior's small-noise MAP path via the Freidlin--Wentzell LDP. Two strict path-space ELBO variants on the same bridge backbone (FFJORD log-det; OM-regularised CNF) are derived as ablations. Under a matched-seed paired Wilcoxon test against DBVI on seven UCI regression benchmarks, OM-Path delivers statistically significant wins on the two largest datasets (\textit{power}: $p\!=\!0.014$, NLL $\mathbf{0.012}$ matching the DSVI baseline of $0.017$; \textit{protein}: $p\!=\!0.002$, RMSE $\mathbf{0.716}$ vs.\ $0.764$, NLL $\mathbf{1.086}$ vs.\ $1.149$), statistical ties on \textit{yacht} / \textit{qsar}, and concedes \textit{boston} / \textit{energy} / \textit{concrete} to DBVI on small-$N$ noisy data. The strict-ELBO variants do not clear DBVI on any UCI metric: in this regime, reducing the variance of the path objective dominates exact-density tracking.

2605.23428 2026-05-25 cs.CV cs.MM

FAST-ME: Foundation-aware Adaptive Stopping for Motion Estimation for Efficient IoT Video Analysis

FAST-ME:面向高效物联网视频分析的基于基础模型的自适应运动估计停止方法

Kakia Panagidi, Stathes Hadjieftymiadis

AI总结 在资源受限的物联网视频分析场景中,视频压缩与理解中的块运动估计(ME)仍是计算瓶颈。本文提出了一种基于时空差异评估的最优停止理论(OST)算法,并结合基础模型(FMs)构建语义感知的运动估计框架,通过融合视觉模型提取的语义注意力分数与传统失真度量,实现对运动幅度与语义重要性的联合判断,从而在保证精度的前提下显著降低计算开销。实验表明,该方法在多个基准数据集上取得了高效且语义覆盖良好的性能。

详情
AI中文摘要

在现代多媒体系统中,高效的视频处理至关重要,尤其是在资源受限的环境下,例如基于物联网的摄像头网络、自主平台和无线传感器多媒体系统。视频压缩和理解中的一个关键瓶颈是块运动估计(ME),尽管已经开发了快速搜索技术,但该过程仍然计算量大。本文提出了一种基于最优停止理论(OST)的块运动估计算法,该算法基于视频帧内和帧间的时空差异评估。同时,本文还提出了一种语义感知运动估计框架,将基础模型(FMs)与基于OST的决策过程相结合。通过利用预训练的视觉模型,如视觉变换器(ViT)和分割一切模型(SAM),该框架提取语义注意力分数,指示特定空间区域内运动的重要性。这些分数与传统的基于失真的度量(如绝对差和(SAD))融合,以指导一个混合停止准则,该准则同时考虑运动幅度和语义相关性。由此产生的自适应算法在冗余区域提前停止,而在运动具有语义重要性的区域继续搜索。实验将所提出的解决方案与文献中广泛使用的方法在基准和多模态视频数据集上进行了比较。所提出的方法在计算量上实现了显著减少,同时精度损失最小,并提高了语义覆盖。结果凸显了将低层运动分析与高层语义推理相结合的益处,为下一代智能系统中高效的多模态视频理解提供了有前景的方向。

英文摘要

In modern multimedia systems, efficient video processing is critical, especially in resource-constrained environments such as IoT-based camera networks, autonomous platforms, and wireless sensor multimedia systems. A key bottleneck in video compression and understanding is block motion estimation (ME), a process that remains computationally expensive despite the development of fast search techniques. This work introduces an Optimal Stopping Theory (OST) algorithm for block motion estimation based on the assessment of spatiotemporal differences within and across video frames. It also proposes a semantic-aware motion estimation framework that integrates Foundation Models (FMs) with the OST-based decision process. By leveraging pretrained visual models such as Vision Transformers (ViT) and the Segment Anything Model (SAM), the framework extracts semantic attention scores that indicate the importance of motion within specific spatial regions. These scores are fused with traditional distortion-based metrics, such as the Sum of Absolute Differences (SAD), to guide a hybrid stopping criterion that jointly considers motion magnitude and semantic relevance. The resulting adaptive algorithm stops early in redundant regions while continuing the search in areas where motion is semantically significant. Experiments compare the proposed solution with widely used approaches from the literature on benchmark and multimodal video datasets. The proposed method achieves a significant reduction in computation with minimal accuracy loss and improved semantic coverage. The results highlight the benefits of bridging low-level motion analysis with high-level semantic reasoning, offering a promising direction for efficient multimodal video understanding in next-generation smart systems.

2605.23426 2026-05-25 cs.HC cs.AI

Socially fluent AI decouples conversational signals from source identity in online interaction

社交流畅的AI在在线互动中解耦对话信号与来源身份

Lixiang Yan, Yueqiao Jin, Xibin Han, Dragan Gašević

AI总结 这项研究探讨了社交流利的AI代理在在线互动中是否能像普通人一样交流,从而让人难以仅凭对话信号判断对方身份。实验表明,在多人协作任务中,参与者无法准确区分AI与人类队友,尽管对话行为中存在可区分AI与人类的线索。研究指出,人们更多依赖主观印象和刻板印象进行判断,而非基于实际行为特征,这使得AI代理可能更易影响和操控在线讨论。

详情
AI中文摘要

社交流畅的智能体AI现在能够以类似于普通人类对话的方式参与在线互动,这可能削弱人们仅凭对话信号推断谁是人类的能力。我们在同步文本群组交互中测试了这种可能性,将未公开的AI代理作为普通队友嵌入到分析性、创造性和伦理任务中。在786名参与者进行的1572次交互后身份判断中,人们区分AI和人类队友的能力未高于随机水平。这种失败并非因为交互缺乏身份相关信息。对话行为包含区分AI与人类的稳健线索,并支持高度准确的计算分类。相反,参与者依赖熟悉的怀疑启发式,包括响应速度、流畅性和感知的脚本化,这些与真实身份只有弱相关。表征分析进一步表明,判断是基于主观印象而非编码真实身份的行为结构组织的。这种分离为能够大规模影响和操纵在线话语的协调AI代理创造了新的脆弱性。

英文摘要

Socially fluent agentic AI can now participate in online interaction in ways that resemble ordinary human conversation, potentially weakening people's ability to infer who is human from conversational signals alone. We tested this possibility in synchronous text-based group interaction by embedding undisclosed AI agents as ordinary teammates across analytical, creative, and ethical tasks. Across 786 participants who made 1,572 post-interaction identity judgments, people did not distinguish AI from human teammates above chance. This failure did not arise because the interaction lacked identity-relevant information. Conversational behaviour contained robust cues that differentiated AI from humans and supported highly accurate computational classification. Instead, participants relied on familiar suspicion heuristics, including response speed, fluency, and perceived scriptedness, that were only weakly related to actual identity. Representational analyses further showed that judgments were organised around subjective impressions rather than the behavioural structure encoding ground truth. This dissociation creates new vulnerabilities to coordinated AI agents that can influence and manipulate online discourse at scale.

2605.23424 2026-05-25 cs.IT cs.LG math.IT

Sparse In-Network Learning via Shortest-Path Backpropagation and Finite-Rate Gating

通过最短路径反向传播和有限速率门控的稀疏网内学习

Mohammad Reza Deylam Salehi

AI总结 本文研究了网络内学习(INL)中的稀疏通信问题,提出了一种基于最短路径树和有限速率门控机制的稀疏网络内学习方法D-INL。该方法通过保留以融合节点为根的容量感知最短路径树,去除非树链接,同时将局部路由建模为有限速率的随机门控,以在稀疏性和预测信息之间取得平衡。实验表明,D-INL在保持分类精度的同时,将训练过程中的通信量减少了70.4%,并进一步通过有限速率正则化将潜在信息率降低了45.7%。

详情
AI中文摘要

网内学习(INL)通过通信图交换潜在激活和反向传播误差来训练分布式神经模块。本文提出Dijkstra剪枝INL(D-INL),通过保留融合节点处的容量感知最短路径树来移除非树链接。为了平衡稀疏性和预测信息,局部路由(或聚合)被建模为有限速率随机门控,其速率为$R_g=I(Z; T)$。我们推导了一个率-失真-泛化界,并在可复现的分布式分类实验上验证了该方法,其中D-INL将训练交换量减少了70.4%,同时将精度保持在密集INL的标准差范围内。与未正则化的Dijkstra INL相比,添加有限速率正则化进一步将估计的潜在速率降低了45.7%。

英文摘要

In-network learning (INL) trains distributed neural modules by exchanging latent activations and backpropagated errors over a communication graph. This letter proposes Dijkstra-pruned INL (D-INL), which removes non-tree links by retaining a capacity-aware shortest-path tree rooted at the fusion node. To balance sparsity and predictive information, local routing (or aggregation) is modeled as a finite-rate stochastic gate with rate $R_g=I(Z; T)$. We derive a rate-distortion-generalization bound and validate the method on a reproducible distributed-classification experiment, where D-INL reduces training exchange by $70.4\%$ while preserving accuracy within the standard deviation of dense INL. Adding finite-rate regularization further reduces the estimated latent rate by $45.7\%$ relative to unregularized Dijkstra INL.

2605.23422 2026-05-25 cs.LG

Hinge Regression Trees and HRT-Boost: Newton-Optimized Oblique Learning for Compact Tabular Models

铰链回归树与HRT-Boost:面向紧凑表格模型的牛顿优化斜学习

Hongyi Li, Jun Xu, Hong Yan

AI总结 本文提出了一种名为Hinge Regression Tree(HRT)的框架,通过将每个斜向分割转化为两个线性预测器的非线性最小二乘问题,从而提升斜向决策树的学习质量。HRT利用节点级别的优化过程,结合阻尼牛顿法进行求解,并在理论上证明其具有明确的逼近能力。基于HRT,作者进一步提出了HRT-Boost集成方法,将节点级的牛顿更新与逐阶段函数梯度下降相结合,在平方损失下实现了经验风险的逐步减少,实验表明该方法在多个基准数据集上表现优异,且能生成更为紧凑的模型。

详情
Comments
arXiv admin note: substantial text overlap with arXiv:2602.05371
AI中文摘要

由于分割优化的离散性和非凸性,学习高质量的斜决策树仍然是一个重大挑战。我们提出了铰链回归树(HRT)框架,该框架将每个斜分割重构为两个线性预测器上的非线性最小二乘问题,其最大/最小包络诱导出类似ReLU的表示能力。我们证明了由此产生的节点级优化可以解释为阻尼牛顿法,并为其回溯线搜索变体建立了节点目标函数的单调递减性质。理论上,我们证明了HRT是一个通用逼近器,具有显式的$O(δ^2)$逼近速率。在此基础学习器之上,我们提出了HRT-Boost,一种数学上协同的集成扩展,将节点级牛顿更新与阶段式函数梯度下降相结合。我们证明了在平方损失下,这种集成构造具有阶段式经验风险降低保证。在合成和真实世界基准上的实证评估表明,HRT与现有的单树基线相比具有很强的竞争力,而HRT-Boost与强集成基线相比表现良好,并且通常产生更紧凑的模型。代码公开于https://github.com/Hongyi-Li-sz/HRT-Boost。

英文摘要

Learning high-quality oblique decision trees remains a significant challenge due to the discrete and non-convex nature of split optimization. We present the Hinge Regression Tree (HRT) framework, which reframes each oblique split as a nonlinear least-squares problem over two linear predictors whose max/min envelope induces ReLU-like representation capacity. We show that the resulting node-level optimization can be interpreted as a damped Newton method, and we establish the monotonic decrease of the node objective for its backtracking line-search variant. We establish, theoretically, that HRT is a universal approximator with an explicit $O(δ^2)$ approximation rate. Building upon this base learner, we propose HRT-Boost, a mathematically synergistic ensemble extension that couples node-level Newton updates with stage-wise functional gradient descent. We show that this ensemble construction admits a stage-wise empirical risk reduction guarantee under the squared loss. Empirical evaluations on synthetic and real-world benchmarks show that HRT is highly competitive with established single-tree baselines, and HRT-Boost compares favorably with strong ensemble baselines and often yields substantially more compact models. The code is publicly available at https://github.com/Hongyi-Li-sz/HRT-Boost.

2605.23420 2026-05-25 cs.CL

Naturalistic measure of social norms alignment

社会规范一致性的自然主义度量

Yevhen Kostiuk, Kenneth Enevoldsen, Peter Bjerregaard Vahlstrup, Márton Kardos, Kristoffer Nielbo

AI总结 该研究旨在解决社会规范对齐的自然化测量问题,传统方法多依赖人工设定的封闭式评估,而本文提出了一种基于自由形式解决方案匹配的框架,用于衡量不同主体(如人类或大型语言模型)在社会困境中的回应一致性。研究引入了两个评估指标,并构建了一个包含3000个非平凡社会困境的丹麦语数据集,每个困境均附有三位文化背景评审提供的参考解决方案。实验表明,该方法能够有效区分模型在不同社会议题上的对齐程度,尤其在邻里冲突和共同居住等话题上表现出更高的共识水平。

详情
AI中文摘要

社会规范反映了对可接受行为的共同期望。测量社会规范一致性仍然具有挑战性,现有方法通常依赖于人为的封闭式评估,如多项选择问卷或衡量与预定义陈述的一致性。在本工作中,社会规范一致性指的是衡量解决方案在应对社会问题或困境时的一致性。我们提出了一个框架,通过解决方案匹配在自然主义、自由形式的设置中测量社会规范一致性。该框架使我们能够衡量任意两个困境响应之间的一致性,例如LLM与人类、LLM与LLM或人类与人类之间。我们引入了两个指标:陈述一致性和显式一致性准确性,并构建了一个包含3000个非平凡社会困境的丹麦语数据集。所有困境都分配了来自三位小组成员的参考解决方案,他们作为文化背景法官。我们在类似于自然用户模型对话的交互设置中评估了几个LLM和人类响应的一致性。我们的结果表明,所提出的指标产生一致的模型排名,并揭示了不同类型困境之间一致性的变化,其中在邻里冲突和共享生活情境等主题上观察到更高的一致性。总体而言,我们的工作引入了一个数据集和评估框架,用于研究自然主义开放式对话中基于文化背景的社会推理。

英文摘要

Social norms reflect shared expectations on acceptable behavior. Measuring social norms alignment remains challenging, with existing approaches typically relying on artificial closed-form evaluations such as multiple-choice questionnaires or measuring agreement with predefined statements. In the context of this work, social norms alignment refers to measuring an agreement between solutions with respect to the social problem or dilemma. We propose a framework for measuring social norm alignment in naturalistic, free-form settings through solution matching. The framework enables us to measure alignment between any two dilemma responses e.g., LLMs to a human, LLMs to LLMs, or human to human. We introduce two metrics: stated and explicit agreement accuracy, and construct a dataset of 3k non-trivial social dilemmas in Danish. All dilemmas are assigned reference solutions derived from three panelists, who serve as culturally grounded judges. We evaluate the agreement of several LLMs and human responses in an interaction setup that resembles natural user-model conversations. Our results show that the proposed metrics produce consistent model rankings and reveal variation in agreement across different types of dilemmas, with higher agreement observed for topics such as neighbor conflicts and shared living situations. Overall, our work introduces a dataset and evaluation framework for studying culturally grounded social reasoning in naturalistic open-ended conversations.

2605.23417 2026-05-25 cs.LG

An Open-Source Training Dataset for Foundation Models for Black-box Optimization

黑箱优化的基础模型的开源训练数据集

Aaron Klein, Herilalaina Rakotoarison, Luca Thale-Bombien, David Salinas

AI总结 本文提出了一种名为BBO-Pile的开源训练数据集,包含超过50万个优化轨迹,覆盖3095个不同黑盒优化问题,是目前规模最大的公开黑盒优化预训练数据集。研究利用该数据集训练了多个不同规模的基础模型,验证了大规模预训练在模仿黑盒优化方法中的有效性,为该领域未来的研究奠定了基础。

详情
AI中文摘要

大多数黑箱优化方法需要大量的超参数调优,这通常限制了它们在不同优化领域的泛化能力。用于黑箱优化的基础模型从大量优化轨迹中学习优化原理,提供了一种有前景的替代方案,有潜力在多样的问题类别中超越手工设计的方法。然而,先前的工作要么依赖非公开数据集,要么依赖纯合成数据,限制了可重复性和对真实世界问题的泛化。因此,该领域的进展一直受到缺乏大规模、真实世界、公开可用的预训练数据的制约。我们引入了BBO-Pile,这是第一个包含超过500K优化轨迹的开源数据集,这些轨迹在3095个不同的黑箱上针对不同的优化器进行了评估,这代表了迄今为止该任务最大的公开数据集。利用该数据集,我们训练了一系列不同规模的基础模型,参数从2M到80M,训练token从200M到2B,并研究了它们相对于计算量的扩展行为。我们的结果表明,大规模预训练是模仿黑箱优化方法的一种可行且有效的方法,为未来的研究铺平了道路。

英文摘要

Most black-box optimization methods require extensive hyperparameter tuning, often limiting their ability to generalize across different optimization domains. Foundation models for black-box optimization that learn optimization principles from a large collection of optimization trajectories offer a promising alternative, with the potential to outperform manually designed methods across diverse problem classes. However, prior work has either relied on non-public datasets or on purely synthetic data, limiting reproducibility and generalization to real-world problems. As a result, progress in this area has been constrained by the lack of large-scale, real-world, publicly available pre-training data. We introduce BBO-Pile, the first open-source dataset comprising over 500K optimization trajectories evaluated across 3095 different black-boxes for different optimizers, which represents by far the largest public dataset for this task. Using this dataset, we train a family of foundation models at multiple scales, ranging from 2M to 80M parameters and from 200M to 2B training tokens, and study their scaling behavior with respect to compute. Our results demonstrate that large-scale pre-training is a viable and effective approach to imitate black-box optimization methods, paving the way for future research in this direction.

2605.23416 2026-05-25 cs.CL cs.SD

Articulatory strategy as a source of variation in acoustic vowel dynamics

发音策略作为声学元音动态变异的一个来源

Patrycja Strycharczuk, Justin J. H. Lo, Sam Kirkham

AI总结 本研究探讨了发音策略如何影响元音的声学动态变化,揭示了个体发音习惯与音素形式过渡之间的关系。通过分析36位北英格兰英语说话者的舌部超声影像数据,研究发现元音/i/的舌形是影响带有腭化滑音的双元音中共振峰动态变化的重要因素。研究结果表明,舌根和舌背的更大运动幅度会导致共振峰过渡更早且更陡峭,并为理解语音个体差异提供了新的视角。

详情
Journal ref
Journal of the Acoustical Society of America (2026) 159(5): 4068-4078
AI中文摘要

声学元音动态具有一些说话者识别特征,这些特征被归因于发音策略的个体特性:共振峰过渡具有特定形状,因为说话者使用特定且熟练的动作移动发音器官。然而,现有证据很少表明不同的发音策略会系统性地影响共振峰动态。本研究证实了二者之间的联系。使用来自36位北盎格鲁英语说话者的超声舌成像数据,识别出腭元音/i/产生的不同发音策略。发现/i/中的舌形是腭滑音双元音中共振峰动态的重要预测因子。观察到的关系可以通过声道形状调节的发音运动特征来解释。舌根和/或舌背的更大发音位移会产生腭元音中与平均舌形的更大偏差,并且还需要更高的发音速度,导致相对更早且更陡的共振峰过渡。结果通过阐明发音补偿的规律性和个体性方面,有助于对言语个体性的概念理解。

英文摘要

Acoustic vowel dynamics have some speaker-identifying characteristics, which have been ascribed to individual properties of articulatory strategies: formant transitions have a particular shape because speakers move their articulators, using specific and practised movements. However, there is little existing evidence that different articulatory strategies systematically affect formant dynamics. The present study corroborates the link between the two. Ultrasound tongue imaging data from 36 speakers of Northern-Anglo English are used to identify distinct articulatory strategies for the production of palatal vowel /i/. Tongue shape in /i/ is found to be a significant predictor of formant dynamics in diphthongs with a palatal offglide. The observed relationships can be explained by the characteristics of articulatory movement conditioned by vocal tract shape. Greater articulatory displacement of tongue root and/or dorsum produces greater distortion from the mean tongue shape in palatal vowels, and it also requires higher articulatory velocities, resulting in relatively earlier and steeper formant transitions. The results contribute to the conceptual understanding of individuality in speech, by illuminating the regularising and individual aspects of articulatory compensation.

2605.23414 2026-05-25 cs.AI cs.LG

When Planning Fails Despite Correct Execution: On Epistemic Calibration for LLM-Based Multi-Agent Systems

当计划正确执行却失败时:基于LLM的多智能体系统的认知校准

Zehao Wang, Shilong Jin, Zhao Cao, Lanjun Wang

AI总结 本文研究了基于大语言模型的多智能体系统在计划正确执行却仍可能失败的问题,指出这是由于智能体在评估计划可行性时对自身知识的误判,即“认识论校准失误”。为此,作者提出了EPC-AW方法,通过在不同信息条件下评估计划的稳定性,而非直接验证可行性,从而提升系统的整体成功率。实验表明,该方法平均提升了9.75%的系统成功率。

详情
AI中文摘要

基于LLM的多智能体系统即使在计划动作正确执行时也可能失败,因为智能体在评估计划可行性时可能误判自身知识,我们将这种现象称为规划中的认知误校准。与执行错误不同,认知误校准在规划过程中是潜在的,因为生成的计划可以保持自洽且可执行,没有可观察到的错误;同时,认知误校准也是动态的,因为新信息可能改变可行性评估,可能掩盖过去的误校准信号并导致其随时间重复出现。为了解决这个问题,我们提出了认知计划校准代理工作流(EPC-AW),它评估计划在不同信息条件下是否仍得到支持,而不是直接验证可行性。EPC-AW采用基于信息一致性的计划选择,选择评估结果在智能体间稳定的计划,并结合一致性引导的认知状态细化,通过利用过去的差异来指导未来规划,从而随时间适应校准。实验表明,EPC-AW平均将系统级成功率提高了9.75%。

英文摘要

LLM-based multi-agent systems can fail even when planned actions are executed correctly because agents may misjudge their knowledge when evaluating plan feasibility, a phenomenon we term epistemic miscalibration in planning. Unlike execution errors, epistemic miscalibration is latent during planning, as generated plans can remain self-consistent and executable without observable errors; the miscalibration is also dynamic, as new information can alter feasibility assessments, potentially obscuring past miscalibration signals and causing them to recur over time. To address this, we propose the Epistemic Planning Calibration Agentic Workflow (EPC-AW), which assesses whether plans remain supported under varying information conditions rather than directly verifying feasibility. EPC-AW employs Information-consistency-based Plan Selection, selecting plans whose evaluations are stable across agents, together with Consistency-guided Epistemic State Refinement to adapt calibration over time by leveraging past discrepancies to guide future planning. Experiments show that EPC-AW improves system-level success by an average of 9.75%.

2605.23412 2026-05-25 cs.CL

EquiSumm : A Gender Bias-Aware Framework for Inclusive Tweet Summarization

EquiSumm:面向包容性推文摘要的性别偏见感知框架

Chaitanya Wanjari, Jessica Kamal, Riddhi Jain, Samruddhi Kurhe, Roshni Chakraborty

AI总结 在新闻事件中,社交媒体平台如Twitter提供了大规模意见分享的渠道,但人工处理海量内容以提炼关键观点是不现实的。为此,已有自动摘要技术被提出,但大多未考虑人口统计学公平性,尤其是性别偏差问题。本文提出EquiSumm,一种关注性别偏见的包容性推文摘要框架,通过考虑意见中的性别因素生成更加公平的摘要,实验结果表明其在两个主要数据集上的性能优于现有方法。

详情
Comments
Accepted at AI for Social Good Workshop, Pattern Recognition and Machine Intelligence (PReMI 2025), IIT Delhi. 6 pages, 2 figures
AI中文摘要

虽然Twitter等社交媒体平台为新闻事件期间的大规模意见分享提供了媒介,但个人或媒体机构手动处理大量内容以识别关键观点是不可能的。为了解决这个问题,已经提出了几种自动摘要技术,将大量推文压缩成简洁且信息丰富的摘要。然而,这些算法没有明确考虑人口统计公平性。现有的若干研究工作已经开发了自动摘要方法,可以提供社交媒体平台上与新闻事件相关的关键方面和主要意见的整体概述。然而,这些方法没有明确考虑不同形式的人口统计代表性,例如性别,这可能导致有偏见的摘要表示。在本文中,我们提出了EquiSumm,它考虑了共享意见的性别方面来生成摘要,我们在两个主要数据集上的实验分析表明了相对于现有研究工作的性能有效性。

英文摘要

While social media platforms, such as Twitter, provide a medium for large-scale opinion sharing during news events, it is manually impossible for individuals or media agencies to process the vast volume of content to identify key viewpoints. In order to resolve this, several automatic summarization techniques have been proposed to condense large collections of tweets into concise and informative summaries. However, these algorithms do not explicitly consider demographic fairness. Several existing research works have developed automated summarization approaches that can provide a holistic overview of the key aspects and major opinions shared on social media platforms related to a news event. However, these approaches do not explicitly consider different forms of demographic representation, such as gender, which can lead to biased summary representation. In this paper, we propose EquiSumm, which considers the gender aspect of the shared opinion to generate a summary, and our experimental analysis on two major datasets indicates the performance effectiveness with respect to existing research works.

2605.23411 2026-05-25 cs.LG cs.CR cs.CV

Sample-wise Targeted Adversarial Attacks on Test-time Adaptation

面向测试时自适应的样本级定向对抗攻击

Phuc Duc Nguyen, Quang Duc Nguyen

AI总结 本文研究了针对测试时适应(TTA)的样本级定向对抗攻击问题,旨在在不引起分布异常的情况下,使特定样本被错误分类。为解决现有方法在批量操作中导致目标标签频率异常的问题,作者提出了一种基于元学习的攻击方法,结合优先级感知的梯度对齐策略,以确保攻击成功率同时保持整体标签分布不变。实验表明,该方法在多个数据集上取得了高成功率,且难以被检测,对现有防御机制也表现出较强的鲁棒性。

详情
Comments
32 pages, 17 figures
AI中文摘要

测试时自适应(TTA)有效应对分布偏移,但通过未标记的测试流使模型暴露于对抗性操纵之下。现有的类别级定向攻击在此场景下难以实现隐蔽利用:由于TTA在批次上操作,强制部分样本朝向目标标签会无意中拉拢相似的良性样本,导致目标标签出现频率异常高,易于检测。为了捕捉更现实的威胁,我们引入了一种样本级定向攻击。与先前方法不同,攻击者旨在仅使携带攻击者选择的触发器的输入被错误分类,同时保持良性查询的全局标签分布以逃避检测。为实现这一目标,我们提出了一种基于元学习的攻击,采用新颖的优先感知梯度对齐策略,明确优先考虑攻击成功率。该策略将梯度更新形式化为椭球信任区域问题,缓解了攻击成功与分布隐蔽性之间的失调,同时为在梯度失调情况下有效优化攻击目标提供了理论保证。在CIFAR-10-C、CIFAR-100-C和ImageNet-C上跨TTA协议的大量实验表明,我们的方法在保持与无攻击基线一致的标签分布的同时,实现了高定向成功率,使其在未标记的TTA部署场景中难以检测。此外,我们证明了我们的攻击对现有防御表现出强鲁棒性。

英文摘要

Test-time adaptation (TTA) effectively counters distribution shifts but exposes models to adversarial manipulation via the unlabeled test stream. Existing class-wise targeted attacks remain impractical for stealthy exploitation in this setting: since TTA operates on batches, forcing a subset of samples toward a target label unintentionally pulls similar benign samples along, resulting in a conspicuously high frequency of the target label that is easy to detect. To capture a more realistic threat, we introduce a sample-wise targeted attack. Unlike prior approaches, the attacker aims to misclassify only inputs carrying an attacker-chosen trigger, while preserving the global label distribution of benign queries to evade detection. To achieve this, we propose a meta-learning-based attack with a novel priority-aware gradient alignment strategy that explicitly prioritizes attack success. The strategy formulates the gradient update as an ellipsoidal trust-region problem, mitigating the misalignment between attack success and distributional stealth, while providing theoretical guarantees for effective optimization of the attack objective in the presence of gradient misalignment. Extensive experiments on CIFAR-10-C, CIFAR-100-C, and ImageNet-C across TTA protocols demonstrate that our method achieves high targeted success rates while maintaining a label distribution that is consistent with the no-attack baseline, making it difficult to detect in unlabeled TTA deployment scenarios. Furthermore, we demonstrate that our attack shows strong robustness against existing defenses.

2605.23410 2026-05-25 cs.LG cs.CV

What Linear Probes Miss: Multi-View Probing for Weight-Space Learning

线性探测的盲区:面向权重空间学习的多视角探测

Eunwoo Heo, Kyeongkook Seo, Jaejun Yoo

AI总结 随着开源模型库的快速增长,如何高效识别和分析模型参数成为重要问题。现有基于探针的方法虽轻量,但受限于单一视角设计,难以捕捉参数间的高阶交互信息。本文提出多视角探针框架 MVProbe,结合一阶结构与基于格拉姆矩阵的交互感知视角,理论分析表明其能更全面地表征模型参数,实验显示其在多种架构上均优于现有方法。

详情
Comments
Accepted at ICML 2026. Code: https://github.com/AI-hew-math/MVProbe ; Project page: https://ai-hew-math.github.io/MVProbe/
AI中文摘要

开源模型库的爆炸式增长催生了“模型丛林”,其中检查点经常在缺乏充分文档或元数据的情况下共享。虽然权重空间学习提供了一种直接从参数识别和分析这些模型的途径,但处理全尺度权重在计算上成本高昂。基于探测的方法作为一种轻量级替代方案出现,通过可学习的探测向量提取置换等变表示。然而,现有探测方法受限于单视角设计:它们捕获一阶结构,但未能编码行-列交互中固有的丰富高阶相关模式。为弥补这一差距,我们引入MVProbe,一个多视角探测框架,它综合了一阶信号与交互感知(基于Gram)的视角。我们的方法有理论依据;我们分析了不同探测阶数的缩放定律,以推导出原则性的标准化和融合策略,确保所有分支的贡献平衡。在Model Jungle基准上,MVProbe在多种架构上持续优于最先进的ProbeX,包括判别式骨干网络(ResNet、SupViT、MAE、DINO)和大规模生成式LoRA适配器(Stable Diffusion LoRA)。

英文摘要

The explosive growth of open-source model repositories has created a Model Jungle, where checkpoints are frequently shared without adequate documentation or metadata. While weight-space learning offers a pathway to identify and analyze these models directly from their parameters, processing full-scale weights is computationally prohibitive. Probing-based methods have emerged as a lightweight alternative, extracting permutation-equivariant representations via learnable probe vectors. However, existing probing methods are limited by a single-view design: they capture first-order structures but fail to encode the rich, higher-order correlation patterns inherent in row-column interactions. To bridge this gap, we introduce MVProbe, a multi-perspective probing framework that synthesizes first-order signals with interaction-aware (Gram-based) views. Our approach is theoretically grounded; we analyze the scaling laws of different probing orders to derive a principled standardization and fusion strategy that ensures balanced contributions from all branches. On the Model Jungle benchmark, MVProbe consistently outperforms the state-of-the-art ProbeX across diverse architectures, including discriminative backbones (ResNet, SupViT, MAE, DINO) and large-scale generative LoRA adapters (Stable Diffusion LoRA).

2605.23409 2026-05-25 cs.CV cs.AI

Online Hand Gesture Recognition Using 3D Convolutional Neural Networks

使用3D卷积神经网络的在线手势识别

Yinghao Qin, Tijana Timotijevic

AI总结 本文提出了一种基于3D卷积神经网络的在线手部手势识别系统,旨在实现实时视频流中手势的定位与分类。为提高系统鲁棒性,采用滑动窗口方法对多窗口结果进行优化。该系统在Jester数据集上训练,检测和分类准确率分别达到98%以上和90%以上,在自制数据集上达到37.5%的Levenshtein准确率,且响应时间在三秒以内。

详情
Comments
Master's dissertation work written in Autumn 2020
AI中文摘要

在人机交互中,动态手势的实时检测与分类具有挑战性,因为:1) 系统必须在实时视频流中运行,且执行手势后响应无明显延迟;2) 不同人执行手势的方式差异较大,使得识别更加困难。本文提出一种在线手势识别系统,能够定位实时视频流中的手势并识别其类别。为提高系统鲁棒性,采用滑动窗口方法对多个窗口的结果进行优化。项目中的所有模型均在Jester数据库上训练,检测器准确率达到98%以上,分类器准确率达到90%以上。在系统整体性能方面,最佳组可在三秒内响应,并在自制数据集上达到37.5%的Levenshtein准确率。本工作使用的项目代码已公开。

英文摘要

In human computer interaction, real-time detection and classification of dynamic hand gestures is challenging as: 1) the system must run in a real-time video stream and there is no noticeable lag in response after performing a gesture; 2) there is a large difference in how people perform gestures, making recognition more difficult. In this paper, an online hand gesture recognition system is proposed, which is able to localize gestures in real-time video stream and recognize what these gestures are. To improve the robustness of the system, the sliding window approach is used to refine results from multiple windows. All of the models in my project are trained on Jester database, achieving 98+% accuracy for detector and 90+% accuracy for classifier. For the overall performance of the system, the best group can respond within three seconds and reach 37.5% Levenshtein accuracy on the homemade dataset. The project codes used in this work are publicly available.

2605.23406 2026-05-25 cs.CV

RS2AD-LiDAR: End-to-End Autonomous Driving LiDAR Data Generation from Roadside Sensor Observations

RS2AD-LiDAR:基于路侧传感器观测的端到端自动驾驶LiDAR数据生成

Runyi Huang, Ni Ding, Ruidan Xing, Yuheng Shi, Lei He, Keqiang Li

AI总结 本文提出了一种名为RS2AD-LiDAR的全新框架,用于从路边传感器观测数据中重建和生成车载激光雷达数据,以解决当前自动驾驶系统在数据采集和标注成本高、场景稀缺等问题。该方法通过坐标转换、虚拟激光雷达建模和点云重采样技术,生成高保真的车载激光雷达数据,并构建了专门用于评估的R2V-LiDAR数据集。实验表明,生成数据在语义相似性和目标检测性能上均表现出良好的效果,有效提升了自动驾驶模型的感知能力。

详情
AI中文摘要

端到端自动驾驶解决方案直接处理多模态传感器数据并输出细粒度控制命令,随着自动驾驶技术的发展已逐渐成为主流方向。然而,当前此类方法依赖单车数据采集进行模型训练和优化,面临采集和标注成本高、有价值场景稀缺以及数据孤岛等问题。为解决这些挑战,我们提出RS2AD-LiDAR,一种从路侧传感器观测重建和生成车载LiDAR数据的新框架。由于目前没有公开数据集提供路侧与车载LiDAR传感器之间高度重叠的感知覆盖(这对于研究路侧到车辆数据生成至关重要),我们构建了专用数据集R2V-LiDAR,仅用于本文的评估。具体而言,我们的方法将路侧LiDAR点云变换到车载LiDAR坐标系,并通过虚拟LiDAR建模和点云重采样技术合成高保真车载数据。据我们所知,这是首个从路侧传感器输入重建车载LiDAR数据的方法。大量实验比较表明,生成数据与真实数据具有语义相似性。此外,目标检测实验显示,将生成数据融入真实数据用于模型训练,可同时提升鸟瞰图(BEV)和3D检测精度,从而验证了所提方法的有效性。

英文摘要

End-to-end autonomous driving solutions, which directly process multimodal sensory data and output fine-grained control commands, have gradually become a mainstream direction with the development of autonomous driving technology. However, current methods in this category rely on single-vehicle data collection for model training and optimization, which suffers from high acquisition and annotation costs, scarcity of valuable scenarios, and data silos. To address these challenges, we propose RS2AD-LiDAR, a novel framework for reconstructing and generating vehicle-mounted LiDAR data from roadside sensor observations. Since no public dataset currently provides highly overlapping perception coverage between roadside and vehicle-mounted LiDAR sensors, which is essential for studying roadside-to-vehicle data generation, we constructed a dedicated dataset named R2V-LiDAR which is used solely for evaluation in this work. Specifically, our method transforms roadside LiDAR point clouds into the vehicle-mounted LiDAR coordinate system, and synthesizes high-fidelity vehicle-mounted data via virtual LiDAR modeling and point cloud resampling techniques. To the best of our knowledge, this is the first approach to reconstruct vehicle-mounted LiDAR data from roadside sensor inputs. Extensive experimental comparisons demonstrate the semantic similarity between the generated data and real data. Furthermore, object detection experiments show that incorporating the generated data into real data for model training improves both Bird's Eye View (BEV) and 3D detection accuracy, thereby validating the effectiveness of the proposed method.

2605.23403 2026-05-25 cs.LG physics.ao-ph quant-ph

Hybrid Quantum-Classical Corrective Diffusion Modeling for Meteorological Downscaling

混合量子-经典校正扩散模型用于气象降尺度

Rui Wang, Edoardo Pasetto, Amer Delilbasic, Morris Riedel, Kristel Michielsen, Gabriele Cavallaro

AI总结 本文提出了一种混合量子-经典修正扩散模型,用于天气场的概率统计降尺度,旨在从低分辨率输入生成高分辨率天气数据。该模型在扩散UNet的最压缩瓶颈处插入变分量子电路层,而回归分支保持完全经典,以测试量子电路是否能作为非线性特征映射提升潜在通道混合效果。实验表明,该混合模型在风场降尺度任务中表现出稳定性,保留了大尺度空间结构,并在多个配置中提升了平均绝对误差和连续排名概率评分,同时展示了对动能谱和风速分布的保持能力,突显了量子混合方法在气象降尺度中的潜力与当前硬件限制。

详情
Comments
11 pages, 9 figures. Submitted to IEEE QCE 2026
AI中文摘要

统计降尺度是天气建模领域的关键组成部分,需要以动力细化的全部成本从粗分辨率输入重建高分辨率输出。在这项工作中,我们研究了一种用于天气场概率统计降尺度的混合量子-经典校正扩散模型。所提出的模型将变分量子电路层插入到扩散UNet的最压缩瓶颈中,同时保留回归分支完全经典。这种放置测试了量子电路是否可以作为潜在通道混合的紧凑非线性特征映射。我们在10米风场分量上评估了通道内和跨通道的ansätze。在2020年验证集上,混合模型保持稳定,保留了生成风场的大尺度空间组织,并在几种配置中相对于经典校正扩散模型改善了MAE和CRPS。结构诊断进一步表明,混合变体保持了与其经典对应物相似的动能谱和风速分布,同时在尾部行为、极端风速定位和联合风场分量结构方面产生受控变化。2020年验证集上的后端研究表明,在测试的电路规模下,模拟设备噪声的影响可以忽略不计,而实际硬件部署仍受限于量子比特可用性和执行保真度。2021年分布外测试表明,这些分布内增益在时间偏移下不能均匀转移,揭示了泛化差距,这促使未来通过稳定化和正则化进行缓解。这些结果表明,瓶颈级别的量子混合可以为天气统计降尺度做出重要贡献,同时也强调了电路规模和硬件部署仍然是关键的限制因素。

英文摘要

Statistical downscaling is a crucial component of the weather modeling field, where high-resolution outputs must be reconstructed from coarse-resolution inputs with the full cost of dynamical refinement. In this work, we investigate a hybrid quantum-classical corrective diffusion model for probabilistic statistical downscaling of weather fields. The proposed model inserts variational quantum circuit layers into the most compressed bottleneck of the diffusion UNet while leaving the regression branch fully classical. This placement tests whether quantum circuits can act as compact nonlinear feature maps for latent-channel mixing. We evaluate intra-channel and cross-channel ansätze on 10m wind components. On the 2020 validation set, the hybrid models remain stable, preserve the large-scale spatial organization of the generated wind fields, and improve both MAE and CRPS relative to a classical corrective diffusion model in several configurations. Structural diagnostics further show that the hybrid variants preserve kinetic-energy spectra and windspeed distributions similar to its classical counterpart while producing controlled changes in tail behavior, extreme-windspeed localization, and joint wind field components structure. Backend studies on the 2020 validation set show negligible impact from simulated device noise at the tested circuit scale, whereas real-hardware deployment remains limited by qubit availability and execution fidelity. The 2021 out-of-distribution test shows that these in-distribution gains do not transfer uniformly under temporal shift, revealing a generalization gap that motivates future mitigation through stabilization and regularization. These results show that bottleneck-level quantum hybridization can make a nontrivial contribution to weather statistical downscaling, while also highlighting that circuit scale and hardware deployment remain key limiting factors.

2605.23402 2026-05-25 cs.LG cs.AI

Parametric Prior Mapping Framework for Non-stationary Probabilistic Time Series Forecasting

非平稳概率时间序列预测的参数先验映射框架

Jinglin Li, Jun Tan, QI Fang, Ning Gui

AI总结 本文提出了一种参数先验映射框架(PPM),用于非平稳概率时间序列预测。该方法通过引入参数化的结构先验,结合生成模型的优势,实现了在保持计算效率的同时捕捉复杂时间依赖关系。实验表明,PPM在非平稳数据预测任务中优于现有方法,在准确性和计算效率之间取得了更好的平衡。

详情
Comments
20 pages, 8 figures, accepted by ICML 2026
AI中文摘要

在概率多变量时间序列(MTS)预测中有效建模非平稳动态需要在表达性和鲁棒性之间取得平衡。现有参数方法受益于强归纳偏置但缺乏灵活性,而深度生成模型在没有大量数据和计算的情况下难以捕捉复杂的时间依赖性。我们引入了参数先验映射(PPM),这是一个将参数化结构先验注入生成建模过程的框架。具体来说,PPM利用参数化估计器推导出一个动态的自适应先验,通过可学习的映射指导复杂预测分布的学习。这种设计使模型能够保留参数方法的效率,同时利用生成模型的表达能力。通过混合目标训练,PPM产生精确的预测,并具有良好校准的不确定性估计。实验结果表明,PPM在处理非平稳数据方面优于现有基线,在精度和计算效率之间提供了更好的权衡。代码可在https://github.com/ljl8336/PPM获取。

英文摘要

Effectively modeling non-stationary dynamics in probabilistic multivariate time series(MTS) forecasting requires balancing expressiveness with robustness. Existing parametric approaches benefit from strong inductive biases but lack flexibility, whereas deep generative models struggle to capture complex temporal dependencies without extensive data and computation. We introduce Parametric Prior Mapping (PPM), a framework that injects parametric structural priors into a generative modeling process. Specifically, PPM utilizes a parametric estimator to derive a dynamic, adaptive prior that guides the learning of a complex predictive distribution via a learnable mapping. This design allows the model to retain the efficiency of parametric methods while exploiting the expressive power of generative models. Trained with a hybrid objective, PPM yields precise forecasts with well-calibrated uncertainty estimates. Empirical results show that PPM outperforms existing baselines in handling non-stationary data, offering a superior trade-off between accuracy and computational efficiency. The code is available at https://github.com/ljl8336/PPM.

2605.23397 2026-05-25 cs.CV

Joint Target-Less Intrinsic and Extrinsic Camera-LiDAR Calibration using Deep Point Correspondences

基于深度点对应的无靶标联合相机-激光雷达内参和外参标定

Simon Bultmann, Daniele Cattaneo, Abhinav Valada

AI总结 本文研究了无需标定目标的相机-激光雷达联合标定问题,提出了一种基于深度点对应关系的全新方法,能够同时估计相机的内参(包括径向-切向畸变)和外参。该方法通过结构从运动自动初始化内参,扩展了对未知畸变图像的匹配能力,并将点对应估计与内、外参的联合非线性优化紧密耦合,实现了更准确的标定效果。实验表明,该方法在KITTI数据集上表现出优越的外参精度和内参恢复能力。

详情
Comments
presented at 2nd German Robotics Conference (GRC)
AI中文摘要

精确的相机-激光雷达标定是机器人多模态感知鲁棒性的前提。最近基于深度点对应的无靶标方法在外参标定中取得了显著性能,但假设图像已校正且内参已知。本文克服了这一限制,提出了首个完全无靶标的流程,通过深度像素-点对应联合估计相机内参(径向-切向畸变的针孔模型)和相机-激光雷达外参。我们的方法通过以下方式扩展了基于深度对应的标定:(i) 通过运动结构自动初始化内参,(ii) 将相机-激光雷达匹配推广到包含未知畸变的原始图像,(iii) 将对应估计与内参和外参的联合非线性优化紧密耦合。我们在KITTI数据集上使用未见过的相机-激光雷达对评估了该方法,并证明联合标定在恢复精确内参的同时提高了外参精度。

英文摘要

Accurate camera-LiDAR calibration is a prerequisite for robust multi-modal perception in robotics. Recent target-less approaches based on deep point correspondences achieve remarkable performance for extrinsic calibration but assume rectified images with known intrinsics. In this work, we overcome this limitation and present the first fully target-less pipeline that jointly estimates camera intrinsics (pinhole model with radial-tangential distortion) and camera-LiDAR extrinsics with deep pixel-point correspondences. Our approach extends deep correspondence-based calibration by (i) automatic intrinsic initialization via structure-from-motion, (ii) generalizing camera-LiDAR matching to raw images with unknown intrinsics including distortion, and (iii) tightly coupling correspondence estimation with joint nonlinear optimization over both intrinsics and extrinsics. We evaluate our method on the KITTI dataset with unseen camera-LiDAR pairs and demonstrate that joint calibration achieves improved extrinsic accuracy while additionally recovering accurate intrinsics.

2605.23393 2026-05-25 cs.LG cs.AI

Every Component is a Lookup: Token Attribution and Composition from a Single Decomposition

每个组件都是一个查找:来自单一分解的令牌归因与组合

Po-Kai Chen, Niki van Stein, Aske Plaat

AI总结 该论文研究了如何从单一前向传播中解析Transformer模型中各组件对预测结果的贡献及其组合方式。作者提出了一种名为Unpack的反向递归方法,通过分解注意力和MLP子层中的信用,揭示了不同组件之间的交互强度以及每个token的归因信息,无需干预、梯度或辅助训练。实验表明,该方法在GPT-2和Pythia系列模型上有效恢复了组件间的组合结构,并展示了对token级归因的准确捕捉,验证了其在机制可解释性方面的有效性。

详情
AI中文摘要

变压器的机制可解释性不仅需要识别哪些组件重要,还需要理解它们如何组合成产生预测的计算路径。注意力和MLP都遵循共享的键值模板 $ϕ(S)U$。我们利用这一结构开发了Unpack,一种后向递归方法,通过两个子层分解贡献,产生任意两个组件之间的交互强度,称为带有K/Q/V组合标签的端到端路径,以及来自单次前向传递的每个令牌的归因,无需干预、梯度或辅助训练。我们在间接宾语识别任务上进行了评估。在GPT-2 small上,该方法恢复了Wang等人(2023)描述的所有三种组合连接,包括每个连接的特定模式路由(K、Q或V)。为了测试超越简单复制的令牌级归因,我们比较了同一分解中同一名称的两次出现:第一次提及保持强归因,而重复检测位置被抑制,这一模式在匹配的控制提示中不存在。在Pythia系列从160M到6.9B参数中,这一抑制模式在每个尺度上一致地恢复,表明该方法无需真实电路标签即可追踪机制结构。代码可在https://github.com/Fun-Cry/unpacklm获取。

英文摘要

Mechanistic interpretability of transformers requires identifying not just which components matter but how they compose into the computational route that produced a prediction. Both attention and MLP follow a shared key-value template $ϕ(S)U$. We exploit this structure to develop Unpack, a backward recursion that decomposes credit through both sublayers, producing interaction strengths between any two components, named end-to-end paths with K/Q/V composition labels, and per-token attribution from a single forward pass, without intervention, gradients, or auxiliary training. We evaluate on the indirect object identification task. On GPT-2 small, the method recovers all three composition connections described by Wang et al. (2023), including the mode-specific routing of each connection (K, Q, or V). To test token-level attribution beyond trivial copying, we compare two occurrences of the same name in the same decomposition: the first mention retains strong credit while the duplicate-detection position is suppressed, a pattern absent in matched control prompts. Across the Pythia family from 160M to 6.9B parameters, this suppression pattern is consistently recovered at every scale, demonstrating that the method tracks mechanistic structure without ground-truth circuit labels. Code is available at https://github.com/Fun-Cry/unpacklm.

2605.23391 2026-05-25 cs.LG cs.NA math.NA

Coupling-Robust Accuracy in Multiphysics Physics Informed Neural Networks via Kronecker-Preconditioned Optimization

通过Kronecker预条件优化实现多物理场物理信息神经网络的耦合鲁棒精度

Youngjae Park, Jaemin Kim, Junghwa Hong

AI总结 物理信息神经网络(PINNs)在处理耦合多物理场系统时,随着方程间耦合增强,会出现系统性精度下降的问题。本文通过神经切线核(NTK)分析,揭示了这一现象的理论原因,并提出了一种基于克罗内克预处理的优化方法SOAP+GN,有效抑制了耦合强度对学习稳定性的影响。实验表明,该方法在多种耦合偏微分方程系统中均能保持较高的精度,显著优于传统优化方法。

详情
Comments
20 pages, 10 figures. Extended version of AI4Physics Workshop submission (ICML 2026)
AI中文摘要

用于耦合多物理场系统的物理信息神经网络(PINN)在方程间耦合增强时会遭受系统性精度退化。我们通过神经正切核(NTK)分析为这一现象提供了理论解释:对于线性耦合系统,我们证明标准NTK的谱半径随耦合强度γ呈Ω(γ²)增长,缩小了稳定学习率,而块对角Gauss-Newton(GN)预条件产生预条件NTK $K_P = J H^{+} J^ op$(其中$H$是块对角GN Hessian矩阵),其谱半径以$S$(网络数量)为界,与γ无关。我们在对称、非对称和非线性耦合PDE系统上数值验证了Ω(γ²)增长,并确认在所有情况下$λ_{\max}(K_P) = S$。将Kronecker预条件优化器SOAP与逆梯度范数损失平衡(SOAP+GN)相结合,实现了耦合鲁棒精度:在跨越三个非线性递增的一维系统和一个二维电渗流基准的234个实验中,即使耦合参数变化一到两个数量级,SOAP+GN保持最终epoch的$L_2$退化≤1.1倍(强耦合与弱耦合误差之比),而Adam+GN则超过$10^2$倍。SOAP+GN进一步扩展到EDL分辨条件下的二维六PDE电渗流系统——这一所有先前PINN电动力学研究通过简化物理而避免的工况——而Adam+GN完全失败($L_2 > 0.9$)。

英文摘要

Physics-informed neural networks (PINNs) for coupled multiphysics systems suffer systematic accuracy degradation as inter-equation coupling strengthens. We provide a theoretical explanation for this phenomenon through neural tangent kernel (NTK) analysis: for linearly coupled systems, we prove that the standard NTK's spectral radius grows as $Ω(γ^2)$ with coupling strength $γ$, shrinking the stable learning rate, while block-diagonal Gauss--Newton (GN) preconditioning yields a preconditioned NTK $K_P = J H^{+} J^\top$ (where $H$ is the block-diagonal GN Hessian) whose spectral radius is bounded by $S$ ($S$ = number of networks), independent of $γ$. We verify the $Ω(γ^2)$ growth numerically across symmetric, asymmetric, and nonlinear coupled PDE systems, and confirm $λ_{\max}(K_P) = S$ with equality in all cases. Combining the Kronecker-preconditioned optimizer SOAP with inverse-gradient-norm loss balancing (SOAP+GN) yields coupling-robust accuracy: across 234 experiments spanning three 1D systems of increasing nonlinearity and a 2D electroosmotic flow benchmark, SOAP+GN maintains final-epoch $L_2$ degradation $\leq 1.1\times$ (ratio of strong- to weak-coupling error) even as coupling parameters vary over one to two orders of magnitude, compared with $> 10^2\times$ for Adam+GN. SOAP+GN further scales to a 2D, 6-PDE electroosmotic flow system at EDL-resolved conditions -- a regime that all prior PINN electrokinetics studies have avoided through simplified physics -- where Adam+GN fails entirely ($L_2 > 0.9$).

2605.23386 2026-05-25 cs.RO

Droneulator: A Portable UAV Simulator for Agricultural Workflows with RotorPy and Godot 4

Droneulator: 一种基于RotorPy和Godot 4的农业工作流便携式无人机模拟器

Jacob Swindell, Michael Lowen, Marija Popovic, Riccardo Polvara

AI总结 本文提出了一款名为Droneulator的便携式无人机模拟器,专为农业应用场景设计,结合了RotorPy进行多旋翼动力学仿真,以及Godot 4进行场景渲染与传感器数据生成。该模拟器支持PX4控制和轻量级WebSocket指令路径,并通过Zenoh实现ROS 2兼容的数据流传输,能够在不修改基础设施的前提下支持农业无人机的图像采集、局部路径规划和强化学习实验。实验结果表明,Droneulator在多种农业无人机任务中表现出良好的性能,包括三维重建、障碍物避让规划和基于深度感知的导航策略训练。

详情
AI中文摘要

农业无人机研究需要模拟器集成逼真的3D场景、高保真车辆动力学和机器人中间件,同时在实际部署中能够跨异构开发机器运行。我们提出Droneulator,一种便携式无人机模拟器架构,结合RotorPy用于多旋翼动力学和Godot 4用于渲染和传感器生成。Droneulator提供基于PX4的控制和轻量级WebSocket命令路径,并通过基于Zenoh的ROS~2兼容管道发布同步的视觉和状态流。这种集成使得单一栈能够支持面向检测的数据捕获、ROS~2/PX4局部规划和强化学习实验,而无需修改模拟器基础设施。我们通过三个农业无人机工作流对当前系统进行了量化验证:使用COLMAP进行3D重建的树冠尺度图像采集、使用EGO-Planner围绕冠层障碍物的局部规划,以及通过自定义Gymnasium环境的闭环强化学习。在报告的设置中,结果表明模拟器能够维持低延迟感知,支持不同捕获密度下的重建导向数据采集,执行围绕冠层障碍物的无碰撞局部规划,并支持基于深度感知的障碍感知导航策略训练。这些结果共同展示了Droneulator在农业无人机检测、规划和学习中作为一个可部署栈的潜力。

英文摘要

Agricultural UAV research requires simulators that integrate realistic 3D scenes, high-fidelity vehicle dynamics, and robotics middleware, while remaining practical to deploy across heterogeneous development machines. We present Droneulator, a portable UAV simulator architecture that combines RotorPy for multirotor dynamics with Godot 4 for rendering and sensor generation. Droneulator exposes both PX4-based control and a lightweight WebSocket command path, and publishes synchronised visual and state streams through a Zenoh-based ROS~2-compatible pipeline. This integration enables a single stack to support inspection-oriented data capture, ROS~2/PX4 local planning, and reinforcement learning experiments without modifying the simulator infrastructure. We present quantified validation of the current system across three agricultural UAV workflows: tree-scale image collection for 3D reconstruction with COLMAP, local planning around canopy obstacles using EGO-Planner, and closed-loop reinforcement learning through a custom Gymnasium environment. In the reported setup, the results show that the simulator can sustain low-latency sensing, support reconstruction-oriented data collection under varying capture density, execute collision-free local planning around canopy obstacles, and support stable depth-sensing-based policy training for obstacle-aware navigation. Together, these results show the potential of Droneulator for agricultural UAV inspection, planning, and learning within one deployable stack.

2605.23384 2026-05-25 cs.CL cs.AI

Metacognition as Reward: Reinforcing LLM Reasoning via Knowledge and Regulation Signals

元认知作为奖励:通过知识和调节信号强化LLM推理

Sirui Chen, Lei Xu, Yuying Zhao, Yutian Chen, Yu Wang, Beier Zhu, Hanwang Zhang, Shengjie Zhao, Chaochao Lu

AI总结 该论文提出了一种基于元认知的强化学习框架 MaR,旨在提升大语言模型的推理能力。MaR 通过元认知知识和元认知调节两个维度提供奖励信号,前者用于识别任务相关的信息,后者用于规划和调整推理过程,从而超越仅依赖最终答案的奖励设计。实验表明,MaR 在多个基准测试中显著提升了模型性能,并在部分任务上超越了更强大的模型。

详情
AI中文摘要

最近的强化学习方法显著提高了LLM的推理能力。现有的奖励设计主要遵循两种范式:(1) 基于可验证奖励的强化学习(RLVR)从可执行检查或真实答案中获取结果信号,但对中间推理行为的指导有限。(2) 基于评分标准的奖励(RaR)通过使用自然语言评分标准来评估推理质量和任务合规性,超越了最终答案检查,但通常需要实例特定的评分标准和大量设计工作。为解决这些问题,我们引入了元认知奖励(MaR),一种受元认知启发的RL框架,通过两个通用过程维度指导LLM推理:i) 元认知知识,无需手工制作的实例特定评分标准即可识别任务相关信息;ii) 元认知调节,规划和调整推理过程,以提供超越最终答案结果的奖励指导。MaR将模型轨迹分解为显式的元认知组件,并通过任务知识覆盖度、调节保真度和最终答案正确性的轨迹级奖励进行优化。通过这种方式,MaR将奖励反馈扩展到推理轨迹,同时将奖励信号锚定在通用的元认知维度上。在22个基准上的实验表明,MaR持续提升模型性能,相比基础模型最高提升7.7%,相比原始DAPO最高提升11.0%。值得注意的是,Qwen3.5-9B + MaR缩小了与前沿模型的差距,在整体平均上超越GPT-OSS-120B,并在多个单独基准上超越更强模型。过程级分析进一步显示推理过程质量显著提升。MaR还能泛化到域外数据集,MaR训练的模型在平均性能上优于对应的基础模型。

英文摘要

Recent RL methods have substantially improved the reasoning abilities of LLMs. Existing reward designs mainly follow two paradigms: (1) Reinforcement learning with verifiable rewards (RLVR) derives outcome signals from executable checks or ground-truth answers, but provides limited guidance for intermediate reasoning behaviors. (2) Rubrics-as-reward (RaR) goes beyond final-answer checking by using natural-language rubrics to assess reasoning quality and task compliance, but often requires instance-specific rubrics and substantial design effort. To address these issues, we introduce Metacognition-as-Reward (MaR), a metacognition-inspired RL framework that guides LLM reasoning through two general process dimensions: i) metacognitive knowledge, which identifies task-relevant information without hand-crafted instance-specific rubrics, and ii) metacognitive regulation, which plans and adjusts the reasoning process to provide reward guidance beyond final-answer outcomes. MaR scaffolds model rollouts into explicit metacognitive components and optimizes them with a trajectory-level reward over task knowledge coverage, regulation fidelity, and final-answer correctness. In this way, MaR extends reward feedback to reasoning trajectories while grounding the reward signals in general metacognitive dimensions. Experiments on 22 benchmarks show that MaR consistently improves model performance, achieving up to a 7.7% gain over the base model and up to an 11.0% gain over vanilla DAPO. Notably, Qwen3.5-9B + MaR narrows the gap to frontier models, surpassing GPT-OSS-120B on overall average and outperforming stronger models on several individual benchmarks. Process-level analysis further shows substantial improvements in reasoning process quality. MaR also generalizes to out-of-domain datasets, where MaR-trained models improve over their corresponding base models on average.

2605.23382 2026-05-25 cs.CL

From Correctness to Preference: A Framework for Personalized Agentic Reinforcement Learning

从正确性到偏好:个性化智能体强化学习框架

Ranxu zhang, zeyang li, Jiacheng Huang, Rui Zhang, Xiaozhou Xu, sun zhe, Yanyong Zhang, Chao Wang

AI总结 该论文提出了一种面向个性化智能体的强化学习框架,旨在解决现实场景中用户需求差异带来的行为规划与工具使用策略多样化问题。核心方法是通过解耦通用任务奖励与个性化偏好奖励,并引入用户特定锚点以稳定学习过程,同时结合分阶段偏好分离奖励模型和个性化技能演化图记忆结构,实现偏好识别、策略优化与技能积累的闭环。实验表明,该框架在多个基准任务中优于现有强化学习与记忆方法。

详情
Comments
34 pages, 7 figures, Under Review
AI中文摘要

智能体强化学习(Agentic RL)在具有明确成功信号的任务中取得了显著进展。然而,许多现实世界的智能体应用需要用户条件行为:同一查询可能在不同用户之间需要不同的规划策略和工具使用决策。这种设置带来了关键挑战:通用奖励无法捕捉异构用户偏好,观察到的行为与从众效应纠缠在一起,扁平记忆无法支持个性化技能检索。为此,我们提出了一个统一的个性化智能体RL框架,将个性化嵌入训练时优化。其核心是 extbf{个性化锚点奖励解耦策略优化}( extbf{PARPO}),它将通用任务质量奖励与个性化偏好奖励解耦,并使用用户特定锚点在异构奖励尺度下稳定学习。我们进一步引入了两阶段偏好解耦奖励模型和 extbf{偏好对齐技能演化图记忆}( extbf{PSGM}),用于个性化监督和偏好对齐技能检索。它们共同构成了偏好识别、策略优化和结构化技能积累的闭环。在ETAPP、ETAPP-Hard和SJAgent上的实验表明,我们的框架始终优于强记忆和RL基线。代码和数据包含在补充材料中。

英文摘要

Agentic reinforcement learning (Agentic RL) has achieved strong progress in tasks with clear success signals. However, many real-world agent applications require user-conditioned behavior: the same query may call for different planning strategies and tool-use decisions across users. This setting raises key challenges: generic rewards cannot capture heterogeneous user preferences, observed behaviors are entangled with conformity effects, and flat memories cannot support personalized skill retrieval. To this end, we propose a unified personalized Agentic RL framework that embeds personalization into training-time optimization. At its core is \emph{Personalized Anchor Reward-Decoupled Policy Optimization} (\textbf{PARPO}), which decouples generic task-quality rewards from personalized preference rewards and uses user-specific anchors to stabilize learning under heterogeneous reward scales. We further introduce a two-stage preference-disentangled reward model and \emph{Preference-Aligned Skill Evolution Graph Memory} (\textbf{PSGM}) for personalized supervision and preference-aligned skill retrieval. Together, they form a closed loop of preference identification, policy optimization, and structured skill accumulation. Experiments on ETAPP, ETAPP-Hard, and SJAgent show that our framework consistently outperforms strong memory and RL baselines. Code and data are included in the supplementary materials.

2605.23381 2026-05-25 cs.CV

VDE: Training-Free Accelerating Rectified Flow Model via Velocity Decomposition and Estimation

VDE: 通过速度分解与估计实现无训练加速整流流模型

Junwen Tan, Jinglin Liang, Hongyuan Chen, Shuangping Huang

AI总结 尽管rectified flow模型在图像、视频和3D生成中表现出色,但其推理速度较慢限制了实际应用。本文提出了一种无需训练的加速方法VDE,通过速度分解与估计将传统缓存复用的范式转变为分解估计,提升了输入适应性与输出质量。VDE将模型速度分解为沿输入方向和平行方向的分量,并利用其时间可预测性和方向稳定性进行精确估计,同时通过定期全前向传播防止误差累积,实验表明该方法在保持视觉质量的同时显著提升了生成效率。

详情
Comments
Accepted by CVPR 2026
AI中文摘要

尽管整流流模型在图像、视频和3D生成中取得了显著性能,但其实际部署受到推理速度慢的挑战。先前的加速方法重用前一步的缓存特征,忽略了静态缓存与不断变化的输入之间日益增长的失配,导致输出保真度下降。本文提出速度分解与估计(VDE),一种无训练加速方法,将范式从缓存重用转变为分解估计。具体而言,VDE将模型的速度分解为与输入平行和正交的分量,利用它们的时间可预测性和方向稳定性进行精确的输入自适应估计。为防止误差累积,它通过完整前向传播定期锚定模型状态。在图像和视频生成任务上的大量实验表明,VDE在视觉质量损失极小的情况下实现了显著加速。值得注意的是,VDE将Flux加速3.22倍,并在Qwen-Image上实现了0.069的LPIPS,比最佳基线降低了52.2%。

英文摘要

Though rectified flow models have achieved remarkable performance in image, video, and 3D generation, their practical deployments are challenged by slow inference speeds. Prior acceleration methods reuse cached features from previous steps, which neglects the growing mismatch between static caches and the evolving input, leading to reduced output fidelity. This work proposes Velocity Decomposition and Estimation (VDE), a training-free acceleration method that shifts the paradigm from caching-and-reusing to decomposing-and-estimating. Specifically, VDE decomposes the model's velocity into components parallel and orthogonal to the input, exploiting their temporal predictability and directional stability for precise, input-adaptive estimation. To prevent error accumulation, it periodically anchors the model's state via full forward passes. Extensive experiments on image and video generation tasks demonstrate that VDE achieves substantial acceleration with minimal loss in visual quality. Notably, VDE accelerates Flux by 3.22 times and achieves an LPIPS of 0.069 on Qwen-Image, outperforming the best baseline with a 52.2% reduction.

2605.23378 2026-05-25 math.OC cs.LG

Selective Ambulance Dispatch Under Contextual Travel-Time Uncertainty

上下文旅行时间不确定性下的选择性救护车调度

Zikun Lin, Daniel Zhuoyu Long, Viet Anh Nguyen

AI总结 本文研究了在交通时间不确定性背景下如何选择性派遣救护车以应对院外心脏骤停的紧急情况。提出了一种名为IDEAL的智能双派车框架,仅在主路线与备选路线的时间差超过阈值时才派遣第二辆救护车,从而在保证响应速度的同时减少资源消耗。该方法通过弱监督双层网络学习上下文相关的道路旅行时间,并结合非光滑优化与不确定性建模,实现了高效且具有收敛性保证的实时决策,在实际数据与模拟测试中表现出优于现有方法的响应时间与资源利用平衡。

详情
AI中文摘要

救护车响应在院外心脏骤停(OHCA)中具有时间紧迫性,调度员必须在及时到达与有限车队容量之间取得平衡。静态区域和确定性旅行时间估计易受动态拥堵影响,而始终双调度增加了冗余但消耗了车队容量。我们提出IDEAL(智能双调度急救车),一种选择性双调度框架,仅当主要路径与次要路径之间的乐观差距超过阈值时才派出第二辆救护车。IDEAL利用弱监督双层表示网络,从行程级调度记录(包括未观测路线)中学习上下文特定的边旅行时间。我们使用小批量保守梯度训练非光滑模型,并证明渐近收敛保证。IDEAL通过Burg散度扰动对学习表示空间中的共享度量进行建模,从而引起边旅行时间的相关变化,并从历史低估误差中学习上下文特定半径。对于实时决策,IDEAL将乐观差距计算转化为凸差规划,并推导出具有复杂度保证的高效预言机。与香港消防处合作,我们使用历史OHCA记录和实时自适应模拟评估IDEAL。相对于所有基于区域和基于谷歌的基线,结果实现了更强的响应时间/资源权衡。

英文摘要

Ambulance response is time-critical in out-of-hospital cardiac arrest (OHCA), where dispatchers must balance timely arrivals with limited fleet capacity. Static territories and deterministic travel-time estimates are vulnerable to dynamic congestion, while always-dual dispatch adds redundancy but consumes fleet capacity. We propose IDEAL (Intelligent Dual dispatch of Emergency AmbuLances), a selective dual-dispatch framework that sends a second ambulance only when the optimistic gap between primary and secondary paths exceeds a threshold. IDEAL learns context-specific edge travel times from trip-level dispatch records, including unobserved routes, using a weakly supervised bilevel representation network. We train the nonsmooth model with mini-batch conservative gradients and prove an asymptotic convergence guarantee. IDEAL models uncertainty via Burg-divergence perturbations to a shared metric in the learned representation space, thereby inducing correlated changes in edge travel times and learning context-specific radii from historical underprediction errors. For real-time decisions, IDEAL casts optimistic-gap computation as a difference-of-convex program and derives an efficient oracle with complexity guarantees. In collaboration with the Hong Kong Fire Services Department, we evaluate IDEAL using historical OHCA records and real-time adaptive simulations. The results achieve a stronger response-time/resource trade-off relative to all region-based and Google-based baselines.