arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1676
2605.09877 2026-05-18 cs.LG cs.AI cs.CL

Key-Value Means: Transformers with Expandable Block-Recurrent Compressed Memory

Daniel Goldstein, Eugene Cheah

AI总结 本文提出了一种名为 Key-Value Means(KVM)的新块循环注意力机制,能够支持固定大小或可扩展的状态存储。该方法在保持参数数量极少的情况下,使强大变压器模型具备线性时间复杂度的分块处理能力,并在长上下文任务中表现出色,预填充时间接近二次方且状态增长接近线性。KVM 结合了传统变压器和线性 RNN 的优势,支持分块并行训练与预填充,适用于所有层以节省 KV 缓存内存,并可在传统注意力机制中与 LRNN 混合使用,提升长上下文处理性能。

详情
英文摘要

We present Key-Value Means ("KVM"), a novel block-recurrence for attention that can accommodate either fixed-size or growing state. Equipping a strong transformer baseline with fixed-size KVM attention layers yields a strong $O(N)$ chunked RNN, while adding only an insignificant number of new parameters. We train a transformer with a growable KVM cache and show it performs competitively on long-context tests with only subquadratic prefill time and sublinear state growth. KVM is implementable with standard operations and without custom kernels, and supports chunk-wise parallelizable training and prefill. It provides many of the benefits of both traditional transformers (expandable context memory, chunk-wise parallelizable training and prefill) and linear RNNs in a single unified package. It can be used on every layer, saving KV-cache memory, and allowing a continuous range of choices of prefill time complexity between $O(N)$ and $O(N^2)$. It can also be implemented in a hybrid solution in tandem with LRNN layers in place of traditional attention, to supplement the LRNN with improved sublinear memory growth context length usage and long context decoding. We release our code at https://github.com/featherless-ai/KVM-paper and trained models at https://huggingface.co/collections/featherless-ai/kvm-paper under the Apache 2.0 license.

2605.09869 2026-05-18 cs.RO cs.CV

ConsistNav: Closing the Action Consistency Gap in Zero-Shot Object Navigation with Semantic Executive Control

Haosen Wang, Zhenyang Li, Yinqiang Zhang, Zongqi He, Lutao Jiang, Kai Li, Yizhou Zhao, Liaoyuan Fan, Wenjian Hou, Tingbang Liang, Yibin Wen, Defeng Gu

AI总结 本文研究了零样本物体导航中的动作一致性问题,即智能体在导航过程中容易因语义信息的反复解读而无法持续追踪目标。为此,作者提出了 ConsistNav,一种无需训练的零样本物体导航框架,通过引入语义执行控制器、持久候选记忆和稳定性感知动作控制三个模块,有效提升了导航过程中对目标的持续追踪能力和动作一致性。实验表明,ConsistNav 在多个基准数据集上取得了优于现有方法的性能,显著提升了成功率和路径成功率。

Comments 13 pages, 5 figures

详情
英文摘要

Zero-shot object navigation has advanced rapidly with open-vocabulary detectors, image--text models, and language-guided exploration. However, even after current methods detect a plausible target hypothesis, the agent may still oscillate between exploration and pursuit, or abandon the object near success. We identify this failure mode as an action consistency gap: semantic evidence is repeatedly reinterpreted at each step without persistent commitment across the episode. We introduce ConsistNav, a training-free zero-shot ObjectNav framework built around a semantic executive composed of three coordinated modules: Finite-State Executive Controller stages target pursuit through guarded semantic phases; Persistent Candidate Memory accumulates cross-frame target evidence into stable object hypotheses; and Stability-Aware Action Control suppresses rotational stagnation, ineffective pursuit, and unverified stopping. This design changes neither the detector nor the low-level planner; instead, it controls when semantic evidence should influence navigation and when it should be suppressed or revisited. We conduct extensive experiments on HM3D and MP3D, where ConsistNav achieves state-of-the-art results among compared zero-shot ObjectNav methods and improves SR by 11.4% and SPL by 7.9% over the controlled baseline on MP3D. Ablation studies and real-world deployment experiments further demonstrate the effectiveness and robustness of the proposed executive mechanism.

2605.08949 2026-05-18 cs.LG

Muon-OGD: Muon-based Spectral Orthogonal Gradient Projection for LLM Continual Learning

Binghang Lu, Zheyuan Deng, Runyu Zhang, Bing Hu, Yunhan Zhao, Yuan Tian, Changhong Mou, Guang Lin, Xiaomin Li

AI总结 在大语言模型的持续学习中,灾难性遗忘是一个核心挑战。本文提出Muon-OGD方法,结合Muon优化器的谱范数几何特性与正交投影约束,通过谱范数约束的优化问题和高效的双迭代求解策略,有效避免对先前任务参数方向的干扰。实验表明,Muon-OGD在多个持续学习基准上优于传统微调和正交梯度方法,具有良好的计算可扩展性。

详情
英文摘要

A central challenge in continual learning for large language models (LLMs) is catastrophic forgetting, where adapting to new tasks can substantially degrade performance on previously learned ones. Existing projection-based methods mitigate such interference by restricting parameter updates to subspaces that are orthogonal to directions associated with past tasks. However, these methods are typically formulated under Euclidean parameter geometry, with update magnitudes and projections governed by the Frobenius norm. The recent empirical success of the Muon optimizer, which applies orthogonalized matrix updates and admits a spectral-norm interpretation, suggests that Frobenius geometry may not be the most effective choice for matrix-valued LLM parameters. Motivated by this observation, we propose Muon-OGD, a spectral-norm-aware continual learning framework that integrates Muon-style operator-norm geometry with orthogonal projection constraints. Our method formulates each update as a spectral-norm-constrained optimization problem with linear non-interference constraints, and solves it efficiently through dual iterations and Newton--Schulz matrix-sign approximations. By applying orthogonalized momentum updates that avoid protected directions associated with prior tasks, Muon-OGD aims to improve the stability--plasticity trade-off in sequential LLM adaptation. We evaluate the proposed method on standard continual learning benchmarks, TRACE, and domain-specific Coding--Math--Medical curricula using both encoder--decoder and decoder-only architectures. Empirically, Muon-OGD consistently improves over sequential fine-tuning and competitive orthogonal-gradient baselines, while remaining computationally scalable. These results suggest that spectral-norm-aware update geometry provides a practical and effective alternative to Frobenius-norm projection for continual learning in LLMs.

2605.08894 2026-05-18 cs.CL cs.AI

Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs

Yuzhuang Xu, Xu Han, Yuxuan Li, Pengzhan Li, Wanxiang Che

AI总结 尽管现有极低比特量化方法主要关注数值精度的保持,但本文指出,极低比特量化大语言模型还面临系统性的平滑性退化问题。通过引入平滑性代理指标和序列邻域建模,研究发现量化位宽越低,平滑性退化越严重,导致生成质量下降。为此,作者提出在后训练量化和量化感知训练中引入平滑性保持原则,有效提升了模型性能,强调了平滑性在极端量化中的重要性。

Comments 19 pages, 4 tables, 14 figures

详情
英文摘要

Large language models (LLMs) achieve strong performance but incur high deployment costs, motivating extremely low-bit but lossy quantization. Existing quantization algorithms mainly focus on improving the numerical accuracy of forward computation to eliminate performance degradation. In this paper, we show that extremely quantized LLMs suffer from systematic smoothness degradation beyond numerical precision loss. Through a smoothness proxy, we observe that such degradation becomes increasingly severe as the quantization bit-width decreases. Furthermore, based on sequence neighborhood modeling, we find that quantized models exhibit a rapid reduction of effective token candidates within the prediction neighborhood, which directly leads to a sparser decoding tree and degraded generation quality. To validate it, we introduce a simple smoothness-preserving principle in both post-training quantization and quantization-aware training, and demonstrate that preserving smoothness brings additional gains beyond numerical accuracy. The core goal of this paper is to highlight smoothness preservation as an important design consideration for future extreme quantization methods. Code is available at https://github.com/xuyuzhuang11/FINE.

2605.08245 2026-05-18 cs.CV cs.AI

When Language Overwrites Vision: Over-Alignment and Geometric Debiasing in Vision-Language Models

Harshvardhan Saini, Samyak Jha, Yiming Tang, Dianbo Liu

AI总结 本文研究了视觉-语言模型(VLMs)中由于语言与视觉模态过度对齐导致的幻觉问题,揭示了其根本原因在于解码器结构使得视觉嵌入过度对齐到文本流形,从而引入了语言统计偏倚,掩盖了细粒度视觉信息。作者首次量化分析了这一现象,提出两种互补的解决方案:一种是无需训练的推理策略,另一种是引入偏倚感知的微调方法,均能有效去除视觉表示中的语言偏倚。实验表明,这些方法在多个基准测试中显著减少了模型幻觉,并提升了长文本生成的质量。

详情
英文摘要

Vision-Language Models (VLMs) increasingly power high-stakes applications, from medical imaging to autonomous systems, yet they routinely hallucinate, confidently describing content not present in the input. We investigate the root causes of these failure modes with a mechanistic analysis focusing on the decoder-based VLMs. We trace these failure modes to a geometric over-alignment: to bridge the modality gap required by attention mechanisms, decoder-based VLMs over-align visual embeddings with the text manifold, injecting a statistical linguistic bias that systematically overshadows fine-grained visual evidence. While prior work either aggressively closes this gap or suppresses hallucinations through expensive black-box decoding strategies, none addresses the underlying geometric cause. We provide the first quantitative characterization of this over-alignment, demonstrating that linguistic bias concentrates in the top principal components of a universal, dataset-agnostic text subspace. Building on this insight, we propose two complementary remedies: a training-free inference strategy and a bias-aware fine-tuning paradigm, both of which explicitly project out this subspace from visual representations. Our methods significantly reduce hallucinations across POPE, CHAIR, and AMBER benchmarks, and improve CLAIR scores on long-form captioning tasks, with the training-free variant adding no computational overhead over the base model.

2605.07557 2026-05-18 cs.LG

Beyond Distribution Estimation: Simplex Anchored Structural Inference Towards Universal Semi-Supervised Learning

Yaxin Hou, Jun Ma, Hanyang Li, Bo Han, Jie Yu, Yuheng Jia

AI总结 本文针对现实场景中标签数据稀缺且未标签数据分布未知的半监督学习难题,提出了一个名为UniSSL的通用半监督学习框架。为避免传统伪标签方法依赖分布估计带来的错误标签问题,作者提出基于表示层面结构推理的新方法,提出了一种名为SAGE的模型,通过捕捉高阶样本间依赖关系建立结构共识,并引入简单形等距紧框架引导类间表示分离,有效提升了模型性能。实验表明,SAGE在多个基准数据集上均优于现有方法,平均准确率提升达8.52%。

Comments The paper is accepted by ICML 2026

详情
英文摘要

Semi-supervised learning faces significant challenges in realistic scenarios where labeled data is scarce and unlabeled data follows unknown, arbitrary distributions. We formalize this critical yet under-explored paradigm as Universal Semi-supervised Learning (UniSSL). Existing methods typically leverage unlabeled data via pseudo-labeling. However, they often rely on the idealized assumption of a uniform unlabeled data distribution or require sufficient labeled data to estimate it. In the UniSSL setting, such dependencies lead to numerous erroneous pseudo-labels, thereby triggering representation confusion. Fortunately, we observe that inter-sample relations captured by representations are more reliable than pseudo-labels. Leveraging this insight, we shift our focus to representation-level structural inference to bypass distribution estimation. Accordingly, we propose Simplex Anchored Graph-state Equipartition (SAGE), which captures high-order inter-sample dependencies to establish structural consensus for guiding representation learning. Meanwhile, to mitigate representation confusion, we employ vectors that satisfy a simplex equiangular tight frame to serve as a coordinate frame for guiding inter-class representation separation. Finally, we introduce a weighting strategy based on distribution-agnostic metrics to prioritize reliable pseudo-labels and an auxiliary branch to isolate potentially erroneous pseudo-labels. Evaluations on five standard benchmarks show that SAGE consistently outperforms state-of-the-art methods, with an average accuracy gain of $\textbf{8.52%}$.

2605.07074 2026-05-18 cs.CV

Decoupling Semantics and Fingerprints: A Universal Representation for AI-Generated Image Detection

Zhiyuan Wang, Yanxiang Chen, Pengcheng Zhao, Yunfeng Diao, Xin Liao

AI总结 该论文研究了如何检测由不同未知架构生成的AI图像,指出现有方法容易过度依赖生成器特定的指纹和语义内容,导致泛化能力不足。研究发现,特征纠缠是主要原因,为此提出了一种正交分解与净化网络(ODP-Net),通过结构化分离通用伪造痕迹、生成器指纹和语义内容,有效提升了模型在未知生成模型上的检测性能。

Comments ~10 pages (IEEEtran two-column), 6 figures, 6 tables, 1 algorithm

详情
英文摘要

Detecting AI-generated images across unseen architectures remains challenging, as existing models often overfit to generator-specific fingerprints and semantic content rather than learning universal forgery traces. We attribute this failure to feature entanglement: detectors learn these factors as a single entangled representation, where universal forgery traces are inextricably confounded with both generator-specific fingerprints and semantic content. Crucially, our spectral analysis reveals that this entanglement is avoidable: distinct generator-specific fingerprints (e.g., GAN stripes vs. Diffusion Model spots) occupy disjoint frequency subspaces and coexist as independent superpositions. Leveraging this physical orthogonality, we propose the Orthogonal Decomposition and Purification Network (ODP-Net) to structurally disentangle these factors. Specifically, ODP-Net employs (1) Instance-aware Orthogonal Decomposition to project features into mutually exclusive subspaces: universal forgery traces, generator-specific fingerprints, and semantic content; (2) Perturbation-based Purification to enforce semantic invariance via cross-sample feature injection; and (3) Manifold Alignment to bridge domain gaps. By explicitly decoupling universal forgery traces from generator-specific fingerprints and semantic content, ODP-Net achieves state-of-the-art performance on unseen architectures (e.g., Stable Diffusion 3), validating that structural disentanglement is key to generalization.

2605.06390 2026-05-18 cs.AI

Automated alignment is harder than you think

Aleksandr Bowkis, Marie Davidsen Buhl, Jacob Pfau, Geoffrey Irving

AI总结 本文探讨了自动化对齐(automated alignment)在人工智能超级智能(ASI)发展中的潜在风险。研究指出,即使研究代理不刻意破坏对齐工作,自动化对齐过程仍可能产生误导性的安全评估,导致未对齐的AI被无意中部署。这是因为对齐研究涉及许多难以监督的模糊任务,人类判断存在系统性偏差,而自动化系统可能在优化压力下产生人类难以发现的错误,进而影响对齐结果的可靠性。因此,如何训练代理可靠地完成这些任务,成为自动化对齐研究中的关键挑战。

Comments 15 pages, 4 figures

详情
英文摘要

A leading proposal for aligning artificial superintelligence (ASI) is to use AI agents to automate an increasing fraction of alignment research as capabilities improve. We argue that, even when research agents are not scheming to deliberately sabotage alignment work, this plan could produce compelling but catastrophically misleading safety assessments resulting in the unintentional deployment of misaligned AI. This could happen because alignment research involves many hard-to-supervise fuzzy tasks (tasks without clear evaluation criteria, for which human judgement is systematically flawed). Consequently, research outputs will contain systematic, undetected errors, and even correct outputs could be incorrectly aggregated into overconfident safety assessments. This problem is likely to be worse for automated alignment research than for human-generated alignment research for several reasons: 1) optimisation pressure means agent-generated mistakes are concentrated among those that human reviewers are least likely to catch; 2) agents are likely to produce errors that do not resemble human mistakes; 3) AI-generated alignment solutions may involve arguments humans cannot evaluate; and 4) shared weights, data and training processes may make AI outputs more correlated than human equivalents. Therefore, agents must be trained to reliably perform hard-to-supervise fuzzy tasks. Generalisation and scalable oversight are the leading candidates for achieving this but both face novel challenges in the context of automated alignment.

2605.05179 2026-05-18 cs.LG cond-mat.dis-nn stat.ML

Estimating the expected output of wide random MLPs more efficiently than sampling

Wilson Wu, Victor Lecomte, Michael Winer, George Robinson, Jacob Hilton, Paul Christiano

AI总结 本文提出了一种比采样更高效的方法,用于估计初始化后的宽随机多层感知机(MLP)在高斯输入下的期望输出。该方法通过构建每一层激活值的近似分布,利用累积量和Hermite展开等工具,避免了传统采样方式中逐个输入计算的耗时过程。实验表明,该方法在保证均方误差的前提下,显著减少了计算量,尤其在估计小概率事件和模型训练中表现出色,为降低模型尾部风险提供了新思路。

Comments 68 pages. Code is available at https://github.com/alignment-research-center/mlp_cumulant_propagation

详情
英文摘要

By far the most common way to estimate an expected loss in machine learning is to draw samples, compute the loss on each one, and take the empirical average. However, sampling is not necessarily optimal. Given an MLP at initialization, we show how to estimate its expected output over Gaussian inputs without running samples through the network at all. Instead, we produce approximate representations of the distributions of activations at each layer, leveraging tools such as cumulants and Hermite expansions. We show both theoretically and empirically that for sufficiently wide networks, our estimator achieves a target mean squared error using substantially fewer FLOPs than Monte Carlo sampling. We find moreover that our methods perform particularly well at estimating the probabilities of rare events, and additionally demonstrate how they can be used for model training. Together, these findings suggest a path to producing models with a greatly reduced probability of catastrophic tail risks.

2605.03548 2026-05-18 cs.LG cs.AI

PerFlow: Physics-Embedded Rectified Flow for Efficient Reconstruction and Uncertainty Quantification of Spatiotemporal Dynamics

Hao Zhou, Rui Zhang, Han Wan, Hao Sun

AI总结 该研究提出了一种名为PerFlow的物理嵌入式修正流模型,用于高效重建和量化由偏微分方程(PDE)支配的时空动态场的不确定性。PerFlow通过将观测条件与物理约束解耦,实现了无需梯度引导的高效条件采样,并通过约束保持投影确保物理一致性。实验表明,该方法在保持良好物理特性的同时,显著提升了重建精度和推理速度。

Comments 17 pages, 8 figures. Accepted to IJCAI-ECAI 2026

详情
英文摘要

Reconstructing PDE-governed fields from sparse and irregular measurements is challenging due to their ill-posed nature. Deterministic surrogates are trained on dense fields that struggle with limited measurements and uncertainty quantification. Generative models, by learning distributions over spatiotemporal fields, can better handle sparsity and uncertainty. However, existing generative approaches enforce data consistency and PDE constraints simultaneously via sampling-time gradient guidance, resulting in slow and unstable inference. To this end, we propose PerFlow, a Physics-embedded rectified Flow for efficient sparse reconstruction and uncertainty quantification of spatiotemporal dynamics. PerFlow decouples observation conditioning from physics enforcement, performing guidance-free conditioning by feeding observations into rectified-flow dynamics while embedding hard physics via a constraint-preserving projection (e.g., incompressibility or conservation). Theoretically, we establish invariance guarantees to ensure that trajectories remain on the physics-consistent manifold throughout sampling. Experiments on various PDE systems demonstrate competitive reconstruction accuracy with sound physics consistency, while enabling efficient conditional sampling (e.g., 50 steps) and up to 320x faster inference than 2000-step guided diffusion baselines.

2605.02960 2026-05-18 cs.LG

MoE-Prefill: Zero Redundancy Overheads in MoE Prefill Serving

Zhaoyuan Su, Olatunji Ruwase, Karthik Ganesan, Aurick Qiao, Samyam Rajbhandari, Juncheng Yang, Yue Cheng, Yuxiong He

AI总结 本文研究了在混合专家(MoE)模型中高效服务仅需预填充(prefill)的生产级任务(如分类、推荐等)的问题,提出了MoE-Prefill系统,通过异步专家并行(AsyncEP)机制,将专家权重的加载与计算重叠,避免了传统方法中的冗余计算和同步开销。该方法在前端引入前缀感知路由和真实FLOPs负载追踪,有效提升了模型的吞吐量和计算利用率,在多个硬件和精度配置下均表现出显著的性能提升。

Comments 19 pages, 12 figures, 4 tables

详情
英文摘要

Production LLM workloads increasingly serve discriminative tasks, such as classification, recommendation, and verification, whose answers are read from the logits of a single prefill pass with no autoregressive decoding. Serving these prefill-only workloads on mixture-of-experts (MoE) models is bottlenecked not by compute but by the distributed execution required to fit the model: existing parallel strategies (tensor, expert, and pipeline parallelism) trade memory pressure for redundant computation, communication, and synchronization, severely degrading MoE prefill serving efficiency. We observe that these overheads stem from coupling expert placement with synchronous activation routing -- a design inherited from the decoding era. The long, compute-bound forward passes of large-batch prefill open a per-layer window wide enough to stream expert weights in the background, replacing per-layer activation AllToAll with asynchronous weight AllGather fully overlapped with computation. We propose MoE-Prefill, a prefill-only serving system whose backend, AsyncEP (Asynchronous Expert Parallelism), gathers experts by weight rather than routing them by activation, and whose frontend co-enforces a physically-derived saturation threshold through prefix-aware routing and true-FLOPs load tracking. On Qwen3-235B-A22B across four hardware/precision configurations, MoE-Prefill delivers 1.35-1.37x throughput over the strongest distributed baseline on real-world workloads and up to 1.59x on long-context synthetic workloads, sustaining 29.8-36.2% per-GPU model FLOPs utilization.

2605.01852 2026-05-18 cs.CV

DP-SfM: Dual-Pixel Structure-from-Motion without Scale Ambiguity

Lilika Makabe, Kohei Ashida, Hiroaki Santo, Fumio Okura, Yasuyuki Matsushita

AI总结 本文提出了一种名为DP-SfM的方法,利用双像素(DP)传感器捕获的图像进行多视角三维重建,无需参考物体或预先标定即可自动解决尺度模糊问题。该方法通过结合深度图与双像素图像中的散焦模糊信息,提出了一种简单有效的线性方法来估计绝对尺度,并进一步通过基于强度的优化对齐左右图像。实验表明,该方法在不同相机和镜头捕获的多样化场景中均表现出良好的效果。

详情
Journal ref
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026
英文摘要

Multi-view 3D reconstruction, namely, structure-from-motion followed by multi-view stereo, is a fundamental component of 3D computer vision. In general, multi-view 3D reconstruction suffers from an unknown scale ambiguity unless a reference object of known size is present in the scene. In this article, we show that multi-view images captured using a dual-pixel (DP) sensor can automatically resolve the scale ambiguity, without requiring a reference object or prior calibration. Specifically, the defocus blur observed in DP images provides sufficient information to determine the absolute scale when paired with depth maps (up to scale) recovered from multi-view 3D reconstruction. Based on this observation, we develop a simple yet effective linear method to estimate the absolute scale, followed by the intensity-based optimization stage that aligns the left and right DP images by shifting them back toward each other using cross-view blur kernels. Experiments demonstrate the effectiveness of the proposed approach across diverse scenes captured with different cameras and lenses. Code and data are available at https://github.com/lilika-makabe/dp-sfm-tpami.git

2604.27859 2026-05-18 cs.AI cs.ET

Rethinking Agentic Reinforcement Learning In Large Language Models

Fangming Cui, Ruixiao Zhu, Cheng Fang, Sunan Li, Jiahong Li

AI总结 本文探讨了在大型语言模型(LLM)背景下对智能体强化学习(Agentic RL)的重新思考。研究关注如何将LLM的认知能力,如目标设定、长期规划、动态策略调整和交互推理,融入强化学习框架,以应对复杂、开放式的现实任务。文章深入分析了该范式的核心概念、方法创新与设计原则,并指出了当前面临的挑战及未来发展方向。

详情
英文摘要

Reinforcement Learning (RL) has traditionally focused on training specialized agents to optimize predefined reward functions within narrowly defined environments. However, the advent of powerful Large Language Models (LLMs) and increasingly complex, open-ended tasks has catalyzed a paradigm shift towards agentic paradigms within RL. This emerging framework extends beyond traditional RL by emphasizing the development of autonomous agents capable of goal-setting, long-term planning, dynamic strategy adaptation, and interactive reasoning in uncertain, real-world environments. Unlike conventional approaches that rely heavily on static objectives and episodic interactions, LLM-based Agentic RL incorporates cognitive-like capabilities such as meta-reasoning, self-reflection, and multi-step decision-making directly into the learning loop. In this paper, we provide a deep insight for looking the conceptual foundations, methodological innovations, and effective designs underlying this trend. Furthermore, we identify critical challenges and outline promising future directions for building LLM-based Agentic RL.

2604.26139 2026-05-18 cs.CL

HIVE: Hidden-Evidence Verification for Hallucination Detection in Diffusion Large Language Models

Guoshenghui Zhao, Tan Yu, Weijie Zhao

AI总结 本文提出了一种名为HIVE的隐藏证据验证框架,用于检测扩散大语言模型(D-LLMs)生成过程中的幻觉。HIVE通过从去噪轨迹中提取压缩的隐藏证据,并结合信息步层选择和前缀嵌入条件验证语言模型,实现了对幻觉的更精细检测,能够输出连续的幻觉评分及结构化的验证结果。实验表明,HIVE在多个基准测试中优于现有方法,验证了隐藏证据在提升幻觉检测性能中的有效性。

Comments 5 figures, appendix included

详情
英文摘要

Diffusion large language models generate text through multi-step denoising, where hallucination signals may emerge throughout the trajectory rather than only in the final output. Existing detectors mainly rely on output uncertainty or coarse trace statistics, which often fail to capture the richer hidden dynamics of D-LLMs. We propose HIVE, a hidden-evidence verification framework that extracts compressed hidden evidence from denoising trajectories, selects informative step-layer evidence, and conditions a verifier language model on the selected evidence through prefix embeddings. HIVE produces both a continuous hallucination score from verifier decision logits and structured verification outputs, including hallucination types, evidence pairs, and short rationales. Across two D-LLMs and three QA benchmarks, HIVE consistently outperforms eight strong baselines and achieves up to 0.9236 AUROC and 0.9537 AUPRC. Ablation studies further confirm the importance of hidden-evidence conditioning, learned evidence selection, two-stream evidence representation, and step-layer embeddings. These results suggest that selected hidden evidence from denoising trajectories provides a stronger and more usable hallucination signal than output-only uncertainty or coarse trace statistics.

2604.17669 2026-05-18 cs.CV

Low Light Image Enhancement Challenge at NTIRE 2026

George Ciubotariu, Sharif S M A, Abdur Rehman, Fayaz Ali Dharejo, Rizwan Ali Naqvi, Marcos V. Conde, Radu Timofte, Zhi Jin, Hongjun Wu, Wenjian Zhang, Chang Ye, Xunpeng Yi, Qinglong Yan, Yibing Zhang, Zaynab Ali, Saiprasad Meesiyawar, Varda I Pattanshetty, Varsha I Pattanshetty, Nikhil Akalwadi, Padmashree Desai, Ramesh Ashok Tabib, Uma Mudenagudi, Hao Yang, Ruikun Zhang, Liyuan Pan, Furkan Kınlı, Donghun Ryou, Inju Ha, Junoh Kang, Bohyung Han, Wei Zhou, Yuval Haitman, Ariel Lapid, Reuven Peretz, Idit Diamant, Leilei Cao, Shuo Zhang, Praful Hambarde, Prateek Shaily, Jayant Kumar, Hardik Sharma, Aashish Negi, Sachin Chaudhary, Akshay Dudhane, Amit Shukla, MoHao Wu, Lin Wang, Jiachen Tu, Guoyi Xu, Yaoxin Jiang, Jiajia Liu, Yaokun Shi, Raul Balmez, Alexandru Brateanu, Ciprian Orhei, Cosmin Ancuti, Codruta O. Ancuti, Bilel Benjdira, Anas M. Ali, Wadii Boulila, Kaifan Qiao, Bofei Chen, Jingyi Xu, Duo Zhang, Xin Deng, Mai Xu, Shengxi Li, Lai Jiang, Harini A, Ananya N, Lakshanya K, Ying Xu, Xinyi Zhu, Shijun Shi, Jiangning Zhang, Yong Liu, Kai Hu, Jing Xu, Xianfang Zeng, Jinao Song, Guangsheng Tang, Cheng Li, Yuqiang Yang, Ziyi Wang, Yan Chen, Long Bao, Heng Sun, Mohab Kishawy, Jun Chen, Wan-Chi Siu, Yihao Cheng, Hon Man Hammond Lee, Chun-Chuen Hui

AI总结 本文综述了NTIRE 2026低光图像增强挑战赛,介绍了参赛者提出的各种解决方案及最终结果。该挑战赛旨在寻找能够有效提升低对比度和噪声图像清晰度与视觉吸引力的网络模型。共有22支队伍提交了有效作品,本文全面评估了当前在(联合去噪与)低光图像增强领域的先进方法,展示了该领域的重要进展,并基于新的数据集进行了分析。

详情
英文摘要

This paper presents a comprehensive review of the NTIRE 2026 Low Light Image Enhancement Challenge, highlighting the proposed solutions and final results. The objective of this challenge is to identify effective networks capable of producing clearer and visually compelling images in diverse and challenging conditions by learning representative visual cues with the purpose of restoring information loss due to low-contrast and noisy images. A total of 195 participants registered for the first track and 153 for the second track of the competition, and 22 teams ultimately submitted valid entries. This paper thoroughly evaluates the state-of-the-art advances in (joint denoising and) low-light image enhancement, showcasing the significant progress in the field, while leveraging samples of our novel dataset.

2604.16925 2026-05-18 cs.CV

Rethinking Cross-Dose PET Denoising: Mitigating Averaging Effects via Residual Noise Learning

Yichao Liu, Zongru Shao, Yueyang Teng, Junwen Guo

AI总结 本文研究了低剂量正电子发射断层扫描(LDPET)图像的跨剂量去噪问题,指出传统模型在不同剂量条件下泛化能力较差,主要由于噪声水平和统计特性差异导致。作者分析发现,现有方法在训练过程中隐式优化了异质噪声分布的期望,导致网络学习到的是跨剂量的平均去噪映射,无法准确建模特定剂量的噪声特性。为此,提出了一种统一的残差噪声学习框架,直接从低剂量图像中估计噪声,而非预测全剂量图像,实验表明该方法在多个医疗中心的大规模数据集上优于现有方法,显著提升了跨剂量去噪性能。

详情
英文摘要

Cross-dose denoising for low-dose positron emission tomography (LDPET) has been proposed to address the limited generalization of models trained at a single noise level. However, neural networks trained on a specific dose level often fail to generalize to other dose conditions due to variations in noise magnitude and statistical properties. Conventional "one-size-for-all" models attempt to mitigate this variability but tend to learn averaged representations across noise levels, resulting in degraded performance. In this work, we analyze this limitation and show that standard training formulations implicitly optimize an expectation over heterogeneous noise distributions, causing the network to learn an averaged denoising mapping that cannot accurately model dose-specific noise characteristics. We propose a unified residual noise learning framework that estimates noise directly from low-dose PET images rather than predicting full-dose images. Experiments on large-scale multi-dose PET datasets from two medical centers demonstrate that the proposed method outperforms the "one-size-for-all" model, individual dose-specific U-Net models, and dose-conditioned approaches, achieving improved denoising performance. These results indicate that residual noise learning effectively mitigates the averaging effect and enhances generalization for cross-dose PET denoising.

2604.15221 2026-05-18 cs.RO cs.CV

Vision-Based Safe Human-Robot Collaboration with Uncertainty Guarantees

Jakob Thumm, Marian Frei, Tianle Ni, Matthias Althoff, Marco Pavone

AI总结 本文提出了一种基于视觉的人体姿态估计与运动预测框架,能够在保证安全协作的前提下提供可验证的不确定性保障。该方法结合了对噪声不确定性的估计与分布外检测,以提升预测的置信度,并引入符合性预测集来确保预测结果在实际人机协作中的高可靠性。实验在真实的人体运动数据和实际人机协作场景中验证了方法的有效性。

详情
英文摘要

We propose a framework for vision-based human pose estimation and motion prediction that gives conformal prediction guarantees for certifiably safe human-robot collaboration. Our framework combines aleatoric uncertainty estimation with OOD detection for high probabilistic confidence. To integrate our pipeline in certifiable safety frameworks, we propose conformal prediction sets for human motion predictions with high, valid confidence. We evaluate our pipeline on recorded human motion data and a real-world human-robot collaboration setting.

2604.08302 2026-05-18 cs.LG cs.AI

DMax: Aggressive Parallel Decoding for dLLMs

Zigeng Chen, Gongfan Fang, Xinyin Ma, Ruonan Yu, Xinchao Wang

AI总结 本文提出了一种名为 DMax 的新方法,用于高效生成扩散语言模型(dLLMs)。该方法通过引入渐进式自优化机制和软并行解码策略,有效缓解了并行解码中的错误累积问题,从而在保持生成质量的同时实现更高效的并行生成。DMax 还提出了 On-Policy Uniform Training 训练策略,统一了掩码和非掩码模型的训练过程,显著提升了模型在多个基准测试中的生成效率与性能。

Comments Working in progress. Code is available at: https://github.com/czg1225/DMax

详情
英文摘要

We present DMax, a new paradigm for efficient diffusion language models (dLLMs). It mitigates error accumulation in parallel decoding, enabling aggressive decoding parallelism while preserving generation quality. Unlike conventional masked dLLMs that decode through a binary mask-to-token transition, DMax reformulates decoding as a progressive self-refinement from mask embeddings to token embeddings. At the core of our approach is On-Policy Uniform Training, a novel training strategy that efficiently unifies masked and uniform dLLMs, equipping the model to recover clean tokens from both masked inputs and its own erroneous predictions. Building on this foundation, we further propose Soft Parallel Decoding. We represent each intermediate decoding state as an interpolation between the predicted token embedding and the mask embedding, enabling iterative self-revising in embedding space. Extensive experiments across a variety of benchmarks demonstrate the effectiveness of DMax. Compared with the original LLaDA-2.0-mini, our method improves TPF on GSM8K from 2.04 to 5.47 while preserving accuracy. On MBPP, it increases TPF from 2.71 to 5.86 while maintaining comparable performance. On two H200 GPUs, our model achieves an average of 1,338 TPS at batch size 1. Code is available at: https://github.com/czg1225/DMax

2604.04539 2026-05-18 cs.LG cs.RO

FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control

Donghu Kim, Youngdo Lee, Minho Park, Kinam Kim, I Made Aswin Nahendra, Takuma Seno, Sehee Min, Daniel Palenicek, Florian Vogt, Danica Kragic, Jan Peters, Jaegul Choo, Hojoon Lee

AI总结 本文提出了一种名为 FlashSAC 的快速且稳定的离线策略强化学习算法,用于解决高维机器人控制问题。该方法基于软演员评论家(Soft Actor-Critic)框架,通过增大模型规模和提升数据吞吐量来减少梯度更新次数,同时通过显式限制权重、特征和梯度的范数来保持稳定性。实验表明,FlashSAC 在多个模拟器中的超过 60 个任务上均优于 PPO 和其他先进离线策略方法,尤其在高维任务中表现出显著性能提升,并在模拟到现实的人形机器人运动任务中大幅缩短了训练时间。

Comments RSS'26

详情
英文摘要

Reinforcement learning (RL) is a core approach for robot control when expert demonstrations are unavailable. On-policy methods such as Proximal Policy Optimization (PPO) are widely used for their stability, but their reliance on narrowly distributed on-policy data limits accurate policy evaluation in high-dimensional state and action spaces. Off-policy methods can overcome this limitation by learning from a broader state-action distribution, yet suffer from slow convergence and instability, as fitting a value function over diverse data requires many gradient updates, causing critic errors to accumulate through bootstrapping. We present FlashSAC, a fast and stable off-policy RL algorithm built on Soft Actor-Critic. Motivated by scaling laws observed in supervised learning, FlashSAC sharply reduces gradient updates while compensating with larger models and higher data throughput. To maintain stability at increased scale, FlashSAC explicitly bounds weight, feature, and gradient norms, curbing critic error accumulation. Across over 60 tasks in 10 simulators, FlashSAC consistently outperforms PPO and strong off-policy baselines in both final performance and training efficiency, with the largest gains on high-dimensional tasks such as dexterous manipulation. In sim-to-real humanoid locomotion, FlashSAC reduces training time from hours to minutes, demonstrating the promise of off-policy RL for sim-to-real transfer.

2604.04310 2026-05-18 cs.RO

frax: Fast Robot Kinematics and Dynamics in JAX

Daniel Morton, Marco Pavone

AI总结 本文介绍了一个基于 JAX 的机器人运动学与动力学库 frax,旨在提供高性能、易用且兼容 CPU 和加速器的解决方案。该库采用全向量化方法,支持实时控制与并行计算,并兼容自动微分,适用于优化方法。实验表明,frax 在 CPU 上可实现微秒级计算,适用于千赫兹控制频率,性能优于常见 Python 库并接近优化的 C++ 实现;在 GPU 上则能扩展到数千个实例,每秒可达上亿次动力学计算。

Comments ICRA 2026 Workshop on Frontiers of Optimization for Robotics

详情
英文摘要

In robot control, planning, and learning, there is a need for rigid-body dynamics libraries that are highly performant, easy to use, and compatible with CPUs and accelerators. While existing libraries often excel at either low-latency CPU execution or high-throughput GPU workloads, few provide a unified framework that targets multiple architectures without compromising performance or ease-of-use. To address this, we introduce frax, a JAX-based library for robot kinematics and dynamics, providing a high-performance, pure-Python interface across CPU, GPU, and TPU. Via a fully-vectorized approach to robot dynamics, frax enables efficient real-time control and parallelization, while supporting automatic differentiation for optimization-based methods. On CPU, frax achieves low-microsecond computation times suitable for kilohertz control rates, outperforming common libraries in Python and approaching optimized C++ implementations. On GPU, the same code scales to thousands of instances, reaching upwards of 100 million dynamics evaluations per second. We validate performance on a Franka Panda manipulator and a Unitree G1 humanoid, and release frax as an open-source library.

2604.02268 2026-05-18 cs.LG

SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization

Zhengxi Lu, Zhiyuan Yao, Jinyang Wu, Chengcheng Han, Qi Gu, Xunliang Cai, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen

AI总结 该研究探讨了如何将技能内化为模型参数,以实现无需运行时检索的零样本自主行为。为此,提出了一种基于上下文强化学习的框架SKILL0,通过训练时逐步减少技能上下文,引导模型学习工具调用和多轮任务完成。实验表明,SKILL0在多个智能体任务中显著优于传统强化学习方法,同时保持了高效的上下文使用效率。

详情
英文摘要

Agent skills, structured packages of procedural knowledge and executable resources that agents dynamically load at inference time, have become a reliable mechanism for augmenting LLM agents. Yet inference-time skill augmentation is fundamentally limited: retrieval noise introduces irrelevant guidance, injected skill content imposes substantial token overhead, and the model never truly acquires the knowledge it merely follows. We ask whether skills can instead be internalized into model parameters, enabling zero-shot autonomous behavior without any runtime skill retrieval. We introduce SKILL0, an in-context reinforcement learning framework designed for skill internalization. SKILL0 introduces a training-time curriculum that begins with full skill context and progressively withdraws it. Skills are grouped offline by category and rendered with interaction history into a compact visual context, teaching he model tool invocation and multi-turn task completion. A Dynamic Curriculum then evaluates each skill file's on-policy helpfulness, retaining only those from which the current policy still benefits within a linearly decaying budget, until the agent operates in a fully zero-shot setting. Extensive agentic experiments demonstrate that SKILL0 achieves substantial improvements over the standard RL baseline (+9.7\% for ALFWorld, +6.6\% for Search-QA, and+10.1\% for WebShop), while maintaining a highly efficient context of fewer than 0.5k tokens per step. Our code is available at https://github.com/ZJU-REAL/SkillZero.

2603.27043 2026-05-18 cs.CL

Introducing MELI: the Mandarin-English Language Interview Corpus

Suyuan Liu, Molly Babel

AI总结 本文介绍了MELI语料库,一个包含51名 Mandarin-English 双语者29.8小时语音数据的开源语料库,涵盖阅读句子和关于语言变体、标准性及学习经历的自发访谈两种说话风格。语料库提供了逐字和音素级别的强制对齐转录,并记录了语言态度等元数据,支持跨语言及跨说话者的声学对比分析,有助于开展定量与定性研究。

Comments Accepted at LREC 2026 (14th International Conference on Language Resources and Evaluation), to appear in the conference proceedings

详情
Journal ref
In Proceedings of the Fifteenth Language Resources and Evaluation Conference (pp. 5896-5904). European Language Resources Association (ELRA) 2026
英文摘要

We introduce the Mandarin-English Language Interview (MELI) Corpus, an open-source resource of 29.8 hours of speech from 51 Mandarin-English bilingual speakers. MELI combines matched sessions in Mandarin and English with two speaking styles: read sentences and spontaneous interviews about language varieties, standardness, and learning experiences. Audio was recorded at 44.1 kHz (16-bit, stereo). Interviews were fully transcribed, force-aligned at word and phone levels, and anonymized. Descriptively, the Mandarin component totals ~14.7 hours (mean duration 17.3 minutes) and the English component ~15.1 hours (mean duration 17.8 minutes). We report token/type statistics for each language and document code-switching patterns (frequent in Mandarin sessions; more limited in English sessions). The corpus design supports within-/cross-speaker, within/cross-language acoustic comparison and links acoustics to speakers' stated language attitudes, enabling both quantitative and qualitative analyses. The MELI Corpus will be released with transcriptions, alignments, metadata, scans of labelled maps and documentation under a CC BY-NC 4.0 license.

2603.23433 2026-05-18 cs.AI

Mecha-nudges for Machines

Giulio Frey, Kawin Ethayarajh

AI总结 本文研究了AI智能体在互联网环境中作为决策者时,其决策可能受到环境变化的系统性影响,这一现象被称为“机械助推”(mecha-nudging)。作者结合经济学中的贝叶斯劝导理论和计算机科学中的可利用信息理论,提出了一种量化环境变化对AI影响的统一方法,并基于超过六百万个Etsy商品列表的数据分析发现,ChatGPT发布后,商品信息中用于预测AI推荐决策的机器可利用信息显著增加,而人类可利用信息则几乎没有变化。该研究首次提供了大规模实证证据,表明系统性的机械助推已在实际环境中发生,但尚未被广泛察觉。

详情
英文摘要

AI agents are becoming active decision-makers on the Internet. As they make decisions in the same environments as humans, the environments themselves can change to influence them. We call this $\textit{mecha-nudging}$: changes to how choices are presented that systematically influence AI agents without materially degrading the decision environment for humans. To measure this phenomenon, we combine two frameworks -- Bayesian persuasion from economics and $\mathcal{V}$-usable information from computer science -- to get a common unit (bits) for quantifying how environments change across a wide range of interventions, contexts, and models. We apply this framework to over six million Etsy listings and find that, after ChatGPT's release, listings contain significantly more machine-usable information for predicting agent curation decisions, increasing by 0.143 bits out of a maximum possible increase of 0.355. This shift is robust across prompts, token choices, labeling models, and fine-tuning architectures; absent in a regulated-text placebo; and far larger than the effect of generic LLM rewriting. In contrast, a human study finds little to no change in human-usable information. Our results provide the first large-scale evidence that systematic mecha-nudging is already occurring in the wild, but going unnoticed.

2603.14764 2026-05-18 cs.CV cs.AI cs.LG

Topology-Preserving Polygon Augmentation for Segmentation in Structured Visual Domains

Sudip Laudari, Sang Hun Baek

AI总结 该论文研究了在结构化视觉领域(如建筑平面图分析)中保持多边形标注拓扑结构的图像增强方法。针对传统几何增强可能导致多边形区域分割、破坏语义连通性的缺陷,提出了一种轻量的拓扑保持增强策略,能够在不改变顶点顺序的前提下修复索引空间中的邻接关系。实验表明,该方法在常见几何变换下能实现接近完美的循环邻接保持(CAP),并有效提升了基于多边形的分割标注一致性。

Comments 10 pages, 6 figures

详情
英文摘要

Geometric data augmentation is widely used in segmentation workflows, but polygon annotations are often assumed to remain valid after transformation. This assumption can fail in structured domains such as architectural floorplan analysis, where a region may contain an interior void encoded as part of a single ordered polygon chain. Cropping or clipping can remove bridge vertices in this chain, causing one semantic region to split into disconnected components. We propose a lightweight topology-preserving augmentation strategy that repairs missing adjacency relations in index space while preserving the original vertex order. The method adds minimal overhead and can be integrated into existing preprocessing workflows. Experiments show that the proposed approach achieves near-perfect Cyclic Adjacency Preservation (CAP) across common geometric transformations and improves annotation consistency in polygon-based segmentation.

2603.10881 2026-05-18 cs.LG

LAtte: Hyperbolic Lorentz Attention for Cross-Subject EEG Classification

Ahmad Bdeir, Johannes Burchert, Tom Hanika, Lars Schmidt-Thieme, Niels Landwehr

AI总结 本文提出了一种名为LAtte的框架,用于解决跨被试脑电图(EEG)分类中的泛化难题。该方法结合了洛伦兹注意力机制与基于双曲几何的InceptionTime编码器,通过将EEG信号分解为基线和任务相关偏差,提升特征表示的结构化程度。此外,模型引入了针对每个被试的低秩适配模块,并结合洛伦兹提升和双曲投影技术,增强模型的鲁棒性和适应性,在多个数据集上均取得了优于现有方法的分类性能。

详情
英文摘要

Electroencephalogram (EEG) classification plays a key role in medical diagnosis and brain-computer interfaces, but remains challenging due to low signal-to-noise ratios and high inter-subject variability. As a result, many existing approaches rely on subject-specific models, which fail to exploit shared structure in neural signals and do not generalize to unseen subjects. To address these limitations, we propose LAtte, a framework that combines Lorentz attention with a hyperbolic InceptionTime-based encoder to improve cross-subject generalization in EEG classification. The model explicitly decomposes EEG signals into a learned baseline component and task-relevant deviations, enabling more structured representation learning. To further improve robustness and adaptability, we incorporate subject-specific low-rank adaptation (LoRA) modules at both encoder and decoder levels, augmented with a Lorentz boost-based LoRA mechanism and hyperbolic projection layers to reduce overfitting in geometric representations. We evaluate LAtte with and without finetuning in three settings: subject-specific, subject-conditional, and leave-one-subject-out (LOSO) on five established EEG datasets, achieving a consistent improvement in performance over current state-of-the-art methods for smaller datasets and maintaining performance for larger datasets.

2603.08063 2026-05-18 cs.CV

SkyLink: A Large Vision-Language Model Driven Re-ranking Framework for Cross-View UAV geolocalization

Bowen Liu, Pengyue Jia, Wanyu Wang, Derong Xu, Jiawei Cheng, Jiancheng Dong, Xiao Han, Zimo Zhao, Chao Zhang, Bowen Yu, Fangyu Hong, Xiangyu Zhao

AI总结 SkyLink 是一种基于大视觉-语言模型(LVLM)的跨视角无人机地理定位重排序框架,旨在提升无人机图像与卫星图像之间的匹配精度。该方法通过建模不同视角之间的视觉-语义关系,实现更有效的跨视角匹配,并引入一种关系感知损失函数以增强模型的判别能力和训练稳定性。实验表明,SkyLink 显著提升了现有模型在多种基准数据集上的重排序性能,尤其在复杂场景中表现突出。

详情
英文摘要

Cross-view UAV geolocalization is fundamentally a challenging large-scale image retrieval task, aiming to determine the geographic coordinates of Unmanned Aerial Vehicle (UAV) queries by matching them against an extensive geo-tagged satellite image database. Most existing methods learn separate feature representations for each view and determine the final prediction using naive heuristics to assess feature similarity, thereby neglecting to model the crucial cross-view relationships. In this paper, we propose SkyLink, a novel plug-and-play ranking framework that pioneers joint relational modeling of inter-view relationships to enhance cross-view UAV geolocalization. SkyLink leverages a Large Vision-Language Model (LVLM) to model the intricate visual-semantic relationships between UAV and satellite views, facilitating effective cross-view matching. To further refine the learning process, we introduce a relational-aware loss. It leverages soft labels to provide a more nuanced supervision signal, mitigating the harsh penalty on near-positive pairs. This approach enhances both training stability and the model's discriminative capacity. Extensive experiments conducted across multiple base retrieval architectures and benchmark datasets demonstrate that SkyLink significantly boosts the ranking effectiveness of existing models, consistently achieving superior performance in various challenging scenarios.

2603.07514 2026-05-18 cs.LG cs.AI cs.CV

A Unified View of Score-Based and Drifting Models

Chieh-Hsin Lai, Bac Nguyen, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yuki Mitsufuji, Stefano Ermon, Molei Tao

AI总结 本文探讨了漂移模型与基于分数的生成模型之间的内在联系,揭示了漂移方法在本质上等价于对平滑分布进行分数匹配的目标。研究发现,使用高斯核时,均值漂移场精确对应于数据分布与模型分布的分数差异,这一结论基于Tweedie公式。对于实际常用的拉普拉斯核,理论与实验均表明其残差项在高维情况下可忽略,因此实际应用中的漂移方法近似于基于分数的生成方法。该研究为理解生成模型提供了统一的视角,并指出了漂移模型与扩散模型在运输方向上的结构性相似与差异。

详情
英文摘要

Drifting models train one-step generators by optimizing a kernel-induced mean-shift discrepancy between the data and model distributions, with Laplace kernels used by default in practice. At each point, this discrepancy compares the kernel-weighted displacement toward nearby data samples with the corresponding displacement toward nearby model samples, thereby defining a transport direction for generated samples. In this paper, we show that drifting is more closely connected to score-based generative modeling than it may first appear, establishing a precise link to the score-matching principle underlying diffusion models. For Gaussian kernels, the population mean-shift field exactly equals the difference between the scores (i.e., the gradient-log-densities) of the Gaussian-smoothed data and model distributions. This identity follows from Tweedie's formula, which links the score of a Gaussian-smoothed density to its conditional mean, and implies that Gaussian-kernel drifting is exactly a score-matching objective on smoothed distributions. More generally, we derive an exact decomposition for radial kernels in which mean shift equals a score-based field plus a residual term. For the practical Laplace kernel, we further show theoretically and empirically that this residual is negligible in high dimension, implying that the transport field used in practice is nearly score-based. Our results reveal a structural connection to diffusion models: both methods use score-mismatch transport directions, but drifting realizes the score nonparametrically through kernel-based estimates, whereas diffusion models learn it parametrically with neural networks.

2603.03243 2026-05-18 cs.RO

HoMMI: Learning Whole-Body Mobile Manipulation from Human Demonstrations

Xiaomeng Xu, Jisang Park, Han Zhang, Eric Cousineau, Aditya Bhat, Jose Barreiros, Dian Wang, Jeannette Bohg, Shuran Song

AI总结 本文提出了一种名为HoMMI的框架,用于从无需机器人的人类演示中直接学习全身移动操作任务。该框架通过增强UMI接口,引入以自我为中心的感知方式,实现了便携、可扩展的数据采集,但同时也带来了人机体感差距的问题。为此,研究者设计了一种跨体感的手眼策略,包括通用视觉表征、放松的头部动作表示以及协调全身运动的控制器,从而实现了复杂移动操作任务的策略迁移。

详情
英文摘要

We present Whole-Body Mobile Manipulation Interface (HoMMI), a data collection and policy learning framework that learns whole-body mobile manipulation directly from robot-free human demonstrations. We augment UMI interfaces with egocentric sensing to capture the global context required for mobile manipulation, enabling portable, robot-free, and scalable data collection. However, naively incorporating egocentric sensing introduces a larger human-to-robot embodiment gap in both observation and action spaces, making policy transfer difficult. We explicitly bridge this gap with a cross-embodiment hand-eye policy design, including an embodiment agnostic visual representation; a relaxed head action representation; and a whole-body controller that realizes hand-eye trajectories through coordinated whole-body motion under robot-specific physical constraints. Together, these enable long-horizon mobile manipulation tasks requiring bimanual and whole-body coordination, navigation, and active perception. Results are best viewed on: https://hommi-robot.github.io

2603.01283 2026-05-18 cs.AI cs.LG

The Informational Cost of Agency: A Bounded Measure of Interaction Efficiency for Deployed Reinforcement Learning

Wael Hafez, Cameron Reid, Amit Nazeri

AI总结 本文提出了一种名为“双可预测性”(Bipredictability,记为P)的信息论指标,用于量化智能体与环境之间的闭环交互在消除不确定性、提升共享可预测性方面的效率。该指标具有理论上的上限(小于0.5),并证明智能体的主动行为会抑制P值低于这一阈值,这一现象被称为“智能体的信息成本”。实验表明,P不仅在强化学习系统中有效,还适用于语言模型、视觉系统等不同领域,展示了其广泛的适用性;同时,基于P构建的信息数字孪生(IDT)架构在检测系统退化方面表现出更高的准确率和更低的延迟,为部署中的自主系统提供了新的可靠性评估手段。

Comments 12 pages, 2 figures

详情
英文摘要

Deployed reinforcement learning systems lack a principled runtime reliability theory. We close this gap by introducing Bipredictability, P, a closed form information theoretic metric that quantifies how efficiently a closed loop interaction between agent and environment converts uncertainty into shared predictability. P admits a provable classical bound P equal, smaller than 0.5, derived from Shannon entropy subadditivity, and responsive agency necessarily suppresses P below this ceiling, a structural prediction we term the informational cost of agency. Across 21 trained continuous control agents, we confirm this prediction empirically at P = 0.33 plus minus 0.02. The same suppression signature reproduces in language model dialogue, convolutional vision systems, and classical mechanical baselines, indicating that P captures a substrate independent property of agentic interaction rather than an algorithm specific artifact. The Information Digital Twin, IDT, a model agnostic architecture that computes P from the external interaction stream, detects 89.3% of coupling degradations against 44.0% for reward based monitoring, with 4.4 times lower latency. P provides the missing measurement layer for runtime reliability and closed loop self regulation in deployed autonomous systems.

2602.23409 2026-05-18 cs.LG cs.AI cs.ET quant-ph

Long Range Frequency Tuning for QML

Michael Poppel, Markus Baumann, Sebastian Wölckert, Claudia Linnhoff-Popien, Jonas Stein

AI总结 该研究针对变分量子电路中的频率编码问题,提出了一种新的初始化方法以提升其对高频函数的拟合能力。传统方法在固定编码下需要大量门操作,而可训练频率电路虽有潜力,但因频谱间隙导致梯度下降效果受限。本文提出的三进制网格初始化方法通过合理设置频率前缀,消除了频谱间隙的影响,显著提升了模型性能。实验表明,该方法在合成和真实数据集上均优于现有方法。

详情
英文摘要

Angle-encoded variational quantum circuits admit a truncated Fourier series representation of their output, but approximating functions with maximum frequency $ω_{\max}$ using fixed unary encoding requires $\mathcal{O}(ω_{\max})$ encoding gates. Trainable-frequency (TF) circuits promise a reduction by learning the data-encoding prefactors alongside the ansatz parameters, adapting the accessible frequency spectrum to the target during training. We identify a practical barrier that prevents this promise from being realized: the prefactor gradient is suppressed by the spectral gap between the circuit's accessible frequencies and the target spectrum, independently of the ansatz parameters, confining gradient-driven prefactor movement to a narrow neighborhood of initialization. We propose \emph{ternary grid initialization} -- setting prefactors to $\{1, 3, 9, \ldots, 3^{k-1}\}$ -- which resolves this limitation by ensuring every target frequency within $[-ω_{\max}, ω_{\max}]$ lies within $\tfrac{1}{2}$ unit of a grid point at initialization, removing the spectral gap suppression by construction. On a synthetic benchmark with target frequencies shifted well beyond the standard initialization range, ternary initialization achieves median $R^2 = 0.997$ versus $0.18$ for unary initialization, with $100\%$ of runs achieving $R^2 > 0.95$ against $0\%$. CMA-ES with $20\times$ the evaluation budget reaches only $25\%$ success, confirming the limitation is a property of the optimization landscape rather than of gradient-based optimization specifically. Real-world validation on two benchmark datasets demonstrates consistent advantages over both fixed and trainable unary baselines.