arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2510.20042 2026-06-04 cs.CV

Exposing Blindspots: Cultural Bias Evaluation in Generative Image Models

揭示盲点：生成图像模型中的文化偏见评估

Huichan Seo, Sieun Choi, Minki Hong, Yi Zhou, Junseo Kim, Lukman Ismaila, Naome Etori, Mehul Agarwal, Zhixuan Liu, Jihie Kim, Jean Oh

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Dongguk University（东国大学）； Delft University of Technology（代尔夫特理工大学）； Johns Hopkins University, School of Medicine（约翰霍普金斯大学医学院）； University of Minnesota–Twin Cities（明尼苏达大学双城分校）； Lavoro AI

AI总结提出一个统一的评估框架，结合自动指标、文化感知VQA和专家人工判断，对文本到图像和图像到图像生成模型进行跨国家、跨时代和跨类别的文化偏见评估，发现模型存在默认全球北方现代描绘、迭代编辑侵蚀文化保真度以及仅应用表面线索等问题。

Comments 28 pages, 8 figures. Accepted at IASEAI 2026. Huichan Seo, Sieun Choi, and Minki Hong contributed equally

详情

AI中文摘要

生成图像模型产生引人注目的视觉内容，但常常歪曲文化。先前的工作主要研究了文本到图像（T2I）系统中的文化偏见，而图像到图像（I2I）编辑器尚未得到充分探索。我们通过一个统一的评估来弥补这一差距，该评估涵盖六个国家、一个8类别/36子类别的模式以及时代感知提示，在标准化协议下审计T2I生成和I2I编辑，产生可比较的诊断结果。使用固定设置的开放模型，我们进行了跨国家、跨时代和跨类别的评估。我们的框架结合了标准自动指标、文化感知的检索增强VQA以及来自母语审阅者的专家人工判断。为了实现可重复性，我们发布了完整的图像语料库、提示和配置。我们的研究揭示了三个发现：（1）在无视国家的提示下，模型默认采用全球北方、现代倾向的描绘，抹平了国家间的差异；（2）迭代的I2I编辑侵蚀了文化保真度，即使传统指标保持平稳或改善；（3）I2I模型应用表面线索（色调变化、通用道具）而非时代一致、上下文感知的变化，通常对全球南方目标保留源身份。这些结果突显了当前系统中文化敏感的编辑仍然不可靠。通过发布标准化数据、提示和人工评估协议，我们提供了一个可重复的、以文化为中心的基准，用于诊断和跟踪生成图像模型中的文化偏见。项目页面：https://seochan99.github.io/ECB

英文摘要

Generative image models produce striking visuals yet often misrepresent culture. Prior work has examined cultural bias mainly in text-to-image (T2I) systems, leaving image-to-image (I2I) editors underexplored. We bridge this gap with a unified evaluation across six countries, an 8-category/36-subcategory schema, and era-aware prompts, auditing both T2I generation and I2I editing under a standardized protocol that yields comparable diagnostics. Using open models with fixed settings, we derive cross-country, cross-era, and cross-category evaluations. Our framework combines standard automatic metrics, a culture-aware retrieval-augmented VQA, and expert human judgments collected from native reviewers. To enable reproducibility, we release the complete image corpus, prompts, and configurations. Our study reveals three findings: (1) under country-agnostic prompts, models default to Global-North, modern-leaning depictions that flatten cross-country distinctions; (2) iterative I2I editing erodes cultural fidelity even when conventional metrics remain flat or improve; and (3) I2I models apply superficial cues (palette shifts, generic props) rather than era-consistent, context-aware changes, often retaining source identity for Global-South targets. These results highlight that culture-sensitive edits remain unreliable in current systems. By releasing standardized data, prompts, and human evaluation protocols, we provide a reproducible, culture-centered benchmark for diagnosing and tracking cultural bias in generative image models. Project page: https://seochan99.github.io/ECB

URL PDF HTML ☆

赞 0 踩 0

2603.12433 2026-06-04 cs.CV cs.AI cs.LG

Revisiting Model Stitching In the Foundation Model Era

重新审视基础模型时代的模型拼接

Zheda Mai, Ke Zhang, Fu-En Wang, Zixiao Ken Wang, Albert Y. C. Chen, Lu Xia, Min Sun, Wei-Lun Chao, Cheng-Hao Kuo

发表机构 * The Ohio State University（俄亥俄州立大学）； Boston University（波士顿大学）； Amazon（亚马逊）

AI总结本文通过系统协议研究视觉基础模型（如CLIP、DINOv2、SigLIP 2）的可拼接性，提出基于目标模型倒数第二层特征匹配损失的拼接方法，并构建VFM拼接树（VST）实现多模态大模型中多个VFM的准确率-延迟权衡。

Comments Accepted by CVPR 2026

详情

AI中文摘要

模型拼接通过一个轻量拼接层将一个模型（源）的早期层连接到另一个模型（目标）的后期层，作为表征兼容性的探针。先前工作发现，尽管初始化或目标不同，但基于同一数据集训练的模型仍然是可拼接的（准确率下降可忽略）。我们重新审视在目标、数据和模态组合（例如CLIP、DINOv2、SigLIP 2）上各异的视觉基础模型（VFM）的拼接，并提出问题：异构VFM是否可拼接？我们引入了一个系统协议，涵盖拼接点、拼接层家族、训练损失和下游任务。三个发现浮现：（1）拼接层训练至关重要：传统方法在拼接点匹配中间特征或端到端优化任务损失时难以保持准确率，尤其是在浅层拼接点。（2）通过在目标模型的倒数第二层使用简单的特征匹配损失，异构VFM在视觉任务上变得可靠可拼接。（3）对于深层拼接点，拼接模型可以超越任一组成模型，仅增加少量推理开销（用于拼接层）。基于这些发现，我们进一步提出VFM拼接树（VST），它在多个VFM之间共享早期层同时保留其后期层，为通常利用多个VFM的多模态大语言模型提供了可控的准确率-延迟权衡。综合来看，我们的研究将拼接从诊断探针提升为整合互补VFM优势并定位其表征对齐或分歧点的实用方法。

英文摘要

Model stitching, connecting early layers of one model (source) to later layers of another (target) via a light stitch layer, has served as a probe of representational compatibility. Prior work finds that models trained on the same dataset remain stitchable (negligible accuracy drop) despite different initializations or objectives. We revisit stitching for Vision Foundation Models (VFMs) that vary in objectives, data, and modality mix (e.g., CLIP, DINOv2, SigLIP 2) and ask: Are heterogeneous VFMs stitchable? We introduce a systematic protocol spanning the stitch points, stitch layer families, training losses, and downstream tasks. Three findings emerge. (1) Stitch layer training matters: conventional approaches that match the intermediate features at the stitch point or optimize the task loss end-to-end struggle to retain accuracy, especially at shallow stitch points. (2) With a simple feature-matching loss at the target model's penultimate layer, heterogeneous VFMs become reliably stitchable across vision tasks. (3) For deep stitch points, the stitched model can surpass either constituent model at only a small inference overhead (for the stitch layer). Building on these findings, we further propose the VFM Stitch Tree (VST), which shares early layers across VFMs while retaining their later layers, yielding a controllable accuracy-latency trade-off for multimodal LLMs that often leverage multiple VFMs. Taken together, our study elevates stitching from a diagnostic probe to a practical recipe for integrating complementary VFM strengths and pinpointing where their representations align or diverge.

URL PDF HTML ☆

赞 0 踩 0

2512.14099 2026-06-04 cs.CV

ViewMask-1-to-3: Multi-View Consistent Image Generation via Multimodal Discrete Diffusion Models

ViewMask-1-to-3: 通过多模态离散扩散模型实现多视图一致图像生成

Ruishu Zhu, Zhihao Huang, Jiacheng Sun, Ping Luo, Hongyuan Zhang, Xuelong Li

发表机构 * Nanyang Technological University（南洋理工大学）

AI总结提出ViewMask-1-to-3，将多视图生成建模为离散序列预测问题，利用MAGVIT-v2视觉令牌和掩码令牌预测的离散扩散，通过迭代去掩码实现渐进式多视图生成，无需专门架构或3D几何先验，在GSO和3D-FUTURE基准上优于基线方法。

Comments Accepted by ICML 2026

详情

AI中文摘要

受离散扩散在语言-视觉建模中成功的启发，我们探索了其在多视图生成中的潜力，而这一任务目前主要由连续方法主导。我们引入了ViewMask-1-to-3，将多视图生成建模为一个离散序列建模问题，其中每个视点被表示为来自MAGVIT-v2的视觉令牌。通过掩码令牌预测的离散扩散，我们的方法能够通过迭代令牌去掩码实现渐进式多视图生成，将语言和视觉统一在共享的令牌空间中。重要的是，简单的随机掩码结合自注意力自然地促进了跨视图一致性，无需专门的架构或3D几何先验。我们的方法在GSO和3D-FUTURE基准上优于基线，在标准图像指标上平均排名第一，并且在3D-FUTURE上相比连续扩散模型实现了10.6%更高的IoU。此外，所提出的框架可以自然地扩展到支持文本到图像生成和多模态理解，突显了其向更统一的多模态理解和生成范式发展的潜力。

英文摘要

Motivated by discrete diffusion's success in language-vision modeling, we explore its potential for multi-view generation, a task dominated by continuous approaches. We introduce ViewMask-1-to-3, formulating multi-view generation as a discrete sequence modeling problem where each viewpoint is represented as visual tokens from MAGVIT-v2. Through discrete diffusion via masked token prediction, our approach enables progressive multi-view generation via iterative token unmasking, unifying language and vision in a shared token space. Importantly, simple random masking combined with self-attention naturally encourages cross-view consistency without specialized architectures or 3D geometric priors. Our method outperforms the baseline on the GSO and 3D-FUTURE benchmarks, ranking first on average across standard image metrics, and achieving a 10.6% higher IoU than continuous diffusion models on 3D-FUTURE. Furthermore, the proposed framework can be naturally extended to support text-to-image generation and multimodal understanding, highlighting its potential toward a more unified paradigm for multimodal understanding and generation.

URL PDF HTML ☆

赞 0 踩 0

2405.15454 2026-06-04 cs.CL cs.SY eess.SY

LiSeCo: Linear Semantic Control for Language Generation

LiSeCo：语言生成的线性语义控制

Emily Cheng, Carmen Amo Alonso

发表机构 * Universitat Pompeu Fabra（庞培法布拉大学）； Stanford University（斯坦福大学）

AI总结提出一种轻量级、无梯度的线性语义控制方法LiSeCo，通过控制理论在线干预嵌入空间中的激活值，将生成轨迹引导至预定义的安全语义区域，实现高效且保证性能的文本生成控制。

Comments TMLR 2026 camera ready; earlier version in NeurIPS MINT Workshop 2024

详情

AI中文摘要

大型语言模型（LLMs）在关键应用中的普及凸显了对受控语言生成方法的需求，这些方法既需计算高效又需具备性能保证。为满足这一需求，我们采用概念语义的常见模型，即概念在线性表示的LLM潜在空间中。具体而言，我们认为自然语言生成在此连续语义空间中沿轨迹进行，由语言模型的隐藏激活实现。这一观点允许在潜在空间中对文本生成进行控制理论处理，我们提出线性语义控制（LiSeCo），一种轻量级、无梯度的干预方法，动态地将轨迹从对应于不良含义的区域中引导开。特别地，我们提出以在线方式直接干预正在生成的令牌在嵌入空间中的激活。关键的是，LiSeCo并非简单地将激活引导至理想区域，而是依赖控制理论中的经典技术，以上下文相关的方式精确控制激活，并保证它们被带入嵌入空间中预定义的、对应于允许语义的特定区域。该干预根据最优控制器公式以闭式形式计算，对生成时间影响极小。这种对嵌入空间中激活的控制允许对生成序列的属性进行细粒度引导。我们证明了该方法在不同任务（毒性、情感和语言（英语/西班牙语）引导）上的有效性，同时保持文本质量。

英文摘要

The prevalence of Large Language Models (LLMs) in critical applications highlights the need for controlled language generation methods that are both computationally efficient and enjoy performance guarantees. To address this need, we use a common model of concept semantics as linearly represented in an LLM's latent space. In particular, we take the view that natural language generation traces a trajectory in this continuous semantic space, realized by the language model's hidden activations. This view permits a control-theoretic treatment of text generation in latent space, in which we propose Linear Semantic Control (LiSeCo), a lightweight, gradient-free intervention that dynamically steers trajectories away from regions corresponding to undesired meanings. In particular, we propose to directly intervene, in an online fashion, the activations of the token that is being generated in embedding space. Crucially, LiSeCo does not simply steer activations towards a desirable region. Instead, it relies on classical techniques from control theory to precisely control activations in a context-dependent way, and guarantees that they are brought into a specific pre-defined region of embedding space that corresponds to allowed semantics. The intervention is computed in closed form according to an optimal controller formulation, minimally impacting generation time. This control of the activations in embedding space allows for fine-grained steering of attributes of the generated sequence. We demonstrate that our approach is effective on different tasks -- toxicity, sentiment, and language (English/Spanish) steering -- while maintaining text quality.

URL PDF HTML ☆

赞 0 踩 0

2603.09803 2026-06-04 cs.LG

Good Reasoning Makes Good Demonstrations: Implicit Reasoning Quality Supervision via In-Context Reinforcement Learning

好的推理产生好的示范：通过上下文强化学习进行隐式推理质量监督

Tiehua Mei, Minxuan Lv, Leiyu Pan, Zhenpeng Su, Hongru Hou, Hengrui Chen, Ao Xu, Deqing Yang

发表机构 * School of Data Science, Fudan University（复旦大学数据科学学院）； University of Chinese Academy of Sciences（中国科学院大学）； College of Intelligence and Computing, Tianjin University（天津大学智能与计算学院）

AI总结提出In-Context RLVR方法，利用策略模型自身的上下文学习能力衡量示范效用（Demonstration Utility），通过隐式奖励加权提升推理质量和准确性。

Comments Accepted at ACL 2026

详情

AI中文摘要

可验证奖励强化学习（RLVR）改进了大型语言模型的推理能力，但将所有正确解决方案同等对待，可能强化偶然得到正确答案的有缺陷的轨迹。我们观察到，\emph{更好的推理产生更好的示范}：高质量的解决方案比低质量的解决方案作为上下文示例更有效。我们将这种教学能力称为 extbf{示范效用}，并表明策略模型自身的上下文学习能力提供了一种有效衡量它的方法，产生一个称为 extbf{证据增益}的质量信号。为了在训练中利用这一信号，我们引入了 extbf{上下文RLVR}，在每次轨迹采样前添加示范。理论上，我们证明这种简单的输入修改隐式地以近似与证据增益成比例的因子重新加权奖励，为高质量轨迹分配更高的权重，而无需昂贵的计算。在数学推理基准上的实验表明，与标准RLVR基线相比，在准确性和推理质量上均有一致的改进。我们的代码和数据集可在 https://github.com/Mithas-114/IC-DAPO 获取。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) improves reasoning in large language models but treats all correct solutions equally, potentially reinforcing flawed traces that arrive at correct answers by chance. We observe that \emph{better reasoning makes better demonstrations}: high-quality solutions serve as more effective in-context examples than low-quality ones. We term this teaching ability \textbf{Demonstration Utility}, and show that the policy model's own in-context learning ability provides an efficient way to measure it, yielding a quality signal termed \textbf{Evidence Gain}. To leverage this signal during training, we introduce \textbf{In-Context RLVR}, which prepends demonstrations before each rollout. Theoretically, we prove that this simple input modification implicitly reweights rewards by a factor approximately proportional to Evidence Gain, assigning higher weights to high-quality traces without requiring costly computation. Experiments on mathematical reasoning benchmarks demonstrate consistent improvements in both accuracy and reasoning quality over standard RLVR baselines. Our codes and datasets are available at https://github.com/Mithas-114/IC-DAPO.

URL PDF HTML ☆

赞 0 踩 0

2603.09493 2026-06-04 cs.CV cs.AI

EvoPrompt: Guided Prompt Evolution for Vision-Language Models Adaptation

EvoPrompt: 引导提示演化以适应视觉-语言模型

Enming Zhang, Jiayang Li, Yanlong Wang, Yanru Wu, Zhenyu Liu, Yang Li

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生院，清华大学）； Sun Yat-sen University（中山大学）； Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））

AI总结提出EvoPrompt框架，通过引导提示演化路径并解耦低秩更新为方向和幅度分量，实现视觉-语言模型在少样本学习中的遗忘-free微调，同时保持零样本能力。

详情

AI中文摘要

大规模视觉-语言模型（VLM）在有限标注数据下适应下游任务仍然是一个重大挑战。虽然参数高效的提示学习方法提供了一条有希望的路径，但它们常常遭受预训练知识的灾难性遗忘。为了解决这一限制，我们的工作基于一个洞察：控制提示的演化路径对于遗忘-free适应至关重要。为此，我们提出了EvoPrompt，一个旨在明确引导提示轨迹以进行知识保留微调的新型框架。具体来说，我们的方法采用模态共享提示投影器（MPP）从统一嵌入空间生成层次化提示。关键的是，一种演化训练策略将低秩更新解耦为方向和幅度分量，保留早期学习的语义方向而仅调整其幅度，从而使提示能够在不丢弃基础知识的情况下演化。这一过程通过特征几何正则化（FGR）进一步稳定，该正则化强制特征去相关以防止表示崩溃。大量实验表明，EvoPrompt在少样本学习中实现了最先进的性能，同时稳健地保留了预训练VLM的原始零样本能力。

英文摘要

The adaptation of large-scale vision-language models (VLMs) to downstream tasks with limited labeled data remains a significant challenge. While parameter-efficient prompt learning methods offer a promising path, they often suffer from catastrophic forgetting of pre-trained knowledge. Toward addressing this limitation, our work is grounded in the insight that governing the evolutionary path of prompts is essential for forgetting-free adaptation. To this end, we propose EvoPrompt, a novel framework designed to explicitly steer the prompt trajectory for knowledge-preserving fine-tuning. Specifically, our approach employs a Modality-Shared Prompt Projector (MPP) to generate hierarchical prompts from a unified embedding space. Critically, an evolutionary training strategy decouples low-rank updates into directional and magnitude components, preserving early-learned semantic directions while only adapting their magnitude, thus enabling prompts to evolve without discarding foundational knowledge. This process is further stabilized by Feature Geometric Regularization (FGR), which enforces feature decorrelation to prevent representation collapse. Extensive experiments demonstrate that EvoPrompt achieves state-of-the-art performance in few-shot learning while robustly preserving the original zero-shot capabilities of pre-trained VLMs.

URL PDF HTML ☆

赞 0 踩 0

2603.09391 2026-06-04 cs.SD cs.AI eess.AS

Physics-Informed Neural Engine Sound Modeling with Differentiable Pulse-Train Synthesis

基于物理信息的神经引擎声音建模与可微分脉冲串合成

Robin Doerfler, Lonce Wyse

发表机构 * GitHub

AI总结提出脉冲串-谐振器（PTR）模型，通过可微分合成架构直接建模发动机脉冲形状和时间结构，利用物理信息归纳偏置提升谐波重建质量并降低总损失。

Comments Revised version; to appear in the Proceedings of the 34th European Signal Processing Conference (EUSIPCO 2026)

详情

AI中文摘要

发动机声音源自连续的排气压力脉冲，而非持续的谐波振荡。虽然神经合成方法通常旨在近似最终的频谱特性，但我们提出直接建模底层脉冲形状和时间结构。我们提出了脉冲串-谐振器（PTR）模型，这是一种可微分合成架构，通过将发动机音频生成为与发动机点火模式对齐的参数化脉冲串，并通过模拟排气声学的递归Karplus-Strong谐振器传播它们。该架构集成了物理信息归纳偏置，包括谐波衰减、热力学音高调制、气门动力学包络、排气系统共振以及导出的发动机运行模式，如节气门操作和减速断油（DFCO）。在三种不同发动机类型（总计7.5小时音频）上验证，PTR在谐波重建上比谐波加噪声基线模型提高了21%，总损失降低了5.7%，同时提供了对应于物理现象的可解释参数。完整的代码、模型权重和音频示例已公开提供。

英文摘要

Engine sounds originate from sequential exhaust pressure pulses rather than sustained harmonic oscillations. While neural synthesis methods typically aim to approximate the resulting spectral characteristics, we propose directly modeling the underlying pulse shapes and temporal structure. We present the Pulse-Train-Resonator (PTR) model, a differentiable synthesis architecture that generates engine audio as parameterized pulse trains aligned to engine firing patterns and propagates them through recursive Karplus-Strong resonators simulating exhaust acoustics. The architecture integrates physics-informed inductive biases including harmonic decay, thermodynamic pitch modulation, valve-dynamics envelopes, exhaust system resonances and derived engine operating modes such as throttle operation and Deceleration Fuel Cutoff (DFCO). Validated on three diverse engine types totaling 7.5 hours of audio, PTR achieves a 21% improvement in harmonic reconstruction and a 5.7% reduction in total loss over a harmonic-plus-noise baseline model, while providing interpretable parameters corresponding to physical phenomena. Complete code, model weights, and audio examples are openly available.

URL PDF HTML ☆

赞 0 踩 0

2603.09242 2026-06-04 cs.CV

When Detectors Forget Forensics: Blocking Semantic Shortcuts for Generalizable AI-Generated Image Detection

当检测器遗忘取证：阻断语义捷径以实现可泛化的AI生成图像检测

Chao Shuai, Shaojing Fan, Chenlin Zou, Bin Gong, Weichen Lian, Xiuli Bi, Zhenguang Liu, Zhongjie Ba, Kui Ren

发表机构 * State Key Laboratory of Blockchain and Data Security, Zhejiang University（区块链与数据安全国家重点实验室，浙江大学）； National University of Singapore（新加坡国立大学）； Chongqing University of Posts and Telecommunications（重庆邮电大学）

AI总结本文提出几何语义解耦（GSD）框架，通过抑制语义主导方向来促进不变取证表征，从而解决预训练视觉基础模型在AI生成图像检测中因语义回退导致的泛化不足问题。

详情

AI中文摘要

生成模型日益逼真，模糊了真实与合成内容之间的界限，给可靠的AI生成图像检测带来了重大挑战。尽管大规模预训练视觉基础模型提升了检测能力，但它们对来自未见生成管道的图像的泛化仍然不足。在本文中，我们首次识别出一个关键失败机制，称为语义回退，即取证微调未能完全重塑表征空间。因此，所得表征仍沿高层语义结构而非操作特定的取证线索组织。基于这一见解，我们提出了一个几何语义解耦（GSD）框架，该框架显式抑制语义主导方向，从而促进不变的取证表征。具体而言，GSD利用冻结的CLIP编码器通过奇异值分解（SVD）估计主导语义子空间。然后，通过几何约束公式抑制语义成分，并在样本和层间自适应调节抑制强度。我们进一步引入了一种小批量SVD近似策略，该策略分摊子空间估计，在保持有效性的同时实现了超过15倍的计算开销减少。最后，考虑到涵盖大规模和在线评估的实际场景，我们开发了三种推理协议：批量推理、逐样本推理和基于参考的推理，并证明它们能产生一致的语义解耦，从而形成稳定的面向伪造的特征流形。

英文摘要

The growing realism of generative models has blurred the boundary between real and synthetic content, posing significant challenges to reliable AI-generated image detection. Although large-scale pre-trained Vision Foundation Models have advanced detection capability, their generalization to images from unseen generation pipelines remains inadequate. In this paper, we identify, for the first time, a key failure mechanism, termed \emph{semantic fallback}, wherein forensic fine-tuning fails to fully reshape the representation space. Consequently, the resulting representations remain organized along high-level semantic structures rather than manipulation-specific forensic cues. Building on this insight, we propose a \textbf{Geometric Semantic Decoupling (GSD)} framework, which explicitly suppresses semantically dominant directions, thereby promoting invariant forensic representations. Specifically, GSD leverages a frozen CLIP encoder to estimate the dominant semantic subspace via Singular Value Decomposition (SVD). It then suppresses the semantic components through a geometry-constrained formulation with the suppression strength adaptively modulated across samples and layers. We further introduce a mini-batch SVD approximation strategy that amortizes subspace estimation, achieving over a $15 \times$ reduction in computational overhead while preserving effectiveness. Finally, considering practical scenarios spanning both large-scale and online evaluation, we develop three inference protocols, batch, per-sample, and reference-based inference, and demonstrate that they induce consistent semantic decoupling, yielding a stable forgery-oriented feature manifold.

URL PDF HTML ☆

赞 0 踩 0

2603.09170 2026-06-04 cs.RO cs.AI

ZeroWBC: Learning Natural Whole-Body Humanoid Interaction from Human Egocentric Data

ZeroWBC: 从人类自我中心数据学习自然全身人形交互

Haoran Yang, Jiacheng Bao, Yucheng Xin, Haoming Song, Yuyang Tian, Bin Zhao, Dong Wang, Xuelong Li

发表机构 * University of Science and Technology of China（中国科学技术大学）； Shanghai AI Laboratory（上海人工智能实验室）； Northwestern Polytechnical University（西北工业大学）； Tsinghua University（清华大学）； Shanghai Jiao Tong University（上海交通大学）； TeleAI, China Telecom（TeleAI，中国电信）

AI总结提出ZeroWBC框架，利用人类自我中心视频和同步全身运动数据，通过生成-跟踪方法实现无遥操作的人形机器人全身交互控制。

详情

AI中文摘要

由于全身遥操作数据成本高昂，实现多功能且自然的全身人形交互控制仍然具有挑战性。我们提出ZeroWBC，一种无需遥操作的框架，从人类自我中心视频以及同步的全身运动和文本注释中学习人形全身交互。ZeroWBC采用生成-跟踪公式来解决静态场景全身交互控制问题。给定初始自我中心图像和语言指令，微调的视觉-语言模型生成未来人类全身运动标记，这些标记被解码为连续运动并重定向到人形机器人。得到的参考运动，连同根部和关键身体部位轨迹，然后由通用交互运动跟踪策略执行。为了提高交互性能，我们引入了一种面向交互的跟踪奖励，该奖励优先考虑全局根部和关键身体部位轨迹对齐，同时保持自然的全身运动。在Unitree G1人形机器人上的实验表明，ZeroWBC无需机器人遥操作演示即可实现多样化的场景感知行为。这些结果表明了一种从人类自我中心数据学习自然人形全身交互的可扩展范式。

英文摘要

Achieving versatile and natural whole-body humanoid interaction control remains challenging due to the high cost of whole-body teleoperation data. We present ZeroWBC, a teleoperation-free framework that learns humanoid whole-body interaction from human egocentric videos paired with synchronized whole-body motion and text annotations. ZeroWBC adopts a generation-then-tracking formulation to tackle the static scene whole-body interaction control problem. Given an initial egocentric image and a language instruction, a fine-tuned Vision-Language Model generates future human whole-body motion tokens, which are decoded into continuous motions and retargeted to the humanoid. The resulting reference motions, together with root and key body-part trajectories, are then executed by a general interactive motion tracking policy. To improve interaction performance, we introduce an interaction-oriented tracking reward that prioritizes global root and key body-part trajectory alignment while preserving natural whole-body motion. Experiments on the Unitree G1 humanoid robot show that ZeroWBC enables diverse scene-aware behaviors without robot teleoperation demonstrations. These results suggest a scalable paradigm for learning natural humanoid whole-body interaction from human egocentric data.

URL PDF HTML ☆

赞 0 踩 0

2510.27191 2026-06-04 cs.RO cs.AI

Vectorized Online POMDP Planning

向量化在线POMDP规划

Marcus Hoerger, Muhammad Sudrajat, Hanna Kurniawati

发表机构 * School of Computing, Australian National University（澳大利亚国立大学计算机学院）

AI总结提出向量化在线POMDP规划器（VOPP），通过全向量化计算消除并行瓶颈，实现大规模并行在线规划，计算效率比现有最先进并行求解器高至少20倍。

Comments 8 pages, 3 figures. Accepted at ICRA 2026

详情

AI中文摘要

部分可观测下的规划是自主机器人的一项基本能力。部分可观测马尔可夫决策过程（POMDP）为部分可观测问题下的规划提供了强大的框架，捕捉了动作的随机效应以及通过噪声观测获得的有限信息。POMDP求解可以极大受益于当今硬件上的大规模并行化，但并行化POMDP求解器一直具有挑战性。大多数求解器依赖于将动作上的数值优化与其价值估计交错进行，这会在并行进程之间产生依赖关系和同步瓶颈，从而抵消并行化的好处。在本文中，我们提出了向量化在线POMDP规划器（VOPP），一种新颖的并行在线求解器，它利用了最近的POMDP公式，该公式解析地解决了优化组件的一部分，将数值计算仅保留为期望的估计。VOPP将所有与规划相关的数据结构表示为张量集合，并将所有规划步骤实现为该表示上的全向量化计算。结果是一个大规模并行的在线求解器，并发进程之间没有依赖关系或同步瓶颈。实验结果表明，与现有的最先进并行在线求解器相比，VOPP在计算近最优解方面的效率至少高出20倍。此外，VOPP优于最先进的顺序在线求解器，同时使用的规划预算小1000倍。

英文摘要

Planning under partial observability is an essential capability of autonomous robots. The Partially Observable Markov Decision Process (POMDP) provides a powerful framework for planning under partial observability problems, capturing the stochastic effects of actions and the limited information available through noisy observations. POMDP solving could benefit tremendously from massive parallelization on today's hardware, but parallelizing POMDP solvers has been challenging. Most solvers rely on interleaving numerical optimization over actions with the estimation of their values, which creates dependencies and synchronization bottlenecks between parallel processes that can offset the benefits of parallelization. In this paper, we propose Vectorized Online POMDP Planner (VOPP), a novel parallel online solver that leverages a recent POMDP formulation which analytically solves part of the optimization component, leaving numerical computations to consist of only estimation of expectations. VOPP represents all data structures related to planning as a collection of tensors, and implements all planning steps as fully vectorized computations over this representation. The result is a massively parallel online solver with no dependencies or synchronization bottlenecks between concurrent processes. Experimental results indicate that VOPP is at least $20\times$ more efficient in computing near-optimal solutions compared to an existing state-of-the-art parallel online solver. Moreover, VOPP outperforms state-of-the-art sequential online solvers, while using a planning budget that is $1000\times$ smaller.

URL PDF HTML ☆

赞 0 踩 0

2603.08485 2026-06-04 cs.RO

3PoinTr: 3D Point Tracks for Learning Manipulation from Unconstrained Human Videos

3PoinTr: 从无约束人类视频中通过3D点轨迹学习操作

Adam Hung, Bardienus Pieter Duisterhof, Jeffrey Ichnowski

发表机构 * Carnegie Mellon University（卡内基梅隆大学）

AI总结提出3PoinTr方法，通过预测密集3D点轨迹从无约束人类视频中预训练样本高效的机器人操作策略，在真实和模拟任务中显著优于基线方法。

详情

AI中文摘要

从人类视频中学习操作策略可以大大减少对昂贵的机器人演示的需求，但现有方法通常需要限制性假设，如编排的人类动作、预定义关键点、手动注释或已知的抓取位置。我们提出3PoinTr，一种通过预测密集3D点轨迹从无约束人类视频中预训练样本高效机器人策略的方法。在无约束的人类演示视频中，人类可以自由地遵循他们认为合适的任何轨迹和操作策略，而不是编排他们的动作来模仿机器人。3PoinTr使用轻量级的可见性感知Transformer从人类视频中学习场景点应如何移动，然后训练一个闭环多任务机器人策略，以从这些预测的点轨迹中灵活提取与动作相关的先验知识。仅使用20个带动作标签的机器人演示，3PoinTr在真实世界任务上的平均成功率比最强的行为克隆和视频预训练基线高出25.0个百分点，在模拟中平均成功率高出29.6个百分点。针对性的消融实验支持关键设计选择，并证实了从无动作视频中学习的好处。我们进一步表明，3PoinTr的点轨迹预测Transformer通过保留对部分遮挡点的监督，优于强基线。项目页面：https://adamhung60.github.io/3PoinTr/。

英文摘要

Learning manipulation policies from human videos could greatly reduce the need for expensive robot demonstrations, but existing approaches typically require restrictive assumptions such as choreographed human motions, predefined keypoints, manual annotations, or known grasp locations. We propose 3PoinTr, a method for pretraining sample-efficient robot policies from unconstrained human videos by predicting dense 3D point tracks. In the unconstrained human demonstration videos, humans are free to follow whatever trajectories and manipulation strategies they see fit, rather than choreographing their motions to mimic a robot. 3PoinTr uses a lightweight visibility-aware transformer to learn how scene points should move from human videos, and then trains a closed-loop multitask robot policy to flexibly extract action-relevant priors from those predicted point tracks. With only 20 action-labeled robot demonstrations, 3PoinTr achieves a 25.0 percentage point higher average success rate than the strongest behavior cloning and video-pretraining baselines on real-world tasks, and a 29.6 percentage point higher average success rate in simulation. Targeted ablations support the key design choices and confirm the benefit of learning from actionless videos. We further show that 3PoinTr's point track prediction transformer outperforms a strong baseline by preserving supervision over partially occluded points. Project page: https://adamhung60.github.io/3PoinTr/.

URL PDF HTML ☆

赞 0 踩 0

2603.07584 2026-06-04 cs.SD cs.LG eess.AS

Analysis-Driven Procedural Generation of an Engine Sound Dataset with Embedded Control Annotations

分析驱动的发动机声音数据集程序化生成与嵌入式控制注释

Robin Doerfler, Lonce Wyse

发表机构 * rdoerfler

AI总结提出一种分析驱动的框架，通过自适应音高谱分析提取真实录音中的谐波结构，驱动扩展参数化谐波加噪声合成器，生成带有精确时间对齐控制注释的发动机音频数据集，用于数据驱动的发动机声音建模。

Comments To appear in the Proceedings of the 34th European Signal Processing Conference (EUSIPCO 2026)

详情

AI中文摘要

计算发动机声音建模是汽车音频行业的核心，尤其适用于主动声音设计应用和虚拟原型设计。新兴的数据驱动发动机声音合成方法需要大量标准化、干净的音频记录，并带有精确时间对齐的运行状态注释：由于高成本、专用测量设备要求和不可避免的噪声污染，这些数据难以获取。我们提出了一种分析驱动的框架，用于生成带有样本精确控制注释的发动机音频。该方法通过自适应音高谱分析从真实录音中提取谐波结构，进而驱动扩展参数化谐波加噪声合成器。利用该框架，我们通过多样化的控制轨迹和参数变化，将每个发动机的5-10分钟源音频扩展15-30倍，生成程序化发动机声音数据集（19.0小时，5,935个文件）：一组带有样本精确RPM和扭矩注释的发动机音频信号，覆盖广泛的工作条件、信号复杂度和谐波轮廓。与真实录音的对比验证了合成数据保留了特征谐波结构，基于该数据集训练的基线可微合成网络证实了其适用于数据驱动的发动机声音建模。该数据集已公开发布，以支持发动机音色分析、控制参数估计和神经生成合成的研究。

英文摘要

Computational engine sound modeling is central to the automotive audio industry, particularly for active sound design applications and virtual prototyping. Emerging data-driven engine sound synthesis methods require large volumes of standardized, clean audio recordings with precisely time-aligned operating-state annotations: data that is difficult to obtain due to high costs, specialized measurement equipment requirements, and inevitable noise contamination. We present an analysis-driven framework for generating engine audio with sample-accurate control annotations. The method extracts harmonic structures from real recordings through pitch-adaptive spectral analysis, which then drive an extended parametric harmonic-plus-noise synthesizer. With this framework, we augment 5-10 min of source audio per engine 15-30x via diverse control trajectories and parametric variation, producing the Procedural Engine Sounds Dataset (19.0 h, 5,935 files): a set of engine audio signals with sample-accurate RPM and torque annotations spanning a wide range of operating conditions, signal complexities, and harmonic profiles. Comparison against real recordings validates that the synthesized data preserves characteristic harmonic structures, and a baseline differentiable synthesis network trained on the dataset confirms its suitability for data-driven engine sound modeling. The dataset is released publicly to support research on engine timbre analysis, control parameter estimation, and neural generative synthesis.

URL PDF HTML ☆

赞 0 踩 0

2603.05881 2026-06-04 cs.CL

Confidence Before Answering: A Paradigm Shift for Efficient LLM Uncertainty Estimation

回答前先置信：高效LLM不确定性估计的范式转变

Changcheng Li, Jiancan Wu, Hengheng Zhang, Zhengsu Chen, Guo An, Junxiang Qiu, Xiang Wang, Qi Tian

发表机构 * University of Science and Technology of China（中国科学技术大学）； Huawei Inc.（华为公司）

AI总结提出CoCA框架，通过GRPO强化学习联合优化置信度校准与答案准确性，实现回答前输出置信度，提升不确定性估计效率。

详情

AI中文摘要

大型语言模型（LLM）的可靠部署需要准确的不确定性估计。现有方法主要是先回答后置信，即在生成答案后才产生置信度，这衡量的是特定响应的正确性，限制了实际可用性。我们研究了一种置信优先范式，其中模型在回答之前输出其置信度，将该分数解释为模型在当前策略下正确回答问题的概率。我们提出了CoCA（联合优化的置信度和答案），这是一种GRPO强化学习框架，通过分段信用分配联合优化置信度校准和答案准确性。通过为置信度和答案段分配单独的奖励和组相对优势，CoCA实现了稳定的联合优化并避免了奖励黑客攻击。在数学、代码和事实问答基准上的实验表明，在保持答案质量的同时，校准和不确定性区分能力得到改善，从而支持更广泛的下游应用。

英文摘要

Reliable deployment of large language models (LLMs) requires accurate uncertainty estimation. Existing methods are predominantly answer-first, producing confidence only after generating an answer, which measure the correctness of a specific response and limits practical usability. We study a confidence-first paradigm, where the model outputs its confidence before answering, interpreting this score as the model's probability of answering the question correctly under its current policy. We propose CoCA(Co-optimized Confidence and Answers), a GRPO reinforcement learning framework that jointly optimizes confidence calibration and answer accuracy via segmented credit assignment. By assigning separate rewards and group-relative advantages to confidence and answer segments, CoCA enables stable joint optimization and avoids reward hacking. Experiments across math, code, and factual QA benchmarks show improved calibration and uncertainty discrimination while preserving answer quality, thereby enabling a broader range of downstream applications.

URL PDF HTML ☆

赞 0 踩 0

2602.12147 2026-06-04 cs.LG

It's TIME: Towards the Next Generation of Time Series Forecasting Benchmarks

是时候了：迈向下一代时间序列预测基准

Zhongzheng Qiao, Sheng Pan, Anni Wang, Viktoriya Zhukova, Yong Liu, Xudong Jiang, Qingsong Wen, Mingsheng Long, Ming Jin, Chenghao Liu

发表机构 * University of Science and Technology of China（中国科学技术大学）； Tsinghua University（清华大学）

AI总结针对现有时间序列基准在数据组成、完整性、任务设计和分析视角上的局限，提出TIME基准，包含50个新数据集和98个预测任务，采用人机协作管道确保数据完整性，并引入模式级评估视角，对12个基础模型进行多粒度排名。

Comments Accepted to ICML 2026. Camera-ready version

详情

AI中文摘要

时间序列基础模型正在从特定数据集建模向通用任务评估转变，彻底改变预测格局。然而，我们认为现有基准在四个维度存在常见局限：受重复使用传统资源主导的受限数据组成、缺乏严格质量保证的数据完整性受损、脱离现实背景的任务公式错位，以及掩盖通用洞察的僵化分析视角。为弥补这些差距，我们引入TIME，一个下一代以任务为中心的基准，包含50个新数据集和98个预测任务，专为无数据泄露的严格零样本TSFM评估而设计。集成大语言模型和人类专业知识，我们建立了人机协同的基准构建管道以确保高数据完整性，并通过将预测配置与现实操作需求和变量可预测性对齐来重新定义任务公式。此外，我们提出一种新颖的模式级评估视角，超越了基于静态元标签的传统数据集级评估。通过利用结构时间序列特征来刻画内在时间属性，该方法提供了跨不同模式的模型能力的通用洞察。我们评估了12个TSFM，并建立了一个多粒度排行榜以促进深入分析和可视化检查。排行榜可在 https://huggingface.co/spaces/Real-TSF/TIME-leaderboard 获取。

英文摘要

Time series foundation models (TSFMs) are revolutionizing the forecasting landscape from specific dataset modeling to generalizable task evaluation. However, we contend that existing benchmarks exhibit common limitations in four dimensions: constrained data composition dominated by reused legacy sources, compromised data integrity lacking rigorous quality assurance, misaligned task formulations detached from real-world contexts, and rigid analysis perspectives that obscure generalizable insights. To bridge these gaps, we introduce TIME, a next-generation task-centric benchmark comprising 50 fresh datasets and 98 forecasting tasks, tailored for strict zero-shot TSFM evaluation free from data leakage. Integrating large language models and human expertise, we establish a human-in-the-loop benchmark construction pipeline to ensure high data integrity and redefine task formulation by aligning forecasting configurations with real-world operational requirements and variate predictability. Furthermore, we propose a novel pattern-level evaluation perspective that moves beyond traditional dataset-level evaluations based on static meta labels. By leveraging structural time series features to characterize intrinsic temporal properties, this approach offers generalizable insights into model capabilities across diverse patterns. We evaluate 12 TSFMs and establish a multi-granular leaderboard to facilitate in-depth analysis and visualized inspection. The leaderboard is available at https://huggingface.co/spaces/Real-TSF/TIME-leaderboard.

URL PDF HTML ☆

赞 0 踩 0

2603.03482 2026-06-04 cs.CV cs.AI cs.LG

Beyond Pixel Histories: World Models with Persistent 3D State

超越像素历史：具有持久3D状态的世界模型

Samuel Garcin, Thomas Walker, Steven McDonagh, Tim Pearce, Hakan Bilen, Tianyu He, Kaixin Wang, Jiang Bian

发表机构 * University of Edinburgh（爱丁堡大学）； Microsoft Research（微软研究院）

AI总结提出PERSIST范式，通过模拟潜在3D场景（环境、相机、渲染器）的演化，实现具有持久空间记忆和一致几何的世界模型，显著提升3D一致性、空间记忆和长期稳定性。

Comments Accepted to the International Conference on Machine Learning (ICML) 2026. To appear in the Proceedings of Machine Learning Research (PMLR). 9 pages

详情

AI中文摘要

交互式世界模型通过响应用户的动作持续生成视频，实现开放式的生成能力。然而，现有模型通常缺乏环境的3D表示，意味着3D一致性必须从数据中隐式学习，且空间记忆受限于有限的时域上下文窗口。这导致不真实的用户体验，并对训练智能体等下游任务构成重大障碍。为解决这一问题，我们提出PERSIST，一种新的世界模型范式，它模拟潜在3D场景（环境、相机和渲染器）的演化。这使得我们能够合成具有持久空间记忆和一致几何的新帧。定量指标和定性用户研究均表明，与现有方法相比，在空间记忆、3D一致性和长期稳定性方面有显著提升，从而实现连贯、演化的3D世界。我们进一步展示了新颖的能力，包括从单张图像合成多样化的3D环境，以及通过直接在3D空间中支持环境编辑和指定，实现对生成体验的细粒度、几何感知控制。项目页面：https://francelico.github.io/persist.github.io

英文摘要

Interactive world models continually generate video by responding to a user's actions, enabling open-ended generation capabilities. However, existing models typically lack a 3D representation of the environment, meaning 3D consistency must be implicitly learned from data, and spatial memory is restricted to limited temporal context windows. This results in an unrealistic user experience and presents significant obstacles to downstream tasks such as training agents. To address this, we present PERSIST, a new paradigm of world model which simulates the evolution of a latent 3D scene: environment, camera, and renderer. This allows us to synthesise new frames with persistent spatial memory and consistent geometry. Both quantitative metrics and a qualitative user study show substantial improvements in spatial memory, 3D consistency, and long-horizon stability over existing methods, enabling coherent, evolving 3D worlds. We further demonstrate novel capabilities, including synthesising diverse 3D environments from a single image, as well as enabling fine-grained, geometry-aware control over generated experiences by supporting environment editing and specification directly in 3D space. Project page: https://francelico.github.io/persist.github.io

URL PDF HTML ☆

赞 0 踩 0

2603.03205 2026-06-04 cs.CL

Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

学习何时行动或拒绝：为安全的多步骤工具使用守护代理推理模型

Aradhye Agarwal, Gurdit Siyan, Yash Pandya, Joykirat Singh, Akshay Nambi, Ahmed Awadallah

发表机构 * Microsoft Research（微软研究院）

AI总结提出MOSAIC框架，通过显式安全推理和基于偏好的强化学习，使代理模型在工具使用中安全决策，减少有害行为并保持良性性能。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

代理语言模型在安全机制上与聊天模型根本不同：它们必须规划、调用工具并执行长期行动，其中单个失误（如访问文件或输入凭据）可能导致不可逆的伤害。现有的对齐方法主要针对静态生成和任务完成进行优化，由于顺序决策、对抗性工具反馈和过度自信的中间推理，在这些设置中失效。我们引入了MOSAIC，一个后训练框架，通过使安全决策显式且可学习，对齐代理以实现安全的多步骤工具使用。MOSAIC将推理构建为规划、检查、然后行动或拒绝的循环，将显式安全推理和拒绝作为第一类行动。为了在没有轨迹级标签的情况下进行训练，我们使用基于偏好的强化学习与成对轨迹比较，这捕获了标量奖励常常忽略的安全区别。我们在三个模型家族（Qwen2.5-7B、Qwen3-4B-Thinking和Phi-4）以及跨分布基准（涵盖有害任务、提示注入、良性工具使用和跨域隐私泄露）上评估了MOSAIC的零样本性能。MOSAIC将有害行为减少高达50%，在注入攻击上将有害任务拒绝率提高超过20%，减少隐私泄露，并保持或改善良性任务性能，展示了跨模型、领域和代理设置的鲁棒泛化能力。

英文摘要

Agentic language models operate in a fundamentally different safety regime than chat models: they must plan, call tools, and execute long-horizon actions where a single misstep, such as accessing files or entering credentials, can cause irreversible harm. Existing alignment methods, largely optimized for static generation and task completion, break down in these settings due to sequential decision-making, adversarial tool feedback, and overconfident intermediate reasoning. We introduce MOSAIC, a post-training framework that aligns agents for safe multi-step tool use by making safety decisions explicit and learnable. MOSAIC structures inference as a plan, check, then act or refuse loop, with explicit safety reasoning and refusal as first-class actions. To train without trajectory-level labels, we use preference-based reinforcement learning with pairwise trajectory comparisons, which captures safety distinctions often missed by scalar rewards. We evaluate MOSAIC zero-shot across three model families, Qwen2.5-7B, Qwen3-4B-Thinking, and Phi-4, and across out-of-distribution benchmarks spanning harmful tasks, prompt injection, benign tool use, and cross-domain privacy leakage. MOSAIC reduces harmful behavior by up to 50%, increases harmful-task refusal by over 20% on injection attacks, cuts privacy leakage, and preserves or improves benign task performance, demonstrating robust generalization across models, domains, and agentic settings.

URL PDF HTML ☆

赞 0 踩 0

2603.02697 2026-06-04 cs.CV cs.AI

ShareVerse: Multi-Agent Consistent Video Generation for Shared World Modeling

ShareVerse：面向共享世界建模的多智能体一致视频生成

Jiayi Zhu, Jianing Zhang, Yiying Yang, Wei Cheng, Xiaoyun Yuan

发表机构 * Shanghai Jiao Tong University China（上海交通大学中国）； Fudan University China（复旦大学中国）； StepFun China（StepFun中国）

AI总结提出ShareVerse框架，通过构建多智能体交互数据集、空间拼接策略和跨智能体注意力机制，实现多智能体共享世界的一致视频生成。

详情

AI中文摘要

本文提出ShareVerse，一个视频生成框架，支持多智能体共享世界建模，解决了现有工作缺乏统一共享世界构建和多智能体交互支持的问题。ShareVerse利用大型视频模型的生成能力，并整合了三个关键创新：1）在CARLA仿真平台上构建了大规模多智能体交互世界建模数据集，包含多样场景、天气条件和交互轨迹，以及配对的每智能体四视角视频（前/后/左/右视图）和相机数据。2）我们提出了一种针对独立智能体四视角视频的空间拼接策略，以建模更广泛的环境并确保内部多视角几何一致性。3）我们将跨智能体注意力模块集成到预训练视频模型中，实现跨智能体时空信息的交互传递，保证重叠区域的共享世界一致性和非重叠区域的合理生成。支持49帧大规模视频生成的ShareVerse能够准确感知动态智能体的位置，实现一致的共享世界建模。

英文摘要

This paper presents ShareVerse, a video generation framework enabling multi-agent shared world modeling, addressing the gap in existing works that lack support for unified shared world construction with multi-agent interaction. ShareVerse leverages the generation capability of large video models and integrates three key innovations: 1) A dataset for large-scale multi-agent interactive world modeling is built on the CARLA simulation platform, featuring diverse scenes, weather conditions, and interactive trajectories with paired multi-view videos (front/ rear/ left/ right views per agent) and camera data. 2) We propose a spatial concatenation strategy for four-view videos of independent agents to model a broader environment and to ensure internal multi-view geometric consistency. 3) We integrate cross-agent attention blocks into the pretrained video model, which enable interactive transmission of spatial-temporal information across agents, guaranteeing shared world consistency in overlapping regions and reasonable generation in non-overlapping regions. ShareVerse, which supports 49-frame large-scale video generation, accurately perceives the position of dynamic agents and achieves consistent shared world modeling.

URL PDF HTML ☆

赞 0 踩 0

2602.20651 2026-06-04 cs.LG stat.AP stat.ML

Sparse Bayesian Deep Functional Learning with Structured Region Selection

稀疏贝叶斯深度函数学习与结构化区域选择

Xiaoxian Zhu, Yingmeng Li, Shuangge Ma, Mengyun Wu

发表机构 * School of Statistics and Data Science, Shanghai University of Finance and Economics（上海金融学院统计与数据科学学院）； School of Statistics（统计学院）； Data Science, Shanghai University of Finance（金融大学数据科学学院）； Department of Biostatistics, Yale School of Public Health（耶鲁大学公共卫生学院生物统计学系）

AI总结提出稀疏贝叶斯函数深度神经网络（sBayFDNN），通过深度贝叶斯架构学习自适应函数嵌入以捕捉复杂非线性关系，并利用结构化先验实现具有量化不确定性的可解释区域选择，理论上首次为贝叶斯深度函数模型提供了近似误差界、后验一致性和区域选择一致性的严格保证。

详情

AI中文摘要

在现代应用如心电图监测、神经影像、可穿戴传感和工业设备诊断中，复杂且连续结构化的数据无处不在，为函数数据分析带来了挑战和机遇。然而，现有方法面临关键权衡：传统函数模型受限于线性，而深度学习方法缺乏可解释的区域选择以处理稀疏效应。为弥合这些差距，我们提出了一种稀疏贝叶斯函数深度神经网络（sBayFDNN）。它通过深度贝叶斯架构学习自适应函数嵌入以捕捉复杂的非线性关系，同时结构化先验使得能够对具有量化不确定性的影响域进行可解释的区域选择。理论上，我们建立了严格的近似误差界、后验一致性和区域选择一致性。这些结果为贝叶斯深度函数模型提供了首个理论保证，确保了其可靠性和统计严谨性。实证上，全面的模拟和真实世界研究证实了sBayFDNN的有效性和优越性。关键的是，sBayFDNN在识别复杂依赖关系以实现准确预测方面表现出色，并能更精确地识别功能上有意义的区域，这些能力从根本上超越了现有方法。

英文摘要

In modern applications such as ECG monitoring, neuroimaging, wearable sensing, and industrial equipment diagnostics, complex and continuously structured data are ubiquitous, presenting both challenges and opportunities for functional data analysis. However, existing methods face a critical trade-off: conventional functional models are limited by linearity, whereas deep learning approaches lack interpretable region selection for sparse effects. To bridge these gaps, we propose a sparse Bayesian functional deep neural network (sBayFDNN). It learns adaptive functional embeddings through a deep Bayesian architecture to capture complex nonlinear relationships, while a structured prior enables interpretable, region-wise selection of influential domains with quantified uncertainty. Theoretically, we establish rigorous approximation error bounds, posterior consistency, and region selection consistency. These results provide the first theoretical guarantees for a Bayesian deep functional model, ensuring its reliability and statistical rigor. Empirically, comprehensive simulations and real-world studies confirm the effectiveness and superiority of sBayFDNN. Crucially, sBayFDNN excels in recognizing intricate dependencies for accurate predictions and more precisely identifies functionally meaningful regions, capabilities fundamentally beyond existing approaches.

URL PDF HTML ☆

赞 0 踩 0

2502.03799 2026-06-04 cs.CL cs.SY eess.SY

Enhancing Hallucination Detection through Noise Injection

通过噪声注入增强幻觉检测

Litian Liu, Reza Pourreza, Sunny Panchal, Apratim Bhattacharyya, Yubing Jian, Yao Qin, Roland Memisevic

发表机构 * Qualcomm AI Research（高通人工智能研究）

AI总结提出一种基于贝叶斯模型不确定性的无训练噪声注入方法，通过扰动模型参数或隐藏单元激活来改进采样过程，显著提升大语言模型幻觉检测性能。

Comments ICLR 2026 main conference paper

详情

AI中文摘要

大型语言模型（LLMs）容易生成看似合理但错误的响应，即幻觉。因此，有效检测幻觉对于LLMs的安全部署至关重要。最近的研究将幻觉与模型不确定性联系起来，表明可以通过测量从模型中抽取的多个样本的答案分布离散度来检测幻觉。虽然从模型定义的词元分布中抽取样本是一种自然的方式，但在这项工作中，我们认为这对于检测幻觉而言并非最优。我们表明，通过以贝叶斯方式考虑模型不确定性，可以显著改进检测。为此，我们提出了一种非常简单、无需训练的方法，该方法基于在采样过程中扰动适当子集的模型参数，或等效地扰动隐藏单元激活。我们证明，我们的方法在跨不同数据集、模型架构和不确定性度量上，显著优于标准采样的推理时幻觉检测。

英文摘要

Large Language Models (LLMs) are prone to generating plausible yet incorrect responses, known as hallucinations. Effectively detecting hallucinations is therefore crucial for the safe deployment of LLMs. Recent research has linked hallucinations to model uncertainty, suggesting that hallucinations can be detected by measuring dispersion over answer distributions obtained from multiple samples drawn from a model. While drawing from the distribution over tokens defined by the model is a natural way to obtain samples, in this work, we argue that it is suboptimal for the purpose of detecting hallucinations. We show that detection can be improved significantly by taking into account model uncertainty in the Bayesian sense. To this end, we propose a very simple, training-free approach based on perturbing an appropriate subset of model parameters, or equivalently hidden unit activations, during sampling. We demonstrate that our approach significantly improves inference-time hallucination detection over standard sampling across diverse datasets, model architectures, and uncertainty metrics.

URL PDF HTML ☆

赞 0 踩 0

2603.01013 2026-06-04 cs.LG

Feature-Weighted Maximum Representative Subsampling

特征加权最大代表性子抽样

Tony Hauptmann, Stefan Kramer

发表机构 * Institute of Computer Science, Johannes Gutenberg University Mainz（明斯特大学计算机科学研究所）

AI总结针对部分特征高度偏倚导致代表性变量引入偏差的问题，提出特征加权最大代表性子抽样（FW-MRS）方法，通过特征权重降低高度偏倚特征的影响，在保持下游任务泛化性能的同时保留更多实例。

详情

DOI: 10.1038/s41598-026-54180-1

AI中文摘要

在社会科学中，通常需要在得出有效结论之前对研究和调查进行去偏。去偏算法能够使用样本权重通过计算方式去除偏差。然而，当只有一部分特征高度偏倚而其余特征已经具有代表性时，就会出现问题。算法需要强烈改变样本分布以处理少数高度偏倚的特征，这反过来可能会给已经具有代表性的变量引入偏差。为了解决这个问题，我们开发了一种使用特征权重的方法，以最小化高度偏倚特征对样本权重计算的影响。我们的算法基于最大代表性子抽样（MRS），该方法通过迭代移除元素来对齐非代表性样本与代表性样本，从而创建代表性子样本，进而对数据集进行去偏。新算法名为特征加权MRS（FW-MRS），它降低了对高度偏倚特征的重视程度，从而能够为下游任务保留更多实例。特征权重来源于一个域分类器的特征重要性，该分类器经过训练以区分代表性数据集和非代表性数据集。我们使用八个表格数据集验证了FW-MRS，每个数据集都被人为偏倚。偏倚特征可能对下游任务很重要，较少关注它们可能导致泛化性能下降。因此，我们评估了FW-MRS在下游任务上的泛化性能，发现没有统计学上的显著差异。此外，FW-MRS被应用于一个来自社会科学的真实数据集。源代码可在https://github.com/kramerlab/FeatureWeightDebiasing获取。

英文摘要

In the social sciences, it is often necessary to debias studies and surveys before valid conclusions can be drawn. Debiasing algorithms enable the computational removal of bias using sample weights. However, an issue arises when only a subset of features is highly biased, while the rest is already representative. Algorithms need to strongly alter the sample distribution to manage a few highly biased features, which can in turn introduce bias into already representative variables. To address this issue, we developed a method that uses feature weights to minimize the impact of highly biased features on the computation of sample weights. Our algorithm is based on Maximum Representative Subsampling (MRS), which debiases datasets by aligning a non-representative sample with a representative one through iterative removal of elements to create a representative subsample. The new algorithm, named feature-weighted MRS (FW-MRS), decreases the emphasis on highly biased features, allowing it to retain more instances for downstream tasks. The feature weights are derived from the feature importance of a domain classifier trained to differentiate between the representative and non-representative datasets. We validated FW-MRS using eight tabular datasets, each of which we artificially biased. Biased features can be important for downstream tasks, and focusing less on them could lead to a decline in generalization. For this reason, we assessed the generalization performance of FW-MRS on downstream tasks and found no statistically significant differences. Additionally, FW-MRS was applied to a real-world dataset from the social sciences. The source code is available at https://github.com/kramerlab/FeatureWeightDebiasing.

URL PDF HTML ☆

赞 0 踩 0

2602.23214 2026-06-04 cs.CV cs.LG eess.IV

Plug-and-Play Diffusion Meets ADMM: Dual-Variable Coupling for Robust Medical Image Reconstruction

即插即用扩散遇见ADMM：双变量耦合用于鲁棒医学图像重建

Chenhe Du, Xuanyu Tian, Qing Wu, Muyu Liu, Jingyi Yu, Hongjiang Wei, Yuyao Zhang

发表机构 * ShanghaiTech University（上海科技大学）； Shanghai Jiao Tong University（上海交通大学）

AI总结提出双耦合即插即用扩散（DC-PnPDP）框架，通过引入经典对偶变量提供积分反馈并采用频谱均匀化（SH）处理结构伪影，解决了现有PnP求解器的稳态偏差和幻觉问题，在CT和MRI重建中实现了最先进的保真度和加速收敛。

Comments Accepted by ICML 2026

详情

AI中文摘要

即插即用扩散先验（PnPDP）框架通过将预训练生成模型视为模块化先验，已成为解决成像逆问题的强大范式。然而，我们发现当前PnP求解器（例如基于HQS或近端梯度）存在一个关键缺陷：它们作为无记忆算子，仅基于瞬时梯度更新估计。这种缺乏历史跟踪的做法不可避免地导致非消失稳态偏差，使得重建在严重损坏下无法严格满足物理测量。为了解决这个问题，我们提出了双耦合PnP扩散（DC-PnPDP），它恢复了经典对偶变量以提供积分反馈，逐步强制数据一致性和先验之间的一致性。然而，这种严格的几何耦合引入了第二个挑战：累积的对偶残差表现出频谱有色、结构化的伪影，违反了扩散先验的加性白高斯噪声（AWGN）假设，导致严重的幻觉。为了弥合这一差距，我们引入了频谱均匀化（SH），一种频域适应机制，将这些结构化残差调制为统计上合规的伪AWGN输入。这有效地将求解器的严格优化轨迹与去噪器的有效统计流形对齐。在CT和MRI重建上的大量实验表明，我们的方法解决了偏差-幻觉权衡，实现了最先进的保真度并显著加速收敛。代码可在https://github.com/duchenhe/DC-PnPDP获取。

英文摘要

Plug-and-Play diffusion prior (PnPDP) frameworks have emerged as a powerful paradigm for solving imaging inverse problems by treating pretrained generative models as modular priors. However, we identify a critical flaw in prevailing PnP solvers (e.g., based on HQS or Proximal Gradient): they function as memoryless operators, updating estimates solely based on instantaneous gradients. This lack of historical tracking inevitably leads to non-vanishing steady-state bias, where the reconstruction fails to strictly satisfy physical measurements under heavy corruption. To resolve this, we propose Dual-Coupled PnP Diffusion (DC-PnPDP), which restores the classical dual variable to provide integral feedback, progressively enforce agreement between the data-consistency and prior. However, this rigorous geometric coupling introduces a secondary challenge: the accumulated dual residuals exhibit spectrally colored, structured artifacts that violate the Additive White Gaussian Noise (AWGN) assumption of diffusion priors, causing severe hallucinations. To bridge this gap, we introduce Spectral Homogenization (SH), a frequency-domain adaptation mechanism that modulates these structured residuals into statistically compliant pseudo-AWGN inputs. This effectively aligns the solver's rigorous optimization trajectory with the denoiser's valid statistical manifold. Extensive experiments on CT and MRI reconstruction demonstrate that our approach resolves the bias-hallucination trade-off, achieving state-of-the-art fidelity with significantly accelerated convergence. The code is available at https://github.com/duchenhe/DC-PnPDP

URL PDF HTML ☆

赞 0 踩 0

2602.20971 2026-06-04 cs.LG cs.AI

Does Order Matter : Connecting The Law of Robustness to Robust Generalization

顺序重要吗：连接鲁棒性定律与鲁棒泛化

Mihir More, Aritra Das, Jaee Ponde, Himadri Mandal, Vishnu Varadarajan, Debayan Gupta

发表机构 * Ashoka University（阿什oka大学）； Truth Audit Labs（真相审计实验室）； Indian Statistical Institute（印度统计研究所）

AI总结本文通过全局和局部Rademacher复杂度，将鲁棒性定律（Lipschitz常数下界）与鲁棒泛化误差联系起来，证明了对任意数据分布，全局Lipschitz界阶不变，而局部Lipschitz界阶随扰动半径和局部浓度项变化。

详情

AI中文摘要

Bubeck和Selke（2021）将鲁棒性定律与鲁棒泛化误差之间的联系作为一个开放问题提出。鲁棒性定律指出，过参数化对于模型实现鲁棒插值是必要的，即插值函数必须是Lipschitz的。Wu等人（2023）将该定律推广到任意数据分布，证明Lipschitz常数满足$L = Ω(n^{1/d})$。另一方面，鲁棒泛化研究小的鲁棒训练损失是否意味着小的鲁棒测试损失。这可以使用统计学习技术（如Rademacher复杂度）来研究，其中鲁棒损失类的Rademacher复杂度的界意味着函数类Lipschitz性的界。我们利用这一联系，明确地将两者联系起来，适用于任意数据分布。(i) 我们证明，在考虑鲁棒损失类的全局Rademacher复杂度时，Lipschitz界的阶保持不变。(ii) 在局部尺度上，即对于具有小经验误差的函数子集，Lipschitz界的阶随扰动半径$ρ$和局部浓度项$\sqrt{r/n}$变化。

英文摘要

Bubeck and Selke (2021) propose the connection between the Law of Robustness and robust generalization error as an open problem. The Law of Robustness states that overparameterization is necessary for models to interpolate robustly, i.e., the interpolating function is required to be Lipschitz. Wu et al. (2023) extend this law to arbitrary data distributions, proving that the Lipschitz constant satisfies $L = Ω(n^{1/d})$. Robust generalization, on the other hand, asks whether small robust training loss implies small robust test loss. This can be studied using statistical learning techniques such as Rademacher complexities, where a bound on the Rademacher complexity of the robust loss class implies a bound on the Lipschitzness of the function class. We use this connection to explicitly link the two for arbitrary data distributions. (i) We prove that the order of the Lipschitz bound remains the same when considering the global Rademacher complexity of robust loss classes. (ii) At the local scale, i.e., for subsets of functions with small empirical error, the order of the Lipschitz bound changes with the perturbation radius $ρ$ and the localized concentration term $\sqrt{r/n}$.

URL PDF HTML ☆

赞 0 踩 0

2602.21103 2026-06-04 cs.CL cs.IR

Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning

提示级蒸馏：一种非参数化的模型微调替代方案，用于高效推理

Sanket Badhe, Deep Shah

发表机构 * Google Mountain View, California, USA（谷歌山景城，加利福尼亚州，美国）

AI总结提出提示级蒸馏（PLD），通过从教师模型中提取推理模式并组织为结构化指令注入学生模型的系统提示，无需微调即可提升小模型推理性能，在多个基准上接近前沿水平且延迟极低。

Comments Accepted at ACL 2026 Industry Track

详情

AI中文摘要

高级推理通常需要思维链提示，虽然准确但会导致高昂的延迟和测试时推理成本。标准的替代方案——微调较小的模型——往往牺牲可解释性，同时引入显著的计算和操作开销。为了解决这些限制，我们引入了提示级蒸馏（PLD）。我们从教师模型中提取显式推理模式，并将其组织成结构化的表达性指令列表，用于学生模型的系统提示。使用Gemma-3 4B进行评估，PLD在StereoSet上将Macro F1分数从57%提升至90.0%，在Contract-NLI上从67%提升至83%，同时将LogiQA准确率提升至70%。在Mistral Small 3.1上的类似结果证明了跨架构的泛化能力，使得这些紧凑模型能够以可忽略的延迟开销达到前沿性能。这些表达性指令使决策过程透明化，允许对逻辑进行完全人工验证，使得该方法非常适合法律、金融和内容审核等受监管行业，以及高容量用例和边缘设备。

英文摘要

Advanced reasoning typically requires Chain-of-Thought prompting, which is accurate but incurs prohibitive latency and substantial test-time inference costs. The standard alternative, fine-tuning smaller models, often sacrifices interpretability while introducing significant resource and operational overhead. To address these limitations, we introduce Prompt-Level Distillation (PLD). We extract explicit reasoning patterns from a Teacher model and organize them into a structured list of expressive instructions for the Student model's System Prompt. Evaluated using Gemma-3 4B, PLD improved Macro F1 scores on StereoSet (57\% to 90.0\%) and Contract-NLI (67\% to 83\%), while increasing LogiQA accuracy to 70\%. Similar results on Mistral Small 3.1 demonstrate cross-architecture generalizability, enabling these compact models to match frontier performance with negligible latency overhead. These expressive instructions render the decision-making process transparent, allowing for full human verification of logic, making this approach ideal for regulated industries such as law, finance, and content moderation, as well as high-volume use cases and edge devices.

URL PDF HTML ☆

赞 0 踩 0

2511.05722 2026-06-04 cs.CL cs.AI

OckBench: Measuring the Efficiency of LLM Reasoning

OckBench：衡量LLM推理效率

Zheng Du, Hao Kang, Song Han, Tushar Krishna, Ligeng Zhu

发表机构 * Georgia Institute of Technology（佐治亚理工学院）； Massachusetts Institute of Technology（麻省理工学院）； NVIDIA（英伟达）

AI总结提出OckBench基准，联合评估推理和编码任务中的准确性与token效率，揭示当前模型token利用率低下问题。

详情

AI中文摘要

大型语言模型（LLM）如GPT-5和Gemini 3已推动自动推理和代码生成的前沿。然而，当前基准强调准确性和输出质量，忽略了关键维度：token使用的效率。在实际应用中，token效率变化很大。解决相同问题且准确率相近的模型，其token长度差异可达 extbf{5.0$ imes$}，导致模型推理能力存在巨大差距。这种差异暴露了显著的冗余，凸显了对标准化基准来量化token效率差距的迫切需求。因此，我们引入OckBench，这是首个联合衡量推理和编码任务中准确性与token效率的基准。我们的评估表明，当前模型的token效率在很大程度上未得到优化，显著增加了服务成本和延迟。这些发现为社区优化潜在推理能力（即token效率）提供了具体路线图。最终，我们主张评估范式转变：token不应被无谓地倍增。我们的基准可在https://ockbench.github.io/获取。

英文摘要

Large language models (LLMs) such as GPT-5 and Gemini 3 have pushed the frontier of automated reasoning and code generation. Yet current benchmarks emphasize accuracy and output quality, neglecting a critical dimension: efficiency of token usage. The token efficiency is highly variable in practical. Models solving the same problem with similar accuracy can exhibit up to a \textbf{5.0$\times$} difference in token length, leading to massive gap of model reasoning ability. Such variance exposes significant redundancy, highlighting the critical need for a standardized benchmark to quantify the gap of token efficiency. Thus, we introduce OckBench, the first benchmark that jointly measures accuracy and token efficiency across reasoning and coding tasks. Our evaluation reveals that token efficiency remains largely unoptimized across current models, significantly inflating serving costs and latency. These findings provide a concrete roadmap for the community to optimize the latent reasoning ability, token efficiency. Ultimately, we argue for an evaluation paradigm shift: tokens must not be multiplied beyond necessity. Our benchmarks are available at https://ockbench.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2601.04493 2026-06-04 cs.RO

Continuum Robot State Estimation with Actuation Uncertainty

具有驱动不确定性的连续体机器人状态估计

James M. Ferguson, Alan Kuntz, Tucker Hermans

发表机构 * Department of Electrical and Computer Engineering and Department of Computer Science, Vanderbilt University（电气与计算机工程系和计算机科学系，范德比尔特大学）； Kahlert School of Computing and Robotics Center, University of Utah（计算与机器人中心，犹他大学）； NVIDIA（英伟达）

AI总结针对连续体机器人在未知交互力和模型不确定性下的形状估计问题，提出一种基于机械原理的驱动先验的离散Cosserat杆模型，联合估计机器人形状、外部载荷和驱动输入，并通过稀疏因子图实现高效非线性优化。

Comments Public preprint for IEEE RAL. Accepted May 2026

详情

AI中文摘要

连续体机器人是柔性、细长的操纵器，非常适合受限的手术环境。在这些环境中，未知的相互作用力和模型不确定性显著影响机器人形状，从而激发了从外部观测进行状态估计的需求。现有的估计方法要么忽略驱动建模，要么依赖于简化的确定性驱动模型。相比之下，我们使用机械原理的驱动先验联合估计机器人形状、外部载荷和驱动输入。为此，我们提出了一种具有分段线性应变积分的离散Cosserat杆公式，该公式提供了高数值精度，同时诱导出稀疏因子图结构以实现高效的非线性优化。我们将该框架扩展到仿真中的腱驱动和并联机器人，并在手术同心管机器人上进行了实验验证。总体而言，我们的方法能够在多种机器人架构上实现原理性的实时估计，同时通过线性化因子图直接访问操纵器雅可比矩阵。

英文摘要

Continuum robots are flexible, slender manipulators well suited for confined surgical environments. In these settings, unknown interaction forces and model uncertainty significantly affect robot shape, motivating state estimation from external observations. Existing estimation methods either neglect actuation modeling or rely on simplified deterministic actuation models. In contrast, we jointly estimate robot shape, external loads, and actuation inputs using mechanically principled actuation priors. To achieve this, we present a discrete Cosserat rod formulation with piecewise-linear strain integration that provides high numerical accuracy while inducing a sparse factor graph structure for efficient nonlinear optimization. We extend the framework to tendon-driven and parallel robots in simulation and validate it experimentally on a surgical concentric tube robot. Overall, our approach enables principled real-time estimation across multiple robot architectures while providing direct access to manipulator Jacobians through the linearized factor graph.

URL PDF HTML ☆

赞 0 踩 0

2602.19101 2026-06-04 cs.CL cs.AI

Value Entanglement: Conflation Between Different Kinds of Good In (Some) Large Language Models

价值纠缠：大型语言模型中不同种类好的混淆

Seong Hah Cho, Junyi Li, Anna Leshinskaya

发表机构 * Independent Department of Cognitive Sciences, UC Irvine（独立认知科学系，加州大学 Irvine 分校）

AI总结通过探测模型行为、嵌入和残差流激活，发现大型语言模型普遍存在价值纠缠，即道德、语法和经济三种价值被混淆，其中语法和经济价值过度受道德价值影响，通过选择性消融与道德相关的激活向量可修复此问题。

2602.16966 2026-06-04 cs.LG cs.AI

A Unified Framework for Locality in Scalable MARL

可扩展多智能体强化学习中局部性的统一框架

Sourav Chakraborty, Amit Kiran Rege, Claire Monteleoni, Lijun Chen

发表机构 * University of Colorado Boulder（科罗拉多大学博尔德分校）； INRIA Paris（巴黎国家信息与自动化研究所）

AI总结提出统一框架，通过将矩阵C^π分解为环境敏感性和策略敏感性部分，利用谱半径条件ρ(H^π)<1严格弱于行和条件，证明软max温度直接控制局部性，并给出块坐标KL近端策略改进的确定性保证。

详情

AI中文摘要

网络化多智能体强化学习的可扩展方法让每个智能体仅使用智能体图的一小部分邻域进行规划。这仅在系统是值局部性时有效，即一个智能体的扰动对远处另一个智能体的长期值影响较弱。在平均奖励设置中，验证局部性的标准方法是Dobrushin行和界，该界基于一个矩阵$C^π$，该矩阵捕捉每个智能体的下一个状态如何依赖于其他智能体的当前状态。为了使该矩阵易于处理，先前的工作通过联合动作的上确界来约束它。得到的界与策略无关，但当策略从不选择最坏情况动作时，该界是松的。我们将$C^π$分解为分别跟踪环境敏感性和策略敏感性的部分，$C^π\preceq E^{\mathrm s}+E^{\mathrm a}Π(π)$，其中$E^{\mathrm s}$衡量下一个状态如何随当前状态变化，$E^{\mathrm a}$衡量它如何随当前动作变化，$Π(π)$衡量策略对状态变化的反应程度。那么$H^π:= E^{\mathrm s}+E^{\mathrm a}Π(π)$的谱半径控制平均奖励泊松解的衰减，谱证书$ρ(H^π)<1$严格弱于同一矩阵上的行和条件$\|H^π\|_\infty<1$，并适用于先前Dobrushin风格工作中使用的策略无关动作上确界界无法处理的场景。对于温度-$τ$ softmax策略，我们有$Π(π)\le L/(2τ)$，因此softmax温度直接控制局部性。我们利用这一衰减结果为块坐标KL近端策略改进模板提供确定性预言机保证，其截断偏差随消息传递半径$κ$指数衰减。

英文摘要

Scalable methods for networked multi-agent reinforcement learning let each agent plan using only a small neighborhood of the agent graph. This works only when the system is value-local, meaning a perturbation at one agent affects the long-run value at another agent weakly when the two are far apart. In the average-reward setting, the standard way to certify locality is the Dobrushin row-sum bound on a single matrix $C^π$ that captures how each agent's next state depends on each other agent's current state. To make this matrix easy to work with, prior work bounds it by a supremum over joint actions. The resulting bound is independent of the policy, but it is loose whenever the policy never picks the worst-case action. We split $C^π$ into pieces that separately track environment sensitivity and policy sensitivity, $C^π\preceq E^{\mathrm s}+E^{\mathrm a}Π(π)$, where $E^{\mathrm s}$ measures how the next state moves with the current state, $E^{\mathrm a}$ measures how it moves with the current action, and $Π(π)$ measures how reactive the policy is to changes in state. The spectral radius of $H^π:= E^{\mathrm s}+E^{\mathrm a}Π(π)$ then controls the decay of the average-reward Poisson solution, and the spectral certificate $ρ(H^π)<1$ is strictly weaker than the row-sum condition $\|H^π\|_\infty<1$ on the same matrix and applies in regimes where policy-independent action-supremum bounds used in prior Dobrushin-style work cannot. For temperature-$τ$ softmax policies we get $Π(π)\le L/(2τ)$, so the softmax temperature directly controls locality. We use this decay result to give a deterministic oracle guarantee for a block-coordinate KL-proximal policy-improvement template whose truncation bias decays exponentially in the message-passing radius $κ$.

URL PDF HTML ☆

赞 0 踩 0

2602.13081 2026-06-04 cs.RO

Agentic AI for Robot Control: Flexible but still Fragile

用于机器人控制的智能体AI：灵活但依然脆弱

Oscar Lima, Marc Vinci, Martin Günther, Marian Renz, Alexander Sung, Sebastian Stock, Johannes Brust, Lennart Niecksch, Zongyao Yi, Felix Igelbrink, Benjamin Kisliuk, Martin Atzmueller, Joachim Hertzberg

发表机构 * ETH Zurich（苏黎世联邦理工学院）； University of Cambridge（剑桥大学）； Technical University of Munich（慕尼黑技术大学）； Max Planck Institute for Intelligent Systems（智能系统马克斯·普朗克研究所）； German Aerospace Center (DLR)（德国航空航天中心（DLR））

AI总结本文提出一种基于推理型语言模型的智能体控制系统，通过迭代规划-执行循环选择并调用机器人技能完成任务，在两种物理平台上实验表明系统灵活但存在非确定性行为、指令遵循错误和提示敏感等脆弱性。

详情

DOI: 10.1609/aaaiss.v8i1.42578

AI中文摘要

最近的工作利用生成模型的能力和常识先验进行机器人控制。在本文中，我们提出了一种智能体控制系统，其中具有推理能力的语言模型通过迭代规划器和执行器循环选择并调用机器人技能来规划和执行任务。我们在两种物理机器人平台上部署该系统，分别用于：(i) 室内移动操作（Mobipick）中的桌面抓取、放置和箱子插入，以及 (ii) 自主农业导航和感知（Valdemar）。两种设置都涉及不确定性、部分可观测性、传感器噪声和模糊的自然语言命令。该系统公开了其规划和决策过程的结构化内省，通过显式事件检查对外部事件做出反应，并支持操作员干预以修改或重定向正在进行的执行。在两个平台上，我们的概念验证实验揭示了显著的脆弱性，包括非确定性次优行为、指令遵循错误以及对提示规范的高度敏感性。同时，该架构是灵活的：转移到不同的机器人和任务领域主要需要更新系统提示（领域模型、可供性和动作目录）并将相同的工具接口重新绑定到平台特定的技能API。

英文摘要

Recent work leverages the capabilities and commonsense priors of generative models for robot control. In this paper, we present an agentic control system in which a reasoning-capable language model plans and executes tasks by selecting and invoking robot skills within an iterative planner and executor loop. We deploy the system on two physical robot platforms in two settings: (i) tabletop grasping, placement, and box insertion in indoor mobile manipulation (Mobipick) and (ii) autonomous agricultural navigation and sensing (Valdemar). Both settings involve uncertainty, partial observability, sensor noise, and ambiguous natural-language commands. The system exposes structured introspection of its planning and decision process, reacts to exogenous events via explicit event checks, and supports operator interventions that modify or redirect ongoing execution. Across both platforms, our proof-of-concept experiments reveal substantial fragility, including non-deterministic suboptimal behavior, instruction-following errors, and high sensitivity to prompt specification. At the same time, the architecture is flexible: transfer to a different robot and task domain largely required updating the system prompt (domain model, affordances, and action catalogue) and re-binding the same tool interface to the platform-specific skill API.

URL PDF HTML ☆

赞 0 踩 0

2602.12215 2026-06-04 cs.RO

LDA-1B: Scaling Latent Dynamics Action Model via Universal Embodied Data Ingestion

LDA-1B：通过通用具身数据摄取扩展潜动力学动作模型

Jiangran Lyu, Kai Liu, Xuheng Zhang, Haoran Liao, Yusen Feng, Wenxuan Zhu, Tingrui Shen, Jiayi Chen, Jiazhao Zhang, Yifei Dong, Wenbo Cui, Senmao Qi, Shuo Wang, Yixin Zheng, Mi Yan, Xuesong Shi, Haoran Li, Dongbin Zhao, Ming-Yu Liu, Zhizheng Zhang, Li Yi, Yizhou Wang, He Wang

发表机构 * Peking University（北京大学）； Galbot ； CASIA（中国科学院自动化研究所）； BAAI（北京人工智能研究院）； Tsinghua University（清华大学）； Sun Yat-sen University（中山大学）； NVIDIA

AI总结提出LDA-1B机器人基础模型，通过统一格式的具身交互数据集EI-30k和结构化DINO潜空间中的动力学学习，联合建模动力学、策略和视觉预测，在接触丰富、灵巧和长时任务上分别提升高达21%、48%和23%。

Comments Accepted at RSS 2026, Project Page:https://pku-epic.github.io/LDA

详情

AI中文摘要

最近的机器人基础模型很大程度上依赖于大规模行为克隆，该方法模仿专家动作，但丢弃了嵌入在异构具身数据中的可迁移动力学知识。虽然统一世界模型（UWM）公式有潜力利用这种多样化数据，但由于粗粒度的数据使用和碎片化的数据集，现有实例难以扩展到基础模型级别。我们引入了LDA-1B，一个通过通用具身数据摄取进行扩展的机器人基础模型，它通过联合学习动力学、策略和视觉预测，为不同质量的数据分配不同的角色。为了大规模支持这种机制，我们组装并标准化了EI-30k，一个具身交互数据集，包含超过3万小时的人类和机器人轨迹，采用统一格式。通过结构化DINO潜空间中的预测实现了对这种异构数据的可扩展动力学学习，避免了冗余的像素空间外观建模。作为这种表示的补充，LDA-1B采用多模态扩散变换器来处理异步视觉和动作流，从而在1B参数规模上实现稳定训练。在模拟和真实世界的实验中，LDA-1B在接触丰富、灵巧和长时任务上分别比先前方法（例如π_{0.5}）高出高达21%、48%和23%。值得注意的是，LDA-1B实现了数据高效的微调，通过利用通常有害且被丢弃的30%低质量轨迹，获得了10%的提升。

英文摘要

Recent robot foundation models largely rely on large-scale behavior cloning, which imitates expert actions but discards transferable dynamics knowledge embedded in heterogeneous embodied data. While the Unified World Model (UWM) formulation has the potential to leverage such diverse data, existing instantiations struggle to scale to foundation-level due to coarse data usage and fragmented datasets. We introduce LDA-1B, a robot foundation model that scales through universal embodied data ingestion by jointly learning dynamics, policy, and visual forecasting, assigning distinct roles to data of varying quality. To support this regime at scale, we assemble and standardize EI-30k, an embodied interaction dataset comprising over 30k hours of human and robot trajectories in a unified format. Scalable dynamics learning over such heterogeneous data is enabled by prediction in a structured DINO latent space, which avoids redundant pixel-space appearance modeling. Complementing this representation, LDA-1B employs a multi-modal diffusion transformer to handle asynchronous vision and action streams, enabling stable training at the 1B-parameter scale. Experiments in simulation and the real world show LDA-1B outperforms prior methods (e.g., $π_{0.5}$) by up to 21\%, 48\%, and 23\% on contact-rich, dexterous, and long-horizon tasks, respectively. Notably, LDA-1B enables data-efficient fine-tuning, gaining 10\% by leveraging 30\% low-quality trajectories typically harmful and discarded.

URL PDF HTML ☆

赞 0 踩 0

2510.26219 2026-06-04 cs.LG cs.AI

Test-time reward-guided alignment of language models by importance sampling on pre-logit space

基于预逻辑空间重要性采样的测试时奖励引导语言模型对齐

Sekitoshi Kanai, Tsukasa Yoshida, Hiroshi Takahashi, Haru Kuroki, Kazumune Hashimoto

发表机构 * NTT, Inc.（NTT公司）； Toyohashi University of Technology（东邦大学）； The University of Osaka（大阪大学）

AI总结提出一种基于预逻辑空间自适应重要性采样的测试时对齐方法AISP，通过高斯扰动和重要性采样优化奖励期望，在样本效率上优于最佳-of-n采样和其他测试时对齐方法。

Comments 24 pages, 10 figures

AI 大模型

视觉与机器人

科学与医疗

Exposing Blindspots: Cultural Bias Evaluation in Generative Image Models

Revisiting Model Stitching In the Foundation Model Era

ViewMask-1-to-3: Multi-View Consistent Image Generation via Multimodal Discrete Diffusion Models

LiSeCo: Linear Semantic Control for Language Generation

Good Reasoning Makes Good Demonstrations: Implicit Reasoning Quality Supervision via In-Context Reinforcement Learning

EvoPrompt: Guided Prompt Evolution for Vision-Language Models Adaptation

Physics-Informed Neural Engine Sound Modeling with Differentiable Pulse-Train Synthesis

When Detectors Forget Forensics: Blocking Semantic Shortcuts for Generalizable AI-Generated Image Detection

ZeroWBC: Learning Natural Whole-Body Humanoid Interaction from Human Egocentric Data

Vectorized Online POMDP Planning

3PoinTr: 3D Point Tracks for Learning Manipulation from Unconstrained Human Videos

Analysis-Driven Procedural Generation of an Engine Sound Dataset with Embedded Control Annotations

Confidence Before Answering: A Paradigm Shift for Efficient LLM Uncertainty Estimation

It's TIME: Towards the Next Generation of Time Series Forecasting Benchmarks

Beyond Pixel Histories: World Models with Persistent 3D State

Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

ShareVerse: Multi-Agent Consistent Video Generation for Shared World Modeling

Sparse Bayesian Deep Functional Learning with Structured Region Selection

Enhancing Hallucination Detection through Noise Injection

Feature-Weighted Maximum Representative Subsampling

Plug-and-Play Diffusion Meets ADMM: Dual-Variable Coupling for Robust Medical Image Reconstruction

Does Order Matter : Connecting The Law of Robustness to Robust Generalization

Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning

OckBench: Measuring the Efficiency of LLM Reasoning

Continuum Robot State Estimation with Actuation Uncertainty

Value Entanglement: Conflation Between Different Kinds of Good In (Some) Large Language Models

A Unified Framework for Locality in Scalable MARL

Agentic AI for Robot Control: Flexible but still Fragile

LDA-1B: Scaling Latent Dynamics Action Model via Universal Embodied Data Ingestion

Test-time reward-guided alignment of language models by importance sampling on pre-logit space