arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2023
热门方向导航
2606.19341 2026-06-18 cs.CV cs.CL cs.SD 新提交

Native Active Perception as Reasoning for Omni-Modal Understanding

原生主动感知作为全模态理解的推理

Zhenghao Xing, Ruiyang Xu, Yuxuan Wang, Jinzheng He, Ziyang Ma, Qize Yang, Yunfei Chu, Jin Xu, Junyang Lin, Chi-Wing Fu, Pheng-Ann Heng

发表机构 * The Chinese University of Hong Kong(香港中文大学) Shanghai Jiao Tong University(上海交通大学) Nanyang Technological University(南洋理工大学) Qwen Team, Alibaba Group(阿里巴巴集团Qwen团队)

AI总结 提出OmniAgent,一种基于POMDP迭代观察-思考-行动循环的原生全模态智能体,通过主动感知将推理复杂度与视频时长解耦,在多个基准上达到开源模型最优性能。

Comments Accepted at ICML 2026. Code and models: https://github.com/harryhsing/omniagent

详情
AI中文摘要

用于长视频理解的被动模型通常依赖于“全看一遍”范式,无论查询难度如何都统一处理帧,导致计算成本随视频时长增长。尽管出现了交互式框架,但它们通常依赖于全局预扫描,其上下文成本仍随视频长度扩展。我们提出OmniAgent,第一个原生全模态智能体,将视频理解建模为基于POMDP的迭代观察-思考-行动循环。OmniAgent执行按需动作,选择性地将视听线索提炼到持久文本记忆中,有效将推理复杂度与原始视频时长解耦。为实现这一点,我们引入了(1)智能体监督微调,通过最佳N轨迹合成和双阶段质量控制在启动原生主动感知;(2)带TAURA(轮次感知自适应不确定性重缩放优势)的智能体强化学习,利用轮次级熵将信用分配引导至关键发现轮次。关键的是,OmniAgent表现出正向测试时缩放,性能随推理轮次增加而提升,验证了主动感知的有效性。在十个基准(如VideoMME、LVBench)上的实验结果表明,OmniAgent在开源模型中达到了最先进性能。值得注意的是,在LVBench上,我们的7B智能体优于10倍大的Qwen2.5-VL-72B(50.5% vs. 47.3%)。

英文摘要

Passive models for long video understanding typically rely on a "watch-it-all" paradigm, processing frames uniformly regardless of query difficulty, causing computational cost to grow with video duration. Although interactive frameworks have emerged, they often rely on global pre-scanning, and their context cost still scales with video length. We propose OmniAgent, the first native omni-modal agent that formulates video understanding as a POMDP-based iterative Observation-Thought-Action cycle. OmniAgent executes on-demand actions to selectively distill audio-visual cues into a persistent textual memory, effectively decoupling reasoning complexity from raw video duration. To operationalize this, we introduce (1) Agentic Supervised Fine-Tuning to bootstrap native active perception via best-of-N trajectory synthesis with dual-stage quality control, and (2) Agentic Reinforcement Learning with TAURA (Turn-aware Adaptive Uncertainty Rescaled Advantage), which leverages turn-level entropy to steer credit assignment toward pivotal discovery turns. Crucially, OmniAgent exhibits positive test-time scaling, where performance improves as the number of reasoning turns increases, validating the efficacy of active perception. Empirical results across ten benchmarks (e.g., VideoMME, LVBench) demonstrate that OmniAgent achieves state-of-the-art performance among open-source models. Notably, on LVBench, our 7B agent outperforms the 10$\times$ larger Qwen2.5-VL-72B (50.5% vs. 47.3%).

2606.19340 2026-06-18 cs.RO 新提交

Zero-Shot Long-Horizon Dexterous Manipulation via Multi-View 3D-Grounded VLM Reasoning

零样本长时程灵巧操作:基于多视图3D接地VLM推理

Jisoo Kim, Sangwon Baik, Taeksoo Kim, Sungjoo Kim, Junyoung Lee, Mingi Choi, Hanbyul Joo

发表机构 * Seoul National University(首尔国立大学) RLWRLD

AI总结 提出零样本框架,利用多视图RGB图像通过VLM生成3D任务规划,结合三角测量和射线投票实现精确3D接地,支持抓取和工具使用,在真实实验中优于基线方法。

详情
AI中文摘要

我们提出了一个零样本框架,用于长时程灵巧操作,该框架将语言指令从校准的多视图RGB图像接地到可执行的3D任务规划。我们的系统不是训练端到端策略,而是使用视觉语言模型(VLM)生成参考帧任务接地和原始级2D关键点,然后通过多视图融合将其提升到3D。这种提升结合了视图级VLM接地的三角测量与参考视图射线投票,后者沿语义相机射线搜索跨相邻视图的几何一致候选点。生成的3D关键点支持抓取和放置以及工具使用:对于工具使用,我们检索与推断技能类别对应的以对象为中心的原子动作,并将其存储的6D工具轨迹对齐到场景;对于灵巧执行,我们将提升的抓取关键点扩展为任务条件抓取可行区域,并使用臂手运动生成器生成可行的抓取-运动对。真实世界实验表明,与单视图RGB-D接地和微调VLA基线相比,3D接地精度和执行可靠性有所提高。我们进一步通过闭环状态验证和重新规划展示了长时程操作,实现了在新场景中对未见物体和工具使用任务的零样本执行。

英文摘要

We present a zero-shot framework for long-horizon dexterous manipulation that grounds language instructions into executable 3D task plans from calibrated multi-view RGB images. Rather than training an end-to-end policy, our system uses a vision-language model (VLM) to produce reference-frame task grounding and primitive-level 2D keypoints, then lifts them into 3D via multi-view fusion. This lifting combines triangulation of view-wise VLM groundings with reference-view ray voting, which searches along a semantic camera ray for geometrically consistent candidates across neighboring views. The resulting 3D keypoints support both pick-and-place and tool-use: for tool-use, we retrieve an object-centric atomic action corresponding to the inferred skill category and align its stored 6D tool trajectory to the scene; for dexterous execution, we expand the lifted grasp keypoint into a task-conditioned grasp affordance region and generate feasible grasp-motion pairs with an arm-hand motion generator. Real-world experiments show improved 3D grounding accuracy and execution reliability over single-view RGB-D grounding and fine-tuned VLA baselines. We further demonstrate long-horizon manipulation through closed-loop status verification and replan, enabling zero-shot execution on unseen objects and tool-use tasks in novel scenes.

2606.19338 2026-06-18 cs.CV 新提交

Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

超越当前观测:评估多模态大语言模型在可控非马尔可夫博弈中的表现

Shengyuan Ding, Xilin Wei, Xinyu Fang, Haodong Duan, Dahua Lin, Jiaqi Wang, Yuhang Zang

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Zhejiang University(浙江大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出RNG-Bench基准套件,通过配对记忆和3D迷宫两个博弈,评估多模态大模型在非马尔可夫环境中重建历史观测并据此行动的能力,发现主要错误源于遗忘而非决策,微调可提升性能。

详情
AI中文摘要

将多模态基础模型部署为闭环策略时,越来越需要基于不再可见的观测来调节动作。然而,现有基准要么暴露完整状态,将隐藏状态重建与其他智能体技能混为一谈,要么仅在回合结束后测试记忆。我们引入了RNG-Bench(重建性非马尔可夫博弈),这是一个基准套件,旨在隔离基础模型在多步交互中重建过去观测并据此行动的能力。RNG-Bench包含两个互补的博弈:配对记忆,其中卡片身份在特定位置短暂显示后需被回忆;以及3D迷宫,其中自我中心视图需整合为空间地图。两个博弈都在统一的测试框架下评估,具有三个可控难度轴:网格大小、视觉模式和观测模态。该基准进一步引入了头对头对决协议以控制实例级方差,以及记忆差距指标,将遗忘与不良动作选择区分开来。最难的配置需要大约128K个token和每回合350个图像输入,前沿MLLMs远未饱和。记忆差距分析表明,大多数残余错误源于遗忘较早的观测,而非次优决策。最后,在最优策略轨迹和过滤后的模型演示上微调Qwen3.5-9B,提高了RNG-Bench的性能,并迁移到现有基准,而不降低通用多模态能力。

英文摘要

Deploying multimodal foundation models as closed-loop policies increasingly requires conditioning actions on observations that are no longer visible. However, existing benchmarks either expose the full state, conflate hidden-state reconstruction with other agent skills, or test recall only after an episode has ended. We introduce RNG-Bench (Reconstructive Non-Markov Games), a benchmark suite designed to isolate a base model's ability to reconstruct past observations and act on them during multi-step interaction. RNG-Bench includes two complementary games: Matching Pairs, where card identities briefly revealed at specific locations must later be recalled, and 3D Maze, where egocentric views must be integrated into a spatial map. Both games are evaluated under a unified harness with three controlled difficulty axes: grid size, visual pattern, and observation modality. The benchmark further introduces a head-to-head duel protocol to control for instance-level variance and a Memory Gap metric that disentangles forgetting from poor action selection. The hardest configurations require contexts of roughly 128K tokens and 350 image inputs per episode, and remain far from saturated by frontier MLLMs. Memory Gap analysis shows that most residual errors stem from forgetting earlier observations rather than from suboptimal decision making. Finally, fine-tuning Qwen3.5-9B on optimal-policy rollouts and filtered model demonstrations improves performance on RNG-Bench and transfers to existing benchmarks without degrading general multimodal capability.

2606.19336 2026-06-18 cs.CL 新提交

Learning User Simulators with Turing Rewards

基于图灵奖励的学习用户模拟器

Yingshan Susan Wang, Cedegao E. Zhang, Linlu Qiu, Zexue He, Pengyuan Li, Alex Pentland, Roger P. Levy, Yoon Kim

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Stanford University(斯坦福大学) MIT-IBM Watson AI Lab(MIT-IBM沃森人工智能实验室)

AI总结 提出Turing-RL方法,利用基于图灵测试的强化学习训练用户模拟器,通过判别性图灵奖励使生成响应与真实用户不可区分,在对话和论坛讨论中优于基线方法。

详情
AI中文摘要

在交互式环境中学习模拟人类用户可以推动代理助手的训练、个性化系统的评估、社会科学研究等。现有方法通常通过训练大型语言模型(LLM)来匹配单一真实响应,要么通过最大化对数概率,要么使用相似性奖励。我们提出{Turing-RL}:一种基于图灵测试的强化学习方法,用于训练用户模拟器模型。{Turing-RL}使用带有LLM评判器的判别性图灵奖励,根据用户历史记录对生成的响应与真实用户的不可区分程度进行评分,用户模拟器LLM学习在这种奖励下产生与用户可能说的内容不可区分的响应。在两个不同领域——对话聊天和Reddit论坛讨论中,我们发现{Turing-RL}在LLM和人工评估指标上均持续优于基线方法。我们的研究表明,优化不可区分性而非响应匹配对于学习用户模拟器是有效的。

英文摘要

Learning to simulate human users in interactive settings could advance the training of agent assistants, evaluation of personalization systems, research in the social sciences, and more. Existing approaches generally do so by training a large language model (LLM) to match a single ground truth response, either by maximizing the log probability or by using a similarity reward. We instead propose {Turing-RL}: a Turing-Test-based reinforcement learning approach for training user simulator models. {Turing-RL} uses a discriminative Turing reward with an LLM judge to score how indistinguishable a generated response is from the real user's given the user's history, and the user simulator LLM learns to produce responses indistinguishable from what the user could have said with such rewards. Across two different domains--conversational chat and Reddit forum discussion--we find that {Turing-RL} consistently outperforms baseline methods on both LLM and human evaluation metrics. Our study suggests that optimizing for indistinguishability, rather than response matching, is effective for learning user simulators.

2606.19334 2026-06-18 cs.CL cs.CY cs.LG 新提交

Freeing the Law with LOCUS: A Local Ordinance Corpus for the United States

用LOCUS解放法律:美国地方条例语料库

Denis Peskoff, Joe Barrow, Christopher Vu, Diag Davenport

发表机构 * UC Berkeley(加州大学伯克利分校) School of Information(信息学院) Independent(独立研究者)

AI总结 为解决美国地方条例缺乏机器可读语料的问题,构建了包含9239个市县条例的LOCUS语料库,并训练ModernBERT分类器以分析法律透明度等维度。

Comments 14 pages, 6 figures

详情
AI中文摘要

法律人工智能的进展越来越依赖于大规模获取权威法律文本。然而,美国法律中最具影响力的层级之一——地方条例——在很大程度上仍然缺失于现有的机器可读语料库中。地方法规管辖着分区、住房、商业许可、公共卫生、噪音、动物控制以及许多其他日常监管领域,但它们分散在专为人类浏览而非批量研究访问设计的供应商平台上。我们引入了LOCUS——美国地方条例语料库——一个全面的语料库和县级统一访问层,用于美国市和县条例。原始语料库可供研究人员发布,几乎涵盖了所有公开可用的市和县条例。由此产生的原始语料库包含来自9239个城市和县的法规。一个较小的县级统一LOCUS访问层覆盖了美国3144个县中最大的2309个,覆盖了大部分人口。我们使用OCR来处理使法律无法成为公共资源的各种文档格式。我们发布了带有覆盖元数据的语料库,以支持可重复性、下游法律AI研究以及逐步扩展对地方法律的机器可读访问。我们训练了一系列基于ModernBERT的分类器和评分器,以便从多个维度分析美国地方法律,例如不透明性和家长式作风,这些维度以前从未在此规模上研究过。LOCUS-v1及其衍生模型可在以下网址获取:this https URL

英文摘要

Progress in legal AI increasingly depends on access to authoritative legal text at scale. Yet one of the most consequential layers of American law remains largely absent from existing machine-readable corpora: local ordinances. Local codes govern zoning, housing, business licensing, public health, noise, animal control, and many other domains of everyday regulation, but they are fragmented across vendor platforms designed for human browsing rather than bulk research access. We introduce LOCUS - the Local Ordinance Corpus for the United States - a comprehensive corpus and county-harmonized access layer for U.S. municipal and county ordinance codes. The raw corpus, available for release to researchers, represents nearly all publicly available municipal and county ordinance codes. The resulting raw corpus contains codes from 9,239 cities and counties. A smaller county-harmonized LOCUS access layer provides coverage for the largest 2,309 of 3,144 U.S. counties, accounting for a majority of the population. We use OCR to handle the myriad of document formats that have kept the law from being a public resource. We release the corpus with coverage metadata to support reproducibility, downstream legal AI research, and the incremental expansion of machine-readable access to local law. We train a collection of ModernBERT-based classifiers and scorers to facilitate analyzing U.S. local law among several dimensions, such as opacity and paternalism, that have not previously been studied at this scale. LOCUS-v1 and its derivative models are available at: https://huggingface.co/datasets/LocalLaws/LOCUS-v1

2606.19333 2026-06-18 cs.RO cs.CV 新提交

Do as I Do: Dexterous Manipulation Data from Everyday Human Videos

Do as I Do: 从日常人类视频中获取灵巧操作数据

Bhawna Paliwal, Haritheja Etukuru, William Liang, Pieter Abbeel, Nur Muhammad Mahi Shafiullah, Jitendra Malik

发表机构 * UC Berkeley(加州大学伯克利分校)

AI总结 提出DO AS I DO算法,从单目RGB人类视频中重建手-物交互并重定向到多指灵巧机器人手,生成可执行的操作数据,优于现有方法。

Comments Project website: https://do-as-i-do.com/

详情
AI中文摘要

我们如何可扩展地生成机器人操作数据,特别是在像多指灵巧手这样的人形平台上?从人类视频中学习最近成为这个问题的可能答案。然而,估计手-物交互和跨越人-机器人具身差距的困难阻碍了将丰富的单目RGB人类视频作为机器人操作数据的主要来源。在这项工作中,我们提出了DO AS I DO,一种将单目RGB人类视频重建并重定向到多指灵巧机器人手的算法。DO AS I DO从各种自我中心和外部中心的野外视频源中重建手-物交互。然后,该算法将这些手-物交互估计重定向为一系列可在现实世界中执行的动作,从不同的人类视频中生成机器人完整的操作数据。总体而言,DO AS I DO在从RGB视频中估计手-物交互和提取灵巧操作轨迹方面优于先前的最先进技术,正如我们在具有真实标签的数据集和在线收集的视频片段数据集上的实验所示。我们的实验使我们能够为从业者收集人类操作数据提出一个有效性指南。

英文摘要

How can we scalably generate data for robotic manipulation, especially on human-like platforms such as dexterous multi-fingered hands? Learning from human videos has recently emerged as a likely answer to this question. However, difficulties in estimating hand-object interaction and crossing the human-to-robot embodiment gap have hindered the adoption of abundant monocular RGB-only human videos as the primary source of robot manipulation data. In this work, we present DO AS I DO, an algorithm to reconstruct and retarget monocular RGB human videos to multi-fingered dexterous robotic hands. DO AS I DO reconstructs hand-object interactions from various egocentric and exocentric in-the-wild video sources. The algorithm then retargets these hand-object interaction estimates into a sequence of actions executable in the real world, yielding robot-complete manipulation data from disparate human videos. Overall, DO AS I DO outperforms previous state of the art in estimating hand-object interactions and extracting dexterous manipulation trajectories from RGB videos, as we show in experiments on datasets with ground truths and on a dataset of video clips collected online. Our experiments enable us to propose an efficacy playbook for practitioners collecting human data for manipulation.

2606.19328 2026-06-18 cs.LG cs.AI cs.RO 新提交

UBP2: Uncertainty-Balanced Preference Planning for Efficient Preference-based Reinforcement Learning

UBP2: 不确定性平衡的偏好规划用于高效基于偏好的强化学习

Mohamed Nabail, Leo Cheng, Jingmin Wang, Nicholas Rhinehart

发表机构 * Learning, Embodied Autonomy, and Forecasting (LEAF) Lab, University of Toronto(多伦多大学学习、具身自主与预测(LEAF)实验室)

AI总结 提出UBP2方法,通过联合推理奖励、动力学和值函数的不确定性来主动引导探索,在Meta-World基准上显著提高了样本效率。

详情
AI中文摘要

基于偏好的强化学习提供了一种从行为的成对比较中学习奖励模型的方法,绕过了显式奖励设计的需求。然而,现有方法通常依赖于被动数据收集,并且在学习的早期阶段样本效率低下。我们引入了一种基于模型的方法,通过联合推理奖励、动力学和值函数的不确定性来主动引导探索。我们的方法,不确定性平衡的偏好规划(UBP2),使用奖励、动力学和值函数模型的集成,根据结合了期望奖励、终值认知不确定性的统一评分来评估候选轨迹。在此目标下的规划产生了利用和信息获取之间的显式权衡,无需临时的探索启发式。在标准正则性假设下,我们为有限时域和无限时域设置建立了次线性遗憾保证。实验上,在Meta-World基准上的实验表明,UBP2比无模型的基于偏好的方法和非乐观的基于模型的基线方法实现了更高的样本效率。

英文摘要

Preference-based RL provides an approach to learning reward models from pairwise comparisons of behaviors, bypassing the need for explicit reward design. However, existing methods typically rely on passive data collection and suffer from poor sample efficiency, especially during the early stages of learning. We introduce a model-based approach that actively directs exploration by jointly reasoning over uncertainties in the reward, dynamics, and value functions. Our method, Uncertainty-Balanced Preference Planning (UBP2), uses ensembles of reward, dynamics, and value function models to evaluate candidate trajectories according to a unified score that combines expected reward, terminal value, and epistemic uncertainty. Planning under this objective yields an explicit tradeoff between exploitation and information acquisition without requiring ad hoc exploration heuristics. Under standard regularity assumptions, we establish sublinear regret guarantees for both finite-horizon and infinite-horizon settings. Empirically, experiments on the Meta-World benchmark show UBP2 achieves substantially higher sample efficiency than model-free preference-based methods and non-optimistic model-based baselines.

2606.19327 2026-06-18 cs.AI cs.CL 新提交

Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation

重新思考奖励监督:基于评分准则的自蒸馏

Siyi Gu, Jialin Chen, Sophia Zhou, Arman Cohan, Rex Ying

发表机构 * Yale University(耶鲁大学)

AI总结 提出评分准则条件自蒸馏框架,通过结构化细粒度反馈指导推理模型,在科学推理基准上平均超越GRPO 1.0分、OPSD 0.9分。

详情
AI中文摘要

推理语言模型的后训练通常由监督蒸馏和基于可验证奖励的强化学习驱动。蒸馏通常依赖于思维链注释,这些注释获取成本高昂,且可能本身带有噪声、不完整或部分错误;即使最终答案正确,不完美的推理过程也会干扰学习。另一方面,基于验证奖励的强化学习通常将评估反馈压缩为标量信号,掩盖了响应中哪些方面需要改进。我们提出\textbf{评分准则条件自蒸馏}框架,该框架将评分准则作为结构化、细粒度的反馈用于策略内自蒸馏。我们的方法使教师模型以准则级评分准则为条件,并利用它在学生自身采样的轨迹上提供令牌级指导。这种设计避免了将单一参考推理过程作为唯一的监督目标。相反,评分准则指定了一个强响应应满足的条件,从而在推理过程中实现比标量奖励优化更细粒度的信用分配。我们通过一个两阶段流程实例化该框架:首先学习生成任务特定的评分准则,然后训练一个评分准则引导的推理器。我们在多样化的科学推理基准上进行评估,结果表明,评分准则条件自蒸馏有效地将准则级标准转化为推理过程中的令牌级指导,平均超过GRPO 1.0分、OPSD 0.9分。

英文摘要

Post-training of reasoning language models is commonly driven by supervised distillation and reinforcement learning with verifiable rewards. Distillation often relies on chain-of-thought annotations that are expensive to obtain and may themselves be noisy, incomplete, or partially incorrect; even when the final solution is correct, an imperfect rationale can interfere with learning. Reinforcement learning with verified rewards, on the other hand, typically compresses evaluative feedback into a scalar signal, obscuring which aspects of a response should be improved. We propose \textbf{Rubric-Conditioned Self-Distillation}, a framework that incorporates rubrics as structured, fine-grained feedback for on-policy self-distillation. Our method conditions the teacher model on criterion-level rubrics and uses it to provide token-level guidance on the student's own sampled trajectories. This design avoids treating a single reference rationale as the sole supervision target. Instead, rubrics specify what a strong response should satisfy, enabling more fine-grained credit assignment over the reasoning process than scalar reward optimization. We instantiate this framework with a two-stage pipeline that first learns to generate task-specific rubrics and then trains a rubric-guided reasoner. We evaluate on a diverse suite of science reasoning benchmarks and results show that rubric-conditioned self-distillation effectively converts rubric-level criteria into token-level guidance over the reasoning process, surpassing GRPO by 1.0 points and OPSD by 0.9 points on average.

2606.19325 2026-06-18 cs.SD cs.AI cs.CV 新提交

Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors

参考驱动的野外先验多说话人音频场景生成

Michael Finkelson, Daniel Segal, Eitan Richardson, Shahar Armon, Nani Goldring, Poriya Panet, Nir Zabari, Benjamin Brazowski, Or Patashnik, Yoav HaCohen

发表机构 * Lightricks Tel Aviv University(特拉维夫大学)

AI总结 提出ScenA方法,利用预训练的文本到音频流匹配基础模型,通过多参考声音和自然语言提示生成多说话人音频场景,并采用高噪声偏置时间步分布解决参考捷径问题,在CoVoMix2-Dialogue基准上优于现有系统。

Comments Project page at https://finmickey.github.io/scena/

详情
AI中文摘要

现有的多说话人对话系统通过结构化监督(如每轮标签、多流转录或可学习说话人嵌入)将说话人与话语绑定。这些系统在仅语音的流水线中运行,生成干净的语音序列,缺乏真实对话的环境纹理。我们采取不同的方法。我们的方法ScenA将文本到音频流匹配基础模型(在大规模野外数据上预训练)直接以多个参考声音和描述整个多说话人音频场景的自由形式自然语言提示为条件。利用这样的基础模型使我们能够继承其生成自然、非录音室音频的能力:背景噪声、房间声学、重叠对话和自发的副语言事件,同时添加多说话人控制而无需任何每轮结构。具体地,参考潜在向量被连接到模型的令牌序列中,并通过轻量级的身份感知位置编码进行区分。然而,我们识别出这种方法的一个关键障碍:参考捷径。在标准噪声调度下的训练过程中,模型可以通过声学相似性识别匹配的参考与噪声目标,从而完全绕过文本提示。我们通过高噪声偏置的时间步分布来解决这个问题,迫使模型依赖文本提示进行说话人分配。我们在CoVoMix2-Dialogue基准上评估ScenA,结果表明它在说话人绑定指标上优于现有的多说话人系统,同时生成具有重叠语音、情感发声和环境声音的丰富对话音频。我们的结果证明了使用以自由形式场景描述为条件的通用音频模型,而不是通过仅语音流水线传递结构化对话脚本的优势。

英文摘要

Existing multi-speaker dialogue systems bind speakers to utterances through structured supervision: per-turn tags, multi-stream transcriptions, or learnable speaker embeddings. These systems operate within speech-only pipelines that produce clean vocal sequences without the ambient texture of real conversations. We take a different approach. Our method, ScenA, conditions a text-to-audio flow-matching foundation model, pretrained on large-scale in-the-wild data, directly on multiple reference voices and a free-form natural language prompt that describes an entire multi-speaker audio scene. Leveraging such a foundational model allows us to inherit its capacity for natural, non-studio audio: background noise, room acoustics, overlapping dialogue, and spontaneous paralinguistic events, while adding multi-speaker control without any per-turn structure. Concretely, reference latents are concatenated into the model's token sequence and distinguished by lightweight identity-aware positional encodings. However, we identify a critical obstacle to this approach: the \textit{Reference Shortcut}. During training under standard noise schedules, the model can identify the matching reference by acoustic similarity to the noisy target, bypassing the text prompt entirely. We address this with a high-noise-biased timestep distribution that forces the model to rely on the text prompt for speaker assignment. We evaluate ScenA on the CoVoMix2-Dialogue benchmark, showing that it outperforms existing multi-speaker systems on speaker-binding metrics while generating rich conversational audio with overlapping speech, emotional vocalizations, and ambient sound. Our results demonstrate the advantage of using a general-purpose audio model conditioned on a free-form scene description, rather than passing structured dialog scripts through a speech-only pipeline.

2606.19317 2026-06-18 cs.LG cs.AI 新提交

Explaining Attention with Program Synthesis

用程序合成解释注意力机制

Amiri Hayes, Belinda Li, Jacob Andreas

发表机构 * NJIT(新泽西理工学院) MIT EECS(麻省理工学院电气工程与计算机科学系) MIT CSAIL(麻省理工学院计算机科学与人工智能实验室)

AI总结 提出用可执行程序近似深度网络组件行为的方法,针对Transformer注意力头,通过生成Python程序再现注意力模式,实现可解释性。

详情
AI中文摘要

可解释深度学习研究的一个长期目标是,用人类可理解的符号描述取代不透明的神经计算。本文提出了一种用可执行程序近似深度网络组件行为的方法。我们专注于Transformer语言模型中的注意力头。对于给定的注意力头,我们首先在一组随机选择的训练样本上计算其关联的注意力矩阵。接着,我们向预训练语言模型提供这些矩阵的摘要,并指示它生成一组Python程序,这些程序仅根据输入句子中的文本即可再现相关的注意力模式。最后,我们根据最终程序集在保留输入上预测行为的效果对程序进行重新排序。我们证明,少于1000个这样的生成程序即可再现GPT-2、TinyLlama-1.1B和Llama-3B中注意力头的注意力模式,在TinyStories上平均交并比相似度超过75%。此外,最佳匹配程序可以替代神经注意力头而不会显著影响模型行为:在三个模型中用程序替代25%的注意力头仅导致平均困惑度增加16%,同时在各种下游问答基准上保持性能。这项工作为使用人类可读、可执行的代码逆向工程Transformer模型中的注意力头提供了一个可扩展的流程,推动了神经模型向符号透明性的发展。

英文摘要

A longstanding goal of research on interpretable deep learning is to replace opaque neural computations with human-meaningful symbolic descriptions. In this paper, we propose an approach for approximating the behavior of components of deep networks with executable programs. We focus on attention heads in transformer language models. For a given head, we first compute its associated attention matrices on a collection of randomly selected training examples. Next, we prompt a pre-trained language model with a summary of these matrices, and instruct it to generate a set of Python programs that can reproduce the associated attention patterns given only text from the input sentence. Finally, we re-rank programs according to how well our final set of programs predict behavior on held-out inputs. We demonstrate that a set of fewer than 1,000 such generated programs can reproduce the attention patterns of heads in GPT-2, TinyLlama-1.1B, and Llama-3B, achieving an average Intersection-over-Union similarity above 75% on TinyStories. Moreover, the best-fit programs can replace neural attention heads without substantially affecting model behavior: replacing 25% of attention heads with programmatic surrogates across the three models incurs only a 16% average perplexity increase, while maintaining performance on a variety of downstream question answering benchmarks. This work contributes a scalable pipeline for reverse-engineering attention heads in transformer models using human-readable, executable code, advancing a path toward symbolic transparency in neural models.

2606.19316 2026-06-18 cs.CV 新提交

NeuMesh++: Towards Versatile and Efficient Volumetric Editing with Disentangled Neural Mesh-based Implicit Field

NeuMesh++:基于解耦神经网格隐式场的多功能高效体积编辑

Chong Bao, Yuan Li, Bangbang Yang, Yujun Shen, Hujun Bao, Zhaopeng Cui, Yinda Zhang, Guofeng Zhang

发表机构 * State Key Lab of CAD&CG, College of Computer Science, Zhejiang University(浙江大学计算机科学与技术学院计算机辅助设计与图形学国家重点实验室) Ant Research(蚂蚁研究院) Google(谷歌) ByteDance(字节跳动)

AI总结 提出一种基于网格顶点的解耦神经辐射场表示,实现几何、纹理和语义引导的高效体积编辑,包括网格引导几何编辑、纹理交换填充绘制及语义编辑。

Comments TPAMI 2025; Project Page: https://zju3dv.github.io/neumeshplusplus/

详情
AI中文摘要

近年来,神经隐式渲染技术迅速发展,在新视角合成和3D场景重建方面展现出显著优势。然而,现有的用于编辑目的的神经渲染方法功能有限,例如刚性变换和类别特定编辑。在本文中,我们提出了一种新颖的基于网格的表示方法,通过在网格顶点上编码解耦的几何、纹理和语义码来编码神经辐射场,从而实现一系列高效且全面的编辑功能,包括网格引导的几何编辑、通过纹理交换、填充和绘制操作进行的指定纹理编辑,以及语义引导的编辑。为此,我们开发了几种技术,包括一种新颖的局部空间参数化以提高渲染质量和训练稳定性,一种可学习的顶点修改颜色以提高纹理编辑的保真度,一种空间感知优化策略以实现精确的纹理编辑,以及一种语义辅助区域选择以减轻隐式场编辑的繁琐标注。在真实和合成数据集上的大量实验和编辑示例证明了我们的方法在表示质量和编辑能力上的优越性。项目页面:此 https URL

英文摘要

Recently neural implicit rendering techniques have evolved rapidly and demonstrated significant advantages in novel view synthesis and 3D scene reconstruction. However, existing neural rendering methods for editing purposes offer limited functionalities, e.g., rigid transformation and category-specific editing. In this paper, we present a novel mesh-based representation by encoding the neural radiance field with disentangled geometry, texture, and semantic codes on mesh vertices, which empowers a set of efficient and comprehensive editing functionalities, including mesh-guided geometry editing, designated texture editing with texture swapping, filling and painting operations, and semantic-guided editing. To this end, we develop several techniques including a novel local space parameterization to enhance rendering quality and training stability, a learnable modification color on vertex to improve the fidelity of texture editing, a spatial-aware optimization strategy to realize precise texture editing, and a semantic-aided region selection to ease the laborious annotation of implicit field editing. Extensive experiments and editing examples on both real and synthetic datasets demonstrate the superiority of our method on representation quality and editing ability. Project page: https://zju3dv.github.io/neumeshplusplus/

2606.19315 2026-06-18 cs.LG 新提交

Diffusion-Proof: Recipe for Formal Theorem Proving Beyond Auto-Regressive Generation

Diffusion-Proof:超越自回归生成的正式定理证明配方

Ruida Wang, Rui Pan, Pengcheng Wang, Shizhe Diao, Tong Zhang

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) NVIDIA(英伟达)

AI总结 提出Diffusion-Proof框架,首次将扩散语言模型应用于形式定理证明,通过全证明生成和局部校正方法,在ProofNet和MiniF2F上分别提升1.61%和6.14%,并解决了一个DeepSeek-Prover-V2-7B无法解决的IMO问题。

详情
AI中文摘要

近年来,增强大型语言模型(LLMs)的形式数学推理能力已成为数学和计算机科学社区的关键焦点。虽然在使用最先进的自回归(AR)LLMs进行形式定理证明方面取得了显著进展,但这些模型存在固有局限性。它们的下一个词预测生成方法可能因长程连贯性挑战和长序列错误累积而导致次优性能。最近,扩散LLMs(dLLMs)通过多词块的迭代去噪生成文本,提供了一种有前景的替代方案。然而,dLLMs在形式数学中的应用(其中保持长程连贯性至关重要)仍然研究不足。为解决上述挑战,我们提出了**Diffusion-Proof**,据我们所知,这是第一个训练和应用dLLMs进行形式定理证明的框架。我们的框架包含两种模型的训练和推理方法。第一个是*dLLM-Prover-7B*,它执行具有长程连贯策略使用的全证明写作。第二个是*dLLM-Corrector-7B*,这是一种新颖的大块扩散校正模型。它利用dLLMs的填充能力,使用双向信息进行局部证明校正。大量实验表明,**Diffusion-Proof**相对显著优于在同一数据集上训练的AR LLM基线。与基线相比,**Diffusion-Proof**在ProofNet-Test和MiniF2F-Test基准上分别实现了**1.61%**和**6.14%**的绝对提升。值得注意的是,**Diffusion-Proof**成功解决了一个更先进的思考模型DeepSeek-Prover-V2-7B无法解决的IMO问题,展示了dLLMs在形式定理证明中的独特优势。

英文摘要

Enhancing the formal math reasoning capabilities of Large Language Models (LLMs) has become a key focus in both mathematical and computer science communities in recent years. While significant progress has been made in using state-of-the-art Auto-Regressive (AR) LLMs for formal theorem proving, these models suffer from inherent limitations. Their next-token prediction generation methods may yield suboptimal performance due to the challenges of long-range coherence and the compounding of errors over long sequences. Recent advancements in diffusion LLMs (dLLMs), which generate text through iterative denoising of a multi-token block, offer a promising alternative. However, the application of dLLMs to formal mathematics, where maintaining long-range coherence is critical, remains largely understudied. To address the challenges above, we propose **Diffusion-Proof**, to the best of our knowledge, the first framework to train and apply dLLMs for formal theorem proving. Our frameworks contain training and inference methods for two models. The first one is *dLLM-Prover-7B*, which performs whole-proof writing with long-range coherent tactic usage. The second one is *dLLM-Corrector-7B*, which is a novel large block diffusion-based correction model. It leverages the in-filling capabilities of dLLMs to perform local proof correction using bi-directional information. Extensive experiments demonstrate that **Diffusion-Proof** relatively significantly outperforms the AR LLM baseline trained under the same dataset. **Diffusion-Proof** achieves an absolute improvement of **1.61%** on ProofNet-Test and **6.14%** on MiniF2F-Test benchmarks compare to the baseline. Notably, **Diffusion-Proof** successfully resolves one IMO problem that more advanced thinking model DeepSeek-Prover-V2-7B could not solve, showcasing the unique advantage of dLLMs in formal theorem proving.

2606.19314 2026-06-18 cs.RO 新提交

Modeling Branches for Active Manipulation using Iterative Parameter Estimation

基于迭代参数估计的主动操作分支建模

Madhav Rijal, Rashik Shrestha, Trevor Smith, Yu Gu

发表机构 * Department of Mechanical and Aerospace Engineering, West Virginia University(西弗吉尼亚大学机械与航空航天工程系)

AI总结 提出一种通过迭代估计材料参数来建模植物分支的方法,利用有限元模拟和变形感知运动规划器,实现精确分支操作,平均变形能量降低35.69%。

Comments Accepted to IROS 2026

详情
AI中文摘要

本研究提出了一种通过迭代估计材料参数来建模多样化植物分支的方法,以支持精细的分支操作。在农业机器人中,分支操作对于植物重新定位、稳定以及清除密集叶片中的视觉障碍是必要的。该方法从点云数据构建四面体分支模型,并使用有限元方法模拟其行为。利用真实观测的变形数据,迭代估计分支参数,然后通过变形感知运动规划器计算最优路径,以在另一个机器人的视野内移动和稳定分支。在30次对具有不同几何形状和材料特性的分支进行的试验中,该方法平均降低了35.69%的变形能量,同时路径长度平均增加了8.10%。

英文摘要

This study presents a method for modeling diverse plant branches by iteratively estimating material parameters to support delicate branch manipulation. Branch manipulation is necessary in agricultural robotics for plant repositioning, stabilizing, and clearing visual obstructions in dense foliage. The proposed method builds a tetrahedral branch model from point-cloud data and simulates its behavior using the finite element method. Using real observed deformation data, it iteratively estimates branch parameters and then computes an optimal path with a deformation-aware motion planner to move and stabilize branches within another robot's field of view. Across 30 trials on branches with varying geometries and material properties, the proposed method reduced the deformation energy by 35.69% while increasing the path length by 8.10% on average.

2606.19308 2026-06-18 cs.CL cs.MA 新提交

Enhancing Decision-Making with Large Language Models through Multi-Agent Fictitious Play

通过多智能体虚拟博弈增强大语言模型的决策能力

Leyang Shen, Yang Zhang, Xiaoyan Zhao, Chun Kai Ling, Tat-Seng Chua

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 针对多智能体系统中决策任务因立场纠缠而难以分解的问题,提出基于虚拟博弈的多智能体虚拟博弈(MAFP)范式,通过迭代最佳响应实现均衡求解,提升决策质量和鲁棒性。

Comments 18 pages, 8 figures

详情
AI中文摘要

基于大语言模型(LLM)的多智能体系统(MAS)通过将子任务分配给协作智能体,在解决具有执行复杂性的任务方面展现出巨大潜力。然而,这种分而治之的范式在现实世界中同样普遍的决策任务上表现不足。这些任务要求所有相关利益方同时推理,其决策相互依赖,因此无法孤立解决。我们将这一挑战定性为立场纠缠,这是一种区别于执行复杂性的决策复杂性。为了解决这一问题,我们提出了多智能体虚拟博弈(MAFP),一种新颖的MAS范式,将利益方立场表示为智能体,并将决策制定形式化为一个均衡寻求过程。基于博弈论中的虚拟博弈原理,MAFP通过每个智能体对其他智能体过去决策的经验混合做出最佳响应,迭代更新其决策。这使得智能体能够暴露并解决彼此的弱点,逐步提高决策质量和鲁棒性。我们在具有挑战性的决策任务上评估MAFP,这些任务测试在行动前为竞争场景制定策略的能力。MAFP在两个互补指标——锦标赛强度和鲁棒性上,均优于单轮和多轮基线,证明了其在解决立场纠缠方面的有效性。

英文摘要

Large language model (LLM)-based multi-agent systems (MAS) have demonstrated great potential in solving tasks with execution complexity, by distributing subtasks across cooperative agents. However, this divide-and-conquer paradigm falls short on decision-making tasks that are also prevalent in the real world. These tasks require simultaneous reasoning from the stances of all involved stakeholders whose decisions are mutually dependent and thus cannot be solved in isolation. We characterize this challenge as stance entanglement, a form of decision complexity distinct from execution complexity. To address it, we propose Multi-Agent Fictitious Play (MAFP), a novel MAS paradigm that represents stakeholder stances as agents and formulates decision-making as an equilibrium-seeking process. Built on the game-theoretic principle of fictitious play, MAFP iteratively updates each agent's decision by best responding to the empirical mixture of other agents' past decisions. This enables agents to expose and address one another's weaknesses, progressively improving decision quality and robustness. We evaluate MAFP on challenging decision-making tasks that test the capability of deciding strategies for competitive scenarios prior to acting. MAFP outperforms both single-round and multi-round baselines on two complementary metrics, tournament strength and robustness, demonstrating its effectiveness in addressing stance entanglement.

2606.19307 2026-06-18 cs.RO 新提交

Observability and Consistency Analysis for Visual-Inertial Navigation with Anchored Feature Parameterizations

基于锚定特征参数化的视觉惯性导航的可观性与一致性分析

Mitchell Cohen, Vassili Korotkine, James Richard Forbes

发表机构 * McGill University(麦吉尔大学)

AI总结 分析基于滤波的视觉惯性导航系统(VINS)使用锚定特征表示时的可观性与一致性,证明其不可观子空间独立于估计的地标状态,从而改善一致性,但仍依赖导航状态,需额外一致性增强技术。

Comments Accepted to IEEE/RSJ IROS. 8 pages, 3 figures, 4 tables

详情
AI中文摘要

本文分析了使用锚定特征表示的基于滤波的视觉惯性导航系统(VINS)的可观性和一致性特性。结果表明,采用锚定地标参数化的VINS的不可观子空间独立于估计的地标状态,从而无需任何额外修改即可改善估计器的一致性。然而,不可观子空间仍然依赖于估计的导航状态,因此需要额外的一致性增强技术。本文提出了两种方法来改善采用锚定特征表示的VINS的一致性。仿真结果表明,与使用全局参考系解析特征的算法相比,所有采用锚定特征参数化的估计器都表现出更好的一致性,特别是在特征初始化可能较差的情况下。在TUM-VI数据集上的真实世界实验表明,仅使用锚定特征表示即可获得与采用全局特征表示的一致性改进估计器相当的性能,证明了在VINS中使用锚定特征参数化的优势。

英文摘要

This paper presents an analysis of the observability and consistency properties of filtering-based visual-inertial navigation systems (VINS) that utilize anchored feature representations. The unobservable subspace of VINS with anchored landmark parameterizations is shown to be independent of the estimated landmark state, which leads to improved estimator consistency properties without any additional modifications. However, the unobservable subspace is still found to depend on the estimated navigation state, necessitating additional consistency-enforcing techniques. Two methods to improve the consistency of VINS with anchored feature representations are presented. Simulation results showcase that all estimators employing anchored feature paramterizations exhibit improved consistency properties compared to algorithms that estimate features resolved in a global reference frame, especially in scenarios where feature initialization may be poor. Real-world experiments on the TUM-VI dataset showcase that the use of anchored feature representations alone can yield comparable performance to consistency-improved estimators employing a global feature representation, demonstrating the benefit of using anchored feature parameterizations for VINS.

2606.19303 2026-06-18 cs.LG 新提交

P-K-GCN: Physics-augmented Koopman-enhanced Graph Convolutional Network for Deep Spatiotemporal Super-resolution

P-K-GCN:物理增强的Koopman图卷积网络用于深度时空超分辨率

Xizhuo, Zhang, Zekai Wang, Fei Liu, Bing Yao

发表机构 * Department of Industrial & Systems Engineering, The University of Tennessee, Knoxville(田纳西大学诺克斯维尔分校工业与系统工程系) Charles F. Dolan School of Business, Fairfield University(费尔菲尔德大学查尔斯·F·多兰商学院) Department of Electrical Engineering & Computer Science, The University of Tennessee, Knoxville(田纳西大学诺克斯维尔分校电气工程与计算机科学系)

AI总结 提出P-K-GCN,结合样条GCN和Koopman算子理论,在非规则几何上实现时空超分辨率,并通过物理损失和理论分析保证误差降低。

详情
AI中文摘要

高保真时空动力学模拟计算成本高昂,因此需要高效的超分辨率技术从粗粒度输入重建高分辨率数据。传统数据驱动方法缺乏物理约束,而简单的物理信息学习难以处理不规则空间几何和复杂时间演化。为解决这些问题,我们提出了一种物理增强的Koopman图卷积网络(P-K-GCN),用于不规则几何上的时空超分辨率。具体地,首先设计了一个基于连续样条的GCN,直接从粗粒度图中提取空间依赖关系,并引入Koopman算子理论将非线性动力学投影到紧凑的潜空间,其中时间演化被线性化。其次,我们通过基于物理的损失增强优化目标,迫使数据驱动重建遵循物理定律,以提高预测保真度和鲁棒性。最后,我们提供了严格的理论分析,证明物理增强和Koopman正则化通过减少Rademacher复杂度和收紧泛化界,数学上保证了超分辨率误差的降低。我们在从稀疏低分辨率测量重建三维心脏几何上的高分辨率心脏电动力学上评估了我们的框架。数值实验表明,我们的方法相比基线模型实现了更高的精度。

英文摘要

High-fidelity simulation of spatiotemporal dynamics is computationally prohibitive, necessitating efficient super-resolution techniques to reconstruct high-resolution data from coarse-grained inputs. Traditional data-driven methods often lack physical constraints, and simple physics-informed learning struggles with irregular spatial geometries and intricately evolving temporal dynamics. To tackle these challenges, we propose a Physics-augmented Koopman-enhanced Graph Convolutional Network (P-K-GCN) for spatiotemporal super-resolution on irregular geometries. Specifically, a continuous spline-based GCN is first designed to extract spatial dependencies directly from coarse graph, and Koopman operator theory is incorporated to project the nonlinear dynamics into a compact latent space where temporal progression is linearized. Second, we augment the optimization objective with a physics-based loss to force the data-driven reconstructions to adhere to physical laws for improving predictive fidelity and robustness. Finally, we provide a rigorous theoretical analysis, establishing that the physics augmentation and Koopman regularization mathematically guarantees a reduction in super-resolution error by diminishing Rademacher complexity and tightening generalization bounds. We evaluate our framework on reconstructing spatially high-resolution cardiac electrodynamics across a 3D heart geometry from sparse low-resolution measurements. Numerical experiments demonstrate that our method achieves superior accuracy compared to baseline models.

2606.19300 2026-06-18 cs.CV cs.LG 新提交

Confidence is Not Reliability: Rethinking MC Dropout in Brain Tumour Segmentation

置信度不等于可靠性:重新思考脑肿瘤分割中的MC Dropout

Xin Ci Wong, Duygu Sarikaya, Kieran Zucker, Marc De Kamps, Nishant Ravikumar

发表机构 * Centre for Doctoral Training in AI for Medical Diagnosis and Care, School of Computing, University of Leeds(利兹大学计算机学院人工智能医学诊断与护理博士培训中心) School of Computer Science, University of Leeds(利兹大学计算机科学学院)

AI总结 通过MC Dropout不确定性估计,发现全局不确定性-误差对齐(AUROC≈0.97)可能掩盖关键子区域(如增强肿瘤)的严重误校准(ECE=0.915),表明子区域校准评估对临床安全至关重要。

Comments Accepted for MIUA2016

详情
AI中文摘要

多参数MRI中的胶质瘤分割是治疗计划的关键组成部分。一个在治疗关键子区域上静默失败的分割模型会带来患者安全风险,而Dice分数等基于重叠的指标无法暴露这种风险。我们探究通过蒙特卡洛(MC)Dropout进行的体素级不确定性估计能否可靠地识别临床关键子区域中的分割错误,以及校准失败模式是否仅从标准报告指标中可检测。在126名BraTS21患者的两模型实证案例研究中,我们评估了高性能预训练SegResNet和本地训练的带有残差单元的UNet(UNet-Res)。MC dropout保持了分割准确性($|\Delta \text{Dice}|$ $<0.01$),同时实现了强不确定性-误差对齐(熵(H)的AUROC $\approx$0.97),表明不确定性正确地将错误体素排在正确体素之上。基于熵的患者分层识别出一个高不确定性亚组,其分割性能显著较低(全肿瘤Dice中位数$0.835$ vs. $0.925$),支持不确定性作为实用的分诊信号。然而,全局对齐可能掩盖重要的区域特异性差异。尽管AUROC相似,UNet-Res在增强肿瘤熵上接近零($0.054$),期望校准误差(ECE)为$0.915$,Dice仅为$0.714$,表明在最临床关键子区域上置信度严重误校准,这是标准Dice和AUROC报告无法发现的失败模式。这些发现表明,强不确定性-误差对齐对于临床安全是必要但不充分的:在选择临床部署模型时,子区域特异性校准评估必须伴随AUROC评估。

英文摘要

Glioma segmentation in multiparametric MRI is a critical component of treatment planning. A segmentation model that fails silently on treatment-critical sub-regions represents a patient safety risk that overlap-based metrics such as Dice scores cannot expose. We ask whether voxel-level uncertainty estimation via Monte Carlo (MC) Dropout can reliably identify segmentation errors in clinically critical sub-regions, and whether calibration failure modes are detectable from standard reporting metrics alone. In an empirical two-model case study on 126 BraTS21 patients, we evaluate a high-performance pretrained SegResNet and a locally trained UNet with residual units (UNet-Res). MC dropout preserved segmentation accuracy ($|Δ\text{Dice}|$ $<0.01$) while achieving strong uncertainty-error alignment (AUROC for entropy (H) $\approx$0.97), indicating uncertainty correctly ranks erroneous voxels above correct ones. Entropy-based patient stratification identified a high-uncertainty subgroup with substantially lower segmentation performance (median whole-tumour Dice $0.835$ vs. $0.925$), supporting uncertainty as a practical triage signal. However, global alignment can mask important region-specific differences. Despite similar AUROC, UNet-Res exhibited near-zero enhancing tumour entropy ($0.054$) and Expected Calibration Error (ECE) of $0.915$, with a Dice of only $0.714$, indicating severely miscalibrated confidence on the most clinically critical sub-region, a failure mode invisible to standard Dice and AUROC reporting. These findings demonstrate that strong uncertainty-error alignment is necessary but insufficient for clinical safety: sub-region-specific calibration assessment must accompany AUROC evaluation when selecting models for clinical deployment.

2606.19297 2026-06-18 cs.LG cs.RO 新提交

Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models

VLA 甚至知道基础知识吗?衡量视觉-语言-动作模型中的常识和世界知识保留

Nikita Kachaev, Andrey Moskalenko, Matvey Skripkin, Nikita Kurlaev, Daria Pugacheva, Albina Burlova, Mikhail Kolosov, Denis Shepelev, Andrey Kuznetsov, Elena Tutubalina, Aleksandr I. Panov, Alexey K. Kovalev, Vlad Shakhuro

发表机构 * CogAI Lab(CogAI实验室) FusionBrain Lab(FusionBrain实验室) IAI MSU(莫斯科大学人工智能研究所) Lomonosov MSU(莫斯科国立罗蒙诺索夫大学) NUST MISIS(国立研究型技术大学MISIS) Applied AI Institute(应用人工智能研究所) HSE University(高等经济大学) Generalizable AI Systems(通用人工智能系统实验室) ISP RAS(俄罗斯科学院系统编程研究所) MIRAI Domain-specific NLP Group(领域特定自然语言处理组)

AI总结 提出 Act2Answer 协议,通过动作回答评估 VLA 模型的知识保留,发现模型在简单概念上表现良好,但在丰富语义类别上存在差距,且 VQA 联合训练有助于知识保留。

Comments Project page: https://tttonyalpha.github.io/act2answer/

详情
AI中文摘要

具身视觉-语言-动作(VLA)模型通常通过在机器人数据上微调强大的预训练 VLM 获得,但目前尚不清楚它们在适应后保留了多少常识和事实知识。在知识敏感任务上的失败是模糊的,混淆了知识缺失与低级控制泛化能力差。我们引入 Act2Answer,一种轻量级协议,通过要求智能体通过动作来回答,将 VLM 知识基准适配到 VLA 评估。每个问题变成一个简短的桌面场景,其中智能体执行单个物体放置动作以选择候选答案,从而产生动作基础的、减少控制混淆的成功率。我们在不同的常识和世界知识类别中策划了这样的环境测试套件,并引入逐层意图探测以定位 VLM 骨干和动作头中与答案相关的信息。在对 7 个 VLA 模型和 9 个 VLM 基线的大规模研究中,我们系统地跨类别对模型进行排名,发现 VLA 在简单概念上表现稳健,但在更丰富的语义类别上相对于其源 VLM 显示出更大的差距,VQA 联合训练与更好的知识保留相关,并且答案相关信号在 VLA 中间层达到峰值,但在上层减弱。Act2Answer 可在以下网址获取:此 https URL。

英文摘要

Embodied Vision-Language-Action (VLA) models are typically obtained by fine-tuning powerful pretrained VLMs on robotics data, yet it is unclear how much commonsense and factual knowledge they retain after adaptation. Failures on knowledge-sensitive tasks are ambiguous, conflating missing knowledge with poor generalization of low-level control. We introduce Act2Answer, a lightweight protocol that adapts VLM knowledge benchmarks to VLA evaluation by requiring agents to answer through action. Each question becomes a short tabletop episode where the agent performs a single object-placement action to select among candidate answers, yielding an action-grounded success rate with reduced control confounds. We curate a test suite of such environments across diverse commonsense and world-knowledge categories and introduce layerwise intent probing to localize answer-relevant information across the VLM backbone and action head. In a large-scale study of 7 VLA models and 9 VLM baselines, we systematically rank models across categories, finding that VLAs show solid performance on simple concepts while exhibiting larger gaps on richer semantic categories relative to their source VLMs, that VQA co-training is associated with better knowledge retention, and that answer-relevant signals peak in middle VLA layers but attenuate in upper layers. Act2Answer is available at https://tttonyalpha.github.io/act2answer/.

2606.19292 2026-06-18 cs.LG 新提交

Risk Stratification for ICU Delirium using Pervasive Ambient Sensing Information

使用普适环境感知信息进行ICU谵妄风险分层

Jiaqing Zhang, Sabyasachi Bandyopadhyay, Miguel Contreras, Jessica Sena, Yuanfang Ren, Andrea Davidson, Ziyuan Guan, Tezcan Ozrazgat-Baslanti, Subhash Nerella, Azra Bihorac, Parisa Rashidi

发表机构 * University of Florida(佛罗里达大学) Stanford University(斯坦福大学)

AI总结 本研究利用环境声音和光照强度数据,通过高效序列神经网络模型预测ICU患者谵妄风险,发现声音是主要预测因子,结合光照可改善短期预测,AUC达0.80。

详情
AI中文摘要

谵妄是重症监护室(ICU)中常见且严重的并发症,与发病率增加、住院时间延长和医疗成本升高相关。尽管其普遍存在,早期预测和预防仍具挑战性。环境因素如环境声音和光照可能影响谵妄的发生,但在风险评估中常被忽视。在本研究中,我们检验了光照强度和声压级是否能在多个预测时间窗口内独立预测谵妄。我们评估了四种高效的序列神经网络模型,这些模型基于来自9个ICU的309名患者的数据,用于预测10种预测窗口大小的谵妄。我们使用Shapley Additive Explanations分析报告了特征重要性和影响方向。卷积模型实现了最强的区分能力,在声音数据和组合数据上的AUC均为0.80。声音特征是整体上的主要预测因子。将声音与光照结合改善了短期(<1周)预测,组合模型在感知期后立即分配最高风险。这些发现表明,被动环境感知,尤其是声音,可以为谵妄风险评估增加临床上有意义、可解释的信号,并为丰富多模态ICU预测和预防策略提供实用途径。

英文摘要

Delirium is a common and serious complication in the Intensive Care Unit (ICU), associated with increased morbidity, prolonged hospital stays, and higher healthcare costs. Despite its prevalence, early prediction and prevention remain challenging. Environmental factors such as ambient sound and light may influence the onset of delirium, yet they are often overlooked in risk assessments. In this study, we examined whether light intensity and sound pressure levels can independently predict delirium across multiple prediction horizons. We evaluated four efficient sequential neural network models on data collected from 9 ICUs across 309 patients to predict delirium for 10 prediction-window sizes. We reported feature importance and direction of influence using Shapley Additive Explanations analysis. The convolutional model achieved the strongest discrimination, with AUC = 0.80 on sound data and on combined data. Sound features were the dominant predictors overall. Integrating sound with light improved short-term ($<1$ week) prediction, with the combined model assigning the highest risk immediately after the sensing period. These findings suggest that passive ambient sensing, especially sound, can add a clinically meaningful, interpretable signal for delirium risk estimation and offer a practical pathway to enrich multimodal ICU prediction and prevention strategies.

2606.19277 2026-06-18 cs.CV 新提交

A Unified Framework for Efficient Remote Sensing Visual Question Answering: Adapting Dual, Hybrid, and Encoder-Decoder Architectures

高效遥感视觉问答的统一框架:适配双编码器、混合架构和编码器-解码器架构

Timothy Agboada, Shikha Chandel, Yadav Raj Ghimire, Leila Hashemi-Beni

发表机构 * Engineering North Carolina A\&T State University Greensboro - NC, USA College of Science Technology North Carolina A\&T State University Greensboro - NC, USA

AI总结 提出RS Adapter参数高效微调策略,在三种视觉语言模型架构上注入轻量瓶颈适配器,仅用不到5%可训练参数实现遥感VQA,混合架构FLAVA在多模态推理与检索间取得最佳平衡。

Comments 4 pages, 2 figures, accepted and to be presented at 2026 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2026), scheduled for 9 to 14 August 2026 in Washington D.C

详情
AI中文摘要

遥感领域的视觉问答因航空影像的高分辨率、多尺度目标分布和语义复杂性而面临独特挑战。尽管通用领域的基础模型取得了显著成功,但直接应用于RSVQA受到巨大领域偏移和全微调计算成本高昂的阻碍。本研究对RS Adapter(一种参数高效微调策略)在三种不同的视觉语言模型架构上进行了比较分析:双编码器CLIP、编码器-解码器BLIP和混合FLAVA。我们引入了一个统一的架构手术流水线,将轻量瓶颈适配器注入冻结骨干网络的注意力和MLP层,从而以少于5%的可训练参数实现快速适应。在高分辨率RSVQA x数据集上的实验结果表明,虽然所有适配模型均实现收敛,但混合FLAVA架构相比单模态对应模型提供了更优越的多模态推理与检索能力平衡。我们的发现为灾害评估和城市监测中的资源高效VQA建立了新的基准。

英文摘要

Visual Question Answering (VQA) in the Remote Sensing (RS) domain presents unique challenges due to the high resolution, multi scale object distribution, and semantic complexity of aerial imagery. While general domain Foundation Models have achieved remarkable success, their direct application to RSVQA is hindered by massive domain shifts and the computationally prohibitive nature of full fine tuning. This study presents a comparative analysis of RS Adapter, a Parameter Efficient Fine Tuning (PEFT) strategy, applied across three distinct Vision Language Model (VLM) architectures: the Dual Encoder CLIP, the Encoder Decoder BLIP, and the Hybrid FLAVA. We introduce a unified architectural surgery pipeline that injects lightweight bottleneck adapters into the attention and MLP layers of frozen backbones, enabling rapid adaptation with less than 5 percent of trainable parameters. Experimental results on the high resolution RSVQA x dataset demonstrate that while all adapted models achieve convergence, the Hybrid FLAVA architecture offers a superior balance of multimodal reasoning and retrieval capabilities compared to its unimodal counterparts. Our findings establish a new baseline for resource efficient VQA in disaster assessment and urban monitoring.

2606.19269 2026-06-18 cs.SD 新提交

Scoring Backends Matter More Than Pooling: A Systematic Study of Training-Free Anomalous Sound Detection under Domain Shift

评分后端比池化更重要:域偏移下无训练异常声音检测的系统研究

Jingwen Zhou, Mingzhe Wang

发表机构 * Xidian University(西安电子科技大学)

AI总结 本研究系统比较了无训练异常声音检测中不同评分后端和时序池化方法对域偏移鲁棒性的影响,发现后端选择(如余弦距离、马氏距离等)主导性能,平均AUC变化13.8点,而池化仅3.2点,并提出无标签分数融合方法。

详情
AI中文摘要

无训练异常声音检测(ASD)通过将测试片段与来自冻结预训练音频编码器的正常嵌入记忆库进行评分。最近的研究将域偏移鲁棒性主要归因于帧级特征随时间池化的方式;而应用于池化嵌入之上的评分后端受到的关注较少。使用单个冻结的BEATs编码器在DCASE 2023 Task 2开发集(全部七种机器类型)上,我们交叉了四种经典后端——最近邻余弦距离、马氏距离、局部密度归一化kNN和PCA子空间重建残差——与三种时序池化(均值、GeM、最大值)。切换后端使目标域AUC平均移动13.8点(最高达53.8),而切换池化仅移动3.2点:在这种无训练机制中,后端而非池化主导域偏移鲁棒性。没有后端在所有情况下都表现最佳,但机器相关的模式在DCASE 2025开发数据(风扇、轴承)上重现。利用这一点,我们提出了一种无标签分数融合方法,该方法对每个后端使用其训练库自分数进行z归一化并取最小值;它达到了63.3%的调和平均目标AUC,而每机器oracle为64.4%,超过了所有固定的单一后端,同时保持了源域准确性。我们还报告了一个负面结果:通过源域伪验证与代理异常来选择后端失败,因为所有后端在代理任务上都饱和了。

英文摘要

Training-free anomalous sound detection (ASD) scores a test clip against a memory bank of normal embeddings from a frozen pretrained audio encoder. Recent work attributes domain-shift robustness mainly to how frame-level features are pooled over time; the scoring backend applied on top of the pooled embedding has received far less systematic attention. Using a single frozen BEATs encoder on the DCASE 2023 Task 2 development set (all seven machine types), we cross four classical backends -- nearest-neighbor cosine distance, Mahalanobis distance, locally density-normalized kNN, and PCA-subspace reconstruction residual -- with three temporal poolings (mean, GeM, max). Switching the backend moves target-domain AUC by 13.8 points on average (up to 53.8), whereas switching the pooling moves it by only 3.2 points: in this training-free regime, the backend, not the pooling, dominates domain-shift robustness. No backend wins everywhere, but the machine-dependent pattern reproduces on the DCASE 2025 development data (fan, bearing). Exploiting this, we propose a label-free score fusion that z-normalizes each backend with its training-bank self-scores and takes the minimum; it reaches a harmonic-mean target AUC of 63.3% versus 64.4% for the per-machine oracle, surpassing every fixed single backend while preserving source-domain accuracy. We also report a negative result: selecting a backend by source-domain pseudo-validation with proxy outliers fails, because all backends saturate on the proxy task.

2606.19267 2026-06-18 cs.RO cs.SY eess.SY 新提交

A Mixed-Reality Testbed for Autonomous Vehicles

自动驾驶汽车的混合现实测试平台

H. M. Sabbir Ahmad, Ehsan Sabouni, Emrullah Celik, Zean Wan, Damola Ajeyemi, Christos G. Cassandras, Wenchao Li

发表机构 * Boston University(波士顿大学)

AI总结 提出一种混合现实硬件在环测试平台,集成物理移动机器人与高保真仿真环境,用于验证感知、规划和控制算法,并支持多智能体系统研究。

Comments 9 pages, 7 figures, 1 table

详情
AI中文摘要

我们提出了一种用于自动驾驶汽车的混合现实、硬件在环(HIL)测试平台,该平台将物理移动机器人测试平台与高保真仿真环境无缝集成。虚拟仿真能够创建多样化的、安全关键的驾驶场景,以验证最先进的感知、规划和控制算法,同时通过配备多模态传感器的物理机器人在逼真的虚拟环境中增强仿真,进一步促进严格的验证。我们的测试平台还利用无线通信实现车辆连接,并通过物理机器人和虚拟仿真代理的组合容纳大量代理,支持包括网联自动驾驶汽车(CAV)在内的多智能体系统研究。最后,我们提出了一种结合感知、规划和一种新颖的基于控制障碍函数(CBF)的在线学习控制器的安全保证框架,用于CAV。使用所提出框架的实验用于验证和展示测试平台的关键功能以及其在弥合仿真与真实世界硬件部署之间差距方面的整体效用。

英文摘要

We propose a mixed-reality, hardware-in-the-loop (HIL) testbed for autonomous vehicles that seamlessly integrates a physical testbed of mobile robots with a high-fidelity simulation environment. The virtual simulation enables the creation of diverse, safety-critical driving scenarios to validate state-of-the-art perception, planning, and control algorithms, while augmenting simulations with physical robots equipped with multimodal sensors in photorealistic virtual environments further facilitating rigorous validation. Our testbed also features vehicular connectivity using wireless communication and can accommodate a large number of agents through the combination of physical robots and virtual simulated agents, supporting research on multi-agent systems including Connected and Autonomous Vehicles (CAVs). Finally, we present a safety-guaranteed framework combining perception, planning and a novel online learning-based controller using Control Barrier Functions (CBFs) for CAVs. Experiments using the proposed framework are used to validate and demonstrate the key functionalities and the overall utility of the testbed to bridge the gap between simulation and real-world hardware deployment.

2606.19266 2026-06-18 cs.CL cs.AI 新提交

Trade-offs in Medical LLM Adaptation: An Empirical Study in French QA

医学LLM适应中的权衡:法语问答的实证研究

Ikram Belmadani, Oumaima El Khettari, Carlos Ramisch, Frederic Bechet, Richard Dufour, Benoit Favre

发表机构 * Aix-Marseille Univ., CNRS, LIS UMR 7020(艾克斯-马赛大学,法国国家科学研究中心,计算机与系统实验室) Nantes Univ., École Centrale Nantes, CNRS, LS2N UMR 6004(南特大学,南特中央理工学院,法国国家科学研究中心,数字科学实验室) Grenoble Alpes Univ., CNRS, INRIA, Grenoble INP, LIG UMR 5217(格勒诺布尔-阿尔卑斯大学,法国国家科学研究中心,法国国家信息与自动化研究所,格勒诺布尔理工学院,信息学实验室)

AI总结 通过法语医学问答任务,实证比较持续预训练(CPT)和监督微调(SFT)在多个模型家族和规模下的效果,发现CPT+SFT在多项选择问答上最优但增益小,SFT是强且经济的默认选择,而CPT在开放式问答中提升重叠指标。

详情
AI中文摘要

大型语言模型(LLMs)的发展导致了对它们适应专业领域和语言的关注增加,但领域适应策略的有效性仍不明确。我们以法语医学问答(QA)为案例,进行了医学领域适应的研究。我们比较了持续预训练(CPT)、监督微调(SFT)及其组合,跨越三个模型家族、多个规模和三种初始化类型,明确区分了适应效果与基础模型选择。我们在贪婪和约束解码下,使用自动指标和LLM-as-a-Judge评估,评估了多项选择问答(MCQA)和开放式问答(OEQA)。对于MCQA,CPT+SFT通常取得最佳分数,但相比SFT的增益很小且通常不显著,使得SFT成为强大且成本效益高的默认选择。对于OEQA,CPT持续改善基于重叠的指标,而SFT常降低生成质量;指令调优和CPT+SFT在基于LLM的评估中更受青睐。跨语言实验进一步显示,法语适应能有效迁移到英语基准。总体而言,我们为在计算约束下选择适应策略提供了实用指南。

英文摘要

The development of large language models (LLMs) has led to an increased focus on their adaptation to specialized domains and languages, yet the effectiveness of domain adaptation strategies remains unclear. We present a study of medical domain adaptation using French medical question-answering (QA) as a case study. We compare continual pretraining (CPT), supervised fine-tuning (SFT), and their combination across three model families, multiple sizes, and three initialization types, explicitly disentangling adaptation effects from base model choice. We evaluate both multiple-choice (MCQA) and open-ended QA (OEQA) under greedy and constrained decoding using automatic metrics and LLM-as-a-Judge evaluation. For MCQA, CPT+SFT most often achieves the best scores, but gains over SFT are small and frequently not statistically significant, making SFT a strong and cost-effective default. For OEQA, CPT consistently improves overlap-based metrics, while SFT often degrades generation quality; instruction tuning and CPT+SFT are preferred by LLM-based evaluation. Cross-lingual experiments further show effective transfer from French adaptation to English benchmarks. Overall, we provide practical guidelines for selecting adaptation strategies under computational constraints.

2606.19265 2026-06-18 cs.RO 新提交

Shape Sensing of Continuum Robots using Direct Laser Writing

使用直接激光写入的连续体机器人形状感知

Amber K. Rothe, Nidhi Malhotra, Jaydev P. Desai

发表机构 * Winship Cancer Institute of Emory University(埃默里大学温希普癌症研究所) Medical Robotics and Automation (RoboMed) Laboratory, Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology(佐治亚理工学院华莱士·H·库尔特生物医学工程系医疗机器人与自动化实验室)

AI总结 本文利用直接激光写入技术制造应变传感器,集成于连续体机器人关节中,通过线性和非线性模型预测关节角度,误差低至1.76度,并实现闭环控制,跟踪误差小于3度。

Comments This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

连续体机器人因其固有的柔顺性和灵巧性,为微创和自然腔道手术提供了一种有前景的方法。然而,这种灵活性也使得估计机器人当前形状变得具有挑战性。已有多种方法用于重建这些机器人的形状,包括成像、光学传感、磁传感和电阻传感。使用直接激光写入(DLW)制造的应变传感器可以提供一种替代传感方法。该技术涉及使用激光诱导某些聚合物碳化,以创建石墨烯图案,例如应变传感器。在本文中,我们展示了如何使用同一激光和同一设置将柔性连续体关节和DLW传感器加工成一个整体结构。使用线性和非线性模型对制造的传感器进行表征,这些模型用于预测关节角度,误差低至1.76度。此外,我们展示了如何使用DLW传感器在机器人关节中实现闭环控制,跟踪误差低于3度。

英文摘要

Continuum robots offer a promising approach for minimally invasive and natural-orifice surgical procedures due to their inherent compliance and dexterity. However, this flexibility also makes estimating the current shape of the robot challenging. Several approaches have been used to reconstruct the shape of these robots, including imaging, optical sensing, magnetic sensing, and resistive sensing. Strain sensors fabricated using direct laser writing (DLW) could provide an alternative sensing method. This technique involves using a laser to induce carbonization of certain polymers to create graphene patterns, such as strain sensors. In this paper, we demonstrate how a flexible continuum joint and a DLW sensor can be machined as one monolithic structure using the same laser and the same setup. The fabricated sensors are characterized using linear and nonlinear models, which are used to predict the joint angle with error as low as 1.76 degrees. Furthermore, we demonstrate how a DLW sensor can be used to implement closed-loop control in a robotic joint, achieving tracking error under 3 degrees.

2606.19262 2026-06-18 cs.LG 新提交

Detecting Hidden ML Training With Zero-Overhead Telemetry

使用零开销遥测检测隐藏的机器学习训练

Robi Rahman, Sabiha Tajdari

发表机构 * Machine Intelligence Research Institute(机器智能研究所) University of Virginia(弗吉尼亚大学)

AI总结 本文评估了仅使用零开销、隐私保护的NVML遥测(内容无关信号)对GPU工作负载分类的对抗鲁棒性,开发了一个分类器,在识别训练工作负载时达到98.2%的二元准确率,并对最具挑战性的意外工作负载达到43-87%的准确率。

Comments Technical AI Governance Research workshop at ICML 2026

详情
AI中文摘要

硬件支持的GPU工作负载监控是许多AI计算治理方案的基础,但如果开发者能够击败监控机制,这些方案将不可行。我们评估了仅使用零开销、隐私保护的NVML遥测(内容无关信号,观察计算的物理效应而不访问模型权重、训练数据或超参数)的GPU工作负载分类的对抗鲁棒性。在5轮监控-逃避迭代中,我们在跨越4代架构的9个GPU模型上评估了20种逃避策略家族。我们开发了一个分类器,在整个语料库上识别训练工作负载时达到98.2%的二元准确率,并在最具挑战性的意外工作负载上(即使它们被对抗性伪装)达到43-87%的准确率。

英文摘要

Hardware-enabled monitoring of GPU workloads underpins many proposals for AI compute governance, but if developers can defeat monitoring mechanisms, such schemes are unworkable. We evaluate the adversarial robustness of GPU workload classification using only zero-overhead, privacy-preserving NVML telemetry: content-agnostic signals that observe physical effects of computation without accessing model weights, training data, or hyperparameters. Across 5 rounds of monitor-evader iteration, we evaluate 20 evasion strategy families on 9 GPU models spanning 4 architecture generations. We develop a classifier that achieves 98.2% binary accuracy at identifying training workloads across the whole corpus, and 43-87% accuracy against the most challenging unexpected workloads even when they are adversarially disguised.

2606.19258 2026-06-18 cs.CV cs.RO 新提交

CABLE: Cloud-Assisted Bandwidth-efficient LMM-based Encoding for V2X Systems

CABLE: 面向V2X系统的云辅助带宽高效LMM编码框架

Haohua Que, Zhipeng Bao, Qianyi Wu, Handong Yao

发表机构 * University of Georgia(佐治亚大学)

AI总结 提出CABLE框架,通过边缘端利用自我运动补偿和残差运动线索传播云分割掩码,生成感兴趣区域(ROI)并仅上传ROI掩码图像,形成掩码-ROI-LMM反馈循环,在五个数据集上实现73-87%的ROI像素覆盖减少和5-8倍LMM预填充加速。

详情
AI中文摘要

云托管的大型多模态模型(LMM)可以为车联网系统提供强大的开放词汇感知能力,但简单地将全分辨率帧从边缘传输到云会导致严重的通信开销和云侧预填充延迟。我们提出了CABLE,一种用于边缘-云感知的云辅助带宽高效LMM编码框架。CABLE在边缘端利用自我运动补偿传播先前的云分割掩码,通过残差运动线索进行细化,并通过走廊包络整合断开区域,形成鲁棒的感兴趣区域(ROI)。仅上传ROI掩码图像,而云分割输出作为下一帧的先验反馈,形成掩码-ROI-LMM反馈循环。在五个数据集(nuScenes、WOD-ZB、Waymo、KITTI和CADC)上的实验表明,该方法在保持感知能力的同时实现了显著的通信节省,相对于全帧推理,ROI像素覆盖减少73-87%,估计LMM预填充加速5-8倍,检测质量略有折衷。

英文摘要

Cloud-hosted large multimodal models (LMMs) can provide strong open-vocabulary perception for Vehicle-to-Everything systems, but naively transmitting full-resolution frames from edge to cloud causes severe communication overhead and high cloud-side prefill latency. We present CABLE, a cloud-assisted bandwidth-efficient LMM-based encoding framework for edge-cloud perception. CABLE propagates the previous cloud segmentation mask on the edge using ego-motion compensation, refines it with residual-motion cues, and consolidates disconnected regions via a corridor envelope to form a robust region of interest (ROI). Only ROI-masked images are uploaded, while the cloud segmentation output is fed back as the prior for the next frame, forming a mask-to-ROI-to-LMM feedback loop. Experiments on five datasets (nuScenes, WOD-ZB, Waymo, KITTI, and CADC) show consistent communication savings while largely preserving perception, achieving $73$--$87\%$ ROI pixel-coverage reduction with $5$--$8\times$ estimated LMM prefill speedup at a modest detection-quality trade-off relative to full-frame inference.

2606.19257 2026-06-18 cs.CL 新提交

DreamReasoner-8B: Block-Size Curriculum Learning for Diffusion Reasoning Models

DreamReasoner-8B:面向扩散推理模型的块大小课程学习

Zirui Wu, Lin Zheng, Jiacheng Ye, Shansan Gong, Xueliang Zhao, Yansong Feng, Wei Bi, Lingpeng Kong

发表机构 * The University of Hong Kong(香港大学) Peking University(北京大学)

AI总结 提出块大小课程学习,通过从细粒度到粗粒度的渐进训练,解决块扩散语言模型在长链推理中性能差距问题,DreamReasoner-8B在数学和代码推理上达到与Qwen3-8B相当的水平。

详情
AI中文摘要

块扩散语言模型通过并行块级去噪加速解码,但其能否可靠地扩展到长思维链(CoT)推理仍未解决。为此,我们开发了开源块扩散推理模型DreamReasoner-8B,并系统研究了训练和推理块大小如何影响长CoT推理。我们的分析揭示了显著的性能差距:使用大块大小训练会导致推理性能极差,而小块大小则能保持有效的推理。为了弥合这一粒度差距,我们提出了块大小课程学习,逐步从细粒度块大小过渡到粗粒度块大小进行训练,从而克服了这一限制,并实现了在多种推理块大小上泛化的强大推理性能。在数学和代码推理基准测试中,DreamReasoner-8B取得了与领先的开源自回归模型(如Qwen3-8B)相竞争的结果。这项工作为高效、具备推理能力的扩散语言模型奠定了实践基础。我们在以下网址发布模型:https://this URL。

英文摘要

Block diffusion language models accelerate decoding through parallel block-wise denoising, yet whether they can be reliably scaled for long chain-of-thought (CoT) reasoning remains unresolved. To this end, we develop DreamReasoner-8B, an open-source block diffusion reasoning model, and conduct a systematic study of how training and inference block sizes affect long-CoT reasoning. Our analysis reveals a stark performance disparity: training with large block sizes yields remarkably poor reasoning, whereas small block sizes preserve effective reasoning. To bridge this granularity gap, we propose block-size curriculum learning, which gradually transitions training from fine-grained to coarse-grained block sizes, thereby overcoming this limitation and enabling strong reasoning performance that generalizes across diverse inference block sizes. On mathematical and code reasoning benchmarks, DreamReasoner-8B achieves results competitive with leading open autoregressive models such as Qwen3-8B. This work establishes a practical foundation for efficient, reasoning-capable diffusion language models. We release our model at https://github.com/DreamLM/DreamReasoner.

2606.19255 2026-06-18 cs.LG 新提交

SCAN: Enhance Time Series Anomaly Detection via Multi-Scale Neighborhood-Centered Clustering

SCAN: 通过多尺度邻域中心聚类增强时间序列异常检测

Xingze Zheng, Hanyin Cheng, Siyuan Wang, Yiting Hao, Peng Chen, Yuan Jun, Yang Shu

发表机构 * East China Normal University(华东师范大学) APPLab, Huawei(华为2012应用实验室) Huawei(华为)

AI总结 提出SCAN方法,通过多尺度聚类增强重建型异常检测,在表示层集成正常模式聚类中心约束重建,在异常判据层结合聚类概率与重建误差,并利用邻域中心表示改进聚类性能,在多个真实数据集上达到最优。

详情
AI中文摘要

时间序列异常检测在广泛的现实应用中扮演着关键角色。基于重建的方法已成为主流范式,但它们面临过度泛化和欠泛化问题,且难以平衡。为了解决这一问题,我们引入多尺度聚类来增强基于重建的方法。在表示层面,我们整合正常模式的聚类中心表示,以约束模型针对代表性正常模式进行重建,防止强大能力和表示能力的主导。在异常判据层面,我们基于聚类成员概率推导异常置信度分数,并将其与重建误差结合,提供双重检测标准。此外,聚类中心表示和异常置信度分数的有效性取决于聚类性能。因此,我们提取邻域中心表示用于多视图聚类,以提高聚类性能。在来自不同应用领域的多个真实数据集上的大量实验表明,SCAN达到了最先进的性能。

英文摘要

Time series anomaly detection plays a crucial role in a wide range of real-world applications. Reconstruction-based methods have become the mainstream paradigm, but they suffer from over-generalization and under-generalization problems, which are challenging to balance. To address this, we introduce multi-scale clustering to enhance reconstruction-based methods. At the representation level, we integrate the cluster center representations of normal patterns to constrain the model to target representative normal patterns for reconstruction, preventing dominance of powerful capacity and representation capability. At the anomaly criterion level, we derive anomaly confidence score based on cluster membership probability and combine it with reconstruction error, providing dual criteria for detection. Furthermore, the effectiveness of the cluster center representations and anomaly confidence score depends on the clustering performance. Accordingly, we extract neighborhood-centered representations for multi-view clustering to improve clustering performance. Extensive experiments on multiple real-world datasets from diverse application domains demonstrate the state-of-the-art performance of SCAN.

2606.19253 2026-06-18 cs.CV cs.AI cs.LG cs.RO 新提交

OneCanvas: 3D Scene Understanding via Panoramic Reprojection

OneCanvas: 通过全景重投影实现3D场景理解

Bartłomiej Baranowski, Dave Zhenyu Chen, Matthias Nießner

发表机构 * Technical University of Munich(慕尼黑工业大学) Huawei(华为)

AI总结 提出OneCanvas方法,将多视图补丁特征聚合到全景画布上,利用深度和相机位姿进行重投影,无需复杂几何编码器或大量训练,在SQA3D等基准上达到最先进精度。

Comments Project page: https://baranowskibrt.github.io/onecanvas/

详情
AI中文摘要

现有的视觉语言模型(VLM)中的3D场景理解方法要么依赖复杂的、模型特定的几何编码器,要么为了追求空间推理而需要大量的训练预算。相反,OneCanvas将所有视图的补丁特征聚合到一个单一的等距柱状全景画布上。具体来说,每个补丁利用其深度和相机位姿被反投影到3D世界坐标,然后根据从画布原点看到的该点的连续经度和纬度放置在画布上,无需对重叠视图进行光栅化或聚合。补丁的度量坐标的3D位置嵌入被添加到其特征中,从而恢复了将世界位置压缩到角度画布坐标时丢失的深度。因此,来自所有帧的补丁共享一个空间坐标系,无需融合或对主干网络进行重大架构修改。预训练的VLM将此表示视为普通图像。由于画布可以以任何感兴趣的姿态为中心,相同的表示直接支持从特定视角进行情境推理,这是机器人和具身AI中的常见需求。得益于这种表示,我们还可以引入空间预训练课程:通过程序化地将从真实图像中提取的对象的补丁特征放置在原本空白的画布上的选定3D世界位置,我们生成了涵盖广泛空间推理任务的即时监督,并控制答案分布以减少空间推理捷径。OneCanvas在SQA3D和VSI-Bench上达到了最先进的准确率,并在SPBench上泛化到分布外数据,其训练计算量比最强竞争方法少一个数量级。

英文摘要

Existing approaches to 3D scene understanding in Vision-Language Models (VLMs) either rely on complex, model-specific geometry encoders or large training budgets in pursuit of spatial reasoning. Instead, OneCanvas aggregates patch features from all views onto a single equirectangular panoramic canvas. Namely, each patch is unprojected to a 3D world coordinate using its depth and camera pose, then placed on the canvas at the continuous longitude and latitude of that point as seen from the canvas origin, with no rasterization or aggregation across overlapping views. A 3D position embedding of the patch's metric coordinates is added to its feature, restoring the depth lost when collapsing the world position to an angular canvas coordinate. Patches from all frames thus share one spatial coordinate system with no fusion or major architectural modifications of the backbone. The pretrained VLM consumes this representation as if it were an ordinary image. Because the canvas can be centered on any pose of interest, the same representation directly supports situated reasoning from a specific viewpoint, a common requirement in robotics and embodied AI. Thanks to this representation, we can also introduce a spatial pretraining curriculum: by procedurally placing patch features of objects, drawn from real images, at chosen 3D world positions on an otherwise empty canvas, we generate on-the-fly supervision spanning a broad range of spatial reasoning tasks, with answer distributions controlled to reduce spatial reasoning shortcuts. OneCanvas achieves state-of-the-art accuracy on SQA3D and VSI-Bench, and generalizes to out-of-distribution data on SPBench, using an order of magnitude less training compute than the strongest competing methods.

2606.19249 2026-06-18 cs.CV cs.LG 新提交

Transformer Geometry Observatory TGO-I: Spectral Geometry Observatory

Transformer几何观测站TGO-I:谱几何观测站

Kaustubh Kapil, Kishor P. Upla

发表机构 * Sardar Vallabhai National Institute of Technology (SVNIT), Surat, India(印度苏拉特萨达尔·瓦拉巴伊国家理工学院(SVNIT))

AI总结 提出TGO框架,通过分析ViT表示的谱几何(有效秩、稳定秩、参与比、谱熵、谱平坦度、谱各向异性等),发现训练过程中维度利用增加、各向异性降低、谱熵和参与比上升,最终CLS标记表示具有最高有效维度和最低各向异性。

详情
AI中文摘要

尽管Vision Transformers(ViTs)被广泛采用并在众多计算机视觉应用中取得成功,对其维度和表示几何的基本理解仍然相对未被充分探索。为了弥补这一差距,我们引入了Transformer几何观测站(TGO),这是一个系统的实验和分析流程框架,旨在研究Vision Transformers的表示几何和动态。TGO-I是该框架的第一部分,专注于ViT表示的谱几何。使用在ImageNet-100上训练的ViT-Small/16模型,我们分析了训练过程中的有效秩、稳定秩、参与比、谱熵、谱平坦度、谱各向异性、协方差结构、特征谱和奇异值谱。我们的结果揭示了维度利用的一致增加,伴随着各向异性降低、谱熵增加、参与比增加以及逐渐平坦的特征谱。与常见的直觉(即训练应将信息集中到少数主导方向)相反,我们观察到方差在表示维度上的逐渐重新分布。这一现象在最终的CLS标记表示中尤为明显,该表示在网络中表现出最高的有效维度和最低的各向异性。

英文摘要

Despite the widespread adoption of Vision Transformers (ViTs) and their success across numerous computer vision applications, the fundamental understanding of their dimensional and representational geometry remains relatively underexplored. To address this gap, we introduce Transformer Geometry Observatory (TGO), a systematic framework of experiments and analysis pipelines designed to investigate the representational geometry and dynamics of Vision Transformers. TGO-I, the first installment of the framework, focuses on the spectral geometry of ViT representations. Using a ViT-Small/16 model trained on ImageNet-100, we analyze Effective Rank, Stable Rank, Participation Ratio, Spectral Entropy, Spectral Flatness, Spectral Anisotropy, covariance structure, eigenspectra, and singular value spectra throughout training. Our results reveal a consistent increase in dimensional utilization, accompanied by decreasing anisotropy, increasing spectral entropy, increasing participation ratio, and progressively flatter eigenspectra. Contrary to the common intuition that training should concentrate information into a small number of dominant directions, we observe a progressive redistribution of variance across representational dimensions. This phenomenon is particularly pronounced in the final CLS token representation, which exhibits the highest effective dimensionality and lowest anisotropy within the network.