arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2094
热门方向导航
2606.20563 2026-06-19 cs.CV 新提交

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

JanusMesh: 通过跨空间去噪实现快速零样本3D视觉错觉生成

Siang-Ling Zhang, Huai-Hsun Cheng, Tsung-Ju Yang, Yu-Lun Liu

发表机构 * National Yang Ming Chiao Tung University(国立阳明交通大学)

AI总结 提出一种无需训练的快速框架,通过跨空间双分支去噪和视图条件纹理合成,在3-5分钟内生成高真实感双语义3D视觉错觉,优于现有方法。

Comments ECCV 2026. Project page: https://siang1105.github.io/JanusMesh.github.io/

详情
AI中文摘要

创建3D视觉错觉——一个从不同视角揭示完全不同语义的单一3D网格——是一个迷人但艰巨的挑战。现有的基于优化的方法速度慢且可能产生过饱和颜色。相比之下,简单的拼接方法无法生成几何一致的物体,导致可见的不自然接缝和语义泄露。在本文中,我们提出了一个快速且无需训练的框架,用于生成文本驱动的3D视觉错觉。我们的方法将生成过程解耦为两个阶段。首先,我们提出一个跨空间双分支去噪过程。该过程动态地将3D潜在变量解码到体素空间,用于CLIP引导的方向对齐和符号距离场(SDF)混合,确保无缝的几何融合。其次,我们引入一个视图条件纹理合成模块,将特定视图的2D扩散先验投影并聚合到融合的几何上。大量实验表明,我们的方法在仅3-5分钟内生成高度逼真的双语义3D错觉,在几何完整性、语义可识别性和效率上显著优于现有方法。项目页面:此https URL

英文摘要

Creating 3D visual illusions, a single 3D mesh that reveals entirely different semantics from various viewing angles, is a fascinating but tough challenge. Existing optimization-based methods are slow and can produce oversaturated colors. In contrast, naive stitching approaches fail to produce geometrically coherent objects. This results in visible unnatural seams and semantic leaks. In this paper, we present a fast and training-free framework for generating text-driven 3D visual illusions. Our approach decouples the generation into two stages. First, we propose a cross-space dual-branch denoising process. This process dynamically decodes 3D latents into voxel space for CLIP-guided orientation alignment and Signed Distance Field (SDF) blending, which ensures seamless geometric fusion. Second, we introduce a view-conditioned texture synthesis module that projects and aggregates view-specific 2D diffusion priors onto the fused geometry. Extensive experiments demonstrate that our method generates highly realistic, dual-semantic 3D illusions in just 3-5 minutes. It significantly outperforms existing methods in geometric integrity, semantic recognizability, and efficiency. Project page: https://siang1105.github.io/JanusMesh.github.io/

2606.20562 2026-06-19 cs.RO 新提交

MemoryWAM: Efficient World Action Modeling with Persistent Memory

MemoryWAM:具有持久记忆的高效世界动作建模

Sizhe Yang, Juncheng Mu, Tianming Wei, Chenhao Lu, Xiaofan Li, Linning Xu, Zhengrong Xue, Zhecheng Yuan, Dahua Lin, Jiangmiao Pang, Huazhe Xu

发表机构 * The Chinese University of Hong Kong(香港中文大学) Tsinghua University(清华大学) Zhejiang University(浙江大学)

AI总结 提出MemoryWAM,通过混合记忆设计和定制注意力机制,在长时域机器人操作任务中实现高效记忆依赖决策,优于现有VLA和WAM基线。

详情
AI中文摘要

现实世界中的鲁棒机器人操作不仅需要理解当前观测,还需要记忆和动力学建模。世界动作模型(WAM)通过联合建模基于当前和历史观测的视觉预测和动作,具备了这些能力,使其成为机器人操作的一个有前景的范式。然而,现有的WAM面临一个基本权衡:高效推理的方法通常仅基于最近观测的有界窗口进行条件化,因此在非马尔可夫环境中表现不佳;而保留长历史的方法则会产生随序列长度大幅增长的时间和空间成本。为解决这一挑战,我们引入了MemoryWAM,一种具有高效持久记忆的世界动作模型。MemoryWAM采用混合记忆设计,结合了最近帧、事件边界锚点帧以及总结长程历史的紧凑要点令牌。一种定制的注意力机制能够检索详细的短期上下文和压缩的长期上下文,支持具有降低推理延迟和GPU内存使用的记忆依赖决策。在模拟和现实世界的长时域、记忆依赖的操作任务中,MemoryWAM在保持良好计算效率的同时,优于强大的视觉-语言-动作(VLA)和WAM基线。

英文摘要

Robust robotic manipulation in the real world requires not only an understanding of the current observation, but also memory and dynamics modeling. World action models (WAMs) possess these capabilities by jointly modeling visual foresight and actions conditioned on both current and historical observations, making them a promising paradigm for robotic manipulation. However, existing WAMs face a fundamental trade-off: methods with efficient inference typically condition only on a bounded window of recent observations and therefore struggle in non-Markovian environments, whereas methods that preserve long histories incur time and space costs that grow substantially with sequence length. To address this challenge, we introduce MemoryWAM, a world action model with efficient persistent memory. MemoryWAM uses a hybrid memory design that combines recent frames, event-boundary anchor frames, and compact gist tokens that summarize long-range history. A tailored attention mechanism enables retrieval of both detailed short-term context and compressed long-term context, supporting memory-dependent decision-making with reduced inference latency and GPU memory usage. Across long-horizon, memory-dependent manipulation tasks in both simulation and the real world, MemoryWAM outperforms strong vision-language-action (VLA) and WAM baselines while maintaining favorable computational efficiency.

2606.20560 2026-06-19 cs.LG cs.AI 新提交

How Transparent is DiffusionGemma?

DiffusionGemma 的透明度如何?

Joshua Engels, Callum McDougall, Bilal Chughtai, Janos Kramar, Senthoran Rajamanoharan, Cindy Wu, Arthur Conmy, Asic Q Chen, Jean Tarbouriech, Min Ma, Brendan O'Donoghue, João Gabriel Lopes de Oliveira, Rohin Shah, Neel Nanda

发表机构 * Google(谷歌)

AI总结 研究DiffusionGemma在连续潜空间中的推理透明度,通过变量透明度和算法透明度分解,发现可解释的令牌瓶颈将不透明串行深度降至Gemma 4的1.1倍,并揭示扩散特有现象。

Comments 20 main text pages and 6 pages of references and appendices

详情
AI中文摘要

LLM推理透明度是理解模型决策、减少误用和错位以及调试意外模型行为的关键能力。然而,DiffusionGemma在连续潜空间中执行了更大比例的计算;这是否使其推理透明度降低?我们通过将透明度分解为两个组成部分来研究这个问题:变量透明度,即我们是否理解模型计算状态的中间快照;以及算法透明度,即我们是否能够利用这些快照重建模型得出其输出的过程。直观上,DiffusionGemma的变量透明度较差:其不透明串行深度,即在可解释模型状态之间发生的串行计算量,最初似乎是相应自回归Gemma 4模型的28.6倍。然而,我们表明,我们可以通过一个可解释的令牌瓶颈映射去噪步骤之间流动的信息,且下游性能没有下降。将这些中间状态视为可解释的,将不透明串行深度降至仅为Gemma 4的1.1倍。对于扩散模型来说,算法透明度比自回归模型更难,因为画布中的所有令牌预测在每个去噪步骤中都可能发生变化,这使模型有能力在去噪过程中实现复杂的分布式算法。为了开始弥合这一差距,我们进行了一系列可解释性案例研究,发现了扩散特有现象(如非时序推理、令牌和序列涂抹以及中间上下文推理)的初步证据。最后,我们测试了可监控性,这是透明度的一个关键应用,衡量模型输出是否对下游任务有用。我们发现DiffusionGemma的可监控性与Gemma 4相似。

英文摘要

LLM reasoning transparency is a critical affordance for understanding model decisions, mitigating misuse and misalignment, and debugging surprising model behaviors. However, DiffusionGemma performs a larger fraction of its computation in a continuous latent space; does this make its reasoning less transparent? We study this question by decomposing transparency into two components: variable transparency, whether we understand intermediate snapshots of a model's computational state; and algorithmic transparency, whether we can use these snapshots to reconstruct the process by which the model arrived at its outputs. Naively, DiffusionGemma has poor variable transparency: its opaque serial depth, the amount of serial computation that occurs in between interpretable model states, seems at first 28.6X higher than the corresponding autoregressive Gemma 4 model. However, we show that we can map the information flowing between denoising steps through an interpretable token bottleneck with no decrease in downstream performance. Treating these intermediate states as interpretable reduces the opaque serial depth to just 1.1X that of Gemma 4. Algorithmic transparency is harder for diffusion models than for autoregressive models because all token predictions in the canvas can change at every denoising step, giving the model the power to implement complicated distributed algorithms during the denoising process. To begin bridging this gap, we conduct a suite of interpretability case studies, uncovering initial evidence of novel diffusion-specific phenomena such as non-chronological reasoning, token and sequence smearing, and intermediate-context reasoning. Finally, we test monitorability, a key application of transparency that measures whether model outputs are useful for downstream tasks. We find that DiffusionGemma is similarly monitorable to Gemma 4.

2606.20556 2026-06-19 cs.CV 新提交

Thinking in Boxes: 3D Editing in Real Images Made Easy

Thinking in Boxes: 真实图像中的3D编辑变得简单

Pradhaan S Bhat, Naveen Chandra R, Rishubh Parihar, Vaibhav Vavilala, R. Venkatesh Babu, D. A. Forsyth, Anand Bhattad

发表机构 * Indian Institute of Science(印度科学研究所) Apple(苹果公司) UIUC(伊利诺伊大学厄巴纳-香槟分校) Johns Hopkins University(约翰霍普金斯大学)

AI总结 提出使用3D盒子作为结构化规范,通过用户提供输入和输出盒子来精确控制真实图像中的平移、旋转、缩放和视角变化,同时保持场景和物体身份,恢复未见的物体区域。

Comments Project Page: https://thinking-in-boxes.github.io/

详情
AI中文摘要

文本和2D条件接口在图像编辑中提供对空间变换的弱、模糊控制——特别是在大物体运动和相机变化下。先前的工作使用了如盒子这样的3D基元,但仅作为松散的调节信号指示近似物体位置,而非指定变换。我们则使用3D盒子作为结构化规范:用户提供编辑的输入和输出盒子,将编辑视为一个适定的几何问题。这种“在盒子中思考”的界面,其中每个盒子面都带有颜色编码以传达3D方向,提供了对真实图像中平移、旋转、缩放和视角变化的精确控制,同时保留场景和物体身份,并恢复之前未见的物体区域。为了将变换与场景外观联系起来,我们引入了一个深度对齐的平面地板作为全局参考框架,并用深度感知线索进行着色。基于这种结构,图像生成器在大变换下产生一致的结果。该系统在两个阶段训练——在合成多物体场景和来自Objectron的小型真实世界视频集上——能够泛化到复杂的、野外真实图像。我们的方法直接作用于真实照片,并在大型3D编辑上显著优于最近的最先进方法。

英文摘要

Text and 2D-conditioning interfaces provide weak, ambiguous control over spatial transformations in image editing -- particularly under large object motions and camera changes. Prior work has used 3D primitives such as boxes, but only as loose conditioning signals indicating approximate object location rather than specifying the transformation. We instead use 3D boxes as structured specifications: the user provides the input and output boxes of the edit, casting editing as a well-posed geometry problem. This ``thinking in boxes'' interface, where each box face is color-coded to convey 3D orientation, gives precise control over translation, rotation, scaling, and viewpoint changes in real images while preserving scene and object identity, and recovering previously unseen object regions. To ground transformations in scene appearance, we introduce a depth-aligned planar floor as a global reference frame, shaded with depth-aware cues. Conditioned on this structure, an image generator produces consistent results under large transformations. Trained in two stages -- on synthetic multi-object scenes and a small set of real-world videos from Objectron -- the system generalizes to complex, in-the-wild real images. Our method operates directly on real photographs and substantially outperforms recent state-of-the-art methods on large 3D edits.

2606.20549 2026-06-19 cs.RO 新提交

Generating Robot Hands from Human Demonstrations

从人类演示生成机器人手

Sha Yi, Nicklas Hansen, Xueqian Bai, Carmelo Sferrazza, Michael T. Tolley, Xiaolong Wang

发表机构 * University of California San Diego(加州大学圣迭戈分校) Amazon Frontier AI & Robotics(亚马逊前沿人工智能与机器人)

AI总结 提出数据驱动框架,利用人类日常操作中超过400万帧指尖运动数据,通过逆运动学匹配指尖位置,优化树状结构机器人手的设计,生成通用6自由度手和低自由度任务专用手,并训练强化学习智能体加速设计搜索。

详情
AI中文摘要

机器人学习在控制学习方面取得了快速进展,但学习机器人的物理身体仍然困难得多,因为同时搜索设计和控制会产生一个非常大的组合问题。在这里,我们提出了一个数据驱动的框架,用于从人类演示生成机器人手。我们不是为每个候选设计学习一个复杂的控制器,而是使用制造后使用的相同简单控制策略来生成机器人手设计:通过逆运动学匹配指尖位置。利用来自日常操作的超过400万帧人类指尖运动数据,我们的算法优化树状结构机器人手以再现所需的目标运动。该框架产生了一个6自由度(DoF)通用手和具有空间四杆仿生关节的低自由度任务专用手。为了加速设计搜索,我们训练了一个强化学习(RL)智能体来提出好的手设计和关节角度,将搜索时间从数小时减少到数分钟。我们直接将机制制作为具有打印就绪关节的一体式铰接结构。在真实世界实验中,6自由度手实现了高度精确的遥操作指尖跟踪,优于现有的商用机器人手,而专门的3自由度手以降低的机械复杂性再现了结构化的人类和合成轨迹。这些结果表明,大规模人类运动数据不仅可以用于训练机器人控制器,还可以作为优化和生成机器人物理实体的参考。

英文摘要

Robot learning has advanced rapidly in learning control, but learning the physical body of a robot remains much more difficult because jointly searching over design and control creates a very large combinatorial problem. Here, we present a data-driven framework for generating robot hands from human demonstrations. Instead of learning a complex controller together with each candidate design, we generate robot hand designs using the same simple control policy used after fabrication: matching fingertip positions through inverse kinematics. Using more than 4 million frames of human fingertip motion from everyday manipulation, our algorithm optimizes tree-structured robot hands to reproduce desired target motions. The framework produced both a 6-degree-of-freedom (DoF) general-purpose hand and lower-DoF task-specific hands with spatial four-bar mimic joints. To accelerate the search over designs, we trained a reinforcement-learning (RL) actor to propose good hand designs and joint angles, reducing search time from hours to minutes. We fabricated the mechanisms directly as one-piece articulated structures with print-in-place joints. In real-world experiments, the 6-DoF hand achieved highly accurate teleoperated fingertip tracking better than available commercial robot hands, whereas the specialized 3-DoF hands reproduced structured human and synthetic trajectories with reduced mechanical complexity. These results showed that large-scale human motion data can be used not only to train robot controllers but also as a reference for optimizing and generating the physical embodiment of robots.

2606.20545 2026-06-19 cs.CV 新提交

Current World Models Lack a Persistent State Core

当前世界模型缺乏持久状态核心

Jinpeng Lu, Dexu Zhu, Haoyuan Shi, Linghan Cai, Guo Tang, Yinda Chen, Jie Cao, Duyu Tang, Yi Zhang, Yong Dai, Xiaozhu Ju

发表机构 * University of Science and Technology of China(中国科学技术大学) Beijing Innovation Center of Humanoid Robotics (X-Humanoid)(北京人形机器人创新中心) NLPR, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所模式识别国家重点实验室) Independent Researcher(独立研究者) Dresden University of Technology(德累斯顿工业大学) Peking University(北京大学)

AI总结 提出WRBench基准测试,发现现有世界模型在观测中断时无法维持世界状态演化,强调物理状态核稳定性应成为世界模型设计首要目标。

Comments 39 pages, 16 figures

详情
AI中文摘要

世界模型日益被视为迈向通用人工智能的关键一步,然而对物理世界建模需要的不仅仅是按需生成令人信服的帧:它需要一个内部世界状态随时间持续演化,与观测解耦,使得物体持久存在、事件运行至结束,无论是否有相机在观察——就像月球在无人注视时仍保持轨道运行一样。这一要求是现有基准的盲点,它们奖励表面属性如保真度、运动和相机可控性,却从不询问生成的 world 在未被观测时是否持续演化。我们引入 \textbf{WRBench},首个系统性的诊断基准,将相机运动视为对可观测性的干预,并将评估分解为一个人工校准的链条:询问相机是否执行了请求的交互,场景在视野内是否保持连续和可识别,以及返回的目标是否与已启动的事件保持一致。在来自 23 个模型(涵盖四种控制范式)的 9,600 个视频中,一个发现顽固地存在:当前系统将观测到的世界维持为跟踪镜头,返回的目标恢复为被遗弃时的状态,而非在未被观测时推进事件。由于这一失败在控制范式、模型家族和规模增量中重复出现,稳健的世界状态演化并非来自更清晰的图像、更严格的控制、更丰富的几何先验或单纯的参数数量。因此,我们主张物理状态核的稳定性和视角干预下世界线的一致性应成为世界模型设计的一级目标,使得世界模型捕捉世界将如何展开,而非下一帧如何呈现。

英文摘要

World models are increasingly regarded as a decisive step toward artificial general intelligence, yet modeling the physical world demands more than rendering convincing frames on demand: it requires an internal world state that keeps evolving over time, decoupled from observation, so that objects endure and events run to their conclusions whether or not a camera is watching, much as the moon holds to its orbit when no one is looking. This requirement is a blind spot of existing benchmarks, which reward surface properties such as fidelity, motion, and camera controllability while never asking whether a generated world keeps evolving once it is unobserved. We introduce \textbf{WRBench}, the first systematic diagnostic benchmark that treats camera motion as an intervention on observability and resolves evaluation into a human-calibrated chain that asks whether the camera executes the requested interaction, whether the scene stays continuous and identifiable while in view, and whether a returning target remains consistent with the event that was set in motion. Across 9{,}600 videos from 23 models spanning four control paradigms, one finding proves stubborn: current systems maintain the observed world as a tracking shot, resuming a returning target in the state at which it was abandoned rather than advancing the event while it went unseen. Because this failure recurs across control paradigms, model families, and increments of scale, robust world-state evolution does not follow from cleaner imagery, tighter control, richer geometric priors, or sheer parameter count We therefore argue that the stability of the physical state kernel and the consistency of worldlines under viewpoint intervention should become first-class objectives of world-model design, so that a world model captures how the world will unfold rather than how the next frame appears.

2606.20544 2026-06-19 cs.AI cs.LG 新提交

Toward Calibrated Mixture-of-Experts Under Distribution Shift

面向分布偏移下的校准混合专家模型

Gina Wong, Drew Prinster, Suchi Saria, Rama Chellappa, Anqi Liu

发表机构 * Johns Hopkins University(约翰霍普金斯大学)

AI总结 研究混合专家模型在分布偏移下的校准问题,提出对抗性重加权方法以改善路由聚合的校准误差,提升准确率-校准权衡。

详情
Journal ref
ICML 2026
AI中文摘要

校准将模型的预测不确定性与其经验结果的频率对齐,对于理解和信任报告的概率很重要。最近的研究表明,在单个预测器级别强制执行校准可以提高集成准确性和校准,特别是混合专家(MoE)模型显示出强烈的经验改进;然而,校准有助于MoE的条件尚不清楚。在这项工作中,我们研究了MoE模型在分布偏移下的行为,重点关注路由机制如何与专家级校准相互作用。我们表明,在硬路由模型中,专家校准足以确保整体模型在一大类分布偏移下的校准,但不足以校准软路由模型。为了解决这个问题,我们提出了一种对抗性重加权方法,惩罚分布偏移下路由聚合的校准误差,并证明它在平均情况下以及在数据的困难子集上,跨模型类别、预测任务和分布偏移,改善了准确率-校准权衡。

英文摘要

Calibration aligns a model's predictive uncertainty with the frequencies of its empirical outcomes and is important for understanding and trusting reported probabilities. Recent work shows that enforcing calibration at the level of individual predictors can improve ensemble accuracy and calibration, with mixture-of-experts (MoE) models showing strong empirical improvements in particular; however, the conditions under which calibration helps MoE are not well understood. In this work, we study how MoE models behave under distribution shift, focusing on how routing mechanisms interact with expert-level calibration. We show that expert calibration is sufficient to ensure calibration of the overall model under a broad class of distribution shifts in hard-routed models, but is insufficient for calibrating soft-routed models. To address this, we propose an adversarial reweighting that penalizes calibration errors of the routed aggregate under distribution shift, and we demonstrate that it improves the accuracy-calibration tradeoff both on average and on difficult subsets of the data, across model classes, prediction tasks, and distribution shifts.

2606.20543 2026-06-19 cs.CV 新提交

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

SSD: 空间推测解码加速自回归图像生成

Shilong Xiang, Zirui Zhang, Lijun Yu, Chengzhi Mao

发表机构 * Rutgers University(罗格斯大学)

AI总结 提出空间推测解码(SSD),利用二维空间相关性同时预测相邻水平与下方令牌,突破视觉推理中的内存瓶颈,实现高达13.3倍的自回归图像生成加速。

详情
AI中文摘要

自回归模型通过将图像视为离散令牌的一维序列,在视觉生成中表现出色,类似于语言建模。然而,这种扁平化处理丢弃了视觉信号固有的二维空间局部性,在推理过程中造成严重的计算瓶颈。我们提出空间推测解码(SSD),一种将预测目标与图像自然几何结构对齐的框架。我们的模型不是仅预测一维序列中的下一个令牌,而是同时预测相邻的水平令牌和正下方的令牌。通过利用这种二维空间相关性,空间推测解码克服了视觉推理中的内存墙。我们的方法在DPG-Bench和GenEval上保持高保真度的同时,将自回归图像生成速度提升高达13.3倍。我们的结果表明,尊重视觉的底层几何结构可以释放巨大的计算效率,为实时、高分辨率自回归生成模型铺平道路。

英文摘要

Autoregressive models excel in visual generation by treating images as 1D sequences of discrete tokens, mirroring language modeling. However, this flattening discards the intrinsic 2D spatial locality of visual signals, creating severe computational bottlenecks during inference. We introduce Spatially Speculative Decoding (SSD), a framework that aligns the predictive objective with the natural geometry of images. Rather than predicting only the immediate next token in a 1D sequence, our model simultaneously predicts the adjacent horizontal token and the token directly below it. By capitalizing on this 2D spatial correlation, spatially speculative decoding overcomes the memory wall in visual inference. Our approach accelerates autoregressive image generation by up to 13.3x while maintaining high fidelity on DPG-Bench and GenEval. Our results suggest that respecting the underlying geometry of vision unlocks massive computational efficiencies, paving the way for real-time, high-resolution autoregressive generative models.

2606.20542 2026-06-19 cs.CV 新提交

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

CalTennis:大型多视角网球视频数据集及单目到3D姿态估计基准

Ilona Demler, Xinran Xie, Blake Werner, Anna Szczuka, Pietro Perona

发表机构 * California Institute of Technology(加州理工学院)

AI总结 提出CalTennis大型多视角网球视频数据集(1100万帧,40名球员),用于评估野外单目到3D姿态估计,并发现现有模型在深度估计和足部接触方面存在不足。

详情
AI中文摘要

Caltech网球数据集(CalTennis)是一个大规模视频基准,用于评估野外单目到3D姿态估计。CalTennis包含超过1100万帧(51小时)来自40名球员的网球练习和比赛视频,由2-6台同步摄像机以60 Hz频率采集。它比现有的野外人体运动视频数据集大10倍,比现有的MOCAP真值数据集大3倍,并且是第一个提供专家运动同步多视角记录的大规模基准。多视角设置使得对单目到3D姿态估计算法进行廉价、无标签的评估成为可能。我们描述了一个简单、标准化的协议,无需专业设备或专业知识即可进行数据收集,并实现了全自动视频校准和同步。在CalTennis上对最先进的单目到3D姿态方法进行基准测试,我们发现,虽然3D关节角度恢复现在相当准确,但所有模型在一致地估计深度和足部接触方面仍然存在困难。我们进一步提出了两个新的性能指标——步法和稳定性,并定性研究了身体形状不一致性。这些指标揭示了以前未充分探索的失败模式,并为姿态估计和动作分析的改进提供了具体机会。

英文摘要

The Caltech Tennis Dataset (CalTennis) is a large-scale video benchmark for evaluating monocular-to-3D pose estimation in the wild. CalTennis comprises over 11 million frames (51 hours) of tennis practice and match play from 40 players, captured with 2-6 synchronized cameras at 60 Hz. It is 10 times larger than existing in-the-wild human motion video datasets and 3 times larger than existing MOCAP-ground-truthed datasets, and it is the first large-scale benchmark to provide synchronized multi-view recordings of expert athletic motion. The multi-view setup enables inexpensive, label-free evaluation of monocular-to-3D pose estimation algorithms. We describe a simple, standardized protocol that enables data collection without specialized equipment or expertise, along with fully automated video calibration and synchronization. Benchmarking state-of-the-art monocular-to-3D pose methods on CalTennis, we find that while 3D joint angle recovery is now quite accurate, all models struggle to estimate depth and foot contact consistently. We further propose two novel performance metrics, footwork and stability, as well as qualitatively study body shape inconsistency. These metrics expose previously underexplored failure modes and point to concrete opportunities for improvement in pose estimation and action analysis.

2606.20538 2026-06-19 cs.LG 新提交

Multi-Task Bayesian In-Context Learning

多任务贝叶斯上下文学习

Qingyang Zhu, Eric Karl Oermann, Kyunghyun Cho

发表机构 * New York University(纽约大学)

AI总结 提出多任务上下文学习框架,通过将先验信息表示为上下文数据集前缀,训练Transformer实现分层贝叶斯预测推理,在多种分布偏移下匹配最优贝叶斯性能且速度提升数个数量级。

Comments ICML 2026

详情
AI中文摘要

贝叶斯预测推断为不确定性量化、数据效率和鲁棒泛化提供了原则性框架。然而,精确推断通常难以处理,可扩展近似可能仍计算昂贵或需要限制性建模假设,从而降低预测性能。先验数据拟合和上下文模型最近作为一种摊销替代方案出现,通过学习直接将数据集映射到预测分布,但现有方法与训练先验的支持紧密耦合,缺乏在测试时适应新先验的显式机制,导致在分布偏移下鲁棒性有限。我们引入了一个多任务上下文学习框架,用于摊销分层贝叶斯预测推断,该框架将先验信息显式表示为上下文数据集的前缀。一个在先验和目标任务序列上训练的Transformer学习跨先验族调整其预测。在一系列难度递增的评估中,包括元分布外先验和具有高维潜在结构的先验,我们的方法匹配了最优贝叶斯预测器,同时速度快了几个数量级。我们进一步在真实世界的时空温度预测基准上展示了其实用性。代码可在https://this URL获取。

英文摘要

Bayesian predictive inference provides a principled framework for uncertainty quantification, data efficiency, and robust generalization. However, exact inference is often intractable, and scalable approximations may remain computationally expensive or require restrictive modeling assumptions that degrade predictive performance. Prior-Data Fitted and in-context models have recently emerged as an amortized alternative by learning to map datasets directly to predictive distributions, but existing approaches are tightly coupled to the support of the training prior and lack explicit mechanisms for adapting to new priors at test time, resulting in limited robustness under distribution shift. We introduce a multi-task in-context learning framework for amortized hierarchical Bayesian predictive inference that explicitly represents prior information as a prefix of in-context datasets. A transformer trained on sequences of prior and target tasks learns to adapt its predictions across families of priors. On a suite of evaluations with increasing difficulty, including out-of-meta-distribution priors and priors with high-dimensional latent structures, our method matches oracle Bayesian predictors while being orders of magnitude faster. We further demonstrate its practical relevance on a real-world spatiotemporal temperature prediction benchmark. Code is available at https://github.com/martianmartina/multi-task-bayesian-icl/.

2606.20536 2026-06-19 cs.CV 新提交

The FID Lottery: Quantifying Hidden Randomness in Generative-Model Evaluation

FID 彩票:量化生成模型评估中的隐藏随机性

Nicolas Dufour, Alexei A. Efros, Patrick Pérez

发表机构 * Kyutai UC Berkeley(加州大学伯克利分校)

AI总结 研究FID作为随机变量在训练和生成种子上的方差,发现重训练比重采样导致更大FID波动,提出新评估协议:使用每类最优引导、报告多个训练种子的误差条。

Comments Website: https://kyutai.org/fid-lottery

详情
AI中文摘要

Frechet Inception Distance (FID) 是图像生成的事实标准仲裁者,但大多数论文仅报告来自单个训练模型使用单个采样种子的单一数值。如果我们重新训练模型,或仅重新从中采样,该数字的可重复性如何?在本文中,我们将 FID 视为训练和生成种子二维面板上的随机变量,并直接在数百个在类别条件 ImageNet 256x256 上训练的 SiT 网络上测量其方差。我们报告了令人惊讶的发现:(a) 使用相同配方但不同种子重新训练模型,在 Inception 特征空间中移动的 FID 比从固定网络重新绘制样本大 3.2 倍。(b) 这一差距由三个因素驱动:随机初始化、数据排序和流匹配损失的每步高斯噪声。(c) 增加计算量或模型大小几乎不会缩小分布范围,将 FID 变异系数 (CoV) 保持在 1-2% 的带内。(d) 每类无分类器引导调整使分布减半,但重新洗牌了哪些种子效果最好,幸运的训练种子达到相同 FID 所需的计算量比不幸的种子少 2 倍。基于这些发现,我们推荐一种新的 FID 评估协议:在每类最优引导下进行评估,将任何低于经验测量的约 1.3% CoV 的 FID 差距视为不确定,并报告多个训练种子的误差条,而不是单一的 FID 数值。

英文摘要

The Frechet Inception Distance (FID) is the de facto arbiter of image generation, yet most papers report just a single number from a single trained model using a single sampling seed. How reproducible is that number if we retrain the model, or merely resample from it? In this paper, we treat FID as a random variable on a two-axis panel of training and generation seeds, and measure its variance directly on several hundred SiT networks trained on class-conditional ImageNet 256x256. We report surprising findings: (a) Retraining the model using the same recipe with a different seed moves FID 3.2x more (in Inception feature space) than redrawing samples from a fixed network. (b) That gap is driven by three factors: random initialisation, data ordering, and the per-step Gaussian noise of the flow-matching loss. (c) Increasing compute or model size barely tightens the spread, holding the FID coefficient of variation (CoV) inside a 1-2% band. (d) Per-cell classifier-free-guidance tuning halves the spread but reshuffles which seeds work best, and a lucky training seed reaches the same FID with up to 2x less compute than an unlucky one. Based on these findings, we recommend a new FID evaluation protocol: evaluate under per-cell optimal guidance, treat any FID gap below the empirically measured ~1.3% CoV as inconclusive, and report an error bar over several training seeds rather than a single FID number.

2606.20531 2026-06-19 cs.CV 新提交

VisDom: Sparse Novel View Synthesis with Visible Domain Constraint

VisDom: 具有可见域约束的稀疏新视角合成

Mariia Gladkova*, Tarun Yenamandra*, Edmond Boyer, Robert Maier, Tony Tung, Daniel Cremers

发表机构 * TU Munich(慕尼黑工业大学) MCML(慕尼黑机器学习中心)

AI总结 提出VisDom,一种无学习的几何约束,通过最小多视角可见性要求增强视觉外壳重建,作为稀疏新视角合成中的空间先验,集成到NeRF和GS管线中,从四张输入图像实现高质量重建。

详情
AI中文摘要

稀疏新视角合成(NVS)由于从少量输入视角恢复3D几何的歧义性仍然具有挑战性。虽然基于NeRF和高斯泼溅(GS)的方法在密集监督下表现良好,但在稀疏设置中它们往往过拟合,产生漂浮伪影和不一致的几何。轮廓一致性通常用作正则化器,但还不够,因为轮廓一致区域可能超出真实物体几何。我们引入VisDom,一种无学习的几何约束,通过强制执行最小多视角可见性要求来增强经典的基于雕刻的视觉外壳重建。具体地,我们将可见域定义为至少被$K$个视角观察到的3D空间子集,并将其用作标准基于轮廓重建之上的额外过滤标准。这在稀疏视角设置中提供了更强的空间先验。我们通过限制体积采样和指导优化过程中的高斯放置,将VisDom集成到隐式(NeRF)和显式(GS)管线中。在三个具有挑战性的数据集上的实验表明,稀疏NVS的一致改进,使得从仅四张输入图像就能实现高质量以物体为中心的重建。我们的方法是领域无关的,仅需要轮廓,并且不引入学习参数,使其成为现有方法的简单补充。在GaussianObject之上应用VisDom进一步提高了在Omni3D和MipNeRF360上的性能,同时以22倍的训练成本匹配或超越它。

英文摘要

Sparse novel view synthesis (NVS) remains challenging due to the ambiguity of recovering 3D geometry from few input views. While NeRF- and Gaussian Splatting (GS)-based methods perform well with dense supervision, they often overfit in sparse settings, producing floating artifacts and inconsistent geometry. Silhouette consistency is commonly used as a regularizer, but it remains insufficient, as silhouette-consistent regions can extend beyond the true object geometry. We introduce VisDom, a learning-free geometric constraint that augments classical carving-based visual hull reconstruction by enforcing a minimum multi-view visibility requirement. Specifically, we define a visible domain as the subset of 3D space observed by at least $K$ views and use it as an additional filtering criterion on top of standard silhouette-based reconstruction. This provides a stronger spatial prior in sparse-view settings. We integrate VisDom into both implicit (NeRF) and explicit (GS) pipelines by restricting volumetric sampling and guiding Gaussian placement during optimization. Experiments on three challenging datasets show consistent improvements in sparse-view NVS, enabling high-quality object-centric reconstruction from as few as four input images. Our method is domain-agnostic, requires only silhouettes, and introduces no learned parameters, making it a simple complement to existing approaches. Applying VisDom on top of GaussianObject further improves performance on Omni3D and MipNeRF360, while matching or surpassing it at 22 $\times$ lower training cost.

2606.20529 2026-06-19 cs.AI cs.CL 新提交

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents

LedgerAgent: 策略遵从工具调用代理的结构化状态

Md Nayem Uddin, Amir Saeidi, Eduardo Blanco, Chitta Baral

发表机构 * Arizona State University(亚利桑那州立大学) University of Arizona(亚利桑那大学)

AI总结 针对客服领域策略遵从工具调用代理,提出LedgerAgent方法,通过独立账本维护任务状态并渲染到提示中,在执行工具调用前检查状态依赖策略约束,提升多轮一致性。

Comments Work in Progress

详情
AI中文摘要

在客服领域,策略遵从工具调用代理必须在跨轮次调用工具时维护任务状态,并遵守领域策略。任务状态包括通过用户交互和工具调用观察到的事实、标识符、约束和条件。在标准代理中,任务状态没有单独表示。观察结果、工具返回和策略指令被放入提示中,使得代理每次决定下一步时都需要从提示中重建相关状态。这种设计使状态管理变得隐式,导致两种常见失败模式:代理可能检索到正确的事实,但后来基于过时、缺失或不正确的信息做出决策;语法上有效的工具调用可能仍然违反依赖于当前任务状态的领域策略。我们引入了\textsc{LedgerAgent},一种用于工具调用代理的推理时方法,它在单独的账本中维护观察到的任务状态,并将状态渲染到提示中。在执行改变环境的工具调用之前,账本还用于检查状态依赖的策略约束,阻止策略违规。在四个客服领域以及开源和闭源模型的混合面板上,\textsc{LedgerAgent}在标准基于提示的工具调用方法上提高了平均pass^k,在更严格的多轮一致性指标下提升最大。

英文摘要

Policy-adherent tool-calling agents in customer-service domains must maintain task states across turns while calling tools and obeying domain policies. Task states consist of relevant facts, identifiers, constraints, and conditions observed through user interaction and tool calls. In standard agents, task states are not represented separately. Observations, tool returns, and policy instructions are placed in the prompt, leaving agents to reconstruct the relevant states from the prompt each time they decide what to do next. This design makes state management implicit, creating two common failure modes. An agent may retrieve the right facts but later ground its decision in stale, missing, or incorrect information; and a syntactically valid tool call may still violate a domain policy that depends on the current task state. We introduce \textsc{LedgerAgent}, an inference-time method for tool-calling agents that maintains observed task states in a separate ledger and renders the states into the prompt. The ledger is also used to check state-dependent policy constraints before environment-changing tool calls are executed, blocking policy violations. Across four customer-service domains and a mixed panel of open- and closed-weight models, \textsc{LedgerAgent} improves average pass\textasciicircum{}k over a standard prompt-based tool-calling approach, with the largest gains under stricter multi-trial consistency metrics.

2606.20527 2026-06-19 cs.CL cs.CV 新提交

StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs

StylisticBias: 少数人类视觉线索驱动多模态大语言模型中的大部分社会偏见

Shaghayegh Kolli, Timo Cavelius, Nafiseh Nikeghbal, Samantha Dalal, Jana Diesner

发表机构 * Technical University of Munich(慕尼黑工业大学) Munich Center for Machine Learning(慕尼黑机器学习中心) Princeton Center for Information and Technology Policy(普林斯顿信息与技术政策中心)

AI总结 提出StylisticBias基准,通过控制单一视觉属性变化,发现年龄和体型主导身份层面偏见,而时尚风格等约15个属性解释近80%的偏见变化,偏见集中于少数视觉线索。

Comments Accepted to the non-archival workshops AI4Good and Culture x AI at ICML 2026

详情
AI中文摘要

多模态大语言模型(MLLMs)越来越多地部署在个人和社会影响重大的场景中,但影响这些模型判断人物的视觉线索仍知之甚少。先前的工作通常比较不同的(群体)个体,难以将外貌效应与身份差异分离。我们引入StylisticBias,一个用于评估MLLMs中属性级社会偏见的受控基准。我们生成500张逼真的基础人脸,每张脸创建约50个单一属性变体,产生约25K张图像。这种设计保持身份不变,每次改变一个视觉属性,使我们能够测量特定线索如何改变模型判断。我们在25个二元社会判断场景中评估了六个MLLMs。我们发现年龄和体型主导身份层面的效应,而时尚风格和其他视觉线索驱动最大的属性级变化。我们进一步发现,约15个属性解释了近80%的总变异,表明偏见集中在少数视觉线索上。在与外貌语义对齐的判断中,尤其是社会经济和风格相关判断,敏感性最强。我们发布StylisticBias作为多模态模型细粒度偏见评估的基准。代码和数据集:此https URL和此https URL。

英文摘要

Multimodal large language models (MLLMs) are increasingly deployed in personally and societally consequential settings, yet the visual cues that shape how these models judge people remain poorly understood. Prior work often compares different (groups of) individuals, making it difficult to separate appearance effects from identity differences. We introduce StylisticBias, a controlled benchmark for evaluating attribute-level social bias in MLLMs. We generate 500 photorealistic base faces and create about 50 single-attribute variations per face, producing about 25K images. This design keeps identity fixed and changes one visual attribute at a time. It lets us measure how specific cues shift model judgments. We evaluate six MLLMs across 25 binary social judgment scenarios. We find that age and body type dominate identity-level effects, while fashion style and other visual cues drive the largest attribute-level shifts. We further find that about 15 attributes account for nearly 80\% of the total variation, showing that bias is concentrated in a small set of visual cues. Sensitivity is strongest in judgments that are semantically aligned with appearance, especially socioeconomic and style-related judgments. We release StylisticBias as a benchmark for fine-grained bias evaluation in multimodal models. Code and dataset: https://github.com/timo-cavelius/StylisticBias and https://hf.co/datasets/shaghayegh/stylistic-bias-dataset.

2606.20526 2026-06-19 cs.AI 新提交

DeepSWIP: Quotient-WMC Counterfactuals for Neural Probabilistic Logic Programs

DeepSWIP: 神经概率逻辑程序的商-WMC反事实

Saimun Habib, Vaishak Belle, Fengxiang He

发表机构 * University of Edinburgh(爱丁堡大学)

AI总结 提出DeepSWIP,一种用于DeepProbLog程序的单世界反事实语义,通过神经物化、SWIP和加权模型计数实现精确反事实推理,实验证明比孪生网络方法快2.14倍。

详情
AI中文摘要

诸如DeepProbLog之类的神经符号系统将神经感知与概率逻辑相结合,但标准推理是关联性的。反事实推理还需要干预和证据的因果语义。我们引入了DeepSWIP,一种用于DeepProbLog程序的单世界反事实语义。利用神经物化,我们将固定上下文神经谓词简化为普通的ProbLog选择,应用单世界干预程序(SWIP),并通过单个转换程序上的加权模型计数(WMC)计算反事实。在有限基和唯一支持模型假设下,DeepSWIP相对于学习到的物化FCM是精确的。ProbLog条件句的标准商-WMC形式识别了活跃的神经概率,并解释了干预清理、校准敏感性和罕见证据不稳定性。在MPI3D上的实验证实了该转换相对于DeepTwin构造在12,000个查询上的有效性,并且由于避免了孪生网络的内源性重复,推理速度提升了2.14倍。一个SUMO HOV实验表明,神经校准退化会偏置插件估计,而正确作用域的随机策略AIPW估计器消除了总体均值和ATE估计量的大部分一阶偏差。代码位于此https URL。

英文摘要

Neurosymbolic systems such as DeepProbLog combine neural perception with probabilistic logic, but standard inference is associational. Counterfactual reasoning additionally requires a causal semantics for interventions and evidence. We introduce DeepSWIP, a single-world counterfactual semantics for DeepProbLog programs. Using neural materialization, we reduce fixed-context neural predicates to ordinary ProbLog choices, apply Single World Intervention Programs (SWIPs), and compute counterfactuals by weighted model counting (WMC) over a single transformed program. Under finite grounding and unique-supported-model assumptions, DeepSWIP is exact relative to the learned materialized FCM. The standard quotient-WMC form of ProbLog conditionals identifies active neural probabilities and explains intervention cleaning, calibration sensitivity, and rare-evidence instability. Experiments on MPI3D confirm the transformation against a DeepTwin construction against 12,000 queries, as predicted and a 2.14$\times$ inference speedup from avoiding the Twin's endogenous duplication. A SUMO HOV experiment shows that neural calibration degradation biases plug-in estimates, while a correctly scoped randomized-policy AIPW estimator removes most first-order bias for population mean and ATE estimands. Code is at https://github.com/saibib/deep_SWIP.

2606.20523 2026-06-19 cs.CV cs.AI cs.DB 新提交

SARLO-80: Worldwide Slant SAR Language Optic Dataset 80cm

SARLO-80:全球斜距SAR语言光学数据集80cm

Solène Debuysère, Nicolas Trouvé, Nathan Letheule, Elise Colin, Georgia Channing

发表机构 * DEMR-ONERA – The French Aerospace Lab, Université Paris-Saclay(法国航空航天实验室DEMR-ONERA,巴黎-萨克雷大学) DTIS-ONERA – The French Aerospace Lab, Université Paris-Saclay(法国航空航天实验室DTIS-ONERA,巴黎-萨克雷大学) Hugging Face

AI总结 为解决高分辨率SAR与光学图像及文本对齐的数据稀缺问题,基于Umbra SLC数据构建了80cm斜距网格的SAR-光学-文本三元组数据集,支持跨模态检索与生成任务。

详情
AI中文摘要

多模态基础模型因大规模光学基准而快速发展,但合成孔径雷达(SAR)的类似资源仍然有限。现有的SAR-光学数据集主要依赖低分辨率、仅强度的地面距离检测(GRD)产品,未保留复值SAR测量或原生采集几何,限制了基于物理的多模态学习。特别是,结合甚高分辨率(VHR)SAR SLC、对齐光学图像和自然语言描述的大规模公开数据集仍然缺乏。我们提出了一个基于开源Umbra聚束模式采集的传感器独立复数据(SICD)构建的VHR SAR-光学-文本数据集。从约2500个全球场景(VV/HH,20cm–2m原生分辨率)出发,通过带限FFT重采样将所有SAR数据标准化到80cm斜距网格,并将图像分割为1024×1024的图块。对于每个SAR图块,我们检索高分辨率光学图块,并利用局部坐标对应关系将其扭曲到SAR网格以实现局部像素级对齐。我们进一步为每个样本生成三种描述变体(短/中/长),以支持视觉-语言训练和评估。我们的数据集包含119,566个三元组(复数和幅度斜距SAR图块、对齐光学图块、自然语言描述),覆盖72个国家的257个地点以及广泛的地物类型和基础设施。我们发布固定的训练/验证/测试划分以及完整的预处理和基线代码,以支持在原生SAR几何中进行跨模态检索和条件生成的多模态对齐的可重复基准测试。该数据集在Hugging Face Hub上公开可用,网址为https://this URL。

英文摘要

Multimodal foundation models have advanced rapidly thanks to large optical benchmarks, but comparable resources for synthetic aperture radar (SAR) remain limited. Existing SAR--optical datasets largely rely on low-resolution, intensity-only Ground Range Detected~(GRD) products and do not preserve complex-valued SAR measurements or native acquisition geometry, which restricts physically grounded multimodal learning. In particular, large-scale public datasets combining very-high-resolution (VHR) SAR SLC, aligned optical imagery, and natural-language descriptions are still lacking. We present a VHR SAR--optical--text dataset built from open-access Umbra spotlight acquisitions distributed as Sensor Independent Complex Data (SICD). From around 2,500 worldwide scenes (VV/HH, 20cm--2m native resolution), we standardize all SAR data to an 80cm slant-range grid via band-limited FFT resampling and tile the imagery into 1024 by 1024 patches. For each SAR patch, we retrieve a high-resolution optical tile and warp it into the SAR grid using local coordinate correspondences for local pixel-level alignment. We further generate three caption variants (SHORT/MID/LONG) per sample to support vision--language training and evaluation. Our dataset contains 119,566 triplets (complex and amplitude slant-range SAR patch, aligned optical patch, natural-language description) covering 257 locations across 72 countries and a broad range of land types and infrastructures. We release fixed train/validation/test splits and the full preprocessing and baseline code to enable reproducible benchmarks for multimodal alignment on cross-modal retrieval and conditional generation in native SAR geometry. The dataset is publicly available on the Hugging Face Hub at https://huggingface.co/datasets/ONERA/SARLO-80.

2606.20521 2026-06-19 cs.CV 新提交

HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining

HumanScale: 以自我为中心的人类视频在具身预训练中可超越真实机器人数据

Juncheng Ma, Jianxin Bi, Yufan Deng, Xuanran Zhai, Kewei Zhang, Ye Huang, Bo Liang, Shukai Gong, Jiankai Tu, Xiaotian Tang, Jiaxin Li, Kaiqi Chen, Duomin Wang, Yuqi Wang, Bingyi Kang, Eric Huang, Zhiyang Dou, Zhen Dong, Enze Xie, Wojciech Matusik, Tat-Seng Chua, Daquan Zhou

发表机构 * PKU(北京大学) NUS(新加坡国立大学) MIT(麻省理工学院) UCSB(加州大学圣塔芭芭拉分校) NVIDIA(英伟达)

AI总结 本文通过系统比较发现,经过精心设计的过滤和标注流程,以自我为中心的人类视频在具身基础模型预训练中不仅可行,而且性能优于遥操作真实机器人数据,验证了“预训练于人类视频+少量机器人数据适配”的可扩展范式。

Comments Github: https://github.com/DAGroup-PKU/HumanNet/

详情
AI中文摘要

具身基础模型有望像大型语言模型一样从数据扩展中受益,但面临更严重的数据瓶颈。遥操作真实机器人轨迹因其精确的动作监督和具身对齐而仍然是主要的预训练来源,但其可扩展性受限于高采集成本、获取难度以及低行为和环境多样性。这些限制引发了对以自我为中心的人类视频作为可扩展、成本显著更低且更多样化的具身模型预训练替代方案的兴趣。然而,与遥操作真实机器人数据相比,其有效性仍未得到充分探索。为了解决这个问题,我们在固定的后训练和验证协议下,进行了一项系统研究,比较以自我为中心的人类视频和遥操作真实机器人轨迹作为具身基础模型的预训练数据源。令人惊讶的是,我们发现经过精心设计的过滤和标注流程处理的以自我为中心的数据,不仅是模型预训练的可行替代品,而且可以带来更优的性能。在相同预训练数据量下,在以自我为中心数据上预训练的模型在真实机器人动作预测上的验证损失降低了24%,在分布内和分布外真实机器人任务执行上的成功率分别提高了52.5%和90%。这一发现验证了具身基础模型的一种可扩展范式:在以自我为中心的人类视频上预训练以学习多样化的世界表征,然后使用少量标注的真实机器人数据进行适配以实现动作空间对齐。我们希望这项研究能鼓励对以自我为中心数据的更广泛探索,并在昂贵的机器人数据收集之前为数据质量评估提供指导。

英文摘要

Embodied foundation models are expected to benefit from data scaling like large language models, but face a much tighter data bottleneck. Teleoperated real-robot trajectories remain the dominant pretraining source due to their precise action supervision and embodiment alignment, yet their scalability is limited by high collection cost, acquisition difficulty, and low behavioral and environmental diversity. These limitations have sparked interest in egocentric human video as a scalable, substantially lower-cost, and more diverse alternative for embodied model pretraining. However, its effectiveness compared to teleoperated real-robot data remains underexplored. To address this question, we conduct a systematic study comparing egocentric human video and teleoperated real-robot trajectories as pretraining data sources for embodied foundation models, under fixed post-training and validation protocols. Surprisingly, we find that egocentric data, when processed through a carefully designed filtering and labeling pipeline, is not merely a viable substitute for model pretraining but can lead to superior performance. With the same amount of pretraining data, models pretrained on egocentric data achieve a 24% lower validation loss on real-robot action prediction, as well as 52.5% and 90% higher success rates on in-distribution and out-of-distribution real-robot task execution, respectively. This finding verifies a scalable paradigm for embodied foundation models: pretrain on egocentric human video to learn diverse world representations, then adapt with a small amount of labeled real-robot data for action-space alignment. We hope this study encourages broader exploration of egocentric data and offers guidance for data quality assessment before costly robot data collection.

2606.20518 2026-06-19 cs.AI 新提交

FlowEdit: Associative Memory for Lifelong Pronunciation Adaptation in Flow-Matching TTS

FlowEdit: 流匹配TTS中终身发音适应的联想记忆

Harshit Singh, Ayush Pratap Singh, Nityanand Mathur

发表机构 * University Of Maryland(马里兰大学) TU Darmstadt(达姆施塔特工业大学) Smallest AI

AI总结 针对流匹配TTS部署后无法纠正专有名词发音错误的问题,提出FlowEdit框架,通过潜在条件编辑而非权重更新学习发音修正,并利用现代Hopfield网络存储和检索修正,在312个多语言专有名词基准上将音素错误率降低92.7%。

详情
AI中文摘要

流匹配文本到语音系统在零样本场景下表现出色,但部署后保持静态:除非重新训练模型,否则对词汇表外的专有名词的发音错误会持续存在。我们提出FlowEdit,一个用于冻结的流匹配TTS的终身适应框架,它将发音修正学习为潜在条件编辑而非权重更新。当提供纠正性反馈时,FlowEdit优化文本嵌入空间中的令牌级扰动,然后将修正存储在作为内容可寻址情景记忆的现代Hopfield网络中。在推理时,通过具有相似性门控的软注意力检索修正,实现模糊形态匹配。在我们整理的涵盖18个语系的312个多语言专有名词基准上,FlowEdit相对于零样本基线将目标词音素错误率降低了92.7%,同时保持相同的通用语音质量。修正过程在单个GPU上大约15秒完成。

英文摘要

Flow-matching text-to-speech systems achieve remarkable zero-shot quality but remain static after deployment: pronunciation errors on out-of-vocabulary proper nouns persist unless the model is retrained. We introduce FlowEdit, a life-long adaptation framework for frozen flow-matching TTS that learns pronunciation corrections as latent conditioning edits rather than weight updates. When corrective feedback is provided, FlowEdit optimizes a token-level perturbation in the text embedding space, then stores the correction in a Modern Hopfield Network serving as content-addressable episodic memory. At inference, corrections are retrieved via soft attention with a similarity gate, enabling fuzzy morphological matching. On our curated benchmark of 312 multilingual proper nouns across 18 language families, FlowEdit reduces target-word Phoneme Error Rate by 92.7% relative to the zero-shot baseline while maintaining identical general-speech quality. Corrections complete in approximately 15 seconds on a single GPU.

2606.20517 2026-06-19 cs.AI cs.PL 新提交

Multi-LCB: Extending LiveCodeBench to Multiple Programming Languages

Multi-LCB: 将 LiveCodeBench 扩展到多种编程语言

Maria Ivanova, Pavel Zadorozhny, Rodion Levichev, Ivan Petrov, Adamenko Pavel, Ivan Lopatin, Alexey Kutalev, Dmitrii Babaev

发表机构 * GigaCode Yandex School of Data Analysis, Applied AI Institute(Yandex数据分析学院,应用人工智能研究所)

AI总结 提出 Multi-LCB 基准,将 LiveCodeBench 的 Python 任务扩展到 12 种编程语言,评估 LLM 跨语言代码生成能力,发现 Python 过拟合和语言特定污染等问题。

Comments ICLR 2026

详情
AI中文摘要

LiveCodeBench (LCB) 最近已成为评估大型语言模型 (LLM) 在代码生成任务上的广泛采用的基准。通过策划竞争性编程问题、不断向集合中添加新问题并根据发布日期进行过滤,LCB 提供了污染感知的评估,并提供了编码能力的整体视图。然而,LCB 仍然局限于 Python,留下了 LLM 是否能够泛化到现实软件工程所需的各种编程语言的问题。我们引入了 Multi-LCB,这是一个跨十二种编程语言(包括 Python)评估 LLM 的基准。Multi-LCB 将 LCB 数据集中的 Python 任务转换为其他语言中的等效任务,同时保留 LCB 的污染控制和评估协议。由于它与原始 LCB 格式完全兼容,Multi-LCB 将自动跟踪未来的 LCB 更新,从而能够系统地评估跨语言代码生成能力,并要求模型在 Python 之外保持良好的性能。我们在 Multi-LCB 上评估了 24 个 LLM 的指令和推理能力,发现了 Python 过拟合、语言特定污染以及多语言性能显著差异的证据。我们的结果将 Multi-LCB 确立为多编程语言代码评估的严格新基准,直接解决了 LCB 的主要局限性,并揭示了当前 LLM 能力的关键差距。

英文摘要

LiveCodeBench (LCB) has recently become a widely adopted benchmark for evaluating large language models (LLMs) on code-generation tasks. By curating competitive programming problems, constantly adding fresh problems to the set, and filtering them by release dates, LCB provides contamination-aware evaluation and offers a holistic view of coding capability. However, LCB remains restricted to Python, leaving open the question of whether LLMs can generalize across the diverse programming languages required in real-world software engineering. We introduce Multi-LCB, a benchmark for evaluating LLMs across twelve programming languages, including Python. Multi-LCB transforms Python tasks from the LCB dataset into equivalent tasks in other languages while preserving LCB's contamination controls and evaluation protocol. Because it is fully compatible with the original LCB format, Multi-LCB will automatically track future LCB updates, enabling systematic assessment of cross-language code generation competence and requiring models to sustain performance well beyond Python. We evaluated 24 LLMs for instruction and reasoning on Multi-LCB, uncovering evidence of Python overfitting, language-specific contamination, and substantial disparities in multilingual performance. Our results establish Multi-LCB as a rigorous new benchmark for multi-programming-language code evaluation, directly addressing LCB's primary limitation and exposing critical gaps in current LLM capabilities.

2606.20515 2026-06-19 cs.CV 新提交

S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence

S-Agent:空间工具使用激发空间智能推理

Yalun Dai, Hao Li, Shulin Tian, Runmao Yao, Yuhao Dong, Fangzhou Hong, Zhaoxi Chen, Fangfu Liu, Baoliang Tian, Dingwen Zhang, Tao Wang, Kim-Hui Yap, Ziwei Liu

发表机构 * NTU(南洋理工大学) THU(清华大学) ByteDance(字节跳动) NWPU(西北工业大学)

AI总结 提出S-Agent空间工具使用智能体范式,通过时空证据积累和层次化工具集,将VLM作为语义规划器,实现连续多视图图像和视频的空间推理,在无训练下提升开源和闭源VLM性能,并基于S-300K轨迹微调得到紧凑空间智能体S-Agent-8B。

Comments Project Page : https://Ropedia.github.io/S-Agent

详情
AI中文摘要

现实世界的空间智能需要对连续且不断变化的三维世界进行推理,然而现有的VLM和工具增强智能体大多仍局限于从孤立的视觉观察中进行静态、无状态的推理。我们引入了\textbf{\textsc{S-Agent}},一种用于理解和推理连续多视图图像和视频的空间工具使用智能体范式。通过将空间推理表述为时空证据积累而非孤立的帧级预测,\textsc{S-Agent}将空间感知重塑为以场景为中心的理解,超越以帧为中心的识别。具体而言,\textsc{S-Agent}将VLM作为语义规划器,决定需要哪些证据,而层次化的空间工具和专家将物体锚定在2D中,将其提升为3D几何证据,并将这些证据聚合为高级空间知识(例如,计数、测量、方向和相对位置)。此外,时间记忆机制,包括用于维护不断演变的场景状态的场景记忆和用于积累推理上下文的智能体记忆,实现了跨帧和推理步骤的证据整合。在多视图和视频空间推理基准上的全面实验表明,\textsc{S-Agent}以无需训练的方式持续提升开源和闭源VLM的性能。除了推理时增强,在\textsc{S-Agent}生成的空间轨迹\textsc{S-300K}上进行监督微调(SFT)得到了\textsc{S-Agent-8B},一个紧凑的空间智能体,显著超越了类似规模的基线(例如,Qwen3-VL-8B),并与先进的闭源模型(例如,GPT-5.4和Gemini 3)性能相当。

英文摘要

Real-world spatial intelligence requires reasoning over a continuous and evolving 3D world, yet existing VLMs and tool-augmented agents largely remain tied to static, stateless inference from isolated visual observations. We introduce \textbf{\textsc{S-Agent}}, a spatial tool-use agentic paradigm for understanding and reasoning over continuous multi-view images and videos. By formulating spatial reasoning as spatio-temporal evidence accumulation rather than isolated frame-level prediction, \textsc{S-Agent} reshapes spatial perception into scene-centric understanding beyond frame-centric recognition. Specifically, \textsc{S-Agent} casts the VLM as a semantic planner that decides what evidence is needed, while a hierarchy of spatial tools and experts grounds objects in 2D, lifts them into 3D geometric evidence, and aggregates this evidence into high-level spatial knowledge (\textit{e.g.}, counting, measurement, orientation, and relative position). Additionally, a temporal memory mechanism, including Scene Memory for maintaining the evolving scene state and Agent Memory for accumulating reasoning context, enables evidence integration across frames and reasoning steps. Comprehensive experiments on multi-view and video spatial reasoning benchmarks show that \textsc{S-Agent} consistently improves both open-source and closed-source VLMs in a training-free manner. Beyond inference-time augmentation, supervised fine-tuning (SFT) on \textsc{S-Agent}-generated spatial trajectories \textsc{S-300K} yields \textsc{S-Agent-8B}, a compact spatial agent that significantly surpasses similar-scale baselines (e.g., Qwen3-VL-8B) and performs comparably to advanced closed-source models (e.g., GPT-5.4 and Gemini 3).

2606.20493 2026-06-19 cs.LG cs.AI cs.MA 新提交

Contagion Networks: Evaluator Bias Propagation in Multi-Agent LLM Systems

传染网络:多智能体LLM系统中的评估者偏见传播

Zewen Liu

发表机构 * Qilu Institute of Technology, School of Software Engineering(齐鲁理工学院软件工程学院)

AI总结 提出传染网络框架,量化评估者偏见在多智能体LLM系统中的传播,发现同模型智能体间偏见传播系数为0.157-0.352,且增大评估委员会规模可减少72.4%的传播效应。

Comments 20 pages, 4 figures, 4 tables

详情
AI中文摘要

当大型语言模型在多智能体系统中担任评估者时,其系统性评估偏见会通过智能体网络传播。我们引入传染网络,这是一个用于衡量评估者偏见如何在交互的LLM智能体间传播的正式框架。在使用DeepSeek-chat进行的受控3智能体实验中,我们采用了三种不同的评估者偏见配置文件(结构化、平衡、基于证据),测量了跨智能体传染矩阵Gamma_3,并发现评估者偏见始终在智能体间传播(gamma在[0.157, 0.352]范围内),即使是在相同底层模型内也是如此。我们识别出由谱半径rho(Gamma_N)控制的三种传播机制,并证明同质模型智能体产生的传染系数比先前工作中观察到的跨模型系数弱3-5倍(MM-EPC: gamma约0.85-1.3),使其处于抑制机制中。我们表明,将评估委员会规模从k=1增加到k=3可将有效传染减少72.4%,提供了一种可行的缓解策略。我们发布了开源的传染网络实验框架。

英文摘要

When large language models serve as evaluators in multi-agent systems, their systematic evaluation biases propagate through the agent network. We introduce Contagion Networks, a formal framework for measuring how evaluator biases spread across interacting LLM agents. In a controlled 3-agent experiment using DeepSeek-chat with three distinct evaluator bias profiles (structured, balanced, evidence-based), we measure the Cross-Agent Contagion Matrix Gamma_3 and find that evaluator biases consistently propagate between agents (gamma in [0.157, 0.352]), even within the same underlying model. We identify three propagation regimes governed by the spectral radius rho(Gamma_N), and demonstrate that homogeneous-model agents produce contagion coefficients 3-5x weaker than cross-model coefficients observed in prior work (MM-EPC: gamma approx 0.85-1.3), placing them in the suppression regime. We show that increasing evaluator committee size from k=1 to k=3 reduces effective contagion by 72.4%, providing an actionable mitigation strategy. We release the open-source Contagion Network experimental framework.

2606.20491 2026-06-19 cs.RO cs.CV 新提交

Fast Human Attention Prediction for Fixation-guided Active Perception in Autonomous Navigation

用于自主导航中注视引导主动感知的快速人类注意力预测

Fatma Youssef Mohammed, Grzegorz Malczyk, Kostas Alexis

发表机构 * Norwegian University of Science and Technology (NTNU)(挪威科技大学)

AI总结 提出GazeLNN,一种基于液态神经网络和MobileNetV3的轻量级扫描路径预测模型,在MIT低分辨率数据集上达到最优性能,计算成本降低99.40%,推理速度提升6倍,并集成到强化学习训练的主动相机-机器人控制策略中,实现自主导航中的注视引导感知。

Comments Accepted to the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)

详情
AI中文摘要

人类视觉注意力依赖于结构化的扫描路径来高效处理场景,但将这种行为注入机器人自主性仍处于初级阶段,且受到现有预测模型高计算成本的阻碍。为了解决这一问题,我们提出了GazeLNN,一种计算轻量级的扫描路径预测模型,该模型采用液态神经网络作为其循环引擎,并使用MobileNetV3进行特征提取。该架构以自回归方式运行,根据当前视觉刺激和注视历史预测顺序注视热图。尽管仅需0.61 GFLOPs,GazeLNN在MIT低分辨率数据集上达到了最先进的性能,获得了0.47的ScanMatch分数。它在多种评估指标上优于现有的循环基线,同时将计算成本降低了99.40%,并将推理速度提高了六倍。为了研究人类注意力建模在机器人自主性中的作用,并展示这种高效架构的实际效用,我们将GazeLNN集成到通过强化学习训练的主动相机-机器人控制策略中。这种集成使得在自主导航过程中能够实现人类注视引导的感知,并通过在无人机上的成功实际部署得到了验证。

英文摘要

Human visual attention relies on structured scanpaths to efficiently process scenes, yet instilling this behavior into robot autonomy is in its infancy and hindered by the high,computational costs of existing predictive models. To address this, we introduce GazeLNN, a computationally lightweight,scanpath prediction model that leverages Liquid Neural Networks as its recurrent engine and employs MobileNetV3 for feature extraction. Operating auto-regressively, the architecture predicts sequential fixation heatmaps conditioned on the current visual stimulus and fixation history. Despite requiring only 0.61 GFLOPs, GazeLNN achieves state-of-the-art performance on the MIT Low Resolution dataset achieving 0.47 ScanMatch score. It outperforms existing recurrent baselines across diverse evaluation metrics, while reducing computational costs by 99.40% and accelerating inference by up to six times. To investigate the role of human attention modeling in robot autonomy and demonstrate the practical utility of this highly efficient architecture, we integrate GazeLNN into an active camera-robot control policy trained via Reinforcement Learning. This integration enables human-fixation-guided perception during autonomous navigation, validated through successful real-world deployments on an aerial robot.

2606.20488 2026-06-19 cs.CV 新提交

How Fragile Are Training-Free AI-Generated Image Detectors? A Controlled Audit of Score Direction, Preprocessing, and Compression

无训练AI生成图像检测器有多脆弱?对分数方向、预处理和压缩的受控审计

Jingwen Zhou, Mingzhe Wang

发表机构 * Xidian University(西安电子科技大学)

AI总结 本文通过统一协议审计两种无训练检测分数(自编码重建和噪声扰动特征相似性)及kNN基线,发现实现细节、分数方向选择和数据集格式偏差会导致AUROC变化高达0.38,且简单融合无法超越最佳单分数。

详情
AI中文摘要

无训练的AI生成图像检测器承诺无需分类器训练即可实现生成器无关的部署,但其报告的数字很少在单一受控协议下进行比较。我们审计了两种代表性的无训练分数——一种自编码器重建分数(AEROBLADE风格)和一种噪声扰动特征相似性分数(RIGID风格),外加一个朴素的特征kNN控制,在包含七个生成器和JPEG压缩质量70和50的公共1,500图像GenImage衍生基准上进行。审计得出三个警示性发现。(i)实现细节伪装成方法差异:将LPIPS骨干网络(AlexNet -> VGG-16)替换使整体AUROC变化+0.085,在resize-to-512和原始分辨率预处理之间切换使每个生成器的结论翻转高达0.38 AUROC。(ii)分数方向不是方法的属性而是其超参数的属性:RIGID风格分数在噪声水平sigma=0.05时对SD1.5和Wukong反转(AUROC < 0.5),在sigma=0.01时对所有生成器恢复至>0.5,在sigma=0.3时降至0.15。(iii)数据集格式偏差夸大鲁棒性声明:没有统一重新编码时,JPEG-50下的AUROC超过AlexNet骨干重建分数的干净条件;偏差校正后残余异常定位到单个生成器(BigGAN)。审计的分数具有互补的逐生成器失败集,但朴素z-score融合未能击败最佳单分数,表明利用互补性需要方向感知的组合。

英文摘要

Training-free detectors of AI-generated images promise generator-agnostic deployment without classifier training, yet their reported numbers are rarely compared under a single controlled protocol. We audit two representative training-free scores -- an autoencoder-reconstruction score (AEROBLADE-style) and a noise-perturbation feature-similarity score (RIGID-style) -- plus a naive feature-kNN control, on a common 1,500-image GenImage-derived benchmark spanning seven generators and JPEG compression at quality 70 and 50. The audit yields three cautionary findings. (i) Implementation details masquerade as method differences: replacing the LPIPS backbone (AlexNet -> VGG-16) changes overall AUROC by +0.085, and switching between resize-to-512 and native-resolution preprocessing flips per-generator conclusions by up to 0.38 AUROC. (ii) Score direction is not a property of the method but of its hyperparameters: the RIGID-style score is inverted (AUROC < 0.5) on SD1.5 and Wukong at noise level sigma=0.05, recovers to >0.5 for every generator at sigma=0.01, and collapses to 0.15 at sigma=0.3. (iii) Dataset format bias inflates robustness claims: without unified re-encoding, AUROC under JPEG-50 exceeds the clean condition for the AlexNet-backbone reconstruction score; after bias correction the residual anomaly localizes to a single generator (BigGAN). The audited scores have complementary per-generator failure sets, but naive z-score fusion does not beat the best single score, indicating that exploiting complementarity requires direction-aware combination.

2606.20487 2026-06-19 cs.CL 新提交

Beyond Global Replanning: Hierarchical Recovery for Cross-Device Agent Systems

超越全局重规划:跨设备智能体系统的分层恢复

Shu Yao, Yuhua Luo, Qian Long, Jingru Fan, Zhuoyuan Yu, Yuheng Wang, Lin Wu, Yufan Dang, Huatao Li, Chen Qian

发表机构 * School of Artificial Intelligence, Shanghai Jiao Tong University(上海交通大学人工智能学院) Shanghai Innovation Institute(上海创新研究院) Southeast University(东南大学) Tsinghua University(清华大学)

AI总结 提出分层重规划框架H-RePlan,通过统一API-CLI-GUI执行和跨层失败抽象,区分设备本地策略恢复与全局重规划,在HeraBench基准上显著提升跨设备任务完成率和指令遵循度。

详情
AI中文摘要

现实世界中的计算机使用任务通常跨越多个应用程序和设备,要求智能体在动态运行时故障下协调异构环境。现有的多设备智能体系统支持任务分解和跨设备分配,但恢复仍然粗粒度:当执行失败时,它们通常重试相同策略、重新分配子任务或修改全局计划,而没有系统地建模设备本地策略空间。这限制了它们区分可在当前设备内修复的故障与需要跨设备重规划的故障的能力。我们提出\textbf{H-RePlan},一个用于具有统一API-CLI-GUI执行的多设备智能体的分层重规划框架。H-RePlan为每个设备配备可互换的执行策略,并通过紧凑的跨层失败抽象将设备本地策略恢复与编排器级全局重规划分离。为了评估这一能力,我们引入\textbf{HeraBench},一个故障注入基准,它在Linux和Android设备上构建跨设备工作流,并注入策略级和设备级故障。实验表明,H-RePlan显著优于单策略和粗粒度多设备基线,实现了更高的完成率、指令遵循率和完美通过率,同时降低了可靠端到端成功所需的令牌成本。这些结果表明,范围感知的分层恢复对于鲁棒的多设备智能体执行至关重要。

英文摘要

Real-world computer-use tasks often span multiple applications and devices, requiring agents to coordinate heterogeneous environments under dynamic runtime failures. Existing multi-device agent systems support task decomposition and cross-device assignment, but recovery remains largely coarse-grained: when execution fails, they typically retry the same strategy, reassign the subtask, or revise the global plan, without systematically modeling the device-local strategy space. This limits their ability to distinguish failures that can be repaired within the current device from those that require cross-device replanning. We propose \textbf{H-RePlan}, a hierarchical replanning framework for multi-device agents with unified API--CLI--GUI execution. H-RePlan equips each device with interchangeable execution strategies and separates device-local strategy recovery from orchestrator-level global replanning through a compact cross-layer failure abstraction. To evaluate this capability, we introduce \textbf{HeraBench}, a fault-injected benchmark that constructs cross-device workflows over Linux and Android devices and injects strategy- and device-level failures. Experiments show that H-RePlan substantially outperforms single-strategy and coarse-grained multi-device baselines, achieving higher completion, instruction adherence, and perfect-pass rates while reducing the token cost required for reliable end-to-end success. These results demonstrate that scope-aware hierarchical recovery is essential for robust multi-device agent execution.

2606.20482 2026-06-19 cs.CL cs.HC cs.LG 新提交

Your Mouse and Eyes Secretly Leak Your Preference: LLM Alignment using Implicit Feedback from Users

你的鼠标和眼睛悄悄泄露你的偏好:利用用户隐式反馈进行LLM对齐

Haw-Shiuan Chang, Jeffrey Gomez, Mehul Patwari, Aryan Sajith, Hamed Zamani

发表机构 * University of Massachusetts, Amherst(马萨诸塞大学阿默斯特分校) York University(约克大学)

AI总结 针对显式反馈稀缺的问题,提出利用鼠标轨迹和眼动数据等隐式反馈训练奖励模型,将文本奖励模型准确率从55%提升至64%,并显著提高DPO对齐后响应质量。

详情
AI中文摘要

为了对齐大型语言模型(LLM),大多数现有方法收集显式的人类反馈,并基于响应文本训练奖励模型来预测人类偏好。这些现有方法有两个关键局限性。首先,用户很少为LLM响应提供显式反馈,这使得高质量偏好标注的收集成本高昂。其次,这些方法没有利用隐式人类反馈,而隐式反馈已被证明对互联网巨头的经济护城河至关重要。为了量化隐式反馈的价值,我们构建了一个名为IFLLM的新数据集,收集了来自59名Mechanical Turk工作者的1336个多轮问题、他们的鼠标轨迹以及通过网络摄像头对LLM响应的眼动注视点。IFLLM显示用户具有非常多样化的注视行为和鼠标轨迹。基于隐式用户反馈的奖励模型将基于文本的奖励模型准确率从55%提升至64%,并在将DPO应用于八个LLM后,相对响应质量改进几乎翻了三倍,证明了隐式反馈在现实场景中的价值。我们的数据收集网站、数据集和代码可在以下网址找到:此https URL。

英文摘要

To align a Large Language Model (LLM), most existing methods collect explicit human feedback and train a reward model to predict the human preference based on the response text. These existing methods have two key limitations. First, the users rarely provide explicit feedback for LLM responses, which makes the high-quality preference annotation expensive to collect. Second, the methods do not leverage implicit human feedback, which has proven vital to the economic moats of Internet giants. To quantify the value of implicit feedback, we build a new dataset called IFLLM, which collects 1336 multi-turn questions from the 59 Mechanical Turk workers, their mouse trajectories, and eye gazing points to the LLMs' responses from their webcams. IFLLM shows that the users have very diverse types of gazing behavior and mouse trajectories. Our reward model based on the implicit user feedback boosts the accuracy of the text-based reward model from 55% to 64% and nearly triples the relative response quality improvements after applying the DPO to eight LLMs, demonstrating the value of implicit feedback in the wild. Our data collection website, dataset, and codes can be found at https://github.com/themehulpatwari/llm-implicit-feedback/.

2606.20479 2026-06-19 cs.RO 新提交

GroundControl: Anticipating Navigation Failures in Vision-Language Agents via Trajectory-Consistent Uncertainty Estimates

GroundControl: 通过轨迹一致的不确定性估计预测视觉语言智能体中的导航失败

Nastaran Darabi, Divake Kumar, Sina Tayebati, Devashri Naik, Amit Ranjan Trivedi

发表机构 * University of Illinois at Chicago (UIC)(伊利诺伊大学芝加哥分校)

AI总结 提出轨迹一致的不确定性估计方法GroundControl,通过卡尔曼滤波建模距离变化并结合轨迹特征,有效预测导航失败,在选择性风险-覆盖评估中优于基线。

详情
AI中文摘要

视觉语言导航智能体在基准任务上取得了具有竞争力的平均成功率,但失败通常源于可预测的轨迹级问题,如振荡、停滞或低效绕路。因此,可靠部署需要能够在执行过程中预测新兴失败动态的不确定性信号,而不仅仅是反映瞬时动作熵。我们引入了\emph{GroundControl},一种轨迹一致的不确定性估计器,定义为在一个回合中聚合的、相对于标称目标导向的距离-目标动态的统计偏差。GroundControl使用恒定速度卡尔曼滤波器对距离演化进行建模,并将归一化创新统计量与补充轨迹特征(捕捉进展、单调性、路径效率和振荡行为)相结合。由此产生的不确定性分数反映了导航行为中的几何和时间不一致性,而非局部预测分散。为了独立于任务成功评估不确定性质量,我们形式化了\emph{选择性风险-覆盖导航(SRCN)}协议,该协议通过风险-覆盖曲线和AURC/E-AURC摘要,衡量不确定性分数按失败或低效对回合进行排序的有效性。在五个EB-Navigation分割($N=300$个回合)上,基于成功的选择性风险下,轨迹一致的不确定性实现了接近神谕的排序,GPT-4o模型的加权平均$\mathrm{E\text{-}AURC}_{\mathrm{SR}}=0.0024$,显著优于熵、共形和启发式基线。在基于SPL的选择性评估下,GroundControl在模型和导航分割上始终实现最低的AURC和E-AURC。这些结果表明,对目标导向动态的偏离进行建模,为预测视觉语言智能体中的导航失败提供了可解释且鲁棒的信号。

英文摘要

Vision-language navigation agents achieve competitive average success on benchmark tasks, yet failures often arise through predictable trajectory-level breakdowns such as oscillation, stagnation, or inefficient detours. Reliable deployment, therefore, requires uncertainty signals that anticipate emerging failure dynamics during execution rather than reflect only instantaneous action entropy. We introduce \emph{GroundControl}, a trajectory-consistent uncertainty estimator defined as statistical deviation from nominal goal-directed distance-to-goal dynamics aggregated over an episode. GroundControl models distance evolution using a constant-velocity Kalman filter and combines normalized innovation statistics with complementary trajectory features capturing progress, monotonicity, path efficiency, and oscillatory behavior. The resulting uncertainty score reflects geometric and temporal inconsistency in navigation behavior rather than local prediction dispersion. To evaluate uncertainty quality independently of task success, we formalize \emph{Selective Risk--Coverage Navigation (SRCN)}, a protocol that measures how effectively an uncertainty score ranks episodes by failure or inefficiency using risk--coverage curves and AURC / E-AURC summaries. Across five EB-Navigation splits ($N=300$ episodes), trajectory-consistent uncertainty achieves near-oracle ordering under success-based selective risk, with weighted-average $\mathrm{E\text{-}AURC}_{\mathrm{SR}}=0.0024$ for the GPT-4o model, substantially outperforming entropy-, conformal-, and heuristic baselines. Under SPL-based selective evaluation, GroundControl consistently achieves the lowest AURC and E-AURC across models and navigation splits. These results show that modeling deviation from goal-directed dynamics provides an interpretable and robust signal for anticipating navigation failures in vision-language agents.

2606.20477 2026-06-19 cs.CV cs.CL cs.LG 新提交

Scalable Training of Spatially Grounded 2D Vision-Language Models for Radiology

面向放射学的空间定位2D视觉-语言模型的可扩展训练

Yusuf Salcan, Simon Ging, Robin Schirrmeister, Philipp Arnold, Elmar Kotter, Behzad Bozorgtabar, Thomas Brox

发表机构 * Computer Vision Group, University of Freiburg, Germany(德国弗莱堡大学计算机视觉组) Department of Radiology, Medical Center -- University of Freiburg, Germany(德国弗莱堡大学医学中心放射科) CRIION-AI Lab, Freiburg, Germany(德国弗莱堡CRIION-AI实验室)

AI总结 提出RefRad2D大规模双语数据集,通过LLM和自动分割生成空间定位数据,训练RadGrounder模型联合完成报告生成、VQA和空间定位,在外部基准上取得竞争性结果。

Comments Accepted for MICCAI 2026. First two authors: equal contribution. Last two authors: equal supervision

详情
AI中文摘要

我们研究了如何在没有手动空间标注的情况下,为放射学训练具有视觉定位能力的视觉-语言模型(VLM)。我们引入了RefRad2D,这是一个大规模的双语(德语/英语)数据集,包含来自临床实践的120万对CT和MR图像-文本对,并通过基于LLM的筛选和自动分割自动生成任务特定的VQA和空间定位子集。在此数据上训练的模型RadGrounder联合执行报告生成、视觉问答以及通过边界框检测或分割进行的空间定位。在外部VQA基准(Slake,VQA-RAD)上,RadGrounder取得了与专用医学VLM竞争的结果。将我们的临床数据加入训练混合集,相比于仅在下游数据集上微调,提高了开放式VQA的性能,显示了数据集的迁移性。关键在于,添加定位监督不会降低语言质量,从而在不牺牲VQA性能的情况下实现空间可验证的输出。

英文摘要

We study how to train visually grounded vision-language models (VLMs) for radiology without manual spatial annotations. We introduce RefRad2D, a large-scale bilingual (German/English) dataset of 1.2M CT and MR image-text pairs derived from clinical practice, with task-specific VQA and spatial grounding subsets generated automatically via LLM-based curation and automated segmentation. Trained on this data, our model RadGrounder jointly performs report generation, visual question answering, and spatial grounding via bounding-box detection or segmentation. On external VQA benchmarks (Slake, VQA-RAD), RadGrounder achieves competitive results with specialized medical VLMs. Adding our clinical data to the training mixture improves open-ended VQA over fine-tuning on the downstream datasets alone, showing the transferability of our dataset. Crucially, adding grounding supervision does not degrade language quality, enabling spatially verifiable outputs at no cost to VQA performance.

2606.20475 2026-06-19 cs.LG 新提交

Marginal Advantage Accumulation for Memory-Driven Agent Self-Evolution

边际优势累积用于记忆驱动智能体自我进化

Mingyu Yang, Keye Zheng, Congchao Cheng, Yujie Liu, Xingkang Lu, Fan Jiang, Yefei Zheng

发表机构 * Alibaba International Digital Commerce Group(阿里巴巴国际数字商业集团)

AI总结 针对批量式轨迹蒸馏中跨批次证据缺失问题,提出边际优势累积(MAA)方法,通过差分信号构造、指数移动平均累积和语义身份合并,在16个设置中14个取得最佳结果,优化阶段token消耗减少约75%。

Comments 26 pages, 4 figures, 10 tables, 42 references

详情
AI中文摘要

在批量式轨迹蒸馏中,同一记忆操作可能在不同批次间收到矛盾的反馈。现有方法缺乏跨批次、操作级别的证据累积机制,无法区分稳定有效的操作与偶然命中。本文将需求形式化为两个结构条件:可对齐性和可比性,并提出边际优势累积(MAA)。MAA构造差分信号使其跨批次可比,通过指数移动平均(EMA)累积每个操作的有符号证据,并通过语义身份合并确保跨批次可追溯性。作为一种后处理架构,MAA在4个基准和4个目标模型的16个设置中14个取得最佳结果,持续优于现有批量级蒸馏基线,并在大多数设置中匹配或超越在线替代方法,同时将优化阶段的token消耗减少约75%。

英文摘要

In batch-style trace distillation, the same memory operation may receive contradictory feedback across different batches. Existing methods lack a cross-batch, operation-level evidence accumulation mechanism, making it impossible to distinguish stably effective operations from accidental hits. This paper formalizes the requirement as two structural conditions, alignability and comparability, and proposes Marginal Advantage Accumulation (MAA). MAA constructs differential signals to make them comparable across batches, accumulates signed evidence per operation via EMA, and ensures cross-batch traceability through semantic identity merging. As a post-processing architecture, MAA achieves the best results in 14 out of 16 settings across 4 benchmarks and 4 target models, consistently outperforming existing batch-level distillation baselines and matching or surpassing online alternatives in most settings, while reducing optimization-phase token consumption by approximately 75%.

2606.20474 2026-06-19 cs.LG cs.AI cs.PF 新提交

UltraQuant: 4-bit KV Caching for Context-Heavy Agents

UltraQuant: 面向上下文密集型智能体的4位KV缓存

Inesh Chakrabarti, David Limpus, Aditi Ghai Rana, Bowen Bao, Spandan Tiwari, Thiago Crepaldi, Ashish Sirasao

发表机构 * Advanced Micro Devices(超威半导体) University of California, Los Angeles(加州大学洛杉矶分校) Purdue University(普渡大学)

AI总结 针对上下文密集型智能体场景,提出UltraQuant方法,通过4位KV缓存压缩、旋转量化和代码本量化,结合AMD GPU优化,在长上下文多轮任务中延迟降低3.47倍,吞吐量提升1.63倍。

Comments 11 pages, 9 figures

详情
AI中文摘要

上下文密集型智能体给键值(KV)缓存带来了异常压力:长前缀在多个短轮次中重复使用,而并发性决定了服务系统能否保持GPU利用率。我们针对此场景研究4位KV缓存压缩,采用TurboQuant风格的旋转和代码本量化作为质量锚点,vLLM FP8 KV缓存作为部署锚点。我们报告三项贡献。首先,我们将4位KV缓存框架用于多轮智能体工作负载,其中任务质量、缓存驻留和服务吞吐量必须联合衡量。其次,我们描述了使4位路径鲁棒所需的实际设计选择,包括非对称K/V处理、Walsh-Hadamard旋转、QJL移除和块尺度变体。第三,我们展示了AMD GPU上的服务优化,包括优化的解码注意力内核和UltraQuant,一种使用FP8查询、FP4 KV张量、UE8M0组尺度和CDNA4上原生缩放MFMA支持的FP4近似路径。在长上下文、多轮智能体工作负载上,UltraQuant在缓存压力大的后期轮次中将P50首令牌延迟降低了3.47倍(所有轮次平均2.3倍),并将输出吞吐量比FP8 KV基线提高了1.63倍。

英文摘要

Context-heavy agents place unusual pressure on the key-value (KV) cache: long prefixes are reused across many short turns, while concurrency determines whether the serving system can keep GPUs utilized. We study 4-bit KV-cache compression for this setting, using TurboQuant-style rotation and codebook quantization as a quality anchor and vLLM FP8 KV caching as the deployment anchor. We report three contributions. First, we frame 4-bit KV caching around multi-round agent workloads where task quality, cache residency, and serving throughput must be measured jointly. Second, we describe the practical design choices needed to make the 4-bit path robust, including asymmetric K/V treatment, Walsh-Hadamard rotation, QJL removal, and block-scale variants. Third, we present serving optimizations on AMD GPUs, including optimized decode-attention kernels and UltraQuant, an FP4 approximation path that uses FP8 queries, FP4 KV tensors, UE8M0 group scales, and native scaled-MFMA support on CDNA4. On a long-context, multi-turn agentic workload, UltraQuant cuts P50 time-to-first-token by 3.47x in the cache-pressured late rounds (2.3x across all rounds) and raises output throughput by 1.63x over the FP8 KV baseline.

2606.20469 2026-06-19 cs.LG cs.CG 新提交

Fisher-Geometric Sharpness and the Implicit Bias of SGD toward Flat Minima

Fisher-几何锐度与SGD对平坦极小值的隐式偏好

Md Sakir Ahmed, Kumaresh Sarmah, Hemen Dutta

发表机构 * Gauhati University(高哈蒂大学)

AI总结 针对SGD偏好平坦极小值但欧氏锐度不具重参数化不变性的问题,提出基于Fisher信息矩阵的黎曼锐度,证明其不变性,并导出SGD稳态分布集中于平坦极小值,PAC-Bayes界联系泛化性能。

Comments 18 pages, 5 figures, preprint

详情
AI中文摘要

深度学习中的一个广泛直觉是随机梯度下降(SGD)隐式偏好平坦极小值,且平坦极小值泛化更好,但损失Hessian的迹或最大特征值等标准欧氏平坦度度量在保持网络函数的重参数化下并非不变,这削弱了这一叙事的理论基础。在本研究中,我们通过将平坦度建立在由Fisher信息矩阵(FIM)诱导的统计流形的黎曼几何上,解决了这一问题。我们在数学上定义了黎曼锐度,并证明它在光滑、保函数的重参数化下是不变的,这直接回应了Dinh等人在论文“Sharp minima can generalize for deep nets”中的批评。我们注意到这种不变性是真实FIM的一个性质;实践中使用的对角经验估计量(以及下面所有实验中的)仅近似继承不变性,而在任意重参数化下的精确不变性需要结构化估计量如K-FAC。我们将小批量SGD的梯度噪声形式化为具有与FIM成比例的协方差结构,推导出所得随机微分方程的稳态分布,然后证明概率质量指数级集中在黎曼平坦极小值处。一个由SR显式控制的PAC-Bayes泛化界正式地将这种几何偏差与测试性能联系起来。我们在MNIST和CIFAR-10上的实验证实,SR以欧氏锐度无法做到的方式可靠地跟踪泛化,并且其随$\eta/B$的缩放与理论预测相匹配。这些结果共同提供了一个严格的、重参数化不变的解释,说明为什么平坦极小值能泛化。

英文摘要

A widely held intuition in deep learning is that stochastic gradient descent (SGD) implicitly favors flat minima and that flat minima generalize better, but standard Euclidean measures of flatness such as the trace or maximum eigenvalue of the loss Hessian are not invariant under reparametrizations that preserve the network function, which undermines the theoretical foundations of this narrative. In this study we resolve this issue by grounding flatness in the Riemannian geometry of the statistical manifold induced by the Fisher Information Matrix (FIM). We define Riemannian sharpness mathematically and prove that it is invariant under smooth, function-preserving reparametrizations, which directly addresses the critique of Dinh et al. in the paper ``Sharp minima can generalize for deep nets''.We note that this invariance is a property of the true FIM; the diagonal empirical estimator used in practice (and in all experiments below) inherits invariance only approximately, and exact invariance under arbitrary reparametrizations would require structured estimators such as K-FAC. We formalize the gradient noise of mini-batch SGD as having a covariance structure proportional to the FIM, derive the stationary distribution of the resulting stochastic differential equation, and then show that the probability mass is exponentially concentrated at Riemannian-flat minima. A PAC-Bayes generalization bound controlled explicitly by SR formally links this geometric bias to test performance. Our experiments on MNIST and CIFAR-10 confirm that SR reliably tracks generalization in ways that Euclidean sharpness does not, and that its scaling with $η/B$ matches the theoretical predictions. Together these results provide a rigorous, reparametrization-invariant account of why flat minima generalize.