arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 多模态与视觉语言模型 1 篇

2606.19120 2026-06-18 cs.LG cs.CV 交叉投稿

Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation

先看后思:解耦感知与推理以实现抗捷径的多模态在策略自蒸馏

Sihan Wang, Xiyao Liu, Lianqing Liu, Zhi Han

发表机构 * State Key Laboratory of Robotics and Intelligent Systems, Shenyang Institute of Automation, Chinese Academy of Sciences(中国科学院沈阳自动化研究所机器人学国家重点实验室) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出ViGOS框架,通过解耦感知和推理,在MLLM后训练中避免文本捷径,提升图像依赖行为。

Comments 29 pages, 5 figures, 8 tables

详情
AI中文摘要

在策略自蒸馏(OPSD)训练模型在其自身rollouts上,并使用冻结副本提供基于参考目标的密集token级目标。这对于LLM推理效果良好,但直接扩展到多模态大语言模型(MLLMs)可能产生捷径:特权目标可能主要基于文本参考目标而非图像来引导token。我们提出ViGOS,一种视觉引导的OPSD框架用于MLLM后训练。学生首先编写视觉描述,然后推理出最终答案。对于有效rollouts,仅图像的感知教师监督描述,而特权推理教师监督同一学生前缀上的推理和最终答案。仅对无效rollouts使用参考教师以恢复输出格式。在通用视觉-语言、专家推理、视觉数学、空间定位和视觉-语言先验基准测试中,ViGOS保持了OPSD的主要优势,并在易产生捷径的设置中改善了图像引导行为。

英文摘要

On-policy self-distillation (OPSD) trains a model on its own rollouts and uses a frozen copy to provide dense token-level targets conditioned on a reference target. This works well for LLM reasoning, but a direct extension to multimodal large language models (MLLMs) can create a shortcut: the privileged target may guide tokens mainly based on the text reference target rather than the image. We propose ViGOS, a visually grounded OPSD framework for MLLM post-training. The student first writes a visual description and then reasons toward the final answer. For valid rollouts, an image-only perception teacher supervises the description, while a privileged reasoning teacher supervises the reasoning and final answer on the same student prefix. A reference teacher is used only for invalid rollouts to recover the output format. Across general vision-language, expert reasoning, visual math, spatial grounding, and visual-language-prior benchmarks, ViGOS keeps the main benefits of OPSD and improves image-grounded behavior in shortcut-prone settings.

2. 具身智能、机器人与自动驾驶 4 篇

2606.18610 2026-06-18 cs.RO cs.CV 交叉投稿

SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation

SC3-Eval: 通过自洽视频生成评估机器人基础模型

Wei-Cheng Tseng, Gashon Hussein, Yuzhu Dong, Allen Z. Ren, Lucy X. Shi, XuDong Wang, Sergey Levine, Zhaoshuo Li, Jinwei Gu, Florian Shkurti, Ming-Yu Liu, Quan Vuong

发表机构 * University of Toronto(多伦多大学) Vector Institute(向量研究所) NVIDIA(英伟达) Physical Intelligence Stanford University(斯坦福大学) UC Berkeley(加州大学伯克利分校) Allen Institute for AI(艾伦人工智能研究所)

AI总结 提出SC3-Eval方法,利用前向-反向动力学一致性、跨视角一致性和测试时一致性,将预训练视频基础模型转化为准确的策略评估器,在7个真实世界策略上达到0.929的皮尔逊相关系数。

详情
AI中文摘要

在真实世界中评估通用机器人操作策略成本高、速度慢且难以扩展。动作条件视频世界模型通过模拟策略 rollout 提供了一种可扩展的替代方案。自回归 rollout 会累积复合误差,多视角观测必须保持相互一致,且评估器必须泛化到行为超出训练分布的策略。我们通过 SC3-Eval 解决这些挑战,这是一种自洽视频生成方案,通过强制三种互补的一致性,将预训练视频基础模型转化为准确的策略评估器。首先,前向-反向动力学一致性联合训练模型从动作预测帧以及从帧恢复动作,将生成的 rollout 锚定在物理上合理的动作流形上,并抵消仅前向模型无法惩罚的漂移。其次,跨视角一致性训练模型从每个相机视角修补其他视角,使多相机观测在长 rollout 中保持连贯,无需任何显式记忆机制。第三,测试时一致性在推理时重用反向动力学模式作为每个动作块的置信度信号,当生成的帧偏离请求的动作时终止 rollout。我们还展示了 SC3-Eval rollout 复现了策略在真实世界 rollout 中表现出的失败模式,支持细粒度的诊断比较而不仅仅是聚合排名。在七个真实世界的视觉-语言-动作策略上,SC3-Eval 达到了闭环皮尔逊相关系数 0.929 和 MMRV 0.119,优于三个强先前的基于视频模型的基线,并泛化到新任务。

英文摘要

Evaluating generalist robot manipulation policies in the real world is expensive, slow, and difficult to scale. Action-conditioned video world models offer a scalable alternative by simulating policy rollouts. Autoregressive rollouts accumulate compounding errors, observations across multiple camera views must remain mutually consistent, and the evaluator must generalize to policies whose behaviors lie outside the training distribution. We address these challenges with SC3-Eval, a self-consistent video generation recipe that adapts a pre-trained video foundation model into an accurate policy evaluator by enforcing three complementary forms of consistency. First, forward-inverse dynamics consistency jointly trains the model to predict frames from actions and to recover actions from frames, anchoring generated rollouts to a physically plausible action manifold and counteracting the drift a forward-only model cannot penalize. Second, cross-view consistency trains the model to inpaint each camera view from the other, keeping the multi-camera observation coherent over long rollouts without any explicit memory mechanism. Third, test-time consistency reuses the inverse dynamics mode at inference as a per-action-chunk uncertainty signal that terminates rollouts whose generated frames drift away from the requested actions. We also demonstrate SC3-Eval rollouts reproduce the failure modes that policies exhibit in real-world rollouts, supporting fine-grained diagnostic comparison rather than aggregate ranking alone. Across seven real-world vision-language-action policies, SC3-Eval attains a closed-loop Pearson correlation of $0.929$ and MMRV of $0.119$, outperforming three strong prior video-model-based baselines, and generalizes to new tasks.

2606.19067 2026-06-18 cs.RO cs.CV 交叉投稿

Sensor Configuration Matters: A Systematic Evaluation of Multimodal SLAM on Quadruped Robots

传感器配置至关重要:四足机器人多模态SLAM的系统评估

Roberto Corlito, Fabian Schmidt, Nils Seibert, Markus Enzweiler, Abhinav Valada, Arne Roennau

发表机构 * Machine Intelligence and Robotics Lab, Karlsruhe Institute of Technology (KIT)(卡尔斯鲁厄理工学院智能机器人实验室) Institute for Intelligent Systems, Esslingen University of Applied Sciences(埃森堡应用科学大学智能系统研究所) Department of Computer Science, University of Freiburg(弗赖堡大学计算机科学系)

AI总结 针对四足机器人运动中的传感器配置问题,系统评估了视觉、视觉-惯性和LiDAR-视觉-惯性SLAM方法,发现立体相机、全局快门和适当惯性集成能显著提升定位鲁棒性。

详情
AI中文摘要

四足机器人在不同环境中的自主导航从根本上依赖于鲁棒的同步定位与地图构建(SLAM)。虽然视觉-惯性SLAM在轮式、手持和空中平台上已经成熟,但在腿部运动的剧烈动态下,硬件级传感器配置如何影响性能仍存在关键的评估空白。四足机器人引入了独特的具身感知挑战,包括足部冲击、高频机械振动和快速角旋转,这些都会降低标准感知管道的性能。为了填补这一空白,我们使用在ANYmal D四足机器人上记录的GrandTour数据集,对最先进的视觉、视觉-惯性和LiDAR-视觉-惯性SLAM方法进行了系统评估。我们分离并量化了相机模态、快门技术和惯性传感器层级的影响,分析了它们在定位精度、算法鲁棒性和计算资源利用方面的权衡。我们的实证结果表明,硬件选择对系统鲁棒性有显著影响:立体配置始终优于单目和RGB-D模态,全局快门相机相比卷帘快门相机显著减少了运动引起的跟踪失败,并且关键的是,在剧烈的腿部运动下,标准惯性集成可能降低主要基于视觉的框架的性能。这些见解还为定制传感器负载提供了具体的设计指南,以实现敏捷腿部系统的可靠感知。

英文摘要

Autonomous navigation of quadrupedal robots in diverse environments fundamentally relies on resilient Simultaneous Localization and Mapping (SLAM). While visual-inertial SLAM has matured across wheeled, handheld, and aerial platforms, a critical evaluation gap remains regarding how hardware-level sensor configurations affect performance under the aggressive dynamics of legged locomotion. Quadrupeds introduce distinct embodiment-induced sensory challenges, including foot-impact shocks, high-frequency mechanical vibrations, and rapid angular rotations, which degrade standard perception pipelines. To address this gap, we present a systematic evaluation of state-of-the-art visual, visual-inertial, and LiDAR-visual-inertial SLAM methods using the GrandTour dataset recorded on an ANYmal D quadruped. We isolate and quantify the impacts of camera modalities, shutter techniques, and inertial sensor tiers, analyzing their trade-offs across localization accuracy, algorithmic robustness, and computational resource utilization. Our empirical findings demonstrate that hardware selection has substantial influence on system resilience: stereo configurations consistently outperform monocular and RGB-D modalities, global shutter cameras significantly mitigate motion-induced tracking failures compared to rolling shutter cameras, and, crucially, standard inertial integration can degrade the performance of primarily vision-based frameworks under harsh legged locomotion. These insights additionally offer concrete design guidelines for tailoring custom sensor payloads to achieve dependable perception on agile legged systems.

2606.19240 2026-06-18 cs.RO cs.CV cs.HC cs.SY eess.SY 交叉投稿

Seeing Through Occlusion: Deterministic Arm Kinematic Correction for Robot Teleoperation

透过遮挡:机器人遥操作的确定性手臂运动学校正

Thomas M. Kwok, Nicholas Koenig, Yue Hu

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 提出手臂运动学校正方法,利用恒定臂长几何约束和勾股定理确定性地重建遮挡关节深度,无需复杂建模,经Vicon验证有效,并成功应用于遥操作。

详情
AI中文摘要

无标记、单RGB-D相机动作捕捉为机器人遥操作提供了一种低成本、非侵入性的替代传统标记系统的方法;然而,在自遮挡存在时,特别是上肢运动期间,深度估计常常退化。本文提出了一种手臂运动学校正(AKC)方法,通过基于恒定臂长施加几何约束来改进深度估计。所提出的方法利用手腕位置和预定义臂长,基于勾股定理的确定性公式重建遮挡关节深度,从而避免了对复杂概率建模或参数调整的需求。针对Vicon参考系统的实验验证表明,该方法在静态和动态关节运动下均表现出可靠的性能,通过均方根误差(RMSE)和皮尔逊相关性进行评估。此外,在模拟和物理机器人环境中成功演示了运动映射遥操作。结果表明,AKC在长时间、严重自遮挡下增强了鲁棒性并保持了解剖一致性,即使与不太可靠的时间滤波器配对时也是如此,突显了其在机器人遥操作和人机交互等实时应用中的实用性。

英文摘要

Markerless, single-RGB-D-camera motion capture provides a low-cost and non-invasive alternative to conventional marker-based systems for robot teleoperation; however, depth estimation often degrades in the presence of self-occlusion, particularly during upper-limb motion. This paper presents an Arm Kinematic Correction (AKC) method that improves depth estimation by enforcing geometric constraints based on constant arm lengths. The proposed approach reconstructs occluded joint depths by leveraging wrist positions and predefined arm lengths via a deterministic formulation based on the Pythagorean theorem, thereby avoiding the need for complex probabilistic modeling or parameter tuning. Experimental validation against a Vicon reference system demonstrates reliable performance for both static and dynamic joint motions, evaluated using root-mean-square error (RMSE) and Pearson correlation. Furthermore, motion-mapping teleoperation is successfully demonstrated in both simulated and physical robot environments. The results show that AKC enhances robustness and preserves anatomical consistency under long-duration, severe self-occlusion, even when paired with less reliable temporal filters, highlighting its practicality for real-time applications such as robot teleoperation and human-robot interaction.

2606.19333 2026-06-18 cs.RO cs.CV 交叉投稿

Do as I Do: Dexterous Manipulation Data from Everyday Human Videos

Do as I Do: 从日常人类视频中获取灵巧操作数据

Bhawna Paliwal, Haritheja Etukuru, William Liang, Pieter Abbeel, Nur Muhammad Mahi Shafiullah, Jitendra Malik

发表机构 * UC Berkeley(加州大学伯克利分校)

AI总结 提出DO AS I DO算法,从单目RGB人类视频中重建手-物交互并重定向到多指灵巧机器人手,生成可执行的操作数据,优于现有方法。

Comments Project website: https://do-as-i-do.com/

详情
AI中文摘要

我们如何可扩展地生成机器人操作数据,特别是在像多指灵巧手这样的人形平台上?从人类视频中学习最近成为这个问题的可能答案。然而,估计手-物交互和跨越人-机器人具身差距的困难阻碍了将丰富的单目RGB人类视频作为机器人操作数据的主要来源。在这项工作中,我们提出了DO AS I DO,一种将单目RGB人类视频重建并重定向到多指灵巧机器人手的算法。DO AS I DO从各种自我中心和外部中心的野外视频源中重建手-物交互。然后,该算法将这些手-物交互估计重定向为一系列可在现实世界中执行的动作,从不同的人类视频中生成机器人完整的操作数据。总体而言,DO AS I DO在从RGB视频中估计手-物交互和提取灵巧操作轨迹方面优于先前的最先进技术,正如我们在具有真实标签的数据集和在线收集的视频片段数据集上的实验所示。我们的实验使我们能够为从业者收集人类操作数据提出一个有效性指南。

英文摘要

How can we scalably generate data for robotic manipulation, especially on human-like platforms such as dexterous multi-fingered hands? Learning from human videos has recently emerged as a likely answer to this question. However, difficulties in estimating hand-object interaction and crossing the human-to-robot embodiment gap have hindered the adoption of abundant monocular RGB-only human videos as the primary source of robot manipulation data. In this work, we present DO AS I DO, an algorithm to reconstruct and retarget monocular RGB human videos to multi-fingered dexterous robotic hands. DO AS I DO reconstructs hand-object interactions from various egocentric and exocentric in-the-wild video sources. The algorithm then retargets these hand-object interaction estimates into a sequence of actions executable in the real world, yielding robot-complete manipulation data from disparate human videos. Overall, DO AS I DO outperforms previous state of the art in estimating hand-object interactions and extracting dexterous manipulation trajectories from RGB videos, as we show in experiments on datasets with ground truths and on a dataset of video clips collected online. Our experiments enable us to propose an efficacy playbook for practitioners collecting human data for manipulation.

3. 视频理解与时序视觉 1 篇

2606.18732 2026-06-18 cs.LG cs.CV 交叉投稿

Low-Cost Neuromorphic Fall Detection Using Synthetic Event Data and Hybrid SNNs

低成本神经形态跌倒检测:使用合成事件数据和混合SNN

Guillermo Rojas, Gonzalo Soto, Daniel Yunge

发表机构 * School of Electrical Engineering Pontificia Universidad Católica de Valparaíso, Chile(瓦尔帕莱索天主教大学电气工程学院)

AI总结 提出混合SNN-CNN模型,从智能手机视频合成事件相机数据,实现高效准确的跌倒检测。

Comments 4 pages, 6 figures, presented at ICONS 2025 during the Poster Session, but not published

详情
AI中文摘要

本工作提出了混合模型,将脉冲神经网络(SNN)与卷积神经网络(CNN)组件集成,以从传统智能手机视频生成的模拟事件相机数据(动态视觉传感器,DVS)中学习。主要针对人类跌倒检测,该方法通过将视频帧转换为事件数据,利用SNN的能效和时空处理能力。通过多个数据集上的模拟评估所提出的模型,并将其性能与传统机器学习模型进行比较。结果表明,在不牺牲准确性的情况下显著提高了效率,强调了将SNN和DVS技术结合用于现实环境中复杂任务的潜力。

英文摘要

This work presents the development of hybrid models that integrate spiking neural networks (SNNs) with components of convolutional neural networks (CNNs) to learn from simulated event-based camera data (Dynamic Vision Sensor, DVS) generated from conventional smartphone videos. Aimed primarily at human fall detection, the approach leverages the energy efficiency and spatio-temporal processing capabilities of SNNs by converting video frames into event-based data. The proposed models are evaluated through simulations on multiple datasets, comparing their performance to that of traditional machine learning models. Results demonstrate significant gains in efficiency without sacrificing accuracy, underscoring the potential of combining SNNs and DVS technology for complex tasks in real-world environments.

4. 生成式视觉与世界模型 2 篇

2606.19162 2026-06-18 cs.LG cs.CV 交叉投稿

The Reward Was in Your Data All Along: Correcting Flow Matching with Discriminator-Guided RL

奖励一直就在你的数据中:用判别器引导的强化学习纠正流匹配

Nicolas Beltran-Velez, Felix Friedrich, Zhang Xiaofeng, Reyhane Askari-Hemmat, Xiaochuang Han, Adriana Romero-Soriano, Michal Drozdzal

发表机构 * FAIR at Meta(Meta FAIR) Columbia University(哥伦比亚大学) McGill University(麦吉尔大学) Canada CIFAR AI Chair(加拿大CIFAR人工智能主席)

AI总结 针对流匹配模型因损失函数与样本质量不匹配导致的视觉缺陷,提出判别器引导的强化学习(DRL),利用预训练空间中判别器的logit作为奖励,显著提升无引导FID和语义FD,并改善偏好对齐。

Comments 84 pages, including appendices

详情
AI中文摘要

得分匹配和流匹配模型通常依赖基于偏好的强化学习来实现两个目的:与主观偏好对齐,以及令人惊讶地恢复视觉真实性和连贯对象结构等属性——而这些属性本应通过匹配训练从数据本身学习。我们认为这反映了结构上的不匹配。匹配损失衡量训练时边缘分布下速度或得分场的$\ell_2$回归误差,这一代理指标与决定推理时样本质量的视觉和语义属性对齐不良。给定一个与这些属性对齐的奖励,强化学习通过评估模型自身生成的样本并直接遵循奖励景观来规避不匹配。挑战在于如何在不依赖人类偏好的情况下获得这样的奖励,因为人类偏好昂贵且会将数据真实性与标注者倾向混为一谈。我们提出判别器引导的强化学习(DRL)。DRL训练一个判别器,在预训练表示空间中区分数据样本和基础模型样本,并将其logit作为KL正则化强化学习中的奖励。预训练空间将判别器限制在感知有意义的方向上,而logit估计数据与模型之间的对数似然比,这是针对数据分布的最优奖励。在SiT、JiT、REPA和RAE上,DRL降低了无引导FID(例如,SiT上从9.38降至2.62)和语义空间FD(例如,SiT上DINOv3从88.2降至19.3),在所有骨干网络上均有一致提升,并且在没有经过偏好奖励训练的情况下改善了人类偏好奖励。在后续基于偏好的后训练中,DRL还在偏好奖励与图像保真度之间产生了更好的帕累托前沿,在提高对齐度的同时减少了过饱和和过亮等低级伪影。

英文摘要

Score- and flow-matching models often rely on preference-based reinforcement learning for two purposes: aligning with subjective preferences and, surprisingly, recovering properties such as visual realism and coherent object structure that matching-based training is intended to learn from the data itself. We argue that this reflects a structural mismatch. Matching losses measure $\ell_2$ regression error on the velocity or score field under training-time marginals, a proxy poorly aligned with the visual and semantic properties that determine sample quality at inference. Given a reward aligned with these properties, RL sidesteps the mismatch by evaluating the model on its own samples and following the reward landscape directly. The challenge is to obtain such a reward without relying on human preferences, which are expensive and conflate data realism with annotator inclinations. We propose Discriminator-Guided RL (DRL). DRL trains a discriminator to separate data from base-model samples in a pretrained representation space and uses its logit as the reward in KL-regularized RL. The pretrained space restricts the discriminator to perceptually meaningful directions, and the logit estimates the log-likelihood ratio between data and model, which is the optimal reward for targeting the data distribution. Across SiT, JiT, REPA, and RAE, DRL reduces guidance-free FID (e.g., $9.38 \to 2.62$ on SiT) and semantic-space FD (e.g., $88.2 \to 19.3$ on DINOv3 for SiT), with consistent gains across all backbones, and improves human-preference rewards without training on them. It also yields a better Pareto frontier between preference reward and image fidelity under subsequent preference-based post-training, increasing alignment while reducing low-level artifacts such as oversaturation and excessive brightness.

2606.19325 2026-06-18 cs.SD cs.AI cs.CV 交叉投稿

Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors

参考驱动的野外先验多说话人音频场景生成

Michael Finkelson, Daniel Segal, Eitan Richardson, Shahar Armon, Nani Goldring, Poriya Panet, Nir Zabari, Benjamin Brazowski, Or Patashnik, Yoav HaCohen

发表机构 * Lightricks Tel Aviv University(特拉维夫大学)

AI总结 提出ScenA方法,利用预训练的文本到音频流匹配基础模型,通过多参考声音和自然语言提示生成多说话人音频场景,并采用高噪声偏置时间步分布解决参考捷径问题,在CoVoMix2-Dialogue基准上优于现有系统。

Comments Project page at https://finmickey.github.io/scena/

详情
AI中文摘要

现有的多说话人对话系统通过结构化监督(如每轮标签、多流转录或可学习说话人嵌入)将说话人与话语绑定。这些系统在仅语音的流水线中运行,生成干净的语音序列,缺乏真实对话的环境纹理。我们采取不同的方法。我们的方法ScenA将文本到音频流匹配基础模型(在大规模野外数据上预训练)直接以多个参考声音和描述整个多说话人音频场景的自由形式自然语言提示为条件。利用这样的基础模型使我们能够继承其生成自然、非录音室音频的能力:背景噪声、房间声学、重叠对话和自发的副语言事件,同时添加多说话人控制而无需任何每轮结构。具体地,参考潜在向量被连接到模型的令牌序列中,并通过轻量级的身份感知位置编码进行区分。然而,我们识别出这种方法的一个关键障碍:参考捷径。在标准噪声调度下的训练过程中,模型可以通过声学相似性识别匹配的参考与噪声目标,从而完全绕过文本提示。我们通过高噪声偏置的时间步分布来解决这个问题,迫使模型依赖文本提示进行说话人分配。我们在CoVoMix2-Dialogue基准上评估ScenA,结果表明它在说话人绑定指标上优于现有的多说话人系统,同时生成具有重叠语音、情感发声和环境声音的丰富对话音频。我们的结果证明了使用以自由形式场景描述为条件的通用音频模型,而不是通过仅语音流水线传递结构化对话脚本的优势。

英文摘要

Existing multi-speaker dialogue systems bind speakers to utterances through structured supervision: per-turn tags, multi-stream transcriptions, or learnable speaker embeddings. These systems operate within speech-only pipelines that produce clean vocal sequences without the ambient texture of real conversations. We take a different approach. Our method, ScenA, conditions a text-to-audio flow-matching foundation model, pretrained on large-scale in-the-wild data, directly on multiple reference voices and a free-form natural language prompt that describes an entire multi-speaker audio scene. Leveraging such a foundational model allows us to inherit its capacity for natural, non-studio audio: background noise, room acoustics, overlapping dialogue, and spontaneous paralinguistic events, while adding multi-speaker control without any per-turn structure. Concretely, reference latents are concatenated into the model's token sequence and distinguished by lightweight identity-aware positional encodings. However, we identify a critical obstacle to this approach: the \textit{Reference Shortcut}. During training under standard noise schedules, the model can identify the matching reference by acoustic similarity to the noisy target, bypassing the text prompt entirely. We address this with a high-noise-biased timestep distribution that forces the model to rely on the text prompt for speaker assignment. We evaluate ScenA on the CoVoMix2-Dialogue benchmark, showing that it outperforms existing multi-speaker systems on speaker-binding metrics while generating rich conversational audio with overlapping speech, emotional vocalizations, and ambient sound. Our results demonstrate the advantage of using a general-purpose audio model conditioned on a free-form scene description, rather than passing structured dialog scripts through a speech-only pipeline.

5. 3D视觉、点云与空间智能 2 篇

2606.18588 2026-06-18 cs.DC cs.CV 交叉投稿

Splaxel: Efficient Distributed Training of 3D Gaussian Splatting for Large-scale Scene Reconstruction via Pixel-level Communication

Splaxel:通过像素级通信实现大规模场景重建的高效分布式3D高斯泼溅训练

Wenqi Jia, Zhewen Hu, Ying Huang, Yu Gong, Stavros Kalafatis, Yuke Wang, Wei Niu, Chengming Zhang, Ang Li, Sheng Di, Yuede Ji, Bo Fang, Miao Yin

发表机构 * Independent Researcher(独立研究者) Rice University(里士满大学) University of Georgia(佐治亚大学) University of Houston(休斯顿大学) University of Washington(华盛顿大学) Argonne National Labs(阿贡国家实验室)

AI总结 提出Splaxel框架,通过像素级局部渲染与全局组合替代高斯同步,在保持数学一致性的同时稳定通信开销,结合可见性预测和冲突消除策略,实现大规模3DGS分布式训练加速7.6倍。

Comments 17 pages, 25 figures

详情
AI中文摘要

3D高斯泼溅(3DGS)能够实现高保真、实时的3D场景重建,但将训练扩展到大规模场景需要跨多个GPU优化数亿个高斯体。现有的分布式方法要么将场景划分为孤立区域,导致全局不一致,要么依赖全局高斯级交换,导致GPU间通信量大幅增长并迅速主导迭代时间。我们提出Splaxel,一种基于像素级局部渲染和全局组合的通信高效分布式3DGS训练框架。每个GPU渲染其局部子集并仅交换部分像素值,而非同步高斯体,从而在保持数学一致性的同时,使通信成本随场景规模增长保持稳定。Splaxel通过几何和透射率可见性预测进一步减少像素级冗余,并通过无冲突的相机视图整合提高GPU利用率。在包含多达1.2亿个高斯体的大规模数据集上评估,Splaxel相比最先进的分布式3DGS框架实现了高达7.6倍的加速,同时保持高重建质量。

英文摘要

3D Gaussian Splatting (3DGS) enables high-fidelity and real-time 3D scene reconstruction, but scaling training to large-scale scenes requires optimizing hundreds of millions of Gaussians across multiple GPUs. Existing distributed approaches either partition scenes into isolated regions, causing global inconsistency, or rely on global Gaussian-level exchanges, which lead to substantial growth in inter-GPU communication and quickly dominate iteration time. We propose Splaxel, a communication-efficient distributed 3DGS training framework based on pixel-level local rendering and global composition. Instead of synchronizing Gaussians, each GPU renders its local subset and exchanges only partial pixel values, maintaining mathematical consistency while keeping communication cost stable as the scene size increases. Splaxel further reduces pixel-level redundancy through geometric and transmittance visibility prediction and improves GPU utilization via conflict-free camera-view consolidation. Evaluated on large-scale datasets with up to 120M Gaussians, Splaxel achieves up to 7.6$\times$ speedup over the state-of-the-art distributed 3DGS framework while preserving high reconstruction quality.

2606.18826 2026-06-18 physics.optics cs.CV eess.IV 交叉投稿

EDoF-NeRF: extended depth-of-field neural radiance fields using a coded aperture camera

EDoF-NeRF: 使用编码孔径相机扩展景深的神经辐射场

Yoshiyuki Shirasaki, Ryoichi Horisaki

发表机构 * Department of Information Physics and Computing, Graduate School of Information Science and Technology, The University of Tokyo(信息物理与计算系,信息科学与技术研究生学校,东京大学)

AI总结 提出一种通过编码孔径相机扩展景深的方法,构建高保真神经辐射场,实现从不同视角图像渲染新视图,并验证其优于传统孔径相机。

详情
AI中文摘要

我们提出了一种扩展景深(DoF)的方法,用于构建高保真神经辐射场(NeRF)——一种基于隐式神经表示、从不同视角捕获的图像数据集渲染逼真新视图的新兴技术。DoF与光量之间的权衡不仅存在于传统相机中,也存在于NeRF中,因为NeRF使用的数据集是由这些相机捕获的。为了解决这个问题,我们在相机光阑处引入编码孔径,在散焦条件下保留空间频率分量。我们开发了一个将编码孔径纳入NeRF的相机模型,允许直接输入编码图像,并能够生成具有扩展景深的新视图。我们通过仿真和实验验证了所提出的方法,称为扩展景深NeRF(EDoF-NeRF),并证明了其相比传统孔径相机的优越性能。

英文摘要

We propose a method for extending the depth-of-field (DoF) to construct high-fidelity neural radiance fields (NeRF) -- an emerging technique for rendering photorealistic novel views from a dataset of images captured at different viewpoints, based on implicit neural representations. The trade-off between DoF and light quantity is inherent not only in conventional cameras but also in NeRF, since the datasets used by NeRF are captured by these cameras. To address this issue, we introduce a coded aperture placed at the camera pupil, preserving spatial frequency components under defocused conditions. We develop a camera model incorporating coded apertures into NeRF, allowing direct input of coded images and enabling the generation of novel views with an extended DoF. We validate the proposed method, termed extended DoF-NeRF (EDoF-NeRF), through simulations and experiments, demonstrating its superior performance compared to conventional aperture cameras.

6. 医学影像与生物视觉 2 篇

2606.18523 2026-06-18 q-bio.QM cs.CV 交叉投稿

DART: A design-aware microfluidic chip paradigm for real-time live-cell image analysis

DART: 一种设计感知的微流控芯片范式用于实时活细胞图像分析

Johannes Seiffarth, Matthias Pesch, Lukas Scholtes, Dietrich Kohlheyer, Hanno Scharr, Katharina Nöh

发表机构 * Institute for Bio- and Geosciences, IBG-1: Biotechnology(生物与地质科学研究所,IBG-1:生物技术) Computational Systems Biotechnology (AVT.CSB), RWTH Aachen University(计算系统生物技术(AVT.CSB),亚琛工业大学) Institute for Advanced Simulation, IAS-8: Data Analytics and Machine Learning(先进模拟研究所,IAS-8:数据分析与机器学习)

AI总结 提出DART范式,通过嵌入式标记和深度学习检测对齐CAD蓝图与物理芯片,实现高通量微流控芯片中所有感兴趣区域的快速定位和全自动图像处理,支持实时分析。

详情
AI中文摘要

高通量微流控活细胞成像产生丰富的单细胞数据。然而,用于定位每个包含一个细胞群体的感兴趣区域(RoI)并从记录图像中移除周围微流控结构的半自动化流程随RoI数量扩展,这阻碍了实时图像分析并将洞察时间延迟数小时至数天。我们提出了用于微流控培养芯片的设计感知和实时能力(DART)范式,该范式将CAD蓝图与物理芯片对齐,从而实现了对所有RoI的通量无关定位以及跨不同RoI几何形状和芯片布局的全自动图像处理。DART通过嵌入式基准标记和基于深度学习的标记检测建立这种对齐。我们使用瑞士军刀芯片验证DART,该芯片在1164个RoI位置上组合了八种结构不同的RoI设计。DART在五分钟内定位所有RoI,在40毫秒内从原始显微镜图像中移除微流控结构,并在每张图像1.1秒内执行全自动图像分析,包括细胞分割。这些能力共同使DART成为一个端到端的硬件-软件范式,具有实时分析能力,为闭环和结果驱动的智能显微镜铺平了道路。

英文摘要

High-throughput microfluidic live-cell imaging generates rich single-cell data. Yet semi-automated procedures for locating regions of interest (RoIs), each containing one cell population, and removing surrounding microfluidic structures from recorded images, scale with the number of RoIs. This prevents real-time image analysis and delays time-to-insight by hours to days. We introduce the Design-Aware and Real-Time capable (DART) paradigm for microfluidic cultivation chips, which aligns the CAD blueprint with the physical chip and thereby enables throughput-independent localization of all RoIs and fully automated image processing across diverse RoI geometries and chip layouts. DART establishes this alignment through embedded fiducial markers and deep-learning-based marker detection. We validate DART using the Swiss Army Knife chip, which combines eight structurally distinct RoI designs across 1164 RoI locations. DART localizes all RoIs in five minutes, removes microfluidic structures from raw microscopy images in 40 ms, and performs fully automated image analysis, including cell segmentation, in under 1.1 s per image. Together, these capabilities establish DART as an end-to-end hardware-software paradigm with real-time-capable analysis that paves the way toward closed-loop and outcome-driven smart microscopy.

2606.18970 2026-06-18 cs.LG cs.AI cs.CV 交叉投稿

A Controlled Benchmark of Quantum-Latent GAN Augmentation for Brain MRI

脑MRI的量子潜GAN增强的受控基准测试

Syed Mujtaba Haider, Silvia Figini

发表机构 * Department of Mathematics(数学系) Department of Political and Social Sciences(政治与社会科学系)

AI总结 通过受控基准测试,比较量子与经典生成器在脑MRI数据增强中的性能,发现两者均未显著优于仅用真实数据训练,且量子生成器无额外优势。

Comments This work has been submitted to the IEEE for possible publication. This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

医学图像分类常受限于有限的标注数据,因此生成式增强被提出;最近,量子生成模型被用于此目的,并经常报告准确率提升。然而,这些声称通常基于单次训练运行,未匹配量子与经典生成器的参数预算,也未表征任何收益出现的数据范围。我们提出了一个受控基准测试,隔离量子生成器对脑MRI增强的贡献。图像被编码到KL正则化的潜在空间中,在该空间中,使用变分量子生成器或参数数量几乎相同的经典生成器(1648 vs. 1632)训练带有梯度惩罚的条件Wasserstein GAN。合成样本被解码并用于增强预训练分类器,覆盖从5%到100%的标注数据比例,通过八个随机种子进行配对显著性检验(多重比较校正)以及集内多样性和潜在分布分析。在所有比例下,没有增强变体显著优于仅用真实数据训练,且量子与经典生成器在统计上无法区分。任何低数据优势表现为正则化而非忠实的数据扩展:合成样本分布外移,并且在数据稀缺时严重模式崩溃,而量子生成器并不比经典生成器更多样化。我们发布该协议作为医学成像中量子生成增强严格评估的测试平台。

英文摘要

Medical image classification is often constrained by limited labeled data, motivating generative augmentation; recently, quantum generative models have been proposed for this purpose, frequently reporting accuracy gains. However, such claims are typically based on single training runs, do not match the parameter budgets of the quantum and classical generators, and do not characterize the data regime in which any benefit appears. We present a controlled benchmark that isolates the contribution of a quantum generator to brain-MRI augmentation. Images are encoded into a KL-regularized latent space in which a conditional Wasserstein GAN with gradient penalty is trained using either a variational quantum generator or a classical generator of near-identical parameter count (1648 vs. 1632). Synthetic samples are decoded and used to augment a pretrained classifier across labeled data fractions from 5% to 100%, evaluated over eight random seeds with paired significance testing (with multiple-comparison correction) and with intraset diversity and latent-distribution analyses. Across all fractions, no augmentation variant significantly outperforms real-data-only training, and the quantum and classical generators are statistically indistinguishable. Any low-data benefit behaves as regularization rather than faithful data expansion:synthetic samples are off distribution and severely mode collapsed precisely where data is scarce, and the quantum generator is no more diverse thanits classical counterpart. We release the protocol as a testbed for rigorous evaluation of quantum generative augmentation in medical imaging.

7. 鲁棒性、安全、隐私与可信视觉 1 篇

2606.18839 2026-06-18 cs.LG cs.CV 交叉投稿

Semantic Robustness Certification for Vision-Language Models

视觉语言模型的语义鲁棒性认证

Peiyu Yang, Paul Montague, Feng Liu, Andrew C. Cullen, Amardeep Kaur, Christopher Leckie, Sarah M. Erfani

发表机构 * School of Computing \& Information Systems, University of Melbourne, Australia

AI总结 提出首个无需额外数据即可认证视觉语言模型在语义层面(如形状、大小、风格)鲁棒性的框架,通过文本提示作为语义代理并量化决策边界,确保预测类别在语义变换下不变。

Comments Accepted to ICML

详情
AI中文摘要

视觉语言模型(VLM)现在被广泛用于下游任务。然而,现实世界的应用常常使VLM面临由语义变化(例如形状、大小和风格)引起的分布偏移。鲁棒性认证确定当对输入应用变换时模型的预测是否改变。虽然大多数认证框架研究输入的几何或像素级变换,但本文提出了一种新颖的框架,能够在语义级变换下认证VLM的鲁棒性。利用VLM的开放词汇能力,我们使用文本提示作为语义代理来构建由控制语义变化程度的范围参数化的变换。通过以封闭形式表征VLM决策边界,我们的框架定量地认证了在语义变换下预测类别保持不变的范围区间。我们的框架是第一个在语义级变化下认证VLM鲁棒性而无需为每种变化提供额外数据的框架,使其易于应用。在合成数据和真实数据上的实验表明,我们的框架能够在各种场景下认证针对多种语义变化的鲁棒性。

英文摘要

Vision-language models (VLMs) are now widely used in downstream tasks. However, real-world applications often expose VLMs to distribution shifts induced by semantic variation (e.g., shape, size, and style). Robustness certification determines if a model's prediction changes when transformations are applied to its input. While most certification frameworks study geometric or pixel-level transformations over inputs, this work proposes a novel framework that enables certifying VLM robustness under semantic-level transformations. Leveraging the open-vocabulary capability of VLMs, we use text prompts as semantic proxies to construct transformations parameterized by an extent that controls the degree of semantic variation. By characterizing the VLM decision boundary in closed form, our framework quantitatively certifies extent intervals for which the predicted class remains unchanged under the semantic transformation. Our framework is the first to certify VLM robustness under semantic-level variations without requiring additional data for each variation, making it practical to apply. Experiments on both synthetic and real-world data show that our framework enables certifying robustness under diverse semantic variations across scenarios.

8. 数据集、基准、评测与训练方法 1 篇

2606.18676 2026-06-18 cs.LG cs.CV 交叉投稿

InTrain: Intrinsic Trainability for Zero-Cost Neural Architecture Search

InTrain: 面向零成本神经架构搜索的内在可训练性

Qinqin Zhou, Fuhai Chen, Jipeng Wu, Zhiwei Chen, Zhikai Hu, Weiwei Cai

发表机构 * School of Computer and Data Science, Fuzhou University(福州大学计算机与数据科学学院) School of Computer and Data Science, Minjiang University(闽江学院计算机与数据科学学院) School of Artificial Intelligence, Nanchang University(南昌大学人工智能学院) Department of Computer Science, Hong Kong Baptist University(香港浸会大学计算机科学系) School of Interdisciplinary Medicine and Engineering, Harbin Medical University(哈尔滨医科大学跨学科医学与工程学院)

AI总结 提出统一理论代理InTrain,通过几何容量和优化韧性两个协同成分形式化架构的可训练性,在NAS基准上达到与集成方法相当的排序相关性。

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026
AI中文摘要

免训练神经架构搜索有望在不进行昂贵训练的情况下高效发现高性能网络。然而,现有的零成本代理依赖于碎片化的启发式方法,未能捕捉基本问题:是什么使一个架构具有可训练性?本文引入内在可训练性(InTrain),一个统一的理论代理,将可训练性形式化为由两个协同成分——几何容量和优化韧性——涌现出的架构不变性。我们通过分析神经信息处理来操作化内在可训练性。几何容量通过激活协方差特征谱的参与比量化,捕捉表示流形的有效维度。优化韧性通过累积梯度健康度测量,评估跨网络深度的反向传播鲁棒性。InTrain通过尺度不变的乘法耦合综合这些维度,我们假设这对于捕捉它们协同、非加性的关系至关重要。在标准NAS基准和搜索空间上的大量实验表明,InTrain达到了与最先进的基于集成的代理相当的排序相关性,并优于其他单指标方法。

英文摘要

Training-free neural architecture search promises efficient discovery of high-performance networks without costly training. However, existing zero-cost proxies rely on fragmented heuristics that fail to capture the fundamental question: what makes an architecture trainable? This paper introduces Intrinsic Trainability (InTrain), a unified theoretical proxy that formalizes trainability as an architectural invariant emerging from two synergistic components: geometric capacity and optimization resilience. We operationalize intrinsic trainability through analysis of neural information processing. Geometric capacity is quantified via the participation ratio of activation covariance eigenspectrum, capturing the effective dimensionality of representation manifolds. Optimization resilience is measured through cumulative gradient health, assessing the robustness of backpropagation across network depth. InTrain synthesizes these dimensions through a scale-invariant multiplicative coupling, which we hypothesize is essential for capturing their synergistic, non-additive relationship. Extensive experiments on standard NAS benchmarks and search spaces demonstrate that InTrain achieves ranking correlations on par with state-of-the-art ensemble-based proxies and outperforms other single-metric methods.

9. 其他/综合视觉 1 篇

2606.19151 2026-06-18 cs.CY cs.CV 交叉投稿

The Market in the Model: Latent Diffusion as Neural Economy

模型中的市场:潜在扩散作为神经经济

Eryk Salvaggio

发表机构 * Cambridge Digital Humanities(剑桥数字人文研究中心) University of Cambridge(剑桥大学) Machine Visual Culture Research Group(机器视觉文化研究组) Max Planck Institute(马克斯·普朗克研究所)

AI总结 本文从计算机视觉工程问题出发,分析潜在扩散模型的机制,论证其作为神经经济运作,将社会交流抽象为可通约向量,并警示仅关注版权与商品防御的批评可能强化模型产生的拜物教。

详情
AI中文摘要

在视觉文化和人文学科中,对生成图像模型的有价值批评强调了数据集在塑造其生成图像中的作用。然而,对嵌入模型机制的意识形态立场的细致研究一直被忽视,使得它们被想象为“黑箱”。为了扩展而非取代数据集批评,本文从潜在扩散模型被引入以解决计算机视觉工程师问题的角度,以及每个组件被赋予自动化决策的任务,审视了其机制。我通过其各部分的历史以及系统刻入每个生成图像中的视觉理论来解释这个集成。借鉴Impett和Offert的神经交换价值概念,我提出这一分析以论证该模型作为神经经济运作:一个封闭的符号系统,将社会交流抽象为可通约向量,同时将社会领域转化为待售包裹。逐组件追踪训练和生成流程揭示了每个操作取代了什么,以及它如何进一步巩固平台经济和注意力经济对社会交流的逻辑。本文警告,任何只关注版权和商品防御的批评都可能重申模型所产生的拜物教,并主张以社会交换为中心。

英文摘要

Valuable critique of generative image models within visual culture and the humanities has emphasized the role of datasets in shaping the images they produce. Yet, close studies of the ideological positions embedded into the mechanism of the models have been neglected, leaving them imagined as "black boxes." In a bid to expand, rather than replace, dataset critique, this paper examines the mechanisms of the latent diffusion model in terms of the problems they were brought in to solve on behalf of computer vision engineers, and the decisions each component was tasked with automating. I interpret that ensemble through the histories of its parts and the theory of vision the system inscribes into every generated image. Drawing on Impett and Offert's notion of neural exchange value, I offer this analysis to argue that the model operates as a neural economy: a contained symbolic system that abstracts social communication into commensurable vectors as it transfers the social sphere into parcels for sale. Tracing the training and generation pipelines component by component reveals what each operation displaces, and how it further entrenches the logics of platform and attention economies over social communication. The paper warns that any critique fixated exclusively on copyright and commodity defenses risks reaffirming the very fetishism the model produces, and argues instead for centering social exchange.