多模态信息融合 - arXivDaily 专题

2606.19927 2026-06-19 cs.CV 新提交 90%

CARE: Competence-Aware Reward Shaping for Adaptive Reasoning Length in Video-MLLMs

CARE: 面向视频多模态大语言模型的自适应推理长度的能力感知奖励塑形

Chengwen Liu, Hao Peng, Jisheng Dang, Hong Peng, Bin Hu, Tat-Seng Chua

发表机构 * School of Information Science and Engineering, Lanzhou University（兰州大学信息科学与工程学院）； School of Medical Technology, Beijing Institute of Technology（北京理工大学医学技术学院）； School of Computing, National University of Singapore（新加坡国立大学计算机学院）

专题命中音视频/视觉语言融合：视频多模态推理，涉及视觉与语言融合

AI总结提出CARE框架，通过能力感知奖励塑形自适应优化推理长度，利用指数移动平均估计能力并分阶段调整奖励偏好，结合批次归一化和后验放大器提升效率与准确性。

详情

AI中文摘要

在多模态视频推理中，基于强化学习的方法通常依赖简单且不灵活的推理长度控制策略，无法适应模型不断变化的能力。这种不匹配可能在早期阶段抑制必要的探索，而在模型变得更有能力后鼓励冗余推理和低效解码。本文提出CARE，一种用于多模态推理中自适应推理长度优化的能力感知奖励塑形框架。具体来说，CARE通过通过率的指数移动平均维护平滑的能力估计，并利用它将训练路由到渐进阶段，将奖励偏好从探索导向的长形式推理转向效率导向的简洁推理。为避免将冗长与内在任务复杂性混淆，CARE进一步使用批次级统计归一化推理努力，并引入后验放大器以增强对历史上困难样本上意外强性能的奖励信号。所提出的机制无缝集成到GRPO训练流程中，且不增加额外推理开销。在多个视频推理和通用视频理解基准上的大量实验表明，CARE持续提高推理准确性，稳定强化学习，并显著提升令牌效率。此外，CARE在训练过程中展现出推理长度的特征性倒U型轨迹，并在收敛时产生更短但信息更丰富的推理轨迹，表明推理预算的有效自适应分配。我们在以下网址提供CARE框架和实验的源代码：此https URL。

英文摘要

In multimodal video reasoning, reinforcement learning-based methods typically rely on simplistic and inflexible reasoning-length control strategies that fail to adapt to the model's evolving competence. This mismatch may suppress necessary exploration at early stages, while encouraging redundant reasoning and inefficient decoding once the model becomes more competent. In this paper, we propose CARE, a competence-aware reward shaping framework for adaptive reasoning length optimization in multimodal reasoning. Specifically, CARE maintains a smoothed competence estimate via an exponential moving average of pass rates, and uses it to route training into progressive stages that shift the reward preference from exploration-oriented long-form reasoning to efficiency-oriented concise reasoning. To avoid conflating verbosity with intrinsic task complexity, CARE further normalizes reasoning effort with batch-level statistics, and introduces a posterior amplifier to strengthen reward signals for unexpectedly strong performance on historically difficult samples. The proposed mechanism is seamlessly integrated into the GRPO training pipeline and incurs no additional inference-time overhead. Extensive experiments on multiple video reasoning and general video understanding benchmarks demonstrate that CARE consistently improves reasoning accuracy, stabilizes reinforcement learning, and significantly enhances token efficiency. Moreover, CARE exhibits a characteristic inverted-U trajectory of reasoning length during training, and yields shorter yet more informative reasoning traces at convergence, indicating effective adaptive allocation of reasoning budget. We provide the source code for our proposed CARE framework and experiments at https://github.com/1Pansy/Video-CARE.

URL PDF HTML ☆

赞 0 踩 0

2606.19882 2026-06-19 cs.CV cs.LG 新提交 90%

Multimodal Concept Bottleneck Models

多模态概念瓶颈模型

Tongqing Shi, Ge Yan, Tuomas Oikarinen, Tsui-Wei Weng

发表机构 * UC San Diego（加州大学圣地亚哥分校）

专题命中音视频/视觉语言融合：多模态概念瓶颈模型，对齐图像和文本嵌入

AI总结提出多模态概念瓶颈模型（MM-CBM），利用双概念瓶颈层对齐图像和文本嵌入，实现可解释的零样本分类和图像检索，在四个基准上平均准确率提升高达51.26%。

Comments Present at NeurIPS 2025 Mechanistic Interpretability Workshop

详情

AI中文摘要

概念瓶颈模型（CBM）通过将图像提取的特征与自然概念对齐，增强了深度学习网络的可解释性。然而，现有的CBM在泛化到固定预定义类别集之外的能力以及非概念信息泄露的风险方面受到限制，其中预期概念之外的预测信号被无意中利用。在本文中，我们提出了多模态概念瓶颈模型（MM-CBM）来解决这些问题，并将CBM扩展到CLIP。MM-CBM利用双概念瓶颈层（CBL）将图像和文本嵌入对齐为可解释的特征。这使我们能够以可解释的方式执行新的视觉任务，如零样本分类或图像检索。与现有方法相比，MM-CBM在四个标准基准上平均准确率提升高达51.26%。我们的方法保持高准确率，在黑盒性能的约5%以内，同时提供更高的可解释性。

英文摘要

Concept Bottleneck Models (CBMs) enhance the interpretability of deep learning networks by aligning the features extracted from images with natural concepts. However, existing CBMs are constrained in their ability to generalize beyond a fixed set of predefined classes and the risk of non-concept information leakage, where predictive signals outside the intended concepts are inadvertently exploited. In this paper, we propose Multimodal Concept Bottleneck Model (MM-CBM) to address these issues and extend CBMs into CLIP. MM-CBM utilizes dual Concept Bottleneck Layers (CBLs) to align both the image and text embeddings into interpretable features. This allows us to perform new vision tasks like zero-shot classification or image retrieval in an interpretable way. Compared to existing methods, MM-CBM achieves up to 51.26% accuracy improvement on average across four standard benchmarks. Our method maintains high accuracy, staying within ~5% of black-box performance while offering greater interpretability.

URL PDF HTML ☆

赞 0 踩 0

2603.10791 2026-06-19 eess.IV 版本更新 90%

Semantic Satellite Communications for Synchronized Audiovisual Reconstruction

面向同步视听重建的语义卫星通信

Fangyu Liu, Peiwen Jiang, Wenjin Wang, Xiao Li, Shi Jin

专题命中音视频/视觉语言融合：提出视听语义传输系统，实现跨模态生成与同步重建

AI总结提出自适应多模态语义传输系统，通过双流生成架构和动态关键帧更新机制，在带宽受限的卫星场景下实现高质量同步视听重建，显著降低带宽消耗并提升鲁棒性。

详情

AI中文摘要

卫星通信在支持高保真同步视听服务方面面临严重瓶颈，因为传统方案在信道波动、带宽有限和长传播延迟下难以处理跨模态一致性。为了解决这些问题，本文提出了一种针对卫星场景的自适应多模态语义传输系统，旨在带宽约束下实现高质量同步视听重建。与具有固定模态优先级的静态方案不同，我们的框架采用双流生成架构，可灵活切换视频驱动音频生成和音频驱动视频生成。这使得系统能够动态解耦语义，仅传输最重要的模态，同时利用跨模态生成恢复另一种模态。为了平衡重建质量和传输开销，动态关键帧更新机制根据无线场景和用户需求自适应维护共享知识库。此外，引入基于大语言模型的决策模块以增强系统适应性。通过集成卫星特定知识，该模块联合考虑任务需求和信道因素（如天气引起的衰落），主动调整传输路径和生成工作流。仿真结果表明，所提系统在实现高保真视听同步的同时显著降低带宽消耗，提高了挑战性卫星场景下的传输效率和鲁棒性。

英文摘要

Satellite communications face severe bottlenecks in supporting high-fidelity synchronized audiovisual services, as conventional schemes struggle with cross-modal coherence under fluctuating channel conditions, limited bandwidth, and long propagation delays. To address these limitations, this paper proposes an adaptive multimodal semantic transmission system tailored for satellite scenarios, aiming for high-quality synchronized audiovisual reconstruction under bandwidth constraints. Unlike static schemes with fixed modal priorities, our framework features a dual-stream generative architecture that flexibly switches between video-driven audio generation and audio-driven video generation. This allows the system to dynamically decouple semantics, transmitting only the most important modality while employing cross-modal generation to recover the other. To balance reconstruction quality and transmission overhead, a dynamic keyframe update mechanism adaptively maintains the shared knowledge base according to wireless scenarios and user requirements. Furthermore, a large language model based decision module is introduced to enhance system adaptability. By integrating satellite-specific knowledge, this module jointly considers task requirements and channel factors such as weather-induced fading to proactively adjust transmission paths and generation workflows. Simulation results demonstrate that the proposed system significantly reduces bandwidth consumption while achieving high-fidelity audiovisual synchronization, improving transmission efficiency and robustness in challenging satellite scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.20077 2026-06-19 cs.CV cs.AI 新提交 85%

The Hidden Evolution of Disguised Visual Context inside the VLM

VLM内部伪装视觉上下文的隐藏演化

Wish Suharitdamrong, Tony Alex, Muhammad Awais, Sara Atito

发表机构 * Surrey Institute for People-Centred AI, University of Surrey（萨里大学以人为本人工智能研究所）； Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey（萨里大学视觉、语音与信号处理中心）

专题命中音视频/视觉语言融合：视觉语言模型中视觉令牌与语言空间的融合

AI总结研究视觉语言模型中视觉令牌如何通过不同集成架构（上下文注入与逐层注入）转化为有意义表示，揭示其内部演化过程及对性能的影响。

详情

AI中文摘要

视觉令牌作为原始的外部信号进入大语言模型（LLM）。它们如何被转化为有意义的表示并与语言空间交互完全取决于集成架构——无论是将视觉令牌视为输入序列中的上下文提示，还是直接注入到LLM的中间层。对于这些架构选择如何影响视觉信息及其内部转换以与LLM集成，目前仍缺乏受控比较和理解。我们通过在相同训练条件下评估上下文注入和逐层注入的VLM集成范式，在单图像、多图像和视频基准上进行公平比较。在此过程中，我们揭示了一个隐藏的演化：视觉令牌作为伪装的视觉上下文（缺乏语言结构的原始表示）进入LLM，但根据集成范式逐渐被重塑，每种范式捕捉视觉信号的不同频率特征。我们表明，LLM内部的这种演化决定了VLM能够有效利用哪些视觉特征、视觉表示如何与语言空间对齐，以及最终每种范式在不同任务上的表现。我们进一步证明，仅关注注意力分配是不够的，性能由每一层视觉表示的质量驱动。

英文摘要

Visual tokens enter Large Language Models (LLMs) as raw, foreign signals. How they are transformed into meaningful representations and interact with the language space depends entirely on the integration architecture. Whether by treating visual tokens as in-context prompts within the input sequence or injecting them directly into the LLM's intermediate layers. A controlled comparison and understanding of how these architectural choices affect visual information and its internal transformation to integrate with the LLM remains underexplored. We provide a fair comparison by evaluating in-context and layer-wise injection VLM integration paradigms under identical training conditions across single image, multi-image, and video benchmarks. In doing so, we uncover a hidden evolution where visual tokens enter the LLM as disguised visual context, raw representations lacking linguistic structure, but are progressively reshaped depending on the integration paradigm, each capturing fundamentally different frequency characteristics of the visual signal. We show that this evolution inside the LLM determines what visual features the VLM can utilize effectively, how visual representations align with the language space, and ultimately how each paradigm performs across different tasks. We further demonstrate that attention allocation alone is insufficient, and that performance is driven by the quality of visual representations at each layer.

URL PDF HTML ☆

赞 0 踩 0

2606.19944 2026-06-19 cs.CV 新提交 85%

Timage: A Generative Text-in-Image Paradigm for Fine-Tuning Vision-Language Models

Timage: 一种用于微调视觉语言模型的文本嵌入图像生成范式

Yifeng Wu, Huimin Huang, Ruiluo Wu, Chunyi Lin, Guanhua Chen, Xian Wu, Wang Song, Ruize Han

发表机构 * Fudan University（复旦大学）； Shenzhen University of Advanced Technology（深圳先进技术大学）； Tencent Jarvis Lab（腾讯贾维斯实验室）； Southern University of Science and Technology（南方科技大学）

专题命中音视频/视觉语言融合：文本嵌入图像，增强视觉语言模型空间推理。

AI总结提出Timage范式，通过约束薛定谔桥将查询文本作为排版覆盖层嵌入图像，以显式空间锚点引导模型关注，在不侵蚀骨干能力前提下提升细粒度空间推理性能。

Comments ECCV

详情

AI中文摘要

多模态大语言模型（MLLMs）在细粒度空间推理中常丢失正确图像区域，因为文本查询很少携带明确的几何锚点进入像素域。现有补救方法要么重新调整模型权重，要么用冗长指令填充提示，但都无法在不侵蚀骨干通用能力的情况下可靠地将语言定位到正确的视觉坐标。我们提出Timage，一种将多模态理解重新定义为输入层面对齐问题的范式：查询被绘制为排版覆盖层直接叠加在图像上。该覆盖层的放置和外观由约束薛定谔桥（cSB）生成，这是一种熵最优传输采样器，将布局合成分解为两个耦合的随机阶段。第一阶段——区域搜索，将噪声向查询对齐的图像区域传输，同时遵守硬遮挡屏障以保护显著前景内容；第二阶段——外观塑造，通过“墨水预算”正则化调整字形大小，使渲染文本保持可读和视觉平衡。生成的覆盖层作为显式注意力信标，引导模型沿空间语义聚焦。在VMCBench基准上，Timage搭配7B骨干模型明显超越更大的专有系统和参数调优基线。该研究将审慎的输入重构定位为一种强大的、架构中立的杠杆，以增强多模态推理。

英文摘要

Multimodal Large Language Models (MLLMs) often lose track of the right image regions during fine-grained spatial reasoning, because a textual query rarely carries any explicit geometric anchor into the pixel domain. Prevailing remedies either rewire the model's weights or pad the prompt with verbose instructions, yet neither reliably pins the language to the correct visual coordinates without eroding the backbone's general competence. We introduce Timage, a paradigm that recasts multimodal understanding as an alignment problem solved at the input: the query is drawn, as a typeset overlay, onto the image itself. The placement and appearance of this overlay are produced by a Constrained Schrödinger Bridge (cSB), an entropic optimal-transport sampler that factorizes layout synthesis into two coupled stochastic stages. The first stage, Region Search, transports noise toward query-aligned image zones while obeying a hard occlusion barrier that protects salient foreground content; the second stage, Appearance Shaping, sizes the glyphs through an ``ink-budget'' regularizer so that the rendered text stays legible and visually balanced. The resulting overlay behaves as an explicit attention beacon that channels the model's focus along spatial semantics. On the VMCBench suite, Timage paired with a modest 7B backbone clearly overtakes far larger proprietary systems as well as parameter-tuned baselines. The study positions deliberate input reconstruction as a powerful, architecture-neutral lever for strengthening multimodal reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.19915 2026-06-19 cs.CV 新提交 85%

SpatialSV: Internalizing Interpretable 3D Spatial Awareness in MLLMs via Task-Oriented Visual Supervision

SpatialSV: 通过任务导向的视觉监督在多模态大语言模型中内化可解释的3D空间感知

Jiayu Tang, Yuchen Zhou, Chao Gou

发表机构 * School of Intelligent Systems Engineering, Sun Yat-sen University（中山大学智能工程学院）

专题命中音视频/视觉语言融合：将2D视觉特征提升为3D表示，多模态融合

AI总结提出SpatialSV框架，通过任务导向的视觉监督将MLLM的2D特征提升为显式3D表示（深度图、相机姿态、点云），实现可解释的3D空间感知内化，无需外部工具，并在半监督设置中展现强泛化能力。

Comments Accepted by IJCAI 2026

详情

AI中文摘要

解锁多模态大语言模型（MLLMs）的空间智能对于理解和与3D世界交互至关重要。当前主流方法通常通过外部工具注入空间先验，这会带来显著的推理开销，或依赖潜在特征蒸馏，后者缺乏可解释性和细粒度几何约束。为解决这些问题，我们提出SpatialSV，一个旨在将鲁棒的3D空间感知内化到MLLMs中，同时提供内在可解释性的框架。与被动特征模仿不同，SpatialSV采用任务导向的视觉监督，迫使模型主动将其2D视觉特征提升为显式3D表示，包括深度图、相机姿态和点云。关键的是，这个2D到3D的提升过程为模型的表示提供了一个透明窗口：生成的3D重建作为可视化和诊断模型内在空间知识质量的直观代理。跨多个模型和基准的广泛实验证明了SpatialSV在增强和解释MLLMs空间智能方面的有效性。此外，该框架在半监督设置中展现出强泛化能力，验证了其利用未标记视觉数据进行可扩展、可解释空间表示学习的潜力。

英文摘要

Unlocking the spatial intelligence of multimodal large language model (MLLMs) is crucial for understanding and interacting with the 3D world. Prevailing approaches typically inject spatial priors via external tools, which impose significant inference overhead, or rely on latent feature distillation, which remains uninterpretable and lacks fine-grained geometric constraints. To address these issues, we propose SpatialSV, a framework designed to internalize robust 3D spatial awareness within MLLMs while simultaneously offering inherent interpretability. Deviating from passive feature imitation, SpatialSV employs task-oriented visual supervision, compelling the model to actively lift its 2D visual features into explicit 3D representations, including depth maps, camera poses, and point clouds. Crucially, this 2D-to-3D lifting process provides a transparent window into the model's representations: the resulting 3D reconstructions serve as an intuitive proxy for visualizing and diagnosing the quality of the model's intrinsic spatial knowledge. Extensive experiments across multiple models and benchmarks demonstrate the effectiveness of SpatialSV in enhancing and interpreting MLLMs' spatial intelligence. Furthermore, the framework exhibits strong generalization in semi-supervised settings, validating its potential to leverage unlabeled visual data for scalable, interpretable spatial representation learning.

URL PDF HTML ☆

赞 0 踩 0

2606.19776 2026-06-19 cs.CV 新提交 85%

Occ-VLM: Occupancy Grounded Vision Language Model for Indoor Scene Understanding

Occ-VLM: 面向室内场景理解的占用接地视觉语言模型

Jianing Li, Zhou Fang, Yijiang Liu, Li Du

发表机构 * School of Electronic Science and Engineering, Nanjing University（南京大学电子科学与工程学院）

专题命中音视频/视觉语言融合：占用接地视觉语言模型，融合3D与2D语义

AI总结提出Occ-VLM，仅用姿态RGB图像和单一2D视觉编码器，通过重建3D占用作为几何先验，实现统一的3D场景理解，在占用预测、3D VQA和密集描述任务上达到领先水平。

详情

AI中文摘要

近期，视觉语言模型（VLM）在3D场景理解方面取得了显著进展，推动了具身智能和机器人视觉等应用的发展。然而，现有方法通常要么直接依赖显式的3D输入（如点云或RGB-D序列），要么引入额外的3D几何编码器从2D图像中推导出3D感知的视觉标记。这种设计在结构上将3D几何感知与通过视觉语言预训练学到的丰富2D语义解耦，阻碍了统一3D视觉语言表示的发展。在这项工作中，我们提出了Occ-VLM，一个仅基于姿态RGB图像并采用单一2D视觉编码器的3D场景理解新框架。具体而言，Occ-VLM重建3D场景占用作为辅助几何先验，用于将前景2D标记与3D空间进行空间关联。然后，这些标记由大型语言模型（LLM）解码，实现统一的场景理解。大量实验表明，Occ-VLM实现了准确的几何感知和稳健的视觉语言推理：在多视角占用预测上达到最先进性能，同时在3D视觉问答（VQA）和3D密集描述基准上与使用3D输入的VLM表现相当。

英文摘要

Recently, vision-language models (VLMs) have made significant progress in 3D scene understanding, driving advances in applications such as embodied intelligence and robotic vision. However, existing approaches typically either rely directly on explicit 3D inputs (e.g., point clouds or RGB-D sequences), or introduce an additional 3D geometry encoder to derive 3D-aware visual tokens from 2D images. Such designs structurally decouple 3D geometric perception from the rich 2D semantics learned via vision-language pre-training, hindering the development of a unified 3D vision-language representation. In this work, we propose Occ-VLM, a novel framework for 3D scene understanding that operates purely on posed RGB images and employs a single 2D vision encoder. Specifically, Occ-VLM reconstructs 3D scene occupancy as an auxiliary geometric prior, which is utilized to spatially associate foreground 2D tokens with 3D space. These tokens are then decoded by a Large Language Model (LLM) for unified scene understanding. Extensive experiments demonstrate that Occ-VLM achieves both accurate geometric perception and robust vision-language reasoning: it attains state-of-the-art performance on multi-view occupancy prediction, while performing on par with 3D-input VLMs on 3D Visual Question Answering (VQA) and 3D dense captioning benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2508.15228 2026-06-19 cs.CV 版本更新 85%

Collaborative Multi-Modal Coding for High-Quality 3D Generation

协作多模态编码用于高质量3D生成

Ziang Cao, Zhaoxi Chen, Liang Pan, Ziwei Liu

发表机构 * S-Lab, Nanyang Technological University, Singapore（南洋理工大学S实验室）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）

专题命中音视频/视觉语言融合：协作多模态编码融合RGB、RGBD和点云特征。

AI总结提出TriMM，首个前馈式3D原生生成模型，通过协作多模态编码融合RGB、RGBD和点云特征，结合辅助2D/3D监督和三平面潜在扩散模型，实现高质量3D资产生成。

详情

AI中文摘要

3D内容本质上具有多模态特性，可投影到不同模态（如RGB图像、RGBD和点云）。每种模态在3D资产建模中表现出独特优势：RGB图像包含生动的3D纹理，而点云定义精细的3D几何。然而，现有大多数3D原生生成架构要么主要在单模态范式下运行——从而忽略了多模态数据的互补优势，要么局限于3D结构，从而限制了可用训练数据集的范围。为了全面利用多模态进行3D建模，我们提出了TriMM，这是第一个从基本多模态（如RGB、RGBD和点云）学习的前馈式3D原生生成模型。具体来说，1) TriMM首先引入协作多模态编码，该编码在保留各模态独特表示优势的同时整合模态特定特征。2) 此外，引入辅助2D和3D监督以提高多模态编码的鲁棒性和性能。3) 基于嵌入的多模态编码，TriMM采用三平面潜在扩散模型生成更高质量的3D资产，增强了纹理和几何细节。在多个知名数据集上的大量实验表明，TriMM通过有效利用多模态，尽管使用少量训练数据，仍能达到与在大规模数据集上训练的模型相竞争的性能。此外，我们在最近的RGB-D数据集上进行了额外实验，验证了将其他多模态数据集纳入3D生成的可行性。

英文摘要

3D content inherently encompasses multi-modal characteristics and can be projected into different modalities (e.g., RGB images, RGBD, and point clouds). Each modality exhibits distinct advantages in 3D asset modeling: RGB images contain vivid 3D textures, whereas point clouds define fine-grained 3D geometries. However, most existing 3D-native generative architectures either operate predominantly within single-modality paradigms-thus overlooking the complementary benefits of multi-modality data-or restrict themselves to 3D structures, thereby limiting the scope of available training datasets. To holistically harness multi-modalities for 3D modeling, we present TriMM, the first feed-forward 3D-native generative model that learns from basic multi-modalities (e.g., RGB, RGBD, and point cloud). Specifically, 1) TriMM first introduces collaborative multi-modal coding, which integrates modality-specific features while preserving their unique representational strengths. 2) Furthermore, auxiliary 2D and 3D supervision are introduced to raise the robustness and performance of multi-modal coding. 3) Based on the embedded multi-modal code, TriMM employs a triplane latent diffusion model to generate 3D assets of superior quality, enhancing both the texture and the geometric detail. Extensive experiments on multiple well-known datasets demonstrate that TriMM, by effectively leveraging multi-modality, achieves competitive performance with models trained on large-scale datasets, despite utilizing a small amount of training data. Furthermore, we conduct additional experiments on recent RGB-D datasets, verifying the feasibility of incorporating other multi-modal datasets into 3D generation.

URL PDF HTML ☆

赞 0 踩 0

2508.04424 2026-06-19 cs.CV 版本更新 85%

Composed Object Retrieval: Object-level Retrieval via Composed Expressions

组合对象检索：通过组合表达式进行对象级检索

Tong Wang, Guanyu Yang, Nian Liu, Zongyan Han, Jinxing Zhou, Salman Khan, Fahad Shahbaz Khan

发表机构 * Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Southeast University, Ministry of Education, Jiangsu, China（新一代人工智能技术及跨学科应用国家重点实验室，东南大学，教育部，江苏，中国）； Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE（穆罕默德·本·扎耶德人工智能大学（MBZUAI），阿布扎赫德，阿联酋）

专题命中音视频/视觉语言融合：组合对象检索结合视觉与文本，属于视觉语言融合

AI总结提出组合对象检索（COR）任务，通过组合参考对象、掩码和检索文本进行对象级检索，并构建COR125K基准和CORE模型，显著优于现有方法。

详情

AI中文摘要

基于用户意图检索细粒度视觉内容在多模态系统中仍然是一个挑战。尽管当前的组合图像检索（CIR）方法结合了参考图像和检索文本，但它们局限于图像级匹配，无法定位特定对象。为此，我们提出了组合对象检索（COR），一种新的对象级检索任务，从目标图像中的候选对象中检索目标对象，并用像素级掩码对检索结果进行定位。给定一个参考对象、其掩码、一个目标图像以及描述所需修改的检索文本，COR要求模型执行组合视觉-文本推理，而不是依赖显式的类别名称。这一设置带来了若干挑战，包括细粒度组合匹配、在视觉相似干扰物下的负对象过滤以及灵活的单对象或多对象检索。我们构建了COR125K，第一个大规模COR基准，包含408个类别的125,541个检索三元组，并划分基础/新类别以评估类别级泛化能力。我们还提出了CORE，一个统一的端到端模型，集成了参考区域编码、自适应视觉-文本交互和区域级对比学习，以将组合表示与目标对象对齐，同时抑制背景和干扰物。大量实验表明，CORE在基础和新类别上均显著优于现有的基于CIR的流程和强基线，为细粒度对象级多模态检索建立了一个简单而有效的基础。代码将在此https URL公开发布。

英文摘要

Retrieving fine-grained visual content based on user intent remains a challenge in multimodal systems. Although current Composed Image Retrieval (CIR) methods combine reference images with retrieval texts, they are constrained to image-level matching and cannot localize specific objects. To this end, we propose Composed Object Retrieval (COR), a new object-level retrieval task that retrieves target object(s) from candidate objects in a target image and grounds the retrieved result with pixel-level masks. Given a reference object, its mask, a target image, and a retrieval text describing the desired modification, COR requires models to perform composed visual-textual reasoning rather than relying on explicit category names. This setting introduces several challenges, including fine-grained compositional matching, negative-object filtering under visually similar distractors, and flexible single- or multi-object retrieval. We construct COR125K, the first large-scale COR benchmark, containing 125,541 retrieval triplets across 408 categories with base/novel splits for evaluating category-level generalization. We also present CORE, a unified end-to-end model that integrates reference region encoding, adaptive vision-text interaction, and region-level contrastive learning to align composed representations with target objects while suppressing background and distractors. Extensive experiments demonstrate that CORE significantly outperforms existing CIR-based pipelines and strong baselines in both base and novel categories, establishing a simple and effective foundation for fine-grained object-level multimodal retrieval. Code will be released publicly at https://github.com/wangtong627/COR.

URL PDF HTML ☆

赞 0 踩 0

2606.20101 2026-06-19 cs.SD cs.AI cs.MM 新提交 80%

Hybrid Diffusion Transformer for Instruction-Guided Audio Editing via Rectified Flow

基于整流流的混合扩散变压器用于指令引导音频编辑

Liting Gao, Yonggang Zhu, Yaru Chen, Dongyu Wang, Shubin Zhang, Zhenbo Li, Jean-Yves Guillemaut, Wenwu Wang

发表机构 * Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey（萨里大学视觉、语音与信号处理中心）； School of Artificial Intelligence, Beijing University of Posts and Telecommunications（北京邮电大学人工智能学院）； Fisheries College, Ocean University of China（中国海洋大学水产学院）； College of Information and Electrical Engineering, China Agricultural University（中国农业大学信息与电气工程学院）

专题命中音视频/视觉语言融合：指令引导音频编辑，涉及文本与音频融合

AI总结提出混合两阶段扩散变压器架构，通过粗到细策略平衡全局语义对齐与局部细节编辑，在重叠音频事件和复杂指令任务上提升性能与效率。

详情

AI中文摘要

音频编辑旨在根据自然语言指令修改现有音频剪辑中的特定内容，同时保留其余声学内容。尽管扩散模型取得了显著进展，但现有的基于训练的编辑方法主要依赖于卷积U-Net骨干中的局部归纳偏差和交叉注意力交互，这通常阻碍了长程语义对齐以及对指令的精确理解和定位。相比之下，扩散变压器提供了更强的全局建模和多模态融合，但现有的编辑架构通常采用MMDiT和DiT块的简单堆叠。在所有块中对拼接的音频和文本标记应用联合注意力会导致相对于标记长度的二次复杂度。为了平衡编辑性能和效率，我们提出了一种基于整流流匹配的混合两阶段扩散变压器架构，用于指令引导音频编辑。它在低分辨率阶段对音频和文本标记进行联合注意力以建立粗略的语义对齐，然后在高分辨率阶段切换到交替的联合注意力和交叉注意力块以细化编辑细节。这种从粗到细的策略实现了高效且准确的指令引导音频编辑。实验表明，所提出的框架在涉及重叠音频事件和复杂指令的具有挑战性的编辑任务上取得了显著的性能提升，同时通过紧凑模型大幅提高了编辑效率。

英文摘要

Audio editing aims to modify specific content in an existing audio clip according to a natural language instruction while preserving the remaining acoustic content. Despite the remarkable progress of diffusion models, existing training-based editing methods mainly rely on the local inductive biases and cross-attention interaction in convolutional U-Net backbones, which often hinder long-range semantic alignment and precise understanding and localization of instructions. In contrast, diffusion transformers provide stronger global modeling and multimodal fusion, but existing editing architectures usually adopt a simple stack of MMDiT and DiT blocks. Applying joint attention over concatenated audio and text tokens in all blocks results in quadratic complexity with respect to token length. To balance editing performance and efficiency, we propose a hybrid two-stage diffusion transformer architecture for instruction-guided audio editing based on rectified flow matching. It performs joint attention over audio and text tokens to establish coarse semantic alignment at low-resolution stage, then switches to alternating joint-attention and cross-attention blocks to refine editing details at high-resolution stage. This coarse-to-fine strategy enables efficient and accurate instruction-guided audio editing. Experiments show that the proposed framework achieves notable performance gains on challenging editing tasks involving overlapping audio events and complex instructions, while substantially improving editing efficiency with a compact model.

URL PDF HTML ☆

赞 0 踩 0

2606.19985 2026-06-19 cs.CV 新提交 80%

Vision-Reasoning-Guided Occlusion Removal from Light Fields

视觉推理引导的光场遮挡去除

Mohamed Youssef, Oliver Bimber

发表机构 * Johannes Kepler University（约翰·开普勒大学）

专题命中音视频/视觉语言融合：融合光场与视觉语言模型，去除遮挡恢复场景。

AI总结提出结合光场积分与视觉语言模型的框架，通过多视图融合和语义先验恢复被遮挡场景，在合成和真实数据上取得最优性能。

详情

AI中文摘要

遮挡鲁棒的场景恢复仍然是计算成像中的一个主要挑战，特别是在自然环境中，密集的前景植被严重限制了可见性。我们提出了一种视觉推理引导的光场遮挡去除框架，该框架结合了光场积分（LFI）的可见性恢复能力和视觉语言模型（VLM）的语义推理能力。首先通过LFI集成多视图观测以抑制前景遮挡，生成初始的可见性增强表示。然后，引入VLM作为条件语义先验，在观测测量的指导下恢复退化结构并恢复细节。为了提高恢复一致性并减少幻觉伪影，我们引入了一种多样本融合策略，将多个生成的假设聚合为统一的估计。在合成和真实世界数据集上的实验结果表明，该方法达到了最先进的性能，在四个合成光场基准场景（4-Syn）上取得了最高的平均SSIM，并在结构化和非结构化采集设置中表现出强大的泛化能力。这些结果凸显了将物理成像约束与视觉语言推理相结合在严重遮挡下实现鲁棒感知的有效性，可应用于搜索救援和探索性机器人导航。

英文摘要

Occlusion-robust scene recovery remains a major challenge in computational imaging, particularly in natural environments where dense foreground vegetation severely limits visibility. We propose a vision-reasoning-guided light field occlusion removal framework that combines the visibility recovery capability of light field integration (LFI) with the semantic reasoning capacity of vision-language models (VLMs). Multi-view observations are first integrated via LFI to suppress foreground occlusions and produce an initial visibility-enhanced representation. A VLM is then incorporated as a conditional semantic prior to restore degraded structures and recover fine details, guided by the observed measurements. To improve recovery consistency and reduce hallucination artifacts, we introduce a multi-sample fusion strategy that aggregates multiple generated hypotheses into a unified estimate. Experimental results on synthetic and real-world datasets demonstrate state-of-the-art performance, achieving the highest average SSIM across four synthetic light field benchmark scenes (4-Syn) and strong generalization across structured and unstructured acquisition settings. These results highlight the effectiveness of combining physical imaging constraints with vision-language reasoning for robust perception under severe occlusion, with applicability to search-and-rescue and exploratory robotic navigation.

URL PDF HTML ☆

赞 0 踩 0

2606.19950 2026-06-19 cs.CV cs.AI 新提交 80%

Confidence Calibration for Multimodal LLMs: An Empirical Study through Medical VQA

多模态大语言模型的置信度校准：基于医学视觉问答的实证研究

Yuetian Du, Yucheng Wang, Ming Kong, Tian Liang, Qiang Long, Bingdi Chen, Qiang Zhu

发表机构 * College of Computer Science and Technology, Zhejiang University（浙江大学计算机科学与技术学院）； School of Computer Science and Technology, Xidian University（西安电子科技大学计算机科学与技术学院）； Zhihui Medical Technology (Shanghai) Co., Ltd.（智汇医疗科技（上海）有限公司）

专题命中音视频/视觉语言融合：多模态LLM置信度校准，用于医学视觉问答。

AI总结针对多模态大语言模型在医学任务中置信度与准确性不匹配的问题，提出结合多策略融合询问与专家大语言模型评估的方法，在三个医学VQA数据集上将期望校准误差平均降低40%，提升了模型可靠性。

Comments Accepted by MICCAI 2025

2606.16615 2026-06-19 cs.CV 新提交 80%

SUP-MCRL: Subject-aware Unified Pseudo-feature Coded Multimodal Contrastive Representation Learning for EEG Visual Decoding

SUP-MCRL：面向EEG视觉解码的感知主体统一伪特征编码多模态对比表示学习

Shengyu Gong, Weiming Zeng, Yueyang Li, Zijian Kang, Hongjie Yan, Wai Ting Siok, Nizhuan Wang

发表机构 * Lab of Digital Image and Intelligent Computation, Shanghai Maritime University（上海海事大学数字图像与智能计算实验室）； Department of Language Science and Technology, The Hong Kong Polytechnic University（香港理工大学语言科学与技术系）； Affiliated Lianyungang Hospital of Xuzhou Medical University（徐州医科大学附属连云港医院）

专题命中音视频/视觉语言融合：多模态对比学习融合EEG和视觉特征，用于视觉解码。

AI总结提出SUP-MCRL框架，通过语义感知视觉编码器、统一EEG增强器和原型渐进增强器，解决多模态对比学习中语义一致性和主体选择性问题，在THINGS-EEG零样本任务上达到66.0%/91.9%的Top-1/Top-5准确率。

详情

AI中文摘要

非侵入式脑机接口在泛化到自然视觉体验时，神经视觉解码面临严重的保真度退化。传统的多模态对比表示学习仅优化几何距离对齐，忽略了语义一致性和主体选择性，导致虚假的零样本对齐。我们提出SUP-MCRL，一个统一框架，集成了三种协作机制：(1) 语义实体感知视觉编码器(SAVE)，学习空间注意力以提取语义内容，无需预训练的显著性模型；(2) 统一EEG增强器(UEE)，采用多尺度空洞卷积和频带间注意力实现自适应跨主体鲁棒性；(3) 基于原型的渐进增强器(PPA)，维护一个EMA更新的伪特征池以防止表示崩溃。在THINGS-EEG上的零样本实验实现了66.0%/91.9%（Top-1/Top-5）的个体内准确率和24.0%/52.9%的LOSO准确率，超越了现有最先进方法。代码可在https://github.com/NZWANG/SUP-MCRL获取。

英文摘要

Non-invasive brain-computer interfaces exhibit significant performance degradation when moving from controlled laboratory stimuli to real-world natural images. This degradation occurs because conventional multimodal contrastive representation learning models focus exclusively on optimizing geometric distance alignment, thereby failing to account for semantic consistency and inter-subject variability in neural representation and selective attention. As a result, these models are prone to producing spurious zero-shot matches. To address these limitations, we propose SUP-MCRL, a unified framework integrating three collaborative mechanisms: (1) a Semantic-entity Aware Visual Encoder (SAVE) that learns spatial attention to extract semantic content without relying on pre-trained saliency models; (2) a Unified EEG Enhancer (UEE) that employs multi-scale atrous convolutions and inter-band attention for adaptive cross-subject robustness; and (3) a Prototype-based Progressive Augmenter (PPA) that maintains an EMA-updated pseudo-feature pool to prevent representation collapse. Zero-shot experiments on the THINGS-EEG achieve 66.0%/91.9% (Top-1/Top-5) intra-subject and 24.0%/52.9% LOSO accuracy, significantly surpassing state-of-the-art methods and demonstrating that structured alignment supervision is key to overcoming the limitations of cross-modal decoding. Code is available at https://github.com/NZWANG/SUP-MCRL.

URL PDF HTML ☆

赞 0 踩 0

2606.20083 2026-06-19 cs.CV 新提交 75%

Holo-World: Unified Camera, Object and Weather Control for Video World Model

Holo-World: 视频世界模型的统一相机、物体和天气控制

Xiangchen Yin, Wenzhang Sun, Jiahui Yuan, Zijie Liu, Yinda Chen, Wei Li, Dachun Kai, Chunfeng Wang, Xiaoyan Sun

发表机构 * University of Science and Technology of China（中国科学技术大学）； Li Auto ； Institute of Artificial Intelligence, Hefei Comprehensive National Science Center（合肥综合国家科学中心人工智能研究院）

专题命中音视频/视觉语言融合：视频世界模型，联合控制相机、物体和天气

AI总结提出Holo-World，一种从单张图像联合控制相机、物体运动和天气的统一视频世界模型，通过场景适配器和解耦CFG实现世界保持与天气迁移。

Comments Project Page: \url{https://xiangchenyin.github.io/Holo-World} Code: \url{https://github.com/XiangchenYin/Holo-World}

详情

AI中文摘要

视频世界模型正朝着在可控相机和物体运动下保持观察到的世界，同时允许其环境状态变化的方向发展。然而，这些控制仍然是孤立的，天气生成通常依赖于已经指定未来结构的源视频或重建场景。我们研究了一种基于第一帧锚定的源到状态设置，其中模型从单张图像开始，遵循明确的相机和物体控制以及可选的天气指令，然后生成一个视频，该视频要么保持源世界，要么将其转移到目标天气状态。为了解决这些挑战，我们首先构建了HoloStateData，一个状态视频数据集，将多样化的视频转换为用于相机、物体和天气监督的统一控制样本。其次，我们引入了Holo-World，一个统一的、可控制的视频世界模型，从单张图像联合控制场景。其统一场景适配器将世界保持和天气迁移分解为不同的参数子空间，使用渲染背景、几何缓冲区和物体控制来维持受控场景结构，同时建模依赖天气的外观和粒子效果。此外，场景-天气解耦CFG分别引导场景和天气残差，增强目标天气效果而不过度放大完整条件。定量和定性实验表明，Holo-World在保持精确的相机和物体控制以及一致场景结构的同时，将场景迁移到多样化的目标天气状态，在天气状态生成上优于视频到视频的天气编辑基线。我们的项目页面可在\url{this https URL}获取。

英文摘要

Video world models are moving toward preserving an observed world under controllable camera and object motion while allowing its environmental state to change. Yet these controls remain isolated, and weather generation typically relies on a source video or reconstructed scene that already specifies future structure. We study a first-frame-anchored source-to-state setting, where the model starts from a single image and follows explicit camera and object controls and an optional weather instruction, then generates a video that either preserves the source world or transfers it to a target weather state. To address these challenges, we first build HoloStateData, a state video dataset that turns diverse videos into unified control samples for camera, object, and weather supervision. Second, we introduce Holo-World, a unified controllable video world model that jointly controls scene from a single image. Its Unified Scene Adapter factorizes world preservation and weather transfer into distinct parameter subspaces, using rendered background, geometry buffers, and object controls to maintain controlled scene structure while modeling weather-dependent appearance and particle effects. Additionally, Scene-Weather Decomposed CFG guides scene and weather residuals separately, strengthening target weather effects without over-amplifying the full condition. Quantitative and qualitative experiments demonstrate that Holo-World maintains precise camera and object control with consistent scene structure while transferring scenes into diverse target weather state, outperforming video-to-video weather editing baselines on weather-state generation. Our project page is available at \url{https://xiangchenyin.github.io/Holo-World/}.

URL PDF HTML ☆

赞 0 踩 0

2509.10416 2026-06-19 cs.RO 版本更新 75%

TASC: Task-Aware Shared Control for Relational Telemanipulation

TASC：面向关系遥操作的任务感知共享控制

Ze Fu, Pinhao Song, Yutong Hu, Renaud Detry

发表机构 * KU Leuven, Dept. Mechanical Engineering, Research unit Robotics, Automation and Mechatronics（KU莱顿机械工程系，机器人、自动化与机电一体化研究单位）； KU Leuven, Dept. Electrical Engineering, Research unit Processing Speech and Images（KU莱顿电气工程系，语音与图像处理研究单位）

专题命中音视频/视觉语言融合：利用视觉语言模型推断意图，属于视觉语言融合

AI总结提出TASC框架，通过视觉构建开放词汇交互图推断任务级用户意图，并基于空间约束提供共享控制辅助，提升关系遥操作效率与泛化能力。

Comments Accepted to IROS 2026

详情

AI中文摘要

我们提出了TASC，一个面向关系遥操作的任务感知共享控制框架，该框架从仅运动输入中推断任务级用户意图并提供辅助。为了在没有预定义模板的情况下支持抓取关系任务，TASC从视觉输入构建一个开放词汇的交互图来表示功能性物体关系，并据此推断用户意图。然后，共享控制策略在抓取和物体交互过程中提供辅助，该辅助由视觉语言模型预测的空间约束引导。我们的方法解决了共享控制下关系遥操作的两个关键挑战：（1）从低级运动命令中推断任务级意图，以及（2）跨不同物体和任务的泛化辅助。在仿真和真实世界的实验表明，与先前方法相比，TASC提高了任务效率并减少了用户输入努力，同时实现了跨多种关系遥操作任务的零样本泛化。支持我们实验的代码在此https URL公开提供。

英文摘要

We present TASC, a Task-Aware Shared Control framework for relational telemanipulation that infers task-level user intent and provides assistance from motion-only input. To support prehensile relational tasks without predefined templates, TASC constructs an open-vocabulary interaction graph from visual input to represent functional object relationships, and infers user intent accordingly. A shared control policy then provides assistance during both grasping and object interaction, guided by spatial constraints predicted by a vision-language model. Our method addresses two key challenges in relational telemanipulation under shared control: (1) task-level intent inference from low-level motion commands, and (2) generalizable assistance across diverse objects and tasks. Experiments in both simulation and the real world demonstrate that TASC improves task efficiency and reduces user input effort compared to prior methods, while enabling zero-shot generalization across diverse relational telemanipulation tasks. The code that supports our experiments is publicly available at https://github.com/fitz0401/tasc.

URL PDF HTML ☆

赞 0 踩 0

2606.20094 2026-06-19 cs.CV cs.AI cs.GR cs.LG cs.MM 新提交 70%

MakeupMirror: Improving Facial Attribute Preservation in Diffusion Models for Makeup Transfer

MakeupMirror：在用于化妆迁移的扩散模型中改进面部属性保持

Nefeli Andreou, Angel Martínez-González, Sabine Sternig, Matthieu Guillaumin, Epameinondas Antonakos, Michael Opitz

发表机构 * Amazon（亚马逊）

专题命中音视频/视觉语言融合：化妆迁移，涉及图像与文本条件融合

AI总结提出MakeupMirror扩散模型，通过ControlNet几何条件、区域特定迁移控制、肤色调制和Langevin采样器，在保持面部特征和肤色的同时实现高质量化妆迁移，相比Stable-Makeup提升面部识别相似度60%、降低肤色差异50%。

详情

AI中文摘要

化妆迁移模型能够实现有趣的增强现实（AR）体验以及在线化妆购物的虚拟试妆（VTO）。尽管最近最先进的基于扩散的解决方案（如Stable-Makeup）显著提高了化妆迁移的准确性和逼真度，但在身份和肤色保持方面仍存在局限性，使得用于化妆购物的生产级VTO不切实际。在这项工作中，我们提出了MakeupMirror，一种基于扩散的化妆迁移方法，在保持面部特征和肤色方面取得了显著进展。我们在Stable-Makeup的基础上引入了多项技术创新：（1）将面部几何条件与ControlNets集成以保持面部保真度；（2）区域特定的化妆迁移控制，以便在面部区域（如皮肤、眼睛和嘴唇）实现精确的化妆应用；（3）基于肤色的化妆迁移调制，防止跨主体迁移场景中的肤色改变；（4）集成Levenberg-Marquardt Langevin采样器以加速推理同时保持生成质量。我们在CPM-Real、Makeup Wild以及（本文新收集的、更多样化的）MakeupSelfies数据集上的实验表明，与Stable-Makeup相比，MakeupMirror将相对面部识别相似度提高了+60%，将相对肤色差异降低了-50%，延迟为0.7秒，同时在核心面部身份保持标准上达到了94%的专家接受率。

英文摘要

Makeup transfer models enable fun augmented reality (AR) experiences as well as virtual try-on (VTO) for online makeup shopping. While recent state-of-the-art diffusion based solutions such as Stable-Makeup dramatically improve the accuracy and realism of makeup transfer, they still face limitations in identity and skin color preservation, making production-level VTO for makeup shopping unrealistic. In this work, we propose MakeupMirror, a diffusion-based approach to makeup transfer that makes significant progress towards preserving facial features and skin tone. We introduce several technical innovations over Stable-Makeup: (1) integration of facial geometry conditioning with ControlNets to maintain facial fidelity; (2) region-specific makeup transfer control to enable precise makeup application across facial regions such as skin, eyes and lips; (3) skin tone-based makeup transfer modulation that prevent skin tone alteration in cross-subject transfer scenarios; and (4) integration of a Levenberg-Marquardt Langevin sampler to speed up inference while maintaining generation quality. Our experiments on CPM-Real, Makeup Wild, and (herein newly collected, more diverse) MakeupSelfies datasets show that MakeupMirror improves relative facial recognition similarity by +60%, reduces relative skin tone difference by -50% over Stable-Makeup, with a latency of 0.7s, while achieving expert acceptance rate of 94% across core facial identity preservation criteria.

URL PDF HTML ☆

赞 0 踩 0

2606.05833 2026-06-19 cs.CV cs.AI 版本更新 70%

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

从视频中学习几何表示以实现空间智能多模态大语言模型

Haibo Wang, Lifu Huang

发表机构 * University of California, Davis（加州大学戴维斯分校）

专题命中音视频/视觉语言融合：从视频学习3D几何表示，增强多模态大语言模型空间智能

AI总结提出GeoVR框架，通过从2D视频序列中蒸馏3D几何知识（包括相机姿态、深度图、尺度因子和多尺度3D特征），重塑多模态大语言模型的内部表示以赋予其空间智能，在空间推理基准上达到最先进性能。

详情

AI中文摘要

多模态大语言模型（MLLMs）在2D语义理解方面表现出色，但缺乏内在的3D感知能力，导致其表示无法在视频帧间保持几何和空间一致性。鉴于大规模3D数据的稀缺性，我们提出了GeoVR，一种新颖的框架，仅使用2D视频序列学习几何表示。该方法有效地重构了MLLMs内部的语义潜在空间，以解锁空间智能。GeoVR并非采用浅层的特征混合，而是通过从预训练的3D基础模型中蒸馏几何知识来重塑MLLM的内部表示。这是通过一种多目标学习策略实现的，该策略由四个互补的几何目标驱动：（1）估计帧间相机姿态以嵌入变化的视角动态，（2）回归密集深度图以锚定物理距离，（3）预测度量尺度因子以进行真实世界校准，以及（4）蒸馏多尺度3D特征以对齐中间特征空间。在这些显式的物理和几何约束的引导下，模型的内部表示自然地发展出强大的3D感知能力。在空间推理基准上的大量实验表明，GeoVR实现了最先进的性能，为赋予基础模型空间智能建立了一种新范式。

英文摘要

Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, resulting in representations that fail to maintain geometric and spatial consistency across video frames. Given the scarcity of large-scale 3D data, we present GeoVR, a novel framework that learns geometric representations using purely 2D video sequences. This approach effectively restructures the semantic latent space within MLLMs to unlock spatial intelligence. Rather than employing superficial feature mixing, GeoVR reshapes the internal representations of the MLLM by distilling geometry knowledge from pre-trained 3D foundation models. This is accomplished through a multi-objective learning strategy driven by four complementary geometric targets: (1) estimating inter-frame camera poses to embed varying viewpoint dynamics, (2) regressing dense depth maps to anchor physical distances, (3) predicting a metric scale factor for real-world calibration, and (4) distilling multi-scale 3D features to align the intermediate feature space. Guided by these explicit physical and geometric constraints, the model's internal representations naturally develop strong 3D awareness. Extensive experiments on spatial reasoning benchmarks demonstrate that GeoVR achieves state-of-the-art performance, establishing a new paradigm for endowing foundation models with spatial intelligence.

URL PDF HTML ☆

赞 0 踩 0