视觉大模型 / VLM - arXivDaily 专题

2606.17030 2026-06-18 cs.CV 新提交 90%

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

Qwen-RobotWorld技术报告：通过语言条件视频生成统一具身世界模型

Jie Zhang, Xiaoyue Chen, Anzhe Chen, Dayiheng Liu, Deqing Li, Gengze Zhou, Hale Yin, Haoqi Yuan, Haoyang Li, Jiahao Li, Jiazhao Zhang, Jingren Zhou, Kaiyuan Gao, Kun Yan, Lihan Jiang, Ningyuan Tang, Pei Lin, Qihang Peng, Shengming Yin, Tianhe Wu, Tianyi Yan, Xiao Xu, Yan Shu, Yanran Zhang, Ye Wang, Yi Wang, Yilei Chen, Yixian Xu, Yiyang Huang, Yuxiang Chen, Zekai Zhang, Zhendong Wang, Zixing Lei, Zhixuan Liang, Zihao Liu, Zikai Zhou, Chenxu Lv, Xiong-Hui Chen, Chenfei Wu

发表机构 * Qwen Team（Qwen团队）

专题命中视觉推理：语言条件视频世界模型，视觉推理与生成

AI总结提出Qwen-RobotWorld，一种以自然语言为统一动作接口的语言条件视频世界模型，通过双流MMDiT、大规模具身世界知识语料和渐进式课程训练，在机器人操作、自动驾驶等任务中实现物理一致的未来视觉轨迹预测，在多个基准上取得最优结果。

详情

AI中文摘要

我们介绍Qwen-RobotWorld，一种用于具身智能的语言条件视频世界模型。以自然语言作为统一动作接口，它从当前观测预测物理上合理的未来视觉轨迹，涵盖机器人操作、自动驾驶、室内导航和人到机器人迁移。这种统一公式提供了三个有前景的应用方向：用于策略训练增强的合成数据生成、用于策略评估的可扩展虚拟环境，以及用于下游机器人控制的语言引导规划信号。这是通过三部分设计实现的：a) 双流MMDiT与MLLM动作编码，其中60层双流扩散变压器通过逐层联合注意力将冻结的Qwen2.5-VL语义与视频VAE潜变量耦合；b) 具身世界知识(EWK)，一个860万视频-文本语料库（2亿+帧），包含20+种具身形态和500+动作类别的动作-语言映射；c) 通用+专家渐进式课程，一种两阶段训练策略，首先学习通用视觉先验，然后在共享语言接口下注入具身专门化。广泛的结果显示出强竞争力：在EWMBench和DreamGen Bench上总体排名第一，在WorldModelBench和PBench上优于所有开源模型。在RoboTwin-IF基准上的额外零样本分析进一步支持了鲁棒泛化和多视图一致性。

英文摘要

We introduce Qwen-RobotWorld, a language-conditioned video world model for embodied intelligence. With natural language as a unified action interface, it predicts physically grounded future visual trajectories from current observations across robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer. This unified formulation provides three promising application directions: synthetic data generation for policy training augmentation, scalable virtual environments for policy evaluation, and language-guided planning signals for downstream robot control. This is achieved through a three-part design: a) Double-Stream MMDiT with MLLM Action Encoding, where a 60-layer double-stream diffusion transformer couples frozen Qwen2.5-VL semantics with video-VAE latents through layer-wise joint attention; b) Embodied World Knowledge (EWK), an 8.6M video-text corpus (200M+ frames) with action-language mapping over 20+ embodiments and 500+ action categories; and c) General+Expert Progressive Curriculum, a two-stage training strategy that first learns general visual priors and then injects embodied specialization under a shared language interface. Extensive results show strong competitiveness: ranks 1st overall on EWMBench and DreamGen Bench, outperforms all open-source models on WorldModelBench and PBench. Additional zero-shot analyses on RoboTwin-IF benchmark further support robust generalization and multi-view consistency.

URL PDF HTML ☆

赞 0 踩 0

2606.18846 2026-06-18 cs.CV 新提交 85%

From Bounding Boxes to Visual Reasoning: An On-Policy Data Annotation Tool for Vision-Language Models

从边界框到视觉推理：一种用于视觉语言模型的在线策略数据标注工具

Like Zhang, Runliang Niu, Shiqi Wang, Xiyu Hu, Qianli Xing, Pan Wang, Qingzu He, Qi Wang

发表机构 * School of Artificial Intelligence, Jilin University（吉林大学人工智能学院）； College of Computer Science, Jilin University（吉林大学计算机科学与技术学院）； OPPO

专题命中视觉推理：VLM数据标注工具，支持视觉推理。

AI总结提出ScreenAnnotator，通过统一标注原子模式、在线策略循环与贝叶斯验证器，解决现有工具表达力不足、标注-训练脱节和数据复用性差的问题，实现高效多任务数据生成。

Comments 14 pages, 7 figures

详情

AI中文摘要

视觉语言模型（VLM）正快速向复杂的基于基础的结构化视觉推理发展。训练具备此类高级能力的模型需要一种新型数据，该数据能将空间坐标、开放词汇描述、结构化属性和拓扑关系无缝统一为单一表示。然而，现有数据标注工具从根本上无法满足这些复杂需求，存在三个系统性瓶颈：表达力有限、严重的标注-训练解耦以及数据复用性差。为弥补这一基础设施差距，我们引入了一个开源标注工具ScreenAnnotator。首先，我们定义了一个统一的标注原子模式，将空间、语义和结构基元绑定为单个单元。其次，我们实现了一个嵌入贝叶斯标注验证器（BAV）的在线策略标注循环。最后，我们设计了一个模板驱动的多任务数据合成过程，动态地将静态原子转化为多样化的多维推理任务，消除了冗余的重新标注。在线策略循环将流程图上的标注接受率提升至近100%，GUI截图上的接受率达到77%，同时随着标注数据的积累，每张图像的标注时间稳步减少。在流程图场景中，微调VLM的平均准确率达到76.1%，绝对提升了35.1个百分点。我们的代码可在以下网址获取：this https URL。

英文摘要

Vision-language models (VLMs) are rapidly advancing toward sophisticated grounded structured visual reasoning. Training models for such advanced capabilities demands a new genre of data that seamlessly unifies spatial coordinates, open-vocabulary descriptions, structured attributes, and topological relationships into a singular representation. However, existing data annotation tools fundamentally fail to meet these intricate demands, suffering from three systematic bottlenecks: limited expressiveness, severe annotation-training decoupling, and poor data reusability. To bridge this infrastructure gap, we introduce an open-source annotation tool, ScreenAnnotator. First, we define a unified annotation atom schema that binds spatial, semantic, and structural primitives into a single unit. Second, we implement an on-policy annotation loop embedded with a Bayesian Annotation Verifier (BAV). Finally, we design a template-driven multi-task data synthesis process dynamically transforms static atoms into diverse multi-dimensional reasoning tasks, eliminating redundant re-annotation. The on-policy loop drives the annotation accept rate to nearly 100% on flowcharts and 77% on GUI screenshots, while steadily reducing per-image annotation time as labeled data accumulate. In the flowchart scenario, fine-tuning a VLM yields 76.1% average accuracy, which is a 35.1% point absolute gain. Our code is available at: https://github.com/WnQinm/Annotator.

URL PDF HTML ☆

赞 0 踩 0

2606.18839 2026-06-18 cs.LG cs.CV 新提交 85%

Semantic Robustness Certification for Vision-Language Models

视觉语言模型的语义鲁棒性认证

Peiyu Yang, Paul Montague, Feng Liu, Andrew C. Cullen, Amardeep Kaur, Christopher Leckie, Sarah M. Erfani

发表机构 * School of Computing \& Information Systems, University of Melbourne, Australia

专题命中视觉推理：VLM语义鲁棒性认证，文本提示代理。

AI总结提出首个无需额外数据即可认证视觉语言模型在语义层面（如形状、大小、风格）鲁棒性的框架，通过文本提示作为语义代理并量化决策边界，确保预测类别在语义变换下不变。

Comments Accepted to ICML

详情

AI中文摘要

视觉语言模型（VLM）现在被广泛用于下游任务。然而，现实世界的应用常常使VLM面临由语义变化（例如形状、大小和风格）引起的分布偏移。鲁棒性认证确定当对输入应用变换时模型的预测是否改变。虽然大多数认证框架研究输入的几何或像素级变换，但本文提出了一种新颖的框架，能够在语义级变换下认证VLM的鲁棒性。利用VLM的开放词汇能力，我们使用文本提示作为语义代理来构建由控制语义变化程度的范围参数化的变换。通过以封闭形式表征VLM决策边界，我们的框架定量地认证了在语义变换下预测类别保持不变的范围区间。我们的框架是第一个在语义级变化下认证VLM鲁棒性而无需为每种变化提供额外数据的框架，使其易于应用。在合成数据和真实数据上的实验表明，我们的框架能够在各种场景下认证针对多种语义变化的鲁棒性。

英文摘要

Vision-language models (VLMs) are now widely used in downstream tasks. However, real-world applications often expose VLMs to distribution shifts induced by semantic variation (e.g., shape, size, and style). Robustness certification determines if a model's prediction changes when transformations are applied to its input. While most certification frameworks study geometric or pixel-level transformations over inputs, this work proposes a novel framework that enables certifying VLM robustness under semantic-level transformations. Leveraging the open-vocabulary capability of VLMs, we use text prompts as semantic proxies to construct transformations parameterized by an extent that controls the degree of semantic variation. By characterizing the VLM decision boundary in closed form, our framework quantitatively certifies extent intervals for which the predicted class remains unchanged under the semantic transformation. Our framework is the first to certify VLM robustness under semantic-level variations without requiring additional data for each variation, making it practical to apply. Experiments on both synthetic and real-world data show that our framework enables certifying robustness under diverse semantic variations across scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.18681 2026-06-18 cs.CV 新提交 85%

Moving Beyond Diversity: Visual Token Pruning as Subspace Reconstruction for Efficient VLMs

超越多样性：将视觉令牌剪枝视为子空间重建以实现高效视觉语言模型

Jaeyeon Lee, Shunjie Wen, Dong-Wan Choi

发表机构 * Inha University（延世大学）

专题命中视觉推理：VLM视觉令牌剪枝，提升效率

AI总结提出SPARE方法，将令牌剪枝重构为子空间重建问题，通过迭代选择投影残差大的令牌进行剪枝，并引入反相关性机制保留上下文信息，在LLaVA上剪枝94%令牌仍保持95%性能。

Comments ECCV 2026 Under Review

详情

AI中文摘要

尽管视觉语言模型（VLM）性能卓越，但由于大量视觉令牌的存在，它们产生了巨大的计算开销。虽然多样性最大化已成为令牌减少的主流策略，但现有方法依赖于基于余弦的归一化相似度，忽略了幅度信息，无法忠实逼近原始特征表示，导致性能次优，尤其是在组合多技能推理任务上。本文提出SPARE，一种子空间重建方法，将令牌剪枝重新表述为列子集选择问题，并显式最小化重建误差。通过迭代选择投影残差大的令牌，SPARE在角度多样性之外实现了重建驱动的剪枝。此外，我们揭示了一个反直觉的反相关性现象：图像-文本相关性得分较低的令牌能更好地保留上下文信息。基于这一发现，我们将反相关性作为额外的选择标准纳入SPARE，以促进上下文感知的令牌选择。在多个VLM和基准上的大量实验表明，SPARE始终达到最先进的性能，在组合任务上取得显著提升。当应用于LLaVA时，SPARE在完全无需训练的情况下，可移除高达94%的视觉令牌，同时保留95%的基线性能。

英文摘要

Despite their remarkable performance, Vision Language Models (VLMs) incur substantial computational overhead due to the large number of visual tokens. While diversity maximization has become a dominant strategy for token reduction, existing methods rely on cosine-based normalized similarity that discards magnitude information, failing to faithfully approximate the original feature representation and leading to suboptimal performance, particularly on compositional multi-skill reasoning tasks. In this paper, we introduce SPARE, a subspace reconstruction method that reformulates token pruning as a column subset selection problem and explicitly minimizes reconstruction error. By iteratively selecting tokens with large projection residuals, SPARE performs reconstruction-driven pruning beyond angular diversity. Moreover, we reveal a counterintuitive anti-relevance phenomenon: tokens with lower image-text relevance score can better preserve contextual information. Based on this finding, we incorporate anti-relevance into SPARE as an additional selection criterion to promote context-aware token selection. Extensive experiments across multiple VLMs and benchmarks demonstrate that SPARE consistently achieves state-of-the-art performance, with strong gains on compositional tasks. When applied to LLaVA, SPARE removes up to 94% of visual tokens while retaining 95% of the baseline performance, all in a fully training-free manner.

URL PDF HTML ☆

赞 0 踩 0

2606.18385 2026-06-18 cs.AI 新提交 85%

CaVe-VLM-CoT: An Interpretable Vision-Language Model Framework

CaVe-VLM-CoT：一种可解释的视觉-语言模型框架

Sneha Rao, Shaina Raza, Dhanesh Ramachandram

发表机构 * Vector Institute（向量研究所）

专题命中视觉推理：提出可解释VLM框架，结合CoT和RAG

AI总结提出CaVe-VLM-CoT框架，通过五阶段闭环流水线（提取器、检索器、求解器、引用注入器、验证器）实现证据推理，并引入CaVeScore复合指标评估检索质量、引用忠实度和跨模态基础，在ScienceQA和MMMU上取得性能提升。

详情

AI中文摘要

视觉-语言模型（VLM）仍然容易产生幻觉，输出流畅但视觉上不忠实的输出。现有的思维链和检索增强方法仅部分解决了这一问题，因为它们既没有强制执行步骤级引用基础，也没有将验证失败路由回检索以进行纠正。我们提出了CaVe-VLM-CoT，一个模块化的基于反射的智能体RAG框架，通过五阶段闭环流水线强制执行证据推理：提取器、检索器、求解器、引用注入器和验证器，其中检测到的无根据声明会触发结构化反馈给提取器以进行针对性重新检索。由于现有框架没有联合衡量检索质量、逐步引用忠实度和跨模态基础，我们提出了一套涵盖所有阶段的23个组件级指标，以CaVeScore为核心，这是一个加权准确性、引用精确率和召回率、归因和证据基础的复合指标。无需任何架构或提示修改，CaVe-VLM-CoT在ScienceQA上达到87.1%的准确率和56.6%的CaVeScore，在MMMU（30个学科）上达到55.2%的准确率和35.7%的CaVeScore。

英文摘要

Vision-Language Models (VLMs) remain prone to hallucinations, producing fluent but visually unfaithful outputs. Existing chain-of-thought and retrieval-augmented methods only partially address this, as they neither enforce step-level citation grounding nor route verification failures back to retrieval for correction. We present CaVe-VLM-CoT, a modular reflection-based agentic-RAG framework that enforces evidence-grounded reasoning through a five-stage closed-loop pipeline: Extractor, Retriever, Solver, Citation Injector, and Verifier, in which detected ungrounded claims trigger structured feedback to the Extractor for targeted re-retrieval. Since no existing framework jointly measures retrieval quality, step-wise citation faithfulness, and cross-modal grounding, we propose a suite of 23 component-wise metrics across all stages, anchored by CaVeScore, a composite metric weighting accuracy, citation precision and recall, attribution, and evidence grounding. Without any architectural or prompt modifications, CaVe-VLM-CoT achieves 87.1\% accuracy and 56.6\% CaVeScore on ScienceQA , and 55.2\% accuracy and 35.7\% CaVeScore on MMMU (30 subjects).

URL PDF HTML ☆

赞 0 踩 0

2606.19253 2026-06-18 cs.CV cs.AI cs.LG cs.RO 新提交 80%

OneCanvas: 3D Scene Understanding via Panoramic Reprojection

OneCanvas: 通过全景重投影实现3D场景理解

Bartłomiej Baranowski, Dave Zhenyu Chen, Matthias Nießner

发表机构 * Technical University of Munich（慕尼黑技术大学）； Huawei（华为）

专题命中视觉推理：VLM中3D场景理解方法

AI总结提出OneCanvas方法，将多视图补丁特征聚合到全景画布上，利用深度和相机位姿进行重投影，无需复杂几何编码器或大量训练，在SQA3D等基准上达到最先进精度。

Comments Project page: https://baranowskibrt.github.io/onecanvas/

详情

AI中文摘要

现有的视觉语言模型（VLM）中的3D场景理解方法要么依赖复杂的、模型特定的几何编码器，要么为了追求空间推理而需要大量的训练预算。相反，OneCanvas将所有视图的补丁特征聚合到一个单一的等距柱状全景画布上。具体来说，每个补丁利用其深度和相机位姿被反投影到3D世界坐标，然后根据从画布原点看到的该点的连续经度和纬度放置在画布上，无需对重叠视图进行光栅化或聚合。补丁的度量坐标的3D位置嵌入被添加到其特征中，从而恢复了将世界位置压缩到角度画布坐标时丢失的深度。因此，来自所有帧的补丁共享一个空间坐标系，无需融合或对主干网络进行重大架构修改。预训练的VLM将此表示视为普通图像。由于画布可以以任何感兴趣的姿态为中心，相同的表示直接支持从特定视角进行情境推理，这是机器人和具身AI中的常见需求。得益于这种表示，我们还可以引入空间预训练课程：通过程序化地将从真实图像中提取的对象的补丁特征放置在原本空白的画布上的选定3D世界位置，我们生成了涵盖广泛空间推理任务的即时监督，并控制答案分布以减少空间推理捷径。OneCanvas在SQA3D和VSI-Bench上达到了最先进的准确率，并在SPBench上泛化到分布外数据，其训练计算量比最强竞争方法少一个数量级。

英文摘要

Existing approaches to 3D scene understanding in Vision-Language Models (VLMs) either rely on complex, model-specific geometry encoders or large training budgets in pursuit of spatial reasoning. Instead, OneCanvas aggregates patch features from all views onto a single equirectangular panoramic canvas. Namely, each patch is unprojected to a 3D world coordinate using its depth and camera pose, then placed on the canvas at the continuous longitude and latitude of that point as seen from the canvas origin, with no rasterization or aggregation across overlapping views. A 3D position embedding of the patch's metric coordinates is added to its feature, restoring the depth lost when collapsing the world position to an angular canvas coordinate. Patches from all frames thus share one spatial coordinate system with no fusion or major architectural modifications of the backbone. The pretrained VLM consumes this representation as if it were an ordinary image. Because the canvas can be centered on any pose of interest, the same representation directly supports situated reasoning from a specific viewpoint, a common requirement in robotics and embodied AI. Thanks to this representation, we can also introduce a spatial pretraining curriculum: by procedurally placing patch features of objects, drawn from real images, at chosen 3D world positions on an otherwise empty canvas, we generate on-the-fly supervision spanning a broad range of spatial reasoning tasks, with answer distributions controlled to reduce spatial reasoning shortcuts. OneCanvas achieves state-of-the-art accuracy on SQA3D and VSI-Bench, and generalizes to out-of-distribution data on SPBench, using an order of magnitude less training compute than the strongest competing methods.

URL PDF HTML ☆

赞 0 踩 0

2606.17412 2026-06-18 cs.CV cs.AI 新提交 80%

Enhancing Pathological VLMs with Cross-scale Reasoning

增强病理视觉语言模型的跨尺度推理能力

Chi Phan, Tianyi Zhang, Qiaochu Xue, Yufeng Wu, Dan Hu, Zeyu Liu, Sudong Wang, Yueming Jin

发表机构 * Department of Electrical and Computer Engineering, National University of Singapore（新加坡国立大学电气与计算机工程系）； PuzzleLogic Pte Ltd（PuzzleLogic私人有限公司）； Department of Pathology, Fujian Medical University Cancer Hospital & Fujian Cancer Hospital（福建医科大学附属肿瘤医院病理科暨福建省肿瘤医院）

专题命中视觉推理：增强病理VLM的跨尺度视觉推理

AI总结提出首个跨尺度训练与评估范式，通过多倍率视觉问答任务增强病理视觉语言模型的跨尺度推理能力，并构建高质量基准数据集Scale-VQA及模型ScaleReasoner-R1，实现最优性能。

详情

AI中文摘要

病理图像本质上是多尺度的，要求病理学家整合从低倍放大下的整体组织结构到高倍放大下的细胞形态的证据以进行准确诊断。虽然现有的视觉语言模型（VLM）病理数据集包含多种尺度，但它们通常缺乏明确的跨尺度推理目标。这一限制阻碍了VLM捕获关键的跨尺度表示和学习基于证据的推理。为弥补这一差距，我们引入了首个跨尺度训练和评估范式，将病理解释表述为多倍率推理。然而，创建这样的任务揭示了一个关键挑战：多图像视觉问答（VQA）容易受到仅文本捷径的影响，这使得模型能够利用与放大倍数相关的伪影而非视觉证据来猜测答案。为解决此问题，我们提出了一种泄漏感知的策展流程，结合了对抗性仅文本筛选和约束引导的问题设计。利用该流程，我们构建了Scale-VQA，一个高质量基准，包含4,685个多项选择题，基于2,537张跨多个放大级别的病理图像。最后，我们提出了ScaleReasoner-R1，一个通过强化学习训练的模型，以优化跨尺度VQA任务的性能。ScaleReasoner-R1在我们的跨尺度推理基准上达到了最先进的性能，并在已有的单尺度基准上泛化到最先进的性能。研究结果表明，即使是有限的跨尺度监督也能显著改善病理理解。代码和演示将开源。

英文摘要

Pathological images are inherently multi-scale, requiring pathologists to integrate evidence from global tissue architecture at low magnification to cellular morphology at higher magnification for accurate diagnosis. While existing pathological datasets for vision-language model (VLM) include various scales, they often lack an explicit cross-scale reasoning objective. This limitation prevents VLMs from capturing essential cross-scale representations and learning evidence-based reasoning. To bridge this gap, we introduce the first cross-scale training and evaluation paradigm that formulates pathology interpretation as multi-magnification reasoning. However, creating such a task reveals a critical challenge: multi-image visual question answering (VQA) is prone to text-only shortcuts, which allow models to guess answers using magnification-dependent artifacts rather than visual evidence. To address this, we propose a leakage-aware curation pipeline that combines adversarial text-only screening with constraint-guided question design. Using this pipeline, we construct Scale-VQA, a high-quality benchmark with 4,685 multiple-choice questions grounded in 2,537 pathology images across multiple magnification levels. Finally, we present ScaleReasoner-R1, a model trained via reinforcement learning to optimize performance on the cross-scale VQA task. ScaleReasoner-R1 achieves state-of-the-art performance on our cross-scale reasoning benchmark and generalizes to SOTA performance on established single-scale benchmarks. Findings suggest that even the limited cross-scale supervision can significantly improve pathological understanding. The code and demos will be open-sourced.

URL PDF HTML ☆

赞 0 踩 0

2606.18738 2026-06-18 cs.SD 新提交 75%

GRIDEX: Grid-Grounded Forensic Explanations for Deepfake Spectrogram Analysis

GRIDEX：基于网格的深度伪造频谱图取证解释

Thi Ngan Ha Do, Tingmin Wu, Alsharif Abuadbba, Kristen Moore

发表机构 * CSIRO（澳大利亚联邦科学与工业研究组织）

专题命中视觉推理：深度伪造频谱图分析，生成取证解释。

AI总结提出GRIDEX框架，通过两阶段学习（SFT+GRPO）定位频谱图异常区域并生成结构化取证解释，提升伪造检测的可解释性。

详情

AI中文摘要

语音生成技术的进步使得人工语音越来越逼真。尽管现代分类模型在深度伪造检测方面可以达到高准确率，但它们不会产生证据，例如指出欺骗线索在频谱图中的位置及其声学含义，从而限制了它们在取证中的实用性。完整频谱图的人工分析是资源密集型的，因此证据应将注意力集中在最具诊断性的区域。此外，现有的可解释性方法在将上下文属性与局部证据联系起来方面的能力有限，使得解释更难验证。为了克服这一限制，我们提出了GRIDEX，这是一个流水线，当给定深度伪造频谱图时，它会生成其异常的取证解释。该流水线（i）选择频谱图中前K个异常区域，并（ii）为每个异常生成解释。这些解释遵循分类声学字段的模式，包括时间、频谱、语音信息和解释文本。据我们所知，这是第一个使用区域定位为深度伪造频谱图生成结构化取证解释的框架。GRIDEX采用两阶段学习范式进行训练，该范式将监督微调（SFT）与群体相对策略优化（GRPO）相结合。在我们的数据集上的实验表明，与强大的视觉语言模型（VLM）基线相比，伪影定位和解释质量有所提高。数据集和代码将在发表后发布。

英文摘要

The advancement of speech generation technologies has made artificial speech increasingly realistic. Although modern classification models can achieve high accuracy when it comes to deepfake detection, they do not produce evidences such as indicating where spoof cues appear in the spectrogram and what they imply acoustically, limiting their usefulness in forensic settings. Manual analysis of full spectrograms is resource-intensive, so evidence should narrow attention to the most diagnostic regions. Moreover, existing explainability methods have limited capabilities in connecting contextual attributes to localized evidence, making explanations harder to verify. To overcome this limitation, we propose GRIDEX, a pipeline that, when given a deepfake spectrogram, generates forensic explanations of its anomalies. The pipeline (i) selects top-K anomalous regions in the spectrogram and (ii) produces an explanation for each anomaly. The explanations follow a schema of categorical acoustic fields, including temporal, spectral, phonetic information and interpretation text. To our knowledge, this is the first framework to generate structured forensic explanations using regional grounding for deepfake spectrograms. GRIDEX is trained with a two-stage learning paradigm that combines supervised fine-tuning (SFT) with Group Relative Policy Optimization (GRPO). Experiments on our dataset show improved artifact localization and explanation quality over strong vision-language model (VLM) baselines. The dataset and code will be released upon publication.

URL PDF HTML ☆

赞 0 踩 0

2606.18661 2026-06-18 cs.CV cs.AI 新提交 75%

LandslideAgent with Multimodal LandslideBench: A Domain-Rule-Augmented Agent for Autonomous Landslide Identification and Analysis

LandslideAgent与多模态LandslideBench：一种面向自主滑坡识别与分析的领域规则增强型智能体

Chengfu Liu, Dongyang Hou, Junwu Xiang, Cheng Yang, Xuezhi Cui, Zeyuan Wang, Liangtian Liu, Zelang Miao

发表机构 * Central South University（中南大学）

专题命中视觉推理：滑坡专用视觉语言模型增强地质语义理解

AI总结提出指令驱动智能体框架，包含多模态数据集LandslideBench、滑坡专用视觉语言模型LandslideVLM及领域规则增强智能体LandslideAgent，实现自主滑坡识别与分析。

详情

AI中文摘要

智能滑坡灾害解译对于防灾减灾至关重要，然而当前范式难以同时提取视觉特征和高层次地球科学语义，而通用视觉语言模型在复杂地质场景中存在感知局限和领域幻觉。为解决这些挑战，我们提出一个指令驱动的智能体框架，包含三个组成部分。首先，通过多VLM交叉验证和交互式标注构建LandslideBench，这是一个多模态细粒度数据集，包含七个子类型标签、高分辨率图像、像素级掩膜和高质量文本描述。然后，通过LoRA在LandslideBench上微调面向滑坡的VLM——LandslideVLM，以增强地质语义理解。最后，以LandslideVLM为认知核心的领域规则增强智能体LandslideAgent，采用双规则控制器，结合结构化报告元数据约束和交叉验证识别约束，来调控自动化工具调用。实验表明，LandslideBench为五种主流模型在细粒度分类和语义分割上提供了有效基线。LandslideVLM在滑坡判别、细粒度分类和语义描述质量上分别提升了10.96%、32.87%和15.91%。LandslideAgent进一步实现了自主多源空间数据推理，实现了滑坡识别与分析的全流程智能化。

英文摘要

Intelligent landslide hazard interpretation is critical for disaster prevention, yet current paradigms struggle to simultaneously extract visual features and high-level geoscientific semantics, while general-purpose vision-language models (VLMs) suffer from perceptual limitations and domain hallucinations in complex geological scenarios. To address these challenges, we propose an instruction-driven agentic framework comprising three components. First, LandslideBench, a multimodal fine-grained dataset with seven subtype labels, high-resolution imagery, pixel-level masks, and high-quality textual descriptions, is constructed via multi-VLM cross-validation and interactive annotation. Then, LandslideVLM, a landslide-oriented VLM, is fine-tuned via LoRA on LandslideBench to enhance geological semantic understanding. Finally, LandslideAgent, a domain rule-enhanced agent taking LandslideVLM as its cognitive backbone, employs a dual-rule controller incorporating structured report metadata constraints and cross-validation identification constraints to regulate automated tool invocation. Experiments demonstrate that LandslideBench provides effective baselines across five mainstream models on fine-grained classification and semantic segmentation. LandslideVLM achieves accuracy improvements of 10.96%, 32.87%, and 15.91% on landslide discrimination, fine-grained classification, and semantic description quality, respectively. LandslideAgent further enables autonomous multi-source spatial data inference, realizing full-process intelligence for landslide identification and analysis.

URL PDF HTML ☆

赞 0 踩 0

2606.18558 2026-06-18 cs.CV 新提交 75%

MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction

MolmoMotion: 基于语言指令的3D点轨迹预测

Jianing Zhang, Chenhao Zheng, Yajun Yang, Max Argus, Rustin Soraki, Winson Han, Taira Anderson, Chun-Liang Li, Shuo Liu, Jiafei Duan, Zhongzheng Ren, Jieyu Zhang, Ranjay Krishna

发表机构 * Allen Institute for AI（艾伦人工智能研究所）； University of Washington（华盛顿大学）； UNC-Chapel Hill（北卡罗来纳大学教堂山分校）

专题命中视觉推理：语言指令引导3D点轨迹预测

AI总结提出一种基于语言指令的3D点运动预测方法，通过构建大规模数据集和基准，实现类无关、视角稳定的运动轨迹预测，并在机器人操作和视频生成中验证其有效性。

详情

AI中文摘要

运动预测是视觉智能的核心：智能体必须预测物体如何运动，以规划行动、推理物理交互并合成逼真的未来场景。我们认为，世界坐标系中的3D点提供了一种通用表示，具有类无关、视角稳定、紧凑且对下游任务直接有用的特性。我们形式化了目标条件3D点运动预测任务：给定一段短视觉历史、目标物体上的一组3D查询点以及预期目标的语言描述，模型预测每个点的未来3D轨迹。我们引入了一个完整的堆栈来大规模研究此任务：(1) MolmoMotion-1M是一个大型语料库，包含从116万无约束视频中标注的动作描述、物体锚定的3D点轨迹；(2) PointMotionBench是一个人工验证的基准，涵盖111个物体类别和61种运动类型；(3) MolmoMotion是一个通用运动预测模型，支持自回归坐标预测和基于流匹配的轨迹生成。MolmoMotion能准确预测不同语言指令下的多样运动模式，并在PointMotionBench上显著优于现有运动预测基线。最后，我们展示了学习到的3D运动先验能很好地迁移到下游应用：它提高了机器人操作的训练效率和泛化能力，其预测轨迹为生成模型提供了有效的运动指导，以合成具有更真实物体运动的视频。

英文摘要

Motion forecasting is central to visual intelligence: agents must anticipate how objects will move in order to plan actions, reason about physical interactions, and synthesize realistic futures. We argue that 3D points in world coordinates provide a general representation that is class-agnostic, view-stable, compact, and directly useful for downstream tasks. We formalize the task of goal-conditioned 3D point motion forecasting: given a short visual history, a set of 3D query points on an object of interest, and a language description of the intended goal, the model predicts the future 3D trajectory of each point. We introduce a full stack to study this task at scale: (1) MolmoMotion-1M is a large corpus of action-described, object-grounded 3D point trajectories annotated from 1.16M unconstrained videos; (2) PointMotionBench is a human-verified benchmark spanning 111 object categories and 61 motion types; and (3) MolmoMotion is a general motion forecasting model that supports both autoregressive coordinate prediction and flow-matching-based trajectory generation. MolmoMotion accurately predicts diverse motion patterns with different language instructions, and significantly outperforms existing motion prediction baselines on PointMotionBench. Finally, we show that the learned 3D motion prior transfers well to downstream applications: it improves training efficiency and generalization for robot manipulation, and its predicted trajectories provide effective motion guidance for generative models to synthesize videos with more realistic object motion.

URL PDF HTML ☆

赞 0 踩 0

2606.19258 2026-06-18 cs.CV cs.RO 新提交 70%

CABLE: Cloud-Assisted Bandwidth-efficient LMM-based Encoding for V2X Systems

CABLE: 面向V2X系统的云辅助带宽高效LMM编码框架

Haohua Que, Zhipeng Bao, Qianyi Wu, Handong Yao

发表机构 * College of Engineering, University of Georgia（佐治亚大学工程学院）

专题命中视觉推理：利用LMM进行边缘-云感知编码

AI总结提出CABLE框架，通过边缘端利用自我运动补偿和残差运动线索传播云分割掩码，生成感兴趣区域（ROI）并仅上传ROI掩码图像，形成掩码-ROI-LMM反馈循环，在五个数据集上实现73-87%的ROI像素覆盖减少和5-8倍LMM预填充加速。

详情

AI中文摘要

云托管的大型多模态模型（LMM）可以为车联网系统提供强大的开放词汇感知能力，但简单地将全分辨率帧从边缘传输到云会导致严重的通信开销和云侧预填充延迟。我们提出了CABLE，一种用于边缘-云感知的云辅助带宽高效LMM编码框架。CABLE在边缘端利用自我运动补偿传播先前的云分割掩码，通过残差运动线索进行细化，并通过走廊包络整合断开区域，形成鲁棒的感兴趣区域（ROI）。仅上传ROI掩码图像，而云分割输出作为下一帧的先验反馈，形成掩码-ROI-LMM反馈循环。在五个数据集（nuScenes、WOD-ZB、Waymo、KITTI和CADC）上的实验表明，该方法在保持感知能力的同时实现了显著的通信节省，相对于全帧推理，ROI像素覆盖减少73-87%，估计LMM预填充加速5-8倍，检测质量略有折衷。

英文摘要

Cloud-hosted large multimodal models (LMMs) can provide strong open-vocabulary perception for Vehicle-to-Everything systems, but naively transmitting full-resolution frames from edge to cloud causes severe communication overhead and high cloud-side prefill latency. We present CABLE, a cloud-assisted bandwidth-efficient LMM-based encoding framework for edge-cloud perception. CABLE propagates the previous cloud segmentation mask on the edge using ego-motion compensation, refines it with residual-motion cues, and consolidates disconnected regions via a corridor envelope to form a robust region of interest (ROI). Only ROI-masked images are uploaded, while the cloud segmentation output is fed back as the prior for the next frame, forming a mask-to-ROI-to-LMM feedback loop. Experiments on five datasets (nuScenes, WOD-ZB, Waymo, KITTI, and CADC) show consistent communication savings while largely preserving perception, achieving $73$--$87\%$ ROI pixel-coverage reduction with $5$--$8\times$ estimated LMM prefill speedup at a modest detection-quality trade-off relative to full-frame inference.

URL PDF HTML ☆

赞 0 踩 0

2606.19120 2026-06-18 cs.LG cs.CV 新提交 70%

Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation

先看后思：解耦感知与推理以实现抗捷径的多模态在策略自蒸馏

Sihan Wang, Xiyao Liu, Lianqing Liu, Zhi Han

发表机构 * State Key Laboratory of Robotics and Intelligent Systems, Shenyang Institute of Automation, Chinese Academy of Sciences（机器人与智能系统国家重点实验室，沈阳自动化研究所，中国科学院）； University of Chinese Academy of Sciences（中国科学院大学）

专题命中视觉推理：视觉描述辅助推理，属于VLM范畴

AI总结提出ViGOS框架，通过解耦感知和推理，在MLLM后训练中避免文本捷径，提升图像依赖行为。

Comments 29 pages, 5 figures, 8 tables

详情

AI中文摘要

在策略自蒸馏（OPSD）训练模型在其自身rollouts上，并使用冻结副本提供基于参考目标的密集token级目标。这对于LLM推理效果良好，但直接扩展到多模态大语言模型（MLLMs）可能产生捷径：特权目标可能主要基于文本参考目标而非图像来引导token。我们提出ViGOS，一种视觉引导的OPSD框架用于MLLM后训练。学生首先编写视觉描述，然后推理出最终答案。对于有效rollouts，仅图像的感知教师监督描述，而特权推理教师监督同一学生前缀上的推理和最终答案。仅对无效rollouts使用参考教师以恢复输出格式。在通用视觉-语言、专家推理、视觉数学、空间定位和视觉-语言先验基准测试中，ViGOS保持了OPSD的主要优势，并在易产生捷径的设置中改善了图像引导行为。

英文摘要

On-policy self-distillation (OPSD) trains a model on its own rollouts and uses a frozen copy to provide dense token-level targets conditioned on a reference target. This works well for LLM reasoning, but a direct extension to multimodal large language models (MLLMs) can create a shortcut: the privileged target may guide tokens mainly based on the text reference target rather than the image. We propose ViGOS, a visually grounded OPSD framework for MLLM post-training. The student first writes a visual description and then reasons toward the final answer. For valid rollouts, an image-only perception teacher supervises the description, while a privileged reasoning teacher supervises the reasoning and final answer on the same student prefix. A reference teacher is used only for invalid rollouts to recover the output format. Across general vision-language, expert reasoning, visual math, spatial grounding, and visual-language-prior benchmarks, ViGOS keeps the main benefits of OPSD and improves image-grounded behavior in shortcut-prone settings.

URL PDF HTML ☆

赞 0 踩 0

2606.17372 2026-06-18 cs.CL cs.AI 新提交 70%

Implicit vs. Explicit Prompting Strategies for LVLMs in Referential Communication

LVLMs在指称通信中的隐式与显式提示策略

Peter Zeng, Amie J. Paige, Weiling Li, Susan E. Brennan, Owen Rambow, Cameron R. Jones

发表机构 * Stony Brook University（石溪大学）

专题命中视觉推理：研究LVLM指称通信中的提示策略

AI总结本研究通过控制任务差异，比较显式与隐式提示对LVLM生成高效指称表达的影响，发现显式提示下模型能协调高效表达，而隐式提示则失败，揭示了人机通信的关键差异。

2606.18634 2026-06-18 cs.RO cs.AI 新提交 60%

EffiNav: Fusing Depth and Vision-Language for Efficient Object Goal Navigation

EffiNav: 融合深度与视觉语言实现高效物体目标导航

Zecheng Yin, Benedict Jun Ma

发表机构 * Systems Hub of Intelligence Transportation HKUST(GZ)（香港科技大学（广州）智能交通系统中心）

专题命中视觉推理：利用视觉语言模型预测探索边界

AI总结提出EffiNav框架，融合深度信息与视觉语言模型，通过预测探索边界和语义先验指导导航，在HM3D和OVON数据集上匹配或超越基线，提升路径效率与泛化性。

详情

AI中文摘要

在未知环境中定位目标物体是自主智能体的基本能力，应用范围从搜索救援到野外机器人。该任务的简化版本是物体目标导航（ObjNav）。在ObjNav中，成功到达目标物体提供了基本的性能度量；然而，导航轨迹的效率同样重要，因为它指示了智能体探索的智能程度以及后续任务剩余的时间。在未知环境中，高效导航的关键在于决定下一步探索的位置。尽管许多先前工作旨在解决这一核心挑战并在某些场景中取得了有希望的性能，但最近的基于训练的模型和非训练框架分别仍存在泛化性和效率问题，在最坏情况下可能导致对已访问区域的过度探索或冗余的来回运动。我们在两个广泛使用的仿真基准Habitat Matterport 3D（HM3D）和开放词汇物体目标导航（OVON）上评估EffiNav，并在真实世界的物理机器人上进一步验证其有效性。我们对大量仿真回合进行了失败分析。通过最小修改，我们还将EffiNav扩展到GOAT-BENCH数据集上的记忆增强ObjNav任务，展示了其在标准ObjNav设置之外的适应性。在两个标准指标——成功率（SR）和路径长度加权成功率（SPL）上，EffiNav匹配或超越了最近的基线，反映了其效率、鲁棒性和实际适用性。认识到两个数据集的不同侧重点，性能表明该框架在高效ObjNav中更加平衡和可泛化。

英文摘要

To locate a target object while exploring the unknown environment is a fundamental capability for autonomous agents, with applications ranging from search-and-rescue to field robots. A simplified version of such task is Object Goal Navigation (ObjNav). In ObjNav, successful arrival at the target object provides a basic measure of performance; however, the efficiency of the navigation trajectory is equally important, as it indicates how intelligently the agent explores and how much time remains for subsequent tasks. In unknown environments, the key to efficient navigation lies in deciding where to explore next. While many prior works aim to address this core challenge and achieved promising performance in certain settings, recent training-based models and non-training frameworks still suffer from generalization and efficiency issues respectively, which in the worst cases can lead to excessive exploration of already-visited areas or redundant back-and-forth motion. We evaluate EffiNav on two widely used simulation benchmarks Habitat Matterport 3D (HM3D) and Open-Vocabulary Object goal Navigation (OVON), and further validate its effectiveness on physical robots in real-world settings. We conduct failure analysis on massive simulation episodes. With minimal modification, we also extend EffiNav to a memory-augmented ObjNav task on the GOAT-BENCH dataset, demonstrating its adaptability beyond standard ObjNav settings. Across two standard metrics--Success Rate (SR) and Success weighted by Path Length (SPL), EffiNav matches or outperforms recent baselines, reflecting its efficiency, robustness, and practical applicability. Recognizing the different emphases of the two datasets, the performances reveals this framework is more balanced and generalizable for efficient ObjNav.

URL PDF HTML ☆

赞 0 踩 0