视觉大模型 / VLM

2606.19277 2026-06-18 cs.CV 新提交专题 90

A Unified Framework for Efficient Remote Sensing Visual Question Answering: Adapting Dual, Hybrid, and Encoder-Decoder Architectures

高效遥感视觉问答的统一框架：适配双编码器、混合架构和编码器-解码器架构

Timothy Agboada, Shikha Chandel, Yadav Raj Ghimire, Leila Hashemi-Beni

发表机构 * Computational Data Science and Engineering（计算数据科学与工程）； College of Science and Technology（科学与技术学院）

专题命中视觉问答：遥感VQA参数高效微调，适配多种VLM架构

AI总结提出RS Adapter参数高效微调策略，在三种视觉语言模型架构上注入轻量瓶颈适配器，仅用不到5%可训练参数实现遥感VQA，混合架构FLAVA在多模态推理与检索间取得最佳平衡。

Comments 4 pages, 2 figures, accepted and to be presented at 2026 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2026), scheduled for 9 to 14 August 2026 in Washington D.C

详情

AI中文摘要

遥感领域的视觉问答因航空影像的高分辨率、多尺度目标分布和语义复杂性而面临独特挑战。尽管通用领域的基础模型取得了显著成功，但直接应用于RSVQA受到巨大领域偏移和全微调计算成本高昂的阻碍。本研究对RS Adapter（一种参数高效微调策略）在三种不同的视觉语言模型架构上进行了比较分析：双编码器CLIP、编码器-解码器BLIP和混合FLAVA。我们引入了一个统一的架构手术流水线，将轻量瓶颈适配器注入冻结骨干网络的注意力和MLP层，从而以少于5%的可训练参数实现快速适应。在高分辨率RSVQA x数据集上的实验结果表明，虽然所有适配模型均实现收敛，但混合FLAVA架构相比单模态对应模型提供了更优越的多模态推理与检索能力平衡。我们的发现为灾害评估和城市监测中的资源高效VQA建立了新的基准。

英文摘要

Visual Question Answering (VQA) in the Remote Sensing (RS) domain presents unique challenges due to the high resolution, multi scale object distribution, and semantic complexity of aerial imagery. While general domain Foundation Models have achieved remarkable success, their direct application to RSVQA is hindered by massive domain shifts and the computationally prohibitive nature of full fine tuning. This study presents a comparative analysis of RS Adapter, a Parameter Efficient Fine Tuning (PEFT) strategy, applied across three distinct Vision Language Model (VLM) architectures: the Dual Encoder CLIP, the Encoder Decoder BLIP, and the Hybrid FLAVA. We introduce a unified architectural surgery pipeline that injects lightweight bottleneck adapters into the attention and MLP layers of frozen backbones, enabling rapid adaptation with less than 5 percent of trainable parameters. Experimental results on the high resolution RSVQA x dataset demonstrate that while all adapted models achieve convergence, the Hybrid FLAVA architecture offers a superior balance of multimodal reasoning and retrieval capabilities compared to its unimodal counterparts. Our findings establish a new baseline for resource efficient VQA in disaster assessment and urban monitoring.

URL PDF HTML ☆

赞 0 踩 0

2606.18609 2026-06-18 cs.CV 新提交专题 85

Hallucination Detection and Correction in Medical VLMs via Counter-Evidence Verification

基于反事实证据验证的医学视觉语言模型幻觉检测与纠正

Nan Zhou, Ke Zou, Meng Liu, Linchao He, Jiaqi Zhu, Yi Zhang, Hu Chen, Huazhu Fu

发表机构 * College of Computer Science, Sichuan University（四川大学计算机科学学院）； Yong Loo Lin School of Medicine, National University of Singapore（新加坡国立大学杨潞龄医学院）； Key Laboratory of Data Protection and Intelligent Management, Ministry of Education, Sichuan University（四川大学数据保护与智能管理教育部重点实验室）； National Key Laboratory of Autonomous Intelligent Unmanned Systems, Beijing Institute of Technology（北京理工大学自主智能无人系统国家重点实验室）； Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR)（新加坡科技研究局高性能计算研究所）

专题命中视觉问答：针对医学视觉语言模型幻觉检测与纠正

AI总结提出CoEV框架，通过文本与视觉证据的双向验证检测并纠正医学VLM幻觉，无需重新训练，在四个数据集上显著提升检测和纠正性能。

Comments MICCAI 2026 Accept. Submission Version

详情

AI中文摘要

视觉语言模型（VLM）在医学诊断中的可靠性受到幻觉的挑战，这削弱了信任。现有的幻觉检测方法主要关注识别生成文本与参考数据之间的事实不一致性。虽然一些研究分析了模型在图像中的注意力区域，但它们很少验证这种注意力是否真正反映了支持生成文本的视觉证据。为了解决这一差距，我们提出了反事实证据验证（CoEV），一个无需训练的即插即用框架，通过基于证据的事实一致性验证来检测和纠正幻觉。CoEV在文本断言和视觉证据之间执行双向验证，测试每个陈述是否得到其对应证据区域的支持，并将每个陈述分配到一个四象限诊断图中，该图捕获文本事实性和视觉基础性的组合。CoEV检测幻觉内容，并作为事后细化工具，无需重新训练即可纠正幻觉。在四个医学数据集上的大量实验表明，CoEV能够对抗幻觉。在幻觉检测方面，CoEV始终优于现有方法，平均PR-AUC和ROC-AUC分别提高了3.0%和3.9%的绝对百分点，在特定VQA场景中提升高达18.5%。在幻觉纠正方面，它将Micro-F1提高了高达12.5%，在医学报告生成中将幻觉率降低了超过11.9%，并提高了医学VQA的准确性。这些结果表明，CoEV能够可靠地检测和纠正幻觉，为临床医生提供可靠的、基于证据的诊断线索。代码将在接收后发布。

英文摘要

Vision-Language models (VLMs) reliability in medical diagnosis is challenged by trust-undermining hallucinations. Existing hallucination detection approaches mainly focus on identifying factual inconsistencies between generated text and reference data. While some studies analyze where models attend in images, they seldom verify whether such attention truly reflects the visual evidence supporting the generated text. To address this gap, we propose Co}unter-Evidence Verification (CoEV), a training-free plug-and-play framework that detects and corrects hallucinations through evidence-based factual consistency verification. CoEV performs bidirectional verification between textual assertions and visual evidence, testing whether each statement is supported by its corresponding evidence region, and assigns each statement into a four-quadrant diagnostic map capturing combinations of text factuality and visual grounding. CoEV detects hallucinated content and serves as a post hoc refinement tool, correcting hallucinations without retraining. Extensive experiments on four medical datasets show that CoEV combats hallucinations in VLMs.For hallucination detection, CoEV consistently outperforms existing methods, improving average PR-AUC and ROC-AUC by 3.0% and 3.9% absolute points respectively, with notable gains of up to 18.5% in specific VQA scenarios. For hallucination correction, it improves Micro-F1 by up to 12.5%, reduces hallucination rates by over 11.9% on medical report generation, and also boosts medical VQA accuracy. These results show that CoEV enables reliable detection and correction of hallucinations, providing clinicians with dependable, evidence-based cues for diagnosis. Code will be released upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2606.19100 2026-06-18 cs.CV 新提交专题 80

AMALIA-VL: A Native European Portuguese Open-Source Vision and Language Model

AMALIA-VL: 一个原生欧洲葡萄牙语开源视觉与语言模型

Diogo Glória-Silva, João Cardeira, Manuel Letras da Luz, Afonso Simplício, Gonçalo Vinagre, Diogo Tavares, Rafael Ferreira, Inês Calvo, Inês Vieira, David Semedo, João Magalhães

发表机构 * NOVA School of Science and Technology（NOVA科学与技术学校）； NOVA LINCS

专题命中视觉问答：开源视觉语言模型，面向欧洲葡萄牙语。

AI总结针对欧洲葡萄牙语缺乏开源多模态模型的问题，提出AMALIA-VL，通过三阶段训练和葡萄牙语中心数据混合，建立强基线并开源所有资源。

详情

AI中文摘要

大型视觉与语言模型（LVLMs）发展迅速，但欧洲葡萄牙语（pt-PT）在现有的开源多模态模型中仍系统性地未被充分服务，这些模型要么将其与巴西葡萄牙语混为一谈，要么在其训练数据混合中严重缺乏代表性。我们推出了AMALIA-VL，这是第一个原生为pt-PT构建的开源指令微调LVLM，通过可学习的连接器将高分辨率视觉编码器与动态图像平铺以及完全开放的pt-PT优化语言模型配对。我们贡献了一个精心设计的三阶段训练过程——视觉-语言对齐、通用视觉指令微调和偏好优化——以及一个以pt-PT为中心的多模态数据混合，该混合结合了策划和翻译的公共数据集与新颖的数据集，以解决欧洲葡萄牙语多模态资源几乎完全缺失的问题。我们的评估表明，AMALIA-VL为开源pt-PT LVLM建立了强基线。我们将发布模型权重、训练数据和构建流程，以及机器翻译的pt-PT评估基准，以帮助民主化pt-PT LVLM的开发。

英文摘要

Large Vision and Language Models (LVLMs) have advanced rapidly, yet European Portuguese (pt-PT) remains systematically underserved by existing open-source multimodal models, which either conflate it with Brazilian Portuguese or severely under-represent it in their training data mixes. We introduce AMALIA-VL, the first open-source instruction-tuned LVLM built natively for pt-PT, pairing a high-resolution vision encoder with dynamic image tiling and a fully open pt-PT-optimized language model via a learned connector. We contribute with a purposefully designed three-stage training process - vision-language alignment, general visual instruction tuning, and preference optimization - together with a pt-PT-centric multimodal data mix combining curated and translated public datasets with novel datasets that address the near-total absence of European Portuguese multimodal resources. Our evaluation shows that AMALIA-VL establishes a strong baseline for open-source pt-PT LVLMs.We will release model weights, training data, and construction pipelines along with machine-translated pt-PT evaluation benchmarks to help democratize pt-PT LVLM development.

URL PDF HTML ☆

赞 0 踩 0

2606.18271 2026-06-18 cs.AI cs.LG 新提交专题 80

NAVI-Orbital: First In-Orbit Demonstration of a Zero-Shot Vision-Language Model for Autonomous Earth Observation

NAVI-Orbital：用于自主地球观测的零样本视觉语言模型的首次在轨演示

Juan Manuel Delfa Victoria, Taran Cyriac John, Andrew W. Herson

发表机构 * NASA Jet Propulsion Laboratory (JPL)（美国宇航局喷气推进实验室）； Loft Orbital（Loft Orbital公司）

专题命中视觉问答：在轨部署VLM进行自主地球观测与多模态推理。

AI总结本文介绍NAVI-Orbital系统，在低地球轨道卫星上首次实现视觉语言模型的自主多模态推理，通过语义压缩解决数据下传瓶颈。

Comments 17 pages, 47 figures

详情

AI中文摘要

随着地球观测数据的生成速度超过下行链路带宽和人在回路处理能力，星载采集与可操作地面情报之间的差距日益扩大。本文介绍NAVI-Orbital，一个部署在低地球轨道（LEO）航天器上的软件系统。2026年4月16日，NAVI-Orbital实现了据作者所知首次在轨演示，即视觉语言模型完全在星上进行自主多模态推理。NAVI-Orbital使用本地视觉语言模型（Gemma 3）对每个捕获场景进行分类，生成其内容及特征间关系的文本描述，并通过自然语言对话响应操作员的后续查询。该系统通过纯英语提示替代传统指令序列进行任务重定向，并由基于图的状态机（LangGraph）编排，协调用于检测和对话的专用代理。地面基准测试（在7,960张图像的精选AID基准上准确率达88.16%）、Flatsat验证以及实时在轨捕获的新获取、未见过的地球图像（包括未校正的YAM-9图像，在星上通过硬件加速GPU推理处理且未对飞行仪器进行微调）的结果表明，在卫星级边缘计算机上运行基础模型是可行的，通过星上地球观测的语义压缩，颠覆了传统的先采集后全部下传的带宽模式。

英文摘要

As Earth Observation data generation outpaces downlink bandwidth and human-in-the-loop processing, a widening gap has emerged between onboard collection and actionable ground intelligence. This paper presents NAVI-Orbital, a software system deployed on a Low Earth Orbit (LEO) spacecraft. On April 16, 2026, NAVI-Orbital achieved what is, to the authors' knowledge, the first in-orbit demonstration of a vision-language model performing autonomous multi-modal inference entirely onboard. NAVI-Orbital uses a local vision-language model (Gemma 3) to classify each captured scene, produce a text description of its content and the relationships between its features, and respond to operator follow-up via natural-language dialogue. The system is re-tasked through plain-English prompts in place of conventional command sequences, and is orchestrated by a graph-based state machine (LangGraph) coordinating dedicated agents for detection and dialogue. Results across ground benchmarking (88.16% accuracy on the 7,960-image curated AID benchmark), Flatsat validation, and live in-orbit captures of newly acquired, previously unseen Earth imagery (including uncorrected YAM-9 imagery, processed onboard with hardware-accelerated GPU inference and no fine-tuning for the flight instrument) demonstrate the feasibility of running foundation models on satellite-class edge computers to invert the conventional acquire-then-downlink-everything bandwidth profile through semantic compression of Earth observations in-orbit.

URL PDF HTML ☆

赞 0 踩 0

2606.17188 2026-06-18 cs.CV cs.CL 新提交专题 80

Not Truly Multilingual: Script Consistency as a Missing Dimension in VLM Evaluation

并非真正的多语言：脚本一致性作为VLM评估中缺失的维度

Prabhjot Singh, Bhushan Pawar, Madhu Reddiboina, Rajvee Sheth

发表机构 * RediMinds Inc.（RediMinds公司）； The University of Texas at Austin（德克萨斯大学奥斯汀分校）； Independent Researcher（独立研究员）

专题命中视觉问答：评估VLM在多文字脚本下的视觉推理

AI总结提出PuMVR基准，评估10个VLM在旁遮普语三种文字上的表现，发现显著的脚本差距，并提出脚本一致性率（SCR）作为必要评估指标。

详情

AI中文摘要

当前视觉语言模型（VLM）的多语言评估假设语言与正字法一一对应，忽略了使用多种文字语言的数十亿用户。我们引入了PuMVR（旁遮普多模态视觉推理），这是一个包含1000个严格平行图像-文本实例的基准，覆盖旁遮普语的三种活跃文字：古木基文、沙穆基文和罗马文。评估10个最先进的VLM，我们暴露了一个显著且系统的脚本差距。模型经常在一种文字上解决视觉任务，而在另一种文字上失败，准确率差异高达16%。关键的是，视觉输入均匀地提升了绝对性能，但并未缩小正字法差距。此外，跨文字的上下文迁移非常脆弱，揭示了脚本锁定的知识表示。通过所有文字对的McNemar检验支持，我们的发现表明当前的“多语言”VLM并非真正的多文字。我们提出脚本一致性率（SCR），在我们的基准上低至24.8%，作为脚本无关评估的强制性指标，以确保公平的AI访问。数据和代码可在以下网址获取：this https URL。

英文摘要

Current multilingual evaluations for Vision-Language Models (VLMs) assume a one-to-one mapping between language and orthography, overlooking billions of users of multi-script languages. We introduce PuMVR (Punjabi Multimodal Visual Reasoning), a benchmark of 1,000 strictly parallel image-text instances across Punjabi's three active scripts: Gurmukhi, Shahmukhi, and Roman. Evaluating 10 state-of-the-art VLMs, we expose a substantial and systematic Script Gap. Models frequently solve visual tasks in one script while failing identical tasks in another, with accuracy deltas reaching 16%. Crucially, visual input boosts absolute performance uniformly yet does not close the orthographic gap. Furthermore, cross-script in-context transfer is highly brittle, exposing script-locked knowledge representation. Supported by McNemar tests across all script pairs, our findings demonstrate that current "multilingual" VLMs are not truly multi-script. We propose the Script Consistency Rate (SCR), which falls as low as 24.8% on our benchmark, as a mandatory metric for script-agnostic evaluation to ensure equitable AI access. Data and code are available at: https://github.com/prabhjotschugh/Not-Truly-Multilingual-PuMVR.

URL PDF HTML ☆

赞 0 踩 0

2606.18553 2026-06-18 cs.CV 新提交专题 70

Hierarchical Multi-Modal Retrieval for Knowledge-Grounded News Image Captioning

基于知识的分层多模态检索用于新闻图像描述生成

Minh-Loi Nguyen, Xuan-Vu Le, Long-Bao Nguyen, Hoang-Bach Ngo, Trung-Nghia Le

发表机构 * University of Science, VNU-HCM（越南国立大学胡志明市分校理学院）； Vietnam National University, Ho Chi Minh City（越南国立大学胡志明市分校）

专题命中视觉问答：结合VLM和LLM生成图像描述

AI总结提出分层多模态文章检索增强的图像描述框架，通过结构感知检索和上下文精炼，结合VLM和LLM生成富含上下文细节的描述，在EVENTA 2025挑战赛中获得第5名。

Comments SOICT 2025

详情

AI中文摘要

传统的图像描述方法通常难以生成全面、上下文丰富的描述，尤其是对于无法直接从视觉线索中观察到的细节。为了克服这一问题，我们提出了一种新颖的检索增强图像描述框架，通过利用外部知识生成具有更深层次洞察的描述，如对象属性、事件背景和潜在意义。我们的方法采用分层多模态文章检索机制，超越了单一的文本实体。该检索考虑了文章结构感知特征，包括加权文本组件（例如，标题、正文部分）和视觉布局模式，以及多方面的相似性计算（内容-视觉、视觉-视觉和话语定位）。后续的上下文相关性精炼阶段进一步增强了检索到的信息。检索到的文章随后作为描述生成的知识库：首先，VLM生成简洁的图像描述；其次，我们基于该描述从检索到的文章中分割出相关信息；最后，LLM利用描述和提取的知识生成全面、上下文详细的描述。我们参加了ACM Multimedia EVENTA 2025挑战赛，并在OpenEvent-V1数据集的私有测试集上以0.2824的总分获得第5名。源代码已在此https URL公开发布。

英文摘要

Traditional image captioning methods often struggle to generate comprehensive, context-rich descriptions, especially for details not directly observable from visual cues. To overcome this, we propose a novel retrieval-augmented image captioning framework that generates captions with deeper insights, such as object attributes, event context, and underlying significance, by leveraging external knowledge. Our approach features a hierarchical multi-modal article retrieval mechanism that moves beyond monolithic text entities. This retrieval considers article structure-aware features, including weighted textual components (e.g., headlines, body sections) and visual placement patterns, alongside multi-faceted similarity computations (content--visual, visual--visual, and discourse positioning). A subsequent contextual relevance refinement stage further enhances the retrieved information. The retrieved articles then serve as the knowledge base for caption generation: first, a VLM generates a concise image description; second, we segment relevant information from the retrieved articles based on this description; and finally, an LLM utilizes both the description and extracted knowledge to generate a comprehensive, contextually detailed caption. We participated in the ACM Multimedia EVENTA 2025 Challenge and achieved 5th place with an overall score of 0.2824 on the private test set of the OpenEvent-V1 dataset. Source code is publicly released at https://github.com/mf0212/EVENTA-Challange.

URL PDF HTML ☆

赞 0 踩 0

2606.19053 2026-06-18 cs.CV 新提交专题 90

Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: From Evaluation to Diagnosis

大规模视觉-语言模型在细粒度图像任务上的基准测试：从评估到诊断

Hong-Tao Yu, Chen-Wei Xie, Yuxin Peng, Serge Belongie, Xiu-Shen Wei

发表机构 * School of Computer Science and Engineering, Southeast University, China（东南大学计算机科学与工程学院，中国）； Alibaba Group（阿里巴巴集团）； School of Computer Science and Engineering, School of Intelligence Science and Engineering, and Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Southeast University, China（东南大学计算机科学与工程学院、智能科学与工程学院以及新一代人工智能技术及其交叉应用关键实验室，中国）； Wangxuan Institute of Computer Technology, National Key Laboratory for Multimedia Information Processing, Peking University, China（北京大学王轩计算机技术研究所、多媒体信息处理国家重点实验室，中国）； University of Copenhagen, Denmark（丹麦哥本哈根大学）

专题命中：细粒度图像任务基准，诊断LVLM。

AI总结提出FG-BMK基准，含101万问题和28万图像，通过人机双范式评估LVLM的细粒度语义识别与视觉判别能力，诊断失败原因，发现视觉表示、语义对齐等瓶颈。

详情

AI中文摘要

近期大规模视觉-语言模型（LVLMs）展示了显著的多模态感知和推理能力。尽管众多基准从整体或任务特定角度评估了LVLMs，但它们在细粒度图像任务（计算机视觉的基础）上的能力仍未得到充分理解。为填补这一空白，我们引入FG-BMK，一个全面的细粒度评估基准，包含101万问题和28万图像，覆盖从常见物体中心领域到专业领域的多样化场景。FG-BMK通过面向人类和面向机器的范式，联合评估对话级细粒度语义识别和特征级视觉判别能力，从而诊断分析LVLM的失败是否源于视觉表示不足、视觉-语义对齐薄弱或细粒度知识有限。通过对一系列代表性LVLM/VLM的大量实验，我们发现当前LVLMs仍是不充分的细粒度识别器，失败源于视觉表示、语义对齐、模态对齐和类别级知识中相互交织的瓶颈。我们进一步分析了提升细粒度能力的训练设计因素，并考察了视觉和语言扰动如何影响LVLM预测。这些发现为当前LVLMs的局限性提供了诊断性见解，并为未来数据构建和模型设计提供了指导，以开发更可靠的细粒度视觉任务LVLMs。我们的代码已开源，可从此https URL获取。

英文摘要

Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal perception and reasoning capabilities. While numerous benchmarks have evaluated LVLMs from holistic or task-specific perspectives, their capabilities on fine-grained image tasks-fundamental to computer vision-remain insufficiently understood. To address this gap, we introduce FG-BMK, a comprehensive fine-grained evaluation benchmark containing 1.01 million questions and 0.28 million images, covering diverse scenarios from common object-centric domains to specialized domains. FG-BMK jointly evaluates dialogue-level fine-grained semantic recognition and feature-level visual discriminability through human-oriented and machine-oriented paradigms, enabling diagnostic analysis of whether LVLM failures arise from insufficient visual representations, weak visual-to-semantic grounding, or limited fine-grained knowledge. Through extensive experiments on a diverse set of representative LVLMs/VLMs, we find that current LVLMs remain inadequate fine-grained recognizers, with failures arising from intertwined bottlenecks in visual representations, semantic grounding, modality alignment, and category-level knowledge. We further analyze training design factors for improving fine-grained capabilities and examine how visual and linguistic perturbations affect LVLM predictions. These findings provide diagnostic insights into the limitations of current LVLMs and offer guidance for future data construction and model design in developing more reliable LVLMs for fine-grained visual tasks. Our code is open-source and available at https://fg-bmk.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2606.18101 2026-06-18 cs.AI 新提交专题 90

Trust the Right Teacher: Quality-Aware Self-Distillation for GUI Grounding

信任正确的教师：面向GUI定位的质量感知自蒸馏

Jingyuan Huang, Zuming Huang, Yucheng Shi, Tianze Yang, Xiaoming Zhai, Wei Chu, Ninghao Liu

发表机构 * University of Georgia（佐治亚大学）； INFLY Tech ； Tencent AI Lab（腾讯AI实验室）； The Hong Kong Polytechnic University（香港理工大学）

专题命中视觉定位：自蒸馏提升VLM的GUI定位能力

AI总结提出质量感知自蒸馏方法，通过软正确性感知门控和教师概率缩放改善坐标令牌教师信号质量，提升VLM在GUI定位任务中的性能。

Comments corrected some claims

详情

AI中文摘要

图形用户界面（GUI）定位要求视觉语言模型（VLM）在高分辨率截图中识别小的目标元素并预测精确的屏幕坐标。同策略自蒸馏（OPSD）是一种有前景的后训练方法，因为它提供密集的令牌级教师信号，超越了硬坐标标签。然而，朴素OPSD并不适合GUI定位：OPSD在由学生生成的前缀上评估教师，当前缀已经偏离目标坐标时，坐标令牌教师信号的质量会下降，导致不可靠的教师信号。为缓解这一问题，我们提出了面向基于VLM的GUI定位的质量感知自蒸馏，通过软正确性感知门控和教师概率缩放来改善坐标令牌教师信号质量。软正确性感知门控检查在当前学生生成的前缀下，教师的坐标令牌预测是否仍能完成到真实框。如果不能，则相应教师信号被降低权重。教师概率缩放则利用教师置信度作为轻量级因子，进一步校准门控监督的强度。一个关键的实验发现是，单独使用任一组件都不能提升整体性能，而组合使用则能持续提升性能。这表明两种机制发挥互补作用：正确性感知门控抑制不可靠的坐标令牌监督，而教师概率缩放校准剩余信号的强度。在六个GUI定位基准上的实验表明，我们的方法持续提升基础模型性能，并优于强基线。

英文摘要

Graphical user interface (GUI) grounding requires vision-language models (VLMs) to identify small target elements in high-resolution screenshots and predict precise screen coordinates. On-policy self-distillation (OPSD) is a promising post-training approach for this coordinate-sensitive task, since it provides dense token-level teacher signals beyond hard coordinate labels. However, naive OPSD is not well suited to GUI grounding: OPSD evaluates the teacher on student-generated prefixes, the quality of coordinate-token teacher signals can degrade when the prefix has already deviated from the target coordinate, leading to unreliable teacher signal. To mitigate this, We propose quality-aware self-distillation for VLM-based GUI grounding, which improves coordinate-token teacher-signal quality through soft correctness-aware gating and teacher-probability scaling. The soft correctness-aware gate checks whether the teacher's current coordinate-token prediction can still be completed into the ground-truth box under the student-generated prefix. If not, the corresponding teacher signal is down-weighted. Teacher-probability scaling then uses the teacher's confidence as a lightweight factor to further calibrate the strength of the gated supervision. A key empirical finding is that neither component alone improves overall performance, whereas combining them consistently improves performance. This suggests that the two mechanisms play complementary roles: correctness-aware gating suppresses unreliable coordinate-token supervision, while teacher-probability scaling calibrates the strength of the remaining signals. Experiments across six GUI grounding benchmarks show that our method consistently improves the base model and outperforms strong baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.17030 2026-06-18 cs.CV 新提交专题 90

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

Qwen-RobotWorld技术报告：通过语言条件视频生成统一具身世界模型

Jie Zhang, Xiaoyue Chen, Anzhe Chen, Dayiheng Liu, Deqing Li, Gengze Zhou, Hale Yin, Haoqi Yuan, Haoyang Li, Jiahao Li, Jiazhao Zhang, Jingren Zhou, Kaiyuan Gao, Kun Yan, Lihan Jiang, Ningyuan Tang, Pei Lin, Qihang Peng, Shengming Yin, Tianhe Wu, Tianyi Yan, Xiao Xu, Yan Shu, Yanran Zhang, Ye Wang, Yi Wang, Yilei Chen, Yixian Xu, Yiyang Huang, Yuxiang Chen, Zekai Zhang, Zhendong Wang, Zixing Lei, Zhixuan Liang, Zihao Liu, Zikai Zhou, Chenxu Lv, Xiong-Hui Chen, Chenfei Wu

发表机构 * Qwen Team（Qwen团队）

专题命中视觉推理：语言条件视频世界模型，视觉推理与生成

AI总结提出Qwen-RobotWorld，一种以自然语言为统一动作接口的语言条件视频世界模型，通过双流MMDiT、大规模具身世界知识语料和渐进式课程训练，在机器人操作、自动驾驶等任务中实现物理一致的未来视觉轨迹预测，在多个基准上取得最优结果。

详情

AI中文摘要

我们介绍Qwen-RobotWorld，一种用于具身智能的语言条件视频世界模型。以自然语言作为统一动作接口，它从当前观测预测物理上合理的未来视觉轨迹，涵盖机器人操作、自动驾驶、室内导航和人到机器人迁移。这种统一公式提供了三个有前景的应用方向：用于策略训练增强的合成数据生成、用于策略评估的可扩展虚拟环境，以及用于下游机器人控制的语言引导规划信号。这是通过三部分设计实现的：a) 双流MMDiT与MLLM动作编码，其中60层双流扩散变压器通过逐层联合注意力将冻结的Qwen2.5-VL语义与视频VAE潜变量耦合；b) 具身世界知识(EWK)，一个860万视频-文本语料库（2亿+帧），包含20+种具身形态和500+动作类别的动作-语言映射；c) 通用+专家渐进式课程，一种两阶段训练策略，首先学习通用视觉先验，然后在共享语言接口下注入具身专门化。广泛的结果显示出强竞争力：在EWMBench和DreamGen Bench上总体排名第一，在WorldModelBench和PBench上优于所有开源模型。在RoboTwin-IF基准上的额外零样本分析进一步支持了鲁棒泛化和多视图一致性。

英文摘要

We introduce Qwen-RobotWorld, a language-conditioned video world model for embodied intelligence. With natural language as a unified action interface, it predicts physically grounded future visual trajectories from current observations across robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer. This unified formulation provides three promising application directions: synthetic data generation for policy training augmentation, scalable virtual environments for policy evaluation, and language-guided planning signals for downstream robot control. This is achieved through a three-part design: a) Double-Stream MMDiT with MLLM Action Encoding, where a 60-layer double-stream diffusion transformer couples frozen Qwen2.5-VL semantics with video-VAE latents through layer-wise joint attention; b) Embodied World Knowledge (EWK), an 8.6M video-text corpus (200M+ frames) with action-language mapping over 20+ embodiments and 500+ action categories; and c) General+Expert Progressive Curriculum, a two-stage training strategy that first learns general visual priors and then injects embodied specialization under a shared language interface. Extensive results show strong competitiveness: ranks 1st overall on EWMBench and DreamGen Bench, outperforms all open-source models on WorldModelBench and PBench. Additional zero-shot analyses on RoboTwin-IF benchmark further support robust generalization and multi-view consistency.

URL PDF HTML ☆

赞 0 踩 0

2606.18846 2026-06-18 cs.CV 新提交专题 85

From Bounding Boxes to Visual Reasoning: An On-Policy Data Annotation Tool for Vision-Language Models

从边界框到视觉推理：一种用于视觉语言模型的在线策略数据标注工具

Like Zhang, Runliang Niu, Shiqi Wang, Xiyu Hu, Qianli Xing, Pan Wang, Qingzu He, Qi Wang

发表机构 * School of Artificial Intelligence, Jilin University（吉林大学人工智能学院）； College of Computer Science, Jilin University（吉林大学计算机科学与技术学院）； OPPO

专题命中视觉推理：VLM数据标注工具，支持视觉推理。

AI总结提出ScreenAnnotator，通过统一标注原子模式、在线策略循环与贝叶斯验证器，解决现有工具表达力不足、标注-训练脱节和数据复用性差的问题，实现高效多任务数据生成。

Comments 14 pages, 7 figures

详情

AI中文摘要

视觉语言模型（VLM）正快速向复杂的基于基础的结构化视觉推理发展。训练具备此类高级能力的模型需要一种新型数据，该数据能将空间坐标、开放词汇描述、结构化属性和拓扑关系无缝统一为单一表示。然而，现有数据标注工具从根本上无法满足这些复杂需求，存在三个系统性瓶颈：表达力有限、严重的标注-训练解耦以及数据复用性差。为弥补这一基础设施差距，我们引入了一个开源标注工具ScreenAnnotator。首先，我们定义了一个统一的标注原子模式，将空间、语义和结构基元绑定为单个单元。其次，我们实现了一个嵌入贝叶斯标注验证器（BAV）的在线策略标注循环。最后，我们设计了一个模板驱动的多任务数据合成过程，动态地将静态原子转化为多样化的多维推理任务，消除了冗余的重新标注。在线策略循环将流程图上的标注接受率提升至近100%，GUI截图上的接受率达到77%，同时随着标注数据的积累，每张图像的标注时间稳步减少。在流程图场景中，微调VLM的平均准确率达到76.1%，绝对提升了35.1个百分点。我们的代码可在以下网址获取：this https URL。

英文摘要

Vision-language models (VLMs) are rapidly advancing toward sophisticated grounded structured visual reasoning. Training models for such advanced capabilities demands a new genre of data that seamlessly unifies spatial coordinates, open-vocabulary descriptions, structured attributes, and topological relationships into a singular representation. However, existing data annotation tools fundamentally fail to meet these intricate demands, suffering from three systematic bottlenecks: limited expressiveness, severe annotation-training decoupling, and poor data reusability. To bridge this infrastructure gap, we introduce an open-source annotation tool, ScreenAnnotator. First, we define a unified annotation atom schema that binds spatial, semantic, and structural primitives into a single unit. Second, we implement an on-policy annotation loop embedded with a Bayesian Annotation Verifier (BAV). Finally, we design a template-driven multi-task data synthesis process dynamically transforms static atoms into diverse multi-dimensional reasoning tasks, eliminating redundant re-annotation. The on-policy loop drives the annotation accept rate to nearly 100% on flowcharts and 77% on GUI screenshots, while steadily reducing per-image annotation time as labeled data accumulate. In the flowchart scenario, fine-tuning a VLM yields 76.1% average accuracy, which is a 35.1% point absolute gain. Our code is available at: https://github.com/WnQinm/Annotator.

URL PDF HTML ☆

赞 0 踩 0

2606.18839 2026-06-18 cs.LG cs.CV 新提交专题 85

Semantic Robustness Certification for Vision-Language Models

视觉语言模型的语义鲁棒性认证

Peiyu Yang, Paul Montague, Feng Liu, Andrew C. Cullen, Amardeep Kaur, Christopher Leckie, Sarah M. Erfani

发表机构 * School of Computing \& Information Systems, University of Melbourne, Australia

专题命中视觉推理：VLM语义鲁棒性认证，文本提示代理。

AI总结提出首个无需额外数据即可认证视觉语言模型在语义层面（如形状、大小、风格）鲁棒性的框架，通过文本提示作为语义代理并量化决策边界，确保预测类别在语义变换下不变。

Comments Accepted to ICML

详情

AI中文摘要

视觉语言模型（VLM）现在被广泛用于下游任务。然而，现实世界的应用常常使VLM面临由语义变化（例如形状、大小和风格）引起的分布偏移。鲁棒性认证确定当对输入应用变换时模型的预测是否改变。虽然大多数认证框架研究输入的几何或像素级变换，但本文提出了一种新颖的框架，能够在语义级变换下认证VLM的鲁棒性。利用VLM的开放词汇能力，我们使用文本提示作为语义代理来构建由控制语义变化程度的范围参数化的变换。通过以封闭形式表征VLM决策边界，我们的框架定量地认证了在语义变换下预测类别保持不变的范围区间。我们的框架是第一个在语义级变化下认证VLM鲁棒性而无需为每种变化提供额外数据的框架，使其易于应用。在合成数据和真实数据上的实验表明，我们的框架能够在各种场景下认证针对多种语义变化的鲁棒性。

英文摘要

Vision-language models (VLMs) are now widely used in downstream tasks. However, real-world applications often expose VLMs to distribution shifts induced by semantic variation (e.g., shape, size, and style). Robustness certification determines if a model's prediction changes when transformations are applied to its input. While most certification frameworks study geometric or pixel-level transformations over inputs, this work proposes a novel framework that enables certifying VLM robustness under semantic-level transformations. Leveraging the open-vocabulary capability of VLMs, we use text prompts as semantic proxies to construct transformations parameterized by an extent that controls the degree of semantic variation. By characterizing the VLM decision boundary in closed form, our framework quantitatively certifies extent intervals for which the predicted class remains unchanged under the semantic transformation. Our framework is the first to certify VLM robustness under semantic-level variations without requiring additional data for each variation, making it practical to apply. Experiments on both synthetic and real-world data show that our framework enables certifying robustness under diverse semantic variations across scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.18681 2026-06-18 cs.CV 新提交专题 85

Moving Beyond Diversity: Visual Token Pruning as Subspace Reconstruction for Efficient VLMs

超越多样性：将视觉令牌剪枝视为子空间重建以实现高效视觉语言模型

Jaeyeon Lee, Shunjie Wen, Dong-Wan Choi

发表机构 * Inha University（延世大学）

专题命中视觉推理：VLM视觉令牌剪枝，提升效率

AI总结提出SPARE方法，将令牌剪枝重构为子空间重建问题，通过迭代选择投影残差大的令牌进行剪枝，并引入反相关性机制保留上下文信息，在LLaVA上剪枝94%令牌仍保持95%性能。

Comments ECCV 2026 Under Review

详情

AI中文摘要

尽管视觉语言模型（VLM）性能卓越，但由于大量视觉令牌的存在，它们产生了巨大的计算开销。虽然多样性最大化已成为令牌减少的主流策略，但现有方法依赖于基于余弦的归一化相似度，忽略了幅度信息，无法忠实逼近原始特征表示，导致性能次优，尤其是在组合多技能推理任务上。本文提出SPARE，一种子空间重建方法，将令牌剪枝重新表述为列子集选择问题，并显式最小化重建误差。通过迭代选择投影残差大的令牌，SPARE在角度多样性之外实现了重建驱动的剪枝。此外，我们揭示了一个反直觉的反相关性现象：图像-文本相关性得分较低的令牌能更好地保留上下文信息。基于这一发现，我们将反相关性作为额外的选择标准纳入SPARE，以促进上下文感知的令牌选择。在多个VLM和基准上的大量实验表明，SPARE始终达到最先进的性能，在组合任务上取得显著提升。当应用于LLaVA时，SPARE在完全无需训练的情况下，可移除高达94%的视觉令牌，同时保留95%的基线性能。

英文摘要

Despite their remarkable performance, Vision Language Models (VLMs) incur substantial computational overhead due to the large number of visual tokens. While diversity maximization has become a dominant strategy for token reduction, existing methods rely on cosine-based normalized similarity that discards magnitude information, failing to faithfully approximate the original feature representation and leading to suboptimal performance, particularly on compositional multi-skill reasoning tasks. In this paper, we introduce SPARE, a subspace reconstruction method that reformulates token pruning as a column subset selection problem and explicitly minimizes reconstruction error. By iteratively selecting tokens with large projection residuals, SPARE performs reconstruction-driven pruning beyond angular diversity. Moreover, we reveal a counterintuitive anti-relevance phenomenon: tokens with lower image-text relevance score can better preserve contextual information. Based on this finding, we incorporate anti-relevance into SPARE as an additional selection criterion to promote context-aware token selection. Extensive experiments across multiple VLMs and benchmarks demonstrate that SPARE consistently achieves state-of-the-art performance, with strong gains on compositional tasks. When applied to LLaVA, SPARE removes up to 94% of visual tokens while retaining 95% of the baseline performance, all in a fully training-free manner.

URL PDF HTML ☆

赞 0 踩 0

2606.18385 2026-06-18 cs.AI 新提交专题 85

CaVe-VLM-CoT: An Interpretable Vision-Language Model Framework

CaVe-VLM-CoT：一种可解释的视觉-语言模型框架

Sneha Rao, Shaina Raza, Dhanesh Ramachandram

发表机构 * Vector Institute（向量研究所）

专题命中视觉推理：提出可解释VLM框架，结合CoT和RAG

AI总结提出CaVe-VLM-CoT框架，通过五阶段闭环流水线（提取器、检索器、求解器、引用注入器、验证器）实现证据推理，并引入CaVeScore复合指标评估检索质量、引用忠实度和跨模态基础，在ScienceQA和MMMU上取得性能提升。

详情

AI中文摘要

视觉-语言模型（VLM）仍然容易产生幻觉，输出流畅但视觉上不忠实的输出。现有的思维链和检索增强方法仅部分解决了这一问题，因为它们既没有强制执行步骤级引用基础，也没有将验证失败路由回检索以进行纠正。我们提出了CaVe-VLM-CoT，一个模块化的基于反射的智能体RAG框架，通过五阶段闭环流水线强制执行证据推理：提取器、检索器、求解器、引用注入器和验证器，其中检测到的无根据声明会触发结构化反馈给提取器以进行针对性重新检索。由于现有框架没有联合衡量检索质量、逐步引用忠实度和跨模态基础，我们提出了一套涵盖所有阶段的23个组件级指标，以CaVeScore为核心，这是一个加权准确性、引用精确率和召回率、归因和证据基础的复合指标。无需任何架构或提示修改，CaVe-VLM-CoT在ScienceQA上达到87.1%的准确率和56.6%的CaVeScore，在MMMU（30个学科）上达到55.2%的准确率和35.7%的CaVeScore。

英文摘要

Vision-Language Models (VLMs) remain prone to hallucinations, producing fluent but visually unfaithful outputs. Existing chain-of-thought and retrieval-augmented methods only partially address this, as they neither enforce step-level citation grounding nor route verification failures back to retrieval for correction. We present CaVe-VLM-CoT, a modular reflection-based agentic-RAG framework that enforces evidence-grounded reasoning through a five-stage closed-loop pipeline: Extractor, Retriever, Solver, Citation Injector, and Verifier, in which detected ungrounded claims trigger structured feedback to the Extractor for targeted re-retrieval. Since no existing framework jointly measures retrieval quality, step-wise citation faithfulness, and cross-modal grounding, we propose a suite of 23 component-wise metrics across all stages, anchored by CaVeScore, a composite metric weighting accuracy, citation precision and recall, attribution, and evidence grounding. Without any architectural or prompt modifications, CaVe-VLM-CoT achieves 87.1\% accuracy and 56.6\% CaVeScore on ScienceQA , and 55.2\% accuracy and 35.7\% CaVeScore on MMMU (30 subjects).

URL PDF HTML ☆

赞 0 踩 0

2606.19253 2026-06-18 cs.CV cs.AI cs.LG cs.RO 新提交专题 80

OneCanvas: 3D Scene Understanding via Panoramic Reprojection

OneCanvas: 通过全景重投影实现3D场景理解

Bartłomiej Baranowski, Dave Zhenyu Chen, Matthias Nießner

发表机构 * Technical University of Munich（慕尼黑技术大学）； Huawei（华为）

专题命中视觉推理：VLM中3D场景理解方法

AI总结提出OneCanvas方法，将多视图补丁特征聚合到全景画布上，利用深度和相机位姿进行重投影，无需复杂几何编码器或大量训练，在SQA3D等基准上达到最先进精度。

Comments Project page: https://baranowskibrt.github.io/onecanvas/

详情

AI中文摘要

现有的视觉语言模型（VLM）中的3D场景理解方法要么依赖复杂的、模型特定的几何编码器，要么为了追求空间推理而需要大量的训练预算。相反，OneCanvas将所有视图的补丁特征聚合到一个单一的等距柱状全景画布上。具体来说，每个补丁利用其深度和相机位姿被反投影到3D世界坐标，然后根据从画布原点看到的该点的连续经度和纬度放置在画布上，无需对重叠视图进行光栅化或聚合。补丁的度量坐标的3D位置嵌入被添加到其特征中，从而恢复了将世界位置压缩到角度画布坐标时丢失的深度。因此，来自所有帧的补丁共享一个空间坐标系，无需融合或对主干网络进行重大架构修改。预训练的VLM将此表示视为普通图像。由于画布可以以任何感兴趣的姿态为中心，相同的表示直接支持从特定视角进行情境推理，这是机器人和具身AI中的常见需求。得益于这种表示，我们还可以引入空间预训练课程：通过程序化地将从真实图像中提取的对象的补丁特征放置在原本空白的画布上的选定3D世界位置，我们生成了涵盖广泛空间推理任务的即时监督，并控制答案分布以减少空间推理捷径。OneCanvas在SQA3D和VSI-Bench上达到了最先进的准确率，并在SPBench上泛化到分布外数据，其训练计算量比最强竞争方法少一个数量级。

英文摘要

Existing approaches to 3D scene understanding in Vision-Language Models (VLMs) either rely on complex, model-specific geometry encoders or large training budgets in pursuit of spatial reasoning. Instead, OneCanvas aggregates patch features from all views onto a single equirectangular panoramic canvas. Namely, each patch is unprojected to a 3D world coordinate using its depth and camera pose, then placed on the canvas at the continuous longitude and latitude of that point as seen from the canvas origin, with no rasterization or aggregation across overlapping views. A 3D position embedding of the patch's metric coordinates is added to its feature, restoring the depth lost when collapsing the world position to an angular canvas coordinate. Patches from all frames thus share one spatial coordinate system with no fusion or major architectural modifications of the backbone. The pretrained VLM consumes this representation as if it were an ordinary image. Because the canvas can be centered on any pose of interest, the same representation directly supports situated reasoning from a specific viewpoint, a common requirement in robotics and embodied AI. Thanks to this representation, we can also introduce a spatial pretraining curriculum: by procedurally placing patch features of objects, drawn from real images, at chosen 3D world positions on an otherwise empty canvas, we generate on-the-fly supervision spanning a broad range of spatial reasoning tasks, with answer distributions controlled to reduce spatial reasoning shortcuts. OneCanvas achieves state-of-the-art accuracy on SQA3D and VSI-Bench, and generalizes to out-of-distribution data on SPBench, using an order of magnitude less training compute than the strongest competing methods.

URL PDF HTML ☆

赞 0 踩 0

2606.17412 2026-06-18 cs.CV cs.AI 新提交专题 80

Enhancing Pathological VLMs with Cross-scale Reasoning

增强病理视觉语言模型的跨尺度推理能力

Chi Phan, Tianyi Zhang, Qiaochu Xue, Yufeng Wu, Dan Hu, Zeyu Liu, Sudong Wang, Yueming Jin

发表机构 * Department of Electrical and Computer Engineering, National University of Singapore（新加坡国立大学电气与计算机工程系）； PuzzleLogic Pte Ltd（PuzzleLogic私人有限公司）； Department of Pathology, Fujian Medical University Cancer Hospital & Fujian Cancer Hospital（福建医科大学附属肿瘤医院病理科暨福建省肿瘤医院）

专题命中视觉推理：增强病理VLM的跨尺度视觉推理

AI总结提出首个跨尺度训练与评估范式，通过多倍率视觉问答任务增强病理视觉语言模型的跨尺度推理能力，并构建高质量基准数据集Scale-VQA及模型ScaleReasoner-R1，实现最优性能。

详情

AI中文摘要

病理图像本质上是多尺度的，要求病理学家整合从低倍放大下的整体组织结构到高倍放大下的细胞形态的证据以进行准确诊断。虽然现有的视觉语言模型（VLM）病理数据集包含多种尺度，但它们通常缺乏明确的跨尺度推理目标。这一限制阻碍了VLM捕获关键的跨尺度表示和学习基于证据的推理。为弥补这一差距，我们引入了首个跨尺度训练和评估范式，将病理解释表述为多倍率推理。然而，创建这样的任务揭示了一个关键挑战：多图像视觉问答（VQA）容易受到仅文本捷径的影响，这使得模型能够利用与放大倍数相关的伪影而非视觉证据来猜测答案。为解决此问题，我们提出了一种泄漏感知的策展流程，结合了对抗性仅文本筛选和约束引导的问题设计。利用该流程，我们构建了Scale-VQA，一个高质量基准，包含4,685个多项选择题，基于2,537张跨多个放大级别的病理图像。最后，我们提出了ScaleReasoner-R1，一个通过强化学习训练的模型，以优化跨尺度VQA任务的性能。ScaleReasoner-R1在我们的跨尺度推理基准上达到了最先进的性能，并在已有的单尺度基准上泛化到最先进的性能。研究结果表明，即使是有限的跨尺度监督也能显著改善病理理解。代码和演示将开源。

英文摘要

Pathological images are inherently multi-scale, requiring pathologists to integrate evidence from global tissue architecture at low magnification to cellular morphology at higher magnification for accurate diagnosis. While existing pathological datasets for vision-language model (VLM) include various scales, they often lack an explicit cross-scale reasoning objective. This limitation prevents VLMs from capturing essential cross-scale representations and learning evidence-based reasoning. To bridge this gap, we introduce the first cross-scale training and evaluation paradigm that formulates pathology interpretation as multi-magnification reasoning. However, creating such a task reveals a critical challenge: multi-image visual question answering (VQA) is prone to text-only shortcuts, which allow models to guess answers using magnification-dependent artifacts rather than visual evidence. To address this, we propose a leakage-aware curation pipeline that combines adversarial text-only screening with constraint-guided question design. Using this pipeline, we construct Scale-VQA, a high-quality benchmark with 4,685 multiple-choice questions grounded in 2,537 pathology images across multiple magnification levels. Finally, we present ScaleReasoner-R1, a model trained via reinforcement learning to optimize performance on the cross-scale VQA task. ScaleReasoner-R1 achieves state-of-the-art performance on our cross-scale reasoning benchmark and generalizes to SOTA performance on established single-scale benchmarks. Findings suggest that even the limited cross-scale supervision can significantly improve pathological understanding. The code and demos will be open-sourced.

URL PDF HTML ☆

赞 0 踩 0

2606.18738 2026-06-18 cs.SD 新提交专题 75

GRIDEX: Grid-Grounded Forensic Explanations for Deepfake Spectrogram Analysis

GRIDEX：基于网格的深度伪造频谱图取证解释

Thi Ngan Ha Do, Tingmin Wu, Alsharif Abuadbba, Kristen Moore

发表机构 * CSIRO（澳大利亚联邦科学与工业研究组织）

专题命中视觉推理：深度伪造频谱图分析，生成取证解释。

AI总结提出GRIDEX框架，通过两阶段学习（SFT+GRPO）定位频谱图异常区域并生成结构化取证解释，提升伪造检测的可解释性。

详情

AI中文摘要

语音生成技术的进步使得人工语音越来越逼真。尽管现代分类模型在深度伪造检测方面可以达到高准确率，但它们不会产生证据，例如指出欺骗线索在频谱图中的位置及其声学含义，从而限制了它们在取证中的实用性。完整频谱图的人工分析是资源密集型的，因此证据应将注意力集中在最具诊断性的区域。此外，现有的可解释性方法在将上下文属性与局部证据联系起来方面的能力有限，使得解释更难验证。为了克服这一限制，我们提出了GRIDEX，这是一个流水线，当给定深度伪造频谱图时，它会生成其异常的取证解释。该流水线（i）选择频谱图中前K个异常区域，并（ii）为每个异常生成解释。这些解释遵循分类声学字段的模式，包括时间、频谱、语音信息和解释文本。据我们所知，这是第一个使用区域定位为深度伪造频谱图生成结构化取证解释的框架。GRIDEX采用两阶段学习范式进行训练，该范式将监督微调（SFT）与群体相对策略优化（GRPO）相结合。在我们的数据集上的实验表明，与强大的视觉语言模型（VLM）基线相比，伪影定位和解释质量有所提高。数据集和代码将在发表后发布。

英文摘要

The advancement of speech generation technologies has made artificial speech increasingly realistic. Although modern classification models can achieve high accuracy when it comes to deepfake detection, they do not produce evidences such as indicating where spoof cues appear in the spectrogram and what they imply acoustically, limiting their usefulness in forensic settings. Manual analysis of full spectrograms is resource-intensive, so evidence should narrow attention to the most diagnostic regions. Moreover, existing explainability methods have limited capabilities in connecting contextual attributes to localized evidence, making explanations harder to verify. To overcome this limitation, we propose GRIDEX, a pipeline that, when given a deepfake spectrogram, generates forensic explanations of its anomalies. The pipeline (i) selects top-K anomalous regions in the spectrogram and (ii) produces an explanation for each anomaly. The explanations follow a schema of categorical acoustic fields, including temporal, spectral, phonetic information and interpretation text. To our knowledge, this is the first framework to generate structured forensic explanations using regional grounding for deepfake spectrograms. GRIDEX is trained with a two-stage learning paradigm that combines supervised fine-tuning (SFT) with Group Relative Policy Optimization (GRPO). Experiments on our dataset show improved artifact localization and explanation quality over strong vision-language model (VLM) baselines. The dataset and code will be released upon publication.

URL PDF HTML ☆

赞 0 踩 0

2606.18661 2026-06-18 cs.CV cs.AI 新提交专题 75

LandslideAgent with Multimodal LandslideBench: A Domain-Rule-Augmented Agent for Autonomous Landslide Identification and Analysis

LandslideAgent与多模态LandslideBench：一种面向自主滑坡识别与分析的领域规则增强型智能体

Chengfu Liu, Dongyang Hou, Junwu Xiang, Cheng Yang, Xuezhi Cui, Zeyuan Wang, Liangtian Liu, Zelang Miao

发表机构 * Central South University（中南大学）

专题命中视觉推理：滑坡专用视觉语言模型增强地质语义理解

AI总结提出指令驱动智能体框架，包含多模态数据集LandslideBench、滑坡专用视觉语言模型LandslideVLM及领域规则增强智能体LandslideAgent，实现自主滑坡识别与分析。

详情

AI中文摘要

智能滑坡灾害解译对于防灾减灾至关重要，然而当前范式难以同时提取视觉特征和高层次地球科学语义，而通用视觉语言模型在复杂地质场景中存在感知局限和领域幻觉。为解决这些挑战，我们提出一个指令驱动的智能体框架，包含三个组成部分。首先，通过多VLM交叉验证和交互式标注构建LandslideBench，这是一个多模态细粒度数据集，包含七个子类型标签、高分辨率图像、像素级掩膜和高质量文本描述。然后，通过LoRA在LandslideBench上微调面向滑坡的VLM——LandslideVLM，以增强地质语义理解。最后，以LandslideVLM为认知核心的领域规则增强智能体LandslideAgent，采用双规则控制器，结合结构化报告元数据约束和交叉验证识别约束，来调控自动化工具调用。实验表明，LandslideBench为五种主流模型在细粒度分类和语义分割上提供了有效基线。LandslideVLM在滑坡判别、细粒度分类和语义描述质量上分别提升了10.96%、32.87%和15.91%。LandslideAgent进一步实现了自主多源空间数据推理，实现了滑坡识别与分析的全流程智能化。

英文摘要

Intelligent landslide hazard interpretation is critical for disaster prevention, yet current paradigms struggle to simultaneously extract visual features and high-level geoscientific semantics, while general-purpose vision-language models (VLMs) suffer from perceptual limitations and domain hallucinations in complex geological scenarios. To address these challenges, we propose an instruction-driven agentic framework comprising three components. First, LandslideBench, a multimodal fine-grained dataset with seven subtype labels, high-resolution imagery, pixel-level masks, and high-quality textual descriptions, is constructed via multi-VLM cross-validation and interactive annotation. Then, LandslideVLM, a landslide-oriented VLM, is fine-tuned via LoRA on LandslideBench to enhance geological semantic understanding. Finally, LandslideAgent, a domain rule-enhanced agent taking LandslideVLM as its cognitive backbone, employs a dual-rule controller incorporating structured report metadata constraints and cross-validation identification constraints to regulate automated tool invocation. Experiments demonstrate that LandslideBench provides effective baselines across five mainstream models on fine-grained classification and semantic segmentation. LandslideVLM achieves accuracy improvements of 10.96%, 32.87%, and 15.91% on landslide discrimination, fine-grained classification, and semantic description quality, respectively. LandslideAgent further enables autonomous multi-source spatial data inference, realizing full-process intelligence for landslide identification and analysis.

URL PDF HTML ☆

赞 0 踩 0

2606.18558 2026-06-18 cs.CV 新提交专题 75

MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction

MolmoMotion: 基于语言指令的3D点轨迹预测

Jianing Zhang, Chenhao Zheng, Yajun Yang, Max Argus, Rustin Soraki, Winson Han, Taira Anderson, Chun-Liang Li, Shuo Liu, Jiafei Duan, Zhongzheng Ren, Jieyu Zhang, Ranjay Krishna

发表机构 * Allen Institute for AI（艾伦人工智能研究所）； University of Washington（华盛顿大学）； UNC-Chapel Hill（北卡罗来纳大学教堂山分校）

专题命中视觉推理：语言指令引导3D点轨迹预测

AI总结提出一种基于语言指令的3D点运动预测方法，通过构建大规模数据集和基准，实现类无关、视角稳定的运动轨迹预测，并在机器人操作和视频生成中验证其有效性。

详情

AI中文摘要

运动预测是视觉智能的核心：智能体必须预测物体如何运动，以规划行动、推理物理交互并合成逼真的未来场景。我们认为，世界坐标系中的3D点提供了一种通用表示，具有类无关、视角稳定、紧凑且对下游任务直接有用的特性。我们形式化了目标条件3D点运动预测任务：给定一段短视觉历史、目标物体上的一组3D查询点以及预期目标的语言描述，模型预测每个点的未来3D轨迹。我们引入了一个完整的堆栈来大规模研究此任务：(1) MolmoMotion-1M是一个大型语料库，包含从116万无约束视频中标注的动作描述、物体锚定的3D点轨迹；(2) PointMotionBench是一个人工验证的基准，涵盖111个物体类别和61种运动类型；(3) MolmoMotion是一个通用运动预测模型，支持自回归坐标预测和基于流匹配的轨迹生成。MolmoMotion能准确预测不同语言指令下的多样运动模式，并在PointMotionBench上显著优于现有运动预测基线。最后，我们展示了学习到的3D运动先验能很好地迁移到下游应用：它提高了机器人操作的训练效率和泛化能力，其预测轨迹为生成模型提供了有效的运动指导，以合成具有更真实物体运动的视频。

英文摘要

Motion forecasting is central to visual intelligence: agents must anticipate how objects will move in order to plan actions, reason about physical interactions, and synthesize realistic futures. We argue that 3D points in world coordinates provide a general representation that is class-agnostic, view-stable, compact, and directly useful for downstream tasks. We formalize the task of goal-conditioned 3D point motion forecasting: given a short visual history, a set of 3D query points on an object of interest, and a language description of the intended goal, the model predicts the future 3D trajectory of each point. We introduce a full stack to study this task at scale: (1) MolmoMotion-1M is a large corpus of action-described, object-grounded 3D point trajectories annotated from 1.16M unconstrained videos; (2) PointMotionBench is a human-verified benchmark spanning 111 object categories and 61 motion types; and (3) MolmoMotion is a general motion forecasting model that supports both autoregressive coordinate prediction and flow-matching-based trajectory generation. MolmoMotion accurately predicts diverse motion patterns with different language instructions, and significantly outperforms existing motion prediction baselines on PointMotionBench. Finally, we show that the learned 3D motion prior transfers well to downstream applications: it improves training efficiency and generalization for robot manipulation, and its predicted trajectories provide effective motion guidance for generative models to synthesize videos with more realistic object motion.

URL PDF HTML ☆

赞 0 踩 0

2606.19258 2026-06-18 cs.CV cs.RO 新提交专题 70

CABLE: Cloud-Assisted Bandwidth-efficient LMM-based Encoding for V2X Systems

CABLE: 面向V2X系统的云辅助带宽高效LMM编码框架

Haohua Que, Zhipeng Bao, Qianyi Wu, Handong Yao

发表机构 * College of Engineering, University of Georgia（佐治亚大学工程学院）

专题命中视觉推理：利用LMM进行边缘-云感知编码

AI总结提出CABLE框架，通过边缘端利用自我运动补偿和残差运动线索传播云分割掩码，生成感兴趣区域（ROI）并仅上传ROI掩码图像，形成掩码-ROI-LMM反馈循环，在五个数据集上实现73-87%的ROI像素覆盖减少和5-8倍LMM预填充加速。

详情

AI中文摘要

云托管的大型多模态模型（LMM）可以为车联网系统提供强大的开放词汇感知能力，但简单地将全分辨率帧从边缘传输到云会导致严重的通信开销和云侧预填充延迟。我们提出了CABLE，一种用于边缘-云感知的云辅助带宽高效LMM编码框架。CABLE在边缘端利用自我运动补偿传播先前的云分割掩码，通过残差运动线索进行细化，并通过走廊包络整合断开区域，形成鲁棒的感兴趣区域（ROI）。仅上传ROI掩码图像，而云分割输出作为下一帧的先验反馈，形成掩码-ROI-LMM反馈循环。在五个数据集（nuScenes、WOD-ZB、Waymo、KITTI和CADC）上的实验表明，该方法在保持感知能力的同时实现了显著的通信节省，相对于全帧推理，ROI像素覆盖减少73-87%，估计LMM预填充加速5-8倍，检测质量略有折衷。

英文摘要

Cloud-hosted large multimodal models (LMMs) can provide strong open-vocabulary perception for Vehicle-to-Everything systems, but naively transmitting full-resolution frames from edge to cloud causes severe communication overhead and high cloud-side prefill latency. We present CABLE, a cloud-assisted bandwidth-efficient LMM-based encoding framework for edge-cloud perception. CABLE propagates the previous cloud segmentation mask on the edge using ego-motion compensation, refines it with residual-motion cues, and consolidates disconnected regions via a corridor envelope to form a robust region of interest (ROI). Only ROI-masked images are uploaded, while the cloud segmentation output is fed back as the prior for the next frame, forming a mask-to-ROI-to-LMM feedback loop. Experiments on five datasets (nuScenes, WOD-ZB, Waymo, KITTI, and CADC) show consistent communication savings while largely preserving perception, achieving $73$--$87\%$ ROI pixel-coverage reduction with $5$--$8\times$ estimated LMM prefill speedup at a modest detection-quality trade-off relative to full-frame inference.

URL PDF HTML ☆

赞 0 踩 0

2606.19120 2026-06-18 cs.LG cs.CV 新提交专题 70

Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation

先看后思：解耦感知与推理以实现抗捷径的多模态在策略自蒸馏

Sihan Wang, Xiyao Liu, Lianqing Liu, Zhi Han

发表机构 * State Key Laboratory of Robotics and Intelligent Systems, Shenyang Institute of Automation, Chinese Academy of Sciences（机器人与智能系统国家重点实验室，沈阳自动化研究所，中国科学院）； University of Chinese Academy of Sciences（中国科学院大学）

专题命中视觉推理：视觉描述辅助推理，属于VLM范畴

AI总结提出ViGOS框架，通过解耦感知和推理，在MLLM后训练中避免文本捷径，提升图像依赖行为。

Comments 29 pages, 5 figures, 8 tables

详情

AI中文摘要

在策略自蒸馏（OPSD）训练模型在其自身rollouts上，并使用冻结副本提供基于参考目标的密集token级目标。这对于LLM推理效果良好，但直接扩展到多模态大语言模型（MLLMs）可能产生捷径：特权目标可能主要基于文本参考目标而非图像来引导token。我们提出ViGOS，一种视觉引导的OPSD框架用于MLLM后训练。学生首先编写视觉描述，然后推理出最终答案。对于有效rollouts，仅图像的感知教师监督描述，而特权推理教师监督同一学生前缀上的推理和最终答案。仅对无效rollouts使用参考教师以恢复输出格式。在通用视觉-语言、专家推理、视觉数学、空间定位和视觉-语言先验基准测试中，ViGOS保持了OPSD的主要优势，并在易产生捷径的设置中改善了图像引导行为。

英文摘要

On-policy self-distillation (OPSD) trains a model on its own rollouts and uses a frozen copy to provide dense token-level targets conditioned on a reference target. This works well for LLM reasoning, but a direct extension to multimodal large language models (MLLMs) can create a shortcut: the privileged target may guide tokens mainly based on the text reference target rather than the image. We propose ViGOS, a visually grounded OPSD framework for MLLM post-training. The student first writes a visual description and then reasons toward the final answer. For valid rollouts, an image-only perception teacher supervises the description, while a privileged reasoning teacher supervises the reasoning and final answer on the same student prefix. A reference teacher is used only for invalid rollouts to recover the output format. Across general vision-language, expert reasoning, visual math, spatial grounding, and visual-language-prior benchmarks, ViGOS keeps the main benefits of OPSD and improves image-grounded behavior in shortcut-prone settings.

URL PDF HTML ☆

赞 0 踩 0

2606.17372 2026-06-18 cs.CL cs.AI 新提交专题 70

Implicit vs. Explicit Prompting Strategies for LVLMs in Referential Communication

LVLMs在指称通信中的隐式与显式提示策略

Peter Zeng, Amie J. Paige, Weiling Li, Susan E. Brennan, Owen Rambow, Cameron R. Jones

发表机构 * Stony Brook University（石溪大学）

专题命中视觉推理：研究LVLM指称通信中的提示策略

AI总结本研究通过控制任务差异，比较显式与隐式提示对LVLM生成高效指称表达的影响，发现显式提示下模型能协调高效表达，而隐式提示则失败，揭示了人机通信的关键差异。

2606.18634 2026-06-18 cs.RO cs.AI 新提交专题 60

EffiNav: Fusing Depth and Vision-Language for Efficient Object Goal Navigation

EffiNav: 融合深度与视觉语言实现高效物体目标导航

Zecheng Yin, Benedict Jun Ma

发表机构 * Systems Hub of Intelligence Transportation HKUST(GZ)（香港科技大学（广州）智能交通系统中心）

专题命中视觉推理：利用视觉语言模型预测探索边界

AI总结提出EffiNav框架，融合深度信息与视觉语言模型，通过预测探索边界和语义先验指导导航，在HM3D和OVON数据集上匹配或超越基线，提升路径效率与泛化性。

详情

AI中文摘要

在未知环境中定位目标物体是自主智能体的基本能力，应用范围从搜索救援到野外机器人。该任务的简化版本是物体目标导航（ObjNav）。在ObjNav中，成功到达目标物体提供了基本的性能度量；然而，导航轨迹的效率同样重要，因为它指示了智能体探索的智能程度以及后续任务剩余的时间。在未知环境中，高效导航的关键在于决定下一步探索的位置。尽管许多先前工作旨在解决这一核心挑战并在某些场景中取得了有希望的性能，但最近的基于训练的模型和非训练框架分别仍存在泛化性和效率问题，在最坏情况下可能导致对已访问区域的过度探索或冗余的来回运动。我们在两个广泛使用的仿真基准Habitat Matterport 3D（HM3D）和开放词汇物体目标导航（OVON）上评估EffiNav，并在真实世界的物理机器人上进一步验证其有效性。我们对大量仿真回合进行了失败分析。通过最小修改，我们还将EffiNav扩展到GOAT-BENCH数据集上的记忆增强ObjNav任务，展示了其在标准ObjNav设置之外的适应性。在两个标准指标——成功率（SR）和路径长度加权成功率（SPL）上，EffiNav匹配或超越了最近的基线，反映了其效率、鲁棒性和实际适用性。认识到两个数据集的不同侧重点，性能表明该框架在高效ObjNav中更加平衡和可泛化。

英文摘要

To locate a target object while exploring the unknown environment is a fundamental capability for autonomous agents, with applications ranging from search-and-rescue to field robots. A simplified version of such task is Object Goal Navigation (ObjNav). In ObjNav, successful arrival at the target object provides a basic measure of performance; however, the efficiency of the navigation trajectory is equally important, as it indicates how intelligently the agent explores and how much time remains for subsequent tasks. In unknown environments, the key to efficient navigation lies in deciding where to explore next. While many prior works aim to address this core challenge and achieved promising performance in certain settings, recent training-based models and non-training frameworks still suffer from generalization and efficiency issues respectively, which in the worst cases can lead to excessive exploration of already-visited areas or redundant back-and-forth motion. We evaluate EffiNav on two widely used simulation benchmarks Habitat Matterport 3D (HM3D) and Open-Vocabulary Object goal Navigation (OVON), and further validate its effectiveness on physical robots in real-world settings. We conduct failure analysis on massive simulation episodes. With minimal modification, we also extend EffiNav to a memory-augmented ObjNav task on the GOAT-BENCH dataset, demonstrating its adaptability beyond standard ObjNav settings. Across two standard metrics--Success Rate (SR) and Success weighted by Path Length (SPL), EffiNav matches or outperforms recent baselines, reflecting its efficiency, robustness, and practical applicability. Recognizing the different emphases of the two datasets, the performances reveals this framework is more balanced and generalizable for efficient ObjNav.

URL PDF HTML ☆

赞 0 踩 0

1. 视觉问答 6 篇

A Unified Framework for Efficient Remote Sensing Visual Question Answering: Adapting Dual, Hybrid, and Encoder-Decoder Architectures

Hallucination Detection and Correction in Medical VLMs via Counter-Evidence Verification

AMALIA-VL: A Native European Portuguese Open-Source Vision and Language Model

NAVI-Orbital: First In-Orbit Demonstration of a Zero-Shot Vision-Language Model for Autonomous Earth Observation

Not Truly Multilingual: Script Consistency as a Missing Dimension in VLM Evaluation

Hierarchical Multi-Modal Retrieval for Knowledge-Grounded News Image Captioning

2. 其他 1 篇

Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: From Evaluation to Diagnosis

3. 视觉定位 1 篇

Trust the Right Teacher: Quality-Aware Self-Distillation for GUI Grounding

4. 视觉推理 14 篇

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

From Bounding Boxes to Visual Reasoning: An On-Policy Data Annotation Tool for Vision-Language Models

Semantic Robustness Certification for Vision-Language Models

Moving Beyond Diversity: Visual Token Pruning as Subspace Reconstruction for Efficient VLMs

CaVe-VLM-CoT: An Interpretable Vision-Language Model Framework

OneCanvas: 3D Scene Understanding via Panoramic Reprojection

Enhancing Pathological VLMs with Cross-scale Reasoning

GRIDEX: Grid-Grounded Forensic Explanations for Deepfake Spectrogram Analysis

LandslideAgent with Multimodal LandslideBench: A Domain-Rule-Augmented Agent for Autonomous Landslide Identification and Analysis

MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction

CABLE: Cloud-Assisted Bandwidth-efficient LMM-based Encoding for V2X Systems

Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation

Implicit vs. Explicit Prompting Strategies for LVLMs in Referential Communication

EffiNav: Fusing Depth and Vision-Language for Efficient Object Goal Navigation