视觉大模型 / VLM - arXivDaily 专题

2606.19277 2026-06-18 cs.CV 新提交 90%

A Unified Framework for Efficient Remote Sensing Visual Question Answering: Adapting Dual, Hybrid, and Encoder-Decoder Architectures

高效遥感视觉问答的统一框架：适配双编码器、混合架构和编码器-解码器架构

Timothy Agboada, Shikha Chandel, Yadav Raj Ghimire, Leila Hashemi-Beni

发表机构 * Computational Data Science and Engineering（计算数据科学与工程）； College of Science and Technology（科学与技术学院）

专题命中视觉问答：遥感VQA参数高效微调，适配多种VLM架构

AI总结提出RS Adapter参数高效微调策略，在三种视觉语言模型架构上注入轻量瓶颈适配器，仅用不到5%可训练参数实现遥感VQA，混合架构FLAVA在多模态推理与检索间取得最佳平衡。

Comments 4 pages, 2 figures, accepted and to be presented at 2026 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2026), scheduled for 9 to 14 August 2026 in Washington D.C

详情

AI中文摘要

遥感领域的视觉问答因航空影像的高分辨率、多尺度目标分布和语义复杂性而面临独特挑战。尽管通用领域的基础模型取得了显著成功，但直接应用于RSVQA受到巨大领域偏移和全微调计算成本高昂的阻碍。本研究对RS Adapter（一种参数高效微调策略）在三种不同的视觉语言模型架构上进行了比较分析：双编码器CLIP、编码器-解码器BLIP和混合FLAVA。我们引入了一个统一的架构手术流水线，将轻量瓶颈适配器注入冻结骨干网络的注意力和MLP层，从而以少于5%的可训练参数实现快速适应。在高分辨率RSVQA x数据集上的实验结果表明，虽然所有适配模型均实现收敛，但混合FLAVA架构相比单模态对应模型提供了更优越的多模态推理与检索能力平衡。我们的发现为灾害评估和城市监测中的资源高效VQA建立了新的基准。

英文摘要

Visual Question Answering (VQA) in the Remote Sensing (RS) domain presents unique challenges due to the high resolution, multi scale object distribution, and semantic complexity of aerial imagery. While general domain Foundation Models have achieved remarkable success, their direct application to RSVQA is hindered by massive domain shifts and the computationally prohibitive nature of full fine tuning. This study presents a comparative analysis of RS Adapter, a Parameter Efficient Fine Tuning (PEFT) strategy, applied across three distinct Vision Language Model (VLM) architectures: the Dual Encoder CLIP, the Encoder Decoder BLIP, and the Hybrid FLAVA. We introduce a unified architectural surgery pipeline that injects lightweight bottleneck adapters into the attention and MLP layers of frozen backbones, enabling rapid adaptation with less than 5 percent of trainable parameters. Experimental results on the high resolution RSVQA x dataset demonstrate that while all adapted models achieve convergence, the Hybrid FLAVA architecture offers a superior balance of multimodal reasoning and retrieval capabilities compared to its unimodal counterparts. Our findings establish a new baseline for resource efficient VQA in disaster assessment and urban monitoring.

URL PDF HTML ☆

赞 0 踩 0

2606.18609 2026-06-18 cs.CV 新提交 85%

Hallucination Detection and Correction in Medical VLMs via Counter-Evidence Verification

基于反事实证据验证的医学视觉语言模型幻觉检测与纠正

Nan Zhou, Ke Zou, Meng Liu, Linchao He, Jiaqi Zhu, Yi Zhang, Hu Chen, Huazhu Fu

发表机构 * College of Computer Science, Sichuan University（四川大学计算机科学学院）； Yong Loo Lin School of Medicine, National University of Singapore（新加坡国立大学杨潞龄医学院）； Key Laboratory of Data Protection and Intelligent Management, Ministry of Education, Sichuan University（四川大学数据保护与智能管理教育部重点实验室）； National Key Laboratory of Autonomous Intelligent Unmanned Systems, Beijing Institute of Technology（北京理工大学自主智能无人系统国家重点实验室）； Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR)（新加坡科技研究局高性能计算研究所）

专题命中视觉问答：针对医学视觉语言模型幻觉检测与纠正

AI总结提出CoEV框架，通过文本与视觉证据的双向验证检测并纠正医学VLM幻觉，无需重新训练，在四个数据集上显著提升检测和纠正性能。

Comments MICCAI 2026 Accept. Submission Version

详情

AI中文摘要

视觉语言模型（VLM）在医学诊断中的可靠性受到幻觉的挑战，这削弱了信任。现有的幻觉检测方法主要关注识别生成文本与参考数据之间的事实不一致性。虽然一些研究分析了模型在图像中的注意力区域，但它们很少验证这种注意力是否真正反映了支持生成文本的视觉证据。为了解决这一差距，我们提出了反事实证据验证（CoEV），一个无需训练的即插即用框架，通过基于证据的事实一致性验证来检测和纠正幻觉。CoEV在文本断言和视觉证据之间执行双向验证，测试每个陈述是否得到其对应证据区域的支持，并将每个陈述分配到一个四象限诊断图中，该图捕获文本事实性和视觉基础性的组合。CoEV检测幻觉内容，并作为事后细化工具，无需重新训练即可纠正幻觉。在四个医学数据集上的大量实验表明，CoEV能够对抗幻觉。在幻觉检测方面，CoEV始终优于现有方法，平均PR-AUC和ROC-AUC分别提高了3.0%和3.9%的绝对百分点，在特定VQA场景中提升高达18.5%。在幻觉纠正方面，它将Micro-F1提高了高达12.5%，在医学报告生成中将幻觉率降低了超过11.9%，并提高了医学VQA的准确性。这些结果表明，CoEV能够可靠地检测和纠正幻觉，为临床医生提供可靠的、基于证据的诊断线索。代码将在接收后发布。

英文摘要

Vision-Language models (VLMs) reliability in medical diagnosis is challenged by trust-undermining hallucinations. Existing hallucination detection approaches mainly focus on identifying factual inconsistencies between generated text and reference data. While some studies analyze where models attend in images, they seldom verify whether such attention truly reflects the visual evidence supporting the generated text. To address this gap, we propose Co}unter-Evidence Verification (CoEV), a training-free plug-and-play framework that detects and corrects hallucinations through evidence-based factual consistency verification. CoEV performs bidirectional verification between textual assertions and visual evidence, testing whether each statement is supported by its corresponding evidence region, and assigns each statement into a four-quadrant diagnostic map capturing combinations of text factuality and visual grounding. CoEV detects hallucinated content and serves as a post hoc refinement tool, correcting hallucinations without retraining. Extensive experiments on four medical datasets show that CoEV combats hallucinations in VLMs.For hallucination detection, CoEV consistently outperforms existing methods, improving average PR-AUC and ROC-AUC by 3.0% and 3.9% absolute points respectively, with notable gains of up to 18.5% in specific VQA scenarios. For hallucination correction, it improves Micro-F1 by up to 12.5%, reduces hallucination rates by over 11.9% on medical report generation, and also boosts medical VQA accuracy. These results show that CoEV enables reliable detection and correction of hallucinations, providing clinicians with dependable, evidence-based cues for diagnosis. Code will be released upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2606.19100 2026-06-18 cs.CV 新提交 80%

AMALIA-VL: A Native European Portuguese Open-Source Vision and Language Model

AMALIA-VL: 一个原生欧洲葡萄牙语开源视觉与语言模型

Diogo Glória-Silva, João Cardeira, Manuel Letras da Luz, Afonso Simplício, Gonçalo Vinagre, Diogo Tavares, Rafael Ferreira, Inês Calvo, Inês Vieira, David Semedo, João Magalhães

发表机构 * NOVA School of Science and Technology（NOVA科学与技术学校）； NOVA LINCS

专题命中视觉问答：开源视觉语言模型，面向欧洲葡萄牙语。

AI总结针对欧洲葡萄牙语缺乏开源多模态模型的问题，提出AMALIA-VL，通过三阶段训练和葡萄牙语中心数据混合，建立强基线并开源所有资源。

详情

AI中文摘要

大型视觉与语言模型（LVLMs）发展迅速，但欧洲葡萄牙语（pt-PT）在现有的开源多模态模型中仍系统性地未被充分服务，这些模型要么将其与巴西葡萄牙语混为一谈，要么在其训练数据混合中严重缺乏代表性。我们推出了AMALIA-VL，这是第一个原生为pt-PT构建的开源指令微调LVLM，通过可学习的连接器将高分辨率视觉编码器与动态图像平铺以及完全开放的pt-PT优化语言模型配对。我们贡献了一个精心设计的三阶段训练过程——视觉-语言对齐、通用视觉指令微调和偏好优化——以及一个以pt-PT为中心的多模态数据混合，该混合结合了策划和翻译的公共数据集与新颖的数据集，以解决欧洲葡萄牙语多模态资源几乎完全缺失的问题。我们的评估表明，AMALIA-VL为开源pt-PT LVLM建立了强基线。我们将发布模型权重、训练数据和构建流程，以及机器翻译的pt-PT评估基准，以帮助民主化pt-PT LVLM的开发。

英文摘要

Large Vision and Language Models (LVLMs) have advanced rapidly, yet European Portuguese (pt-PT) remains systematically underserved by existing open-source multimodal models, which either conflate it with Brazilian Portuguese or severely under-represent it in their training data mixes. We introduce AMALIA-VL, the first open-source instruction-tuned LVLM built natively for pt-PT, pairing a high-resolution vision encoder with dynamic image tiling and a fully open pt-PT-optimized language model via a learned connector. We contribute with a purposefully designed three-stage training process - vision-language alignment, general visual instruction tuning, and preference optimization - together with a pt-PT-centric multimodal data mix combining curated and translated public datasets with novel datasets that address the near-total absence of European Portuguese multimodal resources. Our evaluation shows that AMALIA-VL establishes a strong baseline for open-source pt-PT LVLMs.We will release model weights, training data, and construction pipelines along with machine-translated pt-PT evaluation benchmarks to help democratize pt-PT LVLM development.

URL PDF HTML ☆

赞 0 踩 0

2606.18271 2026-06-18 cs.AI cs.LG 新提交 80%

NAVI-Orbital: First In-Orbit Demonstration of a Zero-Shot Vision-Language Model for Autonomous Earth Observation

NAVI-Orbital：用于自主地球观测的零样本视觉语言模型的首次在轨演示

Juan Manuel Delfa Victoria, Taran Cyriac John, Andrew W. Herson

发表机构 * NASA Jet Propulsion Laboratory (JPL)（美国宇航局喷气推进实验室）； Loft Orbital（Loft Orbital公司）

专题命中视觉问答：在轨部署VLM进行自主地球观测与多模态推理。

AI总结本文介绍NAVI-Orbital系统，在低地球轨道卫星上首次实现视觉语言模型的自主多模态推理，通过语义压缩解决数据下传瓶颈。

Comments 17 pages, 47 figures

详情

AI中文摘要

随着地球观测数据的生成速度超过下行链路带宽和人在回路处理能力，星载采集与可操作地面情报之间的差距日益扩大。本文介绍NAVI-Orbital，一个部署在低地球轨道（LEO）航天器上的软件系统。2026年4月16日，NAVI-Orbital实现了据作者所知首次在轨演示，即视觉语言模型完全在星上进行自主多模态推理。NAVI-Orbital使用本地视觉语言模型（Gemma 3）对每个捕获场景进行分类，生成其内容及特征间关系的文本描述，并通过自然语言对话响应操作员的后续查询。该系统通过纯英语提示替代传统指令序列进行任务重定向，并由基于图的状态机（LangGraph）编排，协调用于检测和对话的专用代理。地面基准测试（在7,960张图像的精选AID基准上准确率达88.16%）、Flatsat验证以及实时在轨捕获的新获取、未见过的地球图像（包括未校正的YAM-9图像，在星上通过硬件加速GPU推理处理且未对飞行仪器进行微调）的结果表明，在卫星级边缘计算机上运行基础模型是可行的，通过星上地球观测的语义压缩，颠覆了传统的先采集后全部下传的带宽模式。

英文摘要

As Earth Observation data generation outpaces downlink bandwidth and human-in-the-loop processing, a widening gap has emerged between onboard collection and actionable ground intelligence. This paper presents NAVI-Orbital, a software system deployed on a Low Earth Orbit (LEO) spacecraft. On April 16, 2026, NAVI-Orbital achieved what is, to the authors' knowledge, the first in-orbit demonstration of a vision-language model performing autonomous multi-modal inference entirely onboard. NAVI-Orbital uses a local vision-language model (Gemma 3) to classify each captured scene, produce a text description of its content and the relationships between its features, and respond to operator follow-up via natural-language dialogue. The system is re-tasked through plain-English prompts in place of conventional command sequences, and is orchestrated by a graph-based state machine (LangGraph) coordinating dedicated agents for detection and dialogue. Results across ground benchmarking (88.16% accuracy on the 7,960-image curated AID benchmark), Flatsat validation, and live in-orbit captures of newly acquired, previously unseen Earth imagery (including uncorrected YAM-9 imagery, processed onboard with hardware-accelerated GPU inference and no fine-tuning for the flight instrument) demonstrate the feasibility of running foundation models on satellite-class edge computers to invert the conventional acquire-then-downlink-everything bandwidth profile through semantic compression of Earth observations in-orbit.

URL PDF HTML ☆

赞 0 踩 0

2606.17188 2026-06-18 cs.CV cs.CL 新提交 80%

Not Truly Multilingual: Script Consistency as a Missing Dimension in VLM Evaluation

并非真正的多语言：脚本一致性作为VLM评估中缺失的维度

Prabhjot Singh, Bhushan Pawar, Madhu Reddiboina, Rajvee Sheth

发表机构 * RediMinds Inc.（RediMinds公司）； The University of Texas at Austin（德克萨斯大学奥斯汀分校）； Independent Researcher（独立研究员）

专题命中视觉问答：评估VLM在多文字脚本下的视觉推理

AI总结提出PuMVR基准，评估10个VLM在旁遮普语三种文字上的表现，发现显著的脚本差距，并提出脚本一致性率（SCR）作为必要评估指标。

详情

AI中文摘要

当前视觉语言模型（VLM）的多语言评估假设语言与正字法一一对应，忽略了使用多种文字语言的数十亿用户。我们引入了PuMVR（旁遮普多模态视觉推理），这是一个包含1000个严格平行图像-文本实例的基准，覆盖旁遮普语的三种活跃文字：古木基文、沙穆基文和罗马文。评估10个最先进的VLM，我们暴露了一个显著且系统的脚本差距。模型经常在一种文字上解决视觉任务，而在另一种文字上失败，准确率差异高达16%。关键的是，视觉输入均匀地提升了绝对性能，但并未缩小正字法差距。此外，跨文字的上下文迁移非常脆弱，揭示了脚本锁定的知识表示。通过所有文字对的McNemar检验支持，我们的发现表明当前的“多语言”VLM并非真正的多文字。我们提出脚本一致性率（SCR），在我们的基准上低至24.8%，作为脚本无关评估的强制性指标，以确保公平的AI访问。数据和代码可在以下网址获取：this https URL。

英文摘要

Current multilingual evaluations for Vision-Language Models (VLMs) assume a one-to-one mapping between language and orthography, overlooking billions of users of multi-script languages. We introduce PuMVR (Punjabi Multimodal Visual Reasoning), a benchmark of 1,000 strictly parallel image-text instances across Punjabi's three active scripts: Gurmukhi, Shahmukhi, and Roman. Evaluating 10 state-of-the-art VLMs, we expose a substantial and systematic Script Gap. Models frequently solve visual tasks in one script while failing identical tasks in another, with accuracy deltas reaching 16%. Crucially, visual input boosts absolute performance uniformly yet does not close the orthographic gap. Furthermore, cross-script in-context transfer is highly brittle, exposing script-locked knowledge representation. Supported by McNemar tests across all script pairs, our findings demonstrate that current "multilingual" VLMs are not truly multi-script. We propose the Script Consistency Rate (SCR), which falls as low as 24.8% on our benchmark, as a mandatory metric for script-agnostic evaluation to ensure equitable AI access. Data and code are available at: https://github.com/prabhjotschugh/Not-Truly-Multilingual-PuMVR.

URL PDF HTML ☆

赞 0 踩 0

2606.18553 2026-06-18 cs.CV 新提交 70%

Hierarchical Multi-Modal Retrieval for Knowledge-Grounded News Image Captioning

基于知识的分层多模态检索用于新闻图像描述生成

Minh-Loi Nguyen, Xuan-Vu Le, Long-Bao Nguyen, Hoang-Bach Ngo, Trung-Nghia Le

发表机构 * University of Science, VNU-HCM（越南国立大学胡志明市分校理学院）； Vietnam National University, Ho Chi Minh City（越南国立大学胡志明市分校）

专题命中视觉问答：结合VLM和LLM生成图像描述

AI总结提出分层多模态文章检索增强的图像描述框架，通过结构感知检索和上下文精炼，结合VLM和LLM生成富含上下文细节的描述，在EVENTA 2025挑战赛中获得第5名。

Comments SOICT 2025

详情

AI中文摘要

传统的图像描述方法通常难以生成全面、上下文丰富的描述，尤其是对于无法直接从视觉线索中观察到的细节。为了克服这一问题，我们提出了一种新颖的检索增强图像描述框架，通过利用外部知识生成具有更深层次洞察的描述，如对象属性、事件背景和潜在意义。我们的方法采用分层多模态文章检索机制，超越了单一的文本实体。该检索考虑了文章结构感知特征，包括加权文本组件（例如，标题、正文部分）和视觉布局模式，以及多方面的相似性计算（内容-视觉、视觉-视觉和话语定位）。后续的上下文相关性精炼阶段进一步增强了检索到的信息。检索到的文章随后作为描述生成的知识库：首先，VLM生成简洁的图像描述；其次，我们基于该描述从检索到的文章中分割出相关信息；最后，LLM利用描述和提取的知识生成全面、上下文详细的描述。我们参加了ACM Multimedia EVENTA 2025挑战赛，并在OpenEvent-V1数据集的私有测试集上以0.2824的总分获得第5名。源代码已在此https URL公开发布。

英文摘要

Traditional image captioning methods often struggle to generate comprehensive, context-rich descriptions, especially for details not directly observable from visual cues. To overcome this, we propose a novel retrieval-augmented image captioning framework that generates captions with deeper insights, such as object attributes, event context, and underlying significance, by leveraging external knowledge. Our approach features a hierarchical multi-modal article retrieval mechanism that moves beyond monolithic text entities. This retrieval considers article structure-aware features, including weighted textual components (e.g., headlines, body sections) and visual placement patterns, alongside multi-faceted similarity computations (content--visual, visual--visual, and discourse positioning). A subsequent contextual relevance refinement stage further enhances the retrieved information. The retrieved articles then serve as the knowledge base for caption generation: first, a VLM generates a concise image description; second, we segment relevant information from the retrieved articles based on this description; and finally, an LLM utilizes both the description and extracted knowledge to generate a comprehensive, contextually detailed caption. We participated in the ACM Multimedia EVENTA 2025 Challenge and achieved 5th place with an overall score of 0.2824 on the private test set of the OpenEvent-V1 dataset. Source code is publicly released at https://github.com/mf0212/EVENTA-Challange.

URL PDF HTML ☆

赞 0 踩 0