arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

视觉大模型 / VLM

视觉语言模型、视觉推理、视觉问答、图文理解和视觉 grounding。

今日/当前日期收录 4 信号源:cs.CV, cs.AI, cs.LG
2606.19646 2026-06-19 cs.IR cs.CV 新提交 85%

SAFE-Cascade: Cost-Adaptive Vision-Language Routing for Chart Question Answering

SAFE-Cascade: 面向图表问答的成本自适应视觉语言路由

Ayush Dwivedi, Qixin Wang, Ashvi Soni, Ruoteng Wang, Han Li, Animesh Mahapatra, Neeraj Agrawal, Xintao Wu

发表机构 * University of Arkansas(亚拉巴马大学)

专题命中 视觉问答 :提出成本自适应路由系统,用于图表问答,涉及VLM调用决策。

AI总结 提出SAFE-Cascade系统,通过OCR和轻量语言模型先给出答案,再由学习路由器决定是否调用VLM,在ChartQA上以73.1%的VLM调用率达到69.1%准确率,减少26.9%的VLM调用和9.3%的成本。

Comments Demo paper submitted at CIKM 2026. 4 pages, 2 figures

详情
AI中文摘要

视觉语言模型(VLM)在图表问答中表现出色,但若每个查询都调用VLM,当许多问题可通过OCR文本和轻量语言推理回答时,成本会不必要地高昂。我们展示了SAFE-Cascade,一个用于成本自适应图表问答的交互系统。给定图表图像和自然语言问题,SAFE-Cascade首先通过OCR提取图表文本,从纯文本语言模型获得临时答案,然后使用学习路由器决定接受文本答案还是升级到VLM。该演示向用户展示这一决策过程:OCR证据、纯文本答案、路由概率、升级决策、最终答案、估计成本和估计延迟并排显示。SAFE-Cascade被设计为一个透明界面,用于理解何时实际需要视觉基础。用户可以上传或选择图表、提问、检查每条路径使用的证据、比较纯文本和VLM答案,并调整升级阈值以探索准确率-成本边界。该系统使用Azure Document Intelligence进行OCR,gpt-5-mini作为纯文本模型,gemini-2.5-flash-image作为VLM,以及基于推理时特征训练的随机森林路由器。在从2500个样本实验中留出的375个ChartQA测试集上,SAFE-Cascade实现了69.1%的统一准确率和73.1%的VLM调用率,而全VLM基线为67.7%准确率和100% VLM调用率。观察到的+1.4个百分点差异在统计上不确定,因此我们将SAFE-Cascade解释为匹配全VLM性能,同时减少26.9%的VLM调用和9.3%的估计成本。该演示展示了选择性模态路由如何使多模态知识系统更加透明、可调优和成本感知。

英文摘要

Vision-language models (VLMs) are powerful for chart question answering, but invoking a VLM for every query can be unnecessarily expensive when many questions are answerable from OCR text and lightweight language reasoning. We demonstrate SAFE-Cascade, an interactive system for cost-adaptive chart question answering. Given a chart image and a natural-language question, SAFE-Cascade first extracts chart text with OCR, obtains a provisional answer from a text-only language model, and then uses a learned router to decide whether to accept the text answer or escalate to a VLM. The demo exposes this decision process to users: OCR evidence, text-only answer, routing probability, escalation decision, final answer, estimated cost, and estimated latency are shown side by side. SAFE-Cascade is designed as a transparent interface for understanding when visual grounding is actually needed. Users can upload or select charts, ask questions, inspect the evidence used by each pathway, compare text-only and VLM answers, and adjust the escalation threshold to explore the accuracy-cost frontier. The system is implemented with Azure Document Intelligence for OCR, gpt-5-mini as the text-only model, gemini-2.5-flash-image as the VLM, and a Random Forest router trained on inference-time features. On a held-out ChartQA test split of 375 examples from a 2,500-example experiment, SAFE-Cascade achieves 69.1% unified accuracy with 73.1% VLM invocation, compared with 67.7% accuracy and 100% VLM invocation for the full-VLM baseline. The observed +1.4 percentage-point difference is statistically uncertain, so we interpret SAFE-Cascade as matching full-VLM performance while reducing VLM calls by 26.9% and estimated cost by 9.3%. The demonstration shows how selective modality routing can make multimodal knowledge systems more transparent, tunable, and cost-aware.

2606.20477 2026-06-19 cs.CV cs.CL cs.LG 新提交 80%

Scalable Training of Spatially Grounded 2D Vision-Language Models for Radiology

面向放射学的空间定位2D视觉-语言模型的可扩展训练

Yusuf Salcan, Simon Ging, Robin Schirrmeister, Philipp Arnold, Elmar Kotter, Behzad Bozorgtabar, Thomas Brox

发表机构 * Computer Vision Group, University of Freiburg, Germany(德国弗莱堡大学计算机视觉组) Department of Radiology, Medical Center -- University of Freiburg, Germany(德国弗莱堡大学医学中心放射科) CRIION-AI Lab, Freiburg, Germany(德国弗莱堡CRIION-AI实验室)

专题命中 视觉问答 :联合报告生成、VQA和空间定位

AI总结 提出RefRad2D大规模双语数据集,通过LLM和自动分割生成空间定位数据,训练RadGrounder模型联合完成报告生成、VQA和空间定位,在外部基准上取得竞争性结果。

Comments Accepted for MICCAI 2026. First two authors: equal contribution. Last two authors: equal supervision

详情
AI中文摘要

我们研究了如何在没有手动空间标注的情况下,为放射学训练具有视觉定位能力的视觉-语言模型(VLM)。我们引入了RefRad2D,这是一个大规模的双语(德语/英语)数据集,包含来自临床实践的120万对CT和MR图像-文本对,并通过基于LLM的筛选和自动分割自动生成任务特定的VQA和空间定位子集。在此数据上训练的模型RadGrounder联合执行报告生成、视觉问答以及通过边界框检测或分割进行的空间定位。在外部VQA基准(Slake,VQA-RAD)上,RadGrounder取得了与专用医学VLM竞争的结果。将我们的临床数据加入训练混合集,相比于仅在下游数据集上微调,提高了开放式VQA的性能,显示了数据集的迁移性。关键在于,添加定位监督不会降低语言质量,从而在不牺牲VQA性能的情况下实现空间可验证的输出。

英文摘要

We study how to train visually grounded vision-language models (VLMs) for radiology without manual spatial annotations. We introduce RefRad2D, a large-scale bilingual (German/English) dataset of 1.2M CT and MR image-text pairs derived from clinical practice, with task-specific VQA and spatial grounding subsets generated automatically via LLM-based curation and automated segmentation. Trained on this data, our model RadGrounder jointly performs report generation, visual question answering, and spatial grounding via bounding-box detection or segmentation. On external VQA benchmarks (Slake, VQA-RAD), RadGrounder achieves competitive results with specialized medical VLMs. Adding our clinical data to the training mixture improves open-ended VQA over fine-tuning on the downstream datasets alone, showing the transferability of our dataset. Crucially, adding grounding supervision does not degrade language quality, enabling spatially verifiable outputs at no cost to VQA performance.

2606.20561 2026-06-19 cs.CV 新提交 70%

TimeProVe: Propose, then Verify for Efficient Long Video Temporal Reasoning in Activities of Daily Living

TimeProVe: 先提出后验证,实现日常活动中的高效长视频时间推理

Arkaprava Sinha, Dominick Reilly, Siddharth Krishnan, Hieu Le, Srijan Das

发表机构 * University of North Carolina, Charlotte(北卡罗来纳大学夏洛特分校)

专题命中 视觉问答 :使用VLM进行长视频问答验证

AI总结 提出TimeProVe框架,先通过轻量模块生成基于动作的候选假设,再调用昂贵VLM验证,在长视频问答中降低75%VLM调用和93%推理成本,性能提升7.3%。

详情
AI中文摘要

长视频问答(LVQA)需要在数小时未修剪的视频中识别稀疏的、与查询相关的证据。现有方法要么使用大型视觉语言模型(VLM)密集处理视频,导致计算成本过高,要么依赖稀疏的基于字幕的推理,这往往会遗漏时间局部化和以运动为中心的证据。我们提出TimeProVe,一种用于长视频中时间基础推理的高效混合框架。TimeProVe首先使用轻量模块生成基于动作的答案-证据假设,随后仅调用昂贵的VLM进行针对性验证。我们框架的核心在于基于动作的候选证据(ACE)模块,该模块通过轻量级LLM推理将时间局部化的动作转换为查询条件化的候选答案和支持证据窗口。我们进一步引入OpenTSUBench(OTB),一个开放基准测试,旨在评估真实世界日常活动(ADL)场景中的时间基础推理。实验表明,TimeProVe在OTB上比最强基线高出7.3%,同时减少了75%的VLM调用和93%的推理成本。此外,在没有显式时间基础训练的情况下,TimeProVe在Charades-STA上取得了竞争性性能,并在结合基础VLM增强时达到了最先进的结果。

英文摘要

Long Video Question Answering (LVQA) requires identifying sparse, query-relevant evidence within hours-long untrimmed videos. Existing approaches either process videos densely with large vision-language models (VLMs), incurring prohibitive computational cost, or rely on sparse caption-based reasoning, which often misses temporally localized and motion-centric evidence. We introduce TimeProVe, a cost-efficient hybrid framework for temporally grounded reasoning in long videos. TimeProVe first employs lightweight modules to generate action-grounded answer--evidence hypotheses and subsequently invokes an expensive VLM only for targeted verification. The core of our framework lies in the Action-based Candidate Evidence (ACE) module, which converts temporally localized actions into query-conditioned candidate answers and supporting evidence windows through lightweight LLM reasoning. We further introduce OpenTSUBench (OTB), an open-ended benchmark designed to evaluate temporally grounded reasoning in real-world Activities of Daily Living (ADL) scenarios. Experiments show that TimeProVe outperforms the strongest baseline on OTB by 7.3%, while reducing VLM calls by 75% and inference cost by 93%. Furthermore, without explicit temporal grounding training, TimeProVe achieves competitive performance on Charades-STA, and reaches state-of-the-art results when enhanced with grounding VLMs.

2606.19684 2026-06-19 cs.CV 新提交 70%

Exploring Multi-Modal Large Language Models and Two-Stage Fine-Tuning for Fashion Image Retrieval

探索多模态大语言模型与两阶段微调在时尚图像检索中的应用

Nguyen Cao Hoang, Hoang Bui Le, Nam Vo Hoang, Trung-Nghia Le

发表机构 * University of Science, VNU-HCM(胡志明市国家大学下属理科大学) Vietnam National University, Ho Chi Minh(胡志明市国家大学)

专题命中 视觉问答 :利用LLaVA生成属性感知三元组进行时尚图像检索

AI总结 提出融合多模态大语言模型(LLaVA)生成属性感知三元组,并采用两阶段微调策略增强对比学习,以解决时尚图像检索中标注数据稀缺和负采样简单的问题。

Comments SOICT 2025

详情
AI中文摘要

组合图像检索通过参考图像和修改文本描述的复合查询来检索目标图像。在时尚领域,该任务需要理解颜色、图案和纹理等细微属性变化。然而,现有方法因标注数据稀缺和负采样简单而面临局限性。我们提出了一种新颖框架,该框架集成多模态大语言模型(LLaVA)以生成属性感知三元组,并引入两阶段微调策略来增强对比学习。我们利用预训练的视觉-语言模型(如CLIP-ViT/B32)生成句子级提示并与相对描述拼接,以及使用静态表示来增加负样本数量。实验结果表明,该框架增强了组合推理能力并改进了细粒度检索行为,突显了所提框架在时尚检索中的可行性和潜力。

英文摘要

Composed image retrieval retrieves a target image using a composed query of a reference image and a modified text description. In the fashion domain, this task requires understanding subtle attribute variations such as color, pattern, and texture. However, existing approaches face limitations due to scarce annotated data and simplistic negative sampling. We propose a novel framework that integrates a multi-modal large language model (LLaVA) to generate attribute-aware triplets and introduces a two-stage fine-tuning strategy to enhance contrastive learning. We leverage pretrained vision-language models, such as CLIP-ViT/B32, to generate and concatenate sentence-level prompts with the relative caption and to scale the number of negatives using static representations. Experimental results demonstrate enhanced compositional reasoning and improved fine-grained retrieval behavior, underscoring the feasibility and potential of the proposed framework for fashion retrieval.