视觉大模型 / VLM - arXivDaily 专题

2603.28387 2026-06-19 cs.AI cs.LG 版本更新 85%

The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation

脚手架效应：提示框架如何驱动临床VLM评估中的表面多模态增益

Doan Nam Long Vu, Simone Balloccu

发表机构 * Technical University of Darmstadt（达姆施塔特技术大学）

专题命中视觉问答：揭示临床VLM评估中提示框架的脚手架效应

AI总结研究发现，在临床VLM评估中，提示中提及MRI可用性即可解释70-80%的性能提升，与图像数据是否存在无关，这种“脚手架效应”揭示了表面评估无法反映真实多模态推理能力。

详情

AI中文摘要

可信的临床AI要求性能提升反映真实的证据整合而非表面伪影。我们在两个临床神经影像队列\textsc{FOR2107}（情感障碍）和\textsc{OASIS-3}（认知衰退）上评估了12个开源视觉语言模型（VLM）的二分类性能。两个数据集都包含结构MRI数据，但这些数据不携带可靠的个体级诊断信号。在这些条件下，较小的VLM在引入神经影像上下文后F1分数提升高达58%，蒸馏模型变得与规模大一个数量级的模型相当。对比置信度分析显示，仅仅在任务提示中\textit{提及}MRI可用性就解释了70-80%的转变，与影像数据是否存在无关，这是模态坍塌的一个领域特定实例，我们称之为\textit{脚手架效应}。专家评估揭示了在所有条件下捏造基于神经影像的正当理由，而偏好对齐虽然消除了引用MRI的行为，却使两种条件都退化为随机基线。我们的发现表明，表面评估不足以作为多模态推理的指标，这对VLM在临床环境中的部署有直接影响。

英文摘要

Trustworthy clinical AI requires that performance gains reflect genuine evidence integration rather than surface-level artifacts. We evaluate 12 open-weight vision-language models (VLMs) on binary classification across two clinical neuroimaging cohorts, \textsc{FOR2107} (affective disorders) and \textsc{OASIS-3} (cognitive decline). Both datasets come with structural MRI data that carries no reliable individual-level diagnostic signal. Under these conditions, smaller VLMs exhibit gains of up to 58\% F1 upon introduction of neuroimaging context, with distilled models becoming competitive with counterparts an order of magnitude larger. A contrastive confidence analysis reveals that merely \emph{mentioning} MRI availability in the task prompt accounts for 70-80\% of this shift, independent of whether imaging data is present, a domain-specific instance of modality collapse we term the \emph{scaffold effect}. Expert evaluation reveals fabrication of neuroimaging-grounded justifications across all conditions, and preference alignment, while eliminating MRI-referencing behavior, collapses both conditions toward random baseline. Our findings demonstrate that surface evaluations are inadequate indicators of multimodal reasoning, with direct implications for the deployment of VLMs in clinical settings.

URL PDF HTML ☆

赞 0 踩 0

2506.06952 2026-06-19 cs.CV 版本更新 70%

LaTtE-Flow: Layerwise Timestep-Expert Flow-based Transformer

LaTtE-Flow: 基于层间时间步专家流的Transformer

Ying Shen, Zhiyang Xu, Jiuhai Chen, Shizhe Diao, Jiaxin Zhang, Yuguang Yao, Joy Rimchala, Ismini Lourentzou, Lifu Huang

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； University of Maryland（马里兰大学）； Nvidia（英伟达）； Salesforce AI Research（Salesforce AI研究）； Intuit AI Research（Intuit AI研究）

专题命中视觉问答：统一图像理解与生成，基于预训练VLM。

AI总结提出LaTtE-Flow，一种基于预训练视觉语言模型的高效统一架构，通过层间时间步专家流和条件残差注意力机制，实现图像理解与生成，生成速度提升约6倍。

Comments Unified multimodal model, Flow-matching

详情

AI中文摘要

多模态基础模型在统一图像理解与生成方面取得了最新进展，为在单一框架内处理广泛的视觉-语言任务开辟了令人兴奋的途径。尽管取得了进展，现有的统一模型通常需要大量的预训练，并且与专门针对每项任务的模型相比，难以达到相同的性能水平。此外，许多这些模型存在图像生成速度慢的问题，限制了它们在实时或资源受限环境中的实际部署。在这项工作中，我们提出了基于层间时间步专家流的Transformer（LaTtE-Flow），一种新颖且高效的架构，可在单个多模态模型中统一图像理解与生成。LaTtE-Flow建立在强大的预训练视觉语言模型（VLM）之上，以继承强大的多模态理解能力，并通过新颖的层间时间步专家流架构扩展它们，以实现高效的图像生成。LaTtE-Flow将流匹配过程分布到专门的Transformer层组中，每组负责不同的时间步子集。这种设计通过在每个采样时间步仅激活一小部分层，显著提高了采样效率。为了进一步提升性能，我们提出了一种时间步条件残差注意力机制，用于跨层高效的信息重用。实验表明，LaTtE-Flow在多模态理解任务上取得了强劲的性能，同时与最近的统一多模态模型相比，实现了具有竞争力的图像生成质量，推理速度提高了约6倍。

英文摘要

Recent advances in multimodal foundation models unifying image understanding and generation have opened exciting avenues for tackling a wide range of vision-language tasks within a single framework. Despite progress, existing unified models typically require extensive pretraining and struggle to achieve the same level of performance compared to models dedicated to each task. Additionally, many of these models suffer from slow image generation speeds, limiting their practical deployment in real-time or resource-constrained settings. In this work, we propose Layerwise Timestep-Expert Flow-based Transformer (LaTtE-Flow), a novel and efficient architecture that unifies image understanding and generation within a single multimodal model. LaTtE-Flow builds upon powerful pretrained Vision-Language Models (VLMs) to inherit strong multimodal understanding capabilities, and extends them with a novel Layerwise Timestep Experts flow-based architecture for efficient image generation. LaTtE-Flow distributes the flow-matching process across specialized groups of Transformer layers, each responsible for a distinct subset of timesteps. This design significantly improves sampling efficiency by activating only a small subset of layers at each sampling timestep. To further enhance performance, we propose a Timestep-Conditioned Residual Attention mechanism for efficient information reuse across layers. Experiments demonstrate that LaTtE-Flow achieves strong performance on multimodal understanding tasks, while achieving competitive image generation quality with around 6x faster inference speed compared to recent unified multimodal models.

URL PDF HTML ☆

赞 0 踩 0

2504.02885 2026-06-19 cs.CL 版本更新 70%

Med-R2: Perception and Reflection-driven Complex Reasoning for Medical Report Generation

Med-R2：面向医学报告生成的感知与反思驱动复杂推理

Hao Wang, Shuchang Ye, Jinghao Lin, Usman Naseem, Jinman Kim

发表机构 * The School of Computer Science, The University of Sydney（悉尼大学计算机科学学院）； The School of Computing, Macquarie University（麦考瑞大学计算机学院）； Doubao Medical Group, ByteDance（字节跳动 doubao 医疗集团）

专题命中视觉问答：使用视觉语言模型进行医学报告生成

AI总结提出Med-R2微调策略，通过引入感知驱动的长推理过程和放射学知识指导，并加入反思机制修正感知错误，提升LVLMs在医学报告生成中的病理特征感知和诊断准确性。

Comments 28 pages, 3 figures, 1 table

详情

AI中文摘要

自动化医学报告生成（MRG）越来越多地被用于减轻人工报告负担和辅助决策。大型视觉语言模型（LVLMs）因其细粒度的图像-文本对齐和先进的文本生成能力，在自动化MRG中展现出巨大潜力。目前，最先进的MRG主要专注于通过直接监督微调（SFT）来适应预训练的LVLMs，这是一种使用医学图像-报告对的微调策略。然而，有几个因素限制了这些LVLMs的性能。首先，直接SFT使LVLMs能够直接生成医学报告，而无需经过病理特征感知和诊断推理的中间思考过程。这导致可能无法感知病理特征，从而引起误诊。其次，直接SFT缺乏放射学特定知识的指导，导致LVLMs误解感知到的病理特征并做出错误诊断。为了解决这些问题，我们提出了一种名为Med-R2的新型微调策略。我们引入了一个感知驱动的长推理过程，该过程在报告生成之前进行，并融入放射学特定知识作为指导。此外，为了减轻复杂推理中潜在的感知错误，引入了一种反思机制来细化病理特征的感知和生成的报告。我们的实验表明，Med-R2通过微调LVLMs有效增强了MRG的病理特征感知能力和诊断准确性。

英文摘要

Automated medical report generation (MRG) is increasingly used to reduce the burden of manual reporting and for decision support. Large vision-language models (LVLMs) hold great promise for automated MRG due to their fine-grained image-text alignment and advanced text-generation capabilities. Currently, state-of-the-art MRGs primarily focus on adapting pre-trained LVLMs with direct supervised fine-tuning (SFT), a fine-tuning strategy with medical image-report pairs. However, several factors limit the performance of these LVLMs. Firstly, direct SFT enables LVLMs to generate medical reports directly without an intermediate thinking process of pathological feature perception and diagnostic reasoning. This causes a potential failure to perceive pathological features and thus leads to misdiagnosis. Secondly, direct SFT lacks the incorporation of radiology-specific knowledge guidance, causing LVLMs to misinterpret perceived pathological features and make incorrect diagnoses. To address these gaps, we propose a novel fine-tuning strategy named Med-R2. We introduce a perception-driven long reasoning process that precedes report generation and incorporates radiology-specific knowledge as guidance. Additionally, to alleviate potential perceptual errors in complex reasoning, a reflection mechanism is introduced to refine the perception of pathological features and the generated report. Our experiments demonstrate that Med-R2 effectively enhances the capability of pathological features perception and diagnosis accuracy for MRG via fine-tuned LVLMs.

URL PDF HTML ☆

赞 0 踩 0