多模态大模型

2606.14702 2026-06-18 cs.CV 新提交专题 90

OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains

OmniVideo-100K：通过结构化脚本和证据链进行音视频推理的数据集

Xinyue Cai, Chaoyou Fu, Yi-Fan Zhang, Ran He, Caifeng Shan

发表机构 * Nanjing University（南京大学）； CASIA（中国科学院自动化研究所）

专题命中音视频多模态：音视频推理数据集与问答

AI总结提出OmniVideo-100K数据集，通过实体锚定视频脚本和线索引导的QA生成机制，解决音视频问答中跨段实体不一致和长时推理不足的问题，微调模型在多个基准上取得显著提升。

Comments Project page: https://github.com/MiG-NJU/OmniVideo-100K

详情

AI中文摘要

当前的音视频问答（QA）自动化流水线通常采用“视频-字幕-QA”范式。然而，这些方法通常将视频分割成短片段，并为音频和视觉模态生成独立的描述。这种解耦处理切断了声音与其视觉来源之间的固有关联，而独立的片段处理常常导致同一实体在不同片段中的描述不一致。此外，将长文本理解和QA合成耦合到单一步骤中，往往将模型限制在局部事件上，生成的问答缺乏长期时间连接和深度跨模态推理。为了解决这些问题，我们提出了一种自动化数据引擎，包含两种机制：（1）**实体锚定视频脚本**将视频转换为结构化脚本，包括摘要、主要实体列表和逐片段的音视频描述。实体列表作为全局先验，确保跨片段引用一致性并重建音视频关联。（2）**线索引导的QA生成**提示模型首先从脚本中挖掘跨片段、多模态线索，然后基于这些高价值线索生成QA对。利用这一流水线，我们构建了指令微调数据集**OmniVideo-100K**和人工验证的测试集**OmniVideo-Test**。在OmniVideo-100K上微调VITA-1.5、Qwen2.5-Omni-7B和Qwen3-Omni-30B，在OmniVideo-Test上获得了高达20.59%的性能提升，并在Daily-Omni和JointAVBench等现有基准上表现出强大的泛化能力（提升高达12.64%）。

英文摘要

Current automated pipelines for audio-visual Question Answering (QA) generally adopt a ``video-caption-QA'' paradigm. However, these methods typically segment videos into short clips and generate separate descriptions for audio and visual modalities. This decoupled processing severs inherent associations between sounds and their visual sources, while independent clip processing often causes inconsistent descriptions of the same entity across segments. Furthermore, coupling long-text comprehension and QA synthesis into a single step often restricts models to localized events, yielding questions lacking long-term temporal connections and deep cross-modal reasoning. To address these issues, we propose an automated data engine featuring two mechanisms: (1) \textbf{Entity-Anchored Video Scripting} transforms videos into structured scripts, comprising summaries, main entity lists, and segment-wise audio-visual descriptions. The entity list serves as a global prior to ensure cross-segment referential consistency and reconstruct audio-visual associations. (2) \textbf{Clue-Guided QA Generation} prompts models to first mine cross-segment, multimodal clues from the script, and subsequently generate QA pairs based on these high-value clues. Leveraging this pipeline, we construct the instruction-tuning dataset \textbf{OmniVideo-100K} and a human-verified test set, \textbf{OmniVideo-Test}. Fine-tuning VITA-1.5, Qwen2.5-Omni-7B and Qwen3-Omni-30B on OmniVideo-100K yields performance gains of up to 20.59% on OmniVideo-Test, demonstrating strong generalization (up to 12.64% improvements) across established benchmarks like Daily-Omni and JointAVBench.

URL PDF HTML ☆

赞 0 踩 0

2606.19157 2026-06-18 eess.AS cs.CL 新提交专题 85

IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages

IndicContextEval：评估8种印度语言音频大语言模型上下文利用能力的基准

Sakshi Joshi, Dhruv Subhash Rathi, Sanskar Singh, Eldho Ittan George, R J Hari, Kaushal Bhogale, Mitesh M. Khapra

发表机构 * AI4Bharat, Indian Institute of Technology Madras, India（AI4Bharat，印度理工学院马德拉斯分校）； Sarvam AI, India（Sarvam AI，印度）

专题命中音视频多模态：评估音频大语言模型的上下文利用能力

AI总结提出IndicContextEval基准，包含8种印度语言555位说话人的56小时自然语音，通过7级提示框架评估音频大语言模型是否真正利用上下文而非依赖参数化知识。

Comments Accepted at Interspeech 2026

详情

AI中文摘要

音频大语言模型（AudioLLMs）能够基于文本提示（如领域描述或实体列表）进行语音识别。然而，尚不清楚这些模型是真正利用此类上下文，还是依赖预训练期间学到的参数化知识。现有基准无法回答这个问题，因为它们仅在固定提示条件下评估转录，且很少包含明确的上下文输入。我们引入IndicContextEval，这是一个56小时的多语言基准，包含来自8种印度语言和23个专业领域的555位说话人的自然语音。我们设计了一个7级提示框架，逐步引入上下文信号，包括元数据、自然语言描述、英语和本地文字的实体列表，以及包含错误实体的对抗性提示。评估五个模型揭示了上下文利用行为的显著差异，凸显了对音频大语言模型中上下文基础进行显式评估的必要性。

英文摘要

AudioLLMs enable speech recognition conditioned on textual prompts such as domain descriptions or entity lists. However, it remains unclear whether these models genuinely utilise such context or rely on parametric knowledge learned during pretraining. Existing benchmarks cannot answer this question because they evaluate transcription under fixed prompting conditions and rarely include explicit contextual inputs. We introduce IndicContextEval, a 56-hour multilingual benchmark of natural speech from 555 speakers across 8 Indian languages and 23 professional domains. We design a 7-level prompting framework that progressively introduces contextual signals, including metadata, natural-language descriptions, entity lists in English and native script, and adversarial prompts with incorrect entities. Evaluating five models reveals substantial differences in context utilisation behaviour, highlighting the need for explicit evaluation of contextual grounding in AudioLLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.18924 2026-06-18 cs.SD 新提交专题 85

Who Wins the Conflict? Mechanistic Interpretability of Text Bias in Audio LLMs

谁赢得冲突？音频大模型中文本偏差的机制可解释性

Hyebin Cho, Suho Yoo, Jaehyuk Jang, Changick Kim, Joon Son Chung

发表机构 * School of Electrical Engineering, KAIST（韩国科学技术院电子工程学院）

专题命中音视频多模态：分析音频大模型中文本偏差机制

AI总结本文通过机制分析揭示音频大模型中的文本主导偏差，发现文本路径主动抑制完整音频表征，并提出无训练干预方法back-patching以增强音频表征，缓解文本主导。

Comments Preprint

详情

AI中文摘要

虽然音频大模型在多模态理解方面表现出色，但它们存在文本主导偏差，即模型盲目偏向文本而忽视声学证据，导致幻觉。然而，当音频和文本输入相互矛盾时，这些模型内部行为的底层机制尚未被探索。在这项工作中，我们通过追踪内部表征在层间的传播，首次对这一现象进行了机制分析。我们的研究揭示了三个关键发现：（i）文本主导在模型中系统性地且经验性地存在；（ii）虽然文本和音频依赖功能不同的路径，但它们最终在后期层中汇聚到一个共享语义空间；（iii）文本路径不会擦除音频信息，而是主动抑制完整的音频表征。基于这些见解，我们利用back-patching，一种无训练干预方法，将后期层的音频激活路由回早期层。这放大了音频表征，使其能够克服文本抑制。我们的评估表明，back-patching持续减少文本主导，为冲突下的机制性多模态对齐铺平了道路。

英文摘要

While Audio Large Language Models (Audio LLMs) excel at multimodal understanding, they suffer from text dominance, a bias where models blindly favor text over acoustic evidence, causing hallucinations. However, the internal mechanisms underlying how these models behave when audio and textual inputs contradict each other remain unexplored. In this work, we present the first mechanistic analysis of this phenomenon by tracing the propagation of internal representations across layers. Our investigation reveals three key findings: (i) text dominance is systematically and empirically across models; (ii) while text and audio rely on functionally distinct pathways, they ultimately converge into a shared semantic space in late layers; and (iii) the text pathway does not erase audio information, but rather actively suppresses intact audio representations. Building on these insights, we leverage back-patching, a training-free intervention that routes late-layer audio activations back into earlier layers. This amplifies the audio representations, enabling them to overcome textual suppression. Our evaluation shows that back-patching consistently reduces text dominance, paving the way for mechanistic multimodal alignment under conflict.

URL PDF HTML ☆

赞 0 踩 0

2606.18273 2026-06-18 cs.CL cs.AI cs.SD eess.AS 新提交专题 85

Continuous Audio Thinking for Large Audio Language Models

面向大型音频语言模型的连续音频思考

Gyojin Han, Dong-Jae Lee, Changho Choi, Jongsuk Kim, Junmo Kim

发表机构 * KAIST（韩国科学技术院）

专题命中音视频多模态：提出CoAT框架，增强音频语言模型的连续音频思考能力。

AI总结提出连续音频思考（CoAT）框架，通过专家蒸馏在连续潜在空间中组织声学信息，使音频语言模型在生成响应前利用丰富声学特征，无需额外自回归解码成本，在多个音频任务上提升性能。

Comments Preprint

详情

AI中文摘要

大型音频语言模型（LALMs）在从语音转录到音乐分析等多种音频理解任务中展现了令人印象深刻的能力。然而，由于LALMs通常被训练生成与文本对齐的响应，其隐藏状态逐渐为文本生成而塑造，而非保留声学信息。因此，音频携带的多样化声学内容，如语音细节、韵律、声音事件、情感和音调，在过程中丢失，难以在响应中利用。我们引入了连续音频思考（CoAT），这是一个框架，为音频语言模型配备一个连续的潜在工作空间，用于在响应生成之前组织声学信息，并通过音频专家的蒸馏进行基础化。在思考空间内，模型可以在生成响应时利用专家蒸馏提供的丰富声学信息。此外，所提出的连续思考块可以在单个预填充中处理，因此CoAT不需要比基线额外的自回归解码成本。在三个LALM上，Qwen2-Audio、Qwen2.5-Omni-7B和Audio Flamingo~3，在涵盖音频推理、音频理解、音乐分类、语音情感和语音转录的广泛基准套件上的性能提升证明了CoAT的有效性。进一步分析证实，辅助监督从思考位置传播到模型的文本响应。

英文摘要

Large audio language models (LALMs) have shown impressive capabilities on diverse audio understanding tasks, ranging from speech transcription to music analysis. However, because LALMs are typically trained to produce text-aligned responses, their hidden states are progressively shaped for text generation rather than for preserving acoustic information. As a result, the diverse acoustic content that audio carries, such as phonetic detail, prosody, sound events, affect, and pitch, is lost along the way and difficult to leverage in the response. We introduce Continuous Audio Thinking (CoAT), a framework that equips audio language models with a continuous latent workspace for organizing acoustic information prior to response generation, grounded by distillation from audio experts. Within the thinking space, the model can utilize the rich acoustic information provided by expert distillation when generating its response. Furthermore, the proposed continuous thinking block can be processed in a single prefill, so CoAT does not require additional autoregressive decoding cost over the baseline. Across three LALMs, Qwen2-Audio, Qwen2.5-Omni-7B, and Audio Flamingo~3, performance gains on a broad benchmark suite spanning audio reasoning, audio understanding, music classification, speech emotion, and speech transcription demonstrate the effectiveness of CoAT. Further analysis confirms that the auxiliary supervision propagates from the thinking positions to the model's textual responses.

URL PDF HTML ☆

赞 0 踩 0

2606.19203 2026-06-18 eess.AS 新提交专题 80

DASH: Dual-View Self-Distillation with Multi-Layer Hidden Representations for Robust Speech Recognition

DASH: 基于多层隐藏表示的双视角自蒸馏用于鲁棒语音识别

Jaeeun Baik, Ui-Hyeop Shin, Jiwoon Lee, Woocheol Jeong, Hyung-Min Park

专题命中音视频多模态：提出自蒸馏框架提升语音识别鲁棒性，属于音频处理

AI总结提出DASH自蒸馏框架，通过双视角学习干净-噪声一致性，从多层编码器蒸馏隐藏表示并最小化原型分配分布的KL散度，在保持干净准确率的同时提升噪声鲁棒性，额外开销仅约微调时间的4%。

Comments Accepted to Interspeech 2026

详情

AI中文摘要

自动语音识别（ASR）在现实噪声环境中常常性能下降，因此噪声鲁棒性对于部署至关重要。有监督的噪声增强微调是一种常见的补救措施，但它可能引入鲁棒性与干净性能之间的权衡，并过度拟合特定噪声，导致干净条件下的识别性能下降。我们提出了DASH，一种自蒸馏框架，通过从配对视图中学习干净-噪声一致性来提高鲁棒性。DASH从多个编码器层蒸馏隐藏表示，以捕获从低级声学到高级语义的特征，并通过最小化干净视图和噪声视图的原型分配分布之间的KL散度来稳定训练。在LibriSpeech上的实验表明，DASH在保持干净准确率的同时，在各种噪声条件下持续提高识别性能，这是通过在标准微调之外增加一个无标签的预训练阶段实现的，额外开销极小（约为微调时间的4%）。

英文摘要

Automatic Speech Recognition (ASR) often degrades in real-world noisy environments, making noise robustness essential for deployment. Supervised noise-augmented fine-tuning is a common remedy, but it can introduce a robustness-clean trade-off and overfit to specific corruptions, degrading recognition in clean conditions. We propose DASH, a self-distillation framework that improves robustness by learning clean--noisy consistency from paired views. DASH distills hidden representations from multiple encoder layers to capture features from low-level acoustics to high-level semantics, and stabilizes training by minimizing KL divergence between prototype assignment distributions of clean and noisy views. Experiments on LibriSpeech show that DASH consistently improves recognition under diverse noisy conditions while preserving clean accuracy, achieved by a label-free pre-training stage with minimal additional overhead (about 4% of fine-tuning time) beyond standard fine-tuning.

URL PDF HTML ☆

赞 0 踩 0

2606.19338 2026-06-18 cs.CV 新提交专题 85

Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

超越当前观测：评估多模态大语言模型在可控非马尔可夫博弈中的表现

Shengyuan Ding, Xilin Wei, Xinyu Fang, Haodong Duan, Dahua Lin, Jiaqi Wang, Yuhang Zang

发表机构 * Fudan University（复旦大学）； Shanghai Innovation Institute（上海创新研究院）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； Zhejiang University（浙江大学）； The Chinese University of Hong Kong（香港中文大学）

专题命中多模态评测：非马尔可夫博弈评估多模态模型记忆

AI总结提出RNG-Bench基准套件，通过配对记忆和3D迷宫两个博弈，评估多模态大模型在非马尔可夫环境中重建历史观测并据此行动的能力，发现主要错误源于遗忘而非决策，微调可提升性能。

详情

AI中文摘要

将多模态基础模型部署为闭环策略时，越来越需要基于不再可见的观测来调节动作。然而，现有基准要么暴露完整状态，将隐藏状态重建与其他智能体技能混为一谈，要么仅在回合结束后测试记忆。我们引入了RNG-Bench（重建性非马尔可夫博弈），这是一个基准套件，旨在隔离基础模型在多步交互中重建过去观测并据此行动的能力。RNG-Bench包含两个互补的博弈：配对记忆，其中卡片身份在特定位置短暂显示后需被回忆；以及3D迷宫，其中自我中心视图需整合为空间地图。两个博弈都在统一的测试框架下评估，具有三个可控难度轴：网格大小、视觉模式和观测模态。该基准进一步引入了头对头对决协议以控制实例级方差，以及记忆差距指标，将遗忘与不良动作选择区分开来。最难的配置需要大约128K个token和每回合350个图像输入，前沿MLLMs远未饱和。记忆差距分析表明，大多数残余错误源于遗忘较早的观测，而非次优决策。最后，在最优策略轨迹和过滤后的模型演示上微调Qwen3.5-9B，提高了RNG-Bench的性能，并迁移到现有基准，而不降低通用多模态能力。

英文摘要

Deploying multimodal foundation models as closed-loop policies increasingly requires conditioning actions on observations that are no longer visible. However, existing benchmarks either expose the full state, conflate hidden-state reconstruction with other agent skills, or test recall only after an episode has ended. We introduce RNG-Bench (Reconstructive Non-Markov Games), a benchmark suite designed to isolate a base model's ability to reconstruct past observations and act on them during multi-step interaction. RNG-Bench includes two complementary games: Matching Pairs, where card identities briefly revealed at specific locations must later be recalled, and 3D Maze, where egocentric views must be integrated into a spatial map. Both games are evaluated under a unified harness with three controlled difficulty axes: grid size, visual pattern, and observation modality. The benchmark further introduces a head-to-head duel protocol to control for instance-level variance and a Memory Gap metric that disentangles forgetting from poor action selection. The hardest configurations require contexts of roughly 128K tokens and 350 image inputs per episode, and remain far from saturated by frontier MLLMs. Memory Gap analysis shows that most residual errors stem from forgetting earlier observations rather than from suboptimal decision making. Finally, fine-tuning Qwen3.5-9B on optimal-policy rollouts and filtered model demonstrations improves performance on RNG-Bench and transfers to existing benchmarks without degrading general multimodal capability.

URL PDF HTML ☆

赞 0 踩 0

2606.19120 2026-06-18 cs.LG cs.CV 新提交专题 85

Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation

先看后思：解耦感知与推理以实现抗捷径的多模态在策略自蒸馏

Sihan Wang, Xiyao Liu, Lianqing Liu, Zhi Han

发表机构 * State Key Laboratory of Robotics and Intelligent Systems, Shenyang Institute of Automation, Chinese Academy of Sciences（机器人与智能系统国家重点实验室，沈阳自动化研究所，中国科学院）； University of Chinese Academy of Sciences（中国科学院大学）

专题命中图文多模态：MLLM后训练框架，解耦感知与推理

AI总结提出ViGOS框架，通过解耦感知和推理，在MLLM后训练中避免文本捷径，提升图像依赖行为。

Comments 29 pages, 5 figures, 8 tables

详情

AI中文摘要

在策略自蒸馏（OPSD）训练模型在其自身rollouts上，并使用冻结副本提供基于参考目标的密集token级目标。这对于LLM推理效果良好，但直接扩展到多模态大语言模型（MLLMs）可能产生捷径：特权目标可能主要基于文本参考目标而非图像来引导token。我们提出ViGOS，一种视觉引导的OPSD框架用于MLLM后训练。学生首先编写视觉描述，然后推理出最终答案。对于有效rollouts，仅图像的感知教师监督描述，而特权推理教师监督同一学生前缀上的推理和最终答案。仅对无效rollouts使用参考教师以恢复输出格式。在通用视觉-语言、专家推理、视觉数学、空间定位和视觉-语言先验基准测试中，ViGOS保持了OPSD的主要优势，并在易产生捷径的设置中改善了图像引导行为。

英文摘要

On-policy self-distillation (OPSD) trains a model on its own rollouts and uses a frozen copy to provide dense token-level targets conditioned on a reference target. This works well for LLM reasoning, but a direct extension to multimodal large language models (MLLMs) can create a shortcut: the privileged target may guide tokens mainly based on the text reference target rather than the image. We propose ViGOS, a visually grounded OPSD framework for MLLM post-training. The student first writes a visual description and then reasons toward the final answer. For valid rollouts, an image-only perception teacher supervises the description, while a privileged reasoning teacher supervises the reasoning and final answer on the same student prefix. A reference teacher is used only for invalid rollouts to recover the output format. Across general vision-language, expert reasoning, visual math, spatial grounding, and visual-language-prior benchmarks, ViGOS keeps the main benefits of OPSD and improves image-grounded behavior in shortcut-prone settings.

URL PDF HTML ☆

赞 0 踩 0

2606.18988 2026-06-18 cs.AI 新提交专题 85

ThinkDeception: A Progressive Reinforcement Learning Framework for Interpretable Multimodal Deception Detection

ThinkDeception: 一种用于可解释多模态欺骗检测的渐进式强化学习框架

Jinhao Song, Shan Liang, Yiqun Yue, Zhuhuayang Zhang, Tianqi Gao

发表机构 * Xi'an Jiaotong-Liverpool University（西安交通大学利物浦大学）

专题命中图文多模态：引入多模态大模型进行可解释欺骗检测，结合视觉和音频。

AI总结提出ThinkDeception框架，将多模态大语言模型引入欺骗检测，通过逐步推理和视觉-音频一致性组相对策略优化（VAC-GRPO）实现可解释的认知推理，在主流基准上达到新SOTA。

Comments 10pages,4figures

详情

AI中文摘要

多模态欺骗检测对于识别欺诈意图至关重要，然而现有方法主要依赖于端到端的黑箱范式。这些方法严重缺乏可解释性，无法提供透明的推理轨迹，也难以明确捕捉欺骗行为中固有的细微跨模态不一致性。为了超越这些限制，我们提出了ThinkDeception，一个新颖且可解释的多模态欺骗检测框架。作为开创性工作，它将多模态大语言模型（MLLMs）引入该领域，将欺骗检测从传统的二分类任务转变为显式的认知推理过程。借助首个精心标注的逐步多模态思维链（CoT）数据集，我们开发了基础模型ThinkDeception Base，实证验证了模态不一致性在解码欺骗中的关键作用。在此基础之上，我们的核心创新在于提出了配备渐进式训练策略的视觉-音频一致性组相对策略优化（VAC-GRPO）。与标准GRPO不同，我们将训练数据分为四个渐进难度等级，引导模型经历基于心理学的从易到难的认知转变。通过创新地将这一动态课程调度器与多维度的过程感知奖励机制及反思学习范式相结合，我们显著提升了模型的整体推理质量。在主流基准上的大量实验表明，ThinkDeception建立了新的SOTA，在检测准确性和推理质量上均显著优于现有方法。最终，这项工作成功地将欺骗检测领域推向可解释的多模态认知推理。

英文摘要

Multimodal deception detection is critical for identifying fraudulent intentions, yet existing approaches predominantly rely on end to end black--box paradigms. These methods suffer from a severe lack of interpretability failing to provide transparent reasoning trajectories and struggling to explicitly capture the subtle, cross modal inconsistencies inherent in deceptive behaviors. To transcend these limitations, we propose ThinkDeception, a novel and interpretable multimodal deception detection framework. As a pioneering effort, it introduces Multimodal Large Language Models (MLLMs) into this domain, transforming deception detection from a traditional binary classification task into an explicit cognitive reasoning process. Facilitated by the first meticulously annotated step--by--step multimodal Chain of Thought (CoT) dataset, we develop a foundational model, ThinkDeception Base, empirically validating the critical role of modal inconsistency in decoding deception. Building upon this foundation, our core innovation lies in proposing Visual-Audio Consistency Group Relative Policy Optimization(VAC--GRPO) equipped with a progressive training strategy. Distinct from standard GRPO, we stratify the training data into four progressive difficulty tiers, guiding the model through a psychologically grounded easy--to--hard cognitive transition. By innovatively coupling this dynamic curriculum scheduler with a multi dimensional, process aware reward mechanism and a reflective learning paradigm, we significantly elevate the model's overall reasoning quality. Extensive experiments on mainstream benchmarks demonstrate that ThinkDeception establishes a new SOTA, significantly outperforming existing methods in both detection accuracy and rationale quality. Ultimately, this work successfully drives the field of deception detection toward interpretable, multimodal cognitive reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.18780 2026-06-18 cs.CV cs.CL cs.MM 新提交专题 85

SAMA: Semantic Anchor-aligned Augmentation for Unified Low-Resource Multimodal Information Extraction

SAMA：面向统一低资源多模态信息抽取的语义锚定对齐增强

Quanjiang Guo, Chong Mu, Jiazhou Pan, Ming Jia, Ling Tian, Hui Gao, Zhao Kang

发表机构 * School of Computer Science and Engineering, University of Electronic Science and Technology of China（电子科技大学计算机科学与工程学院）

专题命中图文多模态：多模态信息抽取，利用多专家MLLM增强数据。

AI总结提出语义锚定对齐增强框架SAMA，通过构建结构化语义锚引导多专家多模态大模型生成高保真文本，并利用锚保留扩散机制合成图像，结合双约束过滤模块，在低资源多模态信息抽取任务中显著提升性能。

Comments Accepted by IEEE Transactions on Multimedia

详情

AI中文摘要

多模态信息抽取（MIE）——涵盖多模态命名实体识别（MNER）、关系抽取（MRE）和事件抽取（MEE）等任务——对于理解多媒体内容至关重要，但受到严重数据稀缺的限制。尽管数据增强是一种有前景的补救措施，但现有方法受到粗粒度跨模态对齐和碎片化、任务特定设计的阻碍，未能利用共享语义知识。为克服这些限制，我们引入了语义锚定对齐多模态增强（SAMA），一个用于生成高保真、任务感知合成数据的统一框架。SAMA从真实标签构建结构化语义锚，以指导协作多专家多模态大语言模型（CME-MLLM），该模型集成了用于共享语义的通用适配器和任务特定适配器，以生成多样且符合约束的文本样本。对于图像合成，SAMA采用锚保留扩散机制，使用锚加权提示和潜在条件来维持关键语义锚，同时多样化视觉上下文。为消除人工验证需求，SAMA进一步引入双约束过滤模块，基于跨模态一致性和锚保真度选择合成样本。在MNER、MRE和MEE基准数据集上的大量实验表明，SAMA在全监督和低资源设置下均一致优于最先进的增强基线，突显了其通用性、鲁棒性和有效性。

英文摘要

Multimodal Information Extraction (MIE)-covering tasks such as Multimodal Named Entity Recognition (MNER), Relation Extraction (MRE), and Event Extraction (MEE)-is essential for understanding multimedia content but remains constrained by severe data scarcity. Although data augmentation is a promising remedy, existing approaches are impeded by coarse cross-modal alignment and fragmented, task-specific designs that fail to exploit shared semantic knowledge. To overcome these limitations, we introduce Semantic Anchor-aligned Multimodal Augmentation (SAMA), a unified framework for generating high-fidelity, task-aware synthetic data. SAMA constructs structured semantic anchors from ground-truth labels to guide a Collaborative Multi-Experts Multimodal Large Language Model (CME-MLLM), which integrates a Universal Adapter for shared semantics with Task-Specific Adapters to produce diverse yet constraint-compliant textual samples. For image synthesis, SAMA employs an Anchor-Preserving Diffusion mechanism that uses anchor-weighted prompts and latent conditioning to maintain critical semantic anchors while diversifying visual contexts. To eliminate the need for manual verification, SAMA further introduces a Dual-Constraint Filtering module that selects synthetic samples based on both cross-modal consistency and anchor fidelity. Extensive experiments across benchmark datasets for MNER, MRE, and MEE demonstrate that SAMA consistently outperforms state-of-the-art augmentation baselines under both fully supervised and low-resource settings, underscoring its versatility, robustness, and effectiveness.

URL PDF HTML ☆

赞 0 踩 0

2606.17030 2026-06-18 cs.CV 新提交专题 85

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

Qwen-RobotWorld技术报告：通过语言条件视频生成统一具身世界模型

Jie Zhang, Xiaoyue Chen, Anzhe Chen, Dayiheng Liu, Deqing Li, Gengze Zhou, Hale Yin, Haoqi Yuan, Haoyang Li, Jiahao Li, Jiazhao Zhang, Jingren Zhou, Kaiyuan Gao, Kun Yan, Lihan Jiang, Ningyuan Tang, Pei Lin, Qihang Peng, Shengming Yin, Tianhe Wu, Tianyi Yan, Xiao Xu, Yan Shu, Yanran Zhang, Ye Wang, Yi Wang, Yilei Chen, Yixian Xu, Yiyang Huang, Yuxiang Chen, Zekai Zhang, Zhendong Wang, Zixing Lei, Zhixuan Liang, Zihao Liu, Zikai Zhou, Chenxu Lv, Xiong-Hui Chen, Chenfei Wu

发表机构 * Qwen Team（Qwen团队）

专题命中图文多模态：融合视觉与语言的多模态世界模型

AI总结提出Qwen-RobotWorld，一种以自然语言为统一动作接口的语言条件视频世界模型，通过双流MMDiT、大规模具身世界知识语料和渐进式课程训练，在机器人操作、自动驾驶等任务中实现物理一致的未来视觉轨迹预测，在多个基准上取得最优结果。

详情

AI中文摘要

我们介绍Qwen-RobotWorld，一种用于具身智能的语言条件视频世界模型。以自然语言作为统一动作接口，它从当前观测预测物理上合理的未来视觉轨迹，涵盖机器人操作、自动驾驶、室内导航和人到机器人迁移。这种统一公式提供了三个有前景的应用方向：用于策略训练增强的合成数据生成、用于策略评估的可扩展虚拟环境，以及用于下游机器人控制的语言引导规划信号。这是通过三部分设计实现的：a) 双流MMDiT与MLLM动作编码，其中60层双流扩散变压器通过逐层联合注意力将冻结的Qwen2.5-VL语义与视频VAE潜变量耦合；b) 具身世界知识(EWK)，一个860万视频-文本语料库（2亿+帧），包含20+种具身形态和500+动作类别的动作-语言映射；c) 通用+专家渐进式课程，一种两阶段训练策略，首先学习通用视觉先验，然后在共享语言接口下注入具身专门化。广泛的结果显示出强竞争力：在EWMBench和DreamGen Bench上总体排名第一，在WorldModelBench和PBench上优于所有开源模型。在RoboTwin-IF基准上的额外零样本分析进一步支持了鲁棒泛化和多视图一致性。

英文摘要

We introduce Qwen-RobotWorld, a language-conditioned video world model for embodied intelligence. With natural language as a unified action interface, it predicts physically grounded future visual trajectories from current observations across robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer. This unified formulation provides three promising application directions: synthetic data generation for policy training augmentation, scalable virtual environments for policy evaluation, and language-guided planning signals for downstream robot control. This is achieved through a three-part design: a) Double-Stream MMDiT with MLLM Action Encoding, where a 60-layer double-stream diffusion transformer couples frozen Qwen2.5-VL semantics with video-VAE latents through layer-wise joint attention; b) Embodied World Knowledge (EWK), an 8.6M video-text corpus (200M+ frames) with action-language mapping over 20+ embodiments and 500+ action categories; and c) General+Expert Progressive Curriculum, a two-stage training strategy that first learns general visual priors and then injects embodied specialization under a shared language interface. Extensive results show strong competitiveness: ranks 1st overall on EWMBench and DreamGen Bench, outperforms all open-source models on WorldModelBench and PBench. Additional zero-shot analyses on RoboTwin-IF benchmark further support robust generalization and multi-view consistency.

URL PDF HTML ☆

赞 0 踩 0

2606.15088 2026-06-18 cs.SD cs.CL eess.AS 新提交专题 85

When the Same Musical Knowledge Forgets Differently: A Clean Probe of Pathway-Dependent Forgetting

当相同的音乐知识以不同方式遗忘：路径依赖遗忘的干净探测

Yu Liu, Zhiwei Yang, Wenxiao Zhang, Cong Cao, Fangfang Yuan, Kun Peng, Haimei Qin, Lei Jiang, Jin B. Hong, Hao Peng, Yanbing Liu

发表机构 * Institute of Information Engineering, CAS（中国科学院信息工程研究所）； School of Cyber Security, UCAS（中国科学院大学网络空间安全学院）； The University of Western Australia（西澳大利亚大学）； Beihang University（北京航空航天大学）

专题命中图文多模态：研究多模态模型中知识遗忘路径依赖

AI总结提出配对路径控制协议（PPCP），发现多模态模型中通过文本路径获取的知识比音频路径更易遗忘，且该效应不受架构深度影响，主要源于输入表示差异。

详情

AI中文摘要

一个模型可以通过听音频或阅读文本描述来学习钢琴曲《致爱丽丝》是平静而沉思的，但当这些知识后来面临遗忘风险时，获取路径是否重要？多模态模型中的遗忘研究衡量了在适应过程中丢失了哪些知识，但尚未探究获取路径是否影响知识被遗忘的难易程度。我们将这个未经检验的前提称为路径不变假设。音乐理解提供了一个干净的测试，因为一段音乐剪辑和一段规范的文本描述可以对齐到相同的感知内容，使得相同的知识单元可以通过听或读进入模型，而目标保持不变。在多个架构不同的音频-语言模型中，我们观察到一致的不对称性：在相同的适应压力下，文本路径知识比匹配的音频路径知识更容易被遗忘。为了将这种效应归因于路径而非混淆因素，我们引入了配对路径控制协议（PPCP），这是一个三阶段设计，建立匹配的路径基线，在相同的知识池上以对称监督激活两条路径，并对两条路径施加相同的遗忘压力。这种差距在模型间和增益控制分析中稳定存在，当矛盾覆盖被替换为正确标签的跨域学习时仍然存在，在单模态压力下仍然存在，并且不会被轻量级重放消除。两个独立的路径深度控制证实，该效应不能由架构深度解释，表明输入表示是主导因素。在PPCP下，我们的结果表明遗忘高度依赖于路径，将获取路径确立为遗忘研究和多模态系统设计的一个新的分析维度。

英文摘要

A model can learn that the piano piece Für Elise is calm and reflective by listening to the audio or by reading a text description, but does it matter which route that knowledge took when it is later at risk of being forgotten? Forgetting research in multimodal models measures what knowledge is lost under adaptation, yet has not asked whether acquisition route affects how easily that knowledge is forgotten. We call this untested premise the Pathway-Invariant Assumption. Music understanding enables a clean test because a music clip and a canonical text description can be aligned to the same perceptual content, allowing the same knowledge unit to enter a model through listening or reading while the target remains fixed. Across multiple architecturally distinct audio-language models, we observe a consistent asymmetry: text-pathway knowledge is forgotten more than matched audio-pathway knowledge under identical adaptation pressure. To attribute this effect to route rather than confounds, we introduce the Paired Pathway Controlled Protocol (PPCP), a three-phase design that establishes matched pathway baselines, activates both pathways under symmetric supervision on the same knowledge pool, and applies identical forgetting pressure to both pathways. The gap is stable across models and gain-controlled analyses, persists when contradictory overwrite is replaced by correct-label cross-domain learning, remains under single-modality pressure, and is not removed by lightweight replay. Two independent routing-depth controls confirm that the effect is not explained by architectural depth, pointing to input representation as the dominant factor. Under PPCP, our results demonstrate that forgetting is highly route-dependent, establishing acquisition route as a new analytical dimension for forgetting research and multimodal system design.

URL PDF HTML ☆

赞 0 踩 0

2606.18974 2026-06-18 cs.CV 新提交专题 80

Visual-OPSD: Cross-Modal On-Policy Self-Distillation for Efficient Unified Multimodal Reasoning

Visual-OPSD：用于高效统一多模态推理的跨模态在策略自蒸馏

Pengyu Li, Zhitao Gao, Lingling Zhang, Muye Huang, Yuanming Li, Fangzhi Xu, Jun Liu

发表机构 * Xi’an Jiaotong University（西安交通大学）； MOE KLINNS Lab（MOE KLINNS实验室）； Shaanxi Province Key Laboratory of Big Data Knowledge Engineering（陕西省大数据知识工程重点实验室）； Sun Yat-sen University（中山大学）

专题命中图文多模态：跨模态自蒸馏将视觉推理能力转移到纯文本模型。

AI总结提出Visual-OPSD方法，通过跨模态在策略自蒸馏，将多步扩散生成的可视化思维推理能力转移到纯文本学生模型，实现14.3倍加速且性能提升3.40个百分点。

详情

AI中文摘要

统一多模态模型（UMMs）将生成的“可视化思维”（VTs）与文本推理交错以改进空间任务。这导致多步扩散带来大约一个数量级的推理成本。我们发现这种成本带来的直接收益有限。在ThinkMorph上，移除或噪声化VTs在九个基准上几乎不改变准确率。一旦渲染，注意力集中在VT上，无论其内容如何。然而，KL诊断表明，以特权VT轨迹为条件会改变模型的完成分布。这表明生成路径编码了超出渲染像素的有用推理。受此差距启发，我们提出了Visual On-Policy Self-Distillation（Visual-OPSD）。教师和学生共享相同权重，但上下文不同：教师看到特权VTs，而学生只看到问题。在策略学生轨迹上的token级JSD蒸馏将教师的推理转移到纯文本学生。在九个基准上，Visual-OPSD相比其生成教师提高了$+3.40$个百分点，加速$14.3\times$（每个样本10.0秒 vs. 142.8秒），并在VSP上比同规模VLM提高了$+63.83$个百分点。高斯噪声控制（真实VT为$+0.40$pp vs. $+10.28$pp）和$58.4\%$的KL差距闭合证实，收益来自生成路径的语义内容。

英文摘要

Unified multimodal models (UMMs) interleave generated ''visual thoughts'' (VTs) with text reasoning to improve spatial tasks. This incurs roughly an order-of-magnitude inference cost from multi-step diffusion. We find this cost yields limited direct benefit. On ThinkMorph, removing or noising VTs barely changes accuracy across nine benchmarks. Once rendered, attention concentrates on the VT regardless of content. Yet a KL diagnostic shows that conditioning on a privileged VT trace shifts the model's completion distribution. This suggests the generation pathway encodes useful reasoning beyond the rendered pixels. Motivated by this gap, we propose Visual On-Policy Self-Distillation(Visual-OPSD). Teacher and student share identical weights but differ in context: the teacher sees privileged VTs while the student sees only the question. Token-level JSD distillation on on-policy student trajectories transfers the teacher's reasoning to a text-only student. Across nine benchmarks, Visual-OPSD improves over its generative teacher by $+3.40$pp with $14.3\times$ speedup (10.0s vs. 142.8s per sample) and outperforms same-scale VLMs by $+63.83$pp on VSP. A Gaussian-noise control ($+0.40$pp vs. $+10.28$pp for real VTs) and $58.4\%$ closure of the KL gap confirm that gains come from the semantic content of the generation pathway.

URL PDF HTML ☆

赞 0 踩 0

2606.18893 2026-06-18 cs.CL 新提交专题 80

Learning Robust Pair Confidence for Multimodal Emotion-Cause Pair Extraction

学习鲁棒的成对置信度用于多模态情感-原因对提取

Zhuangzhuang Pan, Ning Dong, Yingna Su, Yan Xia

发表机构 * Institute for Advanced Studies（先进研究院）； Universiti Malaya（马来大学）； School of Information Engineering（信息工程学院）； Suqian University（宿州学院）； Digitization Department（数字化部门）

专题命中图文多模态：多模态情感-原因对提取，学习鲁棒置信度

AI总结提出RPCL框架，通过置信度差异边界约束和对抗性扰动，增强多模态情感-原因对提取中成对置信度的判别性和稳定性，在三个数据集上提升Pair F1约2.6-2.8个百分点。

Comments 11 pages, 3 figures, 5 tables

详情

AI中文摘要

多模态情感-原因对提取（MECPE）需要候选对上的可靠成对置信度。现有的成对评分器通常对有效候选使用成对级别的交叉熵，这大多独立地处理链接。这使得竞争原因之间的相对置信度几何结构约束不足，允许黄金对接近硬负例或依赖偶然的非黄金上下文。我们将这种脆弱性研究为成对置信度脆弱性，并提出RPCL（鲁棒成对置信度学习），一种仅用于训练的成对置信度学习框架。RPCL鼓励成对置信度既具有判别性又具有稳定性：通过置信度差异边界约束将黄金对与行方向硬负例分离，并将干净成对预测与来自损坏视图的预测对齐，其中非黄金上下文话语表示被部分损坏。在推理时，原始的干净成对评分器和解码流水线保持不变。在ECF、MECAD和MEC4上，RPCL在全文本-音频-视频设置下将三种子平均Pair F1相对于匹配基线模型提高了2.58到2.83个百分点，并在所有三个数据集上提高了平均Pair AUPRC。诊断分析进一步显示更大的黄金-负例置信度差距和更低的边界违反严重性。这些结果表明，显式塑造成对置信度是MECPE的一种有效训练策略。

英文摘要

Multimodal emotion-cause pair extraction (MECPE) requires reliable pair confidence over candidate pairs. Existing pair scorers commonly use pair-level cross entropy over valid candidates, which treats links mostly independently. This leaves the relative confidence geometry among competing causes under-constrained, allowing gold pairs to stay close to hard negatives or rely on incidental non-gold context. We study this vulnerability as pair-confidence brittleness and propose RPCL (Robust Pair Confidence Learning), a training-only framework for pair-confidence learning. RPCL encourages pair confidence to be both discriminative and stable: gold pairs are separated from row-wise hard negatives through a confidence-difference margin constraint, and clean pair predictions are aligned with predictions from a corrupted view where non-gold contextual utterance representations are partially corrupted. The original clean pair scorer and decoding pipeline are used unchanged at inference time. On ECF, MECAD, and MEC4, RPCL improves the three-seed mean Pair F1 over a matched base model by 2.58 to 2.83 percentage points in the full text-audio-video setting, and improves mean Pair AUPRC on all three datasets. Diagnostic analysis further shows larger gold-negative confidence gaps and lower margin-violation severity. These results suggest that explicitly shaping pair confidence is an effective training strategy for MECPE.

URL PDF HTML ☆

赞 0 踩 0

2606.18710 2026-06-18 cs.CR 新提交专题 80

LARE: 低注意力区域编码用于文本-图像检索

Abdulmalik Alquwayfili, Faisal Almeshal, Jumanah Almajnouni, Leena Alotaibi, Faisal Alhajari, Mohammed Alkhrashi, Alreem Almuhrij, Abdullah Aldwyish, Raied Aljadaany, Huda Alamri, Muhammad Kamran J. Khan

发表机构 * Saudi Data and Artificial Intelligence Authority (SDAIA)（沙特数据与人工智能局）

专题命中跨模态检索：文本-图像跨模态检索

AI总结提出LARE框架，通过并行编码低注意力区域和完整图像，解决拥挤场景下视觉编码器忽视关键细节的问题，在密集场景子集上提升检索性能。

Comments Accepted at the ICML 2026 Workshop on Efficient Multimodal Question Answering (EMM-QA). Code: https://github.com/AbdulmalikDS/LARE ; Dataset: https://huggingface.co/datasets/AbdulmalekDS/Dense-Set

详情

AI中文摘要

拥挤场景中的图像检索尤其具有挑战性，因为传统视觉编码器存在显著性偏差，倾向于关注主要对象而忽略低注意力区域，而这些区域通常对细粒度检索至关重要。我们提出了LARE（低注意力区域编码），一个显式建模这些被忽略区域的框架。LARE采用双编码策略，并行编码图像的低注意力区域和完整图像，从而产生更多样化和信息丰富的图像嵌入。为了评估拥挤场景下的图像检索性能，我们引入了Dense-Set，一个源自COCO和Flickr30K的具有挑战性的子集。在该子集中，图像被重新标注，以提供对低注意力或先前被忽略区域的更丰富描述。该数据集突显了现有检索模型的局限性，并能够在密集拥挤场景条件下进行更严格的评估。实验结果表明，所提出的框架通过在共享潜在空间中保留微妙的非主导视觉线索来提高检索性能。

英文摘要

Image retrieval in crowded scenes is particularly challenging due to the salience bias of conventional visual encoders, which tend to focus on dominant objects while neglecting low-attention regions that are often crucial for fine-grained retrieval. We propose LARE (Low-Attention Region Encoding), a framework that explicitly models these overlooked regions. LARE adopts a dual-encoding strategy that encodes low-attention regions of an image and the full image in parallel, leading to more diverse and informative image embeddings. To evaluate image retrieval performance in challenging crowded scenes, we introduce Dense-Set, a challenging subset derived from COCO and Flickr30K. In this subset, images are re-captioned to provide richer descriptions of low-attention or previously overlooked regions. This dataset highlights the limitations of existing retrieval models and enables a more rigorous evaluation under densely crowded scene conditions. Experimental results demonstrate that the proposed framework improves retrieval performance by preserving subtle, non-dominant visual cues within the shared latent space.

URL PDF HTML ☆

赞 0 踩 0

2606.19140 2026-06-18 cs.LG 新提交专题 55

ChronoSurv: A Clinical Pathway-Guided Graph Framework for Multimodal Survival Analysis

ChronoSurv：一种临床路径引导的多模态生存分析图框架

Hugo Miccinilli, Theo Di Piazza

发表机构 * Université Paris-Saclay, CentraleSupélec, MICS, France（巴黎萨克雷大学，中央超算学院，MICS，法国）； University of Lyon, INSA Lyon, CREATIS, France（里昂大学，里昂国家理工学院，CREATIS，法国）

专题命中其他多模态：处理多模态临床数据，但非大模型

AI总结提出ChronoSurv，一种基于有向图的多模态生存分析框架，通过层次化拓扑和异质消息传递建模临床轨迹，在头颈癌数据集上取得最优判别性能与可靠校准。

Comments Accepted at MICCAI 2026. Submitted version due to embargo

详情

AI中文摘要

准确的生存预测对于头颈癌的个性化治疗计划至关重要，但由于多模态临床数据的异质性和高维性，这仍然具有挑战性。虽然深度生存模型在预测性能上优于经典统计方法，但现有方法通常依赖于静态融合策略或时间无关建模，限制了其捕捉结构化临床工作流程的能力。在这项工作中，我们提出了ChronoSurv，一种用于多模态生存分析的异质层次有向图框架。ChronoSurv使用与关键诊断步骤对齐的有向图，将患者护理表示为进展感知的临床轨迹。层次拓扑包含细粒度、粗粒度和全局表示，进一步支持对缺失模态的灵活适应，而异质消息传递则建模了跨模态和临床步骤的复杂非对称关系。在两个公共数据集上的实验结果表明，ChronoSurv在保持统计可靠校准的同时，实现了最先进的判别性能。全面的消融研究进一步证实了每个架构组件的贡献，突出了轨迹感知图建模在多模态生存预测中的潜力。

英文摘要

Accurate survival prediction is essential for personalized treatment planning in head and neck cancer, yet remains challenging due to the heterogeneous and high-dimensional nature of multimodal clinical data. While deep survival models have improved predictive performance over classical statistical approaches, existing methods typically rely on static fusion strategies or temporally agnostic modeling, limiting their ability to capture structured clinical workflows. In this work, we propose ChronoSurv, a heterogeneous hierarchical directed graph framework for multimodal survival analysis. ChronoSurv represents patient care as a progression-aware clinical trajectory using directed graphs aligned with key diagnostic steps. A hierarchical topology incorporates fine-grained, coarse, and global representations, further supporting flexible adaptation to missing modalities, while heterogeneous message passing models complex and asymmetric relationships across modalities and clinical steps. Experimental results on two public datasets demonstrate that ChronoSurv achieves state-of-the-art discriminative performance while maintaining statistically reliable calibration. Comprehensive ablation studies further confirm the contribution of each architectural component, highlighting the potential of trajectory-aware graph modeling for multimodal survival prediction.

URL PDF HTML ☆

赞 0 踩 0

1. 音视频多模态 5 篇

OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains

IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages

Who Wins the Conflict? Mechanistic Interpretability of Text Bias in Audio LLMs

Continuous Audio Thinking for Large Audio Language Models

DASH: Dual-View Self-Distillation with Multi-Layer Hidden Representations for Robust Speech Recognition

2. 多模态评测 1 篇

Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

3. 图文多模态 11 篇

Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation

ThinkDeception: A Progressive Reinforcement Learning Framework for Interpretable Multimodal Deception Detection

SAMA: Semantic Anchor-aligned Augmentation for Unified Low-Resource Multimodal Information Extraction

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

When the Same Musical Knowledge Forgets Differently: A Clean Probe of Pathway-Dependent Forgetting

Visual-OPSD: Cross-Modal On-Policy Self-Distillation for Efficient Unified Multimodal Reasoning

Learning Robust Pair Confidence for Multimodal Emotion-Cause Pair Extraction

Image Prompt Reconstruction Attacks on Distributed MLLM Inference Frameworks

When Prompts Mislead: Textual Dominance and Diagnostic Bias in MLLMs

LandslideAgent with Multimodal LandslideBench: A Domain-Rule-Augmented Agent for Autonomous Landslide Identification and Analysis

Reasoning as Intersection: Consensus-Frame Alignment for Visual Focus in Video-MLLMs

4. 跨模态检索 2 篇

DREAM: Extending Vision-Language Models with Dual-Objective Encoding for Cross-Modal Retrieval

LARE: Low-Attention Region Encoding for Text-Image Retrieval

5. 其他多模态 1 篇

ChronoSurv: A Clinical Pathway-Guided Graph Framework for Multimodal Survival Analysis