arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

多模态大模型

跨文本、图像、视频、音频等模态的大模型与学习方法。

今日/当前日期收录 20 信号源:cs.CV, cs.CL, cs.AI, cs.MM, eess.AS

1. 音视频多模态 5 篇

2606.14702 2026-06-18 cs.CV 新提交 专题 90

OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains

OmniVideo-100K:通过结构化脚本和证据链进行音视频推理的数据集

Xinyue Cai, Chaoyou Fu, Yi-Fan Zhang, Ran He, Caifeng Shan

发表机构 * Nanjing University(南京大学) CASIA(中国科学院自动化研究所)

专题命中 音视频多模态 :音视频推理数据集与问答

AI总结 提出OmniVideo-100K数据集,通过实体锚定视频脚本和线索引导的QA生成机制,解决音视频问答中跨段实体不一致和长时推理不足的问题,微调模型在多个基准上取得显著提升。

Comments Project page: https://github.com/MiG-NJU/OmniVideo-100K

详情
AI中文摘要

当前的音视频问答(QA)自动化流水线通常采用“视频-字幕-QA”范式。然而,这些方法通常将视频分割成短片段,并为音频和视觉模态生成独立的描述。这种解耦处理切断了声音与其视觉来源之间的固有关联,而独立的片段处理常常导致同一实体在不同片段中的描述不一致。此外,将长文本理解和QA合成耦合到单一步骤中,往往将模型限制在局部事件上,生成的问答缺乏长期时间连接和深度跨模态推理。为了解决这些问题,我们提出了一种自动化数据引擎,包含两种机制:(1)**实体锚定视频脚本**将视频转换为结构化脚本,包括摘要、主要实体列表和逐片段的音视频描述。实体列表作为全局先验,确保跨片段引用一致性并重建音视频关联。(2)**线索引导的QA生成**提示模型首先从脚本中挖掘跨片段、多模态线索,然后基于这些高价值线索生成QA对。利用这一流水线,我们构建了指令微调数据集**OmniVideo-100K**和人工验证的测试集**OmniVideo-Test**。在OmniVideo-100K上微调VITA-1.5、Qwen2.5-Omni-7B和Qwen3-Omni-30B,在OmniVideo-Test上获得了高达20.59%的性能提升,并在Daily-Omni和JointAVBench等现有基准上表现出强大的泛化能力(提升高达12.64%)。

英文摘要

Current automated pipelines for audio-visual Question Answering (QA) generally adopt a ``video-caption-QA'' paradigm. However, these methods typically segment videos into short clips and generate separate descriptions for audio and visual modalities. This decoupled processing severs inherent associations between sounds and their visual sources, while independent clip processing often causes inconsistent descriptions of the same entity across segments. Furthermore, coupling long-text comprehension and QA synthesis into a single step often restricts models to localized events, yielding questions lacking long-term temporal connections and deep cross-modal reasoning. To address these issues, we propose an automated data engine featuring two mechanisms: (1) \textbf{Entity-Anchored Video Scripting} transforms videos into structured scripts, comprising summaries, main entity lists, and segment-wise audio-visual descriptions. The entity list serves as a global prior to ensure cross-segment referential consistency and reconstruct audio-visual associations. (2) \textbf{Clue-Guided QA Generation} prompts models to first mine cross-segment, multimodal clues from the script, and subsequently generate QA pairs based on these high-value clues. Leveraging this pipeline, we construct the instruction-tuning dataset \textbf{OmniVideo-100K} and a human-verified test set, \textbf{OmniVideo-Test}. Fine-tuning VITA-1.5, Qwen2.5-Omni-7B and Qwen3-Omni-30B on OmniVideo-100K yields performance gains of up to 20.59% on OmniVideo-Test, demonstrating strong generalization (up to 12.64% improvements) across established benchmarks like Daily-Omni and JointAVBench.

2606.19157 2026-06-18 eess.AS cs.CL 新提交 专题 85

IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages

IndicContextEval:评估8种印度语言音频大语言模型上下文利用能力的基准

Sakshi Joshi, Dhruv Subhash Rathi, Sanskar Singh, Eldho Ittan George, R J Hari, Kaushal Bhogale, Mitesh M. Khapra

发表机构 * AI4Bharat, Indian Institute of Technology Madras, India(AI4Bharat,印度理工学院马德拉斯分校) Sarvam AI, India(Sarvam AI,印度)

专题命中 音视频多模态 :评估音频大语言模型的上下文利用能力

AI总结 提出IndicContextEval基准,包含8种印度语言555位说话人的56小时自然语音,通过7级提示框架评估音频大语言模型是否真正利用上下文而非依赖参数化知识。

Comments Accepted at Interspeech 2026

详情
AI中文摘要

音频大语言模型(AudioLLMs)能够基于文本提示(如领域描述或实体列表)进行语音识别。然而,尚不清楚这些模型是真正利用此类上下文,还是依赖预训练期间学到的参数化知识。现有基准无法回答这个问题,因为它们仅在固定提示条件下评估转录,且很少包含明确的上下文输入。我们引入IndicContextEval,这是一个56小时的多语言基准,包含来自8种印度语言和23个专业领域的555位说话人的自然语音。我们设计了一个7级提示框架,逐步引入上下文信号,包括元数据、自然语言描述、英语和本地文字的实体列表,以及包含错误实体的对抗性提示。评估五个模型揭示了上下文利用行为的显著差异,凸显了对音频大语言模型中上下文基础进行显式评估的必要性。

英文摘要

AudioLLMs enable speech recognition conditioned on textual prompts such as domain descriptions or entity lists. However, it remains unclear whether these models genuinely utilise such context or rely on parametric knowledge learned during pretraining. Existing benchmarks cannot answer this question because they evaluate transcription under fixed prompting conditions and rarely include explicit contextual inputs. We introduce IndicContextEval, a 56-hour multilingual benchmark of natural speech from 555 speakers across 8 Indian languages and 23 professional domains. We design a 7-level prompting framework that progressively introduces contextual signals, including metadata, natural-language descriptions, entity lists in English and native script, and adversarial prompts with incorrect entities. Evaluating five models reveals substantial differences in context utilisation behaviour, highlighting the need for explicit evaluation of contextual grounding in AudioLLMs.

2606.18924 2026-06-18 cs.SD 新提交 专题 85

Who Wins the Conflict? Mechanistic Interpretability of Text Bias in Audio LLMs

谁赢得冲突?音频大模型中文本偏差的机制可解释性

Hyebin Cho, Suho Yoo, Jaehyuk Jang, Changick Kim, Joon Son Chung

发表机构 * School of Electrical Engineering, KAIST(韩国科学技术院电子工程学院)

专题命中 音视频多模态 :分析音频大模型中文本偏差机制

AI总结 本文通过机制分析揭示音频大模型中的文本主导偏差,发现文本路径主动抑制完整音频表征,并提出无训练干预方法back-patching以增强音频表征,缓解文本主导。

Comments Preprint

详情
AI中文摘要

虽然音频大模型在多模态理解方面表现出色,但它们存在文本主导偏差,即模型盲目偏向文本而忽视声学证据,导致幻觉。然而,当音频和文本输入相互矛盾时,这些模型内部行为的底层机制尚未被探索。在这项工作中,我们通过追踪内部表征在层间的传播,首次对这一现象进行了机制分析。我们的研究揭示了三个关键发现:(i)文本主导在模型中系统性地且经验性地存在;(ii)虽然文本和音频依赖功能不同的路径,但它们最终在后期层中汇聚到一个共享语义空间;(iii)文本路径不会擦除音频信息,而是主动抑制完整的音频表征。基于这些见解,我们利用back-patching,一种无训练干预方法,将后期层的音频激活路由回早期层。这放大了音频表征,使其能够克服文本抑制。我们的评估表明,back-patching持续减少文本主导,为冲突下的机制性多模态对齐铺平了道路。

英文摘要

While Audio Large Language Models (Audio LLMs) excel at multimodal understanding, they suffer from text dominance, a bias where models blindly favor text over acoustic evidence, causing hallucinations. However, the internal mechanisms underlying how these models behave when audio and textual inputs contradict each other remain unexplored. In this work, we present the first mechanistic analysis of this phenomenon by tracing the propagation of internal representations across layers. Our investigation reveals three key findings: (i) text dominance is systematically and empirically across models; (ii) while text and audio rely on functionally distinct pathways, they ultimately converge into a shared semantic space in late layers; and (iii) the text pathway does not erase audio information, but rather actively suppresses intact audio representations. Building on these insights, we leverage back-patching, a training-free intervention that routes late-layer audio activations back into earlier layers. This amplifies the audio representations, enabling them to overcome textual suppression. Our evaluation shows that back-patching consistently reduces text dominance, paving the way for mechanistic multimodal alignment under conflict.

2606.18273 2026-06-18 cs.CL cs.AI cs.SD eess.AS 新提交 专题 85

Continuous Audio Thinking for Large Audio Language Models

面向大型音频语言模型的连续音频思考

Gyojin Han, Dong-Jae Lee, Changho Choi, Jongsuk Kim, Junmo Kim

发表机构 * KAIST(韩国科学技术院)

专题命中 音视频多模态 :提出CoAT框架,增强音频语言模型的连续音频思考能力。

AI总结 提出连续音频思考(CoAT)框架,通过专家蒸馏在连续潜在空间中组织声学信息,使音频语言模型在生成响应前利用丰富声学特征,无需额外自回归解码成本,在多个音频任务上提升性能。

Comments Preprint

详情
AI中文摘要

大型音频语言模型(LALMs)在从语音转录到音乐分析等多种音频理解任务中展现了令人印象深刻的能力。然而,由于LALMs通常被训练生成与文本对齐的响应,其隐藏状态逐渐为文本生成而塑造,而非保留声学信息。因此,音频携带的多样化声学内容,如语音细节、韵律、声音事件、情感和音调,在过程中丢失,难以在响应中利用。我们引入了连续音频思考(CoAT),这是一个框架,为音频语言模型配备一个连续的潜在工作空间,用于在响应生成之前组织声学信息,并通过音频专家的蒸馏进行基础化。在思考空间内,模型可以在生成响应时利用专家蒸馏提供的丰富声学信息。此外,所提出的连续思考块可以在单个预填充中处理,因此CoAT不需要比基线额外的自回归解码成本。在三个LALM上,Qwen2-Audio、Qwen2.5-Omni-7B和Audio Flamingo~3,在涵盖音频推理、音频理解、音乐分类、语音情感和语音转录的广泛基准套件上的性能提升证明了CoAT的有效性。进一步分析证实,辅助监督从思考位置传播到模型的文本响应。

英文摘要

Large audio language models (LALMs) have shown impressive capabilities on diverse audio understanding tasks, ranging from speech transcription to music analysis. However, because LALMs are typically trained to produce text-aligned responses, their hidden states are progressively shaped for text generation rather than for preserving acoustic information. As a result, the diverse acoustic content that audio carries, such as phonetic detail, prosody, sound events, affect, and pitch, is lost along the way and difficult to leverage in the response. We introduce Continuous Audio Thinking (CoAT), a framework that equips audio language models with a continuous latent workspace for organizing acoustic information prior to response generation, grounded by distillation from audio experts. Within the thinking space, the model can utilize the rich acoustic information provided by expert distillation when generating its response. Furthermore, the proposed continuous thinking block can be processed in a single prefill, so CoAT does not require additional autoregressive decoding cost over the baseline. Across three LALMs, Qwen2-Audio, Qwen2.5-Omni-7B, and Audio Flamingo~3, performance gains on a broad benchmark suite spanning audio reasoning, audio understanding, music classification, speech emotion, and speech transcription demonstrate the effectiveness of CoAT. Further analysis confirms that the auxiliary supervision propagates from the thinking positions to the model's textual responses.

2606.19203 2026-06-18 eess.AS 新提交 专题 80

DASH: Dual-View Self-Distillation with Multi-Layer Hidden Representations for Robust Speech Recognition

DASH: 基于多层隐藏表示的双视角自蒸馏用于鲁棒语音识别

Jaeeun Baik, Ui-Hyeop Shin, Jiwoon Lee, Woocheol Jeong, Hyung-Min Park

专题命中 音视频多模态 :提出自蒸馏框架提升语音识别鲁棒性,属于音频处理

AI总结 提出DASH自蒸馏框架,通过双视角学习干净-噪声一致性,从多层编码器蒸馏隐藏表示并最小化原型分配分布的KL散度,在保持干净准确率的同时提升噪声鲁棒性,额外开销仅约微调时间的4%。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

自动语音识别(ASR)在现实噪声环境中常常性能下降,因此噪声鲁棒性对于部署至关重要。有监督的噪声增强微调是一种常见的补救措施,但它可能引入鲁棒性与干净性能之间的权衡,并过度拟合特定噪声,导致干净条件下的识别性能下降。我们提出了DASH,一种自蒸馏框架,通过从配对视图中学习干净-噪声一致性来提高鲁棒性。DASH从多个编码器层蒸馏隐藏表示,以捕获从低级声学到高级语义的特征,并通过最小化干净视图和噪声视图的原型分配分布之间的KL散度来稳定训练。在LibriSpeech上的实验表明,DASH在保持干净准确率的同时,在各种噪声条件下持续提高识别性能,这是通过在标准微调之外增加一个无标签的预训练阶段实现的,额外开销极小(约为微调时间的4%)。

英文摘要

Automatic Speech Recognition (ASR) often degrades in real-world noisy environments, making noise robustness essential for deployment. Supervised noise-augmented fine-tuning is a common remedy, but it can introduce a robustness-clean trade-off and overfit to specific corruptions, degrading recognition in clean conditions. We propose DASH, a self-distillation framework that improves robustness by learning clean--noisy consistency from paired views. DASH distills hidden representations from multiple encoder layers to capture features from low-level acoustics to high-level semantics, and stabilizes training by minimizing KL divergence between prototype assignment distributions of clean and noisy views. Experiments on LibriSpeech show that DASH consistently improves recognition under diverse noisy conditions while preserving clean accuracy, achieved by a label-free pre-training stage with minimal additional overhead (about 4% of fine-tuning time) beyond standard fine-tuning.

2. 多模态评测 1 篇

2606.19338 2026-06-18 cs.CV 新提交 专题 85

Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

超越当前观测:评估多模态大语言模型在可控非马尔可夫博弈中的表现

Shengyuan Ding, Xilin Wei, Xinyu Fang, Haodong Duan, Dahua Lin, Jiaqi Wang, Yuhang Zang

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Zhejiang University(浙江大学) The Chinese University of Hong Kong(香港中文大学)

专题命中 多模态评测 :非马尔可夫博弈评估多模态模型记忆

AI总结 提出RNG-Bench基准套件,通过配对记忆和3D迷宫两个博弈,评估多模态大模型在非马尔可夫环境中重建历史观测并据此行动的能力,发现主要错误源于遗忘而非决策,微调可提升性能。

详情
AI中文摘要

将多模态基础模型部署为闭环策略时,越来越需要基于不再可见的观测来调节动作。然而,现有基准要么暴露完整状态,将隐藏状态重建与其他智能体技能混为一谈,要么仅在回合结束后测试记忆。我们引入了RNG-Bench(重建性非马尔可夫博弈),这是一个基准套件,旨在隔离基础模型在多步交互中重建过去观测并据此行动的能力。RNG-Bench包含两个互补的博弈:配对记忆,其中卡片身份在特定位置短暂显示后需被回忆;以及3D迷宫,其中自我中心视图需整合为空间地图。两个博弈都在统一的测试框架下评估,具有三个可控难度轴:网格大小、视觉模式和观测模态。该基准进一步引入了头对头对决协议以控制实例级方差,以及记忆差距指标,将遗忘与不良动作选择区分开来。最难的配置需要大约128K个token和每回合350个图像输入,前沿MLLMs远未饱和。记忆差距分析表明,大多数残余错误源于遗忘较早的观测,而非次优决策。最后,在最优策略轨迹和过滤后的模型演示上微调Qwen3.5-9B,提高了RNG-Bench的性能,并迁移到现有基准,而不降低通用多模态能力。

英文摘要

Deploying multimodal foundation models as closed-loop policies increasingly requires conditioning actions on observations that are no longer visible. However, existing benchmarks either expose the full state, conflate hidden-state reconstruction with other agent skills, or test recall only after an episode has ended. We introduce RNG-Bench (Reconstructive Non-Markov Games), a benchmark suite designed to isolate a base model's ability to reconstruct past observations and act on them during multi-step interaction. RNG-Bench includes two complementary games: Matching Pairs, where card identities briefly revealed at specific locations must later be recalled, and 3D Maze, where egocentric views must be integrated into a spatial map. Both games are evaluated under a unified harness with three controlled difficulty axes: grid size, visual pattern, and observation modality. The benchmark further introduces a head-to-head duel protocol to control for instance-level variance and a Memory Gap metric that disentangles forgetting from poor action selection. The hardest configurations require contexts of roughly 128K tokens and 350 image inputs per episode, and remain far from saturated by frontier MLLMs. Memory Gap analysis shows that most residual errors stem from forgetting earlier observations rather than from suboptimal decision making. Finally, fine-tuning Qwen3.5-9B on optimal-policy rollouts and filtered model demonstrations improves performance on RNG-Bench and transfers to existing benchmarks without degrading general multimodal capability.

3. 图文多模态 11 篇

2606.19120 2026-06-18 cs.LG cs.CV 新提交 专题 85

Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation

先看后思:解耦感知与推理以实现抗捷径的多模态在策略自蒸馏

Sihan Wang, Xiyao Liu, Lianqing Liu, Zhi Han

发表机构 * State Key Laboratory of Robotics and Intelligent Systems, Shenyang Institute of Automation, Chinese Academy of Sciences(机器人与智能系统国家重点实验室,沈阳自动化研究所,中国科学院) University of Chinese Academy of Sciences(中国科学院大学)

专题命中 图文多模态 :MLLM后训练框架,解耦感知与推理

AI总结 提出ViGOS框架,通过解耦感知和推理,在MLLM后训练中避免文本捷径,提升图像依赖行为。

Comments 29 pages, 5 figures, 8 tables

详情
AI中文摘要

在策略自蒸馏(OPSD)训练模型在其自身rollouts上,并使用冻结副本提供基于参考目标的密集token级目标。这对于LLM推理效果良好,但直接扩展到多模态大语言模型(MLLMs)可能产生捷径:特权目标可能主要基于文本参考目标而非图像来引导token。我们提出ViGOS,一种视觉引导的OPSD框架用于MLLM后训练。学生首先编写视觉描述,然后推理出最终答案。对于有效rollouts,仅图像的感知教师监督描述,而特权推理教师监督同一学生前缀上的推理和最终答案。仅对无效rollouts使用参考教师以恢复输出格式。在通用视觉-语言、专家推理、视觉数学、空间定位和视觉-语言先验基准测试中,ViGOS保持了OPSD的主要优势,并在易产生捷径的设置中改善了图像引导行为。

英文摘要

On-policy self-distillation (OPSD) trains a model on its own rollouts and uses a frozen copy to provide dense token-level targets conditioned on a reference target. This works well for LLM reasoning, but a direct extension to multimodal large language models (MLLMs) can create a shortcut: the privileged target may guide tokens mainly based on the text reference target rather than the image. We propose ViGOS, a visually grounded OPSD framework for MLLM post-training. The student first writes a visual description and then reasons toward the final answer. For valid rollouts, an image-only perception teacher supervises the description, while a privileged reasoning teacher supervises the reasoning and final answer on the same student prefix. A reference teacher is used only for invalid rollouts to recover the output format. Across general vision-language, expert reasoning, visual math, spatial grounding, and visual-language-prior benchmarks, ViGOS keeps the main benefits of OPSD and improves image-grounded behavior in shortcut-prone settings.

2606.18988 2026-06-18 cs.AI 新提交 专题 85

ThinkDeception: A Progressive Reinforcement Learning Framework for Interpretable Multimodal Deception Detection

ThinkDeception: 一种用于可解释多模态欺骗检测的渐进式强化学习框架

Jinhao Song, Shan Liang, Yiqun Yue, Zhuhuayang Zhang, Tianqi Gao

发表机构 * Xi'an Jiaotong-Liverpool University(西安交通大学利物浦大学)

专题命中 图文多模态 :引入多模态大模型进行可解释欺骗检测,结合视觉和音频。

AI总结 提出ThinkDeception框架,将多模态大语言模型引入欺骗检测,通过逐步推理和视觉-音频一致性组相对策略优化(VAC-GRPO)实现可解释的认知推理,在主流基准上达到新SOTA。

Comments 10pages,4figures

详情
AI中文摘要

多模态欺骗检测对于识别欺诈意图至关重要,然而现有方法主要依赖于端到端的黑箱范式。这些方法严重缺乏可解释性,无法提供透明的推理轨迹,也难以明确捕捉欺骗行为中固有的细微跨模态不一致性。为了超越这些限制,我们提出了ThinkDeception,一个新颖且可解释的多模态欺骗检测框架。作为开创性工作,它将多模态大语言模型(MLLMs)引入该领域,将欺骗检测从传统的二分类任务转变为显式的认知推理过程。借助首个精心标注的逐步多模态思维链(CoT)数据集,我们开发了基础模型ThinkDeception Base,实证验证了模态不一致性在解码欺骗中的关键作用。在此基础之上,我们的核心创新在于提出了配备渐进式训练策略的视觉-音频一致性组相对策略优化(VAC-GRPO)。与标准GRPO不同,我们将训练数据分为四个渐进难度等级,引导模型经历基于心理学的从易到难的认知转变。通过创新地将这一动态课程调度器与多维度的过程感知奖励机制及反思学习范式相结合,我们显著提升了模型的整体推理质量。在主流基准上的大量实验表明,ThinkDeception建立了新的SOTA,在检测准确性和推理质量上均显著优于现有方法。最终,这项工作成功地将欺骗检测领域推向可解释的多模态认知推理。

英文摘要

Multimodal deception detection is critical for identifying fraudulent intentions, yet existing approaches predominantly rely on end to end black--box paradigms. These methods suffer from a severe lack of interpretability failing to provide transparent reasoning trajectories and struggling to explicitly capture the subtle, cross modal inconsistencies inherent in deceptive behaviors. To transcend these limitations, we propose ThinkDeception, a novel and interpretable multimodal deception detection framework. As a pioneering effort, it introduces Multimodal Large Language Models (MLLMs) into this domain, transforming deception detection from a traditional binary classification task into an explicit cognitive reasoning process. Facilitated by the first meticulously annotated step--by--step multimodal Chain of Thought (CoT) dataset, we develop a foundational model, ThinkDeception Base, empirically validating the critical role of modal inconsistency in decoding deception. Building upon this foundation, our core innovation lies in proposing Visual-Audio Consistency Group Relative Policy Optimization(VAC--GRPO) equipped with a progressive training strategy. Distinct from standard GRPO, we stratify the training data into four progressive difficulty tiers, guiding the model through a psychologically grounded easy--to--hard cognitive transition. By innovatively coupling this dynamic curriculum scheduler with a multi dimensional, process aware reward mechanism and a reflective learning paradigm, we significantly elevate the model's overall reasoning quality. Extensive experiments on mainstream benchmarks demonstrate that ThinkDeception establishes a new SOTA, significantly outperforming existing methods in both detection accuracy and rationale quality. Ultimately, this work successfully drives the field of deception detection toward interpretable, multimodal cognitive reasoning.

2606.18780 2026-06-18 cs.CV cs.CL cs.MM 新提交 专题 85

SAMA: Semantic Anchor-aligned Augmentation for Unified Low-Resource Multimodal Information Extraction

SAMA:面向统一低资源多模态信息抽取的语义锚定对齐增强

Quanjiang Guo, Chong Mu, Jiazhou Pan, Ming Jia, Ling Tian, Hui Gao, Zhao Kang

发表机构 * School of Computer Science and Engineering, University of Electronic Science and Technology of China(电子科技大学计算机科学与工程学院)

专题命中 图文多模态 :多模态信息抽取,利用多专家MLLM增强数据。

AI总结 提出语义锚定对齐增强框架SAMA,通过构建结构化语义锚引导多专家多模态大模型生成高保真文本,并利用锚保留扩散机制合成图像,结合双约束过滤模块,在低资源多模态信息抽取任务中显著提升性能。

Comments Accepted by IEEE Transactions on Multimedia

详情
AI中文摘要

多模态信息抽取(MIE)——涵盖多模态命名实体识别(MNER)、关系抽取(MRE)和事件抽取(MEE)等任务——对于理解多媒体内容至关重要,但受到严重数据稀缺的限制。尽管数据增强是一种有前景的补救措施,但现有方法受到粗粒度跨模态对齐和碎片化、任务特定设计的阻碍,未能利用共享语义知识。为克服这些限制,我们引入了语义锚定对齐多模态增强(SAMA),一个用于生成高保真、任务感知合成数据的统一框架。SAMA从真实标签构建结构化语义锚,以指导协作多专家多模态大语言模型(CME-MLLM),该模型集成了用于共享语义的通用适配器和任务特定适配器,以生成多样且符合约束的文本样本。对于图像合成,SAMA采用锚保留扩散机制,使用锚加权提示和潜在条件来维持关键语义锚,同时多样化视觉上下文。为消除人工验证需求,SAMA进一步引入双约束过滤模块,基于跨模态一致性和锚保真度选择合成样本。在MNER、MRE和MEE基准数据集上的大量实验表明,SAMA在全监督和低资源设置下均一致优于最先进的增强基线,突显了其通用性、鲁棒性和有效性。

英文摘要

Multimodal Information Extraction (MIE)-covering tasks such as Multimodal Named Entity Recognition (MNER), Relation Extraction (MRE), and Event Extraction (MEE)-is essential for understanding multimedia content but remains constrained by severe data scarcity. Although data augmentation is a promising remedy, existing approaches are impeded by coarse cross-modal alignment and fragmented, task-specific designs that fail to exploit shared semantic knowledge. To overcome these limitations, we introduce Semantic Anchor-aligned Multimodal Augmentation (SAMA), a unified framework for generating high-fidelity, task-aware synthetic data. SAMA constructs structured semantic anchors from ground-truth labels to guide a Collaborative Multi-Experts Multimodal Large Language Model (CME-MLLM), which integrates a Universal Adapter for shared semantics with Task-Specific Adapters to produce diverse yet constraint-compliant textual samples. For image synthesis, SAMA employs an Anchor-Preserving Diffusion mechanism that uses anchor-weighted prompts and latent conditioning to maintain critical semantic anchors while diversifying visual contexts. To eliminate the need for manual verification, SAMA further introduces a Dual-Constraint Filtering module that selects synthetic samples based on both cross-modal consistency and anchor fidelity. Extensive experiments across benchmark datasets for MNER, MRE, and MEE demonstrate that SAMA consistently outperforms state-of-the-art augmentation baselines under both fully supervised and low-resource settings, underscoring its versatility, robustness, and effectiveness.

2606.17030 2026-06-18 cs.CV 新提交 专题 85

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

Qwen-RobotWorld技术报告:通过语言条件视频生成统一具身世界模型

Jie Zhang, Xiaoyue Chen, Anzhe Chen, Dayiheng Liu, Deqing Li, Gengze Zhou, Hale Yin, Haoqi Yuan, Haoyang Li, Jiahao Li, Jiazhao Zhang, Jingren Zhou, Kaiyuan Gao, Kun Yan, Lihan Jiang, Ningyuan Tang, Pei Lin, Qihang Peng, Shengming Yin, Tianhe Wu, Tianyi Yan, Xiao Xu, Yan Shu, Yanran Zhang, Ye Wang, Yi Wang, Yilei Chen, Yixian Xu, Yiyang Huang, Yuxiang Chen, Zekai Zhang, Zhendong Wang, Zixing Lei, Zhixuan Liang, Zihao Liu, Zikai Zhou, Chenxu Lv, Xiong-Hui Chen, Chenfei Wu

发表机构 * Qwen Team(Qwen团队)

专题命中 图文多模态 :融合视觉与语言的多模态世界模型

AI总结 提出Qwen-RobotWorld,一种以自然语言为统一动作接口的语言条件视频世界模型,通过双流MMDiT、大规模具身世界知识语料和渐进式课程训练,在机器人操作、自动驾驶等任务中实现物理一致的未来视觉轨迹预测,在多个基准上取得最优结果。

详情
AI中文摘要

我们介绍Qwen-RobotWorld,一种用于具身智能的语言条件视频世界模型。以自然语言作为统一动作接口,它从当前观测预测物理上合理的未来视觉轨迹,涵盖机器人操作、自动驾驶、室内导航和人到机器人迁移。这种统一公式提供了三个有前景的应用方向:用于策略训练增强的合成数据生成、用于策略评估的可扩展虚拟环境,以及用于下游机器人控制的语言引导规划信号。这是通过三部分设计实现的:a) 双流MMDiT与MLLM动作编码,其中60层双流扩散变压器通过逐层联合注意力将冻结的Qwen2.5-VL语义与视频VAE潜变量耦合;b) 具身世界知识(EWK),一个860万视频-文本语料库(2亿+帧),包含20+种具身形态和500+动作类别的动作-语言映射;c) 通用+专家渐进式课程,一种两阶段训练策略,首先学习通用视觉先验,然后在共享语言接口下注入具身专门化。广泛的结果显示出强竞争力:在EWMBench和DreamGen Bench上总体排名第一,在WorldModelBench和PBench上优于所有开源模型。在RoboTwin-IF基准上的额外零样本分析进一步支持了鲁棒泛化和多视图一致性。

英文摘要

We introduce Qwen-RobotWorld, a language-conditioned video world model for embodied intelligence. With natural language as a unified action interface, it predicts physically grounded future visual trajectories from current observations across robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer. This unified formulation provides three promising application directions: synthetic data generation for policy training augmentation, scalable virtual environments for policy evaluation, and language-guided planning signals for downstream robot control. This is achieved through a three-part design: a) Double-Stream MMDiT with MLLM Action Encoding, where a 60-layer double-stream diffusion transformer couples frozen Qwen2.5-VL semantics with video-VAE latents through layer-wise joint attention; b) Embodied World Knowledge (EWK), an 8.6M video-text corpus (200M+ frames) with action-language mapping over 20+ embodiments and 500+ action categories; and c) General+Expert Progressive Curriculum, a two-stage training strategy that first learns general visual priors and then injects embodied specialization under a shared language interface. Extensive results show strong competitiveness: ranks 1st overall on EWMBench and DreamGen Bench, outperforms all open-source models on WorldModelBench and PBench. Additional zero-shot analyses on RoboTwin-IF benchmark further support robust generalization and multi-view consistency.

2606.15088 2026-06-18 cs.SD cs.CL eess.AS 新提交 专题 85

When the Same Musical Knowledge Forgets Differently: A Clean Probe of Pathway-Dependent Forgetting

当相同的音乐知识以不同方式遗忘:路径依赖遗忘的干净探测

Yu Liu, Zhiwei Yang, Wenxiao Zhang, Cong Cao, Fangfang Yuan, Kun Peng, Haimei Qin, Lei Jiang, Jin B. Hong, Hao Peng, Yanbing Liu

发表机构 * Institute of Information Engineering, CAS(中国科学院信息工程研究所) School of Cyber Security, UCAS(中国科学院大学网络空间安全学院) The University of Western Australia(西澳大利亚大学) Beihang University(北京航空航天大学)

专题命中 图文多模态 :研究多模态模型中知识遗忘路径依赖

AI总结 提出配对路径控制协议(PPCP),发现多模态模型中通过文本路径获取的知识比音频路径更易遗忘,且该效应不受架构深度影响,主要源于输入表示差异。

详情
AI中文摘要

一个模型可以通过听音频或阅读文本描述来学习钢琴曲《致爱丽丝》是平静而沉思的,但当这些知识后来面临遗忘风险时,获取路径是否重要?多模态模型中的遗忘研究衡量了在适应过程中丢失了哪些知识,但尚未探究获取路径是否影响知识被遗忘的难易程度。我们将这个未经检验的前提称为路径不变假设。音乐理解提供了一个干净的测试,因为一段音乐剪辑和一段规范的文本描述可以对齐到相同的感知内容,使得相同的知识单元可以通过听或读进入模型,而目标保持不变。在多个架构不同的音频-语言模型中,我们观察到一致的不对称性:在相同的适应压力下,文本路径知识比匹配的音频路径知识更容易被遗忘。为了将这种效应归因于路径而非混淆因素,我们引入了配对路径控制协议(PPCP),这是一个三阶段设计,建立匹配的路径基线,在相同的知识池上以对称监督激活两条路径,并对两条路径施加相同的遗忘压力。这种差距在模型间和增益控制分析中稳定存在,当矛盾覆盖被替换为正确标签的跨域学习时仍然存在,在单模态压力下仍然存在,并且不会被轻量级重放消除。两个独立的路径深度控制证实,该效应不能由架构深度解释,表明输入表示是主导因素。在PPCP下,我们的结果表明遗忘高度依赖于路径,将获取路径确立为遗忘研究和多模态系统设计的一个新的分析维度。

英文摘要

A model can learn that the piano piece Für Elise is calm and reflective by listening to the audio or by reading a text description, but does it matter which route that knowledge took when it is later at risk of being forgotten? Forgetting research in multimodal models measures what knowledge is lost under adaptation, yet has not asked whether acquisition route affects how easily that knowledge is forgotten. We call this untested premise the Pathway-Invariant Assumption. Music understanding enables a clean test because a music clip and a canonical text description can be aligned to the same perceptual content, allowing the same knowledge unit to enter a model through listening or reading while the target remains fixed. Across multiple architecturally distinct audio-language models, we observe a consistent asymmetry: text-pathway knowledge is forgotten more than matched audio-pathway knowledge under identical adaptation pressure. To attribute this effect to route rather than confounds, we introduce the Paired Pathway Controlled Protocol (PPCP), a three-phase design that establishes matched pathway baselines, activates both pathways under symmetric supervision on the same knowledge pool, and applies identical forgetting pressure to both pathways. The gap is stable across models and gain-controlled analyses, persists when contradictory overwrite is replaced by correct-label cross-domain learning, remains under single-modality pressure, and is not removed by lightweight replay. Two independent routing-depth controls confirm that the effect is not explained by architectural depth, pointing to input representation as the dominant factor. Under PPCP, our results demonstrate that forgetting is highly route-dependent, establishing acquisition route as a new analytical dimension for forgetting research and multimodal system design.

2606.18974 2026-06-18 cs.CV 新提交 专题 80

Visual-OPSD: Cross-Modal On-Policy Self-Distillation for Efficient Unified Multimodal Reasoning

Visual-OPSD:用于高效统一多模态推理的跨模态在策略自蒸馏

Pengyu Li, Zhitao Gao, Lingling Zhang, Muye Huang, Yuanming Li, Fangzhi Xu, Jun Liu

发表机构 * Xi’an Jiaotong University(西安交通大学) MOE KLINNS Lab(MOE KLINNS实验室) Shaanxi Province Key Laboratory of Big Data Knowledge Engineering(陕西省大数据知识工程重点实验室) Sun Yat-sen University(中山大学)

专题命中 图文多模态 :跨模态自蒸馏将视觉推理能力转移到纯文本模型。

AI总结 提出Visual-OPSD方法,通过跨模态在策略自蒸馏,将多步扩散生成的可视化思维推理能力转移到纯文本学生模型,实现14.3倍加速且性能提升3.40个百分点。

详情
AI中文摘要

统一多模态模型(UMMs)将生成的“可视化思维”(VTs)与文本推理交错以改进空间任务。这导致多步扩散带来大约一个数量级的推理成本。我们发现这种成本带来的直接收益有限。在ThinkMorph上,移除或噪声化VTs在九个基准上几乎不改变准确率。一旦渲染,注意力集中在VT上,无论其内容如何。然而,KL诊断表明,以特权VT轨迹为条件会改变模型的完成分布。这表明生成路径编码了超出渲染像素的有用推理。受此差距启发,我们提出了Visual On-Policy Self-Distillation(Visual-OPSD)。教师和学生共享相同权重,但上下文不同:教师看到特权VTs,而学生只看到问题。在策略学生轨迹上的token级JSD蒸馏将教师的推理转移到纯文本学生。在九个基准上,Visual-OPSD相比其生成教师提高了$+3.40$个百分点,加速$14.3\times$(每个样本10.0秒 vs. 142.8秒),并在VSP上比同规模VLM提高了$+63.83$个百分点。高斯噪声控制(真实VT为$+0.40$pp vs. $+10.28$pp)和$58.4\%$的KL差距闭合证实,收益来自生成路径的语义内容。

英文摘要

Unified multimodal models (UMMs) interleave generated ''visual thoughts'' (VTs) with text reasoning to improve spatial tasks. This incurs roughly an order-of-magnitude inference cost from multi-step diffusion. We find this cost yields limited direct benefit. On ThinkMorph, removing or noising VTs barely changes accuracy across nine benchmarks. Once rendered, attention concentrates on the VT regardless of content. Yet a KL diagnostic shows that conditioning on a privileged VT trace shifts the model's completion distribution. This suggests the generation pathway encodes useful reasoning beyond the rendered pixels. Motivated by this gap, we propose Visual On-Policy Self-Distillation(Visual-OPSD). Teacher and student share identical weights but differ in context: the teacher sees privileged VTs while the student sees only the question. Token-level JSD distillation on on-policy student trajectories transfers the teacher's reasoning to a text-only student. Across nine benchmarks, Visual-OPSD improves over its generative teacher by $+3.40$pp with $14.3\times$ speedup (10.0s vs. 142.8s per sample) and outperforms same-scale VLMs by $+63.83$pp on VSP. A Gaussian-noise control ($+0.40$pp vs. $+10.28$pp for real VTs) and $58.4\%$ closure of the KL gap confirm that gains come from the semantic content of the generation pathway.

2606.18893 2026-06-18 cs.CL 新提交 专题 80

Learning Robust Pair Confidence for Multimodal Emotion-Cause Pair Extraction

学习鲁棒的成对置信度用于多模态情感-原因对提取

Zhuangzhuang Pan, Ning Dong, Yingna Su, Yan Xia

发表机构 * Institute for Advanced Studies(先进研究院) Universiti Malaya(马来大学) School of Information Engineering(信息工程学院) Suqian University(宿州学院) Digitization Department(数字化部门)

专题命中 图文多模态 :多模态情感-原因对提取,学习鲁棒置信度

AI总结 提出RPCL框架,通过置信度差异边界约束和对抗性扰动,增强多模态情感-原因对提取中成对置信度的判别性和稳定性,在三个数据集上提升Pair F1约2.6-2.8个百分点。

Comments 11 pages, 3 figures, 5 tables

详情
AI中文摘要

多模态情感-原因对提取(MECPE)需要候选对上的可靠成对置信度。现有的成对评分器通常对有效候选使用成对级别的交叉熵,这大多独立地处理链接。这使得竞争原因之间的相对置信度几何结构约束不足,允许黄金对接近硬负例或依赖偶然的非黄金上下文。我们将这种脆弱性研究为成对置信度脆弱性,并提出RPCL(鲁棒成对置信度学习),一种仅用于训练的成对置信度学习框架。RPCL鼓励成对置信度既具有判别性又具有稳定性:通过置信度差异边界约束将黄金对与行方向硬负例分离,并将干净成对预测与来自损坏视图的预测对齐,其中非黄金上下文话语表示被部分损坏。在推理时,原始的干净成对评分器和解码流水线保持不变。在ECF、MECAD和MEC4上,RPCL在全文本-音频-视频设置下将三种子平均Pair F1相对于匹配基线模型提高了2.58到2.83个百分点,并在所有三个数据集上提高了平均Pair AUPRC。诊断分析进一步显示更大的黄金-负例置信度差距和更低的边界违反严重性。这些结果表明,显式塑造成对置信度是MECPE的一种有效训练策略。

英文摘要

Multimodal emotion-cause pair extraction (MECPE) requires reliable pair confidence over candidate pairs. Existing pair scorers commonly use pair-level cross entropy over valid candidates, which treats links mostly independently. This leaves the relative confidence geometry among competing causes under-constrained, allowing gold pairs to stay close to hard negatives or rely on incidental non-gold context. We study this vulnerability as pair-confidence brittleness and propose RPCL (Robust Pair Confidence Learning), a training-only framework for pair-confidence learning. RPCL encourages pair confidence to be both discriminative and stable: gold pairs are separated from row-wise hard negatives through a confidence-difference margin constraint, and clean pair predictions are aligned with predictions from a corrupted view where non-gold contextual utterance representations are partially corrupted. The original clean pair scorer and decoding pipeline are used unchanged at inference time. On ECF, MECAD, and MEC4, RPCL improves the three-seed mean Pair F1 over a matched base model by 2.58 to 2.83 percentage points in the full text-audio-video setting, and improves mean Pair AUPRC on all three datasets. Diagnostic analysis further shows larger gold-negative confidence gaps and lower margin-violation severity. These results suggest that explicitly shaping pair confidence is an effective training strategy for MECPE.

2606.18710 2026-06-18 cs.CR 新提交 专题 80

Image Prompt Reconstruction Attacks on Distributed MLLM Inference Frameworks

分布式多模态大模型推理框架上的图像提示重建攻击

Xinjian Luo, Hongyan Chang, Jianxin Wei, Yuncheng Wu, Xiaofeng Gao, Meikang Qiu, Ting Yu, Xue Liu

专题命中 图文多模态 :分布式MLLM图像提示重建攻击。

AI总结 研究分布式MLLM推理中中间嵌入泄露图像提示的风险,提出两种被动黑盒攻击方法MPAA和IEDA,实现像素级和语义级图像重建。

详情
AI中文摘要

分布式大语言模型(LLM)推理框架将孤立的消费级设备连接起来进行大规模模型推理,大幅降低了硬件限制。然而,最近的研究表明,参与者之间传输的中间嵌入可能会泄露私有提示。随着LLM演变为多模态LLM(MLLM),这种风险已扩展到文本之外:图像提示包含丰富的视觉和语义信息,使其中间嵌入高度隐私敏感。然而,分布式MLLM推理中的图像提示泄露问题在很大程度上尚未被探索。在本文中,我们研究了分布式MLLM框架中由中间嵌入引起的输入图像隐私风险。我们首先分析了从图像像素到中间表示的信息流。由于图像和文本嵌入通常在MLLM各层中交织,我们设计了一种图像嵌入提取算法作为重建攻击的前提,在我们的实验中,该算法在几乎所有MLLM层上实现了100%的提取准确率。在此基础上,我们开发了两种被动的黑盒图像重建攻击:MPAA和IEDA,反映了来自知识有限、能力有限的正常参与者的现实威胁。MPAA通过逐块信息提取和组装进行细粒度像素级重建,而IEDA通过嵌入引导的扩散生成进行粗粒度语义重建。我们在四个代表性的MLLM系列上评估了我们的攻击:Gemma 3、Phi 4 Multimodal、Qwen 2.5 VL和Llama 4 Scout。结果显示在各种设置下均具有一致优越的重建性能。我们进一步分析了MoE架构、图像预处理、模型大小和文本-图像依赖关系对攻击性能的影响。据我们所知,这是对MLLM图像重建攻击的首次研究。

英文摘要

Distributed large language model (LLM) inference frameworks connect isolated consumer-grade devices for large-scale model inference, substantially reducing hardware constraints. However, recent studies show that intermediate embeddings transmitted among participants can leak private prompts. As LLMs evolve into multimodal LLMs (MLLMs), this risk extends beyond text: image prompts contain rich visual and semantic information, making their intermediate embeddings highly privacy-sensitive. Yet, image-prompt leakage in distributed MLLM inference remains largely unexplored. In this paper, we investigate privacy risks to input images caused by intermediate embeddings in distributed MLLM frameworks. We first analyze the information flow from image pixels to intermediate representations. Since image and text embeddings are often intertwined across MLLM layers, we design an image embedding extraction algorithm as a prerequisite for reconstruction attacks, achieving 100% extraction accuracy across almost all MLLM layers in our experiments. Building on this, we develop two passive black-box image reconstruction attacks, MPAA and IEDA, reflecting realistic threats from normal participants with limited knowledge and capability. MPAA performs fine-grained pixel-level reconstruction via patch-wise information extraction and assembly, while IEDA performs coarse-grained semantic reconstruction through embedding-guided diffusion generation. We evaluate our attacks on four representative MLLM families: Gemma 3, Phi 4 Multimodal, Qwen 2.5 VL, and Llama 4 Scout. Results show consistently superior reconstruction performance in various settings. We further analyze the effects of MoE architecture, image preprocessing, model size, and text-image dependency on attack performance. To our knowledge, this is the first study of image reconstruction attacks on MLLMs.

2606.18262 2026-06-18 cs.HC 新提交 专题 75

When Prompts Mislead: Textual Dominance and Diagnostic Bias in MLLMs

当提示误导:多模态大语言模型中的文本主导与诊断偏差

Inhyuk Park, Doohyun Park

专题命中 图文多模态 :研究多模态LLM在医学诊断中的文本主导偏差。

AI总结 研究揭示在医学多模态大语言模型中,文本提示会主导视觉线索,导致诊断偏差,即使模型具备空间定位能力,提示策略仍可能不安全。

Comments Accepted to the CVPR 2026 MMFM-BIOMED Workshop

详情
AI中文摘要

多模态大语言模型(MLLMs)正越来越多地被评估用于医疗应用,其中计算约束通常使提示策略成为微调之外唯一实用的替代方案。这类策略通常被认为支持诊断推理,但其在医学MLLMs中的潜在故障模式仍缺乏特征描述。我们分析了开源眼科MLLM FundusExpert-1B,在公共BRSET数据集上执行出血与玻璃膜疣的鉴别任务,该数据集被用作我们分析的受控测试平台。(i) 通过人工注入标记的受控探针证实,模型保留了粗粒度的区域级空间定位能力。(ii) 与零样本推理相比,单样本文本提示使预测偏向于提示的发现。(iii) 当叠加的病灶轮廓与不一致的文本声明配对时,文本提示覆盖了正确的视觉线索:整体准确率从仅视觉条件下的75%下降到46%,而思维链(CoT)推理与进一步退化而非自我纠正相关。尽管仅限于单个模型和数据集,我们的发现表明,仅靠提示策略可能不足以实现医学MLLMs的安全临床部署。

英文摘要

Multimodal large language models (MLLMs) are increasingly being evaluated for medical applications, where computational constraints often make prompting strategies the only practical alternative to fine-tuning. Such strategies are generally assumed to support diagnostic reasoning, yet their potential failure modes in medical MLLMs remain poorly characterized. We analyze FundusExpert-1B, an open-source ophthalmology MLLM, on a hemorrhage versus drusen discrimination task using the public BRSET dataset, adopted here as a controlled testbed for our analysis. (i) A controlled probe with artificially injected markers confirms that the model retains coarse, region-level spatial grounding. (ii) Compared with zero-shot inference, one-shot textual prompts bias predictions toward the prompted finding. (iii) When an overlaid lesion contour is paired with an inconsistent textual claim, the textual prompt overrides the correct visual cue: overall accuracy drops from 75% to 46% relative to the visual-only condition, and Chain-of-Thought (CoT) reasoning is associated with further degradation rather than self-correction. Although limited to a single model and dataset, our findings suggest that prompting strategies alone may be insufficient for the safe clinical deployment of medical MLLMs.

2606.18661 2026-06-18 cs.CV cs.AI 新提交 专题 70

LandslideAgent with Multimodal LandslideBench: A Domain-Rule-Augmented Agent for Autonomous Landslide Identification and Analysis

LandslideAgent与多模态LandslideBench:一种面向自主滑坡识别与分析的领域规则增强型智能体

Chengfu Liu, Dongyang Hou, Junwu Xiang, Cheng Yang, Xuezhi Cui, Zeyuan Wang, Liangtian Liu, Zelang Miao

发表机构 * Central South University(中南大学)

专题命中 图文多模态 :多模态数据集包含图像、掩码和文本描述

AI总结 提出指令驱动智能体框架,包含多模态数据集LandslideBench、滑坡专用视觉语言模型LandslideVLM及领域规则增强智能体LandslideAgent,实现自主滑坡识别与分析。

详情
AI中文摘要

智能滑坡灾害解译对于防灾减灾至关重要,然而当前范式难以同时提取视觉特征和高层次地球科学语义,而通用视觉语言模型在复杂地质场景中存在感知局限和领域幻觉。为解决这些挑战,我们提出一个指令驱动的智能体框架,包含三个组成部分。首先,通过多VLM交叉验证和交互式标注构建LandslideBench,这是一个多模态细粒度数据集,包含七个子类型标签、高分辨率图像、像素级掩膜和高质量文本描述。然后,通过LoRA在LandslideBench上微调面向滑坡的VLM——LandslideVLM,以增强地质语义理解。最后,以LandslideVLM为认知核心的领域规则增强智能体LandslideAgent,采用双规则控制器,结合结构化报告元数据约束和交叉验证识别约束,来调控自动化工具调用。实验表明,LandslideBench为五种主流模型在细粒度分类和语义分割上提供了有效基线。LandslideVLM在滑坡判别、细粒度分类和语义描述质量上分别提升了10.96%、32.87%和15.91%。LandslideAgent进一步实现了自主多源空间数据推理,实现了滑坡识别与分析的全流程智能化。

英文摘要

Intelligent landslide hazard interpretation is critical for disaster prevention, yet current paradigms struggle to simultaneously extract visual features and high-level geoscientific semantics, while general-purpose vision-language models (VLMs) suffer from perceptual limitations and domain hallucinations in complex geological scenarios. To address these challenges, we propose an instruction-driven agentic framework comprising three components. First, LandslideBench, a multimodal fine-grained dataset with seven subtype labels, high-resolution imagery, pixel-level masks, and high-quality textual descriptions, is constructed via multi-VLM cross-validation and interactive annotation. Then, LandslideVLM, a landslide-oriented VLM, is fine-tuned via LoRA on LandslideBench to enhance geological semantic understanding. Finally, LandslideAgent, a domain rule-enhanced agent taking LandslideVLM as its cognitive backbone, employs a dual-rule controller incorporating structured report metadata constraints and cross-validation identification constraints to regulate automated tool invocation. Experiments demonstrate that LandslideBench provides effective baselines across five mainstream models on fine-grained classification and semantic segmentation. LandslideVLM achieves accuracy improvements of 10.96%, 32.87%, and 15.91% on landslide discrimination, fine-grained classification, and semantic description quality, respectively. LandslideAgent further enables autonomous multi-source spatial data inference, realizing full-process intelligence for landslide identification and analysis.

2606.18441 2026-06-18 cs.CV 新提交 专题 70

Reasoning as Intersection: Consensus-Frame Alignment for Visual Focus in Video-MLLMs

推理即交集:视频多模态大语言模型中视觉焦点的一致性帧对齐

Chengwen Liu, Zhe Huang, Jisheng Dang, Hong Peng, Qi Tian, Tat-Seng Chua

发表机构 * School of Information Science and Engineering, Lanzhou University(兰州大学信息科学与工程学院) Beijing University of Posts and Telecommunications(北京邮电大学) Cloud and AI BU, Huawei(华为云与AI业务部) School of Computing, National University of Singapore(新加坡国立大学计算机学院)

专题命中 图文多模态 :涉及视频多模态大语言模型推理优化

AI总结 提出无时间标注的过程级奖励框架CF-GRPO,通过视频内在线索构建一致性帧先验,并利用一致性帧奖励优化模型帧使用与先验的对齐,提升视频推理性能。

详情
AI中文摘要

强化学习提升了大型语言模型的推理能力,但将仅结果奖励应用于视频多模态大语言模型(Video-MLLMs)时,对哪些视觉证据应支持答案提供的指导有限。受多感官整合启发(其中一致的线索可以增强感知估计的显著性和可靠性),我们引入了一致性帧GRPO(CF-GRPO),一种无需时间标注的过程级奖励框架,用于证据感知的视频推理。CF-GRPO从内在视频线索中构建一致性帧先验,包括时间覆盖、场景转换线索和查询条件化的视觉相关性。然后,它从视觉和响应表示中计算模型侧的帧使用分数,并通过一致性帧奖励(CFR)优化它们的一致性。通过显著性感知的稀疏聚合和分布锐化,CFR提供了高对比度的奖励信号,无需人工时间标注。实验表明,VideoCFR在复杂视频推理基准上取得了有竞争力的性能,并在多个指标上优于代表性的Video-MLLM和RL基线,同时一致性先验提供了训练中强调的证据帧的可解释视图。实现代码见:https://this https URL。

英文摘要

Reinforcement learning has improved the reasoning ability of large language models, but applying outcome-only rewards to video multimodal large language models (Video-MLLMs) provides limited guidance on which visual evidence should support the answer. Inspired by multisensory integration, where consistent cues can enhance the salience and reliability of perceptual estimates, we introduce Consensus Frame GRPO (CF-GRPO), a temporal-annotation-free process-level reward framework for evidence-aware video reasoning. CF-GRPO constructs a consensus frame prior from intrinsic video cues, including temporal coverage, scene-transition cues, and query-conditioned visual relevance. It then computes a model-side frame-use score from visual and response representations and optimizes their agreement through the Consensus Frame Reward (CFR). With salience-aware sparse aggregation and distribution sharpening, CFR provides a high-contrast reward signal without requiring human temporal annotations. Experiments show that VideoCFR achieves competitive performance across complex video reasoning benchmarks and improves several metrics over representative Video-MLLM and RL baselines, while the consensus prior provides an interpretable view of the evidence frames emphasized during training. The implementation is available at https://github.com/1Pansy/VideoCFR.

4. 跨模态检索 2 篇

2606.19062 2026-06-18 cs.CV 新提交 专题 85

DREAM: Extending Vision-Language Models with Dual-Objective Encoding for Cross-Modal Retrieval

DREAM: 通过双目标编码扩展视觉-语言模型用于跨模态检索

Kaleem Ullah, Altaf Hussain, Muhammad Munsif, Sung Wook Baik

发表机构 * Sejong University(世宗大学) Korea Advanced Institute of Science and Technology(韩国科学技术院) Ulsan National Institute of Science and Technology(乌山国立科学研究院)

专题命中 跨模态检索 :跨模态检索,双目标编码。

AI总结 提出DREAM模型,通过双路径表示增强与对齐,结合层级视觉编码器和混合语言建模,在视频检索任务中实现新SOTA。

详情
AI中文摘要

在当今媒体驱动的世界中,视频内容在监控、教育和娱乐等领域的指数级增长使得通过自然语言查询检索语义相关视频变得日益关键。早期的视频检索系统依赖于手工特征或浅层跨模态映射,限制了其捕捉复杂语义和时间动态的能力。虽然大规模视觉-语言模型改进了跨模态对齐,但在建模细粒度时间依赖和微妙语言结构方面仍存在挑战。本文介绍DREAM:双路径表示增强与对齐模型,一种通过增强视觉和文本编码来解决这些局限性的新型多模态框架。DREAM采用混合语言建模策略,结合掩码和排列语言建模目标,以捕捉局部和全局语言语义。在视觉方面,我们设计了一个具有级联组注意力的层级视觉编码器,通过多阶段令牌交互和从粗到细的注意力细化来整合空间和时间信息。我们通过在广泛使用的MSRVTT、MSVD和LSMDC基准数据集上进行全面评估来验证DREAM,分别取得了49.4%、49.7%和27.3%的新SOTA R1分数。定性分析进一步展示了模型在帧间保持连贯注意力以及将复杂查询与动态视频内容对齐的能力。这些发现强调了层级注意力和双目标文本建模在实现鲁棒、上下文感知视频检索中的有效性,并为推进跨模态表示学习的未来研究铺平了道路。

英文摘要

In today's media-driven world, the exponential growth of video content across domains such as surveillance, education, and entertainment has made retrieving semantically relevant videos via natural language queries increasingly critical. Early video retrieval systems relied on handcrafted features or shallow cross-modal mappings, limiting their ability to capture complex semantics and temporal dynamics. While large-scale vision-language models have improved cross-modal alignment, challenges remain in modeling fine-grained temporal dependencies and nuanced linguistic structures. In this paper, we introduce DREAM: Dual-path Representation Enhancement and Alignment Model, a novel multimodal framework that addresses these limitations through enhanced visual and textual encoding. DREAM incorporates a hybrid language modeling strategy that combines masked and permuted language modeling objectives to capture both local and global linguistic semantics. On the visual side, we design a hierarchical vision encoder with cascaded group attention, which integrates spatial and temporal information through multi-stage token interaction and coarse-to-fine attention refinement. We validate DREAM through comprehensive evaluations on the widely-used MSRVTT, MSVD and LSMDC benchmark datasets, where it achieves new state-of-the-art R1 scores of 49.4%, 49.7% and 27.3%, respectively. Qualitative analyses further show the model's ability to maintain coherent attention across frames and align complex queries with dynamic video content. These findings underscore the effectiveness of hierarchical attention and dual-objective textual modeling in enabling robust, context-aware video retrieval, and pave the way for future research in advancing cross-modal representation learning.

2606.18885 2026-06-18 cs.CV cs.IR 新提交 专题 75

LARE: Low-Attention Region Encoding for Text-Image Retrieval

LARE: 低注意力区域编码用于文本-图像检索

Abdulmalik Alquwayfili, Faisal Almeshal, Jumanah Almajnouni, Leena Alotaibi, Faisal Alhajari, Mohammed Alkhrashi, Alreem Almuhrij, Abdullah Aldwyish, Raied Aljadaany, Huda Alamri, Muhammad Kamran J. Khan

发表机构 * Saudi Data and Artificial Intelligence Authority (SDAIA)(沙特数据与人工智能局)

专题命中 跨模态检索 :文本-图像跨模态检索

AI总结 提出LARE框架,通过并行编码低注意力区域和完整图像,解决拥挤场景下视觉编码器忽视关键细节的问题,在密集场景子集上提升检索性能。

Comments Accepted at the ICML 2026 Workshop on Efficient Multimodal Question Answering (EMM-QA). Code: https://github.com/AbdulmalikDS/LARE ; Dataset: https://huggingface.co/datasets/AbdulmalekDS/Dense-Set

详情
AI中文摘要

拥挤场景中的图像检索尤其具有挑战性,因为传统视觉编码器存在显著性偏差,倾向于关注主要对象而忽略低注意力区域,而这些区域通常对细粒度检索至关重要。我们提出了LARE(低注意力区域编码),一个显式建模这些被忽略区域的框架。LARE采用双编码策略,并行编码图像的低注意力区域和完整图像,从而产生更多样化和信息丰富的图像嵌入。为了评估拥挤场景下的图像检索性能,我们引入了Dense-Set,一个源自COCO和Flickr30K的具有挑战性的子集。在该子集中,图像被重新标注,以提供对低注意力或先前被忽略区域的更丰富描述。该数据集突显了现有检索模型的局限性,并能够在密集拥挤场景条件下进行更严格的评估。实验结果表明,所提出的框架通过在共享潜在空间中保留微妙的非主导视觉线索来提高检索性能。

英文摘要

Image retrieval in crowded scenes is particularly challenging due to the salience bias of conventional visual encoders, which tend to focus on dominant objects while neglecting low-attention regions that are often crucial for fine-grained retrieval. We propose LARE (Low-Attention Region Encoding), a framework that explicitly models these overlooked regions. LARE adopts a dual-encoding strategy that encodes low-attention regions of an image and the full image in parallel, leading to more diverse and informative image embeddings. To evaluate image retrieval performance in challenging crowded scenes, we introduce Dense-Set, a challenging subset derived from COCO and Flickr30K. In this subset, images are re-captioned to provide richer descriptions of low-attention or previously overlooked regions. This dataset highlights the limitations of existing retrieval models and enables a more rigorous evaluation under densely crowded scene conditions. Experimental results demonstrate that the proposed framework improves retrieval performance by preserving subtle, non-dominant visual cues within the shared latent space.

5. 其他多模态 1 篇

2606.19140 2026-06-18 cs.LG 新提交 专题 55

ChronoSurv: A Clinical Pathway-Guided Graph Framework for Multimodal Survival Analysis

ChronoSurv:一种临床路径引导的多模态生存分析图框架

Hugo Miccinilli, Theo Di Piazza

发表机构 * Université Paris-Saclay, CentraleSupélec, MICS, France(巴黎萨克雷大学,中央超算学院,MICS,法国) University of Lyon, INSA Lyon, CREATIS, France(里昂大学,里昂国家理工学院,CREATIS,法国)

专题命中 其他多模态 :处理多模态临床数据,但非大模型

AI总结 提出ChronoSurv,一种基于有向图的多模态生存分析框架,通过层次化拓扑和异质消息传递建模临床轨迹,在头颈癌数据集上取得最优判别性能与可靠校准。

Comments Accepted at MICCAI 2026. Submitted version due to embargo

详情
AI中文摘要

准确的生存预测对于头颈癌的个性化治疗计划至关重要,但由于多模态临床数据的异质性和高维性,这仍然具有挑战性。虽然深度生存模型在预测性能上优于经典统计方法,但现有方法通常依赖于静态融合策略或时间无关建模,限制了其捕捉结构化临床工作流程的能力。在这项工作中,我们提出了ChronoSurv,一种用于多模态生存分析的异质层次有向图框架。ChronoSurv使用与关键诊断步骤对齐的有向图,将患者护理表示为进展感知的临床轨迹。层次拓扑包含细粒度、粗粒度和全局表示,进一步支持对缺失模态的灵活适应,而异质消息传递则建模了跨模态和临床步骤的复杂非对称关系。在两个公共数据集上的实验结果表明,ChronoSurv在保持统计可靠校准的同时,实现了最先进的判别性能。全面的消融研究进一步证实了每个架构组件的贡献,突出了轨迹感知图建模在多模态生存预测中的潜力。

英文摘要

Accurate survival prediction is essential for personalized treatment planning in head and neck cancer, yet remains challenging due to the heterogeneous and high-dimensional nature of multimodal clinical data. While deep survival models have improved predictive performance over classical statistical approaches, existing methods typically rely on static fusion strategies or temporally agnostic modeling, limiting their ability to capture structured clinical workflows. In this work, we propose ChronoSurv, a heterogeneous hierarchical directed graph framework for multimodal survival analysis. ChronoSurv represents patient care as a progression-aware clinical trajectory using directed graphs aligned with key diagnostic steps. A hierarchical topology incorporates fine-grained, coarse, and global representations, further supporting flexible adaptation to missing modalities, while heterogeneous message passing models complex and asymmetric relationships across modalities and clinical steps. Experimental results on two public datasets demonstrate that ChronoSurv achieves state-of-the-art discriminative performance while maintaining statistically reliable calibration. Comprehensive ablation studies further confirm the contribution of each architectural component, highlighting the potential of trajectory-aware graph modeling for multimodal survival prediction.