arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

多模态大模型

跨文本、图像、视频、音频等模态的大模型与学习方法。

今日/当前日期收录 33 信号源:cs.CV, cs.CL, cs.AI, cs.MM, eess.AS

1. 音视频多模态 1 篇

2603.09234 2026-06-18 eess.AS 版本更新 专题 70

StuPASE: Towards Low-Hallucination Studio-Quality Generative Speech Enhancement

StuPASE:迈向低幻觉、工作室质量的生成式语音增强

Xiaobin Rong, Jun Gao, Zheng Wang, Mansur Yesilbursa, Kamil Wojcicki, Jing Lu

专题命中 音视频多模态 :生成式语音增强,属于音频处理

AI总结 提出StuPASE,基于PASE框架,通过使用干目标微调和流匹配模块替代GAN,在保持低幻觉的同时实现工作室级语音质量,优于现有方法。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

在生成式语音增强中,实现无幻觉的高感知质量仍然是一个挑战。一种代表性方法PASE对幻觉具有鲁棒性,但在不利条件下感知质量有限。我们提出StuPASE,基于PASE构建,在保持其低幻觉特性的同时实现工作室级质量。首先,我们表明使用干目标而非包含模拟早期反射的目标对PASE进行微调,显著改善了去混响。其次,为解决强加性噪声下的性能限制,我们将PASE中基于GAN的生成模块替换为流匹配模块,即使在极具挑战性的条件下也能实现工作室级生成。实验表明,StuPASE始终能生成感知高质量语音,同时保持低幻觉,优于最先进的语音增强方法。音频演示见:此 https URL。

英文摘要

Achieving high perceptual quality without hallucination remains a challenge in generative speech enhancement (SE). A representative approach, PASE, is robust to hallucination but has limited perceptual quality under adverse conditions. We propose StuPASE, built upon PASE to achieve studio-level quality while retaining its low-hallucination property. First, we show that finetuning PASE with dry targets rather than targets containing simulated early reflections substantially improves dereverberation. Second, to address performance limitations under strong additive noise, we replace the GAN-based generative module in PASE with a flow-matching module, enabling studio-quality generation even under highly challenging conditions. Experiments demonstrate that StuPASE consistently produces perceptually high-quality speech while maintaining low hallucination, outperforming state-of-the-art SE methods. Audio demos are available at: https://xiaobin-rong.github.io/stupase_demo/.

2. 图文多模态 1 篇

2601.14968 2026-06-18 cs.LG cs.AI 版本更新 专题 70

InstructTime++: Time Series Classification with Multimodal Language Modeling via Implicit Feature Enhancement

InstructTime++: 通过隐式特征增强的多模态语言建模进行时间序列分类

Mingyue Cheng, Xiaoyu Tao, Huajian Zhang, Qi Liu, Zhiding Liu, Yucong Luo, Yiheng Chen, Enhong Chen

发表机构 * State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China(中国科学技术大学认知智能国家重点实验室)

专题命中 图文多模态 :融合数值序列、文本特征和指令的多模态输入

AI总结 提出将时间序列分类转化为多模态生成任务,通过离散化模块和对齐投影层弥合模态差距,并利用隐式特征建模提升语言模型性能。

详情
AI中文摘要

大多数现有的时间序列分类方法采用判别范式,将输入序列直接映射到独热编码的类别标签。虽然有效,但这种范式难以融入上下文特征,也无法捕捉类别间的语义关系。为了解决这些局限性,我们提出了InstructTime,一种将时间序列分类重新定义为多模态生成任务的新框架。具体来说,连续的数值序列、上下文文本特征和任务指令被视为多模态输入,而类别标签则通过调优的语言模型作为文本输出生成。为了弥合模态差距,InstructTime引入了一个时间序列离散化模块,将连续序列转换为离散的时间标记,同时结合对齐投影层和生成式自监督预训练策略,以增强跨模态表示对齐。在此框架基础上,我们进一步提出了InstructTime++,通过引入隐式特征建模来扩展InstructTime,以补偿语言模型有限的归纳偏差。InstructTime++利用专门的工具包从原始时间序列和上下文输入中挖掘信息丰富的隐式模式,包括统计特征提取和基于视觉-语言模型的图像描述,并将其转化为文本描述以实现无缝集成。在多个基准数据集上的大量实验证明了InstructTime++的优越性能。

英文摘要

Most existing time series classification methods adopt a discriminative paradigm that maps input sequences directly to one-hot encoded class labels. While effective, this paradigm struggles to incorporate contextual features and fails to capture semantic relationships among classes. To address these limitations, we propose InstructTime, a novel framework that reformulates time series classification as a multimodal generative task. Specifically, continuous numerical sequences, contextual textual features, and task instructions are treated as multimodal inputs, while class labels are generated as textual outputs by tuned language models. To bridge the modality gap, InstructTime introduces a time series discretization module that converts continuous sequences into discrete temporal tokens, together with an alignment projection layer and a generative self-supervised pre-training strategy to enhance cross-modal representation alignment. Building upon this framework, we further propose InstructTime++, which extends InstructTime by incorporating implicit feature modeling to compensate for the limited inductive bias of language models. InstructTime++ leverages specialized toolkits to mine informative implicit patterns from raw time series and contextual inputs, including statistical feature extraction and vision-language-based image captioning, and translates them into textual descriptions for seamless integration. Extensive experiments on multiple benchmark datasets demonstrate the superior performance of InstructTime++.

3. 其他多模态 1 篇

2606.19140 2026-06-18 cs.LG 新提交 专题 55

ChronoSurv: A Clinical Pathway-Guided Graph Framework for Multimodal Survival Analysis

ChronoSurv:一种临床路径引导的多模态生存分析图框架

Hugo Miccinilli, Theo Di Piazza

发表机构 * Université Paris-Saclay, CentraleSupélec, MICS, France(巴黎萨克雷大学,中央超算学院,MICS,法国) University of Lyon, INSA Lyon, CREATIS, France(里昂大学,里昂国家理工学院,CREATIS,法国)

专题命中 其他多模态 :处理多模态临床数据,但非大模型

AI总结 提出ChronoSurv,一种基于有向图的多模态生存分析框架,通过层次化拓扑和异质消息传递建模临床轨迹,在头颈癌数据集上取得最优判别性能与可靠校准。

Comments Accepted at MICCAI 2026. Submitted version due to embargo

详情
AI中文摘要

准确的生存预测对于头颈癌的个性化治疗计划至关重要,但由于多模态临床数据的异质性和高维性,这仍然具有挑战性。虽然深度生存模型在预测性能上优于经典统计方法,但现有方法通常依赖于静态融合策略或时间无关建模,限制了其捕捉结构化临床工作流程的能力。在这项工作中,我们提出了ChronoSurv,一种用于多模态生存分析的异质层次有向图框架。ChronoSurv使用与关键诊断步骤对齐的有向图,将患者护理表示为进展感知的临床轨迹。层次拓扑包含细粒度、粗粒度和全局表示,进一步支持对缺失模态的灵活适应,而异质消息传递则建模了跨模态和临床步骤的复杂非对称关系。在两个公共数据集上的实验结果表明,ChronoSurv在保持统计可靠校准的同时,实现了最先进的判别性能。全面的消融研究进一步证实了每个架构组件的贡献,突出了轨迹感知图建模在多模态生存预测中的潜力。

英文摘要

Accurate survival prediction is essential for personalized treatment planning in head and neck cancer, yet remains challenging due to the heterogeneous and high-dimensional nature of multimodal clinical data. While deep survival models have improved predictive performance over classical statistical approaches, existing methods typically rely on static fusion strategies or temporally agnostic modeling, limiting their ability to capture structured clinical workflows. In this work, we propose ChronoSurv, a heterogeneous hierarchical directed graph framework for multimodal survival analysis. ChronoSurv represents patient care as a progression-aware clinical trajectory using directed graphs aligned with key diagnostic steps. A hierarchical topology incorporates fine-grained, coarse, and global representations, further supporting flexible adaptation to missing modalities, while heterogeneous message passing models complex and asymmetric relationships across modalities and clinical steps. Experimental results on two public datasets demonstrate that ChronoSurv achieves state-of-the-art discriminative performance while maintaining statistically reliable calibration. Comprehensive ablation studies further confirm the contribution of each architectural component, highlighting the potential of trajectory-aware graph modeling for multimodal survival prediction.