arXivDaily arXiv每日学术速递 周一至周五更新

视觉与机器人

图像生成

图像生成、文生图、图像编辑、扩散模型和可控生成。

今日/当前日期收录 9 信号源:cs.CV, cs.GR, cs.MM

1. 图像修复 3 篇

2603.05010 2026-06-18 cs.CV 版本更新 90%

How far have we gone in Generative Image Restoration? A study on its capability, limitations and evaluation practices

生成式图像恢复进展:能力、局限性与评估实践研究

Xiang Yin, Jinfan Hu, Zhiyuan You, Kainan Yan, Yu Tang, Chao Dong, Jinjin Gu

发表机构 * Fudan University(复旦大学) Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(深圳先进技术研究院,中国科学院) University of the Chinese Academy of Sciences(中国科学院大学) Multimedia Laboratory, The Chinese University of Hong Kong(香港中文大学多媒体实验室) Shenzhen University of Advanced Technology(深圳先进技术大学)

专题命中 图像修复 :研究生成式图像恢复,包括扩散和GAN模型

AI总结 通过多维度评估管道系统比较扩散、GAN等生成式模型与PSNR导向模型,揭示从细节不足到细节质量与语义控制的范式转变,并训练了更符合人类感知的IQA模型。

Comments Accepted by CVPR 2026 Findings

详情
AI中文摘要

生成式图像恢复(GIR)在感知真实感方面取得了显著进展,但与先前方法相比,其实际能力究竟有多大提升?为回答这一问题,我们基于新的多维度评估管道开展大规模研究,该管道从细节、清晰度、语义正确性和整体质量四个维度评估模型。我们的分析涵盖多种架构,包括基于扩散的、基于GAN的、PSNR导向的以及通用生成模型,揭示了关键的性能差异。此外,我们的分析揭示了失败模式的演变,这标志着以感知为导向的低层视觉领域发生了范式转变。核心挑战正从先前的细节稀缺(欠生成)问题演变为细节质量和语义控制(防止过生成)的新前沿。我们还利用我们的基准训练了一个新的IQA模型,该模型更符合人类感知判断。最终,本工作对现代生成式图像恢复模型进行了系统研究,提供了关键见解,重新定义了对其真实状态的理解,并为未来发展指明了方向。

英文摘要

Generative Image Restoration (GIR) has achieved impressive perceptual realism, but how far have its practical capabilities truly advanced compared with previous methods? To answer this, we present a large-scale study grounded in a new multi-dimensional evaluation pipeline that assesses models on detail, sharpness, semantic correctness, and overall quality. Our analysis covers diverse architectures, including diffusion-based, GAN-based, PSNR-oriented, and general-purpose generation models, revealing critical performance disparities. Furthermore, our analysis uncovers a key evolution in failure modes that signifies a paradigm shift for the perception-oriented low-level vision field. The central challenge is evolving from the previous problem of detail scarcity (under-generation) to the new frontier of detail quality and semantic control (preventing over-generation). We also leverage our benchmark to train a new IQA model that better aligns with human perceptual judgments. Ultimately, this work provides a systematic study of modern generative image restoration models, offering crucial insights that redefine our understanding of their true state and chart a course for future development.

2602.00176 2026-06-18 cs.CV cs.AI 版本更新 70%

Posterior Continuation with Noise-Conditioned Frequency Exposure for Diffusion Inverse Problems

基于噪声条件频率暴露的扩散逆问题后验延续

Feng Tian, Yixuan Li, Weili Zeng, Weitian Zhang, Yichao Yan, Xiaokang Yang

发表机构 * Shanghai Jiao Tong University(上海交通大学)

专题命中 图像修复 :提出后验延续框架用于扩散逆问题,包括图像修复。

AI总结 提出后验延续框架,根据扩散噪声水平逐步暴露测量频率,结合稳定采样器实现超分辨率、修复和去模糊的先进性能。

详情
AI中文摘要

扩散后验采样通过将预训练的扩散先验与测量一致性指导相结合来解决逆问题。然而,在高噪声水平下,全频带指导可能不可靠,因为干净估计包含分数诱导误差,且高频测量方向弱可识别。我们认为后验指导应根据瞬时扩散噪声水平暴露测量频率。基于这一原则,我们提出一个后验延续框架,构建一系列中间后验,其似然强调当前可靠频带并逐渐恢复全频带一致性。我们通过一个稳定采样器实例化该框架,该采样器结合了扩散预测器、频率受限似然细化以及Haar域承诺规则,该规则提交可靠粗校正同时推迟弱可识别细节。在超分辨率、修复和去模糊任务中,我们的方法实现了具有竞争力乃至最先进的恢复性能,包括在FFHQ和ImageNet评估中,运动去模糊相比强基线PSNR提升高达5 dB。

英文摘要

Diffusion posterior sampling solves inverse problems by combining a pretrained diffusion prior with measurement-consistency guidance. However, full-band guidance can be unreliable at high noise levels, where clean estimates contain score-induced errors and high-frequency measurement directions are weakly identifiable. We argue that posterior guidance should expose measurement frequencies according to the instantaneous diffusion noise level. Based on this principle, we propose a posterior continuation framework that constructs a family of intermediate posteriors whose likelihood emphasizes currently reliable frequency bands and gradually returns to full-band consistency. We instantiate this framework with a stabilized sampler that combines a diffusion predictor, frequency-limited likelihood refinement, and a Haar-domain commitment rule that commits reliable coarse corrections while deferring weakly identifiable details. Across super-resolution, inpainting, and deblurring, our method achieves competitive-to-state-of-the-art restoration performance, including up to 5 dB PSNR improvement on motion deblurring over strong baselines in evaluations on FFHQ and ImageNet.

2204.14224 2026-06-18 cs.CV cs.LG eess.IV 版本更新 65%

Investigation of Neural Network Methods for Reconstruction and Classification of Texture Images Under Conditions of Incomplete Information

不完全信息条件下纹理图像重建与分类的神经网络方法研究

Galymzhan Abdimanap, Kairat Bostanbekov, Abdelrahman Abdallah, Anel Alimova, Darkhan Kurmangaliyev, Daniyar Nurseitov, Tatyana Dedova, Larissa Balakay, Serik Nurakynov

发表机构 * Satbayev University(萨特巴耶夫大学) Institute of Ionosphere LLP(电离层研究所) Information Technology Department(信息技术部门) Assiut University(阿西乌特大学)

专题命中 图像修复 :使用GAN进行图像修复,重建缺失细节。

AI总结 提出结合目标检测、GAN(CRA)修复和Transformer/CNN分类的端到端框架,发现重建质量高(PSNR 28.7dB)但分类准确率仅53%,通过置信度混合集成将MCA从48%提升至58%,揭示生成模型产生语义模糊特征的问题。

Comments IEEE ACCESS

详情
AI中文摘要

异质自然纹理的自动化分析常因物理损伤和数据丢失而受阻,这对计算机视觉构成了重大挑战。虽然深度学习在受控环境中已显示出成功,但其在信息不完全条件下对复杂地质材料的应用仍未被充分探索。本研究提出了一个用于高分辨率岩心样本图像修复和分类的集成框架。我们设计了一个端到端流水线,利用目标检测进行样本分割,随后使用具有上下文残差聚合(CRA)的生成对抗网络(GAN)进行图像修复,以重建缺失的高频细节。接着,我们在重建数据上评估了现代基于Transformer(Swin、ViT)和CNN架构的性能。实验揭示了重建质量与下游效用之间的关键分歧:尽管结构保真度高(PSNR 28.7 dB,FID 74.01),分类准确率却停滞在53%。为了改善少数类检测,我们提出了一种基于置信度的混合集成方法,将MCA从48%提升至58%。这些结果凸显了当前最先进生成模型的局限性,它们可能产生视觉上合理但语义模糊的特征(“幻觉”),从而混淆分类器。本工作深入探讨了图像重建质量与分类性能之间的依赖关系,为无损检测和材料科学领域的未来研究提供了可复现的基线。鉴于井间准确率仍处于49-53%范围,我们将所得到的系统定位为岩相解释的决策支持和筛选工具,而非完全自主的分类器。代码可在以下网址获取:https://github.com/your-repo(注:原文URL未提供,此处为示例)

英文摘要

The automated analysis of heterogeneous natural textures is frequently hindered by physical damage and data loss, presenting a significant challenge to computer vision. While deep learning has shown success in controlled environments, its application to complex geological materials under conditions of incomplete information remains underexplored. This study presents an integrated framework for the inpainting and classification of high-resolution core sample images. We propose an end-to-end pipeline that utilizes object detection for sample segmentation, followed by image inpainting using Generative Adversarial Networks (GANs) with Contextual Residual Aggregation (CRA) to reconstruct missing high-frequency details. Subsequently, we evaluate the performance of modern Transformer-based (Swin, ViT) and CNN architectures on the reconstructed data. Our experiments revealed a critical divergence between reconstruction quality and downstream utility: despite high structural fidelity (PSNR 28.7~dB, FID 74.01), classification accuracy plateaued at 53\%. To improve minority-class detection, we propose a confidence-based hybrid ensemble that raises MCA from 48\% to 58\%. These results highlight the limitations of current state-of-the-art generative models, which may produce visually plausible but semantically ambiguous features ("hallucinations") that confound classifiers. This work provides insights into the dependencies between image reconstruction quality and classification performance, offering a reproducible baseline for future research in non-destructive testing and material science. Given that cross-well accuracy remains in the 49--53\% range, we position the resulting system as a decision-support and screening tool for lithofacies interpretation rather than as a fully autonomous classifier. The code is available at https://github.com/GalymzhanAbdimanap/Lithology_recognition

2. 文生图 2 篇

2508.03483 2026-06-18 cs.CV cs.AI 版本更新 90%

When Cars Have Stereotypes: Auditing Demographic Bias in Objects from Text-to-Image Models

当汽车有刻板印象:审计文本到图像模型中对象的群体偏见

Dasol Choi, Jihwan Lee, Minjae Lee, Minsuk Kahng

发表机构 * AIM Intelligence(AIM智能研究院) Yonsei University(延世大学)

专题命中 文生图 :审计文本到图像模型中的群体偏见,涉及图像生成。

AI总结 提出SODA框架,通过三个指标系统测量文本到图像模型在生成对象中的群体偏见,发现中性提示隐含偏向中年和白人,且人口统计线索导致高度偏斜的刻板输出。

详情
AI中文摘要

虽然先前关于文本到图像生成的研究主要集中在人类描绘中的偏见,但生成对象中的群体偏见仍然相对未被充分探索。我们引入了SODA(刻板对象诊断审计),这是一个新颖的框架,通过自动属性发现和三个标准化指标系统地测量这些偏见:基础与群体差异(BDS)、跨群体差异(CDS)和视觉属性集中度(VAC)。将SODA应用于五个最先进模型和八个对象类别(例如汽车)的8000张图像,我们发现“中性”提示产生的输出在视觉上最接近中年和白人,表明这些群体在模型默认设置中被隐含地过度代表。此外,人口统计线索触发了高度偏斜的刻板输出:26.6%的对象-模型-群体组合产生的结果中,所有20张生成图像共享完全相同的属性值(例如,为女性生成玫瑰金笔记本电脑)。最后,提示级别的去偏减少了群体间差异,但矛盾地压缩了群体内多样性,用一种刻板印象取代了另一种。SODA提供了一个实用的流程,使这些隐含关联变得可测量,作为迈向更负责任的人工智能发展的一步。

英文摘要

While prior research on text-to-image generation has predominantly focused on biases in human depictions, demographic bias in generated objects remains relatively underexplored. We introduce SODA (Stereotyped Object Diagnostic Audit), a novel framework for systematically measuring these biases through automated attribute discovery and three standardized metrics: Base vs. Demographic Divergence (BDS), Cross-Demographic Disparity (CDS), and Visual Attribute Concentration (VAC). Applying SODA to 8,000 images across five state-of-the-art models and eight object categories (e.g., cars), we find that "neutral" prompts produce outputs most visually similar to middle-aged and White people, suggesting these groups are implicitly over-represented in model defaults. Furthermore, demographic cues trigger highly skewed stereotypical outputs: 26.6% of object-model-demographic combinations produce results where all 20 generated images share the exact same attribute value (e.g., rose gold laptops for women). Finally, prompt-level debiasing reduces inter-group disparity but paradoxically collapses within-group diversity, replacing one stereotype with another. SODA offers a practical pipeline for making these implicit associations measurable, serving as a step toward more responsible AI development.

2605.14877 2026-06-18 cs.CV 版本更新 85%

HeatKV: Head-tuned KV-cache Compression for Visual Autoregressive Modeling

HeatKV:针对视觉自回归建模的头部调制KV缓存压缩

Jonathan Cederlund, Axel Berg, William Isaksson, Durmus Alp Emre Acar, Chuteng Zhou, Pontus Giselsson

发表机构 * Dept. of Automatic Control, Lund University(自动控制系,吕勒欧大学) Arm(Arm公司)

专题命中 文生图 :提出HeatKV压缩方法用于视觉自回归图像生成。

AI总结 本文提出HeatKV方法,通过根据每个头部对先前生成尺度的注意力进行调整,实现更高效的KV缓存压缩,提升内存利用率并保持图像生成质量。

Comments 18 pages total including appendix; 6 main-paper figures, 2 appendix figures; 4 tables

详情
AI中文摘要

视觉自回归(VAR)模型最近在保持低延迟的同时展示了出色的图像生成质量。然而,它们受到严重的KV缓存内存限制,通常需要每个生成图像数吉字节的内存。我们引入了HeatKV,一种新的压缩方法,该方法根据每个头部对先前生成尺度的注意力来调整缓存分配。使用一个小的离线校准集,注意力头部根据其在先前尺度上的注意力分数进行排序。基于此排序,我们构建了一个针对给定内存预算定制的静态剪枝计划。应用于Infinity-2B模型时,HeatKV在KV缓存内存分配的压缩比上比现有方法高2倍,同时保持相似或更好的图像保真度、提示对齐度和人类感知分数。我们的方法在VAR模型的KV缓存压缩中达到了新的最先进的水平,展示了细粒度、特定头部的缓存分配的有效性。

英文摘要

Visual Autoregressive (VAR) models have recently demonstrated impressive image generation quality while maintaining low latency. However, they suffer from severe KV-cache memory constraints, often requiring gigabytes of memory per generated image. We introduce HeatKV, a novel compression method that adapts cache allocation in each head based on its attention to previously generated scales. Using a small offline calibration set, the attention heads are ranked according to their attention scores over prior scales. Based on this ranking, we construct a static pruning schedule tailored to a given memory budget. Applied to the Infinity-2B model, HeatKV achieves $2 \times$ higher compression ratio in memory allocation for KV cache compared to existing methods, while maintaining similar or better image fidelity, prompt alignment and human perception score. Our method achieves a new state-of-the-art (SOTA) for VAR model KV-cache compression, showcasing the effectiveness of fine-grained, head-specific cache allocation. Code and calibration script available at https://github.com/arm-research/heatkv.

3. 扩散模型 1 篇

2606.05883 2026-06-18 cs.CV 版本更新 85%

Geometry-Aware Dataset Condensation for Diffusion Model Training

面向扩散模型训练的几何感知数据集压缩

Xiao Cui, Yulei Qin, Mo Zhu, Wengang Zhou, Hongsheng Li, Houqiang Li

发表机构 * GitHub

专题命中 扩散模型 :面向扩散模型训练的几何感知数据集压缩

AI总结 针对扩散模型训练,提出基于几何感知分布对齐的真实子集选择方法,利用单侧部分最优传输保持几何结构,并辅以轻量级特征统计与语义一致性正则化,通过两阶段离散优化实现高效压缩。

Comments ICML 2026

详情
AI中文摘要

数据集压缩旨在通过合成或选择从真实数据中构建紧凑数据集。然而,现有方法不适用于扩散模型训练:合成数据生成通常产生不适合真实建模的低保真样本,而真实子集选择通常无法保留扩散似然目标所需的分布几何结构。为解决此问题,我们提出将真实子集选择重新表述为几何感知分布对齐问题。通过引入单侧部分最优传输,我们的方法选择性地将紧凑子集与完整数据分布对齐,同时允许低密度区域中的未匹配质量,确保保留扩散模型训练所需的有效几何结构。为进一步保证分布保真度,我们用轻量级特征统计和语义一致性正则化补充几何对齐。提出了一种高效的两阶段离散优化策略来实现该对齐目标。在扩散变体、子集大小、图像分辨率和训练轮次上的大量实验表明,我们的方法在扩散模型训练中实现了优越的保真度和分布覆盖。代码可在 https://github.com/2018cx/GADC 获取。

英文摘要

Dataset condensation aims to construct compact datasets from real data via synthesis or selection. However, existing approaches are ill-suited for diffusion model training: synthetic data generation often yields low-fidelity samples unsuitable for authentic modeling, while real subset selection typically fails to preserve the distributional geometry required by diffusion likelihood objectives. To address this, we propose to reformulate real subset selection as a geometry-aware distribution alignment problem. By incorporating one-sided partial optimal transport, our method selectively aligns a compact subset with the full data distribution while allowing unmatched mass in low-density regions, ensuring the preserved geometric structure necessary for effective diffusion model training. To further ensure distributional fidelity, we complement geometric alignment with lightweight feature-statistics and semantic consistency regularization. An efficient two-stage discrete optimization strategy is proposed to achieve this alignment objective. Extensive experiments across diffusion variants, subset sizes, image resolutions, and training rounds show that our method achieves superior fidelity and distributional coverage in diffusion model training. Codes are available at https://github.com/2018cx/GADC.

4. 图像编辑 2 篇

2605.21431 2026-06-18 cs.CV 版本更新 85%

iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance

iTryOn: 通过空间-语义引导掌握交互式视频虚拟试穿

Jun Zheng, Zhengze Xu, Mengting Chen, Jing Wang, Jinsong Lan, Xiaoyong Zhu, Kaifu Zhang, Bo Zheng, Xiaodan Liang

发表机构 * Shenzhen Campus of Sun Yat-sen University Taobao \& Tmall Group of Alibaba

专题命中 图像编辑 :交互式视频虚拟试穿,属于图像生成与编辑。

AI总结 本文提出iTryOn框架,通过空间-语义引导解决交互式视频虚拟试穿中的语义模糊和复杂服装变形问题,实现了更动态可控的虚拟试穿体验。

Comments Project Page: https://zhengjun-ai.github.io/itryon-page. Accepted by ICML 2026

详情
AI中文摘要

视频虚拟试穿(VVT)旨在无缝替换视频中人物身上的衣物。尽管现有方法在保持时间一致性方面取得了显著进展,但它们主要局限于非交互场景,其中模型仅展示衣物。这种限制忽略了现实世界服装展示中的关键方面:主动的人-衣物互动。为弥合这一差距,我们引入并正式化了一个新的挑战性任务:交互式视频虚拟试穿(Interactive VVT),其中视频中的主体主动与衣物互动。该任务引入了超出简单纹理保留的独特挑战,包括:(1)从标准姿态信息中解决交互的语义模糊性,以及(2)从视频中学习复杂的衣物变形,其中交互时刻稀少且短暂。为了解决这些挑战,我们提出了iTryOn,一种基于大规模视频扩散Transformer的新型框架。iTryOn首创多级交互注入机制,以引导复杂动态的生成。在空间层面,我们引入了服装无关的3D手先验,以提供精细的指导,精确的手-服装接触,有效解决空间模糊性。在语义层面,iTryOn利用全局描述词提供整体上下文,并利用时间戳动作描述词提供局部交互,通过我们新颖的Action-aware Rotational Position Embedding(A-RoPE)进行同步。广泛的实验表明,iTryOn不仅在传统VVT基准上实现了最先进的性能,还在新的交互设置中建立了显著的领先优势,标志着更动态和可控的虚拟试穿体验的重要一步。

英文摘要

Video Virtual Try-On (VVT) aims to seamlessly replace a garment on a person in a video with a new one. While existing methods have made significant strides in maintaining temporal consistency, they are predominantly confined to non-interactive scenarios where models merely showcase garments. This limitation overlooks a crucial aspect of real-world apparel presentation: active human-garment interaction. To bridge this gap, we introduce and formalize a new challenging task: Interactive Video Virtual Try-On (Interactive VVT), where subjects in the video actively engage with their clothing. This task introduces unique challenges beyond simple texture preservation, including: (1) resolving the semantic ambiguity of interactions from standard pose information, and (2) learning complex garment deformations from video where interactive moments are sparse and brief. To address these challenges, we propose iTryOn, a novel framework built upon a large-scale video diffusion Transformer. iTryOn pioneers a multi-level interaction injection mechanism to guide the generation of complex dynamics. At the spatial level, we introduce a garment-agnostic 3D hand prior to provide fine-grained guidance for precise hand-garment contact, effectively resolving spatial ambiguity. At the semantic level, iTryOn leverages global captions for overall context and time-stamped action captions for localized interactions, synchronized via our novel Action-aware Rotational Position Embedding (A-RoPE). Extensive experiments demonstrate that iTryOn not only achieves state-of-the-art performance on traditional VVT benchmarks but also establishes a commanding lead in the new interactive setting, marking a significant step towards more dynamic and controllable virtual try-on experiences.

2604.03156 2026-06-18 cs.CV 版本更新 85%

CAMEO: A Conditional and Quality-Aware Multi-Agent Image Editing Orchestrator

CAMEO: 一种条件感知与质量驱动的多智能体图像编辑编排器

Yuhan Pu, Hao Zheng, Ziqian Mo, Zirui Pang, Hill Zhang, Tianyi Fan, Shuhong Wu, Jiaheng Wei

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Harbin Institute of Technology(哈尔滨工业大学) Shenzhen University(深圳大学) Claremont McKenna College(克莱蒙特麦肯纳学院) Research Institute of Petroleum Exploration and Development, CNPC(石油勘探开发研究院,中石油)

专题命中 图像编辑 :多智能体框架进行条件图像编辑,含质量评估

AI总结 提出CAMEO多智能体框架,将条件图像编辑重构为质量感知的反馈驱动过程,通过分解编辑阶段、嵌入评估循环,在异常插入和人体姿态切换任务中平均胜率提升20%。

详情
AI中文摘要

条件图像编辑旨在根据文本提示和可选的参考指导修改源图像。这种编辑在需要严格结构控制的场景中至关重要(例如,驾驶场景中的异常插入和复杂人体姿态变换)。尽管近期大规模编辑模型(如Seedream、Nano Banana等)取得了进展,但大多数方法依赖单步生成。这种范式通常缺乏显式质量控制,可能引入与原始图像的过度偏差,并经常产生结构伪影或环境不一致的修改,通常需要手动调整提示才能获得可接受的结果。我们提出\textbf{CAMEO},一个结构化的多智能体框架,将条件编辑重构为质量感知、反馈驱动的过程,而非一次性生成任务。CAMEO将编辑分解为协调的阶段:规划、结构化提示、假设生成和自适应参考定位,仅在任务复杂度需要时才调用外部指导。为克服现有方法缺乏内在质量控制的不足,评估直接嵌入编辑循环中。通过结构化反馈迭代优化中间结果,形成闭环过程,逐步纠正结构和上下文不一致性。我们在异常插入和人体姿态切换任务上评估CAMEO。在多个强编辑骨干网络和独立评估模型上,CAMEO相比多个最先进模型平均胜率提升20%,展示了在条件图像编辑中更强的鲁棒性、可控性和结构可靠性。

英文摘要

Conditional image editing aims to modify a source image according to textual prompts and optional reference guidance. Such editing is crucial in scenarios requiring strict structural control (i.e., anomaly insertion in driving scenes and complex human pose transformation). Despite recent advances in large-scale editing models (i.e., Seedream, Nano Banana, etc), most approaches rely on single-step generation. This paradigm often lacks explicit quality control, may introduce excessive deviation from the original image, and frequently produces structural artifacts or environment-inconsistent modifications, typically requiring manual prompt tuning to achieve acceptable results. We propose \textbf{CAMEO}, a structured multi-agent framework that reformulates conditional editing as a quality-aware, feedback-driven process rather than a one-shot generation task. CAMEO decomposes editing into coordinated stages of planning, structured prompting, hypothesis generation, and adaptive reference grounding, where external guidance is invoked only when task complexity requires it. To overcome the lack of intrinsic quality control in existing methods, evaluation is embedded directly within the editing loop. Intermediate results are iteratively refined through structured feedback, forming a closed-loop process that progressively corrects structural and contextual inconsistencies. We evaluate CAMEO on anomaly insertion and human pose switching tasks. Across multiple strong editing backbones and independent evaluation models, CAMEO consistently achieves 20\% more win rate on average compared to multiple state-of-the-art models, demonstrating improved robustness, controllability, and structural reliability in conditional image editing.

5. 其他图像生成 1 篇

2605.08189 2026-06-18 eess.AS 版本更新 55%

DiffVQE: Hybrid Diffusion Voice Quality Enhancement Under Acoustic Echo and Noise

DiffVQE:声学回声和噪声下的混合扩散语音质量增强

Haljan Lugo, Ernst Seidel, Pejman Mowlaee, Ziyue Zhao, Tim Fingscheidt

专题命中 其他图像生成 :提出扩散模型用于语音质量增强,非图像生成。

AI总结 提出首个基于扩散的声学回声控制模型DiffVQE,在回声和噪声控制性能、计算复杂度和模型大小上均优于判别式DeepVQE模型。

Comments 6 pages, 4 figures, accepted at Interspeech 2026

详情
AI中文摘要

声学回声和背景噪声对免提系统和免提电话中的语音增强提出了挑战。判别式训练的端到端方法为联合声学回声控制(AEC)和去噪提供了强大的解决方案。然而,随着生成方法的出现,基于扩散的方法在语音增强任务中表现出卓越的性能。在这项工作中,据我们所知,我们提供了第一个(仍然是非因果的)基于扩散的AEC模型(DiffVQE),该模型在拓扑结构、训练数据和训练框架方面是可复现的。到目前为止,在不使用扩散的情况下,微软的判别式DeepVQE模型已被证明优于ICASSP 2023 AEC挑战赛的任何参赛作品,取得了卓越的性能。使用来自Interspeech 2025 URGENT挑战赛的数据作为多样化、高质量的训练数据集,我们的DiffVQE在回声和噪声控制性能以及计算复杂度和模型大小方面均优于DeepVQE。

英文摘要

Acoustic echo and background noise pose challenges on speech enhancement in hands-free systems and speakerphones. Discriminatively trained end-to-end methods represent a powerful solution for joint acoustic echo control (AEC) and denoising. However, with the advent of generative methods, diffusion-based approaches have seen remarkable performance in speech enhancement tasks. In this work, to the best of our knowledge, we provide the first (still non-causal) diffusion-based AEC model (DiffVQE) that is reproducible in terms of topology, training data, and training framework. So far, without employing diffusion, Microsoft's discriminative DeepVQE model has been shown to excel any of the ICASSP 2023 AEC Challenge entries achieving remarkable performance. Using data from the Interspeech 2025 URGENT Challenge for a diverse, high-quality training dataset, our DiffVQE excels DeepVQE both in echo and noise control performance, as well as in computational complexity and model size.