arXivDaily arXiv每日学术速递 周一至周五更新

视觉与机器人

图像生成

图像生成、文生图、图像编辑、扩散模型和可控生成。

今日/当前日期收录 2 信号源:cs.CV, cs.GR, cs.MM
2606.19259 2026-06-18 cs.CV cs.AI 新提交 70%

A Multi-Domain Benchmark for Detecting AI-Generated Text-Rich Images from GPT-Image-2

一个用于检测 GPT-Image-2 生成的含丰富文本图像的多领域基准

Yijin Wang, Shuyi Wang, Wenhan Zhang, Yuqi Ouyang

发表机构 * College of Computer Science(计算机科学学院)

专题命中 其他图像生成 :检测GPT-Image-2生成的图像

AI总结 针对现有基准缺乏文本丰富图像检测的问题,构建了包含8602张图像、覆盖6个类别的多领域基准,评估5种检测器,发现性能高度依赖领域且易受JPEG压缩影响。

详情
AI中文摘要

含丰富文本的图像通常包含隐私敏感、交易或决策相关信息。随着最近多模态图像生成模型合成逼真文本内容和结构化视觉设计的能力越来越强,检测AI生成的含丰富文本图像已成为数字信任和内容真实性的重要挑战。然而,现有基准主要关注以物体为中心的图像,对文本语义和布局组织至关重要的场景覆盖有限。在本文中,我们引入了一个用于检测OpenAI的GPT Image 2生成的含丰富文本图像的多领域基准。该基准包含8602张图像,涵盖六个代表性类别:商业海报、信息图表、学术海报、收据、表格和UI截图。利用该基准,我们在零样本设置下评估了五种代表性AI生成图像检测器,并分析了它们的整体性能、类别性能和后处理鲁棒性。我们的结果表明,检测器性能高度依赖于领域:在某些类别上表现良好的方法往往在其他类别上失败,即使最强的传统检测器也对JPEG压缩表现出严重敏感性。我们进一步使用多模态视觉语言模型进行了探索性评估,揭示了其在结构化格式上的潜力和局限性。这些发现突显了针对现代AI生成图像需要文本和布局感知的检测方法。我们的数据集发布于XXX。

英文摘要

Text-rich images often contain privacy-sensitive, transactional, or decision-relevant information. As recent multimodal image generation models become increasingly capable of synthesizing realistic textual content and structured visual designs, detecting AI-generated text-rich images has become an important challenge for digital trust and content authenticity. Existing benchmarks, however, largely focus on object-centric images and provide limited coverage of scenarios where textual semantics and layout organization are central. In this paper, we introduce a multi-domain benchmark for detecting text-rich images generated by OpenAI's GPT Image 2. The benchmark contains 8,602 images across six representative categories: commercial posters, infographics, academic posters, receipts, tables, and UI screenshots. Using this benchmark, we evaluate five representative AI-generated image detectors in a zero-shot setting and analyze their overall, category-wise, and post-processing robustness. Our results show that detector performance is highly domain-dependent: methods that perform well in some categories often fail on others, and even the strongest conventional detector exhibits severe sensitivity to JPEG compression. We further conduct an exploratory evaluation with a multimodal vision-language model, revealing both its promise and its limitations on structured formats. These findings highlight the need for text- and layout-aware detection methods for modern AI-generated images. Our dataset is released at XXX.

2605.08189 2026-06-18 eess.AS 版本更新 55%

DiffVQE: Hybrid Diffusion Voice Quality Enhancement Under Acoustic Echo and Noise

DiffVQE:声学回声和噪声下的混合扩散语音质量增强

Haljan Lugo, Ernst Seidel, Pejman Mowlaee, Ziyue Zhao, Tim Fingscheidt

专题命中 其他图像生成 :提出扩散模型用于语音质量增强,非图像生成。

AI总结 提出首个基于扩散的声学回声控制模型DiffVQE,在回声和噪声控制性能、计算复杂度和模型大小上均优于判别式DeepVQE模型。

Comments 6 pages, 4 figures, accepted at Interspeech 2026

详情
AI中文摘要

声学回声和背景噪声对免提系统和免提电话中的语音增强提出了挑战。判别式训练的端到端方法为联合声学回声控制(AEC)和去噪提供了强大的解决方案。然而,随着生成方法的出现,基于扩散的方法在语音增强任务中表现出卓越的性能。在这项工作中,据我们所知,我们提供了第一个(仍然是非因果的)基于扩散的AEC模型(DiffVQE),该模型在拓扑结构、训练数据和训练框架方面是可复现的。到目前为止,在不使用扩散的情况下,微软的判别式DeepVQE模型已被证明优于ICASSP 2023 AEC挑战赛的任何参赛作品,取得了卓越的性能。使用来自Interspeech 2025 URGENT挑战赛的数据作为多样化、高质量的训练数据集,我们的DiffVQE在回声和噪声控制性能以及计算复杂度和模型大小方面均优于DeepVQE。

英文摘要

Acoustic echo and background noise pose challenges on speech enhancement in hands-free systems and speakerphones. Discriminatively trained end-to-end methods represent a powerful solution for joint acoustic echo control (AEC) and denoising. However, with the advent of generative methods, diffusion-based approaches have seen remarkable performance in speech enhancement tasks. In this work, to the best of our knowledge, we provide the first (still non-causal) diffusion-based AEC model (DiffVQE) that is reproducible in terms of topology, training data, and training framework. So far, without employing diffusion, Microsoft's discriminative DeepVQE model has been shown to excel any of the ICASSP 2023 AEC Challenge entries achieving remarkable performance. Using data from the Interspeech 2025 URGENT Challenge for a diverse, high-quality training dataset, our DiffVQE excels DeepVQE both in echo and noise control performance, as well as in computational complexity and model size.