arXivDaily arXiv每日学术速递 周一至周五更新

视觉与机器人

图像生成

图像生成、文生图、图像编辑、扩散模型和可控生成。

今日/当前日期收录 6 信号源:cs.CV, cs.GR, cs.MM
2606.20536 2026-06-19 cs.CV 新提交 75%

The FID Lottery: Quantifying Hidden Randomness in Generative-Model Evaluation

FID 彩票:量化生成模型评估中的隐藏随机性

Nicolas Dufour, Alexei A. Efros, Patrick Pérez

发表机构 * Kyutai UC Berkeley(加州大学伯克利分校)

专题命中 其他图像生成 :研究FID评估中的随机性,影响生成模型评测

AI总结 研究FID作为随机变量在训练和生成种子上的方差,发现重训练比重采样导致更大FID波动,提出新评估协议:使用每类最优引导、报告多个训练种子的误差条。

Comments Website: https://kyutai.org/fid-lottery

详情
AI中文摘要

Frechet Inception Distance (FID) 是图像生成的事实标准仲裁者,但大多数论文仅报告来自单个训练模型使用单个采样种子的单一数值。如果我们重新训练模型,或仅重新从中采样,该数字的可重复性如何?在本文中,我们将 FID 视为训练和生成种子二维面板上的随机变量,并直接在数百个在类别条件 ImageNet 256x256 上训练的 SiT 网络上测量其方差。我们报告了令人惊讶的发现:(a) 使用相同配方但不同种子重新训练模型,在 Inception 特征空间中移动的 FID 比从固定网络重新绘制样本大 3.2 倍。(b) 这一差距由三个因素驱动:随机初始化、数据排序和流匹配损失的每步高斯噪声。(c) 增加计算量或模型大小几乎不会缩小分布范围,将 FID 变异系数 (CoV) 保持在 1-2% 的带内。(d) 每类无分类器引导调整使分布减半,但重新洗牌了哪些种子效果最好,幸运的训练种子达到相同 FID 所需的计算量比不幸的种子少 2 倍。基于这些发现,我们推荐一种新的 FID 评估协议:在每类最优引导下进行评估,将任何低于经验测量的约 1.3% CoV 的 FID 差距视为不确定,并报告多个训练种子的误差条,而不是单一的 FID 数值。

英文摘要

The Frechet Inception Distance (FID) is the de facto arbiter of image generation, yet most papers report just a single number from a single trained model using a single sampling seed. How reproducible is that number if we retrain the model, or merely resample from it? In this paper, we treat FID as a random variable on a two-axis panel of training and generation seeds, and measure its variance directly on several hundred SiT networks trained on class-conditional ImageNet 256x256. We report surprising findings: (a) Retraining the model using the same recipe with a different seed moves FID 3.2x more (in Inception feature space) than redrawing samples from a fixed network. (b) That gap is driven by three factors: random initialisation, data ordering, and the per-step Gaussian noise of the flow-matching loss. (c) Increasing compute or model size barely tightens the spread, holding the FID coefficient of variation (CoV) inside a 1-2% band. (d) Per-cell classifier-free-guidance tuning halves the spread but reshuffles which seeds work best, and a lucky training seed reaches the same FID with up to 2x less compute than an unlucky one. Based on these findings, we recommend a new FID evaluation protocol: evaluate under per-cell optimal guidance, treat any FID gap below the empirically measured ~1.3% CoV as inconclusive, and report an error bar over several training seeds rather than a single FID number.

2606.20488 2026-06-19 cs.CV 新提交 75%

How Fragile Are Training-Free AI-Generated Image Detectors? A Controlled Audit of Score Direction, Preprocessing, and Compression

无训练AI生成图像检测器有多脆弱?对分数方向、预处理和压缩的受控审计

Jingwen Zhou, Mingzhe Wang

发表机构 * Xidian University(西安电子科技大学)

专题命中 其他图像生成 :检测AI生成图像,评估生成质量

AI总结 本文通过统一协议审计两种无训练检测分数(自编码重建和噪声扰动特征相似性)及kNN基线,发现实现细节、分数方向选择和数据集格式偏差会导致AUROC变化高达0.38,且简单融合无法超越最佳单分数。

详情
AI中文摘要

无训练的AI生成图像检测器承诺无需分类器训练即可实现生成器无关的部署,但其报告的数字很少在单一受控协议下进行比较。我们审计了两种代表性的无训练分数——一种自编码器重建分数(AEROBLADE风格)和一种噪声扰动特征相似性分数(RIGID风格),外加一个朴素的特征kNN控制,在包含七个生成器和JPEG压缩质量70和50的公共1,500图像GenImage衍生基准上进行。审计得出三个警示性发现。(i)实现细节伪装成方法差异:将LPIPS骨干网络(AlexNet -> VGG-16)替换使整体AUROC变化+0.085,在resize-to-512和原始分辨率预处理之间切换使每个生成器的结论翻转高达0.38 AUROC。(ii)分数方向不是方法的属性而是其超参数的属性:RIGID风格分数在噪声水平sigma=0.05时对SD1.5和Wukong反转(AUROC < 0.5),在sigma=0.01时对所有生成器恢复至>0.5,在sigma=0.3时降至0.15。(iii)数据集格式偏差夸大鲁棒性声明:没有统一重新编码时,JPEG-50下的AUROC超过AlexNet骨干重建分数的干净条件;偏差校正后残余异常定位到单个生成器(BigGAN)。审计的分数具有互补的逐生成器失败集,但朴素z-score融合未能击败最佳单分数,表明利用互补性需要方向感知的组合。

英文摘要

Training-free detectors of AI-generated images promise generator-agnostic deployment without classifier training, yet their reported numbers are rarely compared under a single controlled protocol. We audit two representative training-free scores -- an autoencoder-reconstruction score (AEROBLADE-style) and a noise-perturbation feature-similarity score (RIGID-style) -- plus a naive feature-kNN control, on a common 1,500-image GenImage-derived benchmark spanning seven generators and JPEG compression at quality 70 and 50. The audit yields three cautionary findings. (i) Implementation details masquerade as method differences: replacing the LPIPS backbone (AlexNet -> VGG-16) changes overall AUROC by +0.085, and switching between resize-to-512 and native-resolution preprocessing flips per-generator conclusions by up to 0.38 AUROC. (ii) Score direction is not a property of the method but of its hyperparameters: the RIGID-style score is inverted (AUROC < 0.5) on SD1.5 and Wukong at noise level sigma=0.05, recovers to >0.5 for every generator at sigma=0.01, and collapses to 0.15 at sigma=0.3. (iii) Dataset format bias inflates robustness claims: without unified re-encoding, AUROC under JPEG-50 exceeds the clean condition for the AlexNet-backbone reconstruction score; after bias correction the residual anomaly localizes to a single generator (BigGAN). The audited scores have complementary per-generator failure sets, but naive z-score fusion does not beat the best single score, indicating that exploiting complementarity requires direction-aware combination.

2606.20563 2026-06-19 cs.CV 新提交 70%

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

JanusMesh: 通过跨空间去噪实现快速零样本3D视觉错觉生成

Siang-Ling Zhang, Huai-Hsun Cheng, Tsung-Ju Yang, Yu-Lun Liu

发表机构 * National Yang Ming Chiao Tung University(国立阳明交通大学)

专题命中 其他图像生成 :生成双语义3D视觉错觉,属于图像生成

AI总结 提出一种无需训练的快速框架,通过跨空间双分支去噪和视图条件纹理合成,在3-5分钟内生成高真实感双语义3D视觉错觉,优于现有方法。

Comments ECCV 2026. Project page: https://siang1105.github.io/JanusMesh.github.io/

详情
AI中文摘要

创建3D视觉错觉——一个从不同视角揭示完全不同语义的单一3D网格——是一个迷人但艰巨的挑战。现有的基于优化的方法速度慢且可能产生过饱和颜色。相比之下,简单的拼接方法无法生成几何一致的物体,导致可见的不自然接缝和语义泄露。在本文中,我们提出了一个快速且无需训练的框架,用于生成文本驱动的3D视觉错觉。我们的方法将生成过程解耦为两个阶段。首先,我们提出一个跨空间双分支去噪过程。该过程动态地将3D潜在变量解码到体素空间,用于CLIP引导的方向对齐和符号距离场(SDF)混合,确保无缝的几何融合。其次,我们引入一个视图条件纹理合成模块,将特定视图的2D扩散先验投影并聚合到融合的几何上。大量实验表明,我们的方法在仅3-5分钟内生成高度逼真的双语义3D错觉,在几何完整性、语义可识别性和效率上显著优于现有方法。项目页面:此https URL

英文摘要

Creating 3D visual illusions, a single 3D mesh that reveals entirely different semantics from various viewing angles, is a fascinating but tough challenge. Existing optimization-based methods are slow and can produce oversaturated colors. In contrast, naive stitching approaches fail to produce geometrically coherent objects. This results in visible unnatural seams and semantic leaks. In this paper, we present a fast and training-free framework for generating text-driven 3D visual illusions. Our approach decouples the generation into two stages. First, we propose a cross-space dual-branch denoising process. This process dynamically decodes 3D latents into voxel space for CLIP-guided orientation alignment and Signed Distance Field (SDF) blending, which ensures seamless geometric fusion. Second, we introduce a view-conditioned texture synthesis module that projects and aggregates view-specific 2D diffusion priors onto the fused geometry. Extensive experiments demonstrate that our method generates highly realistic, dual-semantic 3D illusions in just 3-5 minutes. It significantly outperforms existing methods in geometric integrity, semantic recognizability, and efficiency. Project page: https://siang1105.github.io/JanusMesh.github.io/

2606.16417 2026-06-19 cs.SD eess.AS 新提交 70%

Joycent: Diffusion-based Accent TTS without Accented Phone Prediction

Joycent: 基于扩散的口音语音合成,无需口音音素预测

Xintong Wang, Ye Wang

发表机构 * University of Science and Technology of China(中国科学技术大学)

专题命中 其他图像生成 :扩散模型用于口音语音合成

AI总结 提出Joycent,一种基于扩散模型的口音TTS方法,直接从标准音素序列和语音参考合成口音语音,无需口音音素预测,通过条件层归一化集成口音和说话人表征,并引入WhisAID口音识别模型,在保持说话人身份的同时提升口音自然度。

详情
AI中文摘要

口音文本到语音(TTS)旨在合成具有目标口音的语音。现有的口音TTS系统通常依赖于两阶段流程,首先将标准音素序列转换为口音音素序列,然后合成口音语音。然而,这种方法存在错误累积问题,并且需要配对的标准-口音音素序列数据,这在实践中往往有限。此外,基于文本的口音音素表示不足以建模韵律和节奏等声学口音特征。在这项工作中,我们提出了Joycent,一种基于扩散的口音TTS模型,它直接从标准音素序列和语音参考合成口音语音,无需口音音素预测。Joycent通过文本编码器中的条件层归一化(CLN)集成口音和说话人表征。我们引入了WhisAID,一种在口音普通话语音上训练的普通话口音识别模型,以提取口音表征。实验结果表明,与基线系统相比,Joycent在保持说话人身份的同时提高了口音自然度。我们在以下网址发布代码和演示:https://github.com/oshindow/Joycent-code。

英文摘要

Accent text-to-speech (TTS) aims to synthesize speech with target accents. Existing accent TTS systems typically rely on a two-stage pipeline that first converts standard phone sequences into accented phone sequences and then synthesizes accented speech. However, such approaches suffer from error accumulation and require paired standard-accented phone sequence data, which is often limited in practice. Moreover, text-based accented phone representations are insufficient to model acoustic accent characteristics such as prosody and rhythm. In this work, we propose Joycent, a diffusion-based accent TTS model that synthesizes accented speech directly from standard phone sequences and speech references without accented phone prediction. Joycent integrates accent and speaker representations through conditional layer normalization (CLN) in the text encoder. We introduce WhisAID, a Mandarin accent identification model trained on accented Mandarin speech to extract accent representations. Experimental results show that Joycent improves accentedness while preserving speaker identity compared with baseline systems. We release our code and demos at: https://github.com/oshindow/Joycent-code.

2606.19957 2026-06-19 cs.CY 新提交 60%

Modest, artistic, and radical solutions to the environmental impact of image-generating machine learning

图像生成机器学习的环境影响:温和、艺术与激进的解决方案

Laura U. Marks, Jess MacCormack, Kehui Li

专题命中 其他图像生成 :讨论图像生成ML的环境影响与解决方案

AI总结 针对图像生成ML的高能耗问题,从计算机工程、媒体研究和艺术角度探索非精确计算、小模型、低精度硬件等解决方案,并提出真实成本核算。

Comments Paper in Proceedings of LIMITS 2026: 12th Workshop on Computing within Limits, 2026-06-23-25, Online

详情
AI中文摘要

机器学习常被宣称能提高信息通信技术的效率,但这种微小收益被数据中心和ML就绪设备的巨大碳、水和土地足迹所淹没。我们调查了ML应用在训练和推理中的电力消耗,重点关注电力密集型的图像生成。我们的团队由一名计算机工程师、一名媒体学者和一名艺术家组成,探索了包括非精确计算、微型语言模型、低精度硬件架构、有限容量硬件以及在设计阶段预测和缓解能源需求等解决方案。我们将概述正在进行的、使用非抓取数据的道德且美学上精致的微型图像生成器的工作。着眼于经济背景,我们将提出机器学习环境影响的真实成本核算,并表明效率标准是由信息通信技术的股东资本主义框架驱动的。

英文摘要

Machine learning is often touted to improve the efficiency of ICT, but that small gain is overwhelmed by the enormous carbon, water, and land footprints of data centers and ML-ready devices. We survey the electricity consumption of ML applications in training and inference, focusing on electricity-intensive image generation. Our team of a computer engineer, a media scholar, and an artist explore solutions including inexact computing; tiny language models; low-precision hardware architectures; hardware with limited capacity; and anticipating and mitigating energy demands at the design phase. We will sketch our work in progress of an ethical and aesthetically sophisticated tiny image generator using non-scraped data. Looking to the economic context, we will propose a true-cost accounting for the environmental impact of machine learning and suggest that the criterion of efficiency is driven by the shareholder-capitalist framing of ICT.

2606.19701 2026-06-19 astro-ph.HE 新提交 55%

On the Contribution of Local Sources to the Galactic Cosmic-Ray Spectrum: An Exact Series Solution for Two-Zone Diffusion

论局部源对银河宇宙射线谱的贡献:两区扩散的精确级数解

Zi-Hang Liu, Yiwei Bao, Ruo-Yu Liu

专题命中 其他图像生成 :局部源对宇宙射线谱贡献的扩散模型

AI总结 本文推导了两区扩散模型的级数格林函数,通过蒙特卡洛模拟发现近源慢扩散使局部源贡献概率从0.4%升至1.7-2.2%,但统计困难仍存,且局部源解释高度依赖模型。

Comments submitted to PRD, The code accompanying this paper will be released soon

详情
AI中文摘要

膝以下宇宙射线质子和氦谱的测量显示出偏离简单幂律的行为,包括多TeV结构。一种可能的解释是,一个或几个附近的源为局部谱贡献了额外的成分。然而,先前的研究表明,在均匀扩散模型下,主导的局部贡献在统计上不太可能。在这项工作中,我们基于银河加速器周围扩展伽马射线发射的观测,研究了如果宇宙射线在其源附近经历低效输运,这一概率如何变化。我们推导了一个级数格林函数,能够快速计算该场景下的粒子分布,使得银河源群的蒙特卡洛计算可行。内部慢扩散区域延迟逃逸并在时间和能量上重新分布到达的粒子。在蒙特卡洛实现中,最强的局部源在$10\,\ m{TeV}$处与背景相当的概率从均匀扩散中的约$0.4\%$增加到两区模型中的$1.7$--$2.2\%$。因此,抑制的近源输运削弱了统计困难,但并未消除。然后,我们检查了编录的附近候选超新星遗迹,并表明只有在额外假设下,特别是更硬的局部注入谱和有利的扩散系数,才能重现$10\,\ m{TeV}$特征。给定源的预测贡献在不同粒子输运模型之间变化很大。因此,局部源解释是合理的但高度依赖模型,并且需要对源注入历史、粒子输运机制和局部星际湍流进行独立约束。

英文摘要

Measurements of cosmic-ray proton and helium spectra below the knee show deviations from simple power laws, including multi-TeV structures. A possible explanation is that one or a few nearby sources contribute an additional component to the local spectrum. However, previous study shows that a dominant local contribution is statistically unlikely under a homogeneous diffusion model. In this work, we investigate how this probability changes if cosmic rays experience inefficient transport near their sources, motivated by observations of extended gamma-ray emission around Galactic accelerators. We derive a series Green's function that enables fast calculation of the particle distribution in this scenario, making Monte Carlo calculations for Galactic source populations feasible. The inner slow-diffusion region delays escape and redistributes the arriving particles in time and energy. In Monte Carlo realizations, the probability that the strongest local source becomes comparable to the background at $10\,\rm{TeV}$ increases from about $0.4\%$ in homogeneous diffusion to $1.7$--$2.2\%$ in the two-zone models. Thus inhibited near-source transport weakens, but does not remove, the statistical difficulty. We then examine cataloged nearby candidate supernova remnants and show that a $10\,\rm{TeV}$ feature can be reproduced only with additional assumptions, especially a harder local injection spectrum and a favorable diffusion coefficient. The predicted contribution of a given source changes strongly among different particle transport model. Therefore, the local source interpretations are plausible but highly model dependent, and require independent constraints on source injection history, particle transport mechanisms, and local interstellar turbulence.