arXivDaily arXiv每日学术速递 周一至周五更新

视觉与机器人

图像生成

图像生成、文生图、图像编辑、扩散模型和可控生成。

今日/当前日期收录 44 信号源:cs.CV, cs.GR, cs.MM

1. 文生图 7 篇

2606.20100 2026-06-19 cs.CV 新提交 95%

WeGenBench: A Multidimensional Diagnostic Benchmark towards Text-to-Image Model Optimization

WeGenBench:面向文本到图像模型优化的多维诊断基准

Qian Liang, Xiaomin Li, Ying Zhang, Jia Xu, Lihao Ni, Hongrui Li, Jingjing Li, Jing Lyu, Chen Li

发表机构 * University of Electronic Science and Technology of China(电子科技大学) Dalian University of Technology(大连理工大学) Weixin, Tencent(腾讯微信)

专题命中 文生图 :文本到图像生成评估基准

AI总结 提出WeGenBench基准,包含4000个中英双语提示,通过场景分类和多维标签实现跨维度评估,并设计基于视觉语言模型的新颖指标,精准定位模型在特定生成类别中的缺陷。

详情
AI中文摘要

最近的文本到图像生成模型在仅从文本输入合成高度逼真的图像方面展现了卓越的能力。尽管现有基准可以在一定程度上评估各种模型的生成能力,但它们难以全面准确地衡量多个维度的性能,往往无法揭示模型在特定类别中的固有缺陷。为了解决这些局限性,我们提出了WeGenBench,一个新颖的基准,旨在对文本到图像生成能力进行全面、多视角的评估。我们的基准总共包含4000个测试提示,涵盖两个主要类别,并在中英文之间精心平衡,以评估双语和跨文化生成能力。除了宏观场景分类外,我们根据每种语言的不同内容和挑战为每个提示标注了多维标签,从而将生成任务细化为更具体的子类别。通过利用场景分类和多维标签的跨维度评估机制,WeGenBench可以精确定位模型在特定生成类别中的不足。此外,为了更准确地衡量生成质量,我们通过整合视觉语言模型(VLM)设计并验证了几种新颖的评估指标,这些指标从三个核心方面评估模型在特定领域任务上的性能。至关重要的是,我们的方法既产生评估结果,也产生详细的推理轨迹,有助于对评估结果的准确性和合理性进行严格验证。最后,我们对当前最先进的方法进行了系统性的基准测试,并深入分析了现有模型中存在的局限性。

英文摘要

Recent text-to-image generation models have demonstrated remarkable capabilities in synthesizing highly realistic images from text inputs alone. Although existing benchmarks can evaluate the generation capabilities of various models to some extent, they struggle to comprehensively and accurately measure performance across multiple dimensions, often failing to reveal the inherent deficiencies of models in specific categories. To address these limitations, we propose WeGenBench, a novel benchmark designed for the comprehensive, multi-perspective evaluation of text-to-image generation capabilities. Our benchmark comprises a total of 4,000 test prompts across two primary categories, meticulously balanced between Chinese and English to evaluate bilingual and cross-cultural generation capabilities. Beyond macroscopic scene classification, we annotate each prompt with multi-dimensional tags tailored to the distinct content and challenges of each language, thereby refining the generation tasks into more specific sub-categories. Through a cross-dimensional evaluation mechanism leveraging both scene classifications and multi-dimensional tags, WeGenBench can precisely pinpoint model shortcomings in specific generation categories. Furthermore, to measure generation quality more accurately, we design and validate several novel evaluation metrics by integrating Vision-Language Models (VLMs), which assess model performance on domain-specific tasks from three core aspects. Crucially, our approach yields both the assessment outcomes and the detailed reasoning trajectories, facilitating a rigorous verification of the accuracy and soundness of the evaluation results. Finally, we conduct systematic benchmarking on current state-of-the-art methods and provide an in-depth analysis of the limitations present in existing models.

2606.20506 2026-06-19 cs.CV cs.AI 新提交 90%

FreeStyle: Free Control of Style-Content Dual-Reference Generation from Community LoRA Mining

FreeStyle: 从社区LoRA挖掘中实现风格-内容双参考生成的自由控制

Jinghong Lan, Wei Cheng, Yunuo Chen, Ziqi Ye, Peng Xing, Yixiao Fang, Rui Wang, Yufeng Yang, Xuanyang Zhang, Xianfang Zeng, Difan Zou, Gang Yu, Chi Zhang

发表机构 * Fudan University(复旦大学) StepFun Westlake University(西湖大学) University of Hong Kong(香港大学)

专题命中 文生图 :提出风格-内容双参考图像生成框架

AI总结 提出FreeStyle框架,利用社区LoRA作为锚点,通过两阶段课程学习(注意力级约束和频率感知RoPE调制)解决双参考生成中的内容泄露问题,并引入新基准和评估指标,实现风格对齐、内容保持与泄露抑制的平衡。

Comments 35 pages, 26figures. Project page: https://github.com/Blue2Giant/FreeStyle

详情
AI中文摘要

风格-内容双参考生成旨在合成一张图像,该图像保留内容参考的结构和语义,同时采用单独风格参考的风格。尽管近期有所进展,但这一设置仍然具有挑战性,因为模型必须平衡内容保真度、风格对齐和指令遵循,同时避免风格参考的语义泄露。一个关键瓶颈是缺乏大规模的三元组数据,这些数据具有清晰的内容-风格分离和广泛的长尾风格。在这项工作中,我们提出了FreeStyle,一个基于社区LoRA的可扩展双参考生成框架。我们将社区LoRA视为风格和内容的组合锚点,并设计了一个严格的生成和过滤流水线,以在多个基础模型上构建大规模的风格参考和内容参考三元组。为了解决内容泄露,我们采用了两阶段课程学习,并设计了特定阶段的解耦机制:在风格迁移阶段,采用注意力级增强约束来抑制风格参考泄露;在更困难的双参考阶段,采用频率感知的RoPE调制策略来针对基于位置对应的泄露。我们还引入了一个基准,涵盖风格参考和双参考生成,并在风格相似性、内容保持、美学质量、指令遵循和泄露拒绝方面进行评估。该基准包含一个风格不变的内容对齐分数(CAS),并引入了一个基于校准的VLM的拒绝分数,用于评估生成可靠性和泄露。大量实验表明,我们的模型在风格对齐、内容保持和泄露抑制之间实现了强平衡。

英文摘要

Style-content dual-reference generation aims to synthesize an image that preserves the structure and semantics of a content reference while adopting the style of a separate style reference.Despite recent progress, this setting remains challenging because models must balance content fidelity, style alignment, and instruction following avoiding semantic leakage from the style reference.A key bottleneck is the lack of large-scale triplet data with clean content-style separation and broad long-tail style coverage.In this work, we propose FreeStyle, a scalable dual-reference generation framework based on community LoRA mining.We treat community LoRAs as compositional anchors for style and content, and design a rigorous generation and filtering pipeline to construct large-scale Style-Reference and Content-Reference triplets across multiple base models.To address content leakage, we adopt a two-stage curriculum with stage-specific disentanglement mechanisms: an attention-level enrichment constraint that suppresses style-reference leakage in the style-transfer stage, and a frequency-aware RoPE modulation strategy that targets positional-correspondence-based leakage in the harder dual-reference stage.We also introduce a benchmark covering both style-reference and dual-reference generation, with evaluations on style similarity, content preservation, aesthetics, instruction following, and leakage rejection. The benchmark incorporates a style-invariant Content Alignment Score (CAS) and introduces a calibrated VLM-based Rejection Score for evaluating generation reliability and leakage suppression.Extensive experiments show that our model achieves a strong balance among style alignment, content preservation, and leakage suppression.

2606.20543 2026-06-19 cs.CV 新提交 85%

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

SSD: 空间推测解码加速自回归图像生成

Shilong Xiang, Zirui Zhang, Lijun Yu, Chengzhi Mao

发表机构 * Rutgers University(罗格斯大学)

专题命中 文生图 :加速自回归图像生成,属于图像生成技术

AI总结 提出空间推测解码(SSD),利用二维空间相关性同时预测相邻水平与下方令牌,突破视觉推理中的内存瓶颈,实现高达13.3倍的自回归图像生成加速。

详情
AI中文摘要

自回归模型通过将图像视为离散令牌的一维序列,在视觉生成中表现出色,类似于语言建模。然而,这种扁平化处理丢弃了视觉信号固有的二维空间局部性,在推理过程中造成严重的计算瓶颈。我们提出空间推测解码(SSD),一种将预测目标与图像自然几何结构对齐的框架。我们的模型不是仅预测一维序列中的下一个令牌,而是同时预测相邻的水平令牌和正下方的令牌。通过利用这种二维空间相关性,空间推测解码克服了视觉推理中的内存墙。我们的方法在DPG-Bench和GenEval上保持高保真度的同时,将自回归图像生成速度提升高达13.3倍。我们的结果表明,尊重视觉的底层几何结构可以释放巨大的计算效率,为实时、高分辨率自回归生成模型铺平道路。

英文摘要

Autoregressive models excel in visual generation by treating images as 1D sequences of discrete tokens, mirroring language modeling. However, this flattening discards the intrinsic 2D spatial locality of visual signals, creating severe computational bottlenecks during inference. We introduce Spatially Speculative Decoding (SSD), a framework that aligns the predictive objective with the natural geometry of images. Rather than predicting only the immediate next token in a 1D sequence, our model simultaneously predicts the adjacent horizontal token and the token directly below it. By capitalizing on this 2D spatial correlation, spatially speculative decoding overcomes the memory wall in visual inference. Our approach accelerates autoregressive image generation by up to 13.3x while maintaining high fidelity on DPG-Bench and GenEval. Our results suggest that respecting the underlying geometry of vision unlocks massive computational efficiencies, paving the way for real-time, high-resolution autoregressive generative models.

2606.20241 2026-06-19 cs.CV 新提交 85%

BAFIS: Dataset + Framework to assess occupational Bias and Human Preference in modern Text-to-image Models

BAFIS:评估现代文本到图像模型中的职业偏见与人类偏好的数据集与框架

Thomas Klassert, Adrian Ulges, Biying Fu

发表机构 * RheinMain University of Applied Sciences(莱茵美因应用科学大学)

专题命中 文生图 :评估文本到图像模型的职业偏见

AI总结 本研究提出BAFIS平台和包含21,140张多语言提示生成图像的数据集,评估五种文本到图像模型在职业生成中的性别和种族偏见,结合人类偏好反馈,发现系统性偏见并强调纳入人类偏好的必要性。

Comments Accepted at the IEEE Winter Conference on Applications of Computer Vision, WACV 2026

详情
AI中文摘要

生成式人工智能有潜力提高生产力并改变创意内容的制作。然而,现有研究表明图像生成模型受到偏见的显著影响。本文研究了文本到图像模型在职业相关图像生成中存在的固有偏见和语言诱导偏见,并通过人类偏好反馈补充了现有指标。我们对五种当前文本到图像模型进行了全面评估:Midjourney v6.1、Stable Diffusion 3 Medium、DALL-E 3、Playground v2.5和FLUX.1-dev,重点关注性别和种族偏见、图像质量以及提示对齐。为促进这一评估,我们开发了“公平图像合成竞技场”(BAFIS),一个旨在收集生成图像中偏见的人类反馈的平台。此外,我们创建了一个包含21,140张使用多语言提示生成的合成图像的数据集,作为我们分析的基础。我们进一步将结果置于更广泛的社会背景中,与德国联邦就业局的官方统计数据进行比较。我们的发现揭示了文本到图像模型中的系统性偏见,且现有评估指标与主观用户评分存在部分相关性。因此,我们的研究强调了纳入人类偏好以开发更公平、更包容的文本到图像模型的必要性。

英文摘要

Generative artificial intelligence has the potential to improve productivity and transform the production of creative content. However, existing research indicates that image generation models are significantly influenced by biases. This work investigates the inherent biases and language-induced biases present in text-to-image models within the context of occupation-related image generation, complementing established metrics with human preference feedback. We present a comprehensive evaluation of five current text-to-image models: Midjourney v6.1, Stable Diffusion 3 Medium, DALL-E 3, Playground v2.5, and FLUX.1-dev , focusing on gender and ethnicity bias, image quality, and prompt alignment. To facilitate this evaluation, we developed the "Battle-Arena for Fair Image Synthesis" (BAFIS), a platform designed to collect human feedback on bias in generated images. Furthermore, we created a dataset comprising 21,140 synthetic images generated using multilingual prompts, which serves as a basis for our analysis. We further place our results within a broader social context by comparing them to official statistics from the German Federal Employment Agency. Our findings reveal systematic biases in text-to-image models, with established evaluation metrics in partial correlation with subjective user ratings. Thus, our research emphasizes the need for including human preferences to develop fairer and more inclusive text-to-image models.

2606.20155 2026-06-19 cs.CV cs.CL 新提交 85%

NAMESAKES: Probing Identity Memorization in Text-to-Image Models

NAMESAKES: 探究文本到图像模型中的身份记忆

Morris Alper, Vasudha Varadarajan, Moran Yanuka, Angelina Wang, Hadar Averbuch-Elor

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Tel Aviv University(特拉维夫大学) Cornell University(康奈尔大学)

专题命中 文生图 :探究文本到图像模型中的身份记忆问题。

AI总结 提出一种黑盒行为探针,无需参考照片或训练数据,即可区分文本到图像模型生成的图像是记忆还是虚构,并在NAMESAKES数据集上验证其有效性。

详情
AI中文摘要

文本到图像(T2I)模型在提示其姓名时,会生成某些个体的逼真肖像,这引发了隐私问题。然而,区分生成的面孔是记忆还是虚构的,目前需要真实照片、训练数据访问权限或模型内部的白盒访问,限制了适用性。我们引入了一种完全黑盒的行为探针,可以在无需参考照片或事先了解训练数据的情况下区分这两种情况。为了基准测试这一任务,我们提出了NAMESAKES数据集,包含一千多个不同知名度水平的公众人物的姓名和面孔,以及经过扰动的、知名度较低的姓名。对最先进的T2I模型的实验表明,我们的探针能够显著预测身份记忆,并将记忆的姓名与未识别的姓名区分开来,并进一步揭示了不同模型系列之间的差异。

英文摘要

Text-to-image (T2I) models generate realistic likenesses of some individuals when prompted with their names, raising privacy concerns. However, distinguishing whether a generated face is memorized or fabricated currently requires ground-truth photos, access to training data, or white-box access to model internals, limiting applicability. We introduce a fully black-box behavioral probe that distinguishes between these regimes while requiring no reference photos or prior knowledge of training data. To benchmark this task, we present the NAMESAKES dataset of over one thousand names and faces of public figures spanning a wide range of fame levels, along with perturbed, less famous names. Experiments on state-of-the-art T2I models show that our probe substantially predicts identity memorization and separates memorized from unrecognized names, with further insights into differences across model families.

2606.17979 2026-06-19 cs.AI 新提交 85%

STAR: SpatioTemporal Adaptive Reward Allocation for Text-to-Image RL Post-Training

STAR: 文本到图像强化学习后训练中的时空自适应奖励分配

Jinjie Shen, Wei Deng, Xian Hu, Daiguo Zhou, Jian Luan

发表机构 * institutetext: STAR: SpatioTemporal Adaptive Reward Allocation for Text-to-Image RL Post-Training(机构文本:STAR:时空自适应奖励分配用于文本到图像强化学习后训练)

专题命中 文生图 :文本到图像生成的后训练奖励分配方法

AI总结 针对文本到图像生成中奖励与生成轨迹粒度不匹配的问题,提出STAR方法,利用文本-图像注意力构建时空自适应分配图,对相关潜在区域施加更强策略更新,提升语义对齐和文本渲染性能。

详情
AI中文摘要

现有的文本到图像生成的强化学习后训练方法通常将最终图像奖励转换为单个标量优势,并以相同强度应用于整个生成轨迹。然而,文本到图像生成自然具有时间和空间结构:不同的去噪步骤负责不同的生成阶段,而真正决定文本对齐的内容通常只出现在图像的一部分。这种粒度不匹配使得策略更新难以聚焦于实际影响奖励的生成组件。为了解决这个问题,我们提出了用于文本到图像扩散和流模型的强化学习后训练的**时空自适应奖励(STAR)分配**。STAR利用生成模型内部的文本-图像注意力,从用户提示中真正关心的核心内容开始,构建在去噪步骤和展开中动态变化的空间分配图,并将相同的组相对优势分配给更相关的潜在区域,几乎没有额外的计算开销。然后,STAR通过空间分辨的策略目标对这些区域应用更强的策略更新。我们使用Stable Diffusion 3.5 Medium作为基础模型,并在三个任务上评估:GenEval、OCR文本渲染和PickScore。实验结果表明,STAR在不改变外部奖励源的情况下,改善了组合语义对齐、文本渲染和偏好优化,在GenEval、OCR和PickScore上分别达到了$\mathbf{0.9759}$、$\mathbf{0.9757}$和$\mathbf{23.60}$。

英文摘要

Existing RL post-training methods for text-to-image generation usually convert the final-image reward into a single scalar advantage and apply it with the same strength to the entire generative trajectory. However, text-to-image generation naturally has temporal and spatial structure: different denoising steps are responsible for different generation stages, and the content that truly determines text alignment often appears only in part of the image. This granularity mismatch makes it difficult for policy updates to focus on the generative components that actually affect the reward. To address this issue, we propose \textbf{SpatioTemporal Adaptive Reward (STAR) Allocation} for RL post-training of text-to-image diffusion and flow models. STAR uses text-image attention inside the generative model and starts from the core content that the user truly cares about in the prompt. It constructs spatial allocation maps that dynamically vary across denoising steps and rollouts, and allocates the same group-relative advantage to more relevant latent regions with almost no additional computational overhead. STAR then applies stronger policy updates to these regions through a spatially resolved policy objective. We use Stable Diffusion 3.5 Medium as the base model and evaluate on three tasks: GenEval, OCR text rendering, and PickScore. Experimental results show that STAR improves compositional semantic alignment, text rendering, and preference optimization without changing the external reward source, achieving $\mathbf{0.9759}$, $\mathbf{0.9757}$, and $\mathbf{23.60}$ on GenEval, OCR, and PickScore, respectively.

2606.19939 2026-06-19 cs.CV 新提交 80%

DiffMath: Symbol- and Graph-Aware Latent Diffusion Transformer for Handwritten Mathematical Expression Generation

DiffMath:面向手写数学表达式生成的符号与图感知潜在扩散Transformer

Wei Pan, Xuhan Zheng, Yilin Shi, Huiguo He, Hiuyi Cheng, Dezhi Peng, Minghui Liao, Lianwen Jin

发表机构 * South China University of Technology(华南理工大学) Huawei Technologies Co., Ltd.(华为技术有限公司)

专题命中 文生图 :提出手写数学表达式生成的扩散框架

AI总结 提出DiffMath框架,利用LaTeX层次结构作为先验,通过关系抽象语法树、结构保持潜在表示和条件去噪,无需位置监督即可生成结构一致的手写数学表达式。

详情
AI中文摘要

手写数学表达式生成(HMEG)由于数学表达式的复杂二维布局和长程结构依赖而具有挑战性。现有方法通常依赖显式空间监督,如符号级边界框,这导致高标注成本并限制可扩展性。在这项工作中,我们提出了DiffMath,一个符号与图感知的潜在扩散框架,利用LaTeX固有的层次结构作为结构先验,消除了位置监督的需求。首先,我们设计了关系抽象语法树(RelAST),一种面向生成的表示,将MathML树蒸馏为紧凑的三元组序列[S, R, D],其中每个标记直接编码符号身份、空间关系或嵌套深度。其次,我们引入了MathVAE,通过符号感知和关系感知的感知正则化学习保持结构的潜在表示,确保潜在空间同时捕获字符语义和空间拓扑。第三,MathDiT在这个结构化潜在空间中进行条件去噪,并通过自适应层归一化(AdaLN)进一步由全局符号计数先验引导,以改善结构一致性。实验表明,DiffMath生成结构一致的手写表达式,在现有方法上实现了优越性能,并通过合成数据增强提高了下游OCR模型的准确性。

英文摘要

Handwritten Mathematical Expression Generation (HMEG) is challenging due to the complex two-dimensional layouts and long-range structural dependencies of mathematical expressions. Existing methods typically rely on explicit spatial supervision, such as symbol-level bounding boxes, which incurs high annotation costs and limits scalability. In this work, we propose DiffMath, a symbol- and graph-aware latent diffusion framework that leverages the hierarchical structure inherent in LaTeX as a structural prior, eliminating the need for positional supervision. First, we design a Relational Abstract Syntax Tree (RelAST), a generation-oriented representation that distills MathML trees into compact triplet sequences [S, R, D], where each token directly encodes a symbol identity, spatial relation, or nesting depth. Second, we introduce MathVAE, which learns structure-preserving latent representations through symbol-aware and relation-aware perceptual regularization, ensuring that the latent space captures both character semantics and spatial topology. Third, MathDiT performs conditional denoising in this structured latent space, further guided by a global symbol-count prior via Adaptive Layer Normalization (AdaLN) to improve structural coherence. Experiments show that DiffMath produces structurally consistent handwritten expressions, achieves superior performance over existing methods, and improves the accuracy of downstream OCR models through synthetic data augmentation.

2. 扩散模型 10 篇

2606.20416 2026-06-19 cs.LG cs.CV 新提交 90%

On the Redundancy of Timestep Embeddings in Diffusion Models

扩散模型中时间步嵌入的冗余性研究

José A. Chávez

发表机构 * Independent Researcher, Lima, Peru(独立研究者,秘鲁利马)

专题命中 扩散模型 :研究扩散模型中时间步嵌入的冗余性,影响图像生成

AI总结 本文通过理论和实验证明,在U-Net和Diffusion Transformer架构中,扩散模型无需显式时间步嵌入也能达到全局最优,甚至在某些指标上超越有条件模型。

Comments 17 pages

详情
AI中文摘要

扩散模型严重依赖显式的时间步嵌入来调节不同噪声尺度下的去噪过程。在这项工作中,我们通过分析时间步嵌入对U-Net和Diffusion Transformer架构的影响,挑战了这些时间信号的必要性。除了经验证据外,我们提供了一个理论框架,证明在某些条件下,无需显式时间步条件即可达到扩散训练目标的全局最小值。我们的发现揭示了当完全移除时间步嵌入时令人惊讶的鲁棒性。在CelebA和CIFAR-10数据集上的大量消融研究表明,这些时间无关模型可以保持高结构保真度,甚至在竞争性指标(包括FID、精确率和召回率)上超越其有条件对应模型。我们的分析表明,这些架构可以在特定假设下从损坏输入中隐式推断噪声尺度,使得显式时间条件变得冗余。这项研究挑战了长期以来的时间条件范式,并为更高效、更注重结构的生成架构铺平了道路。

英文摘要

Diffusion models rely heavily on explicit timestep embeddings to modulate the denoising process across various noise scales. In this work, we challenge the necessity of these temporal signals by analyzing their impact on U-Net and Diffusion Transformer architectures. Beyond empirical evidence, we provide a theoretical framework demonstrating that, under certain conditions, the global minimizer of the diffusion training objective can be achieved without explicit timestep conditioning. Our findings reveal a surprising robustness when timestep embeddings are completely removed. Extensive ablation studies on the CelebA and CIFAR-10 datasets show that these time-agnostic models can maintain high structural fidelity and even surpass their conditioned counterparts in competitive metrics, including FID, precision, and recall. Our analysis suggests these architectures can implicitly infer noise scales from the corrupted input under specific assumptions, rendering explicit temporal conditioning redundant. This study challenges long-standing temporal conditioning paradigms and paves the way for more efficient and structurally focused generative architectures.

2606.19970 2026-06-19 cs.CV 新提交 90%

CrossFlow: One-Step Generation Across Latent and Pixel Spaces

CrossFlow: 跨潜在空间与像素空间的单步生成

Xiyuan Wang, Xiao Zhang, Yang Li, Ruoxi Jiang, Zhao Zhong, Liefeng Bo, Muhan Zhang

发表机构 * Institute for Artificial Intelligence, Peking University(北京大学人工智能研究院) Tencent(腾讯) Fudan University(复旦大学)

专题命中 扩散模型 :提出跨空间流模型实现单步生成

AI总结 提出CrossFlow,一种跨空间流模型,将噪声潜在输入直接映射到像素图像,通过无速度单步目标实现潜在到像素的生成,并替代潜在扩散中的解码器,在ImageNet-1k上达到1.62 FID。

Comments Preprint, Under Review

详情
AI中文摘要

大多数扩散和流匹配生成器在相同的表示空间中定义先验、概率路径和预测目标。潜在扩散通过将该路径移动到自编码器潜在空间来提高效率,但最终样本仍由单独训练的解码器生成。这种分离造成了不匹配:生成器针对潜在空间预测进行优化,而最终质量取决于解码器如何处理可能与干净编码器输出不同的生成潜在变量。我们引入了CrossFlow,一种跨空间流公式,将噪声潜在输入直接映射到像素空间图像。关键技术步骤是一个无速度的单步目标:潜在轨迹定义了训练路径,但监督预测是图像而非潜在位移。这使得一个模型既可以作为单步潜在到像素生成器,也可以作为潜在扩散管道的解码器替代品。在类别条件ImageNet-1k $256\ imes256$上,CrossFlow-XL通过一次函数评估达到了1.62 FID。消融实验表明,潜在编码器以及像素空间感知和对抗损失对保真度很重要。这些结果表明,跨空间流目标可以结合潜在表示的效率与直接像素空间监督,而无需在推理时使用单独的解码器。

英文摘要

Most diffusion and flow-matching generators define the prior, probability path, and prediction target in the same representation space. Latent diffusion improves efficiency by moving this path into an autoencoder latent space, but the final sample is still produced by a separately trained decoder. This separation creates a mismatch: the generator is optimized for latent-space prediction, while final quality depends on how the decoder handles generated latents that may differ from clean encoder outputs. We introduce CrossFlow, a cross-space flow formulation that maps noisy latent inputs directly to pixel-space images. The key technical step is a velocity-free one-step objective: the latent trajectory defines the training path, but the supervised prediction is an image rather than a latent displacement. This lets one model act both as a one-step latent-to-pixel generator and as a decoder replacement for latent diffusion pipelines. On class-conditional ImageNet-1k at $256\times256$, CrossFlow-XL achieves 1.62 FID with one function evaluation. Ablations show that the latent encoder and pixel-space perceptual and adversarial losses are important for fidelity. These results indicate that cross-space flow objectives can combine the efficiency of latent representations with direct pixel-space supervision, without requiring a separate decoder at inference.

2606.19662 2026-06-19 cs.CV 新提交 90%

Learning When to Denoise: Optimizing Asynchronous Schedules for Latent Diffusion

学习何时去噪:优化潜在扩散的异步调度

Bingshuo Qian, Xiang Cheng

发表机构 * Department of Electrical and Computer Engineering(电气与计算机工程系)

专题命中 扩散模型 :学习异步调度优化多表示扩散模型的去噪顺序

AI总结 提出学习异步调度策略,通过调度校正目标优化多表示扩散模型的去噪顺序,在ImageNet 256x256上以不到1%额外训练计算实现4倍加速,FID达1.02。

Comments 25 pages, 9 figures, 4 tables

详情
AI中文摘要

多表示扩散模型可以通过对图像的互补视图进行去噪来改善视觉合成,但其性能关键取决于决定每个表示何时去噪的异步调度。我们提出学习这种调度。我们的方法在多个表示空间上制定异步流匹配,并使用调度校正目标,该目标在调度变化时保持每个表示的局部噪声时间权重固定。我们用一个灵活的参数类实例化调度,该类通过构造是凸且单调的,并使用快速联合探针进行学习,额外训练计算少于1%。在ImageNet 256x256上,学习的调度在匹配的675M参数XL骨干下显著提高了收敛速度和最终质量。使用AutoGuidance,我们的200 epoch模型达到FID 1.05,与800 epoch的SFD-XL基线相当,训练量减少4倍。训练到600 epoch进一步改善到FID 1.02,优于1B参数的SFD-XXL结果(FID 1.04),同时使用更小的模型。在无引导设置中,我们的200 epoch模型达到FID 2.37,已经低于最佳800 epoch SFD-XL结果(2.54),训练量减少4倍,并在600 epoch时改善到FID 2.14。代码可在https://this URL获取。

英文摘要

Multi-representation diffusion models can improve visual synthesis by denoising complementary views of an image, but their performance depends critically on the asynchronous schedule that determines when each representation is denoised. We propose to learn this schedule. Our method formulates asynchronous flow matching over multiple representation spaces and uses a schedule-corrected objective that keeps each representation's local noising-time weights fixed as the schedule changes. We instantiate the schedule with a flexible parametric class that is convex and monotone by construction, and learn it using a fast joint probe with less than 1% additional training compute. On ImageNet 256x256, the learned schedule substantially improves both convergence speed and final quality under a matched 675M-parameter XL backbone. With AutoGuidance, our 200-epoch model reaches FID 1.05, matching the 800-epoch SFD-XL baseline with 4x less training. Training to 600 epochs further improves to FID 1.02, outperforming the 1B-parameter SFD-XXL result of FID 1.04 while using a smaller model. In the unguided setting, our 200-epoch model reaches FID 2.37, already below the best 800-epoch SFD-XL result (2.54) at 4x less training, and improves to FID 2.14 at 600 epochs. Code is available at https://github.com/bsq532087/LWD

2606.20112 2026-06-19 cs.CV eess.IV 新提交 85%

Pixel-Level Residual Diffusion Transformer: Scalable 3D CT Volume Generation

像素级残差扩散Transformer:可扩展的3D CT体生成

Zhenkai Zhang, Markus Hiller, Krista A. Ehinger, Tom Drummond

发表机构 * School of Computing and Information Systems, The University of Melbourne(墨尔本大学计算与信息系统学院)

专题命中 扩散模型 :基于扩散Transformer的3D图像生成

AI总结 提出像素级残差扩散Transformer(PRDiT),通过两阶段训练(局部MLP盲估计器分离低频结构+全局残差扩散Transformer建模高频残差)实现高保真3D CT体生成,在LIDC-IDRI和RAD-ChestCT数据集上优于现有方法。

Comments Accepted at ICLR 2026. Code available at https://github.com/Fredy-Zhang/PRDiT

详情
AI中文摘要

由于现有生成模型固有的巨大计算需求和优化困难,生成具有精细细节的高分辨率3D CT体仍然具有挑战性。在本文中,我们提出了像素级残差扩散Transformer(PRDiT),这是一种可扩展的生成框架,可直接在体素级别合成高质量的3D医学体。PRDiT引入了一个两阶段训练架构,包括:1)一个局部去噪器,形式为基于MLP的盲估计器,作用于重叠的3D块,以有效分离低频结构;2)一个全局残差扩散Transformer,采用内存高效注意力来建模和细化整个体上的高频残差。这种从粗到细的建模策略简化了优化,增强了训练稳定性,并有效保留了细微结构,而无需自编码器瓶颈。在LIDC-IDRI和RAD-ChestCT数据集上进行的大量实验表明,PRDiT始终优于最先进的模型,如HA-GAN、3D LDM和WDM-3D,在3D FID、MMD和Wasserstein距离指标上显著降低。

英文摘要

Generating high-resolution 3D CT volumes with fine details remains challenging due to substantial computational demands and optimization difficulties inherent to existing generative models. In this paper, we propose the Pixel-Level Residual Diffusion Transformer (PRDiT), a scalable generative framework that synthesizes high-quality 3D medical volumes directly at voxel-level. PRDiT introduces a two-stage training architecture comprising 1) a local denoiser in the form of an MLP-based blind estimator operating on overlapping 3D patches to separate low-frequency structures efficiently, and 2) a global residual diffusion transformer employing memory-efficient attention to model and refine high-frequency residuals across entire volumes. This coarse-to-fine modeling strategy simplifies optimization, enhances training stability, and effectively preserves subtle structures without the limitations of an autoencoder bottleneck. Extensive experiments conducted on the LIDC-IDRI and RAD-ChestCT datasets demonstrate that PRDiT consistently outperforms state-of-the-art models, such as HA-GAN, 3D LDM and WDM-3D, achieving significantly lower 3D FID, MMD and Wasserstein distance scores.

2606.20076 2026-06-19 cs.CV cs.AI 新提交 85%

Variable-Length Tokenization via Learnable Global Merging for Diffusion Transformers

基于可学习全局合并的可变长度分词用于扩散变换器

Dong Hoon Lee, Seunghoon Hong

发表机构 * Kim Jaechul Graduate School of AI, KAIST, Daejeon, South Korea(韩国科学技术院金载哲人工智能研究生院,大田,韩国) School of Computing, KAIST, Daejeon, South Korea(韩国科学技术院计算学院,大田,韩国)

专题命中 扩散模型 :扩散Transformer可变长度分词

AI总结 针对固定压缩比限制扩散模型质量-计算权衡的问题,提出基于可学习全局合并的可变长度分词器,通过合并令牌实现跨长度表示对齐,在ImageNet 256×256生成中实现更优的gFID-计算权衡。

详情
AI中文摘要

潜在扩散模型(LDM)在视觉合成中占据主导地位,但其质量-计算权衡很大程度上受限于分词器的固定压缩比。可变长度分词器(VLT)通过改变令牌数量实现自适应压缩,使扩散模型能够灵活平衡质量和计算。然而,传统的VLT通过截断有序令牌序列来调节长度,这使得令牌语义依赖于令牌位置,并破坏了跨长度的表示对齐。这导致潜在分布出现跨长度偏移,阻碍单个可变长度扩散模型有效运行。为了解决这个问题,我们提出了一种新颖的可变长度分词器,通过合并令牌来调节长度。我们表明,当扩散变换器根据合并模式运行时,鼓励相似令牌合并可以实现直接的跨长度表示对齐。由于传统的合并方法是数据依赖的,使得生成过程中无法访问合并模式,我们引入了可学习的全局合并,它是数据独立的,以确保与扩散变换器的兼容性。在ImageNet 256×256生成中,我们的基于合并的可变长度分词器与扩散变换器集成,相比之前的VLT方法实现了更优的gFID-计算权衡。代码可在[此https URL](此https URL)获取。

英文摘要

Latent Diffusion Models (LDMs) have become dominant in visual synthesis, but their quality-compute trade-off is largely constrained by the tokenizer's fixed compression ratio. Variable-length tokenizers (VLTs) promise adaptive compression by varying token counts, allowing diffusion models to flexibly balance quality and compute. However, conventional VLTs modulate length by truncating ordered token sequences, which makes token semantics depend on token position and breaks representational alignment across lengths. This leads to a cross-length shift in the latent distribution that hinders a single variable-length diffusion model from operating effectively. To address this, we propose a novel variable-length tokenizer that modulates length by merging tokens. We show that encouraging similar tokens to merge enables direct cross-length representation alignment when the diffusion transformer operates according to the merging pattern. Since conventional merging methods are data-dependent, making the merging pattern inaccessible during generation, we introduce learnable global merging, which is data-independent, to ensure compatibility with diffusion transformers. On ImageNet 256$\times$256 generation, our merging-based variable-length tokenizer integrated with a diffusion transformer achieves a superior gFID-compute trade-off compared to prior VLT methods. Code is available at [this https URL](https://github.com/movinghoon/lgm)

2601.21542 2026-06-19 cs.CV cs.AI 版本更新 85%

Bi-Anchor Interpolation Solver for Accelerating Generative Modeling

双锚点插值求解器加速生成建模

Hongxu Chen, Hongxiang Li, Zhen Wang, Long Chen

发表机构 * The Hong Kong University of Science(香港科学与技术大学)

专题命中 扩散模型 :加速生成建模,双锚点插值求解器

AI总结 提出BA-solver,通过轻量SideNet(1-2%主干大小)学习双向时间感知和双锚点速度积分,在不重新训练主干的情况下,以极低训练成本实现10步内达到100+步Euler求解器质量,支持即插即用。

详情
AI中文摘要

流匹配(FM)模型已成为高保真合成的前沿范式。然而,它们对迭代常微分方程(ODE)求解的依赖造成了显著的延迟瓶颈。现有解决方案面临两难:无训练求解器在低神经函数评估(NFE)下性能严重下降,而基于训练的一步或几步生成方法则面临高昂的训练成本且缺乏即插即用的通用性。为弥合这一差距,我们提出了双锚点插值求解器(BA-solver)。BA-solver保留了标准无训练求解器的通用性,同时通过引入轻量级SideNet(主干大小的1-2%)与冻结主干并行,实现了显著加速。具体而言,我们的方法基于两个协同组件:1)双向时间感知,其中SideNet学习近似未来和过去的速度,无需重新训练重型主干;2)双锚点速度积分,利用带有两个锚点速度的SideNet高效近似中间速度,用于批量高阶积分。通过利用主干建立高精度“锚点”并利用SideNet加密轨迹,BA-solver能够以最小误差实现大步长。在ImageNet-256^2上的实验结果表明,BA-solver仅需10次NFE即可达到与100+次NFE的Euler求解器相当的生成质量,并在仅5次NFE时保持高保真度,且训练成本可忽略不计。此外,BA-solver确保与现有生成流水线的无缝集成,便于图像编辑等下游任务。

英文摘要

Flow Matching (FM) models have emerged as a leading paradigm for high-fidelity synthesis. However, their reliance on iterative Ordinary Differential Equation (ODE) solving creates a significant latency bottleneck. Existing solutions face a dichotomy: training-free solvers suffer from significant performance degradation at low Neural Function Evaluations (NFEs), while training-based one- or few-steps generation methods incur prohibitive training costs and lack plug-and-play versatility. To bridge this gap, we propose the Bi-Anchor Interpolation Solver (BA-solver). BA-solver retains the versatility of standard training-free solvers while achieving significant acceleration by introducing a lightweight SideNet (1-2% backbone size) alongside the frozen backbone. Specifically, our method is founded on two synergistic components: \textbf{1) Bidirectional Temporal Perception}, where the SideNet learns to approximate both future and historical velocities without retraining the heavy backbone; and 2) Bi-Anchor Velocity Integration, which utilizes the SideNet with two anchor velocities to efficiently approximate intermediate velocities for batched high-order integration. By utilizing the backbone to establish high-precision ``anchors'' and the SideNet to densify the trajectory, BA-solver enables large interval sizes with minimized error. Empirical results on ImageNet-256^2 demonstrate that BA-solver achieves generation quality comparable to 100+ NFEs Euler solver in just 10 NFEs and maintains high fidelity in as few as 5 NFEs, incurring negligible training costs. Furthermore, BA-solver ensures seamless integration with existing generative pipelines, facilitating downstream tasks such as image editing.

2606.19894 2026-06-19 cs.LG 新提交 80%

Score Approximation for Diffusion Models on Arbitrary Low-Dimensional Structures

任意低维结构上扩散模型的分数近似

Xinhe Mu, Zaijiu Shang, Zhaoqi Zhou, Chuan Zhou, Qi Meng, Guiying Yan, Zhiming Ma

发表机构 * Shanghai Institute for Mathematics and Interdisciplinary Sciences(上海数学与交叉科学研究院) Huawei Technologies Co., Ltd.(华为技术有限公司)

专题命中 扩散模型 :扩散模型分数近似理论,支持非光滑数据。

AI总结 针对任意紧支撑分布,提出一种基于离散混合的分数近似方法,证明ReLU网络复杂度仅随上Minkowski维数d指数增长,打破环境维数诅咒,解释扩散模型在非光滑数据上的有效性。

详情
AI中文摘要

基于分数的扩散模型的显著成功激发了大量建立其理论基础的努力。然而,现有的分数近似复杂度界限严重依赖于限制性假设,如Lipschitz连续密度或光滑流形支撑,而这些假设通常被真实感知数据固有的奇异性、尖锐边界和不连续簇所违反。本文建立了一个通用的分数近似定理,适用于任何支撑在任意上Minkowski维数为$d$的紧集上的分布。通过一种新颖的离散混合公式,我们证明了分数函数可以用ReLU网络近似,其复杂度仅随$d$指数增长,从而打破了环境维数的指数诅咒。结合现有关于精确求解任意紧分布的反向扩散SDE的理论,我们的工作表明扩散模型能够自适应地处理不规则、非光滑的数据结构,解释了它们在真实生成任务中的能力。

英文摘要

The remarkable success of score-based diffusion models has spurred significant efforts to establish their theoretical foundations. However, existing complexity bounds for score approximation rely heavily on restrictive assumptions like Lipschitz continuous densities or smooth manifold supports, which are routinely violated by the singularities, sharp boundaries, and disjoint clusters inherent to real-world perceptual data. This work establishes a universal score approximation theorem that works for any distribution supported on any compact set of upper Minkowski dimension $d$. Using a novel discrete-mixture formulation, we prove that the score function can be approximated with a ReLU network whose complexity grows exponentially only with $d$, thus breaking the exponential curse of ambient dimensionality. Combined with existing theories on accurately solving the backward diffusion SDE for arbitrary compact distributions, our work shows that diffusion models readily adapt to irregular, non-smooth data structures, explaining their competence in real-world generative tasks.

2606.19397 2026-06-19 cs.RO 新提交 80%

DiffusionVS: A Generative Framework for Robust Visual Servoing Based on Diffusion Policy

DiffusionVS:基于扩散策略的鲁棒视觉伺服生成框架

Hongkang Cui, Rui He, Haoyao Chen

专题命中 扩散模型 :基于扩散策略生成相机速度,利用条件去噪。

AI总结 提出基于扩散策略的视觉伺服方法,通过条件去噪生成相机速度,并采用在线训练增强泛化能力,仿真成功率近100%,物理实验93%。

Comments 8 pages, 4 figures, 7 tables

详情
AI中文摘要

视觉伺服是机器人操作和导航中的基础技术。基于回归的视觉伺服常因噪声敏感的单步映射和分布偏移时的误差累积而出现轨迹抖动。相比之下,扩散策略通过预测动作序列保持时间一致性,并通过隐式数据增强提高鲁棒性。本文提出一种新颖的基于扩散的伺服方法。基于扩散策略,该方法使用观测标签角点的归一化图像坐标作为输入,通过条件去噪生成相机速度。为了克服在静态数据集上训练的模型的泛化限制,采用了在线训练范式,通过交互经验收集持续扩展训练数据的多样性。该策略显著提升了模型的性能和泛化能力。全面的仿真和实际实验证明了该方法的有效性,在仿真中实现了近100%的成功率,在物理实验中达到93%。除了具体的流程,我们进一步验证了扩散机制的通用性。实验表明,现有的视觉伺服网络在与我们的扩散模块集成时,性能持续提升。这些结果表明,所提出的策略具有广泛的适用性,能够增强除本文具体架构之外的各种视觉伺服系统。

英文摘要

Visual servoing is a fundamental technique in robotic manipulation and navigation. Regression-based visual servoing frequently experiences trajectory jitter as a result of noise-sensitive single-step mappings and the accumulation of errors during distribution shifts. In contrast, Diffusion Policy maintains temporal consistency by predicting action sequences and improves robustness through implicit data augmentation. This paper presents a novel diffusion-based servoing method. Based on Diffusion Policy, the proposed approach uses normalized image coordinates of observed tag corners as input and generates camera velocity through conditional denoising. To overcome the generalization limitations of models trained on static datasets, an online training paradigm is adopted, continuously expanding the diversity of training data through interactive experience collection. This strategy substantially enhances both the performance and generalization capability of the model. Comprehensive simulations and real-world experiments demonstrate the effectiveness of the proposed method, achieving success rates of nearly 100\% in simulation and 93\% in physical experiments. Beyond the specific pipeline, we further validate the generality of the diffusion mechanism. Experiments show that existing visual servoing networks consistently achieve improved performance when integrated with our diffusion-based module. These results indicate that the proposed strategy possesses broad applicability and can enhance various visual servoing systems beyond the specific architecture presented here.

2603.20455 2026-06-19 math.OC 版本更新 80%

Time-Reversed BSDEs for Accurate Gradient Estimation in Diffusion Models

时间反向BSDE用于扩散模型中的精确梯度估计

Yuhang Mei, Amirhossein Taghvaei

专题命中 扩散模型 :扩散模型梯度估计的BSDE方法

AI总结 针对扩散模型微调中梯度估计不稳定问题,提出基于时间反向BSDE的自适应伴随过程,降低方差并提高稳定性。

Comments 10 pages, 3 figures

详情
AI中文摘要

越来越多的文献采用随机最优控制(SOC)视角来微调扩散模型及相关生成策略。一类称为迭代扩散优化的著名方法通过模拟扩散过程、评估损失函数并应用随机优化算法来解决SOC问题,其中伴随匹配已成为最先进的方法。然而,这些方法中使用的伴随过程不适应前向扩散滤波,可能导致不稳定或高方差的梯度估计。在本文中,我们通过后向随机微分方程(BSDE)的视角重新审视扩散模型中的梯度估计。我们提出了一种基于我们先前工作中引入的时间反向BSDE公式的替代估计器,该估计器产生适应于底层滤波的伴随过程。这种自适应结构导致更稳定的梯度估计,且可能具有更低的方差。我们分析了所提估计器的准确性,并将其与伴随匹配进行了比较。在微调玩具扩散模型上的数值实验证明了改进的梯度稳定性和有竞争力的性能。

英文摘要

There is a growing literature adopting a stochastic optimal control (SOC) perspective to fine-tune diffusion models and related generative policies. A prominent class of methods, known as iterative diffusion optimization, solves the SOC problem by simulating the diffusion process, evaluating a loss function, and applying stochastic optimization algorithms, with adjoint matching emerging as a state-of-the-art approach. However, the adjoint process used in these methods is not adapted to the forward diffusion filtration, which can lead to unstable or high-variance gradient estimates. In this paper, we revisit gradient estimation in diffusion models through the lens of backward stochastic differential equations (BSDEs). We propose an alternative estimator based on a time-reversed BSDE formulation introduced in our prior work, which produces an adjoint process adapted to the underlying filtration. This adapted structure leads to more stable gradient estimates with potentially lower variance. We analyze the accuracy of the proposed estimator and compare it with adjoint matching. Numerical experiments on fine-tuning toy diffusion models demonstrate improved gradient stability and competitive performance.

2601.03112 2026-06-19 eess.IV cs.CV 版本更新 80%

DiT-JSCC: Rethinking Deep JSCC with Diffusion Transformers and Semantic Representations

DiT-JSCC:基于扩散变换器与语义表示的深度JSCC再思考

Kailin Tan, Jincheng Dai, Sixian Wang, Guo Lu, Shuo Shao, Kai Niu, Wenjun Zhang, Ping Zhang

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Shanghai Jiao Tong University(上海交通大学) University of Shanghai for Science and Technology(上海科技大学)

专题命中 扩散模型 :利用扩散变换器作为生成解码器

AI总结 提出DiT-JSCC框架,联合学习语义优先表示编码器和扩散变换器生成解码器,通过粗细粒度条件解码和基于Kolmogorov复杂度的自适应带宽分配,在极端信道条件下提升语义一致性与传输效率。

Comments 14pages, 14figures, 2tables

详情
AI中文摘要

生成式联合源信道编码(GJSCC)已成为一种新的深度JSCC范式,用于在极端无线信道条件(如超低带宽和低信噪比)下实现高保真和鲁棒的图像传输。近期研究通常采用扩散模型作为生成解码器,但经常产生视觉上逼真但语义一致性有限的结果。这种局限性源于面向重建的JSCC编码器与生成解码器之间的根本性不匹配,因为前者缺乏显式的语义判别能力,无法提供可靠的条件线索。在本文中,我们提出DiT-JSCC,一种新颖的GJSCC骨干网络,能够联合学习语义优先的表示编码器和基于扩散变换器(DiT)的生成解码器,我们的开源项目旨在促进GJSCC的未来研究。具体来说,我们设计了一个语义-细节双分支编码器,与从粗到细的条件DiT解码器自然对齐,在极端信道条件下优先考虑语义一致性。此外,受Kolmogorov复杂度启发,引入了一种无需训练的自适应带宽分配策略,以进一步提高传输效率,从而真正重新定义生成解码时代的信息价值概念。大量实验表明,DiT-JSCC在语义一致性和视觉质量上始终优于现有JSCC方法,尤其是在极端条件下。

英文摘要

Generative joint source-channel coding (GJSCC) has emerged as a new Deep JSCC paradigm for achieving high-fidelity and robust image transmission under extreme wireless channel conditions, such as ultra-low bandwidth and low signal-to-noise ratio. Recent studies commonly adopt diffusion models as generative decoders, but they frequently produce visually realistic results with limited semantic consistency. This limitation stems from a fundamental mismatch between reconstruction-oriented JSCC encoders and generative decoders, as the former lack explicit semantic discriminability and fail to provide reliable conditional cues. In this paper, we propose DiT-JSCC, a novel GJSCC backbone that can jointly learn a semantics-prioritized representation encoder and a diffusion transformer (DiT) based generative decoder, our open-source project aims to promote the future research in GJSCC. Specifically, we design a semantics-detail dual-branch encoder that aligns naturally with a coarse-to-fine conditional DiT decoder, prioritizing semantic consistency under extreme channel conditions. Moreover, a training-free adaptive bandwidth allocation strategy inspired by Kolmogorov complexity is introduced to further improve the transmission efficiency, thereby indeed redefining the notion of information value in the era of generative decoding. Extensive experiments demonstrate that DiT-JSCC consistently outperforms existing JSCC methods in both semantic consistency and visual quality, particularly in extreme regimes.

3. 图像编辑 5 篇

2606.20094 2026-06-19 cs.CV cs.AI cs.GR cs.LG cs.MM 新提交 90%

MakeupMirror: Improving Facial Attribute Preservation in Diffusion Models for Makeup Transfer

MakeupMirror:在用于化妆迁移的扩散模型中改进面部属性保持

Nefeli Andreou, Angel Martínez-González, Sabine Sternig, Matthieu Guillaumin, Epameinondas Antonakos, Michael Opitz

发表机构 * Amazon(亚马逊)

专题命中 图像编辑 :扩散模型用于化妆迁移

AI总结 提出MakeupMirror扩散模型,通过ControlNet几何条件、区域特定迁移控制、肤色调制和Langevin采样器,在保持面部特征和肤色的同时实现高质量化妆迁移,相比Stable-Makeup提升面部识别相似度60%、降低肤色差异50%。

详情
AI中文摘要

化妆迁移模型能够实现有趣的增强现实(AR)体验以及在线化妆购物的虚拟试妆(VTO)。尽管最近最先进的基于扩散的解决方案(如Stable-Makeup)显著提高了化妆迁移的准确性和逼真度,但在身份和肤色保持方面仍存在局限性,使得用于化妆购物的生产级VTO不切实际。在这项工作中,我们提出了MakeupMirror,一种基于扩散的化妆迁移方法,在保持面部特征和肤色方面取得了显著进展。我们在Stable-Makeup的基础上引入了多项技术创新:(1)将面部几何条件与ControlNets集成以保持面部保真度;(2)区域特定的化妆迁移控制,以便在面部区域(如皮肤、眼睛和嘴唇)实现精确的化妆应用;(3)基于肤色的化妆迁移调制,防止跨主体迁移场景中的肤色改变;(4)集成Levenberg-Marquardt Langevin采样器以加速推理同时保持生成质量。我们在CPM-Real、Makeup Wild以及(本文新收集的、更多样化的)MakeupSelfies数据集上的实验表明,与Stable-Makeup相比,MakeupMirror将相对面部识别相似度提高了+60%,将相对肤色差异降低了-50%,延迟为0.7秒,同时在核心面部身份保持标准上达到了94%的专家接受率。

英文摘要

Makeup transfer models enable fun augmented reality (AR) experiences as well as virtual try-on (VTO) for online makeup shopping. While recent state-of-the-art diffusion based solutions such as Stable-Makeup dramatically improve the accuracy and realism of makeup transfer, they still face limitations in identity and skin color preservation, making production-level VTO for makeup shopping unrealistic. In this work, we propose MakeupMirror, a diffusion-based approach to makeup transfer that makes significant progress towards preserving facial features and skin tone. We introduce several technical innovations over Stable-Makeup: (1) integration of facial geometry conditioning with ControlNets to maintain facial fidelity; (2) region-specific makeup transfer control to enable precise makeup application across facial regions such as skin, eyes and lips; (3) skin tone-based makeup transfer modulation that prevent skin tone alteration in cross-subject transfer scenarios; and (4) integration of a Levenberg-Marquardt Langevin sampler to speed up inference while maintaining generation quality. Our experiments on CPM-Real, Makeup Wild, and (herein newly collected, more diverse) MakeupSelfies datasets show that MakeupMirror improves relative facial recognition similarity by +60%, reduces relative skin tone difference by -50% over Stable-Makeup, with a latency of 0.7s, while achieving expert acceptance rate of 94% across core facial identity preservation criteria.

2606.19961 2026-06-19 cs.CV 新提交 85%

Addressing Detail Bottlenecks in Latent Diffusion for RGB-to-SWIR Image Translation

解决潜在扩散模型中RGB到SWIR图像翻译的细节瓶颈

Kaili Wang, Martin Dimitrievski, Jose Maria Salvador, Ben Stoffelen, David Van Hamme, Lore Goetschalckx

发表机构 * imec imec-IPI-Ghent University(imec-IPI-根特大学) Yale University(耶鲁大学)

专题命中 图像编辑 :改进潜在扩散模型用于RGB到SWIR翻译

AI总结 针对潜在扩散模型在RGB到SWIR图像翻译中丢失空间细节的问题,提出源条件自编码器和可学习引导编码器两种轻量级改进,在驾驶场景下将检测mAP提升至2倍,小目标提升3.4倍,并达到最优FID。

详情
AI中文摘要

潜在扩散模型(LDM)能够高效地进行图像到图像的翻译,但在压缩过程中丢弃了精细的空间细节,从而降低了下游感知任务的性能。我们识别出两个瓶颈:自编码器(丢失空间信息)和条件路径(通过朴素下采样进一步退化源信号)。我们提出了两种轻量级、与骨干网络无关的修复方法:源条件自编码器(SCAE),通过跳跃连接将高分辨率源特征注入解码器;以及可学习引导编码器(LGE),用学习到的条件信号替代朴素下采样。在驾驶场景的RGB到SWIR翻译任务上,使用两种去噪骨干网络(U-Net和DiT)进行评估,我们的方法在潜在扩散基线基础上将检测mAP提升了高达2倍,小目标(COCO-small,<32^2像素^2)上提升高达3.4倍,同时达到了最先进的FID。我们进一步表明FID与检测性能相关性较差,从而激励多轴评估。结果零样本泛化到公开的RASMD基准。我们将公开发布带有标注的测试数据、所有检查点和训练代码。

英文摘要

Latent diffusion models (LDMs) enable efficient image-to-image translation but discard fine spatial details during compression, degrading downstream perception tasks. We identify two bottlenecks: the autoencoder, which loses spatial information, and the conditioning pathway, which further degrades the source signal through naive downsampling. We propose two lightweight, backbone-agnostic fixes: a Source-Conditioned Autoencoder (SCAE) that injects high-resolution source features into the decoder via skip connections, and a Learnable Guidance Encoder (LGE) that replaces naive downsampling with a learned conditioning signal. Evaluated on RGB-to-SWIR translation for driving scenes with two denoiser backbones (U-Net and DiT), our approach improves detection mAP by up to 2x over the latent diffusion baseline, with up to 3.4x gains on small objects (COCO-small, <32^2 px^2), while achieving state-of-the-art FID. We further show that FID and detection performance are poorly correlated, motivating multi-axis evaluation. Results generalise zero-shot to the public RASMD benchmark. We will publicly release test data with annotations, all checkpoints, and training code.

2603.07236 2026-06-19 cs.CV 版本更新 85%

HY-WU (Part I): An Extensible Functional Neural Memory Framework and An Instantiation in Text-Guided Image Editing

HY-WU (第一部分): 一种可扩展的功能性神经记忆框架及其在文本引导图像编辑中的应用

Mengxuan Wu, Xuanlei Zhao, Ziqiao Wang, Ruicheng Feng, Zhangyang Wang, Kai Wang

发表机构 * Tencent HY Team(腾讯 HY 团队)

专题命中 图像编辑 :提出HY-WU框架用于文本引导图像编辑。

AI总结 提出HY-WU框架,通过功能性神经记忆模块即时生成实例特定权重更新,避免共享权重覆盖导致的干扰,解决持续学习与个性化中的灾难性遗忘问题。

详情
AI中文摘要

基础模型正从离线预测器过渡到期望长时间运行的部署系统。在实际部署中,目标并非固定:领域漂移、用户偏好演变,以及模型发布后出现新任务。这将持续学习和即时个性化从可选功能提升为核心架构要求。然而,大多数适应流程仍遵循静态权重范式:训练后(或任何适应步骤后),推理执行单一参数向量,而不考虑用户意图、领域或实例特定约束。这将训练或适应后的模型视为参数空间中的单个点。在异构且持续演变的机制中,不同目标可能在参数上诱导分离的可行区域,迫使任何单一共享更新陷入妥协、干扰或过度专业化。结果,持续学习和个性化通常实现为对共享权重的重复覆盖,冒着先前学习行为退化的风险。我们提出HY-WU(权重释放),一种记忆优先的适应框架,将适应压力从覆盖单一共享参数点转移。HY-WU将功能性(算子级)记忆实现为神经模块:一个根据实例条件即时合成权重更新的生成器,产生实例特定算子而无需测试时优化。

英文摘要

Foundation models are transitioning from offline predictors to deployed systems expected to operate over long time horizons. In real deployments, objectives are not fixed: domains drift, user preferences evolve, and new tasks appear after the model has shipped. This elevates continual learning and instant personalization from optional features to core architectural requirements. Yet most adaptation pipelines still follow a static weight paradigm: after training (or after any adaptation step), inference executes a single parameter vector regardless of user intent, domain, or instance-specific constraints. This treats the trained or adapted model as a single point in parameter space. In heterogeneous and continually evolving regimes, distinct objectives can induce separated feasible regions over parameters, forcing any single shared update into compromise, interference, or overspecialization. As a result, continual learning and personalization are often implemented as repeated overwriting of shared weights, risking degradation of previously learned behaviors. We propose HY-WU (Weight Unleashing), a memory-first adaptation framework that shifts adaptation pressure away from overwriting a single shared parameter point. HY-WU implements functional (operator-level) memory as a neural module: a generator that synthesizes weight updates on-the-fly from the instance condition, yielding instance-specific operators without test-time optimization.

2606.20404 2026-06-19 cs.CV 新提交 80%

FlowBender: Feedback-Aware Training for Self-Correcting Conditional Flows

FlowBender: 面向自校正条件流的反馈感知训练

Daniel Gilo, Sven Elflein, Ido Sobol, Or Litany

发表机构 * Technion(以色列理工学院) NVIDIA(英伟达) University of Toronto(多伦多大学) Vector Institute(向量研究所)

专题命中 图像编辑 :反馈感知训练用于条件流模型,提升图像翻译和修复

AI总结 针对条件扩散/流模型常违反任务约束的问题,提出FlowBender闭环框架,将对齐误差作为输入训练网络学习校正策略,在图像翻译、复原和3D纹理贴图中同时提升保真度与合理性。

Comments Project page: https://flow-bender.github.io/

详情
AI中文摘要

条件扩散和流模型通常无法满足定义其任务的约束条件。例如,深度条件模型经常产生重新提取的深度与输入不一致的图像,尽管定义约束的前向算子(深度预测器)在训练和推理期间都可用。现有方法通常分为两类:将条件信号视为静态线索并在推理时忽略对齐信息的监督模型,以及通过手动调整的线性更新咨询约束的基于引导的方法,通常以生成样本的合理性为代价来换取对条件的保真度。我们认为这两种范式的根本差距在于模型从未被训练利用自身的对齐误差。我们引入FlowBender,一个闭环框架,将此误差视为一等输入,训练网络学习基于推理时反馈的校正策略。在每一步,无引导的前瞻传递估计干净信号,通过前向算子计算特定任务的偏差,然后细化传递消耗此信号以产生校正速度。我们提出了FlowBender的几种变体,包括用于可微算子的基于梯度的公式和用于不可微设置(如JPEG压缩)的零阶变体。为了实现高效采样,我们引入了一个前一步捷径,使得以最小的额外计算成本实现闭环校正。在图像到图像翻译、复原和3D网格纹理贴图中,FlowBender始终优于标准监督基线、对齐损失增强训练和最先进的推理时引导,同时提高保真度和合理性,而不是在它们之间进行权衡。项目页面:此 https URL

英文摘要

Conditional diffusion and flow models routinely fail to satisfy the very constraints that define their task. For instance, a depth-conditioned model often produces images whose re-extracted depth disagrees with the input, even though the forward operator--the depth predictor defining the constraint--is available during both training and inference. Existing approaches generally fall into two categories: supervised models that treat the conditioning signal as a static cue and ignore alignment information at inference, and guidance-based methods that consult it through hand-tuned linear updates, typically trading fidelity to the condition against the plausibility of the generated sample. We argue that the fundamental gap in both paradigms is that the model is never trained to utilize its own alignment error. We introduce FlowBender, a closed-loop framework that treats this error as a first-class input, training the network to learn a correction policy conditioned on inference-time feedback. At each step, an unguided look-ahead pass estimates the clean signal, a task-specific deviation is computed via the forward operator, and a refinement pass consumes this signal to produce a corrected velocity. We propose several variants of FlowBender, including a gradient-based formulation for differentiable operators and a zero-order variant for non-differentiable settings such as JPEG compression. For efficient sampling, we introduce a prior-step shortcut that enables closed-loop correction at a minimal additional computational cost. Across image-to-image translation, restoration, and 3D mesh texturing, FlowBender consistently outperforms standard supervised baselines, alignment-loss-augmented training, and state-of-the-art inference-time guidance, improving fidelity and plausibility simultaneously rather than trading them against each other. Project page: https://flow-bender.github.io/

2606.19802 2026-06-19 cs.LG cs.CV 新提交 80%

Flow Map Denoisers: Traversing the Distortion-Perception Plane for Inverse Problems

流映射去噪器:遍历逆问题的失真-感知平面

Nicolas Zilberstein, Morteza Mardani, Santiago Segarra

发表机构 * Rice University(莱斯大学) NVIDIA Inc.(英伟达公司)

专题命中 图像编辑 :提出流映射去噪器,实现图像恢复中的失真-感知权衡。

AI总结 提出流映射模型,通过单一参数t在MMSE和感知质量间连续调节,实现逆问题的失真-感知权衡,无需额外监督或调参。

详情
AI中文摘要

图像复原面临一个基本权衡:最小化误差的方法产生模糊重建,而最大化感知质量的方法产生锐利但不够保真的图像。现有方法要么在失真-感知(DP)前沿上固定一个操作点,要么需要配对数据监督、辅助模型或对采样器进行超参数调优以访问不同点。我们证明,流映射模型——一种用于少步采样的流匹配的近期扩展,学习一个平均场——隐式定义了一个单参数去噪器族,连续跨越DP前沿。前瞻参数t充当MMSE和感知区域之间的控制旋钮。对于高斯目标,我们证明改变t精确恢复最优DP前沿;对于自然图像,我们在经验上观察到类似行为。在即插即用求解器中,相同机制扩展到一般逆问题,控制感知对齐与数据一致性之间的权衡。尽管在此设置中缺乏精确最优性保证,单个训练的流映射跨越DP权衡,在两端匹配或超越专门基线。在CelebA(128×128)和AFHQ(256×256)上的多个线性和非线性逆任务的广泛实验验证了我们的发现。

英文摘要

Image restoration faces a fundamental tradeoff: methods that minimize error produce blurry reconstructions, while those that maximize perceptual quality yield sharp but less faithful images. Existing approaches either commit to a single operating point on this distortion perception (DP) frontier or require paired-data supervision, auxiliary models, or hyperparameter tuning of the sampler to access different points. We show that flow map models, a recent extension of flow matching for few-step sampling that learns an average field, implicitly define a one-parameter family of denoisers that continuously spans the DP frontier. The lookahead parameter t acts as a control knob between the MMSE and perceptual regimes. For Gaussian targets, we prove that varying t exactly recovers the optimal DP frontier; for natural images, we observe similar behavior empirically. Within a Plug-and-Play solver, the same mechanism extends to general inverse problems, where it controls a tradeoff between perceptual alignment and data consistency. Despite the lack of exact optimality guarantees in this setting, a single trained flow map spans the DP tradeoff, matching or exceeding specialized baselines at both extremes. Extensive experiments on CelebA ($128\times 128$) and AFHQ ($256\times 256$) across several linear and nonlinear inverse tasks validate our findings.

4. 可控生成 5 篇

2606.19718 2026-06-19 cs.CV 新提交 90%

One-Shot Novel View and Pose Human Image Synthesis via 3D Prior Guided Diffusion Model

基于3D先验引导扩散模型的单样本新视角与姿态人体图像合成

Shenjian Gong, Kangkan Wang, Shanshan Zhang, Jian Yang

发表机构 * PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, and Jiangsu Key Lab of Image and Video Understanding for Social Security, School of Computer Science and Engineering, Nanjing University of Science and Technology(南京理工大学计算机科学与工程学院教育部高维信息智能感知与系统重点实验室、江苏省社会安全图像与视频理解重点实验室及PCA实验室) Advanced Laser Technology Laboratory of Anhui Province, Electronic Engineering Institute, National University of Defense Technology, and Jianghuai Advance Technology Center(国防科技大学电子工程学院安徽省先进激光技术实验室及江淮前沿技术中心)

专题命中 可控生成 :基于扩散模型合成新视角和姿态的人体图像。

AI总结 提出一种基于条件去噪扩散模型的方法,利用3D人体先验(法线图和颜色提示)作为几何和颜色条件,从单张参考图像合成任意姿态和视角的高质量人体图像,包括被遮挡部分。

Comments 30 pages, 10 figures

详情
AI中文摘要

本文解决了单样本新视角和姿态人体图像合成的挑战。现有方法通过一组2D姿态关键点将参考人体图像转移到目标姿态,或基于可泛化人体NeRF(使用人体模型先验提取逐点特征)合成人体图像。然而,基于姿态转移的方法无法处理使用模糊2D姿态作为条件的复杂人体姿态,而可泛化人体NeRF在缺乏可靠特征时可能无法准确恢复被遮挡/不可见的人体部分。为解决这些问题,我们提出了一种基于条件去噪扩散模型的新方法,用于从单张人体图像进行新视角和姿态合成。我们的扩散模型将新视角和姿态合成问题分解为一系列条件去噪步骤。具体而言,为了生成具有复杂和任意姿态的人体,我们将3D人体先验(即3D法线图和颜色提示)作为几何和颜色条件引入生成过程。通过一系列扩散步骤将参考人体转移到目标人体,我们的扩散模型能够实现高质量合成,包括被遮挡/不可见部分。此外,我们提出了一种基于自重建的自定义细化方法,以在测试新视角时增强细节。在多个公共数据集上的实验结果表明,我们的方法显著优于先前方法,并显示出更好的跨数据集泛化能力。代码将在https://this https URL上公开。

英文摘要

This paper addresses the challenge of one-shot novel view and pose human image synthesis. The existing methods transfer the reference human image to a target pose using a set of 2D pose keypoints or synthesize human images based on generalizable human NeRF which uses human model priors to extract point-wise features. However, pose transfer based methods can not handle complex human pose using ambiguous 2D pose as the condition, while generalizable human NeRFs may be inaccurate to recover occluded/invisiable human parts without extracted reliable features. To solve these problems, we propose a novel approach for novel view and pose synthesis from a singe human image via conditional denoising diffusion model. Our diffusion model divides the novel view and pose synthesis problem into a sequence of conditional denoising steps. Specifically, to generate humans with complex and arbitrary poses, we introduce 3D human priors, i.e., 3D normal map and color prompt, as geometry and color conditions into the generation process. By transferring the reference human into the target human with a series of diffusion steps, our diffusion model enables high-quality synthesis including the occluded/invisible parts. Further, we propose a self-reconstruction based customized refinement to enhance fine details when tested on novel persons.Experimental results on different public datasets demonstrate that our approach significantly outperforms previous methods and also shows better generalization ability across datasets. The code will be made publicly available at https://github.com/Yankeegsj/3DPGDM.

2606.20110 2026-06-19 cs.CV 新提交 80%

FrozenDrive: Zero-Shot Text-Guided Driving Scene Generation and Data Augmentation with Parameter-Free Frozen Diffusion Model

FrozenDrive: 零样本文本引导驾驶场景生成与数据增强的无参数冻结扩散模型

Yuhwan Jeong, Hyeonseong Kim, Daehyun We, Seonkyu Song, Jinnyeong Yang, Hyun-Kurl Jang, Youngho Yoon, Kuk-Jin Yoon

发表机构 * KAIST, Visual Intelligence Lab(韩国科学技术院视觉智能实验室)

专题命中 可控生成 :文本引导的驾驶场景生成

AI总结 提出FrozenDrive框架,利用冻结的预训练扩散模型,通过知识保留的时空注意力实现多视图一致性和时间连贯性,无需微调即可生成恶劣天气下的驾驶场景,提升自动驾驶模型鲁棒性。

Comments Accepted to ECCV 2026

详情
AI中文摘要

自动驾驶的合成数据正在激增,这得益于扩散模型能够实现可扩展的场景生成。然而,关键障碍依然存在,因为强制执行多视图和时间一致性通常依赖于骨干网络微调或添加层,这会侵蚀预训练知识并削弱文本对齐。模型也保持接近训练分布,在恶劣天气和未见配置下表现不佳,并且保真度偏向频繁类别而非稀有类别。我们通过FrozenDrive解决这些差距,这是一个可控生成框架,在保持预训练扩散模型知识的同时实现强一致性。FrozenDrive以丰富的驾驶堆栈信号和文本提示为条件,并引入知识保留的时空注意力,在无参数的冻结扩散骨干中单次通过时施加跨视图对齐和时间连贯性。额外的对象聚焦约束提高了稀有类别的每个对象保真度。无需任何天气或场景特定的微调,我们的模型从文本合成全局连贯的多视图驾驶场景,特别是在恶劣和稀有条件下,并超越了先前的基线。在nuScenes上,FrozenDrive增强数据显著提升了AD模型的性能,尤其是在夜间和雨天,当使用我们的场景定向数据训练时,展示了更强的鲁棒性。

英文摘要

Synthetic data for autonomous driving is surging, powered by diffusion models that promise scalable scene generation. Yet key obstacles remain, as enforcing multi-view and temporal consistency often relies on backbone fine-tuning or added layers, which erodes pre-trained knowledge and weakens text alignment. Models also stay close to the training distribution, struggling under adverse weather and unseen configurations, and fidelity favors frequent over rare classes. We address these gaps with FrozenDrive, a controllable generative framework that preserves a pretrained diffusion models knowledge while achieving strong consistency. FrozenDrive conditions on rich driving-stack signals and text prompts, and introduces knowledge-preserving spatio-temporal attention to impose cross-view alignment and temporal coherence in a single pass within a parameter-free frozen diffusion backbone. An additional object-focused constraint improves per-object fidelity for rare categories. Without any weather- or scene-specific fine-tuning, our model synthesizes globally coherent multi-view driving scenes from text, particularly under adverse and rare conditions, and surpasses prior baselines. On nuScenes, FrozenDrive augmented data significantly improves AD models performance, especially at night and in rain, demonstrating stronger robustness when trained with our scenario-targeted data.

2606.20083 2026-06-19 cs.CV 新提交 80%

Holo-World: Unified Camera, Object and Weather Control for Video World Model

Holo-World: 视频世界模型的统一相机、物体和天气控制

Xiangchen Yin, Wenzhang Sun, Jiahui Yuan, Zijie Liu, Yinda Chen, Wei Li, Dachun Kai, Chunfeng Wang, Xiaoyan Sun

发表机构 * University of Science and Technology of China(中国科学技术大学) Li Auto Institute of Artificial Intelligence, Hefei Comprehensive National Science Center(合肥综合国家科学中心人工智能研究院)

专题命中 可控生成 :相机、物体、天气联合控制

AI总结 提出Holo-World,一种从单张图像联合控制相机、物体运动和天气的统一视频世界模型,通过场景适配器和解耦CFG实现世界保持与天气迁移。

Comments Project Page: \url{https://xiangchenyin.github.io/Holo-World} Code: \url{https://github.com/XiangchenYin/Holo-World}

详情
AI中文摘要

视频世界模型正朝着在可控相机和物体运动下保持观察到的世界,同时允许其环境状态变化的方向发展。然而,这些控制仍然是孤立的,天气生成通常依赖于已经指定未来结构的源视频或重建场景。我们研究了一种基于第一帧锚定的源到状态设置,其中模型从单张图像开始,遵循明确的相机和物体控制以及可选的天气指令,然后生成一个视频,该视频要么保持源世界,要么将其转移到目标天气状态。为了解决这些挑战,我们首先构建了HoloStateData,一个状态视频数据集,将多样化的视频转换为用于相机、物体和天气监督的统一控制样本。其次,我们引入了Holo-World,一个统一的、可控制的视频世界模型,从单张图像联合控制场景。其统一场景适配器将世界保持和天气迁移分解为不同的参数子空间,使用渲染背景、几何缓冲区和物体控制来维持受控场景结构,同时建模依赖天气的外观和粒子效果。此外,场景-天气解耦CFG分别引导场景和天气残差,增强目标天气效果而不过度放大完整条件。定量和定性实验表明,Holo-World在保持精确的相机和物体控制以及一致场景结构的同时,将场景迁移到多样化的目标天气状态,在天气状态生成上优于视频到视频的天气编辑基线。我们的项目页面可在\url{this https URL}获取。

英文摘要

Video world models are moving toward preserving an observed world under controllable camera and object motion while allowing its environmental state to change. Yet these controls remain isolated, and weather generation typically relies on a source video or reconstructed scene that already specifies future structure. We study a first-frame-anchored source-to-state setting, where the model starts from a single image and follows explicit camera and object controls and an optional weather instruction, then generates a video that either preserves the source world or transfers it to a target weather state. To address these challenges, we first build HoloStateData, a state video dataset that turns diverse videos into unified control samples for camera, object, and weather supervision. Second, we introduce Holo-World, a unified controllable video world model that jointly controls scene from a single image. Its Unified Scene Adapter factorizes world preservation and weather transfer into distinct parameter subspaces, using rendered background, geometry buffers, and object controls to maintain controlled scene structure while modeling weather-dependent appearance and particle effects. Additionally, Scene-Weather Decomposed CFG guides scene and weather residuals separately, strengthening target weather effects without over-amplifying the full condition. Quantitative and qualitative experiments demonstrate that Holo-World maintains precise camera and object control with consistent scene structure while transferring scenes into diverse target weather state, outperforming video-to-video weather editing baselines on weather-state generation. Our project page is available at \url{https://xiangchenyin.github.io/Holo-World/}.

2606.19736 2026-06-19 cs.CV 新提交 80%

VFACamou: View-Fused Adversarial Camouflage for Environment-Adaptive Physical Evasion

VFACamou: 视图融合的对抗性伪装用于环境自适应物理规避

Shihui Yan, Hu Liu, Junyu Shi, Zihui Zhu, Ziqi Zhou, Yufei Song, Youming Geng, Minghui Li, Shengshan Hu

发表机构 * State Key Laboratory of Intelligent Vehicle Safety Technology(智能汽车安全技术国家重点实验室) School of Cyber Science and Engineering, Huazhong University of Science and Technology(华中科技大学网络空间安全学院) School of Computer Science and Technology, Huazhong University of Science and Technology(华中科技大学计算机科学与技术学院) School of Software Engineering, Huazhong University of Science and Technology(华中科技大学软件学院) Hebei Energy College of Vocation And Technology(河北能源职业技术学院)

专题命中 可控生成 :使用扩散纹理生成器生成对抗图案。

AI总结 提出一种端到端框架,结合UV体积渲染与扩散纹理生成器,并引入照明颜色一致性估计器和多尺度动态训练策略,生成可穿戴对抗图案,在无人机侦察等动态视角和光照变化下实现稳定物理攻击。

Comments Accepted by ICME 2026

详情
AI中文摘要

物理世界中的对抗性伪装仍然极具挑战性,尤其是在无人机侦察场景下,目标会经历连续的几何变化和极端光照变化。现有方法要么优化无法泛化到动态视角的2D数字扰动,要么产生视觉上不自然的纹理而无法在实际场景中部署。因此,我们提出一个端到端的对抗性伪装生成框架,能够自动生成可穿戴的对抗图案,并在视角、姿态和光照条件变化的真实物理环境中保持稳定的攻击性能。我们的方法将UV体积渲染与基于扩散的纹理生成器相结合,使得在不同尺度、姿态和光照条件下外观保持一致。为了确保环境真实性,我们提出一个照明颜色一致性估计器,提取主导背景属性并引导自然纹理损失,使生成的UV纹理与周围环境对齐。多尺度动态训练策略进一步增强了对抗视角变化和身体变形的鲁棒性。在多个主流检测器上的大量实验表明,我们的方法在保持高感知自然性的同时实现了强大且稳定的物理攻击性能,在不引入不自然伪影的情况下降低了人类检测率。

英文摘要

Adversarial camouflage in the physical world remains highly challenging, particularly under UAV reconnaissance where targets undergo continuous geometric changes and extreme illumination variations. Existing methods either optimize 2D digital perturbations that fail to generalize to dynamic viewpoints or produce visually unnatural textures that cannot be deployed in real scenarios. Therefore, we propose an end-to-end framework for adversarial camouflage generation that automatically produces wearable adversarial patterns and maintains stable attack performance in real physical environments with changing viewpoints, poses, and lighting conditions. Our method integrates UV-volume rendering with a diffusion-based texture generator, enabling consistent appearance under varying scales, poses, and lighting conditions. To ensure environmental realism, we propose an illumination color consistency estimator that extracts dominant background attributes and guides a natural texture loss to align the generated UV texture with the surrounding environment. A multi-scale dynamic training strategy further enhances robustness against viewpoint shifts and body deformation. Extensive experiments across multiple mainstream detectors demonstrate that our method achieves strong and stable physical attack performance while maintaining high perceptual naturalness, reducing human detection rates without introducing unnatural artifacts.

2601.12870 2026-06-19 cs.CE 版本更新 75%

Text2Structure3D: Graph-Based Generative Modeling of Equilibrium Structures with Diffusion Transformers

Text2Structure3D: 基于扩散变换器的图生成建模平衡结构

Lazlo Bleker, Zifeng Guo, Kaleb E. Smith, Kam-Ming Mark Tam, Karla Saldaña Ochoa, Pierluigi D'Acunto

专题命中 可控生成 :从文本生成平衡结构图,属于可控结构生成。

AI总结 提出Text2Structure3D,结合潜在扩散、变分图自编码器和图变换器,从自然语言提示生成接近平衡状态的结构图,并通过残余力优化确保完全满足静力平衡。

Journal ref Results in Engineering 31 (2026) 111375

详情
AI中文摘要

本文提出Text2Structure3D,一种基于图的机器学习模型,能够从自然语言提示生成平衡结构。Text2Structure3D旨在支持概念结构设计过程中新的直观设计探索和迭代方式。该方法将潜在扩散与变分图自编码器(VGAE)和图变换器相结合,生成接近平衡状态的结构图。Text2Structure3D集成了一个残余力优化后处理步骤,确保生成的结构完全满足静力平衡。该模型使用一个跨类型的悬链线找形和静定桥梁结构数据集进行训练和验证,该数据集配有针对每座桥梁的形式和结构特征的文本描述。结果表明,Text2Structure3D生成的平衡结构高度遵循基于文本的规范,并且与基于参数模型的方法相比,大大提高了泛化能力。Text2Structure3D代表了迈向结构设计通用基础模型的早期一步,使生成式AI能够集成到概念设计工作流程中。

英文摘要

This paper presents Text2Structure3D, a graph-based Machine Learning (ML) model that generates equilibrium structures from natural language prompts. Text2Structure3D is designed to support new intuitive ways of design exploration and iteration in the conceptual structural design process. The approach combines latent diffusion with a Variational Graph Auto-Encoder (VGAE) and graph transformers to generate structural graphs that are close to an equilibrium state. Text2Structure3D integrates a residual force optimization post-processing step that ensures generated structures fully satisfy static equilibrium. The model was trained and validated using a cross-typological dataset of funicular form-found and statically determinate bridge structures, paired with text descriptions that capture the formal and structural features of each bridge. Results demonstrate that Text2Structure3D generates equilibrium structures with strong adherence to text-based specifications and greatly improves generalization capabilities compared to parametric model-based approaches. Text2Structure3D represents an early step toward a general-purpose foundation model for structural design, enabling the integration of generative AI into conceptual design workflows.

5. 其他图像生成 3 篇

2606.20536 2026-06-19 cs.CV 新提交 75%

The FID Lottery: Quantifying Hidden Randomness in Generative-Model Evaluation

FID 彩票:量化生成模型评估中的隐藏随机性

Nicolas Dufour, Alexei A. Efros, Patrick Pérez

发表机构 * Kyutai UC Berkeley(加州大学伯克利分校)

专题命中 其他图像生成 :研究FID评估中的随机性,影响生成模型评测

AI总结 研究FID作为随机变量在训练和生成种子上的方差,发现重训练比重采样导致更大FID波动,提出新评估协议:使用每类最优引导、报告多个训练种子的误差条。

Comments Website: https://kyutai.org/fid-lottery

详情
AI中文摘要

Frechet Inception Distance (FID) 是图像生成的事实标准仲裁者,但大多数论文仅报告来自单个训练模型使用单个采样种子的单一数值。如果我们重新训练模型,或仅重新从中采样,该数字的可重复性如何?在本文中,我们将 FID 视为训练和生成种子二维面板上的随机变量,并直接在数百个在类别条件 ImageNet 256x256 上训练的 SiT 网络上测量其方差。我们报告了令人惊讶的发现:(a) 使用相同配方但不同种子重新训练模型,在 Inception 特征空间中移动的 FID 比从固定网络重新绘制样本大 3.2 倍。(b) 这一差距由三个因素驱动:随机初始化、数据排序和流匹配损失的每步高斯噪声。(c) 增加计算量或模型大小几乎不会缩小分布范围,将 FID 变异系数 (CoV) 保持在 1-2% 的带内。(d) 每类无分类器引导调整使分布减半,但重新洗牌了哪些种子效果最好,幸运的训练种子达到相同 FID 所需的计算量比不幸的种子少 2 倍。基于这些发现,我们推荐一种新的 FID 评估协议:在每类最优引导下进行评估,将任何低于经验测量的约 1.3% CoV 的 FID 差距视为不确定,并报告多个训练种子的误差条,而不是单一的 FID 数值。

英文摘要

The Frechet Inception Distance (FID) is the de facto arbiter of image generation, yet most papers report just a single number from a single trained model using a single sampling seed. How reproducible is that number if we retrain the model, or merely resample from it? In this paper, we treat FID as a random variable on a two-axis panel of training and generation seeds, and measure its variance directly on several hundred SiT networks trained on class-conditional ImageNet 256x256. We report surprising findings: (a) Retraining the model using the same recipe with a different seed moves FID 3.2x more (in Inception feature space) than redrawing samples from a fixed network. (b) That gap is driven by three factors: random initialisation, data ordering, and the per-step Gaussian noise of the flow-matching loss. (c) Increasing compute or model size barely tightens the spread, holding the FID coefficient of variation (CoV) inside a 1-2% band. (d) Per-cell classifier-free-guidance tuning halves the spread but reshuffles which seeds work best, and a lucky training seed reaches the same FID with up to 2x less compute than an unlucky one. Based on these findings, we recommend a new FID evaluation protocol: evaluate under per-cell optimal guidance, treat any FID gap below the empirically measured ~1.3% CoV as inconclusive, and report an error bar over several training seeds rather than a single FID number.

2606.20488 2026-06-19 cs.CV 新提交 75%

How Fragile Are Training-Free AI-Generated Image Detectors? A Controlled Audit of Score Direction, Preprocessing, and Compression

无训练AI生成图像检测器有多脆弱?对分数方向、预处理和压缩的受控审计

Jingwen Zhou, Mingzhe Wang

发表机构 * Xidian University(西安电子科技大学)

专题命中 其他图像生成 :检测AI生成图像,评估生成质量

AI总结 本文通过统一协议审计两种无训练检测分数(自编码重建和噪声扰动特征相似性)及kNN基线,发现实现细节、分数方向选择和数据集格式偏差会导致AUROC变化高达0.38,且简单融合无法超越最佳单分数。

详情
AI中文摘要

无训练的AI生成图像检测器承诺无需分类器训练即可实现生成器无关的部署,但其报告的数字很少在单一受控协议下进行比较。我们审计了两种代表性的无训练分数——一种自编码器重建分数(AEROBLADE风格)和一种噪声扰动特征相似性分数(RIGID风格),外加一个朴素的特征kNN控制,在包含七个生成器和JPEG压缩质量70和50的公共1,500图像GenImage衍生基准上进行。审计得出三个警示性发现。(i)实现细节伪装成方法差异:将LPIPS骨干网络(AlexNet -> VGG-16)替换使整体AUROC变化+0.085,在resize-to-512和原始分辨率预处理之间切换使每个生成器的结论翻转高达0.38 AUROC。(ii)分数方向不是方法的属性而是其超参数的属性:RIGID风格分数在噪声水平sigma=0.05时对SD1.5和Wukong反转(AUROC < 0.5),在sigma=0.01时对所有生成器恢复至>0.5,在sigma=0.3时降至0.15。(iii)数据集格式偏差夸大鲁棒性声明:没有统一重新编码时,JPEG-50下的AUROC超过AlexNet骨干重建分数的干净条件;偏差校正后残余异常定位到单个生成器(BigGAN)。审计的分数具有互补的逐生成器失败集,但朴素z-score融合未能击败最佳单分数,表明利用互补性需要方向感知的组合。

英文摘要

Training-free detectors of AI-generated images promise generator-agnostic deployment without classifier training, yet their reported numbers are rarely compared under a single controlled protocol. We audit two representative training-free scores -- an autoencoder-reconstruction score (AEROBLADE-style) and a noise-perturbation feature-similarity score (RIGID-style) -- plus a naive feature-kNN control, on a common 1,500-image GenImage-derived benchmark spanning seven generators and JPEG compression at quality 70 and 50. The audit yields three cautionary findings. (i) Implementation details masquerade as method differences: replacing the LPIPS backbone (AlexNet -> VGG-16) changes overall AUROC by +0.085, and switching between resize-to-512 and native-resolution preprocessing flips per-generator conclusions by up to 0.38 AUROC. (ii) Score direction is not a property of the method but of its hyperparameters: the RIGID-style score is inverted (AUROC < 0.5) on SD1.5 and Wukong at noise level sigma=0.05, recovers to >0.5 for every generator at sigma=0.01, and collapses to 0.15 at sigma=0.3. (iii) Dataset format bias inflates robustness claims: without unified re-encoding, AUROC under JPEG-50 exceeds the clean condition for the AlexNet-backbone reconstruction score; after bias correction the residual anomaly localizes to a single generator (BigGAN). The audited scores have complementary per-generator failure sets, but naive z-score fusion does not beat the best single score, indicating that exploiting complementarity requires direction-aware combination.

2606.20563 2026-06-19 cs.CV 新提交 70%

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

JanusMesh: 通过跨空间去噪实现快速零样本3D视觉错觉生成

Siang-Ling Zhang, Huai-Hsun Cheng, Tsung-Ju Yang, Yu-Lun Liu

发表机构 * National Yang Ming Chiao Tung University(国立阳明交通大学)

专题命中 其他图像生成 :生成双语义3D视觉错觉,属于图像生成

AI总结 提出一种无需训练的快速框架,通过跨空间双分支去噪和视图条件纹理合成,在3-5分钟内生成高真实感双语义3D视觉错觉,优于现有方法。

Comments ECCV 2026. Project page: https://siang1105.github.io/JanusMesh.github.io/

详情
AI中文摘要

创建3D视觉错觉——一个从不同视角揭示完全不同语义的单一3D网格——是一个迷人但艰巨的挑战。现有的基于优化的方法速度慢且可能产生过饱和颜色。相比之下,简单的拼接方法无法生成几何一致的物体,导致可见的不自然接缝和语义泄露。在本文中,我们提出了一个快速且无需训练的框架,用于生成文本驱动的3D视觉错觉。我们的方法将生成过程解耦为两个阶段。首先,我们提出一个跨空间双分支去噪过程。该过程动态地将3D潜在变量解码到体素空间,用于CLIP引导的方向对齐和符号距离场(SDF)混合,确保无缝的几何融合。其次,我们引入一个视图条件纹理合成模块,将特定视图的2D扩散先验投影并聚合到融合的几何上。大量实验表明,我们的方法在仅3-5分钟内生成高度逼真的双语义3D错觉,在几何完整性、语义可识别性和效率上显著优于现有方法。项目页面:此https URL

英文摘要

Creating 3D visual illusions, a single 3D mesh that reveals entirely different semantics from various viewing angles, is a fascinating but tough challenge. Existing optimization-based methods are slow and can produce oversaturated colors. In contrast, naive stitching approaches fail to produce geometrically coherent objects. This results in visible unnatural seams and semantic leaks. In this paper, we present a fast and training-free framework for generating text-driven 3D visual illusions. Our approach decouples the generation into two stages. First, we propose a cross-space dual-branch denoising process. This process dynamically decodes 3D latents into voxel space for CLIP-guided orientation alignment and Signed Distance Field (SDF) blending, which ensures seamless geometric fusion. Second, we introduce a view-conditioned texture synthesis module that projects and aggregates view-specific 2D diffusion priors onto the fused geometry. Extensive experiments demonstrate that our method generates highly realistic, dual-semantic 3D illusions in just 3-5 minutes. It significantly outperforms existing methods in geometric integrity, semantic recognizability, and efficiency. Project page: https://siang1105.github.io/JanusMesh.github.io/