arXivDaily arXiv每日学术速递 周一至周五更新

视觉与机器人

图像生成

图像生成、文生图、图像编辑、扩散模型和可控生成。

今日/当前日期收录 4 信号源:cs.CV, cs.GR, cs.MM
2606.19162 2026-06-18 cs.LG cs.CV 新提交 85%

The Reward Was in Your Data All Along: Correcting Flow Matching with Discriminator-Guided RL

奖励一直就在你的数据中:用判别器引导的强化学习纠正流匹配

Nicolas Beltran-Velez, Felix Friedrich, Zhang Xiaofeng, Reyhane Askari-Hemmat, Xiaochuang Han, Adriana Romero-Soriano, Michal Drozdzal

发表机构 * FAIR at Meta Columbia University Mila -- Qu\' e bec AI Institute McGill University Canada CIFAR AI Chair

专题命中 扩散模型 :用RL纠正流匹配模型视觉缺陷,提升生成质量

AI总结 针对流匹配模型因损失函数与样本质量不匹配导致的视觉缺陷,提出判别器引导的强化学习(DRL),利用预训练空间中判别器的logit作为奖励,显著提升无引导FID和语义FD,并改善偏好对齐。

Comments 84 pages, including appendices

详情
AI中文摘要

得分匹配和流匹配模型通常依赖基于偏好的强化学习来实现两个目的:与主观偏好对齐,以及令人惊讶地恢复视觉真实性和连贯对象结构等属性——而这些属性本应通过匹配训练从数据本身学习。我们认为这反映了结构上的不匹配。匹配损失衡量训练时边缘分布下速度或得分场的$\ell_2$回归误差,这一代理指标与决定推理时样本质量的视觉和语义属性对齐不良。给定一个与这些属性对齐的奖励,强化学习通过评估模型自身生成的样本并直接遵循奖励景观来规避不匹配。挑战在于如何在不依赖人类偏好的情况下获得这样的奖励,因为人类偏好昂贵且会将数据真实性与标注者倾向混为一谈。我们提出判别器引导的强化学习(DRL)。DRL训练一个判别器,在预训练表示空间中区分数据样本和基础模型样本,并将其logit作为KL正则化强化学习中的奖励。预训练空间将判别器限制在感知有意义的方向上,而logit估计数据与模型之间的对数似然比,这是针对数据分布的最优奖励。在SiT、JiT、REPA和RAE上,DRL降低了无引导FID(例如,SiT上从9.38降至2.62)和语义空间FD(例如,SiT上DINOv3从88.2降至19.3),在所有骨干网络上均有一致提升,并且在没有经过偏好奖励训练的情况下改善了人类偏好奖励。在后续基于偏好的后训练中,DRL还在偏好奖励与图像保真度之间产生了更好的帕累托前沿,在提高对齐度的同时减少了过饱和和过亮等低级伪影。

英文摘要

Score- and flow-matching models often rely on preference-based reinforcement learning for two purposes: aligning with subjective preferences and, surprisingly, recovering properties such as visual realism and coherent object structure that matching-based training is intended to learn from the data itself. We argue that this reflects a structural mismatch. Matching losses measure $\ell_2$ regression error on the velocity or score field under training-time marginals, a proxy poorly aligned with the visual and semantic properties that determine sample quality at inference. Given a reward aligned with these properties, RL sidesteps the mismatch by evaluating the model on its own samples and following the reward landscape directly. The challenge is to obtain such a reward without relying on human preferences, which are expensive and conflate data realism with annotator inclinations. We propose Discriminator-Guided RL (DRL). DRL trains a discriminator to separate data from base-model samples in a pretrained representation space and uses its logit as the reward in KL-regularized RL. The pretrained space restricts the discriminator to perceptually meaningful directions, and the logit estimates the log-likelihood ratio between data and model, which is the optimal reward for targeting the data distribution. Across SiT, JiT, REPA, and RAE, DRL reduces guidance-free FID (e.g., $9.38 \to 2.62$ on SiT) and semantic-space FD (e.g., $88.2 \to 19.3$ on DINOv3 for SiT), with consistent gains across all backbones, and improves human-preference rewards without training on them. It also yields a better Pareto frontier between preference reward and image fidelity under subsequent preference-based post-training, increasing alignment while reducing low-level artifacts such as oversaturation and excessive brightness.

2606.18765 2026-06-18 cs.CV 新提交 85%

SpectralDiT: Timestep-Conditioned Spectral Residual Correction for Flow-Matching DiTs

SpectralDiT:流匹配DiT的时间步条件谱残差校正

Jiayu Tian

发表机构 * Peking University(北京大学)

专题命中 扩散模型 :改进流匹配DiT,谱残差校正提升生成质量。

AI总结 提出SpectralDiT,通过时间步条件谱残差校正模块,在CIFAR-10和ImageNet-100上以极少额外计算和参数提升流匹配DiT的生成质量,FID分别降低5.1%和8.7%。

详情
AI中文摘要

我们提出SpectralDiT,一种对流匹配扩散变换器(Diffusion Transformers)的轻量级修改,它在MLP残差分支中添加了时间步条件谱校正。该模块将每个残差更新分解为补丁-令牌网格上的低频和高频分量,然后学习一个零初始化的加法门,使得模型最初与基线DiT匹配。在CIFAR-10像素空间生成中,SpectralDiT在补丁大小为1时将FID从20.78提升至19.71,并缩小了径向傅里叶谱差距。此外,我们将方法扩展到ImageNet-100上的潜在扩散。在额外理论FLOPs增加0.6%和参数增加1.36%的情况下,SpectralDiT改进了潜在流匹配,在无分类器引导(CFG 2.0)下实现了8.7%的相对FID降低。所有报告结果均为五个种子的平均值。在CIFAR-10上的消融实验和门控可视化揭示了稳定的块特定谱校正模式。

英文摘要

We propose SpectralDiT, a lightweight modification to flow-matching Diffusion Transformers that adds timestep-conditioned spectral correction to the MLP residual branch. The module decomposes each residual update into low- and high-frequency components on the patch-token grid, then learns a zero-initialized additive gate so the model initially matches the baseline DiT. On CIFAR-10 pixel-space generation, SpectralDiT improves FID from 20.78 to 19.71 at patch size 1 and reduces the radial Fourier spectrum gap. Furthermore, we scale our method to latent diffusion on ImageNet-100. With 0.6% additional theoretical FLOPs and 1.36% additional parameters, SpectralDiT improves latent flow-matching, achieving an 8.7% relative FID reduction under classifier-free guidance (CFG 2.0). All reported results are averaged over five seeds. Ablations and gate visualizations on CIFAR-10 reveal stable block-specific spectral correction patterns.

2606.19163 2026-06-18 cs.DC 新提交 75%

Pulse: Training Acceleration for Large Diffusion Models with Automatic Pipeline Parallelism

Pulse: 面向大规模扩散模型的自动流水线并行训练加速

Boran Sun, Guoyong Jiang, Lin Zhang, Chen Chen, Yuechen Tao, Zhishu Che, Jieling Yu, Shan Chang, Huaxi Gu, Fangming Liu, Bo Li

专题命中 扩散模型 :针对扩散模型训练加速,优化UNet流水线并行

AI总结 提出PULSE自动流水线并行策略,通过将跳跃连接层同设备放置、局部缓存激活值,消除跨流水线通信,结合动态规划分区器、ILP调度合成器和混合并行调优器,在通信受限硬件上实现最高2.3倍吞吐提升。

Comments Accepted by International Conference on Distributed Computing Systems(ICDCS'26)

详情
AI中文摘要

扩散模型目前是高保真图像和视频生成的主流方法,但在GPU集群上扩展其训练仍具挑战。与仅含Transformer的架构不同,扩散骨干通常采用具有异构层和长距离跳跃连接的UNet风格编码器-解码器结构。在传统流水线并行下,这些非局部依赖迫使大型跳跃激活值及其梯度穿越多个流水线边界,使得点对点(P2P)通信成为主要瓶颈,并显著降低流水线效率。本文提出PULSE,一种自动流水线并行训练策略,将跳跃局部性作为首要优化目标。PULSE通过将跳跃连接的编码器-解码器层放置在同一设备上,并在本地缓存跳跃激活值以供反向传播使用,从而消除跳跃引起的通信。为了在保持高流水线利用率的同时实现这种放置,PULSE协同设计了:(1)一种跳跃感知的动态规划分区器,在对称共置约束下平衡异构阶段负载;(2)一种基于ILP的调度合成器,为生成的阶段到设备映射生成气泡高效的波调度;(3)一种混合并行调优器,在内存和网络约束下选择流水线/数据并行度及微批次大小。大量实验表明,与最先进的并行策略相比,通信量可减少89%,在通信受限硬件上训练吞吐量可提升高达2.3倍。

英文摘要

Diffusion models are now a dominant approach for high-fidelity image and video generation, yet scaling their training across GPU clusters remains challenging. Unlike transformer-only architectures, diffusion backbones commonly adopt UNet-style encoder-decoder structures with heterogeneous layers and long-range skip connections. Under conventional pipeline parallelism, these non-local dependencies force large skip activations and their gradients to traverse multiple pipeline boundaries, making peer-to-peer (P2P) communication a dominant bottleneck and substantially reducing pipeline efficiency. In this paper, we present PULSE, an automatic pipeline-parallel training strategy that makes skip locality a first-class optimization objective. PULSE eliminates skip-induced communication by collocating skip-connected encoder-decoder layers on the same device and caching skip activations locally for later use in backpropagation. To realize this placement while maintaining high pipeline utilization, PULSE co-designs: (1) a skip-aware dynamic-programming partitioner that balances heterogeneous stage workloads under symmetric collocation constraints, (2) an ILP-based schedule synthesizer that generates bubble-efficient wave schedules for the resulting stage-to-device mapping, and (3) a hybrid parallelism tuner that selects pipeline/data-parallel degrees and microbatch sizes under memory and network constraints. Our extensive experiments show that the volume of communication can be reduced by 89 percent, and the training throughput can be increased by up to 2.3x on communication-bound hardware, compared with state-of-the-art parallelism strategies.

2606.19151 2026-06-18 cs.CY cs.CV 新提交 70%

The Market in the Model: Latent Diffusion as Neural Economy

模型中的市场:潜在扩散作为神经经济

Eryk Salvaggio

发表机构 * Cambridge Digital Humanities(剑桥数字人文研究中心) University of Cambridge(剑桥大学) Machine Visual Culture Research Group(机器视觉文化研究组) Max Planck Institute(马克斯·普朗克研究所)

专题命中 扩散模型 :分析潜在扩散模型机制,属于图像生成理论

AI总结 本文从计算机视觉工程问题出发,分析潜在扩散模型的机制,论证其作为神经经济运作,将社会交流抽象为可通约向量,并警示仅关注版权与商品防御的批评可能强化模型产生的拜物教。

详情
AI中文摘要

在视觉文化和人文学科中,对生成图像模型的有价值批评强调了数据集在塑造其生成图像中的作用。然而,对嵌入模型机制的意识形态立场的细致研究一直被忽视,使得它们被想象为“黑箱”。为了扩展而非取代数据集批评,本文从潜在扩散模型被引入以解决计算机视觉工程师问题的角度,以及每个组件被赋予自动化决策的任务,审视了其机制。我通过其各部分的历史以及系统刻入每个生成图像中的视觉理论来解释这个集成。借鉴Impett和Offert的神经交换价值概念,我提出这一分析以论证该模型作为神经经济运作:一个封闭的符号系统,将社会交流抽象为可通约向量,同时将社会领域转化为待售包裹。逐组件追踪训练和生成流程揭示了每个操作取代了什么,以及它如何进一步巩固平台经济和注意力经济对社会交流的逻辑。本文警告,任何只关注版权和商品防御的批评都可能重申模型所产生的拜物教,并主张以社会交换为中心。

英文摘要

Valuable critique of generative image models within visual culture and the humanities has emphasized the role of datasets in shaping the images they produce. Yet, close studies of the ideological positions embedded into the mechanism of the models have been neglected, leaving them imagined as "black boxes." In a bid to expand, rather than replace, dataset critique, this paper examines the mechanisms of the latent diffusion model in terms of the problems they were brought in to solve on behalf of computer vision engineers, and the decisions each component was tasked with automating. I interpret that ensemble through the histories of its parts and the theory of vision the system inscribes into every generated image. Drawing on Impett and Offert's notion of neural exchange value, I offer this analysis to argue that the model operates as a neural economy: a contained symbolic system that abstracts social communication into commensurable vectors as it transfers the social sphere into parcels for sale. Tracing the training and generation pipelines component by component reveals what each operation displaces, and how it further entrenches the logics of platform and attention economies over social communication. The paper warns that any critique fixated exclusively on copyright and commodity defenses risks reaffirming the very fetishism the model produces, and argues instead for centering social exchange.