Injecting Image Guidance into Text-Conditioned Diffusion Models at Inference
在推理时将图像引导注入文本条件扩散模型
Agata Żywot, Iason Skylitsis, Thijmen Nijdam, Zoe Tzifa-Kratira, Derck Prinzhorn, Konrad Szewczyk, Aritra Bhowmik
AI总结 提出视觉概念融合(VCF),一种无需重新训练即可在推理时同时以图像和文本为条件进行双重引导的方法,通过对齐CLIP图像特征与文本嵌入空间实现视觉概念注入。
详情
像Stable Diffusion这样的文本到图像扩散模型可以从文本生成高质量图像,但缺乏在推理时无需重新训练即可注入视觉引导(例如草图、风格)的方法。现有方法要么需要计算昂贵的微调,要么依赖于可能造成与文本提示语义不对齐的风格迁移技术。我们引入了视觉概念融合(VCF),这是第一种在推理时无需任何概念特定训练即可同时对图像和文本提示进行双重条件化的方法。VCF通过将CLIP图像特征与文本嵌入空间对齐,实现了将视觉概念注入Stable Diffusion。VCF由三个组件组成:(1)一个轻量级对齐器,使用InfoNCE和交叉注意力重建损失将图像标记映射到文本嵌入流形;(2)一种保留文本和视觉语义的融合策略;(3)一个可选的提示-噪声优化(PNO)模块,用于测试时细化。我们的实验表明,VCF成功地从参考图像中转移了包括风格、构图和调色板在内的视觉属性,同时保持了对提示的遵循。定量结果显示文本对齐(CLIP分数)和视觉对应(LPIPS)之间存在权衡,VCF在参考保真度方面优于基线。
Text-to-image diffusion models like Stable Diffusion generate high-quality images from text, but lack a way to inject visual guidance (e.g. sketches, styles) at inference without retraining. Existing methods either require computationally expensive fine-tuning or rely on style transfer techniques that risk semantic misalignment with textual prompts. We introduce Visual Concept Fusion (VCF), the first method offering dual conditioning on both an image and text prompt at inference time without any concept-specific training. VCF enables visual concept injection into Stable Diffusion by aligning CLIP image features with the text embedding space. VCF consists of three components: (1) a lightweight aligner that maps image tokens to the text embedding manifold using InfoNCE and cross-attention reconstruction losses, (2) a fusion strategy that preserves both textual and visual semantics, and (3) an optional Prompt-Noise Optimization (PNO) module for test-time refinement. Our experiments demonstrate that VCF successfully transfers visual attributes including style, composition, and color palette from reference images while maintaining prompt adherence. Quantitative results show a trade-off between text alignment (CLIP score) and visual correspondence (LPIPS), with VCF outperforming baselines in reference fidelity.