arXivDaily arXiv每日学术速递 周一至周五更新

视觉与机器人

图像生成

图像生成、文生图、图像编辑、扩散模型和可控生成。

今日/当前日期收录 2 信号源:cs.CV, cs.GR, cs.MM
2605.10898 2026-06-19 cs.HC 版本更新 70%

How Creatives Approach GenAI Image Generation: Tensions Between Structured Guidance, Self-Experimentation, and Creative Autonomy

创意人士如何接近生成式AI图像生成:结构化指导、自我实验与创意自主之间的张力

Haidan Liu, Isabelle Kwan, Taiga Okuma, Jeffrey Loverock, Nicholas Vincent, Parmit K Chilana

专题命中 文生图 :研究创意人士使用GenAI图像生成工具的行为

AI总结 研究探讨创意人士在使用生成式AI图像工具时如何平衡结构化指导与自我实验,发现尽管指导有助于理解AI,但许多人仍倾向于自我探索以保持创意自由。

Comments Accepted at ACM Creativity & Cognition 2026

详情
AI中文摘要

随着生成式AI工具日益影响创意实践,它们引发了长期存在的HCI问题,即创意人士如何学习复杂软件以及如何更好地得到支持。我们通过与8名艺术家和爱好者进行访谈研究,并随后进行159人调查,以了解该群体如何接近和寻求生成式AI图像工具的指导。我们发现,创意人士通常使用自我实验或教程来探索生成式AI工具,但许多人对复杂的AI术语感到困惑。为了进一步了解创意人士的学习体验,我们开发了一个研究探针来获取他们对结构化指导的看法。我们的用户研究显示,即使创意人士描述指导有助于理解AI,许多人仍更喜欢自我实验,认为指导可能限制他们的创造力。我们的发现突显了在支持创意人士AI素养时的核心张力:在平衡指导和促进素养的同时,保持创意自由。

英文摘要

As generative AI tools increasingly influence creative practice, they raise longstanding HCI questions about how creatives learn complex software and how they can be better supported. We conducted an interview study with artists and hobbyists (n=8) and a follow-up survey (n=159) to understand how this population approaches and seeks guidance for GenAI image tools. We found that creatives commonly use either self-experimentation or tutorials to explore GenAI tools, yet many struggle with confusing AI terminology. To gain further insight into creatives' learning experiences, we developed a research probe to elicit creatives' perceptions of structured guidance. Our user study with 17 creatives revealed that, even when creatives described the guidance as helpful for understanding AI, many still preferred self-experimentation, feeling that guidance could limit their creativity. Our findings highlight a central tension in supporting AI literacy for creatives: balancing guidance and promoting literacy while preserving creative freedom.

2506.06952 2026-06-19 cs.CV 版本更新 70%

LaTtE-Flow: Layerwise Timestep-Expert Flow-based Transformer

LaTtE-Flow: 基于层间时间步专家流的Transformer

Ying Shen, Zhiyang Xu, Jiuhai Chen, Shizhe Diao, Jiaxin Zhang, Yuguang Yao, Joy Rimchala, Ismini Lourentzou, Lifu Huang

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Maryland(马里兰大学) Nvidia(英伟达) Salesforce AI Research(Salesforce AI研究) Intuit AI Research(Intuit AI研究)

专题命中 文生图 :提出高效架构实现图像生成,速度提升6倍。

AI总结 提出LaTtE-Flow,一种基于预训练视觉语言模型的高效统一架构,通过层间时间步专家流和条件残差注意力机制,实现图像理解与生成,生成速度提升约6倍。

Comments Unified multimodal model, Flow-matching

详情
AI中文摘要

多模态基础模型在统一图像理解与生成方面取得了最新进展,为在单一框架内处理广泛的视觉-语言任务开辟了令人兴奋的途径。尽管取得了进展,现有的统一模型通常需要大量的预训练,并且与专门针对每项任务的模型相比,难以达到相同的性能水平。此外,许多这些模型存在图像生成速度慢的问题,限制了它们在实时或资源受限环境中的实际部署。在这项工作中,我们提出了基于层间时间步专家流的Transformer(LaTtE-Flow),一种新颖且高效的架构,可在单个多模态模型中统一图像理解与生成。LaTtE-Flow建立在强大的预训练视觉语言模型(VLM)之上,以继承强大的多模态理解能力,并通过新颖的层间时间步专家流架构扩展它们,以实现高效的图像生成。LaTtE-Flow将流匹配过程分布到专门的Transformer层组中,每组负责不同的时间步子集。这种设计通过在每个采样时间步仅激活一小部分层,显著提高了采样效率。为了进一步提升性能,我们提出了一种时间步条件残差注意力机制,用于跨层高效的信息重用。实验表明,LaTtE-Flow在多模态理解任务上取得了强劲的性能,同时与最近的统一多模态模型相比,实现了具有竞争力的图像生成质量,推理速度提高了约6倍。

英文摘要

Recent advances in multimodal foundation models unifying image understanding and generation have opened exciting avenues for tackling a wide range of vision-language tasks within a single framework. Despite progress, existing unified models typically require extensive pretraining and struggle to achieve the same level of performance compared to models dedicated to each task. Additionally, many of these models suffer from slow image generation speeds, limiting their practical deployment in real-time or resource-constrained settings. In this work, we propose Layerwise Timestep-Expert Flow-based Transformer (LaTtE-Flow), a novel and efficient architecture that unifies image understanding and generation within a single multimodal model. LaTtE-Flow builds upon powerful pretrained Vision-Language Models (VLMs) to inherit strong multimodal understanding capabilities, and extends them with a novel Layerwise Timestep Experts flow-based architecture for efficient image generation. LaTtE-Flow distributes the flow-matching process across specialized groups of Transformer layers, each responsible for a distinct subset of timesteps. This design significantly improves sampling efficiency by activating only a small subset of layers at each sampling timestep. To further enhance performance, we propose a Timestep-Conditioned Residual Attention mechanism for efficient information reuse across layers. Experiments demonstrate that LaTtE-Flow achieves strong performance on multimodal understanding tasks, while achieving competitive image generation quality with around 6x faster inference speed compared to recent unified multimodal models.