arXivDaily arXiv每日学术速递 周一至周五更新

视觉与机器人

图像生成

图像生成、文生图、图像编辑、扩散模型和可控生成。

今日/当前日期收录 3 信号源:cs.CV, cs.GR, cs.MM
2606.16849 2026-06-18 cs.NE cs.GR cs.HC 新提交 80%

Evolution & Foundation: AI Shares Creative Control

进化与基础模型:AI共享创意控制

Dylan Banarse, Stephen Todd, William Latham, Frederic Fol Leymarie

专题命中 可控生成 :遗传算法与多模态AI生成3D有机形态

AI总结 提出一种结合遗传算法与多模态AI基础模型的框架,实现自动化设计3D有机形态,将艺术家角色从直接选择转变为系统设计,加速创意探索。

详情
AI中文摘要

本文研究使用进化系统进行自动化设计和艺术评估的创意过程。我们考虑多模态人工智能(AI)模型如何与组合生成和进化计算系统进行通信和引导。通过将遗传算法与大规模AI基础模型的视觉推理能力相结合,创建了一个用于进化美观的复杂3D有机形态的框架。该框架将艺术家的角色从密集的直接选择转变为系统设计;将详细的逐步策划转移给能够进行多模态审美判断的AI代理。该框架使人类艺术家/设计师能够快速穿越多维进化参数空间的大片区域,基于其语义目标找到创意结果。为每个实验生成AI审美推理的详细审计轨迹。交互式可视化工具,连同AI生成的摘要和进化叙事,使得能够深入探索每个进化实验,并提供对AI引导过程的透明洞察。

英文摘要

This paper investigates the creative process of automated design and artistic evaluation using an evolutionary system. We consider how a multimodal artificial intelligence (AI) model can communicate and guide a combined generative and evolutionary computational system. This creates a framework for the evolution of aesthetically pleasing complex 3D organic forms by integrating genetic algorithms with the visual reasoning capabilities of large-scale AI foundation models. The framework shifts the artist role from that of intensive direct selection to one of system design; transferring detailed step-by-step curation to an AI agent capable of multimodal aesthetic judgement. This framework enables the human artist/designer to rapidly traverse large areas of multi-dimensional evolutionary parameter space to find creative outcomes based on their semantic targets. Detailed audit trails of the AI's aesthetic reasoning are generated for each experiment. Interactive visualisation tools, together with AI-generated summaries and evolutionary narratives, enable deep exploration into each evolutionary experiment and providing a transparent insight into the AI-guided process.

2606.13768 2026-06-18 cs.CV cs.AI 新提交 80%

CineOrchestra: Unified Entity-Centric Conditioning for Cinematic Video Generation

CineOrchestra:面向电影视频生成的统一实体中心条件控制

Sharath Girish, Tsai-Shien Chen, Zhikang Dong, Mukesh Singhal, Hao Chen, Sergey Tulyakov, Aliaksandr Siarohin

发表机构 * Snap Inc.(Snap公司) UC Merced(加州大学默塞德分校)

专题命中 可控生成 :扩散模型实现细粒度条件控制

AI总结 提出CineOrchestra,一种统一控制主体、事件、相机和镜头切换的视频扩散模型,通过实体中心条件原语和参数无关的旋转位置编码实现多轴联合控制,在密集描述跟随和镜头切换时序上超越六种专用方法。

Comments Project page: https://snap-research.github.io/CineOrchestra

详情
AI中文摘要

电影视频描绘了多个主体在特定时刻行动或互动,通过有意的相机运动捕捉,并由镜头切换拼接而成。这些元素共同要求比当前文本到视频模型更细粒度的控制。现有工作分别处理每个轴:多主体个性化、时间控制、多镜头合成或相机控制;没有先前的框架能联合集成所有四个轴。我们提出CineOrchestra,一种统一的视频扩散模型,同时控制主体、事件、相机和镜头切换。我们的关键洞察是,这些异构的电影元素共享一个基本结构:每个元素都是在特定时间间隔内行动的实体,因此都可以通过一个共享的实体中心条件原语结构来表达,并辅以视觉实体的参考图像。这种表述将架构挑战简化为单个位置编码问题,我们通过两个参数无关的协调旋转嵌入来解决:(a) 间隔采样的时间RoPE,在持续时间差异巨大的事件上产生一致注意力行为;(b) 2D实体-时间交叉注意力RoPE,消除每个实体条件的歧义,并将其路由到对应的时空区域。在两个新基准上,CineOrchestra在密集描述跟随和镜头切换时序上优于六种每轴专家方法,在成对用户研究和组件消融中持续获得增益。

英文摘要

Cinematic video depicts multiple subjects acting or interacting at specific moments, captured with deliberate camera movement, and stitched together by shot transitions. Together, these elements demand a level of fine-grained control beyond current text-to-video models. Existing work addresses each axis in isolation: multi-subject personalization, temporal control, multi-shot synthesis, or camera control; no prior framework jointly integrates all four. We present CineOrchestra, a unified video diffusion model that controls subjects, events, cameras, and shot transitions simultaneously. Our key insight is that these heterogeneous cinematic elements share a fundamental structure: each is an entity acting over a specific temporal interval, which can therefore all be expressed through one shared structure of entity-centric conditioning primitives, augmented with reference images for visual entities. This formulation reduces the architectural challenge to a single positional encoding problem, which we solve with two parameter-free coordinated rotary embeddings: (a) an interval-sampled temporal RoPE that yields consistent attention behavior across events of dramatically varying duration, and (b) a 2D entity-temporal cross-attention RoPE that disambiguates per-entity conditions and routes each to its corresponding spatiotemporal region. On two new benchmarks, CineOrchestra outperforms six per-axis specialists on dense caption following and shot-transition timing, with consistent gains in a pairwise user study and component ablations. Project page: https://snap-research.github.io/CineOrchestra

2606.18788 2026-06-18 cs.CV cs.CL 新提交 75%

HandwritingAgent: Language-Driven Handwriting Synthesis in Scalable Vector Space

HandwritingAgent: 语言驱动的可缩放矢量空间手写合成

Jaward Sesay, Yue Yu, Börje F. Karlsson

发表机构 * Beijing Institute of Technology(北京理工大学) Beijing Academy of Artificial Intelligence(北京人工智能研究院)

专题命中 可控生成 :语言驱动的手写笔画序列生成

AI总结 提出HandwritingAgent,利用大推理模型在SVG格式中自动回归生成手写笔画序列,无需风格特定训练,通过自然语言和参考图像控制风格,在模仿、识别、多语言及复杂数学表达式合成等任务上达到或超越现有最优方法。

详情
AI中文摘要

教会机器模仿自然手写风格仍然是一个开放挑战,因为它需要合成在形状、纹理、压力和字体上动态变化的笔画序列——不仅在不同个体之间,而且在同一个人的手写中也是如此。针对这一挑战的尝试主要探索了在线和离线环境下的深度学习方法。然而,这些方法通常受到风格特定架构选择、对大型数据集的严重依赖、高计算成本以及缺乏通过自然语言灵活控制书写风格的限制。为此,我们引入了HandwritingAgent,一个语言驱动的智能体,它可以直接在可缩放矢量图形(SVG)格式中合成自然手写序列,无需风格特定训练。该智能体利用大型推理模型在离散网格画布环境中对目标手写字形进行几何分析并自回归生成笔画序列。生成过程以对话或非对话模式提供的文本以及参考手写风格图像为条件。在涵盖模仿、识别、多语言手写合成以及复杂手写数学和科学表达式生成等多样化手写任务上的实验表明,性能有显著提升,HandwritingAgent匹配或超越了最先进的生成式手写模型,同时提供了一种更高效、可控且泛化能力更强的合成方法。

英文摘要

Teaching machines to emulate natural handwriting styles remains an open challenge, as it requires synthesizing stroke sequences that dynamically vary in shape, texture, pressure and script - not only across individuals, but also within a single person's handwriting. Attempts at this challenge have largely explored deep learning methods in both online and offline settings. However, these approaches are often constrained by style-specific architectural choices, heavy reliance on large datasets, high compute costs, and a lack of flexible control over writing styles through natural language. To this end, we introduce HandwritingAgent, a language-driven agent that can synthesize natural handwriting sequences directly in Scalable Vector Graphics (SVG) format with no need for style-specific training. The agent leverages a large reasoning model to geometrically analyse and autoregressively generate target handwritten glyphs as stroke sequences in a discrete grid canvas environment. Generation is conditioned on texts provided in either conversational or non-conversational mode, along with a reference handwriting-style image. Experiments on diverse handwriting tasks spanning imitation, recognition, multi-lingual handwriting synthesis, and generation of complex handwritten maths and science expressions indicate substantial improvement in performance, with HandwritingAgent matching or surpassing state-of-the-art generative handwriting models, while providing a more efficient, controllable, and generalizable synthesis method.