视频大模型 - arXivDaily 专题

2602.15819 2026-06-19 cs.CV 版本更新 90%

VideoSketcher: Sequential Sketch Generation Using Video Model Priors

VideoSketcher：利用视频模型先验的序列草图生成

Hui Ren, Yuval Alaluf, Omer Bar Tal, Alexander Schwing, Antonio Torralba, Yael Vinker

发表机构 * MIT（麻省理工学院）

专题命中视频生成：利用视频扩散模型生成序列草图，结合LLM规划

AI总结提出VideoSketcher方法，结合LLM的语义规划与视频扩散模型的时序渲染，通过两阶段微调从少量样本学习笔画顺序与风格，生成高质量序列草图。

详情

AI中文摘要

素描本质上是序列化的：笔画逐步绘制以探索和完善想法。然而，大多数生成方法将草图视为静态图像，忽略了创造性探索背后的时间过程。建模这种序列结构仍然具有挑战性：先前的方法要么依赖大规模但多样性有限的人类绘制数据集，要么使用大型语言模型（LLM）生成绘制指令，但往往以视觉保真度为代价。我们提出VideoSketcher，一种通过将预训练的文本到视频扩散模型适应于草图形成的稀疏连续性质来生成高质量绘制过程的方法。我们的关键洞察是LLM和视频扩散模型提供互补优势：LLM作为语义规划器，将概念分解为逐步指令，而视频扩散模型作为强大的“渲染器”，将它们转化为时间连贯的草图序列。我们引入一种两阶段微调策略，将时间结构与视觉外观解耦：笔画顺序从合成形状组合中学习，而风格则从少至七幅手绘示例中提炼。尽管监督极少，我们的方法能够生成多样、高质量的序列草图，并忠实遵循指定的绘制顺序。我们的框架自然扩展到笔刷风格控制和自回归生成，支持艺术应用。

英文摘要

Sketching is inherently sequential: strokes are drawn progressively to explore and refine ideas. Yet most generative approaches treat sketches as static images, ignoring the temporal process underlying creative exploration. Modeling this sequential structure remains challenging: prior methods either rely on large-scale human-drawn datasets with limited diversity, or use large language models (LLMs) to produce drawing instructions, often at the cost of visual fidelity. We present VideoSketcher, a method for generating high-quality sketching processes by adapting pretrained text-to-video diffusion models to the sparse, continuous nature of sketch formation. Our key insight is that LLMs and video diffusion models offer complementary strengths: LLMs act as semantic planners that decompose concepts into step-by-step instructions, while video diffusion models serve as powerful "renderers" that translate them into temporally coherent sketch sequences. We introduce a two-stage fine-tuning strategy that decouples temporal structure from visual appearance: stroke ordering is learned from synthetic shape compositions, while style is distilled from as few as seven hand-drawn examples. Despite minimal supervision, our method can generate diverse, high-quality sequential sketches that faithfully follow specified drawing orders. Our framework naturally extends to brush style control and autoregressive generation, supporting artistic applications.

URL PDF HTML ☆

赞 0 踩 0

2605.31158 2026-06-19 cs.CV cs.LG 版本更新 85%

Light Interaction: Training-Free Inference Acceleration for Interactive Video World Models

光交互：交互式视频世界模型的免训练推理加速

Jiacheng Lu, Haoyi Zhu, Sipei Yi, Enze Xie, Yu Li, Cheng Zhuo

发表机构 * Zhejiang University（浙江大学）； NVIDIA

专题命中视频生成：加速交互式视频世界模型推理。

AI总结针对交互式视频世界模型推理成本高的问题，提出免训练加速框架Light Interaction，通过自适应上下文管理、去噪缓存加速和3D块稀疏注意力实现最高2.59倍加速。

Comments 13 pages, 6 figures, 3 tables. Project page: https://2843721358l-del.github.io/Light-Interaction-Project/

详情

AI中文摘要

交互式视频世界模型根据用户控制的相机运动逐块生成视频，支持实时游戏模拟、虚拟场景导航和具身AI训练等应用。然而，由于上下文记忆增长、二次注意力复杂度和重复去噪步骤，扩展到长交互轨迹的成本过高。我们提出Light Interaction，一种用于交互式视频世界模型的免训练推理加速框架。我们的关键洞察是，交互自然支持轨迹依赖的自适应计算：在探索新区域时可丢弃检索到的空间记忆，根据局部潜在动态调整时间上下文，当相机重新访问熟悉区域时可重用早期步骤的模型输出。基于此洞察，Light Interaction结合了自适应上下文管理、去噪缓存加速以及硬件-软件协同设计的3D块稀疏注意力（融合Triton内核）。在HY-WorldPlay和Matrix-Game-3.0上的评估表明，Light Interaction在无需模型重训练的情况下实现了最高2.59倍加速，同时保持有竞争力的视觉质量。

英文摘要

Interactive video world models generate video chunk by chunk in response to user-controlled camera movements, enabling applications such as real-time game simulation, virtual scene navigation, and embodied AI training. However, scaling to long interactive trajectories is prohibitively expensive due to growing context memory, quadratic attention complexity, and repeated denoising steps. We present Light Interaction, a training-free inference acceleration framework for interactive video world models. Our key insight is that interaction naturally enables trajectory-dependent adaptive computation: retrieved spatial memory can be discarded during novel exploration, temporal context can be adjusted according to local latent dynamics, and early-step model outputs can be reused when the camera revisits familiar regions. Based on this insight, Light Interaction combines adaptive context management, denoising cache acceleration, and hardware-software co-designed 3D block sparse attention with fused Triton kernels. Evaluated on HY-WorldPlay and Matrix-Game-3.0, Light Interaction achieves up to 2.59x speedup without model retraining while maintaining competitive visual quality.

URL PDF HTML ☆

赞 0 踩 0