See Before You Code: Learning Visual Priors for Spatially Aware Educational Animation Generation
在编码前看到:学习视觉先验以生成空间感知的教育动画
Yuejia Li, Ke He, Junheng Li, Shutong Chen, Jingkang Xia, Zhiyue Su, Junchi Zhang, Mang Ye
AI总结 本文提出OmniManim框架,通过视觉规划和反馈机制提升教育动画生成质量,改进渲染效果和教学效果。
Comments 21 pages, 4 figures
详情
大型语言模型可以为教育动画生成可执行代码,但生成的渲染结果常出现元素重叠、对齐错误和动画连续性断裂等问题。这些缺陷无法仅从代码中可靠检测,需在执行后才能显现。本文将该问题形式化为渲染反馈感知的约束代码生成:给定自然语言规范,模型必须生成可执行代码,其渲染输出需满足可在渲染后评估的结构化质量标准。为解决此问题,我们引入OmniManim框架,围绕共享场景状态、显式视觉规划、结构化后渲染诊断和局部修复构建。其中,Vision Agent是任务特定的视觉规划模块:它通过粗到细的边界框去噪预测稀疏关键帧布局,并优化插值感知的目标以减少下游动画插值引起的中间帧失败。我们进一步构建了ManimLayout-1K和EduRequire-500两个数据集,并提供可复现的评估协议,涵盖可执行性、教学质量、视觉质量和效率。在EduRequire-500上,OmniManim在单模型基线和现有多智能体框架上均提升了测量渲染质量。系统性消融研究进一步验证,显式视觉规划,特别是其粗略空间先验、边界框细化和插值感知优化是这些提升的关键。
Large language models can generate executable code for educational animations, but the resulting renders often exhibit visual defects, including element overlap, misalignment, and broken animation continuity. These defects cannot be reliably detected from the code alone and become apparent only after execution. We formalize this problem as render-feedback-aware constrained code generation: given a natural language specification, the model must generate executable code whose rendered output satisfies structured quality criteria that can be evaluated only after rendering. To address this problem, we introduce OmniManim, a render-feedback-aware educational animation generation framework built around a shared scene state, explicit visual planning, structured post-render diagnostics, and localized repair. Within OmniManim, the Vision Agent is a task-specific visual planning module: it predicts sparse keyframe layouts with coarse-to-fine bounding-box denoising and optimizes an interpolation-aware objective to reduce intermediate-frame failures induced by downstream animation interpolation. We further construct two datasets, ManimLayout-1K and EduRequire-500, and provide a reproducible evaluation protocol covering executability, instructional quality, visual quality, and efficiency. On EduRequire-500, OmniManim improves measured render quality over both single-model baselines and existing multi-agent frameworks. Systematic ablation studies further verify that explicit visual planning, especially its coarse spatial prior, bounding-box refinement, and interpolation-aware optimization, is central to these gains.