arXivDaily arXiv每日学术速递 周一至周五更新

视觉与机器人

图像生成

图像生成、文生图、图像编辑、扩散模型和可控生成。

今日/当前日期收录 1 信号源:cs.CV, cs.GR, cs.MM
2606.19195 2026-06-18 cs.CV 新提交 95%

Moebius: 0.2B Lightweight Image Inpainting Framework with 10B-Level Performance

Moebius: 0.2B轻量级图像修复框架,性能达10B级别

Kangsheng Duan, Ziyang Xu, Wenyu Liu, Xiaohu Ruan, Xiaoxin Chen, Xinggang Wang

发表机构 * Huazhong University of Science and Technology(华中科技大学) VIVO AI Lab(VIVO人工智能实验室)

专题命中 图像修复 :轻量级图像修复框架,属于图像修复

AI总结 提出Moebius轻量级图像修复框架,通过局部-λ混合交互模块和自适应多粒度蒸馏策略,以0.22B参数实现与10B级模型FLUX.1-Fill-Dev相当甚至更优的生成质量,推理速度提升15倍以上。

详情
AI中文摘要

尽管10B级别的工业基础模型推动了图像修复的边界,但其高昂的计算成本严重阻碍了实际部署。构建高度优化的任务特定专家模型是一个有前景的解决方案,然而极端的结构压缩不可避免地引发了严重的表示瓶颈。为解决这一问题,我们提出了Moebius,一个高效的轻量级修复框架。我们通过引入局部-λ混合交互($L\lambda MI$)模块系统地重构了扩散主干。该模块由局部-λ和交互-λ子模块组成,巧妙地将空间上下文和全局语义先验总结为固定大小的线性矩阵,在保留复杂潜在交互的同时大幅减少参数。此外,为了释放这种高度紧凑架构的全部表示能力,我们将其与自适应多粒度蒸馏策略协同配对。该策略严格在潜在空间内操作以避免昂贵的像素空间解码,动态平衡多个基于梯度的损失以实现高保真对齐。在自然和肖像基准上的大量实验表明,这种最优协同使Moebius能够媲美甚至超越10B级工业通用模型FLUX.1-Fill-Dev的生成质量。值得注意的是,Moebius仅使用不到2%的参数(0.22B vs. 11.9B)就实现了这一点,同时总推理时间加速超过15倍,为高保真修复设立了新的效率标准。项目页面见此https URL。

英文摘要

While 10B-level industrial foundation models have pushed the boundaries of image inpainting, their prohibitive computational costs severely hinder practical deployment. Constructing a highly optimized task-specific specialist offers a promising solution; however, extreme structural compression inevitably triggers a severe representation bottleneck. To conquer this, we propose Moebius, a highly efficient lightweight inpainting framework. We systematically reconstruct the diffusion backbone by introducing the Local-$λ$ Mix Interaction ($LλMI$) block. Comprising Local-$λ$ and Interactive-$λ$ modules, it elegantly summarizes spatial contexts and global semantic priors into fixed-size linear matrices, preserving complex latent interactions while drastically shedding parameters. Furthermore, to unlock the full representational capacity of this highly compact architecture, we synergistically pair it with an adaptive multi-granularity distillation strategy. Operating strictly within the latent space to avoid expensive pixel-space decoding, this strategy dynamically balances multiple gradient-based losses to achieve high-fidelity alignment. Extensive experiments across natural and portrait benchmarks demonstrate that this optimal synergy enables Moebius to rival or even surpass the generation quality of the 10B-level industrial generalist FLUX.1-Fill-Dev. Remarkably, Moebius achieves this using less than 2\% of the parameters (0.22B vs. 11.9B) while delivering a $>15\times$ acceleration in total inference time, setting a new efficiency standard for high-fidelity inpainting. Project page at https://hustvl.github.io/Moebius.