From Pixels to Words -- Towards Native One-Vision Models at Scale
从像素到文字——迈向原生单视觉大规模模型
Haiwen Diao, Jiahao Wang, Penghao Wu, Yuhao Dong, Yuwei Niu, Yue Zhu, Zhongang Cai, Weichen Fan, Linjun Dai, Silei Wu, Xuanyu Zheng, Mingxuan Li, Yuanhan Zhang, Bo Li, Hanming Deng, Huchuan Lu, Quan Wang, Lei Yang, Lewei Lu, Dahua Lin, Ziwei Liu
AI总结 本文提出NEO-ov原生基础模型,通过端到端学习跨帧和像素-文字对应,无需外部编码器或适配器,在细粒度视觉感知上缩小了与模块化模型的差距,验证了原生单视觉架构的可行性和竞争力。
详情
- Comments
- 13 pages, 6 figures
当前的视觉语言模型(VLM)通常通过多阶段对齐将独立的图像编码器和语言解码器拼接在一起,这种模块化框架不可避免地碎片化跨帧的像素级信号,并分散了早期的像素-文字交互。与此同时,原生VLM尽管在单图像上表现令人印象深刻,但在多图像、视频理解和空间智能方面仍鲜有探索。因此,我们引入了NEO-ov,一个原生基础模型,它端到端地学习跨帧和像素-文字对应,无需任何外部编码器、辅助适配器或事后融合。通过完全消除模块边界,NEO-ov使得细粒度且统一的时空建模能够在模型内部原生地涌现。值得注意的是,NEO-ov在缩小与模块化模型差距的同时,在细粒度视觉感知方面表现出色,验证了原生“单视觉”架构不仅可行,而且在大规模上具有竞争力。除了实证性能,我们还揭示了系统的架构分析和详细的训练配方,以促进后续的原生多模态建模。我们的代码和模型可在 https://github.com/EvolvingLMMs-Lab/NEO 公开获取。
Current vision-language models (VLMs) typically stitch together separate image encoders and language decoders via multi-stage alignment, a modular framework that inevitably fragments pixel-level signals across frames and scatters early pixel-word interactions. In parallel, native VLMs, despite impressive performance on single images, remain largely unexplored in multi-image, video understanding, and spatial intelligence. Hence, we introduce NEO-ov, a native foundation model that learns cross-frame and pixel-word correspondence end-to-end, without any external encoders, auxiliary adapters, or post-hoc fusion. By eliminating module boundaries entirely, NEO-ov enables fine-grained and unified spatiotemporal modeling to emerge natively inside the model. Notably, NEO-ov largely narrows the gap to modular counterparts while excelling at fine-grained visual perception, validating that native "one-vision" architectures are not only feasible but competitive at scale. Beyond empirical performance, we unveil systematic architectural analyses and detailed training recipes to facilitate subsequent native multimodal modeling. Our code and models are publicly available at: https://github.com/EvolvingLMMs-Lab/NEO.