Stateful Visual Encoders for Vision-Language Models
用于视觉-语言模型的有状态视觉编码器
Zirui Wang, Junwei Yu, Adam Yala, David M. Chan, Joseph E. Gonzalez, Trevor Darrell
AI总结 提出有状态视觉编码器,通过将每个视觉表示条件于先前的视觉特征,增强视觉-语言模型在多图像、多轮交互中的视觉变化感知能力,在跨图像空间聚合、多目标视觉差异和轨迹行为克隆等任务上取得一致改进。
详情
- Comments
- Project page: https://statefulvisualencoders.github.io/
视觉-语言模型(VLM)越来越多地用于多图像、多轮代理场景,其中决策依赖于视觉变化。然而,在现有的开源权重VLM中,视觉比较仅在语言模型内部进行,而视觉编码器本身是无状态的:每个图像独立编码,无法访问先前的视觉上下文。因此,微小但任务关键的变化可能在语言模型有机会比较之前被减弱,尤其是当这些变化不影响场景的高层语义时。我们引入了一种有状态视觉编码器,它将每个视觉表示条件于先前的视觉特征。在监督微调下,配备有状态编码器的VLM在涉及跨图像空间聚合、多目标视觉差异和视觉轨迹行为克隆的控制任务上取得了一致的改进。这些改进在输入分辨率、语言模型大小和VLM骨干网络上保持一致。最后,我们在实际任务上验证了我们的模型,包括纵向放射学、细粒度图像比较和遥感,其中有状态编码器一致地改进了通用VLM基线,并在选定领域可以匹配或超越专用模型。项目页面:https://statefulvisualencoders.github.io/
Vision-language models (VLMs) are increasingly used in multi-image, multi-turn agentic settings where decisions depend on visual changes. However, in existing open-weight VLMs, visual comparisons happen only inside the language model, while the visual encoder itself remains stateless: each image is encoded independently, without access to the prior visual context. As a result, small but task-critical changes may be attenuated before the language model has a chance to compare them, especially when those changes do not affect the high-level semantics of the scene. We introduce a Stateful Visual Encoder, which conditions each visual representation on prior visual features. Under supervised finetuning, VLMs equipped with stateful encoders achieve consistent improvements on controlled tasks involving cross-image spatial aggregation, multi-object visual differencing, and visual trajectory behavior cloning. These improvements are consistent across input resolutions, language model sizes, and VLM backbones. Finally, we validate our model on real-world tasks, including longitudinal radiology, fine-grained image comparison, and remote sensing, where stateful encoders consistently improve generalist VLM baselines and can match or surpass specialized models in selected domains. Project page: https://statefulvisualencoders.github.io/