VLMs May Not Globally Enhance Human Alignment over LLMs During Natural Reading
VLMs 在自然阅读中可能不会全局性地增强与人类的对齐性优于 LLMs
Jinzhou Wu, Zhengwu Ma, Jixing Li, Baoping Tang, Zitong Lu
AI总结 通过严格文本设置比较LLM和VLM,发现多模态预训练在自然阅读中未带来全局性人类对齐优势,但视觉语义内容强的句子中VLM有选择性优势。
详情
- Comments
- 17 pages, 10 figures
大型语言模型(LLMs)已成为人类语言处理的有用计算模型,但尚不清楚视觉-语言学习是否使文本表示在自然阅读中更接近人类。本文通过严格文本设置比较紧密匹配的LLM和视觉-语言模型(VLM)对,从而将多模态训练历史的影响与在线视觉输入或跨模态融合分离。我们使用包含全脑fMRI反应和同步眼动扫视的人类自然阅读数据集评估模型对齐。我们的发现表明,多模态预训练可能不会在自然阅读中赋予均匀的全局性人类对齐优势,表明语言内部表示仍然是建模人类文本处理的关键因素。然而,当句子包含更强的视觉语义内容时,VLM的优势可能更具选择性出现,fMRI和眼动对齐均提供汇聚证据。总之,我们的发现提供了一个受控的计算框架,用于测试视觉学习历史如何塑造语言处理的模型-人类对齐,表明多模态预训练在自然阅读中对类人语言表示的贡献是选择性的而非全局性的。
Large language models (LLMs) have become increasingly useful computational models of human language processing, but it remains unclear whether vision-language learning makes text representations more human-like during natural reading. Here, we address this question by comparing tightly matched LLM and vision-language model (VLM) pairs under a strictly text-only setting, allowing us to isolate the effect of multimodal training history from online visual input or cross-modal fusion. We evaluate model alignment with a human natural-reading dataset that includes whole-cortex fMRI responses and synchronized eye-tracking saccades. Our findings demonstrate that multimodal pretraining may not confer a uniform, global advantage in human alignment during natural reading, indicating that language-internal representations remain the key factor for modeling human text processing. However, the VLM advantage could emerge more selectively when sentences contain stronger visual semantic content, with converging evidence from both fMRI and eye-movement alignments. Together, our findings provide a controlled in silico framework for testing how visual learning history shapes model-human alignment of language processing, suggesting that multimodal pretraining contributes selectively rather than globally to human-like language representations during natural reading.