Disentangling Visual and Factual Correctness in LVLMs' Visualization Literacy
解构LVLMs可视化素养中的视觉与事实正确性
Soohyun Lee, Jaeyoung Kim, Seokhyeon Park, Sihyeon Lee, Jiwon Song, Bohyoung Kim, Hyunjoo Song, Jinwook Seo
AI总结 提出框架分离视觉正确性与事实正确性,通过反事实测试和仲裁指标揭示LVLMs在可视化素养评估中依赖事实记忆而非视觉推理的问题。
详情
- Comments
- Under review at IEEE Transactions on Visualization and Computer Graphics (TVCG). 23 pages, 9 figures
大型视觉语言模型(LVLMs)展现出强大的可视化解释能力,但尚不清楚其响应是否反映对视觉证据的真实推理,还是训练中习得的事实先验。当前评估混合了这两种来源,掩盖了正确视觉解释被记忆事实覆盖的情况。我们提出了一个将视觉正确性与事实正确性分离的框架,揭示了现有可视化素养评估的有效性局限。通过15个最先进LVLMs的三个实验:(1)多个模型在标准测试(VLAT)上达到人类水平,但这可能反映事实回忆而非视觉理解,而随机数据测试(reVLAT)在正确视觉解释被事实先验取代时低估了素养。(2)使用我们的反事实可视化素养评估测试(CVLAT)和能力归一化仲裁指标,我们根据视觉-事实依赖指数(VFRI)的符号对模型进行分类,揭示了以视觉为导向的多数和以事实知识为导向的少数,尽管几个接近零的情况需要谨慎。在相同反事实项目上的人类基线(N=30)证实,人们在冲突时绝大多数遵循图表,提供了人类参考点。(3)基于提示的干预可以改变优先级,但其有效性高度依赖模型且方向不对称,高图表阅读能力不能预测提示可控性。总体而言,高可视化准确性不足以证明忠实的视觉推理:可靠地集成到视觉分析中不仅需要评估可视化素养,还需要评估模型在视觉证据和事实先验分歧时如何仲裁。基准和代码:此 https URL
Large Vision-Language Models (LVLMs) show strong visualization interpretation, yet it is unclear whether their responses reflect genuine reasoning over visual evidence or factual priors learned during training. Current evaluations mix these two sources, obscuring when correct visual interpretation is overridden by memorized facts. We present a framework that isolates visual correctness from factual correctness, revealing validity limitations in existing visualization literacy assessments. Across three experiments with 15 state-of-the-art LVLMs: (1) several models reach human-level performance on standard tests (VLAT), but this may reflect factual recall rather than visual understanding, while randomized-data tests (reVLAT) underestimate literacy when correct visual interpretation is superseded by factual priors. (2) Using our Counterfactual Visualization Literacy Assessment Test (CVLAT) with capability-normalized arbitration metrics, we classify models by the sign of their visual-factual reliance index (VFRI), revealing a visualization-oriented majority and a factual knowledge-oriented minority, though several near-zero cases warrant caution. A human baseline (N=30) on the same counterfactual items confirms that people overwhelmingly follow the chart under conflict, providing a human reference point. (3) Prompt-based intervention can shift prioritization, but its effectiveness is highly model-dependent and direction-asymmetric, and high chart-reading capability does not predict prompt-controllability. Overall, high visualization accuracy is not sufficient evidence of faithful visual reasoning: reliable integration into visual analytics requires evaluating not only visualization literacy but also how models arbitrate between visual evidence and factual priors when the two diverge. Benchmark and code: https://github.com/JaeyoungKim-HCIL/CVLAT