VLMs Trace Without Tracking: Diagnosing Failures in Visual Path Following
VLMs 跟踪无需跟踪:诊断视觉路径跟随中的失败
Hyesoo Hong, Minsoo Kim, Wonje Jeung, Sangyeon Yoon, Dongjae Jeon, Albert No
AI总结 研究VLMs在视觉路径跟随任务中的表现,发现其在面对局部相似干扰时易切换路径,揭示局部竞争导致的失败原因。
详情
视觉-语言模型(VLMs)在多模态基准测试中表现优异,但可能仍缺乏对基本视觉操作的鲁棒控制。我们研究了路径跟随任务,其中模型必须通过连续的局部延续跟随选定的视觉路径。为隔离这一能力,我们设计了受控的路径跟随任务,引入附近的竞争者并减少语义和拓扑模糊性,如交叉和重叠。在这些任务中,即使是最先进的VLMs也频繁失去目标路径并切换到附近的替代路径,尤其是在这些替代路径在局部上相似时。行为干预和内部分析表明,这些失败源于局部竞争:附近的相似干扰者会将模型拉离真正的延续。标准解决方案无法消除这一瓶颈:模型大小扩展只能提供有限的收益,推理部分通过成本高昂的替代策略补偿,而显式路径指示未能恢复稳定的路径跟随。最后,在复杂的电缆场景和地铁地图上测试表明,相同的路径切换失败在受控设置之外仍然存在。
Vision-language models (VLMs) achieve strong performance on multimodal benchmarks, but may still lack robust control over basic visual operations. We study \textit{line tracing}, where a model must follow a selected visual path through successive local continuations. To isolate this ability, we design controlled tracing tasks that introduce nearby competitors while reducing semantic and topological ambiguity such as crossings and overlaps. Across these tasks, even state-of-the-art VLMs frequently lose the target path and switch to nearby alternatives, especially when those alternatives look locally similar to the target. Behavioral interventions and internal analyses indicate that these failures arise from local competition: nearby similar distractors pull the model away from the true continuation. Standard remedies do not remove this bottleneck: model-size scaling provides only limited gains, reasoning partially compensates through costly substitute strategies, and explicit tracing instructions fail to recover stable path following. Finally, tests on tangled-cable scenes and metro maps with richer visual complexity show that the same path-switching failure persists beyond our controlled settings.