arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

视觉大模型 / VLM

视觉语言模型、视觉推理、视觉问答、图文理解和视觉 grounding。

今日/当前日期收录 1 信号源:cs.CV, cs.AI, cs.LG
2606.18101 2026-06-18 cs.AI 新提交 90%

Trust the Right Teacher: Quality-Aware Self-Distillation for GUI Grounding

信任正确的教师:面向GUI定位的质量感知自蒸馏

Jingyuan Huang, Zuming Huang, Yucheng Shi, Tianze Yang, Xiaoming Zhai, Wei Chu, Ninghao Liu

发表机构 * University of Georgia(佐治亚大学) INFLY Tech Tencent AI Lab(腾讯AI实验室) The Hong Kong Polytechnic University(香港理工大学)

专题命中 视觉定位 :自蒸馏提升VLM的GUI定位能力

AI总结 提出质量感知自蒸馏方法,通过软正确性感知门控和教师概率缩放改善坐标令牌教师信号质量,提升VLM在GUI定位任务中的性能。

Comments corrected some claims

详情
AI中文摘要

图形用户界面(GUI)定位要求视觉语言模型(VLM)在高分辨率截图中识别小的目标元素并预测精确的屏幕坐标。同策略自蒸馏(OPSD)是一种有前景的后训练方法,因为它提供密集的令牌级教师信号,超越了硬坐标标签。然而,朴素OPSD并不适合GUI定位:OPSD在由学生生成的前缀上评估教师,当前缀已经偏离目标坐标时,坐标令牌教师信号的质量会下降,导致不可靠的教师信号。为缓解这一问题,我们提出了面向基于VLM的GUI定位的质量感知自蒸馏,通过软正确性感知门控和教师概率缩放来改善坐标令牌教师信号质量。软正确性感知门控检查在当前学生生成的前缀下,教师的坐标令牌预测是否仍能完成到真实框。如果不能,则相应教师信号被降低权重。教师概率缩放则利用教师置信度作为轻量级因子,进一步校准门控监督的强度。一个关键的实验发现是,单独使用任一组件都不能提升整体性能,而组合使用则能持续提升性能。这表明两种机制发挥互补作用:正确性感知门控抑制不可靠的坐标令牌监督,而教师概率缩放校准剩余信号的强度。在六个GUI定位基准上的实验表明,我们的方法持续提升基础模型性能,并优于强基线。

英文摘要

Graphical user interface (GUI) grounding requires vision-language models (VLMs) to identify small target elements in high-resolution screenshots and predict precise screen coordinates. On-policy self-distillation (OPSD) is a promising post-training approach for this coordinate-sensitive task, since it provides dense token-level teacher signals beyond hard coordinate labels. However, naive OPSD is not well suited to GUI grounding: OPSD evaluates the teacher on student-generated prefixes, the quality of coordinate-token teacher signals can degrade when the prefix has already deviated from the target coordinate, leading to unreliable teacher signal. To mitigate this, We propose quality-aware self-distillation for VLM-based GUI grounding, which improves coordinate-token teacher-signal quality through soft correctness-aware gating and teacher-probability scaling. The soft correctness-aware gate checks whether the teacher's current coordinate-token prediction can still be completed into the ground-truth box under the student-generated prefix. If not, the corresponding teacher signal is down-weighted. Teacher-probability scaling then uses the teacher's confidence as a lightweight factor to further calibrate the strength of the gated supervision. A key empirical finding is that neither component alone improves overall performance, whereas combining them consistently improves performance. This suggests that the two mechanisms play complementary roles: correctness-aware gating suppresses unreliable coordinate-token supervision, while teacher-probability scaling calibrates the strength of the remaining signals. Experiments across six GUI grounding benchmarks show that our method consistently improves the base model and outperforms strong baselines.