arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

视觉大模型 / VLM

视觉语言模型、视觉推理、视觉问答、图文理解和视觉 grounding。

今日/当前日期收录 20 信号源:cs.CV, cs.AI, cs.LG
2606.20515 2026-06-19 cs.CV 新提交 90%

S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence

S-Agent:空间工具使用激发空间智能推理

Yalun Dai, Hao Li, Shulin Tian, Runmao Yao, Yuhao Dong, Fangzhou Hong, Zhaoxi Chen, Fangfu Liu, Baoliang Tian, Dingwen Zhang, Tao Wang, Kim-Hui Yap, Ziwei Liu

发表机构 * NTU(南洋理工大学) THU(清华大学) ByteDance(字节跳动) NWPU(西北工业大学)

专题命中 视觉推理 :利用VLM作为语义规划器进行空间推理

AI总结 提出S-Agent空间工具使用智能体范式,通过时空证据积累和层次化工具集,将VLM作为语义规划器,实现连续多视图图像和视频的空间推理,在无训练下提升开源和闭源VLM性能,并基于S-300K轨迹微调得到紧凑空间智能体S-Agent-8B。

Comments Project Page : https://Ropedia.github.io/S-Agent

详情
AI中文摘要

现实世界的空间智能需要对连续且不断变化的三维世界进行推理,然而现有的VLM和工具增强智能体大多仍局限于从孤立的视觉观察中进行静态、无状态的推理。我们引入了\textbf{\textsc{S-Agent}},一种用于理解和推理连续多视图图像和视频的空间工具使用智能体范式。通过将空间推理表述为时空证据积累而非孤立的帧级预测,\textsc{S-Agent}将空间感知重塑为以场景为中心的理解,超越以帧为中心的识别。具体而言,\textsc{S-Agent}将VLM作为语义规划器,决定需要哪些证据,而层次化的空间工具和专家将物体锚定在2D中,将其提升为3D几何证据,并将这些证据聚合为高级空间知识(例如,计数、测量、方向和相对位置)。此外,时间记忆机制,包括用于维护不断演变的场景状态的场景记忆和用于积累推理上下文的智能体记忆,实现了跨帧和推理步骤的证据整合。在多视图和视频空间推理基准上的全面实验表明,\textsc{S-Agent}以无需训练的方式持续提升开源和闭源VLM的性能。除了推理时增强,在\textsc{S-Agent}生成的空间轨迹\textsc{S-300K}上进行监督微调(SFT)得到了\textsc{S-Agent-8B},一个紧凑的空间智能体,显著超越了类似规模的基线(例如,Qwen3-VL-8B),并与先进的闭源模型(例如,GPT-5.4和Gemini 3)性能相当。

英文摘要

Real-world spatial intelligence requires reasoning over a continuous and evolving 3D world, yet existing VLMs and tool-augmented agents largely remain tied to static, stateless inference from isolated visual observations. We introduce \textbf{\textsc{S-Agent}}, a spatial tool-use agentic paradigm for understanding and reasoning over continuous multi-view images and videos. By formulating spatial reasoning as spatio-temporal evidence accumulation rather than isolated frame-level prediction, \textsc{S-Agent} reshapes spatial perception into scene-centric understanding beyond frame-centric recognition. Specifically, \textsc{S-Agent} casts the VLM as a semantic planner that decides what evidence is needed, while a hierarchy of spatial tools and experts grounds objects in 2D, lifts them into 3D geometric evidence, and aggregates this evidence into high-level spatial knowledge (\textit{e.g.}, counting, measurement, orientation, and relative position). Additionally, a temporal memory mechanism, including Scene Memory for maintaining the evolving scene state and Agent Memory for accumulating reasoning context, enables evidence integration across frames and reasoning steps. Comprehensive experiments on multi-view and video spatial reasoning benchmarks show that \textsc{S-Agent} consistently improves both open-source and closed-source VLMs in a training-free manner. Beyond inference-time augmentation, supervised fine-tuning (SFT) on \textsc{S-Agent}-generated spatial trajectories \textsc{S-300K} yields \textsc{S-Agent-8B}, a compact spatial agent that significantly surpasses similar-scale baselines (e.g., Qwen3-VL-8B) and performs comparably to advanced closed-source models (e.g., GPT-5.4 and Gemini 3).

2606.19776 2026-06-19 cs.CV 新提交 90%

Occ-VLM: Occupancy Grounded Vision Language Model for Indoor Scene Understanding

Occ-VLM: 面向室内场景理解的占用接地视觉语言模型

Jianing Li, Zhou Fang, Yijiang Liu, Li Du

发表机构 * School of Electronic Science and Engineering, Nanjing University(南京大学电子科学与工程学院)

专题命中 视觉推理 :占用接地VLM用于室内3D场景理解。

AI总结 提出Occ-VLM,仅用姿态RGB图像和单一2D视觉编码器,通过重建3D占用作为几何先验,实现统一的3D场景理解,在占用预测、3D VQA和密集描述任务上达到领先水平。

详情
AI中文摘要

近期,视觉语言模型(VLM)在3D场景理解方面取得了显著进展,推动了具身智能和机器人视觉等应用的发展。然而,现有方法通常要么直接依赖显式的3D输入(如点云或RGB-D序列),要么引入额外的3D几何编码器从2D图像中推导出3D感知的视觉标记。这种设计在结构上将3D几何感知与通过视觉语言预训练学到的丰富2D语义解耦,阻碍了统一3D视觉语言表示的发展。在这项工作中,我们提出了Occ-VLM,一个仅基于姿态RGB图像并采用单一2D视觉编码器的3D场景理解新框架。具体而言,Occ-VLM重建3D场景占用作为辅助几何先验,用于将前景2D标记与3D空间进行空间关联。然后,这些标记由大型语言模型(LLM)解码,实现统一的场景理解。大量实验表明,Occ-VLM实现了准确的几何感知和稳健的视觉语言推理:在多视角占用预测上达到最先进性能,同时在3D视觉问答(VQA)和3D密集描述基准上与使用3D输入的VLM表现相当。

英文摘要

Recently, vision-language models (VLMs) have made significant progress in 3D scene understanding, driving advances in applications such as embodied intelligence and robotic vision. However, existing approaches typically either rely directly on explicit 3D inputs (e.g., point clouds or RGB-D sequences), or introduce an additional 3D geometry encoder to derive 3D-aware visual tokens from 2D images. Such designs structurally decouple 3D geometric perception from the rich 2D semantics learned via vision-language pre-training, hindering the development of a unified 3D vision-language representation. In this work, we propose Occ-VLM, a novel framework for 3D scene understanding that operates purely on posed RGB images and employs a single 2D vision encoder. Specifically, Occ-VLM reconstructs 3D scene occupancy as an auxiliary geometric prior, which is utilized to spatially associate foreground 2D tokens with 3D space. These tokens are then decoded by a Large Language Model (LLM) for unified scene understanding. Extensive experiments demonstrate that Occ-VLM achieves both accurate geometric perception and robust vision-language reasoning: it attains state-of-the-art performance on multi-view occupancy prediction, while performing on par with 3D-input VLMs on 3D Visual Question Answering (VQA) and 3D dense captioning benchmarks.

2606.19552 2026-06-19 cs.CL 新提交 90%

LaViSA: A Language and Vision Structural Ambiguity Benchmark

LaViSA:语言与视觉结构歧义基准

Lee Sangmyeong, Shun Inadumi, Koichiro Yoshino

发表机构 * Nara Institute of Science and Technology(奈良先端科学技术大学院大学) Guardian Robot Project RIKEN(RIKEN守护机器人项目) The University of Osaka(大阪大学)

专题命中 视觉推理 :评估VLM利用视觉场景解决结构歧义的能力

AI总结 提出LaViSA基准,通过七类歧义句及对应图像评估视觉语言模型利用视觉场景解决结构歧义的能力,实验显示现有模型虽能部分利用视觉信息,但在特定歧义类型和细微语义区分上仍有局限。

详情
AI中文摘要

结构歧义是指单个句子由于其句法结构而产生多种有效解释,这给语言理解带来了基本挑战。视觉场景可作为解决此类歧义的有用线索,视觉语言模型(VLM)需要能够从视觉场景中推导出可能的语义解释。我们引入了语言与视觉结构歧义(LaViSA)基准,旨在评估VLM利用视觉场景解决结构歧义的能力。LaViSA包含歧义句子、其消歧句子以及这些消歧句子对应的图像,涵盖七类歧义。利用LaViSA,我们对多种VLM进行了全面评估,包括专有模型和开源模型,参数规模和推理能力各异。实验结果表明,尽管最近的VLM能在一定程度上利用视觉场景解决结构歧义,但它们仍然在特定歧义类型和视觉上微妙的语义区分上存在困难,表明在利用视觉场景解决结构歧义方面仍存在局限性。

英文摘要

Structural ambiguity arises when a single sentence admits multiple valid interpretations due to its syntactic structure, posing a fundamental challenge for language understanding. Visual scenes serve as useful cues for resolving such ambiguity, and Vision and Language Models (VLMs) need to be capable of deriving possible semantic interpretations from visual scenes. We introduce Language and Vision Structural Ambiguity (LaViSA), a benchmark designed to evaluate the ability of VLMs to resolve structural ambiguity leveraging visual scenes. LaViSA consists of ambiguous sentences, their disambiguated sentences, and corresponding images of these disambiguated sentences across seven ambiguity categories. Using LaViSA, we conduct a comprehensive evaluation of diverse VLMs, including both proprietary and open-source models with varying parameter scales and reasoning capabilities. Experimental results show that although recent VLMs can leverage visual scenes to resolve structural ambiguity to a some extent, they still struggle with certain ambiguity types and visually subtle semantic distinctions, indicating remaining limitations in resolving structural ambiguity using visual scenes.

2606.20527 2026-06-19 cs.CL cs.CV 新提交 85%

StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs

StylisticBias: 少数人类视觉线索驱动多模态大语言模型中的大部分社会偏见

Shaghayegh Kolli, Timo Cavelius, Nafiseh Nikeghbal, Samantha Dalal, Jana Diesner

发表机构 * Technical University of Munich(慕尼黑工业大学) Munich Center for Machine Learning(慕尼黑机器学习中心) Princeton Center for Information and Technology Policy(普林斯顿信息与技术政策中心)

专题命中 视觉推理 :评估MLLM中视觉线索导致的社会偏见

AI总结 提出StylisticBias基准,通过控制单一视觉属性变化,发现年龄和体型主导身份层面偏见,而时尚风格等约15个属性解释近80%的偏见变化,偏见集中于少数视觉线索。

Comments Accepted to the non-archival workshops AI4Good and Culture x AI at ICML 2026

详情
AI中文摘要

多模态大语言模型(MLLMs)越来越多地部署在个人和社会影响重大的场景中,但影响这些模型判断人物的视觉线索仍知之甚少。先前的工作通常比较不同的(群体)个体,难以将外貌效应与身份差异分离。我们引入StylisticBias,一个用于评估MLLMs中属性级社会偏见的受控基准。我们生成500张逼真的基础人脸,每张脸创建约50个单一属性变体,产生约25K张图像。这种设计保持身份不变,每次改变一个视觉属性,使我们能够测量特定线索如何改变模型判断。我们在25个二元社会判断场景中评估了六个MLLMs。我们发现年龄和体型主导身份层面的效应,而时尚风格和其他视觉线索驱动最大的属性级变化。我们进一步发现,约15个属性解释了近80%的总变异,表明偏见集中在少数视觉线索上。在与外貌语义对齐的判断中,尤其是社会经济和风格相关判断,敏感性最强。我们发布StylisticBias作为多模态模型细粒度偏见评估的基准。代码和数据集:此https URL和此https URL。

英文摘要

Multimodal large language models (MLLMs) are increasingly deployed in personally and societally consequential settings, yet the visual cues that shape how these models judge people remain poorly understood. Prior work often compares different (groups of) individuals, making it difficult to separate appearance effects from identity differences. We introduce StylisticBias, a controlled benchmark for evaluating attribute-level social bias in MLLMs. We generate 500 photorealistic base faces and create about 50 single-attribute variations per face, producing about 25K images. This design keeps identity fixed and changes one visual attribute at a time. It lets us measure how specific cues shift model judgments. We evaluate six MLLMs across 25 binary social judgment scenarios. We find that age and body type dominate identity-level effects, while fashion style and other visual cues drive the largest attribute-level shifts. We further find that about 15 attributes account for nearly 80\% of the total variation, showing that bias is concentrated in a small set of visual cues. Sensitivity is strongest in judgments that are semantically aligned with appearance, especially socioeconomic and style-related judgments. We release StylisticBias as a benchmark for fine-grained bias evaluation in multimodal models. Code and dataset: https://github.com/timo-cavelius/StylisticBias and https://hf.co/datasets/shaghayegh/stylistic-bias-dataset.

2606.20419 2026-06-19 cs.CV 新提交 85%

Spectral Query-Key Product Weight Steering for Training-Free VLM Hallucination Mitigation

谱查询-键乘积权重引导用于免训练VLM幻觉缓解

Karn Tiwari, Varnith Chordia, Prathosh A P

发表机构 * Indian Institute of Science, Bengaluru(印度科学理工学院,班加罗尔) Snap Research(Snap 研究院)

专题命中 视觉推理 :免训练方法减少VLM对象幻觉,提升视觉推理

AI总结 提出QK乘积引导,一种无数据、免训练、零推理成本的权重编辑方法,通过抑制中间层主导奇异模式减少对象幻觉,在三个GQA基VLM上平均降低CHAIR$_s$ 4.0%。

Comments Under Review

详情
AI中文摘要

视觉语言模型(VLM)通常生成流畅但视觉上无依据的描述,尤其是提及图像中不存在的对象。我们提出QK乘积引导,一种无数据、免训练、零推理成本的权重编辑方法,用于减少对象幻觉。该方法通过抑制选定中间层中少量主导奇异模式,直接编辑每头的查询-键乘积(即产生softmax前注意力logits的算子)。然后,通过封闭形式的仅查询更新将编辑后的乘积映射回查询权重,同时保持共享的键权重固定,使编辑兼容分组查询注意力。我们进一步将QK乘积分解为对称和反对称分量,以区分相互内容相似性模式与方向性注意力模式。在三个基于GQA的VLM上,QK乘积引导实现了平均相对CHAIR$_s$降低4.0%,而匹配的随机模式控制显示可忽略的变化。可解释性消融表明,幻觉信号特定于主导QK模式,并主要定位于对称相互注意力通道。总体而言,QK乘积引导提供了一种解码时缓解的简单替代方案,无需额外数据、微调或推理时开销,同时基本保持多模态能力。

英文摘要

Vision-language models (VLMs) often generate fluent but visually unsupported descriptions, especially by mentioning objects absent from the image. We propose QK Product Steering, a data-free, training-free, and zero-inference-cost weight edit for reducing object hallucination. The method directly edits the per-head query-key product, the operator that produces pre-softmax attention logits, by suppressing a small number of dominant singular modes in selected middle layers. The edited product is then mapped back to the query weights through a closed-form query-only update while keeping shared key weights fixed, making the edit compatible with grouped-query attention. We further decompose the QK product into symmetric and antisymmetric components to distinguish mutual content-similarity patterns from directional attention patterns. Across three GQA-based VLMs, QK Product Steering achieves an average relative CHAIR$_s$ reduction of $4.0\%$, while matched random-mode controls show negligible change. Interpretability ablations show that the hallucination signal is specific to dominant QK modes and is primarily localized to the symmetric mutual-attention channel. Overall, QK Product Steering offers a simple alternative to decoding-time mitigation, requiring no additional data, fine-tuning, or inference-time overhead while largely preserving general multimodal capability.

2606.20364 2026-06-19 cs.LG 新提交 85%

Judging to Improve: A De-biased VLM-as-3D-Judge Protocol for Single-Image 3D Generation

评判以改进:一种去偏的 VLM-as-3D-Judge 协议用于单图像 3D 生成

Ali Asaria, Tony Salomone, Deep Gandhi

发表机构 * Transformer Lab

专题命中 视觉推理 :VLM作为评判者优化3D生成

AI总结 本文提出一种去偏的跨模型 VLM-as-3D-Judge 协议,将评判者从排序扩展到优化,通过训练与评估评判者分离、位置偏差校正及修复三种失效模式,实现轻量级适应下与强基线的匹配。

详情
AI中文摘要

一项伴随研究建立了一个去偏的、跨模型的 VLM-as-3D-Judge,能够可靠地对单图像到 3D 网格质量进行排序,而廉价的几何和 CLIP 代理在此方面表现不足。本文提出:该评判者的偏好能否专门化一个强大的开放生成器 TRELLIS,针对单一资产类别(家具),且无需人工标注?将评判者从排序扩展到优化是本文的工作所在。将 VLM 评判者推入训练和评估循环会暴露排序从未触发的失效模式,因此我们的贡献是对评判者进行优化级别的强化:一个训练评判者(Qwen2.5-VL-7B)与一个评估评判者(InternVL3-8B)保持分离以打破循环性;位置偏差校正;以及针对三种失效模式(图像过载、隐藏几何的溅射渲染、以及奖励干净但错误输出的无参考评判)的修复,并附有校准证据(清晰差距胜率 0.83-1.0;基线间约 0.5)。使用此协议作为独立评估者,仅从公开模型和数据出发,采用轻量级参数高效适应,我们发现我们的方法匹配了强基线而非超越它。独立基线样本几乎不携带可学习的偏好(0.94 顺序翻转率),因此信号必须通过质量对比构造来设计。在六种适应方法、两种输入模式和严重程度扫描中,最具针对性的方法——严重退化下的条件器修复——达到了与基线持平(0.50),而没有方法达到 >=65% 的胜率目标。结果是机制性的:干净输入使评判者饱和,流式 DIT 微调通过采样器被冲刷,而条件器修复是改变几何的位点。胜率在 n=8 个对象时具有方向性。匹配一个强大的公开数据基线本身具有信息量:超越它需要比公开数据上的轻量级 PEFT 更多,而评判者协议是可复用的。

英文摘要

A companion study established a de-biased, cross-model VLM-as-3D-judge that reliably ranks single-image-to-3D mesh quality where cheap geometry and CLIP proxies fall short. This paper asks: can that judge's preferences specialize a strong open generator, TRELLIS, on one asset class (furniture), cheaply and without human labels? Taking the judge from ranking to optimization is where the work lives. Pushing a VLM judge into the training and evaluation loop exposes failure modes ranking never triggered, so our contribution is an optimization-grade hardening of the judge: a training judge (Qwen2.5-VL-7B) held distinct from an evaluation judge (InternVL3-8B) to break circularity; position-bias correction; and fixes for three failure modes (image overload, geometry-hiding splat renders, and reference-free judging that rewards clean-but-wrong outputs), with calibration evidence (clear-gap win-rate 0.83-1.0; base-vs-base ~0.5). Using this protocol as an independent evaluator, and working only from public models and data with lightweight parameter-efficient adaptation, we find our methods match the strong base rather than exceed it. Independent base samples carry essentially no learnable preference (0.94 order-flip rate), so signal must be engineered by quality-contrastive construction. Across six adaptation methods, two input regimes, and a severity sweep, the most targeted - conditioner repair under severe degradation - reaches parity (0.50) with the base, while no method clears the >=65% win-rate target. The result is mechanistic: clean inputs saturate the judge, flow-DIT fine-tuning washes out through the sampler, and conditioning repair is the locus that moves geometry. Win-rates are directional at n=8 objects. Matching a strong public-data base with cheap adaptation is itself informative: exceeding it needs more than lightweight PEFT on public data, and the judge protocol is reusable.

2606.20244 2026-06-19 cs.CV cs.AI 新提交 85%

SPOT-E: Test-Time Entropy Shaping with Visual Spotlights for Frozen VLMs

SPOT-E:基于视觉聚光灯的冻结VLM测试时熵整形

Bo Yin, Xiaobin Hu, Chengming Xu, Ruolin Shen, Mo Yang, Jiangning Zhang, Peng-Tao Jiang, Cheng Tan, Shuicheng YAN

发表机构 * National University of Singapore(新加坡国立大学) Fudan University(复旦大学) Technical University of Munich(慕尼黑工业大学) Sagenic Tech Zhejiang University(浙江大学) vivo Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

专题命中 视觉推理 :测试时熵整形提升VLM证据定位

AI总结 提出SPOT-E方法,通过测试时熵整形和视觉聚光灯,解决VLM在证据密集型任务中因忽视局部关键证据而表现不佳的问题,无需重新训练即可提升定位与鲁棒性。

详情
AI中文摘要

视觉语言模型(VLM)在证据密集型任务中通常表现不佳,因为决定性视觉证据往往微小、局部且容易被忽略,导致即使高层推理完好,证据读取也会失败。先前的推理时视觉干预可以在不重新训练的情况下改善定位,但大多是开环的,缺乏验证高亮证据是否实际使用的机制。我们研究答案跨度预测熵作为模型内部反馈信号,并表明朴素熵最小化具有歧义性,因为低熵可能源于证据支持的置信度或捷径坍塌。为解决这一歧义,我们引入低熵锚点和熵整形目标,在减少答案不确定性的同时保留基线高置信度标记。我们将这一原理实例化为SPOT-E,一种即插即用的测试时方法,生成问题条件聚光灯,并通过基于组相对策略优化(GRPO)的轻量级调优对每个实例进行优化。在所有基准测试和不同VLM家族中,SPOT-E在视觉损坏下均取得一致增益和改进的鲁棒性。代码公开于:\url{this https URL}

英文摘要

Vision-language models (VLMs) often underperform on evidence intensive tasks because decisive visual evidence are small, localized, and easy to overlook, leading to failures in evidence readout even when high-level reasoning is intact. Prior inference-time visual interventions can improve grounding without retraining, but they are largely open-loop and lack a mechanism to verify whether highlighted evidence is actually used. We study answer-span prediction entropy as a model-internal feedback signal and show that naive entropy minimization is ambiguous, since low entropy may arise from evidence-grounded confidence or shortcut collapse. To resolve this ambiguity, we introduce low-entropy anchors and an entropy-shaping objective that reduces answer uncertainty while preserving baseline high-confidence tokens. We instantiate this principle in SPOT-E, a plug-and-play test-time method that produces question-conditioned spotlights, optimized per instance via light-weight tuning based on Group Relative Policy Optimization (GRPO). Across all benchmarks and different VLM families, SPOT-E yields consistent gains and improved robustness under visual corruptions. Code is publicly available at: \url{https://github.com/YinBo0927/SPOT-E}

2606.20077 2026-06-19 cs.CV cs.AI 新提交 85%

The Hidden Evolution of Disguised Visual Context inside the VLM

VLM内部伪装视觉上下文的隐藏演化

Wish Suharitdamrong, Tony Alex, Muhammad Awais, Sara Atito

发表机构 * Surrey Institute for People-Centred AI, University of Surrey(萨里大学以人为本人工智能研究所) Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey(萨里大学视觉、语音与信号处理中心)

专题命中 视觉推理 :VLM中视觉令牌演化分析

AI总结 研究视觉语言模型中视觉令牌如何通过不同集成架构(上下文注入与逐层注入)转化为有意义表示,揭示其内部演化过程及对性能的影响。

详情
AI中文摘要

视觉令牌作为原始的外部信号进入大语言模型(LLM)。它们如何被转化为有意义的表示并与语言空间交互完全取决于集成架构——无论是将视觉令牌视为输入序列中的上下文提示,还是直接注入到LLM的中间层。对于这些架构选择如何影响视觉信息及其内部转换以与LLM集成,目前仍缺乏受控比较和理解。我们通过在相同训练条件下评估上下文注入和逐层注入的VLM集成范式,在单图像、多图像和视频基准上进行公平比较。在此过程中,我们揭示了一个隐藏的演化:视觉令牌作为伪装的视觉上下文(缺乏语言结构的原始表示)进入LLM,但根据集成范式逐渐被重塑,每种范式捕捉视觉信号的不同频率特征。我们表明,LLM内部的这种演化决定了VLM能够有效利用哪些视觉特征、视觉表示如何与语言空间对齐,以及最终每种范式在不同任务上的表现。我们进一步证明,仅关注注意力分配是不够的,性能由每一层视觉表示的质量驱动。

英文摘要

Visual tokens enter Large Language Models (LLMs) as raw, foreign signals. How they are transformed into meaningful representations and interact with the language space depends entirely on the integration architecture. Whether by treating visual tokens as in-context prompts within the input sequence or injecting them directly into the LLM's intermediate layers. A controlled comparison and understanding of how these architectural choices affect visual information and its internal transformation to integrate with the LLM remains underexplored. We provide a fair comparison by evaluating in-context and layer-wise injection VLM integration paradigms under identical training conditions across single image, multi-image, and video benchmarks. In doing so, we uncover a hidden evolution where visual tokens enter the LLM as disguised visual context, raw representations lacking linguistic structure, but are progressively reshaped depending on the integration paradigm, each capturing fundamentally different frequency characteristics of the visual signal. We show that this evolution inside the LLM determines what visual features the VLM can utilize effectively, how visual representations align with the language space, and ultimately how each paradigm performs across different tasks. We further demonstrate that attention allocation alone is insufficient, and that performance is driven by the quality of visual representations at each layer.

2606.20045 2026-06-19 cs.CV cs.AI 新提交 85%

See-and-Reach: Precise Vision-Language Navigation for UAVs within the Field of View

See-and-Reach: 视场内的精确视觉语言导航用于无人机

Fanfu Xue, En Yu, Yantian Shen, Zhikun Hu, Hongjun Wang, Yang Yang, Xindi Wang, Jiande Sun

发表机构 * School of Information Science and Engineering, Shandong University(山东大学信息科学与工程学院) Faculty of Engineering and Information Technology, University of Technology Sydney(悉尼科技大学工程与信息技术学院) School of Computer Science and Technology, Shandong University(山东大学计算机科学与技术学院) School of Artificial Intelligence, Shandong University(山东大学人工智能学院) School of Computer Science and Artificial Intelligence, Shandong Normal University(山东师范大学计算机科学与人工智能学院) Interdisciplinary Research Center of General Artificial Intelligence, Shandong Normal University(山东师范大学通用人工智能跨学科研究中心)

专题命中 视觉推理 :提出视觉语言导航框架用于无人机精确到达。

AI总结 针对无人机视觉语言导航中目标可见后精确到达能力评估不足的问题,提出UAV-VLN-FOV任务和3DG-VLN框架,通过动态3D方向线索增强细粒度视觉定位与空间对齐,在基准和真实实验中显著提升成功率。

Comments 12 pages, 7 figures

详情
AI中文摘要

无人机视觉语言导航(UAV-VLN)通常被形式化为一个整体的搜索与到达问题,其中远程目标发现和最终目标接近被联合优化和评估。这种表述使得评估空中具身代理的关键能力变得困难,即一旦目标进入其视场,无人机能否准确地将可见目标定位并将视觉语言证据转化为精确的3D运动。为了解决这一局限性,我们引入了UAV-VLN-FOV,一个目标可见的导航任务,它隔离了“看到并到达”阶段,并能够对终端到达能力进行更具诊断性的评估。我们进一步提出了3DG-VLN,一种由动态3D方向线索引导的视觉语言航点预测框架,以增强细粒度视觉定位和空间方向对齐,从而实现精确的目标到达。具体来说,3DG-VLN自适应地处理高分辨率的前视和下视观测,以保留用于目标定位的细粒度视觉和几何细节。它还在闭环导航过程中在线更新目标相对方向,使代理能够保持与目标的空间对齐并减少累积的方向漂移。为了支持该任务,我们构建了一个专用的高分辨率基准,包含2,717条轨迹,带有面向目标的高级指令、高分辨率的前视和下视自我中心观测以及连续的3D航点注释。实验表明,3DG-VLN优于具有竞争力的UAV-VLN基线,成功率提高了13.82%。真实世界试验进一步展示了3DG-VLN在实际“看到并到达”导航中的潜力。源代码和基准可在以下网址获取:此 https URL。

英文摘要

UAV Vision-Language Navigation (UAV-VLN) is typically formulated as a holistic search-and-reach problem, where long-range target discovery and final target approach are optimized and evaluated jointly. This formulation makes it difficult to assess a critical capability of aerial embodied agents, namely whether a UAV can accurately ground a visible target and translate vision-language evidence into precise 3D motion once the target enters its field of view. To address this limitation, we introduce UAV-VLN-FOV, a target-visible navigation task that isolates the see-and-reach stage and enables a more diagnostic evaluation of terminal reaching ability. We further propose 3DG-VLN, a vision-language waypoint prediction framework guided by dynamic 3D direction cues to enhance fine-grained visual grounding and spatial direction alignment for precise target reaching. Specifically, 3DG-VLN adaptively processes high-resolution front-view and downward-view observations to preserve fine-grained visual and geometric details for target grounding. It also updates the target-relative direction online during closed-loop navigation, allowing the agent to maintain spatial alignment with the target and reduce accumulated direction drift. To support this task, we construct a dedicated high-resolution benchmark which contains 2,717 trajectories with target-oriented high-level instructions, high-resolution front-view and downward-view egocentric observations, and continuous 3D waypoint annotations. Experiments show that 3DG-VLN outperforms competitive UAV-VLN baselines, achieving a 13.82\% improvement in success rate. Real-world trials further demonstrate the potential of 3DG-VLN for practical see-and-reach navigation. The source code and benchmark are available at https://github.com/xuefanfu/3DG-VLN.

2606.19965 2026-06-19 cs.CV cs.AI 新提交 85%

ROSE: Benchmarking the Perception-to-Action Gap in Multimodal Models

ROSE:多模态模型中感知到行动差距的基准测试

Yihao Wang, Zijian He, Jie Ren, Keze Wang

发表机构 * Sun Yat-sen University(中山大学) Shaanxi Normal University(陕西师范大学)

专题命中 视觉推理 :提出ROSE基准测试MLLM感知到行动差距

AI总结 提出ROSE基准,通过固定视觉场景并变化区域约束与符号输出,测试多模态大模型在不同上下文中将相同视觉证据转化为所需行动的能力,发现模型性能下降高达44.5个百分点,揭示感知到行动的瓶颈。

Comments 29 pages, 11 figures

详情
AI中文摘要

多模态大语言模型(MLLMs)越来越被期望基于视觉信息采取行动,然而同一场景在不同任务上下文中可能需要不同的行动。模型能否可靠地将相同的视觉证据转化为当前上下文所需的行动?为了回答这个问题,我们引入了\textsc{ROSE}(\textbf{R}eference-conditioned \textbf{O}ddity and \textbf{S}ymbolic \textbf{E}xecution),一个受控基准,它在保持视觉场景固定的同时变化区域约束和所需的符号输出。通过耦合的计数和坐标行动任务,\textsc{ROSE}测试模型是否能够推断出隐含的多数参考,并在变化的上下文中基于由此产生的细粒度视觉证据采取行动。在九个最近的MLLMs中,从计数导向任务到区域条件行动的性能下降高达44.5个百分点,而人类表现达到98.8%。这种差距在成对的场景和区域中持续存在,即使同一模型在这些场景和区域上返回正确的计数,而全局点击和匹配的局部控制表明坐标定位仅解释了部分损失,揭示了在将共享视觉证据转化为上下文特定行动时存在一个独特的、模型相关的瓶颈。

英文摘要

Multimodal large language models (MLLMs) are increasingly expected to act on visual information, yet the same scene may require different actions under different task contexts. How reliably can a model turn the same visual evidence into the action required by the current context? To answer this question, we introduce \textsc{ROSE} (\textbf{R}eference-conditioned \textbf{O}ddity and \textbf{S}ymbolic \textbf{E}xecution), a controlled benchmark that holds the visual scene fixed while varying region constraints and required symbolic outputs. Through coupled counting and coordinate-action tasks, \textsc{ROSE} tests whether models can infer an implicit majority reference and act on the resulting fine-grained visual evidence under changing contexts. Across nine recent MLLMs, performance drops by as much as 44.5 percentage points from counting-oriented tasks to region-conditioned action, despite 98.8\% human performance. The gap persists on paired scenes and regions for which the same model returns the correct count, while global-click and matched local controls show that coordinate grounding explains only part of the loss, revealing a distinct, model-dependent bottleneck in turning shared visual evidence into context-specific actions.

2606.19944 2026-06-19 cs.CV 新提交 85%

Timage: A Generative Text-in-Image Paradigm for Fine-Tuning Vision-Language Models

Timage: 一种用于微调视觉语言模型的文本嵌入图像生成范式

Yifeng Wu, Huimin Huang, Ruiluo Wu, Chunyi Lin, Guanhua Chen, Xian Wu, Wang Song, Ruize Han

发表机构 * Fudan University(复旦大学) Shenzhen University of Advanced Technology(深圳先进技术大学) Tencent Jarvis Lab(腾讯贾维斯实验室) Southern University of Science and Technology(南方科技大学)

专题命中 视觉推理 :提出文本嵌入图像范式提升VLM细粒度空间推理

AI总结 提出Timage范式,通过约束薛定谔桥将查询文本作为排版覆盖层嵌入图像,以显式空间锚点引导模型关注,在不侵蚀骨干能力前提下提升细粒度空间推理性能。

Comments ECCV

详情
AI中文摘要

多模态大语言模型(MLLMs)在细粒度空间推理中常丢失正确图像区域,因为文本查询很少携带明确的几何锚点进入像素域。现有补救方法要么重新调整模型权重,要么用冗长指令填充提示,但都无法在不侵蚀骨干通用能力的情况下可靠地将语言定位到正确的视觉坐标。我们提出Timage,一种将多模态理解重新定义为输入层面对齐问题的范式:查询被绘制为排版覆盖层直接叠加在图像上。该覆盖层的放置和外观由约束薛定谔桥(cSB)生成,这是一种熵最优传输采样器,将布局合成分解为两个耦合的随机阶段。第一阶段——区域搜索,将噪声向查询对齐的图像区域传输,同时遵守硬遮挡屏障以保护显著前景内容;第二阶段——外观塑造,通过“墨水预算”正则化调整字形大小,使渲染文本保持可读和视觉平衡。生成的覆盖层作为显式注意力信标,引导模型沿空间语义聚焦。在VMCBench基准上,Timage搭配7B骨干模型明显超越更大的专有系统和参数调优基线。该研究将审慎的输入重构定位为一种强大的、架构中立的杠杆,以增强多模态推理。

英文摘要

Multimodal Large Language Models (MLLMs) often lose track of the right image regions during fine-grained spatial reasoning, because a textual query rarely carries any explicit geometric anchor into the pixel domain. Prevailing remedies either rewire the model's weights or pad the prompt with verbose instructions, yet neither reliably pins the language to the correct visual coordinates without eroding the backbone's general competence. We introduce Timage, a paradigm that recasts multimodal understanding as an alignment problem solved at the input: the query is drawn, as a typeset overlay, onto the image itself. The placement and appearance of this overlay are produced by a Constrained Schrödinger Bridge (cSB), an entropic optimal-transport sampler that factorizes layout synthesis into two coupled stochastic stages. The first stage, Region Search, transports noise toward query-aligned image zones while obeying a hard occlusion barrier that protects salient foreground content; the second stage, Appearance Shaping, sizes the glyphs through an ``ink-budget'' regularizer so that the rendered text stays legible and visually balanced. The resulting overlay behaves as an explicit attention beacon that channels the model's focus along spatial semantics. On the VMCBench suite, Timage paired with a modest 7B backbone clearly overtakes far larger proprietary systems as well as parameter-tuned baselines. The study positions deliberate input reconstruction as a powerful, architecture-neutral lever for strengthening multimodal reasoning.

2606.19828 2026-06-19 cs.CV 新提交 85%

3D-PLOT-LLM: Part-Level Object Tokens for 3D Large Language Models

3D-PLOT-LLM: 用于三维大语言模型的部件级对象标记

Jintang Xue, Xinyu Wang, Yixing Wu, Jingwen Chen, C. -C. Jay Kuo

发表机构 * University of Southern California(南加州大学) Ohio State University(俄亥俄州立大学)

专题命中 视觉推理 :3D多模态大模型,支持部件级对象标记和推理。

AI总结 提出3D-PLOT-LLM,通过重组输入标记流使部件可直接通过LLM词汇寻址,无需分割解码器或边界框,在部件级基准上超越现有方法。

详情
AI中文摘要

三维多模态大语言模型(3D MLLMs)将3D对象作为一个整体进行描述,但无法处理、命名或推理其部件。先前的部件感知尝试增加了分割解码器、更重的3D编码器或边界框语法,导致参数成本大幅增加。我们采取了一条根本不同的路径:重新组织输入标记流,使得部件通过LLM自身的词汇变得可直接寻址。我们的模型3D-PLOT-LLM将冻结的点编码器的块分割成K个局部一致的区域,并在每个区域的块标记之前插入一个可学习的每区域标记和一个保留词汇标记<part_k>;然后,一个标记空间精化(MSR)模块根据每个区域的空间统计信息和邻接邻居对该标记进行条件化。因此,模型在其输出中引用部件,并遵循通过标记引用部件的提示,这是先前对象级3D MLLMs所不具备的能力。为了探究这一接口,我们构建了PartVerse-QA,一个基于PartVerse网格注释改编的词汇级部件问答基准(77K训练对和588个保留查询,基于不相交的对象划分),在该基准上,3D-PLOT-LLM达到了描述到槽的Jaccard指数0.459和精确匹配率13.78%,槽到描述的GPT-4o评判得分为44.68。在3DCoMPaT-GrIn部件感知接地描述基准上,3D-PLOT-LLM在所有文本输出指标上优于PointLLM、Kestrel、PARIS3D和SegPoint,并在4项指标中的3项上优于ShapeLLM,相比PointLLM的GPT-4o评判得分最高提升+3.03。在Objaverse整体对象描述中,在第二阶段添加PartVerse-QA使得相比PointLLM的SBERT得分提升+0.65,GPT-4o得分提升+1.85,并且在5项传统指标中的4项(SBERT、SimCSE、BLEU-1、METEOR)上超过PointLLM-PiSA,尽管其目标是不同的(部件接地)目标。所有这些仅需在冻结的点编码器上增加不到100万个可训练参数,比先前的部件感知3D MLLMs低一个数量级,且无需分割解码器或边界框头。

英文摘要

3D multimodal large language models (3D MLLMs) describe a 3D object as a whole but cannot address, name, or reason about its parts. Prior part-aware attempts add segmentation decoders, heavier 3D encoders, or bounding-box grammars at substantial parameter cost. We take a fundamentally different path: we reorganize the input token stream so that parts become directly addressable through the LLM's own vocabulary. Our model, 3D-PLOT-LLM, partitions the frozen point encoder's patches into K locally coherent regions and inserts, before each region's patch tokens, a learnable per-region marker and a reserved vocabulary token <part_k>; a Marker-Space Refinement (MSR) module then conditions each marker on its region's spatial statistics and adjacency neighbors. The model thus cites parts in its output and follows prompts that refer to parts by token, a capability absent from prior object-level 3D MLLMs. To probe this interface, we construct PartVerse-QA, a vocabulary-level part-QA benchmark adapted from PartVerse mesh annotations (77K training pairs and 588 held-out queries on disjoint object splits), on which 3D-PLOT-LLM reaches caption-to-slots Jaccard 0.459 and Exact-match 13.78%, with a slot-to-caption GPT-4o judge of 44.68. On the 3DCoMPaT-GrIn part-aware grounded description benchmark, 3D-PLOT-LLM outperforms PointLLM, Kestrel, PARIS3D, and SegPoint on every text-output metric, and ShapeLLM on 3 of 4, with up to +3.03 GPT-4o judge over PointLLM. On Objaverse whole-object captioning, adding PartVerse-QA at Stage 2 yields +0.65 SBERT and +1.85 GPT-4o over PointLLM, and tops PointLLM-PiSA on 4 of 5 traditional metrics (SBERT, SimCSE, BLEU-1, METEOR) despite targeting a different (part-grounded) objective. All with under 1M new trainable parameters on a frozen point encoder, an order of magnitude below prior part-aware 3D MLLMs, and no segmentation decoder or bounding-box head.

2606.19584 2026-06-19 cs.CV 新提交 85%

Language-Instructed Vision Embeddings for Controllable and Generalizable Perception

语言引导的视觉嵌入用于可控且可泛化的感知

Chengzhi Mao, Xudong Lin, Wen-Sheng Chu

发表机构 * Google(谷歌)

专题命中 视觉推理 :语言引导视觉嵌入方法,提升视觉推理和泛化能力

AI总结 提出语言引导视觉嵌入(LIVE)方法,利用语言动态引导视觉编码器生成任务中心嵌入,无需任务特定重训练,减少视觉幻觉并提升泛化能力。

Journal ref Published as a conference paper at ICLR 2026

详情
AI中文摘要

视觉基础模型通常被训练为静态特征提取器,将任务适应的负担转移到大型下游模型上。我们提出另一种范式:不是仅将视觉特征输入语言模型,而是使用语言本身动态引导视觉编码器。我们的方法,语言引导视觉嵌入(LIVE),利用语言作为高层指导在推理时生成以任务为中心的嵌入,消除了任务特定重训练的需要。这使得编码器能够关注输入中上下文相关的方面,产生更可控和可泛化的表示。实验上,LIVE减少了视觉幻觉(在MMVP上提升34分),在视觉问答上超越了参数数量大几个数量级的视觉语言模型,并泛化到未见过的指令和任务——为自适应的、指令驱动的视觉智能提供了直接路径。

英文摘要

Vision foundation models are typically trained as static feature extractors, placing the burden of task adaptation onto large downstream models. We propose an alternative paradigm: instead of solely feeding visual features into language models, we use language itself to dynamically guide the vision encoder. Our method, Language-Instructed Vision Embeddings (LIVE), leverages language as high-level guidance to produce task-centric embeddings at inference time, removing the need for task-specific retraining. This enables the encoder to focus on contextually relevant aspects of the input, yielding more controllable and generalizable representations. Empirically, LIVE reduces visual hallucinations (+34 points on MMVP), surpasses vision-language models with orders of magnitude more parameters on visual question answering, and generalizes to unseen instructions and tasks -- offering a direct path toward adaptive, instruction-driven visual intelligence.

2606.18950 2026-06-19 cs.AI 新提交 85%

RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language Models

RTSGameBench: 视觉语言模型战略推理的RTS基准

San Kim, Daechul Ahn, Reokyoung Kim, Hyeonbeom Choi, Seungyeon Jwa, Jonghyun Choi

发表机构 * Seoul National University(首尔国立大学)

专题命中 视觉推理 :评估VLM在RTS游戏中的战略推理

AI总结 提出RTSGameBench,基于Beyond All Reason游戏,通过多样化对战、迷你游戏诊断和自进化生成框架,评估视觉语言模型在实时策略游戏中的战略推理能力。

Comments First two authors contributed equally

详情
AI中文摘要

现代视觉语言模型(VLM)在竞争和合作环境中的不确定性下,往往难以进行战略推理,即预测和影响其他智能体的行为。实时策略(RTS)游戏可以作为诊断这一局限性的自然测试平台,因为它们要求与盟友协调、适应对手策略,并在部分可观测性下进行长期规划。然而,现有的RTS基准评估范围有限,缺乏系统的能力诊断,并且局限于预设计的场景覆盖。为了解决这些限制,我们提出了RTSGameBench,它建立在Beyond All Reason之上,这是一款大规模RTS游戏,其扩展战场要求比现有测试平台更广泛的策略多样性。该基准通过多种对战结构提供评估,通过迷你游戏进行诊断性评估,每个迷你游戏针对单个战略能力,并通过自进化生成框架实现可扩展的覆盖,该框架将自由形式的查询转化为新的迷你游戏,并在连续循环中改进。此外,为了让VLM在大规模RTS游戏中运行,我们提供了RTSGameAgent,它通过具有智能体记忆的有限状态机(FSM)管理单位。我们通过实验验证,多个最先进的VLM在对战需要更紧密协调、多智能体协调以及任务规模增加时表现不佳。

英文摘要

Modern Vision-Language Models (VLMs) often struggle with strategic reasoning, i.e., anticipating and influencing other agents' actions, under uncertainty in competitive and cooperative settings. Real-time strategy (RTS) games can be a natural testbed for diagnosing this limitation, as they demand coordination with allies, adaptation to opponents' strategy, and long-horizon planning under partial observability. However, existing RTS benchmarks offer limited evaluation scope, lack systematic competency diagnosis, and remain fixed in the pre-designed scenario coverage. To address these limitations, we present RTSGameBench, which is built on Beyond All Reason, a large-scale RTS game with an expanded battlefield that demands broader strategy diversity than the existing testbeds. The proposed benchmark provides evaluations through diverse gameplay across various matchup structures, diagnostic assessment via mini-games, each targeting an individual strategic competency, and extensible coverage via a self-evolving generation framework that converts free-form queries into new mini-games, improving over successive cycles. Additionally, for VLMs to operate in large-scale RTS games, we provide RTSGameAgent that manages units by an FSM with agentic memory. We empirically validate that multiple state-of-the-art VLMs do not perform well when matchups demand tighter coordination, multiagent coordination and when task scale increases.

2606.20177 2026-06-19 cs.CV cs.AI 新提交 80%

Evaluating and Enhancing Negation Comprehension in Remote Sensing MLLMs

评估与增强遥感多模态大语言模型的否定理解能力

Haochen Han, Jue Wang, Alex Jinpeng Wang, Fangming Liu

发表机构 * Peng Cheng Laboratory(鹏城实验室) Tsinghua University(清华大学) Central South University(中南大学)

专题命中 视觉推理 :评估遥感MLLMs否定理解,属于视觉语言推理。

AI总结 提出RS-Neg基准评估遥感MLLMs的否定理解,并设计NeFo方法通过测试时学习利用约5%未标注样本显著提升模型性能。

Comments ECCV 2026 Accepted

详情
AI中文摘要

多模态大语言模型(MLLMs)在各种遥感(RS)任务中取得了显著成功。然而,它们理解否定的能力仍未得到充分探索,限制了在现实应用中的部署,其中模型必须明确识别什么是错误的或不存在的,例如,应急响应人员需要定位非洪水路线进行疏散。为了全面研究这一局限性,我们引入了RS-Neg,这是第一个从区域级到场景级任务评估否定理解的基准。具体来说,我们为遥感图像设计了一个自动数据生成流程,使用LLMs合成多样化的否定查询,并引入了一个动态视觉焦点模块进行验证。我们的评估表明,先进的遥感MLLMs在否定理解上存在困难,表现出幻觉和显著的性能下降。为了弥补这一差距,我们提出了NeFo,一种新颖的测试时学习方法,将否定的逻辑角色明确纳入模型优化。值得注意的是,使用约5%的未标注测试样本,NeFo显著提升了模型的否定理解能力,并展现出对未见任务的强泛化能力。代码和数据将在接收后发布。

英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in various Remote Sensing (RS) tasks. However, their ability to comprehend negation remains underexplored, limiting deployment in real-world applications where models must explicitly identify what is false or absent, e.g., emergency responders need to locate non-flooded routes for evacuation. To comprehensively study this limitation, we introduce RS-Neg, the first benchmark to evaluate negation understanding across region-level to scene-level tasks. Specifically, we design an automated data generation pipeline for RS imagery, using LLMs to synthesize diverse negation queries, and introduce a dynamic visual focus module for verification. Our evaluation reveals that advanced RS MLLMs struggle with negation, exhibiting hallucinations and substantial performance degradation. To close this gap, we propose NeFo, a novel test-time learning method that explicitly incorporates the logical role of negation into the model optimization. Remarkably, using about 5\% unlabeled test samples, NeFo significantly improves the negation understanding of models and shows strong generalization to unseen tasks. Code and data will be released upon acceptance.

2606.19927 2026-06-19 cs.CV 新提交 80%

CARE: Competence-Aware Reward Shaping for Adaptive Reasoning Length in Video-MLLMs

CARE: 面向视频多模态大语言模型的自适应推理长度的能力感知奖励塑形

Chengwen Liu, Hao Peng, Jisheng Dang, Hong Peng, Bin Hu, Tat-Seng Chua

发表机构 * School of Information Science and Engineering, Lanzhou University(兰州大学信息科学与工程学院) School of Medical Technology, Beijing Institute of Technology(北京理工大学医学技术学院) School of Computing, National University of Singapore(新加坡国立大学计算机学院)

专题命中 视觉推理 :提出自适应推理长度优化框架用于视频MLLM

AI总结 提出CARE框架,通过能力感知奖励塑形自适应优化推理长度,利用指数移动平均估计能力并分阶段调整奖励偏好,结合批次归一化和后验放大器提升效率与准确性。

详情
AI中文摘要

在多模态视频推理中,基于强化学习的方法通常依赖简单且不灵活的推理长度控制策略,无法适应模型不断变化的能力。这种不匹配可能在早期阶段抑制必要的探索,而在模型变得更有能力后鼓励冗余推理和低效解码。本文提出CARE,一种用于多模态推理中自适应推理长度优化的能力感知奖励塑形框架。具体来说,CARE通过通过率的指数移动平均维护平滑的能力估计,并利用它将训练路由到渐进阶段,将奖励偏好从探索导向的长形式推理转向效率导向的简洁推理。为避免将冗长与内在任务复杂性混淆,CARE进一步使用批次级统计归一化推理努力,并引入后验放大器以增强对历史上困难样本上意外强性能的奖励信号。所提出的机制无缝集成到GRPO训练流程中,且不增加额外推理开销。在多个视频推理和通用视频理解基准上的大量实验表明,CARE持续提高推理准确性,稳定强化学习,并显著提升令牌效率。此外,CARE在训练过程中展现出推理长度的特征性倒U型轨迹,并在收敛时产生更短但信息更丰富的推理轨迹,表明推理预算的有效自适应分配。我们在以下网址提供CARE框架和实验的源代码:此https URL。

英文摘要

In multimodal video reasoning, reinforcement learning-based methods typically rely on simplistic and inflexible reasoning-length control strategies that fail to adapt to the model's evolving competence. This mismatch may suppress necessary exploration at early stages, while encouraging redundant reasoning and inefficient decoding once the model becomes more competent. In this paper, we propose CARE, a competence-aware reward shaping framework for adaptive reasoning length optimization in multimodal reasoning. Specifically, CARE maintains a smoothed competence estimate via an exponential moving average of pass rates, and uses it to route training into progressive stages that shift the reward preference from exploration-oriented long-form reasoning to efficiency-oriented concise reasoning. To avoid conflating verbosity with intrinsic task complexity, CARE further normalizes reasoning effort with batch-level statistics, and introduces a posterior amplifier to strengthen reward signals for unexpectedly strong performance on historically difficult samples. The proposed mechanism is seamlessly integrated into the GRPO training pipeline and incurs no additional inference-time overhead. Extensive experiments on multiple video reasoning and general video understanding benchmarks demonstrate that CARE consistently improves reasoning accuracy, stabilizes reinforcement learning, and significantly enhances token efficiency. Moreover, CARE exhibits a characteristic inverted-U trajectory of reasoning length during training, and yields shorter yet more informative reasoning traces at convergence, indicating effective adaptive allocation of reasoning budget. We provide the source code for our proposed CARE framework and experiments at https://github.com/1Pansy/Video-CARE.

2606.19882 2026-06-19 cs.CV cs.LG 新提交 80%

Multimodal Concept Bottleneck Models

多模态概念瓶颈模型

Tongqing Shi, Ge Yan, Tuomas Oikarinen, Tsui-Wei Weng

发表机构 * UC San Diego(加州大学圣地亚哥分校)

专题命中 视觉推理 :多模态概念瓶颈模型,可解释零样本分类。

AI总结 提出多模态概念瓶颈模型(MM-CBM),利用双概念瓶颈层对齐图像和文本嵌入,实现可解释的零样本分类和图像检索,在四个基准上平均准确率提升高达51.26%。

Comments Present at NeurIPS 2025 Mechanistic Interpretability Workshop

详情
AI中文摘要

概念瓶颈模型(CBM)通过将图像提取的特征与自然概念对齐,增强了深度学习网络的可解释性。然而,现有的CBM在泛化到固定预定义类别集之外的能力以及非概念信息泄露的风险方面受到限制,其中预期概念之外的预测信号被无意中利用。在本文中,我们提出了多模态概念瓶颈模型(MM-CBM)来解决这些问题,并将CBM扩展到CLIP。MM-CBM利用双概念瓶颈层(CBL)将图像和文本嵌入对齐为可解释的特征。这使我们能够以可解释的方式执行新的视觉任务,如零样本分类或图像检索。与现有方法相比,MM-CBM在四个标准基准上平均准确率提升高达51.26%。我们的方法保持高准确率,在黑盒性能的约5%以内,同时提供更高的可解释性。

英文摘要

Concept Bottleneck Models (CBMs) enhance the interpretability of deep learning networks by aligning the features extracted from images with natural concepts. However, existing CBMs are constrained in their ability to generalize beyond a fixed set of predefined classes and the risk of non-concept information leakage, where predictive signals outside the intended concepts are inadvertently exploited. In this paper, we propose Multimodal Concept Bottleneck Model (MM-CBM) to address these issues and extend CBMs into CLIP. MM-CBM utilizes dual Concept Bottleneck Layers (CBLs) to align both the image and text embeddings into interpretable features. This allows us to perform new vision tasks like zero-shot classification or image retrieval in an interpretable way. Compared to existing methods, MM-CBM achieves up to 51.26% accuracy improvement on average across four standard benchmarks. Our method maintains high accuracy, staying within ~5% of black-box performance while offering greater interpretability.

2606.20458 2026-06-19 cs.RO 新提交 75%

Slow Brain, Fast Planner: Latency-Resilient VLM-Augmented Urban Navigation

慢速大脑,快速规划器:延迟鲁棒的VLM增强城市导航

Zhenghao "Mark'' Peng, Honglin He, Quanyi Li, Yukai Ma, Bolei Zhou

发表机构 * Amazon FAR(亚马逊 FAR) UCLA(加州大学洛杉矶分校) Independent(独立) Zhejiang University(浙江大学)

专题命中 视觉推理 :使用VLM增强城市导航中的轨迹选择。

AI总结 针对移动机器人在人行道导航中轨迹评分差距问题,提出一种无需训练的延迟鲁棒轨迹级融合层,利用VLM选择候选轨迹并与规划器输出融合,在挑战场景下降低ADE 30%。

详情
AI中文摘要

基于学习的 sidewalk 导航规划器可以实时生成多样化的候选轨迹,但其评分函数在挑战性场景中往往无法选择最佳轨迹,即使同一集合中存在更好的候选,也会输出使移动机器人驶入草地、朝向行人或错误方向的轨迹。我们称之为轨迹评分差距:在真实世界的人行道导航中,基于锚点的规划器的最佳选择与最佳候选之间的差距很大,这可能是由于规划器的高层场景理解能力有限。我们不是用端到端的视觉-语言-动作模型替换规划器,而是提出一种VLM-规划器接口,使用VLM从规划器的候选集合中选择一个候选索引,然后将其与规划器的初始输出融合。然而,VLM每次查询需要1-3秒,因此无法直接驱动5-20Hz的控制循环。我们贡献了一种无需训练、延迟鲁棒的轨迹级融合层,通过指数衰减的几何相似性将过时的VLM选择转化为实时规划器评分。在约2000个具有挑战性的真实世界场景(例如交叉口、行人相遇)中,VLM选择相比规划器的最佳选择实现了30%的ADE降低,而规划器在常规场景中仍保持竞争力。在仿真中,Score Fusion在高达5秒的延迟下仍保持>80%的成功率。我们在移动机器人上展示了完整系统,在具有不同网络延迟的具有挑战性的校园人行道上进行导航。

英文摘要

Learning-based planners for sidewalk navigation can generate diverse candidate trajectories in real time, yet their scoring functions often fail to select the best trajectory in challenging situations, outputting trajectories that make the mobile robot drive onto grass, toward pedestrians, or in the wrong direction, even when better candidates exist in the same set. We call this the trajectory scoring gap: in real-world sidewalk navigation, the gap between an anchor-based planner's top choice and the best possible candidate is substantial, likely due to limited high-level scene understanding capability of the planner. Rather than replacing the planner with an end-to-end Vision-Language-Action model, we propose a VLM-Planner interface that uses a VLM to select a candidate index from the planner's proposal set and then fuse it with the planner's initial output. However, VLMs take 1--3s per query and so cannot directly drive a 5--20Hz control loop. We contribute a training-free, latency-resilient trajectory-level fusion layer that turns a stale VLM selection into real-time planner scoring via geometric similarity with exponential decay. On $\sim$2,000 challenging real-world scenarios (e.g., junctions, pedestrian encounters), VLM selection achieves 30% ADE reduction versus the planner's best selection, while the planner remains competitive in routine situations. In simulation, Score Fusion maintains >80% success rate with delays up to 5s. We demonstrate the full system on a mobile robot navigating challenging campus sidewalks with varied network latency.

2606.19489 2026-06-19 cs.LG cs.AI 新提交 75%

Concept Flow Models: Anchoring Concept-Based Reasoning with Hierarchical Bottlenecks

概念流模型:通过层次瓶颈锚定基于概念的推理

Ya Wang, Adrian Paschke

发表机构 * Fraunhofer Institute for Open Communication Systems(弗劳恩霍夫开放通信系统研究所) Freie Universität Berlin(柏林自由大学)

专题命中 视觉推理 :利用视觉语言模型生成概念嵌入,提升可解释性。

AI总结 提出概念流模型(CFM),用层次化概念决策树替代扁平瓶颈,通过逐步缩小预测范围减少信息泄露,在保持预测性能的同时提升可解释性。

Journal ref Transaction on Machine Learning Research, 2/2026

详情
AI中文摘要

概念瓶颈模型(CBM)通过将学习到的特征投影到人类可理解的概念空间来增强可解释性。最近的方法利用视觉-语言模型生成概念嵌入,减少了对人工概念标注的需求。然而,这些模型存在一个关键限制:当概念数量接近嵌入维度时,信息泄露增加,使得模型能够利用虚假或语义上不相关的相关性,从而削弱可解释性。在这项工作中,我们提出了概念流模型(CFM),它将扁平瓶颈替换为层次化的、概念驱动的决策树。层次结构中的每个内部节点专注于局部判别性概念子集,逐步缩小预测范围。我们的框架从视觉嵌入构建决策层次,在每个层次级别分布语义概念,并通过概率树遍历训练可微的概念权重。在多个基准上的大量实验表明,CFM在预测性能上与扁平CBM相当,同时通过减少有效概念使用显著缓解了信息泄露。此外,CFM产生逐步决策流,使得具有层次类结构的透明且可审计的模型推理成为可能。

英文摘要

Concept Bottleneck Models (CBMs) enhance interpretability by projecting learned features into a human-understandable concept space. Recent approaches leverage vision-language models to generate concept embeddings, reducing the need for manual concept annotations. However, these models suffer from a critical limitation: as the number of concepts approaches the embedding dimension, information leakage increases, enabling the model to exploit spurious or semantically irrelevant correlations and undermining interpretability. In this work, we propose Concept Flow Models (CFMs), which replace the flat bottleneck with a hierarchical, concept-driven decision tree. Each internal node in the hierarchy focuses on a localized subset of discriminative concepts, progressively narrowing the prediction scope. Our framework constructs decision hierarchies from visual embeddings, distributes semantic concepts at each hierarchy level, and trains differentiable concept weights through probabilistic tree traversal. Extensive experiments on diverse benchmarks demonstrate that CFMs match the predictive performance of flat CBMs, while substantially mitigating information leakage by reducing effective concept usage. Furthermore, CFMs yield stepwise decision flows that enable transparent and auditable model reasoning with hierarchical class structures.

2606.20274 2026-06-19 cs.AI 新提交 70%

Lagrange: An Open-Vocabulary, Energy-Based Sparse Framework for Generalized End-to-End Driving

Lagrange: 一种面向通用端到端驾驶的开放词汇、基于能量的稀疏框架

Shihao Ji, HongXi Li, Zihui Song, Mingyu Li

专题命中 视觉推理 :使用VLM进行开放词汇推理

AI总结 提出Lagrange框架,利用掩码潜在场和视觉语言模型实现开放词汇、稀疏计算,通过拉格朗日动作最小化确保运动学约束,在nuScenes和CODA基准上验证了鲁棒性和可解释性。

详情
AI中文摘要

将端到端自动驾驶扩展到复杂的开放世界环境,需要能够泛化到异常场景的感知模型和能够产生运动学有效轨迹的规划器。现有范式在表示效率和泛化能力之间存在明显分歧。密集模型(如占用网络)虽然几何鲁棒,但存在关键计算瓶颈,且难以进行高层语义推理。相反,稀疏的基于查询的规划器效率高,但依赖于封闭集定义,使其容易受到分布外事件的影响。尽管最近的视觉-语言-动作模型提供了开放词汇推理,但其自回归离散令牌生成从根本上与车辆动力学的连续高频控制需求相冲突。为解决这一问题,我们提出了Lagrange,一种基于掩码潜在场的开放词汇、计算稀疏的驾驶框架。Lagrange不依赖密集体积重建或封闭集查询机制,而是利用视觉语言模型将类别无关的目标提议编码为连续语义视觉令牌。我们引入了一种意图驱动的掩码交叉注意力模块,该模块在时间上过滤不相关实体,并将注意力令牌解码为定义在空间坐标上的隐式连续能量场。通过将决策制定为跨越该能量场的拉格朗日动作最小化问题,我们在执行碰撞避免的同时强制遵守车辆运动学。在标准(nuScenes)和长尾(CODA)基准上的大量离线评估表明,Lagrange为鲁棒、可解释且运动学可行的开放世界自主性建立了一个有前景的框架。

英文摘要

Scaling end-to-end autonomous driving to complex, open-world environments requires perceptual models that generalize to anomalous scenarios and planners that produce kinematically valid trajectories. Existing paradigms face a distinct dichotomy between representational efficiency and generalization capacity. Dense models (e.g., occupancy networks), while geometrically robust, incur critical computational bottlenecks and struggle with high-level semantic reasoning. Conversely, sparse, query-based planners are efficient but reliant on closed-set definitions, rendering them vulnerable to out-of-distribution (OOD) events. Although recent Vision-Language-Action (VLA) models offer open-vocabulary reasoning, their autoregressive, discrete token generation fundamentally conflicts with the continuous, high-frequency control requirements of vehicle dynamics. To address this, we propose Lagrange, an open-vocabulary, computationally sparse driving framework based on Masked Latent Fields (MLF). Rather than relying on dense volumetric reconstructions or closed-set query mechanisms, Lagrange exploits Vision-Language Models (VLMs) to encode class-agnostic object proposals into continuous semantic visual tokens. We introduce an intent-driven masked cross-attention module that temporally filters irrelevant entities, decoding the attended tokens into an implicit continuous energy field defined over spatial coordinates. By framing decision-making as a Lagrangian action minimization problem spanning this energy field, we enforce strict compliance with vehicle kinematics while executing collision avoidance. Extensive offline evaluations on both standard (nuScenes) and long-tail (CODA) benchmarks demonstrate that Lagrange establishes a promising framework for robust, interpretable, and kinematically feasible open-world autonomy.