多模态大模型 - arXivDaily 专题

2606.19338 2026-06-18 cs.CV 新提交 85%

Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

超越当前观测：评估多模态大语言模型在可控非马尔可夫博弈中的表现

Shengyuan Ding, Xilin Wei, Xinyu Fang, Haodong Duan, Dahua Lin, Jiaqi Wang, Yuhang Zang

发表机构 * Fudan University（复旦大学）； Shanghai Innovation Institute（上海创新研究院）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； Zhejiang University（浙江大学）； The Chinese University of Hong Kong（香港中文大学）

专题命中多模态评测：非马尔可夫博弈评估多模态模型记忆

AI总结提出RNG-Bench基准套件，通过配对记忆和3D迷宫两个博弈，评估多模态大模型在非马尔可夫环境中重建历史观测并据此行动的能力，发现主要错误源于遗忘而非决策，微调可提升性能。

详情

AI中文摘要

将多模态基础模型部署为闭环策略时，越来越需要基于不再可见的观测来调节动作。然而，现有基准要么暴露完整状态，将隐藏状态重建与其他智能体技能混为一谈，要么仅在回合结束后测试记忆。我们引入了RNG-Bench（重建性非马尔可夫博弈），这是一个基准套件，旨在隔离基础模型在多步交互中重建过去观测并据此行动的能力。RNG-Bench包含两个互补的博弈：配对记忆，其中卡片身份在特定位置短暂显示后需被回忆；以及3D迷宫，其中自我中心视图需整合为空间地图。两个博弈都在统一的测试框架下评估，具有三个可控难度轴：网格大小、视觉模式和观测模态。该基准进一步引入了头对头对决协议以控制实例级方差，以及记忆差距指标，将遗忘与不良动作选择区分开来。最难的配置需要大约128K个token和每回合350个图像输入，前沿MLLMs远未饱和。记忆差距分析表明，大多数残余错误源于遗忘较早的观测，而非次优决策。最后，在最优策略轨迹和过滤后的模型演示上微调Qwen3.5-9B，提高了RNG-Bench的性能，并迁移到现有基准，而不降低通用多模态能力。

英文摘要

Deploying multimodal foundation models as closed-loop policies increasingly requires conditioning actions on observations that are no longer visible. However, existing benchmarks either expose the full state, conflate hidden-state reconstruction with other agent skills, or test recall only after an episode has ended. We introduce RNG-Bench (Reconstructive Non-Markov Games), a benchmark suite designed to isolate a base model's ability to reconstruct past observations and act on them during multi-step interaction. RNG-Bench includes two complementary games: Matching Pairs, where card identities briefly revealed at specific locations must later be recalled, and 3D Maze, where egocentric views must be integrated into a spatial map. Both games are evaluated under a unified harness with three controlled difficulty axes: grid size, visual pattern, and observation modality. The benchmark further introduces a head-to-head duel protocol to control for instance-level variance and a Memory Gap metric that disentangles forgetting from poor action selection. The hardest configurations require contexts of roughly 128K tokens and 350 image inputs per episode, and remain far from saturated by frontier MLLMs. Memory Gap analysis shows that most residual errors stem from forgetting earlier observations rather than from suboptimal decision making. Finally, fine-tuning Qwen3.5-9B on optimal-policy rollouts and filtered model demonstrations improves performance on RNG-Bench and transfers to existing benchmarks without degrading general multimodal capability.

URL PDF HTML ☆

赞 0 踩 0