VideoSEG-O3: A Multi-turn Reinforcement Learning Framework for Reasoning Video Object Segmentation
VideoSEG-O3:用于推理视频对象分割的多轮强化学习框架
Ming Dai, Sen Yang, Boqiang Duan, Boyuan Tong, Jiedong Zhuang, Wankou Yang, Jingdong Wang
AI总结 提出VideoSEG-O3,首个多轮强化学习框架,通过多轮时空思维链和SEG感知逻辑校准,实现从粗到细的推理视频对象分割,解决复杂视频中的精确像素定位问题。
详情
- Comments
- ICML2026
推理视频对象分割(RVOS)需要时间动态、空间细节和语言推理的复杂集成,以实现精确的像素级定位。现有方法局限于对固定初始输入进行推理,缺乏主动获取更多视觉证据的能力,而这对于解决长或复杂视频中的复杂引用通常至关重要。为了解决这个问题,我们提出了\textbf{VideoSEG-O3},这是第一个用于RVOS的多轮强化学习框架,模拟人类的“从粗到细”认知过程。它采用\textit{多轮时空思维链},通过迭代定位关键区间和关键帧来捕获细粒度细节。此外,为了使策略在强化学习阶段能够感知超出\texttt{[SEG]}文本概率的分割质量,我们引入了\textit{SEG感知逻辑校准},将像素级分割反馈直接集成到令牌级逻辑中。此外,我们设计了一个\textit{解耦思考轨迹},将推理过程分层分解为时间、空间和语言维度,并构建了\textbf{VTS-CoT},一个包含全面推理轨迹的专门冷启动数据集。代码和模型将在以下网址发布:this https URL。
Reasoning Video Object Segmentation (RVOS) demands a sophisticated integration of temporal dynamics, spatial details, and linguistic reasoning to achieve precise pixel-level localization. Existing methods are limited to reasoning over fixed initial inputs and lack the capacity to actively acquire further visual evidence, which is often essential for resolving complex references in long or intricate videos. To address this, we propose \textbf{VideoSEG-O3}, the first multi-turn reinforcement learning framework for RVOS that emulates the human \textit{``coarse-to-fine''} cognitive process. It employs a \textit{multi-turn temporal-spatial chain-of-thought} to capture fine-grained details by iteratively pinpointing critical intervals and keyframes. Additionally, to enable the policy to perceive segmentation quality beyond mere text probability of \texttt{[SEG]} during the RL stage, we introduce \textit{SEG-aware logit calibration}, which integrates pixel-wise segmentation feedback directly into the token-level logits. Furthermore, we design a \textit{decoupled thinking trace} to hierarchically decompose the reasoning process into temporal, spatial, and linguistic dimensions, and construct \textbf{VTS-CoT}, a specialized cold-start dataset featuring comprehensive reasoning trajectories. The code and models will be released at https://github.com/Dmmm1997/VideoSEG-O3.