SD-GRPO: Verifiable Segment Decomposition for Long-Form Vision-Language Generation
SD-GRPO:面向长格式视觉-语言生成的可验证片段分解
Hyunwoong Kim, Seongeun Lee, Hannah Yun, Junhyun Park, Jonggwon Park
AI总结 提出SD-GRPO方法,通过将长格式输出分解为片段并计算逐片段优势,解决GRPO在视觉-语言任务中粗粒度信用分配不足的问题,实验证明其在多种长格式生成任务中优于基线。
详情
群体相对策略优化(GRPO)及其变体最初为大型语言模型(LLM)开发,最近被应用于多模态LLM并取得了强劲结果。然而,它们基于单一标量优势的粗粒度整体信用分配在视觉-语言(VL)任务中拟合不足,这些任务的输出通常是基于语义丰富图像的长格式响应。为解决这一限制,我们利用了一种单标量公式丢弃的结构化信号:长格式VL输出的自然分段。具体地,我们提出片段分解GRPO(SD-GRPO),它对整个rollout组中可验证的逐片段奖励进行z归一化,生成一个逐片段优势向量以替代单一标量。我们在三个设置中评估SD-GRPO,涵盖受控和真实世界的长格式VL生成,按片段间语义纠缠程度递增组织。在从DOCCI构建的受控多面板密集字幕任务中(片段语义独立),SD-GRPO始终优于GRPO基线,且片段数量越多增益越大。扩展到从MultiChartQA构建的受控多图表长格式VQA任务,我们从理论和经验上证明,rollout级奖励存在随输出长度增加而加剧的跨片段信用错误归因。在MMSci数据集上的真实世界科学图表字幕任务中(子图字幕共享图表上下文),混合整体和逐片段奖励进一步提升了两者性能,表明当片段语义纠缠时,仅逐片段归一化是不够的。最后,通过将SD-GRPO集成到Dr. GRPO中,我们确认它可以以最小的实现开销应用于任何GRPO框架,以增强长格式VL生成。
Group Relative Policy Optimization (GRPO) and its variants, originally developed for Large Language Models (LLMs), have recently been applied to Multimodal LLMs and produced strong results. However, their coarse-grained holistic credit assignment from a single scalar advantage underfits vision-language (VL) tasks, where outputs are often long-form responses grounded in semantically rich images. To address this limitation, we exploit a structured signal that single-scalar formulations discard: the natural segmentation of long-form VL outputs. Concretely, we propose Segment-Decomposed GRPO (SD-GRPO), which z-normalizes verifiable per-segment rewards across the rollout group, yielding a vector of per-segment advantages in place of a single scalar. We evaluate SD-GRPO across three settings spanning controlled and real-world long-form VL generation, organized by increasing semantic entanglement across segments. On a controlled multi-panel dense-captioning task constructed from DOCCI, where segments are semantically independent, SD-GRPO consistently outperforms the GRPO baseline, with larger gains at higher segment counts. Extending to a controlled multi-chart long-form VQA task constructed from MultiChartQA, we show both theoretically and empirically that rollout-level rewards suffer from cross-segment credit misattribution that scales with output length. On a real-world scientific figure captioning task on the MMSci dataset, where subfigure captions share context across the figure, blending holistic and per-segment rewards further improves on both, suggesting per-segment normalization alone is insufficient when segments are semantically entangled. Finally, by integrating SD-GRPO into Dr. GRPO, we confirm that it can be applied to any GRPO framework with minimal implementation overhead to enhance long-form VL generation.