2510.01833
2026-05-27
cs.AI
cs.CL
Plan Then Action:High-Level Planning Guidance Reinforcement Learning for LLM Reasoning
先规划后行动:面向LLM推理的高层规划引导强化学习
Zhihao Dou, Qinjian Zhao, Zhongwei Wan, Dinggen Zhang, Weida Wang, Towsif Raiyan, Benteng Chen, Qingtao Pan, Yang Ouyang, Chaoda Song, Zhiqiang Gao, Shufei Zhang, Sumon Biswas
发表机构
*
Case Western Reserve University, Cleveland, OH, USA(凯斯西储大学)
;
Kean University, Union, NJ, USA(凯恩大学)
;
The Ohio State University, Columbus, OH, USA(俄亥俄州立大学)
;
Fudan University, Shanghai, China(复旦大学)
;
Shanghai Artificial Intelligence Laboratory, Shanghai, China(上海人工智能实验室)
;
The University of Hong Kong, Hong Kong, China(香港大学)
;
North Carolina State University, Raleigh, NC, USA(北卡罗来纳州立大学)
AI总结
提出PTA-GRPO两阶段框架,通过高层规划引导与强化学习联合优化,提升LLM在数学和自然科学推理任务中的准确性和泛化能力。