SIRI: Self-Internalizing Reinforcement Learning with Intrinsic Skills for LLM Agent Training
SIRI:具有内在技能的自我内化强化学习用于LLM智能体训练
Zhongyu He, Yuanfan Li, Fei Huang, Tianyu Chen, Siyuan Chen, Xingyang Li, Meng Hsuan Yu, Xiangrong Liu, Leyi Wei, Lu Pan, Ke Zeng, Xunliang Cai
AI总结 提出SIRI框架,通过自我技能挖掘、验证和内化,使LLM智能体无需外部技能生成器或推理时技能库即可提升长程任务性能,在ALFWorld和WebShop上优于基线方法。
详情
长程LLM智能体可以从可重用技能中受益,但现有的基于技能的方法通常依赖于训练期间的外部技能生成器或推理时的持久技能检索,增加了工程复杂性、上下文长度和部署延迟。我们提出了具有内在技能的自我内化强化学习(SIRI),这是一个三阶段框架,使智能体能够发现、验证和内化技能,无需外部技能生成器或推理时的技能库。SIRI首先使用GiGPO预热策略以获得基本交互能力并收集成功的无技能轨迹。然后进行自我技能挖掘,当前策略从其自身的成功普通轨迹中总结紧凑技能,并通过配对的技能增强和技能无关轨迹进行验证。最后,SIRI仅使用轨迹级效用和动作级优势将有帮助的技能引导动作令牌蒸馏到普通策略中。推理时,智能体仅使用原始提示运行。在ALFWorld和WebShop上使用Qwen2.5-7B-Instruct,SIRI将GiGPO从ALFWorld的0.908提升到0.930,从WebShop的0.728提升到0.813,优于基于提示、基于强化学习和基于记忆增强的基线。进一步分析表明,我们的自我挖掘策略可以实现与闭源大模型蒸馏相当的性能。我们的代码可在https://github.com/kirito618/SIRI获取。
Long-horizon LLM agents can benefit from reusable skills, yet existing skill-based methods often rely on external skill generators during training or persistent skill retrieval at inference, increasing engineering complexity, context length, and deployment latency. We propose Self-Internalizing Reinforcement learning with Intrinsic skills (SIRI), a three-phase framework that enables agents to discover, validate, and internalize skills without external skill generators or inference-time skill banks. SIRI first warms up the policy with GiGPO to acquire basic interaction ability and collect successful skill-free trajectories. It then performs self-skill mining, where the current policy summarizes compact skills from its own successful plain rollouts and validates them through paired skill-augmented and skill-free rollouts. Finally, SIRI distills only beneficial skill-guided action tokens into the plain policy using trajectory-level utility and action-level advantage. At inference, the agent runs with the original prompt only. On ALFWorld and WebShop with Qwen2.5-7B-Instruct, SIRI improves GiGPO from 0.908 to 0.930 on ALFWorld and from 0.728 to 0.813 on WebShop, outperforming prompt-based, RL-based, and memory-augmented baselines. Further analysis shows that our self-mining strategy can achieve performance comparable to distillation with closed-source large model. Our code is available at https://github.com/kirito618/SIRI.