SKILLC: Learning Autonomous Skill Internalization in LLM Agents via Contrastive Credit Assignment
SKILLC: 通过对比信用分配学习LLM智能体的自主技能内化
Hongxiang Lin, Zhirui Kuai, Erpeng Xue, Lei Wang
AI总结 提出SkillC框架,基于对比技能信用分配(CSCA)将技能帮助性对比转化为直接学习信号,实现LLM智能体的自主技能内化,在ALFWorld和WebShop上分别超越最强基线5.5%和4.4%。
详情
结构化技能提示改善了长周期智能体强化学习(RL)中的探索。技能增强型RL方法在推理时保留外部技能,而技能内化型RL方法在训练期间撤回技能以实现自主性能。然而,现有的内化方法仅使用技能帮助性对比进行课程控制,策略更新保持不变,无法区分技能依赖和自主成功。我们提出SkillC,一种基于对比技能信用分配(CSCA)的框架,将该对比转化为内化的直接学习信号。SkillC在同一策略更新中,为来自活跃技能类型的任务采样配对的技能注入和无技能轨迹,并通过双流优势估计器将它们的任务级对比注入优化,该估计器在保持全局排名的同时,对无技能成功施加单边校正。平滑的验证级信号进一步驱动自适应课程,包括归因强度、轨迹分配和单调活跃集剪枝。在ALFWorld和WebShop上的实验表明,在无运行时技能访问的情况下,SkillC分别超过最强先验技能内化RL基线5.5%和4.4%,同时与技能增强型RL方法保持竞争力。
Structured skill prompts improve exploration in long-horizon agentic reinforcement learning (RL). Skill-augmented RL methods retain external skills at inference, while skill-internalization RL methods withdraw them during training to enable autonomous performance. However, existing internalization approaches only use skill-helpfulness contrast for curriculum control, leaving the policy update unchanged and unable to distinguish skill-dependent from autonomous success. We propose SkillC, a framework based on Contrastive Skill Credit Assignment (CSCA) that converts this contrast into a direct learning signal for internalization. \textsc{SkillC} samples paired skill-injected and skill-free rollouts for tasks from active skill types within the same policy update, and injects their task-level contrast into optimization via a dual-stream advantage estimator that preserves global ranking while applying a one-sided correction toward skill-free success. A smoothed validation-level signal further drives an adaptive curriculum over attribution strength, rollout allocation, and monotonic active-set pruning. Experiments on ALFWorld and WebShop show that, without runtime skill access, SkillC surpasses the strongest prior skill-internalization RL baseline by 5.5\% and 4.4\%, respectively, while remaining competitive with skill-augmented RL methods.