Post-Training LLMs as Better Decision-Making Agents: A Regret-Minimization Approach
将LLM后训练为更好的决策智能体:一种遗憾最小化方法
Chanwoo Park, Ziyang Chen, Asuman Ozdaglar, Kaiqing Zhang
AI总结 提出迭代遗憾最小化微调(Iterative RMFT),通过反复蒸馏低遗憾决策轨迹来后训练LLM,提升其在在线决策任务中的表现,无需依赖已知算法或人工模板。
Comments Camera ready version of ICML 2026
详情
大型语言模型(LLM)越来越多地被部署为交互式和动态环境中的决策智能体。然而,由于它们最初并非为决策设计,最近的研究表明,LLM即使在基本的在线决策问题中也可能表现不佳,无法实现低遗憾或有效的探索-利用权衡。为了解决这个问题,我们引入了迭代遗憾最小化微调(Iterative RMFT),这是一种后训练过程,反复将低遗憾决策轨迹蒸馏回基础模型。在每次迭代中,模型生成多个决策轨迹,选择k个最低遗憾的轨迹,并在此基础上进行微调。与先前方法(a)从已知决策算法中蒸馏动作序列或(b)依赖人工设计的思维链模板不同,我们的方法利用遗憾度量来激发模型自身的决策能力和推理依据。这种对模型生成推理的依赖避免了僵化的输出工程,并提供了更灵活、自然语言的训练信号。实验结果表明,Iterative RMFT在多种模型上提升了LLM的决策性能——从具有数值输入/输出的Transformer,到开源权重LLM,再到像GPT-4o mini这样的先进闭源模型。其在输出和推理格式上的灵活性使其能够泛化到具有不同时间范围、动作空间、奖励过程和自然语言上下文的任务。最后,我们提供了理论见解,表明在这种范式下,单层Transformer可以在简化设置中充当无遗憾学习器。总体而言,Iterative RMFT为增强LLM的决策能力提供了一个有原则且通用的后训练框架。
Large language models (LLMs) are increasingly deployed as "agents" for decision-making (DM) in interactive and dynamic environments. Yet, since they were not originally designed for DM, recent studies show that LLMs can struggle even in basic online DM problems, failing to achieve low regret or an effective exploration-exploitation tradeoff. To address this, we introduce Iterative Regret-Minimization Fine-Tuning (Iterative RMFT), a post-training procedure that repeatedly distills low-regret decision trajectories back into the base model. At each iteration, the model rolls out multiple decision trajectories, selects the k-lowest regret ones, and fine-tunes itself on them. Unlike prior methods that (a) distill action sequences from known DM algorithms or (b) rely on manually crafted chain-of-thought templates, our approach leverages the regret metric to elicit the model's own DM ability and reasoning rationales. This reliance on model-generated reasoning avoids rigid output engineering and provides more flexible, natural-language training signals. Empirical results show that Iterative RMFT improves LLMs' DM performance across diverse models - from Transformers with numerical input/output, to open-weight LLMs, and advanced closed-weight models like GPT-4o mini. Its flexibility in output and reasoning formats enables generalization across tasks with varying horizons, action spaces, reward processes, and natural-language contexts. Finally, we provide theoretical insight showing that a single-layer Transformer under this paradigm can act as a no-regret learner in a simplified setting. Overall, Iterative RMFT offers a principled and general post-training framework for enhancing LLMs' decision-making capabilities.