Learning-to-Defer with Expert-Conditional Advice
基于专家条件建议的学习-延迟决策
Yannis Montreuil, Leïna Montreuil, Axel Carlier, Lai Xing Ng, Wei Tsang Ooi
AI总结 研究在决策时可为专家提供额外信息(建议)的延迟学习问题,提出一种在复合专家-建议动作空间上的增广替代损失,并证明其一致性保证和最优策略恢复能力。
详情
学习-延迟决策将每个输入路由到预期成本最小的专家,但假设决策时每个专家可获得的信息是固定的。许多现代系统违反了这一假设:选择专家后,还可以选择该专家应接收哪些额外信息,例如检索到的文档、工具输出或升级上下文。我们研究了这个问题,并将其称为带建议的学习-延迟决策。我们表明,即使在最简单的非平凡设置中,一系列广泛使用的自然分离替代损失(通过不同头部学习路由和建议)也是不一致的。然后,我们引入了一个在复合专家-建议动作空间上操作的增广替代损失,并证明了其$\mathcal{H}$一致性保证以及超额风险转移界,从而在极限情况下恢复贝叶斯最优策略。在表格、语言和多模态任务上的实验表明,所提方法优于标准学习-延迟决策,同时根据成本机制调整其建议获取行为;一个合成基准证实了分离替代损失预测的失败模式。
Learning-to-Defer routes each input to the expert that minimizes expected cost, but it assumes that the information available to every expert is fixed at decision time. Many modern systems violate this assumption: after selecting an expert, one may also choose what additional information that expert should receive, such as retrieved documents, tool outputs, or escalation context. We study this problem and call it Learning-to-Defer with advice. We show that a broad family of natural separated surrogates, which learn routing and advice with distinct heads, is inconsistent even in the smallest non-trivial setting. We then introduce an augmented surrogate that operates on the composite expert--advice action space and prove an $\mathcal{H}$-consistency guarantee together with an excess-risk transfer bound, yielding recovery of the Bayes-optimal policy in the limit. Experiments on tabular, language, and multi-modal tasks show that the resulting method improves over standard Learning-to-Defer while adapting its advice-acquisition behavior to the cost regime; a synthetic benchmark confirms the failure mode predicted for separated surrogates.