Dr.LLM: Dynamic Layer Routing in LLMs
Dr.LLM:大语言模型中的动态层路由
Ahmed Heakl, Martin Gubri, Salman Khan, Sangdoo Yun, Seong Joon Oh
AI总结 本文提出Dr.LLM,一种通过在预训练模型中加入轻量级每层路由器来实现动态层路由的框架,该方法在不改变基础权重的情况下,通过显式监督训练路由器,提高推理的计算效率和准确性。
Comments Published at ICLR 2026
详情
大语言模型(LLMs)处理每个token时都会通过transformer堆栈的所有层,这导致简单查询的计算浪费以及更复杂的查询需要更深层次推理时的灵活性不足。适应深度方法可以提高效率,但先前的方法依赖于成本高昂的推理时间搜索、架构更改或大规模重新训练,在实践中虽然提高了效率,但常常导致准确性下降。我们介绍了Dr.LLM,即大语言模型中的动态层路由,一种可回退的框架,该框架为预训练模型配备了轻量级每层路由器,决定跳过、执行或重复一个块。路由器通过显式监督进行训练:使用蒙特卡洛树搜索(MCTS),我们推导出高质量的层配置,以在计算预算下保持或提高准确性。我们的设计,包括窗口池化以实现稳定的路由、聚焦损失与类别平衡以及瓶颈MLP路由器,确保在类别不平衡和长序列下具有鲁棒性。在ARC(逻辑)和DART(数学)上,Dr.LLM在每个示例上平均节省5层的同时,将准确性提高了最高3.4个百分点。路由器能够泛化到域外任务(MMLU、GSM8k、AIME、TruthfulQA、SQuADv2、GPQA、PIQA、AGIEval)时,仅导致0.85%的准确性下降,同时保持效率,并在某些情况下优于先前的路由方法。总体而言,Dr.LLM展示了通过显式监督训练的路由器可以回退冻结的LLMs,以实现预算意识、准确性驱动的推理,而无需改变基础权重。代码可在https://github.com/parameterlab/dr-llm上获得。
Large Language Models (LLMs) process every token through all layers of a transformer stack, causing wasted computation on simple queries and insufficient flexibility for harder ones that need deeper reasoning. Adaptive-depth methods can improve efficiency, but prior approaches rely on costly inference-time search, architectural changes, or large-scale retraining, and in practice often degrade accuracy despite efficiency gains. We introduce Dr. LLM, Dynamic routing of Layers for LLMs, a retrofittable framework that equips pretrained models with lightweight per-layer routers deciding to skip, execute, or repeat a block. Routers are trained with explicit supervision: using Monte Carlo Tree Search (MCTS), we derive high-quality layer configurations that preserve or improve accuracy under a compute budget. Our design, windowed pooling for stable routing, focal loss with class balancing, and bottleneck MLP routers, ensures robustness under class imbalance and long sequences. On ARC (logic) and DART (math), Dr. LLM improves accuracy by up to +3.4%p while saving 5 layers per example on average. Routers generalize to out-of-domain tasks (MMLU, GSM8k, AIME, TruthfulQA, SQuADv2, GPQA, PIQA, AGIEval) with only 0.85% accuracy drop while retaining efficiency, and outperform prior routing methods by up to +7.7%p. Overall, Dr. LLM shows that explicitly supervised routers retrofit frozen LLMs for budget-aware, accuracy-driven inference without altering base weights. Code is available at https://github.com/parameterlab/dr-llm.