GAMMA: Global Bit Allocation for Mixed-Precision Models under Arbitrary Budgets
GAMMA:在任意预算下为混合精度模型进行全局位分配
Zhangyang Yao, Haiyan Zhao, Haoyu Wang, Tianbo Huang, Lihua Zhang, Xu Han
AI总结 本文提出GAMMA框架,通过后训练流水线学习模块级精度偏好,优化教师强制隐藏状态重建目标并利用整数规划实现精确预算分配,从而在任意预算下提升大语言模型的精度,优于固定精度基线和搜索基混合精度方法。
详情
混合精度量化通过将更多位分配给敏感模块,提高了大语言模型(LLMs)的预算-精度权衡。然而,在LLM规模上自动化这种分配面临独特约束:可学习方法需要量化感知训练,这在十亿参数模型中不可行;训练自由替代方案依赖静态代理指标,无法捕捉跨模块交互,并且必须为每个目标预算重新计算;搜索方法成本高且无法保证精确预算符合。我们提出GAMMA,一种量化器无关的框架,完全在后训练流水线内学习模块级精度偏好。GAMMA在增强拉格朗日约束下优化教师强制隐藏状态重建目标,并通过整数规划将学习的偏好投影到精确预算可行的离散分配中。关键性质是分数重用:因为学习的偏好编码了一个稳定的敏感性排名而非预算特定权重,单次训练运行可服务于任意部署目标,仅需重新求解整数规划,将每预算适应时间从小时减少到几分钟。在Llama和Qwen模型(8B-32B)上,GAMMA优于固定精度基线(最高+12.99 Avg.)和搜索基混合精度方法(最高+7.00 Avg.),并在2.5位平均精度下可匹配固定3位质量,从而在大幅减小内存占用的情况下实现部署。
Mixed-precision quantization improves the budget--accuracy trade-off for large language models (LLMs) by allocating more bits to sensitive modules. However, automating this allocation at LLM scale faces a unique combination of constraints: learnable approaches require quantization-aware training, which is infeasible for billion-parameter models; training-free alternatives rely on static proxy metrics that miss cross-module interactions and must be recomputed per target budget; and search-based methods are expensive without guaranteeing exact budget compliance. We propose GAMMA, a quantizer-agnostic framework that learns module-wise precision preferences entirely within a post-training pipeline. GAMMA optimizes a teacher-forced hidden-state reconstruction objective under an augmented Lagrangian constraint, and projects the learned preferences into exact budget-feasible discrete assignments via integer programming. A key property is score reuse: because the learned preferences encode a stable sensitivity ranking rather than budget-specific weights, a single training run serves arbitrary deployment targets by re-solving only the integer program, reducing per-budget adaptation from hours to a few minutes. Across Llama and Qwen models (8B--32B), GAMMA outperforms both fixed-precision baselines (up to +12.99 Avg.) and search-based mixed-precision methods (up to +7.00 Avg.), and can match fixed 3-bit quality at 2.5-bit average precision, enabling deployment at substantially smaller memory footprints.