MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models
MiMuon: 一种具有改进泛化能力的混合穆恩优化器用于大模型
Feihu Huang, Yuning Luo, Songcan Chen
AI总结 本文研究了穆恩优化器的泛化误差,提出了一种改进的混合穆恩优化器MiMuon,证明其泛化误差更低,同时保持了与穆恩优化器相同的收敛速度。
Comments 25 pages
详情
矩阵结构的参数在许多人工智能模型中频繁出现,例如大语言模型。最近,为大规模模型的矩阵参数设计了一种高效的穆恩优化器,其收敛速度明显快于向量级算法。尽管一些工作已经开始研究穆恩优化器的收敛性质(即优化误差),但其泛化性质(即泛化误差)尚未建立。因此,在本文中,我们基于算法稳定性与数学归纳法研究穆恩优化器的泛化误差,并证明穆恩优化器的泛化误差为O(1/(Nκ^T)),其中N为训练样本数量,T表示迭代次数,κ>0表示梯度估计奇异值之间的最小差。为了增强穆恩优化器的泛化能力,我们通过谨慎使用梯度的正交化,提出了一种有效的混合穆恩(MiMuon)优化器,该优化器是穆恩优化器与基于动量的SGD优化器的混合。然后我们证明我们的MiMuon优化器的泛化误差比穆恩优化器的O(1/(Nκ^T))更低,因为κ通常非常小。同时,我们还研究了我们MiMuon算法的收敛性质,并证明我们的MiMuon算法具有与穆恩算法相同的收敛速度O(1/T^{1/4})。在训练大模型(包括Qwen3-0.6B和YOLO26m)的一些数值实验结果中展示了MiMuon优化器的效率。
Matrix-structured parameters frequently appear in many artificial intelligence models such as large language models. More recently, an efficient Muon optimizer is designed for matrix parameters of large-scale models, and shows markedly faster convergence than the vector-wise algorithms. Although some works have begun to study convergence properties (i.e., optimization error) of the Muon optimizer, its generalization properties (i.e., generalization error) is still not established. Thus, in this paper, we study generalization error of the Muon optimizer based on algorithmic stability and mathematical induction, and prove that the Muon has a generalization error of $O\big(\frac{1}{Nκ^{T}}\big)$, where $N$ is training sample size, and $T$ denotes iteration number, and $κ>0$ denotes minimum difference between singular values of gradient estimate. To enhance generalization of the Muon, we propose an effective mixed Muon (MiMuon) optimizer by cautiously using orthogonalization of gradient, which is a hybrid of Muon and momentum-based SGD optimizers. Then we prove that our MiMuon optimizer has a lower generalization error of $O\big(\frac{1}{N}\big)$ than $O\big(\frac{1}{Nκ^{T}}\big)$ of Muon optimizer, since $κ$ generally is very small. Meanwhile, we also studied the convergence properties of our MiMuon algorithm, and prove that our MiMuon algorithm has the same convergence rate of $O(\frac{1}{T^{1/4}})$ as the Muon algorithm. Some numerical experimental results on training large models including Qwen3-0.6B and YOLO26m demonstrate efficiency of the MiMuon optimizer.