Convergence of Spectral Descent for Non-smooth Optimization
非光滑优化的谱下降收敛性
Yixuan Yang, Yuqing He, Song Li
AI总结 研究Muon优化器的简化变体谱下降(SD)及其截断版本(TSD)在非光滑凸优化中的全局线性收敛性,并应用于鲁棒低秩矩阵恢复。
详情
Muon优化器最近在训练大型语言模型方面展示了显著的经验成功。然而,对其机制的理论理解仍然有限。目前Muon的收敛保证严重依赖于光滑性假设,其非光滑收敛行为在很大程度上未被探索。在这项工作中,我们通过研究谱下降(SD)(Muon的简化变体)及其截断版本截断谱下降(TSD),朝着弥合这一差距迈出了一步。在凸性、Lipschitz连续性和尖锐性条件下,我们建立了SD和TSD在非光滑凸公式中的全局线性收敛性。我们还研究了配备解耦权重衰减的正则化变体,并通过它们与Frank-Wolfe方法的联系推导出次线性收敛保证。最后,我们将我们的理论框架应用于混合稀疏和密集噪声下的鲁棒低秩矩阵恢复,并提供了严格的恢复保证。数值实验支持理论发现,并展示了Muon类型方法在非光滑优化中的有效性。
The Muon optimizer has recently demonstrated remarkable empirical success in training large language models. However, the theoretical understanding of its mechanisms remains limited. Current convergence guarantees for Muon rely heavily on smoothness assumptions, leaving its non-smooth convergence behavior largely unexplored. In this work, we take a step toward bridging this gap by investigating Spectral Descent (SD), a simplified variant of Muon, together with its truncated counterpart, Truncated Spectral Descent (TSD). Under convexity, Lipschitz continuity, and sharpness conditions, we establish global linear convergence for both SD and TSD in non-smooth convex formulations. We also study regularized variants equipped with decoupled weight decay and derive sublinear convergence guarantees through their connection with Frank-Wolfe methods. Finally, we apply our theoretical framework to robust low-rank matrix recovery under mixed sparse and dense noise regimes and provide rigorous recovery guarantees. Numerical experiments support the theoretical findings and demonstrate the effectiveness of Muon-type methods for non-smooth optimization.