On Second-Order Methods for Bilevel Optimization
关于双层优化的二阶方法
Jiawen Bi, Jiaxiang Li, Mingyi Hong, Shuzhong Zhang
AI总结 本文针对双层优化问题,提出了一种单循环三次正则牛顿算法,在非凸上层和强凸下层设置下,实现了最优的O(ε^{-1.5})总预言复杂度,首次达到二阶驻点的最优收敛率。
详情
双层优化是现代机器学习和工程设计不可或缺的建模工具。然而,在双层优化中寻找二阶驻点的理论和实践仍然很大程度上未解决。即使对于具有强凸下层问题的双层优化,其诱导的超函数通常是非凸的。尽管三次正则牛顿方法(CRN)在单层优化中实现了最优的$\mathcal{O}(\varepsilon^{-1.5})$ SOSP(二阶驻点)率,但如何控制将二阶方法应用于双层问题时超梯度和超Hessian计算的精度,以使整个过程高效,仍不清楚。在本文中,我们着手回答这个问题。特别地,我们首先制定了一个双循环CRN基线,该基线实现了最优的外层率,但需要重复的下层求解。接下来,我们提出了一种单循环三次正则牛顿算法,该算法将一个下层梯度步与一个用于超梯度的牛顿步相结合,并证明了总体确定性的$\mathcal{O}(\varepsilon^{-1.5})$总预言复杂度,这是最优的。此外,我们说明了一些直观简单的修改可能无法维持收敛结果。据我们所知,这是第一个用于无约束NCSC(非凸上层和强凸下层)双层优化设置的确定性单循环方法,该方法实现了寻找超函数$\varepsilon$-SOSP的$\mathcal{O}(\varepsilon^{-1.5})$最优收敛率。
Bilevel optimization is an indispensable modeling tool for modern machine learning and engineering design. However, the theory and practice for finding second order stationary points in the context of bilevel optimization still remain largely unsettled. Even for bilevel optimization with strongly convex lower-level problem, the hyperfunction it induces is in general nonconvex. Although the Cubic Regularized Newton methods (CRN) famously achieve the optimal $\mathcal{O}(\varepsilon^{-1.5})$ SOSP (second-order stationary point) rate in single-level optimization, it is unclear how to control the accuracy of the hypergradient and hyper-Hessian computations in the context of applying the second-order methods to bilevel problems in order for the overall process to be efficient. In this paper, we set out to answer this question. In particular, we first formulate a double loop CRN baseline that achieves the optimal outer rate but requires repeated lower level solves. Next, we propose a single loop cubic regularized Newton algorithm that combines one lower-level gradient step with one Newton step for the hypergradient, and prove an overall deterministic $\mathcal{O}(\varepsilon^{-1.5})$ total oracle complexity, which is optimal. In addition, we illustrate that some intuitively simple modifications of our method may fail to hold up the convergence result. To the best of our knowledge, this is the first deterministic single loop method for unconstrained NCSC (non-convex upper-level and strongly convex lower-level) bilevel optimization setting that achieves the $\mathcal{O}(\varepsilon^{-1.5})$ optimal convergence rate for finding an $\varepsilon$-SOSP of the hyperfunction.