针对机器学习中常见的一类非凸极小极大优化问题,本文提出了一种新的随机交替梯度投影算法,该算法基于交替梯度投影算法,在随机环境中用随机梯度进行更新,和随机交替梯度下降–上升算法相比,运算效率大大提高。在高斯分布数据集上的数值实验结果表明新算法是可行有效的。 To solve a class of nonconvex minimax optimization problems in machine learning, we propose a new stochastic alternating gradient projection algorithm (SAGP). This algorithm is based on the alternating gradient projection algorithm (AGP), which is updated with stochastic gradients in a stochastic environment. Compared with the stochastic alternating gradient descent ascent algorithm (SAGDA), the operation efficiency is greatly improved. Numerical experiments on Gaussian distributed data sets show that the new algorithm is feasible and effective.
To solve a class of nonconvex minimax optimization problems in machine learning, we propose a new stochastic alternating gradient projection algorithm (SAGP). This algorithm is based on the alternating gradient projection algorithm (AGP), which is updated with stochastic gradients in a stochastic environment. Compared with the stochastic alternating gradient descent ascent algorithm (SAGDA), the operation efficiency is greatly improved. Numerical experiments on Gaussian distributed data sets show that the new algorithm is feasible and effective.
This work is licensed under the Creative Commons Attribution International License (CC BY 4.0).
http://creativecommons.org/licenses/by/4.0/
1. 引言
本文考虑以下极小极大优化问题:
min x ∈ R n max y ∈ R m f ( x , y ) ≜ Ε [ F ( x , y ; ξ ) ] ,
其中 ξ 是具有支持度 Ξ 的随机向量, f ( x , y ) 是光滑函数, f ( x , y ) 对于 x 是凸的,对于 y 是非凹的。
在机器学习中,许多问题都可表述为 min x max y f ( x , y ) 的形式,如生成性对抗网络(GAN)、分布式非凸优化、多域鲁棒学习、信号处理中的功率控制和收发器设计问题等。求解此类问题最简单的方法之一是梯度下降(GD)的自然泛化,称为梯度下降–上升(GDA)算法,在每次迭代时通过同步或交替的渐变更新来对 x 执行梯度下降步骤,对 y 执行梯度上升步骤。
在凸–非凹环境下,对于任意给定的 x ,求解 max y f ( x , y ) 是NP-难的。几乎现有的嵌套循环算法和部分现有的单循环算法 [
12
] [
13
] 都会失去理论保证,因为这些算法都需求解 max y f ( x , y ) 。目前只有AGP算法 [
14
] 可求解目标函数 f ( x , y ) 是光滑函数的凸–非凹极小极大优化问题,并证明了其收敛性。
凸–非凹极小极大优化问题的一类子问题是凸-PL极小极大优化问题,即 f ( x , y ) 对于 x 是凸的,对于 y 是满足Polyak-Lojasiewicz(PL)条件的。PL条件最初是由Polyak [
20
] 引入,并证明了梯度下降以线性速率全局收敛。目标函数的变量 y 满足PL条件的假设比 y 具有强凹性的假设更温和,甚至不要求目标函数在 y 上是凹的,这种假设在线性二次型调节器 [
21
] 和超参数化神经网络 [
22
] 中被证明成立。对于极小极大优化问题,有的研究 [
17
] 是使变量 y 满足PL条件,有的研究 [
23
] [
24
] 则是通过使 x 满足PL条件来找到全局解,同时PL条件被证明适用于机器学习中某些非凸应用 [
21
] [
25
] ,包括与深度神经网络相关的问题 [
22
] [
26
] ,因此PL条件引起了广泛的关注。
交替梯度投影算法是单循环算法,每次迭代通过两个梯度投影步骤更新 x 和 y 。在第 t 次迭代中,AGP算法使用如下辅助函数的梯度进行更新,即:
f ^ ( x , y ) = f ( x , y ) + b t 2 ‖ x ‖ 2 − c t 2 ‖ y ‖ 2 ,
其中 b t ≥ 0 和 c t ≥ 0 是正则化参数。在凸–非凹环境下, c t = 0 , { b t } 是非负单调递减序列,以下给出交替梯度投影算法的框架:
算法1 交替梯度投影算法(AGP)
步骤1 输入固定步长 τ 1 > 0 , τ 2 > 0 ,初始点 ( x 0 , y 0 ) 及参数 b 0 > 0 ,
步骤2 计算 b t 并更新 x t :
x t + 1 = P X ( x t − τ 1 ( ∇ x f ( x t , y t ) + b t x t ) ) ,
步骤3 更新 y t :
y t + 1 = P Y ( y t + τ 2 ∇ y f ( x t + 1 , y t ) ) ,
步骤4 当算法满足终止准则时,停止;否则,令 t = t + 1 ,返回步骤2。
3. PL条件
目标函数 f ( x , y ) 在 y 上的PL条件:对于任意固定的 x , max y f ( x , y ) 有非空解集和有限最优值,存在 μ > 0 使得 ‖ ∇ y f ( x , y ) ‖ 2 ≥ 2 μ [ max y f ( x , y ) − f ( x , y ) ] , ∀ x , y 。
4. 随机交替梯度投影算法
根据文献 [
17
] 所述,在随机环境中求解NC-PL极小极大优化问题,基于交替梯度下降–上升算法(AGDA)和Smoothed-AGDA算法 [
16
] ,采用随机梯度 G x ( x , y , ξ ) 和 G y ( x , y , ξ ) 进行更新,其中 G x ( x , y , ξ ) 和 G y ( x , y , ξ ) 分别是 ∇ x f ( x , y ) 和 ∇ y f ( x , y ) 的无偏随机估计且方差有界 σ 2 > 0 ,构造了新算法Stoc-AGDA和Stoc-Smoothed-AGDA。由于上述算法具有良好的数值表现,对于凸-PL极小极大优化问题,本文考虑将随机梯度结合到交替梯度投影算法(AGP)中,以下给出随机交替梯度投影算法的框架:
算法2 随机交替梯度投影算法(SAGP)
步骤1 输入固定步长 τ 1 > 0 , τ 2 > 0 ,初始点 ( x 0 , y 0 ) 及参数 b 0 > 0 ,
步骤2 for all t = 0 , 1 , 2 , ⋯ , T − 1 do,
步骤3 抽取两个独立同分布样本 ξ 1 t , ξ 2 t ,
步骤4 x t + 1 = x t − τ 1 [ G x ( x t , y t , ξ 1 t ) + b t x t ] ,
步骤5 y t + 1 = y t + τ 2 G y ( x t + 1 , y t , ξ 2 t ) ,
步骤6 end for,
步骤7 随机从 { ( x t , y t ) } t = 0 T − 1 中均匀抽取 ( x ˜ , y ˜ ) 。
min μ , σ max ϕ 1 , ϕ 2 F ( μ , σ , ϕ 1 , ϕ 2 ) ≜ Ε ( x r e a l , z ) ∼ D D ϕ ( x r e a l ) − D ϕ ( G μ , σ ( z ) ) − λ ‖ ϕ ‖ 2 ,
其中生成器为 G μ , σ ( z ) = μ + σ z ,判别器为 D ϕ ( x ) = ϕ 1 x + ϕ 2 x 2 , x 是来自生成器的真实数据或假数据, D 为真实数据和潜在变量的分布。真实数据 x r e a l 和潜在变量 z 的高斯分布数据集来自平均值为0、方差为1的正态分布, x r e a l 产生于 μ ^ = 0 , σ ^ = 0.1 。批量大小固定为100,实验重复3次。实验结果如图1所示。
参考文献References
Nemirovski, A. (2004) Prox-Method with Rate of Convergence O(1/t) for Variational Inequalities with Lipschitz Continuous Monotone Operators and Smooth Convex-Concave Saddle Point Problems. SIAM Journal on Optimization, 15, 229-251. https://doi.org/10.1137/S1052623403425629
Nesterov, Y. (2007) Dual Extrapolation and Its Applications to Solving Variational Inequalities and Related Problems. Mathematical Programming, 109, 319-344. https://doi.org/10.1007/s10107-006-0034-z
Monteiro, R.D.C. and Svaiter, B.F. (2010) On the Complexity of the Hybrid Proximal Extragradient Method for the Iterates and the Ergodic Mean. SIAM Journal on Optimization, 20, 2755-2787. https://doi.org/10.1137/090753127
Monteiro, R.D.C. and Svaiter, B.F (2011) Complexity of Variants of Tseng’s Modified F-B Splitting and Korpelevich’s Methods for Hemivariational Inequalities with Applications to Saddle-Point and Convex Optimization Problems. SIAM Journal on Optimization, 21, 1688-1720. https://doi.org/10.1137/100801652
Abernethy, J., Lai, K.A. and Wibisono, A. (2019) Last-Iterate Convergence Rates for Min-Max Optimization. ArXiv Preprint arxiv: 1906.02027.
Rafique, H., Liu, M., Lin, Q. and Yang, T. (2018) Non-Convex Min-Max Optimization: Provable Algorithms and Applications in Machine Learning. ArXiv Preprint arXiv: 1810.02060.
Nouiehed, M., Sanjabi, M., Huang, T., Lee, J.D. and Razaviyayn, M. (2019) Solving a Class of Non-Convex Min-Max Games Using Iterative First Order Methods. 33rd Con-ference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, 8-14 December 2019, 14934-14942.
Thekumparampil, K.K., Jain, P., Netrapalli, P. and Oh, S. (2019) Efficient Algorithms for Smooth Minimax Optimization. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, 8-14 December 2019, 12659-12670.
Kong, W. and Monteiro, R.D.C. (2021) An Accelerated Inexact Proximal Point Method for Solving Non-convex Concave Min-Max Problems. SIAM Journal on Optimization, 31, 2558-2585. https://doi.org/10.1137/20M1313222
Lin, T., Jin, C. and Jordan, M.I. (2020) Near-Optimal Algorithms for Minimax Optimization. In: Abernethy, J. and Agarwal, S., Eds., Proceedings of Thirty Third Conference on Learning Theory, PMLR 125, ML Research Press, Maastricht, 2738-2779.
Lin, T., Jin, C. and Jordan, M.I. (2020) On Gradient Descent Ascent for Nonconvex-Concave Minimax Problems. In: Daumé III, H. and Singh, S., Eds., Proceedings of the 37th International Con-ference on Machine Learning, PMLR 119, ML Research Press, Maastricht, 6083-6093.
Jin, C., Netrapalli, P. and Jordan, M.I. (2019) Minmax Optimization: Stable Limit Points of Gradient Descent Ascent Are Locally Optimal. ArXiv Preprint arXiv: 1902.00618.
Lu, S., Tsaknakis, I., Hong, M. and Chen, Y. (2020) Hybrid Block Successive Approximation for One-Sided Non-Convex Min-Max Problems: Algorithms and Applications. IEEE Transactions on Signal Processing, 68, 3676-3691. https://doi.org/10.1109/TSP.2020.2986363
Xu, Z., Zhang, H., Xu, Y. and Lan, G. (2020) A Unified Single-loop Alternating Gradient Projection Algorithm for Nonconvex-Concave and Convex-Nonconcave Minimax Problems. ArXiv Preprint arXiv: 2006.02032.
Boţ, R.I. and Böhm, A. (2020) Alternating Proximal-Gradient Steps for (Stochastic) Noncon-vex-Concave Minimax Problems. ArXiv Preprint arXiv: 2007.13605.
Zhang, J., Xiao, P., Sun, R. and Luo, Z.-Q. (2020) A Single-Loop Smoothed Gradient Descent-Ascent Algorithm for Nonconvex-Concave Min-Max Problems. 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, 6-12 December 2020.
Yang, J., Orvieto, A., Lucchi, A. and He, N. (2022) Faster Single-Loop Algorithms for Minimax Optimization without Strong Concavity. In: Camps-Valls, G., Ruiz, F.J.R. and Valera, I., Eds., Proceedings of the 25th International Conference on Artificial Intelligence and Statistics, PMLR 151, ML Research Press, Maastricht, 5485-5517.
Chambolle, A. and Pock, T. (2016) On the Ergodic Convergence Rates of a First-Order Primaldual Algorithm. Mathematical Programming, 159, 253-287. https://doi.org/10.1007/s10107-015-0957-3
Daskalakis, C. and Panageas, I. (2018) The Limit Points of (Optimistic) Gradient Descent in Min-Max Optimization. 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, 2-8 December 2018, 9236-9246.
Polyak, B.T. (1963) Gradient Methods for Minimizing Functionals. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 3, 643-653.
Fazel, M., Ge, R., Kakade, S. and Mesbahi, M. (2018) Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator. In: Dy, J. and Krause, A., Eds., Proceedings of the 35th International Conference on Machine Learning, PMLR 80, ML Research Press, Maastricht, 1467-1476.
Liu, C., Zhu, L. and Belkin, M. (2020) Loss Landscapes and Optimization in Over-Parameterized Non-Linearsystems and Neural Net-works. ArXiv Preprint arXiv: 2003.00307.
Yang, J., Kiyavash, N. and He, N. (2020) Global Convergence and Variance Reduction for a Class of Nonconvex-Nonconcave Minimax Problems. 34th Conference on Neural Information Processing Sys-tems (NeurIPS 2020), Vancouver, 6-12 December 2020, 1153-1165.
Guo, Z., Liu, M., Yuan, Z., Shen, L., Liu, W. and Yang, T. (2020) Communication-Efficient Distributed Stochastic AUC Maximization with Deep Neural Networks. In: Daumé III, H. and Singh, A., Eds., Proceedings of the 37th International Conference on Machine Learning, PMLR 119, ML Research Press, Maas-tricht, 3864-3874.
Cai, Q., Hong, M., Chen, Y. and Wang, Z. (2019) On the Global Convergence of Imitation Learning: A Case for Linear Quadratic Regulator. ArXiv Preprint arXiv: 1901.03674.
Du, S., Lee, J., Li, H., Wang, L. and Zhai, X. (2019) Gradient Descent Finds Global Minima of Deep Neural Networks. In: Chaudhuri, K. and Salakhutdinov, R., Eds., Proceedings of the 36th International Conference on Machine Learning, PMLR 97, ML Research Press, Maastricht, 1675-1685.
Rafique, H., Liu, M., Lin, Q. and Yang, T. (2022) Weakly-Convex-Concave Min-Max Optimization Provable Algo-rithms and Applications in Machine Learning. Optimization Methods and Software, 37, 1087-1121. https://doi.org/10.1080/10556788.2021.1895152