Abstract
We present a unified analysis of methods for such a wide class of problems as variational inequalities, which include minimization and saddle point problems as special cases. The analysis is developed relying on the extragradient method, which is a classic technique for solving variational inequalities. We consider the monotone and strongly monotone cases, which correspond to convex-concave and strongly-convex-strongly-concave saddle point problems. The theoretical analysis is based on parametric assumptions about extragradient iterations. Therefore, it can serve as a strong basis for combining existing methods of various types and for creating new algorithms. Specifically, to show this, we develop new robust methods, including methods with quantization, coordinate methods, and distributed randomized local methods. Most of these approaches have never been considered in the generality of variational inequalities and have previously been used only for minimization problems. The robustness of the new methods is confirmed by numerical experiments with GANs.
REFERENCES
F. Facchinei and J.-S. Pang, Finite-Dimensional Variational Inequalities and Complementarity Problems (Springer, New York, 2003). https://doi.org/10.1007/b97544
Yu. Nesterov, “Smooth minimization of non-smooth functions,” Math. Program. 103 (1), 127–152 (2005).
A. Nemirovski, “Prox-method with rate of convergence O(1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems,” SIAM J. Optim. 15 (1), 229–251 (2004).
A. Chambolle and T. Pock, “A first-order primal-dual algorithm for convex problems with applications to imaging,” J. Math. Imaging Vision 40 (1), 120–145 (2011).
E. Esser, X. Zhang, and T. F. Chan, “A general framework for a class of first order primal-dual algorithms for convex optimization in imaging science,” SIAM J. Imaging Sci. 3 (4), 1015–1046 (2010).
I. J. Goodfellow et al., “Generative adversarial networks” (2014). arXiv: 1406.2661 [stat.ML]
F. Hanzely and P. Richtárik, “Federated learning of a mixture of global and local models” (2020). arXiv preprint arXiv:2002.05516
F. Hanzely et al., “Lower bounds and optimal algorithms for personalized federated learning” (2020). arXiv preprint arXiv:2010.02372
G. M. Korpelevich, “The extragradient method for finding saddle points and other problems,” Ekon. Mat. Metody 12 (4), 747–756 (1976).
A. Juditsky, A. Nemirovski, and C. Tauvel, “Solving variational inequalities with stochastic mirror-prox algorithm,” Stochastic Syst. 1 (1), 17–58 (2011).
A. Alacaoglu and Y. Malitsky, “Stochastic variance reduction for variational inequality methods” (2021). arXiv preprint arXiv:2102.08352
H. Robbins and S. Monro, “A stochastic approximation method,” Ann. Math. Stat. 22 (3), 400–407 (1951). https://doi.org/10.1214/aoms/1177729586
R. Johnson and T. Zhang, “Accelerating stochastic gradient descent using predictive variance reduction,” Adv. Neural Inf. Process. Syst. 26, 315–323 (2013).
D. Alistarh et al., “QSGD: Communication-efficient SGD via gradient quantization and encoding,” Advances in Neural Information Processing Systems 30 (NIPS 2017), pp. 1709–1720.
F. Hanzely, K. Mishchenko, and P. Richtarik, “SEGA: Variance reduction via gradient sketching” (2018). arXiv preprint arXiv:1809.03054
E. Gorbunov, F. Hanzely, and P. Richtarik, “A unified theory of SGD: Variance reduction, sampling, quantization and coordinate descent,” Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics, PMLR (2020), pp. 680–690.
Y.-G. Hsieh et al., “On the convergence of single-call stochastic extra-gradient methods” (2019). ar-Xiv:1908.08465 [math.OC]
K. Mishchenko et al., “Revisiting stochastic extragradient” Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics, PMLR (2020), pp. 4573–4582.
P. Tseng, “A modified forward-backward splitting method for maximal monotone mappings,” SIAM J. Control Optim. 38 (2), 431–446 (2000). https://doi.org/10.1137/S0363012998338806
Yu. Nesterov, “Dual extrapolation and its applications to solving variational inequalities and related problems,” Math. Program. 109 (2), 319–344 (2007).
P. Balamurugan and F. Bach, “Stochastic variance reduction methods for saddle-point problems,” Proceedings of the 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, Ed. by D. Lee et al. (Curran Associates, 2016), pp. 1416–1424. https://proceedings.neurips.cc/paper/2016/file/1aa48fc4880bb0c9b8a3bf979d3b917e-Paper.pdf
T. Chavdarova et al., “Reducing noise in GAN training with variance reduced extragradient” (2019). arXiv preprint arXiv:1904.08598
A. Sidford and K. Tian, “Coordinate methods for accelerating l ∞ regression and faster approximate maximum flow,” 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS) (IEEE, 2018), pp. 922–933.
Y. Carmon et al., “Coordinate methods for matrix games” (2020). arXiv preprint arXiv:2009.08447
A. Sadiev et al., “Zeroth-order algorithms for smooth saddle-point problems” (2020). arXiv preprint ar-Xiv:2009.09908
Y. Deng and M. Mahdavi, “Local stochastic gradient descent ascent: Convergence analysis and communication efficiency,” Proceedings of the 24th International Conference on Artificial Intelligence and Statistics PMLR (2021), pp. 1387–1395.
A. Beznosikov, V. Samokhin, and A. Gasnikov, “Distributed saddle-point problems: Lower bounds, optimal algorithms and federated GANs” (2021). arXiv preprint arXiv:2010.13112
S. U. Stich, “Unified optimal analysis of the (stochastic) gradient method” (2019). arXiv preprint a-rXiv:1907.04232
S. J. Wright, “Coordinate descent algorithms,” Math. Program. 151 (1), 3–34 (2015).
Yu. Nesterov, “Efficiency of coordinate descent methods on huge-scale optimization problems,” SIAM J. O-ptim. 22 (2), 341–362 (2012).
A. Beznosikov et al., “On biased compression for distributed learning” (2020). arXiv preprint arXiv:2002.12410
S. Barratt and R. Sharma, “A note on the inception score” (2018). arXiv preprint arXiv:1801.01973
A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks” (2016). arXiv:1511.06434 [cs.LG]
M. Mirza and S. Osindero, “Conditional generative adversarial nets” (2014). arXiv preprint arXiv:1411.1784
Funding
This work was supported by the Ministry of Science and Higher Education of the Russian Federation, state assignment no. 075-00337-20-03, project no. 0714-2020-0005.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
The authors declare that they have no conflicts of interest.
Additional information
Translated by I. Ruzanova
Appendices
APPENDIX
A. ANALYSIS OF VARIOUS METHODS
2.1 A.1. Extra Step Method
Let us start with the simplest case of (1)–(3), namely, the stochastic setting with uniformly bounded noise [10]:
where \(z\) and \(\xi \) are independent. The following method can be used in this case (Algorithm 1):
Algorithm 1. Extra step method
Parameters: Stepsize \(\gamma \), K.
Initialization: Choose \({{z}^{0}} \in \mathcal{Z}\).
for \(k = 0,1, \ldots ,K - 1\) do
Generate random \({{\xi }^{k}}\), \({{\xi }^{{k + 1/2}}}\),
\({{z}^{{k + 1/2}}} = {\text{pro}}{{{\text{x}}}_{{\gamma h}}}({{z}^{k}} - \gamma F({{z}^{k}},{{\xi }^{k}}))\),
\({{z}^{{k + 1}}} = {\text{pro}}{{{\text{x}}}_{{\gamma h}}}({{z}^{k}} - \gamma F({{z}^{{k + 1/2}}},{{\xi }^{{k + 1/2}}}))\).
end for
Note that, in this algorithm, \(\tau = 0\) and, hence, \({{w}^{k}} = {{z}^{k}}\) for all \(k\). Additionally, we set \({{\sigma }_{k}} = 0.\) The following lemma determines the constants from Assumption 2.
Lemma 2. Assume that \(F\) is bounded-Lipschitz with constants \(L\) and \(D\) (Assumption \(3\)). Then \({{g}^{k}}\) and \({{g}^{{k + 1}}}\) from Algorithm \(1\) satisfy Assumption \(2\) with constants \(A = 3{{L}^{2}}\), \({{D}_{1}} = 3{{D}^{2}} + 6{{\sigma }^{2}}\), and \({{D}_{3}} = {{\sigma }^{2}}\).
Proof. It is easy to see that \({{g}^{{k + 1/2}}}\) is unbiased. Next,
and, finally,
Corollary 1. Assume that \(F\) is bounded-Lipschitz with constants \(L\) and \(D\) . Then
\( \bullet \) in the strongly monotone case with \(\gamma \leqslant \min \left\{ {\frac{1}{{6L}};\frac{1}{{4\left( {{{\mu }_{F}} + {{\mu }_{h}}} \right)}}} \right\},\) the extra step method satisfies
\( \bullet \) in the monotone case \(\gamma \leqslant 1{\text{/}}3L\), the extra step method satisfies
With a proper choice of γ (see, e.g., [28]), the following convergence estimates can be obtained:
\( \bullet \) in the strongly monotone case,
\( \bullet \) in the monotone case,
Remark. This analysis covers the smooth case with \(D = 0\). To obtain estimates in the nonsmooth case with a bounded operator F, it suffices to use \(L = 0\) and set \(1{\text{/}}L = + \infty \).
With a proper choice of \(\gamma \) (see, e.g., \([28]\)) the convergence rates are estimated as follows:
\( \bullet \) in the strongly monotone case,
\( \bullet \) in the monotone case,
2.2 A.2. Past Extra Step Method
The setting is the same as in the preceding subsection, namely, (1)–(3). However, we work with a modification of the classical extra step method.
Algorithm 2. Past extra step method (Past-ES)
Parameters: Stepsize \(\gamma \), \(K\).
Initialization: Choose \({{z}^{0}} \in \mathcal{Z}\).
for \(k = 0,1, \ldots ,K - 1\) do
Generate random \({{\xi }_{{k + 1/2}}}\),
\({{z}^{{k + 1/2}}} = {\text{pro}}{{{\text{x}}}_{{\gamma h}}}({{z}^{k}} - \gamma F({{z}^{{k - 1/2}}},{{\xi }_{{k - 1/2}}}))\),
\({{z}^{{k + 1}}} = {\text{pro}}{{{\text{x}}}_{{\gamma h}}}({{z}^{k}} - \gamma F({{z}^{{k + 1/2}}},{{\xi }_{{k + 1/2}}}))\),
end for
Lemma 3. Assume that \(F\) is bounded-Lipschitz with constants \(L\) and \(M\) (Assumption 3). Then \({{g}^{k}}\) and \({{g}^{{k + 1}}}\) from Algorithm 2 satisfy Assumption 2 with constants \(\rho = 1{\text{/}}3,\) \(B = 3,\) \(C = 2{{L}^{2}},\) \({{D}_{1}} = 6{{\sigma }^{2}},\) \({{D}_{2}} = 4{{D}^{2}} + 12{{\sigma }^{2}},\) and \({{D}_{3}} = {{\sigma }^{2}}\).
Proof. Let \(\sigma _{k}^{2} = {\text{||}}F({{z}^{{k - 1/2}}}) - F({{z}^{{k + 1/2}}}){\text{|}}{{{\text{|}}}^{2}}\). Then
if we assume that \(\gamma \leqslant 1{\text{/}}(3L)\). Therefore,
and, finally,
Corollary 2. Assume that \(F\) is bounded-Lipschitz with constants \(L\) and \(D.\) Then
\( \bullet \) in the strongly monotone case with \(T = 36\) and \(\gamma \leqslant \min \left\{ {\frac{1}{{12L\sqrt 2 }};\frac{1}{{4\left( {{{\mu }_{F}} + {{\mu }_{h}}} \right)}}} \right\}\), the Past-ES satisfies
\( \bullet \) in the monotone case with \(T = 18\) and \(\gamma \leqslant \frac{1}{{12L\sqrt 2 }}\), the Past-ES satisfies
For the next method, we consider the setting of a finite sum: (1)–(4).
2.3 A.3. Variance Reduced Extra Step Method
Algorithm 3. Variance reduced extra step method (VR-ES)
Parameters: Stepsize \(\gamma \), \(K\).
Initialization: Choose \({{z}^{0}} = {{w}^{0}} \in \mathcal{Z}\).
for \(k = 0,1, \ldots ,K - 1\) do
\({{\bar {z}}^{k}} = \tau {{z}^{k}} + (1 - \tau ){{w}^{k}}\)
Generate uniformly random \({{m}_{k}} \in 1, \ldots ,M,\)
\({{z}^{{k + 1/2}}} = {\text{pro}}{{{\text{x}}}_{{\gamma h}}}({{\bar {z}}^{k}} - \gamma F({{w}^{k}}))\),
\({{z}^{{k + 1}}} = {\text{pro}}{{{\text{x}}}_{{\gamma h}}}({{\bar {z}}^{k}} - \gamma ({{F}_{{{{m}_{k}}}}}({{z}^{{k + 1/2}}}) - {{F}_{{{{m}_{k}}}}}({{w}^{k}}) + F({{w}^{k}})))\),
\({{w}^{{k + 1}}} = \left\{ {\begin{array}{*{20}{l}} {{{z}^{{k + 1}}}\quad {\text{with probability}} 1 - \tau ,} \\ {{{w}^{k}}\quad {\text{ with probability}} \tau .} \end{array}} \right.\)
end for
We set \({{\sigma }_{k}} = 0\). The following lemma gives the values of the constants from Assumption 2:
Lemma 4. Assume that all \({{F}_{{{{m}_{k}}}}}\) and \(F\) itself are bounded-Lipschitz with constants \(L\) and \(D\) (Assumption 3). Then \({{g}^{k}}\) and \({{g}^{{k + 1}}}\) from Algorithm 3 satisfy Assumption 2 with constants \(A = {{L}^{2}}\), \({{D}_{1}} = {{D}^{2}}\), \(E = 4{{L}^{2}}\), and \({{D}_{3}} = 4{{D}^{2}}\).
Proof. It is easy to see that \({{g}^{{k + 1/2}}}\) is unbiased. Next,
and, finally,
Corollary 3. Assume that all \({{F}_{{{{m}_{k}}}}}\) and \(F\) itself are bounded-Lipschitz with constants \(L\) and \(D\) . Then
\( \bullet \) in the strongly monotone case with \(\gamma \leqslant \min \left\{ {\frac{{\sqrt {1 - \tau } }}{{2\sqrt 2 L}};\frac{{1 - \tau }}{{4\left( {{{\mu }_{F}} + {{\mu }_{h}}} \right)}}} \right\}\), VR-ES satisfies
\( \bullet \) in the monotone case with \(\gamma \leqslant \frac{{\sqrt {1 - \tau } }}{{2\sqrt 6 L}}\), VR-ES satisfies
At every iteration, we calculate only one out of \(M\) operators. However, when \({{w}^{k}}\) is updated, it is necessary to compute all \(M\) operators at a new point \({{w}^{k}}\). Based on this, an optimal value of \(\tau \) can be chosen as follows:
2.4 A.4. Coordinate Extra Step Method
Let us go back and once again consider the most common setting without finite sums: (1).
Algorithm 4. Coordinate extra step method (Coord-ES)
Parameters: Stepsize \(\gamma \), \(K\).
Initialization: Choose \({{z}^{0}} = {{w}^{0}} \in \mathcal{Z}\).
for \(k = 0,1, \ldots ,K - 1\) do
\({{\bar {z}}^{k}} = \tau {{z}^{k}} + (1 - \tau ){{w}^{k}}\)
Generate uniformly random \({{i}_{k}} \in 1, \ldots ,d\),
\({{z}^{{k + 1/2}}} = {\text{pro}}{{{\text{x}}}_{{\gamma h}}}({{\bar {z}}^{k}} - \gamma F({{w}^{k}}))\),
\({{z}^{{k + 1}}} = {\text{pro}}{{{\text{x}}}_{{\gamma h}}}({{\bar {z}}^{k}} - \gamma (d{{[F({{z}^{{k + 1/2}}})]}_{{{{i}_{k}}}}}{{e}_{{{{i}_{k}}}}} - d{{[F({{w}^{k}})]}_{{{{i}_{k}}}}}{{e}_{{{{i}_{k}}}}} + F({{w}^{k}})))\),
\({{w}^{{k + 1}}} = \left\{ {\begin{array}{*{20}{l}} {{{z}^{{k + 1}}}\quad {\text{with probability}} 1 - \tau ,} \\ {{{w}^{k}}\quad {\text{ with probability}} \tau .} \end{array}} \right.\)
end for
We set \({{\sigma }_{k}} = 0\). The following lemma gives the values of the constants from Assumption 2.
Lemma 5. Assume that \(F\) is bounded-Lipschitz with constants \(L\) and \(D\) (Assumption 3). Then \({{g}^{k}}\) and \({{g}^{{k + 1}}}\) from Algorithm 4 satisfy Assumption 2 with constants \(A = d{{L}^{2}}\), \({{D}_{1}} = d{{D}^{2}}\), \(E = 2(d + 1){{L}^{2}}\), and \({{D}_{3}} = 2(d + 1){{D}^{2}}\).
Proof. It is easy to see that \({{g}^{{k + 1/2}}}\) is unbiased. Next,
and, finally,
Corollary 4. Assume that \(F\) is bounded-Lipschitz with constants \(L\) and \(D\) . Then
\( \bullet \) in the strongly monotone case with \(\gamma \leqslant \min \left\{ {\frac{{\sqrt {1 - \tau } }}{{2\sqrt {2d} L}};\frac{{1 - \tau }}{{4\left( {{{\mu }_{F}} + {{\mu }_{h}}} \right)}}} \right\}\), Coord-ES satisfies
\( \bullet \) in the monotone case with \(\gamma \leqslant \frac{{\sqrt {1 - \tau } }}{{2L\sqrt {4d + 2} }}\), Coord-ES satisfies
The optimal value is \(\tau = d{\text{/}}(d + 1)\).
2.5 A.5. Quantized Extra Step Method
In this section, we discuss a method that works with quantization operators.
Algorithm 5. Quantized extra step method (Quant-ES)
Parameters: Stepsize \(\gamma \), \(K\).
Initialization: Choose \({{z}^{0}} = {{w}^{0}} \in \mathcal{Z}\).
for \(k = 0,1, \ldots ,K - 1\) do
\({{\bar {z}}^{k}} = \tau {{z}^{k}} + (1 - \tau ){{w}^{k}}\)
\({{z}^{{k + 1/2}}} = {\text{pro}}{{{\text{x}}}_{{\gamma h}}}({{\bar {z}}_{k}} - \gamma F({{w}^{k}}))\),
\({{z}^{{k + 1}}} = {\text{pro}}{{{\text{x}}}_{{\gamma h}}}({{\bar {z}}_{k}} - \gamma Q(F({{z}^{{k + 1/2}}}) - F({{w}^{k}})) + \gamma F({{w}^{k}}))\),
\({{w}^{{k + 1}}} = \left\{ {\begin{array}{*{20}{l}} {{{z}^{{k + 1}}}\quad {\text{with probability}} 1 - \tau ,} \\ {{{w}^{k}}\quad {\text{ with probability}} \tau .} \end{array}} \right.\)
end for
Definition 1. We say that \(Q(x)\) is a quantization of a vector \(x \in {{\mathbb{R}}^{d}}\) if
for some \(\omega \).
Lemma 6. Assume that \(F\) is bounded-Lipschitz with constants \(L\) and \(D\) (Assumption 3). Then \({{g}^{k}}\) and \({{g}^{{k + 1}}}\) from Algorithm 5 satisfy Assumption 2 with constants \(A = \omega {{L}^{2}}\), \({{D}_{1}} = \omega {{D}^{2}}\), \(E = 2(\omega + 1){{L}^{2}}\), and D3 = \(2(\omega + 1){{D}^{2}}\).
Proof. It is easy to see that \({{g}^{{k + 1/2}}}\) is unbiased. Next,
and, finally,
Corollary 5. Assume that \(F\) is bounded-Lipschitz with constants \(L\) and \(D\) . Then Quant-ES
\( \bullet \) in the strongly monotone case with \(\gamma \leqslant \min \left\{ {\frac{{\sqrt {1 - \tau } }}{{2L\sqrt {2\omega } }};\frac{{1 - \tau }}{{4\left( {{{\mu }_{F}} + {{\mu }_{h}}} \right)}}} \right\}\), Quant-ES satisfies
\( \bullet \) in the monotone case with \(\gamma \leqslant \frac{{\sqrt {1 - \tau } }}{{2L\sqrt {4\omega + 2} }}\), Quant-ES satisfies
Consider the case \(D = 0\). Quantization is needed to compress the information, with \(\omega \) regarded as a compression coefficient, i.e., \(\omega \) times more information is transferred when an uncompressed operator is computed. However, an uncompressed operator has to be computed at every \(1{\text{/}}(1 - \tau ){\text{th}}\) iteration (when \({{w}^{k}}\) is updated). Based on this, an optimal value of \(\tau \) can be chosen as follows:
2.6 A.6. Quantized Variance Reduced Extra Step Method
In what follows, the ideas of quantization and variance reduction are combined for the case of problem (1)–(4) with an operator in the form of a finite sum.
Algorithm 6. Quantized variance reduced extra step method
Parameters: Stepsize \(\gamma \), \(K\).
Initialization: Choose \({{z}^{0}} = {{w}^{0}} \in \mathcal{Z}\).
for \(k = 0,1, \ldots ,K - 1\) do
Generate uniformly random \({{m}_{k}} \in 1, \ldots ,M\),
\({{\bar {z}}^{k}} = \tau {{z}^{k}} + (1 - \tau ){{w}^{k}}\),
\({{z}^{{k + 1/2}}} = {\text{pro}}{{{\text{x}}}_{{\gamma h}}}({{\bar {z}}_{k}} - \gamma F({{w}^{k}}))\),
\({{z}^{{k + 1}}} = {\text{pro}}{{{\text{x}}}_{{\gamma h}}}({{\bar {z}}_{k}} - \gamma Q({{F}_{{{{m}_{k}}}}}({{z}^{{k + 1/2}}}) - {{F}_{{{{m}_{k}}}}}({{w}^{k}})) + \gamma F({{w}^{k}}))\),
\({{w}^{{k + 1}}} = \left\{ {\begin{array}{*{20}{l}} {{{z}^{{k + 1}}}\quad {\text{with probability }} 1 - \tau ,} \\ {{{w}^{k}}\quad {\text{ with probability }} \tau .} \end{array}} \right.\)
end for
We set \({{\sigma }_{k}} = 0\). The following lemma gives the values of the constants from Assumption 2.
Lemma 7. Assume that all \({{F}_{{{{m}_{k}}}}}\) and \(F\) itself are bounded-Lipschitz with constants \(L\) and \(D\) (Assumption 3). Then \({{g}^{k}}\) and \({{g}^{{k + 1}}}\) from Algorithm 6 satisfy Assumption 2 with \(A = \omega {{L}^{2}}\), \({{D}_{1}} = \omega {{D}^{2}}\), \(E = 2(\omega + 1){{L}^{2}}\), and \({{D}_{3}} = 2(\omega + 1){{D}^{2}}\).
Proof. It is easy to see that \({{g}^{{k + 1/2}}}\) is unbiased. Next,
and, finally,
Corollary 6. Assume that all \({{F}_{{{{m}_{k}}}}}\) and \(F\) itself are bounded-Lipschitz with constants \(L\) and \(D\) . Then the quantized variance reduced extra step method
\( \bullet \) in the strongly monotone case with \(\gamma \leqslant \min \left\{ {\frac{{\sqrt {1 - \tau } }}{{2L\sqrt {2\omega } }};\frac{{1 - \tau }}{{4\left( {{{\mu }_{F}} + {{\mu }_{h}}} \right)}}} \right\}\), satisfies
\( \bullet \) in the monotone case with \(\gamma \leqslant \frac{{\sqrt {1 - \tau } }}{{2L\sqrt {4\omega + 2} }}\), satisfies
The optimal value \(\tau \) is the same as in the preceding subsection.
2.7 A.7. Importance Sampling Extra Step Method
Here, we consider a more general case than a finite sum. Specifically, each function has its own weight \({{p}_{m}}\). Namely, consider a discrete random variable \(\eta \) of the form
At every iteration, we call \({{F}_{\eta }}\). The weights/probabilities \({{p}_{m}}\) can be given a priori or can be specified by the user: for example, it is reasonable to use \({{p}_{m}} = {{L}_{m}}{\text{/}}\left( {\sum\nolimits_m \,{{L}_{m}}} \right)\).
Algorithm 7. Importance sampling extra step method
Parameters: Stepsize \(\gamma \), \(K\).
Initialization: Choose \({{z}^{0}} = {{w}^{0}} \in \mathcal{Z}\).
for \(k = 0,1, \ldots ,K - 1\) do
Generate \({{\eta }_{k}}\),
\({{\bar {z}}^{k}} = \tau {{z}^{k}} + (1 - \tau ){{w}^{k}}\),
\({{z}^{{k + 1/2}}} = {\text{pro}}{{{\text{x}}}_{{\gamma h}}}({{\bar {z}}_{k}} - \gamma F({{w}^{k}}))\),
\({{z}^{{k + 1}}} = {\text{pro}}{{{\text{x}}}_{{\gamma h}}}({{\bar {z}}_{k}} - \gamma \frac{1}{{{{p}_{{{{\eta }_{k}}}}}}}({{F}_{{{{\eta }_{k}}}}}({{z}^{{k + 1/2}}}) - {{F}_{{{{\eta }_{k}}}}}({{w}^{k}})) + \gamma F({{w}^{k}}))\),
\({{w}^{{k + 1}}} = \left\{ {\begin{array}{*{20}{l}} {{{z}^{{k + 1}}}\quad {\text{with probability }} 1 - \tau ,} \\ {{{w}^{k}}\quad {\text{ with probability }} \tau .} \end{array}} \right.\)
end for
Lemma 8. Assume that all \({{F}_{m}}\) are bounded-Lipschitz with constants \({{L}_{m}}\) and \({{D}_{m}}\) (Assumption 3), and so is \(F\) with constants \(L\) and \(D\). Then \({{g}^{k}}\) and \({{g}^{{k + 1}}}\) from Algorithm 7 satisfy Assumption 2 with constants \(A = \sum\nolimits_{m = 1}^M {\frac{{L_{m}^{2}}}{{{{p}_{m}}}}} \), \({{D}_{1}} = \sum\nolimits_{m = 1}^M {\frac{{D_{m}^{2}}}{{{{p}_{m}}}}} \), \(E = 2\left( {\sum\nolimits_{m = 1}^M {\frac{{L_{m}^{2}}}{{{{p}_{m}}}}} + {{L}^{2}}} \right)\), and \({{D}_{3}} = 2\left( {\sum\nolimits_{m = 1}^M {\frac{{D_{m}^{2}}}{{{{p}_{m}}}}} + {{D}^{2}}} \right)\).
Proof. It is easy to see that \({{g}^{{k + 1/2}}}\) is unbiased. Next,
and, finally,
Corollary 7. Assume that all \({{F}_{m}}\) are bounded-Lipschitz with constants \({{L}_{m}}\) and \({{D}_{m}}\) (Assumption 3) and so is \(F\) with constants \(L\) and \(D\). Then the importance sampling extra step method
\( \bullet \) in the strongly monotone case with \(\gamma \leqslant \min \left\{ {\frac{{\sqrt {1 - \tau } }}{{2\sqrt 2 \sum\nolimits_{m = 1}^M {\frac{{L_{m}^{2}}}{{{{p}_{m}}}}} }};\frac{{1 - \tau }}{{4\left( {{{\mu }_{F}} + {{\mu }_{h}}} \right)}}} \right\}\), satisfies
\( \bullet \) in the monotone case with \(\gamma \leqslant \frac{{\sqrt {1 - \tau } }}{{2\sqrt {2{{L}^{2}} + 4\sum\nolimits_{m = 1}^M {\frac{{L_{m}^{2}}}{{{{p}_{m}}}}} } }}\), satisfies
2.8 A.8. Local Extra Step Method
This method is designed for the distributed problem (1)–(5). It alternates between local iterations and averaging with the server value. A local step is taken with probability \(\tau \), and communication occurs with probability \(1 - \tau \).
Here, \({{Z}_{{{\text{avg}}}}} = [{{\bar {z}}^{T}}, \ldots ,{{\bar {z}}^{T}}] \in {{\mathbb{R}}^{{Md}}}\) with \(\bar {z} = \frac{1}{M}\sum\nolimits_{m = 1}^M \,{{z}_{m}}\) (the same for \({{W}_{{{\text{avg}}}}}\)).
Algorithm 8. Randomized local extra step method
Parameters: Stepsize \(\gamma \), K; probability \(p\).
Initialization: Choose \({{z}^{0}} = {{w}^{0}} \in \mathcal{Z}\) and \(z_{m}^{0} = {{z}^{0}}\) for all \(m\).
for \(k = 0,1, \ldots \) do
\({{\bar {Z}}^{k}} = \tau {{Z}^{k}} + (1 - \tau ){{W}^{k}}\),
\({{Z}^{{k + 1/2}}} = {\text{pro}}{{{\text{x}}}_{{\gamma h}}}\left( {{{{\bar {Z}}}^{k}} - \gamma (\Phi ({{W}^{k}}) + \lambda ({{W}^{k}} - W_{{{\text{avg}}}}^{k})} \right)\),
\(G(Z) = \left\{ {\begin{array}{*{20}{l}} {\frac{1}{\tau }\Phi (Z)\quad {\text{ with probability }} \tau ,} \\ {\frac{1}{{1 - \tau }}\lambda (Z - {{Z}_{{{\text{avg}}}}})\quad {\text{with probability }} 1 - \tau ,} \end{array}} \right.\)
\({{Z}^{{k + 1}}} = {\text{pro}}{{{\text{x}}}_{{\gamma h}}}\left( {{{{\bar {Z}}}^{k}} - \gamma (G({{Z}^{{k + 1/2}}}) - G({{W}^{k}}) + \Phi ({{W}^{k}}) + \lambda ({{W}^{k}} - W_{{{\text{avg}}}}^{k}))} \right)\),
\({{W}^{{k + 1}}} = \left\{ {\begin{array}{*{20}{l}} {{{Z}^{{k + 1}}}\quad {\text{with probability }} 1 - \tau ,} \\ {{{W}^{k}}\quad {\text{ with probability }} \tau .} \end{array}} \right.\)
end for
In fact, this method was analyzed in the preceding subsection. Indeed, we have two operators, \(\Phi (Z)\) and \(\lambda (Z - {{Z}_{{{\text{avg}}}}})\), which are \(L\)- and \(\lambda \)-smooth, respectively.
If we use \(\tau = L{\text{/}}(L + \lambda )\), then the following result is true.
Corollary 8. The randomized local extra step method
\( \bullet \) in the strongly monotone case with \(\gamma \leqslant \min \left\{ {\frac{{\sqrt \lambda }}{{2\sqrt 2 {{{(L + \lambda )}}^{{3/2}}}}};\frac{{\sqrt \lambda }}{{4\left( {{{\mu }_{F}} + {{\mu }_{h}}} \right)\sqrt {L + \lambda } }}} \right\}\), satisfies
\( \bullet \) in the monotone case with \(\gamma \leqslant \frac{{\sqrt \lambda }}{{2\sqrt 6 {{{(L + \lambda )}}^{{3/2}}}}}\), satisfies
Based on this, the number of local computations and the number of communications can be estimated as follows:
\( \bullet \) in the strongly monotone case,
\( \bullet \) in the monotone case,
Note that these estimates are fairly good for small values of \(\lambda \).
B. EXPERIMENTS
3.1 B.1. Generative Adversarial Networks
As an example of optimizing a minimax value functional by applying different optimization approaches, we consider the optimization of generative adversarial networks (GANs). GAN is a framework for estimating generative models by an adversarial (opposing) process in which two models are trained simultaneously: a generative model G that captures the data distribution and a discriminative model \(D\) that estimates the probability of a sample coming from the training set rather than \(G\). So \(D(G(z))\) is the (scalar) probability that the output of the generator \(G\) is a real image. According to [6], in this task, \(D\) tries to maximize the probability that it correctly classifies reals and fakes \((\log (D(x))\), and \(G\) tries to minimize the probability that \(D\) predicts its outputs as fakes \((\log (1 - D(G(z)))\). From the paper, the GAN loss function is
The purpose of our experiments was not to create state-of-the-art results with a new architecture or approach of GAN formulation task, but rather to confirm the performance of SVRG and quantization methods for training GANs.
Data, model, and optimizers. In these experiments, we used the CIFAR dataset. Its training and validation sets contain \(50\,000\) and \(10\,000\) images, respectively, equally distributed over ten classes. For each optimization approach, Adam was used as an optimizer. The hyperparameters of each of the optimizers (generator and discriminator) were analyzed in the course of the experiments (see Table 1).
Based on the results of the convergence analysis (tested after 100 epochs of training) the following parameters were chosen: \({{\beta }_{1}} = 0.9,\) \({{\beta }_{2}} = 0.999,\) and the learning rate \( = 2 \times {{10}^{{ - 4}}}\) for both optimization methods. The batchsize for one gradient calculation was 64.
To check the quality of the output images, we used the Inception score metric (see [32]):
where \(x \sim {{p}_{g}}\) means that the image x is generated from the distribution \({{p}_{g}}\) and \({{D}_{{KL}}}(p\,|{\kern 1pt} {\kern 1pt} |\,q)\) is the Kullback–Leibler divergence between two distributions \(p,\;q\).
This metric allows us to automatically assess the quality of images generated by the model. In the experiments, it was shown that this metric correlates well with the results of human evaluation on CIFAR10-generated images. This metric uses the Inception v3 neural network pretrained on the ImageNet dataset and collects statistics of the output data of the network applied to the generated images. In the experiments, we used the DCGan architecture [33] with Conditional approach [34]. A feature of this architecture is that the model can be trained to sample images from a particular distribution, samples from which will be similar to elements from the desired distribution of the training set.
The main purpose of these experiments was the possibility of nonretraining certain architectures for different optimization approaches. During training, the generator and discriminator had the same number of optimization steps.
Results. They are presented in Figs. 1 and 2. The results obtained in the experiments suggest that the approaches proposed in this paper outperform the original ones.
\( \bullet \) The variance reduction method made it possible to optimize the functional more accurately than the original method throughout the optimization epoch. However, as the final iterations of each training epoch were approached, the full gradient differed rather widely from the direction of the gradient obtained at the last iterations, which introduced inaccuracy into the optimization and led to retraining of the generator.
\( \bullet \) The method with quantization/clipping (randomly nullifying 70% of the entire model gradient in both the generator and discriminator, which corresponds to the compression operator Rand70%) made it possible to prevent retraining of the generator and discriminator in the early stages, which in turn led to better results.
To sum up, all of the above-mentioned approaches produced better results than the original learning technique and can be used to optimize GANs, but quantization allows one to quickly obtain the best results while saving computational resources.
3.2 B.2. Policeman vs. Burglar
To compare the performance of some of the methods presented in this paper, we consider the Policeman vs. Burglar problem for a square city of 200 by 200 cells. There is a house and a police booth in each cell. The burglar chooses one house to rob, and the policeman chooses a booth in which he will be on duty. The task is to find optimal mixed strategies for the burglar and policeman treated as adversarial players in a game where the burglar’s goal is to rob some chosen house \(i\) with maximal wealth \({{w}_{i}}\) and the policeman’s goal is to choose an optimal booth and to catch the burglar to prevent the maximum expected loss caused by the latter. We assume that the probability of catching the burglar for house \(i\) and booth \(j\) is \(\exp \left( { - \theta d(i,j)} \right)\), where \(d\) is the distance function introduced below. This setting can be formulated as the bilinear saddle point problem
for \(x\) and \(y\) being the vectors of probability of choosing some house and booth, respectively, and for the matrices
where the wealth \({{w}^{{(k)}}}\) and the distance function \(d\) are defined as follows (these expressions are easy to understand if \(i\) is regarded as a flattened coordinate on an \(n \times n\) playing field, \(i(x,y) = xn + y\), \(w\) is plotted as a pyramid centered at the center of this field, and \(d\) is the Euclidean distance on it):
In the experiments, the parameters were specified as \(\theta = 0.6,\) \(n = 25,\) and \(\sigma = 3\). In the Coord-ES, Quant-ES, and Past-ES methods, the value of \(\gamma \) was the same as in Extra Step, which is optimal for the last method. We compared the Past method in terms of the number of calls to the oracle \(F,\) the method with quantization in terms of the number of used bits, and the coordinate method in terms of the number of used coordinates (Fig. 3).
Rights and permissions
About this article
Cite this article
Beznosikov, A.N., Gasnikov, A.V., Zainullina, K.E. et al. A Unified Analysis of Variational Inequality Methods: Variance Reduction, Sampling, Quantization, and Coordinate Descent. Comput. Math. and Math. Phys. 63, 147–174 (2023). https://doi.org/10.1134/S0965542523020045
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1134/S0965542523020045