Skip to main content
Log in

A Unified Analysis of Variational Inequality Methods: Variance Reduction, Sampling, Quantization, and Coordinate Descent

  • GENERAL NUMERICAL METHODS
  • Published:
Computational Mathematics and Mathematical Physics Aims and scope Submit manuscript

Abstract

We present a unified analysis of methods for such a wide class of problems as variational inequalities, which include minimization and saddle point problems as special cases. The analysis is developed relying on the extragradient method, which is a classic technique for solving variational inequalities. We consider the monotone and strongly monotone cases, which correspond to convex-concave and strongly-convex-strongly-concave saddle point problems. The theoretical analysis is based on parametric assumptions about extragradient iterations. Therefore, it can serve as a strong basis for combining existing methods of various types and for creating new algorithms. Specifically, to show this, we develop new robust methods, including methods with quantization, coordinate methods, and distributed randomized local methods. Most of these approaches have never been considered in the generality of variational inequalities and have previously been used only for minimization problems. The robustness of the new methods is confirmed by numerical experiments with GANs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

REFERENCES

  1. F. Facchinei and J.-S. Pang, Finite-Dimensional Variational Inequalities and Complementarity Problems (Springer, New York, 2003). https://doi.org/10.1007/b97544

    Book  MATH  Google Scholar 

  2. Yu. Nesterov, “Smooth minimization of non-smooth functions,” Math. Program. 103 (1), 127–152 (2005).

    Article  MathSciNet  MATH  Google Scholar 

  3. A. Nemirovski, “Prox-method with rate of convergence O(1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems,” SIAM J. Optim. 15 (1), 229–251 (2004).

    Article  MathSciNet  MATH  Google Scholar 

  4. A. Chambolle and T. Pock, “A first-order primal-dual algorithm for convex problems with applications to imaging,” J. Math. Imaging Vision 40 (1), 120–145 (2011).

    Article  MathSciNet  MATH  Google Scholar 

  5. E. Esser, X. Zhang, and T. F. Chan, “A general framework for a class of first order primal-dual algorithms for convex optimization in imaging science,” SIAM J. Imaging Sci. 3 (4), 1015–1046 (2010).

    Article  MathSciNet  MATH  Google Scholar 

  6. I. J. Goodfellow et al., “Generative adversarial networks” (2014). arXiv: 1406.2661 [stat.ML]

  7. F. Hanzely and P. Richtárik, “Federated learning of a mixture of global and local models” (2020). arXiv preprint arXiv:2002.05516

  8. F. Hanzely et al., “Lower bounds and optimal algorithms for personalized federated learning” (2020). arXiv preprint arXiv:2010.02372

  9. G. M. Korpelevich, “The extragradient method for finding saddle points and other problems,” Ekon. Mat. Metody 12 (4), 747–756 (1976).

    MathSciNet  MATH  Google Scholar 

  10. A. Juditsky, A. Nemirovski, and C. Tauvel, “Solving variational inequalities with stochastic mirror-prox algorithm,” Stochastic Syst. 1 (1), 17–58 (2011).

    Article  MathSciNet  MATH  Google Scholar 

  11. A. Alacaoglu and Y. Malitsky, “Stochastic variance reduction for variational inequality methods” (2021). arXiv preprint arXiv:2102.08352

  12. H. Robbins and S. Monro, “A stochastic approximation method,” Ann. Math. Stat. 22 (3), 400–407 (1951). https://doi.org/10.1214/aoms/1177729586

    Article  MathSciNet  MATH  Google Scholar 

  13. R. Johnson and T. Zhang, “Accelerating stochastic gradient descent using predictive variance reduction,” Adv. Neural Inf. Process. Syst. 26, 315–323 (2013).

    Google Scholar 

  14. D. Alistarh et al., “QSGD: Communication-efficient SGD via gradient quantization and encoding,” Advances in Neural Information Processing Systems 30 (NIPS 2017), pp. 1709–1720.

  15. F. Hanzely, K. Mishchenko, and P. Richtarik, “SEGA: Variance reduction via gradient sketching” (2018). arXiv preprint arXiv:1809.03054

  16. E. Gorbunov, F. Hanzely, and P. Richtarik, “A unified theory of SGD: Variance reduction, sampling, quantization and coordinate descent,” Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics, PMLR (2020), pp. 680–690.

  17. Y.-G. Hsieh et al., “On the convergence of single-call stochastic extra-gradient methods” (2019). ar-Xiv:1908.08465 [math.OC]

  18. K. Mishchenko et al., “Revisiting stochastic extragradient” Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics, PMLR (2020), pp. 4573–4582.

  19. P. Tseng, “A modified forward-backward splitting method for maximal monotone mappings,” SIAM J. Control Optim. 38 (2), 431–446 (2000). https://doi.org/10.1137/S0363012998338806

    Article  MathSciNet  MATH  Google Scholar 

  20. Yu. Nesterov, “Dual extrapolation and its applications to solving variational inequalities and related problems,” Math. Program. 109 (2), 319–344 (2007).

    Article  MathSciNet  MATH  Google Scholar 

  21. P. Balamurugan and F. Bach, “Stochastic variance reduction methods for saddle-point problems,” Proceedings of the 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, Ed. by D. Lee et al. (Curran Associates, 2016), pp. 1416–1424. https://proceedings.neurips.cc/paper/2016/file/1aa48fc4880bb0c9b8a3bf979d3b917e-Paper.pdf

  22. T. Chavdarova et al., “Reducing noise in GAN training with variance reduced extragradient” (2019). arXiv preprint arXiv:1904.08598

  23. A. Sidford and K. Tian, “Coordinate methods for accelerating l regression and faster approximate maximum flow,” 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS) (IEEE, 2018), pp. 922–933.

  24. Y. Carmon et al., “Coordinate methods for matrix games” (2020). arXiv preprint arXiv:2009.08447

  25. A. Sadiev et al., “Zeroth-order algorithms for smooth saddle-point problems” (2020). arXiv preprint ar-Xiv:2009.09908

  26. Y. Deng and M. Mahdavi, “Local stochastic gradient descent ascent: Convergence analysis and communication efficiency,” Proceedings of the 24th International Conference on Artificial Intelligence and Statistics PMLR (2021), pp. 1387–1395.

  27. A. Beznosikov, V. Samokhin, and A. Gasnikov, “Distributed saddle-point problems: Lower bounds, optimal algorithms and federated GANs” (2021). arXiv preprint arXiv:2010.13112

  28. S. U. Stich, “Unified optimal analysis of the (stochastic) gradient method” (2019). arXiv preprint a-rXiv:1907.04232

  29. S. J. Wright, “Coordinate descent algorithms,” Math. Program. 151 (1), 3–34 (2015).

    Article  MathSciNet  MATH  Google Scholar 

  30. Yu. Nesterov, “Efficiency of coordinate descent methods on huge-scale optimization problems,” SIAM J. O-ptim. 22 (2), 341–362 (2012).

    Article  MathSciNet  MATH  Google Scholar 

  31. A. Beznosikov et al., “On biased compression for distributed learning” (2020). arXiv preprint arXiv:2002.12410

  32. S. Barratt and R. Sharma, “A note on the inception score” (2018). arXiv preprint arXiv:1801.01973

  33. A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks” (2016). arXiv:1511.06434 [cs.LG]

  34. M. Mirza and S. Osindero, “Conditional generative adversarial nets” (2014). arXiv preprint arXiv:1411.1784

Download references

Funding

This work was supported by the Ministry of Science and Higher Education of the Russian Federation, state assignment no. 075-00337-20-03, project no. 0714-2020-0005.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to A. V. Gasnikov.

Ethics declarations

The authors declare that they have no conflicts of interest.

Additional information

Translated by I. Ruzanova

Appendices

APPENDIX

A. ANALYSIS OF VARIOUS METHODS

2.1 A.1. Extra Step Method

Let us start with the simplest case of (1)–(3), namely, the stochastic setting with uniformly bounded noise [10]:

$$F(z) = \mathbb{E}\left[ {F(z,\xi )} \right],\quad \mathbb{E}\left[ {{{{\left\| {F(z,\xi ) - F(z)} \right\|}}^{2}}} \right] \leqslant {{\sigma }^{2}},$$

where \(z\) and \(\xi \) are independent. The following method can be used in this case (Algorithm 1):

Algorithm 1. Extra step method

Parameters: Stepsize \(\gamma \), K.

Initialization: Choose \({{z}^{0}} \in \mathcal{Z}\).

for \(k = 0,1, \ldots ,K - 1\) do

Generate random \({{\xi }^{k}}\), \({{\xi }^{{k + 1/2}}}\),

\({{z}^{{k + 1/2}}} = {\text{pro}}{{{\text{x}}}_{{\gamma h}}}({{z}^{k}} - \gamma F({{z}^{k}},{{\xi }^{k}}))\),

\({{z}^{{k + 1}}} = {\text{pro}}{{{\text{x}}}_{{\gamma h}}}({{z}^{k}} - \gamma F({{z}^{{k + 1/2}}},{{\xi }^{{k + 1/2}}}))\).

end for

Note that, in this algorithm, \(\tau = 0\) and, hence, \({{w}^{k}} = {{z}^{k}}\) for all \(k\). Additionally, we set \({{\sigma }_{k}} = 0.\) The following lemma determines the constants from Assumption 2.

Lemma 2. Assume that \(F\) is bounded-Lipschitz with constants \(L\) and \(D\) (Assumption \(3\)). Then \({{g}^{k}}\) and \({{g}^{{k + 1}}}\) from Algorithm \(1\) satisfy Assumption \(2\) with constants \(A = 3{{L}^{2}}\), \({{D}_{1}} = 3{{D}^{2}} + 6{{\sigma }^{2}}\), and \({{D}_{3}} = {{\sigma }^{2}}\).

Proof. It is easy to see that \({{g}^{{k + 1/2}}}\) is unbiased. Next,

$$\begin{gathered} \mathbb{E}\left[ {{{{\left\| {{{g}^{{k + 1/2}}} - {{g}^{k}}} \right\|}}^{2}}} \right] = \mathbb{E}\left[ {{{{\left\| {F({{z}^{{k + 1/2}}},{{\xi }^{{k + 1/2}}}) - F({{z}^{k}},{{\xi }^{k}})} \right\|}}^{2}}} \right] \leqslant 3\mathbb{E}\left[ {{{{\left\| {F({{z}^{{k + 1/2}}}) - F({{z}^{k}})} \right\|}}^{2}}} \right] \\ + \;3\mathbb{E}\left[ {{{{\left\| {F({{z}^{{k + 1/2}}},{{\xi }^{{k + 1/2}}}) - F({{z}^{{k + 1/2}}})} \right\|}}^{2}}} \right] + 3\mathbb{E}\left[ {{{{\left\| {F({{z}^{k}},{{\xi }^{k}}) - F({{z}^{k}})} \right\|}}^{2}}} \right] \\ \, \leqslant 3{{L}^{2}}\mathbb{E}\left[ {{{{\left\| {{{z}^{{k + 1/2}}} - {{z}^{k}}} \right\|}}^{2}}} \right] + 3{{D}^{2}} + 6{{\sigma }^{2}}, \\ \end{gathered} $$

and, finally,

$$\mathbb{E}\left[ {{{{\left\| {{{g}^{{k + 1/2}}} - F({{z}^{{k + 1/2}}})} \right\|}}^{2}}} \right] = \mathbb{E}\left[ {{{{\left\| {F({{z}^{{k + 1/2}}},{{\xi }^{{k + 1/2}}}) - F({{z}^{{k + 1/2}}})} \right\|}}^{2}}} \right] \leqslant {{\sigma }^{2}}.$$

Corollary 1. Assume that \(F\) is bounded-Lipschitz with constants \(L\) and \(D\) . Then

\( \bullet \) in the strongly monotone case with \(\gamma \leqslant \min \left\{ {\frac{1}{{6L}};\frac{1}{{4\left( {{{\mu }_{F}} + {{\mu }_{h}}} \right)}}} \right\},\) the extra step method satisfies

$$\mathbb{E}\left[ {{{{\left\| {{{z}^{K}} - z\text{*}} \right\|}}^{2}}} \right] \leqslant {{\left( {1 - \gamma \frac{{{{\mu }_{F}} + {{\mu }_{h}}}}{{16}}} \right)}^{{K - 1}}}{{\left\| {{{z}^{0}} - z\text{*}} \right\|}^{2}} + \frac{{96\gamma ({{D}^{2}} + 2{{\sigma }^{2}})}}{{{{\mu }_{F}} + {{\mu }_{h}}}};$$

\( \bullet \) in the monotone case \(\gamma \leqslant 1{\text{/}}3L\), the extra step method satisfies

$$\mathbb{E}\left[ {{\text{Gap}}({{{\bar {z}}}^{K}})} \right] \leqslant \frac{{8{{{\max }}_{{u \in \mathcal{C}}}}\left[ {{{{\left\| {{{z}^{0}} - u} \right\|}}^{2}}} \right]}}{{\gamma K}} + \gamma (21{{D}^{2}} + 43{{\sigma }^{2}}).$$

With a proper choice of γ (see, e.g., [28]), the following convergence estimates can be obtained:

\( \bullet \) in the strongly monotone case,

$$\mathbb{E}\left[ {{{{\left\| {{{z}^{K}} - z\text{*}} \right\|}}^{2}}} \right] = \tilde {\mathcal{O}}\left( {\exp \left( { - \frac{{({{\mu }_{F}} + {{\mu }_{h}})(K - 1)}}{{96L}}} \right){{{\left\| {{{z}^{0}} - z\text{*}} \right\|}}^{2}}} \right. + \left. {\frac{{({{D}^{2}} + {{\sigma }^{2}})}}{{{{{({{\mu }_{F}} + {{\mu }_{h}})}}^{2}}(K - 1)}}} \right),$$

\( \bullet \) in the monotone case,

$$\mathbb{E}\left[ {{\text{Gap}}({{{\bar {z}}}^{K}})} \right] = \mathcal{O}\left( {\frac{{L{{{\max }}_{{u \in \mathcal{C}}}}\left[ {{{{\left\| {{{z}^{0}} - u} \right\|}}^{2}}} \right]}}{K} + \frac{{(D + \sigma ){{{\max }}_{{u \in \mathcal{C}}}}\left[ {\left\| {{{z}^{0}} - u} \right\|} \right]}}{{\sqrt K }}} \right).$$

Remark. This analysis covers the smooth case with \(D = 0\). To obtain estimates in the nonsmooth case with a bounded operator F, it suffices to use \(L = 0\) and set \(1{\text{/}}L = + \infty \).

With a proper choice of \(\gamma \) (see, e.g., \([28]\)) the convergence rates are estimated as follows:

\( \bullet \) in the strongly monotone case,

$$\mathbb{E}\left[ {{{{\left\| {{{z}^{K}} - z\text{*}} \right\|}}^{2}}} \right] = \tilde {\mathcal{O}}\left( {\exp \left( { - \frac{{({{\mu }_{F}} + {{\mu }_{h}})(K - 1)}}{{96L}}} \right){{{\left\| {{{z}^{0}} - z\text{*}} \right\|}}^{2}}} \right. + \left. {\frac{{({{D}^{2}} + {{\sigma }^{2}})}}{{{{{({{\mu }_{F}} + {{\mu }_{h}})}}^{2}}(K - 1)}}} \right),$$

\( \bullet \) in the monotone case,

$$\mathbb{E}\left[ {{\text{Gap}}({{{\bar {z}}}^{K}})} \right] = \mathcal{O}\left( {\frac{{L{{{\max }}_{{u \in \mathcal{C}}}}\left[ {{{{\left\| {{{z}^{0}} - u} \right\|}}^{2}}} \right]}}{K} + \frac{{(D + \sigma ){{{\max }}_{{u \in \mathcal{C}}}}\left[ {\left\| {{{z}^{0}} - u} \right\|} \right]}}{{\sqrt K }}} \right).$$

2.2 A.2. Past Extra Step Method

The setting is the same as in the preceding subsection, namely, (1)–(3). However, we work with a modification of the classical extra step method.

Algorithm 2. Past extra step method (Past-ES)

Parameters: Stepsize \(\gamma \), \(K\).

Initialization: Choose \({{z}^{0}} \in \mathcal{Z}\).

for \(k = 0,1, \ldots ,K - 1\) do

Generate random \({{\xi }_{{k + 1/2}}}\),

\({{z}^{{k + 1/2}}} = {\text{pro}}{{{\text{x}}}_{{\gamma h}}}({{z}^{k}} - \gamma F({{z}^{{k - 1/2}}},{{\xi }_{{k - 1/2}}}))\),

\({{z}^{{k + 1}}} = {\text{pro}}{{{\text{x}}}_{{\gamma h}}}({{z}^{k}} - \gamma F({{z}^{{k + 1/2}}},{{\xi }_{{k + 1/2}}}))\),

end for

Lemma 3. Assume that \(F\) is bounded-Lipschitz with constants \(L\) and \(M\) (Assumption 3). Then \({{g}^{k}}\) and \({{g}^{{k + 1}}}\) from Algorithm 2 satisfy Assumption 2 with constants \(\rho = 1{\text{/}}3,\) \(B = 3,\) \(C = 2{{L}^{2}},\) \({{D}_{1}} = 6{{\sigma }^{2}},\) \({{D}_{2}} = 4{{D}^{2}} + 12{{\sigma }^{2}},\) and \({{D}_{3}} = {{\sigma }^{2}}\).

Proof. Let \(\sigma _{k}^{2} = {\text{||}}F({{z}^{{k - 1/2}}}) - F({{z}^{{k + 1/2}}}){\text{|}}{{{\text{|}}}^{2}}\). Then

$$\begin{gathered} \mathbb{E}\left[ {\sigma _{k}^{2}} \right] \leqslant 2\mathbb{E}\left[ {{{{\left\| {F({{z}^{k}}) - F({{z}^{{k + 1/2}}})} \right\|}}^{2}}} \right] + 2\mathbb{E}\left[ {{{{\left\| {F({{z}^{{k - 1/2}}}) - F({{z}^{k}})} \right\|}}^{2}}} \right] \leqslant 2{{L}^{2}}\mathbb{E}\left[ {{{{\left\| {{{z}^{k}} - {{z}^{{k + 1/2}}}} \right\|}}^{2}}} \right] \\ \, + 2{{L}^{2}}\mathbb{E}\left[ {{{{\left\| {{{z}^{{k - 1/2}}} - {{z}^{k}}} \right\|}}^{2}}} \right] + 4{{D}^{2}} = 2{{L}^{2}}\mathbb{E}\left[ {{{{\left\| {{{z}^{k}} - {{z}^{{k + 1/2}}}} \right\|}}^{2}}} \right] + 4{{D}^{2}} \\ + \;2{{L}^{2}}\mathbb{E}\left[ {{{{\left\| {{{z}^{{k - 1}}} - \gamma F({{z}^{{k - 1/2}}},{{\xi }_{k}}) - {{z}^{{k - 1}}} + \gamma F({{z}^{{k - 3/2}}},{{\xi }_{{k - 1}}})} \right\|}}^{2}}} \right] = 2{{L}^{2}}\mathbb{E}\left[ {{{{\left\| {{{z}^{k}} - {{z}^{{k + 1/2}}}} \right\|}}^{2}}} \right] \\ \end{gathered} $$
$$\begin{gathered} + \;2{{L}^{2}}{{\gamma }^{2}}\mathbb{E}\left[ {{{{\left\| {F({{z}^{{k - 1/2}}},{{\xi }_{k}}) - F({{z}^{{k - 3/2}}},{{\xi }_{{k - 1}}})} \right\|}}^{2}}} \right] + 4{{D}^{2}} \leqslant 2{{L}^{2}}\mathbb{E}\left[ {{{{\left\| {{{z}^{k}} - {{z}^{{k + 1/2}}}} \right\|}}^{2}}} \right] \\ \, + 6{{L}^{2}}{{\gamma }^{2}}\mathbb{E}\left[ {{{{\left\| {F({{z}^{{k - 1/2}}}) - F({{z}^{{k - 3/2}}})} \right\|}}^{2}}} \right] + 4{{D}^{2}} + 12{{\sigma }^{2}} \\ \, \leqslant 2{{L}^{2}}\mathbb{E}\left[ {{{{\left\| {{{z}^{k}} - {{z}^{{k + 1/2}}}} \right\|}}^{2}}} \right] + \frac{2}{3}\mathbb{E}\left[ {\sigma _{{k - 1}}^{2}} \right] + 4{{D}^{2}} + 12{{\sigma }^{2}}, \\ \end{gathered} $$

if we assume that \(\gamma \leqslant 1{\text{/}}(3L)\). Therefore,

$$\mathbb{E}\left[ {{{{\left\| {{{g}^{{k + 1/2}}} - {{g}^{k}}} \right\|}}^{2}}} \right] = \mathbb{E}\left[ {{{{\left\| {F({{z}^{{k + 1/2}}},{{\xi }_{{k + 1}}}) - F({{z}^{{k - 1/2}}},{{\xi }_{k}})} \right\|}}^{2}}} \right] \leqslant 3\mathbb{E}\left[ {\sigma _{k}^{2}} \right] + 6{{\sigma }^{2}},$$

and, finally,

$$\mathbb{E}\left[ {{{{\left\| {{{g}^{{k + 1/2}}} - F({{z}^{{k + 1/2}}})} \right\|}}^{2}}} \right] = \mathbb{E}\left[ {{{{\left\| {F({{z}^{{k + 1/2}}},{{\xi }^{{k + 1/2}}}) - F({{z}^{{k + 1/2}}})} \right\|}}^{2}}} \right] \leqslant {{\sigma }^{2}}.$$

Corollary 2. Assume that \(F\) is bounded-Lipschitz with constants \(L\) and \(D.\) Then

\( \bullet \) in the strongly monotone case with \(T = 36\) and \(\gamma \leqslant \min \left\{ {\frac{1}{{12L\sqrt 2 }};\frac{1}{{4\left( {{{\mu }_{F}} + {{\mu }_{h}}} \right)}}} \right\}\), the Past-ES satisfies

$$\mathbb{E}\left[ {{{{\left\| {{{z}^{K}} - z\text{*}} \right\|}}^{2}}} \right] \leqslant {{\left( {1 - \gamma \frac{{{{\mu }_{F}} + {{\mu }_{h}}}}{{16}}} \right)}^{{K - 1}}}{{\left\| {{{z}^{0}} - z\text{*}} \right\|}^{2}} + \frac{{192\gamma (37{{\sigma }^{2}} + 12{{D}^{2}})}}{{{{\mu }_{F}} + {{\mu }_{h}}}};$$

\( \bullet \) in the monotone case with \(T = 18\) and \(\gamma \leqslant \frac{1}{{12L\sqrt 2 }}\), the Past-ES satisfies

$$\mathbb{E}\left[ {{\text{Gap}}({{{\bar {z}}}^{K}})} \right] \leqslant \frac{{8{{{\max }}_{{u \in \mathcal{C}}}}\left[ {{{{\left\| {{{z}^{0}} - u} \right\|}}^{2}}} \right] + 72{{\gamma }^{2}}\sigma _{0}^{2}}}{{\gamma K}} + \gamma (216{{D}^{2}} + 691{{\sigma }^{2}}).$$

For the next method, we consider the setting of a finite sum: (1)–(4).

2.3 A.3. Variance Reduced Extra Step Method

Algorithm 3. Variance reduced extra step method (VR-ES)

Parameters: Stepsize \(\gamma \), \(K\).

Initialization: Choose \({{z}^{0}} = {{w}^{0}} \in \mathcal{Z}\).

for \(k = 0,1, \ldots ,K - 1\) do

\({{\bar {z}}^{k}} = \tau {{z}^{k}} + (1 - \tau ){{w}^{k}}\)

Generate uniformly random \({{m}_{k}} \in 1, \ldots ,M,\)

\({{z}^{{k + 1/2}}} = {\text{pro}}{{{\text{x}}}_{{\gamma h}}}({{\bar {z}}^{k}} - \gamma F({{w}^{k}}))\),

\({{z}^{{k + 1}}} = {\text{pro}}{{{\text{x}}}_{{\gamma h}}}({{\bar {z}}^{k}} - \gamma ({{F}_{{{{m}_{k}}}}}({{z}^{{k + 1/2}}}) - {{F}_{{{{m}_{k}}}}}({{w}^{k}}) + F({{w}^{k}})))\),

\({{w}^{{k + 1}}} = \left\{ {\begin{array}{*{20}{l}} {{{z}^{{k + 1}}}\quad {\text{with probability}} 1 - \tau ,} \\ {{{w}^{k}}\quad {\text{ with probability}} \tau .} \end{array}} \right.\)

end for

We set \({{\sigma }_{k}} = 0\). The following lemma gives the values of the constants from Assumption 2:

Lemma 4. Assume that all \({{F}_{{{{m}_{k}}}}}\) and \(F\) itself are bounded-Lipschitz with constants \(L\) and \(D\) (Assumption 3). Then \({{g}^{k}}\) and \({{g}^{{k + 1}}}\) from Algorithm 3 satisfy Assumption 2 with constants \(A = {{L}^{2}}\), \({{D}_{1}} = {{D}^{2}}\), \(E = 4{{L}^{2}}\), and \({{D}_{3}} = 4{{D}^{2}}\).

Proof. It is easy to see that \({{g}^{{k + 1/2}}}\) is unbiased. Next,

$$\begin{gathered} \mathbb{E}\left[ {{{{\left\| {{{g}^{{k + 1/2}}} - {{g}^{k}}} \right\|}}^{2}}} \right] = \mathbb{E}\left[ {{{{\left\| {{{F}_{{{{m}_{k}}}}}({{z}^{{k + 1/2}}}) - {{F}_{{{{m}_{k}}}}}({{w}^{k}}) + F({{w}^{k}}) - F({{w}^{k}})} \right\|}}^{2}}} \right] \\ \, = \mathbb{E}\left[ {{{{\left\| {{{F}_{{{{m}_{k}}}}}({{z}^{{k + 1/2}}}) - {{F}_{{{{m}_{k}}}}}({{w}^{k}})} \right\|}}^{2}}} \right] \leqslant {{L}^{2}}\mathbb{E}\left[ {{{{\left\| {{{z}^{{k + 1/2}}} - {{w}^{k}}} \right\|}}^{2}}} \right] + {{D}^{2}} \\ \end{gathered} $$

and, finally,

$$\begin{gathered} \mathbb{E}\left[ {{{{\left\| {{{g}^{{k + 1/2}}} - F({{z}^{{k + 1/2}}})} \right\|}}^{2}}} \right] = \mathbb{E}\left[ {{{{\left\| {{{F}_{{{{m}_{k}}}}}({{z}^{{k + 1/2}}}) - {{F}_{{{{m}_{k}}}}}({{w}^{k}}) + F({{w}^{k}}) - F({{z}^{{k + 1/2}}})} \right\|}}^{2}}} \right] \\ \leqslant 2\mathbb{E}\left[ {{{{\left\| {{{F}_{{{{m}_{k}}}}}({{z}^{{k + 1/2}}}) - {{F}_{{{{m}_{k}}}}}({{w}^{k}})} \right\|}}^{2}}} \right] + 2\mathbb{E}\left[ {{{{\left\| {F({{z}^{{k + 1/2}}}) - F({{w}^{k}})} \right\|}}^{2}}} \right] \leqslant 4{{L}^{2}}\mathbb{E}\left[ {{{{\left\| {{{w}^{k}} - {{z}^{{k + 1/2}}}} \right\|}}^{2}}} \right] + 4{{D}^{2}}. \\ \end{gathered} $$

Corollary 3. Assume that all \({{F}_{{{{m}_{k}}}}}\) and \(F\) itself are bounded-Lipschitz with constants \(L\) and \(D\) . Then

\( \bullet \) in the strongly monotone case with \(\gamma \leqslant \min \left\{ {\frac{{\sqrt {1 - \tau } }}{{2\sqrt 2 L}};\frac{{1 - \tau }}{{4\left( {{{\mu }_{F}} + {{\mu }_{h}}} \right)}}} \right\}\), VR-ES satisfies

$$\mathbb{E}\left[ {\tau {{{\left\| {{{z}^{{k + 1}}} - z\text{*}} \right\|}}^{2}} + {{{\left\| {{{w}^{{k + 1}}} - z\text{*}} \right\|}}^{2}}} \right] \leqslant {{\left( {1 - \gamma \frac{{{{\mu }_{F}} + {{\mu }_{h}}}}{{16}}} \right)}^{{K - 1}}}\left( {\tau {{{\left\| {{{z}^{0}} - z\text{*}} \right\|}}^{2}} + {{{\left\| {{{w}^{0}} - z\text{*}} \right\|}}^{2}}} \right) + \frac{{32\gamma {{D}^{2}}}}{{{{\mu }_{F}} + {{\mu }_{h}}}};$$

\( \bullet \) in the monotone case with \(\gamma \leqslant \frac{{\sqrt {1 - \tau } }}{{2\sqrt 6 L}}\), VR-ES satisfies

$$\mathbb{E}\left[ {{\text{Gap}}({{{\bar {z}}}^{K}})} \right] \leqslant \frac{{8{{{\max }}_{{u \in \mathcal{C}}}}\left[ {{{{\left\| {{{z}^{0}} - u} \right\|}}^{2}}} \right]}}{{\gamma K}} + 11\gamma {{D}^{2}}.$$

At every iteration, we calculate only one out of \(M\) operators. However, when \({{w}^{k}}\) is updated, it is necessary to compute all \(M\) operators at a new point \({{w}^{k}}\). Based on this, an optimal value of \(\tau \) can be chosen as follows:

$$(1 - \tau )M \sim \tau \quad \Rightarrow \quad \tau = \frac{M}{{M + 1}}.$$

2.4 A.4. Coordinate Extra Step Method

Let us go back and once again consider the most common setting without finite sums: (1).

Algorithm 4. Coordinate extra step method (Coord-ES)

Parameters: Stepsize \(\gamma \), \(K\).

Initialization: Choose \({{z}^{0}} = {{w}^{0}} \in \mathcal{Z}\).

for \(k = 0,1, \ldots ,K - 1\) do

\({{\bar {z}}^{k}} = \tau {{z}^{k}} + (1 - \tau ){{w}^{k}}\)

Generate uniformly random \({{i}_{k}} \in 1, \ldots ,d\),

\({{z}^{{k + 1/2}}} = {\text{pro}}{{{\text{x}}}_{{\gamma h}}}({{\bar {z}}^{k}} - \gamma F({{w}^{k}}))\),

\({{z}^{{k + 1}}} = {\text{pro}}{{{\text{x}}}_{{\gamma h}}}({{\bar {z}}^{k}} - \gamma (d{{[F({{z}^{{k + 1/2}}})]}_{{{{i}_{k}}}}}{{e}_{{{{i}_{k}}}}} - d{{[F({{w}^{k}})]}_{{{{i}_{k}}}}}{{e}_{{{{i}_{k}}}}} + F({{w}^{k}})))\),

\({{w}^{{k + 1}}} = \left\{ {\begin{array}{*{20}{l}} {{{z}^{{k + 1}}}\quad {\text{with probability}} 1 - \tau ,} \\ {{{w}^{k}}\quad {\text{ with probability}} \tau .} \end{array}} \right.\)

end for

We set \({{\sigma }_{k}} = 0\). The following lemma gives the values of the constants from Assumption 2.

Lemma 5. Assume that \(F\) is bounded-Lipschitz with constants \(L\) and \(D\) (Assumption 3). Then \({{g}^{k}}\) and \({{g}^{{k + 1}}}\) from Algorithm 4 satisfy Assumption 2 with constants \(A = d{{L}^{2}}\), \({{D}_{1}} = d{{D}^{2}}\), \(E = 2(d + 1){{L}^{2}}\), and \({{D}_{3}} = 2(d + 1){{D}^{2}}\).

Proof. It is easy to see that \({{g}^{{k + 1/2}}}\) is unbiased. Next,

$$\begin{gathered} \mathbb{E}\left[ {{{{\left\| {{{g}^{{k + 1/2}}} - {{g}^{k}}} \right\|}}^{2}}} \right] = \mathbb{E}\left[ {{{{\left\| {d{{{[F({{z}^{{k + 1/2}}})]}}_{{{{i}_{k}}}}}{{e}_{{{{i}_{k}}}}} - d{{{[F({{w}^{k}})]}}_{{{{i}_{k}}}}}{{e}_{{{{i}_{k}}}}} + F({{w}^{k}}) - F({{w}^{k}})} \right\|}}^{2}}} \right] \\ = \mathbb{E}\left[ {{{{\left\| {d{{{[F({{z}^{{k + 1/2}}}) - F({{w}^{k}})]}}_{{{{i}_{k}}}}}{{e}_{{{{i}_{k}}}}}} \right\|}}^{2}}} \right] \leqslant d\mathbb{E}\left[ {{{{\left\| {F({{z}^{{k + 1/2}}}) - F({{w}^{k}})} \right\|}}^{2}}} \right] \leqslant d{{L}^{2}}\mathbb{E}\left[ {{{{\left\| {{{z}^{{k + 1/2}}} - {{w}^{k}}} \right\|}}^{2}}} \right] + d{{D}^{2}} \\ \end{gathered} $$

and, finally,

$$\begin{gathered} \mathbb{E}\left[ {{{{\left\| {{{g}^{{k + 1/2}}} - F({{z}^{{k + 1/2}}})} \right\|}}^{2}}} \right] = \mathbb{E}\left[ {{{{\left\| {d{{{[F({{z}^{{k + 1/2}}})]}}_{{{{i}_{k}}}}}{{e}_{{{{i}_{k}}}}} - d{{{[F({{w}^{k}})]}}_{{{{i}_{k}}}}}{{e}_{{{{i}_{k}}}}} + F({{w}^{k}}) - F({{z}^{{k + 1/2}}})} \right\|}}^{2}}} \right] \\ \leqslant 2\mathbb{E}\left[ {{{{\left\| {d{{{[F({{z}^{{k + 1/2}}}) - F({{w}^{k}})]}}_{{{{i}_{k}}}}}{{e}_{{{{i}_{k}}}}}} \right\|}}^{2}}} \right] + 2\mathbb{E}\left[ {{{{\left\| {F({{z}^{{k + 1/2}}}) - F({{w}^{k}})} \right\|}}^{2}}} \right] \\ \, \leqslant 2(d + 1){{L}^{2}}\mathbb{E}\left[ {{{{\left\| {{{w}^{k}} - {{z}^{{k + 1/2}}}} \right\|}}^{2}}} \right] + 2(d + 1){{D}^{2}}. \\ \end{gathered} $$

Corollary 4. Assume that \(F\) is bounded-Lipschitz with constants \(L\) and \(D\) . Then

\( \bullet \) in the strongly monotone case with \(\gamma \leqslant \min \left\{ {\frac{{\sqrt {1 - \tau } }}{{2\sqrt {2d} L}};\frac{{1 - \tau }}{{4\left( {{{\mu }_{F}} + {{\mu }_{h}}} \right)}}} \right\}\), Coord-ES satisfies

$$\mathbb{E}\left[ {\tau {{{\left\| {{{z}^{{k + 1}}} - z\text{*}} \right\|}}^{2}} + {{{\left\| {{{w}^{{k + 1}}} - z\text{*}} \right\|}}^{2}}} \right] \leqslant {{\left( {1 - \gamma \frac{{{{\mu }_{F}} + {{\mu }_{h}}}}{{16}}} \right)}^{{K - 1}}}\left( {\tau {{{\left\| {{{z}^{0}} - z\text{*}} \right\|}}^{2}} + {{{\left\| {{{w}^{0}} - z\text{*}} \right\|}}^{2}}} \right) + \frac{{32\gamma d{{D}^{2}}}}{{{{\mu }_{F}} + {{\mu }_{h}}}};$$

\( \bullet \) in the monotone case with \(\gamma \leqslant \frac{{\sqrt {1 - \tau } }}{{2L\sqrt {4d + 2} }}\), Coord-ES satisfies

$$\mathbb{E}\left[ {{\text{Gap}}({{{\bar {z}}}^{K}})} \right] \leqslant \frac{{8{{{\max }}_{{u \in \mathcal{C}}}}\left[ {{{{\left\| {{{z}^{0}} - u} \right\|}}^{2}}} \right]}}{{\gamma K}} + \gamma (9d + 2){{D}^{2}}.$$

The optimal value is \(\tau = d{\text{/}}(d + 1)\).

2.5 A.5. Quantized Extra Step Method

In this section, we discuss a method that works with quantization operators.

Algorithm 5. Quantized extra step method (Quant-ES)

Parameters: Stepsize \(\gamma \), \(K\).

Initialization: Choose \({{z}^{0}} = {{w}^{0}} \in \mathcal{Z}\).

for \(k = 0,1, \ldots ,K - 1\) do

\({{\bar {z}}^{k}} = \tau {{z}^{k}} + (1 - \tau ){{w}^{k}}\)

\({{z}^{{k + 1/2}}} = {\text{pro}}{{{\text{x}}}_{{\gamma h}}}({{\bar {z}}_{k}} - \gamma F({{w}^{k}}))\),

\({{z}^{{k + 1}}} = {\text{pro}}{{{\text{x}}}_{{\gamma h}}}({{\bar {z}}_{k}} - \gamma Q(F({{z}^{{k + 1/2}}}) - F({{w}^{k}})) + \gamma F({{w}^{k}}))\),

\({{w}^{{k + 1}}} = \left\{ {\begin{array}{*{20}{l}} {{{z}^{{k + 1}}}\quad {\text{with probability}} 1 - \tau ,} \\ {{{w}^{k}}\quad {\text{ with probability}} \tau .} \end{array}} \right.\)

end for

Definition 1. We say that \(Q(x)\) is a quantization of a vector \(x \in {{\mathbb{R}}^{d}}\) if

$$\mathbb{E}Q(x) = x,\quad \mathbb{E}{{\left\| {Q(x)} \right\|}^{2}} = \omega \left\| x \right\|$$

for some \(\omega \).

Lemma 6. Assume that \(F\) is bounded-Lipschitz with constants \(L\) and \(D\) (Assumption 3). Then \({{g}^{k}}\) and \({{g}^{{k + 1}}}\) from Algorithm 5 satisfy Assumption 2 with constants \(A = \omega {{L}^{2}}\), \({{D}_{1}} = \omega {{D}^{2}}\), \(E = 2(\omega + 1){{L}^{2}}\), and D3 = \(2(\omega + 1){{D}^{2}}\).

Proof. It is easy to see that \({{g}^{{k + 1/2}}}\) is unbiased. Next,

$$\begin{gathered} \mathbb{E}\left[ {{{{\left\| {{{g}^{{k + 1/2}}} - {{g}^{k}}} \right\|}}^{2}}} \right] = \mathbb{E}\left[ {{{{\left\| {Q(F({{z}^{{k + 1/2}}}) - F({{w}^{k}})) + F({{w}^{k}}) - F({{w}^{k}})} \right\|}}^{2}}} \right] \\ = \mathbb{E}\left[ {{{{\left\| {Q(F({{z}^{{k + 1/2}}}) - F({{w}^{k}}))} \right\|}}^{2}}} \right] \leqslant \omega \mathbb{E}\left[ {{{{\left\| {F({{z}^{{k + 1/2}}}) - F({{w}^{k}})} \right\|}}^{2}}} \right] \leqslant \omega {{L}^{2}}\mathbb{E}\left[ {{{{\left\| {{{z}^{{k + 1/2}}} - {{w}^{k}}} \right\|}}^{2}}} \right] + \omega {{D}^{2}} \\ \end{gathered} $$

and, finally,

$$\begin{gathered} \mathbb{E}\left[ {{{{\left\| {{{g}^{{k + 1/2}}} - F({{z}^{{k + 1/2}}})} \right\|}}^{2}}} \right] = \mathbb{E}\left[ {{{{\left\| {Q(F({{z}^{{k + 1/2}}}) - F({{w}^{k}})) + F({{w}^{k}}) - F({{z}^{{k + 1/2}}})} \right\|}}^{2}}} \right] \\ \leqslant 2\mathbb{E}\left[ {{{{\left\| {Q(F({{z}^{{k + 1/2}}}) - F({{w}^{k}}))} \right\|}}^{2}}} \right] + 2\mathbb{E}\left[ {{{{\left\| {F({{w}^{k}}) - F({{z}^{{k + 1/2}}})} \right\|}}^{2}}} \right] \\ \, \leqslant 2(\omega + 1){{L}^{2}}\mathbb{E}\left[ {{{{\left\| {{{w}^{k}} - {{z}^{{k + 1/2}}}} \right\|}}^{2}}} \right] + 2(\omega + 1){{D}^{2}}. \\ \end{gathered} $$

Corollary 5. Assume that \(F\) is bounded-Lipschitz with constants \(L\) and \(D\) . Then Quant-ES

\( \bullet \) in the strongly monotone case with \(\gamma \leqslant \min \left\{ {\frac{{\sqrt {1 - \tau } }}{{2L\sqrt {2\omega } }};\frac{{1 - \tau }}{{4\left( {{{\mu }_{F}} + {{\mu }_{h}}} \right)}}} \right\}\), Quant-ES satisfies

$$\mathbb{E}\left[ {\tau {{{\left\| {{{z}^{{k + 1}}} - z\text{*}} \right\|}}^{2}} + {{{\left\| {{{w}^{{k + 1}}} - z\text{*}} \right\|}}^{2}}} \right] \leqslant {{\left( {1 - \gamma \frac{{{{\mu }_{F}} + {{\mu }_{h}}}}{{16}}} \right)}^{{K - 1}}}\left( {\tau {{{\left\| {{{z}^{0}} - z\text{*}} \right\|}}^{2}} + {{{\left\| {{{w}^{0}} - z\text{*}} \right\|}}^{2}}} \right) + \frac{{32\gamma \omega {{D}^{2}}}}{{{{\mu }_{F}} + {{\mu }_{h}}}};$$

\( \bullet \) in the monotone case with \(\gamma \leqslant \frac{{\sqrt {1 - \tau } }}{{2L\sqrt {4\omega + 2} }}\), Quant-ES satisfies

$$\mathbb{E}\left[ {{\text{Gap}}({{{\bar {z}}}^{K}})} \right] \leqslant \frac{{8{{{\max }}_{{u \in \mathcal{C}}}}\left[ {{{{\left\| {{{z}^{0}} - u} \right\|}}^{2}}} \right]}}{{\gamma K}} + \gamma (9\omega + 2){{D}^{2}}.$$

Consider the case \(D = 0\). Quantization is needed to compress the information, with \(\omega \) regarded as a compression coefficient, i.e., \(\omega \) times more information is transferred when an uncompressed operator is computed. However, an uncompressed operator has to be computed at every \(1{\text{/}}(1 - \tau ){\text{th}}\) iteration (when \({{w}^{k}}\) is updated). Based on this, an optimal value of \(\tau \) can be chosen as follows:

$$(1 - \tau ) \sim \tau \frac{1}{\omega }\quad \Rightarrow \quad \tau = \frac{\omega }{{\omega + 1}}.$$

2.6 A.6. Quantized Variance Reduced Extra Step Method

In what follows, the ideas of quantization and variance reduction are combined for the case of problem (1)–(4) with an operator in the form of a finite sum.

Algorithm 6. Quantized variance reduced extra step method

Parameters: Stepsize \(\gamma \), \(K\).

Initialization: Choose \({{z}^{0}} = {{w}^{0}} \in \mathcal{Z}\).

for \(k = 0,1, \ldots ,K - 1\) do

Generate uniformly random \({{m}_{k}} \in 1, \ldots ,M\),

\({{\bar {z}}^{k}} = \tau {{z}^{k}} + (1 - \tau ){{w}^{k}}\),

\({{z}^{{k + 1/2}}} = {\text{pro}}{{{\text{x}}}_{{\gamma h}}}({{\bar {z}}_{k}} - \gamma F({{w}^{k}}))\),

\({{z}^{{k + 1}}} = {\text{pro}}{{{\text{x}}}_{{\gamma h}}}({{\bar {z}}_{k}} - \gamma Q({{F}_{{{{m}_{k}}}}}({{z}^{{k + 1/2}}}) - {{F}_{{{{m}_{k}}}}}({{w}^{k}})) + \gamma F({{w}^{k}}))\),

\({{w}^{{k + 1}}} = \left\{ {\begin{array}{*{20}{l}} {{{z}^{{k + 1}}}\quad {\text{with probability }} 1 - \tau ,} \\ {{{w}^{k}}\quad {\text{ with probability }} \tau .} \end{array}} \right.\)

end for

We set \({{\sigma }_{k}} = 0\). The following lemma gives the values of the constants from Assumption 2.

Lemma 7. Assume that all \({{F}_{{{{m}_{k}}}}}\) and \(F\) itself are bounded-Lipschitz with constants \(L\) and \(D\) (Assumption 3). Then \({{g}^{k}}\) and \({{g}^{{k + 1}}}\) from Algorithm 6 satisfy Assumption 2 with \(A = \omega {{L}^{2}}\), \({{D}_{1}} = \omega {{D}^{2}}\), \(E = 2(\omega + 1){{L}^{2}}\), and \({{D}_{3}} = 2(\omega + 1){{D}^{2}}\).

Proof. It is easy to see that \({{g}^{{k + 1/2}}}\) is unbiased. Next,

$$\begin{gathered} \mathbb{E}\left[ {{{{\left\| {{{g}^{{k + 1/2}}} - {{g}^{k}}} \right\|}}^{2}}} \right] = \mathbb{E}\left[ {{{{\left\| {Q({{F}_{{{{m}_{k}}}}}({{z}^{{k + 1/2}}}) - {{F}_{{{{m}_{k}}}}}({{w}^{k}})) + F({{w}^{k}}) - F({{w}^{k}})} \right\|}}^{2}}} \right] \\ = \mathbb{E}\left[ {{{{\left\| {Q({{F}_{{{{m}_{k}}}}}({{z}^{{k + 1/2}}}) - {{F}_{{{{m}_{k}}}}}({{w}^{k}}))} \right\|}}^{2}}} \right] \leqslant \omega \mathbb{E}\left[ {{{{\left\| {{{F}_{{{{m}_{k}}}}}({{z}^{{k + 1/2}}}) - {{F}_{{{{m}_{k}}}}}({{w}^{k}})} \right\|}}^{2}}} \right] \leqslant \omega {{L}^{2}}\mathbb{E}\left[ {{{{\left\| {{{w}^{k}} - {{z}^{{k + 1/2}}}} \right\|}}^{2}}} \right] + \omega {{D}^{2}} \\ \end{gathered} $$

and, finally,

$$\begin{gathered} \mathbb{E}\left[ {{{{\left\| {{{g}^{{k + 1/2}}} - F({{z}^{{k + 1/2}}})} \right\|}}^{2}}} \right] = \mathbb{E}\left[ {{{{\left\| {Q({{F}_{{{{m}_{k}}}}}({{z}^{{k + 1/2}}}) - {{F}_{{{{m}_{k}}}}}({{w}^{k}})) + F({{w}^{k}}) - F({{z}^{{k + 1/2}}})} \right\|}}^{2}}} \right] \\ \leqslant 2\mathbb{E}\left[ {{{{\left\| {Q({{F}_{{{{m}_{k}}}}}({{z}^{{k + 1/2}}}) - {{F}_{{{{m}_{k}}}}}({{w}^{k}}))} \right\|}}^{2}}} \right] + 2\mathbb{E}\left[ {{{{\left\| {F({{w}^{k}}) - F({{z}^{{k + 1/2}}})} \right\|}}^{2}}} \right] \\ \, \leqslant 2(\omega + 1){{L}^{2}}\mathbb{E}\left[ {{{{\left\| {{{w}^{k}} - {{z}^{{k + 1/2}}}} \right\|}}^{2}}} \right] + 2(\omega + 1){{D}^{2}}. \\ \end{gathered} $$

Corollary 6. Assume that all \({{F}_{{{{m}_{k}}}}}\) and \(F\) itself are bounded-Lipschitz with constants \(L\) and \(D\) . Then the quantized variance reduced extra step method

\( \bullet \) in the strongly monotone case with \(\gamma \leqslant \min \left\{ {\frac{{\sqrt {1 - \tau } }}{{2L\sqrt {2\omega } }};\frac{{1 - \tau }}{{4\left( {{{\mu }_{F}} + {{\mu }_{h}}} \right)}}} \right\}\), satisfies

$$\mathbb{E}\left[ {\tau {{{\left\| {{{z}^{{k + 1}}} - z\text{*}} \right\|}}^{2}} + {{{\left\| {{{w}^{{k + 1}}} - z\text{*}} \right\|}}^{2}}} \right] \leqslant {{\left( {1 - \gamma \frac{{{{\mu }_{F}} + {{\mu }_{h}}}}{{16}}} \right)}^{{K - 1}}}\left( {\tau {{{\left\| {{{z}^{0}} - z\text{*}} \right\|}}^{2}} + {{{\left\| {{{w}^{0}} - z\text{*}} \right\|}}^{2}}} \right) + \frac{{32\gamma \omega {{D}^{2}}}}{{{{\mu }_{F}} + {{\mu }_{h}}}},$$

\( \bullet \) in the monotone case with \(\gamma \leqslant \frac{{\sqrt {1 - \tau } }}{{2L\sqrt {4\omega + 2} }}\), satisfies

$$\mathbb{E}\left[ {{\text{Gap}}({{{\bar {z}}}^{K}})} \right] \leqslant \frac{{8{{{\max }}_{{u \in \mathcal{C}}}}\left[ {{{{\left\| {{{z}^{0}} - u} \right\|}}^{2}}} \right]}}{{\gamma K}} + \gamma (9\omega + 2){{D}^{2}}.$$

The optimal value \(\tau \) is the same as in the preceding subsection.

2.7 A.7. Importance Sampling Extra Step Method

Here, we consider a more general case than a finite sum. Specifically, each function has its own weight \({{p}_{m}}\). Namely, consider a discrete random variable \(\eta \) of the form

$$\mathbb{P}(\eta = m) = {{p}_{m}},\quad \sum\limits_{m = 1}^M \,{{p}_{m}} = 1.$$

At every iteration, we call \({{F}_{\eta }}\). The weights/probabilities \({{p}_{m}}\) can be given a priori or can be specified by the user: for example, it is reasonable to use \({{p}_{m}} = {{L}_{m}}{\text{/}}\left( {\sum\nolimits_m \,{{L}_{m}}} \right)\).

Algorithm 7. Importance sampling extra step method

Parameters: Stepsize \(\gamma \), \(K\).

Initialization: Choose \({{z}^{0}} = {{w}^{0}} \in \mathcal{Z}\).

for \(k = 0,1, \ldots ,K - 1\) do

Generate \({{\eta }_{k}}\),

\({{\bar {z}}^{k}} = \tau {{z}^{k}} + (1 - \tau ){{w}^{k}}\),

\({{z}^{{k + 1/2}}} = {\text{pro}}{{{\text{x}}}_{{\gamma h}}}({{\bar {z}}_{k}} - \gamma F({{w}^{k}}))\),

\({{z}^{{k + 1}}} = {\text{pro}}{{{\text{x}}}_{{\gamma h}}}({{\bar {z}}_{k}} - \gamma \frac{1}{{{{p}_{{{{\eta }_{k}}}}}}}({{F}_{{{{\eta }_{k}}}}}({{z}^{{k + 1/2}}}) - {{F}_{{{{\eta }_{k}}}}}({{w}^{k}})) + \gamma F({{w}^{k}}))\),

\({{w}^{{k + 1}}} = \left\{ {\begin{array}{*{20}{l}} {{{z}^{{k + 1}}}\quad {\text{with probability }} 1 - \tau ,} \\ {{{w}^{k}}\quad {\text{ with probability }} \tau .} \end{array}} \right.\)

end for

Lemma 8. Assume that all \({{F}_{m}}\) are bounded-Lipschitz with constants \({{L}_{m}}\) and \({{D}_{m}}\) (Assumption 3), and so is \(F\) with constants \(L\) and \(D\). Then \({{g}^{k}}\) and \({{g}^{{k + 1}}}\) from Algorithm 7 satisfy Assumption 2 with constants \(A = \sum\nolimits_{m = 1}^M {\frac{{L_{m}^{2}}}{{{{p}_{m}}}}} \), \({{D}_{1}} = \sum\nolimits_{m = 1}^M {\frac{{D_{m}^{2}}}{{{{p}_{m}}}}} \), \(E = 2\left( {\sum\nolimits_{m = 1}^M {\frac{{L_{m}^{2}}}{{{{p}_{m}}}}} + {{L}^{2}}} \right)\), and \({{D}_{3}} = 2\left( {\sum\nolimits_{m = 1}^M {\frac{{D_{m}^{2}}}{{{{p}_{m}}}}} + {{D}^{2}}} \right)\).

Proof. It is easy to see that \({{g}^{{k + 1/2}}}\) is unbiased. Next,

$$\begin{gathered} \mathbb{E}\left[ {{{{\left\| {{{g}^{{k + 1/2}}} - {{g}^{k}}} \right\|}}^{2}}} \right] = \mathbb{E}\left[ {{{{\left\| {\frac{1}{{{{p}_{{{{\eta }_{k}}}}}}}({{F}_{{{{\eta }_{k}}}}}({{z}^{{k + 1/2}}}) - {{F}_{{{{\eta }_{k}}}}}({{w}^{k}})) + F({{w}^{k}}) - F({{w}^{k}})} \right\|}}^{2}}} \right] \\ = \mathbb{E}\left[ {{{{\left\| {\frac{1}{{{{p}_{{{{\eta }_{k}}}}}}}({{F}_{{{{\eta }_{k}}}}}({{z}^{{k + 1/2}}}) - {{F}_{{{{\eta }_{k}}}}}({{w}^{k}}))} \right\|}}^{2}}} \right] = \mathbb{E}\sum\limits_{m = 1}^M \frac{1}{{{{p}_{m}}}}\left[ {{{{\left\| {{{F}_{m}}({{z}^{{k + 1/2}}}) - {{F}_{m}}({{w}^{k}})} \right\|}}^{2}}} \right] \\ \leqslant \sum\limits_{m = 1}^M \frac{{L_{m}^{2}}}{{{{p}_{m}}}}\mathbb{E}\left[ {{{{\left\| {{{z}^{{k + 1/2}}} - {{w}^{k}}} \right\|}}^{2}}} \right] + \sum\limits_{m = 1}^M \frac{{D_{m}^{2}}}{{{{p}_{m}}}} \\ \end{gathered} $$

and, finally,

$$\begin{gathered} \mathbb{E}\left[ {{{{\left\| {{{g}^{{k + 1/2}}} - F({{z}^{{k + 1/2}}})} \right\|}}^{2}}} \right] = \mathbb{E}\left[ {{{{\left\| {\frac{1}{{{{p}_{{{{\eta }_{k}}}}}}}({{F}_{{{{m}_{k}}}}}({{z}^{{k + 1/2}}}) - {{F}_{{{{m}_{k}}}}}({{w}^{k}})) + F({{w}^{k}}) - F({{z}^{{k + 1/2}}})} \right\|}}^{2}}} \right] \\ \leqslant 2\mathbb{E}\left[ {{{{\left\| {\frac{1}{{{{p}_{{{{\eta }_{k}}}}}}}({{F}_{{{{m}_{k}}}}}({{z}^{{k + 1/2}}}) - {{F}_{{{{m}_{k}}}}}({{w}^{k}}))} \right\|}}^{2}}} \right] + 2\mathbb{E}\left[ {{{{\left\| {F({{w}^{k}}) - F({{z}^{{k + 1/2}}})} \right\|}}^{2}}} \right] \\ \leqslant 2\left( {\sum\limits_{m = 1}^M \frac{{L_{m}^{2}}}{{{{p}_{m}}}} + {{L}^{2}}} \right)\mathbb{E}\left[ {{{{\left\| {{{w}^{k}} - {{z}^{{k + 1/2}}}} \right\|}}^{2}}} \right] + 2\left( {\sum\limits_{m = 1}^M \frac{{D_{m}^{2}}}{{{{p}_{m}}}} + {{D}^{2}}} \right). \\ \end{gathered} $$

Corollary 7. Assume that all \({{F}_{m}}\) are bounded-Lipschitz with constants \({{L}_{m}}\) and \({{D}_{m}}\) (Assumption 3) and so is \(F\) with constants \(L\) and \(D\). Then the importance sampling extra step method

\( \bullet \) in the strongly monotone case with \(\gamma \leqslant \min \left\{ {\frac{{\sqrt {1 - \tau } }}{{2\sqrt 2 \sum\nolimits_{m = 1}^M {\frac{{L_{m}^{2}}}{{{{p}_{m}}}}} }};\frac{{1 - \tau }}{{4\left( {{{\mu }_{F}} + {{\mu }_{h}}} \right)}}} \right\}\), satisfies

$$\begin{gathered} \mathbb{E}\left[ {\tau {{{\left\| {{{z}^{{k + 1}}} - z\text{*}} \right\|}}^{2}} + {{{\left\| {{{w}^{{k + 1}}} - z\text{*}} \right\|}}^{2}}} \right] \\ \leqslant {{\left( {1 - \gamma \frac{{{{\mu }_{F}} + {{\mu }_{h}}}}{{16}}} \right)}^{{K - 1}}}\left( {\tau {{{\left\| {{{z}^{0}} - z\text{*}} \right\|}}^{2}} + {{{\left\| {{{w}^{0}} - z\text{*}} \right\|}}^{2}}} \right) + \frac{{32\gamma }}{{{{\mu }_{F}} + {{\mu }_{h}}}}\sum\limits_{m = 1}^M \frac{{D_{m}^{2}}}{{{{p}_{m}}}}, \\ \end{gathered} $$

\( \bullet \) in the monotone case with \(\gamma \leqslant \frac{{\sqrt {1 - \tau } }}{{2\sqrt {2{{L}^{2}} + 4\sum\nolimits_{m = 1}^M {\frac{{L_{m}^{2}}}{{{{p}_{m}}}}} } }}\), satisfies

$$\mathbb{E}\left[ {{\text{Gap}}({{{\bar {z}}}^{K}})} \right] \leqslant \frac{{8{{{\max }}_{{u \in \mathcal{C}}}}\left[ {{{{\left\| {{{z}^{0}} - u} \right\|}}^{2}}} \right]}}{{\gamma K}} + \gamma \left( {9\sum\limits_{m = 1}^M \frac{{D_{m}^{2}}}{{{{p}_{m}}}} + 2{{D}^{2}}} \right).$$

2.8 A.8. Local Extra Step Method

This method is designed for the distributed problem (1)–(5). It alternates between local iterations and averaging with the server value. A local step is taken with probability \(\tau \), and communication occurs with probability \(1 - \tau \).

Here, \({{Z}_{{{\text{avg}}}}} = [{{\bar {z}}^{T}}, \ldots ,{{\bar {z}}^{T}}] \in {{\mathbb{R}}^{{Md}}}\) with \(\bar {z} = \frac{1}{M}\sum\nolimits_{m = 1}^M \,{{z}_{m}}\) (the same for \({{W}_{{{\text{avg}}}}}\)).

Algorithm 8. Randomized local extra step method

Parameters: Stepsize \(\gamma \), K; probability \(p\).

Initialization: Choose \({{z}^{0}} = {{w}^{0}} \in \mathcal{Z}\) and \(z_{m}^{0} = {{z}^{0}}\) for all \(m\).

for \(k = 0,1, \ldots \) do

\({{\bar {Z}}^{k}} = \tau {{Z}^{k}} + (1 - \tau ){{W}^{k}}\),

\({{Z}^{{k + 1/2}}} = {\text{pro}}{{{\text{x}}}_{{\gamma h}}}\left( {{{{\bar {Z}}}^{k}} - \gamma (\Phi ({{W}^{k}}) + \lambda ({{W}^{k}} - W_{{{\text{avg}}}}^{k})} \right)\),

\(G(Z) = \left\{ {\begin{array}{*{20}{l}} {\frac{1}{\tau }\Phi (Z)\quad {\text{ with probability }} \tau ,} \\ {\frac{1}{{1 - \tau }}\lambda (Z - {{Z}_{{{\text{avg}}}}})\quad {\text{with probability }} 1 - \tau ,} \end{array}} \right.\)

\({{Z}^{{k + 1}}} = {\text{pro}}{{{\text{x}}}_{{\gamma h}}}\left( {{{{\bar {Z}}}^{k}} - \gamma (G({{Z}^{{k + 1/2}}}) - G({{W}^{k}}) + \Phi ({{W}^{k}}) + \lambda ({{W}^{k}} - W_{{{\text{avg}}}}^{k}))} \right)\),

\({{W}^{{k + 1}}} = \left\{ {\begin{array}{*{20}{l}} {{{Z}^{{k + 1}}}\quad {\text{with probability }} 1 - \tau ,} \\ {{{W}^{k}}\quad {\text{ with probability }} \tau .} \end{array}} \right.\)

end for

In fact, this method was analyzed in the preceding subsection. Indeed, we have two operators, \(\Phi (Z)\) and \(\lambda (Z - {{Z}_{{{\text{avg}}}}})\), which are \(L\)- and \(\lambda \)-smooth, respectively.

If we use \(\tau = L{\text{/}}(L + \lambda )\), then the following result is true.

Corollary 8. The randomized local extra step method

\( \bullet \) in the strongly monotone case with \(\gamma \leqslant \min \left\{ {\frac{{\sqrt \lambda }}{{2\sqrt 2 {{{(L + \lambda )}}^{{3/2}}}}};\frac{{\sqrt \lambda }}{{4\left( {{{\mu }_{F}} + {{\mu }_{h}}} \right)\sqrt {L + \lambda } }}} \right\}\), satisfies

$$\begin{gathered} \mathbb{E}\left[ {\tau {{{\left\| {{{z}^{{k + 1}}} - z\text{*}} \right\|}}^{2}} + {{{\left\| {{{w}^{{k + 1}}} - z\text{*}} \right\|}}^{2}}} \right] \\ \leqslant {{\left( {1 - \gamma \frac{{{{\mu }_{F}} + {{\mu }_{h}}}}{{16}}} \right)}^{{K - 1}}}\left( {\tau {{{\left\| {{{z}^{0}} - z\text{*}} \right\|}}^{2}} + {{{\left\| {{{w}^{0}} - z\text{*}} \right\|}}^{2}}} \right) + \frac{{32\gamma }}{{{{\mu }_{F}} + {{\mu }_{h}}}}\sum\limits_{m = 1}^M \frac{{D_{m}^{2}}}{{{{p}_{m}}}}, \\ \end{gathered} $$

\( \bullet \) in the monotone case with \(\gamma \leqslant \frac{{\sqrt \lambda }}{{2\sqrt 6 {{{(L + \lambda )}}^{{3/2}}}}}\), satisfies

$$\mathbb{E}\left[ {{\text{Gap}}({{{\bar {z}}}^{K}})} \right] \leqslant \frac{{8{{{\max }}_{{u \in \mathcal{C}}}}\left[ {{{{\left\| {{{z}^{0}} - u} \right\|}}^{2}}} \right]}}{{\gamma K}} + \gamma \left( {9\sum\limits_{m = 1}^M \frac{{D_{m}^{2}}}{{{{p}_{m}}}} + 2{{D}^{2}}} \right).$$

Based on this, the number of local computations and the number of communications can be estimated as follows:

\( \bullet \) in the strongly monotone case,

$$\mathcal{O}\left( {\frac{{\sqrt {\lambda (\lambda + L)} }}{\mu }\log \frac{1}{\varepsilon }} \right)\;{\text{comm}}{\text{.}}\;{\kern 1pt} \mathcal{O}\left( {\frac{{\sqrt {L(\lambda + L)} }}{\mu }\log \frac{1}{\varepsilon }} \right)\;{\text{local}}\;{\text{comp}}{\text{.;}}$$

\( \bullet \) in the monotone case,

$$\mathcal{O}\left( {\frac{{\sqrt {\lambda (\lambda + L)} {{\Omega }^{2}}}}{\varepsilon }} \right)\;{\text{comm}}{\text{.}}\;\mathcal{O}\left( {\frac{{\sqrt {L(\lambda + L)} {{\Omega }^{2}}}}{\varepsilon }} \right)\;{\text{local}}\;{\text{comp}}{\text{.}}$$

Note that these estimates are fairly good for small values of \(\lambda \).

B. EXPERIMENTS

3.1 B.1. Generative Adversarial Networks

As an example of optimizing a minimax value functional by applying different optimization approaches, we consider the optimization of generative adversarial networks (GANs). GAN is a framework for estimating generative models by an adversarial (opposing) process in which two models are trained simultaneously: a generative model G that captures the data distribution and a discriminative model \(D\) that estimates the probability of a sample coming from the training set rather than \(G\). So \(D(G(z))\) is the (scalar) probability that the output of the generator \(G\) is a real image. According to [6], in this task, \(D\) tries to maximize the probability that it correctly classifies reals and fakes \((\log (D(x))\), and \(G\) tries to minimize the probability that \(D\) predicts its outputs as fakes \((\log (1 - D(G(z)))\). From the paper, the GAN loss function is

$$\mathop {\min }\limits_{{{\theta }_{G}}} \mathop {\max }\limits_{{{\theta }_{D}}} {{\mathbb{E}}_{{x \sim {{\mathbb{P}}_{{{\text{data}}}}}(x)}}}[\log (D(x))] + {{\mathbb{E}}_{{z \sim {{\mathbb{P}}_{z}}(z)}}}[\log (1 - D(z))].$$

The purpose of our experiments was not to create state-of-the-art results with a new architecture or approach of GAN formulation task, but rather to confirm the performance of SVRG and quantization methods for training GANs.

Data, model, and optimizers. In these experiments, we used the CIFAR dataset. Its training and validation sets contain \(50\,000\) and \(10\,000\) images, respectively, equally distributed over ten classes. For each optimization approach, Adam was used as an optimizer. The hyperparameters of each of the optimizers (generator and discriminator) were analyzed in the course of the experiments (see Table 1).

Table 1.   Set of analyzed hyperparameters

Based on the results of the convergence analysis (tested after 100 epochs of training) the following parameters were chosen: \({{\beta }_{1}} = 0.9,\) \({{\beta }_{2}} = 0.999,\) and the learning rate \( = 2 \times {{10}^{{ - 4}}}\) for both optimization methods. The batchsize for one gradient calculation was 64.

To check the quality of the output images, we used the Inception score metric (see [32]):

$${\mathbf{IS}}(G) = \exp ({{\mathbb{E}}_{{x \sim {{p}_{g}}}}}{{D}_{{KL}}}(p(y\,|\,{\mathbf{x}})\,|{\kern 1pt} {\kern 1pt} |\,p(y))),$$

where \(x \sim {{p}_{g}}\) means that the image x is generated from the distribution \({{p}_{g}}\) and \({{D}_{{KL}}}(p\,|{\kern 1pt} {\kern 1pt} |\,q)\) is the Kullback–Leibler divergence between two distributions \(p,\;q\).

This metric allows us to automatically assess the quality of images generated by the model. In the experiments, it was shown that this metric correlates well with the results of human evaluation on CIFAR10-generated images. This metric uses the Inception v3 neural network pretrained on the ImageNet dataset and collects statistics of the output data of the network applied to the generated images. In the experiments, we used the DCGan architecture [33] with Conditional approach [34]. A feature of this architecture is that the model can be trained to sample images from a particular distribution, samples from which will be similar to elements from the desired distribution of the training set.

The main purpose of these experiments was the possibility of nonretraining certain architectures for different optimization approaches. During training, the generator and discriminator had the same number of optimization steps.

Results. They are presented in Figs. 1 and 2. The results obtained in the experiments suggest that the approaches proposed in this paper outperform the original ones.

Fig. 1.
figure 1

Results of the model optimized with the Adam method in different ways: the approach using quantization/clipping of 70% of the entire model gradient, the approach using variance reduction, and the original approach (see [6]).

Fig. 2.
figure 2

These images were generated by the conditional DCGan architecture trained using different approaches: (a) initial optimization using Adam, (b) optimization with nullifying \(70\% \) of coordinates of the entire gradient, and (c) optimization with stochastic variance reduction of gradient descent.

\( \bullet \) The variance reduction method made it possible to optimize the functional more accurately than the original method throughout the optimization epoch. However, as the final iterations of each training epoch were approached, the full gradient differed rather widely from the direction of the gradient obtained at the last iterations, which introduced inaccuracy into the optimization and led to retraining of the generator.

\( \bullet \) The method with quantization/clipping (randomly nullifying 70% of the entire model gradient in both the generator and discriminator, which corresponds to the compression operator Rand70%) made it possible to prevent retraining of the generator and discriminator in the early stages, which in turn led to better results.

To sum up, all of the above-mentioned approaches produced better results than the original learning technique and can be used to optimize GANs, but quantization allows one to quickly obtain the best results while saving computational resources.

3.2 B.2. Policeman vs. Burglar

To compare the performance of some of the methods presented in this paper, we consider the Policeman vs. Burglar problem for a square city of 200 by 200 cells. There is a house and a police booth in each cell. The burglar chooses one house to rob, and the policeman chooses a booth in which he will be on duty. The task is to find optimal mixed strategies for the burglar and policeman treated as adversarial players in a game where the burglar’s goal is to rob some chosen house \(i\) with maximal wealth \({{w}_{i}}\) and the policeman’s goal is to choose an optimal booth and to catch the burglar to prevent the maximum expected loss caused by the latter. We assume that the probability of catching the burglar for house \(i\) and booth \(j\) is \(\exp \left( { - \theta d(i,j)} \right)\), where \(d\) is the distance function introduced below. This setting can be formulated as the bilinear saddle point problem

$$\mathop {\min }\limits_{x \in \Delta ({{n}^{2}})} \mathop {\max }\limits_{y \in \Delta ({{n}^{2}})} f(x,y): = \frac{1}{n}\sum\limits_{k = 1}^n \,{{y}^{ \top }}{{A}^{{(k)}}}x$$
(A.22)

for \(x\) and \(y\) being the vectors of probability of choosing some house and booth, respectively, and for the matrices

$$A_{{ij}}^{{(k)}} = w_{i}^{{(k)}}\left[ {1 - \exp \left( { - \theta d(i,j)} \right)} \right],$$

where the wealth \({{w}^{{(k)}}}\) and the distance function \(d\) are defined as follows (these expressions are easy to understand if \(i\) is regarded as a flattened coordinate on an \(n \times n\) playing field, \(i(x,y) = xn + y\), \(w\) is plotted as a pyramid centered at the center of this field, and \(d\) is the Euclidean distance on it):

$$\begin{gathered} {{w}_{i}} = 1 - \frac{2}{n}\min \left\{ {\left| {\left\lfloor {i{\text{/}}n} \right\rfloor - n{\text{/}}2} \right|,\left| {i\bmod n - n{\text{/}}2} \right|} \right\}, \\ w_{i}^{{(k)}} = {{w}_{i}}(1 + {{\xi }^{{(k)}}})\quad {\text{for}}\quad {{\xi }^{{(k)}}} \sim \mathcal{U}(0,\sigma ), \\ d(i,j) = \sqrt {{{{(\left\lfloor {i{\text{/}}n} \right\rfloor - \left\lfloor {j{\text{/}}n} \right\rfloor )}}^{2}} + {{{(i\bmod n - j\bmod n)}}^{2}}} . \\ \end{gathered} $$

In the experiments, the parameters were specified as \(\theta = 0.6,\) \(n = 25,\) and \(\sigma = 3\). In the Coord-ES, Quant-ES, and Past-ES methods, the value of \(\gamma \) was the same as in Extra Step, which is optimal for the last method. We compared the Past method in terms of the number of calls to the oracle \(F,\) the method with quantization in terms of the number of used bits, and the coordinate method in terms of the number of used coordinates (Fig. 3).

Fig. 3.
figure 3

Convergence of the extra step methods as applied to the Policeman vs. Burglar problem (A.22): (a) extra step and Past-ES, (b) extra step and Coord-ES, and (c) extra step and Quant-ES.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Beznosikov, A.N., Gasnikov, A.V., Zainullina, K.E. et al. A Unified Analysis of Variational Inequality Methods: Variance Reduction, Sampling, Quantization, and Coordinate Descent. Comput. Math. and Math. Phys. 63, 147–174 (2023). https://doi.org/10.1134/S0965542523020045

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1134/S0965542523020045

Keywords:

Navigation