Theoretical analysis of Adam using hyperparameters close to one without Lipschitz smoothness

Iiduka, Hideaki

doi:10.1007/s11075-023-01575-0

Theoretical analysis of Adam using hyperparameters close to one without Lipschitz smoothness

Original Paper
Published: 04 July 2023

Volume 95, pages 383–421, (2024)
Cite this article

Numerical Algorithms Aims and scope Submit manuscript

Hideaki Iiduka¹

134 Accesses
Explore all metrics

Abstract

Convergence and convergence rate analyses of adaptive methods, such as Adaptive Moment Estimation (Adam) and its variants, have been widely studied for nonconvex optimization. The analyses are based on assumptions that the expected or empirical average loss function is Lipschitz smooth (i.e., its gradient is Lipschitz continuous) and the learning rates depend on the Lipschitz constant of the Lipschitz continuous gradient. Meanwhile, numerical evaluations of Adam and its variants have clarified that using small constant learning rates without depending on the Lipschitz constant and hyperparameters ($\beta _1$ and $\beta _2$) close to one is advantageous for training deep neural networks. Since computing the Lipschitz constant is NP-hard, the Lipschitz smoothness condition would be unrealistic. This paper provides theoretical analyses of Adam without assuming the Lipschitz smoothness condition in order to bridge the gap between theory and practice. The main contribution is to show theoretical evidence that Adam using small learning rates and hyperparameters close to one performs well, whereas the previous theoretical results were all for hyperparameters close to zero. Our analysis also leads to the finding that Adam performs well with large batch sizes. Moreover, we show that Adam performs well when it uses diminishing learning rates and hyperparameters close to one.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Non-convergence and Limit Cycles in the Adam Optimizer

ABNGrad: adaptive step size gradient descent for optimizing neural networks

Article 16 February 2024

WSAGrad: a novel adaptive gradient based method

Article 26 October 2022

Availability of supporting data

Not Applicable

Notes

References

Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein GAN https://arxiv.org/pdf/1701.07875.pdf (2017)
Borwein, J.M., Lewis, A.S.: Convex Analysis and Nonlinear Optimization: Theory and Examples. Springer, New York (2000)
Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Review 60, 223–311 (2018)
Article MathSciNet Google Scholar
Chen, H., Zheng, L., AL Kontar, R., Raskutti, G.: Stochastic gradient descent in correlated settings: A study on Gaussian processes. In: Advances in Neural Information Processing Systems, vol. 33 (2020)
Chen, J., Zhou, D., Tang, Y., Yang, Z., Cao, Y., Gu, Q.: Closing the generalization gap of adaptive gradient methods in training deep neural network. In: Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, vol. 452, pp. 3267–3275 (2021)
Chen, X., Liu, S., Sun, R., Hong, M.: On the convergence of a class of Adam-type algorithms for non-convex optimization. In: Proceedings of The International Conference on Learning Representations (2019)
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12, 2121–2159 (2011)
MathSciNet Google Scholar
Fehrman, B., Gess, B., Jentzen, A.: Convergence rates for the stochastic gradient descent method for non-convex objective functions. Journal of Machine Learning Research 21, 1–48 (2020)
MathSciNet Google Scholar
Ghadimi, S., Lan, G.: Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization I: A generic algorithmic framework. SIAM Journal on Optimization 22, 1469–1492 (2012)
Ghadimi, S., Lan, G.: Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization II: Shrinking procedures and optimal algorithms. SIAM Journal on Optimization 23, 2061–2089 (2013)
Article MathSciNet Google Scholar
Gower, R.M., Sebbouh, O., Loizou, N.: SGD for structured nonconvex functions: Learning rates, minibatching and interpolation. In: Proceedings of the 24th International Conference on Artificial Intelligence and Statistics, vol. 130 (2021)
Horn, R.A., Johnson, C.R.: Matrix Analysis. Cambridge University Press, Cambridge (1985)
Iiduka, H.: Appropriate learning rates of adaptive learning rate optimization algorithms for training deep neural networks. IEEE Transactions on Cybernetics 52(12), 13250–13261 (2022)
Article Google Scholar
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of The International Conference on Learning Representations (2015)
Loizou, N., Vaswani, S., Laradji, I., Lacoste-Julien, S.: Stochastic polyak step-size for SGD: An adaptive learning rate for fast convergence. In: Proceedings of the 24th International Conference on Artificial Intelligence and Statistics, vol. 130 (2021)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2019)
Luo, L., Xiong, Y., Liu, Y., Sun, X.: Adaptive gradient methods with dynamic bound of learning rate. In: Proceedings of The International Conference on Learning Representations (2019)
Mendler-Dünner, C., Perdomo, J.C., Zrnic, T., Hardt, M.: Stochastic optimization for performative prediction. In: Advances in Neural Information Processing Systems, vol. 33 (2020)
Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization 19, 1574–1609 (2009)
Article MathSciNet Google Scholar
Nesterov, Y.: A method for unconstrained convex minimization problem with the rate of convergence ${O}(1/k^2)$. Doklady AN USSR 269, 543–547 (1983)
Google Scholar
Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics 4, 1–17 (1964)
Article Google Scholar
Reddi, S.J., Kale, S., Kumar, S.: On the convergence of Adam and beyond. In: Proceedings of The International Conference on Learning Representations (2018)
Robbins, H., Monro, H.: A stochastic approximation method. The Annals of Mathematical Statistics 22, 400–407 (1951)
Article MathSciNet Google Scholar
Scaman, K., Malherbe, C.: Robustness analysis of non-convex stochastic gradient descent using biased expectations. In: Advances in Neural Information Processing Systems, vol. 33 (2020)
Shallue, C.J., Lee, J., Antognini, J., Sohl-Dickstein, J., Frostig, R., Dahl, G.E.: Measuring the effects of data parallelism on neural network training. Journal of Machine Learning Research 20, 1–49 (2019)
MathSciNet Google Scholar
Smith, S.L., Kindermans, P.J., Le, Q.V.: Don’t decay the learning rate, increase the batch size. In: Proceedings of The International Conference on Learning Representations (2018)
Tieleman, T., Hinton, G.: RMSProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning 4, 26–31 (2012)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is All you Need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Virmaux, A., Scaman, K.: Lipschitz regularity of deep neural networks: analysis and efficient estimation. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: Neural image caption generation with visual attention. In: Proceedings of the 32nd International Conference on Machine Learning, vol. 37, pp. 2048–2057 (2015)
Zaheer, M., Reddi, S., Sachan, D., Kale, S., Kumar, S.: Adaptive methods for nonconvex optimization. In: S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, R. Garnett (eds.) Advances in Neural Information Processing Systems, vol. 31. Curran Associates, Inc. (2018)
Zhang, G., Li, L., Nado, Z., Martens, J., Sachdeva, S., Dahl, G.E., Shallue, C.J., Grosse, R.: Which algorithmic choices matter at which batch sizes? Insights from a noisy quadratic model. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Zhou, D., Chen, J., Cao, Y., Tang, Y., Yang, Z., Gu, Q.: On the convergence of adaptive gradient methods for nonconvex optimization. In: 12th Annual Workshop on Optimization for Machine Learning (2020)
Zhuang, J., Tang, T., Ding, Y., Tatikonda, S., Dvornek, N., Papademetris, X., Duncan, J.S.: AdaBelief optimizer: Adapting stepsizes by the belief in observed gradients. In: Advances in Neural Information Processing Systems, vol. 33 (2020)
Zinkevich, M.: Online convex programming and generalized infinitesimal gradient ascent. In: Proceedings of the 20th International Conference on Machine Learning, pp. 928–936 (2003)
Zinkevich, M., Weimer, M., Li, L., Smola, A.: Parallelized stochastic gradient descent. In: Advances in Neural Information Processing Systems, vol. 23 (2010)
Zou, F., Shen, L., Jie, Z., Zhang Weizhong, W.L.: A sufficient condition for convergences of Adam and RMSProp. In: Computer Vision and Pattern Recognition Conference, pp. 11127–11135 (2019)

Download references

Acknowledgements

I am sincerely grateful to Editor Claude Brezinski, Associate Editor, and the anonymous reviewers for helping me improve the original manuscript. I also thank Naoki Sato for his input on the numerical examples

Funding

This work was supported by JSPS KAKENHI Grant Number 21K11773

Author information

Authors and Affiliations

Department of Computer Science, Meiji University, 1-1-1 Higashimita, Tama-ku, Kawasaki-shi, Kanagawa, Japan
Hideaki Iiduka

Authors

Hideaki Iiduka
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

H.I. wrote the manuscript

Corresponding author

Correspondence to Hideaki Iiduka.

Ethics declarations

Ethical Approval

Not Applicable

Conflict of interest

Not Applicable

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported by the Japan Society for the Promotion of Science (JSPS) KAKENHI, Grant Number 21K11773.

Appendix: A

Unless stated otherwise, all relationships between random variables are supported to hold almost surely.

1.1 A.1 Lemmas

Lemma 1

Suppose that (S1), (S2)(13), and (S3) hold. Then, Adam defined by Algorithm 1 satisfies the following: for all $k\in \mathbb {N}$ and all $\varvec{\theta } \in \mathbb {R}^d$,

$$\begin{aligned} \mathbb {E}\left[ \left\| \varvec{\theta }_{k+1} - \varvec{\theta } \right\| _{\textsf{H}_k}^2 \right]&= \mathbb {E}\left[ \left\| \varvec{\theta }_{k} - \varvec{\theta } \right\| _{\textsf{H}_k}^2 \right] + \alpha _k^2 \mathbb {E}\left[ \left\| \varvec{\textsf{d}}_k \right\| _{\textsf{H}_k}^2 \right] \\&\quad + 2 \alpha _k \left\{ \frac{\beta _{1k}}{\tilde{\beta }_{1k}} \mathbb {E}\left[ (\varvec{\theta } \!-\! \varvec{\theta }_k)^\top \varvec{m}_{k-1} \right] \!+\!\frac{\hat{\beta }_{1k}}{\tilde{\beta }_{1k}} \mathbb {E}\left[ (\varvec{\theta } \!-\! \varvec{\theta }_k)^\top \nabla f (\varvec{\theta }_k) \right] \right\} , \end{aligned}$$

where $\varvec{\textsf{d}}_k:= - \textsf{H}_k^{-1} \hat{\varvec{m}}_k$, $\hat{\beta }_{1k}:= 1 - \beta _{1k}$, and $\tilde{\beta }_{1k}:= 1 - \beta _{1k}^{k+1}$.

Proof

Let $\varvec{\theta } \in \mathbb {R}^d$ and $k\in \mathbb {N}$. The definition of $\varvec{\theta }_{k+1}:= \varvec{\theta }_{k} + \alpha _k \varvec{\textsf{d}}_k$ implies that

$$\begin{aligned} \Vert \varvec{\theta }_{k+1} - \varvec{\theta } \Vert _{\textsf{H}_k}^2 = \Vert \varvec{\theta }_{k} - \varvec{\theta } \Vert _{\textsf{H}_k}^2 + 2 \alpha _k \langle \varvec{\theta }_{k} - \varvec{\theta }, \varvec{\textsf{d}}_k \rangle _{\textsf{H}_k} + \alpha _k^2 \Vert \varvec{\textsf{d}}_k\Vert _{\textsf{H}_k}^2. \end{aligned}$$

Moreover, the definitions of $\varvec{\textsf{d}}_k$, $\varvec{m}_k$, and $\hat{\varvec{m}}_k$ ensure that

$$\begin{aligned} \left\langle \varvec{\theta }_k - \varvec{\theta }, \varvec{\textsf{d}}_k \right\rangle _{\textsf{H}_k}&= \left\langle \varvec{\theta }_k - \varvec{\theta }, \textsf{H}_k \varvec{\textsf{d}}_k \right\rangle = \left\langle \varvec{\theta } - \varvec{\theta }_k, \hat{\varvec{m}}_k \right\rangle = \frac{1}{\tilde{\beta }_{1k}} (\varvec{\theta } - \varvec{\theta }_k)^\top {\varvec{m}}_k\\&= \frac{\beta _{1k}}{\tilde{\beta }_{1k}} (\varvec{\theta } - \varvec{\theta }_k)^\top \varvec{m}_{k-1} + \frac{\hat{\beta }_{1k}}{\tilde{\beta }_{1k}} (\varvec{\theta } - \varvec{\theta }_k)^\top \nabla f_{B_k}(\varvec{\theta }_k). \end{aligned}$$

Hence,

$$\begin{aligned} \begin{aligned} \left\| \varvec{\theta }_{k+1} - \varvec{\theta } \right\| _{\textsf{H}_k}^2&= \left\| \varvec{\theta }_k -\varvec{\theta } \right\| _{\textsf{H}_k}^2 + \alpha _k^2 \left\| \varvec{\textsf{d}}_k \right\| _{\textsf{H}_k}^2\\&\quad + 2 \alpha _k \left\{ \frac{\beta _{1k}}{\tilde{\beta }_{1k}} (\varvec{\theta } - \varvec{\theta }_k)^\top \varvec{m}_{k-1} + \frac{\hat{\beta }_{1k}}{\tilde{\beta }_{1k}} (\varvec{\theta } - \varvec{\theta }_k)^\top \nabla f_{B_k} (\varvec{\theta }_k) \right\} . \end{aligned} \end{aligned}$$

(24)

Conditions (13) and (S3) guarantee that

$$\begin{aligned} \mathbb {E}\left[ \mathbb {E} \left[ (\varvec{\theta } \!-\! \varvec{\theta }_k)^\top \nabla f_{B_k} (\varvec{\theta }_k) \Big | \varvec{\theta }_k \right] \right] \!=\! \mathbb {E} \left[ (\varvec{\theta } \!-\! \varvec{\theta }_k)^\top \mathbb {E} \left[ \nabla f_{B_k} (\varvec{\theta }_k) \Big | \varvec{\theta }_k \right] \right] = \mathbb {E} \left[ (\varvec{\theta } \!-\! \varvec{\theta }_k)^\top \nabla f (\varvec{\theta }_k) \right] . \end{aligned}$$

Therefore, the lemma follows by taking the expectation on both sides of (24). This completes the proof. $\square $

Remark 1

Let us consider (16); that is,

$$\begin{aligned} \varvec{\theta }_{k+1} = P_{C,\textsf{H}_k} (\varvec{\theta }_{k} + \alpha _k \varvec{\textsf{d}}_k). \end{aligned}$$

Let $k\in \mathbb {N}$ and $\varvec{\theta } \in C$ (i.e., $\varvec{\theta } = P_{C,\textsf{H}_k}(\varvec{\theta })$). The nonexpansivity condition of $P_{C,\textsf{H}_k}$ ensures that

$$\begin{aligned} \Vert \varvec{\theta }_{k+1} - \varvec{\theta }\Vert _{\textsf{H}_k} = \Vert P_{C,\textsf{H}_k} (\varvec{\theta }_{k} + \alpha _k \varvec{\textsf{d}}_k)- P_{C,\textsf{H}_k}(\varvec{\theta })\Vert _{\textsf{H}_k} \le \Vert (\varvec{\theta }_{k} + \alpha _k \varvec{\textsf{d}}_k)- \varvec{\theta }\Vert _{\textsf{H}_k}. \end{aligned}$$

Hence, we have that

$$\begin{aligned} \Vert \varvec{\theta }_{k+1} - \varvec{\theta } \Vert _{\textsf{H}_k}^2 \le \Vert \varvec{\theta }_{k} - \varvec{\theta } \Vert _{\textsf{H}_k}^2 + 2 \alpha _k \langle \varvec{\theta }_{k} - \varvec{\theta }, \varvec{\textsf{d}}_k \rangle _{\textsf{H}_k} + \alpha _k^2 \Vert \varvec{\textsf{d}}_k\Vert _{\textsf{H}_k}^2. \end{aligned}$$

Accordingly, a discussion similar to the one showing Lemma 1 ensures that, for all $\varvec{\theta } \in C$ and all $k\in \mathbb {N}$,

$$\begin{aligned} \begin{aligned} \mathbb {E}\left[ \left\| \varvec{\theta }_{k+1} - \varvec{\theta } \right\| _{\textsf{H}_k}^2 \right]&\le \mathbb {E}\left[ \left\| \varvec{\theta }_{k} - \varvec{\theta } \right\| _{\textsf{H}_k}^2 \right] + \alpha _k^2 \mathbb {E}\left[ \left\| \varvec{\textsf{d}}_k \right\| _{\textsf{H}_k}^2 \right] \\&\quad \!+\! 2 \alpha _k \left\{ \frac{\beta _{1k}}{\tilde{\beta }_{1k}} \mathbb {E}\left[ (\varvec{\theta } \!-\! \varvec{\theta }_k)^\top \varvec{m}_{k-1} \right] \!+\!\frac{\hat{\beta }_{1k}}{\tilde{\beta }_{1k}} \mathbb {E}\left[ (\varvec{\theta } \!-\! \varvec{\theta }_k)^\top \nabla f (\varvec{\theta }_k) \right] \right\} . \end{aligned} \end{aligned}$$

(25)

We may assume without loss of generality that the assertion in Lemma 1 holds for all $\varvec{\theta }\in C$ and all $k\in \mathbb {N}$, since the theorems in this paper evaluate the upper bounds of (4), (5), (6), and (7). A discussion similar to the one showing the theorems in this paper (see the following lemmas and the proof of theorems) leads to versions of the theorems for all $\varvec{\theta }$ belonging to C; that is, the sequence $(\varvec{\theta }_k)_{k\in \mathbb {N}}$ generated by (16) satisfies the assertions in the theorems for all $\varvec{\theta } \in C$.

Lemma 2

Adam defined by Algorithm 1 satisfies that, under (S2)(13), (14), and (A1), for all $k\in \mathbb {N}$,

$$\begin{aligned} \mathbb {E}\left[ \left\| \varvec{m}_k \right\| ^2 \right] \le \frac{\sigma ^2}{b} + G^2, \quad \mathbb {E}\left[ \left\| \varvec{\textsf{d}}_k \right\| _{\textsf{H}_k}^2 \right] \le \frac{\sqrt{\tilde{\beta }_{2k}}}{\tilde{\beta }_{1k}^2 \sqrt{{\textit{v}}_*}} \left( \frac{\sigma ^2}{b} + G^2 \right) , \end{aligned}$$

where ${v}_*:= \inf \{ \min _{i\in [d]} {\textit{v}}_{k,i} :k\in \mathbb {N}\}$, $\tilde{\beta }_{1k}:= 1 - \beta _{1k}^{k+1}$, and $\tilde{\beta }_{2k}:= 1 - \beta _{2k}^{k+1}$.

Proof

Assumption (S2)(13) implies that

$$\begin{aligned} \begin{aligned} \mathbb {E} \left[ \left\| \nabla f_{B_k} (\varvec{\theta }_{k}) \right\| ^2 \Big | \varvec{\theta }_k \right]&= \mathbb {E} \left[ \left\| \nabla f_{B_k} (\varvec{\theta }_{k}) - \nabla f (\varvec{\theta }_{k}) + \nabla f (\varvec{\theta }_{k}) \right\| ^2 \Big | \varvec{\theta }_k \right] \\&= \mathbb {E} \left[ \left\| \nabla f_{B_k} (\varvec{\theta }_{k}) - \nabla f (\varvec{\theta }_{k}) \right\| ^2 \Big | \varvec{\theta }_k \right] + \mathbb {E} \left[ \left\| \nabla f (\varvec{\theta }_{k}) \right\| ^2 \Big | \varvec{\theta }_k \right] \\&\quad + 2 \mathbb {E} \left[ (\nabla f_{B_k} (\varvec{\theta }_{k}) - \nabla f (\varvec{\theta }_{k}))^\top \nabla f (\varvec{\theta }_{k}) \Big | \varvec{\theta }_k \right] \\&= \mathbb {E} \left[ \left\| \nabla f_{B_k} (\varvec{\theta }_{k}) - \nabla f (\varvec{\theta }_{k}) \right\| ^2 \Big | \varvec{\theta }_k \right] + \Vert \nabla f (\varvec{\theta }_{k}) \Vert ^2, \end{aligned} \end{aligned}$$

(26)

which, together with (S2)(14) and (A1), implies that

$$\begin{aligned} \mathbb {E} \left[ \left\| \nabla f_{B_k} (\varvec{\theta }_{k}) \right\| ^2 \right] \le \frac{\sigma ^2}{b} + G^2. \end{aligned}$$

(27)

The convexity of $\Vert \cdot \Vert ^2$, together with the definition of $\varvec{m}_k$ and (27), guarantees that, for all $k\in \mathbb {N}$,

$$\begin{aligned} \mathbb {E}\left[ \left\| \varvec{m}_k \right\| ^2 \right]&\le \beta _{1k} \mathbb {E}\left[ \left\| \varvec{m}_{k-1} \right\| ^2 \right] + \hat{\beta }_{1k} \mathbb {E}\left[ \left\| \nabla f_{B_k} (\varvec{\theta }_k) \right\| ^2 \right] \\&\le \beta _{1k} \mathbb {E} \left[ \left\| \varvec{m}_{k-1} \right\| ^2 \right] + \hat{\beta }_{1k} \left( \frac{\sigma ^2}{b} + G^2 \right) . \end{aligned}$$

Induction thus ensures that, for all $k\in \mathbb {N}$,

$$\begin{aligned} \mathbb {E} \left[ \left\| \varvec{m}_k \right\| ^2 \right] \le \max \left\{ \Vert \varvec{m}_{-1}\Vert ^2, \frac{\sigma ^2}{b} + G^2 \right\} = \frac{\sigma ^2}{b} + G^2, \end{aligned}$$

(28)

where $\varvec{m}_{-1} = \varvec{0}$. For $k\in \mathbb {N}$, $\textsf{H}_k \in \mathbb {S}_{++}^d$ guarantees the existence of a unique matrix $\overline{\textsf{H}}_k \in \mathbb {S}_{++}^d$ such that $\textsf{H}_k = \overline{\textsf{H}}_k^2$ [12, Theorem 7.2.6]. We have that, for all $\varvec{x}\in \mathbb {R}^d$, $\Vert \varvec{x}\Vert _{\textsf{H}_k}^2 = \Vert \overline{\textsf{H}}_k \varvec{x} \Vert ^2$. Accordingly, the definitions of $\varvec{\textsf{d}}_k$ and $\hat{\varvec{m}}_k$ imply that, for all $k\in \mathbb {N}$,

$$\begin{aligned} \mathbb {E} \left[ \left\| \varvec{\textsf{d}}_k \right\| _{\textsf{H}_k}^2 \right] = \mathbb {E} \left[ \left\| \overline{\textsf{H}}_k^{-1} \textsf{H}_k\varvec{\textsf{d}}_k \right\| ^2 \right] \le \frac{1}{\tilde{\beta }_{1k}^2} \mathbb {E} \left[ \left\| \overline{\textsf{H}}_k^{-1} \right\| ^2 \Vert \varvec{m}_k \Vert ^2 \right] , \end{aligned}$$

where

$$\begin{aligned} \left\| \overline{\textsf{H}}_k^{-1} \right\| = \left\| \textsf{diag}\left( \hat{\textit{v}}_{k,i}^{-\frac{1}{4}} \right) \right\| = \max _{i\in [d]} \hat{\textit{v}}_{k,i}^{-\frac{1}{4}} = \max _{i\in [d]} \left( \frac{\textit{v}_{k,i}}{\tilde{\beta }_{2k}} \right) ^{-\frac{1}{4}} =: \left( \frac{\textit{v}_{k,i^*}}{\tilde{\beta }_{2k}} \right) ^{-\frac{1}{4}}. \end{aligned}$$

Moreover, the definition of

$$\begin{aligned} {\textit{v}}_* := \inf \left\{ {\textit{v}}_{k,i^*} :k\in \mathbb {N} \right\} \end{aligned}$$

and (28) imply that, for all $k\in \mathbb {N}$,

$$\begin{aligned} \mathbb {E} \left[ \left\| \varvec{\textsf{d}}_k \right\| _{\textsf{H}_k}^2 \right] \le \frac{\tilde{\beta }_{2k}^{\frac{1}{2}}}{\tilde{\beta }_{1k}^2 {\textit{v}}_*^{\frac{1}{2}}} \left( \frac{\sigma ^2}{b} + G^2 \right) , \end{aligned}$$

completing the proof. $\square $

Lemma 3

Suppose that (S1)–(S3) and (A1)–(A2) hold. Then, Adam defined by Algorithm 1 satisfies the following: for all $k\in \mathbb {N}$ and all $\varvec{\theta } \in \mathbb {R}^d$,

$$\begin{aligned} \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta })^\top \varvec{m}_{k-1} \right] \le \frac{D(\varvec{\theta }) M^{\frac{1}{4}}}{{{\textit{v}}_*^{\frac{1}{4}}} \beta _{1k}} \sqrt{ \frac{\sigma ^2}{b} + G^2} + \frac{\alpha _k \sqrt{\tilde{\beta }_{2k}}}{2 \sqrt{\textit{v}_*} \beta _{1k} \tilde{\beta }_{1k}} \left( \frac{\sigma ^2}{b} + G^2 \right) + D(\varvec{\theta }) G \frac{\hat{\beta }_{1k}}{\beta _{1k}}, \end{aligned}$$

where $\nabla f_{B_k}(\varvec{\theta }_k) \odot \nabla f_{B_k}(\varvec{\theta }_k):= (g_{k,i}^2) \in \mathbb {R}_{+}^d$, $M:= \sup \{\max _{i\in [d]} g_{k,i}^2 :k\in \mathbb {N}\} < + \infty $, $\hat{\beta }_{1k}:= 1 - \beta _{1k}$, $\tilde{\beta }_{1k}:= 1 - \beta _{1k}^{k+1}$, $\tilde{\beta }_{2k}:= 1 - \beta _{2k}^{k+1}$, ${\textit{v}}_*$ is defined as in Lemma 2, and $D(\varvec{\theta })$ and G are defined as in Assumptions (A1) and (A2).

Proof

Let $\varvec{\theta } \in \mathbb {R}^d$. Lemma 1 guarantees that for all $k\in \mathbb {N}$,

$$\begin{aligned} \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta })^\top \varvec{m}_{k-1} \right]= & {} \underbrace{\frac{\tilde{\beta }_{1k}}{2 \alpha _k \beta _{1k}} \left\{ \mathbb {E}\left[ \left\| \varvec{\theta }_{k} - \varvec{\theta } \right\| _{\textsf{H}_k}^2 \right] - \mathbb {E}\left[ \left\| \varvec{\theta }_{k+1} - \varvec{\theta } \right\| _{\textsf{H}_k}^2 \right] \right\} }_{a_k}\nonumber \\{} & {} \quad + \underbrace{\frac{\alpha _k \tilde{\beta }_{1k}}{2 \beta _{1k}} \mathbb {E}\left[ \left\| \varvec{\textsf{d}}_k \right\| _{\textsf{H}_k}^2 \right] }_{b_k} \nonumber \\{} & {} \quad + \underbrace{\frac{\hat{\beta }_{1k}}{\beta _{1k}} \mathbb {E}\left[ (\varvec{\theta } - \varvec{\theta }_k)^\top \nabla f (\varvec{\theta }_k) \right] }_{c_k}. \end{aligned}$$

(29)

The triangle inequality and the definition of $\varvec{\theta }_{k+1}:= \varvec{\theta }_{k} + \alpha _k \varvec{\textsf{d}}_k$ ensure that

$$\begin{aligned} \begin{aligned} a_k&= \frac{\tilde{\beta }_{1k}}{2 \alpha _k \beta _{1k}} \mathbb {E}\left[ \left( \left\| \varvec{\theta }_{k} - \varvec{\theta } \right\| _{\textsf{H}_k} + \left\| \varvec{\theta }_{k+1} - \varvec{\theta } \right\| _{\textsf{H}_k} \right) \left( \left\| \varvec{\theta }_{k} - \varvec{\theta } \right\| _{\textsf{H}_k} - \left\| \varvec{\theta }_{k+1} - \varvec{\theta } \right\| _{\textsf{H}_k} \right) \right] \\&\le \frac{\tilde{\beta }_{1k}}{2 \alpha _k \beta _{1k}} \mathbb {E}\left[ \left( \left\| \varvec{\theta }_{k} - \varvec{\theta } \right\| _{\textsf{H}_k} + \left\| \varvec{\theta }_{k+1} - \varvec{\theta } \right\| _{\textsf{H}_k} \right) \left\| \varvec{\theta }_{k} - \varvec{\theta }_{k+1} \right\| _{\textsf{H}_k} \right] \\&= \frac{\tilde{\beta }_{1k}}{2\beta _{1k}} \mathbb {E}\left[ \left( \left\| \varvec{\theta }_{k} - \varvec{\theta } \right\| _{\textsf{H}_k} + \left\| \varvec{\theta }_{k+1} - \varvec{\theta } \right\| _{\textsf{H}_k} \right) \left\| \varvec{\textsf{d}}_k \right\| _{\textsf{H}_k} \right] . \end{aligned} \end{aligned}$$

(30)

Let $\nabla f_{B_k}(\varvec{\theta }_k) \odot \nabla f_{B_k}(\varvec{\theta }_k):= (g_{k,i}^2) \in \mathbb {R}_{+}^d$. Assumption (A1) ensures that there exists $M \in \mathbb {R}$ such that, for all $k\in \mathbb {N}$, $\max _{i\in [d]} g_{k,i}^2 \le M$. The definition of ${\varvec{ v}}_k$ guarantees that, for all $i\in [d]$ and all $k\in \mathbb {N}$,

$$\begin{aligned} \textit{v}_{k,i} = \beta _{2k} \textit{v}_{k-1,i} + \hat{\beta }_{2k} g_{k,i}^2. \end{aligned}$$

Induction thus ensures that, for all $i\in [d]$ and all $k\in \mathbb {N}$,

$$\begin{aligned} \textit{v}_{k,i} \le \max \{ \textit{v}_{0,i}, M \} = M, \end{aligned}$$

where $\varvec{ v}_0 = (\textit{v}_{0,i}) = \varvec{0}$. From the definition of $\hat{\varvec{ v}}_k$, we have that, for all $i\in [d]$ and all $k\in \mathbb {N}$,

$$\begin{aligned} \hat{\textit{v}}_{k,i} = \frac{\textit{v}_{k,i}}{\tilde{\beta }_{2k}} \le \frac{M}{\tilde{\beta }_{2k}}, \end{aligned}$$

(31)

which implies that

$$\begin{aligned} \left\| \overline{\textsf{H}}_k \right\| = \left\| \textsf{diag}\left( \hat{\textit{v}}_{k,i}^{\frac{1}{4}} \right) \right\| = {\max _{i\in [d]} \hat{\textit{v}}_{k,i}^{\frac{1}{4}}} \le \left( \frac{M}{\tilde{\beta }_{2k}} \right) ^{\frac{1}{4}}. \end{aligned}$$

Hence, (A2) implies that, for all $k\in \mathbb {N}$,

$$\begin{aligned}&\left\| \varvec{\theta }_{k} - \varvec{\theta } \right\| _{\textsf{H}_k} = \left\| \overline{\textsf{H}}_k (\varvec{\theta }_{k} - \varvec{\theta }) \right\| \le \left\| \overline{\textsf{H}}_k \right\| \left\| \varvec{\theta }_{k} - \varvec{\theta }\right\| \le D(\varvec{\theta }) \left( \frac{M}{\tilde{\beta }_{2k}} \right) ^{\frac{1}{4}},\\&\left\| \varvec{\theta }_{k+1} - \varvec{\theta } \right\| _{\textsf{H}_k} = \left\| \overline{\textsf{H}}_k (\varvec{\theta }_{k+1} - \varvec{\theta }) \right\| \le \left\| \overline{\textsf{H}}_k \right\| \left\| \varvec{\theta }_{k+1} - \varvec{\theta }\right\| \le D(\varvec{\theta }) \left( \frac{M}{\tilde{\beta }_{2k}} \right) ^{\frac{1}{4}}. \end{aligned}$$

Lemma 2, Jensen’s inequality, and (30) ensure that, for all $k\in \mathbb {N}$,

$$\begin{aligned} \begin{aligned} a_k&\le \frac{\tilde{\beta }_{1k}}{2\beta _{1k}} 2 D(\varvec{\theta }) \left( \frac{M}{\tilde{\beta }_{2k}} \right) ^{\frac{1}{4}} \mathbb {E}\left[ \left\| \varvec{\textsf{d}}_k \right\| _{\textsf{H}_k} \right] \le \frac{\tilde{\beta }_{1k}}{\beta _{1k}} D(\varvec{\theta }) \frac{M^{\frac{1}{4}}}{\tilde{\beta }_{2k}^{\frac{1}{4}}} \frac{\tilde{\beta }_{2k}^{\frac{1}{4}}}{\tilde{\beta }_{1k} \textit{v}_*^{\frac{1}{4}}} \sqrt{ \frac{\sigma ^2}{b} + G^2} \\&= \frac{D(\varvec{\theta }) M^{\frac{1}{4}}}{{\textit{v}_*^{\frac{1}{4}}}\beta _{1k}} \sqrt{ \frac{\sigma ^2}{b} + G^2}. \end{aligned} \end{aligned}$$

(32)

Lemma 2 guarantees that, for all $k\in \mathbb {N}$,

$$\begin{aligned} b_k \!=\! \frac{\alpha _k \tilde{\beta }_{1k}}{2 \beta _{1k}} \mathbb {E}\left[ \left\| \varvec{\textsf{d}}_k \right\| _{\textsf{H}_k}^2 \right] \!\le \! \frac{\alpha _k \tilde{\beta }_{1k}}{2 \beta _{1k}} \frac{\sqrt{\tilde{\beta }_{2k}}}{\tilde{\beta }_{1k}^2 \sqrt{\textit{v}_*}} \left( \frac{\sigma ^2}{b} \!+\! G^2 \right) \!=\! \frac{\alpha _k \sqrt{\tilde{\beta }_{2k}}}{2 \sqrt{\textit{v}_*} \beta _{1k} \tilde{\beta }_{1k}} \left( \frac{\sigma ^2}{b} \!+\! G^2 \right) . \end{aligned}$$

(33)

The Cauchy–Schwarz inequality and Assumption (A2) imply that, for all $k\in \mathbb {N}$,

$$\begin{aligned} c_k = \frac{\hat{\beta }_{1k}}{\beta _{1k}} \mathbb {E}\left[ (\varvec{\theta } - \varvec{\theta }_k)^\top \nabla f (\varvec{\theta }_k) \right] \le D(\varvec{\theta }) G \frac{\hat{\beta }_{1k}}{\beta _{1k}}. \end{aligned}$$

(34)

Therefore, (29), (32), (33), and (34) ensure that, for all $k\in \mathbb {N}$,

$$\begin{aligned} \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta })^\top \varvec{m}_{k-1} \right] \le \frac{D(\varvec{\theta }) M^{\frac{1}{4}}}{{\textit{v}_*^{\frac{1}{4}}}\beta _{1k}} \sqrt{ \frac{\sigma ^2}{b} + G^2} + \frac{\alpha _k \sqrt{\tilde{\beta }_{2k}}}{2 \sqrt{\textit{v}_*} \beta _{1k} \tilde{\beta }_{1k}} \left( \frac{\sigma ^2}{b} + G^2 \right) + D(\varvec{\theta }) G \frac{\hat{\beta }_{1k}}{\beta _{1k}}, \end{aligned}$$

which completes the proof. $\square $

Lemma 4

Suppose that (S1)–(S3) and (A1)–(A2) hold. Then, Adam defined by Algorithm 1 satisfies the following: for all $k\in \mathbb {N}$ and all $\varvec{\theta } \in \mathbb {R}^d$,

$$\begin{aligned} \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta })^\top \varvec{m}_{k} \right]&\le \frac{D(\varvec{\theta }) M^{\frac{1}{4}}}{{\textit{v}_*^{\frac{1}{4}}} \beta _{1k}} \sqrt{ \frac{\sigma ^2}{b} + G^2} + \frac{\alpha _k \sqrt{\tilde{\beta }_{2k}}}{2 \sqrt{\textit{v}_*} \beta _{1k} \tilde{\beta }_{1k}} \left( \frac{\sigma ^2}{b} + G^2 \right) + D(\varvec{\theta }) G \frac{\hat{\beta }_{1k}}{\beta _{1k}}\\&\quad + \hat{\beta }_{1k} D(\varvec{\theta }) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) , \end{aligned}$$

where the parameters are defined as in Lemma 3.

Proof

Let $\varvec{\theta } \in \mathbb {R}^d$ and $k\in \mathbb {N}$. The definition of $\varvec{m}_k$ implies that

$$\begin{aligned} (\varvec{\theta }_k - \varvec{\theta })^\top \varvec{m}_{k}&= (\varvec{\theta }_k - \varvec{\theta })^\top \varvec{m}_{k-1} + (\varvec{\theta }_k - \varvec{\theta })^\top (\varvec{m}_{k} - \varvec{m}_{k-1})\\&= (\varvec{\theta }_k - \varvec{\theta })^\top \varvec{m}_{k-1} + \hat{\beta }_{1k} (\varvec{\theta }_k - \varvec{\theta })^\top (\nabla f_{B_k}(\varvec{\theta }_k) - \varvec{m}_{k-1}), \end{aligned}$$

which, together with the Cauchy–Schwarz inequality, the triangle inequality, and Assumptions (A1) and (A2), implies that

$$\begin{aligned} (\varvec{\theta }_k \!-\! \varvec{\theta })^\top \varvec{m}_{k}&\!\le \! (\varvec{\theta }_k \!-\! \varvec{\theta })^\top \varvec{m}_{k-1} \!+\! \hat{\beta }_{1k} D(\varvec{\theta }) \Vert \nabla f_{B_k}(\varvec{\theta }_k) \!-\! \varvec{m}_{k-1}\Vert \\&\!\le \! (\varvec{\theta }_k \!-\! \varvec{\theta })^\top \varvec{m}_{k-1} \!+\! \hat{\beta }_{1k} D(\varvec{\theta }) (B \!+\! \Vert \varvec{m}_{k-1}\Vert ). \end{aligned}$$

Lemma 2 and Jensen’s inequality guarantee that

$$\begin{aligned} \mathbb {E} \left[ (\varvec{\theta }_k - \varvec{\theta })^\top \varvec{m}_{k} \right] \le \mathbb {E} \left[ (\varvec{\theta }_k - \varvec{\theta })^\top \varvec{m}_{k-1} \right] + \hat{\beta }_{1k} D(\varvec{\theta }) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) . \end{aligned}$$

(35)

Hence, Lemma 3 implies that

$$\begin{aligned} \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta })^\top \varvec{m}_{k} \right]&\le \frac{D(\varvec{\theta }) M^{\frac{1}{4}}}{{\textit{v}_*^{\frac{1}{4}}} \beta _{1k}} \sqrt{ \frac{\sigma ^2}{b} + G^2} + \frac{\alpha _k \sqrt{\tilde{\beta }_{2k}}}{2 \sqrt{\textit{v}_*} \beta _{1k} \tilde{\beta }_{1k}} \left( \frac{\sigma ^2}{b} + G^2 \right) + D(\varvec{\theta }) G \frac{\hat{\beta }_{1k}}{\beta _{1k}}\\&\quad + \hat{\beta }_{1k} D(\varvec{\theta }) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) , \end{aligned}$$

which completes the proof. $\square $

Lemma 5

Suppose that (S1)–(S3) and (A1)–(A2) hold. Then, Adam defined by Algorithm 1 satisfies the following: for all $k\in \mathbb {N}$ and all $\varvec{\theta } \in \mathbb {R}^d$,

$$\begin{aligned} \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta })^\top \nabla f(\varvec{\theta }_k) \right]&\le \frac{D(\varvec{\theta }) M^{\frac{1}{4}}}{{\textit{v}_*^{\frac{1}{4}}} \beta _{1k}} \sqrt{ \frac{\sigma ^2}{b} + G^2} + \frac{\alpha _k \sqrt{\tilde{\beta }_{2k}}}{2 \sqrt{\textit{v}_*} \beta _{1k} \tilde{\beta }_{1k}} \left( \frac{\sigma ^2}{b} + G^2 \right) + D(\varvec{\theta }) G \frac{\hat{\beta }_{1k}}{\beta _{1k}}\\&\quad + D(\varvec{\theta })\left( \frac{1}{\beta _{1k}} + 2 \hat{\beta }_{1k} \right) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) , \end{aligned}$$

where the parameters are defined as in Lemma 3.

Proof

Let $\varvec{\theta } \in \mathbb {R}^d$ and $k\in \mathbb {N}$. The definition of $\varvec{m}_k$ ensures that

$$\begin{aligned}&(\varvec{\theta }_k - \varvec{\theta })^\top \nabla f_{B_k}(\varvec{\theta }_k)\\&= (\varvec{\theta }_k - \varvec{\theta })^\top \varvec{m}_k + (\varvec{\theta }_k - \varvec{\theta })^\top (\nabla f_{B_k}(\varvec{\theta }_k) - \varvec{m}_{k-1}) + (\varvec{\theta }_k - \varvec{\theta })^\top (\varvec{m}_{k-1} - \varvec{m}_{k})\\&= (\varvec{\theta }_k - \varvec{\theta })^\top \varvec{m}_k + \frac{1}{\beta _{1k}}(\varvec{\theta }_k - \varvec{\theta })^\top (\nabla f_{B_k}(\varvec{\theta }_k) - \varvec{m}_{k}) + \hat{\beta }_{1k} (\varvec{\theta }_k - \varvec{\theta })^\top (\varvec{m}_{k-1} - \nabla f_{B_k}(\varvec{\theta }_k)), \end{aligned}$$

which, together with the Cauchy–Schwarz inequality, the triangle inequality, and Assumptions (A1) and (A2), implies that

$$\begin{aligned} (\varvec{\theta }_k - \varvec{\theta })^\top \nabla f_{B_k}(\varvec{\theta }_k) \le (\varvec{\theta }_k - \varvec{\theta })^\top \varvec{m}_k + \frac{1}{\beta _{1k}} D(\varvec{\theta }) (B + \Vert \varvec{m}_{k}\Vert ) + \hat{\beta }_{1k} D(\varvec{\theta }) (B + \Vert \varvec{m}_{k-1}\Vert ). \end{aligned}$$

Lemma 2 and Jensen’s inequality guarantee that

$$\begin{aligned} \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta })^\top \nabla f(\varvec{\theta }_k) \right] \le \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta })^\top \varvec{m}_k \right] + \left( \frac{1}{\beta _{1k}} + \hat{\beta }_{1k} \right) D(\varvec{\theta }) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) , \end{aligned}$$

(36)

which, together with Lemma 4, implies that

$$\begin{aligned} \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta })^\top \nabla f(\varvec{\theta }_k) \right]&\le \frac{D(\varvec{\theta }) M^{\frac{1}{4}}}{{\textit{v}_*^{\frac{1}{4}}} \beta _{1k}} \sqrt{ \frac{\sigma ^2}{b} + G^2} + \frac{\alpha _k \sqrt{\tilde{\beta }_{2k}}}{2 \sqrt{\textit{v}_*} \beta _{1k} \tilde{\beta }_{1k}} \left( \frac{\sigma ^2}{b} + G^2 \right) + D(\varvec{\theta }) G \frac{\hat{\beta }_{1k}}{\beta _{1k}}\\&\quad + D(\varvec{\theta })\left( \frac{1}{\beta _{1k}} + 2 \hat{\beta }_{1k} \right) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) , \end{aligned}$$

which completes the proof. $\square $

Lemma 6

Suppose that (S1)–(S3) and (A1)–(A2) hold, $\beta _{1k}:= \beta _1 \in (0,1)$, and $(\alpha _k)_{k\in \mathbb {N}}$ is monotone decreasing. Then, Adam defined by Algorithm 1 with (9) satisfies the following: for all $K\ge 1$ and all $\varvec{\theta } \in \mathbb {R}^d$,

$$\begin{aligned} \frac{1}{K}\sum _{k=1}^K \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta })^\top \varvec{m}_{k-1} \right] \le \frac{d \tilde{D}(\varvec{\theta }) \sqrt{M} \tilde{\beta }_{1K}}{2 \beta _1 \alpha _K \sqrt{\tilde{\beta }_{2K}}K} + \frac{(\sigma ^2 b^{-1} + G^2)}{2 \sqrt{\textit{v}_*} \beta _{1} \hat{\beta }_1 K} \sum _{k=1}^K \alpha _k \sqrt{\tilde{\beta }_{2k}} + D(\varvec{\theta }) G \frac{\hat{\beta }_{1}}{\beta _{1}}, \end{aligned}$$

where the parameters are defined as in Lemma 3 and $\tilde{D} (\varvec{\theta }):= \sup \{ \max _{i\in [d]} (\theta _{k,i} - \theta _i)^2 :k \in \mathbb {N} \} < + \infty $.

Proof

Let $\varvec{\theta } \in \mathbb {R}^d$ and

$$\begin{aligned} \gamma _k := \frac{\tilde{\beta }_{1k}}{2 \beta _{1} \alpha _k} \end{aligned}$$

for all $k\in \mathbb {N}$. Since $(\alpha _k)_{k\in \mathbb {N}}$ is monotone decreasing and $\tilde{\beta }_{1k} = 1 - \beta _1^{k+1} \le 1 - \beta _1^{k+2} = \tilde{\beta }_{1,k+1}$, $(\gamma _k)_{k\in \mathbb {N}}$ is monotone increasing. From the definition of $a_k$ in (29), we have that, for all $K \ge 1$,

$$\begin{aligned} \begin{aligned} \sum _{k = 1}^K a_k&= \gamma _1 \mathbb {E}\left[ \left\| \varvec{\theta }_{1} - \varvec{\theta } \right\| _{\textsf{H}_{1}}^2\right] + \underbrace{ \sum _{k=2}^K \left\{ \gamma _k \mathbb {E}\left[ \left\| \varvec{\theta }_{k} - \varvec{\theta } \right\| _{\textsf{H}_{k}}^2\right] - \gamma _{k-1} \mathbb {E}\left[ \left\| \varvec{\theta }_{k} - \varvec{\theta } \right\| _{\textsf{H}_{k-1}}^2\right] \right\} }_{{\Gamma }_K}\\&\quad - \gamma _{K} \mathbb {E} \left[ \left\| \varvec{\theta }_{K+1} - \varvec{\theta } \right\| _{\textsf{H}_{K}}^2 \right] . \end{aligned} \end{aligned}$$

(37)

Since $\overline{\textsf{H}}_k \in \mathbb {S}_{++}^d$ exists such that $\textsf{H}_k = \overline{\textsf{H}}_k^2$, we have $\Vert \varvec{x}\Vert _{\textsf{H}_k}^2 = \Vert \overline{\textsf{H}}_k \varvec{x} \Vert ^2$ for all $\varvec{x}\in \mathbb {R}^d$. Accordingly, we have

$$\begin{aligned} {\Gamma }_K = \mathbb {E} \left[ \sum _{k=2}^K \left\{ \gamma _{k} \left\| \overline{\textsf{H}}_{k} (\varvec{\theta }_{k} - \varvec{\theta }) \right\| ^2 - \gamma _{k-1} \left\| \overline{\textsf{H}}_{k-1} (\varvec{\theta }_{k} - \varvec{\theta }) \right\| ^2 \right\} \right] . \end{aligned}$$

From $\overline{\textsf{H}}_{k} = \textsf{diag}(\hat{\textit{v}}_{k,i}^{1/4})$, we have that, for all $\varvec{x} = (x_i)_{i=1}^d \in \mathbb {R}^d$, $\Vert \overline{\textsf{H}}_{k} \varvec{x} \Vert ^2 = \sum _{i=1}^d \sqrt{\hat{\textit{v}}_{k,i}} x_i^2$. Hence, for all $K\ge 2$,

$$\begin{aligned} {\Gamma }_K = \mathbb {E} \left[ \sum _{k=2}^K \sum _{i=1}^d \left( \gamma _{k} \sqrt{\hat{\textit{v}}_{k,i}} - \gamma _{k-1} \sqrt{\hat{\textit{v}}_{k-1,i}} \right) (\theta _{k,i} - \theta _i)^2 \right] . \end{aligned}$$

(38)

Condition (9) and $\gamma _k \ge \gamma _{k-1}$ ($k \ge 1$) imply that, for all $k \ge 1$ and all $i\in [d]$,

$$\begin{aligned} \gamma _{k} \sqrt{\hat{\textit{v}}_{k,i}} - \gamma _{k-1} \sqrt{\hat{\textit{v}}_{k-1,i}} \ge 0. \end{aligned}$$

Moreover, (A2) ensures that $\tilde{D} (\varvec{\theta }):= \sup \{ \max _{i\in [d]} (\theta _{k,i} - \theta _i)^2 :k \in \mathbb {N} \} < + \infty $. Accordingly, for all $K \ge 2$,

$$\begin{aligned} {\Gamma }_K \le \tilde{D}(\varvec{\theta }) \mathbb {E} \left[ \sum _{k=2}^K \sum _{i=1}^d \left( \gamma _{k}\sqrt{\hat{\textit{v}}_{k,i}} - \gamma _{k-1} \sqrt{\hat{\textit{v}}_{k-1,i}} \right) \right] = \tilde{D}(\varvec{\theta }) \mathbb {E} \left[ \sum _{i=1}^d \left( \gamma _{K} \sqrt{\hat{\textit{v}}_{K,i}} - \gamma _{1} \sqrt{\hat{\textit{v}}_{1,i}} \right) \right] . \end{aligned}$$

Therefore, (37), $\mathbb {E} [\Vert \varvec{\theta }_{1} - \varvec{\theta }\Vert _{\textsf{H}_{1}}^2] \le \tilde{D}(\varvec{\theta }) \mathbb {E} [ \sum _{i=1}^d \sqrt{\hat{\textit{v}}_{1,i}}]$, and (31) imply, for all $K\ge 1$,

$$\begin{aligned} \begin{aligned} \sum _{k=1}^K a_k&\le \gamma _{1} \tilde{D}(\varvec{\theta }) \mathbb {E} \left[ \sum _{i=1}^d \sqrt{\hat{\textit{v}}_{1,i}} \right] + \tilde{D}(\varvec{\theta }) \mathbb {E} \left[ \sum _{i=1}^d \left( \gamma _{K} \sqrt{\hat{\textit{v}}_{K,i}} - \gamma _{1} \sqrt{\hat{\textit{v}}_{1,i}} \right) \right] \\&= \gamma _{K} \tilde{D}(\varvec{\theta }) \mathbb {E} \left[ \sum _{i=1}^d \sqrt{\hat{\textit{v}}_{K,i}} \right] \\&\le {\gamma }_K \tilde{D}(\varvec{\theta }) \sum _{i=1}^d \sqrt{\frac{M}{\tilde{\beta }_{2K}}}\\&\le \frac{d \tilde{D}(\varvec{\theta }) \sqrt{M} \tilde{\beta }_{1K}}{2 \beta _1 \alpha _K \sqrt{\tilde{\beta }_{2K}}}. \end{aligned} \end{aligned}$$

(39)

Inequality (33) with $\beta _{1k} = \beta _1$ and $\tilde{\beta }_{1k}:= 1 - \beta _1^{k+1} \ge 1 - \beta _1 =: \hat{\beta }_1$ implies that

$$\begin{aligned} b_k \le \frac{\alpha _k \sqrt{\tilde{\beta }_{2k}}}{2 \sqrt{\textit{v}_*} \beta _{1k} \tilde{\beta }_{1k}} \left( \frac{\sigma ^2}{b} + G^2 \right) \le \frac{\alpha _k \sqrt{\tilde{\beta }_{2k}}}{2 \sqrt{\textit{v}_*} \beta _{1} \hat{\beta }_1} \left( \frac{\sigma ^2}{b} + G^2 \right) . \end{aligned}$$

(40)

Inequality (34) with $\beta _{1k} = \beta _1$ implies that

$$\begin{aligned} c_k \le D(\varvec{\theta }) G \frac{\hat{\beta }_{1k}}{\beta _{1k}} = D(\varvec{\theta }) G \frac{\hat{\beta }_{1}}{\beta _{1}}. \end{aligned}$$

(41)

Hence, (29), (39), (40), and (41) ensure that, for all $K\ge 1$,

$$\begin{aligned} \frac{1}{K}\sum _{k=1}^K \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta })^\top \varvec{m}_{k-1} \right] \le \frac{d \tilde{D}(\varvec{\theta }) \sqrt{M} \tilde{\beta }_{1K}}{2 \beta _1 \alpha _K \sqrt{\tilde{\beta }_{2K}}K} + \frac{(\sigma ^2 b^{-1} + G^2)}{2 \sqrt{\textit{v}_*} \beta _{1} \hat{\beta }_1 K} \sum _{k=1}^K \alpha _k \sqrt{\tilde{\beta }_{2k}} + D(\varvec{\theta }) G \frac{\hat{\beta }_{1}}{\beta _{1}}, \end{aligned}$$

which completes the proof. $\square $

Lemma 7

Suppose that (S1)–(S3) and (A1)–(A2) hold, $\beta _{1k}:= \beta _1 \in (0,1)$, and $(\alpha _k)_{k\in \mathbb {N}}$ is monotone decreasing. Then, Adam defined by Algorithm 1 with (9) satisfies the following: for all $K\ge 1$ and all $\varvec{\theta } \in \mathbb {R}^d$,

$$\begin{aligned} \frac{1}{K}\sum _{k=1}^K \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta })^\top \varvec{m}_{k} \right]&\le \frac{d \tilde{D}(\varvec{\theta }) \sqrt{M} \tilde{\beta }_{1K}}{2 \beta _1 \alpha _K \sqrt{\tilde{\beta }_{2K}}K} + \frac{(\sigma ^2 b^{-1} + G^2)}{2 \sqrt{\textit{v}_*} \beta _{1} \hat{\beta }_1 K} \sum _{k=1}^K \alpha _k \sqrt{\tilde{\beta }_{2k}} + D(\varvec{\theta }) G \frac{\hat{\beta }_{1}}{\beta _{1}}\\&\quad + \hat{\beta }_{1} D(\varvec{\theta }) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) , \end{aligned}$$

where the parameters are defined as in Lemma 6.

Proof

Let $\varvec{\theta } \in \mathbb {R}^d$. Inequality (35) with $\beta _{1k} = \beta _1$ implies that, for all $K \ge 1$,

$$\begin{aligned} \frac{1}{K} \sum _{k=1}^K \mathbb {E} \left[ (\varvec{\theta }_k - \varvec{\theta })^\top \varvec{m}_{k} \right] \le \frac{1}{K} \sum _{k=1}^K \mathbb {E} \left[ (\varvec{\theta }_k - \varvec{\theta })^\top \varvec{m}_{k-1} \right] + \hat{\beta }_{1} D(\varvec{\theta }) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) . \end{aligned}$$

Hence, Lemma 6 leads to Lemma 7. $\square $

Lemma 8

Suppose that (S1)–(S3) and (A1)–(A2) hold, $\beta _{1k}:= \beta _1 \in (0,1)$, and $(\alpha _k)_{k\in \mathbb {N}}$ is monotone decreasing. Then, Adam defined by Algorithm 1 with (9) satisfies the following: for all $K\ge 1$ and all $\varvec{\theta } \in \mathbb {R}^d$,

$$\begin{aligned} \frac{1}{K}\sum _{k=1}^K \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta })^\top \nabla f (\varvec{\theta }_{k}) \right]&\le \frac{d \tilde{D}(\varvec{\theta }) \sqrt{M} \tilde{\beta }_{1K}}{2 \beta _1 \alpha _K \sqrt{\tilde{\beta }_{2K}}K} + \frac{(\sigma ^2 b^{-1} + G^2)}{2 \sqrt{\textit{v}_*} \beta _{1} \hat{\beta }_1 K} \sum _{k=1}^K \alpha _k \sqrt{\tilde{\beta }_{2k}} + D(\varvec{\theta }) G \frac{\hat{\beta }_{1}}{\beta _{1}}\\&\quad + \left( \frac{1}{\beta _1} + 2\hat{\beta }_{1} \right) D(\varvec{\theta }) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) , \end{aligned}$$

where the parameters are defined as in Lemma 6.

Proof

Let $\varvec{\theta } \in \mathbb {R}^d$. Inequality (36) with $\beta _{1k} = \beta _1$ implies that, for all $K \ge 1$,

$$\begin{aligned}&\frac{1}{K} \sum _{k=1}^K \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta })^\top \nabla f(\varvec{\theta }_k) \right] \\&\le \frac{1}{K} \sum _{k=1}^K \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta })^\top \varvec{m}_k \right] + \left( \frac{1}{\beta _{1}} + \hat{\beta }_{1} \right) D(\varvec{\theta }) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) , \end{aligned}$$

which, together with Lemma 7, shows that Lemma 8 holds. $\square $

1.2 A.2 Proof of Theorem 1

Proof

Lemmas 4 and 5 with

$$\begin{aligned} \alpha _k = \alpha , \text { } \beta _{1k} = \beta _1, \text { } \beta _{2k} = \beta _2, \text { } \tilde{\beta }_{1k} = 1 - \beta _1^{k+1},\text { } \tilde{\beta }_{2k} = 1 - \beta _2^{k+1}, \text { } \hat{\beta }_{1} = 1 - \beta _1 \end{aligned}$$

imply that

$$\begin{aligned} \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta }^\star )^\top \varvec{m}_{k} \right]&\le \frac{D(\varvec{\theta }^\star ) M^{\frac{1}{4}}}{{\textit{v}_*^{\frac{1}{4}}} \beta _{1}} \sqrt{ \frac{\sigma ^2}{b} + G^2} + \frac{\alpha \sqrt{\tilde{\beta }_{2k}}}{2 \sqrt{\textit{v}_*} \beta _{1} \tilde{\beta }_{1k}} \left( \frac{\sigma ^2}{b} + G^2 \right) + D(\varvec{\theta }^\star ) G \frac{\hat{\beta }_{1}}{\beta _{1}}\\&\quad + \hat{\beta }_{1} D(\varvec{\theta }^\star ) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) ,\\ \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta })^\top \nabla f(\varvec{\theta }_k) \right]&\le \frac{D(\varvec{\theta }) M^{\frac{1}{4}}}{{\textit{v}_*^{\frac{1}{4}}} \beta _{1}} \sqrt{ \frac{\sigma ^2}{b} + G^2} + \frac{\alpha \sqrt{\tilde{\beta }_{2k}}}{2 \sqrt{\textit{v}_*} \beta _{1} \tilde{\beta }_{1k}} \left( \frac{\sigma ^2}{b} + G^2 \right) \\&\quad + D(\varvec{\theta }) G \frac{\hat{\beta }_{1}}{\beta _{1}} + D(\varvec{\theta })\left( \frac{1}{\beta _{1}} + 2 \hat{\beta }_{1} \right) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) , \end{aligned}$$

which completes the proof. $\square $

1.3 A.3 Proof of Corollary 1

Proof

The sequences $(\tilde{\beta }_{1k})_{k\in \mathbb {N}}$ and $(\tilde{\beta }_{2k})_{k\in \mathbb {N}}$ converge to 1. Theorem 1 thus leads to Corollary 1. $\square $

1.4 A.4 Proof of Theorem 2

Proof

Lemmas 4 and 5 with

$$\begin{aligned}&\alpha _k = \frac{1}{k^a}, \text { } \beta _{1k} = 1 - \frac{1}{k^{b_1}}, \text { } \beta _{2k} = \left( 1 - \frac{1}{k^{b_2}} \right) ^{\frac{1}{k+1}}, \text { } \tilde{\beta }_{1k} = 1 - \beta _{1k}^{k+1} \ge 1 - \beta _{1k},\text { } \tilde{\beta }_{2k} = 1 - \beta _{2k}^{k+1},\\&\hat{\beta }_{1k} = 1 - \beta _{1k} \end{aligned}$$

imply that

$$\begin{aligned} \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta })^\top \varvec{m}_{k} \right]&\le \frac{D(\varvec{\theta }) M^{\frac{1}{4}}}{{\textit{v}_*^{\frac{1}{4}}} \beta _{1k}} \sqrt{ \frac{\sigma ^2}{b} + G^2} + \frac{\alpha _k \sqrt{\tilde{\beta }_{2k}}}{2 \sqrt{\textit{v}_*} \beta _{1k} \tilde{\beta }_{1k}} \left( \frac{\sigma ^2}{b} + G^2 \right) + D(\varvec{\theta }) G \frac{\hat{\beta }_{1k}}{\beta _{1k}}\\&\quad + \hat{\beta }_{1k} D(\varvec{\theta }) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) \\&\le \frac{D(\varvec{\theta }^\star ) M^{\frac{1}{4}} k^{b_1}}{\textit{v}_*^{\frac{1}{4}}(k^{b_1} - 1)}\sqrt{ \frac{\sigma ^2}{b} + G^2} + \frac{1}{2 \sqrt{v_*} (k^{b_1} - 1) k^{a +\frac{b_2}{2} -2 b_1}} \left( \frac{\sigma ^2}{b} + G^2 \right) \\&\quad + \frac{1}{k^{b_1} -1} D(\varvec{\theta }^\star ) G + \frac{1}{k^{b_1}} D(\varvec{\theta }^\star ) \left( B + \sqrt{ \frac{\sigma ^2}{b} + G^2} \right) , \end{aligned}$$

$$\begin{aligned} \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta })^\top \nabla f(\varvec{\theta }_k) \right]&\le \frac{D(\varvec{\theta }) M^{\frac{1}{4}}}{{\textit{v}_*^{\frac{1}{4}}} \beta _{1k}} \sqrt{ \frac{\sigma ^2}{b} + G^2} + \frac{\alpha _k \sqrt{\tilde{\beta }_{2k}}}{2 \sqrt{\textit{v}_*} \beta _{1k} \tilde{\beta }_{1k}} \left( \frac{\sigma ^2}{b} + G^2 \right) + D(\varvec{\theta }) G \frac{\hat{\beta }_{1k}}{\beta _{1k}}\\&\quad + D(\varvec{\theta })\left( \frac{1}{\beta _{1k}} + 2 \hat{\beta }_{1k} \right) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) \\&\le \frac{D(\varvec{\theta }) M^{\frac{1}{4}} k^{b_1}}{\textit{v}_*^{\frac{1}{4}}(k^{b_1} - 1)}\sqrt{ \frac{\sigma ^2}{b} + G^2} + \frac{1}{2 \sqrt{\textit{v}_*} (k^{b_1} - 1) k^{a +\frac{b_2}{2} -2 b_1}} \left( \frac{\sigma ^2}{b} + G^2 \right) \\&\quad + \frac{1}{k^{b_1} -1} D(\varvec{\theta }) G + \frac{1}{k^{b_1}} D(\varvec{\theta }) \left( B + \sqrt{ \frac{\sigma ^2}{b} + G^2} \right) \\&\quad + \frac{k^{{b_1}^2} + 2 k^{b_1} - 2}{k^{b_1} (k^{b_1} - 1)} D(\varvec{\theta }) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) , \end{aligned}$$

which completes the proof. $\square $

1.5 A.5 Proof of Corollary 2

Proof

Since $a - b_1 + b_2/2 > 0$, we have that

$$\begin{aligned} \frac{1}{(k^{b_1} - 1) k^{a +\frac{b_2}{2} -2 b_1}} = \frac{1}{k^{a +\frac{b_2}{2} - b_1}- k^{a +\frac{b_2}{2} -2 b_1}} = \frac{1}{k^{a +\frac{b_2}{2} - b_1}}\left( {1 - \frac{1}{k^{b_1}}}\right) ^{-1} \rightarrow 0. \end{aligned}$$

Theorem 2 thus leads to Corollary 2. $\square $

1.6 A.6 Proof of Theorem 3

Proof

Lemmas 7 and 8 with

$$\begin{aligned} \alpha _k = \alpha , \text { } \beta _{1k} = \beta _1, \text { } \beta _{2k} = \beta _2, \text { } \tilde{\beta }_{1k} = 1 - \beta _1^{k+1},\text { } \tilde{\beta }_{2k} = 1 - \beta _2^{k+1} \le 1, \text { } \hat{\beta }_{1} = 1 - \beta _1 \end{aligned}$$

ensure that

$$\begin{aligned} \frac{1}{K}\sum _{k=1}^K \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta }^\star )^\top \varvec{m}_{k} \right]&\le \frac{d \tilde{D}(\varvec{\theta }^\star ) \sqrt{M} \tilde{\beta }_{1K}}{2 \beta _1 \alpha \sqrt{\tilde{\beta }_{2K}}K} + \frac{(\sigma ^2 b^{-1} + G^2)}{2 \sqrt{\textit{v}_*} \beta _{1} \hat{\beta }_1 K} \sum _{k=1}^K \alpha \sqrt{\tilde{\beta }_{2k}} + D(\varvec{\theta }^\star ) G \frac{\hat{\beta }_{1}}{\beta _{1}}\\&\quad + \hat{\beta }_{1} D(\varvec{\theta }^\star ) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) \\&\le \frac{d \tilde{D}(\varvec{\theta }^\star ) \sqrt{M} \tilde{\beta }_{1K}}{2 \beta _1 \alpha \sqrt{\tilde{\beta }_{2K}}K} + \frac{(\sigma ^2 b^{-1} + G^2)}{2 \sqrt{\textit{v}_*} \beta _{1} \hat{\beta }_1} \alpha + D(\varvec{\theta }^\star ) G \frac{\hat{\beta }_{1}}{\beta _{1}}\\&\quad + \hat{\beta }_{1} D(\varvec{\theta }^\star ) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) \end{aligned}$$

and that

$$\begin{aligned} \frac{1}{K}\sum _{k=1}^K \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta })^\top \nabla f (\varvec{\theta }_{k}) \right]&\le \frac{d \tilde{D}(\varvec{\theta }) \sqrt{M} \tilde{\beta }_{1K}}{2 \beta _1 \alpha \sqrt{\tilde{\beta }_{2K}}K} + \frac{(\sigma ^2 b^{-1} + G^2)}{2 \sqrt{\textit{v}_*} \beta _{1} \hat{\beta }_1 K} \sum _{k=1}^K \alpha \sqrt{\tilde{\beta }_{2k}} + D(\varvec{\theta }) G \frac{\hat{\beta }_{1}}{\beta _{1}}\\&\quad + \left( \frac{1}{\beta _1} + 2\hat{\beta }_{1} \right) D(\varvec{\theta }) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) \\&\le \frac{d \tilde{D}(\varvec{\theta }) \sqrt{M} \tilde{\beta }_{1K}}{2 \beta _1 \alpha \sqrt{\tilde{\beta }_{2K}}K} + \frac{(\sigma ^2 b^{-1} + G^2)}{2 \sqrt{\textit{v}_*} \beta _{1} \hat{\beta }_1} \alpha + D(\varvec{\theta }) G \frac{\hat{\beta }_{1}}{\beta _{1}}\\&\quad + \left( \frac{1}{\beta _1} + 2\hat{\beta }_{1} \right) D(\varvec{\theta }) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) , \end{aligned}$$

which completes the proof. $\square $

1.7 A.7 Proof of Theorem 4

Proof

Let

$$\begin{aligned} \alpha _k = \alpha , \text { } \beta _{1k} = \beta _1, \text { } \beta _{2k} = \left( 1 - \frac{1}{k^{b_2}} \right) ^{\frac{1}{k+1}}, \text { } \tilde{\beta }_{1k} = 1 - \beta _1^{k+1} \le 1,\text { } \tilde{\beta }_{2k} = 1 - \beta _{2k}^{k+1}, \text { } \hat{\beta }_{1} = 1 - \beta _1, \end{aligned}$$

where $b_2 \in (0,2)$. We have that

$$\begin{aligned} \sqrt{\tilde{\beta }_{2k}} = \sqrt{1 - \beta _{2k}^{k+1}} = \sqrt{\frac{1}{k^{b_2}}}. \end{aligned}$$

Lemma 7 ensures that

$$\begin{aligned} \frac{1}{K}\sum _{k=1}^K \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta }^\star )^\top \varvec{m}_{k} \right]&\le \frac{d \tilde{D}(\varvec{\theta }^\star ) \sqrt{M} \tilde{\beta }_{1K}}{2 \beta _1 \alpha \sqrt{\tilde{\beta }_{2K}}K} + \frac{(\sigma ^2 b^{-1} + G^2)}{2 \sqrt{\textit{v}_*} \beta _{1} \hat{\beta }_1 K} \sum _{k=1}^K \alpha \sqrt{\tilde{\beta }_{2k}} + D(\varvec{\theta }^\star ) G \frac{\hat{\beta }_{1}}{\beta _{1}}\\&\quad + \hat{\beta }_{1} D(\varvec{\theta }^\star ) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) \\&\le \frac{d \tilde{D}(\varvec{\theta }^\star ) \sqrt{M}}{2 \beta _1 \alpha K^{1-\frac{b_2}{2}}} + \frac{(\sigma ^2 b^{-1} + G^2)\alpha }{2 \sqrt{\textit{v}_*} \beta _{1} \hat{\beta }_1 K} \sum _{k=1}^K \frac{1}{k^{\frac{b_2}{2}}} + D(\varvec{\theta }^\star ) G \frac{\hat{\beta }_{1}}{\beta _{1}}\\&\quad + \hat{\beta }_{1} D(\varvec{\theta }^\star ) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) . \end{aligned}$$

We also have that

$$\begin{aligned} \begin{aligned} \frac{1}{K} \sum _{k=1}^K \frac{1}{k^{\frac{b_2}{2}}}&\le \frac{1}{K} \left( 1 + \int _1^K \frac{\textrm{d}t}{t^{\frac{b_2}{2}}} \right) = \frac{1}{K} \left\{ 1 + \left[ \left( 1 - \frac{b_2}{2} \right) t^{ 1 - \frac{b_2}{2}} \right] _1^K \right\} \\&\le \frac{1}{K} \left\{ 1 + \left( 1 - \frac{b_2}{2} \right) K^{ 1 - \frac{b_2}{2}} \right\} \le \frac{2}{K}K^{ 1 - \frac{b_2}{2}} = \frac{2}{K^{\frac{b_2}{2}}}. \end{aligned} \end{aligned}$$

(42)

Hence,

$$\begin{aligned} \frac{1}{K}\sum _{k=1}^K \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta }^\star )^\top \varvec{m}_{k} \right]&\le \frac{d \tilde{D}(\varvec{\theta }^\star ) \sqrt{M}}{2 \beta _1 \alpha K^{1-\frac{b_2}{2}}} + \frac{(\sigma ^2 b^{-1} + G^2)\alpha }{\sqrt{\textit{v}_*} \beta _{1} \hat{\beta }_1 K^{\frac{b_2}{2}}} + D(\varvec{\theta }^\star ) G \frac{\hat{\beta }_{1}}{\beta _{1}}\\&\quad + \hat{\beta }_{1} D(\varvec{\theta }^\star ) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) . \end{aligned}$$

A discussion similar to the one showing the above inequality and Lemma 8 imply that

$$\begin{aligned} \frac{1}{K} \sum _{k=1}^K \mathbb {E}\left[ \nabla f(\varvec{\theta }_k)^\top (\varvec{\theta }_k - \varvec{\theta }) \right]&\le \frac{d \tilde{D}(\varvec{\theta }) \sqrt{M}}{2 \alpha \beta _1 K^{1 - \frac{b_2}{2}}} + \frac{\alpha }{\sqrt{\textit{v}_*} \beta _{1} (1-\beta _1) K^{\frac{b_2}{2}}}\left( \frac{\sigma ^2}{b} + G^2 \right) \\&\quad + \frac{1 - \beta _{1}}{\beta _{1}} D(\varvec{\theta }) G + \left( \frac{1}{\beta _{1}} + 2 (1 - \beta _{1}) \right) D(\varvec{\theta }) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) , \end{aligned}$$

which completes the proof.$\square $

1.8 A.8 Proof of Theorem 5

Proof

Let

$$\begin{aligned} \alpha _k \!=\! \frac{1}{k^a}, \text { } \beta _{1k} \!=\! \beta _1, \text { } \beta _{2k} \!=\! \beta _2, \text { } \tilde{\beta }_{1k} \!=\! 1 \!-\! \beta _1^{k+1} \!\le \! 1,\text { } \tilde{\beta }_{2k} \!=\! 1 \!-\! \beta _2^{k+1} \!\le \! 1, \text { } \hat{\beta }_{1} \!=\! 1 \!-\! \beta _1. \end{aligned}$$

We have that $\tilde{\beta }_{2k} = 1 - \beta _{2k}^{k+1} \ge 1 - \beta _2$. Lemma 7 ensures that

$$\begin{aligned} \frac{1}{K}\sum _{k=1}^K \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta }^\star )^\top \varvec{m}_{k} \right]&\le \frac{d \tilde{D}(\varvec{\theta }^\star ) \sqrt{M} \tilde{\beta }_{1K}}{2 \beta _1 \alpha _K \sqrt{\tilde{\beta }_{2K}}K} + \frac{(\sigma ^2 b^{-1} + G^2)}{2 \sqrt{\textit{v}_*} \beta _{1} \hat{\beta }_1 K} \sum _{k=1}^K \alpha _k \sqrt{\tilde{\beta }_{2k}} + D(\varvec{\theta }^\star ) G \frac{\hat{\beta }_{1}}{\beta _{1}}\\&\quad + \hat{\beta }_{1} D(\varvec{\theta }^\star ) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) \\&\le \frac{d \tilde{D}(\varvec{\theta }^\star ) \sqrt{M}}{2 \beta _1 \sqrt{1-\beta _2}K^{1-a}} + \frac{(\sigma ^2 b^{-1} + G^2)}{2 \sqrt{\textit{v}_*} \beta _{1} \hat{\beta }_1 K} \sum _{k=1}^K \frac{1}{k^a} + D(\varvec{\theta }^\star ) G \frac{\hat{\beta }_{1}}{\beta _{1}}\\&\quad + \hat{\beta }_{1} D(\varvec{\theta }^\star ) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) , \end{aligned}$$

which, together with (42), implies that

$$\begin{aligned} \frac{1}{K}\sum _{k=1}^K \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta }^\star )^\top \varvec{m}_{k} \right]&\le \frac{d \tilde{D}(\varvec{\theta }^\star ) \sqrt{M}}{2 \beta _1 \sqrt{1-\beta _2}K^{1-a}} + \frac{(\sigma ^2 b^{-1} + G^2)}{2 \sqrt{\textit{v}_*} \beta _{1} \hat{\beta }_1 K^a} + D(\varvec{\theta }^\star ) G \frac{\hat{\beta }_{1}}{\beta _{1}}\\&\quad + \hat{\beta }_{1} D(\varvec{\theta }^\star ) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) . \end{aligned}$$

A discussion similar to the one showing the above inequality and Lemma 8 implies the second assertion in Theorem 5. $\square $

1.9 A.9 Proof of Theorem 6

Proof

The proofs of Theorems 4 and 5 lead to Theorem 6. $\square $

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Iiduka, H. Theoretical analysis of Adam using hyperparameters close to one without Lipschitz smoothness. Numer Algor 95, 383–421 (2024). https://doi.org/10.1007/s11075-023-01575-0

Download citation

Received: 22 December 2022
Accepted: 05 May 2023
Published: 04 July 2023
Issue Date: January 2024
DOI: https://doi.org/10.1007/s11075-023-01575-0

Keywords

Mathematics Subject Classification (2010)

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Theoretical analysis of Adam using hyperparameters close to one without Lipschitz smoothness

Abstract

Access this article

Similar content being viewed by others

Non-convergence and Limit Cycles in the Adam Optimizer

ABNGrad: adaptive step size gradient descent for optimizing neural networks

WSAGrad: a novel adaptive gradient based method

Availability of supporting data

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethical Approval

Conflict of interest

Additional information

Publisher's Note

Appendix: A

Appendix: A

1.1 A.1 Lemmas

Lemma 1

Proof

Remark 1

Lemma 2

Proof

Lemma 3

Proof

Lemma 4

Proof

Lemma 5

Proof

Lemma 6

Proof

Lemma 7

Proof

Lemma 8

Proof

1.2 A.2 Proof of Theorem 1

Proof

1.3 A.3 Proof of Corollary 1

Proof

1.4 A.4 Proof of Theorem 2

Proof

1.5 A.5 Proof of Corollary 2

Proof

1.6 A.6 Proof of Theorem 3

Proof

1.7 A.7 Proof of Theorem 4

Proof

1.8 A.8 Proof of Theorem 5

Proof

1.9 A.9 Proof of Theorem 6

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification (2010)

Search

Navigation