Appendix: A
Unless stated otherwise, all relationships between random variables are supported to hold almost surely.
1.1 A.1 Lemmas
Lemma 1
Suppose that (S1), (S2)(13), and (S3) hold. Then, Adam defined by Algorithm 1 satisfies the following: for all \(k\in \mathbb {N}\) and all \(\varvec{\theta } \in \mathbb {R}^d\),
$$\begin{aligned} \mathbb {E}\left[ \left\| \varvec{\theta }_{k+1} - \varvec{\theta } \right\| _{\textsf{H}_k}^2 \right]&= \mathbb {E}\left[ \left\| \varvec{\theta }_{k} - \varvec{\theta } \right\| _{\textsf{H}_k}^2 \right] + \alpha _k^2 \mathbb {E}\left[ \left\| \varvec{\textsf{d}}_k \right\| _{\textsf{H}_k}^2 \right] \\&\quad + 2 \alpha _k \left\{ \frac{\beta _{1k}}{\tilde{\beta }_{1k}} \mathbb {E}\left[ (\varvec{\theta } \!-\! \varvec{\theta }_k)^\top \varvec{m}_{k-1} \right] \!+\!\frac{\hat{\beta }_{1k}}{\tilde{\beta }_{1k}} \mathbb {E}\left[ (\varvec{\theta } \!-\! \varvec{\theta }_k)^\top \nabla f (\varvec{\theta }_k) \right] \right\} , \end{aligned}$$
where \(\varvec{\textsf{d}}_k:= - \textsf{H}_k^{-1} \hat{\varvec{m}}_k\), \(\hat{\beta }_{1k}:= 1 - \beta _{1k}\), and \(\tilde{\beta }_{1k}:= 1 - \beta _{1k}^{k+1}\).
Proof
Let \(\varvec{\theta } \in \mathbb {R}^d\) and \(k\in \mathbb {N}\). The definition of \(\varvec{\theta }_{k+1}:= \varvec{\theta }_{k} + \alpha _k \varvec{\textsf{d}}_k\) implies that
$$\begin{aligned} \Vert \varvec{\theta }_{k+1} - \varvec{\theta } \Vert _{\textsf{H}_k}^2 = \Vert \varvec{\theta }_{k} - \varvec{\theta } \Vert _{\textsf{H}_k}^2 + 2 \alpha _k \langle \varvec{\theta }_{k} - \varvec{\theta }, \varvec{\textsf{d}}_k \rangle _{\textsf{H}_k} + \alpha _k^2 \Vert \varvec{\textsf{d}}_k\Vert _{\textsf{H}_k}^2. \end{aligned}$$
Moreover, the definitions of \(\varvec{\textsf{d}}_k\), \(\varvec{m}_k\), and \(\hat{\varvec{m}}_k\) ensure that
$$\begin{aligned} \left\langle \varvec{\theta }_k - \varvec{\theta }, \varvec{\textsf{d}}_k \right\rangle _{\textsf{H}_k}&= \left\langle \varvec{\theta }_k - \varvec{\theta }, \textsf{H}_k \varvec{\textsf{d}}_k \right\rangle = \left\langle \varvec{\theta } - \varvec{\theta }_k, \hat{\varvec{m}}_k \right\rangle = \frac{1}{\tilde{\beta }_{1k}} (\varvec{\theta } - \varvec{\theta }_k)^\top {\varvec{m}}_k\\&= \frac{\beta _{1k}}{\tilde{\beta }_{1k}} (\varvec{\theta } - \varvec{\theta }_k)^\top \varvec{m}_{k-1} + \frac{\hat{\beta }_{1k}}{\tilde{\beta }_{1k}} (\varvec{\theta } - \varvec{\theta }_k)^\top \nabla f_{B_k}(\varvec{\theta }_k). \end{aligned}$$
Hence,
$$\begin{aligned} \begin{aligned} \left\| \varvec{\theta }_{k+1} - \varvec{\theta } \right\| _{\textsf{H}_k}^2&= \left\| \varvec{\theta }_k -\varvec{\theta } \right\| _{\textsf{H}_k}^2 + \alpha _k^2 \left\| \varvec{\textsf{d}}_k \right\| _{\textsf{H}_k}^2\\&\quad + 2 \alpha _k \left\{ \frac{\beta _{1k}}{\tilde{\beta }_{1k}} (\varvec{\theta } - \varvec{\theta }_k)^\top \varvec{m}_{k-1} + \frac{\hat{\beta }_{1k}}{\tilde{\beta }_{1k}} (\varvec{\theta } - \varvec{\theta }_k)^\top \nabla f_{B_k} (\varvec{\theta }_k) \right\} . \end{aligned} \end{aligned}$$
(24)
Conditions (13) and (S3) guarantee that
$$\begin{aligned} \mathbb {E}\left[ \mathbb {E} \left[ (\varvec{\theta } \!-\! \varvec{\theta }_k)^\top \nabla f_{B_k} (\varvec{\theta }_k) \Big | \varvec{\theta }_k \right] \right] \!=\! \mathbb {E} \left[ (\varvec{\theta } \!-\! \varvec{\theta }_k)^\top \mathbb {E} \left[ \nabla f_{B_k} (\varvec{\theta }_k) \Big | \varvec{\theta }_k \right] \right] = \mathbb {E} \left[ (\varvec{\theta } \!-\! \varvec{\theta }_k)^\top \nabla f (\varvec{\theta }_k) \right] . \end{aligned}$$
Therefore, the lemma follows by taking the expectation on both sides of (24). This completes the proof. \(\square \)
Remark 1
Let us consider (16); that is,
$$\begin{aligned} \varvec{\theta }_{k+1} = P_{C,\textsf{H}_k} (\varvec{\theta }_{k} + \alpha _k \varvec{\textsf{d}}_k). \end{aligned}$$
Let \(k\in \mathbb {N}\) and \(\varvec{\theta } \in C\) (i.e., \(\varvec{\theta } = P_{C,\textsf{H}_k}(\varvec{\theta })\)). The nonexpansivity condition of \(P_{C,\textsf{H}_k}\) ensures that
$$\begin{aligned} \Vert \varvec{\theta }_{k+1} - \varvec{\theta }\Vert _{\textsf{H}_k} = \Vert P_{C,\textsf{H}_k} (\varvec{\theta }_{k} + \alpha _k \varvec{\textsf{d}}_k)- P_{C,\textsf{H}_k}(\varvec{\theta })\Vert _{\textsf{H}_k} \le \Vert (\varvec{\theta }_{k} + \alpha _k \varvec{\textsf{d}}_k)- \varvec{\theta }\Vert _{\textsf{H}_k}. \end{aligned}$$
Hence, we have that
$$\begin{aligned} \Vert \varvec{\theta }_{k+1} - \varvec{\theta } \Vert _{\textsf{H}_k}^2 \le \Vert \varvec{\theta }_{k} - \varvec{\theta } \Vert _{\textsf{H}_k}^2 + 2 \alpha _k \langle \varvec{\theta }_{k} - \varvec{\theta }, \varvec{\textsf{d}}_k \rangle _{\textsf{H}_k} + \alpha _k^2 \Vert \varvec{\textsf{d}}_k\Vert _{\textsf{H}_k}^2. \end{aligned}$$
Accordingly, a discussion similar to the one showing Lemma 1 ensures that, for all \(\varvec{\theta } \in C\) and all \(k\in \mathbb {N}\),
$$\begin{aligned} \begin{aligned} \mathbb {E}\left[ \left\| \varvec{\theta }_{k+1} - \varvec{\theta } \right\| _{\textsf{H}_k}^2 \right]&\le \mathbb {E}\left[ \left\| \varvec{\theta }_{k} - \varvec{\theta } \right\| _{\textsf{H}_k}^2 \right] + \alpha _k^2 \mathbb {E}\left[ \left\| \varvec{\textsf{d}}_k \right\| _{\textsf{H}_k}^2 \right] \\&\quad \!+\! 2 \alpha _k \left\{ \frac{\beta _{1k}}{\tilde{\beta }_{1k}} \mathbb {E}\left[ (\varvec{\theta } \!-\! \varvec{\theta }_k)^\top \varvec{m}_{k-1} \right] \!+\!\frac{\hat{\beta }_{1k}}{\tilde{\beta }_{1k}} \mathbb {E}\left[ (\varvec{\theta } \!-\! \varvec{\theta }_k)^\top \nabla f (\varvec{\theta }_k) \right] \right\} . \end{aligned} \end{aligned}$$
(25)
We may assume without loss of generality that the assertion in Lemma 1 holds for all \(\varvec{\theta }\in C\) and all \(k\in \mathbb {N}\), since the theorems in this paper evaluate the upper bounds of (4), (5), (6), and (7). A discussion similar to the one showing the theorems in this paper (see the following lemmas and the proof of theorems) leads to versions of the theorems for all \(\varvec{\theta }\) belonging to C; that is, the sequence \((\varvec{\theta }_k)_{k\in \mathbb {N}}\) generated by (16) satisfies the assertions in the theorems for all \(\varvec{\theta } \in C\).
Lemma 2
Adam defined by Algorithm 1 satisfies that, under (S2)(13), (14), and (A1), for all \(k\in \mathbb {N}\),
$$\begin{aligned} \mathbb {E}\left[ \left\| \varvec{m}_k \right\| ^2 \right] \le \frac{\sigma ^2}{b} + G^2, \quad \mathbb {E}\left[ \left\| \varvec{\textsf{d}}_k \right\| _{\textsf{H}_k}^2 \right] \le \frac{\sqrt{\tilde{\beta }_{2k}}}{\tilde{\beta }_{1k}^2 \sqrt{{\textit{v}}_*}} \left( \frac{\sigma ^2}{b} + G^2 \right) , \end{aligned}$$
where \({v}_*:= \inf \{ \min _{i\in [d]} {\textit{v}}_{k,i} :k\in \mathbb {N}\}\), \(\tilde{\beta }_{1k}:= 1 - \beta _{1k}^{k+1}\), and \(\tilde{\beta }_{2k}:= 1 - \beta _{2k}^{k+1}\).
Proof
Assumption (S2)(13) implies that
$$\begin{aligned} \begin{aligned} \mathbb {E} \left[ \left\| \nabla f_{B_k} (\varvec{\theta }_{k}) \right\| ^2 \Big | \varvec{\theta }_k \right]&= \mathbb {E} \left[ \left\| \nabla f_{B_k} (\varvec{\theta }_{k}) - \nabla f (\varvec{\theta }_{k}) + \nabla f (\varvec{\theta }_{k}) \right\| ^2 \Big | \varvec{\theta }_k \right] \\&= \mathbb {E} \left[ \left\| \nabla f_{B_k} (\varvec{\theta }_{k}) - \nabla f (\varvec{\theta }_{k}) \right\| ^2 \Big | \varvec{\theta }_k \right] + \mathbb {E} \left[ \left\| \nabla f (\varvec{\theta }_{k}) \right\| ^2 \Big | \varvec{\theta }_k \right] \\&\quad + 2 \mathbb {E} \left[ (\nabla f_{B_k} (\varvec{\theta }_{k}) - \nabla f (\varvec{\theta }_{k}))^\top \nabla f (\varvec{\theta }_{k}) \Big | \varvec{\theta }_k \right] \\&= \mathbb {E} \left[ \left\| \nabla f_{B_k} (\varvec{\theta }_{k}) - \nabla f (\varvec{\theta }_{k}) \right\| ^2 \Big | \varvec{\theta }_k \right] + \Vert \nabla f (\varvec{\theta }_{k}) \Vert ^2, \end{aligned} \end{aligned}$$
(26)
which, together with (S2)(14) and (A1), implies that
$$\begin{aligned} \mathbb {E} \left[ \left\| \nabla f_{B_k} (\varvec{\theta }_{k}) \right\| ^2 \right] \le \frac{\sigma ^2}{b} + G^2. \end{aligned}$$
(27)
The convexity of \(\Vert \cdot \Vert ^2\), together with the definition of \(\varvec{m}_k\) and (27), guarantees that, for all \(k\in \mathbb {N}\),
$$\begin{aligned} \mathbb {E}\left[ \left\| \varvec{m}_k \right\| ^2 \right]&\le \beta _{1k} \mathbb {E}\left[ \left\| \varvec{m}_{k-1} \right\| ^2 \right] + \hat{\beta }_{1k} \mathbb {E}\left[ \left\| \nabla f_{B_k} (\varvec{\theta }_k) \right\| ^2 \right] \\&\le \beta _{1k} \mathbb {E} \left[ \left\| \varvec{m}_{k-1} \right\| ^2 \right] + \hat{\beta }_{1k} \left( \frac{\sigma ^2}{b} + G^2 \right) . \end{aligned}$$
Induction thus ensures that, for all \(k\in \mathbb {N}\),
$$\begin{aligned} \mathbb {E} \left[ \left\| \varvec{m}_k \right\| ^2 \right] \le \max \left\{ \Vert \varvec{m}_{-1}\Vert ^2, \frac{\sigma ^2}{b} + G^2 \right\} = \frac{\sigma ^2}{b} + G^2, \end{aligned}$$
(28)
where \(\varvec{m}_{-1} = \varvec{0}\). For \(k\in \mathbb {N}\), \(\textsf{H}_k \in \mathbb {S}_{++}^d\) guarantees the existence of a unique matrix \(\overline{\textsf{H}}_k \in \mathbb {S}_{++}^d\) such that \(\textsf{H}_k = \overline{\textsf{H}}_k^2\) [12, Theorem 7.2.6]. We have that, for all \(\varvec{x}\in \mathbb {R}^d\), \(\Vert \varvec{x}\Vert _{\textsf{H}_k}^2 = \Vert \overline{\textsf{H}}_k \varvec{x} \Vert ^2\). Accordingly, the definitions of \(\varvec{\textsf{d}}_k\) and \(\hat{\varvec{m}}_k\) imply that, for all \(k\in \mathbb {N}\),
$$\begin{aligned} \mathbb {E} \left[ \left\| \varvec{\textsf{d}}_k \right\| _{\textsf{H}_k}^2 \right] = \mathbb {E} \left[ \left\| \overline{\textsf{H}}_k^{-1} \textsf{H}_k\varvec{\textsf{d}}_k \right\| ^2 \right] \le \frac{1}{\tilde{\beta }_{1k}^2} \mathbb {E} \left[ \left\| \overline{\textsf{H}}_k^{-1} \right\| ^2 \Vert \varvec{m}_k \Vert ^2 \right] , \end{aligned}$$
where
$$\begin{aligned} \left\| \overline{\textsf{H}}_k^{-1} \right\| = \left\| \textsf{diag}\left( \hat{\textit{v}}_{k,i}^{-\frac{1}{4}} \right) \right\| = \max _{i\in [d]} \hat{\textit{v}}_{k,i}^{-\frac{1}{4}} = \max _{i\in [d]} \left( \frac{\textit{v}_{k,i}}{\tilde{\beta }_{2k}} \right) ^{-\frac{1}{4}} =: \left( \frac{\textit{v}_{k,i^*}}{\tilde{\beta }_{2k}} \right) ^{-\frac{1}{4}}. \end{aligned}$$
Moreover, the definition of
$$\begin{aligned} {\textit{v}}_* := \inf \left\{ {\textit{v}}_{k,i^*} :k\in \mathbb {N} \right\} \end{aligned}$$
and (28) imply that, for all \(k\in \mathbb {N}\),
$$\begin{aligned} \mathbb {E} \left[ \left\| \varvec{\textsf{d}}_k \right\| _{\textsf{H}_k}^2 \right] \le \frac{\tilde{\beta }_{2k}^{\frac{1}{2}}}{\tilde{\beta }_{1k}^2 {\textit{v}}_*^{\frac{1}{2}}} \left( \frac{\sigma ^2}{b} + G^2 \right) , \end{aligned}$$
completing the proof. \(\square \)
Lemma 3
Suppose that (S1)–(S3) and (A1)–(A2) hold. Then, Adam defined by Algorithm 1 satisfies the following: for all \(k\in \mathbb {N}\) and all \(\varvec{\theta } \in \mathbb {R}^d\),
$$\begin{aligned} \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta })^\top \varvec{m}_{k-1} \right] \le \frac{D(\varvec{\theta }) M^{\frac{1}{4}}}{{{\textit{v}}_*^{\frac{1}{4}}} \beta _{1k}} \sqrt{ \frac{\sigma ^2}{b} + G^2} + \frac{\alpha _k \sqrt{\tilde{\beta }_{2k}}}{2 \sqrt{\textit{v}_*} \beta _{1k} \tilde{\beta }_{1k}} \left( \frac{\sigma ^2}{b} + G^2 \right) + D(\varvec{\theta }) G \frac{\hat{\beta }_{1k}}{\beta _{1k}}, \end{aligned}$$
where \(\nabla f_{B_k}(\varvec{\theta }_k) \odot \nabla f_{B_k}(\varvec{\theta }_k):= (g_{k,i}^2) \in \mathbb {R}_{+}^d\), \(M:= \sup \{\max _{i\in [d]} g_{k,i}^2 :k\in \mathbb {N}\} < + \infty \), \(\hat{\beta }_{1k}:= 1 - \beta _{1k}\), \(\tilde{\beta }_{1k}:= 1 - \beta _{1k}^{k+1}\), \(\tilde{\beta }_{2k}:= 1 - \beta _{2k}^{k+1}\), \({\textit{v}}_*\) is defined as in Lemma 2, and \(D(\varvec{\theta })\) and G are defined as in Assumptions (A1) and (A2).
Proof
Let \(\varvec{\theta } \in \mathbb {R}^d\). Lemma 1 guarantees that for all \(k\in \mathbb {N}\),
$$\begin{aligned} \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta })^\top \varvec{m}_{k-1} \right]= & {} \underbrace{\frac{\tilde{\beta }_{1k}}{2 \alpha _k \beta _{1k}} \left\{ \mathbb {E}\left[ \left\| \varvec{\theta }_{k} - \varvec{\theta } \right\| _{\textsf{H}_k}^2 \right] - \mathbb {E}\left[ \left\| \varvec{\theta }_{k+1} - \varvec{\theta } \right\| _{\textsf{H}_k}^2 \right] \right\} }_{a_k}\nonumber \\{} & {} \quad + \underbrace{\frac{\alpha _k \tilde{\beta }_{1k}}{2 \beta _{1k}} \mathbb {E}\left[ \left\| \varvec{\textsf{d}}_k \right\| _{\textsf{H}_k}^2 \right] }_{b_k} \nonumber \\{} & {} \quad + \underbrace{\frac{\hat{\beta }_{1k}}{\beta _{1k}} \mathbb {E}\left[ (\varvec{\theta } - \varvec{\theta }_k)^\top \nabla f (\varvec{\theta }_k) \right] }_{c_k}. \end{aligned}$$
(29)
The triangle inequality and the definition of \(\varvec{\theta }_{k+1}:= \varvec{\theta }_{k} + \alpha _k \varvec{\textsf{d}}_k\) ensure that
$$\begin{aligned} \begin{aligned} a_k&= \frac{\tilde{\beta }_{1k}}{2 \alpha _k \beta _{1k}} \mathbb {E}\left[ \left( \left\| \varvec{\theta }_{k} - \varvec{\theta } \right\| _{\textsf{H}_k} + \left\| \varvec{\theta }_{k+1} - \varvec{\theta } \right\| _{\textsf{H}_k} \right) \left( \left\| \varvec{\theta }_{k} - \varvec{\theta } \right\| _{\textsf{H}_k} - \left\| \varvec{\theta }_{k+1} - \varvec{\theta } \right\| _{\textsf{H}_k} \right) \right] \\&\le \frac{\tilde{\beta }_{1k}}{2 \alpha _k \beta _{1k}} \mathbb {E}\left[ \left( \left\| \varvec{\theta }_{k} - \varvec{\theta } \right\| _{\textsf{H}_k} + \left\| \varvec{\theta }_{k+1} - \varvec{\theta } \right\| _{\textsf{H}_k} \right) \left\| \varvec{\theta }_{k} - \varvec{\theta }_{k+1} \right\| _{\textsf{H}_k} \right] \\&= \frac{\tilde{\beta }_{1k}}{2\beta _{1k}} \mathbb {E}\left[ \left( \left\| \varvec{\theta }_{k} - \varvec{\theta } \right\| _{\textsf{H}_k} + \left\| \varvec{\theta }_{k+1} - \varvec{\theta } \right\| _{\textsf{H}_k} \right) \left\| \varvec{\textsf{d}}_k \right\| _{\textsf{H}_k} \right] . \end{aligned} \end{aligned}$$
(30)
Let \(\nabla f_{B_k}(\varvec{\theta }_k) \odot \nabla f_{B_k}(\varvec{\theta }_k):= (g_{k,i}^2) \in \mathbb {R}_{+}^d\). Assumption (A1) ensures that there exists \(M \in \mathbb {R}\) such that, for all \(k\in \mathbb {N}\), \(\max _{i\in [d]} g_{k,i}^2 \le M\). The definition of \({\varvec{ v}}_k\) guarantees that, for all \(i\in [d]\) and all \(k\in \mathbb {N}\),
$$\begin{aligned} \textit{v}_{k,i} = \beta _{2k} \textit{v}_{k-1,i} + \hat{\beta }_{2k} g_{k,i}^2. \end{aligned}$$
Induction thus ensures that, for all \(i\in [d]\) and all \(k\in \mathbb {N}\),
$$\begin{aligned} \textit{v}_{k,i} \le \max \{ \textit{v}_{0,i}, M \} = M, \end{aligned}$$
where \(\varvec{ v}_0 = (\textit{v}_{0,i}) = \varvec{0}\). From the definition of \(\hat{\varvec{ v}}_k\), we have that, for all \(i\in [d]\) and all \(k\in \mathbb {N}\),
$$\begin{aligned} \hat{\textit{v}}_{k,i} = \frac{\textit{v}_{k,i}}{\tilde{\beta }_{2k}} \le \frac{M}{\tilde{\beta }_{2k}}, \end{aligned}$$
(31)
which implies that
$$\begin{aligned} \left\| \overline{\textsf{H}}_k \right\| = \left\| \textsf{diag}\left( \hat{\textit{v}}_{k,i}^{\frac{1}{4}} \right) \right\| = {\max _{i\in [d]} \hat{\textit{v}}_{k,i}^{\frac{1}{4}}} \le \left( \frac{M}{\tilde{\beta }_{2k}} \right) ^{\frac{1}{4}}. \end{aligned}$$
Hence, (A2) implies that, for all \(k\in \mathbb {N}\),
$$\begin{aligned}&\left\| \varvec{\theta }_{k} - \varvec{\theta } \right\| _{\textsf{H}_k} = \left\| \overline{\textsf{H}}_k (\varvec{\theta }_{k} - \varvec{\theta }) \right\| \le \left\| \overline{\textsf{H}}_k \right\| \left\| \varvec{\theta }_{k} - \varvec{\theta }\right\| \le D(\varvec{\theta }) \left( \frac{M}{\tilde{\beta }_{2k}} \right) ^{\frac{1}{4}},\\&\left\| \varvec{\theta }_{k+1} - \varvec{\theta } \right\| _{\textsf{H}_k} = \left\| \overline{\textsf{H}}_k (\varvec{\theta }_{k+1} - \varvec{\theta }) \right\| \le \left\| \overline{\textsf{H}}_k \right\| \left\| \varvec{\theta }_{k+1} - \varvec{\theta }\right\| \le D(\varvec{\theta }) \left( \frac{M}{\tilde{\beta }_{2k}} \right) ^{\frac{1}{4}}. \end{aligned}$$
Lemma 2, Jensen’s inequality, and (30) ensure that, for all \(k\in \mathbb {N}\),
$$\begin{aligned} \begin{aligned} a_k&\le \frac{\tilde{\beta }_{1k}}{2\beta _{1k}} 2 D(\varvec{\theta }) \left( \frac{M}{\tilde{\beta }_{2k}} \right) ^{\frac{1}{4}} \mathbb {E}\left[ \left\| \varvec{\textsf{d}}_k \right\| _{\textsf{H}_k} \right] \le \frac{\tilde{\beta }_{1k}}{\beta _{1k}} D(\varvec{\theta }) \frac{M^{\frac{1}{4}}}{\tilde{\beta }_{2k}^{\frac{1}{4}}} \frac{\tilde{\beta }_{2k}^{\frac{1}{4}}}{\tilde{\beta }_{1k} \textit{v}_*^{\frac{1}{4}}} \sqrt{ \frac{\sigma ^2}{b} + G^2} \\&= \frac{D(\varvec{\theta }) M^{\frac{1}{4}}}{{\textit{v}_*^{\frac{1}{4}}}\beta _{1k}} \sqrt{ \frac{\sigma ^2}{b} + G^2}. \end{aligned} \end{aligned}$$
(32)
Lemma 2 guarantees that, for all \(k\in \mathbb {N}\),
$$\begin{aligned} b_k \!=\! \frac{\alpha _k \tilde{\beta }_{1k}}{2 \beta _{1k}} \mathbb {E}\left[ \left\| \varvec{\textsf{d}}_k \right\| _{\textsf{H}_k}^2 \right] \!\le \! \frac{\alpha _k \tilde{\beta }_{1k}}{2 \beta _{1k}} \frac{\sqrt{\tilde{\beta }_{2k}}}{\tilde{\beta }_{1k}^2 \sqrt{\textit{v}_*}} \left( \frac{\sigma ^2}{b} \!+\! G^2 \right) \!=\! \frac{\alpha _k \sqrt{\tilde{\beta }_{2k}}}{2 \sqrt{\textit{v}_*} \beta _{1k} \tilde{\beta }_{1k}} \left( \frac{\sigma ^2}{b} \!+\! G^2 \right) . \end{aligned}$$
(33)
The Cauchy–Schwarz inequality and Assumption (A2) imply that, for all \(k\in \mathbb {N}\),
$$\begin{aligned} c_k = \frac{\hat{\beta }_{1k}}{\beta _{1k}} \mathbb {E}\left[ (\varvec{\theta } - \varvec{\theta }_k)^\top \nabla f (\varvec{\theta }_k) \right] \le D(\varvec{\theta }) G \frac{\hat{\beta }_{1k}}{\beta _{1k}}. \end{aligned}$$
(34)
Therefore, (29), (32), (33), and (34) ensure that, for all \(k\in \mathbb {N}\),
$$\begin{aligned} \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta })^\top \varvec{m}_{k-1} \right] \le \frac{D(\varvec{\theta }) M^{\frac{1}{4}}}{{\textit{v}_*^{\frac{1}{4}}}\beta _{1k}} \sqrt{ \frac{\sigma ^2}{b} + G^2} + \frac{\alpha _k \sqrt{\tilde{\beta }_{2k}}}{2 \sqrt{\textit{v}_*} \beta _{1k} \tilde{\beta }_{1k}} \left( \frac{\sigma ^2}{b} + G^2 \right) + D(\varvec{\theta }) G \frac{\hat{\beta }_{1k}}{\beta _{1k}}, \end{aligned}$$
which completes the proof. \(\square \)
Lemma 4
Suppose that (S1)–(S3) and (A1)–(A2) hold. Then, Adam defined by Algorithm 1 satisfies the following: for all \(k\in \mathbb {N}\) and all \(\varvec{\theta } \in \mathbb {R}^d\),
$$\begin{aligned} \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta })^\top \varvec{m}_{k} \right]&\le \frac{D(\varvec{\theta }) M^{\frac{1}{4}}}{{\textit{v}_*^{\frac{1}{4}}} \beta _{1k}} \sqrt{ \frac{\sigma ^2}{b} + G^2} + \frac{\alpha _k \sqrt{\tilde{\beta }_{2k}}}{2 \sqrt{\textit{v}_*} \beta _{1k} \tilde{\beta }_{1k}} \left( \frac{\sigma ^2}{b} + G^2 \right) + D(\varvec{\theta }) G \frac{\hat{\beta }_{1k}}{\beta _{1k}}\\&\quad + \hat{\beta }_{1k} D(\varvec{\theta }) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) , \end{aligned}$$
where the parameters are defined as in Lemma 3.
Proof
Let \(\varvec{\theta } \in \mathbb {R}^d\) and \(k\in \mathbb {N}\). The definition of \(\varvec{m}_k\) implies that
$$\begin{aligned} (\varvec{\theta }_k - \varvec{\theta })^\top \varvec{m}_{k}&= (\varvec{\theta }_k - \varvec{\theta })^\top \varvec{m}_{k-1} + (\varvec{\theta }_k - \varvec{\theta })^\top (\varvec{m}_{k} - \varvec{m}_{k-1})\\&= (\varvec{\theta }_k - \varvec{\theta })^\top \varvec{m}_{k-1} + \hat{\beta }_{1k} (\varvec{\theta }_k - \varvec{\theta })^\top (\nabla f_{B_k}(\varvec{\theta }_k) - \varvec{m}_{k-1}), \end{aligned}$$
which, together with the Cauchy–Schwarz inequality, the triangle inequality, and Assumptions (A1) and (A2), implies that
$$\begin{aligned} (\varvec{\theta }_k \!-\! \varvec{\theta })^\top \varvec{m}_{k}&\!\le \! (\varvec{\theta }_k \!-\! \varvec{\theta })^\top \varvec{m}_{k-1} \!+\! \hat{\beta }_{1k} D(\varvec{\theta }) \Vert \nabla f_{B_k}(\varvec{\theta }_k) \!-\! \varvec{m}_{k-1}\Vert \\&\!\le \! (\varvec{\theta }_k \!-\! \varvec{\theta })^\top \varvec{m}_{k-1} \!+\! \hat{\beta }_{1k} D(\varvec{\theta }) (B \!+\! \Vert \varvec{m}_{k-1}\Vert ). \end{aligned}$$
Lemma 2 and Jensen’s inequality guarantee that
$$\begin{aligned} \mathbb {E} \left[ (\varvec{\theta }_k - \varvec{\theta })^\top \varvec{m}_{k} \right] \le \mathbb {E} \left[ (\varvec{\theta }_k - \varvec{\theta })^\top \varvec{m}_{k-1} \right] + \hat{\beta }_{1k} D(\varvec{\theta }) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) . \end{aligned}$$
(35)
Hence, Lemma 3 implies that
$$\begin{aligned} \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta })^\top \varvec{m}_{k} \right]&\le \frac{D(\varvec{\theta }) M^{\frac{1}{4}}}{{\textit{v}_*^{\frac{1}{4}}} \beta _{1k}} \sqrt{ \frac{\sigma ^2}{b} + G^2} + \frac{\alpha _k \sqrt{\tilde{\beta }_{2k}}}{2 \sqrt{\textit{v}_*} \beta _{1k} \tilde{\beta }_{1k}} \left( \frac{\sigma ^2}{b} + G^2 \right) + D(\varvec{\theta }) G \frac{\hat{\beta }_{1k}}{\beta _{1k}}\\&\quad + \hat{\beta }_{1k} D(\varvec{\theta }) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) , \end{aligned}$$
which completes the proof. \(\square \)
Lemma 5
Suppose that (S1)–(S3) and (A1)–(A2) hold. Then, Adam defined by Algorithm 1 satisfies the following: for all \(k\in \mathbb {N}\) and all \(\varvec{\theta } \in \mathbb {R}^d\),
$$\begin{aligned} \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta })^\top \nabla f(\varvec{\theta }_k) \right]&\le \frac{D(\varvec{\theta }) M^{\frac{1}{4}}}{{\textit{v}_*^{\frac{1}{4}}} \beta _{1k}} \sqrt{ \frac{\sigma ^2}{b} + G^2} + \frac{\alpha _k \sqrt{\tilde{\beta }_{2k}}}{2 \sqrt{\textit{v}_*} \beta _{1k} \tilde{\beta }_{1k}} \left( \frac{\sigma ^2}{b} + G^2 \right) + D(\varvec{\theta }) G \frac{\hat{\beta }_{1k}}{\beta _{1k}}\\&\quad + D(\varvec{\theta })\left( \frac{1}{\beta _{1k}} + 2 \hat{\beta }_{1k} \right) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) , \end{aligned}$$
where the parameters are defined as in Lemma 3.
Proof
Let \(\varvec{\theta } \in \mathbb {R}^d\) and \(k\in \mathbb {N}\). The definition of \(\varvec{m}_k\) ensures that
$$\begin{aligned}&(\varvec{\theta }_k - \varvec{\theta })^\top \nabla f_{B_k}(\varvec{\theta }_k)\\&= (\varvec{\theta }_k - \varvec{\theta })^\top \varvec{m}_k + (\varvec{\theta }_k - \varvec{\theta })^\top (\nabla f_{B_k}(\varvec{\theta }_k) - \varvec{m}_{k-1}) + (\varvec{\theta }_k - \varvec{\theta })^\top (\varvec{m}_{k-1} - \varvec{m}_{k})\\&= (\varvec{\theta }_k - \varvec{\theta })^\top \varvec{m}_k + \frac{1}{\beta _{1k}}(\varvec{\theta }_k - \varvec{\theta })^\top (\nabla f_{B_k}(\varvec{\theta }_k) - \varvec{m}_{k}) + \hat{\beta }_{1k} (\varvec{\theta }_k - \varvec{\theta })^\top (\varvec{m}_{k-1} - \nabla f_{B_k}(\varvec{\theta }_k)), \end{aligned}$$
which, together with the Cauchy–Schwarz inequality, the triangle inequality, and Assumptions (A1) and (A2), implies that
$$\begin{aligned} (\varvec{\theta }_k - \varvec{\theta })^\top \nabla f_{B_k}(\varvec{\theta }_k) \le (\varvec{\theta }_k - \varvec{\theta })^\top \varvec{m}_k + \frac{1}{\beta _{1k}} D(\varvec{\theta }) (B + \Vert \varvec{m}_{k}\Vert ) + \hat{\beta }_{1k} D(\varvec{\theta }) (B + \Vert \varvec{m}_{k-1}\Vert ). \end{aligned}$$
Lemma 2 and Jensen’s inequality guarantee that
$$\begin{aligned} \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta })^\top \nabla f(\varvec{\theta }_k) \right] \le \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta })^\top \varvec{m}_k \right] + \left( \frac{1}{\beta _{1k}} + \hat{\beta }_{1k} \right) D(\varvec{\theta }) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) , \end{aligned}$$
(36)
which, together with Lemma 4, implies that
$$\begin{aligned} \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta })^\top \nabla f(\varvec{\theta }_k) \right]&\le \frac{D(\varvec{\theta }) M^{\frac{1}{4}}}{{\textit{v}_*^{\frac{1}{4}}} \beta _{1k}} \sqrt{ \frac{\sigma ^2}{b} + G^2} + \frac{\alpha _k \sqrt{\tilde{\beta }_{2k}}}{2 \sqrt{\textit{v}_*} \beta _{1k} \tilde{\beta }_{1k}} \left( \frac{\sigma ^2}{b} + G^2 \right) + D(\varvec{\theta }) G \frac{\hat{\beta }_{1k}}{\beta _{1k}}\\&\quad + D(\varvec{\theta })\left( \frac{1}{\beta _{1k}} + 2 \hat{\beta }_{1k} \right) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) , \end{aligned}$$
which completes the proof. \(\square \)
Lemma 6
Suppose that (S1)–(S3) and (A1)–(A2) hold, \(\beta _{1k}:= \beta _1 \in (0,1)\), and \((\alpha _k)_{k\in \mathbb {N}}\) is monotone decreasing. Then, Adam defined by Algorithm 1 with (9) satisfies the following: for all \(K\ge 1\) and all \(\varvec{\theta } \in \mathbb {R}^d\),
$$\begin{aligned} \frac{1}{K}\sum _{k=1}^K \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta })^\top \varvec{m}_{k-1} \right] \le \frac{d \tilde{D}(\varvec{\theta }) \sqrt{M} \tilde{\beta }_{1K}}{2 \beta _1 \alpha _K \sqrt{\tilde{\beta }_{2K}}K} + \frac{(\sigma ^2 b^{-1} + G^2)}{2 \sqrt{\textit{v}_*} \beta _{1} \hat{\beta }_1 K} \sum _{k=1}^K \alpha _k \sqrt{\tilde{\beta }_{2k}} + D(\varvec{\theta }) G \frac{\hat{\beta }_{1}}{\beta _{1}}, \end{aligned}$$
where the parameters are defined as in Lemma 3 and \(\tilde{D} (\varvec{\theta }):= \sup \{ \max _{i\in [d]} (\theta _{k,i} - \theta _i)^2 :k \in \mathbb {N} \} < + \infty \).
Proof
Let \(\varvec{\theta } \in \mathbb {R}^d\) and
$$\begin{aligned} \gamma _k := \frac{\tilde{\beta }_{1k}}{2 \beta _{1} \alpha _k} \end{aligned}$$
for all \(k\in \mathbb {N}\). Since \((\alpha _k)_{k\in \mathbb {N}}\) is monotone decreasing and \(\tilde{\beta }_{1k} = 1 - \beta _1^{k+1} \le 1 - \beta _1^{k+2} = \tilde{\beta }_{1,k+1}\), \((\gamma _k)_{k\in \mathbb {N}}\) is monotone increasing. From the definition of \(a_k\) in (29), we have that, for all \(K \ge 1\),
$$\begin{aligned} \begin{aligned} \sum _{k = 1}^K a_k&= \gamma _1 \mathbb {E}\left[ \left\| \varvec{\theta }_{1} - \varvec{\theta } \right\| _{\textsf{H}_{1}}^2\right] + \underbrace{ \sum _{k=2}^K \left\{ \gamma _k \mathbb {E}\left[ \left\| \varvec{\theta }_{k} - \varvec{\theta } \right\| _{\textsf{H}_{k}}^2\right] - \gamma _{k-1} \mathbb {E}\left[ \left\| \varvec{\theta }_{k} - \varvec{\theta } \right\| _{\textsf{H}_{k-1}}^2\right] \right\} }_{{\Gamma }_K}\\&\quad - \gamma _{K} \mathbb {E} \left[ \left\| \varvec{\theta }_{K+1} - \varvec{\theta } \right\| _{\textsf{H}_{K}}^2 \right] . \end{aligned} \end{aligned}$$
(37)
Since \(\overline{\textsf{H}}_k \in \mathbb {S}_{++}^d\) exists such that \(\textsf{H}_k = \overline{\textsf{H}}_k^2\), we have \(\Vert \varvec{x}\Vert _{\textsf{H}_k}^2 = \Vert \overline{\textsf{H}}_k \varvec{x} \Vert ^2\) for all \(\varvec{x}\in \mathbb {R}^d\). Accordingly, we have
$$\begin{aligned} {\Gamma }_K = \mathbb {E} \left[ \sum _{k=2}^K \left\{ \gamma _{k} \left\| \overline{\textsf{H}}_{k} (\varvec{\theta }_{k} - \varvec{\theta }) \right\| ^2 - \gamma _{k-1} \left\| \overline{\textsf{H}}_{k-1} (\varvec{\theta }_{k} - \varvec{\theta }) \right\| ^2 \right\} \right] . \end{aligned}$$
From \(\overline{\textsf{H}}_{k} = \textsf{diag}(\hat{\textit{v}}_{k,i}^{1/4})\), we have that, for all \(\varvec{x} = (x_i)_{i=1}^d \in \mathbb {R}^d\), \(\Vert \overline{\textsf{H}}_{k} \varvec{x} \Vert ^2 = \sum _{i=1}^d \sqrt{\hat{\textit{v}}_{k,i}} x_i^2\). Hence, for all \(K\ge 2\),
$$\begin{aligned} {\Gamma }_K = \mathbb {E} \left[ \sum _{k=2}^K \sum _{i=1}^d \left( \gamma _{k} \sqrt{\hat{\textit{v}}_{k,i}} - \gamma _{k-1} \sqrt{\hat{\textit{v}}_{k-1,i}} \right) (\theta _{k,i} - \theta _i)^2 \right] . \end{aligned}$$
(38)
Condition (9) and \(\gamma _k \ge \gamma _{k-1}\) (\(k \ge 1\)) imply that, for all \(k \ge 1\) and all \(i\in [d]\),
$$\begin{aligned} \gamma _{k} \sqrt{\hat{\textit{v}}_{k,i}} - \gamma _{k-1} \sqrt{\hat{\textit{v}}_{k-1,i}} \ge 0. \end{aligned}$$
Moreover, (A2) ensures that \(\tilde{D} (\varvec{\theta }):= \sup \{ \max _{i\in [d]} (\theta _{k,i} - \theta _i)^2 :k \in \mathbb {N} \} < + \infty \). Accordingly, for all \(K \ge 2\),
$$\begin{aligned} {\Gamma }_K \le \tilde{D}(\varvec{\theta }) \mathbb {E} \left[ \sum _{k=2}^K \sum _{i=1}^d \left( \gamma _{k}\sqrt{\hat{\textit{v}}_{k,i}} - \gamma _{k-1} \sqrt{\hat{\textit{v}}_{k-1,i}} \right) \right] = \tilde{D}(\varvec{\theta }) \mathbb {E} \left[ \sum _{i=1}^d \left( \gamma _{K} \sqrt{\hat{\textit{v}}_{K,i}} - \gamma _{1} \sqrt{\hat{\textit{v}}_{1,i}} \right) \right] . \end{aligned}$$
Therefore, (37), \(\mathbb {E} [\Vert \varvec{\theta }_{1} - \varvec{\theta }\Vert _{\textsf{H}_{1}}^2] \le \tilde{D}(\varvec{\theta }) \mathbb {E} [ \sum _{i=1}^d \sqrt{\hat{\textit{v}}_{1,i}}]\), and (31) imply, for all \(K\ge 1\),
$$\begin{aligned} \begin{aligned} \sum _{k=1}^K a_k&\le \gamma _{1} \tilde{D}(\varvec{\theta }) \mathbb {E} \left[ \sum _{i=1}^d \sqrt{\hat{\textit{v}}_{1,i}} \right] + \tilde{D}(\varvec{\theta }) \mathbb {E} \left[ \sum _{i=1}^d \left( \gamma _{K} \sqrt{\hat{\textit{v}}_{K,i}} - \gamma _{1} \sqrt{\hat{\textit{v}}_{1,i}} \right) \right] \\&= \gamma _{K} \tilde{D}(\varvec{\theta }) \mathbb {E} \left[ \sum _{i=1}^d \sqrt{\hat{\textit{v}}_{K,i}} \right] \\&\le {\gamma }_K \tilde{D}(\varvec{\theta }) \sum _{i=1}^d \sqrt{\frac{M}{\tilde{\beta }_{2K}}}\\&\le \frac{d \tilde{D}(\varvec{\theta }) \sqrt{M} \tilde{\beta }_{1K}}{2 \beta _1 \alpha _K \sqrt{\tilde{\beta }_{2K}}}. \end{aligned} \end{aligned}$$
(39)
Inequality (33) with \(\beta _{1k} = \beta _1\) and \(\tilde{\beta }_{1k}:= 1 - \beta _1^{k+1} \ge 1 - \beta _1 =: \hat{\beta }_1\) implies that
$$\begin{aligned} b_k \le \frac{\alpha _k \sqrt{\tilde{\beta }_{2k}}}{2 \sqrt{\textit{v}_*} \beta _{1k} \tilde{\beta }_{1k}} \left( \frac{\sigma ^2}{b} + G^2 \right) \le \frac{\alpha _k \sqrt{\tilde{\beta }_{2k}}}{2 \sqrt{\textit{v}_*} \beta _{1} \hat{\beta }_1} \left( \frac{\sigma ^2}{b} + G^2 \right) . \end{aligned}$$
(40)
Inequality (34) with \(\beta _{1k} = \beta _1\) implies that
$$\begin{aligned} c_k \le D(\varvec{\theta }) G \frac{\hat{\beta }_{1k}}{\beta _{1k}} = D(\varvec{\theta }) G \frac{\hat{\beta }_{1}}{\beta _{1}}. \end{aligned}$$
(41)
Hence, (29), (39), (40), and (41) ensure that, for all \(K\ge 1\),
$$\begin{aligned} \frac{1}{K}\sum _{k=1}^K \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta })^\top \varvec{m}_{k-1} \right] \le \frac{d \tilde{D}(\varvec{\theta }) \sqrt{M} \tilde{\beta }_{1K}}{2 \beta _1 \alpha _K \sqrt{\tilde{\beta }_{2K}}K} + \frac{(\sigma ^2 b^{-1} + G^2)}{2 \sqrt{\textit{v}_*} \beta _{1} \hat{\beta }_1 K} \sum _{k=1}^K \alpha _k \sqrt{\tilde{\beta }_{2k}} + D(\varvec{\theta }) G \frac{\hat{\beta }_{1}}{\beta _{1}}, \end{aligned}$$
which completes the proof. \(\square \)
Lemma 7
Suppose that (S1)–(S3) and (A1)–(A2) hold, \(\beta _{1k}:= \beta _1 \in (0,1)\), and \((\alpha _k)_{k\in \mathbb {N}}\) is monotone decreasing. Then, Adam defined by Algorithm 1 with (9) satisfies the following: for all \(K\ge 1\) and all \(\varvec{\theta } \in \mathbb {R}^d\),
$$\begin{aligned} \frac{1}{K}\sum _{k=1}^K \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta })^\top \varvec{m}_{k} \right]&\le \frac{d \tilde{D}(\varvec{\theta }) \sqrt{M} \tilde{\beta }_{1K}}{2 \beta _1 \alpha _K \sqrt{\tilde{\beta }_{2K}}K} + \frac{(\sigma ^2 b^{-1} + G^2)}{2 \sqrt{\textit{v}_*} \beta _{1} \hat{\beta }_1 K} \sum _{k=1}^K \alpha _k \sqrt{\tilde{\beta }_{2k}} + D(\varvec{\theta }) G \frac{\hat{\beta }_{1}}{\beta _{1}}\\&\quad + \hat{\beta }_{1} D(\varvec{\theta }) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) , \end{aligned}$$
where the parameters are defined as in Lemma 6.
Proof
Let \(\varvec{\theta } \in \mathbb {R}^d\). Inequality (35) with \(\beta _{1k} = \beta _1\) implies that, for all \(K \ge 1\),
$$\begin{aligned} \frac{1}{K} \sum _{k=1}^K \mathbb {E} \left[ (\varvec{\theta }_k - \varvec{\theta })^\top \varvec{m}_{k} \right] \le \frac{1}{K} \sum _{k=1}^K \mathbb {E} \left[ (\varvec{\theta }_k - \varvec{\theta })^\top \varvec{m}_{k-1} \right] + \hat{\beta }_{1} D(\varvec{\theta }) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) . \end{aligned}$$
Hence, Lemma 6 leads to Lemma 7. \(\square \)
Lemma 8
Suppose that (S1)–(S3) and (A1)–(A2) hold, \(\beta _{1k}:= \beta _1 \in (0,1)\), and \((\alpha _k)_{k\in \mathbb {N}}\) is monotone decreasing. Then, Adam defined by Algorithm 1 with (9) satisfies the following: for all \(K\ge 1\) and all \(\varvec{\theta } \in \mathbb {R}^d\),
$$\begin{aligned} \frac{1}{K}\sum _{k=1}^K \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta })^\top \nabla f (\varvec{\theta }_{k}) \right]&\le \frac{d \tilde{D}(\varvec{\theta }) \sqrt{M} \tilde{\beta }_{1K}}{2 \beta _1 \alpha _K \sqrt{\tilde{\beta }_{2K}}K} + \frac{(\sigma ^2 b^{-1} + G^2)}{2 \sqrt{\textit{v}_*} \beta _{1} \hat{\beta }_1 K} \sum _{k=1}^K \alpha _k \sqrt{\tilde{\beta }_{2k}} + D(\varvec{\theta }) G \frac{\hat{\beta }_{1}}{\beta _{1}}\\&\quad + \left( \frac{1}{\beta _1} + 2\hat{\beta }_{1} \right) D(\varvec{\theta }) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) , \end{aligned}$$
where the parameters are defined as in Lemma 6.
Proof
Let \(\varvec{\theta } \in \mathbb {R}^d\). Inequality (36) with \(\beta _{1k} = \beta _1\) implies that, for all \(K \ge 1\),
$$\begin{aligned}&\frac{1}{K} \sum _{k=1}^K \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta })^\top \nabla f(\varvec{\theta }_k) \right] \\&\le \frac{1}{K} \sum _{k=1}^K \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta })^\top \varvec{m}_k \right] + \left( \frac{1}{\beta _{1}} + \hat{\beta }_{1} \right) D(\varvec{\theta }) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) , \end{aligned}$$
which, together with Lemma 7, shows that Lemma 8 holds. \(\square \)
1.2 A.2 Proof of Theorem 1
Proof
Lemmas 4 and 5 with
$$\begin{aligned} \alpha _k = \alpha , \text { } \beta _{1k} = \beta _1, \text { } \beta _{2k} = \beta _2, \text { } \tilde{\beta }_{1k} = 1 - \beta _1^{k+1},\text { } \tilde{\beta }_{2k} = 1 - \beta _2^{k+1}, \text { } \hat{\beta }_{1} = 1 - \beta _1 \end{aligned}$$
imply that
$$\begin{aligned} \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta }^\star )^\top \varvec{m}_{k} \right]&\le \frac{D(\varvec{\theta }^\star ) M^{\frac{1}{4}}}{{\textit{v}_*^{\frac{1}{4}}} \beta _{1}} \sqrt{ \frac{\sigma ^2}{b} + G^2} + \frac{\alpha \sqrt{\tilde{\beta }_{2k}}}{2 \sqrt{\textit{v}_*} \beta _{1} \tilde{\beta }_{1k}} \left( \frac{\sigma ^2}{b} + G^2 \right) + D(\varvec{\theta }^\star ) G \frac{\hat{\beta }_{1}}{\beta _{1}}\\&\quad + \hat{\beta }_{1} D(\varvec{\theta }^\star ) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) ,\\ \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta })^\top \nabla f(\varvec{\theta }_k) \right]&\le \frac{D(\varvec{\theta }) M^{\frac{1}{4}}}{{\textit{v}_*^{\frac{1}{4}}} \beta _{1}} \sqrt{ \frac{\sigma ^2}{b} + G^2} + \frac{\alpha \sqrt{\tilde{\beta }_{2k}}}{2 \sqrt{\textit{v}_*} \beta _{1} \tilde{\beta }_{1k}} \left( \frac{\sigma ^2}{b} + G^2 \right) \\&\quad + D(\varvec{\theta }) G \frac{\hat{\beta }_{1}}{\beta _{1}} + D(\varvec{\theta })\left( \frac{1}{\beta _{1}} + 2 \hat{\beta }_{1} \right) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) , \end{aligned}$$
which completes the proof. \(\square \)
1.3 A.3 Proof of Corollary 1
Proof
The sequences \((\tilde{\beta }_{1k})_{k\in \mathbb {N}}\) and \((\tilde{\beta }_{2k})_{k\in \mathbb {N}}\) converge to 1. Theorem 1 thus leads to Corollary 1. \(\square \)
1.4 A.4 Proof of Theorem 2
Proof
Lemmas 4 and 5 with
$$\begin{aligned}&\alpha _k = \frac{1}{k^a}, \text { } \beta _{1k} = 1 - \frac{1}{k^{b_1}}, \text { } \beta _{2k} = \left( 1 - \frac{1}{k^{b_2}} \right) ^{\frac{1}{k+1}}, \text { } \tilde{\beta }_{1k} = 1 - \beta _{1k}^{k+1} \ge 1 - \beta _{1k},\text { } \tilde{\beta }_{2k} = 1 - \beta _{2k}^{k+1},\\&\hat{\beta }_{1k} = 1 - \beta _{1k} \end{aligned}$$
imply that
$$\begin{aligned} \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta })^\top \varvec{m}_{k} \right]&\le \frac{D(\varvec{\theta }) M^{\frac{1}{4}}}{{\textit{v}_*^{\frac{1}{4}}} \beta _{1k}} \sqrt{ \frac{\sigma ^2}{b} + G^2} + \frac{\alpha _k \sqrt{\tilde{\beta }_{2k}}}{2 \sqrt{\textit{v}_*} \beta _{1k} \tilde{\beta }_{1k}} \left( \frac{\sigma ^2}{b} + G^2 \right) + D(\varvec{\theta }) G \frac{\hat{\beta }_{1k}}{\beta _{1k}}\\&\quad + \hat{\beta }_{1k} D(\varvec{\theta }) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) \\&\le \frac{D(\varvec{\theta }^\star ) M^{\frac{1}{4}} k^{b_1}}{\textit{v}_*^{\frac{1}{4}}(k^{b_1} - 1)}\sqrt{ \frac{\sigma ^2}{b} + G^2} + \frac{1}{2 \sqrt{v_*} (k^{b_1} - 1) k^{a +\frac{b_2}{2} -2 b_1}} \left( \frac{\sigma ^2}{b} + G^2 \right) \\&\quad + \frac{1}{k^{b_1} -1} D(\varvec{\theta }^\star ) G + \frac{1}{k^{b_1}} D(\varvec{\theta }^\star ) \left( B + \sqrt{ \frac{\sigma ^2}{b} + G^2} \right) , \end{aligned}$$
$$\begin{aligned} \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta })^\top \nabla f(\varvec{\theta }_k) \right]&\le \frac{D(\varvec{\theta }) M^{\frac{1}{4}}}{{\textit{v}_*^{\frac{1}{4}}} \beta _{1k}} \sqrt{ \frac{\sigma ^2}{b} + G^2} + \frac{\alpha _k \sqrt{\tilde{\beta }_{2k}}}{2 \sqrt{\textit{v}_*} \beta _{1k} \tilde{\beta }_{1k}} \left( \frac{\sigma ^2}{b} + G^2 \right) + D(\varvec{\theta }) G \frac{\hat{\beta }_{1k}}{\beta _{1k}}\\&\quad + D(\varvec{\theta })\left( \frac{1}{\beta _{1k}} + 2 \hat{\beta }_{1k} \right) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) \\&\le \frac{D(\varvec{\theta }) M^{\frac{1}{4}} k^{b_1}}{\textit{v}_*^{\frac{1}{4}}(k^{b_1} - 1)}\sqrt{ \frac{\sigma ^2}{b} + G^2} + \frac{1}{2 \sqrt{\textit{v}_*} (k^{b_1} - 1) k^{a +\frac{b_2}{2} -2 b_1}} \left( \frac{\sigma ^2}{b} + G^2 \right) \\&\quad + \frac{1}{k^{b_1} -1} D(\varvec{\theta }) G + \frac{1}{k^{b_1}} D(\varvec{\theta }) \left( B + \sqrt{ \frac{\sigma ^2}{b} + G^2} \right) \\&\quad + \frac{k^{{b_1}^2} + 2 k^{b_1} - 2}{k^{b_1} (k^{b_1} - 1)} D(\varvec{\theta }) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) , \end{aligned}$$
which completes the proof. \(\square \)
1.5 A.5 Proof of Corollary 2
Proof
Since \(a - b_1 + b_2/2 > 0\), we have that
$$\begin{aligned} \frac{1}{(k^{b_1} - 1) k^{a +\frac{b_2}{2} -2 b_1}} = \frac{1}{k^{a +\frac{b_2}{2} - b_1}- k^{a +\frac{b_2}{2} -2 b_1}} = \frac{1}{k^{a +\frac{b_2}{2} - b_1}}\left( {1 - \frac{1}{k^{b_1}}}\right) ^{-1} \rightarrow 0. \end{aligned}$$
Theorem 2 thus leads to Corollary 2. \(\square \)
1.6 A.6 Proof of Theorem 3
Proof
Lemmas 7 and 8 with
$$\begin{aligned} \alpha _k = \alpha , \text { } \beta _{1k} = \beta _1, \text { } \beta _{2k} = \beta _2, \text { } \tilde{\beta }_{1k} = 1 - \beta _1^{k+1},\text { } \tilde{\beta }_{2k} = 1 - \beta _2^{k+1} \le 1, \text { } \hat{\beta }_{1} = 1 - \beta _1 \end{aligned}$$
ensure that
$$\begin{aligned} \frac{1}{K}\sum _{k=1}^K \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta }^\star )^\top \varvec{m}_{k} \right]&\le \frac{d \tilde{D}(\varvec{\theta }^\star ) \sqrt{M} \tilde{\beta }_{1K}}{2 \beta _1 \alpha \sqrt{\tilde{\beta }_{2K}}K} + \frac{(\sigma ^2 b^{-1} + G^2)}{2 \sqrt{\textit{v}_*} \beta _{1} \hat{\beta }_1 K} \sum _{k=1}^K \alpha \sqrt{\tilde{\beta }_{2k}} + D(\varvec{\theta }^\star ) G \frac{\hat{\beta }_{1}}{\beta _{1}}\\&\quad + \hat{\beta }_{1} D(\varvec{\theta }^\star ) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) \\&\le \frac{d \tilde{D}(\varvec{\theta }^\star ) \sqrt{M} \tilde{\beta }_{1K}}{2 \beta _1 \alpha \sqrt{\tilde{\beta }_{2K}}K} + \frac{(\sigma ^2 b^{-1} + G^2)}{2 \sqrt{\textit{v}_*} \beta _{1} \hat{\beta }_1} \alpha + D(\varvec{\theta }^\star ) G \frac{\hat{\beta }_{1}}{\beta _{1}}\\&\quad + \hat{\beta }_{1} D(\varvec{\theta }^\star ) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) \end{aligned}$$
and that
$$\begin{aligned} \frac{1}{K}\sum _{k=1}^K \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta })^\top \nabla f (\varvec{\theta }_{k}) \right]&\le \frac{d \tilde{D}(\varvec{\theta }) \sqrt{M} \tilde{\beta }_{1K}}{2 \beta _1 \alpha \sqrt{\tilde{\beta }_{2K}}K} + \frac{(\sigma ^2 b^{-1} + G^2)}{2 \sqrt{\textit{v}_*} \beta _{1} \hat{\beta }_1 K} \sum _{k=1}^K \alpha \sqrt{\tilde{\beta }_{2k}} + D(\varvec{\theta }) G \frac{\hat{\beta }_{1}}{\beta _{1}}\\&\quad + \left( \frac{1}{\beta _1} + 2\hat{\beta }_{1} \right) D(\varvec{\theta }) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) \\&\le \frac{d \tilde{D}(\varvec{\theta }) \sqrt{M} \tilde{\beta }_{1K}}{2 \beta _1 \alpha \sqrt{\tilde{\beta }_{2K}}K} + \frac{(\sigma ^2 b^{-1} + G^2)}{2 \sqrt{\textit{v}_*} \beta _{1} \hat{\beta }_1} \alpha + D(\varvec{\theta }) G \frac{\hat{\beta }_{1}}{\beta _{1}}\\&\quad + \left( \frac{1}{\beta _1} + 2\hat{\beta }_{1} \right) D(\varvec{\theta }) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) , \end{aligned}$$
which completes the proof. \(\square \)
1.7 A.7 Proof of Theorem 4
Proof
Let
$$\begin{aligned} \alpha _k = \alpha , \text { } \beta _{1k} = \beta _1, \text { } \beta _{2k} = \left( 1 - \frac{1}{k^{b_2}} \right) ^{\frac{1}{k+1}}, \text { } \tilde{\beta }_{1k} = 1 - \beta _1^{k+1} \le 1,\text { } \tilde{\beta }_{2k} = 1 - \beta _{2k}^{k+1}, \text { } \hat{\beta }_{1} = 1 - \beta _1, \end{aligned}$$
where \(b_2 \in (0,2)\). We have that
$$\begin{aligned} \sqrt{\tilde{\beta }_{2k}} = \sqrt{1 - \beta _{2k}^{k+1}} = \sqrt{\frac{1}{k^{b_2}}}. \end{aligned}$$
Lemma 7 ensures that
$$\begin{aligned} \frac{1}{K}\sum _{k=1}^K \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta }^\star )^\top \varvec{m}_{k} \right]&\le \frac{d \tilde{D}(\varvec{\theta }^\star ) \sqrt{M} \tilde{\beta }_{1K}}{2 \beta _1 \alpha \sqrt{\tilde{\beta }_{2K}}K} + \frac{(\sigma ^2 b^{-1} + G^2)}{2 \sqrt{\textit{v}_*} \beta _{1} \hat{\beta }_1 K} \sum _{k=1}^K \alpha \sqrt{\tilde{\beta }_{2k}} + D(\varvec{\theta }^\star ) G \frac{\hat{\beta }_{1}}{\beta _{1}}\\&\quad + \hat{\beta }_{1} D(\varvec{\theta }^\star ) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) \\&\le \frac{d \tilde{D}(\varvec{\theta }^\star ) \sqrt{M}}{2 \beta _1 \alpha K^{1-\frac{b_2}{2}}} + \frac{(\sigma ^2 b^{-1} + G^2)\alpha }{2 \sqrt{\textit{v}_*} \beta _{1} \hat{\beta }_1 K} \sum _{k=1}^K \frac{1}{k^{\frac{b_2}{2}}} + D(\varvec{\theta }^\star ) G \frac{\hat{\beta }_{1}}{\beta _{1}}\\&\quad + \hat{\beta }_{1} D(\varvec{\theta }^\star ) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) . \end{aligned}$$
We also have that
$$\begin{aligned} \begin{aligned} \frac{1}{K} \sum _{k=1}^K \frac{1}{k^{\frac{b_2}{2}}}&\le \frac{1}{K} \left( 1 + \int _1^K \frac{\textrm{d}t}{t^{\frac{b_2}{2}}} \right) = \frac{1}{K} \left\{ 1 + \left[ \left( 1 - \frac{b_2}{2} \right) t^{ 1 - \frac{b_2}{2}} \right] _1^K \right\} \\&\le \frac{1}{K} \left\{ 1 + \left( 1 - \frac{b_2}{2} \right) K^{ 1 - \frac{b_2}{2}} \right\} \le \frac{2}{K}K^{ 1 - \frac{b_2}{2}} = \frac{2}{K^{\frac{b_2}{2}}}. \end{aligned} \end{aligned}$$
(42)
Hence,
$$\begin{aligned} \frac{1}{K}\sum _{k=1}^K \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta }^\star )^\top \varvec{m}_{k} \right]&\le \frac{d \tilde{D}(\varvec{\theta }^\star ) \sqrt{M}}{2 \beta _1 \alpha K^{1-\frac{b_2}{2}}} + \frac{(\sigma ^2 b^{-1} + G^2)\alpha }{\sqrt{\textit{v}_*} \beta _{1} \hat{\beta }_1 K^{\frac{b_2}{2}}} + D(\varvec{\theta }^\star ) G \frac{\hat{\beta }_{1}}{\beta _{1}}\\&\quad + \hat{\beta }_{1} D(\varvec{\theta }^\star ) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) . \end{aligned}$$
A discussion similar to the one showing the above inequality and Lemma 8 imply that
$$\begin{aligned} \frac{1}{K} \sum _{k=1}^K \mathbb {E}\left[ \nabla f(\varvec{\theta }_k)^\top (\varvec{\theta }_k - \varvec{\theta }) \right]&\le \frac{d \tilde{D}(\varvec{\theta }) \sqrt{M}}{2 \alpha \beta _1 K^{1 - \frac{b_2}{2}}} + \frac{\alpha }{\sqrt{\textit{v}_*} \beta _{1} (1-\beta _1) K^{\frac{b_2}{2}}}\left( \frac{\sigma ^2}{b} + G^2 \right) \\&\quad + \frac{1 - \beta _{1}}{\beta _{1}} D(\varvec{\theta }) G + \left( \frac{1}{\beta _{1}} + 2 (1 - \beta _{1}) \right) D(\varvec{\theta }) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) , \end{aligned}$$
which completes the proof.\(\square \)
1.8 A.8 Proof of Theorem 5
Proof
Let
$$\begin{aligned} \alpha _k \!=\! \frac{1}{k^a}, \text { } \beta _{1k} \!=\! \beta _1, \text { } \beta _{2k} \!=\! \beta _2, \text { } \tilde{\beta }_{1k} \!=\! 1 \!-\! \beta _1^{k+1} \!\le \! 1,\text { } \tilde{\beta }_{2k} \!=\! 1 \!-\! \beta _2^{k+1} \!\le \! 1, \text { } \hat{\beta }_{1} \!=\! 1 \!-\! \beta _1. \end{aligned}$$
We have that \(\tilde{\beta }_{2k} = 1 - \beta _{2k}^{k+1} \ge 1 - \beta _2\). Lemma 7 ensures that
$$\begin{aligned} \frac{1}{K}\sum _{k=1}^K \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta }^\star )^\top \varvec{m}_{k} \right]&\le \frac{d \tilde{D}(\varvec{\theta }^\star ) \sqrt{M} \tilde{\beta }_{1K}}{2 \beta _1 \alpha _K \sqrt{\tilde{\beta }_{2K}}K} + \frac{(\sigma ^2 b^{-1} + G^2)}{2 \sqrt{\textit{v}_*} \beta _{1} \hat{\beta }_1 K} \sum _{k=1}^K \alpha _k \sqrt{\tilde{\beta }_{2k}} + D(\varvec{\theta }^\star ) G \frac{\hat{\beta }_{1}}{\beta _{1}}\\&\quad + \hat{\beta }_{1} D(\varvec{\theta }^\star ) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) \\&\le \frac{d \tilde{D}(\varvec{\theta }^\star ) \sqrt{M}}{2 \beta _1 \sqrt{1-\beta _2}K^{1-a}} + \frac{(\sigma ^2 b^{-1} + G^2)}{2 \sqrt{\textit{v}_*} \beta _{1} \hat{\beta }_1 K} \sum _{k=1}^K \frac{1}{k^a} + D(\varvec{\theta }^\star ) G \frac{\hat{\beta }_{1}}{\beta _{1}}\\&\quad + \hat{\beta }_{1} D(\varvec{\theta }^\star ) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) , \end{aligned}$$
which, together with (42), implies that
$$\begin{aligned} \frac{1}{K}\sum _{k=1}^K \mathbb {E}\left[ (\varvec{\theta }_k - \varvec{\theta }^\star )^\top \varvec{m}_{k} \right]&\le \frac{d \tilde{D}(\varvec{\theta }^\star ) \sqrt{M}}{2 \beta _1 \sqrt{1-\beta _2}K^{1-a}} + \frac{(\sigma ^2 b^{-1} + G^2)}{2 \sqrt{\textit{v}_*} \beta _{1} \hat{\beta }_1 K^a} + D(\varvec{\theta }^\star ) G \frac{\hat{\beta }_{1}}{\beta _{1}}\\&\quad + \hat{\beta }_{1} D(\varvec{\theta }^\star ) \left( B + \sqrt{\frac{\sigma ^2}{b} + G^2} \right) . \end{aligned}$$
A discussion similar to the one showing the above inequality and Lemma 8 implies the second assertion in Theorem 5. \(\square \)
1.9 A.9 Proof of Theorem 6
Proof
The proofs of Theorems 4 and 5 lead to Theorem 6. \(\square \)