Accelerating adaptive online learning by matrix approximation

Wan, Yuanyu; Zhang, Lijun

doi:10.1007/s41060-019-00174-4

Accelerating adaptive online learning by matrix approximation

Regular Paper
Published: 24 January 2019

Volume 9, pages 389–400, (2020)
Cite this article

International Journal of Data Science and Analytics Aims and scope Submit manuscript

Yuanyu Wan¹ &
Lijun Zhang¹

305 Accesses
Explore all metrics

Abstract

Adaptive subgradient methods are able to leverage the second-order information of functions to improve the regret and have become popular for online learning and optimization. According to the amount of information used, these methods can be divided into diagonal-matrix version (ADA-DIAG) and full-matrix version (ADA-FULL). In practice, ADA-DIAG is the most commonly adopted instead of ADA-FULL, because ADA-FULL is computationally intractable in high dimensions though it has smaller regret when gradients are correlated. In this paper, we propose to employ techniques of matrix approximation to accelerate ADA-FULL and develop two methods based on random projections. Compared with ADA-FULL, at each iteration, our methods reduce the space complexity from $O(d^2)$ to $O(\tau d)$ and the time complexity from $O(d^3)$ to $O(\tau ^2 d)$ where d is the dimensionality of the data and $\tau \ll d$ is the number of random projections. Experimental results about online convex optimization and training convolutional neural networks show that our methods are comparable to ADA-FULL and outperform other state-of-the-art algorithms including ADA-DIAG.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions

Article 18 August 2021

Iqbal H. Sarker

Scientific Machine Learning Through Physics–Informed Neural Networks: Where we are and What’s Next

Article Open access 26 July 2022

Salvatore Cuomo, Vincenzo Schiano Di Cola, … Francesco Piccialli

Bolstering stochastic gradient descent with model building

Article Open access 15 April 2024

Ş. İlker Birbil, Özgür Martin, … Figen Öztoprak

Notes

https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py.

References

Achlioptas, D.: Database-friendly random projections: Johnson–Lindenstrauss with binary coins. J. Comput. Syst. Sci. 66(4), 671–687 (2003)
Article MathSciNet Google Scholar
Allesiardo, R., Fraud, R., Maillard, O.A.: The non-stationary stochastic multi-armed bandit problem. Int. J. Data Sci. Anal. 3(4), 267–283 (2017)
Article Google Scholar
Awerbuch, B., Kleinberg, R.: Online linear optimization and adaptive routing. J. Comput. Syst. Sci. 74(1), 97–114 (2008)
Article MathSciNet Google Scholar
Boutsidis, C., Zouzias, A., Drineas, P.: Random projections for $k$-means clustering. In: Advances in Neural Information Processing Systems, vol. 23, pp. 298–306 (2010)
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 1–27 (2011)
Article Google Scholar
Dasgupta, S., Freund, Y.: Random projection trees and low dimensional manifolds. In: Proceedings of the 40th Annual ACM Symposium on Theory of computing, pp. 537–546 (2008)
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)
MathSciNet MATH Google Scholar
Duchi, J., Shalev-Shwartz, S., Singer, Y., Tewari, A.: Composite objective mirror descent. In: Proceedings of the 23rd Annual Conference on Learning Theory, pp. 14–26 (2010)
Fern, X.Z., Brodley, C.E.: Random projection for high dimensional data clustering: a cluster ensemble approach. In: Proceedings of the 20th International Conference on Machine Learning, pp. 186–93 (2003)
Fradkin, D., Madigan, D.: Experiments with random projections for machine learning. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 517–522 (2003)
Freund, Y., Dasgupta, S., Kabra, M., Verma, N.: Learning the structure of manifolds using random projections. In: Advances in Neural Information Processing Systems, vol. 21, pp. 473–480 (2008)
Gao, W., Jin, R., Zhu, S., Zhou, Z.H.: One-pass AUC optimization. In: Proceedings of the 30th International Conference on Machine Learning, pp. 906–914 (2013)
Ghashami, M., Liberty, E., Phillips, J.M., Woodruff, D.P.: Frequent directions: simple and deterministic matrix sketching. SIAM J. Comput. 45(5), 1762–1792 (2016)
Article MathSciNet Google Scholar
Hager, W.W.: Updating the inverse of a matrix. SIAM Rev. 31(2), 221–239 (1989)
Article MathSciNet Google Scholar
Hassani, M., Töws, D., Cuzzocrea, A., Seidl, T.: BFSPMiner: an effective and efficient batch-free algorithm for mining sequential patterns over data streams. Int. J. Data Sci. Anal. 1–17 (2017). https://doi.org/10.1007/S41060-017-0084-8
Hazan, E., Agarwal, A., Kale, S.: Logarithmic regret algorithms for online convex optimization. Mach. Learn. 69(2), 169–192 (2007)
Article Google Scholar
Kaski, S.: Dimensionality reduction by random mapping: fast similarity computation for clustering. In: Proceedings of the 1998 IEEE International Joint Conference on Neural Networks, vol. 1, pp. 413–418 (1998)
Krizhevsky, A.: Learning multiple layers of features from tiny images. Technical report, University of Toronto (2009)
Krummenacher, G., McWilliams, B., Kilcher, Y., Buhmann, J.M., Meinshausen, N.: Scalable adaptive stochastic optimization using random projections. In: Advances in Neural Information Processing Systems, vol. 29, pp. 1750–1758 (2016)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. In: Proceedings of the IEEE, vol. 86, pp. 2278–2324 (1998)
Liberty, E., Ailon, N., Singer, A.: Dense fast random projections and lean walsh transforms. Discrete Comput. Geom. 45(1), 34–44 (2011)
Article MathSciNet Google Scholar
Luo, H., Agarwal, A., Cesa-Bianchi, N., Langford, J.: Efficient second order online learning by sketching. In: Advances in Neural Information Processing Systems, vol. 29, pp. 902–910 (2016)
Magen, A., Zouzias, A.: Low rank matrix-valued Chernoff bounds and approximate matrix multiplication. In: Proceedings of the 22nd Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1422–1436 (2011)
Maillard, O.A., Munos, R.: Linear regression with random projections. J. Mach. Learn. Res. 13, 2735–2772 (2012)
MathSciNet MATH Google Scholar
Miyaguchi, K., Yamanishi, K.: Online detection of continuous changes in stochastic processes. Int. J. Data Sci. Anal. 3(3), 213–229 (2017)
Article Google Scholar
Nalko, N., Martinsson, P.G., Tropp, J.A.: Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 53(2), 217–288 (2011)
Article MathSciNet Google Scholar
Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning. In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011 (2011)
Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems, vol. 21, pp. 1177–1184 (2008)
Tropp, J.A.: An introduction to matrix concentration inequalities. Found. Trends Mach. Learn. 8(1–2), 1–230 (2015)
Article Google Scholar
Wan, Y., Wei, N., Zhang, L.: Efficient adaptive online learning via frequent directions. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 2748–2754 (2018)
Wan, Y., Zhang, L.: Accelerating adaptive online learning by matrix approximation. In: Proceedings of the 22nd Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 405–417 (2018)
Woodruff, D.P.: Sketching as a tool for numerical linear algebra. Found. Trends Mach. Learn. 10(1–2), 1–157 (2014)
MathSciNet MATH Google Scholar
Xiao, L.: Dual averaging method for regularized stochastic learning and online optimization. In: Advances in Neural Information Processing Systems, vol. 22, pp. 2116–2124 (2009)
Yenala, H., Jhanwar, A., Chinnakotla, M.K., Goyal, J.: Deep learning for detecting inappropriate content in text. Int. J. Data Sci. Anal. 6(4), 273–286 (2018)
Article Google Scholar
Zhang, L., Mahdavi, M., Jin, R., Yang, T., Zhu, S.: Recovering the optimal solution by dual random projection. In: Proceedings of the 26th Annual Conference on Learning Theory, pp. 135–157 (2013)
Zhang, L., Yang, T., Jin, R., Xiao, Y., Zhou, Z.H.: Online stochastic linear optimization under one-bit feedback. In: Proceedings of the 33rd International Conference on Machine Learning, pp. 392–401 (2016)
Zinkevich, M.: Online convex programming and generalized infinitesimal gradient ascent. In: Proceedings of the 20th International Conference on Machine Learning, pp. 928–936 (2003)

Download references

Acknowledgements

This work was partially supported by the National Key R&D Program of China (2018YFB1004300), NSFC-NRF Joint Research Project (61861146001) and YESS (2017QNRC001).

Author information

Authors and Affiliations

National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210023, China
Yuanyu Wan & Lijun Zhang

Authors

Yuanyu Wan
View author publications
You can also search for this author in PubMed Google Scholar
Lijun Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lijun Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This paper is an extension version of the PAKDD’2018 Long Presentation paper “Accelerating Adaptive Online Learning by Matrix Approximation” [31].

A Theoretical analysis

In this section, we provide proofs of Theorems 1 and 2.

1.1 Supporting results

The following results are used throughout our analysis.

Lemma 1

(Proposition 3 of [7]). Let sequence $\{\varvec{\beta }_t\}$ be generated by ADA-DP. We have

$$\begin{aligned} R(T)&\le \frac{1}{\eta }\sum \limits _{t=1}^{T-1}\left[ B_{\Psi _{t+1}}(\varvec{\beta }^*,\varvec{\beta }_{t+1})-B_{\Psi _{t}}(\varvec{\beta }^*,\varvec{\beta }_{t+1})\right] \\&\quad +\frac{1}{\eta }B_{\Psi _1}(\varvec{\beta }^*,\varvec{\beta }_1)+\frac{\eta }{2}\sum \limits _{t=1}^{T}\Vert f^\prime _t(\varvec{\beta }_t)\Vert ^2_{\Psi _{t}^*}. \end{aligned}$$

Lemma 2

Let ${X}_t=\sum _{i=1}^t{\mathbf {x}}_i{\mathbf {x}}_i^\top $ and ${A}^\dagger $ denote the pseudo-inverse of A, then

$$\begin{aligned} \sum \limits _{t=1}^T\left\langle {\mathbf {x}}_t,({X}_t^{1/2})^{\dagger }{\mathbf {x}}_t\right\rangle&\le 2\sum \limits _{t=1}^T\left\langle {\mathbf {x}}_t,({X}_T^{1/2})^\dagger {\mathbf {x}}_t\right\rangle \\&=2\mathrm{tr}({X}_T^{1/2}). \end{aligned}$$

Lemma 2 can be proved in the same way as Lemma 10 of [7].

Theorem 3

(Theorem 2.3 of [32]). Let $0<\epsilon ,\delta <1$ and ${S}=\frac{1}{\sqrt{k}}{R}\in {\mathbb {R}}^{k\times n}$ where the entries of R are independent standard normal random variables. Then if $k=\Theta (\frac{d+\log (1/\delta )}{\epsilon ^{2}})$, then for any fixed $n\times d$ matrix A, with probability $1-\delta $, simultaneously for all ${\mathbf {x}}\in {\mathbb {R}}^d$,

$$\begin{aligned} (1-\epsilon )\Vert {A}{\mathbf {x}}\Vert _2^2\le \Vert {S}{A}{\mathbf {x}}\Vert _2^2\le (1+\epsilon )\Vert {A}{\mathbf {x}}\Vert _2^2. \end{aligned}$$

Based on the above theorem, we derive the following corollary.

Corollary 1

Let $0<\epsilon ,\delta <1$ and each entry of ${\mathbf {r}}_t\in {\mathbb {R}}^{\tau }$ is a Gaussian random variable independently drawn from ${\mathcal {N}}(0,1/\tau )$. Then, if $\tau =\Omega (\frac{r+\log (T/\delta )}{\epsilon ^{2}})$, with probability $1-\delta $, simultaneously for all $t=1,\ldots ,T$,

$$\begin{aligned} (1-\epsilon ){C}_t^\top {C}_t\preceq {S}_t^\top {S}_t\preceq (1+\epsilon ){C}_t^\top {C}_t. \end{aligned}$$

Theorem 4

(Theorem 10 of [35]). Let ${C}=\mathrm{diag}(c_1,\ldots ,c_p)$ and ${S}=\mathrm{diag}(s_1,\ldots ,s_p)$ be $p\times p$ diagonal matrices, where $c_i\ne 0$ and $c_i^2+s_i^2=1$ for all i. Let ${R}\in {\mathbb {R}}^{p\times n}$ be a Gaussian random matrix. Let ${M}={C}^2+\frac{1}{n}{S}{R}{R}^\top {S}$ and $r=\sum _is_i^2$.

$$\begin{aligned}&\Pr (\lambda _1({M})\ge 1+t)\le q\cdot \exp \left( -\frac{cnt^2}{\max _i(s_i^2)r}\right) ,\\&\Pr (\lambda _p({M})\le 1-t)\le q\cdot \exp \left( -\frac{cnt^2}{\max _i(s_i^2)r}\right) , \end{aligned}$$

where the constant c is at least 1 / 32, and q is the rank of S.

Based on the above theorem, we derive the following corollary.

Corollary 2

Let $c\ge 1/32$, $\alpha >0$, $\sigma _{ti}^2=\lambda _i({C}_t^\top {C}_t)$, ${\tilde{r}}_t=\sum _i\frac{\sigma ^2_{ti}}{\alpha +\sigma ^2_{ti}}$, ${\tilde{r}}_*=\max \limits _{k\le t\le T}{\tilde{r}}_t$ and ${\sigma ^2_{*1}}=\max \limits _{1\le t\le T}{\sigma ^2_{t1}}$. Let ${K}_t = \alpha {I}_d+{C}_t^\top {C}_t$, $\tilde{{K}}_t = \alpha {I}_d+{S}_t^\top {S}_t$ and $\tilde{{I}}_t={K}_t^{-1/2}\tilde{{K}}_t{K}_t^{-1/2}$. If $\tau \ge \frac{{\tilde{r}}_*{\sigma ^2_{*1}}}{c\epsilon ^2(\alpha +{\sigma ^2_{*1}})}\log \frac{2dT}{\delta }$, with probability at least $1-\delta $, simultaneously for all $t=1,\ldots ,T$,

$$\begin{aligned} (1-\epsilon ){I}_d\preceq \tilde{{I}}_t\preceq (1+\epsilon ){I}_d. \end{aligned}$$

1.2 Proof of Theorem 1

Let $\widetilde{{X}}_t$ denote ${S}_t^\top {S}_t$. First, we consider bounding the first term in the upper bound of Lemma 1. With probability $1-\delta $, we have

$$\begin{aligned}&B_{\Psi _{t+1}}(\varvec{\beta }^*,\varvec{\beta }_{t+1})-B_{\Psi _{t}}(\varvec{\beta }^*,\varvec{\beta }_{t+1})\\&\quad =\frac{1}{2}\left\langle \varvec{\beta }^*-\varvec{\beta }_{t+1},(\widetilde{{X}}_{t+1}^{1/2}-\widetilde{{X}}_t^{1/2})(\varvec{\beta }^*-\varvec{\beta }_{t+1})\right\rangle \\&\quad \le \frac{1}{2}\left\langle \varvec{\beta }^*-\varvec{\beta }_{t+1},\sqrt{1+\epsilon }{X}_{t+1}^{1/2}(\varvec{\beta }^*-\varvec{\beta }_{t+1})\right\rangle \\&\qquad -\frac{1}{2}\left\langle \varvec{\beta }^*-\varvec{\beta }_{t+1},\sqrt{1-\epsilon }{X}_{t}^{1/2}(\varvec{\beta }^*-\varvec{\beta }_{t+1})\right\rangle \\&\quad \le \frac{1}{2}\left\langle \varvec{\beta }^*-\varvec{\beta }_{t+1},({X}_{t+1}^{1/2}-{X}_{t}^{1/2})(\varvec{\beta }^*-\varvec{\beta }_{t+1})\right\rangle \\&\qquad +\frac{\epsilon }{4}\left\langle \varvec{\beta }^*-\varvec{\beta }_{t+1},({X}_{t+1}^{1/2}+{X}_{t}^{1/2})(\varvec{\beta }^*-\varvec{\beta }_{t+1})\right\rangle \\&\quad \le \frac{1}{2}\Vert \varvec{\beta }^*-\varvec{\beta }_{t+1}\Vert _2^2\Vert ({X}_{t+1}^{1/2}-{X}_{t}^{1/2})\Vert \\&\qquad +\frac{\epsilon }{4}\left\langle \varvec{\beta }^*-\varvec{\beta }_{t+1},({X}_{t+1}^{1/2}+{X}_{t}^{1/2})(\varvec{\beta }^*-\varvec{\beta }_{t+1})\right\rangle \\&\quad \le \frac{1}{2}\Vert \varvec{\beta }^*-\varvec{\beta }_{t+1}\Vert _2^2\mathrm{tr}({X}_{t+1}^{1/2}-{X}_{t}^{1/2})\\&\qquad +\frac{\epsilon }{4}\left\langle \varvec{\beta }^*-\varvec{\beta }_{t+1},({X}_{t+1}^{1/2}+{X}_{t}^{1/2})(\varvec{\beta }^*-\varvec{\beta }_{t+1})\right\rangle \end{aligned}$$

where the first inequality is due to Corollary 1.

Thus, we can get

$$\begin{aligned} \begin{aligned}&\sum \limits _{t=1}^{T-1}\left[ B_{\Psi _{t+1}}(\varvec{\beta }^*,\varvec{\beta }_{t+1})-B_{\Psi _{t}}(\varvec{\beta }^*,\varvec{\beta }_{t+1})\right] \\&\quad \le \frac{1}{2}\sum \limits _{t=1}^{T-1}\Vert \varvec{\beta }^*-\varvec{\beta }_{t+1}\Vert _2^2\mathrm{tr}({X}_{t+1}^{1/2}-{X}_{t}^{1/2})\\&\qquad +\frac{\epsilon }{4}\sum \limits _{t=1}^{T-1}\left\langle \varvec{\beta }^*-\varvec{\beta }_{t+1},({X}_{t+1}^{1/2}+{X}_{t}^{1/2})(\varvec{\beta }^*-\varvec{\beta }_{t+1})\right\rangle \\&\quad \le \frac{1}{2}\max \limits _{t\le T}\Vert \varvec{\beta }^*-\varvec{\beta }_{t}\Vert _2^2\mathrm{tr}({X}_{T}^{1/2})-\frac{1}{2}\Vert \varvec{\beta }^*-\varvec{\beta }_{1}\Vert _2^2\mathrm{tr}({X}_{1}^{1/2})\\&\qquad +\frac{\epsilon }{2}\max \limits _{t\le T}\Vert \varvec{\beta }^*-\varvec{\beta }_{t}\Vert _2^2\sum \limits _{t=1}^{T}\Vert {X}_{t}^{1/2}\Vert \\&\qquad -\frac{\epsilon }{4}\Vert \varvec{\beta }^*-\varvec{\beta }_{1}\Vert _2^2\mathrm{tr}({X}_{1}^{1/2}). \end{aligned} \end{aligned}$$

(3)

Note that $\varvec{\beta }_1={\mathbf {0}}$, then

$$\begin{aligned} \begin{aligned} B_{\Psi _1}(\varvec{\beta }^*,\varvec{\beta }_1)&=\frac{1}{2}\left\langle \varvec{\beta }^*,(\sigma {I}_d+\widetilde{{X}}_1^{1/2})\varvec{\beta }^*\right\rangle \\&\le \frac{1}{2}\sigma \Vert \varvec{\beta }^{*}\Vert _2^2+\frac{2+\epsilon }{4}\Vert \varvec{\beta }^{*}\Vert _2^2\mathrm{tr}({X}_{1}^{1/2}) \end{aligned} \end{aligned}$$

(4)

where the inequality is due to Corollary 1.

Then, we consider the bound of $\sum _{t=1}^{T}\Vert f^\prime _t(\varvec{\beta }_t)\Vert ^2_{\Psi _{t}^*}$. With probability $1-\delta $, we have

$$\begin{aligned}&\frac{1}{2}\Vert f^\prime _t(\varvec{\beta }_t)\Vert ^2_{\Psi _{t}^*}=\left\langle {\mathbf {g}}_t,(\sigma {I}_d+\widetilde{{X}}_t^{1/2})^{-1}{\mathbf {g}}_t\right\rangle \\&\quad \le \frac{1}{\sqrt{1-\epsilon }}\left\langle {\mathbf {g}}_t,({X}_{t}^{\dagger })^{1/2}{\mathbf {g}}_t\right\rangle =\frac{l^\prime (\varvec{\beta }_t^\top {\mathbf {x}}_t)^2}{\sqrt{1-\epsilon }}\left\langle {\mathbf {x}}_t,({X}_{t}^{\dagger })^{1/2}{\mathbf {x}}_t\right\rangle \end{aligned}$$

where the inequality is due to Corollary 1. According to Lemma 2, we have

$$\begin{aligned} \begin{aligned}&\sum \limits _{t=1}^{T}\Vert f^\prime _t(\varvec{\beta }_t)\Vert ^2_{\Psi _{t}^*}\\&\quad \le \sum \limits _{t=1}^{T}\frac{2l^\prime (\varvec{\beta }_t^\top {\mathbf {x}}_t)^2}{\sqrt{1-\epsilon }}\left\langle {\mathbf {x}}_t,({X}_{t}^{\dagger })^{1/2}{\mathbf {x}}_t\right\rangle \\&\quad \le \max \limits _{t\le T}l^\prime (\varvec{\beta }_t^\top {\mathbf {x}}_t)^2\frac{2}{\sqrt{1-\epsilon }}\sum \limits _{t=1}^{T}\left\langle {\mathbf {x}}_t,({X}_{t}^{\dagger })^{1/2}{\mathbf {x}}_t\right\rangle \\&\quad \le \frac{4}{\sqrt{1-\epsilon }}\max \limits _{t\le T}l^\prime (\varvec{\beta }_t^\top {\mathbf {x}}_t)^2\mathrm{tr}({X}_T^{1/2}). \end{aligned} \end{aligned}$$

(5)

We complete the proof by substituting (3), (4) and (5) into Lemma 1.

1.3 Proof of Theorem 2

Inspired by the proof of Theorem 1, we can derive Theorem 2 by, respectively, bounding each term in the upper bound of Lemma 1. Before that, we need to derive the lower and upper bounds of $({S}_t^\top {S}_t)^{1/2}$ based on Corollary 2.

Let the SVD of ${C}_t^\top $ be ${C}_t^\top = {U}{\Sigma }{V}^\top $ where ${U}\in {\mathbb {R}}^{d\times d},{\Sigma }\in {\mathbb {R}}^{d\times d},{V}\in {\mathbb {R}}^{t\times d}$. Let ${K}_t = \alpha {I}_d+{C}_t^\top {C}_t$, $\tilde{{K}}_t = \alpha {I}_d+{S}_t^\top {S}_t$ and $\tilde{{I}}_t={K}_t^{-1/2}\tilde{{K}}_t{K}_t^{-1/2}$. According to Corollary 2, with probability at least $1-\delta $, simultaneously for all $t=1,\ldots ,T$,

$$\begin{aligned} {S}_t^\top {S}_t&=\tilde{{K}}_t - \alpha {I}_d={K}_t^{1/2}\tilde{{I}}_t {K}_t^{1/2}- \alpha {I}_d\\&\preceq (1+\epsilon ){K}_t-\alpha {I}_d=(1+\epsilon ){C}_t^\top {C}_t+\epsilon \alpha {I}_d\\&={U}\big ((1+\epsilon ){\Sigma }{\Sigma }+\epsilon \alpha {I}_d\big ){U}^\top \end{aligned}$$

and

$$\begin{aligned} {S}_t^\top {S}_t+\epsilon \alpha {I}_d&=\tilde{{K}}_t - \alpha {I}_d+\epsilon \alpha {I}_d\\&={K}_t^{1/2}\tilde{{I}}_t{K}_t^{1/2}- \alpha {I}_d+\epsilon \alpha {I}_d\\&\succeq (1-\epsilon ){K}_t-\alpha {I}_d+\epsilon \alpha {I}_d\\&=(1-\epsilon ){C}_t^\top {C}_t. \end{aligned}$$

Then simultaneously for all $t=1,\ldots ,T$, we have

$$\begin{aligned} \begin{aligned} ({S}_t^\top {S}_t)^{1/2}&\preceq \sqrt{1+\epsilon }{U}({\Sigma }{\Sigma })^{1/2}{U}^\top +\sqrt{\epsilon \alpha }{U}{I}_d{U}^\top \\&=\sqrt{1+\epsilon }{X}_{t}^{1/2}+\sqrt{\epsilon \alpha }{I}_d \end{aligned} \end{aligned}$$

(6)

and

$$\begin{aligned} \begin{aligned} ({S}_t^\top {S}_t)^{1/2}&=({S}_t^\top {S}_t)^{1/2}+\sqrt{\epsilon \alpha }{I}_d-\sqrt{\epsilon \alpha }{I}_d\\&\succeq \big (({S}_t^\top {S}_t)+\epsilon \alpha {I}_d\big )^{1/2}-\sqrt{\epsilon \alpha }{I}_d\\&\succeq \sqrt{1-\epsilon }{X}_{t}^{1/2}-\sqrt{\epsilon \alpha }{I}_d. \end{aligned} \end{aligned}$$

(7)

Then we consider bounding the first term in the upper bound of Lemma 1. Let $\widetilde{{X}}_t$ denote ${S}_t^\top {S}_t$. Simultaneously for all $t=1,\ldots ,T$, we have

$$\begin{aligned}&B_{\Psi _{t+1}}(\varvec{\beta }^*,\varvec{\beta }_{t+1})-B_{\Psi _{t}}(\varvec{\beta }^*,\varvec{\beta }_{t+1})\\&\quad =\frac{1}{2}\left\langle \varvec{\beta }^*-\varvec{\beta }_{t+1},(\widetilde{{X}}_{t+1}^{1/2}-\widetilde{{X}}_t^{1/2})(\varvec{\beta }^*-\varvec{\beta }_{t+1})\right\rangle \\&\quad \le \frac{1}{2}\left\langle \varvec{\beta }^*-\varvec{\beta }_{t+1},\sqrt{1+\epsilon }{X}_{t+1}^{1/2}(\varvec{\beta }^*-\varvec{\beta }_{t+1})\right\rangle \\&\qquad -\frac{1}{2}\left\langle \varvec{\beta }^*-\varvec{\beta }_{t+1},\sqrt{1-\epsilon }{X}_{t}^{1/2})(\varvec{\beta }^*-\varvec{\beta }_{t+1})\right\rangle \\&\qquad +\frac{1}{2}\left\langle \varvec{\beta }^*-\varvec{\beta }_{t+1},2\sqrt{\epsilon \alpha }{I}_{d}(\varvec{\beta }^*-\varvec{\beta }_{t+1})\right\rangle \\&\quad =\frac{1}{2}\left\langle \varvec{\beta }^*-\varvec{\beta }_{t+1},\sqrt{1+\epsilon }{X}_{t+1}^{1/2}(\varvec{\beta }^*-\varvec{\beta }_{t+1})\right\rangle \\&\qquad -\frac{1}{2}\left\langle \varvec{\beta }^*-\varvec{\beta }_{t+1},\sqrt{1-\epsilon }{X}_{t}^{1/2})(\varvec{\beta }^*-\varvec{\beta }_{t+1})\right\rangle \\&\qquad +\sqrt{\epsilon \alpha }\Vert (\varvec{\beta }^*-\varvec{\beta }_{t+1})\Vert _2^2\\&\quad \le \frac{1}{2}\Vert \varvec{\beta }^*-\varvec{\beta }_{t+1}\Vert _2^2\mathrm{tr}({X}_{t+1}^{1/2}-{X}_{t}^{1/2})\\&\qquad +\frac{\epsilon }{4}\left\langle \varvec{\beta }^*-\varvec{\beta }_{t+1},({X}_{t+1}^{1/2}+{X}_{t}^{1/2})(\varvec{\beta }^*-\varvec{\beta }_{t+1})\right\rangle \\&\qquad +\sqrt{\epsilon \alpha }\Vert (\varvec{\beta }^*-\varvec{\beta }_{t+1})\Vert _2^2 \end{aligned}$$

where the first inequality is due to (6), (7) and the last inequality has been proved in the proof of Theorem 1.

Thus, we can get

$$\begin{aligned} \begin{aligned}&\sum \limits _{t=1}^{T-1}\left[ B_{\Psi _{t+1}}(\varvec{\beta }^*,\varvec{\beta }_{t+1})-B_{\Psi _{t}}(\varvec{\beta }^*,\varvec{\beta }_{t+1})\right] \\&\quad \le \frac{1}{2}\max \limits _{t\le T}\Vert \varvec{\beta }^*-\varvec{\beta }_{t}\Vert _2^2\mathrm{tr}({X}_{T}^{1/2})-\frac{1}{2}\Vert \varvec{\beta }^*-\varvec{\beta }_{1}\Vert _2^2\mathrm{tr}({X}_{1}^{1/2})\\&\qquad +\frac{\epsilon }{2}\max \limits _{t\le T}\Vert \varvec{\beta }^*-\varvec{\beta }_{t}\Vert _2^2\sum \limits _{t=1}^{T}\Vert {X}_{t}^{1/2}\Vert \\&\qquad -\frac{\epsilon }{4}\Vert \varvec{\beta }^*-\varvec{\beta }_{1}\Vert _2^2\mathrm{tr}({X}_{1}^{1/2})\\&\qquad +\sqrt{\epsilon \alpha }(T-1)\max \limits _{t\le T}\Vert \varvec{\beta }^*-\varvec{\beta }_{t}\Vert _2^2. \end{aligned} \end{aligned}$$

(8)

Note that $\varvec{\beta }_1={\mathbf {0}}$, then

$$\begin{aligned} \begin{aligned} B_{\Psi _1}(\varvec{\beta }^*,\varvec{\beta }_1)&=\frac{1}{2}\left\langle \varvec{\beta }^*,(\sigma {I}_d+\widetilde{{X}}_1^{1/2})\varvec{\beta }^*\right\rangle \\&\le \frac{1}{2}\sigma \Vert \varvec{\beta }^{*}\Vert _2^2+\frac{2+\epsilon }{4}\Vert \varvec{\beta }^{*}\Vert _2^2\mathrm{tr}({X}_{1}^{1/2})\\&\quad +\frac{1}{2}\sqrt{\epsilon \alpha }\Vert \varvec{\beta }^*\Vert _2^2. \end{aligned} \end{aligned}$$

(9)

Before considering the upper bound of $\sum _{t=1}^{T}\Vert f^\prime _t(\varvec{\beta }_t)\Vert ^2_{\Psi _{t}^*}$, we need to derive the upper bound of ${H}_t^{-1}$.

Let the SVD of ${S}_t^\top $ be ${S}_t^\top = {U}{\Sigma }{V}^\top $ where ${U}\in {\mathbb {R}}^{d\times d},{\Sigma }\in {\mathbb {R}}^{d\times d},{V}\in {\mathbb {R}}^{t\times d}$. We also have, for all $t=1,\ldots ,T$,

$$\begin{aligned} {H}_t&=\sigma {I}_d+({S}^\top _{t}{S}_{t})^{1/2}={U}\big (\sigma {I}_d+({\Sigma }{\Sigma })^{1/2}\big ){U}^\top \\&\succeq {U}\big (\alpha {I}_d+({\Sigma }{\Sigma })\big )^{1/2}{U}^\top =(\alpha {I}_d+{S}^\top _{t}{S}_{t})^{1/2} \end{aligned}$$

due to $\sigma \ge \sqrt{\alpha }\ge \sqrt{\lambda _i({S}_t^\top {S}_t)+\alpha }-\sqrt{\lambda _i({S}_t^\top {S}_t)}$ for all $i=1,\ldots ,d$.

Then according to Corollary 2, with probability at least $1-\delta $, simultaneously for all $t=1,\ldots ,T$,

$$\begin{aligned} {H}_t^{-1}&\preceq \big ((\alpha {I}_d+{S}^\top _{t}{S}_{t})^{1/2}\big )^{-1}=\big (({K}_t^{1/2}\tilde{{I}}_t{K}_t^{1/2})^{-1}\big )^{1/2}\\&\preceq \frac{1}{\sqrt{1-\epsilon }}({K}_t^{-1})^{1/2}=\frac{1}{\sqrt{1-\epsilon }}\big ((\alpha {I}_d+{X}_{t})^{-1}\big )^{1/2}. \end{aligned}$$

Thus, we can get

$$\begin{aligned} \Vert f^\prime _t(\varvec{\beta }_t)\Vert ^2_{\Psi _{t}^*}&=2\left\langle {\mathbf {g}}_t,{H}_t^{-1}{\mathbf {g}}_t\right\rangle \\&\le \frac{2}{\sqrt{1-\epsilon }}\left\langle {\mathbf {g}}_t,\big ((\alpha {I}_d+{X}_{t})^{-1}\big )^{1/2}{\mathbf {g}}_t\right\rangle \\&=\frac{2l^\prime (\varvec{\beta }_t^\top {\mathbf {x}}_t)^2}{\sqrt{1-\epsilon }}\left\langle {\mathbf {x}}_t,({X}_{t}^{\dagger })^{1/2}{\mathbf {x}}_t\right\rangle . \end{aligned}$$

According to Lemma 2, we have

$$\begin{aligned} \begin{aligned}&\sum \limits _{t=1}^{T}\Vert f^\prime _t(\varvec{\beta }_t)\Vert ^2_{\Psi _{t}^*}\\&\quad \le \frac{2}{\sqrt{1-\epsilon }}\max \limits _{t\le T}l^\prime (\varvec{\beta }_t^\top {\mathbf {x}}_t)^2\sum \limits _{t=1}^{T}\left\langle {\mathbf {x}}_t,({X}_{t}^{\dagger })^{1/2}{\mathbf {x}}_t\right\rangle \\&\quad \le \frac{4}{\sqrt{1-\epsilon }}\max \limits _{t\le T}l^\prime (\varvec{\beta }_t^\top {\mathbf {x}}_t)^2\mathrm{tr}({X}_T^{1/2}). \end{aligned} \end{aligned}$$

(10)

We complete the proof by substituting (8), (9) and (10) into Lemma 1.

1.4 Proof of Corollary 1

Let ${C}_t={U}{\Sigma } {V}^\top $ be the singular value decomposition of ${C}_t$. Notice that ${U}\in {\mathbb {R}}^{t\times r}, {\Sigma } {V}^\top \in {\mathbb {R}}^{r\times d}$. According to Theorem 3, we have if $\tau =\Theta (\frac{r+\log (1/\delta )}{\epsilon ^{2}})$, then simultaneously $\forall {\mathbf {x}} \in {\mathbb {R}}^{r}$, with probability $1-\delta $,

$$\begin{aligned} (1-\epsilon )\Vert {U}{\mathbf {x}}\Vert _2^2\le \Vert {R_t}{U}{\mathbf {x}}\Vert _2^2\le (1+\epsilon )\Vert {U}{\mathbf {x}}\Vert _2^2 \end{aligned}$$

Let ${\mathbf {y}}\in {\mathbb {R}}^d$ be arbitrary vector, then ${C}_t{\mathbf {y}}={U}{\Sigma }{V}^\top {\mathbf {y}}={U}{\mathbf {x}}$ where ${\mathbf {x}}={\Sigma }{V}^\top {\mathbf {y}}\in {\mathbb {R}}^{r}$.

Then we have

$$\begin{aligned} {\mathbf {y}}^\top {S}_t^\top {S}_t{\mathbf {y}}&={\mathbf {y}}^\top {C}_t^\top {R}^\top _t{R}_t{C}_t{\mathbf {y}}=\Vert {R_t}{U}{\mathbf {x}}\Vert _2^2\\&\le (1+\epsilon )\Vert {U}{\mathbf {x}}\Vert _2^2=(1+\epsilon ){\mathbf {y}}^\top {C}_t^\top {C}_t{\mathbf {y}} \end{aligned}$$

and

$$\begin{aligned} {\mathbf {y}}^\top {S}_t^\top {S}_t{\mathbf {y}}&={\mathbf {y}}^\top {C}_t^\top {R}^\top _t{R}_t{C}_t{\mathbf {y}}=\Vert {R_t}{U}{\mathbf {x}}\Vert _2^2\\&\ge (1-\epsilon )\Vert {U}{\mathbf {x}}\Vert _2^2=(1-\epsilon ){\mathbf {y}}^\top {C}_t^\top {C}_t{\mathbf {y}}. \end{aligned}$$

Then, we have $(1-\epsilon ){C}_t^\top {C}_t\preceq {S}_t^\top {S}_t\preceq (1+\epsilon ){C}_t^\top {C}_t$ with probability $1-\delta $, provided $\tau =\Omega (\frac{r+\log (1/\delta )}{\epsilon ^{2}})$. Using the union bound, we have if $\tau =\Omega (\frac{r+\log (T/\delta )}{\epsilon ^{2}})$, with probability $1-\delta $, simultaneously for all $t=1,\ldots ,T$,

$$\begin{aligned} (1-\epsilon ){C}_t^\top {C}_t\preceq {S}_t^\top {S}_t\preceq (1+\epsilon ){C}_t^\top {C}_t. \end{aligned}$$

1.5 Proof of Corollary 2

Define the SVD of ${C}_t^\top $ as ${C}_t^\top = {U}{\Sigma }{V}^\top $ where ${U}\in {\mathbb {R}}^{d\times d},{\Sigma }\in {\mathbb {R}}^{d\times d},{V}\in {\mathbb {R}}^{t\times d}$. Then we have ${K}_t = {U}(\alpha {I}_d+{\Sigma }{\Sigma }^\top ){U}^\top $ and

$$\begin{aligned} \tilde{{I}}_t&={K}_t^{-1/2}\tilde{{K}}_t{K}_t^{-1/2}={K}_t^{-1/2}(\alpha {I}_d+{C}_t^\top {R}_t^\top {R}_t{C}_t){K}_t^{-1/2}\\&={U}\Big ((\alpha {I}_p+{\Sigma }{\Sigma })^{-1/2}{\Sigma }{V}^\top {R}_t^\top {R}_t{V}{\Sigma }(\alpha {I}_d+{\Sigma }{\Sigma }^\top )^{-1/2}\\&\quad +\alpha {I}_d(\alpha {I}_d+{\Sigma }{\Sigma })^{-1}\Big ){U}^\top \\&={U}\Big ((\alpha {I}_p+{\Sigma }{\Sigma })^{-1/2}{\Sigma }{R}{R}^\top {\Sigma }(\alpha {I}_d+{\Sigma }{\Sigma }^\top )^{-1/2}\\&\quad +\alpha {I}_d(\alpha {I}_d+{\Sigma }{\Sigma })^{-1}\Big ){U}^\top \end{aligned}$$

where ${R}={V}^\top {R}_t^\top \in {\mathbb {R}}^{d\times \tau }$ is a Gaussian random matrix due to that V is an orthogonal matrix and ${R}_t^\top $ is a Gaussian random matrix. Let $c_i^2=\frac{\alpha }{\alpha +{\sigma ^2_{ti}}}$ and $s_i^2=\frac{{\sigma ^2_{ti}}}{\alpha +{\sigma ^2_{ti}}}$. Then according to Theorem 4, with probability at least $1-\delta $,

$$\begin{aligned} (1-\epsilon ){I}_d\preceq \tilde{{I}}_t\preceq (1+\epsilon ){I}_d \end{aligned}$$

provided $\tau \ge \frac{{\tilde{r}}_t{\sigma ^2_{t1}}}{c\epsilon ^2(\alpha +{\sigma ^2_{t1}})}\log \frac{2d}{\delta }$ where the constant c is at least 1 / 32. Using the union bound, we complete the proof.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wan, Y., Zhang, L. Accelerating adaptive online learning by matrix approximation. Int J Data Sci Anal 9, 389–400 (2020). https://doi.org/10.1007/s41060-019-00174-4

Download citation

Received: 27 November 2018
Accepted: 14 January 2019
Published: 24 January 2019
Issue Date: May 2020
DOI: https://doi.org/10.1007/s41060-019-00174-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Accelerating adaptive online learning by matrix approximation

Abstract

Access this article

Similar content being viewed by others

Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions

Scientific Machine Learning Through Physics–Informed Neural Networks: Where we are and What’s Next

Bolstering stochastic gradient descent with model building

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

A Theoretical analysis

1.1 Supporting results

Lemma 1

Lemma 2

Theorem 3

Corollary 1

Theorem 4

Corollary 2

1.2 Proof of Theorem 1

1.3 Proof of Theorem 2

1.4 Proof of Corollary 1

1.5 Proof of Corollary 2

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Accelerating adaptive online learning by matrix approximation

Abstract

Access this article

Similar content being viewed by others

Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions

Scientific Machine Learning Through Physics–Informed Neural Networks: Where we are and What’s Next

Bolstering stochastic gradient descent with model building

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

A Theoretical analysis

A Theoretical analysis

1.1 Supporting results

Lemma 1

Lemma 2

Theorem 3

Corollary 1

Theorem 4

Corollary 2

1.2 Proof of Theorem 1

1.3 Proof of Theorem 2

1.4 Proof of Corollary 1

1.5 Proof of Corollary 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation