Skip to main content
Log in

Accelerating adaptive online learning by matrix approximation

  • Regular Paper
  • Published:
International Journal of Data Science and Analytics Aims and scope Submit manuscript

Abstract

Adaptive subgradient methods are able to leverage the second-order information of functions to improve the regret and have become popular for online learning and optimization. According to the amount of information used, these methods can be divided into diagonal-matrix version (ADA-DIAG) and full-matrix version (ADA-FULL). In practice, ADA-DIAG is the most commonly adopted instead of ADA-FULL, because ADA-FULL is computationally intractable in high dimensions though it has smaller regret when gradients are correlated. In this paper, we propose to employ techniques of matrix approximation to accelerate ADA-FULL and develop two methods based on random projections. Compared with ADA-FULL, at each iteration, our methods reduce the space complexity from \(O(d^2)\) to \(O(\tau d)\) and the time complexity from \(O(d^3)\) to \(O(\tau ^2 d)\) where d is the dimensionality of the data and \(\tau \ll d\) is the number of random projections. Experimental results about online convex optimization and training convolutional neural networks show that our methods are comparable to ADA-FULL and outperform other state-of-the-art algorithms including ADA-DIAG.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py.

References

  1. Achlioptas, D.: Database-friendly random projections: Johnson–Lindenstrauss with binary coins. J. Comput. Syst. Sci. 66(4), 671–687 (2003)

    Article  MathSciNet  Google Scholar 

  2. Allesiardo, R., Fraud, R., Maillard, O.A.: The non-stationary stochastic multi-armed bandit problem. Int. J. Data Sci. Anal. 3(4), 267–283 (2017)

    Article  Google Scholar 

  3. Awerbuch, B., Kleinberg, R.: Online linear optimization and adaptive routing. J. Comput. Syst. Sci. 74(1), 97–114 (2008)

    Article  MathSciNet  Google Scholar 

  4. Boutsidis, C., Zouzias, A., Drineas, P.: Random projections for \(k\)-means clustering. In: Advances in Neural Information Processing Systems, vol. 23, pp. 298–306 (2010)

  5. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 1–27 (2011)

    Article  Google Scholar 

  6. Dasgupta, S., Freund, Y.: Random projection trees and low dimensional manifolds. In: Proceedings of the 40th Annual ACM Symposium on Theory of computing, pp. 537–546 (2008)

  7. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)

    MathSciNet  MATH  Google Scholar 

  8. Duchi, J., Shalev-Shwartz, S., Singer, Y., Tewari, A.: Composite objective mirror descent. In: Proceedings of the 23rd Annual Conference on Learning Theory, pp. 14–26 (2010)

  9. Fern, X.Z., Brodley, C.E.: Random projection for high dimensional data clustering: a cluster ensemble approach. In: Proceedings of the 20th International Conference on Machine Learning, pp. 186–93 (2003)

  10. Fradkin, D., Madigan, D.: Experiments with random projections for machine learning. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 517–522 (2003)

  11. Freund, Y., Dasgupta, S., Kabra, M., Verma, N.: Learning the structure of manifolds using random projections. In: Advances in Neural Information Processing Systems, vol. 21, pp. 473–480 (2008)

  12. Gao, W., Jin, R., Zhu, S., Zhou, Z.H.: One-pass AUC optimization. In: Proceedings of the 30th International Conference on Machine Learning, pp. 906–914 (2013)

  13. Ghashami, M., Liberty, E., Phillips, J.M., Woodruff, D.P.: Frequent directions: simple and deterministic matrix sketching. SIAM J. Comput. 45(5), 1762–1792 (2016)

    Article  MathSciNet  Google Scholar 

  14. Hager, W.W.: Updating the inverse of a matrix. SIAM Rev. 31(2), 221–239 (1989)

    Article  MathSciNet  Google Scholar 

  15. Hassani, M., Töws, D., Cuzzocrea, A., Seidl, T.: BFSPMiner: an effective and efficient batch-free algorithm for mining sequential patterns over data streams. Int. J. Data Sci. Anal. 1–17 (2017). https://doi.org/10.1007/S41060-017-0084-8

  16. Hazan, E., Agarwal, A., Kale, S.: Logarithmic regret algorithms for online convex optimization. Mach. Learn. 69(2), 169–192 (2007)

    Article  Google Scholar 

  17. Kaski, S.: Dimensionality reduction by random mapping: fast similarity computation for clustering. In: Proceedings of the 1998 IEEE International Joint Conference on Neural Networks, vol. 1, pp. 413–418 (1998)

  18. Krizhevsky, A.: Learning multiple layers of features from tiny images. Technical report, University of Toronto (2009)

  19. Krummenacher, G., McWilliams, B., Kilcher, Y., Buhmann, J.M., Meinshausen, N.: Scalable adaptive stochastic optimization using random projections. In: Advances in Neural Information Processing Systems, vol. 29, pp. 1750–1758 (2016)

  20. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. In: Proceedings of the IEEE, vol. 86, pp. 2278–2324 (1998)

  21. Liberty, E., Ailon, N., Singer, A.: Dense fast random projections and lean walsh transforms. Discrete Comput. Geom. 45(1), 34–44 (2011)

    Article  MathSciNet  Google Scholar 

  22. Luo, H., Agarwal, A., Cesa-Bianchi, N., Langford, J.: Efficient second order online learning by sketching. In: Advances in Neural Information Processing Systems, vol. 29, pp. 902–910 (2016)

  23. Magen, A., Zouzias, A.: Low rank matrix-valued Chernoff bounds and approximate matrix multiplication. In: Proceedings of the 22nd Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1422–1436 (2011)

  24. Maillard, O.A., Munos, R.: Linear regression with random projections. J. Mach. Learn. Res. 13, 2735–2772 (2012)

    MathSciNet  MATH  Google Scholar 

  25. Miyaguchi, K., Yamanishi, K.: Online detection of continuous changes in stochastic processes. Int. J. Data Sci. Anal. 3(3), 213–229 (2017)

    Article  Google Scholar 

  26. Nalko, N., Martinsson, P.G., Tropp, J.A.: Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 53(2), 217–288 (2011)

    Article  MathSciNet  Google Scholar 

  27. Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning. In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011 (2011)

  28. Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems, vol. 21, pp. 1177–1184 (2008)

  29. Tropp, J.A.: An introduction to matrix concentration inequalities. Found. Trends Mach. Learn. 8(1–2), 1–230 (2015)

    Article  Google Scholar 

  30. Wan, Y., Wei, N., Zhang, L.: Efficient adaptive online learning via frequent directions. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 2748–2754 (2018)

  31. Wan, Y., Zhang, L.: Accelerating adaptive online learning by matrix approximation. In: Proceedings of the 22nd Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 405–417 (2018)

  32. Woodruff, D.P.: Sketching as a tool for numerical linear algebra. Found. Trends Mach. Learn. 10(1–2), 1–157 (2014)

    MathSciNet  MATH  Google Scholar 

  33. Xiao, L.: Dual averaging method for regularized stochastic learning and online optimization. In: Advances in Neural Information Processing Systems, vol. 22, pp. 2116–2124 (2009)

  34. Yenala, H., Jhanwar, A., Chinnakotla, M.K., Goyal, J.: Deep learning for detecting inappropriate content in text. Int. J. Data Sci. Anal. 6(4), 273–286 (2018)

    Article  Google Scholar 

  35. Zhang, L., Mahdavi, M., Jin, R., Yang, T., Zhu, S.: Recovering the optimal solution by dual random projection. In: Proceedings of the 26th Annual Conference on Learning Theory, pp. 135–157 (2013)

  36. Zhang, L., Yang, T., Jin, R., Xiao, Y., Zhou, Z.H.: Online stochastic linear optimization under one-bit feedback. In: Proceedings of the 33rd International Conference on Machine Learning, pp. 392–401 (2016)

  37. Zinkevich, M.: Online convex programming and generalized infinitesimal gradient ascent. In: Proceedings of the 20th International Conference on Machine Learning, pp. 928–936 (2003)

Download references

Acknowledgements

This work was partially supported by the National Key R&D Program of China (2018YFB1004300), NSFC-NRF Joint Research Project (61861146001) and YESS (2017QNRC001).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lijun Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This paper is an extension version of the PAKDD’2018 Long Presentation paper “Accelerating Adaptive Online Learning by Matrix Approximation” [31].

A Theoretical analysis

A Theoretical analysis

In this section, we provide proofs of Theorems 1 and 2.

1.1 Supporting results

The following results are used throughout our analysis.

Lemma 1

(Proposition 3 of [7]). Let sequence \(\{\varvec{\beta }_t\}\) be generated by ADA-DP. We have

$$\begin{aligned} R(T)&\le \frac{1}{\eta }\sum \limits _{t=1}^{T-1}\left[ B_{\Psi _{t+1}}(\varvec{\beta }^*,\varvec{\beta }_{t+1})-B_{\Psi _{t}}(\varvec{\beta }^*,\varvec{\beta }_{t+1})\right] \\&\quad +\frac{1}{\eta }B_{\Psi _1}(\varvec{\beta }^*,\varvec{\beta }_1)+\frac{\eta }{2}\sum \limits _{t=1}^{T}\Vert f^\prime _t(\varvec{\beta }_t)\Vert ^2_{\Psi _{t}^*}. \end{aligned}$$

Lemma 2

Let \({X}_t=\sum _{i=1}^t{\mathbf {x}}_i{\mathbf {x}}_i^\top \) and \({A}^\dagger \) denote the pseudo-inverse of A, then

$$\begin{aligned} \sum \limits _{t=1}^T\left\langle {\mathbf {x}}_t,({X}_t^{1/2})^{\dagger }{\mathbf {x}}_t\right\rangle&\le 2\sum \limits _{t=1}^T\left\langle {\mathbf {x}}_t,({X}_T^{1/2})^\dagger {\mathbf {x}}_t\right\rangle \\&=2\mathrm{tr}({X}_T^{1/2}). \end{aligned}$$

Lemma 2 can be proved in the same way as Lemma 10 of [7].

Theorem 3

(Theorem 2.3 of [32]). Let \(0<\epsilon ,\delta <1\) and \({S}=\frac{1}{\sqrt{k}}{R}\in {\mathbb {R}}^{k\times n}\) where the entries of R are independent standard normal random variables. Then if \(k=\Theta (\frac{d+\log (1/\delta )}{\epsilon ^{2}})\), then for any fixed \(n\times d\) matrix A, with probability \(1-\delta \), simultaneously for all \({\mathbf {x}}\in {\mathbb {R}}^d\),

$$\begin{aligned} (1-\epsilon )\Vert {A}{\mathbf {x}}\Vert _2^2\le \Vert {S}{A}{\mathbf {x}}\Vert _2^2\le (1+\epsilon )\Vert {A}{\mathbf {x}}\Vert _2^2. \end{aligned}$$

Based on the above theorem, we derive the following corollary.

Corollary 1

Let \(0<\epsilon ,\delta <1\) and each entry of \({\mathbf {r}}_t\in {\mathbb {R}}^{\tau }\) is a Gaussian random variable independently drawn from \({\mathcal {N}}(0,1/\tau )\). Then, if \(\tau =\Omega (\frac{r+\log (T/\delta )}{\epsilon ^{2}})\), with probability \(1-\delta \), simultaneously for all \(t=1,\ldots ,T\),

$$\begin{aligned} (1-\epsilon ){C}_t^\top {C}_t\preceq {S}_t^\top {S}_t\preceq (1+\epsilon ){C}_t^\top {C}_t. \end{aligned}$$

Theorem 4

(Theorem 10 of [35]). Let \({C}=\mathrm{diag}(c_1,\ldots ,c_p)\) and \({S}=\mathrm{diag}(s_1,\ldots ,s_p)\) be \(p\times p\) diagonal matrices, where \(c_i\ne 0\) and \(c_i^2+s_i^2=1\) for all i. Let \({R}\in {\mathbb {R}}^{p\times n}\) be a Gaussian random matrix. Let \({M}={C}^2+\frac{1}{n}{S}{R}{R}^\top {S}\) and \(r=\sum _is_i^2\).

$$\begin{aligned}&\Pr (\lambda _1({M})\ge 1+t)\le q\cdot \exp \left( -\frac{cnt^2}{\max _i(s_i^2)r}\right) ,\\&\Pr (\lambda _p({M})\le 1-t)\le q\cdot \exp \left( -\frac{cnt^2}{\max _i(s_i^2)r}\right) , \end{aligned}$$

where the constant c is at least 1 / 32, and q is the rank of S.

Based on the above theorem, we derive the following corollary.

Corollary 2

Let \(c\ge 1/32\), \(\alpha >0\), \(\sigma _{ti}^2=\lambda _i({C}_t^\top {C}_t)\), \({\tilde{r}}_t=\sum _i\frac{\sigma ^2_{ti}}{\alpha +\sigma ^2_{ti}}\), \({\tilde{r}}_*=\max \limits _{k\le t\le T}{\tilde{r}}_t\) and \({\sigma ^2_{*1}}=\max \limits _{1\le t\le T}{\sigma ^2_{t1}}\). Let \({K}_t = \alpha {I}_d+{C}_t^\top {C}_t\), \(\tilde{{K}}_t = \alpha {I}_d+{S}_t^\top {S}_t\) and \(\tilde{{I}}_t={K}_t^{-1/2}\tilde{{K}}_t{K}_t^{-1/2}\). If \(\tau \ge \frac{{\tilde{r}}_*{\sigma ^2_{*1}}}{c\epsilon ^2(\alpha +{\sigma ^2_{*1}})}\log \frac{2dT}{\delta }\), with probability at least \(1-\delta \), simultaneously for all \(t=1,\ldots ,T\),

$$\begin{aligned} (1-\epsilon ){I}_d\preceq \tilde{{I}}_t\preceq (1+\epsilon ){I}_d. \end{aligned}$$

1.2 Proof of Theorem 1

Let \(\widetilde{{X}}_t\) denote \({S}_t^\top {S}_t\). First, we consider bounding the first term in the upper bound of Lemma 1. With probability \(1-\delta \), we have

$$\begin{aligned}&B_{\Psi _{t+1}}(\varvec{\beta }^*,\varvec{\beta }_{t+1})-B_{\Psi _{t}}(\varvec{\beta }^*,\varvec{\beta }_{t+1})\\&\quad =\frac{1}{2}\left\langle \varvec{\beta }^*-\varvec{\beta }_{t+1},(\widetilde{{X}}_{t+1}^{1/2}-\widetilde{{X}}_t^{1/2})(\varvec{\beta }^*-\varvec{\beta }_{t+1})\right\rangle \\&\quad \le \frac{1}{2}\left\langle \varvec{\beta }^*-\varvec{\beta }_{t+1},\sqrt{1+\epsilon }{X}_{t+1}^{1/2}(\varvec{\beta }^*-\varvec{\beta }_{t+1})\right\rangle \\&\qquad -\frac{1}{2}\left\langle \varvec{\beta }^*-\varvec{\beta }_{t+1},\sqrt{1-\epsilon }{X}_{t}^{1/2}(\varvec{\beta }^*-\varvec{\beta }_{t+1})\right\rangle \\&\quad \le \frac{1}{2}\left\langle \varvec{\beta }^*-\varvec{\beta }_{t+1},({X}_{t+1}^{1/2}-{X}_{t}^{1/2})(\varvec{\beta }^*-\varvec{\beta }_{t+1})\right\rangle \\&\qquad +\frac{\epsilon }{4}\left\langle \varvec{\beta }^*-\varvec{\beta }_{t+1},({X}_{t+1}^{1/2}+{X}_{t}^{1/2})(\varvec{\beta }^*-\varvec{\beta }_{t+1})\right\rangle \\&\quad \le \frac{1}{2}\Vert \varvec{\beta }^*-\varvec{\beta }_{t+1}\Vert _2^2\Vert ({X}_{t+1}^{1/2}-{X}_{t}^{1/2})\Vert \\&\qquad +\frac{\epsilon }{4}\left\langle \varvec{\beta }^*-\varvec{\beta }_{t+1},({X}_{t+1}^{1/2}+{X}_{t}^{1/2})(\varvec{\beta }^*-\varvec{\beta }_{t+1})\right\rangle \\&\quad \le \frac{1}{2}\Vert \varvec{\beta }^*-\varvec{\beta }_{t+1}\Vert _2^2\mathrm{tr}({X}_{t+1}^{1/2}-{X}_{t}^{1/2})\\&\qquad +\frac{\epsilon }{4}\left\langle \varvec{\beta }^*-\varvec{\beta }_{t+1},({X}_{t+1}^{1/2}+{X}_{t}^{1/2})(\varvec{\beta }^*-\varvec{\beta }_{t+1})\right\rangle \end{aligned}$$

where the first inequality is due to Corollary 1.

Thus, we can get

$$\begin{aligned} \begin{aligned}&\sum \limits _{t=1}^{T-1}\left[ B_{\Psi _{t+1}}(\varvec{\beta }^*,\varvec{\beta }_{t+1})-B_{\Psi _{t}}(\varvec{\beta }^*,\varvec{\beta }_{t+1})\right] \\&\quad \le \frac{1}{2}\sum \limits _{t=1}^{T-1}\Vert \varvec{\beta }^*-\varvec{\beta }_{t+1}\Vert _2^2\mathrm{tr}({X}_{t+1}^{1/2}-{X}_{t}^{1/2})\\&\qquad +\frac{\epsilon }{4}\sum \limits _{t=1}^{T-1}\left\langle \varvec{\beta }^*-\varvec{\beta }_{t+1},({X}_{t+1}^{1/2}+{X}_{t}^{1/2})(\varvec{\beta }^*-\varvec{\beta }_{t+1})\right\rangle \\&\quad \le \frac{1}{2}\max \limits _{t\le T}\Vert \varvec{\beta }^*-\varvec{\beta }_{t}\Vert _2^2\mathrm{tr}({X}_{T}^{1/2})-\frac{1}{2}\Vert \varvec{\beta }^*-\varvec{\beta }_{1}\Vert _2^2\mathrm{tr}({X}_{1}^{1/2})\\&\qquad +\frac{\epsilon }{2}\max \limits _{t\le T}\Vert \varvec{\beta }^*-\varvec{\beta }_{t}\Vert _2^2\sum \limits _{t=1}^{T}\Vert {X}_{t}^{1/2}\Vert \\&\qquad -\frac{\epsilon }{4}\Vert \varvec{\beta }^*-\varvec{\beta }_{1}\Vert _2^2\mathrm{tr}({X}_{1}^{1/2}). \end{aligned} \end{aligned}$$
(3)

Note that \(\varvec{\beta }_1={\mathbf {0}}\), then

$$\begin{aligned} \begin{aligned} B_{\Psi _1}(\varvec{\beta }^*,\varvec{\beta }_1)&=\frac{1}{2}\left\langle \varvec{\beta }^*,(\sigma {I}_d+\widetilde{{X}}_1^{1/2})\varvec{\beta }^*\right\rangle \\&\le \frac{1}{2}\sigma \Vert \varvec{\beta }^{*}\Vert _2^2+\frac{2+\epsilon }{4}\Vert \varvec{\beta }^{*}\Vert _2^2\mathrm{tr}({X}_{1}^{1/2}) \end{aligned} \end{aligned}$$
(4)

where the inequality is due to Corollary 1.

Then, we consider the bound of \(\sum _{t=1}^{T}\Vert f^\prime _t(\varvec{\beta }_t)\Vert ^2_{\Psi _{t}^*}\). With probability \(1-\delta \), we have

$$\begin{aligned}&\frac{1}{2}\Vert f^\prime _t(\varvec{\beta }_t)\Vert ^2_{\Psi _{t}^*}=\left\langle {\mathbf {g}}_t,(\sigma {I}_d+\widetilde{{X}}_t^{1/2})^{-1}{\mathbf {g}}_t\right\rangle \\&\quad \le \frac{1}{\sqrt{1-\epsilon }}\left\langle {\mathbf {g}}_t,({X}_{t}^{\dagger })^{1/2}{\mathbf {g}}_t\right\rangle =\frac{l^\prime (\varvec{\beta }_t^\top {\mathbf {x}}_t)^2}{\sqrt{1-\epsilon }}\left\langle {\mathbf {x}}_t,({X}_{t}^{\dagger })^{1/2}{\mathbf {x}}_t\right\rangle \end{aligned}$$

where the inequality is due to Corollary 1. According to Lemma 2, we have

$$\begin{aligned} \begin{aligned}&\sum \limits _{t=1}^{T}\Vert f^\prime _t(\varvec{\beta }_t)\Vert ^2_{\Psi _{t}^*}\\&\quad \le \sum \limits _{t=1}^{T}\frac{2l^\prime (\varvec{\beta }_t^\top {\mathbf {x}}_t)^2}{\sqrt{1-\epsilon }}\left\langle {\mathbf {x}}_t,({X}_{t}^{\dagger })^{1/2}{\mathbf {x}}_t\right\rangle \\&\quad \le \max \limits _{t\le T}l^\prime (\varvec{\beta }_t^\top {\mathbf {x}}_t)^2\frac{2}{\sqrt{1-\epsilon }}\sum \limits _{t=1}^{T}\left\langle {\mathbf {x}}_t,({X}_{t}^{\dagger })^{1/2}{\mathbf {x}}_t\right\rangle \\&\quad \le \frac{4}{\sqrt{1-\epsilon }}\max \limits _{t\le T}l^\prime (\varvec{\beta }_t^\top {\mathbf {x}}_t)^2\mathrm{tr}({X}_T^{1/2}). \end{aligned} \end{aligned}$$
(5)

We complete the proof by substituting (3), (4) and (5) into Lemma 1.

1.3 Proof of Theorem 2

Inspired by the proof of Theorem 1, we can derive Theorem 2 by, respectively, bounding each term in the upper bound of Lemma 1. Before that, we need to derive the lower and upper bounds of \(({S}_t^\top {S}_t)^{1/2}\) based on Corollary 2.

Let the SVD of \({C}_t^\top \) be \({C}_t^\top = {U}{\Sigma }{V}^\top \) where \({U}\in {\mathbb {R}}^{d\times d},{\Sigma }\in {\mathbb {R}}^{d\times d},{V}\in {\mathbb {R}}^{t\times d}\). Let \({K}_t = \alpha {I}_d+{C}_t^\top {C}_t\), \(\tilde{{K}}_t = \alpha {I}_d+{S}_t^\top {S}_t\) and \(\tilde{{I}}_t={K}_t^{-1/2}\tilde{{K}}_t{K}_t^{-1/2}\). According to Corollary 2, with probability at least \(1-\delta \), simultaneously for all \(t=1,\ldots ,T\),

$$\begin{aligned} {S}_t^\top {S}_t&=\tilde{{K}}_t - \alpha {I}_d={K}_t^{1/2}\tilde{{I}}_t {K}_t^{1/2}- \alpha {I}_d\\&\preceq (1+\epsilon ){K}_t-\alpha {I}_d=(1+\epsilon ){C}_t^\top {C}_t+\epsilon \alpha {I}_d\\&={U}\big ((1+\epsilon ){\Sigma }{\Sigma }+\epsilon \alpha {I}_d\big ){U}^\top \end{aligned}$$

and

$$\begin{aligned} {S}_t^\top {S}_t+\epsilon \alpha {I}_d&=\tilde{{K}}_t - \alpha {I}_d+\epsilon \alpha {I}_d\\&={K}_t^{1/2}\tilde{{I}}_t{K}_t^{1/2}- \alpha {I}_d+\epsilon \alpha {I}_d\\&\succeq (1-\epsilon ){K}_t-\alpha {I}_d+\epsilon \alpha {I}_d\\&=(1-\epsilon ){C}_t^\top {C}_t. \end{aligned}$$

Then simultaneously for all \(t=1,\ldots ,T\), we have

$$\begin{aligned} \begin{aligned} ({S}_t^\top {S}_t)^{1/2}&\preceq \sqrt{1+\epsilon }{U}({\Sigma }{\Sigma })^{1/2}{U}^\top +\sqrt{\epsilon \alpha }{U}{I}_d{U}^\top \\&=\sqrt{1+\epsilon }{X}_{t}^{1/2}+\sqrt{\epsilon \alpha }{I}_d \end{aligned} \end{aligned}$$
(6)

and

$$\begin{aligned} \begin{aligned} ({S}_t^\top {S}_t)^{1/2}&=({S}_t^\top {S}_t)^{1/2}+\sqrt{\epsilon \alpha }{I}_d-\sqrt{\epsilon \alpha }{I}_d\\&\succeq \big (({S}_t^\top {S}_t)+\epsilon \alpha {I}_d\big )^{1/2}-\sqrt{\epsilon \alpha }{I}_d\\&\succeq \sqrt{1-\epsilon }{X}_{t}^{1/2}-\sqrt{\epsilon \alpha }{I}_d. \end{aligned} \end{aligned}$$
(7)

Then we consider bounding the first term in the upper bound of Lemma 1. Let \(\widetilde{{X}}_t\) denote \({S}_t^\top {S}_t\). Simultaneously for all \(t=1,\ldots ,T\), we have

$$\begin{aligned}&B_{\Psi _{t+1}}(\varvec{\beta }^*,\varvec{\beta }_{t+1})-B_{\Psi _{t}}(\varvec{\beta }^*,\varvec{\beta }_{t+1})\\&\quad =\frac{1}{2}\left\langle \varvec{\beta }^*-\varvec{\beta }_{t+1},(\widetilde{{X}}_{t+1}^{1/2}-\widetilde{{X}}_t^{1/2})(\varvec{\beta }^*-\varvec{\beta }_{t+1})\right\rangle \\&\quad \le \frac{1}{2}\left\langle \varvec{\beta }^*-\varvec{\beta }_{t+1},\sqrt{1+\epsilon }{X}_{t+1}^{1/2}(\varvec{\beta }^*-\varvec{\beta }_{t+1})\right\rangle \\&\qquad -\frac{1}{2}\left\langle \varvec{\beta }^*-\varvec{\beta }_{t+1},\sqrt{1-\epsilon }{X}_{t}^{1/2})(\varvec{\beta }^*-\varvec{\beta }_{t+1})\right\rangle \\&\qquad +\frac{1}{2}\left\langle \varvec{\beta }^*-\varvec{\beta }_{t+1},2\sqrt{\epsilon \alpha }{I}_{d}(\varvec{\beta }^*-\varvec{\beta }_{t+1})\right\rangle \\&\quad =\frac{1}{2}\left\langle \varvec{\beta }^*-\varvec{\beta }_{t+1},\sqrt{1+\epsilon }{X}_{t+1}^{1/2}(\varvec{\beta }^*-\varvec{\beta }_{t+1})\right\rangle \\&\qquad -\frac{1}{2}\left\langle \varvec{\beta }^*-\varvec{\beta }_{t+1},\sqrt{1-\epsilon }{X}_{t}^{1/2})(\varvec{\beta }^*-\varvec{\beta }_{t+1})\right\rangle \\&\qquad +\sqrt{\epsilon \alpha }\Vert (\varvec{\beta }^*-\varvec{\beta }_{t+1})\Vert _2^2\\&\quad \le \frac{1}{2}\Vert \varvec{\beta }^*-\varvec{\beta }_{t+1}\Vert _2^2\mathrm{tr}({X}_{t+1}^{1/2}-{X}_{t}^{1/2})\\&\qquad +\frac{\epsilon }{4}\left\langle \varvec{\beta }^*-\varvec{\beta }_{t+1},({X}_{t+1}^{1/2}+{X}_{t}^{1/2})(\varvec{\beta }^*-\varvec{\beta }_{t+1})\right\rangle \\&\qquad +\sqrt{\epsilon \alpha }\Vert (\varvec{\beta }^*-\varvec{\beta }_{t+1})\Vert _2^2 \end{aligned}$$

where the first inequality is due to (6), (7) and the last inequality has been proved in the proof of Theorem 1.

Thus, we can get

$$\begin{aligned} \begin{aligned}&\sum \limits _{t=1}^{T-1}\left[ B_{\Psi _{t+1}}(\varvec{\beta }^*,\varvec{\beta }_{t+1})-B_{\Psi _{t}}(\varvec{\beta }^*,\varvec{\beta }_{t+1})\right] \\&\quad \le \frac{1}{2}\max \limits _{t\le T}\Vert \varvec{\beta }^*-\varvec{\beta }_{t}\Vert _2^2\mathrm{tr}({X}_{T}^{1/2})-\frac{1}{2}\Vert \varvec{\beta }^*-\varvec{\beta }_{1}\Vert _2^2\mathrm{tr}({X}_{1}^{1/2})\\&\qquad +\frac{\epsilon }{2}\max \limits _{t\le T}\Vert \varvec{\beta }^*-\varvec{\beta }_{t}\Vert _2^2\sum \limits _{t=1}^{T}\Vert {X}_{t}^{1/2}\Vert \\&\qquad -\frac{\epsilon }{4}\Vert \varvec{\beta }^*-\varvec{\beta }_{1}\Vert _2^2\mathrm{tr}({X}_{1}^{1/2})\\&\qquad +\sqrt{\epsilon \alpha }(T-1)\max \limits _{t\le T}\Vert \varvec{\beta }^*-\varvec{\beta }_{t}\Vert _2^2. \end{aligned} \end{aligned}$$
(8)

Note that \(\varvec{\beta }_1={\mathbf {0}}\), then

$$\begin{aligned} \begin{aligned} B_{\Psi _1}(\varvec{\beta }^*,\varvec{\beta }_1)&=\frac{1}{2}\left\langle \varvec{\beta }^*,(\sigma {I}_d+\widetilde{{X}}_1^{1/2})\varvec{\beta }^*\right\rangle \\&\le \frac{1}{2}\sigma \Vert \varvec{\beta }^{*}\Vert _2^2+\frac{2+\epsilon }{4}\Vert \varvec{\beta }^{*}\Vert _2^2\mathrm{tr}({X}_{1}^{1/2})\\&\quad +\frac{1}{2}\sqrt{\epsilon \alpha }\Vert \varvec{\beta }^*\Vert _2^2. \end{aligned} \end{aligned}$$
(9)

Before considering the upper bound of \(\sum _{t=1}^{T}\Vert f^\prime _t(\varvec{\beta }_t)\Vert ^2_{\Psi _{t}^*}\), we need to derive the upper bound of \({H}_t^{-1}\).

Let the SVD of \({S}_t^\top \) be \({S}_t^\top = {U}{\Sigma }{V}^\top \) where \({U}\in {\mathbb {R}}^{d\times d},{\Sigma }\in {\mathbb {R}}^{d\times d},{V}\in {\mathbb {R}}^{t\times d}\). We also have, for all \(t=1,\ldots ,T\),

$$\begin{aligned} {H}_t&=\sigma {I}_d+({S}^\top _{t}{S}_{t})^{1/2}={U}\big (\sigma {I}_d+({\Sigma }{\Sigma })^{1/2}\big ){U}^\top \\&\succeq {U}\big (\alpha {I}_d+({\Sigma }{\Sigma })\big )^{1/2}{U}^\top =(\alpha {I}_d+{S}^\top _{t}{S}_{t})^{1/2} \end{aligned}$$

due to \(\sigma \ge \sqrt{\alpha }\ge \sqrt{\lambda _i({S}_t^\top {S}_t)+\alpha }-\sqrt{\lambda _i({S}_t^\top {S}_t)}\) for all \(i=1,\ldots ,d\).

Then according to Corollary 2, with probability at least \(1-\delta \), simultaneously for all \(t=1,\ldots ,T\),

$$\begin{aligned} {H}_t^{-1}&\preceq \big ((\alpha {I}_d+{S}^\top _{t}{S}_{t})^{1/2}\big )^{-1}=\big (({K}_t^{1/2}\tilde{{I}}_t{K}_t^{1/2})^{-1}\big )^{1/2}\\&\preceq \frac{1}{\sqrt{1-\epsilon }}({K}_t^{-1})^{1/2}=\frac{1}{\sqrt{1-\epsilon }}\big ((\alpha {I}_d+{X}_{t})^{-1}\big )^{1/2}. \end{aligned}$$

Thus, we can get

$$\begin{aligned} \Vert f^\prime _t(\varvec{\beta }_t)\Vert ^2_{\Psi _{t}^*}&=2\left\langle {\mathbf {g}}_t,{H}_t^{-1}{\mathbf {g}}_t\right\rangle \\&\le \frac{2}{\sqrt{1-\epsilon }}\left\langle {\mathbf {g}}_t,\big ((\alpha {I}_d+{X}_{t})^{-1}\big )^{1/2}{\mathbf {g}}_t\right\rangle \\&=\frac{2l^\prime (\varvec{\beta }_t^\top {\mathbf {x}}_t)^2}{\sqrt{1-\epsilon }}\left\langle {\mathbf {x}}_t,({X}_{t}^{\dagger })^{1/2}{\mathbf {x}}_t\right\rangle . \end{aligned}$$

According to Lemma 2, we have

$$\begin{aligned} \begin{aligned}&\sum \limits _{t=1}^{T}\Vert f^\prime _t(\varvec{\beta }_t)\Vert ^2_{\Psi _{t}^*}\\&\quad \le \frac{2}{\sqrt{1-\epsilon }}\max \limits _{t\le T}l^\prime (\varvec{\beta }_t^\top {\mathbf {x}}_t)^2\sum \limits _{t=1}^{T}\left\langle {\mathbf {x}}_t,({X}_{t}^{\dagger })^{1/2}{\mathbf {x}}_t\right\rangle \\&\quad \le \frac{4}{\sqrt{1-\epsilon }}\max \limits _{t\le T}l^\prime (\varvec{\beta }_t^\top {\mathbf {x}}_t)^2\mathrm{tr}({X}_T^{1/2}). \end{aligned} \end{aligned}$$
(10)

We complete the proof by substituting (8), (9) and (10) into Lemma 1.

1.4 Proof of Corollary 1

Let \({C}_t={U}{\Sigma } {V}^\top \) be the singular value decomposition of \({C}_t\). Notice that \({U}\in {\mathbb {R}}^{t\times r}, {\Sigma } {V}^\top \in {\mathbb {R}}^{r\times d}\). According to Theorem 3, we have if \(\tau =\Theta (\frac{r+\log (1/\delta )}{\epsilon ^{2}})\), then simultaneously \(\forall {\mathbf {x}} \in {\mathbb {R}}^{r}\), with probability \(1-\delta \),

$$\begin{aligned} (1-\epsilon )\Vert {U}{\mathbf {x}}\Vert _2^2\le \Vert {R_t}{U}{\mathbf {x}}\Vert _2^2\le (1+\epsilon )\Vert {U}{\mathbf {x}}\Vert _2^2 \end{aligned}$$

Let \({\mathbf {y}}\in {\mathbb {R}}^d\) be arbitrary vector, then \({C}_t{\mathbf {y}}={U}{\Sigma }{V}^\top {\mathbf {y}}={U}{\mathbf {x}}\) where \({\mathbf {x}}={\Sigma }{V}^\top {\mathbf {y}}\in {\mathbb {R}}^{r}\).

Then we have

$$\begin{aligned} {\mathbf {y}}^\top {S}_t^\top {S}_t{\mathbf {y}}&={\mathbf {y}}^\top {C}_t^\top {R}^\top _t{R}_t{C}_t{\mathbf {y}}=\Vert {R_t}{U}{\mathbf {x}}\Vert _2^2\\&\le (1+\epsilon )\Vert {U}{\mathbf {x}}\Vert _2^2=(1+\epsilon ){\mathbf {y}}^\top {C}_t^\top {C}_t{\mathbf {y}} \end{aligned}$$

and

$$\begin{aligned} {\mathbf {y}}^\top {S}_t^\top {S}_t{\mathbf {y}}&={\mathbf {y}}^\top {C}_t^\top {R}^\top _t{R}_t{C}_t{\mathbf {y}}=\Vert {R_t}{U}{\mathbf {x}}\Vert _2^2\\&\ge (1-\epsilon )\Vert {U}{\mathbf {x}}\Vert _2^2=(1-\epsilon ){\mathbf {y}}^\top {C}_t^\top {C}_t{\mathbf {y}}. \end{aligned}$$

Then, we have \((1-\epsilon ){C}_t^\top {C}_t\preceq {S}_t^\top {S}_t\preceq (1+\epsilon ){C}_t^\top {C}_t\) with probability \(1-\delta \), provided \(\tau =\Omega (\frac{r+\log (1/\delta )}{\epsilon ^{2}})\). Using the union bound, we have if \(\tau =\Omega (\frac{r+\log (T/\delta )}{\epsilon ^{2}})\), with probability \(1-\delta \), simultaneously for all \(t=1,\ldots ,T\),

$$\begin{aligned} (1-\epsilon ){C}_t^\top {C}_t\preceq {S}_t^\top {S}_t\preceq (1+\epsilon ){C}_t^\top {C}_t. \end{aligned}$$

1.5 Proof of Corollary 2

Define the SVD of \({C}_t^\top \) as \({C}_t^\top = {U}{\Sigma }{V}^\top \) where \({U}\in {\mathbb {R}}^{d\times d},{\Sigma }\in {\mathbb {R}}^{d\times d},{V}\in {\mathbb {R}}^{t\times d}\). Then we have \({K}_t = {U}(\alpha {I}_d+{\Sigma }{\Sigma }^\top ){U}^\top \) and

$$\begin{aligned} \tilde{{I}}_t&={K}_t^{-1/2}\tilde{{K}}_t{K}_t^{-1/2}={K}_t^{-1/2}(\alpha {I}_d+{C}_t^\top {R}_t^\top {R}_t{C}_t){K}_t^{-1/2}\\&={U}\Big ((\alpha {I}_p+{\Sigma }{\Sigma })^{-1/2}{\Sigma }{V}^\top {R}_t^\top {R}_t{V}{\Sigma }(\alpha {I}_d+{\Sigma }{\Sigma }^\top )^{-1/2}\\&\quad +\alpha {I}_d(\alpha {I}_d+{\Sigma }{\Sigma })^{-1}\Big ){U}^\top \\&={U}\Big ((\alpha {I}_p+{\Sigma }{\Sigma })^{-1/2}{\Sigma }{R}{R}^\top {\Sigma }(\alpha {I}_d+{\Sigma }{\Sigma }^\top )^{-1/2}\\&\quad +\alpha {I}_d(\alpha {I}_d+{\Sigma }{\Sigma })^{-1}\Big ){U}^\top \end{aligned}$$

where \({R}={V}^\top {R}_t^\top \in {\mathbb {R}}^{d\times \tau }\) is a Gaussian random matrix due to that V is an orthogonal matrix and \({R}_t^\top \) is a Gaussian random matrix. Let \(c_i^2=\frac{\alpha }{\alpha +{\sigma ^2_{ti}}}\) and \(s_i^2=\frac{{\sigma ^2_{ti}}}{\alpha +{\sigma ^2_{ti}}}\). Then according to Theorem 4, with probability at least \(1-\delta \),

$$\begin{aligned} (1-\epsilon ){I}_d\preceq \tilde{{I}}_t\preceq (1+\epsilon ){I}_d \end{aligned}$$

provided \(\tau \ge \frac{{\tilde{r}}_t{\sigma ^2_{t1}}}{c\epsilon ^2(\alpha +{\sigma ^2_{t1}})}\log \frac{2d}{\delta }\) where the constant c is at least 1 / 32. Using the union bound, we complete the proof.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wan, Y., Zhang, L. Accelerating adaptive online learning by matrix approximation. Int J Data Sci Anal 9, 389–400 (2020). https://doi.org/10.1007/s41060-019-00174-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s41060-019-00174-4

Keywords

Navigation