Abstract
Adaptive subgradient methods are able to leverage the second-order information of functions to improve the regret and have become popular for online learning and optimization. According to the amount of information used, these methods can be divided into diagonal-matrix version (ADA-DIAG) and full-matrix version (ADA-FULL). In practice, ADA-DIAG is the most commonly adopted instead of ADA-FULL, because ADA-FULL is computationally intractable in high dimensions though it has smaller regret when gradients are correlated. In this paper, we propose to employ techniques of matrix approximation to accelerate ADA-FULL and develop two methods based on random projections. Compared with ADA-FULL, at each iteration, our methods reduce the space complexity from \(O(d^2)\) to \(O(\tau d)\) and the time complexity from \(O(d^3)\) to \(O(\tau ^2 d)\) where d is the dimensionality of the data and \(\tau \ll d\) is the number of random projections. Experimental results about online convex optimization and training convolutional neural networks show that our methods are comparable to ADA-FULL and outperform other state-of-the-art algorithms including ADA-DIAG.
Similar content being viewed by others
References
Achlioptas, D.: Database-friendly random projections: Johnson–Lindenstrauss with binary coins. J. Comput. Syst. Sci. 66(4), 671–687 (2003)
Allesiardo, R., Fraud, R., Maillard, O.A.: The non-stationary stochastic multi-armed bandit problem. Int. J. Data Sci. Anal. 3(4), 267–283 (2017)
Awerbuch, B., Kleinberg, R.: Online linear optimization and adaptive routing. J. Comput. Syst. Sci. 74(1), 97–114 (2008)
Boutsidis, C., Zouzias, A., Drineas, P.: Random projections for \(k\)-means clustering. In: Advances in Neural Information Processing Systems, vol. 23, pp. 298–306 (2010)
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 1–27 (2011)
Dasgupta, S., Freund, Y.: Random projection trees and low dimensional manifolds. In: Proceedings of the 40th Annual ACM Symposium on Theory of computing, pp. 537–546 (2008)
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)
Duchi, J., Shalev-Shwartz, S., Singer, Y., Tewari, A.: Composite objective mirror descent. In: Proceedings of the 23rd Annual Conference on Learning Theory, pp. 14–26 (2010)
Fern, X.Z., Brodley, C.E.: Random projection for high dimensional data clustering: a cluster ensemble approach. In: Proceedings of the 20th International Conference on Machine Learning, pp. 186–93 (2003)
Fradkin, D., Madigan, D.: Experiments with random projections for machine learning. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 517–522 (2003)
Freund, Y., Dasgupta, S., Kabra, M., Verma, N.: Learning the structure of manifolds using random projections. In: Advances in Neural Information Processing Systems, vol. 21, pp. 473–480 (2008)
Gao, W., Jin, R., Zhu, S., Zhou, Z.H.: One-pass AUC optimization. In: Proceedings of the 30th International Conference on Machine Learning, pp. 906–914 (2013)
Ghashami, M., Liberty, E., Phillips, J.M., Woodruff, D.P.: Frequent directions: simple and deterministic matrix sketching. SIAM J. Comput. 45(5), 1762–1792 (2016)
Hager, W.W.: Updating the inverse of a matrix. SIAM Rev. 31(2), 221–239 (1989)
Hassani, M., Töws, D., Cuzzocrea, A., Seidl, T.: BFSPMiner: an effective and efficient batch-free algorithm for mining sequential patterns over data streams. Int. J. Data Sci. Anal. 1–17 (2017). https://doi.org/10.1007/S41060-017-0084-8
Hazan, E., Agarwal, A., Kale, S.: Logarithmic regret algorithms for online convex optimization. Mach. Learn. 69(2), 169–192 (2007)
Kaski, S.: Dimensionality reduction by random mapping: fast similarity computation for clustering. In: Proceedings of the 1998 IEEE International Joint Conference on Neural Networks, vol. 1, pp. 413–418 (1998)
Krizhevsky, A.: Learning multiple layers of features from tiny images. Technical report, University of Toronto (2009)
Krummenacher, G., McWilliams, B., Kilcher, Y., Buhmann, J.M., Meinshausen, N.: Scalable adaptive stochastic optimization using random projections. In: Advances in Neural Information Processing Systems, vol. 29, pp. 1750–1758 (2016)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. In: Proceedings of the IEEE, vol. 86, pp. 2278–2324 (1998)
Liberty, E., Ailon, N., Singer, A.: Dense fast random projections and lean walsh transforms. Discrete Comput. Geom. 45(1), 34–44 (2011)
Luo, H., Agarwal, A., Cesa-Bianchi, N., Langford, J.: Efficient second order online learning by sketching. In: Advances in Neural Information Processing Systems, vol. 29, pp. 902–910 (2016)
Magen, A., Zouzias, A.: Low rank matrix-valued Chernoff bounds and approximate matrix multiplication. In: Proceedings of the 22nd Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1422–1436 (2011)
Maillard, O.A., Munos, R.: Linear regression with random projections. J. Mach. Learn. Res. 13, 2735–2772 (2012)
Miyaguchi, K., Yamanishi, K.: Online detection of continuous changes in stochastic processes. Int. J. Data Sci. Anal. 3(3), 213–229 (2017)
Nalko, N., Martinsson, P.G., Tropp, J.A.: Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 53(2), 217–288 (2011)
Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning. In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011 (2011)
Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems, vol. 21, pp. 1177–1184 (2008)
Tropp, J.A.: An introduction to matrix concentration inequalities. Found. Trends Mach. Learn. 8(1–2), 1–230 (2015)
Wan, Y., Wei, N., Zhang, L.: Efficient adaptive online learning via frequent directions. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 2748–2754 (2018)
Wan, Y., Zhang, L.: Accelerating adaptive online learning by matrix approximation. In: Proceedings of the 22nd Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 405–417 (2018)
Woodruff, D.P.: Sketching as a tool for numerical linear algebra. Found. Trends Mach. Learn. 10(1–2), 1–157 (2014)
Xiao, L.: Dual averaging method for regularized stochastic learning and online optimization. In: Advances in Neural Information Processing Systems, vol. 22, pp. 2116–2124 (2009)
Yenala, H., Jhanwar, A., Chinnakotla, M.K., Goyal, J.: Deep learning for detecting inappropriate content in text. Int. J. Data Sci. Anal. 6(4), 273–286 (2018)
Zhang, L., Mahdavi, M., Jin, R., Yang, T., Zhu, S.: Recovering the optimal solution by dual random projection. In: Proceedings of the 26th Annual Conference on Learning Theory, pp. 135–157 (2013)
Zhang, L., Yang, T., Jin, R., Xiao, Y., Zhou, Z.H.: Online stochastic linear optimization under one-bit feedback. In: Proceedings of the 33rd International Conference on Machine Learning, pp. 392–401 (2016)
Zinkevich, M.: Online convex programming and generalized infinitesimal gradient ascent. In: Proceedings of the 20th International Conference on Machine Learning, pp. 928–936 (2003)
Acknowledgements
This work was partially supported by the National Key R&D Program of China (2018YFB1004300), NSFC-NRF Joint Research Project (61861146001) and YESS (2017QNRC001).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This paper is an extension version of the PAKDD’2018 Long Presentation paper “Accelerating Adaptive Online Learning by Matrix Approximation” [31].
A Theoretical analysis
A Theoretical analysis
In this section, we provide proofs of Theorems 1 and 2.
1.1 Supporting results
The following results are used throughout our analysis.
Lemma 1
(Proposition 3 of [7]). Let sequence \(\{\varvec{\beta }_t\}\) be generated by ADA-DP. We have
Lemma 2
Let \({X}_t=\sum _{i=1}^t{\mathbf {x}}_i{\mathbf {x}}_i^\top \) and \({A}^\dagger \) denote the pseudo-inverse of A, then
Lemma 2 can be proved in the same way as Lemma 10 of [7].
Theorem 3
(Theorem 2.3 of [32]). Let \(0<\epsilon ,\delta <1\) and \({S}=\frac{1}{\sqrt{k}}{R}\in {\mathbb {R}}^{k\times n}\) where the entries of R are independent standard normal random variables. Then if \(k=\Theta (\frac{d+\log (1/\delta )}{\epsilon ^{2}})\), then for any fixed \(n\times d\) matrix A, with probability \(1-\delta \), simultaneously for all \({\mathbf {x}}\in {\mathbb {R}}^d\),
Based on the above theorem, we derive the following corollary.
Corollary 1
Let \(0<\epsilon ,\delta <1\) and each entry of \({\mathbf {r}}_t\in {\mathbb {R}}^{\tau }\) is a Gaussian random variable independently drawn from \({\mathcal {N}}(0,1/\tau )\). Then, if \(\tau =\Omega (\frac{r+\log (T/\delta )}{\epsilon ^{2}})\), with probability \(1-\delta \), simultaneously for all \(t=1,\ldots ,T\),
Theorem 4
(Theorem 10 of [35]). Let \({C}=\mathrm{diag}(c_1,\ldots ,c_p)\) and \({S}=\mathrm{diag}(s_1,\ldots ,s_p)\) be \(p\times p\) diagonal matrices, where \(c_i\ne 0\) and \(c_i^2+s_i^2=1\) for all i. Let \({R}\in {\mathbb {R}}^{p\times n}\) be a Gaussian random matrix. Let \({M}={C}^2+\frac{1}{n}{S}{R}{R}^\top {S}\) and \(r=\sum _is_i^2\).
where the constant c is at least 1 / 32, and q is the rank of S.
Based on the above theorem, we derive the following corollary.
Corollary 2
Let \(c\ge 1/32\), \(\alpha >0\), \(\sigma _{ti}^2=\lambda _i({C}_t^\top {C}_t)\), \({\tilde{r}}_t=\sum _i\frac{\sigma ^2_{ti}}{\alpha +\sigma ^2_{ti}}\), \({\tilde{r}}_*=\max \limits _{k\le t\le T}{\tilde{r}}_t\) and \({\sigma ^2_{*1}}=\max \limits _{1\le t\le T}{\sigma ^2_{t1}}\). Let \({K}_t = \alpha {I}_d+{C}_t^\top {C}_t\), \(\tilde{{K}}_t = \alpha {I}_d+{S}_t^\top {S}_t\) and \(\tilde{{I}}_t={K}_t^{-1/2}\tilde{{K}}_t{K}_t^{-1/2}\). If \(\tau \ge \frac{{\tilde{r}}_*{\sigma ^2_{*1}}}{c\epsilon ^2(\alpha +{\sigma ^2_{*1}})}\log \frac{2dT}{\delta }\), with probability at least \(1-\delta \), simultaneously for all \(t=1,\ldots ,T\),
1.2 Proof of Theorem 1
Let \(\widetilde{{X}}_t\) denote \({S}_t^\top {S}_t\). First, we consider bounding the first term in the upper bound of Lemma 1. With probability \(1-\delta \), we have
where the first inequality is due to Corollary 1.
Thus, we can get
Note that \(\varvec{\beta }_1={\mathbf {0}}\), then
where the inequality is due to Corollary 1.
Then, we consider the bound of \(\sum _{t=1}^{T}\Vert f^\prime _t(\varvec{\beta }_t)\Vert ^2_{\Psi _{t}^*}\). With probability \(1-\delta \), we have
where the inequality is due to Corollary 1. According to Lemma 2, we have
We complete the proof by substituting (3), (4) and (5) into Lemma 1.
1.3 Proof of Theorem 2
Inspired by the proof of Theorem 1, we can derive Theorem 2 by, respectively, bounding each term in the upper bound of Lemma 1. Before that, we need to derive the lower and upper bounds of \(({S}_t^\top {S}_t)^{1/2}\) based on Corollary 2.
Let the SVD of \({C}_t^\top \) be \({C}_t^\top = {U}{\Sigma }{V}^\top \) where \({U}\in {\mathbb {R}}^{d\times d},{\Sigma }\in {\mathbb {R}}^{d\times d},{V}\in {\mathbb {R}}^{t\times d}\). Let \({K}_t = \alpha {I}_d+{C}_t^\top {C}_t\), \(\tilde{{K}}_t = \alpha {I}_d+{S}_t^\top {S}_t\) and \(\tilde{{I}}_t={K}_t^{-1/2}\tilde{{K}}_t{K}_t^{-1/2}\). According to Corollary 2, with probability at least \(1-\delta \), simultaneously for all \(t=1,\ldots ,T\),
and
Then simultaneously for all \(t=1,\ldots ,T\), we have
and
Then we consider bounding the first term in the upper bound of Lemma 1. Let \(\widetilde{{X}}_t\) denote \({S}_t^\top {S}_t\). Simultaneously for all \(t=1,\ldots ,T\), we have
where the first inequality is due to (6), (7) and the last inequality has been proved in the proof of Theorem 1.
Thus, we can get
Note that \(\varvec{\beta }_1={\mathbf {0}}\), then
Before considering the upper bound of \(\sum _{t=1}^{T}\Vert f^\prime _t(\varvec{\beta }_t)\Vert ^2_{\Psi _{t}^*}\), we need to derive the upper bound of \({H}_t^{-1}\).
Let the SVD of \({S}_t^\top \) be \({S}_t^\top = {U}{\Sigma }{V}^\top \) where \({U}\in {\mathbb {R}}^{d\times d},{\Sigma }\in {\mathbb {R}}^{d\times d},{V}\in {\mathbb {R}}^{t\times d}\). We also have, for all \(t=1,\ldots ,T\),
due to \(\sigma \ge \sqrt{\alpha }\ge \sqrt{\lambda _i({S}_t^\top {S}_t)+\alpha }-\sqrt{\lambda _i({S}_t^\top {S}_t)}\) for all \(i=1,\ldots ,d\).
Then according to Corollary 2, with probability at least \(1-\delta \), simultaneously for all \(t=1,\ldots ,T\),
Thus, we can get
According to Lemma 2, we have
We complete the proof by substituting (8), (9) and (10) into Lemma 1.
1.4 Proof of Corollary 1
Let \({C}_t={U}{\Sigma } {V}^\top \) be the singular value decomposition of \({C}_t\). Notice that \({U}\in {\mathbb {R}}^{t\times r}, {\Sigma } {V}^\top \in {\mathbb {R}}^{r\times d}\). According to Theorem 3, we have if \(\tau =\Theta (\frac{r+\log (1/\delta )}{\epsilon ^{2}})\), then simultaneously \(\forall {\mathbf {x}} \in {\mathbb {R}}^{r}\), with probability \(1-\delta \),
Let \({\mathbf {y}}\in {\mathbb {R}}^d\) be arbitrary vector, then \({C}_t{\mathbf {y}}={U}{\Sigma }{V}^\top {\mathbf {y}}={U}{\mathbf {x}}\) where \({\mathbf {x}}={\Sigma }{V}^\top {\mathbf {y}}\in {\mathbb {R}}^{r}\).
Then we have
and
Then, we have \((1-\epsilon ){C}_t^\top {C}_t\preceq {S}_t^\top {S}_t\preceq (1+\epsilon ){C}_t^\top {C}_t\) with probability \(1-\delta \), provided \(\tau =\Omega (\frac{r+\log (1/\delta )}{\epsilon ^{2}})\). Using the union bound, we have if \(\tau =\Omega (\frac{r+\log (T/\delta )}{\epsilon ^{2}})\), with probability \(1-\delta \), simultaneously for all \(t=1,\ldots ,T\),
1.5 Proof of Corollary 2
Define the SVD of \({C}_t^\top \) as \({C}_t^\top = {U}{\Sigma }{V}^\top \) where \({U}\in {\mathbb {R}}^{d\times d},{\Sigma }\in {\mathbb {R}}^{d\times d},{V}\in {\mathbb {R}}^{t\times d}\). Then we have \({K}_t = {U}(\alpha {I}_d+{\Sigma }{\Sigma }^\top ){U}^\top \) and
where \({R}={V}^\top {R}_t^\top \in {\mathbb {R}}^{d\times \tau }\) is a Gaussian random matrix due to that V is an orthogonal matrix and \({R}_t^\top \) is a Gaussian random matrix. Let \(c_i^2=\frac{\alpha }{\alpha +{\sigma ^2_{ti}}}\) and \(s_i^2=\frac{{\sigma ^2_{ti}}}{\alpha +{\sigma ^2_{ti}}}\). Then according to Theorem 4, with probability at least \(1-\delta \),
provided \(\tau \ge \frac{{\tilde{r}}_t{\sigma ^2_{t1}}}{c\epsilon ^2(\alpha +{\sigma ^2_{t1}})}\log \frac{2d}{\delta }\) where the constant c is at least 1 / 32. Using the union bound, we complete the proof.
Rights and permissions
About this article
Cite this article
Wan, Y., Zhang, L. Accelerating adaptive online learning by matrix approximation. Int J Data Sci Anal 9, 389–400 (2020). https://doi.org/10.1007/s41060-019-00174-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41060-019-00174-4