Abstract
We study the problem of estimating multiple predictive functions from a dictionary of basis functions in the nonparametric regression setting. Our estimation scheme assumes that each predictive function can be estimated in the form of a linear combination of the basis functions. By assuming that the coefficient matrix admits a sparse low-rank structure, we formulate the function estimation problem as a convex program regularized by the trace norm and the \(\ell _1\)-norm simultaneously. We propose to solve the convex program using the accelerated gradient (AG) method; we also develop efficient algorithms to solve the key components in AG. In addition, we conduct theoretical analysis on the proposed function estimation scheme: we derive a key property of the optimal solution to the convex program; based on an assumption on the basis functions, we establish a performance bound of the proposed function estimation scheme (via the composite regularization). Simulation studies demonstrate the effectiveness and efficiency of the proposed algorithms.
Similar content being viewed by others
References
Bach FR (2008) Consistency of trace norm minimization. J Mach Learn Res 9:1019–1048
Beck A, Teboulle M (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J Imag Sci 2(1):183–202
Bickel PJ, Ritov Y, Tsybakov AB (2009) Simultaneous analysis of Lasso and dantzig selector. Ann Stat 37(4):1705–1732
Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press, Cambridge
Bunea F, Tsybakov AB, Wegkamp MH (2006) Aggregation and sparsity via \(\ell _1\) penalized least squares. In: Proceedings of the 19th annual conference on learning theory, pp 379–391
Candès EJ, Tao T (2005) Decoding by linear programming. IEEE Trans Inf Theory 51(12):4203–4215
Candès EJ, Recht B (2009) Exact matrix completion via convex optimization. Found Comput Math 9: 717–772
Candès EJ, Li X, Ma Y, Wright J (2009) Robust principal component analysis? J ACM 58(1):1–37
Chandrasekaran V, Sanghavi S, Parrilo PA, Willsky AS (2011) Rank-sparsity incoherence for matrix decomposition. SIAM J Optim 21(2):572–596
Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32(2):407–499
Fazel M, Hindi H, Boyd SP (2001) A rank minimization heuristic with application to minimum order system approximation. In: Proceedings of the American control conference, pp 4734–4739
Friedman J, Hastie T, Höfling H, Tibshirani R (2007) Pathwise coordinate optimization. Ann Appl Stat 1(2):302–332
Grippoa L, Sciandrone M (2000) On the convergence of the block nonlinear gauss-seidel method under convex constraints. Oper Res Lett 26(3):127–136
Huang J, Zhang T, Metaxas DN (2011) Learning with structured sparsity. J Mach Learn Res 12:3371–3412
Lewis DD, Yang Y, Rose TG, Li F, Dietterich G, Li F (2004) RCV1: a new benchmark collection for text categorization research. J Mach Learn Res 5:361–397
Lounici K, Pontil M, Tsybakov AB, van de Geer SA (2009) Taking advantage of sparsity in multi-task learning. In: The 22nd annual conference on learning theory
Moreau JJ (1965) Proximité et dualité dans un espace hilbertien. Bull de la Société Mathématique de France 93:273–299
Negahban S, Wainwright MJ (2011) Estimation of (near) low-rank matrices with noise and high-dimensional scaling. Ann Stat 39(2):1069–1097
Nesterov Y (2004) Introductory lectures on convex optimization: a basic course. Springer, Berlin
Nesterov Y (2013) Gradient methods for minimizing composite objective function. Math Program 140(1):125–161
Pati YC, Kailath T (1994) Phase-shifting masks for microlithography: automated design and mask requirements. J Opt Soc Am A 11(9):2438–2452
Recht B, Fazel M, Parrilo PA (2010) Guaranteed minimum rank solutions to linear matrix equations via nuclear norm minimization. SIAM Rev 52(3):471–501
Rennie JDM, Srebro N (2005) Fast maximum margin matrix factorization for collaborative prediction. In: Proceedings of the 22nd international conference on, machine learning, pp 713–719
Rohde A, Tsybakov AB (2011) Estimation of high-dimensional low rank matrices. Ann Stat 39(2):887–930
Srebro N, Rennie JDM, Jaakola TS (2005) Maximum-margin matrix factorization. In: Advances in neural information processing systems, vol 17
Szarek SJ (1991) Condition numbers of random matrices. J Complex 7(2):131–149
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 58:267–288
Toh KC, Todd MJ, Tutuncu R (1999) SDPT3: a MATLAB software package for semidefinite programming. Optim Methods Soft 11:545–581
Toh KC, Yun S (2010) An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems. Pac J Optim 6:615–640
Ueda N, Saito K (2002) Single-shot detection of multiple categories of text using parametric mixture models. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining, pp 626–631
Watson GA (1992) Characterization of the subdifferential of some matrix norms. Linear Algebra Appl 170:33–45
Zhang T (2009) Some sharp performance bounds for least squares regression with \(l_1\) regularization. Ann Stat 37(5A):2109–2144
Zhao P, Yu B (2006) On model selection consistency of Lasso. J Mach Learn Res 7:2541–2563
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
1.1 Operators \(\mathcal S _0\) and \(\mathcal S _1\)
We define two operators, namely \(\mathcal S _0\) and \(\mathcal S _1\), on an arbitrary matrix pair (of the same size) based on Lemma \(3.4\) in Recht et al. (2010), as summarized in the following lemma.
Lemma 1
Given any \({\varTheta }\) and \({\varDelta }\) of size \(h \times k\), let \({rank} ({\varTheta }) = r\) and denote the SVD of \({\varTheta }\) as
where \(U \in \mathbb R ^{h \times h}\) and \(V \in \mathbb R ^{k \times k}\) are orthogonal, and \({\varSigma } \in \mathbb R ^{r \times r}\) is diagonal consisting of the non-zero singular values on its main diagonal. Let
where \({{\widehat{{\varDelta }}}}_{11} \in \mathbb R ^{r \times r}\), \({{\widehat{{\varDelta }}}}_{12} \in \mathbb R ^{r \times (k-r)}\), \({{\widehat{{\varDelta }}}}_{21} \in \mathbb R ^{(h-r) \times r}\), and \({{\widehat{{\varDelta }}}}_{22} \in \mathbb R ^{(h-r) \times (k-r)}\). Define \(\mathcal S _0\) and \(\mathcal S _1\) as
Then the following conditions hold: \({rank} \left( \mathcal S _0 ({\varTheta }, {\varDelta }) \right) \!\le \! 2 r\), \({\varTheta } \mathcal S _1 ({\varTheta }, {\varDelta })^T\!\!=\!0, {\varTheta }^T \mathcal S _1 ({\varTheta }, {\varDelta }) = 0\).
The result presented in Lemma 1 implies a condition under which the trace norm on a matrix pair is additive. From Lemma 1 we can easily verify that
for arbitrary \({\varTheta }\) and \({\varDelta }\) of the same size. To avoid clutter notation, we denote \(\mathcal S _0 ({\varTheta }, {\varDelta })\) by \(\mathcal S _0 ({\varDelta })\), and \(\mathcal S _1 ({\varTheta }, {\varDelta })\) by \(\mathcal S _1 ({\varDelta })\) throughout this paper, as the appropriate \({\varTheta }\) can be easily determined from the context.
1.2 Bound on trace norm
As a consequence of Lemma 1, we derive a bound on the trace norm of the matrices of interest as summarized below.
Corollary 1
Given an arbitrary matrix pair \({\widehat{{\varTheta }}}\) and \({\varTheta }\), let \({\varDelta } = {\widehat{{\varTheta }}} - {\varTheta }\). Then
Proof
From Lemma 1 we have \({\varDelta } = \mathcal S _0 ({\varDelta }) + \mathcal S _1 ({\varDelta })\) for the matrix pair \({\varTheta }\) and \({\varDelta }\). Moreover,
where the inequality above follows from the triangle inequality and the last equality above follows from Eq. (28). Using the result in Eq. (29), we have
We complete the proof of this corollary.
1.3 Bound on \(\ell _1\)-norm
Analogous to the bound on the trace norm in Corollary 1, we also derive a bound on the \(\ell _1\)-norm of the matrices of interest in the following lemma. For arbitrary matrices \({\varTheta }\) and \({\varDelta }\), we denote by \(J ({\varTheta }) =\{(i,j) \}\) the coordinate set (the location set of nonzero entries) of \({\varTheta }\), and by \(J({\varTheta })_\bot \) the associated complement (the location set of zero entries); we denote by \({\varDelta }_{J({\varTheta })}\) the matrix of the same entries as \({\varDelta }\) on the set \(J({\varTheta })\) and of zero entries on the set \(J({\varTheta })_\bot \). We now present a result associated with \(J({\varTheta })\) and \(J({\varTheta })_\bot \) in the following lemma. Note that a similar result for the vector case is presented in Bickel et al. (2009).
Lemma 2
Given a matrix pair \({\widehat{{\varTheta }}}\) and \({\varTheta }\) of the same size, the inequality below always holds
Proof
It can be verified that the inequality
and the equalities
hold. Therefore we can derive
This completes the proof of this lemma.
1.4 Concentration inequality
Lemma 3
Let \(\sigma _{{X(l)}}\) be the maximum singular value of the matrix \(\mathcal G _X \in \mathbb R ^{n \times h}\); let \(W \in \mathbb R ^{n \times k}\) be the matrix of i.i.d entries as \(w_{ij} \sim \mathcal N (0, \sigma _w^2 )\). Let \(\lambda = {2 \sigma _{{X(l)}} \sigma _w \sqrt{n}} \left( 1 + \sqrt{\frac{k}{n}} + t \right) / {N}\). Then
Proof
It is known (Szarek 1991) that a Gaussian matrix \({\widehat{W}} \in \mathbb R ^{n \times k}\) with \(n \ge k\) and \(\hat{w}_{ij} \sim \mathcal N (0, {1}/{n})\) satisfies
where \(t\) is a universal constant. From the definition of the largest singular value, there exists a vector \(b \in \mathbb R ^h\) of length \(1\), i.e., \(\Vert b\Vert _2 = 1\), such that \(\Vert W^T \mathcal G _X\Vert _2 = \Vert W^T \mathcal G _X b\Vert _2 \le \Vert W \Vert _2 \Vert \mathcal G _X b\Vert _2 \le \sigma _{X(l)}\Vert W\Vert _2\). Since \(w_{ij} / \left( \sigma _w \sqrt{n} \right) \sim \mathcal N (0, 1 / n)\), we have
Applying the result in Eq. (30) into the inequality above, we complete the proof of this lemma.
Rights and permissions
About this article
Cite this article
Chen, J., Ye, J. Sparse trace norm regularization. Comput Stat 29, 623–639 (2014). https://doi.org/10.1007/s00180-013-0440-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-013-0440-7