Skip to main content
Log in

Sparse trace norm regularization

  • Original Paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

We study the problem of estimating multiple predictive functions from a dictionary of basis functions in the nonparametric regression setting. Our estimation scheme assumes that each predictive function can be estimated in the form of a linear combination of the basis functions. By assuming that the coefficient matrix admits a sparse low-rank structure, we formulate the function estimation problem as a convex program regularized by the trace norm and the \(\ell _1\)-norm simultaneously. We propose to solve the convex program using the accelerated gradient (AG) method; we also develop efficient algorithms to solve the key components in AG. In addition, we conduct theoretical analysis on the proposed function estimation scheme: we derive a key property of the optimal solution to the convex program; based on an assumption on the basis functions, we establish a performance bound of the proposed function estimation scheme (via the composite regularization). Simulation studies demonstrate the effectiveness and efficiency of the proposed algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. http://www.csie.ntu.edu.tw/~cjlin.

References

  • Bach FR (2008) Consistency of trace norm minimization. J Mach Learn Res 9:1019–1048

    MATH  MathSciNet  Google Scholar 

  • Beck A, Teboulle M (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J Imag Sci 2(1):183–202

    Article  MATH  MathSciNet  Google Scholar 

  • Bickel PJ, Ritov Y, Tsybakov AB (2009) Simultaneous analysis of Lasso and dantzig selector. Ann Stat 37(4):1705–1732

    Article  MATH  MathSciNet  Google Scholar 

  • Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  • Bunea F, Tsybakov AB, Wegkamp MH (2006) Aggregation and sparsity via \(\ell _1\) penalized least squares. In: Proceedings of the 19th annual conference on learning theory, pp 379–391

  • Candès EJ, Tao T (2005) Decoding by linear programming. IEEE Trans Inf Theory 51(12):4203–4215

    Article  MATH  Google Scholar 

  • Candès EJ, Recht B (2009) Exact matrix completion via convex optimization. Found Comput Math 9: 717–772

    Google Scholar 

  • Candès EJ, Li X, Ma Y, Wright J (2009) Robust principal component analysis? J ACM 58(1):1–37

    Google Scholar 

  • Chandrasekaran V, Sanghavi S, Parrilo PA, Willsky AS (2011) Rank-sparsity incoherence for matrix decomposition. SIAM J Optim 21(2):572–596

    Article  MATH  MathSciNet  Google Scholar 

  • Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32(2):407–499

    Article  MATH  MathSciNet  Google Scholar 

  • Fazel M, Hindi H, Boyd SP (2001) A rank minimization heuristic with application to minimum order system approximation. In: Proceedings of the American control conference, pp 4734–4739

  • Friedman J, Hastie T, Höfling H, Tibshirani R (2007) Pathwise coordinate optimization. Ann Appl Stat 1(2):302–332

    Article  MATH  MathSciNet  Google Scholar 

  • Grippoa L, Sciandrone M (2000) On the convergence of the block nonlinear gauss-seidel method under convex constraints. Oper Res Lett 26(3):127–136

    Article  MathSciNet  Google Scholar 

  • Huang J, Zhang T, Metaxas DN (2011) Learning with structured sparsity. J Mach Learn Res 12:3371–3412

    MATH  MathSciNet  Google Scholar 

  • Lewis DD, Yang Y, Rose TG, Li F, Dietterich G, Li F (2004) RCV1: a new benchmark collection for text categorization research. J Mach Learn Res 5:361–397

    Google Scholar 

  • Lounici K, Pontil M, Tsybakov AB, van de Geer SA (2009) Taking advantage of sparsity in multi-task learning. In: The 22nd annual conference on learning theory

  • Moreau JJ (1965) Proximité et dualité dans un espace hilbertien. Bull de la Société Mathématique de France 93:273–299

    MATH  MathSciNet  Google Scholar 

  • Negahban S, Wainwright MJ (2011) Estimation of (near) low-rank matrices with noise and high-dimensional scaling. Ann Stat 39(2):1069–1097

    Article  MATH  MathSciNet  Google Scholar 

  • Nesterov Y (2004) Introductory lectures on convex optimization: a basic course. Springer, Berlin

    Book  Google Scholar 

  • Nesterov Y (2013) Gradient methods for minimizing composite objective function. Math Program 140(1):125–161

    Article  MATH  MathSciNet  Google Scholar 

  • Pati YC, Kailath T (1994) Phase-shifting masks for microlithography: automated design and mask requirements. J Opt Soc Am A 11(9):2438–2452

    Article  Google Scholar 

  • Recht B, Fazel M, Parrilo PA (2010) Guaranteed minimum rank solutions to linear matrix equations via nuclear norm minimization. SIAM Rev 52(3):471–501

    Article  MATH  MathSciNet  Google Scholar 

  • Rennie JDM, Srebro N (2005) Fast maximum margin matrix factorization for collaborative prediction. In: Proceedings of the 22nd international conference on, machine learning, pp 713–719

  • Rohde A, Tsybakov AB (2011) Estimation of high-dimensional low rank matrices. Ann Stat 39(2):887–930

    Article  MATH  MathSciNet  Google Scholar 

  • Srebro N, Rennie JDM, Jaakola TS (2005) Maximum-margin matrix factorization. In: Advances in neural information processing systems, vol 17

  • Szarek SJ (1991) Condition numbers of random matrices. J Complex 7(2):131–149

    Article  MATH  MathSciNet  Google Scholar 

  • Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 58:267–288

    MATH  MathSciNet  Google Scholar 

  • Toh KC, Todd MJ, Tutuncu R (1999) SDPT3: a MATLAB software package for semidefinite programming. Optim Methods Soft 11:545–581

    Article  MathSciNet  Google Scholar 

  • Toh KC, Yun S (2010) An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems. Pac J Optim 6:615–640

    MATH  MathSciNet  Google Scholar 

  • Ueda N, Saito K (2002) Single-shot detection of multiple categories of text using parametric mixture models. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining, pp 626–631

  • Watson GA (1992) Characterization of the subdifferential of some matrix norms. Linear Algebra Appl 170:33–45

    Article  MATH  MathSciNet  Google Scholar 

  • Zhang T (2009) Some sharp performance bounds for least squares regression with \(l_1\) regularization. Ann Stat 37(5A):2109–2144

    Article  MATH  Google Scholar 

  • Zhao P, Yu B (2006) On model selection consistency of Lasso. J Mach Learn Res 7:2541–2563

    MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jianhui Chen.

Appendix

Appendix

1.1 Operators \(\mathcal S _0\) and \(\mathcal S _1\)

We define two operators, namely \(\mathcal S _0\) and \(\mathcal S _1\), on an arbitrary matrix pair (of the same size) based on Lemma \(3.4\) in Recht et al. (2010), as summarized in the following lemma.

Lemma 1

Given any \({\varTheta }\) and \({\varDelta }\) of size \(h \times k\), let \({rank} ({\varTheta }) = r\) and denote the SVD of \({\varTheta }\) as

$$\begin{aligned} {\varTheta } = U \left[ \begin{array}{c@{\quad }c} {\varSigma } &{} \mathbf{0} \\ \mathbf{0} &{} \mathbf{0} \end{array} \right] V^T, \end{aligned}$$
(26)

where \(U \in \mathbb R ^{h \times h}\) and \(V \in \mathbb R ^{k \times k}\) are orthogonal, and \({\varSigma } \in \mathbb R ^{r \times r}\) is diagonal consisting of the non-zero singular values on its main diagonal. Let

$$\begin{aligned} {\widehat{{\varDelta }}} = U^T {\varDelta } V = \left[ \begin{array}{c@{\quad }c} {{\widehat{{\varDelta }}}}_{11} &{} {{\widehat{{\varDelta }}}}_{12} \\ {{\widehat{{\varDelta }}}}_{21} &{} {{\widehat{{\varDelta }}}}_{22} \end{array} \right] , \end{aligned}$$
(27)

where \({{\widehat{{\varDelta }}}}_{11} \in \mathbb R ^{r \times r}\), \({{\widehat{{\varDelta }}}}_{12} \in \mathbb R ^{r \times (k-r)}\), \({{\widehat{{\varDelta }}}}_{21} \in \mathbb R ^{(h-r) \times r}\), and \({{\widehat{{\varDelta }}}}_{22} \in \mathbb R ^{(h-r) \times (k-r)}\). Define \(\mathcal S _0\) and \(\mathcal S _1\) as

$$\begin{aligned} \mathcal S _0 ({\varTheta }, {\varDelta }) = U \left[ \begin{array}{c@{\quad }c} {{\widehat{{\varDelta }}}}_{11} &{} {{\widehat{{\varDelta }}}}_{12} \\ {{\widehat{{\varDelta }}}}_{21} &{} \mathbf{0} \end{array} \right] V^T, \quad \mathcal S _1 ({\varTheta }, {\varDelta }) = U \left[ \begin{array}{c@{\quad }c} \mathbf{0} &{} \mathbf{0} \\ \mathbf{0} &{} {{\widehat{{\varDelta }}}}_{22} \end{array} \right] V^T. \end{aligned}$$

Then the following conditions hold: \({rank} \left( \mathcal S _0 ({\varTheta }, {\varDelta }) \right) \!\le \! 2 r\), \({\varTheta } \mathcal S _1 ({\varTheta }, {\varDelta })^T\!\!=\!0, {\varTheta }^T \mathcal S _1 ({\varTheta }, {\varDelta }) = 0\).

The result presented in Lemma 1 implies a condition under which the trace norm on a matrix pair is additive. From Lemma 1 we can easily verify that

$$\begin{aligned} \Vert {\varTheta } + \mathcal S _1 ({{\varTheta }, {\varDelta }})\Vert _* = \Vert {\varTheta }\Vert _* + \Vert \mathcal S _1 ({\varTheta }, {\varDelta })\Vert _*, \end{aligned}$$
(28)

for arbitrary \({\varTheta }\) and \({\varDelta }\) of the same size. To avoid clutter notation, we denote \(\mathcal S _0 ({\varTheta }, {\varDelta })\) by \(\mathcal S _0 ({\varDelta })\), and \(\mathcal S _1 ({\varTheta }, {\varDelta })\) by \(\mathcal S _1 ({\varDelta })\) throughout this paper, as the appropriate \({\varTheta }\) can be easily determined from the context.

1.2 Bound on trace norm

As a consequence of Lemma 1, we derive a bound on the trace norm of the matrices of interest as summarized below.

Corollary 1

Given an arbitrary matrix pair \({\widehat{{\varTheta }}}\) and \({\varTheta }\), let \({\varDelta } = {\widehat{{\varTheta }}} - {\varTheta }\). Then

$$\begin{aligned} \Vert {\widehat{{\varTheta }}} - {\varTheta } \Vert _* + \Vert {\varTheta }\Vert _* - \Vert {\widehat{{\varTheta }}}\Vert _* \le 2 \Vert \mathcal S _0 ({\varDelta })\Vert _*. \end{aligned}$$

Proof

From Lemma 1 we have \({\varDelta } = \mathcal S _0 ({\varDelta }) + \mathcal S _1 ({\varDelta })\) for the matrix pair \({\varTheta }\) and \({\varDelta }\). Moreover,

$$\begin{aligned} \Vert {\widehat{{\varTheta }}}\Vert _*&= \Vert {\varTheta } + \mathcal S _0 ({\varDelta }) + \mathcal S _1 ({\varDelta }) \Vert _* \ge \Vert {\varTheta } + \mathcal S _1 ({\varDelta }) \Vert _* - \Vert \mathcal S _0 ({\varDelta }) \Vert _* \nonumber \\&= \Vert {\varTheta } \Vert _* + \Vert \mathcal S _1 ({\varDelta }) \Vert _* - \Vert \mathcal S _0 ({\varDelta }) \Vert _*, \end{aligned}$$
(29)

where the inequality above follows from the triangle inequality and the last equality above follows from Eq. (28). Using the result in Eq. (29), we have

$$\begin{aligned} \Vert {\widehat{{\varTheta }}} - {\varTheta } \Vert _* + \Vert {\varTheta }\Vert _* - \Vert {\widehat{{\varTheta }}}\Vert _*&\le \Vert {\varDelta }\Vert _* + \Vert {\varTheta }\Vert _* - \Vert {\varTheta } \Vert _* - \Vert \mathcal S _1 ({\varDelta }) \Vert _* + \Vert \mathcal S _0 ({\varDelta }) \Vert _* \\&\le 2 \Vert \mathcal S _0 ({\varDelta })\Vert _*. \end{aligned}$$

We complete the proof of this corollary.

1.3 Bound on \(\ell _1\)-norm

Analogous to the bound on the trace norm in Corollary 1, we also derive a bound on the \(\ell _1\)-norm of the matrices of interest in the following lemma. For arbitrary matrices \({\varTheta }\) and \({\varDelta }\), we denote by \(J ({\varTheta }) =\{(i,j) \}\) the coordinate set (the location set of nonzero entries) of \({\varTheta }\), and by \(J({\varTheta })_\bot \) the associated complement (the location set of zero entries); we denote by \({\varDelta }_{J({\varTheta })}\) the matrix of the same entries as \({\varDelta }\) on the set \(J({\varTheta })\) and of zero entries on the set \(J({\varTheta })_\bot \). We now present a result associated with \(J({\varTheta })\) and \(J({\varTheta })_\bot \) in the following lemma. Note that a similar result for the vector case is presented in Bickel et al. (2009).

Lemma 2

Given a matrix pair \({\widehat{{\varTheta }}}\) and \({\varTheta }\) of the same size, the inequality below always holds

$$\begin{aligned} \Vert {\widehat{{\varTheta }}} - {\varTheta } \Vert _1 + \Vert {\varTheta } \Vert _1 - \Vert {\widehat{{\varTheta }}} \Vert _1 \le 2 \Vert {\widehat{{\varTheta }}}_{J({\varTheta })} - {\varTheta }_{J({\varTheta })} \Vert _1. \end{aligned}$$

Proof

It can be verified that the inequality

$$\begin{aligned} \Vert {\varTheta }_{J({\varTheta })} \Vert _1 - \Vert {\widehat{{\varTheta }}}_{J({\varTheta })} \Vert _1 \le \Vert ({\widehat{{\varTheta }}} - {\varTheta })_{J({\varTheta })} \Vert _1, \end{aligned}$$

and the equalities

$$\begin{aligned} {\varTheta }_{J({\varTheta })_{\bot }} = \mathbf{0}, \,\,\, \Vert ({\widehat{{\varTheta }}} - {\varTheta })_{J({\varTheta })_\bot } \Vert _1 - \Vert {\widehat{{\varTheta }}}_{J({\varTheta })_\bot } \Vert _1 = \mathbf{0}, \end{aligned}$$

hold. Therefore we can derive

$$\begin{aligned} \Vert {\widehat{{\varTheta }}} - {\varTheta } \Vert _1 + \Vert {\varTheta } \Vert _1 - \Vert {\widehat{{\varTheta }}} \Vert _1&= \Vert ({\widehat{{\varTheta }}} - {\varTheta })_{J({\varTheta })} \Vert _1 + \Vert ({\widehat{{\varTheta }}} - {\varTheta })_{J({\varTheta })_\bot } \Vert _1 \\&\quad + \Vert {\varTheta }_{J({\varTheta })} \Vert _1 \!+\! \Vert {\varTheta }_{J({\varTheta })_\bot } \Vert _1 - \Vert {\widehat{{\varTheta }}}_{J({\varTheta })} \Vert _1 - \Vert {\widehat{{\varTheta }}}_{J({\varTheta })_\bot } \Vert _1 \\&\le 2 \Vert ({\widehat{{\varTheta }}} - {\varTheta })_{J({\varTheta })} \Vert _1. \end{aligned}$$

This completes the proof of this lemma.

1.4 Concentration inequality

Lemma 3

Let \(\sigma _{{X(l)}}\) be the maximum singular value of the matrix \(\mathcal G _X \in \mathbb R ^{n \times h}\); let \(W \in \mathbb R ^{n \times k}\) be the matrix of i.i.d entries as \(w_{ij} \sim \mathcal N (0, \sigma _w^2 )\). Let \(\lambda = {2 \sigma _{{X(l)}} \sigma _w \sqrt{n}} \left( 1 + \sqrt{\frac{k}{n}} + t \right) / {N}\). Then

$$\begin{aligned} \Pr \left( \frac{1}{N} \Vert W^T \mathcal G _X \Vert _2 \le \frac{\lambda }{2} \right) \ge 1 - \exp \left( - \frac{1}{2} n t^2 \right) . \end{aligned}$$

Proof

It is known (Szarek 1991) that a Gaussian matrix \({\widehat{W}} \in \mathbb R ^{n \times k}\) with \(n \ge k\) and \(\hat{w}_{ij} \sim \mathcal N (0, {1}/{n})\) satisfies

$$\begin{aligned} \mathrm{Pr} \left( \Vert {\widehat{W}}\Vert _2 > 1 + \sqrt{\frac{k}{n}} + t \right) \le \exp \left( - \frac{1}{2} n t^2 \right) , \end{aligned}$$
(30)

where \(t\) is a universal constant. From the definition of the largest singular value, there exists a vector \(b \in \mathbb R ^h\) of length \(1\), i.e., \(\Vert b\Vert _2 = 1\), such that \(\Vert W^T \mathcal G _X\Vert _2 = \Vert W^T \mathcal G _X b\Vert _2 \le \Vert W \Vert _2 \Vert \mathcal G _X b\Vert _2 \le \sigma _{X(l)}\Vert W\Vert _2\). Since \(w_{ij} / \left( \sigma _w \sqrt{n} \right) \sim \mathcal N (0, 1 / n)\), we have

$$\begin{aligned} \mathrm{Pr} \left( \frac{1}{N} \Vert W^T \mathcal G _X \Vert _2 > \frac{\lambda }{2} \right) \le \mathrm{Pr} \left( \frac{1}{N} {\sigma _{{X(l)}}} \left\| W \right\| _2 > \frac{\lambda }{2} \right) . \end{aligned}$$

Applying the result in Eq. (30) into the inequality above, we complete the proof of this lemma.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, J., Ye, J. Sparse trace norm regularization. Comput Stat 29, 623–639 (2014). https://doi.org/10.1007/s00180-013-0440-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-013-0440-7

Keywords

Navigation