Sparse trace norm regularization

Chen, Jianhui; Ye, Jieping

doi:10.1007/s00180-013-0440-7

Sparse trace norm regularization

Original Paper
Published: 03 August 2013

Volume 29, pages 623–639, (2014)
Cite this article

Computational Statistics Aims and scope Submit manuscript

Jianhui Chen¹ &
Jieping Ye²

537 Accesses
8 Citations
Explore all metrics

Abstract

We study the problem of estimating multiple predictive functions from a dictionary of basis functions in the nonparametric regression setting. Our estimation scheme assumes that each predictive function can be estimated in the form of a linear combination of the basis functions. By assuming that the coefficient matrix admits a sparse low-rank structure, we formulate the function estimation problem as a convex program regularized by the trace norm and the $\ell _1$-norm simultaneously. We propose to solve the convex program using the accelerated gradient (AG) method; we also develop efficient algorithms to solve the key components in AG. In addition, we conduct theoretical analysis on the proposed function estimation scheme: we derive a key property of the optimal solution to the convex program; based on an assumption on the basis functions, we establish a performance bound of the proposed function estimation scheme (via the composite regularization). Simulation studies demonstrate the effectiveness and efficiency of the proposed algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Consistent learning by composite proximal thresholding

Article 25 March 2017

Recovering Signals with Unknown Sparsity in Multiple Dictionaries

Newton-Type Methods with the Proximal Gradient Step for Sparse Estimation

Article 20 March 2024

Notes

http://www.csie.ntu.edu.tw/~cjlin.

References

Bach FR (2008) Consistency of trace norm minimization. J Mach Learn Res 9:1019–1048
MATH MathSciNet Google Scholar
Beck A, Teboulle M (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J Imag Sci 2(1):183–202
Article MATH MathSciNet Google Scholar
Bickel PJ, Ritov Y, Tsybakov AB (2009) Simultaneous analysis of Lasso and dantzig selector. Ann Stat 37(4):1705–1732
Article MATH MathSciNet Google Scholar
Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press, Cambridge
Book MATH Google Scholar
Bunea F, Tsybakov AB, Wegkamp MH (2006) Aggregation and sparsity via $\ell _1$ penalized least squares. In: Proceedings of the 19th annual conference on learning theory, pp 379–391
Candès EJ, Tao T (2005) Decoding by linear programming. IEEE Trans Inf Theory 51(12):4203–4215
Article MATH Google Scholar
Candès EJ, Recht B (2009) Exact matrix completion via convex optimization. Found Comput Math 9: 717–772
Google Scholar
Candès EJ, Li X, Ma Y, Wright J (2009) Robust principal component analysis? J ACM 58(1):1–37
Google Scholar
Chandrasekaran V, Sanghavi S, Parrilo PA, Willsky AS (2011) Rank-sparsity incoherence for matrix decomposition. SIAM J Optim 21(2):572–596
Article MATH MathSciNet Google Scholar
Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32(2):407–499
Article MATH MathSciNet Google Scholar
Fazel M, Hindi H, Boyd SP (2001) A rank minimization heuristic with application to minimum order system approximation. In: Proceedings of the American control conference, pp 4734–4739
Friedman J, Hastie T, Höfling H, Tibshirani R (2007) Pathwise coordinate optimization. Ann Appl Stat 1(2):302–332
Article MATH MathSciNet Google Scholar
Grippoa L, Sciandrone M (2000) On the convergence of the block nonlinear gauss-seidel method under convex constraints. Oper Res Lett 26(3):127–136
Article MathSciNet Google Scholar
Huang J, Zhang T, Metaxas DN (2011) Learning with structured sparsity. J Mach Learn Res 12:3371–3412
MATH MathSciNet Google Scholar
Lewis DD, Yang Y, Rose TG, Li F, Dietterich G, Li F (2004) RCV1: a new benchmark collection for text categorization research. J Mach Learn Res 5:361–397
Google Scholar
Lounici K, Pontil M, Tsybakov AB, van de Geer SA (2009) Taking advantage of sparsity in multi-task learning. In: The 22nd annual conference on learning theory
Moreau JJ (1965) Proximité et dualité dans un espace hilbertien. Bull de la Société Mathématique de France 93:273–299
MATH MathSciNet Google Scholar
Negahban S, Wainwright MJ (2011) Estimation of (near) low-rank matrices with noise and high-dimensional scaling. Ann Stat 39(2):1069–1097
Article MATH MathSciNet Google Scholar
Nesterov Y (2004) Introductory lectures on convex optimization: a basic course. Springer, Berlin
Book Google Scholar
Nesterov Y (2013) Gradient methods for minimizing composite objective function. Math Program 140(1):125–161
Article MATH MathSciNet Google Scholar
Pati YC, Kailath T (1994) Phase-shifting masks for microlithography: automated design and mask requirements. J Opt Soc Am A 11(9):2438–2452
Article Google Scholar
Recht B, Fazel M, Parrilo PA (2010) Guaranteed minimum rank solutions to linear matrix equations via nuclear norm minimization. SIAM Rev 52(3):471–501
Article MATH MathSciNet Google Scholar
Rennie JDM, Srebro N (2005) Fast maximum margin matrix factorization for collaborative prediction. In: Proceedings of the 22nd international conference on, machine learning, pp 713–719
Rohde A, Tsybakov AB (2011) Estimation of high-dimensional low rank matrices. Ann Stat 39(2):887–930
Article MATH MathSciNet Google Scholar
Srebro N, Rennie JDM, Jaakola TS (2005) Maximum-margin matrix factorization. In: Advances in neural information processing systems, vol 17
Szarek SJ (1991) Condition numbers of random matrices. J Complex 7(2):131–149
Article MATH MathSciNet Google Scholar
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 58:267–288
MATH MathSciNet Google Scholar
Toh KC, Todd MJ, Tutuncu R (1999) SDPT3: a MATLAB software package for semidefinite programming. Optim Methods Soft 11:545–581
Article MathSciNet Google Scholar
Toh KC, Yun S (2010) An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems. Pac J Optim 6:615–640
MATH MathSciNet Google Scholar
Ueda N, Saito K (2002) Single-shot detection of multiple categories of text using parametric mixture models. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining, pp 626–631
Watson GA (1992) Characterization of the subdifferential of some matrix norms. Linear Algebra Appl 170:33–45
Article MATH MathSciNet Google Scholar
Zhang T (2009) Some sharp performance bounds for least squares regression with $l_1$ regularization. Ann Stat 37(5A):2109–2144
Article MATH Google Scholar
Zhao P, Yu B (2006) On model selection consistency of Lasso. J Mach Learn Res 7:2541–2563
MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

GE Global Research, San Ramon, CA, USA
Jianhui Chen
School of Computing, Informatics and Decision System Engineering, Arizona State University, Tempe, AZ, USA
Jieping Ye

Authors

Jianhui Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jieping Ye
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jianhui Chen.

Appendix

1.1 Operators $\mathcal S _0$ and $\mathcal S _1$

We define two operators, namely $\mathcal S _0$ and $\mathcal S _1$, on an arbitrary matrix pair (of the same size) based on Lemma $3.4$ in Recht et al. (2010), as summarized in the following lemma.

Lemma 1

Given any ${\varTheta }$ and ${\varDelta }$ of size $h \times k$, let ${rank} ({\varTheta }) = r$ and denote the SVD of ${\varTheta }$ as

$$\begin{aligned} {\varTheta } = U \left[ \begin{array}{c@{\quad }c} {\varSigma } &{} \mathbf{0} \\ \mathbf{0} &{} \mathbf{0} \end{array} \right] V^T, \end{aligned}$$

(26)

where $U \in \mathbb R ^{h \times h}$ and $V \in \mathbb R ^{k \times k}$ are orthogonal, and ${\varSigma } \in \mathbb R ^{r \times r}$ is diagonal consisting of the non-zero singular values on its main diagonal. Let

$$\begin{aligned} {\widehat{{\varDelta }}} = U^T {\varDelta } V = \left[ \begin{array}{c@{\quad }c} {{\widehat{{\varDelta }}}}_{11} &{} {{\widehat{{\varDelta }}}}_{12} \\ {{\widehat{{\varDelta }}}}_{21} &{} {{\widehat{{\varDelta }}}}_{22} \end{array} \right] , \end{aligned}$$

(27)

where ${{\widehat{{\varDelta }}}}_{11} \in \mathbb R ^{r \times r}$, ${{\widehat{{\varDelta }}}}_{12} \in \mathbb R ^{r \times (k-r)}$, ${{\widehat{{\varDelta }}}}_{21} \in \mathbb R ^{(h-r) \times r}$, and ${{\widehat{{\varDelta }}}}_{22} \in \mathbb R ^{(h-r) \times (k-r)}$. Define $\mathcal S _0$ and $\mathcal S _1$ as

$$\begin{aligned} \mathcal S _0 ({\varTheta }, {\varDelta }) = U \left[ \begin{array}{c@{\quad }c} {{\widehat{{\varDelta }}}}_{11} &{} {{\widehat{{\varDelta }}}}_{12} \\ {{\widehat{{\varDelta }}}}_{21} &{} \mathbf{0} \end{array} \right] V^T, \quad \mathcal S _1 ({\varTheta }, {\varDelta }) = U \left[ \begin{array}{c@{\quad }c} \mathbf{0} &{} \mathbf{0} \\ \mathbf{0} &{} {{\widehat{{\varDelta }}}}_{22} \end{array} \right] V^T. \end{aligned}$$

Then the following conditions hold: ${rank} \left( \mathcal S _0 ({\varTheta }, {\varDelta }) \right) \!\le \! 2 r$, ${\varTheta } \mathcal S _1 ({\varTheta }, {\varDelta })^T\!\!=\!0, {\varTheta }^T \mathcal S _1 ({\varTheta }, {\varDelta }) = 0$.

The result presented in Lemma 1 implies a condition under which the trace norm on a matrix pair is additive. From Lemma 1 we can easily verify that

$$\begin{aligned} \Vert {\varTheta } + \mathcal S _1 ({{\varTheta }, {\varDelta }})\Vert _* = \Vert {\varTheta }\Vert _* + \Vert \mathcal S _1 ({\varTheta }, {\varDelta })\Vert _*, \end{aligned}$$

(28)

for arbitrary ${\varTheta }$ and ${\varDelta }$ of the same size. To avoid clutter notation, we denote $\mathcal S _0 ({\varTheta }, {\varDelta })$ by $\mathcal S _0 ({\varDelta })$, and $\mathcal S _1 ({\varTheta }, {\varDelta })$ by $\mathcal S _1 ({\varDelta })$ throughout this paper, as the appropriate ${\varTheta }$ can be easily determined from the context.

1.2 Bound on trace norm

As a consequence of Lemma 1, we derive a bound on the trace norm of the matrices of interest as summarized below.

Corollary 1

Given an arbitrary matrix pair ${\widehat{{\varTheta }}}$ and ${\varTheta }$, let ${\varDelta } = {\widehat{{\varTheta }}} - {\varTheta }$. Then

$$\begin{aligned} \Vert {\widehat{{\varTheta }}} - {\varTheta } \Vert _* + \Vert {\varTheta }\Vert _* - \Vert {\widehat{{\varTheta }}}\Vert _* \le 2 \Vert \mathcal S _0 ({\varDelta })\Vert _*. \end{aligned}$$

Proof

From Lemma 1 we have ${\varDelta } = \mathcal S _0 ({\varDelta }) + \mathcal S _1 ({\varDelta })$ for the matrix pair ${\varTheta }$ and ${\varDelta }$. Moreover,

$$\begin{aligned} \Vert {\widehat{{\varTheta }}}\Vert _*&= \Vert {\varTheta } + \mathcal S _0 ({\varDelta }) + \mathcal S _1 ({\varDelta }) \Vert _* \ge \Vert {\varTheta } + \mathcal S _1 ({\varDelta }) \Vert _* - \Vert \mathcal S _0 ({\varDelta }) \Vert _* \nonumber \\&= \Vert {\varTheta } \Vert _* + \Vert \mathcal S _1 ({\varDelta }) \Vert _* - \Vert \mathcal S _0 ({\varDelta }) \Vert _*, \end{aligned}$$

(29)

where the inequality above follows from the triangle inequality and the last equality above follows from Eq. (28). Using the result in Eq. (29), we have

$$\begin{aligned} \Vert {\widehat{{\varTheta }}} - {\varTheta } \Vert _* + \Vert {\varTheta }\Vert _* - \Vert {\widehat{{\varTheta }}}\Vert _*&\le \Vert {\varDelta }\Vert _* + \Vert {\varTheta }\Vert _* - \Vert {\varTheta } \Vert _* - \Vert \mathcal S _1 ({\varDelta }) \Vert _* + \Vert \mathcal S _0 ({\varDelta }) \Vert _* \\&\le 2 \Vert \mathcal S _0 ({\varDelta })\Vert _*. \end{aligned}$$

We complete the proof of this corollary.

1.3 Bound on $\ell _1$-norm

Analogous to the bound on the trace norm in Corollary 1, we also derive a bound on the $\ell _1$-norm of the matrices of interest in the following lemma. For arbitrary matrices ${\varTheta }$ and ${\varDelta }$, we denote by $J ({\varTheta }) =\{(i,j) \}$ the coordinate set (the location set of nonzero entries) of ${\varTheta }$, and by $J({\varTheta })_\bot $ the associated complement (the location set of zero entries); we denote by ${\varDelta }_{J({\varTheta })}$ the matrix of the same entries as ${\varDelta }$ on the set $J({\varTheta })$ and of zero entries on the set $J({\varTheta })_\bot $. We now present a result associated with $J({\varTheta })$ and $J({\varTheta })_\bot $ in the following lemma. Note that a similar result for the vector case is presented in Bickel et al. (2009).

Lemma 2

Given a matrix pair ${\widehat{{\varTheta }}}$ and ${\varTheta }$ of the same size, the inequality below always holds

$$\begin{aligned} \Vert {\widehat{{\varTheta }}} - {\varTheta } \Vert _1 + \Vert {\varTheta } \Vert _1 - \Vert {\widehat{{\varTheta }}} \Vert _1 \le 2 \Vert {\widehat{{\varTheta }}}_{J({\varTheta })} - {\varTheta }_{J({\varTheta })} \Vert _1. \end{aligned}$$

Proof

It can be verified that the inequality

$$\begin{aligned} \Vert {\varTheta }_{J({\varTheta })} \Vert _1 - \Vert {\widehat{{\varTheta }}}_{J({\varTheta })} \Vert _1 \le \Vert ({\widehat{{\varTheta }}} - {\varTheta })_{J({\varTheta })} \Vert _1, \end{aligned}$$

and the equalities

$$\begin{aligned} {\varTheta }_{J({\varTheta })_{\bot }} = \mathbf{0}, \,\,\, \Vert ({\widehat{{\varTheta }}} - {\varTheta })_{J({\varTheta })_\bot } \Vert _1 - \Vert {\widehat{{\varTheta }}}_{J({\varTheta })_\bot } \Vert _1 = \mathbf{0}, \end{aligned}$$

hold. Therefore we can derive

$$\begin{aligned} \Vert {\widehat{{\varTheta }}} - {\varTheta } \Vert _1 + \Vert {\varTheta } \Vert _1 - \Vert {\widehat{{\varTheta }}} \Vert _1&= \Vert ({\widehat{{\varTheta }}} - {\varTheta })_{J({\varTheta })} \Vert _1 + \Vert ({\widehat{{\varTheta }}} - {\varTheta })_{J({\varTheta })_\bot } \Vert _1 \\&\quad + \Vert {\varTheta }_{J({\varTheta })} \Vert _1 \!+\! \Vert {\varTheta }_{J({\varTheta })_\bot } \Vert _1 - \Vert {\widehat{{\varTheta }}}_{J({\varTheta })} \Vert _1 - \Vert {\widehat{{\varTheta }}}_{J({\varTheta })_\bot } \Vert _1 \\&\le 2 \Vert ({\widehat{{\varTheta }}} - {\varTheta })_{J({\varTheta })} \Vert _1. \end{aligned}$$

This completes the proof of this lemma.

1.4 Concentration inequality

Lemma 3

Let $\sigma _{{X(l)}}$ be the maximum singular value of the matrix $\mathcal G _X \in \mathbb R ^{n \times h}$; let $W \in \mathbb R ^{n \times k}$ be the matrix of i.i.d entries as $w_{ij} \sim \mathcal N (0, \sigma _w^2 )$. Let $\lambda = {2 \sigma _{{X(l)}} \sigma _w \sqrt{n}} \left( 1 + \sqrt{\frac{k}{n}} + t \right) / {N}$. Then

$$\begin{aligned} \Pr \left( \frac{1}{N} \Vert W^T \mathcal G _X \Vert _2 \le \frac{\lambda }{2} \right) \ge 1 - \exp \left( - \frac{1}{2} n t^2 \right) . \end{aligned}$$

Proof

It is known (Szarek 1991) that a Gaussian matrix ${\widehat{W}} \in \mathbb R ^{n \times k}$ with $n \ge k$ and $\hat{w}_{ij} \sim \mathcal N (0, {1}/{n})$ satisfies

$$\begin{aligned} \mathrm{Pr} \left( \Vert {\widehat{W}}\Vert _2 > 1 + \sqrt{\frac{k}{n}} + t \right) \le \exp \left( - \frac{1}{2} n t^2 \right) , \end{aligned}$$

(30)

where $t$ is a universal constant. From the definition of the largest singular value, there exists a vector $b \in \mathbb R ^h$ of length $1$, i.e., $\Vert b\Vert _2 = 1$, such that $\Vert W^T \mathcal G _X\Vert _2 = \Vert W^T \mathcal G _X b\Vert _2 \le \Vert W \Vert _2 \Vert \mathcal G _X b\Vert _2 \le \sigma _{X(l)}\Vert W\Vert _2$. Since $w_{ij} / \left( \sigma _w \sqrt{n} \right) \sim \mathcal N (0, 1 / n)$, we have

$$\begin{aligned} \mathrm{Pr} \left( \frac{1}{N} \Vert W^T \mathcal G _X \Vert _2 > \frac{\lambda }{2} \right) \le \mathrm{Pr} \left( \frac{1}{N} {\sigma _{{X(l)}}} \left\| W \right\| _2 > \frac{\lambda }{2} \right) . \end{aligned}$$

Applying the result in Eq. (30) into the inequality above, we complete the proof of this lemma.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, J., Ye, J. Sparse trace norm regularization. Comput Stat 29, 623–639 (2014). https://doi.org/10.1007/s00180-013-0440-7

Download citation

Received: 28 August 2012
Accepted: 19 July 2013
Published: 03 August 2013
Issue Date: June 2014
DOI: https://doi.org/10.1007/s00180-013-0440-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sparse trace norm regularization

Abstract

Access this article

Similar content being viewed by others

Consistent learning by composite proximal thresholding

Recovering Signals with Unknown Sparsity in Multiple Dictionaries

Newton-Type Methods with the Proximal Gradient Step for Sparse Estimation

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendix

1.1 Operators \(\mathcal S _0\) and \(\mathcal S _1\)

Lemma 1

1.2 Bound on trace norm

Corollary 1

Proof

1.3 Bound on \(\ell _1\)-norm

Lemma 2

Proof

1.4 Concentration inequality

Lemma 3

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Sparse trace norm regularization

Abstract

Access this article

Similar content being viewed by others

Consistent learning by composite proximal thresholding

Recovering Signals with Unknown Sparsity in Multiple Dictionaries

Newton-Type Methods with the Proximal Gradient Step for Sparse Estimation

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

1.1 Operators \(\mathcal S _0\) and \(\mathcal S _1\)

Lemma 1

1.2 Bound on trace norm

Corollary 1

Proof

1.3 Bound on \(\ell _1\)-norm

Lemma 2

Proof

1.4 Concentration inequality

Lemma 3

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation