Rank penalized estimators for high-dimensional matrices

In this paper we consider the trace regression model. Assume that we observe a small set of entries or linear combinations of entries of an unknown matrix $A_0$ corrupted by noise. We propose a new rank penalized estimator of $A_0$. For this estimator we establish general oracle inequality for the prediction error both in probability and in expectation. We also prove upper bounds for the rank of our estimator. Then, we apply our general results to the problems of matrix completion and matrix regression. In these cases our estimator has a particularly simple form: it is obtained by hard thresholding of the singular values of a matrix constructed from the observations.


Introduction
In this paper we consider the trace regression problem. Assume that we observe n independent random pairs (X i , Y i ), i = 1, . . . , n. Here X i are random matrices of dimension m 1 × m 2 and known distribution Π i , Y i are random variables in R which satisfy (1) E (Y i |X i ) = tr(X T i A 0 ), i = 1, . . . , n, where A 0 ∈ R m 1 ×m 2 is an unknown matrix, E (Y i |X i ) is the conditional expectation of Y i given X i and tr(A) denotes the trace of the matrix A. We consider the problem of estimating of A 0 based on the observations (X i , Y i ), i = 1, . . . , n. Though the results of this paper are obtained for general n, m 1 , m 2 , our main motivation is the high-dimensional case, which corresponds to m 1 m 2 ≫ n, with low rank matrices A 0 .
Setting ξ i = Y i − E (Y i |X i ) we can equivalently write our model in the form (2) Y i = tr(X T i A 0 ) + ξ i , i = 1, . . . , n. The noise variables (ξ i ) i=1,...,n are independent and have mean zero.
The problem of estimating low rank matrices recently generated a considerable number of works. The most popular methods are based on minimization of the empirical risk penalized by the nuclear norm with various modifications, see, for example, [1,2,3,4,6,7,9,14,17].
In this paper we propose a new estimator of A 0 . In our construction we combine the penalization by the rank with the use of the knowledge of the distribution Π = 1 n n i=1 Π i . An important feature of our estimator is that in a number of interesting examples we can write it out explicitly. Penalization by the rank was previously considered in [5,10] for the multivariate response regression model. The criterion introduced by Bunea, She and Wegkamp in [5], the rank selection criterion (RSC), minimizes the Frobenius norm of the fit plus a regularization term proportional to the rank. The rank of the RSC estimator gives a consistent estimation of the number of the singular values of the signal XA 0 above a certain noise level. Here X is the matrix of predictors. In [5] the authors also establish oracle inequalities on the mean squared errors of RSC. The paper [10] is mainly focused on the case of unknown variance of the noise. The author gives a minimal sublinear penalty for RSC and provides oracle inequalities on the mean squared risks.
The idea to incorporate the knowledge of the distribution Π in the construction of the estimator was first introduced in [13] but with a different penalization term, proportional to the nuclear norm. In [13] the authors establish general sharp oracle inequalities for trace regression model and apply them to the noisy matrix completion problem. They also provide lower bounds.
In the present work we consider a more general model than the model of [5,10]. It contains as particular cases a number of interesting problems such as matrix completion, multi-task learning, linear regression model, matrix response regression model. The analysis of our model requires different techniques and uses the matrix version of Bernstein's inequality for the estimation of the stochastic term, similarly to [13]. However, we use a different penalization term than in [13] and the main scheme of our proof is quite different. In particular, we obtain a bound for the rank of our estimator in a very general setting (Theorem 2, (i)) and estimations for the prediction error in expectation (Theorem 3). Such bounds are not available for nuclear norm penalization used in [13]. Note, however, that under very specific assumptions on X i , [4] shows that the rank of A 0 can be reconstructed exactly, with high probability, when the dimension of the problem is smaller then the sample size.
The paper is organized as follows. In Section 2 we define the main objects of our study, in particular, our estimator. We also show how some well-known problems (matrix completion, column masks,"complete" subgaussian design) are related to our model. In Section 3, we show that the rank of our estimator is bounded from above by the rank of the unknown matrix A 0 with a constant close to 1. In the same section we prove general oracle inequalities for the prediction error both in probability and in expectation. Then, in Section 4 we apply these general results to the noisy matrix completion problem. In this case our estimator has a particularly simple form: it is obtained by hard thresholding of the singular values of a matrix constructed from the observations (X i , Y i ), i = 1, . . . , n. Moreover, up to a logarithmic factor, the rates attained by our estimator are optimal under the Frobenius risk for a simple class of matrices A(r, a) defined as follows: for any A 0 ∈ A(r, a) the rank of A 0 is supposed not to be larger than a given r and all the entries of A 0 are supposed to be bounded in absolute value by a constant a. Finally, in Section 5, we consider the matrix regression model and compare our bounds to those obtained in [5].

Definitions and assumptions
For 0 < q ≤ ∞ the Schatten-q (quasi-)norm of the matrix A is defined by where (σ j (A)) j are the singular values of A ordered decreasingly.
For any matrices A, B ∈ R m 1 ×m 2 , we define the scalar product and the bilinear symmetric form We introduce the following assumption on the distribution of the matrix X i : There exists a constant µ > 0 such that, for all matrices A ∈ R m 1 ×m 2 Under Assumption 1 the bilinear form defined by (3) is a scalar product. This assumption is satisfied, often with equality, in several interesting examples such as matrix completion, column masks, "complete" subgaussian design.
The trace regression model is quite a general model which contains as particular cases a number of interesting problems: • Matrix Completion Assume that the design matrices X i are i.i.d uniformly distributed on the set where e l (m) are the canonical basis vectors in R m . Then, the problem of estimating A 0 coincides with the problem of matrix completion under uniform sampling at random (USR). The latter problem was studied in [11,15] in the non-noisy case (ξ i = 0) and in [17,9,13] in the noisy case. In a slightly different setting the problem of matrix completion was considered, for example, in [7,6,8,11,12].
For such X i , we have the relation • Column masks Assume that the design matrices X i are i.i.d.
replications of a random matrix X, which has only one nonzero column. If the distribution of X is such that all the columns have the same probability to be non-zero and the non-zero column X j is such that E X j X T j is the identity matrix, then the Assumption 1 is satisfied with µ = √ m 2 .
• "Complete" subgaussian design Suppose that the design matrices X i are i.i.d. replications of a random matrix X and the entries of X are either i.i.d. standard Gaussian or Rademacher random variables. In both cases, Assumption 1 is satisfied with µ = 1. • Matrix regression The matrix regression model is given by where U i are 1 × m 2 vectors of response variables, V i are 1 × m 1 vectors of predictors, A 0 is an unknown m 1 × m 2 matrix of regression coefficients and E i are random 1 × m 2 vectors of noise with independent entries and mean zero. We can equivalently write this model as a trace regression model. Let U i = (U ik ) k=1,...,m 2 , E i = (E ik ) k=1,...,m 2 and Z T ik = e k (m 2 ) V i , where e k (m 2 ) are the m 2 × 1 vectors of the canonical basis of R m 2 . Then we can write (6) as . . , l and k = 1, . . . , m 2 .
Assumption 1, which is a condition of isometry in expectation, is used in the case of random X i . In the case of matrix regression with deterministic V i we do not need it, see section 5 for more details. • Linear regression with vector parameter Let m 1 = m 2 and D denotes the set of diagonal matrices of size m 1 × m 1 . If A and X i ∈ D then the trace regression model becomes the linear regression model with vector parameter. The main motivation of this paper is the matrix completion and matrix regression problems, which we treat in Section 4 and Section 5.
We define the following estimator of A 0 : where λ > 0 is a regularization parameter and rank(A) is the rank of the matrix A.
For matrix regression problem and deterministic X i , our estimator coincides with the RSC estimator: This estimator, called the RSC estimator, can be computed efficiently using the procedure described in [5]. Under Assumption 1, the functional is lower semi-continuous; thus ψ attains a minimum on the compact set {A : A L 2 (Π) ≤ c} and the minimum is a global minimum of ψ on R m 1 ×m 2 . Suppose that Assumption 1 is satisfied with equality, i.e., Then our estimator has a particularly simple form: The optimization problem (7) may equivalently be written aŝ Here, the inner minimization problem is to compute the restricted rank estimatorsÂ k that minimizes the norm A − X 2 2 over all matrices of rank k. Write the singular value decomposition (SVD) of X: where • σ j (X) are the singular values of X indexed in the decreasing order, • u j (X) (resp. v j (X)) are the left (resp. right) singular vectors of X.
Following [16], one can write: Using this, we easily see thatÂ has the form Thus, the computation ofÂ reduces to hard thresholding of singular values in the SVD of X. Remark. We can generalize the estimator given by (7), taking the minimum over a closed set of the matrices, instead of the set {A ∈ R m 1 ×m 2 }, such as a set of all diagonal matrices, for example.

General oracle inequalities
In the following theorem we bound the rank of our estimator in a very general setting. To the best of our knowledge, such estimates were not known. We also prove general oracle inequalities for the prediction errors in probability analogous to those obtained in [13, Theorem 2] for the nuclear norm penalization.
Given n observations Y i ∈ R and X i , we define the random matrix The value M ∞ determines the "the noise level" of our problem. Let ∆ = M ∞ .

Theorem 2. Let Assumption 1 be satisfied and
Proof. It follows from the definition of the estimatorÂ that, for all A ∈ R m 1 ×m 2 , one has Therefore we obtain Due to the trace duality A, B ≤ A p B q for p and q such that Under Assumption 1, this yields which implies To prove (i), we take A = A 0 in (15) and we obtain: Thus, To prove (ii), we first consider the case rank(A) ≤ rank(Â). Then (15) implies Therefore, for rank A ≤ rank(Â), we have Using (i) we obtain Consider now the case, rank(A) ≥ rank(Â). Using that Finally, the elementary inequality Using (19) and (21), we obtain (ii).
To prove (iii), we use (14) to obtain From which we get and (iii) follows.
In the next theorem we obtain bounds for the prediction error in expectation. Set m = m 1 + m 2 , m 1 ∧m 2 = min(m 1 , m 2 ) and m 1 ∨m 2 = max(m 1 , m 2 ). Suppose that E (∆ 2 ) < ∞ and let B r be the set of nonnegative random variables W bounded by r. We set Theorem 3. Let Assumption 1 be satisfied. Consider ̺ ≥ 1 and a regularization parameter λ satisfying Proof. To prove (a) we take the expectation of (16) to obtain Note that Cauchy-Schwarz inequality and C 2 ≥ S imply Taking W = rank(Â) + rank(A) we find If E rank(Â) + rank(A) < 1, which implies A = 0, as √ λ ≥ 2̺µ C we obtain This prove (b) in the case E rank(Â) + rank(A) < 1.
We now prove part (c). From (14) we compute Taking the expectation we obtain .
The assumption on λ and (25) imply (c). This completes the proof of Theorem 3.
The next lemma gives an upper bound on S in the case when ∆ concentrates exponentially around its mean.

Matrix Completion
In this section we present some consequences of the general oracles inequalities of Theorems 2 and 3 for the model of USR matrix completion. Assume that the design matrices X i are i.i.d uniformly distributed on the set X defined in (4). This implies that , for all matrices A ∈ R m 1 ×m 2 . Then, we can writeÂ explicitly Setr = rank(Â). In the case of matrix completion, we can improve point (i) of Theorem 2 and give an estimation on the difference of the firstr singular values ofÂ and A 0 . We also get bounds on the prediction error measured in norms different from the Frobenius norm, in particular in the spectral norm.
Proof. The proof is obtained by adapting the proof of [13,Theorem 8] to hard thresholding estimators. For completeness, we give the proof of (iii) and (iv). Let us start with the proof of (iii). Note that Then and we get (iii).
In view of Theorems 2 and 3, to specify the value of regularization parameter λ, we need to estimate ∆ with high probability. We will use the bounds obtained in [13] in the following two settings of particular interest: (A) Statistical learning setting. There exists a constant η such that max i=1,...,,n | Y i |≤ η. Then, we set (39) ρ(m 1 , m 2 , n, t) = 4η max t + log(m) (m 1 ∧ m 2 )n , 2(t + log(m)) n , n * = 4(m 1 ∧ m 2 ) log m, c * = 4η. (B) Sub-exponential noise. We suppose that the pairs (X i , Y i ) i are iid and that there exist constants ω, c 1 > 0, α ≥ 1 and c 2 such that whereC > 0 is a large enough constant that depends only on α, c 1 , c 2 . In both case we can estimate ∆ with high probability: Lemma 6 ([13], Lemmas 1, 2 and 3). For all t > 0, with probability at least 1 − e −t in the case of statistical learning setting (respectively, 1 − 3e −t in the case of sub-exponential noise), one has (41) ∆ ≤ ρ(m 1 , m 2 , n, t).
As a corollary of Lemma 6 we obtain the following bound for

Lemma 7. Let one of the set of conditions (A) or (B) be satisfied.
Assume n > n * , log m ≥ 5 and W is a non-negative random variable such that W ≤ m 1 ∧ m 2 , then Proof. We will prove (42) in the case of statistical learning setting. The proof in the case of sub-exponential noise is completely analogous. Set Note that Lemma 6 implies that We set ν 1 = n(m 1 ∧ m 2 )(c * ) −2 , ν 2 = n(2c * ) −1 and q = log(m) log(m) − 1 .
By Hölder's inequality we get We first estimate E∆ 2 log m 1/ log m . Inequalities (43) and (44) imply that The Gamma-function satisfies the following bound: We give a proof of this inequality in the Appendix. Plugging it into (46) we compute Observe that n > n * implies ν 1 log m ≤ ν 2 2 and we obtain E∆ 2 log m 1/ log m ≤ e log(m)ν −1 If E(W q ) < 1 we get (42) directly from (45). If E(W q ) ≥ 1, the bound W ≤ m 1 ∧ m 2 implies that and we compute The function The natural choice of t in Lemma 6 is of the order log m (see the discussion in [13]). Then, in Theorems 2 and 3 we can take √ λ = 2̺c (m 1 ∨ m 2 ) log(m) n , where the constant c is large enough, to obtain the following corollary.

Corollary 8. Let one of the set of conditions (A) or (B) be satisfied.
Assume n > n * , log m ≥ 5, ̺ ≥ 1 and Then, (i) with probability at least 1 − 3/(m 1 + m 2 ), one has and, in particular, and, in particular, and, in particular, (v) with probability at least 1 − 3/(m 1 + m 2 ), one has Proof. (i) -(iv) are straightforward in view of Theorems 2 and 3. (v) is a consequence of Theorem 5 (iii). The proof of (vi) follows from (i) using the same argument as in [13] Corollary 2.
This corollary guarantees that the normalized Frobenius error Â − A 0 2 √ m 1 m 2 of the estimatorÂ is small whenever n > C(m 1 ∨ m 2 ) log(m)rank(A 0 ) with a constant C large enough. This quantifies the sample size n necessary for successful matrix completion from noisy data. Comparing Corollary 8 with Theorem 6 and Theorem 7 of [13] we see that, in the case of Gaussian errors and for the statistical learning setting, the rate of convergence of our estimator is optimal, for the class of matrices A(r, a) defined as follows: for any A 0 ∈ A(r, a) the rank of A 0 is supposed not to be larger than a given r and all the entries of A 0 are supposed to be bounded in absolute value by a constant a.

Matrix Regression
In this section we apply the general oracles inequalities of Theorems 2 and 3 to the matrix regression model and compare our bounds to those obtained by Bunea, She and Wegkamp in [5]. Recall that matrix regression model is given by where U i are 1 × m 2 vectors of response variables, V i are 1 × m 1 vectors of predictors, A 0 is a unknown m 1 ×m 2 matrix of regression coefficients and E i are random 1 × m 2 vectors of noise with independent entries and mean zero. As mentioned in the section 2, we can equivalently write this model as a trace regression model.
are the m 2 × 1 vectors of the canonical basis of R m 2 . Then we can write (52) as . . , l and k = 1, . . . , m 2 .
then for deterministic predictors Note that we use Assumption 1 in the proof of Theorem 2 to derive (14) from (13). In the case of matrix regression with deterministic V i , we do not need this assumption and proceed as follows. Let P A denote the orthogonal projector on the linear span of the columns of matrix A and let P ⊥ A = 1 − P A . Note that AP ⊥ A T = 0. Then, one has M = V T E = V T P V E. Now, we use (13) and the fact that Hence, the trace duality yields (14) where we set ∆ = P V E ∞ . Thus, in the case of matrix regression with deterministic V i , we have proved that Theorems 2 and 3 hold with ∆ = P V E ∞ even if Assumption 1 is not satisfied.
In order to get an upper bound on S = sup in the case of Gaussian noise we will use the following result.
Lemma 9 ([5], Lemma 3). Let r = rank(V ) and assume that E ij are independent N(0, σ 2 ) random variables.Then Using (54) and Lemma 4 applied to P V E we get the following bound on S: For log m 2 ≥ 4, we have that m 2 ≥ 2 e(2 log m 2 + 1). Then, these two lemmas imply that in Theorems 2 and 3 we can take √ λ = 4σ √ r + √ m 2 to get the following corollary: and, in particular, and, in particular, and, in particular, The symbol means that the inequality holds up to multiplicative numerical constants.
This Corollary shows that our error bounds are comparable to those obtained in [5]. Points (i) and (iii) are new; here we have inequalities with leading constant 1. The results (ii) and (iv) give the same bounds as in [5] up to constants and to an additional exponentially small term in the analog of (iv) in [5].

Appendix
For completeness, we give here the proof of (47). Observe that F is infinitely differentiable on [1, +∞). Moreover the series defining F can be differentiated k times to obtain F (k) . Thus