Matrix completion by singular value thresholding: sharp bounds

We consider the matrix completion problem where the aim is to esti-mate a large data matrix for which only a relatively small random subset of its entries is observed. Quite popular approaches to matrix completion problem are iterative thresholding methods. In spite of their empirical success, the theoretical guarantees of such iterative thresholding methods are poorly understood. The goal of this paper is to provide strong theo-retical guarantees, similar to those obtained for nuclear-norm penalization methods and one step thresholding methods, for an iterative thresholding algorithm which is a modification of the softImpute algorithm. An im-portant consequence of our result is the exact minimax optimal rates of convergence for matrix completion problem which were known until know only up to a logarithmic factor.


Introduction
Suppose that we observe a small subset of entries of a large data matrix. The problem of inferring the many missing entries from this small set of observations is known as the matrix completion problem. This problem has attracted considerable attention in the past five years. The first works [7,6,5,11,21] introduce nuclear-norm minimization method. A different approach, called OP-TISPACE has been proposed in [12,13]. More recently, a method based on max-norm minimization was studied in [4,10]. Other methods include, for example, GROUSE (Grassmannian Rank-One Update Subspace Estimation) [1] and orthogonal rank-one matrix pursuit [25].
A quite popular direction in the matrix completion literature are the thresholding methods which can be divided in two groups: one-step thresholding methods and iterative thresholding methods. Strong theoretical guarantees were obtained for one-step thresholding procedures. For example, Koltchinskii et al in [17] introduce a soft-thresholding method and show that it is minimax optimal up to a logarithmic factor. In [14] Klopp consider a hard thresholding proceedure. Chatterjee [8] propose an universal singular value thresholding that can be applied to a large number of matrix estimation problems, including matrix completion. Despite strong theoretical guarantees these one-step thresholding methods has two important drawbacks: they show poor behavior in practice and only work under the uniform sampling distribution which is not realistic in many practical situations.
Much better practical performances have been shown by iterative thresholding methods. For example, in [3], Cai et al propose a first-order singular value thresholding algorithm SVT which approximately solves the nuclear norm minimization problem. In [19], Mazmuder et al introduce softImpute algorithm. softImpute produces a sequence of solutions that converges to a solution of the nuclear norm regularized least-squares problem when the number of iterations goes to infinity. These iterative thresholding algorithms are simple to implement, scale to relatively large matrices and in practice achieve competitive errors compared to the state-of-the-art algorithms. More recently Dhanjal et al [9] propose an improvement for the softImpute algorithm using randomized SVDs along with a novel updating method. This improvement allows to bypass the bottleneck in the algorithm which consists in the use of the singular value decomposition of a large matrix at each iteration.
The majority of existing algorithms for matrix completion are batch methods, that is, they operate on the full data matrix. However in some applications such as recommendation systems or localization in sensor networks we observe a sequence of data matrix M 1 , . . . , M T reviled sequentially where from M t to M t+1 we add new observations. In such situations the predictive rule should be refined incrementally. One advantage of iterative thresholding algorithms is that they can be adapted to such sequential learning, see for example [9].
In spite of their empirical success, the theoretical guarantees of such iterative thresholding methods are poorly understood. The goal of this paper is to provide strong theoretical guarantees, similar to those obtained for nuclearnorm penalization methods (see, for example [20,15]) and one step thresholding methods (see [17,14,8]) for a modification of the softImpute algorithm.

Contributions and Related Work
The contributions of the present paper to the theoretical study of the modified softImpute algorithm are multifaceted. In Section 3.2 we prove an upper bound on the estimation error of the outputM of our algorithm. Let M 0 ∈ R m1×m2 be the unknown matrix of interest. Suppose, for simplicity, that each entry is observed with the same probability p, then we prove the following upper bound on the estimation error ofM Here the symbol means that the inequality holds up to a multiplicative numerical constant. To the best of our knowledge, the upper bound on the estimation error given by (1) .
On the other hand, [17,20,15], among some other papers, consider a slightly different setting where the matrix completion problem is viewed as a particular case of the trace regression model. In this setting the number of observations n is fixed. The drawback here is that in this model each entry can be observed multiple times which is not the case in a large number of practical situations. We consider a different setting where each entry can be observed at most once (see Section 2.1). However, it is easy to see that these two settings are closely related if we put n = pm 1 m 2 . Comparing to (1), the bounds obtained in [17,20,15] have an additional log(d 1 + d 2 ) factor. Koltchinskii et al in [17] obtained lower bounds for the estimation error without this additional log(d 1 + d 2 ) factor. So our result answer the important theoretical question what is the exact minimax rate of convergence for matrix completion problem. As the lower bound in [17] is obtained for a different setting, in Section 4 we adapt their proof to our setting, showing that the minimax rate of convergence for matrix completion problem is given by (1) and that the estimator produced by our algorithm is minimax optimal. Note that our techniques can be adapted to the setting considered in [17,20,15] and lead to an upper bound without the additional log(d 1 + d 2 ) factor in this setting also.
Another important point is that a large part of matrix completion literature consider uniform sampling at random setting where each entry is observed with the same probability p. In many applications, such as recommendation systems, this assumption is not realistic. The theoretical analysis in the present paper is carried out for quite general sampling distributions and show that our iterative thresholding algorithm has good performances in such situations. Finally our results give theoretical insights for the chose of the parameters in the modified softImpute algorithm.

Organisation of the paper
The remainder of this paper is organized as follows. In Section 2.1 we introduce our model and the assumptions on the sampling scheme. For the reader's convenience, we collect notation which we use throughout the paper in Section 2.2.
In Section 3.1 we present a modification of the softImpute algorithm for matrix completion. The upper bounds on the estimation error are derived in Section 3.2. Finally the lower bounds are obtained in Section 4 and the Appendix contains the proofs.

Model and sampling scheme
Suppose that we observe a relatively small number of entries of a data matrix Here M 0 = (m ij ) ∈ R m1×m2 is the unknown matrix of interest and E = (ξ ij ) ∈ R m1×m2 is the matrix containing the noise. We assume that the noise variables ξ ij are independent, zero mean and bounded: We suppose that each entry of X is observed independently of the other entries. For the entry (i, j) ∈ [m 1 ] × [m 2 ], we denote the probability to be observed by π ij . Let η ij be the independent Bernoulli variables with parameters π ij and y ij = η ij (m ij + ξ ij ). Then, Y = (y ij ) is the matrix containing our observations. We denote by Ω the random set of observed indices.
In the simplest situation each coefficient is observed with the same probability, i.e. for every (i, j) ∈ [m 1 ]×[m 2 ], π ij = p. Unfortunately, such an assumption on the sampling distribution is not realistic in many practical applications. In the present paper, we consider general sampling model. We suppose that each coefficient is observed with a positive probability: Assumption 2. There exists p > 0 such that for any (i, j) ∈ {1, . . . , m 1 } × {1, . . . , m 2 } π ij ≥ p.

Assumption 2 implies that
We denote the column and row marginals by Suppose that we know an upper bound L on it's maximum: Note that we can easily get an estimation on this upper bound using the empirical frequenciesπ

Notation
We provide a brief summary of the notation used throughout this paper. Let A, B be matrices in R m1×m2 .
• For a matrix A, A ij is its (i, j)th entry.
• For any set I, |I| denotes its cardinal andĪ its complement. Let a ∨ b = max(a, b) and a ∧ b = min(a, b).
• For two matrices A, B ∈ R m1×m2 we define the scalar product • We denote by A 2 the usual l 2 −norm. Additionally, we use the following matrix norms: A * is the nuclear norm (the sum of singular values), A is the operator norm (the largest singular value), A ∞ is the largest absolute value of the entries: • π ij is the probability to observe the (i, j)-th element. For j = 1 . . . m 2 , , we define its restriction on I, A I , in the following way: • Let {ǫ ij } be an i.i.d. Rademacher sequence and X ij = e i (m 1 )e * j (m 2 ) where e k (l) are the canonical basis vectors in R l . We define

The Singular Value Thresholding Algorithm
In this section we introduce an iterative singular value thresholding algorithm and discuss its theoretical properties. We show that it enjoys strong theoretical guarantees and, unlike one-step thresholding procedures, is well adapted for general non-uniform sampling distributions.

Algorithm
Our algorithm is based on the softImpute algorithm proposed by Mazumder et al in [19]. SoftImpute algorithm is inspired by SVD-Impute of Troyanskaya et al [23]. It alternates between imputing the missing values from a current SVD, and updating the SVD using the data matrix.

Algorithm 1
Require : Matrix Y , regularization parameter λ and a, an upper bound on the sup-norm of M 0 .

OutputM .
This algorithm repeatedly replaces the missing entries with the current guess, update the guess by solving and truncating M new . Let us denote by (M k ) k≥0 the sequence of solutions produced by Algorithm 1. We have the following result :

Upper bound on the estimation error
In this section we derive an upper bound on the estimation error ofM produced by Algorithm 1. This bound is non-asymptotic and implies, in particular, that the proposed estimator is minimax optimal. We start by a general result which is proven in Appendix A.
Using Assumption 2, Theorem 2 implies the following bound on the estimation error measured in normalized Frobenius norm Corollary 3. Under assumptions of Theorem 2 and with probability at least In order to get a bound in a closed form we need to obtain a suitable upper bounds on E ( Σ R ) and, with probability close to 1, on Σ .
Lemma 4. Suppose that (ξ ij ) are independent and satisfy Assumption 1. Then, there exists absolute constants c * , C * > 0 such that, for all t > 0 with probability at least 1 − me −t 2 we have where L ≤ 1 is defined in (4).
This Lemma is proven in Appendix F. Taking t = 2 log(d) in Lemma 4, we get that with probability at least then, we can choose With this choice of λ we obtain the following Theorem.
Theorem 5. Let Assumptions 1 and 2 be satisfied and M 0 ∞ ≤ a. Then, with probability at least 1 − 8/d, Remark 1. Note that π ij ≥ p yields L ≥ M p. Then, the upper bound on the estimation error in the Theorem 5 is at least a constant times rank(M 0 ) pm .
So, in order to get a small estimation error, p should be larger then rank(M 0 ) m .
We denote by n = ij π ij the expected number of observations. Condition p ≥ rank(M 0 ) m implies the following condition on n n ≥ C rank(M 0 ) M.
When the rank of the matrix M 0 is small, this necessary number of observations is close to the number of degree of freedom of the matrix M 0 , which is Let us restrict our attention to the non-degenerated case M 0 = 0 (we can easily include this case replacing rank(M 0 ) by rank(M 0 )∨1). Assuming that the expected number of observations n is not too small, we can get simpler bound on the estimation error. Suppose that n > c * m log(d). Then, using Lm ≥ n ≥ c * m log d we get L ≥ c * log d and we can chose λ in the following way With this choice of λ we get the following bound on the estimation error In order to compare this result with previous results on noisy matrix completion we consider a more restrictive assumption on the sampling distribution. That is, we assume that this distribution is close to the uniform one: Assumption 3. There exists positives constants µ 1 and µ 2 independent on m 1 and m 2 and a 0 < p < 1 such that for every (i, j) ∈ {1, . . . , m 1 } × {1, . . . , m 2 } we have

Under this assumption Theorem 2 yields
Corollary 7. Let Assumptions 1 and 3 be satisfied and M 0 ∞ ≤ a. Assume that n ≥ m log(d) and λ given by (14). Then, with probability at least 1 − 8/d, Remark 2. Let us compare the bound given by Corollary 7 with bounds available in the literature. Our model was previously considered by Chatterjee in [8] in the case of uniform sampling distribution, that is π ij = p for any (i, j) ∈ {1, . . . , m 1 } × {1, . . . , m 2 }. In [8], Chatterjee introduces a simple estimation procedure, called Universal Singular Value Thresholding which is applied to a number of questions in low rank matrix estimation, blockmodels, distance matrix completion, latent space models and etc. For matrix completion problem and under the additional assumption p ≥ n −1+ǫ for some ǫ > 0, the bound obtained in [8] is the following one The rate of convergence given by Corollary 7 is faster and, as we will see in Section4, is minimax optimal. Note that the additional assumption p ≥ n −1+ǫ yields the following condition on the expected number of observations n > m ǫ M.
For low rank matrices, this necessary number of observations is larger than the number of observations required by our method and given by (13).
In [20,17,15] a closely related set up for matrix completion problem using the trace regression model was considered. The main difference between these two settings is that in the case of the trace regression the number of observations is not random and each entry may be observed multiple times. In our setting the number of observations is random and each entry is observed at most once. Comparing with Corollary 7 and using n = pm 1 m 2 we see that bounds obtained in [20,17,15] contain an additional logarithmic factor log(m 1 + m 2 ).

Minimax Lower bounds
In this section, we prove the minimax lower bound showing that the rates attained by our estimator are optimal. The minimax lower bound in a closely related problem was obtained by Koltchinskii et al in [17]. We adapt their proof to our set up.
For any integer 0 ≤ r ≤ min(m 1 , m 2 ) and any a > 0, we consider the class of matrices We will prove the lower bound in the case of the uniform sampling distribution, that is, we suppose that each entry is observed with the same probability p. As it was noted in Remark 1, in order to get a small estimation error we need to observe a sufficiently large number of entries, or, equivalently, the probability p should be larger then r/m. We prove a lower bound on the estimation risk when this condition is satisfied.
Theorem 8. Suppose that m 1 , m 2 ≥ 2 and p ≥ r m . Fix a > 0 and integer 1 ≤ r ≤ min(m 1 , m 2 ). Suppose that the variables ξ i are i.i.d. Gaussian N (0, σ 2 ), σ 2 > 0, for i = 1, . . . , n. Then, there exist absolute constants β ∈ (0, 1) and A Proof of Theorem 2 1. By Lemma 1 in [19],M minimizes Then, using the sub-gradient stationary conditions we have 2. We estimate each term in (17) separately. For the first term, we have that (Y − M 0 ) Ω = Σ where Σ = (i,j) η ij ξ ij X ij . Then, by the duality between the nuclear and the operator norms, we obtain For the second term, using again the duality between the nuclear and the operator norms and the stopping criteria for the Algorithm 1, we obtain 3. In order to estimate the third term, we use that by monotonicity of subdifferentails of convex functions we have that V Since P A (B) = P S ⊥ 1 (A) BP S2(A) + P S1(A) B and rank(P Si(A) B) ≤ rank(A) we have that rank(P A (B)) ≤ 2 rank(A). (22) Note that the subdifferential of the convex function A → A * is the following set of matrices (cf. [26]) Inequality (19) and (23) imply Using the fact that Now, by the duality between the nuclear and the operator norms, there exists W with W ≤ 1 and such that For this particular choice of W , (25) and (26) imply Putting (18), (19), and (27) into (17) and using λ ≥ 3 Σ we obtain

The triangle inequality and (22) lead to
and 5. For a 0 < r ≤ m we consider the following constrain set (32) Note that the condition A * ≤ √ r A 2 is satisfied if rank(A) ≤ r. We have the following result for matrices in C(r). Its proof is given in Appendix C.
Lemma 9. For all A ∈ C(r) with probability at least 1 − 8/d.
We now consider two cases, depending on whether the matrix M − M 0 3a belongs to the set C (72 rank(M 0 )) or not. Then (31) implies that 1 3a M − M 0 ∈ C (72 rank(M 0 )) and we can apply Lemma 9. From Lemma 9 and (29) we obtain that with probability at least 1 − 8/d one has which leads to the statement of the Theorem 2.

B Proof of Theorem 8
We adopt the proof of Theorem 5 in [17] to our setting. Assume w.l.o.g. that m 1 ≥ m 2 . For a γ ≤ 1, definẽ and consider the associated set of block matrices where O denotes the m 1 × (m 2 − r⌊m 2 /(2r)⌋) zero matrix, and ⌊x⌋ is the integer part of x.
By construction, any element of A as well as the difference of any two elements of A has rank at most r. In addition, condition p ≥ r m implies that the entries of any matrix in A take values in [0, a]. Thus, A ⊂ A(r, a).
The Varshamov-Gilbert bound (cf. Lemma 2.9 in [24]) guarantees the existence of a subset A 0 ⊂ A with cardinality Card(A 0 ) ≥ 2 (rM)/8 + 1 containing the zero m 1 × m 2 matrix 0 and such that, for any two distinct elements A 1 and A 2 of A 0 , Using that, conditionally on X i , the distributions of ξ i are Gaussian, we get that, for any A ∈ A 0 , the Kullback-Leibler divergence K P 0 , P A between P 0 and P A satisfies From (34) we deduce that the condition is satisfied for any α > 0 if γ > 0 is chosen as a sufficiently small numerical constant depending on α. In view of (33) and (35) and using the application of Theorem 2.5 in [24] implies for some absolute constants β ∈ (0, 1), which implies the statement of Theorem 8.

C Proof of Lemma 9
This proof is close to the proof of Lemma 12 in [15]. Set We will show that the probability of the following "bad" event is small Note that B contains the complement of the event that we are interested in. In order to estimate the probability of B we use a standard peeling argument.
Let ν = log(d) 0.0006 log (6/5) p and α = 6 5 . For l ∈ N set If the event B holds for some matrix A ∈ C(r), then A belongs to some S l and For T > ν consider the following set of matrices C(r, T ) = A ∈ C(r) : A 2 L2(Π) ≤ T and the following event Note that A ∈ S l implies that A ∈ C(r, α l ν). Then (37) implies that B l holds and we get B ⊂ ∪ B l . Thus, it is enough to estimate the probability of the simpler event B l and then apply the union bound. Such an estimation is given by the following lemma. Its proof is given in Appendix D. Let Lemma 10. We have that Lemma 10 implies that P (B l ) ≤ 4 exp(−c 1 p α l ν). Using the union bound we obtain where we used e x ≥ x. We finally compute for ν = log(d) 0.0006 p log (6/5) .
This completes the proof of Lemma 9.

D Proof of Lemma 10
We will start by showing that Z T concentrates around its expectation and then we will upper bound the expectation. Recall that by definition, We use the following Talagrand's concentration inequality : For a proof see [22] and [8]. Let f (x 11 , . . . , x m1m2 ) : = sup It is easy to see that f (x 11 , . . . , x m1m2 ) is a Lipschitz function with Lipschitz constant L = p −1 T . Indeed, where we used ||a| − |b|| ≤ |a − b|, A ∞ ≤ 1 and A 2 L2(Π) ≤ T . Now, Theorem 11 and 2 p −1 T ≤ T + p −1 imply Taking t = 1 9 1 3 T we get P Z T ≥ E(Z T ) + 768 p −1 + 1 9 with c 1 ≥ 0.0006.
Next we bound the expectation E (Z T ). Using a standard symmetrization argument (see e.g. [18]) we obtain where we have used (3). Then, by the duality between nuclear and operator norms, we compute Finally, using 1 9 T + 44 r p −1 (E ( Σ R )) 2 and the concentration bound (38) we obtain that with c 1 ≥ 0.0006 as stated.

E Proof of Lemma 1
It is easy to see that Thus, it is enough to show (8). The proof of (8) is close to the proof of Lemma 4 in [19]. Let us denote for byM k the solutions produced by Algorithm 1 after softthresholding step and before truncating step (6). We have that where in the second inequality we used the following result (see, for example, Lemma 3 in [19]) The inequality (40) shows that the sequence {Q(M k ,M k )} k≥1 converges. This and (40) yield Now, it is easy to see that

F Proof of Lemma 4
In order to prove (10), we use the following remarkable bound on the spectral norms of random matrices. It is obtained by extension to rectangular matrices via self-adjoint dilation of Corollary 3.12 and Remark 3.13 in [2] (cf., Section 3.1 in [2]).
In order to prove (11) we use the following result Proposition 14 (Corollary 3.3 in [2]). Let A be the m 1 ×m 2 rectangular matrix with A ij independent centered bounded random variables. Then, there exists a universal constant C * such that, where σ 1 , σ 2 , σ * are defined in Proposition 13.