Noisy Matrix Completion: Understanding Statistical Guarantees for Convex Relaxation via Nonconvex Optimization

This paper studies noisy low-rank matrix completion: given partial and corrupted entries of a large low-rank matrix, the goal is to estimate the underlying matrix faithfully and efficiently. Arguably one of the most popular paradigms to tackle this problem is convex relaxation, which achieves remarkable efficacy in practice. However, the theoretical support of this approach is still far from optimal in the noisy setting, falling short of explaining the empirical success. We make progress towards demystifying the practical efficacy of convex relaxation vis-\`a-vis random noise. When the rank of the unknown matrix is a constant, we demonstrate that the convex programming approach achieves near-optimal estimation errors --- in terms of the Euclidean loss, the entrywise loss, and the spectral norm loss --- for a wide range of noise levels. All of this is enabled by bridging convex relaxation with the nonconvex Burer-Monteiro approach, a seemingly distinct algorithmic paradigm that is provably robust against noise. More specifically, we show that an approximate critical point of the nonconvex formulation serves as an extremely tight approximation of the convex solution, allowing us to transfer the desired statistical guarantees of the nonconvex approach to its convex counterpart.


Introduction
Suppose we are interested in a large low-rank data matrix, but only get to observe a highly incomplete subset of its entries. Can we hope to estimate the underlying data matrix in a reliable manner? This problem, often dubbed as low-rank matrix completion, spans a diverse array of science and engineering applications (e.g. collaborative filtering [RS05], localization [SY07], system identification [LV09], magnetic resonance parameter mapping [ZPL15], joint alignment [CC18a]), and has inspired a flurry of research activities in the past decade. In the statistics literature, matrix completion also falls under the category of factor models with a large amount of missing data, which finds numerous statistical applications such as controlling false discovery rates for dependence data [Efr07,Efr10,FHG12,FKSZ19], factor-adjusted variable selection [KS11,FKW18], principal component regression [Jol82,BN06,PBHT08,FXY17], and large covariance matrix estimation [FLM13,FWZ19]. Recent years have witnessed the development of many tractable algorithms that come with statistical guarantees, with convex relaxation being one of the most popular paradigms [FHB04,CR09,CT10]. See [DR16,CC18b] for an overview of this topic. This paper focuses on noisy low-rank matrix completion, assuming that the revealed entries are corrupted by a certain amount of noise. Setting the stage, consider the task of estimating a rank-r data matrix M = [M ij ] 1≤i,j≤n ∈ R n×n , 1 and suppose that this needs to be performed on the basis of a subset of noisy entries where Ω ⊆ {1, · · · , n} × {1, · · · , n} denotes a set of indices, and E ij stands for the additive noise at the location (i, j). As we shall elaborate shortly, solving noisy matrix completion via convex relaxation, while practically exhibiting excellent stability (in terms of the estimation errors against noise), is far less understood theoretically compared to the noiseless setting.

Convex relaxation: limitations of prior results
Naturally, one would search for a low-rank solution that best fits the observed entries. One choice is the regularized least-squares formulation given by minimize Z∈R n×n 1 2 where λ > 0 is some regularization parameter. In words, this approach optimizes certain trade-off between the goodness of fit (through the squared loss expressed in the first term of (2)) and the low-rank structure (through the rank function in the second term of (2)). Due to computational intractability of rank minimization, we often resort to convex relaxation in order to obtain computationally feasible solutions. One notable example is the following convex program: where Z * denotes the nuclear norm (i.e. the sum of singular values) of Z -a convex surrogate for the rank function. A significant portion of existing theory supports the use of this paradigm in the noiseless setting: when E ij vanishes for all (i, j) ∈ Ω, the solution to (3) is known to be faithful (i.e. the estimation error becomes zero) even under near-minimal sample complexity [CR09, CP10, CT10, Gro11, Rec11, Che15]. By contrast, the performance of convex relaxation remains largely unclear when it comes to more practically relevant noisy scenarios. To begin with, the stability of an equivalent variant of (3) against noise was first studied by Candès and Plan [CP10]. 2 The estimation error Z cvx − M F derived therein, of the solution Z cvx to (3), is significantly larger than the oracle lower bound. This does not explain well the effectiveness of (3) in practice. In fact, the numerical experiments reported in [CP10] already indicated that the performance of convex relaxation is far better than their theoretical bounds. In order to improve the statistical guarantees, several variants of (3) have been put forward, most notably by Negahban and Wainwright [NW12] and Koltchinskii et al. [KLT11]; see Section 1.4 for more details. Nevertheless, the stability analysis in [NW12, KLT11] could often be suboptimal when the magnitudes of the noise are not sufficiently large; in fact, their estimation error bounds do not vanish as the size of the noise approaches zero.
All of these give rise to the following natural yet challenging questions: Where does the convex program (3) stand in terms of its stability vis-à-vis additive noise? Can we establish a theoretical statistical guarantee that matches its practical effectiveness? Is it capable of accommodating a wider (and more practical) range of noise levels?

A detour: nonconvex optimization
While the focus of the current paper is convex relaxation, we take a moment to discuss a seemingly distinct algorithmic paradigm: nonconvex optimization, which turns out to be remarkably helpful in understanding convex relaxation. Inspired by the Burer-Monteiro approach [BM03], the nonconvex scheme starts by representing the rank-r decision matrix (or parameters) Z as Z = XY via low-rank factors X, Y ∈ R n×r , and proceeds by solving the following nonconvex (regularized) least-squares problem directly [KMO10a] minimize X,Y ∈R n×r 1 2 Here, reg(·, ·) denotes a certain regularization term that promotes additional structural properties.
To see its intimate connection [MHT10, SSS11] with the convex program (3), we make the following observation: if the solution to (3) has rank r, then it must coincide with the solution to minimize X,Y ∈R n×r 1 2 This can be easily verified by recognizing the elementary fact that for any rank-r matrix Z [SS05, MHT10]. Note, however, that it is very challenging to predict when the key assumption in establishing this connection -namely, the rank-r assumption of the solution to (3) -can possibly hold. Despite the nonconvexity of (4), simple first-order optimization methods -in conjunction with proper initialization -are often effective in solving (4). Partial examples include gradient descent on manifold [KMO10a,KMO10b,WCCL16], gradient descent [SL16,MWCC17], and projected gradient descent [CW15,ZL16]. Apart from their practical efficiency, the nonconvex optimization approach is also appealing in theory. To begin with, algorithms tailored to (4) often enable exact recovery in the noiseless setting. Perhaps more importantly, for a wide range of noise settings, the nonconvex approach achieves appealing estimation accuracy [CW15,MWCC17], which could be significantly better than those bounds derived for convex relaxation discussed earlier. See [CLC18,CC18b] for a summary of recent results. Such intriguing statistical guarantees motivate us to take a closer inspection of the underlying connection between the two contrasting algorithmic frameworks.

Empirical evidence: convex and nonconvex solutions are often close
In order to obtain a better sense of the relationships between convex and nonconvex approaches, we begin by comparing the estimates returned by the two approaches via numerical experiments. Fix n = 1000 and r = 5. We generate M = X Y , where X , Y ∈ R n×r are random orthonormal matrices. Each entry M ij of M is observed with probability p = 0.2 independently, and then corrupted by an independent Gaussian noise E ij ∼ N (0, σ 2 ). Throughout the experiments, we set λ = 5σ √ np. The convex program (3) is solved by the proximal gradient method [PB14], whereas we attempt solving the nonconvex formulation (5) by gradient descent with spectral initialization (see [CLC18] for details). Let Z cvx (resp. Z ncvx = X ncvx Y ncvx ) be the solution returned by the convex program (3) (resp. the nonconvex program (5)). Figure 1 displays the relative estimation errors of both methods ( Z cvx − M F / M F and Z ncvx − M F / M F ) as well as the relative distance Z cvx − Z ncvx F / M F between the two estimates. The results are averaged over 20 independent trials. Interestingly, the distance between the convex and the nonconvex solutions seems extremely small (e.g. Z cvx − Z ncvx F / M F is typically below 10 −7 ); in comparison, the relative estimation errors of both Z cvx and Z ncvx are substantially larger. In other words, the estimate returned by the nonconvex approach serves as a remarkably accurate approximation of the convex solution. Given that the nonconvex approach is often guaranteed to achieve intriguing statistical guarantees vis-à-vis random noise [MWCC17], this suggests that the convex program is equally stable -a phenomenon that was not captured by prior theory [CP10]. Can we leverage existing theory for the nonconvex scheme to improve the statistical analysis of the convex relaxation approach?

Models and main results
The numerical experiments reported in Section 1.3 suggest an alternative route for analyzing convex relaxation for noisy matrix completion. If one can formally justify the proximity between the convex and the nonconvex solutions, then it is possible to propagate the appealing stability guarantees from the nonconvex scheme to the convex approach. As it turns out, this simple idea leads to significantly enhanced statistical guarantees for the convex program (3), which we formally present in this subsection.
Before proceeding, we introduce a few model assumptions that play a crucial role in our theory.  (3)) and Z ncvx (the estimate returned by the nonconvex approach tailored to (5)) and the relative distance between them vs. the standard deviation σ of the noise. The results are reported for n = 1000, r = 5, p = 0.2, λ = 5σ √ np and are averaged over 20 independent trials.
(a) (Random sampling) Each index (i, j) belongs to the index set Ω independently with probability p.
In addition, let M = U Σ V be the singular value decomposition (SVD) of M , where U , V ∈ R n×r consist of orthonormal columns and Σ = diag(σ 1 , σ 2 , · · · , σ r ) ∈ R r×r is a diagonal matrix obeying σ max σ 1 ≥ σ 2 ≥ · · · ≥ σ r σ min . Denote by κ σ max /σ min the condition number of M . We impose the following incoherence condition on M , which is known to be crucial for reliable recovery of M [CR09,Che15].
Here, U 2,∞ denotes the largest 2 norm of all rows of a matrix U .
With these in place, we are ready to present our improved statistical guarantees for convex relaxation.
Theorem 1. Let M be rank-r and µ-incoherent with a condition number κ. Suppose Assumption 1 holds and take λ = C λ σ √ np in (3) for some large enough constant C λ > 0. Assume the sample size obeys n 2 p ≥ Cκ 4 µ 2 r 2 n log 3 n for some sufficiently large constant C > 0, and the noise satisfies σ n p ≤ c σmin √ κ 4 µr log n for some sufficiently small constant c > 0. Then with probability exceeding 1 − O(n −3 ), 1. Any minimizer Z cvx of (3) obeys 2. Letting Z cvx,r arg min Z:rank(Z)≤r Z − Z cvx F be the best rank-r approximation of Z cvx , we have and the error bounds in (7) continue to hold if Z cvx is replaced by Z cvx,r .
Remark 2. The factor 1/n 3 in (8) can be replaced by 1/n c for an arbitrarily large fixed constant c > 0.
Several implications of Theorem 1 follow immediately. The discussions below concentrate on the case where r, µ and κ are all O(1), under our random noise assumption.
• Improved stability guarantees. Our results reveal that the Euclidean error of any convex optimizer Z cvx of (3) obeys implying that the performance of convex relaxation degrades gracefully as the signal-to-noise ratio decreases. This result matches the minimax lower bound derived in [KLT11,NW12] for the range of noise obeying σ M ∞ , which improves upon prior art in the following aspects: -Candès and Plan [CP10] provided a stability guarantee in the presence of arbitrary bounded noise. When applied to the random noise model assumed here, their results yield Z cvx − M F σn 3/2 , which could be O( n 2 p) times more conservative than our bound (9).
where M ij is set to 0 for any unobserved entry (i.e. those with (i, j) / ∈ Ω). This variant effectively performs singular value thresholding on a rescaled zero-padded data matrix. Under our conditions, their results read (up to some logarithmic factor) whereẐ is the estimate returned by their algorithm. This becomes suboptimal when σ M ∞a highly relevant regime covered by our analysis.
-Negahban and Wainwright [NW12] proposed to enforce an extra constraint Z ∞ ≤ α when solving (3), in order to explicitly control the spikiness of the estimate. When applied to our model, their error bound is the same as (10) (modulo some log factor), which also becomes increasingly looser as σ decreases. In addition, the choice of α may add unwanted variations in practice.
• Nearly low-rank structure of the convex solution. In light of (8), the optimizer of the convex program (3) is almost, if not exactly, rank-r. When the true rank r is known a priori, it is not uncommon for practitioners to return the rank-r approximation of Z cvx . Our theorem formally justifies that there is no loss of statistical accuracy -measured in terms of either · F , · , or · ∞ -when performing the rank-r projection operation.
• Entrywise and spectral norm error control. Moving beyond the Euclidean loss, our theory uncovers that the estimation errors of the convex optimizer are fairly spread out across all entries, thus implying nearoptimal entrywise error control. This is a stronger form of error bounds, as an optimal Euclidean estimation accuracy alone does not preclude the possibility of the estimation errors being spiky and localized. Furthermore, the spectral norm error of the convex optimizer is also well-controlled. Figure 2 displays the relative estimation errors in both the ∞ norm and the spectral norm, under the same setting as in Figure 1. As can be seen, both forms of estimation errors scale linearly with the noise level, corroborating our theory. • Implicit regularization. As a byproduct of the entrywise error control, this result indicates that the additional constraint Z ∞ ≤ α suggested by [NW12] is automatically satisfied and is hence unnecessary. In other words, the convex approach implicitly controls the spikiness of its entries, without resorting to explicit regularization.  [BT09,TY10], to name just a few. Our theory immediately provides statistical guarantees for these algorithms. As we shall make precise in Section 2, any point Z with g(Z) ≤ g(Z cvx )+ε (where g(·) is defined in (3)) enjoys the same error bounds as in (7) (with Z cvx replaced by Z in (7)), provided that ε > 0 is sufficiently small. In other words, when these convex optimization algorithms converge w.r.t. the objective value, they are guaranteed to return a statistically reliable estimate.
Finally, we remark that our results are likely suboptimal when r and κ are allowed to scaled with n. Interested readers are referred to Section 4 for more detailed discussions.

Strategy and novelty
In this section, we introduce the strategy for proving our main theorem. Informally, the main technical difficulty stems from the lack of closed-form expressions for the primal solution to (3), which in turn makes it difficult to construct a dual certificate. This is in stark contrast to the noiseless setting, where one clearly anticipates the ground truth M to be the primal solution; in fact, this is precisely why the analysis for the noisy case is significantly more challenging. Our strategy, as we shall detail below, mainly entails invoking an iterative nonconvex algorithm to "approximate" such a primal solution. Before continuing, we introduce a few more notations. Let P Ω (·) : R n×n → R n×n represent the projection onto the subspace of matrices supported on Ω, namely, for any matrix Z ∈ R n×n . For a rank-r matrix M with singular value decomposition U ΣV , denote by T its tangent space, i.e.
Correspondingly, let P T (·) be the orthogonal projection onto the subspace T , that is, for any matrix Z ∈ R n×n . In addition, let T ⊥ and P T ⊥ (·) denote the orthogonal complement of T and the projection onto T ⊥ , respectively. With regards to the ground truth, we denote The nonconvex problem (5) is equivalent to where we have inserted an extra factor 1/p (compared to (5)) to simplify the presentation of the analysis later on.

Exact duality
In order to analyze the convex program (3), it is natural to start with the first-order optimality condition. Specifically, suppose that Z ∈ R n×n is a (primal) solution to (3) with SVD Z = U ΣV . 3 As before, let T be the tangent space of Z, and let T ⊥ be the orthogonal complement of T . Then the first-order optimality condition for (3) reads: there exists a matrix W ∈ T ⊥ (called a dual certificate) such that This condition is not only necessary to certify the optimality of Z, but also "almost sufficient" in guaranteeing the uniqueness of the solution Z; see Appendix B for in-depth discussions. The challenge then boils down to identifying such a primal-dual pair (Z, W ) satisfying the optimality condition (16). For the noise-free case, the primal solution is clearly Z = M if exact recovery is to be expected; the dual certificate can then be either constructed exactly by the least-squares solution to a certain underdetermined linear system [CR09,CT10], or produced approximately via a clever golfing scheme pioneered by Gross [Gro11]. For the noisy case, however, it is often difficult to hypothesize on the primal solution Z, as it depends on the random noise in a complicated way. In fact, the lack of a suitable guess of Z (and hence W ) was the major hurdle that prior works faced when carrying out the duality analysis.

A candidate primal solution via nonconvex optimization
Motivated by the numerical experiment in Section 1.3, we propose to examine whether the optimizer of the nonconvex problem (5) stays close to the solution to the convex program (3). Towards this, suppose that X, Y ∈ R n×r form a critical point of (5) with rank(X) = rank(Y ) = r. 4 Then the first-order condition reads To develop some intuition about the connection between (16) and (17), let us take a look at the case with r = 1. Denote X = x and Y = y and assume that the two rank-1 factors are "balanced", namely, x 2 = y 2 = 0. It then follows from (17) that λ −1 P Ω (M − xy ) has a singular value 1, whose corresponding left and right singular vectors are x/ x 2 and y/ y 2 , respectively. In other words, one can express where W is orthogonal to the tangent space of xy ; this is precisely the condition (16a). It remains to argue that (16b) is valid as well. Towards this end, the first-order condition (17) alone is insufficient, as there might be non-global critical points (e.g. saddle points) that are unable to approximate the convex solution well. Fortunately, as long as the candidate xy is not far away from the ground truth M , one can guarantee W < 1 as required in (16b).
The above informal argument about the link between the convex and the nonconvex problems can be rigorized. To begin with, we introduce the following conditions on the regularization parameter λ.
Remark 3. Condition 1 requires that the regularization parameter λ should dominate a certain norm of the noise, as well as of the deviation of XY − M from its mean p(XY − M ); as will be seen shortly, the latter condition can be met when (X, Y ) is sufficiently close to (X , Y ).
With the above condition in place, the following result demonstrates that a critical point (X, Y ) of the nonconvex problem (5) readily translates to the unique minimizer of the convex program (3). This lemma is established in Appendix C.1.
Lemma 1 (Exact nonconvex vs. convex optimizers). Suppose that (X, Y ) is a critical point of (5) satisfying rank(X) = rank(Y ) = r, and the sampling operator P Ω is injective when restricted to the elements of the tangent space T of XY , namely, Under Condition 1, the point Z XY is the unique minimizer of (3).
In order to apply Lemma 1, one needs to locate a critical point of (5) that is sufficiently close to the truth, for which one natural candidate is the global optimizer of (5). The caveat, however, is the lack of theory characterizing directly the properties of the optimizer of (5). Instead, what is available in prior theory is the characterization of some iterative sequence (e.g. gradient descent iterates) aimed at solving (5). It is unclear from prior theory whether the iterative algorithm under study (e.g. gradient descent) converges to the global optimizer in the presence of noise. This leads to technical difficulty in justifying the proximity between the nonconvex optimizer and the convex solution via Lemma 1.

Approximate nonconvex optimizers
Fortunately, perfect knowledge of the nonconvex optimizer is not pivotal. Instead, an approximate solution to the nonconvex problem (5) (or equivalently (15)) suffices to serve as a reasonably tight approximation of the convex solution. More precisely, we desire two factors (X, Y ) that result in nearly zero (rather than exactly zero) gradients: where f (·, ·) is the nonconvex objective function as defined in (15). This relaxes the condition discussed in Lemma 1 (which only applies to critical points of (5) as opposed to approximate critical points). As it turns out, such points can be found via gradient descent tailored to (5). The sufficiency of the near-zero gradient condition is made possible by slightly strengthening the injectivity assumption (19), which is stated below.
Condition 2 (Injectivity). Let T be the tangent space of XY . There is a quantity c inj > 0 such that The following lemma states quantitatively how an approximate nonconvex optimizer serves as an excellent proxy of the convex solution, which we establish in Appendix C.2.
Lemma 2 (Approximate nonconvex vs. convex optimizers). Suppose that (X, Y ) obeys for some sufficiently small constant c > 0. Further assume that any singular value of X and Y lies in Then under Conditions 1-2, any minimizer Z cvx of (3) satisfies Remark 4. In fact, this lemma continues to hold if Z cvx is replaced by any Z obeying g(Z) ≤ g(XY ), where g(·) is the objective function defined in (3) and X and Y are low-rank factors obeying conditions of Lemma 2. This is important in providing statistical guarantees for iterative methods like SVT [CCS10], , FISTA [BT09], etc. To be more specific, suppose that (X, Y ) results in an approximate optimizer of (3), namely, g(XY ) = g(Z cvx ) + ε for some sufficiently small ε > 0. Then for any Z obeying g(Z) ≤ g(XY ) = g(Z cvx ) + ε, one has As a result, as long as the above-mentioned algorithms converge in terms of the objective value, they must return a solution obeying (23), which is exceedingly close to XY if ∇f (X, Y ) F is small.
It is clear from Lemma 2 that, as the size of the gradient ∇f (X, Y ) gets smaller, the nonconvex estimate XY becomes an increasingly tighter approximation of any convex optimizer of (3), which is consistent with Lemma 1. In contrast to Lemma 1, due to the lack of strong convexity, a nonconvex estimate with a near-zero gradient does not imply the uniqueness of the optimizer of the convex program (3); rather, it indicates that any minimizer of (3) lies within a sufficiently small neighborhood surrounding XY (cf. (22)).

Construction of an approximate nonconvex optimizer
So far, Lemmas 1-2 are both deterministic results based on Condition 1. As we will soon see, under Assumption 1, we can derive simpler conditions that -with high probability -guarantee Condition 1. We start with Condition 1(a).
Proof. This follows from [CW15, Lemma 11] with a slight and straightforward modification to accommodate the asymmetric noise here. For brevity, we omit the proof.
Turning attention to Condition 1(b) and Condition 2, we have the following lemma, the proof of which is deferred to Appendix C.3. hold simultaneously for all (X, Y ) obeying Here, T denotes the tangent space of XY , and C ∞ > 0 is some absolute constant.
This lemma is a uniform result, namely, the bounds hold irrespective of the statistical dependency between (X, Y ) and Ω. As a consequence, to demonstrate the proximity between the convex and the nonconvex solutions (cf. (22)), it remains to identify a point (X, Y ) with vanishingly small gradient (cf. (21)) that is sufficiently close to the truth (cf. (24)).
As we already alluded to previously, a simple gradient descent algorithm aimed at solving the nonconvex problem (5) might help us produce an approximate nonconvex optimizer. This procedure is summarized in Algorithm 1. Our hope is this: when initialized at the ground truth and run for sufficiently many iterations, the GD trajectory produced by Algorithm 1 will contain at least one approximate stationary point of (5) with the desired properties (21) and (24). We shall note that Algorithm 1 is not practical since it starts from the ground truth (X , Y ); this is an auxiliary step mainly to simplify the theoretical analysis. While we can certainly make it practical by adopting spectral initialization as in [MWCC17, CLL19], it requires more lengthy proofs without further improving our statistical guarantees.
Algorithm 1 Construction of an approximate primal solution. Initialization: Here, η > 0 is the step size.

Properties of the nonconvex iterates
In this subsection, we will build upon the literature on nonconvex low-rank matrix completion to justify that the estimates returned by Algorithm 1 satisfy the requirement stated in (24). Our theory will be largely established upon the leave-one-out strategy introduced by Ma et al. [MWCC17], which is an effective analysis technique to control the 2,∞ error of the estimates. This strategy has recently been extended by Chen et al. [CLL19] to the more general rectangular case with an improved sample complexity bound. Before continuing, we introduce several useful notations. Notice that the matrix product of X and Y is invariant under global orthonormal transformation, namely, for any orthonormal matrix R ∈ R r×r one has X R(Y R) = X Y . Viewed in this light, we shall consider distance metrics modulo global rotation. In particular, the theory relies heavily on a specific global rotation matrix defined as follows where O r×r is the set of r × r orthonormal matrices.
We are now ready to present the performance guarantees for Algorithm 1.
Lemma 5 (Quality of the nonconvex estimates). Instate the notation and hypotheses of Theorem 1. With probability at least 1 − O n −3 , the iterates {(X t , Y t )} 0≤t≤t0 of Algorithm 1 satisfy where C F , C op , C ∞ > 0 are some absolute constants, provided that η 1/(nκ 3 σ max ) and that t 0 = n 18 .
This lemma, which we establish in Appendix D, reveals that for a polynomially large number of iterations, all iterates of the gradient descent sequence -when initialized at the ground truth -remain fairly close to the true low-rank factors. This holds in terms of the estimation errors measured by the Frobenius norm, the spectral norm, and the 2,∞ norm. In particular, the proximity in terms of the 2,∞ norm error plays a pivotal role in implementing our analysis strategy (particularly Lemmas 2-4) described previously. In addition, this lemma (cf. (28)) guarantees the existence of a small-gradient point within this sequence {(X t , Y t )} 0≤t≤t0 , a somewhat straightforward property of GD tailored to smooth problems [Nes12]. This in turn enables us to invoke Lemma 2.
As immediate consequences of Lemma 5, with high probability we have The proof is deferred to Appendix D.12. (26)). It is straightforward to verify that (X ncvx , Y ncvx ) obeys (i) the small-gradient condition (21), and (ii) the proximity condition (24). We are now positioned to invoke Lemma 2: for any optimizer Z cvx of (3), one has

Proof of Theorem 1
The last line arises since n κ -a consequence of the sample complexity condition np κ 4 µ 2 r 2 log 3 n (and hence n ≥ np κ 4 µ 2 r 2 log 3 n κ 4 ). This taken collectively with the property (29) implies that In other words, since X ncvx Y ncvx and Z ncvx are exceedingly close, the error Z cvx − M is mainly accredited to X ncvx Y ncvx − M . Similar arguments lead to We are left with proving the properties of Z cvx,r . Since Z cvx,r is defined to be the best rank-r approximation of Z cvx , one can invoke (30) to derive from which (8) follows. Repeating the above calculations implies that (7) holds if Z cvx is replaced by Z cvx,r , thus concluding the proof.

Prior art
Nuclear norm minimization, pioneered by the seminal works [RFP10,CR09,CT10,Faz02], has been a popular and principled approach to low-rank matrix recovery. In the noiseless setting, i.e. E = 0, it amounts to solving the following constrained convex program which enjoys great theoretical success. Informally, this approach enables exact recovery of a rank-r matrix M ∈ R n×n as soon as the sample size is about the order of nr -the intrinsic degrees of freedom of a rank-r matrix [Gro11,Rec11,Che15]. When it comes to the noisy case, Candès and Plan [CP10] first studied the stability of convex programming when the noise is bounded and possibly adversarial, followed by [NW12] and [KLT11] using two modified convex programs. As we have already discussed, none of these papers provide optimal statistical guarantees under our model when r = O(1). Other related papers such as [Klo14,CZ16] include similar estimation error bounds and suffer from similar sub-optimality issues.
Turning to nonconvex optimization, we note that this approach has recently received much attention for various low-rank matrix factorization problems, owing to its superior computational advantage compared to convex programming (e.g. Another line of works asserted that a large family of SDPs admits low-rank solutions [Bar95], which in turn motivates the Burer-Monteiro approach [BM03,BVB16]. When applied to matrix completion, however, the generic theoretical guarantees therein lead to conservative results. Take the noiseless case (31) for instance: these results revealed the existence of a solution of rank at most O( n 2 p), which however is often much larger the true rank (e.g. when r 1 and p poly log(n)/n, one has n 2 p √ n r). Moreover, this line of works does not imply that all solutions to the SDP of interest are (approximately) low-rank.
Finally, the connection between convex and nonconvex optimization has also been explored in line spectral estimation [LT18], although the context therein is drastically different from ours.

Discussion
This paper provides an improved statistical analysis for the natural convex program (3), without the need of enforcing additional spikiness constraint. Our theoretical analysis uncovers an intriguing connection between convex relaxation and nonconvex optimization, which we believe is applicable to many other problems beyond matrix completion. Having said that, our current theory leaves open a variety of important directions for future exploration. Here we sample a few interesting ones.
• Improving dependency on r and κ. While our theory is optimal when r and κ are both constants, it becomes increasingly looser as either r or κ grows. For instance, in the noiseless setting, it has been shown that the sample complexity for convex relaxation scales as O(nr) -linear in r and independent of κ -which is better than the current results. It is worth noting that existing theory for nonconvex matrix factorization typically falls short of providing optimal scaling in r and κ [KMO10a, SL16, CW15, MWCC17, CLL19]. Thus, tightening the dependency of sample complexity on r and κ might call for new analysis tools.
• Approximate low-rank structure. So far our theory is built upon the assumption that the ground-truth matrix M is exactly low-rank, which falls short of accommodating the more realistic scenario where M is only approximately low-rank. For the approximate low-rank case, it is not yet clear whether the nonconvex factorization approach can still serve as a tight proxy. In addition, the landscape of nonconvex optimization for the approximately low-rank case [CL17] might shed light on how to handle this case.
• Extension to deterministic noise. Our current theory -in particular, the leave-one-out analysis for the nonconvex approach -relies heavily on the randomness assumption (i.e. i.i.d. sub-Gaussian) of the noise. In order to justify the broad applicability of convex relaxation, it would be interesting to see whether one can generalize the theory to cover deterministic noise with bounded magnitudes.
• Extension to structured matrix completion. Many applications involve low-rank matrices that exhibit additional structures, enabling a further reduction of the sample complexity [FHB03,CC14,CWW19]. For instance, if a matrix is Hankel and low-rank, then the sample complexity can be O(n) times smaller than the generic low-rank case. The existing stability guarantee of Hankel matrix completion, however, is overly pessimistic compared to practical performance [CC14]. The analysis framework herein might be amenable to the study of Hankel matrix completion and help close the theory-practice gap.
• Extension to robust PCA and blind deconvolution. . The stability analyses of the convex relaxation approaches for these problems [ZLW + 10, ARR14, LS17] often adopt a similar approach as [CP10], and consequently are sub-optimal. The insights from the present paper might promise tighter statistical guarantees for such problems. [LS15] S. Ling and T. Strohmer. Self-calibration and biconvex compressive sensing. Inverse Problems, 31(11):115002, 2015.

A Preliminaries
In this section, we gather a few notations and preliminary facts that are used throughout the proofs.
To begin with, in view of the incoherence assumption (cf. Definition 1), one has This follows from The bound for Y follows from the same argument. Finally, for notational convenience, we shall often denote

B Exact duality analysis
We show in this section that why the first-order optimality condition is almost sufficient in guaranteeing the uniqueness of the optimizer. The argument is standard, see e.g. [CR09].
Lemma 6. Let Z = U ΣV be the SVD of Z ∈ R n×n . Denote by T be the tangent space of Z and by T ⊥ its orthogonal complement. Suppose that there exists W ∈ T ⊥ such that Then Z is the unique minimizer of (3) if 1. W < 1; 2. The operator P Ω (·) restricted to elements in T is injective, i.e. P Ω (H) = 0 implies H = 0 for any H ∈ T .
Proof of Lemma 6. To begin with, the assumption of this lemma implies that where ∂ Z * denotes the subdifferential of · * at Z. This combined with (34) reveals that thus indicating that Z is a minimizer of the convex program (3). Next, we justify the uniqueness of Z. Before continuing, we record a fact regarding the minimizers of (3).
With this claim at hand, every minimizer of (3) can be written as Z +H for some H obeying P Ω (H) = 0. It then suffices to prove that for any H = 0, one has g (Z + H) > g (Z), where g(·) is the objective function in (3). To this end, we note that where the last relation follows from Claim 1 (i.e. P Ω (H) = 0). Let S be a subgradient of · * at point Z obeying Using the convexity of · * , one can further lower bound (36) by Here, (i) follows from our assumption that U V + W is supported on Ω (cf. (34)) and the fact that P Ω (H) = 0, and (ii) holds since P T (S) = U V (cf. (37)). We can now expand the above expression as where the last inequality holds by using the last property of (37) and invoking the elementary inequality Given that W is assumed to obey W < 1, one has g (Z + H) > g (Z) unless P T ⊥ (H) = 0. However, if P T ⊥ (H) = 0 (and hence H ∈ T ), then the injectivity assumption together with the fact that P Ω (H) = 0 forces H = 0. Consequently, any minimizer Z + H with H = 0 must satisfy g (Z + H) > g (Z), which results in contradiction. This concludes the proof.
Proof of Claim 1. Consider any minimizers Z 1 = Z 2 , and suppose instead that Since · * is convex, we have Furthermore, by the strong convexity of · 2 F we have This contradicts the fact that Z 1 is a minimizer of (3), thus completing the proof.

C.1 Proof of Lemma 1
First of all, since (X, Y ) is a stationary point of (5), we have the first-order optimality conditions As an immediate consequence, one has In words, any stationary point (X, Y ) has "balanced" scale. Let U ΣV be the singular value decomposition of XY with U , V ∈ R n×r orthonormal and Σ ∈ R r×r diagonal. In view of the balanced scale of (X, Y ) (namely, (41)) and Lemma 20, we can write for some orthonormal matrix R ∈ R r×r . Substitution into (40) results in implying that the columns of U (resp. V ) are the left (resp. right) singular vectors of the matrix P Ω (M − XY ). We can therefore write where W ∈ T ⊥ ; recall that T is the tangent space of XY and also U V . In view of Lemma 6, it suffices to show that W < 1, which is the content of the rest of the proof. One can rewrite P Ω (M − XY ) as Substitute this identity into (43) and rearrange terms to obtain These tell us that the columns of U (resp. V ) are the left (resp. right) singular vectors of the matrix which is equivalent to saying that 5 pM + P debias for some W 2 ∈ T ⊥ . One can then derive from (44) that where (i), (ii) and (iv) arise from the facts that U V ∈ T , XY ∈ T and U (pΣ+λI r )V ∈ T , respectively, and (iii) relies on the identity (45). It then suffices to control W 2 . To this end, apply Weyl's inequality to (45) to obtain that: for r + 1 ≤ i ≤ n, the ith largest singular value of U (pΣ + λI r )V + λW 2 obeys where the second inequality comes from the fact that M has rank r (so that σ i (M ) = 0 for r + 1 ≤ i ≤ n) as well as the triangle inequality, and the last inequality follows from the assumptions of the lemma. Furthermore, it is seen that U (pΣ + λI r )V has rank r and all of its singular values are at least λ. These facts taken collectively demonstrate that This together with Lemma 6 completes the proof.

C.2 Proof of Lemma 2
We begin by collecting a few simple properties resulting from our assumptions. By definition, the gradient of f (·, ·) in (15) is given by which together with the small-gradient assumption ∇f (X, Y ) F ≤ cλ c inj pσ min /κ 2 /p implies that Throughout the proof, we let the SVD of XY be XY = U ΣV , and denote by T the tangent space of XY and by T ⊥ its orthogonal complement. Additionally, our assumption regarding the singular values of X and Y implies that This can be easily seen from the following two inequalities Before proceeding, we record a claim that will prove useful in the subsequent analysis.
Claim 2. Under the notations and assumptions of Lemma 2, one has where R is some residual matrix satisfying With Claim 2 in place, we are ready to prove Lemma 2. Let Z cvx be any minimizer of (3) and denote ∆ Z cvx − XY . The proof can be divided into the following steps.
• First, show that the difference ∆ primarily lies in the tangent space of XY ; see (56).
• Next, utilize this property to connect P Ω (∆) 2 F with the size of the gradient ∇f (X, Y ); see (58).
• In the end, obtain a lower bound on P Ω (∆) 2 F in terms of ∆ F using the injectivity property; see (59). The desired upper bound on ∆ F advertised in the lemma then follows by combining these results. In what follows, we shall carry out these steps one by one.
1. The optimality of Z cvx = XY + ∆ reveals that A little algebra allows us to rearrange terms as follows In addition, it follows from the convexity of · * that for any W ∈ T ⊥ obeying W ≤ 1, where U V + W serves as a subgradient of · * at XY . In what follows, we shall pick W such that W , ∆ = P T ⊥ (∆) * . Combining this with (50) and (51), we reach This together with the decomposition (48) leads to and therefore In addition, elementary inequalities give From the condition (49) we have P T ⊥ (R) ≤ λ/2, and hence the above inequality gives which together with the condition (49) on P T (R) F and the small gradient assumption (21) yields This essentially means that ∆ lies primarily in the tangent space of XY for c sufficiently small. As an immediate consequence, as long as c is sufficiently small. Note that we also use the elementary fact that c inj ≤ 1/p (otherwise we will have the contradictory inequality p −1 P Ω (H) 2 F ≥ c inj H 2 F > p −1 H 2 F ). 2. Continue the upper bound in (53) to obtain Here, the last line uses the fact that − P T ⊥ (R), ∆ ≤ P T ⊥ (R) · P T ⊥ (∆) * ≤ λ 2 P T ⊥ (∆) * , which follows from (49). Therefore, using the condition (49) we reach 3. We are left with lower bounding P Ω (∆) 2 F . Using the decomposition ∆ = P T (∆) + P T ⊥ (∆), we obtain where the last inequality follows from the injectivity assumption (20). In addition, (56) implies as long as c is sufficiently small. As a result, In addition, by (57) we have Taking (58) and (59) collectively yields

C.2.1 Proof of Claim 2
Before proceeding to the proof of Claim 2, we state a useful fact; the proof is deferred to Appendix C.2.2.
Claim 3. Instate the notations and assumptions in Lemma 2. Let U ΣV be the SVD of XY . There exists an invertible matrix Q ∈ R r×r such that where U Q Σ Q V Q is the SVD of Q.
In light of the assumptions (46), one has In the sequel, we shall prove the upper bounds on both P T (R) F and P T ⊥ (R) separately.
1. From the definition of P T (·) (see (13)), we have In addition, invoke Claim 3 to obtain for some invertible matrix Q ∈ R r×r , whose SVD U Q Σ Q V Q obeys (60). Combine (61) and (62) to see which together with (64) yields Apply the triangle inequality to get In order to further upper bound (65), we first recognize that (47) implies Second, Claim 3 yields with the proviso that c is sufficiently small. Here we have used the facts that c inj ≤ 1/p and that κ ≥ 1. This in turn implies that Q = Σ Q ≤ 2. Putting the above bounds together yields 2. We now move on to bounding P T ⊥ (R) . In view of the definition of P debias Ω (·) in (33), we can rearrange (61) to derive In view of the representation X = U Σ 1/2 Q and Y = V Σ 1/2 Q − , the above identities are equivalent to for some residual matrixR ∈ R n×n , we have where (i) follows from the definition of R and the fact that U V ∈ T , (ii) uses the definition of P debias Ω (·), (iii) relies on the fact that XY ∈ T , and (iv) applies (67) and the facts that U ΣV ∈ T and that U Σ 1/2 QQ Σ −1/2 V ∈ T . Therefore, it suffices to bound P T ⊥ (R) . Rewrite (67) as Suppose for the moment that This together with the assumptions that P debias Ω XY − M < λ/8 and P Ω (E) < λ/8 reveals that By Weyl's inequality and the relations (69) and (71), one has for any r + 1 ≤ i ≤ n, where σ i (A) denotes the ith largest singular value of a matrix A. Here, we have used the fact that M has rank r and hence σ i (M ) = 0 for any i > r. In addition, it is seen that Note that in (65), we have obtained as long as c is sufficiently small, and hence Σ 1/2 QQ Σ −1/2 − I r ≤ 1/10. Therefore, for any 1 ≤ i ≤ r we know that where the second inequality results from Weyl's inequality. This combined with (68) and (72) yields this happens because at least n − r singular values of U pΣ + λΣ 1/2 QQ Σ −1/2 V + P T ⊥ R are no larger than λ/2 and they cannot correspond to directions simultaneously in the column space spanned by U and the row space spanned by V .
The proof is then complete by verifying (70). To this end, observe that Then following similar technique used to bound P T (R) , we have as long as c is small enough.

C.2.2 Proof of Claim 3
Let for some B 1 , B 2 ∈ R n×r . Clearly, it is seen from the assumption (46) that In addition, the identities (74) allow us to obtain Here, the last line makes use of (75) and the assumption that X , Y ≤ √ 2σ max . In view of Lemma 20, one can find an invertible Q such that where Σ Q is a diagonal matrix consisting of all singular values of Q. Here, (i) follows from (47) as well as the bound (76), and the last inequality (ii) uses the assumption (21). This completes the proof.

C.3 Proof of Lemma 4
Lemma 4 consists of two parts, which we restate into the following two lemmas, namely Lemmas 7-8. First of all, Lemma 7 demonstrates that as long as (X, Y ) is sufficiently close to (X , Y ), the operator P Ω (·) restricted to the tangent space T of XY is injective. The proof is deferred to Appendix C.3.1.
Lemma 7. Suppose that the sample complexity obeys n 2 p ≥ Cµrn log n for some sufficiently large constant C > 0. Then with probability exceeding 1 − O(n −10 ), Here, c > 0 is some sufficiently small constant, and T denotes the tangent space of XY .
Remark 5. In the prior literature, the injectivity of P Ω (·) has been mostly studied when restricted to a fixed tangent space independent of Ω (see [CR09,Gro11]). In comparison, this lemma demonstrates that the injectivity property holds uniformly over a large set of tangent spaces. This allows one to handle tangent spaces that are statistically dependent on Ω.

C.3.1 Proof of Lemma 7
By definition, any H ∈ T can be expressed as for some A, B ∈ R n×r . Given that this is an underdetermined linear system of equations, there might be numerous (A, B)'s compatible with (78). We take a specific choice as follows which satisfies a property that plays an important role in the subsequent analysis: To see this, consider the Lagrangian Taking the derivatives w.r.t.Ã andB and setting them to zero yield A = −Λ X and B = −ΛY for some Lagrangian multiplier matrix Λ ∈ R n×n . The claim (80) then follows immediately. The remaining proof consists of two steps.
• First, we would like to show that • Second, we prove that Taking (81) and (82) together immediately yields the claimed bounds in the lemma. In what follows, we shall establish these two bounds separately.

Regarding the upper bound (81), it follows from elementary inequalities that
It then suffices to control max{ X , Y }. In view of the assumption (77), one has as long as c < 1. This together with the triangle inequality reveals that Similarly, one has Y ≤ 2 √ σ max . Substitution into (83) yields the desired upper bound (81).

We now move on to the lower bound (82). To this end, one first decomposes
The basic idea is to demonstrate that (1) α 2 is bounded from below, and (2) α 1 is sufficiently small compared to α 2 .
(a) We start by controlling α 2 , towards which we can expand The property X B = A Y (see (80)) implies that , where the second line arises from the Cauchy-Schwarz inequality. Recalling from (84) that ∆ X ≤ c X /κ, we arrive at provided that c ≤ 1/200. A similar bound holds for BY 2 F , thus leading to (b) Next, we control α 1 . First, it is seen that As a result, we can expand α 1 as i. Regarding γ 1 , it follows from the bounds in [CR09, Section 4.2] that as long as np µr log n. ii. Invoke Lemma 19 to show that as long as n 2 p n log n and c > 0 is sufficiently small. Here we have utilized the assumption that max{ ∆ X 2,∞ , ∆ Y 2,∞ } ≤ c X /(κ √ n). iii. The term γ 4 can be controlled via Lemma 21: where the second line uses the bound p −1 P Ω 11 −11 n/p guaranteed by [KMO10a, Lemma 3.2]. Continue the upper bound to get Here the first relation (i) arises from the assumption that np 1. The second inequality (ii) applies the elementary inequality ab ≤ (a 2 + b 2 )/2 and the last one (iii) holds with the proviso that c > 0 is small enough. iv. The last term γ 5 can be further decomposed into the sum of four terms. For brevity, we take one out as an example, namely the term Apply the triangle inequality to obtain In light of [CR09, Section 4.2] and [ZL16, Lemma 9], we have Taking the above three bounds collectively yields Using the assumption that ∆ X 2,∞ ≤ c X /(κ √ n), one has for c > 0 small enough. The same argument applies to the remaining three terms, resulting in v. Combining the previous bounds on γ 1 through γ 5 , we arrive at (c) Taking the preceding bounds on α 1 and α 2 collectively yields The proof is then complete.

C.3.2 Proof of Lemma 8
To start with, we have which together with the triangle inequality implies Apply [CL17,Lemma 4.5] to obtain where the second line is due to P debias In addition, the assumption (24) yields as long as σ σmin n log n p 1/κ (recall that λ = C λ σ √ np for some constant C λ > 0). As a consequence, one obtains P debias where the last inequality follows from the upper bound max{ X 2,∞ , Y 2,∞ } ≤ µrσ max /n (cf. (32)). Rearrange the right-hand side of (85) to reach where the last line holds because of the assumption n 2 p κ 4 µ 2 r 2 n log n as well as the choice of λ.

D Analysis of the nonconvex gradient descent algorithm
Lemma 5 shares similar spirit as [MWCC17, Theorem 2] and [CLL19, Lemma 3.5] with one difference: the nonconvex loss function (15) has an additional term X 2 F + Y 2 F to balance the scale of X and Y . To simplify the presentation, we find it convenient to introduce a few notations. Denote It is easily seen from (26) that Algorithm 2 Construction of the lth leave-one-out sequence.
Similar to [MWCC17, CLL19], we resort to the leave-one-out sequences to control the 2 / ∞ error. Specifically, for each 1 ≤ l ≤ n (corresponding to row indices), we construct {F t,(l) } t≥0 to be the gradient descent iterates (see Algorithm 2) w.r.t. the following auxiliary loss function Here P Ω −l,· (·) (resp. P l,· (·)) denotes the orthogonal projection onto the space of matrices which are supported on the index set Ω −l,· = {(i, j) ∈ Ω|i = l} (resp. {(i, j)|i = l}). Mathematically, we have for any matrix Similarly, for each n + 1 ≤ l ≤ 2n (with l − n corresponding to the column index), we define {F t,(l) } t≥0 to be the GD iterates (see Algorithm 2) operating on where P Ω ·,−(l−n) (·) and P ·,(l−n) (·) are defined as ∈ Ω and j = l − n, 0, otherwise and for any matrix B ∈ R n×n . The key ideas are: (1) the iterates are not perturbed by much when one drops a small number of samples (and hence F t and F t,(l) remain sufficiently close); (2) the auxiliary iterates F t,(l) are independent of the samples directly related to the lth row of M , which in turn allows to exploit certain statistical independence to control the lth row of F t,(l) (and hence F t ). See [MWCC17, Section 5] for a detailed explanation. Last but not least, the step size is set to be η, and we take F 0,(l) = F for all 1 ≤ l ≤ 2n (the same initialization as in Algorithm 1). With the help of the leave-one-out sequences, we are ready to establish Lemma 5 in an inductive manner. Concretely we aim at proving that hold for all 0 ≤ t ≤ t 0 = n 18 and for some constants C F , C op , C 3 , C 4 , C ∞ , C B > 0, provided that η 1/(nκ 3 σ max ). In addition, we also intend to establish that holds for all 1 ≤ t ≤ t 0 = n 18 . Here, H t,(l) and R t,(l) are rotation matrices defined as Note that the induction hypotheses (91a), (91b) and (91e) readily imply the statements (27a), (27b) and (27c) in Lemma 5, respectively, whereas the last bound on the size of the gradient (28) follows from (92). We summarize the last connection in the following lemma, whose proof is in Appendix D.2.
The rest of this section is devoted to proving the hypotheses (91) and (92) via induction. We start with the base case, i.e. t = 0. All the induction hypotheses (91) are easily verified by noting that We now proceed to the induction step, which are demonstrated via the following lemmas. All the proofs are in subsequent subsections.
Suppose that the sample size satisfies n 2 p κ 2 µ 2 r 2 n log n and that the noise satisfies σ σmin n p 1 √ κ 2 log n .
If the iterates satisfy (91) at the tth iteration, then with probability at least 1 − O(n −100 ), holds for some sufficiently large constant C B C 2 op , provided that 0 < η < 1/σ min .
Suppose that the noise satisfies σ σmin n p 1/ √ r. If the iterates satisfy (91) at the tth iteration, then with as long as η 1/(κnσ max ).

D.1 Preliminaries and notations
Before proceeding to the proofs, we collect a few useful facts and notations. To begin with, for any matrix A, we denote by A l,· (resp. A ·,l ) the lth row (reps. column) of A.
Define an augmented loss function f aug (X, Y ) to be As the name suggests, this new function augments the original loss function (cf. (15)) with an additional term X X −Y Y 2 F /8, which is commonly used in the literature of asymmetric low-rank matrix factorization to balance the scale of X and Y [TBS + 16,YPCC16,CLL19]. We emphasize that, in contrast to aforementioned works, here our gradient descent algorithm (cf. Algorithm 1) operates on f (·, ·) instead of f aug (·, ·). The introduction of f aug (·, ·) is mainly to simplify the proof.
It is easily seen that the gradients of f aug (·, ·) are given by Correspondingly, define the difference between gradients of ∇f (X, Y ) and ∇f aug (X, Y ) as follows such that Regarding F , simple algebra reveals that where the last one follows from the incoherence assumption (32). We start with a lemma that characterizes the local geometry of the nonconvex loss function, whose proof is given in Appendix D.10.
Lemma 17. Set λ = C λ σ √ np for some constant C λ > 0. Suppose that the sample size obeys n 2 p ≥ Cκµrn log 2 n for some sufficiently large constant C > 0 and that the noise satisfies σ σmin n p

1.
Recall the function f aug (·, ·) defined in (94). Then with probability at least 1 − O(n −10 ), Last but not least, a few immediate consequences of (91) are gathered in the following lemma, whose proof is given in Appendix D.11. Lemma 18. We have the following four sets of consequences of the induction hypotheses (27).
1. Suppose that the sample size obeys n µr log n. If the tth iterates obey (91), then one has 2. Suppose that the noise satisfies σ σmin n p 1 √ κ 2 log n . If the tth iterates obey (91), then one has 3. Suppose that n κ 2 µr log n and that σ σmin n p 1 √ κ 2 log n . If the tth iterates obey (91), then we have 4. Suppose that n ≥ κµ and that σ σmin n p 1 √ κ 2 log n . If the tth iterates obey (91), then (100) also holds for F t,(l) H t,(l) . In addition, one has

D.2 Proof of Lemma 9
Summing (92) from t = 1 to t = t 0 leads to a telescopic sum This further implies that where we have used the assumption that Towards this end, we can use the fact that f (X, Y ) = f (XR, Y R) for any R ∈ O r×r to obtain whereF lies in the line segment connecting F t0 H t0 and F . Apply the triangle inequality to see Here the second line follows from the fact that ∇ 2 f (F ) ≤ 10σ max . To see this, use (91e) to obtain that where the last relation arises from σ √ np λ. Substitution into (101) results in provided that η 1/(nκ 3 σ max ), t 0 = n 18 and that n ≥ κ, which is a consequence of our sample complexity n ≥ np κµr log 2 n.
Here (i) uses the fact that ∇f (F R) = ∇f (F )R for all R ∈ O r×r ; the last relation (ii) uses the decomposition (97) and the triangle inequality. In the following, we bound α 1 , α 2 and α 3 in the reverse order.
1. First, regarding α 3 , since X X = Y Y , one has η ∇f (F ) F = η ∇f aug (F ) F . Repeating our arguments for (103) and (104) gives as long as λ σ √ np. Here the last inequality also relies on the fact that X F = Y F .
2. We now move on to α 2 , for which one has Utilize the fact that max{ X t , Y t } ≤ F ≤ 2 X (see Lemma 18) and the induction hypothesis (91f) to obtain where the second inequality uses X F ≥ √ rσ min and the last one holds as long as 2C B κ 5/2 σ max η ≤ 1.
3. In the end, for α 1 , the fundamental theorem of calculus [Lan93, Chapter XIII, Theorem 4.2] reveals that where we denote F (τ ) F + τ (F t H t − F ) for all 0 ≤ τ ≤ 1. Taking the squared Euclidean norm of both sides of the equality (105) leads to where (106) results from the fact that Applying the same argument as in (102), one gets for all Putting these two bounds back to (106) yields Here the last relation holds as long as 0 ≤ η ≤ 1/(1000κσ max ). As a result, we have Combine the above bounds on α 1 , α 2 and α 3 to conclude that

D.4 Proof of Lemma 11
To facilitate analysis, we define an auxiliary pointF t+1 X t+1 Y t+1 as Then the triangle inequality tells us that In what follows, we shall control α 1 and α 2 separately.
1. We start with α 2 . By the triangle inequality again we have The term β 1 is the same as the term α 2 in [CLL19, Section 4.2].
Therefore we can adopt the bound therein to obtain Moving to β 2 , one has for some constant C > 0. Here the last inequality arises from Lemma 3 and the fact that F = √ 2 X (cf. (98a)). We are now left with the term β 3 , which is exactly the term α 1 in [CLL19, Section 4.2]. Reusing their results, we have The last line follows from the facts that P Ω (11 ) − p11 √ np (see [KMO10a,Lemma 3.2]) and that max{ ∆ t X 2,∞ , ∆ t Y 2,∞ } ≤ F 2,∞ , provided that σ σmin n p 1 √ κ 2 log n (see Lemma 18). Combining the above three bounds gives for some sufficiently large constantC > 0. Here the second inequality arises from the condition κ . An immediate consequence of (109) is that To see this, apply the induction hypotheses (91b) and (91e) to get Here (i) holds under the assumptions that C op C +CC ∞ κ 2 µ 2 r 2 log n np and that F 2,∞ ≤ µr n X (cf. (98b)); (ii) arises since σ σmin n p 1/κ and λ σ √ np. Under the sample complexity n 2 p κ 4 µ 2 r 2 n log n, the first condition can be simplified to C op 2C 1.
2. Next we bound α 1 , towards which we first observe that It is straightforward to verify that H t H t+1 is the best rotation matrix to align F t+1 H t and F (in the sense of (87)). RegardingF t+1 , we obtain the following claim, which demonstrates that it is already aligned with F , i.e. I r is the best rotation matrix to alignF t+1 and F .

Now we intend to apply Lemma 22 with
for which we need to check the two conditions therein. First, in view of (110), one has Second, making use of the gradient update rules (25) and the decomposition (97), we obtain .
The term θ 2 has been controlled as α 2 in the proof of Lemma 10, where we obtained We now move on to θ 1 , for which we have .

Combining [CLL19, Equation (4.13)] and [KMO10a, Lemma 3.2] yields
Here the penultimate inequality arises from the facts that max{ ∆ t X 2,∞ , ∆ t X 2,∞ } ≤ ∆ t 2,∞ ≤ F 2,∞ and similarly max{ ∆ t X , ∆ t X } ≤ ∆ t ≤ X ; see Lemma 18. In addition, the last line holds because of the induction hypotheses (91b) and (91e), provided that C ∞ κ σ σ min n p µ 2 r 2 log n np Again, the first condition would be guaranteed by the sample size condition n 2 p κ 4 µ 2 r 2 n log n and the noise condition σ σmin n p 1/κ. Next, the term ξ 2 can be easily controlled as follows where the last line follows from the same argument for bounding β 2 above. Taking the bounds on θ 1 and θ 2 collectively yields The final inequality is true as long as λ σ √ np and σ σmin n p 1 κ . An immediate consequence of (112) is that as long as η 1/(C B κ 2 σ max √ r), σ σmin n p 1 κ and λ σ √ np. As a result, one obtains Armed with these two conditions, we can invoke Lemma 22 to obtain Combine the bounds on α 1 and α 2 to reach with the proviso that C op 1 and n 2 p κ 4 µ 2 r 2 n log n. Here the last line follows from the same argument as in bounding (111). This completes the proof.
Proof of Claim 4. In view of [MWCC17, Lemma 35], it suffices to show that F F t+1 is symmetric and positive semidefinite.
Lemma 35]), it is straightforward to verify that F F t+1 is also symmetric (which we omit here for brevity). In addition, by (110) we have where λ min (A) stands for the minimum eigenvalue of a matrix A. To conclude, F F t+1 is both symmetric and positive semidefinite, thus establishing the claim.

D.5 Proof of Lemma 12
Without loss of generality, we consider the case when 1 ≤ l ≤ n; the case with n + 1 ≤ l ≤ 2n can be derived in a similar way. From the definition of R t+1,(l) (cf. (93b)), we have The gradient update rules (25) and (88) give where we have used the facts that ∇f (F )R = ∇f (F R) and ∇f (l) (F )R = ∇f (l) (F R) for any orthonormal matrix R ∈ O r×r .
In what follows, we shall bound A 1 , A 2 and A 3 sequentially.
1. The first term A 1 is similar to α 1 in the proof of Lemma 10. Going through the same derivations therein, we obtain provided that σ σmin n p 1 √ κ 4 µr log n and that 0 ≤ η ≤ 1/(1000κσ max ).
2. Next, we turn attention to A 2 , which clearly obeys Recall from the term α 2 in the proof of Lemma 10 that Applying Lemma 15 and going through the same derivation as in bounding α 2 in the proof of Lemma 10, one gets Combine the above three inequalities to obtain Here the second inequality arises from the elementary inequality X ≤ √ n X 2,∞ , whereas the last one holds true because of X 2,∞ ≤ F 2,∞ and the condition that η 3. We are now left with A 3 . To this end, we first observe that The following claims allow one to bound B 1 , B 2 and C 1 , C 2 ; the proofs are deferred to the end of this subsection.
Claim 5. Suppose that σ σmin n p 1 √ κ 2 log n and that np log 2 n. With probability at least 1 − O(n −100 ), Claim 6. Suppose that σ σmin n p 1 √ κ 2 log n and that np log n. With probability at least 1 − O(n −100 ), Claim 7. Suppose that σ σmin n p 1 √ κ 2 log n and that np log 3 n. With probability at least 1 − O(n −100 ), With these claims in place, one can readily obtain that where the last line follows from the induction hypotheses (91c) and (91e).
This together with the bounds on A 1 and A 2 gives: for some constantC > 0, as claimed. Here, (i) invokes the induction hypothesis (91c), whereas (ii) holds as long as C 3 is large enough and the sample size satisfies n 2 p κ 4 µ 2 r 2 n log n.
Proof of Claim 5. For notational simplicity, we denote Since the Frobenius norm is unitarily invariant, we have All nonzero entries of the matrix W reside in its lth row and therefore where δ lj 1 {(l,j)∈Ω} . Notice that conditional on X t,(l) and Y t,(l) , the right-hand side is composed of a sum of independent random vectors, where the randomness comes from {δ lj } 1≤j≤n . It then follows that Here, both (i) and ( with probability exceeding 1 − O(n −100 ). As a result, we arrive at as soon as np log 2 n. To finish up, we make the observation that where the last line arises from Lemma 18. This combined with (119) gives where (i) comes from (120), and (ii) makes use of the incoherence condition F 2,∞ ≤ µrσ max /n and the fact that Y F ≤ √ rσ max .
Proof of Claim 6. Instate the notation in proof of Claim 5. By the unitary invariance of Frobenius norm and the fact that all nonzero entries of the matrix W reside in its lth row, we have Note that for all j, one has Then the matrix Bernstein inequality [Tro15, Theorem 6.1.1] reveals that b 2 V log n + L log n np log n C ∞ + C ∞ log n np log n C ∞ np log n F t,(l) R t,(l) − F 2,∞ F 2,∞ with probability exceeding 1 − O n −100 as long as np log n. Here the last relation uses (120). Observe that X t,(l) 2,∞ ≤ 2 F 2,∞ as long as σ σmin n p 1 √ κ 2 log n ; see Lemma 18. Making use of the incoherence condition (98a) to get We can then conclude the proof.
Proof of Claim 7. By the unitary invariance of the Frobenius norm, one has Since the entries of P Ω l,· (E) are all zero except those on the lth row, we have where we denote δ lj Proof of Claim 8. To facilitate analysis, we introduce an auxiliary pointF t+1 We first claim that I r is the best rotation matrix to alignF t+1,(l) and F ; its proof is similar to that of Claim 4 and is hence omitted for brevity.
Claim 9. One has With this claim at hand, we intend to invoke Lemma 23 with where the last line uses Claim 9. Here sgn(A) = U V for any matrix A with SVD U ΣV . It then boils down to controlling F t+1,(l) H t,(l) −F t+1,(l) , for which we have where we denote This enables us to obtain In view of Lemma 15, one has We are left with bounding B . Decompose B into .
To control B 1 , following the same argument in Lemma 8, we see that We now move on to B 2 , which is equal to b 2 /p defined in the proof of Claim 6, namely, The last term B 3 can be easily bound via Lemma 3, that is, which concludes the proof.

D.7 Proof of Lemma 14
Fix any 1 ≤ l ≤ 2n. Apply the triangle inequality to see that where the second line follows from Lemma 13. Apply Lemma 18 to the (t + 1)th iterates to see that Here the second line follows from Lemma 12. Combine (127) and (128) to reach as long as C ∞ ≥ 5C 3 + C 4 . The proof is then complete since this holds for all 1 ≤ l ≤ 2n.

D.8 Proof of Lemma 15
To simplify the notation hereafter, we denote In view of the gradient descent update rules (25), we have This gives rise to the following identity where we denote Denoting we have With these in mind, a little calculation reveals that which together with the triangle inequality gives as long as λη/p < 1 -a condition that is guaranteed by our assumptions on λ and η. It then boils down to which is supplied in the following claim.
Claim 10. Suppose that the sample complexity satisfies n 2 p κ 2 µ 2 r 2 n log n and that the noise satisfies σ σmin n p 1 √ κ 2 log n , then one has Taking the above bounds together, we arrive at for some constantC > 0, as long as λ ≥ σ √ np and C B C 2 op . Proof of Claim 10. The triangle inequality yields It is easy to see from Lemma 18 that . These allow us to further upper bound (131) as It remains to bound D t . To this end, recall from (130) that In the sequel we shall bound these three terms sequentially. First, Lemma 3 tells us that 1 p P Ω (E) σ n p . Next, repeating the arguments in the proof of Lemma 8 gives which together with the induction hypothesis (91e) yields

D.9 Proof of Lemma 16
In light of the facts that f (F R) = f (F ) and ∇f (F R) = ∇f (F )R for any R ∈ O r×r , one has for someF which lies between F t H t and F t H t − η∇f (F t H t ). Suppose for the moment that One can invoke Lemma 17 to obtain ∇ 2 f (F ) ≤ 10σ max and hence Here the equality uses again the facts that f (F R) = f (F ) and ∇f (F R) = ∇f (F )R for any R ∈ O r×r and the last inequality holds as long as η ≤ 1 10σmax . We are left with proving the aforementioned conditions (134). The first condition has been established in the proof of Lemma 9 and hence we concentrate on the second one, namely (134b). Apply the triangle inequality and the fundamental theorem of calculus [Lan93, Chapter XIII, Theorem 4.2] to obtain Following similar arguments in the proof of Lemma 10 and the proof of Lemma 9, one obtains Here the middle inequality uses the induction hypothesis (91b) and the last relation holds true provided that λ σ √ np, σ σmin n p 1/ √ r and that η 1/(κnσ max ). This proves the second condition and also the whole lemma.

D.10 Proof of Lemma 17
We start by defining a new loss function and ∇ 2 f clean (X, Y ) ≤ 5σ max .
It then boils down to controlling − 2 p P Ω (E) , ∆ X ∆ Y + λ p ∆ 2 F . To this end, one has where the last relation holds due to Lemma 3 and the elementary fact about the nuclear norm (6), i.e.
Regarding the term λ ∆ 2 F /p, it is easy to see from the assumption λ σ √ np that λ p ∆ 2 F σ n p ∆ 2 F . Combine the above two bounds and use the triangle inequality to reach with the proviso that σ σmin n p 1. Taking (135) and (137) together immediately establishes the claims on ∇ 2 f aug (·, ·).
Similar bounds can be obtained for F t,(l) R t,(l) − F provided that n µr log n. We continue to establish the second set of consequences namely (100). Since · is unitarily invariant, one can apply the triangle inequality to get Here (i) uses the induction hypothesis (91b) and the fact that F = √ 2 X , and (ii) holds as long as σ σmin n p 1. Similarly one can obtain F t F ≤ 2 X F provided that σ σmin n p 1 and F t 2,∞ ≤ 2 F 2,∞ as long as σ σmin n p 1/( κ 2 log n). Notice that along the way, we have also proven that We now move on to F t H t − F t,(l) H t,(l) F , for which we intend to apply Lemma 22 to connect it with F t H t − F t,(l) R t,(l) F . First, in view of the induction hypothesis (91b), one has where the equality arises since F = √ 2σ max (see (98a)) and X = √ σ max , and the last line holds as long as σ σmin n p 1/κ. In addition, it follows from the induction hypothesis (91c) that where the penultimate inequality arises from the facts that F 2,∞ ≤ µrσ max /n and that F = √ 2σ max (cf. (98)), and the last relation holds as long as ( σ σmin n log n p + λ p σmin ) µr n 1/κ. Invoke Lemma 22 with F 0 = F , F 1 = F t H t and F 2 = F t,(l) R t, (l) to arrive at F t H t − F t,(l) H t,(l) F ≤ 5 σ 2 1 (F ) σ 2 r (F ) F t H t − F t,(l) R t,(l) F = 5κ F t H t − F t,(l) R t,(l) F .
The last set of consequences can be derived following similar arguments to that for establishing the first set. For brevity, we omit the proof.

D.12 Proof of the inequalities (29)
We single out the proof of X t Y t − M ∞ , whereas the proofs of X t Y t − M F and X t Y t − M follow from the same argument. Recognize the following decomposition which together with the triangle inequality gives In view of Lemma 18, one has Y t H t 2,∞ ≤ 2 F 2,∞ as long as the noise obeys σ σmin n p 1 √ κ 2 log n . This further implies that X t Y t − M ∞ ≤ 3C ∞ κ σ σ min n log n p + λ p σ min F 2,∞ F 2,∞ ≤ 3C ∞ κ 3 µr σ σ min n log n p + λ p σ min M ∞ , where the last relation is F 2,∞ F 2,∞ ≤ √ κµr M ∞ . To see this, one has for any 1 ≤ i ≤ n, Here λ min (·) denotes the minimum eigenvalue. Since the inequality holds for all 1 ≤ i ≤ n, we arrive at Similarly one can obtain Y 2,∞ ≤ n/σ min M ∞ , which further implies F 2,∞ = max{ X 2,∞ , Y 2,∞ } ≤ n/σ min M ∞ . As a result, we arrive at Here we used the incoherence assumption (98a).

E Technical lemmas
Lemma 19. Suppose n 2 p ≥ Cn log n for some sufficiently large constant C > 0. Then with probability exceeding 1 − O(n −10 ), and, similarly, AB 2 F ≤ n A 2 F B 2 2,∞ . Combining the previous bounds with the triangle inequality establishes the claim.
Lemma 20. Let U ΣV be the SVD of a rank-r matrix XY with X, Y ∈ R n×r . Then there exists an invertible matrix Q ∈ R r×r such that X = U Σ 1/2 Q and Y = V Σ 1/2 Q − . In addition, one has where U Q Σ Q V Q is the SVD of Q. In particular, if X and Y have balanced scale, i.e. X X − Y Y = 0, then Q must be a rotation matrix.
Proof. The existence of Q is trivial by setting To see this, one has U Σ 1/2 Q = U Σ 1/2 Σ −1/2 U X = U U X = X, where the last equality follows from the fact that the columns of U are the left singular vectors of X. The relation Y = V Σ 1/2 Q − can also be verified by the identity We now move on to proving the perturbation bound (138). In view of the SVD of Q, i.e. Q = U Q Σ Q V Q , one can obtain Denote B := U Q ΣU Q 0. Then we have Let C = Σ Q B 1/2 and D = Σ −1 Q B 1/2 , and denote ∆ = C − D. One then has Note that C D = B and that C ∆ = C C − C D = C C − B is symmetric. One can continue the bound as where the inequality follows since 4 − 2 √ 2 ≥ 1. Write B = B 1/2 · B 1/2 to see Recognizing that σ min (B) = σ min (Σ) finishes the proof of (138). Combining X X = Y Y and (138) yields Σ Q − Σ −1 Q F = 0, which implies Σ Q = I. Under this circumstance, Q = U Q Σ Q V Q = U Q V Q is a rotation matrix. The proof is then complete.
Lemma 21. For all A, B, C, D ∈ R n×r , one has Proof. This is a simple consequence of [ and, similarly, k C k,· 2 2 D k,· 2 2 ≤ C 2 2,∞ D 2 F . Putting these together concludes the proof.
Lemma 22. Suppose F 1 , F 2 , F 0 ∈ R 2n×r are three matrices such that Then the following two inequalities hold true: Proof. This is the same as [MWCC17, Lemma 37].
Lemma 23. Let S ∈ R r×r be a nonsingular matrix. Then for any matrix K ∈ R r×r with K ≤ σ min (S), one has sgn(S + K) − sgn(S) ≤ 2 σ r−1 (S) + σ r (S) K , where sgn(·) denotes the matrix sign function, i.e. sgn(A) = U V for a matrix A with SVD U ΣV .
Proof. This is the same as [MWCC17, Lemma 36].