Matrix Completion via Max-Norm Constrained Optimization

Matrix completion has been well studied under the uniform sampling model and the trace-norm regularized methods perform well both theoretically and numerically in such a setting. However, the uniform sampling model is unrealistic for a range of applications and the standard trace-norm relaxation can behave very poorly when the underlying sampling scheme is non-uniform. In this paper we propose and analyze a max-norm constrained empirical risk minimization method for noisy matrix completion under a general sampling model. The optimal rate of convergence is established under the Frobenius norm loss in the context of approximately low-rank matrix reconstruction. It is shown that the max-norm constrained method is minimax rate-optimal and yields a unified and robust approximate recovery guarantee, with respect to the sampling distributions. The computational effectiveness of this method is also discussed, based on first-order algorithms for solving convex optimizations involving max-norm regularization.


Introduction
The problem of recovering a low-rank matrix from a subset of its entries, also known as matrix completion, has been an active topic of recent research with a range of applications including collaborative filtering (the Netflix problem) (Goldberg et al., 1992), multi-task learning (Argyriou, Evgeniou and Pontil, 2008), system identification (Liu and Vandenberghe, 2009), and sensor localization (Singer and Cucuringu, 2010;Candés and Plan, 2010), among many others.We refer to Candés and Plan (2010) for detailed discussions of the aforementioned applications.Another noteworthy example is the structure-from-motion problem in computer vision (Tomasi and Kanade, 1992;Chen and Suter, 2004).Let f and d be the number of frames and feature points, respectively.The data are stacked into a low-rank matrix of trajectories, say M ∈ R 2f ×d , such that every element of M corresponds to an image coordinate from a feature point of a rigid moving object at a given frame.Due to objects occlusions, errors on the tracking or variable out of range (i.e.images beyond the camera field of view), missing data are inevitable in real-life applications and are represented as empty entries in the matrix.Therefore, accurate and effective matrix completion methods, which fill in missing entries by suitable estimates, are required.
Because a direct search for the lowest-rank matrix satisfying the equality constraints is NP-hard, most previous work on matrix completion has focused on using the trace-norm, which is defined to be the sum of the singular values of the matrix, as a convex relaxation for the rank.This can be viewed as an analogue to relaxing the sparsity of a vector to its ℓ 1 -norm, which has been shown to be effective both empirically and theoretically in compressed sensing.Several recent papers proved in different settings that a generic d × d rank-r matrix can be exactly and efficiently recovered from O{rd poly(log d)} randomly chosen entries (Candés and Recht, 2009;Candés and Tao, 2010;Gross, 2011;Recht, 2011).These results thus provide theoretical guarantees for the constrained trace-norm minimization method.In the case of recovering approximately low-rank matrices based on noisy observations, different types of trace-norm based estimators, which are akin to the Lasso and Dantzig selector used in sparse signal recovery, were proposed and well-studied.See, for example, Candés and Plan (2010), Keshavan and Montanari (2010), Rohde and Tsybakov (2011), Koltchinskii, Lounici and Tsybakov (2011), Negahban and Wainwright (2012), Koltchinskii (2011) and Klopp (2011Klopp ( , 2014)), among others.
It is, however, unclear that whether the trace-norm is the best convex relaxation for the rank, especially when the underlying sampling scheme is nonuniform, and more importantly, is unknown.A matrix M ∈ R d1×d2 can be viewed as an operator mapping from R d2 to R d1 , its rank can be alternatively expressed as the smallest integer k such that the matrix M can be decomposed as M = U V ⊺ for some U ∈ R d1×k and V ∈ R d2×k .In view of the matrix factorization M = U V ⊺ , by enforcing U and V to have a small number of columns we obtain a low-rank M .The number of columns of U and V can be relaxed in a different way from the usual trace-norm by the so-called max-norm (Linial et al., 2004), defined by where the infimum is carried out over all factorizations M = U V ⊺ with U 2,∞ denoting the operator norm of U : ℓ k 2 → ℓ d1 ∞ and V 2,∞ the operator norm of V : ℓ k 2 → ℓ d2 ∞ (or, equivalently, V ⊺ : ℓ d2 1 → ℓ k 2 ) and k = 1, . . ., min(d 1 , d 2 ).Note that U 2,∞ is also the maximum ℓ 2 row norm of U .Since ℓ 2 is a Hilbert space, the factorization constant • max indeed defines a norm on the space of operators between ℓ d2 1 and ℓ d1 ∞ .The max-norm was recently proposed as an alternative convex surrogate to the rank of the matrix.For collaborative filtering problems, the max-norm has been shown to be empirically superior to the trace-norm Srebro, Rennie and Jaakkola (2004).Foygel and Srebro (2011) used the max-norm for matrix completion under the uniform sampling distribution.Their results are direct consequences of a recent bound on the excess risk for a smooth loss function, such as the quadratic loss, with a bounded second derivative (Srebro, Sridharan and Tewari, 2010).Further, a max-norm constrained maximum likelihood method was considered by Cai and Zhou (2013) for one-bit matrix completion, where instead of observing real-valued entries of an unknown matrix one is only able to see binary outputs, i.e. yes/no, true/false, agree/disagree (Davenport et al., 2014).Theoretical guarantees are obtained in general non-uniform sampling models, and numerical studies show that the max-norm based approach is comparable to and sometimes slightly outperform the corresponding trace-norm method.
Matrix completion has been well analyzed in the uniform sampling model, where observed entries are assumed to be sampled randomly and uniformly.In such a setting, the trace-norm regularized approach has been shown to have good theoretical and numerical performance.However, in some applications such as collaborative filtering, the uniform sampling model is unrealistic.For example, in the Netflix problem, the uniform sampling model is equivalent to assuming all users are equally likely to rate each movie and all movies are equally likely to be rated by any user.From a practical point of view, invariably some users are more active than others and some movies are more popular and thus rated more frequently.Hence, the sampling distribution is in fact non-uniform in the real world.In such a setting, Salakhutdinov and Srebro (2010) showed that the standard trace-norm relaxation can sometimes behave poorly, and suggested a weighted trace-norm penalty, which incorporates the knowledge of true sampling distribution in its construction.Since the true sampling distribution is most likely unknown and can only be estimated based on the locations of those entries that are revealed in the sample, a practically available method relies on the empirically-weighted trace-norm (Foygel et al., 2011).It is also worth noticing that, when the sampling probabilities are bounded from below and above, the trace-norm penalized estimator is minimax optimal up to a logarithmic factor (Klopp, 2014).We refer to Fang et al. (2015b) for further numerical evaluations of the trace-norm regularized method under various non-uniform sampling schemes.
In this paper, we employ the max-norm as a convex relaxation for the rank to study matrix completion based on noisy observations in a general, unspecified sampling model.The rate of convergence for the max-norm constrained least squares estimator is obtained.Information-theoretical methods are used to establish a matching minimax lower bound in the general non-uniform sampling model.Together, the minimax upper and lower bounds yield the optimal rate of convergence for the Frobenius norm loss.It is shown that the max-norm regularized approach indeed provides a unified and robust approximate recovery guarantee with respect to sampling schemes.In the uniform sampling model as a special case, our results also show that the extra logarithmic factors appeared in the error rates obtained by Srebro, Sridharan and Tewari (2010) and Foygel and Srebro (2011) could be avoided after a careful analysis to match the minimax lower bound with the upper bound (see Theorems 3.1 and 3.3 and the discussions in Section 3).
The max-norm constrained minimization problem is a convex program.To solve general convex programs that involve either a max-norm constraint or a max-norm penalization, a first-order algorithm was proposed by Lee et al. (2010), which is computationally effective and outperforms the semi-definite programming (SDP) method of Srebro, Rennie and Jaakkola (2004).In principle, the method of Lee et al. (2010) is based on nonconvex relaxations.Therefore, their algorithm is only guaranteed to find a stationary point, and statistical properties of such solutions are difficult to analyze.Recently, Fang et al. (2015b) proposed a scalable algorithm based on the alternating direction of multipliers method to efficiently solve the max-norm constrained optimization problem with guaranteed rate of convergence to the global optimum.In summary, the max-norm constrained empirical risk minimization problem can indeed be implemented in polynomial time as a function of the sample size and matrix dimensions.
The remainder of the paper is organized as follows.After introducing basic notation and definitions, Section 2 collects a few useful results on the max-norm, trace-norm and Rademacher complexity that will be needed in the rest of the paper.Section 3 introduces the model and the estimation procedure and then investigates the theoretical properties of the estimator.Both minimax upper and lower bounds are given.The results show that the max-norm constrained minimization method achieves the optimal rate of convergence over the parameter space.Comparison with past work is also given.Computation and implementation issues are discussed in Section 4. A brief discussion is given in Section 5, and the proofs of the main results and key technical lemmas are placed in Section 6.

Notations and Preliminaries
In this section, we begin with some notation that will be used throughout the paper, and then collect some known results on the max-norm, trace-norm and Rademacher complexity that will be applied repeatedly later.
For any positive integer d, we use [d] to denote the collection of integers {1, 2, . . ., d}.For any set S, denote by S c its complement, and |S| its cardinality.For a vector u ∈ R d and 1 ≤ p < ∞, define its ℓ p -norm by kℓ be the Frobenius norm and let M ∞ = max (k,ℓ)∈[d1]×[d2] |M kℓ | denote the elementwise ℓ ∞ -norm.Given two norms ℓ p and ℓ q on R d1 and R d2 respectively, the corresponding operator norm • p,q of a matrix M ∈ R d1×d2 is defined by M p,q = sup u p =1 M u q .It is easy to verify that M p,q = M ⊺ q * ,p * , where (p, p * ) and (q, q * ) are conjugate pairs; that is, 1/p + 1/p * = 1/q + 1/q * = 1.In particular, M = M 2,2 is the spectral norm; M 2,∞ = max k=1,...,d1 d2 ℓ=1 M 2 kℓ is also known as the maximum row norm of M .Moreover, for two real numbers a and b, we write for ease of presentation that a ∨ b = max(a, b) and a ∧ b = min(a, b).

Max-norm and trace-norm
For a matrix M ∈ R d1×d2 , the trace-norm (also known as the Schatten 1-norm) M 1 is defined as the sum of all singular values of M , or equivalently, In other words, the trace-norm promotes low-rank decompositions with factors in ℓ 2 .Similarly, using Grothendiek's inequality (Jameson, 1987), the max-norm defined in (1.1) has the following analogous representation in terms of factors in ℓ ∞ : The factor of equivalence is the Grothendieck's constant K G ∈ (1.67, 1.79).
Based on these properties, the max-norm regularization is expected to be more effective when dealing with uniformly bounded data (Lee et al., 2010).
Of the same spirit as the definition of the max-norm in (1.1), the trace-norm has the following equivalent characterization in terms of matrix factorizations: See, for example, Srebro and Shraibman (2005).It is easy to see that which in turn implies that any low max-norm approximation is also a low tracenorm approximation.As pointed out by Srebro and Shraibman (2005), there can be a large gap between 1 √ d1d2 • 1 and • max .The following relation between the trace-norm and Frobenius norm is well-known: An analogous bound holds for the max-norm, in connection with the element-wise ℓ ∞ -norm (Linial et al., 2004): (2.2) For any R > 0, let be the max-norm and trace-norm ball with radius R, respectively.It is now wellknown (Srebro and Shraibman, 2005) that B max (1) can be bounded, from both below and above, by the convex hull of rank-one sign matrices with K G ∈ (1.67, 1.79) denoting the Grothendieck's constant.Moreover, M ± is a finite class with cardinality

Rademacher complexity
A technical tool used in our analysis involves data-dependent estimates of the Rademacher and Gaussian complexities of a function class.We refer to Bartlett and Mendelson (2002) and Srebro and Shraibman (2005) for a detailed introduction of these concepts.
Definition 2.1.For a class F of functions mapping from X to R, its empirical Rademacher complexity over a specific sample S = (x 1 , x 2 , . . ., x n ) ⊆ X is given by , Replacing ε 1 , . . ., ε n with independent standard normal variables g 1 , . . ., g n leads to the definition of (empirical) Gaussian complexity.
Considering a matrix as a function from the index pairs to the entry values, Srebro and Shraibman (2005) obtained upper bounds on the Rademacher complexity of the unit balls under both the trace-norm and the max-norm.Specifically, for any d 1 , d 2 > 2 and any sample of size 2 < |S| < d 1 d 2 , the empirical Rademacher complexity of the max-norm unit ball is bounded by (2.4)

The statistical model
We now consider matrix completion under a general random sampling model.Let M * ∈ R d1×d2 be the unknown target matrix.Suppose that a random sample of the index set is drawn independently according to a general sampling dis- . Given a random index subset S = {(i 1 , j 1 ), . . ., (i n , j n )} of size n, we observe noisy entries {Y itjt } n t=1 indexed by S, i.e.
for some σ > 0. The noise variables ξ t are independent with zero mean and unit variance.By expressing the model as in (3.1), it is implicitly assumed that the noise on the entry is drawn independently each time.
Instead of assuming the uniform sampling distribution, we consider a general sampling distribution Π here.Since Motivated by some applications, to ensure that each entry is observed with a positive probability, it is sometimes natural to assume that there exists a positive constant ν ≥ 1 such that F is typically used in the literature as a natural measure of the estimation accuracy.Now that the sampling distribution Π is arbitrary, we use instead the weighted Frobenius norm with respect to Π to measure the estimation error.For any A = (A kℓ ) ∈ R d1×d2 , define The preceding work on matrix completion has mainly focused on the case of exact low-rank matrices.Here we allow a relaxation of this assumption and consider the more general setting of approximately low-rank matrices.Specifically, we consider recovery of matrices with ℓ ∞ -norm and max-norm constraints defined by (3.4) Here both α and R are free parameters to be determined.If the matrix M * is of rank at most r and M * ∞ ≤ α, then by (2.2) we have and hence M * ∈ K(α, α √ r).

Max-norm constrained least squares estimator
Given a collection of observations Y S = {Y itjt } n t=1 from the observation model (3.1), we estimate the unknown M * ∈ K(α, R) for some α, R > 0 by the minimizer of the empirical risk with respect the quadratic loss function That is, The minimization procedure requires that all the entries of M * are bounded in magnitude by a prespecified constant α.This condition enforces that M * should not be too "spiky", and a too large bound may jeopardize exactness of the estimation.See, for example, Koltchinskii, Lounici and Tsybakov (2011), Negahban and Wainwright (2012) and Klopp (2014).On the other hand, as argued in Lee et al. (2010), the max-norm regularization is expected to be more effective particularly for uniformly bounded data, which is our main motivation for using the max-norm constrained estimator.
Although the max-norm constrained minimization problem (3.5) is a convex program, fast and efficient algorithms for solving large-scale optimization problems that incorporate the max-norm have only been developed recently in Lee et al. (2010) and Fang et al. (2015b).We will show in Section 4 that the convex optimization problem (3.5) can be implemented in polynomial time as a function of the sample size n and dimensions d 1 and d 2 .

Upper bounds
In this section, we state our main results regarding the recovery of an approximately low-rank (low-max-norm) matrix M * using max-norm constrained empirical risk minimization.
Theorem 3.1.Suppose that the noise sequence {ξ t } n t=1 are independent subexponential random variables; that is, there is a constant K > 0 such that (3.6) The parameters α, R > 0 are such that M * ∈ K(α, R).Then, for a sample size holds with probability at least 1 − 2e −d .
(1) It is worth noticing that the general result on approximate reconstruction guarantee (3.7) holds without any prior information on the sampling distribution Π, in particular the lower bound assumption (3.2).In fact, it is reflected in the result that for every location index (k, ℓ), the smaller the sampling probability π kℓ is, the more difficult it will be to recovery the entry at this location.
(2) The upper bounds given in Theorem 3.1 hold with high probability.The rate of convergence under expectation can be obtained as a direct consequence.More specifically, for a sample size n with d ≤ n ≤ d 1 d 2 , we have sup (3.9) In view of the upper bound in (6.1), when the noise level σ is comparable to or dominated by α, the rate is of order αR ( d n ) 1/2 .To fully understand how the random noise affects the estimation accuracy particularly when σ is much smaller than α, we provide a complementary result in Theorem 3.2 which generalizes Theorem 9 in Foygel and Srebro (2011) to the general non-uniform sampling model.Theorem 3.2.Assume that the conditions of Theorem 3.1 are satisfied and σ ≤ α.Then, holds with probability at least 1 − 2n −1 over a random sample of size n satisfying An interesting consequence of Theorem 3.2 is that, in the noiseless case where σ = 0 and a random subset of the entries of M * are perfectly observed, then for any prespecified tolerance level ǫ > 0, the target matrix M * can be approximately recovered in the sense that

Information-theoretic lower bounds
Theorem 3.1 gives the rate of convergence for the max-norm constrained least squares estimator M max .In this section we shall use information-theoretical methods to establish a minimax lower bound for non-uniform sampling at random matrix completion on the max-norm ball.The minimax lower bound matches the rate of convergence given in (3.8) when the sampling distribution Π satisfies 1 ν d1d2 ≤ min k,ℓ π kℓ ≤ max k,ℓ π kℓ ≤ µ d1d2 for some constants ν and µ.The results show that the max-norm constrained least-squares estimator is indeed rate-optimal in such a setting.
To derive the lower bound, we assume that the sampling distribution Π satisfies max for a positive constant µ ≥ 1.Clearly, when µ = 1, it amounts to say that the sampling distribution is uniform.
Theorem 3.3.Suppose that the noise sequence {ξ t } n t=1 are i.i.d.standard normal random variables, the sampling distribution Π satisfies the condition (3.11) and the quintuple (n, (3.12) (3.13) In particular, for a sample size n ≥ 1 Assume that both ν and µ, respectively appeared in (3.2) and (3.11), are bounded above by universal constants, then comparing the lower bound (3.14) with the upper bound (3.9) shows that if the sample size n and the max-norm constrained least-squares estimator (3.5) is rate-optimal.The requirement here on the sample size n R 2 σαd, which is a mild constraint since R 2 is of order α 2 r 0 in the exact low-rank case where r 0 = rank(M * ).
The proof of Theorem 3.3 uses information-theoretic methods.A key technical tool for the proof is the following lemma which guarantees the existence of a suitably large packing set for K(α, R) in the Frobenius norm.
and with the following properties: (ii) For any two distinct M i , M j ∈ M, The proof of Lemma 3.1 is based on an adaptation of the arguments used to prove Lemma 3 in Davenport et al. (2014), which for self-containment, is given in Section 6.4.

Comparison to past work
We now compare the results established in this section with those known in the literature for matrix completion under uniform or general sampling schemes.

Approximate/non-exact low-rank recoveries
It is now well-known that the exact recovery of a low-rank matrix in the noiseless case requires the "incoherence conditions" on the target matrix M * (Candés and Recht, 2009;Candés and Tao, 2010;Recht, 2011;Gross, 2011).In this paper, we consider instead a general setting of approximately low-rank matrices, and prove that approximate recovery is still possible without enforcing exact structural assumptions.
Our results are directly comparable to those of Koltchinskii, Lounici and Tsybakov (2011) and Negahban and Wainwright (2012), in which the trace-norm was used as a proxy to the rank.Taking the latter as an example to illustrate, Negahban and Wainwright (2012) considered the setup where the sampling distribution is a product distribution, i.e. for all (k, ℓ) where π k• and π •ℓ are marginals that satisfy Accordingly, define the weighted norms as where Based on a collection of observations , where (i t , j t ) are i.i.d.according to P{(i t , j t ) = (k, ℓ)} = π kℓ and ε t ∈ {−1, +1} are i.i.d.random signs, and under the assumption that the unknown matrix M * satisfies (3.17) Negahban and Wainwright (2012) proposed the following estimator of M * based on the trace-norm penalized minimization: In the context of low-trace-norm (approximately low-rank) matrix recovery where the true matrix M * satisfies (3.17), they proved that for properly chosen λ n depending on σ (see, e.g.Corollary 2 therein), there exist absolute positive constants c 1 -c 3 such that holds with probability at least 1 − c 2 exp(−c 3 log d).
First, the product distribution assumption can be fairly restrictive in practice and is not valid in many applications.For example, in the case of the Netflix problem, this assumption would imply that conditional on any movie, it will be rated by all users with the same probability.Second, the constraint on M * highly depends on the true sampling distribution which is really unknown in practice and can only be estimated based on the empirical frequencies, i.e. for any pair (k, ℓ) Since only a relatively small sample of the entries of M * is observed, these estimates may not be accurate enough.The max-norm constrained minimization approach, on the other hand, is proved (Theorem 3.1) to be effective in the presence of non-uniform sampling distributions.The method does not require either a product distribution or the knowledge of the exact true sampling distribution.From this point of view, the max-norm constrained method indeed yields a more robust approximate recovery guarantee, with respect to the sampling distributions.
We now turn to the special case of uniform sampling.The "spikeness" assumption in Negahban and Wainwright (2012) can actually be reduced to a single constraint on the ℓ ∞ -norm (Klopp, 2014).Let B ∞ (α) = {M ∈ R d1×d2 : M ∞ ≤ α} be the ℓ ∞ -norm ball with radius α.Define the class of matrices (3.20) It can be seen from (2.1) and (2.
The following results provide upper bounds on the accuracy of both the max-and trace-norm regularized estimators under the Frobenius norm.
Corollary 3.1.Suppose that the noise sequence {ξ t } n t=1 are i.i.d.N (0, 1) random variables and the sampling distribution Π is uniform on Then the following inequalities hold with probability at least 1 − 3d −1 : The minimum M tr to the SDP (3.18) with all weighted norms replaced by the standard ones and with a properly chosen λ n satisfies sup The upper bound (3.21) follows immediately from (6.1) in Theorem 3.1, and (3.22) is a straightforward extension of Theorem 7 in Klopp (2014) on exact low-rank matrix recovery to the case of low-trace-norm matrix reconstruction.The proof is essentially the same and thus is omitted.Foygel and Srebro (2011) analyzed the recovery guarantee for M max based on an excess risk bound for empirical risk minimization with a smooth loss function recently developed in Srebro, Sridharan and Tewari (2010).Specifically, assuming a uniform sampling model with sub-exponential noise and that the target matrix M * ∈ K(α, R), they proved that with high probability, where Y = M * +Z with Z = (ξ kℓ ) 1≤k≤d1,1≤ℓ≤d2 , and σ 2 := 1 d1d2 d1 k=1 d2 ℓ=1 ξ 2 kℓ denotes the average noise level which is concentrated around σ 2 with high probability.
After a more delicate analysis, our result shows that the additional logarithmic factors in (3.23) purely arise from an artifact of the proof technique and thus can be avoided.Moreover, in view of the lower bounds given in Theorem 3.3, we see that the max-norm constrained least square estimator M max achieves the optimal rate of convergence for recovering approximately low-rank matrices over the parameter space K(α, R) under the Frobenius norm loss.To our knowledge, the best known rate for the trace-norm regularized estimator given in (3.22) is near-optimal up to logarithmic factors in a minimax sense, over a larger parameter space K tr (α, R).

Uniform/non-uniform sampling distributions
We now provide further insight into the rationale behind the phenomenon that the max-norm regularized/constrained method is more robust with respect to the sampling distribution.As before, we focus on the setting with a product sampling distribution Motivated by Salakhutdinov and Srebro (2010), Negahban and Wainwright (2012) studied the weighted trace-norm penalized estimator M tr given at (3.18), where for any matrix However, the "true" form of the sampling distribution is ambiguous and even if it is a product distribution, the marginal probabilities π k• and π •ℓ are typically unknown.Therefore, the weighted trace-norm • w(1) can not be used in practice.
For the max-norm, a useful equivalent definition is that for any M ∈ R d1×d2 , See, for example, Theorem 9 in Lee, Shraibman and Spalek (2008).As a result, by considering a max-norm penalized estimator that solves min all the possible marginal probabilities are taken into account, and therefore the solution is expected to be more robust with respect to the unknown sampling distributions.
Although the sampling distribution is not known exactly in practice, its estimated version is expected to be stable enough as an alternative.According to Foygel et al. (2011), given a random sample S = {(i t , j t )} n t=1 , we can estimate π kℓ by The empirically-weighted trace-norm • w(1) can be defined in the same spirit as in (3.24) for the weighted trace-norm, only with π kℓ replaced by π kℓ .Then the unknown matrix can be estimated via penalization on the π-weighted tracenorm, i.e. min Foygel et al. (2011) proved the error bound for the excess risk of the empiricallyweighted trace-norm constrained estimator when the loss function is Lipschitz.It is interesting to investigate whether the results similar to those in Negahban and Wainwright (2012) hold for the empirically-weighted trace-norm constrained and penalized estimators when the quadratic loss function is used.
It is also worth noting that, under condition (3.6) and when the sampling distribution is nearly uniform in the sense that min (3.25) for some constants ν, L ≥ 1, Klopp (2014) showed that the trace-norm penalized estimator with probability greater than 1 − 3d −1 , provided that M * ∞ ≤ α and λ = λ n ≍ σ( L log d nd ) 1/2 .In the case of Gaussian errors and under condition (3.11), the above rate of convergence is minimax optimal, up to a logarithmic factor, for the class of exact low-rank matrices {M ∈ R d1×d2 : M ∞ ≤ α, rank(M ) ≤ r 0 } (Koltchinskii, Lounici and Tsybakov, 2011).An interesting and challenging open problem is that in the context of exact low-rank matrix recovery and when the sampling probabilities satisfy (3.25), whether the optimal recovery guarantee can be achieved using the max-norm constrained method.Also, to the best of our knowledge, there are no theoretical guarantees for exactly recovering a lowrank matrix when the sampling distribution is non-uniform and unspecified.

Computational Algorithms
Although Theorem 3.1 presents theoretical guarantees that hold uniformly for any global minimizer, it does not provide guidance on how to approximate such a global minimizer using a polynomial-time algorithm.A parallel line of work has studied computationally efficient algorithms for solving problems with the tracenorm constraint or penalization.See Lin et al. (2009), Mazumber, Hastie and Tibshirani (2010) and Nesterov (2013), among others.Here we restrict our attention to the less-studied max-norm oriented approach.We discuss two different types of algorithms which are particularly designed to solve large scale optimization problems that incorporate the max-norm as a semidefinite relaxation of the rank.The first one is a fast first-order algorithm developed in Lee et al. (2010) based on nonconvex relaxation.The problem of interest to us is the optimization program (3.5) with both the max-norm and the element-wise ℓ ∞ -norm constraints, in which case the algorithm introduced in Lee et al. ( 2010) can be applied after suitable modifications as described in Section 4.1.The second one, on the other hand, is a convex approach proposed by Fang et al. (2015b) using the alternating direction of multipliers method with guaranteed convergence to the global optimum since it deals with the convex problem (4.1) directly.

A projected gradient method
Due to Srebro, Rennie and Jaakkola (2004), the max-norm of a d 1 × d 2 matrix M can be computed via a semi-definite program: Correspondingly, we can reformulate (3.5) as the following SDP problem where the objective function f is given by This SDP can be solved using standard interior-point methods, though are fairly slow and do not scale to matrices with large dimensions.For large-scale problems, an alternative factorization method based on (1.1), as described below, is preferred (Lee et al., 2010).We begin by introducing dummy variables r+1) .In practice, without a known guarantee on the rank of M max , we alternatively truncate the number of columns k to some reasonably high value less that d 1 + d 2 .Then, we rewrite the original problem (3.5) in the factored form as follows: where {(i 1 , j 1 ), . . ., ) n is a training set of row-column indices, U i and V j denote the ith row of U and the jth row of V , respectively.This problem, however, is non-convex since it involves a constraint on all product factorizations U V ⊺ .When the size of the problem k is large enough, Burer and Choi (2006) proved that this reformulated problem has no local minima.To solve this problem fast and efficiently, Lee et al. (2010) suggested the following first-order method.Notice that The projected gradient descent method generates a sequence of iterates {(U t , V t ), t = 0, 1, 2, . ..} by the recursion: First, define an intermediate iterate otherwise we keep it still.Next, compute updates according to This projection can be computed by re-scaling the rows of the current iterate whose ℓ 2 -norms exceed R so that their norms become exactly R, while rows with norms already less than R remain unchanged.

An alternating direction method of multipliers based approach
The first-order algorithm described in Section 4.1 is computationally efficient and fast.However, (4.1) is in principle a non-convex optimization problem and thus the algorithm is only guaranteed to find a stationary point.Recently, an alternating direction method of multipliers (ADMM) based approached was proposed by Fang et al. (2015b) to solve the convex program (3.5) efficiently with strong theoretical guarantee.Furthermore, it was shown in Fang et al. (2015a) that the worst-case rate of convergence of the ADMM method is of order 1/t, where t denotes the iteration counter.We briefly summarize this ADMM approach here for the sake of readability.
Define the class of matrices where d = d 1 + d 2 , S d denotes the class of all symmetric matrices in R d×d and for every W ∈ S d , we write In this notation, the problem (4.1) can be equivalently formulated as min where as before, the function f : R d1×d2 → R is given by f (M ) = L n (M ; Y ).As pointed out by Fang et al. (2015b), the rationale of reformulating the problem into (4.2) is to divide the complexity of the feasible set in (4.1), which consists of a positive semidefinite constraint and ℓ ∞ -norm constraints, into two parts.Then, by using an iterative method, we only need to control the ℓ ∞ -norm of W and project X into the positive semidefinite cone in each step.The additional constraint W − X = 0 ensures the feasibility of both W and X.More specifically, consider the augmented Lagrangian function of (4.2) that is given by , where Z denotes the dual variable and ρ > 0 is prespecified.The ADMM is used to solve (4.2) iteratively as follows: Initialize (W 0 , X 0 , Z 0 ) and ρ > 0; at the (t + 1)-th iteration, update (W, X, Z) according to where Π S d + : R d×d → S d + is the map that projects a matrix into the semidefinite cone S d + .The second step of (4.3) has an explicit solution given by (Fang et al., 2015b)

Implementation
Before the max-norm constraint approach can be actually implemented in practice to generate a full matrix by filling in missing entries, additional prior knowledge of the unknown true matrix is needed to avoid deviated results.As before, let M * ∈ R d1×d2 be the true underlying matrix.Suitable upper bounds on the following key quantities are needed in advance: In order to estimate R 0 directly from a missing data matrix, it can be seen from (2.2) that α 0 √ r 0 is a sharp upper bound on R 0 and is more amenable to estimation.Fortunately, it is possible to convincingly specify α 0 beforehand in many real-life applications.When dealing with the Netflix data, for instance, α 0 can be chosen as the highest rating index; in the structure-from-motion problem, α 0 depends on the range of the camera field of view, which in most cases is sufficiently large to capture the feature point trajectories.In case where the percentage of missing entries is low, the largest magnitude of the observed entries can be used as an alternative for α 0 .
As for r 0 , we recommend the rank estimation approach recently developed in Juliá et al. (2011), which was shown to be effective in computer vision problems.Recall that in the structure-from-motion problem, each column of the data matrix corresponds a trajectory along the frames of a given feature point, and can be regarded as a signal vector with missing coordinates.Due to the rigidity of the moving objects, it was noted in Juliá et al. ( 2011) that the behavior of observed and missing data is the same and thus they both generate an analogous (frequency) spectral representation.Motivated by this observation, the proposed approach is based on the study of changes in frequency spectra on the initial matrix after missing entries are recovered.
In general, choosing the tuning parameter R > 0 in (3.5) adaptively is a difficult problem.In the regression case, it can be done by the Scaled LASSO method (Sun and Zhang, 2012).It is unclear whether a similar approach would work for matrix completion problems.By convexity and strong duality, the optimization program in (3.5) is equivalent to min for a properly chosen λ.In fact, for any R > 0 specified in (3.5), there exists a λ > 0 such that the solutions to the two problems (3.5) and (4.2) coincide.In practice, we suggest to solve (4.2) using the ADMM method described in Section 4.2 with λ obtained via cross-validation, in a way similarly to that for LASSO or the trace-norm penalized M -estimator studied in Negahban and Wainwright (2011).
Next we describe an implementation of the max-norm constrained matrix completion procedure, which incorporates the rank estimation approach in Juliá et al. (2011).Assume without loss of generality that α 0 is known.
(1) Given the observed partial matrix M S , the initial matrix M ini is obtained by adding the average of the corresponding column to the missing entries of M S .Applying the Fast Fourier Transform (FFT) to the columns of M ini and taking its modulus, i.e.F := |FFT(M ini )|.
(2) Set an initial rank r = 2 and an upper bound r max .Clearly, r max ≤ min(d 1 , d 2 ) and it can be computed automatically by adding a criteria for stopping the iteration.(3) For the current value of r, using the computational algorithms given in Section 4 with R = α 0 √ r to solve the max-norm constraint optimization (3.5).The resulting estimated full matrix is denoted by M r .(4) Apply the FFT to M r as in step 1. Write F r = |FFT( M r )| and compute the error e(r) = F − F r F .
(5) If r < r max , set r = r + 1 and go to step 3.
Finally, let r * = arg min 2≤r≤rmax e(r) and the corresponding M r * is the final estimate of M * .Clearly, the above procedure can be modified by replacing the rank r with the max-norm R. A suitable initial value for the max-norm is R = α 0 √ 2 and at each iteration, increase R = R + δ with a fixed step size δ > 0. An upper-bound R max could be automatically computed by adding some criteria for stopping the iteration.

Discussions
This paper considers the approximate recovery of approximately low-rank matrices, in particular low-max-norm matrices in contrary to low-trace-norm matrices.The max-norm ball with radius 1 is nearly equivalent to the convex hull of rank-1 matrices, and therefore is an alternative convex surrogate for the rank.A max-norm constrained empirical risk minimization method is proposed and its theoretical properties are studied along with computational algorithms.Allowing for unknown non-uniform sampling which is an important relaxation of the uniform assumption in practice, it is shown that the method is rate-optimal and can be solved efficiently in polynomial time.
When the underlying matrix has exactly rank r, it is known that using the trace-norm based approach leads to a mean square error of order O{rd(log d)/n} (Keshavan and Montanari, 2010;Koltchinskii, Lounici and Tsybakov, 2011;Negahban and Wainwright, 2012;Klopp, 2014), where d = d 1 + d 2 .In the ideal uniform sampling model, the trace-norm regularized method is arguably the mostly preferable one as it achieves optimal rate of convergence (up to a logarithmic factor) and is computationally feasible.The sampling scheme considered in this paper is unspecified and is allowed to be highly non-uniform, which brings additional randomness and uncertainty to the recovery problem.Therefore, we are essentially dealing with a much more complex model, and the max-norm constraint is not only introduced as a convex relaxation for low-rankness according to (2.2) but also takes into account the effect of non-uniform sampling.

Proofs
We prove the main results, Theorems 3.1 and 3.3, in this section.The proofs of a few key technical lemmas including Lemma 3.1 are also given.

Proof of Theorem 3.1
For ease of exposition, we write M = M max as long as there is no ambiguity.To illustrate the main idea, we first consider the case where ξ 1 , . . ., ξ n are i.i.d.normal random variables and prove that there exists an absolute constant C such that for any t ∈ (0, 1) and a sample size n satisfying 2 holds with probability greater than 1 − t − e −d .The case of sub-exponential noise can be obtained via a straightforward adaptation of the arguments for Gaussian noise.
To begin with, noting that M is optimal and M * is feasible for the convex optimization problem (3.5), we thus have the basic inequality that This, combined with our model assumption where ∆ = M − M * ∈ K(2α, 2R) is the error matrix.By (6.2), the major challenges in proving Theorem 3.1 consist of two parts, bounding the left-hand side of (6.2) from below in a uniform sense and the right-hand side of (6.2) from above.
Step 1. (Upper bound).Recalling that {ξ t } n t=1 is a sequence of N (0, 1) random variables and that S = {(i 1 , j 1 ), . . ., (i n , j n )} is drawn i.i.d.according to Π on Due to Pisier (1989), we obtain that for any realization of the training set S and for any δ > 0, with probability at least 1 − δ over ξ = {ξ t } n t=1 , sup Thus it remains to estimate the following expectation over the class of matrices K(α, R): As a direct consequence of (2.3), we have (6.5)where M ± contains rank-one sign matrices with cardinality itjt is a Gaussian random variable with mean zero and variance n.Then, the expectation of the Gaussian maximum in (6.5) can be bounded by Substituting this into (6.5)gives Since this upper bound holds uniformly over all realizations of S, we conclude that with probability at least 1 − δ over both the random samples S and the noise ξ In the case of sub-exponential noise, i.e. {ξ t } n t=1 satisfies the assumption (3.6), it follows from (2.3) that For any realization of the training set S = {(i 1 , j 1 ), . . ., (i n , j n )} and for any M ∈ M ± fixed, it follows from a Bernstein-type inequality for sub-exponential random variables (Vershynin, 2012) that where c > 0 is an absolute constant.By the union bound, it can be easily verified that for a sample size n ≥ d, holds with probability at least 1 − e −d for some absolute constant C > 0.
Step 2. (Lower bound).For the given sampling distribution Π, note that where M S = (M i1j1 , . . ., M injn ) ⊺ ∈ R n for any training set S = {(i t , j t )} n t=1 of size n.For β ≥ 1 and δ > 0, consider the following subset Here, δ can be regarded as a tolerance parameter.The goal is to show that there exists some function f β such that with high probability, the following inequality holds uniformly over M ∈ C(β, δ).
Proof of (6.8).Instead, we will prove a stronger result that with exponentially high probability, holds for all M ∈ C(β, δ), based on a straightforward adaptation of the peeling argument used in Negahban and Wainwright (2012).Taking ̺ = 3 2 , define a sequence of subsets (6.9) In fact, if there exists some M ∈ C(β, δ) satisfying Therefore, the main task is to show that the latter event occurs with small probability.To this end, define the maximum deviation for each S ⊆ ( (6.10) The following lemma shows that n −1 M S 2 Lemma 6.1 (Concentration).There exists a universal positive constant C 1 such that, for any D > 0, (6.11)In view of the above lemma, we take f β (n, d 1 , d 2 ) = C 1 β d/n and consider the following sequence of events , using the union bound we have with c 0 = log(3/2)/26, where we used the elementary inequality that Consequently, for a sample size n ≤ d 1 d 2 satisfying exp(−c 0 nδ) ≤ 1 2 , or equivalently, n > (c 0 δ) −1 log 2, we obtain that with probability greater than 1 − 2 exp(−c 0 nδ), holds for all M ∈ C(β, δ).
Step 3. Now we combine the results in Step 1 and Step 2 to finish the proof.
Similarly, using the upper bound (6.7), instead of (6.6), together with the lower bound (6.13) proves (3.7) in the case of sub-exponential noise.
6.1.1.Proof of Lemma 6.1 Here, we prove the concentration inequality given in Lemma 6.1.The argument is based on some basic techniques of probability in Banach spaces, including symmetrization, contraction inequality and Bousquet's version of Talagrand concentration inequality as well as the upper bound (2.4) on the empirical Rademacher complexity of the max-norm ball.
Regarding the matrix M ∈ R d1×d2 as a function: we are interested in the following empirical process indexed by B(D): Recall that |M kℓ | ≤ M ∞ ≤ 1 for all pairs (k, ℓ), we have sup We first bound E S∼Π {∆ D (S)}, and then show that ∆ D (S) is concentrated around its expectation.A standard symmetrization argument Ledoux and Talagrand (1991) yields where {ε i } n i=1 is an i.i.d.Rademacher sequence, independent of S. Given an index set S = {(i 1 , j 1 ), . . ., (i n , j n )}, since |M itjt | ≤ 1, using Ledoux-Talagrand contraction inequality (Ledoux and Talagrand, 1991) where we used inequality (2.4) in the last step.Since the "worst-case" Rademacher complexity is uniformly bounded, we have (6.14) Next, applying Bousquet's version of Talagrand's concentration inequality for empirical processes indexed by bounded functions (Bousquet, 2003)  The proof is based on a general result in Srebro, Sridharan and Tewari (2010) on excess risk bounds for learning with a smooth loss.Recall that the noisy response is of the form Y itjt = M * itjt + ξ t for t = 1, 2, . .., where the location (i t , j t ) of the entry is drawn from [d 1 ] × [d 2 ] according to Π and the noise ξ t on the entry is drawn independently each time.For every d 1 × d 2 matrix M , define the quadratic loss function L(M ) = E (i t ,j t )∼Π ξ t ∼N (0,1) and its empirical counterpart L(M ) = 1 n n t=1 (M itjt − Y itjt ) 2 + σ 2 for a given i.i.d.sample {(i t , j t ), Y itjt = M * itjt +ξ t } n t=1 .In this notation, our estimator M max can be written as M max = arg min M∈K(α,R) L(M ).
In view of Definition 2.1, define the worst-case Rademacher complexity as where K = K(α, R).
For any B > 0, let E B be the event that max 1≤t≤n |ξ t | ≤ B holds.On E B , applying Theorem 1 in Srebro, Sridharan and Tewari (2010)  holds with probability at least 1 − δ over a random sample {(i t , j t )} n t=1 of size n, where C 1 > 0 is an absolute constant.By (2.4), the worst-case Rademacher complexity R n (K) is bounded by 6R d/n.Moreover, note that min M∈K(α,R) L(M ) = L(M * ) = σ 2 and Putting the above calculations together, we obtain that on the event E B , In particular, taking δ = n −1 in both (6.15) and (6.16) proves (3.10).
M i ) denotes the Kullback-Leibler divergence between distributions (Y S |M i ) and (Y S |M j ).For the observation model (3.1) with i.i.d.Gaussian noise, we have i