Inequality-Constrained Matrix Completion: Adding the Obvious Helps!

We propose imposing box constraints on the individual elements of the unknown matrix in the matrix completion problem and present a number of natural applications, ranging from collaborative filtering under interval uncertainty to computer vision. Moreover, we design an alternating direction parallel coordinate descent method (MACO) for a smooth unconstrained optimization reformulation of the problem. In large scale numerical experiments in collaborative filtering under uncertainty, our method obtains solution with considerably smaller errors compared to classical matrix completion with equalities. We show that, surprisingly, seemingly obvious and trivial inequality constraints, when added to the formulation, can have a large impact. This is demonstrated on a number of machine learning problems.


Motivation
Matrix completion is a well-known problem, with applications ranging from image processing to recommender systems.When dimensions of a matrix X and some of its elements X i,j , (i, j) ∈ I are known, the goal is to find the unknown elements.Without imposing any further requirements on X, there are infinitely many solutions.In many applications, however, the matrix completion that minimizes the rank: min rank(Y ), subject to Y i,j = X i,j , (i, j) ∈ I, works the best.In this paper, we present a variant of the problem, where there are inequalities, instead of equalities.This variant has a number of important applications: Collaborative Filtering under Uncertainty.Collaborative filtering is a well-established application of matrix completion problems Srebro (2004), largely thanks to the success of the Netflix Prize (netflixprize.com).Let us have a matrix, where each row corresponds to one user and each column corresponds to a product or service.There are only a small number of entries known, considering that every user rates only a modest number of products or services.Further, notice that one user may provide two different ratings for one and the same product at two different times, depending on the current mood and other circumstances at the two times.One may hence want to consider an interval [x, x] instead of a fixed value x, e.g., [x− , x+ ] or rather [max{L, x− }, min{x+ , U }], when x is known to be a rating on the scale of [L, U ].One may hence want to solve: min Y max Xi,j ∈[Xi,j ,Xi,j ]∀(i,j)∈I rank(Y ), subject to Y i,j = X i,j , (i, j) ∈ I. (2) Notice that this generalizes the robust linear programming of Soyster (1973) to rank minimization.
Low-Rank Approximations in Image Processing.Another use of matrix completion can be found in image processing.In inpainting problems, a subset of pixels from an image are given and the task is to fill in the missing pixels.Matrix completion with equalities (1), where I is the index set of all known pixels, has been used numerous times in this setting.If the original matrix comes from the real life, it probably will be full rank, albeit with quickly decreasing singular values in the spectrum.In this case, instead of solving the equality-constrained problem (1), one should like to find a low-rank approximation Y * of X, such that the known entry of X is not far away from Y * , i.e., ∀(i, j) ∈ I we have Y i,j ≈ X i,j .Let us illustrate this with a small matrix σ i =: R(r) for all (i, j).Therefore, we should not require equality constrains in (1), but rather inequalities |Y i,j − X i,j | ≤ R(r), ∀(i, j) ∈ I.One should like to stress that this approach is not the same as minimizing (i,j)∈I (X i,j − Y i,j ) 2 over all rank r matrices, because we do not penalize the elements of Y , which are already close to X.It is also different from the usual treatment of noise in the observations Candès & Plan (2010).One could rather formulate this as the minimization of (i,j)∈I max{0, |X i,j − Y i,j | − R(r)} 2 over all rank r matrices.
Inpainting with Side Information.Let us present another image processing application.If our original matrix is a gray scale image, then one knows more about the missing pixels than just that they are missing!In particular, one knows that those missing pixels take values from the interval [0, 1].This can hence be seen as "side information" which, as we will show in numerical section, improves recovery of a low-rank approximation considerably.One can extend this approach further, e.g. if the pixel is missing within a light region of the image, one can assume that the intensity should be at least 0.8.
Forecasting with Side Information.A related application comes from the forecasting of seasonal data, e.g.sales.Let us assume that in process {X t }, one knows k + 1 = τ such that F X (x t1+τ , . . ., x t k +τ ) = F X (x t1 , . . ., x t k ) for the cumulative distribution function F X (x t1+τ , . . ., x t k +τ ) of the joint distribution of {X t } at times t 1 + τ, . . ., t k + τ .One can then formulate the forecasting into the future as a matrix completion problem, where there the historical datum at time t is at row t/τ , column t mod k specified by an equality or a pair of inequalities, and where inequalities represent side information.For an example of such side information in sales forecasts, notice that one often has bookings for many months in advance and knows that the sales for the respective months will not be less than the bookings taken.
A number of other applications, e.g., in the recovery of structured matrices Chen & Chi (2013) and in sparse principal component analysis with priors on the principal components, can be envisioned.

The Problem
In this section we introduce our notation and formalize the problem.Let X be an m × n matrix to be reconstructed.Assume that elements (i, j) ∈ E of X we wish to fix, for elements (i, j) ∈ L we have lower bounds and for elements (i, j) ∈ U we have upper bounds.We propose the following natural formulation for the equality and inequality constrained matrix completion problem: min ( This problem is NP-hard, even with U = L = ∅ Natarajan (1995).This special case of (3) has been widely studied, e.g., in Recht et al. (2011); Goldfarb et al. (2009); Ma et al. (2011).
A popular heuristic enforces low rank in a synthetic way by writing X as a product of two matrices, X = LR, where L ∈ R m×r and R ∈ R r×n .Hence, X is of rank at most r.This has been proposed and analyzed by Lee et al. (2010); Recht et al. (2010); Srebro et al. (2004); Tanner & Wei (2013).Let L i: and R :j be the i-th row and j-h column of L and R, respectively.Instead of (3), we consider the problem where and and ξ + = max{0, ξ}.
Parameter µ helps to prevent scaling issues 1 .Hence, we could optionally set µ to zero and then from time to time rescale matrices L and R, so that their product stays constant Tanner & Wei (2013).The term f E (resp.f U , f L ) encourages the equality (resp.inequality) constraints to hold.

The Algorithm
Coordinate descent algorithms (CDA) are effective in solving large-scale problems, due to their low per-iteration computational cost.Although each iteration of CDA is cheap, many more iterations are required for convergence, compared to second-order algorithms or similar.The stochastic CDA has received much attention, recently, because it has numerous benefits, compared to the deterministic version Nesterov (2012); Tseng (2001).Notably, it has been shown that stochastic CDA can be efficiently parallelized and one can obtain almost linear speed-up Bradley et al. (2011); Richtárik & Takáč (2012); Recht et al. (2011), e.g., in regimes when the number of parallel updates τ is much smaller that the dimension of the optimization problem.
We now present our alternating parallel coordinate descent method for MAtrix COmpletion ("MACO") in Algorithm 1. compute δ rj using ( 7) end for 15: end for In Steps 3-8 of our algorithm, we fix R, choose random r and a random set Ŝ of rows of L, and update, in parallel for i ∈ Ŝ: L ir ← L ir + δ ir .In Steps 9-14, we fix L, choose random r and a random set Ŝ of columns of R, and update, in parallel for j ∈ Ŝ: R rj ← R rj + δ rj .
Let us now comment on the computation of the updates, δ ir and δ rj .First, note that while f is not convex jointly in (L, R), it is convex in L for fixed R and in L for fixed R.
If we now fix i ∈ {1, 2, . . ., m} and r ∈ {1, 2, . . ., r}, and view f as a function of L ir only, it has a Lipschitz gradient with constant That is, for all L, R and δ ∈ R, we have where E is the n×r matrix with 1 in the (ir) entry and zeros elsewhere.Likewise, if we now fix j ∈ {1, 2, . . ., n} and r ∈ {1, 2, . . ., r}, and view f as a function of R rj only, it has a Lipschitz gradient with constant That is, for all L, R and δ ∈ R, we have where E is the r × m matrix with 1 in the (rj) entry and zeros elsewhere.
The minimizer of the right hand side of the bound on f (L + δE ir , R) is given by where Note that The minimizer of the right hand side of the bound on f (L, R + δE rj ) is given by where ∇ R f (L, R), E rj equals Note that The random set Ŝ can be chosen uniformly at random, or can be chosen nonuniform, as is common in importance sampling.In our experiments we have chosen the uniform variant.If we have a multicore machine available with τ cores, then a reasonable subset Ŝ should have cardinality τ , or some integral multiple of τ , so that every core has a reasonable (not too small so that it is underutilized, but not too large so that processing takes a long time) load at every iteration.Efficient implementation.Formulas ( 5) and ( 7) suggest that the computation of the final step requires a lot of computation.This can, however, be avoided if we define matrices A ∈ R m×r and B ∈ R r×n such that A iv = W L iv and B vj = V U vj .After each update of the solution, we can also update those matrices.Similarly, one can store sparse residuals matrices ∆ E , ∆ L , ∆ U , where otherwise, and ∆ U , ∆ L are defined in similar way.Subsequently, the computation of δ ir or δ rj is reduced to just a few multiplications and additions.
Lock-free implementations.Since each iteration is cheap and does not depend on the size of the problem, one could possibly solve problems of any dimension.Because each thread deals with a different row of L (column of R), there is no risk of race conditions at run-time.Hence no atomic operations are required.At some point, no single computer will have sufficient memory capacity, and hence one will have to distribute the computation across a cluster.Fortunately, techniques developed in Yun et al. (2013) are also applicable to Algorithm 1, and hence this algorithm can be extended to the distributed setting.
Related work.Let us note that Cai et al. (2010) analyzed matrix completion with an arbitrary convex constraint and proposed to solve the problem using Singular Value Thresholding (SVT) algorithm.This, however, requires the computation of a singular value decomposition (SVD) in each iteration.A number of other approaches, e.g., augmented Lagrangian methods Tomioka et al. (2010), could also be extended, but would require a singular value decomposition or a number of iterations of the power method Jaggi & Sulovský (2010); Shalev-shwartz et al. (2011).Even considering the recent progress in randomized methods for approximating singular value decompositions Halko et al. (2011), the approximation becomes very time-consuming very quickly as the dimensions of matrices grow.
Our algorithm can be seen as a coordinate-wise version of the alternating least squares (ALS) algorithm.If r = 1 and U ≡ L ≡ ∅ and one always chooses all elements of Ŝ, then this algorithm is equivalent with classical ALS.

Convergence Analysis
Due to the non-convex nature of (4), one has to be satisfied with convergence to a stationary point.
Theorem 1.Let µ > 0 and (L (k) , R (k) ) be the (random) matrices produced by Algorithm 1 after k iterations.Then Algorithm 1 is monotonic, i.e., for all Moreover, almost surely, Sketch of the proof.Monotonicity can be deduced from ( 6) and ( 8).Then assumption that µ > 0 together with monotonicity (9) implies that the levelset )} is bounded.Hence, the Lipschitz constants W and V are bounded above.The rest follows again from ( 6) and ( 8).

Numerical Experiments
In this Section, we present the results of various experiments, including a comparison with classical matrix completion with U ≡ L ≡ ∅.We focus on how much can one benefit from imposing obvious inequalities.

Dependence of classical MC and the one with inequality with ∆ margin
Motivated by the fact that the best r-rank approximation of the original matrix can have each element different from the observed elements, we decided to propose an experiment, where we generate a random matrix X ∈ R 20×20 with rank 8. Afterwards, we sample p% of entries of that matrix, which we store in index set I, and solve (4) with just the inequality constrains, i.e., E ≡ ∅, U ≡ L ≡ I, X U = X −∆1 and X L = X +∆1, where 1 ∈ R m×n is a matrix with all elements equals to 1. Let us denote by Y * (∆) the solution of that optimization problem after 10 5 serial iterations (| Ŝ| = 1) and with µ = 10 −5 .Figure 1 shows the dependence of error defined as follows where X(r) is the best rank r approximation of X obtain using SVD decomposition of the whole matrix.Figure 1 clearly suggest that, e.g., if 50% of elements are observed then by allowing each entry ∈ I of reconstructed matrix to lie in ∆ neighborhood of observed values, we can decrease the relative error of reconstruction from approximately 1.22 to 0.4 for ∆ ≈ R(r).In this case, the value of X( 7) F was 21.3245 and R(r) = 0.1075.

Recovery
It is well known that a recovery of rank r matrix X ∈ R m×n from just p < r(m + n − r) observed entries is an ill-posed problem Candès & Tao (2010); Tanner & Wei (2013), because there can exist infinitely many rank r matrices with the entries observed.
In the next experiment, we constructed matrices X ∈ R m×n with different ranks r ∈ {1, 2, . . ., min{m, n}} and tried 10 different random samplings of p elements.For each random sampling, we ran Algorithm 1 for the maximum of 10 6 serial iterations (| Ŝ| = 1).Figure 2 shows how many times (out of 10) we managed obtain reconstruction with relative error less than 5% of the original matrix.The red line is a theoretical line, above which there is no guarantee of recovery (when p < r(m + n − r)).

Inpaiting
Inpaiting is a process of reconstruction of parts of images or videos.Again, we can think about (e.g.grayscale) image as a matrix X with values in [0, 1].We again observe just a subset of elements indexed by I and we want to find a low rank matrix Y such that P I (Y ) ≈ P I (X).However, this is actually not all we know!We also know that ∀(i, j) : Y i,j should lay in [0, 1].Actually, because we are searching for rank r approximation of X we know that for sure −R(r) ≥ Y i,j ≥ 1 + R(r), but for simplicity we assume that r is big enough and therefore R(r) is small.To show how this simple and obvious side-information can help us to find a better rank r matrix, we undertook the following experiment.We took a 512 × 512 grayscale image (Lenna) and chose 50% of the pixels randomly, indexed as I.Then, we ran Algorithm 1 for 10 7 serial iterations (| Ŝ| = 1).We obtained solutions X E (rank) and X IN (rank), where X E (rank) was obtained To illustrate the effect of the obvious constrains further, we took a 50 × 50 image and sample randomly 50% of pixels.(The image is the top-left corner of the Lenna image.) Figure 4 shows the original image X and the best rank 10 approximation X(10).The solutions X E , X E+U , X E+L and X E+U +L were obtained by running Algorithm 1 for 3×10 5 serial iterations (| Ŝ| = 1), where E contains the observed pixels and U and L contains all other pixels.We have used X U = 0 and X L = 1.The result again suggest that adding simple and obvious constrains leads to better low rank reconstruction and helps to keep reconstructed elements of matrix in expected bounds.

The Netflix Problem
Within collaborative filtering, we focus on the problem presented in the Netflix Prize, which bears the name of Netflix, a company which provides streaming media (e.g.movies and TV series) on-demand on-line.Customers of Netflix can rate movies, which they have seen already, and Netflix uses such recommendations to suggest which movies to watch next.If you have ever rated movies on Netflix, though, you may have noticed that whether you give a movie three stars or four depends on your current mood, viewing conditions, etc. Formally, there is a matrix X, where each row corresponds to one user and each column corresponds to a movie.We know values at X i,j for all (i, j) ∈ I, but consider interval uncertainty sets around the actual ratings X i,j and solve: minimize rank(Y ), subject to Y i,j ≤ min{5, X i,j + 1}, (i, j) ∈ I, Given that Netflix uses the scale of 1 to 5 stars in the ratings, we use width 2 interval uncertainty set, but this can be changed freely.
In our computational experiments, we have used • a dataset, which contains 100, 198, 805 ratings of 480, 189 users for 17, 770 movies.There, we look for a 480189 × 17770 matrix of rank 20, but we do not have a validation matrix to match.
All data were obtained from a repository hosted by Carnegie Mellon University2 .In order to illustrate the impact of the use of inequalities, we present the evolution of Root-Mean-Square Error (RMSE) on smallnetflix in Figure 5. RMSE is defined as follows: where I val contains c val = 545, 177 validates points and Y * is the solution obtained by Algorithm 1 with µ = 0.001 and constrains X i,j − ∆ ≤ Y i,j ≤ X i,j + ∆ for all (i, j) ∈ I tr .By epoch we mean c tr element updates of matrix L and c tr element updates of matrix R. Let us remark that RMSE is sensitive to the choice of ∆ and the rank of the matrix we are looking for.If the underlying matrix has a higher rank than expected, ∆ > 0 can lead to smaller values of RMSE.We should also note that for some fixed ∆ 1 and ∆ 2 , RMSE can be better with ∆ 1 for a few epochs, but then get worse when compared with ∆ 2 .Hence, in practice, a cross validation should be used to determine suitable value of parameter ∆.One can use the following heuristic: Start solving the problem with a relatively large ∆.Decrease this parameter slowly, e.g. after each epoch.This corresponds to a process, where one seeks progressively less rough approximations of X, similar to decreasing penalty parameter λ in LASSO Tibshirani (1996).
In order to illustrate the run-time and efficiency of parallelization of Algorithm 1, Figure 6 shows the evolution of RMSE both per iteration and per runtime on the larger data set.As expected, the evolution per iteration is almost identical.The only difference stems from the fact that different random seeds were chosen.The evolution per runtime shows almost linear speed-up between 1 and 4 cores and marginally worse speed-up between 4 and 8 cores.Let us remark that in the recovery of a m × n matrix X, one iteration denotes m coordinate updates of matrix L and n coordinate updates of matrix R.

Extensions and Conclusion
We have presented the inequality-constrained matrix completion problem and an efficient algorithm, which converges to station points of the NP-Hard, nonconvex optimisation problem, without ever trying to approximate the spectrum of the matrix.In our computational experiments, we have shown that even  the most obvious inequality constraints are useful in a number of applications.Some of the applications, e.g. the collaborative filtering under uncertainty, may be of independent interest.This opens numerous avenues for further research: Non-negative matrix factorization.The coordinate descent algorithm for the problem (4) is easy to extend, e.g., towards non-negative factorization.It is sufficient to modify lines 7 and 13 in Algorithm 1 as follows: L i,r = max{0, L i,r + δ i,r }, R r,j = max{0, R r,j + δ r,j }.One could consider extensions beyond box constraints on the individual elements as well.
Getting rid of parameter µ.If we have some a priori bound on the largest eigenvalue of the matrix to reconstruct, let us denote it ζ, then we can modify lines 7 and 13 in Algorithm 1 as follows L i,r = max{min{ζ, L i,r + δ i,r }, −ζ}, R r,j = max{min{0, R r,j + δ r,j }, −ζ}.
We would be delighted to share our code with other researchers interested in the problem.

Figure 2 :
Figure 2: Recovery of rank k matrix X ∈ R m×n for different number of observed elements p.

Figure 3 :
Figure 3: Adding obvious constrains can help to get better solution.

Figure 4 :
Figure 4: Original 50 × 50 image, the best rank 10 approximation and reconstruction using Algorithm 1 with different settings.The RE is a relative error defined as RE(X • ) = X • − X(10) F / X(10) .

Figure 6 :
Figure 6: The evolution of RMSE on the larger dataset as function of the number of iterations and run-time using 1, 4 and 8 cores.The rank parameter is 20 and µ = 10 −3 .
It is an easy exercise to show that for any X ∈ R m×n with singular values σ 1 ≥ σ 2 ≥ • • • ≥ σ min{m,n} , and Y *