Adaptive Multinomial Matrix Completion

The task of estimating a matrix given a sample of observed entries is known as the \emph{matrix completion problem}. Most works on matrix completion have focused on recovering an unknown real-valued low-rank matrix from a random sample of its entries. Here, we investigate the case of highly quantized observations when the measurements can take only a small number of values. These quantized outputs are generated according to a probability distribution parametrized by the unknown matrix of interest. This model corresponds, for example, to ratings in recommender systems or labels in multi-class classification. We consider a general, non-uniform, sampling scheme and give theoretical guarantees on the performance of a constrained, nuclear norm penalized maximum likelihood estimator. One important advantage of this estimator is that it does not require knowledge of the rank or an upper bound on the nuclear norm of the unknown matrix and, thus, it is adaptive. We provide lower bounds showing that our estimator is minimax optimal. An efficient algorithm based on lifted coordinate gradient descent is proposed to compute the estimator. A limited Monte-Carlo experiment, using both simulated and real data is provided to support our claims.


Introduction
The matrix completion problem arises in a wide range of applications such as image processing [14,15,27], quantum state tomography [12], seismic data 0 /Adaptive Multinomial Matrix Completion 1 reconstruction [28] or recommender systems [20,2]. It consists in recovering all the entries of an unknown matrix, based on partial, random and, possibly, noisy observations of its entries. Of course, since only a small proportion of entries is observed, the problem of matrix completion is, in general, ill-posed and requires a penalization favoring low rank solutions. In the classical setting, the entries are assumed to be real valued and observed in presence of additive, homoscedastic Gaussian or sub-Gaussian noise. In this framework, the matrix completion problem can be solved provided that the unknown matrix is low rank, either exactly or approximately; see [6,16,19,24,4,18] and the references therein. Most commonly used methods amount to solve a least square program under a rank constraint or its convex relaxation provided by the nuclear (or trace) norm [9].
In this paper, we consider a statistical model where instead of observing a real-valued entry of an unknown matrix we are now able to see only highly quantized outputs. These discrete observations are generated according to a probability distribution which is parameterized by the corresponding entry of the unknown low-rank matrix. This model is well suited to the analysis of voting patterns, preference ratings, or recovery of incomplete survey data, where typical survey responses are of the form "true/false", "yes/no" or "agree/disagree/no opinion" for instance. The problem of matrix completion over a finite alphabet has received much less attention than the traditional unquantized matrix completion. One-bit matrix completion, corresponding to the case of binary, i.e. yes/no, observations, was first introduced by [7]. In this paper, the first theoretical guarantees on the performance of a nuclear-norm constrained maximum likelihood estimator are given. The sampling model considered in [7] assumes that the entries are sampled uniformly at random. Unfortunately, this condition is unrealistic for recommender system applications: in such a context some users are more active than others and popular items are rated more frequently. Another important issue is that the method of [7] requires the knowledge of an upper bound on the nuclear norm or on the rank of the unknown matrix. Such information is usually not available in applications. On the other hand, our estimator yields a faster rate of convergence than those obtained in [7].
One-bit matrix completion was further considered by [5] where a max-norm constrained maximum likelihood estimate is considered. This method allows more general non-uniform sampling schemes but still requires an upper bound on the max-norm of the unknown matrix. Here again, the rates of convergence obtained in [5] are slower than the rate of convergence of our estimator. Recently, [13] consider general exponential family distributions, which cover some distributions over finite sets. Their method, unlike our estimator, requires the knowledge of the "spikiness ratio" (usually unknown) and the uniform sampling scheme.
In the present paper, we consider a maximum likelihood estimator with nuclear-norm penalization. Our method allows us to consider general sampling scheme and only requires the knowledge of an upper bound on the maximum absolute value of the entries of the unknown matrix. All the previous works on this model also require the knowledge of this bound together with some additional (and more difficult to obtain) information on the unknown matrix.
The paper is organized as follows. In Section 2.1, the one-bit matrix completion is first discussed and our estimator is introduced. We establish upper bounds both on the Frobenius norm between the unknown true matrix and the proposed estimator and on the associated Kullback-Leibler divergence. In Section 2.2 lower bounds are established, showing that our upper bounds are minimax optimal up to logarithmic factors. Then, the one-bit matrix completion problem is extended to the case of a more general finite alphabet. In Section 3 an implementation based on the lifted coordinate descent algorithm recently introduced in [8] is proposed. A limited Monte Carlo experiment supporting our claims is then presented in Section 4.

Notations
For any integers n, m 1 , m 2 > 0, [n] := {1, . . . , n}, m 1 ∨ m 2 := max(m 1 , m 2 ) and m 1 ∧ m 2 := min(m 1 , m 2 ). We equip the set of m 1 × m 2 matrices with real entries (denoted R m1×m2 ) with the scalar product X|X ′ := tr(X ⊤ X ′ ). For a given matrix X ∈ R m1×m2 we write X ∞ := max i,j |X i,j | and for any ρ ≥ 1, we denote its Schatten ρ-norm (see [1]) by with σ i (X) the singular values of X ordered in decreasing order. The operator norm of X is X σ,∞ := σ 1 (X). For any integer q > 0, we denote by R m1×m2×q the set of m 1 × m 2 × q (3-way) tensors. A tensor X is of the form X = (X l ) q l=1 where X l ∈ R m1×m2 for any l ∈ [q]. For any integer p > 0, a function f : R q → S p is called a p-link function, where S p is the p−dimensional probability simplex. Given a p-link function f and X , X ′ ∈ R m1×m2×q , we define the squared Hellinger distance .
For any tensor X ∈ R m1×m2×q we define rk(X ) := max l∈[q] rk(X l ), where rk(X l ) is the rank of the matrix X l and its sup-norm by X ∞ := max l∈[q] X l ∞ .

One-bit matrix completion
Assume that the observations follow a binomial distribution parametrized by a matrixX ∈ R m1×m2 . Assume in addition that an i.i.d. sequence of coefficients ) n is revealed and denote by Π their distribution. The observations associated to these coefficients are denoted by (Y i ) n i=1 ∈ {1, 2} n and distributed as follows where f = (f j ) 2 j=1 is a 2−link function. For ease of notation, we often writeX i instead ofX ωi . Denote by Φ Y the (normalized) negative log-likelihood of the observations: Let γ > 0 be an upper bound of X ∞ . We consider the following estimator ofX: with λ > 0 being a regularization parameter. Consider the following assumptions.
Remark 1. As shown in [7, Lemma 2], K γ satisfies Our framework allows a general distribution Π. We assume that Π satisfies the following assumptions introduced in [18] in the classical setting of unquantized matrix completion: H2. There exists a constant µ > 0 such that, for any m 1 > 0 and m 2 > 0 Denote by R k = m2 k ′ =1 π k,k ′ and C k ′ = m1 k=1 π k,k ′ the probability of revealing a coefficient from row k and column k ′ , respectively.
H3. There exists a constant ν ≥ 1 such that, for all m 1 , m 2 , The first assumption ensures that every coefficient has a nonzero probability of being observed, whereas the second assumption requires that no column nor row is sampled with too high probability (see also [10,18] for more details on these conditions). For instance, the uniform distribution yields µ = ν = 1. Define Theorem 1. Assume 1, 2, 3 and that X ∞ ≤ γ. Assume in addition that n ≥ 2m log(d)/(9ν). Take Then, with probability at least 1−3d −1 the Kullback-Leibler divergence is bounded by withc a universal constant whose value is specified in the proof.
Proof. Using Lemma 9 and Theorem 1, the result follows.
Remark 2. Note that, up to the factor L 2 γ /K 2 γ , the rate of convergence given by Corollary 2, is the same as in the case of usual unquantized matrix completion, see, for example, [18] and [19]. For this usual matrix completion setting, it has been shown in [19,Theorem 3] that this rate is minimax optimal up to a logarithmic factor. Let us compare this rate of convergence with those obtained in previous works on 1-bit matrix completion. In [7], the parameterX is estimated by minimizing the negative log-likelihood under the constraints X ∞ ≤ γ and X σ,1 ≤ γ √ rm 1 m 2 for some r > 0. Under the assumption that rk(X) ≤ r, they could prove that X −X 2 σ,2 where C γ is a constant depending on γ (see [7,Theorem 1]). This rate of convergence is slower than the rate of convergence given by Corollary 2.
[5] studied a max-norm constrained maximum likelihood estimate and obtain a rate of convergence similar to [7]. In [13], matrix completion was considered for a likelihood belonging to the exponential family. Note, for instance, that the logit distribution belongs to such a family. The following upper bound on the estimation error is provided (see [13, Theorem 1]) Comparing with Corollary 2, (10) contains an additional term α 2 * where α * is an upper bound of √ m 1 m 2 X ∞ .

Minimax lower bounds for one-bit matrix completion
Corollary 2 insures that our estimator achieves certain Frobenius norm errors. We now discuss the extent to which this result is optimal. A classical way to address this question is by determining minimax rates of convergence. For any integer 0 ≤ r ≤ min(m 1 , m 2 ) and any γ > 0, we consider the following family of matrices We will denote by infX the infimum over all estimatorsX that are functions of the data (ω i , Y i ) n i=1 . For any X ∈ R m1×m2 , let P X denote the probability distribution of the observations (ω i , Y i ) n i=1 for a given 2−link function f and sampling distribution Π. We establish a lower bound under an additional assumption on the function f 1 : In particular, 4 is satisfied in the case of logit or probit models. The following theorem establishes a lower bound on the minimax risk in squared Frobenius norm: Proof. See Section 5.3.
Note that the lower bound given by Theorem 3 is proportional to the rank multiplied by the maximum dimension ofX and inversely proportional the sample size n. Therefore the lower bound matches the upper bound given by Corollary 2 up to a constant and a logarithmic factor. The lower bound does not capture the dependance on γ, note however that the upper and lower bound only differ by a factor L 2 γ /K γ .

Extension to multi-class problems
Let us now consider a more general setting where the observations follow a distribution over a finite set {1, . . . , p}, parameterized by a tensorX ∈ R m1×m2×q . The distribution of the observations where f = (f j ) p j=1 is now a p-link function andX ωi denotes the vector (X l ωi ) q l=1 . The negative log-likelihood of the observations is now given by: where we use the notation X i = X ωi . Our proposed the estimator is defined as: In order to extend the results of the previous sections we make an additional assumption which allows to split the log-likelihood as a sum.
such that the p-link function f can be factorized as follows The model considered above covers many finite distributions including among others logistic binomial (see Section 2.1) and conditional logistic multinomial (see Section 3). Assumptions on constants depending on the link function are extended by H6. There exist positive constant H γ , L γ and K γ such that: For any tensor X ∈ R m1×m2×q , we writeΣ : We can now state the main results of this paper.
Note that the lower bound of λ is stochastic and the expectation E Σ R σ,∞ is unknown. However, these quantities can be controlled using 3.

Implementation
In this section an implementation for the following p-class link function is given: This p-class link function boils down to parameterizing the distribution of the observation as follows: Assumption 5 is satisfied and the problem (13) is separable w.r.t. each matrix X l . Following [7], we solve (13) without taking into account the constraint γ; as reported in [7] and confirmed by our experiments, the impact of this projection is negligible, whereas it increases significantly the computation burden. Because the problem is separable, it suffices to solve in parallel each subproblem This can be achieved by using the coordinate gradient descent algorithm introduced by [8]. To describe the algorithm, consider first the set of normalized rank one matrices Such function is not unique. Consider an SVD of X i.e., Conversely, for any θ ∈ Θ + , define and the auxiliary objective function: The triangular inequality implies that for all θ ∈ Θ + ,  (19) is actually equivalent to the minimization of (17); see [8,Theorem 3.2]. The minimization (19) can be implemented using a coordinate gradient descent algorithm which updates at each iteration the nonnegative finite support function θ.
The algorithm is summarized in Algorithm 1. Compared to the Soft-Impute [23] or the SVT [3] algorithms, this algorithm does not require the computation of a full SVD at each step of the main loop of an iterative (proximal) algorithm (recall that the proximal operator associated to the nuclear norm is the softthresholding operator of the singular values). The proposed algorithm requires only to compute the largest singular values and associated singular vectors.
Another interest of this algorithm is that it only requires to evaluate the coordinate of the gradient for the entries which have been actually observed. It is therefore memory efficient when the number of observations is smaller than the total number of coefficients m 1 m 2 , which is the typical setting in which matrix completion is used. Moreover, we use Arnoldi iterations to compute the top singular values and vector pairs (see [11,Section 10.5] for instance) which allows us to take full advantage of sparse structures, the minimizations in the inner loop are carried out using the L-BFGS-B algorithm. Table 1 provides the execution time one-bit matrix completion (on a 3.07Ghz w3550 Xeon CPU with RAM 1.66 Go, Cache 8 Mo, C implementation).

Numerical Experiments
We have performed numerical experiments on both simulated and real data provided by the MovieLens project (http://grouplens.org). Both the onebit matrix completion -p = 2, q = 1 -and the extended multi-class setting -p = 5, q = 4 -are considered; comparisons are also provided with the classical Gaussian matrix completion algorithm to assess the potential gain achieved by explicitly taking into account the facts that the observations belong to a finite alphabet. Only a limited part of the experiments are reported in this article; a more extensive assessment can be obtained upon authors request.
The observations are sampled to the conditional multinomial logistic model introduced in Section 3. For comparison purposes we have also computedX N , the classical Gaussian version (i.e., using a squared Frobenius norm in (13)). Contrary to the logit version, the Gaussian matrix completion does not directly recover the distribution of the observations (Y i ) n i=1 . However, we can estimate P(Y i = j) by the following quantity: where F N (0,1) is the cdf of a zero-mean standard Gaussian random variable. The choice of the regularization parameter λ has been solved for all methods by performing 5-fold cross-validation on a geometric grid of size 0.6 log(n) (note that the estimators are null for λ greater than ∇ Φ Y (0) σ,∞ ).
As evidenced in Figure 1, the Kullback-Leibler divergence for the logistic estimator is significantly lower than for the Gaussian estimator, for both the p = 2 and p = 5 cases. This was expected because the Gaussian model assume implicitly symmetric distributions with the same variance for all the ratings, These assumptions are of course avoided by the logistic modem.
Regarding the prediction error, Table 2 and Table 3 summarize the results obtained for a 1000 × 600 matrix. The logistic model outperforms the Gaussian model (slightly for p = 2 and significantly for p = 5).  Table 2 Prediction errors for a binomial (2 classes) underlying model, for a 1000 × 600 matrix.
the prediction errors, we randomly select 20% of the entries as a test set, and the remaining entries are split between a training set (80%) and a validation set (20%).
For this dataset, ratings range from 1 to 5. To consider the benefit of a binomial model, we have tested each rating against the others (e.g., ratings 5 are set to 0 and all others are set to 1).
These results are summarized in Table 4. For the multinomial case, we find a prediction error of 0.59 for the logistic model against a 0.63 for the Gaussian one.  Table 4 Binomial prediction error when performing one versus the others procedure on the MovieLens 100k dataset.

Proof of Theorem 1 and Theorem 4
Proof. Since Theorem 1 is an application of Theorem 4 for p = 2 and q = 1 it suffices to prove Theorem 4. We consider a tensor X which satisfies Φ λ Y (X ) ≤ Φ λ Y (X ), (e.g., X =X ). We get from Lemma 6 Let us define where the expectation is taken both over the (E i ) 1≤i≤n and (Y i ) 1≤i≤n . As stated in Lemma 11, 2 implies µ D f (X ), f (X ) ≥ KL f (X ), f (X ) . We now need to control the left hand side of (20) uniformly over X with high probability. Since we assume λ > 2 max l∈[q] Σ l σ,∞ applying Lemma 10 (30) and then Lemma 11 Consequently, if we define C(r) as we need to control (Φ Y (X ) − Φ Y (X )) for X ∈ C(16µr). We have to ensure that D f (X ), f (X ) is greater than a given threshold β > 0 and therefore we define the following set We then consider the two following cases. (23) gives X ∈ C β (16µr). Plugging Lemma 12 in (20) with β = 2M γ log(d)/(η n log(α)) , α = e and η = 1/(4α) then it holds with probability at least where ǫ is defined in Lemma 12. Recalling Lemma 11 we get An analysis of this second order polynomial and the relation ǫ(16µr, α, η)/µ = ǫ(16r, α, η) lead to KL f (X ), f (X ) ≤ µ λ √r + λ 2r + 2ǫ(16r, α, η) .
where we have used Lemma 7-(ii) and (iii) and for the last two lines and the definition of K γ and Lemma 8 to get the result.
Lemma 8. For any tensor X ,X ∈ R m1×m2×q and p-link function f it holds:

Proof. See [26, Lemma 4.2]
Lemma 9. For any p, q > 0 and p-link function f and any X ,X ∈ R m1×m2×q satisfying X ∞ ≤ γ and X ∞ ≤ γ, we get: Proof. For p = 2 and q = 1, it is a consequence of Remark 1 and Lemma 8. Otherwise, the proof follows from the definition (16) of K γ and Lemma 8.
Proof. The proof is adapted from [24,Theorem 1] and [18,Lemma 12]. We use a peeling argument combined with a sharp deviation inequality detailed in Lemma 13, Consider the events Let us also define the set and Then for any X ∈ B ∩ S l we have Moreover by definition of S l , X ∈ C β (r, α l β). Therefore If we now apply the union bound and Lemma 13 we get where we used x ≤ e x in the second inequality.
Proof. Using Massart's inequality ([22, Theorem 9]) we get for 0 < η < 1/(2α) By using the standard symmetrization argument, we get where ε := (ε i ) 1≤i≤n is a Rademacher sequence which is independent from (Y i ) 1≤i≤n and (E i ) 1≤i≤n . 5 yields Since for any i ∈ [n], the function is a contraction satisfying φ i (0) = 0, the contraction principle ( [21,Theorem 4.12]) and the fact that Denoting Σ R := n −1 n i=1 ε i E i and the duality, the previous inequality implies where we have the definition of C β (r, t) for the last inequality. Plugging into (37) gives . The proof is concluded by noting that, since for any a, b ∈ R and c > 0, ab ≤ (a 2 /c + cb 2 )/2,

Proof of Theorem 5
Proof. By Theorem 5 it suffices to control Σ l σ,∞ and E[ Σ R σ,∞ ]. For any l ∈ [q], by definition with ∂ l designating the partial derivative against the l-th variable. The sequence of matrices and (f j (X k,k ′ )) p j=1 is a probability distribution, we obtain were we have 3 for the last inequality. Using a similar argument we get E[ n i=1 Z ⊤ i Z i ] σ,∞ /n ≤ L 2 γ ν/m. Therefore, Proposition 14 applied with t = log(d), U = L γ and σ 2 Z = L 2 γ ν/m yields with at least probability 1 − 1/d, With the same analysis for Σ R := 1 n n i=1 ε i E i and by applying Lemma 15 with U = 1 and σ 2 Z = ν m , for n ≥ n * := m log(d)/(9ν) it holds: Proof. The proof is adapted from [18,Lemma 6]. Define t * := (9nσ 2 Z )/U 2 − log(d) the value of t for which the two bounds of Proposition 14 are equal. Let ν 1 := n/(σ 2 Z c * 2 ) and ν 2 := 3n/(U c * ) then, from Proposition 14 we have Let h ≥ 1, then .
For notational convenience, the dependence of κ * in r, M and n is implicit. We start with a packing set construction, inspired by [7]. Assume w.l.o.g., that m 1 ≥ m 2 . For κ ≤ 1, define and consider the associated set of block matrices where O denotes the m 1 × (m 2 − r⌊m 2 /r⌋) zero matrix, and ⌊x⌋ is the integer part of x.

Remark 3.
In the case m 1 < m 2 , we only need to change the construction of the low rank component of the test set. We first build a matrixL ∈ R r×m2 with entries in − κγ 2 , κγ 2 and then we replicate this matrix to obtain a block matrix L of size m 1 × m 2 .
Let I m1×m2 denote the m 1 × m 2 matrix of ones. The Varshamov-Gilbert bound ([26, Lemma 2.9]) guarantees the existence of a subset L ′′ ⊂ L ′ with cardinality Card(L ′′ ) ≥ 2 (rM)/8 + 1 containing the matrix (κγ/2) I m1×m2 and such that, for any two distinct elements X 1 and X 2 of L ′′ , By construction, any element of A as well as the difference of any two elements of A has rank at most r, the entries of any matrix in A take values in [0, γ], and X 0 = γI m1×m2 belongs to A. Thus, A ⊂ F (r, γ). Note that A has the same size as L ′′ and it also satisfies the same bound on pairwise distances, i.e. for any two distinct elements X 1 and X 2 of A, (42) is satisfied.
For some X ∈ A, we now estimate the Kullback-Leibler divergence D (P X 0 P X ) between probability measures P X 0 and P X . By independence of the observations Since X 0 ω1 = γ and either X ω1 = X 0 ω1 or X ω1 = (1 − κ)γ, by Lemma 16 we get