On the distance to low-rank matrices in the maximum norm

Every sufficiently big matrix with small spectral norm has a nearby low-rank matrix if the distance is measured in the maximum norm (Udell&Townsend, SIAM J Math Data Sci, 2019). We use the Hanson--Wright inequality to improve the estimate of the distance for matrices with incoherent column and row spaces. In numerical experiments with several classes of matrices we study how well the theoretical upper bound describes the approximation errors achieved with the method of alternating projections.


Introduction
To check whether a matrix X ∈ R m×n admits good low-rank approximation, we typically look at its singular values σ 1 , . . ., σ min{m,n} .The theorems of Schmidt [1], Eckart-Young [2] and Mirsky [3] state that for every unitarily invariant norm ∥ • ∥ • and every r < rank(X) the minimum of the functional ∥X − Y ∥ • over all matrices Y with rank(Y ) = r equals diag 0, . . ., 0, σ r+1 , . . ., σ min{m,n} • and is achieved at the truncated singular value decomposition of X. Recall that a matrix norm is called unitarily invariant if ∥Q m AQ n ∥ • = ∥A∥ • for each A ∈ R m×n and all orthogonal matrices Q m ∈ R m×m and Q n ∈ R n×n .This class includes the widely used spectral ∥ • ∥ 2 and Frobenius ∥ • ∥ F norms, but there are other important norms that are not unitarily invariant and, therefore, are not covered by the above theorems.We focus on the maximum norm which is not unitarily invariant as a simple example shows: Just as for any other norm, there is an ultimate upper bound on the approximation error that is achieved if we take a vanishing sequence of rank-r matrices: The ideas of cross approximation [4,5] lead to the following bound which can be simplified to d r (X) ≤ 2σ r+1 when m, n ≥ 2r − 1.This result is similar in flavor to the case of the spectral norm but, unlike it, can become trivial.Indeed, the spikiness of X, allows us to rewrite the above estimate as ∥X∥ max and note that the right-hand side exceeds ∥X∥ max if σ r+1 > 1 2 ∥X∥ 2 .Therefore, the upper bound (2) guarantees good approximation in the maximum norm when the singular values decay rapidly but ceases to be restrictive when the spectrum is flat.
A completely different route was taken in [6], where the authors showed that d r (X) can be small regardless of the singular values.Namely, they proved the following.
For example, if X is a multiple of the n × n identity matrix for sufficiently large n then d r (X) ≤ ε∥X∥ max , r = rank(Y ), while the best rank-r approximation errors in the spectral and Frobenius norms are equal to ∥X∥ 2 and 1 − r n ∥X∥ F .To achieve the same relative approximation error ε in the Frobenius norm, the rank must be at least n(1 − ε 2 ), which scales linearly with n.For the spectral norm, the relative approximation error is always equal to one.At the same time, the rank scales as log(n) for the maximum norm-a truly low-rank approximation when n is extremely large.Theorem 1.1 was extended to nonsymmetric matrices in [7], though the maximum norm in the right-hand side had to be replaced with the spectral norm.
When m and n are sufficiently big, this theorem gives an upper bound that is restrictive for ε < γ X / √ mn even when all singular values of X are equal.

Contributions and outline
Theorem 1.2 is an interesting statement about low-rank approximation in non-unitarily-invariant norms, but its proof in [7] contains a small mistake.We feel obliged to correct it and simultaneously refine the estimate for matrices that are not full rank and whose column and row spaces are incoherent; this is the purpose of Theorem 2.3.Following [7], we rely on the Johnson-Lindenstrauss lemma in the proof of Theorem 2.3.We turn to the Hanson-Wright inequality to prove Theorem 3.4: an estimate of the approximation error that is tighter than Theorem 2.3.
Both Theorem 2.3 and Theorem 3.4 provide upper bounds for the distance d r (X) in the maximum norm from X to the set of rank-r matrices.How accurate are they?We address this question numerically with the method of alternating projections as developed in [8,Section 7.5].
As far as we know, this is the first extensive numerical study of low-rank matrix approximation in the maximum norm.
The text is naturally divided into three parts.In Section 2, we recall the Johnson-Lindenstrauss lemma and prove Theorem 2.3.We introduce the Hanson-Wright inequality in Section 3 and prove Theorem 3.4 with its help.Finally, we present the method of alternating projections and discuss the results of our numerical experiments in Section 4.

Proof based on the Johnson-Lindenstrauss lemma
The lemma of Johnson and Lindenstrauss [9] states that every finite set in a high-dimensional Euclidean space can be mapped into a lower-dimensional Euclidean space so that the pairwise distances between the points in the set are approximately preserved.Below, we present a particular version of the lemma from [10].The ℓ 2 -norm of a vector is written as ∥ • ∥.
The Johnson-Lindenstrauss lemma plays a pivotal role in the proof of Theorem 1.2 in [7].The lemma is not applied directly, though; rather, it is used to show that the pairwise inner products are approximately preserved together with the pairwise distances.
there exists a matrix Q ∈ R k×r with orthonormal columns such that for all i, j ∈ [m] Proof.Follow the proof of [7, Lemma 2.4] and use Theorem 2.1.
To formulate our Theorem 2.3, a refined version of Theorem 1.2, we need to introduce the coherence of a subspace.Let L be a k-dimensional subspace of R m and Π L : R m → L be the corresponding orthogonal projection operator.The coherence of L is defined as the largest squared norm of the canonical basis vectors projected onto L: ∥Π L e i ∥ 2 .
For every subspace L, its coherence µ L lies in the closed interval [1, m/k].
where µ col and µ row are the coherences of the column and row spaces of X.
Proof.Let X = U ΣV T be the thin singular value decomposition of and denote their rows as We then apply Corollary 2.2 with ε = ε/3 to the set {ũ 1 , . . .ũm , ṽ1 , . . ., ṽn } ⊂ R k .The choice of r guarantees that so there is a matrix Q ∈ R k×r with orthonormal columns such that for all i ∈ [m] and j ∈ [n] , we can rewrite the above inequality as Next, we estimate the norms of the rows: Similarly, we have ∥ṽ j ∥ 2 ≤ k n µ row ∥X∥ 2 .Using the definition of spikiness we arrive at It remains to note that rank(Y ) = r since Ũ , Ṽ and Q are full rank.
This is exactly the statement of Theorem 1.2.In its original proof [7, Theorem 1.0], the rows and columns of Ũ and Ṽ were confused, which explains why the estimate of the rank in Theorem 1.2 depends on 2 min{m, n} rather than on m + n.All the other proofs in [7] that follow the same line of reasoning are void of this inaccuracy.
Theorem 2.3 controls the approximation error better than Theorem 1.2 for matrices with low spikiness and incoherent column and row spaces.Consider two examples.First, the Hadamard matrix of size n × n has γ X = √ n and µ col = µ row = 1 so that Theorem 2.3 gives The second example involves randomness.It was proved in [11,Lemma 2.2] that if L is a random k-dimensional subspace of R m then its coherence is with high probability bounded by Then if X is designed to have random column and row spaces, Theorem 2.3 says that In both cases, even if m and n become arbitrarily large, the gain over Theorem 1.2 is within a constant factor.With the upcoming Theorem 3.4, we achieve a more substantial improvement.

Proof based on the Hanson-Wright inequality
Under the hood, the Johnson-Lindenstrauss lemma is a probabilistic statement about the concentration of random variables.So is the Hanson-Wright inequality [12,13] that establishes the concentration of a quadratic form in sub-Gaussian random variables around its mean.Recall that a real-valued random variable ξ is called sub-Gaussian if where c > 0 is an absolute constant.
In the proof of Theorem 2.3, we used Corollary 2.2 to control the deviation of inner products from their means; our current plan is to employ the Hanson-Wright inequality for the task.To do so, we need to set up a specific quadratic form and study its properties.We write ⊗ for the Kronecker product of matrices and I r for the r × r identity matrix.
Proof.The spectral and Frobenius norms of a Kronecker product of matrices are equal to the products of the respective norms of the matrices.
Proof.The constant L is chosen so that Then Theorem 3.1 guarantees that The squared term dominates whenever Let ε ∈ [ k 0 /k, 1] and assume that r ∈ N satisfies k ≥ r ≥ k 0 /ε 2 .Then for every matrix X ∈ R m×n of rank k there exists a matrix Y ∈ R m×n of rank r such that where µ col and µ row are the coherences of the column and row spaces of X.
Proof.Take a non-zero centered sub-Gaussian random variable ξ with ∥ξ∥ ψ 2 ≤ 1 and populate a matrix Q ∈ R k×r with independent copies of r − 1 2 ξ.Following the proof of Theorem 2.3, define the matrices Ũ ∈ R m×k and Ṽ ∈ R n×k with rows {ũ T i } m i=1 and {ṽ T j } n j=1 .Then we have for every i ∈ [m] and j ∈ [n].Next, we can rewrite ũT i QQ T ṽj as a quadratic form where vec(Q) ∈ R kr is obtained by stacking the columns of Q, and use Lemma 3.3 to get Assume that r ≥ c log(4mn)/ε 2 .Then the union bound guarantees that Thus, the desired Q exist, and we can select one of full rank.Set Let us return to the example from Section 2. Consider a rank-k matrix X ∈ R n×n whose column and row spaces are drawn at random; under the assumptions of Theorem 3.4, for sufficiently large n, the coherences µ col and µ row of X are bounded by an absolute constant C and there exists a matrix Y of rank C log(4n 2 )/ε 2 such that Take a sequence of such n × n matrices {X n } with ∥X n ∥ 2 ≤ n δ 2 and rank(X n ) = ⌈n 1−δ ⌉ for some δ ∈ (0, 1).Then Theorem 3.4 guarantees that d rn (X n ) → 0 with r n = C log(4n 2 )/ε 2 as n → ∞.This behavior cannot be explained with Theorem 2.3.
Theorem 3.4 still has limitations, though.For a sequence of random orthogonal matrices {X n }, it guarantees that d rn (X n ) ≤ ε for large n.Meanwhile, for every fixed rank r, the ultimate upper bound (1) [14].

Numerical experiments 4.1 Known algorithms
Even though the maximum-norm distance from a matrix to the set of rank-r matrices is always attained, we cannot expect to compute the closest matrix with a reasonably efficient numerical algorithm since the problem is known to be NP-hard even for r = 1 (unless the sign pattern of the minimizer is known, in which case there is an algorithm of polynomial complexity that checks if the distance is smaller than a given constant [15]).Combinatorial arguments were used in [16] to design an optimization algorithm that converges to the best rank-one approximant and is based on repeated alternating minimization with different initial conditions.
The rank-r approximation problem was addressed in [17] with alternating minimization (but the algorithm was only shown to reach local minimizers) and in [15] with a heuristic coordinate descent approach.The algorithm from [18] is based on the column subset selection problem and is proved to find a rank-r matrix Y such that ∥X − Y ∥ max = O(r) • d r (X) using O(r r log r (m)poly(mn)) operations.In [19], gradient descent was applied to a smoothed and regularized functional; with an optimal (but practically impossible to select) choice of parameters, the algorithm can find a rank-r approximant that satisfies

Alternating projections
In this article, we shall use the method of alternating projections to numerically compute lowrank approximations in the maximum norm.Following [8, Section 7.5], we introduce the closed maximum-norm ball of radius ε centered at X, and the set of rank-r matrices The method of alternating projections consists in choosing the initial low-rank matrix Y 0 ∈ M r and computing successive Euclidean projections onto the two sets: The crucial aspect making alternating projections viable for our problem is that the projections are required to minimize the Euclidean distance (i.e. the Frobenius norm) rather than the maximum-norm distance: For the closed ball B ε (X), the Euclidean projection is unique, can be computed as and amounts to clipping the large elements of Y k − X.The Euclidean projection onto M r follows from the truncated singular value decomposition and can be non-unique if there are repeated singular values (any truncation can be chosen in this case).When the initial condition Y 0 is "not too bad", both sequences {Z k } and {Y k } approach the same limit in M r ∩ B ε (X).Combining the method of alternating projections with a binary search over feasible values of ε, we can estimate the maximum-norm distance d r (X) and provide a numerical upper bound d r (X) ≤ ε + .To accelerate the computations, we use randomized singular value decomposition for rank truncation.Find further details in [8,Section 7.5].
We are going to consider four different classes of matrices in order to better understand how the maximum-norm distance d r (X) depends on the properties of the matrix X.

Identity matrices
The first class are identity matrices {I n } of varying sizes.Some preliminary numerical results on the identity matrices were reported in [8], and our goal here is to elaborate on them.The maximum and spectral norms of I n are independent of n, hence the ultimate upper bound (1) and Theorem 3.4 guarantee, respectively, that d r (I n ) ≤ 1 for every r and d rn (I n ) ≤ ε for r n = C log(4n 2 )/ε 2 .In two series of experiments, we use the method of alternating projections to numerically estimate d r (I n ) for • fixed matrix size n and varying approximation rank r, • fixed approximation rank r and varying matrix size n.
For every pair (n, r), we form a random initial matrix Y 0 ∈ M r as a product of two n × r random Gaussian matrices with variance r −1 and repeat the experiment five times with different instances of Y 0 .We plot the smallest errors achieved over the five experiments in Fig. 1.
When n is fixed and r is small, we observe that the numerical estimate of d r (I n ) decays as r −1 (same behavior was reported in [8] for n = 1600).As the approximation rank r approaches n, the numerical error drops faster (to be expected) at a rate similar to r −1 √ n − r.For moderate values of r -not too small and not too close to n -this numerical bound is in accordance with the r −1/2 decay rate of Theorem 3.4.With fixed r, the error grows approximately as √ n − r for n close to r and slows down to the polylogarithmic rate for larger n (we observed similar behavior for r = 40 in [8], but with a different power of the logarithm).At the same time, Theorem 3.4 suggests that the approximation error grows as log(n), which is slower than what we observe for the studied range of matrix sizes n.

Random uniform matrices
The second class are n×n random matrices {U n } with independent entries distributed uniformly in the interval (−1, 1).Their maximum and spectral norms with high probability satisfy the following [20, Section 2.3]: This means that the upper bound of Theorem 3.4 for random uniform matrices is about √ n times larger than for identity matrices.We carry out two similar series of experiments as with the identity matrices, repeating the computations ten times with different instances of the initial approximation and plotting the median together with the 10% and 25% percentiles (see Fig. 2).For fixed matrix size n, our results show that the errors decrease approximately as r −1 (n−r) when the approximation rank r approaches n; this rate is higher than for identity matrices by a factor of √ n − r.In addition, we do not see the r −1 -decay regime for small values of r, and the corresponding errors are close to one.

Discussion
We can formulate three implications from the results of our numerical experiments.First, they illustrate how the distance d r (X) depends on rank(X) and confirm the corresponding scaling in Theorem 3.4.The proof of such dependency is a novel contribution of our article as neither Theorem 1.1 nor Theorem 1.2 describe it.Second, the numerical results serve as evidence that, for general matrices X, the distance d r (X) depends on the spectral norm ∥X∥ 2 (as suggested by Theorems 1.2 and 3.4) rather than on the maximum norm ∥X∥ max (as in the symmetric positive semidefinite case of Theorem 1.1).Therefore, it is likely that the need to replace ∥X∥ max with ∥X∥ 2 in the process of generalizing Theorem 1.1 to non-symmetric matrices is not a technical limitation of the proof.
Third, our results reveal a phase transition.When the approximation rank r and the actual rank(X) are close, the distance d r (X) exhibits power-law scaling with respect to parameters.Whereas when r is small compared to rank(X), the rate of change switches to the polylogarithmic regime, until d r (X) eventually approaches the ultimate upper bound (1).
It would be interesting to see our numerical experiments repeated -and extended -with different algorithms, especially with those that have stronger optimality guarantees.

Figure 1 :
Figure 1: Low-rank approximation errors in the maximum norm obtained with the method of alternating projections for the class of identity matrices: (left) fixed matrix size n, (right) fixed approximation rank r.

Figure 2 :
Figure 2: Low-rank approximation errors in the maximum norm obtained with the method of alternating projections for the class of random uniform matrices: (left) fixed matrix size n, (right) fixed approximation rank r.

Figure 4 :
Figure 4: Properties of 2000 × 2000 matrices formed as products of 2000 × k random matrices with orthonormal columns and normalized to unit maximum norm: (left) spectral norm as a function of rank k; (right) low-rank approximation errors in the maximum norm obtained with the method of alternating projections for fixed approximation rank r.