On the convex geometry of blind deconvolution and matrix completion

Low-rank matrix recovery from structured measurements has been a topic of intense study in the last decade and many important problems like matrix completion and blind deconvolution have been formulated in this framework. An important benchmark method to solve these problems is to minimize the nuclear norm, a convex proxy for the rank. A common approach to establish recovery guarantees for this convex program relies on the construction of a so-called approximate dual certificate. However, this approach provides only limited insight in various respects. Most prominently, the noise bounds exhibit seemingly suboptimal dimension factors. In this paper we take a novel, more geometric viewpoint to analyze both the matrix completion and the blind deconvolution scenario. We find that for both these applications the dimension factors in the noise bounds are not an artifact of the proof, but the problems are intrinsically badly conditioned. We show, however, that bad conditioning only arises for very small noise levels: Under mild assumptions that include many realistic noise levels we derive near-optimal error estimates for blind deconvolution under adversarial noise.


Introduction
A number of recent works have explored the observation that various ill-posed inverse problems in signal processing, imaging, and machine learning can be naturally formulated as the task of recovering a low-rank matrix X 0 ∈ C n 1 ×n 2 from an underdetermined system of structured linear measurements where A : C n 1 ×n 2 → C m is a linear map and e ∈ C m , e ≤ τ , represents additive noise. Such problems include, for example, matrix completion [8], phase retrieval [9], blind deconvolution [1], robust PCA [6], and demixing [49]. In this paper, we aim to analyze the worst case scenario, that is, we do not make any assumptions on the noise except for the bound on its Euclidean norm (this scenario is sometimes referred to as adversarial noise, as it allows for noise specifically designed to be most harmful in a given situation). A natural first approach to recover X 0 that remains an important benchmark is to solve the semidefinite program where · * denotes the nuclear norm, i.e., the sum of the singular values. Recovery guarantees have been shown under the assumption that the measurement operator A possesses a certain degree of randomness. To establish such guarantees various proof strategies have been proposed, including approaches via the restricted isometry property [55,47], descent cone analysis [13], and so-called approximate dual certificates [26,25]. While the latter approach remains state of the art for many structured problems including the highly relevant problems of randomized blind deconvolution and matrix completion, it seemingly has some disadvantages. Most prominently, the resulting recovery guarantees take the form whereX denotes a minimizer of the semidefinite program above and · F denotes the Frobenius norm, whereas under comparable normalization, the first two approaches, when applicable, give rise to superior recovery guarantees of the form X − X 0 F τ.
Before this paper it was open whether the additional dimension scaling factor in (1) is a proof artifact. Similarly, for randomized blind deconvolution one of the coherence terms appearing in the result was believed to arise only from the proof technique (cf. [46,Remark 2]).
Another drawback of proceeding via an approximate dual certificate is that it gives only limited insight into geometric properties of the problems such as the null-space property [17], which is also an important ingredient for the study of some more efficient non-convex algorithms [21,42].
Approaches via descent cone analysis [13], in contrast, provide much more geometric insight. The underlying idea of such approaches is to study the minimum conic singular value defined by for K the descent cone of the underlying atomic norm -the nuclear norm in case of lowrank matrix recovery. For a more detailed review of this approach including a precise definition of the descent cone we refer to Section 2.3 below. Through the study of the minimum conic singular value many superior results were obtained for low-rank recovery problems, most importantly in the context of phase retrieval [41,40]. Furthermore, minimum conic singular values can also help to understand certain nonlinear measurement models [52].
For all these reasons, it would be desirable to apply this approach also for matrix completion and blind deconvolution. A challenge that one faces, however, is that for both problems one cannot hope to recover all low-rank matrices; rather, only matrices that satisfy certain coherence constraints are admissible (cf. the discussion in [60,Section 5.4]). In this article we address this challenge, providing the first geometric analysis of these problems. We find that the dimensional factors appearing in the error bounds are the true scaling of the minimum conic singular value and hence intrinsically relate to the underlying geometry. Nevertheless for blind deconvolution, near-optimal recovery is possible, if the noise level is not too small.

Organization of the paper and our contribution
In Section 2 we will review blind deconvolution, matrix completion, as well as some techniques related to descent cone analysis. In Section 3 we will present the main results of this paper. Theorems 3.1 and 3.5 establish that for both blind deconvolution and matrix completion, nuclear norm minimization is intrinsically ill-conditioned. In contrast, Theorem 3.7 provides a near-optimal error bound for blind deconvolution when the noise level is not too small, implying that the conditioning problems only take effect for very small noise levels. The upper bounds for the minimum conic singular value which are the main ingredients of Theorems 3.1 and 3.5 are derived in Section 4. In Section 5 we prove the stability results for blind deconvolution. We believe that not only our results, but also the proof techniques and geometric insights in this manuscript will be of general interest and help to obtain further understanding of low-rank matrix recovery models, in particular under coherence constraints. We discuss interesting directions for future research in Section 6.

Background and related work 2.1. Blind deconvolution
Blind deconvolution problems arise in a number of different areas in science and engineering such as astronomy, imaging, and communications. The goal is to recover both an unknown signal and an unknown kernel from their convolution. In this paper we work with the circular convolution, which is defined by where the index difference k − j is considered modulo L. Without further assumptions on w and x this bilinear map is far from injective. Consequently, it is crucial to impose structural constraints on both w and x. Arguably, the simplest such model is given by linear constraints, that is, both w and x are constrained to known subspaces. Such a model is reasonable in many applications. In wireless communication, for example, it makes sense to assume that the channel behaviour is dominated by the most direct paths and for the signal x a subspace model can be enforced by embedding the message via a suitable coding map into a higher-dimensional space before transmission.
The first rigorous recovery guarantees for such a model were derived by Ahmed, Recht, and Romberg [1]. More precisely, they assume that w = Bh, where B ∈ C L×K is a fixed, deterministic matrix such that B * B = Id K (i.e., B is an isometry) and they model x = Cm 0 , where m 0 denotes the complex-conjugate of m 0 . Here, the matrix C ∈ C L×K is a random matrix, whose entries are independent and identically distributed with circular symmetric normal distribution CN 0, 1 √ L . In this paper we also adopt this model.
Using the well-known fact that the Fourier transform diagonalizes the circular convolution one can rewrite where F ∈ C L×L denotes the normalized, unitary discrete Fourier matrix, and Denoting by b ℓ the ℓth row of the matrix F B, and by c ℓ the ℓth row of the matrix √ LF C, one observes that Furthermore, because of the rotation invariance of the circular symmetric normal distribution all the entries of the vectors {c ℓ } L ℓ=1 are (jointly) independent and identically distributed with distribution CN (0, 1). Noting that the expression h 0 m * 0 , b ℓ c * ℓ F is linear in h 0 m * 0 , Ahmed, Recht, and Romberg [1] defined the operator A : obtaining the measurement model where e ∈ C L is additive noise and X 0 = h 0 m * 0 . The goal is then to determine h 0 and m 0 from y ∈ C L up to the inherent scaling ambiguity, or, equivalently, to find the rank-one For e = 0, among all solutions of the equation y = A (X 0 ), the matrix X 0 is the one with the smallest rank. For this reason, Ahmed, Recht, and Romberg [1] suggested minimizing a natural proxy for the rank, the nuclear norm · * , defined as the sum of the singular values of a matrix.
Here τ > 0 is an a priori bound for the noise level, that is, we assume that e ≤ τ . For this semidefinite program, they establish the following recovery guarantee.
). Consider measurements of the form y = A (h 0 m * 0 ) + e for h 0 ∈ C K , m 0 ∈ C N , e ∈ C L , and A as defined in (2). Assume that e ≤ τ and Then with probability exceeding 1 − O L −1 every minimizerX of the SDP (3) satisfies Here µ 2 max and µ 2 h 0 are coherence parameters, which are defined via and The third coherence factorμ h 0 is a technical term corresponding to a partition that is constructed as a part of the proof of Theorem 2.1, which is based on the Golfing Scheme [25].
To put the impact of the coherence factors into perspective, observe that if all vectors b ℓ have the same ℓ 2 -norm, one obtains that µ max = 1; this will be the case, for example, when B is a low-frequency Fourier matrix, as it appears for applications in wireless communication. The second coherence factor always satisfies 1 ≤ µ 2 h 0 ≤ Kµ 2 max . If µ h 0 is smaller, this indicates that the mass h 0 . Numerical simulations in [1] confirm that many h 0 corresponding to large µ h 0 show worse performance, indicating that this factor may be necessary. The last coherence factorμ h 0 , in contrast, will no longer appear in our result below, which is why we refrain from detailed discussion. We refer the interested reader to [ For generic h 0 the parameters µ h 0 andμ h 0 are reasonably small. For example, if h 0 is chosen from the uniform distribution on the sphere, one can show that with high probability µ h 0 = O √ log L . For the noiseless case, i.e., τ = 0, Theorem 2.1 yields exact recovery, and the required sample complexity L/ log 3 L K + N is optimal up to logarithmic factors, as the number of degrees of freedom is K + N − 1 (see [32] for an exact identifiability analysis based on algebraic geometry.) However, if there is noise, the bound for the reconstruction error scales with √ K + N , in contrast to other measurement scenarios such as low-rank matrix recovery from Gaussian measurements (see, e.g., [13]).
Let us comment on some related work. The foundational paper [1] has triggered a number of follow-up works on the problem of randomized blind deconvolution. A first line of works extended the result to recovering signals from their superposition r i=1 w i * x i , a problem often referred to as blind demixing [46,30]. Another line of works investigated non-convex (gradient-descent based) algorithms [45,48,28], which have the advantage that they are computationally less expensive, as they operate in the natural parameter space. It has been shown that they require a near-optimal number of measurements for recovery. For such an algorithm, [45] derived near-optimal noise-bounds for a Gaussian noise model. However, as in this paper, we focus on the scenario of adversarial noise (instead of random noise) the resulting guarantees are not comparable to ours below.

Matrix completion
The matrix completion problem of reconstructing a low-rank matrix X 0 ∈ R n 1 ×n 2 (we assume that w.l.o.g. n 1 ≥ n 2 ) from only a part of its entries arises in many different applications such as in collaborative filtering [56] and multiclass learning [3]. For this reason one could observe a flurry of work on this problem in the last decade, and we will only be able to give a very selective overview of this topic. The precise sampling model that we consider is that m entries of X 0 are sampled uniformly at random with replacement. Denoting by e i the standard coordinate vectors in R n 1 and R n 2 , respectively, the corresponding measurement operator A : R n 1 ×n 2 → R m can be written as where (a i , b i ) ∈ [n 1 ] × [n 2 ] is chosen uniformly at random for each i ∈ [m] (and independently from all other measurements). The scaling factor n 1 n 2 m in the definition of the measurement operator A is chosen to ensure that E A (X) 2 = X 2 F . (Some other papers on matrix completion choose a different scaling. We have chosen this normalization because in this way the results for the matrix completion problem can be better compared to those for the blind deconvolution scenario.) Alternative sampling models analyzed in other works include sampling a subset Ω uniformly from [n 1 ] × [n 2 ] (i.e., without replacement, see, e.g., [10]), or sampling using random selectors.
Again we aim to recover X 0 from noisy observations y = A (X 0 ) + e, with a noise vector e ∈ R m that satisfies e ≤ τ via the SDP For matrix completion, this approach has first been studied in [8].
It is well known that similarly to the blind deconvolution problem, some incoherence assumptions are necessary to allow for successful recovery. Indeed, suppose that X 0 = e 1 e * 1 . Then, if m ≪ n 1 n 2 with high probability it holds that A (X 0 ) = 0 and one cannot hope to recover X 0 . To avoid such special cases, one needs to ensure that the mass of the Frobenius norm of X 0 is spread out over all entries rather evenly. If U ΣV T is the singular value decomposition of the rank-r matrix X 0 (with Σ ∈ R r×r ), then this property is captured by the following coherence parameters [25] µ (U ) := n 1 r max For these coherence parameters, a series of works [8,10,25,54,14,19] lead to the following recovery guarantee for the noiseless scenario. 19]). Consider measurements of the form y = A (X 0 ), where X 0 ∈ R n 1 ×n 2 is a rank-r matrix with singular value decomposition X 0 = U ΣV T and A is given by (5). Assume that Then with probability at least 1 − O n −1 1 the matrix X 0 is the unique minimizer of the SDP (6) with τ = 0.
As for blind deconvolution, this result has been shown using an approximate dual certificate. In [7] this result has been generalized to the case of adversarial noise, showing that with high probability the minimizerX of (6) satisfies whenever m n 1 polylog n 1 . As in the blind deconvolution framework, this error bound differs from the case of full Gaussian measurements as discussed, for example, in [13], and also from oracle estimates [6, Section III.B] by a dimensional scaling factor, which will be addressed in this paper. Also random noise models for matrix completion have been studied in a number of works. In particular, we would like to mention [36,51], which derive near-optimal rates (both in sample size and estimation error) for matrix completion under subexponential noise with a slightly different nuclear-norm penalized estimator than the one we consider as long as the noise-level is not too small. Similar bounds have also been obtained in [35] using an estimator, which is closer to the one in this work.
Apart from convex methods also many nonconvex algorithms have been proposed and analysed, for example a number of variants of gradient descent (see, e.g., [33,29,27,59,23,21,42,48]). Arguably the strongest result for matrix completion under adversarial noise has been shown in [33,34]. These works propose a non-convex algorithm based on Riemannian optimization and show that if the number of measurements is larger than r 2 n 1 polylog (n 1 ) the true matrix can be reconstructed up to an estimation error superior to the one in [7]. Namely for κ denoting the condition number of the matrix X 0 they show that the outputX of their algorithm satisfies (in our notation) provided the noise level is below a certain, small threshold that scales with the smallest singular value of X 0 . For error vectors e that are spread out evenly and matrices that are well conditioned, one has that √ m e ∞ ≈ e 2 , so this bound is superior to (7) in the sense that the scaling factors that appear only scale with the rank r and not the dimension. It should be noted though that in contrast to nuclear norm minimization the underlying algorithm requires precise knowledge of the true rank of the matrix to be recovered.
Just before completion of this manuscript, Chen et al. [16] bridged convex and nonconvex approaches, using nonconvex methods to analyze a convex recovery scheme. Their results provide near optimal recovery guarantees for the matrix completion problem via nuclear norm minimization under a subgaussian random noise model for a much larger range of admissible noise levels than the aforementioned works. More precisely, the proof is based on the observation that in their scenario the minimizer of the convex problem is very close to an approximate critical point of a non-convex gradient based method. This allows them to transfer existing stability results [48] for non-convex optimization to the convex problem. However, the required sample complexity scales suboptimally in the rank r of the matrix and similarly to (8), the error bound depends on the condition number κ.

Descent cone analysis
In recent years a number of works have studied low-rank matrix recovery and compressed sensing via a descent cone analysis. This approach has been pioneered for ℓ 1 -norm minimization in [58] and for more general (atomic) norms in [13]. Here the descent cone of a norm at a point X 0 ∈ C K×N is the set of all possible directions Z ∈ C K×N such that the norm does not increase. For the nuclear norm, this leads to the following definition. Definition 2.3. For any matrix X 0 ∈ C K×N define its descent cone K * (X 0 ) by To understand its relevance for recovery guarantees assume for a moment that we are in the noiseless scenario, i.e., τ = 0 and e = 0. Then the matrix X 0 ∈ C K×N is the unique minimizer the semidefinite program (3), if and only if the null space of A does not intersect the descent cone K * (X 0 ). In the case of noise, the constraint y − A (X 0 ) ≤ τ in the SDPs (3) and (6) defines a region around X 0 + kerA, i.e., the affine subspace consistent with the observed measurements in the noiseless scenario. The intersection of this region with the set of all signals that have a smaller nuclear norm than the ground truth X 0 is the set of feasible solutions that are preferred to X 0 . The following quantity for a matrix X 0 , which is often referred to as minimum conic singular value, quantifies the size of this intersection If λ min (A, K * (X 0 )) becomes larger, this intersection becomes smaller, which translates into stronger recovery guarantees. The following theorem confirms this intuition.

Theorem 2.4. [13, Proposition 2.2] Let
A : C n 1 ×n 2 → C m be a linear operator and assume that y = A (X 0 ) + e with e ≤ τ . Then any minimizerX of the SDP (3) satisfies .
When measurement matrices of the operator A are full Gaussian matrices (in contrast to rank-1 measurements as in this paper) and A is normalized such that E [A * A] = Id, for an arbitrary low-rank matrix X 0 one has with high probability that λ min (A, K * (X 0 )) ≍ 1. Consequently, Theorem 2.4 yields an optimal estimation error even for adversarial noise. As we will show this is no longer the case for blind deconvolution and matrix completion.
The geometric analysis of linear inverse problems via the descent cone and the minimum conic singular value has lead to many new results and insights in compressed sensing and low-rank matrix recovery. For convex programs the phase transition of the success rate could be precisely predicted [2]. As the proofs are specific to full Gaussian measuements, they do not apply for a number of important structured and heavy-tailed measurement scenarios. Stronger results [43,20,40,31,41] were subsequently obtained using Mendelson's small ball method [37,50], a powerful tool for bounding a nonnegative empirical process from below, now often refereed to as Mendelson's small ball method.

Notation
For n ∈ N we will write [n] to denote the set {1; . . . ; n}. For any set A we will denote its cardinality by |A|. For a complex number z we will denote its real part by Re (z) and its imaginary part by Im (z). By log (·) we will denote the logarithm to the base e. By EX we will denote the expectation of a random variable X and by P (A) we denote the probability of an event A. If v ∈ C n we will denote its ℓ 2 -norm by v and its Hermitian transpose by v * . For u, v ∈ C n the (Euclidean) inner product is defined by u, v := u * v. Furthermore, for Z ∈ C n 1 ×n 2 its spectral norm is given by Z , i.e., the dual norm of the nuclear norm Z * . Moreover, the Frobenius norm of Z is defined by Z F with corresponding inner product Z, W F := Tr (Z * W ), where W ∈ C n 1 ×n 2 . When we study matrix completion, we will work with matrices Z ∈ R n 1 ×n 2 and the previous quantities will be defined analogously. Moreover, in that scenario we will use the notation Z ℓ∞ := max

Our results
3.1. Instability of low-rank matrix recovery

Blind deconvolution
Our first main result states that randomized blind deconvolution can be unstable under adversarial noise.
Then there exists a matrix B ∈ C L×K satisfying B * B = Id K and with F B having rows of equal norm, i.e., µ 2 max = 1, such that for all h 0 ∈ C K \ {0} and m 0 ∈ C N \ {0} the following holds: is τ 0 > 0 such that for all τ ≤ τ 0 there exists an adversarial noise vector e ∈ C L with e ≤ τ that admits an alternative solutionX with the following properties. (3), i.e., X * ≤ X 0 * , but •X is far from the true solution in Frobenius norm, i.e., The constants C 1 , C 2 , and C 3 are universal.
Remark 3.2. The matrix B in the above result exactly fits into the framework of Theorem 2.1. Indeed, one can check that for many interesting cases (including the case that That is, the assumptions of Theorem 2.1 cannot be enough to deduce stability. We do not expect, however, that this kind of instability is observed for arbitrary isometric embeddings B ∈ C L×K . For example, let B be a random embedding, which is chosen from the uniform distribution over the Stiefel manifold V L K , i.e., the manifold consisting of all matricesB ∈ C L×K such thatB * B = Id K ∈ C K×K . In this case, we expect that a similar proof as in [53,40] applies and that the multiplicative dimensional factor does not appear in the error bound with high probability if one randomizes over B and C simultaneously. In particular, this implies the existence of an isometric embedding B such that a result analogous to Theorem 3. 1

cannot hold. An interesting open problem is whether the statement of Theorem 3.1 still holds if F B is a low-frequency discrete Fourier matrix, which is a common assumption in blind deconvolution.
The corresponding b ℓ 's should lead to better conditioning as in our counterexample, but worse than in the case of a random B, as a number of b ℓ 's exhibit substantial correlation, but many are uncorrelated. In that sense, this scenario is in between the scenario of arbitrary B's sketched above and the adversarial scenario in Theorem 3.1.
To put our results in perspective note that for L ≍ (K + N ) polylog (K + N ), which is the minimal number of measurements required for noiseless recovery, it holds that . Up to logarithmic factors, this coincides with the rate predicted by (4), whenever K ≍ N . Theorem 3.1 is a direct consequence of the following proposition, which we think is interesting in its own right.
Then there exists B ∈ C L×K satisfying B * B = Id K and µ 2 max = 1, whose corresponding measurement operator A satisfies the following.
Here C 1 , C 2 , and C 3 are absolute constants.
The proof of Proposition 3.3 will be provided in Section 4. Note that by definition of the minimum conic singular value λ min (A, K * (h 0 m * 0 )) Proposition 3.3 is equivalent to the statement that with high probability there is Our construction of such Z ∈ K * (h 0 m * 0 ) relies on the observation that with high probability there is a rank-one matrix W ∈ C K×N in the null-space of A which is relatively close to the descent cone (with respect to the · F -distance). Perturbing W by −βh 0 m * 0 for a suitable β one can then obtain a matrix Z ∈ K * (h 0 m * 0 ), which fulfills (11). The existence of such a matrix W ∈ kerA also reveals a fact about the geometry of the problem, which we find somewhat surprising: while the null space of A does not intersect the descent cone (otherwise exact recovery would not be possible), the angle between those objects is very small. This is very different from the behavior for measurement matrices A with i.i.d. Gaussian entries (instead of b ℓ c * ℓ ).

Remark 3.4. WhileX is preferred to the true solution by the SDP (3)X is typically not a minimizer of (3). To see this, assume that without noise exact recovery is possible, which is the case with high probability by Theorem 2.1. Then considerX
On the other hand, we also have that  (3), see also Section 6. Even if the minimizer of (3)X is closer to the ground truth (in · F -distance) thañ X, however, the nuclear norms of X andX will be very close, which can easily lead to numerical instabilities.

Matrix completion
Our second main result states that for arbitrary incoherent low-rank matrices, matrix completion is unstable with high probability. Note that in contrast to Theorem 3.1 which is based on a specific choice of parameters the following result holds for an arbitrary incoherent matrix X 0 .
Theorem 3.5. Let n 1 ≥ n 2 and let A : R n 1 ×n 2 → R m be defined as in (5). Assume that X 0 ∈ R n 1 ×n 2 \ {0} is a rank r matrix with singular value decomposition X 0 = U ΣV * . Moreover, assume that Then with probability at least there is τ 0 > 0 such that for all τ ≤ τ 0 there exists an adversarial noise vector e ∈ R m with e ≤ τ that admits an alternative solutionX ∈ R n 1 ×n 2 with the following properties.

noisy measurement vector
•X is preferred to X 0 by the SDP (6), i.e., X * ≤ X 0 * , but •X is far from the true solution in Frobenius norm, i.e., Here the constants C 1 , C 2 , and C 3 are universal.
Again, to put our results in perspective note that for m ≍ n 1 polylog (n 1 ), which is the minimal number of measurements required for noiseless recovery, it holds that polylog(n 1 ) . Up to logarithmic factors, this coincides with the rate predicted by (7). Theorem 3.5 is a direct consequence of the following proposition, which, in our opinion, is of independent interest, as it provides a negative answer to a question by Tropp [60,Section 5.4].
Proposition 3.6. Let X 0 ∈ R n 1 ×n 2 \ {0} be a rank-r matrix with corresponding singular value decomposition X 0 = U ΣV * . Moreover, assume that Then with probability at least The constants C 1 , C 2 , and C 3 are universal.
Proposition 3.6 corresponds to Proposition 3.3 for blind deconvolution and will be proved analogously. We will again show that with high probability there is W ∈ R n 1 ×n 2 such that A (W ) = 0 and W is relatively close to the descent cone of X 0 in · F -distance. Setting Z := W − βU V * for a suitable β > 0 yields an element of K * (X 0 ) with

Stable recovery
A geometric interpretation of Theorems 3.1 and 3.5 is that the nuclear norm ball is neartangential to both the kernels of matrix completion and randomized blind deconvolution. Given that tangent spaces only provide local approximation, these results leave open, what happens in some distance, i.e., for larger noise levels -this will depend on the curvature of the nuclear norm ball. Our third main result concerns exactly this problem for the randomized blind deconvolution setup. As it turns out, the descent directions Z ∈ K * (h 0 m * 0 ) with A (Z) / Z F very small correspond to directions of significant curvature. That is, only a very short segment in this direction will have smaller nuclear norm than h 0 m * 0 , and the corresponding alternative solutions all correspond to very small e. For noise levels τ large enough, in contrast, these directions can be excluded and one can obtain near-optimal error bounds.
X 0 X 0 + kerA Figure 1: Geometric illustration of our approach: Close to X 0 the descent set (indicated by the red line) is near-tangential to the kernel of the measurement operator A, so the descent cone (light blue) is rather wide. By restricting to noise levels above a certain threshold we only need to cover the descent set at some distance, which is achieved by a much smaller cone (green). Note that below the noise level (orange strip), the green cone does not contain the full descent set.
In order to precisely formulate this observation, we denote the set of µ-incoherent vectors h ∈ C K with respect to B ∈ C L×K for µ ≥ 1 by With this notation, our result reads as follows.
Here C 1 , C 2 , and C 3 are absolute constants.
In words, this theorem establishes linear scaling in the noise level τ with only a logarithmic dimensional factor for τ ≥ α h 0 m * 0 F , in contrast to the polynomial factor required for small noise levels as a consequence of Theorem 3.1. Here the value of α can be chosen arbitrarily small, at the expense of an increased number of measurements. For example when one is interested in noise levels τ = ǫµ −2 log −2 L for some ǫ > ǫ 0 (this is the largest order to expect meaningful error bounds despite the additional logarithmic factors) one should choose α ≍ ǫ 0 µ −2 log −2 L, and near-linear error bounds will be guaranteed for a sample complexity of The goal of this section is to prove Proposition 3.3 and Proposition 3.6, from which we will then be able to deduce Theorem 3.1 and Theorem 3.5. For that we first discuss a characterization of the descent cone K * (X). In order to state this characterization, Lemma 4.1, we need to introduce some additional notation. Let X ∈ C n 1 ×n 2 be a matrix of rank r. We will denote its corresponding singular value decomposition by X = U ΣV * , where Σ ∈ R r×r is a diagonal matrix with nonnegative entries and U ∈ C n 1 ×r and V ∈ C n 2 ×r are unitary matrices, i.e., U * U = V * V = Id r . This allows us to define the tangent space of the manifold of rank-r matrices at the point X by By P T X we will denote the orthogonal projection onto T X , by P T ⊥ X = Id − P T X the projection onto its orthogonal complement.
Lemma 4.1. Let X ∈ C n 1 ×n 2 \ {0} be a matrix of rank r with corresponding singular value decomposition X = U ΣV * . Then where K * (X) denotes the topological closure of K * (X). The proof of Lemma 4.1 relies on the duality between the descent cone and the subdifferential of a convex function. In the following we will denote by ∂ · * (X) the subdifferential of the nuclear norm at the point X ∈ C n 1 ×n 2 . We will use that a characterization of ∂ · * is well-known [64]. Namely, for all X ∈ C n 1 ×n 2 with corresponding singular value decomposition X = U ΣV * it holds that Proof. Recall that for a set of matrices V ⊂ C n 1 ×n 2 its polar cone V • is defined by For all X ∈ C n 1 ×n 2 \ {0} we have the following polarity relation between the descent cone and the subdifferential For sets and functions defined in R n with the usual Euclidean inner product, this is [57,Theorem 23.7]. The complex case directly follows, as C n 1 ×n 2 with the inner product Re ( ·, · F ) can be identified with an 2n 1 n 2 -dimensional real-valued vector space with standard Euclidean inner product.
It follows from the bipolar theorem (see, e.g., [5, p. 53]) that Hence, in order to complete the proof it is sufficient to show that In the first inequality we have used that the spectral norm is the dual norm of the nuclear norm. The second inequality follows from P T ⊥ X W ≤ 1. Hence, we have shown that Z ∈ (∂ · * (X)) • . Next, let Z ∈ (∂ · * (X)) • be arbitrary. ChooseW ∈ T ⊥ X such that Re W , Z F = P T ⊥ X (Z) * and W ≤ 1. Then by (15) it follows that U V * +W ∈ ∂ · * (X) and as Z ∈ (∂ · * (X)) • we obtain that This shows that −Re ( U V * , Z F ) ≥ P T ⊥ X (Z) * . Hence, we have verified (16), which completes the proof.

Upper bound for blind deconvolution
The goal of this section is to prove Proposition 3.3. For that we need the following lemma, which is a consequence of the concentration of measure theorem for Lipschitz functions.
(For a proof of the real-valued case see, e.g., [63, Lemma 5.3.2]. The complex-case can be shown analogously.) Lemma 4.3. Let P : C n → C n be a random projection onto a k-dimensional subspace, which is uniformly distributed in the Grassmannian Gr (k, C n ). Fix z ∈ C n . Then for all ε > 0 with probability at least 1 − 2e −ckε 2 we have that wherec > 0 is some absolute constant.
Proof of Proposition 3.3. A core ingredient of the proof is to find a tight frame B such that each of its frame vectors is orthogonal to all but a near-minimal number of other frame vectors. For such a B we then choose a vector h out of these frame vectors and use it to construct a matrix in the descent cone that is close to the kernel of the measurement map. For that, we exploit that by the choice of h, any rank-one matrix hm * will lead to a large part of zero measurements due to the orthogonality. Consequently, there are also many vectors m 1 and m 2 such that hm * 1 and hm * 2 lead to the same measurements, including some choices such that hm * 1 −hm * 2 is not only in the kernel of the measurement map, but also close to the descent cone.
When K divides L a suitable choice for B consists of L K repetitions of a fixed orthogonal basis. When K does not divide L, one can still start off in the same way, if one completes the matrix appropriately to obtain a unit norm tight frame without introducing too many pairs of non-orthogonal frame vectors. One way to achieve this is to find a tight frame that consists only of very sparse vectors, as this will also lead to many vanishing inner products. A natural candidate is hence a so-called spectral tetris frame [11], as it has been shown to be maximally sparse [12]. Indeed, our construction uses exactly this frame.
The Spectral Tetris algorithm is based on the observation that the rows of a matrix G ∈ R L×K form a tight frame if and only if the columns of G are orthogonal and of equal norm. It is easy to see that for unit norm tight frames, the column normalization must be L K . Spectral Tetris starts by greedily filling up the first column of G by choosing the first ⌊ L K ⌋ rows of G to be the first standard basis vector e 1 . The next two frame vectors are chosen of the form αe 1 ± √ 1 − α 2 e 2 , where α is chosen to fulfil the norm constraint of the first column. Next, the second column is greedily filled, then the third, and so on. By construction, the resulting frame will only consist of 1-sparse vectors and 2-sparse vectors supported in neighboring entries. Consequently, the sets Moreover, note that it follows from the Spectral Tetris algorithm that where in the last inequality we used that by assumption 2K ≤ L.
Without loss of generality we assume that h 0 = m 0 = 1 as rescaling does not change the descent cone K * (h 0 m * 0 ). For the proof we will condition on two events. The first event states that whereĉ > 0 is some absolute constant. As the matrix C is Gaussian, the different subspaces D i 's and hence also the random vectors m i K i=1 are independent, so with probability at least there exists at least one k ∈ [⌊K/3⌋] such that (19) holds (with k = i). Also note that which for C 1 = 24 log 2 c follows from assumption (9). To summarize, we have shown that the two events happen with probability at least 1 − O exp −C 2 L/µ 2 , where C 2 > 0 is an appropriately chosen constant.
Conditional on E 1 and E 2 , we will construct Z ∈ C K×N (depending on the realization of the random matrix C) such that Z ∈ K * (h 0 m * 0 ) \ {0} and such that the inequality is satisfied. Note that this will complete the proof. Indeed, by definition of the closure and the continuity of A this implies that there existsZ ∈ K * (X 0 ) such that which by the definition of λ min (A, K * (h 0 m * 0 )) implies that (10) holds with constant C 3 = 12.
To construct Z satisfying (20), define where i ∈ [⌊K/3⌋] is chosen to satisfy (19). It follows directly from the definition of W that W F = 1. We observe that A (W ) = 0 as for each i ∈ [⌊K/3⌋] and ℓ ∈ [L] we either have e i , b ℓ = 0, ℓ / ∈ B i , or m ⊥ i , c ℓ = 0, if ℓ ∈ B i . Denote by T = T X 0 the tangent space of the manifold of rank-one matrices at X 0 = h 0 m * 0 as defined in (14) and by P T and P T ⊥ the corresponding orthogonal projections. It follows that Thus we have shown that W , an element of the null space of A, is close to the tangent space T . We will now show that for β = 3 lies in the closure of the descent cone K * (h 0 m * 0 ). For that, we observe that and hence Lemma 4.1 entails that Z ∈ K * (h 0 m * 0 ). Moreover, note that by the triangle inequality and by the assumption L ≤ 1 36 KN it holds that These observations together with A (W ) = 0 yield that This shows (20), as desired.

Upper bound for matrix completion
In this section, we prove Proposition 3.6. For that we introduce sets N a , a ∈ [n 1 ], via That is, N a contains all the indices of the ath row of the matrix X 0 , which are observed by the measurements. Furthermore, define by P Na ∈ R n 2 ×n 2 the projection onto the coordinates, which are contained in N a , i.e. P Na = b∈Na e b e * b . By we denote the coordinate projection onto [n 2 ] \ N a .
We need the following technical lemma.
Then with probability at least there exists a ∈ [n 1 ] such that P Na V ≤ 2m n 1 n 2 .
C 1 and C 2 are universal constants.
Proof. For each a ∈ [n 1 ] we set I a := {i ∈ [m] : a i = a} and define the event We will first derive a lower bound for P E a I a . For that we note that P Na V 2 = V * P Na V . Let v 1 , v 2 , . . . , v n 2 denote the rows of the matrix V . By definition of N a and I a it follows that for Here we write A B for two symmetric matrices A and B, if and only if B − A is positive semidefinite. By (24) it is sufficient to bound the probability of the event conditionally on I a . To bound i∈Ia X i we will use the matrix Bernstein inequality (see, e.g., [61, Theorem 6.1.1]) conditionally on I a , which requires as ingredients where the fourth line is due to the definition of µ 2 (V ). This implies that To find an appropriate K > 0 note that almost surely where in the fourth line we used that max Finally to apply Bernstein inequality we need that the X i 's are independent conditionally on I a , which follows from the fact that the a i 's and b i 's are drawn independently. With these ingredients the matrix Bernstein inequality yields that ; .
Setting t = m 2n 1 n 2 this implies that for fixed a ∈ [n 1 ] it holds that To complete the proof we restrict our attention to A := a ∈ [n 1 ] : |I a | ≤ 4m 3n 1 as it follows from (25) that and, consequently, for a ∈ A we obtain that P E a I a (27), (26) As the E a 's only depend on {b i } i∈Ia and are hence independent conditionally on I α , this implies that where in the last line we have used assumption (23) with C 1 large enough. Furthermore, note that implies that |A| ≥ n 1 4 almost surely. Hence, it follows that .
This shows that conditional on {I a } n 1 a=1 we have that almost surely Taking expectations yields the claim. Now we are prepared to give a proof of Proposition 3.6.
Proof of Proposition 3.6. For the proof we will condition on two events E 1 and E 2 , which we will define in the following. The event E 1 is defined by Observe that Note that one has almost surely that where we have applied Cauchy-Schwarz and the definition of µ (U ) and µ (V ). Hence, one has where we used U V * 2 F = r and the previous estimate. Hence, by the Bernstein inequality (see, e.g., [63, Theorem 2.8.4]) we obtain that .
By setting t = U V * 2 F = r we observe that E 1 holds with probability at least 1 − 2 exp − cm rµ 2 (U )µ 2 (V ) . The second event E 2 is defined by For C 1 in assumption (12) chosen large enough Lemma 4.4 then entails that P ( . Consequently, we can find a ∈ [n 1 ] (depending on the random sampling pattern) such that the condition defining E 2 is satisfied.
Note that in order to prove Proposition 3.6 it is enough to find Z ∈ K * (X 0 ) \ {0} such that because by definition of the closure and the continuity of A this implies that there is a which implies (13) with constant C 3 = 8. In the following we will construct such a matrix Z. Let x ∈ R r be a vector such that x = 1. Then for a ∈ [n 1 ] as above we define the vector w a ∈ R n 2 by w a := P N ⊥ a V x and set It follows directly from the definition of N a that A (W ) = 0. In the following let T be the tangent space of the manifold of rank-r matrices at X 0 as defined in (14). Furthermore, denote by P U = U U * the orthogonal projection onto the column space of U and, analogously, by P V = V V * the orthogonal projection onto the column space of V . Then we obtain that where in the second equality we have used that P T ⊥ M = P U ⊥ M P V ⊥ for all M ∈ R n 1 ×n 2 and in the last line we used that P U ⊥ e i ≤ 1. Plugging in w a = P N ⊥ a V x it follows that where the last line is due to P V ⊥ V x = 0. The fact that P V ⊥ ≤ 1 then yields that where for the last line we used that a ∈ [n 1 ] was chosen such that the condition in E 2 holds. This shows that W is relatively close to T . Based on W we now aim to find Z ∈ K * (X) of the form where β > 0 will be chosen in the following such that Z ∈ K * (X 0 ), which by Lemma 4.1 is equivalent to − U V * , Z F ≥ P T ⊥ Z * . First, we note that due to P T ⊥ (U V * ) = 0 and the inequality chain (31). Furthermore, we have that Hence, setting β = 2 m r 2 n 1 n 2 and combining (32) and (33) it follows that Z ∈ K * (X 0 ). This Z also satisfies (30). To see that we observe that where in the last line we used the assumption that m ≤ n 1 n 2 32 . Furthermore, from A (W ) = 0 it follows that Combining the last two inequality chains implies (30), which completes the proof.

Proof of Theorem 3.1 and Theorem 3.5
As already mentioned Theorem 3.1 can be deduced from Proposition 3.3 and Theorem 3.5 can be deduced from Proposition 3.6. We only show how to prove Theorem 3.1, as the proof of Theorem 3.5 is analogous.
Proof of Theorem 3.1. By Proposition 3.3 and the definition of the minimum conic singular value λ min (A, K * (h 0 m * 0 )) with probability at least 1 − O exp − K and such thatX t := h 0 m * 0 + tZ obeys X t * ≤ h 0 m * 0 * for all 0 < t ≤ 1. Next, set e t = t 2 A (Z). Then for y t = A (h 0 m * 0 ) + e t we have that Hence, by setting τ 0 := A(Z) 2 we observe that A X t − y t = tτ 0 . Furthermore, note that

Outline of the proof and main ideas
The goal of this section is to prove Theorem 3.7. We first give a proof sketch and present the key ideas. We have seen in Proposition 3.3 that for certain isometries B ∈ C L×K with high probability one has that λ min (A, K * (h 0 m * 0 )) L KN . Hence, applying Theorem 2.4 cannot lead to very strong error estimates. However, if we closely inspect the proof of Proposition 3.3 we observe the following. Again, denote by T the tangent space of the manifold of rank-1 matrices at point h 0 m * 0 as defined in (14) and assume that h 0 = m 0 = 1. By construction we have that As Z F 1 this implies that F is quite small, meaning that Z and h 0 m * 0 are almost orthogonal to each other. All the descent directions Z with this property, however, have in common that the admissible descent step size is necessarily very small. Geometrically this corresponds to the fact that the nuclear norm ball is curved near X 0 , which is why its near-tangential behaviour only holds locally. This will be made precise in Lemma 5.7 below. For this reason the idea of the proof of Theorem 3.7 is to split the descent cone into two parts. One part will consist of all the matrices aligned with h 0 m * 0 . The second part will consist of all remaining matrices, which are almost orthogonal to h 0 m * 0 . As mentioned above, these matrices must necessarily be close to T . The first part is captured by the set E µ,δ with where δ > 0. In Section 5.2 we will show that with high probability it holds that Hence, if we have for the minimizerX of (3) thatX − h 0 m * 0 is an element of the conic hull of E µ,δ we can proceed similarly as in [13] to obtain near-optimal error bounds. Let us briefly explain which property of E µ,δ allows us to show (35). We define for any matrix W ∈ C K×N its · B 1 -norm by In other words W B 1 is the ℓ 1 -norm of the vector ( W * b ℓ ) L ℓ=1 . We show in Lemma 5.5 below that all Z ∈ E µ,δ have rather large · B 1 -norm, which entails that the mass of the vector ( W * b ℓ ) L ℓ=1 cannot be concentrated on only very few entries. This in turn will allow us to employ a non-i.i.d. version of Mendelson's small ball method [37], allowing us to show a lower bound for (35), see Lemma 5.6 below. To understand the behaviour on the second part recall from Proposition 3.3 that for matrices Z/ Z F ∈ K * (h 0 m * 0 ) \ E µ,δ the quantity A (Z) may be quite small, so a uniform bound is not feasible. However, as Z is almost orthogonal to h 0 m * 0 , also P T ⊥ Z * must be rather small because of the characterization of the descent cone, Lemma 4.1. Hence, Z is close to the tangent space and is almost orthogonal to h 0 m * 0 . For that reason, whenever h 0 m * 0 + tZ * ≤ h 0 m * 0 * holds, the cylindrical shape of the nuclear norm ball implies that t > 0 is small. This fact is captured by Lemma 5.7 below. Theorem 3.3 can then be proven by combining inequality (35) and Lemma 5.7, see Section 5.3.

A lower bound for the minimum conic singular value
First we recall the notion of Gaussian width (see, e.g., [63]).

Definition 5.1. For a set E ⊂ C K×N its Gaussian width is defined by
where G ∈ C K×N is a matrix, whose entries are independent and identically distributed random variables with distribution CN (0, 1).
This definition allows us to state the following lemma, which is important for our analysis of the conic singular value. It relies on a uniform lower bound on the number of measurements whose magnitude is larger than a certain constant and is a variant of Theorem 2.1 in [37].
Here ε 1 , . . . , ε L are independent Rademacher variables, i.e., random variables which take the two values ±1 each with probability 1 2 . The proof of the Lemma 5.2 is based on a variant of Mendelson's small-ball method and proceeds in analogy to [37]. We have deferred a detailed proof to Appendix A. In order to apply Lemma 5.2 we need to estimate the first term of the right-hand side of Lemma 5.2. Such an estimate can be derived using the Paley-Zygmund inequality as in [37]. For the sake of completeness we have included a proof in Appendix B.
In order to use Lemma 5.3 we need a lower bound for | {ℓ ∈ [L] : X * b ℓ ≥ ξ} |. This will be achieved by the next lemma. For the statement of this lemma and its proof we will need to introduce the following notion. We define for any matrix W ∈ C K×N its · B 1,w -quasinorm by That is, W B 1,w is the weak ℓ 1 -norm of the vector ( W * b ℓ ) L ℓ=1 . (For a more detailed discussion of the weak ℓ 1 -norm see, e.g., [24].) A direct consequence of this interpretation is the inequality (see, e.g., [ Lemma 5.4. Let Z ∈ C K×N such that Z F = 1. Then it holds that Proof of Lemma 5.4. Choose ξ * such that As where we also used inequality (37). We observe that where for the second equality we used the identity L ℓ=1 b ℓ b * ℓ = Id. Using inequality (39) and rearranging terms it follows that which completes the proof.
The next lemma gives a bound on inf Proof. Let Z ∈ E µ,δ be arbitrary. By definition of E µ,δ there is h 0 ∈ H µ and m 0 ∈ C N such that Z ∈ K * (h 0 m * 0 ) and such that the inequality where for the second equality we have used that L ℓ=1 b ℓ b * ℓ = Id. Note that for all ℓ ∈ [L] it holds that Hence, by the previous inequality chain it follows that where in the last inequality we used the definition of µ and Z B 1 . Rearranging terms and taking the infimum over all Z ∈ E µ,δ yields the desired inequality.
Having gathered all the necessary ingredients we can state and prove the main lemma of this section.
Lemma 5.6. Let δ > 0. Assume that Then with probability at least 1 − exp − C 1 and C 2 are absolute constants.
Proof. Our goal is to apply Lemma 5.2. In order to apply it we first derive a lower bound for the first term on the right-hand side of inequality (36). For that recall that by Lemma 5.5 it holds that Thus, for any Z ∈ E µ,δ we obtain that where the first inequality follows from (42), the second one is due to Lemma 5.4, and the third one follows again from (42). Hence, by Lemma 5.3 applied with ξ = Next, we need an upper bound for the Gaussian width. For that, we first observe that The Gaussian width of E has been bounded in [31,Lemma 4.1], combined with the monotonicity of the Gaussian width their results yields that Thus for ξ = δ 4 √ L log(eL)µ we obtain from Lemma 5.2 together with (43), (44) that with probability at least 1 − exp −2t 2 it holds that where the second inequality follows from assumption (40), if the constant C 1 > 0 is chosen large enough. Consequently, setting t = 9δ 2 √ L 128 log 2 (eL)µ 2 and recalling that (A (Z)) (ℓ) = b ℓ c * ℓ , Z F we have that with probability at least 1 − exp − Summing up we obtain that with probability at least 1 − exp − This shows the claim.

Proof of Theorem 3.7
As already mentioned in Section 4, in order to control all matrices Z ∈ K * (h 0 m * 0 ), which are almost orthogonal to h 0 m * 0 , we need the following key lemma.
Proof. We observe that Rearranging terms yields the result.
Now we have gathered all tools which are needed to prove Theorem 3.7.
Proof of Theorem 3.7. Having introduced all necessary tools in the last two sections we can now give a proof of Theorem 3.7. We set δ := (log eL) 2/3 µ 2/3 α 1/3 . Throughout the proof we will assume that inequality (41) holds, which by Lemma 5.6 holds with probability at least Let h 0 ∈ H µ and m 0 ∈ C N . Furthermore, letX be a minimizer of (3) and set Z := X − h 0 m * 0 . Note that from the minimality ofX it follows that X * ≤ h 0 m * 0 * . This implies that Z ∈ K * (h 0 m * 0 ). To prove the lemma it remains to derive an appropriate upper bound on Z F . For that we will distinguish two cases, namely Z Z F ∈ E µ,δ and Z Z F / ∈ E µ,δ . If Z Z F ∈ E µ,δ , it follows from inequality (41) that where in the second inequality we used the triangle inequality as well as Z =X − h 0 m * 0 and y = A (h 0 m * 0 ) + e. In the third inequality we used thatX is feasible and e ≤ τ . If Combining the estimates (45) and (46) we obtain that which completes the proof.

Outlook
In this paper we have analyzed two important cases of structured low-rank matrix recovery problems, blind deconvolution and matrix completion, through an inspection of the descent cone of the nuclear norm and its interaction with the measurement operator A.
We have shown that the conic singular value is typically quite small and, consequently, previous analysis approaches cannot give strong recovery guarantees. For the example of blind deconvolution we have presented a new approach based on a refined analysis of the descent cone, showing that the nuclear norm minimization approach is stable against adversarial noise in certain important parameter regimes and allows for uniform recovery guarantees in the presence of noise. In our opinion our results give rise to a number of interesting follow-up questions.
• Stability for small noise-levels: Until now, our stability result only covers the situation that the noise level τ is of constant order (up to logarithmic factors). For small τ , Theorem 3.1, respectively Theorem 3.5, put some barriers on what performance can be expected. Nevertheless, it will be interesting to examine the transitional case, that τ is rather small, even further. For example, while the bad conditioning for small noise levels has been established, it remains open whether one can construct a noise vector e such that the true minimizer behaves like the alternative (but non-minimal) solutions constructed in Theorems 3.1 and 3.5. Also the transitional behavior of the minimum conic singular values between very small noise levels (where we established bad conditioning) and larger noise levels (where at least for randomized blind deconvolution, we proved stability) will be an interesting question to study.
• Extension to the rank r case: Understanding nuclear norm recovery for matrix completion under adversarial noise remains an important open problem in the field. While our result established that recovery guarantees for arbitrary noise levels are not feasible, our considerations for the rank one scenario give hope that for sufficiently large noise levels, near optimal guarantees are within reach also for matrices of arbitrary rank.
Similarly, a natural generalization of blind deconvolution is the problem of blind demixing [46,30], where one observes a noisy superposition of several convolutions, that is, y = r i=1 w i * x i + e. The corresponding low-rank matrix formulation can be interpreted as a rank r version of the randomized blind deconvolution problem.
We expect that a rank r version of Theorem 3.7 will apply to both these scenarios, which is why we consider this a very promising direction for future research.
• Extension to other low-rank matrix recovery models: Various other lowrank matrix models also involve incoherence in some way, for example, robust PCA ( [6]) and spectral compressed sensing via matrix completion [15]. Also for these problems, recovery results are typically proven via the Golfing Scheme and lead to a seemingly suboptimal noise bound (see, e.g., [65, Section VI]). Can these problems be analyzed with the methods developed in this paper? Moreover, [38] provided an incoherence based analysis of the phase retrieval problem under random Bernoulli measurements. It will be interesting to analyze this setup with similar methods as in this manuscript.
[65] Z. Zhou, X. Li A. Proof of Lemma 5.2 The proof of Lemma 5.2 will rely on the following two lemmas. The first lemma is a version of Mendelson's small-ball method for non-i.i.d. measurements. In order to state it let X 1 , . . . , X L be independent, matrix-valued random variables defined on a probability space (Ω, µ). For every measurable, real-valued function f and for every ξ > 0 we define the quantity Lemma A.1. Let X 1 , . . . , X L ∈ C K×N be independent random variables and F be a set of real-valued functions, which are measurable with respect to (Ω, µ). Let t > 0 and ξ > 0. Then with probability at least 1 − exp −2t 2 it holds that where ε 1 , . . . , ε L are independent Rademacher variables, i.e., random variables which take the two values ±1 each with probability 1 2 . The proof of Lemma A.1 is exactly analogous as the proof of the original small-ball method [37], see Section A.1. The second auxiliary lemma, proved in Section A.2, relates the quantity E sup Then, by a direct application of Lemma A.1 we obtain that with probability at least 1 − exp −2t 2 it holds that To bound the second summand, we observe that where in the third line we used that Re ( b ℓ c * ℓ , X F ) and Im ( b ℓ c * ℓ , X F ) have the same distribution. The fourth line follows from the symmetry of the set E and the last line is due to Lemma A.2. Combining (47) and (48) finishes the proof.

A.1. Proof of Lemma A.1
We directly trace the steps of the proof of Theorem 1.5 in [37]. In the following ½ A denotes the indicator function, which takes the value 1, if the event A occurs and the value 0 otherwise. Note that Taking the infimum we observe that by the definition of Q 2ξ The bounded difference inequality (see, for example, [4]) implies that with probability at least 1 − exp −2t 2 it holds that To deal with the expectation we will use the function Ψ ξ : [0, +∞) −→ R defined by We observe that Ψ ξ is Lipschitz continuous Together with the inequality chains (49) and (50), this completes the proof.

A.2. Proof of Lemma A.2
First, we observe that Note that due to the definition of ω (E) in order to finish the proof it is enough to show that the entries of the matrix X = L ℓ=1 b ℓ c * ℓ are independent and identically distributed with distribution CN (0, 1). For that, let (i, j) ∈ [K] × [N ] and compute that This implies that e * i L ℓ=1 b ℓ c * ℓ e j ∈ CN (0, 1). It remains to show that the individual entries of the matrix m ℓ=1 b ℓ c * ℓ are independent. For that, we set Now let (i, j) , (i ′ , j ′ ) ∈ [K] × [N ] such that (i, j) = (i ′ , j ′ ). Our goal is to show that E X i,j X i ′ ,j ′ = 0. If j = j ′ this follows immediately from the observation that c * ℓ e j and c * ℓ e j ′ are independent for all ℓ ∈ [L]. Now assume that j = j ′ . Then we can compute that Hence, we have shown that all entries of the matrix X are uncorrelated. As the entries of X are jointly Gaussian this implies that they are independent, which completes the proof.