On signal detection and confidence sets for low rank inference problems

We consider the signal detection problem in the Gaussian design trace regression model with low rank alternative hypotheses. We derive the precise (Ingster-type) detection boundary for the Frobenius and the nuclear norm. We then apply these results to show that honest confidence sets for the unknown matrix parameter that adapt to all low rank sub-models in nuclear norm do not exist. This shows that recently obtained positive results in (Carpentier, Eisert, Gross and Nickl, 2015) for confidence sets in low rank recovery problems are essentially optimal.


Introduction
Consider the Gaussian design trace regression model where ǫ ∼ N (0, I n ) is an i.i.d. vector of Gaussian noise. Here the matrices X i are d × d square matrices with i.i.d. entries X i mk ∼ N (0, 1), and θ is the unknown d × d matrix we want to make inference on. We are interested in the case where the model dimension d 2 is possibly large compared to sample size n, but where θ has low rank k, in which case we write θ ∈ R(k), 1 ≤ k ≤ d. This setting serves as a prototype for various matrix inference problems such as those occurring in compressed sensing [4] or in quantum tomography [7]. We consider here a high-dimensional regime where min(d, n) → ∞, reflecting contemporary statistical challenges.
The first problem we study in this paper is the signal detection problem with low-rank alternatives: We want to test the hypothesis H 0 : θ = 0 vs. H 1 : θ = 0, θ ∈ R(k), θ ≥ ρ, * This work was carried out when this author was a research associate in the University of Cambridge.
where · equals either the Frobenius norm · F or the nuclear norm · * (defined in detail below), and where ρ should be the minimal 'signal strength' condition for the above hypothesis testing problem to have a consistent solution (in the sense of Ingster, see [10]). We will show that the minimax optimal detection boundary in Frobenius norm is of the form ρ ≈ min d n , n −1/4 whereas in nuclear norm it is ρ ≈ min kd n , k n 1/2 .
A remarkable feature is that for the Frobenius norm the detection rate does not depend at all on the complexity of the alternative hypothesis (the rank k), whereas for the nuclear norm it does. The phase transition between the two regimes in these rates depends precisely on whether the sample size n exceeds the dimension d 2 of the maximal parameter space R(d) or not. The upper bounds in our proofs are related to the papers [9,1] about the detection boundary in the sparse regression setting, and our main contribution consists in deriving the matching lower bounds for low rank alternatives.
Our interest in the detection boundary is triggered by the second problem we investigate here: the question of existence and non-existence of adaptive confidence sets for low rank parameters. It follows from general decision-theoretic principles (see Chapter 8.3 in [6] and also [8,2]) that the answer to this question is closely related to a 'composite version' of the detection problem (see (15) below). This approach was employed in [14] to prove that adaptive and honest confidence sets for the parameter θ do not exist in sparse regression models if an ℓ 2 -risk performance beyond O(n −1/4 ) is desired. In contrast in the recent paper [5] it was shown that if sparsity constraints are replaced by low rank conditions, then adaptive and fully honest confidence sets exist over the entire parameter space R(d). Adaptation means here that the expected Frobenius norm diameter of the confidence set reflects the minimax risk over arbitrary low rank sub-models R(k), 1 ≤ k ≤ d. The fact that the detection rates obtained here in Frobenius norm are independent of the rank constraint θ ∈ R(k) provides another heuristic explanation of the result in [5].
Moreover [5] constructed another confidence set whose diameter adapts to low rank sub-models in the stronger nuclear norm distance, and that is honest for all θ's that are non-negative definite and have trace equal to one, that is, whenever θ is the density matrix of a quantum state. Such a constraint on θ is natural in a quantum physics context considered in [5], but not in general. The question arises whether it is essentially necessary or not. In the present paper we show that indeed the existence results of [5] are specific to the geometry induced by the Frobenius norm or to the quantum state constraint, and that nuclear-norm adaptive and honest confidence sets over general low rank parameter spaces do not exist in the model (1). For example, our results imply that if one requires coverage of a confidence set over all of R(d) then the worst case nuclear norm diameter for rank-one parameters can be off the minimax estimation rate over R(1) by as much as √ d. Our results thus further illustrate the subtleties involved in the theory of confidence sets for high-dimensional parameters, and that the positive results in [5] are of a rather specific nature.
Our proofs are given in the simplest model where both the design and the noise are Gaussian, and the matrices involved are of square type. As usual, our results extend without major difficulty to sub-Gaussian design and noise, to certain correlated random designs, and also to non-square matrices, at the expense of slightly more technical proofs. Generalisations of our results to the matrix completion problem are currently under investigation.

Notation
We write M d for the set of d × d matrices with real elements. If X : M d → R n denotes the 'sampling operator' θ → X θ = tr(X 1 θ), . . . , tr(X n θ) T , then the model (1) can be written as where Y = (Y 1 , . . . , Y n ) T and ǫ = (ǫ 1 , . . . , ǫ n ) T . We write E X for the expectation over the distribution of X only, and E θ for the expectation conditional on X . The full expectation is denoted by E θ = E X E θ . The corresponding probability laws are denoted by P X , P θ , P θ and we employ the usual o/O/o P /O P -notation with min(n, d) → ∞.
We denote the standard norm on Euclidean space by · 2 , and the associated inner product by ·, · 2 . Let . F be the Frobenius norm over M d , i.e.
where λ 2 i are the eigenvalues of M T M . The associated inner product is We also define the nuclear norm of M as These two norms are in fact defined also for matrices that are not of square type. Finally we recall that for any matrix M ∈ R(k), we have

Signal detection for low rank alternatives
We consider first the following hypothesis testing problem, also known as the signal detection problem: Here the alternative space is restricted to a 'low rank' hypothesis θ ∈ R(k) for some 1 ≤ k ≤ d.
Moreover, for a separation constant ρ > 0, the detection boundary is described by a 'signal strength' condition measured in terms of the size θ ≥ ρ of the Frobenius-, or of the nuclear norm of θ. In the high-dimensional regime where min(n, d) → ∞, we want to find the minimal sequence ρ ≡ ρ n,d such that for any α > 0 a level α-test Ψ = Ψ(Y, X , α) exists: Recall that a test is simply a random indicator function ψ = 1 A where the rejection event A depends only on Y, X , α, and we require the sum of the type-one and the type-two error of the test to be controlled at any fixed level α > 0.
The tests Ψ constructed in the proof are given in (9) below and straightforward to implement. Note also that the · * -separated alternatives are a subset of the · F -separated alternatives (see (10) below), and our results imply that an optimal test for the case · = · F is essentially optimal also for · * .

Confidence sets for low rank recovery
Low rank recovery algorithms are well-studied in compressed sensing and high-dimensional statistics, see e.g., [4,7,11,12,13,3] and the references therein. In the setting of model (1) they provide minimax optimal estimatorsθ of θ ∈ R(k) with (high probability) performance guarantees The question we study here is whether associated uncertainty quantification methodology exists, that is, whether we can find confidence sets C n ⊂ M d such that at least for min(n, d) large enough, and such that the diameter |C n | of C n reflects the accuracy of adaptive estimation in the sense that |C n | shrinks, with high probability, at the optimal rates from (5) whenever θ ∈ R(k). We insist here on an adaptive confidence set that does not require knowledge of the unknown rank k of θ.
A first result that is proved in the paper [5] is that such adaptive confidence sets do exist in the model (1) if the diameter is measured in Frobenius distance. The construction of this set is straightforward, see [5] for details.
Theorem 2 (Theorem 2 in [5]) For every α > 0 there exists a confidence set C n = C n (Y, X , α) such that for all n ∈ N, (6) holds, and such that uniformly in θ ∈ R(k 0 ) for any 1 ≤ k 0 ≤ k, with high P θ -probability the Frobenius-norm diameter |C n | F of C n satisfies A second result that is proved in the paper [5] is that an (asymptotic) adaptive confidence set exists also in nuclear norm provided that the "quantum state constraint" is satisfied, namely, provided it is known a priori that θ is non-negative definite and has nuclear norm one, and provided the coverage requirement in (6) is relaxed to hold only over a maximal model R(k) in which asymptotically consistent estimation of θ is possible (i.e., k d/n = o(1)). Define R + (k) = R(k) ∩ {θ is non-negative definite, tr(θ) = 1}, the set of quantum state density matrices of rank at most k.
Theorem 3 (Theorem 4 in [5]) Assume k d/n = o(1) for some 1 ≤ k ≤ d, and let α > 0 be given. Then there exists a confidence set C n = C n (Y, X , α) such that and such that uniformly in θ ∈ R + (k 0 ) for any 1 ≤ k 0 ≤ k, with high P θ -probability the nuclear norm diameter |C n | * of C n satisfies |C n | * k 0 d n .
In fact it is not difficult to generalise the above theorem to the case where the condition tr(θ) = 1 is relaxed to θ * ≤ 1.
The next theorem, which is the main result of this subsection, implies that no analogue of Theorem 2 can hold true if the Frobenius norm there is replaced by the nuclear norm, and it also shows that Theorem 3 cannot hold true if R + (k) is replaced by R(k), that is, if the 'quantum state constraint' is relaxed. More precisely, we show that if a confidence set C n is required to have coverage over the maximal model R(k 1 ), then the worst case expected nuclear norm diameter of C n over arbitrary sub-models R(k 0 ), k 0 = o(k 1 ), depends on the maximal model dimension k 1 and does not improve as k 0 ↓ 1. The proof of Theorem 4 is based on Part 2) of Theorem 1 and lower bound techniques for adaptive confidence sets from [8,2].
Theorem 4 Let k 1 → ∞ such that k 1 = o(d) as min(n, d) → ∞. Suppose that for any 0 < α < 1/3 the confidence set C n = C n (Y, X , α) is asymptotically honest over the maximal model R(k 1 ), that is, it satisfies lim inf Then for every k 0 = o(k 1 ) and some constant c > 0 depending on α, we have for every min(n, d) large enough. In particular no confidence set exists that is honest over all of M d and that adapts in nuclear norm to any model R(k 0 ), For notational simplicity we have lower bounded the expected diameter |C n | * in (8), but the proof actually contains a stronger 'in probability version' of this lower bound.

Remark 1 A few remarks on Theorem 4 are in order:
i) In the least favourable case where one wants coverage over the entire R(d) = M d while still adapting to rank-one matrices (i.e., k 0 = 1), the performance of any honest confidence set is off the minimax optimal adaptive estimation rate d/n over R(1) by a diverging factor that can be as close to √ d as desired.
ii) Even if one restricts coverage to hold only for 'consistently estimable models' R(k 1 ) with k 1 d/n → 0 (as in Theorem 3), the diameter |C n | * can be off the minimax rate of estimation over R(1) by a factor of √ k 1 .
iii) We also note that the above result does not disprove the existence of adaptive confidence sets for sub-models R(k 0 ) of 'moderate rank' where k 0 ≥ √ d. While more of technical interest -note that this rules out n < d 2 for consistent recovery to be possible -this regime currently remains open (it is related to the apparently hard problem of finding optimal separation rates in the composite testing problem (15) below).

Proof of Theorem 1, upper bounds
When n < d 2 then definer The test statistic is Ψ n = 1 {r n ≥ z α τ n } where z α are quantile constants chosen below. These tests work for Frobenius norm separation, by effectively the same proofs as in [9], using that we can embed the matrix regression model into a vector regression model with p = d 2 parameters, and since the separation rates only depend on the model dimension (and not on low rank or sparsity degrees). However, to provide intuition, we give some details, first for the case n < d 2 : Under H 0 we have Y = ǫ and so for every n ∈ N and z α large enough (using either Chebyshev's inequality and Eε 4 i = 3, or Theorem 4.1.9 in [6] for a more precise non-asymptotic bound). Now for the alternatives θ ∈ H 1 we use the basic concentration result Lemma 1a) in [5] which implies that for any fixed θ the event has P X -probability at least 1 − 2 exp(−n/24), and so, for n ≥ n α such that 2 exp(−n/24) < α/6, since, by the hypothesis on ρ, we have for D large enough that The last probability is bounded by α/6 as under H 0 and the last but one probability is also bounded by α/6 by a direct (conditional on X ) Gaussian tail inequality (restricting to the event E: just as in term II of the proof of Theorem [5] withθ = 0 there), so that in total we have bounded the testing errors in (3) by α/2 + (3/6)α = α, as desired. The case n ≥ d 2 follows from similar but slightly more technical arguments, adapting the arguments from proof of Theorem 3 in [5], or arguing directly as in Theorem 4.3 in [9] with p = d 2 .
The test (9) also works for nuclear-norm separation since so that We now turn to the more difficult lower bounds.

Proof of Theorem 1, lower bounds
Let Ψ be any test -any measurable function of Y, X that takes values in {0, 1}. Assume ρ = o(r n,d ) as min(n, d) → ∞ and let H 1 = H 1 (ρ) be the corresponding alternative hypothesis.
Step I: Reduction to averaged likelihood ratios: Let π = π n,d be a sequence of finitely supported probability distributions on M d such that π n,d (H 1 ) → 1, and denote by π|H 1 that measure restricted to H 1 and re-normalised to unit mass. Define where dP (θ) i is the distribution of Y i |X when the parameter generating the data is θ, and dP (0) i is the distribution of Y i |X when the parameter generating the data is 0. Then, by a standard testing lower bound (e.g., (6.23) in [6]), for any η > 0, as min(n, d) → ∞ for a suitable choice of π, then the lower bound (4) will follow by letting η → 0. Recall the notation E θ = E X E θ .
Step II: Computation of E 0 [Z 2 ]: The (Y i ) are independent with distribution N ((X θ) i , 1) conditional on the design X , hence and can hence write Thus, if θ, θ ′ are independent copies of joint law π 2 , then we have Step III: Integrating over X : The E X -expectation of the last expression can be bounded by where Z ℓ = X ϑ ℓ 2 2 − n ϑ ℓ 2 F , with ϑ 1 = θ + θ ′ , ϑ 2 = θ, ϑ 3 = θ ′ . The last factor can be bounded, by applying the Cauchy-Schwarz inequality twice, by N (0, 1). Applying Theorem 3.1.9 in [6] with τ i ≡ 1 and λ = ϑ 1 2 F or λ = 2 ϑ ℓ 2 F , ℓ = 2, 3, (and hence setting A = 1, A HS = n in that theorem) we see that if max ℓ ϑ ℓ 2 F ≤ 1/4 then As a consequence if max ℓ=1,2,3 then the the product (11) is bounded above by 1 + o(1). We conclude that if the prior π satisfies (12) almost surely then Step IV: Construction of π and bounds for E 0 [Z 2 ]: Assume for notational simplicity that d is an integer multiple of k, the general case needs only minor notational adjustment. Pick independent random d × 1 vectors v ℓ : ℓ = 1, . . . , k each of which consists of i.i.d. Rademacher entries (i.e., taking values ±1 with probability 1/2). Create a matrix W as follows: In the first d/k columns insert v 1 times a random sign B 1,j , j = 1, . . . , d/k. Then, in the ℓ-th block repeat the same with v 1 replaced by v ℓ , and random signs B ℓ,j , j = 1, . . . , d/k. If · = · F let γ n = ρ n /d and if · = · * set γ n = 2ρ n /( √ kd), so that in either case Define the random matrix θ = γ n W and let θ ′ be an independent copy of it. Thus As products of Rademacher variables are again Rademacher variables we have, for ǫ ℓ,m ,ǫ ℓ,j i.i.d. Rademacher variables (all defined on a suitable product probability space), where the Rademacher sum Z = d m=1 ǫ m is a sub-Gaussian random variable with variance proxy σ 2 = d (cf. Section 2.3 in [6]). Thus by (2.24) in [6] we have noting that (12) holds π-almost surely in view of θ 2 F = γ 2 n W 2 F = γ 2 n d 2 = o(n −1/2 ).
Step V: Asymptotic concentration of π on H 1 : Finally we show that for the above prior we have indeed Π(H 1 ) → 1. First since θ consists of columns that are linear combinations of at most k distinct vectors v ℓ we immediately have θ ∈ R(k) almost surely. Moreover, for the case · = · F we have from the last display and by definition of γ n that θ 2 F = ρ 2 n , so Π(H 1 ) = 1 follows.
For the case · = · * we have to show that π n,d ( θ * ≥ ρ n ) → 1 as min(n, d) → ∞. We can transform θ into the d × k matrix θU consisting of k column vectors γ n d/kv ℓ , ℓ = 1, . . . , k. The corresponding d × k matrix U consists of k column vectors, the ℓ-th of which has zero entries except for the indices m ∈ [ℓd/k, . . . , −1 + (ℓ + 1)d/k], where it equals k/dB ℓ,m . Thus, U is an orthonormal projection matrix and we deduce that