Minimax Euclidean Separation Rates for Testing Convex Hypotheses in $\mathbb{R}^d$

We consider composite-composite testing problems for the expectation in the Gaussian sequence model where the null hypothesis corresponds to a convex subset $\mathcal{C}$ of $\mathbb{R}^d$. We adopt a minimax point of view and our primary objective is to describe the smallest Euclidean distance between the null and alternative hypotheses such that there is a test with small total error probability. In particular, we focus on the dependence of this distance on the dimension $d$ and the sample size/variance parameter $n$ giving rise to the minimax separation rate. In this paper we discuss lower and upper bounds on this rate for different smooth and non- smooth choices for $\mathcal{C}$.


Introduction
In this paper we consider the problem of testing whether a vector µ ∈ R d belongs to a closed convex subset C of R d with d ∈ N, based on a noisy observation X obtained from the Gaussian sequence model with variance scaling parameter n ∈ N, i.e.
where is a standard Gaussian vector. More precisely, in an l 2 sense, we aim at finding the order of magnitude of the smallest separation distance ρ > 0 from C such that the testing problem where . 2 is the Euclidean distance, can be solved in the following sense: For η ∈ (0, 1), we can construct a uniformly η−consistent test ϕ for (1.2), i.e. sup µ∈C P µ (ϕ = 1) + sup where A ρ ⊂ R d corresponds to H ρ . We write ρ * (C) := ρ * (C, d, n, η) for the minimax-optimal separation distance ρ of this test, i.e.
ρ * (C) = inf{ρ ≥ 0|∃ test φ : sup µ∈C P µ (ϕ = 1) + sup µ∈Aρ P µ (ϕ = 0) ≤ η}, i.e. the smallest distance ρ that enables the existence of a uniformly consistent test of level η for the testing problem (1.2). See Section 2 for a more precise description of the model and relevant quantities. The theory of minimax testing in general has been profiting very much from the seminal work of Ingster and Suslina, see for example their book [15].
An instance of this problem that was extensively studied is signal detection, i.e. the case where C is a singleton, see e.g. [14] for an extensive survey of this problem and also [2]. From this literature, we can deduce that the minimaxoptimal order of ρ * in this case is In its general form however, this problem is a composite-composite testing problem (i.e. neither C nor A ρ is only a singleton). A versatile way of solving such testing problems was introduced in [12], where the authors combine signal detection ideas with a covering of the null hypothesis, for deriving minimax optimal testing procedures for composite-composite testing problems, provided that the null hypothesis is not too large (i.e. that its entropy number is not too large, see Assumption (A3) in [12]). In this case, the authors prove that the minimax-optimal testing separation rate is the same as the signal detection separation rate, namely d 1/4 √ n . This idea can be generalised also to the case where the null hypothesis is "too large" (when Assumption (A3) in [12] is not satisfied); the approach then implies that an upper bound on the minimax rate of separation is the sum of the signal detection rate and the optimal estimation rate in the null hypothesis C -see [5] for an illustration of this for a specific convex shape. Using this technique, one finds that the smaller the entropy of C, the smaller the separation rate.
This idea has the advantage of generality, but is nevertheless sub-optimal in many simple cases. For instance, if C is a half-space, the minimax-optimal separation rate is 1 √ n , which is much smaller than the minimax-opimal signal detection rate, even though a half-space has a much larger entropy (it is even infinite) and larger dimension than a single point. See Section 3 for an extended discussion on this case. This highlights the fact that for such a testing problem, it is in many cases not the entropy, or size, of the null hypothesis that drives the rate, but rather some other properties of the shape of C.
In order to overcome the limitations of this approach, some other ideas were proposed. A first line of work can be found in [3], where the authors consider the general testing problem (1.2), but for separation in . ∞ -norm instead of . 2 -norm. Since any convex set can be written as a intersection of half-spaces, they rewrite the problem as a multiple testing problem. This approach is quite fruitful, but the . ∞ -norm results translate in a non-optimal way to . 2 -norm in terms of the dependence on the dimension d, particularly for large d. A second main direction that was investigated was to consider testing for some specific convex shapes, as e.g. the cone of positive, monotone, or convex functions, see e.g. [16], or also balls for some metrics [17,8]. These papers exhibit the minimaxoptimal separation distance -or near optimal distance, in some cases of [16] and [17] -for the specific convex shapes that are considered, namely cones and smoothness balls. The models considered in these works are different from our model as they consider functional estimation; also, they do not provide results for more general choices of the null hypothesis. In Sections 3 and 5, we derive results for our model and shapes related to those of these papers -namely the positive orthant and the Euclidian ball -in order to relate our work with these earlier results. Finally, a last type of results that are related to our problem is the case where the null hypothesis can be parametrised, see e.g. [11] where the authors consider shapes that can be parametrised by a quadratic functional. This approach and their results suggest that the smoothness of the shape of C has an impact on the testing rate.
In this paper, we want to take a more general approach toward the testing problem (1.2). In Section 3, we expose the range of possible separation rates by demonstrating that, without any further assumptions on C, the statement is sharp up to ln(d)-factors. After that, in Sections 4 and 5, we investigate the potential of a geometric smoothness property of the boundary of C. Despite its simplicity, this property takes us quite far: In particular, given any separation rate satisfying (1.4), it allows for constructing a set C exhibiting this rate up to ln(d)-factors.

Setting
Let d, n ∈ N. We consider the d-dimensional statistical model where µ ∈ R d is unknown and is a standard Gaussian vector, written ∼ N (O d , I d ). For k ∈ N, O k denotes the origin of R k and I k ∈ R k×k the identity matrix. Clearly, by construction, the variance scaling parameter n may also be interpreted as sample size since the distribution of X is precisely the distribution of the mean of n iid observations from N (µ, I d ). Now, let C R d be closed, nonempty and convex. For x ∈ R d we write A corresponding open Euclidean ball with center z ∈ R k and radius r > 0 is denoted B k (z, r); moreover, we indicate vector concatenation by [·, ·], so that, for instance, Given ρ > 0, we are interested in the testing problem and we write A ρ := {x ∈ R d | dist(x, C) ≥ ρ}. Our goal is to find the smallest value of ρ such that testing (2.2) with prescribed total error probability is possible in a minimax sense, i.e. the quantity ρ * (C) := ρ * (C, d, n, η) for some fixed η ∈ (0, 1). Here, a test ϕ is a measurable function ϕ : R → {0, 1}.
In particular, we focus on the dependence of ρ * (C) on the dimension d and n. In terms of notation, this is done by using the symbols , and as follows: For some function g C that may only depend on n and d, we define We define in a similar way the symbol (other direction). Finally, if g C (d, n) ρ * (C) g C (d, n), we write ρ * (C) g C (d, n); g C (d, n) then exhibits the minimax Euclidean separation rate for (2.2) or simply separation rate.
Remark 2.1. In the proofs for upper bounds on ρ * (C) it is necessary to consider the type-I and type-II errors sup µ∈C P µ (ϕ = 1) and sup µ∈Aρ P µ (ϕ = 0) separately leading to parameters α, β ∈ (0, 1 2 ) rather than η. However, this does not affect the separation rate. For the sake of consistency in notation, we will state the exact constants w(η) in upper bounds with α = β = η 2 . In these statements and in the proofs, we use the abbreviation v x := ln 1 x , x > 0.

A General Guarantee and Extreme Cases
The quantity ρ * (C) clearly depends on C.
Let us firstly examine a simple, essentially one-dimensional case, namely a halfspace.
Then, in the testing problem (2.2), we have Remark 3.2. As can be seen in the proof (section 6.2.1), this testing problem is essentially equivalent to the problem µ = 0 vs. µ = ρ in dimension d = 1, so that, alternatively, the rate 1 √ n can be obtained by analysing the optimal test in the sense of Neyman-Pearson. Furthermore, and in fact closely related to that, note that the lower bound in the previous theorem is valid for any choice of closed convex set C such that C and R d \C are non-empty: Indeed, we find this rate by considering a fixed pair of points (µ 0 , µ 1 ) ∈ C × A ρ that minimises the distance between C and A ρ , i.e. µ 0 − µ 1 = ρ. That seems to have firstly been discussed in [6]; other related (classical) literature would be for instance [10] and [13]. Now, on the other hand, making no additional assumptions about C, a natural choice ϕ for solving (2.2) is a plug-in test based on confidence balls. This gives rise to the following general upper bound: Theorem 3.3. Let C be an arbitrary closed convex subset of R d such that C and R d \ C are non-empty. Then, in the testing problem (2.2), we have and therefore Remark 3.4. Note that this upper bound is the rate of estimation of µ in l 2 norm in the model (2.1) (See Equation (6.2) in Section 6.1.2).
Remark 3.5. From Remark 3.2 and Theorem 3.3 it is clear that whenever C is a closed convex subset of R d such that C and R d \C are non-empty.
Given this observation, it is natural to ask if the upper bound in Theorem 3.3 is also sharp in the sense that there is a choice of C that requires the separation rate √ d √ n , at least up to logarithmic factors. It turns out that the answer is yes when C is taken to be an orthant: 8 9 ) and Then, for the testing problem (2.2), we have This result heavily relies on tailoring the priors such that they have a certain number of moments in common. A related application of this approach can be found in the proof theorem 1 in [16], see also for instance [7].

A Simple Smoothness-Type Property
Clearly, the two extreme cases C HS and C O differ significantly with respect to smoothness of their boundaries. Based on this observation, in order to be able to handle ρ * (C) more flexibly, we propose to describe convex sets by their boundaries' degree of smoothness, where the boundary of a set S ∈ R d is denoted by ∂S and its closure by S = S ∪ ∂S. To begin with, we examine the potential of the following very simple and purely geometric smoothness concept: Note that R-rounding is a stronger requirement the higher the value of R, i.e. intuitively the degree of the boundary's smoothness grows with R. In particular, a half space The definition of R−rounding is closely related to the so-called R-rolling condition employed in [1]. In fact, R-rounding of S is equivalent to saying that R d \S fulfils the R-rolling condition.
Another related concept worth mentioning is the radius of curvature, though the connection is more subtle: The radius of curvature at a point x ∈ ∂S would be the radius r of the ball B that best fits ∂S in the sense of a common tangential hyperplane of ∂S and B at x and common analytical curvature, see for instance [9]. Hence, it is possible that the infimum R of these radii r with respect to x ∈ ∂S corresponds to the parameter R in our previous definition. However, we can then still not easily guarantee that the resulting balls B of the form Since smoothness is usually defined as a local property of a function, we provide a suggestion for how to cast the above concept in that context for a closed convex set C: Given any x ∈ ∂C, without loss of generality (w.l.o.g.) apply a rotation and translation G such that Lemma 4.3. In the situation described in the latter paragraph, if f is twice differentiable on B (i.e. the gradient ∇f (·) and Hessian matrix Hf (·) exist), the following conditions are sufficient in order that the graph of f remains below where λ min (·) and λ max (·) are the lowest and highest eigenvalues of a real symmetric matrix, respectively. Now, let us examine how the additional assumption of R−rounding may affect the general upper bound of Theorem 3.3: Theorem 4.4. If C is R−rounded for some R ∈ (0, ∞), for the testing problem (2.2), we have and therefore, taking Theorem 3.3 into account, The following result confirms that this upper bound can be sharp up to ln(d) factors, namely in the case where C is taken as an R-inflated orthant: Theorem 4.5. Let d ≥ 43, η ∈ (0, 8 9 ) and where C O = (−∞, 0] d is the orthant from Theorem 3.6. Furthermore, let Then, in the testing problem (2.2), we have with s = √ 3 28

Discussion
The concept of R-rounding allows for the construction of hypotheses C with any separation rate 1 n , up to ln(d)-factors. On the other hand, we must acknowledge that R−rounding is too weak a concept to fully describe the difficulty of testing an arbitrary C; an examination of the natural R-rounded set, namely a ball of radius R, provides clear evidence of this drawback. The result is a direct generalisation of the known rate ρ * (C) ∼ d 1 4 √ n in the signal detection setting, see [2]. Therefore, Clearly, Theorem 4.5 does not capture this case. As a consequence, future work will be concerned with finding a stronger concept, possibly a localised version of R−rounding, that ideally allows for describing ρ * (C) for any choice of C. However, we suspect this to be quite an ambitious goal.

Techniques for Obtaining Lower Bounds
We employ a classical Bayesian approach for proving lower bounds, see references in [2] for its origins. We briefly give the main theoretical ingredients of this approach for our setting: Let ν 0 be a distribution with S 0 := supp(ν 0 ) ⊆ C and ν ρ be a distribution with S ρ := supp(ν ρ ) ⊆ {µ ∈ R d | dist(µ, C) = ρ} (priors). For instance, Dirac priors on some x ∈ R will be denoted δ x . Furthermore, for i ∈ {0, ρ}, let P νi be the resulting distribution of X given µ ∼ ν i . Now, we see that for any test , see for instance [2]. This justifies the following reasoning used for each lower bound proof in the present paper: Let η ∈ (0, 1). For any ρ > 0 such that either it holds that sup µ∈C P µ (ϕ = 1) + sup µ∈Aρ P µ (ϕ = 0) ≥ η and thus, for the testing problem (2.2), we have ρ * (C) ≥ ρ.

Frequently used Bounds for Expressions Containing Square Roots
We will employ the following bounds on several occasions which makes it convenient to mention them here.
Lemma. For any a > 0, b ∈ R, we have Proof. Firstly, through Taylor expansion of √ a + b 2 − b as a function in a, we see that there is a ξ ∈ (0, a) such that Now, with ξ ≥ 0 and ξ ≤ a we obtain the upper and lower bounds in (6.3), respectively. Secondly, explicit calculation tells us that which concludes the proof. Proof. We prove independently that the order of ρ * (C) is lower and upper bounded by 1 √ n .

Lower Bound.
In accordance with the framework in Section 6.1.1, we verify that the bound holds in the special case ν 0 = δ O d and ν ρ = δ ρ·e d , where e d is the last standard basis vector e d = [O d−1 , 1]. Since both the null and alternative hypotheses are simple, the corresponding density functions F ν0 (x) and F νρ (x) are readily given and we obtain Therefore inequality (6.1) is satisfied (with equality) if the latter quantity is equal to 1 + 4(1 − η) 2 , i.e. for This yields the claim.

Proof of Theorem 3.6
Proof. The arguments of this proof are related to the ones used in [16] and [7]. We decompose the proof into several steps.

Choice of priors.
We make use of the following lemma used and explained in [16] : Lemma. For any M ∈ N and b > 0, there are distributions ν 0 and ν 1 with the following properties: For now, let ν i be such distributions and ν i = ν ⊗d i , i ∈ {0, 1}; M, b and ρ will be specified later. Furthermore, writing σ 2 = 1 n , let where * denotes convolution. Clearly, the corresponding density function can be written as where φ(x; µ, σ 2 ) is the density of N (µ, σ 2 ). It will be convenient to examine the case d = 1, denoted by P i . Note that P 0 is in accordance with P ν0 from Section 6.1.1, but the construction of ν 1 does not warrant the notion of Euclidean distance we are interested in -ν 1 has support inside C O -hence the slight difference in notation. This technical obstacle is necessary for the property (6.5.III), but it can be resolved for a small price, which we explain in the last step of this proof. 2. Controlling the total variation distance.
Based on our construction, we have for i ∈ {0, 1} and fixed x ∈ R Then (6.6) in conjunction with (6.5.III) tells us that and thus We take a moment to upper bound the individual summands: Since and, with [18], Now through Stirling's approximation and elementary manipulation, with M ≥ 32 we obtain 17 32 That yields and moreover, continuing (6.7), and hence finally . By direct computation, we now see that for any η ∈ (0, 1) (2) ln(d) + c η .

Application.
Note that this upper bound 1 2 P 1 − P 0 TV ≤ 1 − η does not formally allow for determining a lower bound on ρ * (C) yet since H 0 and H 1 are not separated in a Euclidean sense. In a final step, we will resolve this by a suitable restriction of H 1 .
Hence, inference from the testing problem discussed in Steps 1 and 2 to the problem is valid as long as η ∈ (0, 8 9 ) (η corresponds to η − 1 9 above).
The following observation concludes the proof: Clearly, H 1 agrees with A ρ for Applying Taylor's theorem with Lagrange's remainder yields On the other hand, we can use a classical eigenvalue representation to obtain the desired upper bound: For some s ∈ (0, 1), by assumption and (6.4).

Proof of Theorem 4.4
Proof. We define the test statistic and a corresponding test of the form ϕ(X) = 1 {T (X)≥τ } .
, which tells us Clearly, this bound holds generally in the sense Based on the general property for random variables A and B and by using (6.2.I) and (6.2.II), we finally obtain the rejection threshold for a fixed level α ∈ (0, 1 2 ). On the other hand, w.l.o.g., choose µ = [O d−1 , −ρ]. This is valid since by construction µ minimises the distance between C and A ρ and O d represents an arbitrary element of ∂C. We have so that it is sufficient to ensure which leads to the condition This concludes the proof.

Proof of Theorem 4.5
Proof. This is a variation on the proof of Theorem 3.6. Using the same construction and notation as previously, and taking d ≥ 3, let now for i ∈ {0, 1} Since the mutual deterministic coordinate µ d = R is irrelevant for the total variation distance between the resulting distributions P 0 and P 1 , the bounds in Step 2 of the proof of Theorem 3.6 also hold here with d − 1 instead of d.
The most important modification arises when calculating ρ: Now, if at least d−1

3
of the coordinates take the value u = b 4M 2 , computing the Euclidean distance of µ from C and using (6.3) leads to if d is large enough in the sense that M η ≤ C ln(d − 1) for some C > 0, where M η is given in the statement of the theorem. This concludes the proof. Proof. W.l.o.g., let z = O d . We prove independently that ρ * (C) is lower and upper bounded by the right hand side of (5.1).
Lower Bound. Let ν 0 = δ Re d , giving rise to the density function On the other hand, for a suitable h > 0 specified in a moment, let ν ρ be the uniform distribution on Since each element of P h has Euclidean distance R 2 + (d − 1)h 2 − R from C, which should correspond to ρ, we set h 2 = (R+ρ) 2 −R 2 d−1 . This gives rise to the following density function: exp −nx 2 i cosh 2 (nhx i ) .