A strong converse bound for multiple hypothesis testing, with applications to high-dimensional estimation

: In statistical inference problems, we wish to obtain lower bounds on the minimax risk, that is to bound the performance of any possible estimator. A standard technique to do this involves the use of Fano’s inequality. However, recent work in an information-theoretic setting has shown that an argument based on binary hypothesis testing gives tighter converse results (error lower bounds) than Fano for channel coding problems. We adapt this technique to the statistical setting, and argue that Fano’s inequality can always be replaced by this approach to obtain tighter lower bounds that can be easily computed and are asymptotically sharp. We illustrate our technique in three applications: density estimation, active learning of a binary classiﬁer, and compressed sensing, obtaining tighter risk lower bounds in each case.


Introduction
When solving an inference problem, we would like to know whether the algorithm we use is close to optimal.In statistical language we seek to give a lower bound on the performance of any estimator over a class of problems (often called minimax risk).In the language of information theory, such results are called converse results, and give performance bounds satisfied by any communication scheme over a noisy channel.
In the statistics literature, one standard approach to proving converse results is via Fano's inequality (see [1,Theorem 2.11.1]).However, recent work in the information-theoretic literature has shown how to obtain sharper converse bounds than those obtained using Fano's inequality.The resulting improvements can be significant at finite sample size, and give bounds that are close to optimal, as illustrated in the work of Polyanskiy, Poor and Verdú [2].The present paper shows how the method of [2], although developed for channel coding problems, can be applied in a statistical context to obtain stronger risk lower bounds for high-dimensional inference problems than the standard Fano approach.
We first describe a general framework within which we understand our inference problems, enabling us to obtain a lower bound on the expected value of the loss function (or risk).This framework at first closely follows the treatment and notation of [3,Chapter 2], before we demonstrate how results based on Fano's inequality can be strengthened using a new method based on [2].
Consider an inference problem (possibly non-parametric) where we wish to estimate some quantity θ ∈ F from samples Y = (Y 1 , . . ., Y n ) generated according to a distribution P θ (Y).For example, in Section 3 we consider θ to be a probability density chosen from a pre-specified class, and in Section 5 we consider θ to be a k-sparse vector in R n .Let θ := θ(Y) be any estimator of θ and let d(θ, θ) represent the loss.We assume that the loss function d is a distance, although (as in [3]) our results hold when d is a semi-distance; that is, d satisfies the triangle inequality and symmetry, though d(θ, θ ′ ) = 0 need not imply that θ = θ ′ .We wish to obtain lower bounds on the minimax risk inf where w is any monotonically increasing function with w(0) = 0.For example, we may consider w(u) = u p for any p > 0 or w(u) = I(u ≥ c) for some c > 0.
We begin with a high-level overview of the Fano's inequality approach to producing lower bounds on (1).First, a set {θ 1 , . . ., θ M } ⊆ F is chosen, such that we have a lower bound on the pairwise distance between any two elements of the set.The pairwise distance is measured using the loss function d(•, •).Then, as explained below, any estimator θ can be used to define an M -ary hypothesis test that aims to detect one of {θ 1 , . . ., θ M } based on the data Y.Next, the key step is to obtain a lower bound for the error probability associated with this hypothesis test.For a well-constructed set, Fano's inequality can be often used to show that the average error probability of the hypothesis test is bounded away from 0 as n → ∞.In this paper, we present a technique that can be used to show that the error probability in fact approaches 1 as n → ∞ (with the same set).In information theory parlance (see for example [1, P.207]), we prove a "strong converse" result in contrast to the "weak converse" provided by Fano's inequality.
We now explain the details, following the framework in [3].For any positive constants A and ψ n , using Markov's inequality we have allowing us to deduce [3, Eq. (2.5)], which states that While applying (2), we typically choose ψ n as a decreasing function of n to give the desired convergence rate, and A as a constant that can be used to optimize the lower bound.The goal then is to control the bracketed term on the RHS of (2) to obtain a lower bound on the minimax risk.We use the following definition.
In general, the packing set is not explicitly constructed, but its existence is guaranteed via combinatorial arguments.In Remarks 3.1 and 4.1 below, we give examples where existence of packing sets of a certain size is guaranteed by applying the Gilbert-Varshamov bound to obtain a set of vertices well-separated in Hamming distance on the high-dimensional binary hypercube.In Remark 5.1, the existence of a packing set is guaranteed via the probabilistic method.We emphasize that we use these existing packing set constructions: our contribution is to provide tighter lower bounds than can be obtained using Fano's inequality.However, it is possible that the resulting risk lower bounds could be improved by a further constant factor, by optimizing over the size of the packing set.
In statistical language, we think of the packing set P M,d min as multiple hypotheses to be distinguished on the basis of data.An alternative information-theoretic interpretation is to think of P M,d min as a codebook, that is a collection of M codewords, one of which is transmitted over a noisy communication channel.Given a packing set P M,d min , any estimator θ provides a way to distinguish between multiple hypotheses (act as a channel decoder) as follows: given θ, we choose î = arg min j d( θ, θ j ), i.e. the index of the closest value in the packing set.In coding theory, this is called the minimum distance decoder.
If θ i is the true value, a simple triangle inequality argument shows that the error event Taking d min = 2Aψ n , we conclude that the worst case error probabilty of the minimum distance decoder can be bounded as max i∈{1,...,M } Although the minimum distance decoder may be suboptimal, we can use this expression to bound the bracketed term on the RHS of (2) in terms of the average error probability of the optimal decoder.Denoting the output of the optimal decoder by i * = i * (Y), we have The main focus of this paper is to obtain a sharp and easily computable bound for the average error probability on the RHS of (4).In the rest of the paper, we denote the average error probability by A standard technique to bound this average error probability is to use Fano's inequality, which gives the bound [3, Lemma 2.10] where P := 1 M M i=1 P θ i , and D(P ||Q) is the Kullback-Leibler (KL) divergence from distribution P to distribution Q.To apply (5), one typically obtains a bound on the average KL divergence of the form 1 for some constant α ∈ (0, 1) (see [3,Section 2.7.1]).Then (5) implies that When log M → ∞ as n → ∞, the bound in ( 5) is close to (1 − α) for large sample sizes n, meanign that we deduce a weak converse.Substituting ( 6) in (4) then gives a lower bound on the risk (via (2)).The use of Fano's inequality in this kind of statistical context dates back to the work of Ibragimov and Khasminskii [4].
Contributions of this paper : We first derive a result (Theorem 1) that strengthens Fano's inequality.We then apply Theorem 1 to three high-dimensional estimation problems, and show that in each case, the average error probability ǫ M → 1 as n → ∞, i.e., we obtain a strong converse.In each case, our method replaces the Fano-based part of the argument which gives a weak converse; for comparison purposes, we use the existing packing set constructions derived by the original papers.The strong converse results are obtained via a lower bound on ǫ M that is specified in terms of a parameter λ > 0, which can be optimized to strengthen the non-asymptotic lower bound.
The remainder of this paper is organized as follows.In Section 2, we present the lower bound result (Theorem 1), and show how Fano's inequality can be obtained as a special case.In Section 2.1, we discuss prior work related to the results in this paper.In Section 3, we consider a density estimation problem studied by Yu [6], and obtain a strong converse using Theorem 1, in comparison with the Fano-based result in [6].In Section 4, we apply Theorem 1 to obtain strengthened risk lower bounds for active learning of a binary classifier, a problem studied by Castro and Nowak in [7].In Section 5, we show how Theorem 1 can be used to improve (by a factor of nearly 8) lower bounds provided by Candès and Davenport [5] for the minimax mean-squared error in compressed sensing.

Lower bound on the Average Error Probability
We bound the average error probability on the RHS of (4) in terms of a different binary hypothesis testing problem.We adopt the notation and formalism of [2] as follows.We consider a random variable S, representing a message chosen uniformly at random from the set {1, . . ., M }.The message S is mapped to a codeword θ, using the simple encoder that generates θ = θ i when S = i.Thus the induced distribution P θ is uniform over the set {θ 1 , . . ., θ M }.
We think of Y = (Y 1 , . . ., Y n ) as the output of a channel with input θ.The space of Y is denoted by Y, and throughout the paper, we use boldface notation to denote vectors of length n.Using arguments from [2,8], we bound the desired average error probability of the optimal decoder (4) in terms of the Type I error probability of the following binary hypothesis testing problem: for some probability distribution Q Y that does not depend on θ.We assume that P Y|θ and Q Y have densities p Y|θ and q Y respectively with respect to some common reference measure µ.In other words, we wish to determine whether θ and Y are independent, or are generated by the true underlying channel model.
In the rest of this paper, for brevity we will often write p θ i (y) to denote p Y|θ (y|θ i ), and q(y) to denote q Y (y).
Theorem 1.Let ǫ M denote the average error probability of any decoder over channel P Y|θ , for a channel code with input distribution P θ uniform over the M codewords {θ 1 , . . ., θ M }.If {θ 1 , . . ., θ M } form a packing set P M,2Aψn with minimum distance 2Aψ n , then the minimax risk satisfies where ǫ M can be bounded from below as follows.Consider any λ > 0, and any probability density q over Y such that p θ i is absolutely continuous with respect to q for 1 ≤ i ≤ M .Then, Proof of Theorem 1. First, note that the risk lower bound in (9) follows by substituting ( 4) in ( 2), and noting that the RHS of ( 4) is the average error probability ǫ M .The following lemma is a key ingredient in proving (10), the lower bound on ǫ M .
Lemma 2.1.With the assumptions and notation of Theorem 1 we have for any γ > 0 The proof of the lemma is given in Appendix A. Using I(•) to denote the indicator function, the probability term in (11) can be written as Computing the maximum over γ > 0 and rearranging, we get (10).
Remark 2.1.The integral in (10) can be expressed in terms of a Rényi divergence as follows.
where D 1+λ (p θ i ||q) is the Rényi divergence of order (1 + λ) defined as As shown in the active learning example in Sec. 4, one can use upper bounds available for the Rényi divergence [9,10] to obtain lower bounds for ǫ M .
Remark 2.2.In Appendix B, we show how Fano's inequality in (5) can be obtained from the lower bound on Theorem 1. Furthermore, the examples in the next three sections show that Theorem 1 yields strictly better lower bounds that the Fano-based approach.

Related work
In [3, Proposition 2.2], Tsybakov gives a result similar to Lemma 2.1.This result can then be used to obtain a lower bound on ǫ M using the average pairwise χ 2 -distance between q and the elements of the packing set [3,Theorem 2.6].This bound is similar to the one obtained by using λ = 1 in Theorem 1.In this paper, we show that via two examples (active learning and compressed sensing) that Theorem 1 can be applied with a general λ > 0 to obtain stronger non-asymptotic bounds.Furthermore, as n → ∞, Theorem 1 gives a strong converse (ǫ M → 1), unlike Fano's inequality.
Birgé [11] gives stronger, but less transparent, bounds than Fano's inequality using Fano-type arguments; again, ǫ M is bounded in terms of an average of Kullback-Leibler divergences, but these are used as the argument for a function, rather than directly substituted.
Note that an alternative approach to hypothesis testing bounds, avoiding the use of Fano's inequality, is given by Assouad [12].Indeed, the paper [6] makes a detailed comparison between Fano-based bounds and those coming from Assouad's Lemma [12].However, there is little practical difference observed between these bounds; indeed [6] quotes Birgé [13, p. 279] that Fano "is in a sense more general because it applies in more general situations.It could also replace Assouad's Lemma in almost any practical case . . .".Using Fano's inequality, Yang and Barron [14] obtained order-optimal minimax risk lower bounds that depend only on global metric entropy features of the underlying function class, without explicitly constructing a packing set.The required metric entropy features (bounds on the packing number and covering number) are available from results in approximation theory for many function classes of interest.Guntuboyina [15] obtained a lower bound on the average error probability in terms of general f -divergences, and also generalized the metric entropy results of Yang and Barron [14] to certain f -divergences such as the χ 2 -divergence.An interesting direction for future work would be to obtain a non-asymptotic result analogous to Theorem 1 for the case where only the global metric entropy features are available.
An important historical remark is that Hayashi and Nagaoka [16] first linked channel coding and binary hypothesis testing, with later work [17] by Hayashi clarifying this approach.Further, the work of Nagaoka [18] used similar ideas to derive strong converse results.The recent work by Vazquez-Vilar et al [8] also provides results characterizing the average error probability of a channel coding problem in terms of the Type I error of a binary hypothesis testing problem.This link with channel coding has been applied in other contexts to prove strong converse results, including [19], which derived strong converse results for the group testing problem.

Application to density estimation
For the remainder of this paper, we show how Theorem 1 can be applied to a number of highdimensional estimation problems.In this section we apply Theorem 1 to the following density estimation problem taken from Yu [6, Example 2, P.431].
Let F be the class of smooth densities on [0, 1] such that for any density θ ∈ F, we have for some positive constants a 0 , a 1 , a 2 .The goal is to estimate the density θ from Y = (Y 1 , . . ., Y n ), where {Y i } are generated IID from θ.We want to bound from below the risk of any estimator θ n = θ n (Y), where the loss is measured using squared Hellinger distance, i.e., The packing set in [6] is constructed via a hypercube class of densities defined via small perturbations of the uniform density on [0, 1].Fix a smooth, bounded function g(x) with We partition the unit interval [0, 1] into m subintervals of length 1/m, and perturb the uniform density on each subinterval by a small amount, proportional to a version of g rescaled and translated to lie on that subinterval.That is, for some sufficiently small fixed constant c, we can define the functions Considering perturbations of the uniform density by ±{g j }, define the following hypercube class of joint densities indexed by τ = (τ 1 , . . ., τ m ) ∈ {±1} m : The bandwidth parameter m will be chosen later as an increasing function of n, to optimize the risk lower bound.
Remark 3.1 (Packing set construction).It is shown in [6,Lemma 4] that there exists a subset A ⊆ {−1, 1} m of size M ≥ exp(c 0 m), where c 0 ≃ 0.082, whose elements have minimum pairwise Hamming distance at least m/3.It is then shown in [6] that this results in a packing set of densities where ac 2 /(3m 4 ) is a lower bound on the squared Hellinger distance (see (15)) between distinct densities in the packing set.Here a is defined in (17), and c is defined in (18).We use exactly this packing set P M,ac 2 /(3m 4 ) as the set of M codewords {θ 1 , . . ., θ M } in Theorem 1.
We now use Theorem 1 to bound the risk.To do this, we first state an explicit bound on the bracketed term in (10), for λ = 1.Lemma 3.1.Taking q to be the uniform density on [0, 1] n and identifying each p θ i with a density f n τ (y) = n i=1 f τ (y i ) for τ ∈ A, with λ = 1, the bracketed term in (10) becomes: ≤ exp c 2 an 2m 4 .
Proof.See Appendix C.
Combining Lemma 3.1 with Theorem 1, we deduce the following lower bound.
Proposition 3.2.For any positive constant ν < (c 0 /(c 2 a)) 1/5 , the risk of any estimator θ n satisfies where Therefore, for large n we have Proof.We apply Theorem 1 by setting the minimum distance ac 2 /(3m 4 ) of the packing set in Remark 3.1 to 2Aψ n .Taking A = 1, we obtain ψ n = c 2 a/(6m 4 ).Taking w to be the identity in (9), we deduce Taking λ = 1 in Theorem 1 and using Lemma 3.1, we bound ǫ M as where inequality (a) is obtained using the fact that packing set of densities P M,ac 2 /(3m 4 ) has size M ≥ exp(c 0 m), as described in Remark 3.1 above.We therefore have The result (20) follows by taking m = n 1/5 /ν.
To obtain the asymptotic bound in (22), we take ν to approach (c 0 /(c 2 a)) 1/5 as n → ∞, but slowly enough that ensure that the exponent on the RHS of ( 21) is negative so that ǫ M tends to 1.For example, we can take ν = c 0 c 2 a 1/5 (1 − n −1/κ ) for κ > 25.
Remark 3.2.The paper [6] derives Fano-type bounds in this setting: combining Lemmas 3 and 5 of [6] and taking taking m = n 1/5 /ν gives the same bound as (20), but with a looser lower bound on ǫ M given by For the bound (24) to be meaningful, we need ν < c 0 , where c g = c sup x |g(x)|.The scaling factor c has to be chosen so that c g < 1.
Thus Proposition 3.2 provides a strong converse (error probability tending to 1), whereas the result (24) extracted from [6] gives a weak converse (error probability bounded away from 0).Our bound also offers greater flexibility in choosing ν and removes the need to control c g .Remark 3.3.Theorem 1 can similarly be applied to obtain strong converses for estimating densities belonging to either Hölder or Sobolev classes, strengthening the risk lower bounds described in [3, Sec.2.6.1]

Application to active learning of a classifier
In this section, we use Theorem 1 to derive strengthened minimax lower bounds for active learning algorithms for a family of classification problems introduced by Castro and Nowak [7] (see also [20] who considered a similar class of problems).We will use the explicit packing set construction of [7], but will modify their notation to be consistent with the rest of our paper.
Consider data of the form Y = (U, V) = ((U 1 , V 1 ), . . ., (U n , V n )).Each pair (U r , V r ) consists of a d-dimensional feature vector U r ∈ R d (where we assume d ≥ 2) and a binary label V r ∈ {0, 1}, and is drawn independently from an underlying joint distribution P U V = P U P V |U .The goal of classification is to predict the value of label V , given a future U observation.This is done via a function of the form φ(u) = I(u ∈ G) where G is a measurable subset of R d .Given a U ∈ R d , the classifier estimates its label as V := V (U ) := I(U ∈ G).The risk of a classifier is the probability of classification error, given by where (U, V ) ∼ P U V .It is well-known (see [20]) that, given knowledge of P U V , the Bayes-optimal classifier is , where the feature conditional probability η(u) = P V |U (1|u).
As we do not know P U V , our goal is to estimate G * from data Y, or equivalently to estimate η.The performance of a classifier G n is measured by the excess risk (or regret) [7, Eq. ( 1 where ∆ represents the symmetric difference between sets. For the remainder of this section, as in [7], we will assume that P U is supported on [0, 1] d .It is clear that the difficulty of a classification problem will depend on both the shape of the boundary of G * and the behaviour of (2η(u) − 1) for u close to this boundary.We consider the class of joint distribution functions BF(α, κ, L, c), defined formally in [7, Section IV].For our purposes it suffices to understand this class as a set of distributions P U V such that: 1.The boundary of G * can be expressed as an α-Hölder smooth function with constant L.
Algorithms that attempt to learn the Bayes-optimal classifier G * from data are categorized as passive or active.In a passive learning algorithm, we aim to learn G * from a pre-specified, possibly random, choice of (U 1 , . . ., U n ) and the corresponding labels (V 1 , . . ., V n ).In contrast, an active learning algorithm is one where for each r ≥ 1, we choose U r based on previous values (U, V ) − r := (U 1 , . . ., U r−1 , V 1 , . . ., V r−1 ).This allows us to adaptively probe the boundary of G * .A (randomized) active learning algorithm is defined by a sequence of conditional distributions , which defines the joint distribution as follows: where P Vr|Ur ≡ P V |U ; in particular, conditioned on U r , the label V r is independent of (U 1 , . . ., U r−1 ).We assume that for each r, the conditional distribution P Ur|(U,V ) − r has a density p Ur|(U,V ) − r with respect to Lebesgue measure on [0, 1] d .Note that active learning algorithms correspond to channel coding with feedback, and to adaptive group testing algorithms in the language of [19].Passive learning corresponds to channel coding without feedback, and to non-adaptive group testing algorithms.
We provide lower bounds on the excess risk of active learning algorithms that strengthen those in [7, Theorem 3], but our techniques can also be applied to the results corresponding to the passive case in [7, Theorem 4].We use the packing set constructed in [7], which is defined via a hypercube class of joint distributions on (U, V ).Fix an integer m (to be chosen later as a function of n).For each vector τ ∈ {0, 1} m d−1 , Castro and Nowak [7, Appendix C] construct a unique distribution of (U, V ) whose feature conditional probability is denoted by η τ (u), and the corresponding Bayes classifier is denoted by G * τ .We denote this hypercube class of 2 m d−1 distributions by F m .Each distribution in F m has the same U -marginal P U .Thus the joint distribution is determined by the conditional distribution P V |U .The conditional distributions in F m (equivalently, the feature conditional probabilities η τ (u) for each τ ∈ {0, 1} m d−1 ) are not explicitly defined here, but the definition can be found in the displayed equation at the foot of [7, P.2350].The definition ensures that the hypercube class F m ⊆ BF(α, κ, L, c).
The packing set defined in [7, Appendix C] is a subset of distributions in F m .
and H = h 1 is the norm of a suitable smooth function h.Furthermore, P M +1,βm/8 contains the function η 0 , corresponding to the point τ = (0, 0, . . ., 0) in the hypercube.We use the other M functions in the packing set P M +1,βm/8 to act as the M codewords {θ 1 , . . ., θ M } in Theorem 1.The Bayes classifiers corresponding to these codewords are denoted by G * 1 , . . ., G * M .
As in Section 3, we prove an explicit bound on the bracketed term (10) in Theorem 1, with (u, v) corresponding to y in (10).
Proof.See Appendix D.
Combining Lemma 4.1 with Theorem 1, we deduce the following lower bound.
Proposition 4.2.Let ρ = (d − 1)/α.For any positive constant ν, the risk of a classifier G n learnt via any active learning algorithm satisfies sup where Therefore, for large n we have sup Using Lemma 4.1 in Theorem 1, for any λ > 0 the average error probability ǫ M can be bounded from below as where inequality (a) is obtained using the fact that packing set of distributions P M +1,βm/8 has M ≥ 2 m d−1 /8 , as described in Remark 4.1 above.Now, consider any distribution Defining f (ψ n ) := min 4c κ2 κ ψ κ n , ψ n , we therefore obtain the following chain of inequalities: ≤ sup where inequality (b) follows from (4).Hence, using ψ n = LHm −α /16, we have where the last inequality is obtained using (31).The result follows by taking m = n 1 α(2κ−2)+d−1 /ν.To get the asymptotic risk bound (30), we choose the supremum of ν such that ǫ M → 1 as n → ∞ in order to obtain the largest possible prefactor in (9).Remark 4.2.The paper [7] derives Fano-type bounds in this setting: in particular, taking m = n 1 α(2κ−2)+d−1 /ν, the computation in p.2351 of [7] together with Theorem 6 of that paper gives the same bound as (28), but with a looser lower bound on ǫ M given by where ξ = 256 log 2 c 2 (LH) 2κ−2 ν.For the bound (33) to be meaningful, we need ξ < 1 2 , which implies ν < log 2 512c 2 (LH) 2κ−2 .Again, Proposition 4.2 provides a strong converse, while (33) provides a weak one, with ǫ M bounded away from zero.

Application to compressed sensing
We now describe how Theorem 1 can give improved risk lower bounds in compressed sensing.The goal in compressed sensing is to estimate a sparse vector x ∈ R n from a measurement y ∈ R m of the form y = Ax + w.
Here A ∈ R m×n is the (known) measurement matrix, and w ∼ N (0, σ 2 I m ) is the noise vector.We assume that the signal x is k-sparse.In particular, we consider x ∈ Σ k , where In the pioneering works [21][22][23] of Candès, Donoho, Romberg, and Tao, among others, it was shown that under suitable assumptions on A and the sparsity level k, the signal could be estimated to a high degree of accuracy with an efficient algorithm, even when m ≪ n.For example, when A satisfies the Restricted Isometry Property [24], reconstruction techniques based on minimizing the ℓ 1 norm produce an estimate x which satisfies 1 As m ≫ 1 and κ ≥ 1, we assume for brevity that f (ψn) = 4c κ2 κ ψ κ n .This is always true for sufficiently large m when κ > 1.However, if κ = 1 and c > 1  2 , then f (ψn) = ψn; however, the definition of c in [7, Eq. ( 9)] implies that c can be restricted to (0, 1  2 ] without loss of generality. with high probability, provided that m is of order at least k log(n/k) [25].(C 0 is a universal positive constant.)Here and throughout this section, • represents the L 2 Euclidean norm.
To complement these achievability results, several authors, e.g., [5,[26][27][28] have derived lower bounds on the minimax risk under various assumptions on A and x.The minimax risk is defined as Here, we consider the lower bound on M * (A) obtained by Candès and Davenport [5] as it holds for general A, k, and n.We show how Theorem 1 can be used to obtain a strong converse, improving by a constant factor the lower bound obtained using Fano's inequality in [5].
Using the probabilistic method, [5] shows the existence of a packing set of well-separated vectors in Σ k .To be specific: 2. u i − u j 2 ≥ 1 2 for all 1 ≤ i, j ≤ M such that i = j. 3.
Here β is a constant that can be made arbitrarily small with growing n.
The set X gives a packing set P M,C/ √ 2 := {θ 1 , . . ., θ M } of codewords with minimum distance θ i − θ j ≥ C √ 2 , simply by taking θ i = Cu i , where the value of C will be specified later.Observe that here we take the distance d defined by the Euclidean norm • .
In fact, we will consider a subset of the packing set P M,C/ √ 2 , defined as follows: where M = (n/k) k/4 .Then there exists a subset Proof.We first bound the average over the packing set P M,C/ √ 2 , given by 1 Using steps similar to those on [5, p.320], we have In the above chain, step (a) holds because both (A T A) and M i=1 u i u T i /M are positive semi-definite.
Step (b) is obtained using Property 3 of the packing set as defined in Remark 5.1.
We then use the following fact: if the average of M non-negative numbers c 1 ≤ c 2 . . .≤ c M is c, then c j ≤ c 1−(j−1)/M , for 1 ≤ j ≤ M (otherwise the sum of the (M − j + 1) largest numbers will exceed M c).The result then follows by picking M ′ elements of P M,C/ √ 2 in increasing order of Aθ 2 , and calling this set As we restrict attention to the subset P M ′ ,C/ √ 2 in the rest of this section, with mild abuse of notation, let us denote its elements by {θ 1 , . . ., θ M ′ }.Also, let φ(y; µ, Σ) denote the normal density in R m with mean vector µ and covariance matrix Σ.Then, for any θ i , from the measurement model (34), we have where the rth row of A is denoted by A T r ∈ R n .Further, we choose where θ ∈ R n will be specified later.With these choices of p θ i and q, we can prove the following bound for the integral in (10).
Combining Lemma 5.2 with Theorem 1, we deduce the following lower bound.
To obtain (43), we recall from Remark 5.1 that β can be chosen arbitrarily small as n → ∞.Furthermore, λ, ∆ can also be arranged to go 0 (at suitably slow rates) as n → ∞.
Remark 5.2.The paper [5] uses Fano's inequality to derive the following bound on the minimax risk: Comparing with Proposition 5.3, we see that our result improves the bound by a factor close to 8 for large n.

Appendices
A Proof of Lemma 2.1 Consider two hypotheses on the pair (X, Y ) ∈ X × Y: where we assume that P, Q ≪ µ for some measure µ, so that dP = p(x, y)dµ and dQ = q(x, y)dµ.
We use this result to complete the proof of Lemma 2.1: Proof of Lemma 2.1.As in [2, Theorem 26], let ǫ M and ǫ ′ M denote the average error probabilities over channels P Y|θ and Q Y|θ = Q Y , respectively, for a channel code with M equiprobable codewords.Given (θ, Y), the result [2, Theorem 26] describes a (sub-optimal) hypothesis test based on the channel decoder to distinguish between H 0 : (θ, Y) ∼ Q θY = P θ Q Y and H 1 : (θ, Y) ∼ P θY = P θ P Y|θ .Let T ∈ {0, 1} denote the output of this test.It is shown in the proof of that theorem that the probability of Type I error, i.e., Q[T = 1] is 1 − ǫ ′ M , and the probability of type II error, i.e., P [T = 0] = ǫ M .Applying Lemma A.1 to this hypothesis test yields that for any γ > 0, The result follows by observing that when Q Y|θ = Q Y , any channel decoder has average error probability ǫ which is exactly (11).

B Recovering Fano's Inequality from Theorem 1
Here we show how to obtain Fano's inequality from Theorem 1. From the variational representation of the lower bound in (13), for any λ, γ > 0 we have We first establish a general converse result involving mutual information (equation ( 52)), and then obtain Fano's inequality from it.
For λ > 0, we write the last term on the RHS of (44) in terms of a Hellinger divergence of order 1 + λ as follows.
Since q is the uniform density and each p θ i corresponds to an f n τ , for each value of i, we can express the integral in (56) as Y (p θ i (y)) 2 (q(y)) −1 dµ(y) = For any τ we can express the bracketed term on the RHS of (57) as Here equality (a) is obtained from ( 18) by noting that g j (y)g k (y) ≡ 0 for j = k, and by (17).

D Proof of Lemma 4.1
Proof of Lemma 4.1.We can express the key ratio in the LHS of (27) as where the inner integral I n can be written as where we use the fact that the Rényi divergence of order (1 + λ) between two Bernoulli random variables with parameters η τ (u n ) and η 0 (u n ), respectively, is Recalling that u n ∈ [0, 1] d , let us denote the dth coordinate of u n by u n,d .Also recall that β m = LHm −α .The construction of the hypercube class of functions F m in [7, p.2350] ensures that for any τ , τ ′ ∈ {0, 1} m d−1 , the following properties hold.

Lemma 4 . 1 .
For a given active learning algorithm described by n r=1 Consider the M codewords chosen from the packing set P M +1,βm/8 , as described in Remark 4.1, with corresponding Bayes classifiers G * 1 , . . ., G * M .The minimum pairwise set distance between these Bayes classifiers is at least β m /8.Equating the minimum distance of the packing set given by 2Aψ n (in Theorem 1) to β m /8, taking A = 1 we obtain ψ n = β m /16 = LHm −α /16.