Discussion of “Hypothesis testing by convex optimization”

Testing statistical composite hypotheses is a very difficult area of the mathematical statistics theory and optimal solutions are found in very seldom cases. It is precisely in this respect that the paper ''Hypotheses testing by convex optimization'' brings a new insight and a powerful contribution. The optimality of solutions depends strongly on the criterion adopted for measuring the risk of a statistical procedure. In our opinion, the novelty here lies in the introduction of a new criterion different from the usual one. In the present discussion, we give some more precise details on the main results necessary to enlighten the strength and the limits of the new theory.


Introduction
We congratulate the authors for this very stimulating paper. Testing statistical composite hypotheses is a very difficult area of the mathematical statistics theory and optimal solutions are found in very seldom cases. It is precisely in this respect that the present paper brings a new insight and a powerful contribution. The optimality of solutions depends strongly on the criterion adopted for measuring the risk of a statistical procedure. In our opinion, the novelty here lies in the introduction of a new criterion different from the usual one (compare criterions (2.1) and (2.2) below). With this new criterion, a minimax optimal solution can be obtained for rather general classes of composite hypotheses and for a vast class of statistical models. This solution is nearly optimal with respect to the usual criterion. The more remarkable results are contained in Theorem 2.1 and Proposition 3. 1

and are illustrated by numerous examples.
In what follows, we give some more precise details on the main results necessary to enlighten the strength and the limits of the new theory.

Theorem 2.1
In this paper, the authors consider a parametric experiment (Ω, (P μ ) μ∈M ) where the parameter set M is a convex open subset of R m . From one observation ω, it is required to build a test deciding between two composite hypotheses H X : μ ∈ X, H Y : μ ∈ Y where X, Y are convex compact subsets of M. Assumptions on the subsets X, Y are thus quite general and the problem is taken as symmetric (no distinction is done between the hypotheses such as choosing H 0 versus H 1 ). We come back to this point later on.

1739
A test φ is identified with a random variable φ : Ω → R and the decision rule The theory does not consider randomized tests which are found to be optimal tests in the classical theory for discrete observations.
The usual risk of a test is based on the two following errors: (2.1) We use the notation X (φ), Y (φ) to stress on the dependence on φ. An optimal minimax test with respect to the classical criterion would be a test achieving The authors introduce another risk r(φ), larger than R(φ) by the simple Markov inequality: The main result contained in Theorem 2.1 is that there exists an optimal minimax solution with respect to the augmented criterion: a triplet (φ * , x * , y * ) exists that achieves Such a triplet can be explicitly computed in classical examples (discrete, Gaussian, Poisson) together with the value ε * . Moreover, the usual error probabilities Nevertheless, conditions on the parametric family are required and the optimal test is to be found in a specific class F of tests. The whole setting (conditions 1.-4.) must be "a good observation scheme" which roughly states: • First, the parametric family must be an exponential family with its natural parameter set: ( denotes the transpose) • The class F of possible tests is a finite-dimensional linear space of continuous functions, and contains constants and all Neyman-Pearson statistics log(p μ (.)/p ν (.)). This is not a surprising assumption but means that the class of tests among which an optimal solution is to be found is exactly • Condition 4. is more puzzling: the class F must be such that for all φ ∈ F, is defined and concave in M. This means that: Because of (2.4) and (2.5), the computation of is easy and the solution of (2.3) is obtained by solving a simple optimization problem. Theorem 2.1 provides the solution φ * = (1/2) log(p x * /p y * ) and a remarkable result is that: where ρ is the Hellinger affinity of P x * , P y * . An important point too concerns the translated detectors φ a * := φ * −a. Equation (4), p. 4, shows that by a translation, the probabilities of wrong decision can be upper-bounded as follows: Therefore, by an appropriate choice of a, one can easily break the symmetry between hypotheses, choose H 0 and H 1 and reduce one error while increasing the other one. Examples (2.3.1, 2.3.2, 2.3.3) are particularly illuminating. All computations are easy and explicit.
The extension of the theory to repeated observations is relatively straightforward due to the exponential structure of the parametric model and we will not discuss it.

Proposition 3.1
Another very powerful result concerns the case where one has to decide not only on a couple of hypotheses but on more than two hypotheses. The testing of unions is an impressive result. The problem is of deciding between: on the basis of one observation ω ∼ p μ . Here, X i , Y j are subsets of the parameter set. The authors consider tests φ ij available for the pair (H Xi , H Yj ) such that (for instance the optimal tests of Theorem 2.1, but any other test would suit) and define the matrices E = ( ij ) and An explicit test φ for deciding between H X and H Y is built using an eigenvector of H and the risk r(φ) of this test is evaluated with accuracy: it is smaller than ||E|| 2 the spectral norm of E. A very interesting application, which is illustrated on numerous examples, is when H X = H 0 : μ ∈ X 0 is one hypothesis (not a union) and H Y = n j=1 H j : μ ∈ n j=1 Y j is a union. One simply builds for j = 1, . . . , n the optimal tests φ * (H 0 , H j ) := φ 0j with errors bounded by 0j (obtained, for instance, by Theorem 2.1). The matrix E is the (1, n) matrix E = [ 01 . . . 0n ]. As E E has rank 1, its only non null eigenvalue is equal to T r(E E) = n j=1 2 0j = (||E|| 2 ) 2 . The eigenvector of H is given by [ Then, the optimal test for H 0 versus the union n j=1 H j is explicitly given by Then, given a value ε, if it is possible to tune the test φ 0j := φ 0j (ε) of H 0 versus H j to have a risk less than ε/ √ n, then the resulting test for the union φ(ε) has risk less than ε. This is remarkable: if we consider the test min 1≤j≤n {φ 0j } for H 0 versus the union n j=1 H j , we have To get a risk bounded by ε, one would have to tune the test φ 0j of H 0 versus H j to have a risk less than ε/n.

Discussion
• The theory is restricted to exponential families of distributions with natural parameter space. As noted by the authors in the introduction, this is the price to pay for having very general hypotheses. The problem of finding more general statistical experiments which would fit in the theory is open and worth investigating. The new risk criterion seems to be a more flexible one for finding new optimal solutions. • The fact that the sets X, Y must be compact sets is restrictive. We wonder if there are possibilities to weaken this constraint. • The combination of conditions 3. and 4. on the class of tests implies a reduction of the class of tests. Consider the case of exponential distributions, where By condition 3., F contains log p μ /p ν for all μ, ν > 0, hence F contains all tests φ(ω) = aω + b, a, b ∈ R. For condition 4., to compute F (μ) the condition μ > a is required and As F (μ) = a(2μ − a)/(μ 2 (μ − a) 2 ), F is concave if and only if a ≤ 0. Therefore, condition 4. restricts the class of tests to This raises a contradiction: condition 3) states that F must contain all log p μ /p ν , thus all φ = aω + b, a, b ∈ R. We wonder if condition 4. could be stated differently so as to avoid this contradiction. • Generally, when testing statistical hypotheses, one chooses a hypothesis H 0 and builds a test of H 0 versus H 1 . One wishes to control the error of rejecting H 0 when it is true and the other error does not really matter. A discussion on this point is lacking in relation with Theorem 2.1, formula (4). • Randomized tests are not considered here. However, they are found as optimal solutions in the fundamental Neyman-Pearson lemma. In estimation theory, randomized estimators are of no use because, due to the convexity of loss functions, a non randomized estimator is better. In the setting of the paper, is there such a reason to eliminate randomized tests? • The criterion r(φ) is larger than the usual one, entailing a loss. The notion of "provably optimal test" introduced in this study is not commented in the text. So, it is difficult to understand or quantify it. Maybe more comments on Theorem 2.1 (ii) would help. At this point, let us notice that the notation X without dependence on the test statistic φ is a bit misleading. It should be X (φ), so that in formula (4) of Theorem 2.1, we would read X (φ * ). Another point is the comparison between X (φ * ) and ε * . Apart from the Markov inequality, would it be possible to quantify ε * − X (φ * )?  (H 0 , H j )) or as X0 (φ * (H 0 , H j ))? The numerical illustrations were also difficult to follow.

Concluding remarks
To conclude, the criterion r(φ), defined in (2.2), provides a powerful new tool to build nearly optimal tests for multiple hypotheses. The large number of concrete and convincing examples is impressive. We believe that this paper offers a new insight on the theory of testing statistical hypotheses and will surely inspire new research and new developments.