Discussion of “Hypotheses testing by convex optimization” ∗

I was very happy to read this paper by Sasha, Anatoli and Arkadi not only because it is an exciting paper about testing problems that I have been interested in for many years but also for some more personal reasons. It actually reminded me of various problems about testing between sets that I begun to consider and work on almost fourty years ago and to which I gave a continuous interest up to now. It also brought back to my memory many exchanges and discussions that I had in the late seventies and early eighties with Lucien Le Cam, as well as many seminars and talks about robustness, which was at that time a very fashionable subject, and many friends that I made then like Tadeusz Bednarski, Gabor Tusnády and Piet Groeneboom, among others.

I was very happy to read this paper by Sasha, Anatoli and Arkadi not only because it is an exciting paper about testing problems that I have been interested in for many years but also for some more personal reasons. It actually reminded me of various problems about testing between sets that I begun to consider and work on almost fourty years ago and to which I gave a continuous interest up to now. It also brought back to my memory many exchanges and discussions that I had in the late seventies and early eighties with Lucien Le Cam, as well as many seminars and talks about robustness, which was at that time a very fashionable subject, and many friends that I made then like Tadeusz Bednarski, Gabor Tusnády and Piet Groeneboom, among others.

Historical remarks
The general problem of testing between two non-trivial sets received a lot of attention in the seventies along two different streams of research. One was initiated by Le Cam and its collaborators much earlier, actually in the fifties and a milestone paper was Kraft (1955) about the consistency of tests. It actually contains (Theorem 5) a fundamental result by Le Cam (previously unpublished) that I shall comment about below and which provides the performance of a best test between two convex sets of probabilities. Other important results about the performance of tests are provided by Le Cam (1973) in a paper which was directed towards the use of tests in order to derive "universal" estimators under some dimensionality restrictions. Le Cam's work was about the possibility of testing efficiently between two sets of probabilities with application to estimation while the theory of robustness was about a different problem that I could summarize by "improving the stability" of statistical procedures. When applied to testing between two simple hypotheses, say {P } and {Q}, it amounts to finding tests that are more or less equivalent to the classical likelihood ratio (Neyman-Pearson) tests between P and Q but with errors that do not increase too much when the truth is actually slightly different L. Birgé from either P or Q. This amounts to finding tests between some vicinities P and Q of P and Q respectively, the result depending on the topology that is used. Various results in this direction appeared in the sixties and seventies after the milestone paper by Huber (1965). It would be much too long to cite them all but Huber showed how to find explicit optimal tests between L 1 -balls, among other vicinities, and more generally between sets that are dominated by two-alternating capacities - Huber and Strassen (1973). There are indeed many available results about tests between convex sets but a large part of them is of a purely theoretical nature (existence results) and does not provide explicit tests that perform as predicted by the theory. It is a great merit of this paper to provide such tests in some interesting and useful statistical frameworks.

Kraft and Le Cam's results
Let us begin with some elementary facts and notations. To test with a random variable X between two probability sets P and Q, we use a test function ϕ with values in {−1, 1}, deciding P when ϕ(X) = 1 and Q otherwise. This results in errors of the form For a given test ϕ these errors do not change if we replace both P and Q by their convex hulls, hereafter denoted by Co(P) and Co(Q) respectively. Le Cam was interested by what he called the "testing affinity" (where the infimum runs over all test functions ϕ) which measures, in a sense, the performance of a best test between P and Q. In particular, denoting by dP and dQ the densities of P and Q with respect to any dominating measure, where D(P, Q) denotes the "variation distance": The fundamental theoretical result of Le Cam which appeared in Kraft (1955) says: Theorem 1 If P and Q are two sets of probabilities, then In words, the testing affinity between two convex sets is determined by their variation distance.
Although quite precise, this result is actually very difficult to use for analyzing concrete testing problems for two reasons. First the proof of Theorem 1 is based on the Hahn-Banach Theorem and therefore does not provide the construction of an optimal test. Then many problems involve i.i.d. observations X 1 , . . . , X n with joint distribution R ⊗n with R belonging either to P or Q so that the application of Theorem 1 requires to compute π(S, T ) for S and T in the convex hulls of P ⊗n = {P ⊗n , P ∈ P} and Q ⊗n = {Q ⊗n , Q ∈ Q}. Unfortunately, there is no direct relationship between π(P ⊗n , Q ⊗n ) and π(P, Q).
A major idea of Kraft and Le Cam, in view of solving the second problem, was the introduction of the Hellinger distance h and affinity ρ as substitutes to π and D: They can be related to π or D via the following inequalities: The main advantage of ρ over π derives from the fact that ρ(P ⊗n , Q ⊗n ) = ρ n (P, Q) which allows to deal with i.i.d. samples. Let us now define, for two probability sets P or Q, It then follows from (2.1) and (2.2) that π(P, Q) ≤ ρ(P, Q) and π P ⊗n , Q ⊗n ≤ ρ P ⊗n , Q ⊗n .
Moreover, the following fundamental result holds for ρ.
Theorem 2 If P and Q are two sets of probabilities, then Putting everything together, we can conclude that if P and Q are convex, then . This shows that, as soon as two convex sets of probabilities P and Q are separated (i.e. h(P, Q) > 0), the errors of a best test between P and Q decrease exponentially fast with the number n of observations. But this does not solve the problem of finding explicit tests with such a performance.

An alternative point of view
The point of view that underlines the paper by Sasha, Anatoli and Arkadi derives from the following observations. The best test between P and Q is the likelihood ratio test given by ϕ( But it is actually much more fruitful to proceed differently in order to bound both errors of this test separately. The following sequence of inequalities is straightforward but nevertheless enlightening.
and also This actually leads to the suboptimal bound π(P, Q) ≤ 2ρ(P, Q) but this is unimportant in the very interesting case of a small value of ρ(P, Q). Here we have an exemple of a function ψ( are indeed extremely useful to control the performance of tests between P and Q based on the detector φ according to the following (trivial) lemma which already appears (hidden in the proofs) in Birgé (1984), but also in earlier works like Chernoff (1952) about large deviations.
Lemma 1 Let P i and Q i , 1 ≤ i ≤ n be sets of probabilities on measurable spaces then, for all y ∈ R and random variables X i ∈ X i for 1 ≤ i ≤ n, It follows from this lemma that, once one has found a set of functions ψ i , one can easily derive tests between the convex envelopes of P and Q with controlled errors and play with y in order to balance between them. Finding suitable functions ψ i is therefore essential. This is the approach that I considered in Birgé (1984) and more recently in Birgé (2013). But again, given two sets P and Q, the problem of finding ψ satisfying (3.1) has no explicit solution in general, in contrast to Huber's results that provide explicit tests between some particular vicinities of P and Q. Unfortunately his results do not apply to Hellinger vicinities which are not generated by two-alternating capacities but, fortunately, although abstract in the general case, my results apply to Hellinger balls and provide in this particular case some explicit tests which are actually likelihood ratio tests between the closest points in the balls. This is exactly the type of result that Sasha, Anatoli and Arkadi get for their "good observation schemes". One can readily see from their illustrations (a), (b) and (c) that the "least favorable" pair (P x * , P y * ) is actually a pair that minimizes the Hellinger distance between the two hypotheses and the "nearly optimal" test is a likelihood ratio test between them.
As we have seen, finding pairs of sets P and Q for which one can explicitly compute a function ψ that satisfies (3.1) is a major issue in testing theory. The interest and importance of the present paper is that it solves this problem for particular sets P and Q connected to some classical parametric statistical models. On the one hand the results only apply to some very specific cases (the good observation schemes), on the other hand they allow to derive concrete and practical tests with excellent performances, which is definitely more useful than a mere existence theorem. As we often say in French "on ne peut avoir le beurre et l'argent du beurre !" (no free lunch!). Moreover, in view of Lemma 1, these tests apply to i.i.d. samples and allow to balance between the two kinds of errors. Although dealing with some particular parametric models, they nevertheless apply to many interesting situations as illustrated by the authors in their Section 4.

Combining elementary tests
An important part of the paper (namely Section 3) is devoted to various ways of combining detectors φ = log ψ, where ψ satisfies (3.1), in order to test between more complex hypotheses. Given a family H 1 , . . . , H M of hypotheses (with H j corresponding to P ∈ P j ) for which one can find tests between pairs (H i , H j ), 1 ≤ i < j ≤ M , it is not obvious to design good tests between ∪ m j=1 H i and ∪ M j=m+1 H i . I am personally not fond of considering this problem when some of the hypotheses overlap or, more generally, when some of them are almost indistinguishable. Le Cam (1973) (and this also follows from the inequalities of the previous Section 2) shows that if nh 2 (P, Q) is too small, no test can correctly distinguish between P ⊗n and Q ⊗n . I therefore believe that when combining tests between various hypotheses H j , one should not try to test between H i and H j when these hypotheses are too close (in Hellinger distance). In such a case, one knows that there is no hope to get small errors so that such a situation should be avoided and a good solution is to proceed as indicated in Section 3.2.1 and avoid testing between hypotheses that are too close.
Considering the problem of choosing between M hypotheses (but this remark is also valid for testing, when M = 2) I believe that it has to be put in the classical decision-theoretic framework with a decision function δ(X) with values in {1, . . . , M}. Since the assumption is that the true distribution satisfies at least one assumption the decision δ should choose one. This rules out any strategy, like some that appear in Section 3 which, may be, reject all hypotheses and therefore lead to no decision. A simple solution, which would not increase the rejection rate and therefore the errors, would be, for instance, to decide at random when the multiple testing procedure rejects all hypotheses.
Clearly, the procedure of Section 3.2.1 tends to improve the situation. It can be viewed as the choice of a family of "pseudo"-distances between the different hypotheses, the "distance" between H i and H j being zero when (i, j) ∈ C (assuming symmetry: (i, j) ∈ C is equivalent to (j, i) ∈ C) and one otherwise. For a given i, one only tests with H j if the "distance" between H i and H j is one. One could actually adopt a more sophisticated strategy with mutual distances between the assumptions being arbitrary nonnegative numbers. One could for instance take, as the "distance" between H i and H j , minus the logarithm of the error of a best test between them. To build the final decision function, one should not only consider the various tests involved but also look at the mutual distances between the hypotheses and decide according to these mixed informations. The idea is to decide for an i such that no test for which H j is far from H i rejects it, based on the fact that two hypotheses that are close cannot be properly distinguished while two hypotheses that are far apart lead to a test with small errors. I actually used this argument in Birgé (1983) and Birgé (2006) to derive estimators from families of tests. I am convinced that the method that I used to build T-estimators can be adapted to deal with multiple hypotheses via the function D X provided by (4.5) of Birgé (2006). One could analogously use a suitable version of D X (k) as a criterion of the "credibility" of the assumption H k (the larger D X (k) the less credible H k ) and finally decide for the minimizer over k of D X (k). Various modifications of this procedure are certainly possible.

Conclusion
I definitely find this paper exciting and hope it will open a new research trend towards finding explicit detectors for testing between two hypotheses for other observation schemes. I hope that more examples will be found in the future. I also greatly appreciated the applications of Section 4 but discussing this aspect is not really in my field of expertise. As to the problem of handling many hypotheses, I believe that the point of view developed in Section 3.2.1 is the more fruitful and promising one and that the authors should pursue in this direction.