Near-optimal Bayesian Active Learning with Correlated and Noisy Tests

We consider the Bayesian active learning and experimental design problem, where the goal is to learn the value of some unknown target variable through a sequence of informative, noisy tests. In contrast to prior work, we focus on the challenging, yet practically relevant setting where test outcomes can be conditionally dependent given the hidden target variable. Under such assumptions, common heuristics, such as greedily performing tests that maximize the reduction in uncertainty of the target, often perform poorly. In this paper, we propose ECED, a novel, computationally efficient active learning algorithm, and prove strong theoretical guarantees that hold with correlated, noisy tests. Rather than directly optimizing the prediction error, at each step, ECED picks the test that maximizes the gain in a surrogate objective, which takes into account the dependencies between tests. Our analysis relies on an information-theoretic auxiliary function to track the progress of ECED, and utilizes adaptive submodularity to attain the near-optimal bound. We demonstrate strong empirical performance of ECED on two problem instances, including a Bayesian experimental design task intended to distinguish among economic theories of how people make risky decisions, and an active preference learning task via pairwise comparisons.


Introduction
Optimal information gathering, i.e., selectively acquiring the most useful data, is one of the central challenges in machine learning.The problem of optimal information gathering has been studied in the context of active learning (Dasgupta, 2004a;Settles, 2012), Bayesian experimental design (Chaloner & Verdinelli, 1995), policy making (Runge et al., 2011), optimal control (Smallwood & Sondik, 1973), and numerous other domains.In a typical set-up for these problems, there is some unknown target variable Y of interest, and a set of tests which correspond to observable variables defined through a probabilistic model.The goal is to determine the value of the target variable with a sequential policy -which adaptively selects the next test based on previous observations -such that the cost of performing these tests is minimized.
Deriving the optimal testing policy is NP-hard in general (Chakaravarthy et al., 2007); however, under certain conditions, some approximation results are known.In particular, if test outcomes are deterministic functions of the target variable (i.e., in the noise-free setting), a simple greedy algorithm, namely Generalized Binary Search (GBS), is guaranteed to provide a near-optimal approximation of the optimal policy (Kosaraju et al., 1999).On the other hand, if test outcomes are noisy, but the outcomes of different tests are conditionally independent given Y (i.e., under the Naïve Bayes assumption), then using the most informative selection policy, which greedily selects the test that maximizes the expected reduction in uncertainty of the target variable (quantified in terms of Shannon entropy), is guaranteed to perform near-optimally (Chen et al., 2015a).
However, in many practical problems, due to the effect of noise or complex structural assumptions in the probabilistic model (beyond Naïve Bayes), we only have access to tests that are indirectly arXiv:1605.07334v2[cs.LG] 11 Jul 2016 informative about the target variable Y (i.e., test outcomes depend on Y through another hidden random variable.See Fig. 1.) -as a consequence, the test outcomes become conditionally dependent given Y .Consider a medical diagnosis example, where a doctor wants to predict the best treatment for a patient, by carrying out a series of medical tests, each of which reveals some information about the patient's physical condition.Here, outcomes of medical tests are conditionally independent given the patient's condition, but are not independent given the treatment, which is made based on the patient's condition.It is known that in such cases, both GBS and the most informative selection policy (which myopically maximizes the information gain w.r.t. the distribution over Y ) can perform arbitrarily poorly.Golovin et al. (2010) then formalize this problem as an equivalence class determination problem (See §2.1), and show that if the tests' outcomes are noise-free, then one can obtain near-optimal expected cost, by running a greedy policy based on a surrogate objective function.Their results rely on the fact that the surrogate objective function exhibits adaptive submodularity (Golovin & Krause, 2011), a natural diminishing returns property that generalizes the classical notion of submodularity to adaptive policies.Unfortunately, in the more general setting where tests are noisy, no efficient policies are known to be provably competitive with the optimal policy.
Our Contribution.In this paper, we introduce Equivalence Class Edge Discounting (ECED), a novel algorithm for practical Bayesian active learning and experimental design problems, and prove strong theoretical guarantees with correlated, noisy tests.In particular, we focus on the setting where the tests' outcomes indirectly depend on the target variable (and hence conditionally dependent given Y ), and we assume that the outcome of each test can be corrupted by some random, persistent noise ( §2).We prove that when the test outcomes are binary, and the noise on test outcomes are mutually independent, then ECED is guaranteed to obtain near-optimal cost, compared with an optimal policy that achieves a lower prediction error ( §3).We develop a theoretical framework for analyzing such sequential policies, where we leverage an information-theoretic auxiliary function to reason about the effect of noise, and combine it with the theory of adaptive submodularity to attain the near-optimal bound ( §4).The key insight is to show that ECED is effectively making progress in the long run as it picks more tests, even if the myopic choices of tests do not have immediate gain in terms of reducing the uncertainty of the target variable.We demonstrate the compelling performance of ECED on two real-world problem instances, a Bayesian experimental design task intended to distinguish among economic theories of how people make risky decisions, and an active preference learning task via pairwise comparisons ( §5).To facilitate better understanding, we provide the detailed proofs, illustrative examples and a third application on pool-based active learning in the supplemental material.

Preliminaries and Problem Statement
The Basic Model Let Y be the target random variable whose value we want to learn.The value of Y , which ranges among set Y = {y 1 , . . ., y t }, depends deterministically on another random variable Θ ∈ supp(Θ) = {θ 1 , . . ., θ n } with some known distribution P [Θ].Concretely, there is a deterministic mapping r : supp(Θ) → Y that gives Y = r(Θ).Let X = {X 1 , . . ., X m } be a collection of discrete observable variables that are statistically dependent on Θ (see Fig. 1). .
We use e ∈ V {1, . . ., m} as the indexing variable of a test.Performing each test X e produces an outcome x e ∈ O (here, O encodes the set of possible outcomes of a test), and incurs a unit cost.We can think of Θ as representing the underlying "root-cause" among a set of n possible root-causes of the joint event {X 1 , . . ., X m }, and Y as representing the optimal "target action" to be taken for root-cause Θ.Also, each of the X e 's is a "test" that we can perform, whose observation reveals some information about Θ.In our medical diagnosis example (see Fig. 2(a)), X e 's encode tests' outcomes, Y encodes the treatment, and Θ encodes the patient's physical condition.
Crucially, we assume that X e 's are conditionally independent given Θ, i.e., with known parameters.Note that noise is implicitly encoded in our model, as we can equivalently assume that X e 's are first generated from a deterministic mapping of Θ, and then perturbed by some random noise.As an example, if test outcomes are binary, then we can think of X e as resulting from flipping the deterministic outcome of test e given Θ with some probability, and the flipping events of the tests are mutually independent.
Figure 2: (a) shows an illustrative example of the medical diagnosis problem.In (b), we initialize EC 2 , by drawing edges between all pairs of root-causes (diamonds) that are mapped into different treatments (circles).In (c), we run EC 2 and remove all the edges incident to root-causes θ 2 [0, 0, 0] and θ 5 [0, 1, 0] if we observe X 1 = 1.(d) ECED, instead, discounts the edge weights accordingly.
The expected error probability after running policy π is then defined as p ERR (π) E ψπ p MAP ERR (ψ π ) .In words, p ERR (π) is the expected error probability w.r.t. the posterior, given the final outcome of π.Let the (worst-case) cost of a policy π be cost(π) max ψπ |ψ π |, i.e., the maximum number of tests performed by π over all possible paths it takes.Given some small tolerance δ ∈ [0, 1], we seek a policy with the minimal cost, such that upon termination, it will achieve expected error probability less than δ.Denote such policy by OPT(δ).Formally, we seek (2.1)

Special Case: The Equivalence Class Determination Problem
Note that computing the optimal policy for Problem (2.1) is intractable in general.When δ = 0, this problem reduces to the equivalence class determination problem (Golovin et al., 2010;Bellala et al., 2010).Here, the target variables are referred to as equivalence classes, since each y ∈ Y corresponds to a subset of root-causes in supp(Θ) that (equivalently) share the same "action".
If tests are noise-free, i.e., ∀e, P [X e | Θ] ∈ {0, 1}, this problem can be solved near-optimally by the equivalence class edge cutting (EC 2 ) algorithm (Golovin et al., 2010).As is illustrated in Fig. 2, EC 2 employs an edge-cutting strategy based on a weighted graph G = (supp(Θ), E), where vertices represent root-causes, and edges link root-causes that we want to distinguish between.Formally, E {{θ, θ } : r(θ) = r(θ )} consists of all (unordered) pairs of root-causes corresponding to different target values (see Fig. 2(b)).We define a weight function w : i.e., as the product of the probabilities of its incident root-causes.We extend the weight function on sets of edges E ⊆ E, as the sum of weight of all edges {θ, θ } ∈ E , i.e., w(E ) {θ,θ }∈E w({θ, θ }).Performing test e ∈ V with outcome x e is said to "cut" an edge, if at least one of its incident root-causes is inconsistent with x e (See Fig. 2 as the set of edges cut by observing x e .The EC 2 objective (which is greedily maximized per iteration of EC 2 ), is then defined as the total weight of edges cut by the current partial observation ψ π : The EC2 objective function is adaptive submodular, and strongly adaptive monotone (Golovin et al., 2010).Formally, let ψ 1 , ψ 2 ∈ 2 V×O be two partial realizations of tests' outcomes.We call ψ 1 a subrealization of ψ 2 , denoted as ψ 1 ψ 2 , if every test seen by ψ 1 is also seen by ψ 2 , and e., "adding information earlier helps more").Further, function f is called strongly adaptively monotone w.r.t.P, if for all ψ, test e not seen by ψ, and x e ∈ O, it holds that f (ψ) ≤ f (ψ ∪ {(e, x e )}) (i.e., "adding new information never hurts").For sequential decision problems satisfying adaptive submodularity and strongly adaptive monotonicity, the policy that greedily, upon having observed ψ, selects the test e * ∈ arg max e ∆(X e | ψ), is guaranteed to attain near-minimal cost (Golovin & Krause, 2011).
In the noisy setting, however, we can no longer attain 0 error probability (or equivalently, cut all the edges constructed for EC 2 ), even if we exhaust all tests.A natural approach to solving Problem (2.1) for δ > 0 would be to pick tests greedily maximizing the expected reduction in the error probability p ERR .However, this objective is not adaptive submodular; in fact, as we show in the supplemental material, such policy can perform arbitrarily badly if there are complementaries among tests, i.e., the gain of a set of tests can be far better than sum of the individual gains of the tests in the set.Therefore, motivated by the EC 2 objective in the noise-free setting, we would like to optimize a surrogate objective function which captures the effect of noise, while being amenable to greedy optimization.

The ECED Algorithm
We now introduce ECED for Bayesian active learning under correlated noisy tests, which strictly generalizes EC 2 to the noisy setting, while preserving the near-optimal guarantee.
EC 2 with Bayesian Updates on Edge Weights In the noisy setting, the test outcomes are not necessarily deterministic given a root-cause, i.e., ∀θ, P [X e | θ] ∈ [0, 1].Therefore, one can no longer "cut away" a root-cause θ by observing x e , as long as P [X e = x e | θ] > 0. In such cases, a natural extension of the edge-cutting strategy will be -instead of cutting off edges -to discount the edge weights through Bayesian updates: After observing x e , we can discount the weight of an edge {θ, θ }, by multiplying the probabilities of its incident root-causes with the likelihoods of the observation 2 : w({θ, θ } | x e ) := P . This gives us a greedy policy that, at every iteration, picks the test that has the maximal expected reduction in total edge weight.We call such policy EC 2 -Bayes.Unfortunately, as we demonstrate later in §5, this seemingly promising update scheme is not ideal for solving our problem: it tends to pick tests that are very noisy, which do not help facilitate differentiation among different target values.Consider a simple example with three root-causes distributed as P [θ 1 ] = 0.2, P [θ 2 ] = P [θ 3 ] = 0.4, and two target values r(θ 1 ) = r(θ 2 ) = y 1 , r(θ 3 ) = y 2 .We want to evaluate two tests: (1) a purely noisy test X 1 , i.e., ∀θ, P [X 1 = 1 | θ] = 0.5, and (2) a noiseless test X 2 with P One can easily verify that by running EC 2 -Bayes, one actually prefers X 1 (with expected reduction in edge weight 0.18, as opposed to 0.112 for X 2 ).The ECED Algorithm The example above hints us on an important principle of designing proper objective functions for this task: as the noise rate increases, one must take reasonable precautions when evaluating the informativeness of a test, such that the undesired contribution by noise is accounted for.Suppose we have performed test e and observed x e .We call a root-cause θ to be "consistent" with observation x e , if x e is the most likely outcome of X e given θ (i.e., x e ∈ arg max x P [X e = x | θ]).Otherwise, we say θ is inconsistent.Now, instead of discounting the weight of all root-causes by the likelihoods P [X e = x e | θ] (as EC 2 -Bayes does), we choose to discount the root-causes by the likelihood ratio: λ θ,xe . Intuitively, this is because we want to "penalize" a root-cause (and hence the weight of its incident edges), only if it is inconsistent with the observation (See Fig. 2(d)).When x e is consistent with root-cause θ, then λ θ,xe = 1 and we do not discount θ; otherwise, if x e is inconsistent with θ, we have λ θ,xe < 1.When a test is not informative for root-cause θ, i.e.P [X e | θ] is uniform, then λ θ,e = 1, so that it neutralizes the effect of such test in terms of edge weight reduction.Formally, given observations ψ π , we define the value of observing x e as the total amount of edge weight discounted: This motivates us to use the following objective function: as the expected amount of edge weight that is effectively reduced by performing test e.We call the algorithm that greedily maximizes ∆ ECED the Equivalence Class Edge Discounting (ECED) algorithm, and present the pseudocode in Algorithm 1.
Similar with EC 2 , the efficiency (in terms of computation complexity as well as the query complexity) of ECED depends on the number of root-causes.Let θ,e 1−max x P [X e = x | θ] be the noise rate for test e.As our main theoretical result, we show that under the basic setting where test outcomes are binary, and the test noise is independent of the underlying root-causes (i.e., ∀θ ∈ supp(Θ), θ,e = e ), ECED is competitive with the optimal policy that achieves a lower error probability for Problem (2.1): Theorem 1. Fix δ ∈ (0, 1).To achieve expected error probability less than δ, it suffices to run ECED for O k cε log kn δ log n δ 2 steps where n | supp(Θ)| denotes the number of root-causes, c ε min e∈V (1 − 2 e ) 2 characterizes the severity of noise, and k cost (OPT(δ opt )) is the worstcase cost of the optimal policy that achieves expected error probability Note that a pessimistic upper bound for k is the total number of tests m, and hence the cost of ECED is at most O (log(mn/δ) log(n/δ)) 2 /c ε times the worst-case cost of the optimal algorithm, which achieves a lower error probability O δ/(log n • log(1/δ)) 2 .Further, as one can observe, the upper bound on the cost of ECED degrades as we increase the maximal noise rate of the tests.When c ε = 1, we have e = 0 for all test e, and ECED reduces to the EC 2 algorithm.Theorem 1 implies that running EC 2 for O k log kn δ log n δ 2 in the noise-free setting is sufficient to achieve p ERR ≤ δ.
Finally, notice that by construction ECED never selects any non-informative test.Therefore, we can always remove purely noisy tests (i.e., {e : ∀θ, ), so that c ε > 0, and the upper bound in Theorem 1 becomes non-trivial.
target objective function) of the greedy policy to the optimal policy is by showing that, the one-step gain of the greedy policy always makes effective progress towards approaching the cumulative gain of OPT over k steps.One powerful tool facilitating this is the adaptive submodularity theory, which imposes a lower bound on the one-step greedy gain against the optimal policy, given that the objective function in consideration exhibits a natural diminishing returns condition.Unfortunately, in our context, the target function to optimize, i.e., the expected error probability of a policy, does not satisfy adaptive submodularity.Furthermore, it is nontrivial to understand how one can directly relate the two objectives: the ECED objective of (3.1), which we utilize for selecting informative tests, and the gain in the reduction of error probability, which we use for evaluating a policy.
We circumvent such problems by introducing surrogate functions, as a proxy to connect the ECED objective ∆ ECED with the expected reduction in error probability p ERR .Ideally, we aim to find some auxiliary objective f AUX , such that the tests with the maximal ∆ ECED also have a high gain in f AUX ; meanwhile, f AUX should also be comparable with the error probability p ERR , such that minimizing f AUX itself is sufficient for achieving low error probability.
We consider the function f AUX : 2 V×O → R ≥0 , defined as Here , and c is a constant that will be made concrete shortly (in Lemma 3).Interestingly, we show that function f AUX is intrinsically linked to the error probability: Lemma 2. We consider the auxiliary function defined in Equation (4.1).Let n | supp(Θ)| be the number of root-causes, and p MAP ERR (ψ) be the error probability given partial realization ψ.
Therefore, if we can show that by running ECED, we can effectively reduce f AUX , then by Lemma 2, we can conclude that ECED also makes significant progress in reducing the error probability p MAP ERR .
Bounding the Gain w.r.t. the Auxiliary Function It remains to understand how ECED interacts with f AUX .For any test e, we define to be the expected gain of test e in f AUX .Let ∆ EC 2 ,ψ (X e ) denote the gain of test e in the EC 2 objective, assuming that the edge weights are configured according to the posterior distribution Similarly, let ∆ ECED,ψ (X e ) denote the ECED gain, if the edge weights are configured according to We prove the following result: and be the noise rate associated with test e ∈ V. Fix η ∈ (0, 1).We consider f AUX as defined in Equation (4.1), with c = 8 log(2n 2 /η) 2 .It holds that where and c (1 − 2 ) 2 /16.
Lemma 3 indicates that the test being selected by ECED can effectively reduce f AUX .
Lifting the Adaptive Submodularity Framework Recall that our general strategy is to bound the one step gain in f AUX against the gain of an optimal policy.In order to do so, we need to show that our surrogate exhibits, to some extent, the diminishing returns property.By Lemma 3 we can relate ∆ AUX (X e | ψ π ), i.e., the gain in f AUX under the noisy setting, to ∆ EC 2 ,ψ (X e ), i.e., the expected weight of edges cut by the EC 2 algorithm.Since f EC 2 is adaptive submodular, this allows us to lift the adaptive submodularity framework into the analysis.As a result, we can now relate the 1-step gain w.r.t.f AUX of a test selected by ECED, to the cumulative gain w.r.t.f EC 2 of an optimal policy in the noise-free setting.Further, observe that the EC 2 objective at ψ satisfies: Hereby, step (a) is due to the fact that the error probability of a MAP estimator always lower bounds that of a stochastic estimator (which is drawn randomly according to the posterior distribution of Y ).Suppose we want to compare ECED against an optimal policy OPT.By adaptive submodularity, we can relate the 1-step gain of ECED in f EC 2 ,ψ to the cummulative gain of OPT.Combining Equation (4.2) with Lemma 2 and Lemma 3, we can bound the 1-step gain in f AUX of ECED against the k-step gain of OPT, and consequently bound the cost of ECED against OPT for Problem 2.1.We defer a more detailed proof outline and the full proof to the supplemental material.

Experimental Results
We now demonstrate the performance of ECED on two real-world problem instances: a Bayesian experimental design task intended to distinguish among economic theories of how people make risky decisions, and an active preference learning task via pairwise comparisons.Due to space limitations, we defer a third case study on pool-based active learning to the supplemental material.
Baselines.The first baseline we consider is EC 2 -Bayes, which uses the Bayes' rule to update the edge weights when computing the gain of a test (as described in §3).Note that after observing the outcome of a test, both ECED and EC 2 -Bayes update the posteriors on Θ and Y according to the Bayes' rule; the only difference is that they use different strategies when selecting a test.We also compare with two commonly used sequential information gathering policies: Information Gain (IG), and Uncertainty Sampling (US), which consider picking tests that greedily maximizing the reduction of entropy over the target variable Y , and root-causes Θ respectively.Last, we consider myopic optimization of the decision-theoretic value of information (VOI) (Howard, 1966).In our problems, the VOI policy greedily picks the test maximizing the expected reduction in prediction error in Y .

Preference Elicitation in Behavioral Economics
We first conduct experiments on a Bayesian experimental design task, which intends to distinguish among economic theories of how people make risky decisions.Several theories have been proposed in behavioral economics to explain how people make decisions under risk and uncertainty.We test ECED on six theories of subjective valuation of risky choices (Wakker, 2010;Tversky & Kahneman, 1992;Sharpe, 1964), namely (1) expected utility with constant relative risk aversion, (2) expected value, (3) prospect theory, (4) cumulative prospect theory, (5) weighted moments, and (6) weighted standardized moments.Choices are between risky lotteries, i.e., known distribution over payoffs (e.g., the monetary value gained or lost).A test e (L 1 , L 2 ) is a pair of lotteries, and root-causes Θ correspond to parametrized theories that predict, for a given test, which lottery is preferable.The goal, is to adaptively select a sequence of tests to present to a human subject in order to distinguish which of the six theories best explains the subject's responses.We employ the same set of parameters used in Ray et al. (2012) to generate tests and root-causes.In particular, we have generated ∼16K tests.Given root-cause θ and test e = (L 1 , L 2 ), one can compute the values of L 1 and L 2 , denoted by v 1 and v 2 .Then, the probability that root-cause θ favors L 1 is modeled as Results Fig. 3(a) demonstrates the performance of ECED on this data set.The average error probability has been computed across 1000 random trials for all methods.We observe that ECED and EC 2 -Bayes have similar behavior on this data set; however, the performance of the US algorithm is much worse.This can be explained by the nature of the data set: it has more concentrated distribution over Θ, but not Y .Therefore, since tests only provide indirect information about Y through Θ, what the uncertainty sampling scheme tries to optimize is actually Θ, hence it performs quite poorly.

Preference Learning via Pairwise Comparisons
The second application considers a comparison-based movie recommendation system, which learns a user's movie preference (e.g., the favorable genre) by sequentially showing her pairs of candidate movies, and letting her choose which one she prefers.We use the MovieLens 100k dataset (Herlocker et al., 1999), which consists of a matrix of 1 to 5 ratings of 1682 movies from 943 users, and adopt the experimental setup proposed in Chen et al. (2015b).In particular, we extract movie features by computing a low-rank approximation of the user/rating matrix of the MovieLens 100k dataset through singular value decomposition (SVD).We then simulate the target "categories" Y that a user may be interested by partitioning the set of movies into t (non-overlapping) clusters in the Euclidean space.A root-cause Θ corresponds to user's favorite movie, and tests e's are given in the form of movie pairs, i.e., e (m a , m b ), where a and b are embeddings of movie m a and m b in Euclidean space.Suppose user's movie is represented by θ, then test e is realized as 1 if a is closer to y than b, and 0 otherwise.We simulate the effect of noise by . where d(•, •) is the distance function, and λ control the level of noise in the system.
Results Fig. 3(b) shows the performance of ECED compared other baseline methods, when we fix the size of Y to be 20 and λ to be 10.We compute the average error probability across 1000 random trials for all methods.We can see that ECED consistently outperforms all other baselines.Interestingly, EC 2 -Bayes performs poorly on this data set.This may be due to the fact that the noise level is still high, misguiding the two heuristics to select noisy, uninformative tests.Fig. 3(c) shows the performance of ECED as we vary λ.When λ = 100, the tests become close to deterministic given a root-cause, and ECED is able to achieve 0 error with ∼ 12 tests.As we increase the noise rate (i.e., decrease λ), it takes ECED many more queries for the prediction error to converge.This is because with high noise rate, ECED discounts the root-causes more uniformly, hence they are hardly informative in Y .This comes at the cost of performing more tests, and hence low convergence rate.

Related Work
Active learning in statistical learning theory.In most of the theoretical active learning literature (e.g., Dasgupta (2004b); Hanneke (2007Hanneke ( , 2014)); Balcan & Urner (2015)), sample complexity bounds have been characterized in terms of the structure of the hypothesis class, as well as additional distribution-dependent complexity measures (e.g., splitting index (Dasgupta, 2004b), disagreement coefficient (Hanneke, 2007), etc); In comparison, in this paper we seek computationally-efficient approaches that are provably competitive with the optimal policy.Therefore, we do not seek to bound how the optimal policy behaves, and hence we make no assumptions on the hypothesis class.
Persistent noise vs non-persistent noise.If tests can be repeated with i.i.d.outcomes, the noisy problem can then be effectively reduced to the noise-free setting (Kääriäinen, 2006;Karp & Kleinberg, 2007;Nowak, 2009).While the modeling of non-persistent noise may be appropriate in some settings (e.g., if the noise is due to measurement error), it is often important to consider the setting of persistent noise in many other applications.In many applications, repeating tests are impossible, or repeating a test produces identical outcomes.For example, it could be unrealistic to replicate a medical test for practical clinical treatment.Despite of some recent development in dealing with persistent noise in simple graphical models (Chen et al., 2015a) and strict noise assumptions (Golovin et al., 2010), more general settings, which we focus on in this paper, are much less understood.

Conclusion
We have introduced ECED, which strictly generalizes the EC 2 algorithm, for solving practical Bayesian active learning and experimental design problems with correlated and noisy tests.We have proved that ECED enjoys strong theoretical guarantees, by introducing an analysis framework that draws upon adaptive submodularity and information theory.We have demonstrated the compelling performance of ECED on two (noisy) problem instances, including an active preference learning task via pairwise comparisons, and a Bayesian experimental design task for preference elicitation in behavioral economics.We believe that our work makes an important step towards understanding the theoretical aspects of complex, sequential information gathering problems, and provides useful insight on how to develop practical algorithms to address noise.
Tversky, Amos and Kahneman, Daniel.Advances in prospect theory: Cumulative representation of uncertainty.

A Table of Notations Defined in the Main Paper
We summarize the notations used in the main paper in Table 1.

B The Analysis Framework
In this section, we provide the proofs of our theoretical results in full detail.Recall that for the theoretical analysis, we study the basic setting where test outcomes are binary, and the test noise is independent of the underlying root-causes (i.e., given a test e, the noise rate on the outcome of test e is only a function of e, but not a function of θ).

B.1 The Auxiliary Function and the Proof Outline
The general idea behind our analysis, is to show that by running ECED, the one-step gain in learning the value of the target variable is significant, compared with the cumulative gain of an optimal policy over k steps (see Fig. 4).The key idea behind our proof, is to show that the greedy policy ECED, at each step, is making effective progress in reducing the expected prediction error (in the long run), compared with OPT.
In Appendix §C, we show that if tests are greedily selected to optimize the (reduction in) expected prediction error, we may end up failing to pick some tests, which have negligible immediate gain in terms of error reduction, but are very informative in the long run.ECED bypasses such an issue by selecting tests that maximally distinguish root-causes with different target values.In order to analyze ECED, we need to find an auxiliary function that properly tracks the "progress" of the ECED algorithm; meanwhile, this auxiliary function should allow us to connect the heuristic by which we select tests (i.e., ∆ ECED ), with the target objective of interest (i.e., the expected prediction error p ERR ).
We consider the auxiliary function defined in Equation (4.1).For brevity, we suppress the dependence of ψ where it is unambiguous.Further, we use p θ , p θ , and p y as shorthand notations for P [θ | ψ], P [θ | ψ] and P [y | ψ].Equation (4.1) can be simplified as We illustrate the outline of our proofs in Fig. 5. Our goal is to bound the cost of ECED against the cost of OPT (Theorem 1; proof provided in Appendix §B.6).As we have explained earlier, our strategy is to relate the one-step gain of ECED 1-step: AUX (e `+1 | `) with the gain of OPT in k-steps OPT: AUX (Appendix §B.5, Lemma 8).To achieve that, we divide our proof into three parts: 1. We show that the auxiliary function f AUX is closely related with the target objective function p ERR .More specifically, we provide both an upper bound

To analyze the one-step gain of ECED, we introduce another intermediate auxiliary function:
For a test e +1 chosen by ECED, we relate its one-step gain in the auxiliary function 1-step: AUX (X e`+1 | `) , to its one-step gain in the EC 2 objective 1-step: EC 2 , ( Lemma 3, detailed proof provided in Appendix §B.3).The reason why we introduce this step is that the EC 2 objective is adaptive submodular , by which we can relate the 1-step gain of a greedy policy 1-step: to an optimal policy OPT: EC 2 , .
3. To close the loop, it remains to connect the gain of an optimal policy OPT in the EC 2 objective function OPT: EC 2 , , with the gain of OPT in the auxiliary function OPT: AUX .We show how to achieve this connection ( ) in Appendix §B.4, by relating OPT: To make the proof more accessible, we insert the annotated color blocks from Fig. 5  , , OPT: AUX , etc), into the subsequent subsections in Appendix §B, so that readers can easily relate different parts of this section to the proof outline.Note that we only use these annotated color blocks for positioning the proofs, and hence readers can ignore the notations, as it may slightly differ from the ones used in the proof.

B.2 Proof of
To prove the second part, we write p yi = P [Y = y i | ψ] for all y i ∈ Y. W.l.o.g., we assume p y1 ≥ p y2 ≥ • • • ≥ p yt .Then p MAP ERR = 1 − p y1 .We further have Here, step (a) is by inequality ln x ≥ 1 − 1/x for x ≥ 0.
2. p y1 > 1/2.Since i>1 p yi = 1 − p y1 , we have which completes the proof.In this section, we analyze the 1-step gain in the auxiliary function 1-step: AUX (X e`+1 | `) , of any test e ∈ V.By the end of this section, we will show that it is lowered bounded by the one-step gain in the EC 2 objective 1-step: EC 2 , .
Recall that we assume test outcomes are binary for our analysis, and in the following of this section, we assume the outcome x e of test e is in {+, −} instead of {0, 1}, for clarity purposes.The favorable outcome of X e for the root-causes in solid dots are +; the favorable outcome for root-causes in hollow dots are −.We also illustrate the short-hand notations used in §B.3.They are: p, q (i.e., the posterior probability distribution over Y and Θ), h (i.e., the prior distribution over Y and Θ) and α, β (i.e., the probability mass of solid and hollow dots, respectively, before performing test e).total probability mass of positive / negative root-causes α i , β i probability mass of positive / negative root-causes associated with target y i µ i , ν i α i /α, β i /β (defined in §B.3.5)θ θ r(θ) = r(θ ), i.e., root-causes θ and θ do not share the same target value

B.3.1 Notations and the Intermediate Goal
For brevity, we first define a few short-hand notations to simplify our derivation.Let p, q be two distributions on Θ, and h = h + p + h − q be the convex combination of the two, where h + , h − ≥ 0 and h + + h − = 1.
In fact, we are using p and q to refer to the posterior distribution over Θ after we observe the (noisy) outcome of some binary test e, and use h to refer to the distribution over Θ before we perform the test, i.e., p θ P [θ | X e = +], q θ P [θ | X e = −], and h θ P [θ] = h + p θ + h − q θ , where h + = P [X e = +] and h − = P [X e = −].For y i ∈ Y, we use p i θ:r(θ)=yi p θ to denote the probability of y i under distribution p, and use q i θ:r(θ)=yi q θ to denote the probability of y i under distribution q.
Further, given a test e, we define Θ + i , Θ − i to be the set of root-causes associated with target y i , whose favorable outcome of test e is + (for Θ + i ) and − (for We then define Θ + i∈{1,...,t} Θ + i , and Θ − i∈{1,...,t} Θ − i , to be the set of "positive" and "negative" root-causes for test e, respectively.Let α i , β i be the probability mass of the root-causes in Θ + i and Θ − i , i.e., α i y∈Θ Then, by definition of h + , h − , p i , q i , p θ , q θ , it is easy to verify that In the following, we derive lower bounds for the above two terms respectively.
Combining the above four equations, we obtain a lower bound on Part 1: 4. (θ, θ ) ∈ U (−,+) : θ maps x to −, θ maps x to +.By symmetry we have Combining the above four equations, we obtain a lower bound on Part 2:

B.3.3 A Lower Bound on Term 2
Now we move on to analyze Term 2 of Equation (B.6).By strong concavity of f ( Plugging in the definition of p i , q i from Equation (B.4), we get Now, combining Equation (B.7), (B.8), and (B.9), we can get a lower bound for ∆ AUX : We can rewrite Equation (B.10) as Next, we will show that term LB1 is lower-bounded by a factor of ∆ EC 2 (i.e., 1-step: EC 2 , ), while LB2 cannot be too much less than 0. Concretely, we will show , and At the end of this subsection, we will combine the above results to connect 1-step: AUX (X e`+1 | `) with 1-step: EC 2 , (See Equation (B.18)).
LB1 VS. ∆ EC 2 .We expand the EC 2 gain 1-step: as To prove the above inequality, we consider the following two cases: Observe the fact that yi∈Y Rearranging the terms in the above inequality, we get A lower bound on LB2.In the following, we will analyze LB2.
For brevity, define µ i α i /α, and ν i β i /β.We can simplify the above equation as Denote the summand on the RHS of the above equation as LB2 i .If for any y i ∈ Y we can lower bound LB2 i , we can then bound the whole sum.Fix i. W.l.o.g., we assume µ i ≥ ν i .Then Here, step (a) is due to the fact that f (x) = x log n βx is monotone increasing for n ≥ 3. When n < 3, we have µ i = 1 and ν i = 0 (otherwise, there is no uncertainty left in Y ) and hence the problem becomes trivial.
In this case, we cannot replace p i , q i with µ i or ν i .However, notice that max{µ i (1 To further simplify notation, we denote γ 1 8c − 6, and γ 2 8 log n 2 β .Then the above equation can be rewritten as In this case, we have Step (a) is due to the fact that 1/n < 1/2 and therefore n ≥ 3.
Putting the above cases together, we obtain the following equations: , and γ 2 = 8 log n 2 β , so and thus we get That is, if β ≥ η, we have LB2 i ≥ 0 for all i ∈ {1, . . ., t}.

B.3.6 Bounding ∆ AUX against ∆ ECED
To finish the proof for Lemma 3, it remains to bound ∆ AUX against ∆ ECED .In this subsection, we complete the proof of Lemma 3, by showing that Recall that is the noise rate of test e.Let ρ = 1− be the discount factor for inconsistent root-causes.By the definition of ∆ ECED in Equation (3.1), we first expand the expected offset value of performing test e: Denote γ = 1 − ρ 2 .Then, we can expand ∆ ECED as (initial total edge weight)−(offset value) − expected remaining weight after discounting With the results from Appendix §B.3.5 and §B.3.6, we therefore complete the proof of Lemma 3.

B.4 Bounding the error probability: Noiseless vs. Noisy setting
Now that we have seen how ECED interacts with our auxiliary function in terms of the one-step gain, it remains to understand how one can relate the one-step gain to the gain of an optimal policy OPT: AUX , over k steps.In this subsection, we make an important step towards this goal.
Specifically, we provide Lemma 7. Consider a policy π of length k, and assume that we are using a stochastic estimator (SE).Let p E be the error probability of SE before running policy π, p ⊥ E,noisy be the average error probability of SE after running π in the noisy setting, and p ⊥ E,noiseless be the average error probability of SE after running π in the noiseless setting.Then Proof of Lemma 7. Recall that a stochastic estimator predicts the value of a random variable, by randomly drawing from its distribution.Let π be a policy.We denote by p E (π φ ) the expected error probability of an stochastic estimator after observing π φ : where φ ∈ V × O denotes a set of test-outcome pairs, and π φ denotes a path taken by π, given that it observes φ.
Now, let us see what happens in the noiseless setting: we run π exactly as it is, but in the end compute the error probability of the noiseless setting (i.e., as if we know which test outcomes are corrupted by noise).Denote the noise put on the tests by Ξ, and the realized noise by ξ.We can imagine the noiseless setting through the following equivalent way: we ran the same policy π exactly as in the noisy setting.But upon completion of π we reveal what Ξ was.We thus have The error probability upon observing π φ and Ξ = ξ is The expected error probability in the noiseless setting after running π is where (a) is by Jensen's inequality and the fact that f (x) = x(1 − x) is concave.Combining with Equation (B.20) we complete the proof.
Essentially, Lemma 7 implies that, in terms of the reduction in the expected prediction error of SE, running a policy in the noise-free setting has higher gain than running the exact same policy in the noisy setting.This result is important to us, since analyzing a policy in the noise-free setting is often easier.We are going to use Lemma 7 in the next section, to relate the gain of an optimal policy OPT: EC 2 , in the EC 2 objective (which assumes tests to be noise-free), with the gain OPT: AUX in the auxiliary function (which considers noisy test outcomes).

B.5
The Key Lemma: One-step Gain of ECED VS. k-step Gain of OPT Now we are ready to state our key lemma, which connects 1-step: AUX (X e`+1 | `) to OPT: AUX .Lemma 8 (Key Lemma).Fix η, τ ∈ (0, 1).Let n = | supp(Θ)| be the number of root-causes, t = |Y| be the number of target values, OPT(δ opt ) be the optimal policy that achieves p ERR (OPT(δ opt )) ≤ δ opt , and ψ be the partial realization observed by running ECED with cost .We denote by f avg AUX ( ) := E ψ [f AUX (ψ )] the expected value of f AUX (ψ ) over all the paths ψ at cost .Assume that f avg AUX ( ) ≤ δ g .We then have where k = cost(OPT(δ opt ))), c η, 2t(1 − 2 ) 2 η, c δ (6c + 8) log(n/δ g ), c 8 log(2n 2 /η) where by f EC 2 ,ψ we mean the initial EC2 objective value given partial realization ψ , and by we mean the expected gain in f EC 2 when we run OPT (δ opt ).Note that OPT (δ opt ) has worst-case length k.Now, imagine that we run the policy OPT (δ opt ), and upon completion of the policy we can observe the noise.We consider the gain of such policy in f EC 2 : ≥ p E − p ⊥ E,noiseless .The reason for step (a) is that the error probability of the stochastic estimator upon observing ψ , i.e., p E , is equivalent to the total amount of edge weight at ψ , i.e., f EC 2 ,ψ .The reason for step (b) is that under the noiseless setting (i.e., assuming we have access to the noise), the EC2 objective is always a lower-bound on the error probability of the stochastic estimator (due to normalization).Thus Here p E,ψ denotes the error probability under P [Y | ψ ], and p ⊥ E,noisy,ψ denotes the expected error probability of running OPT (δ opt ) after ψ in the noise-free setting.By Lemma 7 we get where p ⊥ E,noisy,ψ denotes the expected error probability of running OPT (δ opt ) after ψ in the noisy setting.By (the lower bound in) Lemma 4, we know that p E,ψ = p E (ψ ) ≥ p MAP ERR (ψ ), and hence Taking expectation with respect to ψ , we get Using (the upper bound in) Lemma 2, we obtain where (a) is by Jensen's inequality.
Suppose we run ECED, and achieve expected error probability δ g , then clearly before ECED terminates we have E ψ p MAP ERR (ψ ) ≥ δ g .Assuming E ψ p MAP ERR (ψ ) ≤ 1/2, we have which gives us Combining Equation (B.25) with Equation (B.22), we get which completes the proof.
B.6 Proof of Theorem 1: Near-optimality of ECED We are going to put together the pieces from previous subsection, to give a proof of our main theoretical result (Theorem 1).
Proof of Theorem 1.In the following, we use both OPT [k] and OPT(δ opt ) to represent the optimal policy that achieves prediction error δ opt , with worst-cast cost (i.e., length) k.Define S(π, φ) to be the (partial) realization seen by policy π under realization φ.We  . . .
e seq 1 2 V 3 V 3 = {e seq 1 , . . ., e seq t } (d) Test set 3 Figure 8: A problem instance where the maximal informative policy, and the the myopic policy that greedily maximizes the reduction in the expected prediction error, perform significantly worse than ECED (EC 2 ).
Suppose we want to solve Problem (2.1) for δ = 1/3.Similarly with §C.1, the problem is equivalent to find a minimal cost policy π that achieves 0 prediction error, because once the error probability drops below 1/3, we will know precisely which target value is realized.
There are three set of tests, and all of them have binary outcomes and unit cost.The first set V 1 := {e 0 } contains one test e 0 , which tells us the value of o of the underlying root-cause θ i,o .Hence for all i, Θ = θ i,o ⇒ X e0 = o (see Fig. 8(b)).The second set of tests are designed to help us quickly discover the index of the target value via binary search if we have already run e 0 , but to offer no information whatsoever (in terms of expected reduction in the prediction error, or expected reduction in entropy of Y ) if e 0 has not yet been run.There are a total number of s tests in the second set V 2 := {e 1 , e 2 , . . ., e s }.For z ∈ {1, . . ., t}, let b k (z) be the k th least-significant bit of the binary encoding of z, so that z = s k=1 2 k−1 b k (z).Then, if Θ = θ i,o , then the outcome of test e k ∈ V 2 is X e k = 1 {φ k (i) = o} (see Fig. 8(c)).The third set of tests are designed to allow us to do a (comparatively slow) sequential search on the index of the the target values.Specifically, we have V 3 := {e seq 1 , . . ., e seq t }, such that Θ = θ i,o ⇒ X e seq k = 1 {i = k} (Fig. 8(d)).Now consider running the maximal informative policy π (the same analysis also applies to the value of information policy, which we omits from the paper).Note that in the beginning, no single test from V 1 ∪ V 2 results in any change in the distribution over Y , as it remains uniform no matter with test is performed.Hence, the maximal informative policy only picks tests from V 3 , which have non-zero (positive) expected reduction in the posterior entropy of Y .In the likely event that the test chosen is not the index of Y , we are left with a residual problem in which tests in V 1 ∪ V 2 still have no effect on the posterior.The only difference is that there is one less class, but the prior remains uniform.Hence our previous argument still applies, and π will repeatedly select tests in V 3 , until a test has an outcome of 1.In expectation, the cost of π is least cost(π) ≥ 1 t t z=1 z = t+1 2 .On the other hand, a smarter policy π * will select test e 0 ∈ V 1 first, and then performs a binary search by running test e 1 , . . ., e s ∈ V 2 to determine b k (i) for all 1 ≤ k ≤ s (and hence to determine the index i of Y ).Since the tests have unit cost, the cost of π * is cost(π * ) = s + 1.

Figure 3 :
Figure 3: Experimental results: ECED outperforms most baselines on both data sets.

`{
Figure4: On the left, we demonstrate a sequential policy in the form of its decision tree representation.Nodes represent tests selected by the policy, and edges represent outcomes of tests.At step , a policy maps partial realization ψ = {(e 1 , x e1 ), . . ., (e , x e )} to the next test e +1 to be performed.In the middle, we demonstrate the tests selected by an optimal policy OPT of length k.On the right, we illustrate the change in the auxiliary function as ECED selects more tests.Running OPT at any step of execution of ECED will make f AUX below some threshold (represented by the red dotted line).The key idea behind our proof, is to show that the greedy policy ECED, at each step, is making effective progress in reducing the expected prediction error (in the long run), compared with OPT.

EC 2 ,
to the expected reduction in prediction error, and further in §B.5, by applying the upper bound Ub p MAP err provided in §B.2.

↵ 1 Figure 6 :
Figure 6: Performing binary test e on Θ and Y .Dots represent root-causes θ ∈ supp(Θ), and circles represent values of the target variable y ∈ Y.The favorable outcome of X e for the root-causes in solid dots are +; the favorable outcome for root-causes in hollow dots are −.We also illustrate the short-hand notations used in §B.3.They are: p, q (i.e., the posterior probability distribution over Y and Θ), h (i.e., the prior distribution over Y and Θ) and α, β (i.e., the probability mass of solid and hollow dots, respectively, before performing test e).
(a) Root-causes and their associated target values Problem Statement We consider sequential, adaptive policies for picking the tests.Denote a policy by π.In words, a policy specifies which test to pick next, as well as when to stop picking tests, based on the tests picked so far and their corresponding outcomes.After each pick, our observations so far can be represented as a partial realization Ψ ∈ 2 V×O(e.g., Ψ encodes what tests have been performed and what their outcomes are).Formally, a policy π : 2 V×O → V is defined to be a partial mapping from partial realizations Ψ to tests.Suppose that running π till termination returns a sequence of test-observation pairs of length k, denoted by ψ π , i.e., ψ π {(e π,1 , x eπ,1 ), (e π,2 , x eπ,2 ), • • • , (e π,k , x e π,k )}.This can be interpreted as a random path 1 taken by policy π.Once ψ π is observed, we obtain a new posterior on Θ (and consequently on Y ).After observing ψ π , the MAP estimator of Y has error probability p MAP ERR Algorithm 1: The Equivalence Class Edge Discounting (ECED) Algorithm Input: [λ θ,x ] n×m (or Conditional Probabilities P [X | Θ]), Prior P [Θ], Mapping r : supp(Θ) → Y; begin ).Further, we call test e to be non-informative, if its outcome does not affect the distribution of Θ, i.e., ∀ θ, θ ∈ supp(Θ) andx e ∈ O, P [X e = x e | θ] = P [X e = x e | θ ].Obviously, performing a non-informative test does not reveal any useful information of Θ (and hence Y ).Therefore, we should augment our basic value function δ BS , such that the value of a non-informative test is 0. Following this principle, we defineδ OFFSET (x e | ψ π ) {θ,θ }∈E P [θ, ψ π ] P [θ , ψ π ]•(1−max θ λ 2 θ,xe), as the offset value for observing outcome x e .It is easy to check that if test e is non-informative, then it holds that δ BS

Table 1 :
A reference table of notations used in the main paper BS (x e | ψ) the "basic" component in the ECED gain by observing x e , having observed ψ δ OFFSET (x e | ψ)the "offset" component in the ECED gain by observing x e , having observed ψ ∆ ECED (X e | ψ) the ECED gain which is myopically optimized at each iteration of the ECED algorithm ∆ ECED,ψ (X e ) suppose we have observed ψ, and re-initialize the EC 2 graph so that the total edge weight is f EC 2 ,ψ (∅).Then, ∆ EC 2 ,ψ (X e ) is the expected reduction in edge weight, by performing test e and discounting edges' weight according to ECED.It is the renormalized version of∆ ECED (x e | ψ), i.e., ∆ ECED,ψ (X e ) = ∆ ECED (x e | ψ)/P [ψ] 2 .∆EC 2 ,ψ (X e )the expected gain in f EC 2 ,ψ by performing test e, and cutting edges weight according to EC 2 .It can be interpreted as ∆ ECED,ψ (X e ), as if the test's outcome is noise-free, i.e., ∀θ, θ,e = 0.
t |Y|, number of possible target values n | supp(Θ)|, number of root-causes π policy, i.e., a (partial) mapping from observation vectors to tests Ψ random variable encoding a partial realization, i.e., set of test-observation pairs ψ π the partial realization, i.e., set of test-observation pairs observed by running policy ERR (ψ π ) , expected error probability by running policy π OPT optimal policy for Problem (2.1) G G = (supp(Θ), E), the (weighted) graph constructed for the EC 2 algorithm w({θ, θ }) weight of edge {θ, θ } ∈ E in the EC 2 graph G f EC 2 the EC 2 objective function, with f EC 2 (∅) := θ,θ ∈E P [θ] P [θ ]. f EC 2 ,ψ the EC 2 objective function, with f EC 2 ,ψ (∅) := θ,θ ∈E P [θ | ψ] P [θ | ψ].λ θ,e discount coefficient of root-cause θ, used by ECED when computing ∆ ECED .θ,e 1 − arg max e P [X e = x e ], the noise rate for a test e δ 2 , parameter of f AUX .It is only used for the analysis of ECED.∆ AUX (X e | ψ) the expected gain in f AUX by performing test e, conditioning on partial realization ψ c η, , c constants required by Lemma 3 λ parameter controlling the error rate of tests (see §5) (i.e.,Ub p MAP err , , , OPT: EC 2 Lemma 2: Relating f AUX to p ERR as the prediction error of a stochastic estimator upon observing ψ, i.e., the probability of mispredicting y if we make a random draw from P [Y | ψ].We show in Lemma 4 that p MAP ERR (ψ) is within a constant factor of p E (ψ): y∈Y P [y | ψ] (1 − P [y | ψ])Proof of Lemma 4. We can always lower bound p E by p MAP ERR , since by definition, p MAP ERR we provide lower and upper bounds of the second term in the RHS of Equation (B.1): Lemma 5. 2p MAP ERR ≤ y∈Y H 2 (p y ) ≤ 3(H 2 p MAP ERR+ p MAP ERR log n).Proof of Lemma 5. We first prove the inequality on the left.Expanding the middle term involving the binary entropy of p y , we get y∈Y H 2 (p y ) = y∈Y p y log 1 p y Now, we are ready to state the upper bound Ub p MAP Proof of Lemma 2. Clearly, {θ,θ }∈E p θ p θ log 1 .3Proof of Lemma 3: Bounding ∆ AUX against ∆ EC 2 , ∆ ECED ERR. Combining with Lemma 5 and Lemma 6, we getf AUX (ψ) ≤ 3c • H 2 p MAP ERR + p MAP ERR log n + 4 (H 2 (p E ) + p E log n) ≤ (3c + 4) • H 2 p MAP ERR + p MAPERR log n , which completes the proof.B Proof of Lemma 8. Let ψ be a path ending up at level of the greedy algorithm.Recall that ∆ EC 2 (X e | ψ ) denotes the gain in f EC 2 if we perform test e and assuming it to be noiseless (i.e., we perform edge cutting as if the outcome of test e is noiseless), conditioning on partial observation ψ .Further, recall that ∆ AUX (X e | ψ ) denotes the gain in f AUX if we perform noisy test e after observing ψ and perform Bayesian update on the root-causes.Let e = arg max e ∆ ECED (X e | ψ ) be the test chosen by ECED, and ê = arg max e ∆ EC 2 (X e | ψ ) be the test that maximizes ∆ EC 2 , then by Lemma 3 we know 2 ,and c(1 − 2 ) 2 /16.