Non-marginal Decisions: A Novel Bayesian Multiple Testing Procedure

In this paper we consider the problem of multiple testing when the hypotheses are dependent. In most of the existing literature, either Bayesian or non-Bayesian, the decision rules mainly focus on the validity of the test procedure rather than actually utilizing the dependency to increase efficiency. Moreover, the decisions regarding different hypotheses are marginal in the sense that they do not depend upon each other directly. However, in realistic situations, the hypotheses are usually dependent, and hence it is desirable that the decisions regarding the dependent hypotheses are taken jointly. In this article we develop a novel Bayesian multiple testing procedure that coherently takes this requirement into consideration. Our method, which is based on new notions of error and non-error terms, substantially enhances efficiency by judicious exploitation of the dependence structure among the hypotheses. We prove that our method minimizes the posterior expected loss associated with a an additive"0-1"loss function, we also prove theoretical results on the relevant error probabilities, establishing the coherence and usefulness of our method. The optimal decision configuration is not available in closed form and we propose a novel and efficient simulated annealing algorithm for the purpose of optimization, which is also generically applicable to binary optimization problems. Numerical studies demonstrate that in dependent situations, our method performs significantly better than some existing popular conventional multiple testing methods, in terms of accuracy and power control. Moreover, application of our ideas to a real, spatial data set associated with radionuclide concentration in Rongelap islands yielded insightful results.

Here we discuss the multiple comparison problem in a Bayesian decision theoretic framework, given data D (say). Let us first define the following qunatities:- In Guindani, Müller & Zhang (2009), the authors considered an additive loss function which combines the number of true positives with the number of false positives as follows:- where λ > 0 is the Lagrangian multiplier.
The above loss function may be interpreted as the loss function which maximizes the number of true positives for each fixed level of the number of false positives. The optimal decision rule, which is that decision rule which minimizes the posterior risk function given by, where v i = E(r i |D) = Pr(r i = 1|D) = Pr(H 1i is true|D), i = 1, · · · , m.
Clearly, the decision rule d = (d 1 , · · · , d m ) which minimizes the above posterior risk (the Oracle rule), is given by (see also Guindani et al. (2009)): The choice of the Lagrange multiplier λ has not been discussed in Guindani et al. (2009). However, λ can be chosen such that the posterior expectation of the number of false positives is less than or equal to α, where α ∈ (0, 1) is the appropriate level of significance. In other words, Note that the function m i=1 d * i (1 − v i ) is a non-increasing function of λ. Hence, we can choose λ to be the minimum value such that (4) is satisfied. Guindani et al. (2009) assumed that θ 1 , · · · , θ m are a priori independent with θ i having the prior distribution π i (· ); however, development of their Bayesian decision theory does not require the assumption of independence. Indpendent priors fail to account for the multiplicity effect, which entail that the decisions associated with the individual hypotheses when they tested in isolation are the same as the decisions when other hypotheses are also tested simultaneously. Indeed, Abramovich & Angelini (2006) mention "from the Bayesian perspective, placing independent priors on θ i and using additive losses (e.g., a sum of "0 −1" individual losses for each test) does not imply any multiplicity correction: the evident optimal Bayes rule in this case simply applies the corresponding individual Bayes testing rule to each one of m tests "ignoring", therefore, the multiplicity effect." Dependent prior on (θ 1 , . . . , θ m ) corrects for the effect of multiplicity; see Appendix A for the details.
But note that, in the case of Guindani et al. (2009), although the posterior probability does not change with the number of hypotheses, the choice of the Lagrange multiplier λ of the loss function (1) depends on the number of hypotheses being tested. Further clarification on this is available if instead of a single Lagrange multipler λ, for each term d i (1 − r i ), a separate λ i is employed.
Then in (3), λ i replaces λ, and for m tests, one can choose λ i to be the smallest value such that But for large m, this results in very conservative tests with low power.
However, even if multiplicity may be accounted for by dependent priors, for additive loss functions such as (1), for each i, the decision d * i is the marginal decision associated with the i-th hypothesis, involving just the marginal posterior probability of H 1i . In other words, even though it is expected in multiple testing that the decision regarding the i-th hypotheses should functionally depend upon the decisions on every other hypotheses (that is, d * i should be a function of the joint posterior probability of all the hypotheses, not just the marginal posterior probability of H 1i ), particularly if the θ i are dependent, still the optimal decision rules, for example, (3) forces marginal decisions, even if the θ i are dependent. Abramovich & Angelini (2006) attempt nonmarginal decisions by considering a "0-1" loss function for multiple testing, but required to put a prior on the number of true alternatives. This extra level of prior assumption is separate from the underlying model, and the decisions on the hypotheses may not be robust with respect to different choices of the prior.

NEW PROPOSAL TO OBTAIN NON-MARGINAL DECISIONS
Suppose we have a set of initial decisions, say, (d 1 , · · · , d (0) m ). Starting from this initial decision, we update the decisions sequentially, based on the remaining ones, using the same principle as the Gibbs sampler (Gelfand & Smith (1990)). That is, for each i = 1, . . . , m, we update d i conditional on d −i = (d 1 , . . . , d i−1 , d i+1 , . . . , d m ). The updating will take place deterministically by minimizing some error criterion. In the optimization literature, optimizing the desired function by sequentially optimzing each argument, keeping other arguments fixed, is known as "block relaxation" (see, for example, de Leeuw (1994), Lange (2010)). At each step, the optimizing function is either increased or decreased, depending upon whether the problem is a maximization or a minization problem. Although the principle is deterministic, in our case we will draw connections with Gibbs sampling and use the Markov chain theory to determine the number of iterations necessary for the decisions to converge to a unique set of decision rules.
2.1 Illustration of block relaxation in the case of Guindani et al. (2009) Before introducing our criterion of updating the d i , we illustrate the principle using the risk function (2). For the above loss function, suppose we set (d m ) = (0, · · · , 0) as our initial decision, that is, we initially accept all the null hypotheses. We then start updating the decisions based on the remaining decisions by minimizing the posterior risk R(d) given by (2), keeping the remaining decisions fixed.
m ) = (0, · · · , 0) fixed, the posterior risk, as a function of d 1 only, becomes 1 , 0, · · · , 0), and now the posterior risk, as a function of d 2 only, becomes, which would be minimized if we put d 2 = 1{v 2 > λ 1+λ }. Thus we get d (1) 2 = 1 v 2 > λ 1+λ . Continuing in this manner, we finally obtain d 1+λ , for i = 1, · · · , m. Any further attempt to update does not change these decisions. Hence, these are the converged, optimal set of d * i 's. It is easy to verify that starting from any other set of initial decisions, the same set of optimal decisions is reached.

New error based criterion
Define Now consider the term This is the number of cases for which {d i = 1, r i = 1, z i = 1}; in words, the number of cases for which the i-th decision correctly accepts H 1i , and all other decisions, which may accept either H 0j or H 1j ; j = i, are correct. This is the number of true positives, and we maximize its posterior expectation with respect to d. But there are also errors to be controlled, for example, and E 1 is the number of cases for which {d i = 1, r i = 0, z i = 1}, that is, the number of cases for which H 1i is wrongly accepted, and the remaining decisions are correct; E 2 is the number of cases for which H 1i is wrongly accepted and at least one decision regarding the other hypotheses is also wrong; E 3 is the number of cases for which the i-th decision is correct but at least one of the other decisions is wrong. The complete set of terms, corresponding to errors and correct decisions are which we will control, sunject to maximizing T P . The reason we do not attempt to control all the error terms is that controlling too many error terms will result in over-penalization, which will result in low power.
Unlike Guindani et al. (2009) ours is not an approach based on loss functions. We will minimize the posterior expectation of −T P given by (6) subject to controlling the posterior expectation of E, but we do not seek to give a loss function interpretation to the resultant function to be minimized after incorporating the Lagrange multiplier.
With E to be controlled, the function to be minimized is given by where Note that (11) is of the same form as the risk function (1) Thus, unlike the risk function (1), which is linear with respect to d 1 , . . . , d m , our maximizing function is non-linear in d 1 , . . . , d m .
As a rule of thumb, the Lagrange multiplier λ can be chosen to be the minimum value such that for appropriate level of significance α ∈ (0, 1), where, for i = 1, . . . , m, the suffix "tr" standing for "thumb rule".
For notational simplicity, we will write v i instead of v i|d j ,j =i . Abusing notation, we also denote We then need to maximize f (d) with respect to d. In the next section we present our algorithm for this purpose.

Maximization of f using block relaxation algorithm
For implementation we can select a set of initial decisions i (0) and 0 otherwise. In Appendix C we show that f (d) is identifiable, that is, for any two sets of decisions d (1) and (2) ). Hence, the function f must have a unique minimizer, and so, our algorithm, based on block relaxation, converges to the unique minimizer.
To obtain a better understanding of what the algorithm entails, we need to analyse the function where τ is either t or t − 1, {j 1 , . . . , j L } ⊂ {1, . . . , m} and d (τ ) To see this, let v (t) j ℓ ,d i =0 > λ 1+λ if possible, for some ℓ. In conjunction with (16) and (17), this implies v j ,j |D) is a probability and hence can not exceed 1. Hence, (18) must hold. The above arguments imply that for λ Hence, if (17) holds, then d then it is not necessarily the case that d j ℓ ,d i =0 > λ/(1 + λ) for some ℓ, then using (16) it can be seen that d (t) i = 0. Finally note that since the probability does not depend upon d i , we can divide f (15) with w −i above while minimizing with respect to d i . Then all the above arguments continue to hold with λ/(1 + λ) replaced with λ/ for j = 1, . . . , m.

Special case: independent data and prior
If, for i = 1, . . . , m, X i |θ i ∼ f (· | θ i ), and θ i ∼ π i (·) independently, then (12) In this case, the conditional probabilities (22) and (23) reduce to P r (H d i ,i |X i ) and P r (H 1,i |X i ), respectively. Hence, the optimum set of decisions is a function of these marginal posterior probabilities associated with the hypotheses. However, the threshold λ/{w . Hence, we still need the iterative algorithm to obtain the optimal set of decisions unlike the case of Guindani et al.

ORDER BASED CRITERION
Now let us assume that the hypotheses (H 0i , H 1i ); i = 1, . . . , m, are ordered in some manner, for example, by ordering the Bayes factors (see, for example, Chen & Sarkar (2004)). Then, as an alternative to (5) we can define z i as the following: The number of true positives, T P , and the error term E remain of the same form as before, with only the definition of z i replaced with (25). That is, we now minimize where, for i > 1, v i|1,...,i−1 = P r(r i = 1, z i = 1|D) and for i = 1, v i|0 = v 1 = P r(r 1 = 1, z 1 = 1|D) = P r(H 11 |D).

Maximization procedure
Let the m hypotheses be ordered in terms of marginal Bayes factors such that where Note that if for some k ∈ {1, . . . , m}, BF k is sufficiently large so that there is significantly more evidence in favour of H 1,k in comparison with H 0,k then because of the ordering (29)  Note that, to update d m in the next iteration, it is sufficient to maximize the following function, which is the only term in (26) involving d m : Clearly, the above function is maximized by Given d (1) m , let us now update d m−1 . We have If d More generally, assume that at some iteration t, d (t) Hence, if v  i (1) > 0, and with λ ≥ 1, f (t) j L |1,...,j L −1,d i =1 ≥ λ/(1 + λ). It is also clear that this implies d (t) k = 1 for k = 1, . . . , i − 1 as well.
The above arguments show that in general, given that all the initial decisions were set equal to 1, and the decisions are updated in the sequence d m , d m−1 , . . . , d 1 , then at iteration t if d (t) j = 1 for some j ∈ {2, . . . , m}, then d (t) k = 1 for k = 1, . . . , j − 1. Interestingly, this procedure can be seen to generalize the step-down procedure of Chen & Sarkar (2004).
As an alternative to our above procedure of initializing by setting all decisions equal to 1, and updating the decisions in the sequence d m , d m−1 , . . . , d 1 , we can also initialize by setting d Note that it is not at all necessary to update the decisions in the manners described above. In fact, it is perfectly fine to update in any order, given any set of initial values. The underlying block relaxation algorithm will ensure convergence in any case. However, the above procedures can be much more efficient in that much less number of iterations may be needed fo convergence.
Moreover, computation time is also expected to be much less since once some decision attains the value 1 (while updating in the sequence d m , d m−1 , . . . , d 1 ) or the value 0 (while updating in the sequence 1, 2, . . . , m), all the remaining decisions must also be 1 and 0, respectively.

EXPECTED NUMBER OF ITERATIONS TO ACHIEVE CONVERGENCE OF OUR BLOCK RELAXATION ALGORITHMS
Note that all our block relaxation algorithms for updating the decisions can be viewed as discrete Gibbs samplers, with the full conditional distributions being Bernoulli distributions. In other words, for the set S of the form either {d j : j = i} or {d j : j < i}, Since we know that the block relaxation algorithms converge to a unique point d * = (d * 1 , . . . , d * m ), from any starting point d (0) = (d The solution is available using well-known results of absorbing Markov chain theory. Indeed, any arbitrary absorbing Markov chain P has the following canonical form: where Q is the matrix of probabilities of transitions from the transient states to the transient states, R is the matrix of probabilities of transitions from the transient states to the absorbing states; 0 and I are the null matrix and the identity matrices, respectively. It is well-known that the matrix I − Q is invertible. Let N = (I − Q) −1 . Then the following well-known theorem gives the desired expected times to be absorbed.

SET-UPS
Assuming the set-up: X i |θ i ∼ f (·|θ i ); (θ 1 , . . . , θ m ) ∼ π, the posterior of θ i , given D = (X 1 , . . . , X n ), In the case of independent priors, π(θ −i |θ i ) = π(θ −i ), which implies that (A.1) reduces to the individual posterior probability of the event {θ ∈ Θ 1i }, given by Hence, in this case, the posterior probability associated with testing H 0i in isolation is the same as the posterior probability associated with testing H 0i when multiple hypotheses are tested simultaneously along with H 0i . In other words, "multiplicity control" is not employed.
On the other hand, if θ 1 , . . . , θ m are not independent a priori, then it is clear from (A.1) that the posterior involves the entire data set D, not just X i as in the independent case. Since the j-th hypothesis involves θ j , which, in turn, is associated with X j , an increase in the number of hypotheses also entails an increase in the size of the data set D. Hence, as the number of hypotheses to be simultaneously tested changes, so does the posterior probability associated with H 0i . Thus, for dependent priors, multiplicity correction is automatically accounted for. Since all Bayesian multiple testing procedures rely on the individual posterior probabilities, multiplicty correction is inherent in the Bayesian paradigm.

B. BREAK-UP OF THE NUMBER OF HYPOTHESES BEING TESTED INTO ERROR AND NON-ERROR TERMS
We denote the error terms by E and the non-error terms by NE.
(1) NE 1 = m i=1 d i r i z i , equalling #{i : d i = 1, r i = 1, z i = 1}. In words, the term corresponds to the number of cases where H 1i are correctly accepted, and all other decisions are also correct. ( In words, the term corresponds to the number of cases where H 1i are correctly rejected, and all the remaining decisions are correct. ( In words, the term corresponds to the number of cases where H 1i are wrongly accepted, but all the remaining decisions are correct.
In words, the term corresponds to the number of cases where H 1i are wrongly accepted, but at least one of the remaining decisions is incorrect.
equalling #{i : d i = 1, r i = 1, z i = 0}. In words, the term corresponds to the number of cases where H 1i are correctly accepted, and at least one of the remaining decisions is incorrect.
equalling #{i : d i = 0, r i = 0, z i = 0}. In words, the term corresponds to the number of cases where H 1i are correctly rejected, but at least one the remaining decisions is incorrect.
(1 − d i )r i z i , equalling #{i : d i = 0, r i = 1, z i = 1}. In words, the term corresponds to the number of cases where H 1i are wrongly rejected, but all the remaining decisions are correct.
(8) E 6 = m i=1 (1 − d i )r i (1 − z i ), equalling #{i : d i = 0, r i = 1, z i = 0}. In words, the term corresponds to the number of cases where H 1i are wrongly rejected, and at least one of the remaining decisions is incorrect.