Optimal exact tests for multiple binary endpoints

In confirmatory clinical trials with small sample sizes, hypothesis tests based on asymptotic distributions are often not valid and exact non-parametric procedures are applied instead. However, the latter are based on discrete test statistics and can become very conservative, even more so, if adjustments for multiple testing as the Bonferroni correction are applied. We propose improved exact multiple testing procedures for the setting where two parallel groups are compared in multiple binary endpoints. Based on the joint conditional distribution of test statistics of Fisher's exact tests, optimal rejection regions for intersection hypotheses tests are constructed. To efficiently search the large space of possible rejection regions, we propose an optimization algorithm based on constrained optimization and integer linear programming. Depending on the optimization objective, the optimal test yields maximal power under a specific alternative, maximal exhaustion of the nominal type I error rate, or the largest possible rejection region controlling the type I error rate. Applying the closed testing principle, we construct optimized multiple testing procedures with strong familywise error rate control. Furthermore, we propose a greedy algorithm for nearly optimal tests, which is computationally more efficient. We numerically compare the unconditional power of the optimized procedure with alternative approaches and illustrate the optimal tests with a clinical trial example in a rare disease.


Introduction
In small confirmatory clinical trials, asymptotic hypothesis tests often do not provide strict control of the type I error rate. For many testing problems, exact tests based on conditional inference and permutation tests have been proposed. For such tests, the null distribution of the test statistic is typically discrete and the nominal significance level is not fully exhausted. Furthermore, in many settings more than one hypothesis is tested and a multiple testing procedure is applied to control the familywise type I error rate (FWER) [1]. However, multiple testing procedures that do not take into account the discreteness of the elementary exact tests, often become even more conservative. One remedy are randomized tests. They, however, are not accepted in practical applications such as clinical trials and will not be further considered here.
In this paper we construct optimized exact tests of intersection hypotheses and construct multiple testing procedures for elementary hypotheses based on the closed testing principle [2], focusing on multiple Fisher's exact tests for binary endpoints.
Several approaches to account for the discreteness of test statistics in multiple testing procedures have been proposed. For example, the Bonferroni test can be improved, using the fact that for discrete tests a lower bound for the elementary p-values exists. Therefore, one can exclude hypotheses from testing (and multiplicity adjustment) for which this lower p-value bound exceeds the nominal familywise significance level [3]. This idea was refined by Tarone [4], who noticed that tests whose lower p-value threshold is larger than the corresponding adjusted level of the Bonferroni test, do not contribute to the type I error rate. While Tarone's test controls the FWER, it is not α-consistent. A test is said to be α-consistent, if the rejection of the null hypothesis at some significance level α implies that it can be rejected at all significance levels greater than α. Hommel and Krummenauer [5] and Roth [6] derived a more powerful and α-consistent procedure by applying Tarone's procedure for all levels α ≤ α and rejecting a null hypothesis if Tarone's procedure rejects for any α ≤ α. Type I error rate control of all these procedures holds by the Bonferroni inequality. Hence, they rely on the marginal distributions of the test statistics only and do not take into account their joint distribution. Even though the latter is typically unknown, in the important case of between-group comparisons, the conditional joint distribution of the test statistics can be found by permutation of the group labels. The resulting distribution is conditional on the observed data.
Inference on the global null hypothesis of all marginal null hypotheses being true may be performed by combining the marginal test statistics or marginal p-values into a univariate statistic according to a pre-specified function. The null distribution of this statistic may be determined from the joint permutation distribution. See [7] for a review on combination of dependent tests in a multivariate permutation setting.
The permutation approach is also used in the well known minP test [8], that is also applicable to discrete tests [9]. In the minP test, for each permutation of the group labels, the minimum across the p-values of the marginal hypothesis tests is calculated. The global null hypothesis is rejected at level α if the proportion of permutations with a minimum p-value less or equal the actually observed minimum p-values is less or equal α. This results in rejection regions of a particular shape that can be described as the complement of a hypercube in the space of observable test statistics or p-values. In the case of discrete marginal tests, the probability mass of these rejection regions is typically below α. Additional events defined on the joint distribution of test statistics (or p-values) may be added to such regions, still maintaining a probabiltiy mass below or equal α. Rom proposed one particular extension, allowing for rejection if the minimum observed p-value is larger than the usual minP threshold, given that the remaining ordered p-values are below appropriate thresholds [10]. A combination of the approach underlying Tarone's test and the minP test was suggested in [11].
In this paper we generalize these tests and consider general rejection regions for multivariate exact test statistics. The rejection regions may have arbitrary shapes that are only constrained by certain regularity conditions. We determine optimized rejection regions to either maximize exhaustion of the nominal type I error rate, the power under a specific alternative point hypothesis or simply the number of elements in the region. To efficiently search for the optimal rejection region we propose a numerical optimization algorithm.
The idea to consider general rejection regions for discrete tests has been used before. Paroush proposed a test of a single, simple null hypothesis versus a simple alternative based on a univariate discrete test statistic. The rejection region that maximizes the power under the alternative and controls the type I error rate under the null hypothesis can be found by linear integer programming [12]. Gutman and Hochberg extended this approach to the multidimensional case [11] and defined discrete multivariate rejection regions for the vector of marginal test statistics to test a global intersection null hypothesis. They derived rejection regions that optimally exhaust the nominal type I error rate and discuss also optimization of the power. However, the application of the approach in a closed testing procedure showed low power to reject elementary hypotheses. The authors attributed this to the lack of consonance of the procedure. However, it may rather be due to the fact that the used algorithm does not guarantee test decisions which are monotone in the marginal test statistics, such that a rejection with a certain observed effect does not imply rejection when a more extreme effect is observed. Because of the potential non-monotonicity of the resulting rejection regions, exhaustion of the local type I error rate may not translate into high power.
In this paper we extend the approach in [11] in several ways: (i) we introduce a monotonicity constraint in the optimization framework to guarantee that the rejection regions are monotone in the marginal test statistics; (ii) we show how the monotonicty constraint facilitates to efficiently solve the optimization problem numerically with a branch and bound algorithm; (iii) we consider optimization for additional objective functions, including the power and the size of the rejection regions; (iv) as generalization of the approaches based on Tarone's method, we construct optimally weighted Bonferroni tests; (v) we propose greedy algorithms as a computationally less demanding alternative to full optimization.
The paper is organized as follows. In Section 2 we introduce the optimization framework for general, discrete, multivariate test statistics. We distinguish the case of a known joint null distribution and the case where only the marginal distributions are known. We construct optimized tests for intersection hypotheses and derive multiple testing procedures for elementary hypotheses based on the closed testing principle [2]. In Section 3, we apply the optimization framework to Fisher's exact tests for the comparison of two parallel groups in multiple binary endpoints and in Section 4 we illustrate the procedure with a numeric example [13]. In Section 5, the unconditional power of the optimal exact tests for multiple binary endpoints is compared to alternative procedures in a range of scenarios. We close with a discussion on the proposed procedures for multiple binary endpoints and give examples for the application of the optimization framework to other multiple testing problems.

Optimal rejection regions for discrete tests
In this section we develop a general framework to determine optimal multivariate rejection regions for exact tests of an intersection null hypothesis based on a vector of k discrete marginal test statistics. In Section 2.1.1, we study the case of a known joint distribution of the k test statistics. We describe the construction of rejection regions for a global intersection null hypothesis that are optimal with respect to a given optimization criterion. We consider in particular three optimization criteria: (i) exhaustion of the nominal type I error rate, (ii) maximizing the number of elements in the multivariate rejection region and (iii) maximizing the power of the resulting test under a specific alternative. In Section 2.1.2 we study the optimal selection of critical thresholds in weighted Bonferroni tests, which only requires knowledge of the exact marginal distributions of the test statistics. In both cases, the optimal solutions are found through methods of numeric optimization. As an alternative, we propose in Section 2.1.3 greedy algorithms that provide approximately optimal solutions. In Section 2.2 we discuss the construction of multiple testing procedures based on the application of locally optimal intersection hypothesis tests in a closed testing scheme.

Optimal rejection regions for intersection hypotheses 2.1.1 Optimal general rejection regions based on the joint distribution of the test statistics
Consider hypothesis tests of k elementary null hypotheses H i , i = 1, . . . , k with discrete test statistics T = (T 1 , . . . , T k ) taking values in a finite set V = V 1 ×. . .×V k ⊆ N k such that larger values are in favor of the alternative. We assume that the marginal distribution under H i of each T i and the joint distribution of (T 1 , . . . , We aim to construct optimal rejection regions R ⊆ V for T to test the intersection hypothesis H 0 = ∩ k i=1 H i versus the alternative that at least one H i is false at a prespecified level α. Optimization is performed over all valid rejection regions, defined as subsets R ⊆ V that satisfy where P H 0 is the probability under the intersection null hypothesis H 0 . Condition (i) establishes type I error rate control and condition (ii) is a monotonicity condition that ensures that whenever the test rejects for test statistics taking the values t = (t 1 , . . . , t k ) it will also reject if one (or several) values of the elementary test statistics are increased.
Let f : {R : R ⊆ V } → R define an objective function that assigns a real number to every rejection region R ⊆ V . Let R denote the set of all rejection regions R ⊆ V that satisfy conditions (i) and (ii). Then the optimal rejection regions with respect to f are given by In the numerical examples we consider three objective functions: To obtain the test that best exhausts the nominal type I error rate we choose the objective function An alternative objective function is the number of elements in the rejection region where | · | denotes the cardinality of a set. Optimizing the exhaustion of the nominal type I error rate or the number of elements does not necessarily translate to an optimal power. To maximize the power under a specific alternative hypothesis we consider the objective function where P H A denotes the joint distribution of T under a specified alternative H A .
An algorithm to determine optimal rejection regions based on the joint distribution of test statistics The optimization problem (1) can be written as a binary integer program that can be solved with a branch and bound algorithm. The algorithm below determines an optimal solution if the objective function f satisfies f (R) ≤ f (R ) for all sets R, R such that R ⊆ R . This is, e.g., the case for the objective functions considered above. We index the (vector valued) elements of the set V such that V = {t i , i = 1, . . . , m}, where m denotes the cardinality of V . Then we can represent a rejection region R ⊆ V as a binary vector R and x i = 0 otherwise. Therefore, the objective function f can also be defined as function on {0, 1} m and we use both definitions interchangeably.
To solve (1) we apply a branch and bound algorithm [14]. For the algorithm, denote current partial solution vectors x, with x i ∈ {−1, 0, 1}, where x i = −1 indicates that the algorithm has not yet decided whether the corresponding point belongs to the optimal rejection region, x i = 0 denotes that the point does not belong to the optimal region, and x i = 1 that the point belongs to it. The nodes of the branch and bound algorithm are now given by a partial solution vector x, together with a lower and upper bound of the value of the objective function f . Furthermore, we denote the current best lower bound for the value of the objective function f by L. Then the optimization algorithm is given by: 1. Initialize the set S containing a single node with the solution x = (−1, −1, . . . , −1), lower bound f (∅) and upper bound f (V ) and set the current best lower bound to L = f (∅).
2. Of all nodes in S with not fully determined solution, let N denote the node with the largest lower bound and x its current partial solution. (If there are several such nodes, any can be chosen.) Furthermore, letî denote the index of the first entry of x equal to -1.
3. Remove N from S and add two modified copies of N to S: in the first, set xˆi = 0, in the second set xˆi = 1. For each of the two new nodes: (a) Check if, to satisfy condition (ii), the chosen value for xˆi determines the value of other entries of x that are currently equal to −1. If so, set the required values in x.
(b) Let x (resp. x ) denote copies of x where all components equal to −1 are set to 0 (resp. 1). Set the lower bound of the node to f (x ) and the upper bound to f (x ).
(c) If i:x i =1 P H 0 (t i ) > α remove the node from S.

Update
L to the maximum of the lower bounds of all nodes in S.

5.
Remove all nodes from S with an upper bound < L.
6. Repeat Steps 2 to 5 until S contains only nodes with fully determined solutions. These are optimal solutions.
P-values for tests with an optimized rejection region R can be defined as follows. Let r = |R| and as a starting point for the iteration let R r = R. We have to distinguish two cases: (a) If the observed test statistic t obs ∈ R, iterate R s−1 = R s \t s , where t s = argmax{P H 0 (t) : t ∈ R s , R s \t meets condition (ii)} (and \ denotes 'without'). Stop if t s = t obs and set the p-value to Stop if t s = t obs and set the p-value to P H 0 (R s+1 ).
Some comments on the optimization algorithm (a) Note that Step 3(a) is not part of the standard branch and bound algorithm and has a double impact. It ensures that the solutions satisfy the monotonicity condition (ii) and at the same time it simplifies the optimization problem by reducing the number of solutions that need to be considered. Computationally, this step can be implemented by computing an m × m look up matrix D = (d ij ), where d ij = 1 if t j ≥ t i (where t j ≥ t i iff t j,l ≥ t j,l , l = 1, . . . , m) and 0 otherwise. Then, in Step 3(a), if xˆi = 1, x j is set to 1 for all indices j = 1, . . . , m where dˆi j = 1. Similarly, if xˆi = 0, x j is set to 0 for all j = 1, . . . , m where d jî = 1. Note that the algorithm can be further improved by pre-processing steps in which the search space is reduced by excluding points from V according to simple necessary conditions following from conditions (i) and (ii) (see Appendix A). An R implementation of the branch and bound algorithm, the pre-processing and further functions to caclulate the optimal exact tests we describe, is provided in the online supplement.
(b) Note that the constraints (i) and (ii) are linear functions of x. If the objective function f can also be written as a weighted sum m i=1 w i x i with appropriate weights w i , the optimization problem can be formulated alternatively as a linear program, similar as in [12] and [11] (see Appendix B). Therefore, in principle standard LP solvers, as, for example, lpsolve [15], which can be accessed through R [16,17], can be used to solve the optimization problem. However, when using lpsolve on different numeric examples for optimizing rejection regions, we occasionally observed numeric issues resulting in nonoptimal solutions or extremely long run times. According to personal communication with the maintainers of lpsolve, these may result from the involved probability values ranging across several orders of magnitude, hence proper scaling in the underlying simplex algorithm may be difficult. For the numeric calculations presented in this paper, our own implementation of the branch and bound algorithm was used.
(c) In general, the optimization problem can have more than one solution. This can occur, for example, if the joint null distribution (and the distribution under the alternative) is symmetric in the endpoints. To reduce the set of solutions, optimization criteria can be combined and applied in a lexicographical order.
(d) If the search space V is large and many points in V have very small probability mass there are a large number of close to optimal solutions resulting in long computation times. In Appendix C we show how an approximately optimal solution can be found at substantially reduced computational cost if points with very small probability under the null hypothesis are handled separately in the algorithm.
(e) Due to step 3(a), the proposed branch and bound algorithm only searches across potential solutions satisfying condition (ii). If the search space is a k-dimensional hyper- such solutions. This number gives an upper bound for the number of points the branch and bound algorithm visits. For comparison, the standard branch and bound algorithm visits up to 2 d with d = k i=1 |V i | points.

Optimal rejection regions based on the marginal distribution of test statistics
The above optimization relies on the joint null distribution of the test statistics. If this distribution is unknown, Bonferroni-type optimal multiple tests for H i , i = 1, . . . , k based on the exact marginal distributions can be derived that control the FWER at level α in the strong sense. The Bonferroni test rejects denotes the probability that the test statistic for the i-th endpoint is equal or exceeds t under the marginal null hypothesis H i .
While for the unweighted Bonferroni test the c i are chosen such that S i (c i ) ≤ α/k, i = 1, . . . , k, for the weighted test, the level α can be allocated across the hypotheses more flexibly. The framework of weighted Bonferroni tests includes for example Tarone's test [4], in which for some hypotheses c i can be chosen such that S i (c i ) = 0 and not all hypotheses are tested. It also includes a method proposed by Westfall and Troendle in which a common value as small as possible for all c i is used, i.e. c i = min{c : k i=1 S i (c) ≤ α} [18].
In general, the critical boundaries c i can be chosen to meet some optimization criterion over the set of marginal rejection regions. Let V i ⊆ N denote the set of values T i can take. We assume |V i | < ∞ for all i = 1, . . . , k. Then the marginal rejection regions are given by R i = {t ∈ V i : t ≥ c i } and are defined by a vector of critical boundaries c ∈ V = V 1 ×. . .×V k . Let g i : V i → R denote marginal objective functions that depend on the marginal rejection regions only and define the overall objective function g = k i=1 g i . Then the critical values optimizing g are given by The size of the search space for this optimization problem is bounded by Thus, the search space is much smaller than for the optimization based on the joint distributions discussed above and the solution can be found either by an exhaustive search or by integer linear programming (see Appendix D).
Objective functions for the marginal tests corresponding to the criteria (2) and (4) or the expected number of rejections under marginal alternatives H These are upper bounds on the type I error rate and power to reject the intersection hypothesis H 0 .

Greedy optimization algorithms
The above algorithms find an optimal solution, but they can be computationally demanding. As an alternative, greedy algorithms can be used to obtain approximate solutions that satisfy (i) and (ii). To determine a rejection region based on the joint null distribution of T , define an objective function f as in Section 2.1.1. Start with the empty set R 0 . Choose an operator opt ∈ {argmax, argmin}. In an iterative manner, define If no such point can be found, stop and the rejection region R is given by the current R s . The choice of the operator opt as argmax or argmin depends on whether maximal or minimal increments of f (R) are aimed at. In the numeric examples the algorithm is applied with f (R) = P H 0 (T ∈ R) and the argmin operator, attempting to obtain a region with good exhaustion of the nominal level and a large number of elements. In contrast, f (R) = P H A (T ∈ R) and the argmax operator could be used when the objective is to maximize the power under H A . The greedy algorithm results in α-consistent tests by construction.
Similarly, a Bonferroni-type test as in Section 2.1.2 with close to optimal exhaustion of the nominal level can be found by a greedy algorithm. Here, in each iteration an element is added to the rejection region of that marginal test, for which the resulting increment in the objective function is smallest (largest) and the overall level α is still controlled.

Multiple testing procedures
In the small sample setting, the rejection of intersection hypotheses can be an important trial objective because the power to reject specific elementary hypotheses may be insufficient. However, in many applications rejection of elementary hypotheses will be of interest and we extend the optimal tests for intersection hypotheses to multiple testing procedures for the elementary hypotheses H i , i ∈ I = {1, . . . , k} that control the FWER in the strong sense.
To derive a multiple testing procedure we construct optimal local level α tests for all intersection null hypotheses H J = ∩ i∈J H i , J ⊆ I and then apply the closed testing principle [2] to test the elementary hypotheses. The closed test rejects an elementary null hypothesis H i if all intersection hypotheses H J with J ⊆ I, i ∈ J are rejected by the respective local level α test.
Multiplicity adjusted p-values for intersection or elementary hypotheses H J in a closed test are defined by p * (J) = max{p(J ) : J ⊆ J}, where J ⊆ I is an index set and p(J ) is the local p-value for the intersection hypothesis ∩ i∈J H i .
Even if each of the intersection hypothesis tests satisfies an optimality criterion, this does not imply that some optimality property holds for the overall closed testing procedure (see e.g. [19]). In particular, Gutman and Hochberg [11] noted that closed tests with locally optimal discrete rejection regions need not be consonant such that the rejection of an intersection hypothesis does not necessarily imply the rejection of at least one of the elementary hypotheses [20]. They attributed the observed low power to reject elementary hypotheses observed for their procedure to the lack of consoncance. However, as we do not observe a similar drop in power (see the numeric results below), we conjecture the low power might be due to the lack of a monotonicity constraint like condition (ii) in their procedure.
Still, for a non-consonant test the power to reject at least one elementary hypothesis is lower than the power to reject the global null hypothesis. The derivation of consonant optimized closed tests for general testing problems is complex due to the large number of intersection hypotheses that need to be considered. For the case of two hypotheses, though, a minor modification of the optimization algorithm to derive the optimal test for H 1 ∩ H 2 is sufficient to ensure consonance: Let R i , i = 1, 2 be the one-dimensional rejection region of the marginal test for H i . Then B = {V 1 \R 1 × V 2 \R 2 } ∩ V is the set of points in the multivariate search space V where no elementary hypothesis is rejected. For a consonant procedure, the search space for an optimal rejection region must therefore be restricted to V \B. Maximizing objective functions (2) or (4) over this restricted search space results in tests with maximal exhaustion of the FWER, or maximal power to reject at least one elementary hypothesis, respectively.
For the weighted Bonferroni tests, consonance is achieved for the general case of k hypotheses if, starting from the global intersection hypothesis test, the critical boundaries for each marginal test statistic are non-increasing [21]. Formally, for all J ⊂ J and i = 1, . . . , k, c is the critical boundary for the test statistic of the i-th endpoint in the test for H J . This additional constraint can be easily implemented when optimizing the critical boundaries for the marginal tests, simply by reducing the search space accordingly. It does not affect the power of the global test. As a consequence, the power to reject at least one elementary hypothesis is equal to the power to reject the global intersection hypothesis. When the critical boundaries for the local tests are found by the greedy algorithm for Bonferroni tests, the closed testing procedure is consonant by construction of the greedy algorithm.

Optimized Fisher's exact tests for multiple binary endpoints
We will now apply the algorithms of Section 2 to construct optimal testing procedures for multiple binary endpoints, making use of the permutation joint distribution of the vector of multiple Fisher's exact test statistics.
Consider a treatment (T rt) and control (Ctr) group with n g subjects in group g ∈ {T rt, Ctr}. The observations on the subjects are assumed to be independent within and between the groups. In the case of a comparison of these groups with respect to a single binary endpoint, the observations are independently Bernoulli distributed with success probability p T rt in the treatment group and p Ctr in the control group. The observed data can be aggregated in a 2×2 cross-table and Fisher's exact test provides condititional exact inference on the null hypothesis H : p T rt ≤ p Ctr [22,23]. One of the four entries in the 2 × 2 table, say the number of successes in the treatment group, is chosen as test statistic T . Conditional on the table margins, T has a hypergeometric null distribution and large values of T are in favor of the alternative p T rt > p Ctr . Making the inference conditional on the observed margins removes the influence of the unknown nuisance parameter (e.g. p Ctr , depending on the parametrization) and allows for an exact test for H. Under the point null hypothesis p T rt = p Ctr the hypergeometric distribution of T is equivalent to the permutation distribution of T that results from all permutations of the group labels.
For the case of k binary endpoints, consider the null hypotheses H i : p i,T tr ≤ p i,Ctr , i = 1, . . . , k, where p i,g is the marginal success rate in the i-th endpoint in group g. We are interested in one sided alternatives p i,T tr > p i,Ctr to establish a higher success rate of the new treatment compared to control.
Regarding the joint observations in one patient with respect to all k endpoints, there are d = 2 k possible outcome categories. We can formally define these categories by a set of index vectors S = {(s 1 , . . . , s k ) : s i ∈ {0, 1}, i = 1, . . . , k} such that s i = 1 if the particular patient had a success in endpoint i. E.g. with k = 2 endpoints, S = {(1, 1), (1, 0), (0, 1), (0, 0)}, indicating a success in both endpoints, endpoint 1 only, endpoint 2 only or neither endpoint. To simplify notation, these d different outcome categories will be indexed by s = 1, . . . , d in the following equations if not indicated otherwise.
The observations on the j-th patient, j = 1, . . . , n g , in group g ∈ {T rt, Ctr} can be written as vector g,s = 1 indicates that the patient is in the s-th outcome category. Let q g,s = P (Y (j) g,s = 1) be the probability for this event. The distribution of Y (j) g is characterized by the vector q g = (q g,1 , . . . , q g,d ) with 0 < q g,s < 1 and d s=1 q g,s = 1. The data resulting from this model can be aggregated without loss of information in a d × 2 contingency table with columns Ctr (see Table 1 below for an example), and row margins M = Y T rt + Y Ctr .
As for marginal Fisher's exact tests, define T i = (s 1 ,...,s k ):s i =1 Y T rt,s 1 ...s k as the number of subjects in the treatment group with a sucess in endpoint i, and let T = (T 1 , . . . , T k ). Thus, T is a linear function h of Y T rt . The distribution of T conditional on M will be used for exact inference about H 0 = ∩ k i=1 H i . This distribution is a function of the conditional distribution of Y T rt given M =m where W = {y ∈ N d : d s=1 y s = n T rt and ∃z ∈ N d : y + z =m} is the set values of y T rt that are possible given the table margins. The conditional distribution of Y T rt given M =m is a multivariate (non-central) hypergeometric distribution (see Appendix E) with normalizing constant N = y T rt ∈W d s=1 When q T rt,s = q Ctr,s for all s = 1, . . . , d, (9) is equivalent to the permutation distribution of Y T rt givenm that results from performing all possible permutations of the group labels and (8) is equivalent to the corresponding permutation distribution of T . Similar to the minP approach [8,9], we use this permutation distribution of T as null distribution under H 0 . Optimal multivariate rejection regions are then determined by applying the optimization procedures of Section 2 over the search space V = h(W ).
Analogously, for a test of H J = ∩ i∈J H i , J ⊂ {1, . . . , k} the joint permutation distribution of (T i : i ∈ J), J ⊂ {1, . . . , k} and the respective search space and optimal rejection regions are determined using only the data on endpoints i ∈ J.
When testing the intersection hypotheses H J by tests based on these permutation distributions, the closed testing procedure controls the FWER under the following additional exchangeability assumption, which is similar to the marginals-determine-the-joint condition given in [24].
(iii) Assumption. Let K ⊆ {1, . . . , k} be the index set of all true null hypotheses. The joint distribution of the observations on endpoints i ∈ K is assumed to be identical in both treatment groups.
FWER control follows, because under (iii) the conditional null distribution of (T i : i ∈ K) given the respective table margins for the data on the endpoints i ∈ K,m K , is its permutation distribution. So the type I error rate of the permutation test for H K is controlled conditional onm K , and since this holds for any realization ofm K it also holds unconditionally. Type I error rate control of the test for H K is sufficient for FWER control of the closed test. See [18,25,26] for further discussion on testing multiple hypotheses using permutation tests and involved assumptions and [27] for a general treatment of the permutation principle applied to contingency tables.
Calculating optimal Bonferroni-type tests, as described in Section 2.1.2, for multiple Fisher's exact tests is straight forward using the known marginal hypergeometric distribution of the individual test statistics T i , i = 1, . . . k. Assumption (iii) is not required for FWER control with the Bonferroni-type tests. Note that, for both procedures, only the observed table marginsm are required to compute the conditional null distribution.

A clinical trial example
For illustration, consider a trial to show superiority of ibuprofen compared to indomethacin in the treatment of patent ductus arteriosus in preterm infants, similar to the study described in [13]. The primary endpoint in this study was ductal closure and a major secondary endpoint was low urine output. For our example we will consider both binary endpoints as primary. As study outcome consider the observed frequencies given in Table  1, that for either treatment group entail the same observed marginal success rates as in the original study (no information on the joint distribution is reported in [13]).
Assume the aim of this study is to show superiority of the new treatment compared to control in at least one of the two endpoints, at a familywise significance level of 2.5%. The elementary null hypotheses are H urine : p urine,T rt = p urine,Ctr and H duct. : p duct.,T rt = p duct.,Ctr , and the global intersection null hypothesis is H 0 = H urine ∩ H duct. . We assume exchangeability under H 0 , according to (iii). To test the global intersecion hypothesis H 0 at level α = 0.025, we consider the test statistics from the marginal Fisher's exact tests for each endpoint. For this vector of test statistics, we calculate the permutation distribution and find rejection regions with optimal exhaustion of the nominal level, maximal number of elements and optimal power under an assumed alternative of independent endpoints with true success rates of p urine,T rt = p duct.,T rt = 0.9 and p urine,Ctr = p duct.,T rt = 0.75 (following the assumptions made for sample size planning in the original study). Analogous regions were caclulated under the additional constraint of providing a consonant closed test. In addition, the multivariate rejection region resulting from the greedy algorithm with the argmin operator was found. In the results Table 2 the respective tests are referred to as (consonant) optimal alpha, (consonant) optimal area, (consonant) optimal power and greedy algorithm.
Further, critical boundaries for optimal consonant Bonferroni-type tests with objective functions (6) (Bonferroni optimal alpha) and (7) (Bonferroni optimal power) were calculated, as well as boundaries resulting from the Bonferroni greedy algorithm. For comparison, the unweighted Bonferroni test, the Hommel-Krummenauer variant of Tarone's test (HKT) and the minP test were included. For the minP test, the minimum p-value across the marginal Fisher's exact tests was used as test statistic. The null distribution of this statistic was derived from the joint permutation distribution of the marginal test statistics, matching the permutation approach described in [9].
For the included tests and conditional on the row margins of Table 1, Table 2 shows the actual type I error rate under H 0 , the power under the assumed alternative and the number of elements the rejection regions contain. For the Bonferroni-type tests and for the minP test, the critical boundaries are included in the table. The rejection regions for the proposed optimal tests based on the joint distribution are visualized in Figures 1 and  2. Here, the tests that maximize alpha exhaustion or power are not consonant and Figures  1 and 2 include the respective rejection regions, when the additional constraint of consonance is imposed. In this example conservatism is greatly reduced and the conditional power increased when using optimal tests as compared to a basic Bonferroni test.
The marginal Fisher's exact tests reject at local level 2.5% if T urine ≥ 91 and T duct. ≥ 85. The observed values for the test statistics in the example are t urine = 93 and t duct. = 81. The point (93, 81) is contained in all considered rejection regions (see Figures 1 and 2), and so all examined tests reject H 0 . Following the application of the closed testing principle all tests also reject H urine , concluding that the proportion of patients with low urine output is lower under the new medication.
The p-value for H 0 , calculated according to the suggestion in Section 2.1.1 after rejection of H 0 at the level α = 0.025, is approximately 0.0002 for the greedy algorithm test and the tests optimizing alpha exhaustion and area of the rejection region, with and without the consonance constraint. The marginal one-sided p-values using Fisher's exact test are p urine = 0.0005 and p duct. = 0.3361, and as both are larger than 0.0002 the multiplicity adjusted p-values take the same values, respectively. For the test optimizing the power under the specified assumption on the alternative, the p-value for the global test is 0.0006 without the consonance constraint and 0.0017 with the consonance constraint. When using these approaches, the multiplicity adjusted p-value for low urine output is equivalent to the p-value for the global test, the adjusted p-value for ductal closure again is 0.3361.
The results on the example data serve as illustration and apply only conditional on the specific observed margins. A study of the unconditional properties of the proposed tests is given in the next section. | and |V (2) | are the number of elements in the original search space, initially and after the first and second pre-processing step, respectively (see Appendix A). The number of iterations that were required in the branch and bound algorithm to identify an optimal solution is given in the last column.

Unconditional power of the optimal procedures
The unconditional power of different closed testing procedures based on either of the proposed optimal intersection hypothesis tests is studied for the setting of k = 2 and k = 3 binary endpoints. These intersection hypothesis tests are the optimally weighted consonant Bonferroni tests with objective function (6) (labelled as Bonferroni optimal alpha) or objective function (7) (Bonferroni optimal power), the Bonferroni greedy algorithm test with the argmin operator (Bonferroni greedy algorithm), optimal tests using the joint permutation distribution with the objective functions (2) (Optimal alpha), (3) (Optimal area) and (4) (Optimal power) and the greedy algorithm test for joint distributions, using the argmin operator (Greedy algorithm). For the case of two endpoints, the corresponding optimal joint distribution-based tests with the additional constraint of consonance (Cons. opt. alpha, Cons. opt. area., Cons. opt power) are also studied. These procedures are compared to closed testing procedures based on testing the local intersection hypotheses via the Bonferroni test, the Hommel and Krummenauer improvement of Tarone's test (HKT) or the minP test as described in Section 4. In all cases, one-sided Fisher's exact tests are applied for the local elementary hypothesis tests and the vector of the elementary test statistics was used as multivariate test statistic in the intersection hypothesis tests. The between-groups differences in these scenarios are parametrized by the marginal success rates p i,T rt , i = 1, . . . , k in the treatment group and p i,Ctr , i = 1, . . . , k in the control group. Further, a common product-moment correlation ρ between the binary observations within each subject is assumed. Per-group sample sizes are n T rt = n Ctr = n ∈ {5, 10, 15, 20} for two endpoints and n = 10 for three endpoints. Table 3 shows the settings for all considered scenarios. p i,T rt and p i,Ctr were chosen such that for endpoints where the alternative holds (p i,T rt + p i,Ctr )/2 = 0.5 and such that the power to reject the hypothesis at an unadjusted level of 2.5% using a single Fisher's exact test is approximately 0.6 (two endpoints) or 0.41 (three endpoints). Furthermore, one scenario with three endpoints with p 1,T rt = 0.8, p 2,T rt = 0.7, p 3,T rt = 0.6 and p i,Ctr = 0.2 for all i = 1, 2, 3, was considered. There, the corresponding local power values of Fisher's exact tests are 0.64, 0.43 and 0.25.
For the tests that directly aim to maximize the power, assumptions on the alternative need to be specified. For each scenario with two endpoints the optimal power tests are calculated under three different assumptions matching the three overall scenarios of an effect in one endpoint, an effect in both endpoints with ρ = 0 and an effect in both endpoints with ρ = 0.5. In this way, the true alternative is always included, and in addition the characteristics of the tests under assumptions that deviate from the truth can be assessed. For three endpoints the power was optimized under the true alternative.
The unconditional power was calculated numerically for the scenarios with two end-points and by simulation for the scenarios with three endpoints. See the supplemental material for technical details. Table 3: Simulation scenarios for two and three binary endpoints. #endpoints n p 1,T rt p 2,T rt p 3,T rt p 1,Ctr p 2,Ctr p 3,Ctr ρ α

Numerical results
The results for a selected scenario with a treatment effect in two uncorrelated endpoints and a sample size of n = 15 per group are shown in Table 4. All results on scenarios with two endpoints are tabulated in the supplemental tables S1 to S19. The results for the scenario with three correlated endpoints with unequal effect sizes are shown in Table 5. The results for the remaining scenarios with three endpoints are covered in the supplemental tables S20 to S25. Overall, the unweighted Bonferroni procedure had the lowest power in the considered scenarios. The Hommel and Krummenauer improvement of Tarone's test (HKT) differs from the Bonferroni test only for constellations where some tests cannot become significant at certain Bonferroni-adjusted levels ≤ α. This happens frequently only for very small sample sizes, therefore the HKT test was substantially more powerful than the Bonferroni test only in scenarios with the very small sample size of n = 5. For larger sample sizes the differences were small.
Conservatism was reduced and the power was notably increased by an order of 10 percentage points when the local boundaries in weighted Bonferroni tests were chosen according to one of the proposed optimization criteria or using the greedy algorithm. When the treatment effects in both endpoints were equal, there was almost no difference in the performance of these optimization approaches. In the case with an effect in one endpoint only, however, optimizing power according to the objective function (7) under the assumption of the true effects provided some additional advantage.
The minP test had very similar power values as the weighted Bonferroni tests. This is of particular interest, as the Bonferroni tests do not require Assumption (iii) while the minP test does require the assumption to allow for unambiguous interpretation of tests on marginal effects.
The optimal tests based on the joint distribution provided a substantial improvement over HKT, the Bonferroni-type tests and over the minP test in the order of another 10 percentage points. By definition, optimizing the power under the true alternative results in the largest power, which can serve as a benchmark for the other tests. When there was an effect in only one endpoint, optimizing exhaustion of the nominal level or optimizing power under different alternative hypotheses in some scenarios resulted in power similar to that of the Bonferroni test, though. In contrast, maximizing the number of points in the rejection region gave more robust results, with power values above those of the weighted Bonferroni tests.
For the optimal tests using the joint distribution, enforcing consonance in the scenarios with two endpoints did not lead to a notable improvement of the power to reject at least one elementary hypothesis, but at the same time decreased the power to reject the global intersection null hypothesis.
The tests obtained through the greedy algorithms, both for the Bonferroni approach and the joint distribution rejection region, performed surprisingly well. In most scenarios these tests had power similar to or above that of the respective other tests based on marginal or joint distributions.
Differences in the power characteristics between the testing procedures were mostly observed for the test of the global intersection hypothesis. For the elementary hypotheses, all tests other than Bonferroni and HKT showed similar power values, with a few exceptions in scenarios assuming an effect in just one endpoint. The probability to reject all elementary hypotheses simultaneously was almost identical for most of the studied tests. This is another consequence of the discreteness of the elementary tests, by which the set of values for the multivariate test statistic, that lead to the local rejection of all two or three elementary null hypotheses simultaneously, is often entirely contained in the optimal rejection regions. For an illustration see the numeric example of Section 4. There, the set {(t urine , t duct. ) : t urine ≥ 91, t duct. ≥ 85} is contained in all optimal rejection regions. Even the simple Bonferroni rejection region almost completely contained this set, missing only the single point (T urine = 91, T duct. = 85).
For all tests other than unweighted Bonferroni and HKT, the power to reject a specific elementary hypothesis was typically very close to the local power that a single Fisher's exact test would have under the chosen marginal success rates. The power to reject at least one elementary hypothesis was even larger, with the exception of scenarios with an effect in only one endpoint. This observation implies that carefully accounting for the discreteness of the tests allows one to greatly reduce the cost of multiple testing. Thus, in the studied scenarios, the mulitplicity adjustment when testing two or three hypotheses does not reduce the power compared to the test of a single null hypothesis. To quantify the number of required iterations, the median (q50), the 90 % quantile (q90) and the maximum of the number of iterations is shown in the last three columns. A value of 0 means that the optimization was finished by the pre-processing.
For tests optimizing the power, the assumed alternative is indicated in brackets next to the test label. There, "all" and "EP 1" refer to an assumed effect identical to the true effect in a scenario with effect in both endpoints or one endpoint, respectively. ρ there indicates the assumed correlation between the endpoints. solution was found by the branch and bound algorithm for all four intersection hypothesis tests in the closed testing procedure is given in the column labelled C(%). To further quantify the number of required iterations, the median (q50), the 90 % quantile (q90) and the maximum of the number of iterations is shown in the last three columns. A value of 5 · 10 5 indicates that the respective quantity to achieve an optimal solution would be above 5 · 10 5 , but is here bounded by the maximal number of iterations. Test

Discussion
The analysis of small clincial trials is often challenging [28]. Asymptotic methods may lack type I error rate control when the sample size is small, or when there are few events in the case of binary endpoints. Exact tests guarantee type I error rate control, but they are typically overly conservative due to discreteness. At the same time it is required to make best use of the available information, if the overall number of observations is limited, which favors the analysis of multiple endpoints. This was a motivation for the investigation of optimal rejection regions for multivariate exact tests described in this work.
Most multiple testing adjustments lead to multivariate rejection regions of a certain restricted shape, e.g. the complement of a hypercube in the case of Bonferroni tests or the minP test. Within a class of shapes, optimization can be performed, however any such shape-restriction leads to increased conservatism when applied to discrete distributions, and as a consequence it has the potential to reduce the power of the test. This limitation can be avoided by allowing for arbitrarily shaped rejection regions. Still, some constraints are required to allow for unambiguous interpretation of the results. In contrast to earlier suggestions for optimal tests with discrete statistics [11,12], we require that the test decision is monotonic in the value of the test statistic, leading to an additional constraint in the optimization framework. This condition rules out testing procedures where a (relatively) small observed effect results in rejecting a null hypothesis while a larger effect does not. Further, the simulation results of this work compared to those in [11] suggest that the monotonicity constraint (or some similar constraint) is required to obtain a powerful closed testing procedure based on optimal tests for intersection hypotheses.
To control the FWER, it is important to pre-specify the rejection region. Especially, the rejection region must be defined before information on the treatment effect estimates is revealed. In a blinded experiment this means to define the rejection region before unblinding the treatment allocation. In addition, all steps of the procedure that define the optimized rejection region should be specified in the study protocol.
The choice of the optimization objective function may be based on assumptions about the effect sizes under the alternative. Rejection regions optimized for the true alternative can result in a far more powerful test than those resulting from other optimization criteria. However, as the true alternative is unknown, optimizing power under some assumed alternative is sensitive to having guessed wrongly. Still, the optimal power test can serve as a useful benchmark to judge the performance of other optimal tests. Furthermore, a prior distribution on the effect sizes can be specified and the power averaged over this prior distribution can be optimized.
The reduced power of discrete tests is often attributed to the conservativeness of the test decision under the null hypothesis. Therefore, an obvious choice would be to maximize exhaustion of the type I error rate. This approach indeed results in type I error rates close to the nominal one, but it does not necessarily result in high power. This may be due to the fact that optimizing type I error rate exhaustion does not necessarily lead to rejection regions with a high probability under the alternative. A useful alternative to the tests derived with the branch and bound algorithm are tests based on the greedy algorithm for multivariate rejection regions with the argmin operator proposed in Section 2.1.3. In the scenarios included in the numeric power calculations, this test is often close to optimal.
Of note, the assumption of exchangeability under the null hypothesis, which was discussed in Section 3, is required to guarantee FWER control for the procedures relying on the joint conditional distribution of the test statistics. If this assumption is not satisfied, the optimally weighted Bonferroni tests should be preferred over tests using the joint permutation distribution. Among these Bonferroni-type tests, the test based on the greedy algorithm showed robust performance, good power, is α-consistent and provides consonant procedures by construction and can therefore be recommended as a good general choice.
While the focus of this paper is on multiple binary endpoints, the proposed theoretical framework for optimal exact tests is more general. It can be applied to multiple hypothesis tests whenever the exact joint distribution of the involved test statistics is known, or, in case of the weighted Bonferroni tests, when the marginal distributions are known. Consider, e.g., the comparison of k treatment groups to a common control with respect to a continuous endpoint, with the aim of showing superiority for at least one treatment versus control. Rank-sum tests may be used as exact marginal tests and optimal rejection regions may be defined for the joint permutation distribution of the k rank-sum statistics. As further example consider testing a treatment effect in k disjoint populations, using k exact tests. The joint distribution of the test statistics then follows from the known marginal distribution and from independence between observations from different populations, and a closed test with optimal local rejection regions can be derived. Similarly, in an analysis involving a full population and a sub-population, the distribution of exact test statistics is given by the known marginal distributions and the correlation structure determined by the proportion of subjects belonging to the sub-population.
In summary, optimizing the rejection region for multivariate exact tests offers a notable advantage over simpler methods in terms of the power to reject a global intersection null hypothesis and, to a lesser extent, the power to reject some elementary null hypothesis. In the small sample setting, where this approach can have the greatest impact, numeric solutions of the discrete optimization problems are found within short computation times. Application of the optimal exact tests may require an additional effort at the planning stage, which is worthwhile if the aim is to make best use of multiple exact hypotheses tests from a small data sample.

Appendix A -Pre-processing
In the first pre-processing step, condition (i) is used to remove all points from the search space V that would inevitably lead to a rejection region with a level greater than α. If a point t = (t 1 , . . . , t k ) ∈ V was selected to be part of the rejection region, condition (ii) implies that the set {(s 1 , . . . , s k ) ∈ V : s 1 ≥ t 1 , . . . , s k ≥ t k } is part of the rejection region. Thus the type I error rate of a rejection region containing t is at least P ({(s 1 , . . . , s k ) ∈ V : s 1 ≥ t 1 , . . . , s k ≥ t k }). Therefore all points t ∈ V for which P ({(s 1 , . . . , s k ) ∈ V : s 1 ≥ t 1 , . . . , s k ≥ t k }) > α are removed from the search space. Denote the set of the remaining points by V (1) .
In the second pre-processing step we identify points that are definitely contained in an optimal level α rejection region, regardless of the optimality criterion. For each point t ∈ V (1) , we calculate an upper bound for the probability mass under H 0 of all possible rejection regions that do not contain t. If t / ∈ R, the set A(t) = {(s 1 , . . . , s k ) ∈ V (1) : s 1 ≤ t 1 , . . . , s k ≤ t k } / ∈ R, because otherwise condition (ii) would be violated. The upper bound for the level when point t is not included in the rejection region is α max,−t = P H 0 (V \A(t)). If α max,−t < α, even the largest rejection region not containing t could possibly be made larger, and the only way to do so is adding t, because of condition (ii). So if α max,−t + P H 0 (t) ≤ α, t must be included in each optimal rejection region. Denote the set of the points still remaining after this step by V (2) . It is then sufficient to perform the optimization on the remaining search space V (2) for a significance level of α − P H 0 (V (1) \V (2) ).
The two pre-processing steps are illustrated in Figure 3. . . , s k ) ∈ V : s 1 ≥ t 1 , . . . , s k ≥ t k }) > α from the search space as including these points will result in a level > α. Right panel: For all points t ∈ V get the largest region A(t) which satisfies condition (ii) and does not include t. If P H 0 (A(t)) + P H 0 (t) ≤ α, t must be contained in any optimal rejection region.
ensures that x contains exactly one entry equal to 1 for each search vector v i , indicating the chosen c i . Further, let a i be the vector of contributions to the type I error rate for the i-th endpoint with elements a i,j = S i (v i,j ). Let a = (a T 1 , . . . , a T k ) T . Then the constraint guarantees type I error control by the Bonferroni inequality. The contributions of possible choices for c i , i = 1, . . . , k to the objective function are formalized similarly in terms of a stacked vector w = (w T 1 , . . . , w T ) T . Here w i,j is the contribution of the i-th test to the objective function if c i = v (i) j is selected. Then the objective function is of the type g = w T x, with w = a for objective function (6) and w i,j = P H (1) i (T i ≥ v i,j ) for objective function (7). Thus, the linear integer program is constituted by g = w T x, the constraints (10) and (11) and the further constraint that the elements of x are in {0, 1}.
Appendix E -The conditional distribution of Y T rt P (Y T rt = y T rt |Y T rt + Y Ctr =m) = P (Y T rt = y T rt , Y T rt + Y Ctr =m) P (Y T rt + Y Ctr =m) = = P (Y T rt = y T rt , Y Ctr =m − y T rt ) P (Y T rt + Y Ctr =m) = P (Y T rt = y T rt )P (Y Ctr =m − y T rt ) P (Y T rt + Y Ctr =m) = (q T rt,s ) y T rt,s y T rt,s ! n Ctr !