Closed testing using surrogate hypotheses with restricted alternatives

Introduction The closed testing principle provides strong control of the type I error probabilities of tests of a set of hypotheses that are closed under intersection such that a given hypothesis H can only be tested and rejected at level α if all intersection hypotheses containing that hypothesis are also tested and rejected at level α. For the higher order hypotheses, multivariate tests (> 1df) are generally employed. However, such tests are directed to an omnibus alternative hypothesis of a difference in any direction for any component that may be less meaningful than a test directed against a restricted alternative hypothesis of interest. Methods Herein we describe applications of this principle using an α-level test of a surrogate hypothesis H˜ such that the type I error probability is preserved if H⇒H˜ such that rejection of H˜ implies rejection of H. Applications include the analysis of multiple event times in a Wei-Lachin test against a one-directional alternative, a test of the treatment group difference in the means of K repeated measures using a 1 df test of the difference in the longitudinal LSMEANS, and analyses within subgroups when a test of treatment by subgroup interaction is significant. In such cases the successive higher order surrogate tests can be aimed at detecting parameter values that fall within a more desirable restricted subspace of the global alternative hypothesis parameter space. Conclusion Closed testing using α-level tests of surrogate hypotheses will protect the type I error probability and detect specific alternatives of interest, as opposed to the global alternative hypothesis of any difference in any direction.


Introduction
The closed testing principle provides strong control of the type I error probabilities of tests of a set of hypotheses that are closed under intersection such that a given hypothesis H can only be tested and rejected at level α if all intersection hypotheses containing that hypothesis are also tested and rejected at level α. For the higher order hypotheses, multivariate tests (> 1df) are generally employed. However, such tests are directed to an omnibus alternative hypothesis of a difference in any direction for any component that may be less meaningful than a test directed against a restricted alternative hypothesis of interest.

Methods
Herein we describe applications of this principle using an α-level test of a surrogate hypothe-sisH such that the type I error probability is preserved if H )H such that rejection ofH implies rejection of H. Applications include the analysis of multiple event times in a Wei-Lachin test against a one-directional alternative, a test of the treatment group difference in the means of K repeated measures using a 1 df test of the difference in the longitudinal LSMEANS, and analyses within subgroups when a test of treatment by subgroup interaction is significant. In such cases the successive higher order surrogate tests can be aimed at detecting parameter values that fall within a more desirable restricted subspace of the global alternative hypothesis parameter space.

Conclusion
Closed testing using α-level tests of surrogate hypotheses will protect the type I error probability and detect specific alternatives of interest, as opposed to the global alternative hypothesis of any difference in any direction. PLOS

Introduction
The closed testing principle of Marcus, Peritz and Gabriel [1] provides strong control of the type I error probability, the so-called family-wise error rate (FWER), over a set of tests of multiple hypotheses. The basic principle is that a given elemental null hypothesis can be tested and rejected at level α if all higher order intersection hypotheses containing it have also been tested and rejected at level α. In this case the type 1 error probability for the set of hypotheses, both elemental (i.e. simple) and joint (i.e. intersections), will be protected at level α provided that each hypothesis is tested using an α-level test, meaning that the type 1 error probability associated with a given test of a given hypothesis is no greater than α, multiple testing aside. Hsu [2] describes various applications. Henning and Westfall [3] provide a review of historical and recent developments. The most common application of closed testing is pairwise tests of group differences in a multiple K > 2 group trial in which we wish to test the equality of the K groups by conducting K(K − 1)/2 pairwise comparisons with strong control of the type I error probability for the set of tests. Let μ j denote the expected value of the outcome (mean, proportion, etc.) for the jth group 1 � j � K. Consider the case of K = 4 groups with 6 pairwise tests. In this case we start with a test of the joint null hypothesis H 0,1234 : μ 1 = μ 2 = μ 3 = μ 4 (the highest order interaction hypothesis) against the alternative H 1,1234 : μ j 6 ¼ μ k for at least one pair of groups among 1 � j < k � K = 4.
Closed testing can also be applied to tests of the difference between two groups for multiple outcomes. Let θ j refer to the difference between the two groups for the jth outcome and assume that we wished to test the individual hypotheses H 0,j : θ j = 0, j = 1, . . ., K, with control of the type I error probability for the set of K tests. Consider a test of the hypothesis H 0,1 : θ 1 = 0. This hypothesis can be rejected at level α if it and all intersection hypotheses containing it are also rejected at level α. This entails testing the set of hypotheses presented in Table 1 starting with the K-level intersection hypothesis. This is a simple testing tree.
For K = 4 outcomes, the parameter estimatesθ ¼ ½ŷ 1ŷ2ŷ3ŷ4 � T are jointly asymptotically normally distributed with expectation θ and a consistently estimable covariance matrix S. Then the order 4 hypothesis H 0,1234 : θ = 0 could be tested using a T 2 -like test of the form that is asymptotically distributed under H 0,1234 as chi-square on 4 df. Then an order 3 joint  [4] also describe application to other tests of the differences between means for multiple quantitative outcomes, such as the O'Brien [5] Ordinary Least Squares (OLS)-based test based on the sum of the mean differences over the set of K measures. These and other α-level tests are also shown to provide strong control of the type I error probability. Wassmer et al. [6] also provide an overview of procedures for analysis of multiple, principally quantitative, outcomes that contrasts omnibus versus directional alternatives. More Since H ? T is always the first true null to be tested, and since Pr ðreject H ? T Þ � a, the cumulative probability of all further type I errors cannot exceed α.
Closed testing typically employs an efficient (e.g. UMP) test of each null hypothesis against a global alternative hypothesis such as the T 2 -like test H 0,1234 : θ = 0 of joint equality against the alternative H 1,1234 : θ 6 ¼ 0 that the group difference for at least one of the outcomes is unequal to zero. However, from (4), the only requirement for closed testing to control the family-wise error rate at the desired level α is that each test employed be an α-level test [3], meaning that the type I error probability of a test does not exceed the desired level α under that null hypothesis. Thus, closed testing can also be applied using a test directed towards a restricted alternative hypothesis, such as the one-directional or one-sided alternative hypothesis H 1,1234 : θ > 0 where positive values of θ are considered beneficial. In this case the test is directed to a restricted alternative hypothesis that represents a region of the parameter space of greater interest than would be provided by the usual multiple df omnibus test of H 0 .
More generally, closed testing can also be employed using a surrogate test of a surrogate hypothesis. Let H be a null hypothesis of interest. We will say that a hypothesisH is a surrogate where rejection ofH implies rejection of H. For example, consider a test of H 0,12 : θ 1 = θ 2 = 0 in Table 1 against the alternative H 1,12 : θ 1 6 ¼ 0 and/or θ 2 6 ¼ 0. A surrogate test could be conducted usingH 0;12 : y 1 ¼ y 2 against the alternativeH 1;12 : y 1 6 ¼ y 2 . Clearly H 0;12 )H 0;12 and rejection ofH 0;12 implies rejection of H 0,12 . Even though the efficiency of the test ofH may differ from that of the usual test of H,H is still is an α-level test and this testing strategy preserves the type I error probability at � α for the set of tests closed under intersection. We now present specific applications, starting with the analysis of multiple event-time outcomes (e.g. MACE in a cardiovascular trial) following a one-directional Wei-Lachin multivariate test of a combination of outcomes, with a computational example. This is followed by a description of tests of treatment group differences in means of K repeated measures over time where the tests of intersection hypotheses are conducted using tests of the longitudinal LSMEANS rather than T 2 -like MANOVA omnibus tests. We then describe testing the treatment difference between two groups within multiple subgroups following a test of treatment by subgroup interaction (i.e. homogeneity). This is accompanied by the computation of the operating characteristics of the traditional closed testing and the surrogate closed testing for this application.

Components of the MACE composite outcome
We first apply closed testing using surrogate hypotheses to the assessment of the significance of treatment group differences for elements of a composite time-to-event outcome such as a Major Adverse Cardiovascular Event (MACE) using the times to one or more of a set of possible component events such as cardiovascular (CV) death, non-fatal myocardial infarction (MI), non-fatal stroke or non-fatal congestive heart failure, so called 4-point MACE. Herein we compare traditional closed testing using T 2 -like "MANOVA" omnibus tests on multiple df to surrogate closed testing using Wei-Lachin [7] 1 df tests against one-directional restricted alternatives, and also to the commonly used time-to-first-event analysis.
Let β j denote the log hazard ratio for treatment versus control for a Cox PH model analysis of the time to the jth of K different types of events including multiple types for a given patient, e.g. time to the first non-fatal MI and time to CV death for a patient who experiences both types of event. The K separate models generate a vector of coefficient T that is asymptotically normally distributed with expectation β = (β 1 . . . β K ) T and with a covariance matrix S with elements Estimates of the covariances fŝ jk g can be provided by partitioning the model-based information sandwich as described in Lachin and Bebu [7], or using the method of Wei, Lin and Weissfeld [8] that employs the Lin and Wei [9] estimate of the observed information that is robust to departures from the proportional hazards assumption. Both approaches may also be adjusted for other covariates, and provide the estimate of the joint covariance matrixΣ of the treatment group coefficients.
Typically, traditional closed testing of the group differences for the K outcomes would start with a test of the global K-order null hypothesis versus the global or omnibus alternative hypotheses: that tests for any difference or combination of differences between groups in any direction, such as where the treatment is beneficial for some outcomes but harmful for others. Using a consistent estimateΣ, the T 2 -like Wald test of H 0 versus the global alternative H 1O is provided by that is asymptotically distributed as chi-square on K df. If this K-order test is significant at level α, then one can continue to conduct the K − 1 order tests, etc. The traditional closed testing structure would entail tests of the set of hypotheses presented in Table 1.
Alternately, surrogate closed testing of such a multivariate or composite outcome could be conducted using a test that is directed to a one-directional alternative hypothesis. Assume that β j < 0 represents a beneficial effect of treatment for the jth outcome. For the K-order test the one-directional alternative hypothesis specifies that This surrogate hypothesis specifies that the experimental treatment has a beneficial or neutral effect on each component event (β j � 0) and is superior for one or more outcomes ð P K j¼1 b j < 0Þ. Thus, this restricted alternative hypothesis is directed to regions in the Kdimensional parameter space where there is a preponderance of benefit for the set of K outcomes, though not necessarily to the same degree, with no overt harm for any outcome.
Recently, Lachin and Bebu [7] described the application of the 1 df Wei-Lachin robust onedirectional test to such data. The test is based on the simple sum, or equivalently the unweighted mean, of the Cox PH model coefficients, or log hazard ratios representing the treatment group difference for each component event, where different types of events in the same subject are included in the analysis of the different outcomes.
The K-order Wei-Lachin test is provided by Frick [10,11] showed that this test is maximin efficient provided that J 0Σ > 0 which will almost always apply. Then, the joint null hypothesis in (7) can be replaced by the surrogate hypothesisH 0 : � b ¼ 0, thus satisfying the conditions in (5). For an intermediate order test the unit vector J is modified to only include a 1 for those components tested, 0 otherwise. For example, if K = 4 and we wish to test the 2-order hypothesis H 0,24 , the test would employ the corresponding vector J 24 = (0 1 0 1) T in the like expressions ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi J 0 24Σ J 24 q ¼� b 24 ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where� b 24 is the mean of the coefficients tested. Then let D 24 = diag (J 24 ). The corresponding maximin condition is J 0 24 ðD 0 24Σ D 24 Þ > 0 for those elements with a corresponding value 1 in J 24 .
Then the elemental hypothesis for the first component H 0,1 : β 1 = 0 would be rejected if the tests ofH 0;1234 : and H 0,1 were all nominally significant at level α. A similar testing tree would apply to the other elemental hypotheses.
For illustration we use data from the Prevention of Events with Angiotensin Converting Enzyme Inhibition (PEACE) study [12] that assessed whether treatment with ACE inhibition with trandolapril (ACEi, n = 4158) versus placebo (n = 4132), when added to standard therapy, would reduce the risk of cardiovascular outcomes. Table 2 presents the numbers of subjects (cases) with each type of event, the hazard ratio, the two-sided confidence limits and p-value, nominally, with no adjustment for multiple tests. There is a slight benefit with ACEi versus placebo for CV death, but none for non-fatal MI. However, there is a barely non-significant (two-sided) benefit with ACEi for non-fatal stroke, and a barely significant benefit for congestive heart failure. This pattern of differences between groups represents the type of results that would fall under the one-directional alternative hypothesis (9).
The traditional closed testing procedure would start with a T 2 -like omnibus K-order test as in (8). For the set of 4 PEACE study outcomes, this yields X 2 O ¼ 7:39 on 4 df with p = 0.117 and no difference between groups can be declared to reach significance. Table 3 then presents the surrogate closed testing (two-sided) using the Wei-Lachin test for orders 2 through 4. Test results that do not reach significance at the 0.05 level, or are included tively. These are the two order-3 hypotheses that include intersections withH 0;34 . This hypothesis can then be tested and indeed is significant at p = 0.011, indicating a treatment group difference in the joint (bivariate) event-time distributions of non-fatal stroke and CHF. Thus, by surrogate closed testing we can conclude that ACEi significantly reduced the risk of nonfatal stroke and CHF jointly, but are not able to demonstrate a beneficial effect on either outcome separately. In addition, neither would be significant had the Holm or Hochberg procedure been applied to the set of 4 component tests.
The most common method of analysis of such a composite outcome is a simple 1 df test of the difference between the treatment versus control groups using a logrank or Cox PH model test of the time to the first event (TTFE). This could also be viewed as providing a test of a different surrogate hypothesis that the distribution of the minimum event time does not differ between groups. This approach, however, does not include other events following the initial event, such as a CV death that occurs after an initial non-fatal MI. Lachin and Bebu [7] also show that the Wei-Lachin test can be more powerful than the TTFE analysis.
For the PEACE study, the analysis of the MACE + CHF composite outcome using the TTFE yields an estimated hazard ratio of 0.90 with a 95% confidence interval of (0.79, 1.02) with p = 0.12 two-sided. Thus, closed testing of the PEACE outcomes using either the omnibus or the TTFE test fails to declare any significant difference between groups.
Further, a note of caution. Bebu and Lachin [13] also show that the TTFE may not provide an unbiased α-level test of the joint null hypothesis that the hazard or survival functions do not differ between groups, i.e. of H 0 : β = 0. Letb denote the log (HR) for the time-to-first event.
They show that the distribution of the estimateb can differ substantially among groups even when H 0 in (7) is true, and conversely that there may be no difference between groups in the distribution ofb even though H 0 is false. These discrepancies occur when there is a difference between groups in the correlation structure of the component event times. Unfortunately, there is no general method to assess this difference in correlations; however, Bebu and Lachin [13] describe an estimate of the correlation of event times under a bivariate exponential distribution.

Longitudinal repeated measures
Consider the case of K repeated measures over time where it is desired to conduct a test of the difference between the group means at each of the K points in time, post-randomization. Let μ ij denote the mean of the observations in the ith group at the jth time, and θ j = μ 1j − μ 2j denote the mean difference at the jth time. The K differences could be tested using a Bonferroni-type procedure, such as that of Holm. Alternately, a traditional closed testing procedure could be conducted starting with an overall omnibus K df "MANOVA" test using a T 2 -test, with successive sub-order T 2 tests.
However, another possible order-K test is the overall group effect on 1 df in a longitudinal model that compares the "LSMEANS" of the two groups, these being the model-estimated average of the means over time in the two groups. Again, consider the case of K = 4 wherê y J ¼m 1J Àm 2J and them iJ in the ith group at the jth time are obtained from a repeated measures longitudinal model. Then the estimated LSMEAN of the 4 repeated measures combined in the ith group is the unweighted mean� m i;1234 and the estimated LSMEAN difference is � y 1234 ¼� m 1;1234 À� m 2;1234 . Thus, at order K, the 1 df test of the difference in the LSMEANS of the K repeated measures is employed that provides a test of the surrogate hypothesis H 0;1234 : � y 1234 ¼ 0. At order K − 1, the LSMEANS of a given set of K − 1 means is employed, such as a test ofH 0;123 : � y 123 ¼ 0; and so on. Then at order 1 the difference between groups in the means at the jth time could be tested using a simple t-test provided that all of the intersection hypotheses of LSMEANS containing the jth mean difference are significant at level α. This approach would be directed to alternative hypotheses where the mean differences over time were all in the same direction, i.e. the mean profiles did not cross, analogous to the alternative hypothesis in (9).
For example, an analysis of the group differences in K = 4 repeated measures can be con- Also note that since the test of the LSMEANS is a test of the unweighted average of the time-specific means, then this is the same as a Wei-Lachin one-directional test. Lachin [14] also describes the details of the application of the Wei-Lachin test to multiple mean differences. This test is efficient when the groups tend to differ in the same direction, but not necessarily of the same magnitude, over time.
To illustrate, consider an analysis of the systolic blood pressure values recorded every 6 months over the first 2 years of follow-up in the subset of 1371 subjects with diabetes in the PEACE study. Had the full cohort of 8290 subjects been employed, virtually every method of analysis would produce extremely significant differences. The following are the treatment group within time LSMEANS and the LSMEAN differences (placebo-ACEi):  Table 4 then shows that all tests of the higher order intersection hypotheses are significant at the 0.05 level so that the elementary hypotheses can also be tested at the 0.05 level and all are significant.

LSMEAN
In comparison, had the 4 elementary hypotheses been tested using the Holm procedure, all would also have been significant at the 0.05 level, the adjusted p-values for months 6, 12, 18 and 24 (ranked in that order) are <0.0004, 0.0006, 0.0268 and 0.0383.

Closed testing of group differences within subgroups
Consider the case where pre-specified analyses of the differences between groups are conducted within K = 2 subgroups of the study population defined by a subgroup factor, such as the comparison of treatment group differences separately among men and among women (later generalized to K � 2 subgroups). It is generally recommended that analyses within subgroups only be conducted when a test for a group by subgroup factor interaction, or a test for homogeneity of effects among subgroups, is significant [15], such as a test that the treatment group difference among males equals that among females. If significant, then the tests of significance within each subgroup often employ an alpha adjustment for the 2 tests, such as a Bonferroni correction (or its generalizations). However, a correction is unnecessary under the surrogacy principle described above.
Let {θ j } denote the treatment group difference within the j th subgroup, j = 1,2, defined by the gender of each subject, where θ 1 is the treatment group difference among males and θ 2 the difference among females. Thenθ ¼ ðŷ 1ŷ2 Þ T is asymptotically normally distributed with expectation θ = (θ 1 θ 2 ) T and with a covariance matrix Σ ¼ diagðs 2 1 s 2 2 Þ with covariance σ 12 = 0 since the two subgroups are independent.
The objective is to determine whether the treatment group difference within either subgroup is statistically significant when there is heterogeneity of the treatment group differences among the two subgroups. Thus, the elemental null hypotheses to be tested are H 0,1 : θ 1 = 0 and Table 4. The sequence of tested hypotheses for the longitudinal analysis of systolic blood pressure in the subset of diabetic subjects in the PEACE study. The model is adjusted for the baseline systolic blood pressure and the group differences tested using a t-test with 1288 df. Shown is the tested hypothesis for each intersection hypothesis, ( � y), the difference in the LSMEANS for placebo minus ACEi, the SE and the two-sided p-value for the test of the difference between groups. For example, the test of � y 124 is testing that the average of the group means at visits 1, 2 and 4 (6, 12 and 24 months) is the same in the two groups. H 0,2 : θ 2 = 0. One approach is to use a Bonferroni correction for the two tests. Another is to use traditional closed sequential testing that would start with a T 2 -like Wald test of the joint null hypothesis H 0,12 : θ 1 = θ 2 = 0 against the global or omnibus alternative H 1,12 : θ 1 6 ¼ 0 and/or θ 2 6 ¼ 0 of a group difference in either direction within either subgroup. With a consistent estimatê Σ, this order 2 test is provided by

Diff
Under H 0,12 , X 2 O is distributed as chi-square on 2 df. If significant at level α, each of the elemental hypotheses H 0,1 and H 0,2 are rejected if the corresponding Z-test values are likewise significant at level α.
However, the alternative hypothesis parameter space (H 1,12 ) for this order 2 test includes cases where θ 1 = θ 2 6 ¼ 0, i.e. where there is a homogeneous non-zero treatment group difference within the two subgroups. Such values do not represent any heterogeneity among subgroups or a treatment by subgroup interaction. Thus, the order 2 omnibus test is not specifically directed to detecting cases where there is a treatment by subgroup interaction.
Rather, we only wish to assess the treatment effect within subgroups when there is evidence that the variation among subgroups is greater than would be expected by chance, i.e. a treatment by subgroup interaction exists. So in this case we are interested in first testing the surrogate null hypothesisH 0;12 : y 1 ¼ y 2 againstH 1;12 : y 1 6 ¼ y 2 . A simple test is provided by Z S ¼ŷ 1 Àŷ 2 ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffî Asymptotically Z S * N(0, 1) underH 0;12 and the test rejectsH 0;12 in favor ofH 1;12 when Z S � Z 1−α for an upper-tail one-sided test at level α, or when abs(Z S ) � Z 1−α/2 at level α twosided. If that test is significant, we can then test the treatment difference within each subgroup at level α (two-sided) with strong control of the type 1 error probability, without the need for a correction for two tests.
Again, note that H 0;12 )H 0;12 and rejection ofH 0;12 ) rejection of H 0,12 . In this case, the order 2 joint hypothesis (H 0,12 ) of no difference in both subgroups implies that both subgroups have the same null effect ðH 0;12 Þ. However, if we rejectH 0 this implies that the no-interaction hypothesis H 0,12 is false because θ 1 6 ¼ θ 2 implies that θ 1 and θ 2 cannot both equal zero.
This can also be generalized to the case of more than 2 subgroups. Suppose K = 3 with the vector of estimated treatment group differences within the three subgroupsθ ¼ ½ŷ 1ŷ2ŷ3 � T .
Since the subgroups are independent, the covariance matrix of the treatment group estimates within the three subgroups isΣ ¼ diag½ŝ 2 1ŝ 2 2ŝ 2 3 �: In this case the traditional 3-order test of H 0,123 would be replaced by a 2 df test of homogeneity of the three subgroups differences H 0;123 : y 1 ¼ y 2 ¼ y 3 using a T 2 -like statistic of the form in (2) with contrast matrix with subgroup 1 as the reference for the 2:1 and 3:1 pairwise subgroup differences. Then the test of the elemental hypothesis H 0,1 , for example, would be declared significant at level α if it and the intersection hypothesesH 0;12 ,H 0;13 , andH 0;123 were all rejected at level α. The other elemental hypotheses can likewise be tested at level α provided that the relevant higher order intersection hypotheses are also rejected at level α.

Numerical computations
Computations were conducted for the case of two (independent) subgroups to compare the operating characteristics of the traditional closed testing approach for subgroup analyses versus analyses using the test of the surrogate hypothesis of homogeneity. Computations also included tests within 2 subgroups using a Holm (improved Bonferroni) correction that were virtually identical to the traditional closed testing and are omitted herein. To simplify, we assume that the variance of the observations is 1 with sample size n per treatment group in both subgroups so that the standard error of the mean difference within each subgroup iŝ s j ¼ ffi ffi ffi ffi ffi ffi ffi ffi 2=n p . The traditional closed testing approach employs a 2 df omnibus T 2 -like test of the order-2 hypothesis H 0,12 : θ 1 = θ 2 = 0 shown in (12). Under H 0,12 the test statistic X 2 has a large-sample central Chi-square distribution with 2 df. The null is rejected at the α = 0.05 level if the statistic is greater than the distribution's 95th percentile. If significant, both H 0,1 and H 0,2 can be tested at the 0.05 level, either one or two-sided. Herein all tests are conducted two-sided at the 0.05 level.
Alternately, at order 2 we could employ the 1 df test of the surrogate hypothesis of homoge-neityH 0;12 : y 1 ¼ y 2 . UnderH 0;12 , the contrast test statistic Z S ¼ ðŷ 1 Àŷ 2 Þ= ffi ffi ffi ffi ffi ffi ffi ffi 2=n p from (13) has a large-sample standard normal distribution. If this test of homogeneity is significant at level α two-sided, then both H 0,1 and H 0,2 can be tested at level α = 0.05 one or two-sided.
Figures describe the difference between the traditional and surrogate testing procedures. Fig 1 illustrates the rejection region for the traditional method starting with the 2 df omnibus test of H 0,12 : θ 1 = θ 2 = 0 at level α = 0.05, followed by 1 df tests of H 0,1 and H 0,2 , two-sided. The omnibus test rejection region at α = 0.05 consists of points ðŷ 1 ;ŷ 2 Þ outside of the circle. If this test is significant, the hypotheses H 0,1 : θ 1 = 0 and/or H 0,2 : θ 2 = 0 for each subgroup may be rejected at α = 0.05 (two-sided) when |Z j | exceeds Z 1−α/2 = Z 0.975 , j = 1, 2. For the test of H 0,1 the rejection region falls outside a vertical band with a small crescent piece removed from the left and right sections. These represent values that fail to reject the joint hypothesis for which H 0,1 is not tested. Likewise, the rejection region for the test of H 02 falls outside a horizontal band with a small crescent removed from the upper and lower sections. Also note that there are 4 small triangular areas that fall within the rejection region for the joint test but for which the test of H 0,1 or H 0,2 would not be significant. Fig 2 illustrates the rejection region for the surrogate test method starting with the 1 df contrast test of homogeneity of the subgroup mean differencesH 0;12 : y 1 ¼ y 2 at level α = 0.05 two-sided, followed by 1 df tests of the difference within each subgroup, two-sided. The 1 df test of homogeneity rejects null hypothesis for points ðŷ 1 ;ŷ 2 Þ outside of a diagonal band about the line of equalityŷ 1 ¼ŷ 2 . Outside of this band the difference betweenŷ 1 andŷ 2 is large enough to rejectH 0;12 . Then the hypothesis H 0,1 for the first subgroup mean difference is rejected at α = 0.05 (two-sided) when jŷ 1 j exceeds Z 1−α/2 = Z 0.975 . This corresponds to a vertical band symmetric about θ 1 = 0. Likewise, for the test of θ 2 there would be a horizontal band intersecting the diagonal band that defines the rejection region. For example, the point ðŷ 1 ;ŷ 2 Þ ¼ ð5; 1Þ falls outside of the diagonal band and therefore would indicate rejection of the test of homogeneity (rejection ofH 0;12 : y 1 ¼ y 2 ). Then the test of significance of H 0,1 : θ 1 = 0 would be declared significant but not the test of H 0,2 : θ 2 = 0. Also, the two small triangular areas represent values that would lead to rejection of the surrogate hypothesis of homogeneity but for which neither test within subgroups would be significant. Table 5 then presents the operating characteristics (rejection probabilities) for tests using traditional closed-testing and surrogate closed-testing for illustrative values of θ 1 and θ 2 with sample sizes of n = 25 or 50 within each cell. These were computed using numerical integration, see the Appendix. Scenarios include values of θ 1 and θ 2 satisfying H 0,12 and/orH 0;12 and the respective alternatives.
For each sample size, under the joint null hypothesis H 0,12 : θ 1 = θ 2 = 0 in scenario 1, all tests have a type I error probability � 0.05, with that for the surrogate tests within each subgroup being less (more conservative) than traditional closed testing. Under the surrogate joint null hypothesisH 0;12 : y 1 ¼ y 2 ¼ 0:5 or 1.0 (scenarios 2-3), the rejection probabilities for the surrogate tests of the elementary hypotheses, the type I error probability for these tests, is � 0.05. However, scenarios 2 and 3 also fall under the global alternative H 1,12 for which, as would be The omnibus two degree-of-freedom test of H 0,12 : θ 1 = θ 2 = 0 will reject the null hypothesis at level α for values (ŷ 1 ;ŷ 2 ) outside the circle. If the omnibus test is significant at level α, the test of H 0,1 : θ 1 = 0 then rejects outside of the green bar, and that of H 0,2 : θ 2 = 0 rejects outside of the red bar. Note the four small near-triangles in which the omnibus test is rejected but neither test of the two elementary tests is significant.
https://doi.org/10.1371/journal.pone.0219520.g001 expected, the traditional closed testing procedures provide increasing power as the common value for θ increases. This is also reflected by the power of the 2 df test of H 0,12 under the joint null compared to the nominal type I error probabilities of the 1 df test of the surrogate hypoth-esisH 0;12 . Scenarios 4-6 fall under both the global alternative hypothesis H 1,12 and the surrogate alternative hypothesisH 1;12 where 0 � θ 1 < θ 2 . In scenarios 4 and 5 where θ 1 = 0, all procedures preserve the type I error probability for the test of H 0,1 and the traditional closed testing procedure provides slightly greater power for the test of H 02 than does the surrogate test (*0.996 versus 0.942 when θ 2 = 1.0 for n = 50). However, in scenario 6 where θ 1 = 0.5 and θ 2 = 1.0, since the difference between subgroups is smaller than scenario 5 (0.5 versus 1.0), the surrogate test ofH 0;12 is less powerful than the traditional omnibus test of H 0,12 (0.424 versus nearly 1.0 for n = 50), and as a result, the tests of the elementary hypotheses are less powerful under the surrogate versus traditional closed testing. The test of homogeneityH 0;12 : y 1 ¼ y 2 will reject the null hypothesis at level α for values (ŷ 1 ;ŷ 2 ) outside of the black diagonal band. If the surrogate test is significant at level α, the test of H 0,1 : θ 1 = 0 then rejects outside of the green bar, and that of H 0,2 : θ 2 = 0 rejects outside of the red bar. https://doi.org/10.1371/journal.pone.0219520.g002 Note that scenarios 2-3 fall under the global alternative H 1,12 whereas they fall under the surrogate null hypothesisH 0;12 . Thus, the traditional tests have greater "power". Scenarios 4-6 fall under both alternatives. In all cases the traditional closed tests have higher rejection probabilities. That is because they are rejecting H 0,12 in situations that do not fall in the surrogate alternativeH 1;12 parameter space.
To show this consider the following 2×2 table for scenario 4 and n = 50 that displays the joint and marginal probabilities that the elementary test within stratum 2 would be significant at the 0.05 level using either the traditional or the surrogate closed testing procedures.
Marginally, the traditional closed testing procedure has a higher rejection probability than does the surrogate closed testing (0.586 versus 0.395). However, the probability that both reject is 0.374 meaning that the probability is 0.212 that the traditional test would reject in cases where the surrogate test does not, or in cases where the test of homogeneity is not significant. Further, significance of the surrogate test (with probability 0.395) is highly concordant with that of the traditional test (probability 0.374), meaning that the probability of the traditional test failing to be significant when the surrogate test is significant is small (0.021).
In summary, all procedures preserve the type I error probability under the null for either or both elementary tests (scenarios 1-3). Under the surrogate alternativeH 1;12 (scenarios 4-6), the traditional testing procedure provides greater "power" than the surrogate testing owing to a higher probability of rejection in cases whereH 0;12 is true, i.e. the treatment group differences are dissimilar. Thus, the rejection regions for the traditional versus surrogate closed testing procedures differ, as well as the probabilities of rejection over the parameter space. Table 5. Probabilities of rejection of the Order-2 gate-keeping tests and the tests of the elemental hypotheses using the traditional closed-testing procedure and the extended surrogate closed-testing procedure for n of 25 or 50 per group within each subgroup and with homogeneous or heterogeneous treatment effects θ 1 and θ 2 within each of the two subgroups. All tests at the 0.05 level two-sided. To display this, the probability of rejection of the different tests was computed by numerical integration for θ 1 = −1(0.1)1 and θ 2 = −1(0.1)1. The values of θ 1 and θ 2 for which power equaled a specific value were then plotted (power contours). Fig 3 displays the power contours over the parameter space for the tests of the elementary hypotheses H 0,1 and H 0,2 , respectively, for the traditional closed testing procedure for n = 50. These power contours are close to straight vertical or horizontal lines, respectively, as would be the case for a simple test with no adjustment for multiplicity. Fig 4 then displays the power contours for these same tests using the surrogate closed testing procedure. The regions in which the test of H 0,1 has high power, such as 0.7 or greater, are characterized by vertical lines in the upper left and lower right quadrants that "bend" away from the diagonal acceptance region for the surrogate test ofH 0;12 . The same pattern is obtained for the test of H 0,2 when the labels of the axes are interchanged. Thus, these contours describe regions of the parameter space where the θ 1 and θ 2 within the two subgroups differ substantially, and where there is a high probability that a test of either θ 1 and/or θ 2 would also be significant.

Discussion
Herein we describe applications of the closed testing principle using α-level tests of higher order surrogate hypotheses that are directed to testing different null versus alternative hypotheses than those employed in traditional closed-testing procedures. The type I error probability is protected provided that all hypotheses are tested using an α-level test. We present three applications directly relevant to the analysis of clinical trial results. Clearly there are others. The advantage of the surrogate testing approach is that it provides a test that is directed to detect specific alternatives of interest, as opposed to the global alternative hypothesis of any difference in any direction.
The first two examples both employ surrogate hypotheses that are directed towards regions of the parameter space where one group has a preponderance of benefit for the set of outcomes considered, the so-called one-directional alternative hypothesis (9). This alternative is specified in terms of one group being more beneficial than the other, such as the experimental treatment being beneficial relative to placebo. However, there may be situations, such as a study of Closed testing against restricted alternatives comparative effectiveness, where it is of interest to determine whether either treatment A is superior to B or vice versa, in which case a two-sided alternative hypothesis and two-sided test would be employed. A two-sided analysis can also be employed to meet regulatory requirements to establish effectiveness in a placebo controlled trial.