Graphical approaches for the control of generalized error rates

When simultaneously testing multiple hypotheses, the usual approach in the context of confirmatory clinical trials is to control the familywise error rate (FWER), which bounds the probability of making at least one false rejection. In many trial settings, these hypotheses will additionally have a hierarchical structure that reflects the relative importance and links between different clinical objectives. The graphical approach of Bretz et al (2009) is a flexible and easily communicable way of controlling the FWER while respecting complex trial objectives and multiple structured hypotheses. However, the FWER can be a very stringent criterion that leads to procedures with low power, and may not be appropriate in exploratory trial settings. This motivates controlling generalized error rates, particularly when the number of hypotheses tested is no longer small. We consider the generalized familywise error rate (k-FWER), which is the probability of making k or more false rejections, as well as the tail probability of the false discovery proportion (FDP), which is the probability that the proportion of false rejections is greater than some threshold. We also consider asymptotic control of the false discovery rate, which is the expectation of the FDP. In this article, we show how to control these generalized error rates when using the graphical approach and its extensions. We demonstrate the utility of the resulting graphical procedures on three clinical trial case studies.

The increase in multiplicity in clinical trials also tends to go hand-in-hand with an increase in the complexity of the objectives and structure of the hypotheses tested. A key setting where this occurs is when measuring multiple endpoints to answer distinct (but related) clinical questions. The corresponding hypotheses often fit naturally within a hierarchical structure that reflects the relevant importance and links between the clinical questions that the trial aims to answer. For example, in a trial with both a primary and secondary hypothesis, the trialist may only wish to test the secondary hypothesis if the primary hypothesis is first rejected. More complex hierarchical structures can be formed as the number of hypotheses increases.
Many methods have been developed for FWER control that respect complex trial objectives and multiple structured hypotheses. A highly flexible framework for doing so is the graphical approach to hypothesis testing, as proposed independently by Bretz et al 3 and Burman et al. 4 In the framework of Bretz et al, 3 vertices represent the null hypotheses and weights represent the local significance levels, which are propagated through weighted, directed edges. The resulting multiple testing procedure can be tailored to structured families of hypotheses with arbitrary dependence between the hypotheses, and allows the visualization of complex decision strategies in an easily communicable way. Many well-known procedures for FWER control are special cases of the graphical approach, such as the fixed sequence (or hierarchical) test, 5 the Holm procedure, 6 the Hochberg procedure, 7 and several gatekeeping procedures. [8][9][10] However, controlling the FWER is a very stringent criterion, especially as the number of hypotheses increases. By controlling the probability of even a single type I error, the power of FWER-controlling procedures can be very low, with little chance of any of the individual hypotheses being rejected. While strong FWER control is appropriate in confirmatory contexts, in exploratory trial settings such strict criterion may not be necessary. Indeed, as reflected in the FDA (2017) guidance on multiple endpoints in clinical trials, 1 exploratory analyses can be included in a trial to explore and generate new hypotheses. Since such exploratory hypotheses will often be followed up with confirmatory testing, strict FWER control at the exploratory stage is no longer necessary.
Westfall and Bretz 11 expand on this argument, by categorizing the hypothesis tests in a typical clinical trial into families of "efficacy," "safety," and "exploratory" tests. For the efficacy family, the primary endpoints and main secondary endpoints are the basis of regulatory approval and labeling, and hence require strong FWER control. However, there may also be "lesser interest tests" (eg, multiple time point analyses), where FWER controlling methods are not needed. Nonetheless, the authors note that some form of multiplicity adjustment would strengthen the claims made for this set of tests. For the safety family, serious and known treatment-related adverse events (AEs) do not require multiplicity adjustment (since type II errors are of much greater concern). However, for all other AEs, the authors state that there is a clear need to recognize the multiplicity problem, and note that the use of the false discovery rate (FDR) may be more appropriate here. Finally, for the family of exploratory tests (which may include both safety and efficacy tests), the authors state that "standard multiplicity adjustment here seems unreasonable, as power will be very low," and again recommend the use of FDR controlling methods.
All this demonstrates that outside of the context of testing the primary and main secondary endpoints for regulatory approval and labeling, strong FWER control may not be needed, even in confirmatory trials. Less stringent error rates can then be used, where more than one false rejections are acceptable in order to increase the power of the trial. One approach is to control the generalized FWER, or k-FWER. The k-FWER is the probability of making at least k false rejections, where k ≥ 1. Clearly the FWER is a special case of the k-FWER when k = 1. A number of methods controlling the k-FWER have been proposed, including step-up procedures [12][13][14][15] and permutation-based procedures. [16][17][18] Another approach is to accept a certain proportion of false rejections, that is, to control the false discovery proportion (FDP). The FDP is closely related to the well-known FDR, 19 which is now a common error rate to control in experiments with a large number of hypotheses, such as genomic studies. The FDR is the expected value of the FDP, that is, the FDR is the expected proportion of errors among the rejected hypotheses. Although controlling the FDR controls the expectation of the FDP, in practical applications the actual FDP might be far from its expectation. 20 In the context of clinical trials with a relatively small number (< 100) of hypotheses, this motivates control of the tail probability of the FDP and hence guaranteeing control over the probability of having a high proportion of false discoveries. Some methods for controlling the FDP have previously been proposed. 12,17,21 In general, the various procedures proposed in the literature for generalized error rate control are not suitable for structured hypothesis testing problems encountered in the context of clinical trials, as they do not respect the underlying hierarchical structure of the testing strategy. In order to do so, in this article we show how to control both the k-FWER and FDP (as well as asymptotic control of the FDR) when using the graphical approach of Bretz et al 3 and its extensions. We achieve this by modifying and applying the methodology for k-FWER and FDP control given by van der Laan et al 22 and F I G U R E 1 Graph showing a possible testing strategy for a diabetes trial with a primary and secondary endpoint that tests two doses of a drug against a placebo, as given in Maurer et al 23 Romano and Wolf 18 to the graphical framework. The performance of the resulting procedures are compared analytically and through simulations in the context of various case studies.
The rest of the article is structured as follows. In Section 2, we introduce the basic notation and the graphical approach to hypothesis testing. Section 3 shows how to modify the graphical approach to control the k-FWER, while Section 4 gives a further modification of the graphical approach to control the FDP as well as (asymptotic) control of the FDR. Section 5 shows how to use the proposed procedures for a number of extensions to the graphical approach. We illustrate the proposed methods using three case studies in Section 6, and conclude with a discussion in Section 7.

GRAPHICAL APPROACH TO HYPOTHESIS TESTING
Consider simultaneously testing multiple null hypotheses H 1 ,… ,H m which are related in some way and so can be thought of as a family of hypothesis tests. Since we are jointly testing multiple hypotheses, there is a resulting multiplicity problem that we wish to take account of in the testing procedure. The standard approach for confirmatory clinical trials is to control the FWER (in the strong sense) below some prespecified level , where ∈ (0,1). That is, P(V > 0) ≤ under any configuration of true and false null hypotheses, where V denotes the number of false rejections made. We consider testing H 1 ,… ,H m using the corresponding P-values P 1 ,… ,P m . Let M = {1,… ,m} denote the associated index set and assume that the P-values associated with the true null hypotheses satisfy P(p i ≤ u) ≤u for any u ∈ [0,1]. We now describe the graphical approach to hypothesis testing introduced by Bretz et al, 3 which controls the FWER. In this approach, the hypotheses H 1 ,… ,H m are represented by vertices, with associated weights denoting the significance levels. Any two vertices H i and H j are connected by a directed edge with weight g ij , which indicates the fraction of the significance level i which is propagated from H i to H j if H i is rejected. If g ij = 0 then there is no propagation of the significance levels, and the edge can be dropped for convenience from the graphical visualization. These g ij form an m × m transition matrix G = (g ij ), which fully characterizes the propagation of significance levels.
As an example, consider a trial in diabetes patients that compares two doses (a low dose and a high dose) of an experimental drug against placebo, in terms of both a primary and secondary clinical endpoint. Since the primary endpoint is more important than the secondary one, the trialist tests the primary hypothesis first; only if this is rejected is the secondary hypothesis then tested. Assuming that both doses are equally important, a possible testing strategy is shown in the graph in Figure 1, as given in Maurer et al. 23 Bretz et al 24 proposed a graphical weighting strategy which allows the computation of the set of weights for any intersection hypotheses H J = ⋂ j∈J H j , J ⊆ M. The graphical weighting strategy requires the specification of initial weights w i (M), i ∈ M, for the global null hypothesis H M and the transition matrix G, with entries g ij satisfying the regularity conditions 0 ≤ g ij ≤ 1, g ii = 0 and m ∑ j=1 g ij ≤ 1 for all i, j ∈ M.
(1) Algorithm 7 in Appendix A1 reproduces the algorithm given in Bretz et al 24 for calculating the weights w j (J), j ∈ J, which can then be used for testing the intersection hypothesis H J .
Given these weights, a weighted multiple testing procedure can then be applied to each intersection hypothesis H J , such as a weighted Bonferroni test, or a weighted parametric test if the joint distribution of the P-values is known. 24 Applying a weighted Bonferroni test is the simplest option, and leads to the original Bonferroni-based graphical approach for FWER control based on a shortcut procedure where the m hypotheses can be tested sequentially, and hence requires at most m steps of the algorithm 3 (see also Algorithm 8 in Appendix A1).
Adjusted P-values can also be calculated when using this graphical approach, which then allow the hypothesis tests to be easily performed at any significance level . More formally, the adjusted P-value P adj j for hypothesis H j is the smallest significance level at which one can reject the hypothesis using the given multiple test procedure. 3 Algorithm 9 in Appendix A1 reproduces the algorithm given in Bretz et al 3 for calculating adjusted P-values.
The R package gMCP 25 provides functions and a graphical user interface to perform all of the calculations described above.

GRAPHICAL APPROACHES FOR K-FWER CONTROL
Controlling the k-FWER at prespecified level implies that P(V > k) ≤ , where V is the number of false rejections. The generalized Bonferroni procedure controls the k-FWER: 12,26 Reject any H i for which p i ≤ k /m. Assuming known positive weights w i that satisfy ∑ m i=1 w i = 1, Romano and Wolf 18 introduced the weighted generalized Bonferroni procedure: Reject any H i for which p i ≤ w i k . If w i = 1/m for all i then this is equivalent to the unweighted version.
In order to extend the graphical approach for controlling the k-FWER, it is tempting to simply replace by k , in analogy to the modification made for the generalized Bonferroni procedures. However, in general this does not control the k-FWER for k > 1. As a counterexample, consider the Holm procedure with m hypotheses, which can be represented as a graph with initial weights w i (M) = 1/m and g ij = 1/(m − 1) for all i,j ∈ M, i ≠ j. Using the graphical weighting strategy (Algorithm 7), we have w i (I) = 1/|I| for all i ∈ I and I ⊆ M. Replacing by k in the graphical approach (Algorithm 8) is hence equivalent to a stepdown procedure where the ith smallest P-value is compared with the significance level i = k m+1−i . However, since k m+1−i > k m+k−i for k > 1, the result of Theorem 2.3 in Lehmann and Romano 12 shows that this procedure does not control the k-FWER. Hence, we turn to alternative procedures for k-FWER control.

Augmented graphical approach for k-FWER control
We first consider a simple method of controlling the k-FWER described in van der Laan et al [22,Procedure 1], which can be applied to give a graphical approach for k-FWER control. The original method starts with an initial procedure that controls the usual FWER, and then augments this by additionally rejecting the hypotheses associated with the smallest k − 1 remaining (unrejected) P-values. These k − 1 additionally rejected hypotheses can be freely chosen, and so we aim to respect the hierarchical structure of the underlying multiple testing problem and to avoid rejecting hypotheses with large P-values. This results in the following augmented graphical approach for k-FWER control.
Algorithm 1 (Augmented graphical approach for k-FWER control). Here ≥ 0 determines how many of the "free" rejections we use, and can be set larger than . In fact, we can even set large enough to ensure that k − 1 additional hypotheses are rejected, regardless of the observed P-values (see below). Of course, this comes at the potential cost of rejecting hypotheses with P-values close to 1 that are likely to be null. Conversely, low values of mean that we are only willing to reject a hypothesis if it has reasonably substantial evidence against it.
In step (ii) of Algorithm 8, there may be a choice as to which of the hypotheses j ∈ I to reject. Since there can only be a maximum of k additional rejections in step (ii) of Algorithm 1, the order in which hypotheses are rejected does matter here. One sensible choice is to set j = arg min i∈I {p i ∕w i (I)}, which we use in the remainder of the article. The choice of can be data-dependent to ensure that (up to) k − 1 additional rejections are made. More explicitly, we can increase so that one additional rejection is made, then if necessary increase until another additional rejection is made, and so on. This allows an alternative formulation of the augmented graphical approach based on adjusted P-values, which does not depend on an explicit choice of .  If there are ties in the ordering in step (iii), they can be broken by choosing the hypothesis with the smallest index, for example. Algorithm 1 will give the same rejections as Algorithm 2 for large enough. In addition, the R package gMCP 25 can straightforwardly be used to implement Algorithm 1 in two stages corresponding to steps (i) and (ii). Hence, we focus on Algorithm 1 in the rest of the article.
Example 1 (Example of the augmented graphical approach for k-FWER control:). Consider the graph of the diabetes trial given in Figure 2, where we control the k-FWER for k = 2 with = .05. Suppose also that the P-values are given by P 1 = .01, P 2 = .03, P 3 = .02, P 4 = .024.
In step (i) of Algorithm 1, the usual Bonferroni-based graphical procedure for FWER control would only reject H 1 . The updated graph (ie, removing node H 1 and propagating the local significance levels) is then used in step (ii) of Algorithm 1, with replaced by . Supposing that = 0.5, we would then reject H 2 . At this point, we have made k − 1 additional rejections, and so we stop testing having rejected H 1 and H 2 . Figure 2 demonstrates each step of the augmented procedure graphically.

Generalized graphical approach for k-FWER control
As an alternative approach, we focus on Algorithm 7, which gives weights w j (J), j ∈ J, for any J ⊆ M. As shown in Bretz et al, 3 these weights satisfy the monotonicity condition Hence, we can apply the generic stepdown method for k-FWER control described in Romano and Wolf [18,Algorithm 4.1] with these weights to give the following generalized Bonferroni-based algorithm. Essentially, we simply set the critical constantsĉ n,K,i (1 − , k) in their algorithm equal to w i (K), where K is used in step (iv) of Algorithm 3 below to index the subsets including k − 1 of the previously rejected hypotheses. In what follows, we refer to this as the generalized graphical approach for k-FWER control.
If no such H i exists then stop; otherwise let R ′ be the indices of these rejected hypotheses.
(v) Update the sets I and R as follows: In Algorithm 3, at each step R is simply the set of indices of all the hypotheses that have been rejected previously, and I is the set of indices of the remaining hypotheses M⧵R. The algorithm is in a similar spirit to the graphical weighting strategy, 24 in the sense that there is a separation between the weighting strategy and the graphical test procedure which allows the generalization to k-FWER control.
In step (iii) of Algorithm 3, if |R| < k − 1 and |R| ≠ |I| then we can freely reject additional hypotheses so that a total of (up to) k − 1 rejections are made, while still controlling the k-FWER, since the algorithm will stop at this step. In order to respect the hierarchical structure of the underlying multiple testing procedure, and to avoid rejecting hypotheses with large P-values, we propose the following subprocedure in step (iii) if |R| < k − 1: 1. Set I → I⧵R and follow steps (ii) to (iv) of the usual Bonferroni-based graphical procedure for FWER control (Algorithm 8) with replaced by , until up to k − 1 additional rejections have been made.
As before, ≥ 0 determines how many of the "free" rejections we use, and hence can be set larger than or made data-dependent so that (up to) k − 1 additional rejections are made.
Looking at Algorithm 3 as a whole, if k = 1, then once a hypothesis is rejected, it no longer plays a further role and step (iv) above reduces to rejecting any H i , i ∈ I, for which p i ≤ w i (I) . Hence, Algorithm 3 is equivalent to Algorithm 8 (the usual Bonferroni-based graphical approach for FWER control) in that both algorithms will lead to exactly the same rejections when k = 1, assuming the same initial weights. When k > 1, however, the algorithm becomes more complex and involves maximizing over subsets including k − 1 of the previously rejected hypotheses in step (iv). As noted by Romano and Wolf, 18 intuitively this is because when considering a set of unrejected hypothesis in Algorithm 3, we may have already rejected (hopefully at most) k − 1 true null hypotheses. We do not know which of the rejected hypotheses are true, and so we maximize over subsets including at most k − 1 of those hypotheses previously rejected. In Appendix B1, we discuss the computational challenges of using Algorithm 3 for large values of m, and show how to streamline and operationalize the algorithm. However, in general these modified procedures only give asymptotic control of the k-FWER as the sample size of the trial increases.
In Appendix B2, we give some examples of using the generalized graphical approach. We show how it reduces to previous algorithms for k-FWER control as special cases, but also how it can have undesirable properties when the testing procedure has a hierarchical structure. The main problem (as demonstrated analytically in Example 4 of Appendix B2) is that if a hypothesis H j has fewer than k donors, its initial significance level will never increase, except for up to k − 2 hypotheses via the subprocedure in step (iii). Here, the donors of a hypothesis H j are the hypotheses that donate (or propagate) their significance levels to H j if they are rejected. Hence, the generalized graphical approach cannot effectively propagate the significance levels through the graph. We will see further examples of this in the case studies given in Section 6.
Example 2 (Example of the generalized graphical approach for k-FWER control:). We again consider the graph of the diabetes trial given in Figure 1. We control the k-FWER for k = 2 with = .05, with the P-values this time given by P 1 = .01, P 2 = .03, P 3 = .02, P 4 = .024. Applying the generalized graphical approach for k-FWER control gives the following: Here w i (I) are simply the initial weights and so w 1 (I) = w 2 (I) = 0.5 and w 3 (I) = w 4 (I) = 0. Since p 1 < , p 2 < , H 1 and H 2 are rejected at this step.

Existing power comparisons
Romano and Wolf 16 argue that the augmented procedure is suboptimal compared with their generic stepdown method for k-FWER control, since it can only reject at most k − 1 hypotheses more compared with a usual FWER-controlling procedure, whereas Algorithm 3 can reject substantially more hypotheses. In their simulation study [16,Section 6], they considered testing the means of a multivariate normal distribution with common correlation , where the number of hypotheses M = 50 or M = 400. They compared a number of different procedures for k-FWER control, but the relevant power comparison for our context of graphical approaches is the one between the generalized Holm procedure and the augmented Holm procedure. Their simulation results showed that when M = 400, k = 10 and ≤ 0.5, the generalized Holm procedure can make a substantially higher number of rejections (up to twice as many) compared with the augmented Holm procedure. However, when M = 50 and k = 3, the augmented Holm procedure almost always had a higher number of rejections than the generalized Holm procedure. These findings are corroborated by the simulation results of Dudoit et al. 27 They also considered testing the means of a multivariate normal distribution, with the number of hypotheses M = 24 or M = 400. Through simulation, they compared the augmented and generalized Holm and Bonferroni procedures, concluding that the augmented approach tends to be more powerful than the generalized approach "for a broad range of models" [27, Section 6.2.1]. The largest gains in power were when the number of hypotheses was small and a large proportion of the null hypotheses were true. However, for a large number of hypotheses (M = 400) and when was relatively large, the generalized approaches was more powerful than the augmented approaches. In many clinical trials, we would be in the setting with a smaller number of hypotheses, and so the augmented approach would be expected to be more powerful. In our case studies in Section 6, we consider power comparisons beyond Bonferrroni or Holm based methods.

GRAPHICAL APPROACHES FOR FDP AND (ASYMPTOTIC) FDR CONTROL
In this section, we consider how to extend the graphical approach for FDP and (asymptotic) FDR control. More formally, the FDP is defined as FDP = V max(R,1) , where R denotes the total number of rejections. The FDR is then the expectation of the FDP. A multiple testing procedure controls the tail probability of the FDP at level if P(FDP > ) ≤ , where ∈ [0,1) is a prespecified bound. This is also known as the tail probability for the proportion of false positives 27 or the false discovery exceedance. 28 Note that setting = 0 results in control of the FWER at level . In what follows, when we refer to FDP control, we mean controlling this tail probability of the FDP, where we suppress the dependence on for notational convenience.

Augmented approach for FDP and FDR control
A simple method of controlling the FDP based on a FWER-controlling procedure is given by van  Here ≥ 0 is a constant controlling how many additional rejections are made. As before, may be greater than , and can be set very large so that all D additional hypotheses are rejected. The choice of can also be data-dependent, giving an alternative algorithm based on adjusted P-values, which does not depend on an explicit choice of .  If there are ties in the ordering in step (iii), they can be broken by choosing the hypothesis with the smallest index. Algorithm 4 will give the same rejections as Algorithm 5 for large enough. In addition, the R package gMCP 25 can straightforwardly be used to implement Algorithm 4 in two stages corresponding to steps (i) and (iii). Hence, we focus on Algorithm 4 in the remainder of the article.
Example 3 (Example of the augmented graphical approach for FDP control:). We continue the example of the diabetes trial displayed in Figure 1, where this time we aim to control the FDP with = .05 and = 0.5. Suppose this time the P-values are given by P 1 = .01, P 2 = .015, P 3 = .02, P 4 = .024. In step (i), the Bonferroni-based graphical procedure for FWER control would reject H 1 and H 2 . We then reject up to D additional hypotheses in step (iii), where D is the largest integer satisfying D/(D + 2) ≤ . Hence if 0 ≤ < 1/3 we make D = 0 additional rejections, if 1/3 ≤ < 1/2 we make D = 1 additional rejection (reject H 3 ), and if ≥ 1/2 we make D = 2 additional rejections (reject H 3 and H 4 ).
Although our focus in this article is on controlling the tail probability of the FDP, we note in passing that the augmented procedure for FDP control at level automatically gives asymptotic control of the FDR at level 2 . This follows directly from van der Laan [22,Theorem 3]. Hence, applying the augmented graphical approach for FDP control given in Algorithm 4 at prespecified level asymptotically controls the FDR at level 2 . Lehmann and Romano 12 showed that FDP control at level also implies FDR control at level * = (1 − ) + . Hence, if * < 2 , which implies that < /(1 − ), this bound can be used instead, while also yielding finite sample FDR control.

Generalized graphical approach for asymptotic FDP control
As an alternative method to control the FDP, we can directly apply the generic method for FDP control in Romano and Wolf [18, Algorithm 8.1] to give the following graphical approach.
Algorithm 6 (Generalized graphical approach for asymptotic FDP control).
(i) Let j = 1 and k 1 = 1 (ii) Apply the k j -FWER procedure given in Algorithm 3, and let R j denote the index set of the hypotheses it rejects. (iii) If |R j | < k j / − 1, stop and reject all hypotheses rejected by the k j -FWER procedure. Otherwise, let j = j + 1 and k j = k j − 1 + 1, then return to step (ii).
This algorithm was only proven in Romano and Wolf 18 to give asymptotic FDP control, but they showed empirically that it had good finite control of the FDP. However, since Algorithm 6 is based on the k-FWER generalized graphical approach, the same potential problems as described in Appendix B2 will also apply. Finally, we again note in passing that the result of Lehman and Romano 12 shows that this procedure gives (asymptotic) FDR control at level Example 4 (Example of the generalized graphical approach for FDP control:). We continue the example of the diabetes trial displayed in Figure 1, where we aim to control the FDP with = .05 and = 0.5. Suppose the P-values are given by P 1 = .01, P 2 = .015, P 3 = .02, P 4 = .024. In step (ii), applying the FWER procedure results in the rejection of H 1 and H 2 .

Existing power comparisons
Romano and Wolf 16 argue that the augmented procedure for FDP control is suboptimal compared with their generalized method for FDP control, given that both are based on the k-FWER controlling procedures. In the simulation results for FDP controlling procedures given in Dudoit et al 27 and Romano and Wolf, 16 the augmented and generalized approaches as given above (Algorithms 4 and 6) are not directly compared for Holm (or Bonferroni) based procedures. However, given their simulation results for k-FWER control, we might also expect the augmented approach to have a higher power than the generalized approach when the number of hypotheses are small or when the proportion of true null hypotheses is high. We consider such power comparisons in our case studies in Section 6.

EXTENSIONS TO THE GRAPHICAL APPROACH
The original Bonferroni-based graphical approach of Bretz et al 3 has been extended in a number of ways. 29 These extensions can be used in the augmented and generalized procedures for k-FWER and FDP control.

Entangled graphs
First, we consider the setting where it is desirable for the graphical procedures to have memory, in the sense that the propagation of significance levels depends on their origin. To achieve this, we can define individual graphs for each relationship and combine them afterward. This is known as an entangled graph, and the algorithm presented in Maurer and Bretz 30 gives an entangled Bonferroni-based graphical approach. Hence, we can straightforwardly modify the augmented graphical approaches for k-FWER and FDR control for use with entangled graphs. To do so, simply replace Algorithm 8 with the algorithm of Maurer and Bretz. 30 For the adjusted augmented graphical approaches, replace Algorithm 9 with the algorithm of Maurer and Bretz, 31 which shows how to calculate adjusted P-values for the entangled graph setting. Maurer and Bretz 30 also showed how to calculate the weights for any intersection hypothesis H J , J ⊆ M, and this weighting strategy satisfies the monotonicity condition given in Equation (2). Hence, we can directly apply this weighting strategy to the generalized graphical approaches for k-FWER and FDP control. We give an example of the use of entangled graphs in the case study described in Section 6.2.

Weighted parametric tests
All the procedures so far have been based on weighted Bonferroni tests, which can be conservative. As an alternative, weighted parametric tests can be used if the joint distribution of the P-values p j , j ∈ J, are known for the intersection hypothesis H J . In this case, a weighted min-p test can be defined. 32,33 This test rejects H J if there exists a j ∈ J such that p j ≤ c J w j (J) , where c J is the largest constant satisfying If only some of the multivariate distributions of the P-values are known, then Bretz et al 24 and Xi et al 34 showed how to derive conservative upper bounds on this rejection probability, and hence determine a value for c I . The motonocity condition in this setting is which implies that rejection thresholds are always more liberal when fewer hypotheses are included in the set. In practice, this condition is often violated when using weighted parametric tests. 24 If this is the case, then it may be possible to modify the weighting scheme so that Equation (3) holds. 24,34 If the monotonicity condition does hold, then we can use the weighted parametric tests directly for the augmented and generalized approaches for k-FWER and FDP control, with the only change being that w i (I) is replaced by c I w i (I). For the adjusted augmented graphical approach, adjusted P-values can be constructed for weighted parametric tests. 34

Group sequential designs
The graphical approach can also be extended to group sequential designs with one or more interim analyses. Under mild monotonicity conditions, Maurer and Bretz 35 proposed a graphical testing procedure for multiple hypotheses and multiple interim analyses. More formally, consider testing H 1 ,… ,H m in a group sequential trial at time points t = 1,… ,h. Each H i has an associated error spending function a i ( ,y) with information fraction y and significance level . The nominal significance levels are denoted bỹi ,t ( ), which are the interim decision boundaries. We assume that these nominal levels satisfy the monotonicity conditioñi ,t ( ′ ) ≥̃i ,t ( ) for all ′ > (ie, the rejection boundaries are always higher when the total error rate of the design is higher). These conditions hold for many spending functions, including O'Brien-Fleming and Pocock boundaries. 35 The algorithm presented in Maurer and Bretz 35 gives a Bonferroni-based graphical test procedure for group sequential designs. The augmented graphical approaches for k-FWER and FDP control can hence be extended to apply to group sequential designs: simply replace Algorithm 8 with the algorithm in Maurer and Bretz. 35 For the adjusted augmented graphical approach, replace Algorithm 9 with the algorithm of Maurer and Bretz, 31 which shows how to calculate adjusted P-values for the group sequential design setting.

CASE STUDIES
In this section, we compare and contrast the use of the algorithms for k-FWER and FDP control on three clinical case studies covering a broad range of clinical trial applications. In Section 6.1 we revisit an exploratory pharmacodynamic clinical trial to investigate the effect of drug activity at the GABA-A receptor in the brain. In Section 6.2, we revisit a proof-of-concept trial investigating three doses of a new drug against a placebo on multiple biological endpoints related to acute heart failure. Finally, in Section 6.3 we illustrate the proposed approaches for the comparison of three therapies in a confirmatory clinical trial for heart failure patients.

F I G U R E 3
Graph representing the testing strategy for the pharmacodynamic study described in Ferber et al, 36 with modified initial weights. Here contrast T i D j compares the change from baseline under dose j (j = 1,2,3) at time point i (i = 1,… ,5) to the corresponding change under placebo

Pharmacodynamic study
Our first case study is motivated by the exploratory pharmacodynamic clinical study reported by Ferber et al, 36 which explored the effect of drug activity at the GABA-A receptor in the brain as measured using a quantitative electroencephalogram (qEEG). Three doses of the drug (0.25, 0.5, and 1 mg) were tested as well as a placebo. During the first 15 minutes after the drug was given to each patient, qEEG measurements were taken and afterward subdivided into five time slices of 3 minutes duration. The analysis strategy used a mixed effect linear model to obtain 15 contrasts to formally test. Contrast T i D j compared the change from baseline under dose j (j = 1,2,3) at time point i (i = 1,… ,5) to the corresponding change under placebo. Figure 3 shows the graph representing the hierarchical testing strategy used for these 15 hypotheses (with modified initial weights, see below), and Table 1 gives the unadjusted P-values from the mixed effects linear model for the 15 hypotheses. Figure 3 shows that the hypotheses T 4 D 3 and T 5 D 2 each only have a single donor hypothesis (T 5 D 3 ). Hence if they have initial weights of zero (as in the original graph 36 ), then they cannot be rejected by the generalized graphical approach for k-FWER control with k = 2. This then means that no hypotheses can be rejected except for T 5 D 3 . Therefore, we first set the initial weights for T 4 D 3 ,T 5 D 2 , and T 5 D 3 to 1/3, with all other weights set equal to zero.

TA B L E 2 Rejected
hypotheses for the pharmacodynamic study of Ferber et al, 36 with initial weights of 1/3 for T 4 D 3 ,T 5 D 2 , and T 5 D 3 Table 2 shows the resulting rejections for the generalized and augmented graphical k-FWER and FDP controlling procedures, with = 1. Looking first at the k-FWER procedures, for k = 1 the generalized and augmented graphical procedures both reject the same eight hypotheses, as would be expected. For k = 2 and k = 3, the augmented procedure rejects 9 and 10 hypotheses, respectively. However, the generalized graphical procedure rejects fewer hypotheses when k = 2 and k = 3, with only three rejections in both cases. For k = 3 this is because all hypotheses have fewer than three donors and hence only those hypotheses with nonzero initial weights can be rejected. This is still the case when k = 2, even though all hypotheses (except for T 5 D 3 ) have two donors, showing that the generalized graphical procedure cannot effectively propagate the significance levels through the graph.
There is a similar pattern for the FDP controlling procedures, which is expected given that they are based on the k-FWER controlling procedures. For = 0.1 the generalized and augmented graphical procedures give the same eight rejections, which are also the same as the k-FWER controlling procedures when k = 1. For = 0.2 and = 0.3, the augmented procedure rejects 10 and 11 hypotheses, respectively. However, again the generalized graphical procedure rejects fewer hypotheses for the larger values of = 0.2 and = 0.3, with only three hypotheses rejected. These are the same rejections as the generalized k-FWER controlling procedure for k > 1, because k j > 1 in Algorithm 6.
We also consider the setting where all 15 hypotheses have initial weight of 1/15. Table 3 shows the resulting rejections for the generalized and augmented graphical k-FWER and FDP controlling procedures, with = 1. With these new initial weights, the k-FWER controlling procedures both reject the same seven hypotheses when k = 1 and the same eight hypotheses with k = 2. This shows how the generalized graphical procedure can benefit with nonzero initial weights. However, for k = 3 the augmented procedure rejects one more hypothesis (T 5 D 1 ) than the generalized graphical procedure. This is because all hypotheses have fewer than three donors, and hence the weights for the generalized graphical procedure cannot increase-that is, there is no propagation of the significance levels.
Similarly, the FDP controlling procedures both reject the same seven hypotheses when = 0.1 and the same eight hypotheses when = 0.2, which are also the same rejections as the k-FWER controlling procedures when k = 1 and Abbreviations: FDP, false discovery proportion; FWER, familywise error rate. k = 2, respectively. However, for = 0.3 the augmented procedure rejects two more hypotheses than the generalized graphical procedure, while the latter only gives the same rejections as when = 0.2. This is because when = 0.3, k j > 2 in Algorithm 6 and there is no propagation of the significance levels.

The Pre-RELAX-AHF trial
Our second case study is a proof-of-concept trial called the Preliminary study of RELAXin in Acute Heart Failure (Pre-RELAX-AHF). 37 The trial compared three doses of relaxin against a placebo on multiple biological endpoints related to acute heart failure. Given that this was a proof-of-concept trial, less stringent error rates can be used when adjusting for multiplicity. One criterion for recommending the treatment for further testing is to show an effect on the majority of multiple endpoints. Following Davison et al, 38 we consider a subset of nine endpoints. We focus on the 30 μg/kg/day dose of relaxin treatment, which showed efficacy on six of these endpoints when compared with placebo, using one-sided (uncorrected) P-values with = .1. In what follows, we call the 30 μg/kg/day dose of relaxin treatment the experimental treatment, and the placebo the control treatment.
Since the experimental treatment was declared efficacious in six out of nine endpoints in the pre-RELAX-AHF trial, we consider a trial design where it is required to reject at least six out of nine hypotheses to declare success. Calling these the primary hypotheses, we then add a hierarchical structure to this trial by supposing that we also test secondary hypotheses if at least six out of the nine primary hypotheses were rejected. Hence, we have a family of primary hypotheses  1 = (H 1 , … , H 9 ) corresponding to testing the experimental treatment against the control across the nine endpoints, and a family of secondary hypotheses  2 .
We can represent this six out of nine gatekeeping procedure using entangled graphs, which were described in Section 5.1. More precisely, we can define gatekeeping graphs for all hypotheses and then entangle them. 30 We perform a Holm procedure  l on six hypotheses for each of the 84 subsets of size 6, which we denote J l , l = 1,… ,84. The full significance level is passed on to  2 if all six hypotheses in  1l = {H i ∶ i ∈ J l } are rejected. The testing procedure is given by the entangled graph  (c,  l ; l = 1, … , 84) where c i = 1/84 for i = 1,… ,84. This is equivalent to the following testing strategy: the usual Holm procedure is performed on the nine hypotheses in  1 at level until any six of these hypotheses are rejected. The remaining primary and secondary hypotheses are then  Table of weights for the entangled graph procedure used to analyse the trial based on Pre-RELAX-AHF tested using the weights given in Table 4, which depend on the number |I| of unrejected hypotheses in  1 . For simplicity, in what follows we suppose that  2 consists of a single hypothesis H 10 (which could, eg, represent a composite safety endpoint). We can then use the weights given in Table 4 in the k-FWER and FDP controlling graphical procedures. In our simulation study, for the primary hypotheses  1 we follow Delorme et al 39 and take the empirical means and standard errors of the endpoints as the true parameter values for the experimental (E) and control (C) treatments. The numerical values of the means C , E and standard deviations C , E are given in Appendix C. We assume that the distributions of the observed means of the endpoints for the experimental and control treatments follow a multivari-

TA B L E 4
Here diag( G ) is a diagonal matrix with the ith diagonal element equal to G i , and Σ( ) is a correlation matrix with ones on the diagonal and on all off-diagonal terms. The test statistic for endpoint i is given by which is compared with a t-distribution. The estimator of the variance of the difference between the means, as well as the appropriate degrees of freedom for the t-distribution are given by Delorme et al 39 and implemented in their R package rPowerSampleSize. 40 For the secondary hypothesis H 10 , for simplicity we assume that the test statistic T 10 follows a normal distribution with mean 3 and variance 1, and is independent of the test statistics for  1 . Table 5 gives the marginal power to reject each hypothesis H 1 ,… H 10 , calculated using 10 4 trial replications, with = .1 and = 1. The results show that in all scenarios, the augmented procedure has an equal or higher power to reject each of the hypotheses H 1 ,… ,H 10 . For the primary hypotheses H 1 ,… ,H 5 and H 8 , this is especially noticeable for the k-FWER controlling procedures when k = 2 and k = 3. For hypothesis H 9 , the augmented procedures have a substantially higher power compared with the generalized graphical procedure (except for when controlling the usual FWER). However, H 9 is actually a true null hypothesis (with C 9 = E 9 = 0.07) and so this implies a higher type I error rate for H 9 when using the augmented procedure. In fact the type I error rate for H 9 is below or equal to the nominal 10% in all scenarios for the generalized graphical procedures. Finally, for the secondary hypothesis H 10 (which has an initial weight of zero), we see that the power decreases as k and increases for the k-FWER and FDP controlling generalized graphical procedures, respectively (in particular, the power is only 6% when = 0.3 for the latter procedure). Again this shows that in contrast to the augmented procedures, the generalized graphical approaches do not effectively propagate the significance levels when there is a hierarchical structure in the hypotheses.

ATMOSPHERE study
Our final case study is motivated by the confirmatory ATMOSPHERE study 41 in patients with heart failure. As described in Maurer and Bretz, 31 the trial compared three therapies: aliskiren monotherapy (A), enalapril monotherapy (E), and aliskiren/enalapril combination therapy (C

F I G U R E 4
The graph on the left-hand side was used for the ATMOSPHERE study, as presented in Maurer and Bretz. 31 The graph on the right-hand side is the updated graph at the start of step (ii) in the augmented graphical approach for either k-FWER or FDP control, after H 2 ,H 51 , and H 52 have been rejected. FDP, false discovery proportion; FWER, familywise error rate The graph on the left-hand side in Figure 4 shows the graphical test procedure used in Maurer and Bretz 31 to analyse the trial. Note that if all individual null hypotheses in  4 or  5 are rejected, the local significance level is propagated to the remaining hypotheses. For simplicity, we apply a Holm procedure within each of the two secondary families  4 and  5 . Following Reference 31, suppose we observe the (hypothetical) unadjusted P-values P 1 = .1, P 2 = .007, P 3 = .05, P 41 = .0015, P 42 = .04, P 51 = .0031, and P 52 = .001.
Consider first controlling the k-FWER with k = 2 and = .025. For the augmented graphical approach (given in Algorithm 1), in step (i) the Bonferroni-based graphical procedure for FWER control would reject H 2 , H 51 , and H 52 . The updated graph used at the start of step (ii) is shown in the right-hand side of Figure 4, where has been replaced by .
Supposing that = 0.5, step (ii) of the algorithm rejects H 3 . Since we have made one additional (augmented) rejection, at this point we stop. As for the generalized graphical approach for k-FWER control (given in Algorithm 3), in step (ii) we would only reject H 2 . Since the number of rejections |R| = k − 1, we stop at this point. Now consider controlling the FDP with = 0.3. For the augmented graphical approach (given in Algorithm 4), in step (i) we reject H 2 , H 51 , and H 52 like before. In step (ii), we can reject one additional hypothesis, and hence we reject H 3 and then stop. Finally, for the generalized graphical approach (given in Algorithm 6), we first apply the usual Bonferroni-based graphical procedure for FWER control, which rejects H 2 , H 51 , and H 52 . Since |R 1 | > 1/ − 1, we then apply the 2-FWER procedure which (as above) only rejects H 2 . Since |R 2 | < 2/ − 1, we stop and only reject H 2 .

DISCUSSION
In this article, we have showed how to generalize the graphical approach of hypothesis testing 3 so that the k-FWER or the FDP can be controlled. By applying the methodology of Romano and Wolf 18 and van der Laan, 22 we have proposed generalized and augmented graphical approaches for both k-FWER and FDP control (as well as an augmented procedure for asymptotic FDR control). Crucially, these approaches respect the hierarchical structure of the underlying multiple testing procedure given by the graphical weighting strategy. We have also applied the proposed graphical approaches to three real-life case studies covering a broad range of clinical trial applications. Our recommendation is that the augmented graphical approaches should be used instead of the generalized graphical approaches. First, the generalized graphical approach for k-FWER control has the undesirable property that if a hypothesis H j has fewer than k donors, its initial significance level will not increase. Hence, the generalized graphical approach cannot effectively propagate the significance levels through the graph. The case studies in Section 6 show how this can have a detrimental effect on the power of the generalized graphical approach-the power to reject hypotheses with fewer than k donors can actually decrease as k increases. Since the generalized graphical approach for FDP control is based on the generalized graphical approach for k-FWER control, a similar problem occurs.
By contrast, the augmented graphical approach is able to propagate significance levels to all hypotheses that have fewer than k donors. As a consequence, the power of the augmented graphical approach for k-FWER control and FDP control increases as k and increase (respectively). Importantly, in all of the case studies in Section 6, the augmented graphical approach had a higher power (or rejected at least as many hypotheses) compared with the generalized graphical approach. These results are backed up by existing power comparisons for the generalized and augmented Holm procedure 16,27 when testing a relatively small number of hypotheses.
The research for this article was motivated by clinical trial applications ranging from early to late drug development, as illustrated by the case studies in Section 6. Outside of the context of clinical trials and the graphical weighting strategy of Bretz et al, 3 another area of application is testing hypotheses in a directed acyclic graph (DAG) for use in gene set analysis, as proposed by Meijer and Goeman. 42 The authors presented a top-down method that strongly controls the FWER, and by considering the genes and gene sets as nodes in a DAG, the method allows testing for simultaneous testing of both significant gene sets and individual genes. The testing procedure starts with an initial weight for each of the leaf nodes (ie, nodes without any descendants), and an iterative weighting procedure is used to update the weights for all the other nodes in the graph. These weights also satisfy the monotonicity condition given in Equation (2), and so suitably modified versions of the augmented and generalized graphical approaches could be used in this setting.
As future work, it would be desirable to derive adjusted P-values for all of the proposed procedures, especially for the augmented graphical approaches. This would involve extending the results of van der Laan, 22 who showed how to calculate adjusted P-values for their augmented approach. Finally, the initial motivation for this article came from considering the generalized closure principle, 43 which was applied to derive stepup procedures for k-FWER control. The usual graphical approach for FWER corresponds to defining a shortcut closed testing procedure. 3 It would be interesting to formalize a similar link between the generalized graphical approach for k-FWER control and the generalized closure principle.

DATA AVAILABILITY STATEMENT
All of the data that support the findings of this study are available within the article itself. Code to reproduce the results of Section 6 can be found at https://github.com/dsrobertson/graphical-approach. The weights w j (J),j ∈ J, generated by this procedure are unique, 3 and in particular do not depend on which order the hypotheses H j ,j ∈ J c are removed in Algorithm 7.
(i) Set I = M (ii) Select a j ∈ I such that p j ≤ w j (I) and reject H j ; otherwise stop. (iv) Update the graph: The streamlined version only gives asymptotic control of the k-FWER, but involves no minimization over any subsets. In order to get closer to the original, exact algorithm while still retaining computational feasibility, as a compromise we can use the operative method proposed in Romano and Wolf [18,Remark 3.3]. Consider that to compute the critical value in step (iv) of Algorithm 3, one has to evaluate weights in order to choose the minimum. The operative method maximizes over subsets not necessarily of the entire index set R of previously rejected hypotheses, but only for some number B least significant hypotheses so far. More precisely, we have the following algorithm: Algorithm 11 (Operative graphical approach for k-FWER control). Pick a user-specified number N max and let B be the largest integer for which When B ≥ |R| we maximize over all subsets of R of size k − 1 like in the original algorithm, while the streamlined algorithm is a special case of the operative method where N max = 1 and hence B = k − 1.

B.2 Examples of the generalized graphical approach
Example 5 (Generalized Weighted Bonferroni:). Suppose each vertex on the graph is unconnected, that is, g ij = 0 for all i,j ∈ M. Algorithm 7 implies that w j (J) = w j (M), j ∈ J for all J ⊆ M. Hence the inequality in step (iv) of Algorithm 3 is simply p i ≤ w i (M)k and so there is no further testing after step (ii), unless > 0 and |R| < k − 1. Thus when = 0, Algorithm 3 is exactly the same as the generalized weighted Bonferroni procedure in Section 2 with w i = w i (M).
Example 6 (Generalized Holm:). To represent the Holm procedure with m hypotheses, we set the initial weights w i (M) = 1/m and g ij = 1/(m − 1) for all i, j ∈ M, i ≠ j. Hence using Algorithm 7, we have w i (I) = 1/|I| for all i ∈ I and I ⊆ M, and the inequality in step (iv) of Algorithm 3 is simply For = 0, this gives identical rejections to the generalized Holm procedure as given in Lehmann and Romano. 12 Example 7 (Hierarchical testing: fixed sequence test and fallback procedure). In a fixed sequence test, the hypotheses are tested in a prespecified order. This allows each hypothesis to be tested at the full level while controlling the FWER, with the proviso that if any hypothesis is not rejected then no further testing is allowed. Suppose the prespecified ordering for testing m hypotheses is H 1 → H 2 →… →H m . Hence, we have g ij = 1 for i = 1,… ,m − 1 if j = i + 1 and g ij = 0 otherwise.
If we follow the usual fixed sequence test and set w 1 (M) = 1 and w i (M) = 0, i = 1,… ,m − 1, then only H 1 can be rejected in step (ii) of the Algorithm 3 and hence the algorithm will never proceed to step (iv) since |R| < k. Hence, a more natural generalization of the fixed sequence test is to set the initial weights as w i (M) = 1/k for i = 1,… ,k and w i (M) = 0 otherwise. This means that the first k hypotheses will be tested at full level . However, assume that the first k hypotheses are all rejected (otherwise we proceed to the subprocedure of step (iii) and can only reject up to the first k − 1 hypotheses). Since w i ({1,… ,k − 1,k + 1,… ,i,… ,m}) = 0 for i = k + 1,… ,m, step (iv) of the algorithm implies that no further hypotheses can then be rejected. For k > 1, this generalization of the fixed sequence test has the undesirable property that only the first k hypotheses can ever be tested, even when using the subprocedure of step (iii) with > 0.
A similar issue occurs when generalizing the fallback procedure, 44 which is a modification of the fixed sequence procedure where the initial weights w i (M) > 0 for all i ∈ M. Applying Algorithm 3, suppose (without loss of generality, by relabeling the hypothesis labels) that the hypotheses H 1 ,… ,H k are all rejected at step (ii). However, since w i (I ∪ {1, … , k − 1}}) = w i (M) for all i ∈ I and I ⊆ {k + 1,… ,m}, step (iv) implies that the hypotheses H i , i = k + 1,… ,m, are also tested at significance level w i (M)k . So for k > 1, this generalization of the fallback procedure has the undesirable property that rejecting hypotheses does not lead to an increase in the significance levels of the remaining hypotheses, except via the subprocedure of step (iii) when > 0 (but even then, the propagation is limited to at most k − 1 hypotheses).
Example 8 (Hypotheses with fewer than k donors:). We can generalize the previous example to any graph where any hypothesis has fewer than k donors, where the donors of a hypothesis H j are the hypotheses that donate (or propagate) their significance levels to H j if they are rejected. More formally, we denote the donors of hypothesis H j by do(H j ) = {H i :g ij > 0}. Note that two hypotheses H i and H j can be donors to each other.
If a hypothesis H j in a graph has fewer than k donors, then applying the generalized graphical approach has the undesirable property that the initial significance level for H j can never increase, even if all its donors are rejected (except for up to k − 2 hypotheses via the subprocedure in step (iii)). To see this, suppose = 0 and all donors of H j have been rejected (and |R| ≥ k, or else there is no propagation). In step (iv) of Algorithm 3, since |do(H j )| ≤ k − 1 then min J⊆R,|J|=k−1 Hence H j is tested using the initial weights in step (iv). In particular, this means that if a hypothesis with fewer than k donors has an initial weight of zero, then it can never be rejected (except possibly via the subprocedure in step (iii)). This can be an undesirable property to have in a testing procedure which has a hierarchical structure, as we will see further in the case studies in Section 6.
As an example, consider the graph for the diabetes trial shown in Figure 1, and suppose we wish to control the k-FWER for k = 2. Since the secondary hypotheses H 3 and H 4 only have one donor each (hypotheses H 1 and H 2 , respectively) and start with a weight of zero, they will never be rejected even if more than one of the primary hypotheses H 1 and H 2 are rejected.

APPENDIX C. PARAMETER VALUES FOR THE PRE-RELAX-AHF TRIAL
In our simulation study, for the primary hypotheses H 1 ,… ,H 9 we take the empirical means and standard errors of the endpoints as the true parameter values for the experimental (E) and control (C) treatments. The numerical values of the means C , E and standard deviations C , E are given in Delorme et al 39