Bayesian Optimal Designs for Multi-Arm Multi-Stage Phase II Randomized Clinical Trials with Multiple Endpoints

There is a growing need to evaluate of multiple competing drugs in phase II trials where the number of patients is often limited, and simultaneous assessment of both efficacy and toxicity is crucial. To avoid the waste of research resources, it is indeed more efficient to screen multiple drugs at once in a platform phase II setting. We aim to adapt the Bayesian optimal phase II (BOP2) design to multi-arm trials for both uncontrolled and controlled settings. The binary efficacy and toxicity endpoints are modeled by a Dirichlet distribution as a vector of four outcomes. Posterior marginal distributions at each analysis are used to derive the monitoring threshold that varies during the trial. We control the family-wise Type I error rate for multiple comparison against a common reference value or a shared control. We conduct simulation studies under both uncontrolled and controlled settings to evaluate the operating characteristics of the proposed design. Our simulations demonstrate that the design exhibits better operating characteristics compared to a design using a constant threshold and is less sensitive to changes in accrual rate relative to what was planned. The design had promising operating characteristics and could be used in phase II oncology clinical trials for evaluating multiple drugs at a time.


Introduction
In oncology, phase II trials have long employed single-arm designs to assess the efficacy of new drugs.However, over the past few decades, the development of drugs in oncology, particularly immunotherapy agents, has been on the rise, creating a growing need for enhanced evaluation of their toxicity and efficacy.Indeed, the evaluation of multiple new treatments separately poses challenges in estimating the relative effect of each treatment, due to "treatment-trial" confounding (Estey and Thall 2003): Trials for different drugs are often designed using distinct eligibility criteria, standards of care, and outcome measures, which can confound the estimation of the relative treatment effect.Moreover, this one-at-atime evaluation of the treatments in early phases appears to lack specificity in detecting effective drugs, as evidenced by the reported low success rates in phase II and III trials and subsequent market approvals of cancer therapeutics (Sutter and Lamotta 2011).
With the aim of avoiding confounding and shortening the duration of drug development, platform trials that evaluate multiple drugs (either simultaneously or at different times) have been proposed (Yu, Hubbard-Lucey, and Tang 2019;Franklin et al. 2022).Platform trials extend the concept of randomized phase II trials that include a control group, a concept first proposed in the early 90s (Simon, Thall, and Ellenberg 1994).They offer well-documented advantages of the randomized trials over single-arm trials (Sharma, Stadler, and Ratain 2011;Wason, Stecher, and Mander 2014;Hobbs, Chen, and Lee 2018).One notable benefit is the ability to screen several candidate treatments allocated through randomization and identify the most promising ones.More specifically, multi-arm multi-stage (MAMS) designs (Hobbs, Chen, and Lee 2018) establish early decision rules based on sequential analyses of the effect of multiple treatments compared to the control arm.These designs aim to control the false positive rate for the entire trial, not just for each arm separately (Jaki, Pallmann, and Magirr 2019).Additionally, MAMS designs address both ethical and economic concerns by allowing for the early termination of treatments with evidence of futility and/or excessive toxicity as well as the early graduation of promising treatments.
The literature proposing Bayesian approaches for adaptive platform trials is expanding (Berry 2006;Berger, Wang, and Shen 2014;Ryan et al. 2019).Among these approaches, the Bayesian Optimal design for Phase II clinical trial (BOP2) allows for evaluation of multiple endpoints sequentially, although it is restricted to the single-arm setting (Zhou, Lee, and Yuan 2017).In brief, the BOP2 design aims to assess a new treatment in comparison to reference values of m binary outcome measures.The resulting 2 m distinct and mutually exclusive outcomes define a multinomial random variable.Inference on the model parameters is conducted within a Bayesian framework, using a Dirichlet prior, chosen for its conjugacy with the multinomial distribution.Due to the aggregation properties of the Dirichlet distribution, the marginal Beta distributions of each of the m binary outcomes can be computed easily, allowing the derivation of stopping rules for futility or excessive toxicity (Zhou, Lee, and Yuan 2017).
We aimed to extend the BOP2 design to accommodate multiple treatment arms (or multiple doses of the same treatment), all evaluated concurrently for both efficacy and toxicity within the same trial.To illustrate this extension, we retrospectively applied the proposed design to the AZA-PLUS trial (NCT01342692), which was conducted in patients with high risk myelodysplastic syndrom (MDS) (Ades et al. 2018).
The article is organized as follows: First, we present the models and design algorithms.Next, we conduct a simulation study to assess the performance of the designs.We then illustrate the proposed designs retrospectively using the AZA-PLUS trial (NCT01342692).Finally, we provide a discussion.

Motivating Example: AZA-PLUS Trial
In the treatment of adult patients with high risk (defined by an intermediate-2 or high IPSS score) myelodysplastic syndrom, the AZA-PLUS trial (NCT01342692) aimed to evaluate whether the efficacy and toxicity of the standard-of-care azacitidine could be improved by adding a new drug (Ades et al. 2018).Initially, the trial was first planned to assess the combination of azacitidine with lenalidomide or valproic acid, and later with idarubicine, using the Jung's two-stage design (Jung 2008).The trial was designed with a Type I error rate of 0.15 and a Type II error rate of 0.20.It was scheduled to recruit 80 patients in each arm, with an interim analysis conducted after 40 patients per arm.
Our focus was on the three-arm randomized trial, which compared the control (AZA, n = 81) against both AZA+LEN (n = 80) and AZA+VPA (n = 80).We considered two binary outcome measures: the efficacy endpoint was the overall response rate (ORR), defined by the achievement of complete, partial, or medullary remission, and hematological improvement after 6 treatment cycles (a 6-month period).The toxicity endpoint was defined as treatment discontinuation due any reason other than progression or relapse.Based on the terminal analysis of the 241 enrolled patients, the pooled ORR was estimated at 41.1%, with arm-specific estimates of 42.0%, 41.2%, and 40.0% in the AZA, AZA+VPA, and AZA+LEN, respectively.The toxicity rates ranged from 59.3% in the AZA arm to 65.0% in the AZA+VPA arm and 67.5% in the AZA+LEN arm.To assess whether the generalized BOP2 design for multi-arm multi-stage trials could have allowed us to interrupt the trial earlier, we retrospectively applied this proposed design to the trial data.

Methods
We extended the BOP2 design to a multi-arm multi-stage trial, where patients are randomized to K experimental arms in an uncontrolled setting or to K experimental arms plus a control arm in a controlled setting.For simplicity, we assumed balanced randomization across the investigational arms.We considered only two binary outcomes, (Y T , Y E ), where Y T = 1 indicates toxicity and 0 otherwise, and Y E = 1 indicates efficacy and At each interim analysis, which is conducted once n k patients have been treated and assessed for both efficacy and toxicity endpoints, the stopping decision for futility and/or toxicity are made using the following rules: where D n = {x k, E , x 0, E , x k, T , x 0, T , n k , n 0 } is the observed data at the interim analysis for arm k, k = 0, …, K, and C n is the common decision threshold as defined in Section 3.3.Computations of the above two probabilities are carried out through integration, as follows (Jacob et al. 2016;Hobbs, Chen, and Lee 2018): where F ( ⋅ ) is the Beta cumulative distribution function and f( ⋅ ) is the Beta density function.

Choice of the Decision Threshold
When planning the multi-arm trial, the hypotheses regarding the multinomial distribution of Y k .for the efficacy and toxicity of the new treatment, as well as the maximum acceptable Family-Wise Error Rate (FWER), are first elicited with the clinicians.Following the standard BOP2 design by Zhou, Lee, and Yuan (2017), the null hypothesis H 0 corresponds to an inadmissible treatment (ineffective and overly toxic), while the alternative hypothesis H 1 corresponds to a promising treatment (effective and not excessively toxic).Additional discussions on the choice of the null hypothesis for testing co-primary toxicity and efficacy endpoints are provided at the end of this section.
Unlike a single arm trial, we must address the presence of K distinct experimental arms when optimizing the decision threshold.In this context, the FWER is defined as the proportion of claiming an inadmissible arm promising, which means concluding the efficacy and acceptable toxicity of at least one experimental arm, while all these arms are inefficacious and toxic (i.e., the global null hypothesis, where each arm is at H 0 ).The (least) power is defined as the proportion of concluding the efficacy and acceptable toxicity of the efficacious and safe arm when that arm is the only efficacious and safe arm, and all others are inefficacious and toxic.This scenario is also referred to as the Least Favourable Configuration (LFC), where only one arm is at H 1 and all others are at H 0 ) (Thall, Simon, and Ellenberg 1989;Jaki, Pallmann, and Magirr 2019).
To define the decision boundaries, we first considered the same threshold function as the original BOP2 design (Zhou, Lee, and Yuan 2017).This function is represented as , where λ (0 < λ < 1) and γ are positive design parameters.In cases where there is only one experimental arm, the values of λ and γ can be optimized by grid search to control the false positive rate and maximize power in both uncontrolled and controlled settings, respectively (Zhou, Lee, and Yuan 2017;Zhao et al. 2023).We further defined γ such that γ ≤ 1, ensuring a convex shape for the decision boundary.The shape of the threshold function reflects the principle that early stopping based on sparse data should be avoided, and the rules become less stringent as more data accumulates over the trial (Jennison and Turnbull 1999).
We then defined a threshold specifically designed for the multi-arm trial, denoted thereafter as C n m .We employed the same threshold function, but the design parameters (λ, γ) used to define C n m were optimized through simulation via a grid search that accounts for multiple experimental arms.Among all possible pairs of (λ, γ) satisfying the prespecified FWER constraint (Magirr, Jaki, and Whitehead 2012;Bratton et al. 2016;Jaki, Pallmann, and Magirr 2019) under the global null hypothesis, we selected the one that maximizes power under the LFC.The grid search can be performed based on simulations, see more details in sec.2.3 of Zhou, Lee, and Yuan (2017).It's worth noting that as the cutoff function was optimized under the global null and depends only on the sample size, this approach also maintained the arm-specific Type I error rate.
Alternatively, another decision threshold, C n m, a , has been proposed, and it depends on the number a of ongoing active experimental arms (those still open to patient accrual) at the time of the interim analysis.It is defined as where η = K + 1 − a.This correction relative to a leads to a less stringent threshold, particularly when multiple arms are truly promising as the threshold function is an increasing function of a.To optimize the parameters λ and γ, mimicking the procedure used for optimizing C n m , we aimed to control the FWER under the global null and maximize power under LFC.However, due to the construction of C n m, a , merely controlling the FWER is insufficient to maintain a desired arm-specific Type I error rate, especially when only one experimental arm is unpromising while the rest are promising (because many arms including the unpromising one can pass the monitoring criteria).To address this issue, there are two approaches.The first approach is to simultaneously control the FWER under the global null and the arm-specific Type I error rate under the scenario where only one arm is unpromising.However, due to the additional constraint in terms of the arm-specific Type I error rate, this approach requires a larger parameter optimization search within a well-defined search region, making it time-consuming.The second approach is more straightforward.By taking the final threshold as the minimum between C n m, a and the single experimental arm-based C n s of Zhou, Lee, and Yuan (2017), it is trivial to demonstrate that the design based on (C n m, a , C n s ) can not only maintain the FWER but also control the arm-specific Type I error rate.We chose the latter approach as it requires less computational resources.
Lastly, due to the formulation of the null hypothesis H 0 (i.e., treatment being ineffective and overly toxic), there is a possibility that the optimal multi-arm design may not maintain tight control over the false positive rate, especially when one of the co-primary endpoints is not met-either the treatment is safe but ineffective, or effective yet overly toxic.This issue is evident in scenarios 6 and 7 of Tables 1 and 2, where the arm-specific Type I error rate or the FWER may exceed the nominal level.approach).This method, while potentially being less powerful, offers improved control over false positives.

Simulation Settings
We conducted various simulation studies to assess the operating characteristics of the proposed generalized BOP2 design in both uncontrolled and controlled settings.A total of K = 3 experimental treatment arms were considered, with a maximum sample size of n = 60 patients in each group.To assess the proposed designs, we considered the uncontrolled (3 arms) and controlled (thus, with 4 arms) settings, separately, resulting in a maximum total number of included patients of N = 180 and N = 240, respectively.Three interim analyses were planned to be performed, when every 15 additional patients have been enrolled in each arm, plus the terminal analysis.Of note, we also examined various sample sizes.As expected, in the case of a smaller planned sample size (e.g., 30 patients planned per arm with interim analyses at 10 and 20), power decreased under fixed scenarios.Desirable performance could be achieved with a larger planned effect size, in both uncontrolled and controlled settings (data not shown).All arms were assumed to be of equal size in the main simulation study.Sensitivity analysis, based on different accrual rates among the various arms, was also conducted, as shown below.
A total of 13 different scenarios were constructed similarly for both uncontrolled and controlled settings (see Supplementary Table S1).These scenarios were derived from two real clinical trials: one evaluating the efficacy and safety of lenalidomide associated with rituximab in the treatment of recurrent non-follicular lymphoma (Sacchi et al. 2016) in an uncontrolled setting, and the second comparing TAS-102, a nucleoside analogue, and topotecan/amrubicin for the treatment of refractory small cell lung cancer in a controlled setting (Scagliotti et al. 2016).These motivating examples allowed us to derive realistic hypotheses of efficacy and toxicity targets.In both uncontrolled and controlled settings, scenario 1 corresponded to the global null hypothesis H 0 , while scenario 2 represented the LFC.Scenarios 6, 7, 10, and 11 explored cases of undesirable treatments, including scenarios with no efficacy and toxicity, efficacy and toxicity, and no efficacy and no toxicity.Scenarios 5 and 8 involved treatments with more efficacy or less toxicity than expected.Finally, Scenarios 12 and 13 illustrated treatments with intermediate efficacy and toxicity.
We compared the performances of the proposed decision thresholds (C n m , C n m, a ) with Thall, Simon, and Estey's approach (Thall, Simon, and Estey 1995), which is similar to the approach of Hobbs, Chen, and Lee (2018), using constant boundaries applied to a multi-arm trial, denoted hereafter ϵ m , and original single experimental arm-based BOP2 threshold C n s .
The prior distributions of the designs were set to reflect the null hypotheses (H 0 ) as described in Section 3, resulting in a so-called "skeptical" prior approach (Spiegelhalter, Abrams, and Myles 2004).Consistent with the two real trials that were used to define realistic scenarios, the prior was set to Dir(0.15, 0.30, 0.15, 0.40) for the uncontrolled setting and to Dir(0.30, 0.30, 0.10, 0.30) for the controlled design.
For each scenario, both in the uncontrolled and controlled settings, we conducted 10,000 independent repetitions of each trial, each of K = 3 experimental arms.We computed various performance metrics, including the percentage of selection for both efficacy and non toxicity, both globally and for each arm separately, the percentage of correct selection, the empirical FWER of the designs (under the null scenario 1) and the power (under the LFC of scenario 2).Additionally, we calculated the percentage of early stopping, defined as any stopping decision regarding arm k before the terminal analysis, and recorded the reason of the stopping decision (toxicity or futility).It's important to note that the number of repetitions was determined to ensure a Monte Carlo standard error of 0.003 for a 0.1 Type I error rate and 0.004 for a power value of 0.8, following previous work (Koehler, Brown, and Haneuse 2009;Morris, White, and Crowther 2019).
Results for the uncontrolled setting are reported in Table 1; and results for the controlled setting are provided in Table 2.
We conducted additional analyses to assess the robustness of our results concerning the total sample size, accrual rate, and imbalanced sample sizes across the arms during interim analyses.First, to simulate a lower accrual than expected, we reduced the maximum sample size N in each arm, ranging from 20 to 60 while using the threshold optimized for 60 patients per arm.Interim analyses were performed every N ∕ 4 patients in each arm.It's important to note that the thresholds were optimized based on a prespecified sample size of 60, rather than the actual sample size, to mimic a real-world scenario where the accrual rate is slower than anticipated.Secondly, to simulate fluctuations in the accrual rate, we performed interim analyses after every 60 × ( j 4 ) ψ enrolled patients, with j representing the jth interim analysis and ψ varying from 0.25 to 1.75.A value of ψ = 1 represented the planned accrual, with 15 patients recruited in each arm between analyses.Lower values of ψ corresponded to a fast accrual rate at the beginning, while higher values indicated a slower accrual rate at the beginning.Lastly, to assess impact of imbalances in the number of patients recruited at each interim analysis, we conducted interim analyses after the recruitment of 15 + { − u, …, 0, …, u} patients, with u ranging from 0 to 5. Here, 2u represents the maximum imbalance across arms.

Operating Characteristics
The operating characteristics of the different approaches in the uncontrolled setting are reported in Table 1.In scenario 1, the use of the three decision thresholds , while ϵ m had the lowest proportion of false positives.In general, the higher false positive rate of C n m, a was due to its less stringent threshold.However, its arm-specific Type I error rate remained under 10%.
The operating characteristics of the different approaches in the controlled setting are displayed in In conclusion, ϵ m had a higher proportion of early stoppages, while C n m, a resulted in a lower proportion of early stoppages.
When the sample size was smaller than planned, the estimated FWER remained close to the prespecified level, no more than 12.32%, even with only one third of total expected sample size (20 patients per arm instead of 60, see Supplementary Table S2).Figure 2A depicts the proportion of conclusion of efficacy and acceptable toxicity in each arm under scenario 13 for different sample sizes (see also Supplementary Table S2 in supplementary materials for complete results with lower sample sizes in each arm).Scenario 13 exemplified a situation covering the design hypotheses across the 3 experimental arms (arm A: H 0 , arm B: H 1 , arm C: intermediate efficacy and toxicity).C n m , C n m, a , and ϵ m behaved similarly: the proportion of correct selection decreased with a lower sample size while the empirical FWER remained around the specified level.The power and the percentage of correct selection were higher for When assessing values of ψ ranging from 0.25 to 1.75, the C n m and C n m, a designs had stable operating characteristics with these different recruitment rates (see Figure 2B).In contrast, ϵ m was more sensitive to variations in accrual rates that deviated from the planned rate.It is also important to note that the percentage of correct selection using C n m, a slightly increased with higher values of ψ.This is because, with larger values of ψ, the sample size at the interim analysis is smaller than expected.Due to the sparser data, the probability of early termination of an arm decreases, allowing for a less stringent cutoff for C n m, a .As a result, fewer arms were terminated incorrectly at the interim analysis, leading to an increase in the percentage of correct selection.There was limited impact of interim analyses performed at fixed time points across arms, instead of using a fixed number of patients, on the proportion of conclusion regarding efficacy and acceptable toxicity, as well as on the proportion of correct selections in each arm.The empirical FWER slightly increased when an imbalance appeared across arms (u = 1, 2), but it remained stable regardless of the magnitude of imbalance (see Supplementary Table S2).

The AZA-PLUS Trial
To illustrate the design, we retrospectively planned three interim analyses (at least 20, 40, and 60 patients in each arm) along with the terminal analysis (assuming a maximum of 80 patients in each arm), using the AZA-PLUS trial as an example.We used the following probabilities θ k for the multinomial distribution of treatment for inefficacy/toxicity to calibrate the design: (0.15, 0.25, 0.15, 0.45), and for efficacy/non toxicity: (0.15, 0.40, 0.05, 0.40).The inefficacy/toxicity hypothesis H 0 will be used as the prior for the endpoint distributions, and comparisons will be made with the control arm (Azacitidine) for decision rules.Consequently, the prior distribution is Dir(0.15,0.25, 0.15, 0.45).The optimized parameters for multi-arm C n m were λ = 0.63 and γ = 1.These parameters controlled the FWER under 15% as planned (actual = 14.84%), and correspondeded to a simulated power of 73.78%.
The interim analyses had slightly different sample sizes across arms due to randomization.
(Figure 3).The retrospective analysis using the proposed design suggested stopping both Azacitidine + Lenalidomide and Azacitidine + Valproic acid arms after the 2nd analysis.
Compared to the initial trial, this would have reduced the actual sample size by 120 patients overall (40 patients less in each arm).Of note, given the trial population, the accrual rate was approximately 20 patients per year per arm.Assuming that both efficacy and toxicity criteria would require a follow-up of 6.5 months to eventually take the go/no go decision, the proposed design would have led to an approximate 2-year reduction in trial duration compared to the original design.

Discussion
We have proposed a Bayesian design to control the FWER in multi-arm multi-stage phase II clinical trials with a joint assessment of efficacy and toxicity.We adapted the BOP2 design (Zhou, Lee, and Yuan 2017) to this setting, using group-sequential decision boundaries that depend on the fraction of the number of enrolled patients, either solely or also accounting for the number of active arms at a given analysis.The proposed decision thresholds demonstrated good operating characteristics with increased power, as shown in single-arm trials, when compared to constant thresholds (Zhou, Lee, and Yuan 2017).This finding is consistent with the work of Jiang et al. (2020), who found that varying boundaries with the course of the study outperformed constant thresholds in terms of power.Additionally, their approach used separate constant thresholds for efficacy and toxicity, which sometimes may result in thresholds that are more stringent for one endpoint than the other.It might be beneficial to consider additional constraints for these constant thresholds.
Two functions were defined for the decision boundaries, with the one depending on the number of active arms at the time of interim analyses outperforming the one depending only on the fraction of included patients overall.Ensuring less stringent boundaries when there remained some unpromising arms ensured greater power for truly promising arms.However, this came at the cost of a slightly increased false positive rate in trials that included truly promising arms alongside ineffective or toxic ones.
The design appeared robust in the face of departures from the planned setting, including variations in maximum sample size, accrual rate, and balance across arms at each analysis.The design however encountered challenges when dealing a drug exhibiting a discordant profile of efficacy and toxicity (i.e., efficacious but toxic or inefficacious but safe).This could be attributed to the computation of the decision thresholds, which are optimized to control the FWER under the global null hypothesis, where all treatments are assumed to be inefficacious and toxic.These situations may arise when evaluating the association of treatments that could potentially lead to antagonistic interactions, for example.Additionally, the method demonstrated robustness in terms of the choice of prior.The selection of a pessimistic prior, as in original BOP2 design, was made to ensure that interim analysis conclusions are robust.A change in prior with a small effective sample size does not alter the results significantly.However, one should exercise caution when considering the weight of the prior relative to interim analysis results.Nevertheless, through the calibration process, the influence of prior distributions can be further diminished.Similar to the BOP2 design, the proposed thresholds are optimized considering the entire distribution of efficacy and toxicity, thereby taking into account the correlation between efficacy and toxicity.This approach provides adaptability, allowing for the adjustment of the correlation between efficacy and toxicity.However, one should exercise caution when choosing this correlation, as an overestimated correlation may result in a slight inflation of the FWER (results not shown).
Future research directions can be explored in conjunction with the proposed method.Jiang et al. (2021) recently proposed a seamless phase I/II design in which patients are split into indication-specific parallel subgroups for the phase II part of the design, similar to an uncontrolled basket trial.They rely on Bayesian hierarchical modeling to borrow information across subgroups.Further evaluation is needed to quantify the benefit of such an approach in our controlled setting.Furthermore, while we have only implemented futility stopping rules, efficacy stopping rules could also be considered for the multi-arm multistage trial, where a promising treatment showing signals of efficacy without toxicity could graduate early (Blenkinsop, Parmar, and Choodari-Oskooei 2019).Of note, the proposed method could be considered as a selection design with multiple treatment candidates for a common indication.Since there is no strong consensus on Type I and II error rates for phase II trials, emphasis should be made on FWER or power relatively to the purpose and settings of the study (Stallard 2012), compromising somewhat on the risk of false positives and false negatives, prior to formal efficacy assessment in phase III.
In conclusion, the proposed design for multi-arm multi-stage trials has demonstrated promising operating characteristics and could be employed to screen multiple treatments in phase II trials.Accounting for the available fraction of information and for the number of active arms allowed for an improvement in power, particularly in situations with multiple promising treatments.The R package to implement the methods proposed in this article is available at https://github.com/GuillaumeMulier/multibrasBOP2.

Supplementary Material
Refer to Web version on PubMed Central for supplementary material.Results for scenario 13 in a controlled setting, calibrated for a maximum sample size of 60 patients per arm.Panel A: percent of conclusions regarding efficacy and absence toxicity in each arm according to the maximum number of enrolled patients.B: proportion of correct selection relative to the accrual rate.The table below the plot represents the additional number of enrolled patients at each interim analysis.

Figure 1 .
Figure 1.Decision thresholds used during the trial at the interim and terminal analyses, either for futility (A, C) or (over-)toxicity (B, D), based on the uncontrolled design (plots A and B) or the controlled design (plots C and D).C n m stands for the threshold in the same form as BOP2 applied in a multi-arm setting, C n m, a represents the multi-arm threshold dependent on the number of remaining ongoing arms, while the ϵ is the multi-arm constant threshold maintained throughout the trial.

Figure 3 .
Figure 3. AZA-PLUS trial with the multi-arm C n m threshold: Posterior probabilities of efficacy and no toxicity, along with decision rules, at 3 interim analyses and the final analysis.
To address potential concerns, a more rigorous calibration process involving two null hypotheses, H 01 and H 02 , could be implemented.H 01 would consider the treatment as safe but ineffective, whereas H 02 would view it as effective but overly toxic.The initial step of the calibration process as outlined above could then be extended to identify all feasible (λ, γ) pairs meeting the predefined FWER constraints under both H 01 and H 02 ( as well as the arm-specific Type I error constraints for the C n

Table 2 .
Similar to the uncontrolled setting, the empirical FWER was controlled at 8.75% for C n m , 9.46% for C n m, a and 9.62% for ϵ m .However, it was not controlled for

Table 1 .
Operating characteristics for each arm under the 13 scenarios (Sc) for the uncontrolled designs: Family-Wise Error Rate (FWER), percent of conclusion of efficacy and no toxicity (ENT), percent of early stopping (ES), and mean sample size (SS).
Stat Biopharm Res.Author manuscript; available in PMC 2024 September 19.

Table 2 .
Operating characteristics for each experimental (Exp.)arm under the 13 scenarios (Sc) for the controlled designs: Family-Wise Error Rate (FWER), percent of conclusion of efficacy and no toxicity (ENT), percent of early stopping (ES), and mean sample size (SS).