Control of Type I Error Rates in Bayesian Sequential Designs

Bayesian approaches to phase II clinical trial designs are usually based on the posterior distribution of the parameter of interest and calibration of certain threshold for decision making. If the posterior probability is computed and assessed in a sequential manner, the design may involve the problem of multiplicity, which, however, is often a neglected aspect in Bayesian trial designs. To effectively maintain the overall type I error rate, we propose solutions to the problem of multiplicity for Bayesian sequential designs and, in particular, the determination of the cutoff boundaries for the posterior probabilities. We present both theoretical and numerical methods for finding the optimal posterior probability boundaries with α-spending functions that mimic those of the frequentist group sequential designs. The theoretical approach is based on the asymptotic properties of the posterior probability, which establishes a connection between the Bayesian trial design and the frequentist group sequential method. The numerical approach uses a sandwich-type searching algorithm, which immensely reduces the computational burden. We apply least-square fitting to find the α-spending function closest to the target. We discuss the application of our method to single-arm and doublearm cases with binary and normal endpoints, respectively, and provide a real trial example for each case. MSC 2010 subject classifications: Primary 62C10; secondary 62P10.


Introduction
Along with the frequentist method, one of the popular paradigms in clinical trial designs is the Bayesian approach, where samples are treated as fixed and the parameter of interest is assigned a prior probability distribution to represent the uncertainty about its value; see, e.g., Berry (2006Berry ( , 2011, Berger and Berry (1988), Efron (1986Efron ( , 2005 and Yin (2012). The posterior distribution of the parameter is continuously updated with regard to the accrued samples. Bayesian approaches allow incorporating useful information into the prior distribution and are usually more efficient provided that the prior distribution is sensible. Inferences are made based on the posterior distribution of the parameter of interest, which can be updated as the trial accumulates more data. Along this direction, Thall and Simon (1994) proposed a Bayesian single-arm phase II clinical trial design that continually evaluates the posterior probability that the experimental drug is superior to the standard of care, where the response rate of the new treatment is compared with a fixed cutoff boundary at each interim analysis during the trial. Because the comparison is made multiple times during the study, the design involves the problem of multiple testing, and a failure to make proper adjustment for multiplicity is known to induce potential inflation in the type I error rate. As an illustration of multiple testing problems in the Bayesian setting, consider a random sample {y 1 , . . . , y n } from N (θ, 1), and we are interested in H 0 : θ ≤ 0 versus H 1 : θ > 0. From a frequentist viewpoint, the test statistic at the kth interim analysis isȳ k = n k i=1 y i /n k , where n k is the cumulative sample size up to stage k. The decision rule for the O'Brien-Fleming type (O'Brien and Fleming, 1979) group sequential design is √ n kȳk > C OF (K, α) K/k, k = 1, . . . , K, where K is the total number of analyses planned for the trial and C OF (K, α) is the critical constant for the design. By contrast, assuming a flat prior distribution for θ, f (θ) ∝ 1, under a Bayesian approach, the posterior at interim analysis k is θ|ȳ k ∼ N (ȳ k , 1/n k ). If we employ the decision rule that the posterior probability of H 1 should be greater than 1−α, our decision boundary would be √ n kȳk > Φ −1 (1−α), and thus there is no penalty for multiple testing in the Bayesian setting. To effectively control the overall type I error in Bayesian sequential designs, we study the problem of multiplicity, specifically, how the cutoff boundaries for the posterior probabilities should be determined.
The issue of multiplicity either involves testing multiple hypotheses simultaneously, or testing a single hypothesis repeatly over time. For the former, extensive research has been conducted under the Bayesian paradigm, e.g., see Gopalan and Berry (1998), Berry and Hochberg (1999), Scott and Berger (2006), Labbe and Thompson (2007), Guindani et al. (2009), and Guo and Heitjan (2010). For the latter, various Bayesian clinical trial designs involving sequential testing of a single hypothesis have been proposed, e.g., see Thall and Simon (1994), Lee and Liu (2008), Thall et al. (1995), Heitjan (1997), Rosner and Berry (1995), and Gsponer et al. (2014). However, a comprehensive and unified approach to controlling the overall type I error rate and accounting for the multiplicity is rarely discussed. With a focus on the binary endpoint, Zhu and Yu (2017) adopted a numerical search method for calibrating the operating characteristics of a Bayesian sequential design in terms of the α-spending function. Murray et al. (2016) developed a computational algorithm for calibrating the spending of the type I error rate for utility-based sequential trials with multinomial endpoints. Murray et al. (2017) adopted a simulation-based approach to calibrating the empirical α-spending function for a Bayesian design with two co-primary semicompeting time-to-event endpoints, which incorporates three interim analyses.
In the frequentist group sequential design, multiplicity is explicitly considered to control the overall type I error rate, e.g., see Pocock (1977), O'Brien and Fleming (1979), Wang and Tsiatis (1987), Eales and Jennison (1992), and Barber and Jennison (2002). Limited research has been conducted on the problem of multiplicity adjustment for Bayesian sequential designs. Bayesian multiple testing procedure should be ideally conducted in a fully decision-theoretic framework, where a loss function and related parametric assumptions are explicitly specified; e.g., see Lewis and Berry (1994), Christen et al. (2004), Müller et al. (2007), and Ventz and Trippa (2015). However, in clinical trials, regulatory bodies (e.g., the Food and Drug Administration) often require explicit evidence that the frequentist error rates are well maintained. As a result, it is a common practice to evaluate the frequentist properties of a Bayesian design based on simulations, which require simulating a large number of repetitions of the trial conduct and different trial designs may require different simulation setups. Our goal is to provide a more unified framework to directly control the type I error rate for Bayesian sequential designs. We propose both theoretical and numerical approaches to maintaining the overall type I error rate for Bayesian designs that involve multiple comparisons using posterior probabilities, such that the designs' operating characteristics mimic those of the commonly used group sequential methods. By carefully calibrating design parameters, Bayesian methods can effectively maintain the frequentist type I and type II error rates at the nominal levels. Although such a calibrated Bayesian design has similar operating characteristics to its frequentist counterpart, Bayesian designs bring more flexility to the trial conduct, e.g., adaptive randomization based on the posterior distribution, or prediction of trial success using posterior predictive distributions. Moreover, when historical data are available, Bayesian approaches allow the incorporation of historical information via informative priors, which would lead to savings in the sample size.
The rest of this article is organized as follows. In Section 2, we describe a motivating example where a Bayesian design may inflate the type I error rate if no adjustment is made to account for multiplicity. In Sections 3, we develop the Bayesian sequential designs using posterior probabilities and describe the methods for maintaining the frequentist error rates for single-and double-arm designs. Section 4 extends our methods to trials with normal endpoints, and Section 5 compares the operating characteristics of our methods with those of a Bayesian continuous monitoring scheme. Section 6 presents examples of design applications. Section 7 concludes the article with some remarks. Thall and Simon (1994) proposed a Bayesian single-arm design for phase II trials. The design continually evaluates the efficacy of the experimental treatment by monitoring the binary outcomes and makes adaptive decisions throughout the trial. Let p E denote the response rate of the experimental drug, and let p S denote that of the standard drug. We are interested in testing the hypotheses

Motivating Example
We assume beta prior distributions for these two response rates, p E ∼ Beta(a E , b E ) and p S ∼ Beta(a S , b S ), where the prior mean of p S is set to equal to p null . Typically, historical information for the standard treatment is often available and we may set p null to be the estimate from the historical data. We inflate the prior variance of p S to account for the uncertainty due to between-trial effects, i.e., the differences between the historical trials and the current trial. The beta prior for p E is usually much more diffuse, with a large variance reflecting the fact that little information is known about the experimental drug. For example, we may assume a vague prior distribution for the experimental drug, p E ∼ Beta(0.2, 0.8), which is often considered to be equivalent to the information of only one subject. For the standard drug, suppose that we have observed 200 responses among 1000 subjects in historical studies, we may set p null to be the historical sample proportion while discounting the historical information by enlarging the prior variance; for example, we may assume p S ∼ Beta(20, 80), which contains the amount of information equivalent to 100 patients (Morita et al., 2008).
Suppose the trial has accrued n subjects to receive the experimental treatment, and we observe y responses among them. Let D denote the observed data (n, y). Due to the conjugate nature of the beta prior distribution when combined with a binomial likelihood function, the posterior distribution of p E is still beta, p E |D ∼ Beta(a E + y, b E +n−y). Let f (p; a, b) and F (p; a, b) denote the probability density function and the cumulative distribution function of a Beta(a, b) distribution, respectively. We compute the posterior probability that the experimental response rate is larger than the standard rate in the form of (1) Let θ U and θ L denote the upper and lower boundaries for the posterior probability. At each step of the trial, we compute the value of Pr(p E > p S |D) and claim the experimental drug promising if it is larger than θ U , unpromising if it is smaller than θ L , or proceed to enroll the next subject if it lies between the two. At the end of the trial when the prespecified maximum sample size N max is exhausted, if Pr(p E > p S |D) > θ U , the drug is concluded to be promising, otherwise unpromising.
The design by Thall and Simon (1994) in fact suffers from the issue of multiplicity, as the same hypothesis p E > p S is tested repeatedly and the drug can be declared as promising over any interim result that produces a posterior probability larger than θ U . We define the type I error rate as the probability of rejecting the null hypothesis when p E = p null , and assess the degree of its inflation under N max = 60 and θ U = 0.9, and assess the type I error rates under different prior sample sizes of p S . The prior distribution of p E has a mean equal to that of p S and a prior sample size of 1. We simulate one million trials by generating random samples from the Bernoulli(p E ) distribution, and calculate the empirical type I error rate as the proportion of times the trial results lead to positive conclusions. Figure 1 shows the type I error rates under different prior sample sizes of p S , for Bayesian sequential designs without futility stopping and with futility stopping θ L , respectively. A more informative prior of p S would induce a larger type I error rate. The intuition behind such a pattern is that with a larger prior sample size, the prior distribution of p S is more centered at p null , which makes it easier for the trial results to reach a high posterior probability of p E > p S . Moreover, a larger value of p null appears to be associated with a higher type I error rate. A possible reason for this phenomenon could be that for smaller values of p null , the posterior probability decision boundaries tend to be more conservative and more difficult to reach at p E = p null when the cumulative sample size is relatively low. As an example, consider the case when the cumulative sample size is 10, and under a prior sample size of 1000, the probability of reaching the posterior probability boundary under p null = 0.1 is 0.07, whereas that under p null = 0.6 is as high as 0.17. The type I error rate inflation is slightly ameliorated with a futility stopping scheme by setting θ L = 0.1. In all cases, the type I error rate exceeds the nominal level of 1−θ U = 0.1, and in some extreme cases the type I error rate can be inflated up to 0.5. Therefore, it is recommended that Figure 1: Type I error rates under different prior sample sizes for p S (a) without futility stopping and (b) with futility stopping θ L = 0.1 under the Bayesian single-arm continuous monitoring design by Thall and Simon (1994).
for the Bayesian sequential design where the same hypothesis is tested multiple times, the decision boundaries should be carefully adjusted and calibrated to prevent inflation of the overall type I error rate, particularly when the design involves a strong degree of information borrowing from historical trials, i.e., an informative prior distribution or a large prior sample size. Similar findings on the type I error rate inflation are also noted in Jennison and Turnbull (2000), Chapter 18.

Single-Arm Design
We propose a Bayesian sequential design based on posterior probabilties for single-arm phase II trials with binary outcomes. Our goal is to not only maintain the overall type I error rate, but also calibrate the decision boundaries such that the design's operating characteristics mimic those of the commonly used group sequential designs. Let K be the total number of analyses to be conducted throughout the trial and let m be the number of samples in each group, i.e., we conduct one analysis every time m additional subjects are enrolled. Let c k be the efficacy decision boundary at stage k, k = 1, . . . , K. At the kth interim stage, the posterior probability of the experimental response rate being larger than the standard rate is where D k is the cumulative data up to stage k. If P (H 1 |D k ) > c k , we stop the trial and declare treatment efficacy, otherwise we enroll the next group of m patients and conduct another analysis at stage k + 1, or if k = K, i.e., we reach the end of the trial, we declare treatment futility. We define the type I error rate to be the probability of declaring treatment efficacy when p E = p null . The amount of the type I error rate spent at stage k, denoted as α k , is defined to be the probability of reaching the efficacy boundary of stage k. Let b(y; n, p) denote the binomial probability mass function and let I(·) denote the indicator function, and then The overall type I error rate is thus

Posterior Probability Boundaries: Numerical Method
Our goal is to search for the optimal set {c k : k = 1, . . . , K} that yields the closest fit to a prespecified target α-spending function while controlling the overall type I error rate. It is evident that the search space is of high dimension and a full enumeration method would be computationally intensive. To overcome this issue, we shrink the search space to the region where the optimal solution most probably lies so that the numerical approach to the problem becomes feasible. We first establish that P (H 1 |D k ) is an increasing function of y k , the cumulative number of responses at stage k, because in the integration of (1), it is true that F (p; a E +y+1, b E +n−y−1) < F (p; a E +y, b E +n−y) by the two lemmas in Thall and Simon (1994). Based on such a monotonic relationship, we can avoid the computationally intensive integration when calculating P (H 1 |D k ) and directly calibrate u k , which is the boundary for the cumulative number of responses at stage k, i.e., if y k > u k , we declare the drug promising. In other words, we translate the information in the probability domain c k to the number of responses u k . As the spending of the type I error rate α k is a function of all the upper boundaries up to stage k, we denote it as a function α k (U k ), where U k = (u 1 , . . . , u k ) T is a vector of design parameters. Let α(k) denote the target amount of type I error rate to be spent at stage k and α the overall type I error rate. More specifically, we propose the sandwich-type searching algorithm which can immensely reduce the computational burden, and the detail is described as follows.
1. At step k = 1, we iterate j from 0 to m.
(i) For each j, we compute the amount of type I error rate spent at stage 1 when u 1 = j, denoted as α 1 (U 1j ), where U 1j is a scalar equal to j, corresponding to an upper boundary value of j at the end of the first stage.
1j and U † 1j are two scalars equal to u * 1j and u * 1j + 1, respectively. Let A 1 be the set consisting of U * 1j and U † 1j .
2. At step k, 1 < k < K, we iterate through each vector in A k−1 .
(i) Let U k−1 denote the design vector consisting of the values of upper boundaries up to the (k − 1)th interim analysis, and denote its last element as n min . Fixing U k−1 , we iterate j from n min to km.
(ii) We find the pair (u * kj , u * kj + 1), such that α k (U * kj ) < α(k) < α k (U † kj ), where U * kj and U † kj can be obtained by appending u * kj and u * kj + 1 to the current design vector U k−1 , respectively. The vectors U * kj and U † kj represent the two sets of upper boundaries up to stage k whose amounts of cumulative error rate spending are closest to the target; the error rates under the design vector U * kj are under-spent while those under U † kj are over-spent. Let A k be the set consisting of vectors U * kj and U † kj .
3. At step k = K, we iterate through each vector in A K−1 .
(i) For each vector in A K−1 , denoted as U K−1 , we calculate the type I error rate spent up to the (K − 1)th stage as α * = 4. Among all the obtained vectors U K in the last step, we choose the one that yields the smallest L 2 -distance to the target α-spending function, i.e., minimizing where U k consists of the first k elements in U K . Based on the increasing relationship between P (H 1 |D k ) and y k , we can then find the corresponding c k such that Steps 1 to 3 identify the sets of upper boundaries under which the amounts of cumulative type I error rate spending are closest to the target, and step 4 selects the best set of boundaries with the smallest L 2 -distance to the target function. In the first step, only one pair of design vectors are identified. In each subsequent step k, further pairs are identified and appended to the design vector in the set A k−1 from the previous step. The total number of design vectors assessed at step k is 2 k . We can also minimize the maximum difference between α k (U k ) and α(k) in the last step, which would give similar results.
As the proposed numerical algorithm seeks to minimize the squared distance between the empirical spending function and the target, it is robust, accurate and flexible, which can accommodate any types of α-spending functions in the group sequential methods, including the commonly used Pocock, O'Brien-Fleming, and Wang-Tsiatis types (Jennison and Turnbull, 2000). The specification of the target function depends on the preferences on how the spendings of the type I error rate should be distributed over the interim analyses.

Posterior Probability Boundaries: Theoretical Method
There exists an asymptotic connection between the Bayesian approach based on posterior probabilities and the frequentist method using p-values. Dudley and Haughton (2002) studied the asymptotic normality of the posterior probability of half-spaces. In particular, let Θ be an open subset of a Euclidean space R d . A half-space H is a set satisfying a linear inequality, where θ ∈ Θ, a ∈ R d and b is a scalar, and let ∂H represent the boundary hyperplane of H, Examples of half-spaces under the context of clinical trials are Let y i denote the observed data whose probability density function is f (y i , θ), for i = 1, . . . , n. The likelihood ratio statistic for testing the null hypothesis H 0 : θ ∈ ∂H is , θ), andθ andθ are the maximum likelihood estimates for θ ∈ Θ and θ ∈ ∂H, respectively. Let S n denote the signed root likelihood ratio statistics, i.e., ifθ ∈ H, S n = − √ Δ n ; otherwise, S n = √ Δ n . Let π n (H) denote the posterior probability of the half space given the data with sample size n. Dudley and Haughton (2002), we have (i) If H n is a sequence of the same half-space, indexed by the cumulative sample sizes, then as n → ∞, π n (H n )/Φ(S n ) → 1 almost surely, where Φ(·) is the cumulative distribution function of the standard normal random variable.

Theorem 1. Under the regularity conditions in
(ii) For cumulative sample sizes n 1 , . . . , n K , the joint statistics , which follows a multivariate normal distribution asymptotically.
The proof of the first part of the theorem can be found in Dudley and Haughton (2002), and the second part follows from the continuous mapping theorem and the argument on the joint canonical distribution of (S n1 , . . . , S n K ) in Jennison and Turnbull (1997), Scharfstein et al. (1997), and Jennison and Turnbull (2000), Chapter 11.2.
Based on the theoretical results, we propose a method to find the set of c k by connecting the Bayesian and the frequentist group sequential designs. Specifically, let {z k ; k = 1, . . . , K} denote a series of critical constants obtained from the frequentist group sequential method and, without loss of generality, we assume that all the z k 's are positive. We set c k to be equal to Φ(z k ), because the decision rules using the posterior probabilities of H 1 , . . . , H k are asymptotically equivalent to S 1 , . . . , S k being greater than z 1 , . . . , z k , respectively, which leads to the correct type I error rate spending of α 1 , . . . , α k based on the canonical distribution in the group sequential design.

Double-Arm Design Design Specification
In addition to the single-arm Bayesian sequential design, we also study the properties of a double-arm Bayesian sequential design that uses the posterior probability at the interim analyses. Consider a double-arm clinical trial with dichotomous outcomes, let p E denote the response rate of the experimental drug, and let p S denote that of the standard drug. Let K denote the total number of analyses and let m denote the sample size per arm in each group. If we consider the one-sided hypothesis test, we are interested in examining whether the experimental drug is superior to the standard of care, Under the Bayesian framework, we assume beta prior distributions for p E and p S , i.e., p E ∼ Beta(a E , b E ) and p S ∼ Beta(a S , b S ). At the kth interim analysis, the cumulative number of patients accrued in each arm is km. If the numbers of responses in the experimental and standard arms are y E and y S respectively, the binomial likelihood functions can be formulated as The posterior distributions of p E and p S are given by whose density functions are denoted by f (p E |y E ) and f (p S |y S ), respectively. Let c k be a prespecified cutoff probability boundary at stage k. Based on the posterior probability, we can construct a Bayesian sequential testing procedure, so that the experimental treatment is declared as promising if Otherwise, we fail to declare treatment efficacy.
To control the overall type I error rate, we may adopt the theoretical method that connects the Bayesian sequential design with the frequentist group sequential method by setting c k = Φ(z k ), where {z k ; k = 1, . . . , K} is a series of critical constants obtained from the frequentist group sequential method. Wason et al. (2015) proposed a Bayesian adaptive design for analyzing the relationships between biomarkers and the experimental treatment effects. Consider a biomarker trial where there are L biomarkers and a total of J experimental treatment arms and one control arm for the standard drug. When a patient is enrolled into the study, a test is conducted to obtain his/her biomarker profile. Let X i = (X i1 , . . . , X iL ) denote the biomarker profile of the ith patient, where X il = 1 if the expression of biomarker l is positive for the ith patient, and X il = 0 otherwise. Patients are equally randomized to all treatment arms, and we denote T i = (T i1 , . . . , T iJ ) as the treatment assignment vector, where T ij = 1 if the ith patient is allocated to the jth experimental treatment, and when all entries of T i are zero, the patient is assigned to the control arm. Let Y i denote the binary endpoint for the ith patient and let p i denote the probability of treatment success. The design utilizes a Bayesian logistic regression model to characterize the treatment effect of the drug, the biomarker and their interaction, which is represented as

Extension to Biomarker Design
where β 0 is the intercept, β j represents the effect of the jth experimental treatment, γ l represents the effect of the lth biomarker, and δ jl represents the effect of the treatment and biomarker interaction. Noninformative normal prior distributions are specified for all regression coefficients.
A total of J(L + 1) hypotheses are tested at the interim or the final analysis. In particular, the set of alternative hypotheses are {H |D k ), and if it is larger than c k , the superiority of the experimental treatment can be declared for the corresponding subgroup of patients. The theoretical method for controlling the overall type I error rate can be adopted for such a Bayesian sequential design, i.e., we may set c k = Φ(z k ).

Numerical Evaluation
We apply the theoretical posterior probability boundaries for controlling the type I error rate in a double-arm sequential biomarker design. We assume that there are J = 2 experimental treatments and L = 2 biomarkers, and the trial involves K = 4 analyses with sample size 480. Patients are equally randomized to the two treatment arms and the control arm. Wason et al. (2015) recommended controlling the familywise error rate (FWER) in a range of 0.4 to 0.5. As a total of 6 hypotheses are to be tested, we adopt Bonferroni's method for controlling the FWER at 0.48, i.e., we set the significance level for each hypothesis test to be 0.48/6 = 0.08.
To examine the effectiveness of controlling the type I error rate, we consider a null case where β j and δ jl are all zero for j = 1, 2 and l = 1, 2, and β 0 = γ 1 = γ 2 = 0.1. Based on 1000 trial replications, we compute the empirical type I error rates spent at the interim analyses using the theoretical posterior probability boundaries with the Pocock type and O'Brien-Fleming type α-spending functions, which are exhibited in Table 1. Due to symmetry, we only need to show results for the two alternative hypotheses H (1,1) 1 : β 1 + δ 11 > 0 and H (1,0) 1 : β 1 > 0. Because the endpoint is binary, slight deviation between the total empirical type I error rate and the target level 0.08 is observed. Nevertheless, the theoretical method controls the type I error rate spending in accordance with the specified α-spending function.

Design Specification
Consider a single-arm trial with a continuous endpoint from the normal distribution N (μ, σ 2 ). Let y i denote the observed outcome for the ith subject in the experimental arm and let n denote the number of observations. We assign a prior distribution N (μ 0 , σ 2 0 ) to the mean μ, and for simplicity we assume the variance σ 2 to be known. The likelihood can be expressed as n i=1 φ(y i ; μ, σ 2 ), where φ(·; μ, σ 2 ) denotes the normal density function with mean μ and variance σ 2 . Based on the conjugacy of a normal prior distribution under a normal likelihood, the posterior distribution of μ follows N (μ * , σ 2 * ), where We formulate the null and alternative hypotheses as where δ is the minimum value of μ that warrants further investigation. We reject the null hypothesis if Pr(μ > δ|D) > c, D = {y 1 , . . . , y n }, which is equivalent to (μ * − δ)/σ * > Φ −1 (c), and it can be further expressed as is a nondecreasing function of c. To control the overall type I error rate for a series of sequential tests, we can equate Q(c; μ 0 , σ 0 , σ) to the corresponding critical constant in the group sequential design.
Under the group sequential methodology, we reject the null at an interim analysis if √ n(ȳ − δ)/σ > z, orȳ > zσ/ √ n + δ, where z is a known critical constant from the interim boundaries in the group sequential test with the same specification of the overall type I error rate, power and α-spending function. The value of c can be solved by setting In the case with a two-arm trial, we are interested in comparing the means of the endpoints between the experimental and control arms, denoted by μ E and μ S respectively. Under a normal likelihood function with normal prior distributions on the means, the posterior distributions of μ E and μ S are both normal. The posterior distribution of μ E − μ S is also normal and, as a result, the decision boundary c can be derived along similar lines.

Commensurate Prior
One of the advantages of Bayesian trial designs is the ability to incorporate useful historical information in the prior distribution, which, if adopted correctly, leads to higher power and saving in sample size. Hobbs et al. (2011) proposed several classes of commensurate prior distributions for normal endpoints. Commensurate priors can adaptively adjust the amount of information borrowing from the historical data according to the degree of commensurability between the data in the historical trials and the current one.
We consider a class of commensurate prior distributions proposed by Hobbs et al. (2011) called the location commensurate prior. Let μ S and μ H be the mean parameter for the current and historical data respectively, and let D H denote the historical data. The location commensurate prior is a hierarchical construct where we first specify a prior distribution p(τ ) for the commensurability parameter τ > 0, which serves as the primary mechanism for adjusting the influence of prior information relative to its commensurability with the data in the current trial. Conditional on the commensurability parameter τ , we center the prior of μ S at the historical mean μ H , i.e., a normal distribution with mean μ H and precision τ (i.e., variance 1/τ ), and multiply it with the historical likelihood, which results in a prior of the form, As τ → 0, p(μ S |D H , μ H , τ) approaches p 0 (μ S ), such that the historical data are completely ignored due to noncommensurability; and as τ → ∞, p(μ S |D H , μ H , τ) approaches L(μ S |D H )p 0 (μ S ), leading to full exchangeability between the historical and current data, and thus the current and historical data are equally weighed and can be simply merged.
Assume that the historical data follow a normal distribution, N (μ H , σ 2 H ), with sample size n H , and the current data in the standard arm follow N (μ S , σ 2 S ). Letȳ H denote the historical sample mean, and letσ 2 H denote the maximum likelihood estimator of σ 2 H . We specify p(τ ) to be a Gamma(ντ , ν) distribution with meanτ and varianceτ /ν, p(μ S |μ H , τ) a normal distribution with mean μ H and precision τ , and p 0 (μ S ) ∝ 1. The location commensurate prior for μ S under such hierarchical models can be derived as We apply the theoretical approach to determining posterior probability boundaries in a Bayesian double-arm sequential trial that utilizes the commensurate prior in the standard arm. For the experimental arm, we assume a vague prior N (μ 0 , σ 0 ) for μ E . At the kth interim analysis, if the posterior probability Pr(μ E > μ S |D, D H ) > c k , where D denotes the data in the current trial, we terminate the trial and declare treatment superiority; otherwise, we continue to recruit the next group of patients.

Numerical Evaluation
We conduct two simulation studies to study the trial performance with informative priors: one for a single-arm design and the other for a double-arm design. In the singlearm study, we compute the empirical type I error rate based on normal endpoints with mean μ and variance 1. We are interested in testing whether μ is greater than δ = 0. The sample size is 200 and a total of K = 4 analyses are considered. Our desired type I error rate is α = 0.1. We set the prior distribution of μ to be N (0, 100). We simulate one million trials by generating random samples from the standard normal distribution N (0, 1) and the proportion of times the null hypothesis is rejected is defined as the empirical type I error rate. Figure 2 shows the target α-spending functions versus the empirical ones for the Pocock type and O'Brien-Fleming type boundaries, respectively. Clearly, the proposed method maintains the type I error rate under the nominal level and the empirical pattern of the type I error rate spent at each stage is close to that of the target α-spending function.
For the double-arm trial, we apply the theoretical method for calculating the posterior probability boundaries by setting c k = Φ(z k ) where z k 's are the critical constants from the frequentist group sequential designs. The commensurate prior is adopted for the standard arm to facilitate information borrowing from the historical data with sample size n H = 200. As the variances in the current and historical trials are not of direct interest, for simplicity we assume that variances are known to be 1 in both trials. Four interim analyses are involved and the sample size in each arm is 200. As we are interested in the influence of the commensurability parameter τ and the historical mean parameter μ H on the current trial's operating characteristics, we consider various values of μ H and commensurate prior distributions for τ . In particular, we consider cases where μ H = μ S and μ H = μ S ± 0.05; and we specify p(τ ) to be a Gamma(ντ , ν) distribution and consider the cases withτ = ν = 1 andτ = ν = 1000, i.e., respective prior means of 1 and 1000 and prior variances of 1, which correspond to weak and strong degrees of information borrowing. Based on 1000 trial replications, we compute the empirical type I error rates spent under the null where μ S = μ E = 0.5, and the power under the alternative where μ S = 0.5 and μ E = 0.8.  Table 2 shows the type I error rate spending and power under various values of historical means, commensurate priors and α-spending functions. Compared with those with a weak degree of information borrowing (τ = ν = 1), the cases with highly infor-  Table 2: Type I error rate spending and power in Bayesian sequential designs with normal endpoints using the commensurate priors and the theoretical posterior probability boundaries with the Pocock type and O'Brien-Fleming type α-spending functions.
mative priors (τ = ν = 1000) have higher power values, but may suffer from inflation or over-stringentness in the type I error rate when μ H deviates from μ S . When the historical mean is over/under-estimated, the design would have higher/lower values of power. Clearly, under the Pocock type and O'Brien-Fleming type α-spending functions, the pattern of type I error rate spending matches the desired target. It is worth emphasizing that under the theoretical posterior probability boundaries, the overall type I error rate might not be controlled exactly at the target level, particularly when a complex and informative prior distribution is adopted. To achieve an exact control of the type I error rate, we may compute the empirical type I error rate by simulating a large number of trials, and adjust the posterior probability boundaries to be c k = Φ(z k ) + ζ k , where the value of ζ k can be easily calibrated via grid search or bisectional search, such that the empirical type I error rate can be maintained at the desired level. For example, we may set ζ k = {1 − Φ(z k )} × u, where 0 ≤ u ≤ 1, and perform numerical calibration on the value of u. The middle part of Table 2 shows the spendings of the type I error rate where u is calibrated to be 0.141 and 0.237 under the null case (μ H = 0.5) with the Pocock type and the O'Brien-Fleming type boundaries, respectively. Wathen and Thall (2008) proposed a Bayesian doubly optimal group sequential design, abbreviated as "BDOGS", which optimizes the expected utility function under the frequentist constraints. As a comparison, we consider the BDOGS design for the binary endpoint, which takes an interim look once every time a new outcome is observed. At each interim analysis, the posterior probability is updated and compared with a boundary function P U (n) = a U + b U (n/N max ) c U , where n is the cumulative sample size and N max is the maximum sample size, to decide whether the trial should be stopped for efficacy. The design calibration involves finding the optimal values for the parameters (a U , b U , c U ) such that the expected utility is optimized under the constraints on the type I error rate and power. The expected utility under the BDOGS design is specified to be the average of the expected sample sizes under the null and the alternative hypotheses.

Design Comparison
Calculations of the expected utility, the type I error rate and power are conducted via a forward simulation approach (Carlin et al., 1998), and a simple grid search method is used for finding the optimal parameters (a U , b U , c U ).
The proposed Bayesian group sequential designs incorporating the Pocock type and O'Brien-Fleming type α-spending functions are compared with the BDOGS design, under the constraints of the type I error rate being at most 0.1 and power at least 0.8. Considering a binary endpoint, the single-arm design aims to test the hypotheses H 0 : p E ≤ 0.2 versus H 1 : p E > 0.4. For the Bayesian group sequential designs with K = 4 analyses, the minimally required group sizes are 11 for the Pocock type design and 9 for the O'Brien-Fleming type design. For the BDOGS design, we specify N max = 45 and the optimal parameters are found to be (a U , b U , c U ) = (0.985, −0.015, 0.600). (Shi and Yin, 2018) shows the stopping boundaries for the three designs under comparison, and Figure 2 in the Supplementary Material exhibits the distributions of the spendings of the type I error rates. Except for the first interim analysis under the O'Brien-Fleming type design, the boundary values of the Bayesian group sequential designs are smaller than those of the BDOGS design, as the latter requires much more interim looks. In terms of the type I error rate spendings, the Bayesian group sequential design allows specifying the pattern of the distribution of the spendings, whereas the BDOGS design is less flexible and the majority of the spendings are distributed at the first few analyses. The expected sample sizes are similar across the three designs, which are 30.7, 32.7 and 30.1 for the BDOGS, Pocock and O'Brien-Fleming types of designs, respectively. Thall and Simon (1994) described a single-arm clinical trial using fludarabine + ara-C + granulocyte colony stimulating factor (G-CSF) in the treatment of acute myeloid leukemia. The study aimed at assessing whether the addition of G-CSF to the standard therapy (fludarabine + ara-C) can improve the clinical outcomes of the patients. The complete remission of the disease is defined as the binary endpoint of the study. We

Bayesian Posterior Probability
Frequentist  provide an illustrative trial example where the proposed design is adopted. The trial has a maximum sample size of N max = 160, and there are a total of K = 4 analyses to be carried out in the trial. We take the type I error rate α = 0.1, p null = 0.2 and noninformative prior p E ∼ Beta(0.2, 0.8). For the Pocock design and the O'Brien-Fleming design, the α-spending functions are respectively given by where t ∈ [0, 1] denotes the information fraction, taking values of 0.25, 0.5, 0.75, and 1 in our case, and z α/2 denotes the 100(1 − α/2)th percentile of the standard normal distribution. The target type I error rate to be spent at the kth analysis is thus α(k/4)− α{(k − 1)/4}. We provide Bayesian sequential designs whose empirical type I error spending functions are respectively calibrated towards those of the two classical group sequential designs. Table 3 shows the values of the posterior probability cutoff and the α-spending function at each interim analysis for the Bayesian Pocock type and O'Brien-Fleming type sequential designs, respectively. It is worth emphasizing that because P (H 1 |D k ) is discrete and takes a finite number of values, the upper cutoff c k can take any value within a certain interval to satisfy the type I error constraint. For example, in the Pocock type sequential design, the first cutoff c 1 can be any value within the interval (0.923,0.963). Figures 3 and 4 show the posterior probability cutoff intervals and the empirical type I error spending functions versus the target for the Pocock and O'Brien-Fleming designs, respectively. Because the endpoint is binary, exact calibration to the target function is not possible. Therefore, the empirical spending function under the proposed methods would slightly deviate from the target one. For the Pocock type design, the numerical method (dashed) and the theoretical method (dot-dashed) yield similar solutions, although the former produces a closer fit to the target α-spending function. Similar to the constant critical values in the frequentist Pocock design, the posterior probability  cutoffs are also constant with a value of 0.958 under the theoretical calibration, while the critical intervals under the numerical approach are also close to each other with substantial overlappings. For the O'Brien-Fleming type design, the theoretical method produces the posterior probability cutoffs of 0.998, 0977, 0.948, and 0.920, while the numerical method leads to cutoff intervals that tend to decline throughout the trial. The O'Brien-Fleming sequential design imposes more stringent posterior probability cutoffs at the early stages of the trial, and then gradually relaxes the cutoffs as the trial progresses. Jennison and Turnbull (1989) provided a formulation of the repeated confidence intervals across interim analyses under a group sequential design. As a counterpart in the Bayesian paradigm, a similar notion of the repeated credible interval can be naturally developed. In particular, we obtain the posterior distribution of the parameter of interest, and adopt the highest posterior density interval repeatedly based on the type I error rate spent for each interim analysis. Figure 3 in the Supplementary Material shows the repeated credible intervals when the number of responses attains the efficacy boundary, i.e., the minimum value of y k such that P (H 1 |D k ) > c k is satisfied, which can be back-solved based on the monotonic relationship between y k and the posterior probability P (H 1 |D k ). As more data are accumulated in each interim analysis, the width of the repeated credible interval decreases. Maki et al. (2007) described a phase II randomized study comparing the efficacy of gemcitabine alone and the gemcitabine-docetaxel combination in the treatment of metastatic soft tissue sarcoma. Based on the binary outcomes of tumor response, the study aimed at determining whether the addition of docetaxel could improve the efficacy of gemcitabine. For illustrative purpose, we applied the proposed design to the trial and examined the empirical type I error rates under various types of sequential boundaries. We experimented the total sample size of 50 and 500 respectively, and for both cases K = 5 interim analyses were considered. We took non-informative prior distributions, p S ∼ Beta(0.2, 0.8) and p E ∼ Beta(0.2, 0.8), and the type I error rate was controlled at α = 0.1. The type I error rates were computed as the probabilities of trial success with p E = p S at different values of p E (or p S ). Figure 5 shows the type I error rates under different types of boundaries. Due to the finite sample size, the joint distribution of the test statistics at the interim analyses may deviate from the multivariate normal canonical distribution under the frequentist group sequential framework. As a result, the type I error rates can be different from the nominal level. As expected, when no adjustment is made to account for the multiplicity, i.e., the posterior probability boundaries are all set equal to 1 − α = 0.9 throughout the trial, the type I error rate is inflated up to the level of 25%. Both the Pocock type and O'Brien-Fleming type boundaries work well for the large-sample cases, but suffer from slight inflation of the type I error rate when the response rate is very low for the small-sample cases. Figure 5: Type I error rates under different types of boundaries in Bayesian double-arm sequential designs with binary endpoints and K = 5, (a) sample size of 50 subjects per arm and (b) sample size of 500 subjects per arm. The solid line represents the type I error rate with a fixed posterior probability cutoff of 0.9 throughout the trial, and the dashed and dotted lines correspond to those with the posterior probability cutoffs calibrated using the Pocock type and O'Brien-Fleming type α-spending functions.

Discussion
Controlling the type I error rate for clinical trial designs that involve multiple interim assessments on the posterior probabilities is often a neglected aspect in Bayesian sequential designs. Although Bayesian methods can serve as a useful and flexible alternative to conventional frequentist designs, it is crucial to understand its frequentist properties. As shown in the motivating example, a failure to account for the multiplicity in a Bayesian trial may lead to a severe inflation of the type I error rate. The proposed method connects the aspect of multiple testing in Bayesian designs with that of the frequentist group sequential method. Although the theoretical method is primarily applied to the binary case with an assumed beta prior distribution on the response rate, it can be used under a more general family of prior and posterior models, as long as the regularity conditions for the asymptotic properties of the posterior probability are satisfied.
We consider both the binary and normal endpoints and single-and double-arm trials. We develop a numerical approach as well as establishing a connection based on the asymptotic properties of the posterior probability between the Bayesian sequential design and the frequentist counterpart. The numerical method involves the calculation of the exact type I error rate. When the number of analyses and the group size are large, the summation of the product of binomial probabilities could be computationally intensive. To overcome this issue, simulation-based computation or normal approximation to the binomial distribution might be preferred.
For the numerical approach, the sandwich-type algorithm can be generalized to more complex model settings. The error rate formulation would be similar to that in (2) except for the binomial distribution, which can be approximated using a simulationbased approach. A more general formulation of error rates can be implemented by first setting up a null parametric model representing H 0 , and then simulating a large number of trials, and the proportion of trials that reach the decision boundary at each interim step can be used as an approximation to the type I error rate. Based on the prespecified error rates, suitable design parameters can be calibrated either by a grid-based search or a bisection approach in order to yield the desirable pattern of the type I error rate spendings. Examples of such calibration methods can be found in Murray et al. (2016) and Murray et al. (2017).
For the theoretical approach, we assume a non-informative prior distribution when assessing the design's operating characteristics, and show that the type I error rates can be well maintained under such a prior assumption. It should be emphasized that the theorem in Dudley and Haughton (2002) only holds asymptotically, and simulation studies on finite-sample performance of the design might be necessary for assessing the adequacy of the theoretical boundaries. In the settings where the prior distribution is highly informative or the sample size is relatively small, the empirical performance of the decision boundaries under the theoretical approach might not be satisfactory. It is then advised to adopt the numerical approach, either by explicitly formulating the error rates as discussed in this paper, or by averaging the number of error cases with computer simulation.