Vaccine clinical trials with dynamic borrowing of historical controls: Two retrospective studies

Traditional vaccine efficacy trials usually use fixed designs with fairly large sample sizes. Recruiting a large number of subjects requires longer time and higher costs. Furthermore, vaccine developers are more than ever facing the need to accelerate vaccine development to fulfill the public's medical needs. A possible approach to accelerate development is to use the method of dynamic borrowing of historical controls in clinical trials. In this paper, we evaluate the feasibility and the performance of this approach in vaccine development by retrospectively analyzing two real vaccine studies: a relatively small immunological trial (typical early phase study) and a large vaccine efficacy trial (typical Phase 3 study) assessing prophylactic human papillomavirus vaccine. Results are promising, particularly for early development immunological studies, where the adaptive design is feasible, and control of type I error is less relevant.


| INTRODUCTION
The use of methods for the dynamic borrowing of historical information in clinical trials is increasingly accepted by the authorities. These methods allow historical information to be used in a degree commensurate to their similarity to the current information. Various approaches for dynamic lending have been proposed in the literature. Pocock 1 proposed the first Bayesian method for combining Historical Control (HC) data with control data in a new randomized clinical trial. Many methods have been proposed in the last 20 years, such as the power prior, 2 the modified power prior, 3,4 the meta-analytic predictive (MAP) 5 method, and the commensurate prior. 6 More recently, Schmidli et al. 7 proposed an extension of the MAP approach, called robust MAP prior, in which the MAP prior is mixed with a vague component, to account for the possibility of drift between historical and current data (where the drift is the distance between the current and the historical parameter). The performance of some of these methods has been compared by simulations (see for example references [8][9][10]. For vaccines efficacy (VE) trials, Jin et al. 11 considered borrowing information for treatment effects and propose a Bayesian framework under the exact conditional binomial test 12 for vaccine trial sample size determination.
In this paper, we consider dynamic borrowing of HC for vaccine trials, where HC data is borrowed using the robust mixture prior (RMP) method which corresponds to the robust MAP prior 7 in case of only one historical study, where the heterogeneity among historical controls from different studies is ignored. We also propose and consider a new version of RMP with significance level adjusted for the observed drift (drift-adjusted alpha RMP). We apply this approach retrospectively to two real Human Papillomavirus (HPV) vaccine trials: a relatively small immunological trial, for which we use an adaptive HC design, and a large VE trial, for which we use a fixed design.
The HPV is the most common viral infection of the reproductive tract and is the cause of a range of conditions in both men and women, including precancerous lesions that may progress to cancer. In women, persistent infection with specific HPV types (most frequently HPV-16 and HPV-18) may lead to precancerous lesions, which, if untreated, may progress to cervical cancer. 13 Cervical cancer is the fourth most common cancer among women worldwide, with around 569,000 new cases and 311,000 deaths reported in 2018. 14 Approximately 70% of cervical cancer cases are attributable to high-risk HPV-16 and -18, with HPV-31, -33, -35, -45, -51, -52, and -58 contributing to an additional 20% of cases. 15 Three prophylactic HPV vaccines are currently available and marketed in many countries worldwide for the prevention of HPV-related disease: the HPV-6/11/16/18 vaccine (quadrivant, Gardasil, Merck) was first licensed in 2006, the AS04-adjuvanted HPV-16/18 vaccine (AS04-HPV-16/18, Cervarix, GSK) in 2007 and the HPV-6/11/16/18/31/33/45/52/58 vaccine (nonovalent, Gardasil, Merck) in 2014. For women aged 15 years and older, three doses are recommended, according to a 0-, 1-and 6-month schedule for bivalent 16 or a 0-, 2-and 6-month schedule for the quadrivalent and nonovalent vaccines. 17 Several clinical trials have been conducted to assess the safety, efficacy and immunogenicity of these vaccines. Generally, persistent infection and/or high-grade cervical intraepithelial neoplasia (CIN) caused by HPV types are used as disease endpoint in the VE trials and antibody concentrations against the HPV types are used as immunological endpoint. 18 To evaluate the methodology in an immunogenicity trial setting, we considered the data from two randomized, controlled, observer-blind trials (NCT01031069 19 and NCT00423046 20 ) comparing the immunogenicity of bivalent vaccine versus quadrivalent vaccine in young women. To evaluate the methodology in a VE trial setting, we considered the data from two large randomized, controlled, double-blind phase 3 trials (NCT00122681 21 and NCT00128661 22 ) evaluating the efficacy of bivalent vaccine in young women aged 15-26 years. The two examples represent two different settings in vaccine clinical development: one is a typical early phase study and the other is a confirmatory Phase 3 study. These two settings have different goals and different requirements from the authorities. Furthermore, the two examples correspond to cases of continuous and binary endpoints.

| Notations and definitions
In this section, we introduce the basic notations and definitions that are used throughout the paper. For ease of notation, we are not using the hat operator to denote the estimated values.

| Immunogenicity trials
Endpoints based on immune response, such as antibody levels, are often used to make decisions in early vaccine development (e.g., determination of the vaccine dose, schedule). Furthermore, immunogenicity outcomes play a critical role throughout all the phases of vaccine development (e.g., bridging trials, and lot-to-lot consistency trials).
Let us denote by y g, i the log 10 -transformed (humoral or cellular) immune readout value observed for the ith subject in treatment group g with g = v for the vaccine group or c for the control group. Let us assume that Y g,i $ N μ g , σ 2 g ; N v and N c represent the sample sizes of the v and c group, respectively. A classical measure of immunological vaccine effect is the geometric mean ratio (GMR). The log of the GMR corresponds to a mean difference on the log scale: which can be estimated using linear regressions, potentially adjusting for covariates. The one-sided hypothesis (superiority test) for assessing immunogenicity is represented as follows: where GMR 0 represents the superiority margin.

| Vaccine efficacy trials
In a classical phase 3 vaccine efficacy (VE) trial with allocation ratio k: 1, the clinical endpoint of the study is compared between the vaccine group (v) and the control group (c). Let us assume that the clinical outcome is binary (e.g., y g, i = 1 if the ith subject in group g = v, c is a case; y g, i = 0 otherwise). N v and N c represent the sample sizes of the v and c groups of the phase 3 study, respectively. Let us denote by Y g, i $ Bin(p g ) and X g $ Bin(N g , p g ) the number of cases in group g = v, c. A classical measure of VE is one minus the relative risk: A classical null hypothesis of a phase 3 VE trial is rejected when the lower limit of the (1 À α)% confidence interval (CI) of the VE is above the superiority margin V E 0 . The one-sided hypothesis for assessing VE is represented as follows: Different Bayesian VE models have been proposed in the literature 11,23,24 (see Appendix A for more details). We performed a small simulation study to compare different models in the setting of our case study (see Appendix B). Based on our simulation results, the approach with better frequentist results is the model with two independent Jeffreys beta priors (p c $ Beta[0.5, 0.5]; p v $ Beta[0.5, 0.5]). The good frequentist performances of this model were also highlighted by Agresti and Min. 25 We considered this model for the analysis of the case study.

| Robust mixture prior model
Based on our case study, we suppose that only one historical controlled clinical trial is available. The outcome of the historical control group (h) with N h subjects is distributed as Y h $ N μ h , σ 2 h À Á for normal outcomes (immunogenicity trials) and X h $ Bin(N h , p h ) for binary outcomes (vaccine efficacy trials).
The RMP for a parameter of interest θ c is where p H (θ c ) is the component synthesizing the information from the HC arm, p vague (p c ) is a vague distribution and w, 0 ≤ w ≤ 1, is the (pre-specified) prior probability that the new trial does not differs systematically from the historical trial. When w = 1, then the RMP prior is just the historical component, therefore leading to full borrowing. On the other extreme, when w = 0, the historical data are discarded. Values 0 < w < 1 lead to partial borrowing of the historical information. Pre-specified weight w and vague prior parameters may have a strong impact on the results. The choice of w would depend on clinical/scientific judgment of the similarity of the historical and current controls. In this paper we use the weight value of w = 60% and the "unit information priors". 7 In case of normally distributed data (immunological studies; θ c = μ c ), we use the following RMP: where y h ¼ P N h i¼1 y h,i =N h is the average of the log 10 -transformed immunological historical control data and σ 2 h N h is its estimated variance. It should be noted that Equation (3) is a simplification, as the HC prior is approximated by a normal distribution.
For binary endpoints (VE studies; θ c = p c ), we considered the Jeffreys beta priors RMP: where a = b = a 0 = b 0 = 0.5 and x h is the observed number of HC cases. RMP is closely related to the robust MAP prior approach. 7 Robust MAP prior is a mixture of the MAP prior coming from multiple historical trials and a vague prior. In our setting, since only one historical trial is available, we replaced the MAP prior by the posterior of the historical control group. Callegaro et al. 26 showed the relationship between RMP and Pocock prior. 1

| Posterior distribution and updated weights
Since the robust prior p RMP (θ c ) is a mixture of conjugate priors, the posterior is also a mixture of conjugate posteriors, with updated mixture weights (e w).
For normally distributed data (assuming σ c = σ h ) the posterior is given by For binary data the posterior is given by The updated weight (e w) is a function of the weight w and the marginal likelihoods of the new data under the historical (f h ) and vague models (f vague ), e w ¼ wf h = wf h þ 1 À w ð Þf vague n o 7 and may give some indication of the relationship of historical and current data.

| Drift-adjusted alpha RMP
The RMP is flexible and a well-established approach for dynamic borrowing of HC. One consequence of the Bayesian reasoning framework of this approach (and other methods similar in spirit to RMP) is that type I error is not controlled for moderate values of the drift. One possibility to control the type I error of RMP is by using a significance level depending on the drift (driftadjusted alpha). We will explore this approach on the VE case study. We derive the alpha adjusted (α δ ) by simulations. More specifically, for a certain value of the drift δ we (1) simulated clinical trials under the null hypothesis (V E = V E 0 ) with p c = p h + δ, and (2) derived α δ in such a way that the type-I-error (measured as the proportion of simulated trials with the lower limit of the 1-2α δ % credible interval above V E 0 ) was equal to α. A similar in spirit approach was considered by Cuffe 27 for the Pocock model. 1 When the current data is observed, the adjustment is done based on the observed drift ( b δ ¼ b p c À b p h ) to controls the type I error rate conditional on the estimated value of δ. The rationale for this approach is that b δ is an unbiased estimator of δ. A more conservative procedure may be considered by choosing the smallest α δ observed in the plausible range of δ values. The advantages of this second approach are the following: (i) it guarantees to protect the Type I error irrespective of the (unknown) value of δ; (ii) a single critical value is chosen in advance of the trial. Unfortunately, this approach is too conservative in our context, undermining the benefit of the historical data. For this reason, we consider instead the other approach (conditioned on the observed drift). Even if it is based on strong assumptions, it tackles the problem of controlling the type-I-error while keeping historical controls still potentially useful.

| Historical controls designs
In this paper, we will illustrate application of different historical control designs (fixed and adaptive) by retrospectively re-analyzing two existing studies as if they had been designed to borrow external controls. In the next sections, we introduce the two kind of HC designs and the two existing studies.

| Fixed design
The HC information is used as a prior to reduce the sample size of the current control group. The reduced sample size is given by κN c where N c is the planned sample size of the study without historical controls and κ is the proportion of the planned controls recruited. To achieve the same power of the classical 1:1 trial, κ should be chosen in such a way that the effective sample size (ESS) 28 of the RMP posterior (which is unknown) is equal to N c . Therefore, assumptions about the drift δ should be done to determine κ. Safety requirements could be considered as well to determine κ. The RMP model is used to estimate the effect of interest.

| Adaptive design
Let N c and N v be the desired ESS at the end of the trial for control and vaccine, respectively. Schmidli et al. 7 proposed a two stage design where at the end of stage I (when N c, I and N v, I subjects are recruited) the Bayesian model is used to estimate the posterior ESS based on the first stage control data (ESSI). The number of subjects recruited after interim (stage II) in the vaccine group is N v, II = N v À N v, I , while the number of subjects recruited in the control group is max (N c À ESSI, N c, min ), where N c, min is a pre-specified minimum sample size of stage II. The ESSI determines how many controls will be recruited in the second stage (at least N c, min ).
The adaptive design described above can be implemented in immunological studies, while it is often not feasible in VE trials because the clinical outcome is often measured with long follow-up (3-4 years) while the recruitment is usually faster. It follows that stopping the recruitment to observe a long follow-up at the end of stage I is not feasible/ convenient from an operational point of view.

| DATA DESCRIPTION
3.1 | Immunological case study NCT01462357 (referred to as immunogenicity HC study) is a phase 3b observer-blind, randomized, multicenter primary immunization study to evaluate the immunogenicity and safety of Cervarix (GSK) vaccine and Gardasil (Merck). The primary objective of the study was to compare the immune responses to HPV-16 and -18 induced by Cervarix and Gardasil, in terms of HPV-16 and -18 geometric mean concentrations (GMCs) measured by ELISA, at Month 7 and the secondary objective of the study was to assess the immune responses by pseudovirion-based neutralization assay (PBNA), at Month 7. 29 At month 7, according-to-protocol (ATP) cohort for immunogenicity included 92 subjects from the Cervarix group and 93 from the Gardasil group with antibody titres results available for HPV-16. The geometric mean titres (GMTs) for HPV-16 were 51043.8 (95%CI: 42567.9; 61078.3) in the Cervarix group and 21377.9 (95%CI: 16900.1; 27042.2) in the Gardasil group.
NCT01031069 (referred to as immunogenicity current study in this paper) is a phase IV, observer-blind, randomized, controlled, multicenter study to assess the safety and immunogenicity of Cervarix in human immunodeficiency virus-infected (HIV+) female subjects aged 15-25 years, as compared to Gardasil (control group).
The primary objective of the study was to demonstrate non-inferiority of Cervarix in terms of GMTs against HPV-16 and HPV-18 measured by pseudovirion-based neutralization assay (PBNA), at Month 7.
Although the primary objective of the study was to compare the immune response in the HIV+ female subjects, HIV-negative (HIV-) female subjects were also enrolled in both treatment groups as safety controls, and immunogenicity in HIV-females was evaluated as a secondary objective in the study. For this work, we used the data from HIVfemale subjects.
At month 7, the According To Protocol (ATP) cohort for immunogenicity included 77 HIV-subjects in the Cervarix group and 80 in the Gardasil group. The GMTs for HPV-16 were 74825.7 (95%CI: 58340.8; 95968.6) in the vaccine group and 27415.7 (95%CI: 21695.2; 34652.9) in the control group.
3.2 | Efficacy case study NCT00122681 (referred to as efficacy HC study in the paper) is a phase 3, double-blind, randomized, controlled study with two parallel groups. The study was conducted in Asia, Europe, Latin America and North America. Women aged 15 to 25 years (N ≈ 18,000) were enrolled in the study to receive three doses of vaccine (Cervarix) or control (hepatitis A virus [HAV] vaccine) at 0, 1, 6 months according to their random assignment (1:1 randomization).
The primary objective of the study was to assess VE in the prevention of grade 2 or 3 cervical intraepithelial neoplasia, adenocarcinoma in situ of the cervix, or invasive cervical cancer (CIN2+) associated with HPV-16 or HPV-18 cervical infection. 13 The ATP cohort for efficacy included 7338 subjects in the vaccine group and 7305 subjects in the control group. At the end of the study, 5 cases of CIN2+ were observed in the vaccine group and 97 cases were observed in the control group in the ATP cohort for efficacy, leading to a VE of 94.9% (95% CI: 87.7%; 98.4%). 33 NCT00128661 (referred to as efficacy current study) is a phase 3, double-blind, randomized, controlled study with two parallel groups. The study was conducted in a single center in Costa Rica with seven satellite sites. Women aged 18 to 25 years (N ≈ 7, 000) were enrolled in the study to receive three doses of Cervarix or control (HAV vaccine) at 0, 1, and 6 months according to their random assignment (1:1 randomization).
The primary objective of the study was to evaluate efficacy of the vaccine in preventing CIN2+ associated with incident HPV-16/18 cervical infections. ATP cohort for efficacy included 2635 subjects in the vaccine group and 2677 subjects in the control group. At the end of the study, 1 case of CIN2+ was observed in the vaccine group and 10 cases were observed in the control group in the ATP cohort for efficacy, leading to a VE of 89.8% (95%CI: 39.5%; 99.5%). 14 In both studies, the VE is defined as V E = 1 À RR, where RR = incidence rate in vaccine group/incidence rate in control group.
A high-level summary of the case studies is shown in Table 1.

| Evaluation of comparability criteria
For the selected vaccine trials, an evaluation of Pocock's criteria 1 for the comparability of historical and current controls is as follows: P1: The HC must have received a precisely defined standard treatment that must be the same as the treatment for the randomized controls. For both studies, both arms have received the same treatment. P2: The HC must have been part of a recent clinical study that contained the same requirements for patient eligibility. The immunological historical control study enrolled girls aged 9 through 14 years at the time of first vaccination and immunogenicity current study enrolled women aged 15 through 25 years. For this analysis, it was assumed that the GMTs were similar in these age groups. Quality-control results showed that the assay used in the two immunological trials did not change over time (data not shown). The historical vaccine trial was a recent clinical study with similar patient eligibility. P3: The methods of treatment evaluation must be the same. We used the same immunological and the same efficacy endpoints, and these outcomes were measured in the same way in all trials. P4: The distributions of important patient characteristics in the HC should be comparable with those in the new trial. Comparability was confirmed by a demographic characteristics analysis. We retrospectively applied the adaptive design described above. 7 The sample size of the current study was N v = N c = 140 subjects in each treatment group (H 1 : GMR >1), non-evaluable rate of 20% and σ = 0.6. We denote by π ¼ N c,I þN c, min N c the minimum proportion of planned controls. Figure 1 shows the operating characteristics of the adaptive design (Adaptive RMP) with N c, min = 10 as a function of the true value of μ c . To show the impact of the weight on the results, we considered two versions of the adaptive design: one with weight equal to 60% (Adaptive RMP [π = 50%, w = 60%]black dashed line) and one with weight equal to 90% (Adaptive RMP (π = 50%, w = 90%)black dotted line).
On  Figure 1A shows the type I error as a function of the true parameter μ c . We can see that the Adaptive design with w = 60% and the fixed design using RMP (140:70) has similar type I Error, which is not well controlled for moderate values of the drift. The control of the type I Error is more problematic for the adaptive design with w = 90%. As expected, the type I error is well controlled by the designs without historical controls (No HCgray lines). Figure 1B,C show the power as a function of the true parameter μ c in case of log 10 (GMR) = 0.2 and log 10 (GMR) = 0.3, respectively. When the drift is large, the adaptive design is more powerful than the fixed RMP and the power is similar to the power of the fixed design with 1:1 randomization (No HC [140:140]) because all additional controls are recruited after the interim. This is expected as this corresponds to the situation that the historical controls are not very useful and we fall back to the situation were the maximum number of controls still have to be enrolled in the study. Figure 1D shows the number of recruited controls as a function of the true parameter μ c . When the historical and current controls are similar, we can see that the adaptive designs results in considerable savings in sample size compared to the classical fixed design with 1:1 randomization (No HC [140:140]). Figure 1D,E show the bias and the root mean-squared error (rMSE) of the posterior mean of the control group as a function of the true parameter μ c . We can see that the bias and rMSE of the adaptive design with smaller weight

P3
Same method of treatment evaluation. Same method of treatment evaluation.

P4
Comparable subjects characteristics (slightly different age groups).
Comparable subjects characteristics (but multiregional historical study; Costa Rica current study).

P5
Studies performed in the same organization. Studies not performed by the same organization.

P6
No other indications leading one to expect differing results.
No other indications leading one to expect differing results.
(w = 60%) is quite similar to the fixed RMP design, while the adaptive design with larger weight (w = 60%) is more biased and has larger rMSE for moderate drifts.
In summary, results in Figure 1 suggests that the adaptive with w = 60% has good robustness properties. Based on these results and other considerations (clinical/scientific judgment of the similarity of the historical and current controls), we pre-specified the Adaptive design with w = 60% for the analysis of the real data.
For the retrospective analysis of the real data, subjects at interim (stage I) were selected based on the date of vaccination. The interim (stage I) analysis was performed with N cI = 70 À N c, min = 60 recruited subjects, corresponding to N cI * = 32 subjects in the ATP cohort. Table 3 shows summary data (N, N*, log 10 [GMT], SE[log 10 (GMT)]). N* represents the sample size of the ATP cohort (number of observed immunological values) which is much smaller than the number of recruited subjects. We can see that historical and current control GMTs at interim are very similar.   At interim, the updated weight was e w ¼ 88% and ESSI = 178, which is larger than the planned sample size. Since max(N c À ESSI, N c, min ) = N c, min , with N c, min = 10, we recruited 10 additional controls at stage II and the remaining subjects in the vaccine group. At the end of the adaptive trial we estimated the following effect: GMR = 3.24 (95%CI 2.24; 4.57). The lower limit was larger than 1, so the adaptive design was successful. Table 4 shows that similar in spirit results are obtained using the other considered designs. In this case, the adaptive design is equivalent to the fixed RMP (140:70) design because they recruited the same controls.
A key aspect of the adaptive design is the time of the interim analysis (expressed as π). In our example, we performed the interim analysis when about half of the samples were available (π = 0.5). For illustration, Figure 2 shows the results of the adaptive design as a function of π. Figure 2 shows that (i) updated weight start to decrease when π < 0.3 (panel A); (ii) additional controls are recruited after interim only if π < 0.3 (panel B); (iii) conclusions of the trial do not change (panel C).

| Retrospective analysis of the VE case study (fixed design)
The sample size of the current study (N v = N c ≈ 3000) was derived to test H 0 : V E ≤ 25% versus H 1 : V E > 25%, with α = 2.5% (one sided), power of 90% under the assumption that V E = 80%, p c = 1% and drop-out of 10%. Figure 3 shows the type I error and the power of the design (with and without HC) as a function of the true value of p c and of the proportion of the planned controls included in the study κ (κ = 0.5, 1). Figure 3A shows the type I error of the two designs (RMP and no HC) as a function of the true parameter p c . It is known from the literature that RMP (black lines) does not control type I error for moderate values of the drift. In our T A B L E 4 Immunological study results: recruited current controls (N c ); current controls in the according-to-protocol (ATP) cohort (N Ã c ); GMR and 95%CI.  specific VE setting, we can see that type I error is always inflated when p c <1%, and in particular when 0.25% < p c <1%. The maximum type I error increases when κ decreases. As an example, at p c = 0.7% the type I error of RMP is about 20% and 40% when κ is 100% and 50%, respectively. As expected, the type I error is well controlled by the design without historical controls (No HCgray lines). Figure 3B,C shows the power as a function of the true parameter p c in case of V E = 50% and V E = 80%, respectively. Results are surprising: we can see that RMP design with κ = 50% is more powerful than RMP design with κ = 100% when p c <1%. This is due to the fact that the type I error of RMP is more inflated when κ = 50%. The gain in power of RMP with respect to the design without HC (no HC) seems to be mainly driven by the inflation of the type I error. To disentangle type I error on the power it may be useful to consider the alpha-adjusted RMP approach, where RMP significance level depends on the observed drift (RMP Adj). Figure 4A shows the drift-adjusted alpha to control α = 2.5% for our case study as a function of κ. The drift-adjusted alpha is larger than 2.5% when 1.3% < p c <2.5% (where standard RMP is conservative), and it is smaller than 2.5% for all values of p c <1.3% (where standard RMP is liberal). The drift-adjusted alpha is smaller than 0.5% when 0.4% < p c <1%. Figure 4B shows that RMP using the drift-adjusted alpha (RMP Adj) is controlling the type I error. Figure 4C,D show the power of RMP with drift-adjusted alpha. Power results of RMP Adj are more easily interpretable: (i) RMP Adj using all controls (κ = 100%) is more powerful then RMP Adj using half of the controls (κ = 50%); (ii) the gain in power of RMP Adj with respect to the design without HC (No HC) is centered around the parameter estimated in the HC (vertical gray line).

Design
Based on these results (and additional considerations such as safety minimum sample size), κ = 60% (N c = 1800) was chosen/pre-specified for the retrospective design. Controls were selected using the date of vaccination. Table 5 shows summary data (ATP sample size, number of cases, attack rate). We can see that the probability of the event in the current trial is different from the probability of the historical data (large drift). Table 6 shows the results of different VE designs, without HC (no HC [κ = 1] and no HC [κ = 0.6]) and using historical controls with and without type-I-error adjustment (RMP Adj [κ = 0.6] and RMP [κ = 0.6]).
The updated weight is e w ¼ 29% (the pre-specified value was w = 60%), with ESS = 3179, and the effect estimated by RMP (κ = 0.6) is V E = 0.92 (95%CI 0.51; 0.99). Results of the full trial without historical controls (no HC [κ = 1]) are V E = 0.88 (95%CI 0.44; 0.99) so, even though the updated weight is small (suggesting that historical and current controls are quite different), the RMP design using only κ = 60% of the planned controls provided the same conclusion of the full (unobserved) trial. A different conclusion was obtained by using RMP Adj(κ = 60%) where (1 À 2α δ )% CI = (À0.07;1) because the estimated lower limit is much smaller than 0.25. This result is due to the fact that the incidence rate estimated in the current control group (b p c Ã 1000 ¼ 3:73) is in the region with the strongest alpha adjustment (α δ = 0.0013) (see Figure 4A). Results of the RMP design strongly depend on the pre-specified proportion of recruited controls (κ). For illustration, Figure 5 shows the results of the HC design as a function of κ. Figure 5 shows the number of cases observed in the control group (panel A), the updated weight (panel B), the ESS (panel C) and the RMP estimated VE (panel D) as a function of the proportion of controls recruited (κ). We can see that the lower limit of the credible interval of RMP is above 25% for all values of κ > 40%. In contrast, the lower limit of the alpha-adjusted RMP (red dots) is slightly above 25% only for κ > 90%. As explained above, this is due to the fact that the estimated incidence rate in the current controls is in the region with strong alpha adjustment. The drift-adjusted alpha at κ = 0.5, 0.6, 0.7, 0.8, 0.9, 1 are α δ = 0.0005, 0.0014, 0.0023, 0.0023, 0.0036, 0.0072.

| DISCUSSION
In this paper, we retrospectively re-analyzed two vaccine studies by using HC: one relatively small immunological study and one large phase 3 VE study. Historical trials were selected to meet the Pocock criteria. 1 We used the dynamic borrowing approach of the Robust Mixture Prior (RMP) which corresponds the Robust MAP Prior 7 when only one historical trial is available and the historical between-trial variability is ignored. Other methods proposed in the literature for dynamic borrowing can be used as well. 1,3,4,9 Key prior parameters of the RMP approach are (i) the weight (w) and (ii) the definition of the vague prior. Results may dramatically change when different w or vague priors are used. For this reason, it is critical to pre-specify the prior in the protocol. In this paper, we pre-specified weight w = 60% and the "unit information priors".
Parameter w represents the prior probability that the new trial does not differs systematically from the historical trial. The chosen value of w should be discussed with the authorities, and results should be supported by sensitivity analysis (tipping point) and simulation studies showing the impact of w on type I error and power. For the immunological adaptive design, we showed that larger weights are more problematic in terms of type I error (in case of moderate drift). For the VE study we assessed the operating characteristic of the design with weights w = 60% and w = 80%. For sake of brevity, we showed only results for the selected weight (w = 60%). As for the immunological study, and in agreement with simulation results of Schmidli et al., 7 type I error is less controlled using larger weights. For example, the maximum type I error of RMP (κ = 0.5) with w = 60% is less than 40% (see Figure 3A), while it is about 50% with w = 80% (data not shown).
For the immunological study we applied an adaptive design 7 in which the total number of recruited controls is adapted at interim on the basis of the similarities between the current and HC. Immunological studies are ideal to apply these adaptive designs because the immunological endpoint can be measured in a short time, which makes it possible to stop the recruitment and perform the interim analysis soon. In our case study, the two studies were very similar, and so it was not necessary to recruit additional controls after interim. To speed up and simplify the adaptive trial, one possibility is to start the trial with a 2:1 randomization ratio and perform the final analysis at interim if the estimated ESS of controls is large enough. In this way, there is no need to recruit the additional vaccinated subjects after interim.
One key aspect of the vaccine immunological studies is the stability of assays over time. The assay measuring the immune response is usually changing (improving) during drug development. It is, therefore, possible that the assay in the two trials is not exactly the same. If the assay changed, then one possible solution is to predict the historical data values on the scale of the new assay. Deming regression (with parameters estimated using quality control data) is often used to make these kinds of transformations/predictions. A consequence of this approach is that the variability of the prediction will be added to the variability of the historical study. Regulatory authorities may have reservations about these types of transformations in the context of historical controls, especially for confirmatory trials. Hence, engaging regulatory authorities during the setup of the study is recommended.
For the VE case study the adaptive design was not feasible. The clinical endpoint was measured with a long followup (4 years), while the recruitment was about 1 year. For this reason, we used a fixed design, in which we recruited only a pre-specified proportion (κ = 60%) of the planned controls, and for which the final analysis was done using adaptive borrowing of the historical controls. In our case study, current controls and HC were quite different, and so only a small amount of the historical information was borrowed (e w ¼ 29%). This did not change the conclusions of the trial if RMP was used. Conclusions changed instead if the proposed alpha-adjusted RMP approach was used or if a smaller sample size for the current study had been chosen.
The proposed alpha-adjustment has the limitation that it depends on the estimated value of the drift and so the adjustment is only approximated (because the real value of δ is unknown). In general, it is complex to estimate the type I error by simulations because there are many scenarios potentially compatible with the null hypothesis. A correct and simple way to control the type I error is to use the smallest alpha-adjusted in the plausible range of δ values. Unfortunately, this approach is too conservative in our context, undermining the benefit of the historical data. All these aspects together emphasize the complexity of controlling the type I error when using Bayesian dynamic borrowing of historical controls in VE studies.
In the future, we will consider the adjustment for covariates (e.g., Bayesian with propensity score approaches 34 ) to reduce the heterogeneity between studies. Different Bayesian models have been proposed to estimate the VE. To choose the model for our case study we performed a small simulation study (see Appendixes A and B). The model with two independent Jeffreys Beta priors showed the best frequentist performances. For simplicity, we did not consider correlated priors. In this case, it is possible that the log-odds ratio models outweigh the models with independent Beta priors. A large simulation study to compare Bayesian models to estimate VE is an interesting line of research for future work.
Adapting these methods to safety endpoints may be interesting for safety signal detection, as control of the type I error may also be less of an issue. However, very small incidences for these endpoints may increase the variability of the results given the proportion of planned current controls recruited. Furthermore, the approach proposed in this paper could be implemented for life cycle management when multiple clinical trials have been performed. Using previously collected data will allow to design smaller studies for both patient and company benefits.
In conclusion, we retrospectively implemented two HC designs in two HPV vaccine trials, one immunological study (for which we applied an adaptive design) and one VE study (for which we applied a fixed design). Importantly, the design and the prior of the Bayesian model were fully pre-specified.
From our point of view, results from these two retrospective studies show the feasibility and the potential role of dynamic borrowing of HC in classical vaccine development. In particular, the approach seems to be promising for (early development) immunological studies, where the adaptive design is feasible and control of type I error is less relevant.

AUTHOR CONTRIBUTIONS
All authors contributed equally to the design, analysis and interpretation of the study and to the writing of the article.

A P P END I X A : Bayesian VE
For vaccine efficacy (VE) trials, the event (particularly in the vaccine group when the VE is high) can be rare. In studies with rare events, the data contain limited information, and the information of the prior distribution is expected to contribute to the posterior distribution. Different Bayesian methods have been considered in the literature.
We may consider a model with two independent Beta priors (p g $ Beta[a, b], g = v, c), where the priors can be, for example, uniform (a = b = 1) or weakly informative (a = b = 1/2). Similarly, we may consider that the logit of the two proportions are independent normally distributed.
An assumption of these approaches is that the two priors are independent. To take the correlated prior beliefs into consideration, one can use correlated priors on the proportions or model the incidence in the control group (p c ) and the relative risk (RR) as independent. 23 For example, Hampson et al. 24 used a normal prior for the log odds-ratio (log[OR]) and an independent beta prior for p c .
Chu and Alloran 23 proposed a lognormal prior for the log odds of the control group log(p c /[1 À p c ]) and an independent lognormal prior for the log(OR).
Jin et al. 11 used the Poisson approximation framework. 12 Given the total number of events X = X c + X v , the number of cases in the vaccine group is X v $ Bin(Y jθ), where: where k is the randomization ratio. They considered two different priors for θ: a normal and a Beta prior.

A P P END I X B: Frequentist performance of Bayesian VE (simulation study)
We performed a small simulation study to compare different models in the setting of our case study. In this simulation study we consider (i) Beta-Beta (BB) model with two independent non-informative Beta priors (p g $ Beta   To assess the frequentist performance of different Bayesian models described above, we performed a small simulation study. We simulated 5000 trials under the null hypothesis (V E = 0.25) and 1000 under the alternative hypothesis (V E = 0.8; V E = 0.95). The number of events for each trial was simulated using binomial distribution with input parameters coming from our case study: N c = N v = 2, 700, p c = 0.4%, p v = (1 À V E) Á p c , with one-sided hypothesis H 0 : V E ≤ V E 0 versus H 1 : V E > V E 0 and V E 0 = 0.25. Table A1 shows simulation results for the different models described above under different values of VE. Let us first look at the results under the null hypothesis (V E = 0.25). The prior of the log(OR)-normal prior (τ lOR = 1.25) was derived to be calibrated under the null hypothesis (to control Type-I-error). This is the reason why the type I error is exactly 0.025. A slightly larger (smaller) value of the variance will lead to a liberal (conservative) test. With the exception of the log(OR)-normal prior (which should not be compared to the others under the null hypothesis), the model with best frequentist performance in terms of type I error (true VE = 25%) is the Bayesian model using two independent Jeffreys priors (p c $ B[0.5, 0.5]; p v $ B[0.5, 0.5]). The frequentist model using the score approach to estimate the confidence intervals 30-32 is performing well (only slightly conservative). In contrast with the Bayesian models here considered, the frequentist point estimate is not defined when y c = 0. A classical approach to estimate the frequentist VE in case of two-by-two tables with zero cells is to add a constant 0.5 to all cells ("continuity correction") which is similar in spirit to the Jeffreys prior model.
Let us now consider results under the alternative hypothesis (V E = 0.80, V E = 0.95). We can see that the Jeffreys Prior Beta-Beta model is slightly less biased than the Beta-Beta Uniform prior model. The Jeffreys Prior Beta-Beta model is slightly more powerful than the other models. The power of the frequentist approach (confidence intervals estimated using score test) is very similar to the power of the Uniform prior and of the log(OR) model. The log(OR) model (calibrated under the null) is not performing well in terms of coverage and MSE when the VE is very large (V E = 0.95).
Based on our simulation results, we considered the Beta-Beta Jeffreys prior model for the estimation of the VE in the retrospective case study. The good frequentist performances of this model were also highlighted by Agresti and Min. 25