Bayesian borrowing for basket trials with longitudinal outcomes

Basket trials are a novel clinical trial design in which a single intervention is investigated in multiple patient subgroups, or “baskets.” They offer the opportunity to share information between subgroups, potentially increasing power to detect treatment effects. Basket trials offer several advantages over running a series of separate trials, including reduced sample sizes, increased efficiency, and reduced costs. Primarily, basket trials have been undertaken in Phase II oncology settings, but could be a promising design in other areas where a shared underlying biological mechanism drives different diseases. One such area is chronic aging‐related diseases. However, trials in this area frequently have longitudinal outcomes, and therefore suitable methods are needed to share information in this setting. In this paper, we extend three Bayesian borrowing methods for a basket design with continuous longitudinal endpoints. We demonstrate our methods on a real‐world dataset and in a simulation study where the aim is to detect positive basketwise treatment effects. Methods are compared with standalone analysis of each basket without borrowing. Our results confirm that methods that share information can improve power to detect positive treatment effects and increase precision over independent analysis in many scenarios. In highly heterogeneous scenarios, there is a trade‐off between increased power and increased risk of type I errors. Our proposed methods for basket trials with continuous longitudinal outcomes aim to facilitate their applicability in the area of aging related diseases. Choice of method should be made based on trial priorities and the expected basketwise distribution of treatment effects.


INTRODUCTION
In recent years, new clinical trial designs have emerged which aim to increase efficiency by combining what would have traditionally been separate trials into one under a "master protocol." Master protocols aim to investigate multiple hypotheses simultaneously, with the goal of streamlining clinical development while identifying subgroups of patients for which a treatment is most beneficial. Whereas a traditional trial may investigate a single primary hypothesis, master protocols aim to investigate multiple treatments and/or patient subgroups at the same time within a single trial.
Master protocols can be stratified into three main designs: umbrella, basket, and platform trials. 1 Broadly speaking, umbrella trials aim to investigate multiple treatments for a single disease with multiple subgroups. Conversely, basket trials aim to investigate a single therapy or intervention in multiple diseases. Platform trials are a more flexible trial design, in which multiple therapies for a single disease are investigated in a continuous manner, with arms of the trial allowed to enter or leave on the basis of a prespecified decision algorithm. 2 Basket trials begin by enrolling patients who share a particular disease feature. Patients are then stratified into "baskets" based on their disease subgroup. The feature, common to all disease subgroups within the trial, is targeted with the new therapy. The trial can then evaluate multiple questions regarding the overall effect of the treatment and the differences in responses between different subgroups.
Basket trials offer the opportunity for increased efficiency through statistical borrowing of information across complementary baskets, helping to reduce overall sample size, time and costs. There are several different approaches to borrowing of information between subgroups, ranging from no borrowing to complete pooling, with many authors proposing using hierarchical models as a compromise between the two.
The vast majority of basket trials conducted so far have been undertaken within oncology whereby the aim is to recruit patients on the basis of a common biomarker or genetic mutation. Patients are then stratified into baskets based on their cancer histology. Woodcock and Lavange considered applying basket trials in other areas, 2 where instead of stratifying patients by cancer location, they are stratified by biological characteristics, such as disease type/stage, specific genetic changes, or demographic characteristics. One promising area for basket trial development is in aging related diseases, where diseases may share features which respond similarly to a targeted therapy. For example, immune-mediated inflammatory diseases (IMIDs) are a group of distinct conditions which share common inflammatory pathways. Grayling et al have recently considered innovative approaches to IMID trials, including the use of basket trials. 3 IMIDs are associated with chronic morbidity affecting quality of life and leading to premature death, and there is potential for novel trial designs to improve personalised treatment strategies in this area.
Chronic disease trials often investigate the effect of an intervention over time through repeated observations on individuals. The use of basket trials in this area therefore raises some statistical issues; namely, how to handle information sharing of continuous longitudinal data within a basket trial framework. There are currently no methods for basket trials that we are aware of which allow for information sharing between repeated measures over time, and this will therefore be the focus of this paper.
Overviews of longitudinal data analysis in clinical trials have been discussed by various authors; for instance, the seminal paper by Laird and Ware, 4 as well as Albert, 5 Matthews et al, 6 Senn et al, 7 and De Livera et al. 8 From a statistical perspective, longitudinal studies are attractive because they usually increase the precision of estimated treatment effects (since there are repeated measurements on the same person, and so a larger amount of available data), thus increasing the power to detect such effects. We focus on methods where the dependent variable is continuous and where the primary research question is the effect of treatment on the disease course over time. Methods for longitudinal data for different data types (eg, discrete data, recurrent event data) are discussed for example in Reference 5. Longitudinal data analysis requires special consideration of several important factors, such as how to account for correlation between repeated measurements on the same individual and how to handle missing data (common in trials due to loss to follow up, or missed study visits due to intercurrent illness). These missing data can result in information from different individuals available at different time points, that is, an unbalanced design. 8 There have been extensive discussions in the literature of methods of dealing with missing values in longitudinal data for different analysis methods, for example, References 9-12.
An increasingly common method for longitudinal data analysis is hierarchical modeling. This approach goes by various names including multilevel modeling, random coefficients modeling and mixed effects modeling (hereto forward referred to as hierarchical modeling), and is described in more detail in Section 2.1. These models are highly flexible and aim to account for the complex correlation structure of longitudinal data via "random" effects at different levels. They handle missing data under a missing at random (MAR) assumption 13 which may be attractive because all of the available data can be used to estimate model parameters. This is in contrast to other methods for longitudinal analysis (such as repeated measures ANOVA) which exclude any individuals with missing values. We note that the mixed effects model is able to handle MAR only when the main model is correctly specified and extra covariates that are related to dropout are added in the main model. In a typical two-level hierarchical model, correlation between observations (within subjects) is modeled at level 1, with correlation between subjects modeled at level 2. If necessary, further levels can be added (eg, as we will do, a level 3 effect for subgroup or cluster). For a comparison of hierarchical methods with more traditional multivariate methods for longitudinal data, see References 14 and 15. Another class of models that can be employed in longitudinal data analysis are known as generalized estimating equations (GEE). GEE may be preferable to hierarchical linear models if the response variable is not continuous as they require less distributional assumptions. The focus of GEE is on estimating marginal effects (as opposed to conditional effects in hierarchical linear models). See References 16 and 5 for details. Optimal model selection when dealing with missing data in this setting is discussed in Reference 12. Hierarchical models can be used to address two major questions in a longitudinal clinical trial setting: (1) whether the overall mean response differs between two or more treatment groups (and how much by), and, (2) whether the overall rate of change of response over time differs between two or more treatment groups (and how much by). Appropriate choice of summary measure to address the question depends on the way in which the treatment effect changes over time. For example, in the treatment of asthma, regular therapy with a beta-agonist is likely to produce a near constant bronchodilation over time and so a mean outcome measure is suitable. Conversely, hormone replacement therapy in osteoporosis may have an effect on bone mineral density which increases with time so that a slope estimate is more appropriate. 17 In the context of aging-related diseases, in order to demonstrate a beneficial novel treatment, it may be valuable to compare rates of cognitive decline in Alzheimer's patients between treatment groups (as in Reference 18).
In this paper we propose hierarchical linear growth models to examine treatment effects via comparing rates of change of a continuous response variable between treatment groups. Often individual trajectories can be well approximated by a linear model (however, non-linear extensions are possible at the cost of increased complexity). In addition to elegantly accounting for the correlation structure of the data, formulation of the model in this way allows it to be easily incorporated into the Bayesian hierarchical model framework, which is how information borrowing is facilitated.
While the standard Bayesian hierarchical model (BHM) is useful in many situations, extensions have been developed in an attempt to move away from the assumption that parameters are exchangeable.
In an exchangeable model, each parameter is modeled as coming from a common distribution with common hyper-parameters. For example, in a basket trial with i = 1, … , k baskets, the basket-specific treatment effects may be assumed to be from a normal distribution with mean and variance 2 , that is, In an exchangeable model, basket-specific treatment effect estimates, k , are shrunk toward a global mean, , via a shrinkage parameter, , which facilitates information borrowing across arms. Crucially, the amount of shrinkage is influenced by the between basket variance, where a smaller variance leads to more borrowing. In extreme cases where 2 → 0 or 2 → ∞ the BHM reduces to the situations of complete or nonpooling, respectively. The exchangeability assumption can prove challenging in two ways in a basket trial scenario.
Firstly, with a small number of subgroups (as has commonly been the case so far in basket trials), the between basket variance is difficult to estimate, even if the numbers within each basket are large. Therefore, 2 can be sensitive to the choice of prior (and prior hyper-parameters). This issue is well-known in hierarchical modeling and has been investigated for example in  Relatedly, the second issue is that problems can occur when treatment effects differ vastly between baskets; in particular, in the case where the treatment is highly efficacious for some baskets and not others. This is due to the shrinkage induced by the exchangeability assumption; in a highly heterogeneous scenario, true small (or absent) treatment effects can be inflated, while true large treatment effects can be muted toward a global mean (it is possible to mitigate this effect somewhat through careful selection of the prior distribution and prior parameters). These two issues have been the motivation behind the development of more sensitive information sharing methods for basket trials, such as EXNEX and the Hellinger distance (HD) methods that we consider here.
In this paper, three basket trial borrowing methodologies are extended for a longitudinal setting, motivated by potential application to chronic disease trials in which the primary interest is in treatment effect over time.
Our methods could also be applied in other trial settings where repeated measures are taken and where information sharing between subgroups is desirable. To our knowledge, no other authors have proposed extensions of these methods for repeated measures basket designs.

METHODS
Our paper builds on previous work by Zheng and Wason 22 in which they propose using a commensurability measure (the Hellinger distance) to build basket-specific priors for treatment effects in the case of a cross-sectional randomized basket trial. Zheng and Wason compare their proposed borrowing method with three commonly used Bayesian analysis methods. In our extension of their work, we extend the same four methods for the novel case of a longitudinal randomized basket trial: (1) Stratified Bayesian analysis (no borrowing), where treatment effects in each basket are estimated independently.
(2) The standard BHM in which treatment effect parameters are considered fully exchangeable. The BHM was originally proposed for clinical trials by Thall et al 23 24 which probabilistically splits subgroups into a set of exchangeable (EX) and nonexchangeable (NEX) baskets. (4) The HD method, proposed by Zheng and Wason, 22 which does not rely on any assumption of exchangeability on treatment effect parameters. Instead this method quantifies the similarity between each pair of baskets using the "Hellinger distance," a measure of distributional discrepancy.

Stratified BHM
We begin with specification of our stratified model, which also forms the basis for all three borrowing models. Models are formulated to account for the nested structure of the data (repeated measures within patients within baskets), and to facilitate information sharing. Our models do not include covariates, although these could be incorporated straightforwardly by including them as independent variables in the model. Suppose we have k = 1, … , K indications (baskets) sharing a common disease feature at which a new therapy is targeted, with j = 1, … , n k patients recruited from each. Patients are randomized within each basket to receive either the new treatment or a placebo. T jk is a binary treatment assignment indicator, set to 1 for patients assigned to treatment and 0 for patients assigned to placebo. On each patient, repeated measurements of a continuous outcome measure are taken at timepoints i = 0 (baseline/pretreatment), and (posttreatment) timepoints i = 1, … , I j .
For the stratified model (ie, for each basket separately), at level 1, suppose repeated observations over time within each patient are modeled using a linear regression, such that: where Y ijk is a continuous endpoint measured at timepoint i for individual j in basket k. 0jk is a random intercept for individual j in basket k (i.e. response at baseline, when t 0 = 0). t ijk is the (continuous) time from randomization for the ith observation on the jth subject in the kth basket. 1jk is a random slope for individual j in basket k. R ijk is a random effect for each person at each timepoint. Within each basket, R ijk ∼  (0, 2 k ) i.i.d., where 2 k is the within-patient (within basket) variability in observations over time.
At level 2, suppose each parameter from (1) can further be modeled such that each individual's intercept and slope is a linear function of basket-specific parameters, dependent on treatment assignment: , where 2 0k is between individual variation in intercepts, 2 1k is between individual variation in slopes, 0k 0k is covariance between individuals' intercepts and slopes.
For patient j in basket k receiving treatment (T jk = 1), Equations (2) and (3) become, For patient j in basket k receiving control (T jk = 0), Equations (2) and (3) become, This defines our longitudinal stratified model (which is essentially a two-level BHM), where all baskets are considered independently of each other.
The overall model becomes, In each of our models, we include a basketwise effect for the mean difference in measured outcomes between treatment groups at baseline, 01k . As noted in the following section, any difference between treatment groups at baseline is not attributable to treatment, and, because of the randomization, we would expect no difference (other than due to chance). However, since baseline values of an outcome are highly related to follow-up measurements of the same variable, even small (random) differences in baseline values between the two groups can have a (strong) confounding effect. 25 Therefore, it is advised to adjust for the difference between groups in the baseline value of the outcome variable.
In the Bayesian framework, priors over each parameter complete the model specification. In this case, we require priors for the 'fixed' population level effects, namely 00k , 01k , 10k , 11k , as well as the "random" effects at each level, namely the k , , and k parameters. Priors (and prior hyper-parameters) can be chosen in various ways dependent on trial objectives, knowledge of possible ranges for parameter values, and/or existing knowledge, and can range from "sceptical" to very informative (see Reference 26 for a discussion on choosing priors in Bayesian clinical trials). For this reason there are no recommended default priors; the best approach to prior specification (including hyper-parameter selection) is entirely dependent on the situation. In the context of a real trial further investigations into the sensitivity of priors and hyper-parameters may be necessary at the design stage.

Estimating treatment effectiveness
It is a well-known problem of causal inference that we can never obtain the set of results for which different treatments are independently tested on each participant at the same time. This is known as the fundamental problem of causal inference. [27][28][29] We instead lean on the potential outcomes framework 30,31 widely adopted in the literature to identify causal interpretation of population-level estimates while stating the necessary assumptions. Under the stable unit treatment value assumption (SUTVA) 32 causal effects are defined based on the comparison of functions of potential outcomes under two different actions made on the same object or group of objects. 33 In each basket we assume randomization will produce comparable groups, such that the only (known and unknown) systematic difference between groups is treatment assignment. In taking the difference in the expected value of potential outcomes Y ijk from Equation (4) for treatment (T jk = 1) vs control (T jk = 0), we construct an estimand for the average causal effect (ACE) of treatment in basket k at time i: This can be thought of as the expected difference in outcomes between treatment groups at time t i in basket k. Here, E( 01k ) is the expected difference in potential outcomes at baseline (t 0 = 0). Assuming the randomization procedure produces comparable groups in all (known and unknown) factors that may affect outcomes, any difference between treatment groups at baseline is not attributable to treatment, therefore E( 01k ) = 0. Thus our attention is focused on comparing methods for sharing information on 11k and our treatment effect estimands are: If k > 0, or some prespecified clinically significant amount, u , then on average over the course of the trial, the trajectory of the treatment arm is "steeper" than that of the control arm, and we can conclude (with a level of evidence) that treatment is better than control (assuming that higher response values correspond to "better" outcomes).
In the Bayesian hypothesis testing framework, which directly models P( k |data), the decision-making process can be framed as a null and alternative hypothesis: In early phase trials this can be framed as a Go/No-Go decision. If the probability that k > u is greater than a prespecified evidence level, a Go decision can be allocated for that basket (ie, continue clinical development), otherwise No Go (halt clinical development).

Note on estimands
It is important to clarify explicitly the estimand in our models. The methods we propose assume that the primary research hypothesis is to establish whether there is a (basketwise) causal effect of treatment over placebo over the trial duration via a difference in linear trajectories of the responses between treatment groups, under an assumption of no systematic difference at baseline. This allows researchers to answer the question of whether a novel treatment is improving the overall outcome over time (over and above placebo) in each subgroup. For each basket, our estimands, k , are not a prediction of mean difference in (basketwise) treatment response over and above control at a particular timepoint compared to baseline (as they typically would be in a cross-sectional basket trial). Rather, they represent the average treatment effect per unit of time, and so k are the difference in slopes (difference in rates of change over time) between treatment groups (see Section 2.2). The units of k are the units of the response variable divided by the units of the time variable. Furthermore, in the language of the estimand framework, our models operate under a hypothetical treatment strategy. 33 That is, our models estimate the treatment effect in the hypothetical scenario where all patients adhere to treatment, since our analysis methods allow all patients to be included in the analysis and for each patient a mean linear trajectory is implicitly imputed, even if they only have one baseline measurement.

Longitudinal BHM method (BHM-L)
In our three-level longitudinal BHM method (BHM-L), we use a single model for all the data (in contrast to formulating independent models for each basket as in the stratified BHM). At level 1, where R ijk is random effect for each person at each timepoint. For all baskets, .
In our BHM-L, each basket level parameter from level 2 is further modeled as a mixture of "fixed" population level terms, and "random" basket-specific V terms, such that, at level 3, Covariances of 0 are assumed between level 3 parameters. This is a simplifying but realistic assumption. Simplifying, since we wish to isolate estimation of the treatment effect parameters in order to compare our borrowing methods. Furthermore, in most applications, the parameters defining the level 3 covariance structure will not be of direct interest, and overparameterization can lead to inefficiencies. As already discussed, it is well known that in a basket trial estimation of the between basket variance parameter(s) can be challenging when the number of baskets is small. Indeed, this is the primary motivation for subsequent developments of the standard BHM method. It is also a realistic assumption, since, as should be noted again, the model already explicitly accounts for all important correlations via the nested structure. The overall model becomes, Again, priors over each parameter complete the model specification. In this case, we require priors for the "fixed" population level effects, namely 00 , 01 , 10 , 11 , as well as the 'random' effects at each level, namely the , , , and parameters.
In the same way outlined in Section 2.2, in taking the expected value of Y ijk from Equation (6) for treatment (T jk = 1) vs control (T jk = 0), the average causal effect of treatment in basket k at time t i is: Again, there should be no difference (other than due to chance) between baseline measurements, and so our treatment effect estimands are: All baskets share the common 11 parameter (overall mean treatment effect), with information sharing between k achieved via partial-pooling. The shrinkage parameter 2 11 from (5) determines the degree of borrowing between V 11k (along with the within-basket sample size, with smaller groups borrowing more than larger groups). In extreme cases where 2 11 → 0 or 2 11 → ∞, the BHM-L reduces to the situations of complete or nonpooling (stratified BHM), respectively.

Longitudinal exchangeable-nonexchangeable method (EXNEX-L)
We have extended the EXNEX method proposed by Neuenschwander et al via incorporation into our longitudinal framework. EXNEX is a mean-driven method which allows strong heterogeneity between baskets by grouping treatment effects into cohorts of exchangeable and nonexchangeable baskets. Each basketwise treatment effect is assigned a fixed probability of being exchangeable, k , or not exchangeable, (1 − k ) (probabilities are set by the user). Our extension of the method is via incorporation into our two-and three-level longitudinal framework.
As with the BHM-L, at level 1: At level 2: .
At level 3: As with the stratified BHM and BHM-L, the treatment effect parameter on which we wish to share information is again the 11k term which will be estimated at either level 2 (as in the [nonexchangeable] stratified model) or level 3 (as in the [exchangeable] BHM-L).
For exchangeable (EX) cohorts, k are included in the level 3 model as in the BHM-L defined in Section 2.4, that is, where 11 is an overall mean and V 11k are random effects such that V 11k ∼  (0, 2 11 ). For nonexchangeable (NEX) cohorts, k are modeled as in the stratified model defined in Section 2.1, that is, where each 11k is given its own prior distribution.
This leads to a mixture model for k , such that, ) .
The EXNEX model covers the special cases of fully exchangeable or stratified analysis by setting k = 1 or k = 0 for all baskets.

Longitudinal HD method (HD-L)
We have extended the HD method proposed by Zheng and Wason 22 via incorporation into our longitudinal framework.
The advantage of this variance driven method is that it avoids the assumption about exchangeability of treatment effects, and instead features the use of a distributional distance measure, the HD, 34 to inform borrowing only from the most similar, or most "commensurate," basket(s).
The HD-L model is defined in the same way as the independent BHM in Section 2.1, the only difference being in allowing the sharing of information on treatment effect parameters, 11k = k using the HD method explained briefly below.
The HD-L method is split into two main stages.
In the first stage, we apply the stratified BHM from Section 2.1 to obtain estimates of the independent k parameter distributions. We label the means of the independent (posterior) distributions̃k and variances 2 k . For the independent estimates from the first stage, k ∼  (̃k, 2 k ). In the second stage, for each pair of baskets, consisting of the contemporary basket under consideration (k * ), and each pairwise basket (k ≠ k * ), a prior distribution is specified for the similarity between them. In Reference 22 this is called the "commensurate predictive prior" (CPP). As parameters for this prior distribution we use the (independent) posterior mean of the pairwise basket,̃k (from the first stage), and shrinkage parameter, 1∕ 2 kk * : Here 2 kk * parameterizes the similarity between k * and̃k and determines the degree of (pairwise) information borrowing. A prior distribution, g k , is placed on the shrinkage parameter, which is set to a function of the HD, d kk * : As discussed in Reference 22, g k is chosen as a spike and slab distribution (the main reason being that it allows for strong borrowing given a basket with sufficiently consistent treatment effect, and weak borrowing otherwise). The spike and slab prior requires a metric by which to quantify the degree of pairwise similarity between baskets. To express the degree of (in)commensurability between pairwise baskets, the HD 34 is utilized.
The HD between two independent posterior distributions, k and k * , is: The CPP in (7) essentially uses the HD to apply a (normalised) weight p kk * to each̃k (k ≠ k * ), where the weights are between 0 and 1 and sum to 1. Taking into account each pairwise basket, we then take the sum of all CPPs, to form a "marginal predictive prior" (MPP): 22 .
In this way, the total marginal estimate of the new k * is gained from combining information from each k ≠ k * basket according to the similarity of the information. [Correction added on 09 May 2023, after first online publication: The missing exponent has been added.] The MPP is then updated via Bayes' rule with the contemporary basket data to a robust posterior estimate for k * : ( k * |y (−k * ) , y k * ) ∝ (y k * | k * ) × MPP ( k * |y (−k * ) ).

MOTIVATING EXAMPLE-BICARB TRIAL
The BiCARB study investigated the effect of oral sodium bicarbonate therapy on function and quality of life in older patients with chronic kidney disease (CKD) and low-grade acidosis. Participants in the trial received a total of 24 months of either bicarbonate or placebo, with the results providing a test of the overall clinical and cost-effectiveness of this commonly used therapy in older patients with severely reduced kidney function. 35 For the purpose of demonstrating our methods on a real dataset, a subset of data from BiCARB were modeled and analysed as if it were a longitudinal basket trial with continuous outcomes according to the each of the methods we describe in Section 2. In contrast to the original trial which used ordinal scores in the Short Physical Performance Battery as the primary outcome measure, we chose as our continuous outcome measure serum bicarb concentration levels taken at 0-, 3-, 6-, and 12-month intervals. Patients were stratified into baskets based on the underlying cause of their CKD (assigned hierarchically where patients have more than one cause), as per Table 1. Our (contrived) research question of interest is treatment effect over time, whereby a single drug (oral sodium bicarbonate therapy) targeting a common disease feature (serum bicarbonate concentration levels) is investigated in different diseases. It would appear that individual trajectories are well approximated by a linear model (see Appendix A). As we are primarily interested in the timewise trend of treatment, we aim to estimate k , the average basketwise difference in trajectories of serum bicarb concentration for treatment group compared to placebo. Our motivation to use borrowing methods is to increase power to detect positive treatment effects since there are small sample sizes in some baskets, and, since the drug targets the same disease feature in all patients, we expect some level of similarity in responses between the different diseases.

Prior settings
For each model, weakly informative priors and hyper-parameters were chosen in order that fixed effects (apart from the overall intercept, 00k ) are centered at zero and to formalize the belief that large treatment differences are unlikely. 26 The use of half-normal variance priors is consistent with Reference 22 and the recommendation by Reference 21. The parameter hierarchy for each of our models is given in Appendix C.
The following settings were used in each model: ( (2) In our BHM-L, a half-normal prior,  (1), was set for all variance parameters, apart from 11 . The prior for 11 was set as 11 ∼  (0.125), to facilitate more information sharing on treatment effects (prior median 0.084, 95% credible interval (0.004, 0.280)). Priors for population level parameters were set as below (with credible intervals the same as in (1)

Basic settings
The main characteristics of the longitudinal basket trials we simulate are based on the motivating BiCARB trial, as well as scenarios presented in Reference 22. These scenarios are presented in Table 2. In particular, mean parameter values were set to the same values as originally used in Reference 22 (although the parameters here have a different interpretation, as discussed in Section 2.3). The nine scenarios considered feature varying mean treatment effect sizes and different degrees of heterogeneity between baskets. As detailed in Reference 22, all sets of the "true" mean values of k are realizations from distinct multivariate normal distributions. These scenarios represent an interesting variety of situations, ranging from the global null to highly heterogeneous treatment effects between baskets. Six baskets (k = 6) are considered with unequal numbers of patients in each basket (n k = (10,10,14,16,20,20)). Patients were randomized equally to treatment or control by subtrial, however, our models can accommodate unequal randomization.
Continuous endpoints were simulated for each patient at 0, 3, 6, and 12 months. Timepoints were assumed to be the same for each patient. In a real trial, actual visits would occur in windows around planned timepoint and not all be the same, however, one of the benefits of structuring the model in a hierarchical manner is that there is no need for all patients to be measured at the same time.
Random effects parameter realizations and outcome data Y ijk were generated according to the three-level BHM-L in Section 2.4. For each of the simulated trials the same fixed parameters were used. The residual (level 1) variance was set at 2 = 0.5 2 , level 2 intercept and slope variance as 2 0 = 2 1 = 0.4 2 (correlation between intercepts and slopes was set to = 0), and the level 3 intercept/slope variance split by treatment allocation as 2 00 = 2 01 = 2 10 = 0.3 2 . Fixed population level parameters were set as 00 = 20, 01 = 0, 10 = 0.5, 11 = 0.
For all nine scenarios, 10 000 simulations were conducted and models fitted in R version 4.0.3. Gibbs sampling was completed in JAGS as called from R using the R2JAGS package. 36 Within each simulation replicate, 30,000 iterations were run on four parallel chains and the first 13 000 iterations were discarded as "burn-in." JAGS code to implement each model is available at https://github.com/longitudinal-baskettrials/Bayesian-analysis-methods.
Each of the 10 000 simulated trials in each scenario were analyzed by each of the four methods outlined in Section 2. For each analysis model, the same priors and hyper-parameters as outlined in Section 3.1 were used.
Comparison between methods is in terms of the precision of the posterior means for k , measured by average bias, variance, and mean squared error (MSE). Furthermore, true and false positive rates are reported by basket (analogous to frequentist type I error rate and statistical power). In particular, a Go decision was allocated to a basket if the posterior probability of a mean treatment effect size being larger than a prespecified value was above our decision threshold. Here, as in Reference 22, our prespecified ('clinically meaningful') mean treatment effect is 0.25. However, unlike Zheng and Wason who use a posterior probability threshold of 0.975, we opt for a more relaxed threshold of 0.80. Therefore, a true Go was when P( k > 0.25) > 0.80 and k(true) > 0, TA B L E 2 Simulation scenarios-specification of 'true' mean treatment effects, k .  An analog of the overall type I error rate, or family-wise error rate (FWER), is also reported. The FWER is computed as the proportion of simulated trials with an erroneous Go decision made for one or more basket. Finally, methods are compared in terms of their "calibrated power." Here, we calibrate a (single) posterior probability decision threshold for each model such that the maximum FWER (across considered scenarios) is 5%. We then examine the corresponding power (ie, the calibrated power) at these decision thresholds. This allows us to compare the relative power of each method when the FWER rate is limited to 5% across the range of scenarios considered. This type of calibration may be desirable at the design stage for basket trials wishing to incorporate borrowing in order to compare the operating characteristics of different analysis methods. Figure 1 displays estimates of treatment effect parameters, k , and corresponding 95% credible intervals obtained by analysing the BiCARB dataset via our proposed methods. An overall (pooled) estimate allows us to visualize the general F I G U R E 1 Posterior mean parameter estimates and 95% credible intervals for BiCARB data in a (contrived) basket trial analyzed under different models. An overall (pooled) treatment effect estimate is also displayed. motivation behind multilevel models whereby they offer a compromise between fully pooled and stratified models. We see that all intervals straddle zero, which suggests that in each disease there little difference between treatment groups in change in serum bicarbonate concentration over time. These results demonstrate the benefits of borrowing information in terms of increasing the precision of parameter estimates, whereby intervals are narrower for borrowing methods than for stratified analysis. We see in Figure 1 that for all borrowing methods, treatment effect estimates tend toward the fully pooled estimate. In hierarchical models, the ratio of the within-level variation to between-level variation has a strong influence on the degree of pooling at that level. Therefore, Figure 1 would seem to indicate that there is a relatively small amount of heterogeneity in the level 3 (between-basket) difference in slopes between treatment groups, in relation to the corresponding heterogeneity between-individual patient slopes at level 2. Examining the BHM-L parameter estimates, this is indeed the case; variability between k estimates was estimated aŝ2 11 = 0.03 2 , while variability between patient slopes was estimated aŝ 2 1 = 0.10 2 .

BiCARB analysis
We can also somewhat visualise the differences between the methods in basket 6 which has the most heterogeneous stratified estimate. Here, BHM-L is most heavily shrunk toward the pooled estimate, while EXNEX-L and HD-L offer more of a compromise. We see the motivation behind the EXNEX-L and HD-L methods, whereby in moving away from the exchageability assumption, they allow for more heterogeneity in estimation of treatment effects.
It should also be noted that the degree of borrowing is influenced by the within-group sample size, with larger groups borrowing less. Again, we see this in basket 6 which has a small sample size relative to the other groups. Compared to the stratified estimate, borrowing methods have allowed a large amount of information from other baskets to be shared due to the relatively small number of patients in this group. Stratified analysis has almost no bias across all scenarios. For borrowing methods, in general, EXNEX-L and HD-L have lower bias than BHM-L, with EXNEX-L the lowest (see scenarios 1, 2, 3, 4, 7, and 8). This is because EXNEX-L and HD-L borrow information more cautiously than BHM-L; in these methods, the treatment effect estimate within each basket is less affected by treatment effects in other baskets. In scenarios with homogeneous treatment effects (scenarios 5, 6, and 9), all methods have almost zero bias. Stratified analysis has larger variance in parameter estimates than other methods, in contrast to the BHM-L which has consistently low variance across all scenarios. This is the classic bias-variance trade-off. In general, all three borrowing methods have smaller MSE than stratified analysis, apart from in scenarios with highly heterogeneous treatment effects (see scenarios 2, 4, and 8), where BHM-L has higher MSE over other methods for certain baskets. Again, this is due to BHM-L sharing information more indiscriminately than other methods. HD-L and EXNEX-L offer similar performance in terms of MSE across all scenarios. Figure 5 compares models in terms of basketwise power for each scenario. Generally, BHM-L has increased power over other methods, due to its increased capacity for upward shrinkage when treatment is effective in all or most baskets (see scenarios 1-5, and 8). However, in scenario 7 (Global Mixed 1) the null treatment effects in baskets 1-4 "overpower" the ability of BHM-L (and other methods) to detect small but positive treatment effects in baskets 5 and 6. Stratified analysis performs best in terms of power in scenario 7, since treatment effect estimates in baskets 5 and 6 are not dampened by the null effects in baskets 1 to 4. HD-L does well in many scenarios. Where it underperforms relative to BHM-L (eg, Scenario 2-Baskets 1, 2, 6, or Scenario 3-Baskets 1, 5, 6, the true effect is just above the clinical relevance threshold.

Simulation study results
Type I error rates per basket and overall are displayed in Table 3 (type I error rates by basket are also displayed graphically in Appendix B). We give the family-wise error rate (FWER) as a measure of overall type I error rate. This is the probability of one or more false positives per trial. An analog of the false discovery rate (FDR) has also been shown. The FDR is defined as the percentage of "false positive" baskets out of all "positive" baskets (ie, the proportion of selected or recommended treatments that are actually ineffective). In the case where all null hypotheses are true, the FDR is equivalent to the FWER. If using as an error rate control metric, for all scenarios apart from the global null, control of the FDR is more liberal (and therefore more powerful) than control of FWER. For this reason, some authors have proposed the FDR as a better error rate metric for basket trials (see, eg, Reference 37). In terms of FWER, stratified analysis is at risk of high type I error rates in all scenarios, but in particular, in the global null situation (scenario 9), where the overall FWER is over 26%. Conversely, the information borrowing methods maintain low error rates under the global null.
In scenario 8 (Global Mixed 2), which is the most heterogeneous scenario, we see that BHM-L error rates (by basket and FWER) are high. This demonstrates the benefits of using more sensitive information sharing methods in highly heterogeneous scenarios. Interestingly, even though the FWER is 26%, the corresponding FDR is relatively low at just over 6%. Therefore even though the probability of making one or more false discoveries is high, in the long run, this will be only 6% of all discoveries (ie, 94% of all discoveries are true discoveries). This may be an acceptable compromise to researchers in some cases.
In Figure 6, we give a comparison between methods in terms of their calibrated power. The calibrated power uses posterior probabilities, P( k > 0.25), to tune separate decision thresholds for each method such that the overall FWER for that method is at most 5%. In this setting, HD-L stands out as performing well compared to other methods, maintaining the highest calibrated power across all scenarios.

DISCUSSION
In this paper we have developed four analysis methods for basket trials with continuous longitudinal outcomes (via extension of methods used in the cross-sectional setting). Basket trials are an innovative study design used primarily in phase II oncology trials, however a promising new area for application of basket trials in is chronic aging-related diseases, since conditions often share common features. We demonstrate how this could be achieved by application of our methods to various simulation scenarios as well as real longitudinal data from a chronic disease trial. The lessons that can be drawn from the BiCARB dataset are limited to some extent as it was not designed as a basket trial, and analyzing it as one is somewhat contrived. Future work needs to apply the methods described in the paper to longitudinal, continuous outcomes that were collected in actual basket trials (this paper could help in the design of those trials). We have shown in our simulation study that methods which incorporate longitudinal borrowing can improve trial efficiency in the chronic disease setting by increasing power to detect treatment effects in many scenarios. This is particularly the case when all baskets are similarly active. In addition, in the case of the global null scenario, borrowing methods maintain low false positive rates compared to stratified analysis. However, in highly heterogeneous scenarios with a mixture of active and inactive baskets, the trade off between power and type I error rates for different methods becomes apparent. In general, assuming similar sample sizes within baskets and equal randomization, when the proportion of active baskets outweighs the inactive baskets, borrowing can increase power over stratified analysis at the risk of also increasing type I error rates. Conversely, when the proportion of inactive baskets outweighs the active baskets, borrowing can lower type I error rates when compared with stratified analysis at the cost of lowering the power. Therefore, the most suitable method depends on trial priorities as well as the expected level of heterogeneity in treatment effects between baskets.
In early phase trials, in which basket trials have primarily been used so far, the goal is often to detect promising baskets for further investigation (ie, whether the treatment is promising for any indication). Here, where only a small proportion of baskets are anticipated to respond to treatment, maintaining power to detect promising treatment effects is a priority. We can liken this to scenario 7 (Global Mixed 1) of our simulation study where there are two active and four inactive baskets. In this case, stratified analysis is the most powerful, at the cost of an increased risk of false positives in null baskets. BHM-L has the lowest risk of false positives, but active basket estimates are overly shrunk downwards and therefore lose power. Here, the (over) sharing of information has led to significant down-weighting of the true treatment effects. In this setting, EXNEX-L and HD-L have higher power than BHM-L while having a lower risk of false positives than stratified analysis, and therefore may offer an appealing compromise between the two extremes.
In later phase trials, a higher proportion of baskets may be anticipated to respond positively to treatment. If there is some evidence or expectation that the treatment might be effective in all indications, BHM-L can be a good model choice, since power to detect a treatment effect in each basket can be significantly increased via more generous information sharing than other methods. This is demonstrated here, for example the power plots for scenarios 1 to 5 (Global  where all baskets are active (see Figure 5).
However, as discussed above, when there are a small proportion of null baskets, BHM-L is at risk of false positives since the exchangeability assumption does not hold. We see this in scenario 8 (Global Mixed 2), where there are only two null baskets, and a mixture of large and small effect sizes in the other four baskets (however, BHM-L also has the highest power here). Therefore, in the case of a confirmatory trial where the consequences of a false positive are more serious, other methods may be more suitable. EXNEX-L and HD-L have lower type I error rates than both BHM-L and stratified analysis in scenario 8 (Global Mixed 2), while maintaining similar power, bias and MSE. Furthermore, the HD-L method offers more robust performance overall, maintaining higher power than both EXNEX-L and stratified across a wider range of potential scenarios (see Figure 5) and so may be the preferable method. In most cases, sharing of information leads to increased power to detect treatment effects over stratified analysis.
Power could further be increased in various ways. One possibility would be to use more informative priors based on phase II results (assuming this is acceptable from a regulatory standpoint). The method recently proposed in Reference 38 could be applied, with suitable adaptations for longitudinal endpoints. Methods to combine one or more historical longitudinal data sources to inform a current trial have also been considered by Qi et al. 39 In all cases, we recommend careful consideration of these issues at the design stage.
As a final point, we wish to emphasize that the models and corresponding correlation structure presented in this paper were constructed to answer the research question: is there a difference in mean response trajectories between treatment groups in each basket over the course of the study, and if so, to what extent? The main advantage in choosing the class of random-effects (hierarchical) models over multivariate methods for this purpose is that they efficiently facilitate implementation of longitudinal borrowing between groups (via the nesting structure), while accounting for betweenas well as within-individual and cluster correlation over time (each individual/cluster has its own intercept and slope). It also leads to a clear interpretation of the estimated parameters in terms of the research question. An assumption of the class of random-effects models is that at the lowest level the error-structure conditional on the random effects is independent, 5 with correlations modeled by way of random effects at each level. While this may be a limitation in some cases, we feel it offers particular benefits in a basket trial setting for the reasons already outlined. Furthermore, modeling the time variable as continuous accommodates the possibility of individuals being measured at different timepoints. It is also simple to extend the model for any number of timepoints and any number of levels of clustering. In contrast, if instead an unstructured time-series correlation structure was utilized at the first stage (coming under the class of multivariate methods 5 ), time would then be modeled as a discrete factor on n-levels (n = number of timepoints). Methods such as this may be useful in analysing clinical trial data when clustering effects are of less interest and the primary outcome is observed at (relatively few) regularly spaced intervals and rarely missing. For a large number of timepoints and/or missing data, this approach becomes unattractive since many of the (between-timepoint) covariance parameters will be poorly estimated. 4

DATA AVAILABILITY STATEMENT
Software in the form of JAGS code together with R functions is available on GitHub (see: https://github.com/longitudinalbasket-trials/Bayesian-analysis-methods). All code adapted from code at https://github.com/BasketTrials/Bayesiananalysis-models. 22 Prior to fitting our models, we recommend plotting response over time by subject. If this relationship appears to be highly non-linear, then alternative models should be considered. In Figure A1, a random selection of individual trajectories from the BiCARB dataset confirms that an approximate linear trajectory is an adequate representation of the change over time within individuals.

F I G U R E A1
A random selection of 20 individual's bicarb concentrations versus month, with a simple linear regression line fit to each.
The (nonidentifiable) subject ID is given in the strip above each panel.

APPENDIX B. TYPE I ERROR RATE BY BASKET
In Figure B1, a plot of the analogue of type I error rate (false positive rate) by basket is given. This is a visual representation of the type I error rate by basket given in Table 3.

F I G U R E B1
Type I error rate (analogue of) by basket. Figure C1 shows the hierarchies of parameters for the stratified and HD-L models, where there are k separate models (a separate model for each basket). Figure C2 shows the hierarchies of parameters for the BHM-L model in which all baskets are included under a single model. EXNEX-L is a mixture model (as described in Section 2.5, with basketwise parameter hierarchies according to either Figure C1 or Figure C2, depending on the probability of exchangeability. F I G U R E C1 Stratified and longitudinal Hellinger distance method model parameter hierarchy (separate models for each basket).