Developing more generalizable prediction models from pooled studies and large clustered data sets

Prediction models often yield inaccurate predictions for new individuals. Large data sets from pooled studies or electronic healthcare records may alleviate this with an increased sample size and variability in sample characteristics. However, existing strategies for prediction model development generally do not account for heterogeneity in predictor‐outcome associations between different settings and populations. This limits the generalizability of developed models (even from large, combined, clustered data sets) and necessitates local revisions. We aim to develop methodology for producing prediction models that require less tailoring to different settings and populations. We adopt internal‐external cross‐validation to assess and reduce heterogeneity in models' predictive performance during the development. We propose a predictor selection algorithm that optimizes the (weighted) average performance while minimizing its variability across the hold‐out clusters (or studies). Predictors are added iteratively until the estimated generalizability is optimized. We illustrate this by developing a model for predicting the risk of atrial fibrillation and updating an existing one for diagnosing deep vein thrombosis, using individual participant data from 20 cohorts (N = 10 873) and 11 diagnostic studies (N = 10 014), respectively. Meta‐analysis of calibration and discrimination performance in each hold‐out cluster shows that trade‐offs between average and heterogeneity of performance occurred. Our methodology enables the assessment of heterogeneity of prediction model performance during model development in multiple or clustered data sets, thereby informing researchers on predictor selection to improve the generalizability to different settings and populations, and reduce the need for model tailoring. Our methodology has been implemented in the R package metamisc.

healthcare records (where the data are clustered by region, hospital, practice, etc). 1 Such data sets are frequently used to develop prediction models, to predict a current health status to aid in diagnosis or a future health outcome to provide a prognosis which may inform clinical decision making. [2][3][4] Well-known examples are PHASES, 5 INTERCHEST, 6 S 2 TOP-BLEED, 7 and EuroSCORE, 8 all of which were developed using data from multiple centers or studies. Unfortunately, prediction model studies that are based on IPD-MA or electronic healthcare records (EHR) rarely account for the potential of between-cluster heterogeneity (eg, EuroSCORE 8 ). 9,10 Sometimes, parameters that capture the baseline risk are stratified by cluster (eg, INTERCHEST 6 ), but then usually no guidance is provided on how to use the prediction model in new patients.
Although random effects models are generally recommended for dealing with the presence of clustering and heterogeneity of predictor effects, their implementation during prediction model development hampers the applicability of the estimated regression coefficients. In particular, random effects modeling does not indicate which parameter values (for the random intercept and predictor coefficients) should be used when the model is applied in new settings and populations. Typically, a single value (eg, the mean) is used for these parameters when making predictions.
When new observations or whole new clusters become available, the common and random effects of the model may be updated as new data become available, through the use of dynamic random effects methods. 11 This can improve the predictive performance of the prediction model in a validation setting. However, when the model is used to provide a prediction for an individual that is not part of a known cluster, there is no information that can be used to update the model, which eliminates the option of applying a dynamic method.
In general, a developed prediction model cannot generate accurate predictions in new patients when the true value for its parameters (eg, the intercept term) varies across the targeted settings and populations, especially when the true value of certain parameters is zero or has a reversed sign in some clusters. This heterogeneity may arise from differences in observed and unobserved patient characteristics, differences in patients' treatment and management strategies, differences in predictor and outcome definitions and differences in measurement methods across clusters. These differences may be temporal, geographic, or domain in nature. A temporal validation, where the validation study is typically performed with a similar study design and protocol by the same authors but in individuals selected from a later time period as the development study, is considered to be the weakest form of external validation. Whereas small differences in the study population or quality of care in a geographic validation may be present, in a domain validation the target population or targeted outcome can be different. For instance, in a domain validation one may investigate whether a prediction model developed in primary care is transportable to secondary or tertiary care. 12 The impact of heterogeneity in predictive associations (ie, the effects of predictors in the included model) has been well documented in the literature (Table 1). 13,14 Many developed prediction models perform poorer than anticipated and require local revisions prior to implementation. 15 These revisions may involve a simple intercept update, a recalibration of the linear predictor (ie, rescale all regression coefficients by a single value), the re-estimation of all the individual regression coefficients, or even the inclusion of new predictors. 12,[16][17][18] Unfortunately, revisions are rarely generalizable to other settings and populations; several reviews have found that prediction model performance substantially varies across validation studies. 19,20 Therefore, such revisions, including recalibration and predictor selection, are preferably performed during prediction model development. We consider situations where it is not possible (or very difficult) to define a priori the potential source(s) of between-cluster heterogeneity, such as temporal calibration drift. 21,22 We aimed to develop methodology that is universally applicable, when causes of poor transportability cannot be directly addressed a priori.
The identification of between-cluster heterogeneity is not possible when data are available from only a single setting or (sub)population. For this reason, the use of clustered data during prediction model development and its subsequent validation offers a critical opportunity to inspect whether this heterogeneity would actually be a concern when the model would be implemented in clinical practice. 9,10,13,14,[23][24][25][26][27][28] However, actually resolving the presence of between-cluster heterogeneity (and thus ensuring model predictions are accurate for all clusters) remains a difficult challenge for which limited guidance is available. 29 For this reason, we here explore an alternative approach that aims to reduce this between-cluster heterogeneity in prediction model performance and minimize the need for estimating setting-specific model parameters, to thereby improve its generalizability.
Recently, internal-external cross-validation (IECV) has been introduced to assess the presence of heterogeneity of a model's performance during its development. 10,23,25 IECV is a special case of cross-validation; available data are split nonrandomly in a natural manner by iteratively taking each cluster (or study) as a hold-out sample. In each iteration, a model is developed on the retained clusters, and then the model is tested in the hold-out cluster. A key advantage of this is that it allows the transportability (ie, the generalizability to other populations and settings) of the model to be assessed multiple times.

TA B L E 1 Types of heterogeneity
• Within-cluster heterogeneity of (measured and unmeasured) covariate values. This heterogeneity is necessary to allow for individual risk prediction • Between-cluster heterogeneity in the distribution (eg, mean and SD) of predictor values. This heterogeneity may appear when clusters represent different geographic regions, countries or studies, etc • Between-cluster heterogeneity in baseline risk (eg, outcome incidence). This is often related to between-cluster heterogeneity in eligibility criteria (eg, randomized trial versus observational study or between settings with differences in patient management), or to between-cluster heterogeneity in unmeasured predictor values • Between-cluster heterogeneity of predictor-outcome associations (and thus in the estimated model coefficients). These differences may, for instance, appear when combining data from studies with differences in patient management, variable and outcome definitions, etc • Between-cluster heterogeneity of prediction model performance. The "true" performance of a prediction model may vary between clusters due to differences in their underlying predictor-outcome associations, or due to differences in their distribution of (measured or unmeasured) predictor values 13,14,31

TA B L E 2 Highlights
• Many studies have shown that overfitting can lead to poor performance when predicting the outcome in new participants, but also that this can be resolved by utilizing large data sets [35][36][37] • Prediction models may still perform poorly even when overfitting is not an issue. This situation can occur when predictor effects do not transport well across different settings. 10,38,39 SIECV aims to address this issue, by incorporating this transportability in the model building process • SIECV is designed for situations where it is difficult to locally update model predictions. It aims to yield a global model with stable performance across the relevant settings and (sub)populations, and may therefore have suboptimal performance within (some of the) individual clusters, particularly when they differ much from one another • SIECV can help to find an appropriate balance between a model's internal and external validity, and to limit the need for local revisions (by avoiding heterogeneity in predictor-outcome associations) • SIECV does not address miscalibration-in-the-large. This can be resolved using a relatively simple update of the model's intercept term, and will often remain necessary. 18,40 Finally, access to clustered data allows one to estimate multiple intercept terms, and thus to ascertain under which conditions an intercept update may be necessary 9,13 In this article, we will first revisit the IECV framework for assessment of model performance in large clustered data sets (Section 2). For the highlights of this article, see Table 2. We then extend the IECV framework to inform predictor selection during prediction model development in Section 3, in order to identify and reduce their impact on the model's performance within and across clusters in the large combined data set. We then apply the methods in our motivating examples in Sections 4 and 5. Finally, we provide a discussion in Section 6. Our methodology can be applied using the metapred function in the R package metamisc. 30

INTERNAL-EXTERNAL CROSS-VALIDATION FOR MODEL VALIDATION
Resampling procedures allow the optimal use of the available data, as all data can be used for model development and subsequent evaluation. Traditionally in cross-validation procedures, the data are iteratively split into a development and validation set by randomly sampling without replacement. In each iteration, a model is estimated on the development sample and predictions are made for the random validation sample. The performance of these predictions in the validation samples is then averaged across iterations, thereby giving an estimate of the reproducibility of model performance. When data are clustered across different studies or settings, traditional resampling procedures that do not account for clustering cannot directly be applied. 32 For this reason several extensions have been proposed that preserve the clustering within and the heterogeneity across the generated samples. In the so-called internal-external cross-validation approach, the data are split by cluster, which may represent the studies from an IPD-MA or the centers in data from EHR. 23,25,26 A model is then iteratively fit in K-1 clusters (Section 2.1) and its corresponding performance model performance is calculated in the remaining cluster (Section 2.2). This is repeated K times, so that, provided that sufficient data are available in the development and validation clusters, a performance estimate and its SE is available for each of the clusters. Thus, IECV is cross-validation where the hold-out samples are nonrandom, in the presence of between-cluster heterogeneity. IECV therefore allows the study of a developed model's potential transportability multiple times. Note that if all patients are exchangeable across clusters, IECV corresponds to the traditional cross-validation and assesses model reproducibility (rather than transportability). 10 In contrast to traditional cross-validation, estimates of the performance in the hold-out samples cannot simply be averaged, as the variation within and across clusters needs to be taken into account. This can be achieved by adopting a (fixed effect or random effects) meta-analysis of the performance estimates (Section 2.2), 33 or by weighting the performance estimates by the number of events in each cluster. 34 As the data are split nonrandomly, this allows the transportability (ie, the generalizability to other populations and settings) of the model to be assessed.

Model fitting
The development phase of IECV may involve a one-stage or a two-stage IPD-MA approach. In the two-stage approach, the prediction model is fitted separately in each cluster. The model coefficients estimated in each of the development K-1 clusters are then combined using standard meta-analysis techniques. In the one-stage approach, a generalized linear model (GLM) is estimated in each of the K development samples consisting of K − 1 clusters. This model may account for clustering by including random intercepts and/or slopes for within-cluster covariates. 13,14,29,34 A disadvantage of the one-stage approach in IECV is that the data from each cluster needs to be used K-1 times to fit a model in the one-stage approach. On the other hand, in the two-stage IECV approach the data from each cluster only needs to be used for model fitting once, as the second stage comprises meta-analysis of different combinations of coefficients and their SEs. The two-stage approach may therefore substantially reduce the necessary computational performance time. However, the two-stage approach may not be feasible when clusters are relatively small, as parameters then become difficult to estimate. For this reason, the two-stage approach appears beneficial when most clusters (studies) in the meta-analysis are not small, and we adopt this approach in our article. Let x p, k, j be the value of a prespecified predictor p, p = 1, … ,P (or function thereof) measured in individual patients j, j = 1, … , N in cluster k, k = 1, … , K. Then their outcomes y k, j may be modeled as follows: where f ( … ) is a link function, k is a cluster-specific intercept and p,k is a cluster-specific coefficient. Here, we propose to estimate k and p,k in each cluster separately. Subsequently, the estimates can be summarized using traditional meta-analytic methods. We here use univariate random effects meta-analysis, where each of the estimated coefficients are summarized separately: where w p, k is the weight attributed tôp ,k estimated in cluster k, and̂M A p,(h) is the meta-analytic estimate of the coefficient estimated on data from all clusters except hold-out cluster h. In the random effects model the w p, k are given by 1 var(̂p ,k )+ 2 , where 2 is the statistical heterogeneity estimate of the coefficient across clusters: A confidence interval (CI) for̂M  (h) ) is a modified variance estimate and Q = K − 1 as one cluster is held out for validation. [41][42][43][44][45] The extent of heterogeneity of a predictor effect can be explored by quantifying a prediction interval (PI), which estimates the interval of probable predictor effects in a new individual cluster, and can be calculated approximately aŝ 46,47 A wide prediction interval for the predictor effect indicates that the predictor effect may be very different in a new cluster, which makes it unlikely that the predictor will improve the model's predictions for individuals in a new cluster.
The random effects meta-analysis model is preferably estimated with REML or the Paule-Mandel method. [48][49][50][51] When fewer than 10 clusters are included in the meta-analysis, or when some clusters are small or the outcome is rare, the between-cluster heterogeneity cannot be reliably estimated by any currently available method. 51 The estimated coefficients could also be summarized using multivariate meta-analysis methods, 33,52 which may be helpful in the presence of collinearity and missing parameter estimates. The necessary within-cluster covariances can then directly be estimated from the IPD set at hand. However, usually univariate and multivariate meta-analysis methods give very similar results when all of the parameter estimates of interest are available for all clusters, even when correlations are large. 53 In IECV, all parameters can be estimated from the data hand, meaning that univariate meta-analysis will usually suffice.

Assessing external model validity
In each iteration of the IECV, the developed model is validated in individuals from the hold-out cluster by applying the model (as developed in the other clusters) using the observed predictor values of individuals. If the developed model contains random (or stratified) intercept terms or predictor effects, this also requires choices about which parameter values are to be used when applying the developed model. It may be fruitful to select an intercept or predictor effect that best fits the validation cluster, 13 provided that sufficient data are available for the validation cluster. Alternatively, given that the prediction model is mean centered and the overall frequency of the outcome is known for the population or cluster that the model is applied to, the appropriate intercept can be restored. 13 When no such options are available, one may choose to apply the mean of the random intercept or predictor effects. Recent work shows that in terms of calibration the latter outperforms marginal models and models that ignore clustering during model development. 54 When comparing the risk predictions for the hold-out cluster with the observed outcomes, several performance measures such as the c-statistic, calibration slope, calibration-in-the-large and/or mean square error can be calculated. 10,27,55 This process is repeated until each cluster has been used as a hold-out cluster once, yielding a set of performance statistics for each IECV iteration. The corresponding estimates can then be pooled across the hold-out clusters using random effects meta-analysis methods, though some statistics and their SEs may require transformation first. 29,33,56 Similar to the predictor effects, a prediction interval can then be constructed for the performance estimates, which provides an interval of likely values that the performance statistic will have in a new cluster.
Besides allowing one to obtain an average estimate of performance, meta-analysis is particularly helpful for investigating the presence of heterogeneity of prediction model performance and any possible causes thereof. 10,31,33 Prediction model performance may vary across clusters due to imprecision or bias of the regression coefficients or performance estimates, or due to the variation in population characteristics. Disentangling these various sources of variation is necessary when inferring on the model's potential generalizability to different settings and populations.
Finally, if the average performance and heterogeneity of the performance of the prediction model are deemed adequate, that is it is considered likely that performance will be adequate in a new cluster, a so called global model may be developed by estimating the coefficients for the predictors on the data of all available clusters. In this final step no clusters are left out, in order to minimize the variability of the estimates of the coefficients. 25

Motivating example: Diagnosis of deep vein thrombosis
Patients with a deep vein thrombosis (DVT) have an increased risk of post-thrombotic syndrome and pulmonary embolism, which can be fatal. 57 In the majority of patients in whom DVT is suspected, no DVT is present on advanced (reference) testing. 58 For illustrative purposes, we here consider the diagnosis of DVT in patients that are suspected of having DVT and use the IPD of 10 014 patients from 11 studies, 59 where each study is considered one cluster (Tables 3  and 4). In each cluster separately, we estimated a binary logistic regression model with three prespecified predictors: history of malignancy (yes/no), calf difference (difference in circumference of the calves ≥ 3 cm), recent surgery (yes/no). Preferably, a continuous predictor such as calf difference should not be dichotomized, as this leads to a loss of information. However, the continuous predictor was not available in the data at hand. As some clusters were small, we applied Firth's correction, 60 which yields unbiased Maximum Likelihood estimates for the coefficients and SEs in small samples 61 and adjusted the intercept post hoc by re-estimating it with an unpenalized GLM. 62 We then applied IECV and adopted a two-stage approach for prediction model development. The pooled regression coefficients (including the intercept term) from the development clusters were used for generating predictions in the hold-out cluster. Although Firth's correction still yielded estimates with high variance for the predictor coefficients in some clusters, this was mitigated by performing a meta-analysis of the regression coefficients. Results in Table 4 reveal that estimates for the predictor effects were very heterogeneous across the included clusters.  Table 5 this also resulted in heterogeneous model performance estimates across hold-out clusters. Although calibration was good on average, it was highly variable in individual clusters. For instance, whereas the summary calibration-in-the-large equaled 0.03 (95% CI: −0.33 to 0.39), meaning that calibration-in-the-large was very good on average, the calibration-in-the-large's approximate 95% prediction interval (PI) ranged from −1.22 to 1.27, thereby indicating the presence of heterogeneity of the prediction model's calibration. Similarly, the calibration of the linear predictors was very good on average, as the calibration slope (also estimated with Firth's correction) equaled 1.00 (95% CI: 0.83 to 1.16), whereas the approximate 95% PI for the calibration slope ranged from 0.53 to 1.46. Further, the c-statistic equaled 0.68 (95% CI: 0.65 to 0.71) and was also substantially heterogeneous across clusters (approximate 95% PI: 0.60 to 0.75).
On overall, the IECV showed that the modeling strategy was unlikely to yield a prediction model with good generalizability. Substantial revision would be necessary to improve the model's average discrimination performance and to reduce the heterogeneity of its calibration and discrimination performance. A possible approach would be to refine the original modeling strategy by altering the set of included predictors and by considering interaction effects and/or nonlinear terms. It is generally recommended to define a set of predictors and functions and interactions thereof on beforehand. Recommendations for this are given elsewhere. 2,[63][64][65] Subsequently, the revised model should be validated again, after which other revisions may be decided and so forth. It may be clear that this strategy is very time consuming and may lead to arbitrary choices in predictor selection. For these reasons we propose a formal framework for predictor selection in the context of heterogeneity of performance across clusters in the next section. We address methods that aim to reduce heterogeneity of performance, improve the average performance and a combination thereof. The code used to apply our methodology as presented in this article is available on Github (https://github.com/VMTdeJong/SIECV-DVT).

STEPWISE INTERNAL-EXTERNAL CROSS-VALIDATION FOR MODEL DEVELOPMENT
In the previous section, we described the purpose of IECV to assess the generalizability of a prediction model that is generated by a predefined modeling strategy. Here, we propose to extend IECV to optimize model generalizability during its development. We here describe the algorithm for the situation that IECV will be used to expand an empty (intercept only) model by iteratively adding predictors, functions of predictors and interaction effects.
We assume that these candidate predictors (and functions thereof) have already been selected through clinical reasoning and evidence from previous studies, such as reviews of prediction models and prognostic factor studies. The proposed methodology is not intended for dimensionality reduction (eg, variable selection in settings with large P, small N), but is used to assess the heterogeneity of the predictive performance resulting from including these in the prediction model.
The approach readily generalizes to the expansion or reduction of a given model. In this stepwise IECV (SIECV) for prediction model development models are estimated, validated in external data sets, and updated in an iterative process, as follows.
Denote the data from the k th cluster by S k , and the data from a set of clusters excluding cluster h by S (h) . Let p, p = 0, 1, … , P be indicators to denote the candidate predictors (or functions thereof), where p = 0 indicates none. The algorithm consists of up to I model adaptation cycles, where I generally equals P, the number of predictors available for inclusion in the model. Then, let P r (i) denote the set of candidate predictors for inclusion, where P r (1) = {1, 2, … , P} and P r (0) = {0}.
Further, let M i, p denote the models in cycle i with added predictor p in the stepwise process. Let M i, p, (h) denote a model estimated on data from all clusters excluding S h . LetẐ i,p,h be an estimate of performance (ie, a loss function) of model M i, p, (h) in cluster h, such as the mean squared error. Let Â i,p be the estimate of a loss function (ie, an aggregated loss function or an estimate of heterogeneity, further described in Section 3.2) in cycle i for a model extended with predictor p, and let c indicate a predictor p that has minimal Â i,p , such that M i, c is the model with best generalizability in cycle i. Then, the algorithm is defined as follows and starts at cycle i = 0: 3. The first condition that is satisfied: (a) If i = 0, continue to step 1.  25 This global model however, is at risk of overfitting as a result of small sample bias, unless the sample is sufficiently large and the event rate sufficiently high, even if no selection of predictors were applied. 35,37,66 To account for this, the prediction model could be fitted with penalized regression, such as Firth's regression. To reduce the variance of the estimated regression coefficients, the ridge penalty could be applied instead, or one could opt for a fully Bayesian approach.
By considering the candidate predictors for inclusion, however, the prediction model is at further risk of overfitting. 2,64 A straightforward adjustment for overfitting could be achieved with the calibration slope and calibration-in-the-large. 67 In step 1 (b) iii of the final cycle these could be estimated and then summary meta-analyses estimates could be computed. The final model coefficients (excluding the intercept) would then be multiplied by the summary calibration slope, whereas the summary calibration-in-the-large would subsequently be estimated and added to the global model's intercept, thereby yielding a final model. Ideally, however, the entire model selection procedure is to be performed within an additional bootstrap or cross-validation procedure, 68 as this would account for any overfitting introduced by the SIECV itself. Alternatively, heuristic shrinkage, that shrinks the coefficients by a function of the number of predictors considered, may be applied. 2,64,69

Extensions
Throughout this article, we work from the perspective that an entirely new prediction model is to be developed. However, our proposed framework readily encompasses model redevelopment including the adding and removal of predictor terms. Predictors that are known from previous research to have a strong predictive effect can then be forced in the model instead of applying a selection strategy to these predictors, as is generally recommended. 65 Selection of predictor effects may then also be performed with a backward procedure starting with all candidate predictors and their transformations or interactions, rather than forward. Though, this may yield issues in the estimation when many predictor effects are considered, especially when random effects are applied. In a simulation study (described in online supporting information A, code on Github [https://github.com/VMTdeJong/SIECV-sim-DVT]), the forward and backward SIECV methods had similar out-of-sample performance, both in terms of average performance and generalizability thereof. Further, similar to IECV for model validation we may adopt a one-or two-stage approach (Section 2.1) for model estimation. To enhance the interpretability of a prediction model, it is generally recommended to include main effects for variables that are included as interaction effects. Hence, the algorithm may be adjusted so that this is always the case by forcing the main effect to remain in the model whenever an interaction with another variable is included.

Quantifying model generalizability
The SIECV algorithm requires specification of an aggregated loss function (A i, p ) that is to be minimized, in order to optimize generalizability of performance across clusters. Here, we consider parametric and nonparametric aggregated loss functions, that vary with respect to the importance they place on the average and heterogeneity of performance.

Ignoring heterogeneity
As a first step, we consider a naive estimator of predictive performance across hold-out data sets from different clusters, that ignores variation within and across clusters. This approach may be reasonable when the clusters are very large and of similar size, and when the effect of clustering is negligible (that is when the intraclass correlation coefficient is near zero). The overall performance is then given by the mean performance across clusters. For instance, when optimizing the mean square error (MSE, or Brier score for categorical outcomes), we can apply the following aggregated loss function:

Weighted meta-analysis
To incorporate the uncertainty of the predictive performance estimates into an aggregated loss function, it may be more appropriate to adopt a weighting procedure. The meta-analysis framework (see Section 2.1) therefore appears an appealing choice. A straightforward extension to Equation (4) would be to apply the weighting procedure in described in Equations (2) and (3). This allows to minimize the prediction error in an "average" cluster, but still does not attempt to optimize their stability across clusters. As a result, it is possible that developed models perform well on average, but require substantial local revisions before implementation. To reduce the need for local revisions, the aggregated loss function should account not only for the average performance, but also for its variation across clusters. For this reason, we propose an extension that combines both sources of error: where is a hyperparameter that defines the impact of random effects meta-analysis summary estimate of perfor-manceẐ RE i,p and heterogeneity estimatêi ,p on aggregated loss functionÂ RE i,p . This is a parameter that is to be chosen on beforehand, where its value should depend on the relative importance of average and heterogeneity of performance. In the simplest case we let = 1, such that the estimate for generalizability is given by the mean of the distribution of performance Similar to the meta-analysis of prediction model coefficients discussed in Section 2.1 it is important to consider the estimation of between-cluster heterogeneity in prediction model performance. For this reason, we recommend a random effects meta-analysis to summarize estimates of prediction model performance. As discussed in Section 2.1, the random effects meta-analysis model is preferably estimated with REML or the Paule-Mandel method. Estimates of between-cluster heterogeneity can then be used to derive an approximate prediction interval that provides information on the likely values that model performance may take in a new cluster. A sufficient number of clusters each with sufficient observations as well as an adequate estimation method are necessary to make this approximate prediction interval informative

Variability of performance across data sets
In the meta-analysis approach, the evidence from small clusters is downweighted to attain an estimate of the mean of the distribution of performance. Yet this distribution might not be of central importance. Instead, all clusters might be considered of equal importance. Then we may instead apply a measure of variability directly to the performance estimates, for instance the SDÂ SD i,p = SD(Ẑ i,p,1 , … ,Ẑ i,p,K ). Alternatively, if no assumptions can be made on the distribution of the predictive performance statistics, we may apply a nonparametric measure. For example, when using Gini's Mean Difference we have: 70,71

MOTIVATING EXAMPLE 2: UPDATING A MODEL FOR DIAGNOSING DVT
The prediction model developed in Section 2.3 had a rather heterogeneous performance across validation clusters and was lacking in average discrimination performance. This heterogeneity of performance implies that although the outcome may be predicted well in individuals in some clusters, which may be helpful in diagnosis, it may be unsatisfactory for individuals in other clusters. The prediction model performance heterogeneity across the 11 clusters may be explained by differences in (measured and unmeasured) predictor distributions and true predictor effects. Therefore we here consider whether additional predictors and interaction effects might explain such differences. Whereas individual clusters may lack the sample size to detect nonlinear effects or may lead to highly variable predictor effects, this is more feasible in pooled studies such as for an IPD-MA and in large healthcare data bases.
Briefly, we considered the following ten additional candidate predictors to extend the model from Section 2.3: sex, absence of leg trauma, absence of leg trauma × recent surgery (ie, an interaction effect), vein distension, log of duration of symptoms, age/25 (ie, divided by 25, to increase the absolute value of its coefficient), age/25 squared, age/25 × malignancy, abnormal d-dimer value and abnormal d-dimer × sex. As we developed this model for illustrative purposes only, we applied multiple imputation for missing data using a joint model with random effects, but have used only one imputed data set for analyses. 30,72 An overview of the missing data per variable per cluster is given in online supporting information B.
We recommend that the inclusion of each candidate predictor (or transformation thereof) be carefully considered with respect to the improvements in generalizability of the model performance on the one hand, and the cost and harms relating to the measurement of the predictor on the other. Further, it is generally recommended that predictors that are known from previous research to have a strong predictive effect should be forced in the model instead of applying a selection strategy to these predictors. 65 Accordingly, we force malignancy, calf difference and surgery as predictors into the model in this example. We apply our methodology to illustrate how each of the strategies regarding heterogeneity of performance leads to different model specifications, and thereby to differing average and heterogeneity of performance. To assess the generalizability of prediction models that use these predictor functions, we follow the SIECV strategy for model development that we developed in Section 3, apply the MSE (ie, Brier score) to the predicted probabilities in the hold-out clusters and apply the aggregated loss functions (measures of heterogeneity of performance) on the MSE estimates and SEs thereof, to select predictors as outlined in Section 3.2.
The six applied aggregated loss functions lead to models with four different predictor function specifications (Table 6), as the strategy that ignored clustering (A M ) when estimating generalizability of performance lead to the same model specification as the strategy that optimized the meta-analytic mean of performance (A RE 1 ), and the meta-analysis strategy that placed equal importance on heterogeneity of performance and average performance (A RE 1∕2 ) lead to the same

Predictor
None

TA B L E 7
Meta-analysis summary estimates of SIECV performance of six strategies for predicting DVT model specification as the A SD strategy. As the SIECV allows for the estimation of any performance statistic, we assessed discriminatory performance with the c-statistic, and calibration with the calibration slope and calibration-in-the-large, for the final model for each aggregated loss function. Subsequently, we summarized the performance and heterogeneity thereof with univariate random effects meta-analyses. In terms of calibration slopes (also estimated with Firth's correction), all strategies showed some overfit (summary calibration slope < 1), though to varying degrees (Table 7, Figure 1). Slopes <1 imply that the estimated slopes were too large (the log odds ratios deviated too far from 0), which yielded predictions for individuals that were too extreme. The linear predictors in the A SD strategy and the meta-analytic strategy that combined heterogeneity of performance and average performance (A RE 1∕2 ) were the worst calibrated (calibration slope of 0.85), and the A Gini strategy the best (0.94).
There was substantial heterogeneity in the estimated calibration slopes, especially for the predefined model with no predictor selection. For all strategies, the prediction interval for the calibration slope also included values >1, which implies that for some (future) clusters the log odds ratios will probably not deviate from 0 enough and that predictions for individuals will probably be not extreme enough. The heterogeneity of the calibration slope decreased for all strategies, as compared to the predefined model with no added predictors. This means that for the resulting models there was a decreased need for extensive local updating.
In terms of average calibration-in-the-large, all strategies achieved a reasonable calibration-in-the-large, that is close to zero (Table 7, Figure 2). This means that on average the incidence was predicted accurately. On the other hand, the

F I G U R E 3
Forest plots of SIECV estimates of c-statistics of six strategies for predicting DVT meta-analysis of the calibration-in-the-large showed that the heterogeneity of calibration-in-the-large had increased for all modeling strategies, as compared to the predefined model with no added predictors. Hence, a trade-off occurred between (heterogeneity of) calibration slopes and calibration-in-the-large, and local updating of the intercept will remain necessary. All strategies that applied SIECV achieved an internally-externally validated c-statistic value of 0.81 (Table 7, Figure 3). There were small differences in heterogeneity of discrimination performance. The A SD strategy and the strategies that included the meta-analytic estimate of heterogeneity had smaller values for the heterogeneity of the internally-externally validated c-statistic, than the strategies that focused on the mean performance alone (A M and A RE 1 ). Further, the A Gini strategy, a non-meta-analytic strategy that focuses on heterogeneity of performance, yielded larger heterogeneity of discrimination performance.
As a final step, one must choose which modeling strategy is most likely to yield adequate performance when applied to individuals in a new cluster, if any. Although heterogeneity in the slopes had decreased substantially for all strategies, the prediction intervals still indicated that updating may be necessary. Further, the models resulting from all strategies are likely to need an intercept update. In terms of calibration, it may therefore not be advisable to develop a global model, that is a model developed on all available clusters (without leaving any out). Finally, although the discrimination for all models improved substantially, the diagnostic utility would have to be put into a clinical perspective.

MOTIVATING EXAMPLE 3: PREDICTING ATRIAL FIBRILLATION
Patients with atrial fibrillation (AF) are at an increased risk for stroke. 73 Although stroke is usually not fatal, it often results in neurological deficiencies. 74 In patients with AF the incidence of stroke as well as the incidence of death from stroke can be greatly reduced by oral anticoagulation. 75 For illustrative purposes, we here consider the development and validation of a binary logistic prediction model to estimate the probability that atrial fibrillation is present in an individual patient. Previously, Audigier et al prepared a simulated data set to mimic the patients from 28 cohorts (clusters, from hereon) of the GREAT consortium. 76,77 This data set comprises a total of 11 685 patients of which 3335 have AF.
Because some of the clusters are very small and may thereby cause estimation issues during model development or validation, we removed a total of eight clusters in which fewer than 50 patients had the outcome or did not have the outcome. Missing values were imputed using a joint model with random effects. 30,78 As we apply the methodology for illustrative purposes only, we have used only the first imputed data set for analyses. We subsequently modeled the probability of the presence of atrial fibrillation in 10 873 patients from the remaining 20 clusters (Table 8). To further prevent overfitting, we applied Firth's correction, 60 and re-estimated the intercepts with unpenalized maximum likelihood. 62 We considered seven candidate predictors, consisting of gender (binary) and six continuous predictors: body mass index (BMI), age, systolic blood pressure (SBP), diastolic blood pressure (DBP), heart rate (HR), and brain natriuretic  Abbreviations: AF, atrial fibrillation; BMI, body mass index; BNP, brain natriuretic peptide; DBP, diastolic blood pressure; HR, heart rate; SBP, systolic blood pressure; SIECV, stepwise internal-external cross-validation.

TA B L E 8
Clinical characteristics of AF data peptide (BNP). BMI, age, and HR were divided by 25, and SBP and DBP by 100 to increase the absolute values of their coefficients. For each of the continuous predictors, we considered linear and quadratic terms and applied centering (within clusters) before application of the quadratic function. This was necessary to ensure that the coefficients are stabilized and positive coefficients for quadratic terms represent an increased probability of presence of AF for values that deviate from the mean value. Again, we stress that in practice the in-or exclusion of predictor functions should not be based on a statistical criterion alone, but should be carefully considered as discussed in Sections 4 and 6.1.
Here, we implement the proposed predictor selection procedures to illustrate their impact on average performance as well as on generalizability across the different clusters. We follow the SIECV strategy for model development as described in Section 3, apply the MSE to the predicted probabilities in the hold-out clusters and apply the aggregated loss functions on the MSE estimates and SEs thereof, to select predictors functions as we outlined in Section 3.2.
The six applied aggregated loss functions lead to four different model specifications (Table 9), as the strategy that ignores clustering when quantifying generalizability (A M ) and the strategy that optimized the meta-analytic mean of performance (A RE 1 ) lead to the same model specification. Further, both meta-analytic strategies that directly quantified heterogeneity of performance (A SD and A Gini ) lead to the same model. Again, we assessed performance with the calibration slope, calibration-in-the-large, and c-statistic, and summarized these and the heterogeneity thereof with univariate random effects meta-analyses.
In terms of summary calibration slopes (estimated with Firth's correction and then pooled in a meta-analysis), all strategies were rather well calibrated, showing only minor overfit (Table 10, Figure 4). However, there was substantial heterogeneity of the calibration slopes. The approximate 95% prediction interval of the calibration slopes of the models for the A M and A RE 1 were the widest, as the upper bound reached 2.20 and the lower bound was estimated at a −0.24, as well as the slopes for A RE 1∕2 which were −0.26 to 2.18. These negative values for the lower bound imply that the predictive effect for the models might be reversed in some clusters: individuals with AF received lower probabilities of AF than individuals without AF in these clusters. This means that for each of these models, there was still a need for extensive updating or model redevelopment.
Calibration-in-the-large was (near) perfect (Table 10, Figure 5). The A RE 0 , A SD , and A Gini strategies all achieved a calibration-in-the-large of 0.00 (95% CI: −0.28 to 0.29), whereas those of A M and A RE 1 were hardly different with 0.04 (95% CI: −0.24 to 0.33), and neither was the interval for A RE 1∕2 with 0.04 (95% CI: −0. 25  TA B L E 10 Meta-analysis summary estimates of SIECV performance of six strategies for predicting AF was large heterogeneity in calibration-in-the-large, as shown by the approximate 95% prediction intervals of the calibration-in-the-large. This means that for each of these models there was still a need for intercept updating they may be used. The A M , A RE 1 , and A RE 1∕2 strategies attained a somewhat better discrimination with c-statistics of 0.62 (95% CI: 0.58 to 0.65) than the other strategies, that all attained c-statistics of 0.58 (95% CI: 0.56 to 0.60), respectively (Table 10, Figure 6). There was considerable heterogeneity in the c-statistics for all strategies. The discrimination was worse than random (c-statistic < 0.50) in at least one cluster for each of the strategies. Indeed, the approximate 95% prediction interval shows it is most likely that this will occur in a new cluster for the A M and A RE 1 strategies.
For all of the considered strategies the SIECV shows that although calibration was adequate on average and that overall discrimination was modest, it is not very likely that any of the developed models will perform well at predicting AF in new clusters without local updating. The effects of the included predictors vary substantially across clusters and hence heterogeneity of performance could not be resolved by considering nonlinear terms. The used for these analyses can be found on Github (https://github.com/VMTdeJong/SIECV-AF).

GENERAL DISCUSSION
We proposed a methodology for improving the generalizabilty of prediction models developed across different settings and populations when individual participant data from multiple studies or electronic health records are available. Our methodology is aimed at situations where the goal is to produce a single model that can globally be used across a wide variety of settings without having to update or recalibrate predictor effects (although it still allows intercepts to be updated). This methodology leverages the information from multiple clusters (eg, studies) by iteratively using all but one cluster for model development and assessing performance in the remainder. The overall predictive performance and its variation across clusters can then be used to quantify a model's generalizability across clusters, and to inform the selection of predictors. As we have demonstrated in our motivating examples, the selection of predictors based on the proposed aggregated loss functions can lead to differing model specifications. Each model specification may perform differently in predicting the outcome for individuals in differing clusters, leading to differing average and heterogeneity of calibration and discrimination. Trade-offs may occur between discrimination and calibration as well as between average and heterogeneity of performance. These may be quantified in the SIECV algorithm during model development, by which the need for extensive or local model updating may be assessed immediately. For instance, before validation in the hold-out cluster, the intercept may be updated, which will inform the researcher on the generalizability of performance of the model after local updates.
Although it remains unlikely that SIECV can completely resolve the need for an intercept update, it may reduce the necessity of the local tailoring of prediction models (particularly with respect to the calibration slope), which is highly prevalent in the implementation of today's prediction models. In particular, we aimed to reduce the need for re-estimation of individual predictor effects. Evidently, the potential impact of SIECV strongly depends on the availability of patient-level covariates that may explain heterogeneity in predictive associations.
We discussed a variety of aggregated loss functions for quantifying the generalizability of a developed model. These ranged from producing an average performance estimate to quantifying its dispersion across clusters. Our framework also allows for assessing average and heterogeneity of performance across data sets (clusters) through meta-analysis, and formalizes the predictor function selection by both of these simultaneously. This requires specifying the relative importance of average performance and the heterogeneity thereof on beforehand. For the meta-analysis strategy, this either requires the prespecification of , or the finding of an optimal value for through a resampling method. When applied in the SIECV model development process, these measures lead to different model specifications with different average performance and heterogeneity thereof, as each places different importance thereon. Therefore, the researcher will have to choose whether to optimize average calibration (calibration-in-the-large close to 0, slope close to 1) and discrimination (high c-statistic, positive and/or negative predictive value for its clinical issue), or the heterogeneity of one or both of these across clusters.

Limitations and future directions
Our main focus was on informing the selection of predictors in the analysis of multiple studies or EHR, and not on the estimation of corresponding predictor effects. Hence, the impact of our methodology may be limited in cases where the number of available predictors is low or when available predictors are not related to heterogeneity in predictor-outcome associations. Though, even in cases where we have few predictors, it may still prove worthwhile to consider nonlinear terms and interaction effects by the use of SIECV. Improvements in one prediction model performance measure may come at the cost of those in another measure. For instance, an improvement in calibration may result in a deterioration in discrimination. In our motivating example, the predictor d-dimer had a large predictive effect, and substantially contributed to discrimination performance. However, d-dimer was a dichotomized predictor and was measured using different methods across clusters. 58,79 This resulted in substantial heterogeneity in its regression coefficients and in heterogeneity of its diagnostic accuracy (see Reference 80).
Differences in predictor distributions will have an impact on the discrimination of an estimated model: 31 in fact, the c-statistic is directly related to the SD of the linear predictor. 81 Differences in predictor distributions should not have any impact on the calibration, 31 as long as all predictors have been measured and included (with the right functional form) in the prediction model, and all of the regression coefficients are correct and there is no heterogeneity in the true regression coefficients. However, differences in distributions of unmeasured predictors will have an impact on the model's calibration. If these predictors are correlated with measured predictors, this will appear as miscalibration of the linear predictor (ie, the calibration slope) and possibly the intercept. Including this predictor in the prediction model-if it were measured-would then remove this miscalibration, thereby improving performance. If the unmeasured predictor is not related to any included predictors, the prediction model's miscalibration will appear solely as miscalibration of the intercept or calibration-in-the-large. In other words, if the distribution of a predictor changes, the predicted incidence of the outcome should change accordingly. If that predictor is not measured it will not affect the predictions, which leads to miscalibration. Imprecise regression coefficients may impact each aspect of performance. If all predictors' k coefficients are too large for the validation sample, the calibration slope will be <1, but discrimination will be unaffected. If some are too large and some too small, the discrimination will also be deteriorated. If the model's intercept is incorrect, or if it cannot account for differences in predictor effects, the calibration-in-the-large will be affected, but not the discrimination.
To gain an understanding of the possible values for performance the prediction model may have in a new sample, the use of benchmark values has been proposed. 31,82 Most importantly, the case-mix corrected c-statistic disentangles the effects of case-mix and incorrect coefficients between clusters. It assumes the predictor coefficients are valid and adjusts for the case-mix of the validation sample. 31 An alternative approach is to quantify the relatedness of the different clusters using a membership model. 14 In this article, we have focused on improving the generalizability of a model's predictive performance through the addition or removal of predictors or functions of predictors. However, this may lead to instability in the estimation process and overfitting of the predictor coefficients if not enough data are available or the outcome is too rare. We stress that the proposed methodology is only likely to lead to stable estimates for the predictor effects when large data sets are available. A future step may be to incorporate heterogeneity of performance into the estimation process. For instance, in penalized maximum likelihood estimation a penalty for heterogeneity of predictor coefficients or predictive performance across clusters could be applied. Such a penalty could readily produce more generalizable models, without the need for a stepwise selection process.
We have not performed a simulation study to compare the aggregated loss functions. Such an endeavor might not be fruitful, as it should be obvious that each of these measures for predictor selection would lead to different estimates of both average and heterogeneity of performance, that is, there is no gold standard. As each of the aggregated loss functions serve a different goal, it is unclear what the evaluation criteria in such a simulation study would be. Nevertheless, it may still be helpful to assess under what circumstances certain generalizability measures may be preferred.
Whereas methodology has been developed on dealing with missing data in model validation in general (eg, see Reference 83), more guidance is needed for application to SIECV. In general, the use of multilevel imputation methods have been recommended in IPD-MA and in the analysis of other types of clustered data. 76,[84][85][86] These methods can account for variables that are sporadically missing within one or more clusters, but also impute variables that are not measured in certain clusters. 85 The adoption of multilevel imputation methods therefore seems paramount when adopting SIECV in data sets with missing values. Further, in multiple imputation it is imperative that the imputation model includes the all of the variables used in the analysis model, including the outcome, as well as any other predictive variables. 87 For this reason, we recommend to include all candidate predictors of SIECV in the imputation model, and to allow for random effects for all of these. Recently, several authors evaluated approaches for implementing multiple imputation in an internal validation procedure. Notably, Schomaker and Heumann showed that the bootstrap and multiple imputation can be applied in sequence to obtain valid SEs, and give recommendations on its implementation. 88 Further, Wahl et al. recommend that the multiple imputation procedure is performed within the development and validation sets separately, for each split in a k-fold cross-validation, 89 which could readily be extended to IECV. Extension to SIECV, however, would require either an appropriate synthesis of performance estimates across imputed data sets within studies before calculation of the aggregate loss function and predictor selection, or an appropriate synthesis of selected predictors in different imputed data sets. For the latter, a voting strategy may be applied. 90,91 When data are missing sporadically and a two-stage meta-analysis approach is taken, the imputation procedure might be performed within clusters before applying SIECV. A formal comparison of these strategies is still needed.

CONCLUSION
The IECV methodology for model validation can inform the researcher on the need for updating a prediction model to adapt it to particular populations and settings, when IPD from multiple clusters or studies are available. Our SIECV methodology extends this framework to quantify and reduce the impact of any heterogeneity on prediction model performance. This can inform the researcher and may reduce the need for tailoring the prediction model to specific populations and settings.