Machine learning algorithms performed no better than regression models for prognostication in traumatic brain injury

Objective: We aimed to explore the added value of common machine learning (ML) algorithms for prediction of outcome for moderate and severe traumatic brain injury. Study Design and Setting: We performed logistic regression (LR), lasso regression, and ridge regression with key baseline predictors in the IMPACT-II database (15 studies, n 5 11,022). ML algorithms included support vector machines, random forests, gradient boosting machines, and artiﬁcial neural networks and were trained using the same predictors. To assess generalizability of predictions, we performed internal, internal-external, and external validation on the recent CENTER-TBI study (patients with Glasgow Coma Scale ! 13, n 5 1,554). Both calibration (calibration slope/intercept) and discrimination (area under the curve) was quantiﬁed. Results: In the IMPACT-II database, 3,332/11,022 (30%) died and 5,233(48%) had unfavorable outcome (Glasgow Outcome Scale less than 4). In the CENTER-TBI study, 348/1,554(29%) died and 651(54%) had unfavorable outcome. Discrimination and calibration varied widely between the studies and less so between the studied algorithms. The mean area under the curve was 0.82 for mortality and 0.77 for unfavorable outcomes in the CENTER-TBI study. Conclusion: ML algorithms may not outperform traditional regression approaches in a low-dimensional setting for outcome prediction after moderate or severe traumatic brain injury. Similar to regression-based prediction models, ML algorithms should be rigorously validated to ensure applicability to new populations. (cid:1) 2020 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).


Introduction
Traumatic brain injury (TBI) is a common disease, with a significant societal burden [1]: TBI is estimated to be responsible for around 300 hospital admissions and 12 deaths per 100,000 persons per year in Europe [2].TBI is a heterogeneous disease in terms of phenotype and prognosis [3].Therefore, prognostic models, which predict outcome for a patient, given a particular combination of baseline characteristics, are important: they may give us insight in mechanisms of disease that lead to poor outcome and allow for risk-based stratification of patients for logistic, research, and clinical reasons.
A large number of prediction models have been developed to predict outcome for patients with TBI, mostly using traditional regression techniques [4].However, these models have not yet been widely implemented in clinical practice.In recent years, more flexible machine learning (ML) algorithms have enjoyed enthusiasm as potentially promising techniques to improve outcome prognostication [5].Frequently used methods are support vector machines (SVMs) [6], deep neural networks (NNs) [7], random forests (RFs) [8], and gradient boosting machine (GBM) [9].Some of these algorithms have been used to develop prediction models on small data sets (!200 events) [10e12].Because ML algorithms are more prone to overfitting [13], it remains unclear what the impact on prognostication is of these novel techniques.
Although the incremental value of flexible ML methods has been previously assessed, these comparisons were potentially subject to bias [14].The incremental value of ML algorithms is potentially overrated because studies up to this point mainly focused on the ability of the methods to discriminate between patients with good and poor outcomes [15e19].Performance of prediction models is however commonly measured across at least two dimensions: calibration and discrimination [20,21].Calibration refers to the agreement of predicted probabilities of a model and observed outcomes (e.g., ''if the risk of death is x%, do x% of the patients with this prediction actually die?'').Poor calibration of prediction models may lead to harmful decision-making when applying these models [22e24].
One of the more thoroughly validated prediction models with good performance exists in the field of TBI: the IMPACT model [25].This model comprises baseline clinical characteristics, presence of secondary insults, imaging findings, and laboratory characteristics.Using the variables of this model, the present study aims to fairly assess the potential incremental value of flexible ML methods beyond classical regression approaches.

Methods
This study was reported to conform with the TRIPOD guidelines [23].

Study population
We included 15 studies from the IMPACT-II database.These include four observational studies and eleven randomized controlled trials on moderate to severe TBI (Glasgow Coma Scale [GCS] 12), which were conducted between 1984 and 2004 [26].Furthermore, we validated models in the patients with moderate to severe TBI (GCS 12) from the CENTER-TBI core study.This is a recent prospective study, which included patients from 2014 to 2018 [27].Data for the CENTER-TBI study have been collected through the Quesgen e-CRF (Quesgen Systems Inc, Burlingame, CA, USA), hosted on the INCF platform, and extracted via the INCF Neurobot tool (INCF, Sweden).Version 1.0 of the CENTER-TBI data was used for this analysis.

Model specification
The outcomes which were predicted were 6 months mortality and unfavorable outcome (Glasgow Outcome Scale ! 3, or Glasgow Outcome ScaleeExtended !5).
The predictors included in the models were 11 predictors of the IMPACT laboratory model [25].Continuous variables were included as continuous variables in the model (no categorization).An overview of the included variables, and their specifications, is shown in Table 1.The baseline GCS score was defined as the last GCS in the emergency department (''poststabilization'').If this score was missing, the nearest GCS at an earlier moment was used.In total, eleven predictors were included, representing 19 parameters (or degrees of freedom [df]).In the case of mortality, 3,491 events (or 184 events per parameter) were on average present in our database for each training.The variables were normalized or one-hot encoded because this is standard practice for training algorithms which use gradient descent optimization.

Regression techniques
The regression techniques which were compared with the ML algorithms included standard logistic regression, but also penalized regression: lasso and ridge regressions [28].These algorithms were developed to improve the performance of logistic regression models by shrinking the coefficients during estimation [29,30].The objective is to obtain models that are less prone to making too extreme predictions (overfitting).The glmnet function from the glmnet package was used (alpha 5 0 for ridge, and alpha 5 1 for lasso).No nonlinear or interaction terms were included in the regression models.

Machine learning algorithms
All analyses were performed using R (R Core Team (2013).R: A language and environment for statistical computing.R Foundation for Statistical Computing, Vienna, Austria).The script can be found on https://github.com/bgravesteijn/ML_baseline_pred_code.
The flexible ML algorithms that were compared with logistic regression were SVM, NN, RF, and GBM.All these algorithms have so-called ''hyperparameters,'' which need to be optimized for the algorithms to work optimally.To select the optimal hyperparameters, the framework of the caret package was used.The best combination of hyperparameters of the algorithms was chosen based on the highest log likelihood.The average log likelihood over 10 repetitions of tenfold cross-validation was used to select the optimal parameters (Fig. 1).For a detailed description of what algorithms were used and what hyperparameters were considered, see Appendix B.
The included flexible ML methods, just like regression, do not allow for missing values.Unlike regression, however, they are not readily compatible with multiple imputation: not every algorithm uses weights as core operators.Moreover, for the algorithms that use weights, there is no implementation of pooling these weights over multiple data sets using Rubin's rules [31].Therefore, multiple imputation using the mice package was performed [32], but only one imputed data set was used to train the models.The outcome and all predictors were included in the imputation model.To check for stability of results, a sensitivity analysis was performed with a different imputed data set.

Cross-validation
The models were validated using three different strategies.First, they were cross-validated per study: the algorithms were trained on all but one study, and calibration and discrimination were assessed by applying the models to the study not used at model development.This procedure has been referred to as 'Internal-external cross-validation' [33,34].For an overview of the analytical steps of internal-external cross-validation, see Fig. 1.Second, internal validation was performed in the IMPACT-II database using 10 times 10-fold random cross-validation.For this method, the data were randomly divided by deciles.The model was developed on 9/10 and validated on 1/10 of the data.This process was repeated until all patients were used once as validation sample.Finally, a fully external validation was performed, with training of the models in the IMPACT-II database and validating in CENTER-TBI.
The performance was assessed in three domains.First, calibration was examined graphically and quantified using a calibration slope and the calibration intercept: the calibration test proposed by Cox [35].Second, discrimination was quantified using the c-statistic, also known as the area under the receiver operating curve.The confidence intervals of the c-statistic were obtained using the DeLong et al. method [36], using the ci.auc function from the pROC package.Third, as a measure of overall performance, the Brier score was calculated [37].More extensive descriptions of these metrics can be found in Appendix B. The estimates and 95% confidence intervals were plotted in forest plots, to visually inspect the variation.To obtain estimates per model and outcome, the estimates (and standard errors) in every validation were pooled using a random effects meta-analysis, using the DerSimonian and Laird estimator for t 2 [38].Because the CENTER-TBI database is a recent study, unlike the IMPACT-II studies, the estimates obtained from validating in this study were presented separately.
To compare whether observed variation of the performance measures can be attributed to differences in performance across study population or type of model used, we used mixed effects linear regression.This was performed in the internal-external validation framework.The performance measure was used as dependent variable, and two random intercepts were included in the model: one for what algorithm was used and one for what study the models were validated in.These random intercepts were assumed to follow a normal distribution with mean 0 and variance t 2 .The percentage variation in performance attributable to in which study the model was validated was calculated by dividing the t 2 of study by the total variance (the sum of the variance of the random intercepts of study and algorithm, and the residuals): t 2 study /(t 2 study þ t 2 algorithm þt 2 residuals ).Similarly, the percentage variation in performance attributable to what algorithm was trained was calculated.

Patient characteristics
The baseline characteristics differed substantially between the IMPACT-II and the CENTER-TBI data.In the IMPACT-II database, patients were younger (35 vs. 47.4 years), had less traumatic subarachnoid hemorrhages (4,016 [45%] vs. 759 [74%]), and presented less often with a motor GCS of one (1,565 [16%] vs. 615 [45%]).However, the patients showed similar Glasgow Outcome Scale in the two studies: In the IMPACT-II database, 3,332 (30%) died and 5,233 (48%) had an unfavorable outcome, and in the CENTER-TBI study, 348 (29%) died and 651 (54%) had unfavorable outcome (Table 2).For an overview of the patient characteristics per study in IMPACT-II and CENTER-TBI, see Table A1.

Discrimination
At internal-external validation, the difference between maximum and minimum c-statistic of the algorithms was only 0.02 for mortality and unfavorable outcome.The discriminatory performance of the implementation of RF was suboptimal: the median and IQR of c-statistic of the RF were 0.79 (0.77e0.82) for mortality (the overall average was 0.81) and 0.79 (0.76e0.81) for unfavorable outcome (the overall average was 0.80).The discriminative performances varied substantially per study (Fig. A2 and Table 3).At internal validation in IMPACT-II, a similar pattern was seen, but the c-statistics were somewhat higher.For example, the GBM showed a c-statistic of 0.81 (0.79e0.83) at internal-external validation and 0.83 (0.82e0.84) at internal validation.When performing external validation in CENTER-TBI, this pattern was also seen: The RF showed a median and 95% CI for the c-statistic of 0.81 (0.78e0.84) for mortality (overall average was 0.82) and 0.76 (0.74e0.79) for unfavorable outcome (overall average was 0.77).Similar results were observed over a different imputed set, see Table A5.

Calibration
At internal-external validation, the average calibration intercepts across the algorithms did not vary substantially: the range of calibration intercepts was À0.08 to À0.02 for mortality, and for unfavorable outcome, the calibration intercepts were 0.02 (Fig. 2B and Table A2).The range of calibration slopes was larger: 0.85e1.05for mortality and 0.89e1.06for unfavorable outcome (Fig. 2C and Table A3).The RF made too extreme predictions, with a median (95% CI) calibration slope of 0.85 (0.77e0.93) for mortality, whereas the overall mean was 0.97, and 0.89 (0.82e0.96) for unfavorable outcome, whereas the overall mean was 0.99.At internal validation in IMPACT-II, calibration slopes and intercepts were similar.In external validation in CENTER-TBI, the RF had again a too low calibration slope (0.88, 95% CI: 0.77e0.99 for mortality).
The calibration intercept for mortality was generally low in CENTER-TBI: the overall mean was À0.58, indicating that the 6-month mortality was lower than expected in CENTER-TBI.

Overall predictive ability
The Brier score was very similar at internal-external validation, internal, and external validation for both outcomes (Table A4).The brier score was somewhat higher at external validation but consistent for all methods (e.g., 0.19 vs. 0.18 for logistic regression to predict unfavorable outcome).

Explained heterogeneity
At internal-external validation, variation in c-statistic, calibration intercept, and Brier score was mainly attributable to the study in which the algorithm was validated (Table 4): for mortality, the variation in c-statistic was 97% attributable to the study in which the algorithm was   metrics (Fig. A1).For mortality, the variation in calibration slope was 11% attributable to the algorithm used and 86% attributable to the study in which the algorithm was validated.This was mostly caused by the low calibration slope of the RF algorithm.This algorithm displayed the worst calibration slope, as indicated in Fig. 2C.For unfavorable outcome, the results were similar.

Nonadditivity and nonlinearity
To explore whether nonadditive and nonlinear effects were frequently appropriate to assume in our data, we performed a post hoc analysis.Per study, logistic regression models allowing for nonadditivity and nonlinearity were tested with likelihood ratio tests (omnibus tests) to the model which did not allow for relaxation of those assumptions [20].It was observed that the model predicting mortality had a better fit when nonlinearity was allowed for in 7 (44%) studies.Less often, the assumption of nonadditivity improved the model fit (Table A6).

Discussion
This study aimed to compare flexible ML algorithms to more traditional logistic regression in contemporary patient data.We trained the algorithms to obtain a model with both high discrimination and good calibration.This was achieved by optimizing the log-likelihood for both regression and ML algorithms.All models and algorithms were developed and validated in large data sets, including the recent prospective cohort study CENTER-TBI [27].Performance was assessed in terms of both discrimination and calibration, which are both important characteristics to be assessed in algorithm validation [22,24,39].Similar performance of most methods was found across a large number of studies from different time periods.
The algorithm that relatively underperformed was the RF: the discrimination was somewhat lower, but it clearly underperformed in terms of calibration.In particular, the RF showed a calibration slope that was far below one.This indicates overfitting, a problem often arising in small data sets [37].According to theoretical arguments, the RF algorithm should not overfit [40].The discrepancy between the theory and the empirical evidence of our study should be explored further.There could be a role for the selection of hyperparameters, in particular the number of random variables at the split, and the fraction of observations in the training sample [41].Because the RF shows signs of overfitting, even in large data sets, the discriminative performance should be interpreted with caution: due to optimism, the discrimination in new data sets can be lower [21].As a contrast, this method was one of the better performing methods in other studies [15,42], which however did not assess calibration.Because calibration is a crucial step before implementation of a prediction model in clinical practice [20,39,43], our study encourages the use of other modeling techniques compared with RFs for outcome prediction.
The variation in observed performance was more explained by the cohorts where the algorithms were validated than by which algorithms were used.This implies that prediction models need continuous updating and validation because their performance is often worse in new cohorts [44].This is a limitation which needs to be addressed, to effectively use these models in clinical practice [45].This finding does raise concerns about the validity of individual patient data metaanalysis in the context of prediction modeling.
A recent systematic review compared flexible ML methods to traditional statistical techniques in relatively small data sets (median sample size was 1,250) and did not find incremental value [14].This was perhaps to be expected because modern ML methods are known to be data hungry compared with classical statistical techniques [13,46].However, due to the increased sharing of data, international collaborations, and the availability of data from electronical health records and other data sets with routinely collected data, data sets are becoming increasingly large [47e49].
Our study shows that in this situation, flexible ML methods are not improving outcome prognostication as well.
A limitation of our study is that we only used a linear kernel function of SVM.Other kernels could have increased the performance of the algorithm because the performance of the algorithm is substantially dependent on its hyperparameters [41].Unfortunately, the computation time increased drastically when this kernel was implemented (the expected running time for one series of crossvalidation was 21 days).Because the first six iterations did not show substantial increase in discriminative performance, we decided to use the linear basis function instead.
Second, we only considered a relatively small number of predictors (11 predictors, with 19 df).The reason for not including more predictors is that there were no other common data elements between all databases.This potentially limits the performance of ML techniques because it has been suggested that flexible ML techniques perform better than traditional regression techniques when a large number of predictors are being considered, that is, highdimensional data [50,51].A reason for such presumed superiority is the flexibility of these algorithms, enabling An example is shown in the supplemental material (Fig. 1).
them to capture complex nonlinear and interaction effects.It should be noted that regression-based techniques can also be extended by nonlinear and interaction effects [20].Given that ML algorithms did not outperform regression, these effects are not likely to be essential in the field of outcome prediction in patients with TBI.Our study was not able to fully use the potential benefit of multidimensional data because of a phenomenon that is expected in big data research: larger volumes of data for better models may come at the price of less detailed or lower quality data.We do believe that although we could perhaps not use the full potential performance of ML algorithms, our comparison is just as relevant.Published ML-based prediction algorithms often include a large number of predictors, sometimes with the suggestion to result in high discriminative performance [52,53].We note that external validation of these high-dimensional prediction algorithms is challenging because availability of predictors may differ from one setting to the other.For prediction with genomics data, this may be feasible if sufficient standardization and harmonization was performed [54].However, clinical variables often have different definitions, notations, or units, which complicate the validation procedure with a large number (say n O 50) of predictors.External validation remains an essential step before implementing prediction algorithms in clinical practice.To train and validate high-dimensional data, a sophisticated IT environment is necessary [55].Therefore, we believe that the low-dimensional setting, such as our study, might be more relevant for clinical practice, also for the near future.Powerful predictions for outcome after TBI can apparently be made with linear effects which are captured with simple algorithms.
Finally, this study should be replicated in other fields than TBI to ensure the generalizability of our findings, again from a largely neutral perspective [54].Preferably, a wide range of studies should be used, representing different settings in terms of study design (randomized controlled trials vs. observational), geography (different countries), types of centers (level I trauma centers vs. other), and so forth.Most studies that compared algorithms used only one or a limited number of study populations [15e19].Because the performance heavily relies on the study population, comparing the methods in multiple populations is recommended.

Conclusion
In a low-dimensional setting, flexible ML algorithms do not perform better than more traditional regression models in outcome prediction after moderate or severe TBI.This is potentially explained by the most important prognostic effects acting as independent, linear effects.Predictive performance is more dependent on the population in which the model is applied than the type of algorithm used.This finding has strong implications: continuous validation and updating of prediction models is necessary to ensure applicability to new populations of both ML algorithms and regression-based models.To improve prognostication for TBI, future studies should extend current prognostic models with new predictors (biomarkers, imaging, genomics) with strong incremental value, for the reliable identification of patients with poor vs. good prognosis.

Fig. 1 .
Fig. 1.Overview of the experimental setup.Step 1 is selecting a study as a validation study.Step 2 is selecting the optimal hyperparameters through 10 times 10-fold cross-validation.If the algorithm did not require hyperparameters, this step was skipped.Step 3 is the training of the final model with optimal hyperparameters on the full training data.The model of step 3 was validated in step 4 with the study that was left out of the training set.Step 5 is repeating step 1e4 until all studies are used once as validation study.Finally step 6 is the presentation of the results, and pooling the results over the different studies.

Fig. 2 .
Fig. 2. Results of the internal-external cross-validation for mortality.Panel A shows the results of the c-statistic/area under the ROC curve, panel B shows the calibration intercept, panel C shows the calibration slope, and panel D shows the Brier score.The validation results are displayed per study (left: observational, right: randomized controlled trials), and per algorithm.Abbreviations: LR, logistic regression; SVM, support vector machine; RF, random forest; NN, neural network; GBM, gradient boosting machine; ROC, receiver operating curve.

Table 2 .
Baseline characteristics of the CENTER-TBI and IMPACT-II databases Abbreviations: CT, computed tomography; GCS, Glasgow Coma Scale; SD, standard deviation; IQR, interquartile range.validated (vs.2.0% to what algorithm was used), whereas the variation in calibration intercept was 98% attributable to the study in which the algorithm was validated (vs.0.3% to what algorithm was used); and variation in Brier score was 96% attributable to the study in which the algorithm was validated (vs.2.0% to what algorithm was used).Variation in calibration slope was slightly more attributable to what algorithm was used, compared with the other

Table 3 .
Results for discriminative performance of all algorithms, in all three validation strategies: internal-external (per-study CV), internal (10-fold CV), and external (CENTER-TBI) validation

Table 4 .
Percentage of variation in performance attributable to what study the algorithms were validated in