Journal Pre-proof Systematic review identifies the design and methodological conduct of studies on machine learning-based prediction models

36 Objective. We sought to summarize the study design, modelling strategies, and performance 37 measures reported in studies on clinical prediction models developed using machine learning 38 techniques. 39 Study Design and Setting. We search PubMed for articles published between 01/01/2018 and 40 31/12/2019, describing the development or the development with external validation of a 41 multivariable prediction model using any supervised machine learning technique. No restrictions 42 were made based on study design, data source, or predicted patient-related health outcomes. 43 Results. We included 152 studies, 58 (38.2% [95%CI 30.8-46.1]) were diagnostic and 94 (61.8% 44 [95%CI 53.9-69.2]) prognostic studies. Most studies reported only the development of prediction 45 models (n=133, 87.5% [95%CI 81.3-91.8]), focused on binary outcomes (n=131, 86.2% [95%CI 46 79.8-90.8), and did not report a sample size calculation (n=125, 82.2% [95%CI 75.4-87.5]). The 47 most common algorithms used were support vector machine (n=86/522, 16.5% [95%CI 13.5-48 19.9]) and random forest (n=73/522, 14% [95%CI 11.3-17.2]). Values for area under the Receiver 49 Operating Characteristic curve ranged from 0.45 to 1.00. Calibration metrics were often missed 50 (n=494/522, 94.6% [95%CI 92.4-96.3]). 51 Conclusions. Our review revealed that focus is required on handling of missing values, methods for 52 internal validation, and reporting of calibration to improve the methodological conduct of studies 53 on machine learning-based prediction models.


INTRODUCTION 60
Clinical prediction models aim to improve healthcare by providing timely information for shared 61 decision-making between clinician and their patients, risk stratification, changes in behaviour, and 62 to counsel patients and their relatives. 1 A prediction model can be defined as the (weighted) 63 combination of several predictors to estimate the likelihood or probability of the presence or 64 absence of a certain disease (diagnostic model), or the occurrence of an outcome over a time 65 period (prognostic model). 2 Traditionally, prediction models were developed using regression 66 techniques, such as logistic or time-to-event regression. However, in the past decade, the attention 67 and use of machine learning approaches to developing clinical prediction models has rapidly grown. 68 Machine learning can be broadly defined as the use of computer systems that fit mathematical 69 models that assume non-linear associations and complex interactions. Machine learning has a 70 wide range of potential applications in different pathways of healthcare. For example, machine 71 learning is applied in stratified medicine, triage tools, image-driven diagnosis, online consultations, 72 medication management, and to mine electronic medical records. 3 Most of these applications 73 make use of supervised machine learning whereby a model is fitted to learn the conditional 74 distribution of the outcome given a set of predictors with little assumption on data distributions, 75 non-linear associations, and interactions. This model can be later applied in other but related 76 individuals to predict their (yet unknown) outcome. Support vector machines (SVM), random forests 77 (RF), and neural networks (NN) are some examples of these techniques. 4 78 The number of studies on prediction models published in the biomedical literature increases every 79 year. 5,6 With more healthcare data being collected and increasing computational power, we expect 80 studies on clinical prediction models based on (supervised) machine learning techniques to 81 become even more popular. Although numerous models are being developed and validated for 82 various outcomes, patients' populations, and healthcare settings, only a minority of these published 83 models are successfully implemented in clinical practice. 7,8 84 The use of appropriate study designs and prediction model strategies to develop or validate a 85 prediction model could improve their transportability into clinical settings. 9 However, currently there 86 is a dearth of information about which study designs, what modelling strategies, and which 87 performance measures do studies on clinical prediction models report when choosing machine 88 learning as modelling approach. [10][11][12] Therefore, our aim was to systematically review and 89 summarise the characteristics on study design, modelling steps, and performance measures 90 reported in studies of prediction models using supervised machine learning. 91 J o u r n a l P r e -p r o o f We followed the PRISMA 2020 statement to report this systematic review. 13 93

Eligibility criteria 94
We searched via PubMed (search date 19 December 2019) for articles published between 1 95 January 2018 and 31 December 2019 (Supplemental file 1). We focused on primary studies that 96 described the development and/or validation of one or more multivariable diagnostic or prognostic 97 prediction model(s) using any supervised machine learning technique. A multivariable prediction 98 model was defined as a model aiming to predict a health outcome by using two or more predictors 99 (features). We considered a study to be an instance of supervised machine learning when reporting 100 a non-regression approach to model development. If a study reported machine learning models 101 alongside regression-based models, this was included. We excluded studies reporting only 102 regression-based approaches such as unpenalized regression (e.g., ordinary least squares or 103 maximum likelihood logistic regression), or penalized regression (e.g., lasso, ridge, elastic net, or 104 Firth's regression), regardless of whether they referred to them as machine learning. Any study 105 design, data source, study population, predictor type or patient-related health outcome was 106

considered. 107
We excluded studies investigating a single predictor, test, or biomarker. Similarly, studies using 108 machine learning or AI to enhance the reading of images or signals, rather than predicting health 109 outcomes in individuals, or studies that used only genetic traits or molecular ('omics') markers as 110 predictors, were excluded. Furthermore, we also excluded reviews, meta-analyses, conference 111 abstracts, and articles for which no full text was available via our institution. The selection was 112 restricted to humans and English-language studies. Further details about eligibility criteria can be 113 found in our protocol. 14 114 Screening and selection process 115 Titles and abstracts were screened to identify potentially eligible studies by two independent 116 reviewers from a group of seven (CLAN, TT, SWJN, PD, JM, RB, JAAD). After selection of potentially 117 eligible studies, full text articles were retrieved and two independent researchers reviewed them 118 for eligibility; one researcher (CLAN) screened all articles and six researchers (TT, SWJN, PD, JM, 119 RB, JAAD) collectively screened the same articles for agreement. In case of any disagreement 120 during screening and selection, a third reviewer was asked to read the article in question and 121 resolve. 122

Extraction of data items 123
We selected several items from existing methodological guidelines for reporting and critical 124 appraisal of prediction model studies to build our data extraction form (CHARMS, TRIPOD, 125 PROBAST). [15][16][17][18] Per study, we extracted the following items: characteristics of study design (e.g. 126 J o u r n a l P r e -p r o o f we kept our evaluations at study level. We did not perform a quantitative synthesis of the model' 163 performance (i.e., meta-analysis), as this was beyond the scope of our review. Analysis and 164 synthesis of data was presented overall. Analyses were performed using R (version 4. Overall, 1,429 prediction models were developed (Median: 9.4 models per study, IQR: 2-8, Range: 185 1-156). As we set a limit on data extraction to 10 models per article, we evaluated 522 models. 186 The most common applied modeling techniques were support vector machine (n=86/522, 16 Table 6. 234

Class imbalance 235
In our sample, 27/152 (17.8% [95%CI 12.5-24.6]) studies applied at least one method to 236 purportedly address class imbalance, that is -when one class of the outcome outnumbers the 237 (SMOTE), a method that combines oversampling the minority class with undersampling the majority 239 class. 19 reported 1000 iterations and one did not report the number of iterations. For further details see 296

Principal findings 313
In this study, we evaluated the study design, data sources, modelling steps, and performance 314 measures in studies on clinical prediction models using machine learning across. The methodology 315 varied substantially between studies, including modelling algorithms, sample size, and 316 performance measures reported. Unfortunately, longstanding deficiencies in reporting and 317 methodological conduct previously seen in studies with a regression-based approach, were also 318 extensively found in our sample of studies on machine learning models. 9,23 319 The spectrum of supervised machine learning techniques is quite broad. 24,25 In this study, the most 320 popular modelling algorithms were tree-based methods (RF in particular) and SVM. RF is an 321 ensemble of random trees trained on bootstrapped sub-sets of the dataset. 26 On the other hand, 322 SVM first map each data point into a feature space to then identify the hyperplane that separates 323 the data items into two classes while maximizing the marginal distance for both classes and 324 minimizing the classification errors. 27 Several studies also applied regression-based methods (LR 325 in particular) as benchmark to compare against the predictive performance of machine learning-326 based models. 327 Various other well-known methodological issues in prediction model research need to be further 328 discussed. Our reported estimate on EPV is likely to be overestimated given than we were unable 329 to calculate it based on number of parameters, and instead we used only the number of candidate 330 predictors. A simulation study concluded that modern modelling techniques such as SVM and RF 331 might even require 10 times more events. 28 Hence, the sample size in most studies on prediction 332 models using machine learning remains relatively low. Furthermore, splitting datasets persists as 333 a method for internal validation (i.e., testing), reducing even more the actual sample size for model 334 development and increasing the risk of overfitting. 29,30 Whilst AUC was a frequently reported metric 335 to assess predictive performance, prediction calibration or prediction error was often overlooked. 31 336 Moreover, a quarter of studies in our sample corrected for class imbalance without reporting 337 recalibration, although recent research has shown that correcting for class imbalance may lead to 338 poor calibration and thus, prediction errors. 32 Finally, therapeutic interventions were rarely 339 considered as predictors in the prognostic models, although these can affect the accuracy and 340 transportability of models. 33 341 Variable importance scores, tuning of hyperparameters, and data preparation (i.e., data pre-342 processing) are items closely related to machine learning prediction models. We found that most 343 studies reporting variable importance scores did not specify the calculation method. Data 344 preparation steps (i.e., data quality assessment, cleaning, transformation, reduction) were often 345 not described in enough transparent detail. Complete-case analysis remains a popular method to 346 handle missing values in machine learning based models. Detailed description and evaluation on 347 one third of models reported their hyperparameters settings, which is needed for reproducibility 349

purposes. 350
Comparison to previous studies 351 Regression methods were not our focus (as we did not define them to be machine learning 352 methods), but other reviews including both approaches show similar issues with methodological 353 conduct and reporting. 12,[35][36][37] Missing data, sample size, calibration, and model availability remain 354 largely neglected aspects. 7,12,37-40 A review looking at the trends of prediction models using 355 electronic health records (EHR) observed an increase in the use of ensemble models from 6% to 356 19%. 41 Another detailed review on prediction models for hospital readmission shows that the use 357 of algorithms such as SVM, RF, and NN increased from none to 38% over the last five years. 10 358 Methods to correct for class imbalance in datasets concerning EHR increased from 7% to 13%. 41 359

Strengths and limitations of this study 360
In this comprehensive review, we summarized the study design, data sources, modelling strategies, 361 and reported predictive performance in a large and diverse sample of studies on clinical prediction 362 model studies. We focused on all types of studies on clinical prediction models rather than on a 363 specific type of outcome, population, clinical specialty, or methodological aspect. We appraised 364 studies published almost three years ago and thus, it is possible that further improvements might 365 have raised. However, improvements in methodology and reporting are usually small and slow even 366 when longer periods are considered. 42 Hence, we believe that the results presented in this 367 comprehensive review still largely apply to the current situation of studies on machine learning-368 based prediction models. Given the limited sample, our findings can be considered a representative 369 rather than exhaustive description of studies on machine learning models. 370 Our data extraction was restricted to what was reported in articles. Unfortunately, few articles 371 reported the minimum information required by reporting guidelines, thereby hampering data-372 extraction. 23 Furthermore, terminology differed between papers. For example, the term 'validation' 373 was often used to describe tuning, as well as testing (i.e., internal validation). An issue already 374 observed by a previous review of studies on deep learning models. 43 This shows the need to 375 harmonize the terminology for critical appraisal of machine learning models. 44 Our data extraction 376 form was based mainly on the items and signaling questions from TRIPOD and PROBAST. Although 377 both tools were primarily developed for studies on regression-based prediction models, most items 378 and signaling questions were largely applicable for studies on machine learning-based models, as 379

well. 380
Implication for researchers, editorial offices, and future research 381 In our sample, it is questionable whether studies ultimately aimed to improve clinical care. 45 Aim, 382 clinical workflow, outcome format, prediction horizon, and clinically relevant performance metrics 383 reporting in studies on prediction models has been intensively and extensively stressed by 385 guidelines and meta-epidemiological studies. 46-48 Researchers can benefit from TRIPOD and 386 PROBAST, as these provide guidance on best practices for prediction model study design, conduct 387 and reporting regardless of their modelling technique. 16,17,46,47 However, special attention is 388 required on extending the recommendations to include areas such as data preparation, tunability, 389 fairness, and data leakage. In this review, we have provided evidence on the use and reporting of 390 methods to correct for class imbalance, data preparation, data splitting, and hyperparameter 391 optimization. PROBAST-AI and TRIPOD-AI, both extensions to artificial intelligence (AI) or machine 392 learning based prediction models are underway. 44,49 As machine learning continues to emerge as 393 a relevant player in healthcare, we recommend researchers and editors to reinforce a minimum 394 standard on methodological conduct and reporting to ensure further transportability. 16,17,46,47 395 We identified that studies covering the general population (e.g., for personalized screening), 396 primary care settings, and time-to-event outcomes are underrepresented in current research.      J o u r n a l P r e -p r o o f