Human and machine learning pipelines for responsible clinical prediction using high-dimensional data

This protocol aims to develop, validate, and deploy a prediction model using high dimensional data by both human and machine learning. The applicability is intended for clinical prediction in healthcare providers, including but not limited to those using medical histories from electronic health records. This protocol applies diverse approaches to improve both predictive performance and interpretability while maintaining the generalizability of model evaluation. However, some steps require expensive computational capacity; otherwise, these will take longer time. The key stages consist of designs of data collection and analysis, feature discovery and quality control, and model development, validation, and deployment.


Introduction
][7] However, prognostication with causal reasoning is warranted for prevention purpose.Machine learning can learn data distribution for prognostication but cannot infer causality yet, which is what may happen if the distribution differs from those existing in the dataset.But, human learning can learn causality, although our comprehension on big data is limited. 8Therefore, application of both human and machine learning are inevitable to achieve responsible clinical prediction.[11][12] The rst building block of this pipeline is a systematic human learning algorithm. 9Based on hypotheticodeductive reasoning, human learning involves collection of prior knowledge to construct a causal diagram as a central assumption for hypothesis testing.Subsequently, statistical methods are used to verify the assumption.Solely using statistical analysis on available data without contextual knowledge can introduce data-driven bias.However, humans need machines to deal with enormous amounts of data.Causal diagram is expected to systematically mitigate this type of bias when a human interprets pattern from machine learning results. 13e second building block of this pipeline is optional for those using electronic medical records.We provide data extraction protocol for medical histories using Kaplan-Meier estimators, so called historical rate. 10These are sensible quantities to be differential through time for affecting a future health state.This protocol also allows comparable medical history across healthcare providers, since this type of data is often isolated within each healthcare provider.By population-level historical rate, it is possible to utilize the isolated data across healthcare providers without accessing the respective databases.
The third building block of this pipeline is to deal with high-dimensional data.We proposed a resampling protocol for dimensional reduction resulting a few latent variables. 11Most prediction models that used latent variables, if not all, conducted a dimensional reduction without either resampling or data partition, which exposed to a risk of optimistic bias, and is not robust for samples beyond the training set.This is because resampling or data partition are more well-known in either predictive modeling or supervised machine learning, compared to a dimensional reduction that is typically used for statistical inference and unsupervised machine learning.
The fourth building block of this pipeline is predictive modeling by machine learning.For machine learning, we applied competition winning, state-of-the-art algorithms for tabular data, which were random forest (RF) and gradient boosting machine (GBM).For prognostication, these algorithms were found to outperform others for similar outcomes. 1We also developed a deep-insight visible neural network (DI-VNN) protocol, 12 according to recent studies. 14,15But, unlike any of those studies, our protocol allows a human to deeply explore the 'subconscious mind' of the machine learning prediction model to gain insights and to identify bias exploited for predictions.
We applied this pipeline to several studies which were parts of a DI-VNN project.These studies applied our DI-VNN algorithm to a variety of predicted outcomes with comparison to those applying human learning with statistical methods and predictive modeling with other machine learning algorithms.Ethical clearance was waived by the Taipei Medical University Joint Institutional Review Board (TMU-JIRB number: N202106025).We followed guidelines for developing and reporting machine learning predictive models in biomedical research. 16We also conducted a self-assessment following the PROBAST guidelines. 17Those guidelines are speci c to multivariable prediction models for making individualized, prognostic, or diagnostic predictions, instead of those applying multivariable modeling to identify risk or prognostic factors. 17To ensure the clinical suitability of our models, we also ful lled a clinician checklist for assessing the suitability of machine learning applications in healthcare. 18We also followed other guidelines to nd comparable models to evaluate success criteria. 19This protocol aimed to provide several building blocks that can be applied either completely or partially to develop and validate clinical prediction using high dimensional data by both human and machine learning.

Reagents Equipment
We used R 4.0.2programming language (R Foundation, Vienna, Austria) to conduct most steps of the data analysis.For steps related to the DI-VNN, we also used Python 3.6.3programming language (Anaconda Inc., Austin, TX, USA).The integrated development environment software was RStudio 1.3.959(RStudio PBC, Boston, MA, USA).To ensure reproducibility, we used Bioconductor 3.11; 20 thus, versions of the included R packages were all in sync according to versions in this Bioconductor version.For all models except the DI-VNN, we used an R package of caret 6.0.86 that wraps R packages for the modeling algorithms, which were glmnet 4.1, Rborist 0.2.3, and gbm 2.1.8.For the DI-VNN, we used keras 2.3.0 and tensor ow 2.0.0 Python libraries via R packages of reticulate 1.16, keras 2.3.0.0, and tensor ow 2.0.0.We also created R packages and a Python library for many steps in the data analysis, including the DI-VNN, medhist 0.1.0,gmethods 0.1.0,rsdr 0.1.0,clixo 0.1.1,and divnn 0.1.3(both an R package and Python library).All of these packages/libraries are available for download from this repository https://github.com/herdiantrisufriyana.For model deployment, we used Shiny Server 1.4.16.958 and Node.js 12.20.0.Details on other R package versions and all of the source codes (vignette) for the data analysis are available (Table 6 in Supplementary Information).
To reproduce our work, a set of hardware requirements may be needed.We used a single machine for all models, except the DI-VNN, with 16 logical processors for the 2.10 GHz central processing unit (CPU) (Xeon® E5-2620 v4, Intel®, Santa Clara, CA, USA), 128 GB RAM, and 11 GB graphics processing unit (GPU) memory (GeForce GTX 1080 Ti, NVIDIA, Santa Clara, CA, USA).Parallelism was applied for CPU computing.Meanwhile, the DI-VNN required a higher GPU capability than that provided by the previous speci cation.For hyperparameter tuning and training, we used multiple machines in a cloud with 90 GB RAM and 32 GB GPU memory (Tesla V100-SXM2, NVIDIA, Santa Clara, CA, USA).For predictions, the DI-VNN only needed a CPU in a local machine, or that in a cloud machine for the web application.Procedure 1. Choose the study design.
Either prospective or retrospective cohort paradigm was recommended to prevent temporal bias for prognostic prediction, and case-control should be avoided. 21The latter study design may introduce collider-strati cation or selection bias. 13For diagnostic prediction, a cross-sectional design was recommended by PROBAST. 17These guidelines also warned utilizing randomized-controlled trial for prediction study.Study design should re ect similar situation in population level according to whether we develop either prognostic or diagnostic prediction.In our example, a retrospective design was applied to select subjects from database.

De ne target population by selection criteria.
Selection criteria of subjects with any outcome should be de ned.These are not de ned for each of outcome, i.e. case and control group; otherwise, we implicitly apply case-control design.In our example, all 12~55-year-old females were included, particularly whose visits were recorded within the dataset period.This represented population of visitors of the healthcare providers, since we intended to use our models for this population.Each period of pregnancy of the same subject should be treated as different subjects.The medical history before or during the rst pregnancy was retained for the second instance.
All visits after delivery should be excluded within the dataset period.Codes of diagnosis and procedure for determining delivery or immediately after delivery care should be determined and listed in supplementary information.
3. Determine the type of prediction task.
Machine learning prediction tasks may be either classifying a categorical outcome or estimating a numerical outcome, and either prognostic or diagnostic prediction.The estimation task was also commonly known as regression task.We avoid this word to prevent confusing it with regression algorithms which can also serve for both classi cation and estimation tasks.In our example, we determined our tasks were prognostic classi cation of prelabor rupture of membranes (PROM) and prognostic estimation of the time of delivery.In the context of electronic health records, as shown in our example, we assigned one or more codes of diagnosis and/or procedure for an outcome based on International Classi cation of Disease version 10 (ICD-10) coding system.A subject was otherwise assigned as a nonevent.In the context of pregnancy, we may utilize the same codes for determining the end of the pregnancy period.This would be an in nite set of numbers.In the context of pregnancy, as shown in our example, we did not have information about gestational age.But, we may infer the number of days from a visit, at which the time of prediction, to the last visit at which the time of true outcome.We used the same codes for the classi cation task to determine such visits encountering those of events and nonevents.

Preserve censoring information in the dataset.
Censoring outcome should not be excluded at rst.For example, we may not know whether the subjects would be pregnant or not, and those who pregnant would be delivered or not up to the end of the study period.Instead of removing instances with censored outcomes, we assigned censoring labels.These were taken into account for causal inferences, and weighting of uncensored outcomes over both censored and uncensored outcomes when training the models later.Indeed, none of the censored outcomes would be included for developing prediction models.This would preserve similar distributions of any outcomes with those in the target population and resolve the class imbalance problem by inverse probability weighting.
6. Set priority based on assessment of practical costs of under-and over-prediction.
To evaluate a prediction model, practical costs of prediction errors should be considered. 16We need to relate potential consequences of under-and over-prediction.Then, we need to choose which one would be likely more frequent or larger magnitude.A priority on dealing with under-prediction would need well calibration and higher true positive rate, while that of over-prediction would also need well calibration but higher speci city.For estimation task, we need to set limit of maximum error which is considerably safe to use a prediction model.In our example, we set priority on dealing with under-prognosis and limit of error within 2 to 4 weeks of the true time of delivery.

Determine candidate predictors.
We need to collect data for variables classi ed into two groups.The rst group were data for baseline variables, i.e. demographics.Meanwhile, the other variables were candidate predictors.We avoid using baseline variables for candidate predictors, particularly those which need to use private data and re ect social and economic background.However, we need to understand the characteristics of our population based on the baseline variables.Using these variables, we may nd whether future, unobserved data have similar characteristics, since this would describe how likely the predictive performance of our models for those new instances.Baseline variables were indeed still included for causal inferences.We also avoided maternal age, although this variable is often a strong predictor.The reason why we avoid the variable was that machine learning models often memorize age non-linearly.In turn, a large weight is often assigned to maternal age followed by weight shrinking of other predictors.All of these demographic variables were respectively assigned either 0 or 1 for no or yes in each category of each variable.
Meanwhile, we extracted ICD-10 codes for diagnoses or procedures from medical histories.

De ne and verify causal factors as parts of candidate predictors.
We made gmethods 0.1.0,an R package, that allows future investigators to conduct statistical tests for causal inference.Please kindly follow the protocol. 9After verifying causal factors, we only included those in a prediction model that applied a logistic regression with a shrinkage method, as recommended by PROBAST, instead of using a stepwise selection method. 17We chose an RR, which applies L 2 -norm or beta regularization, because this method retains all causal factors within the model after weights are updated by training. 22We understood that this model would not necessarily be the best model, because of the use of causal factors.Predictive modeling normally exploits confounders to achieve better performances, while causal models cannot explain all variations among individuals, to which confounding factors contribute. 13However, by comparing a predictive model to one that uses only causal factors, we could imagine how much confounder effects were exploited to improve the predictive performance by machine learning algorithms.This, in turn, can warn a human user of machine learning algorithms about conducting a critical appraisal of internal properties of a machine learning model.We followed the same procedures for hyperparameter tuning and parameter tting (training) as those for machine learning, as described in the following sections.Nonetheless, we viewed tuning and training of this prediction model as already a part of machine learning since fewer interventions are required by human users.9. Remove candidate predictors with perfect separation problems.
We identi ed candidate predictors that took a single value or had zero variance.We removed all candidate predictors which were positive (value of 1) in only one of the outcomes in a training set.This is a perfect separation problem. 16A predictor may be exclusive for one of the outcomes by chance due to sampling error.To prevent such a bias, we removed perfect-separation candidate predictors.Details of the candidate predictors and selection should be described in supplementary information.10.Remove candidate predictors that may cause outcome leakage.
We need to identify these kinds of candidate predictors.In the context of electronic health records, these would be any variable indicated by codes of diagnosis and procedure, which are explicitly or implicitly following the outcome de nition.In our example, these referred to maternal or baby diagnosis/procedure codes that typically occur only during the delivery or post-delivery period.Otherwise, these codes would unexpectedly leak outcome information.The excluded codes should be listed and reported in supplementary information.

Filter irredundant candidate predictors.
To conduct this step, we need to compute pair-wise Pearson correlation coe cients.A pair of candidate predictors which has a high correlation (i.e., >~0.70) should be removed.This may be done by using one of the paired of candidate predictors, or unifying both under a single de nition.In our example, we had pairs of candidate predictors that were highly correlated, but, the coe cients were borderline (i.e.~0.7), while those were causal factors and some codes de ning the factors themselves.Yet, we interpreted the meaning was not considerably the same; thus, we passed those candidate predictors.
12. Construct provider-wise datasets for model development and validation.
We used medical histories and causal factors as candidate features/predictors.Instead of using nationwide medical histories, we need to extract the provider-wise ones by estimation.We only considered medical history of a subject recorded in a single healthcare provider and treated that of the other providers as separated medical histories, like in real-world setting.
Electronic health records across healthcare provider are unlikely connected.We need to have nationwide historical (KM) rates for each code, derived from the training set only.All medical histories in days were transformed into historical rates.This technique allowed generalization of individual data based on nationwide, population-level data, without the need to access data from other providers.We made medhist 0.1.0,an R package, that allows future investigators to implement this historical rate.Please kindly follow the protocol. 10. Conduct dimensional reduction by resampling for candidate features of the machine-learning prediction models.
We made rsdr 0.1.0,an R package, that allows future investigators to apply resampled dimensional reduction.Please kindly follow the protocol. 11Resampling is important to prevent over tting in machine learning. 8But, this is commonly applied in supervised instead of unsupervised machine learning; thus, this procedure is expected to deal with this problem.15.Consider the number of events per variable when choosing the number of candidate predictors for each model.
The minimum number is 20 events per variable (EPV), which was recommended for logistic regression, as recommended by the prediction model risk of bias assessment tools (PROBAST) guidelines. 17imensional reduction can be conducted to get latent variables fewer than the original candidate predictors.Furthermore, pre-selection may be applied by a regression model, which needs the least EPV.Then, we picked fewer latent variables with the highest absolute, non-zero weights as candidate predictors for the models which have higher EPV requirement, e.g.200 EPV for random forest and gradient boosting machine. 23.Choose modeling approaches to be compared.
Five modeling approaches were applied for supervised machine learning, consisting of a set of procedures covering feature representation, feature selection, hyperparameter tuning, and training strategies.For all models, we applied a grid search method for hyperparameter tuning of a minimum of 10 alternatives for each modeling approach.The best hyperparameters or tuning parameters were used for training the model or tting the parameters.17.Compute outcome weights to overcome class imbalance problem.
For classi cation task, the outcome (Y) was weighted by half of the inverse probability/prevalence (w i ), including the censored outcome (∅).For example, if the prevalence of Y = 1 is 0.2, then w i is 1 ÷ 0.2 × 0.5 for the outcome of Y = 1.The sum of the three probabilities is equal to 1.The weight formula is shown as Figure 1.Weights were plugged into a general equation of the loss function in this study (Figure 2), in which training was generally conducted to estimate parameter θ j in a model f(x ij ,θ j ) that minimizes L, where n is the number of visits, p is the number of candidate predictors, x ij is the value for each i th instance and j th candidate predictor, and α and λ are regularization factors.Meanwhile, no weighting was applied in the estimation task (w = 1 for any i).In addition, a speci c training strategy was applied for the last modeling approach, which was the DI-VNN, a model using the pipeline we developed in a previous protocol. 12.Determine hyperparameters for the human-learning prediction model.
The rst model was the causal RR.We used all causal factors except ones that included demographics.This model applied a lter method for feature selection by verifying assumptions based on domain knowledge with a statistical test for the causal inference, as described in the previous section.The RR was applied as the parameter-tting algorithm.The tuning parameter was λ = {10 -9 ,10 -8 ,⋯,100} while keeping α = 0.These values were plugged into the loss function (Figure 2).19.Determine hyperparameters for the machine-learning prediction model using regression algorithm.
The second model was PC-ENR (elastic net regression).We used all PCs for the second model since the EPV was already >20 using the training set.This number, instead of 10, is recommended when applying a regression algorithm. 17This model also applied a shrinkage method for feature selection.Tuning parameters were combinations of α = [0,0.25,⋯,1]and λ = [10 -9 +g(10 0 -10 -9 )/G,⋯,10 0 ] for g = [1,2,⋯,G] and G = 5; thus, the best tuning parameters were searched over 5 × 5 alternatives.These values were plugged into the loss function (Figure 2), where α = 0 means this model becomes an RR but using PCs instead of the original candidate predictors.But if α > 0, some candidate predictors x (j) might be zero after being multiplied by θ j which minimizes the training loss.Each time, this was applied to different α and λ values to nd a couple of values that minimized the validation loss.20.Conduct further pre-selection of candidate features for machine-learning prediction with larger number of events per variable.
The 3 rd and 4 th models respectively applied the PC-RF (random forest) and PC-GBM (gradient boosting machine).These models also applied the wrapper method for feature selection.This means that candidate predictors were pre-selected by a model before being used for these models.The wrapper model was the PC-ENR.We expected a smaller number of PCs to be pre-selected; thus, the EPV was ≥200 using the training set.This is allowed by either using the wrapper method only or subsequently applying a lter method.The lter method was conducted by ranking the PCs in descending order based on the corresponding θ j .We selected l of PCs where l is a number such that the EPV is 200.Unlike regression algorithms, modern machine algorithms are data hungry, of which the tested algorithms were the classi cation and regression tree (CART), RF, support vector machine, and shallow neural network. 23.Determine hyperparameters for the machine-learning prediction model using ensemble algorithms.
RF and GBM are ensemble algorithms.This means that both use prediction results of multiple models that apply the same algorithm, i.e., CART.However, the RF ensembles models in parallel, while the GBM sequentially ensembles the models. 24,25Parallel ensemble means predictions are respectively the majority and average of CART predictions for classi cation and estimation tasks.Meanwhile, sequential ensemble means a simple CART is made to predict the classi cation or estimation error of an earlier CART model.Tuning parameters for the RF were combinations from a number of random candidate predictors used to build a CART each time [5+g(45-5)/G,⋯,45] and a number of minimum unique instances per node for G = 5, while 500 CARTs were built for each of the candidate models.For the GBM, we set tuning parameters as combinations from a number of CARTs [100,200,⋯,2500] and a shrinkage factor or L2-norm regularizer [0.0005+g(0.05-0.0005)/G,⋯,0.05]for G = 25, while the minimum sample number per node (tree) was 20 and only 1 random predictor was used to build a CART each time.Since exhaustively comparing all possible combinations is time consuming, if not impossible, all numbers for both ensemble models were determined based on common practices for simplicity.However, we considered the sample size, the number of candidate predictors, and the diversity of approaches between the two algorithms to heuristically optimize the hypothesis search.The prediction results of these models were plugged into the loss function (Figure 2) with α = 0 and λ = 0 to compute the errors that were minimized by training.

Develop and validate DI-VNN model.
This model works like neurons in the brain.Predictors are fed as inputs into neurons, and the outputs follow an all-or-none activation function.These inputs are arrayed as a minimum of two-dimensional imaging such as an object projected onto retina.We called this an ontology array.A predictor that is correlated to another predictor would have a closer position than another less-correlated predictor.From the receptive eld on the retina, the signals propagate to the primary visual cortex and then the visual association cortex.The signal pathway of the neural network is dedicated for speci c parts of the array.We called this an ontology network.By seeing similar objects with uncertain variations, the neural network is maintained at speci c activation thresholds and weights.This creates visual memory in the association cortex to recognize a particular object by segmenting it into several parts.We made divnn 0.1.3,an R package and Python library, that allows future investigators to develop and validate this model.Please kindly follow the protocol. 12.Evaluate all prediction models.23.a. Calculate 95% con dence interval from all resampling subsets.We needed to evaluate our models in terms of both classi cation and estimation tasks.The 95% con dence interval (CI) was used to express all evaluation metrics as estimates.That interval was calculated from the metrics of multiple resampling subsets for model validation.

b. Assess whether a model is well-calibrated for classi cation task.
Calibration measures were evaluated by tting the predicted probabilities to true probabilities using a univariable linear regression; thus, the calibration measures were the intercept and slope of this linear regression.We also demonstrated a calibration plot.A model is well-calibrated if the interval of the calibration intercept and slope respectively fall closer onto 0 and 1, and the calibration plot approximately hugs the reference line.

c. Compare discriminative ability among well-calibrated models for classi cation task.
Meanwhile, the discriminative ability of a model was quanti ed by an AUROC interval estimate.A nonoverlapping, higher interval of the AUROC determined the best model among the most calibrated ones.

d. Computer prediction error for estimation task.
For estimation, we applied the root mean squared error (RMSE) to train the models.However, since a longer interval is reasonably more di cult to predict, a single evaluation by RMSE might be insu cient to evaluate the estimation performance.In our example, we also evaluated the estimation plot between the predicted and true times of delivery, binned per week.
23. e.Consider to differentiate prediction error of estimation based on predicted classi cation.
Different lines in the estimation plot were provided for positive and negative predictions by the best classi cation model.This would be reasonable for clinical applications by giving an interval estimate of days for the true time of delivery conditional on the predicted one.In our example, we also limited this evaluation to an estimated time of 42 weeks or less, which is the maximum duration of pregnancy.

f. Compare predictive performance for estimation task considering clinical relevance.
To determine the best estimation model, we computed a proportion of weeks in which each predicted time, in weeks, was included within an interval estimate of the true one.The interval had to be the maximum ± x weeks when predicting > x weeks.In our example, if a model predicted that a woman would deliver in 6 weeks, then the interval estimate of the true time of delivery should be the maximum ± 6 weeks.For any women predicted to deliver in 6 weeks, this number should fall into that interval.23.g.Limit minimum and maximum values that can be estimated precisely.
In our example, for the best estimation model, we determined the minimum and maximum predicted times of delivery with acceptable precision for each predicted outcome of PROM based on a visual assessment using internal validation.We also computed the RMSE within this acceptable range.

Assess ful llment of success criteria of the model development.
Success criteria of the modeling should be at least by a threshold-agnostic evaluation metric which is better than those of recent models of the similar outcome using similar predictors.Consider to add a criterion or more to prevent over tting, as recommended by PROBAST guidelines. 17If there is no previous studies to be compared with, AUROC 0.8 or more can be considered for success criterion. 185.Compare to models from previous studies.

a. Determine the context of model comparison across studies.
Model comparison should be within the context of its clinical application.This is represented in the eligibility criteria to select studies to be compared with.

b. Set up comparison guidelines.
Evaluation of success criteria needs comparator models.To nd these models, one can apply PRISMA 2020 guidelines; thus, the models are comparable and the comparison is also fair.Since a systematic review and meta-analysis were not the main purposes of a study in our example, we only applied all items in section methods in the guidelines, except item numbers 11 and 14 regarding the risk of bias assessment and item numbers 13 and 15 regarding synthesis method and certainty.This was because we applied the guidelines only to facilitate fair comparisons to previous studies, and did not consider conclusions as to how valid and accurate PROM could be predicted.

c. De ne study selection criteria.
The eligibility criteria should follow the PICOTS framework: (1) population, targeted subjects that may or may not have the outcome; (2) index, a prediction model of interest; (3) comparator, previous prediction models to be compared with; (4) outcome, de nition and data scale for the outcome with or without additional criterion on the sample size (e.g.events per variable); (5) time, prognostic or diagnostic prediction with speci ed time interval; and (6) setting, healthcare setting such as primary care, hospital, inpatient, outpatient, et cetera.In our example, only original articles and conference abstracts full papers were included.No grouping of the studies was needed because we would not conduct a meta-analysis for data synthesis.

d. Develop the search strategy.
A search date interval and literature databases should be de ned.We also need to describe the keywords.Some con guration for searching studies should also be described, e.g., limiting publication date and language.
25. e. Develop the review strategy.
We need to describe which author does what step of review.If there is a con icting decision among reviewers, then how to achieve nal decision should be described.25. f.Choose the extracted data.
In our example, we extracted evaluation metrics of the best model with the most similar outcome de nition from each study to that of our study.Any data that are needed to brie y assess potential risk of bias were also extracted.These included the study design, population, setting, outcome de nition, sample size, details on events and nonevents, number of candidate predictors, EPV, predictors in the best nal model, and the most recommended validation techniques (external over internal validation; bootstrapping over cross-validation, or cross-validation over test split).If no AUROC was reported by a previous study, but there were sensitivity and speci city, then we computed AUROC from both of the metrics using trapezoidal rule (Figure 3).

g. De ne model evaluation method.
Plots of evaluation should be overlaid those of our models.In our example, we plotted a point at a sensitivity and a speci city for each of our models at the optimum threshold on each ROC curve.AUROCs were also compared with the models in our example (with or without the 95% CI).

Calibrate the predicted probability of any model for classi cation task.
A model can be calibrated training an model to estimate the true probability of an outcome for classi cation task given the predicted probability by the model.In our example, we applied a general additive model using locally weighted scatterplot smoothing (GAM-LOESS).Instead of using precalibrated models, the nal comparison was made for models calibrated using predicted probabilities from each of the developed models.This calibration was intended to obtain more-linear predicted probabilities.No calibration was conducted for the estimation models.We split the dataset two subsets which were intended for internal and external validation.For all external validation sets, evaluation metrics were resampled by bootstrapping 30 times to obtain interval estimates.As recommended, we split out a dataset for external validation by geographical, temporal, and geotemporal splitting.Geographical splitting was conducted by strati ed random sampling of cities with as many as proportion of cities in each state.Temporal splitting was also conducted by strati ed random sampling of days with as many as p proportion of days in each subtropical season.Visits of subjects from either the sampled cities or days were respectively excluded as either a geographical or temporal subset.Then, we further excluded as many as p proportion of cities in the geographical subset or that number of days in the temporal subset.Visits of subjects from both of the excluded subsets were overlapped as a new geotemporal subset.Either a geographical or geotemporal subset did not include visits or subjects in this geotemporal subset.We randomly tried different p proportions for geographical, temporal, and geotemporal splitting such that approximately ~20% of visits belonged to any of the three subsets.These external validation sets were used for stress tests of our models since the distributions of predictors and outcome were conceivably uncertain among excluded cities, days, or overlaps.This re ects the situation in some real-world settings, but this might not re ect common situations nationwide.

b. Conduct data partition for randomized, external validation.
To estimate predictive performance nationwide, ~20% of the remaining set was randomly split out thus leaving ~64% of the original sample size for the training set.This random subset was more representative for external validation to estimate the predictive performances of our models nationwide.After splitting out this random subset, the remaining samples were intended for internal validation.

c. Conduct data partition for calibration.
We split out ~20% of the internal validation set to calibrate each model.The nal predictive performance of internal validation came from this calibration subset.We resampled it by bootstrapping 30 times to compute interval estimates.To compute the EPV, we only used ~80%, which was, the pre-calibrated subset.This was used to train the models.27.d.Choose resampling techniques for any validation.For hyperparameter tuning, applied 5-fold cross-validation, while the nal training was conducted using the best tuned parameters by bootstrapping 30 times.Both tuning and training used the same pre-calibrated set by applying cross-validation for all models except the DI-VNN.Please kindly follow the protocol of DI-VNN. 12.Consider to assess the best time window for the best model of prognostic prediction.
The best time window should be evaluated using the best model for prognostic prediction.This provided additional insights when interpreting the best model.In our example, using the internal validation set, we grouped samples by binning the days to the end of pregnancy every 4 weeks.An interval estimate of the AUROCs was computed for each bin from all resampled subsets.By observing the plot, the best time window should mostly cover interval estimates that were greater than an AUROC of 0.5, which represents prediction by simple guessing.
29. Deploy the best model for each type of tasks.29.a. Determine input for the models.
The best model can be deployed for both classi cation and estimation tasks.A model can be quite complex to apply without a software/application; thus, a web application need to be provided for using the best models.This can be made using R Shiny.In our example, a user needs to upload a deidenti ed, two-column comma-separated value (CSV) le of admission dates and ICD-10 codes of the medical history of a subject.A use case should be described with an example dataset for immediate demonstration of how to use the web application.

b. Design user interface.
For responsible clinical prediction, a threshold is expected to be exible, set by a user.For the classi cation performance, a threshold should be able to be chose under expected population-level performances of evaluation metrics of interest.For the estimation performance, a true value should also be estimated based on a subpopulation with the same estimated value.This may also be conditioned on the same classi cation result.All population-and subpopulation-level metrics are expressed as interval estimates.In our example, the population data for the classi cation task were those of the calibration split, while those for the estimation task were data of the internal validation set which consisted of both pre-calibrated and calibrated subsets.In addition, for estimation task, a timeline should also be shown to visualize each predicted value with the true interval estimate at the subpopulation level.The timeline should show the time of prediction and those code entries that were used as features; thus, a medical history of a subject is visualized using this timeline.All results can be saved online and/or downloaded as a report.An example of the report should be shown.Inference durations 10 times should also be measured for a use case and reported as interval estimates.

Troubleshooting
Step

Anticipated Results
We may expect better generalization with data partition technique in this protocol.The predictive performance would be reasonably higher in random split by chance.Compared to non-random splits, that split includes more variation across healthcare facilities in a country.Some models apply over tting strategy (i.e., random forest and gradient boosting machine) to outperform others but having lower generalization.This will expose to false decision in model selection.This may be prevented by evaluating models based on calibration set which is a different subset to that for training the models by the algorithm of interest.
Although causal RR uses causal factors, this does not mean this model would have the best predictive performance.A prediction model normally exploits confounders to predict all variations in the dataset.
But, this model would be a good reference model for the others.We also may use this model to interpret important predictors in other models.
We may or may not nd any well-calibrated model.But, the most calibrated one(s) should be chosen among the prediction models, preferably those with distributions of predicted probabilities from 0 to 1.A wide range of predicted probabilities will help clinicians to widely adjust threshold based on local data distribution.The optimal threshold for the best model(s) should be given as initial threshold.In addition, distributions of predicted probabilities may or may not be visually differentiated between events and nonevents.
A well-calibrated prediction model may be either more sensitive or more speci c based on receiver operating characteristic (ROC) curves.This should be assessed by internal validation (calibration split).A more-sensitive model (i.e.high sensitivity) at 95% speci city is considerably better for screening a disease from a population-level standpoint.But, sensitivity and speci city at the optimal threshold also need to be considered.
Non-random splitting may not describe generalization but this can con rm the robustness of the best model.The best model, which is chosen by internal validation, is expected to be the best in most, if not all, of the non-random splits of external validation.Yet, the random split may re ect common situations nationwide.We also need to ensure that the best model has interval estimates of the predictive performance better than simple guessing, i.e. the AUROC interval should be higher than 0.5 for all the nonrandom splits.This can be a safety factor to implement the best model before re-calibration using local data.
Comparison to prediction models from previous studies is important to assess success model development.But, we may or may not nd comparable models following methods in the preferred reporting items for systematic reviews and meta-analyses (PRISMA) 2020 expanded checklist. 19In this situation, we may need to consider more literature databases or other keyword strategy.Furthermore, we may also consider modi cation of eligibility criteria.If we nd the previous models, we need to compare using a threshold-agnostic evaluation metric (e.g.AUROC) along with additional metrics requiring a threshold and the sample size for each class of outcome.In addition, it is also important to point out if the predictors are those in either low-or high-resource settings.
For estimation task, we need to consider clinical acceptance of the prediction error for each range of estimated values.Models may or may not ful ll the clinical acceptance for each of the ranges.DI-VNN may not estimate the true time of delivery in our example mainly due to the differential analysis.It only lters predictors based on categorical outcomes only.If classi cation and estimation tasks are related in a circumstance, as demonstrated by our example, we need to stratify the estimated value for each of the classes predicted by the best classi cation model.As for classi cation task, the best model for estimation task should also be con rmed to consistently outperform the other models using external validation sets.
Since an estimation model may not ful ll the clinical acceptance for each of the ranges, we need to determine the precise estimation window of the best estimation model using an internal validation set.This estimation window should be determined for each of the predicted classes.In addition, different estimated values for each of the class may gain insights to the outcome disease.

4 .
Determine outcome de nition.4. a. De ne outcome for the classi cation task.
4. b.De ne outcome for the estimation task.
27. Validate the predictive performances using several data partition methods.a. Conduct partition non-randomized, external validation.

1 to 6
Problem SolutionConsider to widen the range of each period, or get another dataset.complexity, or design interface that shows progress of prediction in multiple, short steps.Time TakenAll steps Approximate time: 1 hour to 8 days (pre-computed)Step 1 to 6 Approximate time: 1 to 5 minutes (pre-computed)Step 7Approximate time: 5 to 15 minutes (pre-computed)Step 8Approximate time: 25 minutes to 3 hours per causal factorStep

Figures
Figures