Predicting community mortality risk due to CoVID-19 using machine learning and development of a prediction tool

Background: The recent pandemic of CoVID-19 has emerged as a threat to global health security. There are a very few prognostic models on CoVID-19 using machine learning. Objectives: To predict mortality among confirmed CoVID-19 patients in South Korea using machine learning and deploy the best performing algorithm as an open-source online prediction tool for decision-making. Materials and methods: Mortality for confirmed CoVID-19 patients (n=3,022) between January 20, 2020 and April 07, 2020 was predicted using five machine learning algorithms (logistic regression, support vector machine, K nearest neighbor, random forest and gradient boosting). Performance of the algorithms was compared, and the best performing algorithm was deployed as an online prediction tool. Results: The gradient boosting algorithm was the best performer in terms of discrimination (area under ROC curve=0.966), calibration (Matthews Correlation Coefficient=0.656; Brier Score=0.013) and predictive ability (accuracy=0.987). The best performer algorithm (gradient boosting) was deployed as the online CoVID-19 Community Mortality Risk Prediction tool named CoCoMoRP (https://ashis-das.shinyapps.io/CoCoMoRP/). Conclusions: We describe the framework for the rapid development and deployment of an open-source machine learning tool to predict mortality risk among CoVID-19 confirmed patients using publicly available surveillance data. This tool can be utilized by potential stakeholders such as health providers and policy makers to triage patients at the community level in addition to other approaches.


Introduction
A novel coronavirus disease 2019 (CoVID-19) originated from Wuhan in China was reported to the World Health Organization in December of 2019. [1] Ever since, this novel coronavirus has spread to almost all major nations in the world resulting in a major pandemic. As of April 18, 2020, it has contributed to more than 2 million confirmed cases and about 150,000 deaths. [2] The first CoVID-19 case was diagnosed in South Korea on January 20, 2020. According to the Korea Centers for Disease Control and Prevention (KCDC), there have been 10,653 confirmed cases and 232 deaths due to CoVID-19 as of April 18, 2020. [3] In the field of healthcare, accurate prognosis is essential for efficient management of patients while prioritizing care to the more needy. In order to aid in prognosis, several prediction models have been developed using various methods and tools including machine learning. [4,5] Machine learning is a field of artificial intelligence where computers simulate the processes of human intelligence and can synthesize complex information from huge data sources in a short period of time. [6] Though there have been a few prediction tools on CoVID-19, only a handful have utilized machine learning. [7] To the best of our knowledge, by far there is no publicly available CoVID-19 prognosis prediction model from the general population of confirmed cases using machine learning. We attempt to apply machine learning on the publicly available CoVID-19 data at the community level from South Korea to predict mortality.
Our study had two objectives, (1) predict mortality among confirmed CoVID-19 patients in South Korea using machine learning, and (2) deploy the best performing algorithm as an opensource online prediction tool for decision-making.
. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 3, 2020.

Patients
Patients for this study were selected from the data shared by Korea Centers for Disease Control and Prevention (KCDC). [3] The timeframe of this study was from the beginning of the detection of the first case (January 20, 2020) through April 07, 2020. In the dataset, there were a total of 3,128 patients. Our inclusion criteria were confirmed CoVID-19 cases with availability of sociodemographic, exposure and diagnosis confirmation features along with the outcome. We excluded patients those had missing featuressex (n=94) and age (n=12), and thus, 3,022 patients were included in the final analysis.

Outcome variable
The outcome variable was mortality and it had a binary distribution -"yes" if the patient died, or "no" otherwise.

Predictors
The predictors were individual patient level socio-demographic and exposure features. They were age group, sex, province, date of diagnosis, and exposure. There were ten age groups as follows below 10 years, 10 Apr 2020. Patients were exposed in several settings, such as nursing home, hospital, religious . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 3, 2020. . https://doi.org/10.1101/2020.04.27.20081794 doi: medRxiv preprint gathering, call center, community center, shelter and apartment, gym facility, overseas inflow, contact with patients and others.

Descriptive Analysis
We performed descriptive analyses of the predictors by respective sub-groups and present the results as numbers and proportions. Potential correlations between predictors were tested with Pearson's correlation coefficient.

Predictive Analysis
We applied machine learning algorithms to predict mortality among CoVID-19 confirmed cases.
Machine learning is a branch of artificial intelligence where computer systems can learn from available data and identify patterns with minimal human intervention. [8] Typically, in machine learning several algorithms are tested on data and performance metrics are used to select the best performing algorithm. We tested five commonly used supervised machine learning algorithms in healthcare research (logistic regression, support vector machine, K neighbor classification, random forest and gradient boosting) to compare algorithm performance efficiency. Logistic regression is best suited for a binary or categorical output. It tries to describe the relationship between the output and predictor variables. [9] In support vector machine (SVM) algorithm, the data is classified into two classes based on the output variable over a hyperplane. [9] The algorithm tries to increase the distance between the hyperplane and the most proximal two data points in each class. K Nearest Neighbors (KNN) is a non-parametric approach that decides the output classification by the majority class among its neighbors. [10] The number of neighbors can be altered to arrive at the best fitting KNN model. For our model, we selected 20 nearest . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 3, 2020. . https://doi.org/10.1101/2020.04.27.20081794 doi: medRxiv preprint neighbors. Random forest algorithm uses a combination of decision trees. [11] Decision trees are generated by recursively partitioning the predictors. New attributes are sequentially fitted to predict the output. We used an ensemble of 501 decision trees with the trees extended up to a maximum depth of 10. Gradient boosting algorithm uses a combination of decision trees. [12] Each decision tree dynamically learns from its precursor and passes on the improved function to the following. Finally, the weighted combination of these trees provides the prediction.

Evaluation of the performance of the algorithms
We split the data into training (80 percent) and validation cohorts (20 percent). Initially, the algorithms were trained on the training cohort and then were validated on the validation cohort for determining predictions. The data was passed through a 10-fold cross validation where the data was split into training and validation cohorts at 80/20 ratio randomly ten times. The final prediction came out of the cross-validated estimate. As our data was imbalanced (only 2% output were with the condition against 98% without), we applied an oversampling technique called synthetic minority oversampling technique (SMOTE) to enhance the learning on the test data. [13,14] The performance of the algorithms were evaluated for discrimination, calibration and overall performance. Discrimination is the abillity of the algorithm to separate out patients with the mortality risk from those without, where as calibration is the agreement between observed and predicted risk of mortality. An ideal model should have the best of both discrimination and calibration. We tested discriminaiton with area under the receiver operating characteristics curve (AUC) and calibration with accuracy and Matthews correlation coefficient. A receiver operator characteristic (ROC) curve plots the true positive rate on y-axis against the false positive rate on x-axis. [15] AUC is score that measures the area under the ROC curve and it ranges from 0.50 to . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 3, 2020. . https://doi.org/10.1101/2020.04.27.20081794 doi: medRxiv preprint 1.0 with higher values meaning higher discrimination. Accuracy is a measure of correct classification of death cases as death and survived cases as survived. [15] Matthews correlation coefficient (MCC) is a measure that takes into account all four predictive classestrue positive, true negative, false positive and false negative. [16] It is considered a better measure than accuracy for unbalanced data. Brier score simultaneously account for discrimnation and calibration. [15] A smaller Brier score indicates better performance. In addition, the gradient boosting algorithm was used to estimate the relative contributions of the predictors and draw the variable importance plot. [17] The statistical analyses were performed using Stata Version 15 (StataCorp LLC. College Station, TX), Python programming language Version 3.7 (Python Software Foundation, Wilmington, DE, USA) and R programming language Version 3.6.3 (R Foundation for Statistical Computing, Vienna, Austria). The web application was built using the Shiny package for R and deployed with Shiny server.
. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 3, 2020.

Patient profile
The profile of the patients is presented in Table 1. Out of 3,022 confirmed patients, a slightly more than half were females (56.3%). Among the age groups, the maximum patients were from . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 3, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 3, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 3, 2020. Using the gradient boosting algorithm, we estimated the relative importance of the predictors (figure 1). Province was the most important predictor followed by age, date, exposure and sex. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 3, 2020.  Table 2 present the performance metrics of all algorithmslogistic regression, support vector machine, K nearest neighbor, random forest and gradient boosting. The accuracy of all algorithms was very similar with the gradient boosting performing the best (0.987) and KNN with the least score (0.979). Similarly, gradient boosting performed the best on Matthews correlation coefficient (highest score) and Brier score (lowest score). Further, figure 2 shows the area under receiver operating characteristic curve (AUC) for all algorithms. The AUC ranged from 0.831 to 0.966 with the best score for the gradient boosting algorithm. Considering all the performance metrics, gradient boosting was the best performing algorithm. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 3, 2020.  is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Online CoVID-19 mortality risk prediction tool -CoCoMoRP
The best performing modelgradient boosting was deployed as the online mortality risk prediction tool named as "CoVID-19 Community Mortality Risk Prediction" -CoCoMoRP" (https://ashis-das.shinyapps.io/CoCoMoRP/). Figure 3 presents the user interface of the prediction tool. The web application is optimized to be conveniently used on multiple devices such as desktops, tablets, and smartphones. The user has to select one option each from the input feature boxes and click the submit button to estimate the CoVID-19 mortality risk probability in percentages. For instance, the tool gives a CoVID-19 mortality risk prediction of 26.3% for a male patient aged between 80 and 89 years from Busan province with exposure in a nursing home who got confirmation of diagnosis during the week of 17-23 February 2020.
. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 3, 2020.

Discussion
The CoVID-19 pandemic is a threat to global health and economic security. Recent evidence for this new disease is still evolving on various clinical and socio-demographic dimensions. [18][19][20] Simultaneously, health systems across the world are constrained with resources to efficiently deal with this pandemic. We describe the rapid development and deployment of an open-source artificial intelligence tool to predict mortality risk among CoVID-19 confirmed patients using publicly available surveillance data. This tool can be utilized by potential stakeholders such as health providers and policy makers to triage patients at the community level in addition to other approaches.
One major limitation of this tool is unavailability of crucial clinical information on symptoms, risk factors and clinical parameters. Recent research has identified certain symptoms, preexisting illnesses and clinical parameters as strong predictors of prognosis and severity of progression for CoVID-19. [20][21][22] These crucial pieces of information are not publicly available so far in the surveillance data, so the tool could not be tested to include these features. Inclusion of these additional features may improve the reliability and relevance of the tool. Therefore, we urge the users to balance the predictions from this tool against their own and/or health provider's clinical expertise and other relevant clinical information.
To the best of our knowledge, our CoVID-19 community mortality risk prediction tool is the first of its kind. Our tool offers an additional approach to informing decision making for CoVID-19 patients. We believe our experience of rapidly developing a mortality risk prediction tool during . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 3, 2020. . https://doi.org/10.1101/2020.04.27.20081794 doi: medRxiv preprint a crisis using limited data will guide future development of similar approaches using locally available data during epidemics and other disasters.

Authors' contributions
Conceived and designed this study: Ashis Kumar Das, Shiba Mishra, Saji Saraswathy Gopalan Analyzed and explained the data: Ashis Kumar Das, Shiba Mishra, Saji Saraswathy Gopalan All authors contributed to the writing and approved the final manuscript.
outside of the authors' organizational affiliations.
. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 3, 2020. . https://doi.org/10.1101/2020.04.27.20081794 doi: medRxiv preprint