A descriptive study of random forest algorithm for predicting COVID-19 patients outcome

Background The outbreak of coronavirus disease 2019 (COVID-19) that occurred in Wuhan, China, has become a global public health threat. It is necessary to identify indicators that can be used as optimal predictors for clinical outcomes of COVID-19 patients. Methods The clinical information from 126 patients diagnosed with COVID-19 were collected from Wuhan Fourth Hospital. Specific clinical characteristics, laboratory findings, treatments and clinical outcomes were analyzed from patients hospitalized for treatment from 1 February to 15 March 2020, and subsequently died or were discharged. A random forest (RF) algorithm was used to predict the prognoses of COVID-19 patients and identify the optimal diagnostic predictors for patients’ clinical prognoses. Results Seven of the 126 patients were excluded for losing endpoints, 103 of the remaining 119 patients were discharged (alive) and 16 died in the hospital. A synthetic minority over-sampling technique (SMOTE) was used to correct the imbalanced distribution of clinical patients. Recursive feature elimination (RFE) was used to select the optimal subset for analysis. Eleven clinical parameters, Myo, CD8, age, LDH, LMR, CD45, Th/Ts, dyspnea, NLR, D-Dimer and CK were chosen with AUC approximately 0.9905. The RF algorithm was built to predict the prognoses of COVID-19 patients based on the best subset, and the area under the ROC curve (AUC) of the test data was 100%. Moreover, two optimal clinical risk predictors, lactate dehydrogenase (LDH) and Myoglobin (Myo), were selected based on the Gini index. The univariable logistic analysis revealed a substantial increase in the risk for in-hospital mortality when Myo was higher than 80 ng/ml (OR = 7.54, 95% CI [3.42–16.63]) and LDH was higher than 500 U/L (OR = 4.90, 95% CI [2.13–11.25]). Conclusion We applied an RF algorithm to predict the mortality of COVID-19 patients with high accuracy and identified LDH higher than 500 U/L and Myo higher than 80 ng/ml to be potential risk factors for the prognoses of COVID-19 patients in the early stage of the disease.


INTRODUCTION
In December 2019, an outbreak of acute respiratory syndrome coronavirus (CoV) pneumonia occurred in Wuhan, Hubei Province, China (Phelan, Katz & Gostin, 2020), and attracted intense attention worldwide. The World Health Organization (WHO) named the virus, coronavirus disease 2019 , based on its identification from a patient's pharyngeal swab sample World Health Organization, 2020). SAS-nCov 2 is a species of CoV, which is a family of the largest, enveloped, single-stranded, positive-sense RNA viruses (Su et al., 2016). The scientific community and infection control agencies face enormous challenges in controlling the increasing intensity of the COVID-19 pandemic. However, the disease has spread rapidly around the world. By 26 June 2020, COVID-19 had affected 213 countries, with over 9,621,470 confirmed cases and 487,295 deaths worldwide (COVID-19 CoV-Update https://virusncov.com/, accessed 26 June 2020). The least absolute shrinkage and selection operator (LASSO) regression has been used to identify the important factors of severity transition in COVID-19 patients , and critically ill patients exhibited respiratory failure, acute respiratory distress syndrome, heart failure, and septic shock, which increased the mortality of COVID-19 patients (Zhou et al., 2020).
Previous studies showed that patients who were elderly and had diabetes, cardiovascular disease, chronic respiratory diseases, or cancer presented an increased risk for COVID-19related mortality worldwide Huang et al., 2020b;Ji et al., 2020;Wang et al., 2020;Wu & McGoogan, 2020;Zhou et al., 2020). However, few models have been used to predict the mortality of COVID-19. Therefore, an effective and robust model is urgently required to predict the mortality of COVID-19 based on routine laboratory assessments and demographic information from COVID-19 patients. Timely detection of patients with high risk is of great significance and may contribute to optimizing the use of limited resources and delivering proper care.
Machine learning is widely used in medical diagnosis, and feature selection is an integral part of accurate data processing (Guyon & Elisseeff, 2003). Random forest (RF) is a type of machine learning that can analyze complex interactions between clinical characteristics and provide high classification accuracy using a set of decision trees (Touw et al., 2013). Therefore, we used a risk RF prediction model based on the outcomes of COVID-19 patients to predict the likelihood of recovery or continued deterioration and speculated on their prognoses, and corresponding disease control strategies need to be stressed to protect these patients against SAS-nCov 2.

Study design and participants
This was a retrospective cohort analysis that included 126 patients, aged 27-87 years, from Wuhan Fourth Hospital. These patients were diagnosed with COVID-19 based on the World Health Organization's interim guidelines. Of the 126 patients, seven patients were excluded due to a lack of known clinical endpoints. The remaining 119 patients in this study were hospitalized for treatment from 1 February to 15 March 2020. This study was approved by the Ethics Committee of Wuhan Fourth Hospital (KY 2020-032-01). The Hospital Ethics Committee waived the informed consent from the study participants due to the high transmissibility of the disease.
Among the patients, the criteria used for discharge were as follows. The highest temperature returned to normal for more than three days. Chest CT imaging revealed significant inflammation absorption, and the respiratory symptoms had substantially improved. Two consecutive nucleic acid tests from throat swabs were negative, and the time interval between testing was at least one day. Finally, after evaluation and a unanimous decision by the expert team, a patient was discharged.

Data collection
Clinical information for all patients was obtained from electronic medical records in Wuhan Fourth Hospital by three independent researchers. Patient information, including exposure history, demographics, medical history, laboratory findings, co-morbidities and clinical outcomes were collected and analyzed. If data were missing from the medical records, we obtained data from attending doctors or directly from the patients. Access was granted by the director of the hospital.

Statistical analysis
Descriptive data were compared using quartiles and medians, and the χ 2 test or Fisher's exact test was used to analyze categorical data. The Kolmogorov-Smirnov test was used to analyze the normality of the data from discharged patients (n = 103). The Shapiro-Wilk test was employed to analyze the normality of the data from patients who died (n = 16). Subsequently, normally distributed laboratory results were analyzed using independent sample paired t-tests. The nonparametric Mann-Whitney-Wilcoxon test was used for data that did not exhibit a normal distribution. Univariable logistic analysis was used to analyze the risk of mortality caused by two variables in survival and non-survival patients. All data were assessed using IBM SPSS, Version 26.0.

Variable selection and model construction
The flowchart for the research design is shown in Fig. S1. Based on the imbalanced distribution of COVID-19 discharged versus deceased patients (106:13), a SMOTE procedure was employed to adjust the data to achieve a final ratio of 1:1 (103:103). Spearman correlation was used to calculate the correlations among the essential variables, which were chosen based on statistical analysis. RFE was used to screen out the discriminative subset of COVID-19 patient clinical characteristics with 10-fold cross-validation to avoid the redundant information. Then an RF classification model was used to predict the mortality of COVID-19 patients with 5-fold cross-validation and the mean value of the accuracy of mtry = 1 was the highest at 0.846, a bagging algorithm was used to randomly collect the clinical characteristic for a total of 500 times, the Gini index was the split criterion, and the nodesize of RF classification model was 1. The clinical data were divided into a training set and a test set with a ratio of approximately 4:1 (166:40). All data were processed using R studio (R 3.6.3), the entire workflow was processed with the caret package (http://CRAN.R-project.org/package=caret) to keep the model construction and validation consistent.

Correlation analysis and assessment of accuracy
A partial dependence correlation analysis was employed to provide a graphical depiction of the marginal effect of a variable on the COVID-19 patients' outcome during the calculation process (Greenwell, 2017). The function being plotted was defined as: where x is the variable corresponding to the chosen clinical characteristic, and x ic represents the other variables in the clinical information. The summand was the predicted logits (log of a fraction of votes) for classification: where K is the number of classes, and P j is the proportion of votes for class j.
The accuracy of the test group to identify the final diagnostic capability of the RF classification algorithm was assessed using AUC, it also was applied to choose the optimal mtry and the best subset of clinical characteristics for the RF model performance.

Clinical demographics and outcomes of COVID-19 patients
We described a cohort of 126 patients hospitalized at the Wuhan Fourth Hospital between 1 February to 15 March 2020, of whom approximately half were classified as severely ill or critically ill. The patient clinical demographics and outcomes are shown in Table 1. We found that 48 patients (38.1%) were older than 65, and the median patient age was 60 years (IQR 53-69.5), which showed the prevalence of COVID-19 in older adults. The incidence of COVID-19 infection was gender-neutral in that the proportions of male and female patients were nearly identical. COVID-19 patients typically exhibited fever (92.0%) and 39 (34.8%) patients had peak temperatures above 39 C. The most frequently observed symptoms of COVID-19 patients on admission were cough (75.4%), followed by fatigue (58.7%), dyspnea (55.6%). In addition, many patients suffered from co-morbidities with hypertension (34.9%) being the most common co-morbidity, followed by diabetes (16.7%), cardiovascular and macrovascular disease (11.9%). During treatment, 83 patients (65.9%) used nasal cannulas for supplemental oxygen, indicating that a nasal cannula was useful for COVID-19 patients, two additional respiratory support strategies that were used were noninvasive mechanical ventilation (NMV) (27.8%), and invasive mechanical ventilation (IMV) (4.0%). The criteria used to determine illness severity were based on the Novel CoV Pneumonia Prevention and Control Program

Laboratory findings of COVID-19 patients at admission
The initial laboratory findings included complete blood count, serum biochemical tests, coagulation profiles, and myocardial enzymes. All patients were assessed to determine whether they deviated significantly (p < 0.05) from a normal range to evaluate the status of important organ functions. As seen in Table 2, more than 80% of the patients exhibited lymphopenia, especially for CD4+ and CD8+ T lymphocytes (91.3%), which confirmed the previous study that SARS-CoV-2 infection damaged the immune system (Huang et al., 2020a). Approximately half of the patients exhibited decreased Th/Ts ratios, which differed from Middle East respiratory syndrome (Park et al., 2017). C-reactive protein (CRP) was elevated in 85.6% of the patients, and procalcitonin (PCT) was slightly increased. COVID-19 infection also impaired coagulation functions in some patients. In this study, the prothrombin time was prolonged in approximately half of the patients, fibrinogen (FIB) was increased in two-thirds of the patients, and D-dimer was increased in 76.2% of the patients. The severely infected COVID-19 patients displayed a trend towards reduced platelet counts, a higher D-dimer level, and a higher rate of DIC occurrence.
The myocardial enzymes showed that myocardial cell injury occurred in some patients, as 60% exhibited elevated B-type natriuretic peptide (BNP) and patients with elevated LDH accounted for 76.2% of the total.

Comparison of clinical characteristics between discharged and deceased patients
A comparison of clinical characteristics between discharged (alive) and deceased patients revealed the significant features that most likely caused deterioration in the deceased patients. As seen in Table 3, the patients in the deceased group were older than those in the discharged group (p < 0.001), and the majority were males (75%). The proportion of patients who experienced dyspnea was apparently increased from the onset of illness in the deceased group (p = 0.018). The deceased patients were more susceptible to respiratory failure, and the arterial blood gas parameters, PCO2, PO2, SO2, and oxygenation at admission, were significantly reduced in the deceased group. The laboratory analysis in this study revealed that more patients in deceased group exhibited elevated neutrophils and lymphopenia, which indicated a "divergence" between these two variables. Therefore, the neutrophil to lymphocyte ratio (NLR) was used to indicate the severity of illness in the patients. The NLR was remarkably elevated in the deceased group, while the  lymphocyte to monocyte ratio (LMR) was decreased in the deceased group. The immune system damage was a risk factor for unfavorable outcomes of the disease. T lymphocytes significantly decreased in the deceased group, especially CD4+ and CD8+, as did the Th/Ts ratio. Compared to discharged patients, the deceased patients underwent more frequent myocardial cell injury, as parameters reflecting heart function, including Myo, CK and LDH, were significantly increased in the deceased group. Moreover, the inflammation-related indices, CRP (p = 0.080) and PCT (p = 0.009) were significantly higher in the deceased group.

Correlation between clinical characteristics
Because imbalanced data distribution affects the prediction accuracy of the RF model, and the ratio of discharged versus deceased patients was 103:16, a SMOTE algorithm was used to balance the data to select a more representative and informative subset of parameters for COVID-19 patients. STOME adjusted the ratio between these two groups to achieve a ratio of 1:1 (103:103). Moreover, redundancy of information also is likely to decrease the prediction performances of the RF classification model, so the correlations between variables should be taken into account in the process of feature selection (Paul et al., 2017). A Spearman correlation coefficient test was used to analyze the correlation between clinical characteristics of the COVID-19 patients (Spearman, 2010). A heatmap was used to show the correlations between variables in the form of a matrix in Fig. 1. Each element in the matrix was the correlation coefficient between the variables, and the range (−1, 1) was used to evaluate degree of correlation between two variables. When the correlation coefficient was greater than 0.8, and the p-value was smaller than 0.05, the correlation was determined to be strong (Paul et al., 2017), indicating that the factors were redundant variables. The analysis revealed that CD45 and CD4 had a high correlation of 0.84 (p < 0.01), meanwhile, NLR and neutrophils also had a correlation of 0.84 (p < 0.01). As for the redundancy of the clinical features, further processing was required to select the best subset for the RF model.

Variable selection and RF classification model construction
To select the optimal subset of clinical features, an RFE processed with 10-fold cross-validation was used to select the best subset. The RFE could eliminate the redundant and irrelative information from the COVID-19 patients and enhance the performance of the RF classification model (Darst, Malecki & Engelman, 2018). The results selected 11 clinical characteristics, Myo, CD8, age, LDH, LMR, CD45, Th/Ts, dyspnea, NLR, D-Dimer and CK with the highest accuracy at 0.9905 ( Fig. 2A), which revealed the optimal complexity of the feature subset. Next, the RF classification model was used to predict the prognoses of COVID-19 patients based on the best subset. Five-fold cross-validation was used to identify the optimal mtry for the RF classification model, and the highest accuracy of classification was mtry = 1 (Fig. 2B), with the highest corresponding mean value of AUC at 0.846. Moreover, out-of-bag (OOB) error represented the generalization ability of the RF to calculate the proportion of misclassification. In Fig. 2C, the OOB error gradually decreased and stabilized as the forest size increased, and it finally reduced to less than 0.05 when the tree number reached 500. Meanwhile, death and survival errors gradually reduced to the same level as the OOB. The final diagnostic capability of the RF classification calculations was assessed using the test group's accuracy, which was 100% (Fig. 2D), the threshold for the test data ROC was 0.385.

Identification of the important predictors for clinical outcomes
As we know, the result of the RF classification model was obtained by selecting the results of the combined predictions among 500 decision trees, and the Gini index was the split criterion. RF-Gini is one of the best methods for feature ranking worldwide, especially for the top five predicted features (Menze et al., 2009). The larger the Gini coefficient became, the more important the information content of the independent variables. As shown in Fig. 3A, the variables that were ranked as important included Myo, age, LDH, CD8, CK, LMR, CD45, NLR, Th/Ts, D-dimer and dyspnea. The top five variables were Myo, age, LDH, CD8, CK, among them, we chose Myo and LDH as two laboratory parameters to assess risk and indicate the prognoses for COVID-19 patients. The accuracy of the variables screened by the RF model is shown in Fig. 3B, and the accuracy of Myo ranked the highest and was followed by age and NLR.

Relationship between clinical characteristics and survival in COVID-19 patients
To further analyze the role of LDH and Myo in affecting the survival of COVID-19 patients, we compared the mortality of patients who exhibited different levels of LDH and Myo. Using univariate logistic analysis, a substantial increase in the risk of in-hospital mortality with increased levels of Myo and LDH was observed (Fig. 4A). Patients with increased Myo (≥80 ng/ml) exhibited a 7.54-fold (95% CI [3.42-16.63]) increase in mortality compared to patients with low Myo (< 80 ng/ml). Similarly, patients with increased LDH (≥500 U/L) exhibited a 4.90-fold (95% CI [2.13-11.25]) increase in mortality compared to patients with low LDH (<500 U/L). The levels of LDH and Myo were compared in discharged and deceased groups (Fig. 4B). The median and IRQ for these two variables in the deceased group were higher than in the discharged group (p < 0.001). The partial dependence plot showed the impact of Myo and LDH on survival when the marginal effects were controlled for in the RF classification. As Fig. 4C shows, there was a significant negative correlation between survival and the levels of LDH or Myo. Specifically, increased levels of LDH and Myo were precursors to a poor prognosis for COVID-19 patients. To test the ability of LDH and Myo to predict the outcome of COVID-19 patients (Fig. 4D , and the LDH ROC curve threshold was 327.5 U/L. These clinical features all had high accuracy for prognoses prediction of COVID-19 patients, but their accuracy was lower than that produced by the RF classification model.

DISCUSSION
The COVID-19 virus that occurred in Wuhan, China, is highly contagious, and a large number of exposed people have become critically ill. This study provided a comprehensive description of the demographics, comorbidities, and laboratory findings, of COVID-19 patients. Among the laboratory results, it was observed that lymphocytes, including CD4+ and CD8+, were decreased in 91.3% of patients, which confirmed that SARS-CoV-2 infection injured the human immune system. Thus, subsequent immune responses to this virus may exacerbate the disease response (Huang et al., 2020a). Approximately 40% of the COVID-19 cases were severe, and the disease resulted in a 13.4% mortality rate. The mortality rate observed in this study was higher than the average rate observed in Wuhan, which was 4% by 24 March 2020 . This difference was likely due to the fact that only severe patients could be transferred to the hospitals designated to treat COVID-19 patients. During the patients' hospitalization for treatment, we found that early COVID-19 symptoms are insidious, but the disease progression is fast. Therefore, early prediction of COVID-19 patients' outcomes and adopting appropriate treatment are urgently required. In this study, significant clinical features (p < 0.05) were identified between discharged and deceased patients using statistical analysis. These clinical features were used for RF classification model, with SMOTE and a feature reduction technique RFE, to predict mortality of COVID-19 patients. The AUC of the RF model reached 100%, demonstrating its robust prediction ability. Moreover, Myo and LDH were identified as two optimal predictors of COVID-19 patients' outcomes with the Gini index.
Our current studies suggested that the deceased patients were susceptible to multiple organ failure, especially heart and respiratory failure. There are several potential reasons for myocardial cell injury in COVID-19 patients, including systemic inflammatory responses, ACE2-targeted SAS-Cov-2 attacks on myocardial and lung cells, adverse effects of some anti-virus drugs (Clerkin et al., 2020), and some underlying myocardial-damaging co-morbidities, such as diabetes and hypertension. Previous studies have reported that heart injury was common in patients with pneumonia (Marrie & Shariatzadeh, 2007). It was reported that elevated concentrations of Myo in venous blood could predict the severity of COVID-19 (McRae et al., 2020). Myo is a significant myocardial marker that is used in the clinical detection of patients with severe pneumonia. In this study, we found that 75% of the non-survivors, whose Myo concentration was higher than 80 ng/mL, exhibited hypertension, which might have accelerated myocardial injury in patients with COVID-19. A partial correlation also indicated that as the Myo concentrations increased, survival decreased. Moreover, based on univariate logistic analysis, a high level of Myo above 80 ng/mL correlated with a high mortality rate (61.5%) from COVID-19, and the risk of mortality was increased by 7.54 (95% CI [3.42-16.63]), suggesting that increased concentrations of Myo potentially led to poor outcomes.
Lactate dehydrogenase is another indicator that reflects the degree of tissue damage caused by the virus and disease severity, including damage to myocardial (Mamas, Fraser & Neyses, 2008;Warren-Gash, Smeeth & Hayward, 2009), muscle, and lung cells. On the one hand, COVID-19 patients have severely reduced lung ventilation, which leads to hypoxia and carbon dioxide retention (Yang et al., 2020b), which damages tissues (Yang et al., 2020a). On the other hand, microcirculation disorders caused by the infection and insufficient tissue perfusion also result in tissue damage. Both processes lead to LDH accumulation in circulating blood. In our current research, the LDH values of 96 (76.2%) patients were higher than the normal reference range, and the average level of LDH for non-survivors was higher than survivors. LDH concentrations higher than 500 U/L were associated with high mortality risk (OR = 4.90, 95% CI [2.13-11.25]). Moreover, in severely affected patients, abnormal elevation of LDH often indicates rapid disease progression and acute respiratory failure . Therefore, an increase in LDH was a significant risk factor for COVID-19 patient mortality.
Given the above research, our findings suggest that strategies to protect organ function should be emphasized to improve the patients' survival. Except for ensuring that they remain protected from getting infected, doctors should evaluate the cardiac condition and the degree of complications of patients before and during treatment to avoid adverse side-effects from drugs used for COVID-19 therapies. Moreover, some ACE inhibitors (ACEi) or angiotensin receptor blockers (ARB), which are used to treat comorbidities in COVID-19 patients, might increase the level of ACE2 in myocardial cells, theoretically, leading to elevated risk for cardiac injury. Therefore, improving the method for patients' treatment should be considered in COVID-19 therapy.
This study has several limitations. First, due to the inclusion and exclusion of a large number of patients, it was inevitable that some important variables were omitted, such as smoking, a history of allergies, and others. Second, we only studied a few patients who exhibited relatively severe illness due to limited medical resources during the epidemic. Third, only patients with clear endpoints were included in the research, some patients who were still in hospital (alive) for treatment were not incorporated into this research, which may result in statistical biases. Last but not the least, about 5% information was lost from the COVID-19 patient list, the missing count variables were supplemented by the median, and the missing categorical variables were supplemented by the mode, which led to biases of the clinical characteristics collection.

CONCLUSION
In this study, we described the clinical characteristics of COVID-19 patients during hospitalization and innovatively used an RF classification model with these clinical characteristics to predict COVID-19 prognoses. Moreover, we found that LDH concentrations that were higher than 500 U/L, and Myo concentrations higher than 80 ng/ml could be identified as two potential risk factors for mortality of COVID-19 patients. Finally, appropriate treatment should be considered for patients with cardiac and tissue injury.