An Effective Machine Learning Approach for Identifying Non-Severe and Severe Coronavirus Disease 2019 Patients in a Rural Chinese Population: The Wenzhou Retrospective Study

This paper has proposed an effective intelligent prediction model that can well discriminate and specify the severity of Coronavirus Disease 2019 (COVID-19) infection in clinical diagnosis and provide a criterion for clinicians to weigh scientific and rational medical decision-making. With indicators as the age and gender of the patients and 26 blood routine indexes, a severity prediction framework for COVID-19 is proposed based on machine learning techniques. The framework consists mainly of a random forest and a support vector machine (SVM) model optimized by a slime mould algorithm (SMA). When the random forest was used to identify the key factors, SMA was employed to train an optimal SVM model. Based on the COVID-19 data, comparative experiments were conducted between RF-SMA-SVM and several well-known machine learning algorithms performed. The results indicate that the proposed RF-SMA-SVM not only achieves better classification performance and higher stability on four metrics, but also screens out the main factors that distinguish severe COVID-19 patients from non-severe ones. Therefore, there is a conclusion that the RF-SMA-SVM model can provide an effective auxiliary diagnosis scheme for the clinical diagnosis of COVID-19 infection.


I. INTRODUCTION
On December 31, 2019, the Health Commission of Hubei Province of China notified the World Health Organization (WHO) of a rapidly evolving outbreak of unexplained The associate editor coordinating the review of this manuscript and approving it for publication was Jenny Mahoney. viral pneumonia, which is now believed to have begun since December 1, 2019 [1]. On February 11, 2020, this viral pneumonia has been named 'coronavirus disease 2019'  by the WHO [2]. At the same time, according to the recommendations from the International Committee on Taxonomy of Viruses (ICTV), the 2019 novel coronavirus was named as 'severe acute respiratory syndrome coronavirus 2' (SARS-CoV-2) [3]. By Mar 23th, 2020, a total of 332,930 laboratory-confirmed cases of SARS-CoV-2 infection have been reported, including 14,510 deaths [4]. Therefore, COVID-19 has been described as a global pandemic [5], [6].
COVID-19 is characterized by a combination of clinical features that include fever, dry cough, sputum, myalgia, tachypnea, fatigue, lymphopenia and abnormal chest computed tomography (CT) imaging [1], [7], [8]. With the improvement of accuracy on the diagnosis of patients with COVID-19, prediction of outcome in critically ill patients with COVID-19 is becoming a more and more important and urgent issue for treatment decisions and resource allocation at high-risk infection regions. A growing number of clinical studies have shown 26.1-32% of patients with COVID-19 develop severe COVID-19, and severe COVID-19 is characterized by rapid progression, severe acute respiratory disease syndrome (ARDS), multiple organs dysfunction, and high mortality [1], [7]. The previous study has described that elderly patients are more susceptible to COVID-19 infection, especially those with diabetes, hypertension, chronic pulmonary diseases, and cerebral infarction [9]. Likewise, a high fatality rate (61.5%) was observed among 52 critically ill adult patients with COVID-19 in Wuhan, China [10]. Due to COVID-19 rapid progression, a demand for intensive care unit beds is rising in a short period of time when the resource is constrained. Hence, compared to rescue and treatment for critically ill patients, early identification and intervention play a more indispensable role in preventing COVID-19 from the mild to severe. If effective novel prognostic model systems are established in time, would help to surveil the status of patients with COVID-19, with the potential for mortality reduction.
So far, it is recognized that COVID-19 is a disease involving multiple situations. Its clinical manifestation is complicated and changeable, and the critically ill cases rapidly develop, especially in patients with acute respiratory disease syndrome [11]. However, the survival rate of critically ill cases remains low, with the rising number of critically ill patients, and the focus on controlling of the infection source and cutting off the route of transmission in health and epidemic prevention work has shifted to timely detection and effective assessment of critically ill patients, and early intervention before fatal events appear. Currently, the diagnosis of critically ill and fatal cases is usually based on a large number of clinical laboratory parameters and clinical characteristics through judgment calls. Therefore, misdiagnosis and missed diagnosis easily exist because of physicians' fatigue, pressure, lack of experience, inconsistent judgments and lack of prior knowledge. Continuously, COVID-19 outbreaks are operating, worldwide, and meanwhile the rates of misdiagnosis and missed diagnosis are becoming the world's most closely watched numbers.
In the present study, the data was retrospectively collected from patients with COVID-19 pneumonia who were admitted to the Affiliated Yueqing Hospital of Wenzhou Medical University from Jan 21 to Mar 10, 2020, in a single-center retrospective study. Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) was detected by real-time reverse transcriptase-polymerase chain reaction (RT-PCR). RT-PCR assay kit was provided by Huada Genomics Co., Ltd. Throat swabs from all 51 patients tested positive for SARS-CoV-2 by RT-PCR. According to the ''the New Coronavirus Pneumonia Prevention and Control'' promulgated by the National Health Commission of the People's Republic of China, in 2020, all patients fulfilled diagnostic criteria of COVID-19 pneumonia. To simplify the analysis, the 51 COVID-19 patients were divided the 51 COVID-19 patients into two groups: severe COVID-19 (n = 21) and non-severe COVID-19 (n = 30).
Severe COVID-19 diagnosis should meet at least 1 of the following criteria: (1) Respiratory distress and the respiratory rate over 30 breaths/min; (2) Resting oxygen saturation lower than 93%; (3) Oxygenation index (OI) below 300 mmHg. The research plan was permitted by the Ethics Committees of the Affiliated Yueqing Hospital of Wenzhou Medical University (Yueqing, China) (Ethical approval code: 202000002). This study was in compliance with the Helsinki declaration. Blood samples were collected from all COVID-19 patients and blood routine examination was conducted, using the whole blood by a Mindray BC-5300 automatic blood cell analyzer (Mindray, Shenzhen, China). In this study, the basic information of the COVID-19 patients and 23 blood routine parameters (features) are listed in Table 1.
In recent years, the development and application of machine learning in term of infectious disease have been beneficial for the prevention and control of infectious diseases. Machine learning methods play more and more significant roles in in a wide fields of medical science [12], [13], which is widely performed in disease diagnosis, development of prediction models and identification of significant risk factors [14]. Undoubtedly, machine learning methods have some advantages including improving health professionals' diagnostic capacities, reducing much of the workload of radiologists and anatomic/clinical pathologists, enhancing the diagnostic accuracy [15]. Therefore, machine learning has emerged as an indispensable tool for physicians to lead a better understanding for patient personalized treatments. In the meantime, it is hoped that machine learning methods based on the abundant clinical data will support researches into COVID-19 epidemiology, which can cope with the current disease outbreaks [16]. Therefore, machine learning based on clinical characteristics is a promising method for the prediction of survival rate and classification of disease severity. To this end, a new method with the random forest (RF) and slime mould algorithm (SMA) optimized support vector machines (SVM) model involved was established, to analyze data of 51 patients confirmed with COVID-19 pneumonia who were admitted to the Affiliated Yueqing Hospital of Wenzhou Medical University (Yueqing, China).
Since then, several machine learning methods have been taken to tackle the related problems of COVID-19. For example, Wang et al. [17] firstly used a Respiratory Simulation Model (RSM) to fill the gap between large amounts of training data and scarce real-world data. A GRU neural network with bidirectional and attentional mechanisms (BI-AT-GRU) was then operated to classify 6 key clinically respiratory patterns for screening large-scale individuals infected with COVID-19. Hu et al. [18] developed an improved stacked autoencoder for modeling epidemic transmission dynamics and applying this model to predicting COVID-19 confirmed cases in real-time across China. Butt et al. developed a three-dimensional deep learning model by using lung CT image sets to segment candidate infection areas, and then utilizing a location attention classification model to classify these separated images into COVID-19, influenza A viral pneumonia, and healthy cases. Wu et al. [19] performed a calibration of logistic growth model, and generalized logistic growth model, growth model, and Richards model on the number of reported infection cases in the severely affected areas. Different models provide predictions with different upper and lower limits. Shan et al. [20] developed a deep learning-based segmentation system to automatically quantify possible infected areas and their volume ratios in the lungs. At the same time, a human-in-the-loop (HITL) strategy was used to assist the radiologist in segmenting the infected area, thereby shortening the total segmentation time to 4 minutes. Yang et al. [21] integrated population migration data and the latest COVID-19 epidemiological data into the Susceptible-Exposed-Infectious-Removed (SEIR) model to obtain the epidemic curve, and also used artificial intelligence (AI) derived from 2003 SARS data ways to predict the next outbreak. Vivanco-Lira [22] tried to summarize the biological characteristics of COVID-19 virus and the possible pathophysiology of its disease, as well as a random model that characterizes the probability distribution of cases in Mexico and the number of Mexican cases estimated by differential equation models. Gozes et al. [23] proposed a method that leverages powerful 2D and 3D deep learning models to modify and adjust existing AI models, and combined them with clinically understood systems, and used 3D volumetric examinations to assess each patient's disease progression over time.
In this study, it aims to propose efficient models by using the SMA, which is employed to train an effective SVM model. Then, it is the first time that the optimized SMA-SVM is employed to diagnose the severity of COVID-19. The active model is built by using the info about patients' basic information and 26 blood routine indicators. In the  b) The proposed RF-SMA-SVM was the first time to apply in the diagnosis of the severity of COVID-19.
(c) It is the first time that the age, Neutrophil-to-Lymphocyte ratio (N/L) and Platelet-to-lymphocyte ratio (P/L) combined together for predicting the prognosis of patients with COVID-19.
d) The established RF-SMA-SVM model outperforms other competitors.
The structure of the paper is as bellow. Section 2 describes the data and proposed the RF-SMA-SVM model in detail. The experimental setup and results are presented in Section 3. Section 4 presents the discussions. The conclusions and future works are given in Section 5.

A. DATA COLLECTION
The boxplot of total 28 indexes are shown in Figure 1. Statistical analyses were performed by using SPSS version 21.0 (IBM, Somers, NY, US). Continuous variables (age and blood VOLUME 9, 2021 routine parameters) were analyzed by an independent sample t-test. Results were considered statistically significant with p < 0.05. Detailed results of the statistical analysis are shown in Table 2.
The main flow of the proposed method is illustrated in Figure 5. It consists of four steps: data acquisition with standardization, feature selection, model training, and classification. The first step is to normalize the acquired data to the range of [−1, 1], and then use the random forest algorithm to select the feature set with the highest score from the data set and SMA for training the parameters (C, γ ) in the SVM. The third step uses the optimized parameters and the feature set selected by the random forest to train the SVM again. The last step is to use the trained SVM for classification of new data. When evaluating the trained model, K-fold cross-validation (CV) is used to obtain less biased experimental results. In this experiment, K is set to 10.

1) FEATURE SELECTION VIA RANDOM FOREST
Random forest (RF) [24], is a combined classification model consisting of multiple decision tree models. It employs bootstrap resampling to take multiple samples from the original sample and models a decision tree for each bootstrap sample. Then, these decision trees are combined together, and the final classification or prediction result is obtained by voting. In addition, a large number of theoretical and empirical studies have shown that RF not only has high prediction accuracy and excellent tolerance to outliers and noise, but also has the ability to calculate the importance of a single feature and avoid easily overfit.
The steps of calculating the importance of a feature by using RF is shown in Figure 2.

2) PARAMETER OPTIMIZATION BY SMA
Swarm intelligent algorithms can always generate better solutions than the traditional gradient descent-based methods [25]- [27], SMA as a new member of the swarm intelligent algorithms, it has shown great effectiveness in many fields. SMA mainly simulates the behaviour mode of Physarum polycephalum in slime mould [28]. SMA mainly uses the weighting factor to simulate the changes of the vein structure and contraction pattern of slime mould during foraging. When the concentration of nearby food is large, the diameter of slime bacteria will increase accordingly. When the concentration of food decreases, the diameter of slime bacteria will also shrink, forming a tendency to approach food. The steps can be described by the following steps: Step 1: Initialization parameters and individual positions. Such as population number N , dimension D, parameter z and randomly distribute the search agents in the solution space.
where X i represents the i − th search agent, rand represents a random number with a value between 0 and 1, and LB and UB represent the lower and upper boundaries of the search space, respectively.
Step 2: Calculate fitness value and weight. First calculate the fitness value of each search agent, sort all fitness values F to get S and I , and get the best fitness values bF and wF from them. Then calculate the weight W .
Step 3: Update the global optimal fitness value DF and the best search agent position BP (parameter to be optimized). If the fitness value obtained during the current iteration process is better than the global optimal fitness value, the global optimal fitness value is updated to the currently obtained fitness value and the best search agent position is updated to the currently obtained position.
Step 4: Update search agent location (parameter value to be optimized). First update the following parameters: where T is the current number of iterations and Max t is the maximum number of iterations. Then update the search agent location according to the following formula: where A and B are two individuals randomly selected from the group, and the formula for p, − → vb and − → vc are as follows: (2.9) Figure 3 shows the flowchart of SMA.

3) CLASSIFICATION BASED ON SVM
SVM is a popular machine learning method that has shown effectiveness in theory and applications [29]- [36]. SVM uses the principle of non-linear mapping: : R n → H , maps nonlinear eigendata to high dimensional space, thus using linear methods to solve nonlinear problems. For a load sample data set D = {x i , y i } , i ∈ 1, 2, . . . , n, x i ∈ R n , y i ∈ R where y i is the load value of the output and x i is the eigenvector corresponding to the output value, SVM maps the eigenvector x i to high dimensional space by function ϕ(x). Establish the regression function: where ω is the weight, b is the intercept, x = (x 1 , x 2 , . . . , x n ) is the eigenvector, y = (y 1 , y 2 , . . . , y n ) is the output load value.
To find a suitable set of ω and b, the search process was transformed into an optimization problem according to the principle of structural risk minimization. The objective function and constraints are as follows: where C is the penalty factor, ξ i and ξ * i are relaxation coefficients, allowing a certain fitting error, ε is the maximum error, and is a linear insensitive loss function parameter.
For the sake of enhancing the efficiency of optimization, Lagrange multiplier method is applied to transform it into dual problem: where a i and a * i are Lagrange multipliers, K x i , x j is a kernel function. Gaussian radial basis kernel function was employed: where γ is a kernel parameter. Eventually, the nonlinear function was obtained.
In SVM, the penalty factor C and the kernel parameter γ have a crucial impact on the learning process, and the prediction effects obtained by different parameters are quite diverse. C indicates the generalization ability and error size of SVM. Larger C will make the model error larger and the precision is not enough. Smaller C will lead to weaker generalization ability. γ affects the fit of the model. In order to find a better set of C, γ and improve the classification accuracy of SVM, this paper combined SMA and SVM to produce a new classification model SMA-SVM and applied the model to the prediction problem in reality.

4) PROPOSED SMA-SVM
To explore the potential of SVM, its parameters are tuned via the SMA. Here, we show the flowchart of parameter optimization of the proposed SMA-SVM model.
In Figure 4, adjust the position of the searched individual and calculate the fitness of all search agents using SVM classifier with the parameter pair according to the following formula:

III. EXPERIMENTAL SETUP AND RESULTS
The In this experiment, the RF method was first time to use in assessment of the significant features, and the specific results are shown in Figure 6. As can be seen from the figures, the individual features exhibit different levels of importance, with the top eight most vital features being Age, LY%, LY, N/L, NEU%, P/L, RDW, EOS%, respectively. Sorting this feature sequence in descending order of importance yields with multiple feature combinations, was employed to be evaluated one by one. Considering the number of features, here we select a set of eight incremental features consisting of the first eight features for the experiment. The results of the are shown in Table 3. It can be seen from the table that the best classification accuracy is achieved when the features consisting of the first 6 features.
The detailed results of RF-SMA-SVM model are shown in Table 4. In order to ensure the accuracy of the results, we repeated 10 times, taking the average of the 10 results as the final result. The results showed that the classification accuracy (ACC) of RF-SMA-SVM was 90.10%, the Matthews correlation coefficient (MCC) was 82.18%, the sensitivity was 88.17%, the specificity was 91.41%, and the variances were 0.1245, 0.2247, 0.2210 and 0.1788, respectively. In addition, it can be observed that the SMA can automatically obtain the optimal parameters of SVM on the optimal feature space obtained from RF, which demonstrates the strong search ability of SMA. the strong search ability of SMA.
To verify the validity of this approach, a comparative study presents with five other valid machine learning models, including RF-HHO-SVM, RF-MFO-SVM, RF-PSO-SVM,    The effect of original SMA in test function is far better than PSO and MFO [28], which indicates that SMA has strong exploration and development ability, and these abilities can also play an important role in medical data analysis. Figure 7 and figure 8 prove that RF-SMA-SVM is superior to RF-HHO-SVM, RF-MFO-SVM and RF-PSO-SVM. Although RF-SMA-SVM is slightly inferior to RF-HHO-SVM in sensitivity, it does not affect the superiority of RF-SMA-SVM. Furthermore, 1 is the upper limit to measure these indicators. If you want to be close to 1, the performance requirements of the algorithm are multiplied. Therefore, RF-SMA-SVM seems to have little difference from  other comparison algorithms, but in fact there is a certain gap.
To describe the convergence of the proposed RF-SMA-SVM method, we also document the trend of the accuracy of the four SVM models with iteration. In Figure 8, it is found that after several iterations, SMA-SVM has the worst effect, and the accuracy of SMA-SVM does not improve significantly with iterations, so it is easy to fall into the local optimum, while RF-SMA-SVM model can quickly and continuously jump out of the local optimum to reach the optimal accuracy, which shows that RF-SMA-SVM method has strong local search ability and global search ability. The accuracy of RF-MFO-SVM and RF-PSO-SVM is slightly lower than that of RF-SMA-SVM, and the accuracy increases less significantly with iteration, which makes it easier to fall into local optimization.   Table 5.
The calculation time of RF-PSO-SVM is the shortest, so the relative calculation time of RF-PSO-SVM is recorded as 1. The relative calculation time of other algorithms is calculated according to the calculation time of RF-PSO-SVM. In Table 5, although RF-SMA-SVM has the longest computation time, its accuracy is better than other algorithms.

IV. DISCUSSIONS
In the present study, COVID-19 patient datasets were screened by using the random-forest based feature selection method. Importantly, several key features were screened, including age, NEU%, LY%, LY, N/L and P/L. Subsequently, a SMA-SVM model is constructed to accurately assess the severity of the COVID-19 and early monitor COVID-19 progression. Therefore, we believe that RF-SMA-SVM model may help inform clinical decision-making. Some studies demonstrate that advanced age is a strong predictor for prognosis of Severe Acute Respiratory Syndrome (SARS) and Middle East Respiratory Syndrome (MERS). For instance, Choi and colleagues used multivariable analysis to identify age older as independent predictors of mortality [37]- [39]. In line with these findings from clinical studies, studies on macaques model revealed that older macaques inoculated with SARS coronavirus (SARS-CoV) exhibited stronger host antiviral response compared to younger macaques, which is tightly related to inflammation-related genes and type I interferon [40]. Similarly to these studies, Zhou et al. found that non-survivor with COVID-19 had a significantly older median age, than did the survivor with COVID-19 [41]. We also found in our study that patients with severe COVID-19 had significantly higher (1.45-fold) age than those with a non-severe COVID-19 (P = 0.00), suggesting that disease severity may be predicted by age in COVID-19 patients. In addition, numerous clinical studies have demonstrated that advanced age is positively associated with the severity of infection and sepsis [42], [43]. The probable reason for this was that lymphocyte function decrease with increasing age may increase the risk of infection and overwhelming secretion of type 2 cytokines may produce an overstimulation of viral replication, eventually leading to body function impairment [44]- [46]. All in all, age is an important predictor of rehabilitation in patients with COVID-19.
White blood cells (WBCs) are blood cells that mediate the immune system and resist exotic microorganisms. There are five types of WBCs, including neutrophils, lymphocytes, monocytes, eosinophils, and basophils [47]. Neutrophilic granulocyte is the chief phagocytic white blood cell of the blood, comprising 50-70% of the total leukocytes in the blood [48]. Neutrophilic granulocytes play a momentous role in the unspecific immune response and VOLUME 9, 2021 inflammation [49], [50]. Neutrophilic granulocytes are the first line to defense microbial pathogens, especially in pyogenic bacteria. Lymphocytes originate from the spleen, tonsils and lymph nodes are the smallest of the white blood cells, comprising 20%−40% of the total leukocytes in blood. Lymphocytes are a necessary component of the immune response of human body, which is involved in the synthesis and distribution of antibodies in the blood [51]. The B-lymphocytes play a core role in humoral immunity or antibody immunity, and the T-lymphocytes are responsible for cellular immunity. A number of clinical observations have suggested that a reduced number of lymphocytes and an increased number of neutrophils commonly occur in COVID-19 patients [1], [52], [53]. As an example, a population-based study of more than 100 patients suggested that severe COVID-19 patients had higher neutrophil count and lower lymphocyte count during the severe phase [54]. Liu and colleagues also found that lymphopenia and neutrophil count increased in patients with COVID-19 upon hospital admission may indicate more severe acute lung injury. Therefore, lymphopenia, neutrophil count increases may be a strong predictor of disease severity [55]. There is reason to believe that severe COVID-19 patients have immune disorders and excessive inflammation. Thus, patients with COVID-19 were rapidly screened with excessive inflammation using routine laboratory parameters in order to enhance survival rates. Hence, in recent years, N/L has been a simple parameter to reflect easily patient's inflammatory [56]. Meanwhile, N/L was widely used as a simple and inexpensive marker for monitoring the infectious diseases, including patients with acute-on-chronic hepatitis B, pulmonary tuberculosis and bacterial community-acquired 45498 VOLUME 9, 2021 pneumonia [57], [58]. Furthermore, N/L also has been considered to be a strong independent prognostic indicator for a variety of diseases including malignant tumors, cardiovascular diseases, sepsis, chronic obstructive pulmonary disease, and infectious disease [59]- [62]. Elevated N/L was associated with poor clinical prognosis [63]- [65]. Elevated N/L has been used as a marker in the determination of increased inflammation in acute exacerbations of chronic obstructive pulmonary disease, which is similar to C-reactive protein [66]. N/L seemed to be a safe, simple and promising marker in predicting bacteremia and in grading community-acquired pneumonia, and N/L was more accurate and objective than C-reactive protein [67]. A large population-based retrospective cohort study from London reported that more than 3,500 people living with human immunodeficiency virus cases proved elevated N/L was an independent predictor of the risk of cardiovascular disease rather than a risk traditional risk factor for cardiovascular disease [68]. In line with these findings, there are several studies describing the relationship between N/L and COVID-19 patients. Lagunas-Rangel et al. revealed that N/L was significantly higher in the severe COVID-19 patients compared with the non-severe group, and the elevation of N/L is closely associated with poor prognosis in COVID-19 patients [69]. Liu and colleagues also demonstrated that the incidence of severe patients aged 50 years or older and N/L ≥ 3.13 was up to 50%. Similar to the findings of their study, we also found that N/L was significantly higher in severe COVID-19 patients displaying about 2.9 times higher levels than in non-severe COVID-19 patients. Taken together, N/L was shown to be of significant predictive value for the discrimination of COVID-19 patients' status and might predict COVID-19 progression.
Platelets are viewed as important immune cells in the human body. Platelets have important physiological functions in the human body. Platelets have multiple physiological functions in the human body, including hemostasis, coagulation, the maintenance of vascular integrity, angiogenesis, innate immunity, inflammatory responses, viral infection, and malignant tumor. The number of platelets and platelet activity is strongly associated with multiple diseases [70]. Platelets are the smallest cell in peripheral blood cells, generating from megakaryocytes in the bone marrow [71] In recent years, research has shown that platelets and lymphocytes play a key role in the inflammatory response process. P/L as a novel type of inflammation factor has attracted more and more attention from researchers. P/L could serve as an indicator of inflammation that may reflect the level of systemic inflammation in patients [72]. P/L serves a role in a wide spectrum of diseases, including acute ischemic stroke, urothelial carcinoma, acute pulmonary embolism and Pulmonary large cell neuroendocrine carcinoma [73]- [76]. Thus, P/L might be a reliable prognostic marker in the progression and prognosis of many diseases. Wang et al carried out a retrospective study that included 134 adenosquamous cell lung cancer and found that high P/L was closely related to shorter disease-free survival and lower overall survival rates [77]. Wang and colleagues revealed that elevated P/L was an independent predictor for poor prognosis of patients with hepatocellular carcinoma [78]. Wang et al. also reported that of the 695 lung cancer cases, lower P/L was significantly associated with a lower tumor-node-metastasis (TNM) stage and a low incidence rate of surgery [79]. Based on the results of previous research, there is a reason to believe the relationship between the P/L and COVID-19. Qu and colleagues revealed that high P/L was positively correlated with COVID-19 severity and hospital lengths of stay. They believed that P/L may provide a useful indicator for monitoring the patients with COVID-19. In this study, we found that P/L in severe COVID-19 patients was about 2.0 times higher compared to non-severe COVID-19 patients. Thus, P/L has the potential to be used as a predictive marker for COVID-19 disease severity. Until now, there are very few reports describing the cost-effective markers N/L and P/L to joint predict the severity and prognosis of the infectious disease. To the best of our knowledge, this is the first time that the combined of age, N/L and P/L for predicting the prognosis of patients with COVID-19 using a machine learning method.
However, there are some limitations in this study. First of all, our data set was based on a single-center and the sample size was not large enough. In order to improve diagnostic accuracy, we will enlarge the sample size in future research. Second, multi-center researches will be necessary to carry out external validation for the RF-SMA-SVM model. Third, we plan to include more blood indexes in our future research.

V. CONCLUSION AND FUTURE WORKS
In this study, a validated RF-SMA-SVM model was developed to differentiate the severity of COVID-19 based on factors such as basic patient information and 26 blood routine indicators. The main innovations of this paper are: on the one hand, it is proposed for the first time that the severity of COVID-19 can be distinguished by using the combination of age, N/L and P/L indicators, and on the other hand, it is the first time that the SMA is used to train an optimal SVM classifier. Based on the experimental results, the proposed method exhibits higher prediction accuracy and more stable performance on the COVID-19 severity prediction problem than the SVM method based on other optimization algorithms, while screening out the key factors rich in discriminatory power.
In future work, we will attempt to apply the proposed RF-SMA-SVM model to the COVID-19 pre-diagnosis problem and then attempt to apply it to other disease prediction problems.