An Overview of Supervised Machine Learning Methods and Data Analysis for COVID-19 Detection

Methods Our analysis and machine learning algorithm is based on most cited two clinical datasets from the literature: one from San Raffaele Hospital Milan Italia and the other from Hospital Israelita Albert Einstein São Paulo Brasilia. The datasets were processed to select the best features that most influence the target, and it turned out that almost all of them are blood parameters. EDA (Exploratory Data Analysis) methods were applied to the datasets, and a comparative study of supervised machine learning models was done, after which the support vector machine (SVM) was selected as the one with the best performance. Results SVM being the best performant is used as our proposed supervised machine learning algorithm. An accuracy of 99.29%, sensitivity of 92.79%, and specificity of 100% were obtained with the dataset from Kaggle (https://www.kaggle.com/einsteindata4u/covid19) after applying optimization to SVM. The same procedure and work were performed with the dataset taken from San Raffaele Hospital (https://zenodo.org/record/3886927#.YIluB5AzbMV). Once more, the SVM presented the best performance among other machine learning algorithms, and 92.86%, 93.55%, and 90.91% for accuracy, sensitivity, and specificity, respectively, were obtained. Conclusion The obtained results, when compared with others from the literature based on these same datasets, are superior, leading us to conclude that our proposed solution is reliable for the COVID-19 diagnosis.


Introduction
e novel coronavirus known as SARS-CoV-2 (Severe Acute Respiratory Syndrome), responsible for COVID-19 pandemic, belongs to the large family of coronaviruses that cause fever, cough, dyspnea, and muscle pain, while imaging frequently reveals bilateral pneumonia [1][2][3]. Although the WHO validated an anti-COVID-19 vaccine [4], it cannot help alone to reduce the spread of the virus. Usually, the standard diagnostic method used is real-time reverse transcription-polymerase chain reaction (RT-PCR), which can help detect viral nucleosides in samples obtained from oropharyngeal swabs, nasopharyngeal swabs, bronchoalveolar washes, or tracheal aspirates acid [5][6][7]. Due to the constraints imposed by the latter, several health centers are opting for immunological or antibodies tests as an alternative [8]. However, these tests do not detect the presence of the virus, but rather the presence of IgM (Immunoglobulin M) and IgG (Immunoglobulin G) antibodies, produced to fight the virus. It is almost impossible to detect these antibodies before fourteen days after infection, this can lead to false-negative results (false negatives) [9,10]. Faced with these limitations, health specialists have seen fit to call on scientists to obtain faster, more efficient, accessible, and more pleasant technological solutions.
Many researches are focusing on artificial intelligence (AI) technologies, machine learning (ML), and deep learning (DL) to deal with COVID-19 [11][12][13][14]. For example, ML algorithms have been used to detect COVID-19 CTscans images from the lung [15]. In [16,17], authors have shown that chest CTs are highly sensitive to the diagnosis of COVID-19. Due to the radiation dose, the relatively small number of available equipment, and the associated operating costs, CT-scan imaging can hardly be used for screening tasks. Furthermore, this method has obvious abnormalities when the lungs are inflamed or have tissue lesions [18]. A similar article on chest X-rays, which is a less expensive and low-dose test, was recently published with encouraging statistical performance [19]. However, it has been found that almost 60% of the chest X-rays taken by patients diagnosed with symptomatic COVID-19 are normal, and the system based on this examination needs to be thoroughly verified in the actual environment [20,21]. Despite these encouraging results, they still attract some attention. Most of the other works have not yet been peer-reviewed: a recent important survey report stated that all surveyed studies may have a high risk of bias and overfitting and almost fail to comply with reporting and reproduction standards [22,23]. Because of the aforementioned limitations of CT scan, RT-PCR, and immunological or antibodies test methods, there is an urgent need to seek for a more efficient and faster method for the detection of COVID- 19. In this paper, we propose an alternative method of testing based on data analysis (DA) and ML algorithms that are rapid, accessible, simple to use, and of low cost and have good accuracy. Our solution is designed to quickly and reliably predict whether or not an individual is infected by SARS-CoV-2 based on clinical data from individuals who have performed PCR tests. To perform this work, the datasets are transformed into a suitable format by using DA methods and then using ML; the bestcorrelated features with the target are retained. Secondly, a suitable model by which the data will be trained is determined, and finally, the model is optimized so to achieve the best performance. e rest of our work is organized as follows. Section 2 presents the state of the art of the related works carried out. Section 3 deals with the DA and ML methods used, mainly the different methods used to carry out our work. Section 4 presents the obtained results and discussions and comparisons with related works. is work ends in Section 5 with the conclusion and suggested future work.

Related Works
Several works based on AI, along with ML and DL, have been carried out over the last two years in the context of diagnosis and detection of COVID-19 infections. In this section, we will present some related works, including the models and methods that authors have used, and their results show the difference between the respective works and our proposed work.
Brinati et al. [23] proposed a feasibility study using ML algorithms detection of COVID-19 infection from blood exams with ML. e authors developed two ML classifiers based on hematochemical values (usual blood exams) from two hundred and seventy-nine (279) types of data from [24]. ey proposed ML classifiers discriminated between patients who are either negative or positive to the SARS-CoV-2: their accuracy spectrum between 82% and 86% and sensitivity between 92% and 95% relative to the gold standard. In 2020, Soares et al. [25] proposed a novel specific artificial intelligence-based method to identify COVID-19 cases using simple blood exams. ey developed a machine learning classifier that takes widely available simple blood exams as input and classifies samples as likely to be positive (having SARS-CoV-2) or negative (not having SARS-CoV-2). Based on this initial classification, positive cases can be referred for further highly sensitive testing (e.g., CT scan or specific antibodies). ey used publicly available data from the Albert Einstein Hospital in Brazil from 5,644 patients. Focusing on simple blood exam figures as main predictors, 599 subjects that had the fewest missing values for 16 common exams were selected. From these 599 patients, 81 tested positive for SARS-CoV-2 (determined by RT-PCR). Based on the reduced dataset, they built an artificial intelligence classification framework, ER-CoV, aiming at determining if suspect patients arriving in ER were likely to be negative for SARS-CoV-2, that is, to predict if that suspect patient is negative for COVID-19. e primary goal of this investigation is to develop a classifier with high specificity and high negative predictive values, with reasonable sensitivity. Banerjee et al. [26] proposed the use of artificial intelligence (AI) along with ML to predict COVID-19 from blood samples. ey collected SARS-CoV-2 rt-PCR samples with anonymized full blood counts results from Hospital Israelita Albert Einstein, in São Paulo, Brazil. ey found that, with full blood counts, shallow learning, random forest, and artificial neural network model predict SARS-CoV-2 patients with high accuracy between populations on regular wards (AUC � 94-95%) and those not admitted in the community or to the hospital or AUC � 80-86% [26]. In 2020, Moraes Batista et al. [27] investigated ML to diagnose and predict COVID-19 for emergency patients. e authors based their investigation on the same dataset of authors from [26] and on five ML algorithms (neural networks, gradient boosting trees, random forests, support vector machines, and logistic regression) and trained their model. eir best 2 Journal of Healthcare Engineering predictive model was obtained by the SVM algorithm (AUC: 0.85; sensitivity: 0.68; specificity: 0.85; Brier Score: 0.16) that is not very reliable. Freitas Barbosa et al. [28] based also on blood tests to develop an intelligent system to diagnose COVID-19 tested several ML methods to achieve high classification performance: 95.159% ± 0.693 of overall accuracy, sensitivity of 0.968 ± 0.007, kappa index of 0.903 ± 0.014, specificity of 0.936 ± 0.011, and precision of 0.938 ± 0.010. eir best results were achieved using Bayes Network and low computational cost classifiers. Soltan et al. [29] applied extreme gradient boosted trees, random forests, and multivariate logistic regression to distinguish admissions due to COVID-19 and emergency department presentations from prepandemic controls. ey investigated the stepwise addition of clinical feature sets and assessed performance using stratified 10-fold cross-validation. Models were calibrated during training to achieve sensitivities of 70, 80, and 90% for identifying patients with COVID- 19. ey generated test sets with varying prevalence rates of COVID-19 and assessed predictive values to simulate real-world performance at different stages of the epidemic. Kukar et al. [30] based on ML proposed a COVID-19 diagnosis by routine blood tests.
ey constructed an ML predictive model for COVID-19 diagnosis. e model was based and cross-validated on the routine blood tests of 5,333 patients with various bacterial and viral infections. ey selected an operational ROC point at a specificity of 97.9% and sensitivity of 81.9%, and the AUC was 0.97. According to the feature importance scoring of the XGBoost algorithm, the authors presented the five most useful routine blood parameters for COVID-19: prothrombin, albumin, eosinophil count, INR, and MCHC.
In 2021, AlJame et al. [31] used routine blood tests and proposed an ensemble learning model for COVID-19 diagnosis. For data preparation, they exploited a K-Nearest Neighbors algorithm to deal with null values in the dataset and an isolation forest method to remove outlier data. e proposed model was trained and evaluated by using publicly available data from [32]. e ensemble model achieved outstanding performance with an overall accuracy of 99.88%. Alves et al. [33] proposed also an ML model to diagnose COVID-19 from blood tests. e authors tested different ML models in a public dataset always from [32]. After performing data wrangling, this dataset had 608 patients, of which 84 were positive for COVID-19 confirmed by RT-PCR. By using random forest (RF) as their best ML algorithm, they achieved a good result (accuracy 0.88, F1-score 0.76, sensitivity 0.66, specificity 0.91, and AUROC 0.86).
Li et al. [34] also investigated COVID-19 detection by using ML algorithms. ey found several novel associations between clinical variables, including the association between men and higher levels of serum lymphocytes and neutrophils. ey found that COVID-19 patients can be divided into subtypes based on the serum levels of immune cells, gender, and reported symptoms. Finally, they trained an XGBoost model that can distinguish COVID-19 patients from influenza patients with a sensitivity of 92.5% and a specificity of 97.9%. Many other works have been performed in ML and blood samples in order to detect COVID-19 [35][36][37][38][39][40][41][42][43]. Others [44][45][46][47] explain how we can apply ML and DA on blood samples. Table 1 summarizes the performance and description of related works. It can be observed in this table that the datasets from [24,32] are widely used in the literature; that is why we used these datasets in our study and why at the end we compare our results with other results from the literature studies that have used the same datasets.
Despite these encouraging results as observed in Table 1, there are some concerns on the reliability, efficiency, and accuracy of their results. Also, we notice that the ML models are different for all the authors, and a model cannot give a good performance to each data set. Moreover, none of the authors in the literature has used DA and ML along with SVM to reach a very good performance in terms of rapidity, accuracy, specificity, and sensitivity. In this paper, therefore, we propose a method of analysis based on DA and ML techniques to analyze and select the best features for our ML algorithm. We optimize the SVM algorithm to finally have a performance superior to all algorithms found in the literature using the same datasets.

Proposed Approach
In this section, we give a detailed presentation of the different steps and methods used to carry out our work. en, we first present our proposed pipeline. Afterward, we present the methods used for data analysis and exploration, data preprocessing, and data modeling. Finally, the optimization of the chosen model is presented. Figure 1 presents our proposed pipeline that contains steps involved in the realization of our solution.

Data Description.
Our analysis is based on the dataset from [32]. is dataset contains the data of 5644 patients who performed a PCR test. ese data are the parameter values obtained after analysis of the patients' blood and tests for the presence of already known viruses. In total, we have 111 features, and the target is represented by the variable SARS-CoV-2 exam result, which contains the results of the COVID-19 test carried out on the different patients.

Deep Analysis of the Data Set.
We divided the features into two different categories: blood (representing the features that were obtained from a blood test) and viral (representing the features that were obtained from a virological test). To visualize our data set before performing analysis, we have plotted some graphs. Figure 2 shows the distribution of four features in our dataset while Figure 3 represents the relationship between the target and four features (viral) also and Figure 4 shows the relationship between blood feature and target.
(1) Distribution of Continue Variables. Blood type variables: blood. e majority of float variables follow the reduced Gaussian distribution. It is possible they have been standardized before in order to facilitate predictions.  (2) Features-Target Relations. Viral-target relation: Looking at these figures, there are very few cases of double disease (people infected with both the SARS-CoV-2 virus and other viruses). On the other hand, the number of double negative cases is high (cases where patients are neither infected with SARS-CoV-2 nor other types of viruses). is suggests that if we do not have any infection of these other viruses, then it is highly likely that we are not infected with the SARS-CoV-2 virus.

(3) Blood-Target Relation.
From the previous figures, we can make the difference between the distribution of the positive and negative cases depending on each feature. e represented features have a great impact on the target. is proves that blood features have a great influence on the prediction of SARS-CoV-2 infection [35][36][37].

Data Preprocessing.
e preprocessing starts by cleaning the dataset to select the best features. Figure 5 shows the pipeline of the preprocessing step.
(i) Cleaning: It consists of deleting variables that have at least 90% of missing values. is new data set has the dimension (5644.32) and contains 10% positive cases and 90% negative cases.
(ii) Encoding: Here, the target is to associate each qualitative value to a numerical value. (v) Features selection: It consists of determining, using statistical methods, the ten feature variables that have the best impact on the target (SARS-CoV-2 exam result): we use the ANOVA (Analysis of Variance) statistical test to give the scores of the relationships between each feature and the target [38][39][40].
where n groupes is the number of groups. In our case, it is 2, because we calculate the ANOVA F score between each feature and the target, therefore, between two elements. n obs is the number of observations in each feature. In our case, it is identical to the number of observations in the target; x, y are the average of the observations in any feature x and in the target y, respectively; µ is the average of the observations of the set made up of the different observations of x and y; x i , y i are the observation of any feature x and target y. Figure 6 shows the importance of each feature by using the ANOVA test.
We have selected the ten first ones to train and evaluate models.
e data set treatment phase has been achieved; it is now left to submit this to the different machine learning models to obtain the predictions.

Data
Modeling. Data modeling can be seen as the process of creating an ML model for our dataset. Here, modeling starts with the choice of the training algorithm, followed by the metric evaluation. Based on the metric evaluation, we can choose the best algorithm for its optimization. Figure 7 shows the pipeline of the modeling step.

Models.
We choose five high-performance classification models for small data sets (less than 100,000 lines), in particular, the KNeighbors classifier, bagging classifier, boosting classifier, SVM, and random forest classifier.

3.3.2.
Training. 80% of the data set will be used as a train set or training data.

3.3.3.
Evaluation. 20% will constitute the test set or data for evaluation or validation. e evaluation criteria are accuracy, precision, and recall.
where n samples is the number of samples. y pred i is the predicted value of the i-th sample. y i is the corresponding true value.
At the end of these 3 stages, the best model is selected, i.e., the one with the best performance.

Optimization of the Best Model.
Optimization aims at improving the performance of the best model using the GridsearchCV technique. Figure 8 shows the pipeline of optimization.

Journal of Healthcare Engineering
After giving a range of values to the hyperparameters of our best model, we train it with the GridSearchCV method. GridSearchCV is a technique that allows you to search within a range of hyperparameter values of a model, the optimal combination of values, allowing you to obtain better performance. e optimization is done by the cross-validation technique [41,42]. After training, the hyperparameters have their optimum values. We then have an optimal best model, and we apply the evaluation criteria to obtain its performance. [48,49], given a training dataset S � (x 1 , y 1 ), . . . , (x p , y p ) of data point x j (with X⊆R n ) with matching labels y j (with Y � −1, +1 { }), the task of

x) � y).
A good classifier/model should guarantee the top possible generalization performance (minimum error on unseen examples) [48][49][50]. In SVM, the hyperplane found in the characteristic space matches the nonlinear decision borderline in the input space.
Let us consider in this case ϕ: I⊆R n ⟶ F⊆R n a mapping from the input space I to the characteristic space F. In the learning step, the algorithm will find the hyperplane defined by the equation 〈w, ϕ(x j )〉 � b such that the margin y � min 1≤j≤p y j 〈w, ϕ x j − b〉 � min 1≤j≤p y j h x i (3) is maximized, where 〈, 〉 denotes the inner product, w is a pdimensional vector of weights, and b is a threshold. e quantity (〈w, ϕ(x j ) − b〉)/‖w‖ represents the distance of the sample x j from the hyperplane. It gives a positive or negative value for corrected and uncorrected classification, respectively, when multiplied by the label y j . A new data point x a label will be assigned to evaluate the decision function given by In this paper, we work on the blood sample dataset and how we can base our investigation on this dataset to build a model able to detect if someone has COVID-19 or not. For that, we need to maximize the margin.
For linearly separable classes, there exists a hyperplane (w, b) given by By taking ‖w‖ 2 � 1, choosing a hyperplane to maximize the margin is equal to the following optimization problem: Problem (6) can be rewritten by using the Lagrange multipliers α j , j � 1, . . . , p in the dual form given by Problem (7) shows how to reduce a quadratic optimization task. However, the Karush-Kuhn-Tucker (KKT) conditions will be satisfied by the solutions α * ensuring that only a subset of training examples is associated with nonzero α j , j � 1, . . . , p. is property is crucial in our blood sample classification for COVID-19 detection and is called sparseness of SVM.
In the solution α * , often only a subset of training examples is associated with nonzero α j , j � 1, . . . , p. ese are called support vectors and correspond to the points that lie closest to the separating hyperplane (Fig.). For the maximal margin hyperplane, the weight vector w * is given by the linear function of the training points given by Based on equation (8), equation (4) can be expressed in equation (9) as For a support vector x j , it is (〈w * , ϕ(x j )〉 − b) � y j j � 1, . . . , p from which the optimum bias b * can be computed.  To choose the best kernel function in SVM to deal with practical problems, we have the following [43]: (i) Based on the prior knowledge of experts, we select the kernel functions (ii) e method of cross-validation is adopted; that is, when selecting the kernel function, different kernel functions should be tried, respectively, and the kernel function with the smallest error is the best kernel function In this paper, we implement the SVM with RBF kernel in our algorithm.

Modeling Results.
After training our models, we get the learning curves of the different models as done in [43,44]. ese include the following: (i) A training curve which gives the score after training (on the training sample) (Figure 9) (ii) A validation curve which gives the score after validation (on the validation sample) (Figure 9) e first remark is that there is no convergence between the learning and validation curves. Random forest, bagging, and AdaBoost classifier are in overfitting [51]. e predictions are perfect on training (blue curve) but poor on validation (orange curve). To resolve this, we can cross-validate them on different splits. e curves of SVM seem to converge; this needs more training data.
In the next step, we observe the performance of each classifier after evaluation.

Results after Evaluation on the Test Set (1) Performance Criteria of the Fives Models.
e performance criteria of the different models are obtained by computing the value of each metric. Table 2 presents the values of metrics for each estimator.
After observing these values, we can say that the model with the best performance is SVM. Let us better appreciate this by observing the accuracy, precision, and recall curves in Figure 10.
We can notice that whatever the performance criteria, the SVM model has the highest score, in terms of accuracy, sensitivity, or specificity, which makes it the best model. All that remains now is to optimize it.

Training Results.
After training the best model using the GridSearchCV method we obtained, we observe the learning curves presented in Figure 11.
We notice that there is no difference between this new model's learning curves and the former one. Perhaps, the hyperparameters of the former model are already the best.
ere is no need to modify it. is will be verified after observing the new confusion matrix.

Results after Evaluation of the Optimized Best Model.
In ML, the confusion matrix (also called the error matrix) is a specific table layout that can visualize the performance of the hypothetical algorithm we use, that is, the parameters of the SVM algorithm (negatively predicted number, positively predicted number). e confusion matrix of our optimized model is displayed in Table 3. e model has very few false negatives (0.71%) and no false positives (0%); it does not make too much confusion between the two classes. is explains his high performance. Moreover, there is no difference between nonoptimized and optimized models: accuracy � 0.992, sensitivity � 0.927, and specificity � 1. So, our SVM model is very reliable and efficient. Figure 12 highlights the performance of our solution and the work of the authors cited in the literature review who worked on the same dataset [32] as ours.

Comparison with the Performance of Related Works.
As can be observed, the performance of our model is almost the highest in terms of accuracy and specificity. is performance may be due to the technique of choice of our final model, which started with the evaluation of several models, and then the choice of the best model. On the other hand, if we take a look at the other results, we will realize that there are solutions that perform better than ours mostly in terms of accuracy and sensitivity, even though we did not work on the same data set. is is the case of the solution resulting from the work [31], which reached 99.88%, 98.72%, and 99.99%, respectively, in terms of accuracy, sensitivity, and specificity.
We can confirm that our model is very efficient but is not perfect. In particular, this perfection is not achieved especially at the sensitivity level, which also affects the accuracy and prevents it from reaching value 1. Indeed, the achievement of this level of sensitivity (below 95%) can be explained by the low number of patients testing positive (only 10%) in our data set. is implies that the model was not trained on a large sample of positive cases, which affects the predictions of positive cases and lowers their performance. Sensitivity, being the ability to find all positive results, is therefore deteriorated.
In order to see if our DA and SVM method is good, we have carried out the same study using another dataset taken from [24] that contains one more parameter CRP (C-reactive protein). e SVM model once more has been the best model in terms of accuracy, sensitivity, and specificity. In Figure 13, we easily appreciate the best model depending on each metric. In this figure, we perform the representation of accuracy, specificity, and sensitivity.
According to Figures 13(a)-13(c), the SVM model has the best performance compared to the others. It achieved 92.86 of accuracy, 93.55 of sensitivity, and 90.91 of specificity. We then compared our result with the result from [23], who worked on the same dataset, and Figure 14 presents the difference between our models.  Although the sensitivity of [23] is higher than ours, the latter achieves 82-86% accuracy, which is under the accuracy of our model. is means that our model makes less errors in its prediction than the author's model from [23].
Regarding the discussions, we confirm that the results of the SVM model are good, either in the first or in the second dataset. Hence, it is important for us to find ways and means to improve the performance of our solution especially the sensitivity.
In a nutshell, we have presented the major results of the proposed solution, obtained during the modeling and evaluation stages. Based on performance (accuracy, precision, and sensitivity), we selected the best model among the five initially considered, and then, we improved its performance to be the best possible. We obtained very high performance on the test set: 99.29%, 92.79%, and 100% for accuracy, sensitivity, and specificity, respectively, concerning the first dataset (data set from [32]) and 92.86%, 93.55%, and 90.91% for accuracy, sensitivity, and specificity, respectively, concerning the second dataset (data set from [24]). By using our model, we can now perform a cheap COVID-19 test within less time. Furthermore, we can try to improve our model with some big data analysis techniques and tools used in biomedical engineering and presented in [52][53][54].

Conclusion
is study focused on the implementation of a solution to predict whether or not an individual is infected with SARS-CoV-2 quickly and reliably, based on DA and ML model as well as clinical data from patients who have carried out PCR tests. With a view to achieving these ends, we, first of all, presented some diagnostic works on COVID-19 already carried out. en, we amply presented the approach used to achieve this solution. It began with an analysis and exploration of the data in order to understand our data set in depth. After understanding our data, we processed it in order to put it in a suitable format for machine learning. is processing consisted of encoding, imputation, standardization, and selection of the 10 best variables. e next step was modeling, in which we presented the five models to be trained and evaluated according to well-defined evaluation criteria, with the aim of selecting the best model. Finally, the last step was optimization, in which we used the "Grid-SearchCV" method, an optimization technique to increase the performance of the selected model. In the last part of this work, we highlighted the results obtained after the modeling and optimization phases, as well as extensive discussions. After training and evaluation of the different models, we  selected the "SVM" as the best model, and then, we optimized it. At the end of the optimization, we observed that the performance remained the same: an accuracy of 0.99, a recall of 0.93, and a perfect specificity of 1. We did the same work with another dataset taken from [24]. Once more, the SVM presented the best performance: 92.86%, 93.55%, and 90.91% for accuracy, sensitivity, and specificity, respectively. At this point, we can easily say that blood parameters are a very good option to predict SARS-CoV-2 infection at low cost and rapidly. Our solution has several advantages, namely: (i) Absence of costs related to the manufacture and transport of the tests (ii) Low dependence on qualified professionals for its use (iii) More pleasant for patients compared to the PCR test (iv) Accessibility to any location (v) Fast and high-performance testing (vi) Low cost In future work, we want to develop an application by using our model to perform the COVID-19 test. We also intend to adapt this solution to several other cases of diseases, pandemics, or epidemics.

Ethical Approval
is article does not contain any studies with human participants and/or animals performed by any of the authors.

Consent
is paper was performed with free and available data from two datasets. No written consent has been obtained from the patients as there is no patient identifiable data included in this case report/series.

Conflicts of Interest
e authors declare that they have no conflicts of interest.