Using Artificial Intelligence for COVID-19 Detection in Blood Exams: A Comparative Analysis

COVID-19 is an infectious disease that was declared a pandemic by the World Health Organization (WHO) in early March 2020. Since its early development, it has challenged health systems around the world. Although more than 12 billion vaccines have been administered, at the time of writing, it has more than 623 million confirmed cases and more than 6 million deaths reported to the WHO. These numbers continue to grow, soliciting further research efforts to reduce the impacts of such a pandemic. In particular, artificial intelligence techniques have shown great potential in supporting the early diagnosis, detection, and monitoring of COVID-19 infections from disparate data sources. In this work, we aim to make a contribution to this field by analyzing a high-dimensional dataset containing blood sample data from over forty thousand individuals recognized as infected or not with COVID-19. Encompassing a wide range of methods, including traditional machine learning algorithms, dimensionality reduction techniques, and deep learning strategies, our analysis investigates the performance of different classification models, showing that accurate detection of blood infections can be obtained. In particular, an F-score of 84% was achieved by the artificial neural network model we designed for this task, with a rate of 87% correct predictions on the positive class. Furthermore, our study shows that the dimensionality of the original data, i.e. the number of features involved, can be significantly reduced to gain efficiency without compromising the final prediction performance. These results pave the way for further research in this field, confirming that artificial intelligence techniques may play an important role in supporting medical decision-making.


I. INTRODUCTION
Covid-19 is an infectious disease caused by the Severe Acute Respiratory Syndrome CoronaVirus 2 (SARS-CoV-2) [1], declared a pandemic by the World Health Organisation (WHO) at the beginning of March 2020 [2].
The pandemic came in several waves, putting the health systems into crisis. For example, hospitals have been particularly affected by this emergency, in which intensive care cases have become a serious concern. Moreover, as of October 21, The associate editor coordinating the review of this manuscript and approving it for publication was Yiming Tang . 2022, it has over 623 million confirmed cases and more than 6 million and a half deaths reported to WHO [2]. For these reasons, and even though more than 12 billion vaccines have been administered to date [2], continuous monitoring and early detection of COVID-19 positive cases remain critical to prevent the spread of the virus and to provide the most appropriate treatment for severe cases.
According to the National Institute of Allergy and Infectious Diseases (NIAID), complications brought on by a coronavirus can exhibit relevant issues that may include symptoms such as cough and breathing difficulties, fever, and kidney illnesses. In the worst cases, the disease may lead to VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ death [3]. For these reasons, governments took preventive actions and invested in research to tackle this problem. In particular, in the field of artificial intelligence (AI), many researchers have studied and employed several machine learning (ML) and deep learning (DL) techniques to support the early diagnosis and monitoring of COVID-19. For many years, indeed, machine learning algorithms have played a crucial role in the medical field for clinical decisionmaking [4], [5], [6], being able to help experienced doctors, speed up the analysis process, and improve the reliability of results [7], [8], [9], [10].
More than ever, during the pandemic, these techniques have proved to be of extreme importance. An example is represented by the adoption of deep learning techniques in computer vision (CV) activities which, in some cases, even generate new data for further investigation through the support of models called generative adversarial networks (GANs) [11], [12], [13].
As witnessed by recent literature [14], [15], there is a growing demand for automated systems that can support healthcare professionals in extracting actionable knowledge from the increasing amount of digitized clinical data. There are many types of medical data, e.g., approaches based on radiography, computed tomography, and magnetic resonance imaging [16]. Similarly, the types of data made public inherent to COVID-19 are chest X-ray (CXR) images, computed tomography (CT) scans [17], [18], [19], cough waves, and many others [20], including demographic and routine clinical data.
In this context, this work aims to investigate the potential of machine learning and deep learning methods for detecting COVID-19 in blood sample data, which can potentially complement other screening and diagnostic approaches. More specifically, this work focuses on the exploration and analysis of a relatively recent dataset containing more than 40 000 instances, each described by more than 12 000 features derived from the digitization of collected blood samples.
With the purpose of building models capable of discriminating between infected and non-infected subjects, several classification methods have been applied, including Bayesian classifiers, rule-based classifiers, tree-based classifiers, instance-based classifiers, Support Vector Machines, and state-of-art ensemble methods (Random Forest and XGBoost). Several artificial neural network architectures have also been explored, leading to a final deep learning model with satisfactory performance. Furthermore, given the high dimensionality of the problem, i.e. the large number of features involved, our comparative study has explored the use of automatic feature selection techniques. They were applied in conjunction with the best-performing classification methods to provide valuable insights into which approaches may be most suitable for analyzing this type of data.
Summarizing, the main contributions of our research are the following: 1) We studied and compared the classification performance of different families of machine learning classifiers. 2) We studied and compared three deep learning classifiers recently proposed for the classification of tabular data. 3) We proposed a new artificial neural network (ANN) to handle the task at hand. 4) We investigated the extent to which feature selection can be beneficial with respect to the top-performing machine and deep learning algorithms, i.e. Random Forest and our proposed ANN. 5) We proposed an extensive comparative analysis in a domain that has not yet been fully explored and an effective pipeline to solve the task at hand.
Encompassing a large variety of methods, such a broad experimental investigation can provide valuable hints to researchers and health professionals in this field, paving the way for further, more in-depth research.
The rest of the manuscript is structured as follows. Section II provides a general review of the use of artificial intelligence techniques for COVID-19 detection. Section III presents the considered dataset and gives a brief description of the machine learning and deep learning techniques and the feature selection methods used in this work. The evaluation metrics and the leading technologies adopted are also explained. Section IV shows the experiments conducted regarding both machine learning and deep learning approaches, used alone as well as in conjunction with different feature selection techniques, with a discussion of the main findings. Finally, Section V provides final considerations on this work and outlines future research directions.

II. RELATED WORK
Over the past two years, a great deal of research has been conducted related to the diagnosis and detection of COVID-19 infections. Below we provide a general overview of previous work that, from different perspectives, sought to contribute to the battle against COVID-19 by exploiting artificial intelligence techniques.
Among the papers summarizing relevant contributions in the field, Bhattacharya et al. [21] describe several applications of deep learning in the context of COVID-19 study and analysis, including outbreak prediction, monitoring the virus spread, diagnosis and treatment, vaccine development, and drug testing.
In the work of Rasheed et al. [20], the authors illustrate how various deep learning techniques have been applied in the field of computer vision, focusing on the analysis of X-ray images. They discuss the role of AI from three perspectives: analysis, prognosis, and case tracking of COVID-19.
Another example is the work of Shorten et al. [22]. The authors studied the main applications of artificial intelligence algorithms to deal with the pandemic. They analyzed DL applications to natural language processing (NLP), life sciences, computer vision, and epidemiology, explaining how the availability of big data affects both the construction and application of learning models.
Some approaches have also investigated hybrid methods based on integrating clinical data with features extracted from CXR images, either handcrafted or automatically learned by convolutional neural networks [44]. The reported experimentation, conducted on patients admitted to Italian hospitals during the first wave of the pandemic, aimed at devising reliable tools for the identification of patients at risk of severe outcomes, like intensive care or death. Despite the inherent difficulty of such a complex task, the authors provided a baseline performance reference to foster further research in this direction.
Overall, the studies reported in the literature point out that the problem of automatic detection of COVID-19 from any data source is quite a difficult task. Typically, methods in the computer vision domain use the ability to infer information from imaging tools, often leading to very high performance for specific groups of patients. However, they may not be suitable for every type of COVID-19-related diagnostic scenario. On the other hand, methods based on routine clinical data, including blood sample data, may have broader applicability for larger groups of patients and can be potentially suitable for large-scale (and low-cost) screenings. Such kind of data, however, is often acquired in less controlled and heterogeneous settings, and no clear guidelines are available on the best features to consider for the analysis. In general, no single artificial intelligence approach can be optimal for each type of COVID-19-related task, motivating the exploration of different approaches that may be complementary to each other.
In this context, our work focuses on a public dataset much less explored than others but still very interesting for the considerable amount of data collected through largescale blood tests involving more than forty thousand people. As will be presented in Section III-A, it is a challenging benchmark provided with a high number of features deriving from the digitization of the collected blood samples. Such high dimensionality makes it particularly difficult to induce accurate detection models. In addition, it does not allow direct comparison with approaches taken in previous work based on blood data.
The study that used data most similar to those employed in our work was proposed by Ribeiro et al. [45]. Indeed, their experiments were based on digitized blood samples but with lower dimensionality than the ones considered here. Specifically, the authors proposed a multilayer perceptron (MLP) with a hidden layer of 450 neurons to devise a diagnostic system with high sensitivity and specificity. Our experiments, encompassing an extensive range of learning methods, confirm the suitability of artificial neural network models in this task, as discussed below.

III. MATERIALS AND METHODS
This section presents all the materials and methods involved in our comparative analysis, including the high-dimensional dataset containing the digitized blood samples (subsection III-A), the artificial intelligence approaches adopted (subsection III-B), the evaluation metrics employed for the experimental evaluation (subsection III-C) and the chosen implementation setup (subsection III-D). Noteworthy, our study encompasses a wide range of classification techniques, which have been used both alone as well as in conjunction with different feature selection methods in order to investigate the extent to which the final classification performance varies in dependence on the data dimensionality.

A. DATASET
The employed dataset contains data extracted from blood samples collected through the blood scanner represented in Fig. 1. 1 It was released by Hilab, 2 a laboratory company from Brazil, which has thousands of blood scanner points distributed throughout the country, mostly in hospitals and pharmacies.
Given the technology of the equipment, where blood samples are digitized, and the high number of exams, enough data has become available to build a significant benchmark for machine learning and deep learning tasks. Such a benchmark has been recently used for a competition entitled COVID19 Detection in Blood Exams. 3  It is publicly available and consists of 40 044 instances uniformly distributed into two classes representing samples positive for COVID-19 or not, as labeled by expert biomedicians. Each sample is described by 12 210 numerical features, which make the classification problem very high-dimensional and challenging.
The releasers split the dataset into seven different fragments by a stratified sampling procedure. Each one is composed of approximately 5 500 instances. Table 1 shows the number of instances for each data fragment.

B. CLASSIFICATION AND SELECTION METHODS
We employed machine and deep learning approaches to induce classification models from the considered COVID-19 detection benchmark. Furthermore, as mentioned earlier, we explored the use of different feature selection methods given the high dimensionality of the data at hand. A brief description of the methods adopted is provided below.

1) MACHINE LEARNING METHODS
For our comparative study, we exploited the following machine learning methods as representatives of different families of classifiers: • Bayesian Network (BayesNet); • Naïve Bayes (NB); • Support Vector Machine (SVM); • k-Nearest Neighbor (k-NN); • Ripper (JRip); • One Rule (OneR); • Decision Tree (J48); • Random Forest (RF); • eXtreme Gradient Boosting (XGBoost). Bayesian Networks are probabilistic models that represent, in the form of a directed acyclic graph (DAG), the conditional dependence relationships among the variables of the problem at hand (namely, in our context, the target class, and the features). The probabilistic parameters are encoded in a set of tables, one for each node of the graph, in the form of local conditional distributions of a node (variable) given its parent nodes. Once the DAG structure and the probability values have been induced from a training set of labeled examples, new instances can be classified by properly computing the posterior probability of each class value [46].
In the family of Bayesian Network classifiers, a straightforward yet effective approach is the Naive Bayes method which assumes conditional independence among the problem features, given the value of the target class. Despite this strong assumption, the Naïve Bayes approach has shown to be competitive across different classification tasks [47]. Support Vector Machines are state-of-the-art classifiers that can effectively model different types of decision boundaries and are known to scale well to high-dimensional feature spaces. In particular, the linear SVM approach involves searching for an optimal hyperplane function that maximizes the width of the margin between the classes [48]. The soft margin formulation and the so-called kernel trick allow for extending the approach to non-linearly separable problems.
The k-NN algorithm is a popular classification method in the family of instance-based learners [49] that assigns the class to unknown instances based on their similarity to the training records. Specifically, given a new instance to classify, the algorithm finds the k training records closest to it (namely, its k nearest neighbors) and makes a prediction based on a majority voting decision. A common variant is to weight the k nearest neighbors based on their distance from the unknown instance, giving higher weights to the closest neighbors [46].
Repeated Incremental Pruning to Produce Error Reduction (RIPPER) is a rule-based classifier that relies on a sequential covering approach [50] to induce an ordered list of prediction rules. Each rule is built greedily, starting with an empty rule antecedent and repeatedly adding conjuncts to maximize the FOIL's gain measure. The resulting rules are then refined using an incremental reduced error pruning technique. More in detail, a validation set is used to estimate the predictive performance of each rule based on a metric that is monotonically related to the rule's accuracy. Pruning is done starting from the last conjunct added to the rule: the conjunct is removed if the performance metric improves after pruning. This style of pruning has proven to be quite effective in raising predictive accuracy in noisy domains.
In the family of rule-based classifiers, One Rule is another well-known approach [51]. Basically, the algorithm constructs a rule by considering the most frequent class for each input feature's value (in the case of numerical features, they are properly discretized). Therefore, each rule is simply a set of feature values bound to their majority class. The rule with the lowest training error is finally used for prediction.
Among the tree-based classifiers, we considered the J48 algorithm, which builds a decision tree model according to the approach originally proposed by Quinlan [52]. At each node of the tree, the attribute with the highest information gain ratio is used to split the data into purer subsets. In order to reduce the risk of overfitting, the size of the tree is controlled by a post-pruning strategy based on a pessimistic error estimate made on the training data itself.
Finally, two ensemble methods were used, i.e. Random Forest and XGBoost. The Random Forest classifier relies on multiple decision trees built from different bootstrap samples of the training data [53]. In order to introduce as much diversity as possible among the ensemble components, each tree is built by selecting, for each internal node, the best splitting attribute among a set of candidate features chosen at random. Such an approach has shown a robust behavior in high dimensional spaces and, compared to other ensemble approaches, turns out to be computationally more efficient.
XGBoost [54] is an extensible gradient boosting tree algorithm that belongs to the Gradient Boosted Decision Trees (GBDT) library, introduced by Grari et al. [55]. As an ensemble grouping model, in XGBoost, new models are created from the residuals of previous models and combined to obtain the final prediction. When new models are added, a gradient descent algorithm minimizes the loss. Therefore, each tree learns from its predecessors and updates the residual errors, minimizing the errors from the previous tree.

2) FEATURE SELECTION METHODS
Feature selection, also known as variable selection or attribute selection, is a widely employed technique for reducing the original data dimensionality [56]. It involves selecting the most relevant features for the task at hand with the aim of improving the efficiency and the understandability of the induced models without degrading their performance significantly. The literature contains several approaches to formalize the concept of feature relevance and quantify the degree of relevance [57]. Nevertheless, there are no clear and standard guidelines to follow for a specific problem [58].
When used in the context of classification tasks, feature selection methods are usually categorized into three groups [59], [60]: i) filters, which assess the degree of correlation among the features and the target class by only relying on the intrinsic data characteristics, without interacting with the classification algorithm that will be used in the construction of the final model; ii) wrappers, which use a specific classifier to evaluate different candidate subsets of features (built through proper search strategies, e.g., a greedy stepwise search or an evolutionary search) and choose the one that leads to the best performance; iii) embedded approaches, which are based on the intrinsic ability of some classification algorithms to assign weights to the features without requiring a systematic comparison among different candidate subsets.
Due to their lower computational cost, filters are the primary choice in very high-dimensional problems, such as the one considered here. Specifically, in this work, we adopted a ranking-based selection approach, in which the features are ordered from the most important to the least important based on the strength of their correlation with the target class. Only a predefined number of highly ranked features is then used for model induction, as discussed in Section IV.
In particular, the following ranking methods, widely employed in different application domains [61], including the analysis of high-dimensional biomedical data [62], [63], [64], [65], were chosen for our experiments: • Pearson's Correlation (Corr); • Information Gain (InfoG); • Gain Ratio (GainR); • Symmetrical Uncertainty (SU); • Mutual Information (MI). Pearson's Correlation is a well-known criterion to evaluate the linear correlation between two variables [66]. In the context of feature selection, it assesses the worth of each attribute by evaluating the extent to which its values are linearly correlated with the class: the higher the correlation, the more relevant the attribute is for the classification task at hand.
Information Gain, Gain Ratio, and Symmetrical Uncertainty rely on the information-theoretical concept of entropy [46]. Specifically, InfoG computes a weight for each feature by measuring the extent to which the entropy of the class decreases when the value of that feature is known. GainR and SU adopt a similar approach but introduce proper normalization factors to compensate for the InfoG's bias toward features with more values.
In turn, Mutual Information is an entropic criterion to measure the degree of dependency between two variables. The specific implementation here adopted relies on a nonparametric approach based on entropy estimation from k-nearest neighbors' distances as described in [67] and [68].

3) DEEP LEARNING METHODS
Deep learning is a branch of machine learning that focuses on Artificial Neural Networks (ANNs), i.e. complex computational systems that attempt to emulate biological neural systems and employ this metaphor to learn from data.
An artificial neural network comprises a set of layers, each consisting of a collection of processing units called nodes or neurons, which are connected to each other via properly weighted directed links. Each neuron includes an activation function that determines the node's output based on the inputs received through the incoming links. The weights of the links (i.e. the network parameters) represent a fundamental aspect as the system's predictive ability depends on them.
ANN systems provide a powerful way of representing features at different levels of abstraction. In fact, at the various layers of the network, more complex features are defined starting from the raw attributes of the input dataset [66], [69]. In contrast to ''shallow'' networks that involve only a small VOLUME 10, 2022 number of hidden layers, deep neural networks are characterized by multiple layers, i.e. multiple levels of abstraction, with the ability to model very complex decision boundaries.
In order to train such complex models, adequate computational resources and advanced algorithmic procedures are required due to various factors that come into play. In particular, regularization methods play a crucial role in reducing the risk of overfitting. Further, depending on the data characteristics, proper architectural solutions need to be adopted [70]. Successful applications of such a computational paradigm are increasingly reported in the literature, across different real-world domains [71], [72], [73], [74], [75], including COVID-19 detection [25], [26], [27], [28], [44], [76].
In this work, we explored different network models to investigate the potential of deep learning in the diagnostic task at hand. The specific solution adopted, with its design choices and settings, is detailed in Section IV. Basically, it involves several intermediate layers, and the dimensionality of the input dataset is gradually reduced. Such a solution was also compared with existing state-of-the-art deep learning methods for tabular data. More specifically, the comparison included the following algorithms: • TabNet; • Neural Oblivious Decision Ensembles (NODE); • 1D Convolutional Neural Network (1D-CNN). TabNet [77] is a transformer-based model for tabular data. It comprises multiple subnetworks processed sequentially and hierarchically, like a decision tree. In particular, each subnetwork corresponds to a decision stage and receives the current batch of data as input. Then, TabNet aggregates the results of all decision phases to obtain the final prediction. TabNet first applies a sparse feature mask in each decision phase to perform a feature selection.
Instead, NODE is considered a fully differentiable model. Therefore, it permits end-to-end deep learning for training and inference employing gradient descent optimizers. Proposed by Popov et al. [78], NODE is an ensemble of differentiable oblivious decision trees [79] and uses the same splitting function for all nodes on the same level. Based on decision tree ensembles, no preprocessing or data transformation is needed.
Convolutional Neural Networks (CNNs) are rarely used on tabular data because the feature ordering has no locality characteristics. Nevertheless, a method based on a 1D convolutional neural network recently achieved the best single model performance in a Kaggle competition with tabular data [80]. More precisely, the main idea is to take advantage of CNN's property to extract features. Therefore, a fully connected layer creates a large set of features with locality characteristics, followed by multiple 1D convolutional layers.

C. EVALUATION METRICS
The metrics taken into account for the final evaluation are the following: TPR or Recall = TP TP + FN ; (3) where TP, TN, FP, FN represent true positives, true negatives, false positives, and false negatives, respectively. More in detail, the accuracy indicates the overall percentage of correctly classified records, as shown in Eq. (1). The precision represents the fraction of correct predictions among all instances assigned to the positive class; clearly, it depends on the number of false positives and is maximum when there are no false positives (Eq. (2)). Instead, the True Positive Rate (TPR), also known as recall, is the fraction of positive instances correctly classified as positive; it depends on the number of false negatives, with the maximum reached when there are no false negatives (Eq. (3)). A proper tradeoff between precision and recall is provided by the F-score (or F-measure), which is defined as the harmonic mean of precision and recall (Eq. (5)) and takes both false positives and false negatives into account.
Another common metric is the Area Under the ROC Curve (AUC), which is a valuable criterion for comparing different classifiers [66]. Basically, the ROC (Receiver Operating Characteristic) curve is a graph that plots the True Positive Rate (TPR, Eq. (3)) against the False Positive Rate (FPR, Eq. (4)) at different probability thresholds for the positive class. Lowering the probability threshold classifies more items as positive, thus increasing both true and false positives. The area under the ROC curve provides a single score to summarize the classifier's performance on a given domain.

D. TECHNOLOGIES AND SETUP
All the experiments have been conducted on the same machine with the following configuration: Intel(R) Xeon(R) Gold 6136 CPU @ 3.00GHz CPU and Tesla P6 16 GB GPU.
Moreover, we used: • The Weka machine learning library [46], which contains a variety of functions for classification, including the different machine learning methods described in Section III-B. It also provides functions for data preprocessing, including various feature extraction and feature selection techniques, which have been used in this work to reduce the dimensionality of the considered benchmark.
• Keras [81], a Python library that provides extensive support for deep learning; it was exploited to compare different artificial neural network architectures and implement our final deep learning model.
• Scikit-learn [82], a Python library that supports machine learning investigations and contains some feature selection techniques, particularly the MI implementation adopted in this work.

IV. EXPERIMENTS AND RESULTS
This section presents a summary of the experimental results obtained. First, we present an investigation with different ML techniques in Section IV-A. In Section IV-B, we then explore deep learning techniques by introducing a custom artificial neural network architecture designed and implemented for this task. Finally, we discuss the results of our comparative study.

A. MACHINE LEARNING APPROACH
Here, we provide details of the experiments conducted with the machine learning approach. In particular, we have divided the experimental investigation into two main phases: 1) A preliminary analysis, in which we exploited some fragments of the dataset (see details on the data subdivision in Table 1). The results of this phase are reported in Table 2 and Table 3. 2) A detailed analysis of the entire dataset, the results of which are presented in Fig. 2 and Fig. 3. In particular, the behavior of the best performing classifier was studied in conjunction with different feature selection approaches by introducing multiple levels of dimensionality reduction. In the preliminary analysis, we ran several tests with the considered machine learning algorithms using only the first four fragments of the dataset. In this way, we tried to get an introductory view of the data to see whether some subdivisions could produce better performance or, in general, whether any of them had more meaningful information than others. Specifically, the following configurations were considered: • training set: fragment 1, test set: fragment 3; • training set: fragment 3, test set: fragment 1; • training set: fragment 2, test set: fragment 4; • training set: fragment 4, test set: fragment 2. The employed classifiers, described in Section III-B, were mainly trained with their default parameters. In particular, we used a linear kernel for the SVM method, while the Random Forest classifier was implemented with 100 trees and log 2 (n)+1 random features. In the case of the k-NN approach, with and without instance weighting, the default value of k (i.e. the number of nearest neighbors) was changed to 5; lower values, indeed, can increase the risk of over-fitting.
On the other hand, the configuration adopted for the experiments on the entire dataset was the following: • Training set: merge of fragments 1, 2, 3, 4, and 5 (≈70%); • Validation set: fragment 7 (≈14%); • Test set: fragment 6 (≈16%). As presented in Table 1, the composition of the dataset allowed us to take advantage of stratified sampling to divide it into training, validation, and test sets, while maintaining a balanced distribution of classes.

1) RESULTS OF THE PRELIMINARY ANALYSIS
The results of the first phase of our comparative analysis are summarized in Table 2, where the accuracy, F-score, and AUC values are reported for the different configurations considered (involving fragments 1 and 3 as well as fragments 2 and 4, as explained above). The mean and standard deviation for all three metrics have also been calculated across the different configurations, as shown in Table 3. They gave more insights into the performance and behavior of each algorithm.
As seen from the tables, the results are not satisfactory overall and express the difficulty of analyzing the considered VOLUME 10, 2022 high-dimensional benchmark. In each case, the Random Forest obtained the best results (emphasized in bold), with the highest values of accuracy, F-score, and AUC. This confirms the effectiveness of this ensemble approach that has proven to be a ''best of class'' learner in several tasks [83], including the analysis of high-dimensional biomedical data [62], [84], [85], [86].
These findings prompted us to focus on the Random Forest classifier for our detailed analysis of the entire dataset, as explained below.

2) RESULTS WITH FEATURE SELECTION
As previously mentioned, all the 40 044 available samples, properly divided into training, validation, and test sets, were used in this analysis phase. The experiments were carried out considering the learning method that worked best in our preliminary investigations (see Section IV-A1), i.e. the Random Forest.
More in detail, the analysis was conducted using all the original features and considering reduced feature spaces of different dimensionalities. For feature selection, the ranking techniques introduced in Section III-B2 were employed, i.e. Corr, InfoG, GainR, SU, and MI. Since each technique outputs a list in which the features appear in decreasing order of relevance, we cut this list at a proper threshold point to select the desired number of features.
Specifically, Fig. 2 shows the performance of a Random Forest model trained on increasing numbers of selected features. Different colors are used in the chart to distinguish the outcome of the different selection methods. We can see that only 100 features are sufficient to achieve an F-score value superior to 0.65. By increasing the number of selected features, the classification performance gradually improves and tends to stabilize for feature subsets containing more than 2 000 features.
Overall, the different selection methods lead to similar results, with a slight superiority of the MI approach. With a reduced subset of 2 000 features, in fact, it leads to an F-score of almost 0.75, the same achieved on the original feature set (containing 12 210 features).
A comparison of the confusion matrices obtained with and without feature selection is provided in Fig. 3, which shows the Random Forest performance over the whole feature set (left) as well as using a reduced set of 2 000 features (right), as selected by MI. We can see that the rate of correct predictions on the positive class is the same, while the rate of correct predictions on the negative class is slightly higher using MI. Therefore, even with 84% fewer features, we achieved pretty good results, equaling those obtained with all 12 210 features.

B. DEEP LEARNING APPROACH
As additional solutions, deep learning techniques were also explored. In this regard, we conducted a large-scale preliminary investigation by implementing and testing several artificial neural network classifiers, which were trained using  all the original features and on feature spaces of reduced dimensionality. The adopted configuration, the final result of these preliminary experiments, is depicted in Fig. 4. It was also compared with some state-of-the-art deep learning methods proposed for tabular data.
As can be seen, we introduced a number of intermediate layers where the dimensionality of the input layer is gradually reduced, which allowed for the extraction, through the architecture itself, of progressively fewer features (of a higher level) to be used for the final class assignment. In particular, our preliminary investigation led us to a significant initial reduction in the number of neurons, going from the input to the first hidden layer, with a more gradual dimensionality decrease through the subsequent layers. In turn, the optimal dimensionality of the input layer was also explored by introducing a preliminary feature selection step, as detailed below.
The grid search led to the best hyperparameters summarized in Table 4. In particular, Leaky ReLU was adopted on all hidden layers, while sigmoid was only in the output layer. In addition, a dropout of 10% was added on all but the last hidden layer to avoid overfitting. It resulted in improved generalization performance. Moreover, for all models, an early stopping criterion based on the validation loss was applied. ADAM was chosen as the optimizer, with a learning rate lr = 1e-5 and beta 1 = 0.9 and beta 2 = 0.999, respectively. Finally, a batch size of 128 was used for 1 000 total epochs. Our final ANN model's performance was significantly better than the one previously obtained with the machine learning approaches, including the Random Forest classifier. Indeed, using all 12 210 features of our benchmark, we got an F-score of 0.84, which is a good outcome compared to the results obtained in the competition for which the dataset was initially released (see Section III-A).
The performance obtained by the proposed ANN was also compared with other deep learning methods (see Section III-B3). The results are reported in Table 5. As can be seen, our approach outperformed the other approaches, even though the 1D-CNN model turned out to be promising as well.
We have also reported a time comparison between the methods (see Table 6) and briefly state some considerations regarding the execution time. Using the machine setup described in Section III-D, the proposed architecture needed 27 minutes to accomplish 1,000 epochs of training, while the inference time is 2.42 seconds. More specifically, Table 6 shows that our proposed architecture can obtain competitive accuracy and time-efficient (and energy-efficient) performance. In fact, it outperformed each other deep learning methods, except for TabNet, which is composed of only 26k trainable parameters.
Furthermore, as in our previous experiments, we explored the extent to which the original data dimensionality can be reduced without compromising the final classification performance. In particular, Fig. 5 shows the F-score obtained with the designed ANN classifier for different numbers of input features, as selected by the MI, GainR, InfoG, SU, and Corr ranking methods.
As can be seen, MI emerged again as the best selection technique, as also observed in the previous section for the machine learning approach. In particular, using 3 000 features selected by MI, the ANN classifier is able to reach the same performance achieved over the entire feature set, as also detailed in Fig. 6. Specifically, Fig. 6a shows the confusion matrix obtained with the proposed artificial neural network trained on all 12 210 features. In comparison, the matrix obtained by training the network on only 3 000 features (as selected by MI) is shown in Fig. 6b.
Let us finally discuss the robustness of the proposed method from two points of view: the possibility of using it as a real-time application in the healthcare domain and the results obtained. First, we consider the proposed architecture suitable for deployment in real-time systems since it consists of 3,756,386 trainable parameters. Second, we have shown that the method's performance with a reduced number of features, as selected by a proper feature selection technique, is as high as that obtained using all features. Unfortunately, as anticipated in Section II, a direct comparison of the results obtained  Fig. 6a shows the results without feature selection, while Fig. 6b shows the results over a subset of 3 000 features, as selected by the Mutual Information (MI) method.  in our work with the state of the art is not possible due to the different characteristics of the datasets used. However, taking as a reference some works based on the analysis of datasets containing hematological features, it is possible to point out that the results we have obtained are better than or in line with them. For example, Brinati et al. [31] reported an accuracy between 82% and 86% on a dataset of blood parameters they proposed [40]; Alves et al. [35] obtained an F-score of 76% on the dataset presented in [41], which consists of SARS-CoV-2 RT-PCR parameters and blood tests.
Compared with these results, we believe that the performance obtained by our proposed ANN can be deemed satisfactory, considering the high dimensionality of the data explored and the intrinsic difficulty of the related classification task.

V. CONCLUDING REMARKS AND FUTURE RESEARCH DIRECTIONS
The goal of this work was to make a contribution to the fields of machine and deep learning for the detection of COVID-19 from blood test data. To this end, we investigated and tested several approaches on a recently proposed public dataset, which proved very challenging.
First, we provided a comparative analysis of several machine learning algorithms in terms of different performance metrics. Second, deep learning techniques were also explored, leading to the proposal of an ANN architecture specifically designed for this task. Third, several feature selection techniques were investigated to reduce the dimensionality of the considered benchmark, thus allowing the construction of more efficient prediction models.
As previously discussed, Random Forest turned out to be the best-performing machine learning technique, with a rate of 73% correct predictions on the COVID-19 positive class. The proposed deep learning strategy offered a significant improvement, which outperformed the machine learning approach by correctly classifying 87% of the positive instances. Our analysis also revealed, for both Random Forest and ANN models, that the original number of features can be significantly reduced, through a preliminary feature selection step, without compromising the final classification performance. In particular, among the considered feature selection techniques, Mutual Information performed consistently better in our experiments.
Based on the different investigations conducted, we firmly believe that AI-based approaches have great potential to provide even higher results in this context. This could be achieved through a deeper analysis in multiple directions. A wider range of learning algorithms can be considered, and further architectural solutions for the deep learning approach. Furthermore, a deeper understanding of the interdependencies and correlations among the features could help improve the final classification results.
Indeed, the ranking methods here considered are the more efficient choice to reduce the data dimensionality but are not designed to capture the relationships among the features and cannot handle feature redundancy. More sophisticated selection strategies could be adopted, even relying on different selection algorithms at different stages of the selection process (e.g., initially reducing the data dimensionality through an efficient ranking approach and then further refining the search through a wrapper approach capable of optimizing the performance of a given classifier) [87], [88]. Ensemble selection methods have also recently been investigated in high-dimensional settings with promising results [89], [90].
From a broader perspective, the explored case study highlights the challenges that still need to be addressed in the context of artificial intelligence applied to COVID-19 diagnosis. In particular, the intrinsic difficulty of building highperforming classifiers from a single type of data, such as the blood sample data here considered, prompts the development of multimodal machine learning models that can process and fuse information from different data sources [34].
Finally, although artificial intelligence techniques have demonstrated remarkable performance in many diagnostic tasks, it is important to consider that medical applications require, more than others, a high level of accountability and transparency. Therefore, explanations for algorithm decisions and predictions are increasingly needed to justify their reliability and offer high interpretability for the end users [91], [92]. We also intend to explore these aspects in our future work.