ECOVNet: a highly effective ensemble based deep learning model for detecting COVID-19

The goal of this research is to develop and implement a highly effective deep learning model for detecting COVID-19. To achieve this goal, in this paper, we propose an ensemble of Convolutional Neural Network (CNN) based on EfficientNet, named ECOVNet, to detect COVID-19 from chest X-rays. To make the proposed model more robust, we have used one of the largest open-access chest X-ray data sets named COVIDx containing three classes—COVID-19, normal, and pneumonia. For feature extraction, we have applied an effective CNN structure, namely EfficientNet, with ImageNet pre-training weights. The generated features are transferred into custom fine-tuned top layers followed by a set of model snapshots. The predictions of the model snapshots (which are created during a single training) are consolidated through two ensemble strategies, i.e., hard ensemble and soft ensemble, to enhance classification performance. In addition, a visualization technique is incorporated to highlight areas that distinguish classes, thereby enhancing the understanding of primal components related to COVID-19. The results of our empirical evaluations show that the proposed ECOVNet model outperforms the state-of-the-art approaches and significantly improves detection performance with 100% recall for COVID-19 and overall accuracy of 96.07%. We believe that ECOVNet can enhance the detection of COVID-19 disease, and thus, underpin a fully automated and efficacious COVID-19 detection system.


Introduction
Coronavirus disease 2019 (COVID-19) is a contagious disease that was caused by the Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2).The disease was first detected in Wuhan City, Hubei Province, China in December 2019, and was related to contact with a seafood wholesale market and quickly spread to all parts of the world [1].The World Health Organization (WHO) promulgated the outbreak of the COVID-19 pandemic on March 11, 2020.As of September 20, 2020, this perilous virus has not only overwhelmed the world, but also affected millions of lives.So far, there have been 30, 675, 675 confirmed COVID-19 cases and 954, 417 confirmed deaths [2].To limit the spread of this infection, all infected countries strive to cover many strategies such as encourage people to maintain social distancing as well as lead hygienic life, enhance the infection screening system through multi-functional testing, seek mass vaccination to reduce the pandemic ahead of time, etc.The reverse transcriptase-polymerase chain reaction (RT-PCR) is a modular diagnosis method, however, it has certain limitations, such as the accurate detection of suspect patients causes delay since the testing procedures inevitably preserve the strict necessity of conditions at the clinical laboratory [3] and false-negative results may lead to greater impact in the prevention and control of the disease [4].
To make up for the shortcomings of RT-PCR testing, researchers around the world are seeking to promote a fast and reliable diagnostic method to detect COVID-19 infection.The WHO and Wuhan University Zhongnan Hospital respectively issued quick guides [5,6], suggesting that in addition to detecting clinical symptoms, chest imaging can also be used to evaluate the disease to diagnose and treat COVID-19.In [7], the authors have contributed a prolific guideline for medical practitioners to use chest radiography and computed tomography (CT) to screen and assess the disease progression of COVID-19 cases.Although CT scans have higher sensitivity, it also has some drawbacks, such as high cost and the need for high doses of radiation during screening, which exposes pregnant women and children to greater radiation risks [8].On the other hand, diagnosis based on chest X-ray appears to be a propitious solution for COVID-19 detection and treatment.In [9], Ng et al. remarked that COVID-19 infection pulmonary manifestation is immensely delineated by chest X-ray images.Moreover, in the case of an artificial intelligence (AI)-based disease recognition system, medical practitioners have already emphasized chest X-rays to explore potential symptoms of COVID-19 infection, such as opaque patterns in the lungs [10].
The purpose of this study is to ameliorate the accuracy of COVID-19 detection system from chest X-ray images.In this context, we contemplate a CNN-based architecture since it is illustrious for its topnotch recognition performance in image classification or detection.For medical image analysis, higher detection accuracy along with crucial findings is a top aspiration, and in current years, CNN based architectures are comprehensively featured the critical findings related to medical imaging that's why we constructed the proposed architecture with CNN.In order to achieve the defined purpose, this paper presents a novel CNN based architecture called ECOVNet, exploiting the cutting-edge EfficientNet [11] family of CNN models together with ensemble strategies.The pipeline of the proposed architecture commences with the data augmentation approach, then optimizes and fine-tunes the pre-trained EfficientNet models, creating respective model's snapshots.After that, generated model snapshots are integrated into an ensemble, i.e., soft voting and hard voting, to make predictions.The motivation for using EfficientNets is that they are known for their high accuracy, while being smaller and faster than the best existing CNN architectures.Moreover, an ensemble technique has proven to be effective in predicting since it produces a lower error rate compared with the prediction of a single model.Owing to the limited number of COVID-19 images currently available, diagnosing COVID-19 infection is more challenging, thereby investing with a visual explainable approach is applied for further analysis.In this regard, we use a Gradient-based Class Activation Mapping algorithm, i.e., Grad-CAM [12], providing explanations of the predictions and identifying relevant features associated with COVID-19 infection.The key contributions of this paper are as follows: • We propose a novel CNN based architecture that includes front-end pre-trained EfficientNets for feature extraction and model snapshots to detect COVID-19 from chest X-rays.• Taking into account the following assumption, the decisions of multiple radiologists are considered in the final prediction, so we propose an ensemble in the proposed architecture to make predictions, thus making a credible and fair evaluation of the system.• We visualize a class activation map through Grad-CAM to explain the prediction as well as to identify the critical regions in the chest X-ray.• Finally, we appraise our architecture with state-of-the-art architectures through empirical observations to highlight the effectiveness of the proposed architecture in detecting COVID-19.
The remainder of the paper is arranged as follows: Section 2 discusses related work.Section 3 explains the details of the data set and proposed network architecture, as well as its adjustments to the detection of COVID-19 infection.The results of our experimental evaluation is presented in Section 4. Finally, Section 5 concludes paper and highlights the future work.

Related Works
Due to the need to identify COVID-19 infections faster, the latest application areas of CNN-based AI systems are booming, which can speed up the analysis of various medical images.As we all know, a chest X-ray screening is a state-of-the-art technology with historical prospects for image diagnosis systems for detecting pneumonia [13].In addition, both pneumonia and COVID-19 go through certain infection characteristics (such as the occurrence of severe lung infections).Hence, it has inspired researchers around the world to explore the ability of chest X-rays through various feature extraction methods especially CNN based approaches to detect COVID-19, thus playing a role when the current health care system is exhausted by the pandemic.
An in-depth survey of the application of CNN technology in COVID-19 detection and automatic lung segmentation is explained in [14], with a focus on analysis using X-rays and computed tomography (CT) images.Halgurd et al. [15] tested a modified CNN model as well as a modified pre-trained AlexNet [16] using their own chest X-ray and CT scan data set while providing accuracy up to 98% via modified pre-trained model and 94.1% accuracy by using the modified CNN.Narin et al. [17] achieved the highest accuracy of 98% by using three pre-trained models with ImageNet [18] weights (such as ResNet50 [19], Inception v3 [20], and Inception-ResNet v2 [21]), taking into account two types of images, i.e., COVID-19 and normal images.A completely new CNN framework named COVID-Net and a large chest x-ray benchmark data set, i.e., COVIDx introduced by Wang et al. [22].The proposed COVID-Net obtained the best test accuracy of 93.3%, and studied how COVID-Net uses an interpretability method to predict.In [23], the state-of-the-art CNN architectures (such as VGG19 [24], MobileNetV2 [25], Inception [20], Xception [26], Inception-ResNet v2 [21]) were trained using transfer learning on ImageNet, and different neural network architectures were used on top of each architecture.The results produced by fine-tuned models demonstrated the proof-of-principle for using CNN with transfer learning to extract radiological features.
The authors of [27] prepared a dataset of 5,000 chest x-rays from the publicly available datasets, and a subset of their benchmark utilized to develop a model by fine-tuning four popular pre-trained CNNs (such as ResNet18 [19], ResNet50, SqueezeNet [28] and DenseNet121 [29]).The proposed model was evaluated using the remaining images and produced promising results in terms of sensitivity and specificity.Eduardo et al. [30] proposed a new deep learning framework that extends the EfficientNet [11] series, which is well known for its excellent prediction performance and fewer computational steps.Their experimental evaluation showed noteworthy classification performance, especially in COVID-19 cases.Next, Farooq et al. [31] proposed a method called COVID-ResNet, which uses a three-step technique, including gradually adjusting image size, automatic learning rate selection, and then fine-tuning the pretrained ResNet50 architecture to improve model performance.A CNN model called DarkCovidNet [32] proposed for the automatic detection of COVID-19 using chest X-ray images where the proposed method carried out two types of classification, one for binary classification (such as COVID and No-Findings) and another for multi-class (such as COVID, No-Findings and pneumonia) classification.Finally, the authors provided an intuitive explanation through the heat map, so it can assist the radiologist to find the affected area on the chest X-ray.In another study, Ucar et al. [33] proposed a fine-tuned lightweight SqueezeNet, in which the fine-tuned hyper-parameters were obtained through Bayesian optimization, and the performance of the proposed network was superior to some of the existing CNN networks for detecting COVID-19 cases.
Another research [34] proposed an explainable CNN-based method adjusting on a neural ensemble technique followed by highlighting class-discriminating regions named DeepCOVIDExplainer for automatic detection of COVID-19 cases from chest x-ray images.A study accomplished by Afshar et al. [35] to contribute an efficacious COVID-19 detection system using Capsule Networks(CapsNets) [36] based CNN architecture, and the authors of this research claimed their system efficacy not only in statistical performance evaluation but also for a lesser number of trainable parameters compared to its counterparts.Asif et al. [37] proposed a model named CoroNet that used Xception architecture pre-trained on ImageNet dataset and trained on their benchmark creating from two publicly available data sets, and carried out two different classification performance measurement, i.e., three and four classes since the overall accuracy of three and four class classification are 95% and 89.6% respectively.In [38], Mohammad et al. proposed a CNN-based model called CovXNet, which uses depthwise dilated convolution.At first, the model trained with some non-COVID pneumonia images, and further transferred the acquired learning with some additional fine-tuning layers that trained again with a smaller number of chest X-rays related to COVID-19 and other pneumonia cases.As features extracted from different resolutions of X-rays, a stacking algorithm is used in the prediction process, and for multi-class classification, the accuracy of CovXNet is 90.3%.In another research, Haghanifa et al. [39] prepared a new benchmark by amassing the largest public dataset of COVID-19 chest X-ray images from diverse sources and developed a fine-tuned model based on DenseNet121 using CheXNet [40] weight while providing statistical performance along with the visual marker to efficaciously localize the critical region of COVID-19 cases.Another CNN-based modular architecture proposed by Nihad et al. [41] named PDCOVIDNet (dilated convolution-based COVID-19 detection network), which consists of several blocks (such as a parallel stack of multi-layer filter blocks in a cascade with a classification and visualization It can be seen from the literature review that most methods make prediction decisions based on the output of a single model rather than on ensemble, but few methods [34,38] rely on an ensemble.As we have seen, the ensemble brings a benefit, that is, it can reduce prediction errors, thus making the model more versatile.One of the previous studies used ensemble on heterogeneous models, i.e., VGG19, ResNet18, and DenseNet161 in [34], but that approach has some limitations such as that each model requires a separate training session, and an individual model suffers from training many parameters.Another method [38] is to perform an ensemble on a single model, but uses various image resolutions, and for each image resolution, it creates a separate model and stacks it for prediction, which incurs computational overhead.Contrary to the ensemble, an advanced custom CNN architecture, COVID-Net [22], was implemented and tested using a large COVID-19 benchmark, but due to the large number of parameters, the computational overhead of this model is high.To address the aforementioned problems, we use a lightweight but effective model EfficientNet since it is 8.4 times smaller and 6.1 times faster than the best existing CNN [11].Also, to extenuate the limitation related to the computational cost of training multiple deep learning models for ensemble prediction, we force large changes in model weights through the recurrent learning rate, creating model snapshots in the same training, and further apply an ensemble to make the proposed architecture more robust.

Methodology
In this section, we briefly discuss our approach.First, we will precede the benchmark data set and data augmentation strategy used in the proposed architecture.Next, we will outline the proposed ECOVNet architecture, including network construction using a pre-trained EfficientNet and training methods, and then model ensemble strategies.Finally, to make disease detection more acceptable, we will integrate decision visualizations to highlight pivotal facts with visual markers.

Dataset
In this sub-section, we concisely inaugurate the benchmark data set, named COVIDx [22], that used in our experiment.
To the best of our knowledge, this data set is one of the largest open-access benchmark data set for the number of COVID-19 infection cases, and the total number of 14,914 images for training and 1,579 images for testing, comprising three categories of COVID-19, normal and pneumonia 1 .Figure 1 shows sample images from the benchmark dataset, including COVID-19, normal and pneumonia.Table 2 depicts the distribution of images in training and testing sets.To generate the COVIDx, the authors [22] used five different publicly accessible data repositories: • From COVID-19 Image Data Collection [42], they gathered non-COVID19 pneumonia and COVID-19 cases.
• Radiological Society of North America (RSNA) Pneumonia Detection Challenge dataset [46] employed for normal and non-COVID19 pneumonia cases.

Data Augmentation
Data augmentation is a process performed in time during the training process to expand the training set.As long as the semantic information of an image is preserved, the transformation of the images in the training data set can be used for data augmentation.Using data augmentation, the performance of the model can be improved by solving the problem of overfitting thus greatly improve inductive reasoning.Although the CNN model has properties such as partial translation-invariant, augmentation strategies i.e., translated images can often considerably enhance generalization capabilities [56].Data augmentation strategies provide various alternatives, each of which has the advantage of interpreting images in multiple ways to present important features, thereby improving the performance of the model.We have considered the following parameters: horizontal flip, rotation, shear, and zoom for augmentation during the training process.

Proposed ECOVNet Architecture
In this section, we will briefly describe the proposed ECOVNet architecture.After augmenting the COVIDx dataset, we used pre-trained EfficientNet as a feature extractor.This step ensures that the pre-trained EfficientNet can extract and learn useful chest X-ray features, and can generalize it well.Indeed, EfficientNets are an order of models that are obtained from a base model, i.e., EfficientNet-B0.In the proposed architecture, we demonstrated EfficientNet-B0, however, during the experimental evaluation, we considered other models.The output features from the pre-trained EfficientNet fed to our proposed custom top layers through two fully connected layers, which are respectively integrated with batch normalization, activation, and dropout.We generated several snapshots in a training session, and then combined their predictions with an ensemble prediction.At the same time, the visualization approach, which can qualitatively analyze the relationship between input examples and model predictions, was incorporated into the following part of the proposed model.Fig. 2 shows a graphical presentation of the proposed ECOVNet architecture using a pre-trained EfficientNet.EfficientNets are a series of models (namely EfficientNet-B0 to B7) that are derived from the baseline network (often called EfficientNet-B0) by scale it up.The advantages of EfficientNets are reflected in two aspects, namely, it not only provides higher accuracy, but also ameliorates the effectiveness of the model by reducing parameters and FLOPS(Floating Point Operations Per Second).By adopting a compound scaling method in all dimensions of the network, i.e., width, depth, and resolution, EfficientNets have pulled attention due to its supremacy in prediction performance.Mention that, width refers to the number of channels in any layer, depth relates to the number of layers in CNN, and resolution associates with the size of the image.The intuition of using compound scaling is that scaling any dimension of the network (such as width, depth, or image resolution) can increase accuracy, but for larger models, the accuracy gain will decrease.To scale the dimensions of the network systematically, compound scaling uses a compound coefficient that controls how many more resources are functional for model scaling, and the dimensions are scaled by the compound coefficient in the following way [11]: where φ is the compound coefficient, and α, β, and γ are the scaling coefficients of each dimension that can be fixed by a grid search.After determining the scaling coefficients, these coefficients are applied to the baseline network (EfficientNet-B0) for scaling to obtain the desired target model size.For instance, in the case of EfficientNet-B0, when φ = 1 is set, the optimal values are yielded using a grid search, i.e., α = 1.2, β = 1.1, and γ = 1.15, under the constraint of α.β 2 .γ 2 ≈ 2 [11].By changing the value of φ in Equation 1, EfficientNet-B0 can be scaled up to obtain EfficientNet-B1 to B7.
The feature extraction of the EfficientNet-B0 baseline architecture is comprised of the several mobile inverted bottleneck convolution (MBConv) [25,57] blocks with built-in squeeze-and-excitation (SE) [58], Batch Normalization, and Swish activation [59] as integrated into EfficientNet.Compared with conventional convolution, EfficientNet's ensemble framework is, i.e., MBConv, proven to be more accurate in image classification, while reducing parameters and FLOPS by an order of magnitude.Table3 shows the detailed information of each layer of the EfficientNet-B0 baseline network.
EfficientNet-B0 consists a total of of 16 MBConv blocks varying in several aspects, for instance, kernel size, feature maps expansion phase, reduction ratio, etc.A complete workflow of the MBConv1,k3 × 3 and MBConv6,k3 × 3 blocks are shown in Figure 3.Both MBConv1,k3 × 3 and MBConv6,k3 × 3 use depthwise convolution, which integrates a kernel size of 3 × 3 with the stride size of s.In these two blocks, batch normalization, activation, and convolution with a kernel size of 1×1 are integrated.The skip connection and a dropout layer are also incorporated in MBConv6,k3×3, but this is not the case with MBConv1,k3 × 3. Furthermore, in the case of the extended feature map, MBConv6,k3 × 3 is six times that of MBConv1,k3 × 3, and the same is true for the reduction rate in the SE block, that is, for MBConv1,k3 × 3 and MBConv6,k3 × 3, r is fixed to 4 and 24, respectively.Note that, MBConv6,k5 × 5 performs the identical operations as MBConv6,k3 × 3, but MBConv6,k5 × 5 applies a kernel size of 5 × 5, while a kernel size of 3 × 3 is used by MBConv6,k3 × 3. The rationale for using pre-trained weights is that the imported model already has sufficient knowledge in the broader aspects of the image domain.As it has been manifested in several studies [17,60], using pre-trained ImageNet weights in the state-of-the-art CNN models remain optimistic even when the problem area (namely COVID-19 detection) is considerably distinct from the one in which the original weights have been obtained.The optimization process will fine-tune the initial pre-training weights in the new training phase so that we can fit the pre-trained model to a specific problem domain, such as COVID-19 detection.

Classifier
The final output of the EfficientNet architecture turns out as a global averaged feature followed by a classifier.To perform the classification task, we used a two-layer MLP (usually called a fully connected (FC) layer), which captures the features of EfficientNet through two neural layers (each neural layer has 512 nodes).In between FC layers, we included batch normalization, activation, and dropout layer.Batch normalization greatly accelerates the training of deep networks and increases the stability of neural networks [61].It makes the optimization process smoother, resulting in a more predictable and stable gradient behavior, thereby speeding up training [62].In this study, in a case of activation function, we have preferred Swish which is defined as [59]: where σ(x) = (1 + exp(−x)) −1 is the sigmoid function.Comparison with other activation functions Swish consistently outperforming others including Rectified Linear Unit(ReLU) [63], which is the most successful and widely-used activation function, on deep networks applied to a variety of challenging fields i.e., image classification and machine translation.Swish has many characteristics, such as one-sided boundedness at zero, smoothness, and non-monotonicity, which play an important role in improving it [59].After performing the activation operation, we integrated a Dropout [64] layer, which is one of the preeminent regularization methods to reduce overfitting and make better predictions.This layer can randomly drop certain FC layer nodes, which means removing all randomly selected nodes, along with all its incoming and outgoing weights.The number of randomly selected nodes drop in each layer is obtained with a probability p independent of other layers, where p can be chosen by using either a validation set or a random estimate (i.e., p = 0.5).In this study, we maintained a dropout size of 0.3.Next, the classification layer used the softmax activation function to render the activation from the previous FC layers into a class score to determine the class of the input chest X-ray image as COVID-19, normal, and pneumonia.The softmax activation function is defined in the following way: where C is the total number of classes.This normalization limits the output sum to 1, so the softmax output s(y i ) can be interpreted as the probability that the input belongs to the i class.In the training process, we apply the categorical cross-entropy loss function, which uses the softmax activation function in the classification layer to measure the loss between the true probability of the category and the probability of the predicted category.The categorical cross-entropy loss function is defined as ).
The total number of input samples is denoted as N , and C is the total number of classes, that is, C = 3 in our case.

Model Snapshots and Ensemble Prediction
The main concept of building model snapshots is to train one model with constantly reducing the learning rate to attain a local minimum and save a snapshot of the current model's weight.Later, it is necessary to actively increase the learning rate to retreat from the current local minimum requirements.This process continues repeatedly until it completes cycles.One of the prominent methods for creating model snapshots for CNN is to collect multiple models during a single training run with cyclic cosine annealing [65].The cyclic cosine annealing method starts from the initial learning rate, then gradually decreases to the minimum, and then rapidly increases.The learning rate of cyclic cosine annealing in each epoch is defined as: where α(t) is the learning rate at epoch t, α 0 is the initial learning rate, T is the total number of training iterations and M is the number of cycles.The weight at the bottom of each cycle is regarded as the weight of the snapshot model.The following learning rate cycle uses these weights, but allows the learning algorithm to converge to different solutions, thereby generating diverse snapshots model.After completing M cycles of training, we get M model snapshots s 1 ...s M , each of which will be utilized in the ensemble prediction.
Ensemble through model snapshots is more effective than a structure based on a single model only.Therefore, compared with the prediction of a single model, the ensemble prediction reduces the generalization error, thereby improving the prediction performance.We have experimented with two ensemble strategies, i.e., hard ensemble and soft ensemble, to consolidate the predictions of snapshots model to classify chest X-ray images as COVID-19 or normal or pneumonia.Both hard ensemble and soft ensemble use the last m(m ≤ M ) model's softmax outputs since these models have a tendency to have the lowest test error.We also consider class weights to obtain a softmax score before applying the ensemble.Let O i (x) be the softmax score of the test sample x of the i-th snapshot model.Using hard ensemble, the prediction of the i-th snapshot model is defined as The final ensemble constrains to aggregate the votes of the classification labels (i.e., COVID-19, normal, and pneumonia) in the other snapshot models and predict the category with the most votes.On the other hand, the output of the soft ensemble includes averaging the predicted probabilities of class labels in the last m snapshots model defined as Finally, the class label with the highest probability is used for the prediction.

Hyper-Parameters Adjustment
Fine-tuned hyper-parameters have a great impact on the performance of the model because they directly govern the training of the model.What's more, fine-tuned parameters can avoid overfitting and form a generalized model.Since we have dealt with an unbalanced data set, the proposed architecture may have a huge possibility to confront the problem of overfitting.In order to solve the problem of overfitting, we use L1L2 weight decay regularization with coefficients 1e − 5 and 1e − 3 in FC layers.Next, dropout is another successful regularization technique that has been integrated into the proposed architecture, especially in FC layers with p = 0.3, to suppress overfitting.In the experiments on the proposed architecture, we have explored the Adam [66] optimizer, which can converge faster.When creating snapshots, we set the number of epochs to 25, the minimum batch size to 8, the initial learning rate to 1e − 4, and the number of cycles to 5, thus providing 5 snapshots for each model, on which we build up the ensemble prediction.

Visual Explanations using Grad-CAM
Although the CNN-based modular architecture provides encouraging recognition performance for image classification, there are still several issues where it is challenging to reveal why and how to produce such impressive results.Due to its black-box nature, it is sometimes contrary to apply it in a medical diagnosis system where we need an interpretable system i.e., visualization as well as an accurate diagnosis.Despite it has certain challenges, researchers are still endeavoring to seek for an efficient visualization technique since it can contribute the most critical key facts in the health-care system into focus, assist medical practitioners to distinguish correlations and patterns in imaging, and perform data analysis more efficacious.In the field of detecting COVID-19 through chest X-rays, some early studies focused on visualizing the behavior of CNN models to distinguish between different categories (such as COVID-19, normal, and pneumonia), so they can produce explanatory models.In our proposed model, we applied a gradient-based approach named Grad-CAM [12], which measures the gradients of features maps in the final convolution layer on a CNN model for a target image, to foreground the critical regions that are class-discriminative saliency maps.In Grad-CAM, gradients that are flowing back to the final convolutional layer in a CNN model are globally averaged to calculate the target class weights of each filter.Grad-CAM heat-map is a combination of weighted feature maps, followed by a ReLU activation.The class-discriminative saliency map L c for the target image class c is defined as follows [12]: where A k i,j denotes the activation map for the k-th filter at a spatial location (i, j), and ReLU captures the positive features of the target class.The target class weights of k-th filter is computed as: where Y c is the probability of classifying the target category as c, and the total number of pixels in the activation map is denoted as Z.

Experiments and Results
In this section, we will present the results and consider several experimental settings to analyze the results of the proposed ECOVNet to explore the robustness of the model.The performance of the proposed model for figuring out the three-class classification problem is compared with some state-of-the-art methods.The three-class classification problem is to determine whether the chest X-ray image belongs to the category of COVID-19 or the normal or pneumonia category.All our programs are written in Python, and the software pile is composed of Keras with the TensorFlow backend and scikit-learn.

Data Set Settings
In sub-section 3.1 and sub-section 3.2, the benchmark data set with the augmentation approach used in the experiment is illustrated in brief.We configure two test sets (namely Imbalanced and Balanced Test) that the imbalanced test is the original test set that comes from COVIDx while the balanced test is also from COVIDx test set, but we randomly choose 100 images for both normal and pneumonia where the test size of COVID-19 is fixed, i.e., 100.During training, we set the training and validation ratios to 90% and 10%, respectively.The entire image distribution of training, validation, and testing is shown in Table 4.We regard the pre-trained EfficientNet as feature extraction, and in the description of the related structure in sub-section 3.3, the impression is that EfficientNet is a series of models formed by arbitrary selection of scale factors.In our experiment, we consider EfficientNet B0 to B5 base models; however, the input shapes are different.Table 5 displays a list of input shapes for each base model as well as the total number of parameters during training.

Evaluation Metrics
In order to evaluate the performance of the proposed method, we considered the following evaluation metrics: accuracy, precision, recall, F1 score, confidence interval (CI), receiver operating characteristic (ROC) curve and area under the curve (AUC).The definitions of accuracy, precision, recall and F1 score are as follows: P recision = T P T P + F P (11) where T P stands for true positive, while T N , F P , and F N stand for true negative, false positive, and false negative, respectively.Since the benchmark data set is not balanced, F 1 score may be a more substantial evaluation metric.For example, COVID-19 has 589 images and non-COVID, that is, normal and pneumonia have 8, 851 and 6, 053 images, respectively.What's more, a 95% CI is considered as it's a more practical metric compared with specific performance indicators.It can increase the level of statistical significance as well as can reflect the reliability of the problem domain.Finally, we displayed the ROC curve to display the results and measured the area under the ROC curve (usually called AUC) to provide information about the effectiveness of the model.The ROC curve is plotted between True Positive Rate (TPR)/Recall and False Positive Rate (FPR), and FPR is defined as

Prediction performance of proposed ECOVNet
In Table 6, the predictions of the proposed ECOVNet without any ensemble are shown.In the comparison without an ensemble, the prediction of ECOVNet with EfficientNet-B5 pre-trained weights yields superior results than other base models for the case of images with augmentation and without augmentation, which reflects the fact that feature extraction using an optimized model that considers three aspects, namely higher depth and width, and a broader image resolution, can capture more and finer details, thereby improving classification accuracy.Without augmentation, under the condition of the imbalanced test set, ECOVNet's accuracy reaches 96.26%, and its performance is slightly better for the balance test set, reaching 96.33% accuracy.On the other hand, under augmentation condition, ECOVNet has the same best accuracy, i.e., 94.68% for both unbalanced and balanced test sets.Moreover, in Table 6, we used a 95% CI for accuracy as the measure to analyze the uncertainty inherent of the ECOVNet.A tight range of CI means higher precision, while the wide range of CI indicates the opposite.As we can see, for the imbalanced test set, the CI interval is within a narrow range, but for the balanced case, the CI range is wider because it considers a smaller amount of test data.Furthermore, Figure 4 shows the training loss of ECOVNet considering EfficientNet-B5.Note that the value in bold indicates that the method has statistically better performance than other methods.
We implement two ensemble strategies: hard ensemble and soft ensemble, and each ensemble considers a total of 5 model snapshots that are generated during a single training.Table 7 and Table 8 show the classification results of different evaluation indicators for without augmentation and augmentation, respectively, including ensemble methods and no ensemble.As shown in Table 7, in handling COVID-19 cases, the ensemble methods are significantly better than the no ensemble method.More specifically, the recall hits its maximum value, which is 100%, and to a greater extent, this result demonstrates the robustness of our proposed architecture.In addition, considering that the test set is balanced, soft integration appears to be the preferred method because of its precision, recall, and F1 score of 100%.Comparing two ensembles, since the average softmax score of each category will affect the direction of the desired result, the effect of the soft ensemble is better than the hard ensemble.Owing to the uneven distribution of the imbalanced test set, an F1 score may be more reliable than an accuracy.It can be clearly seen from Table 7 that for the unbalanced test set, compared with no ensemble, the ensemble methods can improve the F1 score of COVID-19, while the F1 scores of the hard ensemble and soft ensemble are 95.57% and 96.15%, respectively.For augmentation, in Table 8, we see that the ensemble method presents better results than the no ensemble, leading to the exception that the hard ensemble is slightly better than the soft ensemble.However, for augmentation and without augmentation, with an imbalanced test set, we observe that accuracy with more precision than a balanced test set, so the confidence interval is tight when computed from an imbalanced test set since it covers a large sample.It can be seen from Figure 5 that since the ensemble methods combine the predictions from the model snapshots, the ensemble methods tend to improve the classification accuracy of the proposed ECOVNet.In addition, it is obvious that when we consider deeper base models, the classification accuracy of the proposed ECOVNet will increase.More specifically, in the case of a soft ensemble, the base models EfficientNet-B4 and EfficientNet-B5 provide the same accuracy and have the highest accuracy, that is, 97%.Meanwhile, when the base model is moderately deeper, the hard ensemble and the soft ensemble have comparable results.On the other hand, when the model is deeper, the soft ensemble shows its superiority.Taking into account COVID-19 cases, for the balanced test data with soft ensemble, Figure 6 shows the precision, recall, and F1 score of ECOVNet.When comparing the precision of ECOVNet, we have seen that, except for EfficientNet-B0, almost all base models show significantly better performance.However, in terms of recall, as we consider more in-depth base models, the value gradually increases, but it decreases by 4% from  ECOVNet-B0 to ECOVNet-B1.The same observation is true for F1-score while a drop of 1% from ECOVNet-B0 to ECOVNet-B1.
It is often useful to analyze the ROC curve to reflect the classification performance of the model since the ROC curve gives a summary of the trade-off between the true positive rate and the false positive rate of a model that takes into account different probability thresholds.In Figure 7, the ROC curves show the micro and macro average and class-wise AUC scores obtained by the proposed ECOVNet, where each curve refers to the ROC curve of an individual model snapshot.The AUC scores of all categories are consistent, indicating that the prediction of the proposed model is stable.However, the AUC scores in the third and fourth snapshots are better than other snapshots.As it is evident from Figure 7 that the area under the curve of all classes is relatively similar, but COVID-19's AUC is higher than other classes, i.e., 1. Figure 8 shows the confusion matrices of the proposed ECOVNet considering the base model of EfiicientNet-B5.In Figure 8, it is clear that for COVID-19, the ensemble methods provide much better results than those without ensemble.For balanced and unbalanced test sets, these methods provide results that are 3 − 4% better than those without ensemble.However, ECOVNet shows the ability to detect normal and pneumonia chest X-rays, and it provides the same performance while ensemble or no ensemble for the imbalanced test set, although it shows a slightly better performance when classifying the balanced test set with no ensemble.Finally, we can say that ECOVNet is an eminent architecture for detecting COVID-19 cases from chest X-ray images, because it focuses on distinguishing features that help distinguish COVID-19 from other types (such as normal and pneumonia).

Comparison between ECOVNet and the other models
Table 9 shows the comparison between the proposed method and the latest methods from which to detect COVID-19 using chest X-rays, and we have seen that the proposed method is superior to other methods.Some previous methods (namely COVID-Net [22], EfficientNet-B3 [30], DeepCOVIDExplainer [34]) used ImageNet weights and the COVIDx data set, however, one of the previous methods, i.e., DeepCOVIDExplainer, also considered two ensemble strategies.On the other hand, CovXNet [38] used an ensemble method and a transfer learning scheme from non-COVID chest X-rays, while retaining training and testing data sets other than COVIDx.One of the previous methods [30] showed comparable performance to our proposed method in terms of accuracy because it can reach 100%.Another method called PDCOVIDNet [41] achieved an accuracy of 96.5%, which lacked by a small margin compared to our proposed method.As we have observed that the proposed approach consistently exhibits better classification accuracy in different  combinations of ensemble with an imbalanced and a balanced set of test data considering a larger number of COVID-19 chest X-rays.When comparing the results of the two ensemble methods, we observed that the soft ensemble showed impressive results in classifying COVID-19, and the accuracy and recall were both 100%.

Visualization using Grad-CAM
In our evaluation, we applied the Grad-CAM visual interpretation method to visually depict the salient areas where ECOVNet emphasizes the classification decision for a given chest X-ray image.Accurate and definitive salient region detection is crucial for the analysis of classification decisions as well as for assuring the trustworthiness of the results.In order to locate the salient area, the feature weights with various illuminations related to feature importance are used to create a two-dimensional heat map and superimpose it on a given input image.Figure 9 shows the visualization results of locating Grad-CAM using ECOVNet for each model snapshots.This salient area locates the area of each category area in the lung that has been identified when a given image is classified as COVID-19 or normal or pneumonia.As shown in Figure 9, for COVID-19, a ground-glass opacity(GGO) occurs along with some consolidation, thereby partially covering the markings of the lungs.Hence, it leads to lung inflammation in both the upper and lower zones of the lung.When examining the heat maps generated from the COVID-19 chest X-ray, it can be distinguished that the heat maps created from snapshot 2 and snapshot 3 points to the salient area (such as GGO).However, in the case of the normal chest X-ray, no lung inflammation is observed, so there is no significant area, thereby easily distinguishable from other classes, i.e., COVID-19 and pneumonia.As well, it can be observed from the chest X-ray for pneumonia is that there are GGOs in the middle and lower parts of the lungs.The heat maps generated for the pneumonia chest X-ray are localized in the salient regions with GGO, but for the 4th snapshot model, it appears to fail to identify the salient regions as the heat map highlights outside the lung.Accordingly, we believe that the proposed ECOVNet provides sufficient information about the inherent causes of the COVID-19 disease through an intuitive heat map, and this type of heat map can help AI-based systems interpret the classification results achieved from the proposed architecture.

Conclusion and Future Work
In this paper, we proposed a new modular architecture ECOVNet based on CNN, which can effectively detect COVID-19 with the class activation maps from one of the largest publicly available chest X-ray data set, i.e., COVIDx.In this work, a highly effective CNN structure (such as the EfficientNet base model with ImageNet pre-trained weights) is used as feature extractors, while fine-tuned pre-trained weights are considered for related COVID-19 detection tasks.Also, ensemble predictions can improve performance by exploiting the predictions obtained from the proposed ECOVNet model snapshots.From empirical evaluations, it is observed that the soft ensemble of the proposed ECOVNet model snapshots outperformed the other state-of-the-art methods.Finally, we performed a visualization study to locate significant areas in the chest X-ray through the class activation map, which is used to classify the chest X-ray into its expected category.What's more, we believe that our findings will make a useful contribution to the control of COVID-19 infection and the widespread acceptance of automated applications in medical practice.
While this work contributes to reduce the effort of health professional's radiological assessment, our further plan is to lead this work to design a fully-functional application using guidelines of the design research paradigm [67,68].Such a modern methodological lens could offer further directions both for developing innovative clinical solutions and associative knowledge in the body of relevant literature.Furthermore, we will spring up a mobile application that can be able to prognosticate whether the disease will become a deadly or not through analyzing a patient's short term historical chest X-ray pattern if the patient manifests any clinical symptoms related to COVID-19 disease.Therefore, this might be a new way to prevent and stop the spread of the COVID-19 pandemic.

Figure 1 :
Figure1: Some image labels available in the benchmark dataset[22]

Figure 2 :
Figure 2: Graphical representation of the proposed ECOVNet architecture

Figure 3 :
Figure 3: The basic building block of EfficientNet-B0.All MBConv blocks take the height, width, and channel of h, w, and c as input.C is the output channel of the two blocks.(Note that, MBConv= Mobile Inverted Bottleneck Convolution, DW Conv= Depth-wise Convolution, SE= Squeeze-Excitation, Conv= Convolution)

Figure 5 :
Figure 5: Comparison between ensemble and no ensemble of the proposed ECOVNet in terms of accuracy for the balanced test data.

Figure 6 :Figure 7 :
Figure 6: Precision, Recall, F1 score of the proposed ECOVNet for the balanced test data with soft ensemble considering COVID-19 cases

Figure 8 :
Figure 8: Confusion matrices of the proposed ECOVNet considering EfficientNet-B5 as a base model.In the confusion matrices, the predicted labels, such as COVID-19, Normal, and Pneumonia, are marked as 0, 1, and 2, respectively.
Visualization for i th Snapshot Model

Figure 9 :
Figure 9: Grad-CAM visualization for the proposed ECOVNet considering the base model EfficientNet-B5.A total of 5 (five) model snapshots were generated during the training process.

Table 3 :
Instead of random initialization of network weights, we instantiate ImageNet's pre-trained weights in the EfficientNet model thereby accelerating the training process.Transferring the pre-trained weights of the ImageNet have performed a great feat in the field of image analysis, since it composes more than 14 million images covering eclectic classes.

Table 4 :
Image partition of Training, Validation, and Testing set for Balanced and Imbalanced test

Table 5 :
Image resolution and total number of parameters of ECOVNet considering the base models (B0 to B5) of EfficientNet

Table 6 :
Prediction performance of proposed ECOVNet without using ensemble a w/o aug.=without augmentation b w/ aug.=with augmentation

Table 9 :
Comparison of the proposed ECOVNet with other state-of-the-art methods on COVID-19 detection a Imbalanced test set b Balanced test set