A deep learning segmentation-classification pipeline for X-ray-based COVID-19 diagnosis

Over the past year, the AI community has constructed several deep learning models for diagnosing COVID-19 based on the visual features of chest X-rays. While deep learning researchers have commonly focused much of their attention on designing deep learning classifiers, only a fraction of these same researchers have dedicated effort to including a segmentation module in their system. This is unfortunate since other applications in radiology typically require segmentation as a necessary prerequisite step in building truly deployable clinical models. Differentiating COVID-19 from other pulmonary diseases can be challenging as various lung diseases share common visual features with COVID-19. To help clarify the diagnosis of suspected COVID-19 patients, we have designed our deep learning pipeline with a segmentation module and ensemble classifier. Following a detailed description of our deep learning pipeline, we present the strengths and shortcomings of our approach and compare our model with other similarly constructed models. While doing so, we focus our attention on widely circulated public datasets and describe several fallacies we have noticed in the literature concerning them. After performing a thorough comparative analysis, we demonstrate that our best model can successfully obtain an accuracy of 91 percent and sensitivity of 92 percent.


Introduction
The artificial intelligence (AI) research community has recently invested considerable time and resources into developing deep learning models based on chest radiographs for the purpose of diagnosing coronavirus disease 2019 . Many medical institutions are finding themselves in difficult positions when faced with countless numbers of patients presenting with symptoms of the illness. There is a need for new diagnostic models to alleviate this important need. Recently deep learning techniques have come to permeate "the entire field of medical image analysis" [1]. With deep learning methodologies, AI researchers have made considerable progress in improving the quality of automated diagnostic medical imaging systems. Because of their pioneering work, many promising directions are now opening up that could potentially help diagnose COVID- 19. There are several kinds of COVID-19 tests that are currently on the market. Molecular tests (polymerase chain reaction tests), Antigen tests (rapid tests), and antibody tests (blood tests) have seen widespread use. Of these three tests, the real-time reverse transcription-polymerase chain reaction (RT-PCR) test is considered the present gold standard for diagnosing COVID-19 [2]. RT-PCR tests are not perfect however and reports have been made considering problems with the tests overall sensitivity [3]. Luo et al. [4] in a study including 4653 participants found that RT-PCR tests have a sensitivity of around 71%. Kucirka et al. [5] in a Johns Hopkins study reported that an RT-PCR test's sensitivity has wide variability over the 21 days after a patient is first exposed. They also noted that "although the false-negative rate is minimized 1 week after exposure, it remains high at 21%" [5]. Kucirka et al. [5] therefore ultimately found that it takes about a week from the time of symptom onset, for RT-PCR testing to deliver the lowest false-negative rate. This leaves room for other tests that may work better over the time that RT-PCR tests are less accurate. Radiological testing is a leading contender in the research community for such a scenario. Research has been shown it to be useful over the time that a patient has obtained a negative RT-PCR test [6]. It can therefore be used in conjunction with other tests and possibly give more clarity regarding a patients current diagnosis.
Many researchers have focused on using computerized tomographic (CT) scanners in diagnosing COVID-19 because of their ability to analyze three-dimensional information. As a modality for COVID-19 testing, however, CT scanners are expensive resources to employ. For a system to be practical during a pandemic, a cheaper and faster solution needs to be available to deal with the sheer number of patients waiting for a test. Chest X-rays (CXRs) are the other alternative modality typically employed by radiologists in imaging thoracic illnesses such as COVID- 19. Some advantages of chest X-rays for this particular application include the portability of an X-ray scanner, the requirement of only cleaning a single surface when reusing it on patients, the speed of the diagnostic measurements required, and the overall expense of the procedure. Given these significant advantages, it is entirely practical for researchers to explore the use of X-ray technology in COVID-19 testing.
Before discussing how a proposed deep learning pipeline can diagnose COVID-19 in suspected patients, we first need to understand the features in a patients lungs that require imaging. Rousan et al. [7] in a study involving 88 patients, found that ground-glass opacities (GGO) were the most frequent finding in COVID-19 X-rays. The chest X-rays of normal patients generally show a black background within a patients lungs. In chest X-rays with GGOs, radiologists find lighter colored patches of haziness that are indicative of a possible pathology. Rousan et al. [7] also found that consolidation increases in severity in the X-rays of many COVID-19 patients up until approximately the second week of the illness. This aligns well with another study performed by Song et al. [8] who found that consolidations do indeed increase as the disease progresses. Consolidation in radiography represents areas of a patient's lungs that are filled with extraneous liquids (pus, blood, and water) and solid materials (stomach contents or cells) that do not exist in healthy lungs. In comparing the number of COVID-19 X-rays with consolidation vs. GGOs, consolidation tends to occur less frequently. It is still, however, the second most frequent visual cue mentioned in the radiological literature. Fig. 1 shows the chest X-rays of two older patients with COVID- 19 showing the aforementioned symptoms.
Many deep learning X-ray studies up until now have solely focused on classification in diagnosing COVID-19 in X-rays. While excellent research has occurred in this space, the number of articles dealing with COVID-19 X-ray segmentation has been quite limited. Segmentation is an important preprocessing technique that can shield a classifier from unnecessary pixel information when categorizing an image. In this way, many imaging-based studies in other computer vision applications have found that proper segmentation has increased the overall accuracies of their classifiers [9][10][11]. It is vital, therefore, to employ segmentation when training a COVID-19 classifier. The following lists the main contributions of our work: • Our pipeline employs an advanced segmentation network (ResUnet [12]) • We have made available a COVID-19 X-ray classification dataset that is larger than all similar datasets we have found in the literature • Our overall pipeline makes use of majority voting and weighted average ensembles • We have included a thorough comparative analysis that benchmarks our model's performance against other deep learning models in the literature Our work begins in Section 2 with an overview of various research studies that have constructed segmentation-classification deep learning pipelines to diagnose COVID-19. In Section 3, we thereafter present our proposed deep learning pipeline's architecture, showing the internal details of our segmentation and classification modules. Following a discussion of our pipeline's architecture, in Section 4 we present the experimental results of our overall system. In Section 4, we additionally present a detailed comparative analysis of our pipeline versus other well-constructed models in the literature. Concluding in Section 5, we discuss potential future directions for this research.

Related works
There are many papers in the literature that use deep learning classification and segmentation for making medical predictions [13][14][15][16][17]. Our main focus in this review, however, is on COVID-19 X-ray articles that combine a segmentation unit and classifier [18][19][20][21][22][23][24][25][26]. We did so in order to see how our deep learning pipeline compares with the studies that are the most related to our own. There are several public datasets available in circulation for segmenting chest X-rays that have been cited in the articles below. There are also a number of public and private datasets mentioned in these articles that were prepared specifically for COVID-19 classification. The following works below are all studies that influenced how we ultimately implemented our final system.
Rajaraman et al. [18] created a segmentationclassification deep learning pipeline to diagnose COVID-19 that included an ensemble of iteratively pruned CNNs. The authors trained several CNN models (VGG-16/VGG-19 [27], Inception-V3 [28], Xception [29], DenseNet-201 [30], etc.) after their dataset had been preprocessed by a U-Net [31] segmentation module that included a Gaussian dropout layer [32]. The authors of this paper tried to employ many different ensemble strategies and, in the end, found that weighted averaging produced the best results. The authors of this paper unfortunately listed Kermany Fig. 1. Lungs of 2 older COVID-19 patients revealing (a) bilateral consolidation and (b) ground glass opacities (white arrows) and linear opacity (black arrow) [7]. et al.'s [33] dataset as being contained in their dataset which likely contributed to exaggerated evaluation metrics. It is incorrect to bias a dataset with only certain categories of the dataset having images of children's lungs.
Alom et al. [19] designed an X-ray-based system that diagnoses COVID-19 with a NABLA-N segmentation network [34] and an Inception Residual Recurrent Convolutional Neural Network (IRRCNN). Their X-ray model is initially trained on a normal vs. pneumonia dataset first as more images are in the public sphere for making such a comparison. After obtaining acceptable performance on this separate task, they fine-tune their model on a smaller COVID-19 dataset. This segmentation-classification pipeline ultimately achieves a final test accuracy of 84.67 percent. The authors of this paper, unfortunately, used Paul Mooney's chest X-ray dataset on Kaggle [35] to obtain pneumonia and normal images for training their classifiers. This contains images from Kermany et al.'s dataset [33] of children's lungs. Their classifier was intended for identifying COVID-19 in adult lungs. Training a classifier with children's lungs that is intended for adult lungs is incorrect, however, and caused Alom et al.'s [19] classifier to be biased. They used normal images from children but COVID-19 images from adults in their dataset. Their normal vs. COVID-19 classifier, therefore, incorrectly could use the features of adult lungs to identify COVID-19.
Yeh et al. [20] combined several public datasets as well as datasets from several private medical institutions when training their segmentation-classification pipeline. Unlike the two previous studies, the authors of this work look like they have constructed an unbiased dataset. They do, however, reference several private datasets that are unavailable to the research community. It is therefore impossible to directly compare our pipeline against their work. They initially trained a U-Net segmentation model [31] as a preprocessing step to exclude non-informative regions of CXRs from their model. Yeh et al. [20] trained this segmentation unit on the Montgomery County X-ray Set and the Shenzhen Hospital X-ray Set [36]. After training their segmentation unit, they obtained a dice similarity coefficient (DSC) of 88 percent. Following this preprocessing step, they trained a DenseNet-121 [30] classifier on segmented images and obtained a COVID-19 sensitivity of 83.33% on their validation set. Their hold-out test set contained 306 COVID-19 images and their final COVID-19 sensitivity on this test set corrected to 81.8 percent.
Horry et al. [37] developed a segmentation-classification deep learning pipeline for diagnosing COVID-19 that was trained and tested on a relatively small preprocessed dataset. While Horry et al.'s [37] final curated dataset was not biased, it contained only 100 COVID-19 images, so it is difficult to ultimately know how well their work would translate to a larger number of images. Horry et al. [37] additionally removed images from their dataset which contained features they believed their model would have difficulty classifying. The authors' segmentation model was not based on a deep learning model. They simply used OpenCV's GrabCut function and reasoned that "that the lung area could be considered the foreground of the X-ray image" [37]. After preprocessing they trained five base models with their segmented images (VGG-16 [27], VGG-19 [27], Inception-V3 [28], Xception [29], and ResNet-50 [38]). Their best base model (VGG-19 [27]) ultimately achieved an F1-score of 81 percent.
Wehbe et al.'s [21] published deep learning pipeline that was trained on the largest COVID-19 X-ray dataset we have found reported in the literature. The authors developed their pipeline by working in collaboration with a private US medical institution. Their large classification dataset is therefore inaccessible to the public at this time. This dataset also appears to have not been improperly biased with the inclusion of incorrect data. The authors were aware of the need to divide their training and test sets by patient number. The authors chose to train their U-Net-based segmentation module [31] on the Montgomery [36] and JSRT [39] datasets. Wehbe et al. [21] in their study also created an ensemble model to detect COVID-19. Their final model contained a weighted average of 6 popular CNNs (Inception [28], Inception-ResNet [40] Xception [29], and ResNet-50 [38], DenseNet-121 [30], and EfficientNet-B2 [41]). An important reason to include this paper in our discussion is that the authors managed to perform an interesting study that up until now we have not seen reproduced elsewhere. The authors commissioned a study involving five radiologists to determine the effectiveness of experts in the field in differentiating COVID-19 from other illnesses. This is important when trying to approximate Bayes error prior to building a deep learning model. Wehbe et al.'s [21] compared the results of their model with the performance of expert radiologists and discovered their model to a minor extent outcompetes them. Their final binary weighted average model obtained a final accuracy of 82% on their test set. The expert radiologists manually obtained a consensus accuracy of 81% on the same images. These final results coincided very nicely with one another.
Tabik et al. [22] created a dataset dubbed the "COVID-GR-1.0" dataset which was used in training their "COVID-SDNet" model in diagnosing COVID-19. Their dataset was divided in a novel fashion whereby COVID-19 positive patients were subdivided into four risk categories (normal-PCR+, mild, moderate, and severe). The authors created this dataset to see how many of weak COVID-19 cases would be analyzed by a prospective classifier correctly. More often than not, in COVID-19 datasets, there is an unequal number of severe COVID-19 patients. Typically, patients who end up undergoing a radiological examination end up being patients experiencing increased complications. COVID-GR-1.0 is a small but well-curated dataset that has utility in that it can be employed to determine a classifier's efficacy on weak COVID-19 images. Tabik et al.'s [22] pipeline consisted of a segmentation module and a classification module that performs "inference based on the fusion of CNN twins." [22] The authors used a U-Net [31] segmentation module and trained it on the Montgomery County X-ray dataset [36], the Shenzhen Hospital X-ray datasets [36] and the RSNA Pneumonia CXR challenge dataset [42]. They calculated the smallest rectangle around each segmented image and added a border containing 2.5% of the pixels around each rectangle to obtain their final masked images. The X-rays they segmented were, therefore, never fully masked. The authors did not want to exclude relevant information in these images that could contain useful diagnostic information. After performing binary classification on their segmented COVID-GR-1.0 dataset, Tabik et al.'s [22] classifier obtained a COVID-19 sensitivity of 72.59%.
Teixeira et al. [23] designed a segmentation-classification pipeline used to diagnose COVID-19 that consisted of a U-Net [31] and Incep-tionV3 [28] CNN. Their U-Net [31] segmentation module was trained on images and masks that were hand-picked from a mixture of public datasets [36,39,43]. The number of images and mask pairings they chose in the Darwin V7 labs [43] segmentation dataset (489) was significantly lower than the total number of pairings available in that dataset (6504). This approach looks as though it allowed them to train their U-Net [31] to have a higher dice similarity coefficient (0.982) than other segmentation units we have seen in the literature for this task. For classification they otherwise used the RYDLS-20 dataset [44]. They had developed this dataset in a previous work and further added images to it to create a new "RYDLS-20-v2" dataset. They attempted to use several classifiers but ultimately found that using an InceptionV3 [28] CNN resulted in giving them their best overall multiclass performance metrics.
Oh et al. [24] published a novel "patch-based deep neural network architecture with random patch cropping" [24] for detecting COVID-19. Their model initially begins with a preprocessing step whereby a fully convolutional DenseNet-103 segments incoming chest X-rays. The authors thereafter use a ResNet-18 on the segmented images for classification. The authors generate 100 randomly cropped patches from the previously segmented chest X-rays and feed those patches through ResNet-18s as well. In this process, the authors have selected a sufficient number of lung patches to ensure that the entire surface area of the segmented lungs is covered. The authors of this paper unfortunately selected images from Kermany et al. [33] to include in their work and thereby biased their classifier.
Abdullah et al. [25] implemented a segmentationclassification pipeline that used a unique segmentation unit and ensemble model for classification. Their segmentation unit, the Res-CR-Net, is a new kind of segmentation model the authors introduced in a previous study [45] that does not contain the same encoder-decoder structure that the popular U-Net [31] contains. According to the authors, the Res-CR-Net [45] "combines residual blocks based on separable, atrous convolutions [46,47] with residual blocks based on recurrent NNs [48] " Abdallah et al. [45]. The authors trained their Res-CR-Net [45] on several open-source sets of masks and images [36,39,43]. They acquired their classification dataset from the Henry Ford Health System (HFHS) hospital in Detroit. This private dataset contained 1417 COVID-negative patients and 848 COVID-positive patients. The authors used this dataset to train a unique hybrid convnet called the "CXR-Net" that contains a Wavelet Scattering Transform (WST) block [49,50], an attention block containing two MultiHeadAttention layers [51,52], and several convolutional residual blocks. This segmentation-classification pipeline ultimately achieved an accuracy of 79.3% and an F1 Score of 72.3% on their test set.
Wang et al. [26] created a deep learning segmentation -classification pipeline for COVID-19 detection and severity assessment. After a CXR standardization module the authors included a common thoraic disease module that was used to determine whether a patient is suffering from pneumonia. This is followed by segmentation and classification modules. Wang et al.'s [26] lung segmentation network was trained on X-ray scans that were resized to 512 × 512 images. They chose to use a DeepLabv3 segmentation architecture [47] after additionally training a U-Net [31] and Fully Convolutional Network [53]. Their DeepLabv3 segmentation architecture in the left lung field obtained a DSC of 0.873, in the right lung field obtained a DSC of 0.910, in the periphery of the left lung field obtained a DSC of 0.864, and in the periphery of the right lung field obtained a DSC of 0.893. Across all categories this averages out to a total DSC of 0.885. Following this segmentation operation the authors performed COVID-19 detection and severity assessments. During training their COVID-19 detection module was trained on 1407 COVID-19 X-rays, 5515 viral pneumonia X-rays and 10,961 "other" pneumonia X-rays. They evaluated their model on a test set with 164 COVID-19 CXRs and 630 other pneumonia CXRs. In the task of differentiating between COVID-19 and other X-rays they ultimately obtained an accuracy of 91% and a COVID-19 sensitivity of 92%.

Segmentation dataset
To train our segmentation model, we looked at the datasets used in our literature review and decided to use the Darwin V7 Labs dataset [43]. We opted in favor of this dataset for three reasons. The first reason was its overall size. The Darwin V7 Labs dataset [43] is significantly larger (6504 images/masks) than most lung segmentation datasets. This being the case, we were able to train a robust segmentation unit that could accurately operate on a wide range of chest X-rays. Our second reason for using the dataset involved considerations involving the regions of the chest X-rays that its masks cover. Most masks in popular lung segmentation datasets include only the lungs. The Darwin V7 Labs [43] masks, however, included space next to the lungs. This left room for the heart to not be excluded. Initially, we did not give the heart and its size any consideration. Eventually, we came to realize, however, that cardiomegaly (an enlarged heart) is found in 29.9% of COVID-19 patients [54]. This symptom would not show up with most general-purpose lung segmentation masks. Our third reason for using the Darwin V7 Labs dataset [43] was that its masks were created for patients with a variety of conditions. Some masks were created for normal patients and others were created for patients exhibiting a variety of lung pathologies including COVID-19, bacterial pneumonia, viral pneumonia, Pneumocystis pneumonia, fungal pneumonia, and Chlamydophila pneumonia.
Some preprocessing was required on the Darwin V7 labs dataset [43] to create a model that operated correctly on the segmentation unit we later created. The segmentation unit we chose for this study was a ResUnet [12], and this segmentation unit was designed for 256 × 256 images/masks. We needed to perform some data wrangling using the JSON files that were included with the dataset to ensure that images smaller than 256 × 256 were excluded. The JSON files provided with the Darwin V7 Labs dataset [43] had a field indicating which kind of X-ray each image was. We, therefore, were able to automate a process whereby we removed all of the lateral X-rays that were sparsely hidden throughout the dataset. Our dataset, therefore, solely contained posteroanterior (PA) X-rays. After preprocessing, we were left with 6377 masks/image pairings. We finally divided this preprocessed Darwin V7 Lab dataset [43] into the 80% training / 20% validation split shown in Table 1.

Classification datasets
In medical imaging, the ability of a model to generalize to new examples typically is limited by the size of the training set. Because research into imaging COVID-19 is relatively recent, there is only approximately a year's worth of images that have been collected for classification purposes. For this reason, most published studies cannot present a model that can be deployed in a clinical setting. This study is no different, although in the work presented here we have taken significant steps forward in remediating several mistakes we have witnessed in the datasets of most papers.
When we first started gathering data, we initially realized that publicly available datasets generally have very little metadata available. That being the case, we decided to build a classifier that works on images alone. While doing so, we came to realize that the classification datasets in many studies have been incorrectly assembled. The majority of papers that have focused on differentiating COVID-19 from similar illnesses have cited using Kermany et al.'s [33] images in their dataset. As we have previously mentioned in our related works section, this dataset is composed of children that are suffering from various forms of bacterial and viral pneumonia. Since the lungs of small children have different features than adult lungs, we realized these images should not be included in our final classification dataset. This dataset likely poses more of a problem in biasing classifiers that are trained on nonsegmented images. The bones of adults are fused and the bones of children are not fused. This is feature can easily be picked up by a CNN. Kermany et al.'s [33] dataset, however, still would pose an issue even with a segmentation unit as the spatial features of adult lungs would differ from those of children's lungs. The classifiers in studies that include this dataset, therefore, can pick up features both internal and external to the lungs that are inconsistent between adults' and childrens' lungs. This has, unfortunately, lead to the unfair biasing of several COVID-19 classifiers in the literature.
Another difficulty facing many studies is the lack of metadata accompanying images. At least some metadata is required alongside images to ensure that X-rays from individual patients do not get mixed in the training and test/validation sets. This problem of data leakage, we believe, is an issue in some studies we have reviewed. We find it disconcerting that most studies do not mention how they ensured the separation of patients' X-ray scans between training and test sets. An enthusiasm surrounding finding the most images possible has resulted in a large number of images being harvested from medical research papers. Wang et al. [55] last year released a popular 'COVIDx5' dataset [55] that has been able to avoid this pitfall. They also did not include Kermany et al.'s dataset [33] in their COVIDX dataset [55], and improperly bias their classifier which many studies have done. We additionally used this dataset because it was larger than many existing datasets and included 14,258 CXR images. In total, this consisted of 617 COVID-19 images, 8066 normal images, and 5575 pneumonia images. We added more COVID-19 images to the COVIDx5 dataset [55] because of the large COVID-19 class imbalance that existed within it. We hoped it would help to reduce overfitting in our classifier. We therefore added 922 COVID-19 images from the MIDRC-RICORD-1C database [56] and 2474 images from the BIMCV dataset [57]. In total, we constructed a dataset that contains 4013 COVID-19 images, 8066 normal images, and 5445 pneumonia images. The images from the COVIDx5 dataset [55] had the necessary metadata needed to allow us to split these images into three sets (80% training/ 10% validation/ 10%test) without creating data leakage. The MIDRC-RICORD-1C dataset [56] and BIMCV dataset [57] were released long after the COVIDx5 dataset [55], and none of these datasets had any relation with one another. It was therefore possible to split the COVID-19 images within these datasets into three sets without creating data leakage between them. The BIMCV [57] COVID-19 images were entirely used in the training set and the COV-IDx5 [55] COVID-19 images were entirely split evenly between the validation and test set. The MIDRC-RICORD-1C [56] COVID-19 images were used in all three sets. The MIDRC-RICORD-1C [56] images came with metadata. Fortunately, the metadata allowed us to be able to divide the images from the MIDRC-RICORD-1C [56] dataset by patient between our training and validation/test sets. In this way we were able to create the datasets shown in Tables 2 and 3. We created both multiclass (3-class) and binary datasets to later compare our segmentation-classification pipeline with models that are reported in various other papers. It was important to produce our large COVID-19 dataset with both validation and test sets to help mitigate concerns that have been brought up by Wehbe et al. [21] concerning overfitting.
In addition to the above dataset that we created, we also directly tested our model on another dataset that was used in Tabik et al.'s [22] study. We wanted to test our segmentation-classification against Tabik et al.'s [22] pipeline because their model worked on many of the same principles ours did. Their model used a segmentation algorithm that leaves more pixels surrounding the lungs in the images they segment. It has been difficult to find segmentation-classification pipelines like our own with unbiased and correctly constructed datasets. We were unable to find a study to directly compare ourselves against that uses a segmentation-classification pipeline and has a larger public dataset. Tabik et al.'s [22] study used a very conservative dataset that was meant to measure the performance of a deep learning model on weaker COVID-19 cases. Their "COVID-GR-1.0" binary dataset has 426 COVID-19 patients and 426 normal patients. The authors originally split this dataset into a 80% training / 20% test split. The dataset split in this format is shown in Table 4.

System design
We set out to construct our deep learning segmentation-classification pipeline by first choosing an appropriate segmentation module to preprocess our classification dataset. We tested the preprocessed Darwin V7 Labs dataset [43] on a host of different segmentation modules including the popular U-Net [31], the ResUnet [12], the ResUNet-a [58], the TransResUNet [59] and U-Nets containing VGG and DenseNet backbones. Before training, we required the images in our preprocessed V7 Labs dataset [43] to undergo additional preprocessing in the form of image augmentation. During augmentation, we set the rotation range to 180 degrees, width/height shift ranges to 30%, shear range to 20%, zoom range to 20%, and set horizontal flips to true. We ultimately found that our best results on the preprocessed Darwin V7 Labs dataset [43] were obtained using Zhang et al's ResUnet [12]. We therefore decided to move forward using this segmentation module in our pipeline. The ResUnet [12] on our preprocessed V7 Labs dataset ultimately obtained a dice similarity coefficient of 95.04% after 45 epochs. This segmentation module uses a 7-level architecture shown in Fig. 2 and Table 5. Its architecture can be understood by dividing it conceptually into three main parts. The first part of the architecture is an encoder that fits the images input into the module into smaller and more compact representations. The last main segment of this architecture is the decoder which "recovers the representations to a pixel-wise categorization, i.e., semantic segmentation" [12]. The second middle part of the classifier serves as a bridge between the encoder at the ResUNet's [12] input and the decoder at the ResUNet's [12] output.
Having discussed the segmentation portion of the deep learning pipeline, we now move on to discussing the models that we have constructed for classifying COVID-19 images. All of our models were trained in TensorFlow2.5. We ran our algorithms on an Intel Xeon CPU (2.30 GHz) using 26 GB RAM and a Tesla P100-PCIE-16GB GPU. We trained our preprocessed multiclass training set on a DenseNet-201 [30], a ResNet-152 [38], and a VGG-19 [27]. Each of these models was set to pretrained ImageNet weights. While designing each of these models we added an extra dense layer and dropout layer to the end of each model.  [27] extra dense layer contained 4096 neurons. Each of the activation functions in these dense layers was set to a ReLU activation. The dropout layer added to the end of each model was set to a dropout rate of 10 percent. This helped each model to avoid overfitting and deal with the limited size of our dataset. We constructed both binary and multiclass versions of all of these classifiers. For the binary version of each classifier, we replaced the final softmax layer of each classifier with a single neuron containing a sigmoid activation function. For the multiclass version of each of these classifiers, our final layers contained three neurons each and had a softmax activation function.
Prior to training our DenseNet-201 [30], ResNet-152 [38], and VGG-19 [27] CNNs, we noticed that a class imbalance existed in our multiclass and binary datasets. There were lower amounts of COVID-19 images in comparison to the other categories of images in our datasets. We, therefore, needed to weigh the loss functions of our classifiers to correct for this imbalance. We did this because we wanted sure that all of our categories were evenly represented. Prior to training our classifiers, we additionally used image augmentation on the segmented images from our ResUNet [12] to prevent overfitting in our classifiers. There is   Table 4 Number of images in the COVID-GR-1.0 training and test sets [22].
often limited data in most medical imaging problems, and we noticed this helped us to improve the accuracy of our classifiers. Using Kera's ImageDataGenerator class, we set the rotation range to 15%, the width/height range to 15%, the shear range to 15%, the zoom range to 15%, and horizontal flips to true. Our training and test set batch sizes were set to 32. In addition to segmenting and augmenting our classification datasets, we also normalized our data. In doing so, we ensured that the scaled data in each batch had a mean of zero and a standard deviation of one. After our initial preprocessing steps, we trained the final fullyconnected layers of each classifier alone for five epochs. We used the ADAM optimizer during this training and kept the ADAM optimizer set to its default settings. After performing this training, for each classifier we progressively unfroze each model's layers and fine-tuned our models at a fixed learning rate of 1 × 10 − 5 until each model hit its highest possible validation accuracy. Prior to unfreezing progressive layers in our models, we froze the moving mean and moving variance of the batches in our models' batchnormalization layers to keep these parameters fixed to their pretrained ImageNet weights. After training each of our CNNs to their optimal validation accuracies, we constructed a majority voting ensemble and a weighted average ensemble that combined all of our classifiers together. We constructed both a binary version and a multiclass version of each type of ensemble classifier. An illustration showing our overall deep learning pipeline and can be observed in Fig. 5. The ensembles used in our deep learning pipeline are illustrated in Figs. 3 and 4.

Performance evaluation
Within the COVID-19 deep learning literature, we have found that most studies report common evaluation metrics. To compare our models against the literature we have reviewed, we have chosen to report the accuracy, sensitivity, specificity, F1-Score, precision, recall, negative predictive value (NPV), positive predictive value (PPV), and area under the receiver operating characteristic curve (AUC-ROC) of our deep learning pipeline.
We first set out to train our multiclass and binary DenseNet-201 [30], ResNet-152 [38], and VGG-19 [27] models for five epochs. On each model, we obtained a validation accuracy that ranged between 70 and 80 percent. This largely mirrored the performance of expert radiologists who had their expertise measured in a research study led by Wehbe et al. [21]. We performed this initial work using our multiclass and binary training sets before moving on to test ourselves against Tabik et al.'s [22] model (which was trained on the "COVID-GR-1.0" dataset). During this initial stage, we worked toward increasing the accuracy of all three of these classifiers by unfreezing each model during training progressively.
On our multiclass dataset set, we obtained final validation set accuracies of 82.16% on our DenseNet-201 [30], 84.25% on our ResNet-152 [38], and 81.09% on our VGG-19 [27]. Likewise, on our multiclass dataset set, we obtained final test set accuracies of 82.42% on our DenseNet-201 [30], 81.84% on our ResNet-152 [38], and 77.53% on our VGG-19 [27]. The test accuracies we obtained all saw a decrease of 2% -4% from their corresponding validation set accuracies. When we ensembled all three classifiers into majority voting and weighted average ensembles, we saw an increase in performance on our validation and test sets. For our weighted average ensemble, we obtained a validation set accuracy of 87.40% and a test set accuracy of 84.07%. For our majority voting ensemble, we obtained a validation set accuracy of 87.14% and a test set accuracy of 84.00%. In both instances, we found that the test set accuracies of both ensembles outperformed our best individual classifier (DenseNet-201 [30]) by more than 1.5%. The overall performance of our three classifiers and our ensembles on our multiclass validation and test sets can be seen in Table 6. Our binary classifiers were trained in the same way as our multiclass classifiers. The overall performance of our three classifiers and our ensembles on our binary validation and test sets can be seen in Table 7. Tables 8-11 show a larger suite of statistics generated on the multiclass and binary test sets using both our weighted average and majority voting ensembles. Figs. 6-9 show the corresponding confusion matrices generated by our weighted average and majority voting ensembles on our multiclass and binary test sets. Fig. 10 shows the AUC-ROC curves generated by our weighted average ensembles.
After training and testing our segmentation-classification pipeline on our datasets, we also tested our binary pipeline directly against Tabik   Table 5 ResUnet architecture [12].     of this dataset. These two datasets have been designed to deal with separate problems and a detailed discussion concerning these differences is presented in the following section. Table 12 shows how our models compared against Tabik et al.'s [22] COVID-SDNet model. Every deep learning expert working in computer vision understands that it is necessary to validate the final version of a classifier after it has been trained. In medical imaging, saliency maps are widely employed on computer vision models to ensure that these models are correctly identifying important features in an image. In radiology, it is common for deep learning models to incorrectly focus on necklaces, medical devices, and the text within X-ray scans. The reason we included a segmentation unit in our study was to ensure that our model's CNNs were rejecting unnecessary image details outside of the boundaries of the lungs. We used a Grad-CAM [60] in this study to ensure that our segmentation module was doing its job correctly in assisting our models to pick up the correct features of COVID-19. A Grad-CAM [60] functions by using the final feature maps in the last convolutional layer of a CNN to signal regions of importance within an image. We were interested in studying our CNNs that were trained on segmented images. We therefore devised a plan to compare them with CNNs that were trained on nonsegmented images. Fig. 11 shows the performance of our a DenseNet-201 [30] after being trained on segmented and nonsegmented      X-rays. Our DenseNet-201 [30] was one of the three CNNs that we used in constructing our majority voting and weighted average ensembles. Part (b) of Fig. 11 shows the performance of our DenseNet-201 [30] on a test image after it was trained without a segmentation module. The red parts of the heatmap indicate the primary parts of the image that the DenseNet-201 [30] focused on when determining a patient has COVID-19. The orange/yellow portions of the heatmap represent areas of medium importance. The green/blue areas of the Grad-CAM [60] heatmap represented areas that were the least important diagnostically in determining that a patient is COVID-19 positive. Unfortunately, portions of the red and orange/yellow parts of the heatmap in part (b) of Fig. 11 are focused on areas outside of the lungs. The area that the Grad-CAM [60] partially focused on in the upper right-hand side of the image was a problem. This area should have been irrelevant to a COVID-19 diagnosis. When our DenseNet-201 [30] was trained on segmented images however, its behavior improved as is shown in part (d) of Fig. 11. We monitored the performance of our model in this way to ensure that our model was picking up the features of COVID-19 that we highlighted in Section 1.

Discussion
Wehbe et al. [21] conducted an important study that measured the performance of practicing radiologists on a private COVID-19 vs. non-COVID-19 dataset. In our work, we took it upon ourselves to build a COVID-19 dataset of comparable size. We wanted to measure our pipeline's ability to compete with the radiologists in their study and their model. We were more specifically interested in comparing our pipeline's COVID-19 sensitivity with the radiologists in Wehbe et al.'s [21] study given the problems concerning RT-PCR test sensitivity we have read about in scientific journals. The radiologists' consensus sensitivity in Wehbe et al.'s study [21] was 70%. All of our ensembles, including those trained on the weaker images in the "COVID-GR-1.0" dataset [22], obtained a higher COVID-19 sensitivity. The COVID-19 sensitivity of the five expert radiologists in Wehbe et al.'s [21] study versus that of our ensembles' can be seen in Table 13.
As can be seen in Table 13, when we compare our ensemble models with the performance of the radiologists in Wehbe et al.'s [21] study, we outperform even the best radiologist's COVID-19 sensitivity. In Table 13, another item that stands out is the difference in sensitivity between the ensemble we trained on our binary dataset versus the ensemble we trained on the COVID-GR-1.0 dataset [22]. This discrepancy can be explained by the higher number of weak COVID-19 images that were intentionally placed by radiologists in the "COVID-GR-1.0" dataset [22]. Tabik et al. [22] created the "COVID-GR-1.0" dataset to measure the performance of their classifier on COVID-19 images that are more difficult to classify. Even after we trained our ensemble model on this extremely conservative dataset, we still managed to obtain a higher sensitivity than the radiologists in Wehbe et al.'s [21] study. This demonstrated the robustness of our technique. The COVID-GR-1.0 dataset intentionally contained a larger proportion of COVID-19 positive images that were difficult for radiologists to identify correctly. Many of the datasets currently available in the literature are constructed from the images of hospitalized patients. The COVID-19 severity of X-rays from patients who have been hospitalized is often worse than the severity seen in X-rays from patients who have not been hospitalized. Many COVID-19 X-ray datasets in the literature, therefore, have a larger proportion of severe COVID-19 images. These datasets may not always be representative of the population at large. That was an issue Tabik et al.'s [22] dataset was attempting to correct for. Our final results after training with Tabik et al.'s [22] dataset showed that our overall pipeline maintained good performance when working with a more conservative dataset.
When we constructed our binary dataset, we built our dataset so as to respond to a criticism that Wehbe et al. [21] mentioned in their paper concerning the size of public datasets. Wehbe et al.'s [21] study found that the consensus accuracy and sensitivity of expert radiologists are 81% and 70% respectively. After training their ensemble model, Wehbe et al. [21] found that their system achieved a test accuracy of 82% and test sensitivity of 75%. Many other studies however have reported performance metrics that are much higher than this. Wehbe et al. [21] explained this by showing how models with extremely high metrics often have very small COVID-19 datasets. They posited that if the number of COVID-19 images in these other studies increased, these models would see a correction. They believed that early COVID-19 deep learning models were overfitting on small COVID-19 datasets. We therefore set out to construct a larger COVID-19 dataset than any other public COVID-19 dataset we have seen in the literature thus far. We felt that it was additionally important to create separate validation and test sets in order to ensure that overfitting does not occur. To protect against overfitting, we also ensured that each of our CNNs in our pipeline had dropout layers in their second last layers.
Wehbe et al.'s [21] criticism of small public datasets was not the only concern we have ended up discovering when using public datasets. We  later realized that many public datasets include images from Kermany et. al.'s [33] dataset which contains the chest X-rays of young children suffering from various forms of pneumonia. It is incorrect to take a model that was trained on children's X-rays and deploy it on adult X-rays. When we attempted to use such a dataset for training one of our CNNs, we obtained extremely high-performance metrics (accuracy/sensitivity between 98% and 100%). We noticed that several deep learning segmentation-classification pipelines [18,19,24] made this mistake. In addition to this, we have come to discover that some authors may have unintentionally biased their classifiers by mixing multiple images from individual patients in their training and test sets. This ultimately results in an incorrect biasing of a deep learning model as the image in the test set often has similar features to the image in the training set that was derived from the same patient. If this biasing occurs, deep learning models often lock onto more closely related features than they would have otherwise been trained to recognize. To summarize, the following three main issues are, therefore, sometimes found with COVID-19 datasets in the literature: 1. COVID-19 datasets have often been too small which has caused overfitting to occur in deep learning models 2. Many datasets have been constructed with pneumonia X-rays collected from children. Models based on these datasets were later then deployed on adult lungs 3. Some datasets may contain separate images from the same patients in both the training and test sets In Table 14 we compare our work with other segmentationclassification pipelines that have not made the mistake of incorrectly biasing their datasets. Our best three-class and two-class ensemble models should only be compared against the first four classifiers in Table 14. Our three-class and two-class ensembles were trained on a dataset that we built after gathering as many COVID-19 images as possible. The authors of the first four papers in Table 14, composed their datasets in the same way. The COVID-GR-1.0 dataset [22], however, was trained intentionally on weak COVID-19 images resulting in a classifier that should be treated in isolation. In comparing our segmentation unit with Yeh et al.'s [20] U-Net [31] segmentation model, our ResUNet [12] achieved a dice similarity coefficient that was 7 percent higher. In terms of dataset size, our COVID-19 dataset contained over 3000 more COVID-19 images. Yeh et al. [20] had a smaller dataset, therefore, and were more likely to have overfit their model. Our model was, therefore, more likely to face downward pressure in our performance metrics. Our three-class model, however, was still capable of obtaining the same COVID-19 sensitivity as Yeh et al.'s [20] model. It likely was able to do so with the help of better segmentation and the use of a majority voting ensemble. This indicates that on datasets that are constructed with as many COVID-19 images as possible, a three-class model (COVID-19 vs. Normal vs. Pneumonia) can reasonably achieve a COVID-19 sensitivity of 82%. Our two-class weighted average ensemble outperformed Wehbe et al.'s [21] classifier by a substantial margin. This may have been caused by a difference in our approach to segmentation. Wehbe et al.'s [21] classifier was trained to crop out the smallest rectangle that a patient's lungs can fit within. Our segmentation unit was trained on a set of masks that removed more pixels than Wehbe et al.'s [21] segmentation unit. We chose to not segment out the pixels showing the heart. Cardiomegaly (an enlarged heart) is a common symptom of COVID-19. Leaving the heart in our classified images allowed us to pick up this feature and likely assisted us to increase the performance metrics of our classifier. Our weighted average ensemble also outperformed Abdullah et al.'s [25] model despite our having a segmentation unit that under-performed Abdullah et al.'s Res-CR-Net [45] by one percent. We obtained a two-class accuracy that was 12 percent better than Abdullah et al.'s [45] classification model. We believe this is a result of our having constructed an extremely robust weighted average classification ensemble. Our best 2-class pipeline's accuracy and sensitivity matched Fig. 10. AUC-ROC graphs of (a) Our multiclass weighted average ensemble trained on our multiclass training set and (b) Our binary weighted average ensemble trained on our binary training set.

Table 12
Our binary models vs. COVID-SDNet on the COVID-GR-1.0 dataset [22]. It should be noted that there are instances where using a segmentation unit can reduce a model's accuracy. While segmentation units should generally always help a classifier's accuracy, we have noticed in our work that classifiers without a segmentation unit can lock onto features of an image that are external to the lungs. Sometimes this helps to increase a CNN's ability to classify particular images. For instance, if one category of images has more text than another you might notice the Grad-CAM [60] heatmaps for that category focusing on text. Our segmentation unit removed this possibility from happening and ultimately allowed us to boost our model's accuracy in a more honest fashion. Our Grad-CAM [60] heatmaps in Fig. 11 additionally showed an improvement in discovering relevant COVID-19 features when we used our segmentation unit.
The approach to creating datasets that is followed by the vast Fig. 11. Example of a segmented and non-segmented Grad-CAM heatmap produced by our DenseNet-201.

Table 13
The COVID-19 sensitivity of five expert radiologists in Wehbe et al.'s study [21] vs. our classifiers.   [22] dataset was the only dataset that we could obtain that allowed us to directly compare our pipeline with another author's segmentation-classification pipeline. It has been difficult to find publicly available datasets such Tabik et al.'s [22] where the authors have made clear how they segmented and classified their images. Tabik et al. [22] did not report a dice similarity coefficient because they segmented their images in such a way so as to create a small cropped rectangle around the lungs. This is similar in principle to how we segmented our images. We chose the Darwin V7 Labs dataset [43] for training our segmentation unit because the masks in this dataset left more room around the lungs to show the heart. We believe that if a segmentation unit were to remove these pixels, that COVID-19 symptoms like cardiomegaly could go unobserved by a classifier. We believe that our weighted average ensemble is ultimately what allowed us to achieve an improved accuracy and improved COVID-19 sensitivity when comparing our model with Tabik et al.'s [22] model. Our segmentation unit also likely helped as well, as it rejected a greater number of superfluous pixels around the lungs in comparison to Tabik et al.'s [22] segmentation methodology.
Unfortunately, at this time, public COVID-19 datasets that have been made available are somewhat incomplete. Public COVID-19 datasets are composed of images that previously came with corresponding positive RT-PCR tests. We know, however, that there are occasionally falsepositive images, depending on when individual RT-PCR tests are performed. Sometimes, if a patient obtains a negative RT-PCR test, they will come back later and obtain a positive test. We, therefore, have datasets with RT-PCR-positive patients, but each image's COVID-19 status has not been perfectly validated. There are occasional errors. This may have affected our work and the work of other papers we have reviewed. Our classifiers' results, therefore, while promising, perhaps should not be clinically deployed until better external labeling processes have been followed in building COVID-19 datasets. Many deep learning models perform well in the lab before being deployed in a clinical setting. Our models would need to be tested alongside other administered COVID-19 tests in order to compare their efficacy against competing technologies.

Conclusion
The two-class and three-class datasets that we have constructed contain the largest number of publicly available COVID-19 images that we have found in the literature. In training our segmentationclassification pipeline we were ultimately able to design several ensembles that generated promising results. Our best two-class weighted average ensemble ultimately achieved a 91 percent COVID-19 accuracy and 92 percent COVID-19 sensitivity. We were also able to out-compete a segmentation-classification pipeline that we directly compared our pipeline against [22]. While our models show promising characteristics in terms of our Grad-CAM heatmaps and performance metrics, our models are still not ready to be implemented in a clinical setting.
For a deep learning pipeline such as ours to be advanced into a clinical setting, the medical community and AI experts require further collaboration. To the best of our knowledge, no study has been performed whereby every single incoming patient at a medical facility was tested for COVID-19 with an X-ray and RT-PCR test simultaneously. The COVID-19 images that can be found in public datasets tend to come from patients that were showing increased complications in relation to their illness. In private datasets, the same problem likely exists as well since radiological evaluations are typically reserved for patients showing a concerning trend in the development of their illness. It is important to find out the proportion of incoming patients at a medical clinic that are COVID-19 positive after blind X-rays get administered to every patient. Anyone wanting to clinically implement a deep learning system such as ours may also benefit from blindly administering competing molecular tests (RT-PCR tests), antigen tests, and antibody tests on the same patients during this data-gathering stage. In our future work, we aim to extend our pipeline with categorical and numerical data to improve the ability of our pipeline to diagnose COVID-19. This additional metadata concerning each patient's age, sex, and relevant background details could really help to improve the performance metrics of our deep learning model. We also hope to eventually construct a deep learning pipeline capable of discovering the prognosis of COVID-19 patients. We believe that our pipeline is a promising step forward towards radiologically automating the detection of COVID-19. With a little more time and resources invested in these data-gathering processes, we believe that a clinically viable deep learning model is possible that allows for a truly better standard of care.

Dataset and code availability
We have made our dataset and scripts used in training our pipeline available at https://www.kaggle.com/roberthertel/covid-xray-datasetwith-segmentation-ensembles.