Indirect supervision applied to COVID-19 and pneumonia classification

The novel coronavirus 19 (COVID-19) continues to have a devastating effect around the globe, leading many scientists and clinicians to actively seek to develop new techniques to assist with the tackling of this disease. Modern machine learning methods have shown promise in their adoption to assist the healthcare industry through their data and analytics-driven decision making, inspiring researchers to develop new angles to fight the virus. In this paper, we aim to develop a CNN-based method for the detection of COVID-19 by utilizing patients' chest X-ray images. Developing upon the inclusion of convolutional units, the proposed method makes use of indirect supervision based on Grad-CAM. This technique is used in the training process where Grad-CAM's attention heatmaps support the network's predictions. Despite recent progress, scarcity of data has thus far limited the development of a robust solution. We extend upon existing work by combining publicly available data across 5 different sources and carefully annotate the comprising images across three categories: normal, pneumonia, and COVID-19. To achieve a high classification accuracy, we propose a training pipeline based on indirect supervision of traditional classification networks, where the guidance is directed by an external algorithm. With this method, we observed that the widely used, standard networks can achieve an accuracy comparable to tailor-made models, specifically for COVID-19, with one network in particular, VGG-16, outperforming the best of the tailor-made models.


Introduction
Since its introduction into the human population in late 2019, COVID-19 continues to have a devastating effect on the global populace with the number of infected individuals steadily rising [1]. With widely available treatments still outstanding and the continued strain placed on many healthcare systems across the world, efficient screening of suspected COVID-19 patients and their subsequent isolation is of paramount importance to mitigate further spread of the virus. Presently, the accepted gold standard for patient screening is reverse transcriptase-polymerase chain reaction (RT-PCR) where the presence of COVID-19 is inferred from the analysis of respiratory samples [2].
Despite its success, RT-PCR is a highly involved manual process with slow turnaround times, with results becoming available up to several days after the test is performed. Furthermore, its variable sensitivity, lack of standardized reporting, and a widely ranging total positive rate [3][4][5] calls for alternative screening methods.
Chest radiography imaging (such as X-ray or computed tomography (CT) imaging) has gained traction as a powerful alternative, where the diagnosis is administered by expert radiologists who analyze the resulting images and infer the presence of COVID-19 through subtle visual cues [6][7][8][9][10]. Of the two imaging methods studied, X-ray imaging has distinct advantages with regards to accessibility, availability, and rate of testing [11]. Furthermore, the existence of portable X-ray imaging systems does not require patient transportation or physical contact between healthcare professionals and suspected infected individuals, thus allowing for efficient virus isolation and a safer testing methodology. Despite its obvious promise, the main challenge facing radiography examination is the scarcity of trained experts that could conduct the analysis at a time when the number of possible patients continues to rise. As such, a computer system that could accurately analyze and interpret chest X-ray images could significantly alleviate the burden placed on expert radiologists and further streamline patient care. Image identification techniques are readily adopted in Artificial Intelligence (AI) and could prove to be a powerful solution to the problem at hand.
Deep learning models, such as convolutional neural networks (CNNs), have gained traction in the field of medical imaging [12,13] and here we train 10 promising CNNs for the purpose of COVID-19 classification in chest X-ray images. To assist the models, we utilize a purpose-built extraction of a soft mask as part of a three-stage procedure. To better quantify the performance of our proposed framework we benchmark our results against recently developed COVID-Net models [14]. To ensure consistency, we utilize our dataset to output predictions across an array of different COVID-Net models.
The structure of the rest of this paper is as follows: section "Related Work" briefly discusses some of the existing work used to diagnose COVID-19 in radiographic imaging; section "Data" summarizes the data collected from the 5 most studied datasets; section "Methods" describes the proposed three-stage workflow using an indirect attention mechanism; section "Results" displays the results obtained during all 3 stages, outlines further improvements of the proposed workflow, its advantages over other models and showcases possible implementations; section "Conclusion" represents a synthesis of key points of the developed model based on the indirect attention mechanism.
The authors in Ref. [34] propose DeepCOVID-XR, an ensemble of CNNs, to detect the presence of COVID-19 on frontal chest radiographs with an accuracy of 82% reported on a test set of 300 images (194 of which were from COVID-19 infected patients). Studying 5,090 images (1,979 of which were COVID-19 positive), the authors in Ref. [33] were able to achieve a binary classification accuracy of 99.5% by making use of HOG + CNN architecture for feature extraction and VGG for classification. In Ref. [32], pre-trained CNN models VGG-16, VGG-19, MobileNet, and Inception ResNet V2, are used to achieve a classification accuracy of at least 90.8% across 545 images (181 of which are COVID-19 positive).
Patients diagnosed with COVID-19 present symptoms consistent with pneumonia in their X-ray images, necessitating the ability to distinguish between COVID-19 and non-COVID-19 based pneumonia findings. Mahmud et al. [29] introduced CovXNet, a CNN-based model that makes use of depthwise convolution with varying dilation rates. The model is trained in two stages, first, on images corresponding to normal and viral/bacterial pneumonia. The model is then trained to distinguish COVID-19 from other forms of pneumonia, with a multi-class accuracy of 90.2% when trained (second stage) on 305 images in each class. Abbas et al. [31] developed a deep CNN called DeTrac to achieve an accuracy of 93.1% when detecting COVID-19 in 196 images across three categories: normal, severe acute respiratory syndrome (SARS), and COVID-19. Toraman et al. [38] developed Convolutional CapsNet (capsule neural network) to distinguish COVID-19 from normal and pneumonia X-ray images. The authors reported an accuracy of 97.2% and 84.2% for binary and multi-class classification, respectively, when making use of 2,331 images (231 of which were COVID-19 positive). Mansour et al. [39] introduced an unsupervised deep-learning-based variational autoencoder model for COVID-19 prediction, with resultant accuracies of 98.7% and 99.2% for binary and multi-class classification respectively. The authors tested their model against the X-ray dataset found in Ref. [40], split across normal, COVID-19, SARS, and ARDS classes. Khan et al. [41] developed CoroNet, a CNN model based on the Xception architecture. When tasked with classifying X-ray images as either normal, COVID-19, bacterial pneumonia, or viral pneumonia, the model achieved an accuracy of 89.6%, based on a dataset consisting of 1,251 images (284 of which belonged to COVID-19 positive cases). Chandra et al. [42] introduced an automatic COVID-19 screening system that uses a two-phase classification approach (normal vs abnormal and then COVID-19 vs pneumonia). The implemented classifier ensemble makes use of majority voting across five benchmark classification algorithms. By making use of 2,346 X-ray images (782 were COVID-19 positive), the authors report accuracies of 98.1% and 91.3% for each phase respectively. Ozturk et al. [30] developed DarkCovidNet, a model that obtained an accuracy of 87.0% when distinguishing between COVID-19, normal, and pneumonia in 1,127 images (127 of which are from COVID-19 positive patients). Wang et al. [14], developed a state-of-the-art model, called COVID-Net, that attains an accuracy of 93.3% when classifying a patient's image across three categories: normal, pneumonia, and COVID-19.
Despite recent progress in the development of CNN-based algorithms, several fundamental challenges remain: the scarcity of publicly available data, overfitting of models, and model sizes that make their adoption within a healthcare setting cumbersome. We extend upon existing works by combining various publicly available data sources and carefully annotate the images across three classes: normal, pneumonia, and COVID-19. The data is then divided into training, validation, and testing subsets with an 8:1:1 split respectively, with a strict class balance maintained across all sets. Furthermore, we make use of widely adopted CNNs whose size is a fraction of some purpose-built models.

Data
We collected data from different publicly available sources to train a high-precision classifier and to estimate its generalization properties. At the time of publication, we identified the following five datasets; COVID Chest X-Ray Dataset (CCXD) [40,43], Actualmed COVID-19 Chest X-Ray Dataset (ACCD) [44], Figure 1 COVID-19 Chest X-Ray Dataset (FCCD) [45], COVID-19 Radiography Database (CRD) [46,47], and RSNA Pneumonia Detection Dataset (RSNA) [48]. Since the datasets include different labels for their findings, we reassigned the labels to maintain consistency across the global dataset. We assigned viral and bacterial cases of pneumonia to the "Pneumonia" label; SARS, MERS-CoV, COVID-19, and COVID-19 (ARDS) to the "COVID-19" label; "no findings" and "normal" diagnosis to the "Normal" label. Table 1 summarizes the statistical information of the study dataset.
It should be noted that the RSNA dataset includes only normal and pneumonia cases. Originally, this dataset consisted of 20,672 normal cases and 9,555 cases of pneumonia. In order to keep class balance in our dataset, we incorporated a total of 800 normal and 700 pneumonia cases. It is worth noting that normal and pneumonia cases from the CRD dataset were excluded because they duplicated images from the CCXD dataset.
The final dataset includes images acquired from the anteriorposterior (AP) and posteroanterior (PA) directions only. Lateral CXR has no clinical applicability to distinguish COVID-19 patients [49]. During network training, validation, and testing, the dataset was split in an 8:1:1 ratio i.e. the training subset includes 2,122 images (80%), the validation subset -242 images (10%), and the testing subset -267 images (10%). The split of data within training, validation, and testing phases was performed according to the distribution shown in Table 2.

Methods
The proposed workflow in this study is divided into three stages. First, we utilized the transfer learning approach based on 10 industrystandard networks including MobileNet V2, DenseNet-121, Effi-cientNet B0, EfficientNet B1, EfficientNet B3, EfficientNet B5, VGG-16, ResNet-50 V2, Inception V3, and Inception ResNet V2. The weights of feature extractors (networks bodies) were frozen and only the classifier heads were trained. During the second stage, we chose the 4 most accurate networks to advance to full training. Here, the weights of the whole network were unfrozen, such that both the feature extractor and the classifier were trained. Finally, the networks were trained with an indirect attention mechanism. Such an indirect supervision mechanism is based on the adoption of the Grad-CAM approach [50], where the output is used to focus the classifier on the lung area of an image. Indirect supervision is used in the training process since Grad-CAM's attention heatmaps reflect the areas of an input image supporting the network's prediction. In this regard, the prediction is based on the areas on which we expect the network to focus, while indirect supervision forces networks to focus on the desired object in the image rather than its surroundings. The training workflow of the model is shown in Fig. 1 below. All three stages are described in the paragraph Description of the workflow stages in more detail. It should also be noted that different COVID-Net models [14] are considered in this study. To date, COVID-Net models are state-of-the-art models used for distinguishing COVID-19 and pneumonia cases. All COVID-Net models are abbreviated to CXR in the remainder of the paper.

Description of the workflow stages
As mentioned previously, 10 deep learning networks were selected to determine which network architectures are most effective in recognizing COVID-19 and pneumonia. All networks vary in the number of weights, architecture topology, data processing, etc. Additionally, CXR models are used for comparison purposes. In order to compare the investigated networks, we provide an overview of the networks used during the first stage in Table 3.
To train the aforementioned networks, we used bodies of these networks with frozen ImageNet weights. The most optimal version of each model was obtained through a series of training jobs performed on the collected dataset through the utilization of Amazon SageMaker. Having performed hyperparameter tuning based on a Bayesian optimization strategy, a set of hyperparameter values for the best performing model, given by the validation accuracy, was found. We chose the following pool of hyperparameters for the investigation: • The number of blocks, where each block is constructed of densely connected, activation, and dropout layers, was chosen to vary from 1 to 5. It is worth noticing that the architecture including 3 densely connected and 2 dropout layers was an optimal solution for all networks. However, the number of neurons varied slightly from network to network. The optimal number of neurons for the first and second densely-connected layers varied from 112 to 136 and from 56 to 72 respectively. A similar situation was observed for the dropout rate which varied from 0.05 to 0.15 for the first dropout layer, and from 0.05 to 0.10 for the second dropout layer. In this regard, we chose the optimal architecture of all network classifiers consisting of the following layers: • Densely-connected layer with 128 neurons and ELU activation; • Dropout layer with dropout rate equal to 0.10; • Densely-connected layer with 64 neurons and ELU activation; • Dropout layer with dropout rate equal to 0.05; • Densely-connected layer with 3 neurons; • Softmax activation layer.
It is important to note that for the first stage, only the classification heads were trained with the body weights frozen. According to the results of the hyperparameter tuning procedure, the gradient descent optimizer SGD with a learning rate equal to 10 − 4 proved to be optimal. Having trained several state-of-the-art networks, we found that most of them diverged. As a result, L2-regularization with λ of 0.001 was applied to all training networks. All networks were trained with a batch size equal to 32. To avoid overfitting during network training, we applied Early Stopping regularization, monitoring validation loss with a patience equal to 10 epochs. For training networks in both first and second stages, we used cross-entropy, calculated as follows: where c is the number of classes (3 in our study), y i is the ground-truth label (ternary indicator), p i is the softmax probability for the c-th class, ε is a small positive constant used for avoiding an undefined case of log(0).
During the second stage, we took the four best performing networks with their trained heads from the first stage, namely MobileNet V2, EfficientNet B1, EfficientNet B3, VGG16, unfroze their body weights (weights of feature extractors) and retrained them using the SGD optimizer whose learning rate was 10 − 5 . As seen, we decreased the learning rate by a factor of 10 compared to that used in the first stage. It is important to lower the learning rate at this stage since a larger model with more unfrozen weights is trained, and this requires the readaptation of the pre-trained weights. Otherwise, unfreezing all weights without changes in the training policy may lead to quick model overfitting.
Once the performance and accuracy metrics of all networks were estimated, four networks that showed the best results during the first stage were chosen for fine-tuning. Besides training both bodies and heads of the networks, we introduced an indirect supervision mechanism for the considered networks. We were inspired by Ref. [59], where the authors proposed a framework that provides guidance on the attention maps generated by a weakly supervised deep learning neural network. The attention block in our pipeline is based on the usage of Grad-CAM preceded by a classification block. Usually, attention maps only cover small discriminative regions of the object of interest when the network is purely supervised by the classification loss. In order to overcome this issue and use attention maps as more reliable priors, both classification and attention blocks share weights between each other. The latter acts as a regularizer, imposing constraints on the attention maps. While the classification block is targeted to search for regions used in the recognition of classes, the attention block ensures that all regions that can contribute to the classification decision will be included in the network's attention. Such an iterative process aids both classification and attention blocks in finding reliable priors and making a correct decision.
With the usage of the indirect supervision mechanism, the network learns to extend the focus area of an input image contributing to the recognition of the target class as much as possible, such that the attention maps are tailored towards the task of interest. In this regard, during network training in Stage III, the loss differs from that of Stage I and Stage II and is calculated as follows: L total = αL cls + βL attn (2) where L cls is the classification loss i.e. cross-entropy loss defined in Eq.
(1), L attn is the attention loss, α and β are the coefficients used to scale the total loss and both components. In order to obtain the attention map and compute the attention loss L attn for a given image I, we compute the neuron importance weights w c l,k = GAP using an application of the global average pooling operation (GAP) to the gradient of the score s c with respect to activation maps f l,k . Once w c l,k are computed on the backward pass, the networks are not updated. Since w c l,k represents the importance of the activation map f l,k (activation of unit k on the l-th layer) assisting in prediction of class c, the indirect mechanism uses the weights matrix w c and applies a two-dimensional convolution over activation maps f l , integrating all of them. Then the ReLU operation allows us to obtain the attention map A c computed as follows: where l is the representation from the last convolutional layer whose features have the best compromise between high-level semantics and detailed spatial information. The attention map A c has the same size as the convolutional feature maps (see the column with the size of the output feature matrix in Table 3).
Using the trainable attention map A c we generate a soft mask that is applied to an input image. This procedure allows us to obtain regions I *c which are beyond the network's current attention for class c and are calculated as follows: where I is an input image, T(A c ) is a masking function that is based on the thresholding operation, and ⊙ denotes element-wise multiplication. Since standard thresholding is not derivable, T(A c ) is approximated using a sigmoid function, where M σ is the thresholding matrix filled with σ values, ω is a scale parameter, ensuring T(A c ) i,j is equal to 1, when A c i,j is larger than σ or equal to 0 otherwise.
Having obtained a soft mask I *c , the attention block of the pipeline uses it to compute the prediction scores s c for all classes. Since the indirect supervision mechanism is used to guide the network to focus its attention on all parts of a given class, I *c has to contain as little features belonging to the target class as possible because regions beyond the high-responding area on the attention map area should not include single-pixel areas that can trigger the network to recognize the object of class c. From the perspective of the attention loss function, it is designed to minimize the prediction score s c of I *c and is calculated as follows: where n is the number of ground-truth class labels for an input image I.

Visual model validation
While modern neural networks enable superior performance, their lack of decomposability into intuitive and understandable components makes them hard to interpret. In this regard, an achievement of the model transparency is useful to explain their predictions. Class Activation Map (CAM) is a modern-day technique used for model interpretation [60]. Though CAM is a good technique to demystify the working of CNNs, it suffers from several drawbacks. For example, CAM requires feature maps to directly precede the softmax layers, so it applies to a particular kind of network architecture that performs global average pooling over convolutional maps immediately before prediction. Such architectures may achieve inferior accuracies compared to general networks on some tasks or simply be inapplicable to new tasks. De facto deeper representations of a CNN capture the best high-level features. Furthermore, CNNs naturally retrain spatial information which is lost in fully connected layers, so we expect the last convolutional layer to have the best tradeoff between high-level semantics and detailed spatial information. In this regard, a popular technique, known as Grad-CAM and published in Ref. [50], aims to improve the shortcomings of CAM and claims to be compatible with any kind of architecture. The technique does not require any modifications to the existing model architecture, and this allows its application to any CNN-based architecture. Unlike CAM, Grad-CAM uses the gradient information flowing into the last convolutional layer of a CNN to understand each neuron for a decision of interest. Grad-CAM improves on its predecessor, provides better localization and clear class discriminative saliency maps. As such, we created heatmap images using the following equations: where the algorithm takes gradient of the output y c with respect to a feature map A k , then it averages the result to get a weight of each feature map α c k . Finally, Grad-CAM takes a linear combination of weights α c k and feature maps A k , which gives us heatmaps.

Stage I
Having trained 10 neural networks, we found that two networks tend to overfit more than others. This is likely connected with their normalization layers. Networks such as MobileNet V2 and VGG-16 do not have Batch/Instance/Layer/Group Normalization layers in their architecture. In this regard, these networks start overfitting (MobileNet V2) or hit a validation loss/accuracy plateau (VGG-16) after approximately 100 epochs, while the training accuracy keeps increasing. Popular regularization techniques such as Lasso Regression (L1 Regularization), Ridge Regression (L2 regularization), ElasticNet (L1-L2 regularization), Dropout, and Early Stopping may help to avoid this problem. In this regard, we applied Ridge Regression, Dropout layers, and Early Stopping in our training pipeline. As for the remaining networks, they did not suffer from overfitting; however, they could not reach better validation loss/accuracy values. When a given model reached its best validation loss, we saved the associated model weights using a saving callback. Fig. 2 demonstrates how the accuracy dynamics of the networks evolved during the first training stage. Blue asterisks reflect the best value of the accuracy on the validation subsets.
Since loss is poorly interpreted, we compared commonly used network metrics such as accuracy and F1-score. Table 4 and Table 5 summarize these metrics estimated during the first stage. As seen, MobileNet V2, EfficientNet B1, EfficientNet B3, and VGG-16 achieved better results than other networks. Additionally, we provide all obtained metrics (Accuracy, F1-score, Precision, and Recall), computed over different subsets, classes, and stages inAppendix A.

Stage II
Based on the results of the first stage, MobileNet V2, EfficientNet B1, EfficientNet B3, and VGG-16 demonstrated their ability to distinct COVID-19 and pneumonia on X-ray images much better than other networks. During the second stage, we chose these four most accurate networks to advance to full training. The weights of each network were unfrozen, such that both the feature extractor and the classifier were trained. Having obtained the accuracy dynamics, we compare, in Fig. 3, how fully-trained networks differ from the networks fine-tuned in the first stage. The blue asterisks in this figure reflect the best value of the accuracy reached on the validation subset.
Having compared the accuracy and F1-score values obtained in the first (Tables 4 and 5) and second stages ( Table 6 and Table 7), we can state that MobileNet V2 and VGG-16 have a larger boost in accuracy over EfficientNet models. Once full training was performed, MobileNet V2 and VGG-16 got a +6% and +9% accuracy change on the validation subset and a +1% and +4% accuracy change on the testing subset. On the other hand, EfficientNet B1 and EfficientNet B3 displayed a +2% and +3% accuracy change on the validation subset and a − 1% and +6%  accuracy change on the testing subset. It should also be noted, that the largest boost in the classification of COVID-19 was achieved by VGG-16. This network had an +11% boost, while MobileNet V2, EfficientNet B1, and EfficientNet B3 could reach the level of +2%, 0%, and +6%, respectively.

Stage III
Once the networks are fine-tuned and fully trained, we then train those best four networks using the proposed pipeline based on indirect supervision. Having trained the chosen networks according to our pipeline described in Description of the workflow stages, we compared them on the validation and testing subsets, reflected in Fig. 4 and Appendix B. Based on the obtained results, we established that the proposed pipeline allows for boosting of the model accuracy. VGG-16 and MobileNet V2 showed the best accuracy on the validation and testing subsets. It is worth noticing that the VGG-16 network outperformed the best CXR model (CXR-4A) on these subsets. The performance of other CXR models is additionally shown inAppendix A. It is observed that the VGG-16 (S3) network trained based on the proposed pipeline has a +9% and +1% of accuracy boost on the validation subset compared to VGG-16 (S1) and VGG-16 (S2) respectively. Similar positive dynamics of using our pipeline are observed for other models as well. It should be noted that the CXR-4A and lightweight MobileNet V2 have almost the same accuracy, while the complexity of the latter is 11-time lower. The MobileNet V2 network includes 3.5 mln. weights, while CXR-4A includes 40.2 mln. weights.
In general, the network that produces the best results is VGG16, having consistently high values in every metric. We assume that VGG-16 could achieve such a high accuracy because of the high complexity and a large number of parameters (138.4 mln.) as compared to other studied networks. Additionally, we found that the plain network architecture is more suitable for the classification of indistinctive lung areas such as COVID-19 and pneumonia-affected regions. Both VGG-16 and Mobile-Net V2 are based on straight-line architecture, including, at most, a few skip-connections. Whilst the EfficientNet, ResNet, Inception, and Inception ResNet network families are based on complex architectures, including a wide variety of skip-connections such as identity/projection shortcuts (ResNet and Inception ResNet) and inception modules (Inception and Inception ResNet). It is worth noting that networks such as Inception V3 and Inception ResNet V2 integrate multiple kernels of different sizes (1 × 1, 3 × 3, and 5 × 5) which should assist in detecting area-specific features. However, 3 × 3 convolutional kernels, integrated to VGG-16 and MobileNet V2, turned out to provide a better solution, allowing for the network's better generalization ability and its ability to distinguish healthy patients from those diagnosed with COVID-19 or pneumonia.

Model validation using Grad-CAM
As we mentioned in Section "Visual model validation", despite deep learning models having facilitated unprecedented accuracy in image classification, one of their biggest drawbacks is model interpretability, representing a core component in understanding and debugging a model. We used the Grad-CAM technique to validate the models and their ability for making predictions and to verify which series of neurons activated in the forward-pass during the prediction. For the sake of visualization, we choose several patients with different findings: pneumonia, and COVID-19. Source images of these findings with their ground truth heatmaps and the heatmap dynamics over three stages are shown in Fig. 5 and Fig. 6.
Using Grad-CAM, we validated where our four best networks (MobileNet V2, EfficientNet B1, EfficientNet B3, VGG-16) are focusing, verifying that they are properly looking at the correct patterns in the image and activating around those patterns. The Grad-CAM technique uses the gradients flowing into the final convolutional layer to produce a coarse localization heatmap, highlighting the important regions in the image for predicting the target concept i.e. COVID-19 or pneumonia areas. However, the localization heatmaps may differ from the traditional localization techniques such as segmentation masks or bounding boxes. In this regard, these heatmaps are used for the sake of approximate localization.  In order to interpret the models, Figs. 5 and 6 reflect the visualization of gradient class activation maps. Additional cases of the networks' heatmaps are shown in Appendix C and Appendix D. Due to the nature of the task at hand, we utilize Grad-CAM for training and visualization purposes only. As we do not segment the COVID-19 affected regions, we have insufficient image information to compute associated metrics such as the Dice coefficient or the Jaccard distance. However, based on the obtained results, we may state that the training of the models using soft masks obtained by the indirect supervision mechanism (Stage III) has a positive effect on the search for correct patterns by the models.    Networks such as MobileNet V2 (Figs. 5c and 6) and VGG-16 ( Fig. 5f and 6f) identify affected areas correctly, despite the inaccuracies in the location of the heatmaps. On the other hand, interpretation of the Effi-cientNet networks showed that they are not activating around the proper patterns of the image. This allows us to assume that EfficientNet B1 and EfficientNet B3 have not properly learned the underlying patterns in our dataset and/or we may need to collect additional data for more complex training.

Conclusion
In this study, we demonstrated a training pipeline based on indirect supervision for neural networks. This supervision forces the neural networks to pay attention to the areas obtained by the external algorithm. Having trained a set of deep learning models, we found that the proposed pipeline allows for an increased classification accuracy. This pipeline was used for the detection of COVID-19 and distinguishing its presence from that of pneumonia. Of the obtained results, MobileNet V2 performed comparably to the tailor-made CXR model CXR-4A, despite being 11 times less complex. According to the performed experiments, the networks trained based on the proposed pipeline perform comparably to practicing radiologists when it comes to the classification of multiple thoracic pathologies in chest X-ray radiographs. Our pipeline may have the potential to improve healthcare delivery and increase access to chest radiograph expertise for the detection of a variety of acute diseases.

Author contributions
Y.G., S.S., and O.T. conceived the idea of the study. V.D., Y.G., and O.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.   Abbreviations: Normnormal (no findings), PNApneumonia, COV -COVID-19.