Utilisation of deep learning for COVID-19 diagnosis

The COVID-19 pandemic that began in 2019 has resulted in millions of deaths worldwide. Over this period, the economic and healthcare consequences of COVID-19 infection in survivors of acute COVID-19 infection have become apparent. During the course of the pandemic, computer analysis of medical images and data have been widely used by the medical research community. In particular, deep-learning methods, which are artificial intelligence (AI)-based approaches, have been frequently employed. This paper provides a review of deep-learning-based AI techniques for COVID-19 diagnosis using chest radiography and computed tomography. Thirty papers published from February 2020 to March 2022 that used two-dimensional (2D)/three-dimensional (3D) deep convolutional neural networks combined with transfer learning for COVID-19 detection were reviewed. The review describes how deep-learning methods detect COVID-19, and several limitations of the proposed methods are highlighted.


Introduction
The SARS-CoV-2 (COVID- 19) virus, detected in December 2019 1 has, as of April 2022, infected over 507 million people across the world and resulted in a global death toll of over 6 million people. 2 Effective and efficient primary screening for COVID-19 infection has been a cornerstone of management. 3 The standard primary screening tool for COVID-19 has been the reverse transcription polymerase chain reaction (RT-PCR) test, 4 in which the ribonucleic acid (RNA) of the COVID-19 virus is identified in sputum samples obtained from the upper respiratory tract; however, early in the pandemic, several studies highlighted a variable sensitivity of RT-PCR tests, which were influenced by the time of collection of the specimen relative to the time of infection. 5,6 During the acute phase of the pandemic, laboratory services were often overwhelmed by the volume of testing required, resulting in diagnostic delays. As a result, an additional diagnostic screening method considered alongside RT-PCR was the examination of chest radiographs (CXR). 7,8 CXRs were favoured as the necessary equipment is easy to access, they are fast to perform and interpret, and portable systems can markedly reduce the chances of virus transmission 9,10 ; however, a major challenge in CXR screening during the pandemic was the limited number of expert radiologists available for interpreting imaging data. 11 In some European centres, computed tomography (CT) imaging was used to screen patients for COVID-19. 6,12,13 CT imaging was shown to be more sensitive to CXRs in diagnosing COVID-19, particularly in cases where the diagnosis was incidental, for example, in the work-up of elective surgical patients. 14,15 CT has also been valuable in assessing the lungs of patients with worsening respiratory complications and in patients with negative RT-PCR test results where COVID-19 infection remained in the differential. Yet the specificity of CT for diagnosing COVID-19 (as with CXRs) is limited 16,17 making radiological interpretation of imaging for COVID-19 diagnosis challenging. 18e20 Known workforce shortages of radiologists, combined with the low specificity of chest imaging methods in diagnosing COVID-19, led many research groups to develop AIbased algorithms to support clinicians and radiologists diagnosing COVID-19. 21 The present review provides an overview of previously proposed AI methods using deeplearning (DL) algorithms for diagnosing COVID-19 using CXR and CT.

Application of DL in COVID-19 detection
DL in COVID-19 detection using CXR images As mentioned above, the convenience and ubiquity of CXRs in the assessment of COVID-19 led to an exponential growth in the acquisition of CXR data. The resulting large datasets were leveraged by AI researchers to develop automated DL algorithms for COVID-19 detection. COVID-19 diagnosis has typically been considered in as a two-group or three-group classification challenge. When considered a two-group problem, computer algorithms were trained to distinguish between COVID-19 imaging and imaging acquired from healthy controls. Classification in this way, however, does not allow inference of how an algorithm would interpret a CXR containing non-COVID-19 pneumonia. A three-group classification task aims to distinguish between COVID-19 pneumonia, non-COVID-19 pneumonia, and normal CXR imaging.
The method proposed by Apostolopoulos et al. 22 is one of the earliest AI-based approaches proposed for COVID-19 detection. In this paper, authors used state-of-the-art convolutional neural networks (CNN) including VGG19, 23 MobileNet v2, 24 Inception, 25 Xception, 26 and Inception ResNet v2. 25 The process of transfer learning was used for COVID-19 diagnosis. In DL analyses, transfer learning is a commonly applied process where computer models previously trained for a specific task (e.g., classification of two image classes such as cats and dogs) reuse the stored knowledge gained in the initial task and apply it to a new but related task (e.g., classification of two image classes such as COVID-19 and non-COVID-19 CXRs). In the study by Apostolopoulos et al. 22 the best accuracies achieved for two-group (normal, COVID-19) and three-group (normal, bacterial pneumonia, COVID-19) classification were 98.75% and 93.48%, respectively. Moreover, they tested the proposed model on a separate dataset with additional viral pneumonia cases, and the reported performance was 96.78% and 94.72% for two-group and three-group classification, respectively.
Ozturk et al.,. 28 designed a 19-layer CNN known as DarkCOVIDNet, which was trained and tested in two-group and three-group classification tasks. They specifically visualised the heat maps of the proposed model using the Grad-CAM 33 approach. A heat map or saliency map is a graphical two-dimensional representation of CNN information that uses a colour-coding system to represent areas of differing importance within the image. An example of a CXR heat map produced using the Grad-CAM method can be seen in Fig 1. The outputs of the heat maps in the study by Ozturk et al. 28 were assessed qualitatively by an expert radiologist. The optimal performance obtained by the DarkCOVIDNet was 98.08% and 87.02% for two-group and three-group classification tasks, respectively.
Rahimzadeh et al. 29 proposed a twin CNN (TCNN) architecture, using two well-known CNNs, Xception 26 and ResNet, 34 to extract parallel deep features from an image. A deep or latent feature is the consistent model response or output at the last node or layer. The extracted latent features were combined for the final prediction. The accuracy of two-group and three-group classification tasks were 99.05% and 94.40%. The results suggested that a TCNN approach could boost algorithm performance for COVID-19 prediction. Similar to Rahimzadeh et al., 29 Ouchicha et al. 30 proposed a TCNN with shared layers known as CVDNet. The designed architecture was tested on a three-group classification task producing an accuracy of 96.69%.
A challenging area for computer algorithms when assessing and classifying diseases on CXRs lies in the region of the diaphragm. Diaphragm contain areas of high density that can confuse algorithms and, in the context of COVID-19 classification, result in misdiagnosis. Heidari et al. 27 proposed a pre-processing algorithm that can boost the performance of a CNN by identifying and removing the diaphragmatic area with multi-stage image processing algorithms. The results in this paper suggest that using their proposed processing pipeline can improve model performance in COVID-19 detection to 98.1% and 94.5% for twogroup and three-group classification tasks, respectively.
SqueezeNet is a well-designed CNN proposed for natural image classification, which uses fewer parameters than other models. 35 In the study by Ucar et al., 31 the building blocks of SqueezeNet were used to design a new network called COVIDiagnosis-Net. To perform tuning of model hyperparameters, they used a Bayesian optimisation 36 algorithm. Moreover, multi-scale offline image augmentation was performed to overcome imbalances in data classes. The output accuracy of COVIDiagnosis-Net for the three-group classification task was reported as 98.30%.
One of the most successful strategies used to improve classification performance is Ensemble Learning (EL). 37 EL combines the output of several independent deep-learning models, each of which may have individual and distinct strengths. The expectation is that the various models will show complementary performance when combined, making the ensemble greater than the sum of its parts and more robust to unseen data. Rajaraman et al. 38 combined the predictions of nine individual DL-based models for two-and three-group classification tasks using several well-known ensemble strategies including max voting, averaging, weighted averaging, and stacking. Moreover, they also proposed an iterative pruning strategy to finetune the model. The iterative pruning strategy was able to identify the optimal number of layers for a given network and, in so doing, decreased the complexity of the model without compromising model performance. The accuracy reported for the three-group classification task was 98% without pruning and 99% when the pruning strategy was employed.
Other CNN based models for COVID-19 detection include multi-dilation CNN 39 ; CoroNet, 21 which uses the Xception 26 architecture; COVID-CAPS 40 and Convolutional capsnet, 41 which use a capsule network-based framework 42 ; and algorithms based on ResNet 43 and MobileNet 44 architectures.
Training DL models requires a considerable amount of data; however, the main challenge faced by the models mentioned previously in this review lay in the small sample size of the datasets available for model training. This was essentially a consequence of difficulties in data sharing in the early stages of the pandemic and the focus on acute clinical care in the emergency setting. Data limitations can result in early models demonstrating poor generalisability for unseen or out-of-distribution data. To overcome the handicap of limited data, most of the recently proposed algorithms based on DL approaches utilise transfer learning to boost performance; however, the main disadvantage of transfer learning is that of negative transfer, which occurs as the initial and target tasks are not similar enough to allow satisfactory training of the model. To avoid this, the authors in32 took the novel approach of generating synthetic CXR images (using an algorithm called COVIDGAN) using generative adversarial networks (GANs). 45 Specifically, they used an auxiliary classifier generative adversarial network (ACGAN) 46 architecture to enhance COVID-19 detection. The results suggested that the synthetic images produced by COVIDGAN can increase COVID-19 detection accuracy by up to 10%. Figure 1 Representation of CNN outputs on a heat map using Grad-CAM. 33 The first column shows the original CXR images from NCCID dataset 72 and the second column shows the heat-map representation of the model output for the corresponding images. 73 Warmer colours show strong signals (higher values) and colder colours show weak signals (lower values).
A summary of the proposed methods using CXR images for COVID-19 detection can be seen in Table 1.

DL in COVID-19 detection using CT
In several countries, such as The Netherlands, 47,48 CT was the primary imaging methodology used to assess the lung to identify COVID-19 infection. The improved spatial resolution of CT over CXR can enhance the sensitivity of COVID-19 detection. This is particularly valuable in detecting subtle early disease or identifying ground-glass densities hidden by the heart or hemidiaphragm on frontal CXRs 6,49 ; however, workforce limitations in radiology departments during the pandemic meant that reading vast numbers of CT studies to detect COVID-19 became a challenging task. 50 When coupled with reader intra-and interobserver variabilities, AI experts began to propose automated algorithms to diagnose COVID-19 on CT.
An early method proposed by Chen et al. 51 was based on a 2D Unetþþ 52 pre-trained on ImageNet. 53 The model was initially trained to segment regions of interest within the lungs on CT and then predict suspicious lesions within these regions. The best accuracy reported by this approach was 92.59%. Similarly, Gunraj et al. 54 proposed a COVID-19 detection algorithm based on a pre-trained 2D CNN combined with optimised neural network architecture, informed by human prerequisites for model sensitivity and positive predictive value for COVID-19 detection. Specifically, they benefited from a machine-driven design exploration strategy proposed in55 to design the network architecture automatically and identify the designed architecture patterns/blocks. The proposed model, called COVIDNet-CT, had a reported accuracy for a three-group (normal, non-COVID-19 pneumonia, and COVID-19) classification task of 99.10%.
Another study used the COVNet TCNN model, 56 which is based on two parallel ResNet50s (CNNs that are 50 layers deep) 34 with weight sharing. Similar to the previously mentioned models, this method first identified the lungs using a 2D Unet 57 and then trained the TCNN to detect features of COVID-19 within the lungs. The authors reported an area under the receiver operating characteristic curve (AUROC) value of 96.50%. Utilising the same concept, Wang et al. 58 designed a TCNN architecture based on 3D networks. They used a 3D U-Net 59 for lung segmentation and then combined two 3D-ResNets for COVID-19 detection. They utilised a prior-attention mechanism and proposed a new prior-attention residual learning block to boost the model's performance. The reported accuracy and AUROC were 93.30% and 97.30%, respectively. Song et al. 60 proposed an architecture based on three parallel 2D ResNets with weight sharing to extract different levels of lung CT features, including global features, detailed local features, and rational features. They directly extracted global features from the whole lung area as an initial processing step using ResNet50. A feature pyramid network algorithm was utilised to segment the lung area and generate sub-region/ images at different scales. Then, based on the defined sub-  61 proposed a pipeline based on GAN 45 algorithms to increase the performance of the previously developed CNNs for COVID-19 detection. They initially extracted the lungs using a deep network called BDCU-Net. 62 They then used a GAN model to synthesise new lung CT images, thereby increasing the number of samples available for training. For their final step, they utilised several well-known pre-trained 2D CNNs to diagnose COVID-19 and demonstrated that using GANs in the baseline models could increase COVID-19 diagnostic accuracy up to 9%.
One of the main challenges in training AI models for COVID-19 detection using CT is having good quality and abundant manual labels of regions of COVID-19-affected lung parenchyma. A strategy that has been used to mitigate against limited labels has been to use weakly supervised DL algorithms 63 that can efficiently work on datasets without labels but can maintain sufficient accuracy for COVID-19 detection. Inspired by the work of Yang et al., 64 Hu et al. 63 initially proposed a 2D Unet multi-view segmentation model with additional attention layers to extract the lungs from the CT. Later they modified the VGG 23 network adding a weakly supervised multi-scale learning algorithm to increase COVID-19 detection performance. The accuracy and AUROC for this model were 87.4% and 89.60%. Similarly, Wang et al. 65 proposed a weakly supervised framework based on a 2D Unet and a 3D CNN. As a first stage, they used a 2D Unet model to segment the lungs on CT images. Then they used the segmented mask to extract the lung area and combined the slices to create a 3D lung mask. They then trained a 3D CNN encoder called DeCoVNet by concatenating the original 3D scans with the lung masks and combined this with a weakly supervised COVID-19 lesion localisation algorithm to classify the scans. The accuracy and AUROC for this approach were 90.10% and 95.90%, respectively.
Other similar approaches based on CNNs for COVID-19 detection include methods based on Unet architectures, 66,67 methods that utilised attention-based networks, 68 and other CNN architectures focused on segmentation and classification. 20,50,69,70 A summary of the various CNN methods that used CT can be seen in Table 2.

Discussion
The present review described the application of DLbased AI models to detect and diagnose COVID-19 on CXR and CT. In total, 30 studies performed within the time frame of the review were assessed. Sixteen studies with pioneering approaches to COVID-19 diagnosis are described, detailing the model architectures and their reported performance. (The remaining 14 studies used similar approaches, and their model architectures are summarised briefly.) Yet, comparing performance between models is an essentially futile exercise as none of the models described examined a common standardised dataset, thereby making an unbiased definitive comparison impossible.
Based on the present review, it is clear that there are several challenges related to the application of AI for COVID-19 detection. Invariably when image analysis of CXR or CT images for COVID-19 detection is performed, the ground truth for diagnosis is the RT-PCT test. Yet, solely using a molecular test to confirm the presence of lung disease is flawed. A patient may have COVID-19 infection with no lung manifestations. The CXR and CT may be clear, but the AI model will be forced to classify the lungs as COVID positive. A consequence is that the model will be trained on faulty ground truth data. A comparable scenario is when in times of endemic infection, a patient with non-COVID-19 pneumonia may acquire nosocomial COVID-19 that does not affect the lungs. The model will again be forced to learn that the lung abnormalities are those of COVID-19 even though the underlying aetiology might be a different infectious agent. False-negative RT-PCR results have also not been reported infrequently. These cases can result in spurious training data where the AI model will be forced to learn that basic COVID-19 features on CXR or CT should be classified as not being COVID-19. In all these examples, the flawed assumption that the RT-PCT test is a surrogate for lung infection is a major constraint to AI model performance.
Despite the good accuracy and performance quoted by the various AI models described in this review, there has been almost no evaluation as to how these models perform on real-world CT and CXR examinations. A recent study 71 highlighted quite dramatically that when some of these algorithms were re-implemented, the regions of the image that were the key determinant of how the image was classified came from features outside of the lungs. On CXR images, non-pathological features in the images, such as laterality markers (identifying the left or right side of the patient on the image), image edges, the diaphragm, and the cardiac silhouette strongly influenced predictions of COVID-19 status. These features instead of emphasising COVID-19-related damage, are more likely to reflect differences across training datasets where the CXR acquisition anterioreposterior (AP) versus posterioreanterior (PA) or patient position differed between centres contributing data to the imaging database. 71 These observations underscore the need for interpretability and transparency of AI model outputs so that human readers can be confident in the logic by which AI models come to their conclusions. These requirements are essential for safety and trust in AI systems.
It is also worth considering that the high accuracy reported for several of the AI methods can result from undesired bias within the datasets used, such as training and testing the models on the same dataset. For example, the AI model proposed by Acar et al. 61 reported the highest accuracy among all the proposed AI methods that analysed CT scans; however, when their model was tested on an external dataset, the accuracy decreased by 8.5%, suggesting that the model was overfitted to the training dataset.
Of the studies reviewed in Tables 1 and 2, performance was better for methods that used CT images; however, models that use CT and 3D processing algorithms utilise more complex algorithmic pipelines and need more complex computational resources. They also have an increased likelihood of overfitting the model to their training data as the number of parameters used in the 3D DL networks is higher. Furthermore, as mentioned previously, creating manual labels of imaging features on CT images is an expensive process requiring the input of experts that are typically in short supply. Models that use CXR images are intrinsically less complex as they are based on 2D DL networks with single end-to-end pipelines. As the size of CXR datasets is typically exponentially larger than CT datasets, the training data have a better chance of being more representative of the various disease manifestations in the lungs; however, as previously described, relying on RT-PCR results as the ground truth for lung damage is a major constraint to the models.
According to the literature, most of the proposed algorithms analysing COVID-19 focus on distinguishing COVID-19 from non-COVID pneumonia; however, clinically, there are other differentials of COVID-19 on CXR, including other viral or bacterial infections, aspiration pneumonia, etc. The main limitation of the proposed methods is that they ignore other types of lung damage/disease, limiting the ability of an AI algorithm to differentiate COVID-19 from other kinds of illness. Moreover, a patient may simultaneously have COVID-19 and a bacterial or viral pneumonia.
There are limitations to this review. The focus was on the image analysis aspects of these studies, and studies where clinical data were also built into AI models were not explored. The content of this review describes a rapidly evolving research field. Papers published until March 2022 were reviewed, but since this time, several new AI models have been released and were beyond the remit of this review.

Declaration of interests
The authors declare the following financial interests/ personal relationships which may be considered as potential competing interests: Joseph Jacob reports financial support was provided by Wellcome Trust.