Deep Learning for Chest X-ray Analysis: A Survey

Recent advances in deep learning have led to a promising performance in many medical image analysis tasks. As the most commonly performed radiological exam, chest radiographs are a particularly important modality for which a variety of applications have been researched. The release of multiple, large, publicly available chest X-ray datasets in recent years has encouraged research interest and boosted the number of publications. In this paper, we review all studies using deep learning on chest radiographs, categorizing works by task: image-level prediction (classification and regression), segmentation, localization, image generation and domain adaptation. Commercially available applications are detailed, and a comprehensive discussion of the current state of the art and potential future directions are provided.


Introduction
A cornerstone of radiological imaging for many decades, chest radiography (chest X-ray, CXR) remains the most commonly performed radiological exam in the world with industrialized countries reporting an average 238 erect-view chest Xray images acquired per 1000 of population annually (United Nations, 2008). In 2006, it is estimated that 129 million CXR images were acquired in the United States alone (Mettler et al., 2009). The demand for, and availability of, CXR images may be attributed to their cost-effectiveness and low radiation dose, combined with a reasonable sensitivity to a wide variety of pathologies. The CXR is often the first imaging study acquired and remains central to screening, diagnosis, and management of a broad range of conditions (Raoof et al., 2012).
Chest X-rays may be divided into three principal types, according to the position and orientation of the patient relative to the X-ray source and detector panel: posteroanterior, anteroposterior, lateral. The posteroanterior (PA) and anteroposterior (AP) views are both considered as frontal, with the X-ray source positioned to the rear or front of the patient respectively. The AP image is typically acquired from patients in the supine position, while the patient is usually standing erect for the PA image acquisition. The lateral image is usually acquired in combination with a PA image, and projects the X-ray from one side of the patient to the other, typically from right to left. Examples of these image types are depicted in Figure 1.
The interpretation of the chest radiograph can be challenging due to the superimposition of anatomical structures along the projection direction. This effect can make it very difficult to detect abnormalities in particular locations (for example, a nodule Ecem Sogancioglu and Erdi Ç allı contributed equally. * Corresponding author: E-mail: erdi.calli@radboudumc.nl posterior to the heart in a frontal CXR), to detect small or subtle abnormalities, or to accurately distinguish between different pathological patterns. For these reasons, radiologists typically show high inter-observer variability in their analysis of CXR images (Quekel et al., 2001;Balabanova et al., 2005;Young, 1994).
The volume of CXR images acquired, the complexity of their interpretation, and their value in clinical practice have long motivated researchers to build automated algorithms for CXR analysis. Indeed, this has been an area of research interest since the 1960s when the first papers describing an automated abnormality detection system on CXR images were published (Lodwick et al., 1963;Becker et al., 1964;Meyers et al., 1964;Kruger et al., 1972;Toriwaki et al., 1973). The potential gains from automated CXR analysis include increased sensitivity for subtle findings, prioritization of time-sensitive cases, automation of tedious daily tasks, and provision of analysis in situations where radiologists are not available (e.g., the developing world).
In recent years, deep learning has become the technique of choice for image analysis tasks and made a tremendous impact in the field of medical imaging (Litjens et al., 2017). Deep learning is notoriously data-hungry and the CXR research community has benefited from the publication of numerous large labeled databases in recent years, predominantly enabled by the generation of labels through automatic parsing of radiology reports. This trend began in 2017 with the release of 112,000 images from the NIH clinical center (Wang et al., 2017b). In 2019 alone, more than 755,000 images were released in 3 labelled databases (CheXpert (Irvin et al., 2019), MIMIC-CXR (Johnson et al., 2019), PadChest (Bustos et al., 2020)). In this work, we demonstrate the impact of these data releases on the number of deep learning publications in the field.
There have been previous reviews on the field of deep learning in medical image analysis (Litjens et al., 2017; van Gin- neken, 2017; Sahiner et al., 2018;Feng et al., 2019) and on deep learning or computer-aided diagnosis for CXR (Qin et al., 2018;Kallianos et al., 2019;Anis et al., 2020). However, recent reviews of deep learning in chest radiography are far from exhaustive in terms of the literature and methodology surveyed, the description of the public datasets available, or the discussion of future potential and trends in the field. The literature review in this work includes 295 papers, published between 2015 and 2021, and categorized by application. A comprehensive list of public datasets is also provided, including numbers and types of images and labels as well as some discussion and caveats regarding various aspects of these datasets. Trends and gaps in the field are described, important contributions discussed, and potential future research directions identified. We additionally discuss the commercial software available for chest radiograph analysis and consider how research efforts can best be translated to the clinic.
The initial selection of literature to be included in this review was obtained as follows: A selection of papers was created using a PubMed search for papers with the following query.
chest and ("x-ray" or xray or radiograph) and ("deep learning" or cnn or "convolutional" or "neural network") A systematic search of the titles of conference proceedings from SPIE, MICCAI, ISBI, MIDL and EMBC was also performed, searching paper titles for the same search terms listed above. In the case of multiple publications of the same paper, only the latest publication was included. Relevant peerreviewed articles suggested by co-authors and colleagues were added. The last search was performed on March 3rd, 2021.
This search strategy resulted in 767 listed papers. Of these, 61 were removed as they were duplicates of others in the list. A further 261 were excluded as their subject matter did not relate to deep learning for CXR, they were commentary or evaluation papers or they were not written in English. Publications that were not peer-reviewed were also excluded (8). Finally, during the review process 142 papers were excluded as the scientific content was considered unsound, as detailed further in Section 6, leaving 295 papers in the final literature review.
The remainder of this work is structured as follows: Section 2 provides a brief introduction to the concept of deep learning and the main network architectures encountered in the current literature. In Section 3, the public datasets available are described in detail, to provide context for the literature study. The review of the collected literature is provided in Section 4, categorized according to the major themes identified. Commercial systems available for chest radiograph analysis are described in Section 5. The paper concludes in Section 6, with a comprehensive discussion of the current state of the art for deep learning in CXR as well as the potential for future directions in both research and commercial environments.

Overview of Deep Learning Methods
This section provides an introduction to deep learning for image analysis, and particularly the network architectures most frequently encountered in the literature reviewed in this work. Formal definitions and more in-depth mathematical explanations of fully-connected and convolutional neural-networks are provided in many other works, including a recent review of deep learning in medical image analysis (Litjens et al., 2017). In this work, we provide only a brief overview of these fundamental details and refer the interested reader to previous literature.
Deep learning is a branch of machine learning, which is a general term describing learning algorithms. The algorithm underpinning all deep learning methods is the neural network, in this case, constructed with many hidden layers ('deep'). These networks may be constructed in many ways with different types of layers included and the overall construction of a network is referred to as its 'architecture'. Sections 2.3 to 2.6 describe commonly used architectures categorized by types of application in the CXR literature.

Convolutional Neural Networks
In the 1980s, networks using convolutional layers were first introduced for image analysis (Fukushima and Miyake, 1982), and the idea was formalized over the following years (LeCun and Bengio, 1998). These convolutional layers now form the basis for all deep learning image analysis tasks, almost without exception. Convolutional layers use neurons that connect only to a small 'receptive field' from the previous layer. These neurons are applied to different regions of the previous layer, operating as a sliding window over all regions, and effectively detecting the same local pattern in each location. In this way, spatial information is preserved and the learned weights are shared.

Transfer Learning
Transfer learning investigates how to transfer knowledge extracted from one domain (source domain) to another (target) domain. One of the most commonly used transfer learning approaches in CXR analysis is the use of pre-training.
With the pre-training approach, the network architecture is first trained on a large dataset for a different task, and the trained weights are then used as an initialization for the subsequent task for fine-tuning (Yosinski et al., 2014). Depending on data availability from the target domain, all layers can be re-trained, or only the final (fully connected) layer can be re-trained. This approach allows neural networks to be trained for new tasks using relatively smaller datasets since useful low-level features are learned from the source domain data. It has been shown that pre-training on the ImageNet dataset (for classification of natural images) (Baltruschat et al., 2019b) is beneficial for chest radiography analysis and this type of transfer learning is prominently used in the research surveyed in this work. ImageNet pre-trained versions of many architectures are publicly available as part of popular deep learning frameworks. The pretrained architectures may also be used as feature extractors, in combination with more traditional methods, such as support vector machines or random forests. Domain adaptation is another subfield of transfer learning and is discussed thoroughly in Section 2.7.

Image-level Prediction Networks
In this work we use the term 'image-level prediction' to refer to tasks where prediction of a category label (classification) or continuous value (regression) is implemented by analysis of an entire CXR image. These methods are distinct from those which make predictions regarding small patches or segmented regions of an image. Classification and regression tasks are grouped together in this work since they typically use the same types of architecture, differing only in the final output layer. One of the early successful deep convolutional architectures for image-level prediction was AlexNet (Krizhevsky et al., 2012), which consists of 5 convolutional layers followed by 3 fully connected layers. AlexNet became extremely influential in the literature when it beat all other competitors in the ILSVRC (ImageNet) challenge (Deng et al., 2009) by a large margin in 2012. Since then many deep convolutional neural network architectures have been proposed. The VGG family of models (Simonyan and Zisserman, 2014) use 8 to 19 convolutional layers followed by 3 fully-connected layers. The Inception architecture was first introduced in 2015 (Szegedy et al., 2015) using multiple convolutional filter sizes within layered blocks known as Inception modules. In 2016, the ResNet family of models  began to gain popularity and improve upon previous benchmarks. These models define residual blocks consisting of multiple convolution operations, with skip connections which typically improve model performance. After the success of ResNet, skip connections were widely adopted in many architectures. DenseNet models (Huang et al., 2017), introduced in 2017, also use skip connections between blocks, but connect all layers to each other within blocks. A later version of the Inception architecture also added skip connections (Inception-Resnet) (Szegedy et al., 2017). The Xception network architecture (Chollet, 2017) builds upon the Inception architecture but separates the convolutions performed in the 2D image space from those performed across channels. This was demonstrated to improve performance compared to Inception V3.
The majority of works surveyed in this review use one or more of the model architectures discussed here with varying numbers of hidden layers.

Segmentation Networks
Segmentation is a task where pixels are assigned a category label, and can also be considered as a pixel classification. In natural image analysis, this task is often referred to as 'semantic segmentation' and frequently requires every pixel in the image to have a specified category. In the medical imaging domain these labels typically correspond to anatomical features (e.g., heart, lungs, ribs), abnormalities (e.g., tumor, opacity) or foreign objects (e.g., tubes, catheters). It is typical in the medical imaging literature to segment just one object of interest, essentially assigning the category 'other' to all remaining pixels.
Early approaches to segmentation using deep learning used standard convolutional architectures designed for classification tasks (Chen et al., 2018b). These were employed to classify each pixel in a patch using a sliding window approach. The main drawback to this approach is that neighboring patches have huge overlap in pixels, resulting in inefficiency caused by repeating the same convolutions many times. It additionally treats each pixel separately which results in the method being computationally expensive and only applicable to small images or patches from an image.
To address these drawbacks, fully convolutional networks (FCNs) were proposed, replacing fully connected layers with convolutional layers (Shelhamer et al., 2017). This results in a network which can take larger images as input and produces a likelihood map output instead of an output for a single pixel. In 2015, a fully convolutional architecture known as the U-Net was proposed (Ronneberger et al., 2015) and this work has become the most cited paper in the history of medical image analysis. The U-Net consists of several convolutional layers in a contracting (downsampling) path, followed by further convolutional layers in an expanding (upsampling) path which restores the result to the input resolution. It additionally uses skip connections between the same levels on the contracting and expanding paths to recover fine details that were lost during the pooling operation. The majority of image segmentation works in this review employ a variant of the FCN or the U-Net.

Localization Networks
This survey uses the term localization to refer to identification of a specific region within the image, typically indicated by a bounding box, or by a point location. As with the segmentation task, localization, in the medical domain, can be used to identify anatomical regions, abnormalities, or foreign object structures. There are relatively few papers in the CXR literature reviewed here that deal specifically with a localization method, however, since it is an important task in medical imaging, and may be easier to achieve than a precise segmentation, we categorize these works together.
In 2014, the RCNN (Region Convolutional Neural Network) was introduced (Girshick et al., 2014), identifying regions of interest in the image and using a CNN architecture to extract features of these regions. A support vector machine (SVM) was used to classify the regions based on the extracted features. This method involves several stages and is relatively slow. It was later superseded by fast-RCNN (Girshick, 2015) and subsequently by faster-RCNN (Ren et al., 2017) which streamlined the processing pipeline, removing the need for initial region identification or SVM classification, and improving both speed and performance. In 2017, a further extension was added to faster-RCNN to additionally enable a precise segmentation of the item identified within the bounding box. This method is referred to as Mask R-CNN . While this is technically a segmentation network, we mention it here as part of the RCNN family. Another architecture which has been popular in object localization is YOLO (You Only Look Once), first introduced in 2016 ( Redmon et al., 2016) as a single-stage object detection method, and improved in subsequent versions in 2017 and 2018 Farhadi, 2017, 2018). The original YOLO architecture, using a single CNN and an image-grid to specify outputs was significantly faster than its contemporaries but not quite as accurate. The improved versions leveraged both classification and detection training data and introduced a number of training improvements to achieve state of the art performance while remaining faster than its competitors. A final localization network that features in medical imaging literature is RetinaNet (Lin et al., 2017). Like YOLO, this is a single stage detector, which introduces the concept of a focal loss function, forcing the network to concentrate on more difficult examples during training. Most of the localization works included in this review use one of the architectures described above.

Image Generation Networks
One of the tasks deep learning has been commonly used for is the generation of new, realistic images, based on information learned from a training set. There are numerous reasons to generate images in the medical domain, including generation of more easily interpretable images (by increasing resolution, or removal of projected structures impeding analysis), generation of new images for training (data augmentation), or conversion of images to emulate appearances from a different domain (domain adaptation). Various generative schemes have also been used to improve the performance of tasks such as abnormality detection and segmentation.
Image generation was first popularized with the introduction of the generative adversarial network (GAN) in 2014 (Goodfellow et al., 2014). The GAN consists of two network architectures, an image generator, and a discriminator which attempts to differentiate generated images from real ones. These two networks are trained in an adversarial scheme, where the generator attempts to fool the discriminator by learning to generate the most realistic images possible while the discriminator reacts by progressively learning an improved differentiation between real and generated images.
The training process for GANs can be unstable with no guarantee of convergence, and numerous researchers have investigated stabilization and improvements of the basic method (Salimans et al., 2016;Heusel et al., 2017;Karras et al., 2018;Arjovsky et al., 2017). GANs have also been adapted to conditional data generation Odena et al., 2017) by incorporating class labels, image-to-image translation (conditioned on an image in this case) , and unpaired image-to-image translation (CycleGAN Zhu et al. (2017)).
GANs have received a lot of attention in the medical imaging community and several papers were published for medical image analysis applications in recent years (Yi et al., 2019b). Many of the image generation works identified in this review employed GAN based architectures.

Domain Adaptation Networks
In this work we use the term 'Domain Adaptation', which is a subfield of transfer learning, to cover methods attempting to solve the issue that architectures trained on data from a single 'domain' typically perform poorly when tested on data from other domains. The term 'domain' is weakly defined; In medical imaging it may suggest data from a specific hardware (scanner), set of acquisition parameters, reconstruction method or hospital. It could, less frequently, also refer to characteristics of the population included, for example the gender, ethnicity, age or even strain of some pathology included in the dataset.
Domain adaptation methods consider a network trained for an image analysis task on data from one domain (the source domain), and how to perform this analysis accurately on a different domain (the target domain). These methods can be categorized as supervised, unsupervised, and semi-supervised depending on the availability of labels from the target domain and they have been investigated for a variety of CXR applications from organ segmentation to multi-label abnormality classification. There is no specific architecture that is typical for domain adaptation, but rather architectures are combined in various ways to achieve the goal of learning to analyze images from unseen domains. The approaches to this problem can be broadly divided into three classes (following the categorization of (Wang and Deng, 2018)); discrepancy-based, reconstructionbased and adversarial-based.
Discrepancy-based approaches aim to induce alignment between the source and target domain in some feature space by fine-tuning the image analysis network and optimizing a measurement of discrepancy between the two domains. Reconstruction-based approaches, on the other hand, use an auxiliary encoder-decoder reconstruction network that aims to learn domain invariant representation through a shared encoder. Adversarial-based approaches are based on the concept of adversarial training from GANs, and use a discriminator network which tries to distinguish between samples from the source and target domains, to encourage the use of domain-invariant features. This category of approaches is the most commonly used in CXR analysis for domain adaptation, and consists of generative and non-generative models. Generative models transform source images to resemble target images by operating directly on pixel space whereas non-generative models use the labels on the source domain and leverage adversarial training to obtain domain invariant representations.

Datasets
Deep learning relies on large amounts of annotated data. The digitization of radiological workflows enables medical institutions to collate and categorize large sets of digital images. In addition, advances in natural language processing (NLP) algorithms mean that radiological reports can now be automatically analyzed to extract labels of interest for each image. These factors have enabled the construction and release of multiple large labelled CXR datasets in recent years. Other labelling strategies have included the attachment of the entire radiology report and/or labels generated in other ways, such as radiological review of the image, radiological review of the report, or laboratory test results. Some datasets include segmentations of specified structures or localization information.
In this section we detail each public dataset that is encountered in the literature included in this review as well as any others available to the best of our knowledge. Details are provided in Table 1. Each dataset is given an acronym which is used in the literature review tables (Tables 2 to 7) to indicate that the dataset was used in the specified work.
1. ChestX-ray14 (C) is a dataset consisting of 112, 120 CXRs from 30, 805 patients (Wang et al., 2017b The dataset was automatically labeled from radiology reports using the same rule-based labeler system (described above) as CheXpert. A second version (V2) of MIMIC-CXR was later released including the anonymized radiology reports and DICOM files. 4. PadChest (P) is a dataset consisting of 160, 868 CXRs from 109, 931 studies and 67, 000 patients (Bustos et al., 2020). The CXRs are collected at San Juan Hospital (Spain) from 2009 to 2017. The images are stored as 16bit grayscale images with full resolution. 27, 593 of the reports were manually labeled by physicians. Using these labels, an RNN was trained and used to label the rest of the dataset from the reports. The reports were used to extract 174 findings, 19 diagnoses, and 104 anatomic locations.
The labels conform to a hierarchical taxonomy based on the standard Unified Medical Language System (UMLS) (Bodenreider, 2004). 5. PLCO (PL) is a screening trial for prostate, lung, colorectal and ovarian (PLCO) cancer (Zhu et al., 2013 , 2005). The images are distributed as anonymized DICOMs. The radiological findings obtained by radiologist interpretation are available in MeSH format 1 . 7. Ped-Pneumonia (PP) is a dataset consisting of 5,856 pediatric CXRs (Kermany, 2018). The CXRs are collected from Guangzhou Women and Children's Medical Center, Guangzhou, China. The images are distributed in 8-bit grayscale images scaled in various resolutions. The labels include bacterial and viral pneumonia as well as normal. 8. JSRT dataset (J) consists of 247 images with a resolution of 2048 × 2048, 0.175mm pixel-size and 12-bit depth (Shiraishi et al., 2000). It includes nodule locations (on 154 images) and diagnosis (malignant or benign). The reference standard for heart and lung segmentations of these images are provided by the SCR dataset (van Ginneken et al., 2006) and we group these datasets together in this work. 9. RSNA-Pneumonia (RP) is a dataset consisting of 30, 000 CXRs with pneumonia annotations (RSNA, 2018). These images are acquired from ChestX-ray14 and are 8-bit grayscale with 1024 × 1024 resolution. Annotations are added by radiologists using bounding boxes around lung opacities and 3 classes indicating normal, lung opacity, not normal. 10. Shenzhen (S) is a dataset consisting of 662 CXRs (Jaeger et al., 2014). The CXRs are collected at Shenzhen No.3 14. COVIDGR (CG) is a dataset consisting of 852 PA CXR images where half of them are labeled as COVID-19 positive based on corresponding RT-PCR results obtained within at most 24 hours (Tabik et al., 2020). This dataset was collected from Hospital Universitario Clínico San Cecilio, Granada, Spain, and the level of severity of positive cases is provided. 15. SIIM-ACR (SI) This dataset was released for a Kaggle challenge on pneumothorax detection and segmentation (ACR, 2019). Researchers have determined that at least some (possibly all) of the images are from the ChestX-ray14 dataset although the challenge organizers have not confirmed the data sources. They are supplied in 1024 × 1024 resolution as DICOM files. Pixel segmentations of the pneumothorax in positive cases are provided. 16. CXR14-Rad-Labels (CR) supplies additional annotations for a subset of ChestX-ray14 data (Majkowska et al., 2019). It consists of 4 labels for 4,374 studies and 1,709 patients. These labels are collected by the adjudicated agreement of 3 radiologists. These radiologists were selected from a cohort of 11 radiologists for the validation split (2,412 studies from 835 patients), and 13 radiologists for the test split (1,962 studies from 860 patients). The individual labels from each radiologist as well as the agreement labels were provided. 17. COVID-CXR (CV) is a dataset consisting of 930 CXRs at the time of writing (the dataset remains in continuous development) (Cohen et al., 2020c). The CXRs are collected from a large variety of locations using different methods including screenshots from papers researching COVID-19. Available labels vary accordingly, depending on what information is available from the source where the image was obtained. Images do not have a standard resolution and are published as 8-bit PNG or JPEG files. 18. NLST (N) is a dataset of publicly available CXRs collected during the NLST screening trial National Lung Screening Trial Research Team et al. (2011). This trial aimed to compare the use of low-dose computed tomography (CT) with CXRs for lung cancer screening in smokers. The study had 26,732 participants in the CXR arm and a part of this data is available upon request. 19. Object-CXR (OB) is a dataset of 10,000 CXR images from hospitals in China with foreign objects annotated on the images. The download location (https://jfhealthcare.github.io/object-CXR/) is no longer available at the time of writing. Further detail is not provided since it cannot be verified from the image source. 20. Belarus (BL) This dataset is included since it is used in a number of reviewed papers however the download location (http://tuberculosis.by) is no longer available at the time of writing. The dataset consisted of approximately 300 frontal chest X-rays with confirmed TB. Further detail is not provided since it can no longer be verified from the image source.
The rapid increase in the number of publicly available CXR images in recent years has positively impacted the number of deep learning studies published in the field. Figure

Public Dataset Caution
Publication of medical image data is extremely important for the research community in terms of advancing the state of the art in deep learning applications. However, there are a number of caveats that should be considered and understood when using the public datasets described in this work. Firstly, many datasets make use of Natural Language Processing (NLP) to create labels for each image. Although this is a fast and inexpensive method of labeling, it is well known that there are inaccuracies in labels acquired this way (Irvin et al., 2019;Oakden-Rayner, 2020. There are a number of causes for such inaccuracies. Firstly, some visible abnormalities may not be mentioned in the radiology report, depending on the context in which it was acquired (Olatunji et al., 2019). Further, the NLP algorithm can be erroneous in itself, interpreting negative statements as positive, failing to identify acronyms, etc. Finally, many findings on CXR are subtle or doubtful, leading to disagreements even among expert observers (Olatunji et al., 2019). Acknowledging some of these issues, Irvin et al. (2019) includes labels for uncertainty or no-mention in the labels on the CheXpert dataset. One particular cause for concern with NLP labels is the issue of systematic or structured mislabeling, where an abnormality is consistently labeled incorrectly in the same way. An example of this occurs in the ChestX-ray14 dataset where subcutaneous emphysema is frequently identified as (pulmonary) 'emphysema' (Calli et al., 2019;Oakden-Rayner, 2020).
It has been demonstrated that deep neural networks can tolerate reasonable levels of label inaccuracy in the training set without a significant effect on model performance (Calli et al., 2019;Rolnick et al., 2018). Although such labels can be used for training, for an accurate evaluation and comparison of models it is desirable that the test dataset is accurately labelled. In the literature reviewed in this work, many authors rely on labels from NLP algorithms in their test data, while others use radiologist annotations, laboratory tests and/or CT verification for improved test set labelling. We refer to data that uses these improved labelling techniques as gold standard data (Table 1).
The labels defined in the public datasets should also be considered carefully and understood by the researchers using them. Many labels have substantial dependencies between them. For example, some datasets supply labels for both 'consolidation' and 'pneumonia'. Consolidation (blocked airspace) is an indicator of a patient with pneumonia, suggesting there will be significant overlap between these labels. A further point for consideration is that, in practice, not all labels can be predicted by a CXR image alone. Pneumonia is rarely diagnosed by imaging alone, requiring other clinical signs or symptoms to suggest that this is the cause for a visible consolidation.
Many public datasets release images with a lower quality than is used for radiological reading in the clinic. This may be a cause for decreased performance in deep learning systems, particularly for more subtle abnormalities. The reduction in quality is usually related to a decrease in image size or bit-depth prior to release. This is typically carried out to decrease the overall download size of a dataset. However, in some cases, CXR data has been collected by acquiring screenshots from online literature, which results in an unquantifiable degradation of the data. In the clinical workflow, DICOM files are the industry standard for storing CXRs, typically using 12 bits per pixel and with image dimensions of approximately 2 to 4 thousand pixels in each of the X and Y directions. In the event that the data is post-processed before release it would be desirable that a precise description of all steps is provided to enable researchers to reproduce them for dataset combination.

Deep Learning for Chest Radiography
In this section we survey the literature on deep learning for chest radiography, dividing it into sections according to the type of task that is addressed (Image-level Prediction, Segmentation, Image Generation, Domain Adaptation, Localization, Other). For each of these sections a table detailing the literature on that task is provided. Some works which have equal main focus on two tasks may appear in both tables. For Segmentation and Localization, only studies that quantitatively evaluate their results are included in those categories. Figure 3 shows the number of studies for each of the tasks.

Image-level Prediction
Image-level prediction refers to the task of predicting a label (classification) or a continuous value (regression) by analyzing an entire image. Classification labels may relate to pathology (e.g. pneumonia, emphysema), information such as the subject gender, or orientation of the image. Regression values might, for example, indicate a severity score for a particular pathology, or other information such as the age of the subject.
We classified 187 studies, fully detailed in Table 2  shelf deep learning models to predict a pathology, metadata information or a set of labels provided with a dataset. The number of studies for each label are provided in Figure 4. The studies that specifically work on a dataset and its labels are grouped together at the bottom. 187 papers are included, each may study more than one label.
The most commonly studied image-level prediction task is predicting the labels of the ChestX-ray14 dataset (31 studies). For example, Baltruschat et al. (2019a) compares the performance of various approaches to classify the 14 disease labels provided by the ChestX-ray14 dataset. Rajpurkar et al. (2018) compares the performance of an ensemble of deep learning models to board-certified and resident radiologists, showing that their models achieve a performance comparable to ex-   (2020) Uses the features extracted from the training dataset to detect adversarial CXRs AA C C Anand et al. (2020) Self-supervision and adversarial training improves on transfer learning AA PM PP Khatibi et al. (2021) Claims 0.99 AUC for predicting TB, uses complex feature engineering and ensembling Schroeder et al. (2021) ResNet model trained with frontal and lateral images to predict COPD with PFT results PR Zhang et al. (2021a) One-class identification of viral pneumonia cases compared with binary classification PR Balachandar et al. (2020) A distributed learning method that overcomes problems of multi-institutional settings C C Burwinkel et al. (2019) Geometric deep learning including metadata with graph structure. Application to CXR C C Nugroho (2021) Proposes a new weighting scheme to impove abnormality classification C C DSouza et al. (2019) ResNet-34 used with various training settings for multi-label classification C C Sirazitdinov et al. (2019) Investigates effect of data augmentations on classification with Inception-Resnet-v2 C C Mao et al. (2018) Proposes a variational/generative architecture, demonstrates performance on CXRs C C Rajpurkar et al. (2018) Evaluates the performance of an ensemble against many radiologists C C Kurmann et al. (2019) Novel method for multi-label classification, application to CXR C C Paul et al. (2020) Defines a few-shot learning method by extracting features from autoencoders C C Unnikrishnan et al. (2020) Mean teacher inspired a probablistic graphical model with a novel loss C C Michael and Yoon (2020) Examines the effect of denoising on pathology classification using DenseNet-121 C C Wang et al. (2021a) Proposes integrating three attention mechanisms that work at different levels C C Paul et al. (2021b) Step-wise trained CNN and saliency-based autoeencoder for few shot learning C C, O Paul et al. (2021a) Uses CT and CXR reports with CXR images during training to diagnose unseen diseases C C, PR Bustos et al. (2020) Proposes a new dataset PadChest with multi-label labels and radiology reports C P Li et al. (2021a) Lesion detection network used to improve image-level classification C PR Ghesu et al. (2019) Method to produce confidence measure alongside probability, uses DenseNet-121 C,PL C, PL Haghighi et al. (2020) Uses self-supervised learning for pretraining, compares with ImageNet pretraining C,PT C, SI Zhou et al. (2020a) Proposes a new CXR pre-training method, compares with pre-training on ImageNet C,X C,RP,X Chen et al. (2020a) Proposes a graph convolutional network framework which models disease dependencies C,X C,X Zhou et al. (2019) Compares several models for the detection of cardiomegaly CM C Bougias et al. (2020) Tests four off-the-shelf networks for prediction of cardiomegaly CM PR Brestel et al. (2018) Inception v3 (2020) Model pre-trained with public data and fine-tuned for pneumothorax detection PT PR Moradi et al. (2019a) DenseNet-121 used to detect CXRs with acquisition-based defects Q C Takaki et al. (2020) GoogleNet combined with rule-based approach to determine the image quality Q PR Pan et al. (2019a) Detects abnormal CXRs using several models. Evaluates on independent private data T C Moradi et al. Evaluates assisting clinicians with an AI based system to improve diagnosis of TB TB PR Heo et al. (2019) Various architectures, inclusion of patient demographics in model considered TB PR Kim et al. (2018) Addresses preservation of learned data, application to TB detection using ResNet-21 TB PR Gozes and Greenspan (2019) Pre-training using CXR pathology and metadata labels, application to TB detection TB S Rajaraman and Antani (2020) Compares various models using various pretraining and ensembling strategies TB S Lakhani (2017) Evaluates models on detecting the position of feeding tube in abdominal and CXRs TU PR Mitra et al. (2020) Comparison of seven architectures and ensembling for detection of nine pathologies X X Pham et al. (2020) A method to incorporate label dependencies and uncertainty data during classification X X Rajan et al. (2021) Proposes self-training and student-teacher model for sample effeciency X X Calli et al. (2019) Analyses the effect of label noise in training and test datasets Z C Deshpande et al. (2020) Labels 6 different foreign object types and detects using various architectures Z M Lu et al. (2019) Evaluates the use of CXRs to predict long term mortality using Inception-v4 Z PL Zhang et al. (2021b) Low-res segmentation is used to crop high-res lung areas and predict pneumoconiosis Z PR Devnath et al. (2021) Pneumoconiosis prediction with DenseNet-121 and SVMs applied to extracted features Z PR Liu et al. (2017) Detection of coronary artery calcification using various CNN architectures Z PR Hirata et al. (2021) ResNet-50 for detection of the presence of elevated pulmonary arterial wedge pressure Z PR Kusunose et al. (2020) A network is designed to identify subjects with elevated pulmonary artery pressure Z PR,RP pert observers in most of the 14 labels provided by ChestX-ray14. Following this, pneumonia is second most studied subject (26 studies). Of the 26 studies that worked with pneumonia, 12 studied pediatric chest X-rays and 11 of those used the Ped-Pneumonia dataset for training and evaluation (Rajaraman et al., 2018a;Yue et al., 2020;Liang and Zheng, 2020;Behzadikhormouji et al., 2020;Elshennawy and Ibrahim, 2020;Ureta et al., 2020;Mittal et al., 2020;Shah et al., 2020;Qu et al., 2020;Ferreira et al., 2020;Anand et al., 2020 Lakhani and Sundaram (2017). Performance of a deep learning model and how the assistance of this model improves the radiologist performance is studied by Rajpurkar et al. (2020). This study in particular evaluates the use of extra clinical information such as age, white blood cell count, patient temperature and oxygen saturation to assist the deep learning model. Diagnosis or evaluation of COVID-19 from CXR is another topic that has attracted a lot of interest from researchers (17 studies). For example, Cohen et al. (2020a) predicts the disease severity, similarly Li et al. (2020a) predicts the disease progression by comparing an exam with the previous exams of the patient, and Tartaglione et al. (2020) detects COVID-19 using a very limited amount of data. Other than these most common tasks, there are many studies using deep learning to make Image-level Predictions from CXRs. Other commonly utilized labels are illustrated in Figure 4 and listed in Table 2.  (Gozes and Greenspan, 2019;Baltruschat et al., 2019a). More sophisticated pre-processing steps to improve model performance include bone suppression (Baltruschat et al., 2019b;Zhou et al., 2020b) and lung cropping .
Some studies bring methodological novelty by making use of methods that are known to work well to improve model performance elsewhere. For example, it is known that an ensemble of many models improves performance compared to a single model (Dietterich, 2000). Some studies that make use of this method are Rajpurkar et al. (2018); Rajaraman et al. (2019a); Rajaraman and Antani (2020); Zhang et al. (2021c). Attention mining (or object-region mining, attention-based) models are also found in the literature (Wei et al., 2018). Those models aim to improve performance and add localization capabilities to an image-level prediction model. Some studies making use of attention mining models are Cai et al. (2018); Saednia et al. (2020). Multiple-instance learning (multi-instance learning or MIL)  is another method that is used to add localization capabilities to image-level prediction models. MIL breaks the input image into smaller parts (instances), makes individual predictions relating to those instances and combines this information to make a prediction for the whole image. Some studies that make use of MIL are (Crosby et al., 2020c;Schwab et al., 2020). Other topics within the literature include model uncertainty (Ul Abideen et al., 2020;Ghesu et al., 2019), quality of the CXR McManigle et al., 2020;Moradi et al., 2019a;Takaki et al., 2020;McManigle et al., 2020) and defence against adversarial attack Anand et al., 2020;Xue et al., 2019).
The different properties of datasets are also utilized to improve model capabilities or performance. Many of the public datasets make use of labels that are not mutually exclusive. This has resulted in a number of papers addressing the dependencies among abnormality labels (Pham et al., 2020;Chen et al., 2020b;Chakravarty et al., 2020). Since many of the labels are common between datasets from different institutes there has been investigation of the issues related to domain and/or label shift in images from different sources (Luo et al., 2020;Cohen et al., 2020b). The effect of dataset sizes is evaluated by Dunnmon et al. (2019). Semi-supervised learning methods combine a small set of labeled and a large set of unlabeled data to train a model (Gyawali et al., 2019(Gyawali et al., , 2020Wang et al., 2019;Unnikrishnan et al., 2020).
Most of the studies working on image-level prediction tasks deal with frontal CXR images. The importance of lateral chest X-rays and models that can deal with multiple views are evaluated in Bertrand

Segmentation
Segmentation is one of the most commonly studied subjects in CXR analysis (58 papers) and includes literature focused on the identification of anatomy, foreign objects or abnormalities. The segmentation literature reviewed for this work is detailed fully in Table 3. Anatomical segmentation of the heart, lungs, clavicles or ribs, on chest radiographs, is a core part of many computer aided detection (CAD) pipelines. It is typically used as an initial step of such pipelines to define the region of interest for subsequent image analysis tasks to improve performance and efficiency (Baltruschat et al., 2019b;Wang et al., 2020e;Rajaraman et al., 2019b;Heo et al., 2019;Liu et al., 2019;Mansoor et al., 2016). Further, the segmentation itself can be useful to quantify clinical parameters based on shape or area measurements. For example, cardiothoracic ratio, a clinically used measurement to assess heart enlargement (cardiomegaly), can be directly calculated from heart and lung segmentations (Sogancioglu et al., 2020;. Organ segmentation has, for these reasons, become one of the most commonly studied subjects among CXR segmentation tasks as seen in Figure 5. Another application found in the CXR literature is foreign object segmentation, i.e. catheter, tubes, lines, for which high  ResNet-50 based architecture with segmentation and classification branches PT SI Tolkachev et al. (2020) Investigates U-Net based models with various backbone encoders for pneumothorax PT SI Groza and Kuzin (2020) Ensemble of three LinkNet based networks and with multi-step postprocessing PT SI Xue et al. (2018d) Cascaded network with Faster R-CNN and U-Net for aortic knuckle Z J Yi et al. (2019a) Multi-scale U-Net based model with recurrent module for foreign objects Z O Lee et al. (2018) Two FCN to segment peripherally inserted central catheter line and its tip Z PR Pan et al. (2019b) Two Mask R-CNN to segment the spine and vertebral bodies and calculate the Cobb angle Z PR performance levels have been reported using deep learning Frid-Adar et al., 2019;Sullivan et al., 2020). Interestingly, only a small number of works addressed segmentation of abnormalities. Hurt et al. (2020) focused on segmentation of pneumonia, and Tolkachev et al. (2020) developed a method to segment pneumothorax. Both of these works used recently published challenge datasets (hosted by Kaggle), namely RSNA-Pneumonia and SIIM-ACR. In general, the determination of abnormal locations on CXR is dominated by methods which addressed this as a localization task (i.e. via bounding-box type annotations) rather than exact delineation of abnormalities through segmentation. This is likely to be attributable to the difficulty of precise annotation on a projection image and to the high annotation cost for precise segmentations.
A small number of works tackled the segmentation task using a patch-based CNN, which is trained to classify the center of pixel in the patch as foreground or background by means of sliding-window approach ( et al., 2019). However, this approach is generally considered inefficient for segmentation and most works use fully convolutional networks (FCN) (Shelhamer et al., 2017), which can take larger, arbitrary sized, images as input and produce a similar sized, per-pixel prediction, likelihood map in a single forward pass. In particular, the U-Net architecture (Ronneberger et al., 2015), a type of FCN, dominates the field with 50% of segmentation works in literature (29/58) employing it or some similar variant. Successful applications were built with this architecture to segment organs (Novikov et al., 2018;Furutani et al., 2019;Kitahara et al., 2019), pneumonia  and foreign objects Frid-Adar et al., 2019). For example, Novikov et al. (2018) compared three U-Net variant architectures for multi-class segmentation of the heart, clavicles and lungs on the JSRT dataset. Using regularization to prevent over-fitting and weighted cross entropy loss to balance the dataset, they outperformed the human observer at heart and lung segmentation. This result was in line with other works Bortsova et al., 2019;Arsalan et al., 2020) employing FCN-type architectures which also achieved very high performance levels on this dataset.
One commonly encountered challenge is that many algorithms produce noisy segmentation maps. In order to tackle this, several works employed post-processing techniques. Lee et al. (2018) used a probabilistic Hough line transform algorithm to remove false positives and produce a smoother segmentation of peripherally inserted central catheters (PICC). Groza and Kuzin (2020) used a heuristic approach to average crossfold predictions with an optimized binarization threshold and a dilation technique for pneumothorax segmentation. Some authors proposed to learn post-processing by training an independent network, inputting segmentation predictions for refine-ment, rather than using conventional methods. For example, Larrazabal et al. (2020) used denoising autoencoders, trained to produce anatomically plausible segmentations from the initial predictions. Similarly, Souza et al. (2019) used a FCN to refine segmentation predictions. The final segmentation was achieved by combining the initial and reconstructed segmentation results.
A number of researchers used a multi-stage training strategy, where network predictions are refined in several steps during training (Wessel et al., 2019;Souza et al., 2019;Xue et al., 2018dXue et al., , 2020. For example, Xue et al. (2018d) employed faster-RCNN to produce coarse segmentation results, which were then used to crop the images to a region of interest, which was provided to a U-Net trained to predict the final segmentation result. Similarly, Souza et al. (2019) employed two networks, where the second network received the predictions of the first to refine the segmentation results. Wessel et al. (2019) trained separate networks for segmentation of each rib in chest radiographs based on Mask R-CNN. The predicted segmentation results from the rib above was fed to each network as an additional input.
Although most of the works in the literature harnessed FCN architectures, a few authors employed recurrent neural networks (RNN) for segmentation tasks (Yi et al., 2019a;Milletari et al., 2018;Mathai et al., 2019) and report good performance. Milletari et al. (2018) proposed a novel architecture where the decoding component was long short term memory (LSTM) architecture to obtain multi-scale feature integration. The proposed approach achieved a Dice score of 0.97 for lung segmentation on Montgomery dataset. Similarly, Yi et al. (2019b) developed a scale RNN, a network based on encoder and decoder architecture with recurrent modules, for segmentation of catheter and tubes on pediatric chest X-rays.
The high cost of obtaining segmentation annotations motivates the development of segmentation systems which incorporate weak-labels or simulated datasets with the aim of reducing annotation costs (Frid-Adar et al., 2019;Ouyang et al., 2019;Lu et al., 2020b;Yi et al., 2019a). Several works addressed this using weakly supervised learning approaches Ouyang et al., 2019). Lu et al. (2020b) proposed a graph convolutional network based architecture which required only one labeled image and leveraged large amounts of unlabeled data (one-shot learning) through a newly introduced three contourbased loss function. Ouyang et al. (2019) proposed a pneumothorax segmentation framework which incorporated both images with pixel level annotations and weak image-level annotations. The authors trained an image classification network, ResNet-101, with weakly labeled data to derive attention maps. These attention maps were then used to train a segmentation model, Tiramisu, together with pixel level annotations.

Localization
Localization refers to the identification of a region of interest using a bounding box or point coordinates rather than a more specific pixel segmentation. In this section we discuss only the CXR localization literature which provides a quantitative evaluation of this task. It should be noted that there are many other works which train networks for an image-level prediction task and provide some examples of heatmaps (e.g., saliency map or GradCAM) to suggest which region of the image determines the label. While this may be considered as a form of localization, these heatmaps are rarely quantitatively evaluated and such works are not included here. Table 4 details all the reviewed studies where localization was a primary focus of the work.
The majority of CXR analysis papers performing localization focus on identifying abnormalities rather than objects (e.g., catheter) or anatomy (e.g., ribs). Localization of nodules, tuberculosis and pneumonia are commonly studied applications in the literature, as illustrated in Figure 6. In recent years, a variety of specific architectures, i.e. YOLO, Mask R-CNN, Faster R-CNN, have been designed in computer vision research aiming at developing more accurate and faster algorithms for localization tasks . Such state of the art architectures have been rapidly adapted for CXR analysis and shown to achieve high-level performance. For example, Park et al. (2019) demonstrated that the (original) YOLO architecture was successful at identifying the location of pneumothorax on chest radiographs. The model was evaluated on an external dataset with CXRs from 1,319 patients which were obtained after percutaneous transthoracic needle biopsy (PTNB) for pulmonary lesions; it achieved an AUC of 0.898 and 0.905 on 3-h and 1-day follow-up chest radiographs, respectively. Similarly, other studies Schultheiss et al., 2020;Takemiya et al., 2019;Kim et al., 2019) harnessed architectures like RetinaNet, Mask R-CNN and RCNN for localization of nodules and masses. Kim et al. (2020) trained RetinaNet and Mask R-CNN for detection of nodule and mass and investigated the optimal input size. The authors showed that, using a square image with 896 pixels as the edge length, RetinaNet and Mask R-CNN achieved FROC of 0.906 and 0.869, respectively.
A number of papers adapted classification architectures (e.g., ResNet, DenseNet) to directly regress landmark locations for CXR localization tasks (Hwang et al., 2019b;Cha et al., 2019). One common way of tackling this is to adapt the networks to produce heatmap predictions and draw boxes around the areas that created the highest signals. For example, Hwang et al. (2019b) tailored a DenseNet-based classifier to produce heatmap predictions for each of four types of CXR abnormalities. The network was trained with pixel-wise cross entropy between the predictions and annotations. Similarly, Cha et al. (2019) adapted ResNet-50 and ResNet-101 architectures for localization of nodules and masses on CXR. Other studies (Xue et al., 2018c;Li et al., 2020c) tackled this problem using patch-based approaches, commonly referred as multiple instance learning, creating patches from chest X-rays and evaluating these for the presence of abnormalities.
One challenge in building robust deep learning localization systems is to collect large annotated datasets. Collecting such annotations is time-consuming and costly which has motivated researchers to build systems incorporating weaker labels during training. This research area is referred to as weakly supervised learning, and has been investigated by numerous works Hwang et al., 2019b;Nam et al., 2019;Pesce et al., 2019;Taghanaki et al., 2019b) for localization of a variety of abnormalities in CXR. Most of the works (Hwang et al., 2019b;Pesce et al., 2019;Nam et al., 2019; leveraged weak image-level labels by adapting a CNN architecture to create two branches for localization (heatmap predictions) and classification. A hybrid loss function was used, combining localization and classification losses, which enabled training of the networks using images without localization annotations.

Image Generation
There are 35 studies identified in this work whose main focus is Image Generation, as detailed in Table 5. Image generation techniques have been harnessed for a wide variety of purposes including data augmentation (Salehinejad et al., 2019), visualization (Bigolin Lanfredi et al., 2019Seah et al., 2019), abnormality detection through reconstruction (Tang et al., 2019c;Wolleb et al., 2020), domain adaptation (Zhang et al., 2018) or image enhancement techniques .
The generative adversarial network (GAN) (Goodfellow et al., 2014;Yi et al., 2019b) has became the method of choice for image generation in CXR and over 50% of the works reviewed here used GAN-based models.
A number of works focused on CXR generation to augment training datasets (Moradi et al., 2018b;Zhang et al., 2019a;Salehinejad et al., 2019) by using unconditional GANs which synthesize images from random noise. For example, Salehinejad et al. (2019) trained a DCGAN model, similar to Moradi et al. (2018b), independently for each class, to generate chest radiographs with five different abnormalities. The authors demonstrated that this augmentation process improved the abnormality classification performance of DCNN classifiers (ResNet, GoogleNet, AlexNet) by balancing the dataset classes.  (Section 4.3). Tasks: IC=Interval Change, IL=Image-level Predictions, PR=Preprocessing, RP=Report Parsing, SE=Segmentation, WS=Weak Supervision. Bold font in tasks implies that this additional task is central to the work and the study also appears in another table in this paper. Labels: C=ChestX-Ray14, CM=Cardiomegaly, CV=COVID, L=Lung, LC=Lung Cancer, LO=Lesion or Opacity, ND=Nodule, PE=Effusion, PM=Pneumonia, PT=Pneumothorax, R=Rib, T=Triage/Abnormal, TB=Tuberculosis, TU=Catheter or Tube, X=CheXpert, Z=Other. Datasets: C=ChestX-ray14, CC=COVID-CXR, J=JSRT+SCR, M=MIMIC-CXR, O=Open-i, PP=Ped-pneumonia, PR=Private, RP=RSNA-Pneumonia, S=Shenzen, X=CheXpert. Another work (Zhang et al., 2019a) proposed a novel GAN architecture to improve the quality of generated CXR by forcing the generator to learn different image representations. The authors proposed SkrGAN, where a sketch prior constraint is introduced by decomposing the generator into two modules for generating a sketched structural representation and the CXR image, respectively. Abnormality detection is another task which has been addressed through a combination of image generation and oneclass learning methods (Tang et al., 2019c;Mao et al., 2020). The underlying idea of these methods is that a generative model trained to reconstruct healthy images will have a high reconstruction error if abnormal images are input at test time, allowing them to be identified. Tang et al. (2019c) harnessed GANs and employed a U-Net type autoencoder to reconstruct images (as the generator), and a CNN-based discriminator and encoder. The discriminator received both reconstructed images and real images to provide supervisory signal for realistic reconstruction through adversarial training.
Similarly, Mao et al. (2020) proposed an autoencoder for abnormality detection which was trained only with healthy images. In this case the autoencoder was tailored to not only reconstruct healthy images but also produce uncertainty predictions. By leveraging uncertainty, the authors proposed a normalized reconstruction error to distinguish abnormal CXR images from normal ones.
The most widely studied subject in the image generation lit-erature is image enhancement. Several researchers investigated bone suppression (Liu et al., 2020a;Matsubara et al., 2020;Zarshenas et al., 2019;Gozes and Greenspan, 2020;Lin et al., 2020;Zhou et al., 2020b) and lung enhancement Gozes and Greenspan, 2020) techniques to improve image interpretability. A number of works (Liu et al., 2020a;Zhou et al., 2020b) employed GANs to generate bone-suppressed images. For example, Liu et al. (2020a) employed GANs and leveraged additional input to the generator to guide the dual-energy subtraction (DES) soft-tissue image generation process. In this study, bones, edges and clavicles were first segmented by a CNN model, and the resulting edge maps were fed to the generator with the original CXR image as prior knowledge. For building a deep learning model for bone suppressed CXR generation, the paired dual energy (DE) imaging is needed, which is not always available in abundance. Several other studies Gozes and Greenspan, 2020) addressed this by leveraging digitally reconstructed radiographs for enhancing the lungs and bones in CXR. For instance,  trained an autoencoder for generating CXR with bone suppression and lung enhancement, and the knowledge obtained from DRR images were integrated through the encoder.

Domain Adaptation
Most of the papers surveyed in this work train and test their method on data from the same domain. This finding is inline with the previously reported studies (Kim et al., 2019;  Eslami et al. (2020) Conditional GANs for multi-class segmentation of heart,clavicles and lungs SE CL,H,L J Onodera et al. (2020) Processing method to produce scatter-corrected CXRs and segments masses with U-Net SE LO SM  Combines classification loss and autoencoder reconstruction loss IL,SE T J, MO,O,S Seah et al. (2019) Wasserstein GAN to permute diseased radiographs to appear healthy IL,LC Z PR Wolleb et al. (2020) Novel GAN model trained with healthy and abnormal CXR to predict difference map IL PE SM, X Tang et al. (2019c) GANs with U-Net autoencoder and CNN discriminator and encoder for one-class learning IL T C Mao et al. (2020) Autoencoder uses uncertainty for reconstruction error in one-class learning setting IL T PP,RP Mahapatra and Ge (2019) Conditional GAN based DA for image registration using segmentation guidance DA,RE,SE L C Madani et al. (2018) Adversarial based method adapting new domains for abnormality classification DA,IL CM PL Umehara et al. (2017) Proposes a patch-based CNN super resolution method SR Z J Uzunova et al. (2019) Generates high resolution CXRs using multi-scale, patch based GANs SR Z O Zhang et al. (2019a) Novel GAN model with sketch guidance module for high resolution CXR generation SR Z PP Lin et al. (2020) AutoEncoder for bone suppression and segmentation with statistical similarity losses SE,PR BS J Dong et al. (2019) Uses neural architecture search to find a discriminator network for GANs SE H,L J, PR Taghanaki et al. (2019a) Proposes an iterative gradient based input preprocessing for improved performance SE L S Fang et al. (2020) Learns transformations to register two CXRs, uses the difference for interval change RE,IC Z PR Yang et al. (2017) Generates bone and soft tissue (dual energy) images from CXRs PR BS PR Zarshenas et al. (2019) Proposes an CNN with multi-resolution decomposition for bone suppression images PR BS PR Gozes and Greenspan (2020) U-Net for bone generation with CT projection images, used for CXR enhancement PR BS SM Lee et al. (2019) U-Net based network to generate dual energy CXR PR Z PR Liu et al. (2020a) GAN integrates edges of ribs and clavicles to guide DES-like images generation PR Z PR Xing et al. (2019) Generates diseased CXRs, evaluates their realness with radiologists and trains models LC C C  Novel CycleGAN model to decompose CXR images incorporating CT projection images IL C C,PR,SM Salehinejad et al. (2019) Uses DCGAN model to generate CXR with abnormalities for data augmentation IL CM,E,PE,PT PR Albarqouni et al. (2017) U-Net based architecture to decompose CXR structures, application to TB detection IL TB PR Moradi et al. (2018b) Two DCGAN trained with normal and abnormal images for data augmentation IL Z PL Bigolin Lanfredi et al. (2019) Novel conditional GAN using lung function test results to visualize COPD progression IL Z PR Zarei et al. (2021) Conditional GAN and two variational autoencoders designed for CXR generation PR Gomi et al. (2020) Novel reconstruction algorithm for CXR enhancement PR Zhou et al. (2020b) Bone shadow suppression using conditional GANS with dilated U-Net variant BS J Matsubara et al. (2020) Generates CXRs from CT to train CNN for bone suppression BS PR Zunair and Hamza (2021) Generates COVID-19 CXR images to improve network training and performance CV CC, RP Bayat et al. (2020) 2D-to-3D encoder-decoder network for generating 3D spine models from CXR studies Z PR Bigolin Lanfredi et al. (2020) Generates normal from abnormal CXRs, uses the deformations as disease evidence Z PR Prevedello et al., 2019) and highlights an important concern: most of the performance levels reported in the literature might not generalize well to data from other domains (Zech et al., 2018). Several studies Zech et al., 2018;Cohen et al., 2020b) demonstrated that there was a significant drop in performance when deep learning systems were tested on datasets outside their training domain for a variety of CXR applications. For example, Yao et al. (2019) investigated the performance of a DenseNet model for abnormality classification on CXR images using 10 diverse datasets varied by their location and patient distributions. The authors empirically demonstrated that there was a substantial drop in performance when a model was trained on a single dataset and tested on the other domains. Zech et al. (2018) observed a similar finding for pneumonia detection on chest radiographs.
Domain adaptation (DA) methods investigate how to improve the performance of a model on a dataset from a different domain than the training set. In CXR analysis, DA methods have been investigated in three main settings; adaptation of CXR images acquired from different hardware, adaptation of pediatric to adult CXR and adaptation of digitally reconstructed radiographs (generated by average intensity projections from CT) to real CXR images. All domain adaptation studies, and studies on generalization reviewed in this work are detailed in Table 6.
Most of the research on DA for CXR analysis harnessed adversarial-based DA methods, which either use generative models (e.g., CycleGANs) or non-generative models to adapt to new domains using a variety of different approaches. For example, Dong et al. (2018) investigated an unsupervised domain adaptation based on adversarial training for lung and heart segmentation. In this approach, a discriminator network, ResNet, learned to discriminate between segmentation predictions (heart and lung) from the target domain and reference standard segmentations from the source domain. This approach forced the FCN-based segmentation network to learn domain invariant features and produce realistic segmentation maps. A number of works (Chen et al., 2018a;Zhang et al., 2018;Oliveira and dos Santos, 2018) addressed unsupervised DA using CycleGAN-based models to transform source images to resemble those from the target domain. For example, Zhang et al. (2018) used a CycleGAN-based architecture to adapt CXR images to digitally reconstructed radiographs (DRR) (generated from CT scans), for anatomy segmentation in CXR. A CycleGAN-based model was employed to convert the CXR image appearance and a U-Net variant architecture to simultane-  Dong et al. (2018) Adversarial training of lung and heart segmentation for DA SE CM J, PR Zhang et al. (2018) CycleGAN guided by a segmentation module to convert CXR to CT projection images SE H,L,Z PR Chen et al. (2018a) CycleGAN based DA model with semantic aware loss for lung segmentation SE L MO Oliveira et al. (2020a) Conditional GANs based DA for bone segmentation SE R SM Lenga et al. (2020) Continual learning methods to classify data from new domains IL C,M C, M Tang et al. (2019a) CycleGAN model to adapt adult to pediatric CXR for pneumonia classification IL PM PP,RP Mahapatra and Ge (2019) Conditional GAN based DA for image registration using segmentation guidance IG,RE,SE L C Madani et al. (2018) Adversarial based method adapting new domains for abnormality classification IG,IL CM PL Zech et al. (2018) Assessment of generalization to data from different institutes IL PM C, O Sathitratanacheewin et al. (2020) Demonstrates the effect of training and test on data from different domains IL TB S ously segment organs of interest. Similarly, CycleGAN-based models were adapted to transfer DRR images to resemble CXR images for bone segmentation (Oliveira et al., 2020a) and to transform adult CXR to pediatric CXR for pneumonia classification (Tang et al., 2019c). Unlike most of the studies which utilized DA methods in unsupervised setting, a few studies considered supervised and semi-supervised approaches to adapt to the target domain. Oliveira et al. (2020b) employed a MUNIT-based architecture (Huang et al., 2018) to map target images to resemble source images, subsequently feeding the transformed images to the segmentation model. The authors investigated both unsupervised and semi-supervised approaches in this work, where some labels from the target domain were available. Another work by Lenga et al. (2020) studied several recently proposed continual learning approaches, namely joint training, elastic weight consolidation and learning without forgetting, to improve the performance on a target domain and to mitigate effectively catastrophic forgetting for the source domain. The authors evaluated these methods for 2 publicly available datasets, ChestX-ray14 and MIMIC-CXR, for a multi-class abnormality classification task and demonstrated that joint training achieved the best performance.

Other Applications
In this section we review articles with a primary application that does not fit into any of the categories detailed in Sections 4.1 to 4.5 (14 studies). These works are detailed fully in Table 7. Image retrieval is a task investigated by a number of authors (Anavi et al., 2015(Anavi et al., , 2016Conjeti et al., 2017;Chen et al., 2018c;Silva et al., 2020;Owais et al., 2020;Haq et al., 2021). The aim of image retrieval tools is to search an image archive to find cases similar to a particular index image. Such algorithms are envisaged as a tool for radiologists in their daily workflow. Chen et al. (2018c) proposed a ranked feature extraction and hashing model, while Silva et al. (2020) proposed to use saliency maps as a similarity measure.
Another task that did not belong to previously defined categories is out-of-distribution detection. Studies working on this (Márquez-Neila and Sznitman, 2019; Ç allı et al., 2019; Bozorgtabar et al., 2020) aim to verify whether a test sample be-longs to the distribution of the training dataset as model performance is otherwise expected to be sub-optimal. Ç allı et al.
(2019) propose using the training dataset statistics on different layers of a deep learning model and applying Mahalanobis distance to see the distance of a sample from the training dataset. Bozorgtabar et al. (2020) approach the problem differently and train an unsupervised autoencoder. Later they use the feature encodings extracted from CXRs to define a database of known encodings and compare new samples to this database. Report generation is another task which has attracted interest in deep learning for CXR Yuan et al., 2019;Syeda-Mahmood et al., 2020;Xue et al., 2018a). These studies aim to partially automate the radiology workflow by evaluating the chest X-ray and producing a text radiology report. For example, Syeda-Mahmood et al. (2020) first determines the findings to be reported and then makes use of a large dataset of existing reports to find a similar case. This case report is then customized to produce the final output.
One other task of interest is image registration (Mansilla et al., 2020). This task aims to find the geometric transformation to convert a CXR so that it anatomically aligns with another CXR image or a statistically defined shape. The clinical goal of this task is typically to illustrate interval change between two images. Detecting new findings, tracking the course of a disease, or evaluating the efficacy of a treatment are among the many uses of image registration (Viergever et al., 2016). To that end, Mansilla et al. (2020) aims to create an anatomically plausible registration by using the heart and lung segmentations to guide the registration process.

Commercial Products
Computer-aided analysis of CXR images has been researched for many years, and in fact CXR was one of the first modalities for which a commercial product for automatic analyis became available in 2008. In spite of this promising start, and of the advances in the field achieved by deep learning, translation to clinical practice, even as an assistant to the reader, is relatively slow. There are a variety of legal and ethical considerations which may partly account for this (Recht et al., 2020;Strohm et al., 2020), however there is growing acceptance that  (2019) Proposes a method to reject out-of-distribution images during test time OD,IL Z C Bozorgtabar et al. (2020) Proposes to detect anomalies based on a dataset of autoencoder features OD Q,T C Ç allı et al. (2019) Mahalanobis distance on network layers to detect out-of-distribution samples OD Z C Anavi et al. (2015) Compares the extracted feature and classification similarities for ranking IR PR Haq et al. (2021) Uses extracted features to cluster similarly labeled CXRs across datasets IR C,X C,X Chen et al. (2018c) Proposes a learnable hash to retrieve CXRs with similar pathologies IR Z C Conjeti et al. (2017) Residual network to retrieve images with similar abnormalities IR Z O Anavi et al. (2016) Combines features extracted from CXRs and metadata for image retrieval IR Z PR Silva et al. (2020) Proposes to use the saliency maps as a similarity measure for image retrieval IR Z X artificial intelligence (AI) products have a place in the radiological workflow and attempts are underway to understand and address the issues to be overcome . In this section we examine the currently available commercial products for CXR analysis.
An up to date list of commercial products for medical image analysis (Grand-challenge, 2021;van Leeuwen et al., 2021) was searched for products applicable to chest X-ray. One product was excluded as it is not specifically a CXR diagnostic tool, but a texture analysis product for many modalities. The 21 remaining products are listed in Table 8. A number of these products have already been evaluated in peer-reviewed publications, as shown in Table 8 and it is beyond the scope of this work to make an assessment of their performance. All of the listed products are CE marked (Europe) and/or FDA cleared (United States) and are thus available for clinical use (Grand-challenge, 2021;van Leeuwen et al., 2021).
The commercial products include applications for a wide range of abnormalities, with 6 of them reporting results for more than 5 (and up to 30) different labels. The most commonly addressed task is pneumothorax identification (8 products), followed by pleural effusion (7), nodules (6) and tuberculosis (4). In contrast with the literature, which is dominated by imagelevel prediction algorithms, 17 of 21 products in Table 8 claim to provide localization of one or more abnormalities which they are designed to detect, usually visualized with heatmaps or contouring of abnormalities. Two further products are designed for generation of bone suppression images, one for interval change visualization and one for identification and reporting of healthy images. Products contribute differently to the workflow of the radiologist. Five products focus on detecting acute cases to prioritize the worklist and speed up time to diagnosis. Draft reports are produced by five other products, for either the normal (healthy) cases only or for all cases. The production of draft reports, like workflow prioritization, is aimed at optimizing the speed and efficiency of the radiologist.

Discussion
In this work we have detailed datasets, literature and commercial products relevant to deep learning in CXR analysis. It is clear that this area of research has thrived on the release of multiple large, public, labeled datasets in recent years, with 209 of 295 publications reviewed here using one or more public datasets in their research. The number of publications in the field has grown consistently as more public data becomes available, as demonstrated in Figure 2. However, although these datasets are extremely valuable, there are multiple caveats to be considered in relation to their use, as described in Section 3. In particular, the caution required in the use of NLP-extracted labels is often overlooked by researchers, especially for the evaluation and comparison of models. For accurate assessment of model performance, the use of 'gold-standard' test data labels is recommended. These labels can be acquired through expert radiological interpretation of CXRs (preferably with multiple readers) or via associated CT scans, laboratory test results, or other appropriate measurements.
Other important factors to be considered when using public data include the image quality (if it has been reduced prior to release, is this a limiting factor for the application?) and the potential overlap between labels. Although a few publications address label dependencies, this is most often overlooked, frequently resulting in the loss of valuable diagnostic information.
While the increased interest in CXR analysis following the release of public datasets is a positive development in the field, a secondary consequence of this readily available labeled data is the appearance of many publications from researchers with limited experience or understanding of deep learning or CXR analysis. The literature reviewed during the preparation for this paper was very variable in quality. A substantial number of the papers included offer limited novel contributions although they are technically sound. Many of these studies report experiments predicting the labels on public datasets using off-theshelf architectures and without regard to the label inaccuracies and overlap, or the clinical utility of such generic image-level algorithms. A large number of works were excluded for rea- sons of poor scientific quality (142). In 112 of these the construction of the dataset gave cause for concern, the most common example being that the training dataset was constructed such that images with certain labels came from different data sources, meaning that the images could be easily differentiated by factors other than the label of interest. In particular, a large number of papers (61) combined adult COVID-19 subjects with pediatric (healthy and other-pneumonia) subjects in an attempt to classify COVID-19. Other reasons for exclusion included the presentation of results optimized on a validation set (without a held-out test set), or the inclusion of the same images multiple times in the dataset prior to splitting train and test sets. This latter issue has been exacerbated by the publication of several COVID-19 related datasets which combine data from multiple public sources in one location, and are then themselves combined by authors building deep-learning systems. Such concerns about dataset construction for COVID-19 studies have been discussed in several other works (López-Cabrera et al., 2021;DeGrave et al., 2020;Cruz et al., 2021;Maguolo and Nanni, 2020;Tartaglione et al., 2020). Although a broad range of off-the-shelf architectures are employed in the literature surveyed for this review, there is little evidence to suggest that one architecture outperforms another for any specific task. Many papers evaluate multiple different architectures for their task but differences between the various architecture results are typically small, proper hyperparameter optimization is not usually performed and statistical significance or data-selection influence are rarely considered. Many such evaluations use inaccurate NLP-extracted labels for evaluation which serves to muddy the waters even further.
While it is not possible to suggest an optimal architecture for a specific task, it is observed that ensembles of networks typically perform better than individual models (Dietterich, 2000).
At the time of writing, most of the top-10 submissions from the public challenges (CheXpert (Irvin et al., 2019), SIIM-ACR (ACR, 2019), and RSNA-Pneumonia (RSNA, 2018)) consist of network ensembles. There is also promise in the development of self-adapting frameworks such as the nnU-Net (Isensee et al., 2021) which has achieved an excellent performance in many medical image segmentation challenges. This framework adapts specifically to the task at hand by selecting the optimal choice for a number of steps such as preprocessing, hyperparameter optimization, architecture etc., and it is likely that a similar optimization framework would perform well for classification or localization tasks, including those for CXR images.
In spite of the pervasiveness of CXR in clinics worldwide, translation of AI systems for clinical use has been relatively slow. Apart from legal and ethical considerations regarding the use of AI in medical decision making (Recht et al., 2020;Strohm et al., 2020), a discussion which is outside the scope of this work, there are still a number of technical hurdles where progress can be made towards the goal of clinical translation. Firstly, the generalizability of AI algorithms is an important issue which needs further work. A large majority of papers in this review draw training, validation and test samples from the same dataset. However, it is well known that such models tend to have a weaker performance on datasets from external domains. If access to reliable data from multiple domains remains problematic then domain adaptation or active learning methods could be considered to address the generalization issue. An alternative method to utilize data from multiple hospitals without breaching regulatory and privacy codes is federated learning, whereby an algorithm can be trained using data from multiple remote locations (Sheller et al., 2019). Further research is required to determine how this type of system will work in clinical practice.
A final issue for deep learning researchers to consider is frequently referred to as 'explainable AI'. Systems which produce classification labels without any indication of reasoning raise concerns of trustworthiness for radiologists. It is also significantly faster for experts to accept or reject the findings of an AI system if there is some indication of how the finding was reached (e.g., identification of nodule location with a bounding box, identification of cardiac and thoracic diameters for cardiomegaly detection). Every commercial product for detection of abnormality in CXR provides a localization feature to indicate the abnormal location, however the literature is heavily focused on image-level predictions with relatively few publications where localization is evaluated.
Beyond the resolution of technical issues, researchers aiming to produce clinically useful systems need to consider the workflow and requirements of the end-user, the radiologist or clinician, more carefully. At present, in the industrialized world, it is expected that an AI system will act, at least initially, as an assistant to (not a replacement for) a radiologist. As a 2D image, the CXR is already relatively quickly interpreted by a radiologist, and so the challenge for AI researchers is to produce systems that will save the radiologist time, prioritize urgent cases or improve the sensitivity/specificity of their findings. Image-level classification for a long list of (somewhat arbitrarily defined) labels is unlikely to be clinically useful. Reviewing such a list of labels and associated probabilities for every CXR would require substantial time and effort, without a proportional improvement in diagnostic accuracy. A simple system with bounding boxes indicating abnormal regions is likely to be more helpful in directing the attention of the radiologist and has the potential to increase sensitivity to subtle findings or in difficult regions with many projected structures. Similarly, a system to quickly identify normal cases has the potential to speed up the workflow as identified by multiple vendors and in the literature (Dyer et al., 2021;Dunnmon et al., 2019;Baltruschat et al., 2020).
To further understand how AI could assist with CXR interpretation, we first must consider the current typical workflow of the radiologist, which notably involves a number of additional inputs beyond the CXR image, that are rarely considered in the research literature. In most scenarios (excluding bedside/AP imaging) both a frontal and lateral CXR are acquired as part of standard imaging protocol, to reduce the interpretation difficulties associated with projected anatomy. Very few studies included in this review made use of the lateral image, although there are indications that it can improve classification accuracy (Hashir et al., 2020). Furthermore, the reviewing radiologist has access to the clinical question being asked, the patient history and symptoms and in many cases other supporting data from blood tests or other investigations. All of this information assists the radiologist to not only identify the visible abnormalities on CXR (e.g., consolidation), but to infer likely causes of these abnormalities (e.g., pneumonia). Incorporation of data from multiple sources along with the CXR image information will almost certainly improve sensitivity and specificity and avoid an algorithm erroneously suggesting labels which are not compatible with data from external sources. Another extremely important and time-consuming element in the radiolog-ical review of CXR is comparison with previous images from the same patient, to assess changes over time. Interval change is a topic studied by very few authors and addressed by only a single commercial vendor (by provision of a subtraction image). Innovative AI systems for the visualization and quantification of interval change with one or more previous images could substantially improve the efficiency of the radiologist. Finally, the radiologist is required to produce a report as a result of the CXR review, which is another time-consuming process addressed by very few researchers and just a handful of commercial vendors. A system which can convert radiological findings to a preliminary report has the potential to save time and cost for the care provider.
In many areas of the world, medical facilities that do perform CXR imaging do not have access to radiological expertise. This presents a further opportunity for AI to play a role in diagnostic pathways, as an assistant to the clinician who is not trained in the interpretation of CXR. Researchers and commercial vendors have already identified the need for AI systems to detect signs of tuberculosis (TB), a condition which is endemic in many parts of the world, and frequently in low-resource settings where radiologists are not available. While such regions of the world could potentially benefit from AI systems to detect other conditions, it is important to identify in advance what conditions could be feasibly both detected and treated in these areas where resources are severely limited.
The findings of this work suggest that while the deep learning community has benefited from large numbers of publicly available CXR images, the direction of the research has been largely determined by the available data and labels, rather than the needs of the clinician or radiologist. Future work, in data provision and labelling, and in deep learning, should have a more direct focus on the clinical needs for AI in CXR interpretation. More accurate comparison and benchmarking of algorithms would be enabled by additional public challenges using appropriately annotated data for clinically relevant tasks.