Deep learning approaches to automatic radiology report generation: A systematic review

.


Background
Radiology tests based on modalities such as X-ray, ultrasound (US), computed tomography (CT), and magnetic resonance imaging (MRI) provide detailed insights into the patients' bodies without them needing to undergo invasive explorative surgeries.They can help screening for and diagnosis of medical conditions as well as monitoring the response to treatments.As such, radiology tests remain the most common types of imaging tests.As many as 45.2 million imaging tests were reported in England between September 2018 and September 2019, the top four tests were the above radiology tests, which accounted for 96% of the imaging tests [1].Before the pandemic in September 2019, more than 9000 people were waiting for CT and MRI test for at least six weeks.In just a year, this number became almost ten times higher [2].The volume of imaging referrals is continually increasing to the point that many departments promote the idea of 30-60 min turn-around times in order to stay competitive [3].However, there were only 12.8 radiologists per million population in Europe in 2020, and the corresponding number in the UK was even smaller [2].The issues of enormous daily diagnostic needs and the lack of radiologists are further aggravated by problems such as diagnostic errors [4][5][6] and interpretation discrepancies between radiologists and physicians [7].
An imaging report represents the most important means of communication between a radiologist and the referring medical professional, both serving an effort to provide a high-quality patient care [8].Some of the most important skills any radiologist needs are observational skills to identify abnormalities, analytical skills to relate observed abnormalities to the underlying pathology and communicative skills to convey their interpretation clearly to both clinicians and patients [8].A widespread shortage of these skills [9] emphasizes the need for automation in this area, which leads us to the task of Automatic Radiology Report Generation (ARRG).

The goal of ARRG
ARRG is a specific application of Automatic Image Captioning (AIC) in the medical radiology domain.As a form of image-to-sequence generation, AIC relies on understanding images such as scenes, objects, object properties, and their interactions; and producing textual descriptions that are both syntactically and semantically sound [10].ARRG focuses solely on radiology images, with an emphasis on recognizing normal and abnormal appearances and describing them accurately and comprehensively.According to Kaur et al. [7], a high-quality automated radiology report should 1) be clear, concise, and structured for the referring clinician to read effortlessly; 2) be complete regarding positive and negative observations; 3) prioritize the observations; 4) use uniform language (medical terminology) to prevent ambiguity; 5) follow the conventional report format used by radiologists.
To achieve this, attempts in this field involve generating narrative reports [11][12][13][14] and structured reports [15][16][17], and retrieving appropriate sentences from pre-constructed retrieval databases [18][19][20][21], as shown in Figs. 1 and 2. We define narrative radiology reports as direct descriptions of all relevant information and diagnostic impressions provided by radiologists with free or structured layouts.It is the most common report type in publicly available radiology image/text datasets.The narrative reports vary excessively in language, length and style, which may affect their clarity and hence the referring clinicians' decision-making [22].These issues gave rise to the idea of structured reporting, which has the potential to improve the clarity of radiology reports.Structured radiology reports use uniform, standardized, organized terms to describe the medical content without being affected by the reporting style of radiologists [23].These terms are considered as structured report entities that can be easily converted into natural-sounding sentences via templates [24], such as a set of tuples including anatomy, anatomy qualifier, observation, observation qualifier, certainty and negation [25,26] and common data elements for radiology [27,28].However, the existing surveys of ARRG have either reviewed only a few studies [29][30][31] or focused on a narrower scope [7], and none of them took into consideration the structured report generation, leaving room for a more comprehensive review of this area.

Deep learning approaches to ARRG
One of the earliest studies towards ARRG dates back to 2015 when Shin et al. [32] trained a deep Convolutional Neural Network (CNN) to generate keywords based on CT/MRI images.Later, Shin et al. [33] went on to design the first ARRG system, which could generate five keywords from chest X-rays concerning disease location, severity, and the affected anatomical sites.In 2018, research on ARRG systems started to gain widespread attention [12,14,20,34].Further details about the evolution of relevant techniques are provided in Fig. 1.
The ARRG models typically rely on Deep Learning (DL) approaches, which have shown promising results in AIC [10].DL techniques enable the models to capture complex patterns and relationships directly from raw data.In contrast, non-DL approaches, such as conventional machine learning, necessitate manually engineered features to develop mathematical or statistical models for pattern recognition [35].The design and extraction of such features require domain knowledge [36], resulting in the formation of multi-disciplinary knowledge barriers.Today, non-DL approaches are rarely used as standalone methods in ARRG [37].Instead, they are often employed in combination with DL methods by which the features are extracted to avoid the time-consuming manual feature engineering [18][19][20][21].
On the other hand, there is increasing literature showing that DL plays an important role in many single-modality tasks in healthcare, such as understanding and interpreting health records [38,39], and handling various medical imaging tasks (e.g.image segmentation) [40].However, as a multi-modality generative task that involves computer vision [41], Natural Language Processing (NLP) [42] and medical image analysis [43], ARRG is a computationally challenging problem.To comprehensively discuss the DL approaches in ARRG, this review is conducted from three key aspects, including training datasets, model designs and evaluation methods, which correspond to the targets, approaches, and outcomes of the model training, respectively.

Materials and Methods
This review was conducted in accordance with the Preferred Reporting Items for Systematic Review and Meta-Analyses (PRISMA) 2020 statement [44].The main aim of this study is to systematically review DL approaches to ARRG in order to answer the following Research Questions (RQs).
RQ1: What data have been used to train and evaluate DL approaches to ARRG?  RQ2: How is DL used to generate the reports from radiology images?RQ3: How are DL approaches to ARRG evaluated?RQ1 aims to identify the key properties of data used to train and evaluate DL approaches to ARRG.These properties include imaging modalities used, anatomical sites involved, the internal structure of the reports, their labels and basic statistical properties.RQ2 describes the methods used to solve the problem of ARRG.This question is concerned with the way in which ARRG can be represented formally by mapping it onto a set of relevant computational problems (e.g.object detection, multi-label classification, text generation) so that any future research on ARRG can consider relevant literature that can be useful for but is not strictly limited to the ARRG.More importantly, it addresses the way in which they are integrated into a DL framework to support ARRG.Finally, RQ3 focuses on the evaluation methods for ARRG.The multimodal nature of this problem makes it difficult to compare the effectiveness of different approaches, thus highlighting the need for a comprehensive evaluation framework.
The scope of the review is defined by a set of inclusion and exclusion criteria described in Table 1.Relevant studies were identified from two scientific databases, Web of Science [45], which comprises 171 million citations in various academic disciplines, and PubMed [46], which indexes more than 33 million citations on the subject of biomedical and life sciences.The process of constructing the search query is shown in Table 2.The query is built on top of three facets: "deep learning", "text generation", and "radiology".They correspond to the method, target, and application area of ARRG, respectively.The three facets were combined into a Boolean search query using the AND operator.

Study selection
The search conducted on November 3, 2021, returned a total of 1443 records.All titles and abstracts were screened by two independent reviewers (Y.L., I.S.), achieving high interrater agreement measured by Cohen's kappa coefficient (K = 0.844, n = 1443) [47].All disagreements were resolved by the third independent reviewer (H.L.).One reviewer (Y.L.) reviewed the full text of the 34 candidate studies.The inspection of their references revealed 21 additional studies, which were added to the pending list for full-text reviewing.Any uncertainties were resolved by discussion (Y.L., I.S., H.L.).Ultimately, a total of 41 studies were included.The selection process and its outcomes are summarized in Fig. 3. Data were extracted by one reviewer (Y.L.).After deeming meta-analysis not applicable to this review due to the heterogeneity of training data and evaluation methods, we conducted a narrative synthesis of the main findings.

RQ1: Data
The quality of training data is an important factor affecting the performance of DL models.Table 3 summarizes publicly available   datasets used to train the ARRG systems.The imaging data were generated using four prevalent modalities, including X-rays, CT, MRI, and US.Some of the report have no specific structure, some are semistructured by sections by means of headings, while others are fully structured as tuples, which can be extended into full reports by either rule-based or DL approaches [48].Fig. 4 provides an example of a chest X-ray study, where the findings and impression are typically the primary targets for ARRG.The findings section describes a radiologist's observations regarding different regions in the image, whereas the impression section summarizes these observations.A comprehensive list of headings is provided in Fig. 5.Some of the datasets used in the reviewed studies were not publicly available or were found not to be appropriate for the ARRG tasks when used on their own.For example, ChestX-ray8/ChestX-ray14 [77] and CheXpert [78] contain only diseases labels rather than reports, which were typically used to pre-train [20,21,51,54,56,62,65,68] or fine-tune [61] a model, or used in combination with other datasets [34,58].Other datasets used in the included studies were not publicly available, including MRI datasets [17,79,80], US datasets [11,[81][82][83], a chest X-ray dataset [84], and a mammography dataset [71].ARRG is a branch of AIC applied in radiology.The majority of ARRG models were derived from one or more AIC models and were further enhanced to meet specific application requirements.This section gives a broad overview of the fundamental frameworks, architectures, and techniques involved in the ARRG models to help researchers build a general picture regarding ARRG.We refer the reader to the recent AIC review study for the details [10]  Encoder-decoder framework.The encoder-decoder architecture is a basic framework for developing DL-based, end-to-end AIC models [10].It was originally introduced for sequence-to-sequence generation [85] and later adopted for image-to-sequence generation in the AIC field [86].This framework consists of two core components: a visual encoder that extracts image features and a textual decoder that learns the mapping from the image representation to text representation and consequently generates sequences.In ARRG, the generated sequences may consist of narrative words [11-14,20,21,34,50-52,54-65, 69,71,81-83] or structured report entities [15][16][17]33,53,68,73,75,79,80,87,88].Moreover, the encoder and decoder can be implemented by different network architectures, as shown in Fig. 1.For example, CNN-RNN architecture [11,33,34,53,65,68,71,[81][82][83] is one of the implementations, of which CNN serves as the encoder and RNN serves as a decoder.

Retrieval framework.
A less commonly seen framework in ARRG is the retrieval framework, which has no fixed architecture [18][19][20][21].The key concern of using this framework is the design of retrieval methods that match the extracted image feature to corresponding sentence templates, which raises two other considerationsthe methods for visual feature extraction and the construction of template databases for retrieval.We identified four retrieval methods in ARRG, including computing cosine similarity between the visual embeddings of images and choosing the corresponding sentences [19]; aligning the visual and semantic features and computing the visual-semantic similarities via an attention-weighted sum of squared l2-normalized Euclidean distance [18]; treating sentence selection as a multi-label classification problem [21]; or training an agent to retrieve sentence via reinforcement learning [20].

Recurrent neural network (RNN)
. RNN and their variants (e.g.LSTM [89] and GRU [90]) can maintain long-range sequential information in their hidden state, ensuring that each word is generated according to its context.In ARRG, they are typically integrated with CNNs and responsible for textual decoding, forming the basic CNN-RNN architecture [11,33,34,53,65,68,71,[81][82][83].Moreover, there are two modified branches of CNN-RNN architecture, including CNN-SRNN [20,55,88] and CNN-HRNN [12,14,52,54,[56][57][58][59][60][61]64].CNN-SRNN employs stacked RNNs (SRNN) as the decoder.Compared with general RNNs, SRNN has multiple recurrent hidden layers stacked on top of each other, increasing the observation and capture of sequence inputs at different time scales, thus allowing a more natural representation of sequence text [91,92].The SRNN structure is also feasible to be integrated into more complex CNN-HRNN architecture [14,57,58].On the other side, CNN-HRNN uses hierarchical RNN (HRNN) as the decoder.HRNN stacks multiple RNNs in a way that models the hierarchical structure of text sequence, enabling it better capturing linguistic features and generate longer texts [93].Generally, text is hierarchically structured into sentences and words.Therefore, the decoding process typically starts with a sentence decoder that generates sentence-level semantic features (also known as topic vectors), followed by a word decoder that parses the topic vector into a sequence of words [12,52,54,60,64].In addition, the radiology report structure also gives rise to a hierarchy that HRNN can leverage [14,57,58].

Transformer.
The transformer architecture is based on the encoder-decoder framework and exclusively employs self-attention units [94].This design makes the training parallelizable, leading to a greater computational efficiency, and enable better capture of long-range dependencies in sequences.Although the transformer encoder and decoder can be split and composited with other architectures, such as CNN encoder or RNN decoder [10], the entire transformer architecture is usually connected to a CNN encoder in ARRG [13,62,63,69].More details about the self-attention unit are discussed in Section 3.3.1.3.3.

Generative Adversarial Network (GAN)
. GAN [95] aims to train a generator network that can generate new data resembling a given training dataset.This is accomplished through an adversarial training process, where an additional discriminator network is introduced to work against the generator network.Specifically, the generator network Fig. 6.The visual extraction processes of using CNNs.The upper two paths indicate the global feature encoding.The lower path shows the regional feature encoding.
Y. Liao et al. is trained to generate realistic data that can evade detection by the discriminator network, while the discriminator network is trained to distinguish between the generated data and the real data.In ARRG, GAN is typically employed for the segmentation of the spinal structures in lumbar spine MRI [17,79].Nevertheless, one study innovatively used the inverse mapping of the GAN's generator instead of the traditional CNN encoder for visual extraction [50].The graph is a useful structure for explicitly representing the relationships between entities, which consists of nodes and edges.Both image and text can be encoded as a graph [21].For example, an image can be converted into a graph by treating pixels as nodes and linking adjacent pixel with edges.Similarly, in text, individual words can be assigned as nodes and the relationships among words can be represented by edges.In ARRG, Graph Neural Networks (GNNs) such as the Graph Convolutional Network (GCN) and Graph Transformer (GTR) have been utilized to leverage the graph structure [21,56].Fig. 7 shows a simplified framework that incorporates GNNs for learning graph representations.In ARRG, graph structures are typically designed based on prior knowledge of radiology ontology and constructed from corresponding reports [21,56].Conversely, in AIC, graphs are constructed using object detection and relationship prediction [10,96] or directly by off-the-shelf scene graph parsers [97,98].Graph structures have also been used to improve the consistency of spinal structure classification [17].In this case, prior knowledge of spinal structures was converted into a graph and embedded into the model to enable reasoning capabilities.

Reinforcement learning (RL).
RL is a machine learning paradigm that aims to train an agent to interact with an environment with optimal actions [99].In supervised learning, ARRG is commonly performed by minimizing the cross-entropy loss via gradient descent, allowing the model to fit the data.However, this approach may not necessarily allow the model to optimize toward a specific metric of interest.In contrast, RL can directly use metrics as rewards and optimize the model by policy gradient, alleviating the discrepancy between the model training goal and a given evaluation metric.In ARRG, RL has been combined with different architectures, including the CNN-HRNN architecture [20,59,61] and CNN-transformer architecture [51], in which the REINFORCE algorithm [100] is the most commonly used policy-gradient method.When RL is introduced, the ARRG problem is redefined as follows: the agent refers to an ARRG model; the environment is the input of the model (i.e. the visual features and the input sequences); the policy is the model's parameter; the model's outputs indicate a sequence of actions taken by the agent under the current policy; the ground-truth reports define the optimal sequences of actions; and the rewards are obtained by comparing the agent's actions and the optimal actions via the target metrics.
4.3.1.3.3.Attention mechanism.The attention mechanism is a method that can combine the elements of distinct feature embeddings with different weights according to element-wise correlation rather than relying solely on a fixed representation [101,102].In ARRG, this method can be divided into cross-model attention (CMA) [12,14,17,18,20,34,52,54,[56][57][58][59][60][61]64,65,68,69,88] and intra-model attention (IMA) [13,21,34,51,62,63,69].The major difference is that the feature embeddings in CMA come from distinct models, whereas those in IMA are the same embedding from a single model.Furthermore, CMA typically equips the soft attention mechanism [101,102] to establish dynamic associations between visual features and linguistic features [14,18,34,52,54,56-59, 61,64,65,68,69,88].Several soft-attention improvements have been proposed in ARRG to capture more information from various aspects, such as local semantic attention, which discloses the dependency of local visual features and distinct symbolic nodes [17]; multi-attention, which refines the visual attention process into channel-and position-level processing [52]; and co-attention, which enables the linguistic features to simultaneously attend to visual features and predicted label features [12,60].On the other hand, IMA is typically combined with CMA, following the transformer architecture [13,21,51,62,63,69].In transformer, both types of attention are based on the scaled dot-product attention mechanism (also known as self-attention) [94]; IMA captures the internal dependencies of the feature embeddings in the encoder and the decode, and the resulting weighted feature representations are correlated by the processing of CMA.In addition to the sequential usage in the transformer architecture, IMA can also be used in parallel with CMA, of which both weighted features are concatenated and used for auxiliary classification [34].Fig. 8 illustrates the common usages of Fig. 7. Overview of encoder-decoder framework integrated with GNN for graph embedding.The construction processes of graphs are demonstrated in three cases.Case 1 is training a GNN to learn to generate graphs for ARRG [21].Cases 2 and 3 are found in general image captioning.Case 2 uses pre-defined predictors to extract the semantic and spatial relationships from the detected object and forms them in graph structures [96].Case 3 uses an off-the-shelf graph parser to generate scene graphs [97].
CMA and IMA in ARRG.

Targeting the report generation
Existing studies have been proposed to address the ARRG problem by transforming it into specific DL tasks that cater to different objectives and requirements based on the application scenarios (training datasets).These approaches have resulted in the development of various ARRG models, which can be broadly classified into three categories, as depicted in Fig. 9.The most salient features of these ARRG models are illustrated in Appendix A.  Narrative report generation is the most prevalent objective of ARRG models.However, the expected forms of the generated reports would affect the choice of the model architecture.In the case of generating US reports, which tend to be short descriptions or voice-over captions explaining the image, the corresponding ARRG models are designed using a simple CNN-RNN architecture [11,[81][82][83].CNN-RNN-Merging [83] simply concatenated the feature vectors of the CNN encoder and RNN decoder and passed them to a fully connected layer to predict the following words.HCNN-RNN [82] proposed an ensemble of multiple CNNs to cope with their multi-class dataset.FRCNN-RNN [81] and SFNet [11] captured the location and semantic information of focus areas, producing overall representations that were subsequently concatenated with text features for report generation.Notably, SFNet fused the features of focus areas at a different time node, achieving better accuracy of pathological information when generating reports.

Long coherent report.
The need to generate longer coherent reports is more commonly seen in ARRG.For this reason, it is necessary to improve the simple CNN-RNN architecture designed for AIC.To achieve this, CNN-MSRNN [55] proposed using three stacked LSTMs to substitute the general RNN decoder.This model performed better in generating reports for normal samples than for abnormal samples.Furthermore, CNN-HRNN-MultiAtt [52] adopted the CNN-HRNN architecture and proposed a multi-step attention mechanism which decomposed the single-step visual attention into the channel-and position-level processing.
Apart from the traditional encoder-decoder architecture, recent studies have explored the use of transformer-based methods.Both MemoryDrivenTR [13] and DSTR [69] implemented a transformer encoder for the secondary encoding of visual features and followed by a transformer decoder for report generation.Moreover, MemoryDrivenTR used a memory mechanism to enhance the transformer decoder, while DSTR utilized extract disease labels to fine-tune the model to improve the clinical coherence of the report.CDGPT2 [62] and ConsecutiveTR [63] directly employed large language models pre-trained using the transformer architecture as the decoders.Additionally, ConsecutiveTR added an intermediate process to perform an abstract transformation from image features to high-level reporting context.The Natural Language Generation (NLG) scores reported in the original studies [13,62,63,69] suggested that ConsecutiveTR and MemoryDrivenTR had similar performance, whereas CDGPT2 and DSTR also had similar performance but worse than the former two.
Other studies have combined RL with the above architectures to address this task.CMAS [61] and CNN-HRNN-RL [59] were based on the CNN-HRNN architecture, while RTMIC [51] used a CNN-transformer architecture.Among them, HRGR-Agent and RTMIC used CIDEr as rewards, while CMAS used BLEU-4 as the basis for rewards.CNN-HRNN-RL also incorporated a novel reward regarding clinical efficacy.

Utilizing auxiliary classification.
To enable the CNN-RNN architecture to be utilized for long report generation, ARRG models often incorporate classifiers alongside the traditional CNN-RNN architecture.For example, TieNet [34] performed disease classification and report generation simultaneously, utilizing two attention mechanisms to highlight essential words and image areas over the outputs.However, it might sacrify the classification performance for better generation performance [65].Therefore, Vispi [65] proposed to perform disease classification and report generation in order, thereby utilizing the former to enhance the latter.Moreover, Vispi's classification module not only predicted disease labels but also located the lesion areas.Hence, the generation module could separately generate overall abnormal findings, fine-grained abnormal findings, and normal findings.In addition, FCN-MLC-LSTM [71] proposed using U-Net's down-sampling portion [103] as the CNN encoder backbone to identify and classify lesions in mammography.Then the corresponding label was transformed into semantic embedding and passed to the decoder.
Moreover, this approach can also be seen in other architectures.In the CNN-HRNN architecture, CNN-HRNN-CoAtt [12] and CNN-HRNN-AttF [54] performed visual feature extraction and label prediction during the encoding stage, of which the former jointly represented the features and labels through the co-attention mechanism, and the latter directly passed them to separate decoders.In addition to leveraging the detected tags, CNN-HRNN-GLP [60] proposed embedding the outputs of the decoders into the same semantic space and augmenting the training data by using similarity matching.The diversity of the generated sentences was consequently improved.To mitigate data imbalance, CNN-HRNN-Dual [64] proposed dual word-level LSTMs with a sentence predictor, which processed normal findings and abnormal findings, respectively.Notably, CNN-HRNN-GLP introduced a novel pooling approach for its classification module, which achieved higher recall and precision than traditional global feature pooling.The other involved models are SentSAT-KG [56], which used graph structures to embed prior knowledge into their CNN-HRNN-based model to improve generation performance, where the graph embedding module was pre-trained via multi-label classification; and GAN-ARAE [50], which proposed using a GAN's generator to extract image features and using the decoder of the Adversarially Regularized Autoencoders [104] to generate a diagnosis label and text simultaneously.
4.3.2.1.4.Leveraging report hierarchy.The final method is explicitly designed for chest X-rays to generate the two-section reports (i.e.findings and impression).This method leverages the CNN-HRNN architecture and takes into account the report hierarchy rather than the linguistic hierarchy.In this regard, the findings and impression sections, which indicate the detailed descriptions of images and the corresponding summaries, are usually handled by different modules and trained jointly or separately.In CNN-HRNN-RecAtt [14], the visual features were passed to an RNN decoder to generate the impression.Subsequently, a sequence-to-sequence model was employed to extract the semantic features of impression, which were combined with the regional visual features to generate the findings recurrently.CNN-HRNN-IDC [58] used the same structure as CNN-HRNN-RecAtt.However, the semantic features of impression were combined with visual features as additional constraints to initialize the decoder of the findings module, making the generated sentences conform to the topic of the entire report.Additionally, STS [57] combined the ideas of Vispi and CNN-HRNN-RecAtt.It first utilized a binary classifier to distinguish normal image samples from abnormal cases.A model with CNN-SRNN architecture was then used to generate findings from the classified images.Finally, a summarization module based on sequence-to-sequence architecture was employed to summarize the generated findings into impressions.

Structured report generation.
This task aims to generate structured reports comprised of a large number of labels that refer to anatomical sites and lesions together with the corresponding intensity, location, shape, size, etc.These detailed descriptive entities can be easily converted into narrative reports.The intuitive solution is to use CNNbased classification framework.For instance, LesaNet [15] and its predecessor, pre-LesaNet [73], used CNNs to predict 171/145 predefined structured report entities.These detailed descriptive entities can be easily converted into narrative reports, such as by ontological mapping used in CNN-FFL [16] or by decision trees used in MultusRadBot [80,87].CNN-SVM [75] decomposed the task into 43 independent classification questions with close-ended answers and open-ended numerical answers, each associated with its own CNN.
CNN-RNN architecture is also eligible for this task.Cascade CNN-RNN [33] proposed recurrently learning different levels of information through weight reuse.However, such a design was prone to raising error propagation and deteriorating the model generalization ability.To conquer this, Sequence CNN-RNN [53] changed the timing and manner of passing image feature embeddings onto the decoder.Such that the model can maintain the correlation between the image features and the generation process of structured report entities at the time-step level, resulting in a better generation performance.CNN-SAT [68] employed the classic AIC model [102] and proved that introducing additional patient data could effectively increase the percentage of correctly generated structured diagnostic report sentences.In addition to CNN-RNN architecture, CNN-SRNN-Att [88] replaced the 1-layer RNN decoder with two-stacked LSTMs to generate structured report entities, which were then expanded into reports by templates.
On the other side, we found that ARRG models prefer generating structured report entities for lumbar spine MRI reporting [17,79,80,87].However, the structural correlations of the lumbar spine are an important basis for the reporting, necessitating a segmentation of the disc regions.In this regard, MultusRadBot employed DeepSPINE [105] in their segmentation module, while RGAN-PL [79] and NSL-AGNet [17] applied GAN to segment and classify the lumbar spine structures.The classification results were expanded to structured report entities and compiled to report templates by logical reasoning methods and rule-based methods.Moreover, NSL-AGNet converted the prior knowledge of spinal structures into graph structures and embedded them into the model to improve the consistency of spinal structure identification.

Retrieval-based approach and Hybrid retrieval-generation approach.
Unlike the traditional generation-based approach, the retrieval-based approach does not generate new text.Instead, it fetches the topmost relevant data from an existing database.Generally, the data are unified into a joint embedding space to compute their similarity, and the results will be stored in a database for later usage.For example, CNN-CVSE [18] proposed a metric learning-based method to learn the visual-semantic embeddings, whereby the fine-grained similarity between the lesion regions and abnormal findings could be measured.However, although it successfully mitigates the weaknesses of the generation-based approach in producing repetitive sentences and bias toward normal findings, its performance appeared not to achieve satisfactory NLG scores.On the other hand, RTEx [19] compared the visual features between the input images and the database images using cosine similarity and assigned the diagnostic sentences of the most similar image to the target image.It achieved high clinical correctness through retrieval constraints on image tags and priority training on abnormal exams.
In addition, several studies were interested in using the retrievalbased approach to complement report generation.One example is HRGR-Agent [20] which uses a CNN-SRNN architecture to combine sentence retrieval and report generation.It utilized RL to train the model with CIDEr as a reward, in which an agent determined whether to use a word decoder to generate sentences or to retrieve directly from the database based on the topic states of a sentence decoder.Another study, KERP [21], utilized graph structures to represent reports as intermediate states in the process of image-to-report generation.These states were then used to perform disease classification and retrieve template sentences.KERP introduced a Graph Transformer to process multi-domain graph structure data and re-write the template sentences into final reports.It outperformed HRGR-Agent in BLEU and ROUGE scores, while its CIDEr score was lower.

RQ3: evaluation
The ability to generate high-quality radiology reports that are both readable and accurate is the key consideration when developing an effective ARRG model.The included studies employed a wide range of methods to evaluate the outputs of ARRG quantitatively or qualitatively.The quality of automatically generated text can be measured as a function of how closely it resembles a natural language as it is used by humans.A range of automatic measures have been devised to the machine translation field and have been spread to natural language generation (NLG).However, these common NLG metrics are not specifically designed for the ARRG task.Thus, complementary metrics were proposed for measuring the quality of radiology reports with respect to their clinically relevant content.Nonetheless, the finer-grained judgements of the quality of automatically generated text can currently only be provided by humans themselves.Some of the included studies adopted human expert evaluation.Typically, a Likert scale is used to elicit responses from medical experts and various statistics are used to measure the reliability of these responses.The full range of methods used to evaluate ARRG is summarized in Table 4.

Quantitative evaluation
We hoped to compare all ARRG models directly by referring to specific values of the relevant metrics.Unfortunately, the reported results of many ARRG models and other studies' baseline experiments Y. Liao et al. proved inconsistent.First, there is no benchmark that would allow the models to be evaluated on a common dataset.Second, different metrics are used to report the results.To provide an overview of the performance, we restricted the comparison between models to the baseline experiments of a single study and exhibited them in a performance matrix, as shown in Fig. 10.More specifically, we designed a simple metric score merging algorithm that uses percentages to show the performance gap between models.Given any target model t and baseline model b, the performance gap p(t, b): where i ∈ BLEU, ROUGE, METEOR, CIDEr, m i is a specific metric value, and num(i) is the number of the metrics used.If i = BLEU, then: Note that for each row in Fig. 10, the red colour with a positive value indicates that the target model outperformed the baseline model and vice versa.From the perspective of columns, the target models are compared by benchmarking to the same baseline model.However, such ranking results vary with the choice of different baseline models, which also reflects the inconsistency issue.

Qualitative evaluation
We reviewed eight relevant studies that have presented human expert evaluation.Although some studies claimed that their models can generate more accurate and reasonable reports than baseline models, the gap between these generated reports and expert-written reports was not clearly evaluated [19][20][21].In contrast, several studies highlighted the need for further advancement of ARRG models for more reliable outputs [62,80,83,87].In particular, CDGPT2 [62] pointed out that their model was capable of generating correct reports for 99% of normal samples.However, the generated reports for abnormal samples suffer from missing information and incorrect diagnosis and often lack the necessary details to describe the abnormalities present in the images.In CNN-SRNN-Att [88], the evaluation noted that clinicians preferred the combination of visual interpretation and "human style" textual explanations for pelvic X-rays.The reliability assessment of MultusRadBot [80,87] disclosed a small "opinion discrepancy" between the generated reports and expert-written reports in lumber spine MRI, and such discrepancies in reporting can significantly impact subsequent clinical decisions.In GAN-ARAE [50], the model's usefulness was confirmed in helping radiologists achieve higher accuracy diagnoses and, perhaps, quicker decision-making processes for edge samples.

Current state of affairs and challenges
Radiology images, especially ultrasound images, have relatively low resolution and blurred boundaries between the foreground and background.Radiology reports, on the other hand, tend to be lengthy, complex, and heterogeneous, covering descriptions of findings, impressions, and other patient-related information.They also contain expressions that convey negation and uncertainty.Furthermore, the structure and style of radiology reports may vary significantly between institutions or individual radiologists, raising the concerns of interobserver variability in the training data.During the clinical reporting process, the radiologist's wording could have been influenced by affective (unconscious emotional reaction) and cognitive (distortions of thinking) biases [87].When the data originate from a small number of institutions, they may not be representative, which may lead to overfitting [116].Moreover, the available open-source datasets are often limited in size and unbalanced in the distribution of normal and abnormal samples, making it even more difficult to train a robust model.Across all datasets identified in this review, we argue that only the Keywords Accuracy [14] The ratio of the number of diagnostic keywords in the generated reports to the number of all diagnostic keywords among the ground truth references.
[14] 1 Clinical Efficacy [59] Measures the accuracy, precision, and recall of disease labels extracted by CheXpert from the ground truth references and the generated reports.
[13,59,63,69] 4 MeSH Accuracy [52] The ratio of the number of MeSH terms correctly generated by a model to the number of all MeSH terms in the ground truth references.
[52] 1 Anatomical Relevance Score [83] Matches the words in GRs against the terminology of the anatomical class of interest.
[83] 1 Medical Abnormality Terminology Detection Accuracy [20] Compares the average precision and average false positives of 10 most frequent medical abnormality terminologies in the ground truth references and the generated reports.

Qualitative evaluation
Human expert evaluation E.g. average score, average preference percentage, etc.
[20,21,62,88] 4 Likert scale [110] A rating system used to measure the opinion of medical experts regarding the quality of generated reports.
MIMIC-CXR dataset meets three key conditions: large-scale, publicly available and containing original reports.
Regarding the assessment of the generated reports, many ARRG models were benchmarked using ordinary AIC metrics, such as BLEU, ROUGE, METEOR and CIDEr, which are based on n-gram overlap and focus more on language fluency.However, n-gram overlap is neither necessary nor sufficient for two sentences to convey the same meaning [117].It was widely believed that accurate detection of pathology should take precedence over language fluency when evaluating the generated reports.Hence various clinical efficacy metrics based on the accuracy, precision, and recall of disease labels were designed for the ARRG task [14,20,52,59,83].Nevertheless, these metrics might not be sufficient for evaluating a report since the pathological description not only concerns specific disease labels but also involve their qualifiers, certainties and negations.Although MIQRI [7] have taken into consideration all these attributes, its effectiveness is yet to be fully proven.
Due to the discrepancy between general image/caption datasets and radiology image/report datasets, using conventional AIC models on the ARRG task might only produce reports that look real but are not clinically correct.Therefore, it is necessary to tailor DL approaches specifically for the ARRG task.To summarize, retrieval-based methods leverage the similarity of data features, which might generate fewer repetitive sentences and mitigate the bias toward generating normal findings [18][19][20][21].RL can train an ARRG model to generate reports toward specific metrics of interest, such as better pathological accuracy [59].Some studies enhanced the generating process by integrating auxiliary classifiers [11,12,21,34,54,56,57,60,64,65] or taking into account the report hierarchy, while other studies substituted the conventional generation process by generating more informative structured report entities [15][16][17]33,53,68,73,75,79,80,87,88].In order to provide radiologists with more interpretable information, researchers offered many ideas, including applying GAN to generate similar images [50], employing class activation maps and saliency maps (e.g.attention maps) to bring visual interpretation [12,13,18,21,34,50,59,62,65,68,88], or using the predicted pathology labels from the auxiliary classifier as supplementary of the generated reports.It has also been demonstrated that introducing patient background information could have positive impacts [52,68].There were also studies that proposed separate processing of normal and abnormal samples to address the data imbalance problem [57,64,65].
It can also be observed that most of the ARRG systems were designed for X-ray tests, in which the difficulty of data access might be the leading cause of why researchers preferred to show the results on the IU X-ray than the larger MIMIC-CXR dataset.For CT scans, the proposed ARRG Results are limited to the IU X-ray dataset (top) and the MIMIC-CXR dataset (bottom).Some common AIC baseline models are used in addition to the ARRG models, including #NIC [86], #SAT [102], #Att2in Ref. [113], #LRCN [114] and #AdaAtt [115].The target models are sorted in descending according to the performance compared to the CNN-HRNN-CoAtt [12] and #NIC models.
Y. Liao et al. systems were largely based on image classification.Current MRI datasets were all focused on the lumbar spine.Due to the structural correlation of the lumbar spine and the particular report format, the proposed systems tend to deem it as segmentation and classification problems and combine with non-DL approaches to compile the predicted labels into reports.Considering that both US and X-ray datasets are intrinsically similar, even though US reports tend to be shorter, the corresponding ARRG systems could be easily repurposed.

Future research Directions
ARRG systems certainly have the potential to streamline clinical workflow.However, the current state of the art in ARRG has yet to generate high-quality reports.We propose exploring the following aspects for further improvement.
First, language models pre-trained on large datasets can help alleviate the data scarcity issue by allowing ARRG models to use smaller training datasets for fine-tuning.In addition, the capability of large language models to capture semantically equivalent contexts can help develop better evaluation metrics.Despite their great success in a wide range of NLP tasks [118,119], their full potential in the ARRG domain is yet to be realized due to the delay of technique shift across domains.Therefore, together with large language models, which are commonly trained by transformer architectures, we can expect ARRG to follow a trend observed in NLP by shifting away from RNNs to transformers.
Second, the automatically generated text sometimes deviates from the ground-truth text in terms of structure, coverage, and lexical content.This can be attributed to the interobserver variability within the training data.This variability is caused by the inherent diversity of natural language, which allows for the same information to be expressed in numerous ways.These issues can be mitigated by adopting structured reports, which use uniform language and structure to describe radiology findings accurately [120].Even though structured reporting is increasingly being used, especially in abdominal and neuroradiological CT and MRI reports [121], the cultural and technological shift required will inevitably delay their widespread adoption.Instead, we propose resorting to NLP approaches to automatically structure the legacy radiology reports and used them not only to reduce interobserver variability, but also to train DL approaches to automatically generate structured reports.
Third, the metrics for evaluating the generated radiology reports require further investigation as the current metrics cannot comprehensively assess the quality of the generated report.Advances in measuring the semantic similarity for general image description can be seen in SPICE [122], which is a new concept-based AIC metric using a semantic graph to capture the meaning of two captions.This measurement approach has been widely adopted in recent AIC models.Although MIQRI made the first attempt in ARRG to develop a graph-based clinical semantic measurement, its effectiveness is underdetermined.We believe future research can draw inspiration from SPICE and MIQRI to develop an evaluation system that can capture the correctness of pathological information and the relationship between pathological attributes and thoroughly verify its effectiveness via thorough baseline experiments and manual evaluation.Finally, radiology images other than chest X-rays are rarely available publicly and at scale.Therefore, appropriate domain-specific evaluation metrics together with large-scale publicly available datasets are required to both deepen and broaden the existing research into ARRG.

Fig. 1 .
Fig. 1.A diagram showing the hieratical relationships between ARRG methods, framework, and architectures (left) and a timeline in terms of the milestones of the technologies involved in ARRG (right).

Y
.Liao et al.

Fig. 3 .
Fig. 3. PRISMA flow diagram of search strategy and study selection. .

Fig. 5 .
Fig. 5.The common sections (the inner circle) and a few corresponding headings (the outer circle) in semi-structured reports.Extracted from the MIMIC-CXR dataset [66].

Fig. 8 .
Fig. 8.The general structures of ARRG models that utilize attention mechanisms.We use the encoder-decoder framework as examples: (a) without attention mechanism; (b) with cross-model attention; (c) with intra-model attention; (d) with both cross-and intra-model attention n.

Fig. 9 .
Fig. 9. Fine-grained classification of ARRG systems.The systems are further distinguished by colour and symbols according to different features.(For interpretation of the references to colour in this figure legend, the reader is referred to the Web version of this article.)

Fig. 10 .
Fig. 10.Model performance matrix.Each row is a set of comparisons of model performance gaps computed based on the baseline experiments of the target model.Results are limited to the IU X-ray dataset (top) and the MIMIC-CXR dataset (bottom).Some common AIC baseline models are used in addition to the ARRG models, including #NIC[86], #SAT[102], #Att2in Ref.[113], #LRCN[114] and #AdaAtt[115].The target models are sorted in descending according to the performance compared to the CNN-HRNN-CoAtt[12] and #NIC models.

Table 1
Inclusion and exclusion criteria.
2Studies that apply data fusion of radiology images and radiology reports as part of training.No.Exclusion Criteria1The text of the radiology report (either input or output) is written in a language other than English.2Studiesthat are not original research, e.g. a review.3Studiesthat have not undergone scrutiny procedures, e.g.peer review.

Table 2
Search queries for PubMed and Web of Science.

Table 3
A summary of publicly available radiology image/text datasets used in training the ARRG systems.

Table 4
The quantitative and qualitative metrics used for evaluating ARRG.