Understanding transfer learning for chest radiograph clinical report generation with modified transformer architectures

The image captioning task is increasingly prevalent in artificial intelligence applications for medicine. One important application is clinical report generation from chest radiographs. The clinical writing of unstructured reports is time consuming and error-prone. An automated system would improve standardization, error reduction, time consumption, and medical accessibility. In this paper we demonstrate the importance of domain specific pre-training and propose a modified transformer architecture for the medical image captioning task. To accomplish this, we train a series of modified transformers to generate clinical reports from chest radiograph image input. These modified transformers include: a meshed-memory augmented transformer architecture with visual extractor using ImageNet pre-trained weights, a meshed-memory augmented transformer architecture with visual extractor using CheXpert pre-trained weights, and a meshed-memory augmented transformer whose encoder is passed the concatenated embeddings using both ImageNet pre-trained weights and CheXpert pre-trained weights. We use BLEU(1-4), ROUGE-L, CIDEr, and the clinical CheXbert F1 scores to validate our models and demonstrate competitive scores with state of the art models. We provide evidence that ImageNet pre-training is ill-suited for the medical image captioning task, especially for less frequent conditions (e.g.: enlarged cardiomediastinum, lung lesion, pneumothorax). Furthermore, we demonstrate that the double feature model improves performance for specific medical conditions (edema, consolidation, pneumothorax, support devices) and overall CheXbert F1 score, and should be further developed in future work. Such a double feature model, including both ImageNet pre-training as well as domain specific pre-training, could be used in a wide range of image captioning models in medicine.


Introduction
In the medical domain, a task that appears in almost all specialities is the generation of reports from medical imaging. Whether this imaging is simple 2D chest radiographs or 3D time series of functional brain activity mappings, experienced clinicians generating many such reports daily are error prone. In the medical setting, such errors could prove fatal. Advances in deep learning based image captioning allow for the potential automation of such clinical tasks.
The release of the MIMIC-CXR dataset [1] inspired multiple efforts for chest radiograph image captioning. The field existed prior to the 2019 release; however, it was limited to datasets using only a few thousand matched radiograph to report examples. With the release of MIMIC-CXR containing 371,920 chest radiographs with 227,943 imaging reports, more advanced models making use of novel transformer architectures catalyzed performance in the growing field.
Deep convolutional neural networks [2] have revolutionized the field of computer vision, in part due to the discovery that supervised pre-training for an auxiliary task, followed by fine-tuning on the desired task, significantly improves performance [3,4]. This process, known as transfer learning, generally involves training on a large-scale dataset such as ImageNet [5] followed by a target task with less training data. Medical applications of transfer learning have enabled high classification performance across disease types, including stroke [6], skin cancer [7], and carotid artery [8,9]. Further attempts to improve performance with pre-training have used even more data, up to 3000x the size of ImageNet [10,11]. However, recent research has challenged this conventional wisdom of 'pre-training and fine-tuning' in computer vision: He et al. [12] reported that ImageNet pre-training does not improve performance on the COCO object detection and instance segmentation tasks compared to random initialization, given enough training iterations to properly converge. Mathis et al. [13] similarly showed that even for small datasets, while ImageNet pre-training is helpful for in-domain tasks, it does not necessarily improve out-of-domain generalization to unrelated tasks.
Most medical image captioning models use ImageNet pre-trained models as radiograph image feature extractors, sometimes without fine-tuning, or don't use pre-training at all [14][15][16][17][18]. However, recent work has demonstrated that ImageNet pre-training may not give much information for medical applications. ImageNet was further shown to apply to medical classification tasks, but not for segmentation tasks due to largely homogenous imaging characteristics with limited morphological information [19]. Recent work, focused on the domain gap between natural images and medical images, has even attempted to develop unsupervised pre-training strategies for radiography applications to substitute ImageNet pre-training for domain specific methods [20]. Thus, it becomes crucial to determine the suitability of ImageNet pre-training for the image captioning task of clinical report generation from chest radiographs.
A significant advantage of studying the appropriateness of ImageNet pretraining in the domain of chest radiographs is the existence of CheXpert, a large dataset of labeled chest radiographs [21]. The Stanford Machine Learning Group directed competition on this task resulted in development of state of the art convolutional neural network (CNN) based models, achieving an area under the receiver operating characteristic curve (AUC) of 0.930, to predict the presence of 14 observations from chest radiographic image inputs [22]. Such highly domain specific models trained on CheXpert have only been used by a few groups working to generate radiology reports [23]. Here we propose a systematic investigation of model performance, especially related to its clinical success on each of the 14 radiological observations, by whether its feature extractor is pretrained on ImageNet, CheXpert, or both (using a novel double feature architecture).

Image to text radiology clinical report generation
Past approaches to clinical report generation include long short-term memory (LSTM) and other recurrent neural network (RNN) methods, but the most promising recent works in this domain use the transformer model [24]. Chen et al. [15] use a meshed-memory transformer, which has shown promise in image captioning tasks by using a multi-level representation of the relationships between image regions. Xiong et al. [25] further apply reinforcement learning to the output text generation to improve generated report quality, treating next word selection as a task performed by a reinforcement learning (RL) agent. Their contribution is notable because word selection is a discrete task and thus non-differentiable without using other tricks, so using an RL agent bypasses this restriction. Liu et al. [26] extend this work to add a further RL reward for clinical coherence to improve medical relevance of the generated reports. Meanwhile, Syeda-Mahmood et al. [27] simply use existing reports as templates to generate new ones. Significantly, Miura et al. use the Meshed-Memory transformer as their base architecture and supplement this with two rewards in a reinforcement learning system, demonstrating that optimizing traditional natural language generation (NLG) metrics does not maximize clinical F1 success. Following state of the art work from Chen et al. and Miura et al., we use the meshed-memory transformer for our architecture in this analysis.

Imagenet pre-training versus domain specific visual extractors
While pre-training on ImageNet was found to provide a significant boost in performance for chest radiograph interpretation, ImageNet was found to provide only a small boost for larger model architectures such as those for the medical image captioning task [28]. In the medical domain, Raghu et al. [29] find that ImageNet pre-training does not significantly benefit performance on medical imaging tasks, especially as compared to simple, lightweight models. Surprisingly, Kornblith et al. [30] show that the performance of pre-trained models on ImageNet (implicitly used as a predictor of how well a model will transfer-learn) can actually correlate negatively with performance on other vision tasks. These results suggest that not only is out-of-domain pre-training on ImageNet often unhelpful, but pre-training may also give a misleading intuition on the performance of transfer learning.

Double feature transformer
We perform clinical note generation via a modified transformer architecture. The choice of the transformer architecture was largely influenced by the fact that state of the art models in the field, including Chen et al. and Miura et al. [15] [23], are transformer based. Furthermore, transformers avoid recursion and allow for the input of larger and more information rich inputs, relying on multi head attention and positional embeddings, which is critical for the medical image to report generation task.
Rather than using a single CNN backbone to extract image features, we use two: an ImageNet pre-trained encoder, as is standard in the literature, and a chest radiograph specific CNN trained on CheXpert labels (ranked 5th in CheXpert leaderboard, AUC=0.929) [31]. The CNN backbones yield a grid of feature vectors (e.g. 8x8x1024), which are then projected to the embedding dimension of the transformer. For the double feature model, we use two separate CNN backbones to encode the image features. The respective outputs are concatenated and a linear layer applied to reduce the dimensionality before feeding them to the encoder for further processing. As is standard, the transformer decoder outputs a discrete softmax probability over the vocabulary predicting the next word. NLL loss is used. This model architecture is described in Fig. 1. In order to further improve performance, we also use a meshed-memory transformer as used by Chen et al. [15]. Since the meshed-memory transformer architecture is similar to a regular transformer, no additional modifications are needed to adopt a double-feature architecture beyond the changes previously described.
For the feature extraction, two CNNs (ImageNet and CheXpert pre-trained respectively) are used rather than directly feeding the projected image patches to the transformer. This approach for extracting the image with a CNN before encoding with the transformer is consistent with state of the art models in this field including Miura et al. We use these pre-trained CNNs to extract feature maps and input to the transformer specifically to allow for the experimental evaluation of pre-training strategy for image encoding. Independently using the pre-trained CNNs for image feature extraction allows us with certainty to understand how which features (Imagenet, CheXpert, or both) the transformer model may use impact its report generation. Chen et al.'s approach using projected image patches to the transformer [15], and recent state of the art vision transformers do not allow for the purpose of this study to study the effect of pre-training strategy.
To construct a generated report, and to predict next words during generation, first the image is passed through the pre-trained CNN based feature extractor. Next, the meshed decoder considers the encoded image with a masked self attention and predicts the first word of the generation according to the method described in Miura et al. [23]. During each iteration, until an end of sentence token is predicted, the decoder considers the encoded image and self attention with the growing generation to predict the next word, again following the method described in Miura et al. [23].

Training details
We train 3 models: 2 single-encoder models using ImageNet and CheXpert pre-trained CNN backbones, and a double-feature model. Both CNN backbones use a Densenet121 architecture. The weights for the CheXpert pre-trained model were made available by [23]. The transformer models use a meshed-memory transformer memory size of 40 and a hidden dimension of 512. Models are trained with a batch size of 24 for 32 epochs using the Adam optimizer. Training occurs on an AWS instance with an NVIDIA Tesla T4 graphics card. In the following sections we refer to the 3 trained models by their backbone pre-training datasets: ImageNet, CheXpert, or double-feature.

Evaluation metrics
To evaluate the generated reports we used, we compare them to ground truth clinical reports using BLEU (1)(2)(3)(4), ROUGE-L, and CIDEr scores. These metrics have been shown to be domain agnostic and reward grammatically correct but clinically irrelevant  [32,33]. As the field of radiology begins to move towards a structured report, a metric for correct labeling of common conditions and findings in the reports is required. To accomplish this we use the CheXbert labeler to calculate the CheXbert Clinical F1 Metric.

CheXbert clinical metric
We use the CheXbert labeler model [34] to extract diagnosis labels from both the ground-truth and generated report, and calculate the F1 score between the two labels. The CheXbert model gives labels Positive, Negative, Blank, or Uncertain for a series of 14 medical conditions; we treat only the "Positive" label as a positive prediction for the purposes of calculating the 1 score. We denote this score the CheXbert 1 score.
Given ground-truth and predicted reports, we extract labels for 14 medical conditions using CheXbert, then calculate the CheXbert 1 score formulated as Additionally, a CheXbert 1 score is calculated for each of the 14 conditions separately to better inform pre-training choices based on desired medical application in future work.

Results
The model results, listed in Table 1, demonstrate that both CheXpert and the double-feature model outperform the ImageNet pre-trained model. The double-feature model achieves the highest BLUE(1-3) scores and highest ROUGE-L score, and approximately equal BLUE4 score and CIDEr score to the CheXpert model. The double-feature model achieves the highest CheXbert 1 score; however, it is important to note that the differential between the CheXpert model and the ImageNet model is far greater than the differential between the double-feature model and the CheXpert model in clinical as well as standard generation metric performance. Furthermore, observe that the factor differential between double-feature model and ImageNet model performance scales increasingly with BLUE(1-3), and plateaus for BLUE4.
The metrics listed in Table 1 validate each of the three model's performance as the resulting BLUE(1-4) scores either approximately equal or surpass the Chen et al. performance using the similar meshed-memory transformer. For comparison, the metrics for the state of the art Miura et al. model exceed our model performance; their CheXbert metric is 0.567 and their BLEU4 score is 11.4. The Miura et al. model has the same architecture as our single-feature models but uses reinforcement learning to improve semantic quality and clinical relevance, thus exceeding the metrics achieved in our model experiments as expected. Our models are not trained additionally with reinforcement learning in order to better isolate the effect of pre-training strategy for the visual feature extractor.
CheXbert derived F1 scores by model and pretraining type are included in Fig. 2. A common effect is seen across the 14 conditions that for less frequent conditions, there is often a more significant gap between ImageNet model performance and the CheXpert model or double feature model performance. Note that ImageNet model performance does achieve the greatest F1 score of any model on cardiomegaly, atelectasis, and pleural effusion. The double feature model achieves the best performance of all models on edema, consolidation, pneumothorax, support devices, and overall F1 score. The conditions that have high F1 scores commonly also show high performance of the ImageNet model; however, in all these cases the CheXpert model and double feature model also achieve similarly high performance. These results suggest pre-training could be potentially determined by task specificity due to the high variability in the models' performance on varying conditions.

Discussion
The medical image captioning task suffers from a problem of unclear pre-training strategy. While most work employs pre-training with ImageNet, some work has provided evidence that suggests it is not suited to medical tasks beyond classification [19]. By using CNNs to extract visual features from medical imaging (chest radiographs) to be used for image captioning task (report generation), we studied the effect of pre-training strategy for these visual extractors when applied to the medical image captioning task. CNN model input was fed into the transformer's encoder to limit the transformer model to only see visual feature information as derived from the CNN, allowing the study of pre-training on Imagenet versus CheXpert (a domain specific dataset) versus both together. The results suggest three conclusions: first that ImageNet pretraining provides significantly less overall knowledge for chest radiograph report generation, second that the double feature model is a promising architecture for future medical image captioning tasks, and third that the choice of what to pretrain on is likely more task dependant than previously thought to be within the larger medical domain.

E. Vendrow and E. Schonfeld
The single feature model pretrained on ImageNet results in lower BLEU(1-4) scores, CIDEr score, and ROUGE-L score than the models using CheXpert pretraining. Additionally, the ImageNet pretrained model results in lower CheXbert clinical F1 score performance than the models using CheXpert pretraining. Scoring lower across all metrics, both those such as BLEU(1-4) that have been shown to be domain agnostic and to value grammatical outputs [32,33], as well as clinical metrics (CheXbert F1 metric) that score highly the identification of fourteen medical labels, we provide evidence to suggest that the ImageNet pretrained model shows lower performance in report generation solely due to its pretraining on ImageNet. This suggests that ImageNet pretraining does not generalize well to chest radiograph captioning. Fig. 2 demonstrates that when ImageNet pretraining results in a higher performing model for a specific condition by CheXbert F1 score, that this is almost always accompanied by the condition itself appearing more frequently in radiographs, demonstrated by cardiomegaly, atelectasis, and pleural effusion. However, for less frequent conditions such as enlarged cardiomediastinum, lung lesion, consolidation, pneumonia, pneumothorax, and fracture, the ImageNet pretrained model is significantly lower in CheXbert F1 score on those conditions than the two models which incorporate CheXpert pretraining. This is likely due to ImageNet pretraining being unable to identify motifs or segment useful areas in rare conditions. Furthermore, ImageNet has been shown in prior work to not generalize well to medical segmentation tasks [19] and thus for developing models intended to be used on images lacking clear morphological boundaries and features, ImageNet pretraining is likely especially a poor choice. However, for conditions with increased morphological information, such as cardiomegaly, ImageNet pretraining for the model may be a suitable choice.
The results of our work suggest that the double feature model may be a potential architectural improvement for report generation from medical images. However, the same benefit may be realized by choosing to pretrain on CheXpert images. While future studies should investigate the difference in benefit from these two approaches, our results clearly demonstrate that Imagenet pretraining for all generation metrics and all conditions except cardiomegaly, atelectasis, and pleural effusion leads to worsened model performance. The double feature architecture provides both domain specific information as well as ImageNet morphological and classification benefits. The double feature model achieved the highest BLEU(1-3), ROUGE-L, and clinical CheXbert F1 scores in our experiments. The model's benefit is further realized in the model's performance on conditions such as edema that couple physical morphological features with chest domain specific features; the double feature model achieves the highest CheXbert F1 score on this condition. The application of such models to other medical imaging captioning tasks will rely on the successful development of classification algorithms with image inputs to label outputs for the specific domain. In the chest radiograph space this has been already accomplished by CheXpert models. Recent weakly supervised techniques to harness pseudo labeled data can catalyze development of these models in other medical domains.
Limitations of our approach include that there are other pretraining methods using other datasets that could be studied for potential performance effects for this task. Another such limitation is that applying reinforcement learning could affect pretraining strategy and should be studied in future work. Future experiments should consider multiple runs for each model training, considering mean performance with standard deviations and p-values to be concluded in order to determine the statistical significance of one model versus another. This would be especially useful to further analyze the proposed double feature model versus the CheXpert only model.
In Table 2 we include three radiology reports demonstrating the ImageNet pretrained model and double feature model results on radiographs with negative findings, a pacemaker present and cardiomegaly, and lung volume loss/pleural effusion respectively. They are included to contrast the effect of including CheXpert pretraining alongside the pretraining of the ImageNet model to better understand the effect of including CheXpert information.

CRediT authorship contribution statement
Ethan Schonfeld; Edward Vendrow: Conceived and designed the experiments; Performed the experiments; Analyzed and interpreted the data; Contributed reagents, materials, analysis tools or data; Wrote the paper.

Code availability
The code used for this project is available at https://github .com /evendrow /ifcc.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.