Medical visual question answering via corresponding feature fusion combined with semantic attention

: Medical visual question answering (Med-VQA) aims to leverage a pre-trained artificial intelligence model to answer clinical questions raised by doctors or patients regarding radiology images. However, owing to the high professional requirements in the medical field and the difficulty of annotating medical data, Med-VQA lacks sufficient large-scale, well-annotated radiology images for training. Researchers have mainly focused on improving the ability of the model’s visual feature extractor to address this problem. However, there are few researches focused on the textual feature extraction, and most of them underestimated the interactions between corresponding visual and textual features. In this study, we propose a corresponding feature fusion (CFF) method to strengthen the interactions of specific features from corresponding radiology images and questions. In addition, we designed a semantic attention (SA) module for textual feature extraction. This helps the model consciously focus on the meaningful words in various questions while reducing the attention spent on insignificant information. Extensive experiments demonstrate that the proposed method can achieve competitive results in two benchmark datasets and outperform existing state-of-the-art methods on answer prediction accuracy. Experimental results also prove that our model is capable of semantic understanding during answer prediction, which has certain advantages in Med-VQA.


Introduction
In the medical field, medical imageology is a mandatory course to be undertaken by every doctor.Different types of imaging techniques, such as computed tomography (CT), magnetic resonance imaging (MRI) and X-ray, play an irreplaceable role in the clinical diagnosis of patients [1−5].Neural network technology has been gradually introduced into health informatics with the continuous advancement of medical empowerment [6].Additionally, the effectiveness of this technology has been proven in radiology image analysis [7], and deep learning models have been utilized in the detection and analysis of various diseases, such as lung diseases [8] and chest cancer [9].However, obtaining visual information exclusively from radiology images has the disadvantages of limited interactive channels and fixed interactive scenes.
In recent years, Visual Question Answering (VQA) [10] have gained ever-increasing attention as a challenging multimodal task.VQA combines the two disciplines of computer vision and natural language processing.A VQA task takes an image and a related question presented with the image as inputs, then it outputs the correct answer to the question through a series of processes.Most methods of VQA [11,12] are based on the framework of supervised learning, which requires large-scale wellannotated multimodal data to train the model.For VQA tasks, Malinowski et al. proposed the DAQUAR dataset [13] in 2014, and Ren et al. constructed the COCO-QA dataset [14] in 2015 based on the MSCOCO image database.Nevertheless, the datasets used in these studies were small in scale, and the question-answer pairs were machine generated, which led to a high repetition rate; besides, the cluttered image contents made questions difficult to be answered.Subsequently, the Visual Genome dataset [15] proposed by Krishna et al. and the Visual7W dataset [16] proposed by Zhu et al. were formulated.These datasets contained a large amount of data, the images and question-answer pairs were manually annotated and screened by volunteers.However, owing to the uneven distribution of answers and biases in the questions, the generalization performance of the models trained on these datasets was mediocre.Goyal et al. proposed the VQA 2.0 dataset [17] in 2017 based on MSCOCO image data.VQA 2.0 contains 240,721 pictures and 1,105,904 question-answer pairs.The scale of VQA 2.0 is sufficiently large, and it overcomes the unbalanced answer distribution.Therefore, VQA 2.0 has been widely used in current studies on general field VQA tasks.
Medical VQA (Med-VQA) aims to improve the quality and efficiency of modern medical diagnosis and alleviates the pressure on the currently strained medical resources.An example of Med-VQA is shown in Figure 1.Different types of radiology images are accompanied by annotations (such as Body Region and Modality) and corresponding clinical question-answer pairs.Each of these radiology images may correspond to several different questions and answers; however, we only list one of them in Figure 1.The Med-VQA task is used to predict the true answer through the provided radiology images and questions.Med-VQA technology can help patients find possible abnormalities in their bodies and-in combination with radiology images-help them easily understand the disease they are suffering from.Additionally, it can assist outpatient doctors with clinical diagnosis and simultaneously indicate abnormal problems that may be overlooked in radiology images.
Unlike in the general field, VQA, in medical domain, is confronted with the lack of large-scale annotated datasets for model training.On the one hand, there are only a few ways to obtain welllabeled radiology images; to annotate a radiology image is difficult and requires the cooperation of experienced doctors.On the other hand, the medical domain requires highly accurate and professional datasets, and different doctors have different ways of generating questions and using words, all of which make it challenging to produce Med-VQA datasets.To the best of our knowledge, ImageCLEF [18] first began to host challenges in Med-VQA early in 2018.VQA-RAD [19] is the earliest benchmark dataset proposed for Med-VQA, which has been representative and wellrecognized over the years.It was sampled from MedPix (https://medpix.nlm.nih.gov/), which is a publicly available database of medical radiographic imaging and medical teaching cases.The questionanswer pairs in VQA-RAD are generated by the natural-communication manner of professional clinical practitioners, and these questions are closer to the ones communicated between doctors and patients in real life than those generated from a template.SLAKE [20]  For Med-VQA, researchers [21,22] first leveraged transfer learning methods to pre-train the model with a large amount of annotated data from the general VQA domain.Then, they migrated the model to the medical domain for further fine-tuning.However, owing to the significant differences between these two domains, the performance of the model migrated from the general domain was not impressive.Subsequently, many studies [23−25] turned to the unlabeled radiology images.They pretrained the visual feature extractor through unsupervised learning or self-supervised learning methods and then moved in Med-VQA for fine-tuning.This achieved a better performance in answer prediction.From another point of view, previous researchers paid much attention to improving visual feature extraction through various approaches, while neglecting that the textual feature extraction is equally indispensable in Med-VQA.Furthermore, there were few works emphasized the importance of the interaction between visual features and corresponding textual features as well as the specific semantic information contained in different questions.
Based on the aforementioned factors, we briefly summarize our contributions as follows:  Considering the interaction between visual features and corresponding semantic features, we propose a novel corresponding feature fusion (CFF) method to integrate multimodal features and build a semantic attention (SA) module to enable our model to focus on important information contained in different clinical questions  Extensive experimental results illustrate the effectiveness of our proposed method on two benchmark datasets.Compared with previous state-of-the-art methods, our model achieves competitive performance in Med-VQA.

Answer
Visual feature extractor

Feature fusion module
Textual feature extractor

Medical visual question answering
The structure of a VQA model in the medical domain is similar to that in the general domain.Generally, in a VQA framework, as shown in Figure 2, the following are required: 1) a visual feature extraction module to obtain the image feature representation, 2) a textual feature extraction module to obtain question feature representation, and 3) a feature fusion module to fuse the multimodal inputs and feed them into a final classifier for answer prediction.Most of the current methods [26−32] choose to use a CNN-based neural network such as ResNet or VGGNet for visual feature extraction.In [33−35], researchers used recurrent neural network (RNN)-based neural networks such as long short-term memory (LSTM) [36], gate recurrent unit (GRU) [37], or transformer-based models such as BERT [38] and BioBERT [39], to extract the textual features.Simultaneously, classical models such as stacked attention networks (SAN) [40], bilinear attention networks (BAN) [41], and multimodal compact bilinear pooling (MCB) [42] are commonly used for multimodal feature fusion to learn visual and textual joint feature representations.
In the past few years, methods such as meta learning and transfer learning have been introduced in modern few-shot tasks.Nguyen et al. designed the mixture of enhanced visual features (MEVF) [43] method from a large number of un-annotated radiology images, using model-agnostic meta-learning (MAML) [44] and convolutional denoising autoencoder (CDAE) [45] to initialize the model weights for the visual feature extraction.Li-Ming Zhan et al. [46] added a conditional reasoning (CR) module on the basis of MEVF; questions were divided into the two categories: "Open" and "Closed", according to the manner in which they were asked, to analyze them further.Khare et al. [47] proposed to pretrain the multimodal medical BERT on a ROCO dataset with a masked language modeling method introduced as a pretext task to learn richer feature representations.Do et al. [48] improved MAML [44] in meta-learning and proposed the multiple meta-model quantifying method without using external datasets for training; this increased the meta-data by auto-annotation and utilized the features output from meta-models for Med-VQA.

Multimodal learning
Robust feature representation is the condition that a model must fulfill to correctly predict the answer in Med-VQA.Feature extraction during the multimodal learning process is particularly critical.In [49], the author proved through extensive experiments that pre-training can greatly improve the model performance for a domain-specific task.Recently, Allaouzi et al. [23] proposed to use an external chest dataset [50] to pretrain a DenseNet-based neural network for visual feature extraction.Liu et al. [24] noticed that the brain, chest, and abdomen are mainly involved in the current radiological benchmark datasets; they pre-trained three visual feature extraction models targeting these three body regions through a contrastive learning method to obtain better feature representations.Gong et al. [25] used a multitask method to pre-train CNN-based neural networks in three external unlabeled radiology image datasets corresponding to the brain MRI [51], chest X-ray (https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia),and abdomen CT (https://www.synapse.org/#!Synapse:syn3193805/wiki/217753) to extract visual features.The above methods made certain progress in the visual channel process of multimodal input.Whereas, these methods focused extensively on learning better visual feature representations and overlooked the importance of textual features as well as the interactions between textual and visual channels in the multimodal learning processes.Based on the previous studies, we made further progress on the connection between specific visual features and corresponding semantic features while guiding the model to learn the pivotal information contained in various questions in a more targeted manner.

Materials and methods
In view of the latest studies on Med-VQA [24,25] and benchmark datasets [19,20], the radiology images mainly focus on three categories of human body regions: abdomen, brain, and chest.Motivated by this observation, as shown in Figure 3, we utilize a type classifier to classify each pair of multimodal inputs (radiology images and clinical questions) into given categories.A Semantic Attention (SA) module was built to help the model focus on semantic features of questions during the feature extraction stage.Thereafter, fusion is performed on the visual and textual features from the same category for the terminal answer prediction.Figure 3 presents an overview of our proposed method, which will be introduced in further detail in this section.

Radiology image and question classification
During the production stage of the Med-VQA dataset, doctors prepared questions based on the visual information presented by radiology images.Regarding a chest X-ray, doctors were more likely to ask a question, such as "what abnormalities are observed within the lungs?" rather than "where are the brain lesions located?".Considering this for a chest X-ray, we hope to fuse its visual features with corresponding textual features and then send it to a classifier for answer prediction.It would be confusing for the model if the visual features of a chest radiology image were combined with the textual features from a question that asks about brain diseases.
Based on the above considerations, we propose the CFF method.The preferential step of this method is to classify the input images and questions into specific categories.We first perform some preprocessings on the radiology images.We set the image size and number of channels in the format of 3 224 224 to be consistent with the scale of the input images used in the pretraining stage of visual feather extractors, which we will elaborate in Section 3.2.Within one batch, we set the batch size as and images ∈ .For the questions, we set the maximum length of each question to 12; questions with a length less than 12 were zero-padded to ensure that the tensor dimensions are the same in the subsequent operation, and we use ∈ to denote input questions.Considering the scale of the datasets and to prevent overfitting, we designed a lightweight CNN-based type classifier module, as shown in Figure 4, to classify input radiology images with related questions.First, the module extracts visual features of input images through two convolution and average pooling layers; subsequently, it sends the extracted visual features into a three-layer multilayer perceptron (MLP) for a nonlinear transformation.Ultimately, after the processes of the Softmax and Sigmoid layers, the classifier finally outputs the three-category prediction score of the input images, ∈ , , represents the scores of each category.And the highest score corresponds to the category of the input images and related questions.The input images and questions in each batch are divided into three categories: (1) the classified radiology images ∈ and the questions ∈ are to be used for the follow-up works.

Visual feature extraction
Following the previous work [25], we send from different categories into three ResNet-34 models, which are pre-trained in external radiology image databases that correspond to brain MRIs, chest X-rays, and abdominal CTs separately; pretrained visual feature extractors are utilized to extract the specific visual features contained in the input images from different categories: (2) where ∈ , , and ∈ represents the extracted feature representations of the abdomen, brain and chest.

Textual feature representation
We chose to use 200-D BioWordVec [52], which is pre-trained on PubMed and MeSH (two open-source databases in the medical domain), to obtain the word embedding of each word contained in the question: ( where ∈ and ∈ . Right after word embedding, in Eq (4), the 1024-D LSTM network was leveraged to extract textual features from the input questions , and obtain the preliminary textual feature representations ∈ of questions in different categories.
Framework of our proposed SA module where ∈ , , .

SA module
For each question, we hope that the model is able to distinguish the specific pathological nouns and the questioning methods contained in different question categories during the learning process as humans are able to do.For example, a clinical question such as, "What are the abnormal cranial nerves?", in which "abnormal" implies that the question may be enquiring about a certain disease.Combining this with the phrase "cranial nerves" indicates that it is a brain-type question which may relates to some brain pathologies; and this instructs the model not to focus on answers with regard to lung or abdominal pathologies.Furthermore, the word "What" suggests that the answer to the question is possibly an open-type answer rather than a limited answer.Considering the aforementioned factors, distinguishing these specific semantic features will not only help the model learn better feature representations but also strengthen its understanding of questions.
To make our model focus more on this type of specific information, we designed an SA module to further process the textual feature representations of different questions.This step was inspired by [53].The structure of SA is shown in Figure 5.
To begin with, we utilized an average pooling layer, in Eq (5), to initially obtain the global feature ∈ of questions from different categories.Then, as shown in Eq (6), the global feature was sent into a three-layer MLP for nonlinear transformation; meanwhile, the ReLU activation function was used to connect the layers.Subsequently, another average pooling layer, in Eq (7) where "⊙" indicates the dot product.
After processing of SA, we obtained the final semantic feature representations of different questions ∈ , which corresponds to the visual features extracted before, where ∈ , , .

Feature fusion and loss calculation
We obtained the corresponding features in the above work of the CFF.Next, each pair of the corresponding visual and semantic features from different categories were sent into the fusion module.After fusing with its corresponding , we sent the joint feature representations into the VQA classifier for answer prediction: Meanwhile, we utilized a cross-entropy method for the loss calculation of answer prediction: where ∈ , , .
represents the real answer targets of different categories and ̂ represents the predicted answer targets; indicates the quantity of candidate answers the model needs to classify; in other words, it represents the total number of candidate answers; and is the batch size of different categories.For all multimodal inputs in one batch, the model traverses candidate answers for each input and calculates the cross-entropy between the predicted answer targets and the real answer targets.The sum of the cross-entropy is the loss of answer predicting.
Notably, the total loss of answer prediction contains the sum of three categories , , : In addition, the type classifier module participates in gradient backpropagation; therefore, we need to calculate its loss of classification and update the model parameters.We set the real category targets of the input image as and the predicted category target as , which is calculated in Section 3.1.
In Eq (13), represents the total batch size that has not yet been classified, and represents the number of categories to be classified.
Lastly, we combined the losses of answer prediction and type classification as the final loss for the model evolution through a balancing approach: where λ is leveraged to balance the loss.

Datasets
Our model was validated on VQA-RAD [19] and filtered SLAKE [20].VQA-RAD is a relatively well-recognized dataset in previous benchmarks.As shown in Figure 6 SLAKE is a recently proposed bilingual dataset for Med-VQA, which contains 642 radiology images and 7032 question-answer pairs.There are some clinical questions based on knowledge graph and radiology images that are not within the scope of our research; we followed the data distribution of VQA-RAD and screened out 488 radiology images.As shown in Figure 6(b), SLAKE* represents the dataset after filtering.The number of chest X-rays (177) exceeds the remaining image categories, followed by abdomen CTs (173), which are slightly fewer, and the brain MRIs constitute nearly 28% (138) of the radiology images.Figure 7 shows the comparative statistics of questions in different categories from two datasets.We calculated the number of questions corresponding to three categories and the number of "Open/Closed" questions in these two datasets.Here, "Open" and "Closed" refers to whether the question can be answered with limited options such as yes/no or with free-form texts.Thus, the questions are divided into two categories: (1) closed-ended questions, (2) open-ended questions.In particular, SLAKE* uses the original data split with reference to VQA-RAD, and there are a total of 8392 questionanswer pairs generated by clinicians in these two datasets, which cover more than 10 aspects such as "Plane", "Modality" and "Organ System".

Evaluation metrics
Accuracy is generally used in Med-VQA experiments to evaluate the model performance and is calculated as follows: where represents the number of correctly answered questions and refers to the entire number of questions.

Evaluation Metrics
All of our experiments were conducted on the Ubuntu 16.04 operating system, and the graphics card used was Nvidia GTX 2080Ti; the deep learning framework was CUDA 10.2 and Pytorch 1.6.0loaded on the Python programming language 3.7.0;we selected Adamax as the gradient descent optimizer.Before training, as shown in Table 1, we conducted an experiment on the selection of hyperparameters; references were the eventual parameter settings of our proposed model.The left side of the table shows the hyperparameters, and the right side presents their effects on the prediction accuracy of the model.Notably, "RNN" indicates the network we use for textual feature extraction.Furthermore, we utilize the warm-up learning method to speed up model convergence, where gradual warm-up steps indicate the learning rate setting during the warm-up period; the decay rate represents the decay ratio of the learning rate to the previous epoch in each decay step, and the decay step is the number of epochs contained in each decay period.Notably, we calculated the classification accuracy of the proposed type classifier in each epoch, and the current classifier parameters are saved for subsequent training only when the classification accuracy outnumbers the previous best result.

Comparison with the State-of-the-Art
As shown in Table 2, we validated our model with five other state-of-the-art methods from different periods on VQA-RAD and SLAKE*.
We briefly review the previous methods.Kim et al. [41] proposed a method to extract joint feature representations from multimodal inputs through a low-rank bilinear pooling method while cutting down the consumption of learning attention distributions for each pair of multimodal input channels at the same time.MEVF + BAN [43] used model-agnostic meta-learning (MAML) [44] and a convolutional denoising autoencoder (CDAE) [45] to initialize the visual feature extractor and utilized the proposed MEVF framework to extract image features while combining BAN for feature fusion.MEVF + BAN + CR [46], on the basis of a previous work [43], added the CR module to process the open-ended and closed-ended tasks.Liu et al. [24] proposed a contrastive pre-training and representation distillation (CPRD) method that used contrastive learning to pre-train visual feature extraction networks on an open-source database and filtered the model for adaptability to small-scale datasets.In [25], Gong et al. considered the compatibility and applicability of the pre-trained features and proposed a crossmodal self-attention (CMSA) multimodal feature fusion method combined with a pre-trained visual feature extraction network for answer prediction.Experimental results illustrate the progressive results on both datasets after employing the CFF method combined with the SA module, which make certain progress based on former work.As shown in Table 2, our model achieves 2.2, 1.1, and 3.0% increase in accuracy for predicting "Overall", "Open", and "Closed" questions, respectively, compared to the current optimal method in VQA-RAD.In SLAKE*, our method achieves 0.5 and 0.4% increase in prediction accuracy of the "Overall" and "Open" questions, respectively, despite a 0.3% decrease in the prediction accuracy of "Closed" questions.Experimental results adequately demonstrate the effectiveness of our proposed model.Furthermore, our model could be further combined with a CR module [46] for an even better performance in Med-VQA.

Case study
We intuitively compare our model with the current optimal method [25] in more detail to further demonstrate the advantages of our proposed method.As shown in Figure 8, we selected five imagequestion pairs to calculate the model attention degree on the specific words contained in the questions during the training period.The figure clearly presents the comparison of our method and CMSA in the semantic comprehension of questions.The attention map in Figure 8 represents the attention paid to each word in the question; CMSA [25] is presented on the left side while our method is on the right.Notably, the darker red shade indicates that more attention is paid on the word and vice versa; "[padding]" means zero padding, and this has no implication in the question.It can be seen from the comparison that our model can focus better on the necessary information contained in a question and is more sensitive to information, such as pathology, interrogative pronouns, and orientation.It neglects "the", "[padding]", and other unnecessary information within a question.These concerns are in line with a human understanding of a question and thus can help the model predict answers more accurately and reasonably.Comparing the two models simultaneously, it can be seen that with increasing training iterations, our proposed model can achieve a higher answer prediction accuracy much earlier than CMSA; this proves the certain advantages that our proposed method has in Med-VQA.

Ablation study
VQA-RAD has been widely cited and recognized by previous studies.It is also more representative in Med-VQA.Therefore, to verify the effectiveness of our proposed model, we conducted experiments on the VQA-RAD dataset.
First, for fairness, we replaced CMSA with BAN as the multimodal feature fusion module and compared with the current state-of-the-art methods that employed BAN as a feature fusion module as well.As shown in Table 3, our method out-performed all other methods on the answer prediction accuracy of "Open" and "Closed" questions, with a prediction accuracy of the "Overall" questions that is only 0.2% lower compared with CPRD + BAN + CR.The experimental results objectively verify that our proposed method can still achieve competitive performance when combined with BAN for feature fusion.Second, as shown in Table 4, ablation experiments were conducted on our model to further analyze the proposed CFF method and the SA module.We calculated the prediction accuracy of the questions from different categories for more intuitive comparison.Meanwhile, BAN and CMSA were combined with our method individually to compare the applicability of our model.
Comparing BAN with BAN + CFF in the table, it can be seen that the CFF method significantly improves the answer prediction accuracy of BAN [41].The same enhancement also occurs after the combination of CMSA [25], even though the prediction accuracy of "Open" questions demonstrates a slight decline.After introducing SA module, the model shows a certain degree of improvement in the answer prediction performance on almost all types of questions compared to the former method, as observed from the table; the prediction accuracy of questions from different categories ( , , ) is further improved, which affirms that the introduction of SA module can result in a positive impact, which leads to specific features contained in the questions from different categories being exploited.In addition, it can be seen from the comparison of BAN and CMSA that our model can achieve higher prediction accuracy when combined with CMSA; this shows that our method has better adaptability.In conclusion, we proved from experiments that our proposed method is able to exert positive impact on the model to obtain better answer prediction results in Med-VQA.

Conclusions
In this study, we propose a CFF method to strengthen the interactions between radiology images and questions from different categories in Med-VQA, utilizing a CNN-based type classifier to classify multimodal inputs and subsequently perform feature fusion for the corresponding image-question pairs.Notably, considering the specific semantic information contained in different questions, we propose an SA module to help our model continuously learn these specific semantic features during the training process and deepen the model's understanding of each question.In addition, extensive experiments were conducted on the benchmark dataset VQA-RAD and a recently proposed bilingual dataset SLAKE to verify the effectiveness of our proposed method.In contrast to previous state-of-the-art methods, our model surpasses several others in answer prediction and achieves better performance in Med-VQA.However, current methods for Med-VQA, including ours, still have certain limitations.The questions that can be answered by the model are only intuitive questions raised according to the content of clinical images.There are some shortcomings, such as limited interaction channels, fixed interaction scenes and narrow description range, which cannot meet the interaction needs of diversified channels in real clinical diagnosis.In order to solve this problem, we intend to introduce knowledgebased question answering (KBQA) [54] into Med-VQA in our future work.On the one hand, for a variety of clinical questions, the model can provide answers in combination with the giant medical information provided by external knowledge bases.On the other hand, the knowledge graph containing a large number of structured triples of medical knowledge [55], which can help the model answer multi-hop questions as well as the reasoning questions.If knowledge-based visual question answering methods [56] could be leveraged in the medical domain, it will better serve the needs of doctors and patients in real life, and help to promote the realization and application of intelligent inquiry in clinical diagnosis.

Figure 1 .
Figure 1.Example of medical visual question answering (Med-VQA) (radiology images with annotations and corresponding question-answer pairs).

Figure 2 .
Figure 2. A basic model design for VQA.Visual and textual features are extracted respectively, and sent into the feature fusion module.The joint feature representations are then sent into the classifier for answer prediction.

Figure 3 .
Figure 3. Overview of our proposed corresponding feature fusion (CFF) method.Classified images and questions are proceeded respectively.Corresponding features are fused and then sent into classifier for answer prediction.

Figure 9 .
Figure 9.Comparison of the Loss/Accuracy curves.(a) The Loss/Accuracy curves of our model.(b) The Loss/Accuracy curves of CMSA.

Figure 9 (
a),(b) shows the comparative training loss and validation accuracy between our proposed model and CMSA, respectively.The training loss decreased with increasing training iterations, and the training loss gradually converged to a stable value, proving the convergence of the model.Comparing these two figures, we can find that the training loss of our proposed method has a faster convergence speed.At approximately 20 epochs, the training loss has dropped to approximately 0.5, and at the 127th epoch, the training loss has dropped to approximately 0.0572; however, more epochs are needed for CMSA.

Table 3 .
Ablation study of the feature fusion module.

Table 4 .
Ablation study of our proposed methods.