MRChexNet: Multi-modal bridge and relational learning for thoracic disease recognition in chest X-rays

: While diagnosing multiple lesion regions in chest X-ray (CXR) images, radiologists usually apply pathological relationships in medicine before making decisions. Therefore, a comprehensive analysis of labeling relationships in di ff erent data modes is essential to improve the recognition performance of the model. However, most automated CXR diagnostic methods that consider pathological relationships treat di ff erent data modalities as independent learning objects, ignoring the alignment of pathological relationships among di ff erent data modalities. In addition, some methods that use undirected graphs to model pathological relationships ignore the directed information, making it di ffi cult to model all pathological relationships accurately. In this paper, we propose a novel multi-label CXR classification model called MRChexNet that consists of three modules: a representation learning module (RLM), a multi-modal bridge module (MBM) and a pathology graph learning module (PGL). RLM captures specific pathological features at the image level. MBM performs cross-modal alignment of pathology relationships in di ff erent data modalities. PGL models directed relationships between disease occurrences as directed graphs. Finally, the designed graph learning block in PGL performs the integrated learning of pathology relationships in di ff erent data modalities. We evaluated MRChexNet on two large-scale CXR datasets (ChestX-Ray14 and CheXpert) and achieved state-of-the-art performance. The mean area under the curve (AUC) scores for the 14 pathologies were 0.8503 (ChestX-Ray14) and 0.8649 (CheXpert). MRChexNet e ff ectively aligns pathology relationships in di ff erent modalities and learns more detailed correlations between pathologies. It demonstrates high accuracy and generalization compared to competing approaches. MRChexNet can contribute to thoracic disease recognition in CXR.


Introduction
Thoracic diseases are diverse and imply complex relationships.For example, extensive clinical experience [1,2] has demonstrated that pulmonary atelectasis and effusion often lead to infiltrate development, and pulmonary edema often leads to cardiac hypertrophy.This strong correlation between pathologies, known as label co-occurrence, is a common phenomenon in clinical diagnosis and is not coincidental [3], as shown in Figure 1.Radiologists need to look at the lesion area at the time of diagnosis while integrating the pathologic relationships to arrive at the most likely diagnosis.Therefore, diagnosing a massive number of Chest X-ray (CXR) images is a time-consuming and laborious reasoning task for radiologists.This has inspired researchers to utilize deep learning techniques to automatically analyze CXR images and reduce the workloads of radiologists.Multiple abnormalities may be present simultaneously in a single CXR image, making the clinical chest radiograph examination a classic multi-label classification problem.Multi-label classification means that a sample can belong to multiple categories (or labels) and that different categories are related.Relationships between pathology labels are expressed differently in different data modalities.As Figure 1 shows, pathology regions appearing simultaneously in the image reflect label relationships as features.In the word embedding of pathology labels, the label relationship is implicit in the semantic information of each label.In recent years, several advanced deep learning methods have been developed to solve this task [4][5][6][7][8][9].According to our survey, the existing methods are divided into two classes: 1) label-independent learning methods and 2) label-correlation learning methods.The label-independent learning method transforms the multi-label CXR recognition task into multiple independent nonintersecting binary recognition tasks.The primary process is to train a separate binary classifier for each label on the sample to be tested.Early on, some researchers [2,[10][11][12] used convolutional neural networks and their variants on this task with some success by designing elaborate network structures to improve recognition accuracy.Despite their efforts and breakthroughs in this field, some things can still be improved.Since this label-independent learning method treats each label as an independent learning object, training results are susceptible to situations, such as missing sample labels and sample mislabeling.Additionally, this class of methods uses only the sample image as the main carrier of the learning object.The image as a single modal form of labeling relationships implies a particular limitation.These methods have yet to consider interlabel correlations and ignore the representation of labeling relationships in other data modalities.
Subsequently, clinical experience has shown that some abnormalities in CXR images may be strongly correlated.The literature [3] suggests that this is not a coincidence but rather one of a labeling relationship that can be called co-occurrence.The literature [1] found that edema in the lungs tends to trigger cardiomegaly.The literature [2] indicates that lung infiltrates are often associated with pulmonary atelectasis and effusion.This labeling relationship inspires the application of deep learning techniques to the CXR recognition task.In addition, this interdependent information can be used to infer missing or noisy labels from co-occurrence relationships.This improves the robustness of the model and its recognition performance.
Existing label-correlation learning methods are mainly categorized into two types: image-based unimodal learning methods and methods that additionally consider textual modal data while learning images.First, the most common technique in image-based unimodal learning methods is attentionguided.These attention-guided methods [13][14][15] focus on the most discriminating lesion area features in each sample CXR image.These methods capture the interdependence between labels and lesion regions implicitly, i.e., by designing attention models with different mechanisms to establish the correlation between lesion regions and the whole region.However, the above methods only locally establish label correlations on the imaging modality, ignoring the global label co-occurrence relationship.Another approach that considers textual modal data when learning images is categorized as Recurrent Neural Network (RNN)-based and Graph Convolutional Network (GCN)-based.These RNN-based methods [1,16,17] rely on state variables to encode label-related information and use the RNN as a decoder to predict anomalous sequences in sample images.However, this approach often requires complex computations.In addition, some researchers [18,19] extract valuable textual embedding information from radiology reports to assist in classification.In contrast, GCN-based methods [6,[20][21][22] represent label-correlation information, such as label co-occurrence as undirected graph data.These methods treat each label as a graph node and use semantic word embeddings of labels as node features.However, while the above methods learn the label relations in additional modalities, they ignore the alignment between the label relation representations of different modalities, as shown on the right side of Figure 1.Moreover, these methods of modeling pathological relationships using graphs are composed so that the directed graph information is ignored, i.e., it is difficult to represent all pathological relationships in an undirected graph accurately.
In this paper, we propose a multi-label CXR classification model called MRChexNet that integrally learns pathology information in different modalities and models interpathology correlations more comprehensively.It consists of a representation learning module (RLM), a multi-modal bridge module (MBM), and a pathology graph learning module (PGL).In RLM, we obtain image-level pathologyspecific representations for lesion regions in every image.In MBM, we fully bridge the pathology representations in different modalities.The image-level pathology-specific representations from RLM align with the rich semantic information in pathology word embeddings.In PGL, we first model the undirected graph pathology correlation matrix containing all pathology relations in a data-driven manner.Second, by considering the directed information between nodes, we construct an in-degree matrix and an out-degree matrix as directed graphs by considering the out-degree and in-degree on each node as the study object, respectively.Finally, we designed a graph learning module in PGL that integrates the study of pathological information in multiple modalities.The front end of the module is designed with a graph convolution block with a two-branch symmetric structure for learning two directed graphs containing labeling relations in different directions.The back end of the module stacks graph attention layers.All labeling relations are comprehensively learned on the undirected graph pathology correlation matrix.Finally, the framework is optimized using a multi-label loss function to complete end-to-end training.
In summary, our contributions are fourfold: 1) A new RLM is proposed to obtain image-level pathology-specific representation and global image representation for image lesion regions.
2) A novel MBM is proposed that aligns pathology information in different modal representations.
3) In the proposed PGL, more accurate pathological relationships are modeled as directed graphs by considering directed information between nodes on the graph.An effective graph learning block is designed to learn the pathology information of different modalities comprehensively.

4)
We developed the framework in two large-scale CXR datasets (ChestX-ray14 [2] and CheXpert [23]) and evaluated the effectiveness of MRChexNet on this basis, with average AUC scores of 0.8503 and 0.8649 for 14 pathologies.Our method achieves state-of-the-art performance in terms of classification accuracy and generalizability.

Related work
This section presents a summary of the relevant literature in two aspects.First, previous works on the automatic analysis of CXR images are introduced.Second, several representative works related to cross-modal fusion are presented.

Multi-label chest X-ray image recognition
To improve efficiency and reduce the workloads of radiologists, researchers are beginning to apply the latest advances in deep learning to chest X-ray analysis.In the early days of deep learning techniques applied to CXR recognition, researchers divided the CXR multi-label recognition task into multiple independent disjoint binary labeling problems.An independent binary classifier is trained for each anomaly present in the image.Wang et al. [2] used classical convolutional neural networks and transfer learning to predict CXR images.Rajpurkar et al. [10] improved the network architecture based on DenseNet-121 [11] and proposed CheXNet for anomaly classification in CXR images, which achieved good performance in detecting pneumonia.Li et al. [24] performed thoracic disease identification and localization with additional location annotation supervision.Shen et al. [12] designed a novel network training mechanism for efficiently training CNN-based automatic chest disease detection models.To dynamically capture more discriminative features for thoracic disease classification, Chen et al. [25] used a dual asymmetric architecture based on ResNet and DenseNet.However, as mentioned above, these methods do not account for the correlation between the labels.
When diagnosing, the radiologist needs to view the lesion area while integrating pathological relationships to make the most likely diagnosis.This necessity inspired researchers to start considering label dependencies.For example, Wang et al. [16] used RNN to model label relevance sequentially.Yao et al. [1] considered multi-label classification as a sequence prediction task with a fixed length.They employed long short-term memory (LSTM) [26] and presented initial results indicating that utilizing label dependency can enhance classification performance.Ypsilantis et al. [17] used an RNN-based bidirectional attention model that focuses on information-rich regions of an image and samples the entire CXR image sequentially.Moreover, some approaches have attempted to use different attentional mechanisms to correlate labels with attended areas.The work of Zhu et al. [13] and Wang et al. [14] both use an attention mechanism that only addresses a limited number of local correlations between regions on an image.Guan et al. [15] used CNNs to learn high-level image features and designed attention-learning modules to provide additional attention guidance for chest disease recognition.It is worth mentioning that as the graph data structure has become a hot research topic, some approaches use graphs to model labeling relationships.Subsequently, Chen et al. [22] introduced a workable framework in which every label represents a node, the term vector of each label acts as a node feature, and GCN is implemented to comprehend the connection among labels in an undirected graph.Li et al. [27] developed the A-GCN, which captures label dependencies by creating an adaptive label structure and has demonstrated exemplary performance.Lee et al. [20] described label relationships using a knowledge graph, which enhances image representation accuracy.Chen et al. [6] employed an undirected graph to represent the relationships between pathologies.They designed CheXGCN by using the word vectors of labels as node features of the graph, and the experiments showed promising results.

Cross-modal fusion
Researchers often use concatenation or elemental summation to fuse different modal features to fuse cross-modal features.Fukui et al. [28] proposed that two vectors of different modalities are made exterior product to fuse multi-modal features by bilinear models.However, this method yields highdimensional fusion vectors.Hu et al. [29] used data within 24 hours of admission to build simpler machine-learning models for early acute kidney injury (AKI) risk stratification and obtained good results.Xu et al. [30] encouraged data on both attribute and imaging modalities to be discriminated to improve attribute-image person reidentification.To reduce the high-dimensional computation, Kim et al. [31] designed a method that achieves comparable performance to the work of Fukui et al. by performing the Hadamard product between two feature vectors but with slow convergence.It is worth mentioning that Zhou et al. [32] introduced a new method with stable performance and accelerated model convergence for the study of fusing image features and text embedding.Chen et al. [22] used ResNet to learn the image features, GCN to learn the semantic information in the label word embeddings, and finally fused the two using a simple dot product.Similarly, Wang et al. [33] designed a sum-pooling method to fuse the vectors of the two modalities after learning the image features and the semantic information of label word embeddings.It not only reduces the dimensionality of the vectors but also increases the convergence rate of the model.

Materials and methods
This section proposes a multi-label CXR recognition framework, MRChexNet, consisting of three main modules: the representation learning module (RLM), multi-modal bridge module (MBM), and pathology graph learning module (PGL).We first introduce the general framework of our model in Figure 2 and then detail the workflow of each of these three modules.Finally, we describe the datasets implementation details, and evaluation metrics.

Representation learning module
Theoretically, we can use any CNN-based model to learn image features.In our experiments, following [1,6,25], we use DenseNet-169 [11] as the backbone for fair comparisons.Thus, if an input image I has a 224 × 224 resolution, we can obtain 1664 × 7 × 7 feature maps from the "Dense Block 4" layer of DenseNet-169.As shown in Figure 2, we perform global average pooling to obtain the image-level global feature x = f GAP ( f backbone (I)), where f GAP (•) represents the global average pooling (GAP) [34] operation.We first set up a multi layer perceptron (MLP) layer learning x to obtain an initial diagnostic score of the image, Y MLP .Specifically, the MLP here consists of a layer of fully connected (FC) network + sigmoid activation function.
where f MLP (•) represents the MLP layer and θ MLP ∈ R C×D is the parameter.We use the parameter θ MLP as a diagnoser for each disease and filter a set of features specific to a disease from the global feature x.Each diagnoser θ C MLP ∈ R D extracts information related to disease C and predicts the likelihood of the appearance of disease C in the image.Then, the pathology-related feature F pr is disentangled by Eq (3.2).
The operation with ⊙ denoting the Hadamard product.Using this method to adjust the global feature x, the adjusted x captures more relevant information for each disease.

Multi-modal bridge module
In this section, we design the MBM module to efficiently align the disease's image features and the disease's semantic word embeddings.As Figure 3 shows, the MBM module is divided into two phases: alignment + fusion and squeeze.The fixed input of the MBM module consists of two parts: modal 1 ∈ R D 1 , which represents the image features, and modal 2 ∈ R D 2 , which is the word embedding.First, we use two FC layers to convert We design a separate dropout layer for M 2 to prevent redundant semantic information from causing overfitting.After obtaining two inputs M 1 , M 2 of the same dimension, the initial bilinear pooling [35] is defined as follows: where F ∈ R o is the output fusion feature of the MBM module and S i ∈ R D 3 ×D 3 is the bilinear mapping matrix with bias terms included.
. Therefore, Equation (3.4) can be rewritten as follows: where the value of G is the factor or latent dimension of two low-rank matrices and 1 T ∈ R G is an all-one vector.To obtain the final F, two three-dimensional tensors u i ∈ R D 3 ×G×o , v i ∈ R D 3 ×G×o need to be learned.Under the premise of ensuring the generality of Eq (3.5), the two learnable tensors u, v are converted into two-dimensional matrices by matrix variable dimension, namely, u i → ũ ∈ R D 3 ×Go and v i → ṽ ∈ R D 3 ×Go , then Eq (3.5) simplifies to: where the function f GroupS um (vect or, G) represents the mapping of g elements in vect or into 1 G groups and outputs all G groups obtained after complete mapping as potential dimensions, F ∈ R G .Furthermore, a dropout layer is added after the elementwise multiplication layer to avoid overfitting.Due to the introduction of elementary multiplication, the size of the output neuron can change drastically, and the model can converge to a local minimum that is not satisfactory.Therefore, the normalization layer (F ← F/∥F∥) and power normalization layer (F ← sign(F)|F| 0.5 ) are appended.Finally, F is copied C times through operation f Repeat (•), then F ∈ R C×G as the final MBM output.These are the details of the MBM process.

Pathology graph learning module
Our PGL module is built on top of graph learning.The node-level output of traditional graph learning techniques is the predicted score of each node.In contrast, the final output of our designed graph learning block is designed as the classifier for the corresponding label in our task.We use the fused features of the MBM output as the node features for graph learning.Furthermore, the graph structure (i.e., the correlation matrix) is typically predefined in other tasks.However, it is not provided in the multi-label CXR image recognition task.We need to construct the correlation matrix ourselves.Therefore, we devise a new method for constructing the correlation matrix by considering the directed information of graph nodes.
First, we capture the pathological dependencies based on the label statistics of the entire dataset and construct the pathology correlation matrix A pc .Specifically, we count the number of occurrences (T i ) of the i-th pathological label (L i ) and the simultaneous occurrences of L i and L j (T i j =T ji ).In addition, the label dependency can be expressed by conditional probability as follows: where P i j denotes the probability that L i occurs under the condition that L j occurs.Note that since the conditional probabilities between two objects are asymmetric, P i j P ji .The element value A pc i j at each position in this matrix is equal to P i j .Then, by considering directed information on the graph structure, we split an in-degree matrix A in pc and an out-degree matrix A out pc , which are defined as follows: Then, in our PGL, the dual-branch learning of the graph learning block is specifically defined as: where Z in and Z out are the outputs of the in-degree branch and the out-degree branch, respectively.f gc (•) denotes the graph convolutional operation, and θ gc denotes the corresponding trainable transformation matrix.
To learn more about the correlations between different pathological features, we use a graph attention network (GAT) [36] to consider Z in and Z out jointly.We do this by using as the input feature to graph attention.f ′ (•) denotes the batch normalization layer and nonlinear activation operation LeakyReLU.The graph attention layer transforms the implicit features of the input nodes and aggregates the neighborhood information to the next node to improve the correlation between the information of the central node and its neighbors.The input Z all to the graph attention layer is the set of node features , where d is the number of feature dimensions in each node.The attention weight coefficients e i, j are computed between node i and the neighborhood of node j ∈ NB i by a learnable linear transformation matrix W and applied to all nodes, as shown in Eq (3.12).e i, j = a WX i ∥WX j , where ∥ is the concatenation operation, W ∈ R d×d , a ∈ R d×d is a learnable parameter and d denotes the dimensionality of the output features.The graph attention layer allows each node to focus on each of the other nodes.e i, j uses LeakyReLU as the nonlinear activation function and is normalized by the sigmoid function, which can be expressed as: To stabilize the learning process of the graph attention in the PGL module, we extended the multiheaded self-attention mechanism within it as follows: where Y PGL ∈ R K D denotes the output features incorporating the pathology-correlated features, K denotes the number of attention heads, and α (k) denotes the normalized k-th attention weight coefficient matrix.W k denotes the transformable weight matrix under the corresponding k-th attention head.Finally, the output features are averaged and passed to the next node.
We show through empirical studies that PGL can detect potentially strong correlations between pathological features.It improves the model's ability to learn implicit relationships between pathologies.
After obtaining Y MLP and Y PGL , we set the final output of our model as Y Out = Y MLP + Y PGL and then feed it into the loss function to calculate the loss.Finally, we update the entire network end-to-end using the MultiLabelSoftMargin loss (called multi-label loss) function [37].The training loss function is described as: where Y Out and L denote the predicted pathology and the true pathology of the sample image, respectively.Y out j and L j denote the j-th element in its predicted pathology and the j-th element in the actual pathology.

Experiments
In this section, we report and discuss the results on two benchmark multi-label CXR recognition datasets.Ablation experiments were also conducted to explore the effects of different parameters and components on MRChexNet.Finally, a visual analysis was performed.

Datasets
ChestX-Ray14 is a large CXR dataset.It contains 78,466 training images, 11,220 validation images, and 22,434 test images.Approximately 1.6 pathology labels from 14 semantic categories are applied to the patient images.Each image is labeled with one or more pathologies, as illustrated in Figure 4. We strictly follow the official splitting standards of ChestX-Ray14 provided by Wang et al. [2] to conduct our experiments so that our results are directly comparable with most published baselines.We use the training and validation sets to train our model and then evaluate the performance on the test set.
CheXpert is a popular dataset for recognizing, detecting and segmenting common chest and lung diseases.There are 224,616 images in the database, including 12 pathology labels and two nonpathology labels (not found and assistive device).Each image is assigned one or more disease symptoms, and the disease results are labeled as positive, negative and uncertain, as illustrated in Figure 4; if no positive disease is found in the image, it is labeled as 'no finding'.Undetermined labels in the images can be considered positive (CheXpert 1s) or negative (CheXpert 0s).On average, each image had 2.9 pathology labels for CheXpert 1s and 2.3 for CheXpert 0s.Since the data for the test set are still not published, we redivided the dataset into a training set, a validation set, and a test set at a ratio of 7:1:2.As described earlier, the proposed PGL module involves the global modeling of all pathologies on the basis of cooccurrence pairs, the results of which are the identification of potential pathologies present in each image.As shown in Figure 5, many pathology pairs with cooccurrence relationships were obtained by counting the occurrences of all pathologies in both datasets separately.For example, lung disease is frequently associated with pleural effusion, and atelectasis is frequently associated with infiltration.This phenomenon serves as a basis for constructing pathology correlation matrix A pc and provides initial evidence of the feasibility of the proposed PGL module.

Implementation details
All experiments were run on an Intel 8268 CPU and NVIDIA Tesla V100 32 GB GPU.Moreover, it was implemented based on the PyTorch framework.First, we resize all images to 256 × 256 and normalize via the mean and standard deviation of the ImageNet dataset.Then, random cropping to make images 224 × 224, random horizontal flip, and random rotation were applied, as some images may have been flipped or rotated within the dataset.The output characteristic dimension D 1 of the backbone was 1664.In the PGL module, we designed a graph learning block consisting of 1-1 symmetrically structured GCN layers stacked with 2(2) graph attention layers (the number of attention heads within the layer).The number of GCN output channels was 1024 and 1024, respectively.We used a 2-layer GAT model, with the first layer using K = 2 attention heads, each head computing 512 features (1024 features in total), followed by exponential linear unit (ELU) [46] nonlinearity.The second layer did the same, averaging these features, followed by logistic sigmoid activation.In addition, we considered LeakyReLU with a negative slope of 0.2 as the nonlinear activation function used in the PGL module.The input pathology label word embedding was a 300-dimensional vector generated by the GloVe model pretrained on the Wikipedia dataset.When multiple words represented the pathology labels, we used the average vector of all words as the pathology label word embedding.In the MBM, we set D 3 = 14,336 to bridge the vectors of the two modes.Furthermore, we set G = 1024 with g = 14 to complete the GroupSum method.The ratios of dropout1 and dropout2 were 0.3 and 0.1, respectively.The whole network was updated by AdamW with a momentum of (0.9, 0.999) and a weight decay of 1e-4.The initial learning rate of the whole model was 0.001, which decreased 10 times every 10 epochs.
In our experiments, we used the AUC value [38] (the area under the receiver operating characteristic (ROC) curve [38]) for each pathology and the mean AUC value across all cases to measure the performance of MRChexNet.There was no data overlap between the training and testing subsets.The true label of each image was labeled with L = [L 1 , L 2 , . . ., L C ].In the dataset of two CXR label numbers C = 14, each element L C indicated the presence or absence of the C-th pathology, i.e., 1 indicated presence and 0 indicated absence.For each image, the label was predicted as positive if the confidence level of the label was greater than 0.5.

Comparison with existing methods
In this section, we conduct experiments on ChestX-Ray14 and CheXpert to compare the performance of MRChexNet with existing methods.
Table 2. AUC comparisons of MRChexNet with previous baseline on CheXpert 1s.
Table 3. AUC comparisons of MRChexNet with the previous baseline on CheXpert 0s.

Ablation experiments and discussion
MRChexNet with its different components on ChestX-Ray14: We experimented with the performance of the components of the MRChexNet; the results are shown in Table 4.In baseline + PGL, we use a simple summation of elements instead of MBM to fuse the visual feature vectors of pathology and the semantic word vectors of pathology.The obtained simple fusion vectors are used as the node features of the graph learning block.Compared to the baseline DenseNet-169, the mean AUC score of baseline + PGL was significantly higher by 3.6% (0.782 → 0.818), especially in atelectasis (0.775 → 0.820), cardiomegaly (0.879 → 0.920), effusion (0.826 → 0.888) and nodule (0.689 → 0.769), ex-ceeding the vanilla DenseNet-169 by an average of 5.7% in those pathology labels.The experimental results showed that the proposed PGL module is crucial in mining the global cooccurrence between pathologies.Note that in the baseline + MBM model, the fixed direct input 2 to the MBM module is a vector of 14 pathology-annotated words with initial semantic information.We learn the output of the resulting cross-modal fusion vectors from one FC layer by aligning the visual features of pathology with the semantic word vectors of pathology.Compared to the DenseNet-169 baseline, the mean AUC score of baseline + MBM was significantly higher by 2.7% (0.782 → 0.809), especially in atelectasis (0.775 → 0.800), effusion (0.826 → 0.860), pneumothorax (0.823 → 0.859), and mass (0.766 → 0.856) on pathology, exceeding the vanilla DenseNet-169 by an average of 4.6% in those pathology labels.With the addition of the MBM and PGL modules, MRChexNet significantly improved the mean AUC score by 6.8%.In particular, the AUC score improvement was significant for atelectasis (0.775 → 0.824), pneumothorax (0.823 → 0.888), and emphysema (0.838 → 0.920).This phenomenon indicates that the MBM and PGL modules in our framework can reinforce and complement each other to make MRChexNet perform at its best.Testing time for different components in MRChexNet: We experimented with the inference time for each component of MRChexNet, and the results are shown in the Table 5.We have set the inference time in seconds and the inference duration as the time to infer 1 image.Then, we first tested an image using Baseline and the obtained time as a base.After testing an image using Baseline + MBM and Baseline + PGL to get the duration, the base inference duration of the previous baseline is subtracted to get the exact inference duration of each module.According to the results, it can be seen that MBM and PGL increase the reasoning time of the model by 20.3 × 10 −6 and 33.7 × 10 −6 s, respectively.It is worth mentioning that the interaction of the two achieves a satisfactory recognition performance, which is an acceptable result compared to the manual reasoning time of the radiologist.
MRChexNet under different types of word embeddings: We default to using GloVe [40] as the token representation as input to the multi-modal bridge module (MBM).In this section, we evaluate the performance of MRChexNet under other types of popular word representations.Specifically, we investigate four different word embedding methods, including GloVe [40], FastText [41], and simple single-hot word embedding.Figure 7 shows the results using different word embeddings on ChestX-Ray14 and CheXpert.As shown, we can see that thoracic disease recognition accuracy is not significantly affected when using different word embeddings as inputs to the MBM.Furthermore, the observations (especially the results of one-hot) demonstrate that the accuracy improvement achieved by our approach does not come entirely from the semantics produced by the word embeddings.Furthermore, using powerful word embeddings led to better performance.One possible reason may be that the word embeddings learned from a large text corpus maintain some semantic topology.That is, semantic-related concept embeddings are close in the embedding space.Our model can employ these implicit dependencies and further benefit thoracic disease recognition.

&KHVW;5D\ &KH;SHUWV
Effects of different pathology word embedding approaches.It is clear that different pathology word embeddings have little effect on accuracy.This shows that our improvements are not necessarily due to the semantic meanings derived from the pathology word embeddings but rather to our MRChexNet.
Groups G and elements g in GroupSum: In this section, we evaluate the performance of the MBM in MRChexNet by using a different number of groups G and the number of elements g within a group.With the GroupSum in the MBM, each D 3 -dimensional vector will be converted into a G-dimensional vector.We have a set of G-g ∈ {(2048, 7), (1024, 14), (512, 28), (256, 56), (128, 112)} to generate a low-dimensional bridging vector.As shown in Figure 8, MRChexNet obtains better performance on ChestX-Ray14 when G = 1024 and g = 14 are chosen, while the change in the mean AUC is very slight on CheXpert.We believe that the original semantic information between the pathology word embeddings can be better expressed by G = 1024 and g = 14.Other values of G-g bring similar results, which do not affect the model too much.
Different numbers of GCN layers and GAT layers of the graph learning block in PGL: Since the front end of the graph learning block we have designed is a GCN with a dual-branch symmetric structure, the main discussion is about the number of GCN layers on each branch.We set the graph attention layer at the end of the graph learning block.To maintain the symmetry of the graph learning block structure, we kept the number of layers the same as the number of attention heads within the layer.We show the performance results for different GCN layers of our model in Table 6.For the 1-1 layer model to GCN, in each branch, the output dimensions of the sequential layers are 1024.For the 2-2 layer model to GCN, in each branch, the output dimensions of the sequential layers are 1024 and 1024.For the 3-3 layer model to GCN, in each branch, the output dimensions of the sequential layers are 1024.We aligned the number of graph attention layers with the number of attention heads.Specifically, for the 1-layer GAT model, with the layer using K = 1 attention heads, the head computes 1024 features (1024 features in total).For the 2-layer GAT model, with the first layer using K = 2 attention heads, each head computes 512 features (1024 features in total), and the second layer does the same.As shown in the table, the pathology recognition performance on both datasets decreased when the number of GCN layers and the number of GAT layers increased.The performance degradation was due to the accumulation of information transfer between nodes when more GCN and GAT layers were used, leading to oversmoothing.
8. The change of mean AUC using different values of G-g.

Visualization of lesion areas for qualitative assessment
In Figure 9, we visualize the original images and the corresponding label-specific activation maps obtained by our proposed MRChexNet.It is clear that MRChexNet can capture the discriminative semantic regions of the images for the different chest diseases.Figure 10 illustrates a visual represen- tation of multi-label CXR recognition.The top-eight predicted scores for each test subject are given and sorted top-down by the magnitude of the predicted score values.As shown in Figure 10, compared with the vanilla DenseNet-169 model, the proposed MRChexNet enhances the performance of multi-label CXR recognition.Our MRChexNet can effectively improve associated pathology confidence scores and suppress nonassociated pathology scores with fully considered and modeled global label relationships.For example, in column 1, row 2, MRChexNet fully considers the pathological relationship between effusion and atelectasis.In the presence of effusion, the corresponding confidence score for atelectasis was (0.5210 → 0.9319); compared to vanilla DenseNet-169 performance, the confidence score improved by approximately 0.4109.For the weakly correlated labels, effusion ranked first in column 2, row 3 regarding the DenseNet-169 score.While MRChexNet fully considers the global interlabel relationships, its confidence score does not reach the top 8.To some extent, this demonstrates the ability of our model to suppress the confidence scores of nonrelevant pathologies.

Conclusions
Improving the performance of multi-label CXR recognition algorithms in clinical environments by considering the correspondence between pathology labels in different modalities and capturing the correlation relationship between related pathologies is vital, as is aligning pathology-relationship representations in different modalities and learning the relationship information of pathologies within each modality.In this paper, we propose a multi-modal bridge and relational learning method named MRChexNet to align pathological representations in different modalities and learn information about the relationship of pathology within each modality.Specifically, our model first extracts pathologyspecific feature representations in the imaging modality by designing a practical RLM.Then, an efficient MBM is designed to align pathological word embeddings and image-level pathology-specific feature representations.Finally, a novel PGL is intended to comprehensively learn the correlation of pathologies within each modality.Extensive experimental results on ChestX-Ray14 and CheXpert show that the proposed MBM and PGL can effectively enhance each other, thus significantly improving the model's multi-label CXR recognition performance with satisfactory results.In the future, we will introduce the relation weight parameter in pathology relation modeling to learn more accurate pathology relations to help further improve the multi-label CXR recognition performance.
In the future, we will extend the applicability of the proposed method to other imaging modalities, such as optical coherence tomography (OCT).Among them, OCT is a noninvasive optical imaging modality that provides histopathology images with microscopic resolution [42][43][44][45].Our next research direction is extending the proposed method for OCT-based pathology image analysis.In addition, exploring the interpretability and readability of models has been a hot research topic in making deep learning techniques applicable to clinical diagnosis.Our next research direction is also how to make our model more friendly and credible for clinicians' understanding.

Use of AI tools declaration
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.

Figure 1 .
Figure 1.Illustration of pathology relationships and alignment problems in different data modals.Left: the pathology correlation within each modal.Right: we aligned the representation of pathology across modals.The transformed arrows in the figure indicate that "Pathology A → Pathology B" means that when Pathology A appears, Pathology B is likely to have occurred, but the converse does not necessarily hold.

Figure 2 .
Figure 2. The overall framework of our proposed MRChexNet.

Figure 4 .
Figure 4. Example images and the corresponding labels in the ChestX-Ray14 and CheXpert datasets.Each image is labeled with one or more pathologies.In CheXpert, the uncertain pathology is marked in red.

Figure 6 .
Figure 6.ROC curves of MRChexNet on the ChestXRay14 and CheXpert, respectively.The corresponding AUC scores are given in Tables 1−3.

Figure 10 .
Figure 10.Visualization results of our model scoring the highest pathology on the images to be tested in the ChestX-Ray14 dataset.We present the top-eight predicted pathology labels and the corresponding probability scores.The ground truth labels are highlighted in red.

Table 4 .
Comparison of AUC of MRChexNet with its different components on ChestX-Ray14.

Table 5 .
Comparison of the test time of MRChexNet with its different components.

Table 6 .
The different number of GCN layers and GAT layers of the graph learning block in PGL.