Multimorbidity Content-Based Medical Image Retrieval and Disease Recognition Using Multi-Label Proxy Metric Learning

Content-based medical image retrieval is an important diagnostic tool that improves the explainability of computer-aided diagnosis systems and provides decision-making support to healthcare professionals. A common approach to content-based image retrieval is learning a distance metric by transforming images into a feature space where the distance between samples is a similarity measure. Proxy metric learning methods are effective at learning this transformation due to the use of proxy feature vectors that enable efficient learning. Training with a distance-based classification loss enables a single proxy model to be suitable for both retrieval and classification. However, these methods are designed only for single-label data, making them unsuitable for multimorbidity medical images. Addressing this, we propose a novel multi-label proxy metric learning method for content-based image retrieval and classification. Unlike existing proxy-based methods, training samples assign to multiple proxies that span multiple class labels. This results in a feature space that encodes the complex relationships between diseases. We introduce negative proxies to better encode the relationships between samples without detected diseases. The efficacy of our approach is demonstrated experimentally on two multimorbidity radiology datasets. Results show that our method outperforms state-of-the-art image retrieval systems and baseline approaches. Our method is clinically significant as it improves on two key factors shown to affect medical professionals’ willingness to use computer-aided diagnosis systems: accuracy and interpretability.

reports [2] and help to reduce the high inter-observer variability that occurs when analysing radiology images [3].
Deep metric learning methods learn a feature space in which distance is a measure of similarity. As such, metric learning is well suited to the problem of content-based image retrieval. The feature spaces learned by metric learning approaches have been shown to generalise well to unseen samples [4], [5], [6], [7], with demonstrated efficacy to retrieval [8], [9]. Conventional deep classification models often require vast quantities of high-quality annotated data in order to be accurately trained [10]. Metric learning models can reduce over-fitting [11], resulting in better performance in few shot learning [12], [13] and data limited scenarios [7]. This is particularly important in the medical domain, as some diseases may be rare and annotating data has a high associated cost.
One family of metric learning approaches are proxy-based methods [14], [15]. Proxies are trainable model parameters that are used to approximate the real distribution of training set feature vectors. Using proxies enables efficient model training, as training samples need only be compared to the relatively small number of proxies, rather than to one another. Proxy methods are able to learn faster than other metric learning methods and result in a more generic feature space than commonly used triplet-based methods [14]. However, existing proxy methods are not designed for multi-label data and cannot be directly applied to multimorbidity CBIR.
In this paper, we propose a novel proxy-based metric learning method for multimorbidity computer-aided diagnosis. Existing proxy approaches generally allow a sample to assign to just a single proxy during training. Unlike these approaches, our method is explicitly designed for multi-label data by allowing X-rays with multiple disease findings to assign to all relevant proxies. This results in a metric feature space that encodes the complex interactions between different diseases. Unlike existing methods that tend to define a single proxy per class label, we propose the use of several proxies per disease class, allowing the semantic variations that exist within the distribution of a single disease to be encoded in the feature space. In this way, our method supports both intra-class and inter-class multi-proxy assignment. Further, we propose the use of negative proxies. By treating negative samples (i.e. those with no positive labels) as their own training class with proxies, a feature space is learned that better encodes the relationships between X-ray samples with no disease findings.
While many metric learning approaches do not perform well for classification [16], our method optimises a classification loss directly on the feature vectors extracted by a deep neural network. This results in a trained model that can be successfully applied to both pathology classification and content-based image retrieval. The major contributions of this paper can be summarised as follows: • We propose a novel proxy-based metric learning algorithm for multi-label data (Section III-C).
• We introduce negative class proxies for encoding the important relationships between X-ray samples with no disease findings (Section III-D).
• We propose defining multiple proxies for each class label and demonstrate the performance benefit (Sections III-C and IV-G).
• We demonstrate that our method outperforms conventional deep classification models (Section IV-E).
• We show that our approach achieves state-of-theart CBIR performance on multimorbidity radiology datasets (Sections IV-D and IV-E).

II. RELATED WORK
In this section we contextualise our proposed method in terms of relevant literature in the fields of content-based medical image retrieval and metric learning. A comparative summary of the related literature and our approach is shown in Table 1.
In the deep learning domain, multi-label metric learning and hashing is a common approach [3], [23], [24], [25]. Chen et al. [3] propose a method that optimises a combination of ranking loss and multi-label classification loss, while Conjeti et al. [23] introduce a Deep Residual Hashing network that incorporates a retrieval loss and regularisation techniques to improve CBIR performance. A pair-wise Deep Supervised Hashing method is proposed by Liu et al. [24], whereby images are mapped to discreet codes, allowing TABLE 1. Summary of related literature. The strengths and limitations of literature methods are compared to our approach in terms of the methodological characteristics required for multimorbidity CBIR and recognition. The comparisons are made relative to our approach, allowing the method types to be evaluated with a simple yes or no for each characteristic. Note that Inter-class and Intra-class Encodings refer to the method's ability to learn rich representations that encode the relationships between and across classes. Detailed discussion of the literature is found in Section II.
Hamming distance to be used as a measure of the similarity between samples. Further discussion on multi-label metric learning is found in Section II-B3.
Taking a different deep learning approach, Haq et al. [2] leverage a conventional Convolutional Neural Network (CNN) multi-label classifier trained with binary crossentropy. A community-based graph structure is proposed for efficient search in large retrieval databases. As demonstrated in our experiments (Section IV), a conventional CNN classifier is limited in its ability to encode rich semantic relationships in the feature vector space. This results in poorer CBIR performance compared to our multi-label metric learning approach that directly optimises the feature space.

B. METRIC LEARNING
The aim of metric learning is to learn a feature space in which standard distance measures, such as Euclidean distance, can be used as a measure of similarity. For example, one would expect the feature vectors belonging to sample images with similar semantic content to be located nearby, while those from images with dissimilar semantic content to be located further apart. Metric feature spaces have applications including retrieval [38], ranking [8], out-of-distribution detection [39] and novel class image generation [40].

1) TRIPLET METHODS
One of the main approaches to metric learning is derived from Siamese networks [41] with contrastive loss [42], [43], whereby positive pairs of images (with matching semantic content) and negative pairs (with non-matching semantic content) are passed through the same network. The network parameters are updated such that the feature vectors of positive pairs are pulled together, while those of negative pairs are pushed apart. An improvement on pairwise metric learning methods are triplet-based methods [44], which construct trios of training images containing two samples with matching semantic content and one sample with differing semantic content. Triplet loss attempts to pull the positive pair of samples closer together than the anchor positive sample and the negative sample, by a set margin.
Metric learning literature often aims to improve triplet methods by performing intelligent mining of ''hard'' triplets [6], [27]. While other streams of literature aim to generalise triplet loss, such as by allowing multiple comparisons within a single mini-batch [5] or by employing a lifted structured embedding [4] that allows computation between every positive and negative pair in the batch. Local triplet loss can also be combined with a global loss term to improve performance [28]. Triplet-based metric learning has shown good performance in embedding learning problems and extreme classification, but is often poor at regular classification problems compared to conventional neural network classifiers [16]. Further, triplet methods suffer from computational bottlenecks in terms of triplet mining.

2) NEIGHBOURHOOD AND PROXY METHODS
Analysing a larger neighbourhood of samples at each training iteration can allow for more efficient updates to the model and a more robust distance metric [7]. Neighbourhood Component Analysis (NCA) [45] considers all nearby samples, minimising a probabilistic loss based on a sum of Gaussian distances within the neighbourhood. As the feature vectors of every sample change after each training iteration, it is computationally infeasible to minimise this loss exactly with a deep neural network. To make training practical, a cache of training feature vectors can be stored and periodically updated, allowing an approximation of NCA loss to be optimised [7]. Another approach to making neighbourhood methods computationally feasible is to employ proxy feature vectors [14]. Proxy features are trainable model parameters that are assigned a class label and used as a proxy for real training feature vectors belonging to that same class. By minimising a proxy-based version of NCA (Proxy-NCA [14], [15]), training features need only be compared to the proxy features, rather than to each other. As the number of proxies is generally set to equal the number of class labels, this significantly reduces the computational complexity. Kim et al. [29] propose a loss function with advantages of both pair-wise methods and proxy-based methods by assigning samples to proxies using sample-to-sample relations. Similarly, N-pair loss is implemented using proxies by Aziere and Todorovic [30], while SoftTriple loss [31] improves the representation of intra-class variations using a method analogous to proxies. In this work, we propose a novel multi-label proxy metric learning method that allows training samples to assign to multiple inter-class and intra-class proxies. This results in a model that can be applied to both multi-label classification and multi-label image retrieval.
Similar to proxy methods, prototypical networks [32] perform neighbourhood metric learning by approximating the feature space distribution. This is done by minimising the distance between training samples and their corresponding ''prototype'' feature. Prototypes are calculated from the feature space using statistical methods. For example, Snell et al. [32] find a per-class prototype feature by calculating the mean of the training features belonging to that class. Prototypical networks can also be used for self-supervised clustering [46], with weighted mean prototype features calculated based on the probability of training features belonging to a given prototype cluster. Scattering loss is used to encourage the alignment of two sets of prototypes obtained from different views. This allows self-supervised representations to be learned without any training annotations.
Unlike prototype features, proxy features are trainable model parameters and are not calculated from the training feature means. This allows proxy methods to avoid an inefficient update step at each training epoch, which typically involves extra forward passes of training data [32] and, in the self-supervised case, running the k-means clustering algorithm [46]. Further, defining proxies as training parameters results in a natural method of multi-proxy assignment, which we demonstrate in Section III-C.

3) MULTI-LABEL METRIC LEARNING
Beyond the discussed medical CBIR approaches [3], [23], [24], [25] (Section II-A), other multi-label metric learning approaches include the multi-label extension of triplet loss by Sumbul et al. [33]. A two-step triplet sampling algorithm is proposed, that uses multi-label similarity to select a diverse set of triplets for a training mini-batch. Annarumma and Montana [34] propose a multi-label triplet method that selects several positive examples for each training sample, such that each of the sample's positive class labels are present in the constructed positive pairs. As multi-label extensions of triplet learning, these methods inherit the inefficiencies of triplet metric learning discussed in Section II-B1, in terms of high computational complexity, limited ability to encode complex intra-class and inter-class relationships, and poor classification performance. To avoid these known problems with triplet loss, our work focuses on computationally efficient proxies and directly minimises a distance-metric based classification loss that is effective for both classification and content-based image retrieval. Beyond multi-label triplet methods, Li et al. [35] optimise a two-way distance metric loss between image and label embeddings, both extracted by neural networks. This method uses the entire neighbourhood of feature vectors to compute the loss. Avoiding such inefficient neighbourhood analysis is a primary motivator behind our use of proxies.
Further to not being proxy methods, the discussed existing approaches do not give special consideration to negative samples (i.e. examples with no positive labels). Such samples are extremely common in medical data (e.g. a radiology image with no disease finding). Recognising this, our method is explicitly designed for such data via the introduction of negative proxies (Section III-D). Additionally, our approach was designed for the consolidated dual use case of medical image retrieval and pathology classification, while most existing multi-label metric learning methods were designed only for a single use case. We quantitatively evaluate our approach against existing multi-label metric learning methods [3], [24], [25] in Section IV-D.

4) USING MULTIPLE PROXIES PER CLASS
Further to enabling multi-label proxy metric learning, we propose to improve the representation of intra-disease variations by defining multiple proxies for each disease class. Although existing works have proposed to define multiple proxies for a single class [16], [31], [47], these methods differ significantly from our approach. Qian et al. [31] propose the use of multiple centroids for each class, but a single weighted per-class centroid is used in the loss function, limiting the intra-class variations that can be captured. Conversely, our loss function incorporates all proxies (both intra-class and inter-class) independently. Similar to approximate NCA methods [7], Liu et al. [47] couple a proxy with each data instance, which is by definition different to the dataset approximating proxies in our method. The proxies used in our proposed approach are trained model parameters that are efficient approximations of the data distribution. Although not strictly proxies, Rippel et al. [16] use k-means clustering and penalise overlapping clusters from different classes, with multiple k-means centres defined for each class. In this sense, the centres can be considered proxies of the data distribution. However, neither approaches in Rippel et al. [16] and Liu et al. [47] have the efficiency benefits of the parameterised proxies used in our paper.

FIGURE 2.
Overview of the training and querying of our system. (a) Training samples are paired with their ground truth multi-label disease annotations and a deep neural network extracts a feature vector from the image. Additional model parameters, known as proxy feature vectors, are defined in the feature space. Proxy features are assigned a class label and act as a ''proxy'' for the real distribution of training features belonging to that class. Although multiple proxies may be assigned to a single class, only one is shown in the figure for simplicity. During training, the feature extractor and proxies are updated to reduce the distance between training samples and their matching proxies, while increasing the distance between samples and their non-matching proxies. (b) At test time, the user inputs a query image to the trained system, and the deep neural network extracts a feature vector. The query feature is compared against a database of feature vectors. The closest k database features are found and the corresponding images are returned to the user. Our system also performs disease recognition by returning a per-class confidence score for each of the disease labels present in the training set.

III. METHOD
An overview of our approach is shown in Fig. 2. Given a query image, such as a chest X-ray, our method is able to predict the presence of multiple diseases, as well as return a set of semantically similar X-rays from a retrieval database to the user. The similarity of samples is measured by Euclidean distance in a feature vector space. Features are extracted by a convolutional neural network that is trained with our novel multi-label metric learning method.
. . x n } be a set of n images from a multi-label dataset containing c class labels. The corresponding set of ground truth label vectors is Y = {y 1 , y 2 , y 3 , . . . y n } where y i ∈ {0, 1} c is the label vector for the i-th image. A value of 1 at location k in y i indicates the presence of the k-th class in image x i . We aim to learn a distance metric d, such that: (1) Such a distance metric is achieved by learning a transformation from the image space to a feature vector space in which the Euclidean distance between features can be used as a measure of similarity, i.e.: where f represents a neural network encoder and f (x i ) is the feature vector that is output from the network for image x i . Given a well learned metric feature vector space, we would expect that the distance between similar samples in the feature space will be small, while the distance between dissimilar samples will be large.

B. PROXY FEATURE VECTORS
The feature vector extracted from image x by the model f is defined as: where v ∈ R d and d is the dimensionality of the feature vector space, realised by the model f . In general, the feature space VOLUME 11, 2023 FIGURE 3. Comparison of proxy relationships in Proxy-NCA [14] and our proposed method. The circles represent a proxy's Gaussian window, while the colour represents the training class. In proxy-NCA, training samples can only assign to a single proxy, forcing all proxies to be located far apart. In our method, samples assign to multiple proxies, resulting in interactions between proxies. This allows the complex relationships between diseases to be better encoded in the feature space.
may have any dimensionality; in our experiments f produces features with 1024 dimensions. A proxy p ∈ R d is defined as a trainable feature vector that is assigned a class label and represents the set of, or a subset of, the real training sample feature vectors that belong to that class. During training, sample feature vectors are compared with the proxies rather than with the much larger number of other training samples.
In this way, these trainable parameters act as a proxy for the real distribution of training samples.

C. MULTI-LABEL METRIC LEARNING WITH PROXIES
Existing proxy-based metric learning methods [14], [15] are designed for datasets with a single positive class label per sample. Generally, these methods define a single proxy for each class label, and training samples are only able to be assigned to one proxy. Since medical imaging data is often multimorbidity, we propose a novel proxy-based metric learning approach for multi-label datasets. In this approach, training samples are able to assign to multiple proxies spanning multiple class labels. We also generalise our method to allow for the definition of multiple proxies per class. Having more than one proxy for a single class may allow for intra-class variations to be better captured by the model, as well as more complex interactions between class labels. This means that a sample's multi-proxy assignment may occur both inter-class and intra-class. Figure 3 illustrates this difference compared to traditional proxy methods. We define a set of m proxies for each c class labels as: where each p i,j ∈ R d corresponds to the i-th proxy feature for the j-th class label. During training, we optimise a multi-label proxy-based loss that allows training features to assign to multiple proxies. The multi-proxy assignment happens both intra-class (when m > 1) and inter-class, when a training sample has more than The loss function L for sample x with labels y is defined in (4). To deal with the large imbalance of positive and negative occurrences of classes that is common in medical imaging data [48], [49], per-class positive and negative weights are used in the loss function. With n j+ being the total number of positive class j samples and n j− being the number of negative class j samples, the positive and negative weights for class j are defined as w j+ = n j− /(n j+ + n j− ) and w j− = n j+ /(n j+ + n j− ), respectively. Before calculating the loss, feature vectors and proxies are normalised. The hyperparameter σ is a fixed value that sets the width of the Gaussian windows in our multi-label proxy loss function. This value is the same for all proxies.

D. NEGATIVE PROXIES
Training samples that are negative for all labels, e.g. healthy chest X-rays with no disease findings, can be considered as their own class for model training purposes. Formally, an Xray is considered a negative sample when ∥y∥ 0 = 0, where the 0-norm calculates the number of non-zero elements in y.
An additional element can be concatenated to the label vector to represent negative samples, i.e. y ⊕ δ(∥y∥ 0 = 0), where ⊕ denotes concatenation and δ(.) ∈ {0, 1} is an indicator function. As negative samples are now represented as a class

Algorithm 1 Training Procedure
Require: Model f with parameters θ f ; Proxies P; Dataset X , Y. 1: while not converged do 2: Sample (x, y) ∼ X , Y ▷ Sample training data. 3: ▷ Normalise feature vector. 5: for each p i,j ∈ P do 6: ▷ Concatenate (⊕) label to represent negative samples. 7:ỹ ← y ⊕ δ(∥y∥ 0 = 0) 8: label, proxies must also be defined for this new class. Training with additional proxies for negative samples can result in a better structured feature space, as negative samples will cluster together around those proxies. Without such proxies, the only training constraint for negative samples is for them to be located far away from positive class proxies. However, there is no loss term that encourages negative samples to be located nearby one another, despite their semantic similarity. The inclusion of proxies for negative samples, which we name negative proxies, introduces a training constraint that negative samples should be located nearby in the feature space ( Figure 4). To our knowledge, negative proxies have not been used in prior proxy metric learning literature. As shown in our experiments in Section IV-H, negative proxies result in both better classification and CBIR performance.

E. TRAINING ALGORITHM
The training procedure for our approach is shown in Algorithm 1. Image and label pairs are sampled from the training dataset and feature vectors are extracted from the images. Both the feature vectors and the proxies are then constrained to a unit sphere using L2-normalisation. The label vector is modified to include an extra dimension that represents the negative (healthy) class. This extra label is set to a value of one when all other elements of the label vector are zero, otherwise the negative label is set to a value of zero. The extra label dimension is needed due to the inclusion of the negative proxies, as discussed in Section III-D. The loss is then calculated according to (4), and model parameters and proxies are updated using the Adam optimisation algorithm [50].

F. MULTI-LABEL CLASSIFICATION
We perform multi-label classification by analysing the similarity between a sample's feature vector and each of the proxy feature vectors ( Figure 5). The classification score for disease label j is: where 0 ≤ score j ≤ 1. For each class j, we calculate the distance between the feature vector v and each of the proxies belonging to class j. The classification score is then calculated based on the distance to the nearest proxy of that class, where a score close to 1 indicates a high likelihood that disease j is present in the sample image, while a score close to 0 indicates a low likelihood. A prediction pred j for class j can then be made by comparing the classification score to a discrimination threshold for that class t j , as shown in (6).
Given a well structured metric feature space, we expect the feature vectors of samples with similar pathology information to be located nearby. As such, we perform contentbased image retrieval for a query image q by returning the k database images corresponding to the k feature vectors that are nearest to the query sample's feature vector ( Figure 5). The image retrieval database, i.e. the set of images from which samples are retrieved based on a query, is constructed from the labelled training samples. The image retrieval procedure is outlined in Algorithm 2. The feature vector distance between the query sample and each of the samples in the database are calculated. The database samples are then sorted based on the distance to the query sample, in ascending order. Finally, the k images that are most similar to the query image are returned to the user as the output of the CBIR system.
▷ Add next similar image to R. 9: i ← i + 1 10: return R

H. BASELINES FOR EVALUATION
We compare our proposed approach to seven state-of-the-art CBIR methods from the literature [2], [3], [18], [19], [20], [24], [25], including CNN classifier methods, feature-based deep learning methods and hashing methods. For further evaluation, we train appropriate baseline models to benchmark our approach against. These baselines are described below. For fairness, our method uses the same base network architecture as all baselines methods, as well as the highest performing evaluated method from literature [2].

1) MULTI-LABEL CLASSIFIER (DENSENET W/ BCE)
This method is a conventional CNN classifier, trained with multi-label binary cross entropy loss. The head of the network is a fully connected layer that outputs class-wise prediction scores. To extract feature vectors for image retrieval, we bypass the final fully connected layer, resulting in a feature vector with the same dimension as v, described in Section III-C. This method is the state-of-the-art literature approach proposed by Haq et al. [2] in terms of model architecture and training, but does not include the nearest neighbour graph.

2) MULTI-LABEL PROXY-NCA (ML-ProxyNCA)
The standard Proxy-NCA [14], [15] loss function can be naively extended to the multi-label case by optimising the  following loss function: where a i = e −∥v−p i ∥ 2 /2σ 2 and p i is the proxy for the i-th class. In the multi-label case, the outer sum over classes in (7) allows a training feature vector to pull towards the proxies belonging to all positive labels. In the case where each sample only has a single positive label, the equation in (7) becomes the standard probabilistic Proxy-NCA loss function [15].

IV. EXPERIMENTS A. IMPLEMENTATION
We use a DenseNet121 architecture [51] as our feature extractor network f , producing a feature vector v with 1024 dimensions. The Adam optimisation algorithm [50] is used for model training, with coefficients β 1 = 0.9 and β 2 = 0.999. Our proposed method is trained for 50 epochs, with a learning rate of 10 −4 and a batch size of 48. The loss function hyperparameter σ is set to 0.7, and unless otherwise stated, two proxies are used per class. Hyperparameter values were tuned using a withheld validation set. During training, images are first resized to 270 × 270 and then randomly cropped to 224 × 224. Finally, images are normalised to a range of -1 to 1. For evaluation images (retrieval database and queries), the random cropping is replaced by a centre crop. For evaluating content-based image retrieval, we compute the metrics using the k-nearest database samples for each query. For the CheXpert and NIH datasets, k is set to 10 and 100, respectively. This evaluation protocol follows the set-ups used in Haq et al. [2] and Chen et al. [3]. In the interest of fairness, the same backbone network, data augmentation and data preprocessing is used for the baselines methods.

B. DATASETS
CheXpert [48] is comprised of frontal and lateral chest X-rays from 67,740 individual patients. Following Haq et al. [2], we use the nine most common diseases found in the dataset. The diseases and their shorthand abbreviations used in this paper are: Enlarged Cardiomediastinum (EC), Cardiomegaly (CM), Lung Opacity (LO), Edema (ED), Consolidation (CS), Pneumonia (PNA), Atelectasis (AT), Pneumothorax (PTX) and Pleural Effusion (PE). Each sample is labelled as positive, negative or uncertain for each disease. Uncertain labels are ignored during training, but treated as positive for CBIR purposes, following the set-up used by Haq et al. [2]. The NIH Chest X-ray Dataset [49] consists of frontal-view X-ray images from 30,805 patients. Again following Haq et al. [2], we use the 13 most common disease labels in our experiments.

C. EVALUATION METRICS
For evaluating CBIR performance, we analyse the three retrieval metrics described below.

1) NORMALISED DISCOUNTED CUMULATIVE GAIN (nDCG)
Discounted Cumulative Gain is the sum of the graded relevance of all of the retrieved images based on their rank position: where k is the number of images that the system is retrieving and r n is the graded relevance value of the n-th retrieved image. Each graded relevance value is the number of common positive labels between the query and retrieved image. Each value is adjusted logarithmically proportional to the rank position n of the query image. This means that a highly relevant query image will be penalised if it has a low rank. The normalised DCG (nDCG) is defined as: where the ideal DCG (iDCG) is the maximum possible DCG that can be achieved based on the dataset.

2) AVERAGE CUMULATIVE GAIN (ACG)
The ACG is defined as: where s n is the graded similarity value of the n-th retrieved image. Graded similarity is defined as the ratio of the number of common positive labels between the query image and the n-th retrieved image, and the total number of positive labels in the query.

3) PRECISION (PREC.)
The CBIR precision is the ratio of the number of relevant images and the total number of retrieved images, k. Each retrieved image that has at least one common label with the query image is considered to be a relevant image. Precision is defined in (11), where δ(.) ∈ {0, 1} is an indicator function.
In order to evaluate pathology classification performance, we report the Area Under Receiver Operating Characteristic Curve (AUC). The receiver operating characteristic curve is a plot of the classifier's True Positive Rate (TPR) against the False Positive Rate (FPR), produced by sweeping the classifier's discrimination thresholds. The AUC measure will be between 0 and 1, where a higher value indicates a more robust classifier with a lower FPR across a range of TPRs.

D. NIH LITERATURE COMPARISON
We compare our approach to state-of-the-art image retrieval methods from literature on the NIH dataset. All of the compared methods were designed for multi-label data, while three of the methods are multi-label metric learning approaches [3], [24], [25]. We follow the experimental set ups from Haq et al. [2] and Chen et al. [3]. For training, 12,000 images are used, each containing at least one positive label from one of the 13 most common diseases in the dataset. As such, negative proxies are not used in this experiment. For testing, a further 1,000 samples are selected. As seen in the retrieval nDCG results in Table 2, our method significantly outperforms the existing methods on the CBIR task.

E. CheXpert EVALUATION
To further evaluate our method, we compare both classification and CBIR results on the CheXpert dataset to the baseline methods detailed in Section III-H. As training sample efficiency is important in the medical domain, due to the difficulty of obtaining high-quality annotations, we evaluate performance across a range of training set sizes, with a particular interest in the smaller sizes. In these experiments, both positive samples (with at least one positive label from one of the nine most common diseases) and negative samples VOLUME 11, 2023 50173 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. (no positive labels) are used. As such, negative proxies are leveraged in these experiments. Table 3 shows all classification and CBIR metrics using a training set size of 4096 samples. Our approach outperforms the baselines methods across all classification and CBIR metrics. The comparatively poor performance of the naive multi-label extension of Proxy-NCA (ML-ProxyNCA) in Tables 2 and 3 demonstrates that multi-label proxy metric learning is non-trivial to achieve. The naive extension performs comparatively poorly at both pathology classification and CBIR, while our novel formulation of proxy metric learning for multi-label data is highly effective for both tasks. Fig. 6 compares performance of our method to the classifier baseline (DenseNet w/ BCE) across a large range of dataset sizes (from 1024 -65536 samples). These results show a consistent and significant performance advantage to our method in both pathology classification and CBIR. The advantage holds across both small and large training set sizes, and is largest when training data is limited. This shows the suitability of our method to data constrained medical problems.  [2]. These particular samples demonstrate a common failure mode of the literature method, wherein retrieved images often contain no disease findings or only a small subset of the diseases present in the query. Note that the literature method results are from our trained version that does not contain the approximate nearest neighbour search step, making it the same as the Densenet w/ BCE baseline. This is equivalent to the literature method with perfect nearest neighbour search.

F. QUALITATIVE EVALUATION
Example image retrieval results for sample query X-rays are shown in Fig. 1 and 7. Query images are from a test set that is withheld during training. Disease annotations are indicated by the shorthand names defined in Section IV-B. In general, there is a strong similarity between the disease labels of the query samples and retrieved samples. Fig. 9 uses t-SNE visualisations [52] of the retrieval database feature vector space to show some of the relationships between diseases learned by the model. Each visualisation selects two disease labels and indicates by colour the samples that have only one of those labels positive or both labels positive. Samples are co-located in the feature space based on their combination of positive disease labels. Figure 8 highlights a common failure mode of both the top performing literature method [2] and the Densenet w/ BCE baseline. The methods often retrieve images with no disease findings for certain query samples containing multiple disease labels. Our method avoids this failure mode. Fig. 10 shows the effect that varying the number of proxies defined for each class label has on the classification and CBIR performance. There is a benefit to having multiple proxies per class, with two providing the best results. Interestingly, as the proxies per class passes eight, the performance begins to drop below the level of a single proxy per class. This is likely due to over-parameterisation of the feature space resulting in overfitting and proxies that do not generalise as well to unseen samples. For example, the extreme case of this would be having one proxy for each training sample, where without additional regularisation, the model wouldn't necessarily need to generalise.

H. NEGATIVE PROXIES
We analyse the effect of using negative proxies both quantitatively and qualitatively. Table 3 shows that excluding negative proxies during training results in a performance drop for both classification and CBIR. The t-SNE visualisation [52] of the feature vector space in Fig. 11 shows that without negative proxies, more positive samples are peppered throughout the primary negative sample cluster, compared to when negative proxies are used. Negative proxies are important for accurately encoding the semantic content of samples with no disease findings, and allow for better discrimination between negative and positive samples.

A. CLINICAL BENEFITS OF METHOD
Computer-aided diagnosis systems have a high barrier to clinical acceptance; clinicians must trust the systems being used in their day-to-day work. A survey of 758 practising clinicians and medical trainees by Chen et al. [53] identified seven factors associated with clinicians' willingness to use medical artificial intelligence systems. These factors were: accuracy, efficiency, ease of use, existing widespread adoption, cost-effectiveness, interpretability and privacy protection capability. Our paper focuses on improving computer-aided diagnosis systems' accuracy and interpretability for multimorbidity data. Chen et al. [53] found that accuracy affected the willingness of 88.26% of respondents to use clinical artificial intelligence, while interpretability affected  the willingness of 64.38%. These results indicate that computer-aided diagnosis systems must be both accurate and explainable for successful widespread clinical adoption. In addition to not gaining clinician trust, an inaccurate computer-aided diagnosis system may even increase the clinical workload and worsen patient outcomes, in direct contradiction with the aims of such systems. Explainable systems that are able to justify or provide context for diagnostic predictions help to open the ''black box'' of computer-aided diagnosis.
Our method advances the state of both discussed factors. The proposed model jointly performs disease recognition and content-based image retrieval. The results in Tables 2  and 3 show that our model performs more accurate disease recognition and retrieval than previous methods. As a joint recognition and retrieval model, our approach can explain and justify diagnostic predictions made by the model by way of the retrieved samples. In other words, our model is able to open the prediction black box by returning the database samples that contextualise a certain pathological prediction. In practice, such a system may also return the pathology reports of database samples, providing further context. This has the additional advantage of being a useful training tool.
Many of the compared state-of-the-art retrieval methods are not joint recognition systems, and therefore do not provide context to any computerised diagnostic system, but rather exist as a stand-alone CBIR model. Furthermore, appending a nearest neighbour classifier on a retrieval system that was not explicitly trained for classification often results in poor performance [16].
The qualitative results in Figure 8 show scenarios where our method avoids a common failure mode of the top existing joint recognition and retrieval method [2]. For samples where the existing approach returns images with limited or no disease findings, our method retrieves samples that are pathologically similar to the queries. This improvement is of clinical significance as it advances the discussed factors for clinical acceptance: accuracy and interpretability. The model has higher retrieval accuracy and consequently provides better context and explanations for given predictions by retrieving similar database samples. Conversely, the existing literature method often returns samples with no disease findings, which would result in confusing feedback to the clinician.

B. LIMITATIONS AND FUTURE WORK
Although our method outperforms existing CBIR systems, certain limitations can be improved in future work. Our method is highly efficient during training due to the use of proxies. However, when querying the system, the database samples that are nearest to the query are found by performing brute force nearest neighbour search. With a high-dimensional feature space, exact nearest neighbour search does not scale well to very large database sizes and can become a computational bottleneck. This can be improved by employing efficient approximate nearest neighbour search methods, such as in the work by Haq et al. [2]. Such algorithms are a scalable solution to the bottleneck but are achieved by sacrificing some search accuracy.
Proxies have the limitation of simply being approximations of the true feature space distribution. This allows us to achieve training simplicity and efficiency, however, it also means that some of the benefits of more fine-grained neighbourhood metric learning methods [7] may be lost. Such approaches that allow comparisons between a training sample and a large 50176 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. set of nearest neighbours excel at encoding fine-grained relationships in the feature space. A direction for future research to address this limitation is the utilisation of intra-batch pairwise comparisons. A hybrid proxy and pairwise method for multi-label data may be able to achieve some of the benefits of fine-grained neighbourhood methods while maintaining training efficiency. Similar approaches have succeeded in the single-label domain [29], [30].
Defining multiple proxies for each disease resulted in improved classification and CBIR performance. This leads to a research question: can defining a variable number of proxies across diseases help to alleviate the effects of unbalanced data? Such unbalanced data is common in the medical domain, where particular disease annotations may be scarce due to the rarity of the disease. Setting per disease proxy numbers with consideration to the disease distribution may help to improve model training on unbalanced data. We leave this as a promising future research direction.
Self-supervised methods utilising contrastive learning enable models to extract meaningful features without manually annotated training samples [54]. This is often achieved by performing heavy transformations on an unlabelled training image, resulting in a new image that is visually different to the original but contains the same semantic content. The original and transformed images can then be used as a positive training pair in the contrastive learning algorithm. Such an approach could be incorporated into our supervised training algorithm, resulting in a semi-supervised method that can learn from large amounts of unlabelled samples and a smaller set of labelled samples. Our loss function can be appended with a scaled contrastive loss term, with the mini-batch sampled jointly from the labelled and unlabelled datasets. Semi-supervised methods are of particular interest in the medical domain due to the high cost of obtaining accurate annotations. Leveraging large, digitised medical imaging databases to improve the performance of our model without any annotation cost is a promising research direction.

VI. CONCLUSION
Computer-aided diagnosis systems can help to reduce the workload of healthcare professionals, potentially resulting in improved patient outcomes [2]. In this paper, we presented a novel model that can be jointly used for pathology classification and content-based image retrieval. Our multimorbidty metric learning approach uses the power of proxies to efficiently learn a feature vector space that encodes the relationships between disease labels. We showed the efficacy of our approach on two chest X-ray datasets, demonstrating a performance advantage over the baseline and state-of-the-art methods in classification and retrieval.