Decomposing Normal and Abnormal Features of Medical Images into Discrete Latent Codes for Content-Based Image Retrieval

In medical imaging, the characteristics purely derived from a disease should reflect the extent to which abnormal findings deviate from the normal features. Indeed, physicians often need corresponding images without abnormal findings of interest or, conversely, images that contain similar abnormal findings regardless of normal anatomical context. This is called comparative diagnostic reading of medical images, which is essential for a correct diagnosis. To support comparative diagnostic reading, content-based image retrieval (CBIR), which can selectively utilize normal and abnormal features in medical images as two separable semantic components, will be useful. Therefore, we propose a neural network architecture to decompose the semantic components of medical images into two latent codes: normal anatomy code and abnormal anatomy code. The normal anatomy code represents normal anatomies that should have existed if the sample is healthy, whereas the abnormal anatomy code attributes to abnormal changes that reflect deviation from the normal baseline. These latent codes are discretized through vector quantization to enable binary hashing, which can reduce the computational burden at the time of similarity search. By calculating the similarity based on either normal or abnormal anatomy codes or the combination of the two codes, our algorithm can retrieve images according to the selected semantic component from a dataset consisting of brain magnetic resonance images of gliomas. Our CBIR system qualitatively and quantitatively achieves remarkable results.


Introduction
Abnormalities in medicine can be viewed as deviations from a normal feature in a healthy population.In medical images, such abnormalities appear as local or global deviations from an original normal anatomy.These changes should reflect a process where abnormalities occur under some initiation mechanisms from the normal anatomy and, with disease progression, existing normal tissues are eventually replaced by pathological (Mototaka Miyake), masataka@ncc.go.jp ( Masamichi Takahashi), 7bmm1342@cc.u-tokai.ac.jp (Akiko Nakagawa), harada@mi.t.u-tokyo.ac.jp (Tatsuya Harada), rhamamot@ncc.go.jp (Ryuji Hamamoto) ones.Indeed, when evaluating images of conditions to be diagnosed, physicians often need corresponding images without abnormal findings of interest or, conversely, images that contain similar abnormal findings regardless of normal anatomical context.This is called comparative diagnostic reading of medical images, which is essential for a correct diagnosis.For example, by comparing abnormal images of interest with normal images of the same anatomical site, physicians can diagnose the presence and extent of the disease.Meanwhile, images with similar abnormal features can also be useful as a reference because diseases with the same imaging phenotypes often have similar prognoses and treatment responses, somewhat regardless of the origin of the diseases.However, it is quite laborious to find similar images from a large database by focusing only on either normal or abnormal features of a query image by hands.Hence, we aimed to establish a content-based image retrieval (CBIR) system to support comparative diagnostic reading.
CBIR is an important application to retrieve a set of the most similar images to a submitted patient's image from large databases to support clinical decision making.A CBIR system basically comprises two subsystems: At the first feature extraction stage, it converts images in a database to a set of features associated with image content, and at the next similarity search stage, similar images are retrieved from the database based on the feature similarity with respect to a query image.Traditionally, the feature extraction employed various types of handcrafted descriptors, such as shape, text, color, and texture; however, there was difficulty in reducing the "semantic gap" between low-level image features captured by the handcrafted descriptors and high-level visual concepts (Kumar et al., 2013;Zheng et al., 2018).Recently, with the success in many image processing tasks, the deep-learning-based approach has gained increasing attention in the field of CBIR.Since the deep neural network can automatically learn complex features in hierarchical levels of abstraction, it can simplify the feature extraction process.A common neural network architecture used in CBIR systems is the autoencoder, which enables to map each image into a latent representation that can store compressed image information (Guo et al., 2015;Liu et al., 2016;Song et al., 2018;Xu and Fang, 2016;Zhang et al., 2016).Even though deep-learning-based approach often outperformed classical algorithm of CBIR systems (Anavi et al., 2016;Sun et al., 2017), there is no specific method that can measure the distance between either normal or abnormal features of medical images as separated semantic components.To achieve our goals, we need to devise a novel feature extraction method that can decompose normal and abnormal features of medical images.We also aimed to establish a computationally effective method to retrieve similar images based on decomposed latent representations.
In this study, we define the two-tiered semantic nature of normal and abnormal anatomy as compositionality of medical images, as presented in Fig. 1.Hereinafter, "normal" anatomy means counterfactual structures that should have existed if the sample is healthy, whereas "abnormal" anatomy attributes to disease changes that reflect the deviation from the normal baseline.When the sample is free from abnormality, "normal" anatomy corresponds to the entire sample image, and "abnormal" anatomy should not indicate any condition.Then, we consider how to decompose a given image into two lowdimensional representations in a manipulable manner, where the latent codes representing normal and abnormal anatomy should be mutually exclusive and collective in terms of the reconstruction of the original image.By measuring similarities based on decomposed normal or abnormal semantic components of medical images, it can be expected to retrieve images that have the same normal anatomical context or similar abnormal findings, respectively, while ignoring the other component.
Factorizing compositionality into distinct and informative factors in data is a fundamental requirement of representation learning (Bengio et al., 2013).The majority of studies has sought to capture purely independent factors of variation that contributes to generation of data, which are called disentangled representations (Higgins et al., 2018).To obtain disentanglement of features, two approaches have exploited implicit or explicit supervision to impose strong inductive biases on autoencoders (Tschannen et al., 2018).One approach is to collect a large amount of data to encompass sufficient variation for each factor and then apply appropriate regularization such that disentanglement can be implicitly performed (Achille and Soatto, 2018;Alemi et al., 2017;Burgess et al., 2018;Chen et al., 2018;Esmaeili et al., 2019;Higgins et al., 2017;Kim and Mnih, 2018;Kumar et al., 2017;Lample et al., 2017;Lopez et al., 2018;Louizos et al., 2016;Zhao et al., 2019).Another approach is to force a model to acquire separate representations by explicitly imposing modeling assumptions (Charakorn et al., 2020;Cheung et al., 2015;Eslami et al., 2016;Kulkarni et al., 2015;Shanahan et al., 2020).If distinct factors of variations explaining the characteristics of data can be separately represented using independent latent units, it can be useful in downstream tasks by providing interpretability and manipulability to the data.
Therefore, disentangled representations can decompose normal and abnormal semantic components of medical images.Indeed, there has been an increasing number of studies focusing on feature disentanglement in medical imaging, including performing pseudo-healthy synthesis (Liao et al., 2020;Tang et al., 2021;Vorontsov et al., 2019;Xia et al., 2020), learning generalizable features across domains (Meng et al., 2021;Meng et al., 2020) or imaging modalities (Chartsias et al., 2019;Chartsias et al., 2021;Chen et al., 2019), and evaluating the reliability of individual annotators from true segmentation label distributions (Zhang et al., 2020).However, with respect to the etiology of diseases in medical images, it is noteworthy that the assumption of disentanglement, where purely independent low-dimensional latent features can mimic the generation of high-dimensional data, can be too simple to be generalizable.For example, brain tumors originate as focal changes, and with their progression, cause compressional deformations and infiltration in adjacent structures, making it difficult to clearly define the boundary between normal and abnormal tissues.Hence, we consider decomposition of latent representation, which encompasses the notion of disentangled representation as its special case and allows much richer properties of latent spaces such as intricate structured relationships (Mathieu et al., 2019).
In this study, we first devised a neural network architecture, called feature decomposing network, to decompose normal and abnormal semantic components of medical images in a manipulable manner.Then, we demonstrated a novel CBIR framework by utilizing the decomposed latent codes to support comparative diagnostic reading.Given an input image, the feature decomposing network is trained to map it into two latent codes, normal anatomy code and abnormal anatomy code.It indicates that the original latent space for representing the whole image is divided into one portion as a normal anatomy code corresponding to normal anatomy and the remaining portion as an abnormal anatomy code corresponding to abnormal anatomy.After training the feature decomposing network, latent codes become representative of targeted semantic features in medical images.We also investigate the effectiveness of a method called distribution matching by utilizing Wasserstein generative adversarial networks (GANs) (Arjovsky et al., 2017) with gradient penalty (Gulrajani et al., 2017).Distribution matching is imposed on the latent distribution to minimize the overlap in the semantic contents between normal and abnormal anatomy codes.Furthermore, by constructing the two latent codes to be discrete through vector quantization (van den Oord et al., 2017), we can reduce computational burden at the time of similarity search by binary hashing.The utility of these decomposed latent codes for CBIR applications is shown based on a large dataset containing brain magnetic resonance (MR) images of gliomas.By performing nearest neighbor search utilizing either normal or abnormal anatomy codes or the combination of the two codes, our CBIR system can retrieve images according to selected semantic components while ignoring the other, if necessary.
The main contributions of this study can be summarized into the following: • We propose a feature decomposing network that can explicitly decompose semantic features of medical images into normal and abnormal anatomy codes in a manipulable manner for downstream tasks.
• To enhance computational efficiency at the time of similarity search, latent spaces are configured using vector quantization to be discrete, rather than continuous.
• By employing the decomposed latent codes, we present a novel CBIR application that can search for similarities in images based on a selected latent code or the combination of the two latent codes, enabling retrieval of images viewed as any semantic components while ignoring the other, if necessary.
The proposed method is most closely related to our conference extended abstract (Kobayashi et al., 2020), a neural network architecture to decompose normal and abnormal features of medical images with its application to CBIR.However, the presented work has significantly improved the learning method with extensive experimental settings.We further validated the performance of the CBIR application with more appropriate metrics.
The rest of the manuscript is organized as follows: Section 2 reviews the literature on disentangled representation, imageto-image translation in medical imaging, and CBIR.Section 3 describes our proposed method with technical backgrounds.Section 4 presents experimental settings and evaluation methods.Section 5 provides the results.Section 6 presents the discussion and conclusions.

Related work
Here, we briefly review literature related to disentangled representation learning, especially in the field of computer vision.Thereafter, as a research interest more related to our purpose, we review recent progress that mainly apply the imageto-image translation technique based on GANs for pseudohealthy synthesis of medical images.We also introduce current progress in the CBIR system in medical imaging.

Disentangled representation learning
Learning disentangled representation attempts to separate the underlying factors of sample variations in a way that each factor exclusively represents one type of sample attributes.There are several benefits in learning disentangled representation from data because models that produce feature disentanglement can provide explainability of the model function and manipulability in the data generation process.One approach is to combine deep generative models, such as GANs (Goodfellow et al., 2014) and variational autoencoders (VAEs) (Kingma and Welling, 2014;Rezende et al., 2014) with regularization techniques to acquire disentanglement in an implicit manner (Achille and Soatto, 2018;Alemi et al., 2017;Chen et al., 2018;Esmaeili et al., 2019;Higgins et al., 2017;Kim and Mnih, 2018;Kumar et al., 2017;Lample et al., 2017;Lopez et al., 2018;Louizos et al., 2016;Zhao et al., 2019).For example, β-VAE can automatically discover the independent latent factors of variation by imposing a limit on the capacity of latent information, which facilitates factorization of representations (Higgins et al., 2017).However, it is still fundamentally difficult to learn disentangled features without any supervision.It is also argued that acquired disentangled representations sometimes mismatch human's predefined concepts (Locatello et al., 2019).Another approach is to explicitly factorize representations into a component that is related to or independent of classes based on labeled data (Charakorn et al., 2020;Cheung et al., 2015;Eslami et al., 2016;Kulkarni et al., 2015;Shanahan et al., 2020).Label information or attribute annotation can serve as strong supervision for feature disentanglement; hence, the performance of disentanglement can be optimized and guaranteed.Therefore, by designing a network with an appropriate structure and exploiting segmentation labels indicating abnormal regions as supervision, we aim to obtain the desired decomposition of latent representations.

Image-to-image translation
Image-to-image translation has been exploited in medical imaging to obtain disentangled representation.CycleGAN, which performs bidirectional translation between image domains, has been widely used in image-to-image translation (Zhu et al., 2017).As an extended architecture of CycleGAN, the unsupervised image-to-image translation (UNIT) framework proposes a shared latent space assumption, where a pair of corresponding images in two different domains can be mapped (Liu et al., 2017).More recently, multimodal UNIT decomposes an image into a content code that is domain invariant and a style code that represents domain-specific features (Huang et al., 2018).In line with studies focusing on medical imaging, Xia et al. demonstrated pseudo-healthy synthesis by creating a subject-specific healthy image from a pathological one by extending the learning framework of CycleGAN (Xia et al., 2020).Similarly, Liao et al. proposed an artifact disentanglement network using the image-to-image translation architecture, achieving comparable performance in image restoration to existing supervised models (Liao et al., 2020).Vorontsov et al. also applied the same concept to improve semi-supervised training for semantic segmentation with autoencoding (Vorontsov et al., 2019).To enhance realistic synthesis of chest radiographic images, Tang et al. proposed a disentangled generative model for disease decomposition and demonstrated that disease residual maps can indicate underlying abnormal regions (Tang et al., 2021).Since one of our goals is to decompose medical images into normal and abnormal semantic components, the basic idea is somewhat similar to pseudo-healthy synthesis exploiting image-to-image translation techniques.However, previous approaches focused on transforming rather superficial appearance of images and did not evaluate the accessibility and validity of latent representations acquired inside the models.Thus, our feature decomposing network has a bottleneck where imaging features are compressed, enabling to handle latent representations of targeted semantic component for the downstream task.

Content-based image retrieval
CBIR is an important application in retrieving a set of the most similar images from large databases, given a query image.Similarity measurements based on various information, such as shape, text, color and texture, and features acquired inside convolutional neural networks are used to resolve the semantic gap between imaging features and high-level visual concepts (Kumar et al., 2013;Zheng et al., 2018).Even though several studies utilizing state-of-the-art techniques of deep neural networks have been introduced to CBIR (Haq et al., 2021;Mohd Zin et al., 2018), to the best of our knowledge, there are few studies that exploit disentangled representation from the viewpoint of image retrieval.Recently, Havaei et al. proposed a neural network architecture to ensure content-style disentanglement of medical images and demonstrated how the inferred style and content features are disentangled from each other by utilizing CBIR as an evaluation method (Havaei et al., 2020).Since CBIR is one of essential technologies that assist physicians in diagnosis, a methodology that directly seeks to develop CBIR based on disentangled representation is worth considering.Therefore, in addition to the concept of decomposing normal and abnormal features of medical images, we particularly employ latent codes as to be discrete through vector quantization.By applying vector quantization for the latent space, subspaces can be fixed and transversed by the Hamming distance rearrangement, which is favorable in a large-scale CBIR by reducing the computational cost at the time of similarity search (Gao et al., 2019).

Material and methods
In this study, we propose a method to decompose two-tiered semantic components of medical images into normal and abnormal anatomical codes for the application of CBIR that can selectively focus on semantic components.The proposed method is presented in two stages: First, we describe a network architecture, which we call feature decomposing network, to decompose normal and abnormal features in medical images and its learning strategy.Then, we present how to establish a CBIR system that can support comparative diagnostic reading by utilizing these decomposed features.

Feature decomposing network
The feature decomposing network is used to decompose semantic components of medical images into normal and abnormal anatomy codes.It is trained based on a dataset containing input images x and ground-truth segmentation labels y.The feature decomposing network consists of an encoder and two decoders, segmentation decoder, and image decoder.A pair of discrete latent spaces with latent codebooks exists at the bottom of the network, each of which produces normal and abnormal anatomy codes.The overview of the network architecture is shown in Fig. 2. It is noteworthy that the feature decomposing network has the latent space as its bottleneck, where no bypass connection between the encoder and decoders, such as skip connections (Drozdzal et al., 2016), is implemented.Therefore, we can expect that the information processed by the encoder can be compressed in latent spaces (Razavi et al., 2019;Wu and Flierl, 2019).

Feature encoding
The encoder uses a two-dimensional (2D) medical image x ∈ R C×H×W as an input, where C is the number of channels and z + e correspond to the semantic features of normal and abnormal anatomies, respectively.We use z ∓ e to represent both features.Subsequently, vector quantization is used to discretize z ∓ e .Namely, each elemental vector z ∓ e i ∈ R D is replaced with the closest code vector in each codebook e ∓ ∈ R D×K that comprises D-dimensional K code vectors.Details in this vector quantization process are presented in the next subsection.We denote the quantized vector of z ∓ e as z ∓ q .Hereinafter, z − q is referred to as normal anatomy code and z + q as abnormal anatomy code.

Vector quantization
The latent space has two codebooks, e − = {e − k |k = 1, . . ., K} ∈ R K×D and e + = {e + k |k = 1, . . ., K} ∈ R K×D , corresponding to the normal and abnormal semantic features, respectively.The vector quantization process is similar to that of a vector-quantized (VQ) VAEs (van den Oord et al., 2017).An i-th elemental vector of z ∓ e , denoted as z ∓ e i ∈ R D , is quantized by executing a nearest-neighbor lookup on the codebook as follows: . (1) Thereafter, an i-th elemental vector of z ∓ e is replaced by the k ∓th code vector in the codebook as follows: This replacement is performed for all (H × W ) elemental vectors of z ∓ e to collectively form a quantized vector z ∓ q .To optimize this process, the encoder and codebooks are updated to minimize an objective, which is referred to as latent loss L lat as follows: where sg indicates a stop-gradient operator, which serves as an identity function at the forward computation time and has zero partial derivatives, and β is a balancing hyperparameter.During training, the first term in the abovementioned equation updates the codebook variables by delivering the code vectors to the encoder output.Meanwhile, the second term encourages the encoder output to move closer to the targeted code vectors.We use the exponential moving average to train the codebook (Kaiser et al., 2018).

Feature decoding
The segmentation decoder uses an abnormal anatomy code z + q as an input and outputs a S -class segmentation label ŷ ∈ R S ×H×W that corresponds to the ground-truth label y.A loss function for the segmentation output L seg is a composite of the generalized Dice (Sudre et al., 2017) and focal (Lin et al., 2017) losses as follows: where ỹ indicates the softmax logits of the segmentation decoder, N(= H × W) is the number of pixels, and w k is determined as w k = 1 ( N |y s |) 2 to mitigate the class imbalance problem (Sudre et al., 2017).
Meanwhile, the image decoder f performs conditional image generation using the spatially adaptive normalization (SPADE) (Park et al., 2019).SPADE is designed to propagate semantic layouts to the process of synthesizing images (Fig. 3).The image decoder uses a normal anatomy code z − q as its primary input.When the image decoder is used to reconstruct the entire input image x+ , the softmax logits of the segmentation decoder ỹ is transmitted to each layer of the image decoder via the SPADE modules ( f (z − q , ỹ) = x+ ).When null information, where ỹ is filled with 0s, is propagated to the SPADE modules, a normalappearing reconstruction x− is generated by the image decoder ( f (z − q , 0) = x− ).
To enforce different characteristics between the two types of generated images, x− and x+ , we apply a pixel-wise reconstruction loss depending on the region of abnormality.Suppose M + ∈ {0, 1} C×H×W defines a mask, indicating that pixels with any abnormality labels are set to 1 and 0 otherwise, and M − = 1 − M + is a complementary set of M + .Briefly, M + presents the abnormal anatomy region, and M − indicates the normal anatomy region.Using these masks, image reconstruction loss L rec is defined as follows: where SSIM indicates structural similarity (Zhou Wang et al., 2004), which is added to the L2 loss as a constraint owing to its empirical effect to stabilize the image generation process.

Distribution matching
It is quite important to ensure that each decomposed feature, normal anatomy code z − q and abnormal anatomy code z + q , corresponds to targeted semantic content in the images.For example, when some code vectors of normal anatomy codes convey not only features corresponding to normal anatomies but also those related to abnormal anatomies, the feature decomposition can be "leaky," losing its reliability for downstream tasks.Particularly, this can occur at normal anatomy codes because, when a pathological image is provided, it is fundamentally impossible to obtain a paired normal counterpart that can be a ground-truth for the normal-appearing reconstructions x− .Therefore, we utilize unpaired healthy images and employ distribution matching technique to minimize the discrepancy of the distributions between normal anatomy codes from healthy images and those from disease images.We consider this discrepancy as the Wasserstein distance d W and minimize it using Wasserstein GAN (Arjovsky et al., 2017) with gradient penalty (Gulrajani et al., 2017) that has a critic network g.
Here, we consider that the set of input images X can be divided into a set consisting of healthy images X h and a set consisting of diseased images X d .Suppose the distribution of normal anatomy codes Z − q can also be split into those originating from healthy images Z − h and those originating from diseased images Z − d .When a batch {x ∼ X} containing both healthy images {x h ∼ X h } and diseased images {x d ∼ X d } is fed into the encoder, a corresponding batch of normal anatomy codes {z − q ∼ Z − q } can be split into those originating from healthy im- If there is no leakage of semantic content of abnormality into the normal anatomy codes, the two distributions, Z − h and Z − d , should be identical with each other, irrespective of the presence of abnormality in the input images.Then, distribution matching imposes two types of loss functions, L critic and L reg , as follows: where is the Wasserstein distance, λ gp is the balancing term for the gradient penalty, and z is to enforce the Lipschitz constraint by sampling a variable along straight lines between pairs of points sampled from the two latent distributions (Gulrajani et al., 2017).When optimizing L critic , only the critic network is trained, not propagating gradients to the modules prior to z − h and z − d .Both the encoder and codebook for normal anatomy codes are used to bring the two distributions closer together in the process of optimizing L reg .
Note that, unlike usual GANs for image synthesis, our goal is to achieve an alignment between two latent distributions, Z − h and Z − d .Since distribution matching is applied to the batch containing quantized latent codes, it is expected that certain constraints will be added to the process of vector quantization.More specifically, it can be expected to regularize code vectors in the codebook for normal anatomy codes not to be too representative even for abnormal imaging features.

Overall learning objectives
In summary, we define several loss functions: latent loss L lat for optimization of the encoder and codebooks; segmentation loss L seg for the segmentation decoder, encoder, and codebook for abnormal anatomy codes; reconstruction loss L rec for the image decoder, encoder, and codebook for normal anatomy codes; and distribution matching loss for the critic network L critic and that for the encoder and codebook for normal anatomy codes L reg .The overall learning algorithm is shown in Algorithm 1.

Modeling of content-based image retrieval
Here, we formulate the problem of CBIR to find the closest feature vector from a reference database containing N Ddimensional database vectors {r n } N n=1 , given a query vector q, as follows: where D is a distance function such as Euclidean distance.
Starting with the abovementioned equation, our consideration into the CBIR system is twofold.First, to enhance computational efficiency in the distance calculation, we propose to binarize latent representations by leveraging the latent spaces that are constructed as to be discrete rather than continuous.Then, a novel CBIR framework by utilizing the decomposed latent codes to retrieve images with targeted semantic components is introduced.
, L seg (ŷ, y), and L rec ( x− , x+ , x).Split the batch of {z − q } into subbatches of {z − h } and {z − d } according to the presence of abnormal label in each input image.

Binary hashing based on separating hyperplanes
Here, we consider how to binarize code vectors in a codebook e = {e k |k = 1, . . ., K} ∈ R K×D , which is learned in the feature decomposing network.The goal is to find subspaces that can be transversed by Hamming distance rearrangement.Since each code vector e k ∈ R D has a fixed position in the latent space, the latent space can be divided by K 2 separating hyperplanes that perpendicularly bisect a line segment connecting any two code vectors.Given two code vectors, e i and e j , points x located on the separating hyperplane can be formulated as follows: Therefore, the position of another code vector e k can be binarized according to the side of the separating hyperplanes it is on: where Then, by considering the positional relationship between a total of K 2 separating hyperplanes, the continuous codebook e ∈ R K×D can be converted into a binarized codebook b = {b k |k = 1, . . ., K} ∈ R K×E , where E = K 2 .Experimentally, we confirmed that there was no code vector that is exactly located on any separating hyperplane, which would make Eq.( 15) equal to zero.Since the binary representation corresponding to the location of each vector is uniquely determined, it can be regarded as binary hashing.

Optimization of the binarized vector length
When naively performing the abovementioned binarization, the length of each binarized code vector is K 2 = 512 2 = 130, 816, which is extremely long to obtain the computational benefit from the Hamming distance calculation.Therefore, we consider optimization for the length of the binarized code vectors.At the time of performing nearest neighbor search, a local sensitivity is primarily required; that is, the relationship between the closest code vectors should remain the same.Meanwhile, it does not necessarily guarantee the positional relationship between distance code vectors.Based on these ideas, we can optimize the length of the binarized code vector by removing each element one by one to investigate whether the proximity between the closest vectors changes or not.This optimization algorithm is shown in Appendix A. Empirically, the length of the code vectors can be reduced to < 1%, allowing the calculation of Hamming distance to be much faster than that of L2 distance using continuous code vectors.Hereinafter, we denote b * = {b * k |k = 1, . . ., K} ∈ R K×E * as a binarized codebook after optimization.

Image retrieval based on the decomposed latent codes
When performing the CBIR framework, a query image is decomposed into a normal anatomy code and abnormal anatomy code through the feature decomposing network (Fig. 4).Then, the similarity using either the latent codes or a combination of both is calculated between the query image and images in the reference dataset.Particularly, when the similarity is viewed as counterfactual normal images as they should be, the similarity measurement uses only normal anatomy codes to be compared S normal (q, r).In contrast, when viewed by focusing only on abnormal areas, the similarity measurement uses only abnormal anatomy codes S abnormal (q, r).Moreover, to calculate similarities of the whole imaging features between the query and reference images, we define S sum (q, r) for the summation of both measurements as follows: S sum (q, r) = S normal (q, r) + S abnormal (q, r). ( These similarity measurements can be calculated based on different distance definitions.Here, for comparison, three types In contrast, by calculating the similarity of abnormal anatomy codes, images with similar tumor regions can be retrieved (S abnormal ).Besides, the similarity retrieval based on the whole imaging features can be applied by calculating the similarity combining both normal and abnormal anatomy codes (S sum ).Note that these similarity measurements can be calculated based on different distance definitions, such as Euclidean distance, angular distance, and Hamming distance.
of distances, that is, Euclidean distance D E , angular distance D A , and Hamming distance D H , are calculated for each similarity measurement.Note that the continuous codebook e is the basis for Euclidean distance D E and angular distance D A , while the optimized binarized codebook b * is the basis for Hamming distance D H .

Dataset
We used brain MR images of gliomas from the 2019 BraTS Challenge (Bakas et al., 2017a,b,c;Menze et al., 2015), containing a dataset of 355 patients for training (MIC-CAI BraTS Training), a dataset of 125 patients for validation (MICCAI BraTS Validation), and a dataset of 167 patients (MICCAI BraTS Testing) for testing.For each patient, T1, Gd-enhanced T1, T2, and fluid-attenuated inversion recovery (FLAIR) sequences were obtained.MIC-CAI BraTS Training contains three segmentation labels of abnormality: Gd-enhancing tumor (ET), peritumoral edema (ED), and necrotic and non-enhancing tumor core (NET).Conversely, MICCAI BraTS Validation and MICCAI BraTS Testing do not have any segmentation label.There is no ground-truth that represents normal anatomical structures.Therefore, we segmented all three datasets into six normal anatomical labels (left cerebrum, right cerebrum, left cerebellum, right cerebellum, left ventricle, and right ventricle).Moreover, the abnormal regions (ET, ED, and NET) in MICCAI BraTS Validation and MICCAI BraTS Testing were segmented in the same manner with those in MICCAI BraTS Training.The annotation process is described in detail in Appendix B.
For the training of the feature decomposing network, we concatenated both MICCAI BraTS Validation and MIC-CAI BraTS Testing as a training dataset.
Then, the feature decomposing network was evaluated on MIC-CAI BraTS Training as a test dataset, which is also utilized in the demonstration of the CBIR system based on the decomposed latent codes.

Preprocessing
All four sequences, T1, Gd-enhanced T1, T2, and FLAIR, were concatenated into a four-channel MR volume X ∈ R 4×240×240×155 .Then, a preprocessing pipeline, including axial image resizing to 256 × 256 and Z-score normalization, was performed.Subsequently, each three-dimensional (3D) MR volume was decomposed into a collection of 2D axial slices {x 1 , x 2 , . . ., x 155 ∈ R 4×256×256 }.The training and test datasets underwent the same preprocessing process.Data augmentation, such as random rotation and random horizontal flip, was applied to each image in the training dataset to train the feature decomposing network.

Implementation
All experiments were implemented in Python 3.7 with Py-Torch library 1.2.0 (Paszke et al., 2019) using an NVIDIA Tesla V100 graphics processing unit and CUDA 10.0.For all networks, Adam optimization (Kingma and Ba, 2015) was used for the training.Network initialization was performed using the method described by He et al. (He et al., 2015).

Feature decomposing network
When using a quantized latent space, such as VQ-VAE, the size of latent representation (i.e., width and height of the feature maps) exerts a significant effect on the quality of image generation (Razavi et al., 2019).Therefore, for comparison, we configured several architectures of the feature decomposing network depending on the size of the latent space.Hereinafter, we denote a feature decomposing network with the latent representation of the size of H × W as FDN H ×W .In this study, we compared FDN 4×4 , FDN 8×8 , FDN 16×16 , and FDN 32×32 with or without the distribution matching for the normal anatomy codes.
As described in Section 3.1, the feature decomposing network comprises the encoder, segmentation decoder, and image decoder.See Appendix C for the detail of the architectures of these neural networks.The input for the encoder is required to be a four-channel 2D image with the size of 4 × 256 × 256 (= channel × height × width).The encoder has a common trunk for obtaining an input image and extracting low-level features of the image and then bifurcates into two branches with the same architecture, one of which is for the normal anatomy code and the other is for the abnormal anatomy code.The number of a repeated structure, consisting of residual blocks (He et al., 2016) with a strided convolution (ResBlock + StridedConv), was adaptively set for each size of the latent representation.For example, in FDN 8×8 , the encoder utilized 32 − 64 − 128 − 128 − 128 − 128 filter kernels in each layer.The encoder outputs two latent representations corresponding to normal and abnormal semantic components, z − e and z + e , with the size of 64 × H × W .These latent representations are subsequently quantized to z − q and z + q through the vector quantization.The dimension of code vectors D and number of code vectors K in the codebooks were fixed as follows: D = 64, and K = 512.
The image and segmentation decoders have almost the same architecture, except for the normalization layer, where the image decoder particularly utilizes SPADE module for the propagation of softmax logits ỹ from the segmentation network.The number of a repeated structure, which comprises upsampling module with a bilinear interpolation function followed by residual block with or without SPADE [Upsample + (SPADE-)ResBlock], was adopted for the size of the latent representation.For example, in FDN 8×8 , the image and segmentation decoders utilized 128 − 128 − 128 − 128 − 64 − 32 filter kernels in each layer.The segmentation decoder takes z + q as an input and outputs a segmentation map ŷ with the size of 4 × 256 × 256 (= channel × height × width), the channel number of which corresponds to the number of abnormality labels (ET, ED, and TC) plus a background label.Besides, a softmax logit ỹ with the size of 4 × 256 × 256 is retained for the collateral input for the image decoder.The image decoder utilizes z − q as primary input.Depending on the presence or absence of the softmax logit ỹ through the SPADE modules, it generates either a normal-appearing reconstruction x− or entire reconstruction of the input image x+ , respectively.
When using distribution matching to regularize normal anatomy codes (see Section 3.1.4),the critic network is trained to approximate the Wasserstein distance d W of distributions between the normal anatomy codes originating from healthy images Z − h and those originating from diseased images Z − d .The detailed network architecture is described in Appendix C. The number of inner iterations in the training of the critic network m was set to 5, and the balancing term for the gradient penalty λ gp was 10.0.To balance the magnitudes of the loss functions, λ 4 and λ 5 were set to 1.0 × 10 −4 for FDN 4×4 and FDN 8×8 and 5.0 × 10 −4 for FDN 16×16 and FDN 32×32 .The larger values of λ 4 and λ 5 for each configuration tended to fail in mode collapse, where the encoder learned only a few modes of data in the sample distribution.The other hyperparameters were shared across the configurations as follows: batch size = 112, number of training epochs = 400, learning late = 1.0 × 10 −4 , weight decay = 1.0 × 10 −5 , λ 1 = 0.25, λ 2 = 5.0, λ 3 = 5.0, and γ = 2.0.

Preparation of codebooks
The CBIR framework was constructed on a per-image basis; that is, each MR volume was separated into slices along the axial axis, which was in the same manner with the training of the feature decomposing network.By straightforwardly applying the trained feature decomposing network, every single image in the test dataset was decomposed into normal anatomy code z − q and abnormal anatomy code z + q .These latent codes can be considered as a set of code vectors extracted from the continuous codebooks e ∓ , which is subjective to the calculation of Euclidean distance D E and angular distance D A .Then, according to the methods described in Section 3.2.1 and Section 3.2.2,we prepared a set of latent codes in a manner of binarized code vectors.First, code vectors with L2 norm < 1.0 × 10 −5 were rounded to a zero vector.Then, each code vector was binarized with optimized length to obtain the optimized binarized codebooks b ∓ * .Binarized latent codes were acquired by searching the optimized binarized codebooks based on the same indices of the corresponding latent codes in the continuous codebooks.The similarity between binarized code vectors was evaluated based on Hamming distance D H .

Evaluation
Since end-to-end learning of the whole modules cannot be performed, we evaluated individual component to find the optimal combination for the CBIR framework.The evaluation process comprised three stages.First, to find the optimal configuration of the feature decomposing network, we assessed the error of image reconstruction, performance of abnormality segmentation, "leakiness" of feature decomposition, and compactness of the codebooks.Second, based on the feature decomposing network with a selected configuration, we observed the extent to which the distance relationship between code vectors was changed through the binarization.Lastly, we demonstrated the effectiveness of the proposed CBIR framework based on the decomposed latent codes from qualitative and quantitative perspectives.

Image reconstruction error
The image decoder performs conditional image generation while switching the entire reconstructions x+ and normalappearing reconstructions x− (see Section 3.1.3).At the training stage, the entire reconstructions are similar to the input images.Meanwhile, normal-appearing reconstructions learn to match the region of the input images that excludes the abnormal area, which can be indicated by the mask matrix M − .Therefore, the reconstruction error of the entire reconstructions and that of normal-appearing reconstructions can be calculated as x − x+ 1 and M − x − M − x− 1 , respectively, where denotes pixel-wise summation of the residual errors.

Segmentation performance
The segmentation decoder predicts a segmentation output ŷ, which should be close to the ground-truth label y (see Section 3.1.3).The segmentation performance was evaluated based on Dice score (Dice, 1945) with respect to the abnormality labels (ET, ED, and NET).The segmentation outputs in each 2D axial slice {x 1 , x 2 , . . ., x 155 } were concatenated into a volume to evaluate the Dice score based on the volume.

Leakiness of feature decomposition
As described in Section 3.1.4,distribution matching is introduced to ensure that each decomposed latent code corresponds to targeted semantic content of the input images without being "leaky" to other features.Especially, this is important for normal anatomy codes because there is no ground-truth for normalappearing reconstructions originating from diseased images.However, observation of the latent space cannot provide information on whether the representations contained therein are decomposed in a desirable manner.Hence, we observe reconstructed images generated from these latent codes through the image decoder.In the evaluation, a batch containing only the normal-appearing reconstructions { x− } was fed into a classification network to predict whether each normal-appearing reconstruction was derived from healthy images X h or diseased images X d .The architecture of the classification network and training settings based on the training dataset are described in Appendix D. When the normal-appearing reconstructions do not contain any abnormal characteristics, the classification network should not be possible to identify their origin, even if they are derived from diseased images.Otherwise, it indicates that there are "leaked" abnormal imaging features in the normalappearing reconstructions, which can be distinguishable for the origin from the diseased images.Positive predictive value (PPV) of the classification network with respect to the origin from the diseased images was evaluated as an indicator of leakiness in the test dataset.

Compactness of the codebooks
In some configurations of the feature decomposing network, it was empirically observed that a portion of the K code vectors stored in a codebook exhibited small norms close to zero (see Appendix E).It suggested that these code vectors with small norms did not encode distinguishable features with zero vector, allowing approximation as the zero vector.Therefore, in that case, the number of code vectors that are actually significant was < K. Hence, we define the compactness of the codebooks to be the ratio of insignificant code vectors that showed small L2 norms below a threshold values of 1.0 × 10 −5 to the total number of code vectors K in a codebook.When the value of the compactness is large, it implies that the codebook can be considered as compact with less than initial K code vectors.This is preferential for the computational cost at the time of similarity search because we can approximately reduce the number of code vectors to be computed.

Validity of the binarization process
The validity of the binarization process, which is introduced in Section 3.2.1 and Section 3.2.2, was evaluated from three perspectives: concordance of the distance relationship of the code vectors before and after the binarization process, compression ratio of the code vector length, and comparison of computational time.To demonstrate concordance, we investigated whether the nearest neighbor relationship among binarized code vectors changed with respect to the continuous ones.Since the fundamental element of the latent space composed a fixed set of code vectors, the distance relationship as a whole can be regarded as almost unchanged, given that the distance relationship between each code vector is consistent.Therefore, the distance relationship of each code vector with the other remaining K − 1 code vectors was assessed and compared between the two types of codebooks.In the evaluation, for each code vector, the indices of the top-Q closest code vectors were obtained.The distance calculation was performed using Hamming distance D H for the optimized binarized codebook b * and Euclidean distance D E for the continuous codebook e.Then, the concordance was calculated with respect to the agreement by the Jaccard similarity coefficient between two sets of indices.The compression ratio of the code vectors owing to the optimization process was calculated as follows: E * E .See Appendix F for the methodological details in the comparison of computational time in the distance calculation.

CBIR performance based on the decomposed latent codes
To quantify the image retrieval performance using the decomposed latent codes (see Section 3.2.3),images containing the largest area of abnormality were selected from each MR volume in the test dataset based on ground-truth labels.We refer to these representative images as query images.For each query image, a reference dataset that comprised the rest of the images in the test dataset except for the images in the same MR volume of the query image was constructed.The images in the reference dataset are called reference images.Then, for each MR volume in the reference dataset, the image with the closest latent code to that of the query image was obtained.The obtained images from each MR volume were sorted with respect to the similarity.Lastly, top-Q closest images, each of which belonged to different MR volume, were provided as retrieved images.This MR volume-wise retrieval can be appropriate in evaluating variable retrieved images by a single query image because one MR volume usually contains several images with similar appearance in continuous axial slices.
To report retrieval performance, the mean Dice coefficient based on two types of ground-truth labels (see Section 4.1), one is for six categories of normal anatomy and the other is for three categories of abnormality, was assessed between each query image and top-Q closest images from the reference dataset.The mean Dice coefficient was calculated for all query images and then averaged.As shown in Fig. 4, when using S normal , the mean Dice coefficients based on the categories of normal anatomy should be high because the similarity is evaluated based on the normal anatomy codes that correspond to normal-appearing reconstructions.Conversely, image retrieval using S abnormal should be accompanied by high mean Dice coefficients in the categories of the abnormalities because it evaluates the similarity based on the abnormal anatomy codes relevant to tumor segmentation labels.Therefore, the mean Dice coefficient based on the ground-truth labels of the normal or abnormal anatomical categories was reported for the image retrieval using S normal or S abnormal , respectively.When it comes to the similarity retrieval using S sum , the mean Dice coefficients based on the ground-truth labels of both normal and the abnormal anatomical categories were averaged because the similarity measurement should represent the features of the whole images.
For comparison, a brutal search to directly maximize Dice overlap was performed for each query image to retrieve top-Q closest images.In comparing the retrieval performance using S normal or S abnormal , the brutal search retrieved images by maximizing Dice coefficients between the ground-truth labels of the normal or abnormal anatomical categories of a query image and those of reference images, respectively.For the similarity measurement using S sum , the brutal search maximized the simple summation of Dice coefficients of the ground-truth labels of the normal and abnormal anatomical categories.Similar image retrieval using the brutal search was conducted in the same MR volume-wise manner.While the brutal search requires a significant computational time and ground-truth labels for all query and reference images, the mean Dice coefficients obtained can be used as an oracle (technical upper bound).

Results
Here, we first present an example training process of the feature decomposing network.Then, several configurations of the feature decomposing network, particularly focusing on the size of the latent representation and use of distribution matching, are compared to select a model to be exploited for the downstream task.Subsequently, using the feature decomposing network with the selected configuration, we confirm the validity of binary hashing.Lastly, the CBIR system utilizing decomposed latent codes is qualitatively and quantitatively evaluated.

Example training results of the feature decomposing network
Among several configurations of the feature decomposing network, an example of the training results of FDN 8×8 with distribution matching at epoch 400 is shown in Fig. 5.The first, second, and third rows demonstrate the input images x, corresponding entire image reconstructions x+ , and normalappearing reconstructions x− , respectively.We can observe that there is a clear distinction between two types of reconstructions, x+ and x− , especially when evaluating abnormal regions, which existed in the input images x and entire reconstructions x+ but disappeared in the normal-appearing reconstructions x− .Note that the regions where the abnormality had existed originally were replaced by imaging characteristics of normal neuroanatomy in the normal-appearing reconstructions x− .This indicates the complementary capacity derived from normal anatomy codes.Besides, due to the shared normal anatomy codes between the two types of reconstructions, x+ and x− , the appearances outside the region of abnormality seems to be almost identical with each other.Furthermore, the fourth and fifth rows in Fig. 5 indicate ground-truth segmentation labels for the abnormality (ET, TC, and NET) and prediction for the labels through the segmentation decoder, respectively.The output of segmentation labels tended to be rounded, without precisely recovering the detailed shape of each tumor region.This can be expected because the compressed representation of the latent space, where the spatial resolution is only Entire input images were reconstructed x+ (second row) based on both normal and abnormal anatomy codes, whereas normal-appearing reconstructions x− (third row) were generated on normal anatomy code only.A clear distinction can be observed between x+ and x− at abnormal regions, which existed in both x and x+ but not in x− .The fourth and fifth rows indicate ground-truth segmentation labels y representing the tumor region and prediction for the labels ŷ, respectively.The output of segmentation labels tended to be rounded and did not recover the detailed shape of each region.We consider this as a natural consequence since the compressed representation in the latent codes, which is advantageous for the computational cost of similarity search, did not have sufficient capacity to preserve the detailed feature in the input image as a tradeoff.ET, Gd-enhancing tumor; ED, peritumoral edema; NET, necrotic and non-enhancing tumor core.
8 × 8, is advantageous for computational efficiency at the time of similarity search.For the same reason, we did not pursue the generation quality of the reconstructed images as a primary purpose.Although the detailed part of the textual appearance as realistic MR images was not perfectly reproduced, we consider that it is sufficient for recognizing anatomical location and presence of abnormality in the reconstructed images.The learning process of this example model is demonstrated in Appendix G.

Comparison between several configurations of the feature decomposing network
To evaluate the effects of the size of the latent representation and use of the distribution matching, we compared several configurations of the feature decomposing network, such as FDN 4×4 , FDN 8×8 , FDN 16×16 , and FDN 32×32 , with or without distribution matching (see Section 4.3.1).See Fig. 6 for visual results, where two types of reconstructed images, entire image reconstructions x+ and normal-appearing reconstructions x− , according to an input image are shown for each configuration.As the resolution of the latent space increased from 4 × 4 to 32 × 32, the quality of the reconstructed images was improved, showing that fine textures of the brain MR images were reproduced more realistically.As for the difference between entire reconstructions x+ and normal-appearing reconstructions x− , it is expected that there should be differences in the areas that correspond to the abnormal sites.Thus, the abnormal areas should appear only in the entire reconstructions x+ and should be diminished in the normal-appearing reconstructions x− .When the resolution of the latent representation is relatively low (i.e., FDN 4×4 and FDN 8×8 ), the difference was clear even without distribution matching.Conversely, particularly in FDN 16×16 and FDN 32×32 , the abnormal regions, which exhibited as low-intensity area in Gd-enhanced T1 sequence and high-intensity area in FLAIR sequence, were partly reproduced even in the normal-appearing reconstructions x− especially when distribution matching was not utilized (see arrows in Fig. 6).As mentioned in Section 4.4.3,we call this failure in decomposing representations as "leakiness" of abnormal features into normal anatomy codes z − q .Note that the distribution matching imposed on the normal anatomy codes mitigated this leakiness and encouraged normal-appearing reconstructions to replace the region of abnormality with normal imaging features that would have existed therein if the sample is healthy (indicated by arrowheads in Fig. 6).
Subsequently, we quantitatively compared several configurations based on the reconstruction error (see Section 4.4.1),Fig. 6: Comparison of reconstructed images according to different configurations of the feature decomposing network.As the spatial resolution of the latent representation increases from FDN 4×4 to FDN 32×32 , visual fidelity of the reconstructed images tends to improve.Note that there are partially reconstructed abnormal regions (arrows) even in the normal-appearing images x− through the models with relatively high resolutions (i.e., FDN 16×16 and FDN 32×32 ).These "leaky" appearances can be alleviated by imposing the distribution matching (arrowheads).DM, distribution matching.performance of abnormality segmentation (see Section 4.4.2),leakiness of feature decomposition (see Section 4.4.3), and compactness of the codebooks (see Section 4.4.4).As shown in Fig. 7 and Fig. 8, the performances of both image reconstruction and segmentation were improved as the resolution of the latent representation increased, which is demonstrated by the reduced reconstruction errors and increased Dice coefficients, respectively.Since the distribution matching is expected to impose some regularization effects on the codebooks, the performance of image reconstruction and segmentation could be degraded by training with distribution matching; however, the data also indicate that it did not cause any negative effect on both image reconstruction and segmentation performance.For the leakiness evaluated in Fig. 9, without the distribution matching, the tendency that the normal-appearing reconstructions unintentionally contained distinguishable features of abnormality became apparent when the latent resolution increased.For example, this can be evident at FDN 32×32 , where PPV for the origin of diseased images of the classification network trained based on normal-appearing reconstructions (see the orange bar in Fig. 9) was as high as those of the classification network trained based on entire reconstructions (see the yellow and gray bars in Fig. 9).However, the leakiness of abnormal features could be mitigated to some extent by introducing distribution matching.Note that PPVs for the origin of diseased images of the classification network trained on normal-appearing reconstructions (see the orange bars in Fig. 9) consistently decreased by imposing distribution matching (see the blue bars in Fig. 9).Regarding the compactness of the codebooks shown in Table 1, models with lower latent resolutions showed higher value.Besides, it is noteworthy that distribution matching had the effect of rendering the codebooks more compact with the higher ratio of insignificant code vectors, which can be observed except for the abnormal anatomy codes of FDN 16×16 with distribution matching.See Appendix E for the distribution of norms of code vectors in each configuration, which also supports that models with low latent space resolution had more code vectors with small norms.
In summary, there is a tradeoff between the performance of image reconstruction and segmentation and leakiness of feature decomposition and compactness of the codebooks according to the spatial resolution of latent representations.Note that, when the length and width of the latent space doubles (e.g., from 8×8 to 16 × 16), the computational complexity of the code vectors quadruples.Hereinafter, we selected FDN 8×8 with distribution matching for the following evaluation because the model exhibited intermediate characteristics, enabling good feature decomposition and simple latent representation with compact codebooks, which will be favorable at the time of similarity calculation.

Assessment of the binarization process
Using FDN 8×8 with distribution matching as the feature decomposing network, we validated the process of the binarization of code vectors (see Section 4.4.5).Based on the techniques described in Section 3.2.2, the length of the optimized binarized vector was shortened from 13, 816 to 789 for the normal anatomy codes and 292 for the abnormal anatomy codes.The compression ratios E * E for the normal and abnormal anatomy codes were 0.60% and 0.22%, respectively.Fig. 10 demonstrates an example result on the relationship between Hamming distance D H and Euclidean distance D E before (Fig. 10a) and after (Fig. 10b) optimization of the vector length.Note that, even though the optimization process considered only the nearest neighbor relationship, the global distance relationship was also maintained.The concordances of the top 1, 5, and 10 closest relationships between the Hamming distance calculation using the optimized binarized codebook b * and the Euclidean distance calculation using the continuous codebook e were 0.81, 0.89, and 0.90 for normal anatomy codes and 0.81, 0.88, and 0.90 for abnormal anatomy codes, respectively.Notably, the relationships between the binarized codebook b before the optimization and the continuous codebook e were at similar levels, demonstrating that the concordances of top 1, 5, and 10 closest relationships were 0.82, 0.86, and 0.89 for normal anatomy codes and 0.84, 0.87, and 0.89 for abnormal anatomy codes, respectively.Therefore, the optimization process of the binarized vector length did not hinder the distance relationship with respect to the original Euclidean distance.Hamming distance calculation using the optimized binarized codebook b * reduced the computational time by 48.3% and 64.5% for normal and abnormal anatomy codes, respectively, compared to the Euclidean distance calculation using the continuous codebook e (see Appendix F).

Retrieval results based the decomposed latent codes
An example of the CBIR results showing the top 5 images with the closest latent codes based on S normal , S abnormal , and S sum is presented in Fig. 11.Similarity calculation based on normal anatomy codes S normal retrieved images with similar normal anatomical labels irrespective of gross abnormalities (Fig. 11a).Similarity calculation based on abnormal anatomy codes S abnormal retrieved images with similar abnormal anatomical labels (Fig. 11b).Note that the variety of normal anatomical contexts of the retrieved images accompanied similar abnormal lesions.In the similarity retrieval using S sum , as formulated in Eq. ( 17), the similarity measurements between the normal and abnormal anatomy codes were summed.As shown in Fig. 11c, the whole imaging features of the retrieved images using sum resemble those of the query image.In this example, Euclidean distance D E was employed based on the continuous codebooks e.
According to the method described in Section 4.4.6,we performed top 10 nearest neighbor search between query images and reference dataset for quantitative evaluation.Here, the query images are the slices with the largest tumor area in each MR volume, and the reference dataset contains the rest of images in the test dataset, except for the images in the same MR volume of each query image.The similarity measurements utilized S normal , S abnormal , and S sum .Mean Dice coefficients were assessed according to the semantics exploited in the similarity measurement: Dice coefficients between ground-truth labels of the normal anatomical categories were averaged when using S normal , those between ground-truth labels of the abnormal anatomical categories were averaged when using S abnormal , and those between ground-truth labels of the normal anatomical categories and those between ground-truth labels of the abnormal anatomical categories were averaged when using S sum .The three types of distance calculations, Euclidean distance D E , angular distance D A , and Hamming distance D H , were calculated for each similarity measurement.The results are summarized in Table 2.The technical upper bound was provided by the brutal search, which retrieved images to maximize Dice coefficients directly computing the overlap between ground-truth labels.Note that the brutal search is performed based on the ground-truth labels for all query and reference images, which is usually not given in real clinical practice.

Discussion and conclusions
Comparative diagnostic reading is critical in correct diagnosis by comparing an image of the condition to be diagnosed with corresponding normal images without abnormal findings or images that contain similar abnormal findings.To the best of our knowledge, this is the first study that proposes a deeplearning-based algorithm specifically designed to support the comparative diagnostic reading.The fundamental contribution of our study is to extend the idea of disentangled representation into a CBIR application in medical imaging.By leveraging the feature decomposing network, a medical image can be decomposed into a normal and abnormal anatomy code, each of which represents the targeted semantic component in the image.Note that the codebooks located at the bottom of the network allows decomposed latent codes to be manipulable for the CBIR framework, which can switch the semantic components to be focused in the retrieval according to users' purposes.
The proposed CBIR framework demonstrated notable results in both qualitative and quantitative evaluation (see Section 5.4).However, because of the two-staged approach to establish the CBIR framework, there are possible error origins for the final image retrieval performance, such as errors in image reconstruction, segmentation, and vector quantization in the codebooks.Moreover, the leakiness of abnormal imaging features into the normal anatomical codes can hinder retrieval performance such that the retrieval based on D normal unintentionally accompanies images with significant amount of abnormalities, which can be alleviated by distribution matching.Because it is quite overloaded as an experiment, directly comparing the image retrieval performance according to the different configurations of the feature decomposing network could not be performed.Instead, we precisely evaluated these possible origins of errors (see Section 5.2) and selected FDN 8×8 trained with distribution matching for reporting the CBIR performance owing to its preferable features for the image retrieval.
Even though there is no benchmark available for the image retrieval based on the dataset, we consider that the performance difference from the brutal search (see Table 2) can be acceptable for clinical use.Under the condition that the leakiness of abnormal imaging features is controlled, it is quite natural to speculate that the performance of the image reconstruction and segmentation has a positive relationship between the image retrieval performance using the decomposed latent codes.Therefore, if higher retrieval performance is desired and computational resources are sufficient, a model with higher latent space resolution can be used.Because the brutal search is an ideal setting, where ground-truth labels are given to all query and reference images, our CBIR framework is more versatile and useful, and furthermore, different configurations can be used for different situations.
Distance calculations using Euclidean distance D E , angular distance D A , and Hamming distance D H can be implemented according to different situations, taking into account the tradeoff between accuracy and computational efficiency at the time of similarity search.Note that, with respect to S sum , angular distance calculation D A provided a higher value than others.This can be because the magnitude of the distance values is not normalized in the Euclidean distance calculation D E and Hamming distance calculation D H , making it difficult to evenly weight the two terms, S normal (q, r) and S abnormal (q, r), in Eq. (17).It is also noteworthy that, even though the proposed binarization technique is simple, Hamming distance calculation D H approximated the Euclidean distance calculation D E with notable accuracy (see Section 5.3).Since the picture archiving and communication systems in hospitals usually contain a huge amount of medical images, our proposal of the binary hashing based on the discrete latent codes can also be effective even in the context of large-scale image search.
One may argue that the distribution matching should be applied in reconstructed images instead of latent codes.It seems to be one alternative method to decompose features of medical images, but there are some concerns.One is an issue called "posterior collapse," which is caused by a powerful decoder ignoring latent codes, which can be observed in many VAE models (van den Oord et al., 2017).Because our primary purpose is to acquire good latent representation that can be faithfully representative for corresponding imaging features, we intentionally avoided imposing additional learning objective for decoders.Another concern is that the distribution matching for images with a resolution of 256 × 256 was computationally expensive and required a long time for training of feature decomposing networks.We also encountered a more unstable training process, which is one of the intrinsic problems of GANs.As shown in Fig. 6, where the constraints on the latent space are reflected in the differences in imaging features in the reconstructed images, we consider that distribution matching on the latent spaces should be more appropriate when it comes to CBIR utilizing latent codes.
We found that SPADE modules can effectively propagate the semantic layout obtained at the final layer of the segmentation decoder into the image reconstruction process of the image decoder.The success of the conditional image generation of the image decoder depending on the collateral input through the SPADE modules can also be observed in Fig. 9, where PPVs for the origin from diseased images of the classification network trained using the entire reconstructions were consistently > 0.9 (see yellow and gray bars in Fig. 9).However, future technological challenges may lie in this regard.Ideally, the la-tent codes representing normal and abnormal semantic components of medical images should be distributed in a decomposable manner in a single space where they can be linearly computed with each other.Because the current study employed an architecture that holds two separated latent spaces for decomposed latent codes, the similarity calculation according to the whole imaging feature (Eq.( 17)) might be deemed arbitrary.Even so, because simple autoencoders can be sufficient to calculate the similarity based on whole imaging features, we believe that our proposal is innovative to enable the CBIR to selectively utilize either normal or abnormal components of medical images to support comparative diagnostic reading.Note that sparsity of the code vectors was remarkable at the lower resolution of the latent space.Moreover, there is a tendency that distribution matching enhanced such sparsity in each level of the latent resolution.

Fig. 1 :
Fig. 1: Concept of the compositionality of medical images.It indicates that the entire medical image can be decomposed into normal and abnormal anatomy codes relevant to normal and abnormal semantic components in the image, respectively.

Fig. 2 :
Fig. 2: Shared architecture of the feature decomposing networks.Input image x is mapped to a pair of latent representations, z − e and z + e .Vector quantization is performed based on two codebooks, e − and e + , to produce normal anatomy code z −q and abnormal anatomy code z + q , respectively.The segmentation decoder predicts the semantic segmentation of abnormality ŷ from z + q .Depending on the collateral input of the softmax logit of abnormality ỹ, the image decoder reconstructs either the entire input image x+ or normal-appearing image x− .Several loss functions in training the network, including latent, reconstruction and segmentation losses, are shown, except for those related to distribution matching.

Fig. 3 :
Fig. 3: Overview of spatially adaptive normalization (SPADE) module.Softmax logit ỹ multiplied by the segmentation mask ŷ is further downsampled to achieve resolutions corresponding to those of each layer in the image decoder.SPADE module propagates the semantic layout of abnormalities into the image generation process.Each convolution block comprises a convolution function and bias parameter.

Fig. 4 :
Fig.4: Overview of the content-based image retrieval framework.First, a query image and reference images are decomposed into normal and abnormal anatomy codes.Then, by calculating the similarity of normal anatomy codes between the query image and reference images (S normal ), similar images when viewed as normal anatomies without any abnormality can be retrieved.In contrast, by calculating the similarity of abnormal anatomy codes, images with similar tumor regions can be retrieved (S abnormal ).Besides, the similarity retrieval based on the whole imaging features can be applied by calculating the similarity combining both normal and abnormal anatomy codes (S sum ).Note that these similarity measurements can be calculated based on different distance definitions, such as Euclidean distance, angular distance, and Hamming distance.

Fig. 5 :
Fig. 5: Example results of model training based on FDN 8×8 with distribution matching.The first row indicates the input images x.Entire input images were reconstructed x+ (second row) based on both normal and abnormal anatomy codes, whereas normal-appearing reconstructions x− (third row) were generated on normal anatomy code only.A clear distinction can be observed between x+ and x− at abnormal regions, which existed in both x and x+ but not in x− .The fourth and fifth rows indicate ground-truth segmentation labels y representing the tumor region and prediction for the labels ŷ, respectively.The output of segmentation labels tended to be rounded and did not recover the detailed shape of each region.We consider this as a natural consequence since the compressed representation in the latent codes, which is advantageous for the computational cost of similarity search, did not have sufficient capacity to preserve the detailed feature in the input image as a tradeoff.ET, Gd-enhancing tumor; ED, peritumoral edema; NET, necrotic and non-enhancing tumor core.

Fig. 7 :
Fig. 7: Image reconstruction results.Mean ± standard deviation of the image reconstruction error (see Section 4.4.1) is shown for each configuration of the feature decomposing network.As the resolution of the latent representation increased from FDN 4×4 to FDN 32×32 , the quality of image reconstruction was improved with decreased reconstruction errors.Note that the distribution matching did not cause any negative effect.FDN, feature decomposing network; DM, distribution matching.

Fig. 8 :
Fig.8: Segmentation results.Mean ± standard deviation of the averaged Dice scores for the abnormality labels (see Section 4.4.2) is shown for each configuration of the feature decomposing network.As the resolution of the latent representation increased from FDN 4×4 to FDN 32×32 , the segmentation performance was improved.Note that the distribution matching did not cause any negative effect.FDN, feature decomposing network; DM, distribution matching.

Fig. 9 :
Fig. 9: Evaluation of the leakiness in the normal-appearing reconstructions.A classification network was trained based on four different reconstructions, such as entire reconstructions, entire reconstructions with distribution matching, normal-appearing reconstructions, and normal-appearing reconstructions with distribution matching, to classify whether the original images included the abnormality or not (see Section 4.4.3).Here, leakiness of feature decomposition means the tendency that the normal-appearing images unintentionally contained distinguishable features of abnormality.Mean ± standard deviation of positive predictive values (PPVs) for the origin from diseased images measured in the test dataset are shown as an indicator of the leakiness.As the latent resolution increased, the PPVs increased (orange bars), especially without the distribution matching.This leakiness of abnormal features was alleviated by imposing the distribution matching (blue bars).FDN, feature decomposing network; DM, distribution matching.

Fig. 10 :
Fig. 10: Relationship between Hamming distance and Euclidean distance.The distances between one code vector and all other code vectors were calculated by Hamming distance (horizontal axis) and Euclidean distance (vertical axis).(a) Distance relationship between the continuous codebook and binarized codebook before optimization.(b) The same relationship after the optimization of the binarized codebook length.Note that not only the nearest relationship but also the global relationship is maintained through the optimization process.

Fig. 11 :
Fig. 11: Example results of CBIR based on the decomposed latent codes.Image retrieval was performed based on volume, comparing the closest images from each MR volume.Each similarity was calculated according to Euclidean distance D E based on the continuous codebooks e. Retrieved images are arranged from left to right, starting with the closest one to the query vector.(a) Similarity calculation based on normal anatomy codes S normal retrieved images with similar normal anatomical labels, irrespective of gross abnormalities.Query and retrieved images with ground-truth labels of the normal anatomical categories are shown.(b) Similarity calculation based on abnormal anatomy codes S abnormal retrieved images with similar abnormal anatomical labels.Query and retrieved images with ground-truth labels of the abnormal anatomical categories are shown.Note the variety of normal anatomical contexts of the retrieved images.(c) Similarity calculation using S sum retrieved images with combined features, as shown by the similar patterns of both normal and abnormal anatomical labels among the retrieved images.Query and retrieved images with ground-truth labels of the normal and abnormal anatomical categories are shown.ET, Gd-enhancing tumor; ED, peritumoral edema; NET, necrotic and non-enhancing tumor core.

Fig
Fig. E.1: Distribution of norms in the codebooks.For each configuration with or without distribution matching (DM), the magnitude of norms (vertical axis) along with a total of 512 (= K) code vectors (horizontal axis) is indicated.Note that sparsity of the code vectors was remarkable at the lower resolution of the latent space.Moreover, there is a tendency that distribution matching enhanced such sparsity in each level of the latent resolution.

Table 1 :
Compactness of the codebooks.The compactness, which is the ratio of insignificant code vectors showing small L2 norms below a threshold to the total number of the code vectors (see Section 4.4.4), is shown for each configuration of the feature decomposing network.FDN, feature decomposing network; DM, distribution matching.

Table 2 :
Quantitative evaluation of the image retrieval.Mean ± standard deviation of the averaged Dice coefficients for ground-truth labels with respect to the semantic components exploited by the similarity evaluation is shown.