Decomposing Normal and Abnormal Features of Medical Images for Content-based Image Retrieval

Medical images can be decomposed into normal and abnormal features, which is considered as the compositionality. Based on this idea, we propose an encoder-decoder network to decompose a medical image into two discrete latent codes: a normal anatomy code and an abnormal anatomy code. Using these latent codes, we demonstrate a similarity retrieval by focusing on either normal or abnormal features of medical images.


Introduction
In medical imaging, the characteristics purely derived from a disease should reflect the extent to which abnormal findings deviate from the normal features that would have existed. Indeed, physicians often need corresponding normal images without abnormal findings of interest or, conversely, images that contain similar abnormal findings regardless of normal anatomical context. This is called comparative diagnostic reading of medical images, which is essential for a cor-rect diagnosis. To support the comparative diagnostic reading, content-based image retrieval (CBIR) utilizing either normal or abnormal features will be useful.
Here, we define this two-tiered nature of normal and abnormal features as the compositionality of medical images. Subsequently, we consider a method to decompose a medical image into two low-dimensional representations, where the two latent codes representing normal and abnormal anatomies should be collective for reconstructing the original image ( Figure 1).
To our best knowledge, few studies have focused on the compositionality of medical images. Recently, image-to-image translation techniques mainly derived from Cycle- GAN (Zhu et al., 2017) were exploited to disentangle the domain-specific variations of medical images (Xia et al., 2020;Liao et al., 2020;Vorontsov et al., 2019). For example, Tian et al. successfully transformed an input image with pathology into a normalappearing image by leveraging several cycleconsistency losses (Xia et al., 2020). However, these approaches did not treat the compositionality in a separable manner, i.e., the latent spaces of these methods were not explicitly designed for down-stream tasks.
In this paper, we propose an encoderdecoder network to project a medical image into a pair of latent spaces, each of which produces a normal anatomy code and an abnormal anatomy code. Using these latent codes, our CBIR framework can retrieve images by focusing on either normal or abnormal features, providing excellent performances in qualitative evaluations.

Methodology
The proposed network consists of an encoder and two decoders, a segmentation decoder and an image decoder ( Figure 2). A pair of discrete latent spaces exists at the bottom that produces normal and abnormal anatomy codes, separately. See Appendix A for the details of the network architecture.
Feature encoding: The encoder uses a two-dimensional medical image x ∈ R C×H×W as an input and maps it into two latent representations, z − e ∈ R D×H ×W and z + e ∈ R D×H ×W , where z − e and z + e correspond to the features of normal and abnormal anatomies, respectively. We used z ∓ e to represent both features. Subsequently, vector quantization was used to discretize z ∓ e . Namely, each elemental vector z ∓ e k ∈ R D was replaced with the closest code vector in each codebook e ∓ ∈ R D×K comprising K code vectors. The codebooks were updated as VQ-VAEs (van den Oord et al., 2017;Razavi et al., 2019). We denote the quantized vector of z ∓ e as z ∓ q . Here, z − q is referred to as the normal anatomy code, and z + q as the abnormal anatomy code.
Feature decoding: The segmentation decoder uses abnormal anatomy code z + q as input and outputs segmentation label y ∈ R C ×H×W . Meanwhile, the image decoder f performs conditional image generation using the spatially-adaptive normalization (SPADE) (Park et al., 2019). SPADE is designed to propagate semantic layouts to the process of synthesizing images (Appendix B). The image decoder uses the normal anatomy code z − q as its primary input. When the image decoder is encouraged to reconstruct the entire input imagex + , the logit of the segmentation decoderỹ is transmitted to each layer of the image decoder via the SPADE modules (f (z − q ,ỹ) =x + ). When null information, whereỹ is filled with 0s, is propagated to the SPADE modules, normalappearing imagex − is generated by the image decoder (f (z − q , 0) =x − ). Learning objectives: We defined several loss functions (see Appendix C for the details): latent loss L lat for optimizing the encoder and the codebooks, discrimination loss L dis for the encoder to identify the presence of abnormality, segmentation loss L seg for the segmentation decoder, reconstruction loss L rec , and residual loss L res for the conditioned image generation performed in coordination between the two decoders. The overall objective can be summarized as follows: L total = L lat +λ 1 L dis +λ 2 L seg +λ 3 L rec + λ 4 L res , where λs are used for balancing the terms. The details of learning configuration and an example of the model training result are presented in Appendix D and Appendix E, respectively. Content-based image retrieval: After the training, the encoder learned to decompose medical images into normal and abnormal anatomy codes. We defined three measurements according to the types of latent codes as follows: D − normal for normal anatomy codes, D + abnormal for abnormal anatomy codes, and D concat for concatenated codes of the two codes. The L2 distance was calculated between the query and reference latent codes (see Appendix F for the overview of the proposed CBIR method).

Dataset
We used brain magnetic resonance (MR) images with gliomas from the 2019 BraTS Challenge (Menze et al., 2015;Bakas et al., 2017;Bakas S et al., 2017a,b), containing a training dataset with 355 patients, a validation dataset with 125 patients, and a test dataset with 167 patients. Among T1, gadolinium (Gd)-enhancing T1, T2, and FLAIR sequences, only the Gd-enhanced T1-weighted sequence was used. The training dataset contained three segmentation labels of abnormality: Gd-enhancing tumor (ET), peritumoral edema (ED), and necrotic and non-enhancing tumor core (TC). We used the training dataset to train the networks. Further, each image in the validation and test dataset was segmented into six normal anatomical labels (left and right cerebrum, cerebellum, and ventricles) and three abnormal labels (ET, ED, and TC). The validation and test datasets were used as query and reference datasets, respectively, for the performance evaluation of CBIR.

Results
Example CBIR results showing 5 images with the closest latent codes based on D − normal , D + abnormal , and D concat are presented in Figure 3. Distance calculation based on normal anatomy codes D − normal retrieved images with similar normal anatomical labels irrespective of gross abnormalities ( Figure  Figure 3: Example results of content-based image retrieval based on similarities of decomposed latent codes. Images are shown with normal or abnormal anatomical labels that are corresponded to the semantics utilized in the retrieval. Although the image retrieval employed latent codes instead of label information, the retrieved images accompanied similar labels to those of the query images. Patient identifiers with slice numbers are noted. 3a). Distance calculation based on abnormal anatomy codes D + abnormal retrieved images with similar abnormal anatomical labels ( Figure 3b). Note that the variety of normal anatomical contexts of the retrieved images. In the calculation using D concat , the query latent code was made from a combination of the normal anatomy code of the left image (CBICA ANK 1) and the abnormal anatomy code of the right image (WashU W047 1) (Figure 3c). Note that normal anatomies and abnormal anatomies of the retrieved images resemble those of the left query image and right query image, respectively.

Conclusion
We demonstrated the CBIR algorithm focusing on the semantic composites of medical imaging. This application can be useful to support comparative diagnostic reading, which is essential for a correct diagnosis. We will further evaluate the quantitative performance of the proposed method.

Acknowledgments
We are grateful to Dr.   The detailed architecture of the SPADE module is shown in Figure B.1.

Appendix C. Details of Learning Objectives
Several loss functions were designed for the training. Hereinafter, we denote by ∓ to indicate that a particular term is either for the path based on the normal anatomy code (−) or abnormal anatomy code (+). Latent loss: In the learning framework of the VQ-VAE (van den Oord et al., 2017;Razavi et al., 2019), the latent loss L lat is optimized for acquiring latent embeddings for data samples. We define L lat as a sum of L − lat and L + lat for the normal and abnormal anatomy codes, respectively, as follows: where sg represents the stop-gradient operator that serves as an identity function at the forward computation time and has zero partial derivatives. During the training, the codebook loss, which is the first term in the equation above, updates the codebook variables by transferring the selected latent codes to the output of the encoder. Additionally, the commitment loss, which is the second term, encourages the output of the encoder to move closer to specific latent codes. Discrimination loss: Because the input images do not always convey abnormal findings, the encoder must be able to distinguish the abnormalities. To implement a discriminative function in the encoder, we extend the commitment loss particularly for the abnormal anatomy code. The encoder is trained to Figure B.1: Logitỹ applied by the segmentation maskŷ is further downsampled to achieve resolutions corresponding to those of each layer in the image decoder. SPADE module propagates the semantic layout of abnormalities into the image generation process.
minimize the commitment loss when abnormalities exist. Meanwhile, for normal input images, the encoder is encouraged to increase the term up to a threshold value of π. We define this loss function as the discrimination loss for the path to the abnormal anatomy code as follows: where π is a positive scalar of the threshold. Segmentation loss: The segmentation decoder infers the segmentation labels, which are classified as K(= 3) abnormal segmentation categories in the training dataset. The loss function for the output of the segmentation decoder is a composite of the generalized Dice (Sudre et al., 2017) and cross-entropy losses as follows: whereỹ indicates the logit output of the segmentation decoder, N is the number of pixels, and w k is determined as w k = 1 ( N y k ) 2 to mitigate the class imbalance problem.
Reconstruction loss: To guarantee a difference between two types of generated images,x − andx + , we applied a pixel-wise reconstruction loss based on the region of abnormality. Suppose M ∈ {0, 1} C×H×W defines the mask, indicating that pixels with any abnormality labels are set to 1 and 0 otherwise, and M is the complementary set of M . Briefly, M presents the region of abnormality, and M indicates the region of normal anatomy. Using these masks, the reconstruc-tion loss L rec is defined as follows: where SSIM indicates the structural similarity (Zhou Wang et al., 2004), which is added to the L2 loss as a constraint.
Residual loss: The image outside the region of abnormality must be the same between the two types of images,x − andx + , generated by the image decoder to preserve the identity between corresponding regions. Therefore, we added a loss function to guarantee the similarity betweenx − andx + based on the normal regions, indicated by M as follows: Figure E.1: Results of model training. Entire input imagesx + (second row) were reconstructed based on both normal and abnormal anatomy codes, whereas reconstruction as pseudo-normal imagesx − (third row) were only on normal anatomy code. A clear distinction can be observed betweenx + andx − at abnormal regions, which existed in both x andx + but not inx − . The fourth and fifth rows indicate ground-truth segmentation label y for abnormality (ET, TC, and ED) and prediction for labelsŷ, respectively. The output of segmentation labels tended to be spherical and did not recover the detailed shape of each region. We assume this as a natural consequence since the compressed representation in the latent codes, which is advantageous for the computational cost of similarity search, did not have sufficient capacity to preserve the detailed feature in the input image as a trade-off. Figure F.1: Content-based image retrieval was constructed on a per-image basis, i.e., each magnetic resonance volume was separated into slices along the axial axis. From the 2019 BraTS Challenge, the validation and test datasets were used as query and reference datasets, respectively. Every image in the reference dataset was decomposed into normal and abnormal anatomy codes in advance. Furthermore, a query image was decomposed into two latent codes. Subsequently, several reference images with the most similar latent codes were extracted from the reference database.