MSCDA: Multi-level Semantic-guided Contrast Improves Unsupervised Domain Adaptation for Breast MRI Segmentation in Small Datasets

Deep learning (DL) applied to breast tissue segmentation in magnetic resonance imaging (MRI) has received increased attention in the last decade, however, the domain shift which arises from different vendors, acquisition protocols, and biological heterogeneity, remains an important but challenging obstacle on the path towards clinical implementation. In this paper, we propose a novel Multi-level Semantic-guided Contrastive Domain Adaptation (MSCDA) framework to address this issue in an unsupervised manner. Our approach incorporates self-training with contrastive learning to align feature representations between domains. In particular, we extend the contrastive loss by incorporating pixel-to-pixel, pixel-to-centroid, and centroid-to-centroid contrasts to better exploit the underlying semantic information of the image at different levels. To resolve the data imbalance problem, we utilize a category-wise cross-domain sampling strategy to sample anchors from target images and build a hybrid memory bank to store samples from source images. We have validated MSCDA with a challenging task of cross-domain breast MRI segmentation between datasets of healthy volunteers and invasive breast cancer patients. Extensive experiments show that MSCDA effectively improves the model's feature alignment capabilities between domains, outperforming state-of-the-art methods. Furthermore, the framework is shown to be label-efficient, achieving good performance with a smaller source dataset. The code is publicly available at \url{https://github.com/ShengKuangCN/MSCDA}.


Introduction
Breast cancer is the most commonly diagnosed cancer in women and contributes to 15% of mortality worldwide, ranking as a leading cause of death in many countries (Francies et al. (2020); Sung et al. (2021)). The significantly increasing mortality rates of breast cancer, especially in developing countries and low-income regions, lead to increased burdens for patients, their families, and society, highlighting the need for early detection and intervention (Azamjah et al. (2019)). In the past decades, breast magnetic resonance imaging (MRI) has been recommended to supplement conventional mammography and ultrasound techniques to screen women at a high risk of breast cancer and determine the extent of breast cancer after diagnosis (Saslow et al. (2007); Lowry et al. (2022); Sardanelli et al. (2017)). cern is performance degradation due to large inhomogeneities present in MRI datasets, leading to differing imaging feature distributions between training (source domain) and testing (target domain) datasets, also known as the domain shift problem. Inhomogeneities in MRI datasets primarily stem from two factors: acquisition heterogeneity and biological heterogeneity. Acquisition heterogeneity refers to variations in acquisition protocols, machine vendors, contrast-agent enhancement, and reconstruction algorithms, while biological heterogeneity encompasses differences in breast sizes and densities, menstrual cycle effects, and stage of disease progression. Additionally, factors such as patient positioning, motion artifacts, and imaging artifacts may contribute to dataset inhomogeneities. These inconsistencies within the images may lead to unstable performance of DL models, as highlighted in recent studies (Granzier et al. (2022)). Although this problem could be addressed by acquiring large and varied datasets of accurately annotated target images for training, this exercise would be labor-consuming and expensive, and is further hindered by legal and ethical considerations regarding the sharing of patient data. Thus, recent published studies (Hoffman et al. (2018); Hoyer et al. (2022); Hoffman et al. (2016)) focus on developing unsupervised domain adaptation (UDA) methods to mitigate the distribution discrepancy without target labels.
Contrastive learning in this case explicitly computes the inter-category similarity between pixel representation pairs (Zhong et al. (2021); Zhao et al. (2021a)) (refers to pixel-topixel (P2P) contrast) aiming to learn an invariant representation in feature space. However, it still suffers from the following two major concerns that are not taken into account: (i) P2P contrast skips the structure context of adjacent pixels, so it does not extensively exploit the semantic information present in the MRI scan. To alleviate this problem, we propose a method to integrate different levels of semantic information into a contrastive loss function. More specifically, the mean value of the pixel representations of a specific category, i.e., the centroid, should be similar to the pixels contained in the region. Likewise, centroids, regardless of whether they are from the same domain, should also be close to centroids of the same category and far away from centroids of other categories. We denote these two relations as pixel-to-centroid (P2C) and centroid-to-centroid (C2C) respectively. (ii) A common practice to perform inter-category contrast is to generate positive and negative pairs by sampling partial pixel representations in a mini-batch (Chaitanya et al. (2020)). However, the imbalanced proportion between background and regions of interest (ROIs) in the breast MRIs poses a challenge to obtain adequate pairs during training. To address this problem, we build a hybrid memory bank and optimize the sampling strategy to ensure enough crossdomain positive and negative pairs especially for the highly imbalanced mini-batches. Additionally, we also explore the impact of anchors and samples from different domains on model performance.
In summary, we extend the contrastive UDA framework for breast segmentation to further mitigate the domain shift problem. To the best of our knowledge, this is the first attempt to apply contrastive UDA in breast MRI. We briefly provide the novel contributions of our work as follows: 1. To solve the domain shift problem in breast MRI, we develop a novel Multi-level Semantic-guided Contrastive Domain Adaptation (MSCDA) framework for crossdomain breast tissue segmentation. 2. To exploit the semantic information present in source labels, we propose a method that combines pixel-to-pixel, pixel-to-centroid and centroid-to-centroid contrasts into the loss function. 3. To resolve the data imbalance problem, we develop a hybrid memory bank that saves both category-wise pixel and centroid samples. We further investigate a category-wise cross-domain sampling strategy to form adequate contrastive pairs. 4. To validate the performance of the UDA framework, we replicate our experiment under multiple source datasets of different sizes. The results show robust performance and label-efficient learning ability. We further show that our framework achieves comparable performance to supervised learning.

Semantic Segmentation
Semantic segmentation is an essential and hot topic in computer vision, achieving automatic categorization of each pixel (or voxel) into one or more categories. In recent years, convolutional neural networks (CNNs) have shown significant results in multiple fields. Fully convolutional network (FCN) (Long et al. (2015)), as one of the most remarkable early-stage segmentation architectures, demonstrated the pixel-level representation learning ability of CNNs. However, CNNs are still far from maturity in terms of accuracy and efficiency. Therefore, many mechanisms have been proposed to improve segmentation performance. For instance, U-Net Ronneberger et al. (2015) introduced skip connections in an encoder-decoder design to solve the vanishing gradient problem; DeepLab v3+ ) proposed Atrous Spatial Pyramid Pooling (ASPP) to capture more context information in multi-scale receptive fields. Meanwhile, inspired by the effectiveness of residual blocks, ResNet (He et al. (2016)) was adopted as the backbone in many encoder-decoder segmentation frameworks ; Wu et al. (2019); Zhang et al. (2018a); He et al. (2019)) to provide deep feature representations.

Contrastive Learning
Contrastive learning (CL) was introduced as a selfsupervised learning framework, allowing the model to learn representations without labels (Oord et al. (2018); He et al. (2020); Chen et al. (2020b,a); Grill et al. (2020)). An essential step of early CL methods is to build a pretext task, such as instance discrimination (Wu et al. (2018); He et al. (2020); Chen et al. (2020a)), to discriminate a positive pair (two augmented views of an identical image) from negative pairs (augmented view of other images). Based on this pioneering approach, many subsequent advanced mechanisms have been proposed to improve the representation learning ability. For example, Moco v1 (He et al. (2020)) and v2 (Chen et al. (2020b)) combined a momentum encoder with a first-in-first-out queue as a memory bank to maintain more negative samples. This results in an improved classification performance e.g., ImageNet (Deng et al. (2009)) and enables training the network on normal graphics processing units (GPUs). Afterwards, the projection head (Chen et al. (2020a)) and the prediction head (Grill et al. (2020)) were introduced respectively to improve the classification accuracy on downstream tasks.
For semantic segmentation tasks, recent CL works leverage the pixel-level labels as supervised signals (Zhao et al. (2021a); Zhong et al. (2021); Hu et al. (2021); Wang et al. (2021); Chaitanya et al. (2020)). The underlying idea is to group the pixel representations from the same category and to separate pixel representations from different categories. Zhao et al. (2021a) introduced a label-efficient two-stage method that pre-trained the network by using P2P contrastive loss and then fine-tuned the network using cross-entropy (CE) loss (Bishop and Nasrabadi (2006)). PC 2 Seg (Zhong et al. (2021)) improved this method in a one-stage semi-supervised learning (SSL) approach by jointly updating the network weights with pixel contrastive loss and consistency loss. ContrastiveSeg ) combined pixel-to-region contrastive loss to explicitly leverage the context relation across images. It also proves that storing the samples from recent batches can boost segmentation tasks, especially when the training batch size is limited by the memory of device. Similar to Zhong et al. (2021); Wang et al. (2021), the authors in Chaitanya et al. (2020) validated the effectiveness of sampling strategies on contrastive learning for multiple medical MRI segmentation tasks. Furthermore, they suggest that using a sampling strategy that involves cross-image negative sampling can lead to additional performance improvements. Although CL has shown great potential in segmentation tasks, it is important to note that its performance still remains unknown in domain adaptation problems.

Unsupervised Domain Adaptation
Unsupervised Domain Adaptation (UDA) is used to generalize learned knowledge from a labeled source domain to an unlabeled target domain. The key challenge of UDA is domain shift, i.e., the inconsistent data distribution across domains, which usually causes performance degradation of models. Early machine learning methods utilized different feature transformations or regularizations to overcome this problem (Kouw and Loog (2018); Mehrkanoon and Suykens (2017); Mehrkanoon (2019)).
A number of existing DL methods solve the domain shift problem using adversarial learning or self-training-based approaches. Adversarial learning utilizes generative adversarial networks (GANs) (Goodfellow et al. (2014)) to align the distribution of the feature space (Tzeng et al. (2017) Guan and Liu (2021)) because of its ability to translate the 'style' of the source domain to the target domain in an unpaired way. While CycleGAN-based unsupervised domain adaptation (UDA) methods have shown promising results, they are known to require a large amount of data to learn effective mappings between domains, and can be prone to mode collapse, leading to limited output variations.
Self-training, frequently used in SSL, uses the predictions of the target domain as pseudo-labels and retrains the model iteratively. A typical self-training network (Tarvainen and Valpola (2017)) generates pseudo-labels from a momentum teacher network and distills knowledge to the student network by using consistency loss. The authors in Perone et al. (2019); Perone and Cohen-Adad (2018) improved the self-training method by aligning the geometrical transformation between the student and teacher networks. DART (Shanis et al. (2019)) and MT-UDA (Zhao et al. (2021b)) combined self-training with adversarial learning in different ways, both receiving promising re-

Notations Description
x s , x t , y s ,ŷ s Source image, target image, source image ground truth and corresponding one-hot representation respectively; p s , p t Student network probability map of the source and target images respectively; Teacher network probability map of the source and target images respectively; z t Student network feature embedding of the target image; z ′ s Teacher network feature embedding of the source image; y t ,ŷ ′ t One-hot pseudo-label of p t and p ′ t respectively (y=argmax(p)); v k s , v k t Pixel feature embedding of category k of the source and target images respectively; c k s , c k t Centroid feature embedding of category k of the source and target images respectively; Q pixel , Q centroid Pixel queue and centroid queue in the memory bank.
sults. For imbalanced datasets, different denoising methods and sampling strategies have been proposed to improve the quality of pseudo-labels (Zhang et al. (2021); Hoyer et al. (2022); Xie et al. (2022)). Recent self-training approaches, such as those described in Xie et al. (2022); Zhang et al. (2022), have followed the paradigm of Chaitanya et al. (2020) to align the features, achieved by sampling or merging contrastive features across categories. This demonstrates that the integration of CL can improve the alignment of features at the pixel level. Additionally, the use of a memory bank to expand negative samples has shown to enhance the performance in unsupervised domain adaptation tasks, while enabling training on a normal device. Inspired by the above-mentioned studies, we integrate three kinds of contrastive losses and a category-wise cross-domain sampling strategy to accomplish the UDA segmentation task for breast MRI.

Problem Definition
Source domain data and target domain data are two sets of data used in the domain adaptation problem. The source domain data X s = {x s } M i=1 have pixel-level labels whereas the target domain image data X t = {x t } N i=1 are unlabeled. We aim at developing a method that can learn from the labeled source domain and be applied to the target domain. In particular, the learned network is used to classify each pixel of the target domain image into K categories. A direct approach is to train the network in a supervised manner on the source domain and apply it directly to the target domain. However, the performance of the network often drops because of the aforementioned domain gap between source and target domains. To address this concern, we propose a new domain adaptation approach, named MSCDA, based on the combination of self-training and contrastive learning.

Overall Framework
The proposed domain adaptation framework is depicted in Fig. 1. It consists of a student network and a momentum teacher network. The student network consists of four main components, a feature encoder f e , a feature decoder f d , a projection head f pro j , and an additional prediction head f pred . These components are correspondingly mapped in the teacher network with the only exception of the last component (i.e., the prediction head). The three components in the teacher network are called f ′ e , f ′ d and f ′ pred . The important notations are listed in Table 1.
In the student network, the feature encoder f e maps the input MRI image x ∈ R H×W×1 into a high dimension feature map h ∈ R H ′ ×W ′ ×C . Next, h is transferred into a segmentation probability map p ∈ R H×W×K and a low dimension feature embedding z ∈ R H ′ ×W ′ ×D through two forward passes, hereafter referred to as segmentation and contrast paths, respectively. In the first forward pass (segmentation path), the decoder f d generates the segmentation probability map p of the input h. In the second forward pass (contrast path), the projection head f pro j and prediction head f pred jointly reduce the feature map into a low-dimension projected feature embedding z = f pred ( f pro j (h)). Similar steps are conducted in the teacher network, yielding the momentum probability map p ′ and feature embedding z ′ . Finally, the probability map p and p ′ are used for self-training while the projected feature embeddings z and z ′ are used for semantic-guided contrastive learning to diminish the discrepancy between the two domains. The overall loss function is given by: where L seg is the supervised segmentation loss, L con is the consistency loss, L ctr is the contrastive loss, and λ 1 and λ 2 are the regularization coefficients of the corresponding losses. The summation of segmentation and consistency loss is henceforth referred to as the self-training loss. We elaborate the selftraining loss in Section 3.3 and our proposed contrastive loss in Section 3.4.

Self-training
Following the self-training paradigm (Perone et al. (2019)), two optimization goals were established. The first goal is to perform supervised learning on the student network from source image labels. The second goal is that the student network learns the pseudo labels generated by the teacher network to distill knowledge from target images. Only the weights in the segmentation path of both networks are updated in this phase.

Supervised Learning
In supervised learning, we employ a hybrid segmentation loss (Isensee et al. (2018)) that combines Dice loss (Sudre et al. (2017)) and CE loss, and is formulated as: The student network is trained using a supervised segmentation loss, an inter-network consistency loss, and a multi-level contrastive loss, while the teacher network updates the weights using exponential moving average (EMA). The training procedure is detailed in Sections 3.3 and 3.4.
whereŷ s is the one-hot ground truth and p s is the probability map of the source domain image in the student network.

Distilling Knowledge from Pseudo Labels
The pseudo label of the target image is generated by the segmentation path in the momentum teacher network iteratively: where p ′ t is the probability map of the target domain image in the teacher network. In order to distill knowledge from the pseudo label, an extra consistency loss is added between the two networks. In other words, the target image segmentation p t generated by the student network is guided by the pseudo label y ′ t . The consistency loss is formulated as: where i is the pixel index of the image and k is the category.
Here, we update the weights of the student network by means of back propagation. However, in the teacher network, a stopgradient operation is applied, and the network weights are updated by exponential moving average (EMA): where Θ and Θ ′ are the weights of the student network and teacher network respectively, and α ∈ (0, 1) is the momentum coefficient.
Combining data augmentation with self-training has been shown to improve the domain adaptation performance (Tarvainen and Valpola (2017); Chen et al. (2020a)). The student network receives strongly-augmented images, and the teacher network receives weekly-augmented images during the training process. Random resized cropping is used as the weak augmentation method, and random brightness, contrast and Gaussian blur are used as strong augmentation methods. The stronglyaugmented path learns a robust feature representation from the weakly-augmented path that has less disruption.

Semantic-guided Contrastive Loss
In order to improve the performance of our UDA framework even further, we incorporate a multi-level semantic-guided contrast to the self-training framework. The idea is to leverage the ground truth of the source domain as supervised signals to enforce the encoder to learn a well-aligned feature representation that mitigates the domain discrepancy. A common way is to categorize the feature embedding and conduct contrastive learning using the pixels or centroids between domains. In our approach, we develop the contrastive loss at P2P, P2C and C2C levels to directly utilize multi-level semantic information to guide the feature alignment. The data flow of our proposed contrastive loss is depicted in Fig. 2.

Preliminaries
In unsupervised contrastive segmentation approaches, the contrast is performed using a randomly selected sample (called the anchor) v, a positive sample v + and n negative samples The aim is to learn a feature representation that yields high similarity in positive pairs (v, v + ) and low similarity in negative pairs (v, v − ). Following He et al. (2020); Chen et al. (2020b); Zhong et al. (2021), we utilize the InfoNCE as our loss function, which is given as follows: where n is the number of negative samples per anchor, '·' is the dot product between two samples, and τ is a temperature hyperparameter that controls the gradient penalty of hard negative samples, which is empirically set to 0.07 (He et al. (2020)).
Here, samples are selected from D-dimensional feature embedding followed by l 2 -normalization.

Feature Categorization
Feature categorization is a necessary step required for supervised contrastive learning in the feature space. To utilize the semantic information effectively, we categorize the feature embedding from both domains. For the source image, the feature embedding in the teacher network and its ground truth are required. Given the l 2 -normalized target network feature embedding of a source image z ′ s ∈ R H ′ ×W ′ ×D and the one-hot ground truthŷ s ∈ R H×W×K , we first down-sample the one-hot ground truth intoȳ s ∈ R H ′ ×W ′ ×K to fit the embedding size, then assign the category label index k ∈ {0, K − 1} ofȳ s to each pixel of z ′ s ( Fig. 2(a)). Similarly, the target image embedding z t can also be categorized using the pseudo labelŷ t . Based on the categorized feature embedding, we further compute the category-wise mean value of pixels of the feature embedding as the centroid C={c k } K−1 k=0 , which is given as follows: where 1 [·] is an indicator function that returns 1 when the condition holds and 0 otherwise, z i is the i th pixel of the feature embedding andȳ (i,k) is the down-sampled label which belongs to the i th pixel and category k, Y k is the set of labels of category k.

Memory Bank & Sampling Strategy
The adequacy of negative samples plays a critical role in learning feature representations (He et al. (2020)). However, the imbalanced ratio between foreground and background pixels in breast MRI segmentation tasks may result in an insufficient number of negative pairs in each batch. To tackle this issue, increasing the batch size or employing a memory bank to save samples from recent batches are ideal solutions. Nevertheless, GPU memory limitations make using a large batch size, such as 1024, impractical for typical devices. Therefore, we adopted the design presented in Wang et al. (2021); Xie et al. (2022). Specifically, we utilized two category-wise firstin-first-out queues as a memory bank in the teacher network to preserve the pixel and centroid samples extracted from the source images. By using category-wise queues, one for foreground samples and another for background samples, we can save enough negative samples for the contrastive loss, while also ensuring a balanced distribution of samples in each queue. Therefore, we employ a strategy of uniform sampling of a fixed number of pixels from each category in the feature embedding to the pixel queue ( Fig. 2(b,c)). This under-sampling approach enables the queue to maintain a sufficient number of balanced pixel samples, while avoiding redundancy. The pixel queue Q pixel and the centroid queue Q centroid can be represented as: where Q k pixel is the pixel queue of category k, v k (s,i) is the i th source pixel sample of category k, Q k centroid is the centroid queue of category k, c k (s,i) is the i th source centroid sample of category k, and B p and B c are the size of the queue respectively.

Pixel-to-pixel Contrast
We perform the pixel-to-pixel (P2P) contrastive loss to align the cross-domain feature representation of the same category. To resolve this problem, we first sample m anchors from each category of the target feature embedding z t in the student network, denoted as set V k t . Then, for each anchor v k t ∈ V k t with category label k, we sample a source pixel of the same category from the pixel queue Q pixel to form a positive pair (v k t , v k+ s ), and sample n source pixels of category q ∈ K\{k} to form n negative pairs (v k t , v q− s ). Based on these positive and negative pairs, the InfoNCE loss of a single target anchor is computed by using Eq.(6). Overall, the P2P loss is defined as: where |·| is the number of elements in a set, and V q− s is the set of negative source pixels. Note that the number of pixels labeled as foreground categories might be less than m (or even 0) if the model predicts a few (or no) breast tissue labels in a mini-batch. Nevertheless, benefiting from the category-wise memory bank, the contrast loss can still be computed even if all pixels in a mini-batch belong to the same category.

Pixel-to-centroid Contrast
Due to the under-sampling strategy in selecting anchors and updating the memory bank, the network may suffer from inadequate semantic knowledge and thereby be difficult to converge. This issue is further addressed by incorporating P2C and C2C contrasts to P2P contrast.
For P2C contrast, we force the pixel representation to learn a more general representation with the guidance of the centroid ; Xie et al. (2022)). Specifically, a pixel and a centroid from the same category are considered as a positive pair (v k , c k+ ), while a pixel and a centroid from different categories are considered as a negative pair (v k , c q− ). We reuse the anchors in Section 3.4.4 and sample all positive and negative centroids from the centroid queue Q centroid . Similar to P2P loss, the P2C loss is defined as: where C q− s is the set of negative source centroids.
3.4.6. Centroid-to-centroid Contrast For C2C contrast, the ideal situation is that the centroids from the same category are located near to one another, whereas centroids from other categories are located far apart. Unlike P2C contrast, the total number of centroids p (BK ≤ p ≤ 2BK) is much smaller than the pixel number in a mini-batch. Besides, calculating centroids is computationally efficient. Therefore, the centroids of the whole mini-batch can be fully involved  as anchors in C2C contrast. Similar to P2P and P2C contrast, the positive pairs (c k , c k+ ) and negative pairs (c k , c q− ) are defined according to whether centroids are from the same category. Thus, the C2C loss is defined as: where C t is the set of target centroid anchors. Finally, we take the weighted sum of the three abovementioned contrasts (Fig. 2(d)) as our proposed multi-level semantic-guided contrastive loss: where λ P2P , λ P2C and λ C2C are the regularization coefficients of the corresponding contrasts. The overall training process of our proposed MSCDA is presented in Algorithm 1.

Datasets
Dataset 1. Dataset 1 consists of test-retest breast T1-weighted (T1W) and T2-weighted (T2W) MRI images and corresponding right-breast masks of eleven healthy female volunteers, which is described in Granzier et al. (2022). The images of each subject were collected in two separate sessions (interval<7 days), during which three 3D scans were collected. Subjects were asked to lay in the prone position and remain still in the MRI scanner while both modalities are sequentially acquired. All images were acquired with an identical 1.5T MRI scanner (Philips Ingenia, Philips Healthcare, Best, the Netherlands) using a fixed clinical breast protocol without contrast. The Algorithm 1: MSCDA for Breast MRI Input: Source domain image x s and label y s ; Target domain image x t ; 1 Initialize the weights of the student network Θ e , Θ d with pre-trained weights, Θ pro j and Θ pred via He et al. (2015). Initialize the teacher network by copying weights from the student network and applying stop-gradient; Initialize the memory bank Q pixel and Q centroid ; 2 for epoch = 1, E max do 3 foreach mini-batch do 4 Apply weak and strong data augmentation; 5 Forward propagate weak-augmented batch in the student network to get p s , p t and z t ;

6
Forward propagate strong-augmented batch in the teacher network to get p ′ t and z ′ s ;

7
Compute loss L seg using p s and y s via Eq.

9
Categorize the feature embedding z ′ s and z t ; Output: Weights of the student network Θ e and Θ d .
Dataset 2. Dataset 2 consists of the images from 134 subjects with histologically confirmed invasive breast cancer imaged between 2011 and 2017 in Maastricht University Medical Cen-ter+ and collected retrospectively (Granzier et al. (2020(Granzier et al. ( , 2021). The images contain breast dynamic contrast-enhanced T1W (DCE-T1W) and T2W MRIs and corresponding right-breast masks. Similar to Dataset 1, each subject underwent the examinations with 1.5T MRI scanners (Philips Intera and Philips Ingenia (idem)) in a prone position. In particular, DCE-T1W images were acquired before and after the intravenous injection of gadolinium-based contrast Gadobutrol (Gadovist, Bayer Healthcare, Berlin, Germany (EU)) with a volume of 15 cc and a flow rate of 2 ml/s. The acquisition parameters are also listed in Table 2. We conduct the same image pre-processing as in Dataset 1. In total, Dataset 2 contains 21793 T2W and 28540 T1W slices and they are split into three folds with 45, 45 and 44 subjects for the cross-validation depicted in Section 4.2.

Experiment Setup
As shown in Table 2, the subject population, machine vendor and acquisition parameters between the two datasets are heterogeneous, indicating the common domain shift problem in clinical practice. In particular, T1W and T2W are two different types of MRI sequences, with T1W images typically used for observing anatomical structures, while T2W images provide information on tissue composition. In breast MRI, T1W images help identify the location and size of lesions, while T2W images can detect edema or inflammation (Mann et al. (2019)).
We set up the experiment on both Dataset 1 and 2 to transfer the knowledge of breast segmentation from healthy women to patients. Specifically, the experiment consists of two scenarios: (1) T2W-to-T1W: utilizing the T2W images of Dataset 1 as the source domain and the T1W images of Dataset 2 as the target domain; (2) T1W-to-T2W: utilizing the DCE-T1W images of Dataset 1 as the source domain and the T2W images of Dataset 2 as the target domain. In each scenario, we establish three tasks with a different number of subjects in the source domain to validate the label-efficient learning ability of our framework. The three tasks contain four, eight and eleven (i.e., the whole dataset) randomly selected subjects respectively, and are denoted as S4, S8 and S11. To further verify the robustness of UDA performance, we split the target domain into three folds to perform a three-fold cross-validation. In each run of the crossvalidation, two folds are used as the target domain for training and the remaining fold for testing.

Model Evaluation
The DSC is used as the main evaluation metric. Additionally, we use the Jaccard Similarity Coefficient (JSC) as well as precision (PRC) and sensitivity (SEN) as auxiliary evaluation metrics. These metrics are formulated as follows: where TP, FP and FN are the number of true positive, false positive and false negative pixels of the prediction respectively. Note that we show the mean value of each metric of the threefold cross-validation.  ) with ResNet-50 (He et al. (2016)) as backbone. Benefiting from the encoder-decoder architecture, the encoder and decoder of DeepLab-v3+ are adopted in our framework. Specifically, the hidden dimension of ResNet-50 is set to (16,32,64,128), yielding a 512dimension feature map.
Projection/Prediction Head. The projection head f pro j is a shallow network that contains two 1 × 1 convolutional layers with BatchNorm and ReLU. It projects the 512-d feature map into a 128-dimension l 2 -normalized feature embedding. The prediction head f pred shares the same architecture setting with f pro j with the exception that the f pred does not change the dimension of the features.
Memory bank. The size of the pixel queue and the centroid queue of each category are set to 4096 and 1024, respectively. In each mini-batch, we randomly sample eight pixels per category of each feature embedding to the queue and discard the oldest samples. The number of pixel anchors for P2P loss is set to 32, the number of negative pairs of P2P contrast is set to 4096, which is equivalent to the size of the pixel queue, the number of negative pairs of P2C and C2C contrasts is set to 1024, which is equivalent to the size of the centroid queue. The regularization coefficients in Eq. (1) and Eq. (13) are all set to 1 by default.

Training Settings
To accelerate the training procedure, we pre-train the DeepLab-v3+ on the source domain and then use the weights to initialize the encoder f e and decoder f d of our UDA framework. Additionally, the projection and prediction heads are initialized by He et al. (2015). The Adam (Kingma and Ba (2014)) optimizer is used for training the framework for E max =100 epochs with a fixed learning rate of 0.01, batch size 24. Note that only f e and f d participate in inference, while f pro j , f pred , f ′ e , f ′ d , f ′ pro j and Q p/c are discarded after training. All networks are implemented based on Python 3.8.8 and Pytorch 1.7.1 and are trained on an NVIDIA GeForce GTX 2080Ti GPU.

Quantitative Comparison with Other Start-of-art Approaches
The performance of our proposed MSCDA is depicted in Table 3 and Fig. 3. We compared our proposed method with two state-of-art UDA approaches: CyCADA (Hoffman et al. (2018)) using adversarial learning methods and SEDA (Perone et al. (2019)) using self-training methods which are frequently used for medical images. Additionally, the two selected methods were both trained with two different domain labels, i.e. source domain labels (denoted as "Src-Only") and target domain labels (denoted as "Supervised"). In summary, we compare MSCDA to four methods and each has two different types of backbones (U-Net (Ronneberger et al. (2015)) or DeepLab v3+ )), yielding eight combinations. Note that plain U-Net is not applicable for our method because the very small (e.g., 8 × 8) resolution in latent space leads to the inaccurate classification of embeddings.
The influence of domain shift on the performance of segmentation models can be quantified by comparing the DSC between the supervised and Src-Only methods. For instance, in T2W-to-T1W scenario Task S4 with DeepLab v3+ as the backbone, the supervised method achieved a DSC of 95.8%, while the Src-Only method only reached 54.9%, resulting in a performance degradation of 40.9%. Similarly, in Task S8, and S11, Src-Only experienced a performance loss of 25.8% and 17.1%, respectively, compared to the supervised method. On the other hand, Fig. 3 also shows the performance degradation at a subject level. The medians of the supervised method show a significant DSC increase compared to Src-Only. Meanwhile, Src-Only demonstrates a larger interquartile range (IQR) than the supervised method, indicating a wider distribution of DSC across subjects.
After applying UDA methods, MSCDA outperforms the other examined methods under the same task. More specifically, the DSC reaches over 83% in task S4 in both T2W-to-T1W and T1W-to-T2W scenarios (T2W-to-T1W: 87.2%, T1Wto-T2W: 83.4%), while the DSC of other methods are below 76% (e.g., CyCADA, T2W-to-T1W: 64.0%, T1W-to-T2W: 67.6%; SEDA, T2W-to-T1W: 71.4%, T1W-to-T2W: 75.5%). This result is supported by other evaluation metrics, such as JSC and SEN. As it can be seen in the bottom of Table 3, in both scenarios, MSCDA achieved better results in all evaluated metrics except in PRC although it reaches over 92%. For the other two tasks (S8 and S11), the proposed method in general outperforms other approaches. The box plot (see Fig. 3) also indicates that MSCDA method not only performs better but also has a smaller IQR than Src-Only and the other two methods.
From Table 3, one can observe that when comparing the performance between different tasks (i.e., S11, S8 and S4), MSCDA shows high label-efficient learning ability. More precisely, the DSC of our methods in T2W-to-T1W scenarios only drops 2.0% from 89.2% to 87.2% while CyCACA and SEDA drop 16.0% and 10.3% respectively; The DSC of our method in T1W-to-T2W scenario remains relatively stable across three tasks with the difference of 0.9% across tasks. Compared to our model, the performance of other methods drops significantly as the number of the source subjects decreases. Therefore, the obtained results show that our method is less sensitive to the size of source domain compared to other UDA methods. Notably, the performance of our method is very close to that of supervised learning (MSCDA: DSC=89.2%, JSC=81.0%, PRC=89.3% SEN=89.9%; Supervised: DSC=95.8%, JSC=92.8%, PRC=98.0%, SEN=94.7%) when training with the eight source subjects (task S8) in T2Wto-T1W scenario, demonstrating the potential of contrastive representation learning and self-training framework.

Qualitative Segmentation Comparison with Other Start-
of-art Approaches To help qualitatively better understand the performance of models, we plot the segmentation results and corresponding uncertainty maps in Fig. 4. The uncertainty map reflects the confidence level of the model to each pixel, which is generated by test-time dropout (Loquercio et al. (2020)) with Monte Carlo simulation number equals to 20. In Fig. 4 T2W-to-T1W scenario, the performance degradation of Src-Only is mainly manifested in a large number of under-segmented regions, and it has high uncertainty at the boundary of segmentation results and low uncertainty in under-segmented regions. Applying SEDA and CyCADA can alleviate the under-segmented regions, where the uncertainty area is reduced in SEDA while it still remains in CyCADA. MSCDA is able to generate segmentations that closely resemble the supervised model and which covered more under-segmented areas in SEDA and CyCADA. Meanwhile, the uncertainty in MSCDA occurs mainly close to the pectoral muscles, which is more difficult to segment that the breast-air boundary. In the T1W-to-T2W scenario, however, we observed some under-segmented regions near the breast-air boundary, which is likely attributable to the substantial difference between the marginal fat and FGT tissue in T2W images. This difference probably makes it challenging to align the feature space of fat with the source T1W images.

Effect of Loss Function & Augmentation
In order to investigate the contribution of augmentation and different loss function, we conduct an ablation experiment by removing/adding each component separately. We test the network on scenario 1 task S4 fold 1 with combinations of selftraining, data augmentation, P2P, P2C and C2C contrast. All the networks are trained under the same experimental settings as Section 4.4. As illustrated in Table 4, adding data augmentation (see case 2) to self-training can increase the DSC by 21.3% compared to plain self-training (see case 1). Combining case 2 with P2P (see case 3) or P2C (see case 4) contrast increase the DSC to 80.2% and 76.0% respectively. However, when adding C2C contrast into case 2 (see case 5), the network performance deteriorates to a DSC of 67.3%, indicating centroid-level contrastive learning does not benefit feature embeddings in our breast segmentation task. Nonetheless, this shortcoming is canceled out by adding P2P or P2C contrast, as shown in case 6 and 8. This indicates that C2C contrast is not as effective as P2P or P2C contrast in our breast segmentation task. When integrating all contrasts together (see case 9), the DSC reaches highest score of 82.2%, an increment of 31.9% compared to the simple case 1. Overall, by adding data augmentation, P2P, P2C, and C2C contrasts, MSCDA can improve the self-training framework to achieve better segmentation performance. However, we also find that not all types of contrasts are equally effective. Hence, we performed an ablation study on the regularization coefficients of three contrasts in Section 5.3.2.

Effect of Coefficients between Contrasts
To investigate the effectiveness of different contrast coefficients, we conduct an ablation study by varying the regularization coefficients of each contrast in Eq. (13) from 0 to 1. As shown in Fig. 5, we observed that increasing λ P2P can improve   We also observe that increasing λ P2C and λ C2C could improve model performance only when λ P2P was set to a large value (i.e., 0.75 or 1). This finding implies that P2C and C2C may be more effective when the P2P contrast is heavily weighted. We also observed several equally sub-optimal combinations when λ P2P is set to 1, which indicates that there may be multiple ways to achieve optimal performance. Therefore, coefficients are set to 1 as the default in our training settings. Our result is consistent with the findings of Alonso et al. (2021), which showed that increasing the weight of the P2P contrast from a low value can lead to improved performance in a similar semi-supervised setting. Moreover, our study provides additional insights into the sensitivity of the model's performance to different coefficient combinations of contrasts.

Effect of Coefficients between Consistency Loss and Contrastive Loss
We also conduct the ablation study of the coefficients in Eq.
(1) to investigate the best combination of consistency loss (λ 1 ) and contrastive loss (λ 2 ). Table 5 shows that setting λ 1 to 0.5 or 1 with λ 2 set to 1 achieves the best performance, with a DSC of 82.2%. It is worth noting that setting λ 1 to a smaller value (e.g. λ 1 =0.2) can still result in relatively good performance, with a DSC of around 80%. However, setting λ 1 and λ 2 to larger values can lead to a decrease in performance. Overall, the study shows the importance of finding the appropriate balance between the consistency and contrastive losses in UDA tasks.

Effect of Contrast Between Domains
As mentioned in Section 3.4, we compute three types of contrasts between the student and teacher networks. In particular, only the target feature embeddings in the student network are sampled as anchors, while only the source feature embeddings in the teacher network are sampled to update the memory bank. To further elaborate our selection, we conduct an additional, complementary ablation study by selecting different domains for computing contrast. Note that all other experimental settings remained unchanged. and previous methods. The leftmost subplot in each scenario shows the ground truth (GT), followed by model predictions from the Src-Only, SEDA, CyCADA, MSCDA, and supervised training, respectively. The segmentation results are visualized as red contours, and the corresponding uncertainty map is presented below each subplot. Intensities in the uncertainty map signify the degree of uncertainty, with higher values indicating greater uncertainty. All methods utilize DeepLab v3+ as the backbone. contrasts on the segmentation performance. The ablation study involved varying each of the coefficients from 0 to 1 and utilizing DSC for evaluation. As shown in Table 6, we observe that the best candidate (see case 7, DSC=82.2%) is the combination of the target samples in the student network and the source sample in the teacher network. More specifically, we adopt source samples from the teacher network to create the memory bank and to guide the target samples from the student network. As expected, when adding target samples to the memory bank (see case 5), the performance shows a minor decrease of 0.2%, indicating that the pseudo label brings uncertainty to the model. It is worth noticing that we observe 5.7% of degradation when adopting additional source samples as anchor (see case 2). It might be due to the overfitting of the model on the source domain.

Effect of Size of Memory Bank/Negative Samples
The size of the memory bank is a critical factor in our proposed contrastive learning method since it determines the negative pairs in P2P, P2C, and C2C contrasts. To investigate its effect, we conducted an ablation study on scenario 1 task S4 fold 1, where the size of the pixel queue B p and centroid queue B c were varied from 512 to 8192 and from 32 to 4096, respectively. As presented in Table 7, the model's performance generally improved with an increase in the sizes of B p and B c . The best-performing combinations were B p =4096 and B c =1024 or B p =2048 and B c =2048, achieving a DSC score of 82.2%. However, the performance improvement reached a saturation point or declined after a certain value. This could be due to the excessive number of negative samples causing the model to suf-  (2021)). To avoid hurting the representation learning quality, Awasthi et al. (2022); Ash et al. (2021) suggest that an appropriate trade-off should be made in selecting the number of negative pairs. We, therefore, chose B p =4096 and B c =1024 as the default settings for our training. In conclusion, a sufficiently large memory bank is crucial for improving the model's performance, but increasing its size beyond a certain limit can lead to diminishing returns due to the collision-coverage trade-off in our tasks.

Visualization of Feature Alignment
To visualize the effect of our proposed method on domain shift, we plot the learned features from the source and target testing images with t-SNE (Van der Maaten and Hinton (2008)). The learned features are obtained by using DeepLab v3+ ) as the backbone. At the pixel level (Fig. 6), when no domain adaptation method is applied, the breast pixels of Src-Only highly overlap with non-breast pixels ( Fig. 6(a)), making them indistinguishable. Compared to Src-Only, the self-training ( Fig. 6(b)) makes it possible to align part of the breast pixels between domains but fails to separate them from non-breast pixels. Incorporating P2P contrast (Fig. 6(c)) highly aligns the breast pixels; however, a number of breast pixels are contaminated by non-breast pixels which may increase the error. In contrast to the above-mentioned methods, our method nicely aligns the breast pixels and separates them from nonbreast pixels. The visualization of the centroid level in Fig. 7 further illustrates the effect of our method on the feature space. Compared to the pixel level, the uneven distribution caused by the imbalanced dataset is alleviated at the centroid level, making the visualization clearer. We can observe that the learned centroids of different categories in all methods are linearly separable. Before self-training, the centroids of the same category are completely separable by domain, as can be observed in Fig. 7(a). When self-training is applied ( Fig. 7(b)), the non-breast centroids are clustered together while the breast centroids are still not aligned. The P2P contrast (Fig. 7(c)) improves the centroid alignment between domains but is still not fully overlapped. In our method (Fig. 7(d)), the centroids of the same category share a well-aligned tight representation space. In summary, the t-SNE visualization demonstrates the effect of domain shift in the feature space, an effect that can be mitigated by applying our method.

Quantitative Analysis of Feature Alignment
To further quantitatively analyze the feature alignment in our proposed MSCDA, we employ the use of cluster centroid distance variation (CCD) (Luo et al. (2021)), which measures the distance between the distributions of feature embeddings in the source and target domains. We first obtain feature embeddings of each domain in Section 5.4.1 separately and then calculate the CCD between the centroids of each category. Moreover, we follow the normalization in Luo et al. (2021) so that the CCD of the baseline method Src-Only is al-  ways 1. A smaller CCD value indicates better feature alignment, while a larger CCD value indicates poorer alignment. The results depicted in Fig. 8 show that MSCDA achieves better feature alignment in both categories compared to other methods. In the 'NonBreast' category, our proposed MSCDA method exhibits a CCD of 0.599, indicating a slight improvement over the P2P contrast method (0.601). By contrast, in the foreground 'Breast' category, the CCD of MSCDA (0.297) is significantly lower than the other methods (Src-Only=1, self-training=0.728, P2P=0.597), demonstrating a significant enhancement in feature alignment. These results support the hypothesis that multi-scale contrastive learning can better exploit deeper semantic information in UDA, leading to higher discrimination of the model towards target images.

Conclusion
In this paper, a novel multi-level semantic-guided contrastive UDA framework for breast MRI segmentation, named MSCDA, is introduced. We found that by combining selftraining with multi-level contrastive loss, the semantic information can be further exploited to improve segmentation performance on the unlabeled target domain. Furthermore, we built a hybrid memory bank for sample storage and proposed a category-wise cross-domain sampling strategy to balance the contrastive pairs. The proposed model shows a robust and clinically relevant performance in a cross-sequence label-sparse scenario of breast MRI segmentation. The code of our MSCDA model is available at https://github.com/ShengKuangCN/ MSCDA.