Cross-Modality Medical Image Segmentation via Enhanced Feature Alignment and Cross Pseudo Supervision Learning

Given the diversity of medical images, traditional image segmentation models face the issue of domain shift. Unsupervised domain adaptation (UDA) methods have emerged as a pivotal strategy for cross modality analysis. These methods typically utilize generative adversarial networks (GANs) for both image-level and feature-level domain adaptation through the transformation and reconstruction of images, assuming the features between domains are well-aligned. However, this assumption falters with significant gaps between different medical image modalities, such as MRI and CT. These gaps hinder the effective training of segmentation networks with cross-modality images and can lead to misleading training guidance and instability. To address these challenges, this paper introduces a novel approach comprising a cross-modality feature alignment sub-network and a cross pseudo supervised dual-stream segmentation sub-network. These components work together to bridge domain discrepancies more effectively and ensure a stable training environment. The feature alignment sub-network is designed for the bidirectional alignment of features between the source and target domains, incorporating a self-attention module to aid in learning structurally consistent and relevant information. The segmentation sub-network leverages an enhanced cross-pseudo-supervised loss to harmonize the output of the two segmentation networks, assessing pseudo-distances between domains to improve the pseudo-label quality and thus enhancing the overall learning efficiency of the framework. This method’s success is demonstrated by notable advancements in segmentation precision across target domains for abdomen and brain tasks.


Introduction
The commendable accomplishments of deep convolutional neural networks (CNNs) in image segmentation have been extensively documented [1,2].The anticipation is that deep learning models should adeptly handle the diverse medical images originating from various imaging modalities, scanning protocols, or demographic factors.However, in practical healthcare applications, a notable disparity often arises between the data used for training and that encountered during testing, which is attributed to differences in acquisition parameters or imaging modalities [3].A considerable degradation in the performance of the trained model may be caused by this gap.As illustrated in Figure 1, differences in the image data obtained from abdominal images using magnetic resonance imaging (T2-SPIR MRI) and computed tomography (CT) modalities are evident.This phenomenon, where algorithm performance is degraded due to differences in data distribution during crossmodality image processing tasks, is referred to by researchers as the domain shift.Therefore, the model's capability to generalize across various domain shifts needs to be enhanced.Models trained on the source domain cannot be directly migrated to the target domain (e.g., MR to CT) for effective segmentation due to the presence of significant domain offsets.To mitigate cross-modality performance degradation, a plain and intuitive approach is to use labeled images on the target domain for fine tuning.However, this approach necessitates extensive target domain annotation, and pixel-level annotation on medical images is found to be very labor-intensive and time-consuming.
Unsupervised domain adaptation (UDA) techniques perform a core idea to deal with cross-modal analysis [4,5].In the UDA setting, given a labeled source domain dataset and an unlabeled target domain dataset, the goal is to train a robust model capable of achieving high segmentation performance on the target domain.Currently, most advanced UDA segmentation methods are at the image level [6][7][8], feature level [9][10][11], or joint image and feature level [12,13], with adversarial learning mechanisms being used to train and optimize the model.The fundamental approach at the image level involves using GANs to generate source domain images into the target domain.These generated images are then utilized to train models for the target domain, aiming to enhance the performance of the target domain model.On the other hand, feature-level methods focus on mapping data from both source and target domains to a domain-invariant space, which are then subsequently applied to downstream tasks such as image segmentation.Among them, a pivotal work is CFDnet [14], which focuses on minimizing the distance of distribution feature functions for feature alignment.
In the context of cross-modal image generation, these methods commonly face the following challenges: the quality of generated images in the target domain significantly influences the performance of downstream segmentation networks.If features fail to align effectively, it may lead to noticeable fluctuations during network training, making the training process complex and susceptible to the collapse phenomenon.Additionally, such methods encounter issues of subpar individual segmentation results in the target domain, contributing to a substantial variance in performance across multiple model training iterations.The fundamental cause of these problems can be primarily attributed to ineffective feature alignment, resulting in the inability of transformed cross-modal images to be effectively used for training segmentation networks.
In addressing the challenges outlined, this study introduces a novel unsupervised domain adaptation image segmentation framework, as illustrated in Figure 2. The primary function of the feature alignment sub-network is to effectively align the distributions of the source and target domains, performing image generation and reconstruction tasks.Building upon the network introduced by Han et al. [13], a self-attention module [15] is incorporated to effectively model both long-range and multi-level dependencies among image regions, allowing the model to capture information pertinent to geometric and structural consistency more accurately.Furthermore, to reciprocally promote feature alignment and facilitate the training of downstream segmentation networks inspired by Chen et al. [16], a dual-stream segmentation sub-network is proposed with consistent architectures, wherein the two streams are initialized differently.The outputs of both streams are constrained by a cross-pseudo-supervised loss [16], encouraging consistency in the outputs of the dual-stream network to promote cross-modal feature alignment.Given the significant differences between our task and traditional single-modal multicenter scenarios, where there is considerable inter-modal variation, we introduce a cross-domain distance awareness module to enhance the quality of pseudo-labels and alleviate the pronounced fluctuations in segmentation loss.Our model represents a distinct paradigm from a segment anything model (SAM) [17], focusing on domain adaptation for medical image segmentation with full automation and minimal reliance on target domain labels, thus contrasting SAM's semi-automated, generalization-based approach that necessitates broad dataset training and manual fine tuning.An overview of our proposed approach.The whole network consists of a cross-modal, shared common encoder (E), two domain-specific discriminators (Ds, Dt), two domain-specific private decoders (Gs, Gt), and two segmentation networks (S(θ)).Among them, the shared encoder and the domain-specific decoder are used for feature alignment.The cross-pseudo-supervised, dual-stream segmentation sub-network consists of two segmentation networks (S(θ 1 ), S(θ 2 )) and a distance-aware, pseudo-supervision module.Self-attention modules are added to the last convolutional layer of the encoder, which is not shown in the diagram.In the inference phase, we only need to test the images from the target domain.We took the ensemble of the two networks as the final prediction.
Our innovative points are summarized as follows: • We propose a novel, unsupervised domain adaptation framework for medical image segmentation, comprising a feature alignment sub-network and a pseudo-supervised, dual-stream segmentation sub-network.The feature alignment sub-network facilitates the alignment of features between the source and target domains, while the pseudosupervised, dual-stream segmentation sub-network achieves segmentation of crossmodal images and promotes the learning process of the feature sub-network; • For the first time, cross-pseudo-supervision is introduced to mitigate the impact of lowquality pseudo-labels, thereby insulating the segmentation network from fluctuations in the feature alignment sub-network; • Given that medical images present clear structures and shapes, incorporating a selfattention module into the feature alignment sub-network significantly improves the learning process; • Our method was evaluated on two challenging tasks, and the results indicate that it is increasingly approaching the performance of fully supervised models.

Cross-Modality Medical Image Segmentation
Cross-modality medical image segmentation [18][19][20] involves the transfer and application of knowledge from a source domain to enhance segmentation in a target domain.Kang Li et al. [20] proposed an approach using online mutual knowledge distillation.Over years of development, researchers have proposed various types of methods to address cross-domain problems.
Domain adaptation technology has significantly advanced in addressing domain discrepancy issues such as cross-modality medical image segmentation.The journey began with Single-Source Domain Adaptation [5,21,22], focusing on transferring knowledge learned from one source domain to the target domain.This was followed by multi-source domain adaptation [23][24][25], which enhances the effectiveness of transfer learning by integrating data from multiple source domains to capture a broader set of domain knowledge.Pei et al. [23] proposed a framework that utilizes multi-level adversarial learning and a multi-model consistency loss to adapt features and transfer knowledge from multiple source domains to the target domain.
With the challenge of scarce source labels, semi-supervised domain adaptation methods [26][27][28] emerged, utilizing both limited labeled data and a large amount of unlabeled data for improved domain adaptation.Zhao et al. [26] introduced MT-UDA, a labelefficient framework where a student model, trained with limited source labels, learns from unlabeled data in two domains through two teacher models in a semi-supervised manner.This method distills intra-domain semantic knowledge and exploits inter-domain anatomical information, thereby improving cross-modality segmentation performance by integrating underlying knowledge and addressing source label scarcity.
Recently, source-free domain adaptation [29][30][31] has gained attention, where the aim is to achieve domain adaptation using only a pre-trained source domain model without access to source domain data.This approach faces the challenge of effective knowledge transfer without direct support from the source data.M. Bateson et al. [29] utilized a novel loss function that combines Shannon entropy and KL divergence to match class ratios with anatomical priors, and it is validated on cardiac, prostate, and intervertebral disk segmentation tasks.
The evolution of these technologies demonstrates the ongoing depth and innovation within the domain adaptation field, and it is aimed at overcoming data distribution differences and improving model generalization in new domains.

Single-Source Domain Adaptation
Single-Source Domain Adaptation, as explored in the works of Wang et al. [21], Saito et al. [4], and Long et al. [32], constitutes a vital aspect of transfer learning.It finds extensive applications in addressing performance degradation issues when the distribution of test data significantly deviates from that of the training data [33].In the realm of medical images, domain adaptation holds even greater potential.Pei et al. [22] proposed an innovative unsupervised domain adaptation framework tailored for cross-modal heart image segmentation.Their approach segregates domain-invariant and domain-specific features for each domain, augmenting the representation of domain-invariant features through the incorporation of self-attentive modules and a zero-loss strategy.Ouyang et al. [34] introduced a data-efficient, unsupervised domain adaptation method that fuses variational, self-encoder-based features prior to matching with domain adversarial training.This hybrid approach effectively learns a shared domain-invariant hidden space.Building upon a self-training strategy and adversarial learning, Xie et al. [5] enhanced target-domain image segmentation performance.Chen et al. [8] leveraged generative adversarial networks to transform image appearances across modalities.Their approach incorporates deep collaborative image and feature alignment, and it is complemented by self-attentive mechanisms and knowledge distillation losses to elevate segmentation quality.Han et al. [13] devised a novel deep symmetric adaptation network for cross-modal medical image segmentation.The architecture comprises a segmentation sub-network and two symmetric source and target domain transformation sub-networks.These sub-networks implement a bi-directional alignment strategy by sharing an encoder and two private decoders.The use of original and transformed images for training the segmentation classifier effectively diminishes interdomain differences, leading to improved segmentation performance.For cross-modality segmentation, the mainstream method is still used to perform feature alignment and to then apply the domain-invariant features to the training of the segmenter.Our network also includes these two parts.

Feature Alignment
The concept of feature alignment is a prevalent theme in various unsupervised domain adaptation (UDA) frameworks.Many UDA methods concentrate on aligning the distribution in the feature space by minimizing the distance metric between features extracted from the source and target domains.Adversarial learning is a prominent approach in this context [35,36].Ganin et al. [36] leveraged adversarial training to obtain domain-invariant features between the encoder and decoder, resulting in the learning of domain-invariant representations.Liu et al. [37] introduced a semantic segmentation branch with a domain discriminator to address the domain gap at the context level.Han et al. [13] proposed a bidirectional alignment strategy that trains the segmentation classifier using both original and transformed images, effectively reducing inter-domain discrepancies.Inspired by Han et al. [13], we incorporate analogous ideas into our framework to enhance feature alignment.

Cross-Pseudo-Supervision
Cross-pseudo-supervision (CPS) was introduced by Chen et al. [16] in 2021.The fundamental concept involves employing two networks with identical structures but different initializations, imposing constraints to ensure that the outputs of both networks for the same sample exhibit similarity.Ouali et al. [38] extended the CPS framework to propose a cross-consistency training method for cross-modal, semi-supervised semantic segmentation.This method incorporates data enhancement and feature alignment techniques for images from different modalities.Ibrahim et al. [39] presented a self-correcting network approach aimed at enhancing the quality of pseudo-labeling and the robustness of the model.This is achieved by introducing a correction module and a confidence module to dynamically update pseudo labels and loss weights.Feng et al. [40] focused on improving the generalization ability of the model within the context of CPS.They introduced a dynamic thresholding strategy and a class-balancing strategy to effectively select high-quality pseudo labels and balance the difficulty differences between different classes.In our experiments, we observed that the semi-supervised approach of CPS proves effective in harnessing unlabeled data and facilitating the learning of feature-aligned networks.For unsupervised domain adaptation, which can be viewed as a special case of semi-supervised learning, CPS demonstrates its ability to enhance unsupervised domain adaptation capabilities.

Datasets
In our study on cross-modality segmentation, we consider tasks involving abdominal organ segmentation and brain tumor segmentation due to their clinical significance and the challenges they present in terms of imaging modalities.
For abdominal organ segmentation, we utilized a comprehensive dataset that included 30 CT scans from the BTCV [41] challenge training set and 20 T2-SPIR MRI scans from the ISBI 2019 CHAOS challenge [42] training set.These datasets present a variety of volume sizes and resolutions: the BTCV, with slice thicknesses ranging from 2.5 mm to 5.0 mm; and the CHAOS with ISDs, varying from 5.5 mm to 9 mm and averaging at 7.84 mm, alongside x-y spacing that ranges from 1.36 mm to 1.89 mm, averaging at 1.61 mm.To streamline consistency, we cropped the CT images, initially encompassing 11 organs, to focus solely on the abdominal area, including the liver, both kidneys, and the spleen, and we retained the annotations of these four organs from the BTCV dataset.
The brain tumor segmentation, derived from the Brats2018 Brain Tumor challenge, consisted of 75 instances pf patient data from the LGG category, including the following four modalities: T1, T1CE, T2, and FLAIR.Targeting whole tumor segmentation, domain adaptation experiments were conducted with T2 as the source domain and the other three modalities as the target domain.Across all datasets, in line with SIFA [8] and SymDA [13], 80% of patient data were randomly selected for training, with the remaining 20% reserved for testing.

Overall Framework Architecture
We denote the labeled dataset from the source domain as X S = x i s N s i=1 and the unlabeled dataset from the target domain as , where X S = x i s N s i=1 has a gold standard partition Y S = y i s N s i=1 .They were both collected randomly from both domains.Here, N s represents the quantity of data in the source domain, while N t denotes the quantity of data in the target domain.For simplicity, the proposed framework is denoted as FACPS, an abbreviation for feature alignment and cross-pseudo-supervision.
The overall framework proposed is illustrated in Figure 2, which provides an overview of our framework and consists of two parts.The first part is the Cross-Modal Feature Alignment Sub-Network, comprising a shared encoder E and two domain-specific decoders, D s and D t .The shared encoder and domain-specific decoders are utilized for image reconstruction and cross-modal image generation.We align feature distributions in both directions using the Cross-Modal Feature Alignment Sub-Network to ensure that the synthesized images maintain their original structures via employing a consistency loss method [43].The second part is the Pseudo-Supervised Dual-Stream Segmentation Sub-Network, which is composed of two identical segmentation sub-networks.The prediction of one network serves as pseudo-labels to supervise the optimization process of the other network.Due to the data's diverse origins and significant span, there exist non-overlapping distributions, making direct supervision prone to training instability.To stabilize the training process and prevent collapse, we propose the Cross-Domain Distance Perception Module, significantly improving the quality of supervised labels based on cross-domain distances.In the following sections, we will provide detailed descriptions of our method from these two perspectives.

Cross-Modality Feature Alignment Sub-Network
In order to achieve better feature alignment, we developed a novel feature alignment network, where two translation sub-networks are dedicated to transforming images from the source domain to the target domain and vice versa.These networks are based on adversarial losses and incorporate reconstruction losses to better preserve structural information.Additionally, we share the encoder network between the Feature Alignment Sub-Network and the Segmentation Sub-Network, facilitating feature alignment.This approach differs from the methods proposed by Chen et al. [8] and Han et al. [13].Chen et al. first mapped images to common feature space and then aligned feature distributions through decoders.Han et al. focused on distinguishing between the data originating from the target domain and the source domain to enforce bi-directional feature alignment.In contrast to these methods, our approach emphasizes expressing domain-specific forms through independent decoders for both the source and target domains, thereby compelling the shared encoder to learn domain-invariant feature representations exclusively.
As shown in Figure 2, the feature alignment network consists of a shared encoder (E) and a domain-specific decoder (G).The encoder maps the source image (x s ) to a latent space.Then, the target decoder (G t ) maps the features from the potential space to the target domain to generate a fake target domain image (x s→t ) and tries to deceive the target discriminator (D t ), which is responsible for determining the authenticity of the fake target domain image (x s→t ) and the real target domain image (x t ).D t is updated through the adversarial loss L t adv as Here, D t needs to maximize L t adv so that it can recognize real or fake images, while E tries to minimize L t adv so that x s→t looks like the real target image.To ensure the effectiveness of the encoder and decoder and the content of the source-domain image remains unchanged, and we performed a reverse image transformation on the generated fake image and reconstructed the original input image as x s→s and x t→t .Specifically, we obtained x s→s = G s (E(x s )) and x t→t = G t (E(x t )).The reconstructed image should be consistent with the original image; therefore, we can obtain a consistency loss for the target domain.
Similarly, the source-domain Discriminator D s also performs corresponding operations to better convert x t to a source-like image.Additionally, we conducted inverse image transformations on the generated fake images and reconstructed the original input images.Specifically, we obtained adversarial losses and cycle consistency losses as follows: Up to this point, we obtained the optimization losses for the shared encoder and domainspecific decoders in both the target domain and the source domain.The computation formulas are as follows: where λ recon and λ adv are hyperparameters used to adjust the weights of each loss term.In our experiments, the corresponding values were set to 1.0 and 0.02, respectively.

Self-Attention Module
To effectively model long-range dependencies within images, inspired by the method proposed by Zhang et al. [15], a self-attention module was introduced.By computing the correlation between two positions in the image, an attention map is generated for each pixel in the image matching the size of the image, which reflects the semantic correlation between this point and other points.For each pixel in the image, the self-attention mechanism encourages the model to pay more attention to regions highly correlated with that pixel.Introducing this semantic correlation not only benefits better feature alignment in the model, but it also promotes the learning efficiency of the segmentation model.Although self-attention increases model complexity and training duration, its computational demand is largely contingent on the input image size.To offset this, we uniformly preprocessed inputs to 256 × 256, thereby minimizing computational overhead.The framework of the self-attention is illustrated in Figure 3.The features from the previous layer f ∈ R C×H×W were first passed through two convolutional kernels C 1 and C 2 , generating feature maps C 1 ( f ) and C 2 ( f ), respectively.At that stage, it was {C 1 ( f ), C 2 ( f )} ∈ R C×H×W .The feature map was then used as the input to the So f tmax layer, and the resulting output was the Attention Map A, which can be expressed as Simultaneously, another feature map C 3 ( f ) was obtained through convolutional kernel C 3 .
Then, following the same operations as described above, the features were reshaped and multiplied with the attention map calculated using the formula mentioned earlier.The resulting matrix was then passed through the final convolutional kernel C 4 to obtain the self-attention feature map f att , which can be expressed as Finally, by performing a residual connection between the output of the self-attention mechanism and the input features, we obtained the final output, which is represented as where λ att is a learnable parameter, initially initialized to 0, guiding the model to explore local spatial information first, and this is then followed by refinement using the selfattention mechanism.Following the strategy of Zhang et al. [15], self-attention modules were added to the last convolutional layer of the encoder.Each self-attention module consisted of four convolutional layers, with a kernel size of 1 × 1 and a stride set to 1.

Cross-Pseudo-Supervised Dual-Stream Segmentation Sub-Network
Previous methods based on generative adversarial networks have shown promising results in domain adaptation.However, for cross-modal image segmentation tasks, there are often issues such as poor instance segmentation performance and unstable training.Since we lack labels in the target domain, unsupervised domain adaptation can be considered as semi-supervised learning.Chen et al. [16] proposed a method where the output of one network is used to supervise the training of another network to utilize the unlabeled data in the target modality as much as possible (source domain data can also be considered as unlabeled data).This method is suitable when the data come from the same medical-imaging modality (such as CT) but different centers, with relatively small variations compared to cross-modal differences, thus usually producing high-quality supervision.However, in our task, the data came from different medical-imaging modalities (such as CT and MR), and the significant differences in the training samples from different modalities may have led to low-quality pseudo-labels.In this work, inspired by Yao et al. [44] and Sutter et al. [45], we propose a Cross-Domain Distance Perception Cross Pseudo-Supervision mechanism to improve the quality of labels and reduce the impact of low-quality labels.This mechanism provides cross-domain pseudo-variance by measuring the cross-domain pseudo-distance between the target domain and the source domain, thereby enhancing the supervision quality of pseudo-labels.Specifically, the image similar to x s→t was passed through a shared encoder and then fed into the Dual-Stream Segmentation Sub-Network S(θ).This segmentation sub-network consists of S(θ 1 ) and S(θ 2 ), which have the same structure but different initialization.For each segmentation network, we obtained predictions for the original image x s and the image similar to the target domain x s→t in S(θ 1 ) by Similarly, we passed the original image x s and the image similar to the target domain x s→t into the segmentation network S(θ 2 ), resulting in After obtaining predictions p 2 o and p 2 c from the original image x s and the transformed target-domain image x s→t , respectively, we calculated the average of p 2 o and p 2 c to obtain the ensemble prediction result p1 avg : Cross-pseudo-supervision is suitable for data that come from the same imaging modality but different centers, where the variation is relatively small compared to inter-modalities.For cross-modal data, such as CT and MRI, the significant differences between training samples from different modalities lead to low-quality pseudo-supervision.Inspired by Yao et al. [44], we propose a cross-domain distance awareness module to enhance crosspseudo-supervision in order to improve the quality of cross-pseudo-supervision.By quantifying the cross-domain pseudo-distance between the source and target domains, the negative impact of low-quality labels on model training is reduced.If there is a large discrepancy between two predictions, the calculated distance value will also be large, reflecting the relatively low quality of the two predictions.As such, the KL divergence is used to measure the discrepancy between two predictions, and the pseudo-distance can be represented as follows: The cross-pseudo-supervision scheme offers dual advantages.It ensures consistency in predictions from various networks for the same image, targeting low-density decision boundaries.Later, it stabilizes pseudo-segmentation, enhancing accuracy beyond conventional supervised learning.This acts as an effective data expansion, boosting the training quality of the segmentation network.After obtaining the prediction distance from the two networks, we propose the enhanced cross-domain distance-aware loss function L dacps .Since the input images are x s and x s→t , we named this loss function L s→t dacps : L s→t dacps = E e −v1 Lce p2 avg , where P 1 and P 2 are one-hot-encoded pseudo-labels generated from predicted probability maps p1 avg and p2 avg ; L ce represents the cross-entropy loss; and E represents the mathematical expectation.If the input images are x t and x t→s , then, similarly, the cross-domain distance perception loss L t→s dacps can be obtained.In the supervised learning part, we used the Dice loss to optimize the segmentation sub-network.The loss function can be represented as follows: Our method leverages the strengths of CPS -ensuring consistent predictions across networks for a given image and mimicking data expansion.Coupled with the benefits of KL divergence, it suppresses the impact of low-quality pseudo labels, making it robust to the initial quality of these labels, which is applicable to cross-modal tasks.

Total Loss
Combining the discussions from the above two sections, the overall loss function can be represented as follows: where λ t gen , λ s gen , λ s ,λ st , and λ ts are hyperparameters balancing the weights of each module loss.In the experiments, each training iteration involved multiple network update steps.Firstly, L t gen was used to generate images similar to the target domain and update the shared parameter encoder E, aligning the feature distributions of the source domain and the target domain while transforming the images.Secondly, L s gen + L s was introduced to generate images similar to the source domain and to begin training the segmentation network, extracting features for segmentation results, and further enforcing deep feature alignment.Finally, the source-domain images, target-domain images, generated target-domain similar images, and source-domain similar images were fed into the Dual-Stream Segmentation Sub-Network.The proposed L s→t docps + L t→s dacps was used to optimize the network, improving the quality of pseudo-labels and performing feature alignment again.In our experiments, the corresponding values were set as λ t gen = 1.0, λ s gen = 1.0, λ s = 1.0, λ st = 2.0, and λ ts = 2.0.We used the Adam optimizer, where the learning rate was set for the segmentation sub-networks S(θ), the domain-specific decoders G s and G t was set to 1 × 10 −3 , and the learning rate for the shared encoder E, as well as the discriminators D s and D t , was set to 2 × 10 −4 .During each training iteration, we sequentially updated different modules.In the inference stage, we used the shared parameter encoder E to extract features and then feed them into the segmentation sub-networks.The outputs of the two segmentation sub-networks were used as the final segmentation results.

Experimental Setup
The experiments were conducted on the Ubuntu 20.04 operating system, with code design and implementation carried out using Python and relevant development packages.The deep-learning framework employed was PyTorch, and the iterations were performed on two NVIDIA RTX 4090 GPUs with a memory size of 48 GB using a batch size of 4.During the data preprocessing stage, each slice was resampled to 256 × 256.To expedite network convergence, Z-score was performed using the method wherein the mean intensity was subtracted and the result was divided by the standard deviation.Subsequently, the data were shifted to the range [−1, 1].In the training phase of the network, random cropping and rotation augmentation operations were additionally applied.It is worth noting that, for a fair comparison, we did not consider the impact of noise and artifacts, and no denoising algorithms were used during the preprocessing stage.
For the shared parameter encoder of the Feature Alignment Sub-Network, following the construction strategy of Han et al. [13], we used the encoder of Deeplab-ResNet50.For the domain-specific decoders G s and G t of the Feature Alignment Sub-Network, they had the same structure but did not share parameters.Each consisted of three residual blocks and three deconvolution layers.The discriminators D s and D t contained four convolutional layers but did not share parameters.For the Dual-Stream Segmentation Sub-Networks S(θ 1 ) and S(θ 2 ), we implemented two segmentation networks using DeepLabv3+ [46] and ResNet50 [47] as backbones, respectively.The backbone networks were initialized with weights trained on ImageNet [48], while the other layers of the segmentation networks were initialized with random noise.
Our experiments used three evaluation metrics.For the abdominal organ segmentation, the Dice coefficient and Average Surface Distance (ASD) were employed as evaluation metrics.For the brain tumor segmentation, the Dice coefficient and Hausdorff Distance (HD) were utilized as evaluation metrics.Furthermore, to better assess the performance and the stability of model retraining, results were obtained using three different model parameter initializations, and the experimental outcomes were presented in terms of both the mean and standard deviation.

Comparison Results
This section primarily validates the proposed FACPS algorithm on the abdominal multi-organ dataset and the Brats2018 dataset and visualizes the segmentation results.We mainly compared the following methods: (1) Supervised Learning (ST)-a segmentation model that is trained using annotated target domain data, and its results can serve as the upper bound for unsupervised domain adaptation methods.(2) No Adaptation (NA)-a segmentation model that is trained with annotated source domain data, which are then directly applied to this model for target-domain data prediction.(3) Adaoutput [11]-one of the domain adaptation methods that performs an adversarial task between source prediction mapping and target prediction mapping.(4) CycleGAN [43]-initially, a CycleGAN is trained with source-and target-domain data, followed by using CycleGAN for modality transfer from the source domain to the target domain.Subsequently, a segmentation model is trained with the synthetic target domain data and are then directly applied to target-domain data prediction.( 5) SIFA [8]-this method aligns the data distribution of the two domains at both feature and image levels, and it is used for cross-modality heartsegmentation tasks.(6) SymDA [13]-this method proposes the idea of deep symmetric adaptation, and it is applicable to multiple medical image segmentation tasks.In addition, to better evaluate the performance, we reported the mean and standard deviation of the three different model initializations.

Results on Abdominal Organ Segmentation
We performed bidirectional domain adaptation for the CT and MR data.For convenience, domain adaptation from MR as the source domain to CT as the target domain was abbreviated as MR-CT; conversely, the CT as the source domain to MR as the target domain was abbreviated as CT-MR.The experimental results are presented in Tables 1 and 2.
Firstly, in both MR-CT and CT-MR, compared to the Supervised Training (ST) method, the performance of the No Adaptation (NA) method significantly decreased, with the average Dice score dropping by 83.85% and 88.21% and the Average Surface Distance (ASD) coefficient increasing by 51.39 and 52.56, respectively.Particularly in the spleen segmentation, the Dice score decreased by 90.32% and 87.5%.The substantial performance decrease highlights the significant impact of the domain gap between modalities on the segmentation model's performance, indicating that the NA method essentially loses its segmentation ability when faced with data from an unknown domain.Compared to the NA method, FACPS improved the Dice score by 79.09% and 84.45%, respectively.In comparison with the existing methods, FACPS achieved the best results in the segmentation of four different organs and was the optimal method in both Dice and ASD metrics, proving the effectiveness of the proposed approach.Secondly, FACPS outperformed the existing methods that use feature alignment.As shown in Tables 1 and 2, the proposed method significantly surpassed CycleGAN by 6.44% and 3.8% in average Dice scores for MR-CT and CT-MR, respectively.This further proves the effectiveness of the proposed method.Moreover, compared to the classical feature-alignment domain adaptation work SIFA, FACPS improved by 3.93% (MR-CT) and 1.45% (CT-MR).For SymDA, the proposed method outperformed by 1.12% in MR-CT.The superior performance of FACPS was primarily due to the introduction of a pseudo-supervision mechanism based on cross-domain distance awareness for training the dual-stream segmentation network, which helps to eliminate the domain gap, thus avoiding poor feature alignment conversion results that affect the segmentation network and allowe the network to absorb more semantic knowledge.Notably, the proposed method showed lower variance in the Dice score than existing methods, indicating more stable network performance.The two-dimensional visualization results comparing FACPS on the abdominal dataset with existing methods are shown in Figure 4.

Ablation Study
We performed ablation experiments on the abdominal multi-organ dataset to verify the effect of various constraints on the whole network.In this chapter, we discuss how the proposed FACPS method involves three main modules to enhance the model's performance, including the feature alignment sub-network, the dual-stream segmentation network with cross-pseudo-supervision, and the self-attention module.To validate their effectiveness individually, an ablation study was conducted using the abdominal multi-organ data for CT-MR domain adaptation.The settings for the ablation study are shown in Table 4, where "✓" indicates the selection of a loss or module.Specifically, the study began with training the segmentation network using only the segmentation loss, denoted as SegOnly; next, the target-domain segmentation loss L st was introduced to validate the effectiveness of the feature alignment network and was denoted as FA; then, the loss L cps was added to validate the effectiveness of the cross-pseudo-supervision mechanism and was denoted as FA+CPS; subsequently, loss L cps , which includes both L s→t dacps and L t→s dacps , was used to validate the effectiveness of the cross-domain distance awareness module and was denoted as FA+DACPS; and, finally, the there was the introduction of the self-attention module, which was abbreviated as FACPS.The results of the ablation study, as shown in Table 5, reveal that, without domain adaptation, the performance of SegOnly is very poor with the lowest average Dice score (i.e., 3.75%).FA: By aligning the source and target domains through one of the feature alignment sub-networks' two branches and simultaneously transforming the images of both the target and source domains, performance significantly increased by 66.69% compared to SegOnly.This indicates that the feature alignment sub-network is highly beneficial for domain adaptation.
FA+CPS: By further employing cross-pseudo-supervision to train the network, the overall feature alignment is enforced and the consistency of the network's output is constrained.In this way, combining the images generated by the feature alignment subnetwork with CPS for network training improved the segmentation network's performance by 10.85%.
FA+DACPS: Introducing the cross-domain pseudo-distance module to optimize the traditional CPS loss effectively avoided potential issues with traditional CPS, thereby further enhancing performance and model stability.As the experimental results show, the domain adaptation capability significantly increased from 81.29% to 88.35%.
FACPS: Further incorporating the self-attention module allows the network to learn tht information related to structural consistency, which is beneficial to the learning process of the feature alignment sub-network and enhances the performance of the segmentation network.As shown in the experimental results, the Dice score increased to 89.94%.

Discussion
This study presents the FACPS framework, a robust approach to cross-modal, medicalimage segmentation through enhanced feature alignment and cross-pseudo-supervision learning.Our method demonstrates significant improvements in segmentation precision when adapting from one domain to another, as evidenced by the substantial increase in Dice scores and the reduction in ASD and HD measures across various organ-segmentation tasks.Despite the promising results, we identified several areas for future enhancement when primarily focusing on adapting to the unique challenges of datasets and optimizing computational performance.
Regarding dataset challenges, these include issues with small datasets, class imbalance, and data diversity.To address the challenge of small datasets, we will employ advanced data augmentation and meta-learning strategies to enhance our model's generalizability and robustness against data scarcity.For class imbalance, we will explore specialized techniques to balance class representation during the training process, ensuring equitable model learning and consistent performance across various classes.Additionally, our method's effectiveness on multi-organ and brain-tumor segmentation tasks suggests a promising potential for generalization across various datasets.To ensure diversity in datasets, we plan to expand our experiments to encompass a broader range of imaging modalities and conditions, thereby better reflecting the breadth of clinical scenarios and enhancing our model's adaptability.
In terms of computational efficiency optimization, especially for edge devices, our model's current evaluation on a 4090 GPU indicates an average case prediction time of 2.896 s, which may still be considerable for devices with less computational power.To this end, we are prioritizing optimizations that will expedite our model's inference speed for real-time applications without compromising accuracy.This includes investigating model compression techniques and optimizing the inference pipeline.

Conclusions
This paper introduces a domain-adaptation framework for cross-modal, medicalimage segmentation named FACPS.Existing feature alignment methods utilize generative adversarial networks for modality conversion and reconstruction of images.However, features sometimes fail to align well, leading to the converted cross-modal images being ineffective for training the segmentation network and potentially introducing erroneous guidance information.Therefore, this paper proposes FACPS, which can better eliminate the domain gap while stabilizing the training process.By measuring the pseudo-distance between the source and target domains, this network effectively improves the quality of pseudo-labels, avoiding the adverse effects of low-quality labels and thereby promoting the learning process of the entire framework.The experiments conducted on the two challenging tasks verified that, after domain adaptation, the medical-image segmentation results in the target domain were significantly improved.This paper introduces a domain-adaptation framework for cross-modal medical-image segmentation named FACPS.Existing feature alignment methods typically utilize generative adversarial networks for modality conversion and image reconstruction.However, challenges arise when features do not align well, resulting in ineffective cross-modal images for training the segmentation network and potentially introducing misleading guidance information.FACPS incorporates a mechanism to measure the pseudo-distance between the source and target domains.By doing so, the network enhances the quality of pseudo-labels, mitigating the negative impact of low-quality labels and facilitating the learning process within the entire framework.The experimental results from two challenging tasks demonstrate that, following domain adaptation, the medical-image segmentation outcomes in the target domain show significant improvement.

Figure 1 .
Figure 1.Illustration of the cross-modal medical image challenge.(a) Comparison of the modal gap between CT and MR images in abdominal images.The liver CT image is on the left and the MRI image is on the right.(b) Comparison of the models trained on abdominal MRI images, which were evaluated on MRI and CT images using the Dice score and are denoted as "supervised" and "no adaptation", respectively.

Figure 2 .
Figure 2.An overview of our proposed approach.The whole network consists of a cross-modal, shared common encoder (E), two domain-specific discriminators (Ds, Dt), two domain-specific private decoders (Gs, Gt), and two segmentation networks (S(θ)).Among them, the shared encoder and the domain-specific decoder are used for feature alignment.The cross-pseudo-supervised, dual-stream segmentation sub-network consists of two segmentation networks (S(θ 1 ), S(θ 2 )) and a distance-aware, pseudo-supervision module.Self-attention modules are added to the last convolutional layer of the encoder, which is not shown in the diagram.In the inference phase, we only need to test the images from the target domain.We took the ensemble of the two networks as the final prediction.

Figure 3 .
Figure 3. Schematic diagram of the self-attention module.

Figure 4 .
Figure 4. Illustration of the segmentation results by different methods for abdominal CT images (top two rows) and MR images (bottom two rows).

4. 2 . 2 .
Results on Brain-Tumor SegmentationThe experimental results on brain tumor segmentation are shown in Table3.Similar to the abdominal multi-organ dataset, we first discussed the results of the Supervised Training (ST) and No Adaptation (NA) experiments.Due to differences in patterns and distributions, the Dice score decreased by 63.96%, while the Hausdorff Distance (HD) increased by 36.53.All adaptation methods exhibited similar trends on this dataset as on the abdominal multi-organ dataset.

Figure 5 .
Figure 5. Brain-tumor segmentation results obtained from different methods in the unsupervised domain adaptation task.The T2 modality was used as the source domain and then transferred to the FLAIR, T1, and T1CE modalities (from top to bottom), respectively.

Table 1 .
Comparison of different methods on the abdominal multi-organ dataset.Bold number highlight the best performance.

Table 2 .
Comparison of different Methods on the abdominal multi-organ dataset.Bold numbers highlight the best performance.

Table 3 .
Comparison of different methods on the Brats2018 dataset.Bold numbers highlight the best performance.performance variance, further verifying the stability of the proposed method in network performance and its resilience against model collapse. smaller

Table 5 .
Experimental results of the ablation study.