DiCyc: GAN-based deformation invariant cross-domain information fusion for medical image synthesis

Cycle-consistent generative adversarial network (CycleGAN) has been widely used for cross-domain medical image synthesis tasks particularly due to its ability to deal with unpaired data. However, most CycleGAN-based synthesis methods cannot achieve good alignment between the synthesized images and data from the source domain, even with additional image alignment losses. This is because the CycleGAN generator network can encode the relative deformations and noises associated to different domains. This can be detrimental for the downstream applications that rely on the synthesized images, such as generating pseudo-CT for PET-MR attenuation correction. In this paper, we present a deformation invariant cycle-consistency model that can filter out these domain-specific deformation. The deformation is globally parameterized by thin-plate-spline (TPS), and locally learned by modified deformable convolutional layers. Robustness to domain-specific deformations has been evaluated through experiments on multi-sequence brain MR data and multi-modality abdominal CT and MR data. Experiment results demonstrated that our method can achieve better alignment between the source and target data while maintaining superior image quality of signal compared to several state-of-the-art CycleGAN-based methods.


Introduction
Multi-modal medical imaging, i.e. acquiring images of the same organ or structure using different imaging techniques (or modalities) that are based on different physical phenomena, is increasingly used towards improving clinical decision-making. However, collecting data from the same patient using different imaging techniques is often impractical, due to, limited access to different imaging devices, additional time needed for multiple scanning sessions, and the associated cost. This makes cross-domain medical image synthesis a technology that is gaining popularity. We use the term ''domain'' herein to refer to different imaging modalities, contrast and parametric configurations, for example, for magnetic resonance imaging (MRI). We present a method, called DiCyc, that can perform cross-domain medical image synthesis by learning from non-paired data, thus taking advantage of multiple sources of images, but due to new network architectures it C. Wang et al.

Fig. 1.
Example of cross-domain synthesis using vanilla CycleGAN. The first row shows the results obtained from cross-modality abdominal MR-CT data; the second row shows the results of multi-sequence brain data with a synthesized deformation. Both cases demonstrate a reproduction of ''domain-specific deformation'' in the synthesized output.
attenuation of x-rays in tissue. To overcome this, pseudo-CT generated from corresponding MR could be used to compute a map of linear attenuation coefficients ( -map) and used for attenuation correction of the PET data acquired on a PET-MRI scanner [20]. This requires mapping of the geometric correspondences between CT and MR.
Learning a contextual correspondence between domains requires not only paired, but well-aligned training data. Such data can be generated by a reliable automatic or manual registration algorithm. As a result, the vast majority of cross-modality image synthesis methods are solely applicable to, or evaluated on brain image data [1,2,4,8,13,[16][17][18][19][21][22][23][24][25][26][27][28][29], due to the low geometric variance across different imaging modalities for this particular organ. For other organs, most methods require that the data be aligned by affine transformations or small deformations [3,9,25,[30][31][32][33]. However, very distinct geometric variances may occur among these data. Nonlinear geometric variances are often associated with different modalities, such as those caused by the shape of imaging bed, the field of view and the axial location planning (captured in Fig. 1). We refer to these as ''domain-specific deformations'', the presence of which can compromise the quality of the synthesis. This depends on whether the network can learn the mapping sufficiently by being invariant to the presence of deformations (which depends on landing on an ideal local minimum of the loss), or whether pre-processing has removed the deformation due to successful registration (which is not always feasible and cannot deal with large field of view differences).
Methods that allow training with unregistered or unpaired data have recently been proposed [34]. Most state-of-the-art methods use deep convolutional neural networks (CNN) as the image generator within a generative adversarial network (GAN) framework [35]. GAN can represent sharp and even intractable probability densities through a nonparametric approach. It has been widely used in medical image analysis, especially for data augmentation and multi-modality image translations, due to its ability of dealing with domain shift [36]. A popular direction for cross-domain image synthesis is to leverage Cy-cleGAN [37] into the training process. Previous studies have shown that CycleGAN can be trained with unpaired brain data [22,28]. However, CycleGAN can mistakenly encode domain-specific deformations as domain specific features and reproduce the deformations in the synthesized output. Fig. 1 demonstrates two examples. The first row shows a synthesis performed between abdominal CT and T2*-weighted MR, while the second row gives an example of T2-weighted and proton density brain MR with a simulated deformation. In both cases, the deformations specific to the input sources are reproduced by CycleGAN in the output. For applications, such as, attenuation correction where voxel-wise attenuation coefficients are computed, domain-specific deformations should be discarded whilst contextual information relating to the cross-domain appearance of anatomical features and organs is retained.
Recently, several modifications of the vanilla CycleGAN have been proposed, to enhance the alignment between data from the source and target domain using an additional image alignment measure [30,32,38]. However, the additional image alignment loss conflicts with the original loss function in CycleGAN. The synthesized data in which the domain-specific deformations are reproduced will lead to a lower adversarial loss (of the discriminator in GAN). At the same time, the reproduced deformations harm the alignment between the source and the synthesized data, which leads to higher alignment loss. As a result, the synthesized data cannot be aligned to the source data particularly well while maintaining a good quality of signal. To address this issue, we propose the deformation invariant CycleGAN model, or DiCyc. Fig. 2 presents the structural differences between the vanilla CycleGAN and the proposed DiCyc generator networks. We introduce a global transformation model and modified layers of the deformable convolutional network (DCN) into the CycleGAN image generator and propose to the use of a novel image alignment loss based on normalized mutual information (NMI). We evaluate the proposed method using both a publicly available multi-sequence brain MR dataset and our private multi-modality (CT, MR) abdominal dataset. DiCyc displayed better ability to handle disparate imaging domains and to generate synthesized images aligned with the source data whilst keeping comparable quality of the output, compared to state-of-the-art models. Furthermore, the ablation experiment demonstrated that, unlike in the state-of-the-art models, the image alignment loss and the GAN loss were minimized together during training without conflicts in DiCyc.
The main contributions of this paper are as follows:  The global non-linear distortion is modeled using thin-plate-spline (TPS) generated by a spatial transformation subnetwork, parameterized by , . Details of the modified deformable convolution is shown in Fig. 4(c). The blue arrows represent the CycleGAN forward pass. The additional forward pass introduced by the deformable convolutional layers is represent as red arrows. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) 1. We propose a novel DiCyc architecture using a global transformation network and modified deformable convolution layers in between normal convolution layers to address the problem of domain-specific deformations. The deformable layers are modified to have less parameters and offer faster convergence. 2. Rather than the classical ''1 forward pass, 1 backward pass'' training routine, we designed a new expectation-maximization training procedure where each training iteration includes two distinct forward passes (shown as the blue and red arrows in Fig. 2b) and one single backward pass. 3. We designed a novel cycle-consistency loss and an image alignment loss for information fusion. These losses, together with the new training procedure, address the conflict observed between image alignment loss and the discriminative loss of GAN. 4. We visualized and quantitatively assessed the influence of the domain-specific deformation. We demonstrated the negative effects of the conflict between the image alignment loss and GAN loss in experiments using simulated brain data and realistic abdominal data, and visualized these effects on model convergence in our ablation study.
The paper is organized as follows. Section 2 reviews previous and related techniques. Section 3 gives details of the DiCyc network architecture and the associated loss function. Experiments and datasets used are described in Section 4. The results and discussion are presented in Section 5. Conclusions are given in Section 6.

Non-CycleGAN models
Typically, most image synthesis methods build up a mapping function from a source to a target domain using paired and pre-registered data. The mapping can be constructed by learning a regression or a dictionary from a collection of patches or feature examples as in [16,17,19,39,40]. Another conventional approach is to build an atlas for each domain using registration, such as modality propagation [41][42][43]. The prediction is given by mapping between atlases. Along with the rise of deep learning in recent years, neural networks have been used as the cross-domain regressor. For example, the Location Sensitive Deep Network (LSDN) [27] uses a CNN to map the location-dependent patch information between domains. In [26,29], a GAN framework are used to learn the mapping function with context-aware measure based on gradient difference loss. Similarly, [44] uses conditional GAN to synthesize lung histology images. An early method using unpaired data is proposed in [34] where training with unpaired data was addressed as an unsupervised approach. It uses mutual information (MI) to select the best corresponding image patches from unpaired crossdomain data, and maximizes a mixture of global MI and local spatial consistency to synthesize multi-sequence brain MR data. This work uses a preprocessing procedure [41] which includes a registration step. Another approach similar to [34] is to construct a dictionary from patches or image pairs [19,39]. In [45], an algorithm using Weaklycoupled And Geometry (WAG) co-regularized joint dictionary learning is proposed, which learns the patch correspondence from partially unpaired data. Yet, this method was only evaluated using brain images with small geometric variances. A natural strategy in current deep learning based medical image synthesis methods is to model the latent features to arbitrary distributions. For example, [46] assumes the latent features follow a mixed Gaussian distribution, but this method was only evaluated on multi-contrast CT images for segmentation tasks. This paper concentrates on more general synthesis problems between multi-sequence MR data or multi-modal MR and CT data.

CycleGAN-based methods
CycleGAN was first applied to cross-domain medical image synthesis in [31] and [28] for co-synthesis of CT and MR cardiac and brain data respectively. Both works hint at the influence of deformation affecting results and so removed such artifacts by regularizing the problem through adding additional information (e.g. segmentation masks) [31], and by co-registration [28]. Similarly, in [11,25,30,31,47], the performance of image synthesis networks can be enhanced when jointly trained for segmentation tasks. However, these models require extra manual annotations or registration. Without this requirement, many methods integrate image similarity measures into the GAN loss, for matching the same structure across different domains. For example, [48] introduced a structure-consistency loss based on the modality independent neighborhood descriptor (MIND) [49]. It has been demonstrated that this structure-constrained CycleGAN can deal to some extent with unregistered multi-modal MR and CT brain data. A similar gradient-consistency loss, based on the normalized gradient cross correlation (GCC), is used in [32] for the same purpose. This method has been evaluated using unpaired but pre-registered, multimodal MR and CT hip images. However, as discussed in Section 1, there is a conflict between the image similarity based losses and the CycleGAN discriminative loss. One potential solution of this problem is to factorize the latent representations into domain-independent semantic features and domain-dependent appearance features, and explicitly filter out the relative spacial deformation between the source and target data [50][51][52]. This work extends this idea for larger deformations and wider range of domains.

Notation and background
Our goal is to generate synthesized CT or MR data to help postprocessing of the source data. For example, a pseudo-CT map applicable to PET-MR attenuation correction without registering the synthesized data to the source.
We assume that we have images ∈  from domain  , and images ∈  from domain  . For a source image , a generator, → , is trained to generate a synthesized imagê = → ( ). Following the GAN setup, → and a discriminator ares trained to solve the min-max problem of the GAN loss . For brevity, we let  → denote the GAN loss.
→ maps the data from  to  while is trained to distinguish whether an image is real or synthesized. Accordingly, for synthesis from  to  , there are a → , a , and a GAN loss  → . The vanilla CycleGAN framework consists of two symmetric sets of generators → and → act as mapping functions applied to a source domain, and two discriminators and to distinguish real and synthesized data for a target domain [37]. The cycle consistency , or  , , is used to keep the cycle-consistency between the two sets of networks [37]. This gives CycleGAN the ability to deal with unpaired data. Then the loss of the whole CycleGAN framework  is Recent improvements of CycleGAN [32,48] add an image alignment term  , to  which becomes where is the weight used to balance the effects of  , and  . As discussed in Section 1, this causes the conflict between quality of synthesis images and source-target image alignment. The later parts of this section present the detailed analysis of this problem and our DiCyc solution.

Dicyc architecture
Adding the alignment loss  makes cross-domain image synthesis a multi-task learning problem: is trained for image synthesis while aligning the source and synthesized images. Because the relative deformation, , between the source and target training images are partially domain specific, this information is encoded by the discriminator . Note that  and  in existing methods [32,48] are both works on the source image and the synthesized image ( ). Assuming ( ) is well aligned to , and̂= ( )• is identical to the target image, even when both images have the same image quality, it is always true that for an optimal discriminator * . At the same time, As a result,  and  lead to gradients with opposite directions: where is the network parameters. Any choice of the hyperparameter > 0 or data augmentation for will cause a trade-off between the image quality and data alignment.
To solve the problem of inverse gradients, we model the deformation using a separated set of parameters . For example, in the → process, → outputs two synthesized images: one undeformed image aligned to the source: and one deformed image that is identical to the target: As shown in Fig. 1, the relative deformation between the source and target domains can be seen as a combination of a global and a local transformation, thus = • . The corresponding transformation parameters = { , , , } are modeled by in different subnetworks in the DiCyc generator (Fig. 2b).
We split the generator into three subnetworks: an encoder, , a decoder and a transformer . estimates the global transformation , parameterized by , . In previous CycleGAN based methods parameterize and with image synthesize parameters . In our DiCyc model, the generator also estimates the local deformations, parameterized by , which is introduced by a series of deformable convolutional layers. As a results, also produced two versions of latent features: the undeformed feature map ( ) = ( | ) and the locally deformed feature ( ) = ( )• = ( | , , ).

Global deformation
The global transformer has a similar structure with the thin-platespline (TPS) based STN. As shown in Fig. 2b, in the → process, the global deformation is calculated by: where → and → are latent features given by the encoders → and → , and ⊕ represents the concatenation operation. Specifically, a regular grid of 6 × 6 control points = { | ∈ {1, … , 36}} is placed on the latent feature maps of . → outputs the coordinates of corresponding points on features of . TPS maps the deformation decided by and using an interpolation function . has a form of: where is regular image grid and is the weights assigned to the control points. and define the affine transformation between and . is defined as: where is a radial basis kernel has the form of: Note that the transformer uses a normalized grid where the coordinates ∈ [−1, 1].
It has been proved that this form of interpolation function minimizes the bending energy of a surface [53], so it introduces minimal affection on image quality. Based on this analysis, for better quality of synthesis, we wish to keep the local deformation to minimum level within tiny spatial area. When ignoring the local deformation , , the whole DiCyc model is shown as in Fig. 3.

Local deformation
We use a modified DCN structure in the encoder to model the deformation in a local neighborhood after the latent feature → and → are globally aligned. A deformable convolutional layer interpolates the input feature maps through an ''offset convolution'' operation, followed by a normal convolutional layer [54]. This architecture separates the information about local spatial deformation and image context into two forward passes, thus further removes the conflict introduced by  . As shown in Fig. 2b, we add an offset convolutional layer (displayed in cyan) before the input convolution layer, the two down-sample convolution layers and the stack of Resnet blocks. This leads to a ''lasagne-like'' structure consisting of interleaved ''offset convolution'' and conventional convolution operations so that the spatial deformation is gradually encoded through each layer. The red and blue arrows in Fig. 2b display the computation flows for generating ( ) and ( ) in the forward passes. Fig. 4 demonstrates details of the deformable convolution and our modified version used in this work. The deformable convolution can be viewed as an atrous convolution kernel with trainable dilation rates as shown in Fig. 4a. This dilation rate varies across different locations of the input feature maps. As shown in Fig. 4b, the offset of each point in the ''N-channel'' input feature maps is learned by a standard convolutional operation, outputting 2N ''offset maps'' (a 2-D deformation for each input feature map is represented by 1 ''x'' and 1 ''y'' offset map) [54]. The N input feature maps are then interpolated using the 2N offset feature maps. These operations together are termed as ''offset convolution''. A standard convolution layer is then applied to the interpolated feature map. When put together these operations form a deformable convolution operation. Designed originally for object recognition tasks, the deformable convolution operation deforms each input feature map independently. Instead, to adjust this operation to cross-domain image synthesis, our modified deformable convolution generates a uniform 2-D deformation that is valid for all input feature maps (Fig. 4c). This is equivalent to directly applying a deformation to the input image and passing it forward through the vanilla Cy-cleGAN generator. This reduces the number of parameters in DCN to a minimum level. Fig. 4d shows our implementation of the ''offset convolution''.
Combined with the global transformation,̂= → ( → ( )• → ) is then taken by the corresponding discriminator to compute GAN losses, and̂= → ( → ( )) is expected to be aligned with . Training DiCyc loss involves the traditional GAN loss, the cycleconsistency loss used in the original implementation of CycleGAN [37], as well as an image alignment loss and an additional cycle consistency loss introduced by the auxiliary outputs obtained from our two separated forward passes. We detail these below.

GAN loss
For the GAN loss → , the minmax game of → and is represented as: where → * and * represents optimal generator and discriminator. Theoretically, in our DiCyc model, the loss function of is: where  and  represent the data distribution in domain  and  . The GAN loss of generator → is: Similarly, the GAN loss of → is:

Image alignment loss
Eq. (11) can be rewritten as: wherêis the distribution of synthesized domain images. is then trained to discriminate the distributions ∼  and ∼ [35]. In the minmax game of the original GAN model, it has been proved that an optimal discriminator * =  ∕(  + ). Substituting this into  , it can be rewritten as: where is the Kullback-Leibler divergence and is the Jensen-Shannon divergence.
Let and be the spatial poses of the two images, and ∼ and ∼ . For a pair of training images, the relation between and is: where −⋅ represents the inverse transformation and represents the identical transformation. With training data which is suffering from the domain-specific deformation, optimally trained * and → * will inevitably predict that ( | , → ) = ( ) and ( | , ) ∈ ( ) even when̂has comparable quality with . As the GAN losses are calculated usinĝand̂, a new discriminative loss is required to predict which , the data is sampled from. Based on the infoGAN theory [55], we can maximize the mutual information (MI) between and , as it can be easily proved that C. Wang et al. MI yields values from 0 to +∞, which makes it difficult to be scaled and combined with other losses. Here we propose to use an image alignment loss based on NMI: Because the deformations are modeled by a separated set of parameters, this image alignment loss can be adopted with any similarity measure suitable for image registration, such as normalized mutual information (NMI) [56], normalized GCC used in [32], or MIND in [48] and [49].

Cycle-consistency losses
The cycle-consistency loss plays a critical role for the improved performance of CycleGAN compared to a single GAN network, as it forces → and → learning mutually recoverable information from distinct domains. As in DiCyc, each generator produces an undeformed and deformed version of synthesized data, both should be cycle-consistent to encode optimal representation. This results in two cycle-consistency losses in our DiCyc model. The undeformed cycle consistency loss is defined as: and the deformation-invariant cycle consistency loss is:

Training procedure
Based on the discussion above, the overall loss of our DiCyc model is 3 Treating the cycle-consistent losses as a kind of regularization, training the DiCyc model can be seen as a maximum likelihood estimation (MLE): where ( ) is an unknown distribution of the image poses. Based on Jensen's inequality, as log(⋅) is an convex function, which gives a lower bound of the maximum likelihood. To make the equality established, ( , | ) ( ) = , where is a constant. Thus the distribution ( ) is: This MLE learning can be performed through an expectationmaximization (EM) training procedure. The ''E'' step estimates the distribution ( ) by: where −1 is decided by the sample training data. For learning optimal global transformations, we fixed the parameters of and while only update the STN . In other world, only the parameters are updated. In the ''M'' step, the two synthesized imageŝand̂are calculated through two forward passes. The parameters are updated based on  .

Datasets and preprocessing
IXI dataset: We selected two datasets for multi-sequence MR and crossmodality MR-CT data synthesis tasks. The first was the Information eXtraction from Images (IXI) dataset 4 which provides co-registered multi-sequence skull-stripped 1.5T and 3T MR images collected from multiple sites. We used 66 proton density (PD-) and T2-weighted volumes, each volume containing 116 to 130 2D slices. For training and testing, 38 pairs and 28 pairs were used, respectively. Our image generators take 2D axial-plane slices of the volumes as inputs. All 3 Here we set = = 10 and = 0.9. 4 http://brain-development.org/ixi-dataset/.
volumes were resampled to a resolution of 1.8 × 1.8 × 1.8 mm 3 ∕ , then cropped to a size of 128 × 128 pixels. As each resampled volume contains 94 to 102 slices, over 6000 pairs of IXI images were used in our experiments. As the generators in both CycleGAN and DiCyc are fully convolutional, the predictions are performed on uncropped images. All the images are bias field corrected and normalized with their mean and standard deviation. dataset: We used a dataset containing 40 pairs of multimodality abdominal T2*-weighted and CT images collected from 20 patients with abdominal aortic aneurysm. Example images are shown in Fig. 1 where domain-specific deformations can be observed. The data were collected as part of the 3 clinical trial 5 [57]. All images were resampled to a resolution of 1.56 × 1.56 × 5 mm 3 ∕ , and the axial-plane slices trimmed to 192 × 192 pixels. We used 30 volumes for training and 10 volumes for testing. Each resampled volume contains 24 to 40 slices, which gives over 1200 pairs of slices for our experiments.

Evaluation metrics
Ideally, alignment between data and the quality of the synthesized images can be evaluated by segmentation-based metrics, such as, Dice index. However, it is difficult to generate the segmentation masks on synthesized data, which can also introduce extra errors in the evaluation. Referring to previous image synthesis works discussed in previous sections, here we use three metrics to evaluate performance of image synthesis: mean squared error (MSE), peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) as typically used by other CycleGAN based methods. Given a volume and a target volume , the MSE is computed as , where and 2 are mean and variance of a volume, and is the covariance between and . 1 and 2 are two variables to stabilize the division with weak denominator [58]. Larger PSNR and SSIM, or smaller MSE, indicate a better performance of a synthesis algorithm. These metrics were used to identify the best performing CycleGAN-based method, which we will subsequently refer to as the baseline method. We then evaluated the performance of the proposed DiCyc method compared to this baseline method. A paired t-test was used to the difference in mean MSE, PSNR and SSIM values between DiCyc and selected baseline. For the ablation experiment, a paired t-test was performed on metrics arising from the DiCyc model and its CycleGAN-based counterpart. Differences in performance were considered to be statistically significant when the pvalue resulting from the t-test was less than 0.05.

Experimental setup
We present three experiments using the two datasets. In the first and second, performance of our DiCyc model was compared to the vanilla CycleGAN [28] and state-of-the-art CycleGAN models with image alignment losses [32,48]. For all experiments, we applied random affine transformations, including translation, rotation, scaling, shearing and flipping, to the input data as augmentations in the training stage, 6 and we manually set that each epoch contains 6000 iterations for better network convergence. After comparing performance of the proposed DiCyc with selected state-of-the-art methods, an ablation study was performed to reveal the influence of DiCyc architecture and learning procedure.
Simulated IXI to identify influence of domain-specific deformation: As the brain organs are mainly rigid structures and rarely suffer from non-linear deformations, ground truth obtained from the registered PD-and T2-weighted image pairs allows evident quantitative assessments. When trained on the registered data, all the methods obtained better performance than when they were trained on unaligned and unpaired data. This provided an upper limit of performance for all the tested methods. To assess the ability of the selected methods to deal with domain-specific deformations, we applied a simulated nonlinear transformation to each T2-weighted image. We performed synthesis experiments using the undeformed PD-weighted images and deformed T2-weighted images to generate undeformed T2-weighted data and deformed PD-weighted data. Minibatches of the input data were sampled from randomly selected patients and slices. When using deformed T2weighted images to generate synthesized PD data, the ground truth was generated by applying the same nonlinear deformation to the source PD images. Similarly, the ground truth for the synthesized T2-weighted data were the original undeformed T2-weighted data provided in IXI. Values for the three evaluation metrics were computed between the synthesized images and the ground truths. We also qualitatively evaluate the synthesized images using error images as in prior works [26,29]. data: After evaluated on simulated dataset with given ground truths, the methods are further evaluated using realistic data from our dataset. Due to ''domain-specific deformations'', the multimodality images cannot be affinely registered. Specifically, the multiple organs in the pair of images can be hardly aligned at the same time. Furthermore, as non-rigid registration remains an ill-posed problem and lacks a gold standard, we did not non-rigidly register the images to generate ground truth for synthesis. However, several objects, such as aorta and spine, are relatively rigid compared to other surrounding soft tissues such as lower gastrointestinal tract organs. These objects can be separately registered with affine transformations. As a result, performance of synthesis should be assessed by alignment of multiple organs, as well as by quantitative analysis of image quality. In this work, for each volume in the 3 dataset, the anatomy of the aorta was manually segmented (as described in [59]). Multi-modality data acquired from the same patient were affinely registered so that the segmented aortas were well aligned. The manual registration and segmentation were performed by 4 clinical researchers. Signal of the synthesized images was evaluated within the segmentation of aorta using the three metrics described above. Image alignment between the source and synthesized data were visually assessed within both the aorta and spine regions. To sum up, a method with better performance should generate images show better alignment in both the aorta and spine region while achieving lower MSE, higher PSNR, and higher SSIM. In the training stage, the input minibatch was sampled from the same patient but randomly selected slices as described in [48]. The data is augmented with similar transformations that have been applied to the IXI dataset. Ablated models with different alignment losses: The CycleGANbased models do not handle the conflict between the additive image alignment losses and the discriminative GAN loss, thus cannot achieve good data alignment without sacrificing quality of the synthesized data. By contrast, the architecture and associated training algorithm of Di-Cyc handles the geometric deformation and contextual correspondence between the domains separately. This property plays a key role in generating synthesized data that are aligned with source data while maintaining a good performance of contextual synthesis. To prove this argument, it is necessary to analyze the different behaviors of an image alignment loss while being used in CycleGAN and DiCyc frameworks. Furthermore, current CycleGAN-based models use GCC and MIND, but we use a NMI-based alignment loss given in Eq. (19).
To verify our proposed alignment loss, it is necessary to compare the performance GCC, MIND and NMI under the same architecture and training procedure.
With these motivations in mind, we performed an ablation experiment using the IXI dataset where different image alignment losses were integrated within both CycleGAN and DiCyc models. Specifically, we replaced NMI-based alignment loss used in the proposed model with the GCC-and MIND-based alignment loss to build a GCC-DiCyc and a MIND-DiCyc. Similarly, our NMI-based alignment loss was added to the CycleGAN loss to build a NMI-CycleGAN. Performance of DiCyc's with different alignment losses were then compared to their CycleGANbased counterparts. We performed a paired t-test on the evaluation metrics for each pair of CycleGAN and DiCyc models with the same alignment loss to evaluate any improvement in performance introduced by our new architecture. Any improvements introduced by the NMIbased alignment loss can be seen by comparing performance of the DiCyc models using different alignment losses. Evolution of the loss values and synthesis results were also visually assessed throughout the training process.

Implementation details
We used image generators with 6 Resnet blocks, and 70 × 70 PatchGAN [60] as discriminator networks. Based on the default setup of CycleGAN, we use the LSGAN loss to compute  . Experiments were implemented in PyTorch and paired t-tests were performed using Scipy library. All parameters of, or inherit from, vanilla CycleGAN are taken from the PyTorch implementation of the original paper. 7 The first convolutional layer uses 7 × 7 kernels, all others use 3 × 3 kernels. The first convolution output 64 channels of feature maps, followed by layers with 128 and 256 channels. All the convolutions in the Resnet blocks have 256 channels.
For the DiCyc, we set = = 10 and = 0.9. The models were trained with Adam optimizer [61] with a fixed learning rate of 0.0002 for the first 100 epochs, followed by 100 epochs with linearly decreasing learning rate. Here we apply a simple early stop strategy: in the first 100 epochs, when  stops decreasing for 10 epochs, the training will move to the learning rate decaying stage; similarly, this tolerance is set to 20 epochs in the second 100 epochs. For the selected benchmark CycleGAN-based models, unless mentioned above, setup of hyper-parameters follows the original publications. Experiments were performed with nVidia Tesla K80 GPUs provided by the Amazon AWS EC2 cloud computing platform.

Results and discussion
This section presents the performance across all models assessed. For each experiment, we visualize the data from the source domain and the synthesized results. Quantitative results are shown in terms of MSE, PSNR and SSIM. Fig. 5 shows an example of the synthesized images generated by the methods we tested, along with the error images calculated between the synthesized data and corresponding ground truth. For a fair visual comparison, here we present the results obtained by all the compared baselines with the same non-linear deformation. As the simulated ''domain-specific deformation'' were applied to the T2-weighted data, the synthesized PD-weighted data should display the same deformation aligned with the source data. Similarly, the synthesized T2-weighted data should be aligned with the source PD-weighted data without showing the simulated deformation. However, as shown in Fig. 5,  Fig. 5. Examples of synthesis from the IXI dataset: an arbitrary deformation was applied to the T2 weighted images, and the ground truth of the synthesized proton density (PD) weighted image was generated by applying the same deformation. the vanilla CycleGAN model reproduced the simulated deformation in the synthesized T2-weighted image and did not show the simulated deformation in the synthesized PD-weighted image. Although the GCC-CycleGAN and MIND-CycleGAN reduce the misalignment effect of the simulated deformation, the synthesized and source data are still not well aligned. Furthermore, the synthesis results generated by the three CycleGAN-based models are blurry and showed visible artifacts. In contrast, our DiCyc model gave the best alignment between the source and synthesized data and also lead to better image quality when assessed visually.

DiCyc versus CycleGAN-based models on IXI
The quantitative evaluation of multi-sequence MR synthesis using the IXI dataset is shown in Table 1, where the best result for each metric is shown in bold and the optimum baseline method we chose for a paired t-test is highlighted by a gray background. Vanilla Cy-cleGAN trained on paired and registered images (without simulated deformation) gave the best results with PSNR > 24.3, SSIM > 0.817 and MSE ≤ 0.036. This is considered as the upper bound of synthesis performance. Trained with unpaired data that have simulated deformations, the vanilla CycleGAN gave a lower-bound baseline of performance. With additive image alignment losses, GCC-CycleGAN and MIND-CycleGAN methods lead to improvements in terms of PSNR. However, because these two models are still affected by the simulated domain-specific deformation, their performance was still comparable to vanilla CycleGAN.
In contrast, the proposed DiCyc model led to at least 18% increase in MSE, and 8% and 12% performance gain in terms of PSNR and SSIM on IXI data. The results were statistically significant based on the paired t-tests ( -value < 0.05). Table 2 shows the quantitative assessments of the four models based on the same metrics used for the IXI data. The vanilla CycleGAN had slightly better performance compared to the GCC-and MIND-CycleGAN models. The only exception is that MNID-CycleGAN model obtained higher PSNR in the ''T2*→CT'' synthesis. Our DiCyc model outperformed the other three methods according to all the metrics. Note that in the ''CT→T2*'' synthesis, DiCyc lead to a 20% performance gain in terms of MSE, and achieved 22.8% higher SSIM. Differences between performance achieved by the DiCyc model and the best baseline methods were statistically significant. The quantitative results shown in Table 2 can be affected by both the qualities of the synthesized images and the alignment between the source and synthesized data. As discussed above, some objects in the images can be affinely registered independently, for example, the anatomy of aorta and spine. However, these two objects cannot be affinely aligned at the same time as a result of domain-specific deformations. This leads to lower PSNR and SSIM, and higher MSE value within the segmented region of aorta.

DiCyc versus CycleGAN-based models on
For better assessing the effects of the domain-specific deformation, the synthesis results of the compared baselines and our TPS-based DiCyc model are displayed in Fig. 6 using a checkerboard visualization. As shown in Fig. 6, when the region of aorta is affinely aligned, the C. Wang et al.  CycleGAN-based methods either achieved worse alignment in the spine area, for example, the synthesized CT produced by CycleGAN and GCC-CycleGAN, and the synthesized T2* weighted image given by GCC-CycleGAN; or they generated significant artifacts, for example, in the aorta area of synthesized CT output by CycleGAN and MIND-CycleGAN. Our DiCyc model is the only model that produces synthesized images where both the aorta and spine are simultaneously aligned. Although the synthesized T2* weighted images looks slightly blurred, our DiCyc model generated less artifacts. Fig. 7 presents the synthesized images produced by the ablated models using different alignment losses, and the quantitative evaluation results are shown in Table 3. As shown in Fig. 7, all the DiCyc-based models achieved better alignment between the source and synthesized data. This is consistent with the quantitative results shown in Table 3 where in most cases ablated DiCyc models achieved lower MSE and higher PSNR and SSIM models. However, using GCC-and MINDbased alignment losses within the DiCyc framework caused a shift of intensities in the synthesized data. The most obvious example is the synthesized T2-weighted image produced by MIND-DiCyc which  looks more like the source PD-weighted data rather than the target T2weighted data. As a result, the MIND-DiCyc model gave higher MSE and lower PSNR values in the ''PD→T2'' synthesis. By contrast, this intensity shift was not observed in the synthesized data generated by our proposed NMI-based DiCyc model. The proposed NMI-DiCyc model outperformed the ablated GCC-DiCyc and MIND-DiCyc models, as well as the state-of-the-art CycleGAN-based methods.

Ablation study
Figs. 8 and 9 demonstrate the evolution of the compared image alignment losses and the synthesis results in the CycleGAN and DiCyc frameworks during the training process. Comparing the synthesis results produced by CycleGAN-based methods (Figs. 8a, 8c and 8e) with those generated by DiCyc models (Figs. 8b, 8d and 8f), we can see that the CycleGAN methods can achieve a good data alignment within the first 20 epochs of training. However as the training algorithm continues to minimize the CycleGAN losses, the domain-specific deformation is gradually reproduced. As the DiCyc framework separately trains the image alignment loss and the CycleGAN loss in two forward passes, the relative deformation between the source and target domain is removed. As shown in Figs. 9a, 9b and 9c, in the CycleGAN framework, each alignment loss was minimized at a certain point of the training process, but then kept increasing as it started to conflict with the GAN discriminative losses. In our DiCyc framework, the alignment losses kept decreasing throughout the whole training process.   Comparing the results shown in Figs. 8b, 8d and 8f, we can see that the ablated GCC-and MIND-DiCyc models reproduced the appearance of the PD-weighted data in the synthesized T2-weighted data. This means GCC and MIND are still more domain-dependent measures compared to NMI although they have been widely used in multimodality registration methods. However, computationally GCC and MIND are easily vectorized and the associated backward pass are easier to implement with lesser computational complexities.

Model complexity
For the CycleGAN-based baselines compared above, each generator network, , has 34.52M trainable parameters, and each descriminator network, , has 2.76M. As a result, in the training stage, a CycleGAN-based model has 74.56M trainable parameters and each forward pass consists of 37.98G multiply-add operations (MACs) 8 processing 128 × 128 image data. For our DiCyc model, the local and the global transformation modules introduce 8.15M and 4.31M trainable parameters. Each forward pass consists of 66.36G MACs. As a result, it takes 75% more time and 33% extra memory to train a DiCyc model. However, once trained, prediction of the synthesized images is performed only by the image generator without global and local deformation modules. In other word, in the testing stage, the proposed DiCyc model has the same temporal and spacial complexity with the CycleGAN-based methods (34.52M trained parameters, 18.24G MACs per forward pass).

Conclusion
We introduced the DiCyc cross-domain medical image synthesis model which addresses the issue of and is resilient to domain-specific deformations. We integrated a modified deformable convolutional layer into the network architecture, and proposed the associated deformation-invariant cycle consistency loss and NMI-based alignment loss function. Experiments were performed for synthesis of multisequence MRI data with simulated deformations and of multi-modality CT and MRI data suffering from actual domain-specific deformations. We compared our method to the vanilla CycleGAN method and two state-of-the-art methods with additional alignment losses. Our DiCyc method achieved better alignment between the source and synthesized data while maintaining signal qualities of the synthesized data. It outperformed state-of-the-art methods. In order to reveal the mechanism of DiCyc that is separately encoding the information about spatial deformation in the synthesis process, we also performed an ablation study by integrating popular image similarity metrics into DiCyc and comparing their CycleGAN-based counterparts. It shows that the DiCyc model avoids the conflict between the CycleGAN loss and the image alignment losses. Our NMI-based image alignment loss also demonstrated better robustness for synthesis of images from different domains.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.