Deformation equivariant cross-modality image synthesis with paired non-aligned training data

Cross-modality image synthesis is an active research topic with multiple medical clinically relevant applications. Recently, methods allowing training with paired but misaligned data have started to emerge. However, no robust and well-performing methods applicable to a wide range of real world data sets exist. In this work, we propose a generic solution to the problem of cross-modality image synthesis with paired but non-aligned data by introducing new deformation equivariance encouraging loss functions. The method consists of joint training of an image synthesis network together with separate registration networks and allows adversarial training conditioned on the input even with misaligned data. The work lowers the bar for new clinical applications by allowing effortless training of cross-modality image synthesis networks for more difficult data sets.


Introduction
Image-to-image translation is an active area of research in computer vision because of its various applications such as image synthesis, segmentation, restoration, style transformation and pose estimation.After the advent of deep leaning, medical imaging as a cardinal application area, has seen an increasing interest in the use of imageto-image translation.In histopathology, image-to-image translation has been used, e.g., for cross-stain translation (Liu et al., 2021;Xu et al., 2019), for replacing chemical staining by digitally generated mask (Valkonen et al., 2019), for tissue color normalization (de Bel et al., 2019;de Bel et al., 2021), for virtual staining of label-free or unstained tissue images (Bayramoglu et al., 2017;Rana et al., 2020;Rivenson et al., 2019).In radiology, it has been used for pseudo CT generation and cross-modality MRI synthesis, and the synthesized images have been shown to be useful for downstream tasks, e.g., segmentation (Boulanger et al., 2021;Spadea et al., 2021;Xie et al., 2022).The image-to-image translation methods are primarily divided into two categories: supervised methods that rely upon aligned image pairs and unsupervised methods that don't require aligned image pairs.In the medical setting including supervision tends to improve the results (Jin et al., 2019;Klages et al., 2020;Li et al., 2020;Peng et al., 2020;Fard et al., 2022).
Image-to-image translation is called differently depending on the application and in this work we will call it cross-modality image synthesis, which is often used in the medical imaging context.We use the term modality broadly to refer to any distinct image types capturing dif-©2023.This manuscript version is made available under the CC-BY-NC-ND 4.0 license https://creativecommons.org/licenses/by-nc-nd/4.0/ 1 F x (i) y (i) y (i)   d(i) (d (i) ) -1 Figure 1: Basic setting.Only non-aligned pairs of x (i) and ỹ(i) are available but the task is to learn F for transforming x (i) into y (i) .The images are from the synthetic "multimodal" data sets built using COCO (Lin et al., 2014) data set.
ferent characteristics of the underlying anatomy.
In the medical domain, different modality images of the same subject are not usually anatomically aligned.To solve this before training a network images are typically registered, or in other words, aligned anatomically.Deep learning registration methods have gained popularity (Fu et al., 2020) with the best methods performing close to classical registration algorithms, e.g. in Learn2Reg multitask medical image registation challenge (Hering et al., 2021) or in histopathology ANHIR competition (Borovec et al., 2020).
Methods combining the two, cross-modality image synthesis and cross-modality registration, have also started to surface.In registration a synthesized image can be used as a bridge to generate a cross-modality similarity metric (Lu et al., 2021).However, some methods combine these two into a unified architecture, solving both problems at the same time.Such methods have been published from both the registration (Arar et al., 2020;Chen et al., 2022) and image synthesis viewpoint (Joyce et al., 2017;Kong et al., 2021;Wang et al., 2018Wang et al., , 2019Wang et al., , 2021a)).
In this paper, we propose a new architecture for crossmodality image synthesis which is robustly trainable with misaligned training data, using a novel strategy that couples registration and image synthesis networks during training.Firstly, we suggest training the imagesynthesis network directly for deformation equivariance which refers to the property that applying a deformation before or after the image synthesis should result in the same image.Secondly, we develop a strategy allowing adversarial training conditioned on the input images despite using misaligned training data, not possible to do robustly by earlier methods.Conditioning adversarial training on the input is especially important in the medical domain as it results in more reliable predictions.In addition to the better quality predictions the method is applicable to a wider range of data sets than earlier similar methods.In the experiments we show that the method improves upon earlier methods trainable on non-aligned data and that it also surpasses the standard approach where the images are registered before training, assuming that no significant manual effort is put into the registration.
Assuming that the images are continuous, they can be seen as mappings x (i) : R n → R m 1 and y (i) , ỹ(i) : R n → R m 2 where n is the dimensionality of the image (e.g.n = 2 for two dimensional images) and m 1 and m 2 are number of channels in input and target images respectively (e.g. 3 for RGB images).The deformations would then be mappings d (i) : R n → R n connecting the image coordinates.Doing a coordinate transformation of an image x (i) based on a deformation d (i) would equal to function composition x (i) • d (i) which can be written using the pullback notation as d (i) * x (i) := x (i) • d (i) where d (i) * can be seen as a mapping acting on images.
Following the notation, we have the relationship d (i) * y (i) = ỹ(i) between the aligned and non-aligned targets.In practice, the images are not continuous, but instead only samples of the images are available, i.e. the pixels or voxels.Hence in reality, the mapping d (i) * equals to interpolating the image at the locations defined by the deformation, and we use linear interpolation.
In this work, we study a setting where we are trying to learn a function F which is a neural network such that F(x (i) ) = y (i) .To do this, we simultaneously try to learn a second neural network, or, as it turns out, multiple networks, for predicting d (i) .However, during test time we still only want to use the network F.

Cross-Modality Image Synthesis
Cross-modality medical image synthesis has gained a lot of attention in recent years with multiple proposed clinical applications (Wang et al., 2021b).The conditional GAN-based architecture pix2pix (Isola et al., 2017) is widely used when paired and aligned data are available as it is based on an assumption of pixel-to-pixel correspondence between training images of different modality.On the other hand, CycleGAN (Zhu et al., 2017) can be used without paired or aligned data.
Paired training images in the medical context are typically not aligned, and hence for pixel-to-pixel training they are registered into the same coordinate system.However, registration is never perfect and pixel-to-pixel losses are very sensitive to registration errors reducing the synthesis quality especially on difficult to register areas with large internal anatomic motion such as on pelvis area (Wang et al., 2021b).
In pixel-to-pixel setting, different approaches have been proposed to mitigate for the remaining registration errors (Chen et al., 2020a;Joyce et al., 2017;Kazemifar et al., 2019;Leynes et al., 2018;Yu et al., 2019).Most similarly to our work, Kong et al. (2021) combine a cross-modality image synthesis network with a registration network to enable training with non-aligned data.However, we argue that their method does not robustly mitigate for registration errors especially with real world data sets as will be dicussed in Section 4. Additionally they use unconditional adversarial training which has been shown to be inferior to conditioning the discriminator with input images.In this work we aim to solve both of these problems.
While performing worse than pix2pix when paired and aligned data are available, unsupervised CycleGAN is more robust against misalignments due to its cycle consistency loss (Kaji and Kida, 2019;Wang et al., 2021b).However, if the misalignments between the modalities are systematic and severe, CycleGAN can also fail to produce geometrically aligned predictions.Approaches, similar to ones used with pix2pix, have also been employed with Cy-cleGAN (Zhang et al., 2018;Hiasa et al., 2018;Kida et al., 2019).Wang et al. (2018Wang et al. ( , 2019Wang et al. ( , 2021a) ) propose network architectures combining cross-modality image synthesis and registration, together with mutual information loss between the input and the prediction to promote similar geometry.

Cross-Modality Registration
Deformable medical image registration using deep learning has also gained popularity recently (Fu et al., 2020).With stationary velocity field parametrization, one can generate diffeomorphic deformations (Arsigny et al., 2006;Ashburner, 2007).This kind of methodology was applied to deep learning by Dalca et al. (2018).Some architectures such as the one by De Vos et al. ( 2019) combine affine or rigid registration together with a separate deformable registration resulting in multi-stage registration approach.
From cross-modality registration methods, the method by Arar et al. (2020) is closest to our work.They train a cross-modality registration network by simultaneously training a cross-modality image synthesis network which they encourage to be equivariant to deformations predicted by the registration network.This is done by applying the predicted deformation both before and after the image synthesis network and comparing both of them to the target.We instead use simulated deformations for encouraging deformation equivariance.We argue that this leads to a more robust approach, which we verify in the experiments.Our method of encouraging deformation equivariance is similar to the method by Pielawski et al. (2020) where they train their network for rotational equivariance.
Very recently, Chen et al. ( 2022) use a contrastive learning based loss for promoting geometric (or shape) similarity of the image synthesis in an otherwise similar setting to Arar et al.

Methods
To teach the network F for predicting y (i) from x (i) , Kong et al. (2021) train an additional network G aimed at learning the i) .They train both of the networks F and G simultaneously with the similarity loss (which we label default similarity loss) together with a regularization loss Figure 2: Proposed core architecture.When using the equivariance similarity loss instead of the default similarity loss, the commutation loss is optional.The architecture presented here is further refined by adding an adversarial loss and a separate cross-modality registration network for first registering ỹ(i) to x (i) .Deformation t is a random deformation sampled on the fly individually for each training image pair.The images are from the synthetic "multimodal" data sets built using COCO (Lin et al., 2014) where Reg is some operator penalizing non-smooth deformations, and an unconditional adversarial loss with the intent of training the distribution of F(x (i) ) to match the distribution of ỹ(i) .
In the work by Kong et al. they view the deformations between inputs and targets as noise and assume the same underling physical distribution for both the inputs and the targets.In that setting, the adversarial training objective they use is justified.However, often images in the target domain might be systematically geometrically different to the images in the input domain, e.g., when patients are laying differently within different medical imaging equipment.In that case matching the distribution of F(x (i) ) with the distribution of ỹ(i) is not desirable.
To fix this we first omit the adversarial loss altogether, although we will develop a revised adversarial training strategy later.However, without the adversarial loss the optimization problem is very unstable since the network F is not in any way constrained to preserve the geometry of the input images, that is, the predictions are not guaranteed to be anatomically aligned with the inputs.If F is a convolutional neural network it has an inductive bias towards this kind of a behaviour but there is no guarana-tee that the convolutional network does not, e.g., shift the predictions and the registration network compensate for the shift.An example of a possible failure mode is shown later in Figure 8.It is also noteworthy that while having the adversarial training in a setting similar to the work by (Kong et al., 2021) will definitely stabilize the training, there is no fundamental theoretical reason why it should result in F preserving the geometry of its inputs.
The property of F preserving the geometry of an input can be formulated as deformation equivariance.Any movement in the underlying anatomy of the input image should be reflected similarly in the output image.Assuming a set of anatomically possible geometric deformations T (i) for each input image x (i) , a network F is deformation equivariant over the deformations if for all t ∈ T (i) it holds that t * F(x (i) ) = F(t * x (i) ). (2) In other words, F should commute with respect to the deformations of x (i) .
In practice we do not aim for the relation to hold exactly but instead propose to promote the property implicitly by modifying the default similarity loss given in Equation (1).The modification is similar to the one by Pielawski et al. (2020), although they use it in a different contrastive learning setting.We label the resulting loss equivariance similarity loss: Here t is seen as a random variable sampled from some distribution.The loss can be zero only if F is equivariant to all t and inputs.Note that we first compose the deformations G(F(x), ỹ) and t −1 and after that deform the prediction F(t * x).This way we avoid multiple interpolations of the same image.The same strategy of composing the deformations first is always used in this paper when applying multiple deformations to an image.
The loss in Equation 3 also acts as a data augmentation method by randomly transforming inputs fed into the network F. However, in contrast to the traditional data augmentation, we do not transform the target image with the same deformation as the input.
Optionally one can more directly promote the equivariance by training with the following objective which we label commutation loss: When using the commutation loss using the equivariance similarity loss is optional and the default similarity loss can be used as well.Without the commutation loss the equivariance similarity loss is needed.As a result we have three possible configurations.Arar et al. (2020) also encourage deformation equivariance but only for deformations predicted by their registration network.That is not always enough, e.g., with perfectly aligned training data the network F could still introduce any translation which the registration network could compensate since translations commute.Also with subtle systematic deformations the network F might easily overlearn the deformations from the data resulting in the registration network predicting zero deformation.The limitations of their method are also discussed in the supplementary materials of their paper where they conclude that their method works only with relatively small image synthesis networks, which is not a large problem in their work which focuses on image registration, but a significant limitation in difficult cross-modality image synthesis tasks.Later, in Figure 10, we visualize a situation where our method succeeds in producing spatially aligned output but the method by Arar et al. (2020) (NeMAR) fails.
The core architecture presented so far is visualized in Figure 2, and is in itself trainable.However, in addition to the core architecture, we will be looking at adding a conditional adversarial loss for training the model to improve the prediction quality further.In order for the adversarial training to converge even in the presence of systematically different geometries between the input and target domains, it turns out we will require two registration networks: one for registering targets to inputs and the other for registering predictions to possibly imperfectly registered targets.
Before advancing further, we introduce an additional notation.In case a variable should be treated as a constant from optimization point of view even if it is an output of a neural network, we overline the variable, e.g.x (i) vs.
x (i) .In the neural network context, this means halting the backward pass during back-propagation.

Selecting a Set of Simulated Deformations
The equivariance similarity loss and the commutation loss require some way to simulate anatomically realistic deformations for calculating them.For promoting equivariance, an ideal set would be the set of all anatomically realistic deformations for each sample, but anatomically realistic non-affine deformations are very difficult to simulate.A natural question is then whether a significantly smaller set of deformations would be sufficient, especially considering the inductive biases of the used architectures.
Given that F is a convolutional network it is (roughly) translation equivariant.Hence intuitively simulating only globally affine deformations might be enough since any diffeomorphic deformation is "locally affine".Also, affine deformations are easy to sample and the same distribution of deformations can be used for each sample.Applying affine deformations to images is also computationally efficient and numerically accurate.For these reasons we used only affine transformations in our experiments.It is left for future work to test whether elastic deformations could increase the performance further.
Affine transformations can be further divided into translation, rotation, scaling, shearing, and flipping.Translations and rotations should be reasonable to use in any practical situations, and flipping can be used when it doesn't affect the distribution of the imaged anatomy.Scaling and shearing are more difficult since anatomically stretching a tissue could in principle affect its appearance under different imaging techniques.In other words, even in an ideal situation the deformation equivariance property would no longer hold exactly.In the experiments we conduct two studies on two data sets on using different types of affine transformations.

Adversarial Training
In addition to the losses presented so far, we want to incorporate adversarial loss to the training in order to improve the appearance and also clinical quality of the predictions.In adversarial training, an additional discriminator network is trained to classify whether an image fed to the network is real or fake and can be used for guiding the generator network responsible for generating synthetic images.We want to employ a conditional adversarial training setting, similar to pix2pix (Isola et al., 2017), wherein the input image is also fed to the discriminator.This is different to the approach taken by Kong et al. (2021) where they feed only the prediction or the target.Conditioning the discriminator on input images has potential for better prediction quality (Isola et al., 2017).
Let now D be the discriminator network receiving an input image as the first argument and either a target or a prediction as the second argument.The conditional adversarial learning objective is defined as follows: The discriminator is trained to maximize the loss and the generator is trained to minimize it with the training executed in turns while holding the weights of the other network constant.Placeholder texts are used as we are yet to derive the optimal way to feed the data to the discriminator.Misaligned training data will require care in how that is done.To be more precise, the following three points need to be taken into account: 1. Predictions and targets have to be fed in the same coordinate system since the input domain and the target domain might have systematic geometric differences which would encourage the predictions to be misaligned with the inputs.2. Inputs and their corresponding predictions can not be fed in the exactly same coordinate system since even if the predictions are registered to the targets or vice versa, the targets will not be exactly aligned with the inputs, especially in the beginning of the training.That would encourage misaligned predictions.

Interpolation acts as a low-pass filter especially in
areas where the image is stretched.As a result, predictions registered to targets can not be directly compared with the targets as the discriminator will learn to notice the missing high frequencies.
Points 1) and 2) would suggest to feed targets and predictions in the target coordinates but 3) would suggest that we can not deform the predictions to the targets either.As a solution we propose to train two separate registration networks: one for cross-modality registration of targets to inputs and one for intra-modality registration of predictions to possibly imperfectly registered targets.The adversarial comparison can then be done between the registered targets and the predictions registered to the registered targets.The proposed approach will solve all the three problems mentioned above: 1) The comparison will be done in the same coordinate system.2) If the target and the input are imperfectly aligned, the prediction can still be separately registered to the registered target hence removing the incentive for misaligned predictions.
3) The predictions registered to the registered targets can contain high frequency information to at least a similar extent as their counterparts, the registered targets, since as the training progresses the registration of the targets to the inputs should account for most of the registration movement.
Let us now denote the predicted cross-modality deformation (approximately) mapping coordinates of ỹ(i) to x (i) as d (i)  cross and the predicted intra-modality deformation (approximately) mapping coordinates of F(x (i) ) to d (i)   cross * ỹ(i) as d (i)  intra .We will look at how these are obtained in Section 4.3.
From the adversarial loss perspective, we want to treat d (i)  cross as constant since the cross-modality registration would not benefit from the adversarial loss and might even result in unexpected optima.This approach corresponds to the normal GAN training where only the second term is used in updating the generator.The proposed adversarial loss is then The loss function can be improved even further by employing a similar idea to the equivariance similarity loss.To simultaneously prevent discriminator from over-fitting and implicitly promote equivariance to a set of deformations, we propose to further modify the loss to the following form (which we label equivariance adversarial loss): Here, the same deformations t can be used which are used for the equivariance similarity loss and the commutation loss.The core idea is to augment the inputs to the discriminator with the deformations t while taking into account the unaligned nature of the training data.

Registration Architecture
As discussed, the registration will be divided into crossmodality registration for registering targets to inputs and intra-modality registration for registering predictions to the registered targets.While the cross-modality registration receives pairs of different modality as inputs, it is trained with intra-modality loss based on the synthesised image F(x (i) ) similarily to the intra-modality registration.
In principle, the registration networks can predict the deformation in any suitable form.We split the crossmodality registration into rigid registration and elastic registration.The two-stage architecture makes it significantly easier for the model to handle large deformations.For intra-modality registration, we do not use twostage architecture as cross-modality registration should take care of most of the registration movement.
We generate elastic deformations from stationary velocity fields to promote diffeomorphic deformations and to allow inverting the deformations.From a stationary velocity field the final diffeomorphic deformation is obtained by integrating the field over itself over a unit time.
In group theory, this can be seen as exponentiation of a member of a lie algebra (Arsigny et al., 2006), and hence we denote the integration by exp.Exponentiation can be estimated efficiently by the scaling and squaring method (Arsigny et al., 2006;Dalca et al., 2018).The velocity fields are predicted in the same resolution as the images.

Cross-Modality Registration
Let the neural network predicting the rigid deformation for cross-modality registration be H rig and the neural network predicting the stationary velocity field for elastic cross-modality registration be H svf .Both networks H rig and H svf could be trained in principle with a single loss.However, it is possible that the rigid registration network could first unnecessarily shift the target image and the elastic registration network could then shift the image back, and to prevent this we add a separate rigid registration loss.
The overall predicted cross-modality deformation is then where r (i) cross := H rig (x (i) , ỹ(i) ) and v (i) cross := H svf (x (i) , ỹ(i) ).Halting the gradients for r (i)  cross is not necessary but makes loss function balancing more straightforward by separating the rigid registration altogether.
We train the rigid registration network H rig with the loss and the elastic registration network H svf with the loss reg and v (i)  cross are forwarded for intra-modality registration.Regularization is applied to both inverse and forward elastic deformation which is not explicitly shown here.The images are from the synthetic "multimodal" data sets built using COCO (Lin et al., 2014) data set.
We halt the gradients for the backward pass for F(x) as we do not want the imperfect cross-modality and especially rigid cross-modality registered target image to affect the image synthesis network.
Additionally we need to regularize the deformation.The regularization term can be applied only to the elastic component as we do not penalize rigid deformations.We use non-rigidity penalty by Staring et al. (2007) and apply it to both inverse and forward deformations.We have where Rig is the non-rigidity penalty by Staring et al.Details of the regularization used can be found in the supplementary materials.

Intra-Modality Registration
The intra-modality registration network receives the triplets (F(x (i) ), y (i) reg , exp(−v (i) cross ) − I) as inputs where I is the identity mapping.The third argument represents displacement field of the inverse elastic deformation with which y (i)  reg has been deformed and allows the network to optimize regularity of the concatenated overall deformation, as we regularize based on that.
Outputs of the cross-modality registration stage are treated as constants by the intra-modality registration losses.By that we prevent the networks from finding any non-desired optima where, e.g., difficult to synthesize regions were made smaller.
Let now the function predicting the stationary velocity field for intra-modality registration be G svf .Then, the predicted intra-modality deformation is where Here, we use the negative sign for the velocity field to emphasize that the direction is different from the cross-modality registration.As the training progresses and cross-modality registration and cross-modality image synthesis improve, d (i)  intra should approach the identity mapping.
The loss function for the intra-modality registration is also guiding the cross-modality image synthesis.Hence, we use the deformation equivarince encouraging loss function following the equation (3): We also experiment with a setting where the L intra sim is replaced with the default similarity loss following the equation (1).In that case, the loss simply takes the following form: For regularization, we use the concatenated overall elastic deformation again in both directions: Using the concatenated overall deformation is logical, as that is the deformation we are actually trying to learn and hence regularize.
Having the separate intra-modality registration network in addition to the cross-modality registration allows the prediction from the image synthesis network to directly affect the predicted deformation.As a result the crossmodality image synthesis network is efficiently optimized for generating predictions with lowest deformation regularization loss, which can be seen as a meaningful selection criteria among all the possible geometry preserving versions of the synthesized image.

Masking
Throughout the architecture, images are resampled by deformations, but the sampled locations might be outside the image.We connect each image with a mask that can initially represent invalid regions in the image.Each time an image is resampled, the mask is updated with the regions resampled from outside the image.In similarity losses, we then compare images only within intersections of the masks.Masks contain discrete values and no gradients flow through the masks during the backward-pass, preventing the optimization of the masks themselves.The same procedure is done also for the deformations as they are also resampled by other deformations.
Invalid region masks are also fed for the registration networks since for invalid regions only the regularization should affect the generated deformation.
Additionally, the masks of the registered targets and the predictions registered to the registered targets might be systematically different, which the discriminator could use for separating the images.We mitigate for that by multiplying each image fed to the discriminator by the intersection of the masks of the images compared.

Overall Loss Function
The overall loss function can be written as where L intra sim is either L intra eq-sim or L intra def-sim , and λ, γ, δ ∈ R are loss function weights.Each loss component affects only the weights of the sub-networks written in the curly braces.Note that D is trained to maximize the loss whereas the other networks are trained to minimize it.We use two optimizers, one for discriminator, and one for all the other components.This is different to the works by Arar et al. (2020) and Kong et al. (2021) which use a separate optimizer for the registration network.
Actual values used for the loss function weights are given in the supplementary materials.

Experiments
To evaluate our method, we conduct experiments on four diverse data sets of which two are real world medical imaging data sets, one is a semi-synthetic data set, and one is a synthetically constructed "multi-modal" data set.We perform an ablation study of the losses proposed and compare the method against multiple baselines.Additionally we evaluate on two data sets the performance when using different distributions of affine transformations with the commutation loss.The aim of the experiments is to establish to a reasonable extent: 1.The performance of our method against earlier crossmodality image synthesis methods which are trainable on non-aligned data.2. The performance of our method against the standard pipeline where the image pairs are registered before training, assuming that no significant manual effort is put into registering the images.3. Which types of affine transformations are the most suitable for the equivariance encouraging losses and whether the choice has a large effect on the performance.
On the two real world data sets we use clinically relevant metrics for establishing the best performance.

Ablation Study
We will use the following naming conventions to reflect loss terms to be included in different experimental configurations: • EqSim: The equivariance similarity loss from Equation (11) was included.
• Com: The commutation loss from Equation (4) was included.
• DefUncondAdv: The default unconditional adversarial loss from pix2pix (Isola et al., 2017) defined directly between unmodified predictions and targets was included.
• NoReg: Only the cross-modality image synthesis component F with L1 similarity loss directly between predictions and unmodified training targets was included.
• Aug: Traditional data augmentation was used for each training input using the same distribution of deformations as what would have been used for the equivariance similarity loss, the commutation loss, and the equivariance adversarial loss.
The cross-modality registration related similarity loss and both of the regularization losses were used in all the trainings except with NoReg setup.
In Section 4 three variants of our developed method were proposed: EqSim, DefSim + Com, and EqSim + Com.Optionally, EqAdv can be combined with any of them.Training with any of the variants should result in a stable convergence and the experiments aim at measuring their relative performance.
• RegGAN The method proposed by Kong et al. (2021), trainable on paired unaligned data.We use the NICEGAN (Chen et al., 2020b) variant which uses L2 adversarial loss as it performed the best.
• NeMAR The method proposed by Arar et al. (2020), trainable on paired unaligned data, originally suggested for image registration.
We use the official implementations 123 and modify them for our data sets.
For pix2pix and NeMAR we additionally train variants which use the same losses as the official implementations  but our components and optimizers.We denote these variants by adding "our components" in parenthesis after the method name.For details, see Section S.IV of the supplementary materials.Note also that the method DefSim + DefUncondAdv + Aug corresponds to training our architecture with the losses similar to RegGAN, although Reg-GAN uses a separate optimizer for the registration network.
Our proposed equivariance losses additionally act as data augmentation.To ensure that the reason for our methods performing better is not simply the effect of seeing more data, we augment the inputs for the baseline methods with the same distribution of deformations as is used for the equivariance similarity loss, the commutation loss, and the equivariance adversarial loss.

Synthetic
We performed an ablation study on very simple synthetic "multimodal" data sets created using images from COCO data set (Lin et al., 2014) with unmodified images as input images.Target images were generated by circularly swapping the RGB color channels of the input images and by deforming them with simulated deformations.The simulated deformations were generated by a composition of rotation, translation, and an elastic deformation component generated by exponentiation of a stationary velocity field defined by parameters µ, σ, m ∈ R n using the formula where x is the spatial coordinate and i ∈ 1, . . ., n is the dimension (n = 2 or 3).Four data set were generated: LR (Large Random), SR (Small Random), LC (Large Constant), and SC (Small Constant).Used deformation parameters are displayed in Table 1.We centrally cropped all the images to resolution (400, 400), to avoid the need to extrapolate values from outside the original image when synthetically deforming the images, for details see Section S.V of the supplementary materials.Training, validation, and test sets all contained 4113 images.With this data set, all the experiments were conducted without the adversarial loss to study separately the effects of deformation equivariance encouraging losses.The six models trained using each of the data sets are listed in Table 2.
Compositions of the following transformations were used as simulated deformations for the loss functions and data augmentation: 1. Rotations in range (−15 • , 15 • ) 2. Orthogonal rotations of either 0 • , 90 • , 180 • , or 270 • , 3. Random flips over any axis No model was trained with aligned data as it would be easily learned perfectly in this setup.

Semi-Synthetic Cross-Modality Brain MRI Synthesis
In recent years, a significant amount of research has emerged on applying deep learning to cross-modality brain MRI synthesis.Synthetically generated modalities have many possible down-stream use cases such as segmentation, classification, detection and diagnosis.(Xie et al., 2022) We used brain images from Information eXtraction To generate unaligned data set we applied simulated deformations to the target images.The deformations were generated by a composition of rotation, translation, and an elastic deformation component.Translations were sampled from range (2.0 mm, 10.0 mm), rotations from range (2.0 • , 10.0 • ), and for elastic deformations we sampled white noise with mean of 10mm and standard deviation of 200mm followed by Gaussian smoothing with standard deviation of 10mm.The distribution is intentionally skewed to make the non-desired outcome of over-learning the deformation already in F more attractive.We refer to the unaligned data set as "unaligned".
We re-register the synthetically deformed images using popular deformable registration method elastix (Klein et al., 2009;Shamonin et al., 2014) to compare our method with the standard approach of using registration as a pre-processing step.We refer to this data set as "registered".
Additionally we train an oracle model with the original aligned data set and refer to that data set as "aligned".For the oracle we use the same generator architecture as for our other methods.Its performance should provide a good upper bound on the performance of our methods.
For our methods we use 3D models and train them by sampling random image patches of size (64, 64, 64) from the whole training data set.For 2D baseline models we used randomly sampled axial slices.Additionally the inputs were augmented with low-amplitude noise during the training.We also conducted a small experiment on the validation set for determining which type of simulated affine deformations are the most suitable ones for this problem.The experiment included translation, rotation, scaling and shearing.Flipping was not considered since the human anatomy is not symmetric.

Virtual Histopathology Staining
Virtual histopathology staining using deep learning has emerged as an active research topic in recent years, and has been primarily driven by GAN-based methods (Bayramoglu et al., 2017;Rivenson et al., 2019;Rana et al., 2020;Koivukoski et al., 2023).However, a majority of the methods require elastic registration of inputs and targets.Our method simplifies the data pre-processing by eliminating the need to elastically register image pairs explicitly.
We used a public data set containing unstained and stained tissue whole slides image (WSI) pairs Khan et al. (2023) available at5 .These are essentially ultra high resolution gigapixel images, and virtually staining the unstained tissue WSIs is a highly non-trivial task.Preclinical murine prostate tissue samples were prepared at the University of Eastern Finland, Kuopio.Material used was surplus tissue from previous studies (Latonen et al., 2017;Valkonen et al., 2017) where all animal experimentation and care procedures were carried out in accordance with guidelines and regulations of the national Animal Experiment Board of Finland, and were approved by the board of laboratory animal work of the State Provincial Offices of South Finland (licence number ESAVI/6271/04.10.03/2011).The tissue samples were first scanned without staining.This was followed by hematoxylin and eosin (H&E) staining of the unstained tissue samples, and then the stained samples were scanned again.The samples were scanned using Thunder Imager 3D Tissue slide scanner (Leica Microsystems, Wetzlar, Germany) equipped with DMC2900 camera at 40X magnification level with a pixel size of 0.353µm.Total of 17 WSI pairs were included in the data set each with resolution of approximately 40k × 40k from which 9 were used for training, 1 for validation, and 7 for testing.
Inputs and targets were coarsely registered and the alignment seems superficially good.However, upon a closer inspection clear misalignments are present.We additionally registered the images using an open source cross-modality whole slide image registration tool called wsireg6 .The WSI pairs were registered in two steps, first rigidly for global alignment and then elastically for more granular correspondence between the modalities.We refer to the coarsely registered original data set as "unaligned" and to the more finely registered data set as "registered".No oracle model was trained as we did not have ground truth registrations for this data set.
All of the models were trained by sampling random image patches of size 512×512 from the whole training data set.Additionally the inputs were augmented with lowamplitude noise during the training.On this data set we used the same set of transformations for the equivariance encouraging loss functions as was used in the synthetic experiment.We did not consider scaling or shearing for this data set as the misalignments on a single patch level are essentially rigid.Flipping was included as on the microscopic level the distribution should not be affected by that.

Head MRI to CT synthesis
Pseudo CT images are CT-like images generated from MRI images and are mostly used for replacing CT images in external beam radiation therapy (EBRT).Both predicting correct CT-values and geometrical accuracy of the generated images are important.In recent years Deep x (i)  y (i) Learning has emerged as a strong option for pseudo CT generation.(Owrangi et al., 2018) For the experiments we used CERMEP-IDB-MRXFDG data set which is freely available for research use (Mérida et al., 2021).The data set consists of 37 rigidly registered CT and T1 MRI head scan pairs.We resampled all the images to the MRI-resolution of 160 × 192 × 192.Pre-processing of the T1 images was similar to the one done for the cross-modality MRI synthesis data set as we applied N4 bias correction (Tustison et al., 2010) using Advanced Normalization Tools (ANTs) software (Avants et al., 2009) and normalized the brain white-matter to the mean value of one using the implementation by Reinhold et al. (2019) together with the implementation by Iglesias et al. (2011) for brain mask extraction.We additionally removed any external objects from the CT images using series of morphological operations.While the skull and brain regions are relatively rigid, the image volumes extend to neck region with significant registration mismatches as can be seen in Figure 5.We divided the data set to 20 cases for training, 5 for validation, and 12 for testing.
We refer to the default data set as "unaligned".We additionally registered the images elastically using elastix (Klein et al., 2009;Shamonin et al., 2014) with the hyperparameters by Leibfarth et al. (2013) and refer to the data set as "registered".
Training setup was very similar to the that of crossmodality MRI synthesis experiment.We used 3D mod-els and trained them by sampling random image patches of size (64,64,64) from the whole training data set and for 2D baseline models we used randomly sampled axial slices.Additionally the inputs were augmented with lowamplitude noise during the training.We still similarly to the cross-modality MRI synthesis experiment conducted an experiment on the validation set for determining which type of simulated affine deformations are the most suitable.Flipping was again not considered since the human anatomy is not symmetric.

Evaluation
For all the experiments we measure structural similarity index (SSIM) (Wang et al., 2004), peak-signal-to-noiseratio (PSNR), and normalized mutual information (NMI) (Studholme et al., 1999).NMI was applied between inputs and predictions as opposed to inputs and aligned targets as it is used here for measuring the geometric similarity of the predictions to the corresponding inputs.A detailed description of the pixel-wise metrics is given in the supplementary materials.
Additionally we measured the visual appearance of the synthesised images with Fréchet inception distance (FID) metric Heusel et al. (2017); Seitzer (2020).While the visual appearance of the images is not usually clinically relevant, we still considered the comparison to be interesting enough to be included.For 3D data sets we computed FID over image slices over all three axes.
Evaluation of the virtual staining and pseudo CT experiments required more careful approaches described in the subsections 5.4.1 and 5.4.2.

Virtual Histopathology Staining Evaluation
We computed the pixel-wise metrics for the virtual staining data set separately for each predicted batch.As no ground truth was available, we registered affinely each stained patch to the corresponding unstained patch using Advanced Normalization Tools (ANTs) software (Avants et al., 2008(Avants et al., , 2009)).We did not use directly the registered data set for computing the pixel-wise metrics to avoid the models trained with that data set from benefiting too much by being able to learn the exact registration dynamics (including possible systematic registration errors).
To further evaluate the quality of the virtually stained images, we conducted a comparative analysis of the vir-tual staining methods for nuclei reproducibility, a downstream validation approach similar to that of Khan et al. (2023).For that analysis we used the images from the registered data set as the ground truth.We employed the nuclei detection method by Valkonen et al. (2020) to output nuclei center coordinates in all the WSIs in the test set.First, nuclei were detected for the ground truth followed by nuclei detections in the virtually stained WSI generated by each of compared methods.F1-scores were computed to compare the detected nucleus coordinates of all outputs against those of the ground truth WSIs, using Euclidean distance with a tolerance of 5µm radius derived experimentally and through prior knowledge of typical nucleus dimensions (Lammerding, 2011;Valkonen et al., 2020).We defined a true positive as a nucleus center in the virtually stained WSI for which there was a corresponding nucleus center in the ground truth WSI within 5µm radius.A false positive was defined as a nucleus center in the virtually stained WSI for which there was no corresponding nucleus center in the ground truth WSI within the 5µm radius.False negatives were nuclei centers in the ground truth WSI for which there were no matches in the virtually stained WSI within the 5µm radius.

Head MRI to CT Synthesis Evaluation
For pseudo CT evaluation we additionally computed pixel-wise mean absolute error (MAE) and mean error (ME) as they are very widely used for pseudo CT evaluation and better predictors for resulting radiation dose differences than SSIM or PSNR Boulanger et al. (2021).ME refers to mean signed error over the data set, and can also be negative.ME largely ignores geometrical misalignements between inputs and predictions but on the other hand is robust to registration errors between inputs and targets used for evaluation.
We concluded the registered data set to be inadequate for accurate evaluation and had to develop more nuanced approach for registering the images, although still using the elastix software (Klein et al., 2009;Shamonin et al., 2014).We noticed that the registration results improved by registering only part of the image at a time, probably since that way the rigid registration phase was able to account for a larger part of the total deformation.We ended up randomly sampling 20 masks with radius of 10 centimeters from each image and registered the image pairs over each of the masks.The evaluation metrics were com- y (i)   x(i) puted over all the registrations with Gaussian weighting such that the highest weight was given to the coordinates at the center of the registration mask.Additionally we generated bone masks from the images by thresholding and applied the non-rigidity penalty (Staring et al., 2007) over those regions, improving the bone registration.That allowed us to use lower regularization value for the soft tissue regions improving the registration for those regions as well.We also manually masked out any regions with clear artefacts from the images.With these changes the quality of the registrations improved significantly based on visual evaluation.However, registration errors will always remain which will have to be taken into account in interpreting the results.To mitigate for registrations errors in body outline which end up easily dominating the metrics, we constructed body masks for both the MRI and CT images using morhoplogical operations and ignored in the evaluation the regions where the body masks did not match.

Synthetic
Results for the synthetic data set experiment can be seen in Table 2.
The models using the deformation equivariance encouraging losses systematically outperformed the models not using them.Four out of eight trainings with the registration component but without the deformation equivariance encouraging losses did not converge at all to a meaningful optimum.An example prediction of such a training is shown in Figure 6.The performance of the models without the deformation equivariance losses varied a lot and in few cases the performance was even quite good.However, when using either of the deformation equivariance losses the trainings always converged robustly to a meaningful optimum.Models using both the equivariance similarity loss and the commutation loss performed the best in terms of the similarity metrics.However, the models having only either the equivariance similarity loss or the commutation loss for encouraging deformation equivariance also performed very well and it is questionable whether the differences in performance when using this kind of synthetic data set will be relevant in real world applications.

Cross-Modality Brain MRI synthesis
Results for the study on a validation set comparing different distributions of affine transformations for equivariance encouraging losses can be seen in Table 3.Based on the study we used combination of all four transformation types in the main experiment for which the results with a test set can be seen in Table 4.
All the three proposed variants of our method performed very well compared to the oracle model trained with aligned data and outperformed all the baselines with statistically significant margin on the voxel-wise metrics.NeMAR which also encourages deformation equivariance coupled with our 3D architecture is the only baseline trained on non-aligned data that came close to our method.RegGAN performed significantly worse and converged to produce severely misaligned predictions as visualized in Figure 8. Pix2pix model trained on the registered data set coupled with our 3D architecture also performed well but was still clearly behind our methods while surpassing all the other models trained on non-aligned data.CycleGAN was also unable to converge to a meaningful optimum due to the large misalignments.While rarely directly relevant in clinical context, our method also performed very well in terms of the FID score.

Virtual Histopathology Staining
Results for the virtual histopathology staining experiment can be seen in Table 5.
Nucleus reproduction from unstained brightfield images to virtual stained H&E images is a particularly challenging task, as confirmed by the comparison of nucleus detection results between real and virtual stained images.This is in line with the conclusion from the earlier literature that the locations of all nuclei are simply not available in the unstained input images (Khan et al., 2023).Additionally the data is not very uniform which also affects the evaluation metrics as training and test distributions do not match.
In terms of F1-score, which can be considered the main metric, RegGAN performed better than the other methods with statistical significance, and two of our methods were close behind, outperforming other baselines.We suspect that RegGAN benefited from having a generator with 24 times more parameters than our generator (1.1 billion vs. 46 million).Also, with this data set the distributions of the desired predictions F(x (i) ) and the training targets ỹ(i) are very close to each other, i.e. there are no systematic geometrical differences.In such settings the RegGAN can  Two of our configurations, EqSim + EqAdv and Def-Sim + Com + EqAdv, beat both pix2pix variants trained on registered data in terms of F1-score by a statistically significant margin (p-values 0.016 and 0.00020).The result is more significant for the pix2pix model with our components as it was trained with identical architecture to our method.However, it is also noteworthy that the differences in loss function weightings affect the precisionrecall balance which again can affect the F1-score, e.g.our training of the vanilla pix2pix had a different loss function balance affecting the precision-recall balance compared to the one trained with our components.For our proposed variants it seems that increasing deformation equivariance increases precision at the cost of recall.The model EqSim + EqAdv which did not have the commutation loss was more inclided to guess nuclei whereas the two other models with the commutation loss placed them in more certain locations.We suspect this is due to the commutation loss more directly promoting deformation equivariance which will require the shape of the nuclei to be known.

Input RegGAN Target
NeMAR adversarial training did not converge meaningfully with this data set due to the discriminator being easily able to distinguish between fake and real target images.We suspect that it was due to high frequency components present in this data set which the NeMAR architecture could not replicate for the discrminator due to the issues discussed in Section 4.2.Generator size of the unmodified NeMAR was also way too limited for the task.

Head MRI to CT Synthesis
Results for the study on validation set comparing different distributions of affine transformations for equivariance encouraging losses can be seen in Table 6.Based on the study we used only translation and rotation in the main experiment for which the results can be seen in Table 7.Note that the differences in metrics when using different affine transformation types are very small and might not be statistically significant.
Our proposed method performed very well in terms of the MAE which can be considered the most important metric for pseudo CT generation due to the strongly linear relationship between CT values and radiation absorption in radiation therapy.DefSim + Com + EqAdv beat all of the baselines with statistically signifcant margin (pvalue 0.0024 compared to NeMAR with our components).EqSim + EqAdv also beat all of the baselines but when using the threshold of 0.05 in p-value for statistical significance the difference in MAE is narrowly not significant (p-value 0.057).Encouraging too much equivariance seems to be detrimental for correct CT-value estimation since EqSim + Com + EqAdv performed slightly worse, although still not worse than any of the baselines.
While close to our method metric-wise, under visual inspection the images generated by NeMAR with our components contained more easily visible alignment mistakes than the images generated by our models.Probably the easiest mistake to notice was its inaccuracy in predicting the body outline at neck region, where the data set contains the largest systematic deformation differences.An example of such a case is shown in Figure 10.More sub- tle mistakes included soft tissue boundaries being placed slightly off.Geometric accuracy of tissue boundaries is important as the pseudo CT images might also be used for positioning at the linear accelerator.The result is in line with the paper introducing NeMAR (Arar et al., 2020) as in the supplementary materials they conclude that their image-to-image translation network produces geometrically accurate results only when the image synthesis generator model is significantly smaller than the one used here.
FID values have to be looked at with caution since they were calculated with respect to the unaligned data set whose distribution differs from that of the desired unavailable aligned CT images.However, based on visual inspection the NeMAR model with our components indeed produced the most realistic looking texture.

Implementation
Implementation of our method in PyTorch framework and all the evaluation implementations can be found at https://github.com/honkamj/non-aligned-i2i.The code base also contains all of the data pre-processing and allows for easily reproducing the results.

Conclusions
In this work, we have developed a generic method for training a network for cross-modality image synthesis with paired but misaligned training data by promoting equivariance with respect to simulated deformations.The method is applicable to a wider range of data sets than earlier methods and has the best overall performance accross three different cross-modality image synthesis tasks.On

S.I. Loss Function Weighting
The following loss function weights were used for all the variations of our method: • The similarity loss terms were given the weight 1.0.
• The commutation loss term was given the weight 1.0.
• The adversarial loss terms were given the weight 0.0001.
• The deformation regularization terms were given the weight 1.0 in the cross-modality brain MRI synthesis and MRI to CT synthesis experiments and 100.0 in the virtual staining experiment.In the synthetic COCO experiment we used the deformation regularization weight 0.1 but the internal non-rigidity penalty weights were also different which is discussed in Section S.VII.

S.II. Training Details
Adam optimizer (Kingma and Ba, 2015) was used for training the models.In synthetic data set experiment learning rate 2e − 4 was used.In the other experiment learning rate 1e − 4 was used with a separate optimizer for the discriminator network with learning rate 4e − 4.
We used validation set for selecting the best epoch from last six training epochs for each model.Selection was done based on L 1 metric between d (i) intra * F(x (i) ) and cross * ỹ(i) .For models without the registration component L 1 metric between the prediction and the training label was used.
In experiments which used image patches for training, the patches were sampled randomly from the images.The training images contained invalid regions defined by masks and only patches containing a fully valid region in the input image were fed to the model.Our model can also handle partially invalid regions.However, because the baselines could not handle invalid regions, we used this training procedure for fair comparison.

S.III. Network Architecture Details
For F, H svf and G svf U-net (Ronneberger et al., 2015) style convolutional networks with skip connections were used.Double convolution with kernel size 3 was applied on each resolution followed by a downsampling layer or an upsampling layer.For downsampling we used strided convolutions with stride 2 and kernel size 2, and for upsampling corresponding transposed convolutions.For F, we used group normalization (Wu and He, 2018) instead of typical batch normalization, while H svf and G svf had no normalization layers.
For H rig and D, ResNet (He et al., 2016) style encoders were used with global average pooling after a certain number of downsamplings followed by a linear mapping.For discriminator D we additionally used spectral normalization (Miyato et al., 2018) to make the training more stable.Also, we fed normalized image coordinates to the rigid registration network to handle rotations correctly despite of the global average pooling.
Leaky ReLU activation (Maas et al., 2013) with slope 0.01 was used as non-linearity for F and D to avoid dead gradients as their optimization goal was suspected to Multiple inputs were concatenated over the channel dimension before feeding them to the networks.
In cross-modality brain MRI synthesis experiment and head MRI to CT synthesis experiments 3D architectures were used.Architectures used were otherwise similar to the architectures used in other experiments except that all 2D layers were replaced with corresponding 3D layers.

S.IV. Baseline Network Architecture Details
For two baselines, NeMAR (Arar et al., 2020) and Pix2pix (Isola et al., 2017), we additionally trained variants for which the overall architecture was kept the same but subcomponents were replaced with the components used for our method.To be more specific, this included the following changes for any given model: • Generator and discriminator networks were replaced with our architectures described for each experiment in Section S.III (F and D).
• For NeMAR, the registration network was replaced with the architecture of G svf described in Section S.III (together with the svf exponentiation).
• The optimizer described in Section S.II was used.
• Loss function weights for each component were chosen to be identical with the loss function weights used for training our models.This included weights for the adversarial loss, the similarity loss, and for NeMAR also the deformation regularization loss.

S.V. Synthetic Data Generation Details
For two data sets we generated non-aligned target images by applying synthetic deformations.In such a process the values extrapolated from outside the image volume are usually set to zero.However, the resulting border could be used by the registration network for partially inferring the applied deformation, and with real world data sets no such border naturally exists.That would unnaturally benefit deep learning registration based methods such as our method.For COCO data set we avoid this problem mostly by centrally cropping the images to 400 × 400 resolution.However, for some COCO images and for cross-modality brain MRI synthesis images values were still extrapolated from outside the original images.For that reason we masked the synthetically deformed target images such that we computationally found the largest interior rectangle within the available image values and set everything else to zero.

S.VI. Evaluation Metrics
The used evaluation metrics are explained here.

S.VI.A. Peak-Signal-to-Noise-Ratio (PSNR)
PSNR is calculated as 10 log 10 MAX 2 MSE where MAX is maximum possible image intensity and MSE is the mean squared error between the images.For synthetic COCO dataset and virtual staining data set MAX = 255 and for cross-modality brain MRI synthesis data set we used maximum value over the whole data set since MRI images do not have a definite upper bound.Same value was used in all the experiments.

Figure 3 :
Figure3: Cross-modality registration architecture.The outputs y(i)  reg and v(i)  cross are forwarded for intra-modality registration.Regularization is applied to both inverse and forward elastic deformation which is not explicitly shown here.The images are from the synthetic "multimodal" data sets built using COCO(Lin et al., 2014) data set.

Figure 4 :
Figure 4: Example images from the semi-synthetic cross-modality MRI synthesis data set.Only one sagittal slide of the 3D volumes is visualized.The training target ỹ(i) has been deformed with a random deformation after which we have cropped it such that the top and bottom edges are straight.

Figure 6 :
Figure 6: Example failure mode when training without deformation equivariance encouraging losses.The prediction is shifted towards topleft direction and also has a non-desired pattern both of which are compensated by the registration networks.Images are from the synthetic experiment with data set SR and model DefSim.

Figure 7 :
Figure 7: Example prediction from the synthetic experiment with data set LR and model EqSim + Com.The image is from the test set.In addition to the synthesized image, the deformation is accurately reproduced.

Figure 8 :
Figure 8: A prediction produced by the RegGAN model in the crossmodality brain MRI synthesis experiment.The model has converged to produce severely misaligned predictions.

Figure 10 :
Figure 10: Predictions produced by different variants of our method and the two best performing baselines in the head MRI to CT synthesis experiment.Only our method is capable of generating the body outline at the lower neck region correctly.The region is highlighted with red circles.T1 MRI ©Copyright CERMEP -Imagerie du vivant, www.cermep.frand Hospices Civils de Lyon.All rights reserved.

Table 1 :
Deformation parameters for synthetic data sets

Table 2 :
Results for the synthetic data set experiments

Table 4 :
Results for the cross-modality brain MRI synthesis experiment

Table 5 :
Results for the virtual histopathology staining experiment.Unmodified NeMAR was not included in the nuclei reproducibility study as it failed to converge to a meaningful optimum.

Table 6 :
Results for the MRI to CT synthesis experiment on the validation set comparing different types of affine transformations for the commutation loss.The experiment was performed using "DefSim + Com + EqAdv" setup.Translations were sampled from range [−8mm, 8mm], and rotations from range [−25 • , 25 • ].Scales and shears were sampled by exponentiating a symmetric matrix with each matrix element being sampled from a zero mean Gaussian distribution with standard deviation of 0.08.For generating scales non-diagonal values were set to zero.Only translation and rotation were used for the main experiment as that resulted in the best MAE, although the difference is not very large in comparison to additionally using scaling and shearing.Note that ME largely ignores geometrical misalignements between inputs and predictions.

Table 7 :
Results for MRI to CT synthesis experiment.Note that ME largely ignores geometrical misalignements between inputs and predictions.-modality brain MRI synthesis and head MRI to CT synthesis, the method outperformed all of the baselines, and on the virtual staining task the performance was close to the best performing baseline, even though the baseline had a significantly larger network size.Based on the experiments while the EqSim + Com + EqAdv configuration worked well on the synthetic data, our recommended configurations are EqSim + EqAdv and DefSim + Com + EqAdv as they performed the best on more realistic data sets.