SynthMorph: Learning Contrast-Invariant Registration Without Acquired Images

We introduce a strategy for learning image registration without acquired imaging data, producing powerful networks agnostic to contrast introduced by magnetic resonance imaging (MRI). While classical registration methods accurately estimate the spatial correspondence between images, they solve an optimization problem for every new image pair. Learning-based techniques are fast at test time but limited to registering images with contrasts and geometric content similar to those seen during training. We propose to remove this dependency on training data by leveraging a generative strategy for diverse synthetic label maps and images that exposes networks to a wide range of variability, forcing them to learn more invariant features. This approach results in powerful networks that accurately generalize to a broad array of MRI contrasts. We present extensive experiments with a focus on 3D neuroimaging, showing that this strategy enables robust and accurate registration of arbitrary MRI contrasts even if the target contrast is not seen by the networks during training. We demonstrate registration accuracy surpassing the state of the art both within and across contrasts, using a single model. Critically, training on arbitrary shapes synthesized from noise distributions results in competitive performance, removing the dependency on acquired data of any kind. Additionally, since anatomical label maps are often available for the anatomy of interest, we show that synthesizing images from these dramatically boosts performance, while still avoiding the need for real intensity images. Our code is available at https://w3id.org/synthmorph.

many neuroimaging pipelines involving data acquired across time, subjects, and modalities. Magnetic resonance imaging (MRI) uses pulse sequences to obtain images with contrasts between soft tissue types. Different sequences can produce dramatically different appearance even for the same anatomy. For neuroimaging, a range of contrasts is commonly acquired to provide complementary information, such as T1-weighted contrast (T1w) for inspecting anatomy or T2-weighted contrast (T2w) for detecting abnormal fluids [1]. Registration of such images is critical when combining information across acquisitions, for example to gauge the damage induced by a stroke or to plan a brain-tumor resection. While rigid registration can be sufficient for aligning within-subject images acquired with the same sequence [2], images acquired with different sequences can undergo differential distortion due to effects such as eddy currents and susceptibility artifacts, requiring deformable registration [3]. Deformable registration is also important for morphometric analyses [4]- [6], which hinge on aligning images with an existing standardized atlas that typically has a different contrast [7]- [9]. Given the central importance of registration tasks within and across contrasts, and within and across subjects, the goal of this work is a learning-based framework for registration agnostic to MRI contrast: we propose a strategy for training networks that excel both within contrasts (e.g. between two T1w scans) as well as across contrasts (e.g. T1w to T2w), even if the test contrasts are not observed during training.
Classical registration approaches estimate a deformation field between two images by optimizing an objective that balances image similarity with field regularity [10]- [16]. While these methods provide a strong theoretical background and can yield good results, the optimization needs to be repeated for every new image pair, and the objective and optimization strategy typically need to be adapted to the image type. In contrast, learning-based registration uses datasets of images to learn a function that maps an image pair to a deformation field aligning the images [17]- [24]. These approaches achieve subsecond runtimes on a GPU and have the potential to improve accuracy and robustness to local minima. Unfortunately, they are limited to the MRI contrast available during training and therefore do not generally perform well on unobserved (new) image types. For example, a model trained on pairs of T1w and T2w images will not accurately register T1w to proton-density weighted (PDw) images. With a focus on neuroimaging, we remove this constraint of learning methods and design an approach that generalizes to unseen MRI contrasts at test time. A. Related work 1) Classical methods: Deformable registration has been widely studied [11], [12], [15], [16], [25]. Classical strategies implement an iterative procedure that estimates an optimal deformation field for each image pair. This involves maximizing an image-similarity metric, that compares the warped moving and fixed images, and a regularization term that encourages desirable deformation properties such as preservation of topology [10], [13]- [15]. Cost function and optimization strategies are typically chosen to suit a particular task. Simple metrics like mean squared error (MSE) or normalized crosscorrelation (NCC) [12] are widely used and provide excellent accuracy for images of the same contrast [26].
For registration across MRI contrasts, metrics such as mutual information (MI) [27] and correlation ratio [28] are often employed, although the accuracy achieved with them is not on par with the within-contrast accuracy of NCC and MSE [29]. For some tasks, e.g. registering intra-operative ultrasound to MRI, estimating even approximate correspondences can be challenging [30], [31]. While they are not often used in neuroimaging, metrics based on patch similarity [32]- [36] and normalized gradient fields [37]- [39] outperform simpler metrics, e.g. on abdominal computer-tomography (CT). Other methods convert images to a supervoxel representation, which is then spatially matched instead of the images [40], [41]. Our work also employs geometric shapes, but instead of generating supervoxels from input images, we synthesize arbitrary patterns (and images) from scratch during training to encourage learning contrast-invariant features for spatial correspondence.
2) Learning approaches: Learning-based techniques mostly use convolutional neural networks (CNNs) to learn a function that directly outputs a deformation field given an image pair. After training, evaluating this function is efficient, enabling fast registration. Supervised models learn to reproduce simulated warps or deformation fields estimated by classical methods [21], [22], [24], [42]- [44]. In contrast, unsupervised models minimize a loss similar to classical cost functions [17], [45]- [47] such as normalized MI (NMI) [48] for cross-contrast registration. In another cross-contrast registration paradigm, networks synthesize one contrast from the other, so that within-contrast losses can be used for subsequent nonlinear registration [29], [49]- [53]. These methods all depend on having training data of the target contrast. If no such data are available during training, models generally predict inaccurate warps at test time: a model trained on T1w-T1w pairs would fail when applied within unseen contrasts (e.g. T2w-T2w) or across unseen contrast combinations (e.g. T1w-T2w).
Recent approaches also use losses driven by label maps or sparse annotations (e.g. fiducials) for registering different imaging modalities labeled during training, such as T2w MRI and 3D ultrasound within the same subject [54], [55], or aiding existing formulations with auxiliary segmentation data [17], [56]- [58]. While these label-driven methods can boost registration accuracy compared to approaches using intensity-based loss functions, they are dependent on the limited annotated images available during training. Consequently, these approaches do not perform well on unobserved MRI contrasts.
Data-augmentation strategies expose a model to a wider range of variability than the training data encompasses, for example by randomly altering voxel intensities or applying deformations [59]- [62]. However, even these methods still need to sample training data acquired with the target contrast. Similarly, transfer learning can be used to extend a trained network to new contrasts but does not remove the need for training data with the target contrast [63]. Given the continuing development of new and improved MRI contrast types at ever higher field strengths, the reduction in accuracy evidenced by existing methods in the presence of novel image contrast becomes a limiting factor.

B. Contribution
In this work we present SynthMorph, a general strategy for learning contrast-agnostic registration ( Fig. 1). At test time, it can accurately register a wide variety of acquired images with MRI contrasts unseen during training. SynthMorph enables registration of real images both within and across contrasts, learning only from synthetic data that far exceed the realistic range of medical images. During training we synthesize images from label maps, whereas registration requires no label maps at test time. First, we introduce a generative model for random label maps of variable geometric shapes. Second, conditioned on these maps, or optionally given other maps of interest, we build on recent methods to synthesize images with arbitrary contrasts, deformations, and artifacts [64]. Third, the strategy enables us to use a contrast-agnostic loss that measures label overlap, instead of an image-based loss. This leads to two SynthMorph network variants (sm) that yield substantial generalizability, both capable of registering any contrast combination tested without retraining: sm-shapes trains without acquired data of any kind, matches classical state-of-the-art registration of neuroanatomical MRI, and outperforms learning baselines at cross-contrast registration. Variant sm-brains trains on images synthesized from brain segmentations only and substantially outperforms all classical and learning-based baselines tested.
This work builds on and extends a preliminary conference paper [65] presented at the IEEE International Symposium on Biomedical Imaging (ISBI) 2021. The extension includes a series of new experiments, new analyses of the framework and regularization, and a substantially expanded discussion. We also show that networks trained within the SynthMorph strategy generalize to new image types with MRI contrasts unseen at training. Our contribution focuses on neuroimaging but provides a general learning framework that can be used to train models across imaging applications and machinelearning techniques. Our code is freely available as part of the VoxelMorph library [66] and at https://w3id.org/synthmorph.

A. Background
Let m and f be a moving and a fixed 3D image, respectively. We build on unsupervised learning-based registration frameworks and focus on deformable (non-linear) registration. These frameworks use a CNN h θ with parameters θ that outputs the deformation At each training iteration, the network h θ is given a pair of images {m, f }, and parameters are updated by optimizing a loss function L(θ; m, f, φ θ ) similar to classical cost functions, using stochastic gradient descent. Typically, the loss contains an image dissimilarity term L dis (m • φ θ , f ) that penalizes differences in appearance between the warped image and the fixed image, and a regularization term L reg (φ) that encourages smooth deformations: where φ θ = h θ (m, f ) is the network output, and λ controls the weighting of the terms. Unfortunately, networks trained this way only predict reasonable deformations for images with contrasts and shapes similar to the data observed during training. Our framework alleviates this dependency.

B. Proposed method overview
We strive for contrast invariance and robustness to anatomical variability by requiring no acquired training data, but instead synthesizing arbitrary contrasts and shapes from scratch .., J}) are sampled from a standard distribution, then warped by random deformations φ j to cover a range of scales and shapes. We synthesize a label map s from the warped imagesp j = p j • φ j : for each voxel k of s, we assign label j corresponding to imagẽ p j where k has the highest intensity j, i.e. s k = arg max j ([p j ] k ).
The example uses J = 26. ( Fig. 1). We generate two paired 3D label maps {s m , s f } using a function g s (z) = {s m , s f } described below, given random seed z. However, if anatomical labels are available, we can use these instead of synthesizing segmentation maps. We then define another function g I (s m , s f ,z) = {m, f } (described below) that synthesizes two 3D intensity volumes {m, f } based on the maps {s m , s f } and seedz. This generative process resolves the limitations of existing methods as follows. First, training a registration network h θ (m, f ) using the generated images exposes it to arbitrary contrasts and shapes at each iteration, removing the dependency on a specific MRI contrast. Second, because we first synthesize label maps, we can use a similarity loss that measures label overlap independent of image contrast, thereby obviating the need for a cost function that depends on the contrasts being registered at that iteration. In our experiments, we use the (soft) Dice metric [67] where s j represents the one-hot encoded label j ∈ {1, 2, ..., J} of label map s, and and ⊕ denote voxel-wise multiplication and addition, respectively. While the framework can be used with any parameterization of the deformation field φ, in this work we use a stationary velocity field (SVF) v, which is integrated within the network to obtain a diffeomorphism [11], [45], [68], that is invertible by design. We regularize φ using L reg (φ) = 1 2 ∇u 2 , where u is the displacement of the deformation field φ = Id + u. C. Generative model details 1) Label maps: To generate input label maps with J labels of random geometric shapes, we first draw J smoothly varying noise images p j (j ∈ {1, 2, ..., J}) by sampling voxels from a standard distribution at lower resolution r p and upsampling to full size (Fig. 2). Second, each image p j is warped with a random smooth deformation field φ j (described below) to obtain imagesp j = p j • φ j . Third, we create an input label map s by assigning, for each voxel k of s, the label j corresponding to imagep j that has the highest intensity, i.e. s k = arg max j ([p j ] k ).
Given a selected label map s, we generate two new label maps. First, we deform s with a random smooth diffeomorphic transformation φ m (described below) using nearest-neighbor interpolation to produce the moving segmentation map s m = s • φ m . An analogous process yields the fixed map s f . Alternatively, if segmentations are available for the anatomy of interest, such as the brain, we randomly select and deform input label maps instead of synthesizing them (Fig. 3). To generate two different images, we could start by using a single segmentation twice, or two separate ones. In this work, we sample separate brain label maps as this captures more realistic variability in the correspondences that the registration network has to find. In contrast, for sm-shapes training, we use a single label map s as input twice to ensure that topologically consistent correspondences exist.
2) Synthetic images: From the pair of label maps {s m , s f }, we synthesize gray-scale images {m, f } building on generative models of MR images used for Bayesian segmentation [69]- [72] (Fig. 3). We extend a publicly available model [64] to make it suitable for registration, which, in contrast to segmentation, involves the efficient generation of pairs of images (Section II-D.3). Given a segmentation map s, we draw the intensities of all image voxels that are associated with label j as independent samples from the normal distribution N (µ j , σ 2 j ). We sample the mean µ j and standard deviation (SD) σ j for each label from continuous distributions U(a µ , b µ ) and U(a σ , b σ ), respectively, where a µ , b µ , a σ , and b σ are hyperparameters. To simulate partial volume effects [73], we convolve the image using an anisotropic We further corrupt the image with a spatially varying intensity-bias field B [74], [75]. We independently sample the voxels of B from a normal distribution N (0, σ 2 B ) at lower resolution r B relative to the full image size (described below), where σ B ∼ U(0, b B ). We upsample B to full size, and take the exponential of each voxel to yield non-negative values before we apply B using element-wise multiplication. We obtain the final images m and f through min-max normalization and contrast augmentation through global exponentiation, using a single normally distributed parameter γ ∼ N (0, σ 2 γ ) for the entire image such that m =m exp(γ) , wherem is the normalized moving image, and similarly for the fixed image (Fig. 3).
3) Random transforms: We obtain the transforms φ j (j = 1, 2, ..., J) for noise image p j by integrating random SVFs v j [11], [45], [46], [68]. We draw each voxel of v j as an independent sample of a normal distribution N (0, σ 2 j ) at lower resolution r p , where σ j ∼ U(0, b p ) is sampled uniformly, and each SVF is integrated and upsampled to full size. Similarly, we obtain the transforms φ m and φ f based on hyperparameters r v and b v for sm-brains. For sm-shapes, we sample several SVFs v m ∼ N (0, σ 2 v ) at resolutions r v ∈ {1:8, 1:16, 1:32}, drawing a different σ v for each to synthesize a more complex deformation, since the fixed and moving images are based on the same input label map. The upsampled SVFs are then combined additively, and a similar procedure yields v f . D. Implementation details 1) Hyperparameters: The generative process requires a number of parameters. During training, we sample these based on the hyperparameters presented in Table I. Their values are not chosen to mimic realistic anatomy or a particular MRI contrast. Instead, we select hyperparameters visually to yield shapes and contrasts that far exceed the range of realistic medical images, to force our networks to learn generalizable features that are independent of the characteristics of a specific contrast [59]. We thoroughly analyze the impact of varying hyperparameters in our experiments.
2) Architecture: The models implement the network architecture used in the VoxelMorph library [17], [45]: a convolutional U-Net [60] predicts an SVF v θ from the input {m, f }. As shown in Fig. 4, the encoder has 4 blocks consisting of a stride-2 convolution and a LeakyReLU layer (parameter 0.2), that each halve the resolution relative to the inputs. The decoder features 3 blocks that each include a stride-1 convolution, an upsampling layer, and a skip connection to the corresponding encoder block. We obtain the SVF v θ after 3 further convolutions at half resolution, and the warp φ θ after integration and upsampling.
All convolutions use 3 × 3 × 3 kernels. We use a default network width of n = 256 unless stated otherwise. While the last layer of all networks employs n = 3 filters, we reduce the width to n = 64 for the parameter sweeps of Section III-G and the analysis of feature maps in Fig. 10 and Fig. 12, to lower the U-Net architecture of φ θ = h θ (m, f ). Each block of the encoder features a 3D convolution with n = 256 filters and a LeakyReLU layer (0.2). Stride-2 convolutions each halve the resolution relative to the input. In the decoder, each convolution is followed by an upsampling layer and a skip connection (long arrows). The SVF v θ is obtained at half resolution, yielding the warp φ θ after integration and upsampling. All kernels are of size 3 × 3 × 3. The final layer uses n = 3 filters. computational burden and memory requirements and thereby enable us to perform the analyses within our computational resources. We expect the results to be generally applicable as we use the same synthesis and registration architecture, while higher network capacities typically improve accuracy as long as the training set is large enough.
3) Implementation: We implement our networks using Ten-sorFlow/Keras [76]. We integrate SVFs using a GPU version [45], [46] of the scaling and squaring algorithm with 5 steps [11], [68]. Training uses the Adam optimizer [77] with a batch size of one registration pair and an initial learning rate of 10 −4 , that we decrease to 10 −5 in case of divergence. We train each model until the Dice metric converges in the synthetic training set, typically for 4 × 10 5 iterations.
To generate pairs of images with high variability for registration, we extend a model [64] implemented for a singleinput segmentation task. First, we improve efficiency to meet the increased computational demand. For example, we replace smoothing operations based on 3D convolutions by 1D convolutions with separated Gaussian kernels. We also integrate spatial augmentation procedures such as random axis flipping into a single deformation field, enabling their application as part of one interpolation step. We also implement an interpolation routine with fill-value-based extrapolation on the GPU. The fill value enables extrapolating with zeros instead of repeating voxels where the anatomy extends to the edge of the image, making the spatial augmentation more realistic.
Second, we add to the data augmentation within the model by expanding random axis flipping to all three dimensions, and by drawing a separate smoothing kernel for each dimension of space enabling randomized anisotropic blurring. We implement a more complex warp synthesis that generates and combines SVFs at multiple spatial resolutions. We also extend most augmentation steps to vary across batches, thereby increasing variability. Third, we simplify the code to improve its maintainability and reusability. We use the external VoxelMorph and Neurite libraries to avoid code duplication. We update the model to support the latest TensorFlow version to benefit from the full set of features including batch profiling and debugging in eager execution mode.

III. EXPERIMENTS
We evaluate network variants trained with the proposed strategy and compare their performance to several baselines. The test sets include a variety of image contrasts and levels of processing to assess method robustness. Our goal is for SynthMorph to achieve unprecedented generalizability to new contrasts among neural networks, matching or exceeding the accuracy of all classical and learning approaches tested.

A. Data
While SynthMorph training involves data synthesized from label maps that vary widely beyond realistic ranges, all tests and method comparisons use only acquired MRI data.
Multi-FA, multi-TI. We compile a series of spoiled gradientecho (FLASH) [86] images for flip angles (FA) varied between 2 • and 40 • in 2 • steps. For each of 20 subjects, we obtain contrasts ranging from PDw to T1w using the steady-state signal equation with acquired parametric maps (T1, T2 * , PD) and sequence parameters: repetition time (TR) 20 ms, echo time (TE) 2 ms. Equivalently, we compile a series of MPRAGE images for inversion times (TI) varied between 300 ms and 1000 ms in steps of 20 ms. For each of 20 subjects, we fit MPRAGE contrasts based on MP2RAGE [87] echoes acquired with parameters: TR/TE 5000/2.98 ms, TI 1 /TI 2 700/2500 ms, FA 4 • . Fig. 9 shows typical examples of these data.
Buckner40. We derive 40 distinct-subject segmentations with brain and non-brain labels from T1w MPRAGE scans of the Buckner40 dataset [88], a subset of the fMRIDC structural data [89].
Cardiac MRI. We gather cine-cardiac MRI datasets from 33 subjects [90]. Each frame is a stack of thick 6-13 mm slices with ∼(1.5 mm) 2 in-plane resolution. The data include manually drawn contours outlining the endocardial and epicardial walls of the left ventricle. Fig. 15 shows representative frames.
2) Processing: As we focus on deformable registration, we map all brain images into a common 160 × 160 × 192 affine space [4], [91] at 1 mm isotropic resolution. Unless manual segmentations are available, we derive brain and non-brain labels for skull-stripping and evaluation using the contrastadaptive SAMSEG [6] method.
For each subject of the multi-FA and multi-TI datasets, we derive brain labels from a single acquired T1w image using FreeSurfer [4], ensuring identical labels across all MRI contrasts obtained for the subject.
We resample all cardiac frames to 256 × 256 × 112 volumes with isotropic 1-mm voxels and transfer the manual contours into the same space.
3) Dataset use: Training. We use the Buckner40 label maps for data synthesis (Fig. 5) and SynthMorph training. For the learning baselines, we use T1w and T2w images from 100 HCP-A subjects, and all T1w images from GSP and UKBB.
Validation. For hyperparameter tuning and monitoring model training, we use 10 registration pairs for each of the OASIS, HCP-A and BIRN contrast pairings described below. These subjects do not overlap with the training set.
Test. Table II provides an overview of the contrast combinations compiled from OASIS, HCP-A, and BIRN. Except for the 8 PDw BIRN images, the subjects do not overlap with the training or validation sets. We also use the multi-FA, multi-TI and cardiac images for testing; none of these data are used in training or validation.

B. Baselines
We test classical registration with ANTs (SyN) [12] using recommended parameters [92] for the NCC similarity metric within contrast and MI across contrasts. We test NiftyReg [13] with the default cost function (NMI) and recommended parameters, and we enable its diffeomorphic model with SVF integration as in our approach. Both ANTs and NiftyReg are optimized for neuroimaging applications, leading to appropriate parameters for our tasks. We also run the deedsBCV [93] patch-similarity method, which we tune for neuroimaging. To match the spatial scales of brain structures, we reduce the default grid spacing, search radius and quantization step to 6 × 5 × 4 × 3 × 2, 6 × 5 × 4 × 3 × 2, and 5 × 4 × 3 × 2 × 1, respectively, improving registration in our experiments.
As a learning baseline, we train VoxelMorph (vm), using an image-based NCC loss and the same architecture as SynthMorph, on 100 skull-stripped T1w images from HCP-A that do not overlap with the validation set. Similarly, we train another model with NMI as a loss on random combinations of 100 T1w and 100 T2w images. This exposes each model to 9900 different cross-subject registration pairs, and vm-nmi to T1w-T1w, T1w-T2w and T2w-T2w contrast pairings (both contrasts were acquired from the same 100 subjects). Following the original VoxelMorph implementation [17], we train these baseline networks without data augmentation, with the exception of randomized axis flipping.
While we compare to learning baselines following their original implementation [17], we also investigate if the performance of these methods can be further improved. First, we

Moving Fixed
Subject pairs Experiment HCP-A HCP-A 50 1 retrain the baseline model adding a further 7000 T1w images from UKBB and GSP to the training set to evaluate whether the original finding that 100 images are sufficient [17] holds true in our implementation, or whether the greater anatomical variability would promote generalizability across contrasts or datasets (vm-ncc-7k). Second, we explore to what extent augmentation can improve accuracy, by retraining vm-ncc with 100 T1w images, while augmenting the input images with random deformations as for sm-brains training (vm-ncc-aug).
Third, we train a new hybrid method using extreme contrast augmentation to explore if more variability in the training contrasts would help the network generalize (Fig. 5). At every iteration, we sample a registration pair from 100 T1w images and pass it to the similarity loss, while the network inputs each undergo an arbitrary gray-scale transformation: we uniformly sample a random lookup table (LUT) from U(0, 255), remapping the intensities {0, ..., 255} to new values of the same set. We smooth this LUT using a Gaussian kernel L(σ L = 64).
Fourth, the synthesis enables supervised training if the moving and fixed label maps {s m , s f } are generated from the same input label map, so that the net warp is known. We analyze whether knowledge of the synthetic net warp can improve accuracy, by training models with the same architecture using an MSE loss between the synthesized and predicted SVFs v (sup-svf) or deformation fields φ (sup-def), respectively. As for sm-shapes, we draw the SVFs {v m , v f } at several resolutions r v ∈ {1:8, 1:16, 1:32} to synthesize a more complex deformation since we use a single brain segmentation map as input to ensure that a topologically consistent spatial correspondence exists.

C. SynthMorph variants
For image-data and shape-agnostic training (sm-shapes), we sample {s m , s f } by selecting one of 100 random-shape segmentations s at each iteration, and synthesizing two separate image-label pairs from it. Each s contains J = 26 labels that we include in the loss L dis . Since brain segmentations are often available, even if not for the target MRI contrast, we train another network on the Buckner40 anatomical labels instead of shapes (sm-brains). In this case, we sample {s m , s f } from two distinct label maps at each iteration and further deform them using synthetic warps. We optimize the J = 26 largest brain labels in L dis , similar to what VoxelMorph does for validation [17] (see below).

D. Validation metrics
To measure registration accuracy, we propagate the moving labels using the predicted warps and compute the Dice metric D [94] across a representative set of brain structures: amygdala, brainstem, caudate, ventral DC, cerebellar white matter and cortex, pallidum, cerebral white matter (WM) and cortex, hippocampus, lateral ventricle, putamen, thalamus, 3 rd and 4 th ventricle, and choroid-plexus. We average scores of bilateral structures. In addition to volumetric Dice overlap, we evaluate the mean symmetric surface distance S (MSD) between contours of the same moved and fixed labels. We also compute the proportion of voxels where the warp φ folds, i.e. det(J φ ) ≤ 0 for voxel Jacobian J φ .
E. Experiment 1: baseline comparison 1) Setup: For each contrast, we run experiments on 50 heldout image pairs, where each image is from a different subject, except for T1w-PDw pairs, of which we have eight. To assess robustness to non-brain structures, we evaluate registration within and across datasets, with and without skull-stripping, using datasets of the same size. We determine whether the mean differences between methods are significant using paired two-sided t-tests.
2) Results: Fig. 5 shows examples of SynthMorph training data, and Fig. 6 shows typical registration results. Fig. 7 compares registration accuracy across structures to all baselines, in terms of Dice overlap (Fig. 7a) and MSD (Fig. 7b). By exploiting the anatomical information in a set of brain labels, sm-brains achieves the best accuracy across all datasets, even though no real MR images are used during training. First, sm-brains outperforms classical methods on all tasks by at least 2.4 Dice points, and often much more (p < 0.0003 for T1w-PDw, p < 4 × 10 −15 for all other tasks). Second, it exceeds the state-of-the-art accuracy of vm-ncc for T1w-T1w registration, which is trained on T1w images, by at least 0.6 Dice points (p < 6 × 10 −6 ). Importantly, across contrasts sm-brains outperforms all other methods, demonstrating its ability to generalize to contrasts. Compared especially to baseline learning methods, which cannot generalize to contrasts unseen during training, sm-brains leads by up to 45.1 points (p < 6 × 10 −7 for all cross-contrast tasks). Compared to classical methods, the proposed method outperforms by 2.9 or more points (p < 0.0003 for T1w-PDw, p < 2 × 10 −17 for other cross-contrast tasks).
The shape and contrast-agnostic network sm-shapes matches the performance of the best classical method for each dataset except T1w-T1w registration, where it slightly underperforms (p < 8 × 10 −11 ), despite never having been exposed to either imaging data or even neuroanatomy. Like sm-brains, sm-shapes generalizes well to multi-contrast registration, matching or exceeding the accuracy of all baselines, and by significant margins compared to learning baselines (p < 8 × 10 −7 for T1w-PDw, p < 2 × 10 −17 otherwise). The baseline learning methods vm-ncc and vm-nmi perform well and clearly match or outperform classical methods at contrasts similar to those used in training. However, as expected, these approaches break down when tested on a pair of new contrasts that were not sampled during training, such as T1w-PDw. Similarly, vm-ncc and vm-nmi achieve slightly lower accuracy on image pairs that are not skull-stripped.
While MSD can be more sensitive than Dice overlap at structure boundaries, our analysis of surface distances yields a similar overall ranking between methods (Fig. 7b). Importantly, sm-brains achieves the lowest MSD for all contrasts, typically 0.7 mm or less, which is below the voxel size. Within contrasts, sm-brains outperforms classical methods by at least 0.06 mm (p < 2 × 10 −9 ), surpassing all baselines tested across contrasts (p < 0.04 for T1w-PDw, p < 10 −10 for the other tasks).
Exposing the baseline models to a much larger space of deformations at training does not result in a statistically significant increase of accuracy for T1w-to-T1w registration within OASIS (Fig. 8a). For vm-ncc-aug, accuracy across T1w datasets (OASIS-HCP, p < 0.007) and T2w-to-T2w accuracy (p < 0.03) decrease by 0.1 Dice point relative to vm-ncc. For vm-ncc-7k, accuracy across T1w datasets increases by 0.1 point (p < 0.04), with no significant change for T2w-to-T2w registration, but overall these 0.13% changes are negligible. Similar to vm-ncc, these models do not generalize to unseen pairings across contrasts, under-performing sm-brains by 42.9 or more points (Fig. 8a, p < 10 −8 ).
Augmenting T1w image contrast using random LUTs (Fig. 5)   the supervised models by up to 6.1 Dice points (p < 0.009 for T1w-PDw, p < 3 × 10 −15 for all other tasks). However, the increased contrast robustness comes at the expense of a drop of 0.5-1. 9 Dice points within contrasts relative to vm-ncc (p < 0.0002), while sm-brains outperforms hybrid by at least 2.4 points within (p < 6 × 10 −17 ) and 4.5 points across contrasts (p < 10 −5 for T1w-PDw, otherwise p < 10 −23 ). We also investigate lower kernel widths σ L < 64, but find these to negatively impact accuracy and therefore do not include them in the graph: reducing σ L introduces noise in the image, indicating the importance of LUT smoothness. Finally, the supervised networks sup-def and sup-vel achieve the lowest accuracy for within-contrast registration (p < 0.02) and consistently under-perform their unsupervised counterpart sm-brains by 6.8-10.7 points across all contrast combinations (p < 0.0001 for T1w-PDw, p < 3 × 10 −26 for all other tasks). As for the main baseline comparison, measurements of the mean surface distance in Fig. 8b result in a similar ranking between method variations, at comparable significance levels.
In our experiments, learning-based models require less than 1 second per 3D registration on an Nvidia Tesla V100 GPU. Using the recommended settings, NiftyReg and ANTs typically take ∼0.5 h and ∼1.2 h on a 3.3-GHz Intel Xeon CPU, respectively, whereas deedsBCV requires ∼3 min.

F. Experiment 2: contrast invariance
In this experiment we evaluate registration accuracy as a function of gradually varying MRI contrast and measure robustness to new image types by analyzing the variability of network features across these contrasts. 1) Setup: To assess network feature invariance to MRI contrast, we perform the following procedure for 10 pairs of separate subjects, where each subject is only considered once and, thus, registered to a different fixed image. Given each such pair, we run a separate registration between each of the multi-FA contrasts for the moving subject and the most T1wlike contrast (FA 40 • ) of the fixed subject. For each pair of  subjects, we measure accuracy with all tested methods as well as the variability of the features of the last network layer, before the SVF is formed, across input pairs. Specifically, we compute the root-mean-square difference d (RMSD) between the layer outputs of the first and all other contrast pairs over space, averaged over contrasts, features, and subjects. For efficiency, we restrict the moving images for this analysis to the subsets of FAs and TIs that undergo the largest changes in contrast, i.e. FAs from 2 to 30 • (4 • steps) and TIs from 300 to 600 ms (40-ms steps).
2) Results: Fig. 11 compares registration accuracy as a function of the moving-image MRI contrast for baseline methods and SynthMorph. In both the multi-FA and the multi-TI data, we obtain broadly comparable results for all methods when the moving and fixed image have T1w-like contrast. However, the performance of ANTs, NiftyReg and learning baselines decreases with increasing contrast differences, whereas SynthMorph remains largely unaffected. Fig. 12 shows the variability of the response of each network layer to varying MRI contrast of the same anatomy (shown in Fig. 9). Compared to VoxelMorph, the feature variability within the deeper layers is significantly lower for the SynthMorph models. Fig. 10 illustrates this result, containing example feature maps extracted from the last network layer before the SVF is formed.
Overall, SynthMorph models exhibit substantially less variability in response to contrast changes than all other methods tested, indicating that the proposed strategy does indeed encourage contrast invariance.    Fig. 11. Accuracy as a function of moving-image contrast across 10 realistic (a) FLASH and (b) MPRAGE image pairs. In each registration, the fixed image has the same T1w contrast. The moving image becomes decreasingly T1w towards the right. Being comparable across methods, error bars are shown for ANTs only and indicate the standard error of the mean over subjects.
smoothness b K , number of features n per layer (network width), bias-field range b B , gamma-augmentation strength σ γ and relative resolutions r. Third, for the case that brain segmentations are available (sm-brains), we analyze the effect of training with full-head labels, brain labels only, or a mixture of both. Unless indicated, we test all hyperparameters using n = 64 convolutional filters per layer. For comparability, both SynthMorph variants use SVFs {v m , v f } sampled at a single resolution r v . 2) Results: Fig. 13 shows registration performance for various training settings. Variant sm-brains performs best at low deformation strength b v , when label maps s from two different subjects are used at each iteration (Fig. 13a), likely because the differences between distinct subjects already provide significant variation. For {s m , s f }, a larger value of b v = 3 is optimal due to the lacking inter-subject deformation, since we generate {s m , s f } from a single segmentation s.
Random blurring of the images {m, f } improves robustness   Feature variability for registration across (a) FLASH and (b) MPRAGE contrasts from 10 distinct subject pairs. We use normalized RMSD d between each contrast and the most T1w-like, averaged over contrasts, features, subjects. All models use the same architecture with n = 64 filters per layer. SynthMorph variants exhibit the least variability in the deeper layers (red boxes). Error bars show the standard error of the mean over features.
to data with different smoothing levels, with optimal accuracy at b K ≈ 1 (Fig. 13b). Higher numbers of filters n per convolutional layer boost the accuracy at the cost of increasing training times (Fig. 13c), indicating that richer networks better capture and generalize from synthesized data. We identify the optimum bias-field cap and gamma-augmentation SD as b B = 0.3 and σ γ = 0.25, respectively. We obtain the highest accuracy when we sample the SVF and bias field at relative resolutions r v = 1:16 and r B = 1:40, respectively (Fig. 13f). Finally, training on full-head as compared to skull-stripped images has little impact on accuracy (not shown). Fig. 14a shows that with decreasing regularization, accuracy increases for the large structures used in L dis . When we include smaller structures, the mean overlap D reduces for λ < 1, as the network then focuses on optimizing the training structures. This does not apply to sm-shapes, which is agnostic to anatomy since we train it on all synthetic labels present in the random maps. Fig. 14b shows a small proportion of locations where the warp field folds, decreasing with increasing λ. For test results, we use λ = 1, where the proportion of folding voxels is below 10 −6 at our numerical precision. At fixed λ = 1, increasing the number of integration steps reduces voxel folding, about 6-fold for 10 instead of 5 steps, after which further increases have no effect.

H. Experiment 4: cine-cardiac application
In this experiment we test SynthMorph and VoxelMorph on cine-cardiac MRI to assess how these models transfer to a domain with substantially different image content. The goal is to analyze whether already trained models extend beyond neuroimaging, rather than claiming their outperformance over methods specifically developed for the task. We choose the dataset because the trained networks assume affine registration of the input images, which can be challenging in non-brain applications, whereas cardiac frames from the same subject are largely aligned. This provides an opportunity for testing registration of images with structured background within contrast; we test cross-contrast registration in Section III-E and Section III-F. 1) Setup: Non-rigid registration of cardiac images from the same subject is an important tool that can help assess cardiovascular health. Some approaches choose an end-diastolic frame as the fixed image, as it is easily identified [95], [96]. Thus, we pair an end-systolic with an end-diastolic frame for each of 33 subjects, corresponding to maximum cardiac contraction and expansion. For 3D registration of these pairs, we use already trained SynthMorph and VoxelMorph models without optimizing for the new task.
2) Results: Table III compares the effect on mean symmetric surface distance for the best-performing SynthMorph (sm-shapes) and VoxelMorph (vm-ncc) models. Registration with sm-shapes reduces MSD between the epicardial contours by ∆S/S = (11.6 ± 1.5)% on average, improving MSD for 85% of pairs (lower MSD is better). The mean reduction for vm-ncc is only ∆S/S = (4.3±1.4)%. While the pairs that do not improve appear visually unchanged, MSD increases slightly: for example, the most substantial decrease for sm-shapes is 35.4%, but the least accurate registration only results in a 3.9% increase. While the performance gap between the models is smaller for endocardial MSD, sm-shapes still outperforms vm-ncc. The models sm-brains and vm-nmi underperform sm-shapes and vm-ncc in terms of MSD, respectively. Fig. 15 shows exemplary cardiac frames before and after registration with sm-shapes along with the displacement fields, illustrating how SynthMorph leaves most anatomy intact while focusing on dilation of the heart to match its late-diastolic shape.

IV. DISCUSSION
We propose SynthMorph, a general strategy for learning contrast-invariant registration that does not require any imaging data during training. We remove the need for acquired data by synthesizing images randomly from noise distributions.

A. Generalizability
A significant challenge in the deployment of neural networks is their generalizability to image types unseen during training. Existing learning methods like VoxelMorph achieve good registration performance but consistently fail for new MRI contrasts at test time. For example, vm-ncc is trained on T1w pairs and breaks down both across contrasts (e.g. T1w-T2w) and within new contrasts (e.g. T2w-T2w). The SynthMorph strategy addresses this weakness and makes networks resilient to contrast changes by exposing them to a wide range of synthetic images, far beyond the shapes and contrasts typical of MRI. This approach obviates the need for retraining to register images acquired with a new sequence. Training conventional VoxelMorph with a loss evaluated on T1w images while augmenting the input contrasts enables the transfer of domain-specific specific knowledge to cross-contrast registration tasks. However, the associated decrease in within-contrast performance indicates the benefit of SynthMorph: learning to match anatomical features independent of their appearance in the gray-scale images.
The choice of optimum hyperparameters also is an important problem for many deep learning applications. While the grid search of Fig. 13 illustrates the dependency of accuracy on hyperparameter values, SynthMorph performance is robust over the ranges typical of medical imaging modalities, e.g. smoothing kernels with SD σ K ∈ [0, 2].
We select SynthMorph hyperparameters for all experiments based on the analysis of Fig. 13, using validation data that do not overlap with the test sets. The chosen parameters (Table I) enable robust registration across six different test sets in Section III-E and over a landscape of continually evolving MRI contrasts in Section III-F, demonstrating their generalizability across datasets.

B. Baseline comparison
Networks trained within the SynthMorph framework do not have access to the MRI contrasts of the test set nor indeed to any MRI data at all. Yet sm-shapes matches state-ofthe-art classical performance within contrasts and provides substantial improvements in cross-contrast performance over ANTs and NiftyReg, all while being substantially faster.
Registration accuracy varies with the particular contrast pairings, likely because anatomical structures appear different on images acquired with different MRI sequences. There is no guarantee that a structure will have contrast with neighboring structures and can be registered well to a scan of a particular MRI contrast (e.g. PDw). Nevertheless, SynthMorph outperforms both classical and learning-based methods across contrasts, demonstrating that it can indeed register new image types, to the extent permitted by the intrinsic contrast. If brain segmentations are available, including these in the image synthesis enables the sm-brains network to outperform all methods tested by a substantial margin-at any contrast combination tested-although this model still does not require any acquired MR images during training.
Visual inspection of typical deformation fields in Fig. 6 provides an interesting insight: the sm-brains network appears to learn to identify the structures of interest optimized in the loss. Thus, it focuses on registering these brain regions and their close neighbors, while leaving the background and structures such as the skull unaffected. This anatomical knowledge enables registration of skull-stripped images to data including the full head. While the resulting deformations may appear less regular than those estimated by classical methods, our analysis of the Jacobian determinant demonstrates comparable field regularity across methods.

C. Dice-loss sensitivity
When training on synthesized structures with arbitrary geometry, the network learns to generally match shapes based on contrast. The sm-shapes model does not learn to register specific human anatomical structures or sub-structures since we never expose it to specific neuroanatomy and instead sample random shapes of all sizes during training. In the experiment trained on brain anatomy, the model matches substructures within labels if they manifest contrast. If substructures are not discernible, the smooth regularization yields reasonable predictions. This can be observed with sm-brains for smaller structures that are not included in the dissimilarity loss L dis but for which we obtain competitive validation Dice scores, e.g. the 3 rd and 4 th ventricle.

D. Supervised or unsupervised?
Since ground-truth deformation fields are available for sm-shapes, we also train baseline models in a supervised manner. This approach consistently under-performs its unsupervised counterpart, for which we propose three possible explanations. First, several different deformations can result in the same warped brain, which has the potential to introduce a level of ambiguity into the registration problem that makes it challenging to train a reliable predictor. Second, related to this point, image areas with little intensity variation such as the background or central parts of the white matter offer no guidance for the supervised network to match the arbitrary ground-truth deformation, compared to unsupervised models, that are driven by the regularization term in those areas. Third, the synthesized transforms may not represent an exact identifiable mapping between the source and target image because of errors introduced by nearest-neighbor interpolation of the input label maps and further augmentation steps including image blurring and additive noise.

E. Further work
While SynthMorph addresses important drawbacks of within and between-contrast registration methods, it can be expanded in several ways.
First, we plan to extend our framework to incorporate affine registration [47], [54], [55], [97]. We will explore whether the simultaneous estimation of affine and deformable transforms can improve accuracy and thoroughly investigate the appropriateness of architectures for doing this in heterogeneous data. In the current work, the input images {m, f } need prior affine alignment for optimal results. Although this preprocessing step is beyond the focus of our current contribution, the code we make available includes an optimization-based affine solution, thus providing full registration capabilities independent of third-party tools. The optimization estimates 12 affine parameters for each new pair of 3D images in ∼10 seconds, with accuracy comparable to ANTs and NiftyReg.
Second, our approach promises to be extensible to unprocessed images acquired with any MRI sequence, of any body part, possibly even beyond medical imaging. While this is an exciting area of research, the present work focuses on neuroimaging applications since the breadth of the analyses required is beyond the scope of a single solid contribution.
Third, an obvious extension is to combine the simulation strategy with existing image data that might already be available. We plan to investigate whether including real MRI scans would aid, or instead bias the network and reduce its ability to generalize to unseen contrast variations.

F. Invariant representations
We investigate why the SynthMorph strategy enables substantial improvements in registration performance. In particular, we evaluate how accuracy responds to gradual changes in MRI contrast and show that the deep layers of SynthMorph models exhibit a greater degree of invariance to contrast changes than networks trained in a conventional fashion. We present qualitative and quantitative analyses demonstrating that the enhanced contrast invariance leads to highly robust registration across wide spectra of MR images simulated for two commonly used pulse sequences, FLASH and MPRAGE.

G. Cardiac registration
The cine-cardiac experiment demonstrates the viability and potential of SynthMorph applied to a domain with substantially different image content than neuroimaging. While we do not claim to outperform dedicated cardiac registration methods, sm-shapes reduces the MSD metric between the fixed and moving frames in the majority of subjects, to a greater extent than any of the sm-brains, vm-ncc, and vm-nmi models. The network achieves this result without any optimization for the anatomy or image type considered, using weights obtained with generation hyperparameters tuned for isotropic 3D brain registration. In contrast, the cardiac data are volumes resampled from stacks of slices with thicknesses exceeding the voxel dimension of our neuroimaging test sets by 9-fold on average. Although sm-shapes is not an optimized registration tool for cardiac MRI, its weights provide a great choice for initializing networks when training applicationspecific registration, since the model produces reasonable results and is unbiased towards any particular anatomy.

H. Domain-specific knowledge
The comparisons between sm-brains and sm-shapes in neuroimaging datasets indicate that SynthMorph performs substantially better when exploiting domain-specific knowledge. For the cardiac application, this could be achieved in the following ways. First, if the amplitude of cardiac motion exceeds the deformations sampled during sm-shapes training, increasing hyperparameter b v will be beneficial. Second, a lower regularization weight λ may be favorable for cardiac motion, which is characterized by considerable displacements within a small portion of space. Third, anatomical segmentations in fields other than neuroimaging often include fewer different labels. To overcome this challenge and synthesize images complex enough for networks to learn anatomy-specific registration, these label maps could be augmented by including arbitrary geometric shapes as diverse backgrounds.
Qualitatively, our experience is that generation hyperparameters represent a trade-off between (1) sampling from a distribution large enough to include the features of a target dataset while promoting network robustness by exposure to broad variability, and (2) ensuring that the network capacity is adequate for capturing the sampled variation. As an alternative to making domain-specific informed changes to the generation hyperparameters and retraining networks, recent work suggests to optimize hyperparameter values efficiently at test time using hypernetworks [98]. In addition to a registration pair, such hypernetworks take as input a set of hyperparameters and output the weights of a registration network, thus modeling a continuum of registration networks each trained with different hyperparameter values.

I. Data requirements for registration
The baseline comparison reveals that neither augmenting nor adding data in VoxelMorph training boosts performance. While counter-intuitive to intuitions about deep learning in classification tasks, this result is consistent with recent findings confirming that large datasets are not necessary for tasks like deformable registration and segmentation, that have sizable input and output spaces [58], [59], [99]: in effect, every image voxel can be thought of as a data sample, although these are, of course, not independent. For example, reasonable segmentation performance can be achieved with only a handful of annotated images [58]. For registration, our analysis shows that SynthMorph training with label maps from only 40 subjects enables outperformance of all other methods tested.
We train the VoxelMorph baseline using images from 100 subjects, randomly flipping the axes of each input pair, which already gives rise to 79,200 different cross-subject image combinations. An analysis in the VoxelMorph paper [17] comparing training sets of size 100 and 3231 without randomly flipping axes provides further evidence that larger datasets do not necessarily lead to significant performance gains.

V. CONCLUSION
Our study establishes the utility of training on synthetic data only and indicates a novel way of thinking about feature invariance in the context of registration. SynthMorph enables users to build on the strengths of deep learning, including rapid execution, increased robustness to local minima and outliers, and flexibility in the choice of loss functions, by now having the previously-missing ability to generalize to any MRI contrast at test time. This leads us to believe the strategy can be broadly applied to networks to limit the need for training data while vastly improving applicability.