Anatomy-aware and acquisition-agnostic joint registration with SynthMorph

Abstract Affine image registration is a cornerstone of medical-image analysis. While classical algorithms can achieve excellent accuracy, they solve a time-consuming optimization for every image pair. Deep-learning (DL) methods learn a function that maps an image pair to an output transform. Evaluating the function is fast, but capturing large transforms can be challenging, and networks tend to struggle if a test-image characteristic shifts from the training domain, such as the resolution. Most affine methods are agnostic to the anatomy the user wishes to align, meaning the registration will be inaccurate if algorithms consider all structures in the image. We address these shortcomings with SynthMorph, a fast, symmetric, diffeomorphic, and easy-to-use DL tool for joint affine-deformable registration of any brain image without preprocessing. First, we leverage a strategy that trains networks with widely varying images synthesized from label maps, yielding robust performance across acquisition specifics unseen at training. Second, we optimize the spatial overlap of select anatomical labels. This enables networks to distinguish anatomy of interest from irrelevant structures, removing the need for preprocessing that excludes content which would impinge on anatomy-specific registration. Third, we combine the affine model with a deformable hypernetwork that lets users choose the optimal deformation-field regularity for their specific data, at registration time, in a fraction of the time required by classical methods. This framework is applicable to learning anatomy-aware, acquisition-agnostic registration of any anatomy with any architecture, as long as label maps are available for training. We analyze how competing architectures learn affine transforms and compare state-of-the-art registration tools across an extremely diverse set of neuroimaging data, aiming to truly capture the behavior of methods in the real world. SynthMorph demonstrates high accuracy and is available at https://w3id.org/synthmorph, as a single complete end-to-end solution for registration of brain magnetic resonance imaging (MRI) data.


Introduction
Image registration is an essential component of medical image processing and analysis that estimates a mapping from the space of the anatomy in one image to the space of another image (Cox, 1996;Fischl et al., 2002Fischl et al., , 2004;;Jenkinson et al., 2012).Such transforms generally include an affine component accounting for global orientation such as different head positions, which are typically not of clinical interest.Transforms often include a deformable component that may represent anatomically meaningful differences in geometry (Hajnal and Hill, 2001), and many techniques such as voxel-based morphometry (Ashburner and Friston, 2000;Whitwell, 2009) analyze these further.
Iterative registration has been extensively studied, and the available methods can achieve excellent accuracy both within and across MRI contrasts (Ashburner, 2007;Cox and Jesmanowicz, 1999;Friston et al., 1995;Jiang et al., 1995;Lorenzi et al., 2013;Rohr et al., 2001;Rueckert et al., 1999).Approaches differ in how they measure image similarity and the strategy chosen to optimize it, but the fundamental algorithm is the same: fit a set of parameters modeling the spatial transformation between an image pair by iteratively minimizing a dissimilarity metric.While classical deformable registration can take tens of minutes to several hours, affine registration optimizes only a handful of parameters and is generally faster (Hoffmann et al., 2015;Jenkinson and Smith, 2001;Modat et al., 2014;Reuter et al., 2010).For these reasons, classical affine methods are still widely used both within analysis pipelines and also for more specialized applications such as correcting head motion during image acquisition (Gallichan et al., 2016;Thesen et al., 2000;Tisdall et al., 2012).However, these approaches solve an optimization problem for every new image pair, which is inefficient: depending on the algorithm, affine registration of higher-resolution structural MRI, for example, can easily take 5-10 minutes (Table 3).Further, iterative pipelines can be laborious to use.The user typically has to tailor the optimization strategy and choose a similarity metric appropriate for the image appearance (Pustina and Cook, 2017).Often, images require preprocessing such as intensity normalization or removal of structures that the registration should exclude.These shortcomings have motivated work on deep-learning (DL) based registration.
Recent advances in DL have enabled registration with unprecedented efficiency and accuracy (Balakrishnan et al., 2019;Dalca et al., 2018;Eppenhof and Pluim, 2019;Krebs et al., 2017;Li and Fan, 2017;Rohé et al., 2017;Sokooti et al., 2017;Yang et al., 2016Yang et al., , 2017)).In contrast to classical approaches, DL models learn a function that maps an input registration pair to an output transform, and evaluating this function on a new pair of images is fast.However, most existing DL methods focus on the deformable component.Affine registration of the input images is often assumed (Balakrishnan et al., 2019;de Vos et al., 2017) or incorporated ad hoc, and thus given less attention than deformable registration (De Vos et al., 2019;Hu et al., 2018;Mok and Chung, 2022;Zhao et al., 2019b,d).Although state-of-the-art deformable algorithms are capable of compensating for sub-optimal affine alignment to some extent, it can be challenging to estimate locally accurate deformable transforms while dedicating a substantial portion of the model capacity to affine alignment.Further, any inaccuracy in the affine transform will make it harder to interpret the deformable component (Bookstein, 2001;Ou et al., 2014), which will now include an undesired affine residual.
The learning-based models encompassing both affine and deformable components usually do not consider network generalization to modality variation (De Vos et al., 2019;Shen et al., 2019;Zhao et al., 2019b,d;Zhu et al., 2021).That is, networks trained on one type of data, such as T1-weighted (T1w) MRI, tend to inaccurately register other types of data, such as T2-weighted (T2w) scans.Even for similar MRI contrast, the domain shift caused by unseen noise or smoothness levels alone has the potential to reduce accuracy at test time.In contrast, learning frameworks capitalizing on generalization techniques and domain adaptation often do not incorporate the fundamental affine transform (Chen et al., 2017;Iglesias et al., 2013;Qin et al., 2019;Tanner et al., 2018).
A separate challenge for affine registration consists in accurately aligning specific anatomy of interest in the image while ignoring irrelevant content.Any undesired structure that moves independently or even deforms nonlinearly will reduce the accuracy of the anatomy-specific transform unless an algorithm has the ability to ignore it.For example, neck and tongue tissue can confuse rigid brain registration when they deform non-rigidly (Andrade et al., 2018;Fein et al., 2006;Fischmeister et al., 2013;Hoffmann et al., 2020).
Finally, identifying an optimal architecture for affine registration and formulating the problem in a differen- tiable manner will enable embedding and jointly learning affine registration with other tasks, for example for creating conditional template images representing a subject population, from non-aligned input images (Dalca et al., 2019a;Ding and Niethammer, 2022;Sinclair et al., 2022).

Contribution
In this work we present a single, easy-to-use DL tool for end-to-end affine and deformable brain registration for images right of the MRI scanner, without preprocessing (Figure 1).The tool performs robustly across MRI contrasts, intensity scales, resolutions.We address the domain dependency and anatomical non-specificity of affine registration: while invariance to acquisition specifics will enable networks to generalize to new image types without retraining, our anatomy-specific training strategy alleviates the need for segmentation to remove distracting image content prior to registration, for example, with skullstripping (Eskildsen et al., 2012;Hoopes et al., 2022;Iglesias et al., 2011;Salehi et al., 2017;Smith, 2002).
Our work builds on ideas from DL-based registration, affine registration, and a recent synthesis-based training strategy that promotes data independence by exposing networks to arbitrary image contrasts (Billot et al., 2020;Hoffmann et al., 2021;Hoopes et al., 2022).
First, we rigorously analyze three fundamental network architectures to provide insight into how DL models learn and best represent the affine component in Section 4.4, using a broad collection of images that capture the diversity of real-world data.Second, we select and optimize a suitable architecture and train the network with synthetic data only, making it robust across a landscape of acquired image types since it has never been exposed to any real images during training.Third, we test the resulting model on a range of real datasets and compare its performance to readily available affine algorithms in Section 4.5, to thoroughly assess the registration accuracy achievable with off-the-shelf implementations on images unseen at training.Fourth, we combine the affine model with a deformable network to create an end-to-end registration tool, and evaluate its performance against popular toolboxes in Section 4.6.

Related work
There is substantial work on medical image registration.While this section provides an overview of successful and widely adopted strategies, more in-depth review articles are available (Fu et al., 2020;Oliveira and Tavares, 2014;Wyawahare et al., 2009).

Classical registration
Classical registration is driven by an objective function, which measures similarity in appearance between the moving and the fixed image.A simple and effective choice for images of the same contrast is the mean squared error (MSE).Normalized cross-correlation (NCC) is also widely used, because it provides excellent accuracy independent of the intensity scale (Avants et al., 2008).Registration of images across contrasts or modalities generally employs objective functions such as normalized mutual information (NMI) (Maes et al., 1997;Wells III et al., 1996) or the correlation ratio (Roche et al., 1998), as these do not assume similar appearance of the input images.Although rarely used in neuroimaging, metrics based on patch similarity (Heinrich et al., 2012) can sometimes outperform simpler metrics across modalities (Hoffmann et al., 2021).
To improve computational efficiency and avoid local minima, many classical techniques perform multiresolution searches (Hellier et al., 2001;Nestares and Heeger, 2000).
First, this strategy coarsely aligns smoothed downsampled versions of the input images.This initial solution is subsequently refined at higher resolutions, until the original images align precisely (Avants et al., 2011;Modat et al., 2014;Reuter et al., 2010).Additionally, an initial grid search over a set of rotation parameters can help constrain this scale-space approach to a neighborhood around the global optimum (Jenkinson and Smith, 2001;Jenkinson et al., 2012).
Instead of optimizing image similarity, another registration paradigm detects landmarks and matches these across the images (Myronenko and Song, 2010).Early work relied on user assistance to identify fiducials (Besl and McKay, 1992;Meyer et al., 1995).More recent computer-vision approaches automatically extract features (Machado et al., 2018;Toews and Wells III, 2013), for example from entropy (Wachinger andNavab, 2010, 2012) or difference-of-Gaussians images (Lowe, 2004;Rister et al., 2017;Wachinger et al., 2018), and the performance of the strategy depends on the invariance of landmarks across viewpoints and intensity scales (Matas et al., 2004).

Deep-learning registration
Analogous to classical registration, unsupervised deformable DL methods fit the parameters of a deep neural network by optimizing a loss function that measures image similarity-but across many image pairs (Balakrishnan et al., 2019;Dalca et al., 2019b;De Vos et al., 2019;Guo, 2019;Hoffmann et al., 2021;Krebs et al., 2019).In contrast, supervised DL strategies (Eppenhof and Pluim, 2019;Krebs et al., 2017;Rohé et al., 2017;Sokooti et al., 2017;Yang et al., 2016Yang et al., , 2017) ) train a network to reproduce ground-truth transforms, for example obtained with classical tools, and tend to underperform relative to their unsupervised counterparts (Hoffmann et al., 2021;Young et al., 2022), although warping features at the end of each U-Net (Ronneberger et al., 2015) level can close the performance gap (Young et al., 2022).

Affine deep-learning registration
Similar to the deformable case, affine registration strategies can be supervised or unsupervised but require different network architectures.A straightforward option combines a convolutional encoder with a fully connected (FC) layer to predict the parameters of an affine transform in one shot (Shen et al., 2019;Zhao et al., 2019b,d;Zhu et al., 2021).A series of convolutional blocks successively halve the image dimension, such that the output of the final convolution has substantially fewer voxels than the input images.This facilitates the use of the FC layer with the desired number of output units while preventing the number of network parameters from becoming intractably large.Networks typically concatenate the input images before passing them through the encoder.To benefit from weight sharing, Siamese networks pass the fixed and moving images separately though the same encoder and connect their outputs at the end (Chen et al., 2021;De Vos et al., 2019).
As affine transforms have a global effect on the image, some architectures replace the locally operating convolutional layers with vision transformers (Dosovitskiy et al., 2020;Mok and Chung, 2022).These models subdivide their inputs into patch embeddings and pass them through the transformer, before a multi-layer perceptron (MLP) outputs a transformation matrix.Multiple such modules in series can successively refine the affine transform if each module applies its output transform to the moving image before passing it on to the next stage (Mok and Chung, 2022).Composition of the transforms from each step produces the final output matrix.
Another affine DL strategy (Moyer et al., 2021;Yu et al., 2022) derives an affine transform without requiring any MLP or FC layers, similar to the classical feature extraction and matching approach (Section 2.1).This method separately passes the moving and the fixed image through a single convolutional encoder to detect two corresponding sets of feature maps.Computing the barycenter of each feature map yields moving and fixed point  clouds, and an LS fit provides a transform aligning them.
The approach is robust across large transforms (Yu et al., 2022), while removing the FC layer alleviates the dependency of the architecture on a single specific image size.
In this work we will thoroughly test these fundamental DL architectures and extend them to build an end-to-end solution for joint affine and deformable registration that is aware of the anatomy of interest.

Robustness and anatomical specificity
Indiscriminate registration of images as a whole can limit the accurate alignment of specific substructures, such as the brain in whole-head MRI.One group of classical methods avoids this problem by down-weighting image regions that cannot be mapped accurately with the chosen transformation model, for example using an iteratively re-weighted least-squares (LS) algorithm (Billings et al., 2015;Gelfand et al., 2005;Modat et al., 2014;Nestares and Heeger, 2000;Puglisi and Battiato, 2011;Reuter et al., 2010).Few approaches focus on specific anatomical features, for example by restricting the registration to regions of an atlas with high prior probability for belonging to a particular tissue class (Fischl et al., 2002).The affine registration tools most commonly used in neuroimage analysis (Cox, 1996;Friston et al., 1995;Jenkinson and Smith, 2001;Modat et al., 2014) instead expect-and require-that distracting image content be removed from the input data as a preprocessing step for optimal performance (Eskildsen et al., 2012;Iglesias et al., 2011;Klein et al., 2009;Smith, 2002).Similarly, many DL registration algorithms assume intensity-normalized and skull-stripped input images (Balakrishnan et al., 2019;Yu et al., 2022;Zhao et al., 2019d), limiting their applicability to diverse and unprocessed data.

Domain generalizability
The adaptability of neural networks to out-of-distribution data generally presents a challenge to their deployment (Sun et al., 2016;Wang and Deng, 2018).Mitigation strategies include augmenting the variability of the training distribution, for example by adding random noise or applying geometric transforms (Chaitanya et al., 2019;Perez and Wang, 2017;Shorten and Khoshgoftaar, 2019;Zhao et al., 2019a).Transfer learning adapts a trained network to a new domain by fine-tuning deeper layers on the target distribution (Kamnitsas et al., 2017;Zhuang et al., 2020).These methods require training data from the target domain.By contrast, within medical imaging, a recent strategy synthesizes unrealistically variable training images to promote data independence.The resulting networks generalize beyond dataset specifics, and perform with high accuracy on tasks including segmentation (Billot et al., 2020), deformable registration (Hoffmann et al., 2021), and skull-stripping (Hoopes et al., 2022).We build on this technology to achieve end-to-end registration incorporating the affine component.

Method
3.1 Background

Learning-based registration
Let m be a moving an f a fixed image in N -dimensional (N D) space.We train a deep neural network h θ with learnable parameters θ to predict a global transform T θ : Ω → R N that maps the spatial domain Ω of f onto m, given images {m, f }.The transform T θ = h θ (m, f ) is a matrix where matrix A θ ∈ R N ×N represents rotation, scaling, and shear, and v θ ∈ R N ×1 is a vector of translational shifts, such that t θ ∈ R N ×(N +1) .We fit the network weights θ to training set D subject to where we define the loss L(T θ , m, f ) = L sim (m • T θ , f ), L sim measures the loss of similarity in appearance between two images, and m•T θ means m transformed by T θ .

Synthesis-based training
A recent strategy (Billot et al., 2020;Hoffmann et al., 2021;Hoopes et al., 2022) achieves robustness to preprocessing and acquisition specifics by training networks exclusively with synthetic images generated from label maps.From a set of label maps {s m , s f }, we synthesize corresponding wildly variable images {m, f } as network inputs, and we optimize spatial label overlap with a (soft) Dice loss (Milletari et al., 2016) independent of image appearance, where s j represents the one-hot encoded label j ∈ J of label map s defined at the voxel locations x ∈ Ω in the discrete spatial domain Ω of f .Requiring only a few input label maps, the generation produces a stream of diverse training images, helping the model accurately generalize to real medical images of any contrast at test time, which can then be registered without needing label maps.

Anatomy-aware affine registration
As we build on our recent work on deformable registration, SynthMorph (Hoffmann et al., 2021), we only provide a high-level overview, focusing on what is new.Figure 2 illustrates our learning setup for affine registration.

Label maps
Every training iteration, we draw a pair of moving and fixed brain segmentations.We apply random spatial transformations to each of them to augment the range of head orientations and anatomical variability in the training set.Specifically, we construct an affine matrix from random translation, rotation, scaling, and shear, as detailed in Appendix A).We compose the affine transform with a randomly sampled and randomly smoothed deformation field (SynthMorph) and apply the resulting composite transform in a single interpolation step.Finally, we simulate acquisitions with a partial field of view (FOV) by randomly cropping the label map content, yielding label maps {s m , s f }.

Anatomical specificity
Let K be the complete set of labels in {s m , s f }.To encourage networks to register specific anatomy while ignoring irrelevant image content, we propose to recode {s m , s f } such that they include only a subset of labels J ⊂ K.For brain-specific affine registration, we merge brain structures such that J consists of tissue classes gray matter (GM), white matter (WM), and cerebrospinal fluid (CSF).The loss L optimizes only the overlap of J, whereas we synthesize images from the complete set of labels K, as illustrated in Figure 2.

Image synthesis
Given label map s m , we generate image m with random contrast, noise, and artifact corruption (and similarly f from s f ).Following SynthMorph, we first sample a mean intensity for each label j ∈ K in s m and set all voxels of m associated with label j to this value.Second, we corrupt m by randomly applying additive Gaussian noise, anisotropic Gaussian blurring, a multiplicative spatial intensity bias field, intensity exponentiation with a global parameter (gamma), and downsampling along randomized axes.In aggregate, these steps produce widely varying intensity distributions within each anatomical label (Figure 3).

Generation hyperparameters
We choose the affine augmentation range such that it encompasses real transforms measured across public datasets: Figure C.1 (Appendix C) shows the distribution of these transforms.We adapt all other values from prior work, which thoroughly analyzed their impact on registration accuracy (Hoffmann et al., 2021).Table B.1 (Appendix B) lists hyperparameters for label-map augmentation and image synthesis.

Joint registration
For joint registration, we combine the affine model h θ with the deformable SynthMorph architecture as shown in Figure 4. Let g η be a deformable model with trainable parameters η.We move the image m based on the affine  transform T θ = h θ (m, f ) and g η predicts the warp field φ η = g η (m • T θ , f ), yielding the total transform ψ θη = T θ • φ η .We add a regularization term L reg to the Dice loss L of Equation ( 3) to encourage smooth deformations.For joint registration, the loss becomes where λ controls the weighting of the terms.We choose L reg (φ) = 1 2 ||∇u|| 2 with λ = 1, where u is the displacement of the deformation φ = id + u, and id is the identity field.

Affine architectures
Estimating an affine transform T from a pair of medical images in N D requires reducing a large input space of the order of 100k-10M voxels to only N (N +1) output parameters.We analyze three competing network architectures (Figure 5) that represent state-of-the art methods (Balakrishnan et al., 2019;De Vos et al., 2019;Moyer et al., 2021;Shen et al., 2019;Yu et al., 2022;Zhu et al., 2021).

Parameter encoder
We first build on networks combining a convolutional encoder with an FC layer (Shen et al., 2019;Zhu et al., 2021) whose N (N + 1) output units we interpret as parameters for translation, rotation, scale, and shear.We refer to a cascade of C such subnetworks h i , with i ∈ {1, 2, ..., C} as Encoder.Each h i outputs a matrix constructed from the affine parameters as shown in Appendix A, to incrementally update the total transform.We obtain transform T i by matrix multiplication after invoking subnetwork h i , where m i = m•T i is the moving image transformed by T i , and T 0 = I N is the identity matrix.As the subnetworks h i are architecturally identical, weight sharing is possible, and we evaluate versions of the model with and without weights shared across cascades.
For balanced gradient steps, we complete each subnetwork with a layer involving learnable local rescaling weights applied to the affine parameters before matrix construction.

Warp decomposer
We propose a second architecture building on deformable registration models (Balakrishnan et al., 2019;De Vos et al., 2019).Decomposer estimates a dense deformation field φ θ with corresponding non-negative voxel weights ω θ that we decompose into the affine output transform T θ = h θ (m, f ) and a (discarded) residual component δ θ , i.e. φ θ = δ θ • T θ .The voxel weights ω θ enable the network h θ to focus the decomposition on the anatomy of interest.Both (φ θ , ω θ ) are the output of a single fully convolutional network and thus benefit from weight sharing.We decompose φ θ in a WLS sense over the discrete spatial domain Ω of f , using the definition of t from Equation (1) as the submatrix of T excluding the last row: where t is the matrix transpose of t.Denoting W = diag(ω θ ), and by X and y the matrices whose corresponding rows are (x 1) and φ θ (x) for each x ∈ Ω, respectively, the closed-form solution of Equation ( 6) is, (7)

Feature detector
Third, we extend a recent architecture (Moyer et al., 2021;Yu et al., 2022) that takes as input a single image and predicts a set of k non-negative spatial feature maps F i , where i ∈ {1, 2, ..., k}, to support full affine transforms (Yu et al., 2022) and WLS (Moyer et al., 2021). .Decomposer predicts a one-shot displacement field (no activation) with corresponding voxel weights (ReLU), that we decompose in a weighted least-squares (WLS) sense to estimate affine transform T .Detector outputs ReLU-activated feature maps for a single image.We compute their centers of mass (COM) and weights separately for m and f , to fit a transform T that aligns these point sets.Parentheses specify filter numbers.We LeakyReLU-activate the output of unnamed convolutional blocks (param.α = 0.2).Blue blocks smaller than their predecessor indicate subsampling by a factor of 2.
Following a series of convolutions, we compute the center of mass a i and channel power p i for each feature map F i m of the moving image, and separately center of mass b i with channel power q i for each F i f of the fixed image.We interpret the sets {a i } and {b i } as corresponding moving and fixed point clouds and refer as Detector to a network h θ which returns the affine transform t θ = h θ (m, f ) aligning these point clouds subject to tθ = arg min where we define the normalized weight ω i as Let X and y be matrices whose i th rows are (a i 1) and b i , respectively.With W = diag({ω i }), Equation ( 7) yields the closed-form solution tθ as above.

Implementation
Except for the final convolutions indicated in Figure 5, each model uses convolutional blocks with w filters; the network width w does not vary within a model.Unless stated otherwise, we activate the output of each block with LeakyReLU (parameter α = 0.2) and downsample by a factor of 2 using max pooling.All kernels are of size 3 N .For computational efficiency, our 3D models downsample the network inputs {m, f } by a factor of 2 using max pooling, while we evaluate the loss on full-size segmentations {s m , s f }.We min-max normalize network inputs such that the image intensities fall in the interval [0, 1].Affine coordinate transforms operate in a zero-centered index space, and we have Encoder predict rotation parameters in degrees.This parameterization ensures varying rotation angles has an effect of similar magnitude as translations in millimeters, at the scale of the brain, which we find helps networks converge faster in our experiments.Appendix A includes further details.

Optimization
We fit model parameters with stochastic gradient descent using Adam (Kingma and Ba, 2014), choosing a learning rate of l = 10 −4 that we reduce to l = 10 −5 in case of divergence, and a batch size of 1.All models train until the loss visually converges, but at least for 10 5 iterations of 100 batches each.To avoid non-invertible matrices M = X W X at the start of training, we pretrain Decomposer for 500 iterations, temporarily replacing the output transform with the field T θ = φ θ ω θ , where ω θ are the voxel weights predicted by the network (Section 3.3.2),and denotes voxel-wise multiplication.We initialize the rescaling weights of Encoder to 1 for translations and rotations, and to 0.05 for scaling and shear, which we find favorable to faster convergence at training.
For joint registration, we freeze parameters θ of the trained affine submodel h θ to train only subnetwork g η optimizing the loss of Equation ( 4).We choose this optimization setup over joint training, as it ensures that the affine and deformable subnetworks do not compete at estimating affine components.

Experiments
In a first experiment, we thoroughly analyze the performance of the different architectures across a broad range of variants and transformations, to understand how networks learn and best represent the affine component.In a second experiment, we select and train an architecture with synthetic data only.This experiment shifts the focus from network architectures to building a readily usable tool, and we assess the resulting accuracy in various affine registration tasks.In a third experiment, we complete the affine model with a deformable network to produce a joint registration solution and compare its performance to well established baselines.

Data
The training-data synthesis and analyses use 3D brain MRI scans from a very diverse collection of public data, aiming to truly capture the behavior of the methods facing the diversity of real-world images.While users of SynthMorph do not need to preprocess their data, our experiments use images conformed to the same isotropic 256×256×256 1-mm voxel space by resizing with trilinear interpolation, and cropping and zero-padding symmetrically.We rearrange the voxel data to produce gross leftinferior-anterior (LIA) orientation with respect to the volume axes.Experiments conducted in 2D use mid-sagittal slices extracted from 3D images and label maps.

Generation label maps
For training-data synthesis, we compose a set of 100 whole-head tissue segmentations, each derived from T1w acquisitions with isotropic ∼1-mm resolution (although our experiments do not use the T1w images).The latter include 30 locally scanned adult FSM subjects (Greve et al., 2021), 30 participants of the cross-sectional Open Access Series of Imaging Studies (OASIS) dataset (Marcus et al., 2007), 30 teenage subjects from the Adolescent Brain Cognitive Development (ABCD) study (Casey et al., 2018), and 10 infant subjects scanned at Boston Children's Hospital at age 0-18 months (de Macedo Rodrigues et al., 2015;Hoopes et al., 2022).
We compute brain label maps from the conformed T1w scans using SynthSeg (Billot et al., 2020).We emphasize that inaccuracies in the segmentations have little impact on our procedure, as the images synthesized from the segmentations will still be in perfect voxel-wise registration with the labels by construction.To facilitate the synthesis of spatially complex image signal outside the brain, we add non-brain labels to each label map using a simple thresholding procedure.The procedure sorts non-zero image voxels outside the brain into one of six intensity bins, equalizing bin sizes on a per-image basis.These added labels do not necessarily represent distinct or meaningful anatomical structures but expose the networks to nonbrain image content.

Analysis and training images
For architecture analysis, we use T1w training images from 5000 adult participants of the UK Biobank (UKBB) study (Alfaro-Almagro et al., 2018;Sudlow et al., 2015).We also randomly pool 1000 distinct registration pairs from OASIS subjects and another distinct 1000 pairs of ABCD subjects to analyze typical transforms.
Our experiments use the held-out test images listed in Table 1.For monitoring and model validation, we use a handful of images pooled from the same datasets, which do not overlap with the test subjects.We do not consider QIN at validation and validate performance in pediatric data with held-out ABCD subjects.
To measure registration accuracy, we compute anatomical brain label maps individually for each conformed im- age volume using SynthSeg (Billot et al., 2020).We also skull-strip a copy of the GSP, IXI and UKBB data with SynthStrip (Hoopes et al., 2022), to test registering images that have undergone common preprocessing steps.

Labels
The training segmentations encompass a set K of 38 different labels, 32 of which correspond to anatomical brain structures or background.We use all labels K to synthesize training images but optimize the overlap of brainspecific labels J ⊂ K based on Equation (3).

Baselines
We test affine and joint classical registration with ANTs (Avants et al., 2011) version 2.3.5 using recommended parameters (Pustina and Cook, 2017) for the NCC metric within and MI across MRI contrasts.We test NiftyReg (Modat et al., 2014) version 1.5.58 with the NMI metric and enable SVF integration for joint registration, as in our approach.We also run the patchsimilarity method Deeds (Heinrich et al., 2012(Heinrich et al., , 2013) ) (version from 2022-07-23).For a rigorous baseline assessment, we reduce the grid spacing to 6 × 5 × 4 × 3 × 2 to match the spatial scales of brain structures when estimating the deformable component.As in prior work (Hoffmann et al., 2021), this modification results in a 1-2% accuracy increase for most datasets.We test affine-only registration with mri robust register (Robust) from FreeSurfer 7.3 (Fischl, 2012) using its robust cost functions (Reuter et al., 2010), as only the robust cost functions can down-weight the contribution of regions that deform non-linearly.However, we highlight that the robustentropy metric for cross-modal registration is experimental.We use Robust with up to 100 iterations and initialize the affine registration with a rigid run.
We compare DL model variants covering popular registration architectures in Section 4.4.This analysis uses the same capacity and training set for each model.For our final synthesis-based tool in Sections 4.5 and 4.6, we consider readily available machine-learning baselines pre-trained by their respective authors, to assess their generalization capabilities to the diverse data we have gathered.This strategy evaluates what level of accuracy a user can expect from off-the-shelf methods without retraining, as retraining is generally challenging for users (see Section 5.4).We test: KeyMorph (Yu et al., 2022) and C2FViT (Mok and Chung, 2022) models trained for pairwise affine, and the 10-cascade Volume Tweening Network (VTN) (Zhao et al., 2019b,d) trained for joint affine and deformable registration.Each network receives inputs with the expected image orientation, resolution, and intensity normalization.

Evaluation metrics
To measure registration accuracy, we propagate the moving label map s m using the predicted transform T to obtain the moved label map s m • T and compute its (hard) Dice overlap D (Dice, 1945) with the fixed label map s f .We use paired two-sided t-tests to determine whether differences in mean scores between methods are significant.
To assess the extent to which a deformable transform described by deformation field φ includes an undesirable affine component, we fit the affine matrix tδ that best describes φ in an LS sense, using Equations ( 6) and ( 7) with W = I |Ω| .We compute the deformation field φ δ equivalent to tδ over spatial domain Ω, and define the affine residual δ as:

Experiment 1: network analysis
In the first experiment, we rigorously analyze variants and ablations of the three competing architectures from Section 3.3.Assuming a network capacity of ∼250k learnable parameters, our goal is to identify an optimal architecture for robust and general affine registration, that we will later train using the synthesis-based strategy.Second, we compare Decomposer variants that fit T in an OLS sense, i.e. using weights ∀x ∈ Ω, ω θ (x) = 1, or in a WLS sense.For both models, we asses how increasing the resolution of the field φ θ relative to f affects performance, by upsampling by a factor of 2 after each of the first n ∈ {0, 1, 2, 3} convolutional blocks following the encoder, using skip connections where possible, such that (n) = 1/2 4−n .
Finally, we select a suitable configuration per architecture and analyze its performance across a range of transformation magnitudes.We investigate how models adapt to larger transforms, by fine-tuning pretrained weights to twice the affine augmentation amplitudes of Table B.1 until convergence, and we also repeat the experiment with doubled capacity.We test on copies of the UKBB validation set, each injected with random affine transforms of maximum strength γ ∈ [0, 2] relative to the augmentation range of Table B.1.For example, at a given γ, we uniformly sample a rotation angle r ∼ U(−γα, γα) for each of the 200 moving and each fixed images, where α = 45 • , and similarly for all other degrees of freedom (DOF), that we apply by resampling the image (Appendix A).

Results
Figure 6  with diminishing returns after about C = 4 and at the cost of substantially longer training times that roughly scale with C.Although keeping subnetwork weights separate might enable each h i to specialize in increasingly fine adjustments to the final transform, in practice we observe no benefit in distributing capacity over the subnetworks compared to weight sharing.Decomposer shows a clear trend towards lower output resolutions improving accuracy.While decomposing the field φ θ in a WLS sense boosts performance by 0.6-1.3 points over OLS, the model still lags behind the other architectures while requiring 2-3 times more training iterations to converge.There is little difference across numbers k of output feature maps, and choosing WLS over OLS results in a minor increase in accuracy.
Figure 7 shows network robustness across a range of maximum transform strengths γ, where we compare Encoder with C = 4 subnetworks sharing weights to the WLS variants of Decomposer without upsampling and Detector with k = 32 output channels, to balance performance and efficiency.Detector proves most robust to large transforms, remaining unaffected up to γ ≈ 1.2, i.e. shifts and rotations up to 36 mm and 54 • for each axis, respectively, and scale and shear up to 0.12.In contrast, accuracy declines substantially for Encoder and Decomposer after γ ≈ 0.8, corresponding to maximum transforms of 24 mm and 36 • (blue).Doubling the affine augmentation extends Encoder and Decomposer robust-ness to γ ≈ 1.2 but comes at the cost of a drop of 1 and 2 Dice points for all γ < 1.2, respectively (orange).Decomposer performance is capacity-bound, as doubling the number of parameters restores ∼50% of the drop in accuracy, whereas increasing capacity does not improve Encoder accuracy for γ < 1.2 (green).Detector optimally benefits from the doubled affine augmentation, which enables the network to perform robustly across the entire test range (orange).Doubling its capacity has no effect (green).
In conclusion, the marginal lead of Encoder only manifests for small transforms and at C = 16 cascades.The 16 interpolation steps of this variant render it intractably inefficient for 3D applications.In contrast, Detector performs with high accuracy across transformation strengths, making it a more suitable architecture for a general registration tool.

Experiment 2: end-to-end tool performance
Based on the accurate performance of Detector and its robustness across transform strengths in Section 4.4, we train the WLS-based architecture in 3D using the generative strategy of Figure 2, leading to "affine SynthMorph".
We focus on the development of an anatomy-aware affine registration tool that generalizes across MRI acquisition protocols and image characteristics while enabling brain registration without requiring the user to preprocess data.

Setup
First, to give the reader an idea of the accuracy achievable with off-the-shelf algorithms for data unseen at training, we compare affine SynthMorph to baseline DL methods pretrained by the respective authors.We test registration within and across MRI contrasts, with and without skullstripping, for a variety of different imaging resolutions and populations, including adults, children, and patients with glioblastoma.Each test involves 50 held-out image pairs from separate subjects.Affine SynthMorph implements Detector (Figure 5) with w = 256 convolutional filters, as this network width proved adequate for learning registration from synthetic data only (Hoffmann et al., 2021).Using 3D convolutions, we choose k = 64 output feature maps, shown to yield best performance for large 3D transforms (Yu et al., 2022).We train SynthMorph solely with synthetic images generated from label maps based on the hyperparameter ranges of Table B.1.
Second, we analyze the effect of thick-slice acquisitions on SynthMorph accuracy compared to classical affine baselines.This experiment retrospectively reduces the through-plane resolution of 50 GSP-IXI T1 pairs to produce stacks of axial slices with thickness ∆z ∈ {1, 2, ..., 10} mm.At each ∆z, we simulate partial voluming (Kneeland et al., 1986;Simmons et al., 1994) by smoothing the images in slice-normal direction with a 1D Gaussian kernel of full-width at half-maximum (FWHM) ∆z and by extracting slices ∆z apart using linear interpolation.Finally, we restore the initial volume size by upsampling the stack through-plane.

Results
Figure 8 shows representative registration examples for the tested dataset combinations, while Figure 9 compares affine registration accuracy across GM, WM, and CSF to each baseline method in terms of Dice overlap.
Although affine SynthMorph has not seen any real MRI data at training, it achieves the highest affine Dice score for every dataset tested.For the skull-stripped T1w data that all baselines except Deeds specialize in, SynthMorph exceeds the best-performing baseline by 0.3 points (GSP SS →IXI T1,SS , p < 9 × 10 −4 ).In contrast, SynthMorph accuracy leads by at least 4.4 points for all other datasets (p < 3 × 10 −9 ) and often much more, demonstrating its ability to generalize to acquisition specifics and accurately register the brain independent of whether brain tissue is the only image content.
In fact, there is no difference in SynthMorph performance between skull-stripped (GSP SS →IXI T1,SS ) and full-head image pairs (GSP→IXI T1 ), whereas most other methods do worse.The classical baselines drop by 4.2-8.9points-except for Deeds, which performs substantially worse than the other classical methods for GSP SS →IXI T1,SS but improves by 6.2 points when tested on GSP→IXI T1 .Surprisingly, Robust accuracy also drops by 6.2 points despite its ability to down-weight image regions that deform non-linearly.These results demonstrate how label recoding in the loss enables SynthMorph to not require preprocessing, and skull stripping in particular (Section 3.2.2).
As expected, DL-baseline performance breaks down for full-head images because these methods were trained using skull-stripped data.Even for GSP SS →IXI T1,SS , the  This discrepancy is likely due to a domain shift from the training distribution, such as a different smoothness or noise level, and evidences the training-data dependency of existing DL methods.The affine cascade of VTN yields the lowest accuracy across the skull-stripped images.In contrast to all other methods, its preprocessing includes constraining each input image to a box of the size of and centered on a corresponding label map.
NiftyReg is the most robust baseline across varied image content and acquisition protocols, outperforming all other baselines.In particular, NiftyReg performs reasonably for MASi→HCP-D and QIN→IXI T1 : these tasks are challenging because the MASi data are defaced, while QIN includes prominent contrast-enhanced pathology.
Figure 10a shows how registration accuracy evolves with increasing QIN→IXI T1 slice thickness ∆z.SynthMorph performance is the most robust, remaining invariant for ∆z ≤ 5 mm and reducing by less than 1% at ∆z = 10 mm.The least affected classical baselines are ANTs and Robust.Their initial Dice score drops by up to 3%, while ANTs is invariant for ∆z ≤ 4 mm.NiftyReg and Deeds are most susceptible to resolution changes, decreasing to 94% and below 90% of their respective 1 Timed on the GPU as the device is hard-coded.
2 Implementation performs joint registration only.
starting accuracy.Table 3 lists the registration time required by each affine method on a 2.2-GHz Intel Xeon Silver 4114 CPU using a single computational thread.The values shown reflect averages over n = 10 uni-modal runs.Classical runtimes range between 2 and 27 minutes with Deeds being the fastest and Robust being the slowest, although we hightlight that we substantially increased the number of Robust iterations.Complete single-threaded DL runtimes are about 1 minute, including model setup.However, inference only takes a few seconds and reduces to well under a second on an NVIDIA V100 GPU.

Experiment 3: joint registration
Motivated by the affine performance of SynthMorph, we complete the model with a deformable module to achieve 3D joint affine-deformable registration (Figure 4).Our focus is on building a complete and readily usable tool that generalizes across scan protocols while requiring minimal data preparation.

Setup
First, we test registration using 50 held-out subject pairs for each of the dataset combinations considered in Section 4.5.We compare joint SynthMorph performance to classical baselines and VTN, the only joint DL baseline pretrained by the original authors that is available to us-as we seek to gauge the accuracy achievable with off-theshelf algorithms for data unseen at training.Second, we also analyze the effect of reducing throughplane resolution ∆z on SynthMorph performance compared to classical baselines, following the steps outlined in Section 4.5.
Third, we measure the affine residual δ in the deformation field φ across 50 registration pairs from each of the T1w testsets.In this experiment, our goal is to analyze how much of the physical affine transform between registration pairs each algorithm captures, assuming that the deformable registration will pick up the left-over affine component δ defined by Equation ( 11).

Results
Figure 13 shows typical joint registration examples for each method, and Figure 11 quantitatively compares registration accuracy across testsets in terms of mean Dice overlap D over anatomical structures.
Although SynthMorph trains with intentionally unrealistic synthetic images only, it achieves the highest score for every testset.For the adult T1w testsets GSP SS →IXI T1,SS and GSP→IXI T1 , SynthMorph outperforms the best classical baseline ANTs by at least 1.5 Dice points (p < 10 −9 for paired two-sided t-test).For all other testsets, SynthMorph performance remains largely invariant, whereas ANTs struggles and yields the lowest scores among classical methods.In contrast, SynthMorph leads by at least 4.2 Dice points compared to the highest baseline score (GSP→IXI T2 , p < 3 × 10 −18 ) and often much more.Crucially, the distribution of SynthMorph scores for isotropic data is substantially narrower than the baseline scores.This indicates an absence of gross registration failures (Figure 13) such as pairs with D < 20 that ANTs and VTN produce for isotropic cross-contrast pairings.
The most robust classical baseline is Deeds, which ranks third at adult T1w registration.Its performance reduces the least for the cross-contrast and clinical testsets, where it produces the highest Dice overlap after SynthMorph.In fact, Deeds is the only baseline tested that performs reasonably for the challenging tasks MASi→HCP-D and QIN→IXI T1 .These results confirm prior findings (Hoffmann et al., 2021) focusing on deformable instead of joint registration and indicate that the deformable component estimated by Deeds often compensates for its relatively inaccurate affine transforms (9).
The only joint DL baseline with pretrained weights that we had access to, VTN, yields relatively low accuracy across all testsets.This was expected for the full-head and cross-contrast pairings, since the model was trained with skull-stripped T1w data, reconfirming the data dependency of DL methods.However, VTN lags behind the worst-performing classical baseline for skull-stripped T1w GSP SS →IXI SS,T1 data, too (∆D = 8.4, p < 8 × 10 −12 ), likely due to a domain shift as in the affine case.Figure 10b assesses the dependency of registration performance on slice thickness ∆z.Similar to the affine case, deformable accuracy decreases for thicker slices, albeit faster.SynthMorph performs most robustly.Its accuracy remains almost unchanged up to ∆z ≤ 3 mm and reduces by less than 5% at ∆z = 10 mm.ANTs is the most robust classical method, but its accuracy drops considerably faster than SynthMorph.Deeds and NiftyReg are most affected at reduced resolution, performing at less than 95% accuracy for ∆z ≥ 5 mm and ∆z ≥ 4.5, respectively.
Figure 12 compares the affine residuals δ measured across deformable T1w transforms.SynthMorph outperforms all baselines for GSP→IXI T1 and MASi→HCP-D registration pairs, achieving the lowest mean displacement δ = (0.6 ± 0.0) mm per voxel (± SD).Importantly, SynthMorph performance is invariant to the level of preprocessing, as the method produces the similarly low residuals for skull-stripped GSP SS →IXI T1, SS pairs.In this testset, ANTs and NiftyReg produce even lower residuals of δ = (0.2 ± 0.0) mm and δ = (0.3 ± 0.1) mm, respectively.However, their residuals increase 5 to 17fold when we do not use skull-stripping, indicating that their deformable registration attempts to compensate for sub-optimal affine alignment.The mean Deeds and VTN residuals exceed the values for SynthMorph at least 15fold across testsets, as they build on the least accurate affine transforms among the methods tested (Figure 9) Deformable registration often requires substantially more time than affine registration (Table 3).On the GPU, SynthMorph takes less than 8 seconds per image pair for registration, IO, and resampling.One-time model setup requires about 1 minute, after which the user could register any number of image pairs without reinitializing the model.On the CPU, the fastest classical method (Deeds) requires only about 6 minutes in single-threaded mode, whereas ANTs takes almost 5 hours.While VTN's joint runtime is 1 minute, SynthMorph needs about 15 minutes for deformable registration on a single thread.

Discussion
We present an easy-to-use DL tool for end-to-end affine and deformable brain-specific registration.SynthMorph achieves robust performance across acquisition characteristics such as imaging contrast, resolution, and pathology, enabling accurate registration for brain scans right off the scanner without preprocessing.The SynthMorph strategy alleviates the dependency on acquired training data by generating widely variable images from anatomical label maps-there is no need for these during inference.

Architectures
We performed a rigorous analysis of popular affine architecturs.The comparison shows that Encoder is an excellent network architecture if the expected transforms are small, especially at a number of cascades C ≥ 4. For medium to large transforms, Encoder accuracy suffers.
While our experiments indicate that the reduction in accuracy can be mitigated by simultaneously optimizing a separate loss for each cascade, doing so substantially increases training times compared to the other architectures.Another drawback of Encoder is the image-size dependence introduced by the FC layer.We find that Detector is a more flexible alternative that remains robust for medium to large transforms.Vision transformers (Dosovitskiy et al., 2020) are another popular approach to overcoming the local receptive field of convolutions with small kernel sizes, querying information across distributed image patches.However, in practice the sophisticated architecture is often unnecessary for many computer-vision tasks (Pinto et al., 2022): while simple small-kernel U-Nets generally perform well as their multi-resolution convolutions effectively widen the receptive field (Liu et al., 2022b), increasing the kernel size can boost the performance of convolutional networks beyond that achieved by vision transformers across multiple tasks (Ding et al., 2022;Liu et al., 2022a).

Anatomy-specific registration
Accurate registration of the specific anatomy of interest requires ignoring or down-weighting the contribution of irrelevant image content to the optimization metric.For instance, the presence of neck and tongue tissue in MRI can reduce brain registration accuracy as these structures move independently of the brain and can deform non-linearly.Many existing classical and DL methods cannot distinguish between and treat separately relevant and irrelevant image features, and thus have to rely on a separate segmentation step to remove distracting content prior to registration, such as skull-stripping in neuroimage processing (Eskildsen et al., 2012;Hoopes et al., 2022;Iglesias et al., 2011;Salehi et al., 2017;Smith, 2002).
SynthMorph learns what anatomy is pertinent to the task, as the optimization focuses on aligning only select labels of interest, removing the dependency of DL registration methods on complex preprocessing.This general strategy can be applied to other anatomy-as long as label maps are available for training.The strategy can also apply weights to the labels to force networks to prioritize alignment of some anatomical structures over others, if for example, registration of a structure such as the hippocampus is important across subjects.

Baseline performance
Networks trained with the SynthMorph strategy do not have access to the MRI contrasts of the testsets nor in fact to any MRI data at all.Yet SynthMorph outperforms classical and DL-baseline performance on all of the real-world datasets tested, while being substantially faster than the classical methods.For deformable registration, for example, the fastest classical method (Deeds) requires 6 minutes, while SynthMorph takes about one minute for one-time model setup and just under 8 seconds for each subsequent registration.
The DL baselines break down when the input data does not undergo the expected preprocessing, that is, skullstripping, or for contrast pairings unobserved at training.In contrast, SynthMorph performs robustly, with its accuracy relatively unaffected by changes in imaging contrast, resolution, or subject population.These results demonstrate that the SynthMorph strategy produces powerful networks that can register new image types unseen at training.We emphasize that our focus is on leveraging the training strategy to build a robust and accurate registration tool.It is possible that the pretrained DL baseline architectures tested in this work perform equally well when trained using our proposed strategy.
Although Robust down-weights the contribution of image regions that cannot be mapped with the linear transformation model of choice, its accuracy dropped by several points for data without skull-stripping.The poor performance in cross-contrast registration may be due to the experimental nature of its robust-entropy cost function.We initially experimented with the recommended NMI metric, but registration failed for a number of cases as Robust produced non-invertible matrix transforms, and we hoped that the robust metrics would deliver accurate results in the presence of non-brain image contentwhich the NMI metric cannot ignore during optimization.

Challenges with retraining baselines
Retraining DL baselines to improve performance for specific user data frequently involves substantial practical challenges.For example, users have to reimplement the architecture and training setup from scratch if code is not available, which is often the case.If code is available, the user may be unfamiliar with the specific programming language or machine-learning library, and building on the original authors' implementation typically requires setting up an often complex development environment with matching package versions.In our experience, not all authors make this version information readily available, such that users may have to resort to trial and error.Additionally, the user's hardware might not be on par with the authors'.If a network exhausts the memory of the user's GPU, avoiding prohibitively long training times on the CPU necessitates reducing model capacity, which can affect performance.
In principle, users could retrain DL methods despite these challenges.However, in practice the burden is usually sufficiently large that users of these technologies will turn to methods that distribute pre-trained models.For this reason, we specifically compare DL baselines pretrained by the respective authors, to gauge the performance attainable without retraining.We hope the broad applicability of SynthMorph may help alleviate the historically limited reusability of DL methods.

Joint registration
The joint baseline comparison highlights that it can be challenging to achieve accurate deformable registration when building on sub-optimal affine alignment.An exception is Deeds, which compensates for inaccurate affine registration across several testsets.However, unlike all other joint baselines tested, Deeds does not integrate an SVF to construct a diffeomorphic deformation field, potentially permitting the transform to evolve in a less constrained fashion during optimization.
In Section 4.5, the affine subnetwork of the 10-cascade VTN model produces sub-optimal brain alignment even for the skull-stripped T1w image type it trained with.We highlight that the authors of VTN do not independently tune or compare the affine component to baselines and instead focus on joint affine-deformable accuracy (Zhao et al., 2019b,d).While VTN presents the affine cascade as an Encoder architecture (C = 1, Section 3.3) terminating with an FC layer (Zhao et al., 2019d), the public implementation omits the FC layer (Zhao et al., 2019c)-some of our early experiments with this architecture indicated the FC layer to be critical to competitive performance.

An ill-posed problem
Compared to deformable and within-subject registration, cross-subject registration is an ill-defined optimization problem.The human cortex, for example, exhibits complex folding patterns that vary considerably between individuals, and transforms limited to translation, rotation, scaling, and shear can only establish a gross match between subjects.This implies that a transform maximizing NCC may not necessarily lead to optimal Dice scores, and vice versa.The Dice metric in particular may have a difficult optimization landscape when images are far apart, or when some structures have no initial overlap.It is thus important to consider which labels to merge and optimize in the loss function.We choose to align the tissue classes WM, GM, and CSF, but other brain-label combinations might work equally well or perform even better.

Degrees of freedom
While we estimate the full affine transformation matrix with 12 DOF in 3D space, some applications require fewer DOF.An example is the correction of head motion during neuroimaging with MRI (Gallichan et al., 2016;Tisdall et al., 2012;White et al., 2010), where the bulk motion and its possible mitigation through pulse-sequence adjustments are physically constrained to 6 DOF accounting for translation and rotation.The Detector architecture used by SynthMorph can support such lower-DOF applications in several ways.First, the WLS problem (9) has closedform solutions for various numbers of DOF, and Moyer et al. (Moyer et al., 2021) propose a Detector model implementing a 6-DOF solution.Second, we can decompose the affine matrix t of Equation ( 7) into translation, rotation, scaling, and shearing parameters for each axis of space.Passing a modified transform tvr to the loss (3)reconstructed from translations v i and rotation parameters r i only, for example (Appendix A)-will encourage the network to detect features optimal for rigid alignment.

Rotational range
At training, SynthMorph sees registration pairs rotated apart by angles in excess of |r i | = 90 • about any axis i, due to the combined effect of spatial augmentation (Table B.1) and the rotational offset already present between any two input label maps.While prior work (Yu et al., 2022) and the analysis of Section 4.4 demonstrate that Detector can effectively capture rotations up to |r i | = 180 • , many applications will not require this full range.For example, the rotational ranges measured within OASIS and ABCD do not exceed |r i | ≤ 43.1 • (Figure C.1).Even in situations where rotations up to |r i | ≤ 180 • can occur, the registration problem often reduces to an effective 45 • range.This is because the pixel data of the moving image can be transposed to closely match that of the fixed image, either manually or using prior knowledge as encoded in the spatial transformation matrix typically stored with medical images.

Future work
We plan to expand our work in several ways.First, we will provide a trained 6-DOF model for rigid registration, as many applications require translations and rotations only, and the most accurate rigid transform does not necessarily correspond to the translation and rotation encoded in the most accurate affine transform.Second, we will employ the proposed strategy and affine architecture to train specialized models for within-subject registration for navigator-based real-time motion correction of neuroimaging with MRI (Tisdall et al., 2012).These models need to be efficient for real-time use but do not have to be invariant to MRI contrast or resolution when employed to track head-pose changes between navigators acquired with a fixed protocol.However, the brain-specific registration made possible by SynthMorph will improve motion-tracking and thus correction accuracy in the presence of distracting jaw movement, for example (Hoffmann et al., 2020).Third, another application that can dramatically benefit from anatomy-specific registration is fetal neuroimaging with MRI, where the fetal brain is surrounded by features such as amniotic fluid and maternal tissue.We plan to tackle registration of the fetal brain, which is challenging, partly due to its small size, and which currently relies on brain extraction prior to registration to remove confounding image content (Gaudfernau et al., 2021).

Conclusion
We present an easy-to-use DL tool for end-to-end registration of images right off the scanner without any preprocessing.Our study demonstrates the feasibility of training accurate affine and joint registration networks that generalize to image types unseen at training, outperforming established baselines across a landscape of image contrasts and resolutions.In a rigorous analysis approximating the diversity of real-world data, we find that our networks achieve invariance to protocol-specific image characteristics by leveraging a strategy that synthesizes wildly variable training images from label maps-there is no need for label maps during inference.
Importantly, optimizing the spatial overlap of select anatomical labels enables anatomy-specific registration without the need for segmentation to remove distracting content from the input images.We believe this independence from complex preprocessing has great promise for time-critical applications, such as real-time motion correction of MRI.

Declaration of interest
Bruce Fischl has a financial interest in CorticoMetrics, a company whose medical pursuits focus on brain imaging and measurement technologies.Massachusetts General Hospital and Mass General Brigham review and manage this interest in accordance with their conflict of interest policies.The authors have no other known financial and personal interests that could have inappropriately influenced the work reported in this paper.with ∆d i = (d i − 1)/2, placing the center of rotations at the center of f .Let T : Ω → R N be the affine coordinate transform of Equation ( 1), which maps a moving image m onto the domain of f .We parameterize T as

A Affine parameterization
where V , R, Z, E are matrices describing translation, rotation, scaling, and shear, respectively.Denoting v i the translation and z i the scaling parameter along axis i, we define For rotations and shear, we distinguish between the 2D and 3D case.Let r i be the angle of rotation about axis i, where the direction of rotation follows the right-hand rule.We abbreviate c i = cos(r i ) and s i = sin(r i ).

A.3 Transforming coordinates
With the notation introduced in Equation ( 1), we transform the coordinates of an N D point x = (x 1 x 2 . . .x N ) ∈ Ω (A.10) as x = Ax + v, or, using a single matrix product, (A.11)

C Transform analysis
In this experiment, we analyze the range of real-world transforms a registration tool may have to cope with.We register 1000 distinct and randomly pooled subject pairs from OASIS and another 1000 pairs from The estimated transformation matrix T decomposes into the translation, rotation, scaling, and shearing parameters defined in Appendix A. Therefore, we choose to augment input label maps at training with affine transforms drawn from the ranges of Table B.1, to ensure that SynthMorph covers the transformation parameters measured across public datasets.

Figure 1 .
Figure 1.Representative examples of SynthMorph affine 3D brain registration showing the moving brain transformed onto the T1weighted fixed image as a red overlay.Trained with extremely variable synthetic data, SynthMorph generalizes across a diverse array of real-world imaging contrasts, resolutions, and subject populations.

Figure 2 .
Figure 2. Training strategy for anatomy-specific affine registration.At each iteration, we augment a pair of moving and fixed label maps {sm, s f } and synthesize images {m, f } from them.The acquisition-agnostic CNN h θ predicts an affine transform T that we compute the moving label map sm • T from.Dice loss L recodes the labels in {sm, s f } to optimize the overlap of select anatomy of interest only-in this case WM, GM, and CSF.

Figure 3 .
Figure 3. Synthetic 3D training data with arbitrary contrast, resolution, and artifact level, generated from brain label maps.The image characteristics exceed the realistic range to promote network generalization across acquisition protocols.All examples are based on the same label map.In practice, we use label maps from several different subjects.

Figure 4 .
Figure 4. Training strategy for anatomy-specific joint registration.As in Figure 2, CNN h θ predicts an affine transform T between synthetic moving and fixed images {m, f } synthesized from label maps {sm, s f }.The moved image m • T and f are inputs to the acquisition-agnostic CNN gη, which predicts a diffeomorphic warp field φ.We form the joint transform ψ = T • φ by composition and compute the moved label map sm • ψ.The Dice loss L recodes the labels of labels maps {sm, s f } to optimize the overlap of select anatomy of interest-in this case brain labels only.
Figure5.Affine registration architectures.A recurrent Encoder estimates refinements to the current transform T i from moved image m i = m • T i and fixed image f .Decomposer predicts a one-shot displacement field (no activation) with corresponding voxel weights (ReLU), that we decompose in a weighted least-squares (WLS) sense to estimate affine transform T .Detector outputs ReLU-activated feature maps for a single image.We compute their centers of mass (COM) and weights separately for m and f , to fit a transform T that aligns these point sets.Parentheses specify filter numbers.We LeakyReLU-activate the output of unnamed convolutional blocks (param.α = 0.2).Blue blocks smaller than their predecessor indicate subsampling by a factor of 2.

Figure 8 .Figure 9 .
Figure 8. Representative affine 3D registration examples showing the image moved by each method overlaid with the fixed brain mask (red).Each row is an example from a different dataset.Subscripts indicate MRI contrast.

Figure 10 .
Figure 10.Dependency of 3D (a) affine and (b) joint affinedeformable registration accuracy on slice thickness.Error bars indicate the standard error of the mean and are comparable across baselines-except for Robust, which is similar to SynthMorph.Higher scores are better.

Figure 11 .Figure 12 .
Figure 11.Joint affine-deformable 3D registration accuracy (mean Dice scores D over |J| = 23 bilateral brain regions).Each violin shows the distribution of D across 50 image pairs from separate subjects.Subscripts indicate MRI contrast, SS denotes skull-stripped images, and downward arrows indicate median scores D < 50.

Figure 13 .
Figure 13.Joint affine-deformable 3D registration examples comparing the moved image m • ψ and the total deformation field ψ = T • φ across methods.Each row is an example from a different dataset corresponding to the image pairs of Figure 8, which we picked at random.Subscripts indicate MRI contrast.

−
Let f be a fixed N D image of side lengths d i , where i ∈ {1, ..., N } indexes the right-handed axes of the spatial image domain Ω.This work uses zero-centered index voxel coordinates x ∈ Ω.That is, ∆d i , 1 − ∆d i , ..., d i − 1 − ∆d i (A.1) We consider intrinsic 3D rotations represented as the matrix product R = R 1 R 2 R 3 , where

Figure C. 1 .
Figure C.1.Absolute affine transformation range across n = 1000 registration pairs randomly selected from OASIS or ABCD subjects.Each panel pools parameters relative to all axes i ∈ {1, 2, 3} of 3D space.Horizontal bars indicate median values.Circles represent parameters farther than 1.5 inter-quartile ranges from the median.

Table 1 .
Acquired test data spanning a range of MRI contrasts, resolutions (res.), and subject populations.QIN are contrastenhanced clinical stacks of thick slices from patients with glioblastoma, whereas the other acquisitions use 3D sequences.While HCP-D and MASi include pediatric data, the remaining datasets sample adult populations.

Table 2 .
Network capacity for model comparison.Each model uses a width w held constant across its convolutional layers to reach a target capacity of 250k or 500k parameters, up to a small deviation.
compares registration accuracy for the tested model configurations.Encoder achieves the highest accuracy, surpassing the best Detector configuration by up to 0.4 and the best Decomposer by up to 1 Dice point.Using more subnetworks improves Encoder performance, albeit Figure 7. Network robustness across affine transform strengths γ relative to the augmentation range of TableB.1.At a given γ, we resample each image of the validation set with affine parameters drawn from a uniform distribution U modulated by γ, for example rotation angle r ∼ U (−γα, γα), where α = 45 • .We also test networks trained with doubled augmentation (aug) and doubled capacity (cap).Dice scores represent averages over 100 UKBB crosssubject 2D registration pairs.Shaded areas indicate the standard error of the mean.

Table 3 .
Single-threaded runtimes on a 2.2-GHz Intel Xeon Silver 4114 CPU, averaged over n = 10 runs.Errors indicate standard deviations.On an NVIDIA V100 GPU, all affine and deformable DL runtimes (below midline) are about 1 minute, including setup.

Table B .
1 lists the generation hyperparameters ranges that SynthMorph training uses for label-map augmentation and image synthesis.
Table B.1.Uniform hyperparameter sampling ranges [a, b] for synthesizing training images from source segmentation maps.We abbreviate standard deviation (SD), full width at half maximum (FWHM), and field of view (FOV).