A Symmetric Prior for the Regularisation of Elastic Deformations: Improved anatomical plausibility in nonlinear image registration

Nonlinear registration is critical to many aspects of Neuroimaging research. It facilitates averaging and comparisons across multiple subjects, as well as reporting of data in a common anatomical frame of reference. It is, however, a fundamentally ill-posed problem, with many possible solutions which minimise a given dissimilarity metric equally well. We present a regularisation method capable of selectively driving solutions towards those which would be considered anatomically plausible by penalising unlikely lineal, areal and volumetric deformations. This penalty is symmetric in the sense that geometric expansions and contractions are penalised equally, which encourages inverse-consistency. We demonstrate that this method is able to significantly reduce local volume changes and shape distortions compared to state-of-the-art elastic (FNIRT) and plastic (ANTs) registration frameworks. Crucially, this is achieved whilst simultaneously matching or exceeding the registration quality of these methods, as measured by overlap scores of labelled cortical regions. Extensive leveraging of GPU parallelisation has allowed us to solve this highly computationally intensive optimisation problem while maintaining reasonable run times of under half an hour.


Introduction
Nonlinear registration is commonly used in neuroimaging to deform images of individual brains into some common space (normalising both size and shape). Often the registration is based on structural, e.g., T1weighted, images. The resulting warp is subsequently applied to both structural images as well as images depicting some function of interest such as BOLD or diffusion-derived connectivity. The rationale behind this is typically to facilitate statistical analysis of those functional data across subjects and populations.
An important distinction in this context is between volume-and surface-based methods. The former attempt to find the inter-subject mappings in the original (3D) space. The latter, in contrast, are typically only concerned with the cortex and attempt to inflate the folded 2Dmanifold (the cortex) onto a sphere before finding the (2D) inter-subject mapping on that sphere (e.g., Fischl (2012); Robinson et al. (2014)). In this paper we present a novel volume based method.
Existing volume based methods can broadly be categorised based on: Construction of warps: As part of the iterative process a warp can be updated by adding an update, or by warping the previous warps by an update. The former is often referred to as an elastic or a small deformation framework, as opposed to a plastic or large deformation framework for the latter (Miller et al., 2003). Note that for infinitesimally small displacements, composition of warps can be extended to the concept of integrating a velocity field (Beg et al., 2005). Regularisation of warps: Warps can be regularised, i.e., have smoothness imposed on them, by adding a penalty term to the costfunction, or by explicitly smoothing the updates.
Similarity measure: This is a scalar measure that assesses the similarity/difference between the images and whose maximisation/minimisation drives the registration.
The present paper is concerned mainly with the second of these, the regularisation, though we will touch upon the construction of warps and the effect of similarity metric choice as well. A thorough review of these categorisations can be found in Sotiras et al. (2013).
Historically, for the small deformation framework, one important aspect of the regularisation was to try to ensure that the warps stayed one-to-one and onto i.e., that each point in image A mapped to a unique point in image B, and that each point in image B could be reached from image A. This also means that the mapping is invertible. This was often attempted by adding a penalty term to the cost-function such that one simultaneously minimised the difference between the images (dissimilarity) and some function of the warps (regularisation). That function would often be borrowed from mechanics, such as bending energy (Bookstein, 1997), membrane energy (Amit et al., 1991) or linear-elastic energy (Miller et al., 1993) (whose formulations are summarised in ). However, none of those functions had any specifically biological relevance to the problem of mapping individual brains to each other. Moreover, they do not explicitly guarantee invertibility. So it was common practice to either ignore the issue of invertibility, or to empirically calibrate the relative weights of the image difference and the warp penalty so that the resulting warps were (almost) always invertible. But that typically meant a quite high weight for the regularisation term which resulted in an inability to properly model large deformations.
This limitation led to the development of the large deformation framework, where warps are updated by warping (resampling) the previous warp by an update warp (equivalent to a composition of multiple small warps). If both the previous and the update warps are invertible, this guarantees that the composed warp is also invertible. Therefore, as long as one ensures that each update warp is invertible the end result will be too. Furthermore, certain methods (such as LDDMM (Beg et al., 2005) and Geodesic Shooting (Ashburner and Friston, 2011)) seek solutions where the overall set of composed warps (or integrated velocity field) are as "smooth" (according to some criterion) as possible, rather than greedily smoothing at each step. The result is then that these methods are able to model large deformations while remaining invertible by taking many small update steps, leading to better matching than could be achieved within the small deformation framework.
However, invertibility is not the be-all and end-all for regularising warps. It does restrict the space of allowed warps, but some form of regularisation is still needed for choosing the optimal warp within that space. Moreover, it is no more clear in the large than in the small deformation framework how exactly that regularisation of the warps/ velocity fields should be performed. Even when maintaining invertibility one can, and frequently does, still obtain warps that correspond to implausibly large local changes in volume and/or shape (Ashburner and Ridgway, 2013;Mang and Biros, 2016;Coalson et al., 2018). For a more detailed discussion of some of the tools which address invertibility, see Section 4.2.
In this paper we suggest using a small deformation framework together with a penalty function that explicitly penalises changes in both volume and shape. Furthermore, this penalty approaches infinity as relative volumes approach zero (which would break invertibility), meaning that it can be given a small weight in order to allow large deformations and still guarantee diffeomorphic warps. Hence, it addresses the same issue of invertibility as LDDMM based methods as well as the issue of how to find a particular set of warps within that space. It should be stressed at this point that the penalty we propose is not the same, or even particularly similar to, previously suggested penalties based on some function of the Jacobian determinant (see for example Rohlfing et al. (2003) or Leow et al. (2007)). We should also be clear that even though we refer to this as an elastic deformation, the prior is not based on any linear elastic model with the ensuing small deformation assumption. We will refer to our penalty as the Symmetric Prior for the Regularisation of Elastic Deformations (SPRED). The mathematical formulation and theoretical basis of the penalty are explained in Section 2.3. It has been used in the past (Ashburner et al., , 2000, but the computational cost is such that at the time it was not practically feasible. With the advent of general-purpose computing on graphics processing units (GPGPUs) that is no longer the case, and we describe our implementation using NVIDIA's CUDA framework (NVIDIA, 2019) of a registration algorithm based on that penalty function in Section 2.4.
The rationale behind the suggested penalty function is twofold.
To minimise geometric distortions, i.e. changes in both shape and volume, and to ensure the warps are one-to-one and onto. To be symmetric in the sense that a geometric (lineal, areal or volumetric) change by a factor of 2 is penalised the same as one by a factor of 0.5.
The first desideratum comes from the notion that brain tissue implements function, be that function integration of signal in the grey matter or transmission of signal in the white matter. The implementation of that function will necessarily occupy some physical space and it would seem unlikely that a given function can be implemented in a very small fraction of space in one brain compared to in another. A penalty should therefore favour no volume changes at all. It should also rapidly increase as the relative volumes become unrealistic, approaching infinity as the volume changes get close to breaking the one-to-one and onto criteria, thereby never allowing negative relative volumes. But it should also prevent unrealistic shape changes. Given what we know about the structure of both grey and white matter it is hard to picture a situation where a function implemented in a 1 Â 1 Â 1 mm 3 cube in one brain might occupy a 0.01 Â 10 Â 10 mm 3 sheet or a 0.1 Â 0.1 Â 100 mm 3 stick in another. Note that this can all be reduced to penalising lineal changes since that would automatically also penalise shape and volume changes.
The second property, symmetry, comes from the desire that a registration algorithm should be inverse consistent, i.e., that the transform that is obtained from registering image A to image B is the inverse of that obtained from registering B to A (Christensen, 1999). A necessary, but not sufficient, condition for this is a prior that is symmetric in terms of penalising an expansion from A to B equally to the corresponding contraction from B to A.

Registration framework
We now broadly describe the registration framework within which our SPRED penalty is implemented in order that specific details of the implementation are more readily understandable. SPRED forms part of the development of our new MultiModal Registration Framework (MMORF) tool. MMORF is a volumetric registration tool whose underlying transformation model is a 3D cubic Bspline parametrised free-form deformation. The parameter space over which the optimisation is performed therefore represents the spline coefficients of three displacement fields (x, y and z-warps). A mean-squares (MSQ) similarity (or rather dissimilarity) metric is the data-consistency term optimised during registration. Regularisation penalties are calculated in the reference image space. A multi-smoothing, multi-resolution approach is employed to encourage convergence towards a globally optimal solution. From the outset MMORF has been designed to leverage GPU parallelisation, which is a largely why SPRED is computationally tractable within this framework.
The preferred optimisation strategy is the Levenberg variant of Gauss-Newton (Levenberg, 1944;Press et al., 2007). In this scheme, the Gauss-Newton Hessian H GN is replaced with H L , where: λ L is used to ensure that the update step always leads to a decrease in the cost function. If an update would successfully reduce the cost function λ L is decreased, otherwise it is increased and the update step re-evaluated. Additionally, the same strategy is used to ensure no update step is ever taken which would lead to non-diffeomorphic warps. This is achieved by checking that the resulting Jacobian determinants are all positive before accepting any update step, and modifying λ L accordingly. Due to memory constraints, the Hessian can only be stored on the GPU until a warp resolution of 5 mm isotropic. Beyond this resolution our implementation offers a choice between two optimisation methods that do not require F.J. Lange et al. NeuroImage 219 (2020) 116962 explicit representation of the Hessian. One is the Scaled Conjugate Gradient (Moller, 1993) method, which is related to other quasi-Newton methods and which uses the history of cost-function changes from earlier iterations to calculate a step-length along the next direction, thereby eliminating the need for line minimisations. The other is the Majorise-Minimisation (MM) method (Hunter and Lange, 2004) which replaces the Hessian with a diagonal matrix where the ith value on the diagonal is the sum of absolute values of the elements in the ith column of the Gauss-Newton Hessian.

Intuition
In the following section we will detail how to calculate the suggested penalty, and why it is an ideal, albeit challenging, candidate for Single Instruction Multiple Data (SIMD) style parallelisation on a GPU (NVIDIA, 2019). For the estimation of the warp parameters we use the Gauss-Newton method, so we will also need to calculate the gradient of the penalty with respect to the spline coefficients of the warps and a Gauss-Newton approximation to the Hessian.
We attempt now to give an intuitive outline of the technical/mathematical formalism which follows in Section 2.3. At each voxel one can calculate a Jacobian matrix using the values of the surrounding warp Bspline coefficients. The singular values of this matrix represent the orthogonal scalings of that voxel due to the warp. The SPRED penalty (described in Section 2.3.2) is a function of these singular values, and we can therefore calculate each voxel's contribution to the total penalty. Those calculations are a perfect match for SIMD parallelisation with one computational thread per voxel, where each thread performs identical calculations but on different data. The individual voxel contributions are subsequently summed (a reduction in GPU parlance) to obtain the total penalty for the warp.
In order to calculate the gradient of the penalty, one needs to apply the chain rule, which says that: Remembering that for example g might be a vector-valued function of a vector-valued argument x which would make df dg a vector and dg dx a matrix. In our case the contribution to an element of the gradient from a given voxel is a scalar valued function of the vector of singular values of the local Jacobian matrix. These in turn are each a function of the vector of elements of the Jacobian matrix. Finally, these elements are functions of the vector of spline-coefficients whose support includes that voxel. The first two factors are again calculated using one thread per voxel, while the third factor uses one thread per spline-coefficient. Similarly each element of the total gradient is calculated by one thread per spline-coefficient.
To understand how the Hessian matrix is calculated consider a matrix A where each row corresponds to a voxel and each column to a warp spline-coefficient, and where A ij is the rate of change of the penalty contribution from voxel i with respect to coefficient j. The Gauss-Newton approximation to the Hessian then becomes A T A. However, for practical reasons, A is never explicitly calculated or stored and instead the Hessian H is calculated directly. In this instance we use one thread per non-zero element of H for the final calculation, but there are a number of preceding steps which employ the same one thread per voxel approach as in the gradient calculations.

Theory
We now formalise the concepts introduced in Section 2.2 in terms of the mathematics required for incorporating the SPRED penalty into the optimisation strategy of our registration framework.

Jacobian matrix
Consider the generic problem of registering some 3D moving image g to a reference image f, where both are defined on R 3 . Let x, y and z be the 3 orthogonal directions in R 3 . We define a transformation t ! on the domain of f with parameters w ! , such that gð w ! Þ becomes our transformed moving image. At each point ðx; y; zÞ : Where ∂di ∂j is the partial derivative of the i component of d JThen describes how the image is locally scaled and rotated by the action of t ! . These actions can be decoupled by applying singular value decomposition (SVD) to J, such that: Uand V are unitary matrices describing the rotational effect of J, and will not be considered further as they do not geometrically distort the image in any way. S is a diagonal matrix containing the singular values and represents the orthogonal scaling effect of J. It is the individual elements s i of S which will be penalised in our regularisation, and we will refer to them as the Jacobian Singular Values (JSVs).

Penalty function
The penalty is based on the prior belief that the JSVs are drawn from a lognormal distribution. Such a penalty meets the biological plausibility arguments set out in Section 1 as: Now, we define Cð w ! Þ as the penalty (or cost) associated with t ! : Where F.J. Lange et al. NeuroImage 219 (2020) 116962 C ¼ total cost/penalty w ! ¼ transformation parameters From Equation (10) we recognise that the log 2 s i term represents our lognormal prior on the JSVs. However this prior is based on the distribution of singular values only in f. Therefore we have also included a term vð1 þjJ n jÞ to ensure that our penalty is truly symmetric in g by accounting for the total volume in both images being penalised.
In practice, computation of the logarithm in Equation (9) (and its associated effect on the gradient and Hessian) may add a significant computational overhead (Ashburner et al., 2000). We therefore apply the following approximation (derived in Appendix A): Recognising that: and that: where λ i ðA T AÞ denotes the ith eigenvalue of A T A and where s i ðAÞ denotes the ith singular value of A, we may then write: An important benefit of this approximation is that our cost function is now an explicit function of only the elements of J, and therefore we never need to calculate the actual SVD of J.

Transformation model
The previous sections have used a generic transformation t ! , but at this point it becomes necessary to define the actual transformation model used. As in Equation (5), we utilise a displacement field d ! where each of the three directional components (d x , d y and d z ) in R 3 is composed of a basis set of uniformly spaced, 3D, cubic B-splines (B x , B y and B z ). Our transformation parameters w ! are then defined to be the coefficients of each of the B-splines in our basis set. If M B-splines are required to fully cover the N voxels in our reference image f, then we require 3M parameters to fully describe d ! (i.e. M parameters for each displacement direction).

Gradient and Hessian
The penalty function as defined in Section 2.3.2 will be included into a Gauss-Newton style optimisation framework, and therefore we require the gradient and the Gauss-Newton approximation to the Hessian of Equation (14). We note that for a valid application of Gauss-Newton (i.e., for being able to use the Gauss-Newton approximation to the Hessian) the function being minimised must be of the form y ¼ 1 2 aðxÞ 2 , which Equation (15) is not (Chen, 2011). However, by making the substitution: it becomes of that form. This means that: and therefore that the Gauss-Newton Hessian approximation becomes: i.e., it introduces a factor 1 2cnð w ! Þ compared to how one would ''normally'' calculate the Gauss-Newton Hessian. The gradient rCð w ! Þ is a 3M Â 1 column vector (where M is the number of B-splines used to represent one directional component of the warp-field), whose mth element is of the form: Where : where n 2 V m means that the summation is over all voxels n for which the B-spline m has support, and where J ! n is a vectorised version of the Jacobian at the nth voxel and s ! n is a vector with the three singular values of that Jacobian.
Similarly the Gauss-Newton Hessian is a 3M Â 3M matrix whose jk th element is: Where : where n 2 fV j \V k g denotes summation over all voxels n for which both B-spline j and k have support. It should be noted that although 3M Â 3M can be a very large number, especially for high warp-resolutions/small knot-spacings, the vast majority of elements are zero because most spline combinations jk have no overlap in support. We will demonstrate the practicalities of actually calculating these entities in the sections which follow.

Implementation
We now present how we achieved computational tractability of the SPRED penalty, its gradient, and GN Hessian.

Parallelising calculation of gradient and Hessian
As with all but the simplest optimisation strategies, calculating the value of the SPRED penalty itself is not the computationally costly part of this algorithm. Rather, it is the corresponding gradient and Hessian calculations which are massively computationally complex. See, for example, Equation (19), where we have multiple applications of the chain rule to vector valued functions of vectors, leading to intermediate matrices. These intermediates then need to be multiplied together, further increasing the complexity. It is easy to see that this complexity is compounded even further in Equation (20) when calculating the Hessian. If these calculations were all performed sequentially the problem would become computationally intractable, and the suggested form of regularisation would be a mathematical nicety with no practical impact.
However we will demonstrate how the problem can be framed in terms of the application of multiple massively parallelised SIMD operations, rendering it realistic to use.
As an example, let us consider how one might parallelise the gradient calculation in Equation (19). Firstly we note that the part: depends only on entities dependent on which voxel n is currently being considered. Each of these 1 Â 9 vectors can therefore be calculated and stored independently by one GPU thread. We represent these precalculated vectors as an N Â 9 matrix (where N is the total number of voxels in image f) which we denote ∂c ∂J !
where the nth row is: The columns of ∂c ∂J ! can be thought of as partial derivative images and can be visualised as demonstrated in Appendix B. Note also that in these calculations one can use the approximation given in Equation (15) to avoid explicit computation of the JSVs. The next part: is ''constant'' in the sense that its elements are given by the spatial derivatives of the B-spline basis-functions and can be calculated (analytically) once and for all for a given knot-spacing. If we denote a columnvectorised version of a single 3D B-spline basis-function (whose dimensions are given by the knot-spacing) by B we can create a matrix ∂J ∂w ! containing all the elements we need as: where B ð*Þ denotes the spline basis-function spatially differentiated in the ith direction and N S denotes the total size of the 3D spline (determined by the knot-spacing).
Finally each element rC m of the gradient can be calculated by independent threads as: Where : and if M < m 2M it will select the columns of ∂c ∂J ! corresponding to the second row etc. The ' superscripts on n' and k' indicate transformed versions of n and k such that n' takes into account the offset of the mth spline into the image, and k' will select the corresponding column (regardless of row) in J.
Equation (25) can also serve as an illustration of some of the challenges of SIMD parallelisation. We have chosen a strategy where each element of rC is calculated by one thread. That choice means that we avoid having to synchronise writes to rC or having to perform reductions, both of which reduce occupancy and speed. The downside of that choice is that the different threads within a warp will access global memory in a non-coalesced fashion. This is because each thread will access a 3D sub-volume in each of the nine "partial derivative images" defined in equation (22), where these sub-volumes are offset by an integer multiple of the knot-spacing in one or more dimensions. In the supplementary material there are timings and more detailed examples of the challenges of parallelising the proposed regulariser.
So, in summary the calculation of the gradient can be divided into three parts: One which is separable over voxels (or more generally ''sample points'') where the calculations for one voxel can be performed and stored independently by one GPU thread. One which is ''constant'' and can be performed once and for all and stored where the results can be accessed by all threads. One which is separable over warp parameters (spline coefficients) where calculations for one warp parameter can be performed and stored independently by one GPU thread.
Following this line of reasoning, we may similarly divide the Hessian calculation of Equation (20) into portions which are constant, separable over samples, and separable over warp parameters.
As it is this process which is central to how using the SPRED penalty is made tractable, we provide an intuitive understanding of what implementing the parallelisation of equation (19) actually looks like in practice by considering a simple 2D example. This can be found in Appendix B.

Testing
As assessing the performance of a registration tool is non-trivial, we here try to strike a balance between considering some measure of dataconsistency, and some measure of anatomical plausibility. It is known that image similarity is a poor choice for measuring registration accuracy as it is just a proxy for what we are truly interested in, i.e., alignment of common anatomical structures (Rohlfing, 2012). Image-wide tissue maps are not a good option either, as they are global in nature and only really test tissue classification rather than anatomical alignment (Rohlfing, 2012). We therefore choose the overlap of manually segmented cortical regions as our data-consistency measure, which is an established method by which to assess the quality of registration (Klein et al., 2009).
The focus of this work is not primarily on assessing our method's ability to maximise data consistency. However, if a set of warps was unable to produce high quality overlap results then it would not be of interest irrespective of how plausible the deformation is. High registration accuracy therefore simply allows us to evaluate the anatomical plausibility of the warps in a meaningful way.
For any method, regardless of the specifics of the regulariser, there is a trade-off between the image similarity term and regularisation term. That trade-off is determined by their relative weights, something that is typically calibrated empirically (however see Simpson et al. (2012) for an attempt to determine it probabilistically). Therefore, in order to compare two methods in terms of plausible warps it is crucial to calibrate both methods so that they achieve (close to) identical overlap scores. We ensured a high registration accuracy by using ANTs (Avants et al. (2008)) with similar settings to those used in Klein et al. (2009) as a yardstick for registration accuracy. To make sure warp comparisons were meaningful we calibrated the other methods to yield very similar overlap scores. If we could not achieve that for a method we did not include that in the comparison of warps.
We use two main indices for assessing what we consider to be anatomical plausibility. One is the Jacobian determinant that quantifies local volume change. We consider both the histogram of Jacobiandeterminants within the brain, which provides information about how aggressively the warp is squeezing or expanding the volume on average, and how the Jacobian-determinants are spatially distributed, i.e., whether the deformation appears to be spatially sensible.
The second index is the cube-volume aspect ratio (CVAR, Smith and Wormald (1998)) which quantifies shape changes. In three dimensions it is the cube root of the ratio of the volume of the smallest possible enclosing cube to the actual volume. Similarly to the Jacobian determinant above we consider both the histograms and the spatial distributions of the CVAR.

Registration Tools
In order to make a meaningful and relevant evaluation of the effect of the SPRED penalty on registration performance, four volumetric registration tools were included in this comparison, namely: FLIRT (Jenkinson and Smith, 2001) We chose to include a linear registration method in order to provide a yardstick by which to compare the overlap scores, and in particular the differences in overlap, of the other methods.
ANTs (Avants et al., 2008(Avants et al., , 2014 ANTs has been shown to perform well in a direct comparison with other methods (Klein et al. (2009);Ou et al. (2014)). A method performing significantly worse would not be of any practical interest, so it provides a minimum accuracy bar for the other methods in the comparison. We ran ANTs both with cross correlation (ANTs-CC) (Avants et al. (2011)) and mean sum of squares (ANTs-MSQ) image similarity metrics. In both cases the greedy SyN transformation model was used.
FNIRT (Andersson et al., 2007) We included FNIRT because it uses the same image similarity and warp representation as MMORF, but employs a different strategy to ensure invertibility (Karacali and Davatzikos, 2004) and a different regulariser (bending energy).
MMORF Implements the regulariser (SPRED) that we propose in the present paper. In order to demonstrate just how different to a Jacobian Determinant regulariser (for example Rohlfing et al. (2003) or Leow et al. (2007)) SPRED is, we also implemented a regulariser based on the sum of squares of log-determinants of the Jacobians in the exact same framework.
Whilst there are myriad other tools that may also have been included in this comparison (e.g., SPM Unified Segmentation (Ashburner and Friston, 2005), DARTEL (Ashburner, 2007), elastix (Klein et al., 2010), MIRTK (Rueckert et al., 1999;Schnabel et al., 2001), Diffeomorphic Demons , ART (Ardekani et al., 2005)), the chosen methods represent a broad enough overview of the types of tools available for the purposes of characterising the performance of MMORF.

Registration parameter selection
An important aspect when comparing registration algorithms is ensuring that the user-selectable parameters are as optimal as possible (Klein et al., 2009).
FNIRT was run with an optimised strategy proposed by the tool's author. This consisted of a multi-resolution, multi-smoothing level registration, to a final knot spacing of 1 mm isotropic, and is explained in detail in Andersson et al. (2019).
ANTs-CC was run using the antsRegistrationSyn.sh script, but with the gradient step and smoothing values changed to match those supplied by the tool's author for use in previously published comparative studies (Klein et al., 2009). ANTs-MSQ was run with hand tuned parameters. Both versions of ANTs employed a multi-resolution, multi-smoothing level approach, to a final resolution of 1 mm isotropic.
MMORF was run using parameters that were shown to perform well during the development of the tool. Again, a multi-resolution, multismoothing level approach was used, but with a final knot spacing of 1.25 mm.
Note that whilst significant effort has been made to ensure that each tool performs as well as possible, the overarching goal was not to definitively classify which tool performed best in terms of overlap scores. But rather to investigate how the SPRED penalty affects the nature of the deformations when compared to similar performing tools. In other words, the main goal was to calibrate their respective performances in terms of overlap accuracy so that they were not significantly different. Further details regarding parameter selection can be found in Appendix C.

Test dataset
We have chosen to use the Non-rigid Image Registration Evaluation Project (NIREP) dataset (Christensen et al., 2006) in testing the registration methods. This dataset consists of the T1-weighted scans of 16 healthy subjects, eight male (average age 32.5 years) and eight female (average age 29.8 years). Each subject's scan has been brain extracted, bias corrected, and 32 cortical regions (16 left hemisphere, 16 right) have been expertly hand-segmented. The segmented regions provide a ground truth for anatomical correspondence of a finer granularity than simple tissue maps.

Evaluation strategy
The registration tools were evaluated using pairwise registrations of each combination of the 16 subjects (240 warps in total).
Each of the 32 segmented cortical regions were transformed through these warps and various overlap metrics calculated, namely: Jaccard Coefficient, Dice Coefficient, Specificity, Sensitivity. The distributions of these overlaps in each region were then compared between registration methods.
Additionally, distributions of Jacobian determinants and of cubevolume aspect ratios (CVAR, see Appendix D for a definition) were calculated in order to gain some insight into how aggressive each method is in terms of volume and shape changes. These were inspected and compared as histograms, spatial maps and by summary statistics such as min, max, mean, 5th and 95th percentiles. The comparisons were performed only for the methods for which we achieved overlap scores that equalled those of ANTs-CC.

Overlap scores
Combined left/right hemisphere overlap scores are shown in Fig. 1. All overlap measures showed similar trends, and therefore we focus only on the Jaccard Coefficient as our measure of choice. FLIRT, as expected, achieves the lowest levels of overlap, with mean overlaps varying depending on the region being considered. The change in overlap between FLIRT and the nonlinear methods also provides a useful yardstick against which to gauge the differences between those methods. All of the nonlinear methods, again as expected, improve the degree of overlap considerably. ANTs-MSQ has the lowest average overlap of the nonlinear methods. FNIRT has the second lowest overlap scores, although in certain areas it is outperformed by ANTs-MSQ. Note that the FNIRT results are significantly better than in previously reported comparisons (e.g. were unsuited for skull stripped data), but are in line with those from the tool's author (Andersson et al., 2019). ANTs-CC and MMORF perform the best and equally well on average, but one or the other performs better in individual areas. These observations are supported when one considers the overall distributions of overlap scores in Fig. 2 which display the same trends.
It is perhaps surprising that the two versions of ANTs produce such different results, given that they differ only in their choice of similarity metric. However, this is in line with what the creators of ANTs report themselves (Avants et al. (2011)) on the LPBA40 data set, so we believe this to be an accurate representation of their respective performances.
Because the best overlap performance we could achieve with ANTs-MSQ was significantly worse than that of MMORF and ANTs-CC we did not consider it meaningful to include it further in the comparison of warp metrics (Jacobian determinant and CVAR).
We did include FNIRT in one of the warp-comparisons, even though it did perform slightly worse than ANTs-CC and MMORF. That was because the difference was considerably less than for ANTs-MSQ, and also because it used the same image similarity metric and warp-construction as MMORF, differing only with respect to the strategy for ensuring invertibility and regularisation.
Our main interest was in the comparisons between ANTs-CC, being a very accurate and widely used method, and MMORF. The fact that we managed to calibrate the two methods so that they have almost identical overlap scores means that the warp distributions can be compared in a fair and meaningful way.
The joint distribution of overlap scores for MMORF and ANTs-CC for all regions and warps shown in Fig. 3 confirms that the two methods are on average very comparable. Fig. 4 shows the Jacobian determinant distribution of the warps generated with MMORF, ANTs-CC and FNIRT for a randomly selected image pair. We show both the non-log and log distributions since they highlight slightly different behaviours. It should be noted that although these warps represent a single pair of subjects, we could have chosen any of the 240 warps and the relationship would remain essentially the same.

Jacobian Determinant distributions
The Jacobian determinant distribution for FNIRT has a shape that is quite distinct from that of MMORF and ANTs-CC. It is seen very clearly in the non-log distribution where instead of tapering off smoothly towards zero there is an almost linear drop. This is most likely due to FNIRT projecting its warps onto the nearest B-Spline field which has no negative Jacobian determinants (Karacali and Davatzikos, 2004). The shapes of the distributions for MMORF and ANTs-CC are much more similar, but with ANTs-CC having a notably greater dispersion.
Of the three methods that had the most similar overlap scores it is clear that FNIRT causes substantially greater volume distortions, in addition to having slightly lower overlap scores. We therefore drop FNIRT from further analysis at this stage and focus on MMORF and ANTs-CC.
From the log distribution, the range of the 5th to 95th percentile for each method can be calculated. This was done for every warp, and the results are summarised in Fig. 5. It can be seen that both the mean and the dispersion of the ranges are considerably greater for ANTs-CC than for MMORF. Fig. 6 plots for each of the 240 registrations the log 5th-95th range for ANTs-CC against that of MMORF. In all but 4 cases the percentile range is smaller for MMORF. The FLIRT result provides a baseline for evaluating the improvement in overlap between linear and nonlinear methods. Overall ANTs-CC and MMORF perform similarly. In some areas ANTs-CC produces better results (e.g., cingulate gyrus), and in some area MMORF performs better (e.g., inferior parietal lobule). In all cases, both methods outperformed ANTs-MSQ and, to a lesser extent, FNIRT.

CVAR distributions
Unlike the Jacobian determinant, the mean (across sample points within the brain) of the CVAR is a meaningful statistic. We therefore use that as our summary measure of shape distortions for a given warp. Fig. 7 demonstrates that the mean is both lower on average, and has smaller dispersion for MMORF compared to ANTs-CC.

Spatial maps
In the previous section we showed that of the methods with comparable overlap scores ANTs-CC caused substantially more volume distortions than MMORF. In order to better understand the source of that difference we look at the spatial maps of the Jacobian determinants and of the CVAR. Fig. 8 shows maps of Jacobian determinants and Fig. 9 of CVAR for the same randomly selected pair of subjects as Fig. 4. There are clear visual differences between the results of the two methods. In areas where the T1-weighted signal intensity has strong contrast and carries information about the tissue type, such as along the cortex, both methods produce similar looking maps with varying amounts of expansion and compression. Where they differ in appearance is predominantly within areas displaying a relatively flat T1-weighted signal profile, such as in white matter. Here MMORF produces smoothly changing values with magnitudes close to 1. ANTs-CC by contrast displays highly oscillatory behaviour, with a higher proportion of values deviating away from 1.
As a T1-weighted volume contains minimal information within the white matter, there is little reason to believe that the deformation should be anything other than smooth within these regions. Clearly this has not detrimentally affected the performance of ANTs-CC in terms of achieving high cortical overlap scores. However, if this warp was used to transform a modality rich in white-matter information (such as diffusion tensor imaging) the results could potentially be very deleterious.
By contrast, the MMORF result is more balanced. In areas where there is a large amount of information to drive the deformation (tissue boundaries) the warps are correspondingly larger and more variable, Fig. 1, with MMORF and ANTs-CC performing best, followed by FNIRT and then ANTs-MSQ. The MMORF and ANTs-CC distributions are remarkably similar, despite the slight region-to-region differences in performance.  whereas areas with minimal information are left relatively unchanged. When looking at Fig. 9 it should be noted that the particular (randomly selected) registration we chose for displaying spatial maps happens to be one of 65 (out of 240) for which the mean CVAR was greater for MMORF than for ANTs-CC.

Discussion
We have implemented and investigated a previously suggested method for regularising warps by penalising volume and shape distortions  and Ashburner et al. (2000)). The regulariser has some very attractive properties, but is computationally very expensive. This meant that the implementation in the original paper used a voxel-by-voxel optimising scheme rather than a global optimiser, which can potentially cause order effects and convergence to a non-optimal minimum. We implemented the regulariser on a GPU, using a B-spline basis to represent the warps and a Gauss-Newton optimiser, resulting in run-times of under half an hour for a full multi-level Fig. 4. Jacobian determinant distributions -Jacobian determinant and log-Jacobian determinant distribution of a randomly selected registration pair from the NIREP dataset. Only values within the brain itself have been considered. As desired, all three methods produce Jacobian determinants with few to no negative values. There are distinct differences in appearance though. FNIRT is the most different, with a visibly heavier negative log-Jacobian tail. This is an effect of projecting the Bending-Energy regularised registration onto a field without negative Jacobian determinants. ANTs-CC displays the next widest log-Jacobian distribution, but with more evenly weighted tails.  Fig. 4. The distribution of this percentile range over all 240 registrations is shown for the two best performing methods. ANTs-CC displays a significantly higher range on average, as well as a greater dispersion of ranges when compared to MMORF. Fig. 6. Log-Jacobian determinant 5th to 95th percentile range -Scatter plot of MMORF and ANTs-CC log-Jacobian determinant 5th to 95th percentile ranges. The key feature here is that ANTs-CC produces systematically wider ranges compared to MMORF, with only 4 of the 240 registrations favouring ANTs-CC. This is despite ANTs-CC and MMORF producing comparable Jaccard coefficients overall, as shown in Fig. 3. registration. This has allowed us to run a large number of registrations and compare the results to one of the best performing algorithms in popular use (Avants et al. (2008); Klein et al. (2009)). We have been able to show that it performs as well as ANTs in terms of overlap scores, and that it achieves that with significantly less volume and shape distortions. Pennec et al. (2005) suggested a very similar penalty function based on the concept of a Riemannian manifold for the allowed (diffeomorphic) warps. Other regularisers (Burger et al. (2013)) with similarities to that investigated by us are based on models for hyperelastic materials. These are models that describe the stress-strain relationships of materials, for example rubber, which are not well described by linear elastic models. These models have also been used as regularisation for correcting susceptibility artefacts in diffusion-weighted MRI (Ruthotto et al. (2012)). Similar hyperelastic models have additionally been used to model cortical growth in the developing brain (Knutsen et al., 2010) and for regularisation of surface based cortical registration (Robinson et al., 2018).

Relationship to earlier work
It should be noted that neither the present work, nor that in Ashburner et al. (2000), is explicitly based on hyperelastic models. Nor do we believe that hyperelastic models are necessarily meaningful descriptive generative models for explaining anatomical differences between subjects. Any such model (see for example Van Essen, 1997) is unlikely to be simple enough to be described by a few material constants. The reasoning behind our regularisation function is purely empirical, based on fulfilling our criteria for a useful function and on proving to produce smooth and Fig. 7. Mean CVAR distributions -Distribution of mean CVAR within the brain over all 240 registrations is shown for the two best performing methods. ANTs-CC displays a significantly higher mean CVAR on average, as well as greater variance when compared to MMORF. Fig. 8. Jacobian determinant spatial maps -Spatial distribution of Jacobian determinants for MMORF and ANTs. Note the distinct visual differences within the white matter for ANTs-CC. The SPRED penalty has resulted in smooth changes within this region, whilst ANTs-CC has produced high frequency variations. As a T1 image contains little information within the white matter, the anatomical plausibility of those rapid variations is not clear. By maintaining smoothness within the white matter, MMORF produces a result that is more anatomically believable given the information present within the images being registered. Importantly, the irregularity introduced by ANTs-CC is not necessary for achieving high overlap scores, as Figs. 1 and 3 show. Finally, whilst this may not have any deleterious effect on overlap scores of grey matter regions, the effect of transforming a modality rich in white matter information through such a warp could be significant. plausible warps in parts of images with little or no anatomical information.

Beyond enforcing diffeomorphic warps
It is widely agreed that invertibility is a desirable property in a nonlinear spatial transform. There are two principle ways in which this can be enforced.
Warp Construction: Any warp that is a composition (integration in the limit) of diffeomorphic warps is itself diffeomorphic. Hence, methods that estimate warps as compositions of many small updates will by construction ensure that the end result is invertible.
Warp Penalisation: A highly nonlinear penalty function that goes to infinity as the local Jacobian determinant approaches zero will allow large deformations while maintaining invertibility. It should be noted that functions such as membrane energy or bending energy do not fall into this category.
For completeness both of these approaches will be discussed in more detail below. However, diffeomorphic warps are not the be all and end all. A warp can be invertible and yet highly unrealistic in that it causes very big volume and/or shape distortions. For this reason most methods that enforce invertibility using one of the methods mentioned above will combine that with one or more "traditional" regularisers that enforce smoothness. Ashburner and Ridgway (2013) have shown very convincingly that even within the space of diffeomorphic warps one will obtain very different solutions depending on the exact details and weights of the additional regularisers.
Hence, what we aim to achieve with our choice of regulariser goes much beyond just ensuring diffeomorphic warps. The aim is to find the maximally plausible, in terms of volume and shape distortions, of all possible warps within the space of diffeomorphisms. And to do this with a single penalty function (and a single weight) that achieves the joint objective of ensuring invertibility and simultaneously minimising volume and shape distortions.
We recognise that a judicious choice of forms and weights of additional penalty functions within inherently diffeomorphic frameworks, such as LDDMM and viscous fluid based methods, could potentially find an equally advantageous solution. But it is nevertheless the case that when comparing our regulariser to a state-of-the-art diffeomorphic method we were able to obtain invertible warps with equally good overlap scores and significantly less volume and shape distortions. Furthermore, there is no intrinsic superiority of inherently diffeomorphic methods over any other method that also guarantees invertibility, but achieves that with equally good, or better, registration accuracy.

Diffeomorphism by Warp construction
Methods such as ANTs fall into the category of inherently diffeomorphic warps. In particular, ANTs is an example of a greedy approximation of the general LDDMM method. LDDMM methods seek to find a time varying velocity field which minimises a metric based on total path length in the space of diffeomorphisms. Originally introduced by Beg et al. (2005), these methods guarantee that the total deformation remains diffeomorphic when the velocity field is integrated over sufficiently small timesteps. A downside to these methods is the large number of parameters which need to be estimated at each timepoint. Tools such as DARTEL (Ashburner, 2007) seek to overcome this by instead parametrising a stationary velocity field, thereby greatly reducing the parameter space, but at the expense of potentially larger path lengths. Subsequently, Geodesic Shooting (Ashburner and Friston, 2011) based methods have been able to reformulate the original LDDMM problem such that instead of estimating the entire sequence of time-varying velocity fields, only the initial velocity need be estimated, thereby greatly Fig. 9. CVAR spatial maps -Spatial distribution showing the Cube-Volume Aspect Ratio (CVAR) for MMORF and ANTs-CC. CVAR is a measure of shape change which extends the concept of aspect ratio to arbitrary dimensions, and is described in detail in Appendix D. Values above 1 correspond to increasing deviations from the original shape. Note that the appearance for MMORF and ANTs-CC is very similar to that in Fig. 8. reducing both the parameter space and convergence time of the optimisation. Whilst all of these methods are capable of ensuring warps remain diffeomorphic, they differ from our method in that the prior on which they are based is that deformations follow the shortest path-lengths, rather than deformations conserving the shape of underlying anatomy. As such, diffeomorphism is an intrinsic rather than a controlled property of the transformation model. It is certainly not impossible to include explicit regularisation of shape changes in these models, however it is not the norm.

Diffeomorphism by Warp Penalisation
An alternative method of enforcing diffeomorphism is by using a penalty function that goes to infinity as the local Jacobian determinant approaches zero.
Most commonly, the Jacobian determinant itself is used as a hard constraint that bounds the allowed range (Haber and Modersitzki, 2004;Sdika, 2008;2013;Haber et al., 2010), or fixes it to a particular value (Mansi et al., 2011), or a smoothly varying function of the Jacobian determinant is used as a soft constraint (Borzì et al., 2003;Yanovsky et al., 2007;Leow et al., 2007;Heyde et al., 2016;Mang et al., 2018). It should be noted that when modelling the warps as a velocity field the local divergence can be used as a proxy for the Jacobian determinant.
The crucial difference between our penalty and those based on a function of the Jacobian determinant is that the latter only penalises volume distortions. Not only does that mean that shape changes are not penalised, in practice such a function will result in a method where very large shape changes are used to circumvent the volume change limitations. In Appendix E we show the results obtained with a Jacobian determinant penalty function ( P n vð1 þ jJ n jÞlog 2 jJ n j) where the weight was calibrated so as to yield the same overlap score as our SPRED penalty. It can be seen that the Jacobian determinant penalty yields equally good overlap scores (Fig. E.1 and E.2) and results in warps that are diffeomorphic, with a very narrow range of Jacobian determinants (Fig. E.3 and E.5). However, this has been achieved by completely ignoring shape changes, and has resulted in much larger CVARs (Fig. E.4 and E.6).
Looking at a randomly selected warp (Fig. E.7) it is very clear that the Jacobian determinant penalty has resulted in lots of gratuitous distortions. This is the reason that many groups have combined a hard or soft constraint on volume changes with an additional regulariser (see for example Haber et al. (2010), Yanovsky et al. (2007), Leow et al. (2007) or Mang et al. (2018)). While that may work well, it means that an additional weight factor needs to be empirically determined.
There is another group of algorithms (Loeckx et al., 2004;Ruan et al., 2006;Staring et al., 2007;Modersitzki, 2008;Mang and Biros, 2016) that use functions that penalise deviations from local rigidity (i.e. anything beyond local translation and rotation). In that respect they have similarities to our penalty, but they have mostly been used to enforce near-total rigidity in selected parts of images containing a mix of rigid tissue (e.g., bone) and non-rigid tissue (e.g., muscle).
Within this category of registration methods, Loeckx et al. (2004) and Mang and Biros (2016) are probably the most closely related to ours of which we are aware. Mang and Biros (2016) acknowledge the issue of excessive shape changes and seek to combat it by explicitly penalising the shear of the velocity field. In Appendix A we show the relationship between our SPRED penalty and the local rigidity penalty of Loeckx et al. (2004) ( P n v J T J À I F ). We present the results of using this penalty in Appendix E, with the weighting calibrated so as to yield the same overlap score as SPRED. It can be seen that the local rigidity penalty yields equally good overlap scores (Fig. E.1 and E.2), but it is unable to preserve diffeomorphism. Whilst the CVAR values are mostly well controlled ( Fig. E4 and E6), looking at a randomly selected warp (Fig. E.7) demonstrates that topology is not being preserved in the cortex.

Other parallelised algorithms
It is increasingly common for registration tools to employ some form of parallelisation, and it is worth contextualising our approach in comparison to some of these methods.
Some of the Insight Segmentation and Registration Toolkit (Yoo et al., 2002) methods use CPU multithreading on a single device to accelerate the portions of their code which solve the linear system of equations necessary to compute update steps. This is a simple method to implement, but gains tend to be modest.
A recently described (Mang et al., 2018) parallelised version of Mang and Biros, 2016 utilises distributed-memory parallelism on multiple CPU nodes. It uses a Gauss-Newton optimisation strategy, and solves for a stationary velocity field transformation. The primary focus of the parallelism is in the inversion, rather than the calculation, of the Hessian.
Possibly the most closely related method to ours is NiftyReg (Modat et al., 2010) which employs a B-spline transformation and GPU parallelisation. NiftyReg differs in using a first order optimisation strategy and therefore does not require calculation of the Hessian. Reported performance improvements are very similar to ours, indicating comparable levels of code optimisation. Their impressive sub-minute runtimes are aided by a computationally simpler bending energy regularisation metric. Therefore a direct comparison of runtimes to our method using SPRED is not meaningful.
As discussed in Eklund et al., 2013, most applications of GPU parallelisation to medical image registration focus on speeding up existing methods. However, an interesting alternative is the development of more advanced registration methods which might otherwise have been rejected on the basis of computational complexity.

Registration accuracy
As we have previously stated, absolute registration accuracy is not the primary focus of this work. Instead it stands as a reference point for our discussion of anatomical plausibility, in that only two methods which are comparable in terms of registration accuracy can be meaningfully differentiated based on anatomical plausibility.
Based on the overlap scores in Section 3.1 we see that MMORF and ANTs-CC are clearly the best performing tools in terms of registration accuracy. Additionally there is very little to differentiate between their performance as a whole. We note that the results for ANTs-CC are slightly better than those which have been previously reported (Ou et al., 2014) for the same validation data set. We therefore believe that the way we have used ANTs-CC has yielded a close to optimal performance on this dataset.

Anatomical plausibility
Based on the summary 5th to 95th percentile log-Jacobian determinant range metric, MMORF is significantly more anatomically plausible than ANTs-CC. Fig. 6 supports this argument, by showing that this is true for the vast majority of the registrations. In other words, for a given registration accuracy one is almost guaranteed to have smaller volumetric changes when using MMORF over ANTs-CC. Another way of framing this result is to say that larger volume distortions are not a necessary trade-off for achieving high registration accuracy.
A striking feature of the Jacobian-range (Fig. 8) and CVAR maps (Fig. 9) are the seemingly gratuitous warps in white matter. This could potentially be a particular problem if one intends to use the structural registration to transform diffusion data. We believe this to be a strong case for extending the notion of anatomical plausibility beyond that of simply maintaining diffeomorphism and for building that notion into the regularisation.
Why this behaviour is observed in ANTs-CC is an interesting question in and of itself. At first this might be thought to be a case of insufficient regularisation, however the deformations generated by ANTs-MSQ (data not shown) did not display this same behaviour. Therefore we must posit that this is due to the somewhat scale-free nature of the CC metric. In other words, the gradient of CC does not depend on the amount of signal contrast present, only on the local correlation of that signal. What this means in practice is that regularisation on the level of the gradient cannot alter this behaviour, rather regularisation of the metric itself would be required. How to achieve this is beyond the scope of this work, however it highlights the importance of taking a considered approach to evaluating the believability of a deformation.

Limitations of the current work
The aim of the present paper is to introduce the SPRED penalty and to show that it can be used to yield large deformation, diffeomorphic and anatomically plausible warps. It is not yet a finished registration package that deals with differences in contrast, receive bias-field or B1 inhomogeneities. These will be the subject of future work. Neither have we applied it to data where very large deformations are needed, such as when registering images of severely atrophied brains or inter-species registration. Finally, whilst our regularisation penalty is symmetric, our similarity measure is not. Thus, for the overall algorithm to be truly symmetric we would ideally either modulate our similarity measure by ð1 þjJjÞ (Tagare et al., 2009), or simultaneously register both images to a mid-space (Avants et al., 2008).

Performance summary
The SPRED penalty allows MMORF to overcome the oft-quoted limitations of small deformation frameworks and achieve levels of registration accuracy comparable to state-of-the-art large deformation tools such as ANTs-CC. MMORF achieves this whilst at the same time producing warps which are consistently less aggressive in terms of both volume and shape distortions. Finally, given the information present in a structural scan, the spatial distribution of the log-Jacobian determinants MMORF produces appear more plausible compared to ANTs-CC, which produces variations that are not obviously necessary.
Overall we conclude that SPRED as implemented within MMORF produces warps which can be used with a high level of confidence. MMORF's registration accuracy is on par with state-of-the-art volumetric methods, and therefore no sacrifice need be made in terms of maximising data consistency.

Conclusion and outlook
We have demonstrated that our SPRED penalty is capable of matching the registration accuracy (as measured using overlap scores of manually segmented cortical regions) of the most well established largedeformation (ANTs-CC) framework available today. Additionally, we find that the results from using the SPRED penalty are consistently more anatomically plausible both in terms of Jacobian determinants and CVARs. We leverage GPU parallelisation in order to make optimisation of the penalty computationally tractable, allowing this method to be of practical benefit as part of a useable registration framework.
We have shown that using a regularisation function such as this can overcome the shortcomings of elastic small deformation frameworks, yielding the best of both worlds: large deformation, diffeomorphic warps with minimal distortions and high registration accuracy. Future work will focus on extending MMORF to include the simultaneous registration of scalar and tensor modalities (see, e.g., Irfanoglu et al. (2016)). Furthermore we wish to investigate the effect of spatially varying the weighting of the SPRED penalty based on prior information regarding tissue variability (e.g., allowing larger deformations within the ventricles).

Declaration of competing interest
We have no conflicts of interest to declare.
∴xðzÞ ¼ e z (A.4) Then: yðzÞ ¼ e z þ e Àz À 2 (A.5) Now, taking the Taylor expansion of yðzÞ about z ¼ 0: And finally, using Equation  For singular values between 0.25 and 4 the approximation is accurate to within 17%, and within 42% between 0.125 and 8. Note that the approximation maintains the desirable symmetry for expansions and contractions. Note too that the approximation majorises the exact penalty (i.e., the approximation is always greater than or equal to the exact penalty), and therefore maintains the exact penalty's ability to ensure diffeomorphism.

Appendix A.2. Interpretation
As shown in Equation (15) of the main text, utilising the approximation leads to a penalty in terms of the local Jacobian containing the term: Which may be rearranged as follows: where k k 2 F is the squared Frobenius norm.
This penalty is minimised to 0 when J T ¼ J À1 ⇔ J is an orthogonal matrix. This provides an interesting and intuitive alternative interpretation of the approximate SPRED penalty as one that penalises transformations for which the local deformation deviates from a pure rotation. Under that interpretation, the similarity to the Local Rigidity penalty jJ T J À Ij F (A.14) as used by Loeckx et al. (2004); Staring et al. (2007) becomes apparent.
The primary difference between these two methods is that Equation A.13 includes J À1 , and consequently tends to ∞ as jJj tends to 0. The SPRED approximation is therefore able to preserve diffeomorphism, whereas Local Rigidity is not.

Appendix B. 2D Example
We begin by defining a displacement warp field and its values in the underlying image space. The image space is a square with dimensions 21 Â 21. The cubic B-splines are placed with their control points (knots) 5 samples apart. Each spline has a spatial support radius of 2 knots (10 samples), and therefore we place an extra ring of splines around the perimeter of the image space, such that each point in the image has an equal number of splines with support there. As a result, our x-warp and y-warp parameters are of dimension 7 Â 7. We initialise the warp with a random set of parameters. Fig. B1 shows the warp parameters, the resulting warp in image space, and the gradient of the warp in image space.
Based on the spatial gradients of the displacement fields in Fig. B1, the local Jacobian matrix can be calculated at each point in space according to Equation (7).
The elements of J for each point in the sample space, as well as the resulting Jacobian determinant, jJj, are shown in Fig. B2, which now contains all of the information necessary to calculate the voxelwise part (Equation (22)) of the calculations for the gradient (Equation (19)).
Based on the elements of J, Fig. B3(a) shows the resulting contribution to the total cost from each point in the image.  (22)). Each point in the four partial derivative images can be calculated completely independently of every other point. The calculation of these images therefore constitutes the first parallelised step of our algorithm. We now move onto visualising the result of Equation (25). Note that Equation (24) is constant across the image and identical over the support of every spline. Therefore, when calculating the gradient in a parallel-per-parameter regime, the operation conforms exactly to the SIMD paradigm and may be very efficiently implemented on GPUs. In practice, we perform one such calculation for each combination of J element gradient image ( Fig. B3(b) to B3(e)) and warp direction (x or y), the results of which are shown in Fig. B4(b) to B4(e). Finally, we sum over these combinations for each warp direction, and the result is then the gradient of our penalty with respect to our warp parameters. This is shown in Fig. B.4(f) and B.4(g) .   Fig. B.1. 2D B-spline displacement field and spatial derivatives -Example of a warp (or technically displacement) field for a random set of parameters (cubic Bspline coefficients). The positive x-direction is defined from left to right, and the positive y-direction from top to bottom. The x-warp parameters w x (a) and y-warp parameters w y (b) are arranged on a regular 7 Â 7 grid. The resulting x-displacement field D x (c) and y-displacement field D y (d) corresponding to w x and w y are evaluated on a regular grid with knot spacing of 5 Â 5. Note that only those parts of the warps with full support of the surrounding B-splines are shown (i.e., the area outside of the green square is considered to be outside of the image we are warping). This leaves a square of 21 Â 21 samples in image space. The spatial partial derivatives of the x-warp in the x-direction (e), y-warp in the x-direction (f), x-warp in the y-direction (g) and y-warp in the y-direction (h) are presented, as these values are used in determining the local Jacobian matrices of the warp field. Cubic B-spline parametrisation ensures that both the warps and their spatial derivatives have analytical solutions and are smooth-continuous.   Fig. B2 a cost field C (a) can be evaluated at each position ½x; y of the image. C then represents the contribution to the total cost from each point in image space. The partial derivatives of C with respect to the elements of J are shown in sub-figures (b) to (e). These partial derivative fields (which together make up the overall gradient ∂C ∂J ), can be calculated independently for each position ½x; y as they depend only on the values of J at ½x; y, making this step ideal for a parallel-per-voxel approach.

Appendix C. Parameter Selection
Parameter selection in image registration is non-trivial and often leads to discussions regarding whether or not the particular parameters selected were optimal. As such, the aim of parameter selection in this work was not necessarily to find the absolutely optimal set of parameters for each method, but rather to find parameters which made the various methods equal in one metric (such as accuracy in terms of overlap scores), in order so that differences in another metric (such as aggressiveness in terms of Jacobian determinant distributions) can be meaningfully compared.

Appendix C.1. MMORF
We fixed the number of steps in the warp resolution pyramid (40, 20, 10, 5, 2.5 and 1.25 mm isotropic), and then optimised over three parameters, namely: Smoothing as function of warp resolution (λ s ) Initial regularisation (λ ri ) Rate of decrease of regularisation (λ rd ) The FWHM of the smoothing kernel was defined as the current warp resolution divided by λ s . The initial regularisation λ ri is in arbitrary units and its exact value is not of interest. The total regularisation at each step of the resolution pyramid was λ ri Â ðλ rd Þ it where it is the current position in the resolution pyramid (i.e. 0 to 5).
The parameter values were chosen via a grid search of the parameter space. Half of the NIREP dataset (8 subjects, 56 warps) were used, and overlap scores calculated as per Section 2.5.4. The parameters which resulted in the best overlap scores were then used for the rest of the analysis. The final parameters were: These parameters are relatively aggressive, but as we were aiming for maximum overlap scores this is to be expected. . 2D cost field and warp parameter gradient -The final, and most computationally expensive, step is calculating the gradient of the total SPRED penalty (being the sum of C over all x and y) with respect to the warp parameters. Given that we have the gradient of C with respect to J, knowing the gradient of the resulting partial derivative fields (i.e., Fig. B3 (b)-(e)) with respect to the warp parameters would allow us to apply the chain rule to calculate this. We proceed by recognising that the piece-wise polynomial nature of B-splines means that the gradient of J with respect to each warp parameter is constant. Additionally, the gradient at position ½x; y is exactly 0 for any B-spline which does not have support at that position. Thus, we can calculate the total gradient with respect to each warp parameter by summing the result of the chain rule across each position ½x; y where that spline has support. In (b) to (e) we can see this illustrated in terms of the contributions to the overall gradient due to the effect of the warp parameters w x and w y on each element of the J fields. These partial derivatives can be calculated independently for each warp parameter, making this step ideal for a parallel-per-parameter approach. The end result of the process is then the gradient of the total penalty with respect to w x (f) and w y (g).

Appendix E.2. Local Rigidity Penalty
In Appendix A we show that our approximation to the SPRED penalty can be interpreted as promoting transformations which are locally rigid. However, the SPRED penalty is considerably more computationally complex than alternative methods which also promote local rigidity. We therefore implemented the following much simpler penalty in order to determine whether the additional complexity is warranted or not: We registered all subjects of the NIREP dataset to subject NA01 only (i.e. 15 warps in total) using the three different regularisation methods. The regularisation weighting was chosen such that the average overlap scores were comparable across all methods. We then calculated overlap scores as in Section 2.5.4 of the main text. For each warp we then calculated the minimum, maximum, mean, 5th and 95th percentile for both the Jacobian determinant and CVAR metrics. Overlap scores are presented in Fig. E1 and E2. Jacobian determinant ranges are presented in Fig. E3, and mean CVAR in Fig. E4. Spatial maps of both Jacobian determinants and CVAR are presented in Fig. E5 and E6. Finally, Fig. E7 shows a grid deformed under each regularisation scheme.
Fig. E.1. Jaccard coefficients for all subjects to NA01 -Jaccard coefficients of 32 cortical segmentations for registration to subject NA01 of the NIREP dataset. No one method was consistently superior or inferior across all regions leading to comparable results between the methods in terms of overlap scores.  Log-Jacobian determinant range distributions -5th to 95th percentile range of Jacobian determinants for warps from all subjects to NA01 of the NIREP dataset. As would be expected penalising the log-Jacobian determinant directly leads to a much tighter distribution of Jacobian determinants about 1 than penalising the Jacobian singular values. The local rigidity penalty performs worst in this regard, with consistently more extreme Jacobian determinants.
Fig. E.4. Mean CVAR distributions -Distribution of mean CVAR within the brain for warps from all subjects to NA01 of the NIREP dataset. We observe that SPRED clearly best preserves the original shape of the underlying anatomy during deformation. The local rigidity penalty, while not quite as good as SPRED, still consistently outperforms the log-Jacobian determinant penalty.
Fig. E.5. Jacobian determinant spatial maps -Spatial distribution of Jacobian determinants for SPRED, logjJj, and local rigidity regularisation. Penalising the volumetric changes directly clearly leads to the flattest Jacobian determinant maps. As the local rigidity penalisation does not guarantee diffeomorphism we observe a number of areas where the Jacobian determinant is either very close to 0, or negative, particularly in and around the cortex. SPRED is somewhere between the other two methods, which agrees with our observations in Fig. E3.

E.4. Discussion
From Figs. E1 and E2 we see that the choice of regularisation metric need not affect the overall registration accuracy. From Fig. E3 we note that, as expected, penalising the Jacobian determinant directly leads to warps with less volumetric changes as compared to both SPRED and local rigidity. However, Fig. E4 shows that this comes at the cost of increased shape distortion as measured by CVAR, where both SPRED and local rigidity are far more adept at controlling implausible shape changes. Fig. E5 and E6 provide a clearer understanding of the information in Fig. E3 and E4. The extent to which volumetric and shape changes are traded off by the different regularisation schemes is clearly evident. Finally, from Fig. E7 we see the extent to which extreme CVAR and Jacobian determinant values translate into more convoluted warps with reduced spatial smoothness as compared to SPRED.
Fig. E.6. CVAR spatial maps -Spatial distribution of CVAR values for SPRED, logjJj, and local rigidity regularisation. It is immediately clear that the flat Jacobian determinant maps for the logjJj penalty are accompanied by a marked increase in CVAR, confirming that volumetric control has come at the cost of extreme shape changes. Local rigidity penalisation leads to excellent preservation of shape within the white matter, but shows extreme distortion within some areas the cortex. SPRED performs considerably better than either of the other methods, confirming our observations in Fig. E4. Penalising the Jacobian determinant alone has led to extremely convoluted warps, whereas penalising the Jacobian singular values results in smoothly varying warps which are far more plausible. Interestingly, penalising local rigidity has led to a very similar deformation to SPRED, with the exception of areas where topological folding or collapse has occurred. These areas are particularly evident within the cortex and at tissue boundaries.
By combining the log-Jacobian determinant and local rigidity penalties, in a similar manner to Staring et al. (2007), it may be possible to overcome the limitations of each penalty. However, this would introduce additional regularisation weights to tune. In this respect the SPRED penalty is far more elegant, as it effectively controls both volumetric and shape changes, preserves topology, and requires only a single tuning parameter.
In conclusion, this example shows that simply penalising the Jacobian determinant is insufficient to ensure plausible deformations, and that the additional complexity of SPRED compared to a simpler local rigidity penalty is justified.