Probabilistic non-linear registration with spatially adaptive regularisation (cid:2)

This paper introduces a novel method for inferring spatially varying regularisation in non-linear registration. This is achieved through full Bayesian inference on a probabilistic registration model, where the prior on the transformation parameters is parameterised as a weighted mixture of spatially localised components. Such an approach has the advantage of allowing the registration to be more ﬂexibly driven by the data than a traditional globally deﬁned regularisation penalty, such as bending energy. The proposed method adaptively determines the inﬂuence of the prior in a local region. The strength of the prior may be reduced in areas where the data better support deformations, or can enforce a stronger constraint in less informative areas. Consequently,theuseofsuchaspatiallyadaptivepriormayreduceunwantedimpactsofregularisationonthe inferred transformation. This is especially important for applications where the deformation ﬁeld itself is of interest, such as tensor based morphometry. The proposed approach is demonstrated using synthetic images, and with application to tensor based morphometry analysis of subjects with Alzheimer’s disease and healthy controls. The results indicate that using the proposed spatially adaptive prior leads to sparser deformations, which provide better localisation of regional volume change. Additionally, the proposed regularisation model leads to more data driven and localised maps of registration uncertainty. This paper also demonstrates for the ﬁrst time the use of Bayesian model comparison for selecting different types of regularisation. © 2015 The Authors. Published by Elsevier B


Introduction
Non-linear image registration is a fundamental tool in medical image analysis with a great many applications (Sotiras et al., 2013). One widely explored application of non-linear registration is the analysis of human brain morphology from structural magnetic resonance (MR) images. In this context, non-linear image registration has been used to accurately quantify localised cross-sectional differences be-tween populations, such as subjects with Alzheimer's disease (AD) compared to normal ageing. It has also been used to measure longitudinal changes within individuals. Differences in morphology between populations can be identified using approaches such as tensor based morphometry (TBM) (Ashburner and Friston, 2000;Chung et al., 2001), where statistical analysis is performed on the Jacobian tensor of deformation fields calculated from registering individual subjects to a common space. TBM offers a whole brain approach to statistical analysis, and has the potential to extract rich features that accurately summarise anatomical differences.
TBM features are wholly defined by the registration process, which is complicated by the fact that non-linear registration is an illposed problem. In a typical structural MR image there are more than one million voxels in the human brain, where the intensity of a voxel is a noisy surrogate of tissue type. As such, there is a great deal of ambiguity in matching intensities, making it implausible for a unique voxelwise mapping to be determined purely from the image data.

Regularisation
As no unique mapping can be determined purely from the data, a "reasonable" mapping between images is sought. This is achieved through the use of a data matching term and regularisation, which maximises the similarity of image appearance whilst maintaining a plausible deformation, i.e. with an appropriate magnitude of displacement and spatial smoothness. Regularisation can be considered as a prior on the set of expected deformations, which reduces the space of potential solutions and hence limits the variance of any estimated solution. The form of bias induced by the prior is generally selected based on some physical model of deformation, such as linear elasticity (Miller et al., 1993) or thin-plate spline bending energy (Bookstein, 1997).
Regularisation models are commonly described as having the same effect across the image. However, such models may well be unreasonable in brain registration for two reasons: Firstly, different regions of the image contain different amounts of information. Uninformative image areas should be strongly influenced by the priors as they contain little information, whereas feature-rich regions should be given more freedom. Furthermore, the magnitude of anatomical mis-correspondence is likely to be variable across space, and some regions will require more complex deformations than others to allow an adequate mapping. Therefore, the use of a global spatial regularisation prior may introduce either an unwanted or insufficient bias on the deformation in certain image regions. This could have substantial adverse effects on an application, such as TBM, which directly relies on the interpretability of the deformation field.

Previous approaches to spatially varying regularisation in registration
There have been several previous works on the use of spatially varying regularisation in non-linear registration. These include approaches that vary based on tissues or structures derived from segmentations (Lester et al., 1999;Davatzikos, 1997;Staring et al., 2007;Schmah et al., 2013). These approaches are ideal in cases when an informative deformation prior is known for a specific region or tissue type, which can be robustly defined. However, in the majority of registration applications, this is unlikely to be the case.
More data driven approaches have been proposed, which include anisotropic smoothing of image similarity gradients according to image information (Hermosillo et al., 2002;Papież et al., 2013). Alternative approaches include weighting similarity gradients based on measures of local image reliability (Tang et al., 2010). These approaches allow the image information to affect the local regularisation strength, although are still somewhat ad-hoc, being dependent on the definition of a heuristic weighting between regularisation and data fidelity.
Inference of geometric deviation from an estimated atlas for use as a spatial prior is an alternative approach to define regularisation priors, Allassonniére et al. (2007) proposed a small deformation Bayesian framework for atlas estimation and registration. Gori et al. (2013) proposed a Bayesian approach for estimating an atlas and structure specific regularisation terms for a registration model based on the metric of currents. A recently published approach by Xu et al. (2014) propose a method for deriving an average atlas and a spatial distance metric based on the geometric variability of the atlas. Zhang et al. (2013) proposed a generative registration model using Geodesic shooting for atlas and regularisation estimation, this work was extended to sparsely estimate the principal geodesic modes of variation (Zhang and Fletcher, 2014). Durrleman et al. (2013) also estimate sparse parametrisations of variability from an estimated atlas.
Most similarly to this work, Risholm et al. (2010bRisholm et al. ( , 2013) presented a Bayesian inference scheme that allows linear elastic parameters to be inferred from the data. These parameters can also vary spatially, as demonstrated by Risholm et al. (2011b). This approach does not require the definition of strong heuristics, although informative priors are required for the elastic model parameters. The limitations of the framework lie in the numerical integration inference strategy, which comes with vast computational complexity. Modern sampling techniques may help alleviate this burden (Zhang et al., 2013).

Contribution of this paper
This paper proposes a novel non-linear registration model and Bayesian inference scheme that allows for data-driven spatially varying regularisation. This approach alleviates the difficulties associated with previous attempts at spatially varying regularisation. Firstly, it is fully data driven, requiring no segmentations or informative priors. Secondly, the trade-off of data fidelity and regularisation is inferred directly from the data and finally, inference is tractable.
This work follows from our previous conference paper (Simpson et al., 2013b), with a second-order inference scheme for the regularisation parameters, a full mathematical derivation and broader validation. Additionally, this paper investigates objective Bayesian model comparison and the effects of the spatially varying prior on registration uncertainty. The proposed framework describes registration using a hierarchical probabilistic model, with a transformation prior that is parameterised by a set of hyper-parameters. Each hyperparameter influences a spatially localised region of the prior. Through the use of full Bayesian inference, posterior distributions of hyperparameter weights can be inferred alongside the transformation. This allows the effects of the prior to be locally determined during the registration.
This approach is demonstrated through an application of TBM on synthetic images, as well as comparing subjects with AD to healthy controls. Our results demonstrate the strength of our approach in terms of reducing false positive results, which may improve interpretability. We also highlight additional benefits of the proposed framework including: objective comparison of regularisation models, and more reasonable uncertainty estimates of the deformation fields.

Model
Image registration can be described in a probabilistic manner using a generative model of the target image, y, which is predicted by the deformed source image, t(x, w). Here, t is a transformation model, x is the source image and w parametrises the transformation. In this paper, a cubic B-spline free form deformation model (Rueckert et al., 1999;Andersson et al., 2007) is used for t, with w corresponding to the control point displacement. However, in principle any deformation model could be used.
The generative model also contains an additive noise term, e, which describes the error in model fit. In this work, e, is modelled as independently and identically distributed across voxels and follows a normal distribution: where I is an identity matrix the size of the number of voxels, N v . φ corresponds to the noise precision (inverse variance) of the additive Gaussian noise under the assumption of being independently distributed. α corresponds to the virtual decimation factor (Groves et al., 2011), which is a data driven term used to compensate for spatial covariance in the residual, weakening the assumption of independent noise. The assumption of identically distributed noise could also be relaxed in this approach as in Simpson et al. (2012a). The full generative model for registration is therefore given as: (2)

Prior distributions
Prior information is used to constrain the parameters of the model to plausible values. The noise in model fit, φ, is well defined by the data, so an uninformative Gamma distribution prior can be used, P(φ) = Ga(a 0 , b 0 ), where a 0 = 10e 10 , b 0 = 10e −10 . As motivated in Section 1.1, for the problem of non-linear registration an informative prior on the transformation parameters, p(w), is required to ensure a reasonable result.

Priors on transformation parameters
Spatial regularisation for non-linear registration can be encoded as a prior on the transformation parameters. Commonly such priors penalise deviation from the identity transformation, functioning as an elastic type of regularisation. Here, the prior on w is described using a multivariate normal distribution: The mean of the prior is set to 0, representing the identity transformation. describes the expected variance, and covariance of the transformation parameters. This definition allows the specification of highly complex and rich priors. Most commonly, bending or linear elastic energy priors have been encoded in such a form (Ashburner and Friston, 1999). Simpler constraints such as penalising the magnitude of the deformation parameters could also be straightforwardly included.

Multiple sparse priors
In this work, the multiple sparse priors (MSP) approach of Friston et al. (2008) is adopted to allow spatially varying regularisation for non-linear registration. The MSP model was previously demonstrated for use in the M/EEG inverse problem. Friston et al. define the prior covariance matrix to be a weighted mixture of n covariance components: which is chosen to have limited spatial support, making i a spatially localised covariance component. The number and form of these components is optional. λ i is a scalar weight associated with each covariance component that is inferred from the data. λ i appears within an exponential to ensure a positivity constraint on the weighting factor for each i . As in Friston et al. the prior covariance components, i , are constructed from columns of a spatial coherence prior, G. Here, G is a squared exponential Gaussian process (GP) prior (Rasmussen and Williams, 2006), which can equivalently be considered as the Green's function of a discrete diffusion process (Harrison et al., 2007). The graph encoding the distance between nodes is an adjacency matrix, A, where A i j = 1 when transformation parameters w i and w j are spatially adjacent, and 0 elsewhere. G can be written as: The parameter σ controls the local coherence between adjacent control points, and takes values between 0 (independence of parameters) and 1 (maximally correlated). This approximation to the Green's function only accounts for 4th order neighbouring control points, as defined by the maximum value of m, which allows sparse priors, with compact spatial support. For non-linear registration, the consideration of 4th order covariance neighbours provides an adequate balance between connectedness and sparsity. For a given prior component: . Each prior component, i , strongly controls the variance of a control point displacement, in a given direction, and the covariance with neighbouring control points, with a weaker influence on these neighbours' variance. The scale of this component is dictated by the exponential of its control parameter, λ i , which is inferred from the data.
Due to the exponential parametrisation of λ i , this effectively functions as a log-normal hyperprior on the weights of each i . The selection of P(λ) is discussed in Section 2.5, and the rationale for choosing a normal prior, as opposed to a Gamma distribution, which was used as a prior on a single regularisation parameter, is discussed in Section 2.3.1.

Model inference
The generative model and priors defined in the previous sections describe a hierarchical probabilistic model that is described graphically in Fig. 2. Bayesian inference is used to infer the unobserved random variables in this hierarchical model. Numerical integration approaches, such as Markov chain Monte Carlo, are often computationally prohibitive in problems with many parameters. For this reason, mean-field variational Bayes (VB) (Attias, 2000) was chosen as the inference strategy. VB allows tractable, approximate full Bayesian inference, and has been previously demonstrated for use in high resolution non-linear registration (Simpson et al., 2012b).
VB approximates the posterior distribution of model parameters using parametric distributions. In this work, mean-field VB is used, hence the posterior distribution on the model parameters is approximated as: The variational Bayesian cost function is the negative variational free energy, F, which is a lower bound on the log model evidence (Beal, 2003). As F = log P(y) − KL, where KL is the always positive Kullback-Leibler distance between the unknown true posterior and our approximate posterior distributions, the maximisation of F leads to the minimisation of KL. The derivation of F for this model is given in Appendix A, and a condensed form is given in Eq. (14).
Typically, the functional forms of the approximate posterior distributions can be derived algebraically from the model formulation. In this case: where μ is the mean of the posterior distribution on the transformation parameters, and Y describes the posterior covariance of these parameters. a and b are the shape and scale parameters of q(φ), respectively.
Through the calculus of variations, iterative analytic updates can be found for the parameters of the approximate posterior distributions q(w) and q(φ). Briefly, the nature of these updates involves finding the zero-derivative of the functional F with respect to a particular parameter group. As an example, the optimal value of q(w) would be found conditional on the approximate posterior distribution of the other model parameters q(φ) i q(λ i ).

Regularisation parameters
Unlike the single regularisation hyper-parameter case described in previous work (Simpson et al., 2012b), where q(λ) can also be derived as following a Gamma distribution, the spatially localised hyper-parameters cannot be algebraically determined as following a particular distribution. This is because λ i appears within a matrix inverse in F (see Appendix B), which also complicates the marginalisation of these parameters.
To allow inference, and marginalisation, of these parameters within a tractable framework, two further approximations are required. Firstly, the Laplace approximation is used to assume a normal posterior form for q(λ i ) = N (λ i , σ 2 i ). Secondly, it is assumed that the prior covariance matrix only depends on the first order moments of λ i , which greatly simplifies the marginalisation of q(λ i ) and the estimation of σ 2 i . The expectation of the prior covariance matrix, , can now be written as: where the angular brackets correspond to an expectation of the encompassed term with respect to the subscript.

Inference of transformation and noise parameters
The updates for the transformation and noise parameters are derived in the same way as (Simpson et al., 2012b), taking the expectation of the prior covariance matrix with respect to i q(λ i ) as given in Eq. (9). As t(x, w) is non-linear with respect to the transformation parameters, w, a first order Taylor series approximation is used to locally linearise the function about the current mean estimate. This requires the calculation of the matrix of partial derivatives, J, of t(x, w) with respect to w about the current mean μ old , The transformation mean, μ, and covariance Y are updated by: where k is the vector representing the residual image y − t(x, w).
μ new describes the current estimated transformation parameters, and is dependent on the old estimated values, μ old .φ = ab, which is the expectation of the estimated noise precision.
The posterior parameters of q(φ) are updated by: where N v is the count of voxels within the masked region.

Inference of regularisation parameters
A different but consistent inference mechanism is required to infer the spatial prior parameters, {λ}, from the data. As described in Section 2.3.1, the Laplace approximation uses a Taylor series expansion of F to estimate a normal distribution for q(λ). Based on this approximation, these parameters can be inferred through Newton's method updates with respect to the variational Bayesian cost function, F. Given the mean-field approximation in Eq. (6), and the resulting F described in Appendix A, the optimistion of {λ} purely involves terms from the minimisation of the Kullback-Liebler distance between the prior and posterior distributions of w, as {λ} is a component of the prior on w (see Eq. (9)), and the prior and posterior of λ i . The terms from F that contain {λ}, or , are: where const [{λ}, ] contains all terms that are constant with {λ} and .
The derivation of the 1st and 2nd order partial derivatives of Eq. (14) are given in full in Appendix B. The derivative of F with respect to the mean of each local regularisation control parameter,λ i , can be expressed as: The second partial derivative, taking advantage of the approximation that ∂ 2 ∂λ 2 = 0, is simply written as: As such, q(λ i ) can be updated according to the derivatives in Eqs. (15) and (17), where and the posterior meanλ is updated by:

Model comparison
The negative variational free energy, F, is an objective means for allowing comparison of models without requiring ground truth, or gold standard information. F summarises the fit of the data, and the deviation of the model parameters from their prior distributions. Unlike the Bayesian information criteria, F only penalises model parameters that deviate from the prior, and the cost of a parameter that retains the same distribution as the prior is zero. In the case of the proposed model, this means that the complexity of having additional λ parameters that only take the prior distribution, have no additional cost.
Although F has been previously used for model comparison in the medical image analysis domain (Groves et al., 2009;Penny et al., 2005;Friston et al., 2008), to the best of the authors' knowledge it has never been used in medical image registration. However, previous attempts at probabilistic model selection have appeared using the minimum description length in Van Leemput (2009) and Marsland et al. (2008) and information theoretic model selection approaches include Schnabel et al. (2001), Rohde et al. (2003) and Hansen et al. (2008).

Selection of p(λ)
The prior on the regularisation control parameters, p(λ), has an important effect. If there is little information from the data to suggest a value for these parameters, then they will tend to take the values of the prior. As described previously, p(λ) = N (η, ρ). As our interests lie in a more interpretable formulation of registration, we therefore only wish to see deformations that are reliably driven by the data. As such, a low value for η would be preferable, such that in the absence of information to suggest otherwise, transformation parameters would tend towards the identity transformation. Conversely, we want the value of λ to be strongly driven by the data, hence, we choose a large value for ρ. The influence of p(λ) can be thought of as selecting the prior probability of different scales of deformations being allowable.
In this work a weakly informative prior is chosen, where η = −6 and ρ = 40.

Implementation and initialisations
This algorithm was implemented within the FMRIB Non-linear Image Registration Tool (FNIRT) (Andersson et al., 2007), which provides the facility for efficient calculation of the Hessian of the transformation parameters, J T J. The algorithm uses a 3 level multiresolution scheme where the image is down-sampled, initially by a factor of 4, then 2, then full-resolution. The B-spline knots are supersampled through interpolation at each new level to yield a higher resolution grid. The final spacing is given in the experimental description. The original regularisation model is bending energy, described as an inverse covariance matrix, the scale of which is either adaptively inferred, as in Simpson et al. (2012b), or manually selected.
In terms of initialisation, at the first multi-resolution level, {λ} are set to give an initial control point variance of 2 mm. The first three updates at the first level perform a global scaling of the initial prior matrix. Subsequent iterations treat each λ independently.
Between multi-resolution levels, {λ} is interpolated using trilinear interpolation. A maximum of 20 iterations was run for each multi-resolution level, with convergence defined by: k which is the sum of squared differences plus the deviation of the transformation mean, from the prior instead of F for computational convenience.

Synthetic experiments
Synthetic 2D images were created to demonstrate the effects of this algorithm, see top row of Fig. 3. 10 instances of two 2D phantom images, 30 × 30 pixels, were created with varying SNR. As reference image, a circle with a radius of 10 pixels, and a floating image, which is two pixels thinner on one side. An ideal transformation that links these two images should be spatially localised to the area of shrinkage and have very high confidence in the transformation parameters at all other locations.

Visualisation of uncertainty
The distributions of the posterior transformation parameters q(w) and of the transformation prior p(w) are multivariate normal. In order to display the uncertainty of the posterior, or the support of the prior, in this work the sum of the variance in each direction is summed and the result is square rooted to give an uncertainty value in pixels/mm. This is approximated as the variance at each of the knot points and interpolated over the image using the B-spline basis set.

Example registration
An example set of images and registration results at two SNRs is given in Fig. 3. The log Jacobian maps show that when using the proposed prior the deformation is well localised to the region of change, as opposed to using an adaptive bending energy prior as in Simpson et al. (2012b), where the deformation propagates across the entire circle, despite there being no local image information to support this. The reason for this localisation is that the spatial prior only supports deformation within certain areas. Consequently, this provides a more interpretable estimation of registration uncertainty, where the uncertain regions are only in the areas of change rather than across the image.

Model comparison
Bayesian model selection can be used to objectively choose model parameters that cannot be inferred directly from the data. Here, we investigate the effects of the number of transformation parameters, in terms of B-spline knot spacing, as well as the form of the spatial prior on F at two SNRs. This is plotted in Fig. 4. For both SNRs, using the proposed prior leads to an improvement in F over bending were chosen as they provide relatively good results at both SNRs in terms of F, see Fig. 4. The top row shows the synthetic reference and floating image at two signal to noise ratios (SNRs). The second row shows the resulting log Jacobian map, illustrating expansion or contraction, when using the proposed prior or an adaptive level of bending energy. The third row illustrates the standard deviation of the proposed spatial prior, which is well localised to the region of deformation. The final row shows the uncertainty of the posterior distribution of transformation parameters using either the proposed prior or bending energy.  Interestingly, a slightly higher value of σ is preferable at lower SNR, which leads to greater spatial covariance in the prior. A 5 pixel knot spacing seems to provide the best balance of complexity and data fitting at both SNRs for this example.

Materials
Data used in the preparation of this article were obtained from the Alzheimers Disease Neuroimaging Initiative (ADNI) database (adni. loni.usc.edu). The ADNI was launched in 2003 by the National Institute on Ageing (NIA), the National Institute of Biomedical Imaging and Bioengineering (NIBIB), the FDA, private pharmaceutical companies and non-profit organisations, as a $60 million, 5-year publicprivate partnership. The primary goal of ADNI has been to test whether serial MRI, positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimers disease (AD). Determination of sensitive and specific markers of very early AD progression is intended to aid researchers and clinicians to develop new treatments and monitor their effectiveness, as well as lessen the time and cost of clinical trials. ADNI is the result of efforts of many coinvestigators from a broad range of academic institutions and private corporations, and subjects have been recruited from over 50 sites across the U.S. and Canada.
60 structural MR images acquired on 3 T scanners were taken from the ADNI database, 30 of these subjects suffered from AD, the other 30 are healthy controls (HC). There were 18 males with AD and 12 male HC. The age means and standard deviations were 74.3 (8.4) for AD and 70.1 (13.95) for HC. The AD subjects were taken from 10 different sites and the HC from 7.

Cross sectional TBM
A single high-resolution representative atlas was constructed for use in the tensor based morphometry experiments. Having a common atlas allows direct comparison of the TBM results from the different regularisation approaches. To prevent bias towards a particular regularisation strategy, an entirely different approach was used to create the atlas. The atlas was created by first probabilistically segmenting the images into grey and white matter, followed by co-registering these probability maps into a common space using the geodesic shooting approach (Ashburner and Friston, 2011) within SPM12 beta. The bias corrected images were then resampled into the atlas space and averaged to create the atlas.
Each of the bias field corrected subject images was rigidly registered to the template image using FLIRT (Jenkinson and Smith, 2001). Subsequently, each image was non-linearly registered to the atlas space using one of six regularisation strategies: a fixed level of bending energy (Andersson et al., 2007), a globally adaptive level of bending energy, where the level is inferred from the data as in Simpson et al. (2012b), a global GP prior and the proposed prior where σ = {0.05, 0.1, 0.15}. All registrations were run to a 10 mm B-spline knot spacing. 10 mm was selected for computational reasons, as the current implementation does not provide an efficient mechanism for the inversion of sparse matrices. Following registration, the logarithm of the voxelwise determinant of the Jacobian of the mean transformation, μ, is calculated. This provides a measure of local expansion or contraction.
For the proposed method, p(λ) = N ( − 6, 40). For the proposed model σ was selected to be 0.1 based on the model comparison described in Section 4.2.1. For the global GP prior, different σ values were tested, but σ = 0.1 gave the highest score in terms of F so is presented in all experiments. Two example registrations are given in Figs. 5 and 6.

Model comparison
Model comparison can be used to find the ideal value of σ . In this case we compared the F for an adaptive level of bending energy, a global GP prior with σ = 0.1, and the proposed prior with σ = 0.05, σ = 0.1 and σ = 0.15. The results of this model comparison are illustrated in Fig. 7. σ = 0.1 was chosen for illustration as it generally outperformed σ = 0.15 and adaptive bending energy, with less variability than σ = 0.05.

Jacobian analysis
The distinction between the proposed prior, and a global prior can be seen in terms of the distribution of local volume change as given by the log Jacobian, an example histogram of which is given in Fig. 5. An example slice illustrating a 3D registration where the substantial volume changes are quite sparsely distributed. In this case, the three methods produce quite different log Jacobian maps. The adaptive global bending energy infers an inflexible transformation prior, as insufficient information globally suggests more flexibility is needed. The fixed level of bending energy produces a lot of changes across the brain, the causes of some are not immediately apparent from visual inspection of the data. The global GP prior which does not encourage particularly strong spatial smoothness performs similarly. Conversely, using the proposed prior leads to a sparser set of volume changes that subjectively seem more reasonable, and contain less false positives. Fig. 8. The proposed prior prohibits much displacement in uninformative regions, thus leads to large regions of no volume change. Furthermore, in informative regions the registration is free to follow the data completely leading to more substantial volume changes, which are seen in the tails of the distributions. This emphasises the well supported signal from the data, and reduces other effects. This can be measured using the kurtosis of the log Jacobian distribution, where higher kurtosis implies a more peaked distribution, with heavier tails. Fig. 9 shows a boxplot of the kurtosis of the log Jacobian maps across the population.  6. An example slice illustrating a 3D registration where there are changes distributed across the whole brain. As can be seen, all four methods produce similar log Jacobian maps. The proposed spatial prior shows fairly wide flexibility across the image with more flexibility in the anterior, as there are more substantial changes there. This illustrates that the proposed prior is appropriate even in cases where the changes are widely distributed. The spatial uncertainty is much lower and more focal than when using either of the adaptive global priors.

Population statistics
The log Jacobian maps were analysed using a general linear model, where statistical differences were evaluated between subject groups. The Jacobian maps were not smoothed prior to analysis. All the analyses were performed using tools from the FSL library. 1 Age and total intracranial volume (TIV), as estimated by combining the white matter, grey matter and CSF maps from SPM, were used as co-regressors. Fig. 10 shows the results of these statistical analyses. 1 www.fmrib.ox.ac.uk/fsl/ .

Discussion
This paper has demonstrated that a spatially adaptive transformation prior can be estimated alongside the non-linear registration parameters from a pair of images. The current framework was implemented using a B-spline FFD transformation model but the method itself is independent of the transformation model. The inferred spatial prior aims to reduce the Kullback-Leibler distance between the prior and posterior distributions of the transformation parameters and consequently derives information from the data in terms of the level of local image information, and areas where . Bayesian model comparison of the different regularisation strategies for population to atlas registration. F was significantly lower for σ = 0.15 than all other methods (paired t-test, p < 0.05). σ = 0.05 and σ = 0.1 are fairly similar, and weakly significantly better than the adaptive bending energy regulariser (paired t-test, p < 0.06) and the global GP prior (paired t-test, p < 0.12). As σ = 0.1 has a smaller inter-quartile range, and similar median to σ = 0.05, this was used in future experiments. Fig. 8. Histograms of the log Jacobian values from the registrations in Fig. 6. The left image shows the overall distributions, whereas the right plot focuses on the tails of the same distributions. As can be seen, using the proposed prior leads to substantially heavier tails. In this case, the kurtosis varies from 7.7, for adaptive bending energy, 8.6, for the fixed level of bending energy, 9.1 for the global GP prior, which encourages less smooth deformations than bending energy, and 15.0 for the proposed regularisation prior.
deformations occur. In other areas, the spatial prior has very low variance allowing little displacement to occur. This can lead to sparse deformations, as shown in Fig. 5, where the registration is very free in informative areas allowing larger volume changes, and constrained in other areas prohibiting volume change. This leads to distributions of log Jacobians that have higher kurtosis. We postulate that this may lead to a reduction is weaker false positives, and emphasises true volume changes in the data. This model can be thought of as equivalent to a sparse deformation model, where the hyper-parameters controlling regularisation {λ} can effectively switch off transformation parameters in noninformative regions, therefore the deformation in those locations cannot be uncertain, as it not being estimated. For alternative applications to TBM, a map of inactive regions may be useful, as the alignment of these regions cannot be deemed trustworthy, an intuition for these locations can be seen in the proposed prior maps in Figs. 5 and 6.
In computational terms, the current implementation is quite expensive, which limits the B-spline knot resolution that this method has been tested on. The computational bottleneck lies in the numerical inverse of the matrices and ϒ −1 . Future work would seek to find efficient means of inverting the matrices, possibly using a sparse Cholesky decomposition that allows updating, or through separating the matrix into blocks as in Harrison et al. (2008).
Ideally, a regularisation strategy would not enforce sparsity on the covariance matrix. Instead, it may be more appropriate to have a spatially adaptive prior as a mixture of precision, rather than covariance components. This would permit longer range covariance in the prior, which cannot occur in the proposed work. A difficulty with such an approach is learning a suitable set of prior   10. Population t-statistics (uncorrected) comparing the population with AD and HC. As can be seen, the fixed level of bending energy and global GP prior leads to more widespread changes, particularly in the white matter visible in the bottom row. These may be false positive effects caused by higher global variance than the other methods, or lower spatial smoothness in the case of the global GP prior. The proposed prior leads to focal contractions of high significance in the grey matter and expansion of the ventricles, which may be more plausible. components to use, and ensuring that the resulting prior matrix is positive-definite.
In the current implementation, the subject images were registered to the atlas to allow the deformation fields (and therefore the Jacobian maps) to be in a common space. However, in a generative model such as this, it would be more appropriate to register the smooth atlas image to the subject for estimating the deformation field. As we are currently using a small deformation model, the inverse is not always well defined and therefore such an approach may not be ideal. Future work will implement this model within a large deformation transformation model, such as a stationary velocity field.
A straightforward extension of this work would investigate the use of a population prior distribution of p(λ) that has a variable mean and variance across the image. Furthermore, local covariance components could be merged together where appropriate as in Friston et al. (2008).
Registration uncertainty has been demonstrated to be useful in improving hippocampal subfield segmentations (Iglesias et al., 2013), estimating dose delivery in radiotherapy (Risholm et al., 2011a), assisting neurosurgical decision making (Risholm et al., 2010a) and improving classification (Simpson et al., 2013a). Future work could also investigate the use of posterior deformation distributions to identify whether an individual belongs to a sub-population of the data, either globally or for a specific structure. This work demonstrates how strongly the registration uncertainty depends on the prior information, as well as the local image information. The use of a global spatial prior leads to a global variance contribution, which is modified based on the local image information. Conversely, with an adaptive spatial prior, areas that are informative are given freedom to move, but because they are informative regions, they consequently lead to low variance. As opposed to areas that are uninformative, which are given little freedom in the prior and therefore have a tight posterior distribution as there is no evidence to suggest that they should move.
We believe that this paper provides the first example of Bayesian model comparison for non-linear registration, as demonstrated for choosing the form of the regularisation model. Future work will also investigate finding an optimal B-spline knot spacing or transformation model for a given application.

Conclusions
This paper has described a spatially adaptive regularisation prior model and inference scheme for non-linear registration. The components are optimised using the variational Bayesian cost function, which aims to reduce the Kullback-Leibler distance between the prior and posterior distribution of transformation parameters. This approach leads to better feature localisation and a reduction of false positives in tensor based morphometry, through having a spatial prior that adapts to the local data.

Appendix A. Derivation of the variational free energy
The negative variational free energy, F, is a lower bound of the log model evidence, and is the measure that VB seeks to maximise (Beal, 2003). Maximisation of F is equivalent to minimisation of the Kullback-Leibler distance between the true and approximate posterior distributions. For a model with parameters , F is composed of two terms: where L av is the marginal value of the log likelihood with respect to the approximate posterior distribution, q( ), and D KL is the Kullback-Leibler distance between the approximate posterior and prior distributions.
The mean-field approximation assumes independence of groups of parameters, and for the model in question: q( ) = q(w)q(φ) i q(λ i ). Therefore, for the proposed model L av is calculated as the expectation of the likelihood with respect to the approximate posterior distributions: This results in the following expression for the marginal likelihood: where ψ is the digamma function.
Similarly, D KL comprises the integral of the second term of Eq. (A.1). Due to the mean-field approximation, D KL ( ) is split into approximate posterior parameter groups: D KL (q( )||P( )) = D KL (q(w)||p(w)) + D KL (q(φ)||P(φ)) + I i D KL (q(λ i )||P(λ i )) (A.5) These are the standard Kullback-Leibler distances between either normal, or Gamma distributions and can be found in the literature (Roberts and Penny, 2002).
Closed form updates for the parameters of the approximate posterior distributions can be derived using the calculus of variations. This involves finding the derivative of the functional F with respect to a set of model parameters, given the current posterior distribution on the conditionally independent model parameters. In practical terms, this involves equating the log-likelihood and prior probabilities, marginalised over the independent posterior distributions, with the approximate log posterior distribution. For example, if: M = log p(y|x, w, φ) + log p(w) + log P(φ) + log P(λ) (A.6) then the updated distribution for q(w) can be found as: where the angled brackets correspond to taking an expectation of the bracketed term with respect to the sub-scripted terms. The full derivation of the updates for q(w) and q(φ) are not given here, but can be found in previous work (Simpson et al., 2012b).

Appendix B. Regularisation parameters
The terms of F that relate to the prior covariance matrix, are given as: As can be seen, appears twice within a matrix inverse. As {λ} parameterises , rather than −1 , q(λ) does not have an algebraically defined posterior distribution. Instead, the Laplace approximation is used to assume a normal posterior distribution, by taking a Taylor series expansion of F around the current mean. Furthermore, it is assumed that only depends on the first order moments of λ, as described in Eq. (9).

B.1. First order derivative
Each of these terms can be analytically differentiated with respect to the posterior mean of a given regularisation parameter,λ i : where the identity ∂ log |X| ∂X = Trace(X −1 ∂X) has been used.
The quantity ∂ −1 ∂λ i can be analytically calculated as: where the identity ∂A −1 ∂x = −A −1 ∂A ∂x A −1 has been used. The next term is simply: The derivative of the third term is: The derivative of the final term is: This gives the complete derivative of F with respect toλ i as:

B.2. Second order derivatives
The second order derivatives of F wrt.λ i can be used to estimate the step size for the parameter updates. To get the step size of each parameter update, the second derivative ofλ i w.r.t. F can be calculated: and the identity ∂XY = (∂X)Y + X(∂Y) has been used.
This work makes the assumption that only depends on the first order moment ofλ i . This means that ∂ 2 −1 ∂λ i = 0, which leads to a simplification of Eq. (B.9) as given in Eq. (17).