A Framework for Comparison and Evaluation of Nonlinear Intra-Subject Image RegistrationAlgorithms

Performance validation of nonlinear registration algorithms is a difﬁcult problem due to the lack of a suitable ground truth in most applications. However, the ill-posed nature of the nonlinear registration problem and the large space of possible solutions makes the quantitative evaluation of algorithms extremely important. We argue that ﬁnding a standardized way of performing evaluation and comparing existing and new algorithms currently is more important than inventing novel methods. While there are already existing evaluation frameworks for nonlinear inter-subject brain registration applications [8, 2], there is still a lack of protocols for intra-subject studies or soft tissue organs. In this work we present such a framework which is designed in an ”open-source” and ”open-data” manner around the Insight Segmentation & Registration Toolkit to aid in intra-subject, intra-modality registration evaluation. The goal of our work is to provide the research community with the basis framework that should be extended by interested people in a community effort to gain importance for evaluation studies. We demonstrate our proposed framework on a sample evaluation and release its implementation and associated tools to the public domain.


Introduction
Nonlinear (non-rigid) image registration is an important field for medical computer vision applications [4].Since it resembles an ill-posed problem, it is also a very complex topic from the theoretical point of view.Unfortunately, validation of nonlinear image registration algorithms tends to be quite difficult as well, so we are facing a major challenge if quantitative statements about performance and comparison of algorithms are required.
On the one hand, the problem of validating nonlinear registration stems from the lack of ground-truth data which in general can not be derived from realistic, clinical data sets under investigation.On the other hand, the definition of suitable quantitative measures to compare different algorithms or an algorithm result to a synthetic ground truth also is a challenge.Many quantitative similarity metrics (which are frequently used in publications to compare registration results), like e.g. the sum-of-squared differences, are only judging outcomes of the registration process and they are often biased to algorithms which use the same or a related similarity measure in the cost function of their underlying optimization scheme.
Due to the problems outlined above and the inherently available influence of noise, partial volume effect, limited numerical precision, interpolation schemes, etc. on quantitative measurements we refer to algorithm evaluation (in accordance with [8]), i.e. measuring "relative" algorithm performance, in contrast to validation or the measurement of "absolute" performance.
We argue that the current state of the art in nonlinear registration algorithm evaluation is inadequate.For registration algorithms it is only possible to gain practical, clinical acceptance if the research community builds a standardized protocol for evaluation.Therefore, in this contribution we propose a modular evaluation framework for objectively comparing algorithms which is designed with the spirit of open-source, open-access, open-data and open protocols in mind, according to the basic ideas and principles from the Insight Segmentation & Registration Toolkit (ITK [9]) and the Insight Software Consortium.
The building blocks of our framework are: • Freely available data sets, e.g. from the NLM data collection project [12], the MIDAS project from the Insight Software Consortium [3] or the NCIA project [11].
• A number of freely available algorithms for nonlinear registration, like e.g. the ITK is able to provide and continuously gathers.
• A set of quantitative measures on which the research community has agreed upon, with public domain implementations residing in a central repository.
• A pool of synthetic deformation models, again with the agreement of the research community, either defined analytically or derived by using the existing pool of registration algorithms.
• An evaluation framework in an easily customizable scripting language that forms the glue for the rest of the components.
In the paper we focus on the description of the generic framework and we show an example implementation that focuses on intra-subject, intra-modality nonlinear CT image registration focusing on differences in breathing states.However, our framework can (and should!) easily be extended to more fields of application, be it inter-modal, inter-subject or any kind of inter-modal registration different from CT, by implementing appropriate components in addition to the existing ones.

Related Work
Given the current state of the art in nonlinear intra-subject, intra-modality registration we consider it more important to perform research on the evaluation of algorithms than inventing a novel one.In literature one finds many publications on algorithms which do not focus on a thorough evaluation and there are only few publications which are entirely dedicated to evaluation.
A common approach to perform evaluation is to identify landmark or region correspondences independently of the registration and to measure how well they are aligned after registration.While this is a sufficient method for rigid registration [26] and also for some nonlinear registration applications [8] it obviously has drawbacks in the latter case since the error in regions far from the landmarks in general can not be measured.
In literature we can find several similar attempts to create evaluation projects.Examples are the retrospective rigid registration evaluation project from West et al. [25], the VALMET project for evaluation of segmentation results [6] or more recently the NIREP project for evaluation of nonlinear inter-subject registration algorithms [2].Additionally the success of community-based evaluation efforts are also obvious in related computer vision fields like stereo correspondence algorithms [14] and multi-view 3D reconstruction [16].However, currently there are no projects that focus on a comparison of the large number of available algorithms for intra-subject nonlinear registration.
Next we give some examples for works that deal with nonlinear registration evaluation.In [8] an evaluation study on brain data sets is presented where six inter-subject registration methods are investigated and compared using a number of local and global measures.The main focus lies on the correct alignment of those brain landmarks which are assumed to be present in all individuals.In [15] a finite element method is used to create a physically plausible synthetic deformation model that might be used for comparison with registration results.Their focus lies on the evaluation of B-Spline grid based methods [13] applied to mammography MR input data.The work of [24] presents an evaluation of the Demons [19] algorithm applied to prostate CT images.They use synthetic experiments to argue on the suitability of Demons for this specific application.The focus of [17] is on the comparison of different similarity measures in the context of nonlinear registration.Their protocol includes a number of optimization-independent, statistical measures describing capture range, number, location and extent of local optima as well as the accuracy and distinctiveness of optima of a cost function derived from a similarity measure.In [2] a framework for the evaluation and comparison of inter-subject brain registration algorithms is presented.Their project NIREP (Nonlinear Image Registration Evaluation Project) is most related to our work, however, its dedication to inter-subject registration and its incompatibility with the idea of "open-source" and "open-data" are the major restrictions.

A Nonlinear Intra-Subject Registration Evaluation Framework
The main problem in nonlinear image registration evaluation is the lack of ground truth information from clinical data sets.Researchers therefore have to restrict themselves to define practically relevant synthetic experiments.There are several groups of synthetic experiments possible, i.e. (1) synthetically deformed synthetic data, (2) synthetically deformed phantom data and (3) synthetically deformed real data.In this paper we concentrate on the third kind of experiments, however, the framework allows the usage of the first two input types as well.The choice of synthetic deformations is crucial in an evaluation procedure.If the chosen transformation is too simple or restrictive, then the derived quantitative measures are restricted to a very coarse approximation of the originally investigated problem.This fact especially complicates the task of nonlinear registration.
Our approach is to use a combination of simple synthetic transformation methods to derive quantitative measures on the performance of several nonlinear registration algorithms.We assume that the calculation of many different synthetic deformations, with each of them testing different behavior, along with a large number of different evaluation measures leads to a method for thorough evaluation and comparison.We further hypothesize in this paper that one can make use of a "pool" of different, sufficiently dissimilar methods to create a number of synthetic deformations.If this "pool" of methods is large enough one can assume that the evaluation converges to a "bronze standard" evaluation, a term which was borrowed from a publication of Glatard et al. [7] who describe a way of establishing a "ground truth" in rigid registration evaluation by combining a large number of data sets with a number of algorithms and analyzing all possible combinations of registration results.One of our most important assumptions is, that a community effort is needed to increase the number of algorithms, measures and synthetic transformations in order to arrive at a practically relevant evaluation protocol.
We have built an evaluation framework to assist in the quantitative comparison of different algorithms over a variety of synthetic transformations.This framework uses Python as high-level scripting language to call the C++ methods that perform the synthetic transformation calculation, call the registration algorithms and the computation of the quantitative evaluation measures.Input images are synthetically transformed with the help of a configuration file.A list of nonlinear registration algorithms along with its parameterizations may be specified to perform the computations.Finally the synthetic ground-truth deformation fields and the original moving images are compared with the outcomes of the registration algorithms and log files with the quantitative evaluation results are written.Note, that all of these processes are performed in an easily customizable fully automatic fashion.Another important aspect this evaluation framework allows, is to automatically test algorithms with different parameterizations to determine optimal parameters with respect to the evaluation measures.
Our basic strategy for the evaluation is in all cases identical.We take an original image, apply a synthetic transformation and store the synthetically warped image and the resulting displacement field.The displacement field will be our ground truth to compare to.Now each of the investigated nonlinear registration algorithms gets the synthetically warped image as fixed input and the original image as moving input, i.e. we try to find a displacement field that warps the original image to the synthetically transformed.In this way we can finally compare the warped image with the synthetically warped image and the ground truth displacement field with the calculated displacement field.Note that only this way it can be guaranteed that the displacement fields represent transformations in the same directions (from fixed to moving image).Compare Figure 1 for an overview of this strategy.This figure also clearly shows that the nonlinear registration algorithms have no insights on the working of the synthetic transformations, making their behavior completely independent from these transformations.A possible bias that one method might have concerning a certain synthetic transformation may be removed by increasing the number of investigated synthetic deformations.

Compared Algorithms
In this section we list the nonlinear registration algorithms from the ITK that are currently used in the framework.We shortly describe each method and the parameters that have to be chosen for the evaluation

Evaluation
Step procedure.There are two parameters that are required in all of the algorithms, the number of levels in the Gaussian pyramid approach and the number of iterations per pyramid level.
Demons Registration -"demons" is an approximation to the elastic or fluid deformation model, where the computation of the similarity metric and the regularization of the displacement field are decoupled and the latter part is approximated by a Gaussian smoothing [19].The only additionally required parameter is the standard deviation sigma of the displacement field smoothing step.
Symmetric Demons Registration -"symdemons" builds upon the same principles as Demons, however, the gradient term in the similarity metric computation uses information from both images to be registered, thereby increasing robustness.The only required parameter is the standard deviation sigma of the displacement field smoothing step.
Level Set Motion Registration -"levelsetmotion" is a very efficient registration algorithm that interprets the image intensity as isocontours of a level set and tries to match those iso-contours [21].It gains its efficiency from the fact that no regularization is used.
Curvature Registration -"curvature" uses a regularization term that penalizes the norms of the second derivatives of the displacement field components to perform registration [5].In this implementation the variational formulation for combining similarity and regularization term is used [10] and numerically solved using a fast fourier transform.Additional parameters are the time-step τ of the semi-implicit scheme and the regularization weight α.
We propose to add two more algorithms to our framework, which currently are not included in the official ITK release, but were submitted to the 2007 MICCAI Open Science Workshop.
Here obviously the open manner of this workshop shows its clear advantage compared to traditional workshops.Fast Block Matching Registration -"fastblock" is a 2007 MICCAI Open Science workshop contribution by Suarez-Santana et al [18].It is a simple algorithm relying on a pyramidal block-matching scheme and a local entropy-based similarity measure.There are two parameters required in this algorithm which define regularization and similarity.
Diffeomorphic Demons Registration -"diffdemons" is another 2007 MICCAI Open Science workshop contribution by Vercauteren et al [23,22].It describes a diffeomorphic extension of the Demons algorithm in order to increase its suitability for inter-subject registration.Although inter-subject registration is not in the focus of this work, we chose to include this paper in order to enlargen the pool of available algorithms.There are two additionally required parameters, the standard deviation sigma of the displacement field smoothing and the maximum step length of displacement field updates.
The reader may note that we decided not to include the B-Spline based algorithm using a MeanSquares metric or the Finite Element based elastic registration.The B-Spline method in our experience requires too much computation time, we have not yet managed to bring down the computational effort near the times of the other ITK algorithms.The decision to ignore the Finite Element based registration was made due to the very complex way this algorithm has to be set up.We invite the community to add these implementations to the framework.

Synthetic Deformations
We are using several different kinds of synthetic transformations.Note that in this paper the choice of synthetic deformations is guided by the intended application of evaluating thoracic soft-tissue data sets.However, the framework certainly allows to add and remove any kind of synthetic transformation to tailor the evaluation to the intended application area.
Our first synthetic model is a very simple simulated breathing transformation.It depends on two parameters t vertical and t inward which provide a means for a simple approximation of diaphragm and rib-cage movement.Further, a slight intensity variation is applied to the interior of the lung, to simulate the partial volume effect.
A more detailed description of this transformation can be found in [20].We will denote it simbr.Examples for this transformation are given in Figure 2, for the experiments we use two instances of this transformation (simbr-25-10, simbr-55-25).
The second synthetic transformation consists of a regular grid with random deformations, interpolated by a thin plate spline.There are two parameters that may be varied, the grid size of the points to be placed and the possible extent of the random deformation that is applied per grid-point.It will be denoted grid from now on.The parameters of this transformation are gridSize and maxDeviation.Figure 3 gives some examples for this transformation, later on we will be using two instances of this transformation (grid-32-4, grid-32-8).The third synthetic transformation is a uniform periodic one which we took from [27].Here the displacement field follows a cosinus distribution with a given amplitude and a given phase.We use a single instance of this transformation with amplitude 5 and phase 32 (uniform-5-32).
Our final form of synthetic transformations makes use of the pool of nonlinear registration algorithms described above.Here, given two input images that differ in a nonlinear fashion, we calculate the nonlinear registration of these two data sets with each of the algorithms.For n algorithms this gives us n displacement fields.Now it is possible to use each of these n displacement fields as synthetic deformations to synthetically warp one of the input images.Thus we get a pair of images to be registered by all of the available algorithms.
Note that we currently do not simulate noise in these synthetic deformation models, however, this could easily be incorporated by modifying the programs for calculating the synthetic deformations.

Quantitative Evaluation Measures
Two kinds of measures are used in our evaluation framework.The first one is used to compare displacement field ground truth data with displacement field results calculated during the registration.The second kind of measures compares two sets of images, before and after registration.This comparison step is performed on fixed and moving input image and on fixed and warped moving image.After successful registration, warping the moving image should result in an increase of similarity with the fixed image.
We always compute the evaluation measures only on the overlapping regions of the input image data sets and the deformation fields, i.e. background regions in either of the data set are excluded from the computation.This is especially important in the case of synthetic transformations, where certain areas might vanish due to a shrinking like behavior, we always mark these regions with special values during synthetic transformation and we omit those marked regions from the evaluation process.
We have to remark here that all of these measures possess one inherent shortcoming.Data sets with a size of 256 × 256 × 256 consist of a total around 16 million voxels and each measure of interest will try to reduce this huge amount of information to one single number.Although some of our measures are inspired by robust statistics, their information value has to be at least discussed and also questioned.One should always keep this in mind when looking at quantitative evaluation results.

Displacement Field Measures
To assess the similarity of a synthetic ground-truth deformation field ϕ syn and a deformation field computed by a nonlinear registration algorithm ϕ reg , we use: Root-Mean-Square of Displacement Field RMS disp The Root-Mean-Square of the displacement field interprets the two displacement fields of interest as feature vectors of dimension 3×N 1 ×N 2 ×N 3 , where N i is the number of voxels on the input image grid in the according dimension.The measure reads Median Absolute Deviation of Displacement Field MAD disp This measure is similar to the Root-Mean-Square, however it is a more robust variant in the presence of outliers.The median absolute deviation is defined as Maximum Deviation of Displacement Field MAX disp The maximum deviation of the displacement field states the maximal difference of all components of the two displacement fields ϕ syn and ϕ reg .We use a more robust maximum, i.e. our maximum difference is defined as the difference that is larger than 95% of all other values.
Jacobian of Displacement Field JAC disp The Jacobian of the displacement field gives information about the transformations consistency.Especially it identifies possible singularities of the field.The Jacobian is defined as the determinant of the first partial derivatives of the transformation (compare the definition in [8]), negative values of the determinant resemble singularities, i.e. foldings.The measure JAC disp scans the deformation field for negative values and reports its number as a percentage of the total number of voxels, ideally it would be 0.

Image Similarity Measures
To assess the similarity of images (fixed image I F (x), moving image I M (x)) before and after registration we use: Root-Mean-Square of Intensity Differences RMS int The Root-Mean-Square of the intensity differences needs as its input the pixel-wise intensity differences over the overlapping region of the image domain.
It is defined as We decided to clamp intensity differences that are larger than 100 Hounsfield Units (a number that was chosen empirically) and assign the absolute difference of 100 Hounsfield Units to all of these differences.This makes the measure more sensitive to lower intensity differences, the unmodified Root-Mean-Square measure would weight outliers too strong.Note that we perform this clamping consistently for comparing all image pairs, either original fixed and original moving or original fixed and warped moving.
Median Absolute Deviation of Intensity Differences MAD int This measure is similar to the Root-Mean-Square, however it is a more robust variant in the presence of outliers.Therefore no clamping is necessary within this measure.The median absolute deviation is defined as Maximum Intensity Difference MAX int The maximal intensity differences on the overlap region of the image grid.We use a robust maximum, i.e. our maximum intensity difference is defined as the intensity difference that is larger than 95% of all other values.Here of course no clamping (like for the Root-Mean-Square measure) of the intensity differences is performed.
Normalized Mutual Information NMI int The normalized mutual information is a measure from information theory, which relates the information content of two images by probability distributions of the gray values.We use the formulation from the ITK normalized mutual information histogram metric that divides twice the mutual information by the sum of the individual entropies resulting in a measure between 0 and 1.
Edge Overlap EDGE int We are using a simple scheme to extract strong gradients from both images, based on the Canny edge detector [1].The Canny edge detector is used with a σ of 3 times the voxel spacing, the result is a binary image with edges marked with a 1.The sum of the absolute differences divided by the number of voxels in the overlap region is used to compare the binary images.This function lies between 0 and 1 with 0 denoting optimal overlap.

A Sample Evaluation
To show the applicability of our evaluation framework we now present an evaluation example.The intention is to test different algorithms for the purpose of registering intra-modality thoracic CT data sets under the influence of breathing differences.Therefore, we are using two different publicly available data sets.First, we took the data set "NormalChestCTNoContrast" from the NLM (National Library of Medicine) Data Collection Project [12].This data set was used for the anlytically defined synthetic deformation experiments (see Figure 4a) and is abbreviated nlm.Second, we downloaded a pair of data sets from the NCIA (National Cancer Imaging Archives) repository [11] that differ by some breathing motion (see Figure 4b,c).We used it for the creation of synthetic deformation fields from the different registration algorithms.We refer to this data as ncia, it can be found in the LIDC section of the archives with the identifier "30047".
All of the data sets were resampled to a resolution of 256 × 256 × 256 voxels to limit the run-time of some of the algorithms to a reasonable time-span.We performed our experiments on 64 bit AMD Opteron processors with 2.4GHz and 8GB of RAM running Linux.Since algorithms tend to need a large amount of system memory the choice to use a 64 bit system was necessary.
The necessary parameters for the different algorithms were chosen as follows.For the two algorithms that were taken from the MICCAI Open Science workshop contributions we directly took the proposed default arguments without tuning.The other algorithms were set up with 4 shrink levels in the Gaussian pyramid, and a number of iterations that was calculated from two parameters iter1 = 35 and iter2 = 25 as iter = iter1 * shrinkLevel + iter2 in order to perform many more iterations on higher levels in the pyramid where calculation is cheap.The sigma parameter for demons and symmetric demons was chosen to be 1 voxel.τ and α in the curvatrue based algorithm were both set to 1.0.After this choice all of the parameters remained fixed during the evaluations.

Discussion of the Evaluation Results
The results of our evaluation can be found in Tables 2, 3, 4, 5. Due to the large number of quantitative measures resulting from the framework we decided to discuss it only in an overall manner.
What one can immediately see from these results, is that there are two algorithms which are not performing well on the synthetic data.These are the level set motion registration and the fast block matching registration.
We assume that the level set motion method fails due to its lack of regularization and therefore do not suggest that it should be used.Especially the large percentage of negative values of the jacobian determinant of the displacement field are a problem.The problems are also visible on result images like in Figure 5c.The problems of the fast block matching algorithm are in the border region (see Figure 5a).We are not yet sure if this is an implementation problem or a problem of the method.This algorithm should be adapted and evaluated anew.
The overall best results on the evaluation experiments is given by the diffeomorphic Demons algorithm.Also the classical Demons algorithm performs very well, however on the important displacement field comparisons diffeomorphic Demons behaves better.The percentage of negative jacobian determinants is zero for diffeomorphic Demons as could be expected.The curvature algorithm wins some of the performance measures as well, however there is no clear benefit compared to diffeomorphic Demons and the run-times are approximately equal.As for the symmetric Demons implementation, it performs quite well, however, there are some situations where the quantitative measures grow very high.This corresponds to misregistered data sets (see Figure 5b), we suppose that there might be an implementation problem in the itkFastSymmet-ricDemonsFunction.
We omit presenting output images of well registered data sets, since the difference to the original image is seldom visible with the eye.However, we invite the reader to perform the same experiments using our implementation of the framework.
The run-times of the different algorithms on the 256 × 256 × 256 can be found in Table 1.Note that these are only approximate values.We can see that the practically quite well performing Demons algorithm also Table 1: Computational effort of the various investigated algorithms.
offers fast computation times.
To conclude we currently would suggest using Demons for time-critical registration of thoracic CT data and diffeomorphic Demons for the most accurate registration.

Conclusion & Future Work
In this paper we have presented an evaluation framework for evaluating and comparing nonlinear registration algorithms with the special intent on intra-subject applications.We have successfully shown how it is possible to automatically compare open-source algorithms on public domain data using our tools which we provide in conjunction with this submission.Our main goal is to create a standardized way of evaluating algorithms maintained at a centralized and open repository, e.g. the Insight Consortium server structure.In order to gain importance we invite the research community to contribute and/or discuss algorithms, quantitative measures and synthetic transformations.We know that our sample evaluation is not yet representative, but we hope to converge towards some kind of "bronze standard" when collecting more and more modules.
Further work beside the implementation of the framework on a central repository should definitely include to upgrade the Python based framework to automatically distribute the computations among different computers in a cluster.The inclusion of methodologies to provide a framework for real data experiments or different setups ( inter-modal, inter-subject) should be a straight-forward task by implementing additional registration methods, synthetic deformations and accompanying similarity measures.Finally in accordance to a fruitful reviewer comment, different noise models should be incorporated into the synthetic deformations.

Figure 1 :
Figure 1: The basic setup for the synthetic transformation evaluation experiments.The synthetic transformations are applied to the original image, the nonlinear registration of original and synthetically transformed image leads to a transformation and a warped image which can be compared.

Figure 5 :
Figure 5: Problems of some registration algorithms.Top row is the target image and bottom row the registration result of a) fast block matching, b) symmetric Demons and c) level set motion algorithm.Algorithm demons symdemons levelset curvature fastblock diffdemons Runtime [s] 800 2800 850 3500 1700 3100

Table 2 :
Registration results for the analytically defined synthetic experiments.Displacement field measures.

Table 3 :
Registration results for the synthetic experiments using a pool of algorithms.Displacement field measures.

Table 4 :
Registration results for the analytically defined synthetic experiments.Intensity difference measures.

Table 5 :
Registration results for the synthetic experiments using a pool of algorithms.Intensity difference measures.