Accurate MR Image Registration to Anatomical Reference Space for Diffuse Glioma

To summarize the distribution of glioma location within a patient population, registration of individual MR images to anatomical reference space is required. In this study, we quantified the accuracy of MR image registration to anatomical reference space with linear and non-linear transformations using estimated tumor targets of glioblastoma and lower-grade glioma, and anatomical landmarks at pre- and post-operative time-points using six commonly used registration packages (FSL, SPM5, DARTEL, ANTs, Elastix, and NiftyReg). Routine clinical pre- and post-operative, post-contrast T1-weighted images of 20 patients with glioblastoma and 20 with lower-grade glioma were collected. The 2009a Montreal Neurological Institute brain template was used as anatomical reference space. Tumors were manually segmented in the patient space and corresponding healthy tissue was delineated as a target volume in the anatomical reference space. Accuracy of the tumor alignment was quantified using the Dice score and the Hausdorff distance. To measure the accuracy of general brain alignment, anatomical landmarks were placed in patient and in anatomical reference space, and the landmark distance after registration was quantified. Lower-grade gliomas were registered more accurately than glioblastoma. Registration accuracy for pre- and post-operative MR images did not differ. SPM5 and DARTEL registered tumors most accurate, and FSL least accurate. Non-linear transformations resulted in more accurate general brain alignment than linear transformations, but tumor alignment was similar between linear and non-linear transformation. We conclude that linear transformation suffices to summarize glioma locations in anatomical reference space.


INTRODUCTION
Tumor location is important when comparing treatment and outcome between populations of patients with diffuse glioma. The distribution of tumor locations can be summarized and compared across patients by transforming tumor segmentations from the individual patient space to a common anatomical reference space. Tumor location, biopsy decision, and residual tumor after surgery can then be summarized into tumor, biopsy, or resection probability maps, respectively. These summaries can provide answers to research questions on glioma-genesis (Steed et al., 2016), location preference for molecular sub-types (Ellingson et al., 2012(Ellingson et al., , 2013, differences in surgical intervention (De Witt Hamer et al., 2013;Müller et al., 2019), survival prediction (Liu et al., 2016), and location of neuropsychological domains (Hendriks et al., 2018).
Patient images are brought into spatial alignment with the anatomical reference space through image registration. There is a large number of different approaches to image registration, which differ in the registration paradigm (e.g., the degrees of freedom), similarity metrics, regularization, and the choices in optimization (Viergever et al., 2016). Due to the diversity in implementation of these characteristics, various (publicly available) registration packages provide different solutions to the same problem.
The accuracy of inter-subject brain MRI registration was studied for several public software packages with publicly available data of eighty healthy individuals, showing the added value of high-degree-of freedom (non-linear) registrations ). However, registration becomes more challenging when transforming images of brain tumors, lesions, atrophy, or deformed brain to an anatomical reference space (Crinion et al., 2007;Ripollés et al., 2012). For post-operative recurrent brain tumor patients specifically, this was studied in a group of eight patients by Ou et al. (2014).
Glioma registration has two main challenges. First, the normal tissue contrast is altered by the tumor components and surrounding edema. And second, mass effect from the tumor and from surgery alters the volume and shape of normal tissue. These challenges potentially diminish the registration accuracy, resulting in less reliable summary maps.
The determination of image registration accuracy is not standardized, nor well defined. Previous work determined accuracy of different registration algorithms by comparing difference between extrinsic (i.e., bone-implanted fiducials) with intrinsic (i.e., intensity values in the images) registration (West et al., 1997). Currently intrinsic registration has become dominant in literature (Viergever et al., 2016), where the ground truth of patient attached fiducials is lacking. Most commonly registration results are checked visually, a qualitative approach which is rater dependent. Manually segmenting structures in the images allows for quantitative comparison (Caviness et al., 1996;Shattuck et al., 2008), but is time-consuming and subject to rater variability. Another quantitative method is placing landmarks in the images (Chollet et al., 2014). Accuracy is most reliably measured when using several forms of manual annotation combined (Rohlfing, 2012).
In this study we compared the registration accuracy to anatomical reference space for tumor alignment and for anatomical landmark alignment of 20 patients diagnosed with glioblastoma and 20 with lower-grade glioma, between preand post-operative MRI time-points, between six publicly available registration packages, and between linear and nonlinear registrations.

Patients
Patients were randomly selected from a cohort treated at the Neurosurgical Center of Amsterdam UMC, location VUmc, between 2009 and 2013, and were previously reported on in an inter-rater agreement study . The MR data of 40 glioma patients was used, consisting of 20 patients with histopathologically confirmed glioblastoma and 20 patients with lower-grade glioma.
Following clinical protocol at the treatment site, the preoperative MRI was made within 1 week before resection. The MRI after surgery was made within 72 h after resection for glioblastoma and about 4 months after resection for lowergrade glioma.
Only post-contrast T1-weighted MRI (T1c) images were used for registrations in this study, and were registered individually to the anatomical reference space, for which the symmetric Montreal Neurological Institute 09a (ICBM2009a space) atlas 1 was used (Fonov et al., 2009(Fonov et al., , 2011.

Manual Annotations
A single rater, a neurosurgeon with 20 years clinical experience, segmented the gliomas in MRI at both the pre-and postoperative time-point for all 40 patients. Segmentations in patient space were made with the semi-automatic SmartBrush tool (BrainLab, Feldkirchen, Germany). Radiological presentation of glioma can be divided into MR non-enhancing and contrastenhancing tumors. Although exceptions exist, many nonenhancing gliomas are histopathologically diagnosed as lowergrade glioma (WHO grades II or III), and many contrastenhancing gliomas as glioblastoma (Scott et al., 2002). For glioblastoma, the tumor segmentation was defined as enhancing tumor elements including enclosed necrosis. The lower-grade gliomas were segmented on T2/FLAIR images, which were coregistered with the T1c in the BrainLab Elements software suite, placing all segmentations in the patient T1c space. The segmentation protocol and rater performance for this imaging set are discussed in more detail in Visser et al. (2019).
To measure tumor registration accuracy a tumor target volume in ICBM2009a space was defined, which corresponds with the tumor volume in patient space. The same rater who performed tumor segmentations in patient space also defined the tumor target volumes in ICBM2009a space. This was done by identifying the healthy tissue structures and salient edges surrounding the tumor and finding the analogous areas in the ICBM2009a template. For each segmentation in patient space a corresponding target volume of healthy tissue was delineated in MNI as ground truth for optimal registration, see Figures 1A,B. Furthermore, anatomical landmarks were placed in patient space and in anatomical reference space. A set of landmarks and their corresponding intra-rater agreement was described in Chollet et al. (2014). From these landmarks, we selected a subset of 20 landmarks based on the coverage of the supratentorial compartment and low intra-rater variation (≤2 mm), shown in Figure 2. Specific locations of the landmarks are given in Supplementary Table S1. Segmentations of target volumes in FIGURE 1 | Example of manual tumor annotations followed by pre-operative linear and non-linear registration for glioblastoma and lower-grade glioma in the left hemisphere. Manual segmentation of tumor in patient space is shown in blue, the corresponding estimated target volume in anatomical reference space in green, the tumor segmentation after linear transformation in orange, and the tumor segmentation after non-linear transformation in red. The top row shows imaging of a glioblastoma patient, and the bottom row shows imaging of a lower-grade glioma patient. From left to right: patient space (A), anatomical reference space (B), patient space linearly transformed to anatomical reference space (C), and patient space non-linearly transformed to anatomical reference space (D). ICBM2009a space and landmark placements were performed in ITK-SNAP (Yushkevich et al., 2016).
To be able to review the intra-rater consistency, the same rater delineated the tumor target volumes again in three randomly selected enhancing and three non-enhancing gliomas on both sessions. This was done a year after the initial delineation and the rater was blinded to the initial results.
Preprocessing DICOM data of the T1c images were converted to NIfTI with the dcm2niix tool (Li et al., 2016). Standard pre-processing was used for each method. Therefore, the pre-processing was different for FSL, ANTs, Elastix, and NiftyReg than for SPM5 and DARTEL.

FSL, ANTs, Elastix, and NiftyReg
T1c images were pre-processed with N4-bias correction to correct for low-frequent intensity non-uniformity (Tustison et al., 2010), anisotropic diffusion smoothing to remove noise while respecting the edges in the images (Perona and Malik, 1990), and a brain extraction routine from the ANTs image processing toolbox to obtain a brain mask in patient space (Avants et al., , 2011. Of note, the brain mask in patient space was the union of healthy tissue and tumor and/or surgical cavity. A brain mask for the anatomical atlas was provided with the atlas 1 .

SPM5 and DARTEL
SPM5 and DARTEL registrations use the Unified segmentation approach that incorporates similar operations as above together with tissue segmentation (Ashburner and Friston, 2005). We used the CAT12 implementation in SPM12 (Wellcome Centre for Human Neuroimaging, London, United Kingdom) that has its own bias-field correction, non-local means noise filtering (Manjón et al., 2010), and brain masking (Gaser, 2009). Then T1c images were segmented to gray and white matter for later use during registration instead of the original T1c images.

Registration Scheme
Pre-and post-operative MRI were processed independently to allow for comparison in registration performance. Tumor segmentations and landmarks in patient space were transformed to ICBM2009a space with linear transformations (Figure 1C), and non-linear transformations ( Figure 1D). Due to differences in available options in each registration package the parameter settings per package were not designed to match those of other packages but were based on settings from previous work (van der Lijn et al., 2009;De Witt Hamer et al., 2013;Ou et al., 2014;Mutsaerts et al., 2018;Bartel et al., 2019) and are provided in the Supplementary Material. Furthermore, non-linear registrations were initialized with the linear transformations from the corresponding registration package.

FSL, ANTs, Elastix, and NiftyReg
Linear registrations were performed without using masks in ICBM2009a space, nor in patient space. Non-linear registration was driven by the brain masks for either selecting voxels in ICBM2009a space or cost-function masking in patient space. This resulted in two registrations to ICBM2009a space for each T1c image per package: a linear and a non-linear registration. The entire registration scheme is shown in Figure 3.

SPM5 and DARTEL
Intermediate and final templates in the SPM format are needed for the registration to the ICBM2009a space. SPM5 and DARTEL use the ICBM152 template (Mazziotta et al., 2001), and the 1.5 mm × 1.5 mm × 1.5 mm IXI-database template (Gaser, 2009), respectively. These two templates were additionally aligned with the ICBM2009a template using geodesic shooting registration (Ashburner and Friston, 2011). The so obtained transformation was applied to all SPM5 and DARTEL results to transform them to the ICBM2009a space for comparison. Preparing the dedicated registration priors using the ICBM2009a template would provide a possible alternative approach avoiding the intermediate transformation; however, the errors introduced by the extra transformation were minimal.
We used a hybrid implementation of the DARTEL registration (Mutsaerts et al., 2018). In the region within and around the tumor we used the results of SPM5. In the surrounding tissue, the original DARTEL transformation was used. In the margin around the tumor that encompassed the maximal difference in the transformation fields on the tumor border, we have combined the SPM5 and DARTEL fields with cubic weighting depending on the distance from the lesion to ensure continuity of the merged transformation fields.

Rater Consistency
The locations of post-operative target volumes should be consistent with their respective pre-operative target volume locations in ICBM2009a space. We define rater consistency as the percentage of post-operative voxels with a distance < 3 mm outside the pre-operative target volume. The voxels of the postoperative target volumes were assumed to be located close to the edge and preferably inside of the pre-operative target volumes, as this would be the most probable region where the surgeon might have left residual tumor.
To evaluate the intra-rater consistency on the tumor target volumes in the ICBM2009a space, the Dice score and the modified Hausdorff distance were calculated (see the following section). Additionally, the generalized conformity index (GCI = | regROI∩tarROI|/| refROIG∪tarROI|) was calculated to allow comparison with our previous study on intra-rater agreement in glioma segmentation .

Tumor Registration
The overlap between the segmentations and target volume was determined with the Dice score as: where regROI , |tarROI, and |regROI ∩ tarROI| are the volumes of the registered tumor segmentation, the target volume in ICBM2009a space, and their intersection, respectively. Dice scores below 0.4 were considered poor agreement, 0.4-0.6 as reasonable, 0.6-0.7 as good, and 0.7-1 as excellent (Bartko, 1991;Cicchetti, 1994). The average distance between the borders of the registered tumor segmentation and target volume in ICBM2009a space was determined by the modified Hausdorff distance proposed for object matching in Dubuisson and Jain (1994), where N R and N T are the number of vertices describing the border of the registered tumor segmentation and target volume in ICBM2009a space, respectively, r and t give a specific vertex from the border of the registered tumor segmentation and target volume in ICBM2009a space, respectively, and d defines the minimal distance of a point from an object, d (r, T) = min t∈T ||r-t| | .
FIGURE 3 | Registration flowchart. Images were bias corrected and smoothed and subsequently passed to the linear registration. Brain masks were used to drive the non-linear registration step, which was initiated with the results of the linear registration.

Landmark Registration
General brain alignment accuracy was determined by an average distance in mm between the registered patient space landmarks and the corresponding landmarks in ICBM2009a space.
where d avL is the average distance for a single patient, N L is the number of labels, and d(L

Statistical Analysis of Accuracy Measures
Linear mixed models were used to evaluate the results with Dice, d modH , or d avL as a response variable. Type of registration package, pathology, time-point, and the type of transformation were set as fixed-effects to examine their association with accuracy results. The patient ID was set as a random effect with a random intercept, and the time-point was nested inside the patient ID. Because six registration packages were tested, the differences between all possible combinations of packages were tested with Tukey's honestly significant difference test. A p-value less than 0.05 was considered significant.

Patient Characteristics
Of the lower-grade glioma patients 12 had astrocytoma WHO grade II, four oligodendroglioma WHO grade II, three oligoastrocytoma WHO grade II, and one anaplastic astrocytoma WHO grade III (Louis et al., 2007). Of the 20 glioblastoma patients, 10 were female; median age was 65.7 years (IQR 52.1-71.1 years); and 8 patients had a tumor in the left hemisphere. Median and interquartile range for pre-operative tumor volume was 29.6 mL (IQR 10.3-55.3 mL), and for post-operative 0.8 mL (IQR 0.3-3.5 mL). Of the 20 lower-grade glioma patients, 8 were female; median age was 36.7 years (IQR 28.9-45.7); and 9 patients had a tumor in the left hemisphere. Median and interquartile range for pre-operative tumor volume was 55.2 mL (IQR 33.8-72.1 mL), and for post-operative 10.1 mL (IQR 5.5-16.9 mL).

Rater Consistency
For glioblastoma and lower-grade glioma 73 and 87% of the voxels defined as post-operative target volume in ICBM2009a space were within 3 mm of the pre-operative target volume, respectively. The mean Dice score, the modified Hausdorff distance, and GCI for the intra-rater agreement of the delineation in the ICBM2009a space was 0.87, 1.27, and 0.77 mm for the pre-operative and 0.48, 2.21, and 0.32 mm for the postoperative enhancing gliomas; 0.87, 1.04, and 0.76 mm for the pre-operative and 0.67, 1.21, and 0.51 mm for the post-operative non-enhancing gliomas.

Comparison of Accuracy Between Glioblastoma and Lower-Grade Glioma
Examples of typical tumor registration accuracy are shown in Figure 4.
Summarizing the results from all registration packages, the median Dice scores at pre-and post-operative time-points were 0.4 and 0.0 for glioblastoma, and 0.7 and 0.3 for lower-grade glioma. The median Hausdorff distances at pre-and postoperative time-points were 5.7 and 5.8 mm for glioblastoma, and 3.8 and 4.5 mm for lower-grade glioma. The median landmark distance at pre-and post-operative time-points was 7.5 and 6.7 mm for glioblastoma, and 6.7 and 6.4 mm for lower-grade glioma. All accuracy scores are shown in Figure 5.

Effect of Registration Software Packages
For registation accuracy differences between registration packages measured in Dice score SPM5, DARTEL, and NiftyReg did not show significant differences when compared to each other, and SPM5 and DARTEL had statistically significant higher accuracy than FSL, ANTs, and Elastix. For all significant differences the estimated coefficients were ≤0.04.
When measured with Hausdorff distance ANTs, SPM5, DARTEL, and NiftyReg did not show differences when compared to each other and SPM5 and DARTEL had statistically significant higher accuracy than FSL and Elastix. For all significant differences estimated coefficients were ≤−0.78 mm.
The registation accuracy differences between registration packages measured in landmarks for SPM5, DARTEL, and NiftyReg did not show differences when compared to each other, and had statistically significant higher accuracy than FSL, ANTs, and Elastix. For all significant differences the estimated coefficients were ≤−0.85 mm.
Excluding the tumor from the cost-function of the nonlinear registration for FSL, ANTs, Elastix, and NiftyReg had some influence on the location and shape of the registered tumor by all registration packages, most strongly present for FSL (Supplementary Figure S1). No difference was observed in accuracy results for non-linear registration with either FIGURE 5 | Accuracy of the six packages for both linear and non-linear registration of the pre-and post-operative time-points. Left column shows accuracy for glioblastoma, and the right column for lower-grade glioma. For Dice score higher values indicate better accuracy, for Hausdorff and mean landmark distances smaller values indicate better accuracy. The dashed line indicates the average intra-rater agreement as reported previously (Chollet et al., 2014), multiplied by two to account for error in placement in both patient and ICBM2009a space. Each dot is a data point for an accuracy measure for a single patient. Dot transparency is controlled by the volume of the tumor connected to that data point divided by the maximum tumor volume in that distribution (more transparent is smaller relative volume). The whiskers extend to the most extreme data points not considered outliers, outliers are plotted individually using the '+' symbol, boxes show the interquartile range, and the contained red line shows the median of each respective distribution.
including or excluding the tumor from the cost-function (Supplementary Figure S2).

DISCUSSION
The results of this study showed that registrations with non-linear transformations were more accurate for general brain alignment, but similar to linear transformation for tumor alignment. Furthermore, registrations of lower-grade glioma were more accurate than those of glioblastoma, but registrations of pre-and post-operative MRI were similar in accuracy. SPM5 and DARTEL were slightly more accurate and FSL was slightly less accurate than the other registration packages. The manually created target volume estimates were shown to be consistent in regard to localization of pre-and post-operative tumor.
Compression of the mass effect in glioblastoma proved particularly challenging (Figure 1), which might explain why lower-grade glioma registered with higher accuracy, as these tumors typically have more infiltrative than expansive growth. Dice scores were reasonable for pre-operative glioblastoma registrations, and good for lower-grade glioma. Dice scores were poor for postoperative registration of both glioblastoma and lower-grade glioma. This was most likely a volume effect in the sense that the Dice score is more sensitive to misalignment of small volumes, since Hausdorff distances were comparable to those of pre-operative registration results (Figure 5). Upon visual inspection of the non-linear registrations, these were considered to provide a reasonable representation of the normal brain structures distant from the tumor without gross deformation. The accuracy of the tumor registrations depended on the tumor volume and extent of mass effect, which is unsurprising since regions in the patient images without analog in the reference space can -by definition -not be aligned correctly. Another source of visual registration imprecision was ventricular enlargement, which was observed in a few cases. Furthermore, unlike the source images used to create the ICBM2009a atlas, the patient images in this study were acquired after gadolinium administration, which sometimes resulted in erroneous alignment of the superior sagittal sinus with (sub)cutaneous structures at the vertex.
Although SPM5 and DARTEL often outperformed the other methods, the estimated coefficients were small. When compared to FSL the estimated coefficient was largest, and FSL was also outperformed by the other packages by a smaller margin. Therefore, these results do not indicate a preference for any of the registration packages.
An important observation here is that non-linear transformation did not improve the accuracy of tumor alignment compared to linear transformation (Figure 6). These findings indicate that non-linear transformation is not required for summaries of patient populations in probability maps of tumor segmentations. Additional advantages of omitting this step would be a reduction of processing time and avoidance of registration instability in the sense that non-linear registration is more sensitive to parameters settings and variation in input.
Lesion overlap scores for lower-grade glioma described in this study are comparable to those reported in Ou et al. (2014), who also studied ANTs and FSL. The lesion was defined as the union of the cavity and recurrent tumor. They used a similar approach, but unfortunately did not specify the tumor type and specific landmarks used. Landmark distances observed in this study are slightly higher for ANTs (∼1 mm), and lower for FSL (∼2 mm) than those reported in Ou et al. (2014), which could be a result of different landmarks used, possible difference in pathology, and difference in time-point of treatment. A similarity between our findings is that we also found lower median Hausdorff distances than median landmark distances, which is counterintuitive. The largest landmark registration errors in this study arose from registering the frontal poles of the brain and the posterior horns of the lateral ventricles, which could be explained by possibly larger inherent differences between individuals for these structures.
Often lesions are excluded from the registration process through cost function masking to improve alignment (Brett et al., 2001), which requires the creation of lesion masks in patient space FIGURE 6 | Differences between linear and non-linear registration accuracy (non-linear minus linear). Left column shows differences in accuracy for glioblastoma, and the right column for lower-grade glioma. For Dice score positive difference values indicate better non-linear performance, for Hausdorff and mean landmark distance negative difference values indicate better non-linear performance. Each dot is a data point for an accuracy measure for a single patient. Dot transparency is controlled by the volume of the tumor connected to that data point divided by the maximum tumor volume in that distribution (more transparent is smaller relative volume). The whiskers extend to the most extreme data points not considered outliers, outliers are plotted individually using the '+' symbol, boxes show the interquartile range, and the contained red line shows the median of each respective distribution. prior to registration. To justify including the tumor in the brain mask an analysis was performed for non-linear transformations that either excluded or included the tumor in the brain mask. Excluding the tumor from the registration cost function did not improve tumor overlap, nor did it improve landmark alignment. An alternative to lesion masking for improving registration performance is to perform lesion filling (Popescu et al., 2014) prior to registration.
The image data we collected was of a clinical nature, and therefore had some variety in scanner models and field strength. Previous work showed that variation in scanners and acquisition parameters leads to volumetric variation in segmentations, although registration based methods were less susceptible to these differences by correcting for differences in image geometry (Takao et al., 2011;Durand-Dubief et al., 2012). The MRI acquisition differences in this study did not appear to influence the landmark position or the tumor registration, and no systematic differences in registration accuracy were observed due to either the scanner model, or the field strength.
Recommended imaging practice for glioma patients is to obtain a T1c (Thust et al., 2018), which is often the highest resolution scan, making this the preferable scan for registration. However, T1c images are different from the T1-weighted ICBM2009a template, which does not contain contrast enhanced elements such as the blood vessels. This difference between the T1c and the ICBM2009a template adds to the challenge of successful registration of a patient scan to ICBM2009a space. Differences between patient anatomy and the ICBM2009a template may further complicate the subject-to-atlas registration. Obtaining deformation statistics of anatomical variability, and incorporating this into the registration, could be beneficial to the registration (Commowick et al., 2005).
Another challenge is the inherent difference between an image of a brain with diffuse glioma and anatomical reference space, which is often created from healthy individuals. A method for dealing with the lack correspondence between patient image and anatomical reference space is to perform tumor seeding in the anatomical reference space. This will create a simulated tumor in the anatomical reference space, making it more similar to the patient image with tumor (Dawant et al., 2002;Mohamed et al., 2006;Prastawa et al., 2009). However, the applicability of such methods depends on how well the tumor growth model mimics the actual tumor, and the required manual input that must be provided. Furthermore, the implementations of the individual tumor simulations require specific data input and are therefore not easily generalizable.
Registration accuracy might also be improved by using additional information from multiple MRI sequences or image derived information such as the image gradient. This might address, for example, the ventricular enlargement often seen in glioblastoma patients. Access to different similarity metrics for the different components or phases of the registration could also increase accuracy by allowing the user to finetune the registration on a step-wise basis. Furthermore, the registration accuracy can be affected by the performance of the pre-processing steps. Differences due to noise removal and bias correction are expected to be minimal between different methods. However, other pre-processing methods such as skull-stripping and identifying tissue probabilities can yield larger differences between methods and thereby have a more profound effect when used for subsequent registration schemes.
Rater consistency of segmentations in the ICMB2009a space was evaluated on a small sub-sample. It showed good agreement pre-operatively in both enhancing and non-enhancing tumors. GCI was comparable to inter-rater GCI of segmentations done by experts in patient space . Post-operatively, the overlap was much lower due to small tumor size, especially in enhancing tumors, and thus creating a skewed perception of reproducibility. Our earlier inter-rater study showed similar low rates of GCI in these settings . Moreover, modified Hausdorff distance was rather low showing good intrarater consistency despite low GCI and Dice scores of these small ROIs. Overall, Dice scores were much higher for intra-rater consistency than when comparing the manual and automatic segmentations, and the modified Hausdorff distance was several times lower. That and the fact that we have compared the registration methods on a relative scale with the same groundtruth lead us to the conclusion that the quality of segmentations in the ICMB2009a space did not have a negative impact on the results.
Defining a framework for registration success is a complex task, due to the lack of a ground truth. Extensive manual annotations provide the most rigorous framework (Rohlfing, 2012), but come at the cost of manual labor. Overlap or distance measures alone do not provide a comprehensive registration comparison, especially when high-degreesof-freedom algorithms are used. Therefore, we used the combination of Dice overlap, Hausdorff distance, and landmark distance to quantify our registration results. Labeling and using specific tumor regions might offer an even more detailed evaluation of the registration. However, this would require more complex manual labeling of the tumor structure and is out of the scope of the current work. Furthermore, our approach is intrinsically different than approaches that involve registrations with the ultimate aim of tumor segmentation such as Prastawa et al. (2004); Stefanescu et al. (2004), and Parisot et al. (2012), where performance can be tested with only a tumor segmentation in patient space. Agreement between raters for tumor segmentations in patient space was determined in Visser et al. (2019), but the rater agreement for the estimated target volume creation in ICBM2009a space is unknown. In this study a single rater created all target volume estimates, for which we showed that the rater was quite consistent in the creation of pre-and post-operative target volumes in ICBM2009a space. Perhaps using segmentations of multiple raters might yield more generalizable results, although this would significantly increase the workload of the study.
Accurate normalization to anatomical reference space is not only relevant to the study of glioma, but also to fMRI studies, and the study of other pathologies such as stroke which faces similar challenges in registration as glioma (Crinion et al., 2007). To incorporate registration uncertainty into summary maps of a glioma patient population, the Hausdorff accuracy results reported here can be used as an indication of the amount of regularization needed. The uncertainty in registration accuracy found here is not trivial since median Hausdorff distances range between 3.8 and 5.8 mm. The smoothing of small tumor objects, particularly surgical residue, will increase the required amount of data to perform reliable statistics.

CONCLUSION
Linear transformation suffices to summarize glioma locations in anatomical reference space. The obtained summary maps should be regularized to account for a registration uncertainty of 3.8-5.8 mm. Registration is more accurate for lower-grade glioma than for glioblastoma. Pre-and post-operative MR scans were similarly accurate. No single registration software package stands out.

DATA AVAILABILITY STATEMENT
Registration settings are provided in the Supplementary Material of this manuscript. The image data used in this study is available at https://doi.org/10.17026/dans-zg9-nhrj (Hamer, 2019).

ETHICS STATEMENT
The institutional review board at the VU University Medical Center Amsterdam approved this study (2014.336). All patients provided written informed consent for use of their clinical data for medical research. The MRI data was retrospectively collected from the hospital's Picture Archiving and Communication System. All images were analyzed after anonymization in accordance with the General Data Protection Regulation.

AUTHOR CONTRIBUTIONS
MV wrote the manuscript, edited the parameter settings for FSL, ANTs, Elastix, and NiftyReg, processed the data, analyzed the data, and plotted the figures. JP edited the parameter settings for SPM5 and DARTEL, analyzed the data, and revised the manuscript. DM provided additional cavity segmentations, analyzed the data, and revised the manuscript. RE managed the digital framework, analyzed the data, and revised the manuscript. EH provided additional cavity segmentations for cost-function masking and revised the manuscript. MW, FB, MH, HM, and HV analyzed the data and revised the manuscript. JM initially supervised the study and revised the manuscript. PD provided the tumor segmentations, tumor targets and landmarks, designed the statistical analysis, analyzed the data, and revised the manuscript.