CNN-based lung CT registration with multiple anatomical constraints

Deep-learning-based registration methods emerged as a fast alternative to conventional registration methods. However, these methods often still cannot achieve the same performance as conventional registration methods because they are either limited to small deformation or they fail to handle a superposition of large and small deformations without producing implausible deformation fields with foldings inside. In this paper, we identify important strategies of conventional registration methods for lung registration and successfully developed the deep-learning counterpart. We employ a Gaussian-pyramid-based multilevel framework that can solve the image registration optimization in a coarse-to-fine fashion. Furthermore, we prevent foldings of the deformation field and restrict the determinant of the Jacobian to physiologically meaningful values by combining a volume change penalty with a curvature regularizer in the loss function. Keypoint correspondences are integrated to focus on the alignment of smaller structures. We perform an extensive evaluation to assess the accuracy, the robustness, the plausibility of the estimated deformation fields, and the transferability of our registration approach. We show that it achieves state-of-the-art results on the COPDGene dataset compared to conventional registration method with much shorter execution time. In our experiments on the DIRLab exhale to inhale lung registration, we demonstrate substantial improvements (TRE below 1.2 mm) over other deep learning methods. Our algorithm is publicly available at https://grand-challenge.org/algorithms/deep-learning-based-ct-lung-registration/.


Introduction
Image registration is the process of aligning two or more images to achieve point-wise spatial correspondence. This is a fundamental step for many medical image analysis tasks and has been an active field of research for decades (Maintz and Viergever, 1998;Sotiras et al., 2013). Various approaches and tailored solutions have been proposed to a wide range of problems and applications. Typically, image registration is phrased as an optimization problem with respect to a spatial mapping that minimizes a suitable cost function and common approaches estimate solutions by applying iterative optimization schemes. Unfortunately, solving such an optimization problem is computationally demanding and consequently slow.
While deep learning has become the methodology of choice in many areas, relatively few deep-learning-based image registration algorithms have been proposed. One reason for this is the lack of ground truth and the large variability of plausible deformations that can align corresponding anatomies. Therefore, the problem is much less supervised than for example image classification or segmentation. Nevertheless, several methods have been presented in the last years which aim to mimic the process of conventional image registration methods by training a neural network to predict the non-linear deformation function given two new unseen images. As a trained neural networks can process images in real time, this has immense potential for time-sensitive applications such as image guidance in radiotherapy, tracking, or shape analysis through multi-atlas registration.
In this paper, we target the challenging task of lung registration. The complexity of this registration task is manifold, as the occurring motion is a superposition of respiratory and cardiac motion. Moreover, the sliding motion between the lung and rib cage during breathing -more precisely between pleura visceralis and pleura parietalis -is an additional challenge. The scale of the motion within the lungs can often be larger than the structures (vessels and airways) that are used to guide the optimization process. This may cause a registration algorithm to get trapped in a local minimum (Heinrich et al., 2013;Polzin et al., 2013). This makes the problem even more difficult.Therefore, a registration method needs to be able to estimate a displacement field that accounts for substantial breathing motion but also aligns small structures like individual pulmonary blood vessels precisely. registration and have been applied to different registration applications including brain MR 2019;Yang et al., 2017b), cardiac MR (de Vos et al., 2017), cardiac MR-CT (Hering et al., 2019c), prostate MR-US (Hu et al., 2018a), thoraxabdomen CT (Heinrich, 2019), thorax CT (de Vos et al., 2019;Eppenhof et al., 2019;Hering et al., 2019a;Sentker et al., 2018) and CT-CBCT registration (Kuckertz et al., 2020). Existing approaches can be classified as supervised, unsupervised, and weakly-supervised techniques based on how much supervision is available.
Supervised methods use ground-truth deformation fields for training. The ground truth can be generated in different ways. In  and Sokooti et al. (2017) the network is trained on synthetic random transformations. A drawback is that the randomly generated ground truth is artificial and may not be able to reproduce all possible deformations. Alternatively, conventional registration methods can be used to produce deformations by registering images (Sentker et al., 2018;Yang et al., 2017a) or other image features like landmarks or segmentations (Rohé et al., 2017). Another way to create a ground truth is to combine simulations with existing algorithms (Krebs et al., 2017). Consequently, the performances of all these approaches is upper bounded by the quality of the initial registration algorithm or the realism of the synthetic deformations.
In contrast, unsupervised methods -also called self-supervised methods -do not require any ground truth. The idea is to use the cost function of conventional image registration (similarity measure and regularization term) as the loss function to train the neural network. An important milestone for the development of these methods was the introduction of the spatial transformer network (Jaderberg et al., 2015) to differentiably warp images. This differentiable warping has actually been part of most conventional registration methods for a long time (e.g. König et al. (2018); Modersitzki (2004Modersitzki ( , 2009). The concept of an unsupervised deep-learning-based registration method was first introduced with the DIRNet  for 2D image registration using the normalized cross-correlation image similarity measure as loss function. In Li and Fan (2018) the approach has been extended by adding diffusion regularization to the loss function forcing smooth deformations. The method has successfully been demonstrated for registration of 3D brain subvolumes. The idea of unsupervised deep-learning-based image registration has been further evolved in several works de Vos et al., 2019;Ferrante et al., 2018;Krebs et al., 2019).
Weakly-supervised methods do not rely on ground-truth deformation fields either but training is still supervised with prior information. In Hu et al. (2018a) and Hu et al. (2018b), a set of anatomical labels is used in the loss function. The labels of the moving image are warped by the deformation field and compared with the fixed labels. All anatomical labels are only required during training. In Balakrishnan et al. (2019) and Hering et al. (2019a,c), the complementary strengths of global semantic information and local distance metrics were combined to improve the registration accuracy.
In conventional registration approaches, multilevel continuation and scale-space techniques have been proven very efficient to avoid local minima during the optimization process of the cost function, to reduce topological changes or foldings, and to speed up runtimes (Bajcsy and Kovačič, 1989;Haber and Modersitzki, 2004;Kabus and Lorenz, 2010;Schnabel et al., 2001) -explaining the popularity of multi-level strategies in conventional registration methods. As a lot of deep-learning-based registration methods are build on top of U-Net (e.g. Balakrishnan et al. (2019); ; Hering et al. (2019b); Rohé et al. (2017)), they are also multi-leveled in their nature. The first half of the "U" (the encoder) generates features on different scales starting at the highest resolution and reducing the resolution through pooling operations. In this procedure, however, only feature maps on different levels are calculated but neither are different image resolutions used nor deformation fields computed. Only a few approaches implement a multi-resolution or hierarchical strategy in the sense of multilevel strategies associated with conventional methods. In Hu et al. (2018a), the authors proposed an architecture that is divided into a global and a local network, which are optimized together. In Eppenhof et al. (2019), a multilevel strategy is incorporated into the training of a U-Net, by growing and training progressively level-by-level. In de Vos et al. (2019), a patch-based approach is presented, where multiple CNNs (ConvNets) are combined additively into a larger architecture for performing coarse-to-fine image registration of patches. The results from the patches are then combined into a deformation field warping the whole image. Another patch-based multilevel approach is presented in Fu et al. (2020). The multilevel framework consists of a CoarseNet and a FineNet which are trained jointly. During training, the estimated deformation field of the CoarseNet and the FineNet are not combined but the moving patch is transformed twice. During inference, if the mean absolute differences between the deformed image patch and the fixed image exceeds a predefined threshold, FineNet is applied again. This leads to a variable number of deformation field patches, which are combined additively. Although previous deep-learning-based registration works (e.g. de  Sentker et al. (2018)) contributes many efforts to improve the registration accuracy for lung registration, there is still a misalignment of smaller structures in the lung, which leads to a high target registration error of landmarks.

Contribution
We previously introduced an end-to-end deep-learning multilevel registration method that can handle large deformations by computing deformation fields on different scales and functionally composing them Hering et al. (2019a). This initial study, despite its limited evaluation, proved that it is a valid strategy to improve the alignment of vessels and airways -though a gap regarding the target registration error of landmarks with the best conventional registration methods remained. Building on this previous work, and addressing its limitations, we were able to further close that gap.
Our key contributions are as follow:

•
We present multiple anatomical constraints to incorporate anatomical priors into the registration framework to obtain more realistic results. We integrate the lung lobe mask to consider the global context. Moreover, the keypoint correspondences are used to increase the alignment of airways and vessels.

•
We introduce a novel constraining method to control volume change and therefore avoid foldings inside the deformation field. While the idea of volume change control is not new in conventional registration, we firstly present a suitable version for deep-learning-based image registration.

•
We perform comprehensive experiments on three different datasets -the multicenter COPDGene study (Regan et al., 2011) and the DIRLab challenge dataset (Castillo et al., 2009;, and the EMPIRE10 challenge dataset  -to assess the accuracy, plausibility, robustness, transferability of our method. We achieve comparable results as state-of-the-art registration approaches.

Variational registration approach
Let ℱ, ℳ: ℝ 3 ℝ denote the fixed image and moving image, respectively, and let Ω ⊂ ℝ 3 be a domain modeling the field of view of ℱ. Registration methods aim to compute a deformation y: Ω ℝ 3 that aligns the fixed image ℱ and the moving image ℳ on the field of view Ω such that ℱ x and ℳ y x are similar for x ∈ Ω. The deformation is defined as a minimizer of a suitable cost function that typically takes the form J ℱ, ℳ, y = D ℱ, ℳ y + αℛ y with so-called distance measure D that quantifies the similarity of fixed image ℱ and deformed moving image ℳ y and so-called regularizer ℛ that forces smoothness of the deformation typically by penalizing spatial derivatives. Typical examples for the distance measure are the squared L 2 norm of the difference image (SSD), normalized cross correlation (NCC), or mutual information (MI). The cost function can be extended by additional penalty terms to force desired properties or incorporate additional knowledge in form of anatomical constraints (Rühaak et al., 2017). As illustrated in Fig. 1, our method inputs both the fixed and moving image into the network that predicts the dense displacement field. The loss function uses all available information: input images, segmentation masks and keypoints, with additional regularization -in the form of a smoothness prior and a volume consistency constraint -to prevent foldings.

Loss function
3.2.1. Normalized gradient field distance measure-One of the main challenges of lung registration are the varying intensity changes occurring due to the altered density of lung tissue during breathing. This leads to a violation of the intensity constancy assumption between corresponding points, on which the classic sum of squared differences (SSD) distance measure is built. However, the lung exhibits a rich structure of bronchi, fissures, and especially vessels that can be exploited for the registration, more suited to distance measure that focus on image edges rather than intensities. We follow the approach of Rühaak et al. (2017) and Hering et al. (2019a) using the normalized gradient fields (NGF) (Haber and Modersitzki, 2006) distance measure D ℱ, ℳ y = ∫ Ω 1 − ∇ℳ y x , ∇ℱ x ϵ 2 ∇ℳ y x ϵ 2 ∇ℱ x ϵ 2 dx, with f, g ϵ : = ∑ j = 1 3 f j g j + ϵ 2 , f ϵ : = f, f ϵ . The edge hyper-parameter ϵ > 0 is used to suppress small image noise, without affecting image edges. Therefore, a good strategy is to choose its value relative to the average gradient. In (Haber and Modersitzki, 2006), the following automatic choice is suggested: where ν is the estimated noise level in the image and V is the volume of the domain Ω. For CT images, a value in the range of [0.1, 10] is mostly a good choice.
Since we focus on accurate registration inside the lungs and to avoid misalignment artifacts due to sliding motion at the pleura, we restrict Ω to the support of the lung mask of the fixed image.

Volume change control-Although the curvature regularization from Equation
(2) prefers smooth deformation, foldings may still happen, which is obviously physically impossible. More formally, foldings happen when the Jacobian determinant of the deformation field becomes negative. To avoid any foldings, we therefore aim to minimize the distance measure D and the regularizer ℛ while keeping the Jacobian determinant positive, for every voxel in Ω. Formally, this can be written as a constrained optimization problem: To achieve this, Rühaak et al. (2017) introduced a Volume Change Control (VCC) that could be integrated in their overall objective: where Hering et al. Page 6 Med Image Anal. Author manuscript; available in PMC 2023 July 26.
For the sake of simplicity, the input to ψ in Eq. 3 is substituted with z = det ∇y x in Eq. 4. Notice that ψ z is minimized when z = 1 (see Fig. 2 (a)). Therefore, the regularizing effects of the VCC are twofold: i) prevents the formation of foldings, by keeping the determinants positive, ii) limits both shrinkage and expansions by biasing the optimization to keep the same volume.
The method that Rühaak et al. (2017) used falls into the category of interior-point methods.
Such methods became very popular in constrained optimization (Boyd et al., 2004) as they do not require the expansive primal-dual updates of traditional Lagrangian optimization: the infinity penalty acts as a "barrier", preventing the optimization to go out of bounds.
To be used, interior-points methods require a feasible starting point: all constraints need to be strictly satisfied before starting the optimization procedure. This is usually done in a pre-optimization step (called Phase I) before the actual optimization of Phase II is performed. We can see it as finding a valid initial guess, and then refining it.
In the context of deep neural networks, standard Lagrangian methods are not feasible due to their expensive primal-dual updates, which requires to retrain a neural network (from scratch) at each iteration. Interior-point methods are also not applicable, as solving phase I requires to solve a constrained optimization problem in the first place. Kervadec et al. (2019) (Kervadec, Dolz, Wang, Granger, Ayed, 2020) proposed a parametric log-barrier extension (illustrated in Fig. 2 (b)), that does not require an initial feasible solution: t is a hyper-parameter, controlling the slope of the barrier. By starting with a small initial value, and increasing it as the training progresses, one is able to "raise" the barrier, closing it eventually.
We propose to keep the property of Eq. (4) to symmetrically penalizes local shrinkage and expansion and make it applicable for neural networks by using the barrier formulation of Eq. (5) for z < t with t 0 over time (illustrated in Fig. 2 (c)): with t > 0 which is a hyper-parameter controlling the slope of the linear barrier for z < t. This barrier can be raised during the training by decreasing the value of t to penalize foldings more strongly. Note that the linear part for z < t is chosen such that ψ is continuously differentiable provided t > 0. In our experiments, we set t = 0.2 for the first level of our multilevel architecture and decrease it by the factor of 2 for any further level. for z ≥ t, we symmetrically penalize local shrinkage and expansion, i.e., ψ z = ψ 1/z . Since the segmentation masks are used in the loss function, they are only required during training and not for registration of unseen images. We integrate segmentation masks by using the SSD loss  2017)) has shown that the integration of sparse keypoints during the optimization of the deformation field yields better registration results. In contrast to conventional registration approaches, keypoints can be integrated into the loss function and are therefore, similar to the segmentation masks for the mask penalty, only needed for training but not during inference. In general, there are several ways to integrate the keypoints into an intensity-based registration approach (e.g. Fischer and Modersitzki (2003a), Rühaak et al. (2017), Papenberg et al. (2009)). We choose to integrate the keypoint information through a least-squares penalty into our model by directly comparing the transformed keypoint of the fixed image with the corresponding moving keypoint:

Author Manuscript
Author Manuscript

Author Manuscript
Author Manuscript with the moving keypoint k ℳ i and the warped fixed keypoint y k ℱ i for all K keypoints.
In general, manually annotated landmarks or automatically generated keypoints can be integrated with this loss function. However, since manual annotation of landmarks is timeconsuming, we use the keypoint detection algorithm described in Rühaak et al. (2017) to generate a large number of corresponding keypoints.
The hyper-parameters α, β, γ and δ have to be chosen manually. However, our experiments showed that a change in the magnitude leads to only slight changes in the results.

Baseline architecture
Our CNN is based on a U-Net (Ronneberger et al., 2015) which takes the concatenated 3D moving and fixed image as input and predicts a 3D dense displacement field with the same resolution as the input images. The U-net consists of three levels starting with 16 filters in the first layer, which are doubled after each downsampling step. We apply 3D convolutions in both encoder and decoder path with a kernel size of 3 followed by an instance normalization and a ReLU layer. In the encoder path, the feature map downsampling steps use 2 × 2 × 2 average pooling with a stride of 2. In the decoder path, the upsampling steps use transposed convolution with 2 × 2 × 2 filters and half the number of filters than the previous step. The final layer uses a 1×1×1 convolution filter to map each 16-component feature vector to a three-dimensional displacement.

Multilevel architecture
In conventional image registration, multilevel continuation has been proven very efficient to avoid local minima, to reduce topological changes or foldings, and to speed up runtimes (Bajcsy and Kovačič, 1989;Haber and Modersitzki, 2004;Kabus and Lorenz, 2010;Schnabel et al., 2001). Recent deep-learning-based approaches (de Vos et al., 2019;Fu et al., 2020;Hering et al., 2019a;Jiang et al., 2020;Mok and Chung, 2020) have shown that, besides carrying over these properties, a multilevel scheme helps overcome the limitations of deep-learning-based registration approaches to properly deal with small and local deformations.
As in our previous work (Hering et al., 2019a), we follow the ideas of standard multilevel registration and compute coarse grid solutions that are prolongated and refined on the subsequent finer level. Our multilevel framework is illustrated in Fig. 3 with L = 3 levels.
The registration starts on the coarsest level L where the deformation y L is computed from the input images that have been Gaussian-smoothed and downsampled by a factor of 2 L −1 . On all finer levels ℓ < L, we incorporate the deformations from all preceding coarse levels as an initial guess by combining them by functional composition and warping the moving image. Subsequently, the fixed and warped moving images are downsampled.
The number of used levels is a hyper parameter which should be chosen depending on the task and the used data. The maximal number of levels that can be used is limited by the GPU memory and the image size. Since the images are downsampled with a factor of two in the multilevel setting and additionally the image features are downsampled three-times in the U-Net, the number may be chosen at most so that the image size is divisible by 2 3 + L − 1 . Our experiments (c.f. Section 4.9) have shown that a three-level scheme works best in our application and fits on a 12GB GPU. In our experiments, we use in particular a three-level scheme (L = 3, Fig. 1). The three networks are trained progressively. First, the network on the coarsest level is trained for a fixed amount of epochs. Afterwards, the parameters of the middle network are learned while the coarsest network stays fixed and is only used to produce the initial deformation field. The same procedure is repeated on the finest level. The same architecture is used on all levels. The network parameters on the coarsest level are initialized with Xavier uniform (Glorot and Bengio, 2010), whereas all other networks are initialized with the learned parameters of the previous network. Note that the receptive field in voxels is the same for all networks, however, due to the decreased resolution on the coarse levels, the receptive field in mm is much larger.

Experiments
We perform several experiments to assess the accuracy, plausibility, robustness, transferability, and speed of our weakly-supervised deep-learning-based registration approach.

Data
We train and validate our method on the data from the COPDGene study (Regan et al., 2011). To prove the robustness and transferability of our method and to compare our method with other registration approaches, we evaluate our registration approach on the publicly available DIRLab dataset (Castillo et al., 2009; and on the EMPIRE10 challenge as well. On the COPDGene dataset, the evaluation is based on the lobe segmentation masks, and on both of the other datasets, annotated landmarks are available on which we evaluate the target registration error. 4.1.1. COPDGene Dataset-Training, validation, and testing data were acquired from the COPDGene study, a large multi-center clinical trial with over 10,000 subjects with chronic obstructive pulmonary disease (COPD) (Regan et al., 2011). The COPDGene study includes clinical information, blood samples, and chest CT scans. The image dataset was acquired across 21 imaging centers using a variety of scanner makes and models. Each patient had received two breathhold 3D CT scans, one on full inspiration (200mAs) and one at the end of normal expiration (50mAs). About five years later, follow-up images were acquired from about 6000 subjects. In our study, we use the inspiration and expiration scans of 1000 patients. We split these patients into 750, 50, 200 patients for training, validation, and testing, respectively. The original images have sizes in the range of 512 × 512 × {341, … , 974} voxels. The in-plane resolution of the axial slices varied between 0.5mm to 0.97mm per voxel with a slice thickness of 0.45mm to 0.7mm.
The human lungs are sub dived into five lobes that are separated by visceral pleura called pulmonary fissure. An exemplary inspiration scan and expiration scan of one patient with the lobe segmentation overlay is shown in Fig. 4. For all scans segmentations of the lobes are available, which were computed automatically and manually corrected and verified by trained human analysts.

DIRLab
Challenge-This dataset consists of ten thoracic 4D CT images acquired as part of the radiotherapy planning process for the treatment of thoracic malignancies. In our study we are only using the inspiration and expiration phase of the 4D image, i.e., two of the ten images per 4D scan. The in-plane resolution of the 512 × 512 axial slices varied between 0.97mm to 1.16mm per voxel with a slice thickness of 2.5mm. Each scan pair contains 300 manually annotated corresponding landmarks in the lung on which we evaluate the target registration error.

EMPIRE10
Challenge-The EMPIRE10 challenge  consists of 30 scan pairs from six different categories: breathhold inspiration scan pairs, breathhold inspiration and expiration scan pairs, 4D data scan pairs, ovine data scan pairs, contrast-noncontrast scan pairs and artificially warped scan pairs. Further information on each category can be found in the challenge paper . Each scan pair contains 100 annotated corresponding landmarks.

Preprocessing
In this work, we focus on non-rigid, non-linear deformations and for that reason we perform a linear prealignment of fixed and moving image as preprocessing. For all methods, the same preprocessing is used. We subsequently warp and resample the moving image on the field of view and resolution of the fixed image, which yields a pre-registered moving image ℳ. Lung regions are automatically cropped for each CT and resized to volumes of dimension 192 × 160 × 192 as the network input. However, although the deformation field is computed from low-resolution input, during inference, the output deformation field is up-sampled to the original image resolution using trilinear interpolation and the overall evaluation is performed at full resolution of the original images. We do not perform any further preprocessing like normalization on the images, because the CT images are already in a standardized range (Hounsfield units). On the training data, we use the keypoint detection algorithm described in Rühaak et al. (2017) to automatically compute keypoints inside the lung. These keypoints can be considered noisy labels with residual errors of 1-2mm

Implementation details
We implemente our method in PyTorch. Each network was trained for 25 epochs on an NVIDIA Titan Xp using an ADAM optimizer with a learning rate of 10 −3 . The training of all three networks takes about 20 hours. We empirically chose the loss weighting parameters α = 10, β = 1, γ = 0.01. For the coarsest level, the keypoint weighting parameter δ was set to zero such that the network can focus on the coarse alignment of larger structures. In the subsequent levels, we chose δ = 10 7 . For the edge parameter of the NGF distance measure, we chose ϵ = 1.

Accuracy
We evaluate our method by using the propagated lobe segmentation and the fixed lobe segmentation. If a deformation field represents accurate correspondences, the lobe segmentation of the fixed image b ℱ and the warped lobe segmentation of the moving image b ℳ y should overlap well. In contrast to a lung segmentation overlap, the lobe segmentation overlap provides information about inner lung structures. A good alignment of the lobes was shown to be indicative of good alignment of the fissures, which the evaluation of registration quality in Murphy et al. (2011) has shown to be indicative of the overall performance of different registration approaches.
We measure the overlap of the lobes with the Dice coefficient where X is the propagated segmentation b ℳ y and Y is the segmentation of the fixed image b ℱ . Moreover, we evaluate the average symmetric surface distance We compare our proposed method to the conventional approach of Rühaak et al. (2017) that is currently ranked first in the EMPIRE challenge  (https:// empire10.grand-challenge.org/Home/). This method performs a discrete keypoint detection and matching which are integrated into a dense continuous optimization framework.
Additionally to the keypoint penalty, the method uses an NGF distance measure, curvature regularizer, a volume change penalty, and a mask alignment of the lung segmentations. Note that the lung segmentation has to be available for each pair of images to be registered. This is in contrast to our method, which also uses a boundary loss (Eq. 7), but this requires the masks to be only available during training, not during testing.

Robustness
To analyze the robustness of our method, we evaluate the 30% lowest Dice Scores of all cases. This gives a good overview of the hardest cases and how good our method can register those.

Plausibility of the deformation field
Besides accurately and robustly transferring anatomical annotations, medical image registration should also provide plausible deformations and therefore should not generate deformations with foldings. Hence, we evaluate the Jacobian determinant as it is a local measure for volume change and in particular for (local) change of topology. If det ∇y > 1 1 a volume expansion occurred and if det ∇y < 1 the volume decreased and for det ∇y ≤ 1 there is a folding.

Applicability
In a clinical setting, the registration of two scan pairs has to be available quickly in order not to slow down the routine workflow. In other situations such as screenings, the large number of required registration demands efficient deformable image registration methods. In both cases, the runtime of the algorithms is a crucial factor. For the conventional registration method, we measured the time of the registration without the time needed to load and warp the images. For the network, we measure the time of one forward pass through the cascade of networks. Both measurements were run on the same system with an Intel(R) Core(TM) i7-770K CPU and an Nvidia Titan XP GPU.

Transferability and comparison to state-of-the-art
To show the transferability of our method to other datasets and to compare our method to other registration methods, we apply our trained network as-is to the ten images pairs of the DIRLAB 4DCT. To evaluate the registration accuracy, the target registration error of the landmarks was computed.Moreover, we evaluate the impact of the dataset used to train the registration network. Therefore, we train the widely used Voxelmorph (Balakrishnan et al., 2019) framework using the COPDGene data. We adapt the public implementation slightly by choosing a higher regularization weight λ = 2 to obtain smooth deformation fields Furthermore, we applied our trained model on the 30 scan pairs of the EMPIRE10 challenge and submitted the displacement fields to the organizers who performed the evaluation which includes a lung boundaries, fissures, landmarks and singularities (foldings).

Ablation study
In an ablation study, we study the impact of the components of the proposed method. We investigate the influence of the number of levels in the multilevel framework on the accuracy and plausibility of the registration results. For all experiments, the overall number of epochs was 75. Furthermore, we explore the additional penalty terms in our loss function by setting the weighting parameters in the loss function to zero and compare it with the proposed loss function.

Results
The results of our experiments on the COPDGene dataset are summarized in Table 1. We performed a one-sided Wilcoxon signed-rank test that show that the improvement to the method of Rühaak et al. (2017) is statistically significant for the Dice score, average surface distance (ASD) and Hausdorff distance(HD) and the runtime on the CPU. In the following subsections, we describe the results of each experiment in more detail.

Accuracy
Our proposed method achieves on average significant better Dice Scores than the conventional registration method ( Fig. 5.

Robustness
On the 30% of cases with the lowest Dice Scores, our method achieves an average Dice Score of 0.93 ± 0.01 within a range of [0.85, 0.94]. Compared to the method of Rühaak et al. (2017) with an average Dice Score of 0.86 ± 0.03 within a range of [0.78, 0.90], our method propagates the lobes more robustly.

Plausibility
For our proposed methods, on average fewer than 0.1% voxel positions of the deformation field showed a negative Jacobian determinant and therefore a folding. The full elimination of foldings as in the conventional registration methods of Rühaak et al. (2017) is not guaranteed. However, the percentage of foldings is within acceptable ranges. Fig. 6 shows four exemplary Jacobian determinant colormaps overlaid on the fixed image. The volume changes are smooth and mostly around 1 (yellow overlay). Due to large motion in the upper right case, some foldings (dark red overlay) occur on the left inferior border.

Applicability
The proposed method needs for registration of an image pair on average 0.75 seconds when executed on a GPU and 32 seconds on the CPU. In contrast, the conventional method takes on average 160 seconds executed on a CPU. Moreover, for the execution, only 4GB of GPU memory are required and therefore our method could also be used on standard computers with less powerful GPUs. The prediction is instantaneous and requires no further manual tuning of parameters. This makes our proposed method very applicable for clinical tasks.

Qualitative results
To illustrate the registration results, we show the difference images ℱ − ℳ y of four exemplary cases in Fig. 8. In all cases, the respiratory motion was successfully recovered and most inner structures are well aligned. The first row shows one example of a more accurately registered image pairs in terms of the average Dice Score (after affine: 0.85, after: 0.96) and keypoint distance (after affine: 8.51mm, after: 0.99mm). The last row shows the worse case regarding the Dice Score. In this case, the average Dice Score improved from 0.69 to 0.85 and the keypoint distance could be reduced from 13.57mm to 1.9mm, showing also for the cases with large deformations, our registration methods works robustly. Even with masking the distance measure only to the region inside the lung, the surrounding tissue is mostly well aligned. During training, the model learned to align edges and because no lung mask is given during inference, it also aligns edges outside the lung.

Ablation study
We provide an ablation study to further verify the efficiency of proposed components of our method. Results of this ablation experiment on the COPDGene data are presented in Table  2. The multi-level experiment shows that increasing the number of level from L = 1 to L = 2 and L = 3 results in a increasing Dice Score from 0.927 to 0.939 to 0.946, a decreasing TRE fron 3.95mm to 2.22mm to 2.00mm, and decreasing number of foldings from 0.1% to 0.09% to 0.06%. The mask alignment loss not only improve the alignment of the pulmonary lobes resulting in a higher Dice Score (0.93 vs 0.946) but also enhance the TRE from 2.16mm to 2.00mm. By integrating our volume change loss, the percentage of foldings can be reduced from 0.3% to 0.06%. Furthermore, it also improve the TRE from 2.16mm to 2.00mm. To further enhance the alignment of smaller structures as vessels and smaller airways, we incorporate keypoint correspondences into the loss function. This decrease the TRE from 4.59mm to 2.00mm. However, the percentage of foldings slightly increase from 0% to 0.06%. Fig. 7 shows a comparison of the target registration errors of the DIRLab 4DCT dataset of all compared settings and after affine registration and the initial errors.

Transferability and comparison to state-of-the-art
In Table 4, quantitative results on the DIRLAB 4DCT dataset of deep-learning-based and conventional registration methods are reported. On average the target registration error (TRE) of our method was 1.14 ± 0.76mm and is therefore better as the currently best deeplearning-based method of Fu et al. (2020). In cases 6, 8, and 10 which have a large initial landmark error, our method achieves much better registration results. Training Voxelmorph on the large COPDGene dataset results in a lower TRE than when trained by leave-one-out on the DIRLAB dataset (1.71mm vs 3.65mm). The best conventional registration method of Rühaak et al. (2017) has still a lower TRE, however, it needs about 5 minutes to compute the deformation field, whereas our method only needs less than a second. A detailed evaluation of all ten cases for different deep-learning-based registration methods is given in Table 5. On the EMPIRE10 challenge data, our method achieves a target registration error of 1.01mm on all cases and a TRE of 0.91mm if ovine data is excluded. A summary of the results is shown in Table 3 and a more detailed evaluation on the challenge website 1 .

Discussion
This paper presents a coarse-to-fine multilevel framework for deep-learning-based image registration that can compensate for and handle large deformations using computing deformation fields on different scales. Our method shares many elements with the conventional registration method of Rühaak et al. (2017). We have identified key strategies of this method and successfully developed a deep-learning counterpart. The advantage of our deep learning approach is that the expensive annotation and detection of the lobe masks and keypoints is only necessary as training data. This important knowledge is then embedded in our model and therefore the inference is cheap and fast.
We employ a Gaussian-pyramid-based multilevel framework that can solve the image registration optimization in a coarse-to-fine fashion. To prevent foldings of the deformation field and restrict the determinant of the Jacobian to physiologically meaningful values, we combine the curvature regularizer with a volume change penalty in the loss function. Furthermore, we also integrate weak keypoint correspondences into the loss function to focus more on the alignment of smaller structures. The keypoints are computed automatically and can be considered as noisy labels with residual errors of 1 − 2mm. However, we showed that the use of these noisy labels is nevertheless advantageous and leads to a better alignment of vessels and smaller airways and therefore also results in a better target registration error on the DIRLab dataset.
We validated our framework on the challenging task of large motion inspiration-toexpiration registration using image data from the multi-center COPDGene study. To assess the accuracy of our network, we performed an extensive evaluation of 200 pulmonary CT scan pairs from the large-scale COPDGene study and demonstrated that our method can perform accurate registration between two affine pre-aligned images. Especially for the task of lobe propagation, we could show that our method performs better than conventional approaches. It achieves higher Dice scores and lower surface and Hausdorff distances (0.95, 1.72 mm, and 26.8 mm) compared to conventional registration (0.92, 1.97 mm, and 27.2 mm, respectively). This better performance can be explained by the use of the mask-alignment loss. As demonstrated in previous studies (e.g. Balakrishnan et al. (2019); Hering et al. (2019c)), the combination of the complementary strength of global semantic information (weakly-supervised learning with segmentation labels) and local distance metrics improves the registration performance during inference. In contrast to conventional registration methods, such additional information only needs to be available in the training dataset.
Furthermore, we have evaluated the proposed method using the DIRLab and EMPIRE10 dataset and showed that we achieve excellent TRE of 1.14 mm and 1.01 mm, respectively. Note that our network was not trained on those datasets. This is strong evidence that our network can generalize well. Although previous works ( 2018)) contribute much to improving the registration accuracy, there is still a misalignment of smaller structures, which leads to a high TRE. To focus more on the alignment of vessels, Fu et al. (2020) introduced a preprocessing step to enhance the vessels in the input images by segmenting vascular structures and increasing the intensity value inside the vessel mask. In their paper, they demonstrated the efficiency of this preprocessing step. Since this step is performed on the input images, it is also required during application. To avoid this problem and thus not increase the execution time, we integrate additional information on smaller structures using the keypoint loss. The advantage of this procedure is that the keypoints, as well as the masks of the boundary loss, are only needed during training. Nevertheless, the best conventional registration methods still achieve lower TRE than our method. One reason for this might be that convectional registration methods mostly work on the original image data. In contrast, for the deep-learning approaches, the input images have to be downsampled due to memory restrictions on the GPU. Especially for smaller structures and small errors (we are speaking about a TRE difference of 0.2-0.4mm), it is easily imaginable that this resolution is not high enough. Moreover, the training data used also influence the performance. Our network was trained on inspiration-expiration scan pairs from humans. In the EMPIRE10 dataset, a variety of lung registration tasks including ovine lung registration has to be performed. Although our method does not register the ovine data perfectly, we achieve a TRE of 1.69 mm on the ovine data which shows that our method is capable of generalizing well. We would assume that with a wider variety in training data, the performance of deep-learningbased registration methods can further improve. We showed this effect when training the Voxelmorph network. By using the larger COPDGene dataset to train the Voxelmorph network, the target registration error on the DIRLab dataset improved from 3.65 mm to 1.71 mm compared to a leave-one-out training on the DIRLab dataset. This illustrates the large impact of the training dataset. Since Voxelmorph and our framework are very similar, this experiment also shows that the addition of more loss functions and a multilevel approach is beneficial.
Besides accurately transferring anatomical annotations, medical image registration should also provide plausible transformations and therefore should not generate deformations with foldings. In conventional registration methods, this is achieved by using a regularizer in the cost function. Recently deep-learning-based methods like Eppenhof et al. (2019) and Jiang et al. (2020) also integrated a regularizer into their loss functions to enforce a smooth deformation field resulting in an acceptable amount of foldings (0.42% and 0.1% of foldings). In our work, we additionally use a volume change control which penalizes occurring foldings more severely than the regularizer does, resulting in on average fewer than 0.1% and 0.0005% voxel positions in the deformation field with folding on the COPDGene and DIRLab dataset, respectively. Without the volume change control penalty, the deformation fields produced by our method show on average 0.30% of voxel positions with foldings, which is comparable to the values of other deep learning registration methods. This shows that the addition of the volume change control mitigated the occurrence of foldings. The higher number of foldings on the COPDGene dataset can be explained by the noise difference between the expiration and inspiration scan due to different doses during acquisition (see Fig 8 for some example images). The full elimination of foldings as in some conventional registration methods is not guaranteed. Another way to reduce the number of foldings was presented in the works of Dalca et al. (2018), Krebs et al. (2019), and Qiu et al. (2021) who are using the scaling and squaring algorithm (Arsigny et al., 2006) to integrate the predicted stationary velocity field. With a sufficient number of integration steps, these methods should theoretically guarantee diffeomorphic transformation. However, in the presented works they reported "nearly no non-negative Jacobian voxels"  and 0.023% to 0.151% of voxels with a negative Jacobian determinant (Qiu et al., 2021). As discussed in (Qiu et al., 2021), this has two major factors. First, the velocity field could be not sufficiently smooth. This can be solved by increasing the regularization weight. However, this often yields a drop in the registration accuracy. Secondly, the number of chosen integration steps was too small. Increasing this can reduce the number of foldings which occur but increase the computational cost as well. In summary, the scaling and squaring approach and the volume-change-control penalty presented achieve similar results preventing foldings. Besides, our approach regulates volume changes.
In our experiments, we focused on the complex task of CT lung registration, as the registration results can be evaluated more accurately than only with an overlap of a larger structure. However, our method could also be trained for a different task or on a different modality. Except for keypoint detection, no component is lung-specific and the keypoint loss can be used with landmarks in different organs.
In future studies, we will investigate the impact of instance optimization to fine-tune the deformation field for those image pairs for which the registration result is not yet satisfactory.

Conclusion
This paper presents a deep-learning-based registration approach for deformable image registration, targeting in particular the challenging task of lung registration. We introduce a keypoint matching term and a volume change penalty to increase the alignment of smaller structures and to prevent foldings and restrict the deformation field to physiologically meaningful values. Our multi-level registration framework equipped with these components achieves state-of-the-art registration accuracy on the COPDGene and DIRLab datasets with a very short execution time. Schematic representation of the training process. In the loss function, we compare the fixed image, pulmonary lobes mask and keypoints to the deformed moving image, pulmonary lobes mask and keypoints, respectively. To enforce smoothness and to prevent foldings, a regularizer and a volume change penalty are integrated into the loss function. During inference, only the fixed and moving image is required to estimate the deformation field. For a better visualization, we have placed the windowed CT image in the background of the used keypoints. Best viewed in colors.   Overall scheme of the proposed multilevel framework, where ℱ indicates the fixed image, ℳ the moving image, y the deformation field and ℳ y the warped image. Each CNNs is trained separately for a fixed amount of epochs and the weights stay fixed afterwards. The deformation fields from all preceding coarse levels are used as an initial guess by combining them by functional composition and warp the moving image on the highest resolution. Subsequently, the warped moving image is downsampled to the current image level. The dotted lines illustrate the initialization of the network weights with the learned parameters of the previous level.     Cumulative distribution of target registration error (TRE) in millimeters for all variations of our loss function, after affine registration and initially on all landmark pairs of the DIRLAB 4DCT dataset. In addition, the dotted lines visualize the 75th percentiles of the TRE, which are 1.40mm (our VIRNet), 1.51mm (noVCC), 1.64mm (noKP), 1.54mm (noMask), 2.22mm (singleLevel), 1.54mm (twoLevel), 4.46mm (affine), 12.55mm (initial).  Example coronal slices extracted from four exemplary cases. Input images ℱ and ℳ, the warped moving image ℳ y , the difference image ℱ − ℳ (fourth column) and the difference image ℱ − ℳ y after registration with the proposed method (fifth column). In all cases the respiratory motion was successfully recovered and most inner structures are well aligned. Due to altered density of lung tissue during breathing, intensity changes occur and therefore higher values in the difference images are reached without registration errors. Registration results of Rühaak et al. (2017) and our method on the COPDGene dataset. We performed a onesited Wilcoxon signed-rank test to test if improvements to the method of Rühaak et al. (2017)   Results of the ablation study. To demonstrate the impact of the each loss function term, each penalty weight was set to zero once while the remaining parameters were fixed to their empirically determined optimal values. The registration performance is evaluated using the Dice score, the target registration error of the keypoint (TRE KP), and the percentage of foldings on the COPDGene dataset. Moreover the target registration error on the DIRLab dataset is compared. We performed a one-sided Wilcoxon signed-rank test to test if improvements to all other settings are statistically significant.
We used a Bonferroni correction due to multiples testing.   Results of the EMPIRE10 challenge for the method of Rühaak et al. (2017) and our proposed method. The average score over all 30 cases for the lung boundaries, fissure alignment, landmark error and singularities is shown. Detailed results can be found on the challenge website.  Hansen and Heinrich (2021) 1.39 ± 1.29 0.02% 2s ours 1.14 ± 0.76 < 0.0005% 0.75s All results were extracted from the original papers, besides Voxelmorph* which was reported in Hansen and Heinrich (2020) and Voxelmorph** which we trained on the COPDGene data. Since the runtime was not measured with the same hardware, it cannot directly be compared. However, it gives an impression of the speed.