3D deformable registration of longitudinal abdominopelvic CT images using unsupervised deep learning

This study investigates the use of the unsupervised deep learning framework VoxelMorph for deformable registration of longitudinal abdominopelvic CT images acquired in patients with bone metastases from breast cancer. The CT images were refined prior to registration by automatically removing the CT table and all other extra-corporeal components. To improve the learning capabilities of VoxelMorph when only a limited amount of training data is available, a novel incremental training strategy is proposed based on simulated deformations of consecutive CT images. In a 4-fold cross-validation scheme, the incremental training strategy achieved significantly better registration performance compared to training on a single volume. Although our deformable image registration method did not outperform iterative registration using NiftyReg (considered as a benchmark) in terms of registration quality, the registrations were approximately 300 times faster. This study showed the feasibility of deep learning based deformable registration of longitudinal abdominopelvic CT images via a novel incremental training strategy based on simulated deformations.


Introduction
Deformable medical image registration problems can be solved by optimizing an objective function defined on the space of transformation parameters [1].
Traditional optimization-based methods typically achieve accurate registration results but suffer from being computationally expensive, especially in the case of deformable transformations of high-resolution, three-dimensional (3D) images.
Deep learning based registration methods, however, can perform registration in a single-shot, which is considerably faster than using iterative methods [2]. Due to the recent successes of deep learning for a wide variety of medical image analysis tasks [3], and the advances in Graphics Processing Unit (GPU) computing that have enabled the training of increasingly large three-dimensional (3D) networks [4], the number of studies using deep learning for medical image registration has increased considerably since 2016 [5].
Although deep learning could have a major impact on the field of medical image registration, there is still a gap between proof-of-concept technical feasibility studies and the application of these methods to "real-world" medical imaging scenarios. It remains unclear to which extent deep learning is suited for challenging co-registration tasks with large inter-and intra-patient variations and potential outliers or foreign objects in the Volume of Interest (VOI).
Moreover, deep learning based methods typically require large amounts-i.e., thousands-of well prepared, annotated 3D training images that are rarely available in clinical settings [6].
The present study focuses on the registration of abdominopelvic CT images since these are widely acknowledged to be difficult to register [7]. In abdominopelvic imaging, the conservation-of-mass assumption is typically not valid and, although local-affine diffeomorphic demons have been used in abdominal CT images [8], the transformation is typically not a diffeomorphism.
For instance, bladder-filling or bowel peristalsis in the abdomen may vary between images. More specifically, we consider a longitudinal abdominopelvic CT dataset that comprises several images of each patient acquired at distinct time points. From a clinical perspective, deformable image registration of longitudinal datasets is a necessary step toward automated, quantitative, and objective treatment-response assessment [9][10][11].
The aim of this study is to explore and quantify the applicability of one of the most used unsupervised single-shot deep learning frameworks (VoxelMorph [12]) for deformable registration of longitudinal abdominopelvic CT images. We assessed the maximum displacements that can be learned by the VoxelMorph framework and the impact of extra-corporeal structures, such as the CT table, clothing and prostheses on the registration performance. In addition, the VoxelMorph framework was compared against iterative registration using the NiftyReg [13] toolbox that was selected because of its excellent performance on abdominal CT images in a comparative study [14].
The novelties of this work are: • demonstrating the impact of removing extracorporeal structures before deformable image registration; • using simulated deformations to characterize the limitations of the Voxel-Morph framework for the deformable registration of abdominopelvic CT images; • introducing a novel incremental training strategy tailored to longitudinal datasets that enables deep learning based image registration when limited amounts of training data are available. This paper is structured as follows. Section 2 outlines the background of medical image registration, with a particular focus on deep learning based methods. Section 3 presents the characteristics of our longitudinal abdominopelvic CT dataset, as well as the deformable registration framework, the proposed incremental training strategy, and the evaluation metrics used in this study. Section 4 describes the experimental results. Finally, Section 5 provides a discussion and concluding remarks.

Related work
This section introduces the basic concepts of medical image registration and provides a comprehensive overview about the state-of-the-art of deformable registration using deep learning.

Medical image registration
Medical image registration methods aim to estimate the best solution in the parameter space Ω ⊂ R N which corresponds to the set of potential transformations used to align the images, where N is the number of dimensions.
Typically, N ∈ {2, 3} in biomedical imaging. Each point in Ω corresponds to a different estimate of the transformation that maps a moving image to a fixed image (target). This transformation can be either parametric, i.e., can be parameterized by a small number of variables (e.g., six in case of a 3D rigid-body transformation or twelve for an 3D affine transformation), or non-parametric, i.e., in the case that we seek the displacement of every image element. For most organs in the human body, particularly in the abdomen, many degrees of freedom are necessary to deal with non-linear or local soft-tissue deformations.
In global deformable transformation, the number of parameters encoded in a Displacement Vector Field (DVF) φ is typically large, e.g., several thousands. Therefore, two-step intensity-based registration approaches are commonly employed in which the first step is a global affine registration and the second step is a local deformable registration using for example B-splines [15].
Traditional medical image registration methods often use iterative optimization techniques based on gradient descent to find the optimal transformation [1,15,16]. Deformable registration can be performed using demons [17], typically based on diffeomorphic transformations parameterized by stationary velocity fields [18]. In addition, global optimization techniques that leverage evolutionary algorithms [15] and swarm intelligence meta-heuristics can be useful to avoid local minima [19]. Several off-the-shelf, open-source toolboxes are available for both parametric and non-parametric image registration in biomedical research, such as: elastix [20], NiftyReg [13], Advanced Normalization Tools (ANTs) [21], and Flexible Algorithms for Image Registration (FAIR) [22].

Deep learning based registration
Since 2013, the scientific community has shown an increasing interest in medical image registration based on deep learning [5]. Early unsupervised deep learning based registration approaches leveraged stacked convolutional neural networks (CNNs) or autoencoders to learn the hierarchical representations for patches [23,24].
Fully-supervised methods, such as in [25], have focused on learning a similarity metric for multi-modal CT-MRI brain registration according to the patchbased correspondence. Another supervised method based on the Large Deformation Diffeomorphic Metric Mapping (LDDMM) model called Quicksilver was proposed in [26] and tested on brain MRI scans. In this context, Eppenhof et al. [27] introduced the simulation of ground truth deformable transformations to be employed during training to overcome the need for manual annotations in the case of a pulmonary CT dataset. Very recently, in [28], a graph CNN was used to estimate global key-point locations and regress the relative displacement vectors for sparse correspondences.
Alternatively, several studies have focused on weakly-supervised learning.
For example, Hu et al. [29] proposed a weakly-supervised framework for 3D multimodal registration. This end-to-end CNN approach aimed to predict displacement fields to align multiple labeled corresponding structures for individual image pairs during the training, while only unlabeled image pairs were used as network input for inference. Recently, generative deep models have also been applied to unsupervised deformable registration. Generative Adversarial Networks (GANs) can be exploited as an adversarial learning approach to constrain CNN training for deformable image registration, such as in [30] and [31]. In [32], spatial correspondence problems due to the different acquisition conditions (e.g., inhale-exhale states) in MRI-CT deformable registration, led to changes synthesized by the adversarial learning, which were addressed by reducing the size of the discriminator's receptive fields. In addition, Krebs et al. [33] proposed a probabilistic model for diffeomorphic registration that leverages Conditional Variational Autoencoders.
The current trend in deep learning based medical image registration is moving towards unsupervised learning [5]. The CNN architecture proposed in [2], called RegNet-different from existing work-directly estimates the displacement vector field from a pair of input images; it integrates image content at multiple scales by means of a dual path, allowing for contextual information.
Traditional registration methods optimize an objective function independently for each pair of images, which is time-consuming for large-scale datasets. To this end, the differentiable Spatial Transformer Layer (STL) has been introduced that enables CNNs to perform global parametric image alignment without requiring supervised labels [34].
Recently, De Vos et al. [35] proposed a Deep Learning Image Registration (DLIR) framework for unsupervised affine and deformable image registration.
This framework consists of a multi-stage CNN architecture for the coarse-tofine registration considering multiple levels and image resolutions and achieved comparable performance with respect to conventional image registration while being several orders of magnitude faster. A progressive training method for end-to-end image registration based on a U-Net [36] was devised in [37], which gradually processed from coarse-grained to fine-grained resolution data. The network was progressively expanded during training by adding higher resolution layers that allowed the network to learn fine-grained deformations from higherresolution data.
The starting point of the present work was the VoxelMorph framework that was recently introduced for deformable registration of brain Magnetic Resonance Imaging (MRI) images and is considered state-of-the-art [12]. The VoxelMorph framework is fully unsupervised and allows for a clinically feasible real-time solution by registering full 3D volumes in a single-shot. From a research perspective, the framework is flexible to modifications and extensions of the network architecture. VoxelMorph formulates the registration as a parameterized function g θ (·, ·) learned from a collection of volumes in order to estimate the DVF φ.
This parameterization θ is based on a CNN architecture similar to U-Net [36] which allows for the combination of low-and high-resolution features, and is estimated by minimizing a loss function using a training set. The initial Voxel-Morph model was evaluated on a dataset of 7829 T1-weighted brain MRI images acquired from eight different public datasets. As extensions of this model, Kim et al. [38] integrated cycle-consistency [39] into VoxelMorph, showing that even image pairs with severe deformations can be registered by improving topology preservation. In addition, the combination of VoxelMorph with FlowNet [40] for motion correction of respiratory-gated Positron Emission Tomography (PET) scans was proposed in [41].

Dataset description
The dataset used in this study comprised consecutive CT images of patients with bone metastases originating from primary breast cancer. Breast cancer frequently presents with a mixture of lytic and sclerotic bone metastases, where lytic metastases appear similar to areas of low Hounsfield Unit (HU) attenuation in the bones and sclerotic metastases are more densely calcified than normal bone and have higher HU attenation. Treatment response often causes increasing sclerosis, especially in lytic metastases. However, increasing sclerosis can also be a sign of disease progression, especially in patients with mixed or purely sclerotic metastases at diagnosis, thus causing a diagnostic dilemma [42]. Quantitative assessment of bone metastases and the associated changes in attenuation and bone texture over time thus holds the potential to improve treatment response assessment [9][10][11]. To enable such assessments, accurate and preferably real-time deformable registration of the consecutive CT images is an important prerequisite.
After informed consent, patients with metastatic breast cancer were recruited into a study designed to characterize the disease at the molecular level, using tissue samples and serial samples of circulating tumor DNA (ctDNA) [43].
CT imaging of the chest, abdomen, and pelvis was acquired according to clinical request every 3-12 months to assess response to standard-of-care treatment. A subset of 12 patients with bone metastases only were selected, resulting in 88 axial CT images of the abdomen and pelvis. The CT images were acquired using either of two different clinical CT scanner models-the SOMATOM Emotion 16, the SOMATOM Definition AS(+), and the SOMATOM Sensation 16manufactured by Siemens Healthineers (Erlangen, Germany).
On axial images reconstructed with a slice thickness of 2 mm and a pixel spacing ranging from 0.57-0.97 mm using bone window settings, all vertebral bodies of the thoracic and lumbar spine that were depicted completely were segmented semi-automatically by a board certified radiologist with ten years of experience in clinical imaging, using Microsoft Radiomics (project InnerEye 2 , Microsoft, Redmond, WA, USA). Thus, a series of closely neighboring VOIs was created that spanned the majority of the superior-inferior extent of each scanning volume and was used subsequently to assess the performance of the registration approach. The total number of VOIs delineated for the analyzed dataset was 805 (mean VOIs per scan: 9.15).

Dataset preparation and training set construction
3.2.1. Abdominopelvic CT image pre-processing CT table removal. In a manner similar to that of the commonly used data preparation procedure for brain MR images called "skull-stripping" [44], we refined our abdominopelvic CT images to facilitate deformable registration. The CT table could bias the learning process and lead the registration to overfit on the patient table region. Therefore, we developed a fully automatic approach based on region-growing [45] to remove the CT table from the CT images, as well as all extra-corporeal components, such as breast prostheses, clothes and metal objects. Our slice-by-slice approach automatically initialized the growing region, R G , with a 50 × 50-pixel squared seed-region at the center of each slice by assuming that the body was positioned at the center of the CT scanner.
Considering an image I, Eq. (1) defines the homogeneity criterion, P, in terms of the mean value of the region µ R G [45]: where p B ∈ B denotes a pixel belonging to the candidate list B of the boundary pixels in the growing region R G , while T G is the inclusion threshold. In particular, during the iterations, the 8-neighbors of the current pixel p B , which do not yet belong to R G , are included into the candidate list B. The similarity criterion, P, was based on the absolute difference between the value of the candidate pixels I(p) and the mean intensity of the pixels included in R G (i.e., percentile:x i = xi−xmin xmax−xmin for i ∈ {x min , x min + 1, . . . , x max }; 3. Downsampling with a factor of 2 to 160 × 160 × 256 (1 mm 3 voxels) and cropping to the convex hull (box) enclosing all volumes.

Generation of simulated DVFs
It was not possible to directly train a network to register the longitudinal abdominopelvic CT images in our dataset due to the limited amount of available transformation pairs (see Section 3.1), large inter-patient variations, and the often non-diffeomorphic nature of the transformations, e.g., due to the changes in the appearances of normal structures in consecutive CT images caused by bowel peristalsis or bladder filling. Therefore, we developed a simulator that generated random synthetic DVFs and transforms abdominopelvic CT images in a manner similar to that of Sokooti et al. [2] and Eppenhof et al. [27]. The resulting deformed CT images can subsequently be used to train or evaluate deep learning based image registration methods.
The synthetic DVF generator randomly selects P initialization points, d i (with i = 1, 2, . . . , P ), from within the patient volume of a CT image with a minimum distance, d P , between these points. In the present study, all DVFs were generated using P = 100 and d P = 40. Each point, d i , is composed of three random values between −δ and δ that correspond to the x, y, and z components of the displacement vector in that point. To ensure that the simulated displacement fields were as realistic as possible, we set δ = 6 to mimic the typical displacements found between the pre-registered images in our abdominopelvic CT dataset. In addition, we generated a series of DVFs   The VoxelMorph model consists of a CNN that takes a fixed and a moving volume as input, followed by an STL that warps the moving volume using the deformation that is yielded by the CNN (Fig. 3). The model can be trained with any differentiable loss function. Let F and M be two image volumes defined over an N -dimensional spatial domain, Ω ⊂ R N . We consider CT images, thus N = 3 in our study. More specifically, F and M were the fixed and moving images, respectively.
Let φ be a transformation operator defined by a DVF u that denotes the offset vector from F to M for each voxel: φ = Id + u, where Id is the identity transform. We used the following unsupervised loss function: where L sim aims to minimize differences in appearance and L smooth penalizes the local spatial variations in φ, acting as a regularizer weighted by the parameter λ. The employed L sim is the local cross-correlation between F and M • φ, which is more robust to intensity variations found across scans and datasets [46]. LetF(p) and [M • φ](p) denote local mean intensity images: where p i iterates over a local neighborhood, N (p), defining an ω 3 volume centered on p, with ω = 9 in our experiments. The local cross-correlation of F and [M • φ] is defined as: A higher NCC indicates a better alignment, yielding the loss function: Minimizing L sim encourages M • φ to approximate F, but might yield a nonsmooth φ that is not physically realistic. Thus, a smoother displacement field φ is achieved by using a diffusion regularization term on the spatial gradients of displacement u: and approximate spatial gradients via the differences among neighboring voxels. in size. 3D convolutions were applied in both the encoder and decoder paths using a kernel size of 3, and a stride of 2. Each convolution was followed by a

Parameter settings and implementation details
In the present study, the optimized hyperparameter settings suggested by Balakrishnan et al. [12] served as a starting point. We investigated the effect of the LeakyReLU α parameter on the stability of the training process and found that an α of 0.5 was optimal for registering abdominopelvic CT images. In all experiments, the regularization parameter, λ, was set to 1.0. One training epoch consisted of 100 steps and took approximately five minutes. The models described in Section 4.1 were trained until convergence (1000 epochs) using a learning rate of 10 × 10 −4 , whereas the models described in Section 4. was subsequently deformed using K randomly generated DVFs, φ k (see Section deformed volumes for the i-th patient, , was used to incrementally train the network such that in each training iteration the network was trained on a mini-batch containing all deformed volumes, S i,j . The deformed volumes in the set V * = P * T +1 , P * T +2 , . . . , P * T +V were randomly divided into two equal, independent parts. One part was kept aside for evaluation, and the other part was used to monitor the training process to avoid concept drift (i.e., changes in the data distribution) between the mini-batches over time. After each training iteration, the network weights that resulted in the best performance on this second part of V * were reloaded to initiate the next iteration. If the network did not converge during a certain iteration, the network weights of the previous iteration were reloaded, thereby ensuring that the overall training process could continue and remain stable. To reduce forgetting, the learning rate was decreased linearly from from 10 −4 (first iteration) to 10 −6 (last iteration) [48].

Evaluation methodology
This section describes the evaluation metrics used to quantify the registration performance of the incrementally trained VoxelMorph framework and the NiftyReg toolbox [13] that served as a benchmark in this study.

NiftyReg
All deformed abdominopelvic CT images were also registered using the Fast Free-Form Deformation (F3D) algorithm for non-rigid registration in the NiftyReg toolbox (version 1.5.58) [13]. All options were set to default: the image similarity metric used was Normalized Mutual Information (NMI) with 64 bins and the optimization was performed using a three-level multi-resolution strat- . Our medgistration framework was trained only using the set T , while the erformed on the unseen data included in the validation set V. Our training strategy is schematized in Fig. 3. rationale behind the incremental training strategy.
to minimize problems related to vanishing gradients, due to large between scans, the ↵ coe cient in the Leaky Rectified Linear Unit rs was optimized to ... ning of alpha parameter, training strategy, lr reduction.
the patient index and the corresponding number of CT volumes, respectively.
The whole dataset D was split into two disjoint training Our medical image registration framework was trained only using the set T , while the testing was performed on the unseen data included in the validation set V. Our incremental training strategy is schematized in Fig. 3. Explain the rationale behind the incremental training strategy.

320
In order to minimize problems related to vanishing gradients, due to large deformations between scans, the ↵ coe cient in the Leaky Rectified Linear Unit (ReLU) layers was optimized to ... Describe tuning of alpha parameter, training strategy, lr reduction.

Evaluation methodology 325
This section describes the evaluation procedure used to experimentally validated the deep learning based registration approach. More specifically, NiftyReg [9]-used as a state-of-the-art comparison-and the evaluation metrics are described.

330
As a baseline, all deformed abdominal CT images were also registered using the Fast Free-Form Deformation (F3D) algorithm for non-rigid registration in and results in vanishing gradients. Since such small amounts of training data are typical in clinical settings, we devised a novel approach to enforce and improve the learning phase by using an incremental training strategy combined 310 with simulated deformed CT scans (see Section 3.2.2). The input dataset the patient index and the corresponding number of CT volumes, respectively.
The whole dataset D was split into two disjoint training T = {P 1 , P 2 , . . . , P T } 315 and validation V = {P T +1 , P T +2 , . . . , P T +V } sets (with T + V = D). Our medical image registration framework was trained only using the set T , while the testing was performed on the unseen data included in the validation set V. Our incremental training strategy is schematized in Fig. 3. Explain the rationale behind the incremental training strategy.

320
In order to minimize problems related to vanishing gradients, due to large deformations between scans, the ↵ coe cient in the Leaky Rectified Linear Unit (ReLU) layers was optimized to ... Describe tuning of alpha parameter, training strategy, lr reduction.

Evaluation methodology 325
This section describes the evaluation procedure used to experimentally validated the deep learning based registration approach. More specifically, NiftyReg the patient index and the corresponding number of CT volumes, respectively.
The whole dataset D was split into two disjoint training T = {P 1 , P 2 , . . . , P T } 315 and validation V = {P T +1 , P T +2 , . . . , P T +V } sets (with T + V = D). Our medical image registration framework was trained only using the set T , while the testing was performed on the unseen data included in the validation set V. Our incremental training strategy is schematized in Fig. 3. Explain the rationale behind the incremental training strategy.

320
In order to minimize problems related to vanishing gradients, due to large deformations between scans, the ↵ coe cient in the Leaky Rectified Linear Unit (ReLU) layers was optimized to ... Describe tuning of alpha parameter, training strategy, lr reduction.

Evaluation methodology 325
This section describes the evaluation procedure used to experimentally validated the deep learning based registration approach. More specifically, NiftyReg incremental training strategy is schematized in Fig. 3. Explain the rationale behind the incremental training strategy.

320
In order to minimize problems related to vanishing gradients, due to large deformations between scans, the ↵ coe cient in the Leaky Rectified Linear Unit (ReLU) layers was optimized to ... Describe tuning of alpha parameter, training strategy, lr reduction.

Evaluation methodology 325
This section describes the evaluation procedure used to experimentally validated the deep learning based registration approach. More specifically, NiftyReg incremental training strategy is schematized in Fig. 3.
Explain the rationale behind the incremental training strategy.
In order to minimize problems related to vanishing gradients, due to large 325 deformations between scans, the ↵ coe cient in the Leaky Rectified Linear Unit (ReLU) layers was optimized to ... Describe tuning of alpha parameter, training strategy, lr reduction.

Evaluation methodology
This section describes the evaluation procedure used to experimentally vali-330 dated the deep learning based registration approach. More specifically, NiftyReg [9]-used as a state-of-the-art comparison-and the evaluation metrics are described.
14 and results in vanishing gradients. Since such small amounts of training data are typical in clinical settings, we devised a novel approach to enforce and improve the learning phase by using an incremental training strategy combined incremental training strategy is schematized in Fig. 3.
Explain the rationale behind the incremental training strategy.
In order to minimize problems related to vanishing gradients, due to large 325 deformations between scans, the ↵ coe cient in the Leaky Rectified Linear Unit (ReLU) layers was optimized to ... Describe tuning of alpha parameter, training strategy, lr reduction.

Evaluation methodology
This section describes the evaluation procedure used to experimentally vali-330 dated the deep learning based registration approach. More specifically, NiftyReg [9]-used as a state-of-the-art comparison-and the evaluation metrics are described.
14 and results in vanishing gradients. Since such small amounts of training data are typical in clinical settings, we devised a novel approach to enforce and improve the learning phase by using an incremental training strategy combined In order to minimize problems related to vanishing gradients, due to large 325 deformations between scans, the ↵ coe cient in the Leaky Rectified Linear Unit (ReLU) layers was optimized to ... Describe tuning of alpha parameter, training strategy, lr reduction.

Evaluation methodology
This section describes the evaluation procedure used to experimentally vali- In order to minimize problems related to vanishing gradients, due to large 325 deformations between scans, the ↵ coe cient in the Leaky Rectified Linear Unit (ReLU) layers was optimized to ... Describe tuning of alpha parameter, training strategy, lr reduction.

Evaluation methodology
This section describes the evaluation procedure used to experimentally vali- In order to minimize problems related to vanishing gradients, due to large 325 deformations between scans, the ↵ coe cient in the Leaky Rectified Linear Unit (ReLU) layers was optimized to ... Describe tuning of alpha parameter, training strategy, lr reduction.

Evaluation methodology
This section describes the evaluation procedure used to experimentally vali- In order to minimize problems related to vanishing gradients, due to large 325 deformations between scans, the ↵ coe cient in the Leaky Rectified Linear Unit (ReLU) layers was optimized to ... Describe tuning of alpha parameter, training strategy, lr reduction.

Evaluation methodology
This section describes the evaluation procedure used to experimentally vali-  Fig. 3. Explain the rationale behind the incremental training strategy.
In order to minimize problems related to vanishing gradients, due to large deformations between scans, the ↵ coe cient in the Leaky Rectified Linear Unit 325 (ReLU) layers was optimized to ... Describe tuning of alpha parameter, training strategy, lr reduction.

Evaluation methodology
This section describes the evaluation procedure used to experimentally validated the deep learning based registration approach. More specifically, NiftyReg   Fig. 3. Explain the rationale behind the incremental training strategy.
In order to minimize problems related to vanishing gradients, due to large deformations between scans, the ↵ coe cient in the Leaky Rectified Linear Unit 325 (ReLU) layers was optimized to ... Describe tuning of alpha parameter, training strategy, lr reduction.

Evaluation methodology
This section describes the evaluation procedure used to experimentally validated the deep learning based registration approach. More specifically, NiftyReg   Fig. 3. Explain the rationale behind the incremental training strategy.
In order to minimize problems related to vanishing gradients, due to large deformations between scans, the ↵ coe cient in the Leaky Rectified Linear Unit 325 (ReLU) layers was optimized to ... Describe tuning of alpha parameter, training strategy, lr reduction.

Evaluation methodology
This section describes the evaluation procedure used to experimentally validated the deep learning based registration approach. More specifically, NiftyReg   Fig. 3. Explain the rationale behind the incremental training strategy.
In order to minimize problems related to vanishing gradients, due to large deformations between scans, the ↵ coe cient in the Leaky Rectified Linear Unit 325 (ReLU) layers was optimized to ... Describe tuning of alpha parameter, training strategy, lr reduction.

Evaluation methodology
This section describes the evaluation procedure used to experimentally validated the deep learning based registration approach. More specifically, NiftyReg   Fig. 3. Explain the rationale behind the incremental training strategy.
In order to minimize problems related to vanishing gradients, due to large deformations between scans, the ↵ coe cient in the Leaky Rectified Linear Unit 325 (ReLU) layers was optimized to ... Describe tuning of alpha parameter, training strategy, lr reduction.

Evaluation methodology
This section describes the evaluation procedure used to experimentally validated the deep learning based registration approach. More specifically, NiftyReg   Fig. 3. Explain the rationale behind the incremental training strategy.
In order to minimize problems related to vanishing gradients, due to large deformations between scans, the ↵ coe cient in the Leaky Rectified Linear Unit 325 (ReLU) layers was optimized to ... Describe tuning of alpha parameter, training strategy, lr reduction.

Evaluation methodology
This section describes the evaluation procedure used to experimentally validated the deep learning based registration approach. More specifically, NiftyReg   Fig. 3. Explain the rationale behind the incremental training strategy.
In order to minimize problems related to vanishing gradients, due to large deformations between scans, the ↵ coe cient in the Leaky Rectified Linear Unit 325 (ReLU) layers was optimized to ... Describe tuning of alpha parameter, training strategy, lr reduction.

Evaluation methodology
This section describes the evaluation procedure used to experimentally validated the deep learning based registration approach. More specifically, NiftyReg

330
[9]-used as a state-of-the-art comparison-and the evaluation metrics are described.
14 … mate the generalization ability of a given model with respect to the hold-out 345 method, which used only one partitioning of the dataset into the training/test sets [50]. By so doing, the statistical validity increases with lower variance and dependence on the initial dataset partition, averaging the results over all the k cross-validation rounds. The k-fold cross-validation is the most common choice for reliable generalization results, minimizing the bias associated with 350 the random sampling of the training/test sets [50], even though this statistical practice is computationally expensive due to the k times-repeated training from scratch [51]. Moreover, the results could underestimate the actual performance allowing for conservative analyses [52]; thus, we chose 4-fold cross-validation for reliable and fair training/testing phases, according to the number of available method, which used only one partitioning of the dataset into the training/test sets [50]. By so doing, the statistical validity increases with lower variance and dependence on the initial dataset partition, averaging the results over all the k cross-validation rounds. The k-fold cross-validation is the most common choice for reliable generalization results, minimizing the bias associated with 350 the random sampling of the training/test sets [50], even though this statistical practice is computationally expensive due to the k times-repeated training from scratch [51]. Moreover, the results could underestimate the actual performance allowing for conservative analyses [52]; thus, we chose 4-fold cross-validation for reliable and fair training/testing phases, according to the number of available method, which used only one partitioning of the dataset into the training/test sets [50]. By so doing, the statistical validity increases with lower variance and dependence on the initial dataset partition, averaging the results over all the k cross-validation rounds. The k-fold cross-validation is the most common choice for reliable generalization results, minimizing the bias associated with 350 the random sampling of the training/test sets [50], even though this statistical practice is computationally expensive due to the k times-repeated training from scratch [51]. Moreover, the results could underestimate the actual performance allowing for conservative analyses [52]; thus, we chose 4-fold cross-validation for reliable and fair training/testing phases, according to the number of available the VoxelMorph CNN [8].
A k-fold cross-validation scheme was used, by partitioning the whole dataset 340 D into k = 4. Before the splitting, the patients included in D were randomly shu✏ed to obtain a randomized permutation. Afterwards, during the k crossvalidation rounds, each fold was internally reshu✏ed to increase data variability during training/testing. Cross-validation strategies allow us to better estimate the generalization ability of a given model with respect to the hold-out 345 method, which used only one partitioning of the dataset into the training/test sets [50]. By so doing, the statistical validity increases with lower variance and dependence on the initial dataset partition, averaging the results over all the k cross-validation rounds. The k-fold cross-validation is the most common choice for reliable generalization results, minimizing the bias associated with 350 the random sampling of the training/test sets [50], even though this statistical practice is computationally expensive due to the k times-repeated training from scratch [51]. Moreover, the results could underestimate the actual performance allowing for conservative analyses [52]; thus, we chose 4-fold cross-validation for reliable and fair training/testing phases, according to the number of available

260
This work adopts a selective delineation approach to focus on internal prostatic anatomy: the CG and PZ, denoted by R CG and R P Z , respectively.
Let the entire image and the WG region be I ⌦ and R W G , respectively, the following relationships can be defined: where R bg represents background pixels. Relying on [7,25], R P Z was obtained by subtracting R CG from R W G meeting the constraints:

USE-Net: Incorporating SE Blocks into U-Net
We propose to introduce SE blocks [21] following every Encoder (Enc the F3D algorithm in the NiftyReg toolbox does not support GPU acceleration, in contrast to the Block Matching algorithm for global (affine) registration in the NiftyReg toolbox that was used to pre-align the CT images in this study (see Section 3.2.1).

Evaluation metrics
To quantify image registration performance, we relied on highly accurate delineations performed by a board-certified radiologist. The rationale for considering the VOIs that covered the vertebral bodies of the spine to determine registration performance was that they spanned the majority of the scanning volume in the superior-inferior direction and were of clinical relevance because of the underlying study on bone metastases.
As an evaluation metric, we used the DSC, which is often used in medical image registration [12]. DSC values were calculated using the gold standard regions delineated on the fixed scans (R F ) and the corresponding transformed regions on the moving scans (R M ) after application of the estimated DVF φ: (6)): Since DSC is an overlap-based metric, the higher the value, the better the segmentation results.
For completeness, we also calculated the Structural Similarity Index (SSIM).
This metric is commonly used to quantify image quality perceived as variations in structural information [49]. Let X and Y be two images (in our case, F was compared with either M or D for the evaluation), and SSIM combines three relatively independent terms: • the contrast comparison c(X, Y) = 2σ X σ Y +κ2 • the structural comparison s(X, Y) = σ XY +κ3 σ X σ Y +κ3 ; where µ X , µ Y , σ X , σ Y , and σ XY are the local means, standard deviations, and cross-covariance for the images X and Y, while κ 1 , κ 2 , κ 3 ∈ R + are regularization constants for luminance, contrast, and structural terms, respectively, exploited to avoid instability in the case of image regions characterized by local mean or standard deviation close to zero. Typically, small non-zero values are employed for these constants; according to [49], an appropriate setting is the pixel values in F. SSIM is then computed by combining the components described above: where α, β, γ > 0 are weighting exponents. As reported in [49], if α = β = γ = 1 and κ 3 = κ 2 /2, the SSIM becomes: Note that the higher the SSIM value, the higher the structural similarity, implying that the co-registered image, D, and the original image F are quantitatively similar.   (a) (b)

Impact of CT table removal
To assess the impact of the CT  (Fig. 7a). As a benchmark, all original and refined testing CT images were also registered using NiftyReg (Fig. 7b). Statistical analysis was performed using a paired Wilcoxon signed-rank test, with the null hypothesis that the samples came from continuous distributions with equal medians.
In all tests, a significance level of 0.05 was set [50].  Similarly, the SSIM of the refined images registered using the VoxelMorph framework was higher for both local deformations and global patient shifts (both p < 0.0005). No difference between original and refined CT images was observed in the DSC values of registrations performed using NiftyReg (Figure 7b), although the SSIM of the refined images registered using NiftyReg showed significant improvements over the original images (both p < 0.0005). Table 1 shows the computational times required to register one image pair using the VoxelMorph framework and NiftyReg. The CT table removal procedure resulted in slightly shorter registration times when using the NiftyReg toolbox (on average 105 s) compared to registering original images with and without a patient shift (on average 106 s and 109 s, respectively).

Quantitative evaluation of the incremental training strategy
In the proposed incremental training strategy, a network was trained on all deformed volumes included in a mini-batch S i,j until its performance on the validation set V * no longer improved, after which the best performing network weights were reloaded to initiate the next training iteration. Fig. 8 shows the resulting best training and validation errors achieved during each training iteration of the different cross-validation rounds. Although the training errors sometimes varied greatly between iterations, the network performance on the validation set, V * , gradually improved during incremental training.
Another interesting phenomenon that can be observed in Fig. 8 is that the best training errors achievable when training on a specific mini-batch tended to differ between patients. For example, training errors increased when training on simulated deformed scans of patients P 5 or P 9 . Since both of these patients were, by chance, included in the validation and test sets of round 2, this also explains why the validation errors in round 2 were generally higher (Fig. 8) and the registration performance was lower (see Figs. 9 and 10).

Deformable registration performance
Since the VoxelMorph network did not converge when training on either the whole dataset D or on the simulated dataset T * , the effectiveness of the proposed incremental training strategy in learning features from multiple patients was compared to training a network on 1000 simulated deformed scans derived from a single volume (V 9,1 from P 9 ). All trained networks were subsequently used to register simulated deformed scans from the independent test set back onto their original volumes (Fig. 9). In all cross-validation rounds, the incre-  about 300 times slower than one forward pass through the VoxelMorph framework (Table 1).
To evaluate the impact of the inter-and intra-patient variations on the longitudinal abdominopelvic CT dataset, D, on the registration performance, all trained networks were also used to register real scan pairs, i.e., mapping sequential time-points back onto the reference scan (time-point 0). Fig. 10 shows the DSC and SSIM values between the real scan pairs before registration, after registration using the VoxelMorph framework trained on single volume or incrementally, and NiftyReg. The differences between the scan pairs before registration greatly varied between patients, with DSC and SSIM values ranging from 0.567 to 0.920 and from 0.693 and 0.918, respectively. Although the Voxel-Morph networks were trained using only simulated deformations, the incrementally trained networks improved the DSC between the real scan pairs for 6 out of the 12 patients, whereas the network trained on a single volume improved the DSC for 4 out of the 12 patients (Fig. 10). Furthermore, all VoxelMorph-based models improved the SSIM between the real scan pairs for all patients except patient P 6 . However, it should be noted that none of the networks trained in this study achieved registration results comparable to NiftyReg.

Large displacements
In addition to variations between patients, mapping large displacements may also form a challenge for deep learning based deformable registration methods.
In order to evaluate the effect of the size of the displacements on the registration performance of the networks trained in this study, an additional test set was created by simulating K DVFs φ k (k = 1, . . . , K) with maximum displacements ranging from 0 mm (i.e., no deformation) to 25 mm (i.e., structures moving across the entire abdominal and pelvic regions) in steps of 1 mm, with K = 30 in each step. These DVFs were used to deform the same volume that was used to generate the training data to train the single-volume network (V 9,1 from P 9 ), after which the deformed images were mapped back onto the original volume using the trained VoxelMorph networks and NiftyReg.

Discussion and conclusions
In recent years, an increasing number of studies have focused on using deep neural networks for deformable image registration because such methods offer fast or nearly real-time registration [2,12,26,27,29,35,37]. However, their application to abdominopelvic CT images remains limited because of the large intra-and inter-patient variations, the not fully diffeomorphic nature of the deformations, and the limited availability of large numbers of well-annotated images for training.
In the present study, we demonstrated that removing extracorporeal structures aids deformable registration of abdominopelvic CT images when using both traditional and deep learning approaches. Along with the registration of multiple CT scans over time, in which the table design and shape may differ and affect the registration process, the devised method based on region-growing [45] could also be valuable for multimodal image registration tasks because the scanner table is not visible on MRI and PET [51]. Another practical use case could be radiation treatment-planning, in which the CT table influences the dose   distribution since the table used during imaging typically has different beam attenuation characteristics compared to the treatment table [52].
To address the remaining challenges of our abdominopelvic CT dataset, we generated training data for our network by synthetically deforming the CT images. Such synthetically deformed images can be employed for different purposes: (i ) training a neural network for deformable image registration on a relatively small clinical dataset; and (ii ) evaluation, e.g., testing the ability of a network to register increasingly large displacements. Simulated DVFs have already been successfully used for supervised learning of deformable image registration [2,27]. As a future development, we plan to introduce an additional penalty term into the loss function of our registration method to exploit the known simulated DVFs during training, which would allow the training process to gradually transition from semi-supervised to unsupervised learning.
To exploit the longitudinal nature of our dataset and enable training on small amounts of training data, we propose a novel incremental strategy to train the VoxelMorph framework. Incremental learning has shown potential for image classification [53] and medical image segmentation [54], although the so-called catastrophic forgetting [55] still remains a challenge. The incremental training of neural networks for longitudinal image registration could, therefore, benefit from introducing a penalty term into the loss function to balance the registration performance on new images while minimizing forgetting of previous images.
In the long term, parameter-efficient, single-shot deep learning solutions for deformable image registration would enable the development of novel end-toend approaches, such as task-adapted image reconstruction [56]. From a clinical perspective, automated registration of longitudinal imaging data is a prerequisite for exploiting the full potential of standard-of-care CT images for treatment response assessment in patients with bone metastases. A successful approach might find potential applications in the most frequently occurring malignancies that have a tendency to metastasize to bone, i.e., prostate, lung, and breast cancer [57].