Residual Aligner Network

Image registration is important for medical imaging, the estimation of the spatial transformation between different images. Many previous studies have used learning-based methods for coarse-to-fine registration to efficiently perform 3D image registration. The coarse-to-fine approach, however, is limited when dealing with the different motions of nearby objects. Here we propose a novel Motion-Aware (MA) structure that captures the different motions in a region. The MA structure incorporates a novel Residual Aligner (RA) module which predicts the multi-head displacement field used to disentangle the different motions of multiple neighbouring objects. Compared with other deep learning methods, the network based on the MA structure and RA module achieve one of the most accurate unsupervised inter-subject registration on the 9 organs of assorted sizes in abdominal CT scans, with the highest-ranked registration of the veins (Dice Similarity Coefficient / Average surface distance: 62\%/4.9mm for the vena cava and 34\%/7.9mm for the portal and splenic vein), with a half-sized structure and more efficient computation. Applied to the segmentation of lungs in chest CT scans, the new network achieves results which were indistinguishable from the best-ranked networks (94\%/3.0mm). Additionally, the theorem on predicted motion pattern and the design of MA structure are validated by further analysis.


Introduction
Alignment of two images, also known as image registration [23], is an important task in computer vision applications.In medical imaging, image registration enables comparison between different acquisitions over time (longitudinal analysis) or between different types of scanners (multi-modal registration).
Image alignment can be defined as estimation of the spatial transformation ϕ : R n → R n , represented by a corresponding parameters or a series of displacements denoted by ϕ[x] ∈ R d at the coordinate x ∈ Z d of a target image I t ∈ R n from a source image I s ∈ R n , where n is the size of a 3D image defined as n = H × W × T , and d, T, H, W denoting the image dimension, thickness, height, and width, respectively.
Originally, image registration was solved as an optimization problem by minimization of a dissimilarity metric D and a regularization term S: φ = argmin ϕ D(ϕ(I s ), I t ) + λS(ϕ, I t ) (1) where φ denotes the estimated spatial transform, λ denotes the weight of the regularization.Several methods including Demons [26] or Free Form Deformations [21] have been proposed to solve Eq. ( 1), however they can get trapped in the the local optimum and their computational performance is limited due to iterative optimization of highly dimensional, non-convex problem.
More recently, the alignment is performed via (convolution) neural networks R using the feature maps F s , F t ∈ R c×n extracted from I s and I f respectively, (and c denotes a number of feature channels) by directly regressing (DR) the spatial transformation [1,17]: with the training process based on minimizing the loss function (e.g.given in Eq. ( 1)) with the trainable weights w (w are omitted in the remaining part of the paper to simplify the equation).However the direct regression of spatial transformations via convolution neural networks could suffer due to limited capture range of the receptive field of convolution layers when dealing with large or complex motion such as sliding motion.
One solution is to perform coarse-to-fine alignment via multi-scale feature maps or feature pyramid [19,24,28,30,4,16].In coarse-to-fine approach, the residual transformation φ k between the target feature map F t k and the warped source feature map based on previous level k − 1 registration ϕ k−1 (F s k ) is accumulated following the coarse-to-fine image resolution (thus expanding the receptive field from large to small): where • denotes the composition of two spatial transformations, and ϕ 0 is initialized as the identity transform.However, it has two limitations.First the coarse-to-fine strategy in the above-mentioned methods is performed on pyramid representation of feature maps, leading to the limited accessible range of neighbouring pixel's motion, and thus may not estimate accurately different motions of two nearby objects [29] or organs [18].Secondly, those spatial transforms from different scales are usually directly combined at each position with equal weight [32,30,19,4], leading to the lack of flexible balancing between similarity measure and deformation rationality.Alternatively, Cross Attention (CA) mechanism [27] is used in [14,25,8] to obtain the global receptive field and use so-called indicator matrices to quantify the relationship between each pair of pixels from two images, and the usage of multiple indicator matrices is called multi-head.However, calculation of the indicator matrix has O(n 2 ) computational and memory complexity, which could be prohibitive for 3D image registration.
In this paper, we propose the Residual Aligner Network (RAN) based on a novel Motion-Aware (MA) structure (Fig. 3) and a new Residual Aligner (RA) module (Fig. 4) for efficient, motion-aware, coarse-to-fine image registration.Our contributions are as follows.
-A new MA structure employing dilated convolution [5] with high-resolution feature maps is introduced to benefit the network on predicting different motion pattern (Sec.2.3); -Our RA module utilizes confidence and multi-head mechanism based on the semantic information of the image (Sec.3); -The above proposed components constitute an novel RAN that performs efficient, coarse-to-fine, motion-aware unsupervised registration achieving state-of-the-art accuracy on publicly available lung and abdomen Computed Tomography (CT) data in Sec.4; -We also investigate and quantify the capture range (Sec.2.1) and motion patterns (Sec.2.2) predicted in coarse-to-fine registration by recursively warping feature maps.

Related Works
Voxelmorph [2] is an early deep learning method using a convolution neural network, U-net [20], for deformable medical image registration.However, the capture range of large motions by the DR based learning methods is usually limited by the receptive field in the convolution networks between two images which thus requires a pre-alignment.DIRnet [28] was thus proposed with multi-stage (MS) networks for coarse-to-fine registration with each network trained for the specific resolution and searching range of registration, using B-spline for interpolation on the sparse prediction, but require extra time cost on training.Several end-to-end training multi-stage networks [32,12,22,31] were thus proposed for coarse-to-fine image registration by recursively warping images.However, the sub-network of each stage is fed with the directly warped images but not well processed feature maps, and lack connection between different stages, leading to calculation and parameters consuming on extracting the repeated features.
To efficiently employed the features, Feature Pyramid (FP) was employed for registration in Dual-PRNet (DPRn) [11] for unsupervised registration.Multiple spatial transforms are predicted in multi-scale feature domain, to gradually refine the registration based on a sequence of feature maps extracted from a compacted structure [11].Furthermore, Edge-Aware Pyramidal Network [3] was designed for unsupervised registration with an extra edge image of the original input to enhance the texture structure features.A new bilevel, self-tune framework [15] was also proposed for training a pyramidalbased registration network with contextual regularization.However, they still simply add or concatenate the predictions from multi-scale equally weighted features without quantifying the predictions' reliability via the information density to flexibly balance between the similarity of registered images and rationality of motions.

Network Design for Motion-Aware Structure
In this section, we describe the proposed coarse-to-fine image registration network with MA based Feature Extractor to extract the feature maps, and Feature Aligner including

Feature Aligner
Fig. 1.The architecture of RAN.Two Motion-Aware feature extractor networks results in feature maps (see more in Fig. 3) and stacked Residual Aligner modules (see more in Fig. 4) aligns and connects the data streams from the input images.
stacked RA modules to find the correspondence and estimate the Dense Displacement Field (DDF) as shown in Fig. 1.The proposed RA modules are stacked for coarse-tofine image alignment, and each of them takes two feature maps from two images, align and feed back the warped feature map to enhance the feature extraction, and forward the coarse registration to the next RA module for refining registration, where the details of RA module are described in Sec. 3. A pair of Fully Convolution Networks (FCN), with shared weight for efficient training, is used here as the feature extractor to extract two sets of feature maps, {F 1 k } K k=1 and {F 2 k } K k=1 , which take turns as the source and target feature maps.The RA module takes the two feature maps from the two streams of FCN, retrieves one (key) on another (query) and then feeds back the aligned feature maps respectively to reinforce the next feature extraction.The requirement of RA module for global range attention is described in Sec.2.1.The separability of motions is defined in Sec.2.2 to quantify the bottleneck of separating different motions within a certain range region in the coarse-tofine registration.A new type of Motion-Aware FCN is proposed to perform alignment at the higher resolution feature maps comparing with Pyramidal FCN in Sec.2.3.

Capture Range in Coarse-to-fine Registration
Definition 1. (Accessible Motion Range) : The radius of capture range of the the k th RA module is defined as the smallest upper bound of its accessible DDF: where sup(•) denotes the upper bound with varying inputs and the trainable weights of networks, ∥ • ∥ ∞ denotes the L-∞ norm, x denotes one coordinate entry of the images or DDFs.The accessible motion range can be approximated based on the module's receptive field: , where s k denotes the original-resolution size of effective receptive field on the input feature maps: where r k denote the dilation rates of convolution layers in the k th level registration, p k is the corresponding pool size of the k th feature maps' one pixel on the original image with p k ≤ p k−1 , ∀k > 0, and the convolutions are all assumed with kernel size less or equal than 3 to minimize the computation cost.Thus, the part (i) and (ii) in Eq. ( 5) are respectively dependent on pool size and dilation.
In the case of global registration on the whole image, the hyper-parameters p 1 , r 1 are set to enable a 1 to reach the whole image: and thus accessible motion range covers the whole image.

Motion Separability
In the feature pyramid approach, the typical convolution without dilation and the feature pyramid is employed: which fixes the the Eq. ( 5)(ii) and relies on downsampling to enlarge receptive field with only O(n) complexity to reach the whole image.However, as shown in Fig. 2(a), the DDF predicted on low-resolution feature map could form a bottleneck of estimated motion's Degree of Freedom (DoF).The only one predicted displacement is occupied by point a instead of point b and thus point b is not retrieved until finer resolution but with too smaller receptive field, where points a and b could be at the edges of different objects or even just two tiny objects.To quantify this phenomenon, we define the separability of the motion prediction: Definition 2. (Separability of Predicted Motion) : The separability of different motions of the k th RA module is defined as the bottleneck of the upper bound of the difference of its predictable DDF between two locations x, y ∈ Z d : where p denotes the L-∞ distance between the two pixels.
The reason of the problem in Fig. 2(a) is residual registration based on feature pyramid is suffering from the limited range of predicted motion difference with respect to the capture range and the pool size: Theorem 1. (Regional Dependency) The upper boundary of motion difference is related to a k and p k : where k ′′ , k, x, y denote two recursive numbers and two coordinate entries of images/DDFs.
Following Theorem 1, ∆ ∞ (p) is defined as: to describe the limitation on the multi-objects' motion difference.More details of Theorem 1 are given in Appendix.

Motion-Aware Structure
According to Theorem 1, the smaller pool size releases higher range of motion difference.Here, we design a new structure, called MA FCN, to achieve high DoF of DDF but still with the same capture range using dilation convolution on upsampled feature maps as shown in Fig. 3(a).Different from the conventional Feature Pyramid based FCN, the shortcut feature maps from the encoder part are upsampled and concatenated to a specific high-resolution feature map as the input to the decoder part with p k = 2 K−q , ∀k ≤ q and p k = 2 K−k , ∀q < k ≤ K, where q denotes the layer number with MA pattern in the decoder part.The q could be adjusted considering for the balance between the DoF of the predicted DDF and computational cost.The complexity required is O(n log(n)) using fully MA-layer structure q = K and is still O(n) using   4),( 9) (unit: pix/vox).
fully feature pyramid q = 0. To keep the same receptive field of MA structure as FP structure, the dilation rate is set to ∥r k (q > 0)∥ 1 ≥ 2 q−k ∥r k (q = 0)∥ 1 , ∀k ≤ q as suggested by Eq. ( 4) and Eq. ( 5).As shown in Fig. 2(b), with the same receptive field, the MA structure releases the higher resolution before alignment and thus avoid loss of the DoF of DDF.The capture ranges and the difference ranges of DDF for varying setting are illustrated in Fig. 3(b)(c)(d) based on the calculation of Eq. ( 4) and (9), where the new design achieves larger area under ∆ ∞ (p) with almost the same a k .

Residual Aligners
The RA module as shown in Fig. 4 aims to establish spatial transform ϕ between two images via recursively warping feature map of one towards the others, with extra attributes map θ comparing with Eq. ( 3) to restore the auxiliary information related to alignment of each pixel: for the RA modules cascade number k = 1, ..., K.The k th RA module first takes the input feature maps from source images and target images F s k , F t k ∈ R c k ×n k and use the Regressor R k to regress a m-head residual DDF φ k ∈ R dm×n k and the incremental attributes ϑ k ∈ R m×n k (Sec.3.1).Then the Accumulator Network T computes the confidence and M-H masks, describing the prediction's reliability and the semantic properties of each pixel, and fuse the m-head DDF weighted based on the attribute maps θ k ∈ R m×n k as in Sec.3.2.The warping function performed by the resampler is implemented following the work [13].

Regressor
The function of Regressor R k in RAN is to regress the residual transform φ k between the target feature map F t k and the source feature map warped by the previous alignment ϕ k−1 (F s k ), with the incremental attribute map ϑ k to restore the auxiliary information for the inter-scale refinement in the coarse-to-fine registration.As shown in Fig. 4, Regressor concatenates the input feature maps and feeds them into the subsequent series of dilated convolution and activation layers.Referring to Sec. 2.3, the dilation rate vector r k is set to enlarge the capture range of alignment and raise the feature resolution as introduced in Sec. 2. Then two shallow convolution networks are respectively used to predict the M-H DDF and the incremental attributes raised from this level's alignment.

Accumulator
The task of Accumulator T k is to refine the DDF with the previous coarse DDF by interpolating and fusing those spatial transform representations from varying scales and different heads in terms of the contextual information, such as the alignment reliability of the neighbouring pixels and their semantic attributes.The calculation of Accumulator shown in Fig. 4 can be written as: where ϕ k−1 and φ k are the weighted DDF and residual DDF: {m} : R d×m×n k → R d×n k denotes the head-dimension sum, ⊗ : R d×n k ×R m×n k → R d×m×n k denotes the tensor product, ⊙ : R d×m×n k × R 1×n k → R d×m×n k denotes the element-wise product for the last two dimensions.Here C 1 , C 2 , C 3 , C 4 are fitted by convolution networks with activation layers, respectively for the mapping of confidence weight projection, interpolation, attribute fusion and the DDF fusion.
Confidence of Correspondence Simple composition of DDFs from different levels [28,30] could accumulate errors at the points which failed in previous alignment.Thus, the confidence values are respectively quantified by C 1 (ϑ k ), C 1 (θ k−1 ) in Eq. ( 12) for residual M-H DDF φ k and previous DDF ϕ k−1 to weight the following filtering for interpolation with neighbouring prediction value.Here the confidence is implicitly regressed from ϑ k and θ k−1 (contrary to the confidence of occlusion probability in [14]) with general representation aiming to provide higher accuracy.
Multi-Head Mechanism Inspired by the Multi-Head attention [27], the corresponding M-H mask are regressed by softmax(θ k ) to extract the varying motion patterns of multiple objects from the different candidate predictions at M-H residual DDF as shown in Eq. ( 12).This process could be regarded as the combination for the optimal transformation selected by the M-H masks, with preserving discontinuities in the DDF and the trend of motions [9].

Datasets
We evaluated the RAN on unsupervised deformable registration on two public available datasets with segmentation annotations on 9 small organs in abdomen CT and lung in chest CT: Unpaired abdomen CT: The dataset is provided by [6].The ground truth segmentations of spleen, right kidney, left kidney, esophagus, liver, aorta, inferior vena cava, portal, splenic vein, pancreas of all scans are provided. .The inter subject registration of the abdominal CT imaging is considered as challenging due to large inter-subject variations and great variability in organ volume, from 10 milliliters (esophagus) to 1.6 liters (liver).Each volume is resized to 2 × 2 × 2mm 3 in the pre-processing.From totally 30 subjects, 23 and 7 are respectively used for training and testing, for 506 and 42 different pairing cases.

Unpaired chest (lung) CT:
The dataset is provided by [10].The CT scans are all acquired at the same time point of the breathing cycle and here we performe inter-subject registration.The scanner is a Philips Brilliance 16P with a slice thickness of 1.00 mm and slice spacing of 0.70 mm.Pixel spacing in the X-Y plane varies from 0.63 to 0.77 mm with an average value of 0.70 mm.The ground truth lung segmentation of all scans are provided.Each volume is resized to 1 × 1 × 1mm 3 in the pre-processing.From the total of 20 subjects, 12 and 8 are respectively used for training and testing, for 132 and 56 different pairing cases.

Training details
We normalize the input image within 0-1 range and augment the training data by randomly cropping input images during training.For the experiments on inter-subject registration of abdomen and lung CT, the models are first pre-trained for 50k iteration on synthetic DDF combining rigid spatial transformation and deformation (detail in Appendix).Then the models are trained on real data for 100k iterations with the loss function: where normalized cross correlation and mean squared error are used in abdomen and lung CT respectively for D following [2].The whole training takes one week, including the data transfer, pretraining and fine-tuning.With a training batch size of 3, the initial learning rate is 0.001.The model was end-to-end trained with Adam optimizer.

Implementation and Evaluation
Implementation: The code for inter-subject image registration tasks were developed based on the framework of [1] in Python using Tensorflow and Keras.It has been run on Nvidia Tesla P100-SXM2 GPU with 16GB memory, and Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz.The backbone feature pyramid network we used is U-net [20] based on residual structure [7] with four downsampling blocks and four upsampling blocks.Since the most motion difference ranges between 0-15 as shown in following Fig. 7, two models RAn 3 and RAn + 4 are thus selected as our representative models with q = 3, 4 as suggested by the effect in Fig. 3(d).The detail of those structures are illustrated in Appendix.
Comparison: We compared Residual Aligner Network with the relevant state-of-the-art methods.The Voxelmorph [2] is adopted as the representative method of direct regression (DR).The composite network combing CNN (Global-net) and U-net (Local-net) following to [12], as well as recursive cascaded network [31] are also adopted into the framework as the relevant baselines representing multi-stage (MS) networks.Dualstream Pyramidal network (DPRn) [11] is selected as the baseline for feature pyramidal (FP) networks.Additionally, we also replace RA-module (in Fig. 1) with cross attention (Attn) [25] to compare the performance at module-level.
Evaluation metrics: Following [28], we calculate the Dice Coefficient Similarity (DSC), Hausdorff Distance (HD), and Average Surface Distance (ASD) on annotated mask for the performance evaluation of nine organs in abdomen CT and one organ (lung) in chest CT, with the negative number of Jacobian determinant in tissues' region (detJ) for rationality evaluation on prediction.The model size, computation complexity and running time for comparison with previous methods on inter-subject registration of lung and abdomen are shown in Tab. 1.

Performance on registration:
The comparison between RAN with other methods on abdomen and chest CT using all 10 organs is shown in Tab. 1, and the results illustrate Table 1.Avg of Dice Similarity Coefficient (DSC), Hausdorff Distance (HD), Average Surface Distance (ASD) and negative number of Jacobian determinant in tissues' region (detJ) for unsupervised inter-subject registration of abdomen and chest CT using the Voxelmorph (VM1) [2] and its enhanced version with double channels (VM2), convolution networks cascaded with U-net (Cn+Un) [12], 5-recursive cascaded network based on the structure of VM1 and VM2 (RCn1,RCn2) as described in [31], Cross Attention [27] w/ Feature Pyramid (CA/P), Dual-stream pyramidal registration network (DPRn) [11], our RA network with q = 3, 4 (RAn3,RAn our network achieved one of the best performance in this task with fewer parameters and lower computational cost.The Fig. 5 illustrate the improvement at the area containing multi-organs and at the edges of organs.In terms of registration types, DR networks (VM2) require more parameters for better results, and MS networks (RCn1, RCn2) need much more computation, while the FP based network (DPRn) balances between them, and our MA based RAn further improve it.The separate evaluation of the 9+1 organs (abdomen+chest) for five models, as shown in Fig. 6, illustrates our RAn achieves best accuracy in the registration of small organs (veins) and one of the best accuracy in other

organs' registration.
Ablation Study: To validate the effect of each component on the performance, we also tried several combination on the confidence weight (CW), multi-heads (MH) and motion-aware pattern number (q) on experiments of abdomen and lung CT as shown in Tab. 2. For a fair comparison, the channel numbers are tuned to keep the trainable parameter numbers similar to each others, except RAn + 4 with larger model size for higher accuracy.Fig. 5 and 6 show our RAn + 4 with q = 4 is better than RAn 3 on smaller tissues' prediction but worse on larger one's (lung).Separability of the predicted motions: Beside implicitly validated by the better results of higher q, more visual validation of MA design is illustrated in Fig. 7(a) including the probability density distributions of the pairs of correct motion prediction with varying voxel distance and motion difference for varying q.Based on the difference between them in Fig. 7(b), It shows RAn with higher q obtain more correction hits at the left-top area, and thus the better motion separability, matching the expectation in Fig. 3(d) and validating the improvement by the design principle described in Sec 2.3.

Discussion and Conclusion
The novel RAN design is proposed based on the MA mechanism with a new RA module.It achieves the best registration of the veins in abdominal CT scans and comparable registrations with other state of the art networks in the other tissues in abdominal and chest CT with fewer parameters and less computation.Additionally, RANs based on MA structure achieved the improving separability of predicted motion as shown in Fig. 7, which also validate the proposed design principle on MA mechanism.These results demonstrate the efficiency and the potential of RAn performing on relevant tasks including multi-object registration, which could also be further applicable to other relevant computer vision tasks, such as optical flow, stereo matching and motion tracking.

Fig. 2 .
Fig. 2. Illustration of the problem of coarse-to-fine alignment of two neighbouring points, a (×) and b (△), with differing motion.(a) failed capture of point b due to the low resolution feature pyramid.(b) Our proposed solution, utilizing an upsampling layer and dilated convolution in the Motion-Aware structure (Fig. 3), while maintaing the same receptive field.

Fig. 3 .
Fig. 3.The design and theoretical analysis of Fully Convolution Networks (FCN) for feature extraction.(a) Motion-Aware Structures designed with varying number of motion-aware layersq, where the feature maps from encoder part are upsampled and concatenated to the decoder part, (b) with different hyper-parameter setting, showing that a higher q, (c) with almost the same a k , achieves higher area under ∆∞(p) (d), referring to Eq. (4),(9) (unit: pix/vox).

Fig. 4 .
Fig. 4. The architecture of the k th Residual Aligner (RA) module.The Regressor section regresses the residual Multi-Head (M-H) Dense Displacement Field (DDF) φ k and each pixel's attribute ϑ k ,while the Accumulator refines the DDF ϕ k via interpolation and fusion of the M-H predictions weighted by the confidence and M-H mask.

Fig. 5 . 1 HDFig. 6 .
Fig.5.Qualitative example in abdomen and chest CT shows our network achieves plausible registration, with the improvement at the areas between different organs, such as liver, inferior vena cava and right kidney, as well as the edge area of the lung.

Table 2 .
Ablation study on RA module by inter-subject image registration of abdomen CT and lung CT using, with varying setting of head number m, motion-aware pattern of feature maps (q = 0, 3, 4) and confidence weights (CW).