Arbitrary Order Total Variation for Deformable Image Registration

In this work, we investigate image registration in a variational framework and focus on regularization generality and solver eﬃciency. We ﬁrst propose a variational model combining the state-of-the-art sum of absolute differences (SAD) and a new arbitrary order total variation regularization term. The main advantage is that this variational model preserves discontinuities in the resultant deformation while being robust to outlier noise. It is however non-trivial to optimize the model due to its non-convexity, non-differentiabilities, and generality in the derivative order. To tackle these, we propose to ﬁrst apply linearization to the model to formulate a convex objective function and then break down the resultant convex optimization into several point-wise, closed-form subproblems using a fast, over-relaxed alternating direction method of multipliers (ADMM). With this proposed algorithm, we show that solving higher-order variational formulations is similar to solving their lower-order counterparts. Extensive experiments show that our ADMM is signiﬁcantly more eﬃcient than both the subgradient and primal-dual algorithms particularly when higher-order derivatives are used, and that our new models outperform state-of-the-art methods based on deep learning and free-form deformation. Our code implemented in both Matlab and Pytorch is publicly available at https://github.com/j-duan/AOTV. © 2023 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ )


Introduction
Image registration is a process of matching two or more images of the same or similar object from different times, viewpoints, or imaging modalities. It has been widely used in many areas, such as art, astronomy, biology, chemistry, criminology, physics, remote sensing, etc. In medical imaging, registration enables direct comparison of images taken at different stages of progression of a disease, which is essential for disease diagnosis, treatment, and monitoring.
In the case of mono-modal registration, this task can be approached using a variational framework. A common method of estimating geometric transformations (deformations) is to minimise the sum of squared differences (SSD) between the warped source image (aka. moving or template image) and the target image (aka. fixed or reference image). SSD assumes that the underlying noise * Corresponding authors.
However, there remain limitations in current regularizations: (1) Most regularizations were developed in a quadratic form [7,8,[10][11][12][15][16][17][18] . They may be able to generate a globally smooth and satisfactory local deformation field, but cannot preserve discontinuities in the deformation. Such discontinuities often appear in medical imaging at organ boundaries primarily due to respiratory motion but also intensity inhomogeneity, pathological tissues or heavy noise [9] . Although some regularizations [3][4][5][6]9] allow discontinuities, they are based on first-order derivatives, and the smoothness induced may not be strong enough to regularize the resulting deformation. Furthermore, according to [11,12] firstorder methods are more dependant on initial alignments between images and they are more suitable when an affine linear preregistration is available, which is not the case for higher-order models because they do not penalize linear transformations 1 . As such, it is challenging to design a variational model that is able to induce both smoothness and discontinuity whilst being less sensitive to initializations. (2) Some higher-order regularizations are non-convex [13][14][15][16][17] . Together with an almost unavoidably nonconvex data term, it becomes drastically difficult to devise effective and efficient numerical algorithms to optimize such an energy functional involving dual non-convexities. (3) Almost all higherorder models are second order [10][11][12][15][16][17][18] . As such, it is unclear how the numerical behaviour changes when a derivative order higher than two is used. (4) Non-local and fractional methods [20,21,24] define derivatives using many more pixel points, making their implementation non-trivial and expensive.
The minimization process of a variational registration model entails the calculus of variations , by which one obtains an Euler-Lagrange equation with respect to the deformation, which is normally a coupled, higher-order, and non-linear partial differential equation (PDE). Various numerical algorithms have been proposed to solve such equations. Among them, most use time-dependent gradient descent by introducing an artificial time variable and then determining the steady state solution of such PDEs [1,[11][12][13][14]26] . These algorithms are explicit methods based on the time marching schemes and unfortunately are quite slow. Others handled the PDEs directly, with common choices being the semi-implicit fixedpoint iteration [27,28] , Newton-type methods [29,30] , multigrid methods [5,18] , etc. These algorithms however require advanced knowledge about discretization and can be difficult to implement for higher-order PDEs (e.g. over fourth order). Furthermore, the fast primal-dual algorithm was proposed in [3,4,31] to handle the total variation regularization of displacements. However, due to the iterative nature of these methods, they require a sufficiently large number of iterations to achieve high accuracy (e.g. machine precision), thus leading to slow convergence speed. To overcome the limitations of existing image registration methods, in this paper we make four novel contributions by proposing new models and algorithms: • We propose a new variational model for image registration that uses an arbitrary order total variation regularization. Such a regularization is convex and integrates over the magnitude of an arbitrary order spatial derivative derived from the binomial theorem (see Table 1 ). Moreover, the data term is a sum of absolute differences (SAD). By combining the SAD data consistency term with the arbitrary order total variation regularization, the proposed model is capable of capturing discontinuities in the resultant deformation, while being robust to outlier noise. However, it is non-trivial to minimize such a model due to the non-convexity of data term, the generality of regularization order, and the non-differentiabilities of both data and regularization terms. • We propose an effective and efficient algorithm to optimize the proposed variational model. We first tackle its non-convexity by linearizing the data term with the first-order Taylor theorem. As a result, the original problem is converted into a convex optimization, for which we propose a fast ADMM solver based on variable splitting. We then break down this convex problem into three linear subproblems, among which two can be solved by simple thresholding equations (which overcome the non-differentiabilities) and another one solved by the fast discrete cosine transform (DCT) (which handles the generality). All resultant solutions are point-wise and closed-form without any iterations, and are therefore accurate and efficient. Finally, an extra over-relaxation step is introduced to further accelerate the convergence of ADMM without increasing its overall computational complexity. • We provide rigorous derivations and proofs in Appendix A regarding how DCT can be used to analytically solve a linear PDE of arbitrary order. To the best of our knowledge, this is the first time that DCT is used in image registration to solve arbitrary order PDEs. We also implement a subgradient method in Appendix D as well as a primal-dual method in Appendix B which serve as two baselines to compare with our proposed ADMM. Our finding is that ADMM becomes increasingly more efficient than these competing algorithms for the cases when higher-order derivatives are used. Furthermore, in Appendix C we propose a novel method, based on primal dual, to derive the closed-form solution (8) associated with the SAD data term in our model. The method also applies to deriving soft thresholding (12) associated with the regularization term. • We perform extensive experiments on three challenging MRI datasets to explain and understand different models and algorithms. Specifically, we design proper numerical experiments to: show the impact of using different higher-order regularizations, compare the robustness of SAD ( L 1 ) and SSD ( L 2 ) data terms, demonstrate the convergence rate and computational efficiency of different algorithms, select reasonable built-in model parameters for speed and accuracy, and compare our method with state-of-the-art (SOTA) methods including a second-order free-form deformation (FFD) [10] and four unsupervised deep learning methods [ 2,32,[57][58][59] ].

Arbitrary order total variation for image registration
We first propose a non-linear variational registration model using a general, isotropic nth -order regularization, given by min u 1 ,u 2 are a pair of input images. Here, we use the intensity consistency constraint , which assumes the intensity at point x in the target image is the same as that at point x + u in the source image, i.e.
I 0 (x ) = I 1 ( x + u ) . In addition, ∇ n u p (x ) : → R 2 n is a n th-order distributional derivative of u p (x ) and its definition can be found in Table 1 . Theoretically, n can be chosen as an arbitrary integer. However, for larger n the subsequent computations become increasingly challenging. We will investigate and analyze the impact of n in Section 4.1 .
The objective here is to find u * that minimizes (1) . In the data term of (1) , we notice that the non-linearity in the function I 1 ( x + u ) with respect to u poses a challenge for optimization. To benefit from closed-form solutions, we borrow the idea from the Gauss-Newton algorithm to handle (1) . Specifically, if we expand using the first-order Taylor theorem (i.e. 2a ), then we can approxi- Table 1 Detailed forms of ∇ n u and |∇ n u | in the case when n varies from 1 to s . When n = s , the representation of ( s p ) denotes ' s choose p' which is a binomial coefficient. Note that if one computes the variational derivative of a respective |∇ n u | , it will result in a PDE of order 2 n . | · | represents the Euclidean norm. t means we repeat the respective differential operator t times (1) around the displacement field u ω using (2b) where in (2b) , ρ u ω (u ) is the data term linearized with respect to u around u ω : In (2a) and (3) , I 1 ( x + u ω ) (or I ω 1 for short) is the geometrically transformed/warped image from I 1 (x ) using the current iteration of the deformation field u ω , ∇ is the gradient operator and ∇I 1 ( x + u ω ) represents the spatial derivatives of I ω 1 along the vertical and horizontal directions in the image. ·, · denotes the inner product. To solve (1) we need to iterate between (2a) and (2b) . Furthermore (2b) alone, which is the linearized version of (1) also needs to be solved iteratively, therefore, the resulting numerical implementation consists of 2 nested loops. 2 .
There are three significant properties of the variational model (1) : First, the data term uses SAD, which is based on robust estimation [33] . With this term, the resultant displacement field will be less sensitive to outlier noise in the images to be aligned, see our experiments in Section 4.2 for evidence of this claim. Second, the regularization term uses an arbitrary order total variation (i.e. integration over the magnitude of a n th-order distributional derivative), which is capable of recovering discontinuities (e.g. edges) in the final displacement field. Third, the regularization allows reconstruction of a piecewise polynomial function of arbitrary degree for the estimated displacement field. For example, setting n = 1 , n = 2 , and n = 3 will cause the displacement field exhibit piecewise constant, piecewise linear, and piecewise quadratic behaviour, respectively. However, it is non-trivial to find an optimal solution to (2b) due to its non-differentiabilities in both data and regularization terms, as well as the generality in the derivative order. To benefit from the advances of the model, in the next section we develop an effective and efficient algorithm to minimize (1) . Using the proposed algorithm, we show that solving higher-order variational formulations can be as easy as solving their lower-order counterparts.
The proposed arbitrary order total variation regularization in (1) and (2b) has a close relationship to 1D L 1 trend filtering for signal processing. More technically, when we discretize our regularization using the forward finite difference (which is the case in this paper) and discard the second dimension (e.g. y axis), our regularization is exactly 1D L 1 trend filtering (for an evenly-spaced grid). Trend filtering was independently proposed by Steidl et al. [34] , Poschl et al. [35] , and Kim et al. [36] and has been recently studied systematically by Tibshirani [37] . Trend filtering has a rich history dating back to 100 years ago and was studied much earlier than total variation denoising [38] . However, apart from these two tasks (signal processing versus image processing) being fundamentally different, there are other distinctive technical differences. For example, in 1D gradient is a scalar but in 2D it is a vector whose dimension increases exponentially with the order of the derivative. We also find a general 2D gradient follows a binomial distribution (see our Table 1 ) which is not the case in 1D. Further, in 1D discretization is straightforward but in 2D we need to consider spatial relations between pixels and pay special attention to boundary conditions. For these, we need to use the Kronecker products detailed in our Appendix A, which is not the case in 1D. To the best of our knowledge, the proposed general regularization has not been studied in image registration. For these reasons, we consider our regularization a novel contribution.
Our regularization contains only single term and hence is simpler than other higher-order regularizations that hybridize firstand second-order derivatives, such as the total generalized variation (TGV) [39,40] and those proposed in [41][42][43] . This simplicity allows us to study numerical behaviours of very high-order regularizations (e.g. fourth order). On the other hand, with these hybrid regularizations we will end up with additional regularization parameters as well as more penalty weights (if one decides to use our ADMMs), which may be challenging to find an optimal combination. The overall implementation will be also more complicated than our Algorithm 1 especially for very high-order derivatives. Moreover, first-order methods are often dependant on initial alignments and are more suitable when an affine pre-registration is available (see [11,12] or our Fig. 4 ). As such, in image registration where large deformations persist it may not be beneficial to mix first-with higher-order derivatives.

Faster ADMM with Over-relaxation
We first focus on the minimization problem (2b) with ρ u ω (u ) defined in (3) . To solve this convex problem, we need to discretize (2b) and (3) to be a function of the pixels in the image domain as opposed to the continuous domain : 3 where Here n denotes the derivative order, M × N represents the image size, and 2 n denotes the number of components in ∇ n u 1 or ∇ n u 2 (see the 3rd column of Table 1 ). Depending on the value of n , the definition of | ( ∇ n u p ) i, j | , ∀ p ∈ { 1 , 2 } can be found in the 4th column of Table 1 .
Following the philosophy of ADMM for convex problems [44] , we introduce the auxiliary splitting variables The introduction of the first two constraints decouples u in the regularization from that in the data term so that a multi-channel nth -order total variation denoising problem can be explicitly formulated. The last two constraints effectively handle the nondifferentiability, non-linearity and generality in the regularization.
To guarantee an optimal solution, the above constrained problem (5) can be converted to a saddle problem solved by ADMM. Let are Lagrangian multipliers or dual variables, and θ 1 and θ 2 are penalty weights which have an impact on how fast the minimization process converges and will be studied numerically in Section 4.4 . In (6) , we have three sets of primal variables ( u , v , w ) and two sets of dual variables ( b , d ). Saddle points can be computed when (6) is minimized with respect to primal variables while maximized with respect to dual variables. As per convex optimization [45] , one of saddle points for the augmented Lagrange functional (6) will give a minimizer for the constrained minimization problem (5) .

Solving subproblems w.r.t. model variables
To optimize (6) , we need to decompose it into three subprob- and then update the Lagrangian multipliers b = (b 1 , b 2 ) and d = (d 1 , d 2 ) until the process converges.
is an affine linear L 1 problem which 3 We use | · | and | · | 2 to denote Euclidean norm and squared Euclidean norm for vectors defined at each pixel, and · 1 and · 2 2 to denote L 1 norm and squared L 2 norm for vectors defined in the image space.
can be solved by considering the following minimization problem which has a closed-form, point-wise solution, given by the following thresholding equation and abs (·) denotes absolute value of a scalar input. Note that for simplicity we omit subscripts i, j on each variable in the point-wise solution (8) and that is a small positive value to avoid division by zeros. In Appendix C, we propose a novel primal-dual method to derive (8) in detail.
2) Over-relaxation : One approach to accelerate the convergence of ADMM is to account for past values when computing subsequent iterates. This technique is called relaxation and amounts to The parameter α ∈ (0 , 2) is called the relaxation parameter . Note that letting α = 1 recovers the plain ADMM iterations without acceleration. Empirical studies show that over-relaxation, i.e., letting α > 1 , is often advantageous and the guideline α ∈ [1 . 5 , 1 . 8] has been proposed [46] . In [47] , the authors have also shown it is beneficial to use over-relaxation . In experiments, we will examine the impact of using different α s and select the one which accelerates convergence speed most. 3 , which is a linear L 2 problem that can be solved by optimizing the following problem where p ∈ { 1 , 2 } . The minimization process of this subproblem entails the following equation with respect which is a 2 nth -order PDE and has a closed-form solution that can be obtained using the discrete cosine transform (DCT) under the assumption of the Neuman boundary conditions (normal derivative equal to zero on the boundaries). We note that in many recent image restoration works [45,[48][49][50][51] the discrete Fourier transform (DFT) was frequently used to solve similar PDEs under the periodic boundary conditions . Imposing periodicity however introduces discontinuities (therefore artefacts) to the boundaries 4 . Therefore, we propose to use DCT instead of DFT. We note that DCT has been used in various imaging problems. For example, NG et al. [52] used DCT to diagonalize block Toeplitz-plus-Hankel matrices derived from a quadratic image deblurring problem. Strang et al. [53] studied up to eight variants of DCTs (ranging from DCT-1 to DCT-8) and applied DCT-2 to an image compression problem. Setzer et al. [42] used DCT to tackle a mixed first-and second-order upsampling with factor 2 end while ω < N war p or (13) does not hold do # Taylor expansions initialize variables and Lagrangian multipliers while k < N iter or (14) does not hold do # iterations update u k +1 using (8) with I ω 1 and ρ u ω closed-form, point-wise solution update u k +1 using (9) accelerated over-relaxation step update v k +1 using (10) closed-form, point-wise solution update w k +1 using (12) closed-form, point-wise solution image denoising problem, which is similar to TGV [39] and infimalconvolution [41] . However, it has not been used to solve our arbitrary order PDEs.
With DCT, we have the closed-form solution to the linear PDE above where D and D −1 respectively denote the DCT and inverse DCT; ∇ n is the nth -order differential operators which are discretized via the forward finite difference; (−1) n div n is the nth -order adjoint operators which are discretized via the backward finite difference; div n (∇ n ) is the 2 nth -order differential operator; "-" stands for the point-wise division of matrices. The entries of the general coefficient matrix are where H and W stand for image height and width, and r ∈ [0 , H) and q ∈ [0 , W ) are frequency indices. Such a coefficient matrix essentially consists of the eigenvalues of div n (∇ n ) whose eigenvectors are DCT basis functions. In Appendix 1, we prove how to derive a solution like (10) and (11) from a general, linear PDE. 4) w −subproblem : This subproblem has the form of w k +1 ← , which is a linear L 1 problem that can be solved by optimizing the following problem The solution of the problem is the following analytical, point-wise generalized soft thresholding equation with the convention that 0 = 0 / 0 and y p = ∇ n v k +1 Here, for simplicity we omit subscripts i, j on each variable in the pointwise solution (12) . Note that (12) can be also derived using the primal-dual method developed in Appendix C. 5) b and d updates : At each iteration, one also needs to update the augmented Lagrangian multipliers b k +1 and d k +1 , which are shown in Algorithm 1 .

Stopping criteria
In Algorithm 1 , there are two stop criteria that are defined to automatically break the loops. The first one is given as which says that the relative change between the data term (without linearization) at two consecutive iterations should be smaller than a positive threshold 1 . The second stopping criterion, which is the relative residual associated with the displacement field u = (u 1 , u 2 ) , is defined as In summary, to handle the original non-linear, nondifferentiable and non-convex minimization problem (1) , we break it down into a warping problem (2a) and a linearized, convex minimization problem (2b) . We then develop an accelerated, ADMM-based iterative algorithm to decompose (2b) into an over-relaxation step and three simple subproblems. Each subproblem has a closed-form, point-wise solution and can be efficiently handled using existing numerical methods (i.e. DCT and soft thresholding). As such, the algorithm is very accurate and efficient, which can be confirmed in Section 4.3 .
Since the Taylor expansion is used to linearize the intensity difference, we are only able to recover small displacements through minimizing (2b) . As such, Algorithm 1 needs an extra war ping operation (via u ω ). With war ping, we can convert a relatively large displacement into N war p small ones, each of which can be solved iteratively and optimally. In the algorithm, resize − indicates we downsample the image at integer locations, while resize + denotes the image is upsampled via the bilinear interpolation, which is also used in the war ping operation. Since minimizing a large displacement field is a non-convex problem, in our implementation we also embed Algorithm 1 into a multi-scale (pyramid) scheme to avoid convergence to local minima. When a N scale pyramid is used, the total number of iterations in the algorithm becomes N scale × N war p × N iter .

Experimental results
To evaluate the performance of different models and algorithms, we introduce five subsections: impact of derivative orders, robustness against outliers, algorithm efficiency, parameter selection, and comparison with the SOTA. First, we investigate and analyze how increased derivative order n will affect numerical results and model performance. Second, we compare their robustness and performance between L 1 (SAD) and L 2 (SSD) data terms. Third, we show the impact of the over-relaxation parameter α and compare our accelerated ADMM with a subgradient method and a popular primal-dual algorithm for non-smooth optimization. Next, we explain what the role each built-in parameter in the algorithm plays and then show how to tune reasonable values for these parameters in order to maximize both accuracy and speed. Finally, we introduce the datasets and quantitative matrices used for experimental comparison. Implementation details of other competing methods

Fig. 2.
A piecewise quadratic displacement field (2nd image) is simulated to warp source (1st image) to target (3rd image). 4th image is the displacement field in the x direction, and 5th image the middle cross section profile of the 4th image. Source and target are used to compute results in Fig. 3 . are then given. We quantitatively and qualitatively study all methods in the end.

Impact of derivative orders
To understand the numerical behaviour of using varying derivative orders in our proposed model, we analyze a L 2 smoothing problem: min u is a simplified version of v −subproblem in Section 3.1 . As such, our numerical analysis established below may help understand the L 1 formulation (2b) as v −subproblem is one of its subproblems. Here, f (x ) and u (x ) : → R 2 and λ is a smooth parameter. Note that this problem can be explicitly tackled by DCT, and the optimal solution is u and · denotes the point-wise multiplication. In u * , M is a soft mask whose values are determined by λ and n .
In Fig. 1 left, we plot different M s by varying λ and n . In Fig. 1 right, we obtain the same number of smoothed results by using these M s . From Fig. 1 , we can see when n goes higher, λ must increase exponentially to reduce the soft mask size such that the corresponding deformation can be smoothed. In other words, large n makes the problem less sensitive to λ. Numerically speaking, the use of big λ results in systems for inversion that are ill-conditioned, thus making the overall computation difficult. To prove this claim, in Table 2  on the left. It is clear that these deformations become increasingly smooth as the soft masks get smaller. This is because after an deformation is transformed into the DCT space, the corner locations contain mostly low frequency information. Applying a soft mask is equivalent to performing a low frequency filter on the deformation.  Fig. 3. 1st-4th rows: 1st-order TV, 2nd-order TV, 3rd-order TV, and 4th-order TV, respectively. Clearly the best results are acquired by using 3rd and 4th-order TVs, since they are capable of modeling piecewise quadratic signals.
be seen from Fig. 3 , setting n = 1 , n = 2 , and n = 3 make the displacement field respectively exhibit piecewise constant, piecewise linear, and piecewise quadratic behaviour. It is also obvious from Fig. 3 and 4 that higher-order models can induce stronger smoothness and are more robust to larger deformation. Two simple examples here ( Fig. 3 and 4 ) suggest higher-order TVs can be more effective than their lower order counterparts when the underlying displacement field possesses the behavior of high-order derivatives.

Robustness against outliers
We now examine whether the application of L 1 data term (SAD) in (1) leads to more robust performance against outlier noise in images. For this, we compare L 1 data term with its L 2 counterpart (SSD), both of which are implemented with the 1st-order TV regularization for noisy images. The results are given in Fig. 5 , from which it is evident that L 1 is more robust than L 2 against noise.
Technically, the use of L 2 norm between prediction and observation is known as least squares regression, whilst the use of L 1 norm as least absolute deviations regression. The latter is a robust fit method, meaning that the least absolute deviations regression is insensitive to outlier data points [33] . More specifically, L 2 , which squares the error/residual, increases the importance of larger errors. This results in a fit that is especially focused on trying to minimize these large errors -often due to outliers in data. In other words, L 2 tends to overfit to outliers in a dataset. In contrast, by using the absolute error ( L 1 ) we treat negative and positive errors equally, but do not exaggerate the importance of large Fig. 4. Top row: 1st image (target) is first locally deformed to 2nd image (source 1), which is then further affine-tranformed to 3rd image (source 2). Bottom row shows results using 1st-order TV between source 1 and target, results using 1st-order TV between source 2 and target, and results using 3rd-order TV between source 2 and target, respectively. 1st-order methods are more dependant on initial alignments between images so are more suitable when an affine linear pre-registration is available, which is not the case for higher-order methods.

Fig. 5.
Performance comparison between L 2 and L 1 data terms. 1st row: source image and ground truth displacement field between source and target images. 2nd row: target image and its corrupted version by 50% salt and pepper noise. 3rd row: L 2 results (left: displacement field between source and target images; right: displacement field between source and corrupted target images). 4th row: L 1 results (left: displacement field between source and target; right: displacement field between source and corrupted target).   7. Segmentation masks versus multi-scale images. 1st column: segmentation masks of the images in the 2nd column. 2nd-5th columns: images at different scales. From left to right the image is downsampled 1, 2, 4, and 8 times, respectively. In the masks, the while region represents the right ventricle cavity (RV); the light gray region denotes the left ventricle cavity (LV); and the dark gray region indicates the LV myocardium (LVM). 1st-2nd rows: short-axis images; 3rd-4th rows: image-axis images. errors/residuals, which leads to a robust fit insensitive to outlier noise, as proven in Fig. 5 .

Algorithm efficiency
In Fig. 6 , we compare the computational speed between the accelerated ADMM ( α = 1 . 8 ), the standard ADMM ( α = 1 ), the primal dual, and the subgradient method on an image pair called 'Yosemite' from the Middlebury dataset. For all algorithms, we set their common parameters (λ = 0 . 1 , 1 = ×, 2 = 2 −23 , N war p = 1 , N iter = ×, scale = { 1 } ) , where ' = ×' means we do not use that stopping criterion to break the loop. For both ADMMs, we set (θ 1 = 1 , θ 2 = 0 . 1) . We also use the optimal parameters for both primal dual and subgradient methods (see Appendix B and Appendix D). Among all algorithms, the subgradient method is the slowest, which is not surprising because subgradient methods have a convergence rate of O (1 / √ k ) [54] . In contrast, the primal dual has a rate of O (1 /k ) [31] . It is also clear that the accelerated ADMM is faster than the standard ADMM. Our accelerated ADMM is the fastest algorithm especially for higher-order cases.
In Table 3 , we further compare the runtime of accelerated ADMM and primal dual on three other image pairs from the dataset using different λ s . For this comparison, we used θ 1 = 1 and θ 2 = 0 . 1 for ADMM, σ = τ = 1 λ √ 8 n for primal dual, and ( 1 = ×, 2 = 10 −3 , N war p = 5 , N iter = ×, scale = { 4 , 2 , 1 } ) for both algorithms. From the table, we observe that for n = 1 ADMM has a comparable speed as primal dual but for n = 2 ADMM gets much faster. In essence, primal dual is a proximal gradient method and its convergence speed is determined by the step sizes σ and τ . For higher-order models, the step sizes should be decreased exponentially to guarantee convergence, drastically slowing down its speed. This comparative experiment is done in Matlab 2020 using a Dell laptop with a 2.60GHz Intel i7-9850H CPU (16GB RAM). In addition, we have also implemented our ADMM using a V100 GPU in Pytorch with the same model parameters, which boosts the speed by a factor of 10.57 on average as compared to that in Table 3 .

Parameter selection
Now we explain how to select the built-in parameters (n, α, θ 1 , θ 2 , λ, 1 , 2 , N war p , N iter , scale ) in Algorithm 1 for our following final comparative experiments: • n = { 1 , 2 , 3 , 4 } will be studied separately and α as per last section is set to 1.8 to give a faster speed.
• θ 1 and θ 2 are convergence parameters, different combinations of which affect the convergence rate but may not change the minimiser of (2b) due to its convexity. In Table 4 , we prove this claim using the short-axis image pair in Fig. 7 from the UK Biobank (UKBB) [55] . We set (α = 1 . 8 , 1 = ×, 2 = 2 −23 , N war p = 5 , N iter = ×, scale = { 4 , 2 , 1 } ) with an optimal λ for each model. As is evident, within each model the mean Dice and the mean square error are almost identical for differ-  ent combinations of θ 1 and θ 2 . However, the number of iterations varies drastically. By selecting suitable θ 1 and θ 2 , we may achieve fast convergence speed while still benefiting from same numerical accuracy.
• λ is a smooth parameter, controlling the smoothness and accuracy of final deformations. To find the optimal λ for each model, we perform parameter sweeps on a set of λ s . Here we use both image pairs in Fig. 7 as an example. Specifically, first: we define a set of λ s and for each λ together with (α = 1 . 8 , 1 = ×, 2 = 2 −23 , N war p = 5 , N iter = ×, scale = { 4 , 2 , 1 } ) and the optimal θ s identified in Table 4 , we use Algorithm 1 to compute the deformation between the source and target im- Table 4 Numerical impact of using different combinations of     Fig. 7 for each model. Second: we warp the source segmentation mask using the deformation. Third: for each region (i.e. LV, LVM and RV), we compute a Dice value between the warped source mask and the target mask. The optimal λ for each model on both image pairs is identified by the peak point, as shown in the 2nd and 4th rows of Fig. 8 . It is clear that as the regularization order goes higher, we need a larger λ to reach comparable accuracy, which is largely in line with the observation in Fig. 1 . • 1 and 2 are two relative residuals, given in (13) and (14) , which determine the actual number of warpings and iterations in Algorithm 1 , respectively. In the final comparative experiments next, we assign 2% to 1 and 10 −5 to 2 . We also set N war p = 5 and N iter = 500 to guarantee the algorithm will stop at some point.

Comparison with the state-of-the-art
Knowing the behavior of built-in parameters, we then compare their performance with SOTA methods on three MR image datasets, both quantitatively and qualitatively. The first dataset used is UKBB-1 [55] , where we randomly select 220 healthy subjects and for each subject we have a corresponding 4D (3D+ t) volume with t = 50 representing a complete cardiac cycle. For each volume, the in-plane resolution of each image slice is 1 . 8 × 1 . 8 mm 2 per pixel and the through-plane resolution is 10 mm . Due to large gaps between slices, it is physiologically implausible to do 3D registration. The second dataset comes from ACDC [56] which is composed of 100 patients with four types of pathology. In this dataset, the image slice thickness changes from 5 mm to 10 mm and the spatial resolution varies from 1.34 mm to 1.68 mm per pixel. We resample in-plane resolutions of all images to 1 . 8 × 1 . 8 mm 2 before experiments. In addition to the first two datasets which contain only short-axis cardiac images, we use a third dataset which is from UKBB-2. This dataset contains long-axis images from 220 subjects and the resolution of each image is 1 . 8 × 1 . 8 mm 2 per pixel. For all three datasets, we perform experiments on images at the enddiastolic (ED) and end-systolic (ES) frames and crop the image to 128 × 128 pixels. For short-axis datasets, we only use middle ventricular slices. As an example, we show a pair of short-axis images (top) and long-axis images (bottom) in Fig. 7 .
To evaluate the accuracy of resulting deformations, we use Dice and Hausdorff distance (HD) as the quantitative matrices. For Dice, the segmentation masks of LV, LVM and RV at ED (target) and ES (source) frames are used, which are available in all datasets. Dice varies from 0-1, with high values corresponding to a better match. HD is computed on an open-ended scale, with smaller values implying a better result. To compute these matrices for paired images, we estimate the deformation by registering the ES image to its respective ED image. With the deformation, we deform the segmentation mask at ES to ED and measure its overlap and contour distance to the ED mask using Dice and HD, respectively.
We compare our models with the following 5 SOTA approaches. Note that for learning-based methods, we split UKBB-1 into 100/20/100, ACDC into 40/20/40, and UKBB-2 into 100/20/100 for training. validation, and test. The regularization weights in the three methods are selected to maximize the performance on validation sets. All methods are compared on test sets only. Table 5 Comparison of image registration performance using different methods. 'All' means that Dice or HD is computed by averaging that of LV, LVM and RV of all subjects in the test set. Here mean values are reported for UKBB-1, ACDC, and UKBB-2, separately. Note that apart from FFD which is tested on a CPU, we use a A100 GPU (40G RAM) to run the other methods. • The first one is an optimization-based method using B-spline free form deformation (FFD) [10] which ranks top three among 14 non-rigid registration methods [60] and therefore is a strong baseline. For this method, we use the official MIRTK implementation 5 . In terms of parameters, we use SSD as similarity and Bending energy as regularization. Moreover, a three-level multiscale approach (similar to ours) is used where the spacing of B-spline control points on the highest scale is set to 8 mm . • The second method is a learning-based method called Voxel-Morph 6 from [2] . For this method, we first pass a pair of ED and ES images through a 2D deterministic U-Net [61] , the output prediction of which is a displacement field. The ES image is then warped using the displacement field in a bilinear spatial transformation layer [62] . The loss function is a combination of similarity (MSE between the warped ES image and the ED image) and smoothness (first-order diffusion regularization). We note that a recent work [63] has proved that using U-Net as backbone is still very competitive and can be more effective than transformer-based methods [64] . 5 https://github.com/BioMedIA/MIRTK . 6 https://github.com/voxelmorph/voxelmorph .
• The third method 7 we compare is from [32,57] . Its idea is different from VoxelMorph in two aspects: (1) the U-Net is replaced with a Siamese style multi-scale network; (2) the loss controlling smoothness is replaced with the Huber loss [65] . • The fourth method we compare is SYM-Net 8 from [58] . The backbone of this network is U-Net, which predicts a stationary velocity field. Scaling and Squaring is then applied to this velocity field to produce the final deformation. We use the MSE and diffusion regularization loss for SYM-Net. • The fifth method we compare is Bspline-Net 9 from [59] . The backbone of this network is U-Net, which predicts stationary velocities of the control points. The deformation is obtained by first evaluating B-spline functions at the control points and then integrating via Scaling and Squaring. We use the MSE and diffusion regularization loss for Bspline-Net.
In Table 5 , we show the quantitative results of using different methods on UKBB-1, ACDC, and UKBB-2. We observe that the best results on UKBB-1, highlighted in bold, come from either SYM-Net Fig. 9. Comparing visual results obtained by different registration methods on one representative example from UKBB-1. 1st column shows original images (on 1st and 3rd slots), segmentation masks (on 2nd and 4th slots), and absolute differences (on 5th slot) between corresponding source and target; 2nd-10th columns show FFD, VoxelMorph, Siamese, SYM-Net, Bspline-Net, 1stO-TV, 2ndO-TV, 3rdO-TV, and 4thO-TV results, respectively; 1st-5th rows (excluding 1st column) show warped sources, warped source masks (with contours from target mask superimposed), deformation grids, displacement fields, and absolute differences between corresponding warped sources and target, respectively. or our proposed methods, among which 2ndO-TV performs the best and both 3rdO-TV and 4thO-TV are competitive. On ACDC, FFD and 4thO-TV are better in term of HD, while 1stO-TV, 2ndO-TV and 4thO-TV are better in terms of Dice. On UKBB-2, our 2ndO-TV consistently outperforms other methods on most anatomies in terms of both Dice and HD. Overall, our methods are superior to other compared methods in most cases, indicating the effectiveness of using model-based approaches for this unsupervised task. When considering only inference time in Table 5 , deep-learning based methods outperform our proposed methods, but this does not account for the significant amount of training time required by deep learning-based methods. One advantage of our method is their overall efficiency (up to a second runtime), owing to the fact they are model-driven and free of training.
In Fig. 9 and 10 , we compare visual results of different methods. FFD, built on a L 2 regularization, over-smooths the displacement field, leading to a under-estimated mask warping. Our methods (last 4 columns of each figure), which use a L 1 regularization, are able to preserve discontinuities. As such, the resultant displacement fields are more dense and clustered (indicated by red arrows in Fig. 10 ), and their warped masks are therefore closer to their respective ground truths. For displacement fields, 1stO-TV produces staircase artifacts and the smoothness induced does not appear strong enough to regularize the deformation grid. In contrast, higher-order models (last 3 columns of each figure) are able to reconstruct piecewise smooth deformations. We also observe that learning-based methods perform less accurately than modelbased methods. The improved accuracy of our methods is prominent in the first row in Fig. 10 where the mitral valve (indicated in red circles) in the warped source images can be seen to be significantly closer to its position in the target image when our methods are applied. Additionally an improved correspondence to the target mask can be observed in the second row of Fig. 10 . Again, this is particularly obvious in the region close to the mitral valve. Collectively, Fig. 9 and 10 , together with Table 5 , provide ample evidence that our methods have the capability of registering images robustly and accurately.

Conclusion
In this paper, we have proposed a new variational model, where the regularization is a term that involves a derivative of arbitrary order. We have then proposed a point-wise, closed-form and overrelaxed ADMM solver, providing an easy way to accurately and efficiently solve this general model. We have also presented comprehensive derivations, theorems, proofs and experiments for relevant formulations and claims. Extensive experiments have shown that the proposed approaches outperform state-of-the-art methods, including the subgradient method, primal dual, FFD, and learningbased methods. The proposed regularization method and accelerated ADMM solver can be generalized to many other image processing problems.
These experimental results in Fig. 3 and Table 5 deliver two important messages: (1) The best performance is not always achieved by only one specific model (e.g. 2nd-order TV); (2) Depending on applications, one should select suitable models carefully. For example, when the underlying distribution of a displacement field is piecewise constant, the 1st-order TV should be used, which is however not suitbale for large displacements as compared to its higher-order counterparts. As the displacement field in Fig. 3 is piecewise quadratic, we are expecting the 3rd-order and 4thorder TVs to outperform the 1st-order and 2nd-order TVs. This is because both the 3rd-order and 4th-order derivatives possess a quadratic behaviour. The reason why in Table 5 most of the best quantitative results come from the 2nd-order TV is that the underlying displacement field mostly likely follows a piecewise linear distribution.
One drawback of our proposed methods is that they produce folding (singular, non-invertible points) in resultant deformation grids (see Fig. 9 and 10 ), which may lead to numerical instability or warping artifacts. This problem can be eased by using a large regularization parameter λ and a good example can be seen in Fig. 1 , where the deformation grid gradually gets disentangled as we increase the value of λ. However, setting λ too big reduces the deformation magnitude and therefore loses the ability to model large deformations. To tackle this dilemma, we will combine diffeomorphisms [7,66,67] with our discontinuous regularizations in our future research, which are capable to compute smooth, invertible spatial transformations between images.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability
Data will be made available on request.

Appendix A
In this section, we show in detail how to derive the solution of v −subproblem in Section 3.1 . At its core, this requires to solve a Poisson-like equation. In the following, we start from a 1D problem, followed by deriving the respective 2D problem. The general cases are given at the end of this section. Note that our derivations on the 1D problem below are inspired by Section 4.5.1 in [68] . However, for the Neumann boundary conditions their derivations do not up directly with the DCT coefficients (eigenvalues) and basises (eigenvectors).

A1. One-dimensional problem
The 1D Poisson equation is given as Under the Neumann boundary conditions , we have U (a ) = 0 and U (b) = 0 . On a regular grid, the resulting discretization (using a finite difference method) of above differential equation leads to a linear system: Here, u and f are respectively the discrete vectors of U (x ) and F (x ) and each vector has a length of n ; I n is an identity matrix of size n × n ; and T n is a tridiagonal coefficient matrix of size n × n , which has the form of Of note, T n has a highly structured eigensystem that is closely related to some fast trigonometric transform. In other words, T n can be efficiently diagonalized by some eigenvectors. The goal here therefore is to establish these eigensystem/transform connections to show how they can be used to design a very fast and effective Possion solver. Next, we examine the eigenstructure of such a matrix.
Let θ be any real number and for any integer k , let c k = cos k + 1 2 θ and s k = sin k + 1 2 θ . Considering the following two trigonometric identities: By adding the first identity to the second we have which will be used to compute the eigenvectors and eigenvalues of T n .
Focusing on the first component in the right-hand side of the equation above, together with the identity (A.4) , we have the following simple derivation which verifies that (A.5) holds.
We will use this lemma to identify important properties (i.e. eigenvectors and eigenvalues) of the matrix T n that arise in solving (A.2) . We start with the following theorem. The theorem has been proved because above the eigenvalues are equal to that defined in (A.6) and the eigenvectors are equal to V or C k, j in (A.7) . Now, we are almost ready to seek a solution to the discrete differential equation (A.2) . First, with the eigensystem (A.7) we can derive where D n = diag (λ 0 , ..., λ n −1 ) . By using the fast eigensystem (A.8) of T n , we are finally able to solve the linear system of (I n − T n ) u = f as follows: By definition, one knows that the multiplication of a vector by V is equivalent to performing an inverse discrete cosine transform (DCT) with some scaling weights 10 on a 1D grid.

A2. Two-dimensional problem
We now extend the Poisson problem from 1D to 2D. For the latter case, we are given a function F (x, y ) defined on subject to the Neumann boundary conditions . On a 2D Cartesian grid of size m × n , the resulting discretization (using a 2D finite difference method) of the above second-order PDE leads to a linear system: M u = f with M = I m I n − (T m I n + I m T n ) .
(A.10) 10 In DCT, the scaling weights are applied to make sure the norm of each eigenvector is unit In the linear system, denotes Kronecker product . u and f are two vectors resulting from the discretization of U (x, y ) and F (x, y ) respectively and the length of each vector is of mn . In practical, one needs to concatenate all the columns (or rows) in a discrete rectangle matrix (i.e. an image) to form a long vector, like u or f . M ∈ R mn ×mn above is a sparse matrix. The following theorem tells us that M can be diagonalized efficiently.

Suppose
n T n V n = D n , and f ∈ R mn , then the solution to M u = f is given by Proof. According to the Kronecker mixed-product property, the following identities hold Combing these three identities, we have which implies that the theorem follows.
Putting this in an array format, we have the following framework that solves the linear system (A.10) formally: , where r ∈ [0 , n ) and q ∈ [0 , m ) are the integer indices and U ∈ R m ×n is equivalent to u after reshaping back to the m × n sized grid. 1 − 2 cos qπ m + 2 cos rπ n − 4 is a coefficient matrix resulting from solving the discrete second-order PDE of (A.9) .

B1. General primal dual
Similar to ADMM, primal dual tackles saddle-point problems and usually considers the form min x ∈ X max y ∈ Y F (x ) + λ Kx, y − G (y ) , (B.1) where X and Y are two finite-dimensional real vector spaces, K : X → Y is a continuous linear operator, and F : X → [0 , + ∞ ) and G : Y → [0 , + ∞ ) are convex functions. λ usually is a smooth parameter. Given the adjoint operator K * , the primal-dual algorithm to optimise such an objective function is listed in Algorithm 2 .

Algorithm 2: General Primal-Dual Algorithm
Input variables : x 0 ∈ X, y 0 ∈ Y and x 0 = x 0 Input parameters : λ > 0 , σ > 0 , τ > 0 while Not Con v erged do y k +1 = ( I + σ ∂G ) −1 y k + λσ K x k x k +1 = ( I + τ ∂F ) −1 x k − λτ K * y k +1 x k +1 = x k +1 + θ x k +1 − x k Note that x −update and y −update in the algorithm respectively use the proximal operators of G and F , which correspond to the following two minimization problems Note that λ and the step sizes σ and τ need to satisfy λσ τ < 1 r( K * K ) for the algorithm to converge, where r( K * K ) denotes the spectral radius of the matrix K * K . Also note that x −update in Algorithm 2 is an extrapolation prediction step of the form x k +1 = x k +1 + θ x k +1 − x k , where θ = 1 has been used in convex problems such as motion estimation [31] . It have been analyzed for [ −1 , 1] in [69] . Some authors have also suggested adaptive step sizes [70] . In this study, we set θ = 1 to gain the acceleration of speed.

(B.2)
The minimization problem (B.2) can be converted equivalently to a saddle-point problem by writing the regularization term as a maximization, i.e.
With these, we can define a convex set for which the dual maximization (B.3) is taken over which is a saddle-point problem in the form of (B.1) with x = u , y = q , F = ρ u ω (u ) 1 , G = δ( q ) , X = (R M×N ) 2 , Y = Q, K = ∇ n and K * = (−1) n div n .
To apply Algorithm 2 , we need to specify the point-wise proximal operators in y −update and x −update , which respectively correspond to q −update and u −update below, given by whose closed-form solution can be found in (8) . Finally, the primal dual to optimize (2b) is given in Algorithm 3 .

Algorithm 3: Primal-Dual Algorithm for (2b)
Input variables : u 0 = 0 , q 0 = 0 , and u 0 = u 0 Input parameters : λ > 0 , σ > 0 , τ > 0 , λσ τ < 1 8 n , and θ = 1 while Not Con v erged do q k +1 = ( I + σ ∂G ) −1 q k + λσ ∇ n u k u k +1 = ( I + τ ∂F ) −1 u k − λτ (−1) n div n q k +1 u k +1 = u k +1 + θ u k +1 − u k We remark that the spectral radius (largest absolute eigenvalue) of the matrix K * K can be very efficiently computed by applying DCT to the matrix, i.e., see (11) . Depending on the derivative order n , we have λσ τ < 1 8 n , which can used to select λ, σ and τ in Algorithm 3 . The proof for this inequality is straightforward as the upper bound of (11) is 8 n . Also note that Algorithm 3 needs to be combined with Algorithm 1 to minimize (1) . That is to say, in order to compute u k +1 one just replaces the ADMM updates in the third loop in Algorithm 1 with the primal dual updates in Algorithm 3 .