A Bayesian Formulation of Coherent Point Drift

Coherent point drift is a well-known algorithm for solving point set registration problems, i.e., finding corresponding points between shapes represented as point sets. Despite its advantages over other state-of-the-art algorithms, theoretical and practical issues remain. Among theoretical issues, (1) it is unknown whether the algorithm always converges, and (2) the meaning of the parameters concerning motion coherence is unclear. Among practical issues, (3) the algorithm is relatively sensitive to target shape rotation, and (4) acceleration of the algorithm is restricted to the use of the Gaussian kernel. To overcome these issues and provide a different and more general perspective to the algorithm, we formulate coherent point drift in a Bayesian setting. The formulation brings the following consequences and advances to the field: convergence of the algorithm is guaranteed by variational Bayesian inference; the definition of motion coherence as a prior distribution provides a basis for interpretation of the parameters; rigid and non-rigid registration can be performed in a single algorithm, enhancing robustness against target rotation. We also propose an acceleration scheme for the algorithm that can be applied to non-Gaussian kernels and that provides greater efficiency than coherent point drift.

T HE goal of point set registration is to find pairs of corresponding points between shapes represented as point sets. Algorithms for finding shape correspondences have been actively studied in computer vision and computer graphics. Why do we need to find corresponding points between shapes? This is because shapes with correspondence information become a foundation to organize, reconstruct, and synthesize a broader class of shapes. We begin with a summary concerning some of the active research fields in which point set registration arises.
Shape Reconstruction. A goal of computer vision and graphics is to reconstruct the entire 3D surface of an object from range images, i.e., 2D images with depth information, captured at different camera positions. To this end, range images are typically converted into point sets, also called point clouds. A naïve conversion of range images embeds the corresponding point clouds into local coordinate systems. If every point cloud partially overlaps with some of the others, point set registration finds overlaps of the point clouds and aligns them in a global coordinate system. Shape Retrieval. As a result of advances in computer graphics, a vast number of 3D shapes have been generated. One aim is to rapidly retrieve 3D shapes that resemble a query object from a database. One approach to performing this task is to define a measure of similarity between shapes and return a shape having a significant similarity with a query object. Here, we refer to shapes with and without correspondence information as structured and unstructured shapes, respectively. Defining an adequate measure of similarity is, however, a non-trivial task because typical 3D shapes are unstructured. Additionally, each shape model might be embedded in a local coordinate system. Let us assume that shapes are structured and located in a global coordinate system. Then, a naïve similarity measure based on the Euclidean distance, for example, can be a reasonable, computationally efficient similarity measure. Point set registration converts unstructured shapes into structured ones.
Shape Model Learning. If we have a class of structured shapes such as human faces, we can learn a shape model that summarizes shape variations in the class using a statistical learning technique, e.g., principal component analysis (PCA). This is because each element in a structured shape vector can be interpreted as a statistical variable with a specific meaning, e.g., the tip of a nose. Point set registration converts points in unstructured shapes into statistical variables. The shape model after the learning can be used for shape blending and shape completion, for example. a) Shape Blending. If 3D shapes can be automatically synthesized, the efforts of graphic designers can be reduced significantly. A PCA-based shape model, for example, allows us to generate an infinite number of 3D shapes endowed with the characteristics of training shapes. Shape generation based on structured shapes is called shape blending. b) Shape Completion. A 3D surface generated by the scan of a real object typically contains missing regions due to occlusion. The missing regions can be recovered naturally by fitting a PCA-based watertight shape model, i.e., a shape model without defective holes or gaps, to the scanned surface. The interpolation of missing regions in a scan is called shape completion.
Types of Registration Algorithms. Various registration algorithms have been developed to solve shape correspondence problems. Typically, registration algorithms are categorized as either rigid or non-rigid according to the transformation models that deform shapes and align them. Rigid registration algorithms find a map that preserves the distance between every pair of points, i.e., a map defined as rotation and translation. The reconstruction of an entire 3D surface from partial scans typically requires rigid registration.
In contrast, non-rigid registration algorithms find a map that does not necessarily preserve the distance between points. Non-rigid registration algorithms can be divided into two groups involving either non-linear maps or linear maps, e.g., similarity transformations and affine transformations. We note that non-rigid registration algorithms based on similarity transformations are sometimes categorized into rigid registration algorithms owing to their simplicity. Shape model learning and shape retrieval typically involve non-rigid registration with a non-linear map.
Coherent Point Drift. Coherent point drift (CPD) [1], [2] is a collection of three different registration algorithms based on a Gaussian mixture model (GMM). Among them, the bestknown algorithm is a non-rigid algorithm with a non-linear map, which is based on the motion coherence theory [3]. We hereafter refer to it as CPD unless otherwise specified. CPD is considered a state-of-the-art non-rigid registration algorithm because of its registration performance and scalability to large point sets [4]. A primary reason for its stability is its estimation of shape correspondence in a soft-matching manner; point-topoint correspondences are not assumed to be one-to-one. Additionally, the sequential update of the residual variance in the GMM automatically shrinks the set of candidate points corresponding to each source point during the optimization [5].
Our Contribution. We formulate CPD in a Bayesian setting; we replace the motion coherence theory, i.e., the theoretical basis of CPD, with Bayesian inference; we introduce motion coherence using a prior distribution of displacement vectors. We also derive an algorithm that overcomes several issues of CPD on the basis of variational Bayesian inference (VBI). The algorithm is endowed with the following characteristics: Convergence of the algorithm is guaranteed. The parameters regarding motion coherence can be interpreted intuitively. The algorithm combines the rigid CPD and the nonrigid CPD. The algorithm can be accelerated with non-Gaussian kernel functions. In numerical studies, we evaluate the registration performance of the algorithm by comparing it with the state-ofthe-art algorithms. The remainder of the paper is organized as follows: In Section 2, we review point set registration methods related to our study. In Section 3, we follow the derivation of the non-rigid CPD algorithm and summarize several issues to be addressed. In Section 4, we present our methodological contributions with a summary of VBI. In Section 5, we evaluate the registration performance of our algorithm. In Section 6, we conclude this study. Appendices and Supplementary Video 1 can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety. org/10.1109/TPAMI.2020.2971687. The implementation of the algorithm is available at the author's GitHub repository: https://github.com/ohirose/bcpd.

PRELIMINARY
In this section, we follow the derivation of the CPD algorithm [2] to clarify the formulation difference between CPD and our Bayesian variant. CPD is derived by first defining a GMM, which implicitly encodes the unknown correspondences between points, and then minimizing the negative log-likelihood function using the EM algorithm. Notation 1. The goal of point set registration is to find the map T that transforms the geometric shape represented as a point set Y ¼ fy 1 ; . . . ; y M g so that the transformed shape matches the target shape represented as the other point set X ¼ fx 1 ; . . . ; x N g. Here, we refer to a point set to be deformed as a source point set, and refer to the other point set that remains fixed as a target point set. The set of T ðyÞ À y for any y 2 R D is called a displacement field, as it defines the displacement of each point in Y . We denote the displacement field by vðyÞ or v. Additionally, we refer to any point that is irrelevant to the actual object geometry as an outlier. Throughout this paper, we use the following notation: N; M -the numbers of points in target and source point sets, respectively. D -the dimensionality of the space in which a point set is embedded.
TrðÁÞ, j Á j, dðÁÞ -the trace of a matrix, the determinant of a matrix, and the operation that converts a vector into its corresponding diagonal matrix, respectively. I D , I M -the identity matrices of sizes D and M. 1 D , 1 M , 1 N -the vectors of all 1s of sizes D, M, and N. We also use the symbols of the point sets X and Y as the matrix symbols, which are defined as X ¼ ðx 1 ; . . . ; x N Þ T and Y ¼ ðy 1 ; . . . ; y M Þ T . The notation used throughout this paper is summarized in Appendix E, available online.

Gaussian Mixture Model for Point Set Registration
We begin with the definition of a GMM, which is the basis of CPD. In the framework of CPD, a set of points representing a deformed shape, i.e., T ðY Þ, is assumed to be centroids of a GMM, and a target point x n is a data point generated by the GMM. The mth component distribution of the GMM is defined as follows: where T ðy m Þ ¼ y m þ vðy m Þ is a transformation model to move source points, and u 0 ¼ ðv; s 2 Þ is the set of parameters.
With the outlier distribution p out ðx n Þ ¼ 1=N, the GMM is defined as follows: where v is the prior probability that x n is an outlier. By the construction of the probabilistic model, point set registration is replaced with the estimation of u 0 based on minimization of the negative likelihood regarding the GMM.

EM Algorithm and the Q-Function
Owing to the lack of a closed-form expression for u 0 ¼ ðv; s 2 Þ that minimizes the negative likelihood of the GMM, a sequential optimization technique called the EM algorithm is applied to the minimization. The CPD algorithm is obtained as the result of applying the EM algorithm, and it seeks to find a local minimum of the negative likelihood by iterating the following procedure: (1) computation of the matching probability between x n and y m , denoted by p mn , and (2) minimization of the objective function, called the Q-function, given p mn . The former and latter are called the E-step and M-step, respectively. Suppose u 0 0 ¼ ðv 0 ; s 02 Þ is the alternative set of parameters u 0 . From the theory of the EM algorithm, the Q-function is derived as follows: P M m¼1 p mn N is the estimated number of current matching points. The posterior matching probability p mn is derived as follows: The matching probability p mn is dependent only on u 0 0 , and therefore the update of p mn and the minimization of the Q-function regarding u 0 can be alternated. We note that the Q-function is considered as an error function of the shape deformation under the current alignment u 0 0 . The E-step and M-step can therefore be interpreted as follows: the E-step aligns point sets X and Y in a soft-matching manner, whereas the M-step finds the shape that is closest to the target shape given the current alignment u 0 0 .

Regularization and the Motion Coherence Theory
If no constraint is imposed on the displacement field vðyÞ, the minimization of the Q-function given u 0 0 is ill-posed; for example, the Q-function is obviously minimized if displacement vectors satisfy x 1 ¼ y m þ vðy m Þ for all m given a fixed s 2 . A regularization term to avoid trivial solutions is incorporated into the Q-function as follows: where is a trade-off parameter, the tilde symbol indicates the Fourier transform, andG is a positive function that tends to zero as jjsjj ! 1. We note that 1=G is a high-pass filter, and therefore the penalization of the displacement field is increased as it becomes rougher. Myronenko et al., the authors of the CPD algorithm, used the Gaussian function G b ðzÞ ¼ expðÀ jjzjj 2 2b 2 Þ as a choice of G. One of the reasons for this choice is that the regularization term is equivalent to that of the motion coherence theory [3], which penalizes non-coherent moves of neighboring points. Therefore, this regularization implies that motion coherence is imposed on the source points to be moved for registration.

Minimization of the Q-Function
The remaining task to derive the CPD algorithm is to minimize Q 0 ðu 0 ; u 0 0 Þ regarding u 0 ¼ ðv; s 2 Þ given the current alignment u 0 0 ¼ ðv 0 ; s 02 Þ. Owing to the nonlinear dependence between v and s 2 , a closed-form expression of their simultaneous updates is not available. Therefore, v and s 2 are updated separately. To derive the equation for updating v, we first assume that s 2 is given. The alternative parameter set u 0 0 defines the current soft-matching, and therefore, the minimization of Q 0 regarding v is interpreted as a functionfitting problem based on the regularization theory [54]. The solution of the problem is guaranteed to be vðyÞ ¼ where w m 2 R D is a coefficient vector, and 'ðyÞ is a function that spans the null space of the functional R R D jṽðsÞj 2 GðsÞ ds [55]. Specifically, if the function G is positive definite, 'ðyÞ is guaranteed to be zero. The Gaussian function G b as a choice of G guarantees 'ðyÞ ¼ 0 owing to the positive definiteness of G b . SupposeŶ ¼ ðŷ 1 ; . . . ;ŷ M Þ T 2 R MÂD withŷ m ¼ y m þ v 0 ðy m Þ 2 R D is the matrix that represents a current deformed shape given u 0 0 . By using the general solution to solve the Euler-Lagrange equation related to Eq. (1), the equation for updating vðyÞ is obtained through the coefficient matrix W ¼ ðw 1 ; . . . ; w M Þ T 2 R MÂD as follows: where P ¼ ðp mn Þ 2 ½0; 1 MÂN and G ¼ ðg After the update of the displacement field vðyÞ, the residual variance s 2 is updated. By equating the corresponding derivative of Q 0 to zero, the equation for updating s 2 is obtained as follows: The CPD algorithm is designed to perform iterative updates of P , W , and s 2 under a suitable initialization on the basis of the EM algorithm.

Several Issues
Despite the advantages of CPD over state-of-the-art algorithms, several issues remain in both theoretical and practical aspects. One theoretical issue of CPD is that it is unknown whether the algorithm always converges, as suggested by Myronenko et al. [2]. If we findû 0 ¼ ðv;ŝ 2 Þ satisfying Q 0 ðû 0 ; u 0 0 Þ Q 0 ðu 0 0 ; u 0 0 Þ at the minimization step, convergence is guaranteed from the theory of the EM algorithm. However, it is unknown whether the updates of v and s 2 in the Mstep satisfy the condition, because the equations for updating them are obtained by the partial minimization of Q 0 regarding either v or s 2 given the other. Another theoretical issue of CPD is the difficulty in interpreting tuning parameters and b. Both parameters control the degree of motion coherence, i.e., the smoothness of a displacement field, and their difference is unclear. Therefore, the tuning parameters tend to be controlled in a non-intuitive manner. In a practical aspect, CPD is relatively sensitive to the rotation of a target. This is because (1) CPD captures the target's rotation using the nonrigid displacement term vðyÞ, and (2) the rotation of a source shape for registration is penalized by the regularization as well as its non-rigid deformation. Another practical issue is that the acceleration scheme of CPD is limited to the case of the Gaussian function, which will be reviewed in Section 4.6.
Through a Bayesian formulation of CPD, we address both theoretical and practical issues.

METHODS
We propose a Bayesian formulation of CPD and derive an algorithm called Bayesian coherent point drift (BCPD). In Section 4.1, we review VBI. In Section 4.2, we formulate point set registration in a Bayesian setting without the motion coherence theory. In Section 4.3, we derive the BCPD algorithm using VBI, which guarantees convergence of the algorithm. In Section 4.4, we summarize the difference between CPD and BCPD, and we show that CPD is recovered from BCPD. In Section 4.5, we provide a statistical view of motion coherence. In Section 4.6, we present an acceleration scheme of BCPD that can be applied to non-Gaussian kernel functions.

Variational Bayesian Inference
One important task of Bayesian inference is to estimate a set of unobserved variables, u, from a set of observations, z. To this end, we need to compute the maximum mode of a posterior distribution pðujzÞ or the expectation of u over pðujzÞ, because they are reasonable estimates of u. However, this task is often hampered by the difficulty in computing the estimates; e.g., the analytic form of the maximum mode is unavailable because of the multimodality of the posterior distribution, or the computational cost of evaluating the expectation is prohibitively large because of the existence of many discrete variables. VBI relaxes the difficulty by approximating pðujzÞ using an alternative distribution qðuÞ with the property that the maximum mode or the expectation is easily computed. The distribution qðuÞ itself is generally unknown; therefore, finding reasonable estimates of u is replaced with finding the distribution qðuÞ that approximates pðujzÞ as much as possible. Typically, VBI is defined as the minimization of the Kullback-Leibler (KL) divergence between qðuÞ and pðujzÞ, as follows: qðuÞ ln pðujzÞ qðuÞ du ) : ( The equivalent definition suitable for deriving the general solution is the maximization of the lower bound LðqÞ for the log model evidence ln pðzÞ, where LðqÞ is defined as follows: i.e., the negative KL divergence between qðuÞ and pðz; uÞ. If no constraint is imposed on qðuÞ, the computations of the expectation and the maximum mode remain intractable because the solution of this optimization problem isqðuÞ ¼ pðujzÞ. Here, we suppose that u can be divided into J groups, i.e., u ¼ ðu 1 ; . . . ; u J Þ, and we denote the marginal distribution of u j with respect to qðuÞ by q j ðu j Þ. To find qðuÞ that relaxes the difficulty in computing the estimates of u, the posterior distribution qðuÞ is constrained to be the product of its marginals, i.e., qðuÞ ¼ Q J j¼1 q j ðu j Þ. This factorization works for splitting the original problem, Eq. (2), into its subproblems. Here, we assume that only q i is unknown among the factorized distributions fq 1 ; . . . ; q J g. Then, the maximization of the lower bound LðqÞ is replaced by the one that focuses on q î where E i ½ln pðz; uÞ ¼ R ln pðz; uÞ Q J jð6 ¼iÞ q j ðu j Þdu j . We note that the addition of a constant inside the integral does not change the solution. As the integral term forms a negative KL divergence, the general solution is obtained as follows: where the constant term corresponds to the normalization constant ofq i ðu i Þ. SupposeqðuÞ ¼q i ðu i Þ Q J jð6 ¼iÞ q j ðu j Þ. Then, the inequality of the lower bounds LðqÞ ! LðqÞ always holds and is tight if and only ifq i ðu i Þ ¼ q i ðu i Þ for any u i . This is because the subproblem, Eq. (3), is replaced by the minimization of the KL divergence; the KL divergence D KL ðp 1 jjp 2 Þ is non-negative, and becomes zero if and only if p 1 ðxÞ ¼ p 2 ðxÞ for any x [53]. Therefore, a solution for the optimization problem, Eq. (2), is obtained by the following procedure: 1) Initialize q i for all i 2 f1; . . . ; Jg.
2) Cycle the update of q i with the other q j fixed for each i 2 f1; . . . ; Jg using Eq. (4). 3) Repeat step 2 until convergence. Convergence is guaranteed because LðqÞ is monotonically increasing and bounded because of the inequalities ln pðzÞ ! LðqÞ ! LðqÞ ! LðqÞ.

Bayesian Formulation of CPD
We formulate point set registration in a Bayesian-inference manner; that is, we first define a probabilistic model that generates a target point set X from a source point set Y with unobserved random variables u, and then estimate u given X and Y . In this section, we focus on the construction of the generative model of X. The key difference between our formulation and the original CPD formulation is that we define motion coherence using a prior distribution instead of the regularization term. We suppose that T ðy m Þ is the function that deforms the shape represented as Y to match it with the shape represented as X.
Assumptions. As in the definition of Myronenko et al., we assume that a target point set X is generated under the following assumptions: 1) A target point x n is selected as either a point forming a target shape or an outlier with the outlier probability v. 2) If a target point x n is selected as an outlier, x n is generated from an outlier distribution p out ðx n Þ. 3) If x n is a non-outlier, an index m 2 f1; . . . ; Mg, indicating that x n corresponds to y m , is sampled with a probability a m , where P M m¼1 a m ¼ 1. 4) Then, x n is generated from a D-dimensional multivariate normal distribution with a mean vector T ðy m Þ and a covariance matrix s 2 I D . 5) A target point set X is generated by an N-times repetition of 1, 2, 3, and 4. The original formulation of CPD constrains a m to be 1=M for all m 2 f1; . . . ; Mg. On the basis of these assumptions, we define the generative model of X. Notation 2. Here, we define the notation required for the definition of the generative model as follows: . . . ; v T M Þ T 2 R DM -the vector representation of displacement vectors that characterize a non-rigid transformation of a source shape. c ¼ ðc 1 ; . . . ; c N Þ T 2 f0; 1g N -the vector of indicator variables, where c n takes 1 if the nth target point is a non-outlier, and is 0 otherwise. e ¼ ðe 1 ; . . . ; e N Þ T 2 f1; . . . ; Mg N -the vector of index variables representing that the nth target point corresponds to the mth source point if e n ¼ m.
where a m represents the probability of an event e n ¼ m for any n. r ¼ ðs; R; tÞ -the set of random variables that defines the similarity transformation T . fðz; m; SÞ -the multivariate normal distribution of z with a mean vector m and a covariance matrix S. We note that y 2 R DM and v 2 R DM are, hereafter, used as vector symbols that are different from those for defining the displacement field vðyÞ : R D ! R D in Section 3.
Transformation Model. We begin with the definition of the non-rigid transformation model T that deforms a source shape. Unlike in CPD, we define it as a combination of a similarity transformation T and a non-rigid transformation as follows: where s 2 R is a scale factor, R 2 R DÂD is a rotation matrix, t 2 R D is a translation vector, and v m 2 R D is a displacement vector that characterizes a non-rigid transformation.
Gaussian Mixture Model. We define a GMM that generates a point in a target point set and plays a role in a likelihood function for solving registration problems. Under assumption 4, we define the distribution of x n for an event e n ¼ m as a multivariate normal distribution as follows: where j Á j represents the determinant of a square matrix. Unlike in the original CPD paper, we define a GMM with an explicit definition of unknown correspondence, i.e., c n and e n . Under assumptions 1, 2, 3, and 4, we define the joint distribution of ðx n ; e n ; c n Þ given ðy; v; a; r; s 2 Þ as a combination of a two-component mixture distribution and an M-component mixture distribution as follows: pðx n ; e n ; c n jy; v; a; r; where f mn is an abbreviation of fðx n ; T ðy m Þ; s 2 I D Þ and d m ðe n Þ is the indicator function, with a value of 1 if e n ¼ m and 0 otherwise. Notation with respect to corresponding points is shown in Fig. 1.
Prior Distributions. We define motion coherence as a prior distribution of displacement vectors. Suppose G ¼ ðg mm 0 Þ 2 R MÂM with g mm 0 ¼ Kðy m ; y m 0 Þ is a positive-definite matrix, where KðÁ; ÁÞ is a positive-definite kernel. We define the prior distribution of v as follows: where is a positive constant and denotes the Kronecker product. The prior distribution encodes motion coherence because displacement vectors v m and v m 0 correlate with each other if the affinity value between y m and y m 0 , i.e., À1 g mm 0 , is sufficiently large. Additionally, we define pðaÞ as a Dirichlet distribution where k > 0 is the parameter that controls the shape of the Dirichlet distribution and 1 M is the vector of all 1 s of size M. For the simplicity of the generative model, we do not introduce prior distributions over the source point set y, the residual variance s 2 , or the set of variables r that defines a similarity transformation T .
Full Joint Distribution. Incorporating the prior distributions into the GMM with assumption 5, we define the full joint distribution pðx; y; uÞ as follows: pðx; y; uÞ / pðvjyÞpðaÞ Y N n¼1 pðx n ; c n ; e n jy; v; a; r; s 2 Þ; where u ¼ ðv; a; c; e; r; s 2 Þ is the set of latent variables that mediate the generation of x given y. This definition of the joint distribution replaces a point set registration with a probabilistic density estimation. In the next section, we derive an algorithm for estimating a reasonable u using VBI.

Derivation of the BCPD Algorithm
In this section, we derive a registration algorithm based on the Bayesian formulation described in Section 4.2. One approach to solving a registration problem is to compute a reasonable estimate of u, such as the expectation of u ¼ ðv; a; c; e; r; s 2 Þ over pðujx; yÞ, or the maximum mode of pðujx; yÞ. The exact computation of the estimate is, however, intractable for large M and N; the number of evaluations of pðujx; yÞ is roughly ðM þ 1Þ N times when a naïve method is applied. Therefore, we use VBI as described in Section 4.1.
To relax the difficulty in computing the estimate, we approximate pðujx; yÞ as a distribution qðuÞ that is endowed with the factorization qðuÞ ¼ q 1 ðv; aÞq 2 ðc; eÞq 3 ðr; s 2 Þ: We also assume q 3 ðr; s 2 Þ ¼ dðr; s 2 Þ for simplification, where the symbol d with no subscript represents a Dirac delta function, i.e., q 3 is the distribution with a point mass at ðr; s 2 Þ. This assumption means that q 3 is characterized by the first moment only, and the second and higher moments can be represented as the product of the first moment. Through VBI, the maximum mode or the expectation regarding q will be available after performing iterative updates of q 1 , q 2 , and q 3 until convergence. Furthermore, VBI guarantees convergence of the algorithm.
Notation 3. Here, we list the useful symbols for describing the closed-form expressions ofq 1 ,q 2 , andq 3 as follows: KðÁ; ÁÞ is a positive-definite kernel. P ¼ ðp mn Þ 2 ½0; 1 MÂN -the probability matrix where p mn ¼ E½c n d m ðe n Þ represents the posterior probability that x n corresponds to y m .
n represents the posterior probability that x n is a non-outlier.
P M m¼1 p mn N -the estimated number of matching points between X and Y . f mn -the abbreviation of fðx n ; T ðy m Þ; s 2 I D Þ. For convenience, we simplify the notation concerning the Kronecker product by attaching a tilde symbol to a matrix or a vector as follows: As with the above notation, we denote the augmented form of the similarity transformation byT , i.e., T ðyÞ ¼ sðI M RÞy þ ð1 M tÞ: Hereafter, we provide the closed-form expressions ofq 1 ,q 2 , andq 3 without derivations, which are available in Appendices A, B, and C, available in the online supplemental material.

Update of q 1 ðv; aÞ
The update of q 1 ðv; aÞ corresponds to the update of the nonlinear shape deformation. Suppose q 2 ðc; eÞ and q 3 ðr; s 2 Þ are given. From Eq. (4), the closed-form expression ofq 1 ðv; aÞ can be obtained as the product of a Dirichlet distribution and a multivariate normal distribution as follows: Proposition 1. The approximated posterior distributionq 1 ðv; aÞ is factorized into its marginals, i.e.,q 1 ðv; aÞ ¼q a ðaÞq v ðvÞ. Furthermore,q a ðaÞ is a Dirichlet distribution andq v ðvÞ is a multivariate normal distribution, which are defined as follows: Notation with respect to corresponding points. The point sets colored red and blue represent the source point set Y and the target point set X, respectively. The variable c n 2 f0; 1g indicates whether or not x n is a non-outlier, whereas the variable e n 2 f1; . . . ; Mg specifies the index of a point in Y that corresponds to x n . Target points x 1 , x 2 , and x 3 represent the point that corresponds to y m , a non-outlier that does not correspond to y m , and an outlier, respectively.
wherex ¼ dðñÞ À1P x 2 R DM is the shape for whichT À1 ðxÞ is interpreted as the inverse alignment from x to y, andS ¼ S For reference, we list the vectors of size DM, which are related to q v ðvÞ and used hereafter, as follows: This proposition provides some insight into motion coherence. Because T À1 ðx m Þ À y m is interpreted as the residual for the mth source point, the meaning of the mean vectorv ¼ s 2 s 2S dðñÞðT À1 ðxÞ À yÞ can be intuitively explained as follows: ,v m is computed by smoothing the weighted residual vector n m ðT À1 ðx m Þ À y m Þ with basis functions g m1 ; . . . ; g mM . In contrast, if is close to zero,v m is approximated aŝ v m % T À1 ðx m Þ À y m , i.e.,v m is estimated as the residual vector T À1 ðx m Þ À y m without smoothing. In Section 4.5, we further discuss motion coherence from a statistical aspect.
Remark. The proposition allows us to compute ha m i ¼ exp ðE½ln a m Þ and hf mn i ¼ expðE½ln f mn Þ, which are required to update q 2 ðc; eÞ. By using standard results regarding the Dirichlet distribution and Remark 1 in Appendix D, available in the online supplemental material, ha m i and hf mn i are updated as follows: hf mn i ¼ fðx n ;ŷ m ; s 2 I D Þexp n À s 2 2s 2 Trðs 2 m I D Þ o ; where cðÁÞ is the digamma function, also known as the c-function, and s 2 m is the mth diagonal element of S.

Update of q 2 ðc; eÞ
Next, we proceed to the update of q 2 ðc; eÞ that encodes shape correspondence between X and Y . Suppose q 1 ðv; aÞ and q 3 ðr; s 2 Þ are given. From Eq. (4), we obtain the closedform expression ofq 2 ðc; eÞ as follows: Proposition 2. The approximated posterior distributionq 2 ðc; eÞ is a combination of a Bernoulli distribution and a categorical distribution, and is defined as follows: where p mn and n 0 n are defined as From this proposition, we see thatq 2 ðc; eÞ is factorized 2 ðc n ; e n Þ. Additionally, the proposition provides the following observations, some of which are consistent with the description in Notation 3: The definition of p mn in Proposition 2 is consistent with the one in Notation 3 because of E½c n d m ðe n Þ 1 q ðnÞ 2 ðc n ¼ 1; e n ¼ mÞ ¼ p mn . The posterior marginal distribution of c n is the Bernoulli distribution with the probability n 0 n , and thus, the posterior probability of x n being a non-outlier is n 0 n . The posterior conditional distribution of e n given c n ¼ 1 is the categorical distribution with M categoriy probabilites fp mn =n 0 n g M m¼1 . The number of target points matching with y m can be estimated as n m because of E½ P N n¼1 c n d m ðe n Þ ¼ n m . The number of matching points between the target and source point sets, X and Y , can be estimated aŝ N because of E½ P N n¼1 P M m¼1 c n d m ðe n Þ ¼N. This proposition guarantees that the update of P ¼ ðp mn Þ with Eq. (8) increases the lower bound. The terms relating to P , i.e., n ¼ P 1 N , n 0 ¼ P T 1 M ,N ¼ n T 1 M , are updated in turn. We also see that the expectations required by the update of q 2 are summarized into ha m i and hf mn i, which are computed by Eqs. (6) and (7).

Update of q 3 ðr; s 2 Þ
The update of q 3 ðr; s 2 Þ corresponds to the update of the similarity transformation T , parameterized by r ¼ ðs; R; tÞ. Suppose q 1 ðv; aÞ and q 2 ðc; eÞ are given. From the assumption that q 3 is a Dirac delta function, q 3 is characterized only by its mode. Therefore, maximizing the lower bound LðqÞ without resorting to Eq. (4) allows us to find the mode of q 3 ðr; s 2 Þ that increases the lower bound. Let us define the following notation: x where s 2 m is the mth diagonal element of S. Using this notation, we obtain the following proposition: Proposition 3. Suppose the approximated posterior distribution q 3 ðr; s 2 Þ is a Dirac delta function. Given q 1 ðv; aÞ and q 2 ðc; eÞ, the lower bound is maximized by the following equations: t ¼ x ÀŝR u; where F and C are the orthogonal matrices of size D Â D obtained by the singular value decomposition of S xu , i.e., S xu ¼ FS 0 xu C T with the diagonal matrix of singular values, S 0 xu 2 R DÂD . We note that the update of s 2 is important as it gradually shrinks the radius to define a candidate set of target points that corresponds to each source point [5].

Outlier Distribution
Myronenko et al. used the uniform distribution defined as p out ðx n Þ ¼ 1=N. The distribution does not satisfy one of the conditions concerning a probabilistic density function; the integral of p out ðx n Þ over R D is not one; the integral approaches zero for any v as N becomes larger if the nonzero region of p out ðx n Þ is bounded, otherwise the integral becomes infinity. Suppose V is the volume of the minimum bounding box that covers the target point set X. To avoid the normalization issue, we use an estimate of the D-dimensional uniform distribution: p out ðx n Þ ¼ 1=V if x n is in the bounding box and p out ðx n Þ ¼ 0 otherwise.

Initialization
VBI requires the initialization of the expectations for the random variables, as described in Section 4.1. We initialize the expectations in a non-informative manner; we setv ¼ 0, and t ¼ 0. The initial guess of residual variance s 2 controls the randomness of point matching during the early stage of the optimization. This is suggested by the fact that p mn approaches 1=M for any m and n if the outlier probability v is set to zero and the residual variance s 2 in Eqs. (7) and (8) goes to infinity. CPD uses P M m¼1 jjx n À y m jj 2 as an initial guess. For our Bayesian variant of CPD, the use of s 2 ini is often too informative, especially when the target shape is largely rotated, and the solution often falls into local optima. Therefore, we use gs 2 ini as our initial guess of s 2 , where g is a positive constant. As g increases, point matching during the early stage of the optimization becomes random, and a moderately large g often stabilizes registration. The BCPD algorithm is summarized in Fig. 2.

Relation to the Original CPD
In this section, we show that CPD is recovered from BCPD. First, we summarize the differences between CPD and BCPD. The differences are listed as follows: The transformation model T ðy m Þ is a combination of non-rigid and similarity transformations, where the non-rigidity is derived from the displacement vector v. The conditional distribution pðvjyÞ ¼ fðv; 0; À1G Þ and the prior distribution pðaÞ ¼ Dirðajk1 M Þ are imposed on the displacement vector v and the mixing coefficients a, respectively. The optimization is based on VBI instead of the EM algorithm used in CPD. The initial guess of s 2 is parameterized by g, which controls the randomness of point matching during the early stage of the optimization.
The CPD algorithm can be recovered by removing these differences, i.e., by imposing the following constraints on our Bayesian formulation: 1) The parameter k is set to infinity.
2) The similarity transformation T is replaced by the identity function.
3) The posterior distribution q v ðvÞ is assumed to be a Dirac delta function. 4) The outlier distribution is defined as p out ðx n Þ ¼ 1=N. 5) The parameter g is set to 1. Condition 1 constrains all mixing coefficients to be equivalent, i.e., ha m i ¼ 1=M for all m, and recovers the definition of CPD with respect to the mixing coefficients. Condition 2 removes the update of r ¼ ðs; R; tÞ and replaces T and T À1 with the identity function. Condition 3 means that the second moment of q v ðvÞ, denoted by E½vv T , is equivalent to the product of the first moment, i.e.,vv T . The condition drops the terms expfÀ s 2 2s 2 Trðs 2 m I D Þg, s 2 I D , and s 2 s 2 from the P M m¼1 jjx n À y m jj 2 , G ¼ ðg mm 0 Þ with g mm 0 ¼ Kðy m ; y m 0 Þ: Á Optimization: Repeat until convergence.
-Update P ¼ ðp mn Þ and related terms.
hf mn i ¼ fðx n ;ŷ m ; s 2 I D Þ expfÀ s 2 2s 2 Trðs 2 m I D Þg, -Update S,v,û, and ha m i for all m.
-Update s, R, t, s 2 ,ŷ and related terms. x x À sR u,ŷ ¼T ðy þvÞ,  2. BCPD algorithm. The tilde symbol attached to a matrix denotes the Kronecker product between the matrix and I D , e.g.,S ¼ S I D and P ¼ P I D , whereas the tilde symbol attached to a vector denotes the Kronecker product between the vector and 1 D , e.g.,ñ ¼ n 1 D and n 0 ¼ n 0 1 D . The symbol c represents the digamma function. The mth diagonal element of S is denoted by s 2 m . The singular value decomposition is denoted by "svd." The determinant and trace of a square matrix are denoted by j Á j and TrðÁÞ, respectively. Unlike CPD, BCPD simultaneously estimates the variables of non-rigid and similarity transformations. The acceleration of computations G, S, and P is discussed in Section 4.6. equations for updating hf mn i, S uu , and s 2 , respectively. Condition 4 replaces p out ðx n Þ by 1=N. Therefore, the conditions provide the same equation for updating p mn as in the E-step in the CPD algorithm p mn ¼ expfÀ 1 2s 2 jjx n À ðy m þv m Þjj 2 g c 0 þ P M m 0 ¼1 expfÀ 1 2s 2 jjx n À ðy m 0 þv m 0 Þjj 2 g ; where c 0 ¼ v 1Àv M N j2ps 2 I D j 1=2 . Let us denote the expectation of a displacement vectorv by the product betweenG ¼ G I D 2 R DMÂDM and vector w 2 R DM , i.e.,v ¼Gw. From the definitions ofv andS, we obtain the same equations for updating w and s 2 as in the M-step in the CPD algorithm w ¼ ðG þ s 2 dðñÞ À1 Þ À1 ðx À yÞ; Therefore, the CPD algorithm is recovered by imposing the conditions on the Bayesian formulation of CPD, i.e., BCPD is a generalization of CPD. Here, we note some details regarding the algorithms: Convergence of CPD, as well as BCPD, is guaranteed by VBI. If we replace condition 2 with v ¼ 0, BCPD becomes the rigid CPD, meaning that the non-rigid CPD and the rigid CPD are combined as a single algorithm. We also note that Eq. (9) suggests a connection between the CPD algorithm and the iterative closest point (ICP) algorithm [7], which is one of the best-known registration algorithms. We then describe the connection to provide an insight into BCPD.
Relation to ICP. We show that ICP is a special case of BCPD regarding the estimation of corresponding points. ICP solves the registration problem by repeatedly finding the point in the current deformed shape that is closest to each point in the target shape, and updating the deformed shape using the current pairs of closest points. From the description in Section 3.2, we see that the first and second steps of ICP are associated with the E-step and M-step of CPD, respectively. To describe a connection between BCPD and ICP, we focus on the E-step of CPD, i.e., the computation of matching probabilities. From Eq. (9), the matching probability p mn becomes a so-called softmax function for a fixed n if v is zero and all points in Y are not identical to one another. Here, we assume that s 2 is infinitesimal. Under these conditions, we have p mn ¼ 1 if y m is the closest to x n among fy 1 ; . . . ; y M g and p m 0 n ¼ 0 for all m 0 6 ¼ m. This corresponds to the first step of ICP, i.e., finding the closest point in Y for x n . Therefore, we see that the first step of ICP is a special case of the E-step in the CPD algorithm, which means that BCPD is a generalization of ICP concerning the correspondence estimation.

Statistical View of Motion Coherence
The Bayesian formulation in Section 4.2 provides a clear difference between tuning parameters that control motion coherence. Here, we show that the directional correlation between displacement vectors is dependent only on the Gram matrix G regardless of . As shown in Fig. 3, if G is the Gaussian affinity matrix parameterized by b, i.e., G ¼ ðg ðbÞ mm 0 Þ with g ðbÞ mm 0 ¼ expðÀ 1 2b 2 jjy m À y m 0 jj 2 Þ, we will see that 1) b controls the directional correlation between displacement vectors, and 2) controls the expected length of displacement vectors. We begin with the case that G is non-Gaussian. From the definition of pðvjyÞ, i.e., Eq. (5), the prior covariance matrix between v m and v m 0 is defined as for all m; m 0 2 f1; . . . ; Mg. From this equation, the Pearson correlation between v md and v m 0 d is obtained as The correlation is not dependent on the index d, and thereby defines a measure of directional correlation between displacement vectors. Intuitively, the correlation represents the deviation of the angle between displacement vectors; the angle becomes zero without randomness if the correlation is 1 whereas the angle becomes random if the correlation is 0. We note that the correlation is not dependent on ; it does not contribute to the deviation of the angle between displacement vectors. We also note that the correlation always becomes 1 if two points y m and y m 0 are the same and the corresponding g mm 0 is not zero. Next, we assume that G is Gaussian. We begin with the role of b. From Eq. (10), the Pearson correlation between the dth elements of v m and v 0 m is obtained as follows: for all d because g ðbÞ mm is always 1 for all m. As discussed previously, the correlation represents a measure of directional correlation between v m and v m 0 . The correlation is dependent only on b, regardless of . Therefore, b is interpreted as a parameter that controls the directional correlation between displacement vectors. From this equation, we also see that the correlation increases as the distance between y m and y m 0 decreases, and it becomes 1 if the distance is 0. This means that the displacement field for registering point sets is smooth, i.e., displacement vectors gradually become parallel as the distance between two points decreases.
We then proceed to the role of when G is Gaussian. From Eq. (10), we see that jjv m jj 2 follows the x 2 -distribution with the degree of freedom D. Therefore, the square root of the expected squared norm of v m , denoted by l m , is obtained as follows: which is not dependent on m. From this equation, we see that is the parameter that controls the expected length of displacement vectors regardless of b. We note that Eqs. (11) and (12) hold for a non-Gaussian kernel if the kernel function parameterized by b is normalized, i.e., g ðbÞ mm ¼ 1 for all m. We finally note that BCPD falls into a rigid registration technique for sufficiently large because l m becomes zero for all m if goes to infinity.

Acceleration of the BCPD Algorithm
In this section, we present an acceleration scheme for the BCPD algorithm, which can be applied even if the Gram matrix G is non-Gaussian. The bottleneck in BCPD lies in the computations of the Gram matrix G, the posterior matching probability P , and the covariance matrix S ¼ ðG À1 þ s 2 s 2 dðnÞÞ À1 , whose computational costs are OðM 2 Þ, OðMNÞ, and OðM 3 Þ, respectively. The basis of our acceleration scheme is the combination of the Nystr€ om method [56] and the KD tree search [57], which allow the computations of G, P , and S in at most OððM þ NÞlog ðM þ NÞÞ time. We note that the initialization of s 2 is not a bottleneck as it can be computed in OðM þ NÞ by changing the order of the summation and multiplication.
By using the fast Gauss transform (FGT) [58], the original CPD accelerates the eigendecomposition of G at the initialization step and the computation of P at the E-step. Owing to the acceleration, CPD is scalable to large point sets, yet (1) the eigendecomposition of G is still time-consuming [59] and (2) the computation of P requires the evaluations of truncated Gaussian distributions near convergence [2], the computational cost of which is OðMNÞ. Dupej et al. relaxed (1) by computing the approximate eigendecomposition of G using the Nystr€ om method and the improved fast Gauss transform (IFGT) [59]. The IFGT is an improved version of the FGT, and it efficiently approximates the sum of Gaussian functions in high-dimensional spaces [60]. Lu et al. relaxed (2) by reducing the number of EM iterations using the squared iterative EM algorithm, combined with a variant of IFGT [61]. A problem of using the FGT and IFGT is that the Gram matrix G must be Gaussian. To address this issue, we accelerate the computation of G using only the Nystr€ om method. In addition, to relax (2), we accelerate the computation of P by the Nystr€ om method during the early stage of the optimization and the KD tree search near convergence, which do not require OðMNÞ computation.

Fast Computation of G and S
The Nystr€ om method generates a low-rank approximation of a Gram matrix using random sampling in OðMÞ time, and it can also be applied to the approximated computation of the rank-restricted eigendecomposition in OðMÞ time [56]. Suppose K is the number of random samples. We also suppose that the approximated eigendecomposition of G is denoted by G % QLQ T where Q 2 R MÂK is a columnorthonormal matrix and L 2 R KÂK is a diagonal matrix, the mth element of which is the mth approximated eigenvalue of G. The use of the eigendecomposition and the Woodbury identity replaces the computation of S as follows: Here, we note that the evaluations of all elements in the approximated S, whose computational cost is OðM 2 Þ, are not required; the required computations regarding S are the displacement vectorv ¼ s 2 s 2S dðñÞðT À1 ðxÞ À yÞ and the diagonal elements of S for computing hf mn i, S uu , and s 2 . The computational costs of evaluating all of them are OðMÞ if we assume that K is a constant.

Fast Computation of P
The first step to accelerating the evaluation of P is to replace it with the evaluation of the Gaussian affinity matrix between Y and X, denoted by K YX 2 R MÂN , where the mnth element of K YX is defined as expfÀ 1 2s 2 jjx n Àŷ m jj 2 g. Suppose x ðdÞ ¼ ðx 1d ; . . . ; x Nd Þ T 2 R N is the vector constructed from the dth elements of x 1 ; . . . ; x N . Then, the terms involving P are replaced by those involving K YX on the basis of the following observations: 1) All computations regarding P can be expressed as one of the productsP x, n 0 ¼ P T 1 M , or n ¼ P 1 N . 2) The vectorP x can be computed through the products Px ð1Þ ; . . .; Px ðDÞ . 3) The computations of Px ðdÞ , P T 1 M , and P 1 N can be replaced by the products involving K YX . The first observation is shown in Fig. 2. The third observation is obtained as follows: where and 0 represent elementwise product and division, respectively, c 0 0 ¼ v 1Àv p out ðx n Þj2ps 2 I D j 1=2 is a constant, and b 2 R M represents a vector, the mth element of which is b m ¼ ha m i expfÀ s 2 2s 2 Trðs 2 m I D Þg. Therefore, if the computation of terms involving K YX is accelerated, the evaluation of terms involving P is also accelerated. We accelerate those separately for large s and small s. a) For Large s. The matrix-vector products involving K YX can be approximated in OðM þ NÞ time by using the Nystr€ om method. Suppose V is a point set constructed by the random sampling of points from the union between X and Y , and we suppose J is the number of elements in V . The Nystr€ om method guarantees where K YV 2 R MÂJ , K XV 2 R NÂJ , and K VV 2 R JÂJ are the Gaussian affinity matrices between Y and V , between X and V , and for V itself, respectively. Therefore, if D and J are assumed to be constants that are much less than M and N, the approximation of the products involving K YX can be computed in OðM þ NÞ time. b) For Small s. One issue of using the Nystr€ om method is the inaccuracy that originates from the approximation, which often hampers convergence. To avoid the convergence issue without sacrificing computing time, we replace the Nystr€ om method for computing P with the KD tree search method if s 2 is sufficiently small. This approach is reasonable because, if the optimization approaches convergence, s 2 is typically small and almost all elements in K YX become nearly zero. The nonzero elements of K XY can be found by the use of the KD tree search in OððM þ NÞlog ðM þ NÞÞ time. BCPD with the use of the KD tree search often registers point sets as accurately as BCPD with exact computations while reducing the computational costs. The computational costs for the bottleneck computations are summarized in Table 1.

EXPERIMENTS
In this section, we evaluate the registration performance of BCPD. In Section 5.1, we describe the datasets used for numerical evaluations. In Section 5.2, we evaluate the accuracy of non-rigid registration methods. In Section 5.3, we evaluate the registration performance of the accelerated BCPD. In Section 5.4, we apply BCPD to datasets with non-artificial deformations. In Section 5.5, we evaluate the accuracy of rigid registration methods.

Datasets and Demonstrations
We begin with the datasets used in Sections 5.2 and 5.3, shown in Fig. 4. We also provide several demonstrations.
Datasets. The first dataset is the bunny dataset, which is included in the CPD software. The dataset is composed of two bunny shapes: one is the same shape as that distributed in the 3D Scanning Repository (http://graphics.stanford. edu/data/3Dscanrep), and the other is deformed in a nonrigid manner. We downloaded the armadillo and dragon shapes from the 3D Scanning Library website, and we obtained the monkey shape from Blender. For each shape, we created the corresponding deformed shape using a nonrigid deformation tool. The numbers of points in the bunny, monkey, dragon, and armadillo data were 8,171, 7,958, 437,645, and 106,289, respectively. We note that the ground truth of corresponding points is known for the datasets, and thus exact registration errors can be evaluated.
Demonstrations. Figs. 5 and 6 show demonstrations of BCPD using the bunny and armadillo datasets with artificial disturbance. We used Gaussian and inverse multiquadric kernel functions for the bunny and armadillo datasets, respectively. Even when the kernel was non-Gaussian, BCPD successfully registered the armadillo dataset, which has more than 100,000 points. We summarized several demonstrations of BCPD in Supplementary Video 1, available online.
Generation of Artificial Disturbance. To evaluate robustness against rotation, outliers, and clustered outliers, we generated target point sets with artificial disturbance by repeating the following procedure: Five hundred points are randomly subsampled from both source and target point sets in the bunny, monkey, and dragon datasets so that the registration methods with no acceleration scheme can handle generated point sets. The subsampled target point sets are modified by one of the following operations: rotation, the addition of outliers that follow a uniform distribution, and the addition of clustered outliers that follow a Gaussian distribution. We changed the angle of rotation, the number of outliers, and the number of clustered outliers as described hereafter. a) Rotation. We changed the rotation angles from À120 to 120 degrees at intervals of five degree. The number of generated target point sets was 49 in total for each dataset. The computations of terms involving G and S are accelerated by the Nystr€ om method throughout the optimization, whereas those involving P are accelerated by the Nystr€ om method for large s and the KD tree search for small s. Fig. 6. Optimization trajectories for the armadillo data with rotation using the inverse multiquadric kernel. Even if the kernel is non-Gaussian, BCPD is scalable to point sets containing more than 100,000 points.   5. Optimization trajectories for bunny data with rotation and clustered outliers. The points in the target and deformed shapes are colored blue and red, respectively. The leftmost column shows the initial point sets, and the optimization proceeds from left to right. Fig. 7. Evaluation of robustness against the rotations of target point sets: bunny (left), monkey (middle), and dragon (right). The non-rigid CPD with or without the pre-alignment by the rigid CPD is denoted by "CPD (r+n)" or "CPD (n)," respectively. b) Addition of outliers. We generated outliers that follow a uniform distribution with the bounding box surrounding a target point set. The length of an edge of the bounding box along the dth axis was set to 1:2l d , where l d ¼ max n fx nd g N n¼1 À min n fx nd g N n¼1 . We changed the number of outliers from 50 to 1,000 at intervals of 50, i.e., from 0.1 to 2.0 at intervals of 0.1 in outlier/non-outlier ratio. For each ratio, we repeated the generation 100 times. The number of generated target point sets was 2,000 in total for each dataset. c) Addition of clustered outliers. We generated clustered outliers based on the isotropic Gaussian distribution with standard deviation 0.1 after normalizing a target point set. The center of the distribution was randomly selected as a point in the target point set. We changed the number of outliers from 25 to 250 at intervals of 25, i.e., from 0.05 to 0.5 at intervals of 0.05 in outlier/non-outlier ratio. For each ratio, we repeated the generation 100 times. The number of generated target point sets was 1,000 in total for each dataset.
Evaluation. We used the root-mean-squared distance (RMSD) to measure registration errors. Except for the datasets with rotation, we computed the median RMSD among 100 trials because the generation of outliers and clustered outliers were dependent on random numbers.
Results. Here, we summarize the results of the evaluation. a) Robustness against rotation. Fig. 7 shows the result of the performance evaluation regarding the robustness against target rotation. We added the evaluation of the non-rigid CPD after pre-alignment by the rigid CPD to compare BCPD with a naïve combination of non-rigid and rigid registrations. For all datasets, the range of angles within which BCPD correctly registered was the largest among the ranges within which the competitors did; BCPD outperformed the naïve combination of the rigid CPD and non-rigid CPD, suggesting an advantage of the simultaneous optimization of rigid and non-rigid transformations. b) Robustness against outliers. Fig. 8 shows the result of the performance evaluation regarding the robustness against outliers that follow a uniform distribution. BCPD outperformed CPD for both v ¼ 0:1 and v ¼ 0:2 despite the similarity between the algorithms. This suggests that the difference between the outlier distributions of BCPD and CPD affected their performances. TPS-RPM was more robust against outliers than BCPD with v ¼ 0:1; however, BCPD was comparable with TPS-RPM when v ¼ 0:2. c) Robustness against clustered outliers. Fig. 9 shows the result of the performance evaluation regarding the robustness against clustered outliers. The registration accuracy of BCPD and CPD were moderately similar for both ¼ 2:0 and ¼ 100 compared with case b). TPS-RPM was more robust against clustered outliers than BCPD with ¼ 2:0; however, BCPD was comparable with TPS-RPM when ¼ 100.

Performance of the Accelerated BCPD
In this section, we evaluate the registration performance of the accelerated BCPD to show that the acceleration maintains registration accuracy and reduces computing times.

Acceleration Parameters Versus Registration Accuracy
We evaluated the registration accuracy of BCPD for K and J, i.e., the parameters of the Nystr€ om method for accelerating the computation of fG; Sg and P , respectively. Further, we evaluated the registration accuracy of the accelerated CPD algorithm for the comparison between them. Dataset and Parameters. We used the bunny dataset, as shown in Fig. 4. We normalized the dataset before registration so that the mean and variance of the elements in a point set vector were 0 and 1, respectively. For both methods, we used the Gaussian kernel and ðb; ; vÞ ¼ ð2; 2; 0Þ. For BCPD, we used ðg; kÞ ¼ ð1; 1Þ, where g ¼ 1 and k ! 1 lead to the same assumption on the initial s 2 and the mixing coefficients as for CPD. The maximum number of iterations was set to 500. We switched the Nystr€ om method to the KD tree search with search radius minð0:15; 7sÞ at s ¼ 0:2 only if we chose the KD tree search option. To accelerate CPD, we used the following parameters: opt:method ¼ 0 nonrigid lowrank 0 ; opt:fgt ¼ 2; opt:eigfgt ¼ 1; opt:viz ¼ 0; opt:corresp ¼ 0; opt:normalize ¼ 1; opt:tol ¼ 1e À 10; opt:max it ¼ 100; where the first three options specify non-rigid registration with the use of the fast Gauss transform, the eigendecomposition of G, and the computation of truncated Gaussian distributions.
Results. First, we investigated the influence of J, i.e., the number of Nystr€ om points for accelerating P , on the registration performance. We accelerated the computations of G and S by setting K ¼ 70 because the direct inversion of S was severely time-consuming and memory-inefficient. We summarized the result in Fig. 10; the second figure is an enlarged view of the first one. The RMSDs with the Nystr€ om method decreased as J increased; however, the RMSD was relatively large even if J ¼ 600, suggesting the need for more accurate computations near convergence. Even if the use of the KD tree search was allowed, it was not activated for J ¼ 50; 100, and 150 because the corresponding ss were always greater than the cutoff, i.e., 0.2. For J ! 200, the KD tree search was always activated. The corresponding RMSDs were very close to the RMSD with the direct computation of P and were below 0.0344, the RMSD of the accelerated CPD. These results suggest that the accelerated calculation of P was sufficiently accurate for the bunny dataset.
The third figure in Fig. 10 shows the influence of K on registration performance. We calculated RMSDs in the same way as in the previous case; however, we fixed J at 300 and varied K from 10 to 150. The minimum RMSD was obtained at K ¼ 50, suggesting that the approximation of G with K ¼ 50 is sufficiently accurate for the bunny dataset.

Computing Time Versus Registration Accuracy
We evaluated the computing time and registration accuracy of BCPD and CPD; the latter was the only method scalable to point sets with 10 5 points among the competitors in Section 5.2. We used the implementation developed by Myronenko et al. because the faster implementations of CPD [59], [61] were not publicly available.
Dataset and Parameters. Using the armadillo dataset shown in Fig. 4, we generated target and source point sets by randomly extracting 10,000, 20,000, ..., 100,000 points. We set the parameters shared by BCPD and CPD to ðb; ; vÞ ¼ ð3; 20; 0Þ. For BCPD, we used the acceleration parameters ðJ; KÞ ¼ ð300; 100Þ, and we set the remaining parameters of BCPD and CPD to the same values as in Section 5.3.1.
Computational Environment. We used a MacBook Pro (15-inch Retina display, Early 2013, OS X El Capitan 10.11.6) with a 2.4 GHz Intel Core i7 CPU and 16 GB RAM as our computational environment. We implemented the BCPD algorithm in C and used GCC 6.0 as a compiler. We also implemented a parallelization option using OpenMP. We note that CPU time is roughly the same as wall-clock time if the parallelization is disabled.
Results. We first disabled the parallelization option implemented in our software for a fair comparison. The left panel in Fig. 11 shows the evolution of computing time and registration accuracy. BCPD achieved less computing time and slightly less registration error than CPD did. Especially, for the point sets containing 10 5 points, CPD repeated the E-step and M-step 100 times, i.e., the maximum number of iterations, without satisfying its convergence condition. The corresponding computing time and RMSD were 9389.8 s and 0.0079, respectively. BCPD repeated 62 cycles of VBI until convergence, and the corresponding computing time and RMSD were 398.0 s and 0.0070, respectively.
We then measured computing times of BCPD for different kernel functions with parallelization disabled. Using the point sets composed of 10 5 points, we compared Gaussian, inverse multiquadric, and rational quadratic kernels. All of the kernel functions are known to be positive definite [62]. The definitions of the kernel functions are listed in Table 2. The second panel of Fig. 11 shows the evolution of computing time and registration accuracy until convergence. As shown in the figure, the difference in computing times for the three kernels was considerably small. This suggests that our acceleration scheme can be applied even if the kernel function is non-Gaussian.
Finally, we activated the parallelization option of our software. We measured computing times of BCPD, changing the number of points in both target and source point Fig. 10. Registration error versus the numbers of Nystr€ om points, J and K, used for approximating P and G, respectively. The registration error was measured by the root-mean-squared distance (RMSD) using the bunny dataset. The symbol s represents the square root of the residual variance after the optimization, i.e., an estimate of registration error. The second figure is an enlarged view of the first figure, and it additionally includes RMSDs computed by the BCPD that exactly calculated P and the accelerated CPD. sets from 10 4 to 10 5 at intervals of 10 4 . The result is shown in the third panel of Fig. 11. CPU times increased owing to the multithreading overhead caused by parallelization; however, wall-clock times decreased by at least 50 percent. These results suggest that BCPD has an advantage over CPD in terms of computing time.

Real Data Examples
In this section, we report numerical experiments performed using the space-time faces [63] and SHREC'19 humanmatching [64] datasets to demonstrate that BCPD can handle non-artificial deformations.

Performance Evaluation Using Space-Time Faces
We evaluate the registration performance of BCPD using a dataset of space-time faces, composed of a 3D+t sequence of facial expressions with the ground truth of corresponding points. The number of points in each face is 23,728. We used the ith and the ði þ 1Þth faces as the source and target point sets for i ¼ 1; . . . ; 80. An example of source and target point sets is shown in Fig. 12. We evaluated the registration accuracy of BCPD and CPD because GMM-REG and TPS-RPM were not scalable enough to register point sets in this dataset. We set the shared parameters of CPD and BCPD to the same values, i.e., ðb; ; vÞ ¼ ð0:3; 10 4 ; 0Þ, and the remaining parameters of BCPD were set to ðg; kÞ ¼ ð0:1; 1Þ. The acceleration parameters of CPD were set to the same values as those in Section 5.3.1, whereas those of BCPD were set to ðJ; KÞ ¼ ð300; 150Þ, and the Nystr€ om method was switched to the KD tree search with search radius minð0:15; 7sÞ if s < 0:2. Registration errors were measured using the RMSD from the ground truth. The right panel of Fig. 12 shows the boxplot of RMSDs after taking the logarithm. The RMSDs of BCPD were smaller than those of CPD, although the difference in RMSDs was quite small.

Application to SHREC'19 Data
Here, we provide an example of the "shape transfer" as an application of BCPD. The dataset was taken from a SHREC'19 track, called "matching humans with different connectivity" [64]. Among 44 human body shapes, we used shape no. 1 as a target point set and no. 42 as a source point set, as shown in Fig. 13. After resampling 10,000 points for both point sets using voxel grid filtering, we applied the BCPD with the Gaussian kernel; we used the registration parameters ð; b; v; g; kÞ ¼ ð2; 2; 0; 10; 1Þ and the same acceleration parameters as those in Section 5.3.1. Shape (c) in the figure shows shape no. 42 after the registration, clearly approaching shape no. 1.
The rightmost shape of Fig. 13 shows the shape after the second registration; we applied BCPD to shapes (b) and (c), where we set (c) as the source shape for registration. We used the registration parameters ð; b; v; g; kÞ ¼ ð2; 1:2; 0; 0:1; 1Þ; that is, we relaxed the motion coherence compared with the first registration, and we initialized s 2 so that the initial matching was a "moderately hard" one. The deformed shape became closer to shape no.1 as a whole, although some parts of the body, e.g., the hands and arms, deformed unnaturally. This suggests that a tightly fit shape can be obtained by the second registration with weak motion coherence and a small g.
As described above, we applied BCPD twice with different registration parameters. This was because the single execution of the BCPD algorithm failed to transfer shape (a) to shape (b) under the weak motion coherence corresponding to the second registration. This suggests that running the algorithm twice contributed to preventing a local optimum that would have resulted in an incorrect registration.

Rigid Registration
BCPD can be applied to rigid registration problems owing to the similarity transformation incorporated into its transformation model, as shown in Fig. 14. We evaluated the performance of rigid registrations for BCPD by comparing it with the methods implementing rigid registration techniques: CPD [2], GMM-REG [21], and Go-ICP [14].
Datasets. The first dataset we used was the ASL dataset [65]. Among the eight types of datasets, we used four types of data: Apartment, Stairs, Mountain Plain, and Wood in Summer. The second dataset we used was the set of 3D surfaces with partial overlaps [66] included in the UWA dataset.   We used all four objects in the dataset: Chef, Parasaurolophus, T. Rex, and Chicken. We note that these datasets were acquired with different sensors, having been acquired using the Hokuyo UTM-30LX Scanning Rangefinder and the Konica Minolta Vivid 910 3D Scanner.
Pre-Processing. To reduce computing time for CPD and GMM-REG, we decreased the number of points in each point set in both datasets by approximately 2,000 using the voxel grid filter implemented in GMM-REG. For CPD and GMM-REG, we used the internal normalization methods implemented in each software. For BCPD, we used the same normalization method as CPD. For Go-ICP, we reduced the number of points in each target point set by 1,000 using its internal downsampling scheme; we normalized the point sets to be inside ½À1; 1 3 by following its manual, because the internal normalization scheme was not available in the software.
Evaluation. We registered the ith and the ði þ 1Þth point sets for i ¼ 1; . . . ; 10 for each of the eight datasets because all pairs of the point sets were partially overlapped. We evaluated the goodness of a rigid registration by the angle error against the ground truth. The angular difference in degree between rotation matrices R 1 and R 2 in R 3 , denoted by , is computed by the following formula: ¼ 180 p arccosð 1 2 fTr ðR 1 R T 2 Þ À 1gÞ. The registration accuracy of a method was defined as the ratio of correct registrations to 40 registration problems. A trial was defined as being a success if the angular difference from the ground truth was less than a cutoff value. We changed the cutoff values from 0.25 to 5.0 at intervals of 0.25.
Results. Fig. 15 shows the results for the ASL dataset (left) and the UWA dataset (right). For the ASL dataset, BCPD outperformed the competitors if the cutoff value to define the successful registration was less than one degree. The accuracy of GMM-REG was 1.0 if the cutoff was greater than three degrees. The registration accuracy for BCPD and CPD was clearly different despite the similarity of their algorithms, suggesting that the difference in their acceleration schemes and outlier distributions affected their registration performance. For the UWA dataset, we observed that the registration performance of Go-ICP was almost perfect. Therefore, we used the result of Go-ICP as the ground truth to compare the remaining methods. The right panel in Fig. 15 shows the results for the UWA dataset. BCPD outperformed CPD and GMM-REG, suggesting that the registration performance of BCPD was the closest to that of Go-ICP for the UWA dataset.

CONCLUSION
In this paper, we formulated CPD in a Bayesian setting; we introduced motion coherence using the prior distribution of displacement vectors rather than using the motion coherence theory. We also derived BCPD registration algorithm, which is interpreted as a generalization of the CPD algorithm, using VBI. The Bayesian formulation provided a clear difference between tuning parameters controlling the motion coherence. We also proposed an acceleration scheme that can be applied to non-Gaussian kernels.
In the numerical studies, we showed that our acceleration method successfully reduced computing time without losing registration accuracy. We compared BCPD with wellknown registration methods for both non-rigid and rigid registration problems. The results showed that BCPD was at least comparable to the competitors for all experiments, suggesting its usefulness as a general-purpose registration technique.

ACKNOWLEDGMENTS
The author would like to deeply thank the anonymous referees who provided helpful, constructive, and detailed comments on earlier versions of the manuscript. This work was supported by JSPS KAKENHI Grant Number 17K12712.   15. Comparisons of rigid registrations using the ASL data (left) and the UWA data (right). The y-axis represents the rate of correct registrations using 40 pairs of point sets. A trial is defined as being a success if the angular difference from the ground truth is less than a value specified by the x-axis. For the UWA dataset, the ground truth was defined based on the results of Go-ICP, owing to its almost perfect performance finding partial overlaps in the dataset.
Osamu Hirose (Member, IEEE) is currently an assistant professor with the Institute of Science and Engineering, Kanazawa University. His recent research interests focus on developing machine learning techniques for automating the analysis of digital images and movies, especially those related to biology. His research interests also include 3D shape modeling and registration.