A Mixture Model for Robust Point Matching under Multi-Layer Motion

This paper proposes an efficient mixture model for establishing robust point correspondences between two sets of points under multi-layer motion. Our algorithm starts by creating a set of putative correspondences which can contain a number of false correspondences, or outliers, in addition to the true correspondences (inliers). Next we solve for correspondence by interpolating a set of spatial transformations on the putative correspondence set based on a mixture model, which involves estimating a consensus of inlier points whose matching follows a non-parametric geometrical constraint. We formulate this as a maximum a posteriori (MAP) estimation of a Bayesian model with hidden/latent variables indicating whether matches in the putative set are outliers or inliers. We impose non-parametric geometrical constraints on the correspondence, as a prior distribution, in a reproducing kernel Hilbert space (RKHS). MAP estimation is performed by the EM algorithm which by also estimating the variance of the prior model (initialized to a large value) is able to obtain good estimates very quickly (e.g., avoiding many of the local minima inherent in this formulation). We further provide a fast implementation based on sparse approximation which can achieve a significant speed-up without much performance degradation. We illustrate the proposed method on 2D and 3D real images for sparse feature correspondence, as well as a public available dataset for shape matching. The quantitative results demonstrate that our method is robust to non-rigid deformation and multi-layer/large discontinuous motion.


Introduction
Establishing reliable correspondence between two images is a fundamental problem in computer vision and it is a critical prerequisite in a wide range of applications including structurefrom-motion, camera self-calibration, tracking, image retrieval, and object recognition [1]. In this paper, we formulate it as a matching problem between two sets of discrete points where each point is an image feature, extracted by a feature detector, and has a local image descriptor (e.g., SIFT [2]). The matching problem is ill-posed and is typically regularized by imposing two types of constraints: (i) a descriptor similarity constraint, which requires that points can only match points with similar descriptors, and (ii) geometric constraint, which requires that the matches satisfy an underlying geometrical requirement, which can be either parametric (e.g., rigid transformations) or non-parametric (e.g., nonrigid). Even after regularization there remain an exponential number of possible matches between the two sets and efficient algorithms are required to obtain the best solution by removing the false matches. The difficulty of the matching problem is typically made harder by the presence of unmatched points in the two images (due to occlusion or failures of the feature detectors).
A popular strategy for solving the matching problem is to use a two stage process. In the first stage, a set of putative correspondences are computed by using a similarity constraint to reduce the set of possible matches. This putative correspondence set typically includes most of the true matches, the inliers, but also a large number of false matches, or outliers, due to ambiguities in the similarity constraints (particularly if the images contain repetitive patterns). The second stage is designed to remove the outliers and estimate the inliers and the geometric parameters [3][4][5]. This strategy is commonly used for situations where the geometrical constraints are parametric, such as requiring that corresponding points lie on epipolar lines [1]. Examples of this strategy include the RANSAC algorithm [3] and analogous robust hypothesizeand-verify methods [4,6,7]. Although these methods are very successful in many situations they have had limited success if the geometrical constraints are non-parametric, for example if the real correspondence is non-rigid, and they also tend to degrade badly if the proportion of outliers in the putative correspondence set becomes large [5].
Recently, some new non-parametric model-based methods have also been developed to deal with the non-rigidity, such as identifying point correspondences by correspondence function (ICF) [5], vector field consensus (VFC) [8,9], mismatch removal via coherent spatial mapping [10], as well as L 2 -Minimizing Estimate-based method (RPM-L 2 E) [11]. These methods works well when the scene contains some deformable objects. However, since they fit a smooth transformation for the scene motion, when the motion contains large discontinuities or multi-layer, for example, large depth discontinuities or motion inconsistencies, the smoothness prior will be violated and hence these methods will be badly degraded. Some graph matching based methods such as dual decomposition method [12] and graph shift (GS) [13] have also been proposed to capture different layers of motions, and hence robust to the motion with large discontinuities. However, these methods could not handle the 3D case.
In this paper, we generalize the former non-parametric modelbased methods to deal with large discontinuities. Rather than these methods which interpolates a global transformation, our approach uses a mixture model and fits the correspondences with a set of transformations. This enables us to capture multi-layer motion, and hence robust to large discontinuous motion. To illustrate the main idea of our approach, we show a simple example in Fig. 1. Given two sets of interest points extracted from an image pair, we want to match them to establish their point-wise correspondence. We first compute a set of putative correspondences based on their SIFT features as shown in the left of Fig. 1, which contains a number of outliers. Due to the motions of the fox and the ground are quite different in the scene, it is hard to fit a global transformation smooth enough on all the inlier correspondences. Therefore, the non-parametric model-based methods such as VFC [8] will only preserve the majority of the inliers which locate on the ground, as shown in the middle of Fig. 1. However, our approach fits multiple transformations, which can capture multi-layer motion, and hence both the inliers on the fox and the ground will be preserved, as shown in the right of Fig. 1.

Related Work
This section briefly reviews the background material that our work is based on. This includes methods for establishing a set of putative correspondences and methods like RANSAC which use robust criteria for performing correspondence assuming parametric geometric constraints. Next we discuss approaches for solving matching problems which solve for a correspondence matrix between point sets.
There has been considerable study of robust estimation in the statistics literature [14,15]. This work shows, for example, that maximum likelihood estimator of parameters using quadratic L 2 norms are not-robust and highly sensitive to outliers. By contrast, methods which minimize L 1 norm are more robust and capable of resisting a larger proportion of outliers. A particularly robust method is the redescending M-estimator [14]. It can be shown that this estimator results from using an explicit variable to indicate whether data is an outlier or an inlier (this indicator variable must be estimated).
The RANSAC algorithm matches two point sets by first computing a putative set and then using robust methods to impose parametric geometric constraints [3]. RANSAC uses a hypothe-size-and-verify framework. It proceeds by repeatedly generating solutions estimated from a small set of correspondences randomly selected from the data, and then tests each solution for support from the complete set of putative correspondences. RANSAC has several variants such as MLESAC [4], LO-RANSAC [16] and PROSAC [6]. MLESAC adopts a new cost function using a weighted voting strategy based on M-estimation and chooses the solution that maximizes the likelihood rather than the inlier count. RANSAC is also enhanced in LO-RANSAC with a local optimization step based on how well the measurements satisfy the current best hypothesis. Alternatively, prior beliefs are assumed in PROSAC about the probability of a point being an inlier to modify the random sampling step of the RANSAC. A detailed comparative analysis of RANSAC techniques can be found in [7].
In the recent past, some new non-parametric model-based methods have also been developed, such as ICF [5], VFC [8,17], RPM-L 2 E [11]. The ICF rejects outliers by learning a correspondence function pair which maps points in one image to their corresponding points in another. While the VFC converts the outlier rejection problem into a robust vector field interpolation problem which interpolates a non-parametric smooth motion field to fit the potential inliers. Similar to VFC, the RPM-L 2 E also fits a non-parametric spatial transformation, and the difference is that it uses a robust estimator to deal with outliers rather than explicitly modeling the outlier distribution in VFC.
Another strategy for point correspondences is to formulate this problem in terms of a correspondence matrix between points (in the two datasets) together with a parametric, or non-parametric, geometric constraint [18][19][20][21][22]. These approaches relate closely to earlier work on mathematical models of human perception of long-range motion. This includes Ullman's minimal mapping theory [23] and Yuille and Grzywacz's motion coherence theory [24] which formulate correspondence in terms of vector field interpolation and use Gaussian kernels. The iterated closest point (ICP) algorithm [18] is one of the best known point correspondence/registration approaches. It uses nearest-neighbor relationships to assign a binary correspondence, and then uses estimated correspondence to refine the transformation. Efficient versions of ICP use sampling processes, either deterministic or based on heuristics [25]. The nearest point strategy of ICP can be replaced by soft assignments within a continuous optimization framework, e.g., the TPS-RPM [20,26]. In the recent past, the point registration is typically solved by probabilistic methods [21,22,27,28]. The kernel correlation based method [27] models each one of the two point sets by two probability distributions and measures the dissimilarity between the two distributions. It was later improved in [22]. In [21] as well as in [22] and [28], the Gaussian mixture model is used to recast the point-to-point assignment problem into that of estimating the parameters of a mixture. This is done within the framework of maximum likelihood and the expectation-maximization (EM) algorithm [29].
Point correspondence has also been formulated as a graph matching problem, such as the dual decomposition (DD) [12], Spectral Matching (SM) [30], and graph shift (GS) [13,31]. The DD approach formulates the matching task as an energy minimization problem by defining a complex objective function of the appearance and the spatial arrangement of the features, and then minimizes this function based on the dual decomposition approach. The SM method uses an efficient spectral method for finding consistent correspondences between two sets of features. Based on the SM method, the GS method constructs an affinity graph for the correspondences, and the maximal clique of the graph is viewed as spatially coherent correspondences. The SIFTflow algorithm [32] builds a dense correspondence map between two arbitrary images with a particular advantage for matching two scenes; it does not explicitly deal with the outliers and may not be able to produce the accuracy for the precise matching for problems like structure-from-motion. Note that this type of graph matching formulation can in some cases be mathematically equivalent to the methods with correspondence variables and geometric constraints [24].

Problem Formulation
Suppose now we are given a set of putative image point correspondences S~f(x n ,y n ) : n[ g, which may be perturbed by noise and outliers. The non-parametric model-based methods [8,11] aim to fit a non-parametric transformation f to the underlying inliers, i.e., y n~f (x n ) for any inlier (x n ,y n ), and hence remove outliers. The transformation f here is continuous and smooth, which is not able to handle multi-layer motion. In this case, a more robust method is desirable to provide stable results. To this end, we consider a mixture model and fit the correspondences with a set of transformations ff k : k[ g instead of just one global transformation.
We make the assumption that, for the inliers, the noise is Gaussian on each component with zero mean and uniform standard deviation s; for the outliers, the output space is a bounded region of (D is the dimension of the data, typically D~2 or 3), and the distribution is assumed to be uniform 1 a with a being a constant. Let us introduce a set of latent variables fz n : n[ g, where z n has an 1-of-K+1 representation in which a particular element z nk is equal to 1 and all other elements are equal to 0. More specifically, the notation z nk~1 (k[ ) means that the n-th correspondence is an inlier and it corresponds to the transformation f k , while z n,Kz1~1 means that the n-th correspondence is an outlier. The marginal distribution over z n is specified in terms of the mixing coefficients p k , such that where the parameters fp k : k[ Kz1 g satisfy 0ƒp k ƒ1 together with in order to be valid probabilities. The likelihood of a correspondence (x n ,y n ) corresponding to the transformation f k is drawn from a Gaussian distribution with mean 0 and covariance s 2 I: where h~ff,s 2 ,p k g includes a set of unknown parameters. We denote the set of all observed data by X and Y, in which the n-th rows represent x T n and y T n respectively, and similarly we denote the set of all latent variables by Z. By making the i.i.d. data assumption, the likelihood is a mixture model given by distribution function is nonzero only in a bounded region (here we omit the indicator function for clarity). We want to recover the transformations ff k ,k[ g from the data S. Taking a probabilistic approach, we assume f k to be a realization of a random field with a known prior probability distribution p(f k ). The prior is used to impose constraints on f k , assigning significant probability only to those functions that satisfy those constraints. We consider the slow-and-smooth model [33] which has been shown to account for a range of motion phenomena, the prior of f k then has the form: where (f k ) is a smoothness functional and l a positive real number (we will discuss the details of f k later). By applying Bayes rule, we seek a MAP solution of h, i.e., This is equivalent to seeking the minimal energy The set of transformations ff k ,k[ g will be directly obtained from the optimal solution h Ã , and the latent variables fz n : n[ N g determine the inliers. In the next section, we show how to solve the estimation problem using an EM approach.

The EM Algorithm
There are several ways to estimate the parameters of the mixture model, such as EM algorithm, gradient descent, and variational inference. The EM algorithm [29] is a general technique dealing with the existence of latent variables. It alternates with two steps: an expectation step (E-step) and a maximization step (M-step). We follow standard notations [34] and omit some terms that are independent of h. Denote p(z nk~1 jx n ,y n ,h) by c(z nk ), considering the negative log posterior function, i.e. equation (7), the complete-data log posterior is then given by ð8Þ E-step: We use the current parameter values h old to find the posterior distribution of the latent variables, which can be found by applying Bayes' rules where k[ K and n[ N . The posterior probability c(z nk ) indicates to what degree the n-th sample agrees with the current estimated transformation f k . M-step: We determine the revised parameter estimate h new as follows: h new~a rg max h Q(h,h old ). Taking derivative of Q(h) with respect to p k and s 2 , and setting them to zero, together with equation (2), we obtain where N k~P N n~1 c(z nk ). Next we consider the terms of Q(h) that are related to f k . We obtain a regularized risk functional as [8]: We model f k by requiring it to lie within a specific functional space H, namely a vector-valued reproducing kernel Hilbert space (RKHS) [35]. We define the RKHS H by a diagonal Gaussian kernel [8]: C(x i ,x j )~e {bExi{xj E 2 I. For the smoothness functional (f), we use the square norm, i.e., (f)~EfE 2 H . Therefore, we have the following representer theorem [36]. Theorem 1. The optimal solution of the regularized risk functional (12) is given by where C[ N|N is the kernel matrix with the ij-th entry , P k~d iag(c(z 1k ), Á Á Á ,c(z nk )) is a diagonal matrix, and C k~( c 1k , Á Á Á ,c nk ) T [ N|D .
Proof. The kernel C has the following reproducing property, for all c[Y and x[X where C x c~C( : ,x)c. For any given reproducing kernel C, a unique RKHS can be defined as For further details about the reproducing property we refer the readers to [36][37][38].
Let H \ N be a subspace of H, Form the reproducing property, i.e. equation (15), Therefore, the optimal solution of the regularized risk functional (12) comes from the space H N , and hence has the form (13). To solve for the coefficients, we rewrite the regularized risk functional in the following matrix form: where C[ N|N is the kernel matrix with the ij-th entry the Frobenius norm, and tr( : ) denotes the trace. Taking the derivative of the last equation with respect to C k and setting it to zero, we obtain the linear system in equation (14). Thus the coefficient set fc nk : n [ N g of the optimal solution f k is determined by the linear system (14).
Once the EM algorithm converges, we then obtain the set of transformations ff k : k[ K g. Besides, with a predefined thresh- old t we obtain the inliers I as well: I~fn : We summarize our method in Table 1. Since our robust point matching method is based on a mixture model, we named our method RPM-MM.

Fast Implementation
Solving the transformation f k merely requires to solve the linear system (14). However, for large values of N, it may pose a serious problem due to heavy computational (e.g. scales as O(N 3 )) or memory (e.g. scales as O(N 2 )) requirements, and, even when it is implementable, one may prefer a suboptimal but simpler method. To address this problem, in this section we provide a fast implementation based on a similar kind of idea as the subset of regressors method [17,39].
Rather than searching for the optimal solution in H N i.e., equation (16) [17]. There, it was found that simply selecting an arbitrary subset of the training inputs performs no worse than more sophisticated methods. According to the sparse approximation, we search a solution with the form with the coefficients fc m k : m[ M g determined by a linear system where In contrast to the optimal solution given by the representer theorem, which is a linear combination of the basis functions fC( : ,x n ) : n [ N g, the suboptimal solution is formed by a linear combination of arbitrary M-tuples of the basis functions. Generally, this sparse approximation will yield a vast increase in speed and decrease in memory requirements with negligible decrease in accuracy. Compared with the original algorithm shown in Algorithm 1, the fast version solves a different linear system (24) in Line 8.

Computational Complexity
For the RPM-MM algorithm, the corresponding kernel matrix C is of size N|N. According to Theorem 1, we need to solve a linear system (14) for each transformation f k . The time complexity is O(DN 3 ), which is the most time-consuming step in the algorithm. As a result, the total time complexity of our algorithm is O (mKDN 3 ), where m is the number of EM iterations. In our current implementation, we just use the Matlab ''\'' operator, which implicitly uses Cholesky decomposition to invert a matrix. The space complexity of RPM-MM scales like O(N 2 ) due to the memory requirements for storing the kernel matrix C.
For the fast implementation, the corresponding kernel matrix is of size M|M, where M is the number of basis functions used for sparse representation. Then the time complexity is reduced to O(mKDM 2 NzmKDM 3 ), and the space complexity is reduced to O(MNzM 2 ). Typically, in point matching problems, the number of the point matches N is in the order of 10 3 , and the required number of basis function M is in the order of 10 1 . Therefore, both the time and space complexities can be simply written as O(N). This is significant for large datasets. Our experiments demonstrate that the fast version is much faster than the original RPM-MM algorithm with negligible performance degradation. Table 1. Algorithm 1: The RPM-MM Algorithm.

Extension to Non-Rigid Point Set Registration
Point set registration aims to align two point sets fx n g N n~1 (the model point set) and fy l g L l~1 (the target point set). Typically, in the non-rigid case, it requires estimating a non-rigid transformation f which warps the model point set to the target point set. Moreover, for point sets with multi-layer motion, it may need multiple transformations to achieve satisfying results. Recall that our RPM-MM method is able to generate a set of non-rigid transformations with adherence to a set of point correspondences. Therefore, it could be used to recover the transformation(s) between two point sets with a set of putative correspondences. Next we discuss how to establish initial correspondences.
Typically, for a pure point matching problem, the appearance information is not available. In general, if the two point sets have similar shapes, the corresponding points have similar neighborhood structures which could be incorporated into a feature descriptor. Thus finding correspondences between two point sets is equivalent to finding for each point in one point set (e.g., the model) the point in the other point set (e.g., the target) that has the most similar feature descriptor. Fortunately, the initial correspondences need not be very accurate, since our method is robust to noise and outliers. Inspired by these facts, we use shape context [40] as the feature descriptor, using the Hungarian method for matching with the x 2 test statistic as the cost measure.
After we get the rough correspondences between two point sets based on their shape features, we fit a set of transformations and use them to warp the model points. To this end, we need to determine the attribution of each model point (i.e., which component of the mixture model a model point belongs to). First, according to the matching results based on the rough correspondences, we can determine the attributions of the inliers; for the rest of the model points, we attribute them to their nearest neighbors which already have attributions. The two steps of estimating correspondences and transformations are iterated to obtain a reliable result. In this paper, we use a fixed number of iterations, typically 10 but more when there are large degradations on the data. We summarize our non-rigid point set registration method in Table 2.

Implementation Details
The performance of point matching algorithms typically depends on the coordinate system in which points are expressed; here we use data normalization to control for this. More specifically, we perform a linear re-scaling of the correspondences so that the points in the two sets both have zero mean and unit variance. Furthermore, we define the transformation f k as the initial position plus a displacement function v k : f k (x)~xzv k (x) [11], and solve for v k instead of f k . This can be achieved simply by setting the output y n to be y n {x n . The use of displacement function achieves more robustness.
The EM algorithm is well known to converge to a local maximum. To initialize the EM iteration, we first use the K-means Table 3. Performance comparison on 2D image pairs in Fig. 2 Table 4. Performance comparison on 3D surface pairs in Fig. 2  algorithm to cluster the correspondences into K 0 clusters. To this end, we convert the correspondences into a set of motion field samples fy n {x n g N n~1 and do clustering on them. The first K largest clusters are considered as inliers of the K-component mixture model in our approach, and then are used to initialize fc(z nk ) [ f0,1g : n [ N ,k [ K g. In our evaluation, we set K 0~1 0, and K can be set adaptively according to the cardinalities of the clusters. For instance, if the ratio of the cardinalities between a certain cluster and the largest cluster is bigger than a threshold, e.g., 0:2, then it will be preserved as initial inliers, typically K~2 or 3.
There are mainly three parameters in our algorithm: b, l and t. Parameters b and l both reflect the amount of smoothness regularization. Parameter b determines how wide the range of interaction between correspondences. Parameter l controls the trade-off between the closeness to the data and the smoothness of the solution. Parameter t is a threshold, which is used for deciding the correctness of a correspondence. In general, we find our method to be very robust to parameter changes. We set b~0:1, l~1, and t~1=K throughout the experiments. The constant a in the uniform distribution is set to the volume of the output space (e.g., the bounding box of the points) after data normalization, typically a~10. Besides, for the fast implementation, the number M of basis functions used for sparse approximation is fixed to 15.

Experimental Results
In order to evaluate the performance of our algorithm, we conducted two types of experiments: i) sparse image feature correspondence on 2D images and 3D surfaces; ii) non-rigid shape matching on synthetical 2D datasets.

Results on Sparse Feature Correspondence
We present a few representative matching results on both 2D image pairs and 3D surface pairs, as shown in Fig. 2. For the 2D images, the first two pairs (i.e., Tree and Church) are wide baseline images which are public available [41]. The third pair (i.e., Books) contains some shared content which is taken by ourselves. It often occurs in image or object retrieval. The 3D surface pairs (i.e., Person, Centaur and Dog) are objects with different amounts of nonrigid deformation, which come from a surface correspondence benchmark [42]. In our evaluation, we use the SIFT [2] and MeshDOG/MeshHOG [43] to establish putative correspondences for 2D and 3D cases, respectively.
The match correctness is determined as follows. For the 2D images, a method combining subjectivity and objectivity is considered. We first fit the epipolar geometry (e.g., fundamental matrix) by RANSAC and use it to determine the match correctness. We further confirm them artificially. Although the judgment of correct match and mismatch seems arbitrary, we make the benchmark before performing experiments to ensure objectivity. For the 3D surfaces, the ground truth correspondences are supplied by the dataset.
The experimental results are evaluated by precision and recall, where the precision is defined as the ratio of the preserved inlier number and the preserved correspondence number, and the recall is defined as the ratio of the preserved inlier number and the inlier number contained in the putative correspondences. We compare our RPM-MM algorithm with other four methods which remove outliers from given putative point correspondences, such as RANSAC [3], ICF [5], GS [13], and VFC [8]. We implement ICF and tune all parameters accordingly to find optimal settings. For RANSAC, GS and VFC, we implement them based on the publicly available codes. Throughout all the experiments, five algorithms' parameters are all fixed.
The results of our RPM-MM are presented in Fig. 2, we see that for an image pair with relatively simple structures, such as Tree or Person which involves a small amount of rotation, viewpoint change, or non-rigid deformation, the mixture model in our algorithm then degenerates and contains only one component. In this case Our RPM-MM algorithm is equal to a normal nonparametric model-based method, e.g., VFC [8]. For images with large view point change or non-rigid deformation, the mixture  model will contain multiple components to capture multi-layer motion, such as Church, Books, Centaur and Dog. Note that in the image pair of Church, our RPM-MM preserves the correspondences on the sky; this is very useful since removing inliers outstanding in depth tend to make the recovery of epipolar geometry unstable and ill conditioned [44]. Moreover, the matching result of Books shows our method's capability in image retrieval.
We further quantitatively compare our RPM-MM to four stateof-the-art point matching algorithms: RANSAC, ICF, GS and VFC. Tables 3 and 4 report the 2D and 3D results respectively. As shown in Table 3, ICF and VFC have low recalls when the scene contains large discontinuities. In fact, they tend to preserve typically one major component (i.e., the matches marked by blue lines in Fig. 2) or two components of the correspondences. RANSAC has satisfactory performance when the relationship of correspondence is rigid, e.g., epipolar geometry. But it can not work in the non-rigid case, e.g., Books. The graph matching based method GS generally can obtain better performance than ICF and VFC in case of large discontinuity, e.g., Books. But its recalls are still relatively low compared to our RPM-MM. In Table 4 we only use VFC for comparison since the other three methods are not applicable for either the 3D case or non-rigid deformation. We again observe that VFC fails to keep most of the inliers under large non-rigid deformations. Our RPM-MM in general has the best precision-recall trade-off, and it is not affected by large non-rigid deformation or multi-layer motion. In addition, we also test the fast version of our method on these six image pairs, as shown in the last rows of Tables 3 and 4. The average number of putative correspondences on these six image pairs is about 630, and the average run times of our RPM-MM and its fast implementation are about 6 s and 0.2 s on an Intel Core 2.0 GHz PC with Matlab code. We see that the use of sparse approximation leads to an essential speedup without much performance degradation.

Results on Shape Matching
We next evaluate our RPM-MM on the non-rigid point set matching problem, more specifically, shape matching. For the dataset, we choose the same synthesized data as in [20]. The data consists of two different shape models, where the first model consists of 96 points representing a fish shape, and the second model is a more complex pattern consisting of 108 points representing a Chinese character. We combine the two shapes to generate a model point set, i.e., the blue pluses as shown in Fig. 3. To get a target point set, we apply two randomly generated non-rigid transformations on the model point set and warp the two shape models respectively, i.e., the red circles as shown in Fig. 3. The model is warped through progressively larger degrees of nonrigid warpings, and we generate 100 samples in total. Fig. 3 presents some registration results of our RPM-MM and four other state-of-the-art registration algorithms: GMM-L 2 [22], CPD [21], TPS-RPM [20] and VFC [8], which are implemented using publicly available codes. In the first row, the data contains a relatively slight deformation; in this case, it is possible to search a single transformation which approximates the two-layer motion well. Therefore, all the five algorithms are able to generate satisfying results. However, the matching performance degrades gradually as the degree of deformation increases. In the second row, the data contains a relatively large deformation; in this case, just one transformation cannot capture the two-layer motion, and hence the matching performance degrades. By contrast, our RPM-MM uses a mixture model which interpolates a set of transformations, and hence is robust to the two-layer motion. Therefore, it still can produce an almost perfect alignment, as shown in the last column of Fig. 3.
To provide a quantitative comparison, we report the registration results of the five algorithms on all the 100 samples. We compute the recall as the metric used in [22]. Here the recall, or true positive rate, is defined as the proportion of true positive correspondences to the ground truth correspondences and a true positive correspondence is counted when the pair falls within a given accuracy threshold in terms of pairwise distance, e.g., the Euclidean distance between a point in the warped model and the corresponding point in the target. Fig. 4 plots the recall curves of the five algorithms on all the 100 samples. We see that VFC and TPS-RPM perform much better than CPD and GMM-L 2 , and our RPM-MM is far better than the other four methods. This is not surprise since our RPM-MM interpolates multiple transformations to capture multi-layer motion while the other four algorithms interpolate just a global transformation. That is to say, for point sets with multi-layer motion, it is necessary to interpolate multiple local transformations (as in our RPM-MM) rather than only one global transformation to achieve accurate results. We also tested the fast version of our algorithm on this dataset and obtained almost the same recall curve as the original version, here we omit it in the figure for clarity.

Conclusion
Within this paper, we have proposed and studied a new robust point matching algorithm based on a mixture model (RPM-MM). It interpolates a set of transformations to fit different layers of the motion correspondence by an iterative EM algorithm, and hence establish reliable correspondence between two images. Quantitative comparisons on both sparse feature correspondence and shape matching demonstrate that our algorithm outperforms state-ofthe-art point matching methods, especially when the motion of the scene contains large discontinuities.