An Accurate and Efficient Voting Scheme for a Maximally All-Inlier 3D Correspondence Set

We present a highly accurate and efficient, yet simple, two-stage voting scheme for distinguishing inlier 3D correspondences by densely assessing and ranking their local and global geometric consistencies. The strength of the proposed method stems from both the novel idea of post-validated voting set, as well as single-point superimposition transforms, which are computationally cheap and avoid rotational ambiguities. Using a well-known dataset consisting of various 3D models and numerous scenes that include different occlusion rates, the proposed scheme is evaluated against state-of-the-art 3D voting schemes, in terms of both the correspondence PR (precision–recall) AUC (area under curve), and the execution time. A total of 374 experiments were conducted for each method, which involved a combination of four models, 50 scenes, and two down-samplings. The proposed scheme outperforms the state-of-the-art 3D voting schemes in terms of both accuracy and speed. Quantitatively, the proposed scheme scores <inline-formula><tex-math notation="LaTeX">$97.0\% \pm 12.9\%$</tex-math><alternatives><mml:math><mml:mrow><mml:mn>97</mml:mn><mml:mo>.</mml:mo><mml:mn>0</mml:mn><mml:mo>%</mml:mo><mml:mo>±</mml:mo><mml:mn>12</mml:mn><mml:mo>.</mml:mo><mml:mn>9</mml:mn><mml:mo>%</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="sahloul-ieq1-2963980.gif"/></alternatives></inline-formula> on the PR AUC metric, averaged over all of the experiments, while the two state-of-the-art schemes score <inline-formula><tex-math notation="LaTeX">$74.2\% \pm 22.2\%$</tex-math><alternatives><mml:math><mml:mrow><mml:mn>74</mml:mn><mml:mo>.</mml:mo><mml:mn>2</mml:mn><mml:mo>%</mml:mo><mml:mo>±</mml:mo><mml:mn>22</mml:mn><mml:mo>.</mml:mo><mml:mn>2</mml:mn><mml:mo>%</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="sahloul-ieq2-2963980.gif"/></alternatives></inline-formula> and <inline-formula><tex-math notation="LaTeX">$78.3\% \pm 26.4\%$</tex-math><alternatives><mml:math><mml:mrow><mml:mn>78</mml:mn><mml:mo>.</mml:mo><mml:mn>3</mml:mn><mml:mo>%</mml:mo><mml:mo>±</mml:mo><mml:mn>26</mml:mn><mml:mo>.</mml:mo><mml:mn>4</mml:mn><mml:mo>%</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="sahloul-ieq3-2963980.gif"/></alternatives></inline-formula>. Furthermore, the proposed scheme requires only <inline-formula><tex-math notation="LaTeX">$24.1\% \pm 6.0\%$</tex-math><alternatives><mml:math><mml:mrow><mml:mn>24</mml:mn><mml:mo>.</mml:mo><mml:mn>1</mml:mn><mml:mo>%</mml:mo><mml:mo>±</mml:mo><mml:mn>6</mml:mn><mml:mo>.</mml:mo><mml:mn>0</mml:mn><mml:mo>%</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="sahloul-ieq4-2963980.gif"/></alternatives></inline-formula> of the time consumed by the fastest state-of-the-art scheme. The proposed voting scheme also demonstrates high robustness against occlusions and scarce inliers.

Excluding spurious matches from high-outlier-rate correspondences is still an open problem featuring highdimensionality issues [39]; yet, little research has been done in this area. One of the early popular proposals was random sample consensus (RANSAC) [40], which is based on random sampling, as the name implies, and thus, suffers from repeatability issues. Notably, Optimal RANSAC [41] is an extension of the random sampling algorithm, which addresses the repeatability issue to a high extent. Notwithstanding, sampling methods in general are still sensitive to high outlier rates and require a substantially large number of samples for robust estimation, which is time-consuming.
Simultaneously recovering high quality and abundant correspondences is indispensable in obtaining a plausible hypothesis set for proper model fitting [43]. Accordingly, voting-based schemes have gained momentum recently, due to their balanced performance.
More to the point, several studies have incorporated correspondence consistency voting [23], [24], [51], [52], [53] to increase the correspondence inlier rate, either by truncating inconsistent correspondences or utilizing the voting ranks to weight the model parameters. Some voting schemes [23], [24] have adapted the nearest-neighbor similarity ratio (NNSR) [30], which was one of the earliest techniques to detect spurious correspondences formed by indistinct features. However, the NNSR was originally proposed for high-dimensional intensity-based local-image features, and thus its quality is questionable for low-dimensional geometric features. Similarly, the local rigidity constraint (LRC) has been employed in some voting schemes [23], [54] to ensure compatible euclidean distances in the surrounding neighborhoods of the two corresponding points. Per contra, a rigidity constraint is not sufficient to ensure the rotational compatibility of neighboring correspondences. It is worth noting, however, that neighborhood measurements are unavoidable in voting schemes, and it is believed that Ref. [24] made an error in claiming that k-nearest neighbor (k-NN) queries are avoided in their method, while in reality they employed a local reference frame (LRF) estimation method [55], which internally and unavoidably relied on k-NN queries. Although LRFs [55], [56], [57] have been recently utilized in voting schemes [23], [24] to assess the global consistency of correspondences, LRFs suffer from noise and eigenvector sign ambiguities. Even after resolving these ambiguities by following certain conventions [58], using LRFs for global verification is still debatable, as they were originally intended for local feature description. To the best of our knowledge, both the accuracy and robustness of voting schemes remain challenging problems at present.
Accordingly, the problem of rejecting outliers to find a maximal plausible correspondence set still persists, which has been set as the objective of this manuscript. Similar to the state-of-the-art 3D correspondence voting schemes [23], [24], we follow a two-stage scheme concept: in the first stage, a voting set is elected based on the top coarsely-estimated likelihoods; in the later stage, the correspondences are validated against the voting set and their fine-tuned likelihoods are estimated accordingly. Nonetheless, it is challenging to come up with criteria for each stage that maximize the accuracy without affecting the efficiency. Our approach for the first voting stage involves utilizing the LRC, similar to Ref. [23], to obtain coarse inlier ranking scores. However, unlike Ref. [23], the resulting LRC scores are not subject to a hard-threshold and are not combined with external scores. Instead, they are utilized to guide the global scoring stage, rather than constraining it and limiting the overall performance.
Importantly, the strength of our proposed method stems from the second voting stage, in which contamination is minimized in both the voting set and its elementwise hypotheses. While previous methods [23], [24] utilized the putative correspondences in composing the global-stage hypotheses (from both the correspondence source and destination sides, in the case of Ref. [23], and from the source correspondence side in the case of Ref. [24]), we opted to compose our hypotheses solely from the voting set, to minimize outlier effects. Furthermore, the voting set is post-validated in the second stage of our scheme, which was not carried out in previous methods [23], [24]. Our proposed method also differs from both previous methods [23], [24], as it does not rely on their underlying incompetent criteria of NNSR or LRFs, which lacks local-rigidity checks or suffers from rotational ambiguity, respectively. Instead, our proposed single-point superimposition transforms (1PSTs) are rotational ambiguity-free and computationally cheap, compared to the noisy and usually ambiguous LRFs.
Briefly, we propose a voting scheme that is: highly accurate and extremely efficient, deterministic and rigorously repeatable, and simple to implement. The remainder of this manuscript is organized as follows. Section 2 describes the proposed scheme, while Section 3 explains the experimental setup, the dataset, and the performance metrics that were utilized. The results are discussed in Section 4, and Section 5 presents the conclusions and future work.

METHODOLOGY
Let P; P & R 3 be the model and scene point clouds (multiple rigid objects), and let C ¼ f p p; p p ð Þ : p p 2 P; p p 2 Pg & P Â P be the initially-provided correspondence set. The goal is to compute a likelihood set, S & 0; 1 ½ , where each individual element s c c i ð Þ 2 S represents the likelihood that a correspondence c c i 2 C is valid (i.e., an inlier).
In the first voting phase, a voting set C l & C is elected, based on the top-ranked elements of a local coarse ranking set, L, that is estimated through verification of the LRC (local rigidity constraint, Section 2.1). In the second voting phase, the voting set C l and the putative correspondence set C are assessed against each other. The targeted likelihood set S is estimated by calculating the elementwise covariance of the putative set with single-point superimposition transforms, which are derived from the voting set (Section 2.2). A schematic representation of our proposed method is shown in Fig. 1.

First Voting Stage: Voting Set Election
This section is aimed at electing a voting set C l & C. Its cardinality, jC l j ¼ k l , is the first free parameter of our proposed method. The voting set elected in this stage is utilized to assess the putative correspondences C in the next voting stage. To elect the voting set, the local rigidity constraint is utilized, which asserts the mutual compatibility of two correspondences c c i ¼ def ðp p i ; p p i Þ 2 C and c c j ¼ def ðp p j ; p p j Þ 2 C. These two correspondences are said to have a high compatibility likelihood, l c c i ; c c j À Á % 1, when their corresponding domain and co-domain euclidean distances are approximately equal, p p j À p p i 2 % p p j À p p i 2 , thus complying with the rigidity constraint. While several studies have formulated the likelihood of the pairwise rigidity constraint as a minimum between two ratios (e.g., see Ref. [23]), we opt to formulate it as a Gaussian function, in order to capture the physical aspects of the acquisition sensor: where s a is the standard deviation of the acquisition accuracy, which constitutes the second free parameter of the proposed scheme. Due to the pairwise nature of this constraint, one needs to pair each correspondence with several others to accumulate a decent disjoint probability estimation. Nonetheless, excessive verification might lead to a combinatorial explosion. Without loss of generality, we exploit the fact that inlier correspondences tend to appear in groups [23], [59], in order to improve the probability estimation while avoiding computational issues. Although this assertion might reduce the inlier recall, it is compensated for in the global voting stage (Section 2.2).
Accordingly, for every matching keypoint of the model P c ¼ p p : p p; p p ð Þ 2 C f gthe k-NN originating from the same keypoints set, N k p p ð Þ & P c , is utilized, where k defines the size of the neighborhood (i.e., jN k j). If applicable, reusing such neighborhood information from the feature computation phase will save some computational power; otherwise, an approximate k-NN method [60] can be utilized for fast estimation. Upon the availability of neighborhood information, the local coarse ranking set L ¼ L l c c i ð Þ : f c c i 2 Cg & R is estimated as the summation of the neighborhood pairwise likelihoods: Notably, we arrived at the same conclusion as Ref. [23]; the most efficient spatial neighborhood size jN k j ¼ k is actually given by the cardinality of the voting set, jC l j ¼ k l . That is, the accuracy of the final likelihood scores decreases linearly with neighborhood sizes smaller than the cardinality, and begins to saturate for larger ones. Subsequently, the putative set C is sorted in descending order, according to its local coarse ranking scores L, and is denoted as C L , from which the top k l -elements constitute the voting set, C l Thus, the first voting stage is concluded by the election of the voting set C l , which is utilized in the next section for the assessment of the putative correspondences C.

Second Voting Stage: Post-Validation and Scoring
In the first voting stage, we elected a voting set C l from the top-ranked elements of a local coarse ranking set L based on local neighborhood measurements and support. However, inliers surrounded by contaminated neighborhoods would not have received enough support at that stage. In this stage, we address this particular issue by assessing both the voting set C l and the putative correspondence set C against each other, and measure the covariance to global single-point superimposition transforms derived from the voting set. Unlike previous work [23], [24] which utilize LRFs, 1PSTs are computationally cheap and, more importantly, rotational ambiguity-free. Moreover, we construct the hypotheses solely from the voting set and evaluate both the putative and voting sets against each other, while previous work [23], [24] involved a contaminated putative set in forming their hypotheses and only evaluated the putative set without post-validating the voting set. Initially, each voting element c c i 2 C l is assigned a rotational ambiguity-free and computationally cheap singlepoint superimposition transform, T c c i ð Þ ¼ R c c i ð Þ t t c c i ð Þ ½ 2 SE 3 , as a candidate hypothesis, where R c c i ð Þ 2 SO 3 is a rotation matrix and t t c c i ð Þ 2 R 3 is a translation vector. While the translation vector can be given by t t c c i ð Þ ¼ p p i À R c c i ð Þp p i , the rotation matrix is a little bit more involved. For the purpose of computing R c c i ð Þ, we utilize a method for superimposition transform estimation [61] and borrow some concepts from another method [38] originally proposed for LRF estimation. Briefly, a covariance matrix between the corresponding model and scene points is computed and then decomposed to estimate a superimposition transform, as per Ref. [61]. Furthermore, the covariance weights and centroids are computed in a similar manner to Ref. [38]. The covariance matrix C c c i ð Þ 2 R 3Â3 is given by: where weighting term, similar to Ref. [38], to provide robustness against both clutter and occlusions. Unlike Ref. [38], the weighting term v c c i ; c c j À Á is a bilateral kernel, which provides robustness against within-neighborhood inconsistencies, which partially depends on the local rigidity pairwise consistency l ðc c i ; c c j Þ defined in Eq. (1). Additionally, our weighting-term formulation includes s r as the standard deviation of the neighborhood radius, as well as the power Fig. 1. An overview of the proposed voting scheme. Our proposed scheme takes a set of putative correspondences (the yellow-colored dense lines between the teal-colored source model and the tan-colored destination scene) as input and processes them in two stages. In the first stage, Section 2.1, a local voting set, C l (shown as yellow-colored sparse lines), is elected based on the local rigidity constraint. Each element in the voting set represents a global hypothesis, and their elementwise support by the putative correspondence is utilized to post-validate the voting set at the second voting stage, Section 2.2. The top supported hypotheses forms the global voting set, C g . In the case shown above, only a single element is selected (shown as the single green-colored line). Using the post-validated voting set, the likelihood scores, S, for the putative correspondences being inliers are computed, according to their covariance with the post-validated voting set, where green-colored lines correspond to the inlier ones, and magenta-colored lines correspond to the outliers. It is up to the high-level application to decide whether to truncate the scores, based on some threshold, or to utilize them all in a weighted model. term p 2 R to adjust the standard accuracy deviation s a without recomputing l c c i ; c c j À Á . Furthermore, as per Ref. [38], the covariance centroids are approximated by the neighborhood center points, in order to speed up the computation. However, it is worth noting that the covariance matrix in our formulation is not normalized, as this has no effect on the underlying rotation. Additionally, the difference vectors, their euclidean distances, and the neighborhoods N k p p ð Þ, previously computed in Eq. (2), are reutilized in Eq. (4), which contributes to the computational efficiency of our approach.
Moreover, to further speed the computations up, the power term p ¼ 1=0:16 2 % 39 is empirically set to adjust the standard accuracy deviation s a to 16 percent of its original value without recomputing l c c i ; c c j À Á . Moreover, it is sufficient to only consider the first k r ¼ 18 neighbors of the selfincluding neighborhoods, k-NN, while setting the standard deviation radius to just one half of the point cloud resolution, s r ¼ 1 2 voxel. These parameters are believed suitable for datasets beyond our experimentations.
According to Ref. [61], the rotation matrix R c c i ð Þ is then formed by multiplying the left and right singular value where the determinant det Á ð Þ in the middle diagonal matrix is utilized to negate reflection cases. Notably, state-of-theart methods [23], [24] have employed LRF algorithms for hypothesis estimation, which only depend on p p and V, while our hypothesis depends on p p and U, as well. Although they eventually compose the final hypotheses for some P Â P LRFs set, their hypotheses are contaminated by the inclusion of non-voting set LRFs. Moreover, the LRFs suffer from sign ambiguities of the three eigenvectors [58]. Even with such resolution, there remains no guarantee of correctness of such convention. See Fig. 2 for a graphical demonstration.
In spite of the estimated single-point superimposition transforms T c c ð Þ : c c 2 C l È É it is essential to exclude invalid transformations of the voting sets before imposing their dubious assessment on the putative correspondence set, C.
To address this chicken-and-egg problem, we consider the global compatibility likelihood of both the putative and voting sets, in an almost identical manner to Eqs. (1), (2), and (3). Consequently, the global pairwise likelihoods g c c i ; c c j À Á between the putative correspondence set c c i 2 C and the voting set c c j 2 C l are formulated as where s e is the standard deviation of the error tolerance, which led to compelling discriminative likelihoods in our experiments when set to four times the acquisition accuracy (i.e., s e ¼ 4s a ). From an efficiency point of view, the global pairwise likelihoods g 2 R jCjÂjC l j should be computed once, and then reused in the following steps. Similar to Eq. (2), the global coarse ranking set G ¼ L g c c j À Á È : c c j 2 C l g & R is estimated as the sum of the global pairwiselikelihoods over the entire putative correspondence set Then, similar to Eq. (3), the voting set C l is sorted in a descending order, according to the global coarse ranking scores G, and is denoted as C G . The top k g -elements constitute the post-validated voting set C g where k g has some relation to the expected number of the multi-structures in the scene. In this work, we set k g ¼ 1, as the manuscript's scope is limited to single-structure rigid- The first column depicts a single correspondence, shown as a red line, between both the source and destination points, p p and p p, for which the elliptical dotted spheres surrounding them depicts their neighborhoods. The second column shows the basis vectors formed using the corresponding technique over a sphere surface for all correspondences within a voting set. Since LRF, as the name implies, computes the basis transformation of each source and destination point separately, T i 2 SO 3 and T j 2 SO 3 , the transformation basis for the correspondence is obtained by composing the inverse of its destination point transform and the source point transform, T ¼ T À1 j T i . However, there are two issues with the LRF technique: (1) due to eigenvectors' sign ambiguity of each point transformation, there is no guarantee that their composed transform is rotational ambiguity-free, even after following certain conventions to resolve them. (2) the impurities within the neighborhoods of each point are not accounted for. As a result, the transforms of the voting set of previous methods [23], [24] that utilizes LRF are very chaotic, as shown over the sphere surface on the first row. On the other hand, 1PST takes into consideration the neighboring correspondences (shown as gray lines) to filter out impurities, as well as computing the correspondence transform, T 2 SE 3 , from both the translation, t t 2 R 3 , and rotation, R 2 SO 3 , in a single pass to avoid sign ambiguities. As a result, the voting set transformation is accurate, as shown over the sphere surface in the second row. body correspondences. At any rate, k g ( k l should be maintained for proper likelihood estimation, as demonstrated in Fig. 3.
Finally, the fine-tuned likelihoods S ¼ s c c i ð Þ : c c i 2 C f gare estimated for each putative correspondence c c i 2 C, by averaging their pairwise-likelihoods over the post-validated voting set This concludes the description of our proposed methodology. For the sake of simplicity and ease of comparability with the state-of-the-art methods, we denote all methods as functions f f Á; Á ð Þ of two arguments: the first denoting sorting/trimming stage technique and the second denoting the scoring stage technique. As the sorting stage in our proposed method was based on the LRC of the putative correspondences, and the scoring stage was based on the 1PST of such correspondences, we refer to our proposed method as f f LRC; 1PST ð Þ .

EXPERIMENTAL SETUP
In Section 2, we have proposed a two-stage voting scheme, in which a voting set is elected in the first stage and is filtered further in the second stage, before utilizing it to estimate the inlier likelihoods of the input putative correspondences. To demonstrate the accuracy and efficiency of this proposal, we performed experiments to compare our approach with the state-of-the-art. This section is dedicated to describing the experimental setup. First, the utilized dataset is introduced, and we explain how the putative correspondences were obtained from it. After that, the performance indices are discussed and, then, the compared methods and their parameters are briefly introduced.
All compared methods were implemented as singlethreaded Python TM [62] scripts, and were evaluated using a laptop computer with a 2.7 GHz processor and 8 GB of available memory. Internally, the highly optimized NumPy package [63] was utilized for the linear algebra operations, while the 2D graphs and 3D graphics were generated using the Matplotlib [64] and the MayaVi [65] packages, respectively.

Dataset
The UWA 3D object recognition (U3OR) dataset [57], [66] features various real-world scanned objects, shown in numerous scenes with different occlusion and clutter rates (Fig. 4), and was utilized in our comparative experiments. The dataset consists of four models ('Chef', 'Chicken', 'Parasaurolophus', and 'T-Rex') and 50 scenes (RS1 to RS50). Each scene includes partial information about several models, with a total of 188 model-scene combinations and two different down-samplings; 374 of these combinations were utilized in this manuscript, as two pairs had no inlier correspondences after sampling their point clouds.
In order to generate the putative correspondence set, the dataset models and scenes were down-sampled to new resolutions, 2 and 5 mm, represented by 1 voxel hereafter. Additionally, since some related studies [23], [24] utilize LRF, which depends on surface normals, the datasets' surface normals were estimated using a principal component analysis (PCA)-based method [67]. Concisely, the normal vector is the eigenvector corresponding to the smallest eigenvalue of the covariance matrix constructed with the k-NN of a point. We limited the surface-normals neighborhood size to k ¼ 30 with a radius of 2 voxel, to neutralize both over-sampled and distant points. After that, fast point feature histograms (FPFH) features [36] were computed for the down-sampled point clouds, which resulted in a 33dimensional vector for each point, representing the point's spatial features. Finally, the feature-space vectors of the models and scenes were matched together to form the putative correspondences C using an approximate k-NN method [60] with k ¼ 1. See Fig. 5 for a visualized conceptualization. The ground truth of a correspondence was constructed based on the ground-truth relative pose transformations of the model-scene pairs, which were part of the dataset as well. A correspondence was considered an inlier if it varied covariantly with the ground truth superimposition transform T gt within an acceptable tolerance; that is, C gt ¼ fc c : T gt p p i À p p i 2 < d e ; c c 2 Cg. The tolerance was set Fig. 3. The effect of the k g ¼ jC g j parameter on the accuracy of the ranking processes, from which it is apparent that k g ¼ 1 is sufficient for our purposes. The green-colored lines indicate inlier correspondences, while the magenta-colored ones indicate outliers, while C g is the post-validated voting set [Eq. (8)]. Fig. 4. Sample scans of the U3OR (UWA 3D object recognition) dataset [57], [66], which was utilized in our comparative evaluation. The dataset consists of four models (shown in teal color, namely: 'Chef', 'Chicken', 'Parasaurolophus', and 'T-Rex') and 50 scenes (the first ten are shown in tan color).
as twice the resolution of the point cloud, d e ¼ 2 voxel. Additionally, the dataset has an occlusion-rate ground truth, which we utilized along with the inlier fraction jC gt j=jCj to evaluate the robustness of the methods against these two challenges. For conciseness, only the means and standard deviations of the conducted experiments are reported.

Performance Metrics
As for the performance metrics, in order to evaluate both the accuracy and efficiency, we measured the precisionrecall (PR) area under curve (AUC) and the execution time of each evaluated method. While it is the standard, in the context of retrieval and binary classification problems, to utilize the PR criteria, such a criteria represents only a single operating point of the voting scheme at a specific threshold.
In other words, one must choose a score threshold 0 s 1 to form a selected correspondence set C s ¼ c c : S c c ð Þ ! s; f c c 2 Cg, and thus compute the precision p s ð Þ ¼ jC gt \ C s j=jC s j and the recall r s ð Þ ¼ jC gt \ C s j=jC gt j. On the other hand, the aim is to evaluate the operating characteristics of the voting scheme for any threshold, hence PR AUC is the most appropriate single-number criterion capturing the parametric PR behavior throughout the entire range of s where r 0 Á ð Þ is the recall derivative.

The Baseline Methods
Nearest-neighbor distance scores putative correspondences according to their feature-space distances, while nearestneighbor similarity ratio scores correspondence distinctiveness using the second-to-first feature-space distance ratio. These two methods are not expected to have significant scores, as they only depend on feature-based measurements. Local rigidity constraint, however, verifies the rigidity constraint of a correspondence in its local neighborhood, as described in Section 2.1, where its score is the normalized value of Eq. (2). This method is expected to perform quite well in comparison to the previous two, which is why it forms part of our proposed method.

The State-of-the-Art Methods
The existing method of Ref. [23] initializes a voting set using both NNSR and LRC for inlier sorting, and both LRC and local reference frames are utilized for the final scoring. Accordingly, we denote the method of Ref. [23] by f f NNSR ð þLRC; LRC +LRFÞ. Similarly, the more recent existing method [24] is denoted by f f NNSR; LRF ð Þ , as it sorts the voting set in the local phase using NNSR scores, and then utilizes LRFs to construct SO 3 hypotheses for global verification and scoring. Refer to the paragraph just above Section 3 for the interpretation of f f Á; Á ð Þ.

Optimal RANSAC
Random sample consensus is a randomized method [40], in which tentative sample correspondences are iteratively drawn at random, and a superimposition hypothesis T 2 SE 3 is elected if it receives sufficient support from the putative set. Optimal RANSAC [41], which we compare our approach against, improves upon the standard RANSAC algorithm's robustness by resampling the tentative correspondence set, ensuring repeatability. The scores are obtained according to the formula s c c i ð Þ ¼ expðÀ Tp p i À p p i k k 2 2 =2s e 2 Þ, which resembles Eqs. (6) and (9) in our proposed method.

Parameters
As for the parameters, we set the cardinality of the voting set jC l j (and, thus, the local rigidity neighborhood size) to k l ¼ 100 in all of our experiments. The deviation of the acquisition accuracy, introduced in Eq. (1), was set to one fourth of the point cloud resolution, s a ¼ 1 4 voxel. Thus, s e ¼ 4s a in Eq. (6) became 1 voxel. The LRC and Optimal RANSAC methods were assigned the same values of the s a and s e parameters. It is also worth noting that the reported parameter of Ref. [24], d r ¼ 10voxel, did not seem well-tuned and placed their method in a bad light. So here, we tuned it to 1 voxel for better performance, as shown in Fig. 6. The remaining parameters of both state-of-the-art methods [23], [24] were as suggested by their corresponding manuscripts. The dataset models (shown in teal color) and scenes (shown in tan color) are down-sampled, their surface normals and point features are computed, and finally, they are matched together to generate the putative correspondence set (the yellow-colored dense lines), which forms an input to the evaluated methods. Fig. 6. Updating the parameter of a related study. Based on our experimental results, we updated the parameter d r of f f NNSR; LRF ð Þ related method [24] from 10 voxel to 1 voxel to achieve better performance. A higher the PR curve indicates better accuracy. Noting that NNSR (nearest-neighbor similarity ratio) and LRF (local reference frame) denote the underlying techniques utilized within the related method [24].

RESULTS AND DISCUSSION
The proposed voting scheme in Section 2 was evaluated on the U3OR dataset [57], [66], after computing its putative correspondences (Section 3.1). The accuracy and efficiency of the results were interpreted in terms of the PR AUC and execution time criteria (Section 3.2) for the proposed method and several other methods, including the state-of-the-art methods [23], [24] (Section 3.3). This section initiates a discussion with the quantitative accuracy and efficiency results, while the qualitative results follows in later parts.
The proposed method f f LRC; 1PST ð Þ outperforms all compared methods, scoring 97:0% AE 12:9% on the PR AUC metric PR AUC , as shown in Fig. 7. This substantially high score in terms of precision and recall is attributed to the voting set post-validation (Eq. (8)) by collecting the putative correspondence support. Indeed, there is a similarity between the proposed method's operating characteristics and those of Optimal RANSAC, which utilizes the closely related concept of hypothesis support and scores the nearest to the proposed method: 95:2% AE 20:4%. On the other hand, it is worth noting that RANSAC is a randomized algorithm, thus its repeatability cannot match our deterministic voting scheme.
Additionally in Fig. 7, the baseline methods NND, NNSR, and LRC have PR AUC values of 5:7% AE 2:8%, 20:1% AE 10:4%, and 73:4% AE 22:0%, respectively. NND has the lowest operating characteristics, its precision remains below the correspondence inlier fraction jC gt j=jCj. We believe that such worsethan-guessing performance is related to the indistinctness and low dimensionality of the FPFH features. These feature properties also affect the NNSR scores, to some extent, and might explain why its PR AUC curve exponentially decays and approaches the inlier fraction.
As for the state-of-the-art methods, f f NNSR + LRC; ð LRC + LRFÞ [23]  performs on par with the state-of-the-art methods, despite its simplicity, and proves to be a highly competent criterion for voting scheme initialization. We believe that the main reason f f NNSR + LRC; LRC + LRF ð Þ [23] does not score considerably higher than LRC, despite using LRC as part of the method, is two-fold. The major reason is that the local and global scores, LRC and LRF, are multiplied together at the scoring stage, instead of utilizing a weighted summation or relying solely on the global scores (see the following paragraph for further details). Another reason is that it uses hard thresholding, performed in several steps of the scheme, causing some loss of information. Overall, the related methods have some operating characteristic issues, and they are all outperformed by our proposed scheme, including the two state-of-the-art methods and the randomized one.
In order to gain in-depth insights about the novelty and performance of the proposed method f f LRC; 1PST ð Þ , especially in comparison to the f f NNSR + LRC; LRC + LRF ð Þ method [23] and Optimal RANSAC [41], several hybrid combinations between the existing methods and the proposed one were studied. The results are shown in Fig. 8. First, despite the fact that the LRC stage is utilized in the existing method [23], the originality and effectiveness of our formulation is apparent in Fig. 8a. The first curve corresponds to Ref. [23], while the second, f f NNSR + LRC; ð LRC + 1PSTÞ, is a hybrid method, replacing LRF with the proposed 1PST. In these two curves, LRC is utilized in the scoring phase, which drastically limits the performance, regardless of what other scoring technique (i.e., LRF or 1PST), it is combined with. This is the major issue with the method of Ref. [23] and can be alleviated by considering a weighted summation or just avoiding LRC in the scoring phase. However, scoring solely with LRF does not distinctively outperform the original approach [23], as demonstrated by the third curve, f f NNSR + LRC; LRF ð Þ , due to the incompetencies of LRF. Indeed, no significant performance gain can be observed until 1PST is used solely in the Fig. 7. Accuracy of the seven methods, in terms of their precision and recall means (lines) and standard deviations (shades) over 374 experiments. To enhance clarity, the nth means and deviations are denoted by the points and the error bars. The horizontal axis corresponds to the recall metric, while the vertical axis is the precision metric. In this parametric plot, the higher a curve from the horizontal axis, the better its results is. The proposed scheme f f LRC; 1PST ð Þ remains accurate over a large recall range and outperforms all the related methods, including the two state-of-the-art methods and a robust randomized method. Refer to Section 3.3 for the details of compared methods. Fig. 8. Demonstration of the impact of the proposed stages, by comparing the proposed method to variants of competing methods combined with one of the proposed stages in terms of their precision and recall means (lines) and standard deviations (shades). To enhance clarity, the nth means and deviations are denoted by the points and the error bars. The horizontal axis corresponds to the recall metric, while the vertical axis is the precision metric. In this parametric plot, the higher a curve from the horizontal axis, the better its results. Note that only the data of 186 experiments (5 mm resolution) are utilized in these graphs, to avoid memory issues in the combined methods: (a) f f NNSR + LRC; LRC ð þ LRF Þ method [23] with and without the proposed 1PST (single-point superimposition transform) and (b) Optimal RANSAC [41] with and without LRC (local rigidity constraint).
scoring phase, as per the fourth curve, f f NNSR + LRC; ð 1PSTÞ. Similarly, combining NNSR with LRC in the sorting phase harms the performance, rather than benefiting it, as shown by the proposed method's curve, f f LRC; 1PST ð Þ , compared to the former ones.
Second, as shown in Fig. 8b, although coupling LRC with Optimal RANSAC [41] enables the truncation of some outliers and, thus, an improvement of the PR AUC score, it is still outperformed by f f LRC; 1PST ð Þ , thanks to the proposed 1PST scoring stage. These in-depth experiments indicate the importance of both the LRC sorting and 1PST scoring stages, as no other combination outperforms our approach; thus, stressing the meticulousness and significance of the proposed method. It is worth noting that some hybrid combinations ran out of memory resources when utilizing the 2 mm resolution data; thus, only the 5 mm resolution data was utilized, for which the results are a bit different from, but consistent with, the rest of figures in this manuscript.
As shown in Fig. 9 the PR AUC score is inversely correlated with the occlusion ratio and directly with the inlier fraction; however, in both cases, the proposed method remains robust. Notably, the state-of-the-art method [24] denoted by f f NNSR; LRF ð Þexhibits a lack of robustness against a high level of occlusions (Fig. 9a). This mostly originates from the fact this method does not perform local neighborhood rigidity checks, but rather uses methods that depend solely on feature-based measurements (i.e., NND and NNSR). Furthermore, LRC and f f NNSR +LRC; LRC + ð LRF Þ degrade, to some extent, and the most robust methods are the proposed method f f LRC; 1PST ð Þ and Optimal RANSAC. In Fig. 9b, only two methods demonstrate robustness against scarce inliers-namely, the proposed method and Optimal RANSAC-which is mostly due to their hypothesis-support strategies. Therefore, in summary, only the proposed method and Optimal RANSAC exhibit high robustness to scarce inliers and large occlusions.
As for the computational efficiency analysis, aside from Optimal RANSAC, the time complexity of the remaining algorithms, including the proposed method, are OðjCjÞ, as the voting sets of all these algorithms are fixed in size. This can be also observed empirically from the linear relationships between problem size and execution time. Fig. 10 shows such relation, but in logarithmic scale to accommodate all results. In comparison to our proposal, as per Fig. 10a, both of the state-of-the-art methods [23], [24], f f NNSR +LRC; LRC + ð LRF Þ and f f NNSR; LRF ð Þ , have higher slope coefficients, which indicates reduced efficiency. This is mostly because the hypothesis transforms are computed for the entire putative set C and due to the additional computational power being spent on resolving the LRF ambiguities. On the other hand, in the proposed method we compute the hypotheses only for the voting set C l & C, and thus no further overhead ambiguity exists for the 1PSTs.
Importantly, as per Fig. 10b, despite the robustness of Optimal RANSAC, it takes a great deal of time when the inlier fraction is less than 10 percent, which limits its applicability to real-time scenarios. On the other hand, our method consumes a constant time, much less than the stateof-the-art methods, which can be tuned further based on the desired application. Remarkably, the proposed method's execution time is only 24:1% AE 6:0% of the time taken by f f NNSR +LRC; LRC +LRF ð Þ [23], and 18:0% AE 4:1% of the time taken by f f NNSR; LRF ð Þ [24]. Finally, Fig. 11 shows a qualitative evaluation on some of the dataset's more challenging scenes with respect to the four models. These qualitative results show similar tendencies to the conclusions from the PR AUC scores. It is apparent that only Optimal RANSAC and the proposed method f f LRC; 1PST ð Þperform adequately. Nevertheless, Optimal RANSAC also has failure cases for all of the given qualitative examples, despite its high repeatability and robustness; which indicates the superior robustness and accuracy of the proposed method.
We strongly urge researchers to address the scarcity of unified benchmarks for correspondence evaluation, as well as the lack of consensus on appropriate evaluation criteria. We observed that the reported scores of any given method were often not directly comparable to another, as the proposed methods usually utilize different down-samplings Fig. 9. Robustness analysis of competing methods against occlusions and scarce inliers in terms of the PR AUC means (lines) and standard deviations (shades), which corresponds to the vertical axis. To enhance clarity, the nth means and deviations are denoted by the points and the error bars. The vertical axes correspond to two: (a) the occlusions ratio and (b) the inlier fraction. The occlusion ratio denotes the ratio between observed surface points and the total surface points of the scene, which falls in the approximate range of 62%-93%, where the higher the occlusion ratio, the more challenging the problem. Similarly, the inlier fraction jC gt j=jCj denotes the ratio between the count of ground-truth inliers and the correspondences' set size, which falls in the approximate range of 1.5%-21%, where the lower the inlier fraction, the more challenging the problem. In both cases, the higher a curve is, the better its robustness. Refer to Section 3.3 for the details of compared methods. Fig. 10. Analysis of the computational efficiency of the seven methods, in terms of their execution time means (lines) and standard deviations (shades) over 374 experiments. To enhance clarity, the nth means and deviations are denoted by the points and the error bars. The vertical axis denotes the the elapsed time, while the horizontal axes corresponds to (a) correspondence set cardinality jCj, and (b) the inlier fraction jC gt j=jCj. Refer to Section 3.3 for the details of compared methods. Fig. 11. Qualitative performance matching four models (shown in teal color) with some challenging sample scenes (shown in tan color), all from the U3OR (UWA 3D object recognition) dataset [57], [66] down-sampled to 5 mm. The input correspondences and their ground truths are shown in the first and last rows, respectively. The remaining rows show the qualitative results for each method, which consist of (from top to bottom) three baseline methods, two state-of-the-art voting schemes, a robust randomized method, and the proposed voting scheme. The green-colored lines indicate inlier correspondences, while the magenta-colored ones indicate outliers. Only top-ranked correspondences are shown for each method, with an equal count to their corresponding ground truth. Refer to Section 3.3 for the details of compared methods. and different features for computing the correspondences, and also even follow different criteria for evaluation.

CONCLUSION
Rejection of spurious correspondences is an essential step for proper geometric modeling, high-level computer vision, and image processing tasks, such as motion estimation, recognition, and reconstruction, among others. Despite all the progress in the field during recent decades, the correspondence problem remains an open one, with few methods tackling it. Most of these methods are either too complex and slow, or lack adequate accuracy. For these reasons, we proposed an extremely efficient, robust, accurate, and simple voting scheme for correspondence scoring. Our proposed method consists of two stages, in which a voting set is elected in the first stage, based on local rigidity consistency. This voting set is post-validated, in the second stage, by enumerating its element-wise global support from the putative correspondence set. The post-validated voting set is then utilized to score the putative correspondence set, based on their pairwise covariances. While the proposal is simple, its novelty lays in the careful formulation for postvalidation to solve this chicken-and-egg problem. It is worth noting that the method is very flexible with respect to the utilized correspondence estimation method, as it takes the raw correspondences as input without any additional dependency on the detector, descriptor, matching algorithm, or any additional information (such as correspondence quality scores).
The proposed scheme was evaluated on the U3OR dataset [57], [66]. It demonstrated a high level of accuracy, with an average of 97:0% AE 12:9% for the PR AUC criterion over a total of 374 experiments. On the other hand, the state-ofthe-art methods [23], [24] scored 74:2% AE 22:2% and 78:3% AE26:4%; thus, they were outperformed by our proposed method. The proposed method also demonstrated adequate robustness against occlusions and scarce inliers, and a high effectiveness. It seems as though the local rigidity constraint is the major player in occlusion robustness, while collecting hypothesis support improves performance when there are scarce inliers. The effectiveness of our proposed method comes from limiting the hypothesis computation to a relatively small voting set, and thus its execution time is only 24:1% AE 6:0% of the time consumed by the fastest state-of-the-art method. Overall, the proposed method exceeded all the compared methods in all aspects.
However, we are not proposing an almighty scheme, as it is currently limited to single-structure rigid-body geometric fitting. Thus, addressing multi-structure geometric modeling would be an interesting extension of this proposal. Moreover, our proposed scheme picks the highest supported hypothesis without resampling. However, resampling seems to make the estimation more robust [41], [68], [69], and thus constitutes one of the future directions. Another future direction considers borrowing the concept of higher-than-minimal subset sampling [70], [71] into the voting schemes, perhaps by considering clustering correspondences [72] or poses [73]. Moreover, only down-sampled point clouds were employed, so involving complete point clouds in hypothesis election or fine-tuning is expected to bring about astonishing results. Finally, while the method is proposed for the 3D scenario, extending it to other scenarios (e.g., 2D), by adapting its scoring and transformation estimation techniques, seems feasible and has great implications.
Hamdi Sahloul received the BSc degree in computer and control engineering from Sana'a University, Sana'a, Yemen, in 2010, and the ME and PhD degrees in precision engineering from the University of Tokyo, Tokyo, Japan, in 2016 and 2019, respectively. By 2019, he started working for a robotics automation company, as a computer vision engineer. Between 2010 and 2014, he was working as a machine-to-machine systems designer. His research interests include robotics automation, computer vision, and machine learning.
Shouhei Shirafuji received the PhD degree in information science from Osaka University, Osaka, Japan, in 2014. He was a JSPS research fellow from 2014 to 2015. From 2015 to 2018, he was a postdoctoral researcher with the University of Tokyo, Japan. Since 2018, he has been an assistant professor with the Research into Artifacts, Center for Engineering, University of Tokyo, Japan. His main research interests include mechanical design, robotics, and bio-mechanics.
Jun Ota received the BE, ME, and PhD degrees from the Faculty of Engineering, University of Tokyo, Tokyo, Japan, in 1987, 1989, and 1994, respectively. He is currently a professor of Research into Artifacts, Center for Engineering, (RACE), University of Tokyo, Japan. From 1989 to 1991, he worked with Nippon Steel Cooperation. In 1991, he was a research associate with the University of Tokyo. He became a lecturer and associate professor, in 1994 and 1996, respectively. In April 2009, he became a professor with the Graduate School of Engineering, University of Tokyo. In June 2009, he became a professor of RACE, University of Tokyo. From 2015, he has been a guest professor with the South China University of Technology. From 1996 to 1997, he was a visiting scholar with Stanford University. He received a Fellowship from the Robotics Society of Japan in 2016. His research interests include multi-agent robotic systems, embodied-brain systems science, design support for large-scale production/material handling systems, and human behavior analysis and support.