A Survey of 2D and 3D Image Matching in Industrial Object Alignment

Abstract Image matching is a practical tool in the computer vision domain. 2D and 3D image matching can be based on (2D) features or on view and point (3D) which can be cloud based. Intensive practical research has been done on both 2D and 3D image matching and the methods are generally designed to meet the requirements of industrial alignment applications and include such aspects as full rotation, scaling and the handling of multiple objects. The aim of this study is the presentation of a review of 2D and 3D image matching and the provision of a comprehensive reference source for those involved in image matching. Graphical abstract


Introduction
Recently, there has been a significant increase in the demand for both two-dimensional (2D) and three-dimensional (3D) image matching. Especially, the image matching is often used in industrial alignment, such as the Automated Optical Inspection (AOI) or industrial automation. This requires that real-world objects be automatically recognized and localized in a digital image by a computer. Additionally, the pose estima- The application illustrated in Figure 2 introduces further important aspects to be considered in image matching. Here, the metal parts on the conveyer shown in Figure  3 must be picked by the robot. In this example, image matching is used and the program should also be able to recognize several different instances of the object in an image simultaneously. In this example, the different metal parts may overlap and so the matching approach must be capable of handling a certain degree of occlusion. In addition, non-uniform illumination, in this case caused by an incorrectly mounted light source (see Figure  3), needs also to be taken into account. Uniform illumination that is of constant brightness is very important for most applications, but sometimes this is hard to achieve without an expensive setup. This means the image matching method used must also take variations in illumination into account. The image matching program should also be resistant to noise, but fortunately noise is a minor issue in most industrial applications. After the metal parts have been recognized, in the pick and place example, the world coordinates of the pick point are transmitted to the robot.
Another pick and place application is illustrated in Figure 4. This example shows an object with arbitrary xyz axis rotation. In this application, the metal parts will be picked by a 6-DOF robot arm. The matching method used in this case should be capable of recognizing the object image with any degree of rotation on the xyz axes.
The points below summarize the requirements of an image matching approach [1]: (1) The object must be matched in real-time.
(2) The model representation of an object should be computed from a golden sample. (3) The image matching approach should be general with regard to the type of object. (4) Image noise and clutter should not interfere with image matching. (5) An object under rigid motion should be recognized. (6) The pose parameters must be of high accuracy. classic or main key ideas of 2D and 3D of image matching were published within 20 years. In this paper, we do not contemplate into details of particular algorithms or describe results of comparative experiments. We summarize the papers which were proposed within 15 years and point out interesting parts of the 2D and 3D image matching methods, respectively. Several applications have been listed to illustrate the most important requirements that need to be fulfilled in real-world applications.
An inspection application is illustrated in Figure 1. Here, the task is to count the number of Light-Emitting Diode (LED) dies on the printed circuit board (PCB), and also to check for defects in the matched dies. Before defect detection can be done, the pose of the dies in the image must be determined using an image matching approach. In this case, a 'gold' sample die image (an image with no defects) is used to build a model for the matching process. In order to detect defective dies, the matching approach must detect all the dies mounted on the PCB. After the pose of the dies has been determined by the matching approach, defect detection can be done using a method such as image subtraction.  (7) All instances of an object should be found in the image.The rest of the paper is organized as follows: Section 2 provides a review of 2D image matching based on area and features. Section 3 introduces a review of 3D image matching. The conclusions are set out in Section 4.

Approaches to 2D image matching
2D image matching techniques are particularly useful and widely used for many applications. They are simple and can handle many different types of objects. Existing 2D image matching approaches can be divided into two broad categories [1]: area based and feature based. In the following subsection, we adopt this classification and review a large number of existing 2D image matching approaches.

Area based
The area based method is sometimes called the correlation-like or template matching method and the basic concept has been very popular in the past. In this approach a small template is applied to a large inspection image by sliding the template window pixel-by-pixel and the correlation value is computed between the template and compared image blocks. The maximum values or peaks of the computed correlation values indicate the degree of match between the template and compared image blocks in the search image. Simple similarity measures such as the sum of absolute differences (SAD) or the sum of squared differences (SSD) are frequently used in template matching applications. However, these measures are not immune to brightness and contrast variations which occur in practical situations. The most popular and robust correlation-based  used to obtain the key points; a nearest neighbor search being employed to find corresponding points in the two images. The use of edge features, where the sum of the normalized dot products of the gradient direction vectors of the model edge points is calculated, has been proposed by Steger [11]. Hinterstoisser et al. [12] suggested realtime template matching based on the dominant gradient orientations of edges. Furthermore, Huttenlocher et al. [13][14][15] applied the directed Hausdorff distance to several efficient algorithms for image alignment. The main problem with feature-based methods is that the objects should contain discriminative texture. The rotation-invariant descriptor method, in contrast features of regions, edges and interest points, transforms a 2D gray-level image into 1D ring-projection space. [16] Tsai and Chiang [17] further used ring-projection to represent a target pattern in wavelet-decomposed subimages. Specifically, they used only pixels with high wavelet coefficients at low resolution to compute and compare the NCC between two patterns. Ullah and Kaneko [18] proposed a rotation-invariant template method based on gradient information in the form of orientation codes. However, the method involves histogram computation and is time-consuming. In addition, the Zernike moments method [19] is useful for making the pattern features of a template image and considers invariance to rotation, translation, and scale.

Improving real-world application
To meet the requirements, computation time, and full pose estimation in AOI applications, numerous approaches have recently been proposed to reduce the time taken for 2D image matching. The improved methods are listed in Table 1 which shows the numerous techniques aimed at speeding up the image alignment process. There are two main strategies, computation enhancement and skipping unnecessary computation to enhance the performance. The main goal of computation enhancement strategy is the rapid calculation of key values, such as image mean, image variance, edge gradients, etc. Integral image technique is very useful and only needs 3 addition and 3 subtraction similarity measurement, normalized cross correlation (NCC) which is robust against illumination changes, is proposed. [2][3][4] For template matching in AOI, the optimal matching results are indicated by the maximum NCC correlation, and the NCC coefficient is confined within the range of −1 and 1. The NCC coefficient is equal to 1 when there is perfect matching and the two compared images are the same. NCC is a computationally intense problem because to find an object in a scene the template must slide over the image pixel-by-pixel before the NCC coefficient can be computed. Template matching using full search correlation metric is exhaustive and time-consuming and this is especially so if rotation and multiple objects are to be detected. The efficient and rapid location of the best object in an image has always been an important issue in the matching process. [5]

Feature based
In contrast to the area-based method, the feature-based approach does not work directly with image intensity values. Instead, a search is made for pair wise correspondence between interest points, edge points, or other feature descriptors between the template and the search model. The features represent information on a higher level. This property makes the feature-based method suitable for a scenario where changes in illumination are expected. To use feature-based matching, basic features, regions, edges, or interest points need to be extracted. These regions are often represented by the center of gravity, which is invariant with respect to rotation, scaling, and skewing and is also stable under random noise and gray-level variation. However, the general criterion of closed boundary regions is prevalent. The Maximally Stable Extremal Regions (MSER) based on homogeneity of image intensities is proposed to obtain the regions feature. [6] For line feature detection, the standard edge detection method by Canny [7] is employed. To detect interest points, the powerful SIFT [8] and SURF [9] methods are used. The planar object detection method was introduced [10] for matching the object, and the SIFT and SURF methods are Branch and bound [28,48] invariant descriptor [49][50][51][52][53][54] Bounded conditions [51,55] Here, the efficient template matching method only considers image gradient to detect an object, and then makes further use of the binary representation of the gradients. SIMD units are then used to reduce the computational burden of gradient-based NCC metrics. The iterative closest point (ICP) is then employed to optimize the pose estimation. [42]

Summary
A survey of 2D image matching has been discussed in the sections above, several practical methods were proposed for use in real-world applications. Area-based methods are useful in applications where the window contains images with distinguishable features, but mismatching will occur if window images have smooth areas without any prominent detail. Feature-based methods are typically applied when the local information is more significant than the information carried by image intensity. Despite its limitations, area-and feature-based 2D image matching is still in common use because it is so easy to implement.

3D image matching approaches
While 2D image matching has been extensively investigated and is a relatively mature research area, 3D image matching has several advantages. For example, (1) range images provide depth information; (2) Features extracted from range images are not affected by illumination; (3) An estimated 3D pose of the object is more accurate compared to an estimated pose from 2D images. However, 3D image matching is difficult and there are many problems such as those of scaling, viewpoint variation, illumination changes, partial occlusion, and background clutter. Modeling technique that uses the 3D point cloud is becoming increasingly popular. 3D image matching can be defined as the problem of finding all instances of the models in a database in an arbitrary scene and then determining the pose of the detected objects. Several reviews related to the 3D image matching problem can be found in the literature, these are typically registered together using the ICP algorithm because it is popular and simple. The Point Cloud Library (PCL) is a good example of state-ofthe-art implementations in the point cloud technique. [58] The existing 3D image matching schemes that use point clouds are: local feature-based, global feature-based, and view-based methods.

Local feature-based methods
In local feature methods, local features are extracted from around specific key points. They generally handle occlusion and clutter better than the global feature methods.
operations to summarize the value of a specific area. Another powerful way to achieve parallel computing is using modern CPUs for single instruction multiple data operation (SIMD) [20,21] where one SIMD instruction is simultaneously applied to multiple data points. This can dramatically reduce the computational burden. Another enhancement strategy is the exclusion of unnecessary computation by leaving out impossible candidates as early as possible, or by discarding impossible searching space in the matching process. Image searches can be made much more efficient using the image pyramid technique [22] in a coarse-to-fine strategy. Here, a full exhaustive search is first applied to find the candidate at the top of the image pyramid; the search is repeated downwards, level-by-level, until a candidate is found in the lowest level. Search time is dramatically reduced because the search space is inherited from the candidate parent. By skipping impossible candidates, methods based on the bounded condition, elimination strategy, and winner update estimate of the index of the partial region between the template and the compared image, can be substantially improved. The index is calculated using bounded or partial correlation in both the bounded condition method and elimination strategy. The index will show if the compared image is a possible candidate or not. If the compared image is not a possible candidate, the matching process will skip the location of the compared image. In the invariant descriptor method, the invariant feature can be obtained using ring-projection and Fourier transformation. Furthermore, the features can be used to reduce the computational burden of the matching process.
The NCC method with rotation was introduced to satisfy the full pose estimation requirements in an AOI application needing full rotation and scaling. [56,57] The pre-computed score set from rotated templates to the original template is used to recognize the target object by skipping unnecessary computation in the first step and then estimating the rotation angle of the target object in the refinement step. [56,57] Here, piecewise linear models are used to evaluate the rotation angle by straight lines. Furthermore, bi-cubic interpolation in the matching scores is used to obtain the subpixel-level location and rotation angle in both steps was proposed by Cui et al. [57]. An accelerating method for matching of an object with full rotation and scaling was proposed by Chen et al. [28]. In this method, the image pyramid search scheme is combined with SIMD parallel computation. This search scheme can be used to quickly find certain objects with rotation, translation, and scaling in both monochrome and color images.
In contrast to the correlation-based methods, a gradient-based NCC metric for the detection of texture-less objects has been presented by several authors. [12,43,44] show better performance than those obtained with VFH. A novel global registration method, oriented bounding box (OBB) regional area-based descriptor, was proposed for accurately matching. [69] The descriptor consists of two parts, the OOB information and the surface area of the subdivided boxes, and it is used to estimate the similarity of two scanning object data by NCC method. The results show better accuracy than the local feature-based methods.

View-based methods
An alternative way to determine the 3D pose of an object is to estimate the projection of the object location in 3D space onto a 2D camera image. [70] The existing methods manage to get by with just a single camera for the estimation of the so-called view-based 3D to 2D mapping transformation. [71][72][73] The objective of these methods is the determination of the position and orientation of a camera given its intrinsic parameters and a set of correspondences between 3D points and their 2D projections. In traditional Perspective-n-Point (PnP) methods, the pose is extracted from the homography matrix which exists between two projections of points on a 3D plane. Several algorithms have been proposed for the estimation of the homographic relationship between two images of a planar scene. [74,75] Consequently, the corresponding pose can be estimated by decomposing a homograph matrix. [76] In this investigation, different methods based on Singular Value Decomposition (SVD) [77] are thoroughly demonstrated. Recently, many authors have pointed out that non-iterative methods are time-consuming for large data point sets because they are computationally complex. [71] Therefore, an efficient non-iterative algorithm, PnP, has been proposed. It has linear complexity in n and expresses the solution as a weighted sum of eigenvectors. In addition, an optimal PnP method that applies Gauss-Newton optimization [72] has been established and a robust solution to a subsequent PnP problem has also been introduced. [73] In this work, an order 4 polynomial is used to describe the cost function, and a linearization technique is then applied to minimize the cost function. The results show the RPnP is suitable for applications that need to handle both small and large sets of points, such as those used in feature point-based image matching applications.
A considerable number of studies have been undertaken on the template matching schemes related to the pose estimation techniques discussed above. Normalized edge-based templates rely on distance transformation which is used for matching and pose refinement. This, including real-time planar object detection and 3D pose estimation algorithms, is presented in some detail by Holzer et al. [78]. The matching result is useful for initial rough pose estimation and the Lucas-Kanade algorithm [59,60] To describe the important points in the point cloud, Point Feature Histograms (PFH) was introduced by Rusu et al. [61,62] in which a PFH of two-point descriptors is built for all neighboring points of the reference point. Four features, the angles between the points' normal and the distance vector between them, are measured and categorized using a 16-bin histogram. However, defining local invariant features depends very much on the local surface information directly related to the quality and resolution of the acquired and model data. A more efficient method, Fast Point Feature Histograms (FPFH), based on PFH was later proposed by Rusu et al. [63], it retains most of the discriminative power of the PFH method and full descriptions of both PFH and FPFH are given. To consider the efficiency, a fast voting scheme based on the features which describe the pair of points was proposed by Drost et al. [60]. The pair features are defined using surface points with their normal vector which is more discriminative than individual primitives. The pair features can also reduce the complexity of the matching task. In this work, the model is represented by a set of pair features in a hash table for fast retrieval off-line. In the on-line phase, all the points in the scene are paired with the reference point to create point pair features. These features are efficiently matched to those of the model using the hash table. An industrial part handling system was proposed by Skotheim et al. [64] that makes use of this efficient voting scheme. Fast voting based on the pair features is employed to recognize the object at a coarse level and the ICP algorithm is applied to obtain results at the fine level. Choi et al. [65] include three additional pair features, described by oriented surface and boundary points and boundary lines, to increase the robustness and efficiency of pose estimation. In addition, a comparative study of local feature descriptors has been carried out by Morell-Gimenez et al. [66].

Global feature-based methods
In contrast to local feature matching, the global feature-based methods process the object as a whole. They define a set of global features which effectively and concisely describe the entre 3D model. This means they are not suitable for the matching of a partially hidden object in a cluttered scene. [59] Rusu et al. [67] describe extension descriptors for FPFH, the Viewpoint Feature Histogram (VFH). In this work, additional VFH statistics are computed between the viewpoint direction and the estimated normal at each point. The results show a high degree of discrimination. To keep the advantages of VFH, clustered VFH (CVFH) was proposed to enhance robustness against occlusions. [68] This work proposed that a partial view of an object is described as a cluster of CVFH. The results real-world applications has been discussed and analyzed. Although, there are many improvement in the approaches to image matching, it still remains an open problem, such as how to improve the efficiency and the accuracy in 3D image matching when the object with viewpoint variation, partial occlusion, or cluttered background.

Disclosure statement
No potential conflict of interest was reported by the authors. [77] can be applied to refine the pose. Edge-based template matching and pose estimation based on CAD models are detailed by Ulrich et al. [80] and Wiedemann et al. [81]. In this method, 2D matching based on edges [79] is used to find the 2D pose. Models of different views are created within the predefined bounds for 3D pose estimation. For matching, the edge amplitude is extracted from a model and projected into the image plane using the specific camera pose. To speed up the process and improve efficiency, the model is created on multiple levels of an image pyramid. Computation is done separately for each level of the pyramid and the matching results (the corresponding edges) are used to estimate the refined 3D pose. This pose is obtained with robust iterative non-linear optimization using the Levenberg-Marquardt algorithm. [80] Other methods for 3D image matching also use 2D edge-based template matching to find the 2D pose [44,81] and the ICP method to estimate the 3D pose. The results obtained by these methods of pose estimation perform well, are efficient, and have a high degree of discrimination. They are also suitable for time-critical applications. The author further proposed a method that is based on similarity template matching for the matching of instances of a 3D object with scaling. [82]

Summary
It is clear that the main difficulties encountered in 3D image matching are occlusion, clutter, and noise. To overcome these challenges, several investigations have been proposed and discussed. Table 2 gives a summary of 3D image matching. The point cloud-based methods are the most popular for 3D image matching, but computation is still highly complex and very efficient algorithms are needed to facilitate computation. Nowadays, view-based methods are the most frequently used in real-world application because of the cost and efficiency, despite the limited viewpoint accuracy.

Conclusions
This paper has presented a survey of the state-of-theart 2D and 3D image matching methods. The intensive research that has been done to improve performance in