Feature extraction and matching combined with depth information in visual simultaneous localization and mapping

Estimating the camera trajectories is very important for the performance of visual simultaneous localization and mapping. However, visual simultaneous localization and mapping systems based on RGB image are generally not robust in complex situations such as low-textures or large illumination variations. In order to solve this problem, more environmental information is added by introducing depth information, and a feature extraction and matching algorithm combining depth information is proposed. In this article, firstly, the intrinsic mechanism that depth image is used to extract and match feature points is discussed. Then depth information and appearance information are comprehensively considered to extract and describe feature points. Finally, the matching problem of feature points is transformed into a regression and classification problem, with which a matching model is presented in a data-driven way. Experimental results show that our algorithm has better distribution uniformity and matching accuracy and can effectively improve the trajectory accuracy and drift degree of the simultaneous localization and mapping system.


Introduction
Visual simultaneous localization and mapping (SLAM) has been greatly developed in the past decades. Klein et al. 1 proposed the parallel tracking and mapping (PTAM) algorithms in 2007, and since then many visual SLAM systems have been proposed. These visual SLAM systems can be divided into direct methods and feature-based methods according to the difference in utilizing image information. The former directly uses the pixel intensity for matching and performs pose estimation by calculating the gradient and direction values of the local pixels, such as large-scale direct monocular SLAM (LSD-SLAM), 2 and direct sparse odometry (DSO). 3 The latter extracts feature points and builds descriptors from the 2D image, and then computes and optimizes the camera pose by matching the feature points, such as real-time single camera SLAM (MonoSLAM), 4 oriented FAST (Features from Accelerated Segment Test 5 ) and rotated BRIEF (Binary Robust Independent Elementary Features 6 ) SLAM (ORB-SLAM), 7 and 3-D mapping with an RGB-D camera (RGBD-SLAM). 8 The feature-based methods have the advantages of high robustness to complex scenes, and is currently a relatively mature solution, 9 such as scale invariant feature transform (SIFT), 10 speed up robust feature (SURF), 11 and oriented fast and rotated brief (ORB). 12 These methods are generally based on gray images, and these algorithms have relatively stable performance in the case of obvious environmental contrast. 13 However, the feature points cannot be extracted in low-texture and blurred places which often appear in practical applications, in these situations feature point clustering and unevenness are prone to appear. Moreover, descriptors of feature points are indistinguishable when the structures of objects are similar, which easily leads to mismatches. 9 Therefore, it is particularly important and necessary to develop more robust feature point extraction and matching algorithm for visual SLAM. Some researchers devote to this issue from different viewpoints. Points and line segments are combined to increase the mutuality between features in Zhou et al., 14 and thus obtaining a more accurate camera pose estimation. To obtain more environmental information, the feature extraction method based on color images is applied to visual SLAM in Wang et al. 15 In Kim et al., 16 the matching risk is added to each feature point, and only feature points with lower matching risk are used in triangulation to avoid mismatching problems caused by excessive environmental changes. However, most of these methods only consider the appearance information and pay no attentions on the depth information which is insensitive to environmental changes.
Depth estimation is usually fundamental in a SLAM system. 17 Many monocular and stereo SLAM systems take advantage of camera parallax from multiple images to estimate the depth, such as triangulation. Since Kinect camera has been presented by Microsoft, the RGB-D cameras have been gradually applied to SLAM systems. Besides the appearance information of the scene, depth information containing rich environmental information can also be obtained with the depth camera, and thus the process of 3D reconstruction is simplified. The method of using RGB-D camera to perform threedimensional reconstruction of indoor environment was first proposed in Henry et al., 18 extracting SIFT features from color images and searching for corresponding depth information on depth images. Newcombe et al. 19 obtained the camera pose by minimizing the distance measurement of each pixel in each frame through the depth image obtained by Kinect. Kerl et al. 20 simply combined the intensity error and depth error of the pixel as an error function to obtained the optimal camera pose.
Although depth is widely used in visual SLAM, it simply extends 2D points to 3D points and does not comprehensively consider the role of depth information as with appearance information. In addition, it is less used in the feature-based method. When some appearance information cannot be effectively distinguished or is insufficient for feature extraction and matching, the addition of depth information may give us another solution. Depth information and appearance information are combined to construct descriptors in Liu et al., 21 but it simply linearly adds the Hamming distances of the two types of descriptors during matching. Moreover, it does not consider depth information when extracting feature points. On the basis of fully researching the intrinsic mechanism of depth information, this article further utilizes the depth information and applies it to the entire process of feature point extraction and matching. The advantage of high speed makes ORB algorithm widely used in visual SLAM system, so this article adds depth information on the basis of the ORB algorithm, and a new feature extraction and matching algorithm is proposed. The main contributions of this article can be summarized as follows: 1. Depth information and appearance information are comprehensively considered during the extraction and description of feature points, which solves the problem that the feature points cannot be extracted under the condition of blurry and low-texture in RGB pictures, and has a better distribution uniformity. 2. An ORB feature matching model was established, which transformed the matching of feature points into a regression and classification problem, and improved the accuracy of feature point matching.
Our algorithm is integrated into ORB-SLAM2's visual odometer, and its performance is tested using the Mikolajczyk data set 22 and RGB-D data set. 23 The results prove that our algorithm can effectively improve the performance of ORB-SLAM2. 24

ORB algorithm
The ORB 12 algorithm consists of two parts: feature point extraction algorithm (oFAST, oriented features from accelerated segment test 5 ) and feature descriptor calculation algorithm (rBRIEF, rotated binary robust independent elementary features 6 ).
oFAST: FAST keypoint orientation. The ORB algorithm uses FAST-9 (circular radius of 9) to detect keypoints because of its good performance and uses Harris 25 corner to order the keypoints. Aiming at the disadvantage that FAST do not have scale invariance, the ORB algorithm builds a scale pyramid and extracts FAST features (filtered by Harris) at each level of the pyramid. When N keypoints need to be extracted, the ORB algorithm first sets a lower threshold to obtain more than N keypoints, and then selects the top N keypoints according to the results of Harris corner sorting.
Because the keypoints of FAST are not directional, the ORB algorithm adds an efficiently computed orientation. It uses intensity centroids 26 to achieve rotation invariance of features. In a local image patch of feature points, the moments of a patch are defined as where x, y are the coordinates of the pixel. The value of u, v is 0 or 1. I(x, y) is the grayscale value of the image at x, y.
With these moments the centroid can be defined as The direction vectorÕC can be obtained by connecting the geometric center O and the centroid C of the image block. The orientation of the patch is where RðxÞ is the gray value of the image patch at point x. A binary description vector can be generated by selecting m pairs of test points and it is defined as Matching performance of BRIEF falls off sharply as the degrees of in-plane rotation increase. The rBRIEF defines a 2 Â m matrix for the m-dimensional binary sequence generated in the image block Rotate the matrix S toward the main direction q, the description matrix S q of the following formula can be obtained The rotated BRIEF descriptor can be expressed as

Qtree_ORB algorithm
In visual SLAM, to solve the problem of uneven distribution of feature points in the standard ORB algorithm, a quadtree method is used to homogenize the feature points, and at the same time an adaptive threshold method to improve the ORB algorithm in Mur-Artal et al., 7 which is called the Qtree_ORB algorithm. The specific steps are as follows: 1. Construct an eight-layer image pyramid. 2. Grid division of each layer of image pyramid. 3. Use the initial threshold (iniTHFAST) to extract FAST-9 corner points for each grid. If no feature points can be extracted in the grid, lower the threshold and use the minimum threshold (min-THFAST) to extract FAST-9 corner points. 4. Select the required FAST-9 corner points uniformly based on the quadtree. The feature point with the highest Harris response value is retained as the final feature point. 5. Calculate the intensity centroid. 6. Calculate the rBRIEF descriptor of the feature point.

Depth information prediction or completion
Depth information is not easy to obtain compared with appearance information. Monocular and stereo SLAM systems need to use camera parallax from multiple images to estimate the depth, and the data collected by the depth camera often has errors and missing. As deep neural networks have been developed with better backbone networks, huge RGB-D data sets, richer training labels, and complex loss functions, data-driven approach provides another option for obtaining depth information. In Eigen and Fergus, 27 a supervised deep neural network is designed to estimate depth from a single image without any processing. In Xie et al., 28 a deep neural network is used to estimate the viewing angle difference in the image, which is then applied to the conversion of 2D to 3D video. In addition, Zoran et al., 29 Chen et al., 30 and others also proposed various depth estimation networks, which significantly improved the depth perception of a single RGB image. Meanwhile, a variety of depth complement algorithms have also been developed to complement the incomplete depth information collected by the depth camera, such as the joint bilateral filter method, 31 the median filter method, 32 joint rendering method of color and depth, 33 machine learning method, 34 and so on. These methods can effectively complement the incomplete depth image.
Sparse or incomplete depth images cannot be used to extract feature points, so it is necessary to predict the depth value of RGB images or complement the incomplete depth images for subsequent feature extraction and matching. In this article, Aleotti's method 35 is used to predict the depth value of the RGB image, and Alexandru's method 36 is used to fill in the incomplete depth image. In Figure 1, (a) is an RGB image, and (c) is a depth image predicted by (a). (b) is the incomplete depth image collected by the depth camera, and (d) is the completed depth image obtained after inputting (b) into the depth completion model.

Feature extraction with depth information
The feature point extraction and matching algorithm can be divided into three parts, as shown in Figure 2. The "Depth information prediction or completion" section gives an introduction to the method of acquiring depth information. This section will introduce the feature extraction method combined with depth information. The feature matching algorithm that combines depth information is shown in detail in the "Feature matching with depth information" section.

Depth information preprocessing
The relationship between the depth value obtained from the depth camera and the true depth value is not linear (pinhole camera model). 37 The mapping function can be obtained through the camera model 38 where f ðdÞ represents the actual depth from camera to the measuring object, Z r is the depth of camera to the reference plane, b is the length of baseline, f is the focal length of camera, and d is the disparity measured by the camera. The pixel value range of an RGB image is 0-255(0-2 8 ). But the pixel value of the depth image may exceed 255, so it needs to be converted into the range of 0-255 using the formula (9). The measurement distance of ordinary commercial depth cameras is limited, and the measurement accuracy is not high beyond a certain limit, so only the pixel value within a fixed range needs to be converted. This article sets the maximum distance of 4 m. That is, the pixel value of 4 m is 255, and the pixel value greater than 4 m is filled with the surrounding pixel values.
ORB algorithm adopts a Harris corner to score each feature point. The higher the score, the stronger the distinguishability of the feature points, and the better the subsequent matching effect. We randomly select 200 ORB feature points extracted from the RGB image, the completed depth image and the predicted depth image, and calculate their Harris response values, as shown in Figure 3. The initial threshold (iniTH-FAST) is set to 20. Most of the Harris response values of the predicted depth image and the RGB image lie in the interval 20-80, indicating that the feature points have relatively strong distinguishability. The Harris response value of the completed depth image is mainly in the interval 120-160, which is more distinguishable.
Since the Harris response values of the three types of image is not uniform, define a normalization coefficient g where h n is the Harris response of the nth feature point on the depth image, and G n is the Harris response of the nth feature point on the gray image. The Harris response of each feature point of the depth image needs to be multiplied by a normalization coefficient to get the final Harris response The mean value of the three types of pictures is calculated. Figure 4 shows the spread of means for rBRIEF pattern of 256 bits over 1 k sample keypoints. There are many feature points in the three pictures whose average value falls between 0.05 and 0.2, in which 0.05 corresponds to the most number of bits, indicating that the average value of each of the feature points is very close to 0.5. A high mean value means that the feature descriptor has a better discriminative degree.

Extraction
In reality, there are many places with low texture and blur. If feature points are only extracted from RGB images, it is easy to have no feature points in a large area, resulting in too clustered feature points (as shown in Figure 5(a)). After adding the depth information, the feature points can be extracted even if the gray level changes little where the depth changes, making the distribution of feature points more uniform (as shown in Figure 5(b)). Moreover, the feature points with higher Harris response values are

Descriptor
According to formula (8), the rBRIEF descriptor calculated on the depth image can be characterized as In this article, a feature point has two descriptors, which are calculated on the RGB image (formula (8)) and the where q is the main direction of the feature points calculated on the gray image. b is the main direction of the feature points calculated on the depth image; f n ðPÞ and f m ðPÞ are the feature descriptors on the grayscale image and the depth image; and S q and S b are the description matrix rotated.

Matching
The traditional feature point matching algorithm calculates the Hamming distance of the feature descriptor, finds the initial match according to the principle of the smallest Hamming distance and less than the preset distance, and then votes to eliminate the feature points with large rotation angles to obtain the final match. In the above process, the preset Hamming distance threshold has a greater impact on the matching result. In the case of large changes in the environment, a fixed threshold may eliminate the correct match. In the case of small environmental changes, a fixed threshold may lead to an increase in the number of false matches. Moreover, using the rotation angle as a condition for removing the initial match may also remove the correct match, because the larger rotation angle may be the correct match. In the question of whether the two feature points match, the contribution of Hamming distance and rotation angle is not clear. Therefore, this article transforms the matching between two feature points into a classification and regression problem, and uses a data-driven idea to learn how much they contributes. Since this article adds depth information, there are four input variables, namely the Hamming distance of the descriptor on the gray image (H g ), the Hamming distance of the descriptor on the depth image (H d ), the difference in the rotation angle of the descriptor on the gray image (A g ), and the difference in the rotation angle of the descriptor on the depth image (A d ).
Random forest algorithm is a classification regression algorithm that has the advantages of fast training speed, strong robustness, high accuracy, and simple implementation. In this article, a random forest algorithm is used to establish a mathematical model to calculate the matching probability between two feature points. The random forest algorithm uses GiniðDÞ to indicate the purity of the node sample data where D is the data set; N is the number of sample categories; p n is the proportion of the ith sample in D.
The training sample set used to build the matching model comes from the RGB-D data set and the KITTI data set. The training sample set is divided into two parts, one part is 100,000 point pairs that are marked as correct. The images in the RGB-D data set and the KITTI data set are changed to different degrees of rotation, viewpoint, light, blur, and so on, and the feature points before and after transformation are marked as correct point pairs. The other part is 100,000 point pairs that are marked as wrong. Matching experiments are performed on the two data sets using traditional algorithms, and the mismatched point pair are marked as the wrong point pair.
Adjusting the number and depth of decision trees of the random forest model can make the model faster and more accurate. This article uses the forward search method to find the optimal number of trees, and the depth of the decision tree can be ignored because of fewer variables. The fitting performance is evaluated by the accuracy of the test sample set. The results are shown in Figure 6. Considering the prediction time of each frame and the prediction accuracy, it was decided to build the final prediction model from 150 decision trees. The prediction accuracy rate is 96.13%, and the prediction time per frame is about 20 ms, which meets the real-time requirements. Algorithm 2 shows the steps of establishing a matching model.
The area under curve value of the receiver operating characteristic curve (ROC curve) in Figure 7 is 0.94, indicating that the model has high sensitivity and accuracy, and can effectively judge whether the two feature points match. According to the ROC curve, although point A (the false positive rate is 0.61%, the true positive rate is 84.3%) has the largest sum of sensitivity and specificity, point B (the false positive rate is 1.63% and the true positive rate is 94.54%) is selected as the optimal cut off point. Because it only needs to increase the false positive rate by about 1%, the true positive rate can be increased by about 10%. In the feature importance histogram in Figure 7, the importance ratios of the four variables (H g ; A g ; H d ; A d ) are 0.31, 0.22, 0.26, 0.20, respectively. Although the feature descriptors on RGB images still occupy high importance, the overall  difference is not big. It is reasonable for us to put the Hamming distance and the rotation angle at the same priority.
The matching model is tested on the RGB-D data set. Algorithm 3 shows the process of matching two feature points. In Figure 8, (a) is the matching result obtained by the traditional algorithm. There are many mismatches, and the accuracy is 85.06%. (b) is the matching result obtained by our algorithm, and the accuracy is 93.48%. It shows that the algorithm in this article eliminates many incorrect matches under traditional algorithms and increases the correct matches eliminated under traditional algorithms.

Experiments
All the experiments in this article are run on a computer with i5-10400F, memory 8 GB, graphics card GTX960, and operating system ubuntu18.04. The data sets are the widely used RGB-D data set and Mikolajczyk data set. The experimental results are the average of 10 calculations. The matching performance is evaluated by correct rate P where T P is the number of matches detected correctly; F P is the number of matches that the true result is not a match but the detection result is a match. This article shows median results after running each test 10 times.

Matching test of feature points
Monocular mode: The Mikolajczyk data set is a standard data set mainly used for image matching and is commonly used for region detector evaluation and region descriptor evaluation. Mikolajczyk data set gives varying degrees of rotation change, viewpoint change, light change, and blur change, and the changed H matrix is given, which is very convenient for us to test. We first use the Mikolajczyk data set to test the matching performance of our algorithm under different variations, and then use the RGB-D data set to test the matching performance of our algorithm in practical applications.
In the Mikolajczyk data set, each set of sequence changes five times, and the degree of change is getting larger and larger. Figure 9 shows the results of four sets of sequence matching performances. In the tests of blur change (Bike sequence) and lighting change (Leuven sequence), as the degree of change increases, the rate of change gradually decreases. The accuracy of our algorithm is always higher than that of the Qtree_ORB algorithm, and more correct matches can be obtained. In the tests of viewpoint change (Graf sequence), the Qtree_ORB algorithm quickly fails, but our algorithm can maintain a relatively high accuracy rate. ORB features are sensitive to rotation, so the accuracy of rotation change (Bark sequence)  fluctuates greatly. By adding depth information and random forest algorithm, the accuracy of traditional algorithms can be improved. There are a total of eight sequences in the Mikolajczyk data set, Table 1 shows the average of the correct rate of the five degree changes. In general, the algorithm in this article is more robust and accurate than the Qtree_ORB algorithm, and is more able to adapt to complex and changeable realities. In monocular vision SLAM, the initialization of the system uses brute-force matching, and then uses DBoW2 (bags of binary words for fast place recognition) matching. The fr2/rpy sequence in the RGB-D data set is used to test the performance of the two matches. Randomly select 150 adjacent frames to test their matching performance, and the number of feature points extracted in each frame is 1000. When using brute-force matching, only the feature points of the first level of the image pyramid are matched. The result is shown in Figure 10, it can be seen that the accuracy of the initial matching and DBoW2 matching of our algorithm is higher than that of the traditional algorithm, indicating that our system is more robust in the same environment. When DBoW2 matching, the number of matches obtained by the two algorithms is similar, but our algorithm has a higher number of matches in the initial matching. Since the system initialization requires a sufficient number of matches, our algorithm is easier to initialize. In some sequence, the algorithm in this article has been initialized, but the traditional algorithm is waiting because there is not enough number of correctly matched feature points, and a lot of information will be lost.
Integrating our algorithm into the ORB-SLAM2's visual odometer, we test the number of frames required   for the two algorithms to be initialized successfully in the ORB-SLAM2's monocular mode. Table 2 shows that our algorithm is always faster to initialize than the traditional algorithm, which can save a lot of time, because the algorithm in this article has more correct matches and better matching accuracy in the initial brute-force matching. Similarly, when the visual SLAM system fails to track and relocate, our algorithm will be more accurate and faster.  Depth mode: Deep vision SLAM does not require complex initial matching, so only the performance of DBoW2 matching is tested on the fr2/360_hemisphere sequence. As shown in Figure 11, the number of matches and the correct matching rate will appear lower than the ORB-SLAM2 algorithm, but the number of correct matches is always higher than the ORB-SLAM2 algorithm.

Trajectory error
This experiment is tested on the open source algorithm ORB-SLAM2, and its evaluation standard is the absolute trajectory error (ATE). The sequences in the RGB-D data set are used to evaluate the accuracy and stability of the SLAM system in dynamic scenes.
As shown in Table 3, the root mean square error (RMSE), the sum of squares due to error (SSE), and standard deviation (STD) of our algorithm in the four sequences are all smaller than the traditional algorithm, indicating that the trajectory of the algorithm proposed in this article is closer to the true value. In addition, our algorithm can effectively suppress the maximum value of trajectory error by adding depth, which has high accuracy and robustness, and the fluctuation of trajectory error is relatively gentle. Table 4 shows the accuracy comparison results in a subset of sequences where most RGB-D methods are usually evaluated. In the sequence fr2/desk, fr2/xyz, fr2/office, our algorithm has a slight improvement, because the ORB feature of the RGB image is obvious in these sequences. But in sequences (fr1/desk, fr1/desk2, fr1/room, fr3/nst) with low-texture and blurry scenes, the RMSE of this article is significantly smaller. In this case, the ORB-SLAM2 algorithm sometimes fails to extract feature points, or fails to obtain correct matching, which further leads to a large drift in pose estimation. As shown in Figure 12, in the sequence fr3/nst, few feature points and matches are obtained through appearance information, only 41 matches with a correct rate of 73.17%. However, our algorithm can obtain a sufficient number of matches, with 109 matches and a correct rate of 83.49%.
In order to properly validate the effectiveness and robustness of our algorithm in low-texture and blurred places, tracking performance is tested on the sequence fr2/pio-neer_360 with many blurry and low-texture images. For the case as shown in Figure 13 and Figure 14, the ORB-  SLAM2 algorithm only gets few correct matches, which further results in partial path loss depicted in Figure 15(b). However, our algorithm can get more matches by adding depth information, which is enough to guarantee a relatively complete path, depicted as Figure 15(c).

Timing results
In general, the average tracking time per frame is below the inverse of the camera frame rate for each sequence, meaning that a system is able to work in real time. 24 The experiments were run on a computer with i5-10400F and graphics card GTX960. Table 5 shows the average tracking time per frame of our algorithm and ORB-SLAM2 algorithm in different data sets and different modes. Because depth information is considered in this article, it takes more time to acquire and process depth information. The completed depth information is not always directly obtained. In monocular mode, the algorithm used in this article is able to predict depth information at 100 FPS (frames per second). In depth mode, the time to fill the depth mainly depends on the degree of corruption of the depth image. The smaller the incomplete part of the depth image, the less time it takes. In this case, our algorithm consumes 1-10 ms more than ORB-SLAM2 algorithm. Besides, it takes no more than 1     ms time to process depth information in the matching stage for our algorithm compared to the ORB-SLAM2 algorithm. Generally speaking, our algorithm can run in real time when camera FPS is 30HZ. In addition, a more lightweight network can obtain depth information at a faster speed. For example, network proposed by Wofk 39 can infer depth predictions at 178 FPS. If a better hardware is adopted, the run time of our algorithm can further decrease.

Conclusion
This article proposes a method of feature point extraction and matching, which combines depth images. It is mainly divided into four steps: processing depth images, extracting feature points, calculating feature point descriptors, and using models for feature point matching. Experimental results prove that our algorithm has better uniformity and higher accuracy than traditional algorithms. Especially in some places with low texture and blur, the algorithm in this article can effectively increase the stability of the visual SLAM system. Our algorithm needs to rely on GPU for accelerated computing. Further research can focus on reducing the difficulty of acquiring depth information and apply it to CPU or embedded systems. Moreover, the predicted depth image may be helpful to the triangulation of monocular SLAM, and the depth information can be dig deeper in the future.

Author contributions
Data processing, code writing, hardware and software preparation were handled by Yunpeng Sun. Yunpeng Sun and Xiaoli Li jointly wrote the first draft and made the second revision. All authors read and approved the final manuscript.

Code or data availability
Custom code (C, Cþþ, python). The data are all from public data sets.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.