Incremental Pose Map Optimization for Monocular Vision SLAM Based on Similarity Transformation

The novel contribution of this paper is to propose an incremental pose map optimization for monocular vision simultaneous localization and mapping (SLAM) based on similarity transformation, which can effectively solve the scale drift problem of SLAM for monocular vision and eliminate the cumulative error by global optimization. With the method of mixed inverse depth estimation based on a probability graph, the problem of the uncertainty of depth estimation is effectively solved and the robustness of depth estimation is improved. Firstly, this paper proposes a method combining the sparse direct method based on histogram equalization and the feature point method for front-end processing, and the mixed inverse depth estimation method based on a probability graph is used to estimate the depth information. Then, a bag-of-words model based on the mean initialization K-means is proposed for closed-loop feature detection. Finally, the incremental pose map optimization method based on similarity transformation is proposed to process the back end to optimize the pose and depth information of the camera. When the closed loop is detected, global optimization is carried out to effectively eliminate the cumulative error of the system. In this paper, indoor and outdoor environmental experiments are carried out using open data sets, such as TUM and KITTI, which fully proves the effectiveness of this method. Closed-loop detection experiments using hand-held cameras verify the importance of closed-loop detection. This method can effectively solve the scale drift problem of monocular vision SLAM and has strong robustness.


Introduction
With the continuous development of artificial intelligence technology, research of visual slam has also made great progress, and new achievements are emerging. This technology plays an important role in vision navigation. Vision slam is widely used in robot autonomous navigation, augmented reality, 3D reconstruction, unmanned driving, and other fields. According to the different equipment used, visual SLAM can be divided into visual SLAM based on a depth camera RGB-D, visual SLAM based on a binocular or structural camera, and visual SLAM based on a monocular camera. The difficulty of the above three methods gradually increases, and each method has its own advantages and disadvantages [1,2]. For an RGB-D camera, it enjoys the advantages of obtaining the depth information of the object more accurately, which is conducive to the accuracy of SLAM positioning; but it is limited by a narrow field of vision, limited measurement range, poor anti-interference capacity, and is mostly In order to eliminate the influence of scale drift on positioning accuracy, an incremental pose map optimization method based on similarity transformation is proposed in this paper.
The system architecture of this paper is shown in Figure 1. It consists of three threads: Front-end processing, closed-loop detection, and back-end processing. Different processing methods are provided in different threads. positioning accuracy, an incremental pose map optimization method based on similarity transformation is proposed in this paper.
The system architecture of this paper is shown in Figure 1. It consists of three threads: Front-end processing, closed-loop detection, and back-end processing. Different processing methods are provided in different threads. In front-end processing, this paper combined the direct method with the feature point method, using the direct method in a certain threshold range, and the feature point method when it exceeded a certain threshold. In the direct method, the equalization based on a histogram is proposed, which can ensure that the assumption of gray invariance is satisfied in the process of the direct method. In this paper, a mixed inverse depth estimation method based on a probability graph was used to estimate the depth information of monocular vision, which not only improves the accuracy but also improves the robustness. In the closed-loop detection, this paper used the appearance-based bag-of-words. In the process of dictionary generation, the K-means clustering method based on mean initialization was proposed to ensure the global optimal characteristics of cluster centers. The data structure was analyzed by the Thiessen polygon, which is convenient for dictionary representation, and a similar scoring function is expressed by information theory. In the back-end processing, the main innovation is to propose a method of incremental pose optimization based on similarity transformation, which takes the uncertainty of scale into account to ensure the consistency of monocular vision slam. At the same time, the incremental pose optimization facilitates the expansion of observation nodes and reduces the computational complexity. Reference frames and key frames are determined according to the number of feature points, and global optimization is carried out by using the results of closed-loop detection. In addition, the detected closed-loop features can be used for global optimization to further reduce the error.
The organizational structure of this paper is as follows: Section 2 will focus on the front-end processing method based on the combination of the direct method and feature point method; Section 3 will focus on the closed-loop detection based on bag-of-words; Section 4 will mainly introduce the method of incremental pose map optimization based on similarity transformation; Section 5 will introduce indoor environment experiments, outdoor environment experiments, and closed-loop detection experiments of hand-held cameras; and Section 6 will summarize the methods provided in this paper.

Front-End Processing
In visual SLAM, the front end is called visual odometry (VO). According to the information of adjacent images, the motion of the camera is roughly estimated, and the initial value is provided for back-end processing. Common VO can be divided into two categories: In front-end processing, this paper combined the direct method with the feature point method, using the direct method in a certain threshold range, and the feature point method when it exceeded a certain threshold. In the direct method, the equalization based on a histogram is proposed, which can ensure that the assumption of gray invariance is satisfied in the process of the direct method. In this paper, a mixed inverse depth estimation method based on a probability graph was used to estimate the depth information of monocular vision, which not only improves the accuracy but also improves the robustness. In the closed-loop detection, this paper used the appearance-based bag-of-words. In the process of dictionary generation, the K-means clustering method based on mean initialization was proposed to ensure the global optimal characteristics of cluster centers. The data structure was analyzed by the Thiessen polygon, which is convenient for dictionary representation, and a similar scoring function is expressed by information theory. In the back-end processing, the main innovation is to propose a method of incremental pose optimization based on similarity transformation, which takes the uncertainty of scale into account to ensure the consistency of monocular vision slam. At the same time, the incremental pose optimization facilitates the expansion of observation nodes and reduces the computational complexity. Reference frames and key frames are determined according to the number of feature points, and global optimization is carried out by using the results of closed-loop detection. In addition, the detected closed-loop features can be used for global optimization to further reduce the error.
The organizational structure of this paper is as follows: Section 2 will focus on the front-end processing method based on the combination of the direct method and feature point method; Section 3 will focus on the closed-loop detection based on bag-of-words; Section 4 will mainly introduce the method of incremental pose map optimization based on similarity transformation; Section 5 will introduce indoor environment experiments, outdoor environment experiments, and closed-loop detection experiments of hand-held cameras; and Section 6 will summarize the methods provided in this paper.

Front-End Processing
In visual SLAM, the front end is called visual odometry (VO). According to the information of adjacent images, the motion of the camera is roughly estimated, and the initial value is provided for back-end processing. Common VO can be divided into two categories: (a) Direct method: The direct method can be used to estimate the motion of the camera directly by using the gray information of the image, without calculating the key points and descriptors, as long as there are light and shade changes in the scene. The direct method uses the gradient size and direction of the local gray intensity to optimize, which gets rid of the dependence on image features. In the case of the camera motion being blurred, it has strong robustness. The direct method evolved from the optical flow method. The optical flow method describes the motion in the pixel, while the direct rule takes into account the motion model of the camera at the same time. The assumption of gray invariance must be taken into account when using direct method; that is, the grayscale value of a pixel in the same spatial point should remain unchanged in each image.
(b) Method based on feature points: The usual processing methods include the following: Firstly, distinct features, such as feature points or feature lines, are extracted from each image; secondly, invariant feature descriptors are used to match them; and then, epipolar geometry is used to restore the pose of the camera; finally, the pose is optimized by minimizing the re-projection error. The classical invariant feature extraction method is ORB feature extraction, which mainly consists of key points and descriptors. Through Features From Accelerated Segment Test (FAST) corner extraction, the obvious changes of local pixels can be obtained, and the main direction of feature points can be calculated at the same time. The Binary Robust Independent Elementary Features (BRIEF) descriptor is used to describe the image region around the extracted feature points, which is suitable for real-time image matching. The advantage of the feature point method is that it runs stably, is insensitive to light and dynamic objects, and has strong robustness. The disadvantage is that the extraction of feature points and the calculation of descriptors are time-consuming, and the processing error of images without obvious texture features is relatively large.
The feature point method is suitable for scenes with abundant feature points, and is easy to extract. Because enough feature points contain more redundant information, the calculation of the pose will be more accurate. The direct method gets rid of the dependence on feature points, and calculates the pose of the camera by using the gray changes of the front and back frames. It can be applied to scenes with fewer feature points. In this paper, the direct method and the feature point method are combined to make full use of the advantages of both, so as to make full use of the advantages and avoid disadvantages. It not only extends the application scope and scene of the vision front-end but also improves the accuracy of front-end processing.
The results of front-end processing will be taken as the initial value of back-end optimization. The accuracy of the calculation will affect the results of the back-end optimization, and the speed of the calculation will affect the real-time performance of the whole SLAM. In order to ensure the accuracy of pose estimation and the speed of operation, a relatively simple switching method is adopted, which uses the number of feature points and gray changes to judge whether to use the feature point method or the direct method. When the number of feature points is less than a certain threshold and the gray level changes of the two frames are not obvious, the sparse direct method based on a histogram equalization is adopted; otherwise, front-end processing based on the ORB feature is used. According to the different application scenarios, the threshold can be adjusted appropriately to get better optimization results.

Sparse Direct Method Based on Histogram Equalization
In order to satisfy the hypothesis of an invariant gray level, histogram equalization is usually used to preprocess the image to eliminate the influence of strong and weak light on the gray level characteristics. The histogram is a statistical method that uses graphics to represent the distribution of data. Usually, the abscissa represents the grayscale value corresponding to the pixel, and the ordinate represents the number of statistical samples per pixel. Histogram equalization is an image enhancement method that enhances image contrast by stretching the range of image pixels. For an image, the important feature pixels are usually concentrated in a certain gray interval. Therefore, stretching the uniform distribution of the gray interval in the whole gray level range can make the contrast of the features significantly enhanced and the image has a strong sense of clarity. Histogram equalization reduces the area with high grayscale value in the image, increases the area with low grayscale value, and compresses the gray level with a lesser number of pixels, expanding the gray interval with a greater number of pixels. The method of histogram equalization has the advantages of fast operation, better reflection of the details of the image, effective restraint of the drastic changes of brightness between images, reduction of the difference of the grayscale values between adjacent images, and better maintenance of the gray consistency of images.
In the sparse direct method, the estimation of camera pose can be computed with fewer feature points without calculating the descriptor or matching feature. The basic idea is to estimate the position of corresponding matching points by using the current camera pose. By minimizing the photometric error, the pose of the camera is optimized. The optimal error function is: where I(p 1,i ) is the grayscale value of the pixel of the i-th feature point in the first image and I(p 2,i ) is the grayscale value of the pixel in the second image corresponding to the i-th feature point in the first image. N is the number of feature points.
In the second image, the corresponding feature point, p 2,i , is obtained by the camera pose transformation. R is the rotation matrix; t is the translation vector; ξ is the Lie algebra of R and t; ρ i is the corresponding inverse depth information; and K is the camera's internal parameter matrix. The conversion formula is as follows: In this paper, the perturbation model of Lie algebra is used to solve the error. After Lie algebra left-multiplies by a small perturbation, δξ, we can get: The coordinate of the perturbation component in the second camera coordinate system is set as q, namely, q = δξ ∧ exp(ξ ∧ )P, and its corresponding pixel coordinate is m, namely, m = ρ 2 Kq = ρ 2 Kδξ ∧ exp(ξ ∧ )P. By using the first order Taylor formula, the error function can be transformed into the following formula: ∂m is the gradient of pixels at m. ∂m ∂q is the derivative of the pixel coordinates in the camera coordinate system, and ∂q ∂ξ is the derivative of the spatial point to Lie algebra. By calculating the point q = [X, Y, Z] T , Z = 1/ρ in the camera coordinate system, the following formula can be obtained: Therefore, we can find that the Jacobian matrix of Lie algebra is J = − ∂I 2 ∂m ∂m ∂ξ . Using the LM algorithm to solve the error function, the optimized pose can be obtained.

Front-End Processing Based on ORB Features
Based on the front-end processing of ORB features, the ORB features of each image are extracted first, then the fast library for approximate nearest neighbor (FLANN) algorithm is used to match all features, and finally the position and pose are optimized by the bundle adjustment (BA)algorithm using matched feature points.
Assuming that the coordinates of a point in three-dimensional space are U i = [X i , Y i , Z i ] T , the coordinates transformed into camera coordinates are U = [X , Y , Z ] T , ρ = 1/Z , and the projected pixel coordinates are µ i = [u i , v i ] T . The inner parameter matrix of the camera is K, and the Lie algebras of the camera's pose, R and t, are recorded as ξ. The transformation relationship between the spatial point and pixel coordinate system is as follows: After eliminating the proportional coefficient, s i , the error function can be obtained as follows: The Jacobian matrix, J p , of the re-projection error can be obtained by deriving the pose using the chain rule: The first is the derivative of the error with respect to the projection point. It can be obtained that: The second term is the derivative of the transformed point on Lie algebra. It can be obtained that: Therefore, the Jacobian matrix of the re-projection error can be obtained as follows: The min h(x i ) − µ i 2 is solved by using the method of non-linear least squares, and the optimal pose is finally obtained.

Mixed Inverse Depth Estimation Based on the Probability Graph
In the process of initialization, depth estimation is very important to the SLAM operation results of monocular vision. However, the accuracy and robustness of depth estimation are poor and vulnerable to external noise. In [31,32], in order to improve the accuracy and robustness, a mixed inverse depth estimation method based on a probability map is adopted. Figure 2 is the probability diagram of the Gauss-uniform mixture probability distribution model [32]. Assume that X = {x 1 , · · · , x N } is the inverse depth value of the sensor observation, ρ = ρ 1 , · · · , ρ N is the real inverse depth value, π = {π 1 , · · · , π N } is the proportion of good measurement data, 1-π is the proportion of interference, and the interference signal obeys the uniform distribution form, U[ρ min , ρ max ]. ρ min , ρ max are the minimum inverse depth and the maximum inverse depth measured by the sensor, respectively. λ = {λ 1 , · · · , λ N } is the accuracy of the Gauss distribution, where λ = 1 τ , in which τ is the variance of the Gauss distribution. Given the true inverse depth, ρ, the accuracy, λ, of the Gauss distribution, and the proportional coefficient, π, of the correct data, the probability distribution of the inverse depth measured by the mixed model is as follows: results of monocular vision. However, the accuracy and robustness of depth estimation are poor and vulnerable to external noise. In [31,32], in order to improve the accuracy and robustness, a mixed inverse depth estimation method based on a probability map is adopted. Figure 2 is the probability diagram of the Gauss-uniform mixture probability distribution model [32]. Assume that is the inverse depth value of the sensor observation, is the proportion of good measurement data, 1-π is the proportion of interference, and the interference signal obeys the uniform distribution form, , in which  is the variance of the Gauss distribution. Given the true inverse depth,  , the accuracy,  , of the Gauss distribution, and the proportional coefficient, π, of the correct data, the probability distribution of the inverse depth measured by the mixed model is as follows: Probability diagram of Gauss-uniform mixed probability distribution model.
All potential discrete variables are recorded as 12 { , , , } k k nk Z z z z  , in which zik is a binary random variable. Using the expression of "1-of-K", one element is 1, the rest are 0. Among them, zi1 = 1, which means the ith measurement value is good measurement data, and zi0 = 0 which means the ith measurement value is interference data. Therefore, the distribution of potential variables is as follows: All potential discrete variables are recorded as Z = {z 1k , z 2k , · · · , z nk }, in which z ik is a binary random variable. Using the expression of "1-of-K", one element is 1, the rest are 0. Among them, z i1 = 1, which means the ith measurement value is good measurement data, and z i0 = 0 which means the ith measurement value is interference data. Therefore, the distribution of potential variables is as follows: p(x n ρ n , λ n , π n , z nk ) = (π n N(x n ρ n , λ n −1 )) In this paper, the conjugate prior probability distribution of parameter ρ, λ, π is introduced, in which the conjugate prior probability of the mixing coefficient, π, obeys a Beta distribution, and the conjugate prior distribution of the mean, ρ, and precision, λ, is a Gauss-Gamma distribution.
According to the Bayesian theorem, posterior ∝ likelihood × prior, the joint probability distribution of all random variables can be obtained in the form of: By using the method of variational inference to estimate the parameters, the recursive formula of inverse depth estimation can be obtained.

Closed-Loop Detection
With the sequence movement of the camera, the camera pose and the depth information estimated by depth estimation will produce errors. Periodic global back-end optimization for the whole SLAM system will eliminate some errors, but there will still be some accumulated errors. In order to eliminate cumulative errors, a closed-loop detection method is usually used, which eliminates the cumulative errors in a larger time scale through similar data collected by the camera in the same place, so that the whole SLAM has a globally consistent estimation. Appearance-based bag-of-words (BoW) is a common method for the detection of a closed loop. Bag-of-words usually uses the features of an image to describe an image. A word is used to represent a feature, and all the features are composed into a dictionary. Therefore, in the matching process, only the word vectors contained in each image need to be matched. The bag-of-words improves the efficiency and accuracy of closed-loop detection.

Dictionary Generation
Dictionary generation is a clustering problem and an unsupervised machine learning method. The learning classification is automatically performed according to the features extracted from the image, which improves the efficiency of dictionary generation. The common clustering method is the K-means clustering method, which is simple and effective. In this paper, a K-means clustering method based on mean initialization is proposed, so that the selection of clustering centers is not trapped in the local optimum, so as to ensure its global optimum characteristics.
Assuming that D is the feature point data set and C is the cluster center set, the K-means clustering method based on mean initialization is [33]: (1) Center point initialization: The coordinates of all feature points are summed up, and then the average is taken. The feature point closest to the average is taken as the first cluster center, c 1 . If many points are identical, the point near the edge is selected as the first cluster center, and c 1 is added to set C. Then, the feature point furthest from c 1 is selected as the second cluster center, c 2 , which is also added to set C. Then, other centers are chosen according to the following formula. Among them, min means to select the center point nearest to the feature point, calculate its Euclidean distance, and then select the feature point corresponding to the largest distance among all the distances as the new cluster center, and add it to set C: (2) For each feature point, the distance between each feature point and each center point is calculated and the point is classified into the nearest center points.
(3) According to the existing classification, the average value of feature points in each class is used to recalculate the central point, c j , of each class, and then the sum of squares, D, of the new cumulative distance is calculated, where n is the number of all feature points: (4) If the change of D is small and convergent, then the algorithm converges, and the selected center point is suitable. Otherwise, the calculation will be restarted, returning to the second step.
Through the K-means clustering algorithm, feature points can be divided into K specific classes, so as to generate K words; that is, K words represent an image. The words of all images can form a dictionary. The dictionary is sorted according to the nature of the words and a vector is generated. This is not only convenient for the generation of dictionaries but also for the word searching. According to the words provided by the dictionary, feature matching can be carried out efficiently.

Layered Thiessen Polygon Data Structure
In order to facilitate the establishment and search of dictionaries, a K-ary tree is usually used to express a dictionary, while the K-ary tree can be abstractly represented by a Thiessen polygon [34], and its hierarchical way is shown in the following figure. Figure 3 shows the generation process of the K-ary tree dictionary by using the Thiessen polygon. The classification using the Thiessen polygon is mainly based on the fact that the distance from any feature point in a Thiessen polygon to the feature control point constituting the polygon is less than the distance to the feature control point of other polygons. The feature control point here is the cluster center of the K-means clustering method. This paper takes the ternary tree as an example to introduce its principle. Among that, the green point is the control point. According to the main control point of each layer, a multi-layer Thiessen polygon can be generated. The classification results of each layer use the K-means clustering algorithm initialized by the mean. In the figure, A 1 , A 2 , and A 3 are the results of the first layer classification; A 11 , A 12 , and A 13 are the results of the classification of A 1 ; and A 121 , A 122 , and A 123 are the results of the classification of A 12 . Among them, the whole image is equivalent to the root node of the K-ary tree, the number of classifications in each layer is equivalent to the branch of the K-ary tree, and the number of layers is the depth of the K-ary tree. According to the above method, the whole image can be divided into n regions, which are the leaf nodes of the K-ary tree. All the leaf nodes make up n words of the whole image. Figure 3 shows the generation process of the K-ary tree dictionary by using the Thiessen polygon. The classification using the Thiessen polygon is mainly based on the fact that the distance from any feature point in a Thiessen polygon to the feature control point constituting the polygon is less than the distance to the feature control point of other polygons. The feature control point here is the cluster center of the K-means clustering method. This paper takes the ternary tree as an example to introduce its principle. Among that, the green point is the control point. According to the main control point of each layer, a multi-layer Thiessen polygon can be generated. The classification results of each layer use the K-means clustering algorithm initialized by the mean. In the figure, A1, A2, and A3 are the results of the first layer classification; A11, A12, and A13 are the results of the classification of A1; and A121, A122, and A123 are the results of the classification of A12. Among them, the whole image is equivalent to the root node of the K-ary tree, the number of classifications in each layer is equivalent to the branch of the K-ary tree, and the number of layers is the depth of the K-ary tree. According to the above method, the whole image can be divided into n regions, which are the leaf nodes of the K-ary tree. All the leaf nodes make up n words of the whole image.
The words generated by the layered Thiessen polygon are convenient for the dictionary generation and word searching of the dictionary, which ensures the searching efficiency of the dictionary.   The words generated by the layered Thiessen polygon are convenient for the dictionary generation and word searching of the dictionary, which ensures the searching efficiency of the dictionary.

Computation of the Similarity Scoring Function
Assuming that the number of occurrences of the word W i in an image is n i and the image contains n words, the probability of the occurrence of the word W i in an image is P(W i ) = n i n . According to information theory, it can be concluded in this paper that in the same image, each word contains a different amount of information. The words with a higher occurrence frequency often contain less information, while the words with a lower probability contain more information. The logarithmic function of inverse probability is usually used to express the amount of information contained in a word with the expression of log 1 P(W i ) = − log(P(W i )) = log n n i . In information theory, the degree of uncertainty of a word measured in an image is expressed by self-entropy as follows [35]: Therefore, in an image, the uncertainty vector of the word W i composed by all words in the dictionary is as follows: where m is the number of all words contained in the dictionary. Supposing that G a is used to represent the uncertainty vector of image a and G b is used to represent that of image b, the similarity between the two images can be calculated with the similarity score function. The expression is as follows: When the value of the similarity score reaches a certain threshold, it can be considered that image a is similar to image b and a closed loop is constituted. With the closed-loop characteristic, the accumulative error caused by the camera in the sequence motion process can be eliminated, and the positioning accuracy of visual SLAM can be greatly improved.

Back-End Processing
The front-end processing calculates the pose of the moving object, and its error is relatively large. Usually, this calculation result provides the initial value for the back-end optimization of the pose, and selects the key frame from the sequence image for optimization. The initial value of the scale factor is provided by the inverse depth estimation method based on a probability graph. The back end of this paper uses the method based on a similar transformation to optimize, and considers the role of the scale factor. In short, from the front end to the back end is a rough-to-fine continuous optimization process. The front-end processes the data preliminarily, and the back end processes the data more precisely, so that the positioning accuracy can be continuously improved while ensuring the real-time performance.
In monocular vision SLAM, one of the most important problems is scale uncertainty and scale drift. This paper presented an incremental pose map optimization method based on the similarity transformation group, which effectively solves the problems of rotation, translation, and scale drift in closed-loop detection.
Similarity transformation is a combination of Euler transformation and uniform transformation. Generally, the matrix of similar transformation is expressed as: where s is the scale scaling factor; R is the 3 × 3 rotation matrix; and t is the 3 × 1 translation vector.
There is a total of seven degrees of freedom. The invariants of similar transformation are the ratio of two lengths and the included angle between two straight lines. sim (3), the corresponding Lie algebra of Sim (3) is: where the element of Lie algebra sim(3) is a seven-dimensional vector, ς; ρ corresponds to the translation part; φ corresponds to the rotation part; and σ corresponds to the scale scaling factor. In the derivation process of this paper, the relevant formulas about similarity transformation are shown in Appendix A. Figure 4 is a schematic diagram of an incremental pose map optimization method based on similarity transformation, in which each node represents the relevant random quantity. Among them, x i represents the node of the camera pose; l i is the landmark node; o ij is the observation node of the j-th landmark by the i-th camera; f ij is the pose error factor function, representing the pose relationship error value of the j-th camera relative to the i-th camera; and q i is the re-projection error factor function, representing the re-projection error of the i-th camera. Figure 4 is a schematic diagram of an incremental pose map optimization method based on similarity transformation, in which each node represents the relevant random quantity. Among them, xi represents the node of the camera pose; li is the landmark node; oij is the observation node of the j-th landmark by the i-th camera; fij is the pose error factor function, representing the pose relationship error value of the j-th camera relative to the i-th camera; and qi is the re-projection error factor function, representing the re-projection error of the i-th camera. As can be seen from Figure 4, when the camera moves to the new position, xn+1, the pose of xn+1 will be optimized by the re-projection error factor function, qn+1. The pose factor function, fn(n+1), is used to correlate with the original pose, xn, and the relative pose variation is determined to obtain the motion information of the camera. When the camera moves to a new position, the above process will be repeated. Therefore, the formation of an incremental motion calculation process only requires calculation of the motion relationship between the new and the original pose. It does not need to optimize the global information every time, which avoids a lot of repeated calculation, reduces the calculation cost, and improves the calculation speed.
The selection of key frames and reference frames has an important impact on the results of incremental pose map optimization. In this paper, the following principles are usually applied to the selection of key frames: 1. After global optimization, 20 frames are passed. 2. At least 100 feature points can be extracted from the image. 3. The number of matching points of the current frame is reduced by 80% compared with the reference frame. In general, key frames with less than 20% matching points of the reference frames are regarded as new reference frames.
In order to improve the computing speed and reduce the data storage, redundant key frames need to be removed. The method is to delete the remaining two key frames when the matching degree of feature points of the key frames of three consecutive images exceeds 90% and the middle frame is retained, which can reduce the number of key frames. At the same time, it reduces unnecessary closed loops and improves the efficiency of closed-loop detection.
Assuming that the pose error factor function is a zero-mean Gauss distribution: As can be seen from Figure 4, when the camera moves to the new position, x n+1 , the pose of x n+1 will be optimized by the re-projection error factor function, q n+1 . The pose factor function, f n(n+1) , is used to correlate with the original pose, x n , and the relative pose variation is determined to obtain the motion information of the camera. When the camera moves to a new position, the above process will be repeated. Therefore, the formation of an incremental motion calculation process only requires calculation of the motion relationship between the new and the original pose. It does not need to optimize the global information every time, which avoids a lot of repeated calculation, reduces the calculation cost, and improves the calculation speed.
The selection of key frames and reference frames has an important impact on the results of incremental pose map optimization. In this paper, the following principles are usually applied to the selection of key frames: 1. After global optimization, 20 frames are passed. 2. At least 100 feature points can be extracted from the image. 3. The number of matching points of the current frame is reduced by 80% compared with the reference frame. In general, key frames with less than 20% matching points of the reference frames are regarded as new reference frames.
In order to improve the computing speed and reduce the data storage, redundant key frames need to be removed. The method is to delete the remaining two key frames when the matching degree of feature points of the key frames of three consecutive images exceeds 90% and the middle frame is retained, which can reduce the number of key frames. At the same time, it reduces unnecessary closed loops and improves the efficiency of closed-loop detection.
Assuming that the pose error factor function is a zero-mean Gauss distribution: where S ij = S i −1 S j , S ij is the similarity transformation of pose x j relative to pose x i ; S i is the similarity transformation of pose x i relative to the original pose x 0 ; and S j is the similarity transformation of pose x j relative to the original pose x 0 . Σ ij is the covariance matrix of the pose error. Usually, in order to simplify the calculation, it can be made as a unit matrix. Supposing that the re-projection error factor function satisfies the following form of a Gaussian distribution: where h(x i ) is the re-projection of the i-th camera; µ i is the corresponding pixel coordinate; h(x i ) − µ i is the re-projection error; and σ i 2 is the variance of the re-projection. The joint probability distribution of all variables is expressed in the form of a factor product: f ij x i , x j , Z is usually called the normalization factor to ensure that F(X) is a properly defined probability. It is difficult to accurately calculate the value of Z, and in practice, it is not necessary to obtain the exact value of Z. The maximum posteriori probability inference takes the form of: After taking logarithm of the upper formula, ignoring the normalization factor and other constant terms, the maximum posteriori probability inference problem can be transformed into the problem of a nonlinear least squares sum: There is an important defect in the formula. If there is a mismatch in the calculation, and the error is very large, it will dominate the whole error, erase the influence of other correct errors, and produce wrong results. In order to solve the problem of the large error and fast growth of two norms, this paper adopted the Huber kernels method. The improved formula is as follows: where β 1 is the threshold of the re-projection error and β 2 is the threshold of the pose error. The main function of the Huber kernel function is that when the error exceeds the threshold, it changes from a quadratic function to a primary function, which limits the growth rate of the error and weakens the influence of mismatching on the optimization results. Firstly, the Jacobian matrix, J p , of the re-projection error function is solved by the BA method, and the re-projection function is linearized to obtain: where J p is the Jacobian matrix with re-projection error, ∆x i = x i − x i 0 , and x i 0 is the initial pose.
Therefore, the error function can be expressed in the following form: Then the pose error is solved. S ij , S i , and S j measurements can be obtained by a front-end visual odometry. The pose error function is as follows: Next, derivation is made for ς i and ς j , respectively. According to the derivation method of Lie algebra, ς i and ς j left-multiply by perturbations δς i and δς j , respectively, so the error function becomes as follows:ê By using the adjoint matrix property of similar transformation, the perturbation term can be moved to the rightmost side of the function. As a result, the Jacobian matrix in the form of right multiplication can be derived by using the Baker Campbell Hausdorff (BCH) formula. The error function is as follows: where e 0 ij is the initial observation error value. The Jacobian matrix for ς i and ς j is: where: In the incremental pose map optimization method based on a factor graph, the usual processing method is to fix the optimized pose, and only optimize the pose of the newly added points. In Equation (32), δς i = 0 is usually used, so the error function can be simplified as follows: where the Jacobian matrix of pose error is J e = J r e ij −1 Therefore, the pose error can be written as follows: Taking the first three rows of the Jacobian matrix of the pose error, e ij , we can get a 3 × 7 Jacobian matrix, J e . In order to ensure the calculation of the minimization equation, the Jacobian matrix of the re-projection error, e, should be expanded to a 3 × 7 Jacobian matrix, J p , and a one-dimensional scale variable shall be added to the variable ξ to become a seven-dimensional variable, ς j , and b is also extended to a seven-dimensional vector, b . Therefore, the re-projection error can be written as: The non-linear least squares formula using the Huber kernel function is changed to: In the above part, the problem of optimizing the pose by re-projection and similarity transformation is transformed into the problem of least squares optimization using the process in which the camera pose and scale scaling factor are optimized by using the incremental pose map based on a factor graph. The selection of reference frames is significant for the optimization. In this paper, a simple method was proposed for the selection of reference frames. Firstly, the current reference frame is used as the reference frame to match the feature points of the image of its next key frame, and the number of matching feature points is counted and is taken as the reference matching number, m base , of the current reference frame. Then, the remaining sequence images are respectively matched with the reference frame by feature points, and the number m other is counted. The number of m other is compared with the number of references matching m other . If m other ≥ 0.2*m base , the original reference frame remains unchanged. If m other < 0.2*m base , the reference frame needs to be changed and the current key frame is taken as a new reference frame. The above operations are conducted for all the key frames of the remaining sequence images, which can automatically complete the updated iteration of the reference frame and improve the efficiency of the whole optimization process.
In the whole process of visual SLAM, global optimization plays a very important role, because it can reduce the pose error and scale drift error of the whole system in visual SLAM, improve the measurement accuracy, and ensure more accurate navigation and positioning. The conditions of global optimization are as follows: 1. When the closed loop appears in the system, the closed-loop characteristic is needed to optimize the whole system; and 2. when the reference frame passes through 50 frames and there is no closed loop in it, the whole information is needed to carry out a global optimization.
Generally, the global optimization method is to use the equation information of all incremental pose map optimization and calculate this by taking all pose Lie algebras, ς i , as unknown variables, so as to obtain the optimal solution of all poses. This result can be used as the initial value of the next optimization.

Experiments
The data set used in the indoor and outdoor environment experiments in this paper mainly comes from the open data set on the Internet, including the TUM data set, KITTI data set, and EuRoC MAV data set [26,27]. The method provided in this paper was compared with the widely used OBR_SLAM2 method. In the closed-loop detection experiment, in order to ensure the sufficiency of the experiment, a USB camera was used to collect image data, and the real-time image was used for the experiment.

Indoor Environmental Experiments
The indoor environment experiment mainly adopts the TUM data set and EuRoC MAV data set. The experimental results are shown in the following figure.
In Figure 5, the black line is the track curve drawn according to the true value provided by the data set, the blue line is the track drawn according to the actual measured key frame, and the red line represents the distance error between the measured value and the corresponding true value. According to the error values obtained in Figure 5, the experimental error comparison results in Table 1 can be obtained. obtained by hand-held cameras with slow speeds, which are relatively stable, and easy for tracking and depth estimation, so the error is relatively small. However, the data of the EuRoC data set was collected by Unmanned Air Vehicle (UAV) with greater volatility, which is susceptible to interference, and the image quality is relatively poor, affecting the estimation of depth, so the error is relatively large.     The results of Table 1 show that the incremental pose SLAM presented in this paper outperforms the experimental results of ORB-SLAM2. The root mean square error, mean error, median error, standard deviation of error, minimum error, and maximum error are all reduced, therefore, the positioning accuracy is further improved. The main reason is that the incremental pose optimization based on similarity transformation not only considers the optimization of the camera's own pose but also takes into account the optimization of the relative pose based on the scale factor, which greatly improves the estimation accuracy of the camera pose. In this paper, inverse depth estimation based on a probability graph not only improves the accuracy of depth estimation but also improves its robustness.
The average tracking time in the experiment is relatively stable, as it benefits from the stable light intensity of the indoor environment. Additionally, the combination of the sparse direct method and the feature point method combines the advantages of the direct method and the feature point method, which improves the robustness of the tracking. Therefore, in the course of the experiment, there was no tracking failure.
The experimental comparison shows that the accuracy of using the TUM data set is better than that using the EuRoC data set. The main reason is that most of the TUM indoor data sets were obtained by hand-held cameras with slow speeds, which are relatively stable, and easy for tracking and depth estimation, so the error is relatively small. However, the data of the EuRoC data set was collected by Unmanned Air Vehicle (UAV) with greater volatility, which is susceptible to interference, and the image quality is relatively poor, affecting the estimation of depth, so the error is relatively large.

Outdoor Environmental Experiments
The KITTI data set was used for the outdoor experiment. It collects real image data of urban, rural, and highway environments by a car-mounted camera. It provides a large number of outdoor environment materials, which can evaluate the performance of computer vision in a vehicle environment. In this paper, sequences 08 data sets were used to carry out experiments. In the experiment, we used the GPS data as the benchmark to compare the experimental results of traditional ORB-SLAM2 and incremental pose SLAM. In Figure 6, the dotted line is the track drawn according to the GPS data, and the solid line is the track drawn according to the key frame in the experiment. The comparative results of outdoor experiments based on the experimental results are shown in Table 2.

Outdoor Environmental Experiments
The KITTI data set was used for the outdoor experiment. It collects real image data of urban, rural, and highway environments by a car-mounted camera. It provides a large number of outdoor environment materials, which can evaluate the performance of computer vision in a vehicle environment. In this paper, sequences 08 data sets were used to carry out experiments. In the experiment, we used the GPS data as the benchmark to compare the experimental results of traditional ORB-SLAM2 and incremental pose SLAM. In Figure 6, the dotted line is the track drawn according to the GPS data, and the solid line is the track drawn according to the key frame in the experiment. The comparative results of outdoor experiments based on the experimental results are shown in Table 2.  The experimental results show that the trajectory of incremental pose SLAM is relatively smooth. Compared with GPS data, the experimental error is relatively small, and the scale drift phenomenon is relatively small. However, the relative error of traditional ORB-SLAM2 is relatively large, and the scale drift phenomenon is relatively serious. In this paper, an incremental pose map optimization SLAM based on similarity transformation is proposed to mitigate the influence of scale drift on the experimental results.  The experimental results show that the trajectory of incremental pose SLAM is relatively smooth. Compared with GPS data, the experimental error is relatively small, and the scale drift phenomenon is relatively small. However, the relative error of traditional ORB-SLAM2 is relatively large, and the scale drift phenomenon is relatively serious. In this paper, an incremental pose map optimization SLAM based on similarity transformation is proposed to mitigate the influence of scale drift on the experimental results.
Generally speaking, the trajectories obtained by experiments show that in urban and rural roads, the trajectories obtained are relatively accurate and the positioning accuracy is relatively high because of the relatively slow speed and the number of feature points that can be extracted, which is convenient for camera positioning, depth estimation, and tracking. On the freeway, because of the fast speed, it may appear that the speed of computer image processing is slower than that of the vehicle, resulting in tracking failure, and fewer feature points can be extracted on the freeway, and its stability is relatively poor, thus reducing the positioning accuracy.
An important characteristic of the outdoor environment relative to the indoor environment is that the outdoor environment is susceptible to the influence of light. In the same environment, the feature extraction will change obviously in the light direction and backlight, which increases the difficulty of feature matching. By comparing the results, we can see that the sparse direct method based on histogram equalization proposed in this paper can effectively reduce the influence of light on feature points, ensure the assumption of photometric invariance, and improve the robustness of feature extraction. Therefore, the tracking effect of incremental pose SLAM is more stable and can be applied to various experimental environments. When there are loops in the motion of the vehicle camera, this method can extract features effectively, and use the bag-of-words model to complete image matching, so as to optimize the whole situation to eliminate errors and improve the accuracy of global positioning.
When there are fewer feature points or moving too fast, the camera easily loses its target. The repositioning method based on the bag-of-words adopted in this paper can effectively complete the camera relocation. The combination of the sparse direct method and the feature point method can effectively process images with relatively few feature points and accurately estimate the camera pose.

Closed-Loop Detection Experiment of A Hand-Held Camera
In this paper, a 720P USB camera was used for experiments. In the experiment, the IMU was fixed on the camera, and the hand-held camera was moved with the computer to get real-time images, and the data of the IMU were used as benchmark data for comparison, and the results of the closed loop detection were judged according to the comparison results. The experimental results are shown in the following figure.
It can be seen from Figure 7 that SLAM without closed-loop detection is prone to static error, especially when the handheld camera turns, its error increases relatively, and with the increase of the running route and time, its error will become larger and larger. When the camera passes through the same point, its cumulative error will be close to 4 m. However, when closed-loop detection is added to monocular vision SLAM, the scale drift and cumulative error of the monocular camera can be effectively reduced, and the positioning accuracy can be improved, and its RMSE can be reduced to 0.62 m. In this paper, closed-loop detection based on bag-of-words was mainly used. The bag-of-words uses a dictionary to represent the features of each image, and image matching can be done by searching the dictionary of each image, thus eliminating the cumulative errors caused by the camera in motion. In the process of dictionary generation, the K-means clustering method based on mean initialization can effectively prevent cluster centers from falling into the local optimum and improve the accuracy of image feature classification.
In order to test the performance of the closed-loop detection based on mean initialization proposed in this paper, many experiments were carried out, and the recall precision experimental data were obtained. In this paper, we used the data sets of TUM, KITTI, and EuRoC to experiment. We compared the closed-loop detection based on ORB features, the closed-loop detection based on the line band descriptor (LBD), and the closed-loop detection method proposed in this paper [36][37][38]. The experimental results are shown in Figure 8.
The effectiveness and robustness of the proposed method were fully verified by indoor and outdoor environmental experiments using open data sets. Through the closed-loop detection experiment of the hand-held camera, the necessity of closed-loop detection in monocular vision SLAM was fully verified. Closed-loop detection reduced the overall error and ensured the accuracy of the system. outdoor environmental experiments using open data sets. Through the closed-loop detection experiment of the hand-held camera, the necessity of closed-loop detection in monocular vision SLAM was fully verified. Closed-loop detection reduced the overall error and ensured the accuracy of the system.

Conclusions
Visual odometry (VO) is a preliminary estimation of the pose and depth of the camera in monocular SLAM. Usually, the quality of the initial values has a significant impact on the whole system, and with good initial values, satisfactory results can be produced through optimization. In this paper, the direct method and the feature point method were used for pose estimation. The feature points were selected through the sparsity. When the number of feature points was less than a certain threshold, the direct method was used; otherwise, the feature point method was adopted. In order to ensure the photometric invariance of the direct method, histogram equalization was proposed to process the image to reduce the influence of photometric changes on the results. In this paper, a mixed inverse depth estimation method based on a probability graph was used to estimate It can be seen from the experimental results that in the data sets of TUM, KITTI, and EuRoC, the experiments are carried out respectively, but the final experimental results are the same. The measurement accuracy is: Proposed > LBD > ORB. The method proposed in this paper improves the accuracy of closed-loop detection, which can effectively reduce the experimental error and improve the positioning accuracy of visual slam.
Through the experiments of hand-held cameras, we can also see that one of the major drawbacks of monocular SLAM is the uncertainty of scale. If there is no closed-loop detection, after the camera moves a circle, the starting point and the end point have a large offset. In order to improve the accuracy of closed-loop detection, the inverse depth estimation method based on a probability graph is used in front-end processing to improve the accuracy of initial depth estimation; in back-end optimization, the incremental pose optimization method based on similarity transformation is used to improve the accuracy of camera pose estimation. When the closed loop is detected in the camera motion process, global optimization will be carried out at the same time to further improve the accuracy of the scale factor and camera pose, so that the cumulative error of the whole system can be eliminated in a larger time scale.
The effectiveness and robustness of the proposed method were fully verified by indoor and outdoor environmental experiments using open data sets. Through the closed-loop detection experiment of the hand-held camera, the necessity of closed-loop detection in monocular vision SLAM was fully verified. Closed-loop detection reduced the overall error and ensured the accuracy of the system.

Conclusions
Visual odometry (VO) is a preliminary estimation of the pose and depth of the camera in monocular SLAM. Usually, the quality of the initial values has a significant impact on the whole system, and with good initial values, satisfactory results can be produced through optimization. In this paper, the direct method and the feature point method were used for pose estimation. The feature points were selected through the sparsity. When the number of feature points was less than a certain threshold, the direct method was used; otherwise, the feature point method was adopted. In order to ensure the photometric invariance of the direct method, histogram equalization was proposed to process the image to reduce the influence of photometric changes on the results. In this paper, a mixed inverse depth estimation method based on a probability graph was used to estimate the camera depth, which not only improved the estimation accuracy but also improved the robustness.
In the closed-loop detection, the bag-of-words was used to process. A dictionary was built by extracting the features of each image, and feature matching was carried out by looking up the dictionary. Bag-of-words improved the efficiency and accuracy of image matching. In the dictionary generation stage, a K-means clustering method based on mean initialization was proposed, which ensured the global optimum of the cluster center. In the whole process of SLAM, when the closed loop was detected by using the bag-of-words, the information of all nodes stored in the back-end optimization would be used for global optimization, so as to eliminate the cumulative error of the system, improve the accuracy of positioning and mapping, and ensure the consistency of the whole process.
In the back-end optimization, an incremental pose map optimization method based on similarity transformation was proposed. During the optimization process, the scale factor was taken into account by using similarity transformation to reduce the influence of scale drift on the results. The application of incremental pose map optimization contributed to the expansion of nodes. When a new node was added, only the pose between the new node and the associated node was optimized, which greatly reduced the computational complexity and improved the computational speed.
In conclusion, incremental pose map optimization SLAM for monocular vision based on similarity transformation can reduce the scale drift of a monocular camera, reduce the computational complexity of optimization, and improve the positioning accuracy, which has strong robustness. By exponential mapping, the correspondence relations between the Lie groups of similar transformations and Lie algebras can be obtained: The logarithmic mapping of sim (3) is: The adjoint matrix of sim (3) is: The product of two exponential mappings of Lie algebra is expressed linearly by the BCH (Baker-Campbell-Hausdorff) formula as follows: where J l is the left-multiplied approximation and J r is the right-multiplied approximation. The Lie brackets of Sim(3) are: The Jacobian matrix in Li brackets is: