Research of Binocular Visual Inertial Algorithm Based on Point and Line Features

To solve the problem of poor performance of the binocular visual inertial odometer VINS-Fusion in scenes with low texture and large luminosity changes, a binocular visual inertial odometer PLVINS-Fusion is designed that integrates line feature measurement information, which use line features to easy to extract in low-texture scenes, and have the advantage of more robust tracking performance in scenes with large luminosity changes. Point and line features are extracted in the front-end visual extraction at the same time, and line feature residuals are added to the back-end nonlinear optimization, construct a bag-of-words model combining point and line features in the loop detection module. On this basis, a real-time photometric calibration algorithm is adopted to jointly optimize the exposure time, the camera response function and the vignetting factor, and the stability of KLT optical flow tracking is improved by correcting the image brightness. Experiments on benchmark dataset show that the optimized algorithm has higher robustness and effectively improves the positioning accuracy, and meets the real-time performance requirement.


Introduction
In recent years, mobile robots have been widely used in manufacturing, Internet, artificial intelligence and other fields. SLAM technology, as a key technology to assist robots in autonomous localization in unknown environments, has become a research hotspot in the field of mobile robots [1]. Visual inertial odometry has become an important direction for SLAM research because it can effectively fuse the data provided by cameras and IMU sensors and improve the robustness of the system. The binocular visual inertial odometer obtains depth information through binocular stereo matching and can estimate depth both at rest and in motion, with higher robustness compared to monocular cameras and unlike RGB-D cameras [2] which are not affected by infrared structured light.
Visual inertial odometry can be divided into two methods based on state estimation, filter-based and optimization-based [3]. The filter-based methods usually use Kalman filters to fuse the camera and IMU measurements, construct an error function through the equations of motion and the equations of observation and minimize the error function for bit-pose estimation [4]. The main algorithms that have applied the filter-based method are MSCKF-VIO (Multi-State Constraint Kalman Filter-Visual-Inertial Odometry) [5], ROVIO (Robust Visual-Inertial Odometry) [6] and OpenVINS [7]. Optimization-based methods usually use joint optimization of visual and IMU measurement residuals for graph optimization at the back-end, which can achieve higher accuracy while consuming more computational resources [8].
Optimization methods can be divided into point-based and line-based methods depending on the type 2 of visual features extracted from the front-end. Point features have been adopted by many visual inertial odometry methods due to their simple description and ease of extraction, such as ORBSLAM3 (Oriented FAST and Rotated BRIEF SLAM3) [9], VINS-Fusion (Visual-Inertial Navigation Systems-Fusion) , etc. The classical algorithms all use the extraction of point features at the front-end. However, point features are difficult to detect in low-texture or luminosity transformed scenes, with low localization accuracy and loopback false detection [10][11]. PL-SLAM [12] is the first binocular vision SLAM algorithm to fuse point and line features based on ORBSLAM [13]. PL-VIO [14]is the first monocular visual inertial odometry algorithm based on VINS-Mono [15], which uses point-line features tightly coupled with inertial measurement information for higher accuracy. PL-VINS adds a loopback based on point features to PL-VIO. The above research work suffers from the disadvantages of insufficient robustness in the face of light transformation scenarios and difficulty in detecting loopback when point features are scarce. Studies have shown that online photometric calibration of images can effectively improve the algorithm's robustness to light-changing scenes. In 2018, Jakob Engel proposed a real-time photometric calibration method, which was applied to the visual odometer method DSO [16] to effectively improve the accuracy of the algorithm.
VINS-Fusion is a classic binocular visual inertial mileage calculation method. However, in scenes with low texture or luminosity transformations, VINS-Fusion is unable to use visual information for bitpose estimation because of the difficulty in detecting point features, and even loopback false detection occurs, resulting in poor performance. In this paper, we introduce line features in the framework of VINS-Fusion, incorporate an online photometric calibration algorithm, and propose a binocular visual inertial odometry method PLVINS-Fusion with a combination of point and line features. The main contributions of this paper are as follows: (1) On the basis of VINS-Fusion, line features are merged to construct a binocular visual inertial odometer framework based on point line features.
(2) An online photometric calibration algorithm is used to improve the stability of KLT optical flow tracking and line feature tracking by correcting the image brightness.
(3) Construct a bag-of-words model with a combination of point and line features, while considering descriptors of point and line features to improve the loopback detection effect.
(4) Experiments on benchmark dataset show that the improved algorithm has higher positioning accuracy and robustness while ensuring real-time performance.

Algorithm overview
The binocular vision inertial odometry calculation method PLVINS-Fusion proposed in this paper adds line feature information on the basis of VINS-Fusion, so that the line features play a role in the frontend measurement preprocessing, back-end sliding window optimization and loop closure parts, and adds a photometric calibration module in the front-end to improve the positioning accuracy of the algorithm in the photometric transformation environment, the system algorithm flow is shown in Figure 1.The improved parts are indicated by blue marks. After successful left and right eye matching, stereo matching is used to obtain depth information. During optimization, the pose of subsequent cameras is constrained by pre-integrating the IMU measurements and aligning them with the image frames. The visual and IMU information alignment is considered a successful initialization and triggers the nonlinear optimization thread.
(2)Back-end sliding window optimization: finding the optimal state, velocity, spatial points and lines, acceleration and gyroscope bias in a fixed size sliding window. The window size is maintained by jointly minimizing the residual function via the LM (Levenburg-Marquard) algorithm, using marginalized sliding window oldest frames to ensure the sustainability of the algorithm.
(3)Loop closure: After sliding window optimization, the current frame is determined to be a key frame based on whether the parallax between the current frame and the last key frame exceeds a threshold value. After the current frame is confirmed as a key frame, the loopback detection thread is activated to match the descriptors of point and line features through the dictionary. After the loopback is detected, the system treats the loopback candidate frame as the correct loopback frame and repositions it, aligning the current sliding window maintained by the binocular vision inertial odometer with the past poses, which has the effect of reducing the cumulative error.

Image photometric calibration
When there are strong exposure changes between images, the KLT optical flow tracking algorithm based on the assumption of constant gray level will fail. This is because the camera's automatic exposure causes the overall image luminosity to change. To solve this problem, the photometric calibration algorithm is used to perform photometric compensation on the image, thereby improving the stability of KLT optical flow tracking. The reflection from a scene illuminated by a light source is called radiation and is represented by L.
The camera sensor receives the radiation and creates an image according to the irradiance I , which is received at the position x .Since the lens barrel shades the light, the scene radiation falls along the image boundaries producing a vignetting effect, so the irradiance ( ) I x can be obtained by multiplying the scene radiation L by the vignetting factor ( ) V x .Each frame of the image taken by the autoexposure camera will produce a corresponding exposure time t , and the cumulative irradiance can be obtained by integrating the irradiance within the exposure time t .
The camera response function outputs the image intensity as a pixel value based on the cumulative irradiance The output intensity of the entire image can be expressed as The same scene p P ∈ is observed by multiple frames and the photometric error can be expressed In order to accurately calculate the camera response function and vignetting factor, multiple frames are required for calibration. Since the exposure time can be calculated in real time, and the camera response function and vignetting factor need to be calibrated in multiple frames, the exposure time can be decoupled from other parameters. The photometric error of an image can be expressed as The flow of the real-time photometric calibration is shown in Figure 2. Each time a new frame arrives, the exposure time is optimized according to the current value of the vignetting factor and response function, and the vignetting factor and response function are nonlinearly optimized and updated after accumulating multiple frames. To increase the efficiency of the system, only the key frames are calibrated in real time.

Spatial line description
For the convenience of presentation, define the world coordinate system , the IMU coordinate system and the camera coordinate system . Given a straight line in space ∈ w L w it can be described by means of Plücker coordinates as follows.
where [ ] × denotes the partial symmetry matrix.
The Plücker coordinates are over-parametrized, therefore, an orthogonal vector with four degrees of freedom is used to orthogonally represent it.

( )
cos φ -sin φ n -d 1 W = = sin φ cos φ d n n + d (10) The transformation of the Plücker coordinates of a straight line to an orthogonal vector ( ) U,W is related as follows.
Where θ denotes a three-dimensional vector, φ denotes a one-dimensional vector, and a fourdegree-of-freedom orthogonal vector O can be expressed as  (14) Transforming c L to the pixel plane yields the projection line l .

Line residual model and sliding window optimization
, , respectively indicate the sequence numbers of key frames, spatial points and spatial lines in the sliding window.
ρ denotes the robust kernel function to remove the effect of the exclusion point.

Loopback detection
VINS-Fusion uses a DBOW2 [17] bag-of-words model based on point features for feature retrieval in loopback detection. On this basis, a dictionary based on line features is added, which consists of LBD descriptors and word trees are obtained by k-means++ clustering. The flow of loopback detection is shown in Figure 4. The system compares each frame of the input with the previous keyframe for parallax to determine if a loopback candidate is found. Whenever a loopback candidate is found, a similarity score is obtained by matching the BRIEF descriptor and LBD descriptor of the frame with the bag of words, and the total similarity score t s can be expressed as We select the New College dataset to perform a comparative test to evaluate the performance of the loop detection algorithm. The accuracy-recall curve was plotted as shown in Figure 5. Incorrect loopback detection can lead to failure of the bit pose estimation, so the algorithm is more concerned with loopback recall at 100% accuracy. As can be seen in Figure 5, the loopback recall with a combination of point and line features improves by 10% over the loopback recall using only point features, while maintaining 100% accuracy.

Experimental analysis
This chapter conducts experiments on the method proposed above and shows the experimental results of PLVINS-Fusion. The performance of PLVINS-Fusion in terms of photometric calibration and localization accuracy is evaluated on the publicly available benchmark datasets TUM Mono and EuRoc [18] respectively, and the results are compared and analyzed. The algorithm experimental platform was implemented on a ROS system based on Ubuntu 18.04, and all experiments were performed on intel Core i7-10710U CPU@1.1.GHZ.

Photometric calibration experiments
We perform real-time photometric calibration on the TUM Mono dataset sequence 48, and compare the obtained photometric parameters with the true values provided by the dataset. The result of photometric parameter calibration is shown in the figure 6. As shown in figure 6, the estimated values of camera response function, vignetting factor and exposure time are very close to the true value, which proves the effectiveness of the photometric calibration algorithm.

EuRoc dataset experiments
The performance of line feature tracking was evaluated by counting the number of successfully tracked line features in a sliding window of a certain number of frames, with lines that were tracked in at least five key frames being considered successfully tracked. Several sequences from the EuRoc dataset were selected for the line feature tracking count experiment. A comparison of the number of line features tracked is shown in Table 1 To compare trajectories before and after the improvement of the visual inertial odometry calculation method, evo was used as the evaluation tool, and the root mean square error (RMSE) of the trajectory was used as the criterion for evaluation. Table 2 shows a comparison of the accuracy of the algorithm for multiple sequences of the EuRoc dataset. w/o loop means loopback detection is turned off, w/ means loopback detection is turned on, and the data with smaller errors are bolded. The analysis of the data in Table 2 shows that compared to the VINS-Fusion algorithm, the algorithm incorporating line feature information reduces the trajectory error by an average of 22% with loopback turned off, and only performs slightly weaker than the original algorithm in the MH_05 and V1_02 sequences, and the trajectory accuracy is better than the original algorithm in all sequences with loopback turned on, with an average reduction of 40% in trajectory error, PLVIO does not PLVIO does not include loopback detection and is only compared in the case of no loopback. From Table 1, we can see that the trajectory accuracy of PLVINS-Fusion is higher than that of PLVIO in all sequences, effectively improving the accuracy of the overall running trajectory. The reason for this phenomenon is that there are many low-texture and luminosity change scenes in this sequence, and the improved algorithm can effectively detect loopback in these scenes, thus further reducing the trajectory error.
A comparison of the output trajectory of the three algorithms with the true value of the trajectory under the MH_04 sequence is shown in Figure 7(a). Figure 7(b) compares the absolute position error versus time for the three algorithms. Figure 7(c) compares the error data for the entire process.  Figure 7 shows that the improved algorithm outputs smaller trajectory drifts in the MH_04 sequence than the original algorithm, closer to the true value of the trajectory, and an overall smaller absolute position error distribution. The absolute position error is significantly smaller than that of VINS-Fusion and PLVIO in the time stamps of 180 to 210, and the root mean square error as well as the maximum value are smaller than that of PLVIO and VINS-Fusion, which effectively improves the overall algorithm performance.

Analysis of algorithm time consumption
Whether an algorithm can run in real time is an important measure of its effectiveness. Table 3 shows the average computation time per frame for the different algorithms in the EuRoc dataset. The six main modules compared are photometric calibration, point feature extraction and tracking, line feature extraction, line feature tracking, sliding window optimization and loopback detection.  Table 3, the improved algorithm adds a photometric calibration module and a line feature extraction and tracking module to the front end, and a line feature residual and point line feature loopback module to the back end, which increases the overall algorithm time consumption compared to the algorithm using only point features. Compared to the PL-VIO algorithm, which cannot run in real time, the proposed algorithm significantly reduces the time required for line feature extraction due to the improved LSD line feature extraction algorithm, and the improved line feature tracking method effectively improves the overall tracking performance.

Conclusion
Aiming at the problems of VINS-Fusion in low-texture and large luminosity scenes, such as decreased positioning accuracy and loopback error detection, this paper proposes a binocular visual inertial mileage calculation scheme PLVINS-Fusion that integrates point-line feature measurement information. The robustness of the algorithm to luminosity transformed scenes is improved by performing real-time photometric calibration of the image. Propose a loop detection method that can detect point and line features at the same time to improve the recall rate of loop detection. The comparative evaluation results of the TUM dataset and the EuRoc dataset show that PLVINS-Fusion meets the real-time requirements of the system, can obtain higher localization accuracy and better robustness than VINS-Fusion, and outperforms the PLVIO algorithm of the same type in overall performance.