Robust Stereo Visual-Inertial Odometry Using Nonlinear Optimization

The fusion of visual and inertial odometry has matured greatly due to the complementarity of the two sensors. However, the use of high-quality sensors and powerful processors in some applications is difficult due to size and cost limitations, and there are also many challenges in terms of robustness of the algorithm and computational efficiency. In this work, we present VIO-Stereo, a stereo visual-inertial odometry (VIO), which jointly combines the measurements of the stereo cameras and an inexpensive inertial measurement unit (IMU). We use nonlinear optimization to integrate visual measurements with IMU readings in VIO tightly. To decrease the cost of computation, we use the FAST feature detector to improve its efficiency and track features by the KLT sparse optical flow algorithm. We also incorporate accelerometer bias into the measurement model and optimize it together with other variables. Additionally, we perform circular matching between the previous and current stereo image pairs in order to remove outliers in the stereo matching and feature tracking steps, thus reducing the mismatch of feature points and improving the robustness and accuracy of the system. Finally, this work contributes to the experimental comparison of monocular visual-inertial odometry and stereo visual-inertial odometry by evaluating our method using the public EuRoC dataset. Experimental results demonstrate that our method exhibits competitive performance with the most advanced techniques.


Introduction
In recent years, with the advancement of sparse nonlinear optimization theory, camera technology, and computing performance, Visual Simultaneous Localization And Mapping [VSLAM] technology has achieved tremendous development [1,2]. Visual SLAM approaches have been widely used and researched because of their simple equipment and remarkable effects. Real-time and robust state estimation plays a significant role in robotics. Accurate state estimation is of crucial importance in a variety of intelligent applications, such as robot autonomy, Augmented Reality [AR] and Virtual Reality (VR).
The vision-based state estimation method is able to estimate the 6-Degrees-Of-Freedom (6-DOF) state of sensors simultaneously and reconstruct a three-dimensional (3D) map of the surrounding environment. Many works based on nonlinear optimization have been reported, including SVO [3], LSD-SLAM [4], DSO [5], ORB-SLAM [6,7]. The ORB-SLAM system supports monocular cameras, stereo cameras, and depth cameras. The system is also designed based on the idea of PTAM [8], which is divided into three threads: tracking, mapping and loop closing. However, due to the existence of observation noise, the pose estimation of the feature points in space contains uncertainty, leading to localization errors in Visual Odometry. The position change between the current frame and the previous one can be calculated and continuously accumulated by Visual Odometry in order to realize VIO algorithms have been explored in other areas, such as initialization [26], online calibration [27], optimization [28,29], and camera type [30]. These solutions are either based on optimization methods or filtering methods. For the optimization-based methods, they require a powerful CPU for real-time use and cannot run on low-cost devices such as embedded devices. For the filtering-based methods, the localization accuracy may not be enough.
In this paper, we summarize our contributions as follows: • The FAST feature detector was employed for its efficiency, and features were tracked by the KLT sparse optical flow algorithm to decrease the cost of computation. • Stereo matching was performed, which can remove outliers in the stereo matching and feature tracking steps leading to reduction of the mismatch of feature points and improvement of the robustness and accuracy of the system. • A tightly coupled nonlinear optimization method was employed to combine pre-integrated IMU measurements with visual measurements of stereo cameras, and thus a highly accurate and robust visual-inertial odometry was achieved, which can run in real-time on devices such as drones.

•
The proposed method was extensively evaluated and validated in comparison to state-of-the-art open source VIO methods (including OKVIS [14], VINS-MONO [19] and S-MSCKF [16]) by using the EuRoC dataset.

Visual-Inertial Odometry Overview
The schematic depiction of the Visual-Inertial optimization framework is shown in Figure 1. The input of the system is the image acquired by the stereo cameras and the acceleration and angular velocity measured by the IMU. The output is a 6-degree-of-freedom pose with real scale, i.e., the trajectory of the inertial camera (robot) motion. The system starts with Visual-Inertial Preliminaries (Section 3), where features are extracted and tracked, and pre-integrate IMU measurements between two consecutive frames. Then the initialization preprocessing is introduced (Section 4). The inertial pose and visual pose are combined by visual inertia joint initialization to obtain the initial estimated value of the system, including pose, velocity, gyroscope bias, gravity vector. These values will iteratively and sequentially be updated by the tightly coupled visual-inertial odometry (Section 5). Finally, the 6-DOF pose can be obtained. optimization methods or filtering methods. For the optimization-based methods, they require a powerful CPU for real-time use and cannot run on low-cost devices such as embedded devices. For the filtering-based methods, the localization accuracy may not be enough.
In this paper, we summarize our contributions as follows:


The FAST feature detector was employed for its efficiency, and features were tracked by the KLT sparse optical flow algorithm to decrease the cost of computation.


Stereo matching was performed, which can remove outliers in the stereo matching and feature tracking steps leading to reduction of the mismatch of feature points and improvement of the robustness and accuracy of the system.  A tightly coupled nonlinear optimization method was employed to combine pre-integrated IMU measurements with visual measurements of stereo cameras, and thus a highly accurate and robust visual-inertial odometry was achieved, which can run in real-time on devices such as drones.


The proposed method was extensively evaluated and validated in comparison to state-of-theart open source VIO methods (including OKVIS [14], VINS-MONO [19] and S-MSCKF [16]) by using the EuRoC dataset.

Visual-Inertial Odometry Overview
The schematic depiction of the Visual-Inertial optimization framework is shown in Figure 1. The input of the system is the image acquired by the stereo cameras and the acceleration and angular velocity measured by the IMU. The output is a 6-degree-of-freedom pose with real scale, i.e., the trajectory of the inertial camera (robot) motion. The system starts with Visual-Inertial Preliminaries (Section 3), where features are extracted and tracked, and pre-integrate IMU measurements between two consecutive frames. Then the initialization preprocessing is introduced (Section 4). The inertial pose and visual pose are combined by visual inertia joint initialization to obtain the initial estimated value of the system, including pose, velocity, gyroscope bias, gravity vector. These values will iteratively and sequentially be updated by the tightly coupled visual-inertial odometry (Section 5). Finally, the 6-DOF pose can be obtained. In this work, we consider   C  to be the camera frame,   B  to be the body frame, and we regard the IMU frame as the body frame. The matrix R represents rotation. WB R represents the rotation from the body frame to the world frame. WB P is translation from the body frame to the world frame. In this work, we consider (·) C to be the camera frame, (·) B to be the body frame, and we regard the IMU frame as the body frame. The matrix R represents rotation. R WB represents the rotation from the body frame to the world frame. P WB is translation from the body frame to the world frame.

Visual-Inertial Preliminaries
This section gives the preliminary steps for stereo visual measurements and the inertial model. For visual measurements, features are tracked in real-time using the KLT optical flow algorithm. For IMU measurements, two consecutive frames are pre-integrated.

Visual Processing Frontend
In the Visual-Inertial odometry, the camera needs to track the pixel points of the captured image frame, and the pixel points need to be the feature points extracted from the previous and subsequent frames. However, there is only a small part of the overlap between successive image frames in many cases, and it would be inappropriate to match all the pixels, due to the massive demand for computing resources. Additionally, because the feature points are matched through the calculation process of the camera pose, it is unnecessary to extract many feature points. Therefore, in visual odometry, only the feature points from part of the image-i.e., the points with the most noticeable features in the image (corner points, points with sharp changes in brightness or at the edge of the contour)-are extracted and subsequently tracked.
In this paper, the feature points are extracted by the FAST [31] feature detector, and the feature tracking method is the KLT [32] optical flow algorithm. Through experiments, it is found that the KLT optical flow method takes up less CPU resources than the descriptor-based method, making the optical flow method more advantageous in robot applications. The KLT optical flow method was also used for stereo feature matching to save computing resources. Optical flow tracking is able to obtain the feature point coordinates of cam1 (the right camera of the stereo), which correspond to stereo cam0. Then, based on the principle of epipolar geometry, the following constraint can be defined as where X 1 and X 2 are the normalized plane coordinates of the feature points, E = t ∧ R is the essential matrix, and t ∧ is the antisymmetric matrix corresponding to the t vector. t and R are the translation vector and rotation matrix between the stereo cameras, respectively, which can be obtained from the initial calibration.
During the measurement, X T 2 EX 1 are not strictly equal to zero, due to the presence of noise. There is an error in this process, and if the error is greater than a certain threshold, it is marked as an outlier. X 1 , X 2 are obtained by LK optical flow tracking according to the original image, and the accuracy of optical flow tracking is verified by the constraint (relative poses R and t) between stereo cameras. Therefore, the effect of removing outliers can be achieved. Meanwhile, several grids were added to the image, which was assigned with feature points. The number of feature points attached to each grid was checked to ensure that it did not exceed the maximum value of grid feature points.
In the visual processing frontend, outlier rejection was performed using two-point RANSAC similar to (16) in temporal tracking. Assuming that the normalized coordinates of the corresponding points of the previous and current frames are X(x, y, 1), X 2 (x 2 , y 2 , 1), the following constraints can be established based on epipolar geometry: where R is obtained from the average angular velocity of the IMU. When X 1 = RX, the coordinate system is unified into one coordinate system.
Equation (3) can be expanded as According to the RANSAC principle, two matching points of the previous frame and the current frame are arbitrarily selected, and the outliers with large errors are removed.
The following three formulas are available: The values among A x , A y , A z can be calculated, whereby if A i is the minimum value, the corresponding t i = 1 and the other two can be solved by matrix inversion. Then, the obtained t can be brought into Equation (4), and the value calculated by the coordinates of each matching point are regarded as the error. If the error satisfies certain conditions, the matching point is set as inliers. Finally, an inlier set will be obtained, and all inliers in the set will be substituted into Equation (5) to recalculate the new model by the least squares method. The model will be used to compute the final sum of error. Over multiple iterations, the inliers set with the smallest sum of errors is selected, thereby achieving the effect of removing the outliers. Additionally, performing circular matching [33] further removes the outlier matching step generated in the feature tracking and stereo matching between the previous and current stereo image pairs. Through the above steps, the mismatch of feature points is reduced, thereby improving the robustness and accuracy of the system.

IMU Measurement Model and Pre-Integration
As shown in Figure 2, the measurement rate of IMU is much faster than that of the visual camera. To simultaneously optimize the constraints of vision and IMU in a single framework, it is necessary to integrate the measurements of many IMUs between two adjacent visual keyframes into one constraint. Therefore, the theory of the pre-integration formula based on SO3 manifolds (18) was used in this paper.
According to the RANSAC principle, two matching points of the previous frame and the current frame are arbitrarily selected, and the outliers with large errors are removed.
y y y y    . The following three formulas are available: The values among , , A is the minimum value, the corresponding i t = 1 and the other two can be solved by matrix inversion. Then, the obtained t can be brought into Equation (4), and the value calculated by the coordinates of each matching point are regarded as the error. If the error satisfies certain conditions, the matching point is set as inliers. Finally, an inlier set will be obtained, and all inliers in the set will be substituted into Equation (5) to recalculate the new model by the least squares method. The model will be used to compute the final sum of error. Over multiple iterations, the inliers set with the smallest sum of errors is selected, thereby achieving the effect of removing the outliers. Additionally, performing circular matching [33] further removes the outlier matching step generated in the feature tracking and stereo matching between the previous and current stereo image pairs. Through the above steps, the mismatch of feature points is reduced, thereby improving the robustness and accuracy of the system.

IMU Measurement Model and Pre-Integration
As shown in Figure 2, the measurement rate of IMU is much faster than that of the visual camera. To simultaneously optimize the constraints of vision and IMU in a single framework, it is necessary to integrate the measurements of many IMUs between two adjacent visual keyframes into one constraint. Therefore, the theory of the pre-integration formula based on SO3 manifolds (18) was used in this paper.  The IMU measures acceleration and angular velocity in three directions within the inertial system by a three-axis accelerometer and a three-axis gyroscope. The IMU measurement is often affected by Gaussian white noise and zero offsets. The noise model of IMU can be expressed by the following formula: where ω B (t) is the angular velocity measured by the gyroscope; a B (t) is the acceleration measured by the accelerometer; ω B (t) and a B (t) are the real angular velocity and the real acceleration; b and η represent the corresponding zero-bias and white Gaussian noise, respectively.
To calculate the motion of the robot from the measured values of the IMU, the following kinematic models need to be introduced.
where . R WB , . v W and . p W respectively represent the derivatives of the rotation matrix R WB , the velocity vector v W and the translation vector p W with respect to time. Equation (9) describes the change in pose and velocity of the IMU in differential form. To obtain the value of the state of the IMU at each moment, the integration to Equation (9) needs to be determined. The pose and velocity of the IMU at time interval [t, t + ∆t] can be described as follows: where ∆t is the time interval between two adjacent IMU measurements. According to Equations (7) and (8), ω B (t) and a B (t) in Equation (10) can be rewritten as the following forms.
It is worth mentioning that the subscript of the reference coordinate system in Equation (11) is hidden for the legibility of the formula.
Based on Equation (11), the positional relationship of the inertial component at the adjacent ∆t can be obtained. The sampling frequency of the IMU can be optimized if a state is added that needs to be evaluated every time interval in the collection of IMU data, which, however, will result in a slow calculation and arduous process of the IMU data. The IMU pre-integration method combines the IMU measurements between two adjacent visual keyframes i and j into a composite term that constitutes the motion constraints of two adjacent visual keyframes. Assuming that the IMU is synchronized with the visual frame detection time and the measurement data is acquired at discrete time k, the pose relationship between the two keyframes k = i and k = j can be obtained by the IMU measurement. To avoid the situation where the estimation of the initial frame of the pose estimation leads to the recalculation of the integral, this paper uses the incremental expression, which is defined as It should be noted that the IMU biases are regarded as being constants in the time interval ∆t. However, the estimated biases are changed with a small amount of δb during optimization. The Jacobians J g (·) and J a (·) are employed to account for a first-order approximation of the effect of changing the biases without explicitly recomputing the pre-integration. The pose and velocity can be described as

Visual-Inertial Initialization
The primary purpose of Visual-Inertial SLAM system initialization is to obtain the parameters necessary for the system to optimize and the initial values of the state. Since the stereo inertial tightly coupled system is a highly nonlinear system, it becomes very sensitive to specific initial values. The initialization quality directly affects the robustness and accuracy of the localization of the entire tightly coupled system. Therefore, it is necessary to initialize the system properly to provide correct parameters and initial values.
In the process of initialization, the information that needs to be initialized or estimated can be divided into two categories. One is the parameters that almost remain unchanged during the system operation, such as absolute scale and gravity acceleration. Another is the initial values of the system's starting state quantities, including the pose and velocity information of the first few frames, the position of the 3D landmark points, and the zero offset of the IMU accelerometer and gyroscope. The initialization can be separated into two processes. Firstly, the stereo visual initialization can initialize the initial frame pose and the three-dimensional landmark position information based on the sliding window. Then, the visual inertia joint initialization can initialize the absolute scale, gravity acceleration, and camera speed information.

Stereo Vision Initialization Based on a Sliding Window
In the process of stereo visual initialization based on a sliding window, an up-to-scale purely visual structure of the sliding window is constructed to restore the information of the initial frames pose and the three-dimensional landmark point position.
Two frames are first selected that have sufficient feature disparity in the sliding window. Next, the eight-point method of the polar geometry recovery pose is used to recover the essential matrix E. The scale of the translation transformation is fixed, and E is used to retrieve the motion pose and triangulate the 3D map points. After initializing a batch of 3D points, the perspective-n-point (PnP) method is employed to solve the pose information of the remaining frames in the sliding window. A global full bundle adjustment is used to minimize the total reprojection error for all feature observations in all frames. At this point, the measurement can obtain the pose information and the three-dimensional information of the feature points. Since the external parameters T CB = (R CB , p CB ) between the camera and the IMU are known, all variables can be converted to representations in the IMU coordinate system: where s is an unknown scale factor that aligns the visual structure to the metric scale, and feature positions (·) W are represented with respect to the world coordinate system.

Visual Inertial Joint Initialization
Through the visual inertia joint initialization, the absolute scale, gravity acceleration, camera state speed information, and IMU zero offset can be initialized.

Gyroscope Bias Estimation
The IMU bias is firstly initialized under the assumption that the IMU gyroscope zero offset b g is constant in the current window. Considering the adjacent kth and k+1th frames in the window, in the stereo visual initialization of the previous step, we can obtain the rotations R k and R k+1 in their relative world coordinate system. The angular velocity among the results of IMU pre-integration can be derived from Equation (12). The gyroscope bias can be predicted by minimizing the error of the term.
By solving this least squares problem, we can estimate the gyroscope zero offsets b g .

Accelerometer Bias, Gravity and Metric Scale Estimation
It is difficult to solve the accelerometer bias during the initialization process because of g and the accelerometer bias are measured simultaneously. However, the accelerometer bias b a has little effect on system stability, a rough initial guess can thus be made, and the bias will be continuously optimized during subsequent optimizations. Therefore, in the initialization step, we set the accelerometer offset b a to zero.
The remaining estimated parameters: the speed of the state, the gravity, and the scale factor are defined as the variables: where v B k is velocity in the body frame while tracking the kth image, g is the gravity vector in the world frame, and s is the scale of Visual-Inertial SLAM system to metric units. Considering two consecutive frames b k and b k+1 , we can develop the following linear measurement model from Equations (12), (14) and (15): The initial value can be guessed by solving the following least squares problem: The speed of each local frame, the gravity vector (including direction and intensity), and the scale factor can be obtained.

Tightly Coupled Stereo Visual-Inertial Odometry
After the state initialization, a highly accurate state is estimated using a sliding window estimator. A state estimator based on the combination of stereo vision and IMU is described in this section. The state estimator can be regarded as a backend based on nonlinear graph optimization. It optimizes the coordinates of the 3D landmark points and the pose, velocity and IMU bias of the inertial camera. In this section, the state vector of the system estimation is defined firstly, then the expression of the visual inertia optimization term is given. After that, the explicit calculation formulas of the visual error term and the inertia error term are provided, and finally, the system's marginalization method is presented.

System State Vector
First, the state vectors of the system are defined. The state vector to be estimated are the state variable χ of the inertial camera at the image time k, including position p B , pose R WB , velocity v B , gyroscope bias b g , and accelerometer bias b a . (20) x n = R WB , p B , v B , b g , b a (21)

IMU and Visual Error Term
In the traditional visual SLAM or visual odometers, nonlinear optimization is used to obtain the best estimate of camera pose and 3D landmark points by minimizing the reprojection error of feature points in the camera frame. Once inertial measurements are introduced, constraints are applied to the camera's continuous motion poses, the IMU speed and the estimated IMU bias for continuous time, resulting in an increase of the system state variables. The IMU error term can be defined as: where ρ is the Huber robust cost function, and I is the information matrix of the pre-integration and R is the bias random walk. Consider the feature in the j th image, and the reprojection error of feature points e proj for a given match k to be defined as follows: where x k is the location of the observation keypoint in the image, X k c is the map point in world coordinates, and p j WB represents the translation of IMU relative to the world coordinate system at jth frames, and k is the information matrix associated with the keypoint scale.

Marginalization
As time goes by, the feature points and camera poses will accumulate, and the amount of optimization calculation will increase accordingly. Because the dimension of the visual inertia system to be optimized is more than visual system, the number of variables to be optimized is large. If the estimated variables are not limited, the calculation load will surge with time. Therefore, the non-linearization is performed while the state of the estimated system is marginalized, i.e., the time-limited keyframes are removed without changing the consistency of the estimation.
The visual-inertial optimization term is a least squares problem, which can generally be solved by the Gauss-Newton iteration method. This can be defined as: Suppose δx a is the state variable to be marginalized, and δx b is the state variable to be retained. Depending on the conditional independence, we can simplify the process of marginalization as follows: δx a in the above formula is the variable for marginalization, such as a camera pose. We cannot directly delete δx b and its related landmark points, because this would reduce the constraint information. Therefore, the following Schur complement can be used to eliminate the element: where H c − H c H −1 a H b and b b − H c H −1 a b a are marginalized H matrices and error quantities, and state variables δx a are marginalized. By repeating the marginalization, as the new state variable is added, the number of state variables to be optimized remains constant, which dramatically decreases the amount of computation.

Experimental Results
To validate the effectiveness of the proposed reduced mismatch algorithm, the EuRoC dataset of images was used to test the visual front end. Two consecutive frames of images were randomly selected, a total of eight groups (among monocular image using the left camera image), keypoints were extracted for both the monocular image and stereo image, and the RANSAC algorithm was then used to reduce the mismatch. By calculating the comparison between the number of points removed and the original untreated points, it was found that the image matching accuracy after stereo processing was high, with the specific results shown in Figure 3.

Experimental Results
To validate the effectiveness of the proposed reduced mismatch algorithm , the EuRoC dataset of images was used to test the visual front end. Two consecutive frames of images were randomly selected, a total of eight groups (among monocular image using the left camera image), keypoints were extracted for both the monocular image and stereo image, and the RANSAC algorithm was then used to reduce the mismatch. By calculating the comparison between the number of points removed and the original untreated points, it was found that the image matching accuracy after stereo processing was high, with the specific results shown in Figure 3.
An experiment was performed to validate the proposed method. A comparison with the existing state-of-art visual-inertial odometry algorithms VINS-Mono and S-MSCKF through the EuRoC MAV Visual-Inertial datasets was also made. The EuRoC datasets were retrieved by a microdrone. The UAV was equipped with sensors such as stereo cameras (Aptina MT9V034, 20FPS) and IMU (ADIS16448, 200 Hz), and the real-time ground truth states were acquired by the motion capture device. The movement of the drone had a large rotation and a significant change in light intensity, which was challenging for our vision-based odometry. The tracking of the visual front end is shown in Figure 4. In these experiments, the proposed method was compared with VINS-Mono and S-MSCKF, which were tested on a desktop computer. The desktop CPU was equipped with quad-core i7-6700 3.4 Hz, the memory is 8 GB. The EuRoC datasets have eleven datasets, and four representative  An experiment was performed to validate the proposed method. A comparison with the existing state-of-art visual-inertial odometry algorithms VINS-Mono and S-MSCKF through the EuRoC MAV Visual-Inertial datasets was also made. The EuRoC datasets were retrieved by a microdrone. The UAV was equipped with sensors such as stereo cameras (Aptina MT9V034, 20FPS) and IMU (ADIS16448, 200 Hz), and the real-time ground truth states were acquired by the motion capture device. The movement of the drone had a large rotation and a significant change in light intensity, which was challenging for our vision-based odometry. The tracking of the visual front end is shown in Figure 4.  In these experiments, the proposed method was compared with VINS-Mono and S-MSCKF, which were tested on a desktop computer. The desktop CPU was equipped with quad-core i7-6700 3.4 Hz, the memory is 8 GB. The EuRoC datasets have eleven datasets, and four representative sequences, including MH_03_median, MH 05 difficult, V2_01_easy, and V2_02_median, were used in the dataset, to compare with VINS-Mono. Figure 5 shows the trajectory for the sequence  In these experiments, the proposed method was compared with VINS-Mono and S-MSCKF, which were tested on a desktop computer. The desktop CPU was equipped with quad-core i7-6700 3.4 Hz, the memory is 8 GB. The EuRoC datasets have eleven datasets, and four representative sequences, including MH_03_median, MH 05 difficult, V2_01_easy, and V2_02_median, were used in the dataset, to compare with VINS-Mono. Figure 5 shows the trajectory for the sequence MH_03_median, MH 05 difficult, V2_01_easy, and V2_02_median. Obviously, the estimated trajectory by our method is even closer to the real one. MH_03_median, MH 05 difficult, V2_01_easy, and V2_02_median. Obviously, the estimated trajectory by our method is even closer to the real one. The relationship between the translation error and the distance is shown in Figure 6. Due to space limitations, only the results of MH_03_median and MH_05_difficult are demonstrated. In the error graph, the proposed method produces the smallest translation error as the distance increases. Overall, the estimated error of this method is less than 5% on all sequences. The relationship between the translation error and the distance is shown in Figure 6. Due to space limitations, only the results of MH_03_median and MH_05_difficult are demonstrated. In the error graph, the proposed method produces the smallest translation error as the distance increases. Overall, the estimated error of this method is less than 5% on all sequences.  The relationship between the translation error and the distance is shown in Figure 6. Due to space limitations, only the results of MH_03_median and MH_05_difficult are demonstrated. In the error graph, the proposed method produces the smallest translation error as the distance increases. Overall, the estimated error of this method is less than 5% on all sequences.
In addition, we also compared our method with VINS-Mono in terms of CPU resource requirements, and the results are listed in Table 1. As can be seen, our method is comparable to the VINS-Mono on the CPU load, but the memory utilization statistics appears relatively better. Thus, from Figures 5 and 6 and Table 1, it can be concluded that our method obtains comparable accuracy at the start of the trajectory, but in a more efficient manner, and as the distance increases, the method begins to have much better performance than VINS-Mono.  The relationship between the translation error and the distance is shown in Figure 6. Due to space limitations, only the results of MH_03_median and MH_05_difficult are demonstrated. In the error graph, the proposed method produces the smallest translation error as the distance increases. Overall, the estimated error of this method is less than 5% on all sequences.
In addition, we also compared our method with VINS-Mono in terms of CPU resource requirements, and the results are listed in Table 1. As can be seen, our method is comparable to the VINS-Mono on the CPU load, but the memory utilization statistics appears relatively better. Thus, from Figures 5 and 6 and Table 1, it can be concluded that our method obtains comparable accuracy at the start of the trajectory, but in a more efficient manner, and as the distance increases, the method begins to have much better performance than VINS-Mono. To evaluate the performance of this method, the tool evo (https://github.com/MichaelGrupp/evo) is used to determine the Root Mean Square Error (RMSE) of the experimental results, and the results are shown in Table 2. Notably, stereo feature matching failed because of the continuous inconsistency in brightness between stereo images in V2_03_difficult. Therefore, V2_03_difficult sequence was not included here. In the initial stage, the accuracy of our method is not greatly superior to VINS-Mono. As the flight distance increases, however, the error of our method becomes significantly lower than that of VINS, ensuring comparably accurate results over long paths. In the end, our method maintains the best performance, with an average RMSE of 0.075 m, as demonstrated in Table 2. It should be emphasized here that our method successfully recovers the metric scale, which, however, could not be achieved just by using the monocular camera. Compared to the optimization-based monocular SLAM, our method is more accurate than VINS in most scenarios except for individual scenes. For filter-based S-MSCKF, its accuracy is much lower than our method and VINS, which proves that optimization-based methods have greater potential than filter-based methods. It is worth mentioning that VINS-Mono, slightly inferior to our proposed algorithm, has outperformed both OKVIS and MSCKF in our results as shown in Table 2. These findings may not agree with the published results in [16] as VINS-Mono in their tests are running at a relatively low frequency without loop closure detection. When playing at full strength, VINS-Mono has experimentally been proved to be state-of-the-art and perform better than the widely used VIO systems, including OKVIS and MSCKF [33]. In summary, the optimization-based method has high localization accuracy and low memory utilization, while the filter-based method has advantages in computing resources. This is a remarkable result of our optimization-based stereo method, compared to the filter-based stereo VIO (16) and the optimization-based monocular VIO (19).

Conclusions
In this paper, a robust stereo visual-inertial tightly coupled localization method was proposed. The method outputs the pose information of the drone in real-time through the onboard sensor, namely, the camera and the inertial sensing unit. The accuracy, real-time and robustness of the proposed method are validated by comparisons with the state of art open source algorithms in real-time operation in the EuRoC dataset. In most cases, this method shows better performance and lower computational cost, when running on low-computing devices. In future work, experiments will be conducted based on the drones, in combination with the proposed localization method as well as obstacle avoidance algorithms to achieve autonomous navigation.