Monocular Visual-Inertial Navigation for Dynamic Environment

: Simultaneous localization and mapping (SLAM) systems have been generally limited to static environments. Moving objects considerably reduce the location accuracy of SLAM systems, rendering them unsuitable for several applications. Using a combined vision camera and inertial measurement unit (IMU) to separate moving and static objects in dynamic scenes, we improve the location accuracy and adaptability of SLAM systems in these scenes. We develop a moving object-matched feature points elimination algorithm that uses IMU data to eliminate matches on moving objects but retains them on stationary objects. Moreover, we develop a second algorithm to validate the IMU data to avoid erroneous data from inﬂuencing image feature points matching. We test the new algorithms with public datasets and in a real-world experiment. In terms of the root mean square error of the location absolute pose error, the proposed method exhibited higher positioning accuracy for the public datasets than the traditional algorithms. Compared with the closed-loop errors obtained by OKVIS-mono and VINS-mono, those obtained in the practical experiment were lower by 50.17% and 56.91%, respectively. Thus, the proposed method eliminates the matching points on moving objects effectively and achieves feature point matching results that are realistic.


Introduction
Simultaneous location and mapping (SLAM) is the "holy grail" of autonomous driving research [1]. In an unknown environment, SLAM enables a robot to perceive its own position and posture and map the environment at the same time. It is one of the key technologies of augmented reality (AR), positioning, and navigation. Visual SLAM mainly uses visual data fused with other sensor data for positioning and mapping [2]. Depending on the type and number of cameras used, visual SLAM can be divided into RGB-D SLAM, stereo SLAM, and monocular SLAM. RGB-D SLAM directly obtains a precise feature point depth using a depth camera. However, the depth camera is short-ranged and difficult to use outdoors. Stereo SLAM obtains the depth of the feature point via two offset cameras, and the distance between them is the baseline for depth calculation. However, the computation requirements and algorithmic complexity of such systems are high. In contrast to stereo SLAM, monocular SLAM uses only one camera; therefore, it is a low-cost, lightweight, and low-power solution. It is also easier to integrate into AR helmets and other devices, making it widely applicable. However, by relying only on monocular camera information, the system cannot obtain the true scale of the motion trajectory [3]. To solve this problem, various multisensor fusion methods have been applied. Among the many different sensor combinations, the combination of a camera and inertial measurement unit (IMU) is the most robust and least expensive. The visual data obtained by the camera can reduce the drift of the IMU data, whereas the IMU data can recover the scale of the visual data. The VINS-Fusion [4][5][6][7] proposed by the Hong Kong University of Science and Technology is a SLAM system based on optimized multisensor fusion. The system is an extension of VINS-Mono [5] and supports multiple modes: monocular camera + IMU, stereo camera, and stereo camera + IMU. The system uses IMU preintegration and feature point observations to achieve precise device self-location. It supports the calibration of the extrinsic parameters between the camera and IMU and the time offset between different sensors online. MSCKF [8], proposed by Mourikis et al., is a real-time visual-inertial navigation algorithm based on an extended Kalman filter (EKF). MSCKF integrates IMU and visual information under the EKF framework. It can adapt to some violent movements of the equipment, lack of texture in experimental scenes, and changes in lighting.
Most existing SLAM methods assume that all objects in a scene are static [9]. Therefore, they are unable to determine whether objects in the environment are moving. However, in the real world, there are many moving objects, and the assumption of a static environment is impractical. With many moving objects in the environment, image feature point matching is affected; the system's location accuracy suffers, and the SLAM system is degraded. This limits the application of SLAM technology in many fields, such as robotic navigation [10], autonomous driving [11], AR [12], human following [13], and navigation assistance for the visually impaired [14]. Therefore, the construction of a SLAM system that has high precision in a dynamic environment has been the focus of many studies in recent years. Researchers have proposed many solutions that can be broadly divided into two categories: geometric constraints-based and deep learning-based.
Many researchers have attempted to develop SLAM systems for dynamic environments from the perspective of geometric constraints. Kim et al. proposed a backgroundmodel-based dense-visual odometry (BaMVO) [15], which uses an RGB-D camera in dynamic environments. First, a background model is estimated using a nonparametric model. Then, the background model is used to estimate the location and posture of the sensor to reduce the impact of moving objects. Sun et al. [16] proposed an RGB image and depth data-based method for removing moving objects. They incorporated this method into the RGB-D SLAM process in the preprocessing stage to remove data related to moving objects. Li et al. [17] proposed a real-time RGB-D SLAM system suitable for dynamic environments by using a static weighting method for edge points in the keyframe. This static weight is added to the intensity-assisted iterative closest point method used for the registration task. Tan et al. [18] presented a real-time monocular SLAM system with an online keyframe representation and updating method that reliably detects changed features from the key frames by projecting them to the current frame. In addition, they proposed a novel prior-based adaptive random sampling consensus (RANSAC) algorithm to efficiently remove match outliers. Shimamura et al. [19] introduced weighted tentative initial values of a camera pose and three-dimensional point cloud map to reduce user operation loading. Their study also distinguished outliers of feature points to stably estimate camera poses in dynamic scenes. Kim et al. [20] fused depth information into the RGB image based on the sped up robust features [21] descriptor. An IMU was used to rotate the feature points. To remove dynamic feature points and retain static feature points that are used to calculate the motion of the device (via the RANSAC algorithm) [22], they used a filtering method. Wang et al. [9] proposed a moving object detection algorithm for RGB-D SLAM based on mathematical models and geometric constraints.
Researchers have also proposed many deep learning-based methods. Yu et al. [23] proposed a robust semantic visual SLAM for dynamic environments called DS-SLAM, which combines the semantic segmentation network and moving consistency check method to reduce the impact of moving objects and improve the localization accuracy in a dynamic environment. Wang et al. [24] proposed a moving-target segmentation method that can be combined with dense SLAM. To address the problem of unsatisfactory segmentation results, Bescos et al. [25] proposed a visual SLAM system, DynaSLAM, which builds on ORB-SLAM2 [26] and adds dynamic object detection and background inpainting functions. It can run on dynamic scenes in monocular, stereo, and RGB-D modes. Further, it detects moving objects either by multiview geometry, deep learning, or both. Zhao et al. [27] Remote Sens. 2021, 13, 1610 3 of 19 proposed a dynamic environment-oriented semantic visual inertial system that uses pixellevel results of semantic segmentation for real-time locations. The system integrates feature points extraction and matching algorithms into the SLAM process and combines pixel-level features in the dynamic feature detection module. It excludes dynamic feature points by using semantic segmentation results and multiview geometric constraints. Cheng et al. [2] proposed a feature-based method called DM-SLAM, which supports monocular, stereo, and RGB-D sensor modes. The semantic segmentation module obtains the pixel segmentation results of potential dynamic objects. In the case of the RGB-D or stereo modes, the method uses a reprojection error to distinguish dynamic points. In the case of the monocular mode, the method uses epipolar constraints. Then, static feature points are inputted into the SLAM framework to obtain the position and posture of the system. Khan et al. [28] proposed a coarse-to-fine pupil localization method using a composite of machine learning and image processing algorithms. The machine learning method they developed can be applied to dynamic SLAM.
Although many current solutions for dynamic environments have achieved good results, they still have some shortcomings. For methods based on geometric constraints, most current methods use only RGB or RGB-D cameras. This makes the system behave poorly in some complex environments, limiting the system's usage scenarios. Most methods also assume that there are far more static objects than moving objects in the scene; thus, the motion model of the camera is determined by the most matched feature points. The matched feature points that do not satisfy the model are eliminated. However, this assumption is often incorrect. Though deep learning methods are superior to the traditional ones in many aspects, works utilizing them in SLAM systems are few. Besides the system tuning requirements, deep learning methods require a GPU or other hardware accelerators for real-time inference [29], which may limit their application in AR, obstacle avoidance navigation, and other fields. Additionally, because most deep learning-based methods are based on object detection, the deep learning models used must be pretrained to classify predetermined moving objects. Not only does it make the system powerless against moving objects the models were not trained for, but also not all the objects the models are trained to recognize are always in motion. Therefore, it is difficult to guarantee the accuracy of the point features matching. In summary, it is still necessary to study high-precision match elimination algorithms for moving object feature points in complex environments.
To improve the system's location accuracy in complex scenes, we need to eliminate the feature matches on moving objects and use the static scene information for system location. This paper proposes a method for eliminating feature points on moving objects using IMU data. The translation and rotation of the system in the corresponding time period of the two frames can be obtained using the IMU data and the motion state at the last moment of the device. Then, we can use the translation and rotation of the system to obtain the transformation matrix of motion during this period. The distance between the feature matches and transformation matrix is calculated one by one, and the number of matching point features supporting the transformation matrix is determined. When the number is sufficiently high, the IMU data is considered valid. In this case, the feature point matches having large distances to the transformation matrix are considered abnormal matches (located on a moving object or erroneous match) that need to be eliminated.
To summarize, the main contributions of this study are as follows: 1.
An algorithm for validating IMU data is proposed. By calculating the number of image feature matches that support the transformation matrix, we can validate the IMU data.

2.
A feature point matches elimination algorithm for dynamic scenes is proposed. By calculating the distance between the feature point matches on the RGB image and the transformation matrix, we can determine whether the feature match is on the moving object.

3.
The proposed algorithm is integrated into a visual inertial navigation system. Consequently, the number of feature points matches on moving objects in a complex environment is reduced and location accuracy is significantly improved.
The remainder of this paper is structured as follows. The Section 2 describes how to use IMU data to reduce the influence of moving objects. The Section 3 introduces the experimental plan for the public dataset and real-world experiment. The Section 4 presents the experimental process and results. Finally, the Section 5 summarizes the study.

Overview
Due to moving objects, corresponding feature points will move in the image, significantly reducing the accuracy of point features matching. To effectively reduce the impact of moving objects and improve the location accuracy of a SLAM system, we propose a method for eliminating feature point matches on moving objects in dynamic scenes. The method uses IMU data to obtain the motion transformation matrix of the camera. Then, high-precision feature point matches are obtained by using the transformation matrix under polar geometry constraints. Figure 1 depicts the SLAM system using the proposed dynamic feature matches culling algorithm. First, feature points were extracted and matched according to the RGB image. The angular velocity and acceleration can be obtained from the IMU data, which are used to calculate the translation and rotation between the two frames and the corresponding transformation matrix with the system's motion state at the last moment. Then, the distances between the feature matches and the transformation matrix were calculated one at a time to determine whether the two are consistent. If enough feature matches support the matrix, the IMU data are considered valid and abnormal matches are eliminated according to the distance between the feature matches and the transformation matrix. By contrast, if the IMU data are invalid, the RANSAC algorithm is used to eliminate the erroneous matches. Finally, the rest of the feature matches are input to the optimization equations to calculate the device's pose, motion state, and coordinates of three-dimensional feature points through bundle adjustment.

Obtaining the Transformation Matrix from IMU Data
The gyroscope measures the angular velocity of the device. By integrating the angular velocity, the rotation of the device between the two frames can be obtained. The following equation [30] was used to calculate the posture according to the last frame of the image. where: In Equation (1), q w b k represents the posture of the IMU at frame k in the world coordinate system. ω is the angular velocity of the device measured by the gyroscope. ω x , ω y , and ω z represent the angular velocities in the x, y, and z directions, respectively. The rotation of the system in the time period can be calculated according to the current posture of the system and the previous posture of the system. Remote Sens. 2021, 13, x FOR PEER REVIEW 5 of 19 Figure 1. Flowchart of the SLAM system using our proposed algorithm.

Obtaining the Transformation Matrix from IMU Data
The gyroscope measures the angular velocity of the device. By integrating the angular velocity, the rotation of the device between the two frames can be obtained. The following equation [30] was used to calculate the posture according to the last frame of the image. where: In Equation (1), represents the posture of the IMU at frame k in the world coordinate system. ω is the angular velocity of the device measured by the gyroscope. , , and represent the angular velocities in the x, y, and z directions, respectively. The rotation of the system in the time period can be calculated according to the current posture of the system and the previous posture of the system.
Similarly, according to Equation (4), the speed of the system at the current frame can be obtained by integrating the acceleration measured by the accelerometer. By integrating the speed, the translation amount of the system between two frames can be obtained using Equation (5). Similarly, according to Equation (4), the speed of the system at the current frame can be obtained by integrating the acceleration measured by the accelerometer. By integrating the speed, the translation amount of the system between two frames can be obtained using Equation (5).
is the position at frame k in the world coordinate system, and ∆t k is the time interval from t k to t k+1 .
The rotations and translations of the device are in the global coordinate system. Because the camera and the IMU are rigidly connected and the relative position is fixed, the rotation matrix and translation in the camera's coordinate system can be obtained by transforming the location and posture of the IMU using the extrinsic parameters of the camera-IMU. According to the translation and rotation of the camera and the camera's internal parameter matrix, the fundamental matrix of the camera motion between two adjacent frames can be obtained by the following equations: Here, E is the essential matrix of camera movement. t ∧ is the antisymmetric matrix of translation t. F is the fundamental matrix of camera movement. K is the internal parameter matrix of the camera.

Distances between Feature Matches and the Fundamental Matrix
Geometrically, the fundamental matrix is a projective map that maps points to straight lines. In this case, point x in the first frame determines the epipolar line on the second frame as: Therefore, if the feature point P i is in a static state, the feature point will satisfy the epipolar geometric constraint in an ideal situation [31]: For the fundamental matrix F, the distance from the corresponding feature matches to the epipolar line can be obtained as follows: where The steps for calculating the distance between the feature point matches and the fundamental matrix are as follows. First, the fundamental matrix is calculated using IMU data. According to Equation (10), the distance between the corresponding feature points in the latter image and the polar line is calculated and denoted as d 1 . Similarly, the fundamental matrix of motion from the current frame to the previous image is calculated. We calculate the distance between the feature point matches in the previous image and the corresponding polar line according to Equation (11), denoted as d 2 . The larger of the variables d 1 and d 2 is defined as the distance d from the feature matches to the fundamental matrix. Algorithm 1 describes the steps of the calculation in detail.

Eliminating Abnormal Matches Using IMU Data
Because the IMU can only measure the angular velocity and linear acceleration of the system movement, the drift of IMU data will increase over time. Additionally, as many things can interfere with IMUs, the data from IMUs can be very noisy and unreliable. Therefore, IMU data should be validated before use to constrain the feature point matching.
To determine whether the feature match is consistent with the fundamental matrix, we calculate the distance d from the feature match to the fundamental matrix. If the distance is smaller than a, a given threshold, the feature match is consistent with the fundamental matrix; otherwise, it is not consistent. The value ϕ k,i is set to 0 when the i-th feature point u k,i of the k-th frame is consistent with the fundamental matrix; otherwise, it is set to 1.
The consistency of the feature matches in the image with the fundamental matrix derived from the IMU data is calculated one by one. Given a threshold b, when the valid feature matches count is not less than b, the IMU data are considered valid; otherwise, they are considered invalid. The value of θ k is set to 0 if the feature point matches of the k-th and the k + 1-th frames are consistent with the fundamental matrix in the corresponding time period; otherwise, it is set to 1.
If the IMU data are valid (i.e., θ k = 0), the fundamental matrix calculated from the IMU data is accurate, and the IMU data is trustworthy. In this case, the distance between the feature points matches and the fundamental matrix is used to determine whether the feature points matches are correct. For each set of feature matches, if the distance to the fundamental matrix calculated by the IMU is less than c, the feature match will be consistent with the movement of the system between the two frames calculated by the IMU data and is considered correct. Otherwise, the feature match is incorrect, or the matched point is located on a dynamic object and needs to be eliminated.
When the IMU data is invalid (i.e., θ k = 1), the fundamental matrix calculated from the IMU data is inaccurate and the IMU data is untrustworthy. Therefore, the visual data is used instead. Here, the RANSAC algorithm, which uses the feature point matches of the visual data, is used to calculate the motion matrix that is supported by most feature point matches and eliminate abnormal matches.

Materials and Experimental Setup
An Intel(R) Core(TM) i7-7700HQ CPU @ 2.80 GHz computer with 16 GB RAM and Ubuntu 16.04 LTS was used for the experiment. The experiment was composed of two parts. The first part used the ADVIO public dataset [32]. A detailed introduction of the dataset is provided in Section 3.1. Select feature matches data collected when the ADVIO dataset was ran through the SLAM system were extracted to verify the performance of our algorithm. At the same time, the dataset's ground truth was used to quantitatively evaluate the location accuracy. The second part involved using self-collected data. We selected feature matches data during the experiment to verify the performance of our algorithm. By setting the origin and destination of the experiment to the same point, we can use the closed-loop error to verify the location accuracy of the system, as discussed in in Section 3.2.
We selected VINS-mono [5] proposed by Qin  optimization-based method is employed to obtain high accuracy visual-inertial odometry by fusing preintegrated IMU measurements and feature observations. ROVIO is a monocular visual-inertial odometry algorithm, which, by directly using pixel intensity errors of image patches, achieves accurate tracking performance while exhibiting extremely high robustness. OKVIS is a novel approach to tightly integrate visual measurements with readings from an IMU in SLAM.

ADVIO Dataset Experiment
The sensor data for the ADVIO dataset were collected using an iPhone 6s. The sequences collected totaled 4.5 km. The ground truth was obtained by combining a pure inertial navigation system with frequent manual positioning based on precise planes [35]. The dataset contained 23 separate recordings captured in six different locations, and all had complete six degree-of-freedom location and posture true values. The dataset was not limited to small spaces and covered a variety of indoor and outdoor scenarios, including stairs, escalators, elevators, offices, shopping malls, and subway stations. The dataset provides a variety of sensor data from cameras, accelerometers, gyroscopes, magnetic sensors, and barometers.
We selected eight data sequences (sequence 1, 2, 6, 9, 11, 13, 18, and 21) to test the location accuracy of the algorithm in different application scenarios. Sequences 1, 2, 6, and 9 were shopping mall scenes. Sequence 11 was a subway station. Sequences 13 and 18 were office areas. Sequence 21 was an outdoor scene. Sequences 1 and 9 contained escalators. Sequences 13 and 18 contained stairs. Sequence 18 contained elevators. Sequences 6 and 11 contained many pedestrians. Sequences 1 and 2 contained a medium number of pedestrians. Sequences 9, 13, and 21 contained a small number of pedestrians. Sequence 18 was devoid of pedestrians. Sequences 11 and 21 contained many moving vehicles. As the selected sequences contained different kinds of objects, we can detect the adaptability and robustness of the algorithm.

Real-World Experiment
To verify our algorithm's accuracy in the real world, an on-site experiment with equipment was conducted. The experimental data were collected at the Chinese Academy of Sciences, Datun Road, Chaoyang District, Beijing, China. The experiment was conducted at lunchtime; therefore, there were many pedestrians on the road. Hence, it was a challenging dynamic scene for the SLAM system. The experimenter held the equipment in the park along a closed-loop path to collect the data. An Intel RealSense D435i camera [36] was used for data collection, as shown in Figure 2. It is a monocular camera with an accelerometer, gyroscope, and depth camera. Only visual, accelerometer, and gyroscope data were used. The resolution of the visual data was 640 × 480 pixels with a frequency of 15 Hz. The accelerometer's frequency was 60 Hz. The frequency of the gyroscope was 200 Hz.
It was necessary to calibrate the equipment in advance. This included calibrating the internal parameters of the camera, IMU, and the camera-IMU extrinsic parameters. We used the imu_utils [37] library and the MATLAB-based calibration toolbox [38] to calibrate the IMU and camera internal parameters, respectively. The results are presented in Tables 1 and 2. The unit of f x , f y , c x and c y is pixel.

Real-World Experiment
To verify our algorithm's accuracy in the real world, an on-site experiment with equipment was conducted. The experimental data were collected at the Chinese Academy of Sciences, Datun Road, Chaoyang District, Beijing, China. The experiment was conducted at lunchtime; therefore, there were many pedestrians on the road. Hence, it was a challenging dynamic scene for the SLAM system. The experimenter held the equipment in the park along a closed-loop path to collect the data. An Intel RealSense D435i camera [36] was used for data collection, as shown in Figure 2. It is a monocular camera with an accelerometer, gyroscope, and depth camera. Only visual, accelerometer, and gyroscope data were used. The resolution of the visual data was 640 × 480 pixels with a frequency of 15 Hz. The accelerometer's frequency was 60 Hz. The frequency of the gyroscope was 200 Hz. It was necessary to calibrate the equipment in advance. This included calibrating the internal parameters of the camera, IMU, and the camera-IMU extrinsic parameters. We used the imu_utils [37] library and the MATLAB-based calibration toolbox [38] to calibrate the IMU and camera internal parameters, respectively. The results are presented in Tables 1 and 2.  Then, the Kalibr library [39] was used to calibrate the camera-IMU's extrinsic parameters. The calibration results are as follows: Then, the Kalibr library [39] was used to calibrate the camera-IMU's extrinsic parameters. The calibration results are as follows:  Figure 3 shows the track used in the experiment. The blue circle at the lower right corner of the track indicates the start and end points of the loop. The arrow indicates the direction of travel, and the red line is the approximate path traveled. The beginning of the dataset (about 1 s in length) is usually poor in quality with insufficient exposure or excessive exposure of visual photos, because the device has just been turned on. Therefore, the initial stage of the data was processed. Then we used the methods described in [5] to unify the frequencies of the accelerometer and gyroscope and align the time stamps. Subsequently, the data were used by the location system to locate the trajectory.  Figure 3 shows the track used in the experiment. The blue circle at the lower right corner of the track indicates the start and end points of the loop. The arrow indicates the direction of travel, and the red line is the approximate path traveled. The beginning of the dataset (about 1 s in length) is usually poor in quality with insufficient exposure or excessive exposure of visual photos, because the device has just been turned on. Therefore, the initial stage of the data was processed. Then we used the methods described in [5] to unify the frequencies of the accelerometer and gyroscope and align the time stamps. Subsequently, the data were used by the location system to locate the trajectory.

ADVIO Public Data Set
We selected two adjacent frames in sequence 2. Figure 4 shows the raw frame images,

ADVIO Public Data Set
We selected two adjacent frames in sequence 2. Figure 4 shows the raw frame images, where Figure 4a is the previous frame, and Figure 4b is the current frame. During the experiment, the system performed feature points extraction, matching, and abnormal feature point match elimination.

ADVIO Public Data Set
We selected two adjacent frames in sequence 2. Figure 4 shows the raw frame images, where Figure 4a is the previous frame, and Figure 4b is the current frame. During the experiment, the system performed feature points extraction, matching, and abnormal feature point match elimination.     Figure 5a shows the results without the elimination of feature matches. Figure 5b shows the results of with the elimination of feature points matches, which was realized using the RANSAC algorithm. Figure 5c shows the results of with the elimination of feature points matches, which was realized using the proposed method.
The pedestrian in motion interfered with the feature point matching accuracy. The other objects in the scene were static. Therefore, if the feature point is located on the pedestrian, the feature point matching accuracy and the system's location accuracy will be reduced. Comparing Figure 5a,b, we see that while the number of matched feature points on the pedestrian was reduced after using the RANSAC algorithm to eliminate abnormal feature matches, there are still many points on the pedestrian, especially on the pedestrian's left arm and bag. In Figure 5c, the matched feature points on the pedestrian are almost completely nil, and the feature points on the stationary objects in the scene are retained. Therefore, our algorithm is more accurate than the RANSAC algorithm. Hence, when the vision camera is disturbed by a moving object, our proposed algorithm eliminates the matched feature points on the moving object better and obtains higher-precision feature point matches.
For a comprehensive evaluation of our method's performance, we used the root mean square error (RMSE) and mean of the absolute pose error. We used the open-source library evo [40] to evaluate the location trajectory.
The dataset we selected is challenging for the SLAM system. First, because the dataset is collected via a hand-held method, the data has some shaking. Second, dynamic information is present, such as the pedestrian and vehicle in the scene. Third, the shaking of the equipment also poses a big challenge to the SLAM system. Therefore, not all SLAM systems can run perfectly in all datasets. OKVIS-mono cannot run in sequences 11 and 21, while ROVIO-mono can only run in sequences 1, 9, 13, and 21. on the previous frames; the middle images show the matched feature points on the current frames. Figure 5a shows the results without the elimination of feature matches. Figure 5b shows the results of with the elimination of feature points matches, which was realized using the RANSAC algorithm. Figure 5c shows the results of with the elimination of feature points matches, which was realized using the proposed method. The pedestrian in motion interfered with the feature point matching accuracy. The other objects in the scene were static. Therefore, if the feature point is located on the pedestrian, the feature point matching accuracy and the system's location accuracy will be  Table 3 summarizes the error comparison data of the proposed method and the three monocular state-of-the-art SLAM systems for the eight sequences of the ADVIO dataset we preselected for performance testing. The numbers in bold indicate that the proposed method performed better than the other methods, in terms of these values.  Table 3 summarizes that the proposed method performs considerably better than the other traditional methods for the eight preselected data sequences. In terms of system robustness, both OKVIS-mono and ROVIO-mono have dataset sequences that cannot be implemented, but the proposed method can run perfectly in all eight sequences, which implies that the SLAM system using the proposed method is considerably robust. In terms of system location accuracy, that of the SLAM system using the proposed method is the best for all seven dataset sequences, apart from being the second-best accuracy in the first sequence. The SLAM system integrated with the proposed algorithm had higher location accuracies than VINS-mono, which can also run perfectly in all datasets. Figure 6 shows the location trajectory and error data. The leftmost images show the absolute position errors of the four methods. The abscissa is the system's running time in seconds, while the ordinate is the system's absolute position error in meters. The middle images denote the statistics for the absolute error of the trajectory. The statistics are, from top to bottom, the maximum error value, minimum error value, standard deviation, median, average error, and RMSE. The horizontal axis is the magnitude of the errors in meters. The last column is the boxplot of the four methods. The abscissa is the four systems, while the ordinate is the system's absolute position error in meters. The figure does not show the systems that cannot run in the dataset sequence.
For sequence 1 and sequence 13, ROVIO-mono has the lowest positioning accuracy, and the other three exhibit slight differences. FOR sequence 2 and sequence 6, ROVIOmono cannot be implemented. Although OKVIS-mono can barely run, the positioning accuracy is considerably lower than that of VINS-mono and the proposed method, and the positioning accuracy of the proposed method is the best. In sequence 9, the positioning error of OKVIS-mono is considerably larger than the other three. For sequence 11, because it is in the subway station, there is a lot of motion information in the scene, such as trains and pedestrians; thus, only VINS-mono and the proposed method can run. In sequence 18 and 21, the robustness and positioning accuracy of VINS-mono and the proposed method are better than those of OKVIS-mono and ROVIO-mono.
In summary, the performance of the proposed method is the best in terms of robustness and positioning accuracy. dian, average error, and RMSE. The horizontal axis is the magnitude of the errors in meters. The last column is the boxplot of the four methods. The abscissa is the four systems, while the ordinate is the system's absolute position error in meters. The figure does not show the systems that cannot run in the dataset sequence. For sequence 1 and sequence 13, ROVIO-mono has the lowest positioning accuracy, and the other three exhibit slight differences. FOR sequence 2 and sequence 6, ROVIOmono cannot be implemented. Although OKVIS-mono can barely run, the positioning accuracy is considerably lower than that of VINS-mono and the proposed method, and the positioning accuracy of the proposed method is the best. In sequence 9, the positioning

Self-Collected Data
We selected two adjacent frames from the real-world experiment data. The raw images are shown in Figure 7, where Figure 7a is the previous frame, and Figure 7b is the current frame. During the experiment, the system performed feature point extraction, matching, and abnormal feature point matching elimination. ness and positioning accuracy.

Self-Collected Data
We selected two adjacent frames from the real-world experiment data. The raw images are shown in Figure 7, where Figure 7a is the previous frame, and Figure 7b is the current frame. During the experiment, the system performed feature point extraction, matching, and abnormal feature point matching elimination.   Figure 8a shows the results without feature points matches elimination. Figure 8b shows the results with feature points matches elimination using the RANSAC algorithm. Figure  8c shows the results with the feature points match elimination using the proposed method.   Figure 8a shows the results without feature points matches elimination. Figure 8b shows the results with feature points matches elimination using the RANSAC algorithm. Figure 8c shows the results with the feature points match elimination using the proposed method.

Self-Collected Data
We selected two adjacent frames from the real-world experiment data. The raw images are shown in Figure 7, where Figure 7a is the previous frame, and Figure 7b is the current frame. During the experiment, the system performed feature point extraction, matching, and abnormal feature point matching elimination.   Figure 8a shows the results without feature points matches elimination. Figure 8b shows the results with feature points matches elimination using the RANSAC algorithm. Figure  8c shows the results with the feature points match elimination using the proposed method. In the two adjacent frames, the elements that interfere with the accuracy of feature points matching are the two pedestrians close to the lens in the middle and on the righthand side of the frame. Although there are other pedestrians moving farther in the distance, the pixel changes from those pedestrians can be ignored as they are too far away and can be considered static objects. Thus, only the feature points match on the two pedestrians close to the camera will affect the accuracy of feature point matching and the systems' location accuracy. Figure 8a,b show that there are hardly any benefits to using the RANSAC algorithm to remove abnormal feature point matches. Figure 8c shows that our algorithm almost completely removes abnormal feature points matches on the two pedestrians while retaining the feature points on the static objects in the scene. Thus, our algorithm has an obvious removal effect on feature points matches on moving objects. In the two adjacent frames, the elements that interfere with the accuracy of feature points matching are the two pedestrians close to the lens in the middle and on the righthand side of the frame. Although there are other pedestrians moving farther in the distance, the pixel changes from those pedestrians can be ignored as they are too far away and can be considered static objects. Thus, only the feature points match on the two pedestrians close to the camera will affect the accuracy of feature point matching and the systems' location accuracy. Figure 8a,b show that there are hardly any benefits to using the RANSAC algorithm to remove abnormal feature point matches. Figure 8c shows that our algorithm almost completely removes abnormal feature points matches on the two pedestrians while retaining the feature points on the static objects in the scene. Thus, our algorithm has an obvious removal effect on feature points matches on moving objects.
To facilitate the evaluation of the location accuracy of the system, the start and end points were the same in the experiment. Therefore, the accuracy of the algorithm was evaluated by comparing the distance between the start and end points. To make datasets more challenging, we use the hand-held method to collect data; thus, vibration inevitably exists, which will cause the drastic change of accelerometer value and hinder the tracking of feature points. In such cases, ROVIO-mono cannot be implemented; thus, we only compared the proposed method with OKVIS-mono and VINS-mono.
The experimental results are shown in Figure 9. Figure 9a shows the location trajectory of OKVIS-mono, Figure 9b shows the location trajectory of VINS-mono, and Figure 9c shows the location trajectory of the SLAM system using our algorithm. The blue square is the calculated start point, and the green triangle is the calculated end point. The red line shows the trajectory of the positioning results.
In the two adjacent frames, the elements that interfere with the accuracy of feature points matching are the two pedestrians close to the lens in the middle and on the righthand side of the frame. Although there are other pedestrians moving farther in the distance, the pixel changes from those pedestrians can be ignored as they are too far away and can be considered static objects. Thus, only the feature points match on the two pedestrians close to the camera will affect the accuracy of feature point matching and the systems' location accuracy. Figure 8a,b show that there are hardly any benefits to using the RANSAC algorithm to remove abnormal feature point matches. Figure 8c shows that our algorithm almost completely removes abnormal feature points matches on the two pedestrians while retaining the feature points on the static objects in the scene. Thus, our algorithm has an obvious removal effect on feature points matches on moving objects.
To facilitate the evaluation of the location accuracy of the system, the start and end points were the same in the experiment. Therefore, the accuracy of the algorithm was evaluated by comparing the distance between the start and end points. To make datasets more challenging, we use the hand-held method to collect data; thus, vibration inevitably exists, which will cause the drastic change of accelerometer value and hinder the tracking of feature points. In such cases, ROVIO-mono cannot be implemented; thus, we only compared the proposed method with OKVIS-mono and VINS-mono.
The experimental results are shown in Figure 9. Figure 9a shows the location trajectory of OKVIS-mono, Figure 9b shows the location trajectory of VINS-mono, and Figure  9c shows the location trajectory of the SLAM system using our algorithm. The blue square is the calculated start point, and the green triangle is the calculated end point. The red line shows the trajectory of the positioning results.  Figure 9 shows that our algorithm has higher location accuracy than the traditional algorithm as the start and end points are closer to each other, and the closed-loop error is smaller.
To quantitatively evaluate the accuracies of the three methods, the difference between the location of the end point and the start point was calculated and used as an indicator of the location accuracy of the three methods.
When we evaluate the positioning accuracy, we need to consider the total positioning error of the three XYZ axes. If the total positioning error of the three directions decreases, the positioning accuracy of the system improves. The closed-loop errors of OKVIS-mono and VINS-mono are 11.52 m and 13.32 m, which are 50.17% and 56.91% higher, respectively, than that of the proposed algorithm, which is 5.74 m.
The experiment proves that the proposed abnormal feature matching elimination al-  Figure 9 shows that our algorithm has higher location accuracy than the traditional algorithm as the start and end points are closer to each other, and the closed-loop error is smaller.
To quantitatively evaluate the accuracies of the three methods, the difference between the location of the end point and the start point was calculated and used as an indicator of the location accuracy of the three methods.
When we evaluate the positioning accuracy, we need to consider the total positioning error of the three XYZ axes. If the total positioning error of the three directions decreases, the positioning accuracy of the system improves. The closed-loop errors of OKVIS-mono and VINS-mono are 11.52 m and 13.32 m, which are 50.17% and 56.91% higher, respectively, than that of the proposed algorithm, which is 5.74 m.
The experiment proves that the proposed abnormal feature matching elimination algorithm is valid and can significantly improve the location accuracy of SLAM systems in complex environments.

Conclusions
This paper proposed an algorithm for eliminating feature matches on moving objects with the help of IMU data. The proposed algorithm has two main advantages. First, the algorithm uses the distance from the feature that matches the fundamental matrix calculated by the IMU data as an indicator. Then, it uses feature point matching data to validate the IMU data. Second, with the help of the valid IMU data, the transformation matrix separates the dynamic matched feature points from the static matched feature points, reducing the influence of the moving objects on feature point matching. Therefore, the proposed algorithm can obtain feature points that are realistic.
Our experiment tested a SLAM system with our algorithm integrated on the ADVIO dataset and in the real world. The experiments show that even with moving objects in the scene, our algorithm can accurately separate dynamic elements from static elements and eliminate abnormal matches. Therefore, it can improve the accuracy of feature points matches and increase the location accuracy of SLAM systems in complex environments. This investigation would have wider implications for indoor and outdoor navigation, AR, and location services.
In the future, we plan to introduce deep learning to combine the geometric constraint and object detection methods. By detecting dynamic objects in the scene to improve moving objects recognition, further accuracy gains in feature matching and location accuracy are achievable. Acknowledgments: Thanks to Santiago Cortés, Arno Solin, Esa Rahtu, and Juho Kannala for making the ADVIO dataset available for download.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: SLAM simultaneous localization and mapping IMU inertial measurement unit RMSE root mean square error RANSAC random sampling consensus AR augmented reality EKF extended Kalman filter