A Stereo SLAM System with Dense Mapping

The development of simultaneous localization and mapping (SLAM) technology plays an important role in robot navigation and autonomous vehicle innovation. The ORB-SLAM2 is a unified SLAM solution for monocular, binocular, and RGBD cameras which constructs a sparse feature point map for real-time positioning. However, a sparse map based approach cannot effectively meet the requirements of robot navigation, environment reconstruction, and other tasks. In this paper, a dense mapping thread is added to the existing ORB-SLAM2 system. The depth map and color image obtained by the stereo matching of a binocular camera are used to generate a three-dimensional point cloud for keyframes; then, the point cloud is fused by tracking and optimizing the motion track of a feature frame to obtain a real-time point cloud map. Through the experiments conducted on the KITTI dataset and the real environment under the ROS, it is proved that the proposed system constructs a clear three-dimensional point cloud map while constructing an accurate trajectory.


I. INTRODUCTION
Simultaneous localization and mapping (SLAM) approaches receive information through a specific sensor without prior information in order to estimate an object's position and motion as well as establishing an environmental model [1] [2]. The development of SLAM technology has a history of more than 30 years. Recently, advances in computer vision and robotics have supported numerous developments and breakthroughs in SLAM technology; these have made great contributions to robot navigation, three-dimensional (3D) reconstruction, virtual reality, and other fields [3].
Visual SLAM approaches can acquire a large amount of redundant texture information from the environment using cameras, and have a strong capability to recognize scenery. VO (front end) estimations for motion using images can be classified into feature-based and direct methods [4]. The feature-based methods detect and match the key points. Then, the camera motion and scene geometry are computed using epipolar geometry. This approach has an important inherent limitation: only feature information can be used in VO and mapping. Thus, the generated map is sparse and cannot be directly used for navigation and obstacle avoidance. In contrast, direct SLAM optimizes the geometry directly from the image's This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2021.3126837, IEEE Access intensities. This enables the utilization of all the information in the images [5]. However, it is computationally expensive and vulnerable to external illumination. When the camera moves or rotates in a large scale, the image intensities cannot be tracked very well.
We propose an improved hybrid SLAM method that combines feature-based VO and dense mapping by depth estimation. The approach extracts FAST corners [6] and matches them using the ORB descriptor [7] to estimate the motion between successive frames by 3D-2D and 2D-2D. After inserting and culling keyframes, we estimate the depth map by a weighted guided filtering method for stereo matching on keyframes which focuses on dense mapping. By adopting a multi-threaded structure, the proposed SLAM system can achieve accurate localization and dense mapping results in realtime.
The rest of this paper is organized as follows: Section 2 reviews the mainstream SLAM system and local stereo matching method. Section 3 describes the proposed SLAM system. Experimental results are also carried out to demonstrate the effectiveness of the proposed framework in Section 4. Section 5 concludes this paper.

II. RELATED WORK
In this section, we review the related work with respect to the two fields that we integrate within our framework, i.e. methods for visual SLAM and depth estimation with the use of stereo cameras.

A. VISUAL SLAM METHODS
The research of Visual slam can be divided into three times: classical era, algorithm analysis era and multi-sensor fusion era.
In the classical era, the robot pose is estimated mainly based on probability theory, including Extended Kalman Filter, Particle Filter and maximum likelihood estimation.
In the era of algorithm analysis, many new algorithms were born, and their main research includes the observability, convergence and consistency of the algorithm. Andrew Davison's MonoSLAM [8] is the first real-time monocular SLAM system. Its back end traces the sparse feature points of the front end based on EKF method, and creates sparse maps online in the framework of probability.
The PTAM [9] is the first multithreaded SLAM algorithm proposed by Georg Klein, which divides tracking and mapping into two separate tasks. These tasks are processed with two parallel threads. The PTAM is the first scheme to use non-linear optimization as the back end rather than a filter. A keyframe mechanism is proposed in which several keyframes are structured together to optimize their trajectories and maps. The LSD-SLAM proposed in 2014 is a large-scale direct monocular SLAM method, which directly processes the pixels of the image [5]. It proposes an image matching algorithm for estimating the similarity transformation and scale perception between keyframes. The approach not only calculates its own position and posture, but also constructs a global semi-dense and accurate environmental map. is considered to be the most successful SLAM solution at this stage. It achieves high accuracy in real time, but generates a sparse map that does not meet the needs of some robotic applications [11].
LSD-SLAM (Large-scale Direct Monocular SLAM) is a direct visual SLAM system proposed by Engel J et al [12]. It can build a large-scale environment consistency map, has good real-time performance on the CPU, and can realize semi-dense reconstruction of the scene. However, the system is sensitive to light by using the direct method, which is not suitable for fastmoving occasions, and the system does not design the function of dynamic object processing, so it cannot remove moving objects in the environment. In addition, the system is a monocular SLAM system, and its scalability is not strong. However, because the visual odometry is a direct method, it is sensitive to illumination changes, and the scheme can not deal with moving objects. In addition, there is no loop detection module in the system, so it is impossible to reposition the loop of the scene, and it is easy to produce cumulative errors when running in large-scale scenes.

B. DEPTH MAP ESTIMATION
Depth information plays a key role in 3D reconstruction, navigation, and obstacle avoidance because only when the depth is determined can we know its spatial location exactly.
Binocular cameras generally consist of left and right cameras placed horizontally. The center of the two apertures is located on the x-axis, and the distance between them becomes the baseline. Figure 2 shows that = − is the difference between the left and right coordinates, called the disparity, which is inversely proportional to depth. According to the disparity, the distance between the pixel and the camera can be estimated.

III. PROPOSED METHOD
Our method is mainly based on the ORB-SLAM2 approach that we extend to dense mapping. By using stereo cameras, dense mapping is used to process the keyframes provided by the ORB-SLAM2 system to reconstruct the geometry of the environment. The outline of our method is as follows; it is illustrated in Figure 3[17]: 1) The tracking component extracts the ORB feature of each stereo image Il and Ir and matches them based on the features.
The approach continuously tracks every new camera image (i.e., the current frame Fcurr) that is captured by monocular or binocular cameras. It determines whether the current frame is a keyframe KFi.
2) Each keyframe KFi is processed by the local mapping thread.
Except for the mappoints decision and local BA, the local mapping culls the redundant keyframes in two methods for loop closing and dense mapping, respectively. 3

A. TRCKING THREAD
In the front end of the SLAM, feature-based VO matches the ORB feature points extracted from image frames, and then estimates the egomotion of the camera. The ORB feature points are extremely fast to compute and match in addition to having good invariance to the viewpoint [7]. Thus, in accordance with the ORB-SLAM2, we use the ORB feature throughout the tracking thread, local mapping, and loop closing of the whole system.
Feature-based VO methods can be divided into 2D-2D, 3D-2D, and 3D-3D according to the matching feature dimensions involved. These methods optimize the camera orientation ∈ (3) and position ∈ ℝ 3 in order to minimize the reprojection error between consecutive frames [13] [18].
As shown in Figure 3,

C. LOCAL STEREO MATCHING
For obtaining a dense 3D point cloud, we use the weighted guided image filtering (WGIF) method that is based on local stereo matching. This method is accurate and suitable for realtime applications [23]. The WGIF is a novel approach for binocular stereo systems for fast and accurate matching. It consists of four main steps: matching cost computation, cost aggregation, disparity computation, and disparity refinement [24] [25]. The specific calculation process is described in [26].
The matching cost computation algorithms are similarity measurement processes for different disparities. We propose a multi-measure stereo matching algorithm based on the fusion of absolute differences(AD), census transform, and gradient methods.
where , , and γ are the weights computed using the AD, census, and gradient algorithms, and + + = The regularization parameters are adjusted adaptively using the Canny method, which further reduces the errors. The Canny algorithm is used to detect the edge of the image, and the edge weight is defined as Thus, the conclusion of the cost aggregation can be written as A disparity map is generated using the winner-take-all strategy (WTA); this map is subsequently refined using a densification method to reduce errors. The difference between the adjacent angles in the sequence, that is, the angle between the adjacent vectors is calculated. If the angle of biggest angle is greater than the threshold (120 °) ， the data point p is considered as the boundary point, otherwise it is not.
Fourthly, the feature threshold of the point cloud is calculated by using a multi-measure fusion approach, and the feature is judged. For any point p in the space, the local average distance in its k neighborhood is as follows.
And we use principal component analysis to realize the normal vector estimation. The surface variation defined as point p in k neighborhood is as follows. After the 3D point cloud data of the keyframes is registered in the unified world coordinate system, the repeated image area between the keyframes leads to the presence of redundant point cloud information, resulting in the redundancy and inconsistency of the data. Therefore, after data registration, the point cloud data must be fused to eliminate overlapping information and establish a complete 3D point cloud model without redundancy. Our paper sets a relatively small distance threshold to compare the Euclidean distance between each point of the two point clouds after registration. If the distance is less than the threshold, the average value between the two points is used.

IV. IMPLEMENTATION AND EXPRIMENT
The image used in this paper is collected by a binocular camera.
The CPU of the platform is i7 running at 2.6 GHz with a GPU (We use GPU to accelerate for stereo matching). All experiments were performed on image sequences of the wellknown KITTI dataset [30] and real scenes.

A. KITTI DATASET
Note that the Stereo SLAM KITTI dataset is a challenging benchmark as it contains fast movement. We use the Stereo SLAM with local stereo matching and obtain the depth map for each pair. Our method has been evaluated on the KITTI dataset which contains stereo sequences recorded from a car in city, road, residential, and campus. As evidenced by Figure 5, we show a challenging trajectory that includes a loop closure of sequence 05 of KITTI, which has a public ground truth. Our system outperforms with a relative error below 0.5% as shown in Figure 5 c). Table II shows the comparison between the proposed method and state-of-the-art in terms of the root mean square error (RMSE) of estimated trajectory and mean time.
Among them, LSD-SLAM uses binocular camera, SVO uses monocular camera, RGBD-SLAM uses RGB-D camera, and RGBD-SLAM directly outputs dense map using depth map. As can be seen from Table II, the root mean square error of the proposed method is lower than ORB-SLAM2 as the mean time is longer than ORB-SLAM2 because of the more computation in dense mapping and stereo matching. The mean time is longer than RGBD-SLAM (Dense SLAM) because of the latter can get the depth map directly.
We also evaluate our 3D reconstruction as shown in Figure   6 in order to better represent the results of the 3D reconstruction; we use four different scenes in the database, such as city, road, residential, and campus. Table III shows the point cloud number and storage size of dense map.

B. REAL SCENE ONLINE EXPERIMENT
In addition to using KITTI database, we take indoor and outdoor scenes as experimental scenes, and carry out online experiments with a hand-held binocular camera. Myntai binocular camera (standard version S1020) was used to conduct online experiments under ROS system.
As shown in Figure

V. CONCLUSIONS
In order to realize visual navigation and 3D reconstruction, a Stereo SLAM system with dense mapping has been introduced. 3) The map constructed by the method proposed has high accuracy and can accurately describe the environment of the space. The map can fully meet the needs of mobile robot autonomous positioning and navigation.