Visual SLAM Algorithms and their Application for AR, Mapping, Localization and Wayfinding

Visual simultaneous localization and mapping (vSLAM) algorithms use device camera to estimate agent’s position and reconstruct structures in an unknown environment. As an essential part of augmented reality (AR) experience, vS-LAM enhances the real-world environment through the addition of virtual objects, based on localization (location) and environment structure (mapping). From both technical and historical perspectives, this paper categorizes and summarizes some of the most recent visual SLAM algorithms proposed in research communities, while also discussing their applications in augmented reality, mapping, navigation, and localization.


Introduction
Visual SLAM, according to Fuentes-Pacheco et al [1], is a set of SLAM techniques that uses only images to map an environment and determine the position of the spectator. Compared to sensors used in traditional SLAM, such as GPS (Global Positioning Systems) or LIDAR [2], cameras are more affordable, and 5 are able to gather more information about the environment such as colour, texture, and appearance. In addition, modern cameras are compact, have low cost, and low power consumption. Examples of recent applications that employ vS-LAM are the control of humanoid robots [3], unmanned aerial and land vehicles [4], lunar rover [5], autonomous underwater vehicles [6] and endoscopy [7]. 10 Depending on the camera type, there are three basic types of SLAM: monocular, stereo, and RGB-D. Stereo SLAM is a multi camera SLAM that can obtain a particular degree of trajectory resolution. Additionally, stereo SLAM has the advantage of being more versatile as opposed to RGB-D SLAM which is more sensitive to sunlight and is mainly used indoors. The last two decades have seen 15 significant success in developing algorithms such as MonoSLAM [8], PTAM [9], PTAM-Dense [10], DTAM [11] and SLAM++ [12]. However, most systems have been developed for motionless environments and their robustness is still a concern under dynamic environments. Due to the assumption that the camera is the only moving object in a stationary scene, these SLAM systems [13] are 20 typically not applicable. Moving objects, as a consequence, would affect the system's ability to estimate camera poses. Additionally, the extra object motion introduces calculation errors and reduces the accuracy of trajectory estimation due to increased computation weight. In such environments, the SLAM algorithm is required to deal with possible errors and a certain degree of uncertainty 25 characteristic in sensory measures.
Moreover, for virtual objects to be properly anchored in the real environment in an AR (Augmented Reality) [14] experience, it is necessary to apply tracking techniques. That means dynamically determining the viewer's pose (position and orientation) in relation to the actual elements of the scene. An alternative 30 is the application of SLAM techniques, which aim precisely at the creation and updating of a map, as well as the location of the observer in relation to the structure of the environment. This confluence between visual SLAM and AR was the motivation for the realization of this survey. The objective of this research is to carry out a survey of the main visual SLAM algorithms, as well 35 as their applications in AR, mapping, localization and wayfinding. The main characteristics of the visual SLAM algorithms were identified and the main AR applications on visual SLAM were found and analysed. As opposed to presenting a general analysis of SLAM, this survey provides an in-depth review of different visual SLAM algorithms. The survey also includes 40 various datasets that might be considered for evaluation and different types of evaluation metrics. Existing studies in this area tend to describe only one SLAM algorithm, and some of them are rather old. However, to address this, a complete survey describing seminal and more recent SLAM algorithms was produced.
Even if some surveys include a description of different SLAM algorithms (e.g., 45 [15], [16] and [17]), an expanded overview of SLAM algorithms, including those recently developed, is included in this survey, a set of datasets that could be used to evaluate multiple SLAM algorithms and a set of evaluation metrics Table   1. Additionally, the limitations of the evaluation metrics have been identified, which will be explored further in the future. Through this article, we hope to 50 help readers better understand different SLAM algorithms and how they might be applied in different fields.
Following is a description of how this survey is organized. SLAM applications were introduced in Section II. In Section III, various SLAM algorithms were discussed. In Section IV a table of different SLAM features was intro-55 duced. Section V includes a discussion of various datasets that could be use to experiment with SLAM algorithms. Section VI includes a description of the two most common used evaluation metrics. Section VII is mainly focused on discussions about SLAM and Section VIII concludes this survey.

60
SLAM algorithms make use of data from different sensors. Visual SLAM is a SLAM technique that uses only visual sensors which may require a monocular RGB camera [18], a stereo camera [19], an omnidirectional camera (which captures images simultaneously in all 360-degree directions) [20] or an RGB-D camera (captures depth pixel information in addition to RGB images) [21]. Fol-65 lowing, in this section Localization, Mapping, and Wayfinding, the three main categories of vSLAM, are described in more depth, along with some relevant algorithms applicable to each category.

Localization
Localization systems assist users in identifying their location and orientation  [30]. It is then possible to correct and adjust the generated map for these accumulated noises [31].

130
The vSLAM concept is fundamental to any kind of robotic application where the robot must traverse a new environment and generate a map. The system is not limited to robots, but can be used on smartphones and their cameras, as well. vSLAM would be one aspect of the pipeline needed by some advanced AR use cases, for example, where virtual worlds need to be accurately mapped onto 135 real environments.

Wayfinding
Wayfinding systems must be capable of planning and communicating effec- In order to estimate the camera's pose, a bundle-adjustment algorithm [36] is used and visual features are extracted from images and compared against each other. It is possible to estimate camera pose in real-time thanks to PL-SLAM [37], which separates the task of tracking and mapping into two separate threads and processes them on a dual-core computer. Recent SLAM methods align the 165 whole image rather than matching features. However, these types of methods are typically less accurate than feature-based SLAM methods for estimating pose.

SLAM Algorithms
In general, Visual SLAM algorithms have three basic modules: initialization

170
[38], tracking and mapping [39]. The initialization consists of defining the global coordinate system of the environment to be mapped, as well as the reconstruction of part of its elements, which will be used as a reference for the beginning of the tracking and mapping. This step can be quite challenging for some visual SLAM applications Figure.1. The next section of this paper is split into 175 three categories: monocular based, stereo focused, and monocular and stereo focused vSLAM algorithms. In detail, each algorithm is described along with its advantages and disadvantages.

Monocular based
Monocular SLAM is a type of SLAM that relies exclusively on a monocular   effect. An accurate understanding of the camera's motion is required in order to achieve this. The location of the object can then be fed into a standard 3D graphics engine, which then renders the image correctly. It is important to keep in mind that pixels that do not belong to the model can negatively affect tracking quality. Photometric errors over a certain threshold must be excluded from being included in the analysis. As the least squares method converges, this threshold is lowered with each iteration. As a result, this scheme makes observing unmodeled objects possible while tracking densely.  superpoints randomly over the RGB input image and iteratively extends the range of superpoints as necessary. In this case, the target is not separated individually, but instead it's subdivided. Using a fast implementation method [14], this paper segments the image into superpixels by SLIC [13]. This approach can allow the semantic segmentation network to segment targets that are unrecog-340 nizable with greater accuracy, as well as pinpoint the motion feature point area more precisely, eliminating the whole contour feature point of targets that are due to partial joint motion. In comparison to ORB-SLAM2, PMDS-SLAM can significantly improve a low dynamic sequence's accuracy by more than 27.5%.
More than 90% of the improvements can be achieved for scenes that have high Any object based detector can be used to detect semantic objects in VPS-SLAM. After moving along its path, it will obtain new stereo images that will be processed using the described method, resulting in an incremental map whose size increases with each successive iteration. The Tracker module is responsible for this functionality. A second execution thread called Mapper will run concur- Utilizing a direct method of estimating camera trajectory, this method reduces the time that's consumed with feature extraction and tracking. To remove the objects in motion I propose a moving consistency check module, which is 405 an alternative to the feature-based method, which measures match points by re-projection error. In Figure.8, static point extraction is followed by dynamic

Monocular and Stereo Based
Monocular and Stereo based vSLAM algorithms can perform mapping, tracking and wayfinding by using both a sequence of images or just feature points.
ORB-SLAM2. The ORB-SLAM2 system [43] is an integrated SLAM system Whenever environmental conditions do not change significantly in the long run, the localization mode can be used to enable lightweight long-term localization. If necessary, the tracking processes relocalize the camera continuously in this mode without deactivating the local mapping or loop closure threads.

440
As part of this mode, points are mapped using visual odometry matches. In visual odometry, the 3D points that are created at each point of the current frame at the same position as the 3D points generated in the previous frame are matched with the ORB in the current frame. Localization is robust to regions that haven't been mapped, but drift may accumulate. By matching map points, 445 we ensure the existing map remains localized at all times.
DynaSLAM. The capability of dynamic object detection and background inpainting is added to ORB-SLAM2 using DynaSLAM. Whether it is monocular,  For every removed dynamic object, background inpainting is used to reconstruct a realistic image by taking information from previous views and painting over the occluded background. After the map has been created, the synthetic frames may be used to relocate and track cameras, as well as for applications such as virtual and augmented reality.

465
Finally the only limitation of DynaSLAM is that it is less accurate in scenes with dynamic objects.
ORB-SLAM3. Using pin-hole and fisheye lens models, ORB-SLAM3 [44] is the first system that can perform visual, visual-inertial, and multimap SLAM.
It is the first system to rely on maximum a posteriori (MAP) estimation during if it has been relocalized, and the active map will be switched. If the active map is not initialized after a certain time, it becomes inactive and is stored as nonactive until it can be re-initialized from scratch.

525
A monocular and stereo based vSLAM algorithm that outperforms every-thing else is ORB-SLAM3. It is the first system that can perform visual, visualinertial, and multimap SLAM. Moreover It is the first system to rely on maximum a posteriori (MAP) estimation during the initialization of the inertial measurement unit, achieving two to ten times higher accuracy as compared to 530 other approaches in small and large, indoor and outdoor environments. Table   Table II  and Environment shows in witch environment each algorithm works. Finally Table II shows the Resolution of each algorithm and their Estimation.

Datasets
Open-source datasets will be discussed in this section which may be used The relative pose errors along a sequence of n camera poses are obtained in this way: m = n -/Delta. Over all time indices of the translational component, the root mean squared error (RMSE) was calculated as follows: The translational components of the relative pose error E i are denoted by 570 trans (E i ). In certain situations, root mean square error is preferred rather than mean error since outliers are less affected. It is also possible to compute the median rather than the mean instead, which gives outliers less influence.
This expression has quadratic computational complexity in terms of trajectory length. Accordingly, it was proposed [46] that it could be approximated by composing a set of relative pose samples from a fixed number of locations.
Absolute trajectory error (ATE). For vSLAM systems, the absolute distance between the estimated and the ground truth trajectory is another impor-590 tant metric that can be used to assess the global consistency of the estimated trajectory. Due to the fact that both trajectories can be specified in any coordinate frame, they must be aligned first. With the Horn method [50], one can obtain a rigid-body transformation S that maps the estimated trajectory P i:n onto the ground truth trajectory Q 1:n using the least-squares solution. As a 595 result of this transformation, the absolute trajectory error can be computed as follows: For translational components, it was proposed [47]