Indoor simultaneous localization and mapping based on fringe projection profilometry

Simultaneous Localization and Mapping (SLAM) plays an important role in outdoor and indoor applications ranging from autonomous driving to indoor robotics. Outdoor SLAM has been widely used with the assistance of LiDAR or GPS. For indoor applications, the LiDAR technique does not satisfy the accuracy requirement and the GPS signals will be lost. An accurate and efficient scene sensing technique is required for indoor SLAM. As the most promising 3D sensing technique, the opportunities for indoor SLAM with fringe projection profilometry (FPP) systems are obvious, but methods to date have not fully leveraged the accuracy and speed of sensing that such systems offer. In this paper, we propose a novel FPP-based indoor SLAM method based on the coordinate transformation relationship of FPP, where the 2D-to-3D descriptor-assisted is used for mapping and localization. The correspondences generated by matching descriptors are used for fast and accurate mapping, and the transform estimation between the 2D and 3D descriptors is used to localize the sensor. The provided experimental results demonstrate that the proposed indoor SLAM can achieve the localization and mapping accuracy around one millimeter.

environment. Recently, the development of LiDAR and the assistance of GPS signals enables outdoor SLAM applications successfully solved [10,11]. However, indoor SLAM cannot receive external GPS signals, and its carrier is usually a small robot with low cost [12]. Therefore, it is critical to seek a low-cost, small-volume, and high-accuracy sensor for indoor SLAM.
In SLAM, the 3D scenes are perceived by using the LiDAR [13], stereo vision [14] or time of flight (ToF) [15] techniques. The LiDAR has been widely used in outdoor SLAM for autonomous driving due to its advantages of large working distance and high robustness. However, indoor SLAM using LiDAR is a challenging task due to the sparse nature of the obtained 3D data and computational complexities, which is difficult to provide 20 real-time dense maps for indoor service robots [16]. Alternatively, the stereo vision uses two cameras to simulate the human eye according to the principle of bionics and uses images collected from different angles to reconstruct 3D data. However, the shortcomings of stereo vision in terms of accuracy, stability, and field of view limit its application [17,18]. The ToF technique obtains target depth by probing the round-trip time of light emitted and received by each pixel on the imaging chip, which has been widely used in commercial RGB-D sensors (e.g., Microsoft Kinect) [15]. However, it has defects such as a difficult trade-off between the miniaturization and high pixel resolution of the imaging chip, large measurement errors in the near range, and slow measurement speed. As aforementioned, the exist 3D sensors used in SLAM do not satisfy the requirements of indoor millimeter (mm)-level SLAM, i.e., localizing the sensor and the interaction with the indoor scene. 30 Optical metrology has been widely used for accurate and flexible 3D perceiving. Among plenty of stateof-art techniques, fringe projection profilometry (FPP) can achieve high-accuracy (i.e. micrometer-level) and high-speed (i.e. thousands of fps) indoor scene perceiving because the sub-pixel accurate camera-projector correspondences and binary defocus technology [19,20]. Compared with other techniques, FPP also has the inherent advantages of simple hardware configuration, flexibility in implementation, and dense point clouds acquisition [21]. It is appealing to utilize the FPP sensor to achieve indoor SLAM. However, there are two challenges for the application of FPP sensor to indoor SLAM. First, a fast and robust multi-views registration is required for FPP to sense and map a large scene. Second, estimating camera localization based on FPP coordinate system and subsequent localization optimization is still in the blank.
In order to tackle these challenges, in this paper, we propose a FPP-based indoor SLAM method. First, 40 the FPP technique is introduced to perceive the 3D indoor scenes. Based on the perceived data of FPP, the correspond algorithm is proposed for millimeter (mm)-level indoor SLAM, which utilizes the 2D-to-3D descriptor and the coordinate transformation relationship of FPP. The mapping is achieved by solving the transformation matrix based on the correspondences generated by matching descriptors. The localization of sensor is achieved by transformation estimation between the 2D and 3D descriptors. With our method, users or robots can simply obtain accurate indoor maps and immediate feedback during perceiving. We also show both qualitative and quantitative results relating to various aspects of our FPP-based indoor method. The rest of this paper is organized as follows. Section 2 presents the principle of the proposed method. Section 3 provides experiments. Section 4 concludes this paper.

Principle
The flowchart of the proposed FPP-based indoor SLAM method is provided in Fig. 1, which includes the front-end using FPP technique to perceive the unknown scene, and the correspond back-end algorithm using the 2D-to-3D descriptor and the coordinate transformation relationship of FPP to estimate of the pose of sensors and construct the map of indoor scene. Specifically, for the back-end, the local mapping module is performed to generate local maps by minimizing errors between the transformed and original point clouds.
The corresponding location and the pose of the sensor can be calculated by utilizing the coordinate-map given by FPP. The optimizer then performs localization optimization to eliminate the cumulative errors and revise local mapping. Finally, the multi-view point clouds are aligned in global maps.

The front-end: FPP technique
The procedure of the FPP technique is provided in Fig. 2. A typical FPP system usually consists of a 60 projector and a camera. The former is used to project coded fringe patterns onto the measured object, and the latter is used to synchronously capture the height-modulated fringe patterns. The projector can be seen as an inverse camera and the patterns help to establish distinct and accurate camera-projector correspondences regardless of whether the surface is textured or not. The desired phase can be calculated from these captured fringe patterns by using the transform-based [22] or phase-shifting algorithms [23]. The calculated phase is always wrapped in a range of (−π, π] , and a phase unwrapping process is necessary to obtain the absolute phase [20]. The image correspondence between the projector and the camera can be established from the absolute phase [24]. The 3-D shape can be reconstructed from the image correspondence by using the triangulation method, when combining the image correspondence with system parameters calibrated before [25].
As aforementioned, both the system calibration and 3-D reconstruction rely on the retrieved phase, i.e., the 70 wrapped phase and the absolute phase. The phase-shifting algorithm is preferred, and the captured sinusoidal fringe patterns can be described by where (u c , v c ) denotes the pixel coordinate, N denotes phase steps, δ n =2π(n − 1)/N denotes the phase shift amount, a, b and ϕ denote the background, fringe amplitude and desire phase, respecitively. The desire phase can be calculated by using a least-squares algorithm . ( To obtain the correct absolute phase, the Gray code-based phase unwrapping method is preferred due to its high robustness [26], and a set of Gray code-based patterns is used to determine the fringe order of each fringe period. The absolute phase can be obtained by [27] Φ where K denotes the determined fringe order. In this paper, both the projector and the camera can be regarded as a classical hole model and the inverse-80 camera-based method is applied to reconstruct the 3D points from the absolute phase. The pinhole camera model maps 3D points in the world coordinate frame (x w , y w , z w ) to 2D pixel coordinates (u, v) . This gives the equations where superscript c/p represents camera/projector and s denotes the scaling factor. A of 3 × 4 matrix is the product of the intrinsic and extrinsic matrix, which is expressed as I is the 3 × 3 intrinsic parameter, f x and f y are the projector focal, and (u 0 , v 0 ) is the principal coordinate.
[R|T ] is known as the extrinsic matrix. R denotes the 3 × 3 rotational matric and T ∈ R 3 denotes the translation vector that describes the orientation and position of the world frame.
where Φ c (u c , v c ) represents the transverse absolute phase, and λ is the fringe width. If v p is required, the vertical absolute phase can be used. When u p is matched, the 3D coordinate (x w , y w , z w ) can be obtained by the following equation.
Finally, the 2D texture images are rendered point-to-point onto the point clouds.

2.2
The back-end:

Basics of mapping
After obtaining high-quality data, a coarse-to-fine pairwise registration algorithm based on the 2D-to-100 3D descriptor is proposed for fast and accurate mapping, which is achieved by solving the transformation matrix based on the correspondences generated by matching descriptors. Specifically, we apply the coordinate transformation to convert the affine transformation between 2D feature points to a 3D transformation to provide an initial prior for the corresponding point clouds registration. Then, in order to improve the efficiency of registration, 3D feature points are used instead of 3D point clouds to perform registration. The entire process of mapping method is shown in Fig. 3.
Step 1: extract 2D matching points with SURF algorithm and obtain 2D transformation matrix; Step 2: convert the 2D transformation matrix into the 3D transformation matrix according to the coordinate transformation, and use it as the initial prior for registration; Step 3: extract the corresponding 3D points according to the 2D feature points; Step 4: perform ICP on the 3D points to obtain the transformation matrix combined with the initial prior; Step 5: apply the transformation matrix to perform 110 registration on the point clouds; Step 6: return to Step 1 and repeat the above processes. Define two adjacent 3D frames as Frame 1 and Frame 2, which have corresponding 2D texture images I 1 : (u 1 c , v 1 c ) and I 2 : (u 2 c , v 2 c ), and 3D data p1 : (x 1 w , y 1 w , z 1 w ) and p2 : (x 2 w , y 2 w , z 2 w ) . First, the key points and are obtained and matched by Speeded Up Robust Features (SURF) [28] algorithm. In order to improve the registration efficiency, we use paired 3D feature points instead of complete 3D point clouds for registration. By using Eq. 9, we record sets of 3D matching points corresponding to the found 2D ones as and . The corresponding 3D points can be used to calculate the rotation matrix R and initial translation vector T of the coordinate transformation. The calculation can be formulated as The obtained transformation of 2D matching points can be transformed to 3D transformation to initialize the iteration. After P 1 and RP 2 are roughly registered, they almost have been registered well. To improve 120 the registration accuracy, the result is further refined by using the ICP algorithm [29]. Then, for stitching a sequence of point clouds, the next point cloud data can be as the target data and the target data registered in the previous step can be as the source data. Repeating the above two frame point clouds registration process to achieve a series of point cloud registration.

Basics of localization
Live camera localization involves estimating the current camera position and pose for each new point data, which is achieved by transformation estimation between the 2D and 3D descriptors. The 3D-2D points matching is performed to estimate the pose of the sensor using the 3D date of the perceived indoor scene.
Specifically, we find the correspondences between camera and world coordinate points, and calculate the camera pose through optimization.

130
The schematic of the camera localization is shown in Fig. 4, the transformation relationship between the world coordinate system O w and the camera coordinate system O c can be simply described as Thus, the formula can be transformed into Eq. 12.
Because the camera localization is for the optical center coordinates, and the optical center coordinates are the origin relative to the camera coordinates. Considering is an orthogonal matrix, so (R c ) T = (R c ) −1 .
Thus, if we can calculate the rotation matrix and translation matrix of the current captured point data, the current camera position and pose can be easily obtained.
For the FPP imaging system, the coordinates of the 3D point cloud in real space are (x w , y w , z w ), and the homogeneous coordinates can be expressed as (x w , y w , z w , 1). The corresponding projected points in the camera coordinate are (u c , v c ), and the homogeneous coordinates can be expressed as (u c , v c , 1). The camera's 140 internal parameter I c has been obtained through device calibration. By using Eq.8 and Eq.9, we can derive the following form.
By eliminating s c , Eq. 14 can be simplified as According to the above formula, each set of 3D-2D matching points corresponds to two equations, there are a total of 12 unknowns, and at least 6 sets of matching points are required.
Assuming there are N sets of matching points, Eq.15 can be rewritten as The above formula can be written as When N=6, the systems of linear equations can be solved directly.
When N≤6, we can obtain the least-squares solution under the constraint of |F | = 1. And then through SVD decomposition, the last column of the V matrix is the solution.
Thus, the rotation matrix R c and translation matrix T c can be calculated as follows.
After the rotation matrix and translation matrix are calculated, we completed the live camera localization for the current point data.

Experiments
To investigate the performance of the proposed method, we conducted experiments on metrically consistent and indoor scenes. All experiments were implemented on a PC with AMD Ryzen 5 5600H (3GHz) CPU, and a packaged FPP sensor. The built FPP sensor is shown in Fig. 5, which contains a Basler acA1920-155uc camera (1920 × 1200 resolution), a LighterCraf ter 4500 (912 × 1140 resolution) and an interface expansion board. The interface expansion board is used to realize the synchronous triggering of the projector and the 160 camera. The imaging speed of the sensor is 300 fps, the single-view measurement accuracy is 0.067 mm and the global mapping accuracy is around one millimeter. Firstly, a doll was measured to investigate the ability of metrically consistent models from trajectories containing loop closures. As shown in Fig. 6, the FPP system was placed in a fixed location and the doll was placed on a turntable. Then, the turntable was spun through a full rotation to capture point clouds from different perspectives of the doll for registration, and the trajectory localized by the corresponding camera should also be a circle and return to the origin, which allows us to easily evaluate the quality of camera localization. The main motivation for our experiments is to investigate the convergence of the mapping and localization schemes. As shown in the enlarged image of Fig. 6, the captured single-view point clouds were well registered, which demonstrates the performance of our mapping method. Then, the resulting sensor 170 trajectories are given at the bottom of Fig. 6. Since the FPP sensor is fixed, the object is placed on the turntable, which can be equivalent to the sensor rotating around the object. In this experiment, the turntable was rotated 3 times, 13 times for each turn. The loop closures effect of localization trajectory performs well.
The camera forms a perfectly circular trajectory around the doll and can return to the original location after one rotation, which demonstrates the performance of our localization method. Finally, to investigate the performance of our method for real scenes, we conducted the mapping and localization experiment in the corridor. The system setup of the indoor scene experiment is shown in Fig.   7, which includes a computer, our FPP sensor, a battery and a mobile cart. The FPP sensor is located in the mobile cart to scan the indoor scene. The results of scenes using our system are shown in Fig. 8. To generate this map of the indoor scene, we captured over 100 frames of single-view point clouds. The overlap generated map point cloud is dense and has high-quality local geometric information. Then, the map with texture information is shown at the top of Fig. 8, which achieves perfect registration of multi-frame point clouds and avoids interference of cumulative errors. This also demonstrates that our global pose alignment strategy scales well to large spatial extents and long sequences.

Conclusion
The availability of accurate 3-D metrology technologies such as FPP has the potential to revolutionise the fields of indoor localization and mapping, but methods to date have not fully leveraged the accuracy and speed of sensing that such techniques offer. This paper presents a millimeter (mm)-level indoor FPP-based SLAM method, which includes the front-end using FPP technique to perceive the unknown scene, and the

Data Availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.