Semantic RGB-D SLAM for Rescue Robot Navigation

In this paper, we propose a semantic simultaneous localization and mapping (SLAM) framework for rescue robots, and report its use in navigation tasks. Our framework can generate not only geometric maps in the form of dense point-clouds but also corresponding point-wise semantic labels generated by a semantic segmentation convolutional neural network (CNN). The semantic segmentation CNN is trained using our RGB-D dataset of the RoboCup Rescue-Robot-League (RRL) competition environment. With the help of semantic information, the rescue robot can identify different types of terrains in a complex environment, so as to avoid specific obstacles or to choose routes with better traversability. To reduce the segmentation noise, our approach utilizes depth images to perform filtering on the segmentation results of each frame. The overall semantic map is then further improved in the point-cloud voxels. By accumulating results of multiple frames in the voxels, semantic maps with consistent semantic labels are obtained. To show the advantage of having a semantic map of the environment, we report a case study of how the semantic map can be utilized in a navigation task to reduce the arrival time while ensuring safety. The experimental result shows that our semantic SLAM framework is capable of generating a dense semantic map for the complex RRL competition environment, with which the arrival time of the navigation time is effectively reduced.


I. INTRODUCTION
Semantic information representing classes of objects allows robots to understand their surroundings in a higher level other than geometry or appearance. With the help of semantic information, robots can perform better in tasks like path planning and human-robot interaction, etc. For example, in the RoboCup Rescue-Robot-League (RRL) competition -an international competition for evaluating the performance of rescue robots -the contestants need to autonomously traverse and generate maps for a maze consisting of challenging terrains, such as stairs, stepfields, elevated slopes, and steep ramps. It is crucial for the robot to understand what terrains it encountered and then act accordingly to avoid terrains that are difficult or dangerous to traverse.
In this paper, we propose a semantic simultaneous localization and mapping (SLAM) framework for rescue robots The associate editor coordinating the review of this manuscript and approving it for publication was Yangmin Li . to better navigate through challenging RRL environments. Our framework combines the well-known ORB-SLAM2 [1] method and a convolutional neural network (CNN) to generate both geometric and semantic maps of dense point-cloud, using an RGB-D camera.
In order to reduce the adverse effect of segmentation error, our method performs filtering on both single frame and across multiple frames. For single frame filtering, the depth information is utilized to estimate whether neighboring pixels in semantic images belong to the same object, thereby reducing the mislabelled pixels. For multiple frame filtering, we accumulate local point-clouds of multiple frames into voxels and determine the most frequently appeared semantic label for each voxel. Fig. 1 depicts an example of mapping results.
With the point-wise semantic labels available to us, we improve the existing path planning algorithm by also considering the semantic information. For example, in the RRL competition environment, stepfields are notoriously hard to traverse as they are made up of wooden blocks with random heights. Our framework can be used to identify such regions.
We perform real-world experiments to validate the proposed method. Experimental results suggest that our approach is able to generate dense semantic maps for the complex environment. In the rest of the paper, we introduce related works in Sec. II and introduce the proposed methods in detail in Sec. III. We evaluate the effectiveness of the neural network training, semantic image and semantic map optimization by experiments in Sec. IV. Finally, we conclude our work in this paper and introduce our future work.

II. RELATED WORKS A. SEMANTIC SEGMENTATION
With the development of CNN, semantic segmentation algorithms are widely based on deep learning because of its speed and accuracy. FCN [2] is the first successful approach that utilizes CNN to accomplish semantic segmentation. After that, SegNet [3] adds a decoder-encoder structure on the basis of FCN. To optimize segmentation result, EncNet [4] and DeepLab series algorithms [5], [6] proposed by Google consider context information when processing images. Fast-FCN [7] runs three times faster than EncNet without the accuracy loss by replacing dilated convolutions with Joint Pyramid Upsampling (JPU) model in ResNet-101 [8]. Apart from common RGB semantic segmentation algorithms mentioned above, there are also RGB-D based [9]- [13] and pointcloud based [14] approaches.

B. SIMULTANEOUS LOCALIZATION AND MAPPING
SLAM system enables robots to build maps of surroundings and locate their position in the environment. Appearancebased SLAM system RTAB-Map [15] achieves long-term mapping and loop closure detection by utilizing the memory management method. ElasticFusion [16] is a surfel-based dense SLAM system. To realize pose estimation, it fuses the iterative closest point (ICP) method with the direct method. However, GPU is needed when running this algorithm. ORB-SLAM2 is a lightweight SLAM system that utilizes ORB image features to achieve fast feature extraction and matching, and precise pose estimation. Because ORB-SLAM2 has a higher precision of pose estimation and can be easily deployed to robots, our approach employs ORB-SLAM2 as our SLAM system.

C. SEMANTIC SLAM
Most semantic SLAM methods fuse semantic labels obtained from semantic segmentation and maps generated by the SLAM algorithm to generate 3D maps with semantic information. According to the type of sensor they used, semantic SLAM algorithms can be classified as the monocular camerabased [17]- [19], stereo camera-based [20], [21], LiDARbased [22]- [24], multiple sensors-based [25], [26], RGB-D camera-based approaches, and so on. This paper mainly considers semantic SLAM algorithms based on the RGB-D camera.
RGB-D camera can provide both color and depth information of the environment. SemanticFusion [27] uses both information throughout its system: it adds the depth channel into the RGB semantic segmentation neural network, so as to improve the performance of the network. Semantic labels are integrated into maps from ElasticFusion and updated by the recursive Bayesian method. Unlike SemanticFusion, Nakajima et al. [28] use RGB images as input to the neural network, and assign class probabilities to each segmentation label instead of each element (like voxel and surfel) to improve efficiency and reduce storage complexity. To improve accuracy, Antonello et al. [29] leverage multiple views to make the semantic segmentation result of a single frame more accurate. DS-SLAM [30] is based on the semantic segmentation method SegNet [3] and the SLAM system ORB-SLAM2. In order to make this approach accurate and robust in dynamic environments, moving consistency check is proposed to filter out moving objects in the scene. Unlike methods mentioned above, Sünderhauf et al. [31] utilize object detection algorithm SSD [32] to locate objects in the image before using depth images to extract boundaries and associating semantic labels of objects. VOLUME 8, 2020 FIGURE 2. Overview of our approach. Semantic segmentation CNN Inception-v3 utilizes RGB images to generate semantic images. To optimize these semantic images, we leverage corresponding depth images to implement flood-fill, so that we can get refined semantic images. Meanwhile, in the process of generating the dense geometric point-cloud map, RGB images and depth images are used by ORB-SLAM2. After fusing refined semantic images and geometric point-cloud map to generate the semantic map, we use ''winner-takes-all'' (i.e. WTA) to split the semantic map into voxels and refine it by counting the number of each label in every voxel, so as to generate the refined semantic map at last.

D. SEMANTIC INFORMATION FOR PATH PLANNING
Common path planning algorithms often consider only the geometric information when generating feasible paths. These methods can distinguish whether there is an obstacle in front of the robot or not, whereas they cannot tell how difficult it is to traverse the obstacle. With semantic information, robots can have an understanding of what terrain is in front of them and what cost it takes to pass through. Wang et al. [33] utilize semantic information to distinguish rooms and corridors. Since corridors will be observed many times, robots can scan rooms first, so that robots do not have to go through repeated routes. Similarly, Sünderhauf et al. [34] distinguish offices and corridors as well. To avoid interrupting workers in the office during working hours, robots will choose the longer path and pass through the corridor. When at night, robots prefer the shortest route and pass through the office. Lin et al. [35] take the terrain dangerousness into account when planning the path. Wang et al. [36] leverage the semantic information to decide the possibility of managing to find target objects.
In this paper, we propose a semantic SLAM framework for rescue robots to better navigate through challenging environments that comprise complex terrains other than flat ground.

III. METHOD
To generate dense point-clouds augmented with point-wise semantic labels, our pipeline contains three parts: 1) an RGB-D SLAM frontend, 2) a convolutional neural network for semantic segmentation, and 3) filters for semantic labels and maps. Fig. 2 illustrates our pipeline.

A. SLAM FRONTEND
To obtain a dense point-cloud map and the accurate pose of our robot, we use ORB-SLAM2 as the SLAM frontend. ORB-SLAM2 runs mainly three threads in parallel, namely the tracking thread, the local mapping thread, and the loop closing thread. In the tracking thread, the system processes incoming images, extracts feature points, and roughly estimates the pose of the camera. The local mapping thread generates local maps and optimizes the camera pose with local bundle adjustment. The loop closing thread detects the loop and rectifies the accumulating drift of pose estimation.
We utilize the camera poses estimated by ORB-SLAM2 to generate dense point-cloud maps from consecutive RGB-D images. To be more specific, let us denote the coordinates of an RGB-D pixel as [u, v, d], where u, v are the RGB image coordinates, and d is the pixel depth value. The 3D coordinates of a pixel under the camera frame [x c , y c , z c ] can be recovered as where c x , c y , f x , f y are the camera intrinsic parameters. Its where T is the ORB-SLAM2 estimated camera pose with R being a 3 × 3 rotation matrix (representing the orientation of the camera), and t being a 3 × 1 translation matrix (representing the position of the camera)

B. SEMANTIC SEGMENTATION
To extract semantic labels from RGB-D images, we formulate it as a supervised-learning problem and train a convolutional neural network based on the Inception-v3 [37] architecture. The Inception-v3 is chosen based on an empirical evaluation with the other two models, which shows that Inception-v3 is best suited for this task and offers the best segmentation performance. Fig. 3 depicts the structure of Inception-v3. It utilizes various solutions to factorize convolutions and implement model regularization, so as to improve the computational efficiency and extract more features. For example, the Inception-v3 uses two small convolutions to replace a larger spatial filter, and it uses asymmetric convolutions to factorize convolutions into smaller ones. Also, label smoothing is used for regularization. All these efforts make Inception-v3 suitable for mobile robots with limited computing resources.
The dataset used to train the Inception-v3 network contains 5679 RGB-D images of an RRL test field, covering most of the interested terrains and objects with eight classes. The dataset contains eight classes: stairs, stepfields, slopes, elevated slopes, steep ramps, victim, ground, and background.
The RGB-D images in the dataset are collected with a realsense D435 camera mounted on our rescue robot. The resolution of the RGB-D images is set to 640 × 480. The file format of the dataset is the same as that of Cityscapes [38], which is a public and commonly used semantic segmentation dataset focused on city scenes. Fig. 4 shows some examples of our dataset.

C. FILTERING SEMANTIC LABELS AND MAPS
Segmentation results provided by CNN often contain scattered labels of wrong classes. To filter out such labels, we perform a flood-fill operation with the depth images, similar to the work of Chen [22].
To identify the erroneously labeled pixels, we first perform an operation called ''erosion''. Let us denote the semantic label of a pixel as L(x, y), where x, y is the pixel coordinate. If L(x, y) is different from one of its nearby pixel labels, we re-label it as 0, assuming it belongs to no class.
Pixels labeled with 0 are then corrected by a procedure called ''flood-fill'', which determines the proper label by considering the depth value differences among the neighboring pixels. For a pixel S(x, y) with label 0, if the depth value difference D d between this pixel and its nearby pixel is less than a threshold d, the nearby pixel will be marked. After searching all pixels near S(x, y), we modify the value of L(x, y) to the label of the marked pixel with minimum D d . As a result, most of the scattered labels of wrong classes are filtered out. Alg. 1 shows the detailed procedure of the floodfill process. Fig. 5 demonstrates the process of flood-fill. We can see that pixels in one object do not have great depth differences, so erroneously segmented pixels in one object should be marked as this object. As for pixels at the edges of objects, their depth differences with the same object are often smaller than those with neighboring objects. Therefore, their semantic labels will be corrected as well.
To further improve the consistency of the semantic labels, our approach performs temporal filtering by fusing the segmentation results of multiple frames. This is achieved by VOLUME 8, 2020   firstly voxelizing the point-cloud into a grid with a voxel size of 0.08 m × 0.08 m × 0.08 m, then, performing a ''winnertakes-all'' operation for every voxel, i.e., all points in the same voxel will take the most frequently appeared label as their label.   6 provided an example result of filtering point-clouds of ten frames/timesteps. Objects in blue rectangular boxes belong to a stepfeild region. When simply merging the ten sub-point-clouds (as shown in Fig. 6c), the result is very noisy and contains lots of point-cloud labeled as the background instead of ''stepfield''. The mislabeled points mostly come from the first frame because the camera vibrates sharply when the record begins, leading to a blurry image and hence a noisy segmentation result. But the camera movements regain steadiness after the first frame, and so the segmentation accuracy of the next nine frames recovers thereafter. Fig. 6d depicts the result after the voxelization and ''winner-takes-all'' operation. Since the inlier labels outnumbered the outliers, the incorrect labels are greatly reduced after the filtration.

D. PATH PLANNING WITH SEMANTIC INFORMATION
Standard path planning methods often consider trajectory length as one -and the only one -criterion, i.e. the shortest path is most preferred. However, in a complex environment consists of challenging terrains, factors like traversability of different terrains should also be considered. This can be achieved when the semantic label of every map points are available to us.
To perform path planning with the semantic augmented point-cloud, we consider the fast marching level set method (FFM) [39], which can generate global optimum solutions with low computational complexity. Given a velocity map, FFM works by simulating the front propagation of a curve moving in its normal direction, where the moving speed of the curve only depends on the local position. The arrival time from the start point to every position in the map can also be estimated.
To apply the FFM method, a velocity map which determines the traversing speed everywhere in a certain environment is needed. Inspired by the work of Gao et al [40], we propose an improved velocity function, which considers both Euclidean signed distance field (ESDF) and the terrain types (i.e. semantic labels) to generate a velocity map for the RRL competition environment. To be more specific, the velocity function is set to be where • d is the distance to its nearest obstacle; • c i ∈ [0, 1] is the time cost factor of passing terrains of type i, which reflects the traversability of traversing such terrain. The higher the c i is, the slower the robot will move through; • e is the Euler's number; • v m is the maximum velocity of our robot; • tanh() is the hyperbolic tangent function. Given this velocity function and a voxelized point-cloud of the environment, the corresponding velocity map is determined, which in turn determines the arrival time map using FFM. Then, a path connecting a given pair of starting and target points with minimal arrival time can be extracted by performing a gradient descent on the arrival time map.

IV. EXPERIMENTS
Our experiments are designed to validate that our method is able to 1) generate the semantic map of the RRL competition environment, 2) improve the path planning algorithm by using semantic information.

A. SEMANTIC SEGMENTATION
We use our RRL competition environment dataset to train semantic segmentation CNN Inception-v3. This dataset has 3407 images for training, 1136 images for validation, and 1136 images for testing, which involves eight classes in the RRL competition environment. The CNN model is pretrained with the Cityscapes dataset before we train it with our own dataset.
When training this model, the batch size is set to 4, the momentum rate is set to 0.9, and the learning rate is 1×10 −4 . To avoid overfitting, we add 0.4 dropout rate and 5× 10 −5 weight decay rate to increase the generalization ability of models and decrease the complexity of model weights, respectively. The training equipment is a desktop with Intel Core i7-4790K 4.00GHz CPU, GeForce GTX 1070 GPU, and 15.6GB RAM.
After trained for 19 epochs, this model achieves the performance of 96.87 % Mean IoU, 98.99 % Mean Accuracy, and 98.93 % Mean Recall in the testing set. In the meanwhile, the inference time for each frame is 0.038 seconds. The results demonstrate that Inception-v3 is capable of implementing accurate and fast segmentation in our dataset.
We also attempt to train ERFNet [41] and Mobile-NetV2 [42] with our dataset, but the results are below that of Inception-v3. As for every model, we implement the same times of iteration, which is 16K in this experiment, and before   the iteration stops, they have all converged, so we can assure that they have the best performance with the current training parameters. Table. 3 shows the training parameters of every model.
The accuracy and inference time of every model on the testing set are demonstrated in Table. 4. From this table, we can know that MobileNetV2 takes too much time for inference, while the segmentation accuracy of ERFNet is slightly lower. After taking both accuracy and inference time into account, we utilize Inception-v3 to train with our dataset and obtain satisfying segmentation results. Fig. 7 shows some examples of semantic segmentation results.

B. PATH PLANNING WITH SEMANTIC INFORMATION
To validate the effectiveness of the semantic information in path planning, we carry out an experiment in our RRL competition environment. As shown in Fig. 8a, on top of the flat ground, there are two continuous slopes next to a stepfield.
Before the experiment, a point-cloud (and its semantic labels) of the field is obtained using the proposed semantic SLAM framework (see Fig. 8c). To perform path planning on this point-cloud, we performed several preprocessing operations. First, the 3D point-cloud is downsampled by voxelization and projected to the ground plane to generate a 2D grid map, i.e. Fig. 8d and Fig. 8e. Such a grid map has a cell size of 0.05 m × 0.05 m. Then, salt and pepper noise of small scale in the grid map is removed using their neighboring cells, leading to a refined map as shown in Fig. 8g. Finally, we make the inflation at the edge of obstacles and some terrains in the grid map to broaden their ranges, so that the robot will not get too close to the edge and fall or collide. The inflation process is realized automatically by leveraging the function in costmap_2d package of Robot Operating System (ROS). For the slopes, the robot can only traverse the green lines and avoid the red lines in Fig. 9a, because the robot will fall and turn over from the top of the slopes at the red lines, see Fig. 9b. To find out red lines that need to be inflated, we utilize the normals of cells belonging to the slopes in the grid map, where the normals can be obtained from the 3D semantic map. If the edges of the slopes are parallel to the projection of normals onto a horizontal plane, the edges are red lines. Otherwise, the edges are green lines if they are perpendicular to the projection. Fig. 8h is the final semantic map where the path planning takes place.
Given the semantic grid map, a velocity map is generated using (4) with terrain time cost factors c i from Table. 3, where the time cost factors are determined by the difficulty and the danger level for robots to traverse the corresponding terrains.  The resulting velocity map is shown in Fig. 11a, in which the maximum speed of the robot is 1 m/s. With the velocity map, the arrival time from the start to the goal can be calculated by using the FMM algorithm introduced in Sec. III-D. The path with the minimum time cost is also determined by looking for the steepest gradient descent direction. Fig. 10 shows the result of three path planning trials with different goal points. We can see that the robot is able to choose a suitable path for different goal points. If the path lengths that the robot needs to traverse in different routes are the same, the robot will choose path A but not path B consisting of slopes because of the lower time cost factor of the ground, see Fig. 10b. As for the goal point in Fig. 10b, the robot will take 0.52 more seconds to reach by traversing the slopes than traversing path A. To reach the goal point in Fig. 10c, it will save 0.36 seconds to traverse the slopes because the route is shorter and the time cost factor of the slopes is relatively low.
In all three trials, the robot is aware of the terrain class given the semantic information, and hence can make use of the fact that the slopes are neither obstacles nor flat ground, but terrains with a slightly higher time cost factor. In the first trial, it is faster for the robot to traverse path B because path B has low time cost factor compared to the stepfield in path C. As for path A, it requires the robot to pass two more 1.2 m × 1.2 m blocks than traversing path B. Thus, after considering the high time cost factor of path C and the long path length of traversing path A, the robot chooses path B as the route to reach the goal point, see Fig. 10a.
As a comparison, we conducted an experiment with semantic information being neglected in two cases: i) both stepfields and slopes are regarded as obstacles and, ii) stepfields are regarded as obstacles, but slopes are regarded as passable paths the same as flat ground.
In the first case, the robot neglects the slopes for a shortcut and traverses path A with longer arrival time, as can be seen in Fig. 11c and Fig. 11d. Table. 4 shows the arrival time of  different routes, where the arrival time of the route in Fig. 11d is 8.38 seconds, and it is 1.89 seconds longer than that of traversing the slopes (see Fig. 11b).
In the second case, the robot is too greedy and not aware of the dangerous edges. To traverse the path with minimum time cost, the robot may pass the red line and fall with a high chance, see Fig. 9b, Fig. 11e, and Fig. 11f. Although traversing the red line will reduce 2.22 seconds and 4.11 seconds than traversing the green line and path A respectively, the risk of the robot falling is high.
To sum up, from the perspective of topology, it is also beneficial to combine semantic information with path planning. With semantic information, the slopes and the stepfields will be regarded as terrains that are passable but different from the ground. In this way, paths that go through path A, B, and C in Fig. 8a respectively belong to three different homotopy classes. However, if we regard the slopes and the stepfields as the obstacles, the generated paths will only belong to one homotopy class, and the other classes are completely removed. Moreover, if we regard the slopes as the ground and the stepfields as the obstacles, the whole environment will become a connected graph, so there will be no difference between path A and path B. In this way, the number of homotopy classes will also be one, and it neglects the difference of terrains, so there is a strong possibility that the path goes VOLUME 8, 2020 through the dangerous edges of terrains, just as Fig. 9b. Thus, semantic information is invaluable in path planning.

V. CONCLUSION
In this paper, we propose a semantic RGB-D SLAM system for rescue robots. This system can generate dense pointcloud maps with semantic information by fusing semantic segmentation CNN and RGB-D SLAM frontend. We utilize the depth information to determine whether neighboring pixels in the semantic image belong to the same object, so as to improve the precision of semantic segmentation. We also use ''winner-takes-all'' to improve the precision of the semantic map by leveraging point-cloud generated by multiple frames. Semantic information about surroundings plays an important role in helping robots understand the environment and implement path planning. We demonstrate how semantic information can optimize path planning and help to generate paths with the minimum time cost. To validate our semantic SLAM system, we generate an RGB-D semantic dataset of the RRL competition environment. Experimental results prove that our system can generate accurate dense semantic maps and leverage semantic information to improve the path planning results. In future work, we plan to fuse the semantic information and SLAM system tightly, such as utilizing semantic information to improve the accuracy of pose estimation and leveraging the point-cloud map to improve the semantic segmentation precision.