PointLoc: Deep Pose Regressor for LiDAR Point Cloud Localization

In this paper, we present a novel end-to-end learning-based LiDAR relocalization framework, termed PointLoc, which infers 6-DoF poses directly using only a single point cloud as input, without requiring a pre-built map. Compared to RGB image-based relocalization, LiDAR frames can provide rich and robust geometric information about a scene. However, LiDAR point clouds are unordered and unstructured making it difficult to apply traditional deep learning regression models for this task. We address this issue by proposing a novel PointNet-style architecture with self-attention to efficiently estimate 6-DoF poses from 360{\deg} LiDAR input frames.Extensive experiments on recently released challenging Oxford Radar RobotCar dataset and real-world robot experiments demonstrate that the proposedmethod can achieve accurate relocalization performance.

: PointLoc results for the challenging Oxford Radar RobotCar dataset [6], [7]. We directly feed a point cloud of the LiDAR sensor from a single timestamp to the neural network for predicting the 6-DoF pose without the requirement of prebuilt maps. The estimations of PointLoc are robust regardless of weather, and outperform the state-of-the-art DNN-based LiDAR and visual sensor relocalization methods significantly.
to provide an accurate initial pose as coarse localization first.
Thus, most localization solutions employ Global Navigation Satellite System (GNSS) to provide pose estimations. Unfortunately, GNSS is not always available such as in indoor environments and the accuracy of GNSS cannot be guaranteed in areas like large cities where high-rising buildings can block the GNSS signals. To this end, Uy et al. [8] proposed a point cloud retrieval-based localization method to deal with the situation when the GNSS is absent. It obtains a 6-DoF pose with respect to the pre-built map in the form of reference database. Dube et al. [9] further proposed SegMap to improve the storage efficiency of the reference database by storing data-driven descriptors of individual objects in point clouds of LiDAR sensors. However, retrieval-based approaches inherently suffer from several issues. First, the time complexity of finding the closest match between the query point cloud and the reference point cloud is O(n) where n is the number of point clouds, which is not suitable for common real-time application scenarios. Second, the point cloud-based method requires a reference database, which occupies O(n) storage space and cannot be deployed on many mobile robots. Third, the recall rate of it is often not good enough [8].
Recently, learning-based approaches have emerged as a promising tool to build up a completely end-to-end localization system. These methods do not require any reference databases during runtime, and the learned features tend to be general and robust. These kind of localization approaches train a neural network to directly predict the pose. Their time complexity during inference time is O(1) and the space it occupies is only the model size, which addresses the drawbacks of point cloud retrieval-based methods. Early attempts in this direction include PoseNet and its variations [10]- [12]. However, all the current pose regression approaches utilize RGB images of visual sensors as inputs, which have several problems. Visual sensors are sensitive to the change of environments, resulting in suboptimal localization performance. In addition, the input images are restricted to a narrow Field-of-View (FoV). These aspects restrict the application of these approaches to the real world. Compared with RGB images of visual sensors, point clouds, acquired by LiDAR sensors, capture 360°3-D space, and provide much richer geometric information of a specific location. In addition, the features extracted from point clouds tend to be more robust compared to those extracted from images. However, point clouds of LiDAR sensors are unordered and unstructured, making it difficult to learn features for localization. Motivated by this, we design a neural network to use LiDAR point clouds as input for robust and accurate localization.
In this paper, we propose a novel neural network-based 3-D pose regressor, named PointLoc, to accurately estimate the 6-DoF pose using point clouds of LiDAR sensors. The neural network directly takes a primitive point cloud as input and estimates the 6-DoF pose in an end-to-end fashion. The performance shows significant improvement over the learningbased LiDAR and visual sensor relocalization methods. Fig.1 illustrates the superior performance of our PointLoc approach in different environments found in the Oxford Radar RobotCar dataset.
In summary, our contributions are as follows: • To the best of our knowledge, this is the first LiDAR sensor-based approach for deep global pose regression in an end-to-end fashion. Our proposed architecture with a self-attention module can further improve the accuracy of the predicted 6-DoF absolute poses.
• We conduct real-world robot experiments in an indoor environment. We collect and create a new indoor LiDARvisual sensors dataset, dubbed vReLoc, and release it for studying the indoor relocalization task.
• Comprehensive experiments and an ablation study on these two new datasets have been done to evaluate our proposed method. Results demonstrate that the PointLoc model outperforms the state-of-the-art DNN-based Li-DAR and visual sensor relocalization methods by a large margin. The rest of the paper is organized as follows. Section II introduces visual sensor relocalization, learning-based localization systems, DNN-based LiDAR odometry and deep learning on LiDAR point clouds. Problem formulation is given in Section III. The detailed LiDAR sensor relocalization method is illustrated in Section IV. Section V explains the collected indoor dataset vReLoc. Experiments and ablation studies are presented in Section VI. Section VII concludes the work and give the future research directions.

II. RELATED WORK
In this section, we review different learning-based approaches for localization, LiDAR odometry which estimates ego-motions between consecutive point clouds, and DNN architectures on point clouds.

A. Visual Sensor Relocalization
For dealing with the drawbacks of map registration methods, recent works propose learning-based approaches to estimate the global pose directly [10]- [16]. They take images, either single or sequential, as inputs to train a neural network model for predicting absolute poses. The key to these methods is to learn a deep pose regressor, which usually comprises a feature extractor and a regressor [10], [16], [17]. For example, PoseNet related works [10], [11], [18] proved the feasibility of predicting the global pose using a single RGB image by regressing the pose directly. Brahmbhatt et al. [13] utilized the relative pose between two images as a geometric constraint to estimate the pose. Although DNN-based relocalization methods can solve the downsides of retrieval-based approaches, the performance of translation and rotation estimation is still not satisfactory enough to be applied to real-world scenarios [19], which calls for further work on learning algorithms. Our work follows this line of study, aiming to improve the accuracy of deep global pose regression with LiDAR sensors.

B. Learning-based Localization Systems
Learning-based localization systems have gained significant interests recently. Almalioglu et al. [20] employed recurrent neural network for robust MMWave radar-based ego-motion estimation. CellinDeep [21] adopted DNN to capture the nonlinear relationship between the cellular signal and its location for robust and accurate indoor localization. Alshamaa et al. [22] proposed a decentralized kernel algorithm for sensor localization in indoor wireless environments. Silva et al. [23] applied transfer learning and machine learning to images of a Kinect sensor for the localization of mobile robots. Hoang et al. [24] proposed a semi-sequential probabilistic method to improve the performance of the indoor localization with extensive on-site experiments. Li et al. [25] developed a centralized indoor localization method using pseudo-label along with federated learning for the improved indoor localization. AdapLoc [26] utilized the CNN and domain adaptation for the device-free WiFi localization in dynamic environments. In contrast, our work proposes to apply deep learning to LiDAR sensors for global localization.

C. DNN-based LiDAR Odometry
Recent works propose learning-based methods to estimate LiDAR Odometry, which calculates relative poses between consecutive LiDAR scans. Wang et al. [27] proposed a deep parallel neural network to directly predict relative poses. Li et al. [28] developed a learning-based fusion framework with 2-D LiDAR and IMU sensors for odometry estimation. Horn et al. [29] developed a flow embedding approach to solve the fusion problem of point clouds for LiDAR Odometry. 3DFeat-Net [30] was developed to learn both 3-D feature detector and descriptor for point cloud matching using week supervision. Lu et al. [31] proposed a Virtual Corresponding Points method to align two point clouds accurately. Wang et al. [32] designed novel sub-network architectures to address difficulties in the ICP method. These point cloud registration approaches can be leveraged to predict LiDAR Odometry. However, different from these methods, our work focuses on LiDAR relocalization, which estimates global poses rather than relative poses. Fig. 2 illustrates the difference between these two localization tasks. LiDAR odometry estimates relative poses between consecutive point clouds, while LiDAR relocalization predicts absolute poses w.r.t the world coordinate.

D. Deep Learning on Point Clouds
DNN-based feature extraction methods for point clouds [33]- [36] have gained significant success in recent years. VoxelNet [37] was developed to learn feature embeddings in voxels for object detection. PointNet++ related works [33], [34], [36] have been proposed to directly process unordered point sets and learn features from points, which showed impressive performance on tasks of 3-D object detection, part segmentation, and semantic segmentation. Detailed introductions and applications of deep learning for point clouds can be found in the recent survey paper [38].

III. PROBLEM STATEMENT
We design a DNN-based framework for performing deep global pose regression using point cloud data from a LiDAR sensor, which is LiDAR relocalization. We predict the absolute 6-DoF poses of the mobile agent within previously visited areas. A typical use case for our method would be when a mobile agent has already visited the query places before, and then has to localize itself again when it moves across the previously-visited environment. To enable a more generic and reliable relocalization system, we only consider one LiDAR sensor input at a single timestamp rather than sequential inputs.
For each timestamp t, the agent receives one point cloud The difference between LiDAR odometry and LiDAR relocalization [6], [7]. LiDAR odometry estimates relative poses between consecutive point clouds, which produces accumulative drifts over time, while LiDAR relocalization predicts absolute poses w.r.t the world coordinate, which requires agents previously traverse scenes. These are two different tasks in localization, and this work focuses on the LiDAR relocalization.
x i is a vector of describing its coordinate (x, y, z). Therefore, the shape for each P t is (N, 3). The relocalization of the agent is parameterized by a 6-DoF pose [t, r] T with respect to the world coordinate, where t ∈ R 3 is a 3-D translation vector and r ∈ R 4 is a 4-D rotation vector (quaternion). To this end, deep 3-D pose regressors learn a function F such that F(P t ) = (t, r) T , where the function F is usually a neural network for DNN-based methods.

IV. DEEP POINT CLOUD RELOCALIZATION
This section introduces our proposed PointLoc, a deep 3-D pose regressor for predicting the global pose from LiDAR sensors. The overall architecture is illustrated in Fig. 3. Our system consists of point cloud pre-processing, a point cloud encoder, a self-attention module, a group all layers module, and a pose regressor. The point cloud data are down-sampled to a fixed shape (N, 3) as an input. The whole design is based on the PointNet-style structure, which can theoretically learn a critical subset of points for relocalization. We introduce each module individually.

Self-Attention
Point Cloud Encoder Self-Attention Module Regressor Input  [39] consists of a sampling layer, a grouping layer and a PointNet layer [34]. For more details about the SA layer, please refer to PointNet++ [34] and FlowNet3D [39]. The learnt point features are sent to self-attention module for eliminating the noisy features. Afterwards, these features are fed into group all (GA) layers for down-sampling to a feature vector. Finally, the pose regressor predicts the 6-DoF pose.

A. Point Cloud Pre-Processing
The purpose of this module is to pre-process raw point clouds to fit into the neural network. Each point cloud frame of a LiDAR sensor scan contains a different number of points. However, our neural network requires the same point cloud dimensions (N, 3) for its inputs. To tackle this problem, we adopt the random point cloud sampling strategy. We ensure that all the point cloud inputs have the same shape (N, 3). N is set to 20,480 in this work since the average number of points in a point cloud of the Radar Robotcar Dataset [6], [7]in our experiments is around 21,000 and we want to keep the information as much as possible.

B. Point Cloud Encoder
The goal of this module is to extract features from the point cloud. The feature representation extracted by the point cloud encoder plays a critical role in achieving accurate and reliable relocalization. Intuitively, human beings can utilize key points and features in a scene to identify where they are and conventional geometric methods are capable of performing precise localization by exploiting key points of the point cloud data. Inspired by this, if a neural network learns a subset of key points from the original point cloud data relevant to the localization task, we can take better advantage of these key features to identify a location. Existing literature [34], [40] has proved the critical-subset theory, i.e. for any point cloud P, a PointNet-like structure can identify a salient point subset C ⊆ P, making it a desirable choice for our relocalization task.
Specifically, PointNet exploits the multi-layer perceptron (MLP), feature transformation module, and max pooling layer to approximate a permutation invariant function for point cloud classification and segmentation. In fact, it is a universal continuous set function approximator, described as: where φ and h are two continuous functions (they are usually instantiated to be an MLP), and MAX denotes the max pooling layer [33]. PointNet++ extends PointNet by recursively capturing the hierarchical features on point sets in a metric space [33]. From the aforementioned Eq. 1, the result of the PointNet structure is determined by u = MAX{h(x i ) | x i ∈ P}, and the MAX operation takes N vectors as input and outputs one vector of element-wise maximums. Thus, there exists one where u j is the j th dimension of u, and µ j is the j th dimension of h(x i ). These points can be aggregated into a critical subset C ⊆ P, where C determines u and then φ(u) (more details can be found in [34], [40]).
Consequently, the critical-subset theory is applicable to neural networks of the structure of φ(MAX{h(x i ) | x i ∈ P}. The proposed PointLoc is built upon PointNet++, consisting of such a structure and thus can learn the critical subset from point clouds of LiDAR sensors in theory. We design our point cloud encoder based on the set abstraction (SA) layer of PointNet++ [33], [36]. The point cloud encoder is composed of 4 consecutive SA layers. Each SA layer is composed of a sampling layer, a grouping layer and a PointNet layer [34]. The SA layer takes a feature matrix F ∈ R N ×C as input where N is the point number and C is the feature dimension of each point, and outputs a feature matrix F ∈ R N ×C where N is the sub-sampled point number and C is the new feature dimension of each point (from the size (N 1 , C 1 ) to (N 4 , C 4 ) in Fig. 3). We also a leverage multi-scale grouping strategy [33] inside the SA layer for robust feature learning. Specifically, the layer adopts farthest point sampling to sample N regions with x j being the region centers, and for each region with radius r, it extracts local features with a symmetric function as [39]: where F i is the i th row of F, F j is the j th row of F , h : R C → R C is the MLP, and MAX is the max pooling layer.

C. Self-Attention Module
The aim of this module is to remove outliers like moving objects from the previous extracted features for better relocalization performance. Prior works [16], [17] have proved that the self-attention mechanism can improve visual sensor relocalization by removing noisy features. Therefore, we also design a neural module to automatically remove the dynamic features before regressing the final poses. Inspired by the recent works [16], [17], [41], we introduce a self-attention module to learn a mask, which attempts to remove outlier features of moving objects from the original point features by conducting the element-wise dot product between the point features and the mask.
Given a set of point features F ∈ R N ×C which are learned from the point cloud encoder, our attention module aims to learn a mask M ∈ R 1×C for the features F. To achieve this, we use a shared MLP followed by a sigmoid function to take the features F as input and then directly generate the mask M . After that, we broadcast and mask the features F by M , obtaining weighted featuresF for subsequent pose regression. Specifically, since the dimensions of the point features F is N × C and the dimension of the learned mask M is 1 × C, we broadcast the dimensions of the mask from 1 × C to N × C. Afterwards, we mask the features F by the broadcasted mask M via conducting the element-wise dot product in order to remove noisy features of the point features F. Formally, this self-attention module is defined as follows: where dot means element-wise dot product between F and M.

D. Group All Layers Module
The target of this module is to aggregate features from all previous layers to generate an embedded feature vector. Specifically, shown in Fig. 3, the input of the group all layers (GA) module is a point feature set of size N 4 × C 4 , and then the point features are propagated to an updated point feature set of size N 4 × C 5 via MLP, where C 5 is larger than C 4 . Next, it is down-sampled to the C 5 dimension feature vector through the max pooling layer. The embedded feature vector is then forwarded to an FC layer. After the FC layer, the C 5 dimensional feature vector is finally sent to the pose regressor for predicting the translation t and rotation r respectively.

E. Pose Regressor
The purpose of this module is to predict the ultimate pose. After the FC layer, the C 5 dimensional feature vector from the previous module is finally sent to the pose regressor for predicting the translation t and rotation r respectively. The pose regressor is composed of two branches of consecutive fully-connected (FC) layers. Each branch consists of 4 fully connected (FC) layers. The sizes of FC layers decrease gradually to learn features. We choose Leaky Relu as the activation function after each FC layer except for the last FC layer. The last FC layers of these two branches regress the translation and rotation separately.

F. Loss Function
Our goal is to estimate the 6-DoF pose [t, r] T . Prior works [10]- [12], [42] directly predict quaternions and use an l 1 or l 2 loss, but such a representation is over-parameterized and normalization of the output quaternion is required at the cost of worse accuracy [13]. Odometry tasks with DNNs [27], [43] usually regress Euler angles, which are also not suitable here since they wrap around 2π. Consequently, we employ the definition of the loss function in [13] for training our neural network, which is adapted from [11]. Given K training samples G = {P t | t = 1, ..., K} and their corresponding ground-truth poses {[t,r] T t | t = 1, ..., K}, the parameters of the PointLoc are learned via the following loss function: where β and γ are balanced factors to jointly learn translation and rotation. It is worth noting that the β and γ are learnable factors during training, which are initialized by β 0 and γ 0 respectively. log q is the logarithmic form of a unit quaternion q = (u, v), where u is a scalar and v is a 3-D vector. It is defined as:

V. INDOOR LIDAR SENSOR DATASET FOR RELOCALIZATION
There is a lack of public datasets in the indoor environment with LiDAR sensors. In order to boost the research in this area, we collected a new dataset dubbed vReLoc with rich sensor modalities, e.g. vision and LiDAR sensors on a mobile robot platform. Our dataset has been released online to benefit future researchers 1 . The experimental robot is Turtlebot 2, mounted with a Velodyne HDL-32E LiDAR sensor and an Intel RealSense Depth Camera D435. The sensors have been carefully calibrated. The Velodyne is a lightweight pulsed laser for Detection and Ranging, which features 32 lasers across over a 40°vertical field-of-view and a 360°horizontal field-of-view. It runs at a frequency of 10Hz. Each point cloud in the dataset contains ∼60,000 points. The camera was employed to capture RGB images, and the size of each image is 640 × 480 × 3. A Vicon Motion Tracker system is leveraged for acquiring accurate ground truth 6-DoF poses. 10 Bonita B10 cameras are used in the system, installed around the area where the dataset is collected. Each Bonita B10 has the resolution of 1 megapixel with 250 fps frame rate, and an operating range of up to 13 m. The system can track the pose of the robot at a precision of ∼1cm.
The size of the Vicon room is about 4m × 5m. We lay out several obstacles in the scene. For the relocalization task, the scene is fixed through the whole data collection process. We utilized the Robot Operating System (ROS) for robot control and data collection. Timestamps were recorded on every frame of each sensor by the ROS, and we synchronized world timestamp across different systems from the same Network Time Protocol (NTP) server.
A total of 18 sequences were collected of various lengths. Since the Velodyne LiDAR, RealSense camera and Vicon motion tracker system run in different frequencies, we synchronized these systems so that the image of the visual sensor and the point cloud of the LiDAR sensor in each timestamp has the same 6-DoF pose. For the static scenario, there are no moving objects in the scene. For other scenarios, there are people randomly walking in the scene. Sequences 01-10 come from the static environment, sequences 11-15 are the one-person moving scenario, and sequences 16-18 are two-persons moving scenario. In order to better represent real-world situations, in our experiments, we specially chose challenging sequences as the training dataset. We report our training and test sequences from the vReLoc dataset in Table I.

VI. EXPERIMENTS
In this section, we evaluate our proposed approach on the recently released outdoor Oxford Radar RobotCar [6], [7] dataset and our proposed indoor vReLoc dataset and compare to state-of-the-art methods.

A. Implementation Details
Adam [44] is applied to train our network with β 1 = 0.9 and β 2 = 0.999. We set the initial values β 0 = 0.0 and γ 0 = −3.0 of the loss function following MapNet [13] and AtLoc [17]. From our experiment, if we change these values within a short range, the results remain almost the same, which is reasonable since these two parameters are learnable and they will adjust themselves to different values during training phase. The learning rate is set to 0.001, and we train 100 epochs on both datasets. For baseline image approaches, we also used data augmentation to improve the accuracy of predictions. Following the convention of existing works [10], [11], [13], [17], we calculate the mean error for outdoor datasets and the median error for indoor datasets.

B. Baselines
To validate the performance of the proposed PointLoc, we compare it with several state-of-the-art learning-based open-source LiDAR sensor localization and visual sensor relocalization approaches. For LiDAR sensor localization approaches, we choose PointNetVLAD [8] and Deep Closest Point (DCP) [32]. PointNetVLAD is a large-scale point cloud retrieval-based approach, which can be utilized for LiDAR sensor relocalization. We create the triplet training dataset, increase the point number from 4,096 to 8,192, and set the loss margin from 0.5 to 1.0 to improve the performance, while other hyper-parameters are kept the same as the vanilla PointNetVLAD. The validation sequence FULL 5 is chosen for building up the reference database as the localization map. DCP is a DNN-based point cloud registration approach, which employs the PointNet [34] and DGCNN [45] as the embedding network. Although we deal with different tasks, we can adapt it for relocalization. Specifically, the DCP aims at the task of point cloud odometry, which demonstrates that the feature extraction design of this neural network is effective. Our task is to estimate global 6-DoF poses (relocalization) from point clouds. Therefore, we utilized the feature extraction module of DCP in our design to compare the performance. For the visual sensor relocalization baselines, we choose PoseNet17 [11] since it outperforms PoseNet and Bayesian PoseNet in previous works. AtLoc [17] is selected for comparison since it is the state-of-the-art single image-based learning approach. We also choose LSTM-Pose [15] as the sequential baseline. Moreover, we also compare with MapNet [13] because it is the state-of-the-art sequential visual sensor approach. We note that sequential methods generally perform better than single image ones by utilizing time constraints. However, for the relocalization task, this past information is not always available as discussed before. We still compare with them to examine how competitive our method is. We note that we implement baseline methods and tune them for the best performance.

C. Results on the Oxford Radar RobotCar
The Oxford Radar RobotCar dataset [6] is a radar extension to the Oxford RobotCar dataset [7], providing data from dual Velodyne HDL-32E LiDARs and Grasshopper2 monocular cameras. The ground truth poses are obtained by a NovAtel SPAN-CPT ALIGN inertial and GPS navigation system (GPS/INS).

Dataset Description
The data were gathered in January 2019 over thirty-two traversals of a central Oxford route, and the duration and distance of each traversal are ∼32mins and ∼9.05km respectively. The resolution of a captured RGB image is 1280×960, and each point cloud has ∼21,000 points. We observe that the dataset is large-scale, covers various weather conditions and has moving objects like people and cars in the scenes, all of which have significant influence on the accuracy of relocalization task, and therefore it is quite challenging. Since there is a timestamp misalignment between camera and LiDAR sensors, we synchronize timestamps with scripts and interpolate (GPS/INS) measurements to coincide with the ground truth poses. For time synchronization, the FPS (Frames per Second) of camera is 16HZ and the FPS of LiDAR is 20.02 HZ. Therefore, for every timestamp of the camera images, we collected the corresponded point cloud by searching the closest timestamp from the LiDAR point clouds. Like MapNet and AtLoc, the missing GPS/INS data is handled by interpolating values from visual odometry data which is provided by the Radar RobotCar. We report the training and test sequences we used from the Oxford Radar RobotCar in Table VI.

Results
The test results of the Radar RobotCar are presented in Table VII. Following the plotting style of relocalization work [16], the trajectories of FULL7 and FULL8 of DCP, MapNet, AtLoc, and PointLoc are shown in Fig. 4. The PointLoc improves the LiDAR point cloud retrieval-based approach PointNetVLAD by 46.47% in translation and 70.45% in rotation, which proves the effectiveness of our proposed method. As seen from Table VIII, PointLoc can satisfy realtime operation and the storage space is small. The inference time is around 0.1sec, which means that by using this system, the real-time localization is achievable at 10 Hz. This indicates that PointLoc is better than existing point cloud retrievalbased approaches for relocalization. Moreover, The PointLoc improves the DCP with PointNet by 29.46% in translation and 37.80% in rotation, which reveals that the proposed embedding neural network can effectively learn meaningful features for relocalization. For DCP with DGCNN, the whole neural network is difficult to train and requires large computational resources due to the large number of points and the essence of the graph architecture. Furthermore, the proposed PointLoc consistently outperforms the camera relocalization baselines by a large margin. For the best performance of deep camera relocalization, the PointLoc improves the AtLoc by 66.86% in translation and 78.83% in rotation. These results demonstrate that instead of utilizing RGB images as the sensory input, LiDAR point cloud can significantly improve the relocalization accuracy. Our pose predictions are even better than the sequential approaches like MapNet. In addition, the variance of learning-based camera relocalization is much larger than our approach. Therefore, the PointLoc can have stable estimation across the test dataset, which indicates that the point cloud relocalization method is more robust than the visual relocalization.

D. Results on a Real-World Indoor Robot
We also validate our proposed PointLoc on the real-world indoor LiDAR-visual sensors dataset. Our experimental design ig. 4: Trajectories of DCP, MapNet, AtLoc and the proposed PointLoc on FULL7 and FULL8 with mean translation error (m) and rotation error (°). The darkorchid line is the ground truth poses, and the orange-red dot line shows the estimated poses. Our PointLoc outperforms the existing LiDAR and camera relocalization approaches by a significant margin.   imulates the real-world scenarios of robot movements like service robots inside a large shopping mall. The robot moved forward and backward, halting when it faces obstacles. Data was collected under three conditions: static environment, oneperson walking, and two-persons walking. We named the collected dataset vReLoc since it was acquired in a Vicon room for indoor relocalization task. It includes in total 18 robot movement sequences in an indoor Vicon environment. The median errors and trajectories of test results using our PointLoc are plotted in Fig. 5. The results demonstrated that PointLoc can be successfully applied to the real-world indoor scenarios for LiDAR sensor localization.

E. Ablation Study
To explore the impact of different components of PointLoc, we conduct the ablation studies in Table IX. For ablation experiments, we keep all the architecture designs the same as PointLoc except that we do not contain self-attention module (w/o SA), sample 4096 points from raw point clouds (4096 Points), and utilize two fully-connected layers to predict 6-DoF poses directly (Pose Regressor). We report results on the Oxford Radar RobotCar dataset. Without self-attention module (w/o SA), the performance decreases by 7.77% in translation and 19.75% in rotation. This indicates that the self-attention module indeed enhances the accuracy of relocalization. Meanwhile, the PointLoc improves the same architecture with 4096 points (4096 Points) by 1.4% in translation and 9.09% in rotation, which demonstrates that the more sampled points can improve the performance of localization. Furthermore, the PointLoc improves the architecture with two fully-connected layers of pose regressor (Pose Regressor) by 73.83% in translation and 84.88% in rotation, which reveals the effectiveness of our design of Multi-Layer Perceptrons (MLPs) of two branches.

VII. CONCLUSION
This paper presents a novel LiDAR sensor relocalization approach, PointLoc, based on deep learning. Leveraging a point-based neural network, it achieves better relocalization accuracy than previous LiDAR and visual sensor-based relocalization approaches. The approach can be applied to largescale relocalization and robot navigation scenarios for meterlevel localization requirements. It can also be leveraged in indoor environments or urban areas full of high-rise buildings as a complement when the GNSS is absent. In the future, more explorations can be done for further improving the relocalization accuracy such as eliminating the noisy point features from the point cloud or exploring intensity information for better relocalization performance.