CorAl: Introspection for Robust Radar and Lidar Perception in Diverse Environments Using Differential Entropy

Robust perception is an essential component to enable long-term operation of mobile robots. It depends on failure resilience through reliable sensor data and preprocessing, as well as failure awareness through introspection, for example the ability to self-assess localization performance. This paper presents CorAl: a principled, intuitive, and generalizable method to measure the quality of alignment between pairs of point clouds, which learns to detect alignment errors in a self-supervised manner. CorAl compares the differential entropy in the point clouds separately with the entropy in their union to account for entropy inherent to the scene. By making use of dual entropy measurements, we obtain a quality metric that is highly sensitive to small alignment errors and still generalizes well to unseen environments. In this work, we extend our previous work on lidar-only CorAl to radar data by proposing a two-stage filtering technique that produces high-quality point clouds from noisy radar scans. Thus we target robust perception in two ways: by introducing a method that introspectively assesses alignment quality, and applying it to an inherently robust sensor modality. We show that our filtering technique combined with CorAl can be applied to the problem of alignment classification, and that it detects small alignment errors in urban settings with up to 98% accuracy, and with up to 96% if trained only in a different environment. Our lidar and radar experiments demonstrate that CorAl outperforms previous methods both on the ETH lidar benchmark, which includes several indoor and outdoor environments, and the large-scale Oxford and MulRan radar data sets for urban traffic scenarios The results also demonstrate that CorAl generalizes very well across substantially different environments without the need of retraining.


I. INTRODUCTION
For mobile robots to be truly resilient to possible failure causes, during long-term hands-off operation in difficult environments, robust perception needs to be addressed on several levels. Robust perception depends on failure resilience as well as failure awareness. Several stages of the perception pipeline are affected: from the sensory measurements themselves (sensors should generate reliable data also under difficult environmental conditions), via algorithms for registration, mapping, localization, etc, to introspection (by which we mean self-assessment of the robot's performance).
This paper presents novel work on robust perception that addresses both ends of this spectrum, namely CorAl (from "Correctly Aligned?"); a method to introspectively measure and detect misalignments between pairs of point clouds. In particular, we show how it can be applied to range data both from lidar and radar scanners; which means that it is well suited for robust navigation in all-weather conditions.
Lidar sensing is, compared to visual sensors, inherently more unaffected by poor lighting conditions (darkness, shadows, strong sunlight). Radar is furthermore unaffected by low visibility due to fog, dust, and smoke. However, radar as a range sensor has rather different characteristics than lidar, and interpreting radar data (e.g., for localization) has been considered challenging due to high and environmentdependent noise levels, multi-path reflections, speckle noise, and receiver saturation. In this work we demonstrate how radar data can be effectively filtered to produce high-quality point clouds; and how even small misalignments can be reliably detected.
Consequently, it is essential to equip these methods with failure awareness by measuring alignment quality so that misaligned point clouds can be rejected or re-aligned. While a number of measures of alignment quality already exist, it is typically not easy to set a threshold to detect poor alignments that can be used in different environments -as we also demonstrate in our experimental validations (see Secs. IV and V). A particular benefit of the CorAl method is that it generalizes well, so that if it has been trained in one environment, the same parameters can be used in other, unseen, environments.
Some examples of methods that have been used in practice to assess the alignment quality include point-to-point or point-to-plane distances [9], [10], point-to-distribution [11], [12] or distribution-to-distribution [13], [14] likelihood estimates, mean map entropy [15] or dense radar-image comparison [16]. However, except for some notable exceptions [12], [17], few studies in the literature have specifically and methodically targeted the measurement of alignment correctness.
Our method, CorAl, is well-grounded in information theory and gives an intuitive alignment correctness measure. Fig. 1 shows the general outline of the method, and Fig. 2 illustrates the output per-point alignment measure for a pair of point clouds. CorAl computes the average differential Registration CorAl Fig. 1: CorAl, depicted in blue, operates on a pair of point clouds Pa, P b and can classify misalignment (y pred ) by comparing the differential entropy in the point clouds separately and jointly. Additionally, CorAl outputs a perpoint quality measure q that highlights misaligned parts. entropy in two point clouds, exploiting the difference when the entropy is computed for each point cloud separately, compared to the union of the point clouds. For well-aligned point clouds, the joint and the separate point clouds will have similar entropy. In contrast, misaligned point clouds tend to "blur" the scene, which can be measured as an increase in joint entropy as depicted in Fig. 3. A key idea is to estimate the entropy inherent in the scene from the entropy in the separate point clouds, which enables CorAl to accurately assess quality in a range of different environments. In short, our previous contribution was an intuitive and simple measure of alignment correctness between point cloud pairs that generalizes across environments and highlights regions of misalignment. This paper extends our previous work on CorAl [17] by showing how it can be applied to FMCW (frequency-modulated continuous wave) radar data, thus pushing further towards truly robust perception, and includes quantitative evaluations on two new largescale benchmarks and four new baselines. In summarize, we present the following new contributions: • The first investigation (to our knowledge) that systematically evaluates alignment correctness classification using radar-based feature extraction and quality metrics without the aid of auxiliary sensors. • We include 4 new baselines based on recent research within spinning radar odometry. • A novel radar filtering strategy that allows CorAl to operate on noisy radar data, enabling alignment classification of small errors with high accuracy compared to previous radar feature extraction methods in urban traffic settings. • An ablation study that investigates parameter importance and how practical factors such as error magnitude and variation in distance between scans influence classification performance. • A cross-environment study that demonstrates that CorAl generalizes to new environments without retraining.

A. Cost functions for scan registration
In practice, it is common to use the cost function of a scan registration method to estimate the alignment quality after registration.
One well-used alignment measure is the root-meansquared (RMS) point-to-point distance, truncated by some outlier rejection threshold. This is also the function that is minimized by iterative closest point registration [10], [18].
(a) Pa colored by entropy.
(b) P b colored by entropy.
(c) Correctly aligned Pa ∪ P b colored by quality measure. In constrast to (a) and (b), red color indicate misalignment.
(d) Misaligned Pa ∪ P b colored by quality measure when a small misalignment is added.  However, this measure has been shown to be highly sensitive to the environment and the choice of the outlier threshold [12], [19] when trying to detect small errors. Consequently, this is a poor measure for alignment correctness classification. Another common alternative is the point-to-line distance [9], [10] or -as a generalization -a point-todistribution [11], [20] or distribution-to-distribution [13] measure; where local surface descriptors are computed using the spatial distribution of points within a neighborhood. In a similar vein, Liao et al. [14] propose distribution-todistribution registration based on fuzzy clusters, and estimate coarse alignment quality via the dispersion and disposition of points around fuzzy cluster centers. These methods have also been shown to generalize poorly for assessing 3D lidar scan alignment in different environments [17].

B. Fault detection and uncertainty estimation
Going beyond merely setting a threshold for the registration cost function, there are also some methods that have been devised specifically for fault detection and uncertainty estimation.
Huan et al. [21] presented a failure detection based on logistic regression and using point cloud overlap, differences between 2d point cloud projections and mean and deviation of nearest point distance and normal directions. They show that various metrics can be combined for increased accuracy, however the work does not explicitly focus on detecting small errors and instead includes errors up to 5 m.
Makadia et al. [22] use consistency between normals ("plane-to-plane") as a post-registration alignment measure, where normals are computed from voxelized versions of two point clouds. In an independent evaluation [12], this method was found to perform poorly in unstructured outdoor environments.
Nobili et al. [7] proposed a method to predict alignment risk prior to registration by combining overlap information and an alignment metric that quantifies the geometric constraints in the registration problem. Their alignment metric is based on point-to-plane residuals and has been evaluated in structured scenes with planar surfaces. In contrast, the method we propose in this paper can operate well even in unstructured environments. Additionally, our method seeks to estimate the alignment after registration has been completed to introspectively measure the registration success, as opposed to predicting the risk prior to registration.
Akai et al. [23] estimate reliability of vehicle localization from grid map and laser scan data, using a convolutional neural network (CNN). In our work, we are interested in alignment classification without prior knowledge for pairs of point clouds. Aldera et al. [24] learn detection of odometry failures in challenging conditions using weak supervision from IMU or GPS. Their method analyzes eigenvectors of a pairwise compatibility matrix which contains scores between point correspondences. Finally, Sundvall and Jensfelt [25] propose a method for fault detection with redundant positioning systems. In contrast, our method operates on lidar or radar without the need for any additional sensors during deployment and uses odometry or optionally ground truth during training.
A family of methods assessing alignment uncertainty based on an estimate of the pose covariance can be found in [11], [26]- [31]. Some of these use a Monte Carlo method and estimate uncertainty by sampling registrations in a region [27]. For mobile robotics, sampling strategies are tedious and unpractical. Other methods compute covariance in closed form based on the Hessian of the quality metric [11], [30] or from pose samples weighted by dense correlation [16], [31]. However, it is not generally possible to define a fixed set of covariance thresholds to distinguish good from bad alignments.
Bogoslavskyi and Stachniss [32] define a quality metric that takes into account free-space information, and use it to measure alignment error between range images of segmented 3D objects in a controlled experiment. Rather than focusing on objects, our method aims to classify the alignment quality of observed scenes in different environments. Additionally, their method operates on range images, which might not be easily available, while our method operates on unorganized point clouds.
Some methods work on the scale of a full map, rather than individual scans. Chandran-Ramesh and Newman [33] convert a point cloud map into plane patches and train a conditional random field to detect plausible and suspicious plane configurations. This method is not directly applicable for assessing pairwise point cloud alignment. Droeschel and Behnke [15] compute mean map entropy to measure the "crispness" of a point cloud map, with the aim to evaluate accuracy of scan registration in lieu of ground truth pose data. Our work is inspired by this measure but, as further detailed and demonstrated below, mean map entropy does not generalize between structured and semi-structured environments.

C. Feature extraction and quality assessment for spinning radar
Lidar is a well-investigated sensor modality in robot perception. Spinning FMCW radar is an alternative modality that has been receiving more attention in recent years, due to it being resilient to low-visibility conditions. However, due to its challenging noise characteristics, how to efficiently interpret the data for robot perception is still considered an open research question. Hence, in our investigation of radar we include both feature extraction and quality assessment. Radar-based methods that use alignment quality measures can be categorized into dense methods [16], [34], [35], which operate on raw radar images and do not explicitly perform data association, and sparse methods [1], [36]- [43], which compute alignment quality using keypoint locations, shape and descriptors over a correspondence set. Previous sparse methods use (weighted) Point-to-Point [39], [41], [42], Pointto-distribution [43] and Point-to-Line [1] metrics. Key points can be extracted via SURF, blob detection [36], gradientbased feature detectors [41], [42], by a set of oriented surface points [1] or distributions [43] using a grid-based approach, or by semi-supervised [39] and unsupervised [39], [40] deep learning methods.
While these methods for feature extraction and alignment quality have been used as objective functions for the purpose of estimating odometry, it is currently not known to which extent these metrics can be used for alignment correctness classification. Previous work shows that the performance of deep learning-based odometry methods decreases in new environment types [16], [38]. However, some methods have been successful in estimating odometry and performing SLAM in widely different environments without deep learn-ing and even without parameter tuning [1], [36]. In this paper, we show that learning-free feature extraction methods can produce stable key points over multiple environment types, and are suitable for alignment correctness classification in diverse environments. Hence, we can forego (deep) learning for feature extraction and only use logistic regression at the classification stage to learn linear decision boundaries, which ultimately only requires two parameters to be trained.

D. Comparative studies of alignment assessment
Most of the methods above have been used in a more or less ad-hoc manner. Few systematic evaluations and comparative studies have been made on investigating their general capability for the task of alignment classification between point clouds pairs, i.e. to detect aligned vs. misaligned ones. Two evaluations in this direction were presented by Almvqvist et al. [12] and in on our previous work CorAl [17] where a range of quality metrics was used to train classifiers using logistic regression. Almqvist et al. explored alignment classifiers based on point-to-point distances as well as a number of other methods [10], [11], [19], [33], [44], [45], and investigated how to combine the measures with AdaBoost into a stronger classifier. The classifiers were evaluated on two outdoor data sets, and although the best ones reached almost 90 % accuracy for the hardest cases on each data set individually, accuracy drops to around 80 % when crossevaluating between the data sets. In their evaluations, the NDT score function [11] proved to be the best individual measure for alignment assessment. The combined AdaBoost classifier did not have significantly higher accuracy, but reduced parameter sensitivity. Our previous work confirms the finding by Almqvist: that detecting small alignment errors in a diverse range of environments without retraining is challenging.

III. CORAL METHOD
Our work takes inspiration from Droeschel and Behnke [15] who used differential entropy measurements to assess map quality. The differential entropy measures the uncertainty or surprisal of a continuous variable, in their case for (3-dimensional) variables X ∈ R 3 with multivariate Gaussian distribution X ∼ N 3 (µ, Σ). Distributions are approximated by the sample mean and covariance of local point distribution within a radius r. In their work, they compute the "Mean-Map-Entropy" (MME), defined as the average per-point differential entropy over a set of point clouds, and use the measure to compare methods for lidar map refinement algorithms without the need for ground truth measurements and without imposing planar assumptions. As the entropies measure surprisal, everything else being equal, a lower MME measure can be interpreted as less surprisal or crisper point clouds and indicate the success of map refinement. While MME is suitable for and has been used to quantify relative alignment improvement [46], [47], the metric additionally depends on sensor noise and sample density and is highly dependent on scene geometry. Consequently, and as confirmed by our evaluation, the metric does not generalize over, e.g. structured and semi-structured environments. For that reason, we overcome sensitivity to variation in scene geometry by making use of dual entropy measurements computed 1) in point clouds separately and 2) in the joint point cloud. Intuitively, when joining a well-aligned point cloud pair, the entropy (or the blur) found in the joint point cloud should not increase compared to the entropy found in the separate point clouds, and instead, remain close to constant. In contrast, when joining point clouds with small misalignments, the joint entropy (blur) tends to increase compared to the separate entropy. By making use of dual differential entropy measurements, our method can account for the scene appearance and detect small alignment errors. Additionally, the measurements enable generalization to substantially different environments without retraining.

A. Assumptions and definitions
CorAl operates on 2D or 3D point cloud pairs P a , P b acquired by a range sensor such as lidar or radar. Points p k ∈ P are given in a common fixed world frame in p k ∈ R 2 or p k ∈ R 3 . CorAl learns a linear decision boundary to separate aligned vs misaligned point-cloud pairs based on features extracted without learning. During the learning phase, point clouds are assumed to be correctly aligned by an accurate ground truth or odometry system and free of distortions from motion. We aim to detect small alignment errors. Detecting large errors e.g. > 1 m in large-scale-urban or larger than > 0.4 m within indoor environments is generally not considered challenging for traditional metrics such as point-to-point or point-to-line metrics, thus, it is not the focus of our work. For notation, we define the joint point cloud P j = P a ∪ P b ; i.e., all points in P a and P b together.

B. Joint and separate entropy measurements
In order to compute the differential entropy around a single point, we first compute the sample covariance Σ(p k ) from all points within a radius r around p k . The differential entropy can then be computed from the determinant of the sample covariance according to: for a multivariate normal distribution with dimension N . The property of this metric can be visually understood from Fig. 2a where each point is colored according to the differential entropy. The metric serves as a geometric descriptor that describes the local geometry with a single value; the differential entropy. The MME can then be computed by averaging Eq. 1 over the joint point cloud P j ; i.e., computing the sum and then dividing by the number of terms as shown in Eq. 2 and Eq. 3, (a) Top view of the first two aligned point clouds Pa(blue) and P b (red) in the ETH "stairs" dataset.
(b) CorAl measure Q(Pa, P b ) eq. (5) visualized by color for various (x,y) displacements. where |P i | is the number of points in the point cloud P i . We extend this formulation by additionally computing the entropy in P a , P b separately: and obtain our quality metric by subtracting the average differential entropy from the joint: The CorAl quality metric is hence the difference in differential entropy. The CorAl quality can also be given on a per-point level by where i is the point cloud (a or b) p k originates from. While the differential entropy in Eq. 1 varies according to the shape of the environment, the per-point entropy difference in Eq. 6 has the property of being close to zero when point clouds are well aligned and increase with misalignment as depicted in Fig. 2c. The corresponding function surface of the quality measure Q(P a , P b ) is depicted in Fig. 4, which demonstrates how the score increases when displacing one of the point clouds in translation and orientation. Well-aligned point clouds P a ∪ P b acquired in structured environments have low differential entropy for most query points p k . This is reflected by low values for the determinant of the sample covariance. As the determinant can be expressed as the product of the eigenvalues of the sample covariance det(Σ(p k )) = λ 1 λ 2 λ 3 , we see that the measure (a) Aligned point clouds: differential entropy distributions are similar.
(b) Misaligned point clouds: joint differential entropy is higher. is sensitive to an increase in the lowest of the eigenvalues when larger eigenvalues are constant. For example, the entropy of points on a planar surface is represented with a flat distribution with two large (λ 1 , λ 2 ) and one small (λ 3 ) eigenvalue. Misalignment changes the point distribution in the joint point cloud from flat to ellipsoidal, which can be observed as an increase of the smallest eigenvalue λ 3 . This makes the measure sensitive to misalignment of planar surfaces while generalizing well to other geometries. As shown in the evaluation, the CorAl measure can capture discrepancies between point clouds regardless of whether these are due to rigid misalignments or distortions that can occur when scanning while moving (e.g., because of vibrations or sensor velocity estimation errors). That means that the method may also be overly sensitive when used together with a registration method or odometry framework that does not compensate movement distortion or has a low accuracy.

C. Dynamic radius selection and outlier rejection
For well-aligned point clouds, the quality measure Q has values close to zero. In this case, the distribution of perpoint differential entropies in the joint and separate point clouds are similar. The per-point entropy distributions are depicted in Fig. 5a for a set of aligned and misaligned point cloud pairs when using fixed radius for computing entropies. Unfortunately, the entropy in Eq. 1 is numerically unstable when the determinant of the covariance det(Σ(p k )) is illconditioned, hence a small increase of the determinant causes a large increase of the entropy. Accordingly, the lowest measured entropies can increase a lot (which mistakenly indicates misalignment) even when joining well-aligned point clouds as depicted in Fig. 5a.
One case where entropy mistakenly increases is when a 3d lidar observes floor regions with sparse ring patterns (due to space between vertical beams of the sensor). In the separate point clouds, the sparse rings give rise to ellipsoidal covariances (low entropy) with high uncertainty along the ring direction and low in the other directions. In the joint point cloud, the sparsity between rings is reduced, the computed covariances are instead planar according to the floor plane, with two high eigenvalues and one low, and therefore have a higher entropy.
Ill-conditioned covariances occur where point density is low, typically for solitary points or far from the sensor where the radius r is not large enough, e.g., to include multiple "lidar rings" within one entropy measurement, as in the example described above.
The ill-conditioned entropies can be mitigated by increasing the radius r or making use of the options described below. Obviously, for a set of point cloud pairs, the parameters are typically well-calibrated (and ill-conditioned entropies have been mitigated) if CorAl can separate between aligned and misaligned point clouds. This occurs when the joint H j and separate H sep entropies are linearly separable. The parameter can be calibrated by maximizing the ratio: A large ratio indicates that the measure is able to discriminate between aligned and misaligned point clouds.
We propose three optional strategies to address the illconditioned covariances due to variations in sampling density originating from the sensor. Option (1) is specifically intended to address the characteristic ring pattern that produces variations in sparsity. Option (2) that uses a dynamic radius can benefit both spinning lidar or radar where sparsity increase with range. Option (3) is intended mainly for 3d point clouds.
(1): Eq. 1 is modified to h i (p k ) = 1 2 ln(2πe det(Σ(p k ))+ ) where limits the lowest possible entropy. This makes sure that entropy is similar for points distributed along a line and a plane. The improvement can be seen by comparing the first and second column in Fig. 6.
(2): A dynamic radius enables the quality measure to include more points far from the sensor and correctly detect alignment and misalignment at large distances as depicted in the third column of Fig. 6 (c & g). Radius r is chosen based on the distance d between the point p k and the sensor location to account for that point density decrease over distance. The radius is hence selected as: r = d sin(α) in the range r min < r < r max where α is the vertical resolution of the sensor provided by the data-sheet. For other sensor types e.g. RGB-D, the resolution could be chosen similarly according to the angular sensing resolution.
(3): Remove E reject percent of points p k with the lowest entropies. The effect is depicted in the rightmost column of Fig. 6.

D. CorAl for spinning radar
Given the presented CorAl approach, this section describes the additional steps used to enable CorAl to operate on spinning radar data. It contains a short introduction to the format of spinning radar data and proposes a novel feature extraction method that produces a high-quality point cloud from radar data.
Spinning FMCW radar produces 360 • sweeps on polar coordinates seen in Fig. 7. The raw data is represented as a matrix (Z Na×Nr ). The radar outputs N r range bins (the number of columns in Fig. 7), given the max range R, the range resolution is γ = R/N r . Likewise, the radar provides N a azimuth bins (rows in Fig. 7). Each pixel (a, r) with a ∈ {1..N a }, r ∈ {1..N r } holds the reflected intensity and can be converted into Cartesian space as where γ is the range resolution of the radar and the azimuth angle θ can be computed via θ = 2πa/N a .

E. Computation of Radar Intensity Peak (RIP)-features
From raw radar data as seen in Fig. 7, the goal is to produce accurate point clouds suitable for alignment classification. We build on top of the radar filter "k-strongest" [1] that efficiently computes a mask that removes noise and keeps significant features that are useful for odometry estimation. Over all the 1..N r range bins, the k highest intensity bins that additionally exceed the expected noise level z min are selected. The noise level z min mitigates speckle noise and uncertain or false detections in absence of real obstacles. The limitation of k returns per azimuth complements z min by mitigating multi-path reflections and receiver saturation under the assumption that true landmarks have higher intensity.
This method efficiently provides a mask in regions around true landmarks and generalizes well across environment types (the same k and z min values work well in our odometry pipeline [1] for the road as well as underground applications). However, it does not accurately reconstruct landmark surface locations. We tested this representation and found that the method is unsuitable for detecting small misalignments. For that reason, we propose an additional step aiming to further analyze masked regions and compute stable features located on Radar Intensity Peaks (RIP)-features. We aim to efficiently and accurately detect surface locations by finding peaks within masked regions where intensity is consistently high over a local neighbourhood. To do so, we combine non-maximum-suppression on the 1D intensity-range signals together with a region strength criterion. Azimuths bins are analyzed independently without considering neighboring azimuth bin. For each range bin within the masked radar image, and neighboring range bins outside the masked region, the region strength is computed as the average intensity within a window (with size w) of neighboring range bins. Second, we select all range bins where region strength exceeds neighboring bins with the additional criteria that region strength must exceed the expected region noise floor. The algorithm is formally described in Alg. 1. Typical behavior is depicted in Fig. 8. Structures such as buildings and vehicles appear with higher intensity and give rise to the most stable features; these are good candidates for detecting misalignment as   for r ∈ K do //only search masked points 8: if S[r] >= max( neighbors(r, w) ) then 9: if S[r] > z min then 10:

F. Self-supervised learning of alignment classification
We learn alignment classification based on our quality measure in a self-supervised fashion from an accurate sensor pose signal. In this paper, we use either an external ground truth system or a lidar/radar odometry estimator. For each pair of scans, we compute the quality measure 1) directly and 2) after inducing a (sensor frame) offset in position and orientation on the later scan location. This allows the system to produce positive (aligned) and negative (misaligned) data points for training and to learn classification boundaries according to the magnitude of induced errors. To avoid overfitting and to produce valuable insight into the alignment class separability based on the evaluated score functions, we use a simple logistic regression as a model for classification: where p is the class probability and x 1 , x 2 are input variables to which we pass quality measures. For the CorAl quality presented here, we refrain from passing the one variable quality measure Q = H joint − H sep and instead pass the joint and separate entropies as: This allows the model to learn the mapping from H sep and H joint to alignment probability p implicitly. The model parameters in Eq. 9 (β 0 , β 1 , β 2 ) are learned during training. The classification probability threshold t h can be adjusted after training has been carried out in order to balance sensitivity to false-positive rate. Increasing the threshold will cause fewer well-aligned pairs to be correctly classified as aligned (decrease recall), but will reduce the rate of misaligned pairs being falsely be classified as aligned (increase precision). In our experiments we used the default threshold t h = 0.5. During training, weights of data points were adjusted inversely proportional to class occurrence to mitigate bias. ' IV. EVALUATION ON LIDAR DATA In this section, we present a quantitative evaluation of alignment quality classification with CorAl for 3D lidar data. An evaluation for 2D radar data will follow in Sec. V.
In order to compare the method presented in this paper to previous work related to 3D point cloud alignment assessment, we follow the procedure of evaluation as carried out in [12]. For the results in this section, we use an equal portion of aligned and misaligned point clouds, where misaligned point clouds are created by adding an offset for each point cloud pair: an angular offset (e θ = 0.57 • = 0.1 rad) around the sensor's vertical axis and a random translation (x, y) offset at a distance (e d = 0.1 m) from the ground truth. These errors are large enough to be meaningful to detect in various environments, yet challenging to classify.
1) Evaluated lidar methods: The evaluated methods; MME, CorAL, CorAl-median, NDT, Rel-NDT and FuzzyQA are briefly summarized here together with their most important parameters. For CorAl, FuzzyQA and Rel-NDT, two values (that represent the quality measure) are passed to x 1 and x 2 , for MME and NDT, a single value is passed to x 1 while x 2 is set to zero. To make a fair comparison, we use a similar radius for NDT, MME and CorAl. a) MME: Mean Map Entropy (MME) as proposed by Droeschel and Behnke [47]. The quality measure corresponds to H joint from Eq. 3. The parameter is the same radius r used for computing the per-point differential entropy. The MME is passed to the classifier as: x 1 = H joint , x 2 = 0. b) CorAl: as described in Secs. III-B and III-C and in [17]. Parameters are r min , r max and α to determine nearby points radius, and E reject to set outlier rejection ratio and for mitigating ill-conditioned entropies. The joint and separate entropy are passed separately to the classifier: x 1 = H joint , x 2 = H sep , an intuition is presented in III-F. c) CorAl-Median: H s , H j are modified to calculate the median entropy rather than the mean entropy, we hypothesize that this modification can be more robust to outliers. Except for modification, we used the same parameter and methodology described in the previous paragraph. d) NDT (point-to-distribution normal-distributions transform): The NDT quality describes the likelihood of finding the points in P b , given the NDT representation of P a . The method uses the 3D-NDT [48] representation, which constructs a voxel grid over one point cloud, and computes a Gaussian based on the points in each voxel. The likelihood of finding the points in P b is computed as where n is the number of overlapping points, defined as those points that fall in an occupied NDT voxel, or in a voxel that is a direct neighbor of an occupied voxel, andp is the probability density function associated with the nearest overlapping NDT-cell. This is similar to the "NDT3" variant in [12]. The most important parameter for NDT is the voxel size v which is set equal to 2 * r in our evaluation, as this makes the volumes used for the sample covariance of NDT cells and the entropy in CorAl comparable. The NDT quality s is passed as: x 1 = s. e) Rel-NDT: as described in [17]. This variant aims to improve the re-utilization of learned NDT classification parameters when applied to new environments. The idea (similar to CorAl) is that environment type is reflected in the average entropy of the scene and can be combined with the NDT score to improve the classification. Similar to the MME, we compute the average differential entropy. However, instead of recomputing the sample covariances around each point p k , we use the covariances directly from the NDT representation of P a . This is only done in overlapping regions and no additional parameter to NDT is required. The NDT score s and the overlapping NDT differential entropy (NDT-entropy) is passed as: x 1 = s, x 2 = N DT − entropy f) FuzzyQA: FuzzyQA [14] measures the alignment quality by a ratio ρ = AFCCD AFPCD , where AFCCD and AF-PCD are two indexes describing the points' disposition and dispersion around fuzzy cluster centers. In [14], two point clouds are considered to be coarsely aligned if ρ < 1. In this paper, we pass AFCCD and AFPCD separately to the classifier input x 1 = AF CCD, x 2 = AF P CD in order to learn a generalizable separator.

A. Qualitative evaluation, live robot data
First, we present qualitative results from real-world data in a warehouse environment. A forklift equipped with a Velodyne HDL-32E 3D laser scanner was manually driven at fast walking speed in the warehouse depicted in Fig. 9. The environment covered in this data set varies from large and open areas with walls in line of sight to small and narrow aisles between shelf racks. To generate ground-truth alignments for the warehouse dataset, we first aligned the point clouds using a scan-to-map approach [8]. We have then Fig. 9: Data acquired by a truck in a warehouse environment. The sensor trajectory is drawn in red. The environment in the figure is 50 m × 50 m and the sequence length is 102 m. In the first segment of the trajectory, starting at the bottom left, the walls are clearly visible. The final segment is located between aisles where walls are typically out of sight and the sensor observes more complex structures such as shelves. The truck has worn out slightly oval wheels which introduces a large degree of vibrations, hence alignment quality is expected to be low.
inspected the alignment between subsequent scans and found that at least 40/484 (8.3%) point clouds were impaired by rigid misalignments or non-rigid distortions from vibrations and motion to the extent that these could be visually located. Alignment classification was then performed on the remaining scans by inducing errors as described in sec. IV. We used the following parameters as they provided a relatively high value of Q s for the first scan pair in the dataset: α = 0.92 • , E reject = 0.2, r min = 0.2 m, r max = 1.0 m and voxel size v = 2r min = 0.4 m. We found that CorAl-mean, MME and NDT reached an accuracy of 96%, 70% and 99% respectively. In this case, NDT performs slightly better than CorAl. We believe that the relatively lower result is due to the experimental setup. CorAl is highly sensitive to low quality in training data. In this case, even after removing the worst data points in the ground truth set, the level of vibrations from the worn-out wheels remain high in a large number of data points used for training and evaluation. This makes it challenging to learn the detection of small errors using the CorAl quality measure. Whether this is high sensitivity is the desired behavior depends on the application.

B. Quantitative evaluation, ETH benchmark data set
Our main quantitative evaluation of CorAl's performance on 3D lidar point clouds is done by using the public ETH registration dataset [49]. It contains several sequences representing a wide variety of environments, which will serve us to evaluate how well CorAl generalizes across different kinds of them. Specifically, this dataset includes 3 sequences in structured environments (Apartments, ETH Hauptgebaude, Stairs), 3 sequences in semi-structured environments (Gazebo in summer, Gazebo in winter, Mountain plain) and 2 challenging sequences in unstructured environments (Wood in summer, Wood in autumn). In Figs. 10 to 12, the training results for structured environments are shown in blue, the ones for semi-structured environments in brown, and the ones for unstructured environments in green. Each sequence contains between 31 and 47 scans acquired from In order to make the evaluation fairer, more realistic, and applicable to real applications, we downsampled the original dense point clouds using a voxel grid of 0.08 m. In all the training experiments, CorAl has been run on an Intel Core i7-7820X desktop CPU, achieving an overall run-time of 0.246 ± 0.095 seconds per point cloud pair. Also, since this dataset has less variation in sampling density compared to the warehouse one employed in Sec. IV-A, we used a fixed radius r = 0.3 m and set E reject = 20% and = 0. Finally, the NDT voxel size was set equal to two times the CorAl radius, i.e, v = 2r = 0.6 m. This way, the diameter of influence for CorAl and the width of NDT cells are similar.
In these conditions, we carried out three different kinds of training, which are aimed at showing how the proposed CorAl method can achieve generalization. Training has been performed with increasing difficulty, as explained below.
1) Separate training: First, we evaluate the capability to learn classification in a specific type of environment. The classifiers were trained and evaluated on each sequence separately, using 5-fold cross-validation. This evaluation serves as a reference for the cross-environment evaluations below. Results are shown in Fig. 10. We found that all methods except FuzzyQA performed well in structured environments. For instance, MME scored around 90-100%, which clearly indicates that even a method that is highly influenced by the environment can successfully assess alignment quality if the environment is structured and not changing substantially. In contrast, we did not expect that FuzzyQA would achieve a good classification performance since it is specifically designed to classify coarse alignment.
In the semi-structured and unstructured sequences, only CorAl and CorAl-median performed well, with consistently >90% accuracy, even in the most challenging sequences. All other methods are only slightly better than random, except for the gazebo sequences. Rel-NDT slightly outperforms NDT, however not consistently. Both NDT methods performed decently (77-90%) in the gazebo sequence, but poorly in the unstructured Wood sequences (60-65%), indicating that NDT requires at least some structure or surfaces free from foliage to be effective as an alignment correctness

measure.
2) Joint training: The second test evaluates how the methods are able to learn alignment classification when trained in a variety of environments. To do that, the methods need to be versatile. The training was performed on all the ETH sequences together, and evaluation was then done on each sequence individually. The results are shown in Fig. 11. As can be expected, the accuracy of all classifiers decreased compared to the previous test. CorAl still performs best, with an accuracy of 85-100% in all cases. CorAl-median reached a slightly lower accuracy compared to CorAl. Rel-NDT performed better than NDT in most cases, however not consistently. The generally high accuracy of CorAl indicates that it is possible to find general parameters that make the method valid in a range of substantially different environments.
3) Generalization to unseen environments: The final test evaluates how classifiers perform in environments with different characteristics than those observed in the training set. We trained and evaluated different sequences and environments. The 3 structured environments were used for training and the remaining 5 (semi-structured and unstructured) were used for evaluation and vice versa. The classification accuracy is depicted in Fig. 12. When trained on structured and evaluated on semi-structured environments, CorAl performed accurately (85-98%) and other methods performed close to random except NDT for Gazebo summer (74%).
No method generalized well when training on structured data and evaluated on unstructured environments. On the other hand, learning from semi-structured and unstructured environments was enough to afford very high accuracy in structured environments with CorAl -very close to what was attained with joint training on all sequences. The previous joint evaluation shows that it is possible to train a model that is simultaneously accurate in all environment types. Hence, we believe that the reason the classifier trained in a structured environment does not generalize to an unstructured environment is that the model overfits when not trained with sufficiently diverse and challenging data.

V. EVALUATION OF LARGE-SCALE RADAR DATA
In this section, we present a thorough evaluation of the problem of alignment correctness classification using data acquired by different spinning radars. In these experiments, we have employed both CTS350-X and CIR2014-H models by Navtech. We highlight multiple challenges in alignment correctness classification of radar data and consider the impact of practical challenges such as variation of parameters, distance between scans, and error magnitudes. Similar to the generalization training carried out in Sec. IV-B, we again use datasets with different characteristics for training and testing in order to understand how CorAl generalizes across different environment types. We compare our method to recently published radar-specific baselines (Sec. V-A).
Currently, there exist four public datasets for spinning radar localization research: Boreas, 1 Radiate [50], Mulran [51] and the most established Oxford Radar RobotCar dataset [52]. We selected Oxford (Fig. 14) and Mulran (Fig. 15) which both have accurate ground truth positioning and similar sensor range resolution (0.0432 m and 0.0595 m respectively). Additionally, MulRan contains a diverse variety of surrounding environment types including urban, mountain, and fields which allows us to test how the methods generalize outside of urban environments. In Sec. V-E we used the sequence "10-12-32" from Oxford and "KAIST02" from Mulran. In all other experiments, each data point and deviation is computed over the sequences 10-12-32, 18-14-14, 18-14-46 and 18-15-20 from the Oxford dataset. Both datasets contain a variety of weather conditions and traffic from vehicles, pedestrians and bikers. Hence these sequences constitute realistic urban scenarios.
To make the method evaluation more realistic for urban applications, we generate 4 misalignments symmetrically around each aligned data-point: two in the longitudinal directions of the driving direction (forward and backward) where localization uncertainty due to motion and landmarks is generally higher, and two in the lateral direction (left and right) with lower uncertainty. Training weights and evaluation metrics are balanced accordingly.
Depending on the application and sensor, various error levels can be of interest. E.g. odometry is expected to be  more accurate compared to loop closure and relocalization and requires detection of smaller errors. We produced the position error distribution of the currently most accurate method for radar odometry estimation CFEAR, the distribution is depicted in Fig. 13. The 0.995 quantile corresponds to a longitudinal and lateral error of 0.29 m and 0.12 m, respectively. Detecting errors above this point (> 0.29) m would be meaningful for the task of odometry. These are larger error levels compared to what we considered in our previous evaluation on lidar (0.1 m) in Sec. IV. Using the same step size between error levels as in [12], we define small, medium and large errors as 0.3 m, 0.5 m and 0.7 m respectively. These errors are higher than the ones defined for lidar sensors, which is necessary as localization uncertainty is expected to be higher, e.g. due to the larger scale, higher motion and low spinning rate of radar used in our experiments. For a vehicle moving at 50 km/h with a radar spinning at 4 Hz, the motion distortion from translation only is 3.5 m between the first and last segment of a scan unless carefully compensated for.

A. Evaluated radar methods
Here we give a brief introduction of all compared methods for feature extraction, quality measure and parameters. In all cases, we have compensated for motion distortion effects as described in [1], and features within a minimum range of 2.5 m are removed in order to discard false detections located on the experimental setup itself. a) Cen2018: A method for extracting radar features and estimating ego-motion proposed by Cen and Newman [41]. We use their method for extracting features from intensity gradients and peaks with the improved config- uration described by Burnett [38], where the probability threshold is increased to z q = 3.0 and a Gaussian filter is added with σ = 17 as described in [38]. We do not carry out data association as described in the original publication but instead perform a radius search and compute the point-topoint quality measure with association radius radius = 3 m. The sum of point-to-point distances is normalized by the number of points and passed to x1.
b) CFEAR: current state-of-the-art in Radar odometry estimation [1]. We use the CFEAR-feature extraction method, with the quality measures (P2P, P2L and P2D). Radar data is first filtered using k-strongest as described in Sec. III-E by first applying the k-strongest filter and then using a grid-based approach that estimates a set of oriented surface points for each grid cell that contains points. In our experiments, we used the same parameters as described in [1] except with minor changes to z min = 60 and radius = 3 m for consistency with Cen2018. For this method, we extend the logistic regression model with a third input dimension x 3 to incorporate more available information. Specifically, the absolute score, the number of correspondences and the normalized score are passed to x 1 , x 2 and x 3 respectively. c) CorAl-Radar (ours): We set r min = r max = 1, E reject = 0 and = 0. CorAl-Radar extends the previous parameter CorAl set with window size w = 2, k = 12 and z min = 70 for computing RIP-features (Sec. III-E). The latter parameters (k and z min have the same meaning as in the method CFEAR and are fixed in our experiments and equal for CFEAR and CorAl. The joint and separate entropy are passed separately to the classifier: x 1 = H joint , x 2 = H sep as described in Sec. III-F.

B. Method and performance analysis
We start our evaluation by studying the radius parameter. While a low radius is required to detect the smallest alignment errors, a too conservatively chosen radius will make larger errors more challenging to detect. This is because the computation of entropy is only sensitive to the displacement   of points within the radius. Hence is the capability to detect errors with different magnitudes is related to the selection of radius as seen in Fig. 16. Generally, the radius should be chosen larger than the maximum error to detect. however as confirmed by our evaluation, detecting very large alignment errors > 1 m is not a challenging task and CorAl can therefore be complemented with a quality metric such as point-to-point or point-to-line. We found that a radius set to r = 1 m provides a good trade-off and allows the detection of fairly large errors. The runtime performance of CorAl is shown in Tab. I. All timing statistics have been computed out on an Intel i7-11850H laptop CPU. All representations and quality metrics are efficient to compute (<5 ms single-threaded) except Cen2018 which runs at 14 ms with a multi-thread implementation.

C. Detecting errors with various magnitudes
We aim to investigate the extent to which errors of different magnitudes ranging from 0.05 up to 0.9 meters can be detected. The performance is quantified using accuracy and the area under the ROC curve (AUC). The obtained results are depicted in Fig. 17. In these plots, we report results for two classes of scans, with varying distances between scans: scans taken at least 0 m apart (consecutive scans) and scans taken at least 10 m apart. In general, these results  show that the proposed CorAl method achieves the best classification accuracy and AUC in the evaluated range when the spacing between scans is low. Under these conditions, none of the other methods reach similar performance. The next-best method is CFEAR-P2L, which robustly detects large translation errors but only when they are greater than 0.7 meters. When scan spacing is large (10 m), CorAl is still the most accurate for small errors (less than 0.4 meters) although the accuracy achieved is not very high. A summary of these results can be found in Tab. II. Orientation errors are evaluated separately as depicted in Fig. 18. Such errors are added here in the same way as in the case of evaluation on lidar data, i.e., by adding an angular offset e θ around the sensor's vertical axis. In general, the CFEAR and Cen2018 quality metrics demonstrate similar or improved performance compared to CorAl for orientation errors. We believe this is because orientation errors displace observations proportionally to the distance of observation. At large distances, the conservative data association of CorAl makes the metric less sensitive, unless combined with the optional parameter that dynamically increases radius accordingly.

D. Variation in distance between scans
We have also carried out another set of experiments aimed at analyzing the impact of scan spacing distance. In this case, we consider two different levels of translation errors (0.3 and 0.6 meters). Detecting small errors from scans separated by large distances is challenging due to dynamics changes, lower overlap, and because sensor characteristics make observed landmarks appear differently from different perspectives, a phenomenon previously discussed in the literature [2]. As expected, the results in Fig. 20 show that classification accuracy is reduced for all methods when scan spacing is increased. Despite this, CorAl was able to achieve 87% accuracy for 0.6 m errors at 5 m spacing. After 7 m, CFEAR-P2L accuracy surpasses CorAl, which accuracy reduces more quickly We believe the worse performance for larger scan spacing depends on radar-specific challenges as depicted in Fig. 19. When spacing is large, walls appear different due to beam divergence. CorAl is more sensitive to small errors and hence not expected to perform well in this scenario.

E. Generalization across environments
Similarly to the evaluation of generalization capabilities for lidar data (Sec. IV-B.3), we are interested in comparing  CorAl generally achieves the best performance (well above 90% even for small translation errors) when distance between scans is low. For large scan distances, CFEAR-P2L is the most robust method for any translation error.
how the proposed approach can classify the scans when the training and test data sequences are not from the same type of environment. We used the urban Oxford dataset (Fig. 14) and the partly semi-structured Mulran dataset (Fig. 15). The accuracy for Oxford and Mulran is depicted in Fig. 22. We make the same observation as in the lidar generalization experiments in Sec. IV-B; classification of a method trained from a semi-structured (more diverse) data set will generalize better.
To train with the structured data (Oxford) and testing on the semi-structured data (Mulran) the accuracy obtained is 91% compared with 96% if the testing and training data is switched. If the datasets are considered fully separately the accuracy for Oxford and Mulran is 97.9% and 95.7% respectively. ROC curves are provided in Fig. 21, which also  illustrates the generalization capabilities.

VI. CONCLUSIONS
In this paper, we presented CorAl, a principled and intuitive quality measure and self-supervised system that learns to detect small alignment errors between pairs of previously aligned point clouds. CorAl uses dual entropy measurements found in the separate point and in the joint point cloud to obtain a quality measure that substantially outperforms previous methods on the task of detecting small alignment errors within a benchmarking lidar dataset, and within a large-scale urban dataset for spinning radar.
In this work, we proposed a two-step filtering strategy that operates on challenging radar data and produces a highquality point cloud. By combining our filtering method with CorAl we were able to detect small alignment errors in urban settings using only a spinning radar. We found the method to be accurate in a wide range of environments and can generalize to new unseen environments without retraining. Using a roof-mounted radar within realistic trafficked urban scenarios, we achieve up to 98% accuracy in detection of (a) Closely located radar scans.
(b) Distant located radar scans. Fig. 19: Bottom (red) and top (blue) scans acquired with small (a) and large (b) separation in distance. When distance is high, scans incorrectly appear to be misaligned due to a high level of beam divergence in current radar sensors. The impact is higher when landmarks are observed from different angles, which occur more often when the distance between scans is large.   Comparison where methods are trained and tested on the same (Intra dataset) and separate (Generalization) datasets with varied discrimination treshold. We found that CorAl is the most accurate in both environments regardless of where the method is trained. For our proposed method, the best level of generalization is achieved when training on the semi-structured dataset MulRan and testing on the structured Urban Oxford dataset. 0.5 m errors when trained in the same environment, and up to 96% accuracy when trained on another environment type. Our experiments on both lidar and radar data demonstrate that CorAl achieves a high level of generalization between structured and semi-structured environments. We also found that learning from more challenging less structured environments results is advantageous for generalization. In our lidar experiments, we even found that CorAl was able to generalize from unstructured (woods) to structured indoor environments. However, none of the evaluated methods generalized well when trained in structured environments only and evaluated in an unstructured environment, and this  remains a challenging problem. We believe that the presented system has great potential to serve as an alignment quality tool for point clouds and can improve localization robustness by equipping odometry, relocalization, and loop closure systems with the capability of introspectively detecting small errors in diverse environments.