1 Introduction

Localisation is a central task for autonomous navigation in large-scale, outdoor environments, and has been a core problem for the mobile robotics community for decades. We distinguish two types of localisation: topological and metric. Topological localisation, also known as place recognition, seeks to induce a rough estimate of the robot’s pose, often apropos nothing. Given a coarse initial pose estimate from place recognition, metric localisation aims to compute a refined metric pose by registering live sensor data against some sort of prior map.

Fig. 1
figure 1

Left (a to c): given an overhead image (a), we learn an occupancy image (b) that indicates how likely each pixel will produce a point return if scanned by a lidar. This results empirically in a high probability for buildings and low probabilities for streets and roads. Ray-tracing the occupancy image from near the centroid results in a pseudo point-cloud (c). Right (d) to (f): a lidar scan taken nearby, rotation-aligned with the overhead image for better visualisation (d). Ray-tracing the lidar image and keeping only the first return for each azimuth (e), we obtain a point-cloud (f) that can be directly compared against the pseudo point-cloud (c)

Lidar has always been a popular sensor for outdoor vehicle localisation due to its sensing range and invariance to lighting conditions. Both place recognition (Bosse & Zlot, 2013; Dubé et al., 2017; Kim & Kim, 2018; Guo et al., 2019) and metric localisation (Barsan et al., 2018; Baldwin & Newman, 2012; Wolcott & Eustice, 2015; Yew & Lee, 2018) using lidar are well-studied problems, as well as methods that target global localisation (Cop et al., 2018; Kim et al., 2019; Sun et al., 2020; Yin et al., 2019) by combining place recognition and metric localisation and outputting a metric pose, solving the “kidnapped robot” problem end-to-end. Existing methods rely on the proposition that a reliable, up-to-date lidar map is available, which may not be the case. This paper presents an alternative approach to localise a lidar using only off-the-shelf, easily accessible overhead imagery such as Google satellite imagery, which often captures geometric entities also observable by ground lidar scans, providing cues for localisation.

The significant modality difference between overhead imagery and ground range sensors is an immediate challenge. The work in Tang et al. (2020a) tackles the modality difference between satellite imagery and radar by synthetising a radar image from a pair of proximal satellite and radar images. The synthetic image is then in the same domain as the live sensor and can be used for registration. The experimental results using the method in Tang et al. (2020a) are demonstrated for lidar metric localisation against overhead imagery in the authors’ follow-up work (Tang et al., 2020b, 2021a).

The method in Tang et al. (2020a) has two significant limitations. Firstly, it requires geometrically proximal (but not necessarily pose-aligned) pairs of satellite and radar images to address the modality difference for metric localisation. Therefore, it requires place recognition to have been previously solved by an external system. Secondly, the quality of the image generation in Tang et al. (2020a) is extremely dependent on the initial offset, resulting in pose errors increasing disproportionally with larger initial heading offsets. While the methods in Tang et al. (2020a, 2020b, 2021a) address the modality difference by generating synthetic range sensor images from overhead imagery, we take a different approach where both overhead images and lidar scans are converted into a common representation, a collection of 2D points, allowing for the use of standard point-based place recognition and pose estimation networks.

Specifically, given an overhead image, we first learn an occupancy image that indicates how likely each pixel is to induce a range return if scanned by a ground lidar. This results in high probabilities for pixels on buildings and structures and low probabilities for pixels on “free-space” such as roads and sidewalks. We then ray-trace from near the centroid of the occupancy image along each azimuth until the first “occupied” pixel, which emulates the resulting point-cloud as if a lidar near the centroid takes a scan of its surrounding environment. This process is depicted in Fig. 1 together with an actual lidar scan taken at roughly the same position for visual reference.

To the best of the authors’ knowledge, our proposed method is the first to learn the place recognition of a ground range sensor using only publicly available overhead imagery, with no GPS or prior sensor maps at inference time. We evaluate our method for both place recognition and metric localisation on publicly available datasets, showing that we outperform the prior work in Tang et al. (2020a, 2020b) in metric localisation when the initial heading offset is large. Finally, we demonstrate that, although designed for lidar, our method can also be used for radar after applying appropriate domain transfer on the input sensor data.

This article is an extended version of our prior work (Tang et al., 2021b). The improvements include a more thorough explanation of our motivation for using a point-based approach in contrast with prior works that are image-based (Tang et al., 2020a, b, 2021a) in Sect. 3.1, and a more in-depth description of our method in Sect. 4. In terms of experimental validation (Sect. 5), we include additional qualitative results and an ablation study on using reduced training data. We also show how registration errors can be consistently reduced using an improved network architecture, and present a new loss function for learning place recognition that leads to higher retrieval accuracy. We further demonstrate that simply utilising a sequential data stream rather than a single frame of data for place recognition at test time significantly improves the retrieval accuracy without needing to retrain the network. Finally, we include additional results on test data across different sequences than in the training data to show the capability of our method to not only generalise to unseen places, but also to sensor data recorded over different days.

2 Related work

2.1 Range sensor localisation against overhead imagery

While many works were proposed for visual localisation using overhead imagery (Chu et al., 2015; Kim & Walter, 2017; Leung et al., 2008; Noda et al., 2010; Pink, 2008; Senlet & Elgammal, 2011), localising a ground range sensor remains a less-targeted problem due to the challenging modality difference.

Similar to dense image registration, de Paula Veronese et al. (2015) directly aligned lidar intensity maps onto satellite imagery using Normalised Mutual Information, and relies on the accumulation of several scans from accurate odometry. The methods in Kümmerle et al. (2011); Dogruer et al. (2010) pre-process the overhead images by performing edge detection and semantic segmentation respectively, before comparing against range sensor data. Hussein et al. (2013) localised a vehicle in a forest by matching tree stems detected by an on-board lidar against tree crowns observed in overhead imagery. Carle and Barfoot (2010) localised a robot equipped with a lidar within orbital elevation maps by detecting features from topographic peaks. These methods are examples of hand-crafted methodologies designed for specific setups and may not necessarily generalise to more complex scenarios.

In light of this, RSL-Net (Tang et al., 2020a) proposed to use an end-to-end learning-based strategy for solving the metric localisation between satellite imagery and a ground radar or lidar, remaining free of hand-crafted features or methods. The approach in Tang et al. (2020a) was later extended for self-supervised learning in the authors’ follow-up work (Tang et al., 2020b, 2021a). However, the methods in Tang et al. (2020a, 2020b, 2021a) remain strictly limited to metric localisation, and cannot solve place recognition. Concurrent work by Zhu et al. (2020) propose to learn a correction and matching probability between lidar grid-map and satellite imagery, which are used as factors in a pose graph optimisation to correct for drift in lidar odometry, but do not directly solve the metric localisation problem alone. Recent work in Fervers et al. (2022) learns to solve the registration between ground camera and lidar data and overhead imagery by maximising the correlation in feature space over discrete poses, similar as in Tang et al. (2020a), but require ground camera data in addition to range sensor scans.

Different than prior work in Tang et al. (2020a, 2020b, 2021a); Zhu et al. (2020); Fervers et al. (2022), our proposed method does not require ground camera data, and can handle both the metric localisation and place recognition between a ground range sensor and overhead imagery.

2.2 Localisation using other publicly available data

Other publicly available resources have also been shown to be useful for vehicle localisation, in particular road and building information from OpenStreetMap (OSM). The methods in Brubaker et al. (2013); Floros et al. (2013) localise by matching odometry trajectories to road paths from OSM. Panphattarasap and Calway (2018) learned a light-weight descriptor to recognise intersections and gaps, and localised by comparing against descriptors of the operating environment built from building information in OSM. Yan et al. (2019) utilised a similar strategy for localisation using OSM but relied only on existing networks trained for point-cloud semantic segmentation, eliminating the need to train new networks for recognising intersections and gaps.

Our method differs from Brubaker et al. (2013); Floros et al. (2013) in that we do not need a sequence of on-board measurements to compute odometry before a solution can be found. Compared to Panphattarasap and Calway (2018); Yan et al. (2019), our method seeks to infer the geometric rather than semantic relationship between on-board sensor measurements and publicly available data, and can therefore perform geometric registration.

2.3 Learning-based point-cloud registration

After the overhead image is expressed as 2D points, we seek to solve for its registration against points from a lidar scan. Learning-based methods for point-cloud registration have shown to be less prone to converging to local minima than classical approaches such as ICP (Besl & McKay, 1992). The method in Zeng et al. (2017) learns local RGB-D descriptors with 3D convolutions for matching. 3DFeatNet (Yew & Lee, 2018) samples point-clusters using PointNet++ (Qi et al., 2017) and learns local descriptors using a triplet loss on point-clusters for feature alignment. Given a set of feature descriptors, the method in Choy et al. (2020) learns weights for point correspondences for downstream pose estimation, acting similarly in spirit as an outlier-rejection step. DCP (Wang & Solomon, 2019) learns a per-point descriptor using DGCNN (Wang et al., 2019), finds point correspondences by matching learned descriptors, and computes the pose offset using singular value decomposition (SVD). In particular, the descriptors in Wang and Solomon (2019) are optimised using only a loss on pose; the end-to-end differentiability is made possible by taking softmax rather than hard max when finding point correspondences. PRNet (Wang & Solomon, 2019) builds upon DCP, allowing for partial-to-partial registration, by using Gumbel-Softmax for matching and iterative alignment for registration. DeepGMR (Yuan et al., 2020) learns point-to-distribution rather than point-to-point correspondences by representing a point-cloud as a Gaussian Mixture during descriptor learning, greatly reducing complexity for large point numbers.

We utilise DCP (Wang & Solomon, 2019) for registration and learning descriptors, as it outperforms other methods for our problem in our experiments.

2.4 Deep learning for point-cloud place recognition

Several learning-based methods were proposed for large-scale point-cloud place recognition. PointNetVLAD (Uy & Lee, 2018) combines PointNet (Qi et al., 2017) and NetVLAD (Arandjelovic et al., 2016) to learn a per-point-cloud global descriptor for retrieval. LPD-Net (Liu et al., 2019) aggregates learned local features to produce a global descriptor for place recognition, outperforming (Uy & Lee, 2018). PCAN (Zhang & Xiao, 2019) utilises attention to predict each local point feature’s significance and favours task-relevant features when aggregating local features into a global one for retrieval. DH3D (Du et al., 2020) learns local descriptors with a description loss using ground truth point correspondences, which can be used for registration. The local descriptors are then fed to a PointNetVLAD layer with attention to learning a global descriptor for place recognition. Barnes and Posner (2020) learned key-points from radar images for odometry, and pooled local descriptors across spatial dimensions into a global one per image for place recognition. Recently, OverlapNet (Chen et al., 2020) was proposed to detect loop closures based on the overlap between lidar images.

Our approach to learning place recognition is similar to (Du et al., 2020) in that we also re-use local descriptors learned for registration when learning the global descriptor. However, rather than applying a description loss, the local descriptors in our case are learned using DCP (Wang & Solomon, 2019) with a loss on the pose. Re-using local descriptors learned for registration ensures the global descriptor is based on geometrically distinctive features, which we know provide valuable clues for place recognition.

3 Overview

We consider the problem of localising a ground lidar using overhead imagery. Specifically, we utilise satellite images from Google MapsFootnote 1, and project 3D lidar point-clouds to the \(x-y\) plane from a top-down perspective, producing 2D, “birds-eye” view lidar images where pixels with a non-zero intensity indicate range returns. We discard points with z values less than 0 when creating the lidar images to remove returns on the ground. We denote satellite images and lidar images as \(\textrm{I}^S\) and \(\textrm{I}^L,\) respectively. Furthermore, we pre-process all overhead and lidar images to have the same resolution (or scale factor), so any registration between lidar data and overhead imagery is in SE(2).

For place recognition, we consider a set of satellite images queried along known routes within an operating environment, \(\{ \textrm{I}^S_i \}, i=1, 2, \ldots , N_s,\) and lidar images taken across time-stamps during live traversal \(\{ \textrm{I}^L_j \}, j=1, 2, \ldots , N_L.\) For each \(\textrm{I}^L_j,\) we seek to find its nearest spatial neighbour from \(\{ \textrm{I}^S_i \};\) this provides a coarse estimate of the vehicle’s global position as the latitude and longitude coordinates of each \(\textrm{I}^S_i\) are known. For simplicity, we assume the satellite images are queried along the same routes as the vehicle’s live traversal, and that \(N_S=N_L,\) such that for every \(\textrm{I}^L_j,\) there is at least one correct spatial match from the set \(\{ \textrm{I}^S_i \}, \) and vice versa. Though this is not a hard requirement for our algorithm at inference time (as described in detail in Sect. 4.5), since each \(\textrm{I}^L_j\) is compared against all satellite images \(\{ \textrm{I}^S_i \}\) to find the best match, regardless of the size \(N_S.\) During training, however, we required pose-aligned pairs of lidar and satellite images to form supervision, as detailed in Sect. 4.1, and therefore \(N_L\) and \(N_S\) must be the same.

For metric localisation, instead, given the spatially proximal images \(\textrm{I}^S_i \in \{\textrm{I}^S_i \}\) and \(\textrm{I}^L_j \in \{ \textrm{I}^L_j \}\) we seek to solve their unknown SE(2) initial pose offset expressed as \(\begin{bmatrix} x&y&\theta \end{bmatrix}.\) For simplicity, the indices i and j will be dropped in the context of metric localisation.

Though in practise, typically a coarse pose estimate is found by place recognition first and then refined by metric localisation, our approach requires learning the metric localisation first prior to learning place recognition.

3.1 Image generation versus point learning

A widely used strategy for dealing with modality differences between two types of images is to apply image-to-image transfer methods such as Pix2Pix (Isola et al., 2017) or CycleGAN (Zhu et al., 2017). In a naive image-to-image transfer approach, a neural network approximates the following mapping:

$$\begin{aligned} g(\textrm{I}^S) \rightarrow \hat{\textrm{I}}^L, \end{aligned}$$
(1)

where \(\hat{\textrm{I}}^L\) is a synthetic lidar image pose-aligned with \(\textrm{I}^S\). However, standard image-to-image transfer methods are not appropriate for transferring between satellite and lidar images. Due to occlusion and a range sensor’s limited field-of-view (FOV) compared to overhead images, the resulting lidar scan depends on not just the surrounding environment, but also the centroid at which the scan is taken. Because of this, in some cases, the same patch in satellite imagery may contain lidar returns in one satellite-lidar pair, but none in another, as shown in Fig. 2. Therefore, within the training data, the mapping g as in Eq. (1) is one-to-many on the level of image patches. Since a one-to-many mapping is not a function, the mapping g in (1) is not suitable for learning with neural networks, which are function approximators.

Fig. 2
figure 2

Two satellite-lidar image pairs taken 10 s apart, where lidar measurements (white points) are overlaid on top of satellite images with ground truth pose. The resulting lidar scan depends not just on the surrounding scene, but also on the centroid where it is taken. Shown here, the areas enclosed by the white and orange rectangles are identical patches in the satellite images. Image patches enclosed by the orange rectangles have lidar returns in the first satellite image but none in the second; the reverse is true for image patches enclosed by the white rectangles

To handle such ambiguity, the image generation in RSL-Net (Tang et al., 2020a) is conditioned on both a satellite image and a live radar image, where the latter dictates the exact appearance of the synthetic image, which in turn is pixel-wise aligned to the satellite image. As CNNs are non-equivariantFootnote 2 to rotation (Lenc & Vedaldi, 2015), RSL-Net seeks to infer the rotation offset prior to image generation:

$$\begin{aligned} h(\textrm{I}^L, \textrm{I}^S) \rightarrow \textrm{I}^L_* , \quad g(\textrm{I}^L_*, \textrm{I}^S) \rightarrow \hat{\textrm{I}}^L , \end{aligned}$$
(2)

where \(\textrm{I}^L\) and \(\textrm{I}^S\) are a pair of lidar and satellite images that are spatially proximal but have an unknown SE(2) offset, \(\textrm{I}^L_*\) is \(\textrm{I}^L\) rotated to be rotation-aligned with \(\textrm{I}^S,\) and \( \hat{\textrm{I}}^L\) is a synthetic lidar image pixel-wise aligned with \(\textrm{I}^S.\) Here h and g are functions for inferring the rotation offset and generating synthetic images, respectively. In particular, the mapping g in (2) is now a one-to-one mapping (Tang et al., 2021a) and therefore can be properly approximated by a neural network. However, since the input to h and g in Eq. (2) are paired satellite and lidar data that are spatially proximal, RSL-Net requires an existing place recognition solution and cannot be used to solve place recognition on its own. This is also true in the self-supervised image-based approaches in Tang et al. (2020b, 2021a).

Furthermore, it can be immediately seen that because CNNs are not equivariant to rotation, the performance of image generation in network g of Eq. (2) is greatly dependent on the capability of network h,  as any residual rotation offset between \(\textrm{I}^L_*\) and \(\textrm{I}^S\) will adversely affect the quality of the generated images.

We show in Sect. 4 that unlike the mapping g in Eq. (1) which attempts to transform a satellite image directly into a lidar image, our method learns an occupancy image first before transforming to a domain that is comparable to lidar data. This two-step approach is inherently occlusion aware, does not introduce ambiguity as in Eq. (1), and requires no paired input data at inference time.

4 Methodology

Our proposed pipeline follows a number of steps. First, given an overhead image, we propose a novel method to represent it as a collection of 2D points. We start by learning an occupancy image using lidar data as the supervision signal, as detailed in Sect. 4.1. Points are then extracted from the occupancy image with ray-tracing, forming a pseudo point-cloud, as described in Sects. 4.2 and 4.3. Next, we train a DCP (Wang & Solomon, 2019) network to solve the registration between the pseudo point-cloud and points extracted from an actual lidar image, as in Sect. 4.4. Finally, we take the local per-point descriptors from the DCP network optimised for registration, and re-use them to learn a per-point-cloud global descriptor for place recognition with PointNetVLAD (Uy & Lee, 2018), as detailed in Sect. 4.5.

4.1 Learning lidar occupancy from overhead imagery

Fig. 3
figure 3

To generate the occupancy image \(\hat{\textrm{O}},\) we use a U-Net architecture with 8 down-sample convolution blocks (each followed by Leaky ReLU and BatchNorm) and 8 up-sample convolution blocks (each followed by ReLU, BatchNorm, and dropout, except the last block), with skip connections between layers. We use \(4 \times 4\) kernels with a stride of 2. Each block indicates the number of channels after the corresponding layer. The network takes in a 6-channel input (3 from \(\textrm{I}^S\) + 3 from \(\textrm{I}^R\)), and produces a 1-channel output, followed by a sigmoid

Given a satellite image \(\textrm{I}^S,\) we learn an occupancy image \(\hat{\textrm{O}},\) which provides information on the likelihood of each pixel to cause a range return if a lidar situated near the image centroid takes a scan. Suppose ground truth occupancy label \(\textrm{O}\) exists, and \(\textrm{O}(i,j) \in \{0, 1\},\) where \(\textrm{O}(i,j) = 1\) indicates pixel [ij] belongs to “occupied” space, such as on buildings, and \(\textrm{O}(i,j) = 0\) indicates “free-space”, such as on streets and side-walks. We wish to compute the probability map \(P(\textrm{O} \vert \textrm{I}^s, \Theta )\) modelled by Bernoulli distributions over all pixels and approximated by a neural network \(f_o\) with parameters \(\Theta :\)

$$\begin{aligned} f_o(\textrm{I}^S) \rightarrow \hat{\textrm{O}}. \end{aligned}$$
(3)

To optimise the network \(f_o,\) we minimise the weighted binary cross entropy (BCE) loss:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{\textrm{BCE}}&= - \log P\left( \textrm{O} \vert \textrm{I}^s; \Theta \right) \\&= -\sum _{i,j} \log P\left( \textrm{O}(i,j) \vert \textrm{I}^s;\Theta \right) \\&= -\sum _{i,j} \omega _{ij} \left[ \textrm{O}(i,j) \cdot \log \hat{\textrm{O}}(i,j) \right. \\&\quad + \left. \left( 1-\textrm{O}(i,j)\right) \cdot \log \left( 1-\hat{\textrm{O}}(i,j) \right) \right] , \end{aligned} \end{aligned}$$
(4)

where \(\omega _{ij}\) is the scalar weight for pixel [ij].

In practice, the pixel values in \(\hat{\textrm{O}}\) are in the range \(\begin{bmatrix} 0, 1 \end{bmatrix},\) where larger values indicate a higher probability that a pixel is on occupied space, and smaller values indicate free-space.

The training of \(\hat{\textrm{O}}\) as in Eq. (4) is realised by finding the appropriate ground truth occupancy label \(\textrm{O}\) and weights \(\omega _{ij},\) which we can obtain by taking into account the sensing nature of lidars. As also noted in Weston et al. (2019), in a lidar image with no ground points, by ray-tracing from the centroid along each azimuth, pixels from the centroid to immediately in front of the first range return are likely to be on free-space. Pixels with range returns are likely to be on occupied space, for example on building façades. Moreover, for pixels with no return and situated behind the first return along their azimuth, we are uncertain if they are on occupied or free space.

Fig. 4
figure 4

Given a lidar image \(\textrm{I}^L\) we first threshold \(\textrm{I}^L\) to produce a binary lidar image \(\textrm{I}^{L \dagger }\) (left) where \(\textrm{I}^{L \dagger }(i,j)\) is 0 or 1 for any pixel [ij],  with 1 indicating return and 0 indicating no return. Ray-tracing \(\textrm{I}^{L \dagger }\), pixels from the centroid until the first return in \(\textrm{I}^{L \dagger }\) along each azimuth are labelled as 1 in the corresponding certainty mask \(\textrm{M}\) (right). All pixels that are 1 in \(\textrm{I}^{L \dagger }\) are also labelled as 1 in \(\textrm{M}\). All other pixels in \(\textrm{M}\) are labelled as 0.

Using this knowledge, for each \(\textrm{I}^L\) we first create a binary lidar image \(\textrm{I}^{L \dagger },\) where \(\textrm{I}^{L \dagger } (i,j) = 1\) if \(\textrm{I}^L (i,j) > \gamma \) and \(\textrm{I}^{L \dagger } (i,j) = 0\) otherwise, and \(\gamma \) is a threshold we choose. We can then construct a binary certainty mask \(\textrm{M}\) from \(\textrm{I}^{L \dagger }\) by ray-tracing again along each azimuth, as shown in Fig. 4, where \(\textrm{M}(i,j) = 1\) indicates we are certain about whether pixel [ij] is occupied or free, and \(\textrm{M}(i,j) = 0\) indicates we are uncertain. This process can be performed efficiently in parallel by transforming the lidar image to polar (range-azimuth) coordinate representation with a finite number of discretised azimuths. In our experiments, we set the threshold \(\gamma \) to 0.2.

We can then use \(\textrm{I}^{L \dagger }\) as the ground truth label and \(\textrm{M}\) as the weights for training \(f_o.\) Substituting into Eq. (4), the loss for learning the occupancy image becomes

$$\begin{aligned} \begin{aligned} \mathcal {L}_{\textrm{occ}} =&- \sum _{i,j} \textrm{M}(i,j) \left[ \textrm{I}^{L\dagger }(i,j) \cdot \log \hat{\textrm{O}}(i,j) \right. \\&+ \left. \left( 1- \textrm{I}^{L\dagger }(i,j)\right) \cdot \log \left( 1- \hat{\textrm{O}}(i,j)\right) \right] .\\ \end{aligned} \end{aligned}$$
(5)

Notably, the loss only back-propagates through a small fraction of pixels where we are certain about the occupancy information (\(\textrm{M}(i,j)=1\)), rather than through all pixels when using an \(L_1\) loss for image generation as in RSL-Net (Tang et al., 2020a). By taking into account the nature of the sensor within the training procedure through the certainty mask \(\textrm{M},\) our method naturally handles the occlusion and limited FOV of range sensors, preventing the ambiguity in learning a one-to-many mapping as described in Sect. 3.1.

Finally, we found that overhead roadmap images, which are also publicly available from Google Maps, can provide additional, implicit semantic information that facilitates the learning of the occupancy image. Formally, we learn \(f_o\) as

$$\begin{aligned} f_o \left( \textrm{I}^S, \textrm{I}^R \right) \rightarrow \hat{\textrm{O}}, \end{aligned}$$
(6)

where \(\textrm{I}^R\) is a roadmap image queried at the same location as \(\textrm{I}^S,\) and the probability map we wish to compute becomes \(P(\textrm{O} \vert \textrm{I}^S, \textrm{I}^R, \Theta ).\) We parametrise \(f_o\) using a U-Net architecture (Ronneberger et al., 2015), shown in Fig. 3. We also experimented with other common architectures for image-to-image transfer such as variations of ResNet-based (He et al., 2016) generators, but the U-Net architecture as in Fig. 3 resulted in the best performance, and was particularly useful for preserving the location of building edges. We show in Fig. 5 that, at inference time when evaluated on unseen images, the auxiliary input \(\textrm{I}^R\) bolsters the robustness of \(f_o\) against regions in \(\textrm{I}^S\) with a strong pixel intensity gradient but not corresponding to a boundary between free and occupied space, such as shadows. We pretrain \(f_o\) for 100 epochs with random rotations on the input images for data augmentation, prior to downstream tasks.

Fig. 5
figure 5

Top left: an unseen satellite image \(\textrm{I}^S\) at inference. Top right: the corresponding \(\textrm{I}^R\) queried at the same location as \(\textrm{I}^S.\) The circled region corresponds to a strong gradient in \(\textrm{I}^S\) caused by shadows, and results in some pixels in this region inferred as occupied where they should be on free-space when we learn \(f_o(\textrm{I}^S) \rightarrow \hat{\textrm{O}}\) (bottom left). The occupancy image is more robust against shadows if we use \(\textrm{I}^R\) as an auxiliary input (bottom right)

4.2 Points extraction from occupancy image

Given the learned occupancy image \(\hat{\textrm{O}},\) we can emulate the resulting point-cloud as if a lidar takes a scan from near the centre of \(\hat{\textrm{O}},\) which we express as N 2D points where each point is a pixel coordinate. Similar as when constructing \(\textrm{M},\) we are confident that, along each azimuth, the first pixel labelled as occupied space in \(\hat{\textrm{O}}\) is likely to cause a range return when scanned by a lidar. Nevertheless, it is uncertain whether the pixels behind the first occupied pixel will be “seen” by a lidar. As such, we ray-trace \(\hat{\textrm{O}},\) and keep only the first occupied pixel along each azimuth as a point return when forming the pseudo point-cloud. Our method for extracting a point-cloud from an occupancy image is detailed in Algorithm 1. In our experiments, we use images of size \(256 \times 256,\) and set both \(N_A\) and \(N_R,\) the number of discretised azimuths and ranges, respectively, to 256. The result is the point-cloud \(\textbf{P}^S \in \mathbb {R}^{256\times 2}.\) A secondary result of Algorithm 1 is \(\textbf{s} \in \{0, 1\}^{N_A}\), a binary score which is set to 1 if a specific azimuth contains a return and 0 otherwise; this score will be used later in the pipeline (see Eq. 9).

figure d

4.3 Finding the origin for ray-tracing

By default, Algorithm 1 ray-traces from the centroid of \(\hat{\textrm{O}}.\) A degenerate solution occurs if the centroid is on occupied space, in which case the value of the first pixel in every azimuth is larger than \(\gamma ,\) and Algorithm 1 returns a circular point-cloud with radius \(\frac{1}{2}\) for \(\textbf{P}^S.\)

Given an \(h\times h\) patch around the centroid of \(\hat{\textrm{O}},\) we can compute the distance of a pixel within the patch to its nearest occupied pixel. Then, the pixel that is furthest away from its nearest occupied pixel is likely to be on free-space and not enclosed by occupied pixels. Denote such pixel location as \([i^*, j^*],\) proper ray-tracing can be ensured if we use \([i^*, j^*]\) as the origin, rather than the image centroid of \(\hat{\textrm{O}}.\) Figure 6 is a simple drawing that depicts a \(10 \times 10\) patch where the centroid is on occupied space, and the distance of a particular pixel (cyan) to its nearest occupied pixel is traced. Such a process of finding the origin for ray-tracing, as well as the ray-tracing process itself in Algorithm 1, can be conducted efficiently in parallel with tensor operations.

Fig. 6
figure 6

A \(10 \times 10\) image patch where the centroid is on occupied space, which is labelled as white pixels. For a particular pixel in this patch (cyan), we can compute its distance (green) to the nearest occupied pixel. The pixel within the patch that is the furthest away from its nearest occupied pixel is chosen as the origin for ray-tracing in Algorithm 1

Fig. 7
figure 7

Data flow at forward pass. Left: for pose estimation, given a spatially proximal pair of \(\hat{\textrm{O}}\) and \(\textrm{I}^L,\) we extract point-clouds \(\textbf{P}^S\) and \(\textbf{P}^L\) respectively, using Algorithm 1. The DGCNN+Transformer module in DCP learns for each point-cloud local descriptors, then used for soft matching to establish correspondences. The final SVD module in DCP returns the estimated pose offset between \(\textbf{P}^S\) and \(\textbf{P}^L.\) Right: for place recognition, given local descriptors \(\textbf{D} \in \mathbb {R}^{N \times d}\) for a point-cloud \(\textbf{P} \in \mathbb {R}^{N \times 2},\) we use PointNetVLAD to produce a rotation-invariant global descriptor \(\textbf{d} \in \mathbb {R}^k,\) which can be used for retrieval

4.4 Learning pose estimation

Given a lidar image \(\textrm{I}^L\) spatially proximal to \(\textrm{I}^S,\) we can extract a point-cloud \(\textbf{P}^L \in \mathbb {R}^{N_A \times 2}\) using Algorithm 1 with \(\textrm{I}^L\) rather than \(\hat{\textrm{O}}\) as the input. In this process, only the first range return along each azimuth is kept in \(\textbf{P}^L,\) making \(\textbf{P}^L\) more compatible to \(\textbf{P}^S,\) since \(\textbf{P}^S\) has only one point per azimuth.

Given \(\textbf{P}^L\) and \(\textbf{P}^S\) spatially proximal but with an unknown SE(2) offset, we use DCP (Wang & Solomon, 2019) to solve for their pose difference. DCP was shown to outperform ICP on non-identical point sets (Wang & Solomon, 2019) and, in addition, learns descriptors that can be used later for place recognition. DCP utilises DGCNN (Wang et al., 2019) and a Transformer (Vaswani et al., 2017) module to compute a d-dimensional descriptor per point, performs a soft matching and uses SVD for pose estimation. As detailed in Sect. 4.4.1, when computing the covariance matrix in SVD, we weight each correspondence using the score vector \(\textbf{s}\) from Algorithm 1, such that azimuths in \(\hat{\textrm{O}}\) with no returns do not contribute to pose estimation.

DCP outputs an estimated rotation matrix \(\hat{\textbf{R}} \in \mathbb {R}^{2 \times 2}\) and a translation vector \(\hat{\textbf{t}} \in \mathbb {R}^2.\) We can establish a loss term using ground truth \(\textbf{R}\) and \(\textbf{t}:\)

$$\begin{aligned} \mathcal {L}_{\textrm{pose}} = \left\Vert \hat{\textbf{t}} - \textbf{t}\right\Vert _1 + \lambda \left\Vert \hat{\textbf{R}} \textbf{R}^T - \textbf{I}\right\Vert _1, \end{aligned}$$
(7)

where \(\textbf{I}\) is the identity matrix and \(\lambda \) is a relative weighting between the rotational and the translational components.

We use an \(L_1\) loss rather than \(L_2\) loss as in Wang and Solomon (2019), since \(L_1\) loss is more robust against the non-identical point-clouds \(\textbf{P}^L\) and \(\textbf{P}^S\) in our case. We set \(\lambda = 10\) in our experiments. The points extraction step in Algorithm 1 is fully differentiable, allowing the loss in Eq. (7) to not only optimise the DCP network, but also fine-tune the pretrained network for \(f_o,\) without needing to further apply the loss in Eq. (5). The data flow for pose estimation is shown on the left of Fig. 7.

4.4.1 SVD for pose estimation

Here we give details on the SVD algorithm in our implementation of DCP. Specifically, our implementation differs from the original implementation in Wang and Solomon (2019) in that the point-cloud registration is in SE(2) rather than SE(3),  and that we weight the correspondences using a score vector \(\textbf{s}.\) Further details on using SVD for point-cloud registration can be found in Sorkine (2009).

Given two proximal point-clouds \(\textbf{P}^S\) and \(\textbf{P}^L,\) the DGCNN+Transformer module in DCP computes soft correspondences \(\textbf{P}^{L'},\) for each \(\textbf{p}^S_i \in \textbf{P}^S.\) Since DCP uses soft matching, each \(\textbf{p}^{L'}_i \in \textbf{P}^{L'}\) is a weighted softmax of all point coordinates in \(\textbf{P}^L.\)

We begin by finding the centroids:

$$\begin{aligned} \bar{\textbf{p}}^S = \frac{1}{N}\sum _i \textbf{p}^S_i, \quad \quad \bar{\textbf{p}}^{L'} = \frac{1}{N}\sum _i \textbf{p}^{L'}_i, \end{aligned}$$
(8)

where each \(\textbf{p}^S_i\) and \(\textbf{p}^{L'}_i\) is a point in \(\textbf{P}^S\) and \(\textbf{P}^{L'},\) respectively.

The covariance matrix for SVD is then:

$$\begin{aligned} \textbf{H} = \sum _i \left( \textbf{p}^S_i - \bar{\textbf{p}}^S \right) s_i \left( \textbf{p}^{L'}_i - \bar{\textbf{p}}^{L'} \right) ^T, \end{aligned}$$
(9)

where \(\textbf{H}\in \mathbb {R}^{2\times 2},\) and \(s_i \in \textbf{s}\) is the score for \(\textbf{p}^S_i\) from Algorithm 1. We can then apply singular value decomposition on the covariance:

$$\begin{aligned} \textbf{U}{\varvec{\Sigma }}\textbf{V}^T = \textrm{SVD}(\textbf{H}). \end{aligned}$$
(10)

The estimated rotation is found by:

$$\begin{aligned} \hat{\textbf{R}} = \textbf{V} \begin{bmatrix} 1 &{} 0 \\ 0 &{} \textrm{det}(\textbf{V}\textbf{U}^T) \end{bmatrix} \textbf{U}^T, \end{aligned}$$
(11)

and finally the estimated translation is found by:

$$\begin{aligned} \hat{\textbf{t}} = {\bar{\textbf{p}}}^{\mathbf {L'}} - \hat{\textbf{R}}{\bar{\textbf{p}}}^{\textbf{S}}, \end{aligned}$$
(12)

where \(\hat{\textbf{R}} \in \mathbb {R}^{2 \times 2}\) and \(\hat{\textbf{t}} \in \mathbb {R}^2.\)

4.5 Learning place recognition

Shown in Fig. 7, as a middle step, the DGCNN+Transformer module in DCP outputs local descriptors \(\textbf{D}^S\) and \(\textbf{D}^L\) taking the point coordinates as input. After the networks are trained for pose estimation, the local descriptors are optimised to store geometric information useful for registration. Given a point-cloud \(\textbf{P} \in \mathbb {R}^{N \times 2}\) and its local descriptors \(\textbf{D} \in \mathbb {R}^{N \times d}\) (\(\textbf{D}\) can be either \(\textbf{D}^S\) or \(\textbf{D}^L\) depending on whether \(\textbf{P}\) is \(\textbf{P}^S\) or \(\textbf{P}^L\)), we use a PointNetVLAD (Uy & Lee, 2018) layer to learn a k-dimensional global descriptor \(\textbf{d},\) shown on the right of Fig. 7.

To optimize the PointNetVLAD layer, we fix the network parameters for \(f_o\) and DCP and apply a loss to encourage descriptors from positive (spatially proximal) pairs of \(\textbf{P}^S\) and \(\textbf{P}^L\) to have a smaller \(L_2\) distance than negative (spatially distant) pairs. The global descriptor is made rotation-invariant by perturbing the point-clouds with arbitrary random rotations during training. In other words, while positive pairs of \(\textbf{P}^S\) and \(\textbf{P}^L\) have proximal centre positions, they can have any arbitrary heading offset. In our prior work (Tang et al., 2021b), a standard triplet loss is used where an anchor point-cloud from lidar, \(\textbf{P}^L,\) is used to form positive and negative pairs with pseudo point-clouds \(\textbf{P}^{S+}\) and \(\textbf{P}^{S-}\). Formally, this is

$$\begin{aligned} \mathcal {L}_{\textrm{pr}} = \left[ \left\Vert \textbf{d}^{S+} - \textbf{d}^{L} \right\Vert _2 - \left\Vert \textbf{d}^{S-} - \textbf{d}^{L}\right\Vert _2 + m \right] _+, \end{aligned}$$
(13)

where m is the triplet margin which we set as 1,  \([a]_+\) denotes \(\textrm{max}(a,0),\) and \(\textbf{d}^L\) is the global descriptor for \(\textbf{P}^L\) from a lidar image \(\textrm{I}^L.\) \(\textbf{d}^{S+}\) is the global descriptor for \(\textbf{P}^{S+}\) obtained from a pair of satellite and roadmap images \(\textrm{I}^{S+}\) and \(\textrm{I}^{R+},\) queried at a spatially proximal location as \(\textrm{I}^L.\) \(\textbf{d}^{S-}\) is the global descriptor from \(\textrm{I}^{S-}\) and \(\textrm{I}^{R-}\) spatially distant and have no geometric overlap with \(\textrm{I}^L.\)

While the standard triplet loss as in Eq. (13) is widely used for descriptor learning in place recognition, several improvements to the triplet loss were introduced recently, such as the quadruplet loss (Chen et al., 2017). Building upon the loss in Eq. (13), we propose a bidirectional triplet loss:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{\textrm{pr}}&= \left[ \left\Vert \textbf{d}^{S+} - \textbf{d}^{L} \right\Vert _2 - \left\Vert \textbf{d}^{S-} - \textbf{d}^{L}\right\Vert _2 + m \right] _+ \\&\quad + \left[ \left\Vert \textbf{d}^{L+} - \textbf{d}^{S} \right\Vert _2 - \left\Vert \textbf{d}^{L-} - \textbf{d}^{S}\right\Vert _2 + m \right] _+. \end{aligned} \end{aligned}$$
(14)

In addition to the anchor lidar image \(\textrm{I}^L,\) we also randomly sample an anchor satellite and roadmap image pair \(\textrm{I}^S, \textrm{I}^R,\) from the training set to form an anchor pseudo point-cloud \(\textbf{P}^S\) and its associated global descriptor \(\textbf{d}^S.\) \(\textbf{d}^{L+}\) and \(\textbf{d}^{L-}\) are the global descriptors for \(\textbf{P}^{L+}\) and \(\textbf{P}^{L-},\) which form positive and negative pairs with \(\textbf{P}^S,\) respectively. The idea to the bidirectional triplet loss in Eq. (14) is simple: if given a lidar point-cloud \(\textbf{P}^L\) the network can distinguish between proximal and distant samples of \(\textbf{P}^S,\) then it should also be able to do the reverse. The loss in Eq. (14) further encourages the PointNetVLAD layer to learn globally distinctive descriptors, and leads to consistently better retrieval accuracy than the loss in Eq. (13), as shown in Sect. 5.2.

4.6 Implementation details

Here we provide various implementations details on the network architectures, design parameters, and our training strategy.

4.6.1 Network design choices

For learning the occupancy images, we experimented with a number of network architectures for parametrizing \(f_o,\) and we found that a simple and well-established U-Net outperformed other architectures at preserving the locations of building edges. For learning point-cloud registration, we experimented with DCP (Wang & Solomon, 2019), PRNet (Wang & Solomon, 2019), and DeepGMR (Yuan et al., 2020), and found that DCP led to the smallest registration errors. In particular, although PRNet was shown to outperform DCP in the authors’ experiments, using Softmax (as in DCP) is more appropriate than Gumbel-Softmax (as in PRNet) in our problem, since \(\textbf{P}^S\) and \(\textbf{P}^L\) are from different data sources, and there is often no perfect one-to-one point correspondence when performing point matching. Finally, we use PointNetVLAD for place recognition as it is an intuitive choice for re-using optimised local descriptors to construct a global descriptor, similar as done in prior work (Du et al., 2020).

For DCP we implemented using the official open source codeFootnote 3 with the default architectures for DGCNN and Transformer, but changed the SVD algorithm from solving in SE(3) to SE(2),  which we described in detail in Sect. 4.4.1. Here, we set the descriptor length to 1024 and the number of heads in the Transformer module to 16. For PointNetVLAD, we followed the authors’ official open source implementationFootnote 4, and set the cluster size to 8, the maximum number of samples to 256, and the output dimension to 2056. We outline the detailed architectures for our implementation of U-Net in Table 1, where we make use of the following abbreviations:

  • Conv(\(C_i\), \(C_o\), ks, str): 2D convolution with input channel \(C_i\), output channel \(C_o\), kernel size ks, stride str, and zero padding

  • BN: batch normalization

  • ReLU: rectified linear unit

  • LReLU(s): leaky ReLU with negative slope s

  • Drop(d): drop out with rate d

  • ConvT(\(C_i\), \(C_o\), ks, str): transposed 2D convolution with input channel \(C_i\), output channel \(C_o\), kernel size ks, stride str, and zero padding

Table 1 Architecture of our implementation of U-Net for learning occupancy images

4.6.2 Parameter choices

An important parameter is the patch size h to search around the centroid when finding the origin for ray-tracing, as detailed in Sect. 4.3. Intuitively, h should be large enough in order for an adequate ray-tracing origin to be found. However, choosing a search space that is too large may introduce additional computational burden, and risk setting the origin to be too far away from the image centroid. Empirically, we set h to 24 as it was sufficiently large for a decent origin to always be found.

Fig. 8
figure 8

Trajectories are split into training (blue), validation (red), and test (green) for all sequences in RobotCar (left) and 20111003_drive0034 in KITTI (right). We discard regions near the intersections between splits (cyan) to avoid any overlap between training, validation, and test trajectories (Color figure online)

The threshold \(\gamma \) for creating the binary lidar image \(\textrm{I}^{L \dagger }\) in Sect. 4.1 and forming the pseudo point-cloud in Algorithm 1 was set to 0.2. A trade off for \(\gamma \) exists such that a higher value will result in free space being correctly classified, while certain regions on static structures like buildings incorrectly classified as free space. While a lower value will result in static structures likely to be correctly inferred to be on occupied space, but certain pixels on roads incorrectly classified as on occupied space. We set the threshold small as in our experiments, we found that preserving the edges of buildings is particularly important in ensuring the pseudo point-cloud captures geometric cues also observed in lidar scans.

Finally, the design parameters in our networks, such as the descriptor dimension, network width and depth, and the number of heads for multi-head attention in the Transformer module were chosen by trail and error to achieve the best localisation accuracy.

4.6.3 Training details

The training takes three stages. We first pretrain \(f_o\) for 100 epochs using the loss in Eq. (5), where we apply random rotations for data augmentation. We then train \(f_o\) and the DCP network end-to-end on pose estimation, using the pose loss in Eq. (7). Finally, we fix the network parameters of \(f_o\) and DCP and optimise a PointNetVLAD layer for place recognition using the bidirectional triplet loss in Eq. (14). While in theory it is possible to optimise DCP and PointNetVLAD for registration and place recognition jointly, empirically this resulted in larger metric localisation errors. As such, we resorted to training the three stages sequentially. We use a fixed learning rate of \(2\times 10^{-4}\) in stages 1 and 3, and \(1\times 10^{-4}\) in stage 2. We use Adam as the optimiser for all experiments. We train stage 2 for 160 epochs and stage 3 for 120 epochs, and choose the check-point with the best validation set performance.

Table 2 Lidar metric localisation errors for various ranges of initial offset, evaluated on RobotCar
Table 3 Lidar metric localisation errors for various ranges of initial offset, evaluated on KITTI
Table 4 Radar metric localisation errors for various ranges of initial offset, evaluated on RobotCar

5 Experimental results

We validate our method using the Oxford Radar RobotCar Dataset (Barnes et al., 2020) and the KITTI Raw Dataset (Geiger et al., 2013). The lidar data are collected using Velodyne HDL-32E for the first and Velodyne HDL-64E for the latter. Both datasets have longitude/latitude data for each time-stamp to query overhead images. The resolution for \(\textrm{I}^S, \textrm{I}^R,\) and \(\textrm{I}^L\) are \(0.4332~\textrm{m}/ \textrm{pixel}\) in the RobotCar dataset and \(0.4592~\textrm{m}/ \textrm{pixel}\) in KITTI. We use images of size \(256 \times 256\) in all of our experiments. The parts of KITTI data we use were recorded in 2011, while the Oxford Radar RobotCar Dataset was collected in 2019. The satellite data for both datasets were queried in 2021, years after the collection of the corresponding lidar data. Our method achieves decent localisation accuracy despite potential geometry changes in the scene during this time span.

Our method is the first to learn the place recognition of a lidar using only overhead imagery. Thus, we benchmark against a baseline method that trains a VGG-16 network (Simonyan & Zisserman, 2015) with BatchNorm followed by a NetVLAD (Arandjelovic et al., 2016) layer directly on the lidar and overhead images, using triplet loss. We compare our pose estimation approach against recent work (Tang et al., 2020a) on learning the metric localisation between range sensors and satellite imagery. Finally, we compare against our prior work in Tang et al. (2021b).

To the best of our knowledge, the work in Tang et al. (2020a, 2020b, 2021a, 2021b) are the only methods that solve single frame lidar-only localisation against overhead imagery, and evaluated on the datasets we use. For a fair comparison, our choice of training, validation, and test trajectories in each dataset is made as close to the lidar experiments in Tang et al. (2020b) as possible. Specifically, for RobotCar, the training trajectories in sequences no.2, no.5, and no.6 are used for training while we only validate and test on no.2. For KITTI, the training data consists of sequences 20110929_drive0071, 20110930_drive0028, 20111003_drive0027, and 20111003_drive0034, where 20111003_drive0034 is split into training and test trajectories. 20110926_drive0117 forms the validation set. The test trajectory lengths are approximately \(1.26~\textrm{km}\) for RobotCar and \(1.44~\textrm{km}\) for KITTI. The data splits can be visualised in Fig. 8.

5.1 Metric localisation

To evaluate metric localisation, we assume place recognition is solved such that we have spatially proximal pairs of \(\textrm{I}^L\) and \(\mathrm {I^S}\) (and \(\textrm{I}^R\)) where their initial pose offset is no larger than a certain amount. Our method then registers point-clouds \(\textbf{P}^S\) and \(\textbf{P}^L\) to solve for a relative SE(2) pose offset.

Fig. 9
figure 9

Satellite images \(\textrm{I}^S\) (top left and bottom left) and the corresponding pseudo point-clouds \(\textbf{P}^S\) (top right and bottom right) on narrow, straight roads show a strong symmetry

DCP (Wang & Solomon, 2019) includes a Transformer (Vaswani et al., 2017) module for learning local descriptors, which uses multi-headed attention. In our prior work (Tang et al., 2021b), the descriptor length d was set to 512, while the number of heads was 4. Here we use an improved architecture where we increase d to 1024 and the number of heads to 16. We also introduce additional random rotations to the input point-clouds for data augmentation when training the DCP network. We show that, using an improved architecture and rotation augmentation, the metric localisation results consistently outperform our prior work.

We compare against RSL-Net (Tang et al., 2020a) and our prior work (Tang et al., 2021b) where the initial pose offset is a uniform distribution in each of xy,  and \(\theta \) within a specific range. Specifically, we consider initial offsets that are large in both rotation and translation, large in rotation but small in translation, and large in translation but small in rotation. The mean metric localisation errors for various initial pose offsets are summarised in Tables 2 and 3.

In the original work (Tang et al., 2020a, b), RSL-Net was originally evaluated on experiments with small rotation offsets in the range of \([-22.5^\circ , 22.5^\circ ]\) where it shows low errors; yet, we show its accuracy degrades rapidly for larger initial offsets, and it is outperformed by our method. In our prior work (Tang et al., 2021b), the forward pass for all modules takes approximately \(0.075~\textrm{s}\) altogether on a 1080Ti GPU, running approximately \(25\%\) faster than RSL-Net, which runs at \(10~\textrm{Hz}.\) Using the improved architecture with larger feature dimension and number of heads in the Transformer module, the forward pass now takes roughly \(0.1 ~\textrm{s},\) having approximately the same run-time performance as RSL-Net.

Fig. 10
figure 10

Percentage of solutions considered “confident” and the mean \(\theta \) error of confident solutions, for various values of \(\gamma _{\textrm{cd}},\) evaluated on RobotCar

Fig. 11
figure 11

Percentage of top-1 retrievals falling within a certain distance away from the true position, evaluated on RobotCar (top) and KITTI (bottom)

5.1.1 Introspection

For overhead imagery situated along a narrow, straight road, the pseudo point-cloud \(\textbf{P}^S\) can be roughly symmetrical along its principal direction, shown in Fig. 9. In such cases, our method sometimes outputs an estimated \(\hat{\textbf{R}}\) that is approximately \(180 ^\circ \) off from the actual offset as such a solution is equally “correct” for symmetrical point-clouds. The point-cloud registration problem is ill-posed if symmetry exists, in particular if the initial rotation offset can be anywhere from \(-\pi \) to \(\pi .\)

We can introspect whether \(\textbf{P}^S\) is a symmetrical by taking the Chamfer distance after a rotation by \(\pi :\)

$$\begin{aligned} \textrm{d}_{\textrm{chamfer}} = \sum _i \textrm{d}_{\textrm{min}}(\bar{\textbf{p}}^S_i, \textbf{R}_\pi \bar{\textbf{P}}^S), \end{aligned}$$
(15)

where \(\bar{\textbf{P}}^S\) is \(\textbf{P}^S\) centred about its centroid, and each \(\bar{\textbf{p}}^S_i \in \mathbb {R}^2\) is a point in \(\bar{\textbf{P}}^S,\) \(\textrm{d}_{\textrm{min}} (\textbf{p}, \textbf{P})\) is the minimum distance from a point \(\textbf{p}\) to any point in point-cloud \(\textbf{P},\) and \(\textbf{R}_\pi \in \mathbb {R}^{2 \times 2}\) is a rotation by \(\pi .\) While such an introspection methodology only considers symmetries of \(\pi \), man-made structures in practice tend to result in a \(\pi \) symmetry.

Given \(\gamma _{\textrm{cd}},\) a threshold on the Chamfer distance, we can treat a solution as “confident” if \(\textbf{P}^S\) results in a Chamfer distance from Eq. (15) that is larger than \(\gamma _{\textrm{cd}},\) in which case \(\textbf{P}^S\) is regarded as non-symmetrical. For initial offsets for xy,  and \(\theta \) in the range \(\pm 10\) pixels, \(\pm 10\) pixels, and \(\pm 180 ^\circ \) respectively, Fig. 10 shows the mean \(\theta \) error for solutions considered as confident, and the percentage of solutions considered as confident, evaluated on RobotCar, for various values of \(\gamma _{\textrm{cd}}.\)

5.2 Place recognition

In this experiment, we first query the overhead images using the GPS longitude/latitude at each time-stamp during the test route traversal. In a more realistic scenario, while the routes can be known beforehand to gather overhead images ahead of time and use as a prior map, the exact driving path may be different in the live traversal, for example due to driving on different lanes along the same road. To simulate this, we add uniform errors in the range \([-5~\textrm{m}, 5~\textrm{m}]\) to each of the x and y directions when querying for overhead images.

Other than the modality difference, a major challenge when compared against standard lidar-to-lidar place recognition is the arbitrary rotation offset between lidar scans and overhead imagery. In standard lidar-based place recognition, because the vehicle will likely face the same (or opposite) direction when driving along the same route, spatially proximal lidar scans from two traversals will have little rotation offset (or approximately \(\pi \)). However, overhead images are expressed in a fixed orientation (North-up), while lidar scans in their local reference frame do not have a privileged orientation. In fact, without another sensor (e.g., a magnetometer) to measure the global heading, spatially proximal lidar and overhead image pairs can be offset by an arbitrary rotation, increasing the comparison complexity. To foster robustness to rotation, we apply random rotations in the range \([-\pi , \pi ]\) between \(\textbf{P}^S\) and \(\textbf{P}^L\) when learning local descriptors, and between \(\textbf{P}^{S+}\) and \(\textbf{P}^L\) and \(\textbf{P}^{L+}\) and \(\textbf{P}^S\) when training the PointNetVLAD layer.

5.2.1 Top-1 accuracy

Given a database of all pairs of satellite and roadmap images \(\{ \textrm{I}^S_i, \textrm{I}^R_i \}, i = 1, \ldots , N\) queried along the test trajectory, we form a pseudo point-cloud and output a global descriptor for each sample, resulting in a database of map global descriptors \(\{ \textbf{d}^S_i\}.\) For some lidar image \(\textrm{I}^L_j\) taken along the test trajectory and its associated global descriptor \(\textbf{d}^L_j,\) the top-1 retrieval is the closest \(\textbf{d}^S_i \in \{ \textbf{d}^S_i \}\) in terms of Euclidean distance. Figure 11 shows the percentage of top-1 matches within certain distances away from the true position. We show that using the new bidirectional triplet loss in Eq. (14), our results are superior to our prior work (Tang et al., 2021b), which uses the standard triplet loss in Eq. (13). We also compare against a baseline method that feeds \(\{ \textrm{I}^S, \textrm{I}^R\}\) and \(\textrm{I}^L\) to a VGG16+NetVLAD network and learns a triplet loss, where the best baseline performance comes from representing all inputs as polar images before passing them through the network, and using two different networks for the two sources.

On the RobotCar Dataset, which features a city environment, our method achieves a high accuracy, having over half of all top-1 retrievals falling within \(40~\textrm{m}\) of the true position. The test set for KITTI, featuring a residential area, is much more challenging as residential areas are more structurally repetitive and have fewer geometrically distinctive places. Regardless, our method consistently outperforms our prior work and the baseline method.

5.2.2 Failure cases

Overhead images queried at different locations may result in structurally similar point-clouds \(\textbf{P}^S,\) leading to false top-1 retrievals, as depicted in Fig. 12.

Fig. 12
figure 12

Failure cases: a lidar image \(\textrm{I}^L_j\) along the test route (left), the satellite image and the associated pseudo point-cloud corresponding to the true position (centre), and the falsely retrieved satellite image and its associated pseudo point-cloud (right), evaluated on RobotCar (top) and KITTI (bottom)

5.2.3 Precision and recall

To evaluate precision and recall, we sample a threshold \(\gamma _{\textrm{desc}}\) uniformly from the minimum Euclidean distance between any \(\textbf{d}^S_i\) to any \(\textbf{d}^L_j,\) to the maximum. A pair \(\textbf{d}^S_i\) and \(\textbf{d}^L_j\) is considered a positive retrieval if their Euclidean distance is within \(\gamma _{\textrm{desc}},\) and otherwise negative. Two locations are considered a true match if their distance is within \(25~\textrm{m},\) and false match if it is greater than \(50~\textrm{m}.\) We consistently outperform our prior work (Tang et al., 2021b) and the baseline as shown in Fig. 13.

Fig. 13
figure 13

Precision and recall curve for various descriptor distance thresholds

Fig. 14
figure 14

Percentage of top-1 retrievals falling within a certain distance away from the true position, evaluated on RobotCar (top) and KITTI (bottom), by applying descriptor smoothing using a sequence of data

Fig. 15
figure 15

Radar images (top left and bottom left) transformed into the appearance of lidar images (top right and bottom right) using the CycleGAN implementation in Weston et al. (2020)

Table 5 Lidar metric localisation errors for various ranges of initial offset, evaluated on RobotCar for various test set sequences
Table 6 Metric localisation errors for an initial offset of \(\pm 10\) pixels, \(\pm 10\) pixels, \(\pm 180 ^\circ ,\) evaluated on various datasets for ablation studies
Fig. 16
figure 16

Place recognition results using radar, evaluated on the RobotCar Dataset. Top: top-1 retrieval accuracy. Centre: precision and recall curve. Bottom: top-1 retrieval accuracy with descriptor smoothing

Fig. 17
figure 17

Place recognition retrieval accuracies on the RobotCar dataset for lidar localisation, evaluated for different sequences

Fig. 18
figure 18

Percentage of top-1 retrievals within a certain distance away from the true position, evaluated on RobotCar (top), KITTI (centre), and RobotCar with radar localisation (bottom), for ablation studies

5.2.4 Descriptor smoothing using a sequence of data

The results presented in Sect. 5.2.1 use a single frame of lidar data for retrieval. However, the accuracy is corrupted by false retrievals caused by structurally similar regions as shown in Sect. 5.2.2. A simple yet surprisingly effective amendment is to use a sequence of lidar data, rather than a single frame, at inference time. Specifically, given the \(i{\textrm{th}}\) lidar image at inference time, \(\textrm{I}^L_i,\) we can form a set of lidar images by finding its K nearest neighbours (time-wise) in the lidar data stream, namely

$$\begin{aligned} \Omega ^L_i = \left\{ \textrm{I}^L_{i-\frac{K}{2}}, \textrm{I}^L_{i-\frac{K}{2}+1}, \dots , \textrm{I}^L_i, \dots , \textrm{I}^L_{i + \frac{K}{2}} \right\} , \end{aligned}$$
(16)

and their associated global descriptors

$$\begin{aligned} \Phi ^L_i = \left\{ \textbf{d}^L_{i-\frac{K}{2}}, \textbf{d}^L_{i-\frac{K}{2}+1}, \dots , \textbf{d}^L_i, \dots , \textbf{d}^L_{i + \frac{K}{2}} \right\} . \end{aligned}$$
(17)

Since satellite and roadmap images are also queried along a known route, given the \(j\textrm{th}\) pair of images \(\textrm{I}^S_j\) and \(\textrm{I}^R_j,\) we can also find a sequence of global descriptors, forming the set

$$\begin{aligned} \Phi ^S_i = \left\{ \textbf{d}^S_{i-\frac{K}{2}}, \textbf{d}^S_{i-\frac{K}{2}+1}, \dots , \textbf{d}^S_i, \dots , \textbf{d}^S_{i + \frac{K}{2}} \right\} . \end{aligned}$$
(18)

We can then take the median over all \(K+1\) descriptors from the set:

$$\begin{aligned} \tilde{\textbf{d}}^L_i = \textrm{median}(\Phi ^L_i), \quad \quad \tilde{\textbf{d}}^S_j = \textrm{median}(\Phi ^S_j), \end{aligned}$$
(19)

and use \(\tilde{\textbf{d}}^L_i\) and \(\tilde{\textbf{d}}^S_j\) for retrieval, effectively smoothing the descriptors with a stream of data. The resulting retrieval accuracies for different values of K are shown in Fig. 14. We can achieve significantly higher accuracy by using a stream of data. For \(K=40,\) on the RobotCar dataset, almost \(75\%\) of all top-1 retrievals fall within \(70~\textrm{m}\) of the true position.

Given a specific value of K,  all \(\tilde{\textbf{d}}^S_j\) can be computed and stored in an offline process. However, for a live lidar data stream, a minimum of \(\frac{K}{2}\) frames must be delayed before \(\tilde{\textbf{d}}^L_i\) can be computed for the ith frame, making such approach impracticable when delays are not acceptable. Nevertheless, the top-1 retrieval accuracies are consistently better than using single frames of data only. We have also experimented with taking the mean for smoothing rather than the median. However, this resulted in slightly lower accuracies than the median. Vitally, no additional training or retraining is needed to apply descriptor smoothing at inference time.

5.3 Extending to radar localisation using overhead imagery

Our method relies on the premise that pixels from the centroid to the first return along each azimuth are considered free-space in a lidar image. This assumption is not valid for scanning radar images that are prone to speckle interference. However, we can reliably transform radar images to the appearance of lidar images using unpaired image-to-image transfer (Zhu et al., 2017), specifically the methodology in Weston et al. (2020).

Figure 15 shows some synthetic lidar images created from radar images. Once radar images are transformed into the appearance of lidar images, they are used in this form as input in our method for training and inference. We evaluate on the RobotCar Dataset, which also has data from a Navtech radar, and use the same sequences and splits as in the lidar experiments. The metric localisation and place recognition results are shown in Table 4 and Fig. 16.

5.4 Testing on different sequences

So far, the experiments presented are evaluated on test set trajectories collected in places unseen from the training set, as illustrated in Fig. 8. However, the training and test data are still from the same day. The RobotCar Dataset features repeated traversals of the same route over different days. The recorded on-board lidar data can have minor variations over the same route across different days, for example due to different non-static objects like cars or pedestrians, or different driving patterns depending on the traffic condition. Here we demonstrate that our method can not only generalise to unseen places, but also to data collected across different times.

Our training data come from sequences no. 2, no. 5, and no. 6. For additional validation we arbitrarily selected the test set trajectories from sequences no. 7 and no. 18, which are collected 1 day and 5 days after the training data, respectively. The results for metric localisation are shown in Table 5 and compared to the original results presented in Table 2, which uses the test set trajectory in sequence no. 2. The mean errors for metric localisation do not change significantly across different sequences.

Figure 17 shows place recognition results evaluated on sequences no. 7 and no. 18. The top-1 retrieval accuracy drops slightly when evaluated on different sequences, but does not degrade too significantly.

Fig. 19
figure 19

Qualitative results from the test set of RobotCar (rows 1 to 4) and KITTI (rows 5 to 9). From left to right: satellite image \(\textrm{I}^S\) (a), roadmap image \(\textrm{I}^R\) (b), occupancy image \(\hat{\textrm{O}}\) (c), lidar image \(\textrm{I}^L\) (d), point-clouds \(\textbf{P}^S\) (e) and \(\textbf{P}^L\) (f). All images are pose-aligned for better visualisation

5.5 Ablation studies

The training data contain approximately 14 thousand images for each dataset, where the data are prepared as detailed in Sect. 5 and Fig. 8. Here we perform ablation studies by only using a fraction of training data. The first reduced training set is formed by uniformly sampling \(10\%\) of the training data by selecting every 10th frame. The second reduced training set is formed by selecting the first 2000 images in the original training set. Models are trained for each reduced training set, where the number of training epochs is kept the same as using the full dataset.

The results for metric localisation under an initial offset of \(\pm 10\) pixels, \(\pm 10\) pixels, \(\pm 180 ^\circ \) for xy,  and \(\theta \) are shown in Table 6. The results on top-1 retrieval accuracy for place recognition are shown in Fig. 18.

Shown in Table 6, even with reduced training data, our method resulted in smaller metric localisation errors than RSL-Net (Tang et al., 2020a). Place recognition results are also greatly superior than the VGG16+NetVLAD baseline method even when our method is trained with reduced training data. Furthermore, though the number of data samples is larger when we select the first 2000 frames than every 10th frame, uniformly sampling every 10th frame forms a training set with greater variation, thus resulting in better performance on test data.

5.6 Additional qualitative results

Additional qualitative results are shown in Fig. 19.

6 Conclusion and future work

This article presents a method that solves both place recognition and metric localisation of lidar using only publicly available overhead imagery. Specifically, by inferring occupancy information from overhead imagery as an intermediate representation, the modality difference is handled naturally where both data sources are represented as point sets. While our method is the first to learn the place recognition of a ground lidar using overhead imagery, it also outperforms the prior image-based methods on the metric localisation between lidar and overhead imagery when the initial pose offset is large. While the method as detailed in this paper is standalone and requires only overhead imagery, future work would lead to the integration within a large-scale lidar localisation system with prior lidar maps, where solutions using overhead imagery are used as auxiliary signals to improve accuracy further. An interesting study would be to take the DCP network previously trained on one dataset, and train PointNetVLAD for place recognition on another dataset, so that place recognition can be learned for the new dataset without needing to re-learn metric localisation. Another interesting research direction would be to learn the place recognition between a ground range sensor and overhead imagery in an unsupervised fashion, without paired ground truth data.