Appearance-Based Sequential Robot Localization Using a Patchwise Approximation of a Descriptor Manifold

This paper addresses appearance-based robot localization in 2D with a sparse, lightweight map of the environment composed of descriptor–pose image pairs. Based on previous research in the field, we assume that image descriptors are samples of a low-dimensional Descriptor Manifold that is locally articulated by the camera pose. We propose a piecewise approximation of the geometry of such Descriptor Manifold through a tessellation of so-called Patches of Smooth Appearance Change (PSACs), which defines our appearance map. Upon this map, the presented robot localization method applies both a Gaussian Process Particle Filter (GPPF) to perform camera tracking and a Place Recognition (PR) technique for relocalization within the most likely PSACs according to the observed descriptor. A specific Gaussian Process (GP) is trained for each PSAC to regress a Gaussian distribution over the descriptor for any particle pose lying within that PSAC. The evaluation of the observed descriptor in this distribution gives us a likelihood, which is used as the weight for the particle. Besides, we model the impact of appearance variations on image descriptors as a white noise distribution within the GP formulation, ensuring adequate operation under lighting and scene appearance changes with respect to the conditions in which the map was constructed. A series of experiments with both real and synthetic images show that our method outperforms state-of-the-art appearance-based localization methods in terms of robustness and accuracy, with median errors below 0.3 m and 6°.


Introduction
Visual-based localization involves estimating the pose of a robot from a query image, taken with an on-board camera, within a previously mapped environment. The widely adopted approach relies on detecting local image features (e.g., points, segments) [1,2] that are projections of 3D physical landmarks. Though this feature-based localization has achieved great accuracy in the last years [3][4][5], it presents two major drawbacks that hinder long-term localization and mapping: (i) lack of robustness against image radiometric alterations; (ii) inefficiency of 2D-to-3D matching against large-scale 3D models [6]. A much less explored alternative to feature-based localization consists in localizing the robot through the scene appearance, represented by a descriptor of the whole image. According to this framework, localization is accomplished by comparing the appearance descriptor against a map composed of descriptor-pose pairs, without any explicit model of the scene's geometric entities [7,8]. This approach turns out to be particularly robust against perceptual changes and also appropriate for large-scale localization, as demonstrated by the fact that it is included in the front-end of state-of-the-art Simultaneous Localization and Mapping (SLAM) pipelines to perform relocalization and loop closure, typically in the form of Place Recognition (PR) [3,5].
The accuracy of appearance-based localization is, however, quite limited. Good results are reported only when the camera follows a previously mapped trajectory (i.e., in one approximates the geometry of a neighborhood of the Descriptor Manifold (assumed to be locally smooth) with respect to the pose space, predicting the local likelihood p(d q |x, PSAC m ) of the observation d q in a given pose x. In this example, the descriptor-pose pairs are extracted from two previous trajectories of the robot (in red and blue).

Related Work
This section reviews two concepts that are essential for the scope of this work: Global image descriptors and Appearance-based localization.

Global Image Descriptors
A well-founded way of getting a consistent dimensionality reduction from the Image Manifold to the Descriptor Manifold is through Manifold Learning tools, like LLE [19] or Isomap [20]. Their performance, however, is limited to relatively simple IMs that result from sequences of quasi-planar motions or deformations, like face poses, person gait, or hand-written characters [21]. Unfortunately, images taken in a real 3D scene give rise to complex, highly twisted IMs, which also present discontinuities due to occlusion borders [22]. Moreover, typical Manifold Learning tools are hardly able to generalize their learned representations to images captured under different appearance settings [15]. This prevents their application to generating low-dimensional embeddings adequate for camera localization.
Nevertheless, Deep Learning (DL)-based holistic descriptors have recently proven their suitability to enclose information from complete images, effectively reducing their dimensionality, while adding invariance to extreme radiometric changes [8,12,23,24]. This feature has made DL-based descriptors highly suitable for diverse long-term robot applications [9,25], e.g., Place Recognition (PR), where the goal is to determine if a certain place has been already seen by comparing a query image against a lightweight set of images [26]. In addition, these descriptors have proven to more sensitively reflect pose fluctuations than local features [26], which is described, for example, by the equivariance property [27,28]. Since we are targeting robust robot operation under different appearance conditions, global descriptors arise as the natural choice to address appearance-based localization.

Appearance-Based Localization
Appearance-based localization is typically formulated as a two-step estimation problem: first, PR is performed to find the most similar images within the map and, subsequently, the pose of the query image is approximated from the location of the retrieved ones [29,30]. In this scenario, DL-based works have proposed to improve the second stage through Convolutional Neural Network architectures that estimate relative pose transformations between covisible images [31][32][33].
The addition of temporal and spatial sequential information to appearance-based localization methods based on single instances provides more consistency to the estimation of the pose, as it reduces the possibility of losing camera tracking due to, for instance, perceptual aliasing [26]. Following this idea, SeqSLAM [10,34] proposes a sequence-tosequence matching framework that reformulates PR with the aim to incorporate sequentiality, leading to substantial improvements under extreme appearance changes. Building upon SeqSLAM, SMART [11] integrates odometry readings to provide more consistent results. More recently, Network-Flow-based formulations have also been proposed to solve appearance-based sequence matching under challenging conditions, addressing camera localization [35,36] or position-based navigation and mapping [37]. Despite their relevant results, the nature of all these works is discrete, unlike our proposal, restricting all possible estimation to the locations present on the map.
Conversely, CAT-SLAM [38] employs image sequentiality as a source of topometric information to improve the discrete maps used by FAB-MAP [9], allowing interpolation within the sequence map through the association of continuous increments on appearance and pose. Although the estimates produced by this approach are continuous, they are restricted to the mapped trajectory. Our work overcomes this constraint by requiring multiple sequences or pose grids as a source for constructing the map PSACs. This way, we can perform localization even at unvisited map locations near the PSACs, achieving, consequently, more accuracy and reliability.
An interesting alternative to pose interpolation is the use of Gaussian Processes (GPs) regression [39], nonparametric, general-purpose tools that allow generalizing discrete representations to a continuous model, and, hence, can be adapted to perform continuous localization within discrete maps. For instance, [40,41] employed GPs to generate position estimates for omnidirectional images in indoor maps, achieving good performance although lacking applicability to robot rotations. Our approach is instead designed to work with both 2D positions and rotations for conventional cameras.
In turn, the authors of [18] proposed Gaussian Process Particle Filters (GPPFs) to solve appearance-based localization in maps of descriptor-pose pairs. The GP works as the observation model of the PF, estimating the likelihood of the observed holistic descriptor at each of the particle positions. This localization pipeline was later improved in [42] by using only the nearest map neighbors in the GP regression, allowing efficient localization within large environments. Despite being promising, both works have three major drawbacks: (i) they define a unique Gaussian Process between poses and descriptors for the whole environment, assuming that the manifold geometry has a similar shape for the entire environment, thus leading to inaccurate estimations, (ii) they do not propose a relocalization process in case of losing tracking, and (iii) they only consider localization under the same appearance than the map, lacking robustness to radiometric alterations.
Inspired by these works, we employ a GPPF to solve appearance-based localization in a continuous and sequential fashion within challenging indoor environments. We solve their first problem by locally modeling the mapping between poses and descriptors via specific GPs for each PSAC, providing refined estimates for each neighborhood. Our proposal solves the second issue through a fast and multihypothesis relocalization process based on global PR within the map. Finally, the last issue is addressed by incorporating a model of the appearance variation between the mapped and query images to the map.

System Description
This section describes our proposal for the process of appearance-based camera localization. First, we define the elements that form the appearance map, which are key contributions in this work, and then we address PR-based localization and camera tracking using a probabilistic formulation based on a GPPF.

Patches of Smooth Appearance Change
The Patches of Smooth Appearance Change (PSACs) are regions that locally model the interrelation between camera poses and image descriptors, and represent areas where the change in appearance is small.

Definition
The basic building unit of a PSAC is the pair p i = (d i , x i ), composed by the global descriptor d i ∈ R d of an image and the pose x i ∈ SE(2) where it was captured.
We assume that these pairs are extracted from any of these two environmental representations ( Figure 2): either from several robot navigation sequences (at least two) or from pose grids where the cameras have densely sampled the environment given fixed position and rotation increments (i.e., a regular grid). Optimally, a subset of these pairs should be selected so that they constitute the smallest number of samples from which the Descriptor Manifold (DM) can be approximated with sufficiently good accuracy. These key samples can be viewed as the equivalent to the key-frames in traditional, feature-based visual localization, and hence, we denote them key-pairs (KPs). Since determining such an optimal subset is a challenging issue itself, out of the scope of this work, the KPs are constantly sampled from the total collection of pairs. Each PSAC is built from a group of adjacent KPs and approximates the DM in the region that they delimit. As explained later, the robot localization takes place within these PSACs by defining a suitable observation model for the GPPF (refer to Figure 1). Formally, let the m-th PSAC be where Q ≥ 3 is the number of KPs forming the PSAC. In turn, GP m is a Gaussian Process specifically optimized for that particular PSAC that delivers a Gaussian distribution over the image descriptor for any pose nearby the PSAC (further explained in Section 3.1.2). Thus, in order to determine the closeness between a query pair p q = (d q , x q ) and a particular PSAC, we define two distance metrics as follows: • The appearance distance D a m,q from the query descriptor d q to the m-th PSAC is defined as the average of the descriptor distances to each of its constituent key-pairs: • Similarly, but in the pose space, we define the translational distance D t m,q from x q to the m-th PSAC as with t q being the translational component of the pose x q = (t q , θ q ).
Finally, the set of all PSACs covering the environment that has been sampled forms the appearance map M: with M being the number of PSACs. This way, we achieve a much more accurate approximation of the relation between the pose space and the DM within each PSAC, ultimately modeling in M, patchwise, the complete shape of the DM.

GP Regression
GPs are powerful regression tools [39] that have previously demonstrated their validity as observation models in Particle Filters [17,18,42].
In this work, we learn a specific GP for each PSAC from its vertices KPs and also using all the nearest pairs, in terms of translational distance. Then, for a certain query pose x q , the GP m delivers an isotropic Gaussian distribution N (µ µ µ m,q , σ 2 m,q I d ), where µ µ µ m,q ∈ R d and σ 2 m,q ∈ R stand for its mean and uncertainty, respectively. This distribution is finally employed to estimate the likelihood p(d q |x q , PSAC m ) of an observed image descriptor d q , given the query pose x q within the PSAC m .
For this, the GP regression employs a kernel k, which measures the similarity between two input 2-D poses (x i , x j ), with this structure: This kernel k first multiplies two Radial Basis Function (RBF) kernels k RBF (a i , a j ) = β a exp(−α a ||a i − a j || 2 2 ) (α a and β a are optimizable parameters) for the separated translational and rotational components of the evaluated poses x = (t, θ). Then, a White Noise is added, which models the variation suffered by the image descriptors taken at the same pose but under different appearances (refer to Figure 3). This is justified because, although global PR descriptors have demonstrated outstanding results in terms of invariance, such invariance is not ideal and small differences might appear. Thereby, since the construction of the map M is typically carried out considering just one particular appearance, and we aim for the robot localization to be operational under diverse radiometric settings, we propose the inclusion of a white noise distribution accounting for this circumstance in the regression. We model such descriptor variation with the variance σ 2 W , computed as the average discrepancy between the descriptor variances of pose adjacent pairs p i , under the same σ 2 i,same and different σ 2 i,diff illumination settings:  Ideally, their descriptors (blue, red, and purple dots) should be identical but, in practice, certain inaccuracies appear. As the GP learns the descriptor distribution uniquely from the appearance of the map, this variation is not considered, leading to an underestimated GP uncertainty (red area, σ 2 m,q ). The inclusion of white noise expands such uncertainty (green area, σ 2 m,q ), solving this issue.

Robot Localization
Once we have defined all the elements involved in the representation of the environment, we address here the process of localization within the appearance map M. We aim to estimate the robot pose through appearance-based continuous tracking using a Gaussian Process Particle Filter (GPPF), namely, a PF that employs the GPs described above as observation models. Being well-known robot localization tools, we do not provide a deep description of Particle Filters here but instead refer to the reader to the seminar work [43] for further information.
At time-step t, each of the P particles in the filter represent a robot pose estimation p , proportional to its likelihood. Besides, each particle is assigned to a certain region PSAC (t) m,p , as explained next.

System Initialization
When the PF starts, we perform global localization based on Place Recognition to select the most similar PSAC in M to the query descriptor according to its appearance: To account for multihypothesis initialization, we also consider as candidates those PSACs whose appearance distance is under a certain threshold proportional to D am ,q . Subsequently, the particles are uniformly assigned and distributed among all candidate PSACs, setting their initial weights to w P . Note that, if the robot tracking is lost during navigation, this procedure is launched again to reinitialize the system and perform relocalization.

Robot Tracking
Once each particle is assigned to a candidate PSAC, the robot pose estimation is carried out following the traditional propagation-weighting sequence: Propagation. First, the particles are propagated according to the robot odometry: with υ (t) ∼ N (υ (t) , Σ υ ) representing noisy odometry readings, and ⊕ being the pose composition operator in SE(2) [44]. Weighting. After the propagation, the translational distance between each particle's pose and all the PSACs is computed, so that the particle is assigned to the nearest PSAC (PSAC (t) m,p ). Then, we use the GP regressed in the PSAC to locally evaluate the likelihood of the observed descriptor d q at the particle pose x (t) p as follows: with d being the dimension of the descriptor. Finally, apart from propagation and weighting, two more operations can be occasionally applied to the particles.
Resampling: In order to prevent particle degeneracy, the GPPF resamples when the number of effective particles is too low, promoting particles with higher weights.
Reinitializing: During normal operation, the GPPF may lose the tracking of the camera, mainly due to extremely challenging conditions in the images (e.g., very strong change appearances, presence of several dynamic objects). We identify this situation by inspecting the translational distance between each particle and the centroid of its assigned PSAC, defined as the average pose of all the key-pairs forming the PSAC. If all particles are at least twice farther from the centroid of their assigned PSAC than its constituent key-pairs, the tracking is considered lost. Consequently, the PF relocalizes by following the aforementioned initialization procedure.

Experimental Results
In this section, we present three experiments to evaluate the performance of our appearance-based localization system.
First, we carry out a verification of the regression outcome in Section 4.1, with the aim to experimentally validate the hypothesis of a smooth Descriptor Manifold within the regions covered by each PSAC. In Sections 4.2 and 4.3, we test our proposal with four different state-of-the-art global descriptors in two different datasets, with a combination of setups for the map sampling. This has provided us with an insight about the error incurred by our proposal and has allowed us to determine the best configuration for localization. The second experiment, in Section 4.4, compares the resulting setup with three appearancebased localization alternatives in terms of accuracy and robustness, revealing that our system equals or improves their performance in every scenario.
It is important to highlight that this evaluation does not include comparisons with feature-based localization methods, since they are not compatible with appearance changes in the images used for both mapping and localization, which is one of the main benefits of our proposal.
We employed two different indoor datasets for the experiments: • The COLD-Freiburg database [45], which includes real images from an office gathered with a mobile robot under different appearance conditions. We have used only a representative subset of the sequences in part A of the dataset (Figure 4a). • The synthetic SUNCG Dataset [46] rendered through the virtual HOUSE3D environment [47] allows us to test the localization on a grid map (see Figure 4b).
In turn, regarding the global image representation, we have tested the following state-of-the-art appearance descriptors: Although none of these descriptors has been specifically designed to fulfill suitable properties for our pose regression approach, they have achieved promising results in terms of localization accuracy, as shown next. We used the GPy tool [50] to implement the proposed Gaussian Processes and empirically determined σ 2 W from Equation (6), for each descriptor, by randomly sampling N = 2000 adjacent pairs with diverse illumination settings from the COLD-Freiburg database.
In this evaluation, we employ as metrics the median errors in translation and rotation (to inspect our method's accuracy), as well as the percentage of correctly localized frames (which illustrates the tracking and relocalization capabilities of our method). It is worth mentioning that traditional trajectory-based evaluation metrics as Absolute Trajectory Error (ATE) or Relative Pose Errors (RPE) are not applicable to this approach since our proposal, and appearance-based localization methods in general, yields global pose estimations that are not guaranteed to belong to a trajectory, due to possible tracking losses and relocalization situations.

Corridor: Sanity Check
The main assumption of our proposal is the hypothesis of a locally smooth Descriptor Manifold with respect to the pose, on which PSACs are based. Since this assumption is not justified by previous work, we have conducted a basic test to evaluate the regression outcome of the PSACs in a simple scenario.
The proposed experiment studies the evolution of the image descriptor along a simple, linear trajectory, by comparing the observed descriptor and the mean of the descriptor distribution resulting from the regression within the PSAC. In this manner, the behavior of the descriptor can be examined along the corridor axis in order to prove its continuity and the validity of the PSAC approximation. For this, we have selected a portion of an artificially illuminated (night) sequence where the robot traveled along a ∼8 m-long corridor, as well as the NetVLAD image descriptor. For the PSACs, we used a map constructed with images with the same appearance selected every 20 frames.
In order to represent the evolution of the descriptors, we have applied Principal Component Analysis (PCA) to them and represented the first PCA element (that with larger variation). Thus, Figure 5 depicts the trajectory of the robot through the corridor along with the value of said first PCA element for both the observed descriptor and the mean of the descriptor distribution regressed by the GPs at each PSAC. The displayed results demonstrate that the descriptor has a continuous evolution along the corridor, almost lineal in the central part. Besides, the PSACs are proved to also have a continuous outcome and to approximate very accurately the values of the observed descriptor along the sampled trajectory.

COLD: Sequential Map Testing
The COLD-Freiburg database (part A) provides odometry readings and real images for two different itineraries, namely, (i) extended (∼100 m-long), which covers the whole environment; (ii) standard (∼70 m), covering a subset of the environment, both depicted in Figure 4a. The dataset also provides images gathered under three different lighting conditions: at night (with artificial illumination), and on cloudy and sunny days.
In order to create the appearance maps for the experiments, we employed the first and second night sequences of the extended itinerary, since images captured under artificial illumination do not suffer from severe exposure changes or saturation like under the remaining conditions. From here on, we will refer to these as map sequences. Specifically, key-pairs from both map sequences have been obtained through Constant Sampling (CS) every 10, 20, and 30 pairs, resulting in three different maps with diverse density (described in more detail in Table 1), using Q = 4 KPs to construct every PSAC. Finally, we have setup an extensive evaluation with six other sequences including different routes and illumination conditions: the first night; cloudy and sunny sequences of the standard part; the first cloudy and sunny sequences; and the third night sequence of the extended part. Figure 6 shows a comprehensive test study depicting the localization performance of our proposal after twenty runs for all test sequences at each map, using the median translational (top) and rotational (down) errors as metrics. Note that the number of particles for the PF has been set to P = 10 3 , as we have empirically found that increasing that number does not improve the accuracy results. The overall performance shows a median error predominantly below 0.3 m and 6°, which denotes promising results given the pure appearance nature of our approach, i.e., with no geometrical feature employed for localization. The results in Figure 6 show that scene appearance seems to be a key issue regarding the system's accuracy, as our proposal achieves better results in less-demanding lighting conditions like artificial illumination (night) or cloudy. Nevertheless, our system still demonstrates notable performance under challenging radiometric conditions, such as in sunny sequences (e.g., presence of lens flares and image saturation), hence proving its suitability for robust appearance-based localization.
On the other hand, the number of KPs that form the map is another factor influencing the performance, since the PSACs approximate the pose-descriptor interrelation the closer their KPs are. Although not particularly significant under advantageous conditions, this factor severely affects performance under challenging situations, as in sunny sequences, where localization is hindered as the sampling frequency decreases. Note that a more elaborate mapping technique than CS would improve these results, since an optimal selection of KPs would conform PSACs that achieve a more precise description of the DM geometry. Nevertheless, this will be explored in future work while, in this paper, we rely on CS to get still nonoptimal but notable results.
Finally, regarding the tested PR descriptors, the results show that in most cases, all perform similarly, with NetVLAD mostly achieving slightly better results. In turn, 1M COLD Quadruplet seems to struggle under complex illumination conditions, which might indicate that its empirically estimated white noise variance is unlikely to account adequately for these cases. The similar performance shown by all descriptors agrees with the fact that none of them has been specifically trained for appearance-based localization.

SUNCG: Grid Map Testing
The SUNCG Dataset provides a set of synthetic houses where a virtual camera can be placed at any pose. This feature allowed us to create a regular grid map in the space of planar poses with the camera and then to evaluate the impact of the map density in our proposal. Note that this dataset does not present appearance changes, and hence, the effect of such a characteristic cannot be evaluated in this experiment.
First, our dense grid map was created by selecting KPs with constant increments of 0.5 m in translation and 36°in rotation (refer to the red dots in Figure 4b). Then, we used subsampling to generate more grid maps for the evaluation, as described in Table 1, namely, the sparse-position map (subsampling half of the positions); the sparse-rotation map (subsampling half of the orientations); and the sparse-position-rotation map (subsampling both at the same time). In this case, we created PSACs with Q = 8 KPs for all maps. Additionally, we have recorded three ∼30 m-long test sequences following the trajectories shown in shades of green in Figure 4b, generating a synthetic odometry corrupted by zeromean Gaussian noise with σ u = (0.06 m, 1°).
The results of this experiment are shown in Figure 7, comparing the median errors in translation (left) and rotation (right) for all the descriptors employed in the previous experiment and for the described versions of the grid map. Again, we have set P = 10 3 particles for the PF. As can be seen, our proposal yields median errors under 0.2 m and 6°i n the dense map, while using subsampled maps hinders the process of localization. It can be noted that subsampling exclusively on rotations does not worsen the accuracy, while subsampling positions has a noticeable impact on the overall performance. Consequently, PSACs prove to handle information sparseness more efficiently in orientation than in position. Not surprisingly, subsampling in both position and orientation clearly achieves the worst localization performance due to the combined loss of information.  Regarding the holistic descriptors, all of them again demonstrate a similar behavior for each subsampling case, with 1M RobotCar Volume performing worst. ResNet-101 GeM and NetVLAD, in turn, achieve the best performance.
These results demonstrate that uniform grid sampling is a rough strategy for mapping environments, achieving results highly dependent on the sampling density. Besides, the construction of such maps with real robots becomes largely time-consuming, mostly being realizable when using virtual environments. Future work should investigate more elaborated strategies, designed to fulfill more adequate criteria concerning the map creation, ultimately pursuing an optimal approximation of the DM geometry.

Comparative Study
Finally, we compare the localization performance between our proposal and state-ofthe-art appearance-based methods of diverse nature in both datasets. For the setup of our method, we selected the NetVLAD descriptor due to its performance against appearance changes, and added the KPs every 20 pairs for the COLD dataset, as it represents a fair trade-off between accuracy and the number of KPs employed.
These are the appearance-based localization methods involved in the comparison: • Gaussian Process Particle Filter (GPPF) [42], configured with P = 10 3 particles. • The Pairwise Relative Pose estimator (PRP) presented in [31]: a CNN-based regressor that estimates the pose transform between the query and the 5 most similar map images obtained through PR. • The Network flow solution proposed in [37]: a sequential sparse localization method that includes uniform and flow-based mapping, both considered in this study. In order to make the results comparable, we modified its outcome, which is sparse, to pro-duce continuous estimations. For that, we used the following weighting after the bipartite matching: where x j is the pose of the each of the N k = 5 most contributing KPs, y ij is the flow connecting the i-th query and the j-th KP, and y i = ∑ N k j y ij represents the query flow from the nearer KPs (refer to [37] for further info). • Our approach, configured with P = 10 3 particles. Table 2 shows the compared performance between all the described algorithms after twenty runs for each scenario. Note that, apart from the median errors, we included the percentage of correctly localized frames along the trajectory, showing the tracking robustness and relocalization potential. Concretely, a frame is considered to be correctly localized when the distance between the estimate and its true pose is below (0.5 m, 10°). Table 2. Comparative median position and rotation errors, and % of correctly localized frames (m, • , (%)) of different stateof-the-art appearance-based localization methods. A frame is correctly localized when the distance between the estimate and its true pose is below (0.5 m, 10°) (L: sequences where the tracking got lost. N/A: not applicable). PRP-Pairwise Relative Pose estimator; PR-Place Recognition. In bold, best result for each sequence. As can be seen, the challenging radiometric conditions in the COLD-Freiburg database caused the GPPF method to lose tracking, while the PRP estimator achieved low accuracy as a result of not exploiting the trajectory sequentiality, performing PR at every time-step instead. In turn, the solutions based on Network flow provide very accurate estimations in general, with the best results achieved by the uniformly sampled map under favorable conditions (i.e., night and cloudy sequences) and slightly worse in the case of severe appearance changes (i.e., sunny sequences). Our proposal, in contrast, demonstrates providing consistent results regardless of the appearance settings, achieving similar results to the Network flow solution in favorable conditions and outperforming all other methods under challenging radiometric settings.
In the case of the SUNCG dataset, the formulation proposed by the Network flow is incompatible with grid maps covering multiple rotations at the same location, as they are conceived to work only with positions. In turn, PRP and GPPF obtain low performance, even worsened in subsampled maps, while our proposal achieves the best results.
Despite its similarity with our approach, GPPF has shown to be unable to achieve robust localization due to the abovementioned issues: (i) deficiencies from considering a single pose-descriptor mapping for the whole environment, (ii) the absence of a relocalization process, and (iii) its limitation to environments without radiometric changes.
In conclusion, the presented comparison proves that these state-of-the-art localization methods based on appearance cannot provide both consistent and accurate localization estimations while operating within maps of diverse nature and captured under different appearance conditions. Our method, in turn, achieves higher performance in these conditions in terms of precision and robustness, showing notable results given its pure appearance nature.

Conclusions
We have presented a system for appearance-based robot localization that provides accurate, continuous pose estimations for camera navigation within a 2D environment under diverse radiometric conditions. Our proposal relies on the assumption that image global descriptors form a manifold articulated by the camera pose that adequately approximates the Image Manifold. This way, we gather pose-descriptor pairs from a lightweight map in order to create locally smooth regions called Patches of Smooth Appearance Change (PSACs) that shape, piecewise, the Descriptor Manifold geometry. Additionally, we robustly deal with appearance changes by modeling the descriptor variations under a white noise distribution.
We implemented a sequential camera tracking system built upon a Gaussian Process Particle Filter, which allows for multihypothesis pose estimation. Thus, our system optimizes a specific GP for each PSAC, subsequently being employed to define a local observation model of the descriptor for the Particle Filter. Furthermore, our method includes a relocalization process based on PR in case of tracking loss.
A first set of experiments has shown our proposal's error baseline in different environments and for a selection of holistic descriptors, revealing the most suitable configuration for our system. Finally, we have presented a comprehensive evaluation of the localization performance, showing that our approach outperforms state-of-the-art appearance-based localization methods in both tracking accuracy and robustness, even using images with challenging illuminations, yielding median errors below 0.3 m and 6°. Consequently, we have proven that pure appearance-based systems can produce continuous estimations with promising results in terms of accuracy, while working with lightweight maps and achieving robustness under strong appearance changes.
Future work includes research about (i) building the appearance map in an optimal way and wisely selecting where to sample the Descriptor Manifold; (ii) the design of a novel holistic descriptor that is more adequate to perform pose regression while keeping high invariance to radiometric changes.