Adaptive Feature Refinement and Weighted Similarity for Deep Loop Closure Detection in Appearance Variation

: Loop closure detection (LCD), also known as place recognition, is a crucial component of visual simultaneous localization and mapping (vSLAM) systems, aiding in the reduction of cumulative localization errors on a global scale. However, changes in environmental appearance and differing viewpoints pose significant challenges to the accuracy of the LCD algorithm. Addressing this issue, this paper presents a novel end-to-end framework (MetricNet) for LCDs to enhance detection performance in complex scenes with distinct appearance variations. Focusing on deep features with high distinguishability, an attention-based Channel Weighting Module(CWM) is designed to adaptively detect salient regions of interest. In addition, a patch-by-patch Similarity Measurement Module (SMM) is incorporated to steer the network for handling challenging situations that tend to cause perceptual aliasing. Experiments on three typical datasets have demonstrated MetricNet’s appealing detection performance and generalization ability compared to many state-of-the-art learning-based methods, where the mean average precision is increased by up to 11.92%, 18.10%, and 5.33% respectively. Moreover, the detection results on additional open datasets with apparent viewpoint variations and the odometry dataset for localization problems have also revealed the dependability of MetricNet under different adaptation scenarios.


Introduction
Visual simultaneous localization and mapping (vSLAM), which simultaneously reconstructs camera pose and scene structure from video inputs, is one of the critical autonomous positioning and navigation technologies in various areas [1].As an essential component of vSLAM systems, loop closure detection (LCD) is designed to detect whether the autonomous mobile robot has returned to a previously visited location.LCD is also known as a place recognition process.An accurate LCD method can help introduce additional global geometric constraints for the back-end pose optimization to reduce trajectory drift over time [2].However, it has two major challenges: (1) the same location looks different due to illumination changes and seasonal variations, and (2) different scenes look similar due to the presence of similar objects.Therefore, to achieve lifelong localization, LCD must overcome the above challenges to provide reliable detection results.
In a vSLAM system, the LCD works as an appearance-based image matching component to compare the similarity between current and previously captured images.If the similarity score surpasses a predefined threshold, the corresponding images are identified as having achieved loop closure.In most popular approaches, LCD algorithms have two key steps: feature extraction and similarity measurement [3].
Feature extraction is always a prerequisite for tasks such as keyframe extraction, tracking, positioning, and mapping.It has a decisive influence on the performance of higher-level tasks.Traditional appearance-based methods often follow the visual bag-ofwords (BoW) approach [2], where image descriptors are quantized into visual word vectors using clustering algorithms.However, the utilization of traditional handcrafted features discards geometric and structural information to some extent, making it difficult to cope with challenges like illumination changes and appearance variations.
Fortunately, deep learning (DL) networks have recently made significant breakthroughs in computer vision.Many related studies have shown that deep features adaptively learned in a data-driven manner can provide more robust image representations under changing environmental conditions [4].Therefore, the learning-based frameworks are intuitively applied to solve LCD problems [5].The models based on convolutional neural networks (CNNs) are widely adopted for their outstanding performance and efficiency [6].Although learning-based LCDs have shown promising performance in extracting robust features, few studies have focused on similarity measurement strategies..Moreover, many algorithms ignore the internal mutual constraints between the feature extraction and similarity measurement processes, resulting in a lack of systematic optimization of networks.
Therefore, this paper presents a novel end-to-end LCD method designed for complex scenes with distinct appearance variations.The approach employs an AlexNet-based feature extraction network (FEN) to capture high-level visual representations of images.In comparison to hand-crafted feature extraction methods, FEN demonstrates superior capability for perceiving complex environments by mining potential geometric and structural information within the image.Additionally, a self-attention component, Channel Weighting Module (CWM), is employed to extract higher-level features from regions of interest, effectively retaining key information for LCD.The Similarity Measurement Module based on the feature relationship of the patch-by-patch similarity matrix calculates the similarity of adaptive weighted, which improves the loop closure detection detection ability.Taking advantage of adaptive weighting, MetricNet focuses on the effective spatial patches for score estimation, achieving robustness for LCD in challenging environments with significant appearance variations.In summary, our main contributions are as follows: • Dynamic feature selection with self-attention: A novel learning-based LCD framework, incorporating the Channel Weighting Module guided by the insight of inverse document frequency in BoW, is proposed to distill the distinguishable spatial cues and regions of interest.

•
Adaptive weighted similarity measurement for appearance variation: To enhance detection accuracy, a weighted similarity score is generated in the Similarity Measurement Module to distinguish positive and negative pairs based on a patch-by-patch matrix.

•
Comprehensive multi-dataset validation: MetricNet achieves appealing performance on three typical datasets with drastic illumination and seasonal changes.It also delivers reliable results in scenes with significant viewpoint variations and performs well in localization applications.
The rest of this paper is organized as follows: A brief introduction of related works is provided in Section 2. The theoretical derivation and implementation of the proposed method are detailed in Section 3. Experimental results are presented in Section 4, and the conclusion is drawn in Section 5.

Related Works
In vSLAM systems, many practical approaches have been proposed to exploit the similarity between keyframes to achieve correct loop closure detection.In this case, LCD is essentially an image matching problem that consists of two main steps: feature extraction and similarity measurement.Meanwhile, the challenges of LCD in long-term operation have garnered significant attention.

Hand-Crafted Feature Representation
Traditional handcrafted features are manually designed to extract specific image features, which are usually divided into two categories: local and global descriptors.Taking advantage of compact representation and computational efficiency, the methods implemented with global features can describe the image appearance using a single vector, e.g., BRIEF [7] and SURF [8].Histogram statistics can also be used for global description, e.g., HOG [9].In addition, BoW enables the holistic description of incoming images by aggregating quantized local features, known as visual words [10].The work VLAD [11] combines local image descriptors into a vector representation.In addition, several classical unsupervised algorithms transform the original features into binary codes, such as ITQ [12], CBE [13], SGH [14], and UBEF [15].
However, there are some well-established methods for extracting local features, such as SIFT [16] and ORB [17].LPM [18] aims to preserve the local neighborhood structures of these potential true matches.These handcrafted-based methods neglect latent geometric and structural information, making it difficult to cope with challenging situations such as intense illumination changes and dynamic environments.

Learned Feature Representation
The emergence of learning-based methods has accelerated the development of computer vision technologies.Feature extraction based on deep learning has achieved great success in image recognition, classification, and retrieval [19].Many powerful models, such as AlexNet [20], VGG [21], and ResNet [22], have been used as the base architectures in LCD [23].
NetVLAD [24] applies soft assignment of VLAD descriptors to these clusters, making NetVLAD an end-to-end trainable architecture for visual place recognition using a triplet ranking loss function.CAE-VLAD-Net [25] can extract deep features from the input image data by utilizing the locally aggregated descriptor, the stacked auto-encoders, and the teacher-student training strategy.It detects loop closures on keyframes based on the Euclidean distances between the extracted features.FILD++ [26] can jointly extract global and local convolutional features using different scales and construct an incremental database using the global features to recommend potential loop closures to be evaluated using the local features.However, some experimental results have shown that FILD++ has some difficulties dealing with significant background changes caused by weather changes and perceptual aliasing.
In addition, perception information about surrounding objects can significantly elevate the performance of LCDs.In the cooperative perception and control-supported infrastructure-vehicle system (IVS) [27], environmental perception based on object detection and semantic information also improves the ability of connected autonomous vehicles (CAVs).Many cooperative perception methods rely on visual sensing data and output the object's state, such as location and velocity.Therefore, it is necessary for the existing localization and IVS to use and enhance environmental perception.In loop detection tasks, the cognitive perception of infrastructure can be achieved by detecting image landmarks derived from image patches to describe the visual data.The LCD method WASABI [28] builds a descriptor for place recognition across seasons based on the wavelet transform of the semantic edges of the image.Impressive results have also been achieved with the idea of detecting salient regions in late convolutional layers.These regions of interest can be selected by applying an attention mechanism.The work of Chen et al. [29] selects the most salient regions by extracting unique patterns based on the responses of the strongest convolutional layers.

Similarity Measurement
To quantify the confidence that the robot is observing a previously mapped area, several comparison techniques have been proposed.Based on different map representations, they can be broadly classified into two categories: image-to-image and sequence-to-sequence.

Image-to-Image Matching
Image-to-image methods rely on an individual similarity score to make the decision.This similarity score is compared with a predefined hypothesis threshold to determine whether the query image is topologically related to the older image.The sum of absolute differences (SAD) and the euclidean or cosine distance are the most commonly used metrics to estimate the matching confidence between two image features.When global representations (either handcrafted or learned) are used, direct matching is a reasonable measure of similarity.However, when local features are extracted, probabilistic methods are employed.The FAB-MAP [10], a probabilistic approach to the problem of place recognition, is developed to explicitly account for perceptual aliasing in the environment.The patchbased method STA-VPR [30] proposes an adaptive dynamic time warping (DTW) algorithm to align local features from the spatial domain while measuring the distance between two images, which realizes viewpoint-invariant and condition-invariant place recognition.

Sequence-to-Sequence Matching
Conversely, sequence-to-sequence methods are typically based on the comparison of submaps.The members of the groups with the highest similarity scores are considered the loop-closing image pairs.SeqSLAM [5] calculates the best candidate matching location within each local navigation sequence.However, it is less robust to changes in viewpoint.The sequence processing model MCN [31] adapts hierarchical temporal memory (HTM) for mobile robot place recognition.However, MCN suffers from instability and long time consumption due to the utilization of randomization operations.To improve this, SMCN [32] simplifies MCN and combines it with intra-set similarity and novel temporal filters to excavate the temporal continuity in LCD.

Appearance Variation and Dynamic Environment
In LCD problems, it becomes increasingly difficult to determine whether two images were taken at the same location when faced with environments with significant variations in appearance and dynamic objects.Many researchers have proposed CNN-based loop detection methods [33] to overcome the challenges of changing environmental conditions, such as illumination and seasonal changes.
Chen et al. [34] extract the high-level features from AlexNet and achieve image invariance by applying multi-scale deep feature fusion.The patch-based method SAES [35] develops a self-adaptive enhanced similarity metric to enhance the discriminative ability for appearance variations.Although it achieves competitive detection results, the image separation prior to feature extraction causes the method to ignore the spatial information between each patch.CALC2.0 [36] is trained to construct the global feature space, which is composed of local features encoding visual appearance, semantic information, and keypoint description based on the residual activations of different cells in convolutional feature maps.The addition of multiple pieces information certainly helps the model overcome appearance variations.The work of Schubert et al. [37] describes unsupervised learning methods for visual place recognition in discretely and continuously changing environments.They utilize PCA-based approaches and propose a novel clustering-based extension of statistical normalization.
In addition, semantic-based perception information can provide a higher degree of invariability.Many works have shown that the use of semantic information [33] can better address the problem in dynamic environments.PlaceNet [38] is a multi-scale deep auto-encoder network augmented with a semantic fusion layer for scene understanding that is designed to handle dynamic scenes full of moving objects.

Perceptual Aliasing and Viewpoint Variation
Furthermore, the occurrence of similar object appearances in different locations is termed "perceptual aliasing" problem [39].Note that due to the similarity of visual words for loop measurements, BoW-based methods tend to produce false matches [40].In this situation, instead of relying on image-to-image matching, LCD performance can be improved by incorporating multi-view information.
However, the change in viewpoint can range from minor to very dramatic.In general, ground robots tend to view the world from much the same viewpoints over repeated visits.However, the situation is much more complicated when the walking direction is opposite.Traditional loop closure detection systems may not provide satisfactory results in such scenarios.Some novel algorithms are designed to complement this by associating groundto-air information [41].The work of Jin et al. [42] designed a novel multi-tuplet cluster loss function to extract more discriminative feature vectors that are invariant to strong state changes and viewpoint changes.Semantic-based mapping techniques are typically used to achieve greater viewpoint robustness [43] in LCD.This infrastructure-based cognitive perception [27] can provide more effective information.
Both traditional and deep learning approaches mentioned above treat feature extraction and similarity measurement as two independent processes, relying on fixed distance metrics calculated from off-the-shelf features at the element level for matching.However, the effectiveness of similarity measurement is heavily influenced by the quality of the extracted features.Motivated by this, we propose a deep learning-based network that integrates feature extraction and similarity measurement as a well-coupled structure.

System Model
This section introduces the proposed framework (Figure 1) in detail.The main research idea of the paper is to jointly optimize the feature extraction and similarity calculation links and construct a similarity matrix to utilize the spatial information of the image to improve the detection performance in complex and changing scenes.The query and reference image pairs are taken as the input of our MetricNet.After the feature extraction network processing, we get the generated high-level visual representations.To achieve distinguishability in image descriptions, a self-attention component called the Channel Weighting Module is designed to extract the high-level features for regions of interest.Then the Similarity Measurement Module computes the patch-by-patch similarity matrix and obtains a value between 0 and 1 as the similarity of the pair of images to determine whether the two images are collected from the same location.

Weight Sharing
Channel Weighting Module Inspired by SAES [35], MetricNet focuses on the spatial patches of image information.In SAES, the visual inputs are directly split into four patches and fed into feature extraction networks.However, this results in the truncation of objects near the visual center.While the implications for localization tasks may not be immediately apparent, such changes can compromise the integrity of image information and lead to false loop detections.In contrast, the learning-based feature extraction is applied to the entire input image in our method, and then the high-dimensional features are segmented into four patches.MetricNet adeptly maintains the information integrity, specifically accounting for the data along the boundaries of these patches.
The proposed method mainly incorporates three modules: Feature Extraction Network (FEN), Channel Weighting Module (CWM), and Similarity Measurement Module (SMM).

Deep Feature Extraction
In LCD tasks, obtaining the ground truth for the dataset entails a substantial workload.Therefore, when training data are limited, it is crucial to efficiently utilize these data to train a model with strong generalization performance.The paper introduces the idea of transfer learning and fine-tunes the pre-trained network in an end-to-end manner on the existing limited training dataset, making the network more suitable for LCD tasks.This approach not only shortens the training time, promoting faster network convergence, but also enhances the generalization ability of the model.
To select the network with the best initial performance for the LCD task, this paper pre-screened the existing pre-trained networks.The selection method involved testing multiple networks on the same dataset and choosing the one that achieved the highest recall rate at 100% accuracy.As shown in Figure 2, AlexNet [20] achieved the highest recall rate at 100% accuracy, thus it was chosen as the feature extraction module.The complete AlexNet network has five combined convolutional layers (including convolution, normalization, and max-pooling), three fully connected layers, and a subsequent softmax layer.The output of each layer of the AlexNet network can be extracted separately as the global features of the image, and as the network goes deeper, the ability of features to represent images is enhanced layer by layer.However, it has been demonstrated [6] that the features extracted from the fully connected layers cannot ideally characterize the image due to the loss of spatial information, as shown in Figure 3. Therefore, as shown in Figure 1, this paper discards the subsequent fully connected and softmax layers of AlexNet and only preserves five combined convolutional layers (5CONVs) for our feature extraction.The details of the network structure are shown in Table 1, including the settings of network type, output dimension, filter size, step size, and padding.

Channel Weighting Distillation
In imagery, different features have varying levels of discriminative importance.Ubiquitous elements, such as textureless walls, ground, and sky, are less distinguishable as they appear repeatedly in different scenes, making it difficult to complete LCD tasks based on them.Conversely, objects that are infrequent in the environment, such as traffic signs and landmarks, exhibit higher distinctiveness.Treating features extracted from different distinguishable parts equally can result in more false positives in loop closure detection.Therefore, more attention should be given to the parts with higher distinguishability among the features extracted from the image, while the importance of parts with lower distinguishability should be reduced.
Accordingly, to emphasize the imperceptible features of highly distinctive objects, we introduce a channel weighting mechanism in our FEN.Drawing inspiration from the approach in [44], we adopt a self-adaptive Channel Weighting Module (CWM), which is analogous to the Inverse Document Frequency (IDF) used in Bag-of-Words (BoW).In this context, a feature's discriminability is inversely correlated with its frequency in the dataset.The less frequently a feature appears, the more important it becomes for classifying images.By calculating adaptive channel weights, we focus on low-frequency features that are more discriminative.The formulation of the CWM is defined as follows: where F ∈ R C×H×W denotes the 3-dimension features calculated by FEN.F c ∈ R H×W is the feature map on the c-th channel of F, and (h, w) index a feature pixel in F c as F c h,w .If F c h,w > 0, this indicates a positive response of the feature pixel within the feature extraction network.
After counting, T c represents the mean positive response of the c-th channel feature.To accentuate less conspicuous features, the self-adaptive attention mask W m c ∈ R of the c-th channel is constructed using a log-based inverse function.This approach preferentially amplifies feature maps exhibiting lower responsiveness in the channel dimension.Based on that, our methodology facilitates selective feature enhancement.
Then we can derive the weighted feature map F weight ∈ R C×H×W utilizing a selfattention mechanism across the channel dimension.The calculation process of F weight on the c-th channel can be described as Since the calculation of channel-wise attention masks completely depends on the model's response to visual features, there are no static trainable parameters in the selfattention mechanism.Therefore, CWM can dynamically obtain the channel-wise attention mask in a data-driven manner.

Similarity Measurement
Common similarity measurement methods based on local and global pixel-level descriptors often lead to misjudgments when dealing with scenarios where the appearance varies significantly at the same location or appears similar at different locations.To alleviate the aforementioned problem, this paper aims to leverage the spatial information of images to enhance the capability to handle variations in environmental appearance due to changes in lighting, weather, or seasons.
Taking advantage of the natural properties of the CNN convolution process that overlap each convolutional sliding window, all the context information at the patch boundaries is preserved.Unlike SAES [35], the similarity matrix of MetricNet can be constructed without the loss of information, and the data distribution in the similarity matrix determines the final similarity score between two frames.We derive the feature descriptors based on patches and calculate a patch-by-patch similarity matrix.The feature maps refined by CWM are equally divided into four patches and then flattened to vectors for similarity calculation.The corresponding process for the patch feature descriptors is: where p indexes the feature map pertinent to the paired input images {I 1 , is the i-th feature patch vector of matrix F p weight ∈ R 4× CHW 4 .Conventional approaches to similarity measurement typically employ a predetermined metric like Euclidean distance to gauge the resemblance between image pairs.In light of the efficacy demonstrated by our proposed methodology, we utilize the cosine similarity to appraise the similarity across all feature map patches within our Similarity Measurement Module (SMM).This procedure is delineated as follows: where M ij is the calculated similarity corresponding to the patch descriptors f 1i and f 2j .The higher the result, the greater the similarity between the feature patches.Briefly, the entire similarity matrix is the 4 × 4-dimensional matrix calculated as follows: The similarity calculation based on the similarity matrix can be divided into two stages: constructing a 4 × 4 similarity matrix based on feature block mapping and obtaining the overall similarity from the similarity matrix.Figure 4 displays the block-based similarity matrix between two images of the same location in different seasons and two images of different locations, where grayscale values in the similarity matrix image represent the degree of similarity.Specifically, if the feature blocks are completely identical, the similarity is highest, indicated by the white color; otherwise, it is black.As shown in Figure 4d, the diagonal values are generally higher than those of the off-diagonal, whereas the values in Figure 4e show a clear difference.It means that the proposed SMM can effectively distinguish the positive and negative pairs.However, directly summing the diagonal values in the similarity matrix does not leverage this characteristic of data distribution.To distinguish between closed-loop pairs and non-closed-loop pairs based on the characteristics of data distribution and we normalize the element values of the similarity matrix between [0,1] and integrate the distribution pattern of the diagonal of the similarity matrix to define the overall similarity score S of the image pair as: where M ii is the similarity value between two patches with the same index.α represents the probability that the image pair comes from the same location, and ω i weights the diagonal elements.Detailed parameter settings are described below.

Adaptive Parameter Weighting
Inspired by the insights of Metric Learning [45], we measure the similarity between images, where image descriptors of different categories are less similar and descriptors of the same category are more similar.To achieve this, we implement adaptive parameter weighting in the similarity measurement.First, we provide the analysis and definition of the enhancement factor α. Based on previous analysis, if the values on the diagonal of the similarity matrix are significantly larger than the values on the off-diagonal, there is a high probability of encountering a loop closure.Therefore, we enhance the difference between the positive and negative similarity matrix to make it easier to distinguish.The α is designed to make the positive pair similarity close to 1 and the negative similarity score as small as possible.The possibility α can be defined as: where e is the natural constant.d represents the difference between diagonal and offdiagonal values based on the average value of diagonal similarities S dia and off-diagonal similarities S o f f in the matrix: In this case, if S dia ≫ S o f f , α is close to 1.It means that the image pair is likely to be collected from the same location.Conversely, if α is close to 0, it strongly suggests that this image pair is captured from different places.Since the false positive has a significant impact on the back-end optimization in vSLAM, we further improve the prediction accuracy of the method to avoid localization failures.When S dia < S o f f , we directly set α to 0 to avoid false positives misjudgment in the location applications.
As for the parameter ω i , it is designed to address the situation where the same place has distinct appearance variations at different sampling times.Since the environment is constantly changing over time, images collected from the same location at different times are likely to undergo local changes.If the average value of the main diagonal elements is used directly, the overall similarity value will be low.As shown in Figure 5, two images are taken from the same location but have apparent differences in the lower left patches due to the pedestrian obstruction.Therefore, the diagonal value (M 22 ) computed by the corresponding patches is smaller than the other off-diagonal values, where we refer to M 22 as an outlier and need to reduce its weight to minimize its impact on the overall similarity.In addition, when two images collected from different locations contain an amount of textureless information (such as walls, sky, etc.), resulting in uniformly high element values in the similarity matrix, using the average value will yield a very high similarity score, leading to false positive misjudgments.As for perceptual aliasing and perceptual bias, it is necessary to assign different elements importance according to the influence of diagonal values on the overall similarity.The ω i weight can be formulated as and all the ω i weights satisfy the following formula: where γ i is the average similarity value of off-diagonal elements corresponding to two i-th patches (M ii ).k i represents the distance between diagonal and average off-diagonal values, and ω i is the final weight of the diagonal elements utilized to reduce the influence of outliers in the overall similarity.

Training
As in SAES [35], we build the training dataset by sampling image pairs from the SPED_900 dataset [46].The RGB images are resized to 240 × 320 × 3 before being fed into the network, and the ground truth is processed into a binary classification with a label space of 0 and 1.Our method is implemented by the PyTorch framework [47] on an Intel Xeon CPU E5-2678 v3 (2.50 GHz) and an NVIDIA Geforce Titan XP GPU.To speed up the training process, the parameters of AlexNet, pre-trained on the ImageNet dataset [48], are adopted to initialize our feature extraction module.Adam [49], with β 1 = 0.9, β 2 = 0.99, is utilized as the optimizer to train the network for up to 70 epochs, and the batch size is 64.The initial learning rate is set to 0.001 and reduced by 0.1 times every 30 epochs.Early stopping techniques are introduced to prevent the model from overfitting.
The LCD task is to determine whether an image pair is collected from the same location, resulting in either "loop closure" or "non-loop closure" outcomes.The model needs to ensure that the prediction for loop closure pairs is as close to 1 as possible and for non-loop closure pairs as close to 0 as possible.Therefore, this can be treated as a discrete binary classification problem, with the output value representing the probability of loop closure.The Binary Cross Entropy (BCE) is adopted as the loss function to train the proposed method, which can be defined as: where S i and y i represent the predicted similarity score and the label of the i-th image pair, respectively.y i = 1 means that the image pair is taken from the same location.Then the error function can be simplified as Loss = − 1 n ∑ i log S i , and the loss will be large for a small S i .Conversely, the y i will be 0 when encountering a negative pair collected from different places.The error calculated from the simplified function Loss = − 1 n ∑ i log(1 − S i ) will be significant if S i is large.BCE effectively evaluates how well the model's predictions match the ground truth.

Datasets
To evaluate the performance of the proposed system in response to the appearancechanging conditions for LCD, we chose three widely used datasets as the test set (i.e., Gardens Point [50], Nordland [51], and St. Lucia [52] datasets).The details of these three datasets are shown in Table 2, and the appearance variations of some examples are shown in Figure 6.The details of the datasets are as follows: Gardens Point dataset is collected on the campus of Queensland University of Technology while crossing the walkways during the day (along both sides) and at night (along the right side only).The day-right and night-right pairs are adopted as DAY and NIGHT to test the performance under significant illumination variation.The ground truth is created manually according to the location.
Nordland dataset is produced from a TV documentary that chronicles a train journey covering the changing appearance of four seasons.After arranging images from different seasons following their positions, the ground truth can be constructed based on image indexes.Images taken in winter and summer (denoted as WINTER and SUMMER) that contain significant variations in appearance are selected as part of the test set.
St. Lucia dataset is captured in the suburbs at five different times (8:45, 10:00, 12:10, 14:10, and 15:45), with significant appearance changes due to the illumination changes.The ground truth is obtained from GPS logs.For the intense contrasts, SL0845 and SL1410 (collected at 8:45 and 14:10) are selected to form the test set.

Evaluation Metrics
Evaluation techniques typically focus on precision-recall metrics [53].Precision is defined as the ratio of accurate matches (i.e., true positives (TP)) to the total system detections (i.e., true positives plus false positives (FP)).Recall denotes the ratio of true positives to the total ground truth (i.e., the sum of true positives and false negatives (FN)).They can be defined as follows: Based on this, we adopt the Precision-Recall (PR) curve to evaluate the relationship between these metrics.The curve closer to the top right has better performance.
In addition, two other evaluation metrics are also adopted here: (1) the mean average precision (mAP), which refers to the coverage area under the PR curve as : where P(r) represents the PR curve and r is the recall.( 2) the maximum recall rate at 100% precision.Note that the size of the area under the PR curve is positively correlated with the effectiveness of the LCD methods.
To test SAES, we use the model trained on the SPED_900 dataset by the authors.The CALC2.0 is trained on the COCO dataset [54], and the implementation used for the experiments is the same as described in their paper.As for Place-ResNet, the pre-trained models are utilized to make the comparison.The MATLAB source of NetVLAD is available from [24] along with several sets of weights.The results of SeqSLAM [5], MCN [31], SMCN [32], CAE-VLAD-Net [25], Jin et al. [42], and Schubert et al. [37] presented in this section are obtained from their published studies.For a fair comparison, the versions of the models chosen for the experiments are relevant to the methodology in our paper and largely follow the settings in their published papers.
We evaluate the model's effectiveness from multiple perspectives.Firstly, the PR curve results of the methods designed for appearance variations are compared.The PR curves are drawn based on the experimental results, as shown in Figure 7.It can be seen that, as the similarity threshold increases, all the recall rates show a slow downward trend when the precision is high.MetricNet achieves the most promising performance due to its PR curve closer to the upper right corner.It demonstrates that MetricNet has a relatively prominent generalization ability in dealing with the test set with significant appearance variations.Compared to SAES, the backbone of our method, the improvements of MetricNet are mainly due to the utilization of channel-wise attention distillation and adaptively weighted similarity scores.Although our method achieves slightly lower precision as the recall rate increases up to a certain threshold on the Gardens Point and St. Lucia datasets, MetricNet is more advantageous in terms of the area under the PR-curve.It is noteworthy that our method performs best on the Nordland dataset compared to the experimental results on other datasets, highlighting the robustness of MetricNet to seasonal variations.We attribute this to the fact that the images in the dataset were taken along a fixed train track, which minimizes changes in viewpoint.Illumination variations have long posed a challenge in visual tasks.While our method exhibits some sensitivity to illumination variations in the St. Lucia dataset, it achieves the best performance overall.
The corresponding results of the maximum recall rate at 100% precision are shown in Figure 8.It can be seen that MetricNet can achieve the highest recall rate at 100% precision, reaching 43.97%, 44.49%, and 26.71% on the three datasets, respectively.Compared to the other approaches, MetricNet improves performance by up to 10-30%, implying the robustness of MetricNet in reducing false positives for true loop closures.We conclude that improving the discrimination between positive and negative samples in the Similarity Measurement Module is helpful under such conditions.To quantitatively verify the performance of MetricNet and other competitive SOTA algorithms, we compare their mean average precision (area under the PR curve), as shown in Table 3.Compared to the most related method, SAES, MetricNet outperforms it by up to 11.92%, 18.10%, and 5.33% on the Gardens Point, Nordland, and St. Lucia datasets, respectively.This demonstrates that our feature refinement and weighted similarity modules can significantly enhance the accuracy of LCD.It also shows that MetricNet and SAES achieve the two best results on St. Lucia, which verifies the robustness of the patch-based LCD architecture.[36] 0.7284 0.6348 0.5869 Schubert et al. [37] 0.6900 0.7500 0.6800 Place-ResNet [23] 0.3048 0.8209 0.6860 NetVLAD [24] 0.7433 0.5320 0.5660 SeqSLAM [5] 0.4300 0.7233 0.2267 MCN [31] 0.7400 0.8211 0.6241 SMCN [32] 0.6500 0.5300 -CAE-VLAD-Net [25] 0.8400 0.7500 -Jin et al. [42] 0 As for the other two models designed for changing environments, CALC2.0 [36] and Schubert et al. [37] obtain reliable results by using different technical solutions.By taking semantic and geometric information into account, CALC2.0 is able to perform well under changing lighting conditions.Schubert et al.'s work also proves the importance of clustering and PCA-based descriptor standardization.However, MetricNet yields more promising and stable results.It can be seen that our method performs well without the addition of extra perceptual information.Additionally, incorporating similarity measurement inspired by metric learning can further enhance the strengths of the proposed method.
Although the SOTA methods based on sequence-to-sequence matching, i.e., SeqS-LAM [5], MCN [31], and SMCN [32], produce promising average precision by exploiting the sequential characteristic of robotic data streams, their results are inferior to those of Met-ricNet when it comes to long-term operation with significant appearance variations.The results of Place-ResNet [23], NetVLAD [24], and CAE-VLAD-Net [25] demonstrate their robustness to environmental changes to some extent.The learned feature representation of CAE-VLAD-Net is more effective than the simple CNN-based network used in our method.We conclude this is mainly because of the combination of convolutional neural networks and stacked auto-encoders.Jin et al. [42] propose the multi-tuplet clusters loss function together with a mini-batch construction scheme, which shows advantages when dealing with changing illumination.Nevertheless, our approach yields much better performance on seasonal changes in Nordland.
We further conduct an ablation experiment to evaluate the effectiveness of the selfadaptive parameters.Note that the Pairwise baseline is achieved by a cosine similaritybased comparison using deep image descriptors computed by AlexNet.MetricNet_ω is the ablation model with ω i = 1/4.MetricNet_α is the version with the constant parameter α calculated by: where d is calculated by Equation ( 9).It can be seen that MetricNet delivers the best estimation on the Nordland and St. Lucia datasets and the third-best one on Gardens Point.When compared to MetricNet_ω, the results of MetricNet_α are much closer to those of MetricNet.They both yield lower average precision than MetricNet.This demonstrates that the parameters ω and α both have a significant impact on loop closure detection, and different ω i for each M ii makes the method more robust in distinguishing positive and negative pairs.Besides, compared with Pairwise, MetricNet achieves better average precision of up to 47.20%, 23.60%, and 29.3% on Gardens Point, Nordland, and St. Lucia datasets, respectively.

Evaluation of Feature Extraction Component
To evaluate the performance of the feature extraction module, we present an ablation study comparing MetricNet with the closely related patch-matching-based method like [35].The approach [35] directly divides each input image into four patches for further deep feature extraction.In this experiment, we named this ablation version as MetricNet_direct.
The experimental results on the Nordland dataset are shown in Figure 9.The frames of winter1 (Figure 9a) and summer1 (Figure 9b) are collected at the same place, while winter2 in Figure 9c is captured from a different location.The similarity matrices of winter1 and summer1 calculated by MetricNet_direct and MetricNet are shown in Figure 9d and Figure 9f, respectively.The similarity matrices of winter2 and summer1 calculated by MetricNet_direct and MetricNet are shown in Figure 9e and Figure 9g, respectively.
It can be seen that our feature refinement in the Channel Weighting Module significantly improves the performance of MetricNet.It can help the proposed method construct a more robust similarity matrix for subsequent similarity score calculation.This superior similarity matrix naturally provides a more stable classification basis for the similarity measurement.Although both MetricNet_direct and MetricNet achieve promising performance, MetricNet enlarges the difference between positive and negative pairs.When encountering positive pairs, the diagonal values of MetricNet are much larger than the off-diagonal ones.For the negative pairs, the values of the similarity matrix tend to be more random.We can conclude that the MetricNet algorithm has the ability to make the difference between positive and negative more prominent, making the inputs easier to distinguish.To demonstrate this conclusion more clearly, we compare the differences between the mean of diagonal and off-diagonal values (calculated by Equation ( 9)) in the positivepair and negative-pair similarity matrices.The differences d generated by MetricNet and MetricNet_direct can be defined as d 1 and d 2 , respectively.We plot the probability density distribution of ∆ = d 1 − d 2 for positive and negative pairs on each dataset.As shown in Figure 10, the horizontal axis represents the similarity difference of ∆, and the vertical axis represents the corresponding probability density.The corresponding probability density function can be expressed as: where µ and σ denote the mean and standard deviation of ∆, respectively.The area under the curve represents the probability of ∆.When ∆ > 0, we fill the area under the curve with orange, otherwise with medium violet red.As shown in Figure 10, testing on the positive pairs from three datasets, the similarity differences obtained by MetricNet are mostly more significant than those obtained by MetricNet_direct.It means that MetricNet's weighted similarity matrix can make the image descriptors more distinguishable.Besides, as for the negative pairs, MetricNet tends to generate the similarity matrix with more random diagonal and off-diagonal values, where the difference between these elements is evidently smaller.
All the comparisons can demonstrate that the utilization of attention-based channel weighting and the similarity matrix weighted by relative value relation can help improve the discriminability of descriptors for positive pairs and reduce the differences of values in the similarity matrix for negative pairs.Moreover, the visual information at the boundaries of each patch is integral for feature extraction.Using overlapping convolution kernels in convolutional neural networks to retain contextual information at the boundaries of feature blocks is more effective than directly segmenting the original image.

Evaluation of Similarity Measurement Component
In this section, another ablation study is conducted to evaluate the proposed similarity measurement component.We utilized three conventional distance metrics, i.e., cosine, Euclidean, and average similarity (AVE-SIMI) distance, to replace the adaptive weighted similarity matrix in MetricNet.In this experiment, the cosine and Euclidean distances are computed directly to obtain the similarity matrix based on the complete image features without dividing them into patches.As for the version with AVE-SIMI, all the similarity matrix elements are directly averaged for the final measurement.
The PR curves tested on three datasets (i.e., Gardens Point, Nordland, and St. Lucia datasets) are shown in Figure 11.It can be observed that both MetricNet and the ablation models can provide promising precision at low recall rates.However, MetricNet performs better when the recall rate is high.It can be concluded that MetricNet can correctly detect more true loops than other versions, which reveals the effectiveness of the proposed similarity measurement.We conclude that this is mainly because the similarity between images is reflected in the absolute distance and direction between the feature descriptors, and the lack of spatial information in the features is compensated by the construction of the similarity matrix based on adaptive weighting in MetricNet.Figure 12 shows their maximum recall rate at the precision of 100%.Testing on all the datasets, MetricNet obtains the highest recall rate at 100% precision, and the method based on Euclidean distance gets the worst performance.The experimental results reveal that MetricNet can yield fewer false loop detections.
The mean average precision of MetricNet and the ablation methods is shown in Table 4 and Figure 13.It can be seen that the proposed method achieves the best mAP performance on all three datasets, which indicates the considerable reliability of MetricNet in handling scenes with significant appearance variations.
The above comparisons emphasize the meaningfulness of optimizing the similarity measurement of LCD networks.After comparing the Euclidean and cosine-based methods, it can be found that the full utilization of the image information to construct a similarity matrix will partially solve the lack of sufficient global spatial cues from the feature extrac-tion.Meanwhile, after using the adaptive weighting mechanism, MetricNet enhances the difference between positive and negative pair similarity.MetricNet can fully consider the contributions of various features to the overall similarity calculation, which can be reflected in the comparisons with AVE-SIMI.

Results for Distinct Viewpoint Variations
Although MetricNet is an LCD model designed to handle test scenes with significant appearance variations, we also introduce another two open datasets (Oxford5K [55] and Paris6K [56]) to verify the effectiveness of the proposed method.Oxford5K is a dataset containing 5062 images of buildings from 11 landmarks in Oxford.Each landmark has 5 query instances, and 55 query groups are generated.Besides, Paris6K contains 6412 images of Paris landmarks, which are classified into 12 categories.Both of them have significant variations in viewpoint and appearance, as shown in Figure 14.The images in these two datasets are not collected sequentially, and they are usually used for image retrieval rather than localization tasks.Therefore, the sequence-based LCD algorithms cannot work in this condition.Since it is unfair to adopt state-of-theart algorithms specifically designed for viewpoint variations, we compare the proposed method with related works, such as CBE [13], SGH [14], ITQ [12], LPM [18], UBEF [15], and SMVF-CVT [19], as shown in Table 5.
Despite the viewpoint variations of the Oxford5K and Paris6K datasets being significantly dramatic, the proposed methods still achieve reliable performance compared to the other algorithms.The results of SMVF-CVT demonstrate that utilizing multi-view fusion features can help improve image matching performance, which is designed for instance-level image retrieval.It can also be seen that MetricNet_3×3 is more effective in dealing with the testing datasets having significant viewpoint variations.It gets slightly higher mean average precision than the original version on the Oxford5K and Paris6K datasets.We attribute this to the version that splits images into more patches, enabling the network to focus on relations between more refined feature sub-regions, thereby enhancing the model's robustness to significant viewpoint variations.
Although our network is able to produce relatively reliable detection results to some extent, it is still not comparable to professional methods for image retrieval tasks.We believe this limitation arises because the algorithm relies solely on a single image without considering additional perceptual information like semantics, which limits the validity of the proposed method facing viewpoint variations.Moreover, the patch-to-patch design of MetricNet is not adequate for the task of filtering and selecting the most relevant regions.

Application of Loop Closure Detection
To validate the practical effectiveness of the proposed method in the vSLAM localization problem, we integrate MetricNet with a localization method and use its estimations to optimize the VO's prediction.To ensure the reasonableness of this experiment, we only perform the optimization by adding loop closure constraints to trajectories where MetricNet predicts the existence of closed loops.For this experiment, we use one of the most influential outdoor VO/SLAM benchmarks, KITTI [57], as the experimental dataset.Following the commonly used train/test split in deep VOs, we adopt Sequence 00, 02, 08, and 09 of the KITTI dataset to fine-tune the MetricNet and utilize Sequence 05 and 06 to perform the evaluation.ContextAVO [58] is adopted as the visual front-end, with pose graph optimization implemented in g2o [59] as the back-end.
The visualization results are illustrated in Figure 15.It can be seen that the trajectories optimized based on associations from loop closure detection effectively reduce the cumulative error and drift.Despite some inconsistent information changes between query and matched images, i.e., dynamic objects and occlusions, the prediction of MetricNet remains promising.This suggests that MetricNet is able to provide correct and effective associations for place recognition in localization applications.

Computational Performance
The average computational cost of two proposed methods (i.e., MetricNet and Met-ricNet_3×3) and SAES [35] is compared in the following experiments.We test the computational cost of all module processes: (1) the feature extraction process of the neural networks, (2) the construction of the similarity matrices, and (3) the similarity calculation between image pairs.The Nordland dataset is utilized here because of its relatively long length.An Intel Xeon CPU E5-2678 v3 (2.50 GHz) and an NVIDIA GeForce Titan XP GPU are used here.
As shown in Table 6, the feature extraction component takes more execution time than the other two processes.This is mainly because of the large number of computations in deep neural networks.Besides, due to the calculation of channel weighting, our methods take more time in feature extraction.Compared with SAES, the two proposed methods greatly reduce the computational cost of constructing the similarity matrix and calculating the similarity score.Moreover, MetricNet outperforms MetricNet_3×3 by up to 4.4% and 5.0% in these two processes, respectively.Though the effects of both proposed methods are considerable, MetricNet proved to be more robust in terms of efficiency.Note that the parameters and GFLOPs of MetricNet are 2.47 M and 1.03 G, respectively.
In loop closure detection, it is significant to calculate the similarity score in real-time.With the increasing number of previous images in the database, the computational cost of pairwise image matching will gradually increase.Therefore, we further test the algorithms' computational performance when the frame index rises.As shown in Figure 16, as the number of previous images in the database grows, MetricNet achieves higher efficiency and better real-time performance.

Conclusions
This paper proposes a novel LCD framework, MetricNet, for dramatic appearance variations, such as illumination, seasonal changes, and dynamic interference.In the proposed method, feature extraction and similarity measurement components are trained and deployed in an end-to-end manner.It takes the current and previous image pairs as input and extracts the high-dimensional visual features utilizing the AlexNet-based feature extraction network.The self-attention component Channel Weighting Module is designed to refine the image descriptors to preserve discriminative cues.Then the promising similarity score is adaptively weighted in the Similarity Measurement Module based on the feature relationship of the patch-by-patch similarity matrix, which helps to further improve the detection performance.Extensive experiments on various datasets validate the appealing precision, generalization ability, and dependability of MetricNet in scenes with significant appearance and even viewpoint variations.Compared to many state-of-the-art learning-based methods, MetricNet's average accuracy has increased by 7.51%, 3.03%, and 5.33%, respectively.
Although our model has achieved feasible results in LCD, there are still shortcomings of MetricNet in dealing with sequential input and image retrieval problems with significant viewpoint variations.In addition, MetricNet currently only applies to images of fixed size.Images of different sizes need to be cropped or resized before entering the network, which may result in incomplete image information compared to the original, leading to poor loop closure accuracy.In the future, we will further optimize the proposed method by jointly considering semantic, appearance, and geometric information.After utilizing this environmental understanding information, LCD methods will provide a higher degree of invariance to environmental changes.To solve the problem of fixed-size input, we intend to introduce spatial pyramid pooling (SPP) and various feature fusion methods to improve input adaptability and image feature representation capabilities.Furthermore, to achieve long-term operation in LCD, incremental learning technologies will be investigated in our future network.

Figure 1 .
Figure 1.The pipeline of the proposed MetricNet.

Figure 2 .
Figure 2. Comparison results of different pre-trained networks.

Figure 3 .
Figure 3. Visualization of features extracted from different layers of AlexNet.

Figure 4 .
Figure 4.The input images and the corresponding similarity matrixes.The winter1 in (a) and summer1 in (b) is a positive pair, while summer1 in (b) and winter2 in (c) is negative one.(d) similarity matrix of winter1 and summer1.(e) similarity matrix of summer1 and winter2.

Figure 5 .
Figure 5.The frames and corresponding similarity matrix captured at the same place have apparent appearance variations.(a) image captured during day time.(b) image captured in the night.(c) Similarity Matrix of (a) and (b).

Figure 6 .
Figure 6.The appearance variations of sample images from GardensPoint, Nordland, and St. Lucia datasets.

Figure 7 .
Figure 7.The PR curves of different networks on the (a) Gardens Point, (b) Nordland, and (c) St. Lucia datasets.

Figure 8 .
Figure 8.The maximum recall rate at 100% precision of different networks.

Figure 9 .
Figure 9.The similarity matrixes obtained by MetricNet_direct and MetricNet.(d) Similarity matrix of winter1 and summer1 obtained by MetricNet_direct.(e) Similarity matrix of winter2 and summer1 obtained by MetricNet_direct.(f) Similarity matrix of winter1 and summer1 obtained by MetricNet.(g) Similarity matrix of winter2 and summer1 obtained by MetricNet.

Figure 10 .
Figure 10.The probability density distribution of ∆, where ∆ = d 1 − d 2 (d 1 and d 2 are the differences d in Equation (9) calculated by MetricNet and MetricNet_direct, respectively).

Figure 11 .
Figure 11.The PR-curves of different distance measurement on (a) Gardens Point, (b) Nordland, and (c) St. Lucia datasets.

Figure 12 .
Figure 12.The maximum recall rate at 100% precision of different measurement methods.

Figure 13 .
Figure 13.The mean average precision comparisons of different measurement methods.

Figure 14 .
Figure 14.The dramatic viewpoint variations of sample images from Oxford5K and Paris6K datasets.

Figure 15 .
Figure 15.Trajectories before and after the optimization based on the results of MetricNet (left), and examples of retrieval results of place recognition for queries on the KITTI dataset (right).The yellow boxes outline the information that changed in the query and matched images.

Figure 16 .
Figure 16.Comparison of computational of MetricNet and SAES in loop closure detection.

Table 1 .
Details of the feature extraction module.

Table 2 .
Description of the Testing Loop Closure Detection Datasets with Changing Environmental Conditions.

Table 3 .
The mAP Comparison of Different Networks.

Table 4 .
The mAP Comparison of Different Measurement Methods.
The best performance is in bold and the second best is underlined.

Table 5 .
The mAP Comparison of Different Networks.

Table 6 .
Computational Performances on the Nordland Dataset.