A Holistic Visual Place Recognition Approach using Lightweight CNNs for Significant ViewPoint and Appearance Changes

This paper presents a lightweight visual place recognition approach, capable of achieving high performance with low computational cost, and feasible for mobile robotics under significant viewpoint and appearance changes. Results on several benchmark datasets confirm an average boost of 13% in accuracy, and 12x average speedup relative to state-of-the-art methods.


I. INTRODUCTION
Given a query image, an image retrieval system aims to retrieve all images from a large database that contain similar objects as in the query image. Visual Place Recognition (VPR) can also be interpreted as an image retrieval system that tries to recognize a place by matching it with the places from the stored database [1]. A place database is a simplest way to represent a particular environment where appearance based information is stored as an image with no pose related data. However, other VPR techniques use topological maps which contain relative information about the places in an environment (can be an ordered collection of images) and metric maps which are even more accurate in terms of absolute scale of the environment (such as distance, landmark position) but difficult to build and maintain. Two image matching techniques; 1) single image and 2) sequence of images are employed by the VPR community. This paper focuses on database-centric place remembering approach coupled with single image matching, thus, place recognition is solely based on appearance similarity and image retrieval techniques are applicable [2].
As with a range of other computer vision applications, Convolutional Neural Networks (CNNs) have shown promising results for VPR and managed to shift the focus from traditional hand-crafted feature descriptors [3] [4] to CNNs [5][6] [7]. Using a pre-trained CNN for VPR, there are three standard approaches to produce a compact image representation: (a) the entire image is directly fed into the CNN and responses from convolutional layers are extracted [5]; (b) CNN is applied on user-defined regions of the image and prominent activations are pooled from the layers representing those regions [6]; (c) the entire image is fed into the CNN and salient regions are identified by directly extracting distinguishing patterns based on convolutional layers responses [7] [8]. Generally, global image representations retrieved from category (a) are not robust against strong viewpoint variations and partial occlusion. Image representations emerging from category (b) usually handle viewpoint changes better but are computation intensive. On the other hand, image representations resulting from category (c) address both the appearance and viewpoint variations. In this paper, we focus on category (c).
The work by [7] and [8] are considered as state-of-the-arts in identifying prominent regions by directly extracting unique patterns based on convolutional layers' responses. Chen et al. in [7] used VGG-16 network [9] pre-trained on ImageNet [10] and used late convolutional activations for regions identification. For regions-based feature encoding, 10k bag-of-words (BoW) [11] codebook is employed. The system is tested on five benchmark place recognition datasets with AUC-PR curves [12] as the evaluation metric. It claims to outperform FABMAP [13], SEQSLAM [14] and other image retrieval pooling techniques including Cross-Pooling [15], Sum/Average-Pooling [16] and Max-Pooling [17].
Despite its good AUC-PR performance, the method proposed in [7] has some shortcomings. A common strategy for improving CNN accuracy is to make it deep by adding more layers (provided sufficient data and strong regularization). However, increasing network size means increased computation and using more memory both at time of training and testing (such as, for storing outputs of intermediate layers and for storing parameters) which is not ideal for resourceconstrained robots that are usually battery-operated. Using 10k BoW dictionary for regions-based feature encoding (extracted from late convolutional layers of deep VGG-16) followed up with their crossmatching thus degrades the real-time performance. Secondly, employment of object-centric deep VGG-16 model results in system attempting to put more emphasis on objects rather than the place itself. This reflects on the regions-based pooled feature and leads to failure cases. Also, the regional approach proposed in [7] hinders the identification of individual static place-centric regions that can be more effective under condition and viewpoint variations.
To bridge these research gaps, this paper proposes a holistic approach targeted for a CNN architecture comprising a small number of layers pre-trained on a scene-centric [18] image databases to reduce the memory and computational costs. The proposed method detects novel CNN-based regional features and combines them with VLAD [19] adapted specifically for the VPR problem. The motivation behind employing VLAD comes from its better performance in various CNN-based image retrieval tasks utilizing a smaller visual word dictionary [19][20] compared to BoW [11]. To the best of our knowledge, this is the first work that combines novel lightweight CNN-based regional features with VLAD encoding adapted for computation-efficient VPR under changing environment.
As opposed to [7] which uses object-centric VGG-16 architecture and employs a cross-convolution based regional extraction approach (resembles [15]), the proposed VPR technique is particularly different both in identification and extraction of regional features (discussed in detail, in section III-B). The presented approach in this paper showcases enhanced accuracy by employing middle convolutional layer of the CNN architecture comprising of small number of layers. Evaluation on several viewpoint-and condition-variant benchmark place recognition datasets shows an average performance boost of 13% over state-of-the-art VPR algorithms in-terms of AUC computed under Precision-Recall curves. In Figure 1, for a query image (a), our proposed system retrieved image (c) from the stored database. (b) and (d) highlight the salient regions which our proposed methodology identified under strong visual changes.
The rest of the paper is organized as follows. Section II provides the related work for VPR and other image retrieval tasks. In Section III, the proposed methodology is presented in detail. Section IV illustrates the implementation details and performance evaluation of the proposed VPR framework on several benchmark datasets. Section V presents the conclusion.

II. LITERATURE REVIEW
This section provides an overview of major developments in VPR under simultaneous viewpoint and appearance changes using handcrafted features and CNN-based features. Other image retrieval tasks with their feature extracting and encoding approaches are further discussed and differentiated from VPR based image retrieval tasks.
FAB-MAP [13] is the first work that used handcrafted SURF feature descriptors combined with BoW encoding for VPR. It demonstrated robustness under viewpoint changes by taking advantage of the in-variance properties of SURF. Another sequence-based image matching technique, SEQSLAM [14] has shown remarkable performance under severe appearance changes. However, it is unable to deal with simultaneous condition-and viewpoint-variation.
The first CNN-based VPR system is introduced in [5], which is followed by [21], [6] and [22]. Chen et al. in [5] used Overfeat [23] trained on ImageNet. Eynsham [13] and QUT datasets with multiple traverses of the same route under environmental changes are used for benchmarking. Using the Euclidean distance on the pooled layers' responses, test images are matched against the reference images. On the other hand, authors in [22] and [6] used landmark-based approaches coupled with the pre-trained CNN models. Chen et al. in [24] introduced two CNN models for the specific task of VPR (named AMOSNet and HybridNet) which trained and fine-tuned the object-centric CaffeNet [10] on a 2.5 million Specific PlacEs Dataset (SPED). The place-recognition centric SPED consists of thousands of places with severe-condition variance among the same places over different times of the year. The results showed that with Spatial Pyramidal Pooling (SPP) employed on middle and late convolutional layers, HybridNet outperformed AMOSNet, CaffeNet and PlaceNet on four publicly available datasets exhibiting strong appearance and moderate viewpoint changes [24].
Chen et al. in [7] presented a VPR approach that identifies pivotal landmarks by directly extracting prominent patterns based on responses of late convolutional layers of deep object-centric VGG-16 model. Recently, Chen et al. in [8] introduced a contextflexible attention model and combined it with a pre-trained objectcentric VGG-16 fine-tuned on SPED [24] to learn more powerful condition-invariant regional features. The system has shown state-ofthe-art performance on severe condition-variant datasets. However, the efficiency of the framework may be compromised if there is a simultaneous strong viewpoint and condition variations. Moreover, performance and efficient resource usage become two important aspects to be looked upon in real-life robotic VPR applications.
Image retrieval tasks which either rely on handcrafted features, such as, local SIFT and SURF features [3] [4] or combining these with convolutional and fully connected layers of deep/shallow CNNs [2][25] [5], Bag-of-Words (BoW) or Support Vector Machine (SVM) [26] are employed for classification, detection and recognition [17] [15] purposes. As an alternative for BoW feature encoding scheme, several other approaches including Fisher vector [27] and Vector of Locally aggregated descriptor (VLAD) have shown promising results with smaller visual words vocabularies [19]. To perform instance level image retrieval where objects from the same category are to be separated, Yue-Hei Ng et al. in [25] suggested to combine rich spatial middle convolutional layers' features with VLAD encoding. Kim et al. in [28] have used MSER [29] for regions identification, coupled with SIFT feature description within the identified regions and described each region/bundle as a fix sized VLAD, named as PBVLAD. 2D-based localization methods generally offer efficient database management at low accuracy cost whereas 3D-based techniques are computationally complex but more reliable in localization. Sattler et al. in [30] refute this notation by combining 2D-based approaches with SfM-based post-processing and shown better performances then structure-based methods. However, such post processing takes significant longer run-times which is out of scope of this work since our proposed VPR system works like a 2Dbased framework with an aim to improve the retrieval performance while reducing the computation complexities.
With the advent of several feature pooling techniques including Sum-Pooling [16], Max-Pooling [17], Spatial Max-Pooling [31] and Cross-Pooling [15] employed in deep CNNs have demonstrated performance boost in tasks requiring image classification/recognition and object detection/retrieval [17] [15]. All these pooling approaches process the convolutional layers' feature maps as a whole to pick prominent patterns, and with images focus on fewer objects make feature maps sparse in nature and finding single region of interest becomes relatively easier. However, such image retrieval tasks are different in nature from the VPR systems where recognizing a place which undergoes diverse changes due to illumination, winter-summer transition or viewpoint variance added by different capturing angles is quite challenging because appearance of the place changes and makes it difficult to identify the common regions. For VPR, when such external tasks based pre-trained CNNs [10] are integrated with the above mentioned feature pooling techniques, the convolutional layers feature maps focus on the trained objects such as vehicles, pedestrians and other time varying objects which are not suitable for place recognition [7]. Therefore, it is still questionable for a generic VPR system to efficiently deal with simultaneous viewpoint and condition variations when employing CNN-based local features pre-trained on other image retrieval tasks.
Recently, Teichmann et al. in [32] trained the landmark detectors [33][17] with a newly introduced 1.2M Google Landmark dataset (GLD) containing 15k landmark categories (such as, buildings, monuments and bridges) annotated by human. Noticing that not all the visual words get associated with the feature descriptors which results into many zero regional residuals, their proposed R-VLAD technique overcomes it by normalizing the regional residuals [32]. Precisely, it down-weights all the regional residuals and stores a single aggregated regional descriptor per image. Custom landmark detectors including ASMK [34], RMACB [33], RMAC [17] and selective search [35] are incorporated for the regional search and coupled with R-VLAD on deep CNNs. We can expect further boost in our proposed VPR framework with the integration of R-VLAD [32]. Chen et al. in [8] have shown that the state-of-the-art regions-based image retrieval techniques including Attentive Attention [36] and Fixed Context [37] are not generally efficient for VPR under strong visual changes.

III. PROPOSED TECHNIQUE
In this section, the key steps of the proposed methodology are described in detail. It starts from the idea of stacking activations of feature maps for retrieving local descriptors, followed up with the identification of distinguishing regional patterns. It then illustrates the aggregation of local feature descriptors lying under those identified salient regions. Finally, it shows how to retrieve the compact VLAD representation using the extracted CNN-based regional features, later used for determining a match between two images. The workflow of the proposed methodology is shown in Figure 2.

A. Stacking of Convolutional Layer Activations for making Descriptors
Given an image I as an input to the CNN model, at a certain convolutional layer, the output is a 3D tensor M of X × Y × K dimensions. K denotes the number of feature maps, X and Y represent the width and height of feature map / channel. We can also interpret it as M k being a set of X ×Y activations / responses for k th feature map where k = {1, 2, ...., K}. For K feature maps in the convolution layer, we stack each activation at some certain spatial location into K dimensional local feature as shown with different colours in Figure  2 (c). D L in (1) represents the K dimensional d l feature descriptors at L th convolutional layer of m c model.

B. Identification of Regions of Interest
To extract region-based CNN features, the most prominent regions need to be identified. Two or more activations are considered to be connected and represented as a region if they are neighbours and have approximately the same value. For K feature maps, each region is denoted by G h , ∀ h ∈ {1, ..., H} where H is the total number of identified regions at L th convolution layer, visualized in Figure 2 (3), denoted as R L novel regions at L th convolution layer. Figure 3 illustrates the top N = {50, 200, 400} novel R L regions identified by our proposed regions-based VPR system. Our novel CNN based identified regions strongly concentrate on the static objects including buildings, trees and road signals. D L local descriptors in (1) which fall under the bounding boxes of R L regions in (3), aggregated in (4) to retrieve CNN-based regional features. Intuitively, each regional feature is 1 × K dimensional f t vector where q be the R L t region under which D L q descriptors fall. For N novel regions, (5) represents N × K dimensional F L region-based CNN features representing an image at L th convolutional layer (visually shown in Figure 2 (e) / Figure 2 (f)).
In comparison, authors in [7] first identified regions, calculated their mean energies and selected N = 200 energetic regions. Precisely, N regional activations at L th convolution layer were mapped onto the L − 1 th convolutional feature maps and aggregation of modified crossmapped regions-based local descriptors at L − 1 th convolution layer was carried out for feature extraction. Note, that depending upon the quantity of activations per ROI(s) at L th convolution layer and receptive field of the filter (e.g. 3 × 3, 5 × 5) for cross-mapping of L th convolution layer regions at L − 1 th layer, the bounding box (area) per cross-mapped regional feature varies for [7]. Furthermore, Figure 4 illustrates that the identified ROIs from two feature maps (M 1 and M 2 ) at L th convolutional layer with Region-VLAD and Cross-Region-BoW [7] are different in terms of quantity and size / activations per region(s). Thus, the computed regional mean energies of [7] are different from the mean energies of regions identified by our approach. Our approach identifies 36 and 40 ROIs from feature map M 1 and M 2 , shown with different colours. Later, based on their computed mean energies, top N energetic regions are selected from H identified regions at L th convolutional layer, as visualized in Figure 3. The 8-connected component-based regional approach in Cross-Region-BoW [7] identifies 6 and 4 yellow coloured ROIs for feature map M 1 and M 2 . As explained above, N energetic regional feature extraction for [7] is carried out by first selecting N energetic regions at L th layer ( Figure 4) followed up with their mapping at L − 1 th convolution layer and aggregation of crossmapped regions-based local descriptors at L − 1 th convolution layer (not shown in the figure). Exemplars exhibiting the novel identified regions by Cross-Region-BoW [7] and with our proposed Region-VLAD framework are shown in Figure 5. We observe that regional patterns covering more areas similar to [7] hinder the identification of individual place-centric instances vital in recognizing places under changing conditions and viewpoints.

C. Regional Vocabulary and Extraction of VLAD for Image Matching
Vector of Locally Aggregated Descriptor (VLAD) adopts K-means [11] based vector quantization, accumulates the residues of features quantized to each dictionary cluster and concatenates those accumulated vectors into a single feature representation. A separate dataset of 2.6k images is collected and afore-described regions-based feature extraction is employed for generating a regional vocabulary. To learn a diverse vocabulary, we employed 1125 place-recognition centric images of 365 places from Query247 [38] (taken at day, evening and night times). Other images include a benchmark place recognition dataset St.lucia [24] with 1k frames of two traverses captured in suburban environment at multiple times of the day. The left over images consist of multiple viewpoint-and condition-variant traverses of urban and suburban routes collected from Mapillary 1 (previously employed by [6] and [7] for capturing place recognition datasets). K-means is employed for clustering the 2600 × N × K dimensional regional features into V regions such that o u in (6) represents the u th region of C L codebook.
Using the learned codebook, F L regions of benchmark test / reference traverses are quantized in (7) to predict the clusters or labels Z L , where α is the quantization function. Employing regions-based F L feature, predicted labels Z L and regional codebook C L , summed residue v corresponding to each u th region can be retrieved using (8).
In (8), for all the F L regional features that fall in u th region of the C L codebook, the residues of F L u regions and C L u codebook's region center are summed. Sometimes, few regions/words appear more frequently in an image than the statistical expectation known as visual word burstiness [39]. Standard techniques include power normalization [40] is performed in (9) to avoid it where each 1 × K dimensional residue v u undergoes non-linear transformation γ. In (10), power normalization is followed by l 2 normalization. For each image, l 2 normalized residues corresponding to V regions are stored in (11) 1 https://www.mapillary.com/ Fig. 5. Sample images of ROIs identified with Cross-Region-BoW [7] and Region-VLAD are shown here. Our regional approach subdivides each image into large number of most contributing regional blocks.
To match a test image "A" against the reference image "B" in (12), the dot/scalar product of their u th regional VLAD components S L A u and S L B u , each with dimension 1 × K reaches to an individual regional matching score j A,B u , as visualized in Figure 2 (h).
All the scalar j A,B u scores for V regions are summed up in (13) to get final single J A,B matching score. For each test image "A", the cosine matching in (12) is performed against all the reference images and finally, reference image "X" with the highest similarity score is picked as a matched image using (14).

IV. DATASETS, IMPLEMENTATION DETAILS, RESULTS AND ANALYSIS
This section presents the implementation details of our proposed system which will attempt to evaluate its run-time performance for real-time robotic VPR applications. Comparison of the proposed method with state-of-the-art VPR and image retrieval algorithms has been conducted over several benchmark datasets and the obtained results are stated. The section ends by displaying the results on correctly matched and mismatched scenarios of our proposed Region-VLAD framework along with a discussion on the same.

A. Benchmark Place Recognition Datasets
More specifically, challenging benchmark VPR datasets Berlin A100, Berlin Halenseestrasse and Berlin Kudamm (see [7] for detailed introduction), collected from crowd-sourced geotagged photomapping platform Mapillary are used to evaluate the proposed VPR framework. Each dataset covers two traverses of the same route uploaded by different users. One traverse is used as R reference traverse and the other traverse is employed as T test traverse (see TA-BLE I). R represents the reduced reference traverse which matches with T test traverse (discussed in section IV-E). Another dataset, Gardens Point was captured at QUT campus with one traverse taken in daytime on left side walk and the other traverse was recorded in right side walk at night time [24]. The Synthesized Nordland dataset was recorded on a train journey with one traverse taken in winter and the other traverse was recorded in spring. Viewpoint variance was added by cropping frames of summer traverse to keep 75% resemblance [8]. For Berlin A100, Berlin Halenseestrasse and Berlin Kudamm, geotagged information is used for ground truth with 0 to ±2 frame tolerance. For Gardens Point and Synthesized Nordland, the ground truth data is obtained by parsing the frames and maintaining place level resemblance with 0 to ±3 and 0 to ±2 frame tolerance.

B. Setup, Implementation details and Scalability
The proposed VPR framework is implemented in Python 3.6.4 and the system average runtime over 5 iterations is recorded with 1125 images. AlexNet pre-trained on Places365 dataset is employed as a CNN model for region-based features extraction with 256×256 input image size. For all the baseline experiments, we utilize middle conv3 convolutional layer only due to its better performance in various VPR approaches [6] [22].
For a single image, a forward pass takes around an average 0.305ms using Caffe on NVIDIA P100 and 15.57ms employing Intel Xeon Gold 6134 @3.2GHz. We extract N ROIs with total time comparable with state of the art methods [7] (see Table II). The VLAD representations are retrieved and matched using N ROIs mapped on V clustered dictionary C L (trained using N ROIs per image of 2.6k dataset). For direct comparison with [7], we use N = 200 with V = 128. The results are also reported for N = 400 with V = 256. Table II shows that for N = {200, 400} regional settings, our average VLAD matching times are 100x and 58x faster than [7].
In real-time robotic vision applications which include robotic agricultural devices, autonomous infrastructure, environmental monitoring equipment or other agriculture based use-cases, with exploration of new places, the size of the database can grow unbounded. Therefore, scalability is one of an important factor to be considered [41]. Under both the regional settings, employing GPU for forward pass and CPUs for both feature extraction and VLAD encoding, the overall times for retrieving a single query VLAD are 396ms and 447ms. Whereas, Titan X Pascal GPU in [7] takes 408ms for feature encoding per query. Figure 6 (a) further confirms that the proposed system consumes an average 0.07ms (N = 200) and 0.12ms (N = 400) for matching VLAD representations of a single query and reference image. Therefore, the total retrieval times per query against R = 750 reference images are approximately around 446.405ms and 533.245ms. In comparison, Cross-Region-BoW [7] takes 7ms for matching features of one test and one reference image. The overall retrieval time against R = 750 reference images is 5.658s which is 12x and 11x more than our proposed approaches and practically inappropriate for real-time applications. Our Region-VLAD VPR technique can store the encoded VLAD representations of all the reference frames whereas Cross-Region-BoW needs to perform runtime cross matching of given query regions against all the reference frames' regions, and mutually matched regional features are picked. Furthermore, Figure 6 (b) evaluates our proposed system's runtime performance when more places are added in test and reference traverses. For each PR-curve, we employed T test and R reference images. Their VLAD representations are retrieved followed up by their cosine matching and in parallel, we record down the system's performance. We can see that as the size of test and reference traverses increases, the AUC under PR curves remains higher where "Time" represents the overall matching period for a single test image against R reference traverse. This mimics that the system is capable enough to handle large number of reference/database images while maintaining performance both in terms of accuracy and retrieval time. It should be noted that [7] used MATLAB implementation which is practically slower than Python but we have employed CPUs in comparison to [7] which used GPU.

C. Comparison Methods
To show the dominance of our novel place-centric regions finding approach coupled with VLAD encoding, we replaced VGG-16 with AlexNet365 in [7] (open-source MATLAB code can be found at [42]), and combined the regional features with VLAD and BoW encodings, named as Cross-Region-VLAD and Cross-Region-BoW. For a fair comparison, using 2.6k dataset, we trained a separate regional vocabulary employing conv4 for regions identification and conv3 for feature extraction. Keeping N = 200, we used V = 128 for Cross-Region-VLAD and V = 2.6k for Cross-Region-BoW. Furthermore, results are also reported for HybridNet with Spatial Pyramidal Pooling (SPP) [24] employed on conv5 of the model. We also integrated RMAC [17] on AlexNet365 while performing power-and l2-normalization on the retrieved regional features. Similar to [7], mutual regions are filtered using cross matching, their scores are summed up and maximum matching score is considered for retrieval.
PR-curves across all other image retrieval approaches including Cross-Pool, Max-Pool, Sum-Pool, Whole and state-of-the-art VPR approaches FABMAP and SEQSLAM are taken from [7]. Authors in [7] employed conv5 2 of deep object centric VGG-16 as features representation. However, Cross-Region-BoW [7] with deep VGG-16 model used conv5 3 for landmarks identification and conv5 2 for feature extraction. Standard FABMAP implementation [43] and three sequential frames configuration for SEQSLAM were used by [7].

D. Precision Recall Characteristics
In image retrieval tasks where there is a moderate to large class imbalance which means the positive class samples are quite rare as compared to the negative classes, Precision-Recall curves are usually employed as evaluation metric [12]. For all the benchmark datasets, we first calculate the difference in AUC-PR performance of [7] and Region-VLAD, determine their average which comes around an overall 13% performance improvement.
1) Berlin Halenseestrasse: In Figure 7 (a), the proposed Region-VLAD PR-curves for Berlin Halenseestrasse dataset significantly outperforms all other state-of-the-art methods. Surprisingly, Cross-Region-VLAD PR-curve underperformed with a big margin. This mimics that the better AUC-PR performance of our proposed approach is encouraged with the use of our novel regional features. Furthermore, investigations on Cross-Region-VLAD suggest that under strong viewpoint change, the mapping of cross-convoluted regional patterns [7] over the vocabulary for VLAD retrieval results into non-uniform feature distribution. Although, normalization is carried out but still, many zero regional residues exist in the VLAD representation which reflects on the PR-curves. Cross-Region-BoW only considers the mutually matched regions and exhibits better results. Moreover, RMAC which is state-of-the-art in other image retrieval techniques and SPP, both are sensitive under strong viewpoint variation, thus under-performed on this dataset. Although, FABMAP is robust under viewpoint variation but it still underperformed on this dataset just like SEQSLAM, a whole image-based technique which subtracts patch-normalized sequence of frames. Cross-Pool employs a similar idea of pooling as Cross-Region-BoW, so both have achieved nearly the similar PR-curves whereas other pooling techniques under-performed. It is worth noting that even with smaller regional dictionaries, our proposed Region-VLAD framework still achieves better results than VGG-16 based Cross-Region-BoW [7] and other methodologies. It highlights the potential of our shallow CNN based regional features robustness under strong viewpoint variations. 2) Berlin Kudamm: Due to urban environment, too many dynamic and confusing objects such as vehicles, trees and pedestrians with homogeneous scenes lead to perceptual aliasing coupled with severe viewpoint changes makes it a challenging dataset. Figure 8 (a) shows that our proposed Region-VLAD approach still manages to achieve better results. AlexNet365 combined with Cross-Region-BoW claims state-of-the-art results with V = 2.6k regional vocabulary. RMAC and SPP again underperformed. This is apparently because VPR is different from other image retrieval and recognition systems where a single object majorly covers the whole image. Therefore, Sum-Pool, Max-Pool and RMAC which perform relatively well in such vision-based tasks actually not performed well in VPR under strong viewpoint and appearance changes.
In Figure 8 (b), due to resemblance among the places captured in sequence, Whole and SeqSLAM with their whole-image based approach have shown better performances. With higher precision at start and as recall increases, Region-VLAD PR-curves are quite similar but covering more AUC than Whole, SeqSLAM, Cross-Pool and VGG-16 Cross-Region-BoW.
3) Berlin A100: This dataset exhibits moderate viewpoint and moderate conditional changes coupled with dynamic objects. PRcurves are displayed in Figure 9. It is quite evident that our Region-VLAD approach in Figure 9 (a) achieves similar results as state-ofthe-art VGG-16 Cross-Region-BoW [7]. AlexNet365 combined with cross-regional approach of [7] achieves similar and better results for BoW and VLAD. SPP employed on HyridNet was found not very convincing. It might be because HybridNet is fine-tuned on SPED which contains minimal dynamic instances among the same place(s) captured over multiple times of the year.
Against our approach, RMAC on AlexNet365 achieves comparable and better performance than FABMAP and pooling techniques including Sum-Pool, Max-Pool and Cross-Pool. Since condition and viewpoint variations are not much stronger in this dataset, therefore, RMAC and other approaches have also shown better performance. A deep analysis on the dataset reveals varied time interval between the captured frames due to which SEQSLAM underperformed on this dataset. Overall, our proposed Region-VLAD achieved second best performance after VGG-16 Cross-Region-BoW [7].   Since in HybridNet, fine-tuning the CNN model with SPED induced condition invariance. Thus, employing SPP on HybridNet has shown superior performance on this dataset (exhibiting strong conditional changes). In comparison, scene-centric AlexNet365 integrated with Cross-Region-BoW and Cross-Region-VLAD outperformed deep ImageNet-centric VGG-16 based Cross-Region-BoW [7]. This highlights the importance of CNN training. 5) Gardens Point: Both the Gardens Point traverses exhibit stronger lightning variations with adequate temporal coherence between the frames. Figure 11 shows that our Region-VLAD approach achieves similar and better performance than Cross-Region-BoW, Cross-Region-VLAD, Whole, RMAC and SPP. Taking advantage from the sequential information, SEQSLAM has shown state-of-theart performance. Cross-Region-BoW and Cross-Region-VLAD integrated with AlexNet and VGG-16 exhibit similar performances but approaches including Sum-Pool, Max-Pool and FABMAP relatively underperformed.

E. Matching Score Thresholding
By nature, PR curves do not consider True Negative cases (correctly missed the non existing events/classes) [12]. So, in order to tackle such tricky situations, we employ T test traverse and R reference traverse from all the datasets so that T − T queries can be treated as new places (see Table I). Figure 12 visualizes the results of the proposed Region-VLAD framework before and after the match score thresholding. On the basis of matching scores, y-axis differentiates the TP, FN, FP and TN events 2 shown with different coloured curves, where length of the curves in x-axis denotes the number of images which the events contain. The threshold is an Fig. 12. Left column presents graphs for Berlin Halenseestrasse and Berlin A100 before thresholding and right column graphs showcase the change in TP, FP, TN and FN upon thresholding. Our proposed Region-VLAD framework assigned low score to the T-T' or TN queries.
average of TN scores of R reference traverses of the benchmark datasets. Due to limited space, results are reported for two datasets only. Upon thresholding in Figure 12(b), Region-VLAD for Berlin Halenseestrasse dataset missed FN = 2 correctly matched images and successfully filtered 10 queries out of T N = 17. The same behavior is observed for Berlin A100 dataset. In scenarios when the system comes across previously observed places as well as new places, it becomes increasingly challenging to successfully retrieve the correct matches (TPs), discard incorrect matches (TNs), while reducing FPs (retrieved incorrect matches) and FNs (discarded correctly retrieved matches). It is evident that Region-VLAD not only boosts up the AUC under PR-curves but also deals efficiently in assigning low scores to TN queries (green curves). F. Analysis Figure 13 and Figure 14 illustrate some of the matched and mismatched scenarios. For the correct matches, taking advantage from CNN's scene-centric training, Region-VLAD identifies the common regions shown with different coloured boxes under simultaneous viewpoint and appearance changes. For the mismatched scenarios, the identified top novel regions with coloured boxes (trees, lamp posts) show the areas where the system is interested in and matches the scenes but wrongly recognizes the places. We have seen that Cross-Region-BoW [7] when integrated with AlexNet365 showed comparable performance but at high time computation cost. However, our Region-VLAD still outperformed Cross-Region-BoW [7] with smaller dictionary and low retrieval time. Also, cross-regional approach of [7] when combined with the VLAD shown inferior results which confirms the performance boost in Region-VLAD encouraged with our novel regional approach. Datasets and results are placed at [44] and the author intends to open-source the code upon publication.
V. CONCLUSION For Visual Place Recognition on resource-constrained mobile robots, achieving state-of-the-art performance/accuracy with lightweight CNN architectures is highly desirable but a challenging problem. This paper has taken a step in this direction and presented a holistic approach targeted for a CNN architecture comprising a small number of layers pre-trained on a scene-centric image database to reduce the memory and computational cost for resource-constrained mobile robots. The proposed framework detects novel CNN-based regional features and combines them with the VLAD encoding methodology adapted specifically for computation-efficient and environment Invariant-VPR problem. The proposed method achieved state-of-theart AUC-PR curves on significant viewpoint-and condition-variant benchmark place recognition datasets.
In future, it would be useful to analyse the performance of the proposed framework on other shallow/deep CNN models individually trained/fine-tuned on place recognition-centric datasets. Furthermore, instead of employing defined number of novel regions, it would be interesting to investigate the dynamic regional features selection at runtime and their performances on multiple regional vocabularies.  .