Fine-residual VLAD for image retrieval

doi:10.1016/j.neucom.2015.08.076

Neurocomputing

Volume 173, Part 3, 15 January 2016, Pages 1183-1191

https://doi.org/10.1016/j.neucom.2015.08.076 Get rights and content

Abstract

This paper revisits the vector of locally aggregated descriptors (VLAD), which aggregates the residuals of local descriptors to their cluster centers. Since VLAD usually adopts a small-size codebook, the clusters are coarse and residuals not discriminative. To address this problem, this paper proposes to generate a number of residual codebooks descended from the original clusters. After quantizing local descriptors with these codebooks, we pool the resulting secondary residuals as well as the primary ones to obtain the fine residuals. We show that, with two-step aggregation, the fine-residual VLAD has the same dimension as the original. Experiments on two image search benchmarks confirm the improved discriminative power of our method: we observe consistent superiority to the baseline and competitive performance to the state-of-the-arts.

Introduction

We are interested in the problem of near duplicate image retrieval in large scale databases. Specifically, given a query image of object or scene, we aim to retrieve similar images from a large scale database with high search accuracy, efficiency and small memory usage.

Many state-of-the-art image retrieval systems adopt the Bag-of-Words (BoW) relying on local invariant features such as SIFT [1], [2]. In the BoW model, codebook is generated by clustering local features with unsupervised k-means algorithms [3], [4]. Then each local descriptor is quantized to its closest visual word. Based on the quantization results, an image is represented by a high-dimensional sparse histogram. Each visual word is weighted using the TF-IDF scheme [5], [6] and inverted lists can be employed to implement fast retrieval.

Nevertheless, when considering very large databases, search time and memory requirement limit the number of images that can be indexed in practice. Aggregated vectors such as Fisher Vector [7] and VLAD [8] address this problem by encoding an image into a single vector, achieving reasonable trade-offs on both search accuracy and efficiency. In both representations, a small-size codebook is used and the accumulation of residuals on each visual word is concatenated into a single vector. Residual is the difference vector between local descriptor and its assigned visual word. In the following, we use “residual” and “difference vector” interchangeably. These methods transform a set of local descriptors into a fixed-length vector and obtain better performance than the BoW with the same dimension. Usually, the dimension of VLAD vector can be reduced to 128-D by Principal Component Analysis (PCA) for fast search. Since VLAD can be seen as a simplified version of the Fisher Vector, we only focus on the former in this paper.

Due to the large quantization error of small-size codebook, the residual is not very discriminative, limiting the accuracy of VLAD descriptor. Quantization error can be reduced by increasing the codebook size. However, the VLAD descriptor with large-size codebook is high-dimensional and needs more search time and memory usage. It sacrifices efficiency for improving accuracy. Besides, high-dimensional representation suffers more from dimensionality reduction. Considering this, we aim at improving the search accuracy while preserving the search time and the memory usage.

In our approach, we propose a number of residual codebooks descended from the original clusters. Then local descriptors assigned to the same cluster are distinguished by their residuals and divided into finer clusters. Through these codebooks, we calculate the difference vector between primary residual and its closest visual word in the residual codebooks, denoted as the secondary residual. By pooling them with the primary ones, the fine residuals with more discriminative information are obtained. Furthermore, the fine residuals are aggregated into one vector through the two-step aggregation, keeping the same dimension as the original. The examples of image retrieval using fine-residual VLAD are showed in Fig. 1.

The remainder of the paper is organized as follows. In Section 2, we briefly review the related works. In Section 3, we describe our method in detail. Experiments are shown in Section 4. Finally, we conclude in Section 5.

Section snippets

Related works

Over the past decade, considerable efforts have been devoted to improving image retrieval performance. One milestone was established by the introduction of BoW model using invariant local feature [44] such as SIFT. There have been many notable contributions to improve the BoW representation. On one hand, a lot of works focus on how to reduce the quantization error. To name a few, hierarchical k-means [4] quantizes local descriptors hierarchically through a vocabulary tree and allows a larger

VLAD review

VLAD is an encoding technique that aggregates a set of local descriptors (e.g., SIFT) into a fixed-length vector. Firstly, a codebook $C = {c_{1}, \dots, c_{K}}$ is learned using k-means algorithm. Then, each descriptor x is quantized to its closest cluster center $c_{i} = NN (x)$ . In the following, we use “cluster center” and “visual word” interchangeably. For each cluster, the residuals between the descriptors assigned to it and the center $x - c_{i}$ are accumulated, capturing the characteristics of vectors' distribution

Datasets

In this paper, we evaluate our proposed method on two public datasets, INRIA Holidays [10] and UKBench [4]. The Holidays dataset consists of 1491 images of personal holiday photos, and 500 of them are queries. Most queries have 1 or 2 ground truth images undergoing various changes. Retrieval accuracy is measured by mAP (mean average precision). The UKBench dataset contains 10,200 images. Every 4 images are taken from the same object with different viewpoints and illuminations. The performance

Conclusion

This paper introduces the fine-residual VLAD for large scale image retrieval. As the VLAD representation usually adopts small-size codebook, the clusters are very coarse, leading to relatively low accuracy. To enhance the performance, we generate the residual codebook to classify the local descriptors to finer cluster. Through the residual codebook, we add refinement to the aggregation and a fine residual is obtained, which improves the discriminative power of VLAD. Moreover, the fine-residual

Acknowledgments

This work was supported by the National High Technology Research and Development Program of China (863 program) under Grant nos. 2012AA011004 and the National Science and Technology Support Program under Grant no. 2013BAK02B04. This work was also supported in part to Dr. Qi Tian by ARO grant W911NF-15-1-0290 and W911NF-12-1-0057 and Faculty Research Awards by NEC Laboratories of America. This work was supported in part by National Science Foundation of China (NSFC) 61429201.

Ziqiong Liu received the bachelor degree in Information Engineering from Southeast University, Nanjing, China, in 2011. She is currently pursuing the Ph.D. degree in Electronic Engineering of Tsinghua University, Beijing, China. Her current research interests include image/video processing and large scale multimedia retrieval.

References (43)

L. Paulevé et al.
Locality sensitive hashinga comparison of hash function types and querying mechanisms
Pattern Recognit. Lett.
(2010)
D. Lowe
Distinctive image features from scale-invariant keypoints
Int. J. Comput. Vis.
(2004)
R. Arandjelovic, A. Zisserman, Three things everyone should know to improve object retrieval, in: Proceedings of CVPR,...
J. Philbin, O. Chum, M. Isard, J. Sivic, A. Zisserman, Object retrieval with large vocabularies and fast spatial...
D. Nister, H. Stewenius, Scalable recognition with a vocabulary tree, in: Proceedings of CVPR, 2006, pp....
J. Sivic, A. Zisserman, Video Google: a text retrieval approach to object matching in videos, in: Proceedings of ICCV,...
L. Zheng et al.
Lp -norm IDF for scalable image retrieval
IEEE Trans. Image Processing
(2014)
F. Perronnin, Y. Liu, J. Snchez, H. Poirier, Large-scale image retrieval with compressed fisher vectors, in:...
H. Jégou, M. Douze, C. Schmid, P. Pérez, Aggregating local descriptors into a compact image representation, in:...
J. Philbin, O. Chum, M. Isard, J. Sivic, A. Zisserman, Improving particular object retrieval in large scale image...

H. Jégou, M. Douze, C. Schmid, Hamming embedding and weak geometric consistency for large scale image search, in:...

L. Zheng et al.

Visual phraseletrefining spatial constraints for large scale image search

IEEE Signal Process. Lett.

(2013)

X. Shen, Z. Lin, J. Brandt, S. Avidan, Y. Wu, Object retrieval and localization with spatially-constrained similarity...

L. Zheng et al.

Coupled binary embedding for large-scale image retrieval

IEEE Trans. Image Processing

(2014)

C. Wengert, M. Douze, H. Jégou, Bag-of-colors for improved image search, in: ACM MM, 2011, pp....

S. Zhang, M. Yang, X. Wang, Y. Lin, Q. Tian, Semantic-aware co-indexing for near-duplicate image retrieval, in:...

M. Douze, A. Ramisa, C. Schmid, Combining attributes and Fisher vectors for efficient image retrieval, in: Proceedings...

L. Zheng et al.

Fast image retrieval: query pruning and early termination

IEEE Trans. Multimedia

(2015)

R. Ji et al.

Location discriminative vocabulary coding for mobile landmark search

Int. J. Comput. Vis.

(2012)

J.L. Bentley, B. Labo, K-d trees for semidynamic point sets, in: Proceedings of 6th SOCG, 1990, pp....

Y. Weiss, A.B. Torralba, R. Fergus, Spectral hashing, in: NIPS, 2008, pp....

Cited by (40)

Symmetrical irregular local features for fine-grained visual classification
2022, Neurocomputing
Fine-grained visual classification (FGVC) has small inter-class variations and large intra-class variations, therefore, recognizing sub-classes belonging to the same meta-class is a difficult task. Recent studies have primarily addressed this problem by locating the most discriminative image regions, and the extracted image regions have been used to improve the ability to capture subtle differences. Most of these studies used regular anchors to extract local features. However, the local features of the target are mostly irregular geometric shapes. These methods cannot fully extract the features and inevitably include a large amount of irrelevant information, resulting in reduced credibility of the evaluation results. However, the spatial relationship between the features is easily overlooked. This study proposes a novel local feature extraction anchor generator (LFEAG) to simulate the shapes of irregular features. Thus, discriminative features can be fully included in the extracted features. In addition, an effective symmetrized local feature extraction module (SLFEM) based on an attention mechanism is proposed to fully use the spatial relationship between the extracted local features and highlight discriminative features. Experiments on six popular fine-grained benchmark datasets: CUB-200-2011, Stanford Dogs, Food-101, Oxford-IIIT Pets, Aircraft and NA-Birds, are conducted to demonstrate the advantages of our proposed method.
End-to-end semantic-aware object retrieval based on region-wise attention
2019, Neurocomputing
Image representations based on pre-trained Convolutional Neural Networks (CNNs) have achieved the new state of the art in computer vision tasks such as object retrieval. Such methods usually encode the activations of convolutional layers to produce highly competitive global or local representations, as they contain the spatial information of the input image. In this work, we propose the region-wise attention mechanism to generate a semantic-aware encoding of convolutional features by two different methods. One is to re-weight the convolutional features according to the pixel-wise label from the semantic segmentation CNNs, and the other is to design a spatial attention block that adaptively recalibrates region-wise weights by explicitly modelling interdependencies between channels. We further build an end-to-end semantic-aware object retrieval pipeline based on off-the-shelf models and assess the performance of our proposed approach on the public available datasets Oxford5k and Paris6k, including large-scale datasets Oxford105k and Paris106k. As a result, we significantly improve the current state of the art.
Aggregating hierarchical binary activations for image retrieval
2018, Neurocomputing
Convolutional Neural Networks (CNNs) have achieved a breakthrough on a large number of image retrieval benchmarks. However, most previous works make use of the CNNs following the image classification strategy, where the last fully connected layer activations of the whole image are occupied as a single holistic feature vector. To improve the representation power of CNNs, this paper proposes a Multi-layer Fusion (MF) approach to aggregate deep activations for image retrieval task. The key insight of our approach is that different layers of a CNN are sensitive to specific patterns, and are complementary with each other for image representation. Specifically, our approach transforms CNN activations to deep binary codes embedded in the inverted index of Bag-of-Words structure for fast retrieval. Those activations are derived from multiple layers of a CNN on local patches, for features from orderless local areas have proved superior to global ones in the low level handcrafted cases. Corresponding weights and diffusion process are thereafter utilized to penalize and re-rank the individual similarity scores of layers. Our method is efficient, which extracts visual features from different layers only once. Furthermore, the proposed MF approach can be easily extended to include SIFT features to enhance the representation power. Extensive experiments on four public retrieval datasets quantitatively evaluate the effectiveness of our contributions, and the proposed algorithm prove to be the new state-of-the-art on the Holidays and UKBench datasets.
An improved online writer identification framework using codebook descriptors
2018, Pattern Recognition
Citation Excerpt :
The resulting aggregated scores of the D attributes from each of the codevectors are then normalized and used to represent the writer descriptor, as highlighted in sub-figure (b). As stated earlier, our proposal has been inspired by the success of codebook descriptors for the application of object retrieval in image processing [34–38]. However, at the same time, we highlight that our devised strategy is quite different in the formulation step.
This work proposes a text independent writer identification framework for online handwritten data. We derive a strategy that encodes the sequence of feature vectors extracted at sample points of the temporal trace with descriptors obtained from a codebook. The derived descriptors take into account, the scores of each of the attributes in a feature vector, that are computed with regards of the proximity to their corresponding values in the assigned codevector of the codebook. A codebook comprises a set of codevectors that are pre-learnt by a k-means algorithm applied on feature vectors of handwritten documents pooled from several writers. In addition, for constructing the codebook, we consider features that are derived by incorporating a so called ‘gap parameter’ that captures characteristics of sample points in the neighborhood of the point under consideration. We formulate our strategy in a way that, for a given codebook size k, we employ the descriptors of only $k - 1$ codevectors to construct the final descriptor by concatenation. The usefulness of the descriptor is demonstrated by several experiments that are reported on publicly available databases.
Deep convolutional learning for Content Based Image Retrieval
2018, Neurocomputing
In this paper we propose a model retraining method for learning more efficient convolutional representations for Content Based Image Retrieval. We employ a deep CNN model to obtain the feature representations from the activations of the convolutional layers using max-pooling, and subsequently we adapt and retrain the network, in order to produce more efficient compact image descriptors, which improve both the retrieval performance and the memory requirements, relying on the available information. Our method suggests three basic model retraining approaches. That is, the Fully Unsupervised Retraining, if no information except from the dataset itself is available, the Retraining with Relevance Information, if the labels of the training dataset are available, and the Relevance Feedback based Retraining, if feedback from users is available. The experimental evaluation on three publicly available image retrieval datasets indicates the effectiveness of the proposed method in learning more efficient representations for the retrieval task, outperforming other CNN-based retrieval techniques, as well as conventional hand-crafted feature-based approaches in all the used datasets.
Deep learned vectors’ formation using auto-correlation, scaling, and derivations with CNN for complex and huge image retrieval
2023, Complex and Intelligent Systems

View all citing articles on Scopus

Qi Tian (M'96-SM'03) received the B.E. degree in electronic engineering from Tsinghua University, China, in 1992, the M.S. degree in electrical and computer engineering from Drexel University in 1996 and the Ph.D. degree in electrical and computer engineering from the University of Illinois, Urbana Champaign in 2002. He is currently a Professor in the Department of Computer Science at the University of Texas at San Antonio (UTSA). He took a one-year faculty leave at Microsoft Research Asia (MSRA) during 2008–2009. Dr. Tian's research interests include multimedia information retrieval and computer vision. He has published over 280 refereed journal and conference papers. His research projects were funded by NSF, ARO, DHS, SALSI, CIAS, and UTSA and he also received faculty research awards from Google, NEC Laboratories of America, FXPAL, Akiira Media Systems, and HP Labs. He received the Best Paper Awards in PCM 2013, MMM 2013 and ICIMCS 2012, the Top 10 Candidate in PCM 2007. He received Research Achievement Award from College of Science, UTSA in 2014. He received 2010 ACM Service Award. He is in the Editorial Board of IEEE Transactions on Multimedia (TMM), IEEE Transactions on Circuit and Systems for Video Technology (TCSVT), Multimedia Systems Journal, Journal of Multimedia (JMM) and Journal of Machine Visions and Applications (MVA). He is the Guest Editors of IEEE Transactions on Multimedia, Journal of Computer Vision and Image Understanding, Pattern Recognition Letter, EURASIP Journal on Advances in Signal Processing, Journal of Visual Communication and Image Representation, etc.

Shengjin Wang received the B.E.degree from Tsinghua University, China, in 1985 and the Ph.D. degree from the Tokyo Institute of Technology, Tokyo, Japan, in 1997. From May 1997 to August 2003, he was a member of the research staff in the Internet System Research Laboratories, NEC Corporation, Japan. Since September 2003, he has been a Professor with the Department of Electronic Engineering, Tsinghua University. He has published more than 80 papers on image processing, computer vision, and pattern recognition. He is the holder of ten patents. His current research interests include image processing, computer vision, video surveillance, and pattern recognition.

View full text

Fine-residual VLAD for image retrieval

Abstract

Introduction

Section snippets

Related works

VLAD review

Datasets

Conclusion

Acknowledgments

Pattern Recognit. Lett.

Distinctive image features from scale-invariant keypoints

Int. J. Comput. Vis.

Lp -norm IDF for scalable image retrieval

IEEE Trans. Image Processing

Visual phraseletrefining spatial constraints for large scale image search

IEEE Signal Process. Lett.

Coupled binary embedding for large-scale image retrieval

IEEE Trans. Image Processing

Fast image retrieval: query pruning and early termination

IEEE Trans. Multimedia

Location discriminative vocabulary coding for mobile landmark search

Int. J. Comput. Vis.