Elsevier

Neurocomputing

Volume 173, Part 3, 15 January 2016, Pages 1183-1191
Neurocomputing

Fine-residual VLAD for image retrieval

https://doi.org/10.1016/j.neucom.2015.08.076Get rights and content

Abstract

This paper revisits the vector of locally aggregated descriptors (VLAD), which aggregates the residuals of local descriptors to their cluster centers. Since VLAD usually adopts a small-size codebook, the clusters are coarse and residuals not discriminative. To address this problem, this paper proposes to generate a number of residual codebooks descended from the original clusters. After quantizing local descriptors with these codebooks, we pool the resulting secondary residuals as well as the primary ones to obtain the fine residuals. We show that, with two-step aggregation, the fine-residual VLAD has the same dimension as the original. Experiments on two image search benchmarks confirm the improved discriminative power of our method: we observe consistent superiority to the baseline and competitive performance to the state-of-the-arts.

Introduction

We are interested in the problem of near duplicate image retrieval in large scale databases. Specifically, given a query image of object or scene, we aim to retrieve similar images from a large scale database with high search accuracy, efficiency and small memory usage.

Many state-of-the-art image retrieval systems adopt the Bag-of-Words (BoW) relying on local invariant features such as SIFT [1], [2]. In the BoW model, codebook is generated by clustering local features with unsupervised k-means algorithms [3], [4]. Then each local descriptor is quantized to its closest visual word. Based on the quantization results, an image is represented by a high-dimensional sparse histogram. Each visual word is weighted using the TF-IDF scheme [5], [6] and inverted lists can be employed to implement fast retrieval.

Nevertheless, when considering very large databases, search time and memory requirement limit the number of images that can be indexed in practice. Aggregated vectors such as Fisher Vector [7] and VLAD [8] address this problem by encoding an image into a single vector, achieving reasonable trade-offs on both search accuracy and efficiency. In both representations, a small-size codebook is used and the accumulation of residuals on each visual word is concatenated into a single vector. Residual is the difference vector between local descriptor and its assigned visual word. In the following, we use “residual” and “difference vector” interchangeably. These methods transform a set of local descriptors into a fixed-length vector and obtain better performance than the BoW with the same dimension. Usually, the dimension of VLAD vector can be reduced to 128-D by Principal Component Analysis (PCA) for fast search. Since VLAD can be seen as a simplified version of the Fisher Vector, we only focus on the former in this paper.

Due to the large quantization error of small-size codebook, the residual is not very discriminative, limiting the accuracy of VLAD descriptor. Quantization error can be reduced by increasing the codebook size. However, the VLAD descriptor with large-size codebook is high-dimensional and needs more search time and memory usage. It sacrifices efficiency for improving accuracy. Besides, high-dimensional representation suffers more from dimensionality reduction. Considering this, we aim at improving the search accuracy while preserving the search time and the memory usage.

In our approach, we propose a number of residual codebooks descended from the original clusters. Then local descriptors assigned to the same cluster are distinguished by their residuals and divided into finer clusters. Through these codebooks, we calculate the difference vector between primary residual and its closest visual word in the residual codebooks, denoted as the secondary residual. By pooling them with the primary ones, the fine residuals with more discriminative information are obtained. Furthermore, the fine residuals are aggregated into one vector through the two-step aggregation, keeping the same dimension as the original. The examples of image retrieval using fine-residual VLAD are showed in Fig. 1.

The remainder of the paper is organized as follows. In Section 2, we briefly review the related works. In Section 3, we describe our method in detail. Experiments are shown in Section 4. Finally, we conclude in Section 5.

Section snippets

Related works

Over the past decade, considerable efforts have been devoted to improving image retrieval performance. One milestone was established by the introduction of BoW model using invariant local feature [44] such as SIFT. There have been many notable contributions to improve the BoW representation. On one hand, a lot of works focus on how to reduce the quantization error. To name a few, hierarchical k-means [4] quantizes local descriptors hierarchically through a vocabulary tree and allows a larger

VLAD review

VLAD is an encoding technique that aggregates a set of local descriptors (e.g., SIFT) into a fixed-length vector. Firstly, a codebook C={c1,,cK} is learned using k-means algorithm. Then, each descriptor x is quantized to its closest cluster center ci=NN(x). In the following, we use “cluster center” and “visual word” interchangeably. For each cluster, the residuals between the descriptors assigned to it and the center xci are accumulated, capturing the characteristics of vectors' distribution

Datasets

In this paper, we evaluate our proposed method on two public datasets, INRIA Holidays [10] and UKBench [4]. The Holidays dataset consists of 1491 images of personal holiday photos, and 500 of them are queries. Most queries have 1 or 2 ground truth images undergoing various changes. Retrieval accuracy is measured by mAP (mean average precision). The UKBench dataset contains 10,200 images. Every 4 images are taken from the same object with different viewpoints and illuminations. The performance

Conclusion

This paper introduces the fine-residual VLAD for large scale image retrieval. As the VLAD representation usually adopts small-size codebook, the clusters are very coarse, leading to relatively low accuracy. To enhance the performance, we generate the residual codebook to classify the local descriptors to finer cluster. Through the residual codebook, we add refinement to the aggregation and a fine residual is obtained, which improves the discriminative power of VLAD. Moreover, the fine-residual

Acknowledgments

This work was supported by the National High Technology Research and Development Program of China (863 program) under Grant nos. 2012AA011004 and the National Science and Technology Support Program under Grant no. 2013BAK02B04. This work was also supported in part to Dr. Qi Tian by ARO grant W911NF-15-1-0290 and W911NF-12-1-0057 and Faculty Research Awards by NEC Laboratories of America. This work was supported in part by National Science Foundation of China (NSFC) 61429201.

Ziqiong Liu received the bachelor degree in Information Engineering from Southeast University, Nanjing, China, in 2011. She is currently pursuing the Ph.D. degree in Electronic Engineering of Tsinghua University, Beijing, China. Her current research interests include image/video processing and large scale multimedia retrieval.

References (43)

  • L. Paulevé et al.

    Locality sensitive hashinga comparison of hash function types and querying mechanisms

    Pattern Recognit. Lett.

    (2010)
  • D. Lowe

    Distinctive image features from scale-invariant keypoints

    Int. J. Comput. Vis.

    (2004)
  • R. Arandjelovic, A. Zisserman, Three things everyone should know to improve object retrieval, in: Proceedings of CVPR,...
  • J. Philbin, O. Chum, M. Isard, J. Sivic, A. Zisserman, Object retrieval with large vocabularies and fast spatial...
  • D. Nister, H. Stewenius, Scalable recognition with a vocabulary tree, in: Proceedings of CVPR, 2006, pp....
  • J. Sivic, A. Zisserman, Video Google: a text retrieval approach to object matching in videos, in: Proceedings of ICCV,...
  • L. Zheng et al.

    Lp -norm IDF for scalable image retrieval

    IEEE Trans. Image Processing

    (2014)
  • F. Perronnin, Y. Liu, J. Snchez, H. Poirier, Large-scale image retrieval with compressed fisher vectors, in:...
  • H. Jégou, M. Douze, C. Schmid, P. Pérez, Aggregating local descriptors into a compact image representation, in:...
  • J. Philbin, O. Chum, M. Isard, J. Sivic, A. Zisserman, Improving particular object retrieval in large scale image...
  • H. Jégou, M. Douze, C. Schmid, Hamming embedding and weak geometric consistency for large scale image search, in:...
  • L. Zheng et al.

    Visual phraseletrefining spatial constraints for large scale image search

    IEEE Signal Process. Lett.

    (2013)
  • X. Shen, Z. Lin, J. Brandt, S. Avidan, Y. Wu, Object retrieval and localization with spatially-constrained similarity...
  • L. Zheng et al.

    Coupled binary embedding for large-scale image retrieval

    IEEE Trans. Image Processing

    (2014)
  • C. Wengert, M. Douze, H. Jégou, Bag-of-colors for improved image search, in: ACM MM, 2011, pp....
  • S. Zhang, M. Yang, X. Wang, Y. Lin, Q. Tian, Semantic-aware co-indexing for near-duplicate image retrieval, in:...
  • M. Douze, A. Ramisa, C. Schmid, Combining attributes and Fisher vectors for efficient image retrieval, in: Proceedings...
  • L. Zheng et al.

    Fast image retrieval: query pruning and early termination

    IEEE Trans. Multimedia

    (2015)
  • R. Ji et al.

    Location discriminative vocabulary coding for mobile landmark search

    Int. J. Comput. Vis.

    (2012)
  • J.L. Bentley, B. Labo, K-d trees for semidynamic point sets, in: Proceedings of 6th SOCG, 1990, pp....
  • Y. Weiss, A.B. Torralba, R. Fergus, Spectral hashing, in: NIPS, 2008, pp....
  • Cited by (40)

    • An improved online writer identification framework using codebook descriptors

      2018, Pattern Recognition
      Citation Excerpt :

      The resulting aggregated scores of the D attributes from each of the codevectors are then normalized and used to represent the writer descriptor, as highlighted in sub-figure (b). As stated earlier, our proposal has been inspired by the success of codebook descriptors for the application of object retrieval in image processing [34–38]. However, at the same time, we highlight that our devised strategy is quite different in the formulation step.

    View all citing articles on Scopus

    Ziqiong Liu received the bachelor degree in Information Engineering from Southeast University, Nanjing, China, in 2011. She is currently pursuing the Ph.D. degree in Electronic Engineering of Tsinghua University, Beijing, China. Her current research interests include image/video processing and large scale multimedia retrieval.

    Qi Tian (M'96-SM'03) received the B.E. degree in electronic engineering from Tsinghua University, China, in 1992, the M.S. degree in electrical and computer engineering from Drexel University in 1996 and the Ph.D. degree in electrical and computer engineering from the University of Illinois, Urbana Champaign in 2002. He is currently a Professor in the Department of Computer Science at the University of Texas at San Antonio (UTSA). He took a one-year faculty leave at Microsoft Research Asia (MSRA) during 2008–2009. Dr. Tian's research interests include multimedia information retrieval and computer vision. He has published over 280 refereed journal and conference papers. His research projects were funded by NSF, ARO, DHS, SALSI, CIAS, and UTSA and he also received faculty research awards from Google, NEC Laboratories of America, FXPAL, Akiira Media Systems, and HP Labs. He received the Best Paper Awards in PCM 2013, MMM 2013 and ICIMCS 2012, the Top 10 Candidate in PCM 2007. He received Research Achievement Award from College of Science, UTSA in 2014. He received 2010 ACM Service Award. He is in the Editorial Board of IEEE Transactions on Multimedia (TMM), IEEE Transactions on Circuit and Systems for Video Technology (TCSVT), Multimedia Systems Journal, Journal of Multimedia (JMM) and Journal of Machine Visions and Applications (MVA). He is the Guest Editors of IEEE Transactions on Multimedia, Journal of Computer Vision and Image Understanding, Pattern Recognition Letter, EURASIP Journal on Advances in Signal Processing, Journal of Visual Communication and Image Representation, etc.

    Shengjin Wang received the B.E.degree from Tsinghua University, China, in 1985 and the Ph.D. degree from the Tokyo Institute of Technology, Tokyo, Japan, in 1997. From May 1997 to August 2003, he was a member of the research staff in the Internet System Research Laboratories, NEC Corporation, Japan. Since September 2003, he has been a Professor with the Department of Electronic Engineering, Tsinghua University. He has published more than 80 papers on image processing, computer vision, and pattern recognition. He is the holder of ten patents. His current research interests include image processing, computer vision, video surveillance, and pattern recognition.

    View full text