Elsevier

Neurocomputing

Volume 238, 17 May 2017, Pages 399-409
Neurocomputing

SIFT Matching with CNN Evidences for Particular Object Retrieval

https://doi.org/10.1016/j.neucom.2017.01.081Get rights and content

Abstract

Many object instance retrieval systems are typically based on matching of local features, such as SIFT. However, these local descriptors serve as low-level clues, which are not sufficiently distinctive to prevent false matches. Recently, deep convolutional neural networks (CNN) have shown their promise as a semantic-aware representation for many computer vision tasks. In this paper, we propose a novel approach to employ CNN evidences to improve the SIFT matching accuracy, which plays a critical role in improving the object retrieval performance. To weaken the interference of noise, we extract compact CNN representations from a number of generic object regions. Then a query-adaptive method is proposed to choose appropriate CNN evidence to verify each pre-matched SIFT pair. Two different visual matching verification functions are introduced and evaluated. Moreover, we investigate the suitability of fine-tuning the CNN for our proposed approach. Extensive experiments on benchmark datasets demonstrate the effectiveness of our method for particular object retrieval. Our results compare favorably to the state-of-the-art methods with acceptable memory usage and query time.

Introduction

This paper considers the task of particular object retrieval. Given a query image in which a particular object has been selected, the retrieval system should return from its corpus a set of relevant images in which that object appears. This is a harder problem than whole-image retrieval, since the query object may be cluttered or occluded with diverse backgrounds in the returned images. Some examples are shown in Fig. 1.

Most object retrieval systems are based on matching of local features. There are two steps for extracting local features. The first step detects keypoints of interest. This step also delineates a local patch around every keypoint and normalizes the local patch into fixed-size. The second step describes the normalized patches based on algorithms such as SIFT [1]. However, matching the 128-dimensional SIFTs directly costs a lot of time. To speed up the matching process, bag-of-words (BoW) model is widely used [2], [3]. BoW model defines a visual dictionary and quantizes the local features to visual words. An image can be represented by a frequency histogram of visual words. Local features are matched if they are quantized to the same visual word, so the similarity between two images can be expressed by the inner product of their BoW representations. Inverted index which exploits the sparsity of BoW representation also makes the search efficient. Since matching the SIFTs based on visual words may be too coarse, Jégou et al. [4] propose Hamming embedding (HE) to improve the matching accuracy. HE adds a compact binary signature to each SIFT when quantizing it. Then two coarsely matched SIFTs will be filtered out if their Hamming distance between binary signatures is larger than a threshold.

However, SIFT only describes the local gradient distribution and serves as a low-level representation, which is often not sufficiently distinctive to prevent false matches. As shown in Fig. 2, some local patches are very similar in the SIFT feature space, but they depict different contents. This challenging problem is mainly due to the semantic gap. So seeing the big picture and adopting semantic clues may be a good solution to bridge the semantic gap.

Recently, deep convolutional neural networks (CNN) have been proven to achieve state-of-the-art performance in many computer vision tasks such as image classification [6], [7], object detection [8] or semantic segmentation [9]. With the deep architectures, semantic abstractions that are close to human cognition can be learned. A number of recent works show that CNN features trained on large and diverse datasets such as ImageNet [10] can be used to solve tasks for which they have not been trained. Particularly for image retrieval, many works have adopted solutions based on off-the-shelf features extracted from a pre-trained CNN, achieving promising performance on popular benchmarks [11], [12], [13], [14].

In this paper, we propose to adopt the semantic-aware CNN features to improve the SIFT matching accuracy, which helps to improve the particular object retrieval performance. Considering that global CNN representation is sensitive to background clutter, which is common in object retrieval (see Fig. 1), we extract CNN feature at the object-level. We detect a number of potential objects in the images and extract CNN feature on each object. For each pre-matched SIFT pair, we choose appropriate semantic evidence from the candidate CNN features in a query-adaptive manner, and use them to verify the SIFT match quality. By fusing low-level and high-level clues, high visual matching accuracy can be achieved.

The major contributions of this paper are summarized in the following aspects. First, we adopt the object-level CNN features to improve the SIFT matching accuracy. A query-adaptive method is proposed to choose appropriate CNN evidences to verify the SIFT match quality. Second, two different visual matching verification functions are introduced. We evaluate these two functions and show that they have different effectiveness for different types of datasets. Third, we explore the suitability of fine-tuning the CNN to obtain better semantic evidences for our proposed method. Extensive experiments on benchmark datasets demonstrate that our results compare favorably to the state-of-the-art methods with acceptable memory usage and query time.

The remainder of the paper is organized as follows. Related works are reviewed in Section 2. We show our approach in details in Section 3. After that, we provide the experimental results and comparisons in Section 4. Final conclusions are in Section 5.

Section snippets

Related work

The combination of BoW model and local features is widely used in many object retrieval systems. The BoW model is an approximation to the direct matching of local features, so that it can achieve scalability for large scale object retrieval. Popular local features include SIFT [1] and SURF [15]. SIFT descriptor and its extension RootSIFT [16] have shown good performance for most applications. In BoW model, a visual dictionary is trained on an independent set of local features. Then two local

Our approach

The framework of our method is shown in Fig. 3. The feature extraction process is similar in both off-line and on-line phases. The difference is that we detect multiple object proposals only on the database side, which will be shown in Section 3.2. Both SIFT and CNN feature are stored in the indexing structure. In the on-line phase, we first match the SIFTs using bag-of-words model. Then for each pre-matched SIFT pair, the appropriate CNN evidences are chosen in a query-adaptive manner. With

Datasets

We evaluate our method on three datasets for particular object retrieval: Oxford5k [20], Paris6k [41] and INSTRE [5]. All these three datasets specify a rectangular region delimiting the object in the image as a query. The correct results for a query are the other images which contain this object.

The Oxford5k dataset consists of 5063 images and the Paris6k dataset contains 6392 images. Both Oxford5k and Paris6k have 55 query images, comprising 11 different buildings. The Flickr100k [41] is

Conclusions

This paper proposes to employ CNN evidences to improve SIFT matching accuracy, which plays a critical role in improving the object retrieval performance. We decompose the image into several object regions and extract CNN features from them. A query-adaptive method is proposed to select appropriate evidence from the regional CNN features, which weakens the interference of background noise. Two different verification functions are introduced to verify the SIFT matches. Extensive experiments

Acknowledgment

This work has been supported by the National Science and Technology Supporting Program of China under Grant No. 2015BAH49F01 and the Key Technology R&D Program of Beijing under Grant No.D161100005216001.

Guixuan Zhang received the B.S. degree in Measurement and Control Technology from University of Science and Technology, Beijing, China, in 2012. He is currently persuing the Ph.D. degree at the Digital Content Technology and Media Service Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China. His research interests include image retrieval, machine learning and pattern recognition.

References (51)

  • K. Simonyan et al.

    Proceedings of the International Conference on Learning Representation

    Very deep convolutional networks for large-scale image recognition

    (2015)
  • S. Ren et al.

    Faster R-CNN: towards real-time object detection with region proposal networks

    Proceedings of the Conference on Neural Information Processing Systems

    (2015)
  • J. Long et al.

    Fully convolutional networks for semantic segmentation

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2015)
  • DengJ. et al.

    Imagenet: A large-scale hierarchical image database

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2009)
  • A.S. Razavian et al.

    CNN features off-the-shelf: an astounding baseline for recognition

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

    (2014)
  • A. Babenko et al.

    Neural codes for image retrieval

    Proceedings of the European Conference on Computer Vision

    (2014)
  • GongY. et al.

    Multi-scale orderless pooling of deep convolutional activation features

    Proceedings of the European Conference on Computer Vision

    (2014)
  • XieL. et al.

    Image classification and retrieval are one

    Proceedings of the ACM International Conference on Multimedia Retrieval

    (2015)
  • H. Bay et al.

    Surf: speeded up robust features.

    Comput. Vis. Image Underst.

    (2006)
  • R. Arandjelovic et al.

    Three things everyone should know to improve object retrieval

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2012)
  • QinD. et al.

    Query adaptive similarity for large scale object retrieval

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2013)
  • G. Tolias et al.

    To aggregate or not to aggregate: selective match kernels for image search

    Proceedings of the IEEE International Conference on Computer Vision

    (2013)
  • ZhengL. et al.

    Packing and padding: coupled multi-index for accurate image retrieval

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2014)
  • J. Philbin et al.

    Object retrieval with large vocabularies and fast spatial matching

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2007)
  • ZhouW. et al.

    SIFT match verification by geometric coding for large-scale partial-duplicate web image search

    ACM Trans. Multimed. Comput. Commun. Appl.

    (2013)
  • Cited by (33)

    • Design of a deep learning visual system for the thickness measurement of each coating layer of TRISO-coated fuel particles

      2022, Measurement: Journal of the International Measurement Confederation
      Citation Excerpt :

      When the value of the particle machining index is small, the machining surface of this particle is close to the equatorial cross-section, and the observed thickness is close to its real value. In this study, to enhance the reliability and consistency of visual measurement results, a tracing method is proposed based on the SIFT feature [52,53,54]. This kind of image feature is invariant to image scale and rotation and presents highly robust matching across additional image noise and changes in illumination.

    • Deep-seated features histogram: A novel image retrieval method

      2021, Pattern Recognition
      Citation Excerpt :

      We adopted some state-of-the-art methods for comparison that include deep features and fused features: VLAD [33], BOW-200K [33], RVD-W [36], MAC [37], neural codes [39], R-MAC [40], Crow [41], Tr. Embedding [55], VLAD+CNN [56] (GoogleNet), SPoC [57], MSCE [58], SIFT-CNN [59], OR-IVFADC [60], HeW [61] and ReSW [62]. As can be seen from Table 4, the proposed DSFH method, though not the best, demonstrates particularly good performance compared with other state-of-the-art methods, while the vector dimensionality is lower than with most other methods.

    • Ultrasonic thyroid nodule detection method based on U-Net network

      2021, Computer Methods and Programs in Biomedicine
      Citation Excerpt :

      Google Inception Net is structurally innovative, using a global average pooling layer to replace the fully connected layer, reducing the amount of parameters, and introducing batch standardization to accelerate the convergence in the neural network training process. The Inception V3 model is a 47-layer neural network model, which innovatively splits the two-dimensional convolutional layer into two one-dimensional convolutional layers to reduce training parameters and reduce overfitting [34]. DenseNet can be regarded as a special case of the residual neural network ResNet.

    View all citing articles on Scopus

    Guixuan Zhang received the B.S. degree in Measurement and Control Technology from University of Science and Technology, Beijing, China, in 2012. He is currently persuing the Ph.D. degree at the Digital Content Technology and Media Service Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China. His research interests include image retrieval, machine learning and pattern recognition.

    Zhi Zeng received the B.S. and M.S. degree in Computer Science in 2003 and 2006, respectively, both from Chongqing University, China, and the Ph.D. degree in Pattern Recognition from the Institute of Automation, Chinese Academy of Sciences, in 2009. He is currently a Senior Engineer in the Institute of Automation, Chinese Academy of Sciences. His research interests include information retrieval, machine learning, multimedia, and digital rights management.

    Shuwu Zhang got his Ph. D from Chinese Academy of Sciences in 1997. Currently, he is a professor of Institute of Automation, Chinese Academy of Sciences. His research interests are focused on digital content analysis, digital right management, and web-based cultural content service technologies.

    Yuan Zhang received the B.S. degree in Automation from Beijing University of Posts and Telecommunications in 2009 and the Ph.D. degree in Pattern Recognition from the Institute of Automation, Chinese Academy of Sciences in 2014. He is currently an Algorithm Engineer in Alibaba Group. His research interests include image retrieval, object detection, machine learning, and pattern recognition.

    Wanchun Wu received the B.S. and M.S. degree in Computer Science in 2003 and 2006, respectively, both from Chongqing University, China. He is currently an engineer of Children’s Hospital of Chongqing Medical University, Chongqing, China. His research interests include relational database, medical information and computer vision.

    View full text