SIFT Matching with CNN Evidences for Particular Object Retrieval
Introduction
This paper considers the task of particular object retrieval. Given a query image in which a particular object has been selected, the retrieval system should return from its corpus a set of relevant images in which that object appears. This is a harder problem than whole-image retrieval, since the query object may be cluttered or occluded with diverse backgrounds in the returned images. Some examples are shown in Fig. 1.
Most object retrieval systems are based on matching of local features. There are two steps for extracting local features. The first step detects keypoints of interest. This step also delineates a local patch around every keypoint and normalizes the local patch into fixed-size. The second step describes the normalized patches based on algorithms such as SIFT [1]. However, matching the 128-dimensional SIFTs directly costs a lot of time. To speed up the matching process, bag-of-words (BoW) model is widely used [2], [3]. BoW model defines a visual dictionary and quantizes the local features to visual words. An image can be represented by a frequency histogram of visual words. Local features are matched if they are quantized to the same visual word, so the similarity between two images can be expressed by the inner product of their BoW representations. Inverted index which exploits the sparsity of BoW representation also makes the search efficient. Since matching the SIFTs based on visual words may be too coarse, Jégou et al. [4] propose Hamming embedding (HE) to improve the matching accuracy. HE adds a compact binary signature to each SIFT when quantizing it. Then two coarsely matched SIFTs will be filtered out if their Hamming distance between binary signatures is larger than a threshold.
However, SIFT only describes the local gradient distribution and serves as a low-level representation, which is often not sufficiently distinctive to prevent false matches. As shown in Fig. 2, some local patches are very similar in the SIFT feature space, but they depict different contents. This challenging problem is mainly due to the semantic gap. So seeing the big picture and adopting semantic clues may be a good solution to bridge the semantic gap.
Recently, deep convolutional neural networks (CNN) have been proven to achieve state-of-the-art performance in many computer vision tasks such as image classification [6], [7], object detection [8] or semantic segmentation [9]. With the deep architectures, semantic abstractions that are close to human cognition can be learned. A number of recent works show that CNN features trained on large and diverse datasets such as ImageNet [10] can be used to solve tasks for which they have not been trained. Particularly for image retrieval, many works have adopted solutions based on off-the-shelf features extracted from a pre-trained CNN, achieving promising performance on popular benchmarks [11], [12], [13], [14].
In this paper, we propose to adopt the semantic-aware CNN features to improve the SIFT matching accuracy, which helps to improve the particular object retrieval performance. Considering that global CNN representation is sensitive to background clutter, which is common in object retrieval (see Fig. 1), we extract CNN feature at the object-level. We detect a number of potential objects in the images and extract CNN feature on each object. For each pre-matched SIFT pair, we choose appropriate semantic evidence from the candidate CNN features in a query-adaptive manner, and use them to verify the SIFT match quality. By fusing low-level and high-level clues, high visual matching accuracy can be achieved.
The major contributions of this paper are summarized in the following aspects. First, we adopt the object-level CNN features to improve the SIFT matching accuracy. A query-adaptive method is proposed to choose appropriate CNN evidences to verify the SIFT match quality. Second, two different visual matching verification functions are introduced. We evaluate these two functions and show that they have different effectiveness for different types of datasets. Third, we explore the suitability of fine-tuning the CNN to obtain better semantic evidences for our proposed method. Extensive experiments on benchmark datasets demonstrate that our results compare favorably to the state-of-the-art methods with acceptable memory usage and query time.
The remainder of the paper is organized as follows. Related works are reviewed in Section 2. We show our approach in details in Section 3. After that, we provide the experimental results and comparisons in Section 4. Final conclusions are in Section 5.
Section snippets
Related work
The combination of BoW model and local features is widely used in many object retrieval systems. The BoW model is an approximation to the direct matching of local features, so that it can achieve scalability for large scale object retrieval. Popular local features include SIFT [1] and SURF [15]. SIFT descriptor and its extension RootSIFT [16] have shown good performance for most applications. In BoW model, a visual dictionary is trained on an independent set of local features. Then two local
Our approach
The framework of our method is shown in Fig. 3. The feature extraction process is similar in both off-line and on-line phases. The difference is that we detect multiple object proposals only on the database side, which will be shown in Section 3.2. Both SIFT and CNN feature are stored in the indexing structure. In the on-line phase, we first match the SIFTs using bag-of-words model. Then for each pre-matched SIFT pair, the appropriate CNN evidences are chosen in a query-adaptive manner. With
Datasets
We evaluate our method on three datasets for particular object retrieval: Oxford5k [20], Paris6k [41] and INSTRE [5]. All these three datasets specify a rectangular region delimiting the object in the image as a query. The correct results for a query are the other images which contain this object.
The Oxford5k dataset consists of 5063 images and the Paris6k dataset contains 6392 images. Both Oxford5k and Paris6k have 55 query images, comprising 11 different buildings. The Flickr100k [41] is
Conclusions
This paper proposes to employ CNN evidences to improve SIFT matching accuracy, which plays a critical role in improving the object retrieval performance. We decompose the image into several object regions and extract CNN features from them. A query-adaptive method is proposed to select appropriate evidence from the regional CNN features, which weakens the interference of background noise. Two different verification functions are introduced to verify the SIFT matches. Extensive experiments
Acknowledgment
This work has been supported by the National Science and Technology Supporting Program of China under Grant No. 2015BAH49F01 and the Key Technology R&D Program of Beijing under Grant No.D161100005216001.
Guixuan Zhang received the B.S. degree in Measurement and Control Technology from University of Science and Technology, Beijing, China, in 2012. He is currently persuing the Ph.D. degree at the Digital Content Technology and Media Service Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China. His research interests include image retrieval, machine learning and pattern recognition.
References (51)
- et al.
Fine-residual VLAD for image retrieval
Neurocomputing
(2016) - et al.
Adaptive bit allocation hashing for approximate nearest neighbor search
Neurocomputing
(2015) - et al.
Adaptive bit allocation product quantization
Neurocomputing
(2016) - et al.
Deepindex for accurate and efficient image retrieval
Proceedings of the ACM International Conference on Multimedia Retrieval
(2015) Distinctive image features from scale-invariant keypoints
Int. J. Comput. Vis.
(2004)- et al.
Video google: a text retrieval approach to object matching in videos
Proceedings of the IEEE International Conference on Computer Vision
(2003) - et al.
Robust scalable recognition with a vocabulary tree
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2006) - et al.
Improving bag-of-features for large scale image search
Int. J. Comput. Vis.
(2009) - et al.
INSTRE: A new benchmark for instance-level object retrieval and recognition
ACM Trans. Multimed. Comput. Commun. Appl.
(2015) - et al.
Proceedings of the Conference on Neural Information Processing Systems
Imagenet classification with deep convolutional neural networks
(2012)
Proceedings of the International Conference on Learning Representation
Very deep convolutional networks for large-scale image recognition
Faster R-CNN: towards real-time object detection with region proposal networks
Proceedings of the Conference on Neural Information Processing Systems
Fully convolutional networks for semantic segmentation
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Imagenet: A large-scale hierarchical image database
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
CNN features off-the-shelf: an astounding baseline for recognition
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops
Neural codes for image retrieval
Proceedings of the European Conference on Computer Vision
Multi-scale orderless pooling of deep convolutional activation features
Proceedings of the European Conference on Computer Vision
Image classification and retrieval are one
Proceedings of the ACM International Conference on Multimedia Retrieval
Surf: speeded up robust features.
Comput. Vis. Image Underst.
Three things everyone should know to improve object retrieval
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Query adaptive similarity for large scale object retrieval
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
To aggregate or not to aggregate: selective match kernels for image search
Proceedings of the IEEE International Conference on Computer Vision
Packing and padding: coupled multi-index for accurate image retrieval
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Object retrieval with large vocabularies and fast spatial matching
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
SIFT match verification by geometric coding for large-scale partial-duplicate web image search
ACM Trans. Multimed. Comput. Commun. Appl.
Cited by (33)
Light weight convolutional neural network and low-dimensional images transformation approach for classification of thermal images
2023, Case Studies in Thermal EngineeringDesign of a deep learning visual system for the thickness measurement of each coating layer of TRISO-coated fuel particles
2022, Measurement: Journal of the International Measurement ConfederationCitation Excerpt :When the value of the particle machining index is small, the machining surface of this particle is close to the equatorial cross-section, and the observed thickness is close to its real value. In this study, to enhance the reliability and consistency of visual measurement results, a tracing method is proposed based on the SIFT feature [52,53,54]. This kind of image feature is invariant to image scale and rotation and presents highly robust matching across additional image noise and changes in illumination.
Soft matching network with application to defect inspection
2021, Knowledge-Based SystemsDeep-seated features histogram: A novel image retrieval method
2021, Pattern RecognitionCitation Excerpt :We adopted some state-of-the-art methods for comparison that include deep features and fused features: VLAD [33], BOW-200K [33], RVD-W [36], MAC [37], neural codes [39], R-MAC [40], Crow [41], Tr. Embedding [55], VLAD+CNN [56] (GoogleNet), SPoC [57], MSCE [58], SIFT-CNN [59], OR-IVFADC [60], HeW [61] and ReSW [62]. As can be seen from Table 4, the proposed DSFH method, though not the best, demonstrates particularly good performance compared with other state-of-the-art methods, while the vector dimensionality is lower than with most other methods.
Ultrasonic thyroid nodule detection method based on U-Net network
2021, Computer Methods and Programs in BiomedicineCitation Excerpt :Google Inception Net is structurally innovative, using a global average pooling layer to replace the fully connected layer, reducing the amount of parameters, and introducing batch standardization to accelerate the convergence in the neural network training process. The Inception V3 model is a 47-layer neural network model, which innovatively splits the two-dimensional convolutional layer into two one-dimensional convolutional layers to reduce training parameters and reduce overfitting [34]. DenseNet can be regarded as a special case of the residual neural network ResNet.
Unsupervised deep quantization for object instance search
2019, Neurocomputing
Guixuan Zhang received the B.S. degree in Measurement and Control Technology from University of Science and Technology, Beijing, China, in 2012. He is currently persuing the Ph.D. degree at the Digital Content Technology and Media Service Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China. His research interests include image retrieval, machine learning and pattern recognition.
Zhi Zeng received the B.S. and M.S. degree in Computer Science in 2003 and 2006, respectively, both from Chongqing University, China, and the Ph.D. degree in Pattern Recognition from the Institute of Automation, Chinese Academy of Sciences, in 2009. He is currently a Senior Engineer in the Institute of Automation, Chinese Academy of Sciences. His research interests include information retrieval, machine learning, multimedia, and digital rights management.
Shuwu Zhang got his Ph. D from Chinese Academy of Sciences in 1997. Currently, he is a professor of Institute of Automation, Chinese Academy of Sciences. His research interests are focused on digital content analysis, digital right management, and web-based cultural content service technologies.
Yuan Zhang received the B.S. degree in Automation from Beijing University of Posts and Telecommunications in 2009 and the Ph.D. degree in Pattern Recognition from the Institute of Automation, Chinese Academy of Sciences in 2014. He is currently an Algorithm Engineer in Alibaba Group. His research interests include image retrieval, object detection, machine learning, and pattern recognition.
Wanchun Wu received the B.S. and M.S. degree in Computer Science in 2003 and 2006, respectively, both from Chongqing University, China. He is currently an engineer of Children’s Hospital of Chongqing Medical University, Chongqing, China. His research interests include relational database, medical information and computer vision.