SIFT Matching with CNN Evidences for Particular Object Retrieval

doi:10.1016/j.neucom.2017.01.081

Neurocomputing

Volume 238, 17 May 2017, Pages 399-409

https://doi.org/10.1016/j.neucom.2017.01.081 Get rights and content

Abstract

Many object instance retrieval systems are typically based on matching of local features, such as SIFT. However, these local descriptors serve as low-level clues, which are not sufficiently distinctive to prevent false matches. Recently, deep convolutional neural networks (CNN) have shown their promise as a semantic-aware representation for many computer vision tasks. In this paper, we propose a novel approach to employ CNN evidences to improve the SIFT matching accuracy, which plays a critical role in improving the object retrieval performance. To weaken the interference of noise, we extract compact CNN representations from a number of generic object regions. Then a query-adaptive method is proposed to choose appropriate CNN evidence to verify each pre-matched SIFT pair. Two different visual matching verification functions are introduced and evaluated. Moreover, we investigate the suitability of fine-tuning the CNN for our proposed approach. Extensive experiments on benchmark datasets demonstrate the effectiveness of our method for particular object retrieval. Our results compare favorably to the state-of-the-art methods with acceptable memory usage and query time.

Introduction

This paper considers the task of particular object retrieval. Given a query image in which a particular object has been selected, the retrieval system should return from its corpus a set of relevant images in which that object appears. This is a harder problem than whole-image retrieval, since the query object may be cluttered or occluded with diverse backgrounds in the returned images. Some examples are shown in Fig. 1.

Most object retrieval systems are based on matching of local features. There are two steps for extracting local features. The first step detects keypoints of interest. This step also delineates a local patch around every keypoint and normalizes the local patch into fixed-size. The second step describes the normalized patches based on algorithms such as SIFT [1]. However, matching the 128-dimensional SIFTs directly costs a lot of time. To speed up the matching process, bag-of-words (BoW) model is widely used [2], [3]. BoW model defines a visual dictionary and quantizes the local features to visual words. An image can be represented by a frequency histogram of visual words. Local features are matched if they are quantized to the same visual word, so the similarity between two images can be expressed by the inner product of their BoW representations. Inverted index which exploits the sparsity of BoW representation also makes the search efficient. Since matching the SIFTs based on visual words may be too coarse, Jégou et al. [4] propose Hamming embedding (HE) to improve the matching accuracy. HE adds a compact binary signature to each SIFT when quantizing it. Then two coarsely matched SIFTs will be filtered out if their Hamming distance between binary signatures is larger than a threshold.

However, SIFT only describes the local gradient distribution and serves as a low-level representation, which is often not sufficiently distinctive to prevent false matches. As shown in Fig. 2, some local patches are very similar in the SIFT feature space, but they depict different contents. This challenging problem is mainly due to the semantic gap. So seeing the big picture and adopting semantic clues may be a good solution to bridge the semantic gap.

Recently, deep convolutional neural networks (CNN) have been proven to achieve state-of-the-art performance in many computer vision tasks such as image classification [6], [7], object detection [8] or semantic segmentation [9]. With the deep architectures, semantic abstractions that are close to human cognition can be learned. A number of recent works show that CNN features trained on large and diverse datasets such as ImageNet [10] can be used to solve tasks for which they have not been trained. Particularly for image retrieval, many works have adopted solutions based on off-the-shelf features extracted from a pre-trained CNN, achieving promising performance on popular benchmarks [11], [12], [13], [14].

In this paper, we propose to adopt the semantic-aware CNN features to improve the SIFT matching accuracy, which helps to improve the particular object retrieval performance. Considering that global CNN representation is sensitive to background clutter, which is common in object retrieval (see Fig. 1), we extract CNN feature at the object-level. We detect a number of potential objects in the images and extract CNN feature on each object. For each pre-matched SIFT pair, we choose appropriate semantic evidence from the candidate CNN features in a query-adaptive manner, and use them to verify the SIFT match quality. By fusing low-level and high-level clues, high visual matching accuracy can be achieved.

The major contributions of this paper are summarized in the following aspects. First, we adopt the object-level CNN features to improve the SIFT matching accuracy. A query-adaptive method is proposed to choose appropriate CNN evidences to verify the SIFT match quality. Second, two different visual matching verification functions are introduced. We evaluate these two functions and show that they have different effectiveness for different types of datasets. Third, we explore the suitability of fine-tuning the CNN to obtain better semantic evidences for our proposed method. Extensive experiments on benchmark datasets demonstrate that our results compare favorably to the state-of-the-art methods with acceptable memory usage and query time.

The remainder of the paper is organized as follows. Related works are reviewed in Section 2. We show our approach in details in Section 3. After that, we provide the experimental results and comparisons in Section 4. Final conclusions are in Section 5.

Section snippets

Related work

The combination of BoW model and local features is widely used in many object retrieval systems. The BoW model is an approximation to the direct matching of local features, so that it can achieve scalability for large scale object retrieval. Popular local features include SIFT [1] and SURF [15]. SIFT descriptor and its extension RootSIFT [16] have shown good performance for most applications. In BoW model, a visual dictionary is trained on an independent set of local features. Then two local

Our approach

The framework of our method is shown in Fig. 3. The feature extraction process is similar in both off-line and on-line phases. The difference is that we detect multiple object proposals only on the database side, which will be shown in Section 3.2. Both SIFT and CNN feature are stored in the indexing structure. In the on-line phase, we first match the SIFTs using bag-of-words model. Then for each pre-matched SIFT pair, the appropriate CNN evidences are chosen in a query-adaptive manner. With

Datasets

We evaluate our method on three datasets for particular object retrieval: Oxford5k [20], Paris6k [41] and INSTRE [5]. All these three datasets specify a rectangular region delimiting the object in the image as a query. The correct results for a query are the other images which contain this object.

The Oxford5k dataset consists of 5063 images and the Paris6k dataset contains 6392 images. Both Oxford5k and Paris6k have 55 query images, comprising 11 different buildings. The Flickr100k [41] is

Conclusions

This paper proposes to employ CNN evidences to improve SIFT matching accuracy, which plays a critical role in improving the object retrieval performance. We decompose the image into several object regions and extract CNN features from them. A query-adaptive method is proposed to select appropriate evidence from the regional CNN features, which weakens the interference of background noise. Two different verification functions are introduced to verify the SIFT matches. Extensive experiments

Acknowledgment

This work has been supported by the National Science and Technology Supporting Program of China under Grant No. 2015BAH49F01 and the Key Technology R&D Program of Beijing under Grant No.D161100005216001.

Guixuan Zhang received the B.S. degree in Measurement and Control Technology from University of Science and Technology, Beijing, China, in 2012. He is currently persuing the Ph.D. degree at the Digital Content Technology and Media Service Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China. His research interests include image retrieval, machine learning and pattern recognition.

References (51)

LiuZ. et al.
Fine-residual VLAD for image retrieval
Neurocomputing
(2016)
GuoQ.Z. et al.
Adaptive bit allocation hashing for approximate nearest neighbor search
Neurocomputing
(2015)
GuoQ.Z. et al.
Adaptive bit allocation product quantization
Neurocomputing
(2016)
LiuY. et al.
Deepindex for accurate and efficient image retrieval
Proceedings of the ACM International Conference on Multimedia Retrieval
(2015)
D.G. Lowe
Distinctive image features from scale-invariant keypoints
Int. J. Comput. Vis.
(2004)
J. Sivic et al.
Video google: a text retrieval approach to object matching in videos
Proceedings of the IEEE International Conference on Computer Vision
(2003)
D. Nister et al.
Robust scalable recognition with a vocabulary tree
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2006)
H. Jégou et al.
Improving bag-of-features for large scale image search
Int. J. Comput. Vis.
(2009)
WangS. et al.
INSTRE: A new benchmark for instance-level object retrieval and recognition
ACM Trans. Multimed. Comput. Commun. Appl.
(2015)
A. Krizhevsky et al.
Proceedings of the Conference on Neural Information Processing Systems
Imagenet classification with deep convolutional neural networks
(2012)

Cited by (33)

Light weight convolutional neural network and low-dimensional images transformation approach for classification of thermal images
2023, Case Studies in Thermal Engineering
Thermal energy is emitted in the infrared range between X-ray and Gamma rays, which are invisible to the human eye. Thermal cameras can detect the temperature that arises due to the heat emitted by the objects in a non-contact way and transform it into an image. These images ensure to detection of objects regardless of ambient occlusion. Based on this problem, five different classification models were proposed within the scope of the study. New low-dimensional images were obtained by extracting the features of thermal images with HOG (Histogram Oriented of Gradients), LBP (Local Binary Pattern), SIFT (Scale Invariant Feature Transform), and GF (Gabor Filter) methods. These images are classified by a CNN (Convolutional Neural Network) model called LW-CNN (Light Weight CNN). Raw thermal images were classified with the LW-CNN model without pre-processing. In order to analyze the efficiency of the proposed models, the results were compared via the pre-trained VGG16 model. Three different datasets containing thermal images were used in classification processes. The highest classification accuracy was obtained from the LW-CNN model in the performance evaluations carried out on the three datasets. With this model, the classification accuracies obtained from the datasets are 98.58%, 95.56%, and 100%, respectively.
Design of a deep learning visual system for the thickness measurement of each coating layer of TRISO-coated fuel particles
2022, Measurement: Journal of the International Measurement Confederation
Citation Excerpt :
When the value of the particle machining index is small, the machining surface of this particle is close to the equatorial cross-section, and the observed thickness is close to its real value. In this study, to enhance the reliability and consistency of visual measurement results, a tracing method is proposed based on the SIFT feature [52,53,54]. This kind of image feature is invariant to image scale and rotation and presents highly robust matching across additional image noise and changes in illumination.
In the new generation of nuclear energy system, the thickness of the coating layer of tristructural isotropic (TRISO)-coated fuel particles is one of the most important parameters. Recently, some visual-based methods have been developed for the thickness measurement of each coating layer, but the existing method still lacks of practicality. In this study, an advanced visual system combined with the ceramographic section method and deep learning algorithms is designed to automatically measure the thickness values of each coating layer. In the designed visual system, an automatic image acquisition method is first achieved. After that, an accurate thickness measurement method is proposed based on the designed image segmentation model. Finally, to enhance the reliability and consistency of the measurement results, a tracing method is developed for the designed measurement system. The experimental results demonstrate that the designed system can accurately automatically measure the thickness values of each coating layer.
Soft matching network with application to defect inspection
2021, Knowledge-Based Systems
Defect detection with templates is a major concern in manufacturing. Including subtraction and template matching, traditional methods based on images stumbled over the diverse disparities of input image pair. Yet, learning-based approaches have not been explored on this task. This paper proposed a learning-based soft template matching network for defect detection, using an innovative attention mechanism.
Employing feature-pyramid-network-based atrous convolution enables our model to perceive multi-scale features. The proposed contrastive attention module enhances the query feature map. Experimental results demonstrate that our network can capture defects under the interference of disparities based on the correspondence of input image pair, showing practical value for industrial defect detection.
Deep-seated features histogram: A novel image retrieval method
2021, Pattern Recognition
Citation Excerpt :
We adopted some state-of-the-art methods for comparison that include deep features and fused features: VLAD [33], BOW-200K [33], RVD-W [36], MAC [37], neural codes [39], R-MAC [40], Crow [41], Tr. Embedding [55], VLAD+CNN [56] (GoogleNet), SPoC [57], MSCE [58], SIFT-CNN [59], OR-IVFADC [60], HeW [61] and ReSW [62]. As can be seen from Table 4, the proposed DSFH method, though not the best, demonstrates particularly good performance compared with other state-of-the-art methods, while the vector dimensionality is lower than with most other methods.
Low-level features and deep features each have their own advantages and disadvantages in image representation. However, combining their advantages within a CBIR framework remains challenging. To address this problem, we propose a novel image-retrieval method: the deep-seated features histogram (DSFH). Its main highlights are: 1) Low-level features are extracted by simulating the human orientation selection and color perception mechanisms. This follows the human habit of looking at conspicuous regions and then less-conspicuous ones. 2) A novel method, ranking whitening, is proposed for extracting deep features via low-level features and combining them to obtain deep-seated features. 3) The proposed method is straightforward and reduces the vector dimensionality of the FC7 layer of a pre-trained VGG-16 network, and significantly improves image-retrieval precision. Comparative experiments demonstrate that the proposed method outperforms several state-of-the-art methods, including low-level feature-based, deep feature-based, and fused feature-based methods, in terms of precision/recall, area under the precision/recall curve metrics, and mean average precision. The proposed method provides efficient CBIR performance and not only has the power to discriminate low-level features, including color, texture, and shape, but can also match scenes of similar style.
Ultrasonic thyroid nodule detection method based on U-Net network
2021, Computer Methods and Programs in Biomedicine
Citation Excerpt :
Google Inception Net is structurally innovative, using a global average pooling layer to replace the fully connected layer, reducing the amount of parameters, and introducing batch standardization to accelerate the convergence in the neural network training process. The Inception V3 model is a 47-layer neural network model, which innovatively splits the two-dimensional convolutional layer into two one-dimensional convolutional layers to reduce training parameters and reduce overfitting [34]. DenseNet can be regarded as a special case of the residual neural network ResNet.
Aiming at the time consuming processing of existing thyroid nodule detection and difficulty in feature extraction, U-Net-based thyroid nodule detection is proposed to perform computed aided diagnosis.
This paper proposes a mark-guided ultrasound deep network segmentation model of thyroid nodules. By comparing with VGG19, Inception V3, DenseNet 161, segmentation accuracy, segmentation edge and network operation time, it is found that the algorithm in this paper has relative advantages.
U-Net network-based ultrasound thyroid nodules segmented the nodule area overlapped with the manually depicted nodule area close to 100%, the segmentation accuracy rate was as high as 0.9785, and the U-Net segmentation result was closer to the manually depicted nodule. The accuracy of U-Net segmentation of the thyroid is about 3% higher than the other three networks.
The segmentation of nodules based on U-Net proposed in this paper significantly improves the segmentation accuracy of thyroid nodules with a small training data set, and provides a comprehensive reference for clinical diagnosis and treatment.
Unsupervised deep quantization for object instance search
2019, Neurocomputing
In this paper, we propose an unsupervised deep quantization (UDQ) method for object instance search. The UDQ utilizes product quantization to discover the underlying self-supervision information of the training data and iteratively exploits the self-supervision information to optimize features of the training data in an unsupervised fashion. The optimized features are further used to update the self-supervision information for the subsequent training procedure. We introduce two constraints, the separability constraint and the discriminability constraint, to encourage the features to satisfy a cluster structure which is essential for the effective supervision information generation with the product quantization. The UDQ is optimized with an iterative optimization strategy which guarantees that the features and the supervision information can be enhanced each other alternately in a unified model. Moreover, we develop three refinement strategies to refine features to obtain better supervision information for the model optimization. Experimental results on four datasets show the superiority of our UDQ over the state-of-the-art methods.

View all citing articles on Scopus

Zhi Zeng received the B.S. and M.S. degree in Computer Science in 2003 and 2006, respectively, both from Chongqing University, China, and the Ph.D. degree in Pattern Recognition from the Institute of Automation, Chinese Academy of Sciences, in 2009. He is currently a Senior Engineer in the Institute of Automation, Chinese Academy of Sciences. His research interests include information retrieval, machine learning, multimedia, and digital rights management.

Shuwu Zhang got his Ph. D from Chinese Academy of Sciences in 1997. Currently, he is a professor of Institute of Automation, Chinese Academy of Sciences. His research interests are focused on digital content analysis, digital right management, and web-based cultural content service technologies.

Yuan Zhang received the B.S. degree in Automation from Beijing University of Posts and Telecommunications in 2009 and the Ph.D. degree in Pattern Recognition from the Institute of Automation, Chinese Academy of Sciences in 2014. He is currently an Algorithm Engineer in Alibaba Group. His research interests include image retrieval, object detection, machine learning, and pattern recognition.

Wanchun Wu received the B.S. and M.S. degree in Computer Science in 2003 and 2006, respectively, both from Chongqing University, China. He is currently an engineer of Children’s Hospital of Chongqing Medical University, Chongqing, China. His research interests include relational database, medical information and computer vision.

View full text

SIFT Matching with CNN Evidences for Particular Object Retrieval

Abstract

Introduction

Section snippets

Related work

Our approach

Datasets

Conclusions

Acknowledgment

Neurocomputing

Neurocomputing

Neurocomputing

Distinctive image features from scale-invariant keypoints

Int. J. Comput. Vis.

Video google: a text retrieval approach to object matching in videos

Proceedings of the IEEE International Conference on Computer Vision

Robust scalable recognition with a vocabulary tree

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Improving bag-of-features for large scale image search

Int. J. Comput. Vis.

INSTRE: A new benchmark for instance-level object retrieval and recognition

ACM Trans. Multimed. Comput. Commun. Appl.

Proceedings of the Conference on Neural Information Processing Systems

Imagenet classification with deep convolutional neural networks

Proceedings of the International Conference on Learning Representation

Very deep convolutional networks for large-scale image recognition

Faster R-CNN: towards real-time object detection with region proposal networks

Proceedings of the Conference on Neural Information Processing Systems

Fully convolutional networks for semantic segmentation

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Imagenet: A large-scale hierarchical image database

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

CNN features off-the-shelf: an astounding baseline for recognition

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

Neural codes for image retrieval

Proceedings of the European Conference on Computer Vision

Multi-scale orderless pooling of deep convolutional activation features

Proceedings of the European Conference on Computer Vision

Image classification and retrieval are one

Proceedings of the ACM International Conference on Multimedia Retrieval

Surf: speeded up robust features.

Comput. Vis. Image Underst.

Three things everyone should know to improve object retrieval

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Query adaptive similarity for large scale object retrieval

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

To aggregate or not to aggregate: selective match kernels for image search

Proceedings of the IEEE International Conference on Computer Vision

Packing and padding: coupled multi-index for accurate image retrieval

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Object retrieval with large vocabularies and fast spatial matching

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

SIFT match verification by geometric coding for large-scale partial-duplicate web image search

ACM Trans. Multimed. Comput. Commun. Appl.