Elsevier

Pattern Recognition Letters

Volume 34, Issue 11, 1 August 2013, Pages 1211-1220
Pattern Recognition Letters

An improvement to the SIFT descriptor for image representation and matching

https://doi.org/10.1016/j.patrec.2013.03.021Get rights and content

Highlights

  • The elliptical neighborhood region is normalized to enhance the invariance to viewpoint change.

  • The affine scale-space is applied to increase the scale invariance.

  • The polar histogram orientation bin is used to improve SIFT descriptor’s rotation invariance.

  • Rearranging the descriptor is used to increase the mirror reflection invariance.

Abstract

Constructing proper descriptors for interest points in images is a critical aspect for local features related tasks in computer vision and pattern recognition. Although the SIFT descriptor has been proven to perform better than the other existing local descriptors, it does not gain sufficient distinctiveness and robustness in image match especially in the case of affine and mirror transformations, in which many mismatches could occur. This paper presents an improvement to the SIFT descriptor for image matching and retrieval. The framework of the proposed descriptor consists of the following steps: normalizing elliptical neighboring region, transforming to affine scale-space, improving the SIFT descriptor with polar histogram orientation bin, as well as integrating the mirror reflection invariant. A comparative evaluation of different descriptors is carried out showing that the present approach provides better results than the existing methods.

Introduction

Local interest point matching is matching the corresponding points between two or more images. It has proven to be very successful in many pattern recognition and computer vision tasks such as wide baseline matching, object recognition and tracking, texture recognition, image retrieval and reconstruction, robot localization, video data mining, building panoramas, stereo correspondence, recovering camera motion, and recognition of object categories (Tuytelaars and Van Gool, 2004, Shin and Tjahjadi, 2010, Zhang and Wang, 2011, Wu and Rehg, 2011, Arican and Frossard, 2012). The two essential aspects of local interest point matching are detection and description of interest points (Li and Ma, 2009). The detection of interest points determines the reliable feature points that are used to match, and at the same time determines the proper neighboring regions that are used in computing the descriptors. The description of an interest point involves creating a distinctive and robust descriptor for it.

For a good feature descriptor, two criteria should be considered in describing the extracted feature points (Li and Ma, 2009). The first is the distinctiveness, which means that the extracted feature should have enough information to distinguish the interest points, so the descriptor of an interest point must be as discriminative as possible. The other criterion is the robustness to resist photometric and geometric deformations. These feature descriptors should be robust with respect to photometric changes such as illumination direction, color, highlight, and intensity change. The extracted features should also be invariant to different geometric variations such as rotation, translation, scaling, mirror reflection and even viewpoint change.

Many different descriptors for interest points have been developed and have proven to be very successful in applications. Excellent reviews on the existing descriptors can be found in Li and Allinson, 2008, Brown et al., 2011, Florindo et al., 2012. These descriptors can be divided into five classes:

The first class is the distribution-based descriptors. These techniques used histograms to represent different characteristics of the shape or appearance. Belongie et al. (2002) proposed a shape context descriptor, which at a reference shape point captures the distribution of the remaining points relative to it. Carneiro and Jepson (2002) proposed a phase-based local feature which is based on the phase and amplitude responses of complex-valued steerable filters. Lowe (2004) proposed a scale invariant feature transform (SIFT), which combines a scale invariant region detector and a descriptor based on the gradient distribution in the corresponding regions. Several attempts to improve the SIFT descriptor have been reported in the literature. The PCA-SIFT (Ke and Sukthankar, 2004) descriptor is an extension of the SIFT descriptor, which reduces the dimension of the SIFT descriptor vector from 128 to 36 using PCA. The GLOH (Mikolajczyk and Schmid, 2005) is also an extension of the SIFT descriptor designed to increase its robustness and distinctiveness. Morel and Yu (2009) proposed an affine SIFT, which simulates all the distortions caused by variations in the direction of the camera’s optical axis, and then the SIFT is imposed on the simulated images. Guo et al. (2010) presented a mirror reflection invariant descriptor (MIFT) which is inspired from SIFT.

The second class is the differential-based descriptors. These descriptors employ a set of image derivatives computed up to a given order in a point neighborhood. Florack et al. (1991) proposed a descriptor based on the differential invariants, which combines components of the local derivatives to obtain rotation invariance. Schmid and Mohr (1997) described the interest points using local differential gray-level invariants, and the descriptors are invariant to scale, intensity, and rotation transformations.

The third class is the filter-based descriptors. The steerable filter descriptor (Freeman and Adelson, 1991) employed quadrature pairs of derivatives of Gaussian and Hilbert transforms to synthesize any filter of a given frequency with arbitrary phase. The Gabor filter descriptor (Lee, 1996) used a set of Gabor filters tuned to various frequencies and orientations to represent the image patterns. Baumberg (2000) proposed a complex filter which uses the Gaussian derivatives. Moreno et al. (2009) improved the SIFT descriptor with the Gabor smoothing derivative filters. Gómez and Romero (2011)) introduced a curvelet based descriptor which is calculated from the statistical pattern of the curvelet coefficients.

The fourth class is the color-based descriptors. It makes use of the color invariance robust against varying imaging conditions. Gevers and Smeulders (1999) proposed some new color models for the purpose of recognition of multicolored objects. Diplaros et al. (2006) described a method to merge the color and shape invariant information in the context of object recognition and image retrieval. Abdel-Hakim and Farag (2006) introduced the CSIFT as a colored local invariant feature descriptor. Stokman and Gevers (2007) proposed a generic selection model to select and weight the color invariant models for discriminatory and robust image feature detection. Verma et al. (2011) presented new color SIFT descriptors, which extended the SIFT descriptor to different color spaces.

The fifth class is other descriptors. Apart from the above basic descriptor types, there are also other extended descriptors. Ojala et al. (2002) proposed a local binary pattern (LBP) by building statistics on the local micropattern variations. Chen et al. (2010) developed a Weber local descriptor (WLD) based on the perception of human beings, which is robust to noise. Chen and Sun (2010) presented a new image descriptor to represent the normalized region, which primarily comprises the Zernike moment (ZM) phase information.

The SIFT descriptor is one of the most successful and popular local image descriptor among all the above mentioned descriptors. It has been proven to perform better than the other local invariant feature descriptors (Mikolajczyk and Schmid, 2005) until recent time. However, the SIFT descriptor is neither mirror reflection invariant nor completely invariant to the viewpoint change, and its scale and rotation invariance is not so exact for digital images. In this paper, we propose to improve on the SIFT descriptor by considering all the above mentioned disadvantages. Firstly, a normalized elliptical neighboring region is used to enhance the invariance to viewpoint change. Secondly, the affine scale-space is applied to increase the scale invariance. Thirdly, the polar histogram orientation bin is used to improve the rotation invariance. Finally, rearranging the descriptor is used to ensure the mirror reflection invariance.

The remainder of this paper is organized as follows. Section 2 presents our proposed algorithm. Section 3 introduces the evaluation criteria and the data sets for the experiments. In Section 4 the experimental results and analysis are provided. Finally, the paper is concluded in Section 5.

Section snippets

Normalizing elliptical neighboring region

Recently, several researches in the literature have focused on improving local features to be invariant to the viewpoint change. Mikolajczyk and Schmid (2004) proposed Harris-Affine and Hessian-Affine to obtain invariance to viewpoint change by the affine adaptation process based on the second moment matrix. They make effort to obtain the affine invariance in the stage of feature detection. Morel and Yu (2009) proposed an affine SIFT, which simulates all the distortions caused by variations in

Experimental setup

In this section, we first discuss the evaluation metrics used to quantify our results. Then, we introduce the image data used in the experiments.

Experimental results and analysis

We evaluate the performance of our proposed approach in two related tasks: (i) image region matching, and (ii) object retrieval. In the experiments, all the descriptors are computed on the regions detected with the Hessian Affine detector (Mikolajczyk and Schmid, 2004) for its higher accuracy. There are more than 200,000 region pairs that are detected on the INRIA dataset and over 10 million region pairs are detected on the Oxford 5K dataset.

Conclusions

This paper presents a modification to the SIFT descriptor. The proposed descriptor framework consists of the following steps: normalizing elliptical neighboring regions, transforming to affine scale-space, improving SIFT descriptor with polar histogram orientation bin, as well as integrating mirror reflection invariance. In order to evaluate the performance of our descriptor we use the framework of Mikolajczyk and Schmid (2005). Experimental comparisons on seven different feature descriptors

Acknowledgments

This work was partly supported by funds from National Basic Research Program of China (973 Program No. 2007CB311002), and National High Technology Research and Development Program of China (863 Program No. 2009AA01Z409). We are grateful to David Lowe, Krystian Mikolajczyk and Cordelia Schmid for providing the code for their detectors/descriptors.

References (37)

  • A. Baumberg

    Reliable feature matching across widely separated views

    IEEE Conference on Computer Vision and Pattern Recognition, Proceedings

    (2000)
  • S. Belongie et al.

    Shape matching and object recognition using shape contexts

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2002)
  • M. Brown et al.

    Discriminative learning of local image descriptors

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2011)
  • G. Carneiro et al.

    Phase-based local features

    Computer Vision-ECCV

    (2002)
  • J. Chen et al.

    WLD: a robust local image descriptor

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2010)
  • Z. Chen et al.

    A Zernike moment phase-based descriptor for local image representation and matching

    IEEE Transactions on Image Processing

    (2010)
  • A. Diplaros et al.

    Combining color and shape information for illumination-viewpoint invariant object recognition

    IEEE Transactions on Image Processing

    (2006)
  • L. Florack et al.

    General intensity transformations and second order invariants

    Proceedings of the Seventh Scandinavian Conference Image Analysis

    (1991)
  • Cited by (0)

    View full text