Elsevier

Signal Processing

Volume 93, Issue 8, August 2013, Pages 2284-2292
Signal Processing

Discriminative codebook learning for Web image search

https://doi.org/10.1016/j.sigpro.2012.04.018Get rights and content

Abstract

Given the explosive growth of the Web images, image search plays an increasingly important role in our daily lives. The visual representation of image is the fundamental factor to the quality of content-based image search. Recently, bag-of-visual word model has been widely used for image representation and has demonstrated promising performance in many applications. In the bag-of-visual-word model, the codebook/visual vocabulary plays a crucial role. The conventional codebook, generated via unsupervised clustering approaches, does not embed the labeling information of images and therefore has less discriminative ability. Although some research has been conducted to construct codebooks with the labeling information considered, very few attempts have been made to exploit manifold geometry of the local feature space to improve codebook discriminative ability. In this paper, we propose a novel discriminative codebook learning method by introducing the subspace learning in codebook construction and leveraging its power to find a contextual local descriptor subspace to capture the discriminative information. The discriminative codebook construction and contextual subspace learning are formulated as an optimization problem and can be learned simultaneously. The effectiveness of the proposed method is evaluated through visual reranking experiments conducted on two real Web image search datasets.

Highlights

► We introduce subspace learning to embed discriminative information in codebook. ► We optimize codebook construction and contextual subspace learning simultaneously. ► Discriminative ability of the codebook is measured at the image level.

Introduction

Given the explosive growth of Web images, image search plays an increasingly important role in our daily lives. Extensive research has been conducted to improve image search quality. Text-based image search leverages mature information retrieval techniques to index and search the images' associated textual information (filename, surrounding text, URL, etc.). Although text-based image search approaches are efficient for large-scale image indexing, they still have their own limitations since textual information cannot describe the rich content of images comprehensively and substantially. As a consequence, techniques with visual information involved are proposed to build content-based image retrieval prototypes [1], [2], [3], [4] or enrich the textual descriptions via automatic image annotation/concept detection [5], [6]. In all of these methods, visual representation of images plays the fundamental role. In recent years, bag-of-visual-word (BOVW) model has been widely used for image visual representation and has demonstrated promising performance in image retrieval [7], [8], [9] and image categorization [10], [11], [12]. In BOVW, a visual codebook needs to be constructed first by clustering a set of local features such as SIFT [13] extracted from a training image set. Then after quantizing all local descriptors into visual words in the codebook, each image can be represented as a histogram of number of visual words count.

Obviously, in BOVW model, the quality of codebook directly affects the performance of image search. The most popular visual codebook generation method is K-means clustering [8], [14]. It divides a large set of training SIFT feature points in the high dimensional feature space into clusters. Each cluster corresponds to a sub-space in the feature space, and the centroid of cluster is treated as a visual word. All visual words constitute a visual codebook. Then, given a novel feature point, feature quantization assigns it the visual word ID of its closest visual word in the space. As the size of image database becomes larger, a vocabulary tree method with hierarchical K-means [7] is more preferred for hierarchical clustering and efficient local feature quantization. Such kind of unsupervised clustering based codebook generalization method is easy for implementation and has been widely used in many applications. However, it totally ignores the known labeling information of training images. As a consequence, when the labeling information of the training images is given, the codebook generated via unsupervised clustering cannot embed the important image category information. In other words, the semantic contexts are lost.

To address this problem, some learning-based codebook construction methods [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27] are proposed. These methods try to build supervised visual word codebooks in different ways: (1) refine/adapt original codebook based on image semantic labels; (2) build class specific vocabularies for image categorization; (3) learn discriminative and sparse coding models for object recognition; (4) generate supervised codebook by minimizing mutual information lost; (5) unify codebook construction with classifier training to build semantic vocabulary, etc. Although these approaches have improved traditional codebook, most of them construct visual codebook based on the raw local features. Very few attempts have been made to exploit manifold geometry of the local feature space. Actually, manifold learning has been proven an effective way to reveal the intrinsic structure of the original space and maximize the discriminative ability of data in the learned subspace. Wu et al. [21] proposed to construct semantic preserving codebook via distance metric learning. However, this method suffers the following two disadvantages: (1) additional region-level annotation labels are required in the distance metric learning stage. However, those region-level labels are usually unavailable. (2) The codebook construction and semantic distance metric learning are conducted in separate steps, which can hardly achieve a joint optimum. To tackle these problems, in this paper, we propose a novel supervised discriminative codebook learning method which has the following advantages:

  • (1)

    Our method introduces the subspace learning in codebook construction and leverages its power to find a contextual local descriptor subspace for embedding the discriminative information. In the expected subspace, images from different classes can be discriminated well.

  • (2)

    In our method, the codebook construction and contextual subspace learning are formulated as an optimization problem and they can be learned simultaneously. First, the closed-form expression of the bag-of-visual-word histogram based on the codebook in the desired new subspace is derived. Then, the distance between histograms of images from different classes is maximized, and the distance between histograms of images from the same class is minimized. This one-step optimization avoids the accumulation of errors introduced in each separated step.

  • (3)

    In our method, the discriminative ability of the codebook is measured at the image level, i.e., directly requiring the histogram representation of images be similar or dissimilar. Compared with the local feature level discriminative ability pursuing methods [21], e.g., distinguishing local features from different object parts, it is more reasonable since different objects may contain some common local patches.

The rest of this paper is organized as follows. Related work is summarized in Section 2. The proposed discriminative codebook learning method is described in Section 3. The proposed discriminative codebook is applied on Web image search reranking and the experimental results are given in Section 4. And the summary and conclusion are provided in Section 5.

Section snippets

Related work

Bag-of-visual-words (BOVW) model has been widely used in large-scale content-based image search applications. In general, BOVW model contains two major components: codebook generation and image representation.

Supervised discriminative codebook learning

In this paper, we focus on codebook generation, which plays a key role of the BOVW model. A supervised discriminative codebook learning method is proposed. The framework of the proposed approach is illustrated in Fig. 1, which consists of three steps. In step 1, local descriptors/features from a set of training images in L different classes are extracted. The scale-invariant feature transform (SIFT) [13] local descriptor is adopted in this paper. In step 2, a conventional unsupervised codebook

Experiments

In this paper, we evaluate the proposed discriminative codebook learning method on Web image search reranking. The major purpose is that if the learned codebook can well capture discriminative information, more related images should be re-ranked to the top and irrelevant images should be ranked to the bottom. Other two state-of-the-art codebook generation methods are selected for comparison, including one unsupervised codebook learning method (KM) and one supervised codebook learning method

Conclusion

In this paper, we propose a novel supervised discriminative codebook learning method, which not only finds a contextual subspace to embed the discriminative information, but also learns the contextual subspace and discriminative codebook simultaneously. In the learned new space, images from different classes can be well separated and images from the same class are close to each other. We apply the proposed method on Web image search reranking problem and the experimental results on two real Web

References (35)

  • L. Duan et al.

    Adaptive relevance feedback based on Bayesian inference for image retrieval

    Signal Processing

    (2005)
  • J. Tang et al.

    Video semantic analysis based on structure-sensitive anisotropic manifold ranking

    Signal Processing

    (2009)
  • Y. Yang et al.

    Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval

    IEEE Transactions on Multimedia

    (2008)
  • Y. Yang et al.

    A multimedia retrieval framework based on semi-supervised ranking and relevance feedback

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2012)
  • M. Wang et al.

    Towards a relevant and diverse search of social images

    IEEE Transactions on Multimedia

    (2010)
  • Y. Yang et al.

    Web and personal image annotation by mining label correlation with relaxed visual graph embedding

    IEEE Transactions on Image Processing

    (2012)
  • D. Nister, H. Stewenius, Scalable recognition with a vocabulary tree, in: Proceedings of the IEEE Conference on...
  • J. Sivic, A. Zisserman., Video Google: a text retrieval approach to object matching in videos, in: Proceedings of the...
  • Z. Wu, Q.F. Ke, J. Sun, Bundling features for large-scale partial-duplicate web image search, in: Proceedings of the...
  • S. Lazebnik, C. Schmid, J. Ponce, Beyond bags of features: spatial pyramid matching for recognizing natural scene...
  • C. Wang, D. Blei, L. Fei-Fei, Simultaneous image classification and annotation, in: Proceedings of the IEEE Conference...
  • Z. Si, H. Gong, Y.N. Wu, S.C. Zhu, Learning mixed templates for object recognition, in: Proceedings of the IEEE...
  • D.G. Lowe

    Distinctive image features from scale-invariant keypoints

    International Journal of Computer Vision

    (2004)
  • G. Csurka, C. Dance, L. Fan, J. Willamowski, C. Bray, Visual categorization with bags of keypoints, in: Proceedings of...
  • F. Moosmann et al.

    Randomized clustering forests for building fast and discriminative visual vocabularies

    Advances in Neural Information Processing Systems

    (2007)
  • F. Jurie, B. Triggs, Creating efficient codebooks for visual recognition, in: Proceedings of the International...
  • F. Perronnin, C. Dance, G. Csurka, M. Bressan, Adapted vocabularies for generic visual categorization, in: Proceedings...
  • Cited by (0)

    This work was supported in part by start-up funding from the University of Science and Technology of China to X. Tian, in part by Research Enhancement Program (REP), start-up funding from the Texas State University, DoD HBCU/MI grant W911NF-12-1-0057, and NSF CRI 1058724 to Y. Lu.

    View full text