Representation of image content based on RoI-BoW

https://doi.org/10.1016/j.jvcir.2014.10.007Get rights and content

Highlights

  • A model named as RoI-BoW is proposed, which is effective in image retrieval.

  • Influence of different scale segmentation on image content representation is studied.

  • A filtering operator is suggested to find the most important key points out.

Abstract

Representation of image content is an important part of image annotation and retrieval, and it has become a hot issue in computer vision. As an efficient and accurate image content representation model, bag-of-words (BoW) has attracted more attention in recent years. After segmentation, BoW treats all of the image regions equally. In fact, some regions of image are more important than others in image retrieval, such as salient object or region of interest. In this paper, a novel region of interest based bag-of-words model (RoI-BoW) for image representation is proposed. At first, the difference of Gaussian (DoG) is adopted to find key points in an image and generates different size grid as RoI to construct visual words by the BoW model. Furthermore, we analyze the influence of different size segmentation on image content representation by content based image retrieval. Experiments on Corel 5K verify the effectiveness of RoI-BoW on image content representation, and prove that RoI-BoW outperforms the BoW model significantly. Moreover, amounts of experiments illustrate the influence of different size segmentation on image representation based on the Bow model and RoI-BoW model respectively. This work is helpful to choose appropriate grid size in different situations when representing image content, and meaningful to image classification and retrieval.

Introduction

With the fast development of multimedia technology and the internet, quantity of images explosively increases and the requirements of retrieval are more complex. Therefore, efficient image annotation and retrieval become necessary and urgent. For accurate image annotation and retrieval, representation of the image content is the most fundamental step and influences the results to a great extent.

There is no the absolutely best representation. The human vision system can be considered as the best representation system [1], [2]. Based on the basic process of human vision system, there is a kind of information filtering when human look at an image. The pixel of interest will be retained, and the pixel of un-interest will be filtered. This idea can be extended to regions. Motivated by this, we propose a novel and efficient method to represent an image.

The novelty and contribution of this work is to propose a novel model of image representation, which is named as RoI-BoW (region of interest based bag-of-words model). At first, we investigate the region of interest (RoI) which is a selected subset of an image. Based on RoI, we segment an input image into grids with different size. Furthermore, we represent the image based on the bag-of-words (BoW) model.

The concept of RoI is commonly used in many application areas. For example, in medical imaging [3], [4], [5], RoI coder is used to preserve, as much as possible, quality of specific image areas. In geographical information systems (GIS), an RoI can be taken literally as a polygonal selection from a 2D map [6]. In computer vision and optical character recognition [9], the RoI defines the borders of an object under consideration. In many applications, symbolic (textual) labels are added to a RoI, to describe its content in a compact manner.

In this paper, we employed the difference of Gaussian (DoG [44]) to locate the RoI. DoG is based on Laplacian operator. However, there are many good and similar methods who can be used to do it. For example, SURF [7], MSER [8], and so on. SURF is proposed for detection the interest point based the determinant of Hessian matrix. MSER is presented for an affinely invariant stable subset of extremal regions. Mathematically, they are based on the same principle because extremal points are always checked out by Hessian matrix.

Bag-of-words (BoW) is a promising content representation model, which is widely used in the text retrieval domain. Image representation based on the BoW model was proposed by Li and Perona [16], in which images were segmented into grids and visual features are extracted from each grid to form feature vectors. Then the vocabulary of visual words was constructed by clustering the feature vectors, and images were represented by histograms of visual words. The BoW model was widely used in image and video retrieval fields [17], [18], [19], [20], [21], [22], [23], [24], [25], such as object recognition [17], image categorization [21] and near duplicate detection [22]. Wang et al. [18] represented images by the BoW model for image annotation, in which images were segmented into grids directly. Alvarez and Vanrell [25] utilized the BoW model based on irregular blobs to annotate images. The approach of the BoW model about constructing visual words was also employed to form visual phrases [26], [27], [28], [29], [30], [31].

In recent years, The BoW model has became an important image content representation method and large used in image annotation and retrieval. In 2010, Li et al. [47] presented an approach based on probabilistic latent semantic analysis (PLSA) to achieve the task of automatic image annotation and retrieval. Each image was represented as a bag of visual words. Chen et al. [48], studied methods based on the bag-of-visual-words representation to estimate pseudo-objects. In 2011, Hu et al. [51], proposed a scheme to obtain better quality both subjectively and objectively by using different weighting factors. In 2013, Inoue and Shinoda [45] proposed q-Gaussian mixture models (q-GMMs) which extend bag-of-visual-words (BoW) to a probabilistic framework. Raveaux et al. [46] proposed an automatic system with graph based solution to annotate and retrieve images. Zhang et al. [49] proposed a Laplacian affine sparse coding algorithm based on bag-of-visual words model. Kuranar et al. [50] proposed an automated method of video key frame extraction using dynamic Delaunay graph clustering via an iterative edge pruning strategy.

Although BoW can represent the image content effectively, it only retains the frequency of visual words and loses the location information. Many enhanced BoW models were proposed [32], [33], [34], [35], [36], [37] for more accurate content representation. Considering that visual feature of the same semantics may be classified into different clusters in the Euclidian space, Wu et al. [32] proposed a semantics-preserving bag-of-words model (SPBoW), which optimizes the generation of visual words to learn an optimized vocabulary aiming at achieving the minimal loss of the semantics. Against with the lost of spatial information, Li et al. [34] proposed a contextual bag-of-words (CBoW) model, which introduces two kinds of contextual relations between grids: a semantic conceptual relation and a spatial neighboring relation.

In the model of BoW, the size of segmentation has a great influence on the accuracy of image content representation. The large size of grid may lose position information and some details of images, while little size is difficult to describe some regular texture information and lowers the query efficiency. Hence suitable segmentation size is very important for the BoW model and can make image representation more efficient and accurate. Huang et al. [38] constructed a hierarchical conditional random field model for image annotation, in which superpixel [39] was used to segment images. Zhang and Hu [40] segmented images into foreground and background by saliency analysis to annotate images. A multi-scale conditional random fields for image labeling was proposed by He et al. [41], in which images were segmented and represented in three different scales, including local scale, regional scale and global scale. Monay and Gatica-Perez [42] represented images in two scales for annotation. In global scale, features were extracted from the entire image without segmentation, while in regional scale, they employed Ncut [43] to segment images and extract features from regions.

Image content representation by original BoW model considers the image as a whole, and every parts of the image are treated equally. The importance of different parts of image can’t be distinguished. In fact, every parts in an image are not equally important. The foreground or objects usually contain more significant semantic information. Hence, when we represent an image by visual features, it’s necessary to outstand the most important parts of an image. In this paper, we proposed a new image representation method RoI-BoW, which adopts DoG to detect the region of interest (RoI) and looks the RoI as the most important semantic parts of an image. Then the BoW model is used to represent the image content of RoI and Non-RoI respectively, and the whole image is represented by these two parts. In our method, we can give RoI more weights on image content representation and stress its importance in image retrieval or annotation. Although the idea to extract features around key points is not new, using key points to construct RoI and representing the image by two parts are something new in image content representation. In our method, the main point is how to represent the whole image more effectively in image retrieval field, and the experimental results proved the effectiveness of our RoI-BoW model. For studying the influence of segmentation with different size on image retrieval, we did large amount experiments with image representation in different segmentation sizes by the RoI-BoW and BoW model and achieved some interesting results.

The remaining of this paper is organized as follows. Section 2 describes region of interest based bag-of-words (RoI-BoW) model. The algorithms and complexity analysis are illustrated in Section 3. Experiment results and analysis are given in Section 4, followed by conclusion in Section 5.

Section snippets

Region of interest based bag-of-words

Images with focused objects from web or internet are the important information resource. With the wide spread use of digital cameras, the number of Web images is dramatically growing. The primal goal of image retrieval is to look for same or similar images in the web database. The searching performance is measured by the similarity of objects. How to extract the focused objects, or more generally focused regions, is a challenging problem. The capability of extracting focused regions can help to

Algorithms

Algorithms are shown as follow. Algorithm 1 illustrates the training process of BoW model, and Algorithm 2 expresses the image representation by RoI-BoW. Key points filtering is described in Algorithm 3, followed by Algorithm 4 which is the process of image retrieval.

Algorithm 1

Training process of the RoI-BoW model.

Input:Training Image dataset DT={I1,I2,,IN};
Step 1:Detect key points by DoG algorithm;
Step 2:Filter key points by Algorithm 3;
Step 3:Centering on the key points achieved by step 2, regions of

Experiments

In order to verify the effectiveness of the RoI-BoW model for image content representation, we design an image retrieval algorithm, in which the content of image is represented by the RoI-BoW model and content based image retrieval is processed on benchmark databases. In the algorithm of image retrieval, a function of Euclidean distance between visual words of images is defined as similarity measure. Furthermore, the influence of segmentation sizes on image content representation is also

Conclusion

In this paper, a novel RoI-BoW model is proposed to represent image content for image retrieval, in which key points are detected by DoG algorithm and filtered by the presented filtering algorithm for constructing grids with different sizes. By clustering the two features of HSVH and Gabor texture respectively, visual words are constructed. In addition, Non-RoI is segmented into grids and represented by the BoW model. Then representations of RoI and Non-RoI are combined to represent the content

Acknowledgments

This research has been supported by the National Nature Science Foundation of China (Grant 61402174 and 61370174), and Nature Science Foundation of Shanghai Province of China (11ZR1409600).

References (52)

  • T.N. Tan

    Texture edge detection by modeling visual cortical channels

    Pattern Recogn.

    (1995)
  • T.C. Song et al.

    WaveLBP based hierarchical features for image classification

    Pattern Recogn. Lett.

    (2013)
  • S. Alvarez et al.

    Texton theory revisited: a bag-of-words approach to combine textons

    Pattern Recogn.

    (2012)
  • N. Inoue et al.

    q-Gaussian mixture models for image and video semantic indexing

    J. Vis. Commun. Image Represent.

    (2013)
  • R. Raveaux et al.

    Structured representations in a content based image retrieval context

    J. Vis. Commun. Image Represent.

    (2013)
  • Z. Li et al.

    Fusing semantic aspects for image annotation and retrieval

    J. Vis. Commun. Image Represent.

    (2010)
  • K.T. Chen et al.

    Boosting image object retrieval and indexing by automatically discovered pseudo-objects

    J. Vis. Commun. Image Represent.

    (2010)
  • C. Zhang et al.

    Laplacian affine sparse coding with tilt and orientation consistency for image classification

    J. Vis. Commun. Image Represent.

    (2013)
  • S.K. Kuanar et al.

    Video key frame extraction through dynamic Delaunay clustering with a structural constraint

    J. Vis. Commun. Image Represent.

    (2013)
  • H.M. Hu et al.

    A region-based rate-control scheme using inter-layer information for H.264/SVC

    J. Vis. Commun. Image Represent.

    (2011)
  • R. Achanta, F. Estrada, P. Wils, S. Süsstrunk, Salient region detection and segmentation, in: Proc. Springer Int. Conf....
  • A.C. Bovik et al.

    Localized measurement of emergent image frequencies by Gabor wavelets

    IEEE Trans. Inform. Theory

    (1992)
  • I. Fogel et al.

    Gabor filters as texture discriminator

    Biol. Cybern.

    (1989)
  • M.R. Turner

    Texture discrimination by Gabor functions

    Biol. Cybern.

    (1986)
  • F.F. Li, P. Perona, A Bayesian hierarchical model for learning natural scene categories, in: IEEE Computer Society...
  • J. Sivic, A. Zisserman, Video Google: a text retrieval approach to object matching in videos, in: Proceedings Ninth...
  • Cited by (13)

    • A CBIR system based on saliency driven local image features and multi orientation texture features

      2022, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      To address this issue, researchers have utilized many saliency models to locate the salient part of the image for more efficient feature extraction for CBIR. In 2015, Zhang et al. [22] suggested region of interest based bag-of-words model (RoI-BoW) for image segmentation and image retrieval. Initially, Zhang et al. extracted ROI of the images using key-point based difference of Gaussian (DoG) scheme.

    • Instance search via instance level segmentation and feature representation

      2021, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      On the other hand, they should be distinctive enough so that the retrieval quality does not suffer severe degradation as the scale of the reference set grows. In the existing solutions, instance search has been mainly addressed by conventional approaches that are originally designed for image search [1,2], such as bag-of-visual words (BoVW) [4], RoI-BoVW [10], VLAD [5] and FV [9]. All these approaches are built upon image local features such as SIFT [21], RootSIFT [22], SURF [23].

    • Image region label refinement using spatial position relation graph

      2019, Knowledge-Based Systems
      Citation Excerpt :

      We adopt the texture-enhanced JSEG algorithm to segment the image to semantic regions, and subsequently extended the bag-of-words [28–30] model with visual context information to represent the content of the regions. Subsequently, a multi-classifier learned by a maximal figure-of-merit [31–33] algorithm was used to predict the labels of the regions. Based on the image region annotation results of [27], we propose an SPRG model and label refinement method based on the SPRG for wrong label revision, whose framework is illustrated in Fig. 3.

    • Saliency-based multi-feature modeling for semantic image retrieval

      2018, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      To evaluate the effectiveness of the proposed SMFM for image retrieval, a series of experiments have been conducted. First, the proposed SMFM is compared with the classical BOW framework and recently reported method, namely RoI-BOW [32]. And then, the efficiency of SMFM including the dimension of feature descriptor and the computation time is analyzed.

    View all citing articles on Scopus
    View full text