1 Introduction

The number of images on the Internet has been growing explosively with the widespread use of smart phones, digital cameras, and other portable devices. With millions of photos uploaded every day, it has been a great challenge to effectively organize and retrieve the images on the Internet, which is an important research area with broad practical usages. Image annotation is a crucial technology that facilitates image retrieval by adding keywords to images. In general, new images are annotated by finding and building a relevance model that connects high-level semantics and low-level visual features of images. And the high-level semantics will become the keywords for an image. Thus, image annotation technology could transform the image retrieval problem into a more mature text retrieval problem. Users can use keywords to retrieve the images they require. If users provide a query image, the semantics of the query image can also be constructed by searching and processing the annotated images that are similar to the query image.

“A picture is worth a thousand words.” Images are a type of complex multimedia data that contain an abundance of semantic information. Eakins [8] proposed that the semantics of images have three levels. The first level is the low-level semantics, which are the low-level visual features extracted from images such as color and texture. The second level is the object semantics, which are the objects recognized in the image such as tiger, apple, etc. In general, the object categories are inferred from the extracted visual features. The third level is the abstract semantics, which is a higher level inference of the object semantics. Example words and phrases include those expressing concepts (e.g. geek, CEO, art), properties (e.g. nutrient, tasty), behaviors (e.g. resign, lose weight), etc. Currently, the majority of the annotation research focuses on the object semantics based on the visual content of images. There are still many challenges to address with regard to abstract semantics annotation. In this paper, we consider two critical problems with abstract semantics annotation.

The first problem is the lack of effective modeling methods for abstract semantics, which has given rise to a surge of interest in the extraction of precise keywords from the text or user comments associated with an image. Several algorithms have been proposed for automatically assigning keywords to images or image regions. They first select salient terms or more visible words from the associated text and calculate the frequency or co-occurrences of these words [6, 35]. Then an image is annotated using the words with the maximum annotation probability. In most cases, the annotation words selected using such a method represent the objects in an image, i.e., the object semantics it contains. It is still difficult to represent the abstract semantics. Thus when users submit query keywords for images, the effective words to use will be constrained by the object semantics of the image. We believe that a more natural and personalized way of retrieving images is to submit query keywords expressing high-level semantics, e.g., a hot Internet topic. For example, there was a hot topic on the Internet about a food safety accident of red yolk stained with Sudan Red. If “Sudan Red” or “stain” are included as query keywords, users may want to retrieve images about the impact of Sudan on food. Only when abstract semantics are expressed in annotations can the related images be retrieved as desired. So, it is critical to find a way of modeling and extracting abstract semantics of images.

The second problem with existing annotation approaches is the lack of dynamic updating mechanism for the training set. Current approaches adopt machine learning techniques or other relevance modeling methods to learn from a static annotated training set and identify the keywords for new images based on the learned model [14]. This process can be viewed as mapping low-level features of images to high-level semantic concepts. Therefore the annotation results are restricted by the visual features in the training set and the semantics covered by the training set annotations. Without an effective updating mechanism, the training set only provides a fixed vocabulary for annotation and cannot grow to cover newly formed semantics (e.g. new events, temporal hot topics, etc.). Furthermore, regular update of the training set requires re-computing the mapping relationship between annotations and visual features, which can be extremely time-consuming and computationally intensive for large-scale training set.

To address the issues above, we propose a novel high-level semantic annotation method for images based on hot Internet topics. There are three main contributions in this paper. First, we propose to model high-level semantics based on hot Internet topics. We use Latent Dirichlet Allocation (LDA) to analyze the texts on the related web pages to build the topics. After that, three kinds of the relevance relationships are constructed: the topic–to–topic co-occurrence relationship, the topic–to–image relevance relationship and the image–to–image visual similarity relationship. Through the modeling and clustering of complex graph, the hot topics are formed by clustering the LDA topics of similar images, and the related images are annotated with top words in the corresponding hot topics. Second, we propose a dynamic update mechanism for the training set based on hot Internet topics. Through the discovery and tracking of hot Internet topics, representative keywords are selected to annotate related images in the training set. Note that the words for annotation are based on texts newly searched on the Internet so that the vocabulary for annotation can be constantly extended. In addition, our update mechanism does not require the re-computing of the whole mapping relationship between the annotations and low-level visual features of all images. We only update the annotations of the related images in the original training set. This will greatly reduce the update cost and make the extension of the training set much easier. So, the semantic coverage will be increased gradually and the dynamic update of the training set could be achieved, which are very critical for a good image retrieval system. Third, we propose a new search-based image annotation mechanism, which uses the hypergraph modeling and spectral clustering to filter out the semantically irrelevant images. Given a query image, we search for the candidates from the training set according to the visual similarity. A hypergraph is constructed for the candidates and several clusters could be formed via clustering process. Small clusters are considered to be outliers and are discarded. The selected cluster will have both similar visual features and consistent semantics with the query image. So, the annotation results from this cluster will deliver better performance.

The paper is organized as follows. Section 2 discusses previous work on image annotation, which provides readers with a general idea of related work. Section 3 explains the basic design concept. Sections 4 and 5 present the approach in detail. Section 6 contains our evaluations and Section 7 presents our conclusions.

2 Related work

Research on image annotation technology has been conducted for many years. In general, researchers have proposed knowledge modeling methods for automatic annotation, such as classification-based methods [2, 4, 15, 23], graphical model-based methods [36, 37], cross-media modeling methods [9, 11, 17], and translation model-based [5, 7, 12] methods. The key attributes of these methods are using different machine learning algorithms on a training set and the construction of mapping relationships between the semantic concepts and low-level features. These mapping relationships are used to annotate new images. However, the size of the training sets used by these methods is limited, so their quality and performance will degrade when handling large-scale Internet data. This is because the coverage of the semantics in the training set is also limited and it cannot be updated in real-time. In most cases, the annotation results are simply the object semantics of the images.

As the wide use of the social network, more and more users could share their images with others on the Internet. They could also add comments to some images to express feelings, which is called “social tagging”. For example, these websites include Flickr, Photosig, Delicious, LabelMe, Peekaboom, etc. Social tagging suffers from two drawbacks, though it is easy to perform. First, the tags provided by Flickr users actually contain many noise [13]. Second, there is ambiguity in the user tagging. Users often use some common tags for different objects. So these common tags are semantically ambiguous. This is the reason why many similar images can’t be retrieved based on keyword mechanism. Wu et al. [34] proposed a tag recommendation framework taking advantage of correlations between tags and visual content. Weak rankers are learned with this multi-modality model and combined using rank-boost algorithm. However, these methods usually need to build a static vocabulary according to the training set and calculate the mapping relationship among tags in the vocabulary. Once a new word or a new annotated image is added, the whole mapping relationship in the vocabulary should be re-calculated. This is a huge computing cost for a large-scale image database.

Recently, the research focus of image annotation has changed to large-scale Internet image annotation [28, 30]. Given a query image for annotation, its semantic contents can be extracted from similar images that have been annotated. Thus, if similar candidates can be retrieved from Internet, an annotation of the query image can be obtained [16, 24, 29, 31, 32]. In general, these methods require a query word when searching for similar images and they submit the query word to a text-based search engine. The key issue that affects these methods is the algorithm used to identify accurate query words. Zhang et al. [38] proposed an image annotation method that could acquire the initial query words automatically. The key concept in this method was using CBIR technique to find similar candidates and their annotations in a database. Next, the initial query words were derived and used to search for similar images from the Internet after filtering out noisy words and performing a sort process. However, these methods just used a search based algorithm to retrieve similar images from the Internet instead of the training dataset, and then derived annotations from these retrieved images. They had no capability to annotate images based on hot Internet topics.

The data found on the Internet is a type of cross-media data, so the images and their associated text may have some forms of semantic relationships. This relationship may provide an effective method for image annotation. Zhu et al. [39] selected salient terms from the associated text for image annotation. Wu et al. [33] proposed more accurate annotations by calculating the contribution of each word to its visibility model. Monay and Gatica-Perez [21] introduced a latent topic model, probabilistic Latent Semantic Analysis (pLSA), into an image annotation algorithm and trained two pLSA models based on the visual features and associated text, respectively. This method then merged the topics in both models by assigning weights according to the entropy of the visual word distribution. These methods combined the relationships of different features and improved the annotation accuracy. However, these modeling approaches mainly considered the semantic relationships between images and salient words, and the relationships between salient words. In general, these salient words represent the objects in an image. Thus, assigning the salient words as annotations still fails to describe the high-level abstract semantics.

3 Basic design idea

In this paper, a novel high-level semantic image annotation algorithm based on hot Internet topics is proposed. As shown in Fig. 1, the algorithm consists of two sub tasks: search-based image annotation and dynamic update of the training set based on hot Internet topics. The former aims at annotating a given image based on the annotated set, while the latter prevents annotations in the set to be obsolete and keeps the set in synchronization with the latest hot topics on the Internet.

Fig. 1
figure 1

Image annotation based on the two sub tasks

3.1 Dynamic update of the training set based on hot Internet topics

The volume of data on the Internet is huge and it grows continuously. To discover the hot topics in such a large scale data set, we follow the 4 steps below.

  1. 1.

    Corpus collection: For a specific accident or event, the title of it is used to search for related web pages using TBIR technique. Besides, web pages containing both images and texts are regularly collected to construct the corpus for hot topic discovery.

  2. 2.

    Hot topic discovery: To discover hot topics in the collected corpus, we first use LDA modeling [1] to learn the topic distributions in the text corpus. Then three kinds of relationships, including image–to–image, topic–to–topic and image–to–topic relevance, are constructed. We build a complex graph based on these relationships and perform a clustering operation on the constructed graph [18]. The complex graph takes images and LDA topics as vertices and is partitioned according to three sets of relationships. The mapping between the image clusters and the topic clusters is built via the complex graph clustering and the images in one cluster will have similar visual features and consistent semantics. In this way, sub hot topics could be formed even when the initial title is polysemous or ambiguous.

  3. 3.

    Keyword selection: With the hot topics discovered and the corresponding words representing them, χ 2 is used to select for each hot topic the keywords so that they can be understood by users. And the corresponding images are annotated using the selected keywords.

  4. 4.

    Database update: After the images are annotated, they are added to the training set, the semantics coverage of which is thus expanded and in synchronization with the latest hot topics on the Internet.

The core step is the second one. In this step, complex graph clustering is performed based on three sets of relationships, and hot topics are formed with establishment of the mapping relationship between obtained image and topic clusters. Note that an image is represented as a vector of extracted visual features, while a topic is represented as a distribution over text words. They belong to essentially different feature spaces and it is difficult to build a relevance relationship between them.

Existing algorithms generally assume that an image and its associated text are semantically relevant. Such relevance can bridge the different feature spaces and connect an image with a topic. However, because the text associated with a given image are extracted from the corresponding web page, which usually contains multiple images, some content of the text is actually not relevant to the given image at all. It is incorrect if the topics extracted from the irrelevant content are connected with the given image.

In addition, considering the polysemy of words [25], the visual content of the images found by the same query keyword can be quite different (e.g. The word “apple” indicates both the fruit apple and the Apple Inc.). Consequently, the images from one query keyword may be clustered into several sub categories.

To overcome such problems and establish appropriate relevance between the image and topic clusters, the relevance between an image and a topic, the visual similarity between images, and the co-occurrences of topics are all taken into consideration, resulting in three sets of relationships. Complex graph provides a natural way of modeling both the images and the texts simultaneously to exploit the relationships because it allows multi-type vertices to be connected together. By clustering on the complex graph, images in the same cluster will not only be visually similar but also share the same hot topic.

3.2 Search-based image annotation

In this sub task, we will follow the 3 steps below to construct the annotations for a query image by leveraging the annotated training set.

  1. 1.

    The annotated set is searched for the images visually similar to the given one. Considering the existence of semantic gap problem, some of the found images will be semantically irrelevant. Therefore it is inappropriate to select keywords among their annotations directly.

  2. 2.

    To filter out the semantically irrelevant images, a hypergraph is constructed with the found images as vertices and their annotations as hyperedges. That is, if two images are annotated with the same word, then the corresponding two vertices are connected by a hyperedge. An example of hypergraph is shown in Fig. 2. With the hypergraph, spectral clustering is performed to partition it into multiple clusters, and the small clusters will be regarded as semantically irrelevant and discarded.

  3. 3.

    The cluster of images that are most relevant to the query image is identified among the remaining clusters according to visual similarity. Then for each annotation in the cluster, its relevance to the query image is computed. And the most relevant annotations will be used to annotate the query image.

Fig. 2
figure 2

Hypergraph and matrix H

Note that hypergraph provides a natural and concise way of modeling the relationships between images based on their annotations by allowing multiple vertices to be connected by a single hyperedge. It can perfectly express that one annotation can be assigned to multiple images. And moreover, one vertex is also allowed to be included in different hyperedges, which describes that one image can have multiple annotations. Annotations are used to construct hyperedges because they represent the semantics contained in the images.

By partitioning the constructed hypergraph, images that are semantically irrelevant can be identified by considering the fact that their semantics will be quite different from that of the query image and vary among themselves. Therefore those images will only form small clusters, whereas others will form larger ones.

4 Dynamic update of the training set based on hot Internet topics

As Section 3 described, the dynamic update of the annotated training set consists of four steps. The core algorithm of discovering hot topics will be discussed in detail in this section and the pseudocode is shown in Algorithm 1.

Algorithm 1 Discovery of hot topics

1: Input: Extended image set V e , associated text set T e

2: Output: Annotations of hot topics A

3:

4: \(R_v \leftarrow \text{ComputeImageRelevance(} V_e \text{)}\)

5: \(R_t \leftarrow \text{ComputeTopicRelevance(} T_e \text{)}\)

6: \(p(z|d) \leftarrow \text{GibbsSamplingLDA(} T_e \text{)}\)

7: \(R_{vt} \leftarrow \text{ComputeImageTopicRelevance(} p(z|d), R_v \text{)}\)

8: \(G \leftarrow \text{ConstructComplexGraph(} R_v, R_t, R_{vt} \text{)}\)

9: \(S \leftarrow \text{ComplexGraphClustering(} G \text{)}\)

10: \(A \leftarrow \text{SelectMostRelevantWords(} S \text{)}\)

4.1 Image representation and similarity measurement

Three types of features are extracted to represent images in our experiment, i.e. color histogram [26], wavelet texture [20] and SIFT [19], the dimensions of which are 64, 128 and 500 respectively. Note that the choice of features for image representation is not the focus of our paper and in effect many other visual features can also be used as replacement. Specifically, color histograms are computed in the LAB color space. The lightness and two color components are all uniformly quantized into 4 bins, and the χ 2 distance is computed to measure the similarity between two images in terms of their color histograms. The texture feature vector of an image is computed with both pyramid- and tree-structured wavelet transform by decomposing, at different levels, the sub-bands obtained through filtering. And it consists of the means and standard deviations of all the energy distributions [3]. As for SIFT, a visual vocabulary of size 500 is constructed through k-means clustering. Euclidean distances are used to define the visual similarity between two images.

4.2 Building three relevance relationships

Three kinds of relevance relationships are considered in the image and text modeling, i.e. the image–to–image similarity relationship, the topic–to–topic co-occurrence relationship, and the image–to–topic relevance relationship.

The topics are extracted from the associated text set T e using Latent Dirichlet Allocation (LDA) [1]. LDA is a generative model for collections of discrete data such as text corpora, and it is endowed with three layers corresponding to documents, topics, and words respectively. A document is regarded as a mixture of an underlying set of topics. This provides a representation of documents as topic distributions, allowing them to be analyzed effectively in the latent topic space that is usually a much lower dimensional one. Gibbs sampling is adopted for parameter estimation during LDA modeling [10], after which each word is assigned a topic label and the topic-document distribution p(z|d j ) can be determined. Equation (1) shows how the topic-document distributions can be estimated.

$$ p(z_k|d_j) = \frac{n_k^{(d_j)} + \alpha}{n_{\bullet}^{(d_j)} + T \alpha} $$
(1)
  • α is the Dirichlet hyperparameter for topic-document distributions

  • T is the number of topics

  • \(n_k^{(d_{j})}\) is the frequency of topic k in document d j

  • \(n_{\bullet}^{(d_j)}\) is the total frequency of all topics in document d j

Image–to–image similarity relationship

According to Section 4.1, a similarity matrix R v can be constructed for the extended image set V e , whose element R v (i, j) represent the similarity between image i and j in the set.

Topic–to–topic co-occurrence relationship

The relevance relationship between topics can be constructed based on the LDA topic assignments of each word in all the documents. Let R t be the topic co-occurrence matrix. Its element R t (i, j) is defined in (2).

$$\begin{array}{rll} R_t(i, j) & = & \frac{C(z_i \cap z_j)}{C(z_i \cap z_j) + C(\overline{z_i} \cap \overline{z_j})} \cdot \frac{C(z_i \cap z_j)}{C(z_j)} \\ && + \frac{C(\overline{z_i} \cap \overline{z_j})}{C(z_i \cap z_j) + C(\overline{z_i} \cap \overline{z_j})} \cdot \frac{C(\overline{z_i} \cap \overline{z_j})}{C(\overline{z_j})} \end{array}$$
(2)

C(z i  ∩ z j ) is the co-occurrence count of topics z i and z j , i.e. the number of times that both topics are assigned to some words in the same document, and \(C(\overline{z_i} \cap \overline{z_j})\) is the neither-existence count of topics z i and z j , i.e. the number of times that neither of the two topics are assigned to any words in the same document.

Image–to–topic relevance relationship

The image–to–topic relevance relationship can be measured using the conditional probability of the topic given the image, i.e. p(z j |I i ), which can be decomposed by considering all the images similar to the given one as in (3).

$$\begin{array}{rll} p(z_j|I_i) & =& \sum\limits_{I_{\textrm{sim}} \in V_s} p(z_j|I_{\textrm{sim}})p(I_{\textrm{sim}}|I_i) \\ p(z_j|I_{\textrm{sim}}) & \propto& p(z_j|d_{\textrm{sim}}), \textrm{ learned from LDA} \\ p(I_{\textrm{sim}}|I_i) & \propto& \textrm{similarity}(I_i, I_{\textrm{sim}}) \end{array}$$
(3)

The decomposition leads a helpful intuition that if images similar to the given one are relevant to a topic, then the given one should also be relevant to it. To avoid iterative computation to reach the convergence with such a recursive definition, the corresponding text is used to stand for the image, and the topic-document distributions learned with LDA is employed to calculate the relevance values needed. The visual similarity can be viewed as a weight for combination, ensuring that more similar images make greater contributions.

4.3 Modeling the three relevance relationships using a complex graph

To represent and exploit the three relationships in an unified framework, a complex graph G = {V 1, V 2, E} is constructed, which provides a natural way of modeling images and LDA topics simultaneously by allowing multi-type vertices to be connected [18].

The vertex set V 1 and V 2 correspond to LDA topics and images respectively, and the edge set E contains connections between vertices including those between homogeneous vertices and those between heterogeneous ones. The edge set can be written as \(E = \left\{ \left\{ S \in R_+^{|V_1| \times |V_1|} \right\}, \left\{ A \in R_+^{|V_1| \times |V_2|} \right\} \right\}\), where S represents the weights of the homogeneous edges connecting vertices in V 1 (see Eq. (2)), and A represents the weights of the heterogeneous edges connecting vertices in V 1 with those in V 2 (see Eq. (3)). Based on the complex graph constructed, the images and LDA topics can be separately clustered while being constrained by each other. Mapping relationship between the obtained image and LDA topic clusters are established according to the three kinds of relevance relationships.

The complex graph is partitioned to optimize the objective function L defined in (4) [18].

$$\begin{array}{lll} && \arg \min\limits_{{\bf C}^{(1)}, {\bf C}^{(2)}} L \\ && L = \left\| S - {\bf C}^{(1)}D({\bf C}^{(1)})^T \right\|^2 + \left\| A - {\bf C}^{(1)}B({\bf C}^{(2)})^T \right\|^2 \\ && s.t. \quad {\bf C}^{(1)} \in \left\{ 0, 1 \right\}^{|V_1| \times K_1}, {\bf C}^{(2)} \in \left\{ 0, 1 \right\}^{|V_2| \times K_2} \end{array}$$
(4)

C (1) denotes the cluster membership matrix for the vertices in V 1, and \(C_{ij}^{(1)}\) is the weight between vertex i and cluster j in V 1. C (2) denotes the cluster membership matrix for the vertices in V 2, and \(C_{ij}^{(2)}\) is the weight between vertex i and cluster j in V 2. The inter-type pattern matrix B denotes the link patterns between the vertices in V 1 and those in V 2, and B(i, j) is the link strength between cluster i in V 1 and cluster j in V 2. The intra-type cluster pattern matrix D denotes the link patterns in the same type of vertices, and D(i, j) is the link strength between cluster i and cluster j in V 1. In general, the matrices D and B are the probabilities of the links.

The solutions D and B to the optimization problem defined in (4) is computed according to (5) [18].

$$\begin{array}{l} {\bf D}^\star = \left( \left( {\bf C}^{(1)} \right)^T {\bf C}^{(1)} \right)^{-1} \left( {\bf C}^{(1)} \right)^T {\bf S} {\bf C}^{(1)} \left( \left( {\bf C}^{(1)} \right)^T {\bf C}^{(1)} \right)^{-1} \\ {\bf B}^\star = \left( \left( {\bf C}^{(1)} \right)^T {\bf C}^{(1)} \right)^{-1} \left( {\bf C}^{(1)} \right)^T {\bf A} {\bf C}^{(2)} \left( \left( {\bf C}^{(2)} \right)^T {\bf C}^{(2)} \right)^{-1} \\ \qquad\quad s.t. \quad {\bf C}^{(1)} \in \left\{ 0, 1 \right\}^{|V_1| \times K_1}, {\bf C}^{(2)} \in \left\{ 0, 1 \right\}^{|V_2| \times K_2}, \\ \qquad\quad\qquad \: \: {\bf D}^\star \in R_+^{K_1 \times K_1}, {\bf B}^\star \in R_+^{K_1 \times K_2} \end{array}$$
(5)

The complex graph clustering algorithm is described step by step below. For more theoretical details, please refer to [18].

Input:  A complex graph G = (V 1, V 2, E), assuming that the number of clusters in V 1 is K 1 and the number of clusters in V 2 is K 2.

Output:  The result of topic clustering C (1) and the result of image clustering C (2). Matrix P where the elements are the link patterns between the clusters in C (1) and C (2).

  1. 1.

    Given the initial values of C (1) and C (2), calculate the initial value of D, B, and L to generate \(L_{\textrm{min}} = L_{\textrm{init}}\).

  2. 2.

    Fix D, B, and C (2), then update each element of C (1) to 1 row by row, and generate the L minimum at each update. Update \(L_{\textrm{min}}\).

  3. 3.

    Fix D, B, and C (1), then update each element of C (2) to 1 row by row, and generate the L minimum at each update. Update \(L_{\textrm{min}}\).

  4. 4.

    Calculate D and B using Eq. (5).

  5. 5.

    Repeat steps 2 to 4 until convergence is reached.

  6. 6.

    Calculate the mapping relationship matrix P between the image clusters and topic clusters according to P(I|T) = P(I|T′)P(T′|T), where P(I|T′) is the B matrix and P(T′|T) is the D matrix.

The complex graph clustering algorithm can cluster the image vertices and topic vertices separately and produce a one–to–one mapping between the topic clusters and the image clusters. During the clustering process, the three relevance relationships will affect each other. During image clustering, images with similar visual contents and close topic contents will form one cluster. During topic clustering, topics with similar visual contents will form one cluster and produce a hot topic.

To make the obtained hot topics understandable to users, keywords are identified in each hot topic. χ 2 is computed to select the words most relevant to a hot topic. Greater values of χ 2 means higher relevance to the hot topic.

4.4 Updating the annotated set

With the hot topics discovered and keywords identified, the annotated image set is then updated to cover wider range of semantics.

Different from the traditional model-based annotation methods, no relevance model or mapping relationships between visual features and semantic concepts are maintained for the annotated set. Hence the computational cost of updating is much lower and acceptable. To further decrease the cost, images in the set is grouped and organized according to hot topics, dividing the entire set into multiple semantically and visually coherent subsets. Mean feature vectors are computed to represent each subset, whose semantics is expressed with the keywords of the corresponding hot topic.

With the benefit from the data organization, the annotated set can be updated by directly adding new hot topics as new subsets. For instance, considering the situation that many food safety accidents have been discovered, which is not included in the current annotated set, e.g. red yolk stained with Sudan Red, etc. Each accident corresponds to a hot topic which is a set of images and annotations. Then the updated annotated set is the union of the original set and these new hot topic sets. Note that the original subsets remain unchanged.

It is also possible that the update is based on an existing subset of the current annotated set. If new semantics emerges with respect to the topic, then the original topic will split into several sub-topics. And the sub-topics are used to replace the old topic. It can be imagined that at the very beginning, apple only indicates a kind of fruit. So there is only one topic of fruit apple in the annotated set. Then one day the Apple Inc. is founded, and when “apple” is used as a query keyword to search the Internet, two topics will be discovered - fruit apple and Apple Inc. So the old topic of apple in the annotated set is abandoned, and two new topics of fruit apple and Apple Inc. are added.

This update mechanism keeps expanding the semantics covered by the annotated set and eliminating ambiguous topics with the minimal cost. And the form of data organization makes it quite convenient to find the images visually similar to a given one.

5 Search-based image annotation

To annotate a given image, the annotated set will be searched in order to find the images, which are visually similar with the given one. For similarity measurement, please refer to Section 4.1. Then a hypergraph is constructed with the images found and their annotations, on which clustering is performed in order to identify the images semantically relevant or irrelevant to the given one. After filtering out the semantically irrelevant images, keywords are selected from the annotations of the remaining images, which are assigned to the given image.

5.1 Filtering out semantically irrelevant images using a hypergraph

When the set of similar images V s and the corresponding annotation set T s are obtained, a hypergraph G(V s , T s ) can be constructed with images as vertices and annotations as hyperedges, i.e. If two images share the same annotation, then the corresponding two vertices are connected by a hyperedge. Hypergraph G can be represented as a matrix H as shown in Fig. 2. H also encodes the co-occurrence relationships between annotations. The graph is then partitioned using spectral clustering [22], forming clusters containing images that have the most similar annotations. Therefore images with different semantics will be well separated, and those that are semantically irrelevant will form some small clusters, which will be abandoned as analyzed in Section 3.2.

5.2 Annotating the given image

After the semantically irrelevant images are removed, the cluster that is the most visually similar to the given image can be identified, the annotations of which are selected as candidates. Let S denote the identified cluster. Then the final annotations are determined according to the relevance between the given image and candidate annotations in S, as is defined in (6). The conditional probability given the query image p(t i |I q ) is used to measure the relevance between the query image I q and the i-th candidate annotation t i in S. The candidate annotations with high relevance will be preserved and assigned to the given image.

$$\begin{array}{lll} &&p(t_i|I_q) = \sum\limits_{I_j \in S} p(t_i|I_j)p(I_j|I_q) \\ &&p(t_i|I_j) = \left\{ \begin{array}{cc} 1 & \quad \textrm{if $I_j$ is annotated with $t_i$} \\ 0 & \quad \textrm{others} \end{array} \right. \\ &&p(I_j|I_q) \propto \textrm{similarity}(I_j, I_q) \end{array}$$
(6)

similarity (I j , I q ) is the visual similarity between image I j and I q , which is described in Section 4.1.

6 Experiments and evaluation

We propose a new high-level semantic annotation algorithm based on hot Internet topics. It includes the dynamic update of the training set based on hot Internet topics and search-based image annotation. Through the dynamic update of the training set, the semantics in the training set are more accurate and cover more recent hot Internet topics. For the annotation process, users could get the annotation results of the query image, which contain the keywords of the hot topics. So, in this section, we conduct four groups of experiments to evaluate our overall algorithm.

Experiment 1 :

Performance evaluation of the search-based image annotation.

Experiment 2 :

Performance evaluation of the topic discovery and image annotation algorithm.

Experiment 3 :

Effectiveness evaluation of the dynamic update of the training set based on hot Internet topics.

Experiment 4 :

Effectiveness evaluation of the annotation algorithm in an Internet-like environment.

6.1 Data set selection and performance measurement

6.1.1 Data set selection

Four data sets are used in our experiments, which cover a wide range of situations.

Dataset1  The NUS-WIDE data set [3] is selected for the experiment 1. It is a web image data set created by NUS’s Lab for media search. It contains 269,648 images from Flickr with 425,059 unique tags in total. The data set is divided into two parts: the first part contains 161,789 images for training and the second part contains 107,859 images for testing. In the training set, we remove the tags which occurred less than 50 times. In the meantime, the tags not in WordNet are filtered out via WordNet Stemmer. The final number of the unique tags is 5,018. In the testing set, the number of manually labeled tags per image varies from 2 to more than 100, with an average number of about 30. The WordNet stemmer is also used to do stemming, and then we remain the tags which are occurred more than 50 times in the testing set. Finally, the average number of manually labeled tags per image is about 10. This data set also has manual annotations as the ground truth with 81 concepts in total, which belong to different categories. The ground truth for the testing set is the processed manual annotations and related concepts.

Dataset2  This is a hot topic corpus, which is built from the searching results (include web pages and related images) on the Internet for 34 topics. These topics include below:

  1. 1.

    10 concepts which get the lowest accuracy rates in the experiment 1: In the experiment 1, we calculate the annotation performance of every concept. And then 10 concepts with the lowest average accuracy rates in the 81 concepts are selected. They are: C1: Earthquake, C2: Statue, C3: Rainbow, C4: Running, C5: Wedding, C6: Book, C7: Castle, C8: Flags, C9: Temple, C10: Train.

  2. 2.

    20 food safety accidents during 2008 to 2012 in China: We collected the food safety accidents reported on the Internet and organized them into 20 accidents, i.e., E1: Sanlu milk powder, E2: Paraffin chafing-dish material, E3: Poisoned capsules, E4: Jinhao tea oil, E5: McDonald’s chicken, E6: Plasticizer event, E7: Poisoned yoghurt, E8: Sudan Red accident, E9: Trench oil, E10: Small lobster event, E11: Lean meat powder event, E12: Poisoned bread, E13: Maggots-orange, E14: Burst watermelon, E15: Toxic bird’s nest, E16: Turbot accident, E17: Poison bean sprouts, E18: Maggots-sausage, E19: Poisoned ginger event, E20: Deteriorating rice event.

  3. 3.

    4 visual polysemous words: We evaluate the effectiveness and performance of complex graph clustering by using four visual polysemous words as query keywords, i.e., “apple”, “tiger”, “mouse” and “shark”.

The keywords from these 34 topics are used to perform the searching on the Internet and the results of web pages and images are downloaded. Removing the duplicated ones, there are 450–1300 text pages and 300–800 images collected for each topic. In the results, about 150–200 images are selected as the testing images in the experiment 2 and the remaining images are used as the training set.

Dataset3  This is an updated data set. In the experiment 3, we use the Dataset2 to update the Dataset1 and the result is the Dataset3. We will do the annotation experiment 1 again on this data set and compare with the results on Dataset1.

Dataset4  This data set is constructed by updating the training set in Dataset1 to form 196 subtopics based on the original 81 concepts. And 100 images are selected randomly from each subtopic, resulting in a training set of 19,600 images. The test set of 107,859 images remains unchanged, so the ratio of training and test set size is 1:5.5. Furthermore, in order to compare the performances with different ratios, we also used the entire set (1:0.6) and select 10 images from each subtopic (1:55).

6.1.2 Performance measurement

The average precision rate (Av_P) and the average recall rate (Av_R) as defined in (7) are used to measure the performance of our image annotation algorithm.

$$\begin{array}{rll} P(I_i) & =& \frac{|A_i \cap G_i|}{|A_i|} \\ R(I_i) & =& \frac{|A_i \cap G_i|}{|G_i|} \end{array}$$
(7)
$$\begin{array}{rll} \textrm{Av\_P} & =& \frac{1}{N} \sum\limits_{i=1}^{N} P(I_i) \\ \textrm{Av\_R} & =& \frac{1}{N} \sum\limits_{i=1}^{N} R(I_i) \end{array}$$
(8)

N denotes the total number of images. A i denotes the set of result annotations assigned to the image by our algorithm and G i denotes the ground truth for image I i .

Furthermore, coverage rate (Cov_rate), as defined in (9), is used to evaluate the number of keywords in the vocabulary that are used to annotate the query images.

$$ \textrm{Cov\_rate} = \frac{|\bigcup_{i=1}^{N} A_i|}{|V|} $$
(9)

where V is the annotation vocabulary of the evaluation data set.

6.2 Experiments on Dataset1

We evaluate the search-based image annotation task on the Dataset1 and compare the results with SBIA [32] and LTA [34]. The detailed configuration of the experiments is described as below.

  • SBIA: It requires both the query image and an initial correct query keyword as the input. So, we assign the 81 concepts in the Dataset1 to the corresponding query images as the initial query keywords.

  • LTA: For one query image to be annotated, it also requires several initial keywords. The same as SBIA, we assign the 81 concepts in the Dataset1 to the corresponding query images as the first keyword. The other 2 initial keywords will be selected from their manual annotations sequentially.

  • Our algorithm: we require that the training set is organized by the hot topics. However, since the images in the Dataset1 are coming from NUS-WIDE, they have not been organized according to the hot topics. So, we need to leverage the 81 concepts to organize the training set. We calculate the color histogram of each image as its visual feature and get the mean value for each concept on this feature. This mean value will be used as the cluster center for this concept. Each concept is indexed by a cluster center. Then for a query image, we find several neighbor concepts whose centers are closest to the query image, after which similar images are identified among those belonging to the selected concepts. With these images and their tags constructing a hypergraph, clustering is performed in order to annotate the query image.

  • The number of output tags is set to 10 in the experiments.

Table 1 shows the annotation performance of three algorithms. From the results, our algorithm is clearly better than SBIA and LTA on annotation performance. Both SBIA and LTA require one or several correct initial keywords to start with, this helps them greatly on the quality of the final annotations. Even though, our algorithm does not need any correct initial keyword but still achieves better performance. Although SBIA is similar with our algorithm as the search-based mechanism and leverages the visual similarity to search for the candidates, it lacks of the filtering process based on hypergraph clustering. We don’t pick the keywords directly based the annotations of all the candidates. Instead, we leverage the semantic relevance relationship of the images and annotation words to filter out the images with irrelevant semantics. Through hypergraph clustering, several clusters of candidates are formed. And then the cluster of the candidates with the most similar visual features to the query image will be selected and the final annotation words will be picked from this cluster. So the final annotations are much more accurate. For LTA, it considered three relevance relationships: tag–to–tag co-occurrence relationship based on the text characteristics (TC), tag content correlation based on visual similarity (TCC) and image conditioned tag correlation (ITC). For example, when calculating the TCC for the tag pair (t i , t j ), LTA will collect all the images with the annotation of t i and t j respectively and then compute the visual similarity of these two image sets based on VLM model. The visual similarity of these two image sets represents the visual similarity of the two tags. However, because of the well-known semantic gap problem, even the semantics of t i and t j are the same, their visual representations may be quite different. So, building the tag content correlation based on the images with these tags suffers from this semantic gap problem. We could achieve better performance because a hypergraph clustering mechanism in our algorithm could filter out the images with irrelevant semantics. The candidate cluster will have better consistence in terms of visual features and semantics. So, the final annotation results will be more consistent and accurate.

Table 1 Results of the comparison

From Table 1, we can also see that our algorithm has better coverage rate than SBIA and LTA. This is because SBIA and LTA are all tending to select the most common keywords for the testing images. In our algorithm, we calculate the annotation probability for each keyword in the candidate cluster. The annotation probability is related with the similarity between the query image and the candidate image. So, even the keywords that are used less often in the cluster can be selected as the keywords to annotate the images.

Figure 3 shows the distribution of the average accuracy rates on 81 concepts for these 3 algorithms. From the distribution curve, these 10 concepts have the lowest accuracy rates in our algorithm: “earthquake”, “statue”, “rainbow”, “running”, “weding”, “book”, “castle”, “flags”, “temple”, “train”. After the analysis, we found three reasons for these concepts which lead to the low accuracy rates. First, in the training set, the number of images in these 10 concepts is lower than other concepts. So, it is hard to have positive effects according to the query image in these concepts. Second, there are many noisy tags existed in these concepts, such as in the concepts of “earthquake”, “book”, “castle”, and “running”, etc. Third, the semantic gap problem also has impact on some concepts and reduces the annotation quality. For example, in the “rainbow” concept, there are two types of topics in the images: rainbow-nature and rainbow-band. These two topics are all annotated with “rainbow”, but have huge difference in visual characteristics. So, the learning model is not effectively built up for this concept. In the experiment 2, we will update the training set for these 10 concepts based on hot Internet topics and evaluate the improvement on the annotation for the training set in experiment 3.

Fig. 3
figure 3

Average precisions of the 81 concepts with the 3 approaches

6.3 Experiments on Dataset2

6.3.1 The parameters in experiment 2

As described in Section 6.1.1, we search the Internet for web pages and related images according to 34 topics. The search results will form the hot topic corpus. For each topic, we leverage LDA algorithm to build the topic model for the web pages. Before this experiment 2, we set the number of LDA topics to 5, 8, 10, 15, 20, 30, 40, and 50 and evaluate the results of LDA training and complex graph clustering in order to find the best topic number for LDA algorithm and clustering number for complex graph clustering. After the experiments, we found that the clustering result is the best when the topic number for LDA is set to 10 and the cluster number is set to 3. So, in the experiment 2, we always set the topic dimension for each web page to 10 and the number of clusters to 3. At the same time, the number of keywords for each hot topic is set to 10 consistently.

6.3.2 Measure the effectiveness of three relationships

We leverage three relevance relationships in our algorithm for abstract semantic modeling: topic–to–topic co-occurrence relationship, topic–to–image relevance relationship and image-to-image similarity relationship. Through the complex graph clustering, the image set will be clustered into several clusters with different semantics and visual features. In order to evaluate the effectiveness of these three relationships, we conduct three groups of experiments to evaluate the performance of clustering based on different relationships. The Normalized Mutual Information (NMI) is used as the quantitative measurement for clustering performance evaluation.

  1. 1.

    Based on the three relevance relationships proposed in this paper, the complex graph clustering is performed and the NMI is calculated according to the image clustering results.

  2. 2.

    Based on two relevance relationships (topic–to–topic co-occurrence and topic–to–image relevance), the complex graph clustering is performed and the NMI is calculated according to the clustering result.

  3. 3.

    The baseline for this evaluation is the NMI result of K-means algorithm. The LDA topics for the images are used as the document features. Together with the visual features of the images, we use K-means to do the clustering as the baseline.

NMI is the standard performance evaluation method used for clustering. Let k be the number of clusters and let λ = (λ 1, ⋯ , λ N ) be the cluster vector where λ i  = 1, ⋯ , k. λ i  = j denotes that the i-th item belongs to the cluster C j . If λ (a) and λ (b) are clustering result and the ground truth vectors respectively, the NMI criterion Φ can be calculated as follows [27]:

$$ \Phi^{(\textrm{NMI})}\left( \lambda^{(a)}, \lambda^{(b)} \right) = \frac{\sum_{h=1}^N \sum_{l=1}^N n_{hl} \log \left( \frac{n \cdot n_{hl}}{n_h^{(a)} \cdot n_l^{(b)}} \right)}{\sqrt{\left( \sum_{h=1}^N n_h^{(a)} \log \frac{n_h^{(a)}}{n} \right) \left( \sum_{l=1}^N n_l^{(b)} \log \frac{n_l^{(b)}}{n} \right) }} $$
(10)

where \(n_h^{(a)}\) is the number of items in cluster C h in λ (a), while \(n_l^{(b)}\) denotes the number of items in cluster C l in λ (b). n hl is the number of items in both cluster C h and cluster C l . Based on this definition, the clustering result is better if \(\Phi^{(\textrm{NMI} )}(\lambda^{(a)}, \lambda^{(b)})\) is bigger. The theoretical maximum is 1 for \(\Phi^{(\textrm{NMI} )}(\lambda^{(a)}, \lambda^{(b)})\).

6.3.3 Performance evaluation of semantic annotation based on hot topics

  1. (1)

    Evaluation of the annotation results for 10 concepts

As in Table 2, after the complex graph clustering, there are several sub-topics formed for each concept.

Table 2 Some of the concepts and the corresponding sub-topics

From Table 2, there are several keywords in the concepts which are visually polysemous and semantically ambiguous. It means one concept may have multiple visual representatives. In the training set, although several images may belong to the same concept, their visual characteristics could be quite different. So, it will be very challenging to learn the correct relevance relationship based on this kind of training set. In our algorithm, we use the complex graph clustering technology to category the images in one concept into sub-topics. The images in each sub-topic will have almost consistent visual characteristics and semantics.

The NMI data for clustering results based on different relevance relationships is showed in Fig. 4. As the figure shown, the clustering based on three relationships delivered the best performance and it can separate the visual polysemous or ambiguous concepts into several sub-topics. By only using the topic–to–topic co-occurrence relationship and topic–to–image relevance relationship, the visual similarity between the images is ignored. So, even though the clustering result may have similar semantics, the visual difference in one cluster may still be very significant.

Fig. 4
figure 4

NMI of clustering with respect to the 10 concepts

Figure 5 shows examples of the hot topics and corresponding keywords, such as rainbow, and running.

  1. (2)

    Evaluation of the annotation results for visual polysemous words

Fig. 5
figure 5

Examples of sub-topics and the corresponding keywords

We evaluated the effectiveness and performance of three relevance relationships using four visual polysemous words as initial query keywords, i.e., “apple”, “tiger”, “mouse” and “shark”. The experimental results are shown in Figs. 6 and 7. For the “apple” keyword, the clustering results contained three sub-topics: apple as a fruit, Apple product and the Apple Inc. There were two sub-topics for “tiger”: tiger as an animal and Tiger Woods. There were two sub-topics for the “mouse” keyword: mouse as an animal and a computer mouse. There were three sub-topics for the “shark” keyword: the shark as an animal, Shaq O’Neal, and the band Shark. Thus, the input query keywords were semantically ambiguous but we could determine the correct clustering results based on the visual content. We also annotated each cluster correctly according to their semantics. The NMI results for clustering are shown in Fig. 8. The complex graph clustering method considered three types of relationships, so it delivered a better performance than complex clustering for two types of relationships and the K-means algorithm for a single relationship.

  1. (3)

    Evaluation of the image annotation performance for hot Internet topics

Fig. 6
figure 6

Examples for polysemous words “tiger” and “mouse”

Fig. 7
figure 7

Examples for polysemous word “apple” and “shark”

Fig. 8
figure 8

NMI of clustering with respect to the polysemous words

In this experiment, we collected the food safety accidents reported on the Internet and organized them into 20 categories as described in Section 6.1.1. NMI of clustering results with respect to the food safety accidents are shown in Fig. 9, which demonstrates that the image clusters and topic clusters were consistent with the visual contents and semantics. The annotation results extracted from the topic clusters conveyed the semantics of the food safety accidents correctly.

Fig. 9
figure 9

NMI of clustering with respect to food safety accidents

For example, Fig. 10 shows that for image “capsule”, the extended annotation included Chromium, heavy metal, and poisonousness. For image “yoghourt”, the annotation was “gelatin” because this food safety issue was related to the illegal addition of gelatin to yoghourt. The annotations of the images for “red, yolk, and duck egg” had the abstract semantics “Sudan red”, “additive”, etc.

Fig. 10
figure 10

Examples for food safety accidents

6.4 Experiments on Dataset3

In the experiment 2, 47 keywords as hot topics are constructed from 34 topics. Now, we add the new hot topics into the Dataset1 and replace the original 10 concepts with lowest accuracy rates. The update dataset is called the Dataset3. We will perform the following experiments in this section.

  1. 1.

    Annotation experiments on the original images in these 10 concepts.

  2. 2.

    Annotation experiments based on 20 hot topics on food safety accidents.

As Fig. 11 shown, based on the updated training set for these 10 concepts, the annotation results of SBIA, LTA and our algorithm are all improved. As for our algorithm, the average precision of the 10 concepts on the updated annotated set are improved by 5.4x. The reasons are two follows. After updating, the concepts with polysemous or ambiguous images are separated into several sub-topics. In each sub-topic, the image set and the text set have better consistent semantics and visual features. In the meanwhile, the keywords based on hot topics have been merged into the annotations of related images so that the annotation accuracy in the training set gets improved. So, the final annotation performance is improved as well. Our experiments demonstrate that an effective update mechanism for the training set is highly desirable for image annotation.

Fig. 11
figure 11

Precision of the 10 topics

Figure 12 shows the annotation performance of 20 topics on food safety accidents. The average precision of all 20 topics with our approach, SBIA and LTA are 0.67, 0.55 and 0.44 respectively. The results demonstrate the effectiveness of the update and the superiority of our approach. Figure 13 shows some examples of our annotation results.

Fig. 12
figure 12

Precision of 20 topics of food safety accidents

Fig. 13
figure 13

Examples of our annotation results

6.5 Experiments on Dataset4

Most previous annotation algorithms construct annotation mapping models based on one annotated training set, and then test images will be annotated according to these mapping models. In their experiments, the number of images in the training set is generally larger than that in the test set. This configuration does not match the real environment since the number of annotated images is much smaller than that of un-annotated ones on the Internet. So in this experiment, the ratio of the image numbers in the training and test set is made to be 1:5.5 in order to evaluate the annotation algorithm in an environment closer to the real one.

As shown in experiment 1, in the search-based annotation sub-task, we first leveraged the visual features, such as color histogram, and wavelet texture, to find out the candidates from the training set, which are similar to the query image. Then the keywords were extracted from this candidate set for annotation. Because of the semantic gap problem, the images with similar visual features may have different semantics. Hypergraph clustering is used to remove the non-consistent images from the candidate set in order to mitigate the semantic gap problem. So, based on the filtering mechanism, the performance of our algorithm would not be so correlated with the visual features exploited. In experiment 4, the annotation is based on the following four combinations of the features.

  • Color histogram and wavelet texture

  • Color histogram and SIFT

  • SIFT and wavelet texture

  • SIFT and SIFT

The results in Table 3 show that the annotation performances based on different combinations of features are quite close. This proves the effectiveness of our annotation algorithm that the candidates after filtering have similar visual features and semantics. And moreover, compared to that in the test set, the number of images in the training set is very small, which is similar to the real running environment on the Internet. In such an environment, our algorithm can still achieve good results.

Table 3 Results with different combination of features

In addition, with the combination of color histogram and wavelet texture, we also performed experiments on the whole training set of 161,789 images, as well as a subset of 1,960 images, which is obtained by selecting 10 images from each subtopic. Precisions with training sets of size 161,789, 19,600 and 1,960 are 60.2, 56.8 and 36.0 % respectively. Note that when the training set size decreases to approximately 10 % of the original one, the precision reduces by 5.6 and 36.6 % respectively. The 36.6 % decrease is mainly due to the fact that the annotations covered by the training set (1,960 images) are not sufficient to annotate all the test images. And using the entire set (161,789 images) provides only minor improvements over the training set of 19,600 images since the additional images in the entire set actually share the same annotations with others. It indicates that the annotation coverage of the training set is more important than its size.

7 Conclusion

In this study, we develop a new high-level semantic annotation method for images based on hot Internet topics. This method has two sub tasks: search-based image annotation and the dynamic update of the training set based on hot Internet topics. This method exploits the large-scale image resources available on the Internet for image annotation and regularly updates the training set from hot Internet topics in an efficient way. We propose a new method to model the abstract semantics for images. Three sets of relationships between topics and images are exploited. And through the complex graph clustering, the hot Internet topics are extracted from images with similar visual contents. From the experiments, we have demonstrated the effectiveness of three relationships and the complex graph clustering, which make sure that images with similar visual contents and close topics will form one cluster. The keywords from this cluster are good representatives for the hot topics. The dynamic update mechanism of the training set addresses the issue of the huge computing cost in traditional update methods, which require to re-calculate the whole relevance relationships between tags and visual features of images. The experiments also show that the updated training set can deliver better annotation results since it can reduce the impact of the semantic gap problem for visual polysemous words. The search-based image annotation can effectively filter out the semantic irrelevant images via hypergraph mechanism. We calculate the annotation probability for each keyword in the candidate cluster. The annotation probability is related with the similarity between the query image and the candidate images. So, even the tags that are used less often in the cluster can be selected as the keywords to annotate the images. The experiments show that its annotation performance is better than the state-of-the-art algorithms.

The future work of our research is to look at its feasibility on large-scale data center and evaluate the computing requirement in the Cloud.