Elsevier

Knowledge-Based Systems

Volume 192, 15 March 2020, 105259
Knowledge-Based Systems

Social image refinement and annotation via weakly-supervised variational auto-encoder

https://doi.org/10.1016/j.knosys.2019.105259Get rights and content

Abstract

The ever-increasing size of social images and their corresponding imperfect labels have made social image refinement and annotation a crucial problem in supervised learning. However, previous models based on nearest neighbors or matrix completion are limited when the social image set is huge and labels are highly sparse. Deep generative models utilize inference and generative networks to infer latent variables by introducing an observed data variable; they can handle imperfect data, capture noisy data, and fill in missing data variables. In this paper, we propose a new social image refinement and annotation model based on the weakly-supervised variational auto-encoder generative model. First, we formulate the social image refinement and annotation problem as a joint distribution of social images and labels in a probabilistic generative model. Secondly, we derive a new evidence lower bound object to handle imperfect labels. Thirdly, we design a new multi-layer neural network including inference and generative networks to optimize the new evidence lower bound efficiently. Finally, we perform a comparison of our model with other representative models on several real-world social image datasets. Experimental results on social image refinement and annotation tasks show that the proposed model is competitive or even better than existing state-of-the-arts.

Introduction

Billions of social images associated with user-generated labels from sites such as Flickr1 are easy to collect and continuously growing. The ever-increasing size of social images, combined with the difficulty of obtaining massive labeled images, has made social image refinement and annotation one of the most important problems in the field of computer vision. Because social networking site users are mostly amateurs, the resulting social images labels will most likely be low-quality, irrelevant, and incomplete. Consequently, powerful social image refinement and annotation models with the capability of refining these labels have attracted widespread interest in recent years.

The goal of social image refinement and annotation is to refine noisy labels and automatically assign missing labels to social images. Over pass years, various models for social image refinement and annotation have been developed. Previous models proposed by [1], [2], [3], [4], [5], [6], [7] explored the visual similarity between a test image and training images and finally assigned labels by sorting the scores of neighbors of visually similar images. This type of model utilizes image features to predict the labels of images while ignoring their inherent label information. Matrix completion models are among the most widely applied approaches to social image refinement and annotation because of their impressive performance. Such models represent the relationship between social images and user-generated labels through an image-label matrix. One successful approach among matrix completion models is to complete the image-label matrix directly with an image feature matrix as a regularization constraint [8], [9], [10]. Another successful approach is to learn effective common codes of image features and label features directly by matrix factorization from the image-label relationship matrix. The matrix factorization technique predicts which labels an image will prefer by discovering and exploiting the common latent codes across images and labels. The techniques proposed by [11], [12], [13], [14], [15] factorized the image-label matrix into two low-rank matrices with error sparsity. Li et al. [16], [17] learned the common latent representations for images and labels using a deep neural network with multiple layers of linear transformations. Several extensive surveys for social image refinement and annotation have been published by [18], [19]. However, the image-label matrix is often highly sparse in the real world, causing matrix completion models to degrade significantly.

Recently, deep generative models based on the variational auto-encoder have been very successful in learning latent codes [20], [21], [22], [23], [24] and have also demonstrated promising results for computer vision tasks [25], [26], [27], [28]. Wang et al. [24] utilized a variational auto-encoder to determine the common sources of variation for multi-view data in multi-view representation learning. Ramakrishna et al. [27] proposed a novel generative model called TELBO, which could create meaningful images across novel concrete and abstract visual attributes. Suzuki et al. [23] presented a joint multi-modal variational auto-encoder (JMVAE), which could bi-directionally learn a joint representation among all modalities. Tang et al. [22] demonstrated the effectiveness of the proposed conditional variational auto-encoder (MEDL_CVAE) approach with rich contextual information in two related real-world applications.

Despite their widespread use in learning disentangled representations and generating realistic images for computer vision applications, variational auto-encoders receive less attention in the literature on social image refinement and annotation. Here, we extend the variational auto-encoder [29] to social image refinement and annotation with imperfect labels as weak supervision. Variational auto-encoder generative models utilize ‘inference’ and ‘generative’ networks to infer the common latent variables by an observed data variable. In other words, we model the joint distribution of social images and labels as a variational auto-encoder model on the input observation. Given sufficient social images and labels, it is theoretically feasible to learn a joint distribution over the images and labels by optimizing the evidence lower bound (ELBO) object.

However, missing inputs in observation space require data-specific optimization strategies. For example, a missing input optimization strategy with label content might not be suitable for exploiting image content. The imperfect labels also degrade the robustness of learning common latent codes of images and labels. Here, we show that an extremely well-designed ELBO object is well-suited for optimization to handle missing and noise labels. To make use of the missing and noisy inputs from labels and avoid over-fitting, we build a new multi-layer neural network structure with inference and generative networks in the stochastic gradient variational Bayes framework. Empirically, we confirm that employing a variational auto-encoder generative model is more robust regardless of the scarcity and noisiness of the data.

Specifically, the main advantages of our proposed model over previous models are summarized as follows: First, we formulate the social image refinement problem as a joint distribution of social images and labels in a probabilistic generative model. Secondly, we derive a new ELBO object to handle imperfect labels, capture noisy data, and fill in missing labels. Thirdly, we design new inference and generative networks to optimize the new ELBO efficiently. Finally, we perform a comparison of our model with three previous models on several real-world social image datasets.

The remainder of this paper is organized as follows. First, we provide a summary of related work on the variational auto-encoder in Section 2. In Section 3, we present a new model and demonstrate how to extend the variational auto-encoder to create models that can handle social image refinement and annotation. Experimental results on three popular datasets and performance comparisons are presented in Section 4. We conclude with a summary of this work and a discussion of future work in Section 5.

Section snippets

Related work

The variational auto-encoder generative model is a useful approach to modeling high-dimensional data, such as images, especially when the data is partially observed or missing. The variational auto-encoder utilizes a latent variable to model partially observed data instead of observed variables. The distribution of the partially observed data x is defined as p(x)=pθ(x|z)p(z)dz, where p(z) is a prior on latent variable z, and pθ(x|z) is a likelihood on observation x parameterized by θ. The

Weakly-supervised variational auto-encoder

In this section, we introduce a new weakly-supervised variational auto-encoder (WSVAE) for social image refinement and annotation. First, we formulate the social image refinement and annotation task in a probabilistic generative model based on the variational auto-encoder. Secondly, we extend the variational auto-encoder by devising a new ELBO to handle missing inputs in the testing process. Finally, we provide an overview of the model architecture of WSVAE for both training and testing

Experiments

In this section, we test the WSVAE model in three different group datasets (Train1M-MIRFlickr, Train1M-Flickr51, and Train1M-NUS-WIDE), using the objective ELBOwsvae defined in Eq. (12). We measure the quality of the resulting models using the standard criteria and show that our proposed model works in a qualitatively reasonable manner for social image refinement and annotation.

Conclusion

Along with the recent development in various deep generative models, there is a growing research concern for social image refinement and annotation tasks based on the variational auto-encoder. In this paper, we proposed a new weakly-supervised variational auto-encoder for social image refinement and annotation. We showed that the variational auto-encoder generative model is particularly well applied to modeling social images and their imperfect labels in a joint distribution. We also derived a

Acknowledgments

This work is partly supported by the National Natural Science Foundation of China under Grant No. 61502104, the Science and Technology Project A of the Education Department in Fujian Province under Grant No. JT180478, and Scientific Research and Innovation Project on Science Technology Plan of Putian under Grant No. 2018ZP10.

References (46)

  • WangS. et al.

    Penalized nonnegative matrix tri-factorization for co-clustering

    Expert Syst. Appl.

    (2017)
  • WangS. et al.

    Robust co-clustering via dual local learning and high-order matrix factorization

    Knowl.-Based Syst.

    (2017)
  • MakadiaA. et al.

    Baselines for image annotation

    Int. J. Comput. Vis.

    (2010)
  • ZnaidiaA.

    Tag completion based on belief theory and neighbor voting

  • CuiC. et al.

    Social tag relevance learning via ranking-oriented neighbor voting

    Multimedia Tools Appl.

    (2017)
  • XuC. et al.

    Stacked autoencoder based weak supervision for social image understanding

    IEEE Access

    (2019)
  • GuillauminM. et al.

    Tagprop: Discriminative metric learning in nearest neighbor models for image auto-annotation

  • BaQ.T. et al.

    Content is still king: the effect of neighbor voting schemes on tag relevance for social image retrieval

  • LiX. et al.

    Learning social tag relevance by neighbor voting

    IEEE Trans. Multimed.

    (2009)
  • WuL. et al.

    Tag completion for image retrieval

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2012)
  • XuX. et al.

    Non-linear matrix completion for social image tagging

    IEEE Access

    (2017)
  • FengZ. et al.

    Image tag completion by noisy matrix recovery

    Lecture Notes in Comput. Sci.

    (2014)
  • ZhuG. et al.

    Image tag refinement towards low-rank, content-tag prior and error sparsity

  • YeX. et al.

    Image tag completion via image-specific and tag-specific linear sparse reconstructions

  • LiZ. et al.

    Image annotation using multi-correlation probabilistic matrix factorization

  • LiZ. et al.

    Weakly supervised deep matrix factorization for social image understanding

    IEEE Trans. Image Process.

    (2017)
  • LiZ. et al.

    Deep collaborative embedding for social image understanding

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2019)
  • LiX. et al.

    Socializing the semantic gap: A comparative survey on image tag assignment

    ACM Comput. Surv.

    (2016)
  • WangX. et al.

    Visual understanding by mining social media: recent advances and challenges

    Front. Comput. Sci.

    (2018)
  • KingmaD.P. et al.

    Semi-supervised learning with deep generative models

  • SohnK. et al.

    Learning structured output representation using deep conditional generative models

  • TangL. et al.

    Multi-entity dependence learning with rich context via conditional variational auto-encoder

  • M. Suzuki, K. Nakayama, Y. Matsuo, Joint multimodal learning with deep generative models, arXiv preprint...
  • Cited by (0)

    No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.knosys.2019.105259.

    View full text