Elsevier

Neurocomputing

Volume 312, 27 October 2018, Pages 218-228
Neurocomputing

Boosting image sentiment analysis with visual attention

https://doi.org/10.1016/j.neucom.2018.05.104Get rights and content

Abstract

Sentiment analysis plays an important role in behavior sciences, which aims to determine the attitude of a speaker or a writer regarding some topic or the overall contextual polarity of a document. The problem nevertheless is not trivial, especially when inferring sentiment or emotion from visual contents, such as images and videos, which are becoming pervasive on the Web. Observing that the sentiment of an image may be reflected only by some spatial regions, a valid question is how to locate the attended spatial areas for enhancing image sentiment analysis. In this paper, we present Sentiment Networks with visual Attention (SentiNet-A) — a novel architecture that integrates visual attention into the successful Convolutional Neural Networks (CNN) sentiment classification framework, by training them in an end-to-end manner. To model visual attention, we develop multiple layers to generate the attention distribution over the regions of the image. Furthermore, the saliency map of the image is employed as a priori knowledge and regularizer to holistically refine the attention distribution for sentiment prediction. Extensive experiments are conducted on both Twitter and ARTphoto benchmarks, and our framework achieves superior results when compared to the state-of-the-art techniques.

Introduction

With the popularity of social networks and mobile devices, there is a huge volume of images and videos captured by users to record all kinds of activities in their lives everyday and everywhere. For example, people may share their travel experiences, their opinions towards some events and so on. Automatically analyzing the sentiment from these multimedia contents is demanded by many practical applications, such as smart advertising, targeted marketing and political voting forecasts. Compared with text-based sentiment analysis which infers emotional signals from short textual description, visual contents, such as color contrast and tone, could provide more vivid clues to reveal the sentiment behind. Fig. 1 shows image examples from Twitter. Apparently, the images in the upper row manifest positive sentiment, while those in the lower row deliver negative emotion.

Image sentiment analysis is a high-level abstraction concerning the affects to be conveyed by an image, and could bridge the big affective gap between low-level visual features and high-level sentiment. In the literature, there have been several sentiment analysis techniques, including low-level visual feature-based approaches [1], [2], [3], semantic-level feature-based models [4], [5], [6] and deep learning architectures [7], [8], [9]. Though impressive results have been reported by existing image sentiment analysis approaches, these techniques often encode an entire image into a fixed dimensional representation, while leaving the regions of the image that are most indicative to infer the sentiment not fully exploited. This is especially important when there is a lot of clutter in an image. Take the first image in the lower row of Fig. 1 as an example, there are five main objects in the image, including person, car, building, road and tree. To predict the sentiment of the image, we need to first locate those objects, then rule out irrelevant objects (e.g., tree and building) and finally pinpoint to the region of person in this case to infer the sentiment. Therefore, we investigate particularly in this paper the architectures by exploiting visual attention for boosting image sentiment analysis.

By consolidating the idea of visual attention into the analysis of image sentiment, we present a novel Sentiment Networks with visual Attention (SentiNet-A) architecture, as illustrated in Fig. 2. Given an input image, a Convolutional Neural Networks (CNN) is exploited to produce a feature vector for each region of the image, followed by a multi-layer neural network for modeling the attention distribution over all the regions and locating the regions that are most informative to infer the image sentiment. Moreover, considering that saliency is in general the focus of attention cortically in an image and being inspired by the work in [10], which improves the visual saliency computing with emotion intensity based on the strong relationship between the saliency detection and sentiment analysis, we integrate saliency detection into visual attention learning as a regularizer to holistically ensure the correct attention distribution. Technically, to capture subtle visual contrast among multi-scale feature maps, a multi-scale Fully Convolutional Network (FCN) is employed to generate the saliency map [11]. As such, it is natural to optimize the whole architecture by simultaneously minimizing the classification loss of image sentiment and the distance between the learnt attention distribution and saliency map. During prediction, the image representations weighted by the attention are input into a fully-connected layer for image sentiment classification. It is worth noting that our SentiNet-A framework is trainable in an end-to-end fashion.

The main contribution of this work is the proposal of visual attention augmented architecture for image sentiment analysis. By identifying the most distinctive regions in an image to infer the sentiment of the image, our work takes a further step forward to enhance image sentiment analysis. Our solution also leads to the elegant views of how visual attention should be modeled and leveraged in sentiment analysis, which is a problem not yet fully understood in the literature. Moreover, we comprehensively explore how to capitalize on saliency map to holistically refine visual attention learning. Extensive experiments on two datasets validate our proposal in the context of both two-class and eight-class sentiment classification. In addition, we provide thorough discussions on the good practices of training our architecture and integrations with saliency detection in different ways.

The remaining sections are organized as follows. Section 2 describes related works on image sentiment analysis and the exploration of visual attention. Section 3 presents our proposed SentiNet-A architecture. Section 4 provides empirical evaluations, followed by the conclusions in Section 5.

Section snippets

Related work

This paper mainly focuses on the research of visual attention learning for image sentiment analysis. We briefly group the related work into two categories: visual sentiment analysis and the explorations of visual attention.

Sentiment networks with visual attention

An overview of our Sentiment Networks with visual Attention (SentiNet-A) architecture is shown in Fig. 2. Specifically, the proposed SentiNet-A consists of three main components: a CNN which is pre-trained on object recognition task and exploited for learning image representations, a multi-layer neural network to estimate the attention distribution (map) over all the regions towards image sentiment prediction, and a multi-scale FCN to produce a global saliency map. In the stage of sentiment

Experiments

We evaluate and compare our proposed SentiNet-A with some state-of-the-art approaches by conducting image sentiment analysis task on two image sentiment benchmarks, i.e., Twitter dataset [8] and ARTphoto dataset [16]. The former is the most popular image sentiment benchmark collected from tweets and the latter is a public dataset of artistic photos from eight emotional categories.

Conclusions

We have presented Sentiment Networks with visual Attention (SentiNet-A) architecture which explores visual attention to enhance image sentiment analysis. Specifically, we study the problem of identifying the most informative regions to infer the sentiment of the image. To verify our claim, a multi-layer neural network is devised and integrated into standard CNN-based image classification framework to estimate the attention distribution. We optimize the whole architecture by minimizing the

Acknowledgment

This work was supported in part by the National Key Research and Development Program of China (2016YFC0201003), and the “Internet plus” major projects for the “Internet plus” coordinated manufacturing cloud service support platform.

References (58)

  • YangY. et al.

    How do your friends on social media disclose your emotions?

    AAAI Conference on Artificial Intelligence

    (2014)
  • ZhaoS. et al.

    Predicting personalized image emotion perceptions in social networks

    IEEE Trans. Affect. Comput.

    (2016)
  • C. Szegedy et al.

    Going deeper with convolutions

    IEEE Conference on Computer Vision and Pattern Recognition

    (2015)
  • LuX. et al.

    On shape and the computability of emotions

    ACM International Conference on Multimedia

    (2012)
  • WangS. et al.

    Multiple emotion tagging for multimedia data by exploiting high-order dependencies among emotions

    IEEE Trans. Multim.

    (2015)
  • D. Borth et al.

    Sentibank: large-scale ontology and classifiers for detecting sentiment and emotions in visual content

    ACM International Conference on Multimedia

    (2013)
  • J. Yuan et al.

    Sentribute: image sentiment analysis from a mid-level perspective

    ACM International Workshop on Issues of Sentiment Discovery and Opinion Mining

    (2013)
  • ZhaoS. et al.

    Continuous probability distribution prediction of image emotions via multitask shared sparse regression

    IEEE Trans. Multim.

    (2017)
  • V. Campos et al.

    Diving deep into sentiment: understanding fine-tuned cnns for visual sentiment prediction

    ACM International Workshop on Affect & Sentiment in Multimedia

    (2015)
  • YouQ. et al.

    Robust image sentiment analysis using progressively trained and domain transferred deep networks

    AAAI Conference on Artificial Intelligence

    (2015)
  • WangJ. et al.

    Beyond object recognition: visual sentiment analysis with deep coupled adjective and noun neural networks

    International Joint Conferences on Artificial Intelligence

    (2016)
  • LiuH. et al.

    Improving visual saliency computing with emotion intensity

    IEEE Trans. Neural Netw. Learn. Syst.

    (2016)
  • LiG. et al.

    Deep contrast learning for salient object detection

    IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • M. Tkalcic et al.

    Affective labeling in a content-based recommender system for images

    IEEE Trans. Multim.

    (2013)
  • K. Yadati et al.

    Cavva: computational affective video-in-video advertising

    IEEE Trans. Multim.

    (2013)
  • WangW. et al.

    Image retrieval by emotional semantics: a study of emotional space and feature extraction

    IEEE International Conference on Systems, Man and Cybernetics

    (2006)
  • V. Yanulevskaya et al.

    Emotional valence categorization using holistic image features

    IEEE International Conference on Image Processing

    (2008)
  • J. Machajdik et al.

    Affective image classification using features inspired by psychology and art theory

    ACM International Conference on Multimedia

    (2010)
  • WangX. et al.

    Interpretable aesthetic features for affective image classification

    IEEE International Conference on Image Processing

    (2013)
  • JiangY.-G. et al.

    Predicting emotions in user-generated videos

    AAAI Conference on Artificial Intelligence

    (2014)
  • T. Rao et al.

    Multi-scale blocks based image emotion classification using multiple instance learning

    IEEE International Conference on Image Processing

    (2016)
  • PangL. et al.

    Deep multimodal learning for affective analysis and retrieval

    IEEE Trans. Multim.

    (2015)
  • T. Chen, D. Borth, T. Darrell, S.-F. Chang, Deepsentibank: visual sentiment concept classification with deep...
  • T. Rao, M. Xu, D. Xu, Learning multi-level deep representations for image emotion classification, arXiv...
  • T. Rao et al.

    Generating affective maps for images

    Multimedia Tools and Applications

    (2017)
  • ZhaoS. et al.

    Discrete probability distribution prediction of image emotions with shared sparse learning

    IEEE Trans. Affect. Comput.

    (2018)
  • ZhaoS. et al.

    Learning visual emotion distributions via multi-modal features fusion

    ACM International Conference on Multimedia

    (2017)
  • YangJ. et al.

    Joint image emotion classification and distribution learning via deep convolutional neural network

    International Joint Conference on Artificial Intelligence

    (2017)
  • ZhuX. et al.

    Dependency exploitation: a unified cnn-rnn approach for visual emotion recognition

    International Joint Conference on Artificial Intelligence

    (2017)
  • Cited by (130)

    • Integrating color cues to improve multimodal sentiment analysis in social media

      2023, Engineering Applications of Artificial Intelligence
    • Opinion convergence-based sentiment prediction of image advertisement

      2024, International Journal of Multimedia Information Retrieval
    View all citing articles on Scopus
    View full text