Boosting image sentiment analysis with visual attention
Introduction
With the popularity of social networks and mobile devices, there is a huge volume of images and videos captured by users to record all kinds of activities in their lives everyday and everywhere. For example, people may share their travel experiences, their opinions towards some events and so on. Automatically analyzing the sentiment from these multimedia contents is demanded by many practical applications, such as smart advertising, targeted marketing and political voting forecasts. Compared with text-based sentiment analysis which infers emotional signals from short textual description, visual contents, such as color contrast and tone, could provide more vivid clues to reveal the sentiment behind. Fig. 1 shows image examples from Twitter. Apparently, the images in the upper row manifest positive sentiment, while those in the lower row deliver negative emotion.
Image sentiment analysis is a high-level abstraction concerning the affects to be conveyed by an image, and could bridge the big affective gap between low-level visual features and high-level sentiment. In the literature, there have been several sentiment analysis techniques, including low-level visual feature-based approaches [1], [2], [3], semantic-level feature-based models [4], [5], [6] and deep learning architectures [7], [8], [9]. Though impressive results have been reported by existing image sentiment analysis approaches, these techniques often encode an entire image into a fixed dimensional representation, while leaving the regions of the image that are most indicative to infer the sentiment not fully exploited. This is especially important when there is a lot of clutter in an image. Take the first image in the lower row of Fig. 1 as an example, there are five main objects in the image, including person, car, building, road and tree. To predict the sentiment of the image, we need to first locate those objects, then rule out irrelevant objects (e.g., tree and building) and finally pinpoint to the region of person in this case to infer the sentiment. Therefore, we investigate particularly in this paper the architectures by exploiting visual attention for boosting image sentiment analysis.
By consolidating the idea of visual attention into the analysis of image sentiment, we present a novel Sentiment Networks with visual Attention (SentiNet-A) architecture, as illustrated in Fig. 2. Given an input image, a Convolutional Neural Networks (CNN) is exploited to produce a feature vector for each region of the image, followed by a multi-layer neural network for modeling the attention distribution over all the regions and locating the regions that are most informative to infer the image sentiment. Moreover, considering that saliency is in general the focus of attention cortically in an image and being inspired by the work in [10], which improves the visual saliency computing with emotion intensity based on the strong relationship between the saliency detection and sentiment analysis, we integrate saliency detection into visual attention learning as a regularizer to holistically ensure the correct attention distribution. Technically, to capture subtle visual contrast among multi-scale feature maps, a multi-scale Fully Convolutional Network (FCN) is employed to generate the saliency map [11]. As such, it is natural to optimize the whole architecture by simultaneously minimizing the classification loss of image sentiment and the distance between the learnt attention distribution and saliency map. During prediction, the image representations weighted by the attention are input into a fully-connected layer for image sentiment classification. It is worth noting that our SentiNet-A framework is trainable in an end-to-end fashion.
The main contribution of this work is the proposal of visual attention augmented architecture for image sentiment analysis. By identifying the most distinctive regions in an image to infer the sentiment of the image, our work takes a further step forward to enhance image sentiment analysis. Our solution also leads to the elegant views of how visual attention should be modeled and leveraged in sentiment analysis, which is a problem not yet fully understood in the literature. Moreover, we comprehensively explore how to capitalize on saliency map to holistically refine visual attention learning. Extensive experiments on two datasets validate our proposal in the context of both two-class and eight-class sentiment classification. In addition, we provide thorough discussions on the good practices of training our architecture and integrations with saliency detection in different ways.
The remaining sections are organized as follows. Section 2 describes related works on image sentiment analysis and the exploration of visual attention. Section 3 presents our proposed SentiNet-A architecture. Section 4 provides empirical evaluations, followed by the conclusions in Section 5.
Section snippets
Related work
This paper mainly focuses on the research of visual attention learning for image sentiment analysis. We briefly group the related work into two categories: visual sentiment analysis and the explorations of visual attention.
Sentiment networks with visual attention
An overview of our Sentiment Networks with visual Attention (SentiNet-A) architecture is shown in Fig. 2. Specifically, the proposed SentiNet-A consists of three main components: a CNN which is pre-trained on object recognition task and exploited for learning image representations, a multi-layer neural network to estimate the attention distribution (map) over all the regions towards image sentiment prediction, and a multi-scale FCN to produce a global saliency map. In the stage of sentiment
Experiments
We evaluate and compare our proposed SentiNet-A with some state-of-the-art approaches by conducting image sentiment analysis task on two image sentiment benchmarks, i.e., Twitter dataset [8] and ARTphoto dataset [16]. The former is the most popular image sentiment benchmark collected from tweets and the latter is a public dataset of artistic photos from eight emotional categories.
Conclusions
We have presented Sentiment Networks with visual Attention (SentiNet-A) architecture which explores visual attention to enhance image sentiment analysis. Specifically, we study the problem of identifying the most informative regions to infer the sentiment of the image. To verify our claim, a multi-layer neural network is devised and integrated into standard CNN-based image classification framework to estimate the attention distribution. We optimize the whole architecture by minimizing the
Acknowledgment
This work was supported in part by the National Key Research and Development Program of China (2016YFC0201003), and the “Internet plus” major projects for the “Internet plus” coordinated manufacturing cloud service support platform.
References (58)
- et al.
How do your friends on social media disclose your emotions?
AAAI Conference on Artificial Intelligence
(2014) - et al.
Predicting personalized image emotion perceptions in social networks
IEEE Trans. Affect. Comput.
(2016) - et al.
Going deeper with convolutions
IEEE Conference on Computer Vision and Pattern Recognition
(2015) - et al.
On shape and the computability of emotions
ACM International Conference on Multimedia
(2012) - et al.
Multiple emotion tagging for multimedia data by exploiting high-order dependencies among emotions
IEEE Trans. Multim.
(2015) - et al.
Sentibank: large-scale ontology and classifiers for detecting sentiment and emotions in visual content
ACM International Conference on Multimedia
(2013) - et al.
Sentribute: image sentiment analysis from a mid-level perspective
ACM International Workshop on Issues of Sentiment Discovery and Opinion Mining
(2013) - et al.
Continuous probability distribution prediction of image emotions via multitask shared sparse regression
IEEE Trans. Multim.
(2017) - et al.
Diving deep into sentiment: understanding fine-tuned cnns for visual sentiment prediction
ACM International Workshop on Affect & Sentiment in Multimedia
(2015) - et al.
Robust image sentiment analysis using progressively trained and domain transferred deep networks
AAAI Conference on Artificial Intelligence
(2015)
Beyond object recognition: visual sentiment analysis with deep coupled adjective and noun neural networks
International Joint Conferences on Artificial Intelligence
Improving visual saliency computing with emotion intensity
IEEE Trans. Neural Netw. Learn. Syst.
Deep contrast learning for salient object detection
IEEE Conference on Computer Vision and Pattern Recognition
Affective labeling in a content-based recommender system for images
IEEE Trans. Multim.
Cavva: computational affective video-in-video advertising
IEEE Trans. Multim.
Image retrieval by emotional semantics: a study of emotional space and feature extraction
IEEE International Conference on Systems, Man and Cybernetics
Emotional valence categorization using holistic image features
IEEE International Conference on Image Processing
Affective image classification using features inspired by psychology and art theory
ACM International Conference on Multimedia
Interpretable aesthetic features for affective image classification
IEEE International Conference on Image Processing
Predicting emotions in user-generated videos
AAAI Conference on Artificial Intelligence
Multi-scale blocks based image emotion classification using multiple instance learning
IEEE International Conference on Image Processing
Deep multimodal learning for affective analysis and retrieval
IEEE Trans. Multim.
Generating affective maps for images
Multimedia Tools and Applications
Discrete probability distribution prediction of image emotions with shared sparse learning
IEEE Trans. Affect. Comput.
Learning visual emotion distributions via multi-modal features fusion
ACM International Conference on Multimedia
Joint image emotion classification and distribution learning via deep convolutional neural network
International Joint Conference on Artificial Intelligence
Dependency exploitation: a unified cnn-rnn approach for visual emotion recognition
International Joint Conference on Artificial Intelligence
Cited by (130)
EERCA-ViT: Enhanced Effective Region and Context-Aware Vision Transformers for image sentiment analysis
2023, Journal of Visual Communication and Image RepresentationIntegrating color cues to improve multimodal sentiment analysis in social media
2023, Engineering Applications of Artificial IntelligenceOpinion convergence-based sentiment prediction of image advertisement
2024, International Journal of Multimedia Information RetrievalA hybrid fusion-based machine learning framework to improve sentiment prediction of assamese in low resource setting
2024, Multimedia Tools and Applications