Elsevier

Information Fusion

Volume 66, February 2021, Pages 184-197
Information Fusion

Full length article
What makes the difference? An empirical comparison of fusion strategies for multimodal language analysis

https://doi.org/10.1016/j.inffus.2020.09.005Get rights and content

Highlights

  • Cross-modal interactions spread across a multimodal sequence.

  • Multimodal models struggle with positive utterances.

  • Very deep representations do not always result in better multimodal embeddings.

  • Linguistic modality should work a pivot for the nonverbal modalities.

  • The integration of modality context yields improved performance.

Abstract

Multimodal video sentiment analysis is a rapidly growing area. It combines verbal (i.e., linguistic) and non-verbal modalities (i.e., visual, acoustic) to predict the sentiment of utterances. A recent trend has been geared towards different modality fusion models utilizing various attention, memory and recurrent components. However, there lacks a systematic investigation on how these different components contribute to solving the problem as well as their limitations. This paper aims to fill the gap, marking the following key innovations. We present the first large-scale and comprehensive empirical comparison of eleven state-of-the-art (SOTA) modality fusion approaches in two video sentiment analysis tasks, with three SOTA benchmark corpora. An in-depth analysis of the results shows that the attention mechanisms are the most effective for modelling crossmodal interactions, yet they are computationally expensive. Second, additional levels of crossmodal interaction decrease performance. Third, positive sentiment utterances are the most challenging cases for all approaches. Finally, integrating context and utilizing the linguistic modality as a pivot for non-verbal modalities improve performance. We expect that the findings would provide helpful insights and guidance to the development of more effective modality fusion models.

Introduction

Human language is inherently multimodal and is manifested via words (i.e., linguistic modality), gestures (i.e., visual modality), and vocal intonations (i.e., acoustic modality). Consequently, we need to process both verbal (e.g., linguistic utterances) and nonverbal signals (e.g., visual, acoustic utterances) to better understand human language. Verbal signals often vary dynamically in different nonverbal contexts. Even though for humans, comprehending human language is an easy task, this is a non-trivial challenge for machines. Giving machines the capability to understand human language effectively opens new horizons for human–machine conversation systems [1], tutoring systems [2], and health care [3], to name a few applications.

The challenge of modelling human language lies in coordinating time-variant modalities. At its core, this research area focuses on modelling intramodal and crossmodal dynamics [4]. Intramodal dynamics refer to interactions within a specific modality, independent of other modalities. An example is word interactions in a sentence. Crossmodal dynamics refer to interactions across several modalities, for example, a simultaneous presence of a negative word, a frown, and a soft voice. Such interactions, occurring at the same time step, are called synchronous crossmodal interactions. Crossmodal interactions might span over a long-range multimodal sequence and are called asynchronous crossmodal interactions. For example, the negative word with the soft voice at the time step t might interact with the frown at the time step t+1.

Early approaches for learning multimodal representations have widely utilized conventional natural language processing (NLP) techniques in multimodal settings [5], [6], [7], [8]. A recent trend in multimodal embedding learning research is to build more complex models utilizing attention, memory, and recurrent components [9], [10], [11], [12], [13], [14], [15]. Various review papers have surveyed the advancements in multimodal machine learning [16], [17], [18], [19], [20]. In particular, they mostly provide an insightful organization of modality fusion strategies. They also identify broader challenges faced by multimodal representation learning, such as synchronization across different modalities, confidence level, contextual information, etc. However, none of them has conducted a comprehensive empirical study across different state-of-the-art (SOTA) fusion approaches to multimodal language analysis, intending to provide critical and experimental analysis. Such an extensive empirical evaluation would be useful to find out which aspects in the SOTA approaches are the most effective in solving the problem of multimodal language analysis. This paper aims to fill the gap. In particular, we replicate and evaluate the most recent SOTA fusion approaches for modelling human language on three widely used benchmark corpora for multimodal sentiment and emotion analysis [21], [22], [23], and investigate the following Research Questions (RQ).

  • RQ1 How effective are the current machine learning-based multimodal fusion strategies for the sentiment analysis and emotion recognition tasks?

  • RQ2 How efficient are the SOTA multimodal fusion strategies, and how could the effectiveness affect efficiency, in the context of the multimodal sentiment and emotion analysis tasks?

  • RQ3 Which components/aspects in the multimodal language models and fusion strategies are the most effective?

The rest of the paper is organized as follows: Section 2 briefly reviews the related work. Section 3 describes the experiments in detail. The experimental results are shown and discussed in Sections 4 Results, 5 Discussion on key findings, respectively. Finally, Section 6 concludes the paper.

Section snippets

Related work

In this section, we provide a review of multimodal representation learning and multimodal time series for video sentiment analysis and emotion recognition.

Methodology

This section details the methodology we used for our empirical study of the most recent SOTA multimodal language fusion approaches, in the context of video sentiment and emotion analysis tasks. We first formulated the task on which our study was carried out. Sentiment analysis was a binary multimodal classification task inferring either positive or negation emotions. Emotion recognition was a multimodal multilabel classification task inferring one or more emotions, e.g., happy and joyful.

Effectiveness

In Table 1, we see that attention mechanism-based approaches, namely, MulT, MMUU-BA, and RAVEN, exhibit the highest binary accuracy (between 78.2% and 78.7%) on MOSI. MulT reports just 0.1% higher accuracy than RAVEN. Yet, for Acc7, RAVEN reports an increased performance of 34.6% as compared to 33.8% for MMUU-BA and 33.6% for MulT. TFN attained the highest accuracy of 34.9% for Acc7. Raven and MMUU-BA report the highest correlation (Corr). Despite the low accuracy, MCTN exhibits the lowest mean

Discussion on key findings

In this paper, we replicated the most recent SOTA models for multimodal language analysis. We evaluated their effectiveness through comprehensive comparative studies, error analyses and series of ablation studies. The efficiency of the models was also compared in terms of three evaluation metrics, namely, parameters, training time, and validation set convergence. The results associated with ablation studies helped us determine which components and methodologies contribute most to solving the

Conclusions

We have replicated and proposed a large-scale empirical comparison among SOTA approaches for multimodal human language analysis. We thoroughly investigated both their effectiveness and efficiency on two human multimodal affection recognition tasks and determined important components in multimodal language models. The results showed that attention mechanism approaches are the most effective for both sentiment analysis and emotion recognition tasks, even though they are not computationally cheap.

CRediT authorship contribution statement

Dimitris Gkoumas: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Writing - original draft, Visualization. Qiuchi Li: Software. Christina Lioma: Writing - review & editing, Supervision. Yijun Yu: Writing - review & editing, Supervision. Dawei Song: Writing - review & editing, Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This study is supported by the Quantum Information Access and Retrieval Theory (QUARTZ) project, which has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 721321, and Natural Science Foundation of China (grant No.: U1636203).

References (77)

  • PoriaS. et al.

    Context-dependent sentiment analysis in user-generated videos

  • WangH. et al.

    Select-additive learning: Improving generalization in multimodal sentiment analysis

  • ZadehA. et al.

    Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages

    IEEE Intell. Syst.

    (2016)
  • TsaiY.H. et al.

    Multimodal transformer for unaligned multimodal language sequences

  • GhosalD. et al.

    Contextual inter-modal attention for multi-modal sentiment analysis

  • GuY. et al.

    Multimodal affective analysis using hierarchical attention strategy with word-level alignment

  • LiangP.P. et al.

    Multimodal language analysis with recurrent multistage fusion

  • ZadehA. et al.

    Memory fusion network for multi-view sequential learning

  • PhamH. et al.

    Found in translation: Learning robust joint representations by cyclic translations between modalities

  • MaiS. et al.

    Modality to modality translation: An adversarial representation learning and graph fusion network for multimodal fusion

  • AtreyP.K. et al.

    Multimodal fusion for multimedia analysis: a survey

    Multimedia Syst.

    (2010)
  • SunS.

    A survey of multi-view machine learning

    Neural Comput. Appl.

    (2013)
  • RamachandramD. et al.

    Deep multimodal learning: A survey on recent advances and trends

    IEEE Signal Process. Mag.

    (2017)
  • BaltrusaitisT. et al.

    Multimodal machine learning: A survey and taxonomy

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2019)
  • ZadehA. et al.

    MOSI: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos

    (2016)
  • ZadehA. et al.

    Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph

  • BussoC. et al.

    IEMOCAP: interactive emotional dyadic motion capture database

    Lang. Resour. Eval.

    (2008)
  • DonahueJ. et al.

    Long-term recurrent convolutional networks for visual recognition and description

  • MorencyL. et al.

    Towards multimodal sentiment analysis: harvesting opinions from the web

  • GhoshS. et al.

    Representation learning for speech emotion recognition

  • AntolS. et al.

    VQA: Visual question answering

  • BokhariM.U. et al.

    Multimodal information retrieval: Challenges and future trends

    Int. J. Comput. Appl.

    (2013)
  • PoriaS. et al.

    MELD: A multimodal multi-party dataset for emotion recognition in conversations

  • NorooziF. et al.

    Survey on emotional body gesture recognition

    IEEE Trans. Affect. Comput.

    (2018)
  • D’melloS.K. et al.

    A review and meta-analysis of multimodal affect detection systems

    ACM Comput. Surv.

    (2015)
  • MorvantE. et al.

    Majority vote of diverse classifiers for late fusion

  • ShutovaE. et al.

    Black holes and white rabbits: Metaphor identification with visual features

  • GlodekM. et al.

    Multiple classifier systems for the classification of audio-visual emotional states

  • Cited by (57)

    View all citing articles on Scopus
    View full text