An Assessment of In-the-Wild Datasets for Multimodal Emotion Recognition

Multimodal emotion recognition implies the use of different resources and techniques for identifying and recognizing human emotions. A variety of data sources such as faces, speeches, voices, texts and others have to be processed simultaneously for this recognition task. However, most of the techniques, which are based mainly on Deep Learning, are trained using datasets designed and built in controlled conditions, making their applicability in real contexts with real conditions more difficult. For this reason, the aim of this work is to assess a set of in-the-wild datasets to show their strengths and weaknesses for multimodal emotion recognition. Four in-the-wild datasets are evaluated: AFEW, SFEW, MELD and AffWild2. A multimodal architecture previously designed is used to perform the evaluation and classical metrics such as accuracy and F1-Score are used to measure performance in training and to validate quantitative results. However, strengths and weaknesses of these datasets for various uses indicate that by themselves they are not appropriate for multimodal recognition due to their original purpose, e.g., face or speech recognition. Therefore, we recommend a combination of multiple datasets in order to obtain better results when new samples are being processed and a good balance in the number of samples by class.


Introduction
The ability to recognize emotions is crucial in many domains of work that use human emotional responses as a signal for marketing, technology or human-robot interaction [1]. Different models have been proposed for emotion classification or categorization [2]. The most common establish six categories such as joy, love, surprise, sadness, anger and fear [3] or a combination including disgust in place of love [4]. One of the most accepted categorizations is that of Ekman's model. The premise of this model is that there are distinctive facial expressions. Their labels have widely been used by most of the published facial inference research studies over the last 50 years [5]. Some authors have also included the neutral emotion [6,7].
Automatic emotion recognition (ER) is a topic of study that has attracted a great deal of interest. This consists in practice of identifying human emotion from signals such as facial expression, speech and text [8]. A variety of information sources is usually considered since people naturally express emotions in simultaneous different ways, such as facial gestures [9] or other types of gestures with the hands or arms [10], or posture [11], all related to the environment of the interaction [12]. When different modalities for ER are used, the processing is known as multimodal ER [13]. Because each source by itself can produce an ER, the fusion of these results could mitigate present limitations in some single-source approaches, thus obtaining more accurate detection [14].
These architectures have been trained for ER using multimodal signals such as facial and audio gestures, audio and written language, physiological signals and different variations in those modalities [27]. Some works are focused on images (face, pose), audio (speech) and text modalities fused with a fusion method [20,23,28].
Other works consider physiological signals [21,22,24,25] and yet others include combinations of all these modalities [26]. Deep Learning techniques have progressed rapidly in computer vision applications to address detection, localization, estimation and classification issues [29]. An embedding-based Deep Learning approach for 3D cell instance segmentation and tracking to learn spatial, temporal and 3D context information simultaneously was introduced by [30]. This method was also used for face recognition in [31] using a pseudo RGB-D (Red, Green, Blue-Depth) framework and providing data-driven ways to generate depth maps from 2D face images. The problem of extracting and curating individual subplots from compound figures was addressed in [32] with a simple compound figure separation framework that uses weak classification annotations from individual images. Deep transfer learning from face recognition methods were developed in order to explore disease identification from uncontrolled 2D face images [33]. Deep Learning has been successfully combined with data augmentation techniques for introducing different intensities of interference to the spectrum of radio signals [34].
As shown, multimodal ER is becoming more popular in the affective computing research community in order to overcome the constraints imposed by processing just one type of data and to improve recognition robustness [22]. However, despite the progress of ER using Deep Learning techniques that is shown in a large number of studies, most of them use datasets built in laboratory environments, such as IEMOCAP [20,23], AMIGOS [24], RECOLA [28], DEAP [21], SEED, SEED-IV, SEED-V, DEAP and DREAMER [25], etc. The fundamental problem with any recognition system is the lack of data or the training with real data which may affect its generalization to examples that have not been seen during the training process [21,35]. Furthermore, the datasets used for training of ER models have been designed in controlled laboratory environments and are significantly different from what happens in real conditions in terms of brightness, noise level, etc. [36]. Most existing methods have shown good recognition accuracy on lab-controlled datasets, but they deliver much lower accuracy in a real-world uncontrolled environments. Compared to these, datasets from non-controlled environments are also referred to as "in-the-wild" data [37]. Machine learning communities have accepted that progress in a particular application domain is considerably accelerated when a large number of datasets are collected in unconstrained conditions [38,39]. With such data, multimodal analysis could focus on spontaneous behaviors and on behaviors captured in unconstrained conditions [40].
Therefore, the aim of this work is to assess a set of in-the-wild datasets in order to show their strengths and weaknesses for multimodal emotion recognition. Four in-the-wild datasets are to be evaluated: AFEW, SFEW, MELD and AffWild2. The main contributions of this work are as follows: • A review of different works that use in-the-wild datasets. This review comprises works from recent years concerning multimodal emotion recognition methods and the datasets used in their experiments. A detailed description of these datasets is also provided. • A descriptive analysis of the four selected datasets for our study. This description includes the frequency distribution of emotions, visualization of some samples and details related to the original extraction sources.
• An evaluation in terms of performance using an ensemble of Deep Learning architectures and fusion methods. The tests include ablation experiments reporting individual results by modality and fusion results.
This work is structured in several sections. Section 2 presents a review of different studies concerning emotion recognition using in-the-wild datasets. Section 3 presents descriptions of the selected in-the-wild datasets. Section 4 provides a brief description of the ensemble of architectures with which the datasets were evaluated. Section 5 presents results of the training process with the in-the-wild datasets using the ensemble of Deep Learning architectures. Section 6 discusses the main results and recommendations and finally Section 7 presents conclusions and future works.

Related Work
Affective computing is mainly a data-driven research area underpinned by self-built or public databases and datasets. Most of the available data have been collected and stored in controlled conditions in labs. However, there are few existing multimodal emotion databases collected in real-world conditions and those that exist are small and are typically made with a limited number of subjects and expressed in a single language [41]. This review of works focuses on some of these last datasets collected under in-the-wild conditions. Some projects developed in the context of facial emotion recognition [42] and multimodal approaches have used or created their own datasets for experimentation. Riaz et al. [43] conducted tests on Facial Expression Recognition 2013 (FER-2013), the Extended Cohn-Kanade Dataset (CK+) and the Real-world Affective Faces Database (RAF-DB) benchmark datasets. The implementation, called by the authors eXnet (Expression Net), consisted of a Convolutional Neural Network (CNN) architecture based on parallel feature extraction for facial emotion recognition (FER) in the wild. This method achieved higher accuracy on the studied datasets (71.67% for FER-2013, 95.63% for CK+ and 84% for RAF-DB), while using a smaller number of parameters and smaller size on disk than other methods. Chen et al. [41] proposed the Multi-Modal Attention module (MMA) to fuse multi-modal features adaptively on the HEU-part1 and HEU-part2 Emotion databases. The HEU Egm motion database contains a total of 19,004 video clips tagged in 10 emotions (anger, bored, confused, disappointed, disgust, fear, happy, neutral, sad, surprise). For extracting raw features of face, body and audio, CNN+GRU, 3D ResNeXt and OpenSMILE were used and combined with the MMA. Comparative tests were performed on the AFEW and CK+ Datasets. The best results using all modalities and the proposed MMA were 49.22% for HEU-part1 and 55.04% for HEU-part2 with the validation sets.
The Emotion Recognition in the Wild (EmotiW) challenge is a benchmarking effort run as a grand challenge of the several International Conferences on Multimodal Interaction. Various editions starting in 2012 involve different tasks for ER in the wild [44]. Various research projects in affective computing, computer vision, speech processing and machine learning have been developed. Some of the best works presented in these areas are summarized below. Samadiani et al. [36] proposed a video ER that combines visual and audio features on the Acted Facial Expression in the Wild (AFEW) 4.0 dataset. The authors considered the head pose challenge in the videos captured in the wild and proposed a feature-extraction method to handle it. A sparse kernel representation was applied to concatenate the features and a joint sparsity concentration index measurement was used as the decision strategy to indicate the effectiveness of modalities. A Random Forest classifier was used to classify seven basic emotions (angry, disgust, fear, happy, neutral, sad and surprise), achieving an accuracy of 39.71%. Hu et al. [45] experimented in the EmotiW 2017 audio-video-based ER sub-challenge with AFEW 7.0. For this, the authors presented a supervised scoring ensemble that provides dense supervision in diverse feature layers of a deep CNN and bridges class-wise scoring activations for second-level supervision. It extended the idea of deep supervision by adding supervision to deep, intermediate and shallow layers. Afterwards, a fusion structure concatenated class-wise scoring activations from diverse complementary feature layers. The results showed the best accuracy of 60.34% on the ER task for this sub-challenge. Li et al. [46] proposed a framework for video-based ER in the wild taking visual information from facial expression sequences and speech information from audio. This was done in the context of the EmotiW 2019 audio-video-based ER sub-challenge with AFEW 7.0. Different deep networks were used (VGG-Face, BLSTM, ResNet-18, DenseNet-121, VGG-AffectNet) and a fusion method based on weighted sum was used to combine the outputs of the networks. Additionally, to take advantage of the facial expression information, the VGG16 network was trained on the AffectNet dataset to learn a specialized facial expression recognition model. The best results show an accuracy of 62.78% after the fusion method. Salah et al. [47] proposed an approach for video-based ER in the wild using deep transfer learning and score fusion in the context of EmotiW Challenge 2015 and 2016. For the visual modality, this approach used summarizing functionals of complementary visual descriptors, and for the audio modality a standard computational pipeline for paralinguistics was proposed. Both audio and visual features were combined with least squares regression-based classifiers and weighted score-level fusion. The datasets used corresponded to the AFEW corpus and FER-2013. This approach achieved accuracies of 54.55% and 52.11% Some researches on the sub-challenge of the ER in-the-Wild Challenge (EmotiW) have focused on the automatic classification of a set of static images into seven basic emotions on the SFEW (Static Facial Expression in Wild) dataset. SFEW is a static subset of AFEW that addresses recognizing more spontaneous facial expressions. Yu and Zhang [48] proposed a method with a face-detection module based on an ensemble of three face detectors, followed by a classification module with an ensemble of multiple deep convolutional neural networks. The three detectors were joint cascade detection and alignment (JDA), Deep-CNN-based (DCNN) and mixtures of trees (MoT). Pre-trained models were executed on a larger dataset provided by the Facial Expression Recognition (FER) Challenge 2013 and afterwards fine-tuned on the training set of SFEW 2.0. This approach achieved 55.96% and 61.29%, respectively, on the validation and test set of SFEW 2.0, surpassing the challenge baseline of 35.96% and 39.13% with significant gains. Munir et al. [49] proposed a technique composed of three major modules, (a) preprocessing with Fast Fourier Transform (FFT) and Contrast Limited Adaptive Histogram Equalization (CLAHE) methods, (b) generation of merged binary pattern code (MBPC) by pixel, (c) dimensionality reduction with a Principal Component Analysis (PCA) and a classifier to identify the expression. The SFEW dataset was selected for experimentation. Results show 96.5% and 67.2% accuracy for the holistic and division-based approaches, respectively. Cai et al. [50] designed a novel island loss (IL-CNN) to simultaneously increase inter-class separability and intra-class compactness. The proposed method included three convolutional layers, each of which was followed by a PReLU layer and a batch normalization layer (BN). A maximum grouping layer was used after each of the first two BN layers. After the third convolutional layer, two fully connected layers (FC) and one Softmax DRMF were included. Ruan et al. [51] proposed an ensemble of networks for facial ER, composed of four parts: (a) the backbone network (ResNet-18) that extracts basic CNN features; (b) a Feature Decomposition Network (FDN) that decomposes the basic feature into a set of facial action-aware latent features; (c) a Feature Reconstruction Network (FRN) that learns an intra-feature relation weight and an inter-feature relation weight for each latent feature and reconstructs the expression feature; the FRN contains two modules: an Intra-feature Relation Modeling module (Intra-RM) and an Inter-feature Relation Modeling module (Inter-RM); (d) an Expression Prediction Network (EPN) that predicts an expression label. Experimental results were obtained on both the in-the-lab databases (including CK+, MMI and Oulu-CASIA) and the in-the-wild databases (including RAF-DB and SFEW). The accuracy results on RAF-DB and SFEW were 89.47% and 62.16%, respectively.
Emotion recognition in conversations (ERC) is another challenging task that has recently gained interest due to its potential applications. Conversation in its natural form is multimodal. ERC presents several challenges such as conversational context modeling, emotion shift of the interlocutors and others that make the task more difficult to address [52].
Xie et al. [53] proposed a multimodal emotion classification architecture on the Multimodal Emotion Lines Dataset (MELD), including three modalities (textual, audio and face). Three separate prediction models were trained by Generative Pre-trained Transformer (GPT) for text, WaveRNN for audio and FaceNet+GRU for images. A transformer-based fusion mechanism with Embracenet was used to provide multimodal feature fusion. The architecture considered both the joint relations among the modalities and fused different sources of the representation vector. The results showed an accuracy of 65.0% and an F1-Score of 64.0%. Ho et al. [54] presented a multimodal approach for speech emotion recognition based on a Multi-Level Multi-Head Fusion Attention mechanism and a recurrent neural network (RNN) with two modalities (audio and text). The mel-frequency cepstrum (MFCC) from raw signals using the OpenSMILE toolbox was used to determine audio features. A pre-trained model of bidirectional encoder representations from transformers (BERT) was used for embedding text information. A multi-head attention technique fused all feature representations. The experimentation was performed on three databases, Interactive Emotional Motion Capture (IEMOCAP), MELD and CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI). The results showed an accuracy of 63.26% and an F1-Score of 60.59%. Hu et al. [55] introduced a multimodal fused graph convolutional network (MMGCN) for ERC utilizing both multimodal and long-distance contextual information. The MMGCN consisted of three key components: Modality Encoder, Multimodal Graph Convolutional Network and Emotion Classifier. Experiments were conduced on two public benchmark datasets, IEMOCAP and MELD. The results showed accuracy of 66.22% and 58.65% for IEMOCAP and MELD, respectively.
The First Affect-in-the-wild Challenge (AffWild Challenge) was organized in conjunction with the Computer Vision and Pattern Recognition Conference (CVPR) 2017, using the AffWild database [56]. Later this database was extended with 260 more subjects and 1,413,000 new video frames, after which it was called AffWild2 [57]. Some researchers have since experimented with AffWild2. Barros and Sciutti [58] implemented a Deep Neural Network based on VGG16, with a final feature layer that classifies arousal, valence and emotion categories. This was expanded to also process sequential data and achieved an F1-Score of 0.38. Liu et al. [59] proposed a framework that extracts facial features and applies regularization in order to focus on more evident samples, while more uncertain expressions are relabeled. This framework used ResNet-18 [60] and DenseNet [61] architectures, pretrained with the MS-Celeb-1M dataset [62], and achieved a mean accuracy of 62.78%. Yu et al. [63] proposed an ensemble learning approach, using different architectures such as ResNet, EfficientNet [64] and InceptionNet [65] and training with multiple folds between the data in order to enhance their performance. This achieved a mean F1-Score of 0.255. Zhang et al. [66] proposed a unified transformer-based multimodal framework for Action Unit detection and also expression recognition. This framework is made of three gated recurrent neural networks (GRU) and a multilayer perceptron (MLP). A transformerbased fusion module was used to integrate the static vision features and the dynamic multimodal features. The expression F1-Scores of models that are trained and tested on six different folds were calculated (including the original training/validation set of the AffWild2 dataset). The results showed mean F1-Scores of 39.4, 37.9, 41.1, 37.8, 37.3 and 36.1, respectively, for each fold. Table 1 summarizes works presented in this section. Table 2 summarizes datasets or databases used in these works.

A Selection of In-the-Wild Datasets
We selected four in-the-wild datasets for testing: AFEW [68], SFEW [68], MELD [52] and AffWild2 [57]. These datasets were selected primarily because of their accessibility. Requests for access were made directly to their authors and access was granted for academic purposes. A description of each dataset follows.
• AFEW: This is a dyna.mic temporal facial expressions data corpus proposed by A. Dhall et al. in 2011 [77]. It has been used as a database for the Emotion Recognition inthe-wild Challenge (EmotiW) since 2013. Different versions have appeared every year for each challenge. For the sake of simplicity, the following refers to the most recent version as AFEW. It consists of close to real world environment instances, extracted from movies and reality TV shows, including 1809 video clips of 300-5400 ms with various head poses, occlusions and lighting. The database contains a large age range of subjects from 1 to 70 years from various races, genders and ages and with multiple subjects in a scene. Around 330 subjects have been labeled with information such as name, age of character, age of actor, gender, pose and individual facial expressions. AFEW consists of separate training (773), validation (383) and test (653) video clips, in which samples are tagged with discrete emotion labels: the six universal emotions (angry, disgust, fear, happy, sad and surprise) and neutral. Audio and video are in WAV and AVI formats. Modalities to explore in this database include facial, audio and posture. Figure 1 shows some samples from this dataset.  [73]. Figure 4 shows some samples of this database. This corpus is a widely used benchmark database for facial expression recognition in the wild. The official website is the same as AFEW. Table 3 shows the number of images available in SFEW per emotion.     • MELD: the Multimodal EmotionLines Dataset (MELD) is an extension and enhancement of EmotionLines [78]. The MELD corpus was constructed by extracting the starting and ending timestamps of all utterances from every dialog in the Emotion-Lines dataset, given that timestamps of the utterances in a dialog must be in an increasing order and all the utterances in a dialog have to belong to the same episode and scene. After obtaining the timestamp of each utterance, the corresponding audiovisual clips were extracted from the source episode followed by the extraction of audio content from these clips. The audio files were formatted as 16-bit PCM WAV files. The final dataset includes visual, audio and textual modalities for each utterance. MELD contains about 13,000 utterances from 1433 dialogs from the TV series Friends with different speakers participating in these dialogs. It provides multimodal sources including not only textual dialogs, but also their corresponding visual and audio counterparts. Each utterance is annotated with emotion and sentiment labels. Emotions correspond to Ekman's six universal emotions (joy, sadness, fear, anger, surprise and disgust) with an additional emotion label neutral. For sentiments, three classes were distinguished as negative, positive and neutral [52]. Figure 5 shows an extract of this dataset. Each frame is an extract collected from a video and most of these frames contain several people expressing different emotions ( Figure 6). The dataset contains only the raw data for the complete video frames; individual faces are not cropped or tagged. Within the MELD site (https://github.com/declare-lab/MELD, accessed on 27 May 2023), the data are structured in different folders. In the data folder in particular, there are three different CSV files which corresponds to train, dev and test datasets with 9990, 1110 and 2611 lines, respectively. Each file has the same structure, containing utterance, speaker, emotion and sentiment information, as well as data concerning the source.  • AffWild2: This is an extension of the AffWild dataset which was designed for the First Affect-in-the-Wild Challenge [79]. AffWild collected data available in video-sharing websites such as YouTube and selected videos that display the affective behavior of people, for example, videos that display the behavior of people when watching a trailer, a movie, a disturbing clip or reactions to pranks. It was designed to train and test an end-to-end deep neural architecture for the estimation of continuous emotion dimensions based on visual cues [56] and was annotated in terms of the valencearousal dimensions. A total of 298 videos displaying reactions of 200 subjects, with a total video duration of more than 30 h, were collected. Later, AffWild2 extended the data with 260 more subjects and 1,413,000 new video frames [57]. The videos were downloaded from YouTube and have large variations in pose, age, illumination conditions, ethnicity and profession. The set contains 558 videos with 2.8 million frames in total of people reacting to events or audiovisual content or speaking with the camera. The videos involve a wide range in subjects' age, ethnicity and profession and they have different head poses, illumination conditions, occlusions and emotions. The videos were processed, trimmed and reformatted to MP4. Frames were annotated for three tasks (valence arousal, emotion classification and action unit detection) [80]. This dataset is also distributed as a set of cropped frames from each video, centered on the face. The number of images available is shown in Table 4. For emotion classification, each frame was annotated with one of the six main emotions (anger, disgust, fear, happiness, sadness, surprise, other) and a neutral la-bel. Some frames are labeled as discarded (with a "−1") for this task. The official site of AffWild2 belongs to the Intelligent Behaviour Understanding Group (iBUG), Department of Computing at Imperial College London (https://ibug.doc.ic.ac.uk/ resources/aff-wild2/, accessed on 27 May 2023). Figure 7 shows some samples from the AffWild2 dataset. Table 5 summarizes some insights of these in-the-wild datasets.

Preprocessing
In this section, we describe the steps used to preprocess the four datasets by modality group.

Face Modality
For this modality, AFEW, SFEW and AffWild2 had available cropped and centered segments of each video frame with a labeled face of the subject on screen. Figures 1, 4 and 7 show some frames extracted from their original video source with a face detection algorithm.
Because AFEW and SFEW were distributed among various folders by emotion, we created a CSV file containing the path of each frame and its label. In the AffWild2 dataset, each video has a corresponding annotation file. A line in these files indicates the label for each frame. Few of these videos contain more than one person in focus. In these cases, the videos have several annotation files (one per person), each with a label corresponding to the person it refers to. Figure 8 shows an example of this. All frames missing a category label were annotated with −1, indicating the frame should be discarded.  To generate a diverse set of images for training, we applied some random augmentation operations to the training data. These operations imply transformations, which had a 50% chance of being applied, modifying an image. Transformations include flipping, rotation in the range [−10, 10] degrees, random contrast and brightness or addition of Poissonnoise; all of them are applied independently.
As a consequence of the VGG model used for this modality, which requires a defined input size, we had to resize the original images (128 × 128 pixels in AFEW, 181 × 143 pixels in SFEW and 246 ± 176 × 128 ± 129 pixels in AffWild2) to 48 × 48 pixels, using a bilinear interpolation function provided by the scikit-image library in Python.

Audio Modality
For the audio data, we first extracted the audio of each original video at a 22.4 kHz sample rate using ffmpeg [81] on all datasets.
In particular, each video in AffWild2 contains a number of labeled emotion utterances, which define video segments. Audio clips were extracted from each of these segments using Librosa [82] and were labeled with the same emotion as their corresponding utterance. These segments have a mean duration of 3.88 ± 10.18 s, with lengths between ∼30 ms (a frame of video) and 200 s of duration. Figure 10 shows the length distribution for each emotion on training and validation sets. On other datasets, each video sample represents a single utterance; therefore, audio can be used directly from each sample. Each audio item has a duration of up to 6.8 s (average 2.458 ± 1.017 s). On all datasets, we set the input duration for our model to 7 seconds. Shorter segments were padded with zeros and centered (second plot in Figure 11), whereas larger segments were cut to 7 seconds from a random starting position. Afterwards, the segments were converted to Mel Spectrograms using Librosa (bottom-left image in Figure 11), the scale was transformed to decibels (using Librosa.power_to_db()) and finally it was normalized to the range [0-1] and colormapped to encode amplitude information as color in an image, as suggested in the work of Lech et al. [83]. Figure 11 shows this process.
For training, each audio sample was augmented with random displacement from the center of the padded audio and the addition of Gaussian noise with σ = 1 × 10 −5 .

Text Modality
For the text model, we used the same audio segments that were previously extracted from the videos, without the duration normalization and converted them to text using the Speechbrain [84] EncoderDecoderASR Transformer model. An example of the extracted text is shown in Figure 11. For the MELD dataset, a transcription of the audio of each video sample is provided within the data. Thus, we used this transcription as input for our text model.
No data augmentation was applied to text mode data. Figure 11. Visualization of one audio sample and its preprocessing for our audio model.

A Multimodal Framework for Emotion Recognition
For assessment of the in-the-wild datasets mentioned in Section 3, we employed an architecture designed and implemented in a previous work [15] with minor modifications, detailed below. To support emotion recognition using different modalities, specifically face, audio and text, a multimodal architecture is required. This architecture is made up of three individual components: a face component based on a VGG19 network, an audio component based on ResNet50 and a text component based on XLnet called DialogXL [85]. Each component individually recognizes an emotion (modality output). Afterwards, all of these outputs are fused, producing the final recognized emotion. Figure 12 shows the configuration of this architecture. The details are summarized below: • Face Modality processing. We used a VGG19 architecture [86] as our classifier. This model is built using 19 convolutional layers with filters of size 3 × 3. Following 2 layers with 64 channels each, the output is reduced using a max pooling operation of size 2 × 2. This continues with alternation of pooling with groups of 2 layers of 128 channels, 4 layers of 256 channels, 4 layers of 512 channels and 4 layers of 512 channels. After a final max pooling operation, the output goes to an MLP network with 3 dense layers of sizes 4096, 4096 and 1000 and then a final layer with a Softmax activation function. • Audio Modality processing. We used a ResNet50 architecture [60] trained from scratch, replacing the original architecture proposed by Venkataramanan and Rajamohan [87], which was used on our previous work. Our expected input is an image with size 224 × 224, representing the spectrogram of the input audio sample. After a convolutional layer with filter size of 7 × 7 and 64 channels, the input is passed through a number of residual blocks. These residual blocks are composed of three convolutional layers of filter sizes 1 × 1, 3 × 3 and 1 × 1, then the input of the block is added to the block output, providing residual information of higher level features. After a number of groups, the output is max pooled, reducing its size. On ResNet50, this operation occurs after 3, 4, 6 and 3 ResNet blocks. Finally, the output is average pooled, creating a 2048-length vector of features. This vector is then passed to a dense layer and an output layer with a Softmax activation. This last output layer has a number of neurons corresponding to the number of emotions. This is shown in the audio segment of Figure 12. • Text Modality processing. We used DialogXL [85], a PyTorch implementation for Emotion Recognition in Conversation (ERC) based on XLNet. It consists of an embedding layer, 12 Transformer layers and a feed-forward neural network. DialogXL has an enhanced memory to store longer historical context and a Dialog-Aware Self-Attention component to deal with the multi-party structures. The recurrence mechanism of XLNet was modified from segment-level to utterance-level in order to better model the conversational data. Additionally, Dialog-Aware Self-Attention was used in replacement of the vanilla self-attention in XLNet to capture useful intra-and interspeaker dependencies. Every utterance (sentence) made by a speaker is routed via an embedding layer, which tokenizes the sentence into a series of vectors. This representation is then fed into a stack of neural networks, each layer of which outputs a vector that is fed into the layer below. Each layer of the stack has a Dialog-Aware Self-Attention component and an Utterance Recurrence Component. The hidden state of the categorization token and the historical context are fed through a feed-forward neural network at the end of the last layer to produce the recognized emotion. • Fusion method. Individual modalities were fused using Embracenet+, which is presented as an improvement of the Embracenet approach in [15]. The architecture involves three simple Embracenet models working to improve the modalities' correlation learning as well as the final results. Each Embracenet model used has one more linear layer and a dropout layer, which hardens the model a bit to improve learning. Figure 13 shows the Embracenet+ architecture. In it, a linear layer of 32 neurons (D1,1), a dropout layer with 0.5 decay probability and another linear layer of 16 neurons (D1,2) compose each of the altered docking layers. Additionally, a weighted sum, whose output is a vector of n probabilities (n = number of emotion categories), and a concatenation, whose output is a vector of 3n (due to the number of modalities), are used as fusion techniques. Afterwards, another Embracenet receives three vectors of 16, n and 3n values (that work as modalities). These vectors are handled by docking layers of one linear layer of 16 neurons each (d (k) ), leading to an extra linear layer of n neurons, which outputs the final prediction.
All modalities were trained with a batch size of 32 samples per step and 20 epochs. All modalities were optimized using the Adam algorithm, with a learning rate of 0.0001. The implementation was based on PyTorch 2.0 (https://pytorch.org/, accessed on 27 May 2023). All models were trained on GPU, using an Nvidia GeForce RTX 3070 with 12 GB of VRAM.
Using the described framework, tests were performed on the IEMOCAP dataset and the results showed a 79% F1-Score and 77.6% accuracy for the fusion of the three modalities. On the individual modalities, i.e., Face-F, Audio-A and Text-T, the model obtained an average accuracy of 44%, 58.3% and 83.5%, respectively.   Table 6 shows the results of our tests using the AFEW dataset. We achieved a mean accuracy of 26.91%, 21.67%, 22.19% on face, audio and text modalities, respectively. The best performing emotions on our test were happiness with 55.208% for faces, neutral with 33.333% for audio and 43.750% for anger in text mode. One of the main issues we noticed is how the architecture under-performs in certain categories, despite not having a high imbalance of samples for each emotion. disgust and surprise were the worst performers in each mode. AFEW is built with faces of actors who have more pronounced expressions and therefore should be more distinguishable. Figure 14 shows the ROC curves of the three modalities using AFEW. In the faces modality, the emotion happiness has an AUC of 0.61. While for audio and text, anger is the emotion with the highest AUC (0.75 and 0.68, respectively). This may be due to the fact that the audio of the set presents more detailed voice tonalities, coming from professional actors.   Table 7 shows the results for the SFEW dataset. We achieved a mean accuracy of 21.81%, with happiness and anger being most correctly classified, with accuracies of 38.889% and 36.364%, respectively. Similarly to AFEW, which is an expansion of SFEW, it performs badly with surprise and disgust, while having no detection for fear. Figure 15 presents the ROC curve of the SFEW face modality. The emotion with the highest AUC is happiness with 0.72.   Table 8 shows the results on the MELD dataset. Mean F1s were achieved of 15.9%, 34.6% and 53.7% for face, audio and text, respectively. Figure 16 shows the ROC curve for these results. The modality that exhibited the best performance is text mode with a mean accuracy of 57.40%, which is congruent with the fact that this dataset was designed for emotion detection in conversations. The face modality presents the worst results. This is due to the difficulty of building a new dataset for images and of relabeling them. Some of the audio clips are cut or missing the first syllable or might have extra sounds at the beginning or end. Other audio clips sound unnatural. Even though the texts are complete in the CSV file, rebuilding them from the audio files is not possible due to this problem.  Table 9 shows the results on the AffWild2 dataset. The VGG19 architecture using this dataset achieved a weighted F1-Score of 41% on its original face modality. However, since the dataset was not designed for other modalities, such as audio or text, performance was low in these modalities (F1-Scores of 32% in audio and 29% in text). One of the main reasons for this is the source of the speech (if produced). A significant portion of the dataset is composed of reaction videos to entertainment content. As such, the sound in the emotion segment might not coincide with the labeled emotion presented as a label, since these were annotated with faces in mind. Figure 17 shows the ROC curves for each emotion for each modality. Disgust presents the highest area under the curve in both faces and audio (0.94 and 0.69, respectively). While in the text modality, happiness presents the highest AUC with 0.57.   Finally, we evaluated all three modalities using the Embracenet+ fusion method. Table 10 shows the accuracy results using all modalities. AffWild2 performed the best with a weighted accuracy of 58.64%, while AFEW has a weighted accuracy of 18.87 and MELD has an accuracy of 45.63. In the case of AffWild2, both face and text show good results, both individually and fused. The audio modality has the lowest performance, as the audio features are limited by the source data. For the AFEW dataset, the best performance was achieved by the facial modality by itself, with 19.68%. The performance seems on par with what was expected from the unimodal experiment. With MELD, only text modality performed as expected, due to, as was mentioned, this dataset being built for text and audio. However, the combination of audio and text achieved an accuracy of 20.39%. The results obtained might indicate some problems with our features; the face detector by itself has weak performance, as seen in Table 8. In comparison with similar models for each dataset, our model performs similarly or better. However, there were significant differences in how these models were evaluated, compared to ours. For example, for AffWild2, Zhang et al. [66] reported a multimodal accuracy of 39.4%, but they added the emotional category "Other", which we did not evaluate and these cases are a highly represented class inside AffWild2. The work of Li et al. [46] on AFEW performs better on audio and face tasks compared to us, because they used a combination of more complex architectures than we used for the same task. Finally, on the MELD dataset, we achieved similar results to Hu et al. [55].

Discussion
The literature review and the large number of papers that describe models under development show the interest in the area of ER and the challenges still present. However, there is still no dataset or database designed and oriented entirely for multimodal ER. We can find datasets designed in particular for face recognition but not for the other modalities. In this case there are situations where the audio and text do not correspond to the facial expression, imposing the need to re-label the data for these modalities. In other situations where the focus of the dataset is oriented to text recognition in conversations, the facial modality takes a back seat, imposing the need for additional preprocessing. In the case of raw video data there are group images where multiple faces are identified. This implies the need to first perform face recognition and cropping, with the associated difficulty of assigning the audio to the corresponding person.
Most of the datasets do not have a good class balance, so the models generated from them have an important bias present, impacting the classification by overfitting the model to one class or another. As a result, metrics with accuracy are not the most appropriate for indicating good performance of the models. Many of these models have the problem that their generalization capacity is compromised because they are not able to predict correctly when faced with new cases not seen during the training process. Although there are multiple methods to balance the classes, e.g., by discarding samples or by generating synthetic samples (oversampling), the main limitation relates to the frequency of recording of these emotions. Certain expressions have a different duration, in accordance with the context, in addition to the differences of gesticulation within the same emotion. Therefore, a greater diversity of examples is needed so that the model can improve its classification capacity.
AFEW and SFEW are based on images with actors who gesticulate markedly to express their emotions, which could compromise their adaptation to reality. Due to its vastness and variety of emotions, the AffWild2 dataset improved the performance of all networks trained with it compared to networks trained with other datasets. However, its focus is facial, leaving inconsistencies in both audio and text modalities. In fact, because this dataset is labeled with faces in mind, emotion responses could differ depending on what is happening in context. Another issue is the mean length of each utterance, as most labeled samples are too short to enunciate a complete word. This might create bias towards short voice expressions of emotion, such as a scream or a laugh.
In AffWild2, most of the videos provided are of reactionary nature towards multimedia content or from interactions with other people off-camera; the labeled emotion could be different to what is happening in audio. For example, a child might be shown smiling, but in the background the mother is talking about her problems. As such, particular phrases extracted from the audio might have opposing emotions, increasing confusion in the model and lowering performance. The solution to this issue is simple yet costly, requiring manual labeling of the audio emotion in context.
In AFEW and AffWild2, the transcription of the audio by the same set is missing, limiting the performance of the text model. This is because it is dependent on the ability of the automatic transcription system and on its performance in in-the-wild environments. Thus, the text may not correspond to what was actually uttered by the subject. On the other hand, the MELD dataset was conceived for ER in conversations so the facial modality requires more preprocessing. Most videos include more than one person and therefore the images we can extract from them require first a separation and cropping of faces in addition to the corresponding labeling.
Some audio items were discarded from our test due to low quality sources and data corruption on provided files. This impacted the number of samples actually available for training. Other issues present with audio items are related to the issue of source quality. Since some videos, particularly in AffWild2, were extracted from Internet and social media platforms, some of them correspond to people reacting to Internet content or other videos, films, etc. As a result, most of the time the audio background does not match the perceived emotion present on video. Another example of these situations is videos of people interacting with other people off-camera, which increases difficulty of emotion recognition associated with these videos.
The text modality is generally missing a required automatic transcription from audios (except for MELD). Manual transcription is not cost-effective due to the number and length of the videos, as well as the range of accents present in the dataset.
Our multimodal strategy using the Embracenet+ Architecture indicates that multiple information items from different sources can be useful to improve emotion recognition, in comparison with the unimodal strategy. However, some of the results are contradictory to what was expected. For example, when using the MELD dataset, which was designed for audio and text emotion recognition, we expected good performance using those modalities, but reached a weighted accuracy of 20.39%. Nevertheless, audio and text on the multimodal test reached 45.63% and 46.6%, respectively, and 43.14% and 57.4% as a unimodal model.
Nevertheless, the datasets achieve results consistent with the objective for which they were designed. In the case of AFEW, the model performed better in the face modality than AffWild2. The considerable number of samples in AffWild2 favors performance in the face modality. In the case of MELD, it outperforms the other datasets in text.
A possible solution to these problems seems to be the combination of multiple datasets. This would allow for a greater diversity of samples in different contexts, both in acted scenes and in day-to-day living situations. Improvements in the testing datasets regarding modalities other than images could improve performance on in-the-wild emotion recognition tasks, including a correct labeling of the different data sources.
The proposed approach is a light and not so complex ensemble of models; therefore, the resource requirements for both training and deployments are lower. Initially, the framework of Heredia et al. was designed to be executed in a human-robot interaction environment [15]. The restrictions in this environment precisely favor less complex and lighter models. However, the framework was trained on the Lab dataset IEMOCAP which has very different characteristics to in-the-wild datasets reported in this work. Even though this framework has to be optimized and new components have to be tested, the framework establishes a starting point and allows us to make an evaluation on the critical conditions presented by in-the-wild datasets. Additionally, the results show that extra preprocessing tasks have to be carried out in order to yield better performance.

Conclusions
There are a variety of datasets that have been designed for emotion recognition. In this work we have evaluated ER performance using a previously designed architecture and four in-the-wild datasets: AFEW, SFEW, MELD and AffWild2. In the literature, it is common to find architectures based on Deep Learning for emotion recognition. We use an ensemble of pre-trained networks and some performance metrics such as accuracy and F1-Score. The results show that our models can effectively identify emotions using cropped images, audios and transcriptions of what is being said. However, available datasets have not been designed for multimodal ER tasks.
Comparing the results obtained with the studied datasets, for face modality our best performing dataset was AffWild2 with a mean accuracy of 53.98% and a mean F1-Score of 0.514. This is mostly because of the large number of available image samples provided in the dataset, allowing better generalization. For audio modality, our model performed best with the AffWild2 dataset with a mean accuracy of 46.93% and an F1-Score of 0.473, somewhat better than MELD, which followed with a mean accuracy of 43.14% and a mean F1 of 0.356. Though MELD is a dataset that was designed for audio and text modalities, the number of available examples per class is somewhat unbalanced, especially in less available classes such as disgust or fear, which is understandable since the original source is a comedy show. Within the text modality, the best mean performance was achieved with AffWild2, with an accuracy of 60.69%, but this result is skewed since for this dataset our model over-fitted due to the huge number of neutral and happiness class samples over the other classes. The best dataset for this task overall was MELD, with a mean accuracy of 57.40% and an F1-Score of 0.537. Since this dataset was designed for this particular task and included transcripted text of the sampled dialog, our model was able to classify with higher certainty compared with AffWild2 and AFEW. Transcriptions for these two datasets were dependant on both stored sound quality and the speech recognition model used.
When comparing all three datasets for multimodal fusion, AffWild2 had the best performance overall, despite having more data availability biases. This could be further improved with more labeled data, especially from different sources and environments and data augmentation for the least represented examples.
The next steps in this research include the configuration of a new dataset from the existing in-the-wild datasets, improving the preprocessing of each data source and its labeling for multimodal tasks and retraining on our set of networks after incorporating some optimization techniques with the goal of achieving unimodal and multimodal performance.