1 Introduction

Affective computing research [27] to measure and recognize an individual’s emotional state and model emotional interactions between humans and computer systems is increasing nowadays. Among others, one of the main important characteristic when interacting with other people is the ability to empathize (commonly express by the term Theory of Mind ToM [31]). This ability relies on the detection of emotions that others show through their facial and verbal expressions, physiological responses, and body gestures. Among them it is very important the vocal expressiveness that conveys for 38% of emotional information associated to a message [23]. This fact opens the possibility of adapting the verbal human–machine interaction to the emotional state of the user providing a better user experience. This capability has many applications such as chatbots (e.g. helping desk service), robotics (e.g. assisting elderly people), or Augmentative and Alternative Communication Systems (AACs, e.g. assistive emotional learning systems for people with disabilities). Despite there is no general agreement on how to define an emotion, there is some consensus in the use of a working definition that consist of a mere list of analogous terms such as’anger, disgust, fear, happiness, sadness, surprise’ [14] which are a categorical description of a more complex human state that includes emotional experience and regulation processes according to [45].

During the last decade speech emotion recognition (SER) technology has matured enough to be used in noise-free and speaker dependent practical applications such as health care [44], education [5], or robotics [21]. Most of these studies use Support Vector Machines (SVMs) [35], or recurrent and deep neural networks [1]. These solutions rely on databases such as Emo-DB [3] that are usually generated using actors to simulate the emotions. A review and comprehensive analysis of datasets can be found at [36, 43]. Moreover, despite there are some practical applications, still SER technology is not mature enough to recognize the six speaker-independent Ekman’s emotions and usually a subset of emotions (e.g. sentiment classification) are set to obtain good enough classification accuracies.

Most of the SER systems are developed for English and there is only a few for other languages because the lack of public datasets. For instance in Spanish there are only two simulated datasets recorded by a few actors according to [43]. Due to the worldwide Spanish language importance, the scarcity of speech emotionally annotated databases for this language, and the need of continue exploring some of the mentioned problems, in this article we present a new elicited (by non-actors) Spanish database and its application to the problem of human speech emotion recognition (SER). As mentioned, we didn’t employ acted speech but elicited by combination of inductions productions similarly to DEMoS [26].

The creation of a dataset is costly because there is a need of actors to simulate the emotions or there is a need of a validation process for elicited or real scenarios. The creation of elicited non-actors datasets demand the detection of those audios that are not properly expressed and should be discarded. This can be done by a perception test that relies on human responses. However, multiple independently labeling actions are needed to statistically demonstrate that consensus exists for each audio sample. Crowdsourcing approaches are appealing for the accomplishment of this task as it has been already used successfully in other contexts [25][25]. We make use of this crowdsourcing approach as explained in the next Section 2 to create EmoSpanishDB and EmoMatchSpanishDB. In Section 3 the experimental design is presented including the extracted audio features used to test the two databases. At Section 4 the results are discussed and finally, some conclusions are provided at Section 5.

2 Creation of Spanish emotional speech Corpus: EmoSpanishDB and EmoMatchSpanishDB

One of the major problems working in emotional detection studies is the limited number of speakers available in the current databases. This speaker specific information may play considerable role if speech utterances of the same speaker are used for training and testing in machine learning models. On the other hand, developed models may produce poor results due to the lack of generality if speech utterances of different speakers are used for training and testing the models [17]. The Fig. 1 shows the whole database workflow creation that we executed.

Fig. 1
figure 1

EmoSpanish and EmoMatchSpanishDB workflow creation process description

2.1 Spanish sentences selection

The first step is the selection of the Spanish sentences to be recorded. According to the’Real Academia Española’ RAE [33] there are 23 phonemes in the central area of Spain (see Table 1) and several phoneticians have studied their statistical distribution in a regular conversation. In [34] the authors reviewed and summarized the different studies obtaining a phonetic appearance global distribution (see Table 1). In order to replicate a regular conversation, we have used these phonetic percentages to create a total of 12 sentences that contains all these Spanish’s phonemes within the minimum and maximum defined ranges. Moreover the sentences have been created without emotional semantic connotation to avoid any emotional influence in the speaker during the speaking and have similar length (approx. 2 s). We also analyzed the number of labeled audio samples per emotion and sentence to assure this independence (see Table 2) and performed for every pair of sentences a Mann Whitney U non-parametric test to verify that different sentences obtained similar results (\({H}_{0}\) considers that a pair of sentences follow the same distribution). The results for all cases had a p-value higher than 0.05 accepting the null hypothesis (the lower p-value is 0.12 between S7 and S8). The sentences used for the creation of the dataset can be consulted in Appendix. It is worth to mention that people of the central area of Spain pronounce the phoneme /ll/ as /y/ and the letter “h” has no sound at all, and the “x” is the sum of the phonemes /k/ and /s/ (see Table 1).

Table 1 List of Spanish phonemes, its number of appearance within the sentences used in the study, and theoretical percentage ranges of appearance in Spanish language
Table 2 Number of labeled audio samples per emotion and sentence–EmoSpanishDB
Fig. 2
figure 2

Noise-free professional radio studio used for record audio recording

2.2 Audio samples recording

A total of 50 individuals were recorded playing out the 12 selected sentences seven times (one for each Ekman’s basic emotion [9],’anger, disgust, fear, happiness, sadness, surprise’, plus neutral). The total number of audio samples collected were 4200 audio files (emotional raw audios). The participants’ demographics are shown in Table 3. We can confirm that this database is the first in Spanish language that contains elicited emotional voices played out by non-actors. Moreover, it is also the largest publicly available dataset compared to previous ones [13, 24], and Berlin emotional speech database [3] that has 800 sentences. A professional radio studio was used to record these audios Fig. 2. The audio files were recorded noisy free in PCM format with a sampling rate of 48 kHz and a bit depth of 16 bits (no compressed audio). The audios were then incrusted within a waveform audio file format container (.wav). A dynamic mono channel cardioid microphone (Sennheiser MD421) and the AudioPlus (AEQ) software have been used to record the signal and the voice is obtained at a hand’s distance from the microphone. At the beginning the speaker has a sheet with all the sentences and emotions that he/she had to simulate. Next, to induce emotion, he is shown a MIP (Mood Induction Procedure by watching pictures) image extracted from the Geneva Affective Picture Database (e.g., bugs or spiders) [7] along with MIP empathy in a manner similar to DEMoS [26]. This empathy MIP is based on the creation of an empathic reaction by reading text with an emotional content [12]. Hence, the speaker look at an image representing the emotion and listens a short text to induce the emotion before the recording.

Table 3 Dataset Demographics

2.3 Crowdsourcing corpus label process

To ensure that the proposed creation of elicited audios doesn’t contain noisy samples, a perception test was conducted by a set of independent individuals using a crowdsourcing approach. The goal of this crowdsourcing process is to label with an emotion all the recorded audio samples. One of the major constraints in using crowdsourcing for human computation purposes is the lack of guarantees on the expertise of the participants. Thereof the tasks that the participants have to resolve need to be small, simple, and well-formed (usually called micro-task). Moreover, due to human error and bias, it is mandatory to ensure that collected responses are adequately reliable and a quality control mechanism is needed [25]. In our case, a straightforward multiple-choice emotion type questionnaire is used. The recorded raw audios are played by participants and they have to select the emotion that they perceive among the seven proposed. They can play the audio many times until they decide to label it or skip it (the participant can quit the process any time). The selection is then stored in a remote database for further analysis. We refer with the term crowd-labeling [28] to the set of micro-tasks performed by people to label with an emotion the recorded audios.

It has been illustrated that the quality of the majority vote schema for multiple responses collected by a crowdsourcing tool is at least as good as that of answers provided by individual experts when a large enough number of labels are obtained for a sample and the majority voting schema is used [41]. If a large number of samples needs to be labeled there is also a need of a large number of people to perform the micro-tasks. To minimize the number of micro-tasks performed by participants (the total number of micro-tasks needed to label all emotional raw audios) a binomial p-value significance’two-sided test’ is proposed as a metric to find consensus and assign an emotion to an audio. The goal is to set a label to an audio sample as soon as any of the possible emotions obtains a significant \(p\)-value\(<0.05\). The’two-sided test’ tests the null hypothesis that the probability of success in a Bernoulli experiment is p. This value p has been established \(p=1/7\approx 0.143\) assuming that all emotions are equally probable, i.e. 1 out of 7 is the expected probability for each possible label given that we have 6 emotions plus neutral. Following this approach, in a majority voting scenario and assuming that participants are non-expert, it is known that an average of 4 labels are needed in order to emulate expert-level label quality [41]. According to that value we limited the maximum number of trials to 10 to alleviate the crowdsourcing effort. Hence, the stopping criteria for an audio label depends on the number of equal labels needed to reach consensus. For instance, given that the maximum number of labels allowed during the crowd-labeling process is 10, otherwise the audio is discarded, at least 3 equal labels are needed for 3,4,5 and 6 trials, or 4 for 7,8,9 and 10 trials (the most relaxed cases need 3 out of 6 or 4 out of 10 equal labels). Figure 3 shows the Bernoulli probability distribution curves for the just mentioned number of possible trials from 3 to 10. This criteria has been applied to the whole set of raw audio samples. The audios are shown randomly to the participants until they are labeled or discarded. For completion of the crowd-labeling process a total of 21,490 micro-tasks were needed with an average of 5 labels per audio sample that is just 1 above the average obtained in [41], and 194 independent native Spanish speakers were involved.

Fig. 3
figure 3

Bernoulli probabilities distribution for different number of trials with p = 0.1428 used to evaluate consensus during the crowdsourcing process

A total of 3550 audios were labeled with an emotion and those audios compose the EmoSpanishDB databaseFootnote 1 (the other 650 were discarded because didn’t reach consensus. See Table 2). Among the labeled samples at EmoSpanishDB, 2020 also matched the original elicited emotion and those audios are collected at EmoMatchSpanishDBFootnote 2 (see the number and percentage of matches per emotion at Table 4).

Table 4 Percentage of matches between the EmoSpanishDB labels and the original elicited emotion categorized per emotion

3 Experimental design and methodology

Figure 4 shows the experimental flow. The speech feature extraction plays a key role in SER systems to reflect the most important emotional characteristics. The most common categorization of emotional acoustic features include two categories, spectral and prosodic [6]. Following these categories we have selected the following set of features that are adequate for the emotion classification task as proposed in other works [39]. Indicate that the spectral features (frequency-based features) are obtained by converting the time based signal into the frequency domain using the Fourier Transform. Since the speech signal is constantly changing, the features that represent the spectrum can’t be extracted. Thus, the signal is framed into 20 ms windows to analyze its frequency content in a short time segment of a longer signal (this is a typical time window size but other sizes may be also a valid option). The spectral features extracted were: first 13 Mel Frequency Cepstral Coefficients (MFCCs) and their mean, standard deviation, kurtosis and skewness, the first and second derivatives of MFCC (\(\Delta\) MFCC and \(\mathrm{\Delta \Delta }\) MFCC), spectral centroid, spectral flatness, spectral contrast, and Linear Predictive Coding (LPC). The prosodic features represent those supra-segmental elements of oral expression which are elements that affect more than one phoneme and can’t be segmented into smaller units, such as accent, tones, rhythm and intonation. Among the prosodic features we used the fundamental frequency (\({\mathrm{F}}_{0}\)), intensity, and tempo. To obtain a common number of input features, the mean and some statistics are calculated for all frames, resulting in a total of 140 features to be used as input for the machine learning models (Table 4 contains a full description of these features). The spectral features were extracted using Praat [15] and the prosodic using librosa [22]. Moreover, we also tested EmoMatchSpanishDB with other two commonly used features sets: eGeMaps [10] (that contains 88 features) and ComparE [38] (that contains 6373 features). These features were extracted using OpenSmile library [11].

Fig. 4
figure 4

Speech emotion recognition experimental flow

The machine learning algorithms were validated using Cross-Validation (CV). In standard CV, instances partitioning is based on random sampling of file from a pool wherein all speakers are mixed (meaning that is not speaker-independent) into CV partitions. We have also adapted the CV process to validate the models for Leave-One-Speaker-Out (LOSO) scenario. For this second approach data has been split into 10 folders and complete individuals’ audios are split. Therefore, fold 1 contains the 1/10 individuals, fold 2, other 1/10 different individuals, and so on. This guarantees that, at least, training and testing partitions will never contain instances belonging to the same individual. We also tested two different set of features to compare EmoSpanishDB and EmoMatchSpanishDB datasets. The first only contain the 13 Mel frequencies and the second the extended 140 features described in Table 5. Moreover, we tested EmoMatchSpanishDB with eGeMAPS and ComparE feature sets that are commonly used in other SER studies.

Table 5 Description of the extracted audio features

Given a ground-truth and a prediction, the machine learning models were optimized using the unweighted F1-score metric because its robustness to the imbalance in the number of samples for each category. The F1-\(score=\frac{2*Precision*Recall}{Precision+Recall}\) describes the harmonic mean between precision \(precision=\frac{TP}{TP+FP}\) and recall \(recall=\frac{TP}{TP+FN}\). The unweighted accuracy is also measured for interpretation purposes being \(accuracy=\frac{TP+TN}{Total \;number \;of \;cases}\). We applied a min–max normalization to each feature before starting the learning process. EXtreme Gradient Boosting (XGBOOST), Support Vector Machines (SVM), and Feed-Forward Deep Neural Network (FFNN) machine learning methods have been selected to measure the SER performance on the presented datasets (EmoSpanishDB and EmoMatchSpanishDB).

These models are defined by some hyper-parameters that require to be set. XGBOOST has three main hyper-parameters that were fitted: minimum number of samples per leaf, number of estimators, and tree depth. The following range of values for each hyper-parameter were tested: minimum number of samples per leaf: 3, 5, 10; number of estimators (GBC): 5, 10 50, 100; tree depth (GBC): from 1 to 10 in steps of 2; sampling of features: 0.5, 1; sampling of samples: 0.5, 1. SVM has three main hyper-parameters: kernel type, gamma value that defines the influence of each point, and C value that establishes how large is the margin separation among classes. The following range of values for each hype-parameter were tested: kernel type: linear, polynomial, and radial; gamma values: \(1{e}^{-4},1{e}^{-3},1{e}^{-2}\); polynomial degree: 2, 3, 4, 510, and 20; C values: \(1{e}^{-4},1{e}^{-2},1{e}^{-1},\mathrm{1,10,100}\). FFNN has four main hyper-parameters: the number and size of hidden layers, learning rate, and the activation function. The following range of values were tested: layer size depth: 2, 4; number of neurons: 2, 5, 10, 20; learning rate: 0.001, 0.01; activation function: RELU.

In order to tune the hyper-parameters, a systematic procedure known as grid-search was used. This method tries all possible combinations of hyper-parameter values. Models for each hyper-parameter combination are trained using the above exposed cross-validation procedure. The best combination on the validation set is selected. Two experiments were done. The first is speaker dependent that allows that audios of the same speaker may be in training, validation, and test sets, and the second is speaker independent that assures that all the samples of an individual are only in one of the three sets. In both cases we use a k-fold = 10 and to avoid bias on the test we computed the average of the results repeating the same experiment 10 times (each one using different randomly selected samples). The experiments were done using EmoSpanishDB that contains 3550 audios and EmoMatchSpanishDB that contains 2020 audios.

4 Results and discussion

Figure 5 shows the percentages of labeled audio samples per emotion after the crowdsourcing validation process was applied and consensus was achieved. It also shows the percentage of those that also match with the original elicited emotion. A total of 3550 labeled audios were obtained via crowd-label consensus and their distribution by emotion is quite homogeneous but in the case of disgust it is a bit lower and in the case of neutral a bit higher. An explication to this fact after observation is related with the tendency of crowd-labeling an audio as neutral whenever the crowd-labeler is not sure about the emotion that the audio contains. This is specially relevant for emotions that are harder to detect as disgust as was shown in [19]. It can also be observed that the percentage of coincidence between the originally elicited emotion and the crowdsourcing results varies significantly for the different emotions. The percentage of match ranges from 86% for neutral to 28.3% for disgust. Ordered from lower to higher: disgust, happiness, fear, sadness, surprise, anger, and neutral. It is observed that it is more difficult for humans, including both the expression and recognition of an elicited emotion, to find consensus for disgust and happiness, and easier for anger or neutral. In [19] the authors obtained similar conclusions. They found that human emotion recognition (crowd-labelers) have higher accuracy and confidence ratings labeling anger and neutral emotions and in contrast lower for disgust. They also found an interesting pattern saying that the categorization of surprise has more confident than disgust and fear that also occurs in our case.

Fig. 5
figure 5

Percentages of labeled emotions vs. original ones after crowdsourcing process led emotions vs. original ones after crowdsourcing process

There are different reasons why a person has difficulty to express or recognize emotions. It can be due to Social Anxiety Disorders [2] that leads a person towards interpreting ambiguous cues as negatives or a threat [42]. In prosodic emotion recognition has been also demonstrated a bias towards correct identification of fearful voices and a decrease identification of happy voices [32] and, more recently the same behavior has been observed in a study over 31 SAD patients [46]. This can also be observed in Table 6 where anger has the highest match percentage \(61.27\mathrm{\%}\) (exempting neutral that may not be considered as an emotion itself) and happiness has the second lowest \(48.84\mathrm{\%}\) after disgust.

Table 6 Distribution of elicited emotion vs. crowdlabeling results (columns represents the elicited emotions and rows the crowdlabeled emotion) during the creation of the EmoMatchSpanishDB procedure

From an application perspective, all the above explanations suggest that it is hard to have just one homogeneous dataset for all applications but there is always some uncertainty that must be considered as part of human subjectivity and not part of the error in the model itself. This reasoning is not applicable to other datasets and studies [40] that just analyze audios simulated by actors and hence a’perfect’ emotion is assumed.

Observe that the number of audio files that reached consensus on a label different from the elicited one is quite large (it is the difference between the audio samples of EmoSpanishDB and EmoMatchSpanishDB 3550–2020 = 1530 samples). As introduced above, the main reason is that there is a tendency to label an audio sample as neutral when there is not enough confidence about the emotion it contains (note that there are 1130 audios labeled as neutral whereas only 600 where expected according to the elicited process). Thereof, we created and compared two alternative datasets EmoSpanishDB and EmoMatchSpanishDB showing that the latter solves (at least partially) the problem and it is more accurate, as it was expected.

Table 7 shows the results of SVM, XGBOOST, and FFNN for the different experiments with 13 and 140 features. In all cases it is observed that the use of EmoMatchSpanishDB improves the results over the EmoSpanishDB (improvement of 19% and 10% for all samples and LOSO respectively). This means that when the audio is expressed and recognize with the same emotion the ML model is able to learn better that using all the sample audios labeled by the human crowdsourcing recognition process. This reinforces the previous reasoning and explanations were we affirmed that there is some uncertainty associated to different human expressiveness and recognition capabilities and were solved (at least partially) using the EmoMatchSpanishDB alternative.

Table 7 Unweighted F1-Score and accuracy results obtained for different machine learning techniques, EmoMatchSpanishDB all samples and LOSO

The best results, using f1-score as metric to optimize the ML models, are always obtained for SVM method (see Table 7 for all samples and LOSO, and Table 8 for men and women comparatives). Note that in a multi-class problem it is desirable to get a unique score to get an global overview of the performance. For this purpose Cohen’s Kappa Coefficient was used and values of 0.573 and 0.394 were obtained for all samples at EmoMatchSpanishDB. This difference is reasonable because LOSO use independent speakers to test the model. Table 8 shows the results for EmoMatchSpanishDB separated by gender and no significant differences were observed in the results for the three models (p-value \(>\) 0.99). For EmoSpanishDB, not only the split of the dataset doesn’t improve the results but worsen them as shown in Table 8. These facts reject the need of separating the audios by gender to obtain better results as other studies have also defended [48].

Table 8 Unweighted F1-Score and accuracy results obtained for different machine learning techniques, EmoMatchSpanishDB women and men

Analyzing the results on the number of features (13 vs. 140), it is observed that the use of a large number of features results in better F1-score and accuracy and the best results are always obtained with the larger number of features for all cases (see Tables 7, 8, 9 and 10). In order to quantify this improvement we calculated the average percentage for both datasets (all samples and LOSO) being \(\approx\) 37.5%.

Table 9 Unweighted F1-Score and accuracy results obtained for different machine learning techniques, EmoSpanishDB all samples and LOSO
Table 10 Unweighted F1-Score and accuracy results obtained for different machine learning techniques, EmoSpanishDB women and men

Overall, the best technique is SVM that obtains the best f1-scores for all the experiments with an average improvement of \(\approx\) 16.75% over XGBOOST, and \(\approx\) 14% over FFNN. Figures 6 and 7 shows the precision, recall, and F1-score by category for this model and 140 features set. Moreover, the Figs. 8 and 9 show the confusion matrices for all samples and LOSO using the SVM + 140 features as well. The results obtained using EmoMatchSpanishDB improves the ones of EmoSpanishDB for all the emotions as expected. In all samples experiment there are four emotions (disgust, anger, fear, and neutral) that have a F1-score higher than 60% that is considered to be good from a learning perspective. In the contrary, surprise has the lowest value (41% aprox.) and is mainly confused with happy (notice that happy is also confused with surprise meaning that both emotions share common characteristics). This also was highlighted by the authors at [19]. The same behavior can also be observed in LOSO experiment (Fig. 9). The results of LOSO are significantly worsen than using all samples, meaning that emotions are individual dependent and some pre-processing to remove that personification is needed in case a non individual dependent SER systems wants to be developed. The emotion that most suffers of this dependency is’disgust’ that worsens \(\approx\) 44%, followed by’happy’ with a decrease of \(\approx 31\mathrm{\%}\) and’fear’ with \(\approx\) 28%’.

Fig. 6
figure 6

Precision, Recall, and F1-score LOSO results by emotion for the best model (SVM and 140 features)

Fig. 7
figure 7

Precision, Recall, and F1-score All samples results by emotion for the best model (SVM and 140 features)

Fig. 8
figure 8

Confusion matrix for all samples and best model (SVM using 140 features)

Fig. 9
figure 9

Confusion matrix for LOSO and best model (SVM using 140 features)

Finally, Tables 11 and 12 show the results obtained using EgeMaps and ComparE features for EmoMatchSpanishDB. It can be observed that the best results are always for SVM model and the use of ComparE features always improves the results over the other set of features (eGeMAPS and, 13 or 140 features presented above. See 7 and 8). The improvement in F1-score is of \(\approx 2\mathrm{\%}\), \(\approx 39\mathrm{\%},\approx 11\mathrm{\%}\), and \(\approx 14\mathrm{\%}\) for all samples, LOSO, men, and women experiments respectively. It is also important to note that the best recall for LOSO is \(0.59\); this value is similar to the state-of-the-art results obtained in similar datasets for other languages (see Tables 11 at [26]).

Table 11 Unweighted F1-Score and accuracy results obtained for SVM, XGBOOST, and FFNN at EmoMatchSpanishDB all samples and LOSO (eGeMAPS and ComparE features sets)
Table 12 Unweighted F1-Score and accuracy results obtained for SVM, XGBOOST, and FFNN at EmoMatchSpanishDB women and men (eGeMAPS and ComparE features sets)

5 Conclusions

One of the main problems in Speech Emotion Recognition is the absence of public databases. This is very relevant for all languages except English. We created and made public a Spanish Elicited Emotion Dataset consisting of fifty subjects. The generated audio samples were curated using a crowdsourcing approach to avoid discrepancies between the elicited emotions and the emotion recognized by humans. Consensus was obtained using an a priori probability of \(1/7\) in a Bernoulli distribution. An average of six labels were needed to complete the process and \(\approx 84\mathrm{\%}\) of the dataset was classified successfully with an emotion, and \(48\mathrm{\%}\) were classified according to the original elicited emotion. This dataset is the largest public Spanish dataset as far as the authors know. The EmoMatchSpanishDB dataset has been tested using some of the most successful machine learning models and different set of audio characteristics were also tested in a comparative study. The results show that, using SVM model and the ComparE feature set, up to \(65\mathrm{\%}\) accuracy can be obtained for six Ekman’s emotions plus neutral. Finally, Leave One Speaker Out test was performed and the results show an accuracy of \(64.2\mathrm{\%}\) and a \(0.573\) Cohen’s Kappa Coefficient. These results are similar to the state-of-the-art of other recently created elicited databases. We envision that other machine learning methods would benefit differently from the release of this dataset and the comparative study presented here provides a good baseline for future improvements and advances in this area for the Spanish Language.