Recognition of Emotion from Speech: A Review

Emotional speech recognition is an area of great interest for human-computer interaction. The system must be able to recognize the user’s emotion and perform the actions accordingly. It is essential to have a framework that includes various modules performing actions like speech to text conversion, feature extraction, feature selection and classification of those features to identify the emotions. The classifications of features involve the training of various emotional models to perform the classification appropriately. Another important aspect to be considered in emotional speech recognition is the database used for training the models. Then the features selected to be classified must be salient to identify the emotions correctly. The integration of all the above modules provides us with an application that can recognize the emotions of the user and give it as input to the system to respond appropriately.


Introduction
Emotional speech recognition is an area of great interest for human-computer interaction.The system must be able to recognize the user's emotion and perform the actions accordingly.It is essential to have a framework that includes various modules performing actions like speech to text conversion, feature extraction, feature selection and classification of those features to identify the emotions.The classifications of features involve the training of various emotional models to perform the classification appropriately.Another important aspect to be considered in emotional speech recognition is the database used for training the models.Then the features selected to be classified must be salient to identify the emotions correctly.The integration of all the above modules provides us with an application that can recognize the emotions of the user and give it as input to the system to respond appropriately.
In human interactions there are many ways in which information is exchanged (speech, body language, facial expressions, etc.).A speech message in which people express ideas or communicate has a lot of information that is interpreted implicitly.This information may be expressed or perceived in the intonation, volume and speed of the voice and in the emotional state of people, among others.The speaker's emotional state is closely related to this information.In evolutionary theory, it is widely accepted the "basic" term to define some emotions.The most popular set of basic emotions: happiness (joy), anger, fear, boredom, sadness, disgust and neutral.Over the last years the recognition of emotions has become a multi-disciplinary research area that has received great interest.This plays an important role in the improvement of human-machine interaction.Automatic recognition of speaker emotional state aims to achieve a more natural interaction between humans and machines.Also, it could be used to make the computer act according to the actual human emotion.This is useful in various real life applications as systems for real-life emotion detection using a corpus of agent-client spoken dialogues from a medical emergency call centre, detection of the emotional manifestation of fear in abnormal situations for a security application, support of semi-automatic diagnosis of psychiatric diseases and detection of emotional attitudes from child in spontaneous dialog interactions with computer characters.On the other hand, considering the other part of a communication system, progress was made in the context of speech synthesis too.The use of bio signals (such as ECG, EEG, etc.), face and body images are an interesting alternative to detect emotional states.However, methods to record and use these signals are more invasive, complex and impossible in certain real applications.Therefore, the use of speech signals clearly becomes a more feasible option.Good results are obtained by standard classifiers but their performance improvement could have reached a limit.Fusion, combination and ensemble of classifiers could represent a new step towards better emotion recognition systems.This chapter aims to provide a comprehensive review on emotional speech recognition.The chapter is organized as follows.Section 2 describes the frameworks used for SER.Section 3 gives an overview of the types of databases.Section 4 presents the acoustic characteristics of emotions.Section 5 presents feature extraction and classification.Section 6 discusses the applications of emotion recognition.Section 7 presents concluding remarks.

Basic framework for emotional recognition
The input files are speech signals.Fig. 1 gives the basic framework of emotional speech recognition.The feature extraction script extracts the features that represent global statistics.In the Post-processing step, the interface problem between the script for feature extraction and the feature selection technique can be solved.Then feature selection eliminates irrelevant features that hinder the recognition rates.It lowers the input dimensionality and saves the computational time.Distribution models like GMMs are trained using the most discriminative aspects of the feature.The classifiers distinguish the types of emotion.ECG is recorded using ECG sensor .The signals are preprocessed using low pass filter at 100HZ.Then, features are extracted from the preprocessed signal by continuous wavelet transform (CWT) or discrete wavelets transform (DWT).Feature selection is done using Tabu Search Algorithm (TS), Simba algorithm etc.The selected feature is fed into classifier (fisher or K-Nearest Neighbor (KNN) classifier) to identify the type of emotion.Fig. Galvanic Skin Response is the measure of skin conductivity.There is a correlation between GSR and the arousal state of body.In the GSR emotional recognition system, the GSR signal is physiologically sensed and the feature is extracted using Immune Hybrid Particle Swarm Optimization (IH-PSO).The extracted features are classified using neural network classifier to identify the type of emotion.
In the facial emotion recognition the facial expression of a person is captured as a video and it is fed into the facial feature tracking system.Fig 3 gives a basic framework of facial emotional recognition.In facial feature tracking system, facial feature tracking algorithms such as Wavelets, Dual-view point-based model etc. are applied to track eyes, eyebrows, furrows and lips to collect all its possible movements.Then the extracted features are fed into classifier like Naïve Bayes , TAN or HMM to classify the type of emotion.

Emotional speech database
There should be some criteria that can be used to judge how well a certain emotional database simulates a real-world environment.According to some studies the following are the most relevant factors to be considered:

Acoustic characteristics of emotions in speech
The prosodic features like pitch, intensity, speaking rate and voice quality are important to identify the different types of emotions.In particular pitch and intensity seem to be correlated to the amount of energy required to express a certain emotion.When one is in a state of anger, fear or joy; the resulting speech is correspondingly loud, fast and enunciated with strong high-frequency energy, a higher average pitch, and wider pitch range, whereas with sadness, producing speech that is slow, low-pitched, and with little high-frequency energy.In Table 2, a short overview of acoustic characteristics of various emotional states is provided.

Feature extraction and classification
The collected emotional data usually contain noise due to the background and "hiss" of the recording machine.The presence of noise will corrupt the signal, and make the feature extraction and classification less accurate.Thus preprocessing of speech signal is very much required.Preprocessing also reduces the variability.
Normalization is a preprocessing technique that eliminates speaker and recording variability while keeping the emotional discrimination.Generally 2 types of normalization techniques are performed they are energy normalization and pitch normalization.Energy normalization: the speech files are scaled such that the average RMS energy of the neutral reference database and the neutral subset in the emotional databases are the same for each speaker.This normalization is separately applied for each subject in each database.The goal of this normalization is to compensate for different recording settings among the databases.Pitch normalization: the pitch contour is normalized for each subject (speaker-dependent normalization).The average pitch across speakers in the neutral reference database is estimated.Then, the average pitch value for the neutral set of the emotional databases is estimated for each speaker.
Feature extraction involves simplifying the amount of resources required to describe a large set of data accurately.When performing analysis of complex data one of the major problems stems from the number of variables involved.Analysis with a large number of variables generally requires a large amount of memory and computation power or a classification algorithm which overfits the training sample and generalizes poorly to new samples.Feature extraction is a general term for methods of constructing combinations of the variables to get around these problems while still the data with sufficient accuracy.
Although significant advances have been made in speech recognition technology, it is still a difficult problem to design a speech recognition system for speaker-independent, continuous speech.One of the fundamental questions is whether all of the information necessary to distinguish words is preserved during the feature extraction stage.If vital information is lost during this stage, the performance of the following classification stage is inherently crippled and can never measure up to human capability.Typically, in speech recognition, we divide speech signals into frames and extract features from each frame.During feature extraction, speech signals are changed into a sequence of feature vectors.Then these vectors are transferred to the classification stage.For example, for the case of dynamic time warping (DTW), this sequence of feature vectors is compared with the reference data set.For the case of hidden Markov models (HMM), vector quantization may be applied to the feature vectors which can be viewed as a further step of feature extraction.
In either case, information loss during the transition from speech signals to a sequence of feature vectors must be kept to a minimum.There have been numerous efforts to develop good features for speech recognition in various circumstances.
The most common speech characteristics that are extracted are categorized in the following groups: Durational pause related features :The duration features include the chunk length, measured in seconds, and the zero-crossing rate to roughly decode speaking rate.Pause is obtained as the proportion of non-speech to the speech signal calculated by a voice activity detection algorithm Zipf features used for a better rhythm and prosody characterization.

Hybrid pitch features combines outputs of two different speech signal based pitch marking algorithms (PMA)
Feature selection determines which features are the most beneficial because most classifiers are negatively influenced by redundant, correlated or irrelevant features.Thus, in order to reduce the dimensionality of the input data, a feature selection algorithm is implemented to choose the most significant features of the training data for the given task.Alternatively, a feature reduction algorithm like principal components analysis (PCA) and Sequential Forward Floating Search (SFFS) can be used to encode the main information of the feature space more compactly.
Most research on SER has concentrated on feature-based and classification-based approaches.Feature-based approaches aim at analyzing speech signals and effectively estimating feature parameters representing human emotional states.The classification-based approaches focus on designing a classifier to determine distinctive boundaries between emotions.The process of emotional speech detection also requires the selection of a successful classifier which will allow for quick and accurate emotion identification.Currently, the most frequently used classifiers are linear discriminant classifiers (LDC), knearest neighbor (k-NN), Gaussian mixture model (GMM), support vector machines (SVM), decision tree algorithms and hidden Markov models (HMMs).Various studies showed that choosing the appropriate classifier can significantly enhance the overall performance of the system.
The list below gives a brief description of each algorithm: LDC: A linear classifier uses the feature values to identify which class (or group) it belongs to by making a classification decision based on the value of a linear combination of the feature values .They are usually presented to the system in a vector called a feature vector.
k-NN: Classification happens by locating the instance in feature space and comparing it with the k nearest neighbors (training examples) and labeling the unknown feature with the same class label as that of the located (known) neighbor.The majority vote decides the outcome of class labeling.

GMM:
A model of the probability distribution of the features measured in a biometric system such as vocal-tract related spectral features in a speaker recognition system.It is used for representing the existence of sub-populations, which is described using the mixture distribution, within the overall population.
SVM : It is a binary classifier to analyze the data and recognize the patterns for classification and regression analysis.
Decision tree algorithms: work based on following a decision tree in which leaves represent the classification outcome, and branches represent the conjunction of subsequent features that lead to the classification.

HMMs:
It is a generalized model in which the hidden variables control the components to be selected.The hidden variables are related through the Markov process.In the case of emotion recognition, the outputs represent the sequence of speech feature vectors, which allow the deduction of states' sequences through which the model progressed.The states can consist of various intermediate steps in the expression of an emotion, and each of them has a probability distribution over the possible output vectors.The states' sequences allow us to predict the emotional state which we are trying to classify, and this is one of the most commonly used techniques within the area of speech affect detection.
Boostexter: an iterative algorithm that is based on the principle of combining many simple and moderately inaccurate rules into a single, highly accurate rule.It focuses on text categorization tasks.An advantage of Boostexter is that it can deal with both continuousvalued input (e.g., age) and textual input (e.g., a text string).

Applications
Emotion detection is a key phase in our ability to use users' speech and communications as a source of important information on users' needs, desires, preferences and intentions.By recognizing the emotional content of users' communications, marketers can customize offerings to users even more precisely than ever before .This is an exciting innovation that is destined to add an interesting dimension to the man-machine interface, with unlimited potential for marketing as well as consumer products, transportation, medical and therapeutic applications, traffic control and so on.
Intelligent Tutoring System: It aims to provide intervention strategies in response to a detected emotional state, with the goal being to keep the student in a positive affect realm to maximize learning potential.The research follows an ethnographic approach in the determination of affective states that naturally occur between students and computers.The multimodal inference component will be evaluated from audio recordings taken during classroom sessions.Further experiments will be conducted to evaluate the affect component and educational impact of the intelligent tutor.
Lie Detection: Lie Detector helps in deciding whether someone is lying or not.This mechanism is used particularly in areas such as Central Bureau of Investigation for finding out the criminals, cricket council to fight against corruption.X13-VSA PRO Voice Lie Detector 3.0.1 PRO is an innovative, advanced and sophisticated software system and a fully computerized voice stress analyzer that allows us to detect the truth instantly.
Banking: The ATM will employ speaker recognition and authentication if needed "to ensure higher security level while accessing to confidential data."In other words, the unique deployment of combining speech recognition, speaker recognition and emotion detection is not designed to be spooky or invasive."It is just one more step forward the creation of humanlike systems that speak to the clients, understand and recognize a speaker".What's different is the incorporation of emotion detection in the enrollment process, which is probably a very good idea if enrollments are going to be conducted without human assistance or supervision.The machine will be able to talk with the prospective enrollee (and later on the client) and will be able to authenticate his or her unique voiceprint while, at the same time, test voice levels for signs of nervousness, anger, or deceit.
In-Car Board System: An in-car board system shall be provided with information about the emotional state of the driver to initiate safety strategies, initiatively provide aid or resolve errors in the communication according to the driver's emotion.

Prosody in Dialog System:
We investigate the use of prosody for the detection of frustration and annoyance in natural human-computer dialog.In addition to prosodic features, we examine the contribution of language model information and speaking "style".
Results show that a prosodic model can predict whether an utterance is neutral versus "annoyed or frustrated" with an accuracy on par with that of human interlobular agreement.Accuracy increases when discriminating only "frustrated" from other utterances, and when using only those utterances on which labelers originally agreed.Furthermore, prosodic model accuracy degrades only slightly when using recognized versus true words.Language model features, even if based on true words, are relatively poor predictors of frustration.
Emotion Recognition in Call Center: Call-centers often have a difficult task of managing customer disputes.Ineffective resolution of these disputes can often lead to customer discontent, loss of business and in extreme cases, general customer unrest where a large amount of customers move to a competitor.It is therefore important for call-centers to take note of isolated disputes and effectively train service representatives to handle disputes in a way that keeps the customer satisfied.
A system was designed to monitor recorded customer messages and provide an emotional assessment for more effective call-back prioritization.However, this system only provided post-call classification and was not designed for real time support or monitoring.Nowadays the systems are different because it aims to provide a real-time assessment to aid in the handling of the customer while he or she is speaking.Early warning signs of customer frustration can be detected from pitch contour irregularities, short-time energy changes, and changes in the rate of speech.

Sorting of Voice Mail:
Voicemail is an electronic system for recording and storing of voice messages for later retrieval by the intended recipient.It would be a potential application to sort the voice mail according to the emotion of the person's voice recorded.It will help to respond to the caller appropriately.
Computer Games: Computer games can be controlled through emotions of human speech.The computer recognizes human emotion from their speech and compute the level of game (easy, medium, hard).For example, if the human speech is in form of aggressive nature then the level becomes hard.Suppose if the human is too relaxed the level becomes easy.The rest of emotions come under medium level.
Diagnostic Tool By Speech Therapists: Person who diagnosis and treats variety of speech, voice, and language disorders is called a Speech Therapist.By understanding and empathizing emotional stress and strains the therapists can know what the patient is suffering from.The software used for recording and analyzing the entire speech is icSpeech.The use of speech communication in healthcare is to allow the patient to describe their health condition to the best of their knowledge.In clinical analysis, human emotions are analyzed based on features related to prosodics, the vocal tract, and parameters extracted directly from the glottal waveform.Emotional expressions can be referred by vocal affect extracted from the human speech.
Robots: Robots can interact with people and assist them in their daily routines, in common places such as homes, super markets, hospitals or offices.For accomplishing these tasks, robots should recognize the emotions of the humans to provide a friendly environment.Without recognizing the emotion, the robot cannot interact with the human in a natural way.

Conclusion
The process of speech emotion detection requires the creation of a reliable database, broad enough to fit every need for its application, as well as the selection of a successful classifier which will allow for quick and accurate emotion identification.Thirty-one emotional speech databases are reviewed.Each database consists of a corpus of human speech pronounced under different emotional conditions.A basic description of each database and its applications is provided.And the most common emotions searched for in decreasing frequency of appearance are anger, sadness, happiness, fear, disgust, joy, surprise, and boredom.The complexity of the emotion recognition process increases with the amount of emotions and features used within the classifier.It is therefore crucial to select only the most relevant features in order to assure the ability of the model to successfully identify emotions, as well as increasing the performance, which is particularly significant to real-time detection.SER has in the last decade shifted from a side issue to a major topic in human computer interaction and speech processing.SER has potentially wide applications.For example, human computer interfaces could be made to respond differently according to the emotional state of the user.This could be especially important in situations where speech is the primary mode of interaction with the machine.

Fig. 1 .
Fig. 1.Basic framework of SER Bio signals such as ECG, EEG,GSR, face and body images are an interesting alternative to detect emotional states.Fig 2 discusses the mechanism of emotion recognition using these bio signals.

Fig. 2 .
Fig. 2. Framework for emotion recognition using EEG,ECG,GSR signals EEG is one of the most useful bio signals that detect true emotional state of human.The signal is recorded using the electrodes which measure the electrical activity of the brain.The recorded EEG data is first preprocessed to remove serious and obvious motion artifacts.Then the features are extracted from the raw signal using some feature extraction techniques like discrete wavelet transform, statistical based analysis etc.After the extraction the emotion classifier use the emotion classification techniques like Fuzzy C-Means, Quadratic Discriminant Analysis etc. to classify the different emotions of human.

Frequency characteristics 
Accent shape -affected by the rate of change of the fundamental frequency. Average pitch -description of how high/low the speaker speaks relative to the normal speech. Contour slope -describes the tendency of the frequency change over time, it can be rising, falling or level. Final lowering -the amount by which the frequency falls at the end of an utterance. Pitch range -measures the spread between maximum and minimum frequency of an utterance. Formant-frequency components of human speech  MFCC-representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency. Spectral features-measures the slope of the spectrum considered.Time-related features  Speech rate -describes the rate of words or syllables uttered over a unit of time  Stress frequency -measures the rate of occurrences of pitch accented utterances  Energy-Instantaneous values of energy  Voice quality-jitter and shimmer of the glottal pulses of the whole segment.Voice quality parameters and energy descriptors  Breathiness -measures the aspiration noise in speech  Brilliance -describes the dominance of high Or low frequencies In the speech  Loudness -measures the amplitude of the speech waveform, translates to the energy of an utterance  Pause Discontinuity -describes the transitions between sound and silence  Pitch Discontinuity -describes the transitions of fundamental frequency.

No Corpus Name No.of Subjects (Total, Male, female and age and time & days taken) Nature(Acted/Natural/ Induced and purpose, Language& mode) Types of Emotions(Anger, disgust, fear, joy, sad, etc) Publicably Available(Yes/N o) and URL
S.

Table 1 .
List of emotional speech databases www.intechopen.com

Table 2 .
Acoustic Characteristics of Emotions