OPTIMIZED RECURRENT NEURAL NETWORK BASED EMOTION RECOGNITION USING SPEECH FOR ASSAMESE LANGUAGE

This paper deals with the design and development of a dataset for emotion recognition in Assamese language and its analysis using Machine Learning. This dataset comprises of spoken sentences in Assamese language which relates 7 different emotions of the speakers. It also focusses on the analysis of the dataset with Deep Learning architectures and classification algorithm like Recurrent Neural Network (RNN). In total these 7 emotions included calm/neutral, anger, sad, fear, disgust, happy and surprise. The performance of RNN is improved using Representation Learning where training of the model is done using the glottal flow signal to investigate the effect of speaker and phonetic invariant features on classification performance. The speech samples were experimented with different combinations of features and a variety of results were obtained for each of them. In this paper 2 experiments are performed to improve the performance of RNN training. The first one is the representation learning where the training is modelled on the glottal flow signal which is used to investigate the effect of the speaker and phonetic distinct features on the performance of the classification. The second is the transfer learning based RNN training which is done on the valence and activation that is adapted to a 7-category emotion classification. On the Assamese dataset the 4536 NUPUR CHOUDHURY, UZZAL SHARMA experimented approach results in a performance which is comparable and similar to the existing state of art systems for emotion recognition using speech.


INTRODUCTION
All references in the list must be cite in the Text (by number(s) in square brackets), e.g., Zhao [1], Pecaric, Perusic and Vukelic [2]. As deep neural networks have evolved in past decade, significant progress is witnessed in the domain of pattern recognition as well as various problem-solving approaches specially in the field of paralinguistics. In order to facilitate further research a range of neural network architectures like CNN, autoencoder networks, Long Short-Term Memory (LSTM) models [1]. Variety of research works have demonstrated selected properties of these multiple networks for speeches [2] with minimum human interaction [3]. However, most of these models utilize generic features as input features like the spectral features, frequency, pitch, formants and features related to energy and then apply classification in order to predict various emotions which have displayed robustness for a wide variety of speech domains [4] [5]. Some of the works also mention about conversion of the raw and unprocessed data directly into input signal which leads to improved performance. Existing approaches primarily depend upon utilisation of acoustic features which are taken as standards like pitch, MFCC (Mel-Frequency cepstral coefficients), Energy etc. in addition to that the temporal features of the samples are extracted using statistical functions that would be used as segment descriptors for speech segments or detection of utterance of from emotional speech [6]. This stage is followed by classification of the samples with various classifiers like Support Vector Machines (SVM), Hidden Markov Models (HMM), Artificial Neural Networks, Deep Neural Networks etc. In this work limited efforts are being given in achieving an effective representation learning for emotion recognition using speech where training a neural network requires a temporal waveform or a spectrogram. The features that would represent 4537 EMOTION RECOGNITION USING SPEECH FOR ASSAMESE LANGUAGE the signal are extracted by sampling the waveform or from the spectral representation [7]. The unseen data is well generalised without additional requirement for feature extraction. Moreover, the entire process of data collection and annotations is expensive which has led the researchers to experiment with limited labelled data and a few numbers of volunteers for the same. Various training paradigms like transfer learning [8] and semi supervised learning [9] has been experimented with to improve the accuracy of classification, but even then, extensive research is needed as expression of emotion through a language form would also require the language of expression and not all classifier work in the same manner for different types of dataset that is used even if the methodology remains same. In this paper, a novel dataset is used which is based on the Assamese Language. Assamese is also known as Asamiya is the official language of northeastern region of Assam, India and is derived from Indo-Aryan language. Speech Spectrogram features as well as spectrogram related to glottal volume velocity is also experimented with. It includes if the performance of classification can be improved by removing unnecessary variation factors like identity of the speaker and phonemes from the speech signals. For this a model is derived known as Bidirectional Long-Short Term Memory (BSLTM)-RNN that deals with emotions at the utterance level and makes use of features which are extracted by the training of stacked denoising autoencoders from the spectrograms.
A study of the scenario based on transfer learning is also done where the additional utterances are leveraged that are not labelled using the emotions that are being experimented on by training up an RNN based on the activation labels and valence that would consist of all the utterances which are present in the dataset. Finally, it is followed by adaptation of a network that is trained to the classification of the mentioned emotions. There is also an attempt of learning the affect of salient features for Speech Emotion Recognition (SER) by making use of Convolutional Neural Networks (CNN). This entire process involves 2 processes. The first step is to identify the local invariant features by using the samples that are not labelled along with a type of sparse autoencoder which is based on penalization at reconstruction. The next step involves using the local invariant features for feature extraction which would be used to learn noticeable effect, distinct features which makes 4538 NUPUR CHOUDHURY, UZZAL SHARMA use of an objective function that emphasizes on noticeable features, discrimination and orthogonality. The experimental results on the new dataset shows that the approach results in wellbuilt and robust performance in terms of recognition even in complicated scenarios as well as proves to be better than several other approaches from this domain.
The experiments in this paper depend upon initial work in the domain of Deep Neural Network (DNN) and representation learning which typically deals with affect and emotion recognition.
Jaitly et. Al [10] used datasets like Arctic and TIMIT and transformed autoencoders for learning acoustic event from them. Graves et. Al [11] [13] used RNN in IEMOCAP dataset and modelled each frame as a random variable sequence where the improvement of the weighted accuracy as compared to a DNN is around 12%. In this work investigation is being done to find out if the speaker and the invariant phonetic representations represents distinct emotion and if the transfer learning could improve the performance of the classification 2. MODEL

INITIAL TRAINING WITH DENOISING AUTOENCODERS
A neural network which is trained to Learn the distributed, lower dimensional representation of the input samples is known as autoencoder [16]. In this process a feed forward neural network is used which has 1 hidden layer having the activations The input dataset comprises of N samples which is represented by { } =1 = . Output given by the encoder is represented as 4539 EMOTION RECOGNITION USING SPEECH FOR ASSAMESE LANGUAGE = W ' ′ + ′ (2) which is generated from the hidden layer. The training of the encoder is done via backpropagation.
The training is done using Sum of squared loss which is represented using Vincent et.al worked with denoising encoders where a fraction of the elements is set to 0 that leads to a corrupted data point which produces input ̃ and sets the original to which is then reconstructed using the autoencoder. Here a distinct stacking approach is used for the autoencoders which is pyramidal in nature and the number of neurons is generally halved or are significantly lesser in number. The size of the output layer generated by the autoencoder remains same so that the feature sets follow a similar methodology.

BLSTM-RNN CLASSIFICATION
RNN are generally used for experimenting with temporal data and their correlations. However they suffer from a problem known as vanishing gradient problem which is directly proportional to the length of the training samples. In-order to address this problem Hochreiter et al [17] proposed In every epoch, random noise is added to the input sample sequences and model weights that can be controlled using distinct noise hyperparameters.

DATASET AND FEATURE EXTRACTION
For the experiments in this paper, a novel dataset created by Nupur et.al [18] that involves our prior work and is based on Assamese Language is used which has been validated and verified using

REPRESENTATION OF THE SPECTROGRAMS AND BASELINE FEATURE EXTRACTION
The spectrograms are generated using (

METHODOLOGY
The entire samples in the dataset are split into 5 different sessions where every session comprises of a certain scripted sentences pertaining to the emotional categories between 3 different speakers.
The experiments in this paper are done in a strategy which similar to leave out one [14].  Figure 1 shows a snapshot of the entire approach.
The work that has been done is compared with the following standards: i. Deep Neural Network-Extreme Learning Machine method mentioned in [13] where experiments are done on training a Deep Neural Network with an Extreme Learning Machine.
ii. Recurrent Neural Network-Extreme Learning Machine [14] where experiments are done on training a Recurrent Neural Network with an Extreme Learning Machine.

RESULTS AND DISCUSSIONS
Classification experiments are conducted and the performance of the experiments are compared based on the representations derived from the speech and spectrograms based on glottal flow. Table 1 represents leave-one-session-out setup for various emotions and features. According to the results, Happy and Angry categories have near similar accuracy and are often confused whereas the sad category has the most accurate performance.

EMOTION RECOGNITION USING SPEECH FOR ASSAMESE LANGUAGE
The authors in [14] has reported the highest unweighted and weighted accuracies which are resulted by the Deep Neural Network-Extreme Learning Machines and states to be 52.13% and 57.91%. In the work the accuracies without the ELM are 56% and 57.91% respectively which are near comparable with the accuracy reported in this paper.  In [14] it was not clearly mentioned as to which speaker was utilized for testing or validation. In the experiments sone in this paper 1 speaker is allotted for validation while the other 2 were allotted for testing. The experiments were repeated by switching the speaker responsibilities and the average of the performance of all the test sets were considered for evaluation. In [14] the approach also performs experiments on the utterances that are improvised whereas the current work focusses both on the improvised as well as the utterances that are scripted in nature. The authors of [15] have reported an acoustic approach where they made use of 10 fold leave one speaker out validation accuracies on the scripted as well as improvised utterances. However they do not specifically do the evaluation on a testing set. The reported accuracy which is 49% for weighted accuracy to a range of 55.4% for the feature fusion and is comparable to the accuracy 4546 NUPUR CHOUDHURY, UZZAL SHARMA done in the validation phase i.e. 57%. These results represents that the classifications could be done even form lower level spectrograms. The results generated from the glottal flow signals performs better than the results generated from the speech signals for the weighted category 4.2 % which proves that it is more advantageous to eliminate the identity of the speaker and information related to phonetics before classification. For more detailed evaluations the confusion matrix is presented in Figure 2 and Figure 3. This represents the confusion related to the labelled emotions with other emotions for the prediction. Since Happy and Angry has relatable acoustic characteristics they have major performance related to the accuracy. It has been observed that the glottal flow representations reduces the confusion with the Happy and Angry classes to a great extent. The improvisation is 2.3% for the Happy category and 14.44% for the Angry category and is influenced from the results reported in [25]. Moreover they share a similar activation level when their location is considered in correspondence to valence activation. It is believed that the representations using glottal flow are not affective in confusion to Angry and Happy since they extract the differences on the dimension of valences in a better way. Based on these findings it can be concluded that the speech based representation learning could be improved by filtering out the factors like speaker identity and phonetics before the classification process.    In order to address the question of whether data insufficiency could be addressed through transfer learning using some affective and significant attributes Like valence and activation regarding emotions the mentioned BLSTM-RNN is pretrained to be a regression model for the same regarding the entire training set and then fine tune it for the 7 category speech emotion recognition.

Speech
It has been observed that it is necessary that the BLSTM weights in context to the pre trained network could improve the performance on the adaptation task. In order to finetune it, the validation of the hyper parameters over a grid needs to be done. As the best performance among

CONCLUSIONS
In this paper extensive research has been done on the investigation of representation learning using the spectrograms of speech and the glottal flow signals. The experiments done in this paper demonstrates that the features extracted from representation learning are discriminative of the classification of emotions and can be compared with the state of art approaches. It has also been found that inverse filtering which results in filtering out the speaker and phonetic information reduces the confusion between the overlapping categories like Happy and Angry emotions. Also minimal improvement has been noticed in the performance related to the transfer learning from valence and activation regarding the emotion categories. Overall the findings are encouraging in context to improvement of performance in the system involving multiclassifiers due to the diversity in the resulting errors. In future more elaborate and extensive transfer learning