Optimization Methods in Emotion Recognition System

,


Introduction
The amount of data stored in electronic form is exponentially increasing and it is estimated that around 80 % of this data is in the text form.As a result, we are often overloaded with textual information as an unstructured data which are difficult to manage, search and extract contained information.If the data contains important information, one of the possible approach is to employ human resources to read the documents and the information we are currently interested in.However, if the amount of data exceeds the acceptable level, hiring new employees is expensive and cannot be done by human.Automating this process on computers can help significantly.At the same time, computers have capacity to process one page of text in milliseconds.Here machine learning (especially text mining) can be often very effective in solving these tasks and it can do the processing significantly faster than human can be.One of the important information, which is contained in the texts, are writer's emotions.Unfortunately some of the high level information -which is the case of emotions -is that computers have difficulties to understand, especially in case of sarcasm and irony.The emotion recognition from text is in practice used by helpdesks to monitor success rate and potentially dissatisfied customers to prevent their churn.The motivation behind this work is to use emotions to put the optimal priority for a messages in helpdesk.This paper is involved in emotion detection and recognition from texts.The proposed method consists of 1) artificial intelligence combined with 2) optimization of token selection and 3) parameter optimization of learning model techniques.The training database consists of 1673 text messages for training and another 1673 messages for testing.Their origin is from real helpdesk environment, where every message was labeled independently by several people, in order to increase objectiveness of the emotional label.In order to make the recognition computationally effective, many steps have to be done before the training can actually start.This paper describes the whole process of automatic development of the emotion detection system based on the training text database, machine learning and parameters and stopwords optimization techniques.It has been proven by experiments that training of several separate models for each emotional class (five submodels) has increased accuracy by 17 %.This paper also builds on the previous work published in journal [1].
The main contribution of this paper is an extended method for detection and recognition of emotion from text data.We outperformed current state of the art methods and at the same time we have used bigger data set, which is less prone to overfitting (training and testing data set consisiting in total 3346 labeled text messages).The method is based on combination of well-known data mining techniques, optimization of token selection and parameter optimization based on token groups.Multiple SVM (Support Vector Machine) classifiers have been used in the recognition system.Properly set optimization method has increased accuracy of the emotion detection by 11.40 % when compared to the standard data mining approach.
The paper is structured as follows: Sec. 2 mentions and compares various related work; Sec. 3 describes training and testing data, their sizes and process of labeling; Sec. 4 describes the experiment and its evaluation, there is given a detailed description of the proposed system for emotion recognition from text; Sec. 5 describes optimization processes, lifecycle of learning by artificial intelligence and implementation of the proposed method.Section 6 discusses the results and impact of the proposed method resulting to final accuracy of the trained model.The paper is concluded in Sec. 7.

Related Work
Automatic emotion recognition from texts is a challenging task and is currently subject of research at many workplaces all over the world.The meaning of the text is changed with various factors (e.g.irony, sarcasm, politeness, writing style) and from the algorithmic point of view is also language dependent since different languages use different words.Not only because of different words, but also a different way of expressions in different languages.Even in similar languages, there are words with different meaning or emotional charge.There are a few approaches to solve this problem.
Relatively simple approach is based on keywords, where appraoch and the results are described in [2] and [3].Also the experiments with such systems is described in the article [4].The achieved precision rate is 81.0 % for 4 emotional classes.This approach is significantly limited by the set of defined keywords and often fails with more complex sentences and word constructions (especially with irony or sarkasm).This approach also needs to have manually defined database of keywords in a given language for every emotion.
Majority of the state-of-the-art methods uses for text emotion detection the machine learning as for instance mentioned in papers [5] and [6].System for emotion recognition from newspaper headlines mentioned in paper [1] reached accuracy of 80.28 % for 6 emotional classes.This system is based on artificial intelligence and uses SVM (Support Vector Machine) method with the linear kernel.Every sample of its training data set is a short text -a couple of words or one sentence (newspaper titles were used).This is unfortunately insufficient for helpdesk environments where customers usually write from few lines to few paragraphs of texts.Similar research described in the paper [7] reached 71.6 % accuracy or another in paper [5] which had accuracy of 85.5 %.The texts were used from restaurant reviews and 4 emotions were recognized.Also the size of the testing set was quite limited.The paper [8] reached 86 % accuracy with system based on k-NN (k-nearest neighbors) on a English public database with longer texts.Unlike this paper, it does not focus on the analysis of the articles but helpdesk messages or feedbacks.At the same time it uses training database in the Czech language (contains 800 000 words, English contains approx.40 000 words).
Unfortunately, an emotion analysis is language dependent and efforts to train universal (language independent) emotion recognition were less successful and less accurate so far.Recently, classification of two different languages (English and Chinese) has reached accuracy of 89.98 % for classification into 2 emotional classes (positive and negative).It was tested on smaller unbalanced data set (494 positive and 606 negative samples) which is relatively easy to be overfitted.Bilingual system mentioned in [9] can work with only one trained model without decision if there is another result from other model and must make decision which output is the right one.The using of language-independent methods can be also a limitation.
Since many of the databases consist of hundred thousands of samples, the training phase is quite time and memory demanding task.Because of that, acceleration of machine learning (SVM classifier in this case) was also aim of many researches, which was mentioned for example in [10].SVM accelerated by GPU (graphics processing unit) achieved 100 times faster time processing when compared to standard SVM implementations.Acceleration by GPU is very promising idea that achieved the significant results in past few years.As mentioned in paper [11], using GPU based training on classifier can speed up calculations of k-NN algorithm up to 750× in comparison with a single CPU version.Using this acceleration can be processed much bigger data.This allows us to do much complex experiments.It can be relatively cheap alternative to Big data approach [12], which usually consists of a cluster of computers.

Training and Testing Data
Text data used for training and testing was obtained from real helpdesk environment, comments to articles in online newspapers, user reviews of products and services and other comments.We were interested especially in recognition of emotion the writer creating the text has had.All the collected text messages were manually labeled by emotional classes.The approach was partially inspired by emotion acoustic model [13], see Fig. 1.In order to reduce impact of subjectiveness, every text message was labeled in average by three independent persons and consensus vote was used as the final label.Not every text messages contained an emotion and such messages were labeled as neutral.According to our experience from the labeling process, accuracy of participants was timely degrading, caused e.g. by exhaustion.Therefore every text document was labeled by several1 randomly selected persons.
The problem with agreement was for example on text: "What do you actually support?We need a website where can be 'registered users' which can fill in a form, upload photos and send it to us.We need to communicate between 1Total number of labels per one sample is from 1 to 4. Some text messages have even 10 labels until the final consensus was defined.themselves.It is possible?Unless is something impossible, the site is useless.Thank you for your response."(original text was in Czech language).This specific sample was annotated by three different persons multiple times (randomly selected) -three times as "neutral", one time as "sad" and four times as "angry", so this text was labeled with emotion "angry".

Active
When compared for example with [1] where only titles of papers were used, many of the messages are significantly longer.The empirical experience showed, that it is more challenging to extract emotion from longer texts than in short messages like from newspaper headlines or Twitter messages, where the emotion is usually more concentrated at one place.Unfortunately, many of words can cause confusion of model resulting in wrong emotion recognition.In order to reduce the impact of these words, these words were eliminated with a method based on backwards attribute elimination optimization (used in [14]).When processed manually, this can be a very time demanding process, especially when models are being trained separately for each independent model.Before the data is used for training, every text message is preprocessed first.Mainly to reduce dimensionality of the training data (a number of words), which accelerates learning rate.Czech Wordnet (with synonyms dictionary) has been used for this purpose .In the first step the text is tokenized (splitted into words).Those words are checked by spellchecker and corrected if possible.After that the word is lemmatized (i.e.determining the lemma for a given word).In the next step some insignificant words are removed from the text e.g.prepositions, auxiliary verbs, etc.In this step also a set of words that cause confusion (so called stop-words) are removed.After this step there is only a list with significant words, mostly words with emotional charge.In the end selected special words are detected and replaced with group identifier word.More information about this method is provided in Sec.5.2.
The samples were manually labelled with coordinates from the mentioned acoustic model.These emotions are based on two parameters plotted on axes x and y -the rate of positivity and the rate of activeness -as shown in Fig. 1.There were aggregated emotional classes, which can be clearly defined for helpdesk environment.Negative class was divided into: "angry" which is active emotion often with vulgarisms, and "sad" which is on the contrary passive emotion.Positive class was kept without change and named as "satisfied".On the other hand, from practical aspect of the helpdesk environment, two other new classes -"afraid" and "surprised" were added.Emotion "afraid" can be noticed when someone (in this case it is a customer) is afraid of some action, e.g.bug or unexpected behavior2, system failure etc. Likewise emotion "surprised".This emotion can contain in acoustic model positive charge however in texts it is often negative, e.g.some kind of complaint.
For the training and testing purposes a database consisting of 3346 samples was used.In particular for the training 50 % of the database (1673 samples) was used, and the remaining 50 % of the database was used for testing (1673 samples).There are five emotional classes (anger, afraid, sadness, satisfaction and surprise) and neutral class.
The training process was also supported by a set of selected sentences and collocations which are significant for the given emotion class.The method of using this set of sentences is described in 5.3.

Experiment Description
Accuracy of the proposed emotion recognition model was measured on the test part of the database (1673 samples).
First, we experimented with a classification into two classes (positive and negative) for a comparison of different classifiers.The accuracy was relatively high for the SVM classifier (87.00 %, see Tab. 1) on the database with 7000 samples (5000 samples for training, 2000 for testing)3.The database was created from user ratings and comments, the label was automatically selected by the rating (a number of stars).Unfortunately, to recognize only text valence (positivity and negativity) it turned out that prediction of these two classes is not sufficient for the helpdesks environment.
With the model proposed as classification of 5 emotional classes (+ neutral class) has not been achieved any good accuracy on the test data set -classification accuracy has been only about 38 %.With decreasing count of these classes total accuracy about 51 % on three different emotions was achieved.In the experiment with automatically divided data set into two classes (positive and negative, based on product rating) accuracy 87.00 % has been achieved, see results in Tab. 1.These experiments have shown that training 2For example action he wants to do can possibly cause some error in a bought product.This issue can have a negative effect in the future.3Classifier comparison was performed on a different database of positive and negative texts -parsed from website with product reviews.An idea used in the proposal was to create one model for every mentioned emotion class (called submodels, already used in our previous work described in [15]).For this purpose it was necessary to create five subsets, where for each subset the positive part was the related emotional class and negative part contained the messages from other classes and the negative text.In order to use balanced number of positive and negative samples for each emotional class, the negative part size was subsampled into the size of the positive samples, see histogram in Fig. 2. Then this proposed model can decide whether the input text contains a given emotion or not and define confidence about the prediction.The model for neutral emotion is not needed.The text will be recognized as neutral when defined submodels cannot indicate satisfactory confidence.The structure of the system based on this solution is in Fig. 3.The longer texts from real communication can often contain more emotions.In this case it is looking for dominant emotion, but that is beyond the scope of this paper.
In this configuration, the whole model for emotion recognition consists of 5 submodels.Every submodel must be trained with its own data set.Also the number of input attributes is different for each submodel.Each submodel is then tested with a test set independent to the training set.After this process model's base accuracy is achieved.Those methods did not use any optimization introduced by this paper and it is well known data mining approach.

Optimization Methods
With every change of training database (e.g.update local database from shared database with other co-workers) or changes in preprocessing (e.g. a definition of new insignificant word, some correction in lemmatization algorithm or spellcheck algorithm, etc.), there is need to rebuild exported data set for training and testing.The process of training is described by lifecycle in Fig. 4. The first step is labelled as export where the preprocessing methods are applied.With every mentioned change is also changed system behaviour and is needed to measure impact of this change.This means running the simulation for every submodel and compute the final accuracy.
Accuracy of the proposed model was increased with: a) sequential elimination of attributes on the input of the model (accordingly to submodels accuracies), b) comprimation token groups, and c) method based on train data sets extending during practical testing and final tuning.These methods modified the actual model training lifecycle, already shown in Fig. 4.
As was mentioned before, the final model consists of several submodels -one for every emotion class.With this setup can be problems with dependencies between submodels.For the whole system is only one wordnet, so after finding the right attribute for elimination for one submodel will be changed also behaviour of other submodels.On the contrary this setup has also some benefits, e.g.every emotion can be processed parallelly and the whole system can run faster.

Sequential Elimination of Attributes
Not every input attribute on the artificial intelligence is important, so not every word from the input text has to be used in emotional recognition process.On the contrary, some words can cause confusion of the trained model and decrease the final accuracy.These words can be eliminated already during the preprocessing.
Figure 4 describes the lifecycle of training.All of the block is running in parallel for every emotion class, a part with the elimination is running in single thread and is executed sequentially for every submodel.As was mentioned in the previous section, export block is preparing train and test data set using preprocessing.Elimination block is executed only for one submodel and searches exactly one attribute.Block performance measurement is using test data set to measure accuracy for every submodel, the simulation is running again in parallel.So the one and only part that is executed in single thread is the elimination part, because of the dependencies between submodels.This dependencies cannot be removed using individual stopwords database for every submodel as would be expected.The input text in final classificator is processed only once and enters to model for every emotion.For removing mentioned dependencies the input text must be processed 5-times.This would be significant decreasing of classification speed.For our purposes, final model must have large throughput for processing big amount of data.
Elimination of attributes is based on well-known backward feature elimination.This method also decreases redundancy for final model, on input a number of attributes is reduced to necessary minimum.In every step of elimination one attribute (token) is selected, which confuses current submodel most.Elimination of this token can increase the submodel accuracy, but also decreases accuracy of another submodel.For example elimination of token "cena" ("price") would cause the better results on submodel for satisfaction emotion, but in submodel for emotion afraid can cause decreasing of the accuracy.For this case, every eliminated token is checked by the final model and computed accuracy of the whole system.If this accuracy is increased by minimal threshold, eliminated token will be included in stopwords database.Otherwise, the token will be removed from stopwords database and data sets will be processed again.
Threshold for minimal difference of accuracies (before and after elimination) is sequentially decreased.So the words which would be eliminated in the first cycle must have the biggest impact on accuracy.With every eliminated token, behavior of other submodels is also changed.

Token Groups
After all preprocessing steps (spellcheck, lemmatization, removing insignificant words and words without emotional charge, synonyms replacement) token groups are applied.This method uses a defined dictionary for every group, so the dictionaries are dependent on language and must be manually defined.One group is defined by couple of tokens which would be replaced by group identification token in the input texts.For example every vulgar word would be replaced with "xvulgarx" (vulgar words is also divided into 3 groups based on strength of vulgarity).So every group consists of manually defined words with the same meaning.
For every emotional class one token group is also defined which consists of keywords for the related class.For example, in group "xsadx" keywords which represents sadness are defined -sorrow, cry, deject, etc.This refers to an approach to this problem based on keywords mentioned in Sec. 2.
So the dimmensionality space is reduced and the words with similar meaning (or the same emotional charge) would be replaced with one general identifier.Accuracy of the final model has increased from 83.66 % to 85.76 % only by adding this method into preprocessing.There were defined 7 groups, 4 groups for emotional classes (afraid, sad, satisfied and surprised) and 3 groups for vulgar words based on the strength of abuse.For example some word can in one context mean an insult, in another context can mean an actual animal.These words have to be separated from the other vulgarisms.The whole list of all defined token groups is in Tab. 2. This method also causes equalization of weights of the words defined in group.If one word of the group is in the training set used rarely and the other word with the same meaning is used more often, these words would have actually the same weight as is clear from their meaning.

Extending Train Data Sets
To further increase accuracy we extended training database by several commonly used phrases.It was necessary in practice to finetune the results.Even if achieved accuracy can be high, the real feeling from the answers can be different.So the output of the system can be in some cases slightly different from the expected result.The train and test set cannot cover all possible combinations of different words with different emotional charge, so in the last step of the testing process the training set can be extended with some manually defined sentences.
The sentences defined in this method are based on the feedback of users (comparison of emotion from the output of the whole model, and the real emotion recognized by human).The suggested sentences are checked by human staff at helpdesk and then included in the "final tuning database" with labelled emotion.These sentences are the cases with clearly understandable emotion for human and contain only one emotion.The database collected in this step is then used in the training sets.For example, for emotion "sad" sentence "I have high debts" was defined or "The quality of your product is declining".These are the cases with only one emotion.Therefore, these were used for extension of train set for "sad" emotion submodel.

Results and Discussion
The tests were performed on emotional recognition system to compare accuracy and the impact of proposed methods.As was mentioned in the Sec.3, the second half of the database (1673 samples) was used for test purposes.The tests for the method of sequential attribute elimination were performed on system with defined text preprocessing for the Czech language (spell check, lemmatization, stop words filter, synonyms replacement).
The values of accuracies was captured for each iteration step and for each emotion submodel.The final accuracy value is in the last column.In this case is the improvement of the final model accuracy about 3.06 %.The highest improvement of accuracy was about 5.36 % in the submodel for emotion "afraid".The graphical representation of this values is in Fig. 5.In the steps when accuracy was decreasing for a whole model, the eliminated token was not used in stopwords database.For example, in Fig. 5 in 33rd iteration accuracy was lower than in iteration before.So the eliminated token "nestat" (which means "not become") would not be inserted into stopwords database because this word is important in some cases and helps to determine the right emotional charge.In cases like this, the algorithm continues with another model and elimination on it.The elimination here has a 39 iterations.After these, the accuracy was not improving, therefore the process was stopped.
The second measurements include method of token groups.The impact on accuracy using this method was tested on the same data and the same model structure as was mentioned before.The numbers of defined group and their words are defined in Tab. 2. The results are compared in Tab. 3, where the accuracy improvement by 2.20 % can be seen.This method has impact on each submodel and contributes to equalization of weight of the words with the same meaning.
During the practical testing of the proposed emotion recognition system, the database of sentences called "final tuning database" has been changing.After every change, the sequential elimination was run and the improvement of accuracy was measured.In Tab. 4 and Tab. 5 accuracy with different count of defined sentences (different stages) is compared.The accuracy was improved using proposed methods about 11.40 % by defining 320 sentences.Each test captured in Tab. 4 and Tab. 5 represents one state of the emotion recognition system and the defined set of sentences (in tables are numbers of these sentences -related to the last proposed optimization method -Extending train data sets).The tests were launched in a row.The stopwords database was preserved and used in the following test as an initial state.
The progress of testing and the accuracy in every step of the sequential elimination is in Fig. 6 -it includes data from Tab. 5.For every test, the sequential elimination was run and the accuracy of the model was noted on the start and on the finish.The method of "extending train data sets during practical testing" cannot be measured separately.Only change of the sentence database causes in many cases decreasing of accuracy (see the accuracy drops of the final model in Fig. 6).The method of sequential elimination covers this decrease and also caused increasing of the final model accuracy in average about 1.20 %.In test 4 was this increase about 3.06 % -already shown in Fig. 5.The last step (iteration no.130) in Fig. 6 has the best achieved accuracy -86.89 %.After this step, the accuracy was only stagnating or decreasing (not shown in the graph), so the configuration from step 130 was used.

Conclusion
In this paper, emotion detection and emotion recognition system from unstructured textual data was described.The system uses machine learning methods, text-mining methods and text preprocessing methods (spellcheck, lemmatization, stopwords and synonyms replacement).The structure is based on the divided submodels for each emotionafraid, sad, satisfied, angry and surprised.The accuracy of the system was measured on manually labeled Czech text database created from real helpdesk service.
Final system accuracy was increased using 3 proposed optimization methods: 1) sequential elimination of attributes based on backward elimination and adapted for multi-model structure, 2) token groups based on manually defined dictionaries and 3) extending train data sets during practical testing.With these optimization methods accuracy was increased by 11.40 % and the achieved accuracy of the whole system is 86.89 % for recognition of 5 emotional classes.
We outperformed current state of the art methods with combination of the system structure, proposed optimization methods and bigger data set (3346 manually labelled samples) which is less prone to overfitting.

Fig. 2 .Fig. 3 .
Fig. 2. Histogram of analysed emotion frequency in the train and test data set.

Fig. 4 .
Fig. 4. Simplified lifecycle of the model in the learning process, export is running for every single submodel individually.

Fig. 6 .
Fig. 6.Progress of accuracy during the tests and final tuning.
Impact of the group words method on the accuracy.