Emotional dialog generation via multiple classifiers based on a generative adversarial network

Background Human-machine dialog generation is an essential topic of research in the field of natural language processing. Generating high-quality, diverse, fluent, and emotional conversation is a challenging task. Based on continuing advancements in artificial intelligence and deep learning, new methods have come to the forefront in recent times. In particular, the end-to-end neural network model provides an extensible conversation generation framework that has the potential to enable machines to understand semantics and automatically generate responses. However, neural network models come with their own set of questions and challenges. The basic conversational model framework tends to produce universal, meaningless, and relatively "safe" answers. Methods Based on generative adversarial networks (GANs), a new emotional dialog generation framework called EMC-GAN is proposed in this study to address the task of emotional dialog generation. The proposed model comprises a generative and three discriminative models. The generator is based on the basic sequence-to-sequence (Seq2Seq) dialog generation model, and the aggregate discriminative model for the overall framework consists of a basic discriminative model, an emotion discriminative model, and a fluency discriminative model. The basic discriminative model distinguishes generated fake sentences from real sentences in the training corpus. The emotion discriminative model evaluates whether the emotion conveyed via the generated dialog agrees with a pre-specified emotion, and directs the generative model to generate dialogs that correspond to the category of the pre-specified emotion. Finally, the fluency discriminative model assigns a score to the fluency of the generated dialog and guides the generator to produce more fluent sentences. Results Based on the experimental results, this study confirms the superiority of the proposed model over similar existing models with respect to emotional accuracy, fluency, and consistency. Conclusions The proposed EMC-GAN model is capable of generating consistent, smooth, and fluent dialog that conveys pre-specified emotions, and exhibits better performance with respect to emotional accuracy, consistency, and fluency compared to its competitors.


Introduction
Technologies related to human-machine dialog are used in several types of products, including intelligent voice assistants and online customer services. With time, the requirements and expectations of maturity of human-machine dialog have drastically increased. Several related topics have been extensively researched, such as dialog systems with common sense knowledge [1] , dialog systems with audio context [2] , latentvariable task-oriented dialog systems [3] , dialog systems combining texts and images [4] , etc. For a detailed account of the relevant research, please refer to the survey conducted by Ma et al. [5] Currently, dialog generation primarily includes three types of methods: rule-based systems [6] , information retrieval systems [7] , and generation-based systems. This study is based on the final approach. The Seq2Seq model has been extensively researched in the context of the machine translation problem, including its implementations based on Recurrent Neural Network (RNN) [8] , Long Short-Term Memory (LSTM) [9] and attention mechanism [10] . Vinyals et al. were the first to apply the Seq2Seq structure to the problem of dialog generation [11] . The basic Seq2Seq model suffers from a fundamental drawback when it is used to generate conversation -the evaluation of the performance of the model is usually performed at the level of sentences. Since the initial application of Seq2Seq to this topic, several researchers have attempted to solve the problem by using generative adversarial networks (GANs) [12] , which have achieved great success in computer vision.
Yu et al. proposed a more suitable framework to generate conversations based on a GAN called SeqGAN [13] . By modeling the data generator as a stochastic policy in reinforcement learning [14,15] , SeqGAN avoids the difference between generators by directly updating its gradient strategy. Li et al. proposed the use of adversarial training based on reinforcement learning for open-domain dialog generation [16] . Cui et al. proposed the Dual Adversarial Learning (DAL) framework that improves both the diversity and overall quality of the generated responses [17] .
People with high emotional intelligence quotients are capable of identifying and expressing their emotions, identifying the emotions of others, controlling their own emotions, and using feelings and emotions to spur adaptive behavior [18] . It is equally essential to endow machines with emotion in the context of human-machine dialog. Ghosh et al. proposed an LSTM-based model to generate text endowed with emotion [19] . Rashkin et al. introduced a new dataset with emotional annotations that was used to provide retrieval candidates or fine-tune the dialog model, leading to more empathetic responses [20] . The Emotional Chatting Machine, proposed by Zhou and Zhang, is capable of generating dialogs with appropriate content as well as emotions [21] . Wang et al. proposed the framework SentiGAN, which enables models to generate diverse, high-quality texts with specific sentiment labels via penalty mechanisms [22] . In our previous work, we had presented a model based on LSTM, in which we altered the training corpus to solve the emotion factors in dialogs. In that model, the input was adapted to the original sentence and a sentence with an emotion label, and the sentence with the emotion label was used as the output [23] .
In this study, we introduce a new emotional dialog generation model (EMC-GAN) based on a generative adversarial network. As it is difficult to express emotional features in dialog via the basic dialog generation model, we solve the problem by decomposing the emotional dialog generation task. Several different models are trained to generate dialogs endowed with different emotions. Each model focuses on creating one kind of emotional dialog. By incorporating such a modular structure, this method successfully excludes the interference and influence of other emotions during the generation of dialog with a specific emotion, thereby improving the accuracy of dialog generation endowed with pre-specified emotions. The proposed framework comprises a generative model and multiple discriminative models. The generative model is constructed based on the basic Seq2Seq dialog generation model [24] ; and the aggregate discriminative model of the framework comprises a basic discriminative model, an emotion discriminative 19 model, and a fluency discriminative model. Together, they help to distinguish the generated text from the original text and to guide the generator to produce more fluent dialogs that convey specific emotions more accurately. The EMC-GAN model is capable of producing coherent, smooth, and fluent dialog expressing specific emotions, and performs better than existing systems with respect to emotional accuracy, coherence, and fluency ( Figure 1).

Methods
The proposed emotional dialog generation framework, EMC-GAN, comprises one generative and three discriminative models. The generative model G e (Y|X ; θ e g ), is a dialog generation model based on the basic Seq2Seq architecture. It generates coherent and fluent target sentences corresponding to a specified emotion category e, based on input source sentences. The aggregate discriminative model of EMC-GAN comprises a basic discriminative model D e ( X,Y ; θ e d ), an emotion discriminative model D emotion e ( X,Y ; θ e d ), and a fluency discriminative model D fluency e ( X,Y ; θ e d ). The basic discriminative model is identical to the general dialog generation model, which is based on a generative adversarial network, and distinguishes generated fake sentences from real sentences in the training corpus. It also guides the generator to generate dialogs that are closer to human dialogs. The emotion discriminative model is a binary classifier of text sequences, which is capable of determining whether or not the emotion expressed by a generated dialog agrees with a specified emotion e. It provides a confidence probability for the emotion category of the input dialog to be identical to a pre-specified emotion category. The fluency discriminator assesses the fluency of input dialog and guides the generator to create more fluent dialog.

Generative model-oriented emotional dialog
The goal of the generative model G e (Y|X ; θ e g ), is to generate a target sequence corresponding to each input source sequence by endowing it with a pre-specified emotion e. θ e g is a parameter of the generative model. At each time-step t, G e (Y|X ; θ e g ) produces a sentence S t = Y 1:t = { y < 1 > , y < 2 > ,…, y < t > }, where y < t > denotes a word token in the existing vocabulary. Eq. 1 and Eq. 2 present the penalty-based loss function [22] : where V G e ( S t , y < t + 1 > ) denotes the total penalty score for the sentence sequence as calculated by multiple discriminative models, V e ( S t , y < t + 1 > ) denotes the penalty score calculated by the basic discriminative denotes the penalty score calculated by the fluency discriminative model, D fluency e ( X,Y ; θ e d ), which reflects the fluency of the sentences, and λ 1 + λ 2 + λ 3 = 1. In this study, we set λ 1 = 0.5 and λ 2 = λ 3 = 0.25.
Thus, the loss function L (y < t + 1 > ) is defined as follows based on the penalty scores: where G e ( y < t + 1 > |S t ; θ e g ) denotes the probability of choosing the (t+1) th word, which is dependent on the sequence S t . The generative model reference penalty is defined as follows to minimize loss: The penalty is calculated via the following formula: where T y denotes the maximum length of the target sequence, and N denotes the size of the Monte Carlo search samples. The penalty score of the partial sequence is calculated by averaging those corresponding to multiple samples to reduce the loss caused by sampling.

The baseline model of dialog generation
The basic Seq2Seq model is used as a benchmark model in this study. This model uses the encoderdecoder network with deep LSTM units as the underlying architecture for dialog generation. Adding an effective attention mechanism to the model can help in the extraction of a greater amount of corresponding information between the source sentence and the target sentence. The overall structure of this model has been depicted in Figure 2. The dialog generation model proposed in this study shares its network structure with the Seq2Seq baseline model.
Both the encoder and the decoder are implemented using LSTM. Corresponding to each time-step, a token of the source sequence is treated as the input to the encoder network. After the input is completed, the encoder generates a semantic vector C corresponding to each time-step input, which represents the input source sequence. The initial state of the decoder is determined by the generated semantic vector C. The decoder decodes the semantic vector and outputs a token y < t > corresponding to each time-step. Thus, we obtain an output sequence, ) using the dialog generation framework based on the encoderdecoder network.

Attention mechanism of the generative model
The fundamental function of the attention mechanism is to compute the context vector. The context vector context < t > , directs the decoder y, by highlighting the work tokens of the input sentence which should be focused upon by the decoder via the context vector. Figure 3 depicts the details of the calculation of the context vector in the attention mechanism. In order to save the state of the hidden layer of the source sequence, the Seq2Seq model with attention mechanism uses a bidirectional LSTM network [25] to extract the hidden state of the source sequence at each time-step. The bidirectional LSTM network is given by Eq. 5.
where a < t > comprises two parts a , representing the positive sequence feature and the reverse sequence feature, respectively. The two parts of the hidden state vector are calculated via the following formulae.
Besides providing the historical information of the sequence before time-step t, the intermediate state where, the vector e < t,t' > denotes the vector obtained by concatenating the intermediate state vector of the decoder corresponding to the t-1 th time-step and the intermediate state vector of the encoder at the t' th timestep. As is evident from Eq. 9, the attention vector α < t,t' > represents the degree to which, at the time-step t, the decoder focuses on the intermediate state vector a < t' > of the encoder at the time-step t'. Further, as is evident from Eq. 10, the context vector, context < t > , is obtained by multiplying the attention vector, α < t,t' > , by its corresponding hidden state vector a < t' > , and then summing up the products over the range of time-

22
Wei CHEN et al: Emotional dialog generation via multiple classifiers based on a generative adversarial network stepsfrom 1 to T x .

The discriminative model
The deep discriminative models implemented by convolutional neural networks (CNNs) [26] and recursive convolutional neural networks (RCNNs) [27] perform well in complex sequence classification tasks. We use CNN as the fundamental structure of the discriminative model proposed in this study. Moreover, a highway network [28] is added to the discriminative model to improve its training speed. Both the emotion discriminative model and the fluency discriminative model are pre-trained models and do not participate in the adversarial training process of the model.

Basic discriminative model
The text classification model based on CNN, originally proposed by Zhang and LeCun [29] , is used as the fundamental structure of the basic discriminative model D e ( X,Y ; θ e d ), which is employed to distinguish generated fake sentences from real sentences in the training dataset. The loss function of the basic discriminative model can be expressed as follows.
The adversarial training process of the proposed emotional dialog generation model EMC-GAN has been presented in Table 1.  Table 2. The real dialog with emotion category e is denoted by R = { Dialog e3 } and the fake dialog with emotion category is

Emotion discriminative model
The emotion discriminator is used to discriminate between the real dialog R and the fake dialog F and assign a confidence probability to the event that the input dialog is identical to the real dialog.
This model is similar to the basic discriminative model and is trained in advance. The achieved emotional accuracies pertaining to the different

Fluency discriminative model
The sentence fluency evaluation algorithm proposed in this study is based on the sentence fluency evaluation method proposed by Liu [30] . The training process of sentence fluency evaluation has been presented in Table 3, and the algorithm employs the N-gram statistical language model [31] , which uses the transition probabilities of three tuples to measure the fluency of the entire sentence. Initially, we count the number of binary tuples in the dialogs present in the dataset and the corresponding frequencies of occurrence, save the results in n_gram2_count, adopt the binary tuple as the key of the dictionary, and take the occurrence frequency of the binary tuple as the value. Then, all the ternary tuples and their corresponding occurrence frequencies are calculated in a similar manner, and the results are saved in n_gram3_count. The transition probabilities of all ternary tuples are calculated using the dictionaries n_gram2_count and n_gram3_count, as given below. Table 3 The fluency score of dialogs as obtained via the fluency discriminative model   24 24 Wei CHEN et al: Emotional dialog generation via multiple classifiers based on a generative adversarial network where, x i , x j and x k denote adjacent words in the sentence. The calculated result is saved in n_gram3_prob.
Finally, the ternary tuples are sorted in descending order of corresponding transition probabilities and the result is saved to the list sorted_n_gram3_prob. In general, sentences corresponding to higher transition probabilities of n-gram tuples are more fluent. The two transition probabilities, reward_proband penalty_prob, are used to decide whether or not a generated sentence is fluent. The first 40% ternary tuples in sorted_n_gram3_prob are smoother than the rest and the constructions of the last 20% are considerably more awkward. reward_prob denotes the minimum transition probability in the first 40% of the tuples, and penalty_prob denotes the maximum transition probability in the last 20% of the tuples.
During the evaluation of the fluency of a sentence X = { x 1 , x 2 ,…, x m }, an initial fluency score of 0 is assigned to fluency (X). Then, all the ternary tuples are traversed. If the length of the sentence is less than 3, the algorithm directly sets its fluency score to 0 because we do not expect the model to use a short sentence as a response to the input source sentence. If the transition probability of the current ternary tuple is higher than reward_prob, it is implied that the ternary tuple is relatively smooth. In this case, the current fluency score fluency ( ) X adds the ratio of the transition probabilities to reward_prob. If the transition probability of current ternary tuple is less than penalty_prob, it is implied that the current ternary tuple is relatively awkward. In this case, the current fluency score fluency ( )

Experiments
During the process of emotional dialog generation, the generative model generates a target sentence corresponding to each input source sentence and a specified emotion category. It is essential for the resulting sentence sequence to be consistent, fluent, and pertain to the specified emotion category.

Datasets
The dialog dataset comprises a series of dialog pairs with emotion category labels {X, } is the source sentence sequence, Y = { e y , y < 1 > , y < 2 > ,…, y < T y > } is the target sentence sequence (dialog response), and e x and e y are the emotion category labels of the source sentence and the target sentence, respectively. The generative model intends to produce a target sentence with the specified emotion corresponding to each input source sentence with any emotion. To construct the corresponding datasets for the different emotion generative models, we divide the dataset into multiple sub-datasets based on the emotion conveyed by the target sentence. All target sentence sequences belonging to the same dataset share the identical emotion label e y .
NLPCC Weibo (NLPW): This dataset is constructed based on the conversations extracted from Weibo comments and comprises 1119200 dialog pairs with six emotion category labels (anger, disgust, happiness, liking, sadness, and other), similar to those used in our previous study [23] .
Xiaohuangji (XHJ): This dataset comprises 454130 dialog pairs in aggregate. However, its corpus does not include corresponding emotion labels. We use the open-source natural language processing tool 25 HanLP [32] , to train an emotion classification model, which is essentially a naive Bayes classifier trained on the NLPW dataset. The emotion classification model is capable of classifying sentences into the six emotion categories in the NLPW dataset.
The frequency distribution of the two datasets, NLPW and XHJ, with respect to varying emotion categories has been depicted in Figure 4. Table 4 presents the emotion distribution for sub-datasets corresponding to different emotions.

Experimental setup
The basic Seq2Seq dialog generation model is used as a benchmark model during experimental comparison. The performances of the hybrid neural network-based emotional dialog generation model EHMCG [23] , and the emotional dialog generation model, EM-SeqGAN [33] , are compared with that of the proposed EMC-GAN model. This study analyzes and evaluates the dialogs generated by the three different models primarily based on three indicators: emotional accuracy, coherence, and fluency. Tensorflow [34] is used during the construction of our model. During the calculation of the penalty score based on the discriminator coefficient, λ, as defined in Eq.1, the ratio between λ 1 , λ 2 , and λ 3 is taken to be 2:1:1. This adequately constrains the weights of the three evaluation parameters and appropriately guides the training of the generative model. The training iterations of the generative model and the discriminative model are taken to be 5 and 10, respectively.

Evaluation of emotional accuracy
Following the generation of dialog, we annotate them with the corresponding categories of emotion via the emotion classifier used in HanLP. If the emotion category of the generated sentence is identical to that of the target sentence, the emotion category of the generated sentence is considered to conform to the expectation. Table 5 depicts the emotional accuracies of dialogs generated by the different dialog generation models. Compared to other models, the proposed EMC-GAN model exhibits the highest emotional accuracy corresponding to each sub-dataset pertaining to each emotion. In the case of the NLPW dataset, the EMC-GAN model exhibits high emotional accuracies corresponding to the emotions "Other", "Liking", "Sadness", and "Disgust", which lie in the range of 0.588-0.740, while its emotional accuracy corresponding to the emotions "Anger" and "Happiness" were observed to be only 0.392 and 0.236, respectively. In the case of the XHJ dataset, the EMC-GAN model exhibits higher emotional accuracy corresponding to every emotion, and the accuracies lie in the range 0.701-0.870. Compared to other emotion categories, the emotion "Other" consistently corresponds to the highest emotional accuracy on both datasets, and this may be attributed to the fact that it corresponds to the highest number of training data. As the emotion "Other" represents any emotion that is distinct from the listed ones; the emotion classification model tends to judge the emotion category of most input sentences to be "Other".

Evaluation of coherence
One of the most essential parameters in the evaluation of the performance of a dialog generation model is the coherence of the generated dialog, which represents whether or not it is in consonance with the context of the source sentence. Currently, satisfactory models capable of adequately evaluating the coherence of generated text are unavailable. Therefore, we resort to manual judgment to assess its coherence. The options used for evaluation and the corresponding evaluation scores have been summarized in Table 6. The evaluation scores range from 1 to 5 and higher evaluation scores correspond to higher coherence of the dialog.
The coherence evaluation scores of dialogs generated by the different generation models have been depicted in Table 7. The proposed EMC-GAN model exhibits higher coherence evaluation scores corresponding to all emotion categories for both datasets compared to the other models. Further, EMC-GAN exhibits higher coherence evaluation scores on the XHJ dataset than on the NLPW dataset. The generated dialog texts corresponding to the emotions "Other", "Liking" and "Sadness" reflected higher coherence evaluation scores for both datasets. Notably, coherence evaluation scores of 3.407 and 3.180 were achieved in the "Other" and "Sadness" categories, respectively, by EMC-GAN. This indicates that the dialogs generated by the proposed model exhibit satisfactory coherence. Table 5 The emotional accuracy of generated dialog

Evaluation of fluency
Besides coherence, fluency of the generated text is also an essential parameter in the evaluation of the performance of a dialog generation model. Fluency reflects the text production capability of the generator.
The fluency discriminative model evaluates the fluency of generated dialogs, and its underlying algorithm has been outlined in Algorithm 3. In addition, to improve the accuracy of the fluency evaluation, we also adopt a manual method.
A fluency score is assigned to each generated sentence by the fluency discriminative model. As is evident from the data presented in Table 8, the proposed EMC-GAN model exhibits higher fluency scores compared to other models corresponding to each dataset and emotion category. In the case of the NLPW dataset, the proposed model exhibits higher fluency scores corresponding to the emotions, "Sadness" and "Anger", and the fluency score obtained corresponding to the emotion "Other" is relatively low. The generated dialogs in the XHJ dataset corresponded to higher fluency scores than those of the NLPW dataset, and the fluency of the sentences was observed to improve palpably. In the case of the XHJ dataset, the fluency score corresponding to the emotion "Other" was also observed to be relatively low. Table 9 depicts the fluency scores of the different generation models as assigned by human judgment.
These scores are observed to be similar to the coherence evaluation score. Compared to other models, the EMC-GAN model exhibits higher fluency corresponding to each dataset and emotion category. Further, EMC-GAN exhibits a higher fluency score corresponding to each emotion of the XHJ dataset than those in the NLPW dataset. On the XHJ dataset, EMC-GAN exhibits a highest fluency evaluation score of 4.480 corresponding to the emotion "Sadness", and the fluency evaluation score corresponding to "Disgust" and "Anger" are observed to be relatively low at 2.835 and 2.960, respectively.   28 28 Wei CHEN et al: Emotional dialog generation via multiple classifiers based on a generative adversarial network

Analysis of results
The errors in emotional accuracy revealed by this experiment can be primarily attributed to mistakes in the datasets and on the part of the emotion classification model. As the accuracy of emotion classification on the NLPW dataset is observed to be 64%, some errors were caused in the emotion category of the generated dialog. Since dialogs in the XHJ dataset re not equipped with corresponding emotion category labels, the HanLP tool is used to train an emotion classifier by using the NLPW dataset as a training corpus and, thereby, add emotion labels to dialogs in the XHJ dataset. Further, to reduce the influence of the emotion classification model on the emotion evaluation for dialogs, we use the same emotion classification model to classify all the generated texts.
Based on the experimental results, it can be concluded that the proposed model performs better on the XHJ dataset than on the NLPW dataset. The sentence quality of dialogs in the NLPW dataset were generally assessed to be poor. The lengths of the sentences in the NLPW dataset are observed to be shorter than those in usual conversations, and sentences are more awkward, which may be attributed to the irregularity of comments in Weibo. In the NLPW dataset, a significant number of binary and ternary tuples occurred with singular frequency, which is not conducive to the construction of the language model, as it makes it more difficult for the dialog generation model to achieve convergence. Analysis of dialogs generated by the baseline model revealed that, in this case, the majority of the generated sentences are sequences of random words, and the fluency of the generated sentences is particularly poor. As is evident from Table 8, the dialogs generated by the Seq2Seq model exhibit lower fluency scores, and the fluency scores of sentences corresponding to different emotions are roughly identical. This can be attributed to the fact that, during the process of fluency evaluation, we assign a negligible transition probability (0.02) to ternary tuples when the corresponding binary tuple does not exist. Thus, as the fluency of the generated dialogs are extremely poor, the transition probabilities of most ternary tuples are close to the minimum value.
Certain samples of generated dialog were obtained from the test set. Table 10 depicts the dialogs generated by Seq2Seq and EMC-GAN. The column entitled 'Emotion' depicts the pre-specified emotion category that the dialogs were meant to be generated in. It is apparent that the responses generated by Seq2Seq are poorer in terms of emotional and content consistency, and, consequently, they do not fare well with respect to the various evaluation parameters. The proposed model is observed to perform better in terms of emotional accuracy, coherence, and fluency.

Conclusion
In this study, a new emotional dialog generation framework (EMC-GAN) was proposed, which uses multiple classifiers to generate better dialogs with respect to various evaluation parameters. The generative model generates a target sentences for each input source sentence. The basic discriminative model distinguishes generated fake sentences from real sentences in the training dataset. The emotion discriminative model evaluates whether the emotion of the generated dialog agrees with a pre-specified emotion. Finally, the fluency discriminative model assesses the fluency of the input dialog, and assigns a fluency score to it. Based on the experimental results, it was concluded that EMC-GAN is capable of generating dialogs with pre-specified emotions. Compared to other models, the dialogs generated by EMC-GAN were observed to be more fluent and smoother. However, the accuracy of the emotion classifier should be improved to obtain more realistic dialogs. Further, other features of sentences (such as novelty and variability) need to be considered to make the final dialog more fluent and natural.

Declaration of competing interest
We declare that we have no conflict of interest.  不是我吓你， 你晚上真的会做噩梦， 小狗会变成鬼找你。 It' s not that I scare you, you really have nightmares at night, and the puppy will turn into a ghost to find you.

青岛， 一座来过就不想再来的城市！
Qingdao, a city that has been here but never wants to come again!