Boosting Dialog Response Generation

Neural models have become one of the most important approaches to dialog response generation. However, they still tend to generate the most common and generic responses in the corpus all the time. To address this problem, we designed an iterative training process and ensemble method based on boosting. We combined our method with different training and decoding paradigms as the base model, including mutual-information-based decoding and reward-augmented maximum likelihood learning. Empirical results show that our approach can significantly improve the diversity and relevance of the responses generated by all base models, backed by objective measurements and human evaluation.


Introduction
Sequence-to-sequence models (Sutskever et al., 2014) has become one of the most popular approaches to dialog systems, for it provides a high degree of automation and flexibility.On the other hand, they are known to suffer from the "dullresponse" problem (Li et al., 2015).Various research attempts have been made to improve the diversity of responses generated by sequence-tosequence models.One line of research investigate alternatives to maximum likelihood learning and decoding, which is believed to be the main cause of monotonicity.(Li et al., 2015) employed a decoding objective based on mutual information between contexts and responses; (Li et al., 2017a) used reinforcement learning techniques for training the decoder to generate responses that maximize pre-defined rewards instead of perplexities; (Li et al., 2017b;Xu et al., 2017) adopted adversarial learning, in which a generator is trained to deceive a discriminator that tries to differentiate between generated responses and human responses.Beside changing training and decoding objectives, (Liu et al., 2018;Lison and Bibauw, 2017) considered reweighting data points by penalizing those with overly frequent responses or by emphasizing high-quality responses.(Serban et al., 2017;Zhao et al., 2017) introduced stochastic latent variables into their models to capture discourse information on an inter-utterance level.(Shao et al., 2017) experimented with a novel segment-based training and decoding paradigm to help mitigate the problem of redundancy and contradiction.
Yet another type of approach has not been investigated in the literature in the context of response generation -boosting and ensembling, despite having been studied for machine translation (Xiao et al., 2010;Zhang et al., 2017).Being a long established machine learning method (Freund and Schapire, 1997), the process typically involves iteratively training multiple models on reweighted instances according to the error of the previous models and combining these models.The idea has been recently revived and extended to generative models and image generation, which also suffers from diversity problem (Tolstikhin et al., 2017;Grover and Ermon, 2018).In computer vision, the state-of-the-art models tend to generate a few categories of objects all the time and ignore the rest, known as the problem of "missing modes".Boosting has been shown to significantly improve the coverage of image generation models.
For language generation, given the prior success with data re-weighting and bootstrap approach (Zhang et al., 2017;Liu et al., 2018), we believe dialog response generation may benefit from boosting as well.In this work, we designed a principled framework of boosting response generation, based on the recently developed theory of boosting generative models.Moreover, we combined boosting with different training and/or decoding paradigms, and empirically show that boosting can invariably improve them, in both quantitative and qualitative evaluation.

Preliminaries
For standard sequence-to-sequence approaches, training of models and decoding for generations are done through maximum likelihood estimation: where x is the source (or context) and y is the target (or response).(Li et al., 2015) proposed a decoding objective based on mutual information of x and y to improve diversity: The conditional probability of y given x is estimated from sequence-to-sequence models, and the marginal probability of y from a separately trained language model.Reward-augmented maximum likelihood learning (RAML) (Norouzi et al., 2016) incorporates task rewards into maximum likelihood training.An exponential payoff distribution is defined: where y * is the true target, r is a pre-defined reward function, and τ is temperature parameter.The model is trained to minimize the KLdivergence of the conditional distribution of y and the payoff distribution: (4) In multiplicative boosting, the density estimate of at each iteration T is given by: where h t is t th model's estimate, and α t is models' weights.The goal of boosting is to approximate better the true distribution, P .It is shown in (Grover and Ermon, 2018) that if the model at each iteration can optimize for a re-weighted distribution of the following form perfectly: the distance of models' density estimate and the true distribution is decreasing, that is, In equation ( 5) -( 7), the density estimates are for the joint distribution of x and y.We make an additional assumption that the sources are uniformly distributed so that p(x, y) = 1 n p(y | x), for the ease of applying the boosting algorithm to sequence-to-sequence training.
The true distribution P is usually set to be uniform to boost the coverage of generative models.One of our innovations in this work is extending it to the exponential payoff distribution in RAML setting.The decreasing property of KLdivergence still holds, as the theoretical analysis is very much similar to that in (Grover and Ermon, 2018).

Design
We discuss some practical considerations when applying boosting framework to response generation problem.

Data Reweighting
In the generative boosting method of (6), the weights of data are inversely proportional to the perplexities of the responses.However, it is observed in experiments that the generic responses do not always have low perplexities.If not handled properly, such responses end up being boosted, and become the frequently generated responses at the next iteration.
In search for a consistent way to penalize generic responses with high perplexities, we first considered the discriminative boosting approach introduced in (Grover and Ermon, 2018).A discriminator is trained to differentiate between generated responses and human responses.The weights of data after discriminative boosting is the density ratio from the discriminator.The idea is closely related to generative adversarial learning (Goodfellow et al., 2014) too small to train a discriminator to obtain good decision boundaries and generalization.
Instead, we resort to a simple rule-based discriminator.At each iteration, we maintain a list of most frequently generated responses, C t .We choose a binary function to decide whether two responses, y, z, are similar, denoted by sim(y, z).The discriminator is defined as sim(y, y 0 ) = 1 0.5 otherwise (8) And the weights of data at round t is given by In our experiments, the similarity function is chosen to be a predicate of whether there is an n-gram overlap with n ≥ 4. We chose to be aggressive and set c = 0, so responses that are similar to generated by previous models are excluded.The sizes of C t is chosen to be around 20 so that the amount of training data reduces by about 10 percent at each iteration.
In our experiments, we include bootstrapping as an additional baseline.At each iteration, 80% of the data are randomly sampled for training and validation.

Model Combination
At decoding time, due to the discrete nature of text data, the optimization for the response that has highest probability (or mutual information) is intractable, so we use the following heuristics.Candidate responses are generated from the single best model using beam search.The candidates are then scored by all models, and the one with the highest average score is chosen.The model weights α t are set to be uniform.
Since each model are trained on data with different weights, their un-normalized probability density estimates may have different scales.Hence, at decoding time, scores of each model are z-normalized with mean and standard deviation calculated from the training data.

Other Details
For RAML, the reward function is based on tf-idf matching -that is, the sum of products of term frequency and inverse document frequency of each word, divided by lengths.The rationale is to encourage models to include key content words in their generations.Empirically, we observed that RAML with aforementioned reward can generate better responses than MLE baseline even without boosting.The temperature parameter τ is set to be 0.1.To approximate the expectation term in the objective of RAML, three additional responses with highest rewards are selected from training data for each message-response pair in the beginning.We do not sample new responses at the following iterations for the sake of fair comparison.We set β t in equation ( 6) to be 1 bt where b is between 10 and 20, and is tuned on validation set.

Experiments
We evaluate our algorithm on single-turn conversations from Persona Dataset (Zhang et al., 2018).Participants are instructed to converse according to their given personalized background.In the preparation of training data, persona descriptions are prepended to the sources, and all trailing punctuations are truncated from the responses.
We use a standard sequence-to-sequence architecture with attention mechanism.Both encoder and decoder are LSTMs with hidden size of 512 and input size of 300.Attentional contexts are weighted sums of hidden states of words in personas.We use Adam optimizer to train the model with learning rate of 0.001.All model parameters including word embeddings are randomly initialized between −0.1 and 0.1.
In addition to the base models mentioned before, we investigate the combination of RAML and MMI, in which models are trained with RAML and decoded with MMI.

Quantitative Evaluation
We employ two standard word-overlap-based metrics, BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004).We also performed embedding-based evaluation.We embed the responses using the word averaging approach by (Arora et al., 2016), and measure the cosine similarity of the embeddings of generated responses and true responses.
To measure the diversity of the responses, we perform k-means clustering on their embeddings with 10 clusters, and measure the inertia.The larger inertia indicates more diversity.We also show statistics on number of distinct n-grams.
As can be seen in Figure 1, the general trend of boosting is that performance drastically improves up to the third model, then it slowly gets better or stays the same.Boosting is far better than bootstrapping.Boosting can improve lexical-level semantic similarity between generate responses and true responses, measured by cosine similarity.While BLEU scores only fluctuate in a tight range, ROUGE-L suffered from boosting a little, when used on base models that can generate more diversified responses.But we do not consider BLEU and ROUGE the most important metrics.Diversity measures, including count of distinct n-grams and inertia of clusters, are significantly improved by boosting.Combining RAML and MMI seems to give an advantage in BLEU (mainly because generated responses are longer), inertia, and num-ber of unigrams.

Qualitative Evaluation
To ensure the diversified responses are as relevant as before boosting, we ask 5 annotators to evaluate a randomly sampled subset of 100 examples from each base model against its boosted counterpart.Each context are paired with two responsesone from the base model and one from the boosted model.The annotators are asked to choose the most appropriate response, or tie if they are equal.The results are shown in Table 1.On average, about 38 to 47 percent of the time the annotators showed no preferences, and boosted models beat base models for 36 to 45 percent of the trials.Note that all individual tests show annotators preferred the boosted model over the base model, except for one case, where the annotator chose MMI base model over the boosted model slightly more often.We also provide an example of generated responses in Table 2.

Conclusion
We investigated the use of boosting to improve the diversity and relevance of dialog response generation, with various training and decoding objectives including mutual-information-based decoding and reward-augmented maximum likelihood learning.Our combination of boosting and RAML for response generation is novel, and its combination  with MMI gives some of the most diversified results.Quantitative evaluation shows our method can substantially improve the diversity without harming the quality of generated responses.Our human evaluation provides evidence that diversified responses by boosting are even more appropriate than those generated from baseline models.
Figure 1: Quantitative results.X-axis is for iteration and y-axis for metrics.The numbers at iteration 1 represent the base models.

Table 1 :
. However, in our case it is difficult to apply such approach.Because the generated responses are very limited, most classifiers can easily memorize all of them.The discriminators end up assigning extremely high probabilities to most of the human responses, and close-to-zero densities to generated responses.In other words, the amount of negative examples is ± 6.8% 47.2 ± 8.8% RAML 44.8% ± 10.8% 16.8 ± 4.8% 38.4 ± 12.4% Human evaluation results."Win" stands for the boosted model winning.
Context my family lives in alaska .it is freezing down there .Human i bet it is oh i could not Baseline what do you do for a living Boosted do you live near the beach ?i live in canada

Table 2 :
Examples of generated responses from baseline sequence-to-sequence model and its boosted counterpart.