Understanding the Impact of Experiment Design for Evaluating Dialogue System Output

Evaluation of output from natural language generation (NLG) systems is typically conducted via crowdsourced human judgments. To understand the impact of how experiment design might affect the quality and consistency of such human judgments, we designed a between-subjects study with four experiment conditions. Through our systematic study with 40 crowdsourced workers in each task, we ﬁnd that using continuous scales achieves more consistent ratings than Likert scale or ranking-based experiment design. Additionally, we ﬁnd that factors such as no prior experience of participating in similar studies of rating dialogue system output positively impact consistency and agreement amongst raters. 1


Introduction
There is a major imperative on obtaining high-quality crowdsourced human judgments of NLG output, since these are the key evidence that certain models perform better than others. Experiment designs to obtain such human judgments primarily use Likert scales. Belz and Kow (2010) argue that discrete scales, such as Likert scales, can be unintuitive and people may avoid extreme values in their judgments. We focus on a systematic comparison of four experimental conditions by incorporating continuous, relative and ranking scales for obtaining crowdsourced human judgments. Our key findings are: 1. Use of continuous scales results in higher inter-rater consistency and agreement 2. Raters who have no prior experience in evaluating dialogue system output have greater inter-rater consistency and agreement than do those who have previously participated in such rating tasks.

Data and Models
We used the Reddit Conversational Corpus made available by Dziri et al. (2018) to train our models. The corpus contains 9M training examples, 500K development dialogues and 400K dialogues as test data. The models trained for this study include: • Seq2Seq: Simple encoder-decoder model with attention mechanism (Bahdanau et al., 2014) • HRED: Hierarchical Encoder-Decoder (Serban et al., 2016) which incorporates an utterance and intra-utterance layer to model context.
• THRED: Topic Augmented Hierarchical Encoder-Decoder (Dziri et al., 2018) which uses topic words along with a hierarchical encoder-decoder to produce a response.

Experiment Design
We ask human raters to evaluate which model produces the better output, on the basis of two metrics: Readability: which "measures the linguistic quality of text and helps quantify the difficulty of understanding the text for a reader" (Gatt and Krahmer, 2018) and Coherence: "ability of the dialogue system to produce responses consistent with the topic of conversation (Venkatesh et al., 2018)". We constructed three distinct surveys (i.e. experiment conditions), each of which used one of the well-known question types of Likert Scale, Magnitude Estimation and Best-Worst Ranking. Our experiment conditions are: Likert Scale (LS): is typically used in experiments for crowdsourcing human evaluation of dialogue systems (Asghar et al., 2018;Lowe et al., 2017). In our experiment, we ask the raters to rate the generated responses on a 6-point scale, following Novikova et al. (2018) (where 1 is the lowest and 6 is the highest on the metrics of readability and coherence).
Rank-Based Magnitude Estimation (RME): Prior research by Belz and Kow (2011) demonstrates through six separate experiments that continuous scales are more viable and offer distinct advantages over discrete scales in evaluation tasks. Recently, Novikova et al. (2018) adopted magnitude estimation by providing the rater with a standard value for a reference sentence to evaluate output from goal-oriented systems. Following Novikova et al. (2018), we also set the value of the standard (reference utterance) as 100 since the reference utterance was produced by humans and is considered as gold-standard. The crowd-sourced workers are asked to provide a score relative to 100 (from 0 to 999) for three systemgenerated outputs.
Biased Magnitude Estimation (BME): Our third experiment design is biased magnitude estimation (BME). The main difference between RME and BME method is that the standard value we provide for the reference utterance is not uniformly set to 100 for all examples, but instead calculated by automated methods. Our motivation to do so is to understand if anchoring bias may affect the ratings when judgments are made relative to a fixed value (100) or relative to a value calculated by automated means. Anchoring bias is the tendency to rely too heavily on one piece of information offered (the "anchor", in this case, the number 100) when making decisions (Kahneman, 2016).
Best-Worst Scaling (BWS): Our last experiment condition is best-worst scaling (BWS) in which raters are asked to rank the generated responses in order of best to worst on both metrics (readability and coherence). This approach has previously been used to estimate emotion intensity and has been demonstrated to produce high quality and consistent judgments from humans (Kiritchenko and Mohammad, 2017).
Each task includes 50 randomly sampled conversations from the test set in our corpus along with generated responses from the three models and the ground truth (reference utterance). For each task, we collected ratings from 40 workers with Master qualifications through Amazon Mechanical Turk.

RQ1:
What is the effect of experiment design on the reliability on human ratings? We use intraclass correlation (ICC) to measure the reliability across multiple raters (Shrout and Fleiss, 1979). To compare the scores obtained from magnitude estimation experiments to the ratings from the task using discrete Likert scales, we perform a normalization of the magnitude estimation scores on a logarithmic scale as suggested by Bard et al. (1996). Table 1 represents the ICC scores on consistency (ICC-C). We observe that use of Magnitude Estimation with anchors (RME or BME) results in more reliable ratings than using Likert Scale or using Best-Worst ranking (BWS).  Table 2: ICC scores when participants prior experience evaluating dialogue system output. Top half represents participants with prior experience and bottom half with no prior experience. All values statistically significant at p-value<0.001 except those indicated by †.
RQ2: Does prior experience of evaluating dialogue system output affect reliability of rankings?
We asked each rater additional questions at the end of the task. The questions asked raters to indicate whether or not they had prior experience taking part in studies involving evaluation dialogue system output. Table 2 shows how reliable the ratings from the participants based on their prior experience of taking part in studies about evaluating conversational response. We find that participants who have not taken part in prior studies are more consistent and have a higher agreement score than participant who have prior experience.

Conclusion
We present our work on designing a systematic experiment with four experiment conditions to evaluate the output of dialogue systems. Different from prior work where a similar study was conducted with output from goal-oriented systems (Novikova et al., 2018), our study focuses on evaluating output in open-domain situations. We find that that use of continuous scales to obtain crowdsourced ratings provides more consistent and reliable ratings than ratings obtained through Likert scales or Best-Worst scaling. We also find that lack of prior experience of evaluating open-domain dialogue system output results in more reliable ratings. One potential explanation for this could be that workers may have preconceived notions based on their past experience. Our findings have implications on how to best design the survey to obtain human judgments of NLG output.