Evaluating the State-of-the-Art of End-to-End Natural Language Generation: The E2E NLG Challenge

This paper provides a detailed summary of the first shared task on End-to-End Natural Language Generation (NLG) and identifies avenues for future research based on the results. This shared task aimed to assess whether recent end-to-end NLG systems can generate more complex output by learning from datasets containing higher lexical richness, syntactic complexity and diverse discourse phenomena. We compare 62 systems submitted by 17 institutions, covering a wide range of approaches, including machine learning architectures -- with the majority implementing sequence-to-sequence models (seq2seq) -- as well as systems based on grammatical rules and templates. Seq2seq-based systems have demonstrated a great potential for NLG in the challenge. We find that seq2seq systems generally score high in terms of word-overlap metrics and human evaluations of naturalness -- with the winning SLUG system (Juraska et al. 2018) being seq2seq-based. However, vanilla seq2seq models often fail to correctly express a given meaning representation if they lack a strong semantic control mechanism applied during decoding. Moreover, seq2seq models can be outperformed by hand-engineered systems in terms of overall quality, as well as complexity, length and diversity of outputs.

M A N U S C R I P T M A N U S C R I P T

Introduction
This paper provides a comprehensive final report and extended analysis of the first shared task on End-to-End (E2E) Natural Language Generation (NLG), substantially extending previous reports Novikova et al., 2017b;Dušek et al., 2018). In addition to this previous work, we provide a corrected and extended evaluation of the training dataset, as well as a detailed discussion of how current stateof-the-art systems address E2E generation challenges, including semantic accuracy and diversity of outputs, and a comparison of techniques used by the submitted systems with systems outside the competition. We then include a substantially expanded evaluation of the systems using novel automatic metrics, accounting for output complexity, diversity and semantic correctness. In addition, we provide an analysis of system output similarity and confirm that systems using similar techniques, e.g. seq2seq, produce similar outputs. We also provide a detailed error analysis with examples of system outputs. This extended evaluation allows us reach some more in-depth insights about the strength and weaknesses of end-to-end generation systems. Finally, we discuss directions for future work with respect to end-to-end generation, as well as NLG evaluation in general. In addition, this paper accompanies a release of all the participating systems' outputs on the test set along with the human ratings collected in the evaluation campaign.
Shared challenges have become an established way of pushing research boundaries in the field of Natural Language Processing, with NLG benchmarking tasks running since 2007 (Belz and Gatt, 2007). These previous shared tasks have demonstrated that large-scale, comparative evaluations are vital for identifying future research challenges in NLG (Belz and Hastie, 2014). The E2E NLG shared task is novel in that it poses new challenges for recent end-to-end, data-driven NLG systems. This type of systems promises rapid development of NLG components in new domains by reducing annotation effort: They jointly learn sentence planning and surface realisation from non-aligned data, e.g. (Dušek and Jurčíček, 2015;Wen et al., 2015b;Mei et al., 2016;Wen et al., 2016;Sharma et al., 2016a;Dušek and Jurčíček, 2016a;Lampouras and Vlachos, 2016). As such, these approaches do not require costly semantic alignment 3 ACCEPTED MANUSCRIPT A C C E P T E D M A N U S C R I P T between meaning representations (MRs) and the corresponding natural language reference texts (also referred to as "ground truths" or "targets"), but they are trained on parallel datasets, which can be collected in sufficient quality and quantity using effective crowdsourcing techniques, e.g. .
So far, end-to-end approaches to NLG have been limited to small, delexicalised datasets, e.g. BAGEL , SF Hotels/Restaurants (Wen et al., 2015b), or RoboCup (Chen and Mooney, 2008). Therefore, end-to-end methods have not been able to replicate the rich dialogue and discourse phenomena targeted by previous rulebased and statistical approaches for language generation in dialogue, e.g. Stent et al., 2004;Mairesse and Walker, 2007;Rieser and Lemon, 2009). In this paper, we describe a large-scale shared task based on a new crowdsourced dataset of 50k instances in the restaurant domain (see Section 3). We show that the dataset poses new challenges, such as open vocabulary, complex syntactic structures and diverse discourse phenomena, as described in Section 4. Our shared task aims to assess whether the novel end-to-end NLG systems are able to produce more complex outputs given a larger and richer training dataset. We received 62 system submissions by 17 institutions from 11 countries for the E2E NLG Challenge, with about 1 ⁄3 of these submissions coming from industry, as summarised in Section 5. We consider this level of participation an unexpected success, which underlines the timeliness of this task 1 and allows us to reach general conclusions and issue recommendations on the suitability of different methods.
We analyse how the submitted systems address the challenges posed by the dataset in Section 6, and we evaluate the submitted systems by comparing them to a challenging baseline using automatic evaluation metrics (including novel text-based measures) as well as human evaluation (see Section 7). Note that, while there are previous studies comparing a limited number of end-to-end NLG approaches (Novikova et al., 2017a;Wiseman et al., 2017;Gardent et al., 2017a), this is the first research to evaluate novel end-to-end generation at scale using human assessment. reference The Wrestlers offers competitive prices, but isn't rated highly by customers. Our results in Section 8 show a discrepancy between data-driven seq2seq models versus template-and rule-based systems. While seq2seq models generally score high on word-overlap similarity measures and human rankings of naturalness, manually engineered systems score better than some seq2seq systems in terms of overall quality, as well as diversity and complexity of generated outputs. In Section 9, we conclude by laying out challenges for future shared tasks in this area. We also release a new dataset of 36k system outputs paired with user ratings, which will enable novel research on automatic quality estimation for NLG (Specia et al., 2010;Dušek et al., 2017;Ueffing et al., 2018;Kann et al., 2018;Tian et al., 2018). All data and scripts associated with the challenge, as well as technical descriptions of participating systems are available at the following URL:

A C C E P T E D M A N U S C R I P T
http://www.macs.hw.ac.uk/InteractionLab/E2E/

Domain and Task
In general, the task of NLG is to convert an input MR into a natural language utterance consisting of one or more sentences. In this paper, we focus on the case where 5

A C C E P T E D M A N U S C R I P T
an end-to-end data-driven generator is trained from simple pairs of MRs and reference texts, without fine-grained alignments between elements of the MR and words or phrases in the reference texts, as in, e.g. (Dušek and Jurčíček, 2015;Wen et al., 2015b).
An example pair of a MR and a reference text is shown in Figure 1. We focus on restaurant recommendations in our experiments, which, previously, have been widely explored in dialogue systems research, e.g. Henderson et al., 2014;Wen et al., 2017). However, our E2E dataset is substantially bigger and more complex and than previous NLG training datasets for this domain Wen et al., 2015b) (see Section 4), which allows us to assess whether NLG systems are able to learn to produce more varied and complex utterances given enough training examples (cf. Section 8).
For the input representation, we use a format commonly found in task-oriented domain-specific spoken dialogue systems -unordered sets of attributes (slots) and their values, e.g. Young et al., 2010;Liu and Lane, 2016). 2 The list of possible attributes used in the MRs in our dataset with example values is shown in Table 1.

Data Collection Procedure
In order to maximise the chances for data-driven end-to-end systems of producing high quality output, we aim to provide training data in sufficient quality and quantity. We turned to crowdsourcing to collect training data in large enough quantities.
We used the CrowdFlower platform 3 to recruit workers. Previously, crowdsourcing has mainly been used for evaluation in the NLG community, e.g. (Rieser et al., 2014;Dethlefs et al., 2012). However, recent efforts in corpus creation via crowdsourcing have proven to be successful in related tasks. For example, Zaidan and Callison-Burch (2011) showed that crowdsourcing can result in datasets of comparable quality to those created by professional translators given appropriate quality control methods. Mairesse 2 Most dialogue systems also include a general intent of the utterance, such as inform, confirm, or request Wen et al., 2015b;Liu and Lane, 2016). Since our task is focussed on recommendations, this intent would be recommend/inform for all our data, and we can therefore disregard it. 3 The CrowdFlower platform was renamed to FigureEight after our study was completed. See https: //www.figure-eight.com/.
A C C E P T E D M A N U S C R I P T et al. (2010) demonstrate that crowd workers can produce aligned natural language descriptions from abstract MRs for NLG, a method which also has shown success in related NLP tasks, such as spoken dialogue systems (Wang et al., 2012) or semantic parsing (Wang et al., 2015). More recently, data-driven NLG systems, such as (Wen et al., 2015a) and (Dušek and Jurčíček, 2016), have relied on crowdsourcing for collecting training data.
When crowdsourcing corpora for training NLG systems, i.e. eliciting natural language paraphrases for given MRs from workers, the following main challenges arise: 1. How to ensure the required quality of the collected data?
2. What types of meaning representations can elicit spontaneous, natural and varied data from crowd workers?
In an attempted to address both challenges before collecting the main training dataset for the E2E NLG challenge, we ran a small-scale pre-study published in . We briefly summarise the results of this study in this section and apply the successful techniques to the whole data set.
For the pre-study, we prepared a subset of 75 distinct MRs, consisting of three, five or eight attributes from our domain (see Table 1) and their corresponding values in order to evaluate MRs with different complexities. 4 We then implemented several automatic validation procedures for filtering the crowdsourced data in order to address (1), see Section 3.1. To address (2), we explored the trade-off between semantic expressiveness of the MR and the quality of crowdsourced utterances elicited for the different semantic representations. In particular, we investigated translating MRs into pictorial representations as used in, e.g. (Williams and Young, 2007;Black et al., 2011) for evaluating spoken dialogue systems (see Section 3.2). In the remainder of this section, we first describe the detailed setup used to crowdsource our data (Section 3.3) and then finally evaluate the pre-study by comparing pictorial MRs to text-based MRs used by previous crowdsourcing work Wang et al., 2012) in Section 3.4.

A C C E P T E D M
A N U S C R I P T

Automatic Validation Measures
We used two simple methods to check the quality of crowd workers on Crowd-Flower: First, we only select workers that are likely to be native speakers of English, following Sprouse (2011) andDredze (2010). We use IP addresses to ensure that workers are located in one of three English-speaking countries -Canada, the United Kingdom, or the United States. In addition, we included a requirement that "Participants must be native speakers of British or American English" both in the caption of the task listed on CrowdFlower and in the task instructions. Second, we check whether workers spend at least 20 seconds to complete a page of work. This is a standard CrowdFlower option to control the quality of contributions, and it ensures that the contributor is removed from the job if they complete the task too fast.
We also check the quality of the natural language texts produced by crowd workers for a given MR. In particular, we use three JavaScript validators to ensure that the submitted utterances are well-formed English sentences: 1. We check if the ready-to-submit utterance only contains legal characters, i.e. letters, numbers and symbols ", ' . : ; £".
2. We check whether the submitted text is not shorter than the required minimal length, which is an approximation of the total number of characters used for all attribute values in a given MR, as calculated by Eq. 1: Here, # MR characters is the total number of characters in the given MR; # MR attributes is the number of attributes in the given MR; and 10 is an average length of an attribute name plus two associated square brackets.
3. We check that workers do not submit the same utterance several times.
We ensured by manually checking a small number of initial trial tasks that these automatic validation methods were able to correctly identify and reject 100% of bad submissions.

8
A C C E P T E D M A N U S C R I P T

Meaning Representations: Pictures and Text
In previous crowdsourcing tasks involving MRs, these were typically presented to workers in a textual form of dialogue acts , such as the following: inform (type=hotel, pricerange=expensive) However, there is a limit in the semantic complexity that crowd workers can handle when using this type of textual/logical descriptions of dialogue acts . Also, Wang et al. (2012) observed that the chosen semantic formalism influences the workers' language, i.e. crowd workers are primed by the words/tokens and ordering used in the MR. Therefore, in contrast to previous work Wen et al., 2015a;Dušek and Jurčíček, 2016), we explore the usage of different modalities of meaning representation: • Textual/logical MRs appear as a list of comma-separated attribute-value pairs, where attribute values are shown in square brackets after each attribute (see Fig-ures 1 and 2). The order of attributes is randomised so that crowd workers are not primed by the ordering used in the MRs (Wang et al., 2012).
• Pictorial MRs are semi-automatically generated pictures with a combination of icons corresponding to the individual attributes (see Figure 2). The icons are located on a background showing a map of a city, thus allowing to represent the meaning of the attributes area and near.

Data Collection Setup
We set up the data collection tasks on the CrowdFlower platform, using the automatic checks described in Section 3.1 and using both pictorial and textual MRs as input (see Section 3.2). For this pre-study, we collected 1133 distinct utterances from the 75 distinct/unique MRs we prepared. 744 utterances were elicited using the textual MRs, and 498 utterances were elicited using the pictorial MRs. The data collected in the pre-study are freely available for download. 5 We later used the same CrowdFlower setup to collect the whole E2E NLG dataset (see Section 4).

A C C E P T E D M
A N U S C R I P T In terms of financial compensation, crowd workers were paid the standard pay on CrowdFlower, which is $0.02 per page (where each page contained 1 MR). Workers were expected to spend about 20 seconds per page. Participants were allowed to complete up to 20 pages, i.e. create utterances for up to 20 MRs. Mason and Watts (2010) found in their study of financial incentives on Mechanical Turk (counter-intuitively) that increasing the amount of compensation for a particular task does not tend to improve the quality of the results. Furthermore, Callison-Burch and Dredze (2010) observed that there can be an inverse relationship between the amount of payment and the quality of work, because it may be more tempting for crowd workers to cheat on highpaying tasks if they do not have the skills to complete them. Following these findings, we did not increase the payment for our task over the standard level.

Results and Discussion
We analysed the collected natural language reference texts, focussing on textual versus pictorial MRs and their effects on objective measures, such as time taken to collect the data and length of an utterance, and human evaluations of the reference texts collected under the different conditions. Results in full detail can be found in ; here we only summarise the main findings. The data analysis showed that: • There is no significant difference in the time taken to collect data with pictorial vs. textual MRs.
• The average length of a collected reference text, both in terms of number of characters and number of sentences, depends mainly on the number of attributes associated with the MR, rather than on whether pictures or text were used.
• Compared to textual MRs, pictorial MRs elicit texts that are significantly less similar to the underlying MR in terms of semantic text similarity (Han et al., 2013). We assume that this is because pictorial MRs are less likely to prime the crowd workers in terms of their lexical choices.
• The human evaluation revealed that reference texts produced from pictorial MRs are rated as significantly (p < 0.01) more informative than textual MRs. Equally, utterances produced from pictorial MRs were considered to be significantly (p < 0.001) more natural and better phrased than utterances collected with textual MRs. 6 This shows that pictorial MRs have specific benefits for elicitation of NLG data from crowd workers. This may be because the lack of priming by lexical tokens in the MRs leads the crowd workers to producing more spontaneous and natural language, with more variability. As a concrete example of this phenomenon from the collected data, consider the first MR in Figure 2. The textual version of this MR elicited utterances such as "Loch Fyne is a family friendly restaurant serving cheap Japanese food." whereas the pictorial MR elicited e.g. "Serving low cost Japanese style cuisine, Loch Fyne caters for everyone, including families with small children." Pictorial stimuli have also been used in other, related NLP tasks, such as crowdsourced evaluations of dialogue systems, e.g. (Williams and Young, 2007;Black et al., 2011). Williams and Young (2007), for example, used pictures to set dialogue goals for users (e.g. to find an expensive Italian restaurant in the town centre). However, no 6 Please see  for a definition of informativeness, naturalness and phrasing.

11
A C C E P T E D M A N U S C R I P T analysis was performed regarding the suitability of such representations. This experiment therefore has a bearing on the general issue of human natural language responses to pictorial task stimuli, and shows for example that pictorial task presentations can elicit more natural variability in user inputs to a dialogue system.
Of course, there is a limit in the meaning complexity that pictures can express.
We observed that pictorial MRs tend to introduce more noise. In particular, crowd workers tend to omit information, such as eatType = restaurant, which is particularly hard to visualise. Finally, producing pictorial MRs is a semi-automatic process, which is expensive to run at large scale.
Based on these findings, we decided to use pictorial MRs to collect 20% of the full dataset and textual MRs for the rest of the data in order to keep noise and production costs low while increasing diversity. To further increase the data quality and diversity, we collected multiple references per MR to help NLG systems deal with potential noise in the data.

The E2E NLG dataset
Using the procedure described in Section 3, we crowdsourced a large dataset of 50k instances in the restaurant domain (Novikova et al., 2017b). Our dataset is substantially bigger than previous NLG datasets for dialogue in the restaurant domain, i.e. BAGEL  and SF Restaurants (SFRest) (Wen et al., 2015b), which typically only allowed delexicalised data-driven end-to-end approaches (see Section 4.1).
In addition, we demonstrate that our data is also more challenging given its lexical richness, syntactic complexity and diverse discourse phenomena. Following an approach suggested by Perez-Beltrachini and Gardent (2017)

A C C E P T E D M A N U S C R I P T
language texts including all restaurant names. This is the default form for the E2E set; small postprocessing steps were taken for the other two sets to achieve a compatible format. 8 To obtain the delexicalised versions, we replaced with placeholders (e.g. "Xslot") most slot values from open sets that appear verbatim in the data: restaurant names, area names, addresses, and numbers. 9 Since the E2E and BAGEL datasets contain only restaurant recommendations, i.e. cases where the system is providing information (inform dialogue acts), whereas SFRest also includes system questions, confirmations, and greetings, we also created a subset of SFRest dubbed SFRest-inf with only inform instances for a fairer comparison.
We processed the datasets using the MorphoDiTa part-of-speech tagger (Straková et al., 2014) to identify tokens, words (as opposed to punctuation tokens) and sentence boundaries. We used the same tagger to preprocess our data for lexical and syntactic complexity analysis. Table 2 summarises the main size statistics of all three datasets, plus the informonly portion of SFRest. The E2E dataset is significantly larger than the other sets in terms of the total number of different MRs, the total number of data instances (i.e. MR-reference pairs), and especially in terms of the total amount of text in the human references, which is more than 20 times bigger than the next-biggest SFRest. These differences are even more profound if we consider delexicalisation: almost all MRs in the E2E set are distinct even after delexicalisation, while the number of unique MRs is reduced significantly (by more than half) for the other sets. Delexicalisation also seems to have a less significant effect on the reference texts in the E2E sets than in the other datasets (cf. the number of delexicalised words vs. the total number of words). The  While having more data with a higher number of references per MR makes the E2E data more attractive for statistical approaches and enables learning more robust models, it is also more challenging than previous sets as it contains a larger number of sentences in the human reference texts (up to 6 in our dataset, with an average of 1.54, compared to typically 1-2 for the other sets, which average below 1.1). The sentences themselves are also longer than in the other datasets. This is immediately  apparent for SFRest or SFRest-inf, which are up to 40% shorter in terms of words and tokens. BAGEL's sentences are slightly longer than E2E's on average, but this situation is reversed when the sets are delexicalised. In addition, the input MRs in the E2E dataset are more complex than in the other sets: the average number of slot-value pairs in our set is twice that of SFRest (even if only the more complex inform dialogue acts are considered), and slightly higher than BAGEL.

Size
The dataset is split into training, validation and test sets (in a 82-9-9 ratio, see

Lexical Richness
In order to measure various dimensions of lexical richness in the datasets under comparison, we computed statistics on token/unigram, bigram and trigram counts, and we applied the Lexical Complexity Analyser (Lu, 2012), as shown in Table 4. It is clear that our dataset has a much larger vocabulary -2x larger than the second largest SFRest, but more than 5x larger if delexicalised versions of the datasets are considered. This directly translates into the number of distinct lemmas and distinct n-grams; the E2E set has almost 10x more distinct trigrams than SFRest, over 13x more in the delexicalised versions. While the proportion of n-grams only appearing once in the set is slightly lower than in the other datasets, it stays relatively high given the dataset size and narrow domain, and poses a challenging task for end-to-end data-driven approaches.

A C C E P T E D M A N U S C R I P T
because the values are inversely proportional to the dataset size. Therefore, we complement TTR with the more robust measure of mean segmental TTR (MSTTR) (Lu, 2012), which divides the corpus into successive segments of a given length (50 tokens) and then calculates the average TTR of all segments. The higher the value of MSTTR, the more diverse is the measured text. In addition, we measure lexical sophistication (LS2) (Lu, 2012), also known as lexical rareness, which is calculated as the proportion of lexical word types not on the list of 2,000 most frequent words generated from the British National Corpus. Table 4 shows that while the E2E is more sophisticated than SFRest, it is slightly less so compared to BAGEL. However, LS2 numbers on the delexicalised sets show that this is mainly caused by lexical slot values -the delexicalised E2E dataset is almost twice as sophisticated as both SFRest and BAGEL.
Following Oraby et al. (2018a) and Jagfeld et al. (2018), we also use Shannon entropy (Manning and Schütze, 2000, p. 61ff.) as a measure of lexical diversity in the texts: Here, x stands for all unique tokens/n-grams, freq stands for the number of occurrences in the text, and len for the total number of tokens/n-grams in the text. We computed entropy over tokens (unigrams), bigrams and trigrams, as shown in Table 4. We can see that the E2E dataset has slightly lower unigram and bigram entropy than SFRest and higher trigram entropy than any other set. However, when delexicalised, the E2E set shows the highest entropy for any n-gram value. Considering that entropy is a logarithmic measure, the difference is substantial for trigrams -12.1 vs. the closest 10.5 for SFRest, which amounts to about 2.98× higher uncertainty.
We further complement Shannon text entropy with n-gram-language-model-style conditional entropy for next-word prediction (Manning and Schütze, 2000, p. 63ff.), given one previous word (bigram) or two previous words (trigram): Here, (c, w) stands for all unique n-grams in the text, composed of c (context, all tokens but the last one) and w (the last token). Conditional next-word entropy gives an additional, novel measure of diversity and repetitiveness: The more diverse a text is, the less predictable is the next word given previous word(s); on the other hand, the more repetitive the text, the more predictable is the next word given previous word(s). The values for all the datasets are again shown in Table 4, and they demonstrate clearly that E2E data is much more diverse than SFRest or BAGEL. Note also that lexicalisation has a much smaller effect on this measure. In the delexicalised version, the difference against the closest SFRest (2.446 vs. 1.414) indicates about 2.04× more uncertainty on next-word prediction given two previous words.

Syntactic Complexity
We used the D-Level Analyser (Lu, 2009) to evaluate the syntactic complexity of human references in our data using the revised D-Level Scale (Covington et al., 2006).

A C C E P T E D M A N U S C R I P T
We used the syntactic constituency parser of Collins (1997) to preprocess the sentences for the D-Level Analyser. 11 The D-Level scale has eight levels of syntactic complexity, where levels 0 and 1 include simple or incomplete sentences and higher levels include sentences with more complex structures, e.g. sentences joined by a subordinating conjunction, more than one level of embedding etc. Figure 3 shows the D-Level distribution in all three datasets.
The largest proportion of the datasets is composed of simple sentences (levels 0 and 1), but the proportion of simple texts is much lower for the E2E NLG dataset (46%) compared to others (59-66%). Examples of simple sentences in our dataset include: "The Vaults is an Indian restaurant", or "The Loch Fyne is a moderate priced family restaurant".
The majority of our data, however, contains more complex, varied syntactic structures, including phenomena explicitly modelled by early statistical approaches to NLG Walker et al., 2004). For example, clauses may be joined by a coordinating conjunction (level 2), e.g. "Cocum is a very expensive restaurant but the quality is great". There are 14% level-2 sentences in the E2E dataset; BAGEL only has 7% and SFRest 9%, but inform MRs in SFRest contain a similar proportion as our set. Level 3 sentences in our domain are mainly those with object-modifying relative clauses, e.g. "There is a pub called Strada which serves Italian food." The E2E dataset contains 18% level-3 sentences, similar to BAGEL but more than SFRest's 12% (13% in inform MRs). The levels 4-5 are not very frequent in any of the datasets. Sentences may contain verbal gerund (-ing) phrases (level 4), either in addition to previously discussed structures or separately, e.g. "The coffee shop Wildwood has fairly priced food, while being in the same vicinity as the Ranch" or "The Vaults is a family-friendly restaurant offering fast food at moderate prices". Subordinate clauses are marked as level 5, e.g. "If you like Japanese food, try the Vaults".
The highest levels of syntactic complexity involve sentences containing referring expressions ("The Golden Curry provides Chinese food in the high price range. It is Fully covered 30 47 50 Missing content 11 0 0 Additional content 9 3 0 near the Bakers"), non-finite clauses in adjunct position ("Serving cheap English food, as well as having a coffee shop, the Golden Palace has an average customer rating and is located along the riverside") or sentences with multiple embedded structures from previous levels. As Figure 3 shows, our dataset has a substantially higher proportion of level-6-7 sentences -15%, compared to 7% for BAGEL and 8% for SFRest (11% in inform MRs).
On average, sentences in the E2E dataset are much more syntactically complex than in the other datasets under comparison: the mean D-Level for E2E data is 2.17, compared to BAGEL's 1.32 and SFRest's 1.25 (1.57 for inform-only MRs).

Attribute Coverage
Our crowd workers were asked to verbalise all information from the MR; however, they were not penalised if they skip an attribute (cf. Section 3.4). This feature makes generating text from our dataset more challenging as the NLG systems need to deal with a certain amount of noise, i.e. attributes not being verbalised in the human reference texts. In order to measure the extent of this phenomenon, we examined a random sample of 50 MR-reference pairs in all three datasets under comparison. An MR-reference pair was considered "fully covered" if all attribute values present in the MR are verbalised in the reference. It was marked as "additional content" if the reference contains information not present in the MR, and as "missing content" if the MR contains information not present in the reference.
The results of our sample probe in Table 5 indicate that roughly 40% of our data contains either additional or omitted information. In order to help NLG systems account for this variation, we collected multiple references per MR (also see Table 2).

A C C E P T E D M A N U S C R I P T
This variation often concerns the attribute-value pair eatType=restaurant, which is either omitted ("Loch Fyne provides French food near The Rice Boat. It is located in riverside and has a low customer rating") or added in case eatType is absent from the MR ("Loch Fyne is a low-rating riverside French restaurant near The Rice Boat"). 12 As discussed in Section 3.4, pictorial MRs might be a possible source of this phenomenon where eatType=restaurant, eatType=pub, etc. is difficult to illustrate.

Systems in the Competition
The initial idea of the E2E NLG Challenge was first presented in . The interest and active participation in the E2E Challenge has by far outperformed our expectations. We received a total of 62 submitted systems by 17 institutions from 11 countries, with about 1 ⁄3 of these submissions coming from industry. In accordance with ethical considerations for NLP shared tasks (Parra Escartín et al., 2017), we allowed researchers to withdraw or anonymise their results after obtaining automatic evaluation metrics results (cf. Section 7.1). Two groups from industry withdrew their submissions and one group asked to be anonymised after obtaining automatic evaluation results. A full list of all the remaining submissions is given in Table A.14 in the Appendix (including their automatic metric scores).
We asked each participating team to identify 1-2 primary systems, which resulted in 20 systems by 14 groups. Each primary system is described in a short technical paper (available on the E2E NLG Challenge website) 13 and was evaluated both by automatic metrics and human judges (see Section 7). We compare the primary systems to a baseline system we provided ourselves (see Section 5.1). A detailed overview of all the primary systems is given in Table 6. In the following, we describe the systems in terms of different architectures; see Sections 5.2-5.5.
As TGen does not handle unknown vocabulary well, the sparsely occurring string attributes (see Table 1) name and near are delexicalised (see Section 6.1). The main seq2seq model is trained by minimising cross entropy using the Adam algorithm (Kingma and Ba, 2015) in direct token-by-token generation of surface strings; the reranker is trained to detect the presence of all attributes from the input MR. 15 Based on evaluation on the development part of the E2E dataset using automatic metrics (see Table 7), as well as manual cursory checks, TGen appears to be a strong baseline, capable of generating fluent and relevant outputs in most cases.

Seq2seq-based systems
Systems based on the popular sequence-to-sequence architecture (Sutskever et al., 2014;Bahdanau et al., 2015)  2018b) and TNT2 (Tandon et al., 2018) are using TGen with two different data augmentation techniques (see Section 6.3). Gong (Gong, 2018) trains TGen with fine-tuning by the REINFORCE algorithm (Williams, 1992). Two systems are based on the tf-seq2seq toolkit (Britz et al., 2017): NLE (Agarwal et al., 2018) built a character-to-character seq2seq (using simply characters of the original MR as inputs), TR1 (Schilder et al., 2018) use a regular word-based model. The Adapt system  is based on OpenNMT-py (Klein et al., 2017). It uses pointer networks (a form of a copy mechanism (Vinyals et al., 2015)) and a two-step generation where the first step enriches the input MR for diversity (see Section 6.3).
Several other systems use custom seq2seq implementations. Slug and Slug-alt  use an ensemble of two bidirectional LSTM encoders and one convolutional encoder, all paired with an attention LSTM decoder (incl. self-attention).
Harv ) use a seq2seq model with multiple additions for MR coverage and diversity (see Sections 6.2 and 6.3). Sheff2's model , on the other hand, is a vanilla seq2seq setup with LSTM cells. Chen (Chen, 2018) presents a seq2seq model with a custom-tailored input data representation: 2-part input embeddings, which divide into slot name and value token embeddings. Zhang ) apply a seq2seq model with CAEncoder (Zhang et al., 2017), which adds a second layer over a bidirectional encoder with GRU cells (Cho et al., 2014), summarising both directional encoders.

Other data-driven systems
Two groups submitted fully trainable systems that are not based on the seq2seq architecture. First, ZHAW1 and ZHAW2 (Deriu and Cieliebak, 2018) use an RNN language model with semantically conditioned LSTM (SC-LSTM) cells (Wen et al., 2015b) and a 1-hot encoding of input MR slot values. The two system variants differ in the presence of an additional semantic control mechanism (see Section 6.2).
Sheff1  is the only non-neural fully data-driven system submitted to the challenge. It is based on imitation learning using linear classifiers (Crammer et al., 2009) in a two-level generation approach, where the classifiers first select the next slot to be realised and then the corresponding word-by-word realisation of that slot (Lampouras and Vlachos, 2016). The classifiers are trained using the Locally Optimal Learning to Search (LOLS) imitation learning framework (Chang et al., 2015), optimising for BLEU, ROUGE-L, and slot error (cf. Section 7.1).

Rule-based systems
There are two rule-based entries in the E2E challenge: First, the DANGNT system (Nguyen and Tran, 2018) uses a two-step rule-based setup, where the first step determines the appropriate phrases to use for a delexicalised sentence; the second step selects the appropriate phrases to lexicalise slot values. Second, the FORGe1 system (Mille and Dasiopoulou, 2018) is a rule-based pipeline using grammars based on the Meaning-Text Theory (Mel'čuk, 1988). It matches the MR to handcrafted per-slot semantic templates, applies aggregation rules to build sentences, and realises the aggregated sentence structures into surface text.

Template-based systems
Three entries in the E2E challenge are based on traditional template filling. FORGe3 (Mille and Dasiopoulou, 2018) and TR2 (Schilder et al., 2018) take a very similar approach: They mine templates from data by delexicalising slot values. TUDA (Puzikov and Gurevych, 2018), on the other hand, uses templates manually designed by the system authors; the templates are not based on the dataset directly, they are only informed by the data.

Addressing the Challenges
In this section, we focus on how the competing primary systems address specific challenges posed by the task: vocabulary unseen in training (Section 6.1), control of semantic coverage of the input MR (Section 6.2), and producing diverse outputs (Section 6.3). We also include an overview of alternative approaches to addressing these challenges in Section 6.4.

Open Vocabulary
All systems in the challenge have a way of addressing the open vocabulary in the data. In closed-domain setups, slot values are the usually the only part of data where open vocabulary is present, as e.g. is the case of the name and near slots in our dataset (see Table 1). The common approach to dealing with open vocabulary in NLG systems is to use delexicalisation (Wen et al., 2015b; see also Section 4), i.e. replacing slot values with placeholders during training and generation time (both in input MRs and training sentences). This approach is indeed one of the principles of templatebased systems; accordingly, all template-based entries in the E2E Challenge use full delexicalisation of all slot values (except, perhaps, the binary-valued familyFriendly; cf. Table 6). Both rule-based systems also perform full delexicalisation.
The data-driven systems submitted to our challenge mostly opt for partial delexicalisation (see Table 6); the prevailing approach is to delexicalise only the values of the name and near slots, which allows for very simple pre-and postprocessing since these values usually appear verbatim in the outputs. 16 TR1 is the only data-driven system to use a stronger delexicalisation, which also includes the priceRange and customer-Rating slots. Slug and Slug-alt are the only systems to treat values with different morpho-syntactic properties differently (e.g., a value requiring "an" instead of "a" as an article).
Five of the seq2seq systems in the challenge opted for using no delexicalisation and employ alternative ways of addressing open vocabulary: Adapt, Chen and Harv use a copy mechanism (cf. Section 5.2), which allows the system to copy some of the tokens from the input instead of generating them anew. Zhang operates over sub-word units instead of words; these are determined by the byte-pair encoding algorithm and can combine to create previously unseen words (Sennrich et al., 2016). NLE's seq2seq system operates on the character level.

Semantic Control
Most of the participating systems explicitly attempt to realise all slots and thus cope with the noise in the training data (cf. Section 4.4). Full realisation is implied for template and rule-based systems as the templates and rules always relate to specific slots and are chosen based on the slots in the input MR. On the other hand, vanilla seq2seq systems have no way of controlling whether all input slots have been realised.
While attention models (Bahdanau et al., 2015) certainly have an influence on this, they are not explicitly trained to attend exactly once to each slot in a vanilla seq2seq setup.
Therefore, most seq2seq systems include an additional tool checking the realised parts of the input MR on the output (cf. Table 6).
The most frequent approach among the E2E submissions is a MR classification reranker (Dušek and Jurčíček, 2016a). Here, the generator first produces multiple out- Zhang adds an attention regularisation loss term to the training process, which attempts to keep the sum of weights close to 1 for each input MR token, similarly to Harv's penalty term. Three systems, Adapt, TR1 and Sheff2, do not use any explicit semantic control mechanism.
The non-seq2seq data-driven systems use specific mechanisms to maintain input MR coverage. ZHAW1 and ZHAW2 are based on SC-LSTM cells (Wen et al., 2015b), which include a special gate that keeps track of slots covered so far in the MR. In addition, ZHAW1 uses convolutional MR classifiers to rerank beam search outputs similarly to most seq2seq systems; however, this classification is also used in an additional loss term during training. The Sheff1 system explicitly decides which slot to verbalise next using a separate slot-level classifier, which is optimised to cover the input MR.
A C C E P T E D M A N U S C R I P T

Data Augmentation and Diversity
The design of the E2E dataset attempts to provide higher text diversity (see Section 4), and several challenge participants made use of this. Others modified the training set simply to achieve better output quality.
Several systems aim at higher output quality by using data augmentation. TNT1 enriches input MRs by prepending them with the corresponding outputs of the Personage generator (Mairesse and Walker, 2007), with the aim to generate more diverse output. Two teams attempt to increase output diversity by directly modifying the generation process. The ZHAW1 and ZHAW2 systems use a first word control mechanism: they generate outputs starting with all (frequent enough) first words from the training set, then select the final output by sampling. ZHAW1 only samples among semantically correct outputs (see Section 6.2). Adapt takes a different approach, adding a preprocessing step before the main generator, which decides upon specific words that should appear on the output. These are then used to enrich the input MR in the main generation step, providing more diversity on the input. ing dialogue context as well as the input MR to obtain greater diversity.

Evaluation Setup
We evaluated the systems submitted to the E2E challenge using a range of automatic metrics, which we describe in Section 7.1. This includes a novel application of textual measures 17 and a novel usage of standard word-overlap metrics to assess similarity among individual systems. Automatic metrics are popular in NLG (Gkatzia and Mahamood, 2015) because they are cheaper and faster to run than human evaluation. However, sole use of automatic metrics is only sensible if they are known to be sufficiently correlated with human preferences. Recent studies (Novikova et al., 2017a;Reiter, 2018) have demonstrated that this is very often not the case and that automatic metrics only weakly reflect human judgements on system outputs as generated by data-driven NLG. Therefore, we also performed a large-scale crowdsourced human evaluation, as detailed in Section 7.2. For the human evaluation of the 20 primary systems, we address the problem of how to efficiently compare a large number of systems, by:

A C C E P T E D M A N U S C R I P T
1. Extending our previous work  on rank-based Magnitude Estimation (RankME) and verifying the method at scale; 18 2. Introducing the data-efficient TrueSkill algorithm (Herbrich et al., 2006;Sakaguchi et al., 2014) to NLG. This allows us to compute an overall ranking by directly comparing the systems, rather than individually assessing them at higher cost, as done by previous NLG challenges (Belz and Hastie, 2014).

Automatic Metrics
We apply two types of automatic metrics: One set assessing the similarity between generated system outputs and natural language references in the corpus using wordoverlap-based measures, and another set assessing the complexity and diversity of system outputs using a variety of textual measures.

Word-overlap metrics
For the first set, we selected a range of metrics measuring word-overlap between system output and references, including BLEU and NIST, which are used as standard NIST (Doddington, 2002) is a version of BLEU with higher weighting for less frequent (i.e., more informative) n-grams and a different length penalty. It uses n ∈ {1, . . . , 5}. 18 The original study   CIDEr  was primarily designed for generated image captions, but is also applicable for NLG in general. CIDEr is computed as the average cosine similarity between the system output and the reference sentences on the level of n-grams, n ∈ {1, . . . , 4}. The importance of the individual n-grams is given by the Term Frequency Inverse Document Frequency (TF-IDF) measure, which weighs an n-gram's frequency in a particular instance against its overall frequency in the whole dataset.
We provided scripts to the challenge participants to run all of these metrics in a simple, easy-to-use way. The scripts are freely available at the following URL: 19 https://github.com/tuetschek/e2e-metrics In addition to evaluating all NLG systems individually against human-authored reference texts (see Section 8.1), we also apply the same metrics as measures of output similarity among the systems, comparing each system's outputs with all other systems' outputs in place of references (see Section 8.3).

Textual metrics
For the second set of scores, which is intended to measure complexity and diversity in the system outputs, we use the same automatic textual metrics which we used to evaluate the E2E NLG dataset itself (see Section 4.2 and 4.3), i.e. dimensions of lexical richness, such as lexical sophistication (LS2) and mean segmental token-to-type ratio (MSTTR), and metrics of syntactic complexity, such as levels of the revised D-level Scale. This allows us to both evaluate the diversity and complexity of system outputs and to establish whether the text characteristics are similar to the training and test sets.
To focus specifically on the style produced by the individual systems, we delexicalized restaurant names in the system outputs before computing textual metrics scores, since restaurant names could skew some of these metrics as they are mostly composed of infrequent nouns (cf. Section 4.2).

Human Evaluation
The human evaluation was conducted on the 20 primary systems and the baseline using Rank-based Magnitude Estimation (RankME) . In an ordinary (i.e. not rank-based) ME task (Bard et al., 1996), subjects provide a relative rating of an experimental sentence to a reference sentence, which is associated with a pre-set/fixed number. If the target sentence appears twice as good as the reference sentence, for instance, subjects are to multiply the reference score by two; if it appears half as good, they should divide it in half, etc. Rank-based ME extends this idea by asking subjects to provide a relative ranking of several target sentences, i.e. not only to the reference sentence, but also to each other.
Rank-based ME was selected for several reasons. First, its use proved to significantly increase the consistency of human ratings, compared to other data collection methods . Second, it implies the use of continuous scales, i.e. rating scales without numerical labels and without given end points. Recent studies show that continuous scales allow subjects to give more nuanced judgements (Belz and Kow, 2011;Graham et al., 2013;Bojar et al., 2017a). Third, it explores relative ranking of different systems instead of directly assessing quality of each specific system, which makes it more reliable in the environment of a challenge.

A C C E P T E D M A N U S C R I P T
The evaluation was conducted using crowdsourcing based on the CrowdFlower/ FigureEight platform. Crowd workers were presented with five randomly selected outputs of different systems corresponding to a single MR, and were asked to evaluate and rank these systems from the best to the worst, ties permitted, using the RankME method.
The final evaluation results were produced using the TrueSkill algorithm (Herbrich et al., 2006;Sakaguchi et al., 2014). TrueSkill produces system rankings by gradually updating a Bayesian estimate of each system's capability according to the "surprisal" of pairwise comparisons of individual system outputs. This way, fewer direct comparisons between systems are needed to establish their overall ranking. In , we were able to show that TrueSkill is able to to reduce the amount of collected human evaluation data without compromising the final ranking results.
Since the performance of some systems may be very similar and a total ordering would not reflect this, we adopt the practice used in machine translation of presenting a partial ordering into significance clusters established by bootstrap resampling (Bojar et al., 2013(Bojar et al., , 2014Sakaguchi et al., 2014). The TrueSkill algorithm is run 200 times, producing slightly different rankings each time as pairs of system outputs for comparison are randomly sampled. This way we can determine the range of ranks where each system is placed 95% of the time or more often. Clusters are then formed of systems whose rank ranges overlap.
Traditionally, human evaluation aims to assess the naturalness (fluency, readability) and informativeness (relevance, correctness, adequacy) of an automatically generated output (Gatt and Krahmer, 2017). Naturalness targets the linguistic quality of the NLG system output; informativeness targets relevance or correctness of the output relative to We decided against explicitly evaluating informativeness since our training instances do not always verbalise all MR attributes (cf. Section 4.4). We therefore only collected separate ranks for quality and naturalness.

A C C E P T E D M A N U S C R I P T
Quality: When collecting quality ratings, system outputs were presented to crowd workers together with the corresponding meaning representation, which implies that correctness of the NL utterance relative to the MR should also influence this ranking. The crowd workers were asked: "How do you judge the overall quality of the utterance in terms of its grammatical correctness, fluency, adequacy and other important factors?" Naturalness: When collecting naturalness ratings, system outputs were presented to crowd workers without the corresponding meaning representation. The crowd workers were asked: "Could the utterance have been produced by a native speaker?" Ratings of quality and naturalness were collected separately, i.e. in two individual crowdsourcing tasks. Furthermore, when crowd workers were asked to assess naturalness, the MR was not shown to them since it was not necessary for the task. This setup allows to minimise the correlation between the ratings of naturalness and quality Callison-Burch et al., 2007).

Results
In this section, we report on the results of the evaluation of all E2E NLG Challenge primary systems, following the evaluation procedures described in Section 7.
We first show the results using automatic metrics: word-overlap-based (Section 8.1) and textual metrics (Section 8.2), as well as automatically computed output similarity between systems (Section 8.3). We then summarise the human evaluation results (Section 8.4), comment on the semantic accuracy of system outputs (Section 8.5) and declare the overall winning system (Section 8.6). Finally, we provide a list of "lessons learnt" in Section 8.7 -observations that we hope will be useful for future NLG system development.

A C C E P T E D M A N U S C R I P T
primary system is able to beat it in terms of all metrics, or in terms of the normalised metrics' mean -only Slug comes very close. Several other systems manage to beat TGen in one of the metrics but not in others. Note, however, that many secondary system submissions perform better than the primary ones (and the baseline) with respect to word-overlap metrics (see Table A.14 in the Appendix).
Overall, seq2seq-based systems show the best word-based metric values, followed by Sheff1, a data-driven system based on imitation learning. As expected, attempts to increase output diversity by ZHAW1, ZHAW2, Slug-alt and Adapt result in lowered scores by word-overlap-based metrics. Template-based and rule-based systems mostly score at the bottom of the list. The lowest-scoring systems in terms of word-overlap metrics are the ones of Chen and Sheff2, which tend to produce much shorter outputs than other systems (cf. Section 8.2). This most likely resulted in severe brevity penalty.
Finally, it must be noted that the results using automatic metrics are quite different from results obtained in human evaluation (see Section 8.4), which confirms previous findings (Novikova et al., 2017a;Reiter, 2018). Table 9 summarises results from a range of textual metrics which aim to assess the complexity and diversity of primary system outputs (cf. Section 7.1). In addition, we include a comparison to the human references in the test set in order to assess whether systems are able to replicate characteristics of human-produced data. 20 The results in Table 9 show the following:

Textual Metrics
• Seq2seq-based system outputs are less syntactically complex on average than outputs of other systems (they produce more D-level 0-2 sentences and less Dlevel 6-7 sentences than other architectures).
• The systems seem to show a relatively high variance in syntactic complexity levels, especially with respect to the higher levels; few systems match the distribution of the training and test data. The differences in D-level distributions in the outputs are mostly statistically significant (see Figure A.6 in the Appendix). 20 Note that textual metrics have been computed with restaurant names delexicalised (cf. Section 7.1).  show among the most distant from other systems. The Gong system mostly produces level 0-2 sentences, and therefore it appears very distant from other systems as well as the most distant system from human references.

A C C E P T E D M A N U S C R I P T
• None of the systems reaches the lexical sophistication of the human-authored  test set references. The diversity-attempting seq2seq-based Adapt system comes very close, followed by the grammar-based FORGe1 and the TR2 system, which is based on template mining from data. Data-driven systems aiming at higher lexical diversity seem to achieve higher sophistication as well; note the lower performance of Slug-alt, which aims more at syntactic diversity than lexical.

A C C E P T E D M A N U S C R I P T
For rule-based systems, lexical sophistication is a direct result of the system authors' decisions.
• In terms of MSTTR, highest scores are achieved by template or rule-based systems and by data-driven systems that explicitly aim at greater output diversity (ZHAW1, ZHAW2, Adapt, Slug-alt). Note that MSTTR is typically higher in systems that tend to produce longer outputs, which includes most rule-and template-based systems. We assume that this is due to MSTTR's fixed 50-token window used to segment utterances.
• Most systems produce outputs similar in length to the test set human references.
Outputs of rule-and template-based systems tend to be more verbose than those of data-driven systems. The outputs of Zhang, Sheff2 and Chen are much shorter on average than texts in the dataset, which suggests that these systems might not verbalise all the information contained in the MR (cf. Section 8.5).
Same as for the datasets statistics in Section 4.2, we also computed additional textual measures to assess the diversity/repetitiveness of the generated outputs: number of distinct n-grams, Shannon entropy, and conditional next-word entropy; a selection of these metrics is shown in Table 10. 21 We compare the outputs against the whole test set (multiple references) and a randomly selected single reference per MR from the test set. The results show the following: • None of the systems is able to produce as much diversity as is contained in a randomly selected human reference -even the most diverse systems lag behind.
Adapt comes close in vocabulary size, TR2 is the closest system in terms of entropy and next-word conditional entropy.
• In terms of vocabulary, there is a huge gap between the most diverse Adapt and TR2 systems, and any other system (e.g., the 3rd-ranking ZHAW1 has 3× smaller vocabulary than TR2, and 2.4× smaller ratio of unique trigrams).
TR2 demonstrates that mining templates from the training data can lead to very diverse outputs. FORGe3, which uses the same method, also ranks relatively high on vocabulary size and entropy. The diversity produced by Adapt's seq2seq model indicates that the prepocessing step enriching the MRs works effectively (cf. Section 6.3).

A C C E P T E D M A N U S C R I P T
• All diversity-attempting data-driven systems (Adapt, ZHAW1, ZHAW2, Harv, TNT1, TNT2, Slug-alt) indeed rank better than most systems not incorporating diversity measures, with TNT1 and TNT2 showing lower gains than the rest of the group. However, template-mining-based systems (TR2, FORGe3) produce outputs of similar or higher diversity with no concentrated effort.
• Outputs of seq2seq-based systems which do not explicitly model diversity (e.g. Gong, Sheff1, TR1, Slug, Chen) indeed show lower diversity scores. The rulebased DANGNT system also ranks very low on diversity, and the TUDA system with handcrafted templates is the least diverse of all.
In summary, few systems are able to approach the complexity and diversity shown in human-authored data. Seq2seq-based systems tend to favor simpler sentences than hand-engineered systems unless diversity control is in place. Vanilla seq2seq and handcrafted templates produce the least diverse outputs; highest diversity is achieved by template mining or explicit diversity control mechanisms.

System Output Similarity
In order to assess the similarity of outputs produced by the individual systems, we reused the word-overlap-based metrics applied in the challenge (see Section 7.1). We created all possible pairs of systems and computed word-overlap metrics between each of their outputs for every instance in the test set. Same as for textual metrics, restaurant names were delexicalised in the system outputs. 22 This process resulted in a table for each of the metrics (see Figure A.5 in the Appendix), with reference systems in rows and tested systems in columns. All five metrics showed a very similar pattern. Figure 4 therefore summarises the results by taking the average of all normalised metrics (cf. Table 8). For comparison, we also measure similarity of system outputs against the reference texts in the test set, as well as a subset of the test set with a single, randomly sampled reference text per MR.
We can see from Figure 4 that all the seq2seq-based system outputs are in general most similar to each other; other data-driven systems also show higher similarity 22 Results with fully lexicalised outputs are very similar, the differences are just slightly less profound.
0.31 Figure 4: Similarity of the systems' outputs as measured by automatic metrics (mean of normalised BLEU, NIST, METEOR, ROUGE-L and CIDEr where one system output is used as reference). Systems are sorted by their architecture. For comparison, we also include metrics values against the full test set with multiple references (test set all) and against a single-reference randomly sampled subset of the test set (test set rand).
The table on the right shows mean values of similarity of each system against all other systems (average over columns on the left, excluding the 1st line). System architectures are coded with colours and symbols: ♥ seq2seq, ♦ other data-driven, ♣ rule-based, ♠ template-based.

A C C E P T E D M A N U S C R I P T
amongst each other. The exception to this rule in case of the Chen and Sheff2 systems can be explained by the brevity of their outputs (cf. Sections 8.1 and 8.2). Systems that aim at output diversity (ZHAW1, ZHAW2, Slug-alt and mainly Adapt) also exhibit lowered similarity of their outputs to those of other systems, which might indicate that their outputs are indeed more original. The outputs of rule-based and template-based systems are markedly less similar to other outputs than that of the data-driven systems.
We can also see that most system outputs, especially those of data-driven methods, are much more similar to each other than they are to a single randomly selected humanauthored reference text from the test set. This is to be expected since data-driven methods tend to select more frequent phrasing. Some of the system outputs even show a higher similarity to each other than to the closest matching human references from the test set. This is mainly the case for systems with very similar architectures, which often arrive at identical results (e.g. TGen, TNT1 and TNT2).

Results of Human Evaluation
The results of human evaluation of quality and naturalness are provided in Table 11.
Using the RankME setup described in Section 7.2, we collected 2,979 data points of partial system rankings for quality, where one data point corresponds to one MR and ranked outputs of five randomly selected systems (see Table 13 for examples). From these rankings, a set of 29,790 pairwise output comparisons were produced to be used by the TrueSkill algorithm. This resulted in 1,418 pairwise comparisons per system.
For naturalness, 4,239 data points were collected, which resulted in 42,390 pairwise comparisons, and 2,018 comparisons per system. For each of 630 MRs in the test set, 9.5 systems on average (with a maximum of 14) were compared based on both naturalness and quality of their outputs. That is, using TrueSkill, we were able to reduce the number of required system comparisons to more than half. The CrowdFlower task for collecting human evaluation data was running for 235 hours and cost USD 314 in total.
We produced the final ranking of all systems for both quality and naturalness using the TrueSkill algorithm with bootstrap resampling as described in Section 7.2. This resulted in clusters of systems with significantly different system rankings for both  Table 11: TrueSkill measurements of quality (left) and naturalness (right) for all primary systems (significance cluster number, TrueSkill value, range of ranks where the system falls in 95% of cases or more, system name). Significance clusters are separated by a dotted line. System architectures are coded with colours and symbols: ♥ seq2seq, ♦ other data-driven, ♣ rule-based, ♠ template-based.  Clowns is a coffee shop near All Bar One with a customer rating of 3 out of 5. 100 ♥ Zhang

Human Ratings
Clowns is a coffee shop near All Bar One with a customer rating of 3 out of 5 . 100 ♠ FORGe3 Clowns is a coffee shop near All Bar One with a rating 3 out of 5. 70 ♦ ZHAW2 A coffee shop near All Bar One is Clowns. It has a customer rating of 3 out of 5. 50 ♥ Sheff2 Clowns

72
♥ Slug-alt Located near Ranch, The Cricketers is a family-friendly restaurant that is not family-friendly and has a low customer rating.

♥ Adapt
The Cricketers is a non -family -friendly restaurant located near the Ranch . It has a low customer satisfaction rating .

♣ FORGe1
The restaurant The Cricketers is near Ranch. The Cricketers, which does not welcome kids, has a low customer rating.

65
♠ TUDA The Cricketers is a restaurant located near Ranch. It has a low customer rating. It is not family friendly.

56
Table 13: Example system outputs with human rankings of quality and a detailed error analysis attempting to interpret the rankings.
Each example is shown as ranked for quality by a single crowd worker. The raw RankME scores assigned by the crowd workers are shown; however, note that only relative ranks are used by the TrueSkill algorithm. The outputs within each example are sorted by the score for clarity. For the purpose of error analysis, the rankings may be interpreted in the following way (note that quality rankings include both relevance and fluency): 1. Gong and FORGe3 verbalise all attributes but the latter is less fluent. Harv misses the familyfriendliness, Zhang misses the price information. Sheff1 misses family-friendliness and is not fluent. 2. TR2 and Slug-alt provide perfect and fluent information but Slug-alt misses the full stop. Gong does not specify the type of place while TGen adds irrelevant price range information. Sheff2 indicates a wrong venue type. 3. Sheff1 and Zhang provide perfect and fluent information, FORGe3 is less fluent and ZHAW2 even less than that. Sheff2 indicates a wrong venue type and misses the customer rating information. 4. Slug provides a perfect an fluent information. Slug-alt is repetitive and Adapt was probably penalised for lack of detokenisation. FORGe1 and TUDA provide a complete information but are not very fluent.

M A N U S C R I P T
naturalness and quality. 23 In both cases, there are clear winning systems (i.e., the 1st cluster only has one member): Sheff2 for naturalness and Slug for quality. The 2nd clusters are quite large for both criteria -they contain 13 and 11 systems, respectively, and they include the baseline TGen system in both cases.
The results indicate that seq2seq systems dominate in terms of naturalness of their outputs, while most systems of other architectures score lower. The bottom cluster is filled with template-based systems. The winning Sheff2 system is seq2seq-based, and the 2nd cluster mostly includes other seq2seq-based systems. The result also indicates that diversity-attempting systems are penalised in naturalness, i.e. Slug-alt, ZHAW1, ZHAW2 placed in the 3rd cluster; Adapt in the 4th.
The results for quality 24 are, however, more mixed in terms of architectures, with none of them clearly prevailing. The 2nd, most populous cluster includes all different architecture types. The winner is the seq2seq-based system Slug. However, the bottom two clusters are also composed of seq2seq-based systems. This shows the importance of an explicit semantic control mechanism applied at decoding time in seq2seq systems: None of the systems in the bottom two clusters apply such mechanism, whereas all better ranking seq2seq systems do (cf. Section 6.2). 25 Note that this also includes the Sheff2 system, which scored top for naturalness. With the exception of diversityattempting Adapt, these systems tend to produce the shortest outputs (see Table 9), which indicates that they are penalised for not realising parts of the input MR too often (cf. Section 8.5).
Finally, we computed the correlation of word-overlap metrics with the human judgements of both quality and naturalness for all the systems. All of the correlations are weak (< 0.2, see Tables A.16 and A.15 in the Appendix), which confirms earlier find-M A N U S C R I P T ings of Novikova et al. (2017a) and explains the discrepancy between system performances in terms of automatic and human evaluation.

Error Analysis: Input MR Coverage
In order to clarify the mixed quality evaluation results, we attempted to estimate the number of semantic errors produced by the individual systems in two ways: First, we ran a specific crowdsourced evaluation of systems' coverage of the input MR, where crowd workers were asked to manually annotate missed and added information with respect to the input MR (see Table 12). We did not check for workers' correctness here, and thus we can expect some noise, but the annotations confirm that the systems rated low on quality, most of which also produce very short outputs, also correspond to the ones with the lowest proportion of perfectly covered MRs (Chen,Sheff2,Zhang,TR1 and Adapt).
Second, semantic errors were computed following Reed et al. (2018), where we implemented a script to estimate the coverage automatically based on regular expression matching. 26 This allowed us to produce an independent estimate of the proportion of outputs with missing or added information (see Table 12). Following Reed et al. (2018), we also computed the slot error rate (SER) using this pattern-matching approach and the following formula: 27 SER = # missed + # added + # value errors + # repetitions # slots (4) Here, missed stands for slot values missing from the realisations, added denotes additional information not present in the MR (hallucinations), value errors denote correctly realised slots with incorrect values (e.g., specifying low price range instead of high), and repetitions are values mentioned repeatedly in the outputs; slots is the total number of slots/attributes in the test set. SER thus amounts to a proportion of erroneously realised slots. While the absolute numbers for perfectly covered MRs are different 26 We based the patterns for the individual attribute-value pairs on Reed et al. (2018)'s script and manually enhanced them using the first 500 instances of the E2E development set. 27 Note that the coverage and SER values produced by the script is only an estimate as the patterns for a given attribute-value pair will not cover all possible all correct ways to express it. This is different from Wen et al. (2015b)'s computation of SER, where full delexicalisation allowed them to directly count placeholders in the output.

ACCEPTED MANUSCRIPT
A C C E P T E D M A N U S C R I P T from those estimated by humans, they mostly follow the same trend. The SER value is highly correlated with the proportion of perfectly covered MRs.
Both evaluations show that template-and rule-based systems, where MR coverage is implied by the architecture, mostly score high in this regard. However, FORGe3, which uses template mining from training data, scores below average; here, some amount of noise was probably carried over from training data. TUDA, on the other hand, scores high in human ratings and even achieved perfect score by the automatic script (100% perfect coverage), but this is partly given by its low diversity (cf. Section 8.2) -all its templates are probably covered well by the patterns. The results also show that some data-driven systems are able to achieve very good coverage (especially Sheff1, Gong and Slug, with SER estimates below 1.5%), which confirms the efficacy of their respective semantic control approaches (see Section 6.2). Seq2seq systems without reranking (Chen, Sheff2, Zhang, Adapt, TR1) score near the bottom of the list in both evaluations.
Both estimates also indicate that missing information is the most common type of problem, added (hallucinated) information occurs less frequently, but still poses a serious problem for utterance generation in task-based dialogue systems. 28 It also appears that both problems are connected -systems hallucinating less frequently tend to miss information more often.
Finally, the scores show that attempts at diversity may hurt semantic accuracy. This is most apparent in Adapt, the most diverse system with no explicit semantic control mechanism. Other systems with diverse outputs, FORGe3 and Harv, also score lower on coverage. In case of FORGe3, this is due to the above-mentioned noise in the mined templates; Harv's reranking is probably less aggressive than others'. On the other hand, ZHAW1, ZHAW2 and especially Slug-alt produce diverse outputs while maintaining good coverage thanks to their very powerful semantic control mechanisms.
A C C E P T E D M A N U S C R I P T

Winning System
We consider the Slug system , a seq2seq-based ensemble system, as the overall winner of this challenge. It received high human ratings for both naturalness and quality, as well as for automatic word-overlap metrics. In contrast to vanilla seq2seq systems, Slug improves semantic coverage using a heuristic slot aligner in combination with a data augmentation method producing partially aligned examples, which places it among the top-scoring systems in terms of MR coverage (cf. Section 8.5). Slug's only drawback is the relatively low output diversity; note that repetitive output is considered to be problematic for task-based dialogue systems.
A variant of the same system, Slug-alt, provides much more output diversity at the cost of slightly lower quality ratings and MR coverage; it maintains higher quality and coverage scores than other diversity-attempting approaches. While the Sheff2 system , a vanilla seq2seq setup, won in terms of naturalness, it often does not realise all parts of the input MR, which severely affected its quality rating -it placed in the last cluster, ranked 20th-21st out of 21.
Sheff2's outputs also rank very low on complexity and diversity.
Furthermore, the TGen baseline system turned out hard to beat. It ranked highest on average in word-overlap-based automatic metrics and placed in the 2nd cluster in both quality and naturalness (ranks 3-6 and 4-8 out of 21, respectively). TGen also fared well (albeit not perfectly) in MR coverage evaluations. On the other hand, TGen only scored in the middle of the pack on output diversity.

Lessons Learnt and Future Directions
We attempt to formulate some high-level "lessons learnt" for developing future data-driven NLG systems based on the above results, while we acknowledge that our data is limited to a single domain, and that comparisons are not strictly controlled, i.e. models vary in more than one aspect.
• Semantic control: For seq2seq-based systems, a strong semantic control of the generated content seems crucial -beam reranking based on MR classification or heuristic alignments appears to work well while attention-only models perform poorly on our data. Correct semantics is regarded by users as more important 50 A C C E P T E D M A N U S C R I P T than fluency (Reiter and Belz, 2009) and should be prioritised when training the models (cf. also Reiter, 2019).
• Open vocabulary: For limited domains such as ours, delexicalisation of openset attributes still seem to be the best approach. However, the systems of Harv and NLE show character-level models and copy mechanisms are viable alternatives. We believe that the low results of Chen, Zhang and Adapt are due to inferior semantic control, not open-vocabulary handling.
• Complexity and diversity: In general, hand-engineered systems seem to outperform neural systems in terms of output diversity and complexity (see Section 8.2); the most diverse outputs are produced by systems using templates mined from training data and data-driven systems with explicit diversity mechanisms.
Vanilla seq2seq-based systems produce the least diverse outputs: they are essentially probabilistic language models, which tend to settle for the most frequent phrasing, thus penalising length and favouring high-frequency word sequences.
Diversity in seq2seq models can be improved by data selection (Slug-alt), diverse ensembling (Harv) or sampling from the generated beam (Wen et al., 2015b). In contrast, hand-engineered system authors can control the output complexity and diversity directly: here, TUDA's outputs are very repetitive as its set of handcrafted templates is small, while FORGe3 and TR2 with templates mined from data produce some of the most diverse outputs.
In general, any systems attempting output diversity need to impose strong semantic control mechanisms to maintain MR coverage.
• Best method suggestion: Rule-based methods work quite well for limited domains, such as ours. Low-effort handcrafting (as in TUDA) may lead to correct but repetitive outputs. Seq2seq models with semantic reranking emerge as the best data-driven option, in combination with controlling for diversity and using copy mechanisms to minimise preprocessing.

51
A C C E P T E D M A N U S C R I P T

Conclusion
This paper presents the findings of the first shared task on End-to-End Natural Language Generation for Spoken Dialogue Systems. The aim of this challenge was to assess the capabilities of recent end-to-end, fully data-driven NLG systems, which can be trained from pairs of input meaning representations and corresponding texts, without the need for fine-grained semantic alignments.
As part of this challenge, we have created a novel dataset for NLG benchmarking in the restaurant information domain, which is an order-of-magnitude bigger than any previous publicly available dataset for task-oriented NLG. We also provided one of the previous state-of-the art seq2seq-based NLG systems, TGen (Dušek and Jurčíček, 2016a), as a baseline for comparison. The challenge received 62 system submissions by 17 different participating institutions. The systems submitted ranged from complex seq2seq-based setups with different additions to the architecture, over other data-driven methods and rule-based systems, to simple template-based ones. We evaluated all the entries in terms of five different automatic metrics. 20 primary submissions (as identified by the participants) were further evaluated using a novel, crowdsourced evaluation setup. We also include a novel comparison of systems in terms of automatic textual metrics aimed to assess output complexity and diversity. Our evaluation lets us include several general recommendations for future NLG system development.
In general, seq2seq-based systems produce very similar outputs (as measured by word-overlap, cf. Section 8.3), despite their different implementations. Seq2seq models tend to score high on word-overlap metrics and human evaluations of naturalness, while the scores for other data-driven, rule-based and template-based systems are lower. However, these other types of systems often score better in human evaluations of the overall quality. While the winning Slug system is seq2seq-based, the results also demonstrated possible pitfalls of using seq2seq models: 1. Vanilla seq2seq models tend to produce short outputs of low diversity and syntactic complexity. Low diversity is especially problematic since it causes repetitive outputs in spoken dialogue systems.
2. Applying a strong semantic control mechanism during decoding is crucial to 52 A C C E P T E D M A N U S C R I P T preserve the input meaning. The most common semantic mistake for systems is to miss out information. However, added information (hallucinations) is also closely linked. Both type of errors can have severe consequences for task-based dialogue systems, depending on the application domain.
3. Addressing these issues is challenging: Attempts to improve diversity can often result in lowered semantic accuracy and/or output naturalness.
In comparison, hand-engineered systems tend to produce more complex and diverse outputs and are able to reach high overall quality, but are mostly rated low on naturalness. Note that similar findings have been reported by Wiseman et al. (2017) for data-to-document generation. This raises the general question regarding efficiency, costs, and performance of purely data-driven versus carefully hand-engineered NLG systems.
To facilitate further research in this domain, we have made the following data and tools freely available for download: • The E2E NLG training dataset (including test set with human references), • A set of word-overlap-based metrics used for automatic evaluation in the challenge, • Outputs of the baseline TGen system for the development set, • Outputs for the test set produced by the baseline and all participating systems, • the corresponding RankME ratings for quality and naturalness collected in the human evaluation campaign.
All can be accessed under the following URL: http://www.macs.hw.ac.uk/InteractionLab/E2E/ In future work, we aim to investigate additional evaluation methods for NLG systems, such as post-edits (Sripada et al., 2005), or extrinsic evaluation, such as NLG's contribution to task success, e.g. (Rieser et al., 2014;Gkatzia et al., 2016). We also intend to continue our work on automatic quality estimation for NLG (Dušek et al., Table A.14: Full list of E2E challenge submissions with automatic metric scores (primary systems are indicated in the "P?" column; the column "n. avg." shows an average of all metrics normalised into the 0-1 range, cf. Table 8).

ACCEPTED MANUSCRIPT
A C C E P T E D M A N U S C R I P T    Table A.16: Pearson correlation between automatic metrics and human scores of quality. "*" denotes statistical significance at p < 0.05 level, bold denotes the highest value.