Text Editing by Command

A prevailing paradigm in neural text generation is one-shot generation, where text is produced in a single step. The one-shot setting is inadequate, however, when the constraints the user wishes to impose on the generated text are dynamic, especially when authoring longer documents. We address this limitation with an interactive text generation setting in which the user interacts with the system by issuing commands to edit existing text. To this end, we propose a novel text editing task, and introduce WikiDocEdits, a dataset of single-sentence edits crawled from Wikipedia. We show that our Interactive Editor, a transformer-based model trained on this dataset, outperforms baselines and obtains positive results in both automatic and human evaluations. We present empirical and qualitative analyses of this model’s performance.


Introduction
A long-standing goal of natural language processing (NLP) research has been to generate long-form text (Lebowitz, 1985;Fan et al., 2018;Rashkin et al., 2020).Recent large generative language models such as GPT-2 (Radford et al., 2019), and GPT-3 (Brown et al., 2020), demonstrate an impressive ability to generate fluent text, but their outputs are difficult to control beyond a prompt, and they manifest a tendency to hallucinate facts (Wiseman et al., 2017).Much recent work has thus focused on making such models more controllable (Keskar et al., 2019;Hu et al., 2017;Zhang et al., 2020b;Dathathri et al., 2020), and factually grounded (Guu et al., 2020;Liu et al., 2018b).Most such work only considers a one-shot generation setting.Given a set of inputs, which may be a prompt, a control code (Keskar et al., 2019), or a table of data (Liu et al., 2018b) for example, the system generates text in a single step.Humans, though, often produce text through an evolutionary process involving multiple draft-edit cycles.This is not simply because they make mistakes when writing, but because they may require multiple iterations to help them shape and even make sense of what they want to express (Pirolli and Card, 2005).For example, consider a user writing an article about Barack Obama.They might start with a simple sentence such as "Barack Obama was the 44th President of the United States".Next, they may wish to expand on that sentence, adding information, or rephrasing it to integrate it better with the text.Replicating this process in software will mean allowing users to adjust their requirements in response to model outputs.Even an error-free system that meets all of a user's initial requirements does not obviate the need for iteration, since those constraints are themselves dynamic.
The purpose of this paper is to bring into view the task of controllable text editing, as a step beyond one-shot generation towards interactive document generation.A full interactive document generation system will likely comprise multiple components, possibly including one-shot generation to create a first draft.Editing is crucial to interactivity because it allows users to change previously generated text to fit their dynamic constraints.This is a stateful operation, where the state is the current version of the document, as opposed to stateless recasting of text from scratch using a one-shot model.While services like Grammarly or MS Word already offer rewriting suggestions, they mainly focus on small edits, such as paraphrases (Gupta et al., 2018).In this work, we are interested in a broader range of edits, particularly those that add or remove content, or change the meaning of text.Figure 1 illustrates this editing setting with an example from our trained model, where a user produces a sentence about Barack Obama over multiple edits.
In sum, we make the following contributions: We introduce a challenging new text editing task, wherein a model must learn to edit text in response to a user command, while drawing on grounding to avoid problems of hallucination (Wiseman et al., 2017).To accompany this task, we release an open-source dataset of sentence-level edits extracted from Wikipedia, including editor comments, which we leverage as natural language commands, together with pre-retrieved grounding documents.We show that a transformer-based editing model trained on our data outperforms "parrot" and GPT-2 baselines, and obtains competitive results compared to gold-standard edits in human evaluations.We then perform an empirical analysis of our model's performance, showing the importance of the command and grounding, and the varying difficulty of edits in our dataset.

Text Editing Task
We now formalize our text editing task.Let D be a document, q a user command 2 , and G 2 This notation reflects that the edit command is analogous to a query in a retrieval or QA setting in that it expresses a some appropriate form of grounding.Moreover, let D be an edited version of D. Then our task is, given a dataset of edits D = {(D 0 , q 0 , G 0 , D 0 ), ..., (D N , q N , G N , D N )}, learn to produce document D , given D, q, and G.
Note that while previous work on text editing usually only considers D as input, we include both a form of control q and grounding G.The command is needed because otherwise the type of edit to be made is undefined, while the grounding provides external knowledge needed to make an edit.
In our specific instance of this task, we will only consider sentence-level edits.More formally, we consider edits D − → D , where D and D differ only on a single sentence s ∈ D, respectively s ∈ D .While, in general, edits can vary in complexity from document-level to character-level changes, sentences are a natural way to break down text into relatively independent units of meaning, so it makes sense to edit text one sentence at a time.More complex, document-level edits can be seen as a composition of multiple sentence-level edits.
Additionally, we will consider user commands q written in natural language, e.g., "add years in office".The command could also take other forms, such as a categorical variable, but natural language allows for the greatest flexibility in specifying what the edit should accomplish.Moreover, natural language commands are a good fit for our model, which we will initialize with pretrained language model weights.For similar reasons, we will also consider corpora of text snippets as our grounding G.Alternatively, the grounding could also consist of structured data such as tables or graphs.In a real user scenario this grounding might be supplied by the user, or retrieved on the fly.For our dataset, we pre-retrieve groundings by querying a commercial search engine.

Data
To accompany our text editing task we present a novel dataset of nearly 12 million sentence-level edits, WikiDocEdits.These edits were extracted from the revision histories in the February 1 st 2020 dump of English Wikipedia. 3or a given Wikipedia page, a revision consists of a source and target text, corresponding to the old and new versions of the page.Each revision is also accompanied by an editor comment, which form of user intent.
we will use as a proxy for the user command.For a given revision, we split the source and target texts into sentences and then attempt to match the sentences between source and target.For efficiency, we only look at a k-sentence neighborhood.Unmatched sentences are candidates for edits.A source sentence s and target sentence t form an edit pair s − → t if f (s, t) > , where f is sentencelevel BLEU4 without smoothing and = 0.1 in our case.If an unmatched source sentence does not form an edit pair with any target sentence, we consider it to be a sentence deletion.This can also be thought of as matching to an empty sentence.We identify sentence insertions in an analogous manner.Importantly, we only consider revisions that contain a single sentence-level edit.Otherwise, the editor comment that accompanies each revision may only describe one of the possibly many sentence-level edits.See appendix A.1 for a more detailed description of our processing pipeline.

Grounding
We retrieve grounding snippets for the edits in our dataset by querying a commercial search engine.In order to formulate a query for a given edit, we combine the relevant page and section titles with keywords5 from the target sentence.While the target sentence is not available at test time, we make the assumption that in a real user scenario the relevant grounding would be provided by the user.
We retrieve the top 200 returned web page results and only keep the preview snippets returned by the search engine as the grounding corpus. 6ecause Wikipedia, as well as several clones, often appear in search engine results, we check for 4-gram overlap between the target sentence and each grounding snippet, removing any snippet with more than 50% overlap.Finally, we rerank7 the retrieved snippets using an information extraction score, and merge the ranked snippets to take the first N = 512 tokens.

Data Analysis
We now provide an overview of our dataset.Wikipedia, we extract 11,850,786 edits, and take a 1% sample of 118,818 edits to run our analyses.Table 1 presents summary statistics for our data, and in the following, we break down the edits by edit type, analyze the quality of the retrieved grounding, and present some examples.

Fluency and Content Edits
We are interested in the distribution of different edit types within our dataset.In particular, we want to distinguish between fluency edits, which only affect the grammar or structure of a sentence, and content edits, which change the meaning of a sentence.We can lean on previous work to categorize edits on Wikipedia.Yang et al. (2017) create 13 edit intention categories, and train a classifier to label revisions according to the categories.We apply their classifier to our data, and group their 13 categories into "fluency", "content", or "other" edits, as reported in table 2. With the caveat that the edits were labelled automatically using a trained classifier, we see that, while fluency edits make up the majority of the edits in our data, a large proportion are content edits.

Coverage Analysis
We are also interested in knowing how well edits in the data are covered by the inputs (i.e.D, s, q, or G), where an edit is well covered if the information necessary to produce the edit appears somewhere in the inputs.To measure coverage we use word recall: how many words that were inserted in an edit also appear in the grounding?However, because simple recall fails to account for synonyms, or the context in which words appear, we use the BERTScore (Zhang et al., 2020a) recall.This allows for fuzzy matching between BERT embeddings instead of requiring exact word matches.We also use idf scores to weigh words, since we are mostly interested in covering rare words, which are more likely to be meaning-carrying.We can define the BERT recall, R BERT , for a sentence edit s − → s , with respect to some text corpus C as , where s \s = {w ∈ s |w / ∈ s}, and idf(w) are the inverse document frequency scores computed on a random sample of 500K Wikipedia pages.
Table 3 reports the coverage statistics for our subsample of the data.We used an uncased BERT base model to compute the embeddings.The first row reports the coverage of the target by all of the inputs, namely the command, grounding, context, and source sentence.The second row shows the coverage by the grounding alone.Note that, even with just the grounding, coverage is already fairly high.Finally, the last row presents the coverage by the command alone, which shows that it also provides grounding.
Examples Table 4 presents some examples from our data.These were chosen to illustrate a variety of phenomena.The first example shows an elaboration edit, appending new information to the end of a sentence.The second example is a simple typo fix, while the third is changing a fact.Finally, the last example is a more complex edit involving insertion and deletion to reword and clarify a sentence.We can see that there is a large variety of edits in our dataset.See

Model
We formalize our model, which we refer to as Interactive Editor, as a standard auto-regressive sequence to sequence model.Because our data only contains single-sentence edits, we assume that the sentence to be edited in the source document is given as an input to the model.Given a source sentence s ∈ D, the context around s, which we will refer to as D by abuse of notation, a user command q, a grounding corpus G, and a candidate target sentence s , the model, f , computes f (s, s , D, q, G) = P (s |s, D, q, G) where s <i = {s 0 , ..., s i−1 } are the tokens preceding s i in s .
We use the same encoder-decoder architecture as T5 (Raffel et al., 2019) and initialize our model with pretrained language model weights.The Source sentence

Target sentence
For decades to follow, the movie was aired in the United States on or near Easter.
For decades to follow, the movie was aired in the United States on or near Easter , although today with the Turner cable networks now holding the television rights, the film is generally shown during the summer and Christmas seasons.
After the execution-style killings that inadvertenty led to the deaths of Frank Castle's family, Russo was hired by Bruno Costa to assassinate another assassin who had failed to kill Frank Castle as well.
After the execution-style killings that inadvertently led to the deaths of Frank Castle's family, Russo was hired by Bruno Costa to assassinate another assassin who had failed to kill Frank Castle as well.
He married Margaret Frances Prowse Shaw in Sydney in 1874 .
He married Margaret Frances Prowse Shaw in Sydney in 1871 .
Entitled "It Feels Like Home (Re Invented) Tour 2011", it contained his songs and remakes of Alliage hits.
Entitled "It Feels Like Home (Re Invented) Tour 2011", it included many remakes of Alliage hits as well as some of his newer songs.encoder-decoder architecture allows us to perform full attention over the inputs s, D, q, and G, while the decoder allows us to auto-regressively generate s .Meanwhile, initializing with pretrained weights has been shown to achieve state-of-the-art results on many NLP tasks (Raffel et al., 2019).
In order to adapt T5 for our task, we represent all our inputs as sequences of tokens.We then concatenate these sequences together using separator tokens, truncating and padding them to fixed lengths.This is straightforward since all our inputs are text.See fig. 2 for reference.We also use the standard cross-entropy loss to train.

Experiments
We train our model on a subset of ∼1,020K edits from WikiDocEdits.
We use a training/validation/test split of 1,000K/10K/10K edits, and train for 3 epochs with a fixed learning rate of 0.0001.Following Raffel et al. (2019), we finetune all weights in the model, and use a batch size of 128.We validate every 200 steps and select the model with the lowest validation loss.

Evaluation
For inference we use beam search with a beam width of 5, and keep the 5 highest ranked candidates, excluding any generation that parrots the source as this corresponds to making no edits.
Metrics We consider several metrics to evaluate our model.One natural metric to consider is BLEU ( (Papineni et al., 2002)).BLEU shows high correlation with human judgement on machine translation (Papineni et al., 2002;Doddington, 2002).While this should not a priori transfer to evaluating different tasks, our task in fact bears a high similarity to machine translation because of how the output is constrained by the inputs.If, for example, the source sentence in an English to German translation task is "Sally met Lucy", the German translation must in some way mention Sally and Lucy.Similarly, in our task, if the source sentence is "Barack Obama was the 44th President of the United States", and the command is "add birth date", the edit must somehow mention a birth date somewhere.Thus, in our setting, BLEU makes sense as a metric since in principle a good model output should not deviate too far from the reference.We use macro-averaged sentence-level BLEU with epsilon smoothing and equally weighted n-grams, with n up to 4.
One issue with BLEU is that the source and target sentences in our task are already very similar, so a model that simply parrots back the source sentence could achieve an unduly high score.Therefore, we also evaluate model outputs by comparing the word-level edits made by the model against the reference, where a word-level edit is a tuple of an operation, either insertion or deletion, a position, and a word.For example, in the edit "Barack Obama was the 44 th  United States", the set of word edits would look like {(insert, 2, ", "), (insert, 3, "born"), ...}.We can then compute precision, recall, and F1 scores based on these word edit sets.8Finally, we also compute sentence-level accuracy, which reports the number of edits in the test set for which the model output exactly matched the reference.
Baselines We use two baselines to compare our model to.First, we consider the parrot baseline that simply outputs the source sentence as is.The second baseline attempts to delete the source sentence and replace it with a new sentence.We use a pretrained GPT-2 model (Radford et al., 2019) out of the box that generates a sentence given the left context.

Results
Table 5 presents our main results.Notice that the parrot baseline is able to achieve a considerably high BLEU score, as expected, while the GPT-2 baseline surprisingly achieves a high word edit recall score.Our interactive neural editor model is able to beat both baselines across all metrics, as would be expected.Even on a harsh metric like accuracy our model achieves a nontrivial score, although we suspect most of the edits that the model gets exactly right are minor fluency edits.See table 6 for a breakdown by edit type.
Ablations The three middle rows of Table 5 show the results for three ablations of our model.The first ablation removes everything but the source sentence s.This is similar to the paraphrase setting (Gupta et al., 2018), and the editing setting in Faruqui et al. (2018) and Yin et al. (2019).We can see that including the context, grounding, and command as additional inputs yields significant improvements over only using the source sentence.We can also see from the second ablation that the commands are a crucial element in the model's performance.This is not surprising since without a command the model must guess what type of edit to make.While it may make a valid edit, it should very rarely be able to guess the right edit to make.Similarly, the model without grounding performs considerably worse than the full model, showing that the grounding is equally important as the command.Surprisingly, the last two ablations perform only marginally better than the first, meaning that removing the grounding in addition to the commands, or vice-versa, does not lead to a large drop in performance.This seems to suggest a synergistic effect between the command and the grounding, which makes sense since the model would not know what to do with the grounding without a command, and likewise, the model would not have access to the right information without the grounding, even if it knew what to edit from the command.

Breakdown by edit type
The results of our full model are broken down by edit intention labels in Table 6.The columns report the same metrics as in our main table of results, with the exception of S-BLEU, which reports the BLEU score between the source sentence and target, and the last column, which reports the number of test edits that were classified into each category.With the caveat that intention labels come from an automatic classifier and not human annotation, we can observe that our model has varying performance across different types of edits.The model performs very well on fluency edits, but worse on content edits.This comes at no surprise given that fluency edits should be easier as they usually correct minor mistakes, which a language model should be able to detect from pretraining.Content edits, on the other hand, require pulling the correct information from the grounding and incorporating it in the correct manner into the sentence.The S-BLEU scores confirm this since the source sentences in the flu- ency examples are much more similar to the target sentences than for the content edits.In fact, when looking at the absolute improvement of the BLEU over the S-BLEU scores, the model performs equally well on both types of edits.

Human Evaluations
We also conducted human evaluations of our system, comparing our model's top output from beam search to the reference edit across 200 examples from our test set.Annotators were crowd sourced, and each example was rated by 7 judges for a total of 1400 judgments in each of two tasks.In the first task, we asked judges to choose which system better accomplished the command q.In the second, we asked which system was more faithful to the grounding G. Table 7 presents the results.Although there is a clear preference for the Reference edits in the command-related task, 59% of judgments suggest that Interactive Editor may be equal to or better than the reference. 9 In the grounding task, Interactive Editor demonstrates good correspondence with the background material.Judges were further asked whether the retrieved grounding was relevant to the context D: 92.86% of judgments recorded the grounding as either "Somewhat relevant" or "Very relevant" 9 The high percentage of Neutral judgments here may be partially attributable to other factors.Majority Neutral judgments are observed for approximately 65% of those examples that received at least 1 Neutral judgment.This suggests that many commands may not have been readily interpretable to our judges.

Discussion
Text: Geoff Hinton is an English tennis player.Command: fix profession Text: Geoffrey Hinton is a computer science professor at the University of Toronto.Command: add nationality Text: Geoffrey Hinton is an English-Canadian computer science professor at the University of Toronto.Command: add birthdate Text: Geoffrey Hinton (born 1946) is an English-Canadian computer science professor at the University of Toronto.Command: add most famous work Text: Geoffrey Hinton (born 1946) is an English-Canadian computer science professor at the University of Toronto.Geoffrey Hinton is most famous for his work on artificial neural networks.Table 8: An example of a multi-turn interaction with our model.At each turn, the edit was chosen among the top 3 outputs returned by beam-search.See table 11 in the appendix for the grounding used in this example.
From our results, our model seems to be able to learn how to make nontrivial edits to text.However we have focused solely on single turns, while a real text generation scenario would likely involve multiple edits.It isn't obvious that a model that performs well on a single turn will also perform well across multiple turns, as there may be path dependencies when making edits.Table 8 presents an illustrative example of multi-turn interaction with our model.The starting text is a false sentence about Geoff Hinton, which, through a series of edits, is built into a more elaborate, and factually correct, sentence.The model is able to perform nontrivial reasoning to retrieve relevant information from the grounding and insert it in the appropriate part of the sentence.For example, when asked to "fix profession", the model infers that Hinton is not a tennis player, but a computer scientist.See table 11 in the appendix for the grounding used in this example.While this example is not an empirical result, it suggests that our single-turn model can be used meaningfully over multiple turns, and could thus potentially be extended for the multi-turn setting.

Related Work
Grounded Generation Large language models can generate fluent text (Radford et al., 2019;Brown et al., 2020;Raffel et al., 2019), but they have a tendency to hallucinate facts (Wiseman et al., 2017).Thus, several works have explored using various forms of grounding to enable models to generate factually consistent texts (Koncel-Kedziorski et al., 2019;Liu et al., 2018b;Prabhumoye et al., 2019;Liu et al., 2018a;Guu et al., 2020).Our work uses grounding to ensure that edits are factually correct, although our task differs from previous work because of the user command, which requires specific information to be retrieved from the grounding during generation.
Controllable Generation While grounding can be seen as a way to implicitly control the contents of generated text, other works have explored more explicit forms of control.Hokamp and Liu (2017) and Zhang et al. (2020b) use lexical constraints, while Keskar et al. (2019) and Dathathri et al. (2020) control higher level attributes of text, such as style, tone, or topic.Our task instead uses natural language commands, which can flexibly express different types of constraints, ranging from low-level lexical ones, to high-level topical ones.In this sense, we can also draw the parallel to dialog response generation (Ghazvininejad et al., 2018;Dinan et al., 2019), task-oriented dialog (Gao et al., 2019), or open domain question answering (Min et al., 2019;Chen et al., 2017), that also involve user responses or queries, although these tasks are not concerned with text generation in the context of document creation.

Story Generation
The task of Document Generation considered in our work bears similarity with work on generating long-form narratives (Jain et al., 2017).While earlier work in Story Generation focused more on plan-based architectures (Lebowitz, 1985), more recent work moved towards end-to-end approaches (Fan et al., 2018) allowing generation to be unconstrained and creative.As narratives are often aimed at particular goals expressed in terms of outlines and plans, much of the literature in Story Generation is framed as a form of controllable generation, using storylines (Peng et al., 2018), events (Martin et al., 2017;Harrison et al., 2017), plot words or word skeletons (Xu et al., 2018;Ippolito et al., 2019), plans (Yao et al., 2019), story ending (Tambwekar et al., 2019), andoutlines (Rashkin et al., 2020) as various forms of constraints.Our work takes a significantly different approach, as we treat document or story generation as an iterative process that also allows a human to generate a full document from scratch, but allows constraints to be more "on demand" and dynamic (e.g., express the desire to add nationality in Table 8 only if the system missed that the first time).
Text Editing Several previous works have focused on text editing.Guu et al. (2018) generate sentences by editing prototypes taken from their training corpus, although they use editing only as a means for language modeling.Wu et al. (2019) expand upon Guu et al. (2018)'s setting, but for dialog.More related to our own setting, Faruqui et al. (2018) propose WikiAtomicEdits, a dataset of edits crawled from Wikipedia.However, they consider a much narrower definition of edits than our data does.Yin et al. (2019) uses WikiAtomicEdits and proposes the task of learning to represent edits, which Marrese-Taylor et al. (2020) expands using a variational approach.In contrast, we are more interested in generating edits rather than representing them.Iso et al. ( 2020) propose a factbased text editing task, but they do not consider control or other types of edits.Another related task to text editing is text paraphrasing (Gupta et al., 2018), however paraphrasing usually conserves the meaning of a sentence.While the edits we consider include meaning-preserving edits, such as rewording edits, we are mostly interested in edits that add or modify content.

Conclusion
In this work we argued that text generation should be interactive, and, as a means towards that end, we proposed a general text editing task, where a system must edit a document in response to a user command.In our specific instance of the task we considered single-sentence edits, and we crawled a dataset of several million edits from Wikipedia that included commands, in the form of editor comments, as well as grounding documents.We then showed that training a transformer-based model on our data, while initializing with pretrained language model weights, yields encouraging results on both automatic and human evaluations.Additionally, our ablation studies showed the crucial role played by the user command and grounding.Breaking down our results by types of edits, we saw that our model not only performs well on easier fluency edits, but also on much harder content edits.Finally, our example of multi-turn interactions with the system suggests that our single-turn editing model could be feasibly used to generate longer documents.
tion.Because we do not look at all matches between source and target sentences, a sentence may remain unmatched if, in the target document, it was moved more than k sentences away compared to the source document.Thus we only keep a sentence insertion or deletion if the total number of source and target sentences differ by one.If there are both an unmatched source sentence s and target sentence t, we consider them to form an edit s − → t if f (s, t) > , where f is the BLEU score and = 0.1.
As a final step, we filter out edits that involve sentences with markup punctuation.We have found that this helps remedy the shortfalls of the markup removal step, since it often leaves behind markup symbols.While there may be valid sentences that use markup punctuation, we do not expect them to make up a significant part of the data, nor do we expect them to be significantly different from regular sentences, except for their use of unusual punctuation.

A.2 Grounding Search Query Construction
For a given edit, we combine the relevant page and section titles with keywords from the target sentence to construct a query that we use to retrieve grounding from a commercial search engine.In Geoffrey Everest Hinton CC FRS FRSC (born 6 December 1947) is an English Canadian cognitive psychologist and computer scientist, most noted for his work on artificial neural networks.Since 2013 he divides his time working for Google (Google Brain) and the University of Toronto.In 2017, he cofounded and became the Chief Scientific Advisor of the Vector Institute in Toronto.Geoffrey Hinton : index.Department of Computer Science : email: [REDACTED] : University of Toronto : voice: send email: 6 King's College Rd.We would like to show you a description here but the site won't allow us.Geoffrey's great grandfather, the mathematician [REDACTED] Charles Hinton, coined the word "tesseract" and popularized the idea of higher dimensions, while his father, Howard Everest Hinton, was a distinguished entomologist.Geoffrey Hinton is a fellow of the Royal Society, the Royal Society of Canada, and the Association for the Advancement of Artificial Intelligence.He is an honorary foreign member of the American Academy of Arts and Sciences and the National Academy of Engineering, and a former president of the Cognitive Science Society.Geoffrey Hinton.Emeritus Prof. Comp Sci, U.Toronto & Engineering Fellow, Google.Verified email at cs.toronto.edu-Homepage.machine learning psychology artificial intelligence cognitive science computer science.Articles Cited by Co-authors.Title.Sort.Sort by citations Sort by year Sort by title.Geoff Hinton was born in Wimbledon in 1947 to Howard Hinton, an entomologist, and a schoolteacher mother, Margaret Clark.The childhood Hinton describes is a mash-up of Lemony Snicket, ... As the first of this interview series, I am delighted to present to you an interview with Geoffrey Hinton.Welcome Geoff, and thank you for doing this interview with deeplearning.ai.Thank you for inviting me.I think that at this point you more than anyone else on this planet has invented so many of the ideas behind deep learning.Talks by Geoffrey Hinton.The next generation of neural networks A 45min version of this talk which was given at the 10 year celebration of the Microsoft Cambridge Research Laboratory. the original powerpoint file version for most browsers.psversion with 4 slides per page.Very gentle after-dinner version of IJCAI-2005 Research Excellence ...  where D is a sample of 500, 000 Wikipedia articles taken from the Tensorflow Wikipedia dataset (10).We consider words w with df(w) < 0.01 to be keywords.

A.3 Grounding Document Reranking
Because the combined length of the grounding snippets we retrieve far exceeds the capacity of our model, we rerank the retrieved snippets using an information extraction score.We then merge the ranked snippets and take only the first N = 512 tokens.Following (Liu et al., 2018a) we use tf-idf scores to rerank.where N w (G) is the number of occurrences of w in G, N gw is the number of documents in G that contain w, and N g is the number of documents in G.

A.4 Word Edit Metrics
For our evaluations we compare the word-level edits made by the model against the reference, where a word-level edit is a tuple of an operation, either insertion or deletion, a position, and a word.We use the word tokenizer from NLTK (Bird et al., 2009) to break a sentence down into words.For a given target sentence s , denote the set of word edits as WE(s , s).Then we compute the precision where s is the source sentence, s is the reference target sentence and h is the target sentence generated by the model.

Barack
Figure 1: An illustration of our interactive text generation setting.This is an example generated by our model.The blue panels represent the text being edited, taken from the document shown on the right.The orange panels represent user edit commands.The model grounds edits in query results from a commercial search engine.

Figure 2 :
Figure 2: An illustration of our model.The inputs to the encoder are sequences of tokens separated by sep tokens, represented by the vertical bars in the figure.
order to identify keywords we look at document frequencydf(w) = |{D ∈ D | w ∈ D}| |D| , For a given edit s − → s , with retrieved grounding documents G, the information extraction score of snippet G ∈ G is score(G) = w∈s tf-idf(w, G), where the tf-idf score of word w is tf-idf(w, G) = N w (G) • log N g N gw ,

Table 2 :
Breakdown of edits by grouped intention labels.See Table9in the appendix for a breakdown by intention label instead of group.The percentages do not total 100 because edits can have multiple labels.

Table 3 :
table 10 in the appendix for more examples, including comments.R BERT statistics of inserted words for edits in WikiDocEdits.All statistics were computed on a 1% subsample of the data.The BERT embeddings used to compute R BERT were produced using a pretrained BERT base model.The idf weights were computed from a sample of 500,000 Wikipedia pages.Each row represents a different recall when considering a different coverage corpus C.

Table 4 :
Example edits from WikiDocEdits.The edited portions are highlighted in bold.
President of the United States" − → "Barack Obama, born August 4 th 1961, was the 44 th President of the

Table 6 :
Breakdown of results by intention category for our full model.The categories are the same as in table 2.

Table 11 :
Grounding used for the example in table 8. Parts indicated by [REDACTED] were removed for containing sensitive material.