Ask me in your own words: paraphrasing for multitask question answering

Multitask learning has led to significant advances in Natural Language Processing, including the decaNLP benchmark where question answering is used to frame 10 natural language understanding tasks in a single model. In this work we show how models trained to solve decaNLP fail with simple paraphrasing of the question. We contribute a crowd-sourced corpus of paraphrased questions (PQ-decaNLP), annotated with paraphrase phenomena. This enables analysis of how transformations such as swapping the class labels and changing the sentence modality lead to a large performance degradation. Training both MQAN and the newer T5 model using PQ-decaNLP improves their robustness and for some tasks improves the performance on the original questions, demonstrating the benefits of a model which is more robust to paraphrasing. Additionally, we explore how paraphrasing knowledge is transferred between tasks, with the aim of exploiting the multitask property to improve the robustness of the models. We explore the addition of paraphrase detection and paraphrase generation tasks, and find that while both models are able to learn these new tasks, knowledge about paraphrasing does not transfer to other decaNLP tasks.


INTRODUCTION
Recent progress in Natural Language Processing (NLP) has led to improved performance across a wide range of language understanding problems (www.gluebenchmark.com/ leaderboard). A key component of these advances is the use of knowledge transferred from other tasks, most prominently from language modelling (Peters et al., 2018;Howard & Ruder, 2018;Devlin et al., 2019). McCann et al. (2018) developed a new NLP benchmark: the Natural Language Decathlon (decaNLP). This challenges a single model to perform 10 Natural Language Understanding (NLU) tasks by framing each task as question answering (Fig. 1). For example, when solving a translation task, a model is asked the question "Translate from English to German", given a paragraph of English text as the context, and is expected to output the translation of the context in German as the answer. The key appeal of this task design is that it favours models where all parameters are shared between all tasks and adding new tasks only requires additional training data, not redesigning the model. As well as decaNLP, McCann et al. (2018) proposed the Multitask Question Answering Network (MQAN) as a neural network architecture for solving the 10 decaNLP tasks.

RELATED WORK
In this section we provide a general overview of multitask learning, the unique setting provided by decaNLP, as well as a description of paraphrasing from a natural language processing perspective.
Paraphrasing: Paraphrasing is often defined as 'sameness of meaning' (Vila, Martí & Rodríguez, 2014). This however is ambiguous as there are many degrees of 'sameness', and Is this review positive or negative? What is the summary? Translate from English to German What is the translation from English to SQL? What is the change in state? Figure 1 Examples of fixed questions for decaNLP.
In NLP, paraphrasing is generally studied from a machine learning perspective, with notable interest surrounding paraphrase identification for plagiarism detection (El Desouki & Gomaa, 2019;Hunt et al., 2019;Altheneyan & Menai, 2020). Recent advances in language models have shown state-of-the-art performance on this task and the related task Natural Language Inference (Yang et al., 2019;Devlin et al., 2019). The standard corpus used for evaluation is the Microsoft Research Paraphrase Corpus (Dolan & Brockett, 2005), which consists of annotated pairs extracted from news articles. Quora Question Pairs (QQP) (www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs) is a larger dataset, formed of questions submitted to the website Quora, and is often used for training models (Imtiaz et al., 2020;Tomar et al., 2017;Li et al., 2019).
Additionally, various methods have been developed to make NLP models more robust to paraphrasing their input (Ribeiro, Singh & Guestrin, 2018;Minervini & Riedel, 2018;Iyyer et al., 2018). Many of these methods consist of automatically generating variations of the input, feeding each into the model, then ensembling the answers. Dong et al. (2017) perform this via back-translation, while Buck et al. (2018) explore an approach based on an agent which has been trained using reinforcement learning to reformulate the input to maximise the performance of the final model. To enable zero-shot learning on decaNLP, the model should be robust to more complex types of paraphrasing, particularly at the semantics level.
To better categorise paraphrase phenomena, typologies can be constructed based on the understanding of paraphrasing from different fields, primarily theoretical linguistics, discourse analysis, and computational linguistics.
In computational linguistics, typologies are often formed as lists of specific paraphrase mechanisms, grouped into general classes for use in a particular application. Defined at such a low level, these are incomplete descriptions of paraphrasing and cannot be easily transferred to other languages. Vila Rigat et al. (2011) developed a typology specifically with NLP applications in mind. Their hierarchical approach has been used to tag plagiarism corpora (Barrón-Cedeño et al., 2013), and the influential Microsoft Research Paraphrase Corpus (MSRPC-A) (Vila et al., 2015). The typology consists of 20 paraphrase types and is hierarchical, where paraphrase types are grouped by the level that the change occurs (e.g. morphological, lexical, semantics). Commonly occurring types include Addition/deletion (adding or removing lexical/functional units), Same-polarity substitution (changing one lexical/functional unit with another of the same meaning), Sentence modality changes (Changing the modality of a sentence, e.g. from an imperative command to a question), Synthetic/analytic substitutions (swapping a synthetic for an analytic structure e.g., "smarter than everyone else" to "the smartest"), order (swapping the order of some sentence element, e.g. the order of items in a list).
Multitask learning: Traditionally, machine learning models are trained to perform well on a single task in isolation. This differs greatly from how humans learn new tasks-by relying on prior experience solving related problems. Multitask learning seeks to emulate this process by training models to solve multiple objectives simultaneously.
Methods for multitask learning can be divided into three main schemes: soft parameter sharing, hierarchical sharing, and hard parameter sharing. In soft-parameter sharing, each task uses a subset of the parameters, but these are constrained using regularisation techniques (such as l 2 or trace norm) to favour similar values (Duong et al., 2015;Yang & Hospedales, 2017). Hierarchical approaches make explicit use of the theorised relationships between tasks, where some tasks (e.g. named entity recognition) require simple 'low-level' reasoning, and others (e.g. relation extraction) build on this to enable deeper, more complex understanding. These relationships are mirrored in hierarchical approaches, where layers close to the input are used to solve the low-level tasks (Sanh, Wolf & Ruder, 2018;Hashimoto et al., 2017). In hard parameter sharing, a proportion of the parameters are shared between all tasks and the remainder are task-specific-commonly the output layers (Caruana, 1993). This is a stronger form of multitask learning with whole layers used by multiple tasks.
In this work we focus on decaNLP (McCann et al., 2018) which is a strong form of hardsharing.

DecaNLP
The decaNLP challenge (McCann et al., 2018) frames multiple tasks as question answering, and is an extreme case of hard parameter sharing where all the parameters are shared (without any task-specific parameters). This approach has key advantages, primarily that new tasks can be added without any modification to the model architecture, only requiring changes to the dataset to frame the task as a question. Table 1 shows the 10 tasks included in decaNLP. Each one uses standard, publicly-available datasets and metrics. These metrics are simply aggregated to give an overall 'decaScore'. . This network is based on encoderdecoder models for abstractive question answering, notably employing a pointer-generator mechanism for creating the output, a technique commonly applied to summarisation (See, Liu & Manning, 2017). MQAN generalised the pointer mechanism to allow it to construct the output from the question, context or an external vocabulary. This is a modification particularly important for decaNLP tasks where the question may contain the class labels (e.g. "Is this review positive, or negative?"), the context can contain key phrases (as in summarisation), or words can only be in the vocabulary (as in translation). The encoder of the model uses a BiLSTM encoder, dual co-attention, and self-attention to encode the question and context sequences ensuring that long-term dependencies are captured and information is shared between the sequences. The full details of this model can be found in McCann et al. (2018). Raffel et al. (2020) build on decaNLP to explore a similar text-to-text paradigm using a transformer model trained using simple keyword prompts (e.g. "summarize:", "cola sentence:"). This model (named T5) was constructed after a series of experiments comparing different architectures, unsupervised objectives and multitask-learning strategies. The final model organizes its transformer blocks in an encoder-decoder

METHODOLOGY
We present our methodology for the two parts of our work: 1. Our new PQ-decaNLP dataset which we use to analyse how the existing models perform when provided with paraphrased questions, and 2. Proposed improvements to the model training to increase the performance on paraphrased questions.

The PQ-decaNLP dataset
We create a paraphrased version of decaNLP questions: PQ-decaNLP, using the crowdsourcing platform Amazon Mechanical Turk (www.mturk.com). Workers were given a description of a decaNLP task and were asked to provide five paraphrases of the fixed question. For tasks which have instance-specific information, for example NLI, where the hypothesis is embedded in the questions, we transform them into generic templates (Fig. 3A).
Of the 10 decaNLP tasks, we limit our work to the seven tasks which have a fixed question template, removing Question Answering, Semantic Role Labelling, and Zero-Shot Relation Extraction where the question differs for every example. Techniques for improving the robustness of these excluded tasks are not decaNLP-specific and have been widely studied elsewhere (Fader, Zettlemoyer & Etzioni, 2013;Bordes, Chopra & Weston, 2014;Dong et al., 2017).
These were then inspected using the open source project LanguageTool (www. languagetool.org) for spelling and grammar mistakes. Additionally, we removed paraphrases which did not preserve the meaning of the original question, were ungrammatical, or were duplicates, via a manual review. We accepted 73.1% of the paraphrases, rejecting 3.7% due to grammatical errors, 2.9% due to duplication, and 20.3% which were not paraphrases of the original. Figure 3B shows examples of paraphrases for the summarisation task. We collect 100 paraphrases per task to ensure a variety of paraphrases types while minimising duplication. The resulting 700 paraphrases (100 per task) are split 70/30 into train/test sets. Figure 4A shows the distribution of question lengths for the paraphrase corpus compared with the original decaNLP questions. We see that the majority of the paraphrases are longer than the original fixed question, suggesting that authors tend to add complexity  Figure 4B shows edit distances, with paraphrases of MNLI differing most from the original question. For evaluating the models, we define the PQ-decaScore as the sum of the task-specific metrics for seven tasks that we consider on the PQ-decaNLP dataset, similarly to how the decaScore of McCann et al. (2018) is defined over the full set of 10 tasks.

Annotation
To gain an understanding of exactly which kinds of paraphrasing reduce the performance of the models, we hand-annotate the PQ-decaNLP test set using the typology of Vila, Martí & Rodríguez (2014). As our dataset exclusively contains questions and imperative statements, we only observe a subset of the paraphrase phenomena as shown in

Proposed improvements
Investigating the performance of the models on the PQ-decaNLP paraphrase questions, we find lower scores across all tasks, indicating the models are not robust to paraphrasing of the question. These results and analysis are presented in the Results and Discussion section.
To enhance the robustness of the models, we propose several improvements. As our focus is the exploration of the existing models, we restrict our scope to modification of the data (adding/modifying decaNLP tasks) rather than the model architectures themselves. . This model is trained on all ten decaNLP tasks using an anti-curriculum strategy, where the model is first trained on the SQuAD dataset alone (phase one) before sampling batches "round-robin", from all the tasks in a fixed order (phase two). We use the t5-base version of the T5 model, pretrained on the C4 corpus and finetuned following the procedure in Raffel et al. (2020).
Training on PQ-decaNLP: Our first method is to directly train the model on PQ-decaNLP. For each example in the decaNLP training set, we perform uniformly distributed random selections to pick a question from the PQ-decaNLP training set to replace the fixed question. This directly trains the model to consider different paraphrases of the question.
Adding paraphrase tasks: Secondly, we propose to exploit the multitasking abilities of the models by adding a new task to indirectly teach the model about paraphrasing in general. To do this, we introduce a paraphrase detection task (identifying whether two sentences are a paraphrase pair), or a paraphrase generation task (generating a paraphrase of the given sentence).
Introducing a new task rather than changing the data of existing tasks has the advantage of preserving the ease of extending decaNLP to additional tasks in the future. Using this approach, new tasks can still be added with fixed questions as before. We need only a dataset of general paraphrase pairs.
For paraphrase detection we ask the question: "[paraphraseCandidate1]-paraphrase, or nonparaphrase?", and provide [paraphraseCandidate2] as the context (where [paraphraseCandiate1] and [paraphraseCandiate2] are the two sequences in a possible paraphrase pair). We expect that similar to the existing decaNLP tasks of SST and NLI, the model will learn to select the output classes 'paraphrase' and 'nonparaphrase' from the question. Paraphrase generation is framed as a sequence-to-sequence task using the question: "What is the paraphrase?" and [paraphraseCandidate1] as the context. We train the model with the target of [paraphraseCandidate2].
Additionally, we experiment with variants of these tasks which don't explicitly instruct the model to perform paraphrase detection/generation in the question (The 'Without task information' setting). For detection, we use the question: [paraphraseCandidate1], context: [paraphraseCandidate2] and train the model to output 'yes' or 'no' from the external vocabulary. This directly trains the model to paraphrase the entire question. For paraphrase generation, we use [paraphraseCandidate1] as the question with a blank context. We train the model to generate [paraphraseCandidate2]. These are not valid decaNLP tasks but can be used as pretraining tasks, with the task specific information given by the answer the task is supervised on. Table 2 shows examples of these formulations with a sample paraphrase pair.

RESULTS AND DISCUSSION
To examine the robustness to paraphrasing, we evaluate on the PQ-decaNLP dataset. In Table 3 we observe a decrease in scores across many tasks, with the MQAN model showing the largest decreases in performance for IWSLT, MNLI, CNN/DM, and SST, and the T5 model with the CNN/DM and MWSC tasks. Trained on the original decaNLP dataset, the T5 model outperforms MQAN, suggesting a transformer-based language model is better suited to the decaNLP task. Additionally, the base model is more robust to paraphrasing, only loosing 42.1 of its total score when compared to the 140.1 lost by MQAN.
We find that MQAN trained on PQ-decaNLP (PQ-decaNLP trained) reduce this drop across all tasks except MWSC. We hypothesise that the lack of improvement in MWSC is because the original question: "{mainQuestion} {choice1} or {choice2}" already varies Table 2 Example of paraphrase question formulation.

With task information
Question: "How do you start a bakery?"-paraphrase, or nonparaphrase? What is the paraphrase?
Context: "How can one start a bakery business?" " How do you start a bakery?" Answer: "paraphrase" or "nonparaphrase" " How can one start a bakery business?"

Without task information
Question: "How do you start a bakery?" " How do you start a bakery?" Context: "How can one start a bakery business?" Answer: "yes" or "no" "How can one start a bakery business?" greatly between examples in the dataset-only the word 'or' separating the two choices is constant. For WikiSQL, we also find an improvement of 6.9 lf EM on the original dataset, suggesting that this task benefits from more varied questions. Table 3 Validation metrics for decaNLP and PQ-decaNLP datasets: We show paraphrase detection and generation in settings which indicate the task in the question (with task information), and for those where task information is only indicated by the supervision (without task information). The last model is trained only on PQ-decaNLP questions (PQ-decaNLP trained). We report different metrics for each task as described in Table 1. The decaScore and PQ-decaScore reported here are the sum of the task specific metrics. When adding paraphrase detection or generation as an additional task, we find the MQAN model is able to learn these new tasks with 85.7 f1, and 31.4 bleu respectively. We find that while the tasks have little impact on the performance of the original decaNLP data (some scores are slightly higher), they perform worse than the original MQAN on PQ-decaNLP. This suggests that the knowledge learnt about paraphrasing does not help the robustness to paraphrasing of MQAN. Adding these new tasks significantly harms the performance of the T5 model on the original decaNLP data.

Dataset
To better understand how the models behave on paraphrased questions, we conduct a range of analysis. We find only a weak negative correlation between the edit distance of the paraphrase (compared with the original question) and the score (R = −0.2373 for MQAN). This suggests that while many paraphrases which deviate further from the original question perform worse, other factors such as the type of paraphrase may also be significant. Table 4 shows the average difference in performance for paraphrases where a paraphrase phenomenon is present compared to those where the phenomenon is not present.

Impact of paraphrase phenomena
Paraphrased questions which contain an 'Order' annotation perform worse in classification tasks with fixed labels (MNLI, SST) than other paraphrases when using MQAN. The 'Order' tag occurs in 73% of the sentiment analysis task (SST), primarily in the swapping of the class labels 'positive' and 'negative'. We find that manually swapping the class labels back to the same order as the original question, increased the performance by 38.2 (to 61.6), suggesting the model is memorising the position of the labels rather than their semantics. This sensitivity to label order harms the ability of MQAN to perform zero-shot learning. The T5 model looses less performance on MNLI and SST, and we no longer find that 'Order' paraphrases are causing the largest decrease in performance, suggesting the trained T5 model is much more robust to label position and is comprehending their meaning.
An interesting aspect of the original decaNLP framing is that English to SQL translation is formed as a question ("What is the translation from English to SQL?"), where English to German translation is framed as an imperative command ("Translate from English to German."). We see from Table 4 that MQAN performs especially poorly on WikiSQL and IWSLT paraphrases which contain a change in sentence modality. Inspecting the answers for these WikiSQL cases reveals that the model outputs German words, indicating confusion between English-SQL translation and English-German translation. This suggests the model overly relies on the indicators of sentence modality ("What is the", "translate"/"translation") rather than the source and target languages. Again we see that the T5 model is more resilient to changes of sentence modality.
We find that the models trained on PQ-decaNLP have a smaller range of performance, suggesting they perform similarly across all paraphrase types.
We find no correlation between the number of phenomena present and the score.

Analysing pointers
To gain a better insight into why MQAN fails, we analyse where the model copies its answers from: the question, the context, or the external vocabulary, which we present in Fig. 6   question, we see a decrease in copying from the question. For translation (IWSLT), we see an increase in copying from the context, and for semantic parsing (WikiSQL) we see an increase in copying from the external vocabulary. These indicate the confusion between these tasks.

CONCLUSION
In this work we explore how robust text-to-text models are to paraphrasing of questions asked. We introduce a diagnostic corpus annotated with paraphrase phenomena and show how simple transformations such as changing the label order and altering the sentence modality can harm the performance. We believe that the creation of similar typologyannotated corpora will provide useful insights into the robustness to paraphrasing of many models across NLP. Additionally, we find that training models on paraphrased questions improves its robustness to paraphrasing. We find that knowledge learnt from adding the tasks of paraphrase generation or paraphrase detection does not transfer to increased robustness in other tasks for either model.
We hope that the paraphrase corpus of decaNLP questions will encourage further research into more robust multitask question answering models.

ADDITIONAL INFORMATION AND DECLARATIONS Funding
This work made use of the facilities of the N8 Centre of Excellence in Computationally Intensive Research (N8 CIR) provided and funded by the N8 research partnership and EPSRC (Grant No. EP/T022167/1). The Centre is co-ordinated by the Universities of Durham, Manchester and York. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.