The NYU System for the CoNLL–SIGMORPHON 2018 Shared Task on Universal Morphological Reinflection

This paper describes the NYU submission to the CoNLL–SIGMORPHON 2018 shared task on universal morphological reinﬂection. Our system participates in the low-resource setting of Task 2, track 2, i.e., it predicts morphologically inﬂected forms in context: given a lemma and a context sentence, it produces a form of the lemma which might be used at an indicated position in the sentence. It is based on the standard attention-based LSTM encoder-decoder model, but makes use of multiple encoders to process all parts of the context as well as the lemma. In the ofﬁcial shared task evaluation, our system obtains the second best results out of 5 submissions for the competition it entered and strongly outperforms the ofﬁcial baseline.


Introduction
The extreme type sparsity in text in a morphologically rich language, i.e., a language which relies strongly on changes in the surface form of words to express properties like gender, tense or number, requires natural language processing (NLP) systems which are able to handle inflected words in a systematic way.The SIGMOR-PHON and CoNLL-SIGMORPHON shared tasks on morphological reinflection, which have been held since 2016 (Cotterell et al., 2016(Cotterell et al., , 2017a)), encourage the development of computational models for inflection in a large number of languages.This year's edition (Cotterell et al., 2018) features two different tasks.The datasets for Task 1 consist of triplets of lemma, morphological tag (also called the "target tag") and the corresponding inflected form, which is given for training and should be produced at test time.This is the standard inflection setup which has also been subject of the shared tasks in the last years.Task 2, in contrast, is again split into two different subtasks (called "tracks").Both are focused on inflection in context.Here, a sentence is given, in the context of which the inflected form of which only the lemma is known should be used.The setup of the first subtask assumes that the lemmas and tags of all surrounding words are available and can be used for predicting.These might be used as desired, e.g., the tags of the previous and next words are often strong indicators for the tag of the form to be produced, which is unknown.Track 2, on the other hand, requires systems to produce inflected forms only from their lemma and the inflected context words; no tags or lemmas are given for the context.Thus, track 2 is both a more realistic and a harder version of track 1.All tasks and tracks feature 3 different settings: a low-resource setting (LOW), a medium-resource setting (MEDIUM) and a highresource setting (HIGH).
In this paper, we describe the New York University (NYU) submission to the CoNLL-SIGMORPHON 2018 shared task on universal morphological reinflection.The system we submitted was exclusively designed for Task 2, track 2, LOW.Thus, we only focus on this particular competition and do not report numbers for other setups (though, in theory, every system which works for track 2 of Task 2 can also produce output for track 1; the same holds true for LOW/MEDIUM/HIGH).Overall, our system obtains the second highest test accuracy out of 5 submitted systems and outperforms the official shared task baseline by a wide margin.

Morphological Inflection in Context
The system presented in this paper is designed for morphological inflection in context, i.e., predicting an inflected form which fits an indicated position in a sentence, given its lemma.Here, we will describe the task in a more formal way.
Let T be the set of morphological tags being expressed in a language and w a lemma in the same language.We then define the morphological paradigm π of w as follows: Here, f k [w] denotes the inflected form which corresponds to tag t k , and both w and f k [w] are strings consisting of letters from an alphabet Σ.
Note that, even though we follow the convention to describe word forms as functions of the lemma, in the huge majority of the cases, each inflected form is uniquely defined given any other word form of the same paradigm together with its morphological tag.
The task of morphological inflection consists of predicting a target form f i [w] from a paradigm, given the lemma w, as well as the tag t i of the target form.
Building on this, the task of morphological inflection in context consists of predicting a target form f i [w] from the lemma w, as well as the context c, i.e., the sentence surrounding the target form.For the track of the shared task we are interested in, the context consists of inflected forms.Further, this task is ambiguous: for many languages, usually several morphological tags and, thus, inflected forms are acceptable for any given context.

Model Description
Our model is based on the standard LSTM encoder-decoder model with an attention mechanism (Bahdanau et al., 2015).Following several previous approaches (cf.Section 5), we apply it at the character level, i.e., the input to the system is the character sequence of the input lemma, represented by embeddings.The output is the (predicted) character sequence of the inflected form.
Additionally, we include the sentence context as follows: Given a sentence s = [w 1 , w 2 , . . ., w i−1 , l, w i+1 , . . ., w n ], where l is the lemma of the inflected form of interest, and w 1 , . . ., w n with n = i are the surrounding context words, we split the past context c prev = [w 1 , w 2 , . . ., w i−1 ] and the future context c f ut = [w i+1 , . . ., w n ] into subword units using byte pair encoding (BPE, Sennrich et al. (2016)).We then use two additional encoders to encode the sequences of subword units of both contexts.
Using bidirectional encoders, the final hidden states produced by each encoder are concatena-tions of the respective forward and backward hidden states: . ., emb z represents the respective sequence of embeddings, i.e., either the embeddings of the lemma's characters or the embeddings of the subword units of either context.Our model then uses 3 attention mechanismsone for each encoder-to produce a context vector for each output position: H t for the lemma, H p t for the past context and H f t for the future context.The input to the decoder LSTM at each timestep is the concatenation of all contexts and the embedding of the last output character.Embeddings are shared between the character encoder and the decoder, BPE embeddings are shared between the two context encoders.
An overview of our model architecture is shown in Figure 1.Our final system in an ensemble of 5 random restarts of the model, combined via majority voting.

Training and Hyperparameters
Using the shared task development sets, we decide on the following hyperparameters: We employ 100-dimensional BPE and character embeddings, and the encoder and decoder hidden states are 300-dimensional.Dropout (Srivastava et al., 2014) is used with a probability of 0.5 for all hidden states when used as input to the next layer, as well as for the embedding layer.For training, we employ ADAM (Kingma and Ba, 2014).Whenever performance does not improve for 20 steps, we halve the learning rate and restart from the best performing model.Training stops when the learning rate gets below 0.0001; the best performing model is used for the final predictions.We do not use batching, since it hurts performance in our experiments on the development sets.
For decoding, we apply beam search with a beam of width 5.

Datasets
The data for Task 2, track 2, LOW consists of sentences taken from the Universal Dependencies (UD) treebanks (Nivre et al., 2017).All context forms, as well as the lemma of the target inflected form are given for each sentence.Training and development sets feature exactly one correct target form, while, for the test set, additional plausible target forms have been manually given by the shared task organizers (Cotterell et al., 2018).
The languages we experiment on are German, English, Spanish, Finnish, French, Russian and Swedish.

Baseline System
The official baseline system of the shared task is a character-level LSTM encoder-decoder model with attention (Bahdanau et al., 2015).The main input to the system is the lemma of the inflected form which is to be generated.Further, the context is taken into account: each character of the lemma is concatenated with 7 additional embeddings representing (i) the lemma of the word at the previous position in the sentence, (ii) the previous word itself, (iii) the tag of the previous word, (iv) the lemma of the word at the next position in the sentence, (v) the next word itself, (vi) the tag of the next word, (vii) the lemma of the inflected form to generate and given to the encoder.Note that, since no tags or lemmas are available for track 2 of Task 2, but the architecture is identical to that used for track 1 of the same task, all embeddings but those for the previous and the next word, as well as the lemma are set to default vectors.
Given the character embedding-context representations produced by the encoder, the LSTM decoder generates the character sequence of the output inflected form, using an attention mechanism.
More details on the shared task baseline system can be found in Cotterell et al. (2018).

Official Test Results
Two official results are reported.First, system performance is calculated by just taking the gold solution into account, i.e., all generated inflected forms that do not match the UD gold standard are counted as wrong.Second, performance is com- puted by taking all plausible target inflected forms into account, i.e., all forms that could be correct in any way of reading the sentence are accepted as correct.The final results for all systems are shown in Tables 1 and 2, respectively.As can be seen, the baseline performs poorly in the low-resource setting we consider here.In particular, its accuracy is far worse than that of any participating system.Looking at our system's performance, we can see that it is the second best one for German, English, Finnish, French, and Slovene, as well as on average, when only considering the gold solution.Taking all plausible forms into account, our systems obtains the second highest accuracy for German, English, Finnish, and Slovene, as well as on average. 1  The best performing system on average is UZH, and CPH outperforms our model for Spanish, French and Russian for gold solutions, and Spanish and Russian for all plausible forms.BME-HAS and CUB perform worse than our system for all languages.
A final observation is that the accuracy differ- 1 No results with all plausible forms are available for French.
ence between the evaluation with the gold solution and the evaluation with all plausible forms is 0.92 − 6.49, depending on the system.

Related Work
Most recent work on morphological reinflection was done in the context of the SIGMOR-PHON 2016 and the CoNLL-SIGMORPHON 2017 shared tasks.
The first edition of the shared task in 2016 (Cotterell et al., 2016) resulted in 3 different types of systems: "pipeline approaches" (unsupervised alignment algorithms applied to the source-target pairs, followed by a model which predicts edit operations), "neural approaches", and "linguistically inspired systems".The winning system was a neural network, namely a character-based RNN encoder-decoder model with attention, similar to the one we use here (Kann and Schütze, 2016).Hence, neural models gained popularity in the 2017 edition of the shared task (Cotterell et al., 2017a).In 2017, explicit low-resource settings were first introduced to the shared task.These settings demonstrated the effectiveness of hard attention in neural sequence-to-sequence models if training data are limited (Makarov et al., 2017).
Research not immediately done for the shared tasks included papers on multi-source reinflection (Cotterell et al., 2017b;Kann et al., 2017a), cross-lingual transfer for reinflection (Kann et al., 2017b), or first intents of neural inflection systems which make use of context for lemmatization (Bergmanis and Goldwater, 2018).

Conclusion
We presented the NYU system for Task 2, track 2, LOW of the CoNLL-SIGMORPHON 2018 shared task on universal morphological reinflection.The system was designed for the task of morphological inflection in context: it predicts an inflected form for an indicated position in a sentence, given the sentence context and the lemma.In the official evaluation, which consisted of experiments in German, English, Spanish, Finnish, French, Russian and Slovene, our system was the second best performing one out of 5 submissions.

Figure 1 :
Figure 1: Overview of our employed model architecture.