Explicitly modeling case improves neural dependency parsing

Neural dependency parsing models that compose word representations from characters can presumably exploit morphosyntax when making attachment decisions. How much do they know about morphology? We investigate how well they handle morphological case, which is important for parsing. Our experiments on Czech, German and Russian suggest that adding explicit morphological case—either oracle or predicted—improves neural dependency parsing, indicating that the learned representations in these models do not fully encode the morphological knowledge that they need, and can still benefit from targeted forms of explicit linguistic modeling.


Introduction
Parsing morphologically rich languages (MRLs) is difficult due to the complex relationship of syntax to morphology.But the success of neural networks offer an appealing solution to this problem by computing word representation from characters.Character-level models (Ling et al., 2015;Kim et al., 2016) learn relationship between similar word forms and have shown to be effective for parsing MRLs (Ballesteros et al., 2015;Dozat et al., 2017;Shi et al., 2017;Björkelund et al., 2017).Does that mean that we can do away with explicit modeling of morphology altogether?Consider two challenges in parsing MRLs raised by Tsarfaty et al. (2010Tsarfaty et al. ( , 2013)): • Can we represent words abstractly so as to reflect shared morphological aspects between them?• Which types of morphological information should we include in the parsing model?It is tempting to hypothesize that character-level models effectively solve the first problem.For the second, Tsarfaty et al. (2010) and Seeker and Kuhn (2013) reported that morphological case is beneficial across morphologically rich languages with extensive case systems, where case syncretism is pervasive and often hurts parsing performance.But these studies focus on vintage parsers; do neural parsers with character-level representations also solve this second problem?
We attempt to answer this question by asking whether an explicit model of morphological case helps dependency parsing, and our results show that it does.Furthermore, a pipeline model in which we feed predicted case to the parser outperforms multi-task learning in which case prediction is an auxiliary task.These results suggest that neural dependency parsers do not adequately infer this crucial linguistic feature directly from the input text.

Dependency Parsing Model
We use a neural graph-based dependency parser similar to that of Kiperwasser and Goldberg (2016) and Zhang et al. (2017) for all our experiments.We treat our parser as a black box and experiment only with the input representations of the parser.Let w = w 1 , . . ., w |w| be an input sentence of length |w| and let w 0 denote an artificial ROOT token.For each input token w i , we compute the context-independent representation, e(w i ) with a bidirectional LSTM (bi-LSTM) over characters.We concatenate the result with its part-of-speech (POS) representation, t i : We then feed x i to a word-level bi-LSTM encoder to learn a contextual word representation w i .The model uses these representations to compute the probability p(h i , i | w, i) of head h i ∈ {0, ..., |w|}/i and label i of word w i .

Experiments
We experiment with three fusional languages with extensive case systems: Czech, German, and Rus- For each language, we show the number of training sentences.
sian; and we consider four forms of input (e(w i ), §2): word (embedding), characters, characters with gold case, and characters with predicted case.For the latter two, we append the case label to the character sequence, e.g.b, a, t, Acc represents bat with accusative case.Using the same method, we also supply the gold full analysis, to tease out the importance of case specifically.Finally, we experiment with multitask learning (MTL; Søgaard and Goldberg, 2016;Coavoux and Crabbé, 2017), using the bi-LSTM states of the lower layer of the bi-LSTM encoder to predict case feature.Table 1 summarizes the results.
Effect of case We found that the oracle condition of adding gold case improves the parsing performance for all languages, and indeed explains all of the gains of a full morphological analysis.In German, case syncretism is pervasive-a single surface form can represent multiple casesand we see improvement of up to 2.4 LAS points on test set.This results suggest that the characterlevel models still struggle to disambiguate case when they learn only from the input text.We then look at the performance when we replace gold case with predicted case.We train a morphological tagger to predict case information.The tagger has the same structure as the parser's encoder, with an additional feedforward neural network with one hidden layer followed by a softmax layer.We found that predicted case improves accuracy, although the effect is different across languages.These results are interesting, since in vintage parsers, predicted case usually harmed accuracy (Tsarfaty et al., 2010).However, we note that our taggers use gold POS, which might help.
Pipeline model vs. Multi-task learning In general, MTL models achieve similar or slightly better performance than the character-only models, suggesting that supplying case in this way is beneficial.However, we found that using predicted case in a pipeline model gives more improvements than MTL.We also observe an interesting pattern in which MTL achieves better tagging accuracy than the pipeline model but lower performance in parsing (Table 2).This is surprising since it suggests that the MTL model must learn to effectively encode case in the model's representation, but must not effectively use it for parsing.

Conclusion
Vintage dependency parsers rely on hand-crafted feature engineering to encode morphology.The recent success of character-level models for many NLP tasks motivates us to ask whether their learned representations are powerful enough to completely replace this feature engineering.By empirically testing this using a single feature known to be important-morphological case-we have shown that they are not.Experiments with multi-task learning suggest that although MTL gives better performance, it is still underperformed by a traditional pipeline model.