Dependency Parsing as Sequence Labeling with Head-Based Encoding and Multi-Task Learning

Dependency parsing as sequence labeling has recently proved to be a relevant alternative to the traditional transition-and graph-based approaches. It offers a good trade-off between parsing accuracy and speed. However, recent work on dependency parsing as sequence labeling ignore the pre-processing time of Part-of-Speech tagging – which is required for this task – in the evaluation of speed while other studies showed that Part-of-Speech tags are not essential to achieve state-of-the-art parsing scores. In this paper, we compare the accuracy and speed of shared and stacked multi-task learning strategies – as well as a strategy that combines both – to learn Part-of-Speech tagging and dependency parsing in a single sequence labeling pipeline. In addition, we propose an alternative encoding of the dependencies as labels which does not use Part-of-Speech tags and improves dependency parsing accuracy for most of the languages we evaluate.


Introduction
Traditional dependency parsers are transition based (Kuhlmann et al., 2011) or graph based (McDonald, 2006).In contrast to previous studies, Strzyz et al. (2019) recently showed that dependency parsing reframed as a sequence labeling problem is also a competitive strategy.The idea is, for a given token in a sentence, to encode into a single tag the information about which token is its parent in the dependency tree (and the label of the incoming dependency).These tags can be predicted in a sequence labeling process and then be decoded in order to rebuild the dependency tree.Strzyz et al. (2019) compare the performance of dependency parsing as sequence labeling using several encodings of the dependencies which have been presented in previous work1 and show that the best encoding leads to state-of-the-art performance.
One of the main arguments for performing dependency parsing as sequence labeling is to achieve a good speed-accuracy tradeoff (leveraging the efficiency of deep learning frameworks running on GPUs).However, the encoding that is reported as the best one in (Strzyz et al., 2019), requires Part-of-Speech (PoS) tags to encode and decode the dependencies.The method thus involves a pre-processing step of PoS-tagging which is not considered in the evaluation of the parsing speed, whereas previous studies (Ballesteros et al., 2015;de Lhoneux et al., 2017a) showed that PoS-tagging is not a requirement for neural transition-based parsers -using word embeddings as input -in order to achieve state-of-the-art performance.In this work, we set up a single pipeline that performs both PoS-tagging and dependency parsing in order to study the performance of several architectures.We compare the shared (Søgaard and Goldberg, 2016) and stacked (Hashimoto et al., 2017) multi-task learning strategies to a strategy that combines both, with the aim of identifying a proper trade-off between parsing accuracy and speed.
We also present an alternative encoding that does not use PoS-tags to encode the dependencies.It, however, requires an additional step of head tagging which consists of predicting which tokens in a sentence are parents of other tokens (i.e., have dependents in the dependency tree).Hence, the following task of dependency parsing consists of predicting to which of these parents the tokens are attached.We use a similar encoding as in Strzyz et al. (2019).This new encoding aims at reducing the complexity of the attachment step correcting some of the flaws of the original PoS-based encoding.We finally evaluate whether ablating PoS-tagging in the pipeline using the new encoding affects dependency parsing performance.
Contribution We (i) combine two multi-task learning strategies to set up an efficient pipeline for PoStagging and dependency parsing as sequence labeling and (ii) propose a new encoding of the dependencies as labels that does not use PoS-tags.

Sequence Labeling Pipeline
We propose to perform several sequence labeling tasks, such as PoS-tagging and dependency parsing, in a neural network pipeline architecture which combines shared and stacked strategies for multi-task learning.
In the shared multi-task learning architecture of Søgaard and Goldberg (2016), several tasks are trained simultaneously through the same layers (they share parameters).A single input is given to the network but it feeds different outputs.While Hashimoto et al. (2017) propose a stacked multi-task learning architecture, in which each layer is dedicated to the training of one task and layers are stacked on top of each other in a specific order.The calculated output of the final layer dedicated to a given task is concatenated with the input sequence of the network and then feeds the first layer dedicated to the next task.
In our architecture, we combine the two strategies in order to benefit from the strength of both.We define groups of tasks to train sequentially.In a given group, tasks are trained simultaneously using the shared multi-task learning strategy (multiple layers can be stacked for one group).They share the same input and feed different outputs.The outputs of the final layer of each group are concatenated with the input sequence to feed the first layer dedicated to the training of the next group of tasks.We name it the combined strategy.
In all strategies, each layer is a bi-LSTM (Graves and Schmidhuber, 2005).The input sequence of the network is a concatenation of word embeddings (pre-trained) and character embeddings (trained using an additional bi-LSTM layer).The outputs are calculated through a Softmax layer.

Dependency Encodings
Strzyz et al. (2019) observe that the relative PoS-based encoding of the dependencies inspired by Spoustová and Spousta (2010) outperforms other encodings.Given a sentence w 1 . . .w n and its respective sequence of PoS-tags p 1 . . .p n , an incoming dependency to a token w j , such as w i is its parent (i.e., w i → w j ), is described as a tuple of: • the PoS-tag p i of its parent w i , and • the relative position n of p i to w j with respect to the PoS-tags of the same value p, i.e., p i is the nth See the RPT tags in Figure 1 as an example of relative PoS-based encoding.Note that, in contrast to Strzyz et al. (2019) who predict the relation and the encoding of a dependency as one concatenated tag, in this work, we predict the dependency relations (labels) independently from the dependencies (attachment), as it has been applied to constituent parsing as sequence labelling (vil, 2019).This approach reduces the size of the tagset for each task (label tagging and dependency attachment).
In particular, we identify two flaws with the PoS-based encoding: • the tagset includes many infrequent tags (due to infrequent PoS-tags and long distance dependencies) which are difficult to predict;  • consecutive PoS-tags which have similar roles (such as NOUN and PROPN or VERB and AUX) make the prediction of the relative position less accurate (i.e., biased towards short relative position) due to the difficulty of identifying which token is the head of a subtree (in a group of tokens which constitute a phrase, e.g., the main noun in a noun phrase or verb in a verb phrase).
In order to alleviate the impact of these flaws, we propose a new encoding strategy that we name relative head-based encoding.It requires a first step of head tagging in which we identify the heads/parents, i.e., the tokens which have children in the dependency tree.We propose two approaches for tagging the heads: • a first approach (Unique Head) is to tag all parents with a unique tag X (and all non-parents with a NONE tag); • a second approach (Chunk Head) is to see parents as heads of syntactic chunks.We define their roles (tags) as such.In this case, the tagset of the head tagging task includes 5 tags: VP (for heads which are VERBs and AUXs), NP (for NOUNs, PROPNs and PRONs), AP (for ADJs and NUMs), X (for the remaining heads) and NONE for the non-parents. 4ith this approach, disambiguating between PoS-tags with similar roles rests on the head tagging step instead of the actual dependency parsing step which focuses on attaching the children to the correct head.
The relative head-based encodings (RUH for Relative Unique Head and RCH for Relative Chunk Head) are thus deduced from these head tags in the same way as the relative PoS-based encoding with the PoS-tags.The encoding of a dependency is defined as a tuple of the head tag of the parent and its relative position to the child in regards to other head tags with the same role.Hence, the dependency attachment step consists in predicting these encoded tags and then building the dependency tree using in addition the information about the heads from the previous head tagging step.
Using the relative head-based encoding reduces the size of the tagset (for the dependency attachment task) by 65% (RUH) and 52% (RCH) on average 5 compared to the relative PoS-based encoding.See an example of the relative head-based encodings (RUH and RCH) in Figure 1.In this sentence, "clogged" and "kitchen" have the same tag with both head-based encodings because they have the same head while they have different tags with the PoS-based encoding.

Experiments
Models We design three types of experiments.In a first set of experiments, we compare the shared and stacked learning strategies with the combined strategy.For each experiment, we train four tasks (simultaneously or sequentially): PoS-tagging, (morphological) feature tagging, label (dependency relation) tagging and dependency attachment.For the combined strategy, we define two groups of tasks (trained in the following order): PoS-tagging/feature tagging, followed by label tagging/dependency attachment.As a second experiment, we compare the performance of the combined system using different encodings of the dependencies (PoS-based and head-based).When using our proposed head-based encodings, the groups are (trained in this order): PoS-tagging/feature tagging/head tagging, followed by label tagging/dependency attachment.
Third, we train the pipeline without PoS-tagging and feature tagging (-PoS/feats), using only head tagging as a first group.
Setup We use the pre-trained word embeddings of Grave et al. (2018). 6For each task or group of tasks, we use 2 hidden layers of dimension 256.Dimension of the hidden layer for training character embeddings is 128.
Evaluation.We average the scores on 5 runs (with different random seeds) for each experiment.We calculate the unlabeled attachment score (UAS) and the labeled attachment score (LAS) following the guideline of the CoNLL 2018 Shared Task (Zeman et al., 2018). 7We also evaluate precision on heads, i.e., percentage of correctly tagged parents.8

Multi-task Learning Strategies
We compare the learning strategies (shared, stacked and combined in Table 1) when using the relative PoS-based encoding.The shared strategy always leads to the lowest scores, which support the idea that PoS-tagging and dependency parsing must not be trained simultaneously.On average, the stacked learning strategy leads to slightly higher performance (+0.11UAS/+0.16LAS) than the combined strategy.
Both strategies lead to highest performance for half of the languages, but the stacked strategy significantly outperforms the combined strategy on 3 of the 8 languages while this last strategy gives the (significantly) best scores for 2 languages.However, it is worth noting that the parsing speed is much lower with the stacked strategy than with the combined strategy, which increases the parsing speed by 48% on average.With comparable scores on average, the combined strategy is a good trade-off between speed and accuracy.
Overall, the relative head-based encoding is a good approach for parsing as sequence labeling.However, from these results, no clear decision can be made on which tagset for head tagging (RUH vs RCH) would be the most adapted to other languages.The intuition behind the RCH encoding is well-suited to languages which are adapted to a structure in syntactic chunks, such as English (which is reflected in the scores).9It is worth noting that the variation in the scores between the two head-based encodings is substantial and when one is the best option the other often leads to low scores, which shows that the choice of the tagset for the head tagging is crucial and might require fine-tuning for the different languages.For instance, head tagging performs very well on Hebrew and Kazakh10 using the Unique head tagset leading to high parsing scores.
Furthermore, although the relative head-based encoding requires an additional step of head tagging (i.e., one more task in the pipeline), the parsing time is equivalent to the RPT encoding since the head tagging task is performed at the same level as PoS-tagging and feature tagging.
In general, long dependencies are especially difficult to predict correctly.While local dependencies (neighbouring child) achieve more than 80% UAS on average, dependencies of length more than 6 do not overcome 50% UAS.We expect the RCH encoding to alleviate the difficulty of the prediction by artificially reducing the distance between the tokens.We analyse the dependency parsing scores in regards to the length of the dependencies.See the comparison between the encodings in Figure 2. Overall, the RCH encoding outperforms the PoS-based encoding for all dependency length but the neighbouring children 11 while the RUH encoding is especially good on local dependencies but performs poorly on long dependencies -the dependency attachment tagset for the RUH encoding includes more rare tags with high relative positions which are then more difficult to predict.

Ablating PoS-tagging
As previously studied for transition-based parsers (de Lhoneux et al., 2017a;Smith et al., 2018), we want to assess whether dependency parsing as sequence labeling can achieve state-of-the-art performance without PoS-tagging (and feature tagging) as a pre-processing step.We compare the performance of the combined strategy using the RCH encoding with and without PoS-tagging (last two columns of Table 2) as part of the first group of tasks to train in the pipeline.
The results are noticeably lower for the ablated model (-1.56 UAS/-1.86LAS on average) than when using PoS-tagging as an auxiliary task for training head tagging.Determining PoS-tags is essential for most of the languages.Only Hebrew does not suffer from the ablation.
Moreover, it is worth noting that ablating PoS-tagging does not increase the parsing speed since the tasks are performed simultaneously.The combined strategy (with PoS-tagging) thus remains a valid trade-off between speed and accuracy.

Conclusion
We showed that a combined strategy for multi-task learning using shared and stacked strategies is on par with a sequential approach while significantly faster at parsing sentences.It provides a good speedaccuracy tradeoff for PoS-tagging and dependency parsing in a single pipeline.
Besides, we proposed a new encoding of the dependencies as labels which does not use PoS-tags.It splits the parsing task in two steps but does not affect negatively the parsing time when performed simultaneously with PoS-tagging.We test two alternatives of this encoding, comparing fine and coarse tagsets for tagging the heads.It shows that the choice of the tagset is crucial: the performance of dependency attachment depends on the performance of head tagging and on how it performs regarding the length of the dependencies.Finally, this suggests that fine-tuning the tagset in regards to properties of the languages could improve overall performance.Globally, the head-based models outperform the PoS-based model for a majority of the languages.lengths.

Figure 1 :
Figure 1: Dependencies and encoded tags on an English sentence from the EWT treebank.RPT is the encoding based on PoS-tags (PoS).RUH and RCH are the relative encodings based on head tags (respectively: unique head -U.Head-tags and chunk head -C.Head-tags).

Figure 2 :
Figure2: Averaged UAS (on the 8 languages) as a function of the dependency length for the three encodings (using the combined learning strategy).

Table 1 :
Dependency parsing scores (+ average sentence per second on CPU) using the PoS-tag based encoding for the different learning strategies (best in bold; † marks statistical significance; T-test with p<0.05).STR19 scores are reported from Strzyz et al. (2019) (besides from Tamil for which they use gold PoS-tags).

Table 2 :
Dependency parsing scores (+ precision on heads) with the different dependency encodings, using the combined learning strategy (best in bold; † marks statistical significance).