Multitask Pointer Network for Multi-Representational Parsing

We propose a transition-based approach that, by training a single model, can efficiently parse any input sentence with both constituent and dependency trees, supporting both continuous/projective and discontinuous/non-projective syntactic structures. To that end, we develop a Pointer Network architecture with two separate task-specific decoders and a common encoder, and follow a multitask learning strategy to jointly train them. The resulting quadratic system, not only becomes the first parser that can jointly produce both unrestricted constituent and dependency trees from a single model, but also proves that both syntactic formalisms can benefit from each other during training, achieving state-of-the-art accuracies in several widely-used benchmarks such as the continuous English and Chinese Penn Treebanks, as well as the discontinuous German NEGRA and TIGER datasets.

Constituent trees, which are commonly used in tasks where span information is crucial, represent the syntax of a sentence by means of constituents (also called phrases) that hierarchically and from the bottom up group words and/or subtrees located in lower levels. We can find two kinds of constituent trees: continuous and discontinuous (described in Figure 1(a) and (d), respectively). The latter extends the former by allowing constituents with discontinuous spans, which results in phrase-structure trees with crossing branches. These are necessary for describing some wh-movement, long-distance extractions, dislocations, cross-serial dependencies and other linguistic phenomena common in free word order languages such as German (Müller, 2004).
On the other hand, in a dependency tree each word of the sentence is attached to another by a directed link that describes a dependency relation between that word and its parent (also called head). This structure is known for representing information closer to semantic relations and can be classified as projective or non-projective (depicted in Figure 1(c) and (f), respectively). Non-projective dependency trees allow crossing dependencies, and can model the same linguistic phenomena described by discontinuous constituent trees.
Since the information described in a constituent tree is not fully encoded into a dependency tree and vice versa (Kahane and Mazziotta, 2015), typically parsers are exclusively trained to produce either dependency or constituent structures and, in some cases, restricted to the less complex continuous/projective representations. all benchmarks, our approach outperforms single-task parsers ((Fernández-González and Gómez-Rodríguez, 2019), (Fernández-González and Gómez-Rodríguez, 2020)), which proves that learning across regular dependency trees and constituent information (encoded in dependency structures) leads to gains in accuracy in both tasks, obtaining competitive results in all cases and surpassing the current state of the art in several datasets.
The remainder of this article is organized as follows: Section 2 introduces the constituent-to-dependency encoding technique developed by Fernández-González and Martins (2015). In Section 3, we describe in detail the proposed multitask Pointer Network architecture. In Section 4, we extensively evaluate the proposed neural model on continuous/projective and discontinuous/non-projective treebanks, as well as include a thorough analysis of their performance. Section 5 presents other research works that study the joint training of neural models across different syntactic formalisms. Lastly, Section 6 contains a final discussion.

Constituent trees as dependency structures
Since our multitask approach is based on the dependency parser by (Fernández-González and Gómez-Rodríguez, 2019), constituent trees must be represented as dependencies in order to be processed. This was recently explored for neural discontinuous constituent parsing in (Fernández-González and Gómez-Rodríguez, 2020) by using the encoding by Fernández-González and Martins (2015). In this work, we extend it to continuous phrase-structure datasets, where the non-negligible frequency of unary nodes requires additional processing.

Preliminaries
Let 1 , 2 , … , be a sentence and the word at position . A constituent tree is defined by constituents (as internal nodes) hierarchically organized over these words (as leaf nodes). Each phrase (or constituent) is defined as a tuple ( , , ℎ ) that includes a non-terminal symbol , the set of words included in its span (); and, ℎ , the word in  that acts as head and that can be marked by using a language-specific handwritten set of rules. For example, the head word of constituents S and VP in Figure 1(a) is the word is. Furthermore, we say that a constituent tree is continuous if there are no constituents whose yield  is a discontinuous substring of the sentence. If this does not hold, the tree is classified as discontinuous, and then there is at least one constituent with one or more gaps in its span. For instance, the word muß interrupts the span of constituent (VP, {Darüber, nachgedacht}, nachgedacht) in Figure 1(d), resulting in a phrase structure with crossing branches. Finally, constituents with exactly one child are known as unary constituents (for instance, ROOT, NP, ADVP and ADJP in Figure 1(a)).
Unlike constituent structures, dependency trees do not require extra internal nodes and are exclusively composed of the words of the sentence (plus an artificial root node) and binary directed links to connect them. Each dependency link is represented as ( ℎ , , ), where ℎ is the head word of the dependent word (ℎ, ∈ [1, ]) and a dependency label. Additionally, a dependency tree is classified as projective if we can find a directed path from ℎ to all words between words ℎ and for every dependency link ( ℎ , , ). If this does not hold, it is considered a non-projective dependency tree, as the one with crossing arcs depicted in Figure 1(e). Fernández-González and Martins (2015) designed an encoding technique to represent a unariless constituent tree with words as a set of − 1 labelled dependency arcs with enriched information (plus an arc from root), where discontinuous phrase structures are encoded as non-projective dependency trees and continuous structures as projective trees, as shown in Figure 1(b) and (e) for constituent trees in Figure 1(a) and (d), respectively. To that end, for each constituent ( , , ℎ ) with head word ℎ , each child node (different from ℎ ) is encoded into the unlabelled dependency link ( ℎ , ). Please note that a constituent's non-head child nodes might be a word or another constituent ( , , ) with as head word. Additionally, these dependencies are augmented with an arc label that includes the non-terminal name concatenated with an index that indicates the hierarchical order in which nonterminal nodes are built in the tree, resulting in labelled dependency arcs with the form ( ℎ , , # ). Index was included for those cases where several constituents share the same head word, but they are placed in the tree at a different level. For instance, constituent (S, {Darüber, muß, nachgedacht, werden}, muß) in Figure 1(d) is represented as the augmented dependency arc (muß, werden, S#1) in Figure 1(e); and constituent (VROOT, {Darüber, muß, nachgedacht, werden, .}, muß) is encoded as (muß, ., VROOT#2). Both share head word muß, but the latter is built on top of the former and this must be encoded by hierarchical orders 1 and 2; otherwise, after the deconversion, the resulting structure would be a single constituent (named S or VROOT) that spans all the sentence.

Constituent-to-dependency conversion
Finally, unary constituents are not directly supported by this encoding strategy. While Fernández- González and Martins (2015) proposed to remove all unary nodes and recover them in a post-processing step, we decided to incorporate unary constituents into the resulting augmented dependency tree by collapsing non-leaf unary chains (for instance, ROOT from Figure 1(a) into ROOT+S) and saving leaf unary nodes lost after the encoding by assigning them to words (as can be seen in Figure 1(b) for NP, ADVP and ADJP).

Constituent trees recovery
The original unariless constituent trees can be decoded from augmented dependency trees by, following a post order traversal, building constituents from the set of dependencies composed of each head word together with its dependents and following the hierarchical order dictated by the index and non-terminal name encoded into each of the dependency labels. Due to erroneous predictions, it might be the case that heads or dependency labels are mistakenly assigned in the resulting augmented dependency tree; however, (Fernández-González and Martins, 2015)'s technique guarantees that the output is a well-formed constituent tree (which, of course, will differ from the gold tree). For instance, imagine that the word cautious in Figure 1(b) is erroneously attached to the word still (instead of being connected to the verb is), then, instead of a single flat VP with three child nodes, the resulting constituents would be two VPs (the first would have as child nodes the word is and a second VP, which would group the words still and cautious). We can also find different scenarios where dependency labels are erroneously predicted, requiring ad-hoc heuristics during the recovery to deal with some inconsistencies: • Same indices, but different non-terminal names: Note that dependency labels with the same head and at the same level (same index ) should share the same non-terminal name so that a flat constituent can be recovered. If this does not hold, then the dependency label of the dependent closer to the head will be the one chosen for tagging the resulting constituent. For instance, if the arcs is→still and is→cautious in Figure 1(b) were tagged with labels VP#1 and (incorrect) NP#1, respectively; then we would use non-terminal label VP for naming the output flat constituent and NP would be discarded. Alternatively, we could consider that the non-terminal name is correct and index was wrongly predicted: in our running example, we might think that the non-terminal name of dependency label NP#1 is correct, but the resulting constituent NP should be in a higher level (with the correct label being NP#2, for instance). This heuristic would lead us to build a constituent NP with a nested VP. However, Fernández-González and Martins (2015) decided to follow a more conservative strategy that tends to produce flatter structures.
• Non-nested indices in continuous parsing: When the reverse conversion is restricted to continuous constituent trees, erroneous dependency labels might lead to discontinuous structures (even when the augmented dependency tree is projective). For example, if the arcs is→still and is→cautious were tagged with labels (incorrect) VP#2 and VP#1, respectively; then the resulting constituent would be a discontinuous constituent VP with two child nodes: the word and a non-nested VP (with a discontinuous yield composed of the words and ). This would be a well-formed phrase-structure tree in a discontinuous scenario; however, to produce continuous structures, hierarchical indices of dependent words closer to the head should always be the same or lower than adjacent and more distant siblings, thus ensuring that flat or nested continuous constituents will be obtained. In our running example, if the arcs is→still and is→cautious were erroneously tagged with labels VP#2 and VP#1, respectively; then we would decrease index 2 of dependent word until reaching the index of the adjacent and more distant sibling (the word ). In this case, the index is set to 1 and a flat constituent VP with three child nodes is built.
With respect to unary recovery, it is worth noting that, while Penn treebanks present a significant amount of unary constituents, they are very uncommon in discontinuous datasets: NEGRA has no unaries at all and TIGER contains less than 1%. Therefore, we only perform unary recovery in Penn treebanks. To do so, we simply uncollapse unary chains encoded in dependency labels and, for recovering leaf unary nodes lost after the encoding, we use a tagger in a post-processing step. More in detail, we employ the neural sequence tagger developed by Yang and Zhang (2018) for assigning to each word a possible leaf unary node (or a sequence of unaries collapsed into a single tag) seen in the training dataset or the tag NONE (if there is no unary node above that word).

Regular vs. augmented dependency trees
Although both the constituent-based and regular dependency structures are directed trees of nodes, each provides exclusive information: span phrase information is included in arc labels of the augmented variants, and regular dependency labels provide additional semantic information not described in phrase-structure trees. Furthermore, regular dependency trees differ from augmented ones, not only in the label set, but also in the conversion process. Although dependency trees are often generated from constituent trees, different head-rule sets for marking head words and other transformations can be applied in that process. While a set of syntactic rules are used for identifying head nodes when augmented dependency trees are produced, a semantic-based transformation is applied for choosing the semantic heads necessary for generating regular dependency structures. This is the reason why dependency structures in Figure 1(b) and (e) are different from Figure 1(c) and (f), respectively: for the English example, we use the headrule set by Collins (1999) in our constituent-to-dependency encoding, while regular dependency trees were obtained following the Stanford Dependencies conversion (de Marneffe and Manning, 2008); and, for the German sentence, the augmented dependency tree requires a non-projective stucture to fully encode the discontinuous constituent tree, while the regular dependency tree represents the syntax (and semantics) of the sentence with just a projective structure. This will train the parser across a broader variety of syntactic representations and notations.

Multitask Neural Architecture
To develop a neural network capable of producing state-of-the-art, unrestricted constituent and dependency parses, we join two transition-based parsers recently presented under the same architecture: (Fernández-González and Gómez-Rodríguez, 2019) for non-projective dependency parsing, and (Fernández-González and Gómez-Rodríguez, 2020), an extension of the former that can produce discontinuous constituent trees. As explained before, we additionally extend the latter to also deal with continuous phrase structures and unary constituents.
(Fernández-González and Gómez-Rodríguez, 2019) relies on Pointer Networks (Vinyals et al., 2015) to perform unlabelled dependency parsing. After learning the conditional probability of a sequence of numbers that represent positions from the input, these neural networks use a mechanism of attention (Bahdanau, Cho and Bengio, 2014) to select those positions during decoding. Unlike regular sequence-to-sequence architectures, Pointer Networks do not require a fixed dictionary based on the whole training dataset, but the dictionary size is specifically defined by each input sequence length. Fernández-González and Gómez-Rodríguez (2019) adapt Pointer Networks to implement a transition-based approach that, starting at the first word of a sentence of length , sequentially attaches, from left to right, the current focus word to the pointed head word, incrementally building a well-formed dependency tree in just steps. This can be also seen as a sequence of SHIFT-ATTACH-transitions, each of which connects the current focus word to the head word in the pointed position , and then moves the focus to the next word. In addition, a biaffine classifier (Dozat and Manning, 2017) jointly trained is used for predicting dependency labels.
Inspired by (Fernández-González and Gómez-Rodríguez, 2019), we introduce a novel neural architecture with two task-specific decoders: each word of the input sentence is attached to its regular head by the first decoder, and to its augmented dependency head by the second decoder. Additionally, each decoder provides a biaffine classifier trained on its task-specific label set. Since both decoders are aligned, the resulting system requires just steps to dependency and constituent 3 parse a sentence of length , easily allowing joint training.
More specifically, our neural architecture is composed of: Shared Encoder Each input sentence 1 , … , is encoded by a BiLSTM-CNN architecture (Ma and Hovy, 2016), word by word, into a sequence of encoder hidden states 1 , … , . In particular, a Convolutional Neural Network (CNN) is used for extracting a character-level representation of words ( ) and this is concatenated with a word embedding ( ) to create the vector representation for each input word . Additionally, POS tag embeddings ( ) are used when gold POS tags are available: 4 = ⊕ ⊕ Then, the word representation is fed one-by-one into a BiLSTM for generating vector representations , which encode context information captured in both directions: Additionally,a special vector representation 0 , denoting the ROOT node, is prepended at the beginning of the sequence of encoder hidden states. Finally, we extend the encoder with deep contextualized word embeddings ( ) extracted from the pre-trained language model BERT (Devlin, Chang, Lee and Toutanova, 2019) by directly concatenating them to the resulting basic word representation before feeding the BiLSTM-based encoder: Task-specific Decoders Each decoder is implemented by a separate LSTM that, at each time step , receives as input the encoder hidden state of the current focus word and generates a decoder hidden state : 5 Additionally, a pointer layer is implemented for each decoder by an attention vector to perform unlabelled parsing. This vector is generated by computing scores for all possible head-dependent pairs between the current focus word (represented by ) and each word from the input (represented by encoder hidden representations with ∈ [0, ]).
To that end, a scoring function based on the biaffine attention mechanism (Dozat and Manning, 2017) is used and, then, a probability distribution over the input words is computed: where is the weight matrix of the bi-linear term, and are the weight tensors of the linear terms, is the bias vector and 1 (⋅) and 2 (⋅) are two single-layer multilayer perceptrons (MLP) with ELU activation (Dozat and Manning, 2017).
Each attention vector will serve as a pointer to the highest-scoring position from the input, leading the parsing algorithm to create a dependency arc from the head word ( ) to the current focus word ( ). In case this dependency arc is forbidden since it generates cycles in the already-created dependency tree, the next highest-scoring position in will be considered as output instead. Furthermore, the projectivity constraint is also enforced when processing continuous treebanks, discarding arcs that produce crossing dependencies. After the decoding process (where each word is attached to another word at each step), we obtain a well-formed dependency tree where each word has a single head (except the artificial ROOT node that was not processed), with no cycles and, as a consequence of satisfying both the single-head and acyclicity constraints, all words are guaranteed to be connected.
Finally, each decoder trains a labeler layer (implemented as a multi-class classifier) to predict arc labels and produce a labelled dependency tree. In particular, after the pointer layer attaches the current focus word (represented by ) to the pointed head word in position (represented by ), this layer uses the same scoring function as the pointer to compute the score of each possible label for that arc and assign the highest-scoring one: where , , and are parameters distinctly used for each label ∈ {1, 2, … , }, being the number of labels. In addition, 1 (⋅) and 2 (⋅) are two single-layer MLPs with ELU activation.
The described transition-based algorithm can produce unrestricted non-projective dependency structures in ( 2 ) time complexity, since each decoder requires attachments to successfully parse a sentence with words, and at each step the attention vector is computed over the whole input. Figure 2 depicts a sketch of the multitask neural architecture and the decoding procedure for parsing the sentence in Figure 1  Multitask Training Following a multitask learning strategy (Caruana, 1997), we jointly train a single neural model for more than one task by optimizing the sum of their objectives and sharing a common encoder representation.
As both tasks use a dependency representation, the training objective of the pointer of each decoder is to learn the probability ( | ), where is the correct unlabelled dependency tree for a given sentence : ( | ). This probability can be factorized to the sequence of SHIFT-ATTACH-transitions to build (this is basically the sequence of indices ): where < represents previous predicted indices following the left-to-right order. We minimize the negative log of the probability of choosing the correct sequence of indices implemented as cross-entropy loss: Additionally, the labeler of each decoder is trained with softmax cross-entropy to minimize the negative log likelihood of tagging with the correct label a given dependency arc defined between the head word in position and the dependent word in the ℎ position: Then, the whole neural model is jointly trained by summing the pointer and labeler losses of each decoder: Finally, since both are considered main tasks and our goal is to train exclusively a single model, we neither use weights nor perform auxiliary-task training.
For the constituent-to-dependency encoding, we identify head words on German constituents by applying the headrule set defined by Rehbein (2009) and, on English and Chinese structures, by using those developed by Collins (1999) and Zhang and Clark (2008), respectively. The resulting augmented dependencies match regular variants by around 70% in all languages, except for Chinese where the unlabelled augmented and regular dependency trees are exactly the same.
Following standard practice, we discard punctuation for evaluating on both Penn treebanks, using the EVALB script to report constituent accuracy. Furthermore, while all tokens are considered when reporting dependency performance on German datasets, we employ discodop 7 (van Cranenburgh, Scha and Bod, 2016) and ignore punctuation and root symbols for evaluating on discontinuous constituent treebanks.

Settings
Word vectors are initialized with pre-trained structured-skipgram embeddings (Ling, Dyer, Black and Trancoso, 2015) for all languages and character and POS tag embeddings are randomly initialized. All of them are fine-tuned during training. POS tag embeddings are only enabled when gold information is used.
Additionally, we report accuracy gains by augmenting our model with the pre-trained language model BERT (Devlin et al., 2019). Although different approaches to initialize deep contextualized word embeddings from BERT can be found, we proceed with weights extracted from one or several layers for each token as a word-level representation. In addition, since BERT is trained on subwords, we take the vector of each subword of an input token and use the average embedding as the final representation . In particular, we use in our experiments the pre-trained cased German and Chinese BERT models with 12 768-dimensional hidden vectors; and uncased BERT with 24 1024-dimensional layers for English. Depending on the specific task, some layers proved to be more beneficial than others, which is especially crucial when the resulting embeddings are not fine-tuned during training. In order to check which layers are more suitable for our tasks, we test on development sets the combination of different layers. In Table 1, we compare, for the English pre-trained model BERT , the accuracy obtained by averaging several groups of four consecutive layers (from last layer 24 to layer 13) and by just using weights from the second-to-last hidden layer (the simplest and commonly-used strategy, since it is less biased than the last layer to the target objectives used to train BERT). As can be seen, the combination of layers from 17 to 20 achieves the highest accuracy on both tasks and, therefore, this setup is used in our experiments on the PTB. Regarding the pre-trained models BERT for German and Chinese, we noticed that comparable accuracies can be obtained by just using weights from the second-to-last layer  Table 2 Accuracy comparison on regular and augmented dependency trees of the NEGRA development set by using weights from different BERT layers.
instead of combining the four last layers as can be seen, for instance, in Table 2 for the NEGRA dataset. Therefore, we decided to follow the simplest configuration and use the second-to-last layer in all experiments on German and Chinese languages. We discarded other combinations such as the concatenation of several layers to avoid increasing the dimension of BERT embeddings. Finally, by adapting BERT-based embeddings to our specific tasks, our approach would certainly obtain some gains in accuracy; however, we consider that the amount of resources necessary to that end will not justify the expensive fine-tuning of parameter-heavy BERT layers.
In each training epoch, we use the same number of examples from each task and choose the multitask model with the highest harmonic mean among Labelled Attachment Scores on augmented and regular development sets. In addition, average accuracy over 3 repetitions is reported due to random initializations.
Finally, for parameter optimization and hyper-parameter selection, we follow (Ma et al., 2018;Dozat and Manning, 2017) and these are detailed in Table 3. Please note that we use for the multitask variant the exact same hyperparameters as the single-task baselines. By optimizing them to our specific multitask model, we could certainly increase performance; however, we decided to keep the same settings for a fair comparison.

Results
In Table 4, we compare our own implementation of the single-task dependency and constituent parsers by Fernández-González and Gómez-Rodríguez (2019) and Fernández-González and Gómez-Rodríguez (2020) to the proposed multitask approach. In all datasets tested, training a single model of the multi-representational parser across both syntactic representations leads to accuracy gains on both tasks.
In order to further put our approach into context, we also provide a comparison against state-of-the-art models. In Table 6, we show how our approach outperforms the best dependency parsers to date on the PTB and ZCTB with regular pre-trained word embeddings. Moreover, although some of the included parsers use several parameter-heavy layers of BERT and additionally perform a task-specific adaptation via expensive fine-tuning, our approach achieves similar performance on PTB and improves over all models on ZCTB. We also outperform the single-task dependency parser by Fernández-González and Gómez-Rodríguez (2019) with BERT, providing evidence that our multitask neural architecture is learning extra syntactic information that is not encoded in the pre-trained model BERT. Furthermore, Table 7 shows that our novel parser obtains competitive accuracies on constituent PTB and LCTB without BERT (best F-score to date on the latter), while being more efficient than ( 3 ) and ( 5 ) approaches such as (Kitaev and Klein, 2018;Zhou and Zhao, 2019). Finally, in Table 8 we show how our novel neural architecture outperforms all existing single-task parsers on the discontinuous NEGRA and TIGER datasets with regular word embeddings.   Table 4 Accuracy comparison of single-task baseline parsers to the proposed multi-representational approach in both constituent and dependency parsing. We report Labeled Attachment Scores (LAS) and Unlabeled Attachment Scores (UAS) for dependency parsing and, for constituent parsing, the LAS on the augmented dependency trees and F-score on the post-decoding constituent structure. The corresponding standard deviations over 3 runs for each score are reported in Table 5.

Analysis
In order to obtain insight into why the multi-representational variant is outperforming single-task parsers in both tasks, 8 we conduct an error analysis relative to structural factors. For the dependency parsing task, we show in Figure 3(a) the F-score relative to dependency displacements (i.e., signed distances) on the PTB and on the concatenation of all datasets, 9 Figure 3(b) reports the performance on common dependency relations on PTB and Figure 3(c) shows the accuracy of both approaches relative to sentence lengths on PTB and on all datasets together. From these results, we can point out that the multitask parser is performing better on longer leftward dependency arcs (with positive displacement) and on longer sentences, improving over the single-task system in all frequent dependency relations.  Table 6 Accuracy comparison of state-of-the-art dependency parsers on PTB and ZCTB. Models that fine-tune BERT are marked with * . Since in the original work (Fernández-González and Gómez-Rodríguez, 2019) performance with BERT was not reported, we run our own implementation of the single-task dependency parser enhanced with BERT-based embeddings and include it in the second block as "Fernández-González and Gómez-Rodríguez (2019)".
Regarding constituent parsing, we specifically analyze performance on both discontinuous German datasets together, where the multi-representational model significantly outperforms the single-task approach. Firstly, we report in Table 8 an F-score exclusively measured on discontinuous constituents (DF1), showing a notable performance on discontinuous structures (probably thanks to the joint training with regular non-projective dependency structures). Additionally, Figure 3(d) plots the F-score on span identification for different lengths, Figure 3(e) shows the performance by span labels and Figure 3(f) measures the accuracy of both approaches on different sentence length cutoffs. It can be noticed that the multitask variant achieves higher performance when spans are larger and sentences tend to be longer, being only less accurate than the single-task parser on Coordinated Noun Phrases (CNP), where, in this particular case, a disagreement in notation between constituent and dependency representations 10 might be misleading the multitask approach.
All this provides some evidences that learning across syntactic representations is tackling the main weakness of the transition-based sequential decoding: the impact of error propagation on the performance on large constituents and long sentences. Moreover, the information exclusively encoded by each formalism (span phrase information in constituent trees and semantic relations in dependency structures) may complete each other and provide an additional guidance not only in final decoding steps (where the parser is more prone to make a mistake due to error propagation), but also in creating those structures that are less frequent in some of the two representations (as happens with long leftward dependency arcs in languages such as English).
It is also worth mentioning that even on Chinese datasets (where augmented and regular dependencies are the same) our approach benefits from learning across both structures, meaning that both constituent-based and regular dependency label sets provide useful syntactic information.
Finally, the multitask approach achieves lower accuracies on continuous constituent datasets since the encoding technique by Fernández-González and Martins (2015) cannot directly handle unary nodes (which are collapsed or, in case of leaf unary nodes, assigned with a regular sequence tagger), losing some accuracy in continuous treebanks where the amount of this kind of structures is significant: 19.69% and 19.09% of the constituents on the PTB training and development sets, respectively, are unary nodes. One consequence of encoding unaries by collapsing them is that, while the labeler on regular dependency trees deals with 47 different dependency labels on the PTB, the labeler on augmented dependency structures manages 188 different tags (104 of them being generated for encoding unary nodes). On the contrary, in discontinuous datasets such as TIGER (where unary nodes are discarded due to their low frequency), the regular label set size is 45 and the augmented version has 83. This significant increase on augmented dictionary sizes for processing continuous datasets might penalize the labeler's performance and affect final accuracy, especially in an encoding technique where dependency labels have a crucial role during constituent recovery. Additionally, the recovery of leaf unary nodes (the 73.55% of total unaries from PTB development set for example) lost after the constituent-todependency conversion has a greater impact on final accuracy. The tagger in charge of that has to face a complex task, since the amount of words with unary constituents on top is scarce on the training set (88.85% of words are tagged with NONE and, since a sequence of leaf unaries is collapsed into a single tag as done for non-leaf unary nodes, the model has to deal with a large dictionary size of 54 tags), hindering the adequate training of the tagger. While it achieves a good overall accuracy (for instance, 98.65% on the PTB development set), a worse performance is obtained when only considering words with attached unary nodes (just the 10.59% of total words): 92.56% recall, 91.82% precision and 92.19% F-score on the PTB development set. It might seem that this performance is good enough; however, it means that tagging errors are more than 5 times as frequent in words associated with unary nodes compared to the overall error rate, and its impact on the final parsing accuracy is significant taking into account that scores on Penn treebanks are remarkably high. Despite all that, our approach obtains the best accuracy to date among all existing transition-based parsers in both continuous and discontinuous constituent structures, and it is on par with state-of-the-art models such as (Kitaev and Klein, 2018) and (Zhou and Zhao, 2019).

Related work
It is known that parsers based on lexicalized grammar are trained using both constituent and unlabeled dependency information. This includes classic chart parsers (Collins, 2003) as well as lexicalized parsers that build dependencies with reduce transitions, such as (Crabbé, 2015), which can generate both structures. These are restricted to dependencies that are directly inferred from the lexicalized constituent trees. In this sense, the multitask approach is more flexible, as it does not have that limitation and one can use dependencies and constituents from different sources. In the deep learning era, there have been a few recent attempts to jointly train a neural model across constituent and dependency trees, producing, during decoding, both syntactic representations from a single model.
In particular, Strzyz et al. (2019a) propose a multitask sequence labelling architecture that, by representing constituent and dependency trees as linearizations (Gómez-Rodríguez and Vilares, 2018;Strzyz, Vilares and Gómez-Rodríguez, 2019b), can learn and perform parsing in both formalisms as joint tasks. While being a linear and fast parser, the parsing accuracy provided by this approach is notably behind the state of the art (even training separate models by performing an auxiliary-task learning for each formalism) and the linearization strategy used for constituent parsing is restricted to continuous structures. Zhou and Zhao (2019) also explore the benefits of training a model across syntactic representations. They propose to integrate dependency and constituent information into a simplified variant of the Head-Driven Phrase Structure Grammar formalism (HPSG). Then, to implement a HPSG parser, they modify the constituent chart-based parser by (Kitaev and Klein, 2018) that employs an ( 5 ) CKY-style algorithm (Stern et al., 2017b) for decoding. 11 Although their approach can produce both syntactic structures at the same time and achieve state-of-the-art accuracies on PTB and CTB treebanks, their parser is bounded to produce continuous and projective structures with a high runtime complexity.
Our approach can handle any kind of constituent and dependency structures and provides an efficient runtime complexity, crucial for some downstream applications.

Conclusions and Future Work
We propose a novel encoder-decoder neural architecture based on Pointer Networks that, after being jointly trained on regular and constituent-based dependency trees, can syntactically parse a sentence to both constituent and dependency trees. Apart from just requiring to train a single model, our approach can produce not only the simplest continuous/projective trees, but also discontinuous/non-projective structures in just ( 2 ) runtime. We test our parser on the main dependency and constituent benchmarks, obtaining competitive results in all cases and reporting state-ofthe-art accuracies in several datasets.
As future work, we plan to perform auxiliary-task learning and train a separate model for each task, testing different weights for the loss computation. This will lose the advantage of training a single model to undertake both tasks, but will certainly lead to further improvements in accuracy. 11 They also propose a ( 3 ) decoding method that achieves worse accuracy.