Is Supervised Syntactic Parsing Beneficial for Language Understanding Tasks? An Empirical Investigation

Traditional NLP has long held (supervised) syntactic parsing necessary for successful higher-level semantic language understanding (LU). The recent advent of end-to-end neural models, self-supervised via language modeling (LM), and their success on a wide range of LU tasks, however, questions this belief. In this work, we empirically investigate the usefulness of supervised parsing for semantic LU in the context of LM-pretrained transformer networks. Relying on the established fine-tuning paradigm, we first couple a pretrained transformer with a biaffine parsing head, aiming to infuse explicit syntactic knowledge from Universal Dependencies treebanks into the transformer. We then fine-tune the model for LU tasks and measure the effect of the intermediate parsing training (IPT) on downstream LU task performance. Results from both monolingual English and zero-shot language transfer experiments (with intermediate target-language parsing) show that explicit formalized syntax, injected into transformers through IPT, has very limited and inconsistent effect on downstream LU performance. Our results, coupled with our analysis of transformers’ representation spaces before and after intermediate parsing, make a significant step towards providing answers to an essential question: how (un)availing is supervised parsing for high-level semantic natural language understanding in the era of large neural models?

The question of necessity of supervised parsing for LU and NLP in general has been raised before.More than a decade ago, Bod (2007) questioned the superiority of supervised parsing over unsupervised induction of syntactic structures in the context of statistical machine translation.Nonetheless, the NLP community has since still managed to find sufficient evidence for the usefulness of explicit syntax in higher-level LU tasks (Levy and Goldberg, 2014;Cheng and Kartsaklis, 2015;Bastings et al., 2017;Kasai et al., 2019;Zhang et al., 2019a, inter alia).However, we believe that the massive improvements brought about by the LM-pretrained transformers -unexposed to any explicit syntactic signal -warrant a renewed scrutiny of the utility of supervised parsing for high-level language understanding. 1 The research question we address in this work can be summarized as follows: (RQ) Is explicit structural language information, provided in the form of a widely adopted syntactic formalism (Universal Dependencies, UD) (Nivre et al., 2016) and injected in a supervised manner into LM-pretrained transformers beneficial for transformers' downstream LU performance?
While existing body of work (Lin et al., 2019;Tenney et al., 2019;Liu et al., 2019a;Kulmizev et al., 2020;Chi et al., 2020) probes transformers for structural phenomena, our work is more pragmatically motivated.We directly evaluate the effect of infusing structural language information from UD treebanks, via intermediate dependency parsing (DP) training, on transformers' performance in downstream LU.To this end, we couple a pretrained transformer with a biaffine parser similar to Dozat and Manning (2017), and train the model (i.e., fine-tune the transformer) for DP.Our parser on top of RoBERTa (Liu et al., 2019b) and XLM-R (Conneau et al., 2020) produces state-of-the-art parsing results.We then fine-tune the syntacticallyinformed transformers for three downstream LU tasks: natural language inference (NLI) (Williams et al., 2018;Conneau et al., 2018), paraphrase identification (Zhang et al., 2019b;Yang et al., 2019), and causal commonsense reasoning (Sap et al., 2019;Ponti et al., 2020).We quantify the contribution of explicit syntax by comparing LU performance of the transformer exposed to intermediate parsing training (IPT) and its counterpart directly fine-tuned for the downstream task.We investigate the effects of IPT (1) monolingually, by fine-tuning English transformers, BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019b), on an English UD treebank and for (2) downstream zero-shot language transfer, by fine-tuning massively multilingual transformers (MMTs) -mBERT and XLM-R (Conneau et al., 2020) -on treebanks of downstream target languages, before the downstream fine-tuning on source language (English) data.
Our results render supervised parsing largely inconsequential to LU.We observe very limited and inconsistent gains only in zero-shot downstream transfer: further analyses reveal that (1) intermediate LM training yields comparable gains and (2) IPT only marginally changes representation spaces of transformers exposed to sufficient amount of language data in LM-pretraining.These findings explicitly question the need for supervised parsing and manually labeled treebanks, at least in the context of language understanding with transformer networks, and warrant further similar investigations in future work, in order to fully understand the impact of formal syntactic knowledge on LU performance with modern neural architectures.Disclaimer: The purpose of this work is definitely not to invalidate the admirable efforts on syntactic annotation and modelling, but rather to make an empirically driven step towards a deeper understanding of the relationship between LU and formalised syntactic knowledge, and the extent of its impact to modern semantic language processing and applications.

Related Work
Transformer-Based Dependency Parsing.Building on the success of preceding neural parsers (Chen and Manning, 2014;Kiperwasser and Goldberg, 2016), Dozat and Manning (2017) proposed a biaffine parsing head on top of a Bi-LSTM encoder: contextualized word vectors are fed to two feedforward networks, producing dependent-and headspecific token representations, respectively.Arc and relation scores are produced via biaffine products between these dependent-and head-specific representation matrices.Finally, the Edmonds algorithm induces the optimal tree from pairwise arc predictions.Most recent DP work (Kondratyuk and Straka, 2019;Üstün et al., 2020) replaces the Bi-LSTM encoder with multilingual BERT's transformer, reporting state-of-the-art parsing performance.Kondratyuk and Straka (2019) fine-tune mBERTs parameters on the concatenation of all UD treebanks, whereas Üstün et al. (2020) freeze the original transformer's parameters and inject adapters (Houlsby et al., 2019) for parsing.
We propose a simpler transformer-based biaffine parser: we apply biaffine attention directly on representations from transformer's output layer, eliminating the head-and dependendant-based feedforward mapping.Despite this simplification, our biaffine parser produces DP results competitive to current state-of-the-art dependency parsers.
Syntactic BERTology.The substantial body of syntactic probing work shows that BERT (Devlin et al., 2019) (a) encodes text in a hierarchical manner (i.e., it encodes some implicit underlying syntax) (Lin et al., 2019); and (b) captures specific shallow syntactic information (parts-of-speech and syntactic chunks) (Tenney et al., 2019;Liu et al., 2019a).Hewitt and Manning (2019) find that linear transformations, when applied on BERT's contextualized word vectors, reflect distances in dependency trees.This suggests that BERT encodes sufficient structural information to reconstruct dependency trees (though without arc directionality and relations).Chi et al. (2020) extend the analysis to multilingual BERT, finding that its representation subspaces may recover trees also for other languages.They also provide evidence that clusters of head-dependency pairs roughly correspond to UD relations.Similarly, Kulmizev et al. (2020) show that BERT's latent syntax corresponds more to UD trees than to shallower SUD (Gerdes et al., 2018) structures.Despite the evident similarity between BERT's latent syntax and formalisms such as UD, there is ample evidence that BERT insufficiently leverages syntax in downstream tasks: it often produces similar predictions for syntactically valid as well as for structurally corrupt sentences (e.g., with random word order) (Wallace et al., 2019;Ettinger, 2020;Zhao et al., 2020).
Intermediate Training.Sometimes called Supplementary Training on Intermediate Labeled-data Tasks (STILT) (Phang et al., 2018), intermediate training is a transfer learning setup in which one trains an LM-pretrained transformer on one or more supervised tasks (ideally with large training sets) before final fine-tuning for the target task.Phang et al. (2018) show that intermediate NLI training of BERT on the Multi-NLI dataset (Williams et al., 2018) benefits several language understanding tasks.Subsequent work (Wang et al., 2019;Pruksachatkun et al., 2020)

Methodology
We first describe our transformer-based biaffine parser, followed by a brief description of classification heads for downstream LU tasks.

Biaffine Parser
Our parsing model, illustrated in Figure 1, consists of a biaffine attention layer applied directly on the transformer's output (BERT, RoBERTa, mBERT, or XLM-R).We first obtain word-level vectors by averaging transformed representations of their constituent subwords, produced by the transformer.Let X ∈ R N ×H denote the encoding of a sentence with N word-level tokens, consisting of N Hdimensional vectors (where H is the transformer's hidden size).We use the transformed representa- ...

X X'
[root] tion of the sentence start token (e.g., [CLS] for BERT), x CLS ∈ R H , as the representation for the root node of the parse tree, and prepend it to X, X = [x CLS ; X] ∈ R (N +1)×H .We then use X as the representation of syntactic dependants and X as the representation of dependency heads.We then directly compute the arc and relation scores as biaffine products of X and X : where W arc ∈ R H×H and W rel ∈ R H×H×R denote, respectively, the arc classification matrix and relation classification tensor (with R as the number of relations); B arc and B rel denote the corresponding bias parameters.We greedily select the dependency head for each word by finding the maximal score in each row of Y arc : while this is not guaranteed to produce a tree, Zhang et al. (2017) show that in most cases it does.2Our arc prediction loss is the cross-entropy loss with sentence words (plus the root node) as categorical labels: this implies a different number of labels for different sentences.We compute the relation prediction loss as a cross-entropy loss over gold arcs.Our final loss is the sum of the arc loss and relation loss.
Note that, in comparison with the original biaffine parser (Dozat and Manning, 2017) and its other transformer-based variants (Kondratyuk and Straka, 2019;Üstün et al., 2020), we feed wordlevel representations derived from the transformer's output directly to biaffine products, omitting the dependent-and head-specific MLP transformations.Deep task-specific architectures go against the finetuning idea: deep transformers have plenty of their own parameters that can be tuned for DP.We want to propagate as much of the explicit syntactic knowledge as possible into the transformer: a deep(er) DP-specific architecture on top of the transformer would impede the propagation of this knowledge to the transformer's parameters.

Downstream Models
After IPT, we fine-tune transformers for two types of LU tasks: (1) sequence classification (SEQC) tasks, where a sequence of text needs to be assigned a discrete label; and (2) multiple choice classification (MCC) tasks where we need to select the correct answer between two or more options for a given a premise and/or question.For SEQC, we simply apply a softmax classifier on the transformed representation of the sequence start token: y = softmax (x CLS W sc + b sc ) (with W sc ∈ R H×C and b sc ∈ R C as classifier's parameters and C as the number of task's labels).
For MCC tasks, we first concatenate each of the offered answer choices (independently of each other) to the premise and/or question, and encode it with the transformer.Since some of these tasks, e.g., COPA (Roemmele et al., 2011;Ponti et al., 2020), have very small training sets, we would like to support model transfer between different MCC tasks.Different multiple-choice classification tasks, however, may differ in the number of choices: a classifier with the number of parameters depending on the number of labels is thus not a good fit; instead, we follow Sap et al. (2019) and Ponti et al. (2020), and couple the transformer with a feed-forward network outputting a single scalar for each answer.Let x i CLS ∈ R H be the representation of the sequence start token (i.e., [CLS] or <s>) for the concatenation of the premise/question and the i-th answer.We obtain the score for the i-th answer as follows: We then apply a softmax function on the concatenation of y i scores of all answers: y = softmax([y 1 , . . ., y K ]), with K as the number of answers (i.e., labels) in the task.Finally, we compute the cross-entropy loss on y.

Experimental Setup
We now detail experimental setup, where LU finetuning follows Intermediate Parsing Training (IPT).

Sequential Fine-Tuning
Our primary goal is to identify if injection of explicit syntax into transformers via supervised parsing training improves their downstream LU performance -this corresponds to sequential fine-tuning: (1) we first attach a biaffine parser from §3.1 on the transformer and train the whole model on a UD treebank; (2) we then couple the syntactically-informed transformer with the corresponding downstream classification head and perform final fine-tuning.We then compare the downstream performance of transformers with and without the IPT step.
Mono-vs.Cross-Lingual IPT Experiments.In the monolingual setup, we work with English (EN ) transformers, BERT and RoBERTa, pretrained on EN corpora.In the zero-shot language transfer setup, where we work with multilingual models, mBERT and XLM-R (Conneau et al., 2020), we first train transformers via IPT on the UD treebank of the target language (i.e., a language with no downstream training data) before fine-tuning it on the EN training set of the LU task.We experiment with four target languages: German (DE ), French (FR ), Turkish (TR ), and Chinese (ZH ). 3tandard vs. Adapter-Based Fine-Tuning.Standard fine-tuning updates all transformer's parameters, which, for tasks with large training sets may have some drawbacks: (i) fine-tuning may last long and (ii) task-specific information may overwrite the useful distributional knowledge obtained during LM-pretraining.Adapter-based fine-tuning (Rebuffi et al., 2018;Houlsby et al., 2019;Pfeiffer et al., 2020) remedies for these potential issues by keeping the original transformer's parameters frozen and inserting new adapter parameters in transformer layers.In fine-tuning, both sets of parameters are used to make predictions, but we only update adapters based on loss gradients.As the number of adapter parameters is only a fraction of the number of original parameters (3-8%), adapterbased fine-tuning is also significantly faster.
Therefore, to account for the possibility of forgetting distributional knowledge in standard IPT fine-tuning, we also carry out adapter-based IPT.We follow Houlsby et al. (2019) and inject two bottleneck adapters into each transformer layer: first after the multi-head attention sublayer and another after the feed-forward sublayer.In downstream LU tasks, however, we unfreeze the original transformer parameters and fine-tune them together with adapters (now containing syntactic knowledge).

Language Understanding Tasks
We now outline the downstream LU tasks for which we measure the effects of intermediate parsing training.For brevity, we report all the technical training and optimization details in the Supplementary Material.
Natural Language Inference.NLI is a ternary sentence-pair classification task.We predict if the hypothesis is entailed by the premise, contradicts it, or neither.For monolingual EN experiments, we use the Multi-NLI dataset (Williams et al., 2018).In zero-shot language transfer experiments, we train on EN Multi-NLI and evaluate on target language (DE , FR , TR , ZH ) test portions of the multilingual XNLI dataset (Conneau et al., 2018).
Paraphrase Identification is a binary classification task where we predict if two sentences are mutual paraphrases.For EN , we train, validate, and test on respective portions of the PAWS dataset (Zhang et al., 2019b).In zero-shot language transfer, we evaluate on the test DE , FR , and ZH portions of the PAWS-X dataset (Yang et al., 2019).
Commonsense Reasoning.We experiment with two multiple-choice classification (MCC) datasets for commonsense reasoning.In monolingual evaluation, we use the SocialIQA (SIQA) dataset (Sap et al., 2019), testing models' ability to reason about social interactions.Each SIQA instance consists of a premise, a question, and three possible answers.For zero-shot language transfer experiments, we resort to the recently published XCOPA dataset (Ponti et al., 2020), obtained by translating test portions of the EN COPA (Choice of Plausible Alternatives) dataset (Roemmele et al., 2011) to 11 other languages.As mentioned, (X)COPA is an MCC task, with each instance containing a premise, a question,4 and two possible answers.Due to the very limited size of the EN COPA training set (mere 400 instances), we follow Ponti et al. ( 2020) and evaluate the models fine-tuned on SIQA (EN ) on the XCOPA test portions (in TR and ZH ).

Evaluation
We first discuss parsing performance of our novel biaffine parser (see §3.1).We then show transformers' downstream LU performance after IPT, both in monolingual EN setting and in zero-shot transfer.

Results and Discussion
Parsing Performance.In order to judge the benefits of IPT in downstream LU, we must first verify parsing performance of our biaffine parser, i.e., that we successfully fine-tune transformers for DP.Table 1 shows that our biaffine parser gives stateof-the-art performance for all five languages in our study.Our (m)BERT-based parser outperforms UDify (Kondratyuk and Straka, 2019), also based on mBERT, for EN , FR , and TR , and performs comparably for ZH . 5Our parser based on XLM-R additionally yields an improvement over UDify for DE as well.It is worth noting that UDify trains the mBERT-based parser (1) on the concatenation of all UD treebanks and that it (2) additionally exploits gold UPOS and lemma annotations.We train our parsers only on the training portion of the respective treebank without using any additional morphosyntactic information. 6Our mBERT-based parser outperforms our XLM-R-based parser only for ZH : this is likely due to a tokenization mismatch between XLM-R's subword tokenization for ZH and gold tokenization in the ZH -GSD treebank.7 Monolingual EN Results.Table 2 quantifies the effects of applying IPT with the EN -EWT UD treebank to BERT and RoBERTa.We report downstream LU performance on NLI, PAWS, and SIQA.The reported results do not favor supervised parsing (i.e., explicit syntax): compared to original transformers that have not been exposed to any explicit syntactic supervision, variants exposed to UD syntax via intermediate parsing training (Standard, Adapter) fail to produce any significant performance gains for any of the downstream LU tasks.One cannot argue that the cause of this might be forgetting (i.e., overwriting) of the distributional knowledge obtained in LM pretraining during IPT: Adapter IPT variants, in which all distributional knowledge is preserved by design, also fail to yield any significant language understanding gains.
Zero-Shot Language Transfer.We show the results obtained for zero-shot downstream language transfer setup, for both mBERT and XLM-R, in Table 3. Again, these results do not particularly favor the intermediate injection of explicit syntactic information in general.However, in few cases we do observe gains from the intermediate targetlanguage parsing training: e.g., 3% gain on PAWS-X for ZH as well as 4% and 5% gains on XCOPA for ZH and TR , respectively.Interestingly, all substantial improvements are obtained for mBERT; for XLM-R, the improvements are less consistent and pronounced.We believe this to be due to XLM-R's larger capacity which makes it less susceptible to the "curse of multilinguality" (Conneau et al., 2020): with the subword vocabulary twice as large as mBERT's, XLM-R is able to store more language-specific information.Also, XLM-R has seen substantially more target language data in LMpretraining than mBERT for each language.This might mean that the larger IPT gains for mBERT come from mere exposure to additional target language text rather than from injection of explicit syntactic UD signal (see further analyses in §5.2).

Further Analysis and Discussion
We first compare the impact of IPT with the effect of additional LM training on the same raw data.
We then quantify the topological modification that IPT makes in transformers' representation spaces.
Explicit Syntax or Just More Language Data?
We scrutinize the IPT gains that we observe in some zero-shot language transfer experiments.We hypothesize that these gains may, at least in part, to be credited to the transformer model simply seeing more target language data.To investigate this, we replace IPT with intermediate (masked) language modeling training (ILMT) on the same data (i.e., sentences from the respective treebank used in IPT) before final downstream LU fine-tuning. 8ecause MLM is a self-supervised objective, we can credit all differences in downstream LU performance between ILMT and IPT variants of the same pretrained transformer to supervised parsing, i.e., to the injection of explicit UD knowledge.
We run this analysis for setups in which we obtain conspicuous gains with IPT: PAWS-X for Table 3: Performance of multilingual transformers, mBERT and XLM-R, in zero-shot language transfer for downstream LU tasks, with and without prior intermediate dependency parsing training on target language treebanks.mBERT (Adapter fine-tuning, for FR and ZH ) and XCOPA for mBERT (Standard fine-tuning, TR and ZH ).The comparison between IPT and ILMT for these setups is provided in Figure 2. Like IPT, ILMT on mBERT generates downstream gains over direct downstream fine-tuning (i.e., no intermediate training) in all four setups.The gains from ILMT (with the exception of XCOPA for ZH ) are almost as large as gains from IPT.This suggests that most of the gain with IPT comes from seeing more target language text, and prevents us from concluding that the explicit syntactic annotation (i.e., supervised DP) is responsible for the LU improvements in zero-shot downstream transfer.This interpretation is corroborated by the fact that IPT gains roughly correlate with the amount of language-specific data seen in LM-pretraining: the gains are more prominent for mBERT than for XLM-R and for TR and ZH than for FR and DE (see Table 3).

BERT (EN) RoBERTa
Changes in Representation Spaces.Finally, we analyze how fine-tuning transformers on different tasks modifies the topology of their representation spaces.We encode the set of sentences S from the test portions of treebanks used in IPT 9 with different variants: .We analyze the representations in each transformer layer separately: we represent each sentence s ∈ S with the average of subword vectors from that layer (excluding sequence start and end tokens).Let X 1 and X 2 ∈ R |S|×H contain corresponding representations of sentences from S from the i-th layer of two transformer variants (e.g., B and P).We measure the topological similarity of the i-th layers of the two transformers with the linear centered kernel alignment (l-CKA) (Kornblith et al., 2019 Although not invariant to all linear transformations, l-CKA is invariant to orthogonal projection and isotropic scaling, which suffices for our purposes.We base our analysis on the following assumption: the extent of change in transformers' representation space topology (reflected by l-CKA), is proportional to the novelty of knowledge injected in fine-tuning.Put differently, injection of new (i.e., missing) knowledge should substantially change the topology of the space (low l-CKA score).
Figure 3 shows the heatmap of l-CKA scores for pairs of BERT and RoBERTa variants, for layers ).This is additional evidence, besides the fact that RoBERTa consistently outperforms BERT in downstream, that RoBERTa encodes richer language representations, due to its larger training corpora and longer training.It is also in line with suggestions that BERT is undertrained for its capacity (Liu et al., 2019b).
Very high B-P (and B-AP) l-CKA scores in lower layers suggest that the explicit syntactic knowledge from human-curated treebanks is redundant w.r.t. the structural language knowledge transformers obtain through LM pretraining.This is consistent with concurrent observations (Chi et al., 2020;Kulmizev et al., 2020) showing (some) correspondence between structural knowledge of (m)BERT and UD syntax.Finally, we observe highest l-CKA scores in the P-AP column, suggesting that Standard and Adapter IPT inject roughly the same syntactic infor-mation, despite different fine-tuning mechanisms.
Figure 4 illustrates the results of the same analysis for language transfer experiments, for DE and TR (scores for FR and ZH are in the Supplementary Material).The effects of ILMT and IPT (B-M, B-P/B-AP) for DE and TR with mBERT and XLM-R resemble those for EN with BERT and RoBERTa: ILMT changes transformers less than IPT.The amount of new syntactic knowledge IPT injects is larger (l-CKA scores are lower) than for EN , especially for XLM-R (vs.RoBERTa for EN ).We believe that it reflects the relative under-representation of the target language in the model's multilingual pretraining corpus (e.g., for TR): this leads to poorer representations of target language structure by mBERT and XLM-R compared to BERT's and RoBERTa's representation of EN structure.This gives us two seemingly conflicting empirical findings: (a) IPT appears to inject a fair amount of target-language UD syntax, but (b) this translates to (mostly) insignificant and inconsistent gains in language transfer in LU tasks (especially so for XLM-R, cf.Table 3).A plausible hypothesis reconciling these two observations is that there is a substantial mismatch between the type of structural information we obtain through supervised (UD) parsing and the type of structural knowledge beneficial for LU tasks.If true, this hypothesis would render supervised parsing rather unavailing for high-level language understanding, at least in the context of LM-pretrained transformers, the current state of the art in NLP.We hope that our work will inspire further discussion and additional studies that focus on this fundamental relationship between parsing and LU.

Conclusion
This work thoroughly examined the effects of leveraging explicit formalized syntactic structures such as UD in state-of-the-art transformer-based language models (e.g., RoBERTa, XLM-R) for downstream language understanding (LU) tasks (e.g., NLI, commonsense reasoning), in monolingual and zero-shot language transfer settings.The key results, obtained through intermediate parsing training (IPT) based on a novel state-of-the-art dependency parser, indicate that explicit syntax, at least in our extensive experiments, provides zero to negligible impact on LU tasks.
Besides offering comprehensive empirical validations on the mismatch between explicit syntax and improving LU performance with current stateof-the-art transformer models, this study has shed new light on some fundamental questions such as the one in the title.Similar to word embeddings (Mikolov et al., 2013) (from a pragmatic viewpoint) removing sparse lexical features from the NLP horizon, will transformers make supervised parsing obsolete for LU applications or not?More dramatically, in the words of Rens Bod (2007): "Is the end of supervised parsing in sight" for semantic LU tasks?32.In Adapter-based IPT, we set the adapter size to 64 and use GELU (Hendrycks and Gimpel, 2016) as the activation function in adapter layers.

B.2 Intermediate MLM Training (ILMT)
For the analysis in § 5.2, we intermediately further train transformers on MLM: we mask 15% of subword tokens in each sentence and predict them with a linear classifier applied on transformed representations of [MASK] tokens.We compute the cross-entropy loss and use the hyperparameter configuration described in B.1.The development set, used for early stopping, is subdued to fixed masking, whereas we mask the training sentences dynamically, before feeding them to the transformer.

D Complete Topology Analysis Results
Finally, we show the complete results (for all layers, all transformers, and all languages covered in our experiments) of our topological analysis of transformers' representations before and after different fine-tuning steps.Figure 5 shows the analysis results for monolingual EN transformers, BERT and RoBERTa.Figure 6 and Figure 7 show the results for multilingual transformers, mBERT and XLM-R, respectively, for all four target languages included in our experiments: DE , FR , TR , and ZH .

Figure 1 :
Figure 1: Architecture of our variant of a transformerbased biaffine dependency parser.

Figure 3 :
Figure 3: Topological similarity (l-CKA) for pairs of BERT and RoBERTa variants, before and after different finetuning steps (B, M, P, AP, and N).Rows: transformer layers; Columns: pairs of transformer variants in comparison.
(a) Base (B): original LM-pretrained transformer, no further training; (b) MLM (M): after ILMT; (c) Parsing (P): after Standard IPT; 9 IPT itself only consumes train and develoment portions of UD treebanks.We can thus safely use sentences from test portions in this analysis, without risking information leakage.and (d) Adapter-Parsing (AP): after Adapter-based IPT; for monolingual transformers (BERT and RoBERTa), also with (e) NLI (N): after NLI finetuning (without any intermediate training)

Figure 4 :
Figure 4: Analysis of topological similarity (l-CKA) for variants of mBERT and XLM-R before and after IPT and ILMT (B, M, P, AP) in zero-shot transfer experiments.Results shown for intermediate parsing on DE and TR data.L8-L12. 11Comparing B-P and B-N reveals that IPT changes the topology of BERT's higher layers roughly as much as NLI fine-tuning does, implying that both the English UD treebank (EN -EWT) and Multi-NLI data contain a non-negligible amount of novel knowledge for BERT.The direct N-P comparison that IPT and NLI enrich BERT (also RoBERTa) with different type of knowledge, i.e., they change the representations spaces of its layers in different ways.This suggests that the transformers cannot acquire the missing knowledge needed for NLI from IPT (i.e., from EN -EWT), and explains why IPT is not effective for NLI.IPT (comparison B-P), as expected, injects somewhat more new information into transformers than ILMT (comparison B-M), and this is more pronounced for BERT than for RoBERTa.Interestingly, IPT and ILMT change RoBERTa's parameters much less than BERT's (see B-M and B-P l-CKA scores for L11/L12).This is additional evidence, besides the fact that RoBERTa consistently outperforms BERT in downstream, that RoBERTa encodes richer language representations, due to its larger training corpora and longer training.It is also in line with suggestions that BERT is undertrained for its capacity(Liu et al., 2019b).Very high B-P (and B-AP) l-CKA scores in lower layers suggest that the explicit syntactic knowledge from human-curated treebanks is redundant w.r.t. the structural language knowledge transformers obtain through LM pretraining.This is consistent with concurrent observations(Chi et al., 2020;Kulmizev et al., 2020) showing (some) correspondence between structural knowledge of (m)BERT and UD syntax.Finally, we observe highest l-CKA scores in the P-AP column, suggesting that Standard and Adapter IPT inject roughly the same syntactic infor-

Table 1 :
Dependency parsing performance of our transformer-based biaffine parsers.

Table 2 :
Downstream LU performance of monolingual EN transformers (BERT and RoBERTa).None: no IPT; Standard: IPT via standard fine-tuning; Adapter: IPT via adapter-based fine-tuning.