Weighted posets: Learning surface order from dependency trees

This paper presents a novel algorithm for generating a surface word order for a sentence given its dependency tree using a two-stage process. Using dependency-based word embeddings and a Graph Neural Network, the algorithm ﬁrst learns how to rewrite a dependency tree as a partially ordered set (poset) with edge-weights representing dependency distance. The subsequent topological sort of this poset reﬂects a surface word order. The algorithm is evaluated against a naive baseline of average dependency distances across 14 languages, performing well in terms of rank correlation and resulting rate of projectivity based on Universal Dependencies corpora.


Introduction
In a tradition dating at least back to Tesnière (1959), the words in a phrase or sentence can be thought of as a set of heads and dependents.Each word save the root is a dependent of another word, its head, and heads and dependents exist in a one-to-many relationship (Polguère and Mel'čuk, 2009).This arrangement of heads and their dependents forms a tree, or more formally an unordered directed acyclic graph (DAG), in which words are nodes and edges are the dependency relations.A sentence is one possible linearization or surface order of the DAG.
This paper describes a method for learning how to generate a valid1 surface order from a dependency tree.Determining the underlying dependency tree from a surface order is the rather extensively studied task of parsing; this paper concerns the opposite task.
The key insight of the paper is that rather than learning to directly convert a dependency tree to surface order, the target is instead an edge-weighted partially ordered set (poset).The poset's edge direction represents linear precedence in the surface order, while edge weight represents dependency distance, the number of words intervening between dependent and head in the surface order.The topological sort or linear extension of this poset-performed such that nodes connected by edges with smaller weights are placed closest to each other-reflects the surface order of the dependency tree.
For example, Figure 1 shows (a) the dependency tree, (b) edge-weighted poset, and (c) surface order of the sentence Personally I recommend you take your money elsewhere.Rather than attempting to learn how to convert (a) directly into (c), the approach outlined here rewrites (a) to (b) by learning edge directions and weights, then rewrites (b) to (c) via topological sort.Given examples of dependency trees and their corpus-attested surface orders, a neural network can learn to convert previously unseen dependency trees into surface orders by way of a weighted poset.
Implemented as a Graph Neural Network, the machine-learning algorithm treats inputs, targets, and outputs as directed graphs.Further, by representing words with their dependency-based embeddings-that is, embeddings trained on syntactic rather than linear contexts-the model generates a linearized surface order as the final step only, performing all other analysis within a graph framework.In this way surface order is treated as an emergent consequence of topologically sorting an edge-weighted poset, the weights of which represent learned dependency distances.2 Background literature

Related linguistic work
Word order is one of the oldest and most prominent areas in the field of linguistics, and as such a wide variety of models have been advanced seeking to describe and understand word-order variation (Song, 2012).It has been approached from generalist perspectives, as in Behaghel's "what belongs together semantically is also placed close together" (1932, p. 4) or Uniform Information Density (Jaeger and R. Levy, 2006), as well as from specific constituent types, such as the ordering of adpositions and adverbials by manner, place and time (Boisson, 1981); demonstratives, numerals, and descriptive adjectives (Greenberg, 1963;Dryer, 2009); or adjectives by size, shape, and so on (Scott, 2002).
Building on principles such as Head Proximity (Rijkhoff, 1986), Early Immediate Constituents (Hawkins, 1994), Dependency Locality Theory (Gibson, 2000), and Minimize Domains (Hawkins, 2004), a recent approach to word order holds that the dependency distance 2 -the number of words intervening between a dependent and its head-should be minimized and long-distance dependencies should be avoided (Hudson, 1995;H. Liu et al., 2017).Dependency Distance Minimization (DDM) proposes that a surface order with a smaller cumulative or mean dependency distance is generally favored over alternatives, a tendency that may be universal (Futrell et al., 2015).
However, DDM alone cannot fully explain word order: it does not distinguish between total mirror orders-the dependency distances of the cat purrs are presumably the same as purrs cat the-or, more plausibly, partial mirror orders such as the swapped adjectives in big red barn or red big barn.Methods for extending DDM include employing phonemes or syllables as the unit of distance (Ferrer-i-Cancho, 2017), or exploring the relationship between dependents and heads in information-theoretic terms (Dyer, 2018;Hahn et al., 2018).Another avenue is some sort of linear principle that could operate to differentiate mirror surface orders, such as "old concepts come before new ones" (Behaghel, 1932, p. 4), or the possibly contradictory "provide the most important information first" (Gundel, 1988, p. 229).
One way to conceive of surface order is as the result of rewriting a dependency graph by modifying its edge directions to reflect linear order.This process represents an intermediate stage between syntactic structure and surface order in which the linear order of certain word pairs is expressed as a series of precedence relations (Gerdes and Kahane, 2001;Kahane and Lareau, 2016).These precedence relations form a partially ordered set (poset) which can be topologically sorted into a non-unique linearization.
2 Dependency distance is also referred to as dependency length in the literature.

Related NLG work
The field of natural language generation (NLG) seeks to model word order in the service of generating accurate natural language.Contra Harris (1954)3 , language is often seen within NLG as a bag of words in which the task of realizing surface order is based on an n-gram language model (Filippova and Strube, 2009).A common implementation follows the bottom-up insights from dependency parsing (Y.Liu et al., 2015), and features such as syntactic category or dependency relations can improve algorithms for linearizing a bag of words (Zhang and Clark, 2015).
The First Multilingual Surface Realisation Shared Task (SR '18) brought together nine submissions in a shallow track requiring teams to determine word order and inflections of shuffled and lemmatized Universal Dependencies (UD) data, evaluated by both statistical and human assessment (Mille et al., 2018).For the linearization subtask, of the four submissions with the highest BLEU4 scores in at least one of the 10 supported languages: Puzikov and Gurevych (2018) use a bigram language model with binary neural-net classification; Elder and Hokamp (2018) treat the task as a machine-translation problem, using sequence-to-sequence models augmented with synthetic and outside data; Castro Ferreira et al. ( 2018) sort dependents into preceding and following groups which are then sorted by syntactic category or with a maximum entropy classifier; and King and White (2018) use features such as syntactic category, projectivity, and dependency distance to build a language model to incrementally linearize words.
It has long been noted that a reliance on statistical n-gram metrics like BLEU for measuring generated language is problematic given their inability to generalize seemingly unimportant word order variation or synonymy (Pastra and Saggion, 2003;Turian et al., 2006), as well as their lack of correlation with human assessment (Novikova et al., 2017).BLEU specifically has been criticized given its understudied technological biases, a sufficient reason to avoid using it alone to report scientific evidence (Reiter, 2018, p. 399).Further, while the target or reference of generated language is not necessarily a single sentence-there may be more than one semantically and syntactically valid surface realization of a given set of words, with context determining appropriateness-limited resources often result in a single humanproduced reference being used, usually in the guise of an attested sentence in a corpus.

Projectivity
Projectivity refers to the constraint that a head and its dependents must occur in a contiguous sequence in the surface order (Marcus, 1965).Violations of projectivity-often referred to as discontinuities in the linguistics literature-are instances when a word occurring between a head h and dependent d is not dominated by h in the dependency tree.In the oft-cited non-projective sentence The hearing is scheduled on the issue today, both is and scheduled occur between hearing → issue5 , but are not dominated by hearing.A projective order would be The hearing on the issue is scheduled today.
It seems that all natural languages contain some amount of non-projective dependency relations, though calculating exact rates of non-projectivity is difficult given design decisions in the original parsing to create corpora.That is, some annotation schemes presuppose projectivity, and as a result corpora produced following those schemes will not exhibit discontinuities (Ferrer-i-Cancho and Gómez-Rodríguez, 2016).Observed percentages of non-projectivity range from single digits to the mid-teens depending on language, though sources disagree, likely due to differences in corpora, genre, and annotation scheme.
The causal relationship between Dependency Distance Minimization and projectivity is unsettled.Ninio (2017) concludes that "projectivity appears to be not so much a side-effect of DDM as a mathematical requisite for a method to encode a two-dimensional tree in a one-dimensional sentence-string in a way that makes reconstruction possible" (p.216), appealing to other linguistic structures such as catenae (Osborne et al., 2012) to explain discontinuities.This traditional view-that projectivity exists as a principle independent of DDM-is largely disproven by an analysis which positively correlates dependency distance and the number of crossing dependencies across a variety of multilingual corpora (Ferrer-i-Cancho and Gómez-Rodríguez, 2016).Park and R. Levy (2009) note that an avoidance of long-distance dependencies can result in non-projective surface orders.

Syntactic word embeddings
The relationship between words has long been thought of distributionally; as Firth (1957) memorably puts it: "you shall know a word by the company it keeps" (p.11).The company or context of a word is often conceived in terms of the linear neighbors that commonly occur around that word, a context that can be quantified with a dense vector or series of numbers called an embedding.Algorithms have been developed to learn a word's embedding in a corpus, such as skip-grams (Mikolov et al., 2013).O. Levy and Goldberg (2014) extend the notion of context beyond linear neighbors in their word2vecf to use dependency relations in learning syntactic embeddings: a word's context is based on the heads and dependents it takes in a corpus.
The number of dimensions necessary for a given task is an understudied problem.It is widely accepted that larger dimensions are better, up to a point of diminishing returns; for example, O. Levy and Goldberg (2014) use 300 in their evaluation, mentioning that 600 produces similar results.However, Spirling and Rodriguez (2019) note that very large dimensions relative to corpus size result in greater instability of embeddings, where instability refers to the rate at which the cosine-similar nearest neighbors differ between models (Wendlandt et al., 2018).Patel and Bhattacharyya (2017) explore the lower bound of embedding dimensions, below which performance suffers, providing a rather complicated method for calculating the minimum based on the maximum clique of a cosine-similarity matrix of word cooccurrence.An industry rule-of-thumb6 is to use the fourth root of vocabulary size.

Graph neural networks
While machine-learning algorithms, deep or otherwise, have traditionally operated on data represented in Euclidean space-for example, image data can be represented as a regular grid of pixel values-graph neural networks (GNN) allow the complexity of graph structures to be analyzed (Wu et al., 2019).The Graph Nets (GN) framework relies on a graph-to-graph model called a GN block "which takes a graph as input, performs computations over the structure, and returns a graph as output" (Battaglia et al., 2018, p. 11).In this framework, a graph is composed of nodes and their attributes, edges and their attributes, and a set of global attributes.Input and target graphs may contain different node and edge configurations; only the attributes for nodes and the attributes for edges must be of a consistent form.It is these sets of attributes which form the learned parameters of the neural network.
GN blocks also support message-passing neural networks (MPNN) (Gilmer et al., 2017), a method by which a graph's node and edge attributes undergo spatial-based graph convolutions and pooling (Wu et al., 2019, p. 8).In this manner a graph's connected nodes influence each other's node and edge attributes, passing information along directed edges.

Methodology
The approach described in the current study rests on the notion that adding dependency distances as positive or negative edge weights to a dependency tree allows the DAG to be rewritten as a poset whose topological sort reflects a surface order.Edge weights are therefore the number of words intervening between a dependent and its head, where negative weights indicate a dependent that precedes its head and positive a dependent following its head.Learning these edge weights is the core goal of the model.There are three tasks to be undertaken to convert a dependency tree into a surface order: (1) encode words to generalize from training to testing; (2) for a given dependency tree, learn whether each dependent precedes or follows its head in the surface order, and by how many words, in order to produce an edge-weighted poset; and (3) perform a topological sort of the poset based on edge weight.The first task is accomplished with word2vecf, the second with a graph neural network and message passing, and the third with a custom algorithm which rewrites a weighted poset to a linear graph such that nodes are connected in ascending order of edge weight.An overview of this process is shown in Figure 2.

Syntactic embeddings
Word2vecf7 (O.Levy and Goldberg, 2014) is used to generate syntactic word embeddings from a Universal Dependencies CoNLL-U8 file.Embeddings are created for each word|POS|relation, POS|relation, and POS in order to minimize polysemy and homography effects and to enable words unseen during training to be analyzed based on their syntactic category and/or dependency relation.The dimension of the embedding vector is determined by corpus size: in order to avoid the instability seen in both too-small and too-large dimensions, the industry rule-of-thumb of the fourth root of vocabulary size is used, multiplied by two.These dimensions were found during algorithm design to offer a reasonable balance between performance and generalizability.

Graph neural network implementation
The machine-learning algorithm is implemented using Graph Nets and Sonnet, two DeepMind libraries9 for building graph neural networks using Google's Tensorflow (Abadi et al., 2015).The network's layers contain 18 neurons each and follow an 'encode-process-decode' model common to many Graph Nets implementations.Because learned edge weights in the GNN can be positive or negative, loss is calculated as the absolute difference between target and output.An Adam optimizer with a learning rate of 1 −3 is used, there are 6 message-passing steps, and the network is run through 10,000 iterations.
The input is a series of networkx10 directed graphs, one for each sentence in the training and testing sets.In order to effectively utilize message passing, edges are constructed as dependent → head, opposite the usual syntactic dependency-parsing edge direction.Each node has an attribute which is the vector produced by word2vecf's syntactic embedding.In the GNN, edge weights are used to track dependency distance, both negative and positive.A negative edge weight indicates that a head precedes its dependent, and a positive weight that a dependent precedes its head.Target edge weights are calculated as the difference between the dependent and head location in the original surface order, normalized to [-1,1] by dividing each distance by the maximum dependency distance of a given sentence.
For example, Figure 2 (c) and (d) show the input and output for the phrase for your trip to Canada.The input to the GNN is the dependency tree, where each node's attribute is the word's syntactic embedding.The output is the same dependency tree with learned edge attributes reflecting the distance between dependent and head.

Weighted topological sort
Performing a topological sort of an edge-weighted poset such that connected nodes are placed in ascending order of edge weight is conceptually quite simple, but implementation is more complicated than it may appear.A straightforward approach of simply merging nodes with the smallest weights before those with larger weights does not properly order the nodes, since the weight of arcs crossing the merged nodes are not necessarily updated to reflect the merge.Instead, as outlined in Algorithm 1, each edge (u, v) from the poset can be added to a new directed graph order such that the edge's weight is maintained, even though u and v may not be adjacent in order.
When inserting edge (u, v) with weight w uv into order, if u is already in order, then traverse the successor nodes of u until the total distance from u-a value maintained by w sum -exceeds w uv .At that point, insert v and update the weights of v's neighbors.This process is shown in lines 5-16.Similarly, as shown in lines 17-28, if v is already in order, traverse the predecessor nodes of v until w sum exceeds w uv , insert u, and update u's neighbors' weights.Finally, if neither u nor v are in order, add edge (u, v) with weight w uv to order, as shown in line 30.When all edges from poset have been added to order, the topological sort of order is returned as the surface realization.Each edge in poset must be added to order, and in the worst-case scenario the weight of each existing edge in order must be examined.Therefore Algorithm 1 runs in O(n log n) time, where n is the number of edges in poset.

Baseline (AVG)
Rather than generating syntactic word embeddings and running the GNN, a naive approach to determining dependency distances is to average the distance between any two words in the training set for use on the testing set.Similar to the set of word embeddings ( §3.1), in order to generalize to unseen words in the test set, average distances are created for each pair dependent pair of word|POS|relation, POS|relation, and POS.For example, if the|DET|det has an average dependency distance of 1.2 from horse|NOUN|nsubj, and brown|ADJ|amod has an average of 0.9 from horse|NOUN|subj, then using those two average distances as weights in a poset would result in a surface order of the brown horse.If red|ADJ|amod was unseen during training, then the average of all instances of ADJ|amod dependent on horse|NOUN|subj would be used-if that average distance were 1.3, then this naive approach would return red the horse.
Algorithm 1: Given an edge-weighted poset, construct a total order such that nodes with smallest weights are adjacent.

Evaluation
To evaluate the performance of the GNN algorithm compared to the AVG baseline in an automated way across various languages, we must unfortunately use a single target reference to compare the generated sentences.Thus the reference for each sentence is the attested version in the source UD corpus; the generated sentences from both AVG and GNN will be measured for similarity to the attested version.
The algorithm is attempting to order a set of words as closely as possible to their original surface realization in the corpus.Because words may repeat in the sentence, each order is instead represented by a list of integers, and it is these lists of integers which are compared.For example, assuming a target reference order of [1,2,3] for the red horse, the generated order of red the horse would be [2,1,3].An obvious way to quantify how similar these integer lists are is with the widely used Spearman's rank correlation coefficient (Spearman, 1904), also known as Spearman's ρ (rho), which non-parametrically measures the similarity of two rankings.It ranges from -1, indicating that one order is the reverse of the other, to 1, for perfect correlation.The example of [1,2,3] [2,1,3] returns a ρ of 0.5, since in the second order 1 and 2 both precede 3, but 1 does not precede 2. This measure tells us which approach, AVG or GNN, generates orders closest to the attested UD order, as well as a loose gauge of overall effectiveness for both the general approach as well as each algorithm.
Further, to address the question of projectivity, the percentage of projective dependency arcs generated by the AVG baseline, the GNN algorithm, and the attested sentences is evaluated.In each case, projectivity is calculated as the number of instances in which a word appearing between a head h and dependent d is not dominated by h.This measure allows us to explore how dependency distance might result in known rates of projectivity in natural language.

Results & Discussion
Table 1 shows the results of running both the AVG baseline and GNN algorithm on 14 v2.4UD corpora representing a range of language families.These are relatively small corpora-between 500 and 2000 training sentences-and as a consequence their small vocabularies result in embedding vector dimensions between 14 and 22 due to the use of twice the fourth root of vocabulary size ( §2.4, §3.1).While smaller than the more usual 50-or 300-element vectors, tying dimensionality to corpus vocabulary size seemed to avoid instability in the embedding space, though perhaps not in every case.Further, experiments with larger dimensions resulted in poor generalization to the testing set, possibly due to a lack of correlation between embeddings seen and unseen during training.
Results from Spearman's ρ rank correlation show that both AVG and GNN were able to positively correlate surface order with the source UD corpora.Because Spearman's ρ ranges from -1 to 1, positive values are better than chance; values above 0.5 seem rather promising.A large part of surface order can apparently be predicted based on dependency distance, averaged or learned.In all cases the GNN was able to approach AVG, exceeding it 10 out of 14 times.For many languages, the GNN achieved its peak value before training was complete, probably indicating overfitting.In the cases in which the GNN did not best AVG, the sparkline trends for Czech, Hungarian, and Latin suggest problems during training, perhaps due to overzealous learning rates or unstable embeddings, while Uyghur came very close.
In terms of projectivity, the GNN outperformed AVG in all cases, even when it did not best AVG in terms of Spearman's ρ.While it is of course true that were the AVG or GNN method able to perfectly capture the word order of the UD corpora, the rate of projectivity and Spearman's ρ would match exactly, but it is intriguing that short of perfection, Spearman's ρ and projectivity are not necessarily correlated.Nor do many of the intralanguage trends match between the two measures-the highest GNN projectivity was generally achieved late in the training process, and the two sparklines of, for example, Armenian, are not very similar.While the GNN outperformed AVG in generating surface orders with higher rates of projectivity, even those rates lagged quite a bit behind the actual rates for almost all languages.This is likely due to even seemingly minor word transpositions leading to non-projective arcs ( §4.1).
(a) target poset (a ′ ) poset generated by GNN this judge shall be chosen by lot .
1.1 3 0.7 0.6 0.1 1.9 0.9  1.6 0.9 4 0.9 0.7 3.1 0.8 Importantly, AVG is a naive approach, not a learning algorithm.As such there is very little room for improvement by adjusting how the averaged dependency distances are determined-employing morphological data or using lemmata instead of wordforms, for example.Conversely, changes to number of iterations, architecture, or hyperparameters of the GNN, especially tailored to each corpus, would almost certainly yield even better results, with a hypothetical upper bound limited only by the irreducible error present in a language's word-order variation.
The results confirm that dependency distances can be learned from dependency trees by the GNN algorithm, usually better than a naive approach.Those distances can be used to generate surface realizations with word orders that positively correlate with attested UD sentences.Because these promising results can be generated from an essentially off-the-shelf GNN with relatively standardized parameters across a wide variety of languages, future endeavors improving the GNN architecture is certainly warranted.

Error analysis
Delving a bit into the sorts of errors in the surface orders generated by GNN, Figure 3 shows four versions of the same sentence: (a) the poset for a sentence from the UD English-ParTUT corpus; (a ′ ) the poset generated by GNN with a Spearman's ρ coefficient of 0.786-only slightly higher than the average ρ for that corpus, and therefore a typical generation; (b) the poset for the same sentence from French-ParTUT; and (b ′ ) the poset generated by GNN with a non-projective11 arc.
Figure 3 (a ′ ) deviates from (a) in that the weight of be − → désigné-though unlike the English not larger than the combined weights of ce 0.9 − → juge and juge 0.7 − → désigné.The result is a transposition of juge and est, causing a non-projective arc as est appears between ce and juge but is not dominated by either.
Aside from the transposition of the auxiliaries in (a ′ ), both generated surface orders suffer from the weight of judge/juge → chosen/désigné being too small.While the offending edge in (a ′ ) is quite small at 0.02, requiring an addition of over 2 to overcome the combined weights of the auxiliaries be and shall, an addition of just 0.11 to the weight of the edge would resolve (b ′ ).In other words, if the weight of est 0.8 − → désigné were increased to 0.81, it would be larger than est Neither training set for these corpora contains the word judge/juge, so the word's embedding collapses to an average of all nouns acting as passive subjects, NOUN|nsubj:pass.This suggests that insufficient training size, lack of proper generalization from the available training data, and/or problematic embedding creation for unseen words is at fault here.These can all be addressed in future research.

Dependency distance tolerance & projectivity
What is being learned by the GNN?That is, what do the edge weights, used to create a poset, actually represent?The question is perhaps conceptually a bit easier with AVG: the weights are the average distances between dependents and their heads in a corpus.AVG calculates how far a dependent tends to be from its head, or put another way, how many intervening words tend to be allowed between dependent and head in a collection of surface orders.It is a dependent word's tolerance for how far it can be placed in front or behind its head in a surface realization.It seems that the GNN is learning this same information about dependency distance tolerance, but in a more subtle and context-sensitive way.Rather than simply an average distance, the GNN is learning how far a dependent can be placed from its head in concert with its syntactically related words12 in a given dependency tree.
Dependency distance tolerance is effectively a maximum for how far apart a dependent and head can be in the surface realization of a given dependency tree.What factors determine this tolerance and how it might be encoded in a linguistic system is left for other research.However, dependency distance tolerance is a useful concept for exploring how projectivity might come about.
It was suggested in §2.3 that observed rates of projectivity might emerge from Dependency Distance Minimization (DDM).That is, the desire to minimize cumulative or mean dependency distances results in the high rates of projectivity seen across languages.A further goal within DDM is to avoid longdistance dependencies, though this avoidance may result in non-projective surface orders.The concept of dependency distance tolerance provides a more nuanced view of this second DDM motivation.
The topological sort of a poset whose edge weights correspond to contextual dependency tolerances, at least as implemented here, may place dependents closer to their heads than their tolerance, but not farther.As such, it defines an upper bound for each edge weight in a poset.A surface order can be seen as the result of assembling words such that dependents are placed no farther from their heads than their tolerance.In this way dependency distances in the surface order are not only minimized, but minimized in such a way that each word's contextual dependency tolerance is taken into account.
Thus the topological sort of a weighted poset implements DDM's goal of minimizing dependency distances generally, while the learned dependency tolerances provide a contextually sensitive definition of what 'long distance' means for each dependent pair in order to avoid generating surface orders with long-distance dependencies.Through this lens both the strong tendency towards projectivity across languages, as well as the occasional instances of non-projectivity, can be seen as an effect of avoiding dependency distances which exceed their contextual tolerances.

Summary
This paper describes a novel method for converting dependency trees to surface orders via syntactic word embeddings and edge-weighted posets.The embeddings are learned via word2vecf, and poset edge directions and weights are learned by a graph neural network (GNN), all trained on Universal Dependencies (UD) corpora.An algorithm is provided for topologically sorting a weighted poset.The output of the GNN is compared to a naive baseline in which average dependency distances are used as poset edge weight, both evaluated against attested word orders in UD corpora representing a variety of language families.The GNN outperforms the baseline on 10 of 14 corpora in terms of rank correlation and in all cases in terms of rate of projectivity.
The main contribution of the paper is the insight that a surface order can be represented by an edgeweighted poset, the weights of which can be learned by a graph neural network.Representing surface order as the result of topologically sorting this poset contributes to our understanding of how a tendency towards projectivity across natural languages might be explained.
Future research directions include improvement of the GNN architecture and hyperparameters; exploration of the interaction between word embedding dimension, performance, and generalizability; and the analysis of larger corpora.

Figure 1 :
Figure 1: Three graph-theoretic representations of a sentence.(a) A dependency tree as an unordered directed acyclic graph (DAG).(b) A poset in which edge weight indicates dependency distance in the surface order.(c) A surface order generated by a topological sort of the poset in (b).

Figure 2 :
Figure2: Overview of methodology.(a) A CoNLL-U file is parsed by word2vecf to produce (b) a list of syntactic word embeddings for each wordform|POS|relation, POS|relation, and POS.These embeddings form the node attributes for (c) a directed graph of a dependency tree.The graph's edge attribute is a single-element vector which will contain the learned distance between dependent and head.Note that edge direction is reversed from conventional dependency directions to enable more effective message passing.(d) An output graph isomorphic to (c) with learned node and edge attributes.(e) A poset with edge weights representing the distance between words in the eventual surface order, built from the learned directions and distances in (d).Note the flipped edge direction between trip and Canada in the DAG (d) versus the poset (e).(f) The unique surface order resulting from a topological sort of (e).

Figure 3 :
Figure 3: Target and generated posets from English-and French-ParTUT corpora.
, and both those edges have weights larger than the combination of this 0.6 − → judge and judge 0.1 − → chosen.The result is a sentence in which the auxiliaries be and shall are transposed, and both appear in front of this judge.Similarly, (b ′ ) deviates from (b) in that the weight of est 0.8 − → désigné is larger than the weight of juge 0.7 and therefore juge would precede est, resolving (b ′ ) to (b).

Table 1 :
Results.Each language is listed by its corpus; number of training and testing sentences; embedding dimension; Spearman's ρ rank correlation coefficient for AVG and GNN; and rate of projectivity for AVG, GNN, and as attested in the UD corpus.Boldfaced numbers indicate cases in which GNN performed better than AVG.Sparklines show trends over 10K iterations with horizontal gray lines indicating AVG performance and black dots showing peak performance of GNN.