Efficient Computation of Expectations under Spanning Tree Distributions

We give a general framework for inference in spanning tree models. We propose unified algorithms for the important cases of first-order expectations and second-order expectations in edge-factored, non-projective spanning-tree models. Our algorithms exploit a fundamental connection between gradients and expectations, which allows us to derive efficient algorithms. These algorithms are easy to implement, given the prevalence of automatic differentiation software. We motivate the development of our framework with several cautionary tales of previous research, which has developed numerous less-than-optimal algorithms for computing expectations and their gradients. We demonstrate how our framework efficiently computes several quantities with known algorithms, including the expected attachment score, entropy, and generalized expectation criteria. As a bonus, we give algorithms for quantities that are missing in the literature, including the KL divergence. In all cases, our approach matches the efficiency of existing algorithms and, in several cases, reduces the runtime complexity by a factor (or two) of the sentence length. We validate the implementation of our framework through runtime experiments. We find our algorithms are up to 12 and 26 times faster than previous algorithms for computing the Shannon entropy and the gradient of the generalized expectation objective, respectively.


Introduction
Dependency trees are a fundamental combinatorial structure in natural language processing. It follows that probability models over dependency trees are an important object of study. In terms of graph theory, one can view a (non-projective) dependency tree as a spanning arborescence of a graph where * Equal contribution each word is a node of a graph. A dependency tree can then be given a score that decomposes multiplicatively over its edges. As our models do not have the linguistic knowledge to a-priori rule out arcs, the graph is complete, i.e., every node has an edge to every other node. When it is clear from context, we will refer to "spanning arborescences" simply as "trees" without further qualification.
The celebrated matrix-tree theorem (MTT) (Kirchhoff, 1847)-more specifically, its counterpart for directed graphs (Tutte, 1984)-appeared before the NLP community with gusto in 2007 through an onslaught of contemporaneous papers (Koo et al., 2007;McDonald and Satta, 2007; that leverage the classic result to efficiently compute the normalization constant of a distribution over trees. The result is still used in more recent work (Liu and Lapata, 2018;Ma and Hovy, 2017). We build upon this tradition through a framework for computing expectations of a rich family of functions under a distribution over trees. Expectations appear in all aspects of the probabilistic modeling process training, model validation, and prediction. Therefore, developing a framework to efficiently compute expectations over trees is key to accelerating progress in probabilistic modeling.
Our framework is motivated by the lack of a unified approach for computing expectations of spanning trees in the literature. We believe this gap has led to confusion in the literature, which has resulted in the publication of numerous less-thanoptimal algorithms. We motivate the importance of developing such a framework by highlighting the following cautionary tales.
• McDonald and Satta (2007) proposed an algorithm for computing the expectation of an edge in time O n 5 rather than O n 3 . Note that this was corrected by Koo et al. (2007); , nevertheless the oversight was made. • Smith and Eisner (2007) proposed an O n 4 al-gorithm for computing entropy, which we show can be done in O n 3 .
• Druck et al. (2009) proposed an O n 5 algorithm for evaluating the gradient of the generalized expectation (GE) criterion (McCallum et al., 2007). The runtime bottleneck of their approach is the evaluation of a certain covariance matrix. Druck and Smith (2009) later improved the evaluation of this covariance matrix to O n 4 . Using general principles from automatic differentiation, we develop an O n 3 algorithm that avoids creating this covariance matrix. Druck (2011) identified the potential for automatic differentiation techniques, but mistakenly declares that the numerical accuracy and practical efficiency of these techniques would be less optimal.
We summarize our main results below: • Unified Framework: We develop an algorithmic framework for calculating expectations over spanning arborescences, giving precise mathematical assumptions on the types of functions, and efficient algorithms that piggyback on the rich automatic differentiation literature (Griewank and Walther, 2008). Our framework comes as a consequence of MTT (Tutte, 1984), its use in dependency parsing (Koo et al., 2007), and the connection between expectations and gradients (Darwiche, 2003;Li and Eisner, 2009 and O E (h) as the sets of incoming edges to node m and outgoing edges from node h in the edge set E respectively. The sizes of these sets, |I E (m)| and |O E (h)|, are referred to as the in-and out-degree of a node, respectively. The root node ρ has no incoming edges, therefore |I E (ρ)| = 0. Each edge has a weight w e ∈ R ≥0 . Any e ∈ E may be regarded as having w e = 0. Consequently, we can represent hard constraints on a graph by removing edges from E, or, equivalently, setting w e = 0. We can organize the edge weights into a weighted adjacency matrix, A ∈ R n×n for edges between non-root nodes and a special root-weight vector, ρ ∈ R n for edges emanating from the root. We define the following for all h, m ∈ V {ρ} A spanning arborescence (or tree for short) d of a rooted graph is a set of n edges such that all non-root nodes m have exactly one incoming edge and the root node ρ has at least one outgoing edge. Furthermore, a tree does not contain any cycles (in the case that a graph has no self-loops, the above condition is automatically met by each node having exactly one incoming edge). More formally, the set of all trees D in a graph is defined by

no cycles exists}
We will assume that D always has at least one element (this is not necessarily true for all graphs). The weight of a tree, d ∈ D is defined as the product of its edge weights: Normalizing the weight of each tree yields a probability distribution over trees where Z is the normalization constant defined as:

The Matrix-Tree Theorem
The normalization constant Z involves a sum over D, which can grow exponentially large with n. Fortunately, there is sufficient structure in the computation that it can be evaluated in O n 3 time.
The Matrix-Tree Theorem (Tutte, 1984;Kirchhoff, 1847) establishes a connection between Z and determinants of the Laplacian matrix, L ∈ R n×n . For all h, m ∈ V {ρ}, Theorem 1 (Matrix-Tree Theorem). For any graph G = (V, E, ρ) the normalization constant over all trees is given by the determinant of the Laplacian (Tutte, 1984, p. 140) The determinant of any matrix ∈ R n×n can be evaluated in O n 3 time, 3 hence Z can be evaluated in the same complexity.

Dependency parsing & the Laplacian zoo
Encoding dependency parsing as a spanning tree distribution can be done as follows. A sentence of length n is a graph G = (V, E, ρ): each non-root node represents each token of the sentence, and ρ represents a special root symbol of the sentence. Each edge in the graph represents a possible dependency relation between head word h and modifier word m. Therefore, a dependency tree is a set of n edges where each word has exactly one parent word (i.e. one incoming edge), and the root symbol is connected to at least one word. An example of a dependency tree is given in Fig. 1.
3 Algorithms exist to compute the determinant more efficiently than O n 3 (Coppersmith and Winograd, 1987). However, they are typically not used in practice because of large constant factors associated with the more sophisticated strategies (Dumas and Pan, 2016). For simplicity, we will assume that the runtime of matrix determinants is O n 3 .
We compute expectations very efficiently root nsubj dobj advmod advmod Figure 1: Example of a dependency tree In the remainder of this section, we will give several variations on the Laplacian matrix that encode specific constraints on the set of trees that are summed over. 4 In many cases of dependency parsing, we want ρ to have exactly one outgoing edge. This is motivated by linguistic theory, where the root of a sentence should be a token in the sentence rather than a special root symbol (Tesnière, 1959). There are exceptions to this such as parsing Twitter (Kong et al., 2014) or parsing specific languages (e.g., The Prague Treebank (Bejček et al., 2013)). We call these multi-root trees 5 and are represented by the set D as described earlier. Therefore, the normalization constant over all multi-root trees can be computed by a direct application of Theorem 1 Nevertheless, in most dependency parsing corpora, only one edge can emanate from the root (Nivre et al., 2018). We thus modify set of possible trees: the trees with an out-degree 1 restriction on the root node ρ. We call these single-rooted trees.
The choice to replace row 1 by ρ is done by convention, Koo et al. (2007) prove that the result below holds if we replace any row by ρ in the construction of L.
Proposition 1 (Single-root MTT). For any graph G = (V, E, ρ), the normalization constant over all single-rooted trees is given by the determinant of the root-weighted Laplacian (Koo et al., 2007, Prop. This was an important discovery by Koo et al. (2007) as neither of McDonald and Satta (2007) or  identified this correct distinction. Without Proposition 1, finding the normalization constant would require n calls to Theorem 1 and thus take O n 4 instead of the O n 3 of Proposition 1.
Labeled trees. To encode labeled dependency relations in our set of trees, we augment edges with labels-resulting in a multi-graph in which multiple edges may exist between pairs of nodes. Hence, each edge e ∈ E would correspond to a triple (h, m, ) where h and m are the head and modifier nodes as before, and is a dependency relation label in L.
Proposition 2 (Labeled MTT). For any multigraph G = (V, E, ρ), the normalization constant over all multi-or single-rooted trees can be calculated using Theorem 1 and Proposition 1 (respectively) using the adjusted adjacency matrix and root weight vector (Koo et al., 2007;McDonald and Satta, 2007;Smith and Eisner, 2007) Constructing this Laplacian takes O |L|n 2 . Interestingly, the necessary determinant computation is invariant to |L|, giving an overall runtime of O n 3 + |L|n 2 to compute Z.
Summary. We give four settings (others may exist) in which the MTT efficiently computes Z for different sets of trees. The choice is dependent upon the task of interest, and one must be careful to choose the correct Laplacian configuration.
The results we present in this paper are completely modular in the specific choice of Laplacian. For the remainder of this paper, we assume the unlabeled tree setting and will refer to the set of trees as simply D and our choice of Laplacian as L.

Expectations
In this section, we characterize the family of expectations that our framework supports. Our framework is an extension of Li and Eisner (2009) to distributions over spanning trees. Their framework considers expectations over B-hypergraphs (Gallo et al., 1993). Our distributions over trees cannot be cast as a polynomial-size hypergraph. Another important distinction between our framework and that of Li and Eisner (2009) is that we do not use the semiring abstraction. 6 We consider a random variable D whose possible values are in D with a probability distribution p(D = d) = p(d) as in (3).
An expectation of a function f : Note that F is not necessarily one-dimensional, i.e. it could be that F = n or F = n × m for positive integers n and m. For simplicity, when F is multi-dimensional, we will use several variables to represent the dimensions, i.e. F 1 × F 2 . Without any assumptions on f , the expectation is clearly intractable to compute. 7 We will characterize a class of functions f whose expectations can be efficiently computed.
The first family of functions are functions that are additively decomposable along the edges of the tree. Formally, a function r: D → R R is additively decomposable if it can be written as where we abuse notation slightly by for any function r: D → R R , we consider the edge function r e as a vector of edge values. An example of an additively decomposable function is r(d) = − log p(d) whose expectation gives the Shannon entropy. 8 6 Semirings are too algebraically weak to develop efficient determinant algorithms. Jerrum and Snir (1982) proved that the partition function for spanning trees requires an exponential number of additions and multiplications in the semiring model of computation. It turns out that division is not required, but algorithms for division-free determinant computation typically run in O n 4 (Kaltofen, 1992). An excellent overview of the power of subtraction in the context of dynamic programming is given in Miklós (2019, Ch. 3). 7 One could use Monte Carlo approximation methods to approximate the expectation of a general function.
Other first-order expectations include the expected attachment score and the Kullback-Leibler divergence. We demonstrate how to compute these in our framework in and §6.1 and §6.3, respectively.
A function t: D → R T is second-order additively decomposable if it can be written as the outer product of two additively decomposable functions, r: D → R R and s: D → R S .
Thus, T = R × S is generally a matrix.
An example of such a function is the gradient of entropy or the GE objective with respect to w; we bold a function f whose domain is the edges to mean the vector containing all f e . Another example of a second-order additively decomposable function is the covariance matrix. Given two feature functions r: D → R R and s: D → R S , the covariance matrix gives the relationship between each pair of features r i (d) and s j (d) for d ∈ D. That is, it is the expectation of r(d)s(d) , a second-order additively decomposable function.
We focus on products of additive functions because modeling interactions that do not factor along the edges of dependency trees are known to be NPhard (McDonald and Pereira, 2006, App A.).
One family of functions which can be computed efficiently but we will not explore here are those who are multiplicatively decomposable over the edges. A function q: D → R Q is multiplicatively decomposable if it can be written as These functions form a family that we will call zeroth-order expectations and can be computed with a constant number of calls to MTT (usually two or three). Examples of these include the Renyi entropy and p -norms.

Connecting gradients and expectations
In this section, we build upon a fundamental connection between gradients and expectations (Darwiche, 2003;Li and Eisner, 2009). This connection allows us to build on work in automatic differentiation to obtain efficient gradient algorithms. While the propositions in this section are inspired from past work, we believe that the presentation and proofs of these propositions have previously not been clearly presented. 9 We find it convenient to work with unnormalized expectations, or totals (for short). We denote the total of a function f as f We recover the expectation with E p [f ] = f /Z. We note that totals (on their own) may be of interest in some applications (Vieira and Eisner, 2017, Section 5.3).
The first-order case. Specifically, the partial derivative ∂Z ∂we is useful for determining the total weight of trees which include e, w e w e = d:e∈d Proposition 3 (First-order total weight).
Proof. Furthermore, w e /Z = p(e ∈ d). 10 Proposition 4 will establish a connection between unnormalized expectation r and ∇Z.
Proposition 4. For any additively decomposable function r: D → R R , the total r can be computed using a gradient-vector product where is the element-wise multiplication of a vector and matrix as follows 11 : Proof.
(∇Z) (w r) = e∈E ∂Z ∂w e w e r e = e∈E d:e∈d The second-order case. We can similarly use Proposition 5 (Second-order total weight).
Proof. Furthermore, w e,e /Z = p(e ∈ d, e ∈ d). Proposition 6 will establish a connection between total t and ∇ 2 Z, and additionally establishes a connection between t and ∇r. Note that we require an additional term in (21) as w e,e does not account for the case that e = e . Proposition 6. For any function t: D → R R×S , which is expressed as the outer product to two additively decomposable functions, r: D → R R and s: D → R S , t(d) = r(d)s(d) , the total t can be computed using a Jacobian-matrix product = t Next, we show that (20) and (21)  Remark. The total ∇r is a second-order quantity: (20) shows that t = ∇r w s = ∇r by a judicious choice of s equal to the inverse of w. These propositions generalize to higher-order derivatives as well. We do not explore those here as the runtime complexity becomes unwieldy.

Algorithms
Having reduced the computation of r and t to finding derivatives of Z in §4, we now describe efficient algorithms that exploit this connection. The main algorithmic ideas used in this section are based on automatic differentiation (AD) techniques w e w e r e s e O R S n 4 3: return t (Griewank and Walther, 2008). These are generalpurpose techniques for efficiently evaluating gradients given algorithms that evaluate the functions. In our setting, the algorithm in question is an efficient procedure for evaluating Z, such as the procedure we described in §2.1. While we provide derivatives §5.1 in our algorithms, these can also be evaluated using any AD library, such as JAX Proposition 4, and (20) and (21) from Proposition 6 are realized by Alg 1, Alg 2, and Alg 3, respectively. We provide the runtime complexity of each step in the algorithms. These will be discussed in more detail in §5.2.

Derivatives of Z
All three algorithms rely on first-or second-order derivatives of Z. Since Z = L , we can express its gradient via Jacobi's formula and an application of the chain rule, 12 where L − is the transpose of L −1 and ∇ L is the derivative of L with respect to the edge weights and depends on the specific Laplacian encoding ( §2.2). 13 The second derivative of Z can be evaluated as follows 14 Note that (23) is missing a term with ∇ 2 L. Since L is a linear function, its second derivative is zero, and so ∇ 2 L disappears. Furthermore, we consider ∇ 2 Z and ∂ 2 Z ∂ L∂ L to be n 2 × n 2 matrices (they can alternatively be thought of as tensors).

Complexity Analysis
The efficiency of our approach is rooted in the following result from automatic differentiation. 15 Given a function f , we denote the number of elementary operations (e.g., +, *, /, -, cos, pow) of f by Cost{f }. Note that all elementary operations are continuously differentiable.

Theorem 2 (Cheap Jacobian-vector Products).
For any function f : R d → R m and any vector v ∈ R m , we can evaluate (∇f (x)) v with cost satisfying the following bound via reverse-mode AD (Griewank and Walther, 2008, Page 44), 12 The derivative of the determinant can also be given using the matrix adjugate, ∇Z = adj( L) . There are benefits to using the adjugate as it is more numerically stable and equally efficient (Stewart, 1998). In fact, any algorithm that computes the determinant can be algorithmically differentiated to obtain an algorithm for the adjugate.
As a special (and common) case, Theorem 2 implies a cheap gradient principle: the cost of evaluating the gradient of a function of one output (m = 1) is as fast as evaluating the function itself.
The cheap gradient principle tells us that ∇Z can be evaluated as quickly as Z itself, and that numerically accurate procedures for Z give rise to similarly accurate procedures for ∇Z. Additionally, many widely used software libraries can do this work for us, such as JAX (Bradbury et al., 2018), PyTorch (Paszke et al., 2019), andTensor-Flow (Abadi et al., 2015). The runtime of evaluating Z is dominated by evaluating the determinant of the Laplacian matrix. Therefore, Line 1 and Line 2 of Alg 1 have the same complexity: O n 3 . Line 3 is a sum over n 2 scalar-vector multiplications of size R. If each r e is R -sparse (i.e., containing no more than R non-zeros), we can compute this in O R n 2 leading to an overall runtime of O n 3 + R n 2 .
Second-order quantities (t), appear to require ∇ 2 Z and so do not directly fit the conditions of the cheap gradient principle: the Hessian (∇ 2 Z) is the Jacobian of the gradient. One approach to work around this is to make several calls to Theorem 2 with different vectors v. Indeed, this is the approach of Alg 2. In this case, the function in question is (12), which has output dimensionality R. Computing ∇ 2 Z can be done with R Jacobianvector products and R Hessian-vector products; giving an overall O R(n 3 + R n 2 algorithm. We can support somewhat fast accumulation of Ssparse s in the summation on Alg 2. Unfortunately, ∂r ∂we will generally be dense, so the cost of the outer product on Line 3 is O(RS ). Thus, Alg 2 has an overall runtime of O R(n 3 + R n 2 ) + RS n 2 . 16 The downside of Alg 2 is that no work is shared between the R evaluations. 17 For our computation of Z, it turns out that substantial work can be shared among evaluations. Specifically, ∇ 2 Z only relies on the inverse of the Laplacian matrix, as seen in Alg 3. This is essentially the same observation made in Druck and Smith (2009). Exploiting this observation allows us to compute ∇ 2 Z in O n 4 time. One may notice that the matrix- 16 If S<R, we can compute t in O S(n 3 +S n 2 )+R Sn 2 . 17 We believe it is possible to create an algorithm that takes the best of Alg 2 and Alg 3 to compute t in O n 3 . This would be done by caching L and L −1 used in Alg 3 to calculate the Hessian-vector product in Alg 2. However, we have not yet conducted a detailed analysis of this. matrix products in (23) might suggest an inefficiency, but, luckily, ∇ L is very sparse with ≤ 2·|L| nonzero entries per column in the labeled case, and ≤ 2 nonzero entries per column in the unlabeled case. Thus, the necessary Hessian-matrix products can be computed with O n 4 constant-time sparse dot product, giving an overall O n 4 algorithm. The accumulation component of Alg 3 can be done in O R S n 4 .
An important special case. We know that the gradient of a first-order quantity r is a second-order quantity. However, when r is used in the context of another function, it is more efficient to use reversemode automatic differentiation on r and Z. This is simply a byproduct of the cheap gradient principle, which allows us to keep the efficiency of Alg 1, regardless of the dimensionality R. The criterion for this speed-up is that the final quantity we compute is a scalar. Apply this idea only requires using these r and Z in a larger computation, h(Z, r) for some function h: R × R R → R. We will show some concrete examples in §6.
Space Analysis. Each call to MTT uses O n 2 space to store the Laplacian matrix. Computing the gradient of Z similarly takes O n 2 to store. 18 Since storing r takes O(R) space, Alg 1 has a space complexity of O n 2 + R . Alg 2 requires O Rn 2 + RS of space because O Rn 2 is needed to compute and store the Jacobian of r and t has size O(RS). While not prevalent in our pseudocode, a benefit of Alg 3 is that we do not need to materialize the Hessian of Z as it only makes use of the inverse of the Laplacian matrix. Therefore, we only need O n 2 space for the Laplacian inverse and O(RS) space for t. Consequently, the space complexity of Alg 3 is thus O n 2 + RS .

Applications and Prior Work
In this section, we apply our framework to compute a number of desirable quantities. We relate our approach to existing algorithms for computing each quantity in the literature (where applicable), and mention existing and potential applications. Many of our applications are an extension of Li and Eisner (2009) Table 1: Average runtime of computing entropy of dependency parser output on five languages. We use the Stanford Dependency Parser (Qi et al., 2018). The past approach is that of Smith and Eisner (2007).
In most applications that involve training a probabilistic model, the edge weights in the model will be parameterized in some fashion. Traditional approaches (e.g., Koo et al. (2007); McDonald et al. (2005); Druck (2011)), use log-linear parameterizations, whereas more recent works (e.g., Dozat and Manning (2017); Liu and Lapata (2018); Ma and Xia (2014)) use neural network parameterizations. We are agnostic as to how edges are parameterized. For any of the training criteria we consider in this section, the inference will be done one example at a time, thus the set of edges will be local to individual examples (e.g., sentences). Parameters will be trained to minimize the average of each criterion over a training set.

Risk
Risk minimization is a common technique for training structured prediction models (Li and Eisner, 2009;Smith and Eisner, 2007;Stoyanov and Eisner, 2012). Risk is the expected cost over the trees in a graph where the cost function r: D → R measures the number of mistakes in comparison to a target tree d * . 20 In the context of dependency parsing, r(d) can be the labeled-or unlabeled-attachment score (LAS and UAS, respectively). This function is additively decomposable taking where n is the length of the sentence. Note that we use 1 n such that r(d) will be a score between 0 and 1. We can then obtain the expected attachment score by r Z where we calculate r using Alg 1, and we can get its gradient with reverse-mode AD.
our future code release. 20 The dependence on the reference tree d * is left as an implicit argument to r(·) since it is constant.

Shannon Entropy
Entropy is a useful measure of uncertainty, which has been used a number of times in dependency parsing (Druck and Smith, 2009;Ma and Xia, 2014;Smith and Eisner, 2007). Prior work (Smith and Eisner, 2007) use this idea to minimize entropy in a bootstrapping dependency parsing setup. They computed the Shannon entropy in O n 4 , They did this by calling MTT n times, where each call multiplies the set of incoming edges of a node by a log factor. Unlike the case for single-rooted trees, a modification to MTT as suggested by Koo et al. (2007) cannot be used to improve this runtime. This is because the log factor required must be integrated for each node individually. We noted in §3 that the function − log p(d) is additively decomposable and can be used to compute entropy. Then we can compute E p [r], which amounts to (27), in O n 3 . We can even compute its gradient in the same time complexity. Experiment. We briefly demonstrate the practical speed-up over Smith and Eisner (2007)'s O n 4 algorithm. We compare the average runtime per sentence of five different UD corpora 21 . The languages have different average sentence lengths to demonstrate the extra speed-up gained when calculating the entropy of longer sentences (that is, D would be a larger set). Tab. 1 shows that even for a corpus of short sentences (Finnish), we achieve a 3x speed-up. This increases to 12x as we move to corpora with longer sentences (Arabic).

Kullback-Leibler Divergence
To the best of our knowledge, no algorithms to compute the Kullback-Leibler (KL) divergence be-tween two graph-based parsers (nor its gradient) have been given in the literature. We show how this can be achieved easily within our framework. The KL divergence is given by This takes a similar form to the Shannon entropy in (27). We can set our expectation function to be where w p and w q are the unnormalized weights of the trees under the distributions p and q respectively. Note that the above function is additively decomposable as both log w p (d) and log w q (d) are additively decomposable. Then the KL divergence is where Z p and Z q are the normalizing constant of p and q respectively. Putting the pieces together, we see that we can calculate the KL divergence using two calls to MTT and one call to Alg 1 in O n 3 . As with risk and Shannon entropy, we can get the gradient of the KL divergence by using reverse mode AD in O n 3 .

Gradient of the GE Objective
Semi-supervised or lightly supervised learning is an important aspect of dependency parsing (Druck et al., 2009;Kate and Mooney, 2007;Wang et al., 2008). This is because, for many languages, we do not have large annotated datasets of dependency parsers. One way in which we can lightly supervise a model is by using the annotated data to create a distribution over trees and then use a large unlabeled dataset to train a model that will get close to the "ideal" distribution (Druck et al., 2009). We can do this by minimizing a measure of distance such as an f -divergence, a first-order expectation, many of which can be computed in O n 3 as they are zeroth-order expectations. Using a GE criteria (McCallum et al., 2007) is another way in which we approximate a distribution to a target distribution. The target distribution may be obtained through light supervision (Druck, 2011). The basic idea behind GE is that we have F feature functions, a feature function f (d) ∈ R F , and target feature values t from the light supervision. Then we compute the expected constraints, Similar techniques have been used to calculate attention where instead of having a penalty function, we combine marginal probabilities to achieve a scalar loss objective (Liu and Lapata, 2018). As we discussed in our cautionary tales §1, the gradient of the GE has led to confusion in the literature (Druck et al., 2009;Druck and Smith, 2009;Druck, 2011). In particular, Druck and Smith (2009) computed the gradient of the GE objective by materializing the covariance matrix in O n 4 .
Using our framework, given an additively decomposable feature function r: D → R R , we can compute the feature function total r using Alg 1 and then calculate the expected constraints E p [r] = r Z in O n 3 + R n 2 time. Since the GE objective is a scalar, we can compute its gradient in O n 3 + R n 2 using reverse-mode AD. Druck (2011) acknowledges that this can be done, but questions its practicality and numerical accuracy.
Experiment. We compute the GE objective and its gradient for over 1, 000 sentences of the English UD Treebank 22 (Nivre et al., 2018) using 20 features extracted using the methodology of Druck et al. (2009). We note that our framework obtains a speed-up of 26x over materializing the covariance matrix (i.e., Alg 3). Moreover, the gradients are equivalent with an absolute tolerance of 10 −16 (i.e., the gradients are equal to a precision of 53 bits).

Conclusion
We presented a general framework for computing first-and second-order expectations for additively decomposable functions. We did this by exploiting a key connection between gradients and expectations that allows us to solve our problems using automatic differentiation. The algorithms we provide are simple, efficient, and extendable to many expectations. The automatic differentiation principle has been applied in other settings, such as weighted context-free grammars (Eisner, 2016) and chain-structured models (Vieira et al., 2016). We hope that this paper will also serve as a tutorial on how to compute expectations over trees so that the list of cautionary tales does not grow further. Particularly, we hope that this will allow for the KL divergence to be used in semi-supervised training of dependency parsers. Our aim is for our approach for computing expectations to be extended to other structured prediction models.