Unsupervised Discourse Constituency Parsing Using Viterbi EM

In this paper, we introduce an unsupervised discourse constituency parsing algorithm. We use Viterbi EM with a margin-based criterion to train a span-based discourse parser in an unsupervised manner. We also propose initialization methods for Viterbi training of discourse constituents based on our prior knowledge of text structures. Experimental results demonstrate that our unsupervised parser achieves comparable or even superior performance to fully supervised parsers. We also investigate discourse constituents that are learned by our method.


Introduction
Natural language text is generally coherent (Halliday and Hasan, 1976) and can be analyzed as discourse structures, which formally describe how text is coherently organized. In discourse structure, linguistic units (e.g., clauses, sentences, or larger textual spans) are connected together semantically and pragmatically, and no unit is independent nor isolated. Discourse parsing aims to uncover discourse structures automatically for given text and has been proven to be useful in various NLP applications, such as document summarization (Marcu, 2000;Louis et al., 2010;Yoshida et al., 2014), sentiment analysis (Polanyi and Van den Berg, 2011;Bhatia et al., 2015), and automated essay scoring (Miltsakaki and Kukich, 2004).
Despite the promising progress achieved in recent decades (Carlson et al., 2001;Hernault et al., 2010;Ji and Eisenstein, 2014;Feng and Hirst, 2014;Li et al., 2014;Joty et al., 2015;Morey et al., 2017), discourse parsing still remains a significant challenge. The difficulty is due in part to shortage and low reliability of hand-annotated discourse structures. To develop a better-generalized parser, existing algorithms require a larger amounts of training data. However, manually annotating discourse structures is expensive, time-consuming, and sometimes highly ambiguous (Marcu et al., 1999).
One possible solution to these problems is grammar induction (or unsupervised syntactic parsing) algorithms for discourse parsing. However, existing studies on unsupervised parsing mainly focus on sentence structures, such as phrase structures (Lari and Young, 1990;Klein and Manning, 2002;Golland et al., 2012;Jin et al., 2018) or dependency structures (Klein and Manning, 2004;Berg-Kirkpatrick et al., 2010;Naseem et al., 2010;Jiang et al., 2016), though text-level structural regularities can also exist beyond the scope of a single sentence. For instance, in order to convey information to readers as intended, a writer should arrange utterances in a coherent order.
We tackle these problems by introducing unsupervised discourse parsing, which induces discourse structures for given text without relying on human-annotated discourse structures. Based on Rhetorical Structure Theory (RST) (Mann and Thompson, 1988), which is one of the most widely accepted theories of discourse structure, we assume that coherent text can be represented as tree structures, such as the one in Figure 1. The leaf nodes correspond to non-overlapping clauselevel text spans called elementary discourse units (EDUs). Consecutive text spans are combined to each other recursively in a bottom-up manner to form larger text spans (represented by internal nodes) up to a global document span. These text spans are called discourse constituents. The internal nodes are labeled with both nuclearity statuses (e.g., Nucleus-Satellite or NS) and rhetorical 215 Figure 1: An example of RST-based discourse constituent structure we assume in this paper. Leaf nodes x i correspond to non-overlapping clause-level text segments, and internal nodes consists of three complementary elements: discourse constituents x i:j , discourse nuclearities (e.g., NS), and discourse relations (e.g., ELABORATION).
In this paper, we especially focus on unsupervised induction of an unlabeled discourse constituent structure (i.e., a set of unlabeled discourse constituent spans) given a sequence of EDUs, which corresponds to the first tree-building step in conventional RST parsing. Such constituent structures provide hierarchical information of input text, which is useful in downstream tasks (Louis et al., 2010). For instance, a constituent structure [X [Y Z]] indicates that text span Y is preferentially combined with Z (rather than X) to form a constituent span, and then the text span [Y Z] is connected with X. In other words, this structure implies that [X Y] is a distituent span and requires Z to become a constituent span. Our challenge is to find such discourse-level constituentness from EDU sequences.
The core hypothesis of this paper is that discourse tree structures and syntactic tree structures share the same (or similar) constituent properties at a metalevel, and thus, learning algorithms developed for grammar inductions are transferable to unsupervised discourse constituency parsing by proper modifications. Actually, RST structures can be formulated in a similar way as phrase structures in the Penn Treebank, though there are a few differences: The leaf nodes are not words but EDUs (e.g., clauses), and the internal nodes do not contain phrase labels but hold nuclearity statuses and rhetorical relations.
The expectation-maximization (EM) algorithm (Klein and Manning, 2004) has been the dominating unsupervised learning algorithm for grammar induction. Based on our hypothesis and this fact, we develop a span-based discourse parser (in an unsupervised manner) by using Viterbi EM (or ''hard'' EM) (Neal and Hinton, 1998;Spitkovsky et al., 2010;DeNero and Klein, 2008;Choi and Cardie, 2007;Goldwater and Johnson, 2005) with a margin-based criterion (Stern et al., 2017;Gaddy et al., 2018). 1 Unlike the classic EM algorithm using inside-outside re-estimation (Baker, 1979), Viterbi EM allows us to avoid explicitly counting discourse constituent patterns, which are generally too sparse to estimate reliable scores of text spans.
The other technical contribution is to present effective initialization methods for Viterbi training of discourse constituents. We introduce initial-tree sampling methods based on our prior knowledge of document structures. We show that proper initialization is crucial in this task, as observed in grammar induction (Klein and Manning, 2004;Gimpel and Smith, 2012).
On the RST Discourse Treebank (RST-DT) (Carlson et al., 2001), we compared our parse trees with manually annotated ones. We observed that our method achieves a Micro F 1 score of 68.6% (84.6%) in the (corrected) RST-PARSEVAL (Marcu, 2000;Morey et al., 2018), which is comparable with or even superior to fully supervised parsers. We also investigated the discourse constituents that can or cannot be learned well by our method.
The rest of this paper is organized as follows: Section 2 introduces the related work. Section 3 gives the details of our parsing model and training algorithm. Section 4 describes the experimental setting and Section 5 discusses the experimental results. Conclusions are given in Section 6.

Related Work
The earliest studies that use EM in unsupervised parsing are Lari and Young (1990) and Carroll and Charniak (1992), which attempted to induce probabilistic context-free grammars (PCFG) and probabilistic dependency grammars using the classic inside-outside algorithm (Baker, 1979). Klein andManning (2001b, 2002) perform a weakened version of constituent tests (Radford, 1988) by the Constituent-Context Model (CCM), which, unlike a PCFG, describes whether a contiguous text span (such as DT JJ NN) is a constituent or a distituent. The CCM uses EM to learn constituenthood over part-of-speech (POS) tags and for the first time outperformed the strong right-branching baseline in unsupervised constituency parsing. Klein and Manning (2004) proposed the Dependency Model with Valence (DMV), which is a head automata model (Alshawi, 1996) for unsupervised dependency parsing over POS tags and also relies on EM. These two models have been extended in various works for further improvements (Berg-Kirkpatrick et al., 2010;Naseem et al., 2010;Golland et al., 2012;Jiang et al., 2016).
In general, these methods use the inside-outside (dynamic programming) re-estimation (Baker, 1979) in the E step. However, Spitkovsky et al. (2010) showed that Viterbi training (Brown et al., 1993), which uses only the best-scoring tree to count the grammatical patterns, is not only computationally more efficient but also empirically more accurate in longer sentences. These properties are, thus, suitable for ''document-level'' grammar induction, where the document length (i.e., the number of EDUs) tends to be long. 2 In addition, as ex-plained later in Section 3, we incorporate Viterbi EM with a margin-based criterion (Stern et al., 2017;Gaddy et al., 2018); this allows us to avoid explicitly counting each possible discourse constituent pattern symbolically, which is generally too sparse and appears only once.
Prior studies (Klein and Manning, 2004;Gimpel and Smith, 2012;Naseem et al., 2010) have shown that initialization or linguistic knowledge plays an important role in EM-based grammar induction. Gimpel and Smith (2012) demonstrated that properly initialized DMV achieves improvements in attachment accuracies by 20 ∼ 40 points (i.e., 21.3% → 64.3%), compared with the uniform initialization. Naseem et al. (2010) also found that controlling the learning process with the prior (universal) linguistic knowledge improves the parsing performance of DMV. These studies usually rely on insights on syntactic structures. In this paper, we explore discourse-level prior knowledge for effective initialization of the Viterbi training of discourse constituency parsers.
Our method also relies on recent work on RST parsing. In particular, one of the initialization methods in our EM training (in Section 3.3 (i)) is inspired by the inter-sentential and multisentential approach used in RST parsing (Feng and Hirst, 2014;Joty et al., 2013Joty et al., , 2015. We also follow prior studies (Sagae, 2009;Ji and Eisenstein, 2014) and utilize syntactic information, i.e., dependency heads, which contributes to further performance gains in our method.
The most similar work to that presented here is Kobayashi et al. (2019), who propose unsupervised RST parsing algorithms in parallel with our work. Their method builds an unlabeled discourse tree by using the CKY dynamic programming algorithm. The tree-merging (splitting) scores in CKY are defined as similarity (dissimilarity) between adjacent text spans. The similarity scores are calculated based on distributed representations using pre-trained embeddings. However, similarity between adjacent elements are not always good indicators of constituentness. Consider tag sequences ''VBD IN'' and ''IN NN''. The former is an example of a distituent sequence, whereas the latter is a constituent. ''VBD'', ''IN'', and ''NN'' may have similar distributed representations because these tags cooccur frequently in corpora. This implies that it is difficult to distinguish constituents and distituents if we use only similarity (dissimilarity) measures. In this paper, we aim to mitigate this issue by introducing parameterized models to learn discourse constituentness.

Methodology
In this section, we first describe the parsing model we develop. Next, we explain how to train the model in an unsupervised manner by using Viterbi EM. Finally, we present the initialization methods we use for further improvements.

Parsing Model
The parsing problem in this study is to find the unlabeled constituent structure with the highest score for an input text x, that is, where s(x, T ) ∈ R denotes a real-valued score of a tree T , and valid (x) represents a set of all valid trees for x. We assume that x has already been manually segmented into a sequence of EDUs: Inspired by the success of recent span-based constituency parsers (Stern et al., 2017;Gaddy et al., 2018), we define the tree scores as the sum of constituent scores over internal nodes, that is, Thus, our parsing model consists of a single scoring function s(i, j) that computes a constituent score of a contiguous text span x i:j = x i , . . . , x j , or simply (i, j). The higher the value of s(i, j), the more likely that x i:j is a discourse constituent.
We show our parsing model in Figure 2. Our implementation of s(i, j) can be decomposed into three modules: EDU-level feature extraction, span-level feature extraction, and span scoring. We discuss each of these in turn. Later, we also explain the decoding algorithm that we use to find the globally best-scoring tree.

Feature Extraction and Scoring
Inspired by existing RST parsers (Ji and Eisenstein, 2014;Li et al., 2014;Joty et al., 2015), we first encode the beginning and end words of an EDU: where b w and e w denote the beginning and end words of the i-th EDU, and Embed w is a function that returns a parameterized embedding of the input word. We also encode the POS tags corresponding to b w and e w as follows: where Embed p is an embedding function for POS tags.
Prior work (Sagae, 2009;Ji and Eisenstein, 2014) has shown that syntactic cues can accelerate discourse parsing performance. We therefore extract syntactic features from each EDU. We apply a (syntactic) dependency parser to each sentence in the input text, 3 and then choose a head word for each EDU. A head word is a token whose parent in the dependency graph is ROOT or is not within the EDU. 4 We also extract the POS tag and the dependency label corresponding to the head word. A dependency label is a relation between a head word and its parent.
To sum up, we now have triplets of head infor- , each denoting the head word, the head POS, and the head relation of the i-th EDU, respectively. We embed these symbols using look-up tables: where Embed r is an embedding function for dependency relations. Finally, we concatenate these embeddings: and then transform it using a linear projection and Rectified Linear Unit (ReLU) activation function: In the following, we use {v i } n−1 i=0 as the feature vectors for the EDUs, {x i } n−1 i=0 .
Figure 2: Our span-based discourse parsing model. We first encode each EDU based on the beginning and ending words and POS tags using embeddings. We also embed head information of each EDU. We then run a bidirectional LSTM and concatenate the span differences. The resulting vector is used to predict the constituent score of the text span (i, j). This figure illustrates the process for the span (1, 2).
Following the span-based parsing models developed in the syntax domain (Stern et al., 2017;Gaddy et al., 2018), we then run a bidirectional Long Short-Term Memory (LSTM) over the sequence of EDU representations, {v i } n−1 i=0 , resulting in forward and backward representations for We then compute a feature vector for a span (i, j) by concatenating the forward and backward span differences: The feature vector, h i,j , is assumed to represent the content of the contiguous text span x i:j along with contextual information captured by the LSTMs. 5 We did not use any feature templates because we found that they did not improve parsing performance in our unsupervised setting, though we observed that template features roughly following Joty et al. (2015) improved performance in a supervised setting.
Finally, given a span-level feature vector, h i,j , we use two-layer perceptrons with the ReLU activation function: which computes the constituent score of the contiguous text span x i:j .

Decoding
We use a Cocke-Kasami-Younger (CKY)-style dynamic programming algorithm to perform a global search over the space of valid trees and find the highest-scoring tree. For a document with n EDUs, we use an n × n table C, the cell C [i, j] of which stores the subtree score spanning from i to j. For spans of length one (i.e., i = j), we assign constant scalar values: For general spans 0 ≤ i < j ≤ n − 1, we define the following recursion: where s(i, j) denotes the constituent score computed by our model. To parse the full document, we first compute C[0, n − 1] in a bottom-up manner and then recursively trace the history of the selected split positions, k, resulting in a binary tree spanning the entire document.

Unsupervised Learning Using Viterbi EM
In this paper, we use Viterbi EM (Brown et al., 1993;Spitkovsky et al., 2010), a variant of the EM algorithm and self-training (McClosky et al., 2006a,b), to train the span-based discourse constituency parser (Section 3.1) in an unsupervised manner. Viterbi EM has suitable properties for discourse processing, as described later in this section.

Overall Procedure
We first automatically sample initial trees based on our prior knowledge of document structures (described later in Section 3.3) and then perform the M step on the sampled trees to initialize the model parameters. After the initialization step, we repeat the E step and the M step in turns.
To perform early stopping, we use a held-out development set of 30 documents with annotated trees T * dev , which are never used as the supervision to estimate the parsing model.

E Step
In the E step of Viterbi EM, based on the current model, we perform discourse constituency parsing for whole training documents X , resulting in a pseudo treebank with discourse constituent structures, i.e., where valid (x) denotes a set of all valid trees for x, s(x, T ) is defined in Equation (2), andT is the highest-scoring parse tree based on the current model. Klein and Manning (2001b) and Spitkovsky et al. (2010) count grammatical patterns used to derive syntactic trees in D, which are then normalized and converted to probabilistic grammars in the next M step.
In contrast, ''discourse'' constituents are significantly sparse and tend to appear only once, which implies that it is almost meaningless to explicitly count discourse constituent patterns symbolically. We therefore attempt to directly use the trees in D to update the model parameters in the next M step.

M Step
In the M step, we re-estimate the next model as if it is supervised by the best parse trees found in the previous E step.
More precisely, we update the model parameters so that the next model satisfies the following constraints: for each instance (x,T ) ∈ D, where T ′ ranges over all valid trees. ∆(T , T ′ ) is a tree distance we define as follows: where |T | denotes the number of constituent spans (or internal nodes) in T , and |T ∩ T ′ | represents the number of spans shared betweenT and T ′ . In other words, we hope that the score of the best parse treeT should be larger than that of the lessprobable tree T ′ by at least the margin ∆(T , T ′ ). Please note that |T | = |T ′ | always holds, because the parse treeT and the negative-sample tree T ′ are binary trees. ∆(T , T ′ ) = 0 holds if, and only if,T = T ′ . These constraints can be rewritten by using the margin-based criterion as follows: We minimize this criterion by using the minibatch stochastic gradient descent and the backpropagation algorithm.
The highest-scoring negative tree T ′ ( =T ) can be efficiently found by modifying the dynamic programming algorithm in Equation (17). In particular, we replace s Combining Viterbi training and the marginbased objective function allows us to (1) avoid explicitly counting discourse constituent patterns as symbolic variables and (2) directly use the scores of the trees found in the E step for reestimation of the next model.

Initialization in EM
In general, the EM algorithm tends to get stuck in local optima of the objective function (Charniak, 1993). Therefore, proper initialization is vital in order to avoid trivial solutions. This phenomenon has also been observed in EM-based grammar induction (Klein and Manning, 2004;Gimpel and Smith, 2012).
In this section, we introduce the initialization methods we use in Viterbi EM. More precisely, given an input document (i.e., a sequence of EDUs), we automatically build a discourse constituent structure based on our general prior knowledge of document structures. Below, we describe the four pieces of prior knowledge we use for the initial-tree sampling.

(i) Document Hierarchy
It is intuitively reasonable to consider that (elementary) discourse units belonging to the same textual chunk (e.g., sentence, paragraph) tend to form a subtree before crossing over the chunk boundaries. For example, we can assume that EDUs in the same sentence are preferentially connected with each other before getting combined with EDUs in other sentences. Actually, Joty et al. (2013Joty et al. ( , 2015 and Feng and Hirst (2014) observed that it is effective to incorporate intersentential and multi-sentential parsing to build a document-level tree.
First, we split an input document into sentencelevel and paragraph-level segments by detecting sentence and paragraph boundaries, respectively. We obtain sentence segmentation by applying the Stanford CoreNLP  to the concatenation of EDUs. We also extract paragraph boundaries by detecting empty lines in the raw documents. 6 We then build a discourse constituent structure incrementally from sentencelevel subtrees to paragraph-level subtrees and then to the document-level tree in a bottom-up manner. Figure 3 shows this process.

(ii) Discourse Branching Tendency
The second prior knowledge relates to information order in discourses and the branching tendencies of discourse trees. In general, an important text element tends to appear at earlier positions in the document, and then the text following it complements the message, which is reflected in the Right Frontier Constraint (Polanyi, 1985) 6 Therefore, our ''paragraph'' boundaries do not strictly correspond to paragraph segmentation. However, we found that this pseudo ''paragraph'' segmentation improves the parsing accuracy. We used the raw WSJ files (''*.out'') in RST-DT, e.g., ''wsj 1135.out.'' Figure 4: (a) We assume that an important text element tends to appear at earlier positions in the text, and the text following it complements the message, which leads to the right-heavy structure. (b)-(c) We split a intra-sentential EDU sequence into two subsequences based on the location of the EDU with the ROOT word. We build right-branching trees for each subsequence individually and finally bracket them. Head words are underlined.
in Segmented Discourse Representation Theory (Asher and Lascarides, 2003). This tendency can be assumed to hold recursively. Therefore, it is reasonable to consider that discourse structures tend to form right-heavy trees, as shown in Figure 4(a). Based on this assumption, we build right-branching trees for sentence-level, paragraph-level, and document-level discourse structures in the initial-tree sampling.
(iii) Syntax-Aware Branching Tendency As already discussed, this work assumes that discourse structures tend to form right-heavy trees. However, in our preliminary experiments, we found that this naive assumption produces about 44% erroneous trees for sentence-level structures with 3 EDUs. For sentences with 4 EDUs, the error rate increases to about 70%, which is a non-negligible number in the initialization step.
To resolve this problem, we introduce another, more fine-grained, knowledge concept for sentencelevel discourse structures. We expect that sentencelevel trees are more strongly affected by syntactic cues (e.g., dependency graphs) than paragraphlevel or document-level trees. More specifically, given an EDU sequence of one sentence, x i , · · · , x j , we focus on a position of the EDU x k with a head word that is in a ROOT relation with its parent in the dependency graph. We assume that the sub-sequence after the ROOT EDU, x k:j , roughly corresponds to the predicate of the sentence, and the sub-sequence before the ROOT EDU, x i:k−1 , corresponds to the subject. We build right-branching trees for each subsequence individually and finally bracket them. We illustrate the procedure in Figure 4(b)-(c).

(iv) Locality Bias
Inspired by Smith and Eisner (2006), we introduce a structural locality bias as the last prior knowledge. The locality bias was observed to improve the accuracy of dependency grammar induction. We hypothesize that discourse constituents of shorter spans are preferable to those of longer ones.
Instead of introducing the locality bias into the initial-tree sampling, we encode it into the decoding algorithm in training and inference. More precisely, we re-write the CKY recursion in Equation (17) as follows: where λ denotes the hyperparameter and we empirically set λ = 10. The second term decreases in inverse proportion to the span distance.

Data
We use the RST Discourse Treebank (RST-DT) built by Carlson et al. (2001), 7 which consists of 385 Wall Street Journal articles manually annotated with RST structures (Mann and Thompson, 1988). We use the predefined split of 347 training articles and 38 test articles. We also prepare a development set with 30 instances randomly sampled from the training set, which is used only for hyper-parameter tuning and early stopping. We tokenized the documents using Stanford CoreNLP tokenizer and converted them to lowercase. We also replaced digits with ''7'' (e.g., ''12.34'' → ''77.77'') to reduce data sparsity. 7 https://catalog.ldc.upenn.edu/ LDC2002T07.
We also replaced out-of-vocabulary tokens with special symbols '' UNK .''

Metrics
Following existing studies in unsupervised syntactic parsing (Klein, 2005;Smith, 2006), we quantitatively evaluate unsupervised parsers by comparing parse trees with the manually annotated ones. We use the standard (unlabeled) constituency metrics in PARSEVAL: Unlabeled Precision (UP), Unlabeled Recall (UR), and their Micro F 1 , which can indicate how well the parser identifies the linguistically reasonable structures.
The traditional evaluation procedure for RST parsing is RST-PARSEVAL (Marcu, 2000), which adapts the PARSEVAL for the RST representation shown in Figure 5(a)-(b). However, Morey et al. (2018) showed that, as shown in Figure 5(c), traditional RST-PARSEVAL gives a higher-thanexpected score because it considers pre-terminals (i.e., spans of length 1), which cannot be incorrect in the unlabeled constituency metrics. We therefore follow Morey et al. (2018) and perform the encoding of RST trees as shown in Figure 5(d)-(f). That is, we exclude spans of length 1 and include the root node. We also do not binarize the goldstandard trees.

Baselines
To quantitatively evaluate our unsupervised discourse constituency parser, it is necessary to develop strong baseline parsers. We thus propose Combinational Incremental Parsers (CIPs), which automatically and incrementally build a discourse (unlabeled) constituent structure from an EDU sequence based on the prior knowledge introduced in Section 3.3. That is, CIPs first build sentence-level discourse trees based on sentence segmentation using an elementary parser f s . They then build paragraph-level trees using another elementary parser f p , and finally output the document-level tree using f d . An elementary parser is a function that returns a single tree given a sequence of EDUs or subtrees. CIPs can be represented as a triplet of elementary parsers, namely, Inspired by earlier studies in unsupervised syntactic constituency parsing (Klein and Manning, 2001a,b;Klein, 2005;Seginer, 2007), we prepare the following four candidates for the elementary parsers: Right Branching (RB) Given a sequence of elements (i.e., EDUs or subtrees), RB always chooses the left-most element as a left terminal node and then treats the remaining elements as a right nonterminal (or terminal). This procedure is recursively applied to the remaining elements on the right, resulting in (x 0 (x 1 (x 2 . . . ))). As described in Section 3.3, we predict that RB somewhat captures the branching tendency of discourse informational structures. RB was also used as a strong baseline for unsupervised syntactic constituency parsing in Klein and Manning (2001b).
Left Branching (LB) Contrary to RB, LB always chooses the right-most element as the right terminal and then transforms the remaining elements on the left to a subtree, resulting in (((. . . x n−3 ) x n−2 ) x n−1 ).
Adaptive Right Branching (RB * ) We augment RB by considering the syntax-aware branching tendency, described in Section 3.3(iii). That is, based on the position of the head EDU (with the ROOT relation), we split the sentence into two parts and then perform RB for each sub-sequence.
Random Bottom-Up (BU) BU randomly selects two adjacent elements and brackets them. This operation is repeated in a bottom-up manner until we obtain a single binary tree spanning the whole sequence.

Hyperparameters
We set the dimensionalities of the word embeddings, POS embeddings, relation embeddings, forward/backward LSTM hidden layers, and MLP to 300, 10, 10, 125, and 100, respectively. We initialized the word embeddings with the GloVe vectors trained on 840 billion tokens (Pennington et al., 2014). During the training, we did not fine-tune the word embeddings. We run the initialization steps for 3 epochs. We used a minibatch size of 10. We also used the Adam optimizer (Kingma and Ba, 2015).

Results and Discussion
In this section we report the results of the experiments and discuss them. We first discuss the comparison results of our method with baselines and the fully supervised RST parsers, including the results published in literature (Section 5.1). We then investigate the impact of initialization methods (Section 5.2). Finally, we provide our analysis on discourse constituents induced by our method (Section 5.3).

Performance Comparison
We compared our method with the baselines described in Section 4.3. We also included the previous work (Kobayashi et al., 2019) on unsupervised RST parsing as our baseline, though it is not a fair comparison because they use binarized  (2014)  golden trees for evaluation. 8 For reference, we also compared our method with fully supervised parsers: the supervised version of our model 9 and recent supervised parsers (Feng and Hirst, 2014;Joty et al., 2015) that incorporate intra-sentential and multi-sentential parsing as in our parser. Table 1 shows the unlabeled constituency scores in the corrected RST-PARSEVAL (Morey et al., 2018) against non-binarized trees. We also show the traditional RST-PARSEVAL Micro F 1 scores in parentheses. f s , f d indicates that we used only sentence boundaries and discarded paragraph boundaries. The scores of external supervised parsers (Feng and Hirst, 2014;Joty et al., 2015) are borrowed from Morey et al. (2018). 8 However, scores against the binarized trees and the original trees are quite similar (Morey et al., 2018). 9 We used the same model and hyperparameters as the unsupervised model. The only difference is that we used conventional supervised learning with manually annotated trees in stead of Viterbi EM. We observe that: (1) the incremental treeconstruction approach with boundary information consistently improves parsing performances of the baselines; (2) RB-based CIPs are better than those with LB or BU; and (3) replacing RB with RB * yields further improvements. These results confirm the reasonability of the prior knowledge of document structures. The best baseline is RB * s , RB p , LB d , which achieves a Micro F 1 score of 66.8% (83.7%) without any learning. Quite shockingly, the score is competitive with those of the supervised parsers.
Table 1 also demonstrates that our method outperforms all the baselines and achieves an F 1 score of 67.5% (84.0%). If we use the best baseline for initial-tree sampling in Viterbi EM, the performance further improves to 68.0% (84.3%).
To investigate the potential of our unsupervised parser, we also augmented the training dataset with an external unlabeled corpus. We used about 2,000 news articles from Wall Street Journal in Penn Treebank (Marcus et al., 1993) that are not shared with the RST-DT test set. We split the raw documents into EDU segmentations by using an external pre-trained EDU segmenter (Wang et al., 2018) 10 and found that the larger unlabeled dataset can improve parsing performance to 68.6%.
It is worth noting that our method outperforms the baselines used for the initialization, which implies that our method learns some knowledge of discourse constituentness in an unsupervised manner.
Our method also achieves comparable or superior results to supervised models. We suspect that the reason why the supervised version of our model outperforms the external supervised parsers (Feng and Hirst, 2014;Joty et al., 2015) is mostly dependent on feature extraction the introduction of paragraph boundaries.

Impact of Initialization Methods
Here, we evaluate the importance of initialization in Viterbi EM. Beginning with uniform initialization, we incrementally applied the initialization techniques introduced in Section 3.3 and investigated their impact on the results.   Table 1). We then introduced Discourse Branching Tendency in Section 3.3(ii) by replacing BU with RB in the CIP, which also improved the performance, slightly, to 59.7%. We then introduced Syntax-Aware Branching Tendency in Section 3.3(iii) by replacing RB with RB * only for the sentence level, which brought a considerable performance gain of 6.6 points (66.3%). Finally, we introduced Locality Bias in Section 3.3(iv) and achieved 67.5%. We also found that our model can be improved further to 68.0% if we use the best baseline for initialization. In total, these initialization techniques made a difference of 9.1 points compared with uniform initialization (i.e., 58.9 → 68.0), which implies that initialization should be carefully considered in unsupervised discourse (constituency) parsing using EM and that the prior knowledge we proposed in Section 3.3(i)-(iv) can capture some of the tendencies of document structures. We also found that Syntax-Aware Branching Tendency is most effective among the techniques, which suggests that more detailed knowledge can yield further improvements.

Learned Discourse Constituentness
Here, we further investigate the discourse constituentness learned by our method.
First, we calculated Unlabeled Recall (UR) scores for each relation class in RST-DT. We used 18 coarse-grained classes. Please note that we only focus on constituent spans {(i, j)} because our method does not predict relation labels. Table 3 shows the results of the best four and the worst  four relation classes of our method. We compare the results with the supervised version. We observe that although our method uses an unsupervised approach and does not rely on structural annotations, some scores are comparable to those of the supervised version. We also found that relation classes with relatively higher scores can be assumed to form right-heavy structures (e.g., ATTRIBUTION, ENABLEMENT), whereas relations with lower scores can be considered to form left-heavy structures (e.g., EVALUATION, SUMMARY). These results are natural because the initialization methods we used in the Viterbi training strongly rely on RB-based CIP. This implies that, to capture discourse constituency phenomena of SUMMARY or EVALUTION relations, it is necessary to introduce other initialization techniques (or prior knowledge) in future.
Lastly, we qualitatively inspected the discourse constituentness learned by our method. We computed span scores s(i, j) for all possible spans (i, j) in the RST-DT test set without using any boundary information. We then sampled text spans x i:j with relatively higher constituent scores, s(i, j) > 10.0.
As shown in the upper part of Table 4, we can observe that our method learns some aspects of discourse constituentness that seems linguistically reasonable. In particular, we found that our method has a potential to predict brackets for (1) clauses with connectives qualifying other clauses from right to left (e.g., ''X [because B.]'') and (2) attribution structures (e.g., ''say that [B]''). These results indicate that our method is good at identifying discourse constituents near the end Table 4: Discourse constituents and their predicted scores (in parentheses). We show the discourse constituents (in bold) in the RST-DT test set, which have relatively high span scores. We did NOT use any sentence/paragraph boundaries for scoring. of sentences (or paragraphs), which is natural because RB is mainly used for generating initial trees in EM training. The bottom part of Table 4 demonstrates that the beginning position of the text span is also important to estimate constituenthood, along with the ending position.

Conclusion
In this paper, we introduced an unsupervised discourse constituency parsing algorithm that uses Viterbi EM with a margin-based criterion to train a span-based neural parser. We also introduced initialization methods for the Viterbi training of discourse constituents. We observed that our unsupervised parser achieves comparable or even superior performance to the baselines and fully supervised parsers. We also found that learned discourse constituents depend strongly on initialization used in Viterbi EM, and it is necessary to explore other initialization techniques to capture more diverse discourse phenomena.
We have two limitations in this study. First, this work focuses only on unlabeled discourse constituent structures. Although such hierarchical information is useful in downstream applications (Louis et al., 2010), both nuclearity statuses and rhetorical relations are also necessary for a more complete RST analysis. Second, our study uses only English documents for evaluation. However, different languages may have different structural regularities. Hence, it would be interesting to investigate whether the initialization methods are effective in different languages, which we believe gives suggestions on discourse-level universals. We leave these issues as a future work.