Unsupervised Chunking as Syntactic Structure Induction with a Knowledge-Transfer Approach

In this paper, we address unsupervised chunking as a new task of syntactic structure induction, which is helpful for understanding the linguistic structures of human languages as well as processing low-resource languages. We propose a knowledge-transfer approach that heuristically induces chunk labels from state-of-the-art unsupervised parsing models; a hierarchical recurrent neural network (HRNN) learns from such induced chunk labels to smooth out the noise of the heuristics. Experiments show that our approach largely bridges the gap between supervised and unsupervised chunking. 1


Introduction
Understanding the linguistic structure of language (e.g., parsing and chunking) is an important research topic in NLP. Most previous work employs supervised machine learning methods to predict linguistic structures. While these methods achieve high performance, they need massive data labeled with linguistic structures, such as treebanks (Marcus et al., 1993). Existing resources are mainly constructed for widely used languages (e.g., English); further constructing new treebanks for lowresource languages is cumbersome and expensive.
Unsupervised syntactic structure induction has been attracting increasing interest in recent years (Kim et al., 2019a;Shen et al., 2018a,b). This task concerns discovering linguistic structures of text without using labeled data. It is important to NLP research because it can be potentially used for low-resource languages and also be a first pass in annotating large treebanks for them. Moreover, grammar learned by these unsupervised methods shed light on linguistic theories. * Work partially done as a co-op intern at the University of Alberta. 1 Our code and output are released at https://github.com/Anup-Deshmukh/ Unsupervised-Chunking Previous unsupervised syntactic structure mainly focuses on the task of constituency parsing which organize words in a hierarchical manner (Kim et al., 2019a,b;Shen et al., 2018a). Recently, Shen et al. (2021) propose to jointly induce constituency and dependency structures from text.
In this work, we address unsupervised chunking, another meaningful task of linguistic structure discovery. The chunking task aims to group the words of a sentence into chunks (roughly speaking, phrases) in a non-hierarchical fashion (Sang and Buchholz, 2000;Kudo and Matsumoto, 2001), and our setting is to detect chunks without the supervision of annotated linguistic structures.
In fact, unsupervised chunking has real-world applications, as understanding text fundamentally requires finding spans like noun phrases and verb phrases. It would benefit various downstream tasks, such as keywords extraction (Firoozeh et al., 2020), named entity recognition (Sano et al., 2017), and open information extraction (Niklaus et al., 2018).
In our paper, we propose a knowledge-transfer approach to unsupervised chunking by hierarchical recurrent neural networks (HRNN). We utilize the recent advances of unsupervised parsers, and propose a maximal left-branching heuristic to induce chunk labels from unsupervised parsing. Without any supervision of annotated grammars, such heuristic leads to reasonable (albeit noisy and imperfect) chunks. We further design an HRNN model that learns from the heuristic chunk labels. Our HRNN involves a trainable chunking gate that switches between a lower word-level RNN and a upper phrase-level RNN. This explicitly models the composition of words into chunks and chunks into the sentence. Results on three datasets show that our HRNN can indeed smooth out the noise of heuristically induced chunk labels, with a considerable improvement in terms of the phrase-F1 score; such observations are consistent in different domains and languages.
Related Work. Unsupervised syntactic structure detection has attracted much attention in early NLP research because of its use in low-resource scenarios (Clark, 2001;Klein, 2005). Klein and Manning (2002) propose to model constituency and context for each spans with an Expectation-Maximization (EM) algorithm. Early work also focuses on unsupervised dependency parsing for syntactic structure induction (Seginer, 2007;Paskin, 2001). Klein and Manning (2004) combine constituency and dependency models via co-training to further boost their performance.
To learn the syntactic structures, Haghighi and Klein (2006) propose a probabilistic context-free grammar (PCFG), augmented with manually designed features. Reichart and Rappoport (2008) perform clustering by syntactic features to obtain labeled parse trees. Clark (2001) clusters sequences of tags based on their local mutual information to build parse trees. Such early studies typically used heuristics, linguistic knowledge, and manually designed features for unsupervised syntactic structure induction (Wolff, 1988;Klein and Manning, 2002;Clark, 2001).
In the deep learning era, unsupervised parsing has revived the interest. Socher et al. (2011) propose the recursive autoencoder, where a binary tree is built by greedily minimizing the reconstruction loss. Such recursive tree structures can also be learned in an unsupervised way by CYK-style marginalization (Maillard et al., 2019) and Gumbelsoftmax (Choi et al., 2018). Yogatama et al. (2017) learn a shift-reduce parser by reinforcement learning towards a downstream task. However, evidence shows the above approaches do not yield linguistically plausible trees (Williams et al., 2018).
Shen et al. propose to model the syntactic distance (2018a) or syntactic ordering (2018b) to build parse trees. Kim et al. (2019b) propose a Compound PCFG for unsupervised parsing. The trees given by these approaches are more correlated with constituency trees. Li et al. (2019) propose to transfer knowledge among several unsupervised parsers and obtain better performance. Our work is inspired by such knowledge transfer, but we propose insightful heuristics that induces chunk labels from unsupervised parsers. We also design Hierarchical RNN to learn from induced chunk labels.
Previous studies address unsupervised chunking as an important task in speech processing; they use acoustic information to determine the chunks (Pate and Goldwater, 2011; Barrett et al., 2018). Our work only considers textual information, and views unsupervised chunking as a new task of syntactic structure induction.

Model
In this section, we will first induce chunking labels from state-of-the-art unsupervised parsing. Then, we will train a hierarchical RNN to learn from induced labels to smooth out the noise.

Inducing Chunk Labels from Unsupervised Parsing
We propose to induce chunk labels from stateof-the-art unsupervised parsers. The intuition is that the chunking structure can be thought of as a flattened parse tree, and thus agree with the parsing structure to some extent. Our knowledgetransfer approach is able to take advantage of recent advances in unsupervised parsing (Kim et al., 2019a,b). Specifically, we adopt the Compound PCFG which is a 5-tuple grammar G = (S, N , P, Σ, R), where S is a start symbol; N , P, and Σ are finite sets of nonterminal, preterminal, and terminal symbols, respectively. R is a finite set of rules taking one of the following forms: where S → A is the start of a sentence and T → w indicates the generation of a word. A → B C models the bifurcations of a binary constituency tree, where a constituent node is not explicitly associated with a type (e.g., noun phrase).
In addition, the model maintains a sentencelevel continuous random vector, serving as the prior of PCFG. The Compound PCFG is trained by maximum likelihood of text, where the PCFG is marginalized by the Viterbi-like algorithm and the continuous distribution is treated by amortized variational inference. We refer readers to Kim et al. (2019b) for details.
We would like to induce chunk labels from Compound PCFG, which is a state-of-the-art unsupervised parser. Given a sentence, we obtain its parse tree by applying the Viterbi-like CYK algorithm to Compound PCFG. We propose a simple yet effective heuristic that extracts maximal left-branching subtrees as chunks. As known, the English language is strongly biased to right-branching structures (Williams et al., 2018;Li et al., 2019). We observe, on the other hand, that a left-branching structure typically indicates closely related words. Here, a left-branching subtree means that the words are grouped in the form of ((· · · (( We extract all maximal left-branching subtrees as chunks. In Figure 1, for example, "deeply fried food" is a three-word maximal left-branching subtree , whereas "the kid" and "likes" are also maximal leftbranching subtrees (although degenerated). Our heuristic treats them as chunks. The following theorem shows that our heuristic can unambiguously give chunk labels for any sentence with any parse tree. (See Appendix A for proof.) Theorem 1. Given any binary parse tree, every word will belong to one and only one chunk by the maximal left-branching heuristic.
Our simple heuristic achieves reasonable chunking performance, although it is noisy. Then, HRNN learning (discussed in next part) will smooth out such noise and yield more meaningful chunks.

Training Hierarchical RNN
We would like to train a machine learning model to learn from the Compound PCFG-induced chunk labels. Our intuition is that a learning machine pools the knowledge of different samples into a parametric model and thus may smooth out the noise of our heuristics.
Specifically, we run Compound PCFG on an unlabeled corpus to obtain chunk labels in the BI schema (Ramshaw and Marcus, 1995), where "B" refers to the beginning of a chunk, and "I" refers to the inside of a chunk. Then, a machine learning model (e.g., a neural network) will learn from the pseudo-groundtruth labels.
We observe that a classic RNN or Transformer may not be suitable for the chunking task, because the prediction at a time step is unaware of previous predicted chunks, thus lacking autoregressiveness. Feeding predicted chunk labels like a sequenceto-sequence model is not adequate, because a BI label only contains one bit information and cannot provide useful autoregressive information either.
To this end, we design a hierarchical RNN to model the autoregressivenss of predicted chunks by altering the neural structure. Our HRNN contains a lower word-level RNN and an upper chunklevel RNN. We also design a gating mechanism that switches between the two RNNs in a soft manner, also serving as the predicted probability of the chunk label.
Let x (1) , · · · , x (n) be the words in a sentence. We first apply the pretrained language model BERT (Kenton et al., 2019) to obtain the contextual representations of the words, denoted by x (1) , · · · , x (n) . This helps our model to understand the global context of the sentence. For a step t, we first predict a switching gate m (t) ∈ (0, 1) as the chunking decision. 2 where h (t−1) is the hidden state of the lower RNN and h (t−1) is that of the upper RNN. Semicolon represents vector concatenation, and σ represents the sigmoid function. Such a switching gate is also used to control the information flow by altering the network architecture, shown in Figure 1. In this way, it provides meaningful autoregressive information, as it makes HRNN aware of previously detected chunks.
Suppose our model predicts that the tth word is the beginning of a chunk. This essentially "cuts" the sequence into two parts at this step. The lower RNN and upper RNN are updated by where f and f are the transition functions of the two RNNs, respectively.  Bird, 2006)) as groundtruth labels. → refers to our knowledge-transfer approaches.
In Equation (5), the lower RNN ignores its previous hidden state but restarts from a learnable initial state h (sos) , due to the prediction of a new phrase. In Equation (6), the upper RNN picks the newly formed phrase with representation h (t−1) captured by the lower RNN, and fuses it with the previous chunk's representation in the upper RNN h (t−1) .
Suppose our model predicts that the tth word is not the beginning of a chunk, i.e., "no cut" is performed at this step. The RNNs are updated by Here, the lower RNN updates its hidden state with the input x (t) as a normal RNN, whereas the upper RNN is idle because no phrase is formed. The "cut" and "nocut" cases can be unified by In fact, we keep m (t) as a real number and fuse the lower RNN and upper RNN in a soft manner. This is because chunking by its nature may be ambiguous, and our soft gating mechanism is able to better preserve the information.

Experiments
Setup. We used the CoNLL-2000 (Sang and Buchholz, 2000), CoNLL-2003 (Sang andDe Meulder, 2003), and English Web Treebank  for evaluation. We compare the model output with groundtruth chunks in terms of phrase F1 and tag accuracy. Dataset details and our experimental settings are presented in Appendix B.
Main Results. Table 1 presents main results of our knowledge-transfer approach. In addition to Compound PCFG, we also adopt another state-of-the-art unsupervised parser (Kim et al., 2019a) based on the features of a pretrained language model (LM). Specifically, we threshold the BERT (Kenton et al., 2019) similarity of consecutive words for chunking. We observe that the LM-based unsupervised chunker is worse than the Compound PCFG. Therefore, our main model variant uses Compound PCFG as the "teacher" model, i.e., the source of knowledge transfer. We train our student HRNN model to learn from the heuristically induced chunk labels. Results show that we achieve an improvement of more than 5 percentage points in phrase F1 based on either the LM-based chunker or Compound PCFG (42.05 vs. 47.99; 62.89 vs. 68.12) on the CoNLL-2000 dataset. The large margins imply that our HRNN can indeed smooth out the noise of heuristics and capture the chunking patterns.
We evaluate our knowledge-transfer approach on a different language (German) and a different domain (English Web Treebank). The results show a similar trend as the CoNLL-2000 dataset. This highlights the generality of our approach in different languages and domains.
We also tested traditional unsupervised methods for chunking, such as thresholding point-wise mutual information (PMI, de Cruys and Tim, 2011) and the Baum-Welch algorithm for the hidden Markov model (HMM, Rabiner, 1989). These methods perform significantly worse than recent advances in unsupervised syntactic structure discovery. In general, our knowledge transfer approach with HRNN largely bridges the gap between super-  vised and unsupervised chunking. We compare the inference efficiency of our student HRNN and the teacher Compound PCFG in Table 2. We observe that Compound PCFG is slow in inference, as it requires Monte Carlo sampling to marginalize the latent variable and dynamic programming to marginalize the PCFG. Our HRNN not only yields higher-quality chunks, but also is 2-5x faster. Compound PCFG uses the Viterbi-like CYK algorithm for building parse trees, which has the worst case running time of O(n 3 ), where n is the length of the sentence. Thus, efficiency improvement is larger on the CoNLL-2000 dataset, as it contains longer sentences (shown in Table 5, Appendix B).
Analysis. We provide detailed analyses of our maximal left-branching chunking heuristic and student HRNN model to better understand their contribution. We chose the CoNLL-2000 dataset as our testbed, due to constraints of time and space. Table 3 compares the heuristics that induce chunks from parse trees. We observe that our maximal left-branching heuristic outperforms rightbranching by 20 points in Phrase F1. We also introduce a thresholding approach that extracts one-word and two-word chunks only, since most groundtruth chunks contain one or two words. The performance of such heuristic is higher than rightbranching, but worse than our left-branching. The results are consistent with our conjecture that rightbranching is a common structure of English and does not suggest meaningful chunks. On the contrary, left-branching indicates closely related words and is an effective heuristic for inducing chunks from parse trees. Table 4 presents an ablation study on the student model. As seen, all student models outperform the teacher model, showing that the imperfection of   chunk heuristics can indeed be smoothed out by a machine learning model. However, a classic RNN or the Transformer predicts chunk labels individually, which does not provide autoregressive information. The performance is worse than HRNN even if the number of layers is controlled (Rows 4 vs. 6). The HRNN using soft gates outperforms a hard HRNN (Rows 5 vs. 6). This verifies that our soft HRNN can better handle the ambiguity of chunks and provide better autoregressive information. Building HRNN on top of BERT is also helpful (Rows 2 vs. 6), as BERT can capture global contextual information.

Conclusion
In this paper, we address a new task of syntactic structure discovery, namely, unsupervised chunking. We propose a hierarchical RNN with soft gates to learn from the chunk labels inducted by a stateof-the-art unsupervised parser, Compound PCFG. Results show that our approach largely bridges the gap between supervised and unsupervised chunking. We also show rigorous analysis on our chunk heuristics and the student model's architecture. A Proof of Theorem 1.
Theorem 1. Given any binary parse tree, every word will belong to one and only one chunk by the maximal left-branching heuristic. Proof.
[Existence] A single word itself is a leftbranching subtree, which belongs to some maximal left-branching subtree.
[Uniqueness] We will show that two different maximal left-branching subtrees s 1 and s 2 cannot overlap. Assume by way of contradiction that there exists a word x i in both s 1 and s 2 . Then, s 1 must be a substructure of s 2 or vice versa; otherwise, the paths root -s 1 -x i and root -s 2 -x i violate the acyclic nature of a tree. But s 1 being a subtree of s 2 (or vice versa) contradicts with the maximality of s 1 and s 2 .
This easy theorem shows our maximal leftbranching heuristic can unambiguously give chunk labels for any sentence with any binary parse tree.