Online Adaptor Grammars with Hybrid Inference

Adaptor grammars are a flexible, powerful formalism for defining nonparametric, unsupervised models of grammar productions. This flexibility comes at the cost of expensive inference. We address the difficulty of inference through an online algorithm which uses a hybrid of Markov chain Monte Carlo and variational inference. We show that this inference strategy improves scalability without sacrificing performance on unsupervised word segmentation and topic modeling tasks.


Introduction
Nonparametric Bayesian models are effective tools to discover latent structure in data (Müller and Quintana, 2004). These models have had great success in text analysis, especially syntax (Shindo et al., 2012). Nonparametric distributions provide support over a countably infinite long-tailed distributions common in natural language (Goldwater et al., 2011).
We focus on adaptor grammars (Johnson et al., 2006), syntactic nonparametric models based on probabilistic context-free grammars. Adaptor grammars weaken the strong statistical independence assumptions PCFGs make (Section 2).
The weaker statistical independence assumptions that adaptor grammars make come at the cost of expensive inference. Adaptor grammars are not alone in this trade-off. For example, nonparametric extensions of topic models (Teh et al., 2006) have substantially more expensive inference than their parametric counterparts (Yao et al., 2009).
A common approach to address this computational bottleneck is through variational inference (Wainwright and Jordan, 2008). One of the advantages of variational inference is that it can be easily parallelized (Nallapati et al., 2007) or transformed into an online algorithm (Hoffman et al., 2010), which often converges in fewer iterations than batch variational inference.
Past variational inference techniques for adaptor grammars assume a preprocessing step that looks at all available data to establish the support of these nonparametric distributions (Cohen et al., 2010). Thus, these past approaches are not directly amenable to online inference.
Markov chain Monte Carlo (MCMC) inference, an alternative to variational inference, does not have this disadvantage. MCMC is easier to implement, and it discovers the support of nonparametric models during inference rather than assuming it a priori.
We apply stochastic hybrid inference (Mimno et al., 2012) to adaptor grammars to get the best of both worlds. We interleave MCMC inference inside variational inference. This preserves the scalability of variational inference while adding the sparse statistics and improved exploration MCMC provides.
Our inference algorithm for adaptor grammars starts with a variational algorithm similar to Cohen et al. (2010) and adds hybrid sampling within variational inference (Section 3). This obviates the need for expensive preprocessing and is a necessary step to create an online algorithm for adaptor grammars.
Our online extension (Section 4) processes examples in small batches taken from a stream of data. As data arrive, the algorithm dynamically extends the underlying approximate posterior distributions as more data are observed. This makes the algorithm flexible, scalable, and amenable to datasets that cannot be examined exhaustively because of their size-e.g., terabytes of social media data appear every second-or their nature-e.g., speech acquisition, where a language learner is limited to the bandwidth of the human perceptual system and cannot acquire data in a monolithic batch (Börschinger and Johnson, 2012).
We show our approach's scalability and effective-ness by applying our inference framework in Section 5 on two tasks: unsupervised word segmentation and infinite-vocabulary topic modeling.

Background
In this section, we review probabilistic context-free grammars and adaptor grammars.

Probabilistic Context-free Grammars
Probabilistic context-free grammars (PCFG) define probability distributions over derivations of a context-free grammar. We define a PCFG G to be a tuple W , N , R, S, θ : a set of terminals W , a set of nonterminals N , productions R, start symbol S ∈ N and a vector of rule probabilities θ.
The rules that rewrite nonterminal c is R(c). For a more complete description of PCFGs, see Manning and Schütze (1999). PCFGs typically use nonterminals with a syntactic interpretation. A sequence of terminals (the yield) is generated by recursively rewriting nonterminals as sequences of child symbols (either a nonterminal or a symbol). This builds a hierarchical phrase-tree structure for every yield.
For example, a nonterminal VP represents a verb phrase, which probabilistically rewrites into a sequence of nonterminals V, N (corresponding to verb and noun) using the production rule VP → V N. Both nonterminals can be further rewritten. Each nonterminal has a multinomial distribution over expansions; for example, a multinomial for nonterminal N would rewrite as "cake", with probability θ N→cake = 0.03. Rewriting terminates when the derivation has reached a terminal symbol such as "cake" (which does not rewrite).
While PCFGs are used both in the supervised setting and in the unsupervised setting, in this paper we assume an unsupervised setting, in which only terminals are observed. Our goal is to predict the underlying phrase-structure tree.

Adaptor Grammars
PCFGs assume that the rewriting operations are independent given the nonterminal. This contextfreeness assumption often is too strong for modeling natural language.
trees G c rooted at nonterminal c into a richer distribution H c over the trees headed by a nonterminal c, which is often referred to as the grammaton.
A Pitman-Yor Adaptor grammar (PYAG) forms the adapted tree distributions H c using a Pitman-Yor process (Pitman and Yor, 1997, PY), a generalization of the Dirichlet process (Ferguson, 1973, DP). 1 A draw H c ≡ (π c , z c ) is formed by the stick breaking process (Sudderth and Jordan, 2008, PYGEM) parametrized by scale parameter a, discount factor b, and base distribution G c : Intuitively, the distribution H c is a discrete reconstruction of the atoms sampled from G c -hence, reweights G c . Grammaton H c assigns non-zero stick-breaking weights π to a countably infinite number of parse trees z. We describe learning these grammatons in Section 3. More formally, a PYAG is a quintuple A = G, M , a, b, α with: a PCFG G; a set of adapted nonterminals M ⊆ N ; Pitman-Yor process parameters a c , b c at each adaptor c ∈ M and Dirichlet parameters α c for each nonterminal c ∈ N . We also assume an order on the adapted nonterminals, c 1 , . . . , c |M | such that c j is not reachable from c i in a derivation if j > i. 2 Algorithm 1 describes the generative process of an adaptor grammar on a set of D observed sentences x 1 , . . . , x D .
Given a PYAG A, the joint probability for a set of sentences X and its collection of trees T is where x d and t d represent the d th observed string and its corresponding parse. The multinomial PCFG parameter θ c is drawn from a Dirichlet distribution at nonterminal c ∈ N . At each adapted nonterminal c ∈ M , the stick-breaking weights π c are drawn from a PYGEM (Equation 1). Each weight has an associated atom z c,i from base distribution G c , a subtree rooted at c. The probability p(x d , t d | θ, π, z) is the PCFG likelihood of yield x d with parse tree t d .
Adaptor grammars require a base PCFG such that it does not have recursive adapted nonterminals, i.e., there cannot be a path in a derivation from a given adapted nonterminal to a second appearance of that adapted nonterminal.

Hybrid Variational-MCMC Inference
Discovering the latent variables of the model-trees, adapted probabilities, and PCFG rules-is a problem of posterior inference given observed data. Previous approaches use MCMC (Johnson et al., 2006) or variational inference (Cohen et al., 2010).
MCMC discovers the support of nonparametric models during the inference, but does not scale to larger datasets (due to tight coupling of variables). Variational inference, however, is inherently parallel and easily amendable to online inference, but requires preprocessing to discover the adapted productions. We combine the best of both worlds and propose a hybrid variational-MCMC inference algorithm for adaptor grammars.
Variational inference posits a variational distribution over the latent variables in the model; this in turn induces an "evidence lower bound" (ELBO, L) as a function of a variational distribution q, a lower bound on the marginal log-likelihood. Variational inference optimizes this objective function with respect to the parameters that define q.
In this section, we derive coordinate-ascent updates for these variational parameters. A key mathematical component is taking expectations with respect to the variational distribution q. We strategically use MCMC sampling to compute the expecta-tion of q over parse trees z. Instead of explicitly computing the variational distribution for all parameters, one can sample from it. This produces a sparse approximation of the variational distribution, which improves both scalability and performance. Sparse distributions are easier to store and transmit in implementations, which improves scalability. Mimno et al. (2012) also show that sparse representations improve performance. Moreover, because it can flexibly adjust its support, it is a necessary prerequisite to online inference (Section 4).

Variational Lower Bound
We posit a mean-field variational distribution: where π c,i is drawn from a variational Beta distribution parameterized by ν 1 c,i , ν 2 c,i ; and θ c is from a variational Dirichlet prior γ c ∈ R |R(c)| + . Index i ranges over a possibly infinite number of adapted rules. The parse for the d th observation, t d is modeled by a multinomial φ d , where φ d,i is the probability generating the i th phrase-structure tree t d,i .
The variational distribution over latent variables induces the following ELBO on the likelihood: (3) where H[•] is the entropy function.
To make this lower bound tractable, we truncate the distribution over π to a finite set (Blei and Jordan, 2005) for each adapted nonterminal c ∈ M , i.e., π c,Kc ≡ 1 for some index K c . Because the atom weights π k are deterministically defined by Equation 1, this implies that π c,i is zero beyond index K c . Each weight π c,i is associated with an atom z c,i , a subtree rooted at c. We call the ordered set of z c,i the truncated nonterminal grammaton (TNG). Each adapted nonterminal c ∈ M has its own TNG c . The i th subtree in TNG c is denoted TNG c (i).
In the rest of this section, we describe approximate inference to maximize L. The most important update is φ d,i , which we update using stochastic MCMC inference (Section 3.2). Past variational approaches for adaptor grammars (Cohen et al., 2010) rely on a preprocessing step and heuristics to define a static TNG. In contrast, our model dynamically discovers trees. The TNG grows as the model sees more data, allowing online updates (Section 4).
The remaining variational parameters are optimized using expected counts of adaptor grammar rules. These expected counts are described in Section 3.3, and the variational updates for the variational parameters excluding φ d,i are described in Section 3.4.

Stochastic MCMC Inference
Each observation x d has an associated variational multinomial distribution φ d over trees t d that can yield observation x d with probability φ d,i . Holding all other variational parameters fixed, the coordinate-ascent update (Mimno et al., 2012;Bishop, 2006) (4) where φ d,i is the probability generating the i th phrase-structure tree t d,i and E ¬φ d q [•] is the expectation with respect to the variational distribution q, excluding the value of φ d .
Instead of computing this expectation explicitly, we turn to stochastic variational inference (Mimno et al., 2012;Hoffman et al., 2013) to sample from this distribution. This produces a set of sampled trees σ d ≡ {σ d,1 , . . . , σ d,k }. From this set of trees we can approximate our variational distribution over trees φ using the empirical distribution σ d , i.e., This leads to a sparse approximation of variational distribution φ. 3 Previous inference strategies (Johnson et al., 2006;Börschinger and Johnson, 2012) for adaptor grammars have used sampling. The adaptor grammar inference methods use an approximate PCFG to emulate the marginalized Pitman-Yor distributions 3 In our experiments, we use ten samples. at each nonterminal. Given this approximate PCFG, we can then sample a derivation z for string x from the possible trees (Johnson et al., 2007).
Sampling requires a derived PCFG G that approximates the distribution over tree derivations conditioned on a yield. It includes the original PCFG rules R = {c → β} that define the base distribution and the new adapted productions where K c is the truncation level of TNG c and π c,Kc represents the left-over stick weights in the stickbreaking process for adaptor c ∈ M . θ c⇒z represents the probability of generating tree c ⇒ z under the base distribution. See also Cohen (2011).
The expectation of the Pitman-Yor multinomial π c,i under the truncated variational stick-breaking distribution is , and the expectation of generating the phrasestructure tree a ⇒ z based on PCFG productions under the variational Dirichlet distribution is where Ψ(•) is the digamma function, and c → β ∈ a ⇒ z represents all PCFG productions in the phrase-structure tree a ⇒ z.
This PCFG can compose arbitrary subtrees and thus discover new trees that better describe the data, even if those trees are not part of the TNG. This is equivalent to creating a "new table" in MCMC inference and provides truncation-free variational updates (Wang and Blei, 2012) by sampling a unseen subtree with adapted nonterminal c ∈ M at the root. This frees our model from preprocessing to initialize truncated grammatons in Cohen et al. (2010). This stochastic approach has the advantage of creating sparse distributions (Wang and Blei, 2012): few unique trees will be represented.
Figure 1: Given an adaptor grammar, we sample derivations given an approximate PCFG and show how these affect counts.
The sampled derivations can be understood via the Chinese restaurant metaphor (Johnson et al., 2006). Existing cached rules (elements in the TNG ) can be thought of as occupied tables; this happens in the case of the yield "ba", which increases counts for unadapted rules g and for entries in TNGA, f . For the yield "ca", there is no appropriate entry in the TNG , so it must use the base distribution, which corresponds to sitting at a new table. This generates counts for g, as it uses the unadapted rule and for h, which represents entries that could be included in the TNG in the future. The final yield, "ab", shows that even when compatible entries are in the TNG , it might still create a new table, changing the underlying base distribution.
Parallelization As noted in Cohen et al. (2010), the inside-outside algorithm dominates the runtime of every iteration, both for sampling and variational inference. However, unlike MCMC, variational inference is highly parallelizable and requires fewer synchronizations per iteration (Zhai et al., 2012). In our approach, both inside algorithms and sampling process can be distributed, and those counts can be aggregated afterwards. In our implementation, we use multiple threads to parallelize tree sampling.

Calculating Expected Rule Counts
For every observation x d , the hybrid approach produces a set of sampled trees, each of which contains three types of productions: adapted rules, original PCFG rules, and potentially adapted rules. The last set is most important, as these are new rules discovered by the sampler. These are explained using the Chinese restaurant metaphor in Figure 1. The multiset of all adapted productions is M (t d,i ) and the multiset of non-adapted productions that generate tree t d,i is N (t d,i ). We compute three counts: 1: f is the expected number of productions within the TNG. It is the sum over the probability of a tree t d,k times the number of times an adapted production appeared in 2: g is the expected counts of PCFG productions R that defines the base distribution of the adaptor 3: Finally, a third set of productions are newly discovered by the sampler and not in the TNG. These subtrees are rules that could be adapted, These subtrees-lists of PCFG rules sampled from Equation 6-correspond to adapted productions not yet present in the TNG.

Variational Updates
Given the sparse vectors φ sampled from the hybrid MCMC step, we update all variational parameters as where n(r, t) is the expected number of times production r is in tree t, estimated during sampling.
Hyperparameter Update We update our PCFG hyperparameter α, PYGEM hyperparameters a and b as in Cohen et al. (2010).

Online Variational Inference
Online inference for probabilistic models requires us to update our posterior distribution as new observations arrive. Unlike batch inference algorithms, we do not assume we always have access to the entire dataset. Instead, we assume that observations arrive in small groups called minibatches. The advantage of online inference is threefold: a) it does not require retaining the whole dataset in memory; b) each online update is fast; and c) the model usually converges faster. All of these make adaptor grammars scalable to larger datasets.
Our approach is based on the stochastic variational inference for topic models (Hoffman et al., 2013). This inference strategy uses a form of stochastic gradient descent (Bottou, 1998): using the gradient of the ELBO, it finds the sufficient statistics necessary to update variational parameters (which are mostly expected counts calculated using the inside-outside algorithm), and interpolates the result with the current model.
We assume data arrive in minibatches B (a set of sentences). We accumulate expected counts with decay factor ∈ (0, 1) to guarantee convergence. We set it to = (τ + l) −κ , where l is the minibatch counter. The decay inertia τ prevents premature convergence, and decay rate κ controls the speed of change in sufficient statistics (Hoffman et al., 2010). We recover batch variational approach when B = D and κ = 0. The variablesf (l) andg (l) are accumulated sufficient statistics of adapted and unadapted productions after processing minibatch B l . They update the approximate gradient. The updates for variational parameters become γ a→β =α a→β +g (l) (a → β) (11) where K a is the size of the TNG at adaptor a ∈ M .

Refining the Truncation
As we observe more data during inference, our TNGs need to change. New rules should be added, useless rules should be removed, and derivations for existing rules should be updated. In this section, we describe heuristics for performing each of these operations.
Adding Productions Sampling can identify productions that are not adapted but were instead drawn from the base distribution. These are candidates for the TNG. For every nonterminal a, we add these potentially adapted productions to TNG a after each minibatch. The count associated with candidate productions is now associated with an adapted production, i.e., the h count contributes to the relevant f count. This mechanism dynamically expands TNG a .
Sorting and Removing Productions Our model does not require a preprocessing step to initialize the TNGs, rather, it constructs and expands all TNGs on the fly. To prevent the TNG from growing unwieldy, we prune TNG after every u minibatches. As a result, we need to impose an ordering over all the parse trees in the TNG. The underlying PYGEM distribution implicitly places an ranking over all the atoms according to their corresponding sufficient statistics (Kurihara et al., 2007), as shown in Equation 9. It measures the "usefulness" of every adapted production throughout inference process. In addition to accumulated sufficient statistics, Cohen et al. (2010) add a secondary term to discourage short constituents (Mochihashi et al., 2009). We impose a reward term for longer phrases in addition tof and sort all adapted productions in TNG a using the ranking score where |s| is the number of yields in production a ⇒ z a,i . Because decreases each minibatch, the reward for long phrases diminishes. This is similar to an annealed version of Cohen et al. (2010)-where the reward for long phrases is fixed, see also Mochihashi et al. (2009). After sorting, we remove all but the top K a adapted productions.
that do not explain their yield well. They propose table label resampling to rederive yields. In our approach this is equivalent to "mutating" some derivations in a TNG. After pruning rules every u minibatches, we perform table label resampling for adapted nonterminals from general to specific (i.e., a topological sort). This provides better expected counts n(r, •) for rules used in phrasestructure subtrees. Empirically, we find table label resampling only marginally improves the wordsegmentation result.
Initialization Our inference begins with random variational Dirichlets and empty TNGs, which obviates the preprocessing step in Cohen et al. (2010). Our model constructs and expands all TNGs on the fly. It mimics the incremental initialization of Johnson and Goldwater (2009). Algorithm 2 summarizes the pseudo-code of our online approach.

Complexity
Inside and outside calls dominate execution time for adaptor grammar inference. Variational approaches compute inside-outside algorithms and estimate the expected counts for every possible tree derivation (Cohen et al., 2010). For a dataset with D observations, variational inference requires O DI calls to inside-outside algorithm, where I is the number of iterations, typically in the tens.
In contrast, MCMC only needs to accumulate inside probabilities, and then sample a tree derivation (Chappelier and Rajman, 2000). The sampling step is negligible in processing time compared to the inside algorithm. MCMC inference requires O DI calls to the inside algorithm-hence every iteration is much faster than variational approach-but I is usually on the order of thousands.
Likewise, our hybrid approach also only needs the less expensive inside algorithm to sample trees. And while each iteration is less expensive, our approach can achieve reasonable results with only a single pass through the data. And thus only requires O(D) calls to the inside algorithm.
Because the inside-outside algorithm is fundamental to each of these algorithms, we use it as a common basis for comparison across different implementations. This is over-generous to variational approaches, as the full inside-outside computation is more expensive than the inside probability computation required for sampling in MCMC and our hybrid approach.

Experiments and Discussion
We implement our online adaptor grammar model (ONLINE) in Python 4 and compare it against both MCMC (Johnson and Goldwater, 2009, MCMC) and the variational inference (Cohen et al., 2010, VARI-ATIONAL). We use the latest implementation of MCMC sampler for adaptor grammars 5 and simulate the variational approach using our implementation. For MCMC approach, we use the best settings reported in Johnson and Goldwater (2009) Table 1. 6 6 Our ONLINE settings are batch size B = 20, decay inertia τ = 128, decay rate κ = 0.6 for unigram grammar; and minibatch size B = 5, decay inertia τ = 256, decay rate κ = 0.8 for collocation grammar. TNG s are refined at interval u = 50. Truncation size is set to K Word = 1.5k and K Colloc = 3k. The settings are chosen from cross validation. We observe similar behavior under κ = {0.7, 0.9, 1.0}, τ = {32, 64, 512}, B = {10, 50} and u = {10, 20, 100}. 7 For ONLINE inference, we parallelize each minibatch with four threads with settings: batch size B = 100 and TNG refinement interval u = 100. ONLINE approach runns for two passes over datasets. VARIATIONAL runs fifty iterations, with the same truncation level as in ONLINE. For negative log-likelihood evaluation, we train the model on a random 70% of the data, and hold out the rest for testing. We observe similar behavior for

Word Segmentation
We evaluate our online adaptor grammar on the task of word segmentation, which focuses on identifying word boundaries from a sequence of characters. This is especially the case for Chinese, since characters are written in sequence without word boundaries.
We first evaluate all three models on the standard Brent version of the Bernstein-Ratner corpus (Bernstein-Ratner, 1987;Brent and Cartwright, 1996, brent). The dataset contains 10k sentences, 1.3k distinct words, and 72 distinct characters. We compare the results on both unigram and collocation grammars introduced in Johnson and Goldwater (2009) as listed in Table 1. Figure 2 illustrates the word segmentation accuracy in terms of word token F 1 -scores on brent against the number of inside-outside function calls for all three approaches using unigram and collocation grammars. In both cases, our ONLINE approach converges faster than MCMC and VARIATIONAL approaches, yet yields comparable or better performance when seeing more data.
In addition to the brent corpus, we also evaluate three approaches on three other Chinese datasets compiled by Xue et al. (2005) and Emerson (2005) • Peking University (pku): 183k sentences, 53k distinct words, 4.6k distinct characters; and • City University of Hong Kong (cityu): 207k sentences, 64k distinct words, and 5k distinct characters.
We compare our inference method against other approaches on F 1 score. While other unsupervised word segmentation systems are available (Mochihashi et al. (2009), inter alia), 9 our focus is on a direct comparison of inference techniques for adaptor grammar, which achieve competitive (if not state-ofthe-art) performance. Table 2 shows the word token F 1 -scores and negative likelihood on held-out test dataset of our model against MCMC and VARIATIONAL. We randomly sample 30% of the data for testing and the rest for training. We compute the held-out likelihood of the most likely sampled parse trees out of each model. 10 Our ONLINE approach consistently better segments words than VARIATIONAL and achieves comparable or better results than MCMC. For MCMC, Johnson and Goldwater (2009) show that incremental initialization-or online updates in general-results in more accurate word segmentation, even though the trees have lower posterior probability. Similarly, our ONLINE approach initializes and learns them on the fly, instead of initializing the grammatons and parse trees for all data upfront as for VARIATIONAL. This uniformly outperforms batch initialization on the word segmentation tasks.

Infinite Vocabulary Topic Modeling
Topic models often can be replicated using a carefully crafted PCFG (Johnson, 2010). These powerful extensions can capture topical collocations and sticky topics; these embelishments could further improve NLP applications of simple unigram topic models such as word sense disambiguation (Boyd-Graber and Blei, 2007), part of speech  Table 1. The horizontal axis shows the number of passes over the entire dataset. 11 tagging (Toutanova and Johnson, 2008) or dialogue modeling (Zhai and Williams, 2014). However, expressing topic models in adaptor grammars is much slower than traditional topic models, for which fast online inference (Hoffman et al., 2010) is available. Zhai and Boyd-Graber (2013) argue that online inference and topic models violate a fundamental assumption in online algorithms: new words are introduced as more data are streamed to the algorithm. Zhai and Boyd-Graber (2013) introduce an inference framework, INFVOC, to discover words from a Dirichlet process with a character n-gram base distribution.
We show that their complicated model and online inference can be captured and extended via an appropriate PCFG grammar and our online adaptor grammar inference algorithm. Our extension to INFVOC generalizes their static character n-gram model, learning the base distribution (i.e., how words are composed from characters) from data. In contrast, their base distribution was learned from a dictionary as a preprocessing step and held fixed. This is an attractive testbed for our online inference. Within a topic, we can verify that the words we discover are relevant to the topic and that new words rise in importance in the topic over time if they are relevant. For these experiments, we treat each token (with its associated document pseudo-word −j ) as a single sentence, and each minibatch contains only one sentence (token).  Figure 4: The evolution of one topic-concerning tax policy-out of five topics learned using online adaptor grammar inference on the de-news dataset. Each minibatch represents a word processed by this online algorithm; time progresses from left to right. As the algorithm encounters new words (bottom) they can make their way into the topic. The numbers next to words represent their overall rank in the topic. For example, the word "pension" first appeared in mini-batch 100, was ranked at 229 after minibatch 400 and became one of the top 10 words in this topic after 2000 minibatches (tokens). 12 Quantitatively, we evaluate three different inference schemes and the INFVOC approach 13 on a collection of English daily news snippets (de-news). 14 We used the InfVoc LDA grammar (Table 1). For all approaches, we train the model with five topics, and evaluate topic coherence (Newman et al., 2009), which correlates well with human ratings of topic interpretability (Chang et al., 2009). We collect the co-occurrence counts from Wikipedia and compute the average pairwise pointwise mutual information (PMI) score between the top 10 ranked words of every topic. Figure 3 illustrates the PMI score for both approaches. Our approach yields comparable or better results against all other approaches under most conditions. Qualitatively, Figure 4 shows an example of a topic evolution using online adaptor grammar for the de-news dataset. The topic is about "tax policy". The topic improves over time; words like "year", "tax" and "minist(er)" become more prominent. More importantly, the online approach discov-13 Available at http://www.umiacs.umd.edu/˜zhaike/. 14 The de-news dataset is randomly selected subset of 2.2k English documents from http://homepages.inf.ed.ac. uk/pkoehn/publications/de-news/.
It contains 6.5k unique types and over 200k word tokens. Tokenization and stemming provided by NLTK (Bird et al., 2009). ers new words and incorporates them into the topic. For example, "schroeder" (former German chancellor) first appeared in minibatch 300, was successfully picked up by our model, and became one of the top ranked words in the topic.

Conclusion
Probabilistic modeling is a useful tool in understanding unstructured data or data where the structure is latent, like language. However, developing these models is often a difficult process, requiring significant machine learning expertise.
Adaptor grammars offer a flexible and quick way to prototype and test new models. Despite expensive inference, they have been used for topic modeling (Johnson, 2010), discovering perspective (Hardisty et al., 2010), segmentation (Johnson and Goldwater, 2009), and grammar induction (Cohen et al., 2010).
We have presented a new online, hybrid inference scheme for adaptor grammars. Unlike previous approaches, it does not require extensive preprocessing. It is also able to faster discover useful structure in text; with further development, these algorithms could further speed the development and application of new nonparametric models to large datasets.