LR Parsing for LCFRS

LR parsing is a popular parsing strategy for variants of Context-Free Grammar (CFG). It has also been used for mildly context-sensitive formalisms, such as Tree-Adjoining Grammar. In this paper, we present the ﬁrst LR-style parsing algorithm for Linear Context-Free Rewriting Systems (LCFRS), a mildly context-sensitive extension of CFG which has received considerable attention in the last years.


Introduction
LR parsing is an incremental shift-reduce parsing strategy in which the transitions between parser states are guided by an automaton which is compiled offline. LR parsers were first introduced for deterministic context-free languages (Knuth, 1965) and later generalized to context-free languages (Tomita, 1984) and tree-adjoining languages (Nederhof, 1998;Prolo, 2003).
Linear Context-Free Rewriting System (LCFRS) (Vijay-Shanker et al., 1987) is an immediate extension of CFG in which each non-terminal can cover more than one continuous span of the input string. LCFRS and equivalent formalisms have been used for the modeling of discontinuous constituents (Maier and Lichte, 2011) and nonprojective dependencies (Kuhlmann, 2013), as well as for data-driven parsing of such structures (Maier and Kallmeyer, 2010;Kallmeyer and Maier, 2013;van Cranenburgh, 2012;Angelov and Ljunglöf, 2014). They have also been used for modeling non-concatenative morphology (Botha and Blunsom, 2013), for grammar engineering (Ranta, 2011), and for modeling alignments in machine translation (Søgaard, 2008;Kaeshammer, 2013). To our knowledge, so far, no LR strategy for LCFRS has been presented in the literature. In this paper, we present an LR-style parser for LCFRS. It is based on the incremental parsing strategy implemented by Thread Automata (Villemonte de la Clergerie, 2002).
The remainder of the article is structured as follows. In the following section, we introduce LCFRS and thread automata. Section 3 presents the algorithm along an example. In particular, section 3.2 gives the algorithms for automaton and parse table constructions, and section 3.3 presents the parsing algorithm. Section 4 concludes the article.

LCFRS
In this paper, we restrict ourselves to string rewriting LCFRS and omit the more general definition (Weir, 1988).
In LCFRS, a single non-terminal can span k ≥ 1 continuous blocks of a string. A CFG is simply a special case of an LCFRS in which k = 1. k is called the fan-out of the non-terminal. We notate LCFRS with the syntax of Simple Range Concatenation Grammars (SRCG) (Boullier, 1998), a formalism equivalent to LCFRS.
An LCFRS 1 (Vijay-Shanker et al., 1987;Seki et al., 1991) is a tuple G = (N, T, V, P, S) where N is a finite set of non-terminals with a function dim: N → N determining the fan-out of each A ∈ N ; T and V are disjoint finite sets of terminals and variables; S ∈ N is the start symbol with dim(S) = 1.
P is a finite set of rewriting rules with rank m ≥ 0. All γ ∈ P have the form j are called arguments (or sometimes components); the elements in α i are called argument elements. A γ is the set of all argument elements of γ. Variable occurrences in the arguments of the non-terminals of γ are ordered by a strict total order ≺. For all X 1 , X 2 ∈ V occurring in arguments of a non-terminal of γ, it holds that X 1 ≺ X 2 iff either X 1 precedes X 2 in an argument of the non-terminal or the argument X 1 occurs in precedes the argument X 2 occurs in.
For all γ ∈ P , every variable X occurring in γ occurs exactly once in the left-hand side (LHS) and exactly once in the right-hand side (RHS). Furthermore, if for two variables X 1 , X 2 ∈ V , it holds that X 1 ≺ X 2 on the RHS, then also X 1 ≺ X 2 on the LHS. The rank of G is the maximal rank of any of its rules, its fan-out is the maximal fan-out of any of its non-terminals.
We use the following additional notation: For a rule γ ∈ P , lhs(γ) gives the LHS non-terminal; lhs(γ, i) gives the ith argument of the LHS and lhs(γ, i, j) its jth symbol; rhs(γ, k) gives the kth RHS non-terminal; and rhs(γ, k, l) gives the lth component of the kth RHS element (starting with index 0 in all four cases). These function have value ⊥ whenever there is no such element. Furthermore, in the sense of dotted productions, we define for each γ ∈ P a set of symbols denoting computation points of γ, C γ = {γ i.j | 0 ≤ i < dim A , 0 ≤ j ≤ |α i |}, as well as the set C = γ∈P C γ .
A non-terminal A ∈ N can be instantiated w.r.t. an input string w 1 · · · w |w| and a rule γ ∈ P with lhs(γ) = A. An instantiation maps all argument elements of γ to spans of w ((i−1, j) w denotes the span w i · · · w j , 1 ≤ i ≤ j ≤ n). All instantiations are given by a function σ : A derivation rewrites strings of instantiated non-terminals, i.e., given an instantiated clause, the instantiated LHS non-terminal may be replaced with the sequence of instantiated RHS terminals. The language of the grammar is the set of strings which can be reduced to the empty word, starting with S instantiated to the input string.
See figure 1 for a sample LCFRS.

Thread Automata
Thread automata (TA) (Villemonte de la Clergerie, 2002) are a generic automaton model which can be parametrized to recognize different mildly contextsensitive languages. The TA for LCFRS (LCFRS-TA) implements a prefix-valid top-down incremental parsing strategy similar to the ones of Kallmeyer and Maier (2009) and Burden and Ljunglöf (2005). An LCFRS-TA for some LCFRS G = (N, T, V, P, S) works as follows. The processing of a single rule is handled by a single thread which will traverse the LHS arguments of the rule. A thread is given by a pair p : X, where p ∈ {1, . . . , m} * with m the rank of G is the address, and X ∈ N ∪ {ret} ∪ C where ret / ∈ N is the content of the thread. An automaton state is given by a tuple i, p, T where T is a set of threads, the thread store, p is the address of the active thread, and i ≥ 0 indicates that i tokens have been recognized. We introduce a new start symbol S / ∈ N that expands to S and use 0, ε, {ε : S } as start state.
The specific TA for a given LCFRS G = (N, T, V, P, S) can be defined as tuple there is a l such that lhs(γ, k, i) = rhs(γ, j − 1, l), and δ(γ k,i ) = ⊥ if lhs(γ, k, i) ∈ T ∪ {⊥} (intuitively, a δ value j tells us that the next symbol to process is a variable that 1251 Call: is an argument of the jth RHS non-terminal); and Θ is a finite set of transitions. Every transition has the form α a → β with a ∈ T ∪ {ε} and they roughly indicate that in the thread store, α can be replaced with β while scanning a. Square brackets in α and β indicate parts that do not belong to the active thread. This will be made more precise below. Θ contains the following transitions (see figure 2): • Call transitions start a new thread, either for the start symbol or for a daughter non-terminal. They move down in the parse tree. → γ k,i+1 if lhs(γ, k, i) ∈ T . • Publish marks the completion of a production, i.e., its full recognition: γ k,j → ret if dim(lhs(γ)) = k + 1 and j = |lhs(γ, k)|. • Suspend suspends a daughter thread and resumes the parent. i.e., moves up in the parse tree. There are two cases: (i) The daughter is completely recognized: The daughter is not yet completely recognized, we have only finished one of its components: [γ k,i ]β l,j → γ k,i+1 [β l,j ] if dim(lhs(β)) > l + 1, |lhs(β, l)| = j, lhs(γ, k, i) = rhs(γ, δ(γ k,i ) − 1, l) and rhs(γ, δ(γ k,i ) − 1) = lhs(β). • Resume resumes an already present daughter thread, i.e., moves down into some daughter that has already been partly recognized.
and β l,j+1 / ∈ C. This is not exactly the TA for LCFRS proposed in Villemonte de la Clergerie (2002) but rather the one from Kallmeyer (2010), which is close to the Earley parser from Burden and Ljunglöf (2005).
The set of configurations for a given input w ∈ T * is then defined by the deduction rules in figure 3 (the use of set union S 1 ∪ S 2 in these rules assumes that S 1 ∩ S 2 = ∅). The accepting state of the automaton for some input w is |w|, 1, {ε : S , 1 : ret} .

LR Parsing
In an LR parser, the parser actions are guided by an automaton, resp. a parse table which is compiled offline. Consider the context-free case. An LR parser for CFG is a guided shift-reduce parser, in which we first build the LR automaton. Its states are sets of dotted productions closed under prediction, and its transitions correspond to having recognized a part of the input, e.g., to moving the dot over a RHS element after having scanned a terminal or recognized a non-terminal. Given an automaton with n states, we build the parse table with n rows. Each row i, 0 ≤ i < n, describes the possible parser actions associated with the state q i , i.e., for each state and each possible shift or reduce operation, it tells us in which state to go after the operation.

Intuition
The states in the automaton are predict and resume closures of TA thread stores. In order to keep them finite, we allow the addresses to be regular expressions. A configuration of the parser consists of a  . . x n−1 Γ n where Γ i is an address followed by a state and Shift: Whenever we have p : q on top of the stack and an edge from q to q labeled with the next input symbol and an address p , we add the input symbol followed by pp : q to the stack.
Suspend: Whenever the top of the stack is p 1 : q such that there is a γ i−1,k ∈ q with k = |lhs(γ, i − 1)| and i < dim(γ), we can suspend. If i = 1, we add p 1 : γ i to the set of completed components and we remove |lhs(γ, i)| terminals/component nonterminals and their preceding states from the stack. If i ≥ 1, we check whether there is a p 2 : γ i−1 in the set of completed components such that the intersection L(p 1 ) ∩ L(p 2 ) is not empty. 2 We then remove p 2 : γ i−1 from the set of complete components and we add p : γ i to it where p is a regular expression denoting L(p 1 ) ∩ L(p 2 ). Suppose the topmost state on the stack is now p : q . We then have to follow the edge leading from q to some q labeled A i : p where A = lhs(γ). This means that we push A i followed by p p : q on the stack. 2 Note that the corresponding finite state automata can be deterministic; in this case the intersection is quadratic in the size of the two automata. In LCFRS without left recursion in any of the components, the intersection is trivial since the regular expressions denote only a single path each.
Reduce: Whenever there is a γ i−1,k in our current state with k = |lhs(γ, i − 1)| and i = dim(γ), we can reduce, which is like suspend except that nothing is added to the set of completed components.

Automaton and parse table construction
The states of the LR-automaton are sets of pairs p : X where p is a regular expression over {1, . . . , m}, m the rank of G, and X ∈ C ∪ {S }. They represent predict and resume closures.The predict/resume closure q of some set q is described by the deduction rules in figure 4. This closure is not always finite.
ε : S 1 : α 0,0 lhs(α) = S p : γ i,j pk : γ l,0 lhs(γ, i, j) = rhs(γ, k − 1, l), rhs(γ, k) = lhs(γ ) Figure 4: Predict/resume closure However, if it is not, we obtain a set of items that can be represented by a finite set of pairs r : γ i,j plus eventually ε : S such that r is a regular expression denoting a set of possible addresses. As an example for such a case, see q 3 in figure 5.
The reason why we can represent these closures by finite sets using regular expressions for paths is the following: There is a finite number of possible elements γ i,j . For each of these, the set of possible addresses it might be combined with in a state that is the closure of {ε : X 1 , ε : X 2 , . . . , ε : X n } is generated by the CFG C∪{S }∪ {S new }, {1, . . . , m}, P, S new with S new → X i ∈ P for all 1 ≤ i ≤ n, X → Y k ∈ P for all in-stances p:X pk:Y of deduction rules and γ i,j → ε. This is a regular grammar, its string language can thus be characterized by a regular expression.

Parsing
We run the automaton with ε : q 0 , [ ], w and input w = aaba. The trace is shown in figure 7. We start in q 0 , and shift two as, which leads to q 1 . We have then fully recognized the first components of γ and β: We suspend them and keep them in the set of completed components, which takes us to q 3 . Shifting the b takes us to q 6 , from where we can reduce, which finally takes us to q 4 . From there, we can shift the remaining a (to q 5 ), with which we have fully recognized β. We can now reduce both β and with that, α, which takes us to the accepting state q 8 .

Conclusion
We presented the first LR style algorithm for LCFRS parsing. It offers a convenient factorization of predict/resume operations. We are currently exploring the possibility to use it in data-driven parsing.