Shift-Reduce Task-Oriented Semantic Parsing with Stack-Transformers

Intelligent voice assistants, such as Apple Siri and Amazon Alexa, are widely used nowadays. These task-oriented dialog systems require a semantic parsing module in order to process user utterances and understand the action to be performed. This semantic parsing component was initially implemented by rule-based or statistical slot-ﬁlling approaches for processing simple queries; however, the appearance of more complex utterances demanded the application of shift-reduce parsers or sequence-to-sequence models. While shift-reduce approaches initially demonstrated to be the best option, recent eﬀorts on sequence-to-sequence systems pushed them to become the highest-performing method for that task. In this article, we advance the research on shift-reduce semantic parsing for task-oriented dialog. In particular, we implement novel shift-reduce parsers that rely on Stack-Transformers. These allow to adequately model transition systems on the cutting-edge Transformer architecture, notably boosting shift-reduce parsing performance. Additionally, we adapt alternative transition systems from constituency parsing to task-oriented parsing, and empirically prove that the in-order algorithm substantially outperforms the commonly-used top-down strategy. Finally, we extensively test our approach on multiple domains from the Facebook TOP benchmark, improving over existing shift-reduce parsers and state-of-the-art sequence-to-sequence models in both high-resource and low-resource settings.


Introduction
The research community and industry have directed significant attention towards the advancement of intelligent personal assistants such as Apple Siri, Amazon Alexa, and Google Assistant.These systems, known as task-oriented dialogue systems, streamline task completion and information retrieval via natural language interactions within defined domains such as media playback, weather inquiries, or restaurant reservations.The increasing adoption of these voice assistants by users has not only transformed individuals' lives but also impacted real-world businesses.
Humans effortlessly understand language, deriving meaning from sentences and extracting relevant information.Semantic parsing attempts to emulate this process by understanding the meaning of natural language expressions and translating them into a structured representation that can be interpreted by computational systems.Therefore, a crucial component of any voice assistant is a semantic parser in charge of natural language understanding.Its purpose is to process user dialogue by converting each input utterance into an unequivocal task representation understandable and executable by a machine.Specifically, these parsers identify the user's requested task intent (e.g., play music) as well as pertinent entities needed to further refine the task (e.g., which playlist?).
Traditional commercial voice assistants conventionally handle user utterances by conducting intent detection and slot extraction tasks separately.For example, given the utterance Play Paradise by Coldplay, the semantic parsing module processes it in two stages: a) initially determining the user's intent as IN:PLAY MUSIC, and then b) recognizing task-specific named entities Paradise and Coldplay, respectively tagging these elements (slots) as SL:MUSIC TRACK TITLE and SL:MUSIC ARTIST NAME.Intent detection has traditionally been approached as text classification, where the entire utterance serves as input, while slot recognition has been formulated as a sequence tagging challenge [1][2][3].
Annotations generated by these traditional semantic parsers only support a single intent per utterance and a list of non-overlapping slots exclusively composed by tokens from the input.While this flat semantic representation suffices for handling straightforward utterances, it falls short in adequately representing user queries that involve compositional requests.For instance, the query How long will it take to drive from my apartment to San Diego?necessitates first identifying my apartment (IN:GET LOCATION HOME) before estimating the duration to the destination (IN:GET ESTIMATED DURATION).Hence, there is a requirement for a semantic representation capable of managing multiple intents per utterance, where slots encapsulate nested intents.
In order to represent more complex utterances, Gupta et al. [4] introduced the task-oriented parsing (TOP) formalism: a hierarchical annotation scheme expressive enough to describe the task-specific semantics of nested intents and model compositional queries.In Fig. 1, we illustrate how the intents and slots of utterances mentioned in the previous examples can be represented using the TOP annotation.
An advantage of the TOP representation is its ease of annotation and parsing compared to more intricate semantic formalisms such as logical forms [5] or abstract meaning representations (AMR) [6].In fact, its similarity to a syntactic constituency tree enables the adaptation of algorithms from the constituency parsing literature to process task-oriented requests.This was the driving force behind Gupta et al. [4]'s initial proposal to modify the shift-reduce constituency parser introduced by Dyer et al. [7] for generating TOP annotations.
Alternatively, Gupta et al. [4] also proposed the application of different sequenceto-sequence models [8][9][10] for parsing compositional queries.Sequence-to-sequence models comprise a specific neural architecture tasked with predicting a sequence of output tokens based on an input sequence of items.After conducting empirical comparisons between the shift-reduce technique and various sequence-to-sequence models for parsing compositional queries, they determined that the shift-reduce parser surpassed other methods and was the only approach capable of guaranteeing that the output representation adhered to a well-formed TOP tree.This superiority can be largely attributed to the fact that, unlike sequence-to-sequence models, shift-reduce algorithms adhere to grammar constraints throughout the parsing process and exhibit an inductive bias towards tree structures, resulting in enhanced performance.
Although sequence-to-sequence approaches may produce flawed representations, recent advancements [11,12] have substantially enhanced their performance by leveraging Transformer neural networks [10] in conjunction with pre-trained language models such as RoBERTa [13] or BART [14].Consequently, they have emerged as the most accurate approach to date for generating TOP tree structures.
This article presents further advancements in the realm of shift-reduce semantic parsing for natural language understanding.Specifically, we enhance the initial framework introduced by Gupta et al. [4], which relied on the top-down transition system [7] and a Stack-LSTM -based neural architecture [15].Firstly, we implement a more robust neural model based on Stack-Transformers [16], enabling the accurate modeling of shift-reduce systems within a Transformer-based neural architecture.Secondly, we adapt the bottom-up and in-order transition systems [17,18] from the constituency parsing literature to task-oriented semantic parsing.Lastly, we empirically evaluate these alternatives, along with the top-down algorithm, on our neural architecture.Our findings demonstrate that the in-order transition system achieves the highest accuracy on the Facebook TOP benchmark [4,19], even outperforming the most robust sequence-to-sequence baselines.
In summary, our contributions in this article are as follows: • We develop innovative shift-reduce semantic parsers for task-oriented dialogues utilizing Stack-Transformers and deep contextualized word embeddings derived from RoBERTa.• We adapt various transition systems from the constituency parsing literature to handle TOP annotations and conduct a comprehensive comparison against the original top-down approach, demonstrating the superiority of the in-order algorithm across all scenarios.• We evaluate our approach on both low-resource and high-resource settings of the Facebook TOP datasets, pushing the boundaries of the state of the art in taskoriented parsing and narrowing the divide with sequence-to-sequence models.• Our system's source code is freely available at https://github.com/danifg/ShiftReduce-TOP.
The remainder of this article is organized as follows: In Section 2, we provide an overview of prior research on semantic parsing for task-oriented compositional queries.Section 3 outlines our proposed approach, beginning with an exposition of the transition-based algorithms adapted from constituency parsing, followed by a detailed description of the Stack-Transformer-based neural model.Section 4 presents the experiments conducted with the three transition systems using the proposed neural architecture as a testing platform, along with a comprehensive analysis of their performance.Finally, concluding remarks are presented in Section 5.

Related work
The hierarchical semantic representation introduced by Gupta et al. [4] to address compositional queries spurred the adaptation of parsing algorithms initially developed for constituency parsing, such as the Stack-LSTM-based shift-reduce parser [7].Additionally, Gupta et al. [4] proposed sequence-to-sequence models for this task, including those based on convolutional neural networks (CNNs) [9], long short-term memory (LSTM) neural networks [8], and Transformers [10].Although sequence-to-sequence methods were originally devised for machine translation [20], they were also adapted to constituency parsing by first linearizing the tree structure [21].
Given that the shift-reduce parser initially emerged as the leading method for generating TOP representations, Einolghozati et al. [22] opted to enhance the original system by incorporating an ensemble of seven parsers, contextualized word embeddings extracted from ELMo [23], and a language model ranker.Concurrently, Pasupat et al. [24] modified the span-based constituency parser proposed by Stern et al. [25] to process utterances into TOP trees, achieving promising results without the use of deep contextualized word embeddings.
While sequence-to-sequence models initially lagged behind all available semantic parsing methods, recent advancements have substantially improved their performance in constructing TOP representations.Notably, Rongali et al. [11] devised a sequence-to-sequence architecture bolstered by a Pointer Generator Network [26] and a RoBERTa-based encoder [13].This neural architecture emerged as the state of the art in task-oriented semantic parsing and has since been adopted and extended by subsequent studies.Among them, Aghajanyan et al. [12] and Chen et al. [19] proposed simplifying the target sequence by eliminating input tokens that are not slot values, while also initializing both the encoder and the decoder with the pre-trained sequence-to-sequence model BART [14].Furthermore, non-autoregressive variants of the sequence-to-sequence architecture introduced by Rongali et al. [11] have been presented as well [27][28][29][30].Additionally, Shrivastava et al. [31] recently enhanced sequence-to-sequence models with a scenario-based approach, where incomplete intentslot templates are available in advance and can be retrieved after identifying the utterance's scenario.Meanwhile, Wang et al. [32] chose to enhance the efficiency of sequence-to-sequence models by generating subtrees as output tokens at each decoding step.
Diverging from the current mainstream trends, we push forward the boundaries of research in shift-reduce task-oriented parsing by crafting a novel approach grounded in a more accurate transition system and implemented on a more robust neural architecture.As a result, our system surpasses even the strongest sequence-to-sequence baselines.
Simultaneously with our research, Do et al. [33] have developed a two-staged approach that demonstrates remarkable results.Initially, they enhance standard pretrained language models through fine-tuning, incorporating additional hierarchical semantic information.Subsequently, the resulting model is integrated with a recursive insertion-based mechanism [34], constrained by grammar information.Specifically, grammar rules extracted from the training dataset are employed to prune unpromising predictions during the parsing process [35].It is worth noting that these contributions are orthogonal to our approach and could certainly enhance its performance.

Methodology
This section outlines our proposed approach.Specifically, we elaborate on the transition-based algorithms adapted from the constituency parsing literature to handle task-oriented utterances, as well as the neural architecture serving as the foundation of our system.

Transition systems for task-oriented semantic parsing
In task-oriented semantic parsing, the objective is to transform an input utterance comprising n words, denoted as X = w 1 , . . ., w n , into a semantic representation-in our case, a TOP tree Y .Similar to syntactic constituency representations, Y is a rooted tree consisting of tokens w 1 , . . .w n as its leaves and a collection of internal nodes (referred to as constituents) hierarchically structured above them.These constituents are denoted as tuples (N, W ), where W represents the set of tokens covered by its span, and N denotes the non-terminal label.For example, (SL:SOURCE, {my, apartment}) and (SL:DESTINATION, {San, Diego}) are constituents extracted from the TOP tree depicted in Fig. 1(b).Additionally, in our specific scenario, two distinct types of constituents emerge: intents and slots, with non-terminal labels respectively prefixed with IN: and SL:.Finally, tree structures must adhere to certain constraints to be deemed a valid TOP representation: • The root constituent, which encompasses the entire utterance, must be an intent node.
• Only tokens and/or slot constituents can serve as child nodes of an intent node.
• A slot node may have either words (one or several) or a single intent constituent as child nodes.
To process the input utterance, we employ shift-reduce parsers, initially introduced for dependency and constituency parsing [36,37].These parsers construct the target tree incrementally by executing a sequence of actions that analyze the input utterance from left to right.Specifically, shift-reduce parsers are characterized by a non-deterministic transition system, which defines the necessary data structures and the set of operations required to complete the parsing process; and an oracle, which selects one of these actions deterministically at each stage of the parsing process.Formally, a transition system is represented as a quadruple S = (C, c 0 , C f , T ), where: • C denotes the set of possible state configurations, defining the data structures necessary for the parser.• c 0 represents the initial configuration of the parsing process.
• C f is the set of final configurations reached at the end of the parsing process.We can utilize a transition system S along with an oracle o to parse the utterance X: commencing from the initial configuration c 0 , a sequence of transitions a 0 , . . ., a m−1 (determined by the oracle at each time step t) guides the system through a series of state configurations c 0 , . . ., c m until a final configuration is reached (c m ∈ C f ).At this stage, the utterance will have been fully processed, and the parser will generate a valid TOP tree Y .Various transition systems exist in the literature on constituency parsing.In addition to the algorithm employed by Gupta et al. [4], we have adapted two other transition systems for task-oriented semantic parsing, which we elaborate on in the subsequent sections.

Top-down transition system
Initially conceived by Dyer et al. [7] for constructing constituency trees in a topto-bottom fashion, this transition system was later adapted by Gupta et al. [4] to accommodate TOP tree representations.The top-down transition system comprises the following components: • State configurations within C are structured as c = ⟨Σ, B⟩, where Σ denotes a stack (responsible for storing non-terminal symbols, constituents, and partially processed tokens), and B represents a buffer (containing unprocessed tokens to be read from the input).• At the initial configuration c 0 , the buffer B encompasses all tokens from the input utterance, while the stack Σ remains empty.• Final configurations within C f are structured as c = ⟨[I], ∅⟩, where the buffer is empty (indicating that all words have been processed), and the stack contains a single item I.This item represents an intent constituent spanning the entire utterance, as the root node of a valid TOP tree must be an intent.• The set of available transitions T consists of three actions: -The Non-Terminal-L transition involves pushing a non-terminal node labeled L onto the stack, transitioning the system from state configurations of the form ⟨Σ, B⟩ to ⟨Σ|L, B⟩ (where Σ|L denotes a stack with item L placed on top and Σ as the tail).Unlike in constituency parsing, this transition can generate intent and slot non-terminals (with labels L prefixed with IN: or SL:, respectively).Therefore, it must adhere to specific constraints to produce a well-formed TOP tree: * Since the root node must be an intent constituent, the first Non-Terminal-L transition must introduce an intent non-terminal onto the stack.* A Non-Terminal-L transition that inserts an intent non-terminal onto the stack is permissible only if the last pushed non-terminal was a slot, performed in the preceding state configuration.This condition ensures that the resulting intent constituent from this transition becomes the sole child node of that preceding slot, as required by the TOP formalism.* A Non-Terminal-L transition adding a slot non-terminal to the stack is allowed only if the last inserted non-terminal was an intent.
-A Shift action is employed to retrieve tokens from the input by transferring words from the buffer to the stack.This operation transitions the parser from state configurations ⟨Σ, w i |B⟩ to ⟨Σ|w i , B⟩ (where w i |B denotes a buffer with token w i on top and B as the tail, and conversely, Σ|w i represents a stack with Σ as the tail and w i as the top).This transition is permissible only if the buffer is not empty.Specifically for task-oriented semantic parsing, this action will not be available in state configurations where the last non-terminal added to the stack was a slot, and an intent constituent was already created as its first child node.This constraint ensures that slot constituents have only one intent as their child node.-Additionally, a Reduce transition is necessary to construct a new constituent by removing all items (including tokens and constituents) from the stack until a nonterminal symbol is encountered, then grouping them as child nodes of that nonterminal.This results in a new constituent placed on top of the stack, transitioning the parser from configurations ⟨Σ|L|e k | . . .|e 0 , B⟩ to ⟨Σ|L e k ...e0 , B⟩ (where L e k ...e0 denotes a constituent with non-terminal label L and child nodes e k . . .e 0 ).This transition can be executed only if there are at least one non-terminal symbol and one item (token or constituent) in the stack.
Please note that the original work by Gupta et al. [4] did not provide specific transition constraints tailored to generating valid TOP representations.Therefore, we undertook a complete redesign of the original top-down algorithm [7] for taskoriented semantic parsing to incorporate these task-specific transition constraints.
Finally, Table 1 illustrates how the top-down algorithm parses the utterance depicted in Fig. 1(a).It demonstrates the step-by-step construction of each constituent, which involves defining the non-terminal label, reading and/or creating all corresponding child nodes, and then reducing all items within its span.

Bottom-up transition system
In contrast to the top-down approach, shift-reduce algorithms traditionally perform constituency parsing by building trees from bottom to top.Therefore, we have also adapted the bottom-up transition system developed by Fernández-González and Gómez-Rodríguez [17] for task-oriented semantic parsing.Unlike classic bottom-up constituency parsing algorithms [37,38], this transition system does not require prior binarization of the gold tree during training or subsequent recovery of the non-binary structure after decoding.Specifically, the non-binary bottom-up transition system comprises: • State configurations have the form c = ⟨Σ, B, f ⟩, where Σ is a stack, B is a buffer, as described for the top-down algorithm, and f is a boolean variable indicating whether a state configuration is terminal or not.• In the initial configuration c 0 , the buffer contains the entire user utterance, the stack is empty, and f is false.• Final configurations in C f have the form c = ⟨[I], ∅, true⟩, where the stack holds a single intent constituent, the buffer is empty, and f is true.Following a bottom-up algorithm, we can continue building constituents on top of a single intent node in the stack, even when it spans the whole input utterance.To avoid that, this transition system requires the inclusion of variable f in state configurations to indicate the end of the parsing process.• Actions provided by this bottom-up algorithm are as follows: -Similar to the top-down approach, a Shift action moves tokens from the buffer to the stack, transitioning the parser from state configurations ⟨Σ, w i |B, false⟩ to ⟨Σ|w i , B, false⟩.This operation is not permissible under the following conditions: * When the buffer is empty and there are no more words to read.* When the top item on the stack is an intent node and, since slots must have only one intent child node, the parser needs to build a slot constituent on top of it before shifting more input tokens.
- -Lastly, a Finish action is used to signal the end of the parsing process by changing the value of f , transitioning the system from configurations ⟨Σ, B, false⟩ to final configurations ⟨Σ, B, true⟩.This operation is only allowed if the stack contains a single intent constituent and the buffer is empty.
Finally, Table 2 illustrates how this shift-reduce parser processes the utterance in Fig. 1(a), constructing each constituent from bottom to top by assigning the nonterminal label after all child nodes are fully assembled in the stack.

In-order transition system
Alternatively to the top-down and bottom-up strategies, Liu and Zhang [18] introduced the in-order transition system for constituency parsing.We have tailored this algorithm for parsing task-oriented utterances.Specifically, the proposed in-order transition system consists of: • Configurations maintain the same format as the bottom-up algorithm (i.e., c = ⟨Σ, B, f ⟩).• In the initial configuration c 0 , the buffer contains the entire user utterance, the stack is empty, and the value of f is false.• Final configurations take the form c = ⟨[I], ∅, true⟩.Similar to the bottom-up approach, the in-order algorithm may continue creating additional constituents above the intent node left on the stack indefinitely.Hence, a flag is necessary to indicate the completion of the parsing process.• The available transitions are adopted from both top-down and bottom-up algorithms, but some of them exhibit different behaviors or are applied in a different order: -A Non-Terminal-L transition involves pushing a non-terminal symbol L onto the stack, transitioning the system from state configurations represented as ⟨Σ, B, false⟩ to ⟨Σ|L, B, false⟩.However, unlike the top-down algorithm, this transition can only occur if the initial child node of the upcoming constituent is fully constructed on top of the stack.Furthermore, it must meet other task-specific constraints to generate valid TOP representations: * A Non-Terminal-L transition that introduces an intent non-terminal to the stack (i.e., L prefixed with IN:) is valid only if its first child node atop the stack is not an intent constituent.* A Non-Terminal-L transition that places a slot non-terminal on the stack (i.e., L prefixed with SL:) is permissible only if the fully-created item atop the stack is not a slot node.
-Similarly to other transition systems, a Shift operation is used to retrieve tokens from the buffer.However, unlike those algorithms, this action is restricted if the upcoming constituent has already been labeled as a slot (by a non-terminal previously added to the stack) and its first child node is an intent constituent already present in the stack.This condition aims to prevent slot constituents from having more than one child node when the item at the top of the stack is an intent.-A Reduce transition is employed to generate intent or slot constituents.
Specifically, it removes all elements from the stack until a non-terminal symbol is encountered, which is simultaneously replaced by the preceding item to form a new constituent at the top of the stack.Consequently, it guides the parser from state configurations represented as ⟨Σ|e k |L|e k−1 | . . .|e 0 , B, false⟩ to ⟨Σ|L e k ...e0 , B, false⟩.This transition is only applicable if there is a non-terminal in the stack (preceded by its first child constituent according to the in-order algorithm).Additionally, this transition must comply with specific constraints for task-oriented semantic parsing: * When the Reduce operation results in an intent constituent (as determined by the last non-terminal label added to the stack), it is permissible only if there are no intent nodes among the preceding k − 1 items (since the first child e k already adheres to the TOP formalism, as verified during the application of the Non-Terminal-L transition).* When the Reduce transition produces a slot constituent, it is allowed only if there are no other slot nodes within the preceding k − 1 elements that will be removed by this operation.This condition also encompasses scenarios where the initial child node e k of the upcoming slot constituent is an intent and, since the Shift transition is not permitted under such circumstances, only the Reduce action can construct a slot with a single intent.
-Lastly, akin to the bottom-up approach, a Finish action is utilized to finalize the parsing process.This action is only permissible if the stack contains a single intent constituent and the buffer is empty.Table 3 In-order transition sequence and state configurations for generating the TOP representation in Fig. 1(a).NT-L stands for Non-Terminal-L and slot labels have been respectively abbreviated from SL:MUSIC TRACK TITLE and SL:MUSIC ARTIST NAME to SL:TITLE and SL:ARTIST.
In Table 3, we illustrate how the in-order strategy parses the user utterance depicted in Fig. 1(a).While the top-down and bottom-up approaches can be respectively regarded as a pre-order and post-order traversal over the tree, this transition system constructs the constituency structure following an in-order traversal, addressing the drawbacks of the other two alternatives.The in-order strategy creates each constituent by determining the non-terminal label after its first child is completed in the stack, and then processing the remaining child nodes.Unlike the top-down approach, which assigns the non-terminal label before reading the tokens composing its span, the in-order algorithm can utilize information from the first child node to make a better choice regarding the non-terminal label.On the other hand, the nonbinary bottom-up strategy must simultaneously determine the non-terminal symbol and the left span boundary of the future constituent once all child nodes are completed in the stack.Despite having local information about already-built subtrees, the bottom-up strategy lacks global guidance from top-down parsing, which is essential for selecting the correct non-terminal label.Additionally, determining span boundaries can be challenging when the target constituent has a long span, as Reduce#k-L transitions with a high k value are less frequent in the training data and thus harder to learn.The in-order approach avoids these drawbacks by predicting the non-terminal label and marking the left span boundary after creating its first child.In Section 4, we will empirically demonstrate that, in practice, the advantages of the in-order transition system result in substancial accuracy improvements compared to the other two alternatives.

Neural parsing model
Earlier shift-reduce systems in dependency parsing [15], constituency parsing [7], AMR parsing [39] and task-oriented semantic parsing [4,22] relied on Stack-LSTMs for Fig. 2 Transformer neural architecture introduced by Vaswani et al. [10].Note that this neural network requires the incorporation of positional encoding for each input token to maintain sequential order, and Layer Norm refers to the layer normalization technique proposed by Ba et al. [42].modeling state configurations.These architectures are grounded in LSTM recurrent neural networks [40], which dominated the natural language processing community until Vaswani et al. [10] introduced Transformers.This neural architecture offers a cutting-edge attention mechanism [41] that outperforms LSTM-based systems and, unlike recurrent neural networks, can be easily parallelized.This motivated Fernandez Astudillo et al. [16] to design Stack-Transformers.In particular, they use Stack-Transformers to replace Stack-LSTMs in shift-reduce dependency and AMR parsing, achieving remarkable gains in accuracy.
In our research, we leverage Stack-Transformers to represent the buffer and stack structures of the described transition systems, employing them to construct innovative shift-reduce task-oriented parsers.Specifically, we implement the following encoderdecoder architecture:

Encoder
Top-performing sequence-to-sequence approaches [11,27] directly use pre-trained models like RoBERTa [13] as the encoder in their neural architectures, conducting a task-specific fine-tuning during training.RoBERTa, short for "Robustly optimized BERT pretraining approach," employs the same Transformer architecture as BERT [43] and was pre-trained on masked word prediction using a large dataset.
Unlike strong sequence-to-sequence techniques, we adopt a less resource-consuming and greener strategy: we extract fixed weights from the pre-trained language model RoBERTa Large1 to initialize word embeddings, which remain frozen throughout the training process.Specifically, we use mean pooling (i.e., averaging the weights from wordpieces) to generate a word representation e i for each token w i in the input utterance X = w 1 , . . ., w n , resulting in the sequence E = e 1 , . . ., e n .
Next, we define the encoder using a 6-layer Transformer with a hidden size of 256.Transformers utilize a multi-head self-attention layer with multiple attention heads (four in our case) to assess the relevance of each input token relative to the other words in the utterance.The output of this layer is fed into a feed-forward layer, ultimately producing an encoder hidden state h i for each input word (represented as e i ).Therefore, given the sequence of word representations E, the encoding process yields the sequence of encoder hidden states H = h 1 , . . ., h n .Fig. 2 illustrates the Transformer neural architecture.

Decoder with Stack-Transformers
The decoder is responsible for generating the sequence of target actions A = a 0 , . . ., a m−1 to parse the input utterance X according to a specific transition system S.
We use Stack-Transformers (with 6 layers, a hidden size of 256, and 4 attention heads) to effectively model the stack and buffer structures at each state configuration of the shift-reduce parsing process.In the original Transformer decoder model, a crossattention layer employs multiple attention heads to attend to all input tokens and compute their compatibility with the last decoder hidden state q t (which encodes the transition history).However, Stack-Transformers specialize one attention head to focus exclusively on tokens in the stack at state configuration c t and another head solely on the remaining words in the buffer at c t .This specialization allows the Transformer to represent stack and buffer structures.
In practice, these dedicated stack and buffer attention heads are implemented using masks m stack and m buffer over the input.After applying the transition a t−1 to state configuration c t−1 , these masks must be updated at time step t to accurately represent the stack and buffer contents in the current state configuration c t .To achieve this, we define how these masks are specifically modified for each transition system described in Section 3: • If the action a t−1 is a Shift transition, the first token in m buffer will be masked out and added to m stack .This applies to all proposed transition systems, as the Shift transition behaves consistently across them.that reflect the effects of certain in-order transitions on the stack and buffer during the shift-reduce parsing process illustrated in Table 3.
• When a Non-terminal-L transition is applied, it affects the stack structure in c t but has no effect on m stack .This is because attention heads only attend to input tokens, and non-terminals are artificial symbols not present in the user utterance.• For a Reduce transition (including the Reduce#k-L action from the non-binary bottom-up transition system), all tokens in m stack that form the upcoming constituent will be masked out, except for the initial word representing the resultant constituent (since artificial non-terminals cannot be considered by the attention heads).
In Fig. 3, we illustrate how these masks represent the content of the buffer and stack structures and how they are adjusted as the parser transitions from state configurations c t−1 to c t .After encoding the stack and buffer in state configuration c t into masks m stack t and m buffer t (both represented as vectors with values of −∞ or 0), the attention head z stack t (focused exclusively on the stack) is computed as follows: are parameter matrices unique to each attention head, d is the dimension of the resulting attention vector z stack t , and β ti is a compatibility function that measures the interaction between the decoder hidden state q t and each input token w i (represented by h i ).By introducing the mask m stack t into the original equation to compute β ti , this scoring function will only affect the words that are in the stack at time step t.
Similarly, the attention vector z buffer t (which only affects input tokens in the buffer in c t ) is calculated as: The other two regular attention heads z t are computed as originally described in Vaswani et al. [10].All resulting attention vectors are combined and passed through subsequent linear and Softmax layers (as depicted in Fig. 2) to ultimately select the next action a t from the permitted transitions in state configuration c t , according to a specific transition system S.
Finally, note that this neural architecture is flexible enough to implement not only the transition systems described in Section 3, but also any shift-reduce parser for task-oriented semantic parsing.

Training objective
Each shift-reduce parser is trained through the minimization of the overall log loss (implemented as a cross-entropy loss) when selecting the correct sequence of transitions A = a 0 , . . ., a m−1 to generate the gold TOP tree Y g for the user utterance X: where the transition a t (predicted in time step t) is conditioned by previous action predictions (a <t ).

Data
We conduct experiments on the main benchmark for task-oriented semantic parsing of compositional queries: the Facebook TOP datasets.The initial version (TOP)2 was introduced by Gupta et al. [4], who annotated utterances with multiple nested intents across two domains: event and navigation.This was further extended by Chen et al. [19] in the second version (TOPv2)3 , which added six additional domains: alarm, messaging, music, reminder, timer and weather.While the first version presents user queries with a high degree of compositionality, the extension TOPv2 introduced some domains (such as music and weather ) where all utterances can be parsed with flat trees.Table 4 provides some statistics of the TOP and TOPv2 datasets.Furthermore, TOPv2 offers specific splits designed to evaluate task-oriented semantic parsers in a low-resource domain adaptation scenario.The conventional approach Table 4 Data statistics for the Facebook TOP benchmark.We provide the number of queries in the training, validation and test splits; along with the number of intents and slots.Additionally, we include the percentage of compositional queries (i.e., utterances parsed by non-flat trees with depth > 2).
involves utilizing some samples from the reminder and weather domains as target domains, while considering the remaining six full domains (including event and navigation from TOP) as source domains if necessary.Moreover, instead of selecting a fixed number of training samples per target domain, TOPv2 adopts a SPIS (samples per intent and slot) strategy.For example, a 25 SPIS strategy entails randomly selecting the necessary number of samples to ensure at least 25 training instances for each intent and slot of the target domain.To facilitate a fair comparison, we evaluate our approach on the training, test and validation splits at both 25 SPIS and 500 SPIS for the target domains reminder and weather, as provided in TOPv2.Additionally, following the methodology proposed by Chen et al. [19], we employ a joint training strategy in the 25 SPIS setting, wherein the training data from the source domain is combined with the training split from the target domain.
Finally, we further evaluate our shift-reduce parsers on a variant of the TOPv2 dataset (referred to as TOPv2 * ).This variant comprises domains with a high percentage of hierarchical structures: alarm, messaging and reminder.Our aim is to rigorously test the three proposed transition systems on complex compositional queries, excluding those domains that can be fully parsed with flat trees, which are more easily handled by traditional slot-filling methods.

Evaluation
We use the official TOP scoring script for performance evaluation, which reports three different metrics: • Exact match accuracy (EM), which measures the percentage of full trees correctly built.• Labeled bracketing F 1 score (F 1 ), which compares the non-terminal label and span of each predicted constituent against the gold standard.This is similar to the scoring method provided by the EVALB script4 for constituency parsing [44], but it also includes pre-terminal nodes in the evaluation.• Tree-labeled F 1 score (TF 1 ), which evaluates the subtree structure of each predicted constituent against the gold tree.Recent research often reports only the EM accuracy; however, in line with Gupta et al. [4], we also include F 1 and TF 1 scores to provide a more comprehensive comparison of the proposed transition systems.Lastly, for each experiment, we present the average score and standard deviation across three runs with random initialization.

Implementation details
Our neural architecture was built upon the Stack-Transformer framework developed by Fernandez Astudillo et al. [16] using the FAIRSEQ toolkit [45].We maintained consistent hyperparameters across all experiments, based on those specified by Fernandez Astudillo et al. [16], with minor adjustments.Specifically, we used the Adam optimizer [46] with β 1 = 0.9 and β 2 = 0.98, and a batch size of 3584 tokens.The learning rate was linearly increased for the first 4,000 training steps from 1e −7 to 5e −4 , followed by a decrease using the inverse-sqrt scheduling scheme, with a minimum of 1e −9 [10].Additionally, we applied a label smoothing rate of 0.01, a dropout rate of 0.3, and trained for 90 epochs.Furthermore, we averaged the weights from the three best checkpoints based on the validation split using greedy decoding, and employed a beam size of 10 for evaluation on the test set.All models were trained and tested on a single Nvidia TESLA P40 GPU with 24 GB of memory.

Baselines
In addition to evaluating the three proposed transition systems, we compare them against the leading shift-reduce parser for task-oriented dialogue: the system developed by Einolghozati et al. [22].This model builds upon the system by Gupta et al. [4], which uses a top-down transition system and a Stack-LSTM-based architecture, and enhances it with ELMo-based word embeddings, a majority-vote ensemble of seven parsers, and an SVM language model ranker.We also include current top-performing sequence-to-sequence models in our comparison [11,12,27,29,31].For low-resource domain adaptation, we compare our models with the enhanced implementation by Chen et al. [19], which is based on Rongali et al. [11] and specifically tested on the low-resource TOPv2 splits.Lastly, we incorporate the recent state-of-the-art approach by Do et al. [33], which employs a language model enhanced with semantic structured information, into both high-resource and low-resource comparisons.

High-resource setting
We first present the evaluation results of the three described transition systems with Stack-Transformers on the TOP and TOPv2 * datasets in Table 5. Regardless of the metric, the in-order algorithm consistently outperforms the other two alternatives on both datasets.Although the TOP dataset contains a higher percentage of compositional queries than TOPv2 * , the in-order parser shows a more significant accuracy advantage over the top-down parser on TOP (0.49 EM accuracy points) compared to TOPv2 * (0.13 EM accuracy points).The bottom-up approach notably underperforms compared to the other transition systems on both datasets.
In Table 6, we compare our shift-reduce parsers to strong baselines on the TOP dataset.Using frozen RoBERTa-based word embeddings, the in-order shiftreduce parser outperforms all existing methods under similar conditions, including sequence-to-sequence models that fine-tune language models for task-oriented parsing.Specifically, it surpasses the single-model and ensemble variants of the shift-reduce parser by Einolghozati et al. [22] by 3.22 and 0.89 EM accuracy points, respectively.Additionally, our best transition system achieves improvements of 0.41 and 0.05 EM accuracy points over top-performing sequence-to-sequence baselines initialized with RoBERTa [27] and BART [12], respectively.The exceptions are the enhanced variant of Einolghozati et al. [22] (which uses an ensemble of seven parsers and an SVM language model ranker) and the two-staged system by Do et al. [33] that employs an augmented language model with hierarchical information, achieving the best accuracy to date on the TOP dataset.
Lastly, our top-down parser with Stack-Transformers achieves accuracy comparable to the strongest sequence-to-sequence models using RoBERTa-based encoders [11,27], and surpasses the single-model top-down shift-reduce baseline by Einolghozati et al. [22] by a wide margin (2.73 EM accuracy points).

Low-resource setting
Table 7 presents the performance of our approach on low-resource domain adaptation.Across all SPIS settings, the in-order strategy consistently achieves the highest scores, not only among shift-reduce parsers but also compared to top-performing sequenceto-sequence models.Specifically, the in-order algorithm outperforms the BART-based sequence-to-sequence model by 3.5 and 2.4 EM accuracy points in the 25 SPIS setting of the reminder and weather domains, respectively.In the 500 SPIS setting, our best shift-reduce parser achieves accuracy gains of 7.9 and 0.5 EM points on the reminder and weather domains over the strongest sequence-to-sequence baseline.Notably, while the reminder domain poses greater challenges due to the presence of compositional queries, our approach exhibits higher performance improvements in this domain compared to the weather domain, which exclusively contains flat queries.Additionally, we include the state-of-the-art scores achieved by the system developed by Do et al. [33] by incorporating semantic structured information into the language model fine-tuning.

88.18
Table 6 Comparison of exact match performance among state-of-the-art task-oriented parsers on the TOP test set.The first block encompasses sequence-to-sequence models, while the second block comprises shift-reduce parsers.In the last block, we additionally present the results of Einolghozati et al. [22] with ensembling (+ ensemble) and language model re-ranking (+ SVM-Rank ), along with a novel approach that fine-tunes a standard RoBERTa language model by integrating additional semantic structured information (+ hierarchical information).Lastly, we denote with fine-tuned those approaches that utilize pre-trained language models directly as encoders and undergo fine-tuning for adaptation to task-oriented parsing.7 Comparison of exact match performance among top-performing task-oriented parsers on the test splits of reminder and weather domains within a low-resource setting.The first block compiles sequence-to-sequence models, while the second block encompasses shift-reduce (S-R) parsers.In the last block, we additionally present the results of the novel approach by Do et al. [33], which involves fine-tuning a standard RoBERTa language model by integrating additional semantic structured information (+ hierar.inform.).Lastly, we denote with fine-tuned those approaches that utilize pre-trained language models directly as encoders and undergo fine-tuning for adaptation to task-oriented parsing.

Discussion
Overall, our top-down and in-order shift-reduce parsers deliver competitive accuracies on the main Facebook TOP benchmark, surpassing the state of the art in both highresource and low-resource settings in most cases.Furthermore, shift-reduce parsers ensure that the resulting structure is a well-formed tree in any setting, whereas sequence-to-sequence models may produce invalid trees due to the absence of grammar constraints during parsing.For instance, Rongali et al. [11] reported that 2% of generated trees for the TOP test split were not well-formed.Although Chen et al. [19] did not document this information, we anticipate a significant increase in invalid trees in the low-resource setting.Finally, it is worth mentioning that techniques such as ensembling, re-ranking, or fine-tuning pre-trained language models are orthogonal to our approach and, while they may consume more resources, they can be directly implemented to further enhance performance.

Analysis
To comprehend the variations in performance among the proposed transition systems, we conduct an error analysis focusing on utterance length and structural factors using the validation split of the TOP dataset.

Utterance length
In Fig. 4(a) and (b), we present the EM and labeled bracketing F 1 scores achieved by each transition system across various utterance length cutoffs.It can be seen as the bottom-up algorithm yields higher EM accuracy for the shortest utterances (≤ 5), but experiences a notable decline in accuracy for longer queries.While less pronounced than in the bottom-up strategy, both the in-order and top-down algorithms also exhibit a clear decrease in accuracy as utterance length increases.This outcome is anticipated as shift-reduce parsers are prone to error propagation: earlier mistakes in the transition sequence can lead the parser into suboptimal state configurations, resulting in further erroneous decisions later on.Notably, the in-order approach consistently outperforms the top-down baseline across all length cutoffs.We observe that the inorder transition system attains higher EM accuracy on utterances with fewer than 3 intents.However, its performance on more complex queries is surpassed by the topdown approach.A similar trend is evident when evaluating performance using the labeled F 1 score: the top-down strategy outperforms the in-order algorithm on queries with 4 intents.While this might suggest that a purely top-down approach is preferable for processing utterances with a compositionality exceeding 3 intents, it is essential to note that the number of queries with 4 intents in the validation split is relatively low (just 20 utterances with 4 intents, compared to 2,525, 1,179, and 307 utterances with 1, 2, and 3 intents, respectively).Consequently, its impact on the overall performance is limited.Finally, both plots also indicate that the bottom-up approach consistently   underperforms the other two alternatives, except on queries with 3 intents, where it surpasses the in-order strategy in EM accuracy.

Span length and non-terminal prediction
Fig. 4(e) illustrates the performance achieved by each transition system on span identification relative to different lengths, while Fig. 4(f) demonstrates the accuracy obtained by each algorithm on labeling constituents with the most frequent nonterminals (including the average span length in brackets).In Fig. 4(e), we observe that error propagation affects span identification, as accuracy decreases on longer spans, which require a longer transition sequence to be constructed and are thus more susceptible to error propagation.Additionally, the bottom-up transition system exhibits significant accuracy losses in producing constituents with longer spans.This can be attributed to the fact that, while the other two alternatives use a Non-Terminal-L action to mark the beginning of the future constituent, the bottom-up strategy determines the entire span with a single Reduce#k-L transition at the end of the constituent creation.This approach, being more susceptible to error propagation, struggles with Reduce#k-L transitions with higher k values, which are less frequent in the training data and hence more challenging to learn.Regarding the inorder algorithm, it appears to be more robust than the top-down transition system on constituents with the longest span, indicating that the advantages of the in-order strategy (explained in Section 3.1) mitigate the impact of error propagation.Moreover, in Fig. 4(f), we observe that the in-order parser outperforms the other methods in predicting frequent non-terminal labels in nearly all cases, resulting in significant differences in accuracy in building slot constituents with the longest span, such as SL:DATE TIME DEPARTURE and SL:DATE TIME ARRIVAL.The only exceptions are slot constituents SL:SOURCE and SL:CATEGORY EVENT, where the bottom-up and top-down algorithms respectively achieve higher accuracy.

Conclusions
In this paper, we introduce innovative shift-reduce semantic parsers tailored for processing task-oriented dialogue utterances.In addition to the commonly used top-down algorithm for this task, we adapt the bottom-up and in-order transition systems from constituency parsing to generate well-formed TOP trees.Moreover, we devise a more robust neural architecture that, unlike previous shift-reduce approaches, leverages Stack-Transformers and RoBERTa-based contextualized word embeddings.We extensively evaluate the three proposed algorithms across high-resource and low-resource settings, as well as multiple domains of the widely-used Facebook TOP benchmark.This marks the first evaluation of a shift-reduce approach in low-resource task-oriented parsing, to the best of our knowledge.Through these experiments, we demonstrate that the in-order transition system emerges as the most accurate alternative, surpassing all existing shift-reduce parsers not enhanced with re-ranking.Furthermore, it advances the state of the art in both high-resource and low-resource settings, surpassing all top-performing sequence-to-sequence baselines, including those employing larger pre-trained language models like BART.
Additionally, it is worth noting that our approach holds potential for further enhancement through techniques such as ensemble parsing with a ranker, as developed by Einolghozati et al. [22], or by specifically fine-tuning a RoBERTa-based encoder for task-oriented semantic parsing, as employed by the strongest sequence-to-sequence models.Finally, incorporating hierarchical semantic information, as successfully implemented by Do et al. [33], is another avenue for improving our approach.
to-sequence models)

Fig.s 4
Fig.s 4(c) and (d) depict the EM and labeled F 1 scores achieved by each algorithm on queries with varying numbers of intents per utterance.We observe that the inorder transition system attains higher EM accuracy on utterances with fewer than 3 intents.However, its performance on more complex queries is surpassed by the topdown approach.A similar trend is evident when evaluating performance using the labeled F 1 score: the top-down strategy outperforms the in-order algorithm on queries with 4 intents.While this might suggest that a purely top-down approach is preferable for processing utterances with a compositionality exceeding 3 intents, it is essential to note that the number of queries with 4 intents in the validation split is relatively low (just 20 utterances with 4 intents, compared to 2,525, 1,179, and 307 utterances with 1, 2, and 3 intents, respectively).Consequently, its impact on the overall performance is limited.Finally, both plots also indicate that the bottom-up approach consistently label (avg.span length)

Fig. 4
Fig. 4 Performance comparison of the three transition systems relative to utterance length and structural factors.
• T signifies the set of available transitions (or actions) that can be applied to transition the parser from one state configuration to another.Moreover, during training, a rule-based oracle o, given the gold parse tree Y g , selects action a t for each state configuration c t at each time step t: a t = o(c t , Y g ).Once the model is trained, it approximates the oracle during decoding.

Table 1
Top-down transition sequence and state configurations (represented by the stack and the buffer) for producing the TOP tree in Fig.1(a).NT-L stands for Non-Terminal-L and slot labels have been abbreviated from SL:MUSIC TRACK TITLE and SL:MUSIC ARTIST NAME to SL:TITLE and SL:ARTIST, respectively.

Table 2
When the Reduce#k-L transition builds a slot node (i.e., L is prefixed with SL:), it is allowed only if there are no slot constituents among the k elements affected by this operation (as slots cannot have other slots as Transition sequence and state configurations (represented by the stack, buffer and variable f ) for building the TOP semantic representation in Fig.1(a) following a non-binary bottom-up approach.Re#k-L stands for Reduce#k-L.child nodes).Additionally, if the item on top of the stack is an intent node, only the Reduce action with k equal to 1 is permissible (since slots can only contain a single intent constituent as a child node).
A Reduce#k-L transition (parameterized with the non-terminal label L and an integer k ) is used to create a new constituent by popping k items from the stack and combining them into a new constituent on top of the stack.This transitions the parser from state configurations ⟨Σ|e k−1 | . . .|e 0 , B, false⟩ to ⟨Σ|L e k−1 ...e0 , B, false⟩.To ensure a valid TOP representation, this transition can only be applied under the following conditions: * When the Reduce#k-L action creates an intent constituent (i.e., L is prefixed with IN:), it is permissible only if there are no intent nodes among the k items popped from the stack (as an intent constituent cannot have other intents as child nodes).*

Table 5
Average performance across 3 runs on TOP and TOPv2 * test splits.Standard deviations are reported with ±.