Action Grammars: A Cognitive Model for Learning Temporal Abstractions

Hierarchical Reinforcement Learning algorithms have successfully been applied to temporal credit assignment problems with sparse reward signals. However, state-of-the-art algorithms require manual specification of subtask structures, a sample inefficient exploration phase and lack semantic interpretability. Human infants, on the other hand, efficiently detect hierarchical sub-structures induced by their surroundings. In this work we propose a cognitive-inspired Reinforcement Learning architecture which uses grammar induction to identify sub-goal policies. More specifically, by treating an on-policy trajectory as a sentence sampled from the policy-conditioned language of the environment, we identify hierarchical constituents with the help of unsupervised grammatical inference. The resulting set of temporal abstractions is called action grammars (Pastra&Aloimonos, 2012) and can be used to enable efficient imitation, transfer and online learning.


Introduction
Genetically inherited inductive biases enable human infants to infer hierarchical rule-based structures from language, visual input as as well as auditory stimuli (M. C. Frank, Slemmer, Marcus, & Johnson, 2009;Marcus, Fernandes, & Johnson, 2007). Several MEG and fMRI studies provide evidence for a universal process of hierarchical language comprehension in the brain (S. L. Frank & Christiansen, 2018;Brennan, Stabler, Van Wagenen, Luh, & Hale, 2016;Nelson et al., 2017) that extends to motor control (Pastra & Aloimonos, 2012;Stout, Chaminade, Thomik, Apel, & Faisal, 2018). By processing trajectories of an expert, the infant is able to learn policies over higher level sequences of low level control elements. Inspired by these observations, this work proposes to overcome the problem of sub-structure discovery in Hierarchical Reinforcement Learning (HRL) by making use of grammatical inference. More specifically, the HRL agent uses grammar induction to extract hierarchical constituents from trajectory sentences. The proposed solution to the credit assignment problem is split into two alternating stages (see fig. 1): 1. Grammar Learning: Given episodic trajectories we treat the time-series of transitions as a sentence sampled from the language of the policy-conditioned environment. Using grammar induction algorithms (Nevill-Manning & Witten, 1997) the agent extracts hierarchical constituents of the current policy. Based on the estimated production rules, temporally-extended actions are constructed which convey goal-driven syntactic meaning. The grammar can efficiently be inferred (linear time) and provides enhanced interpretability.
2. Action Learning: Using the grammar-augmented action space, the agent acquires new value information by sampling reinforcement signals form the environment. They refine their action-value estimates using Semi-Markov Decision Process (SMDP) Q-Learning (Bradtke & Duff, 1995). By operating at multiple time scales, the HRL agent is able to overcome difficulties in exploration and value information propagation. After action learning, the agent samples simulated sentences by rolling out transitions from the improved policy.
By alternating between stages of grammar and action value learning the agent iteratively reflects and improves on their behavior in semi-supervised manner. The inferred grammar parse trees are easy to interpret and provide semantically meaningful sub-policies. Our experiments highlight the effectiveness of the action grammars framework for imitation, curriculum and transfer learning given an expert policy rollout. Furthermore, we show promising results for an online version which iteratively refines grammar and value estimates.

Background
Temporal Abstractions. SMDPs extend Markov Decision Processes to account for not only reward and transition uncertainty but also time uncertainty. The time between individual decisions is modeled as a random variable, τ ∈ Z ++ .
The waiting time is characterized by the joint likelihood of transitioning from state s ∈ S to state s in τ time steps given action m was pursued, P(s , τ|s, m). Thereby, SMDPs allow one to elegantly model the execution of actions which extend over multiple time-steps. A macro-action (McGovern, Sutton, & Fagg, 1997), m ∈ M specifies the sequential and deterministic execution of multiple (τ m ) primitive actions. Let r τ m = ∑ τ m i=1 γ i−1 r t+i denote the accumulated and discounted reward for executing a macro. Value estimates can then be updated using SMDP-Q-Learning (Parr, 1998) in a model-free bootstrapping-based manner: The DQN (Mnih et al., 2015) objective can then be adapted to the semi-Markov case: The gradient with respect to the parameters is approximated by Monte Carlo samples from the experience replay (Lin (1992); ER) buffer {s, m, r τ m , s , τ m } ∼ D M . The learning dynamics can be stabilized by making use of a target network and gradient clipping.
Context-Free Grammars. Given a start symbol S, a formal grammar (Σ, N, S, P ) produces an output of strings. Production rules P map a set of non-terminal vocabulary N either to another non-terminal or terminal string within the terminal vocabulary Σ. Context-free grammars (CFG) (Chomsky, 1959) constrain the set of productions to either map from one-to-one, one-to-none or one-to-many. A non-branching and loop-free CFG is called a straight-line grammar. Given a sample of sentences, grammar induction infers a consistent language grammar. Sequitur (Nevill-Manning & Witten, 1997) sequentially reads in all symbols and collects repeating subsequences of symbols into a production rule. The final encoded string is only allowed to have unique bigrams and inferred production rules must be used more than once in the derivation of the string. In order to overcome Sequitur's problem of noise overfitting, k-Sequitur (Stout et al., 2018) has been proposed. Instead of replacing a bigram with a rule if the bigram occurs twice, it has to occur at least k times. As k increases the grammar becomes less prone to overfitting and the resulting grammar is more parsimonious in terms of production rules.

Context-Free Action Grammars
Just like communication, action sequences convey goaldirected semantic meaning. They consist of hierarchical structures and are conditioned on the environment in which they are uttered. Furthermore, many real world problems require a hierarchy of subgoal achievements which increase in sequential difficulty and timescale. A trajectory obtained from traversing the current policy π can be viewed as a sample from the language generated by the policy-specific grammar, L(π|E). Let the terminal vocabulary Σ consist of the primitive action space A, hence Σ = A. We denote ϑ i ∼ L(π|E) for i = 1, . . . N g trajectories. Given a set of trajectories, a CFG estimateĜ can be inferred and the resulting production rules transformed into macro-actions MĜ by recursively flattening the non-terminals. The action space of the agent is then augmented such that AĜ = A ∪ MĜ. Depending on the generating policy of the compressed traces, we propose several grammar-based HRL agents.
Expert & Transfer Grammars. If the traces ϑ i are sampled from the language L(π |E) generated by the optimal policy, the agent can use the resulting grammar macros in an imitation learning setting. Before the onset of the first value learning stage, the action space is augmented with the optimal productions. Furthermore, an agent faced with learning a curriculum of tasks can make use of the optimal grammar of an easier solved task. Skills universal to all tasks do not have to be re-learned at every stage. Instead, the inferred optimal grammar provides an effective knowledge structure which accelerates the agents learning process.
Online Inferred Grammars. If an episode successfully terminated, the grammar inference process identifies repeating sub-goal achieving patterns. We hypothesize that by extracting action grammar sub-sequences, one compresses the temporal dimension of the credit assignment problem. After each grammar compression step, the action space is augmented with a new set of grammar macros. The previous set becomes inactive. In order to preserve value estimates between updates, we propose three solutions: (1) Transfer learning (Oquab, Bottou, Laptev, and Sivic (2014), see fig. 2): To accommodate the variable set of grammar-inferred skills, the size of the DQN output layer has to be updated. Transferring the value-relevant feature detectors between action space augmentation, allows the agent to use the previously learned value characteristics. (2) Grammar ER Buffer : It is necessary to maintain a grammar-altered buffer system in order to store transition tuples specific to previously inferred macro-actions. At any given point the agent can only sample macro transitions which are associated with the currently active set of grammar macros. Thereby, sample efficiency is increased once a grammar macro is repeatedly inferred. (3) Intra-Macro Updates: During the execution of a macro-action, one stores the overall macro transition tuple < s t , m t , r t+τ m , s t+τ m +1 , τ m , "on" > as well as the individual transitions {< s i , a i , r i , s i+1 , 1, "on" > } t+τ m i=t . Thereby the agent is able to exploit all gathered transition experiences throughout the overall learning process. The length of the sampled trace is going to increase or de-crease over the course of the learning procedure. The regularization parameter of the k-Sequitur grammar inference algorithm has to be adapted accordingly.

Experiments
The goal of the following experiments is to answer the following questions: (1) Does a grammar learned from optimal policy rollouts allow for rapid imitation learning? (2) Can CFG grammars be used in order to enhance curriculum learning by the means of transferring previously learned action grammars? (3) Is online grammar inference and action space adaptation able to structure the exploration process of the HRL agent? In order to answer these question we choose the general N-disk Towers of Hanoi (ToH) environment (see fig. 3) as well as a hierarchically structured gridworld task (see fig. 4).
Solving the N-disk ToH problem requires the agent to identify a hierarchical and recursive principle. By moving n − 1 disks onto an auxiliary pole and the n-th disk onto the target pole, the agent is able solve the sparse reward problem. Since such a routine can easily be formulated within a grammar parse tree, we hypothesize that the action grammars framework might provide an efficient solution.  The gridworld, on the other hand, provides a non-sparse reward design. The agent (red) has to avoid poisonous items (black) and collect food (yellow). Hence, the agent is required to solve a large set of individually smaller subtasks. Finally, the agent has to avoid a terminal collision with the moving blocks (green), whereas the ToH environment rewards the fastest solution.
Learning with Expert & Transfer Grammars. The righthand side of figure 3 shows the grammar and resulting macros inferred from a trace of the optimal policy 5-disk ToH problem using the 2-Sequitur. The flattened production rule B → CEd → ba f bcd captures the recursive nature learned by the grammar. C → ba f moves two disks on the auxiliary pole, while E → bc moves a third disk from source to target pole and one disk back onto the source pole. The Expert Grammar HRL agent's action space is augmented as follows:  The grammar macros accelerate the learning progress and reduce the variance of policy rollouts. We hypothesize that this is due to the temporal compression of the sequential problem provided by the macro grammars. Finally, the Transfer Grammar agent is capable of transferring the knowledge distilled in a simpler optimal grammar(4 disks) to a more complex setting (6 disks). The gridworld Grammar-DQN agent (see fig. 6) again infers a set of macro-actions from a single expert rollout. Afterwards, the output layer and action space are augmented. The fixed architecture of the DQN is a two-layer 128 hidden units multilayer perceptron trained using Adam (Kingma & Ba, 2014) with a batch-size of 32. The two Expert Grammar-DQN agents differ in the amount of macro-actions (top two and four most used productions in the encoded policy trace) inferred with 2-Sequitur on a converged DQN agent rollout. Again, the expert grammar-endorsed agent is significantly accelerated in their initial learning progress. The two Transfer Grammar-DQN agents, on the other hand, infer a set of two grammar macros from a single sub-optimal separate DQN agent's (trained for 25 or 75 episodes) policy rollout. Our experiments show, that even with noisy non-optimal rollouts the grammar agents are able to exploit the inferred structure of the environment.
Learning with Online Inferred Grammars. Figure 7 displays the results of the online grammar inference framework for the gridworld task. Every 500 optimization steps the DQN agent infers a new set of grammar macros from a self-rollout using 2-Sequitur. We augment the action space with the top two most used flattened production rules in the trace compression. The learning dynamics provide a competitive extension to the general DQN framework. We want to emphasize the relationship between grammar inference and exploration. In our experiments we found that the frequency of grammar updating as well as the grammar inference hyperparameters play a crucial role.

Conclusion
Inspired by hierarchical parse trees of sequential behavior, we introduced a novel cognitive decision making framework which exploits grammatical inference to identify temporallyextended actions. Our contributions are the following: (1) + (2) CFG-based HRL agents provide efficient and interpretable solutions to imitation and transfer learning tasks. (3) Alternating between grammar updates and learning action values is an effective method to learn an optimal grammar as well as an optimal policy online.
In future work we are interested in exploring stochastic grammars as well as their incorporation into model-based RL approaches. Ultimately, we envision a dictionary of action sequences which provides an expandable library of skills for agents which act in diverse naturalistic environments. This could provide a mayor contribution to a key endeavor in general artificial intelligence: Life-long learning.