Learning to Apply Schematic Knowledge to Novel Instances

Humans have schematic knowledge of how certain types of events unfold (e.g. coffeeshop visits) that can readily be generalized to new instances of those events. Schematic knowledge allows humans to perform role-filler binding, the task of associating schematic roles (e.g."barista") with specific fillers (e.g."Bob"). Here we examined whether and how recurrent neural networks learn to do this. We procedurally generated stories from an underlying generative graph, and trained networks on role-filler binding question-answering tasks. We tested whether networks can learn to maintain filler information on their own, and whether they can generalize to fillers that they have not seen before. We studied networks by analyzing their behavior and decoding their memory states. We found that a network's success in learning role-filler binding depends on both the breadth of roles introduced during training, and the network's memory architecture. In our decoding analyses, we observed a close relationship between the information we could decode from various parts of network architecture, and the information the network could recall.


Introduction
Humans have a powerful ability to learn the structural relationships that underlie similar events, and to use this knowledge to organize and guide cognition (Bower, Black, & Turner, 1979).For example, we can learn the schematic structure of "coffee shops" from our experiences at coffee shops.Even though none of these experiences are exactly alike, they share some underlying structure, and learning this schematic structure allows us to effectively draw inferences from partial information over the possible states.For instance, given the sentence "Alice ordered a green tea from Bob", we can use our schematic knowledge of coffee shops to infer that "Alice" is a customer and "Bob" is the barista -even though we have no idea who "Alice" and "Bob" are.This kind of inferential process critically relies on an operation which binds a specific "filler" value (i.e."Alice") to a known structural "role".
Here we examined whether and how recurrent neural networks can learn schematic knowledge and use it to perform role-filler binding.Our approach involves procedurally generating stories from an underlying generative graph, and giving networks sample instantiations of these stories along with questions that probe the networks' ability to perform role-filler binding.Our tasks require the network to store and retrieve schematic information from examples, and to apply this information to solve a role-filler binding questionanswering task.
Prior work has shown that neural networks can perform role-filler binding when they are explicitly told what filler information to maintain (St. John & McClelland, 1990).We tested whether neural networks can achieve a more generalized form of this learning: specifically, whether networks can learn to maintain filler information on their own, and whether they can generalize to new fillers that they have not seen before.We studied networks by analyzing their behavior and also decoding their memory states, to understand whether and how they perform this generalized form of schema learning.
In our analyses of network behavior, we found that a network's success in learning role-filler binding depends on both the breadth of roles introduced during training, and the network's memory architecture.In our decoding analyses, we observed a close relationship between the information we could decode from various parts of network architecture, and the information the network could recall. 1 arXiv:1902.09006v1[cs.AI] 24 Feb 2019 2 Methods

Schema Learning
We used stories as a test of schema learning.To understand a story, we must track abstract roles (e.g. the main character and their friend), and match these roles to the concrete instantiations that fill them (e.g."Alice" and "Bob").We constructed stories using Coffee Shop World, a generator that writes stories based on predefined rules (Coffee Shop World , n.d.).Coffee Shop World models stories as a graph in which nodes represent states of the story and edge weights represent transition probabilities between different states.Each state includes fixed text and variable "roles", which are substituted with specific "fillers" of a certain "entity" type in each instance of the story.For instance, an "Order food" state might read: [Subject] ordered a plate of [Dessert] and in a specific instance of the story the roles [Subject] and [Dessert] would be instantiated with specific fillers of entity type Person and Food, such as "Alice" and "sandwiches".Given a schema that defines states, transitions between them, and possible fillers for each role in the story, Coffee Shop World probabilistically generates stories that are instances of a given schema.

Neural Network Architectures
We looked at recurrent neural networks (RNNs), a class of neural network architectures with weights that form directed cycles.These cycles form feedback loops that allow the networks to maintain an internal state.RNNs have shown success on a wide range of tasks such as speech recognition (Graves, Mohamed, & Hinton, 2013) and language modeling (Mikolov & Zweig, 2012).
In this paper we present experiments involving tests with four neural network architectures: a standard recurrent neural network (RNN), Long Short-Term Memory (LSTM), Fast Weights, and reduced Neural Turing Machine (NTM).We used layer normalization for the RNN, LSTM, and Fast Weights architectures.This re-centers and re-scales the networks' layers and serves to stabilize the network dynamics (J.L. Ba, Kiros, & Hinton, 2016).
Each network updates its weights over the course of training on many examples; these long-term updates reflect the network's learning the structure of the Coffee Shop World.As these are all recurrent neural networks, they also have mechanisms to store short-term information, where we define "short-term" as the course of an input sequence and "long-term" as the course of a network's lifetime.These four architectures have distinct forms of this short-term memory.The LSTM is an RNN with gates to control what the internal state stores, forgets, and displays to the rest of the network (Hochreiter & Schmidhuber, 1997).These gated hidden states provide a "long short-term memory", with which the network learns to save, forget, and output information.The Fast Weights architecture is an RNN with a matrix of quickly changing "fast weights" that enable auto-associative memory (J.Ba, Hinton, Mnih, Leibo, & Ionescu, 2016).The reduced NTM is an RNN with an LSTM "controller" that learns to read to and write from an external buffer (Graves et al., 2016).This allows the network to use an external "mental scratchpad" to store and retrieve short-term memories.The standard form of the NTM architecture includes shift weights that allow the network to iterate through a sequence of addresses in the external memory buffer.As a network that learns a schema should be able to perform the role-filler binding task without shift weights, we remove this feature to form the reduced NTM, which we use in our experiments.

Tests of Network Behavior
We tested networks' ability to perform role-filler binding tasks.Given an input containing a story and a query specifying a role, the networks must return the corresponding filler.For instance, given the input Alice ordered a plate of sandwiches ?QSubject the network should return Alice, the filler for the Subject role.
In our experiments, we used the schema corresponding to the story graph in Figure 1 and the state definitions in Table 1.This schema contains six roles which correspond to the tasks QDessert, QDrink, QEmcee, QPoet, and QSubject (where "QX " denotes the task of identifying the filler corresponding to role X ).Note that the QSubject task is the easiest, as the Subject always occurs as the second word in the story.Other roles do not always occur at a fixed location; for instance, the appearance of the Friend in the "Sit down" state has three possible locations (measured from the start of the story), depending on whether the story enters the "Order drink" and the "Too expensive" states.The QDessert and QEmcee tasks are the hardest: the Dessert and Emcee do not occur in every story, and they do not have a fixed location even when they do occur in a story.Table 1: Story states for role-filler binding experiments.We provide the text of each state of the story, where the bracketed roles are substituted by specific fillers in each story.

Input Sequence
We represented each word as a randomly generated 50-dimensional vector, and sequentially fed the network the words of the input.Upon receiving the entire input, the network outputs a 50-dimensional vector.We computed the cosine similarity between this output vector and each word vector in the experiment's corpus, taking the most similar corpus vector as the network's prediction.This results in a chance accuracy rate of 2.3% for Experiments 1 and 2, and of 1.3% for Experiment 3 (details are provided in Section 6.2.1 in the Supplemental Material).
Our experiments fall into two types: fixed and variable fillers.
In "fixed-filler" experiments, we randomly generated a small, finite set of fillers.During training and testing, roles in input fillers are substituted with fillers chosen from this set.This category of experiments splits into two further categories: previously seen and previously unseen vectors.In the "previously seen vectors" case, networks draw from the same pool of fillers during training and testing.In the "previously unseen vectors" case, networks draw from non-overlapping pools of fillers during training and testingduring testing, the network must perform role-filler binding on inputs with fillers it has never seen before.In fixed-filler experiments we ensured that the train and test set contain distinct input sequences.
In "variable-filler" experiments, we randomly generated a new filler vector for each role for each input story during both training and testing, resulting in a large set of train fillers.Therefore, in both training and testing, the network is continuously asked to perform role-filler binding on inputs containing previously unseen filler vectors.

Decoding Analysis
In addition to analyzing the networks' behavior, we used decoding analyses to examine how networks approach these tasks.We recorded the networks' memory after they receive each word in an input sequence.For the LSTM we recorded the values of hidden state neurons, and for the reduced NTM we recorded the values of the controller's hidden state neurons and of the external memory buffer.For the Fast Weights network, we recorded the values of the hidden state neurons and the associative memory matrix.From this, we obtained a state vector for the network after each time step.We constructed 100 input sequences with the same story frame (a "frame" is a story sequence with unfilled roles) and completed each sequence with distinct fillers.We trained a ridge regression mapping between the state vector and correct output fillers using recordings from 80 of these sequences.Then on each of the remaining 20 sequences, we used this mapping to predict the output filler, and ranked each corpus vector in terms of its cosine similarity with the predicted output.We computed the ranking score (1 − actual output rank corpus size ) for each test sequence.These decoding scores have a maximum score of 1.0, with a chance rate of 0.5.

Experiments 1 and 2: Fixed Fillers
We conducted two fixed-filler experiments, testing on sequences with previously seen and previously unseen fillers.For both experiments, we constructed train and test sets with non-overlapping stories.In the first experiment we used shared fillers: during testing, the network had seen each word of the input before, but had never seen this particular combination of words.In the second experiment we used non-overlapping fillers between the train and test sets (this also implies that the train and test sets have non-overlapping stories, since distinct fillers force stories in the train and test set to differ).In the first experiment, there were eight to ten possible fillers for each role.In the second experiment, there were six possible train fillers for each role, and between two and four possible test fillers for each role.
Figure 2, which contains the test accuracy for each architecture in Experiment 1, shows that each architecture is able to do role-filler binding at an above-chance level on a story it has not previously seen, as long as it has seen each of the story's words before.This tells us that basic RNN architectures can learn and apply a schema to scenarios in which they have seen the fillers before, but in a slightly different context.The Fast Weights and reduced NTM architectures performed better than the basic RNN and LSTM architectures, reaching a much higher level of accuracy in a smaller number of training epochs.
Figure 3 contains the train accuracy and test accuracy for each architecture in Experiment 2. The light blue bars (train accuracy) show that each architecture is able to do role-filler binding on a story with fillers that it has seen during training, mirroring the results from Experiment 1.The test accuracy for each architecture was at floor (as evidenced by the absence of dark blue bars), indicating that all architectures fail to generalize to previously unseen fillers, even though networks may have seen test inputs' story frames during training.This tells us that none of the networks we tested, even those with enhanced memory capabilities, succeed in generalizing to role-filler binding on unseen fillers if they are trained on examples containing a small set of fillers.
Figure 2: Each architecture is able to learn role-filler binding on a story it has not previously seen, as long as it has previously seen each of the story's words.The chance accuracy rate is 2.3%, bars denote mean accuracies, and error bars denote maximum and minimum accuracies over three trials.Full learning curves are available in Section 6.5 in the Supplemental Material.

Experiment 3: Variable Previously Unseen Fillers
Next, we conducted a variable-filler experiment, testing on previously unseen fillers.We constructed train and test sets in which we randomly generated new fillers in each example.In this experiment the network continuously received previously unseen filler vectors and was therefore trained on a very large set of fillers.The network must generalize to previously unseen fillers to have above-chance accuracy, in both the train and test sets.In Figure 4a, which contains the test accuracy for each network, all architectures reach above-chance test accuracy (the chance accuracy rate is 1.3%), showing that all architectures perform some amount of generalization when forced to do so during training.The four architectures show varying amounts of success in generalized role-filler binding, with the LSTM reaching a plateau below 100% accuracy and above chance.
We performed a task-based error analysis to shed light on the different amounts of success in generalization.We tracked the test accuracy for each architecture for each task (e.g.QSubject), and found that networks' errors are not evenly spread across tasks; rather, networks learn each task fully or not at all. Figure 4b, which contains the test accuracy for each network broken down by task, shows that the LSTM learns to generalize only on the QSubject task (which is also the easiest task, since the Subject always occurs at the second location in a story), and the RNN does not learn to generalize on any task.The reduced NTM and Fast Weights networks learn to solve all six tasks.
These results show that architectures are able to perform generalized role-filler binding if forced to do so during training, and that some architectures (the ones with enhanced memory capabilities) learn more tasks than architectures with simpler memory architectures.
Figure 3: Role-filler binding fails with previously unseen fillers.We show accuracy on train trials (with previously seen fillers) and test trials (with previously unseen fillers) -note that the test accuracy remains at 0 for all networks.Networks fail to recall the filler fulfilling a specified role when they have not been trained on inputs containing the filler, although they succeed in recalling fillers they have previously seen.The chance accuracy rate is 2.3%, bars denote mean accuracies, and error bars denote maximum and minimum accuracies over three trials.Full learning curves are available in Section 6.5 in the Supplemental Material.

Experiment 4: Memory Decoding
We performed decoding analyses on the four architectures trained in Experiment 3. Figure 5 contains the results of these analyses, showing decoding accuracy at each time step when discriminating between the correct filler and other fillers (the figure shows the mean rank for the correct answer relative to other answers across the 20 test input sequences).
The RNN is unable to solve any of the six tasks (as shown in Figure 4b), and its decoding scores hover around the chance rate (50%) for all tasks, as we show in Figure 5a.From the LSTM's hidden state we can decode only the Subject at an above-chance rate at the end of the input sequence, mirroring this network's ability to only solve QSubject tasks (Figure 5b).In networks trained to solve only the QSubject task, both the RNN and LSTM learn to solve this task, and decoding scores for the Subject remain high, while decoding scores for other role-filler pairs drop substantially, as we show in Section 6.1 in the Supplemental Material.
With the Fast Weights architecture, we decode using either the controller's hidden state or the set of associative fast weights.We show these decoding scores in Figure 5c.The decoding scores from the controller's hidden state mirror those of the LSTM network's hidden state: the scores peak when the network receives the filler in its input, then decline as the network receives more words.(An exception is decoding scores for the Subject filler.This could be due to the fixed location of the Subject filler, and the non-fixed locations of the other fillers.)We see this trend regardless of whether the network was trained to retrieve a certain filler or not, and we show decoding scores on a QSubject experiment in Section 6.1 of the Supplemental Material.In contrast, the decoding scores using the Fast Weights matrix increase when the network receives the corresponding filler in its input and remain high.
We see a similar pattern between standard and enhanced memory components with the the reduced NTM (Figure 5d), where the decoding scores using the reduced NTM's controller's hidden state mirror those of the LSTM, and the decoding scores from the reduced NTM's external memory matrix mirror those of the Fast Weights matrix.These results suggest that networks learn to solve tasks by storing the relevant information using their enhanced memory components (either the external memory buffer or the fast weights matrix), while the controller acts as a conduit to receive these words and move them to the enhanced memory component.
These findings suggest that the ability of some architectures to solve more filler-recall tasks, which we observed in Figure 4b, results from networks' ability to learn to store relevant information in enhanced memory components.The hidden memory state of an LSTM is enough to solve a simpler task such as QSubject, but is not enough to retain additional fillers needed to solve more difficult tasks.The external memory buffer of the reduced NTM and associative fast weights matrix of the Fast Weights network allows for more persistent memory, supporting the decoding of fillers through the end of the input sequence.

Discussion
Our experiments suggest three main conclusions about generalized role-filler binding.First, networks' ability to perform schema learning depends on the breadth of examples they are trained on.The networks' success in role-filler binding in Experiment 1 and failure in Experiment 2, taken together, indicate that networks trained on a small, fixed number of fillers are only able to perform role-filler binding on inputs containing these sets of examples.In contrast, humans easily generalize to previously unseen fillers, successfully answering the input Clkwef ordered a plate of talsk ?QSubject even without ever having seen the words "Clkwef" or "talsk".In Experiment 3, networks' successes indicate that training networks on variable random fillers allows them to generalize to previously unseen fillers.Previous work found that neural networks perform quite poorly when the domain of the training set differs from that of the test set, which provides a possible explanation for the lack of generalization to unseen fillers (Grefenstette, 2016).Perhaps networks' success in Experiment 3 is due to the shared domain of fillers in the train and test set.Future work could assess whether networks can also generalize to test filler vectors drawn from a different distribution than train filler vectors.
Second, the ability of a network to perform schema learning depends on the network's memory architecture.Use of variable random fillers is not sufficient for a network to learn generalized role-filler binding Figure 5: Decoding scores for each network.For the RNN, which is unable to solve any of the tasks, decoding scores for each of the task words are around the chance rate at the end of the input sequence.The LSTM, which solves only the QSubject task, only maintains the ability to decode the Subject throughout the input sequence.The Fast Weights and reduced NTM architectures show a similar trend in the hidden internal state of the controllers: decoding scores of the hidden states peak when the networks receive the respective filler in the input sequence, then decline as the network receives more words.In comparison, the decoding scores using the external memory (i.e. the Fast Weights matrix or the NTM's external memory buffer) increase when the network receives the corresponding filler in its input and the scores remain high throughout the input sequence.The chance rate is 50%.on all tasks in our experiments.Networks (the RNN and LSTM) with a hidden state, but without the enhanced memory components of the Fast Weights and reduced NTM architectures, perform role-filler binding on previously unseen vectors for QSubject, but fail to also perform role-filler binding on previously unseen vectors for more difficult tasks.Only the networks with enhanced memory components (the reduced NTM and Fast Weights) are able to generalize on all six role-filler binding tasks.We note that simpler networks are not entirely unable to learn role-filler binding tasks on more difficult tasks; they are simply unable to learn multiple role-filler binding tasks when trained on all tasks at the same time.For instance, in work not shown here, we found that providing an LSTM first with QPoet examples and then adding QSubject during training allowed the LSTM to eventually learn to solve both tasks.Third, decoding analyses give insight into how the networks solve these tasks, and why some succeed while others fail.Our ability to decode certain fillers from a network's memory state (whether the hidden state or an enhanced memory component) corresponded with networks' abilities to solve those tasks.For example, we were able to decode all six queries from the reduced NTM at an above-chance rate at the end of the story, and the reduced NTM was able to solve the task for all six queries; by contrast, for the LSTM network, the Subject was much more decodable than other fillers at the end of the trial, and this was the only query that was reliably answered by this network.Overall, the decoding results suggest that the reduced NTM and Fast Weights architectures succeed by storing these bindings in the enhanced memory components, and then retrieving the correct binding upon receiving the query.More broadly, these findings show us when and how artificial neural networks can perform role-filler binding on previously unseen examples; in future work, we can investigate how these abilities are similar to and different from how humans use memory for schema learning.
6 Supplemental Material 6.1 Decoding on QSubject Experiment In the QSubject experiment, networks are trained only on the QSubject task (i.e. during training, they are never asked about any other role-filler bindings).We show the test accuracies (Figure 6) and decoding scores (Figure 7) of networks trained in this experiment.
Figure 6: Test accuracies for networks trained on the QSubject experiment.The chance rate is 3.8%, bars denote mean accuracies, and error bars denote maximum and minimum accuracies over three trials.Full learning curves are available in Section 6.5 in the Supplemental Material.

Prediction Method and Chance Rates
To determine the network's prediction, we used networks in which the final layer has d nodes.We computed the cosine similarity between the output vector and the vector embedding of each word in the experiment's corpus, and selected the word with the highest cosine similarity to the network's output vector.The set of possible words is the corpus created by combining the words seen in all stories in a particular training batch.For fixed embeddings, the corpus therefore consists of the words that occur in all the stories generated for a particular experiment.For experiments in which we generated a new random embedding for each story, the corpus also includes all the new filler vectors generated for stories in that particular batch.
In each experiment, the network's chance rate depends on the number of words it has to choose from.In this section we detail the chance rate for each experiment.
For experiments 1 and 2 with fixed fillers, the network must choose from a corpus size of 44, corresponding to a chance rate of 2.3%.
For experiments with variable fillers the network must choose from all the words in the story corpus (25 + n, where n is the number of possible queries) and the newly generated representations for each story in the batch.Since we use a validation batch size of 4 and 12 new filler vectors are generated for each input,  this results in a total of between 25 + 1 + 4 × 12 = 75 and 31 + 1 + 4 × 12 = 80 words, for a chance rate of around 1.3%.

Padding
To ensure that all inputs have the same number of words, we padded inputs with nonsense words (a randomly generated vector that does not represent any other word in the corpus) between the end of a story and the appearance of the query.We also inserted this nonsense word into a randomly chosen location in the input story, to force the network to learn more shift-invariant representations of the schema.

Epoch Sizes
In Experiment 1 (fixed fillers, tested on previously seen fillers) we used 47135 train and 11784 test stories.In Experiment 2 (fixed fillers, tested on previously unseen fillers) we used 55448 train and 3339 test stories.
In Experiment 3 (variable fillers, tested on previously unseen fillers) we used 112 train and 112 test stories.To compute the number of distinguishable stories for this experiment we summed over the number of possible queries (queries that can be answered using the information in the story; for instance, some stories may not include an Emcee and therefore the input must not use QEmcee as a task) for each possible traversal through the story graph.This gives us 112 stories.

Network Details
For each of our architectures we used 50 hidden units and a learning rate of 1e − 4.
Our reduced NTM model has a memory size of 128, a word size of 20, 1 write head, and 4 read heads.
We used Coffee Shop World to generate the stories used in this experiment.This generator is available on GitHub (Coffee Shop World , n.d.).
The code used to generate data, run experiments, and generate the plots in this paper is available on GitHub at https://github.com/cchen23/generalizedschema learning/.We also include pre-generated data and checkpoints of trained networks.

Learning Curves
In this section we include learning curves corresponding to accuracies depicted in previously presented bar plots.In each plot, we show the mean accuracy with error ribbons for maximum and minimum accuracies, over three trials.3).The chance accuracy rate is 2.3%.6).The chance rate is 3.8%.

Figure 1 :
Figure 1: Story graph for role-filler binding experiments.Each edge indicates a possible transition.In our schema, for states with multiple outgoing transitions, each outgoing transition is equally likely.
(a) Overall Accuracies.Three architectures reach above-chance test accuracy, showing that certain networks perform some amount of generalization when forced to do so during training.However, the four architectures show varying amounts of success in generalized role-filler binding.(b)Query-Split Accuracies.The LSTM and RNN learn to generalize only on the QSubject task (the easiest task, since the Subject always occurs at the second location in a story).The reduced NTM and Fast Weights networks learn to solve all six tasks; moreover, they learn to solve easier tasks more quickly.

Figure 4 :
Figure 4: Overall and Query-Split accuracies on previously unseen fillers, with continuously introducing new fillers during training.The chance rate is 1.3%, bars denote mean accuracies, and error bars denote maximum and minimum accuracies over three trials.Full learning curves are available in Section 6.5 in the Supplemental Material.

Figure 7 :
Figure 7: Decoding scores for networks trained on the QSubject experiment.

Figure 8 :
Figure 8: Test accuracy for Experiment 1. (Fixed Representation with Previously Seen Fillers, originally shown in Figure2).The chance accuracy rate is 2.3%.

Figure 9 :
Figure 9: Train and test accuracy for Experiment 2. (Fixed Representation with Previously Unseen Fillers, originally shown in Figure3).The chance accuracy rate is 2.3%.