The relational processing limits of classic and contemporary neural network models of language processing

The ability of neural networks to capture relational knowledge is a matter of long-standing controversy. Recently, some researchers in the PDP side of the debate have argued that (1) classic PDP models can handle relational structure (Rogers&McClelland, 2008, 2014) and (2) the success of deep learning approaches to text processing suggests that structured representations are unnecessary to capture the gist of human language (Rabovsky et al., 2018). In the present study we tested the Story Gestalt model (St. John, 1992), a classic PDP model of text comprehension, and a Sequence-to-Sequence with Attention model (Bahdanau et al., 2015), a contemporary deep learning architecture for text processing. Both models were trained to answer questions about stories based on the thematic roles that several concepts played on the stories. In three critical test we varied the statistical structure of new stories while keeping their relational structure constant with respect to the training data. Each model was susceptible to each statistical structure manipulation to a different degree, with their performance failing below chance at least under one manipulation. We argue that the failures of both models are due to the fact that they cannotperform dynamic binding of independent roles and fillers. Ultimately, these results cast doubts onthe suitability of traditional neural networks models for explaining phenomena based on relational reasoning, including language processing.


Introduction
The ability to represent and reason in terms of the relations between objects plays a crucial role across human cognition (Halford, Wilson, & Phillips, 2010). Several computational models in cognitive science have sought to capture its main characteristics and development (for a review see, Gentner & Forbus, 2011).
These models differ in their representational assumptions. In the canonical view, relational reasoning entails using predicate representations. A predicate is an abstract structure that can be dynamically bound to an argument, specifying a set of properties about that argument (Doumas & Hummel, 2005). For example, predator (x) specifies a series of properties about the variable x (e.g., carnivore, hunts, etc.). Predicate representations have two main attributes. In the first place, predicates maintain role-filler independence in that at least some aspect of the semantic content of the predicate is invariant with respect to its arguments. For example, predator (fox) and predator (lynx) will specify the same set of properties (e.g., carnivore, hunts, etc.) about the objects fox and lynx. In the second place, predicates can be dynamically bound to arguments, namely, fillers can be assigned and reassigned to different roles as needed during processing. Models based on predicates successfully account for a wide variety of phenomena in the relational thinking literature (for a review see Forbus, Liang, & Rabkina, 2017).
By contrast, traditional Parallel Distributed Processing (PDP) models explicitly eschew structured representations (see, e.g., Rogers & McClelland, 2014). In these models representations are patterns of activation across a layer of units. These representations are unstructured because relational roles and objects are not independently represented, but instead are compressed together into a fixed-sized vector. Recently, Rogers and McClelland (2014) have proposed that the gestalt models of text comprehension (St. John, 1992;St. John & McClelland, 1990) exhibit successful effective role-to-filler binding. Some of this optimism is based on the achievements of deep learning architectures in natural language processing. For example, Rabovsky, Hansen, and McClelland (2018) argue that the success of Google's neural machine translation (GNMT) system (Wu et al., 2016) implies that structured representations are an obstacle to capturing the regularities of human language.
In the present study, we tested the Story Gestalt (SG) model (St. John, 1992) and a Sequence-to-Sequence with Attention (Seq2seq+Attention) model (Bahdanau, Cho, & Bengio, 2015)-the architecture behind the GNMT system-in a series of tasks requiring binding a number of concepts to several roles in a story.

Task overview
Our task, based on the original materials of St. John (1992), consists on answering questions about stories generated by a series of (5) scripts. All the scripts describe events as a sequence of propositions where several concepts play different thematic roles: agent-1, agent-2, topic, patient-theme, recipient-destination, location, manner and attribute. As an illustrative example, consider the Restaurant script (Table 1). This script describes an event where two people go to a restaurant. Each sentence of the Restaurant script defines fillers for some roles. To generate a specific instance of a 682 This work is licensed under the Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0 Restaurant script (i.e., a Restaurant story) the roles are given values corresponding to specific concepts. Table 2 presents an example of an instantiated Restaurant story in a pseudonatural language format. Note that, as illustrated in Table 1, our scripts produce stories with no repeated topic concepts across propositions.

Concept restrictions
The roles agent-1 and agent-2 are never 'Lois' or 'Albert'

Deterministic rule
The quality of the restaurant determines the distance completely: expensive → f ar, cheap → near Each script implements a tree structure where each node represents a proposition and each branch of the tree represents a story. The scripts also implement rules that specify the probability of transitioning from one node to another conditioned on the value of a character or location role. For example, a rule in the Restaurant script (see Table 1) specifies that if the restaurant is expensive, it will be located far away.
We had two training conditions. In the concept restricted condition, some character or object names were never used in specific scripts. For example, in the Restaurant stories the characters Lois and Albert were never used to fill the roles agent-1 or agent-2. In the concept unrestricted condition all concepts were used in all stories. Stories were generated as follows: (1) a script is chosen at random, (2) a sequence of propositions is generated by traversing the tree structure of a scrip and (3) character and vehicles names are given specific values (respecting the script's deterministic rule and the script's concept restrictions if necessary).
To get a criterion for each model's performance we designed a baseline test. We presented the models trained in the unrestricted condition with concept unrestricted stories and asked questions about them. The questions were the concepts filling the topic role. The correct answer was the full proposition in which the topic concept was involved. For example, if a proposition in a restaurant story stated that the "waiter gave change to Anne" and the model was asked about the "gave" proposition the correct answer was "waiter gave change to Anne". Because in our stories there was no repeated topics the correct answer was unequivocal. Table 2 presents an example of a Restaurant baseline story, its questions and correct answers.

Models
Story gestalt model The SG model (St. John, 1992, see Figure 1) integrates a sequence of propositions into a distributed representation of a story, which is then used to answer questions about the story. The model represents all propositions in its input layer through 137 localist units coding for each possible filler of each role (e.g., there is a unit coding for Albert-agent and another unit coding for Albert-recipient).
To represent a complete proposition, the units coding for the concept filling each role are activated. For example, a representation of the sentence Anne and Gary decided to go to the restaurant would consist of a vector of 137 units were the three units coding for Anne-agent, Gary-agent, decided-topic and restaurant-location are set to 1 and all other units are set to 0 ( Figure 1A).   Figure 2) is a deep neural network architecture originally designed to solve translation problems. Typically, the source and target sentences have different lengths. In general, a Seq2seq model consist of an encoder network and a decoder network. Both are recurrent neural networks with their own independent time steps (t for the encoder and t' for the decoder in Figure 2B). The encoder transforms the input sequence into a sequence of fixed-size vectors and the decoder processes these transformed vectors to get the output sequence. Two important features this model are the use of word2vec representations for the input words (Mikolov, Sutskever, Chen, Corrado, & Dean, 2013) and an attention mechanism that allows the model to selectively attend to different parts of the encoders output (Bahdanau et al., 2015).

Simulations
We designed three critical tests for the models. In our first test, termed concept violation, we trained the models in the concept restricted condition and then tested them with stories where the some roles were filled by the restricted concepts. For example, the concept Lois had never appeared as agent in any Restaurant story during the model's training (see Table  1). The model was then tested using a Restaurant story in which Lois appeared as agent by asking, for example, about the "tipped" proposition. The correct (role-based) answer was "Lois tipped waiter big". Note that, while the model was trained in stories where Lois appeared as an agent in other locations, and had been trained to output that someone tipped big with other agents, it had never been trained to output the exact proposition "Lois tipped waiter big". In our second test, termed correlation violation, we presented the models trained in the concept unrestricted condition with stories where we inverted a perfect statistical regularity of the story script. For example, a rule in the Restaurant script establish that if the restaurant was cheap it was nearby and if it was expensive it was far away (see Table 1). To create a Restaurant correlation violation story, we switched the second term of the correlation (e.g., a cheap restaurant that was far away) and asked abut the "distance" proposition. The role-based answer was "The restaurant was far away", even though all cheap restaurants were close by during training.
In our third test, termed shuffled propositions, we presented the models trained in the concept unrestricted condition with stories where we randomized the order of the propositions. Recall that in our stories there are no repeated topic concepts. As a direct consequence, a role-based answer to a question should use the concepts of the proposition corresponding to each question to fill its roles, ignoring the ordering.

Training
We trained two versions of the SG model, one in 1,000,000 randomly generated concept restricted stories and another in 1,000,000 randomly generated concept unrestricted stories. We also trained two versions of the Seq2se2+Attention model, one in 500,000 randomly generated concept restricted stories and another in 500,000 randomly generated concept unrestricted stories. We used the Nadam optimization algorithm with default learning parameters.

Results
For each of our tests, we created a dataset of stories by generating 1,000,000 stories and saving all unique ones. Due to the combinatorics of concepts and scripts, these datasets had different sizes (baseline and shuffled sentences: 14,652, concept violation: 728, correlation violation: 14,647). For all tests we compared the proposition generated by the model with the role-based answer. We coded the answer as correct (with a value of 1) if the all the concept fillers in the answer corresponded to the concept fillers in the role-based answer and as a incorrect (with a value of 0) otherwise. Figure 3 shows the proportion of correct answers per test and model. As can be seen, both models performed well in our baseline test. In our concept violation test the SG model almost invariably filled the roles of the restricted concepts with the most common concepts playing that role during training. For example, if it was presented with a story were the role agent-performed significantly better at this test. The attention mechanism seems to allow this model to apply its word representations to previously unseen sequences of words.
Both models performed poorly in the correlation violation test. Such behavior would seem quite unnatural for a human reader as it would amount to, when presented with a proposition stating that a restaurant is close by, answering the question "where is the restaurant" by stating the restaurant is far away. Notably, the SG model achieved a higher accuracy than Seq2seq+Attention model in this test. We suspect that the same attention mechanism that allows the Seq2seq+Attention model to pass the concept violation test makes it even more likely to overfit to a perfect correlation in the dataset.
While our shuffled proposition test affected both models, the SG model performed significantly better than the Seq2seq+Attention model. We again hypothesize that the attention mechanism is the main reason for this difference in performance. Unfortunately, due to the length of our stories, taking out the attention mechanism yields the Seq2seq+Attention model unable to pass our baseline test, so we could not test our hypothesis directly.

Discussion
We tested the relational processing capabilities of a classic and a contemporary neural network model of text comprehension. In three critical tests we varied the statistical properties of the test stories while keeping their relational structure intact. Our results show clearly that these models are not using the relational information of the stories to answer the questions, but instead they are relying on the statistical regularities of the training dataset.
Our results are highly consistent with the findings of (Lake & Baroni, 2018), who found that sequence-to-sequence models failed at a command-to-action translation task that required composing the meaning of new commands formed by using known primitive concepts combined in ways unseen during training. Truly compositional behavior requires independent representations of objects and roles that can be bound together dynamically. A model that dynamically binds roles to fillers would easily pass our tests by filling the untrained concepts into the trained roles to answer the questions (see, Doumas & Hummel, 2005).
Interestingly, there has been a resurgence of interest on the binding problem in neural networks (Besold et al., 2017). Moreover, relational learning and reasoning have become a core topic on deep learning research (for a review, see Battaglia et al., 2018) with some deep learning architectures implementing elements traditionally associated with symbolic processing such as a content-addressable memory (e.g., Graves et al., 2016). Whether these non-traditional neural network architectures are capable of relational reasoning remains an open question. Our results suggest, however, that for a model to successfully account for all aspects of relational processing, it will need to implement a solution to the binding problem.