Analogues of mental simulation and imagination in deep learning

Mental simulation — the capacity to imagine what will or


Introduction
Mental simulation is the ability to construct mental models [1,2] to imagine what will happen or what could be. Mental simulation is a cornerstone of human cognition [3] and is involved in physical reasoning [4,5], spatial reasoning [6], motor control [7], memory [8], scene construction [9], language [10], counterfactual reasoning [11,12], and more. Indeed, some of the most uniquely human behaviors involve mental simulation, such as designing a skyscraper, performing a scientific thought experiment [13], or writing a novel about people and worlds that do not -and could not -exist. However, such phenomena are challenging to model quantitatively, both because the mental representations used are unclear and because the space of possible behavior is combinatorially explosive.
Artificial intelligence (AI) aims to build agents which are similarly capable of behaving creatively and robustly in novel situations. Perhaps unsurprisingly, there is an analogue to mental simulation in AI: a collection of algorithms referred to as model-based methods, with the 'model' referring to a predictive model of what will happen next. While model-based methods have been around for decades [14,15 ], recent advances in deep learning (DL) and reinforcement learning (RL) have brought renewed interest in learning and using models. Among the results are systems supporting superhuman performance in games like Go [16,17]. Importantly, these systems must necessarily deal with the computational and representational challenges that have historically faced cognitive modelers.
This paper reviews recent model-based methods in DL and emphasizes where such approaches align with human cognition. The aim is twofold. First, for behavioral scientists, this article provides insight into methods which enable intelligent behaviors in large state and action spaces, with the intent of inspiring future models of mental simulation. Second, for DL and AI researchers, this article compares model-based methods to human capabilities, complementing a number of recent related works [18][19][20][21] and clarifying the challenges that lie ahead for building human-level model-based intelligence.

Reinforcement learning
At the core of most model-based methods in DL is the partially-observable Markov decision process, or POMDP [22], which governs the relationship between states (x), observations (o), actions (a), and rewards (r), as illustrated in Figure 1a. Specifically, these variables are related according to transition (T), observation (O), recognition (O À1 ), and reward (R) functions (Figure 1b-e) as well as a policy (p) which produces actions (Figure 1f).
The field of RL is concerned with the problem of finding a policy that achieves maximal reward in a given POMDP. 'Deep' RL implies that the functions in Figure 1 are approximated via neural networks. Much of the research in deep RL is model-free in that it aims to learn policies without knowing anything about the transition, observation, recognition, or reward functions (see [23] for a review). In contrast, model-based deep RL (MBDRL) aims to learn explicit models of these functions which are used to aid in computing a policy, a process referred to as planning [15 ].
An important component of the POMDP is that of partial observability, in which observations do not contain full state information. Sometimes, the missing information is minimal (e.g. velocity can be inferred from a few sequential frames); other times, it is severe (e.g. first-person observations are individually not very informative about the layout of a maze). The recognition function thus serves a dual purpose: to infer missing information (e.g. [24,25]), and to transform high-dimensional perceptual observations to a more useful representational format.
The POMDP model provides a useful framing for a variety of mental simulation phenomena, and illustrates how behaviors that seem quite different on the surface share a number of computational similaries ( Figure 2). For example, a mental model [1,2] can be seen as a particular latent state representation paired with a corresponding transition function, allowing it to be manipulated or run by via mental actions. The way that people choose which mental simulations to run (e.g. the direction of a mental rotation [6]) implies a particular policy and planning method.
Yet, simply framing mental simulation as a POMDP does not tell us where the component functions (Figure 1b-e) come from, what representations they ought to operate over, or how to perform inference in the resulting POMDP. While a number of existing works have answered these questions in simplified observation, state, and action spaces (e.g. [5,26,27]), the answers remain elusive in higher-dimensional settings. Such settings are exactly where DL excels, suggesting that it may prove useful in building cognitive models of mental simulation in richer, more ecologically-valid environments. Indeed, this approach has already been successful in understanding other aspects of the mind and brain, including sensory cortex [28], learning in the prefrontal cortex [29], the psychological representations of natural images [30], and the production of human drawings [31].

State-transition models
Sometimes, it is assumed that the agent has direct access to a useful representation of the state, obviating the need for a recognition model (Figure 3c). Many approaches to learning such state-transition models use classical recurrent neural networks (e.g. [32,33]). Recently, models which represent the state as a graph [34 ] have been used to predict the motion of rigid bodies [35,36], deformable objects [37], articulated robotic systems [38], and multiagent systems [39].

Observation-transition models
Often, an agent does not have direct access to a useful state representation. One approach to dealing with this issue (often referred to as 'video prediction') learns a transition model directly over sensory observations (Figure 3d). For instance, [40] learn a model of pixel  The partially-observable Markov decision process (POMDP). (a) A graphical model of the POMDP, where t indicates time. A state s is a full description of the world, such as the geometries and masses of objects in a scene. An observation o is the data that is directly perceived by the agent, such as a visual image. An action a is an intervention on the state of the world chosen by the agent, such as 'move left'. A reward r is a scalar value that tells the agent how well it is doing on a task and can be likened to the idea of utility or risk. Arrows indicate dependencies between variables. Pink circles indicate variables that can be intervened on; blue indicates variables that are observed; and white indicates variables that are unobserved. (b-f) Depictions of the individual functions (in green) that relate variables in the POMDP. The transition function (b) takes the current state and action and produces the next state, s t+1 = T(s t , a). The reward function (c) takes a state and action and produces a reward (or utility) signal, r t = R(s t , a t ). The observation function (d) is the process by which sensory data are generated given the current state, o t = O(s t ). For example, this can sometimes be thought of as a 'rendering' function which produces images given the underlying scene specification. The recognition function (e) is the inverse of the observation function, s t = O À1 (o t ), and is analogous to the process of perception. The recognition function is often conditioned on past states and observations (i.e. a 'memory'), to allow aggregation of information across time (for example, velocity, or multiple viewpoints of the same scene). The policy (f) is the function which gives rise to actions given the underlying state of the world, a t = p(s t ). The policy is also often conditioned on past memories. motion; [41] predict image masks indicating how a tower of blocks will fall; and [42] predict the boundaries of objects in images.

Prior-constrained latent state-transition models
Rather than computing transitions directly over observations, an alternate approach first transforms observations into latent state representations via the recognition function. Sometimes, prior knowledge can be leveraged, resulting in prior-constrained latent states (Figure 3e). For example, [43] use pre-existing knowledge about physics to encourage recognition models to recover properties like mass and friction. Other approaches use supervision to train recognition and transition models to predict 2D [44,45] and 3D [46,47 ] motion. [48 ] use supervision to learn symbolic representations like on(A,B).

Data-constrained latent state-transition models
Another approach infers data-constrained latent states, where the representations are influenced more strongly by data than by prior knowledge (Figure 3f). The most common such approach is to find a representation which can be used to predict future observations (e.g. [49][50][51]). While most approaches assume distributed vector representations, others have explored alternatives such as graphs [52] or low-dimensional binary vectors [53].
Other models have explored different pressures for learning latent state representations beyond reconstructing Various forms of imagination and mental simulation can be viewed as engaging different aspects of the POMDP model, illustrating how seemingly disparate phenomena are computationally quite similar. In all cases, blue circles indicate relevant observed variables, white circles indicate latent variables, pink circles indicate actions that modify states, and grayed-out circles indicate variables which are not relevant for a particular form of mental simulation. (a) Physical prediction tasks like those explored by [5] can be seen as a case where an initial observation is given (e.g. a tower of blocks) and future states are predicted given that observation (e.g. whether the blocks move). (b) The mental rotation task from [6], in which it was demonstrated that people imagine objects at different rotations in order to compare them, can be seen as choosing a sequence of actions to produce mental images. (c) Theory of mind tasks such as those examined by [26] involve inferring a latent state such as the preferences of another agent (e.g. that the agent prefers pizza over hamburgers) given a sequence of observations (e.g. that the agent picks up a pizza). (d) Tasks like the two-step task [27] which probe how humans learn from reinforcement naturally fall under the POMDP paradigm. In such tasks, people must learn to choose actions to navigate through a sequence of symbols in order to maximize a nonstationary reward.
observations. For example, one approach is to use policy loss or reward prediction error to shape the latent representations (e.g. [25,[54][55][56]). Because reward is a scalar signal, such representations may not be useful for predicting future observations, but may still be useful for planning. Other objectives include inferring the action taken between observations [57] or maximizing the mutual information between observation-transitions and state-transitions [58 ].

Methods in DL for using learned models
A model on its own does not enable flexible behavior: a planning method is needed to turn predictions into actions. Background planning uses models only during the process of learning a model-free policy, while decision-time planning uses models during online deliberation [15 ].

Background planning
The most popular approach to background planning, Dyna (Figure 4a) [59], uses a model to produce simulated experience in place of real experience [32,53,60]. Other methods backpropagate gradients through learned transition and reward models [61], thus providing more information about an action's utility than a scalar reward does on its own. 1 Another approach uses decision-time planning to improve decisions, and then trains a policy to mimic those decisions [17].
Analogues of mental simulation in deep learning Hamrick 11

Figure 3
Methods for learning models in DL. Most methods focus on learning transition and recognition functions; reward functions are often either assumed to be known or are learned as part of the transition function. Blue outlines indicate variables which are observed. (a-b) In a hypothetical scenario, an agent controls a robot arm to pick up a block, and receives pixel-based observations. (c) In state-transition models, the underlying states (e.g. the orientation of the blocks and the robot arm) are directly observed. (d) In observation-transition models, transitions are learned directly between sensory observations. (e) In prior-constrained latent state-transition models, the states must be inferred from observations but often true states are available at training time for supervision, or strong assumptions are made about the dynamics of T or the representation of s. (f) In data-constrained latent state-transition models, a latent state is used but no supervision is given over states at training time. The learned latent states are usually distributed and often do not directly correspond to interpretable dimensions such as position, orientation, etc.

Decision-time planning
One method for decision-time planning simulates Monte-Carlo rollouts (Figure 4b) and then chooses the rollout (or trajectory) with highest reward [33]. Alternately, trajectories can be aggregated via a learned mechanism [62]. The choice of base policy for simulating trajectories has varied from random sampling [33]; to approximating the full model-based policy [62]; to learning the base policy endto-end with the full policy [51].
Tree search (Figure 4d) has achieved superhuman performance in Go given a known transition model and learned model-free policy prior [16,17]. Learned models can be used by embedding the tree search into the computation graph itself [56,66 ]. Other work learns the decisions that are usually hardcoded into tree search [67 ].
Finally, dynamic programming (Figure 4e) performs computations recursively over the entire state space. While techniques like value iteration [14] have classically been used in background planning, recent work incorporates them into DL architectures as decision-time mechanisms [24,68].

Modeling mental simulation with model-based deep RL
The varied approaches to learning and using models in DL have resulted in powerful systems that can model complex physical phenomena [35-38,47 ,58 ], play difficult puzzle games [62,66 ,67 ], and control articulated physical systems [33,40,54,63]. But beyond such applications in AI, model-based methods share a number of similarities with human mental simulation (Figure 2), making them an ideal starting point for developing new cognitive models and for scaling existing ones.

Mental imagery
Consider the classic debate regarding which representations underly mental imagery [69][70][71]. According to the depictive theory (DT) [70], the representations are 2D spatial arrays resembling images. In contrast, the propositional theory (PT) [69] states that the representations are symbolic in nature, without any intrinsic spatial properties. We can see echoes of these theories in the different Readers are referred to [15 ] for further details on these methods. structures of transition models. Observation-transition models (Figure 3d) are related to DT in that they operate directly over sensory observations, with intermediate computations operating over 2D convolutional features (e.g. [40]). Prior-constrained latent state-transition models (Figure 3e) may make the assumption that the underlying representation is symbolic (e.g. [48 ]), just as PT does. However, neither DT nor PT have strongly considered the role of reward functions or policies [71], in contrast to enactive theories (ET) (e.g. [72]) which consider mental imagery to be strongly coupled to actions. Framing DT, PT, and ET as particular instantiations of the POMDP framework, and modeling them with the tools of MBDRL, provides an avenue for furthering these discussions surrounding mental imagery.
MBDRL may also inform our understanding of mental imagery across the visual [70], auditory [73], and motor modalities [74]. Do these multimodal forms of imagery differ because they deal with different sensory data, or because the underlying mechanisms are themselves also different? MBDRL offers a way to probe this question by training networks with identical or varied architectures on data from different sensory modalities, and comparing the results to human mental imagery phenomena.

Learning by thinking
A longstanding puzzle in cognitive science is that of 'learning by thinking' [75]: how does thought influence behavior without the addition of any new information? One hypothesis proposes that a model-based process trains a model-free action policy [76,77], and has been successfully modeled via Dyna [59] (Figure 4a). However, such work often targets MDPs with small state spaces, which are easier to control experimentally and to compute model predictions for. MBDRL offers the possibility of scaling such theories to behavioral domains with huge state spaces. For example, when combined with DL, such models might also be able to account for the phenomenon of mental practice [78], in which people imagine performing a complex physical action (e.g. throwing a ball) and later exhibit improved performance when actually taking that action.

The control of mental simulation
Finally, an open question is how simulations are controlled during deliberation. An active area of research has investigated the overall choice of whether (and how much) to plan [79,80], treating this choice as a speedaccuracy trade-off and inspiring similarly adaptive approaches in MBDRL [64]. The role of the hippocampus in planning is also an active topic [81,82], with some work suggesting how hippocampal replay might be controlled by a variant of Dyna [77]. Other research has investigated how tree search might support decisions when playing board games [83]. Yet, other domains have received less attention. For example, while people use mental simulation to make predictions about physical scenes like towers of blocks [5], it is unclear how those simulations are engaged when constructing towers. Similarly, while mental simulation is used during creative thought (e.g. [84]), it is not well understood which simulations are explored, and why. By casting these problems as POMDPs and solving them with the powerful planning methods from MBDRL, we can produce quantitative, testable hypotheses about how mental simulations might be controlled.

Challenges for model-based deep RL
Model-based deep RL holds the promise of learning rich models of the world from experience and using them to make flexible and robust decisions. However, in comparison to the human capacity for building and running mental models, there are several challenges in fulfilling this promise.
One view of mental simulation holds that it is fast and precise. For example, simulations from forward models in the motor systems must occur in less than 100ms to support real-time action [7]. Similarly, activation of place cells during hippocampal preplay in rats -corresponding to the planning of future trajectories -occurs on the order of 100-300 ms [81]. This view is most consistent with current methods in MBDRL, which require a large number of faster-than-realtime model evaluations before making a decision (e.g. [16,17,33,54]).
Other mental simulations are slow, noisy, and incomplete, with mental simulations lacking full detail [70], exhibiting systematically wrong dynamics [85], and requiring multiple seconds to run [6]. Latent state-transition models have the potential to learn incomplete models of the world, particularly if they do not rely on reconstructing observations. However, almost all planning algorithms assume mostly accurate models. Even in cases where model error is explicitly addressed [32,[62][63][64], it is unclear how well such methods work when the model error is severe. It would seem that the mind can get a lot out of only a handful of incomplete and possibly very inaccurate simulations, a feat which MBDRL methods have yet to achieve.
Mental simulation is also seen as general, flexible, and compositional, supporting behavior across a wide range of different tasks [4], capturing a large body of commonsense knowledge [5,12], and operating over multiple levels of abstraction [86]. While recent graph network approaches do afford more compositional models than standard RNN approaches [34 ], a separate model is still learned for each task or (at best) for a small set of related tasks (e.g. [38,47 ,54]). A significant challenge for DL is to build models that seamlessly compose at different levels of abstraction and that are informed by rich background knowledge about the world, enabling rapid transfer to a diverse range of situations and tasks.
Finally, mental simulation is exploratory, counterfactual, and creative, giving rise to thought experiments [13], children's pretend play [11], and creative works [84]. Mental simulations allow us to conceive of counterfactual worlds that did not come to pass, but which could have [11,12], as well as fully impossible worlds. While the notion of an action-conditional transition model (Figure 1b) does encode some counterfactual knowledge, current methods in DL often struggle to generalize far beyond the scenarios they were trained on [20]. It is an open question of how such methods could entertain concepts as far removed from reality as humans do (such as, 'what if the Earth were replaced by blueberries?' [87]).
While MBDRL holds much promise for building flexible, robust intelligence, it still has a ways to go. To match human cognition, models must be compositional and assembled on-the-fly; methods for planning must succeed with only a handful of evaluations from noisy, incomplete models; and models must be able to generalize far from their training sets, supporting creative exploration and a richer understanding of the world.

Conclusion
The notion of using models of the world to make better decisions has deep roots in the history of both cognitive science [3] and RL [14]. It is unsurprising, then, that both mental simulation and MBDRL share a number of similarities. For cognitive scientists, these similarities suggest that current approaches in MBDRL may be useful starting points for developing new cognitive models and scaling existing models to larger and more complex domains. For DL researchers, they suggest that mental simulation can play an important role in guiding research towards more intelligent agents. In both cases, the integration of model-based methods from DL with theories of mental simulation promises new and exciting research supporting more flexible and creative artificial agents, as well as a deeper understanding of the complexities of the human imagination.

Funding
This work was supported by DeepMind.

Conflict of interest statement
Nothing declared.