Active inference on discrete state-spaces: a synthesis

Active inference is a normative principle underwriting perception, action, planning, decision-making and learning in biological or artificial agents. From its inception, its associated process theory has grown to incorporate complex generative models, enabling simulation of a wide range of complex behaviours. Due to successive developments in active inference, it is often difficult to see how its underlying principle relates to process theories and practical implementation. In this paper, we try to bridge this gap by providing a complete mathematical synthesis of active inference on discrete state-space models. This technical summary provides an overview of the theory, derives neuronal dynamics from first principles and relates this dynamics to biological processes. Furthermore, this paper provides a fundamental building block needed to understand active inference for mixed generative models; allowing continuous sensations to inform discrete representations. This paper may be used as follows: to guide research towards outstanding challenges, a practical guide on how to implement active inference to simulate experimental behaviour, or a pointer towards various in-silico neurophysiological responses that may be used to make empirical predictions.


Perception 1 Introduction
Active inference is a normative principle underlying perception, action, planning, decision-making and learning in biological or artificial agents.It postulates that these processes may all be seen as optimising two complementary objective functions; namely, a variational free energy, which measures the fit between an internal model and (past) sensory observations, and an expected free energy, which scores possible (future) courses of action in relation to prior preferences.Active inference has been employed to simulate a wide range of complex behaviours, including planning and navigation [1], reading [2], curiosity and abstract rule learning [3], saccadic eye movements [4], visual foraging [5,6], visual neglect [7], hallucinations [8], niche construction [9,10], social conformity [11], impulsivity [12], image recognition [13] and the mountain car problem [14][15][16].The key idea that underwrites these simulations is that creatures use an internal forward (generative) model to predict their sensory input, which they use to infer the causes of these data.
Early formulations of active inference employed generative models expressed in continuous space and time (for an introduction see [17], for a review see [18]), with behaviour modelled as a continuously evolving random dynamical system.However, we know that some processes in the brain conform better to discrete, hierarchical, representations, compared to continuous representations (e.g., visual working memory [19,20], state estimation via place cells [21,22], language, etc).Reflecting this, many of the paradigms studied in neuroscience are naturally framed as discrete statespace problems.Decision-making tasks are a prime candidate for this, as they often entail a series of discrete alternatives that an agent needs to choose among (e.g., multi-arm bandit tasks [23][24][25], multi-step decision tasks [26]).This explains why -in active inference -agent behaviour is often modelled using a discrete state-space formulation: the particular applications of which are summarised in Table 1.More recently, mixed generative models [27] -combining discrete and continuous states -have been used to model behaviour involving discrete and continuous representations (e.g., decision-making and movement [28], pharmacologically induced changes in eye-movement control [29] or reading; involving continuous visual sampling informing inferences about discrete semantics [27]).
Table 1: Applications of active inference (discrete state-space).

Application Neuromodulation
Use of precision parameters to manipulate exploration during saccadic searches; associating uncertainty with cholinergic and noradrenergic systems.[6,29,49,50] Decisions to movements Mixed generative models combining discrete and continuous states to implement decisions through movement.[27,28] Planning, navigation and niche construction Agent induced changes in environment (generative process); decomposition of goals into subgoals.[1,9,10] Atari games Active inference compares favourably to reinforcement learning in the game of Doom.[51] Machine learning Scaling active inference to more complex machine learning problems. [52] Due to the pace of recent theoretical advances in active inference, it is often difficult to retain a comprehensive overview of its process theory and practical implementation.In this paper, we hope to provide a comprehensive (mathematical) synthesis of active inference on discrete state-space models.This technical summary provides an overview of the theory, derives the associated (neuronal) dynamics from first principles and relates these to known biological processes.Furthermore, this paper and [18] provide the building blocks necessary to understand active inference on mixed generative models.This paper can be read as a practical guide on how to implement active inference for simulating experimental behaviour, or a pointer towards various in-silico neuro-and electro-physiological responses that can be tested empirically.
This paper is structured as follows.Section 2 is a high-level overview of active inference.The following sections elucidate the formulation by deriving the entire process theory from first principles; incorporating perception, planning and decision-making.This formalises the action-perception cycle: 1) an agent is presented with a stimulus, 2) it infers its latent causes, 3) plans into the future and 4) realises its preferred course of action; and repeat.This enactive cycle allows us to explore the dynamics of synaptic plasticity, which mediate learning of the contingencies of the world at slower timescales.We conclude in section 9 with an overview of structure learning in active inference.

Active inference
To survive in a changing environment, biological (and artificial) agents must maintain their sensations within a certain hospitable range (i.e., maintaining homeostasis through allostasis).In brief, active inference proposes that agents achieve this by optimising two complementary objective functions, a variational free energy and an expected free energy.In short, the former measures the fit between an internal (generative) model of its sensations and sensory observations, while the latter scores each possible course of action in terms of its ability to reach the range of "preferred" states of being.
Our first premise is that agents represent the world through an internal model.Through minimisation of variational free energy, this model becomes a good model of the environment.In other words, this probabilistic model and the probabilistic beliefs2 that it encodes are continuously updated to mirror the environment and its dynamics.Such a world model is considered to be generative; in that it is able to generate predictions about sensations (e.g., during planning or dreaming), given beliefs about future states of being.If an agent senses a heat source (e.g., another agent) via some temperature receptors, the sensation of warmth represents an observed outcome and the temperature of the heat source a hidden state; minimisation of variational free energy then ensures that beliefs about hidden states closely match the true temperature.Formally, the generative model is a joint probability distribution over possible hidden states and sensory consequences -that specifies how the former cause the latter -and minimisation of variational free energy enables to "invert" the model; i.e., determine the most likely hidden states given sensations.The variational free energy is the negative evidence lower bound that is optimised in variational Bayes in machine learning [53,54].Technically -by minimising variational free energy -agents perform approximate Bayesian inference [55,56], which enables them to infer the causes of their sensations (e.g., perception).This is the point of contact between active inference and the Bayesian brain [57][58][59].Crucially, agents may incorporate an optimism bias [60,61] in their model; thereby scoring certain "preferred" sensations as more likely.This lends a higher plausibility to those courses of action that realise these sensations.In other words, a preference is simply something an agent (believes it) is likely to work towards.
To maintain homeostasis, and ensure survival, agents must minimise surprise 3 .Since the generative model scores preferred outcomes as more likely, minimising surprise corresponds to maximising model evidence 4 ).In active inference, this is assured by the aforementioned processes; indeed, the variational free energy turns out to be an upper bound on surprise and minimising expected free energy ensures preferred outcomes are realised, thereby avoiding surprise on average.
Active inference can thus be framed as the minimisation of surprise [62][63][64][65] by perception and action.In discrete state models -of the sort discussed here -this means agents select from different possible courses of action (i.e., policies) in order to realise their preferences and thus minimise the surprise that they expect to encounter in the future.This enables a Bayesian formulation of the perception-action cycle [66]: agents perceive the world by minimising variational free energy, ensuring their model is consistent with past observations, and act by minimising expected free energy, to make future sensations consistent with their model.This account of behaviour can be concisely framed as self-evidencing [67].
In contrast to other normative models of behaviour, active inference is a 'first principle' account, which is grounded in statistical physics [68,69].Active inference describes the dynamics of systems that persist (i.e., do not dissipate) during some timescale of interest, and that can be statistically segregated from their environment -conditions which are satisfied by biological systems.Mathematically, the first condition means that the system is at non-equilibrium steady-state (NESS).This implies the existence of a steady-state probability density to which the system self-organises and returns to after perturbation (i.e., the agent's preferences).The statistical segregation condition is the presence of a Markov blanket (c.f., Figure 1) [70,71]: a set of variables through which states internal and external to the system interact (e.g., the skin is a Markov blanket for the human body).Under these assumptions it can be shown that the states internal to the system parameterise Bayesian beliefs about external states and can be cast a process of variational free energy minimisation.This coincides with existing approaches to approximate inference [53,[72][73][74].Furthermore, it can be shown that the most likely courses of action taken by those systems are those which minimise expected free energy -a quantity that subsumes many existing constructs in science and engineering (see section 7).
By subscribing to the above assumptions, it is possible to describe the behaviour of viable living systems as performing active inference.The remaining challenge is to determine the computational and physiological processes that implement to active inference.This paper aims to summarise a possible answers to this question, by reviewing the technical details of a process theory for active inference on discrete state-space generative models, first presented in [43].Note that it is important to distinguish between active inference as a principle (presented above) from active inference as a process theory.The former is a consequence of fundamental assumptions about living systems, while the latter is a hypothesis concerning the computational and biological processes in the brain that might implement active inference.
The ensuing process theories theory can then be used to predict plausible neuronal dynamics and electrophysiological responses that are elicited experimentally.

Discrete state-space generative models
The generative model [53] expresses how the agent represents the world.This is a joint probability distribution over sensory data and the hidden (or latent) causes of these data.The sorts of discrete state-space generative models used in active inference are specifically suited to represent discrete time series and decision-making tasks.These can be expressed as variants of partially observable Markov decision processes (POMDPs; [75]): from simple Markov decision processes [76][77][78] to generalisations in the form of deep probabilistic (hierarchical) models [2,79,80].For clarity, the process theory is derived for the simplest model that facilitates understanding of subsequent generalisations; namely, a POMDP where the agent holds beliefs about the probability of the initial state (specified as D), the transition probabilities from one state to the next (defined as matrix B) and the probability of states given outcomes (i.e., the likelihood matrix A); see Figure 2.
Dir Dirichlet distribution (conjugate prior of the categorical distribution); probability distribution over the parameter space of the categorical distribution, parameterised by a vector of positive reals.

Probability distribution over
Matrix indexing convention.
a, a Parameters of prior and approximate posterior beliefs about A.
a 0 , a 0 Matrices of the same size as a, a, with homogeneous columns; any of its i th column elements are denoted by a i0 , a i0 and defined by a i0 = ji a ji , a i0 = ji a ji .

log, Γ, ψ
Natural logarithm, gamma function and digamma function.By convention these functions are taken component-wise on vectors and matrices. Functions.
Expectation of random variable f (X) under a probability density P (X), taken component-wise if f (X) is a matrix.E P (X) [f (X)] := f (X)P (X) dX Real-valued operator on random variables.
Shannon entropy of a probability distribution P .Explicitly, Functional over probability distributions.
As mentioned above, a substantial body of work justifies describing certain neuronal representations with discrete state-space generative models (e.g., [19,20,86]).Furthermore, it has been long known that -at the level of neuronal populations -computations occur periodically (i.e., in distinct and sometimes nested oscillatory bands).Similarly, there is evidence for sequential computation in a number of processes (e.g., attention [87][88][89], visual perception [90,91]) and at different levels of the neuronal hierarchy [2,92], in line with ideas from hierarchical predictive processing [93,94].This accommodates the fact that visual saccadic sampling of observations occurs at a frequency of approximately 4Hz [28].The relatively slow presentation of a discrete sequence of observations enables inferences to be performed in peristimulus time by (much) faster neuronal dynamics.
Active inference, implicitly, accounts for fast and slow neuronal dynamics.At each time-step the agent observes an outcome, from which it infers the past, present and future (hidden) states through perception.This underwrites a plan into the future, by evaluating (the expected free energy of) possible policies.The inferred (best) policies specify the most likely action, which is executed.At a slower timescale, parameters encoding the contingencies of the world (e.g., A), are inferred.This is referred to as learning.Even more slowly, the structure of the generative model is updated to better account for available observations -this is called structure learning.The following sections elucidate these aspects of the active inference process theory.This paper will be largely concerned with deriving and interpreting the inferential dynamics that agents might implement using the generative model in Figure 2. We leave the discussion of more complex models to Appendix A, since the derivations are analogous in those cases.

Free energy and model evidence
Variational Bayesian inference rests upon minimisation of a quantity called (variational) free energy, which measures the improbability (i.e., the surprise) of sensory observations, under a generative model.Simultaneously, variational free energy minimisation is a statistical inference technique that enables the approximation of the posterior distribution in Bayes rule.In machine learning, this is known as variational Bayes [53,[72][73][74].Active inference agents minimise variational free energy, enabling concomitant maximisation of their model evidence and inference of the latent variables of their generative model.In the following, we consider a particular time point to be given t ∈ {1, ..., T }, whence the agent has observed a sequence of outcomes o 1:t .The posterior about the latent causes of sensory data is given by Bayes rule: A Markov blanket is a set of variables through which states internal and external to the system interact.Specifically, the system must be such that we can partition it into a Bayesian network of internal states µ, external states η, sensory states o and active states u, (µ, o and u are often referred together as particular states) with probabilistic (causal) links in the directions specified by the arrows.All interactions between internal and external states are therefore mediated by the blanket states b.The sensory states represent the sensory information that the body receives from the environment and the active states express how the body influences the environment.This blanket assumption is quite generic, in that it can be reasonably assumed for a brain as well as elementary organisms.For example, when considering a bacillus, the sensory states become the cell membrane and the active states comprise the actin filaments of the cytoskeleton.Under the Markov blanket assumption -together with the assumption that the system persists over time (i.e., possesses a non-equilibrium steady state) -a generalised synchrony appears, such that the dynamics of the internal states can be cast as performing inference over the external states (and vice-versa) via a minimisation of variational free energy [68,69].This coincides with existing approaches to inference; i.e., variational Bayes [53,[72][73][74].This can be viewed as the internal states mirroring external states, via sensory states (e.g., perception), and external states mirroring internal states via active states (e.g., a generalised form of self-assembly, autopoiesis or niche construction).Furthermore, under these assumptions the most likely courses of actions can be shown to minimise expected free energy.Note that external states beyond the system should not be confused with the hidden states of the agent's generative model (which model external states).In fact, the internal states are exactly the parameters (i.e., sufficient statistics) encoding beliefs about hidden states and other latent variables, which model external states in a process of variational free energy minimisation.Hidden and external states may or may not be isomorphic.In other words, an agent uses its internal states to represent hidden states that may or may not exist in the external world.states, outcomes and other variables that cause outcomes.In this representation, states unfold in time causing an observation at each time-step.The likelihood matrix A encodes the probabilities of state-outcome pairs.The policy π specifies which action to perform at each time-step.Note that the agent's preferences may be specified either in terms of states or outcomes.It is important to distinguish between states (resp.outcomes) that are random variables, and the possible values that they can take in S (resp. in O), which we refer to as possible states (resp.possible outcomes).Note that this type of representation comprises a finite number of timesteps, actions, policies, states, outcomes, possible states and possible outcomes.In Panel 2b, the generative model is displayed as a probabilistic graphical model [53,71,74,81] expressed in factor graph form [82].The variables in circles are random variables, while squares represent factors, whose specific form are given in Panel 2a.The arrows represent causal relationships (i.e., conditional probability distributions).The variables highlighted in grey can be observed by the agent, while the remaining variables are inferred through approximate Bayesian inference (see section 4) and called hidden or latent variables.Active inference agents perform inference by optimising the parameters of an approximate posterior distribution (see section 4).Panel 2c specifies how this approximate posterior factorises under a particular meanfield approximation [83], although other factorisations may be used [84,85].A glossary of terms used in this figure is available in Table 2.The mathematical yoga of generative models is heavily dependent on Markov blankets.The Markov blanket of a random variable in a probabilistic graphical model are those variables that share a common factor.Crucially, a variable conditioned upon its Markov blanket is conditionally independent of all other variables.We will use this property extensively (and implicitly) in the text.
Computing the posterior distribution requires computing the model evidence P (o 1:t ) = π∈Π s 1:T ∈S T P (o 1:t , s 1:T , A, π) dA, which is intractable for complex generative models embodied by biological and artificial systems [92] -a well-known problem in Bayesian statistics.An alternative to computing the exact posterior distribution is to optimise an approximate posterior distribution over latent causes Q(s 1:T , A, π), by minimising the Kullback-Leibler (KL) divergence [95] (D KL ) -a non-negative measure of discrepancy between probability distributions.We can use the definition of the KL divergence and Bayes rule to arrive at the variational free energy F , which is a functional of approximate posterior beliefs: From ( 2), one can see that by varying Q to minimise the variational free energy enables us to approximate the true posterior, while simultaneously ensuring that surprise remains low.This means that variational free energy minimising agents, simultaneously, infer the latent causes of their observations and maximise the evidence for their generative model.To aid intuition, the variational free energy can be rearranged into complexity and accuracy: The first term of (3) can be regarded as complexity: a simple explanation for observable data Q, which makes few assumptions over and above the prior (i.e., with KL divergence close to zero), is a good explanation.In other words, a good explanation is an accurate account of some data that requires minimal movement for updating of prior to posterior beliefs (c.f., Occam's principle).The second term is accuracy; namely, the probability of the data given posterior beliefs about model parameters Q.In other words, how well the generative model fits the observed data.
The idea that neural representations weigh complexity against accuracy underwrites the imperative to find the most accurate explanation for sensory observations that is minimally complex, which has been leveraged by things like Horace Barlow's principle of minimum redundancy [96] and subsequently supported empirically [97][98][99][100].Figure 3 illustrates the various implications of minimising free energy.

On the family of approximate posteriors
The goal is now to minimise variational free energy with respect to Q.To obtain a tractable expression for the variational free energy, we need to assume a certain simplifying factorisation of the approximate posterior.There are many possible forms [84,117,118] (e.g., mean-field, marginal, Bethe), each of which trades off the quality of the inferences with the complexity of the computations involved.For the purpose of this paper we use a particular (structured) mean-field approximation (c.f., Figure 2): This choice is driven by didactic purposes and since this factorisation has been used extensively in the active inference literature (e.g., [2,27,43]).However, the most recent software implementation of active inference (i.e., spm_MDP_VB_X.m)employs a marginal approximation [84,119], which retains the simplicity and biological interpretation of the neuronal dynamics afforded by the mean-field approximation, while approximating the more accurate inferences of the Bethe approximation.For these reasons, the marginal free energy currently stands as the most biologically plausible.
Figure 3: Markov blankets and self-evidencing.This schematic illustrates the various interpretations of minimising variational free energy.Recall that the existence of a Markov blanket implies a certain lack of influences among internal, blanket and external states.These independencies have an important consequence; internal and active states are the only states that are not influenced by external states, which means their dynamics (i.e., perception and action) are a function of, and only of, particular states (i.e., internal, sensory and active states); here, the variational (free energy) bound on surprise.This surprise has a number of interesting interpretations.Given it is the negative log probability of finding a particle or creature in a particular state, minimising surprise corresponds to maximising the value of a particle's state.This interpretation is licensed by the fact that the states with a high probability are, by definition, attracting states.On this view, one can then spin-off an interpretation in terms of reinforcement learning [76], optimal control theory [101] and, in economics, expected utility theory [102].Indeed, any scheme predicated on the optimisation of some objective function can now be cast in terms of minimising surprise -in terms of perception and action (i.e., the dynamics of internal and active states) -by specifying these optimal values to be the agent's preferences.The minimisation of surprise (i.e., self-information) leads to a series of influential accounts of neuronal dynamics; including the principle of maximum mutual information [103,104], the principles of minimum redundancy and maximum efficiency [105] and the free energy principle [65].Crucially, the average or expected surprise (over time or particular states of being) corresponds to entropy.This means that action and perception look as if they are minimising entropy.This leads us to theories of self-organisation, such as synergetics in physics [106][107][108] or homeostasis in physiology [109][110][111].Finally, the probability of any blanket states given a Markov blanket (m) is, on a statistical view, model evidence [112,113].This means that all the above formulations are internally consistent with things like the Bayesian brain hypothesis, evidence accumulation and predictive coding; most of which inherit from Helmholtz's motion of unconscious inference [114], later unpacked in terms of perception as hypothesis testing in 20th century psychology [115] and machine learning [116].

Computing the variational free energy
The next sections focus on producing biologically plausible neuronal dynamics that perform perception and learning based on variational free energy minimisation.To enable this, we first compute variational the free energy, using the factorisations of the generative model and approximate posterior (c.f., Figure 2): where is the variational free energy conditioned upon a policy.This is the same quantity that we would have obtained by omitting A and conditioning all probability distributions in the numerators of (1) by π.In the next section, we will see how perception can be framed in terms of variational free energy minimisation.

Perception
In active inference, perception is equated with state estimation [43] (e.g., inferring the temperature from the sensation of warmth), consistent with the idea that perceptions are hypotheses [115].To infer the (past, present and future) states of the environment, an agent must minimise the variational free energy with respect to Q(s 1:T |π) for each policy π.This provides the agent's inference over hidden states, contingent upon pursuing a given policy.Since the only part of the free energy that depends on Q(s 1:T |π) is F π , the agent must simply minimise F π .Substituting Q(s τ |π) by their sufficient statistics (i.e., parameters s πτ ), F π becomes a function of those parameters.This enables us to rewrite (6), conveniently in matrix form: This enables to compute the variational free energy gradients [120]: The neuronal dynamics are given by a gradient descent on free energy [43], with state-estimation expressed as a softmax function of accumulated (negative) free energy gradients.The constant term 1 is generally omitted since the softmax function removes it anyway.This enables us to equate the gradient with a prediction error.
The softmax function -a generalisation of the sigmoid to vector inputs -is a natural choice as the variational free energy gradient is a logarithm and the components of s πτ must sum to one.

Plausibility of neuronal dynamics
The temporal dynamics expressed in (9) unfold at a much faster timescale than the sampling of new observations (i.e., within timesteps) and correspond to fast neuronal processing in peristimulus time.This is consistent with behaviourrelevant computations at frequencies that are higher than the rate of visual sampling (e.g., working memory [121], visual stimulus perception in humans [90] and macaques [91]).
Furthermore, these dynamics are consistent with predictive processing [122,123] -since active inference prescribes dynamics that minimise prediction error -although they generalise it to a wide range of generative models.Note that, while also a variational free energy gradient, this sort of prediction error is not the same as that given by predictive coding schemes (which rely upon a different sort of generative model) [17,18,124].
Just as neuronal dynamics involve translation from post-synaptic potentials to firing rates, (9) involves translating from a vector of real numbers (v), to a vector whose elements are bounded between zero and one (s πτ ); via the softmax function.As a result, it is natural to interpret the components of v as the average membrane potential of distinct neural populations, and s πτ as the average firing rate of those populations, which is bounded thanks to neuronal refractory periods.This is consistent with mean-field formulations of neural population dynamics, in that the average firing rate of a neuronal population follows a sigmoid function of the average membrane potential [125][126][127].Using the fact that a softmax function is a generalisation of the sigmoid to vector inputs -here the average membrane potentials of coupled neuronal populations -it follows that their average firing follows a softmax function of their average potential.In this context, the softmax function may be interpreted as performing lateral inhibition, which can be thought of as leading to narrower tuning curves of individual neurons and thereby sharper inferences [128].Importantly, this tells us that state-estimation can be performed in parallel by different neuronal populations, and a simple neuronal architecture is sufficient to implement these dynamics (see Figure 6 in [84]).
Lastly, interpreting the dynamics in this way has a degree of face validity, as it enables us to synthesise a wide-range of biologically plausible electrophysiological responses; including repetition suppression, mismatch negativity, violation responses, place-cell activity, phase precession, theta sequences, theta-gamma coupling, evidence accumulation, raceto-bound dynamics and transfer of dopamine responses [36,43].
The neuronal dynamics for state estimation coincide with variational message passing [129,130]: a widely used algorithm for approximate Bayesian inference.This is an important result, since it shows that variational message passing emerges under active inference using a particular mean-field approximation.If one were to use the Bethe approximation, the corresponding dynamics coincide with belief propagation [53,82,84,85,117], another widely used algorithm for approximate inference.This offers a formal connection between active inference and message passing interpretations of neuronal dynamics [27,131,132].In the next section, we examine planning, decision-making and action selection.

Planning, decision-making and action selection
So far, we have focused on optimising beliefs about hidden states under a particular policy by minimising a variational free energy functional of an approximate posterior over hidden states, under each policy.
In this section, we explain how planning and decision-making arise as a minimisation of expected free energy -a function scoring the goodness of each possible future course of action.We briefly motivate how the expected free energy arises from first-principles.This allows us to frame decision-making and action-selection in terms of expected free energy minimisation.Finally, we conclude by discussing the computational cost of planning into the future.

Planning and decision-making
At the heart of active inference, is a description of agents that strive to attain a target distribution specifying the range of preferred states of being, given a sufficient amount of time.To work towards reaching these preferences, agents select policies Q(π), such that their predicted states Q(s τ , A) at some future time point τ > t (usually, the time horizon of a policy T ) reach the preferred states P (s τ , A), which are specified by the generative model.These considerations allow us to show in Appendix B that the requisite approximate posterior over policies Q(π) is a softmax function of the negative expected free energy G5 : Ambiguity (10) This means that the most likely (i.e., best) policies minimise expected free energy.This ensures that future courses of action are exploitative (i.e., risk minimising) and explorative (i.e., ambiguity minimising).In particular, the expected free energy specifies the optimal balance between goal-seeking and itinerant novelty-seeking behaviour, given some prior preferences or goals.Note that the ambiguity term rests on an expectation over fictive (i.e., predicted) outcomes under beliefs about future states.This means that optimising beliefs about future states during perception is crucial to accurately predict future outcomes during planning.In summary, planning and decision-making respectively correspond to evaluating the expected free energy of different policies, which scores their goodness in relation to prior preferences and forming approximate posterior beliefs about policies.

Action selection, policy-independent state-estimation
Approximate posterior beliefs about policies allows to obtain the most plausible action as the most likely under all policies -this can be expressed as a Bayesian model average where δ is the Kronecker delta.In addition, we obtain a policy independent state-estimation at any time point τ ∈ {1, ..., T }, as a Bayesian model average of approximate posterior beliefs about hidden states under policies: Note that these Bayesian model averages may be implemented by neuromodulatory mechanisms [41].

Biological plausibility
Winner take-all architectures of decision-making are already commonplace in computational neuroscience (e.g., models of selective attention and recognition [133,134], hierarchical models of vision [135]).This is nice, since the softmax function in (10) can be seen as providing a biologically plausible [125][126][127], smooth approximation to the maximum operation, which is known as soft winner take-all [136].In fact, the generative model, presented in Figure 2, can be naturally extended such that the approximate posterior contains an (inverse) temperature parameter γ multiplying the expected free energy inside the softmax function (see Appendix A.2).This temperature parameter regulates how precisely the softmax approximates the maximum function, thus recovering winner take-all architectures for high parameter values (technically, this converts Bayesian model averaging into Bayesian model selection, where the policy corresponds to a model of what the agent is doing).This parameter, regulating precision of policy selection, has a clear biological interpretation in terms of confidence encoded in dopaminergic firing [34][35][36]43].Interestingly, Daw and colleagues [23] uncovered evidence in favour of a similar model employing a softmax function and temperature parameter in human decision-making.

Pruning of policy trees
From a computational perspective, planning (i.e., computing the expected free energy) for each possible policy can be cost-prohibitive, due do the combinatorial explosion in the number of sequences of actions when looking deep into the future.There has been work in understanding how the brain finesses this problem [137], which suggests a simple answer: during mental planning, humans stop evaluating a policy as soon as they encounter a large loss (i.e., a high value of the expected free energy that renders the policy highly implausible).In active inference this corresponds to using an Occam window; that is, we stop evaluating the expected free energy of a policy if it becomes much higher than the best (smallest expected free energy) policy -and set its approximate posterior probability to an arbitrarily low value accordingly.This biologically plausible pruning strategy drastically reduces the number of policies one has to evaluate exhaustively.
Although effective and biologically plausible, the Occam window for pruning policy trees cannot deal with large policy spaces that ensue with deep policy trees and long temporal horizons.This means that pruning can only partially explain how biological organisms perform deep policy searches.Further research is needed to characterise the processes in which biological agents reduce large policy spaces to tractable subspaces.One explanation -for the remarkable capacity of biological agents to evaluate deep policy trees -rests on deep (hierarchical) generative models, in which policies operate at each level.These deep models enable long-term policies, modelling slow transitions among hidden states at higher levels in the hierarchy, to contextualise faster state transitions at subordinate levels (see Appendix A).
The resulting (semi Markovian) process can then be specified in terms of a hierarchy of limited horizon policies that are nested over temporal scales; c.f., motor chunking [138][139][140].

Discussion of the action-perception cycle
Minimising variational and expected free energy are complementary and mutually beneficial processes.Minimisation of variational free energy ensures that the generative model is a good predictor of its environment; this allows the agent to accurately plan into the future by evaluating expected free energy, which in turn enables it to realise its preferences.
In other words, minimisation of variational free energy is a vehicle for effective planning and reaching preferences via the expected free energy; in turn, reaching preferences minimises the expected surprise of future states of being.
In conclusion, we have seen how agents plan into the future and make decisions about the best possible course of action.This concludes our discussion of the action-perception cycle.In the next section, we examine expected free energy in greater detail.Then, we will see how active agents can learn the contingencies of the environment and the structure of their generative model at slower timescales.

Properties of the expected free energy
The expected free energy is a fundamental construct of interest.In this section, we unpack its main features and highlight its importance in relation to many existing theories in neurosciences and engineering.
The expected free energy of a policy can be unpacked in a number of ways.Perhaps the most intuitive is in terms of risk and ambiguity: Ambiguity (13) This means that policy selection minimises risk and ambiguity.Risk, in this setting, is simply the difference between predicted and prior beliefs about final states.In other words, policies will be deemed more likely if they bring about states that conform to prior preferences.In the optimal control literature, this part of expected free energy underwrites KL control [141,142].In economics, it leads to risk sensitive policies [143].Ambiguity reflects the uncertainty about future outcomes, given hidden states.Minimising ambiguity therefore corresponds to choosing future states that generate unambiguous and informative outcomes (e.g., switching on a light in the dark).
We can express the expected free energy of a policy as a bound on information gain and expected log (model) evidence (a.k.a., Bayesian risk): Expected information gain Expected information gain (14) The first term in ( 14) is the expectation of log evidence under beliefs about future outcomes, while the second ensures that this expectation is maximally informed, when outcomes are encountered.Collectively, these two terms underwrite the resolution of uncertainty about hidden states (i.e., information gain) and outcomes (i.e., expected surprise) in relation to prior beliefs.
When the agent's preferences are expressed in terms of outcomes (c.f., Figure 2), it is useful to express risk in terms of outcomes, as opposed to hidden states.This is most useful when the generative model is not known or during structure learning, when the state-space evolves over time.In these cases, the risk over hidden states can be replaced risk over outcomes by assuming the KL divergence between the predicted and true posterior (under expected outcomes) is small: Risk (outcomes) This divergence constitutes an expected evidence bound that also appears if we express expected free energy in terms of intrinsic and extrinsic value: Expected evidence bound Intrinsic value (states) or salience

π)||Q(A)]]
Intrinsic value (parameters) or novelty (16) Extrinsic value is just the expected value of log evidence, which can be associated with reward and utility in behavioural psychology and economics, respectively [144][145][146].In this setting, extrinsic value is the negative of Bayesian risk [147], when reward is log evidence.The intrinsic value of a policy is its epistemic value or affordance [39].This is just the expected information gain afforded by a particular policy, which can be about hidden states (i.e., salience) or model parameters (i.e., novelty).It is this term that underwrites artificial curiosity [148].
Intrinsic value is also known as intrinsic motivation in neurorobotics [144,149,150], the value of information in economics [151], salience in the visual neurosciences and (rather confusingly) Bayesian surprise in the visual search literature [152][153][154].In terms of information theory, intrinsic value is mathematically equivalent to the expected mutual information between hidden states in the future and their consequences -consistent with the principles of minimum redundancy or maximum efficiency [104,105,155].Finally, from a statistical perspective, maximising intrinsic value (i.e., salience and novelty) corresponds to optimal Bayesian design [156] and machine learning derivatives, such as active learning [157].On this view, active learning is driven by novelty; namely, the information gain afforded model parameters, given future states and their outcomes.Heuristically, this curiosity resolves uncertainty about "what would happen if I did that" [146].Figure 4 illustrates the compass of expected free energy, in terms of its special cases; ranging from optimal Bayesian design through to Bayesian decision theory.

Learning
In active inference, learning concerns the dynamics of synaptic plasticity, which are thought to encode beliefs about the contingencies of the environment [43] (e.g., beliefs about B, in some settings, are thought to be encoded in recurrent Figure 4: Expected free energy.This figure illustrates the various ways in which minimising expected free energy can be unpacked (omitting model parameters for clarity).The upper panel casts action and perception as the minimisation of variational and expected free energy, respectively.Crucially, active inference introduces beliefs over policies that enable a formal description of planning as inference [1,158,159].In brief, posterior beliefs about hidden states of the world, under plausible policies, are optimised by minimising a variational (free energy) bound on log evidence.These beliefs are then used to evaluate the expected free energy of allowable policies, from which actions can be selected [43].Crucially, expected free energy subsumes several special cases that predominate in the psychological, machine learning and economics literature.These special cases are disclosed when one removes particular sources of uncertainty from the implicit optimisation problem.For example, if we ignore prior preferences, then the expected free energy reduces to information gain [112,156] or intrinsic motivation [144,149,150].This is mathematically the same as expected Bayesian surprise and mutual information that underwrite salience in visual search [152,154] and the organisation of our visual apparatus [103][104][105]155].If we now remove risk but reinstate prior preferences, one can effectively treat hidden and observed (sensory) states as isomorphic.This leads to risk sensitive policies in economics [143,160] or KL control in engineering [142].Here, minimising risk corresponds to aligning predicted outcomes to preferred outcomes.If we then remove ambiguity and relative risk of action (i.e., intrinsic value), we are left with extrinsic value or expected utility in economics [161] that underwrites reinforcement learning and behavioural psychology [76].Bayesian formulations of maximising expected utility under uncertainty is also known as Bayesian decision theory [147].Finally, if we just consider a completely unambiguous world with uninformative priors, expected free energy reduces to the negative entropy of posterior beliefs about the causes of data; in accord with the maximum entropy principle [162].The expressions for variational and expected free energy correspond to those described in the main text (omitting model parameters for clarity).They are arranged to illustrate the relationship between complexity and accuracy, which become risk and ambiguity, when considering the consequences of action.This means that risksensitive policy selection minimises expected complexity or computational cost.The coloured dots above the terms in the equations correspond to the terms that constitute the special cases in the lower panels.
excitatory connections in the prefrontal cortex [46]).The fact that beliefs about matrices (e.g., A, B) may be encoded in synaptic weights conforms to connectionist models of brain function, as it offers a convenient way to compute probabilities, in the sense that the synaptic weights could be interpreted as performing matrix multiplication as in artificial neural networks, to predict; e.g., outcomes from beliefs about states, using the likelihood matrix A.
These synaptic dynamics (e.g., long-term potentiation and depression) evolve at a slower timescale than action and perception, which is consistent with the fact that such inferences need evidence accumulation over multiple stateoutcome pairs.For simplicity, we will assume the only variable that is learned is A, but what follows generalises to more complex generative models (c.f., Appendix A.1.Learning A means that approximate posterior beliefs about A follow a gradient descent on variational free energy.Seeing the variational free energy (5) as a function of a (the sufficient statistic of Q(A)) we can write: Here, we ignore the terms in ( 5) that do not depend on Q(A), as these will vanish when we take the gradient.The KL-divergence between Dirichlet distributions is [163,164]: Incorporating ( 18) in (17), we can take the gradient of the variational free energy with respect to logA: where ⊗ is the Kronecker product.This means that the dynamics of synaptic plasticity follow a descent on (19): In computational terms, these are the dynamics for evidence accumulation of Dirichlet parameters at time t.Since synaptic plasticity dynamics occur at a much slower pace than perceptual inference, it is computationally much cheaper -in numerical simulations -to do a one-step belief update at the end of each trial of observation epochs.Explicitly, setting the free energy gradient to zero at the end of the trial gives the following update for Dirichlet parameters: After which, the prior beliefs P (A) are updated to the approximate posterior beliefs Q(A) for the subsequent trial.Note that this update scheme is formally identical to associative or Hebbian plasticity.
As one can see, the learning rule concerning accumulation of Dirichlet parameters (c.f., (21)) means that the agent becomes increasingly confident about its likelihood matrix by receiving new observations (since the matrix which is added onto a at each timestep is always positive).This is fine as long as the structure of the environment remains relatively constant.In the next section, we will see how Bayesian model reduction can revert this process, to enable the agent to adapt quickly to a changing environment.Table 3 summarises the belief updating entailed by active inference, and Figure 5 indicates where particular computations might be implemented in the brain.
Table 3: Summary of belief updating.

Process Computation Equations
Perception Planning G(π) (43), ( 44) Action selection Policy-independent state-estimation Learning (end of trial) 9 Structure learning In the previous sections, we have addressed how an agent performs inference over different variables at different timescales in a biologically plausible fashion, which we equated to perception, planning and decision-making.In this section, we consider the problem of learning the form or structure of the generative model.
The idea here is that agents are equipped (e.g., born) with an innate generative model that entails fundamental preferences (e.g., essential to survival), which are not updated.For instance, humans are born with prior preferences about their body temperature around 37 • C and cardiac frequency within a certain range.Mathematically, this means that the parameters of these innate prior distributions -encoding the agent's expectations as part of its generative model -have hyperpriors that are infinitely precise (e.g., a Dirac delta distribution) and thus cannot be updated in an experience dependent fashion.The agent's generative model then naturally evolves by minimising variational free energy to become a good model of the agent's environment but is still constrained by the survival preferences hardcoded within it.This process of learning the generative model (i.e., the variables and their functional dependencies) is called structure learning.
Structure learning in active inference is an active area of research.Active inference proposes that the agent's generative model evolves over time to maximise the evidence for its observations.However, a complete set of mechanisms that biological agents use to do so has not yet been laid out.Nevertheless, we use this section to summarise two complementary approaches; namely, Bayesian model reduction and Bayesian model expansion [3,[171][172][173] -that enable to simplify and complexify the model, respectively.

Bayesian model reduction
To explain the causes of their sensations, agents must compare different hypotheses about how their sensory data are generated -and retain the hypothesis or model that is the most valid in relation to their observations (i.e., has the greatest model evidence).In Bayesian statistics, these processes are called Bayesian model comparison and Bayesian model selection -these correspond to scoring the evidence for various generative models in relation to available data and selecting the one with the highest evidence [174,175].Bayesian model reduction (BMR) is a particular instance of structure learning, which formalises post-hoc hypothesis testing to simplify the generative model.This precludes redundant explanations of sensory data -and ensures the model generalises to new data.Technically, it involves estimating the evidence for simpler (reduced) priors over the latent causes and selecting the model with the highest evidence.This process of simplifying the generative model -by removing certain states or parametershas a clear biological interpretation in terms of synaptic decay and switching off certain synaptic connections, which Here, a visual observation is sampled by the retina, aggregated in first-order sensory thalamic nuclei and processed in the occipital (visual) cortex.The green arrows correspond to message passing of sensory information.This signal is then propagated (via the ventral visual pathway) to inferior and medial temporal lobe structures such as the hippocampus; this allows the agent to go from observed outcomes to beliefs about their most likely causes in state-estimation (perception), which is performed locally.The variational free energy is computed in the striatum.The orange arrows encode message passing of beliefs.Preferences C are attributed to the dorsolateral prefrontal cortex -which is thought to encode representations over prolonged temporal scales [44] -consistent with the fact that these are likely to be encoded within higher cortical areas [3].The expected free energy is computed in the medial prefrontal cortex [43] during planning, which leads to inferences about most plausible policies (decisionmaking) in the basal ganglia, consistent with the fact that the basal ganglia is thought to underwrite planning and decision-making [165][166][167][168][169][170].The message concerning policy selection is sent to the motor cortex via thalamocortical loops.The most plausible action, which is selected in the motor cortex is passed on through the spinal cord to trigger a limb movement.Simultaneously, policy independent state-estimation is performed in the ventrolateral prefrontal cortex, which leads to synaptic plasticity dynamics in the prefrontal cortex, where the synaptic weights encode beliefs about A.
To keep things concise, let ν represent a hidden variable in the generative model that is optimised during learning (e.g.A), and o = o 1:t a sequence of observations.The current model has a prior P (ν) and we would like to test whether a reduced prior (i.e., less complex) P (ν) can provide a more parsimonious explanation for the observed outcomes.Using Bayes rule, we have the following identities: P (ν)P (o|ν) = P (ν|o) P (o) Where P (o) = P (o|ν)P (ν) dν and P (o) = P (o|ν) P (ν).Dividing ( 22) by ( 23) yields We can then use (24) in order to obtain the following relations: P (ν)P (ν|o) ⇒ log P (o) − log P (o) = log E P (ν|o) P (ν) We can approximate the posterior term in the expectation of (26) with the corresponding approximate posterior Q(ν), which simplifies the computation.This allows us to compare the evidence of the two models (reduced and full) and select the best.If the reduced model has more evidence, it implies the current model is too complex -and redundant parameters can be removed by adopting the new priors.
In conclusion, BMR allows for computationally efficient and biologically plausible hypothesis testing, to find simpler explanations for the data at hand.It has been used to emulate sleep and reflection in abstract rule learning [3], by simplifying the prior over A at the end of each trial -this has the additional benefit of preventing the agent from becoming overconfident.

Bayesian model expansion
Bayesian model expansion is complementary to Bayesian model reduction.It entails adopting a more complex generative model (by adding, e.g., more states); if, and only if the gain in accuracy in (3) is sufficient enough to outweigh the increase in complexity.This model expansion allows for generalisation and concept learning in active inference [172].Note that additional states need not always lead to a more complex model.It is in principle possible to expand a model in such a way that complexity decreases, as many state estimates might be able to remain close to their priors in place of a small number of estimates moving a lot.This 'shared work' by many parameters could lead to a simpler model.
From a computational perspective, concept acquisition can be seen as a type of structure learning [179,180] -that can be emulated through Bayesian model comparison.Recent work on concept learning in active inference [172], shows that a generative model equipped with extra (latent) hidden states can engage these 'unused' hidden states, when an agent is presented with novel stimuli during the learning process.Initially the corresponding likelihood mappings (i.e., the corresponding columns of A) are uninformative, but these are updated when the agent encounters new observations that cannot be accounted by its current knowledge (e.g., observing a cat when it has only been exposed to birds).This happens naturally, during the learning process, in an unsupervised way through free energy minimization.To allow for effective generalization, this approach can be combined with BMR; in which any new concept can be aggregated with similar concepts, and the associated likelihood mappings can be reset for further concept acquisition, in favour of a simpler model with higher model evidence.This approach can be further extended by updating the number of extra hidden states through a process of Bayesian model comparison.

Discussion
Due to the various recent theoretical advances in active inference, it is easy to lose sight of its underlying principle, process theory and practical implementation.We have tried to address this by rehearsing -in a clear and concise way -the assumptions underlying active inference as a principle, the technical details of the process theory for discrete state-space generative models and the biological interpretation of the accompanying neuronal dynamics.It is useful to clarify these results; as a first step to guide towards outstanding theoretical research challenges, a practical guide to implement active inference to simulate experimental behaviour and a pointer towards various predictions that may be tested empirically.
Active inference offers a degree of plausibility as a process theory of brain function.From a theoretical perspective its requisite neuronal dynamics correspond to known empirical phenomena and extend earlier theories like predictive coding [63,122,123].Furthermore, the process theory is consistent with the underlying free energy principle, which biological systems are thought to abide by -namely, the avoidance of surprising states: this can be articulated formally based on fundamental assumptions about biological systems [68,69].Lastly, the process theory has a degree of face validity as its predicted electrophysiological responses closely resemble empirical measurements.
However, for a full endorsement of the process theory presented in this paper, rigorous empirical validation of the synthetic electrophysiological responses is needed.To pursue this, one would have to specify the generative model that a biological agent employs for a particular task.This can be done through Bayesian model comparison of alternative generative models with respect to empirical (choice) behaviour being measured (e.g., [181]).Once the appropriate generative model is formulated, evidence for a plausible but distinct implementations of active inference would need to be compared, which come from various possible approximations to the free energy [84,85,117], each of which yields different belief updates and simulated electrophysiological responses.Note that the marginal approximation to the free energy currently stands as the most biologically plausible [84].From this, the explanatory power of active inference can be assessed in relation to empirical measurements and contrasted with other existing theories.
This means that the key challenge for active inference -and arguably data analysis in general -is finding the generative model that best explains observable data (i.e., evidence maximising).A solution to this problem would enable to find the generative model -entailed by an agent -by observing its behaviour.In turn, this would enable one to simulate its belief updating and behaviour accurately in-silico.It should be noted that these generative models can be specified manually for the purposes of reproducing simple behaviour (e.g., agents performing simple tasks needed for empirical validation discussed above).However, a generic solution to this problem is necessary to account for complex datasets; in particular, complex behavioural data from agents in a real environment.Moreover, a biologically plausible solution to this problem could correspond to a complete structure learning roadmap; accounting for how biological agents evolve their generative model to account for new observations.Evolution has solved this problem by selecting phenotypes with a good model of their sensory data, therefore, understanding the processes that have selected generative models that are fit for purpose for our environment might lead to important advances in structure learning and data analysis.
Discovering new generative models corresponding to complex behavioural data, will demand to extend the current process theory to these models, in order to provide testable predictions and reproduce the observed behaviour insilico.Examples of generative models that do not fall within the current discrete state-space, continuous state-space [8,18,30,[182][183][184][185] or mixed [27][28][29] models -currently implemented in active inference -include Markov decision trees [74,186] and Boltzmann machines [77,187,188].
One challenge that may arise, when scaling active inference to complex models with many degrees of freedom, will be the size of the policy trees in consideration.Although effective and biologically plausible, the current pruning strategy is unlikely to reduce the search space sufficiently to enable tractable inference in such cases.As noted above, the issue of scaling active inference may yield to the first principles of the variational free energy formulation.Specifically, generative models with a high evidence are minimally complex.This suggests that 'scaling up', in and of itself, is not the right strategy for reproducing more sophisticated or deep behaviour.A more principled approach would be to explore the right kind of factorisations necessary to explain structured behaviour.A key candidate here are deep temporal or diachronic generative models that have a separation of timescales.This form of factorisation (c.f., mean field approximation) replaces deep decision trees with shallow decision trees that are hierarchically composed.
To summarise, we argue that some important challenges for theoretical neuroscience include finding process theories of brain function that comply with active inference as a principle [68,69]; namely, the avoidance of surprising events.The outstanding challenge is then to explore and fine grain such process theories, via Bayesian model comparison (e.g., using dynamic causal modelling [58,189]) in relation to experimental data.From a structure learning and data analysis perspective, the main challenge is finding the generative model with the greatest evidence in relation to available data.This may be achieved by understanding the processes evolution has selected for creatures with a good model of their environment.Finally, to scale active inference to behaviour with many degrees of freedom, one needs to understand how biological agents effectively search deep policy trees when planning into the future, when many possible policies may be entertained at separable timescales.

Conclusion
In conclusion, this paper aimed to summarise: the assumptions underlying active inference, the technical details underwriting its process theory, and how the associated neuronal dynamics relate to known biological processes.These processes underwrite action, perception, planning, decision-making, learning and structure learning; which we have illustrated under discrete state-space generative models.We have discussed some important outstanding challenges: from a broad perspective, the challenge for theoretical neuroscience is to develop increasingly fine-grained mechanistic models of brain function that comply with the core tenets of active inference [68,69].In regards to the process theory, key challenges relate to experimental validation, understanding how biological organisms evolve their generative model to account for new sensory observations and how they effectively search large policy spaces when planning into the future.
The variational free energy, after having observed o 1:t , is computed analogously as in equation ( 5).The process of finding the belief dynamics is then akin to section 8 -we rehearse it in the following: selecting only those terms in the variational free energy, which depend on B and D yields: Using the form of the KL divergence for Dirichlet distributions (18) and taking the gradients yields where ⊗ denotes the Kronecker product.Finally, it is possible to specify neuronal plasticity dynamics following a descent on ( 30), (31), which correspond to biological dynamics.Alternatively, we have belief update rules implemented once after each trial of observation epochs in in-silico agents:

A.2 Complexifying the prior over policies
In this paper, we have considered a simple prior approximate posterior over policies; namely, σ(−G(π)).This can be extended to σ(−γG(π)), where γ is an (inverse) temperature parameter that denotes the confidence in selecting a particular policy.This extension is quite natural in the sense that γ can be interpreted as the postsynaptic response to dopaminergic input [34,35].This correspondence is supported by empirical evidence [36] and enables one to simulate biologically plausible dopaminergic discharges (c.f., Appendix E [43]).Anatomically, this parameter may be encoded within the substantia nigra, in nigrostriatal dopamine projection neurons [36], which maps well with our proposed functional anatomy (c.f., Figure 5), since the substantia nigra is connected with the striatum.We refer the reader to [43] for a discussion of the associated belief updating scheme.

A.3 Multiple state and outcome modalities
In general, one does not only need one hidden state and outcome factor to represent the environment, but many.Intuitively, this happens in the human brain as we integrate sensory stimuli from our five (or more) distinct senses.Mathematically, we can express this via different streams of hidden states (usually referred to as hidden factors) that evolve independently of one another that interact to generate outcomes at each time step; e.g., see Figure 9 [74] for a graphical representation of a multi-factorial hidden Markov model.This means that A becomes a multi-dimensional tensor that integrates information about the different hidden factors to cause outcomes.The belief updating is analogous in this case, contingent upon the fact that one assumes a mean-field factorisation of the approximate posterior on the different hidden state factors (see, e.g., [5,42]).This means that the beliefs about states may be processed in a manner analogous to Figure 5, invoking a greater number of neural populations.

A.4 Deep temporal models
A deep temporal model is a generative model with many layers that are nested hierarchically and act at different timescales.These were first introduced within active inference in [3].One can picture them graphically as a POMDP (c.f., Figure 2) at the higher level where each outcome is replaced by a POMDP at the lower level, and so forth.
There is a useful metaphor for understanding the concept underlying deep temporal models: each layer of the model corresponds to the hand of a clock.In a two-layer hierarchical model, a ticking (resp.rotation) of the faster hand corresponds to a time step (resp.trial of observation epochs) at the lower level.At the end of each trial at the lower level, the slower hand ticks once, which corresponds to a time-step at the higher level, and the process unfolds again.
One can concisely summarise this by saying that a state at the higher level corresponds to a trial of observation epochs at the lower level.Of course, there is no limit to the number of layers one can stack in a hierarchical model.
To obtain the associated belief updating, one computes free energy at the lower level by conditioning the probability distributions from Bayes rule by the variables from the higher levels.This means that one performs belief updating at the lower levels independently of the higher levels.Then, one computes the variational free energy at the higher levels by treating the lower levels as outcomes.For more details on the specificities of the scheme see [3].

B Expected free energy
At the heart of active inference is a description of a certain class of systems at non-equilibrium steady-state (NESS) [68,69].An important consequence of NESS is the existence of a steady-state probability distribution P (s τ , A) that the agent is guaranteed to reach given a sufficient amount of time.Intuitively, this distribution should be thought as the agent's preferences over states and model parameters.Practically, this means that the agent selects policies, such that its predicted states Q(s τ , A) at some future time point τ > t -usually, the time horizon of a policy T -reach its preferences P (s τ , A), which are specified by the generative model.In the following, we will show how a specific family of distributions Q(π) guarantee an agent to reach its preferences.Then, we will see how NESS enables in fact to extract one single canonical member of this family: the (softmax negative) expected free energy.
Objective: we seek distributions over policies that imply steady-state solutions; i.e., when the final distribution does not depend upon initial observations.Such solutions ensure that, on average, stochastic policies lead to a steady-state or target distribution specified by the generative model.
are equal on average under Q if and only if the system reaches steady-state.Explicitly: Here, β ≥ 0 characterises the steady-state with the relative precision (i.e., negative entropy) of policies and final outcomes, given final states.The generative model stipulates steady-state, in the sense that distribution over final states (and outcomes) does not depend upon initial observations.Here, the generative and predictive distributions simply express the conditional independence between policies and final outcomes, given final states.Note that when β = 1, Gibbs energy becomes expected free energy.
Proof.Let us unpack the Gibbs energy expected under Q: And the result is immediate.
A straightforward consequence of Lemma 1, is that each distribution describes a certain kind of system that self-organises to some steady-state distribution.This family of distributions has interesting interpretations: for example, the case β = 0 corresponds to standard stochastic control, variously known as KL control or risk-sensitive control [142]: In other words, one chooses policies that minimise the KL divergence between the predictive and target distribution.More generally, when β > 0, policies are more likely when they simultaneously minimise the entropy of outcomes, given states.In other words, β > 0 ensures that the system exhibits itinerant behaviour.One can see that KL control may arise in this case if the entropy of the likelihood mapping remains constant with respect to policies.
Remark 2. It is possible to extend this framework by considering systems that reach their preferences at a collection of time-steps into the future, say τ 1 , ..., τ n > t.In this case, one can adapt the proof of Lemma 1 to obtain: where G(π, τ i ; β) is the Gibbs free energy of Lemma 1, replacing τ by τ i .In this case, the canonical choice of approximate posterior over policies would be: One perspective -on the distinction between simple and general steady-states -is in terms of uncertainty about policies.For example, simple steady-states preclude uncertainty about which policy led to a final state.This would be appropriate for describing classical systems (that follow a unique path of least action), where it would be possible to infer which policy had been pursued, given the initial and final outcomes.Conversely, in general steady-state systems (e.g., mice, Homo sapiens), simply knowing that 'you are here' does not tell me 'how you got here', even if I knew where you were this morning.Put another way, there are lots of paths or policies open to systems that attain a general steady state.
In active inference, we are interested in a certain class of systems that self-organise to general steady-states; namely, those that move through a large number of probabilistic configurations from their initial state to their final steady-state.The treatments in [68,69] effectively turn the steady-state lemma on its head by assuming NESS is stipulatively trueand then characterise the ensuing self-organisation in terms of Bayes optimal policies: Corollary 3 (Active inference [69]).If a system attains a general steady-state, it will appear to behave in a Bayes optimal fashion -both in terms of optimal Bayesian design (i.e., exploration) and Bayesian decision theory (i.e., exploitation).Crucially, the loss function defining Bayesian risk is the negative log evidence for the generative model entailed by an agent.In short, systems (i.e., agents) that attain general steady-states will look as if they are responding to epistemic affordances [44].
So far, we have deduced the distribution over policies of systems that reach steady-state.However, recall that reaching steady-state is only a consequence of NESS.In fact, NESS dynamics under a Markov blanket (c.f., Figure 1) imply a slightly stronger statement: the most likely trajectories of systems described by active inference are those which minimise expected free energy [68,69] -this is exactly the case β = 1 in (38).This is nice, since many existing theories of cognition and control emerge under this specific imperative (c.f., Figure 4).

C Computing expected free energy
In this appendix, we present the derivations underlying the analytical expression of the expected free energy that is used in spm_MDP_VB_X.m.Following [119], we can reexpress the expected free energy in the following form:

C.2 Risk
The risk term of (42) is the KL divergence between predicted states following a particular policy and preferred states.This can be expressed as: Where the vector C ∈ R m encodes preference over states P (s τ ) = Cat(C).However, it is also possible to approximate this risk term over states by a risk term over outcomes (c.f., ( 15

C.3 Novelty
The novelty term of ( 42) is where The KL divergence between both distributions (c.f., (18)) can be expressed as: where ψ is the digamma function.We now want to make sense of a .Suppose that at time τ the agents knows the possible outcome j and possible state k as in Q(A|o τ , s τ ) (c.f., Table 2 for terminology).This means that in this case, beliefs about hidden states correspond to the true state; in other words, s τ = s τ .We can then use the rule of accumulation of Dirichlet parameters to deduce a = a + o τ ⊗ s τ .In other words, a jk = a jk + 1 and the remaining components are identical.Using the well-known identity: We can use an asymptotic expansion of the digamma function to simplify the expression: Finally, the analytical expression of the novelty term:

Figure 1 :
Figure 1: Markov blankets in active inference.This figure illustrates the Markov blanket assumption of active inference.A Markov blanket is a set of variables through which states internal and external to the system interact.Specifically, the system must be such that we can partition it into a Bayesian network of internal states µ, external states η, sensory states o and active states u, (µ, o and u are often referred together as particular states) with probabilistic (causal) links in the directions specified by the arrows.All interactions between internal and external states are therefore mediated by the blanket states b.The sensory states represent the sensory information that the body receives from the environment and the active states express how the body influences the environment.This blanket assumption is quite generic, in that it can be reasonably assumed for a brain as well as elementary organisms.For example, when considering a bacillus, the sensory states become the cell membrane and the active states comprise the actin filaments of the cytoskeleton.Under the Markov blanket assumption -together with the assumption that the system persists over time (i.e., possesses a non-equilibrium steady state) -a generalised synchrony appears, such that the dynamics of the internal states can be cast as performing inference over the external states (and vice-versa) via a minimisation of variational free energy[68,69].This coincides with existing approaches to inference; i.e., variational Bayes[53,[72][73][74].This can be viewed as the internal states mirroring external states, via sensory states (e.g., perception), and external states mirroring internal states via active states (e.g., a generalised form of self-assembly, autopoiesis or niche construction).Furthermore, under these assumptions the most likely courses of actions can be shown to minimise expected free energy.Note that external states beyond the system should not be confused with the hidden states of the agent's generative model (which model external states).In fact, the internal states are exactly the parameters (i.e., sufficient statistics) encoding beliefs about hidden states and other latent variables, which model external states in a process of variational free energy minimisation.Hidden and external states may or may not be isomorphic.In other words, an agent uses its internal states to represent hidden states that may or may not exist in the external world.

Figure 2 :
Figure2: Example of a discrete state-space generative model.Panel 2a, specifies the form of the generative model, which is how the agent represents the world.The generative model is a joint probability distribution over (hidden) states, outcomes and other variables that cause outcomes.In this representation, states unfold in time causing an observation at each time-step.The likelihood matrix A encodes the probabilities of state-outcome pairs.The policy π specifies which action to perform at each time-step.Note that the agent's preferences may be specified either in terms of states or outcomes.It is important to distinguish between states (resp.outcomes) that are random variables, and the possible values that they can take in S (resp. in O), which we refer to as possible states (resp.possible outcomes).Note that this type of representation comprises a finite number of timesteps, actions, policies, states, outcomes, possible states and possible outcomes.In Panel 2b, the generative model is displayed as a probabilistic graphical model[53,71,74,81] expressed in factor graph form[82].The variables in circles are random variables, while squares represent factors, whose specific form are given in Panel 2a.The arrows represent causal relationships (i.e., conditional probability distributions).The variables highlighted in grey can be observed by the agent, while the remaining variables are inferred through approximate Bayesian inference (see section 4) and called hidden or latent variables.Active inference agents perform inference by optimising the parameters of an approximate posterior distribution (see section 4).Panel 2c specifies how this approximate posterior factorises under a particular meanfield approximation[83], although other factorisations may be used[84,85].A glossary of terms used in this figure is available in Table2.The mathematical yoga of generative models is heavily dependent on Markov blankets.The Markov blanket of a random variable in a probabilistic graphical model are those variables that share a common factor.Crucially, a variable conditioned upon its Markov blanket is conditionally independent of all other variables.We will use this property extensively (and implicitly) in the text.

Figure 5 :
Figure5: Possible functional anatomy.This figure summarises a possible (coarse-grained) functional anatomy that could implement belief updating in active inference.The arrows correspond to message passing between different neuronal populations.Here, a visual observation is sampled by the retina, aggregated in first-order sensory thalamic nuclei and processed in the occipital (visual) cortex.The green arrows correspond to message passing of sensory information.This signal is then propagated (via the ventral visual pathway) to inferior and medial temporal lobe structures such as the hippocampus; this allows the agent to go from observed outcomes to beliefs about their most likely causes in state-estimation (perception), which is performed locally.The variational free energy is computed in the striatum.The orange arrows encode message passing of beliefs.Preferences C are attributed to the dorsolateral prefrontal cortex -which is thought to encode representations over prolonged temporal scales[44] -consistent with the fact that these are likely to be encoded within higher cortical areas[3].The expected free energy is computed in the medial prefrontal cortex[43] during planning, which leads to inferences about most plausible policies (decisionmaking) in the basal ganglia, consistent with the fact that the basal ganglia is thought to underwrite planning and decision-making[165][166][167][168][169][170].The message concerning policy selection is sent to the motor cortex via thalamocortical loops.The most plausible action, which is selected in the motor cortex is passed on through the spinal cord to trigger a limb movement.Simultaneously, policy independent state-estimation is performed in the ventrolateral prefrontal cortex, which leads to synaptic plasticity dynamics in the prefrontal cortex, where the synaptic weights encode beliefs about A.

Table 2 :
Glossary of terms and notation.Hidden) state at time τ .In computations, if s τ evaluates to the i th possible state, then interpret it as the i th unit vector in R m .Sequence of hidden states s 1 , ..., s t .Random variables over S t .Outcome at time τ .In computations, if o τ evaluates to the j th possible outcome, then interpret it as the j th unit vector in R n .Sequence of outcomes o 1 , ..., o t Random variables over O t .
ΠSet of all allowable policies; i.e., sequences of actions.Finite subset of U T .πPolicy.Random variable over Π, or element of Π depending on context.