Branching Time Active Inference: the theory and its generality

Over the last 10 to 15 years, active inference has helped to explain various brain mechanisms from habit formation to dopaminergic discharge and even modelling curiosity. However, the current implementations suffer from an exponential (space and time) complexity class when computing the prior over all the possible policies up to the time-horizon. Fountas et al (2020) used Monte Carlo tree search to address this problem, leading to impressive results in two different tasks. In this paper, we present an alternative framework that aims to unify tree search and active inference by casting planning as a structure learning problem. Two tree search algorithms are then presented. The first propagates the expected free energy forward in time (i.e., towards the leaves), while the second propagates it backward (i.e., towards the root). Then, we demonstrate that forward and backward propagations are related to active inference and sophisticated inference, respectively, thereby clarifying the differences between those two planning strategies.


Introduction
Active inference is at this point a compelling explanatory approach in cognitive neuroscience, and significant analyses of biologically-realistic implementations in both neural and non-neural communication networks has been conducted. More specifically, active inference extends the free energy principle to generative models with actions (Friston et al, 2016;Da Costa et al, 2020a;Champion et al, 2021b) and can be regarded as a form of planning as inference (Botvinick and Toussaint, 2012). This framework has successfully explained a wide range of neuro-cognitive phenomena, such as habit formation (Friston et al, 2016), Bayesian surprise (Itti and Baldi, 2009), curiosity (Schwartenbeck et al, 2018), and dopaminergic discharges (FitzGerald et al, 2015). It has also been applied to a variety of tasks, such as animal navigation (Fountas et al, 2020), robotic control (Pezzato et al, 2020;, the mountain car problem (Ç atal et al, 2020), the game of DOOM (Cullen et al, 2018) and the cart pole problem (Millidge, 2019). Many of those applications require planning several steps into the future in order to be solved successfully. However, as explained in more depth in appendix H, an exhaustive search over all possible sequences of actions will quickly become intractable, i.e., the number of sequences to explore grows exponentially with the time horizon of planning. Figure 1 illustrates this exponential growth. Exploring only a subset of this exponential number of possible sequences using a tree search therefore becomes a compelling and quite natural alternative.  Figure 1: Illustration of all possible policies up to two time steps in the future when |U | = 2. The state at the current time step is denoted by S t . Additionally, each branch of the tree corresponds to a possible policy, and each node S I is indexed by a multi-index (e.g. I = (12)) representing the sequence of actions that led to this state. This should make it clear that for one time step in the future, there are |U | possible policies, after two time steps there are |U | times more policies, and so on until the time-horizon T where there are a total of |U | T possible policies, i.e., the number of possible policies grows exponentially with the number of time steps for which the agent tries to plan.
But what exactly is active inference? Imagine a basketball player at the top of the key (i.e., the area just below the net) ready to take a shot. Intuitively, active inference sees the world as a collection of external states such as the positions of the net, the player and the ball. The player (or agent) is equipped with sensors (such as the eyes) which allow for measurements of the external states. The player is also able to perform actions in the world such as to perform sudden eye movement or simply unfolding his (or her) arms and legs. Furthermore, it is believed that the agent stores an internal representation of the external states, that we shall refer to as the internal states.
Importantly, the external and internal states are separated from each other by the Markov blanket (Kirchhoff et al, 2018), i.e., the sensory information received and actions taken by the agent. In other words, the external states can only modify the internal states indirectly through the observations (also called sensory information) made by the agent, and the internal states can only modify the external states indirectly through the actions taken by the agent.
More formally, active inference builds on a subfield of Bayesian statistics called variational inference (Fox and Roberts, 2012), in which the true posterior distribution is approximated with a variational distribution. This method provides a way to balance the complexity and accuracy of the posterior distribution. The variational approach is only tractable because some statistical dependencies are ignored during the inference process, i.e., the variational distribution is generally assumed to fully factorise, leading to the well known mean-field approximation: where X is the set of all hidden variables of the model, However, as just stated, there is a major bottleneck to scaling up the active inference framework: the number of action sequences grows exponentially with the time-horizon (see Appendix H for details). In the reinforcement learning literature, this explosion is frequently handled using Monte Carlo tree search (MCTS) (Silver et al, 2016;Browne et al, 2012;Schrittwieser et al, 2019). This approach has been applied to active inference in several papers (Fountas et al, 2020;Maisto et al, 2021). Fountas et al (2020) chose to modify the original criterion used during the node selection step in MCTS. This step returns the node that needs to be expanded, and the reinforcement learning community uses the upper confidence bound for trees (UCT) introduced by  as a selection criterion: where n is the number of times the current (parent) node has been explored; n j stands for the number of times the j-th child node has been explored; C p > 0 is the exploration constant andX j is the average reward received by the j-th child, i.e., the sum of all rewards received by the current node and its descendants divided by n j . The child node with the largest U CT j is selected. In their paper, Fountas et al (2020) replaced this selection criterion by: U (s, a) = −G(s, a) + C explore Q(a|s) where U (s, a) indicates the utility of selecting action a in state s; N (s, a) is the number of times that action a was explored in state s; C explore is an exploration constant equivalent to C p in the UCT criterion; Q(a|s) is a neural network modelling the posterior distribution over actions, which is trained by minimizing the variational free energy, andG(s, a) is an estimator of the expected free energy (EFE). The EFE is computed from the following equation: G(π, τ ) = − E Q(θ|π)Q(sτ |θ,π)Q(oτ |sτ ,θ,π) ln P (o τ |π) (4) + E Q(θ|π) E Q(oτ |θ,π) H(s τ |o τ , π) − H(s τ |π) + E Q(θ|π)Q(sτ |θ,π) H(o τ |s τ , θ, π) − E Q(sτ |π) H(o τ |s τ , π), where H(x|y) is the entropy of p(x|y). The computation of the EFE is performed by sampling from three distributions whose parameters are predicted by deep neural networks, i.e., the encoder network modelling Q(s τ ), the decoder network modelling P (o τ |s τ ) and the transition network modelling P (s τ |s τ −1 , a τ −1 ). Note that Equation (2) was developed by  as a criterion for selecting nodes during planning, such that the selected node minimizes the agent's regret (c.f. Appendix G for additional details). Equation (3) finds its origin in the Predictor Upper Confidence Bound (PUCB) algorithm introduced by Rosin (2010). The idea of the PUCB algorithm is to use contextual information to predict the node to select during planning. Equations (2) and (3) both aim to select the node that minimizes the agent's regret, and can therefore be used interchangeably. However, Equation (3) requires contextual information and a model predicting the node to be selected. Fountas et al (2020) proposed to use the neural network modelling Q(a|s) as a predictor. This has the advantage of making the predictor very flexible, since neural networks are known to be general function approximators, but neural networks are also expensive to train and lack interpretability.
To avoid the additional complexity brought by the predictor, this paper makes use of (2), which arises from the multi-armed bandit literature (Auer et al, 2002). The idea is to minimise the agent's regret to handle the trade-off between exploration and exploitation at the tree-level in an optimal manner.
A major novelty of our paper is to think about tree search as a dynamical expansion of the generative model, where the past and present is modelled as a partially observable Markov decision process (Sondik, 1971) and the future is modelled by a tree-like generative model. Importantly, our agent treats future states and observations as latent variables over which posterior beliefs are computed, and those beliefs encode the uncertainty of our agent over future states. In contrast, Fountas et al (2020) are using a maximum a posteriori (MAP) estimate of the future hidden states, while performing MCTS. Lastly, the posterior beliefs held by our agent are computed using variational message passing as presented in (Champion et al, 2021b). In comparison, Fountas et al (2020) perform amortized inference using an encoder network that predicts the mean and variance of the posterior distribution over latent states. Then, (during planning) a MAP estimate is used as input for the neural network modelling the temporal transition. All those neural networks are trained using gradient descent on the variational free energy.
Overall, the key contribution of our paper is to use MCTS to expand or grow the probabilistic graphical model, treat future states and observations as latent variables, and do inference using variational message passing. Indeed, the definition in (Champion et al, 2021b) of a general message passing procedure for performing active inference makes it possible to construct graphical active inference models in a modular fashion. In turn, this makes it possible to incrementally expand an active inference model as required of our MCTS procedure. It is this message passing procedure that makes our approach possible. To our knowledge, an approach of this kind has never been studied before.
In the following, we first provide the requisite background concerning Forney factor graphs, variational message passing, active inference, and Monte Carlo tree search in Sections 2, 3, 4, and 5, respectively. Next, Section 6 introduces our method that frames planning using a tree as a form of Bayesian model extension. Using terminology from concurrency theory (Bowman, 2005), we call our new formalism Branching Time Active Inference (BTAI).
In this domain, models of systems based upon sequences of actions (the format of policies) are described as linear time, while models based upon tree and even graph structures are called branching time (Glabbeek, 1990;van Glabbeek, 1993;Bowman, 2005). Importantly, BTAI does not consider the generative model and the tree as two different objects, instead, BTAI merges those two objects together into a generative model that can be dynamically expanded. For a detailed analysis of the properties of BTAI, the reader is referred to our companion paper (Champion et al, 2021a), which provides an empirical demonstration of the benefits of BTAI over standard active inference (AcI) in the context of a graph navigation task. This companion paper also supplies a theoretical comparison of BTAI and standard AcI based upon a complexity class analysis. Briefly, standard AcI has a space complexity class of O(|π| × T × |S|), where |π| = |U | T is the number of possible policies, |U | is the number of available actions, T is the time horizon of planning, and |S| is the number of values that the hidden state can take. In contrast, the space complexity class of BTAI is O([K + t] × |S|), where t is the current (i.e. present) time point, and K is the number of expansions of the tree performed during planning. Importantly, even complex applications such as the game of Go can be solved by expanding only a small number of nodes (Silver et al, 2016;Schrittwieser et al, 2019). Section 6 is followed by Section 7 that explains the connection between our method and the planning strategies used in both active inference and sophisticated inference . Finally, Section 8 concludes this paper and provides ideas for future research.

Forney Factor Graphs
A Forney factor graph (Forney, 2001) uses three kinds of nodes. The nodes representing hidden and observed variables are depicted by white and gray circles, respectively. And the distribution's factors are represented using white squares, which are linked to variable nodes by arrows or lines. Arrows are used to connect factors to their target variable, while lines link factors to their predictors. Figure 2 shows an example of a Forney factor graph corresponding to the following generative model: Generally, factor graphs only describe the model's structure such as the variables and their dependencies, but do not specify the definition of individual factors. For example, the definitions of P O and P S are not given by Figure 2, and additional information is required to remove the ambiguity, e.g., P S (S) = N (S; µ, σ) clarifies that P S is a Gaussian distribution. The hidden state is represented by a white circle with the variable's name at the center, and the observed variable is depicted similarly but with a gray background. The factors of the generative model are represented by squares with a white background and the factor's name at the center. Finally, arrows connect the factors to their target variable and lines link each factor to its predictor variables.

Variational Message Passing
We now build on Forney factor graphs and provide an overview of the method of Winn and Bishop (2005). For more details, see Champion et al (2021b), which provided a complete derivation of the equations presented below from Bayes' theorem.

Winn and Bishop method
Variational message passing as developed by Winn and Bishop (2005) is an approach for inference based upon the mean-field approximation, which assumes that the posterior fully factorises, i.e.
where X is the set of all hidden variables of the model and X i represents the i-th hidden variable. In this section, we focus on the intuition behind the method, starting with the update equation of an arbitrary hidden state x k : where C is a normalizing constant, and · ∼Q k is the expectation over all factors but Q k (x k ). (9) tells us that the optimal posterior of any hidden states x k only depends on its Markov blanket, i.e., x k 's parents pa k , children ch k and co-parents cp kj . To make (9) more specific, we assume that each random variable of the model is conjugate to its parents (i.e., the posterior has the same functional form as the prior) and is distributed according to a distribution in the exponential family, i.e., where µ k (pa k ), u k (x k ), h k (x k ) and z k (pa k ) are the parameters, the sufficient statistics, the underlying measure and the log partition, respectively. Under those two assumptions, (9) can be re-written as: whereμ k is a re-parameterization of µ k (pa k ) in terms of the expectation of the sufficient statistics of the parents of x k , and similarlyμ j→k is a re-parameterization of µ j→k . Importantly, u k (x k ) and h k (x k ) in the optimal posterior (11) are the same as in the prior (10), and only the parameters have changed according to (12).
Figure 3: This figure illustrates the computation of the optimal posterior parameters as a message passing procedure, which requires the transmission of messages from the parent (m 2 ) and child (m 3 ) factors. Additionally, the message from the child factor (m 3 ) requires the computation of messages from the co-parent (m 4 ) and child (m 5 ) variables. Also, the message from the parent factor (m 2 ) requires the computation of a message (m 1 ) from the parent variable.
To understand the intuition behind (12), let us suppose that we are given the Forney factor graph illustrated in Figure 3 and we wish to compute the posterior of Y . Then, the only parent of Y is Z, the only child of Y is X and the only co-parent of Y with respect to X is W . Therefore, applying (12) to our example leads to the equation presented in Figure 3 whose components can be interpreted as messages. Indeed, each variable (i.e., X, Z and W ) sends the expectation of its sufficient statistics (i.e., a message) to the square node in the direction of Y (i.e., either P X or P Y ). Those messages are then combined using a function (i.e., eitherμ Y orμ X→Y ) whose output (i.e., another set of messages) are summed to obtain the optimal parameters µ * Y . The computation of the optimal parameters (12) can then be understood as a message passing procedure. Also, we provide in Appendix C a concrete instance of the approach presented above.

Active Inference
This section provides a quick overview of the active inference framework, and Appendix H presents a description of the exponential complexity class that it exhibits. The reader is referred to Appendix F for any notations that might not be explained here. For a more detailed treatment of the active inference framework, we refer the reader

Generative model
As illustrated in Figure 4, the classic generative model represents the world as a sequence of hidden states generating observations through the matrix A. The prior over the initial states is defined by the vector D and the transition between time steps is encoded by a 3-tensor B, i.e., one matrix per action. Importantly, the random variable π represents all possible policies up to a given time horizon T and each policy is defined as a sequence of actions, i.e., {U t , ..., U T −1 } where U τ ∈ {1, ..., |U |} ∀τ ∈ {t, ..., T − 1}. The prior over the policies is then set such that policies with high probability minimise the EFE, which is defined as follows (Parr and Friston, 2019): is the Shannon entropy, G is a vector containing as many elements as the number of policies, and the i-th element of G represents the cost of the i-th policy. The prior preferences over observations P (O τ ) represent the (categorical) distribution that the agent wants its observations to be sampled from and is traditionally encoded by the vector C. Note that this generalises the concept of reward from reinforcement learning. Indeed, maximising reward can be reformulated as sampling observations from a Dirac delta distribution over reward maximising states (Da Costa et al, 2020b). Figure 4: This figure illustrates the Forney factor graph of the entire generative model presented by Friston et al (2016). The probability of the initial states is defined by the vector D, and the matrix A defines the probability of the observations given the hidden states. The B matrices define the transition between any successive pair of hidden states. This transition depends on the action performed by the agent, i.e., on the policy π. Furthermore, the prior over the policies has been chosen such that policies minimizing expected free energy are more probable. Finally, the precision parameter γ (which modulates the confidence over which policies to pursue) is distributed according to a gamma distribution.
Lastly, the precision parameter γ has been associated to neuromodulators such as dopamine (FitzGerald et al, 2015;Friston et al, 2013) and can be understood as modulating the confidence over the information afforded by the expected free energy-e.g., smaller values of γ lead to more stochastic decision-making. Finally, the framework allows A, B and D to be learned by introducing Dirichlet distributions over the columns of these tensors such that the posterior parameters of A, B and D can be reused in a new trial, as parameters of the prior, giving an empirical prior. Finally, the classic generative model is defined as follows: where G is a vector of size |π| whose i-th element corresponds to the expected free energy of the i-th policy, σ( • ) is the softmax function, Γ( • ), Cat( • ) and Dir( • ) stand for a gamma, categorical and Dirichlet distribution, respectively, π τ −1 ∈ {1, ..., |U |} is the action prescribed by policy π at time τ −1, O 0:t is the set of (random variables representing) observations between time step 0 and t, and S 0:T is the set of (random variables representing) hidden states between time step 0 and T .

Variational Distribution
The most widely used variational distribution (Da Costa et al, 2020a;Friston et al, 2016) is not fully factorized, i.e., the posterior models the influence of the policy on the hidden states, leading to the following factorization: where all variables with a hat correspond to posterior parameters. Notice that the distributions over A, B and D remain Dirichlet distributions, and the distributions over γ and S τ remain a gamma and a categorical distribution, respectively. Only the distribution over π changes from a Boltzmann to a categorical distribution but both are discrete distributions.
Remark 1 By definition the generative model P (O 0:t , S 0:T , π, A, B, D, γ) is a joint probability distribution over both the observed (O 0:t ) and latent (S 0:T , π, A, B, D, γ) variables. However, the goal of the variational distribution Q(S 0:T , π, A, B, D, γ) is to approximate the true posterior P (S 0:T , π, A, B, D, γ|O 0:t ), which is a distribution over the latent variables only. Thus, the approximate posterior Q(S 0:T , π, A, B, D, γ) is also a distribution over the latent variables only, and does not contain the observed variables.

Variational Free Energy
By definition, the variational free energy (VFE) is the Kullback-Leibler divergence between the variational distribution and the generative model, i.e.

Action selection
In active inference, the simplest strategy to select actions is to compute the evidence for all policies under consideration and then choose the most likely action according to these policies. Mathematically, this amounts to a Bayesian model average by executing the action with the highest posterior evidence: where |π| is the number of policies, π m t is the action predicted at the current time step by the m − th policy, and [u = π m t ] is an indicator function that equals one if u = π m t and zero otherwise.

Monte Carlo Tree Search
By now, the reader should be familiar with the framework of active inference and how variational message passing combined with the Forney factor graph formalism can be used to compute posterior beliefs. We now turn to the last piece of background required to present the method proposed in this paper: Monte Carlo tree search (MCTS), which is based on the multi-armed bandit literature (c.f. Appendix G for details).

A four step process
Monte Carlo tree search has been widely used in the reinforcement learning literature as it enables agents to plan efficiently when the evaluation of every possible action sequence is computationally prohibitive (Silver et al, 2016;Browne et al, 2012;Schrittwieser et al, 2019;Fountas et al, 2020). This algorithm essentially builds a tree in which each node corresponds to a future state and each edge represents the action that led to that state. Initially, the tree is only composed of a root node corresponding to the current state. From here, MCTS is a four step process.
First, a node is selected according to a criterion such as the upper confidence bound for trees (UCT): where n is the number of times the current (parent) node has been explored, n j stands for the number of times the j-th child node has been explored, C p > 0 is the exploration constant andX j is the average reward received by the j-th child. Note, if the rewards are in [0, 1], then C p = 1 √ 2 is known to satisfy the Hoeffding inequality (Browne et al, 2012) and the UCT criterion reduces to: Importantly, the UCT aims to explore highly rewarding paths (exploitation in first term), while also visiting rarely explored regions (exploration in second term).
As shown in Figure 5, this criterion is first used at the root level leading to the selection of a node from the root's children. Then, it is used at the level of the root's children, and so on until a leaf node is reached. As explained by , U CT is a direct application of the U CB1 criterion to trees, where at each level, the allocation strategy must pick a node that is expected to lead to the highest reward, and "picking the i-th node" can be seen as the i-th action of a multi-armed bandit problem. Once a leaf node has been selected, an expansion step is performed by sampling an action from a distribution and adding the node corresponding to this action as a child of the leaf node, i.e., the leaf node is expanded.
The third step consists of performing virtual rollouts into the future to estimate the average future reward obtained from the state corresponding to the newly expanded node. Finally, during the back-propagation step, the average reward obtained from the newly expanded state is used to re-evaluate the average quality of all its ancestors, and the visit counts of all nodes (in the branch explored) are increased. Iterating this four-step process until the time budget has been spent gives a fairly good estimate of the best action to perform next. Figure 5 summarises the MCTS procedure. In the next section, we present our approach and show how MCTS can be fused to active inference by performing a dynamical expansion of the generative model.  Figure 5: This figure illustrates the MCTS algorithm as a four step process. First, we start at the node representing the current state S t and select a node based on the UCT criterion until a leaf node is reached. Second, the tree is expanded to a new node by taking a virtual action from the selected node. Third, the value of this action is estimated by simulating the expected reward following that action. In the simplest version of MCTS, simulations are run until a terminal state is reached, e.g., until the game ends in Go or Chess. Fourth, the expected value is back-propagated to the new node and all of its ancestor nodes. The multi-indices in curly brackets denote action sequences taken from the root node, indicating the current state of the environment.

Branching Time Active Inference (BTAI)
In this section, we present a novel active inference agent that frames planning using a tree as a form of Bayesian model extension. Using terminology from concurrency theory (Bowman, 2005), we call our new formalism, Branching Time Active Inference (BTAI). In this domain, models of systems based upon sequences of actions (the format of policies) are described as linear time, while models based upon tree and even graph structures are called branch-ing time (Glabbeek, 1990;van Glabbeek, 1993;Bowman, 2005). Importantly, we do not consider the generative model and the tree as two different objects. Instead, we merge those two objects together into a generative model that can be dynamically expanded. Figure 6 illustrates an example of such a model, where for the sake of simplicity, we assume that the matrices A, B and D are given to the agent. Furthermore, the random variable representing the policies has been replaced by random variables representing actions and the precision parameter γ has been removed, which is a common design choice (Fountas et al, 2020). Additionally, we follow Parr and Friston (2019) by viewing future observations as latent random variables. Finally, note that the transition between two consecutive hidden states in the future (S I\last and S I where I is a multi-index) will only depend on the matrixB I =B( • , • , I last ), i.e., the matrix corresponding to action I last that led to the transition from S I\last to S I . The reader is referred to Table 1 for the definition ofB and more details about multi-indices can be found in Appendix F.

Prior, Posterior and Target distributions
Since the generative model is fairly different from the standard model, we state here its formal definition: where I t is the set of all non-empty multi-indices already expanded by the tree search from the current state S t , the second product (τ from 0 to t − 1) models the uncertainty over action, reflecting the focus on actions rather than policies, and S I\last is the parent state of S I . Intuitively, the product over all I ∈ I t models the future, while the rest of the above equation models the past and present. Additionally, we need to define the individual factors: whereĀ andB are defined in Table 1,B I =B( • , • , I last ) is the matrix corresponding to I last and I last is the last index of the multi-index I, i.e., the last action that led to S I . Importantly,Ā andB should not be confused with We now turn to the definition of the variational posterior. Under the mean-field approximation: where the individual factors are defined as: and Q(B), respectively. Importantly, O I appears in the variational distribution because observations in the future are treated as hidden variables.
Finally, we follow Millidge et al (2021) in assuming that the agent aims to minimise the KL divergence between the approximate posterior depicting the state of the environment and a target (desired) distribution. Therefore, our framework allows for the specification of prior preferences over both future hidden states and future observations: where the individual factors are defined as: Importantly, by specifying the value of future observations and states, C O and C S play a similar role to the vector C in active inference, i.e., they specify which observations and hidden states are rewarding.
To sum up, this framework is defined using three distributions: the prior defines the agent's beliefs before sampling any observation; the posterior is an updated version of the prior which takes into account past observations made by the agent; finally, the target distribution encodes the agent's prior preferences in terms of future observations and hidden states.
PS (22) S (22) PS (11) S (11) PS (12) S (12) PO (11) O (11) The future is now a tree like generative model whose branches correspond to the policies considered by the agent. As we will see, these branches can be dynamically expanded during planning. Here, the nodes in light gray represent possible expansions of the current generative model. For the sake of clarity, the random tensor A, B, Θ τ and D are not illustrated, i.e., Dirichlet priors over those random tensors are not shown.

Bayesian belief updates
In this section, we focus on the set of update equations used to perform approximate Bayesian inference. These Cox et al, 2019), which in a way similar to automatic differentiation alleviates the final user from the burden of manually deriving complex update equations for each new generative model. To simplify our notation, we use two operators ⊗ and ⊙ that we call generalized outer and inner product, respectively. The generalized outer product creates an N dimensional tensor from N vectors, while the generalized inner product performs a weighted average over one dimension of an N dimensional array, cf Appendix A for details. Using these notations, the first set of update equations are given by: where o τ is the observation made at time τ . Furthermore, this first set of equations count (probabilistically) the number of times, an initial hidden state has been observed, an action has been performed, a state has generated a particular observation or an action has led to the transition between two consecutive hidden states. For example, the posterior parametersâ are computed by adding t τ =0 ⊗[D τ , o τ ] (i.e., the number of times a state-observation pair has been observed during this trial) to the prior parameters a (i.e., the number of times this same pair has been observed during previous trials). The equations for belief updates are given by: where σ( • ) is the softmax function, ch t are the children (states) of the current states S t , ch I are the children (states) of the states S I , [predicate] is an indicator function returning one if the predicate is true and zero otherwise, and the definition ofÅ,B,D andΘ τ are given in Table 1. Note that thanks to the operators ⊗ and ⊙, the perception (i.e., state-estimation) equations can be intuitively understood as a sum of messages, where each message from a factor to a variable is the average over all dimensions except the dimension of the variable, e.g., the message (Å ⊙ o 0 ) from P o 0 to S 0 is the vector obtained by weigthing the rows ofÅ by the elements of o 0 . Importantly, the above update equations are almost identical to the ones used in standard active inference, and thus can be implemented efficiently. Indeed, most of the computation required is about addition of matrices O(n 2 ) and multiplication of matrices O(n 3 ), or their higher dimensional counterparts.
Notation Meaning The expectation of Ā The expectation of B Table 1: Update equations notation. Note that Appendix D provides a proof for D.

Planning as structure learning
In this section, we frame planning as a form of structure learning where the structure of the generative model is modified dynamically. This method is greatly inspired by the Monte Carlo tree search literature, c.f., Section 5 for details.

Selection of the node to be expanded
The first step of planning is to select a node to be expanded. The selection process starts at the root node, if the root node still has unexplored children, then one of them is selected. Otherwise, the child node maximizing the U CT criterion, where the average reward is replaced by minus the average EFE, is selected, i.e., the selected node maximises: where J is a multi-index, n is the number of times the root node has been visited, n J is the number of times the child corresponding to the multi-index J was selected, andḡ J is the average cost received when selecting the child S J . The U CT criterion can be understood as a trade-off between exploitation and exploration at the tree level, which is different to the exploitation and exploration dilemma at the model level. This dilemma is handled by the EFE. Also, the notion of cost in the above equation can be defined in many ways and will be the subject of Section 6.3.3. For our purposes, the cost will be equal, or similar, to the expected free energy, which means that the expected free energy drives structure learning. When a root's child is selected, it becomes the new root in the above procedure, which is iterated until a leaf node is reached.

Dynamical expansion of the generative model
Let S I\last denotes the leaf node selected for expansion. When S I\last has been selected, the structure of the generative model needs to be modified by expanding all possible actions from that node. For each action, we expand the generative model by adding a future hidden state whose prior distribution is given by whereB I is the matrix corresponding to the last action that led to S I . Finally, we expand the (future) observation associated with the new hidden state S I , whose distribution is: To sum up, the expansion step is adding two random variables (S I and O I ) to the generative model, i.e. the generative model becomes bigger, and I is added to the set of all non-empty multi-indices already expanded by the tree search (I t ). The prior distributions over those newly added random variables (i.e. S I and O I ) are defined using the matricesB I andĀ, which effectively predict the future states and observations. After the expansion step, the posterior distribution over S I and O I needs to be computed. At least two kinds of inference strategies can be used. The first-global inference-performs variational message passing over the entire generative model, while the second-local inference-only iterates the update equations of the newly expanded nodes, i.e., S I and O I , until convergence to the variational free energy minimum.

Cost evaluation of the expanded nodes
After expanding the model structure, we need to compute the cost of the newly expanded node S I . As explained in Section 6.3.1, the cost of S I will influence the probability of expanding S I during future planning iterations. In active inference, the classic objective of planning is the expected free energy as defined in Section 4.1, i.e., where g classic I trades off risk (first summand) and ambiguity (second summand). Alternatively, one could follow Section 5 of Millidge et al (2021) and define the cost of S I using the free energy of the expected future: where V (O I , S I ) is the target distribution over states and observations. The target distribution V (O I , S I ) generalises the C matrix in Friston's model by specifying prior preferences over both future observations and future states. Also, this formulation of the cost speaks to the notion of KL divergence minimization proposed by Hafner et al (2020).
Furthermore, due to the mean-field approximation of the posterior (23) and the factorised form of the target distribution (24), the expression of the cost simplifies to In this section, we let G aggr L be a variable that contains the total cost of the node S L , where L could be any multi-index. According to the previous section, we let g L be any of the following evaluation criteria g pcost L , g f eef L and g classic L . Initially, G aggr L equals g L . Also, we let S K be the node that was selected for expansion, and let S I be an arbitrary hidden state expanded from S K . The cost of the newly expanded node(s) can be propagated either forward or backward. The forward propagation (towards the leaves) leads to the following equation: where here G aggr K is the aggregated cost of the parent of S I . Importantly, the symbol ← refers to a programminglike assignment (i.e., an incremental update) performed each time the tree is expanded. The backward propagation (towards the root) leads to: where A I corresponds to all ancestors of the newly expanded node S I . We will see in Section 7 that these strategies respectively relate to active inference and sophisticated inference . Finally, since the agent is free to choose any action, we can back-propagate the (locally) minimum cost, i.e., where K :: a is a multi-index obtained from K by adding the action a to the sequence of actions described by K.
In all cases, the propagation step updates the counter n J associated with each ancestor S J of the newly expanded hidden state S I ; this counts the number of times the node S J has been explored (exactly as in MCTS). This counter will be used for action selection, as well as for the computation of the average cost of S J -ḡ J -that was left undefined by Section 6.3.1. Formally,ḡ J is given by: Remark 2 The forward propagation of the cost presented above will only be used for theoretical purpose in Section 7. Practical implementation of BTAI should use the backward schemes.

Action selection
The planning procedure presented in the previous section ends after a pre-specified amount of time has elapsed or when a sufficiently good policy has been found. When the planning is over, the agent needs to choose an action to act in its environment. In a companion paper (Champion et al, 2021a) that presents empirical results of BTAI, the actions are sampled from σ(−γ g N ), where σ( • ) is a softmax function, γ is a precision parameter, g is a vector whose elements correspond to the cost of the root's children and N is a vector whose elements correspond to the number of visits of the root's children. Importantly, actions with low average cost are more likely to be selected than actions with high average cost.
Alternative approaches to action selection (Browne et al, 2012) could be studied. For example, one could imagine sampling actions from a categorical distribution with parameter σ(N ), where N is a vector containing the n J of all children of the root node. Or, we could select the action corresponding to the root's child with the highest number of explorations n J . The fact that it has been visited more often means that is has a lower cost overall. If there were a tie between several actions, the action with the lowest cost would be selected. The study of these strategies is left to future research.

Action-perception cycle with tree search
In active inference, the action-perception cycle realises an active inference agent in an infinite loop (van de Laar and de Vrie 2019). Each loop iteration begins with the agent sampling an observation from the environment. The observation is used to perform inference about the states and contingencies of the world, e.g., an impression on the retina might be used to reconstruct a three dimensional scene with a representation of the objects that it contains. Then, planning is performed by inferring the consequences of alternative action sequences. Importantly, only a subset of all possible action sequences are evaluated, due to the dynamical expansion of the generative model. Finally, the agent selects an action to perform in the environment by sampling a softmax function of minus the average cost weighted by the precision parameter γ, i.e., σ(−γ g N ). Therefore, actions with low average cost are more likely to be selected than actions with high average cost. We summarise our method using pseudo-code in Algorithm 2.
Algorithm 2: Action-perception cycle with tree search while end of trial not reached do sample an observation from the environment; perform inference using the observation (Section 6.2); while maximum planning iteration not reached do select a node to be expanded (Section 6.3.1); perform the expansion of the node (Section 6.3.2); perform inference on the newly expanded nodes (Section 6.2); evaluate the cost of the newly expanded nodes (Section 6.3.3); propagate the cost of the nodes through the tree, either forward or backward (Section 6.3.4); end select an action to be performed (Section 6.4); execute the action in the environment leading to a new observation; end

Connection between BTAI, active inference and sophisticated inference
In this section, we explore the relationship between BTAI, active inference (AcI) and sophisticated inference (SI).
We show that BTAI is a class of algorithms that generalizes AcI and is related to SI. To do so, we focus on the "cost" of a policy for each method. In addition, we need to introduce the notion of localized and aggregated cost.
The localized cost of a node S I , denoted G local I , is the cost of S I in and of itself, i.e., without any consideration of the cost of past or future states. The aggregated cost of a node S I , denoted G aggre I , is the cost of S I when taking into account either the cost of future states that can be reached from S I (which is the case in SI) or the cost of the past states that an agent has to go through in order to reach S I (which is the case in AcI).

Active inference
The full framework of active inference was described in Section 4. This section focuses on expressing the expected free energy in a recursive form that highlights the relationship between BTAI and AcI. We start by defining the notion of localized and aggregated EFE with Definitions 3 and 4, respectively. Then, we show that in active inference (under some assumptions described below), the aggregated EFE of a policy of size N is given by the aggregated EFE of a policy of size N − 1 plus the localized EFE received at time t + N .
In active inference, a policy is a sequence of actions π = (U t , U t+1 , ..., U T −1 ), where T is the time horizon of planning, and for convenience, π N denotes a policy of size N , obtained by selecting the first N actions of the policy π, i.e., π N = (U t , U t+1 , ..., U t+N −1 ) with N ≤ T − t. Recall from Section 4, that (in active inference) the expected free energy of a policy is given by: If instead of letting τ range from t + 1 to T , we let N range from 1 to T − t, then Equation 44 can be re-written as: Additionally, under the assumption that the probability of observations and states are independent of future actions, i.e., that ∀j ∈ N >0 , Q(O t+i |π i ) ≈ Q(O t+i |π i+j ) and ∀j ∈ N >0 , Q(S t+i |π i ) ≈ Q(S t+i |π i+j ), π can be replaced by π N in the RHS of the above equation, leading to: Importantly, the elements of the above summation constitute the localized cost presented in Definition 3.

Definition 3
We define the localized cost received at time t + N after selecting policy π N as: Importantly, the localized cost quantifies the amount of risk and ambiguity received by the agent at time step t + N , assuming that it will follow the policy π N . We now turn to the notion of aggregated cost of a policy of size N . Definition 4 states that the aggregated cost of a policy is defined recursively. Indeed, by definition, a policy of size zero has an aggregated cost of zero, and then, the aggregated cost of a policy π N (of size N ) is equal to the the aggregated cost of π N −1 (of size N − 1) plus the localized cost received at time t + N .
Definition 4 We define the aggregated cost of a policy π N of size N as: Equipped with Definitions 3 and 4, we are now ready to state and prove Theorem 5 using the two Lemmas of Appendix E.
Theorem 5 Under the assumption that the probability of observations and states are independent of future actions, i.e., ∀j ∈ N >0 , Q(O t+i |π i ) ≈ Q(O t+i |π i+j ) and ∀j ∈ N >0 , Q(S t+i |π i ) ≈ Q(S t+i |π i+j ), the expected free energy can be written as: Proof This proof is based on two lemmas demonstrated in Appendix E. Note that in active inference the expected free energy is defined as: Let N denote the size of the policy π, i.e. N = T − t. Note that because π is of size N , then by definition π = π N , and the above equation can be re-written as: Expanding the summation and using Definition 3: If, instead of letting τ range from t + 1 to t + N − 1, we let i range from 1 to N − 1, then the above equation can be re-written as: Therefore, we replace N by i + k i in the above summation: Lemma 11 tells us that under the assumption that the probability of observations and states are independent of future actions, ∀k i ∈ N >0 , G(π i+k i , t + i) ≈ G(π i , t + i), which allows us to remove the k i to get: Finally, Lemma 12 states that N −1 i=1 G(π i , t + i) = G aggre π N−1 , and thus: The above equation will be used in Section 7.4 to show that BTAI generalizes active inference.

Sophisticated inference
Sophisticated inference ) is a new type of active inference that defines the EFE recursively from the time horizon backward. Intuitively, the agent does not simply ask "what would happen if I did that", but instead wonders "what would I believe about what would happen if I did that". In other words, the agent is exhibiting a form of sophistication, which refers to the fact of having beliefs about one's own or another's beliefs.

Friston et al (2021) also replaced variational message passing by an alternative inference scheme called Bayesian
Filtering (Fox et al, 2003). While the change of inference method is of little relevance to us here, the recursive definition of the EFE is at the core of this section. As explained in Section 4.3 of Da Costa et al (2020b), the (recursive) EFE of a Markov decision process is given by: where U τ and S τ are the action and state at time τ , and V (S τ ) is the target (i.e., desired) distribution over states at time τ . Using our terminology of localized and aggregated cost, this can be rewritten as: Put simply, the aggregated cost of taking action U τ in state S τ can be computed by summing the localized cost at time step τ and the expected aggregated cost at time step τ + 1, i.e., Note that for τ = T − 1 the second term vanishes because future states beyond the temporal horizon are ignored, and thus G aggre (U T −1 , S T −1 ) = G local (U T −1 , S T −1 ). Also, the above equation will be useful in Section 7.5 to show that BTAI is related to sophisticated inference.

Remark 6
The recursive aspect of Equation 59 is deeply related to dynamic programming and the interested reader is referred to Da Costa et al (2020b) for details about this relationship.

Branching Time Active Inference (BTAI)
In BTAI, the (localized) cost of the hidden state S I is defined as G local I = g I , where g I can be equal to g classic I , g f eef I or g pcost I , and there are two ways of computing the aggregated cost of S I . We can either propagate the localized cost towards the leaves (forward): where the g I\last is the cost of the parent of S I . Alternatively, we can back-propagate the cost towards the root where A corresponds to all ancestors of the newly expanded node S I .

BTAI as a generalisation of active inference
To understand the relationship between BTAI and active inference, we need to focus on the forward propagation of the cost where the cost is given by g classic I . Recall that the update for forward propagation is given by: where the g I\last is the cost of the parent of S I , i.e., the parent of the newly expanded node. This equation tells us that the aggregated cost of S I is equal to the localized cost of S I plus the aggregated cost of S I\last , i.e., but, then, we also recall (49), i.e., The only difference between Equations 49 and 66 is notational. Indeed, in BTAI (Eq. 49) a policy is represented by a multi-index denoting the sequence of actions selected, e.g., I = (1, 2) corresponds to a policy of size two consisting of action one followed by action two. In contrast, in active inference, a policy is a sequence of actions, e.g., π 2 = (1, 2) corresponds to the same policy as the one described by I.

Relationship between BTAI and sophisticated inference
The relationship between BTAI and sophisticated inference is slightly more involved. The backward propagation equation, i.e., tells us that when expanding a node S I , we first need to compute its localized cost g I and then add g I to the aggregated cost of its ancestors S J where J ∈ A. In other words, we can rewrite the backward propagation equation as the following: the aggregated cost of an arbitrary node S J will equal the sum of its localized cost g J and that of its descendants D J that have already been evaluated where the descendants are the children, children of children, etc. We can further simplify this expression by grouping the summands by children of S J . This leads us to: where ch J are the children of S J and D I are the descendants of S I . The above equation has clear similarities to Equation (62), which is, However, the second term of the RHS of (62) is an expectation, while the second term of the RHS of (70) is a summation over the children that have already been expanded. The expectation in (62) where: where U is the set of all possible actions, and argmins is defined as: This means that an action is assigned positive probability mass if and only if it minimises the aggregated cost at the next time point G aggre (U τ +1 , S τ +1 ) and a set is required, because multiple actions could have the same minimum cost. Note that if there is a unique minimum, then Q(U τ +1 |S τ +1 ) will be a one-hot like distribution with a probability of one for the best action.
To conclude, Equations (62) and (70) suggest that BTAI and SI share a similar notion of EFE, where the immediate (or localized) EFE is added to the future (or aggregated) EFE. Both BTAI and SI propagate the cost backward, however, in SI the aggregated EFE (i.e., the back-propagated cost) is weighted by the probability of the next action and states, i.e., Q(U τ +1 , S τ +1 |U τ , S τ ). Intuitively, the weighting terms in SI discounts the impact of the back-propagated cost for unlikely states and (locally) sub-optimal actions. Importantly, those weighting terms emerge from the recursive definition of the EFE that relates to the Bellman equation (Da Costa et al, 2020b). In contrast, there are no such weights in BTAI because BTAI finds its inspiration in active inference.

Conclusion and future works
In this paper, we have presented a new approach where planning is cast as structure learning. Simply put, this approach consists of dynamically expanding branches of the generative model by evaluating alternative futures under different action sequences. The dynamic expansion trades off evaluating promising (with repect to the target distribution) policies with exploring policies whose outcomes are uncertain. We proposed two different tree search methods: the first in which the nodes' cost is propagated forward from the root node to the leaves; the second in which the nodes' cost is propagated backward from the leaves to the root node. Then, in Section 7 we showed that forward propagation of the EFE leads to active inference (AcI) under the assumption that the probability of observations and states are independent of future actions, and that backward propagation relates to sophisticated inference (SI). This clarifies the link between AcI and BTAI, and helps to understand the relationship between AcI and SI.
Importantly, by performing a complexity class analysis, we have shown that while Active Inference suffers from an exponential complexity class, our approach scales nicely (linearly) with the number of tree expansions, c.f., We also know that humans engage in counterfactual reasoning (Rafetseder et al, 2013), which, in our planning context, could involve the consideration and evaluation of alternative (non-selected) sequences of decisions. It may be that, because of the more exhaustive representation of possible trajectories, the classic active inference can more efficiently engage in counterfactual reasoning. In contrast, branching-time active inference would require these alternatives to be generated "a fresh" for each counterfactual deliberation. In this sense, one might argue that there is a trade-off: branching-time active inference provides considerably more efficient planning to attain current goals, classic active inference provides a more exhaustive assessment of paths not taken.
Now that we have laid out the mathematics of BTAI, many directions of research could be investigated. One could for example obtain an intuitive understanding of the model's parameters through experimental study. At first it might be necessary to restrict oneself to agents without learning, i.e., inference only. This step should help answer questions such as: How does the number of expansions of the tree and the quality of the prior preferences impact the quality of planning? What is the best inference method (i.e., local or global inference) to use during planning?
Then, one could consider learning of the transition and likelihood matrices as well as the vector of initial states. This can be done in at least two ways. The first is to add Dirichlet priors over those matrices/vectors and the second would be to use neural networks as function approximators. The second option will lead to a deep active inference agent Millidge, 2020) equipped with tree search that could be directly compared to the method of Fountas et al (2020). Including deep neural networks in the framework will also enable direct comparison with the deep reinforcement learning literature (Haarnoja et al, 2018;Mnih et al, 2013;van Hasselt et al, 2016;Lample and Chaplot, 2017;Silver et al, 2016). These comparisons will enable the impact of epistemic terms to be studied when the agent is composed of deep neural networks.
Another, very important direction for future research would be the creation of a biologically plausible implementation of BTAI. For example, using artificial neural networks to model the various mappings of the framework may provide a neural-based implementation of BTAI that is closer to biology. This would especially be the case, if the back-propagation algorithm frequently used for learning is replaced by the (more biologically plausible) generalized recirculation algorithm (O'Reilly, 1996). Another possible approach would be to use populations of neurons to encode the update equations of the framework, as was proposed by Friston et al (2017).
Whatever technique is chosen for learning and inference, implementing MCTS in a biologically plausible way will be challenging. Indeed, MCTS requires a dynamic expansion of the search tree used to explore the space of possible policies. Each time an expansion is performed, the agent needs to store the associated variables such as: the number of visits, the aggregated expected free energy, and the posterior beliefs of the newly expanded node.
Given the fast pace at which planning must be performed to be useful, slow mechanisms such as synaptic plasticity and neurogenesis are likely to be unsuitable for the task. A more plausible approach might rely upon a change of neuronal activation, which can occur within a few hundred milliseconds. One such approach uses a binding pool (Bowman and Wyble, 2007) and provides a notion of variable. In this framework, a variable is composed of two parts. First, a token that can be intuitively understood as the variable's name, and second, a type corresponding to the variable's value. The binding pool is then composed of neurons representing the fact that a variable's name is bound (or set) to a specific value. A localist realisation of a binding pool could be implemented as a 2D array of neurons of size "number of tokens" × "number of types". However, such a representation is quite inefficient and a more compact (i.e. distributed) representation has been developed (Wyble and Bowman, 2006 (Brochu et al, 2010;Bergstra et al, 2011). For example, Thompson Sampling has been shown to improve upon the standard UCT criterion when applied to MCTS (Bai et al, 2013), but requires additional modelling and we leave this for future research.  Appendix A: Generalized inner and outer products Generalized outer products: Given N vectors V i , the generalized outer product returns an N dimensional array W , whose element in position (x 1 , ..., x N ) is given by V 1 i-th vector. In other words: where |V j | is the number of elements in V j . Also, note that by definition W is a N -tensor of size |V 1 | × ... × |V N |. Figure 7 illustrates the generalized outer product for N = 3.
Figure 7: This figure illustrates the generalized outer product W = ⊗ V 1 , V 2 , V 3 , where W is a cube of values illustrated in red, whose typical element W (i, j, k) is the product of V 1 (i), V 2 (j) and V 3 (k). Also, the vectors V i ∀i ∈ {1, ..., 3} are drawn in blue along the dimension of the cube they correspond to.
Generalized inner products: Given an N -tensor W and M = N − 1 vectors V i , the generalized inner product returns a vector Z obtained by performing a weighted average (with weighting coming from the vectors) over all but one dimension. In other words: where |Z| denotes the number of elements in Z, and the large summand is over all x r for r ∈ {1, ..., M } \ {j}, i.e., excluding j. Also, note that if |W | V i ∀i ∈ {1, ..., M } is the number of elements in the dimension corresponding to Figure 8 illustrates the generalized inner product for N = 3.  W (i, j, k). Also, the vectors Z and V i ∀i ∈ {2, 3} are drawn in blue along the dimension of the cube they correspond to.
Naming of the dimensions: Importantly, we should imagine that each side of W has a name, e.g., if W is a 3x2 matrix, then the i-th dimension of W could be named: "the dimension of V i ". This enables us to write: where Z 1 is a 1x2 matrix (i.e., a vector with two elements) and Z 2 is a 3x1 matrix (i.e., a vector with three elements). The operator ⊙ knows (thanks to the dimension name) that W ⊙ V 1 takes the weighted average w.r.t "the dimension of V 1 ", while W ⊙ V 2 must take the weighted average over "the dimension of V 2 ".
In the context of active inference, the matrix A has two dimensions that we could call "the observation dimension" (i.e., row-wise) and "the state dimension" (i.e., column-wise). Trivially, A ⊙ o τ will then correspond to the average of A along the observation dimension and A ⊙D τ will correspond to the average of A along the state dimension.

Appendix B: Generalized inner/outer products and other well-known products
In this section, we explore the relationship between our generalized inner and outer products-presented in appendix A-and other well known products in the literature.

Inner product of two vectors
The inner product of two vectors a and b of the same size is given by: where a i and b i are the elements of the vectors a and b, respectively, and | a| is the number of elements in a. This product is a special case of our generalised inner product, i.e., where Z is a scalar (a 0-tensor), W and V are two vectors (two 1-tensors) of the same size, and |W | is the number of elements in W .

Inner product of two matrices (Frobenius inner product)
The inner product of two matrices A and B of same sizes is given by: matrices A (respectively B). This product is not a special case of our generalised inner product.

Inner product of two tensors (Frobenius inner product)
The inner product of two tensors A and B of same sizes is given by: where a(i 1 , ..., i n ) and b(i 1 , ..., i n ) are the elements of the tensors A and B, respectively, and |A| i is the number of elements in the i-th dimension of A. This product is not a special case of our generalised inner product.

Standard matrix multiplication
Let A be an n × m matrix and b be a vector of size m. The standard matrix multiplication of A by b is given by: This is a special case of our generalised inner product, i.e., Additionally, let A be an n × m matrix and a be a vector of size n, then: This is a special case of our generalised inner product, i.e., Note that because the dimensions are "named" (c.f. Appendix A) the operator performs the transposition implicitly.

Outer product of two vectors
Given two vectors a and b, there outer product-denoted a ⊗ b-is a matrix defined as: This outer product is a special case of our generalised outer product, where the operator is applied to only two vectors, i.e.
Outer product of two tensors Given two tensors U and V , the outer product of U and V is another tensor W such that: Given N vectors V i ∀i ∈ {1, ..., N }, our outer product is a sequence of outer tensor products, i.e., where ⊗ tensor and ⊗ are the tensor and generalised outer products, respectively.

Kronecker product
Given two matrices A and B, the Kronecker product of A and B-denoted ⊗ K -is a generalisation of the outer product from vectors to matrices defined as: where a ij are the elements of A. Note that even if the Kronecker product is a generalisation of the outer product, it is neither a special case nor a generalisation of our generalized outer product.

Hadamard product
Given two matrices A and B of the same size, the Hadamard product of A and B is an element-wise product defined by: where a ij , b ij and c ij are the elements in the i-th row and j-th column of A, B and C, respectively, |A| 1 is the number of rows in A, and |A| 2 is the number of columns in A. This product is unrelated to both our generalised inner and outer products.

Appendix C: Instance of variational message passing
This appendix provides a concrete instance of the method of Winn and Bishop discussed in Section 3. The generative model is as follows: where: Additionally, the variational distribution is given by: which means that we assume that S is an observed random variable. Let us start with the definition of the Dirichlet and categorical distributions written in the form of the exponential family: ...
where · performs an inner product of the two vectors it is applied to, B(d) is the Beta function and |S| is the number of values a state can take. The first step requires us to re-write Equation 97 as a function of u D (D), which is straightforward because µ S (D) is just another name for u D (D). Using the fact that the inner product is commutative: ... .
where • refers to • ∼Q D . Note that in the above equation, d i are fixed parameters, therefore there is no posterior over d and the first expectation · ∼Q D can be removed. The third step rests on taking the exponential of both sides, using the linearity of expectation and factorising by u D (D) to obtain: ...
where z D (d) have been absorbed into the constant term because it does not depend on D. The fourth step is a re-parameterisation done by observing that [S = i] is the i-th element of the expectation of the vector u S (S), The Indeed, the above equation is in fact a Dirichlet distribution in exponential family form, and can be re-written into its usual form to obtain the final update equation:

Appendix D: Expected log of Dirichlet distribution
Definition 7 A probability distribution over x parameterized by µ is said to belong to the exponential family if its probability mass function P (x|µ) can be written as: where h(x) is the base measure, µ is the vector of natural parameters, T (x) is the vector of sufficient statistics, and A(µ) is the log partition.
Lemma 8 The log partition is given by: Proof Starting with the fact that P (x|µ) integrate to one: Lemma 9 The gradient of the log partition function is the expectation of the sufficient statistics, i.e., Proof Restarting with the derivative of the result of Lemma 8: and using the chain rule: Note that the denominator of the first term is equal to the exponential of A(µ), and we can swap the derivative and the integral because the limit of integration does not depend on the parameters µ: Using the chain rule again: Theorem 10 If D is distributed according to a Dirichlet distribution Q(D) = Dir(D;d), then: Proof Let µ be equal tod − 1. Taking the exponential of both sides in Equation 96 and using thatd = µ + 1, we obtain: where µ is the vector of natural parameters, T (D) is the vector of sufficient statistics, A(µ) is the log partition, and B( • ) is the beta function. Using the result of Lemma 9: We now focus on a typical element ofD: and use the definition of the beta function: = ∂ ∂µ i k ln Γ(µ k + 1) − ln Γ( k µ k + 1) where Γ( • ) is the gamma function. The last step relies on the chain rule: where we used thatd = µ + 1 and the definition of the digamma function, i.e., ψ(x) = ∂ ln Γ(x) ∂x .

Appendix E: Relationship between BTAI and active inference (Lemmas)
Lemma 11 Under the assumption that the probability of observations and states are independent of future actions, i.e., then: Proof The proof is straightforward, we start with the following definition: Then, using the assumption that the probability of observations and states are independent of future actions, i.e., ∀j ∈ N >0 , Q(O t+i |π i+j ) ≈ Q(O t+i |π i ) and ∀j ∈ N >0 , Q(S t+i |π i+j ) ≈ Q(S t+i |π i ), we get: Proof The proof is done by induction. The initialisation holds for N = 1, indeed, π 1 = {U t } and by definition: because by definition G aggre π 0 = 0. Then, assuming that G aggre π N = N i=1 G(π i , t + i) holds for some N , we show that its hold for N + 1 as well. By definition: and: G local π N+1 = G(π N +1 , t + N + 1).
Using the inductive hypothesis and the above two equations:

Appendix F: Notation
In this appendix, we introduce the notation used throughout this paper. The following sub-sections describe the notation related to sets of numbers, tensors, probability distributions, global variables, multi-indices and random variables, respectively.

Sets of numbers
Definition 13 Let N >0 be the set of all strictly positive integers defined as: where N is the set of all natural numbers.
Definition 14 Let R >0 be the set of all strictly positive real numbers defined as: where R is the set of all real numbers.
Definition 16 Let T be an n-tensor. The element of T indexed by the n-tuple (x 1 , ..., x n ) is a real number denoted by T (x 1 , ..., x n ).
Remark 17 A 0-tensor is a scalar, a 1-tensor is a vector, and a 2-tensor is a matrix.
Definition 18 Let T be an n-tensor. The size of T is a vector of size n denoted |T | whose i-th element corresponds to the size of the i-th dimension of T .
slice of T , i.e., where • represents the selection of a 1-dimensional slice of T , and the values of all x j =i must be set to specific values in {1, ..., |T | j }. Figure 9 (left) illustrates the notion of a 1-dimensional slice. Definition 20 Let T be an n-tensor and m < n. An m-sub-tensor of T (denoted W ) is an m-tensor obtained by selecting an m-dimensional slice of T , i.e., where i k ∈ {1, ..., n} ∀k ∈ {1, ..., m} are indices representing the dimension being selected. Naturally, the k-th dimension of W corresponds to the i k -th dimension of T for k ∈ {1, ..., m}. Importantly, the sextuple of dots in the middle of the expression represents that there will be m symbols " • ", i.e., one for each dimension selected. Example 3 Let T be a 3-tensor such that: Then T (1, • , • ) is a 2-sub-tensor of T full of ones, and T (2, • , • ) is a 2-sub-tensor of T full of twos.

Probability distributions
Definition 21 A random n-tensor is an n-tensor over which we have an n-dimensional probability distribution.
Remark 22 A random variable is a random 0-tensor, a random vector is a random 1-tensor, and a random matrix is a random 2-tensor.
Definition 23 An n-tensor T is said to represent a joint distribution over a set of n random variables {X 1 , ..., X n } if: For conciseness, if T represents P (X 1 , ..., X n ) we let: Remark 24 If T represents P (X 1 , ..., X n ), then the sum of its elements must equal one.
Remark 25 In contrast, if T is a random n-tensor, then: which means that the joint probability over X 1 , ..., X n is represented by T , and because T is a random tensor (taking values in the set of valid n-tensors T n , i.e., the set of all n-tensors whose elements sum up to one), we must specify which instance of T ∈ T n should be used to define the joint distribution over X 1 , ..., X n .
Remark 29 Importantly, definition 26 uses the symbol T to represent P (X 1 , ..., X n ) and definition 23 uses the symbol R to represent P (X 1 , ..., X n |X n+1 , ..., X m ). Throughout this document, different symbols will be used for representing joint and conditional distributions.
Definition 30 Let R be a random m-tensor representing P (X 1 , ..., X n |X n+1 , ..., X m , R) and k = m − n be the number of variables upon which the variables X 1 , ..., X n are conditioned. Having a Dirichlet prior over R means that: where r is an (m + 1)-tensor such that the 1-sub-tensor r(i 1 , ..., i k , • ) contains the parameters of the Dirichlet prior over P (X 1 , ..., X n |X n+1 = i 1 , ..., X m = i k ). For conciseness, we denote the prior over R as: Remark 31 Definition 30 implicitly means that if V is a 1-tensor then Dir(V ) represents a Dirichlet distribution.
However, if V is an m-tensor (with m = 1) then Dir(V ) represents a product of Dirichlet distributions.
Remark 32 If R is a random m-tensor representing P (X 1 , ..., X n |X n+1 , ..., X m , R), then its prior will be a product of |X n+1 | × ... × |X m | Dirichlet distributions, where |X i | is the number of values that X i can take. Additionally, each Dirichlet distribution will have |X 1 | × ... × |X n | parameters stored in the last dimension of r, where r is the tensor storing the parameters of the prior over R, i.e. P (R) = Dir(r).
Example 4 Let A be a random 2-tensor representing P (O|S, A), then the Dirichlet prior over A is given by: where |A| 2 is the number of values that S can take (i.e., the number of hidden states). by: where |B| 2 is the number of values that S τ can take (i.e., the number of hidden states) and |B| 3 is the number of values that U τ can take (i.e., the number of actions).

Global labels
Definition 33 The number of actions available to the agent is denoted |U |.
Definition 34 The number of states in the environment is denoted |S|.

Definition 35
The number of observations that the agent can make is denoted |O|.

Definition 36
The number of policies that the agent can pick from is denoted |π|.

Definition 37
The time point representing the present is a natural number denoted t.

Definition 38
The time-horizon (i.e., the time point after which the agent stops modelling the sequence of hidden states) is a natural number denoted T .
Remark 40 Multi-indices are used to index random variables such that S I is the hidden state obtained after taking the sequence of actions described by I, and O I is the random variable representing the observation generated by S I .

Definition 41
The last index of a multi-index is denoted I last , i.e., I last is the last element of the sequence I.

Definition 42
The one-hot representation of the action corresponding to I last is denoted I last .
Definition 43 Given a multi-index I, I \ last corresponds to the sequence of actions described by I without the last element.
Remark 44 In Section 6, when a hidden state (i.e., S I ) is indexed by I, then S I\last will be the parent of S I .

Definition 45
Given an expandable generative model, I t is the set of all multi-indices already expanded from the current state S t .
Remark 46 In Section 6 each time a hidden state (i.e., S I ) is added to the generative model, I is added to the set of all multi-indices already expanded I t .

Random variables and parameters of their distributions
Remark 47 Parameters of the posterior distributions are recognizable by the hat notation, e.g.,â,b andd will be posterior parameters, while a, b and d will be prior parameters.

Remark 48
The expected logarithm of an arbitrary tensor X representing a conditional or a joint distribution is The posterior distribution over S I is a categorical distribution represented byD I .
Definition 54 Let A be a |O| × |S| random matrix defining the probability of an observation O τ given the hidden state S τ . The prior distribution over A is a product of Dirichlet distributions whose parameters are stored in a are stored in a |S| × |O| matrixâ.
Definition 55 Let B be a |S| × |S| × |U | random 3-tensor defining the probability of transiting from S τ to S τ +1 when taking action U τ . The prior distribution over B is a product of Dirichlet distributions whose parameters are stored in a |S| × |U | × |S| 3-tensor b. The posterior distribution over B is also a product of Dirichlet distributions, but the parameters are stored in a |S| × |U | × |S| 3-tensorb.
Definition 56 Let D be a random vector of size |S| defining the probability of the initial state S 0 . The prior distribution over D is a Dirichlet distribution whose parameters are stored in a vector d of size |S|. The posterior distribution over D is also a Dirichlet distribution, but the parameters are stored in a vectord of size |S|.
The prior distribution over Θ τ is a Dirichlet distribution whose parameters are stored in a vector θ τ of size |U |.
The posterior distribution over Θ τ is also a Dirichlet distribution, but the parameters are stored in a vectorθ τ of size |U |.
Definition 58 Let γ be a random variable taking values in R >0 . The prior distribution over γ is a gamma distribution with shape parameter α = 1 and rate parameter β ∈ R >0 . The posterior distribution over γ is a gamma distribution with shape parameterα = 1 and rate parameterβ ∈ R >0 .
Definition 59 Let π be a random variable taking values in {1, ..., |π|} indexing all possible policies. The prior distribution over π is a softmax function of the vector G multiplied by minus the precision γ. Note that G is a vector of size |π| whose i-th element is the expected free energy of the i-th policy. The posterior distribution over π is a categorical distribution whose parameters are stored in a vectorπ of size |π|.

Appendix G: Multi-armed bandit problem
In the multi-armed bandit problem, the agent is prompted with K actions (one for each bandit's arm). Pulling the i-th arm returns a reward sampled from the reward distribution P i (X) associated to this arm. Let µ i be the mean of the i-th reward distribution and T i (n) be the number of times the i-th bandit has been selected after n plays. To solve the bandit problem, one needs to come up with an allocation strategy that selects the action that minimises the agent's regret defined as: where µ * is the average reward of the best action. Note that an upper bound of E[T i (n)] is derived by first upper bounding T i (n), and then using: the Chernoff-Hoeffing bound, the Bernstein inequality and some properties of p-series, c.f., proof of Theorem 1 in Auer et al (2002) for details. So, if we first assume that, where n i is the number of times the i-th action has been selected, andX i is the average reward received after taking the i-th action. Then, the main result of Auer et al (2002) was to show that if an allocation strategy was using the UCB1 criterion to select the next action, the expected regret of this allocation strategy will grow at most logarithmically in the number of plays n, i.e., O(ln n). In addition, since it is known that the expected regret of the (best) allocation strategy grows at least logarithmically in n (Lai and Robbins, 1985), we say that the UCB1 criterion resolves the exploration / exploitation trade-off, i.e., the UCB1 criterion ensures that the expected regret grows as slowly as possible.
Appendix H: The exponential complexity class In this appendix, we precisely pinpoint the exponential complexity class that is addressed in this paper, but first, we introduce a multi-index notation. Multi-indices will help us to refer to hidden states in the future. Naturally enough the indexes inside the multi-indices will correspond to the actions the agent will have to perform to reach the hidden state, e.g., S (123) corresponds to a hidden state at time t + 3 obtained by performing action 1 at time t, 2 at time t + 1 and 3 at time t + 2. Using this notation, Figure 10 depicts all the possible policies up to two time steps in the future and the associated hidden states. Importantly, Figure 10 shows that the number of policies grows exponentially with the number of time steps for which the agent tries to plan. Therefore, the definition of the prior over the policies, i.e., P (π|γ) = σ(−γG), exhibits an exponential space and time complexity class because the agent needs to store and compute the |U | T parameters of P (π|γ), where T is the time-horizon. The state at the current time step is denoted by S t . Additionally, each branch of the tree corresponds to a possible policy, and each node S I is indexed by a multi-index (e.g. I = (12)) representing the sequence of actions that led to this state. This should make it clear that for one time step in the future, there are |U | possible policies, after two time steps there are |U | times more policies, and so on until the time-horizon T where there are a total of |U | T possible policies, i.e., the number of possible policies grows exponentially with the number of time steps for which the agent tries to plan.
To show that this exponential explosion is not only a theoretical problem and also appears in practice, we toc % Display the time elapsed since the last call to tic Figure 11 presents the results of our simulations for N from 2 to 6. Under a logarithmic scale on the time axis, the experimental results show that the graph is almost a perfect line, which provides empirical evidence for an exponential time explosion. Note that the simulation for N = 7 crashed after trying to allocate an array of 9.5GB (space explosion). In section 5, we presented an approach proposed to deal with the exponential complexity class that arises during planing, yet is fundamentally similiar to active inference. This effectively means that our paper shows how standard active inference can be made more efficient and scale to longer time horizons.  Figure 11: This figure shows the time required to execute the function "spm maze search" when the agent is allowed to plan N time steps in the future (for N from 2 to 6). A logarithmic scale is used on the time axis.