Canonical Cortical Circuits and the Duality of Bayesian Inference and Optimal Control

The duality of sensory inference and motor control has been known since the 1960s and has recently been recognized as the commonality in computations required for the posterior distributions in Bayesian inference and the value functions in optimal control. Meanwhile, an intriguing question about the brain is why the entire neocortex shares a canonical six-layer architecture while its posterior and anterior halves are engaged in sensory processing and motor control, respectively. Here we consider the hypothesis that the sensory and motor cortical circuits implement the dual computations for Bayesian inference and optimal control, or perceptual and value-based decision making, respectively. We first review the classic duality of inference and control in linear quadratic systems and then review the correspondence between dynamic Bayesian inference and optimal control. Based on the architecture of the canonical cortical circuit, we explore how different cortical neurons may represent variables and implement computations.


Introduction
Sensory perception and motor control are the most fundamental functions of the brain.
Although they are often studied separately, sensory perception and motor control are dependent on each other, which calls for an integrated approach. In addition to composing a sensory-motor loop, computations for sensory inference and optimal control have been shown to have a similarity. Rudolph Kalman [1] showed that the equations used for optimal sensory inference by the Kalman filter are similar to the equations used for optimal motor control by a linear quadratic regulator (LQR) [2]. This is known as Kalman's duality. More recently, researchers in reinforcement learning discovered more general correspondences between the computations for the posterior distribution in dynamic Bayesian inference and the value function in reinforcement learning [3][4][5][6][7][8][9][10][11]. This notion has created a research strategy called control as inference, or reinforcement learning as inference, which provides novel mathematical insights and helps the development of new reinforcement learning algorithms.
Regarding the brain's architecture for sensory perception and motor control, the most fundamental division is between the posterior half of the cerebral cortex, which is mostly involved in sensory perception, and the anterior half, which is mostly involved in motor control, planning and decisions. An interesting unanswered, or unasked, question is: Why does the entire cerebral cortex share the same canonical circuit architecture [12,13] characterized by a common six-layer architecture with specific types of neurons and connections for implementing sensory inference and motor control?
Here we consider the hypothesis that the sensory and motor cortical circuits evolved to make the dual computations for sensory inference and optimal control, or perceptual and value-based decision making, respectively. We first review the classic duality of inference and control in linear quadratic systems, known as Kalman's duality, and then review a more general correspondence between dynamic Bayesian inference and optimal control. We then explore how different types of cortical neurons in sensory and motor cortices may represent different variables and what cortical dynamics may realize required computations. We further discuss what experimental and computational approaches are required for scrutinizing this dual cortical circuit hypothesis.

The Duality of Control and Inference
Kalman filter [1] is the standard method for keeping track of the signal of interest despite noisy observation based on the assumption of linear dynamics and Gaussian noise (Box 1).
Kalman pointed out that the set of equations for updating the estimates of the mean and the covariance of the state variable has the same structure as the equations for optimal control of a linear dynamical system with Gaussian noise [2]. This is known as Kalman's duality. While Kalman's duality has been document in textbooks of control theory and signal processing as a matter of mathematical beauty, recent researchers in reinforcement learning found the relationship can extend beyond linear-Gaussian cases and have developed novel reinforcement learning and control algorithms based on the notion. Emanuel Todorov pointed out the general duality between computations needed for posterior distributions in dynamic Bayesian inference and the value functions in optimal control and reinforcement learning (Box 2) [7][8][9]. A similar correspondence was also formulated by other researchers as well [3][4][5][6]10].
Based on this notion, Todorov realized that, by defining the cost of action as the divergence of the state transition probability from that by 'passive dynamics,' the exponentiated state value function can be computed linearly, which drastically reduces the required data and computation, and enables compositionality of value functions for different goal rewards [14,15]. A similar approach was also derived by Kappen [3,4].
Most recently, Sergey Levine reviewed all these works and formulated a probabilistic graphical model (PGM) having the optimality variable, which takes 1 if a state-action pair is optimal ( Figure 2B) [11]. By assuming that the optimality variable follows a probability given by the exponential of a reward function, a standard message passing algorithm for PGM turns into the update equations for state and action value functions. For this conversion to hold, the objective function should include a regularization term for the entropy of action policy, which is known as maximum-entropy reinforcement learning [16]. Table 2 summarizes the correspondence between the components of Bayesian inference and optimal control. The framework presents a unified theoretical basis for efficient and robust reinforcement learning algorithms, such as the soft actor-critic [17], and is expected to promote derivation of novel algorithms.

Canonical Cortical Circuits
The cerebral neocortex has a common six-layer architecture, known as the canonical cortical circuit (Figure 3) [12,13,18]. While most studies focused on the sensory cortex, the architecture of motor cortex with the thalamic inputs originating from the cerebellum and the basal ganglia has also been worked out [19]. A marked difference between the sensory and motor cortices is the thickness of the layer 4, which is densely populated by excitatory stellate cells in the sensory cortex. Despite a quantitative difference across areas, the basic architecture is preserved: layer 4 receives bottom-up thalamic and cortical input and projects to layers 2/3, where neurons have dense recurrent connections. Layer 2/3 pyramidal neurons project to higher cortical areas and also send output to layers 5/6. Layer 5 pyramidal neurons project to the cerebellum and the basal ganglia and layer 6 pyramidal neurons project to the thalamus and lower cortical areas.

Dual Cortical Computation Hypothesis
Considered together, the duality of Bayesian inference and optimal control and the canonical cortical circuits in the sensory and motor areas suggest that common computations for inference and control are implemented in the common architecture of the neural circuits in the sensory and motor cortices, or the posterior and anterior halves of the cerebral cortex. to represent the surprise signal ( ! | ! ), which is sent to higher cortical areas. This information is also sent to layer 5/6 to update the posterior probability ( ! | # , … , ! ). These computations may also be conditional on the top-down contextual signal ! , including the executed action !"# , from higher cortical areas.
In optimal control, the reward function ( , ) and the state value function ( ) correspond to the log likelihood and log posterior probability in sensory inference, so that they would be represented by layer 4 and layer 5/6 neurons, respectively. The update of action value function ( , ) requires state transition model ( !$# | ! , ! ) and reward information, which is likely to be represented by layer 2/3 neurons. The action policy ( | ) is computed by subtracting the state value from the action value, so that action may be selected in layer 5 or 6 and sent to lower cortical and subcortical areas. Note that the above is just one hypothetical realization and many other mappings of different roles to neurons and connections are conceivable.
There are many interesting open questions about the cortical implementation of the dual computations for Bayesian inference and optimal control. First, how the backward computation is realized in real time? In the visual cortex, evidence suggests that the alpha rhythm around 10 Hz carries top-down feedback information [33] and underlies multi-modal sensory arbitration [34]. In the motor cortex, the beta rhythm around 20 Hz shows responses before execution or during imagination of movements [35,36]. These might be the correlates of periodic execution of backward computation.
Another important question is how the state transition model ( !$# | ! , ! ) and the sensory observation model ( ! | ! ) are learned, together with the internal representations of state and action . The roles of the cerebellar and the basal ganglia inputs through the thalamus to the motor cortex in learning is also an interesting question [37,38].
Finally, how are the parameters for Bayesian inference and optimal control regulated, such as the time frame of planning and the prior uncertainty of the state dynamics and sensory observation? The roles of neuromodulators, such as serotonin, noradrenaline and acetylcholine have been suggested [39][40][41][42][43][44].
Given recent advances in two-photon calcium imaging [45,46] and electrode array recording [47], it is now feasible to test such hypotheses regarding the implementation of Bayesian inference and optimal control by large-scale measurement of neural activities during sensorimotor tasks [38,48,49]. The correspondence between variables and mappings for Bayesian inference and optimal control as depicted in Table 3 may provide a basis for interpreting data from both sensory and motor cortices and coming up with a unified theory of cortical computation.

Conclusion
This article reviewed the duality between sensory inference and motor control and presented a hypothesis that the canonical posterior and anterior cortical circuits perform such dual computation. The author believes that this overreaching hypothesis is worthy of experimental testing by utilizing multi-area, multi-layer neural recording technologies. In addition to the basic operations for inference and control, the regulatory mechanisms for such computations and possible malfunctioning of such mechanisms would provide better understanding of the roles of neuromodulators and the causes of psychiatric disorders.

Box 1: Kalman's duality.
The computations for optimal filtering and optimal control under linear Gaussian assumptions reduces to solving the same form of matrix equations [1,7].

Figure 1: A. Kalman filter. We consider a linear discrete-time dynamical system
where ! is the state, ! is the action input, ! is the sensory observation, and ! and ! are the state and observation noises with covariance matrices S and U, respectively. We aim to estimate the changing state # , . . . , ! iteratively from the observations # , . . . , ! . We represent the uncertainty of the state by a Gaussian distribution !~( > ! , ! ).
With the state transition (1), the mean and the variance of the state distribution evolve as where ′ is the transpose matrix of . With a new observation !$# , the distribution is updated by Bayesian inference with the predicted distribution (@ !$# , B !$# ) as the prior and the observation giving the likelihood ( !$# − @ !$# ; 0, ). This leads to an update of the mean of the state distribution in proportion to the prediction error where the update gain is given by This is called the filter gain, or Kalman gain, which becomes large when the state uncertainty ( ) is large. The variance of the state distribution is also updated by the Kalman gain as which generally reduces the uncertainty. Taken together (4), (6) and (7), the update of the state covariance is given as

B: Linear quadratic regulator (LQR).
Here we consider the same dynamical system (1) and consider the cost function (negative reward) where Q and P are matrices defining the state and action costs [2]. The aim is to minimize the cumulative cost In the linear Gaussian assumption, the state value function ( , ) can be represented by a quadratic form of the state vector and a matrix ! as The optimal action is given by the Bellman equation The solution leads to a feedback control law where the feedback control gain is given by The matrix for the value function is computed backward in time with the terminal condition of & = . From (14) and (15), the update equation for V is which has a similar form with the equation (8) for Kalman filter.
The table below summarizes the correspondence between Kalman filter and LQR [1,7,8].  Instead of just formal similarity, there is a meaningful correspondence between dynamic Bayesian inference and optimal control or reinforcement learning [7][8][9].

Figure 2: A. Dynamic Bayesian inference.
Here we consider the stochastic state dynamics and the sensory observation In active perception, the aim is to estimate the state trajectory behind a series of actions and In observation learning, the sequence of actions, such as muscle forces, are estimated from observation as Here we outline the simplest case involving no action, This is an example of a hidden Markov model (HMM) and a standard algorithm for solving the problem is the forward-backward algorithm [50,51]. We define the forward message as the joint probability of the observations up to and the state !
and the backward message as the conditional probability of observations after the state !
The forward and backward messages are iteratively computed as The posterior probability is then computed from the product of the two messages.
One way is to consider an "optimality variable ! " [11] which takes 1 when the stateaction pair is optimal and assume that the reward function represents the log probability for the state-action pair to be optimal In this formulation, the posterior probability of state-action trajectory conditioned on the optimality variable to be 1 is given as which means that state-action trajectories that is feasible under the state dynamics with higher cumulative rewards have high posterior probability.
This posterior probability can be computed by backward message passing. We define backward messages representing the state-action pair or the state is optimal for time t and where ( ! | ! ) is a policy prior, such as uniform action selection. These messages are computed backward in time as Then the optimal policy is derived from these messages as These messages have correspondence with the value functions as ( ! ) = log ! ( ! ) (15) under the addition of entropy-based regularization of the policy.