A primer on Variational Laplace (VL)

Highlights • Variational Laplace (VL) is a scheme for Bayesian modelling.• VL is widely used in neuroimaging, in particular DCM.• This paper provides a tutorial explanation of the math and algorithms.• New standalone code is provided to enable re-implementation.• The supplementary materials provide worked derivations.


Introduction
Scientific enquiry typically involves inferring quantities that cannot be directly observed.For example, a neuroscientist may wish to investigate the activity of neural populations from electrical activity that can be measured on the scalp.Similarly, a seismologist may wish to make inferences about geophysical events that occur beneath the surface of the earth, from seismograph measurements taken at the surface.Both of these are examples of ill-posed problems, where multiple configurations of the underlying system of interest could lead to similar measurements.Consequently, any inferences that are made about the underlying mechanisms that generate the data will typically involve uncertainty.This uncertainty needs to be quantified when drawing conclusions, which is why probabilistic, or Bayesian inference methods are typically used.
This article explains the mathematics behind a scheme for Bayesian modelling called Variational Laplace (VL), which is widely used in neuroimaging.It is used to test the statistical evidence for competing models, and in tandem, to rapidly estimates models' parameters, without the need for computationally demanding sampling procedures found in other Bayesian approaches.VL is the cornerstone of Dynamic Causal Modelling (DCM) for fMRI and M/EEG -a framework for inferring neural connectivity from non-invasive neuroimaging data (Friston et al., 2003).A special case of VL, referred to as variational Restricted Maximum Likelihood (REML), is applied behind the scenes in every Statistical Parametric Mapping (SPM) analysis for estimating temporal auto-correlation, as well as for Bayesian source localisation with M/EEG data (Friston et al., 2008a).The original implementation of VL was in the SPM software package (Friston et al., 2007), and variants of the scheme are now provided in other packages for analysing neural and psychological data, such as the VBA Toolbox (Daunizeau et al., 2014) and TAPAS (Frässle et al., 2021).It is also increasingly being applied to fields beyond neuroimaging, including theoretical neurobiology (Smith et al., 2022), robotics (Lanillos et al., 2021;Lanillos and van Gerven, 2021) and epidemiology (Friston et al., 2020).
Despite its widespread use, understanding VL can be challenging for the uninitiated; the original description of VL assumed some familiarity with variational methods, statistical physics and Bayesian inference.Here, our aim is didactic.We explain the methodology from first principles, which we anticipate will be particularly useful for people interested in developing new models of neuroimaging data, or new software toolboxes for neuroimaging analysis.Additionally, we hope this article will be useful for experimenters, who wish to gain a deeper understanding of how the analyses they routinely conduct are performed under the hood.Finally, VL may have relevance to other fields, including machine learning and other physical / biological sciences.We have therefore provided generic pseudocode and MATLAB code with this paper, which could ease translation for new applications.This article compliments previous tutorial introductions to variational Bayes in other contexts (Chappell et al., 2016;Ostwald et al., 2014), and technical reports which have set out the mathematics of VL in detail (Daunizeau, 2017;Stephan et al., 2005;Friston et al., 2007;Starke and Ostwald, 2017).

Example problem: modelling neural connectivity
To motivate the use of VL, we consider the following classic fMRI study, which has been used to develop and illustrate many new analysis methods in the decades since it was published.Büchel et al. (1998) investigated area V5 of the visual cortex, which was known to be sensitive to visual motion.While undergoing fMRI, participants viewed white dots on a computer screen, which were either in motion or stationary.On a subset of the trials with motion, participants were instructed to pay attention to the speed of the dots' motion.The authors found that the neural response to visual motion in V5 was enhanced when people paid attention to the speed of the moving dots.Attention also enhanced the neural response in brain regions lower in the visual hierarchy (primary visual cortex) and higher in the hierarchy (superior parietal cortex).
Here, for illustrative purposes, we will use the fMRI data from a single participant from Büchel et al. (1998) to address the following question: which neural connections explain why V5 was sensitive to visual motion?We consider two candidate hypotheses 1 : • H1: Attention modulated bottom-up connectivity from primary visual cortex (V1) to V5 • H2: Attention modulated top-down connectivity from superior parietal cortex (SPC) to V5 Fig. 1A shows the data -fMRI timeseries that were extracted from brain regions V1, V5 and SPC.To formalize the two hypotheses, the next step is to specify models that can generate or simulate the fMRI data that would be expected under each hypothesis, before using the VL scheme to evaluate their evidence.

From hypothesis to models
Models commonly used for fMRI analysis, referred to as Dynamic Causal Models, are specified as follows.Defining a vector z(t) = (z 1 (t), z 2 (t), z 3 (t)) to be the overall level of neural activity in V1, V5 and SPC respectively at time t, and vector y(t) to be the (concatenated) fMRI timeseries from all three regions, either hypothesis H1 or H2 can be expressed using the following pair of equations (a state-space model): The first line, which we will call the neural model, describes how the rate of change in the three brain regions, ż, depends on a time-varying connectivity matrix J(t), which is of dimension [3 × 3], and vector of external driving inputs Cu 1 (t) .The second line of Eq. ( 1) includes an observation model, g, which translates from neural activity z to fMRI data y.This part of the model is governed by a vector of parameters θ h , which, for MRI, includes the rate of blood flow through the venous compartment.This model, together with the Bayesian methods set out here, are together referred to as DCM for fMRI (Friston et al., 2003).
The parameterisation of J(t) determines which connections are switched on (informed by the data) and which are switched off (fixed at zero).We can therefore formalize our hypotheses by specifying two variants of the model, m 1 and m 2 , that differ in which connections can be modulated by attention, as illustrated in Fig. 1B.The corresponding parametrisation of J(t) for the two models is provided in Appendix 1, although this is not required for understanding the rest of this article.Given these two hypotheses, H1 and H2, which we have now formally stated as mathematical models m 1 and m 2 , we have two overall aims: 1 To estimate the models' parameters, which includes the strength of neural connections and the effects of each experimental condition on each connection. 2 To estimate the probability for each model given the data, P(m 1 |y) and P(m 2 |y), enabling us to select the best model or models (referred to as Bayesian model selection).This requires the intermediate step of estimating the model evidence, also called the marginal likelihood, which is the probability of having seen the data under each model, P(y|m 1 ) and P(y|m 2 ).
Both objectives are fulfilled by the VL scheme, which we will introduce next.

VL in a nutshell
To explain the problem that VL solves from a statistical perspective, we begin with Bayes rule.For a model with a vector of parameters θ, we start by defining a prior probability density P(θ).This function defines our belief about any particular value of the parameters, or range of values, before seeing the data.The priors serve to regularize or constrain the estimation of the model, making ill-posed problems tractablein the sense there is a unique solution.We also define a likelihood P(y|θ), which is a function of the parameters θ and returns the probability of observing the data y given those parameters.Together, the priors and likelihood form a generative model of the data.The goal of Bayesian inference, as introduced above, is two-fold.First, to obtain an updated probability density over the parameters after seeing the datathe posterior P(θ|y).Second, to obtain the marginal likelihood or model evidence P(y), which scores the quality of the model and enables models to be compared.If the data 'look like' the kind of data that the model would have predicted, then P(y) will be large, whereas if the data are inconsistent with the model, then P(y) will be smaller, and thus models can be compared on the basis of their evidence.This is called Bayesian model comparison and we will return later to why the model evidence is wellsuited to comparing models.The posterior and evidence are related by Bayes rule: Where the model evidence is the integral or sum over possible settings of the parameters: Thus, calculating the model evidence involves marginalising (summing or integrating) over the parameters.Unfortunately, this integral typically lacks an analytic solution and cannot be calculated directly.This is problematic for the calculating model evidence, but also the posterior probability (which requires the evidence).
It is the intractability of the integral in Eq. ( 3) that necessities approximate Bayesian inference.Traditional sampling methods are a common approach for approximating the posterior, which eschew the need to tackle this integral, however they can be very computationally intensive and do not provide a straight-forward way to approximate the evidence, thereby precluding Bayesian model comparison.Instead, the 1 These models are used as teaching examples, which we supply with the SPM software package.The dataset and analysis scripts, can be freely downloaded from https://www.fil.ion.ucl.ac.uk/spm/data/attention/ .
P. Zeidman et al. approach described here is to construct a quantity that lower bounds the log of the evidence (i.e., is always smaller than or equal to it).This is often referred to as an evidence lower bound (ELBO) or a free energy functional. 2 An algorithm is then derived to maximise this boundmaking it as close as possible to the log evidence.Methods that involve this optimisation of a boundand therefore ensure that the ELBO or free energy approximates the log evidenceare collectively referred to as 'Variational Bayes' (VB).These methods were first introduced by Richard Feynman in statistical physics (Feynman, 1972) and terms such as 'free energy' were retained when it was subsequently applied in different fields.Variational methods were introduced into machine learning though ensemble learning (Hinton and Van Camp, 1993;MacKay, 1995b;MacKay, 1995a).Later, schemes like expectation maximisation (EM) were considered in the light of VB (Bishop, 1998;Neal and Hinton, 1998;Beal, 2003), which proved particularly useful for fitting graphical models (Jordan et al., 1999).
A key practical challenge for the routine use of VB is deriving the necessary model-fitting algorithm for a given model.This is time-consuming and requires a certain degree of skill with variational calculus.Approaches have therefore been developed over the years for implementing VB for a sufficiently broad class of models that new models can be introduced and fitted to data using 'plug-and-play' software routines.VL is one such approach (Friston et al., 2007), which has been employed in a large body of work in neuroscience.The reference implementation of VL is a MATLAB function, spm_nlsi_gn, 3 implemented in the Statistical Parametric Mapping (SPM) software package (https ://www.fil.ion.ucl.ac.uk/spm/).
We proceed by deriving a score for the quality of a model: the free energy bound on the log evidence.Then, we derive the algorithm that optimises this bound, providing estimates of the log evidence and posterior over parameters.We illustrate this with worked examples, including the attention to visual motion example above, code for which is provided with this paper.Finally, in the discussion, we consider some advantages and disadvantages of this scheme and its relation to other Timing of three experimental conditions (top) and representative timeseries fMRI from three brain regions that will be the focus of the analysis (bottom).B. Structure of two candidate Dynamic Causal Models (DCMs) used to explain these fMRI data.The models differ only in where Attention has an effect.Each large coloured circle is a brain region, with latent level of neural activity z 1 , z 2 , z 3 respectively.Arrows between the circles are neural connections encoded in matrix A of the model (see Appendix 1).Arrows with rounded ends encode modulatory effects of Motion and Attention.
2 A functional is a function of a function.More specifically -at least for the purposes of this paper -it is a mapping whose input is a function and whose output is a number. 3This routine is particularly useful for fitting continuous data when the likelihood has a Gaussian form (whose expectation and covariance may be nonlinear functions of the parameters).An alternative MATLAB function, spm_nlsi_ Newton, allows the specification of generic likelihood functions (for example, it could be used for fitting binary data using likelihood densities compatible with logistic regression models).
P. Zeidman et al. variational inference software tools currently in use.An outline of mathematical notation appears in the footnote. 4

The generative model
The VL scheme works with models that can be written generically in the following form.We have a vector of observed data y of length D and a model g (β), where β are the model parameters: In the example above, y are fMRI timeseries data and g is a model of how the data were generated, however VL is generic and any data or model could be used.The vector of errors or residuals ϵ y has a multivariate normal density, with mean zero, and precision matrix Π y of dimension D (which is the inverse of the covariance matrix): Elements on the leading diagonal of matrix Π y are the precision (inverse variance) of each measurement, and any non-zero off-diagonal elements encode the conditional dependencies amongst measurements, which determines their correlation.To parameterize Π y -such that it can be estimated from the datawe decompose it into a weighted mixture of K matrices called precision components, Q k=1,…,K ∈ R D×D , each of which has a corresponding scalar hyperparameter λ = (λ 1 , …λ k ): The exponential of each hyperparameter is taken to enforce positivity (because precisions and variances cannot be negative).This approach provides a convenient method for allowing different mixtures of observations to share variance.At its simplest, a single precision component can be used, set to the identity matrix, Q 1 = I D , in which case the corresponding hyperparameter λ 1 encodes the log precision of the observation noise.
To make this into a statistical model, we define a likelihood function P(y|θ) and a prior probability density P(θ), where the parameters and hyperparameters are θ = (β, λ).As set out in the previous section, the likelihood P(y|θ) returns the probability of observing the data y given a particular setting of the parameters θ.This has a multivariate normal density: Eq. ( 7) states that the observations are expected to be centred on the prediction of the model g(β), with the level of observation noise given by the precision matrix Π y (λ).Next, the prior density P(θ) quantifies our belief about any given value of the parameters before performing the analysis.Here, we define multivariate normal densities as priors over the parameters and hyperparameters: Having defined the model, we next set out methods for approximating the log of the model evidence lnP(y) and the posterior P(θ|y).

Variational Bayes
We will convert the difficult problem of calculating the integral in Eq. (3) into a simpler optimization or search problem.This begins by defining a functional called the free energy, which is a lower bound on the log of the model evidence lnP(y).This means that by construction, it can only return values that are less than or equal to the log evidence.Then we'll search for a setting of the parameters of the free energy (variational parameters) that maximize it, making it as close as possible to the unknown log evidence.Helpfully, the parameters that maximize the bound will turn out to approximate the posterior P(θ|y).

Constructing a lower bound on the log evidence
The log evidence is defined as: To construct a lower bound on this quantity, we will make use of Jensen's inequality, which says that the average of a log is always less than or equal to the log of an average (Fig. 2).That means that if we can express ln P(y) as the average of a log, then to construct a lower bound we simply need to rearrange terms to get the log of an average.
To express the log evidence as the average of a log, we first define a probability density Q(θ) that will form our approximation of the posterior over parameters.We then introduce this density into the log evidence (from Eq. ( 9)), multiplying and dividing so that it causes no overall change to the evidence: Jensen's inequality says that this average-of-functions will always be less than or equal to the corresponding function-of-the-average: Thus, using the notation of expected values, ). Portrait in the public domain via Wikipedia.https://en.wikipedia.org/wiki/Johan_Jensen_(mathematician). 4The following mathematical notation is used.Variables: Lower case italic text (x) is used for scalar variables, bold uppercase letters (Y) are used for matrices and bold lower-case letters are used for vectors (y).The expected value or average of variable x under the density Q is written The variation of a functional δ may be thought of as the derivative of the functional with respect to the function.For instance, for a functional of the form A = ∫ S(h(x))dx, the variation of A with respect to h is: Where E[⋅] is the expected value or average. 5This is the log of an average, so (by Jensen's inequality) a lower bound on this will be the average of the log, which is called the free energy F: Here, we have written the free energy as a functional (a function of a function), with Q(θ) as its input.By construction, for any choice of probability density Q(θ), it holds that F[Q(θ)] ≤ ln P(y).Note that the free energy is sometimes defined to be the negative of this quantity.We have used the terminology typically found in the neuroimaging literature, where the free energy is a lower bound on the log evidence, also known as the ELBO in statistical and machine learning.The next step will be to maximize this lower bound, i.e., find the probability density Q(θ) that maximizes the free energy, bringing it close to the log evidence, so F ≈ ln P(y).

Useful properties of the free energy
The free energy can be rewritten in different ways to illustrate why it serves as a useful score for the quality of a model.First, we can rewrite it as the log evidence minus the Kullback-Leibler (KL) divergence between the true and approximate posterior: Where D KL [Q ‖ P] is the KL divergence from density P to Q, which is a (non-negative) measure of difference between the two densities.The log evidence is a fixed (but unknown) quantity, so if we can find a density Q(θ) that maximizes the free energy, then we minimize the divergence between the true posterior P(θ|y) and the approximation Q(θ).
We can also re-write the free energy from Eq. ( 11) as the difference between the model's accuracy and its complexity: Here, the accuracy term is the expected log likelihood, and the complexity is how far the parameters have diverged from the prior to the approximate posterior.Thus, if we select the model with the most positive free energy out of several candidate models, then we inherently select the model that offers the best trade-off between accuracy and complexity (c.f., Occam's razor).Importantly, this definition of complexity takes into account the covariance amongst parameters, and thus enables the free energy to serve as a better approximation of the log evidence than statistics which discard the covariancein particular the BIC and AIC (Penny, 2012b).
Finally, we can re-arrange the free energy in Eq. ( 11) as follows: By analogy with its applications in statistical physics, the first term is referred to as an energy, and the latter is the Shannon entropy H of the posterior density Q(θ).A probability density Q(θ) with high entropy will be smooth (or in the limit, flat) over the possible values of θ, whereas a probability density with low entropy will have peaks around particular values.This means that if we have two candidate Q(θ) densities, both equally likely under the priors, to maximize the free energy we would select the one that is smoother or, equivalently, with the higher entropy.This is referred to as Jaynes' Principle of Maximum Entropy (Jaynes, 1957), and means that we select the simplest explanation for the data where possible.(A further consequence of Jaynes' Principle is that as we come closer to maximizing the free energy, the free energy forms a smoother landscape, aiding the performance of optimization algorithms.We will return to the notion of a free energy landscape in Section 5).

Free energy under the Laplace approximation
We now build up to the definition of the free energy that is used in the VL scheme, by first defining the probability densities that appear in the definition of the free energy (Eq.( 11)) and then by introducing a mean-field partition over parameters and hyperparameters.

Quadratic approximations
In what follows, we will approximate the various unknown probability densities that appear in Eq. ( 11) using multivariate normal densities, by way of quadratic approximations and Laplace's method, which we will first reprise.A quadratic approximation, also called a second order Taylor approximation, approximates any (twice differentiable) function g(x) that has vector input x, close to its peak or mode x = x 0 , with the linear function: where ∇g(x 0 ) is the gradient of function g evaluated at x 0 and H g (x 0 ) is the Hessian matrix-a matrix of second derivatives-of g evaluated at x 0 , i.e., [H g (x 0 )] i,j = ∂ xixj g(x 0 ).At the mode of the density, x = x 0 , the linear term equals zero and thus disappears in Eq. ( 15).This has a very similar form to the log of a normal density, which is utilized in Laplace's method for function approximation.

Laplace's method
Laplace's method leverages the quadratic approximation in order to approximate a function g near its mode as a (scaled) normal density, as detailed in Fig. 3.If g(x) is a function of scalar variable x, the Laplace approximation L(x) is: where μ is the mean and π is the precision.When g(x) is a function of a vector of variables x, the Laplace approximation L(x) returns a (scaled) multivariate normal density: 5 For a probability density function P(x), the expected value E[P(x)] is a weighted average over the possible values of x, where each value is weighted by its probability, i.e., E[P(x)] = ∫ xP(x)dx.The expected value can also be taken with respect to another probability density function Q(x), which we write as dx.This is a weighted average of P(x), where the weighting comes from Q(x).
Where μ is the mean and Π is the precision matrix, i.e., the inverse of the variance-covariance matrix.We will apply this method several times to approximate the different quantities that comprise the free energy.

Approximating the free energy
Recall Eq. ( 11), the definition of the free energy: We will now write expressions for the log joint ln P(y, θ) and the approximate posterior (i.e., the choice of Q(θ)).As the true form of the posterior is unknown, and the form of the log joint may involve analytically difficult nonlinearities, we will use (local) quadratic or Laplace approximations to define them in a flexible manner.

Approximating ln P(y, θ)
We will apply a quadratic approximation for the log joint density, which is reasonable because densities tend to look normal close to their mode.We define μ to be the mode of the joint probability density, μ = argmax θ P(y, θ).Note that this is the same as the mode of the posterior P(θ|y), because the posterior is simply a scaled version of the joint.(Also, μ is the mode of the log of these densities, as taking the logarithm does not change the mode.)The approximation to the log joint is:

Choosing an approximate posterior Q
We will similarly apply a quadratic approximation to the log posterior lnP(θ|y) around the posterior mode μ.Under Laplace's method, this means we will be using a normal density, Q(θ), in order to approximate the posterior P(θ|y): ln P(θ|y) ≈ ln P(μ|y) − 1 2

Taking expectations
Next, we will apply the expectation operator6 from Eq. ( 11) to the previous two densities under the approximate posterior: Where Eq. ( 22) uses the definition of the log of the multivariate normal density for a variable with dimension n.
To simplify these expressions, note that each quadratic term inside the square brackets is a scalar.This means we can use the 'trace trick', tr(ABC) = tr(CAB).Applying this gives the simpler expressions: Substituting these expectations into Eq.( 11), we get the free energy under the Laplace approximation: Where Σ is the posterior covariance and n is the total number of parameters.

Free energy under a mean-field approximation
The previous section assumed that all parameters are treated equally.However, it is often helpful to separate them out into two (or more) types.The estimation scheme outlined here alternates between updating the estimate of the model's parameters β and the hyperparameters λ that control the precision of the observation noise.We previously lumped these two kinds of parameter together as θ but now separate them out.We can re-write the joint probability as follows: with normal densities for the likelihood and priors Eqs. ( 7) and ( 8).The approximate posterior is chosen to factorise as follows: This factorisation is known as a mean-field approximation.Each approximate posterior is a multivariate normal density: The free energy is easily extended for this factorisation (from Eq. ( 25)): μ is the mode of P(θ|y) (prior to optimisation of the approximate posterior, it may not be).The linear term in the expansion will still disappear under the expectation as it includes a factor of E Q(θ) [θ − μ] = 0, recovering Eq. 21.
Where p is the number of parameters and h is the number of hyperparameters.So, we now just need to define the log joint, which is the product of three normal densities in Eq. ( 26).The log of a normal density with mean m covariance matrix S for a variable of dimension k is defined as: Where the error term is ϵ x = x − μ.Substituting into Eq.( 29) (with dim(y) = k): With error terms: ) Where, to recap, Π β , Π λ are prior precisions, η β , η λ are prior expectations, Σ β , Σ λ are posterior covariances, μ β , μ λ are posterior expectations and Π y is the (modelled) precision of the data.We then use Eq. ( 31) as the objective function (i.e., the measure of the model's quality), which we seek to maximize in order to approximate the log evidence and identify the model's parameters.

Estimation scheme
Next, we derive an algorithm for finding the posterior parameter density Q(θ) that maximizes the free energy F[Q(θ)].Full derivations of these equations are provided in the supplementary material.

Overview of gradient ascent and Gauss-Newton
The simplest approach to numerically maximizing the free energy is gradient ascent.Conceptually, the free energy forms a landscape, the dimensions of which are the parameters.Gradient ascent or descent takes small steps in the same direction as the gradient, as if climbing or descending a hill.Writing the function to be optimized generically as f(μ), the parameters μ are updated on each iteration according to μ = μ + Δμ, where: and α is the step size (positive for ascent, and negative for descent).This is illustrated in Fig. 4A, for finding the minimum of a function using gradient descent.From an initial guess (labelled 1), small steps are taken towards a minimum, which in this example happens to be the global minimum (shaded green sphere).
Gradient ascent or descent is rarely sufficient in practice.One issue regards its speed: because the update is proportional to the gradient, estimates advance very slowly in shallow regions of the landscape (as can be seen in Fig. 4A).Conversely, in very steep regions, the estimates advance quicklyrunning the risk of taking too large a step and moving away from an optimum.We would ideally like the opposite situationto move quickly in shallow regions, and slowly in steeper regions.
The Gauss-Newton algorithm takes a different approach and can converge far more quickly.To minimize a function, a parabola (or Ushaped plane) is fitted to f(μ) at the current estimate of the parameters (Fig. 4B).The algorithm then jumps to the minimum of the parabola, which becomes the new parameter estimate (labelled 2 in Fig. 4B) and the process repeats.This can enable fast progression from the initial estimate to the global or local optimum (Fig. 4C).For some vector of parameters μ * , the parabola is a second-order quadratic approximation of f(μ) at the current estimate of the parameters μ: Where H f is the Hessian matrix of second derivatives.This approximation has a unique minimum at μ − H f (μ) − 1 ⋅∇ μ f(μ), which becomes the new estimate of the parameters, giving rise to the update equation: While Gauss-Newton is much faster than gradient ascent, it will perform poorly if the quadratic approximation is poora particular risk P. Zeidman et al. when taking large steps that are far from the current estimate.In these situations, it would be ideal to dynamically switch to a gradient ascent or descent.A continuous transition between gradient ascent and Gauss-Newton like behaviour is achieved by the variational Laplace scheme, as we will return to shortly.As a first step towards this, we will derive a gradient ascent algorithm for the free energy.

Gradient ascent on free energy
The mean-field approximation Q(θ) = Q(β)Q(λ) naturally gives rise to a gradient ascent algorithm that alternates between optimising the parameters and hyperparameters.With the mean-field approximation, the free energy (Eq.( 11)) can be extended to: We will use δ Q(β) F and δ Q(λ) F to denote the variation of the free energy with respect to each factor of the approximate posterior, Q(β) and Q(λ), respectively.These are functional derivatives, encoding the rate of change in F that would result from infinitesimal adjustments to the form of each function-where the functions in question here are probability densities.Ignoring constants: Setting to zero and taking the exponential, we get the optimal approximate posteriors on each iteration of the algorithm: As before (Eq.( 23)), quadratic approximations can be used for the terms inside the expectations: ) ) By definition, the modes of the approximate posterior densities over parameters μ β and hyperparameters μ λ must maximise the above quantities (i.e., the logarithms of the approximate posteriors), giving the following pair of equations: = 1 2 tr Where vector Δμ β is the change in the parameters at each iteration of the algorithm, (Δμ λ ) i is the change in the i-th hyperparameter at each iteration, ∇ μ β [⋅] is the gradient with respect to the parameters and J g is the Jacobian matrix7 of first partial derivatives with (J g ) ij = ∂ μ j g i (μ β ) for observation i and parameter j.For hyperparameter i, the derivative term ∂ μ λ i (ϵ λ ) is a vector with value of unity in index i and zero elsewhere, and Terms including the second derivative of g(θ) have been omitted on the assumption that it is locally approximately linear.Similarly, the final term for the parameters is usually ignored, because its contribution is usually trivial in relation to the other terms, and is zero when the precision is linear in the hyperparameters.These updates depend on first calculating the covariance of the parameters Σ β and hyperparameters Σ λ , which are given by the expressions8 : Where Σ y = Π − 1 y .The gradient ascent proceeds by alternately applying the two updates in Eq. ( 40).However, as mentioned above, a gradient ascent is rarely sufficient for robust performance, motivating an algorithm that dynamically switches between gradient ascent and Gauss-Newton-like updates.This is set out in the next section.

Updates in continuous time
Parameter updates in optimization are generally treated as occurring in discrete steps, Δμ β and Δμ λ , as described above.However, Friston et al. (2007) considered updates occurring in continuous time, which provides a principled way to transition between gradient ascent and Gauss-Newton-like updates.In continuous time, the gradient ascent on the parameters can then be written in terms of the time derivative μβ (t) at time t: Where [ln P(y, μ β , λ)].Integrating Eq. ( 42) over time using local linearization (Ozaki, 1985), the update for a time interval t is: And a similar expression is applied to update the hyperparameters.When t is small and positive, the algorithm behaves like a gradient ascent, because the term (exp[tμ β ] − I) in Eq. ( 43) regularizes the update (i.e., reduces its size).This follows because the second derivatives μβ are negative.As t increases, the matrix exponential term exp[tμ β ] reduces to zero, resulting in a standard Gauss-Newton update (by around t = 2): The integration time t can therefore be varied dynamically to adjust the behaviour of the algorithm.When the algorithm begins, the time is set to a small value, causing it to behave like a gradient scheme for stability.If the free energy has increased, i.e., improved as a result of an update, then regularization is decreased (by increasing t).Conversely, if the free energy has decreased, i.e., worsened, then regularization is increased (by decreasing t).
In practice, the step size t is generally specified with a (log) descent parameter v that is automatically scaled by the average curvature of the landscape α: Here, n is the number of parameters.A cautious, slow descent corresponds to v = − 4 (the default starting value).As the average real eigenvalue of μβ increases (i.e., the curvature increases), the regularisation is increased by decreasing v.The algorithm stops when the change in free energy becomes sufficiently small.In more detail, the predicted value of I β (μ β ) after making the step Δμ β is given by: ∇ μ β [I β (μ β )]⋅Δμ β .If this is small over a series of iterations, then the algorithm is considered to have converged.Pseudocode for the complete VL algorithm is provided in Appendix 2, and MATLAB code accompanies this article.

Interim summary
This section described an algorithm that searches for a probability density over the parameters Q(θ) that maximizes the free energy.The algorithm ascends the free energy landscape, with an adaptive step size.Readers experienced with machine learning may note some similarity with the Levenberg-Marquardt algorithm, however the continuous time approach reviewed here is derived from first principles without requiring the introduction of an arbitrary regularization term: see Friston et al. (2007) for a detailed comparison.

Examples
This section presents worked examples using simulated data, where the same algorithm introduced above is applied to different kinds of model.MATLAB code for each example is provided with this paper.

Static models
We start with a simple linear regression model with parameters β: For this example, the design matrix X has two columns: the first is a column of ones and the second consists of 100 evenly spaced values between − 50 and 50.Thus, the parameter β 1 encodes the mean and β 2 encodes the regression slope.The observation noise is encoded by a single precision component, controlled by log-precision parameter λ: Note that a truly linear model would converge within a single iteration of the VL scheme, whereas here there is an exponential function in Eq. ( 47), in order to ensure positivity of the hyperparameter λ.This introduces non-linearity into the model, thereby requiring several iterations to converge.Fig. 5A shows that the parameters and hyperparameters used to generate the simulated data (grey bars) were successfully recovered (blue bars with 90% credible intervals).Note that the covariance amongst (hyper)parameters was also calculated by the VL scheme, but is not shown.
The second example considers the situation where there is heteroskedasticityobservations having different levels of observation noise.This is a common situation, for example when parts of the data come from different measurement channels.This can be modelled by having two precision components: Here, the first precision component matrix, Q 1 has values of one on the leading diagonal for the first half of the observations, and zeros for the second half.It therefore captures the precision of the first half of the observations.Similarly, Q 2 captures the precision of the second half of the observations.As shown in Fig. 5B, the parameters and hyperparameters used to generate the data were correctly recovered.Next, we illustrate dynamic models, i.e., those modelling continuous changes over time.

Dynamic models
A typical application of the VL algorithm is to estimate the parameters of dynamic models that are specified as ordinary differential equations (ODEs).As introduced in the empirical example in Section 1.1, this is accomplished by splitting the model into two parts: a model f of the dynamics of the unobserved (latent) variables x, and a static model g that translates the latent variables into observations y: Where x(t) is approximated using a standard numerical integration scheme, i.e.
Here, we perform this integration using the local linearization approach of Ozaki (1985).A simple example is estimating the rate of an exponential decay of a single observed variable x, with rate parameter β, for which the model specification corresponds to: The model fit and estimated parameters are shown in Fig. 5C.Finally, we consider a more involved examplethe haemodynamic model of Stephan et al. (2007), which is used in the analysis of functional magnetic resonance imaging (fMRI) data to infer neural dynamics.(This forms function g in Eq. ( 1), Section 1.1).For a detailed walkthrough of the physiology, please see Appendix 5 of Zeidman et al. (2019a).In brief, there are four hidden states (vasoactive signal s, blood inflow f in , blood volume v, deoxyhaemoglobin q) and three time-invariant parameters that are estimated from the timeseries data (haemodynamic transit time τ h , vasoactive signal decay rate κ and stimulus efficacy z).The dynamics of the four hidden states are governed by the following equations, which together constitute f in Eq. ( 49): Where 1 α and α, γ are fixed parameters.The final part of the model translates from the latent variables for blood flow v and deoxyhaemoglobin q to the fMRI timeseries y: ) Where for this example, ϑ 0 , E 0 , r 0 , TE, ϵ h are fixed parameters.Fig. 5D shows that the parameters were recovered successfully, although the precision with which the transit time parameter could be recovered was lower than the other two parameters, as reflected in the larger 90% credible interval (pink bar).

Bayesian model comparison
The main purpose of the VL scheme is to enable Bayesian model comparison, which is comparing the relative evidence for different models, where the log evidence is approximated by the free energy.If each model encodes a hypothesis for how the data were generated, then Bayesian model comparison enables different hypotheses to be compared, in terms of how well they trade off accuracy and complexity (see Section 3.2).
Models to be compared can differ in their likelihoodi.e., the definition of their forward model − and/or in the specification of their priors.The only requirement is that all models have been fitted to the same data (where this is not the case, an alternative approach referred to as Bayesian Data Comparison may be considered, see Zeidman et al. (2019c)).Where models differ only in their priors, there is no need to fit each model separately to the data using the VL schemeit is sufficient to fit one 'full' model, and use Bayesian model reduction to analytically Fig. 6.Bayesian model comparison.Three general linear models (GLMs) were fitted to the data using the VL scheme, and then compared based on their free energy.The models differed in whether they had one, two or three precision components.Left: The log Bayes factor for each model relative to model 1.Right: The same results transformed to a posterior probability for each model.compute the free energy and parameters of the alternative reduced models, under Laplace approximations (Friston et al., 2018).Through these methods, decisions such as whether to include particular variables in the model or how to capture their interactions can be decided in a principled manner.
To illustrate this, we will use GLM example above, where the first half of the observations were noisier than the second half (illustrated in Fig. 5B, dashed lines, left panel).We will then use Bayesian model comparison to ask which is the best explanation for the data: a model with one precision component (i.e., all observations had the same level of noise), two precision components (the first and second half of the observations had different levels of observation noise) or three components (each third of the data had a different level of noise).Including more precision components could increase the model's accuracy by introducing more degrees of freedom, but would also increase model complexity.To assess which of the three models optimized the accuracy / complexity trade-off, we specified each model and calculated their free energy using the VL scheme.The values were F 1 = − 18.21, F 2 = 39.61,F 3 = 14.32,where a more positive free energy is better.
These free energies can then be compared in terms of the Bayes factor (Kass and Raftery, 1995) and posterior probabilities.The Bayes factor is the ratio of evidence for a model m 1 relative to another model m 2 : The larger the ratio, stronger the evidence in favour of model m 1 .Taking the log, division becomes subtraction, therefore the log Bayes factor is simply the difference in log evidence.The log Bayes factor in favour of model 1 is: The log Bayes factor can be computed for more than two models by selecting a model to serve as the baseline or reference.Here, we chose the worst model, m 1 , as the reference model, and calculated log Bayes factor for models m 2 and m 3 relative to m 1 , as shown in Fig. 6 (left panel).As expected, m 2 was the best (as the data were generated using two precision components), m 3 was the second best, and m 1 was the worst model.This example also makes a key point, that accuracy is not an apt measure of model quality.The third model here would have been the most accurate because it has more degrees of freedom.However, its added complexity was correctly penalized by the free energy, ensuring it would be discarded in favour of the model that is the simplest explanation for the databut not too simple.
It can aid interpretation to report not only the log Bayes factor, but also the posterior probability for each model, e.g., p(m 1 |y).Under equal prior probability for each model, by application of Bayes rule, the posterior probabilities are given by a softmax function of the log Bayes factors: This is illustrated in Fig. 6 (right panel).This demonstrates that m 2 had posterior probability close to unity, meaning that we could be extremely confident that it provided the best explanation for the data.This procedure may be applied with any number of models, and therefore forms the basis for hypothesis testing in Bayesian inference.

Empirical example of Bayesian model comparison
In Section 2 we introduced an example modelling problem, where the aim was to compare the evidence for two candidate models, as explanations for why visual region V5 of the brain is enhanced by visual attention.Here, we illustrate applying the VL scheme to the two candidate models, m 1 and m 2 and performing Bayesian model comparison.
Fig. 7A shows the free energy over iterations for model m 1 (relative to the first iteration).The algorithm converged after 17 iterations.The resulting free energies for the two models were F 1 = − 3277.61 and F 2 = − 3294.20 respectively, where a more positive free energy is better.These free energies were then taken forward for Bayesian model comparison.
Having reported the probability for each model, studies typically report the posterior estimates of the parameters from the winning model if there's a clear winner, or alternatively if there's no clear winner, the (precision-weighted) average of parameters across models (this is called Bayesian model averaging).Fig. 7D shows the posterior expected values from model m 1 .It can be seen that under this model, the presence of visual motion boosted the connection from brain region V1 to V5 (by 0.52 Hz), and attention to visual motion further increased the strength of this connection (by 0.17 Hz).
Together, one may conclude that the data were best explained by hypothesis H1 -i.e., attentional modulation of V5 could be accounted for by bottom-up connectivity from V1. Examining the estimated parameters of the model demonstrated that attention had a gating effect on feed-forward or ascending connections from primary visual cortex.

Discussion
The VL scheme described here underwrites thousands of neuroimaging studies, and is beginning to find applications in other fields.We considered it important, therefore, to clearly explain how it works, what assumptions it makes, and how to interpret the outputs.To this end, Sections 1-3 set out the key challenge of Bayesian inferencethe intractable integral within the model evidence (Eq.( 3)) -and how this can be resolved using variational Bayes (a.k.a., approximate Bayesian inference).This involves defining a lower bound on the log evidence (the free energy, Eq. ( 11)), and then identifying a probability density over the parameters that maximizes this bound, bringing it as close as possible to the log evidence.In effect, variational procedures convert an impossible integration or marginalisation problem into a tractable optimisation problem.Section 4 set out the implementation of variational Bayes typically used in neuroimaging (Eq.( 31)), where Laplace approximations (i.e., normal densities) are used to approximate the free energy bound.Section 5 then described an efficient algorithm for maximizing this (approximate) free energy, as illustrated for static and dynamic models in Section 6. Worked derivations are provided in the supplementary materials, pseudocode is provided in the appendix and standalone MATLAB code accompanies this article.
This scheme has several advantages over alternative methods, both in terms of the free energy approximation of the log evidence and the algorithm used to maximize it.First, the free energy serves as a better approximation of the log evidence than other commonly used heuristics like the AIC and BIC.These approximations can fail dramatically even in relatively benign settingssee (Penny, 2012a) for some unsettling examples.While all three measures can be decomposed into accuracy and complexity terms, only the free energy takes into account uncertainty in the parameters for the accuracy term, as well as the covariance amongst parameters in the complexity term (Penny, 2012b).VL also has advantages over non-variational sampling methods.While sampling is highly effective for profiling the shape of a probability density, it does not provide a straightforward way to estimate the log evidence (one common approach, the harmonic mean, has been described as the "Worst Monte Carlo Method Ever" for its poor performance9 ).The lower computational cost of VL relative to sampling schemes, together with the fact that it is deterministic (so always provides the same results given the same data) provide further advantages.Regarding the algorithm described here for maximizing the free energy, the closest alternative is Expectation Maximization (EM).EM differs from variational Bayes in that EM ignores uncertainty about the hyperparameters.Thus, an advantage of VL is that it conveys the uncertainty of both the parameters and hyperparameters to the next iteration of the algorithm (i.e., EM is a special case of the variational Bayes where uncertainty about the hyperparameters is ignored.) There are potential drawbacks of the VL scheme.First, the Laplace assumption may not be suitable for all applications.For instance, a Gaussian posterior may not be appropriate where the true posterior is multimodal, if a multimodal posterior is important for making inferences.To evaluate this for a particular application, the validity of the Laplace assumption can be assessed using sampling methods.This can be particularly useful when dealing with highly nonlinear models.Typically, variational Laplace accommodates nonlinearities by applying gaussian assumptions to nonlinear transformations of the parameters.A nice example of this is the treatment of hyperparameters above Eqs.( 47)-(48).By taking the exponential of the hyperparameter, one is effectively assuming a log normal prior, which ensures positivity for scale parameters of this sort (a scale parameter is a nonnegative parameter, such as a rate or time constant, variance or distance measure).Using the same device in hierarchical and nonlinear models allows one to accommodate a large range of (weakly) non-linear models within variational Laplace.However, it is sometimes necessary to check the robustness to violations of the Laplace assumption with reference to sampling schemes.
Second, the algorithm presented here is highly likely to converge to an optimal value, but only if the initial values of the parameters are well chosen (typically, the prior expected values of the parameters are used).In other words, if there are multiple local optima, then the algorithm may not be able to escape the local optimum that is easiest to reach from the starting value (this is sometimes expressed as starting within the basin of attraction of a fixed point in the free energy landscape).To overcome this, multi-start approaches have been used in conjunction with the VL scheme.For example, in the analysis of neuroimaging data from multiple test subjects, a common approach is to iteratively restart the algorithm from the group average parameter values (Friston et al., 2015).Finally, it should be noted that variational Bayes methods commonly suffer from overconfidence in their posterior parameter estimates, as demonstrated in the context of DCM for fMRI by Daunizeau et al. (2012).
Various extensions and variants of the VL scheme have been developed to handle a broader range of models in the context of neuroimaging, which we have not had space to detail in this article.For example, VL has been applied to modelling data in the frequency domain (complex cross-spectra), which is routinely used in the analysis of electrophysiological data (Moran et al., 2009) and resting state fMRI data (Friston et al., 2014).Similar approaches have been introduced to invert stochastic differential equation models, namely Dynamic Expectation Maximization (DEM) and Generalised Filtering (GF), which estimate a model's hidden states and parameters, treating both as time-dependant random variables (Friston et al., 2010;Friston et al., 2008b).VL has also been applied to model voxel-wise fMRI data, yielding maps of parameters and posterior probabilities (Zeidman et al., 2018;Puckett et al., 2020).That approach leveraged parallelisation to estimate multiple voxels' timeseries independently; future work could improve performance by sharing parameters across voxels, such as those relating to haemodynamics or observation noise.More recently, the Parametric Empirical Bayes (PEB) framework was introduced to extend the VL scheme to hierarchical experimental designs, where for example, data have been sampled from multiple subjects at multiple time points (Zeidman et al., 2019b;Friston et al., 2016).
Since the introduction of the VL scheme, other algorithms and software tools for variational Bayesian inference have been introduced that serve a similar role.The Variational Message Passing (VMP) algorithm pre-dates VL (Winn et al., 2005) and is now used in the Microsoft infer.NET programming language and the ForneyLab Julia package (Cox et al., 2019).Like VL, this is a deterministic algorithm, which inverts models that can be expressed as Bayesian networks or Forney Factor Graphs (with the prerequisite that nodes are conjugate to their parent).The distributed nature of the VMP algorithm has also enabled its use as a model for how inference is performed in biological neural networks (Parr et al., 2019).Another prominent algorithm is automatic differentiation variational inference (ADVI) (Kucukelbir et al., 2017), which is implemented in multiple probabilistic programming frameworks including Stan (Kucukelbir et al., 2015), Turing.jlfor Julia (Ge et al., 2018), PyMC3 for Python and Tensorflow Probability (TFP).Like VL, ADVI optimizes the free energy, however this is performed without Laplace approximations of the free energy functional.Instead, the log joint in Eq. ( 11) is evaluated automatically from a given graphical model, and the expected value is approximated using Monte Carlo integration methods, when the free energy is evaluated.Having defined the free energy functional, the next step is to maximize it, and VL and ADVI differ in how they do this.In VL, the gradient of the free energy under the Laplace approximation is given by Eq. ( 40), which depends on first computing the gradient of the function to optimized, g(θ), using numerical methods (finite differences).ADVI estimates the gradient of the free energy using Automatic differentiation (AD), which involves reducing the free energy into a graph of elemental math operations and then repeatedly applying the chain rule (Griewank, 1989;Iri, 1984).These gradients are then supplied to a stochastic gradient descent algorithm (Hoffman et al., 2013).An interesting future direction would be to compare the performance of the VL scheme against VMP and ADVI for the kind of models typically applied in neuroimaging.We would predict that VL and VMP would be significantly faster because they eschew sampling, however ADVI would offer the most accurate posteriors in situations where the Laplace approximation is violated.Finally, VL and ADVI may also be compared against a recently developed scheme called Stochastic VB (sVB), which was introduced in the neuroimaging literature and also features stochastic gradient descent.It has been applied to discovering functional modes in neuroimaging data using hidden Markov models (Vidaurre et al., 2017) and quantifying variability in resting state networks across a large sample of the population in the UK Biobank (Farahibozorg et al., 2021).
The code accompanying this paper illustrates applications of the VL scheme with a variety of models.Readers interested in learning more about the applications of VL may wish to proceed to recent tutorials on modelling neuroimaging data using dynamic causal modelling (DCM) (Zeidman et al., 2019a) and behavioural data using active inference (Smith et al., 2022).

Code availability
MATLAB code accompanying this paper can be downloaded from https://github.com/pzeidman/vl-tutorial.

Data and code availability
All MATLAB code accompanying this paper can be downloaded from Github at https://github.com/pzeidman/vl-tutorial.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.In detail, matrix A encodes the coupling strength amongst brain regions, where element a i,j is the strength of the connection from region j to region i in units of hertz (s − 1 ).Thus, the columns of matrix A correspond to outgoing connections from regions V1, V5 and SPC, and the rows correspond to A model corresponding to the second hypothesis, H2, may be specified through a minor modification to the parameter matrix B (ATTENTION) , to indicate that attention should modulate the SPC→V5 connection rather than the V1→SPC connection: P. Zeidman et al.

Fig. 1 .
Fig. 1.Exampling modelling problem.A.Timing of three experimental conditions (top) and representative timeseries fMRI from three brain regions that will be the focus of the analysis (bottom).B. Structure of two candidate Dynamic Causal Models (DCMs) used to explain these fMRI data.The models differ only in where Attention has an effect.Each large coloured circle is a brain region, with latent level of neural activity z 1 , z 2 , z 3 respectively.Arrows between the circles are neural connections encoded in matrix A of the model (see Appendix 1).Arrows with rounded ends encode modulatory effects of Motion and Attention.

Fig. 2 .
Fig. 2. Jensen's inequality.The curved blue line could be any concave function y = f (x).Here as an example, the function f (x) = ln(x) is shown, which has been evaluated in the range x = [0, 10].The straight red secant line joints two arbitrary points: f (x 1 ) and f (x 2 ).Any point on the secant line, such as y A , is a weighted average of f (x 1 ) and f (x 2 ): y A = af (x 1 ) + (1 − a)f (x 2 ).Jensen's inequality says that this average-of-functions will always be less than or equal to the corresponding function-of-the-average: y B = f (ax 1 + (1 − a)x 2 ).Thus, using the notation of expected values, E[f (x)] ≤ f (E[X]).Portrait in the public domain via Wikipedia.https://en.wikipedia.org/wiki/Johan_Jensen_(mathematician).

Fig. 3 .Fig. 4 .
Fig.3.Laplace approximation of a nonnormal density.This example illustrates Laplace's method by approximating a gamma probability density function G(x; a, b) over the variable x with shape parameter a and inverse scale parameter b.The result is a scaled normal density N(x; μ, π − 1 ) with mean μ = x 0 and precision π = − ∂ x0x0 lnG(x 0 ), where x 0 is the mode of lnG.After normalisation to ensure it is a proper probability density, the scaling becomes irrelevant, and we end up with a normal density.Portrait by James Posselwhite, in the public domain via Wikipedia, https://en.wikipedia.org/wiki/Pierre-Simon_Laplace.

Fig. 5 .
Fig. 5. Example applications of the VL scheme.A-D illustrate the VL scheme applied to different models described in the text.Left panels: The simulated data (dashed blue lines) and prediction from the model (solid red lines).Middle and right panels: Parameter or hyperparameter values used to generate the data (grey bars without error bars) and posterior estimated values (blue bars with error bars).

Fig. 7 .
Fig. 7. Results of applying VL to the fMRI attention example.A. The free energy per iteration of the VL model estimation scheme for model m 1 , relative to the free energy of the first iteration (i.e., this shows the increasing log Bayes factor).B. The free energy per model relative to m 2 which was set to zero.C. The posterior probability for each model.D. The expected values of the parameters.With reference to Appendix 1, which details the parameterisation of the model -numbers on the connections relate to parameter matrix A of the model, whereas the effects of motion and attention relate to the parameter matrices B of the model.
in the V1→V5 connection due to motion and attention respectively.Finally, parameter matrix C encodes the sensitivity of each brain region to the driving effect of visual stimuli to region V1: