Interacting Particle Solutions of Fokker–Planck Equations Through Gradient–Log–Density Estimation

Fokker–Planck equations are extensively employed in various scientific fields as they characterise the behaviour of stochastic systems at the level of probability density functions. Although broadly used, they allow for analytical treatment only in limited settings, and often it is inevitable to resort to numerical solutions. Here, we develop a computational approach for simulating the time evolution of Fokker–Planck solutions in terms of a mean field limit of an interacting particle system. The interactions between particles are determined by the gradient of the logarithm of the particle density, approximated here by a novel statistical estimator. The performance of our method shows promising results, with more accurate and less fluctuating statistics compared to direct stochastic simulations of comparable particle number. Taken together, our framework allows for effortless and reliable particle-based simulations of Fokker–Planck equations in low and moderate dimensions. The proposed gradient–log–density estimator is also of independent interest, for example, in the context of optimal control.


Introduction
Many biological and physical systems are characterised by the presence of stochastic forces that influence their dynamics. These forces may be attributed either to intrinsic or extrinsic sources [1][2][3], i.e. arising either from random fluctuations of constituent subsystems [4,5], or from fluctuating interactions with the environment [6,7].
Often, deterministic analysis of these systems suffices to describe their macroscopic behaviour, and the fluctuations contribute only negligible perturbations around the deterministic dynamics. In systems biology, for example, rate equations describing mean concentrations of considered system's species have provided a useful description of chemical reaction networks' dynamics, and have enabled answering invaluable questions pertaining chemical systems comprising large numbers of molecules [8].
However, in several settings, the effect of stochastic forces influences considerably the resulting system's behaviour by qualitatively altering its evolution. In those settings, random fluctuations have to be accounted for, and thus stochastic analysis becomes essential [9,10]. Phenomena such as stochastic resonance [11,12], noise induced transitions [13][14][15], and stochastic synchronisation, to name a few, are prevalent in many physical systems, and highlight the importance of considering fluctuating forces in the analysis of a system's behaviour. Manifestations of these phenomena abound in nature and have been encountered in genetics [16], neuroscience [17,18], climate science [12,19,20] and other fields. density. Therefore, we introduce an approximation to the effective force acting on each particle, which becomes exact in the large particle number limit given the consistency of the estimator.
Our approach is motivated by recent developments in the field of machine learning, where GLD estimators have been studied independently and are used to fit probabilistic models to data. An application of these techniques to particle approximations for FPE is, to our knowledge, new. (The approach in [45] uses a GLD estimator different from ours for particle dynamics but with a probability flow towards equilibrium which is not given by a standard FPE.) Furthermore, our method provides straightforward approximations of entropy production rates, which are of primary importance in non-equilibrium statistical physics [46]. This article is organised as follows: Section 2 describes the deterministic particle formulation of the Fokker-Planck equation. Section 3 shows how a gradient of the logarithm of a density may be represented as the solution of a variational problem, while in Section 4 we discuss an empirical approximation of the gradient-log-density. In Section 5, we introduce function classes for which the variational problem may be solved explicitly, while in Section 6 we compare the temporal derivative of empirical expectations based on the particle dynamics with exact results derived from the Fokker-Planck equation. Section 7 is devoted to the class of equilibrium Fokker-Planck equations, where we discuss relations to Stein Variational Gradient Descent and other particle approximations of Fokker-Planck solutions. In Section 8, we show how our method may be extended to general diffusion processes with state-dependent diffusion, while Section 9 discusses how our framework may be employed to simulate second order Langevin dynamics. In Section 10 we demonstrate various aspects of our method by simulating Fokker-Planck solutions for different dynamical models. Finally, we conclude with a discussion and an outlook in Section 11.

Deterministic Particle Dynamics for Fokker-Planck Equations
We consider Fokker-Planck equations of the type Given an initial condition p 0 (x), Equation (1) describes the temporal development of the density p t (x) for the random variable X(t) ∈ R d following the stochastic differential equation In Equation (2), f (x) ∈ R d denotes the drift function characterising the deterministic part of the driving force, while dB(t) ∈ R d represents the differential of a vector of independent Wiener processes capturing stochastic, Gaussian white noise excitations. For the moment, we restrict ourselves to state independent and diagonal diffusion matrices, i.e. diffusion matrices independent of X(t) (additive noise) with diagonal elements σ 2 characterising the noise amplitude in each dimension. Extensions to more general settings are deferred to Section 8. We may rewrite the FPE Equation (1) in the form of a Liouville equation for the deterministic dynamical system dX dt = g(X, t) , (dropping the time argument in X(t) for simplicity) with velocity field Hence, by evolving an ensemble of N independent realisations of Equation (4) (to be called 'particles' in the following) according to we obtain an empirical approximation to the density p t (x).
Since the only source of randomness in Equation (4) can be attributed to the initial conditions X i (0), averages computed from the particle approximation (Equation (6)) are expected to have smaller variance compared to N independent realisations of the SDE (Equation (2)). Unfortunately, this approach requires perfect knowledge of the unknown instantaneous density p t (x) (Equation (5)), which is actually the unknown quantity of interest.
Here, we circumvent this issue by introducing statistical estimators for the term ∇ ln p t (x), computed from the entire ensemble (X 1 (t)), . . . , X N (t)) of particles at time t. Although this additional approximation introduces interactions among the particles via the estimator, for sufficiently large particle number N, fluctuations of the estimator are expected to be negligible and the limiting dynamics should converge to its mean field limit (Equation (4)) provided the estimator is asymptotically consistent. Thus, rather than computing a differentiable approximation to p t (x) from the particles, e.g. by a kernel density estimator, we show in the following section, how the function ∇ ln p t (x) may be directly estimated from samples of p t (x).

Variational Representation of Gradient-Log-Densities
To construct a gradient-log-density (GLD) estimator we rely on a variational representation introduced by Hyvärinen in his score-matching approach for the estimation of non-normalised statistical models [47]. We favoured this approach over other estimators [48,49] due to its flexibility to adapt to different function classes chosen to approximate the GLD.
Here, we use a slightly more general representation compared to [47] allowing for an extra arbitrary reference function r(x) = (r (1) (x), . . . , r (d) (x)) such that the component α of the gradient is represented as where ∂ α . = ∂ ∂x (α) stands for the partial derivative with respect to coordinate α of the vector The cost function is defined as an expectation with respect to the density p(x) by with dx representing the volume element in R d . To obtain this relation, we use integration by parts (assuming appropriate behaviour of densities and φ at boundaries), and get Minimisation with respect to φ yields Equation (7).

Gradient-Log-Density Estimator
To transform the variational formulation into a GLD estimator based on N sample points (X 1 , . . . , X N ), we replace the density p(x) in Equation (8) by the empirical distribution and where F is an appropriately chosen family of functions with controllable complexity. By introducing the estimator of Equation (11) in Equation (6), we obtain a particle representation for the Fokker-Planck equation for i = 1, . . . , N and α = 1, . . . , d, witĥ Although in this article, we use reference functions r(·) ≡ 0 for all simulated examples, the choice r(x) = 2 σ 2 f (x), which cancels the first two terms in Equation (12), leads to interesting relations with other particle approaches for simulating Fokker-Planck solutions for equilibrium systems (c.f. Section 7) and impacts on the numerical approximation of φ (c.f. Section 5). In particular, when considering Brownian dynamics, where f (x) = −∇U(x), the choice r(x) = − 2 σ 2 ∇U(x) leads to φ(x) = 0 once the system has reached thermal equilibrium. More generally speaking, one may choose r(x) = 2 σ 2 ∇ ln p * , where p * is an appropriate reference measure such as the equilibrium measure of the underlying stochastic process.

Estimating the Entropy Rate
Interestingly, the variational approach provides us with a simple, built in method for computing the entropy rate (temporal change of entropy) of the stochastic process (Equation (2)).
Using the FPE (1) and integration by parts, one can derive the well-known relation, see e.g., [50], The first term on the right hand side is usually called entropy production, whereas the second term corresponds to the entropy flux. In the stationary state, the total entropy rate vanishes. For equilibrium dynamics, both terms vanish individually at stationarity. This should be compared to the minimum of the cost function (Equation (9)), which for r ≡ 0, equals Thus, we obtain the estimator We will later see for the case of equilibrium dynamics that a similar method may be employed to approximate the relative entropy distance to the equilibrium density.

Function Classes
In the following, we discuss choices for families of functions F leading to explicit, closed form solutions for estimators.

Linear Models
A simple possibility is to choose linearly parametrised functions of the form where the φ k (x) are appropriate basis functions, e.g., polynomials, radial basis functions or trigonometric functions. For this linear parametrisation, the empirical cost (Equation (10)) is quadratic in the parameters a k and can be minimised explicitly. A straightforward computation from Equation (12) shows that Here, we require the number of samples to be greater than the number of employed basis functions, i.e., N ≥ m + 1, to have a non-singular matrix C. This restriction may be lifted by adding a penalty to the empirical cost (Equation (10)) for regularisation, similar to ridge regression estimators. Equation (17) is independent of the reference function r, when r belongs to the linear span of the selected basis functions. However, this model class with a finite parameter number has limited complexity.
When the particle number N grows large, we expect convergence of the dynamics to a mean field limit which would be given by where the brackets · q t denote expectation with respect to the limiting density q t (x) and C kl = φ k (X i )φ l (X i ) q t . Since the linear model class (Equation (16)) exhibits limited complexity for fixed m, we do not expect the approximated solution q t (x) to equal the exact solution p t (x) of the FPE. Nevertheless, for rare cases, where both ∇ ln p t (x) and also r(x) are linear combinations of the employed basis functions for all times t, q t (x) would provide an exact solution. For example, in a setting with linear drift function f (x) = −γx, reference function r(x) = 0, and dimensionality d = 1, a basis consisting of a constant φ 1 (x) = 1 and a linear function φ 2 (x) = x, would be able to perfectly represent the GLD of the Gaussian density p t (x).

Kernel Approaches
Here, we consider a family F of functions for which the effective number of parameters to be computed is not fixed beforehand, but rather increases with the sample number N: a reproducing kernel Hilbert space (RKHS) of functions defined by a positive definite (Mercer) kernel K(·, ·). Statistical models based on such function spaces have played a prominent role in the field of machine learning in recent years [51].
A common, kernel-based approach to regularise the minimisation of empirical cost functions is via penalisation using the RKHS norm · RKHS of functions in F . This can also be understood as a penalised version of a linear model (Equation (16)) with infinitely many feature functions φ k . For so-called universal kernels [52], this unbounded complexity suggests that we could expect asymptotic convergence of the GLD estimator (see [53] for related results) and a corresponding convergence of the particle model to the FPE as its mean field limit. However, a rigorous proof may not be trivial, since particles in our setting are not independent.
The explicit form of the kernel-based approximation is given by where the parameter λ controls the strength of the penalisation. Again, this optimisation problem can be solved in closed form in terms of matrix inverses. One can prove a representer theorem which states that the minimiser φ(x) in Equation (18) is a linear combination of kernel functions evaluated at the sample points X i , i.e., For such functions, the RKHS norm is given by Hence, this representation leads again to a quadratic form in the N coefficients. The solution of the minimisation problem is given by where K ij . = K(X i , X j ). Similar approaches for kernel-based GLD estimators have been discussed in [48,49]. For r(·) = 0, Equation (21) agrees with the GLD estimator of [48] derived by inverting Stein's equation, or by minimising the Kernelised Stein discrepancy.
The resulting particle dynamics is given by Please note that here the inverse matrix also depends on the particles X k . We may simplify Equation (22) by adding and subtracting a term λδ kl r(X l ) in the summation over l, with δ kl denoting the Kronecker delta. This yields In the limit of small λ, the right hand side becomes independent of the reference function r.
In the present article, we employ Gaussian radial basis function (RBF) kernels given by with a length scale l. A different possibility would be given by kernels with a finite dimensional feature representation which may also be interpreted as a linear model as in Equation (16) with a L 2 penalty on the unknown coefficients.

A Sparse Kernel Approximation
Since the inversions of the N × N matrices in Equation (22) have to be performed at each step of a time discretised ODE system (Equation (22)), for large particle number N, the cubic complexity becomes too computationally demanding. Hence, here, we resort to a well established approximation in machine learning to overcome this issue, by applying a sparse approximation to the optimisation problem of Equation (18), see e.g., [54]. In particular, we introduce a smaller set of M N inducing points {z k } M k=1 that need not necessarily be a subset of the N particles. We then minimise the penalised cost function (Equation (18)) in the finite dimensional family of functions This may also be understood as a special linear parametric approximation. To keep matrices well conditioned, in practice we add a small 'jitter' term to Equation (18), i.e. we use as the total penalty. In the limit λ, → 0, this representation reduces to an approximation of the form of Equation (16) with M basis functions K(·, z l ) for l = 1, . . . , M. By introducing the matrices and we replace the particle dynamics of Equation (22) by Hence, for this approximation we have to invert only M × M matrices. For fixed M, the complexity of the GLD estimator is limited. Results for log-density estimators in machine learning (obtained for independent data) indicate that for a moderate growth of the number of inducing points M with the number of particles N, similar approximation rates may be obtained as for full kernel approaches.

A Note on Expectations
In this section, we present a preliminary discussion of the quality of the particle method to approximate expectations of scalar functions h of the random variable X(t). We concentrate on the temporal development of h(X(t)). While it would be important to obtain an estimate of the approximation error over time, we will defer such an analysis to future publications and only concentrate on a result for the first time derivative of expectations, i.e., the evolution over infinitesimal times.

Using the FPE (Equation (1)) and integrations by parts, one derives the exact result
where · denotes the expectation with respect to p t (x) and the operator L x is defined as In Equation (32), L x denotes the generator of the diffusion process defined by the corresponding stochastic differential equation (Equation (2)). To obtain a related result for the particle dynamics and the empirical expectations denoted by · p t , we employ the relation where we have used the chain rule for the time derivative. We will focus initially on the dynamics based on basis functions. Expressing the sum over X l in Equation (17) as an expectation, we obtain Hence, by inserting Equation (34) into Equation (33) and adding and subtracting the term where the remainder term is given by For the approximation of the vectorial function ∇h(x) we havê A simple comparison shows that each component of the vector∇h(x) can be written as the minimiser is a linear combination of basis functions. Hence,∇h(x) equals the best approximation of the vectorial function ∇h(x) based on the 'data' ∇h(X l ) using regression with basis functions. Thus, if ∇h(x) is well approximated by basis functions, the remainder ∆ is small. If indeed ∇h(x) = ∑ M n=1 c n φ n (x), for some c n ∈ R d , the remainder term vanishes, ∆ = 0. By its similarity to the finite basis function model, this result should also be valid for the sparse kernel dynamics of Equation (30), when the penalty λ is small. One might conjecture that the temporal development of expectations for reasonably smooth functions might be faithfully represented by the particle dynamics. This conjecture is supported by our numerical results.
A similar result also holds for the dynamics of Equation (22). In this case, the function∇h(x) in the remainder ∆ (Equation (36)) is given bŷ which has also an interpretation as an approximation of ∇h(x) by regularised kernel regression.

Equilibrium Dynamics
An important class of stochastic dynamical systems describe thermal equilibrium, for which the drift function f is the negative gradient of a potential U, while the limiting equilibrium density p ∞ is explicitly given by a Gibbs distribution: For this class of models, our method provides a simple and built-in estimator for the relative entropy between the instantaneous and the equilibrium density, p t and p ∞ respectively. As we discuss here, our framework may also be related to two other particle approaches that converge to the (approximate) equilibrium density.

Relative Entropy
The relative entropy or Kullback-Leibler divergence is defined as Following a similar calculation that led to Equation (13), we obtain where g(x, t) indicates the velocity field of the particle system defined in Equation (4). The first equality holds for arbitrary drift functions. To obtain the second equality, we have inserted the explicit result for p ∞ . Hence, we may compute the relative entropy at any time T as a time integral where the inner expectation is easily approximated by our particle algorithm. This result shows that the exact velocity field g(x, t) converges to 0 for t → ∞, and one expects particles to also converge to fixed points. For other non-equilibrium systems, asymptotic fixed points are, however, the exception.

Relation to Stein Variational Gradient Descent
Recently, Stein variational gradient descent (SVGD), a kernel-based particle algorithm, has attracted considerable attention in the machine learning community [55,56]. The algorithm is designed to provide approximate samples from a given density p ∞ as the asymptotic fixed points of a deterministic particle system. Setting − ln p ∞ (x) = U(x) + const, SVGD is based on the dynamics This may be compared to our approximate FPE dynamics (Equation (22)) for the equilibrium case by setting σ 2 = 2 and r(x) = f (x) = −∇U(x). For this setting, both algorithms have in fact, the same conditions for the 'equilibrium' fixed points. See [44] for a discussion of these fixed points for different kernel functions. However, both dynamics differ for finite times t, where a single time step of SVGD is computationally simpler, being free of the matrix inversion required by our framework. The mean field limit N → ∞ of Equation (44) differs from the FPE limit, and the resulting partial differential equation is nonlinear [57]. Nevertheless, it is possible to interpolate between the two particle dynamics. In fact, in the limit of a large regularisation parameter λ → ∞, the inverse matrix in Equation (22) becomes diagonal, i.e., (K + λI) −1 1 λ I, and we recover SVGD (Equation (44)) by introducing a rescaled time τ . = t/λ. This result could be of practical importance when the goal is to approximate the stationary distribution, irrespective of the finite-time dynamics. The SVGD combines faster matrix operations with slower relaxation times to equilibrium compared to the FPE dynamics. It would be interesting to see if an optimal computational speed of a particle algorithm might be achieved at some intermediate regularisation parameter λ.

Relation to Geometric Formulation of FPE Flow
Following Otto [58] and Villani [59], the FPE for the equilibrium case can be viewed as a gradient flow on the manifold of probability densities with respect to the Wasserstein metric. This formulation can be used to define an implicit Euler time discretisation method for the dynamics of the density p t . For small times δt (and σ 2 = 2) this is given by the variational problem in terms of the Kullback-Leibler divergence and the L 2 Wasserstein distance W 2 . The latter gives the minimum of X − X(t) 2 for two random variables X(t) and X, where the expectation is over the joint distribution with fixed marginals p t and p. Using the dual formulation for a regularised Wasserstein distance, approximate numerical algorithms for solving Equation (46) have been developed by [60] and by [61] with applications to simulations of FPE. We show in the following that Equation (46) may be cast into a form closely related to our variational formulation (Equation (7)) for r(x) = −∇U(x). Assuming that X and X(t) are related through deterministic (transport) mappings of the form we may represent the Wasserstein distance in terms of ψ and the variational problem in Equation (46) may be reformulated as where To proceed, we expand the relative entropy in Equation (48) to first order in δt, inserting the representation of Equation (49) for p t+δt (x), thereby obtaining Minimisation ignoring the O(δt 2 ) terms (employing integration by parts) yields which is related to our cost function (Equation (8)) if we formally identify φ(x) = −∂ α ψ(x). More precisely, by replacing p t by samples, the empirical cost function may be regularised with a RKHS norm penalty resulting in a nonparametric estimator for unnormalised log-density [62]. One could use this estimator as an alternative to our approach. This would lead to a simultaneous estimate of all components of the GLD. In our approach, each of the d components of the gradient is computed individually. In this way, we avoid additional second derivatives of kernels, which would increase the dimensionality of the resulting matrices.

Extension to General Diffusion Processes
The Fokker-Planck equations for an SDE with arbitrary drift f (x) and general, state-dependent diffusion matrix D(x) is given by This may again be written in the form of a Liouville equation (Equation (3)) where the effective force term equals

Second Order Langevin Dynamics (Kramer's Equation)
For second-order Langevin equations, the system state comprises positions X ∈ R d and velocities V ∈ R d following the coupled SDE In Equation (54), the dynamics describe the effect of a friction force, γV, an external force, f (X), and a fluctuating force, where γ denotes the dissipation constant. In this setting, the effective deterministic ODE system is given by Considering here the equilibrium case, we set f (x) = −∇U(x) for which the stationary density equals where β = 2γ σ 2 and H(x, v) = V 2 2 + U(x) denotes the Hamiltonian function. Inserting p ∞ into Equation (56), we find that for t → ∞, the damping and the density-dependent part of the force cancel and we are left with pure Hamiltonian dynamics for which all particles become completely decoupled, with each one conserving energy separately. Of course, this result also precludes fixed point solutions to the particle dynamics. However, this limiting dynamics captured by Equation (58) assumes the mean field limit N → ∞ together with a consistent estimate of the GLD before taking the limit t → ∞. For GLD estimators at finite N, we expect reasonably small stationary fluctuations of individual particle energies, which were also evident in our numerical experiments. The exact asymptotic behaviour is also reflected in the expression for the change of the relative entropy for Kramer's equation. Similar to Equation (42) we obtain When the system approaches equilibrium, both terms in the norm cancel out and the entropy production rate converges to 0.

Simulating Accurate Fokker-Planck Solutions for Model Systems
To demonstrate the accuracy of our approach, we simulated solutions of FPEs for a range of model systems and compared the results with those obtained from direct stochastic simulations (Monte Carlo sampling) with the same particle number, as well as with analytic solutions where relevant. We tested our framework on systems with diverse degrees of nonlinearity and dimensionality, as well as with various types of noise (additive/multiplicative). We quantified the accuracy of transient and steady state solutions resulting from our method in terms of 1-Wasserstein distance [59] and Kullback-Leibler (KL) divergence (Appendices C and D), along with the squared error of distances between distribution cumulants. For evaluating particle solutions for nonlinear processes, where analytical solutions of the Fokker-Planck equation are intractable, we simulated a very large number (N ∞ ) of stochastic trajectories that we considered to be ground truth Fokker-Planck solutions. We employed an Euler-Maruyama and forward Euler integration scheme of constant step size dt = 10 −3 for stochastic and deterministic simulations respectively. We provide a description of the employed algorithm along with analysis of its computational complexity in Appendix H, while further numerical experiments on the influence of hyperparameter values on the performance of the estimator are provided in Appendices G and F.

Linear Conservative System with Additive Noise
For a two dimensional Ornstein-Uhlenbeck process (Appendix A.1) transient and stationary densities evolved through deterministic particle simulations (D) consistently outperformed their stochastic counterparts (S) comprising the same number of particles in terms accuracy in approximating the underlying density ( Figure 1). In particular, comparing the 1-Wasserstein distance between samples from analytically derived densities (P A t ) (Appendix B)-considered here to reflect the ground truth-and the deterministically (D) or stochastically (S) evolved densities (P N t ), W 1 (P A t , P N t ), we observed smaller Wasserstein distances to ground truth for densities evolved according to our deterministic particle dynamics, both for transient ( Figure 1a) and stationary (Figure 1c) solutions. Specifically, we quantified the transient deviation of simulated densities from ground truth by the average temporal 1-Wasserstein distance, W 1 (P A t , P N t ) t (Appendix D). For small particle number, deterministically evolved interacting particle trajectories represented more reliably the evolution of the true probability density compared to independent stochastic ones, as portrayed by smaller average Wasserstein distances. For increasing particle number, the accuracy of the simulated solutions with the two approaches converged. Yet, while for N = 2500 particles the stochastically evolved densities suggest on average (over trials) comparable approximation precision with their deterministic counterparts, the deterministically evolved densities more reliably delivered densities of a certain accuracy, as proclaimed by the smaller dispersion of Wasserstein distances among different realisations (Figure 1a,c).  Likewise, we observed similar results when comparing only the stationary distributions, W 1 (P A ∞ , P N ∞ ) (Figure 1c). While for small particle number, the interacting particle system more accurately captured the underlying limiting distribution, for increasing particle number the accuracy of both approaches converged, with our method consistently delivering more reliable approximations among individual repetitions.
Moreover, densities evolved with our deterministic framework exhibited less fluctuating cumulant trajectories in time, compared to their stochastic counterparts (Figure 2c). In particular, even for limited particle number, cumulants calculated over deterministically evolved particles progressed smoothly in time, while substantially more particles for the stochastic simulations were required for the same temporal cumulant smoothness. To further quantify the transient accuracy of Fokker-Planck solutions computed with our method, we compared the average transient discrepancy between the first two analytic cumulants (m t and C t ) to those estimated from the particles (m t andĈ t ), m t − m t 2 t (Figure 1b) and Ĉ t − C t F t (Figure 1d), where · F stands for the Frobenious norm (Appendix E).
In line with our previous results, our deterministic framework delivered considerably more accurate transient cumulants when compared to stochastic simulations, with more consistent results among individual realisations, denoted by smaller dispersion of average cumulant differences. (Notice the logarithmic y-axis scale in Figure 1b Interestingly, the number of sparse points M employed in the gradient-log-density estimation had only minor influence on the quality of the solution (Figure 1a,c). This hints to substantially low computational demands for obtaining accurate Fokker-Planck solutions, since our method is computationally limited by the inversion of the M × M matrix in Equation (29).

Bi-Stable Nonlinear System with Additive Noise
For nonlinear processes, since the transient solution of the FPE is analytically intractable, we compared the transient and stationary densities estimated by our method with those returned from stochastic simulations of N ∞ = 26,000 particles, and contrasted them against stochastic simulations with the same particle number.
For a system with bi-modal stationary distribution (Appendix A.2), the resulting particle densities from our deterministic framework closely agreed with those arising from the stochastic system with N ∞ = 26,000 particles (Figure 3a). In particular, deterministically evolved distributions respected the symmetry of the underlying double-well potential, while the stochastic system failed to accurately capture the potential symmetric structure Figure 3a (iii).
Systematic comparisons of the 1-Wasserstein distance between deterministic and stochastic N particle simulations with the "N ∞ " stochastic simulation comprising N ∞ = 26, 000 particles revealed that our approach efficiently captured the underlying PDF already with N = 500 particles (Figure 3c,d). For increasing particle number, the stationary solutions of both systems converged to the "N ∞ " one. However, we observed a systematically increasing approximation accuracy delivered from the deterministic simulations compared to their stochastic counterparts.
It is noteworthy that, on average, deterministic simulations of N = 500 particles conveyed a better approximation of the underlying transient PDF compared to stochastic simulations of N = 2500 particles (Figure 3c).
Interestingly, for small particle number, the number of employed inducing points M did not significantly influence the accuracy of the approximated solution. However for increasing particle number, enlarging the set of inducing points contributed to more accurate approximation of Fokker-Planck equation solutions (Figure 3c), with the trade off of additional computational cost.
Similar to the Ornstein Uhlenbeck process (Section 10.1), comparing cumulant trajectories computed from both the deterministic and stochastic particle systems revealed less fluctuating cumulant evolution for densities evolved with our deterministic framework also in this nonlinear setting (Figure 3b).

Nonlinear System Perturbed by Multiplicative Noise
To assess the accuracy of our framework on general diffusion processes perturbed by state-dependent (multiplicative) noise, we simulated a bi-stable system with dynamics governed by Equation (A3) with diffusion function D(x) = sin 2 (x) according to Equation (53). Similarly, in this setting, deterministic particle distributions delivered a closer approximation of the underlying density when compared to direct stochastic simulations. In particular, we found that in this setting, deterministically evolved distributions more accurately captured the tails of the underlying distribution, mediated here by stochastic simulations of N ∞ = 35,000 particles (Figure 4a,b).
Similar to the previously examined settings, the deterministic framework delivered more reliable and smooth trajectories for the marginal statistics of the underlying distribution (Figure 4c).
Comparing the temporal average and stationary 1-Wasserstein distance (Figure 4d,f) between the optimal stochastic distributions and the deterministic and stochastic particle distributions of size N, we found that the deterministic system delivered consistently more accurate approximations, as portrayed by smaller 1-Wasserstein distances.
Interestingly, we found that for deterministic particle simulations, the number of employed sparse points in the gradient-log-density estimation mediated a moderate approximation improvement for small system sizes, while for systems comprising more than N = 2000 particles, the number of sparse points had minimal or no influence on the accuracy of the resulting distribution (Figure 4e,g). ii. iii.

Performance in Higher Dimensions
To quantify the scaling and performance of the proposed framework for increasing system dimension, we systematically compared simulated densities with analytically calculated ones for Ornstein-Uhlenbeck processes of dimension D = {2, 3, 4, 5} following the dynamics of Equation (A4) for inducing point number M = 100 ( Figure 5) and M = 200 ( Figure 6). To evaluate simulated Fokker-Planck solutions, we calculated the Kullback-Leibler divergence between analytically evolved densities (Appendix B) and particle densities. We employed the closed-form equation for estimating KL divergence between two Gaussian distributions (Appendix C) for empirically estimated mean, m t , and covariance,Ĉ t , for particle distributions.
d. e. f. d. e. f. For all dimensionalities, the deterministic particle solutions approximated transient and stationary densities remarkably accurately with Kullback-Leibler divergence between the simulated and analytically derived densities below 10 −2 for all dimensions, both for transient and stationary solutions (Figures 5a,d and 6a,d). In fact, the deterministic particle solutions delivered more precise approximations of the underlying densities compared to direct stochastic simulations of the same particle number. Remarkably, even for processes of dimension D = 5, deterministically evolved solutions mediated through N = 500 particles resulted in approximately the same KL divergence of stochastic particle solutions of N = 6500 particles. Our deterministic particle method delivered consistently better approximations of the mean of the underlying densities compared to stochastic particle simulations (Figure 5b,e). Specifically, estimations of the stationary mean of the underlying distributions were more than two orders of magnitude accurate that their stochastically approximated counterparts already for small particle number (Figure 5e).
Yet, the accuracy of our deterministic framework deteriorated for increasing dimension (Figure 5a,d). More precisely, while for low dimensionalities the covariance matrices of the underlying densities were accurately captured by deterministically evolved particles, for increasing system dimension approximations of covariance matrices became progressively worse. Yet, even for systems of dimension D = 5, covariance matrices computed from deterministically simulated solutions of N = 500 particles were at the same order of magnitude as accurate as covariances delivered by stochastic particle simulations of size N = 6500.
However, comparing the resulting performance of solutions delivered by employing different number of inducing points (Figures 5 and 6) reveals that for increasing dimension more inducing points are required to attain accurate FPE solutions. In particular, both transient and stationary KL divergences to ground truth improved remarkably for dimensions D = 4 and D = 5 by employing M = 200 inducing points in the gradient-log-density estimation (Figures 5a,d and 6a,d). In more detail, for nearly all dimensions, the estimation of the covariance of the underlying distribution improved considerably, both for transients and stationary solutions (Figures 5c,f and 6c,f), while only the stationary mean of dimension D = 4 showed significant improvement (Figures 5b,e and 6b,e). Figure 6c,f reveals that by increasing the number of inducing points our framework is able to capture more effectively the spread of the underlying distribution, clearly surpassing in approximation accuracy solutions mediated by stochastic particle simulations.

Second order Langevin Systems
To demonstrate the performance of our framework for simulating solutions of the FPEs for second order Langevin systems as described in Section 9, we incorporated our method in a symplectic Verlet integrator (Equations (A10)-(A12)) simulating the second-order dynamics captured by Equation (56) for a linear f (x) = −4 x and a nonlinear, f (x) = −4 x 3 + 4 x, drift function (Equation (A10)), and compared the results with stochastic simulations integrated by a semi-symplectic framework [63]. In agreement with previous results, cumulant trajectories evolved smoother in time for deterministic particle simulations when compared to their stochastic counterparts (Figures 7a and 8c). Stationary densities closely matched analytically derived ones (see Equation (A7)) (purple contour lines in Figures 7b and 8b), while transient densities captured the fine details of simulated stochastic particle densities comprising N ∞ = 20, 000 (Figure 8a). Furthermore, the symplectic integration contributed to the preservation of energy levels for each particle after the system reached equilibrium (Figures 7e and 8f), which was also evident when observing individual particle trajectories in the state space (Figures 7c,d and 8d,e).
As already conveyed in Section 9, the velocity term and the gradient-log-density term canceled out in the long time limit (Figures 7f and 8g) for each particle individually, while the average kinetic energy in equilibrium exactly resorted to the value dictated by the fluctuation-dissipation relation and the equipartition of energy property, i.e., K (i) N = σ 2 2 γ (Figures 7g and 8h).

Nonconservative Chaotic System with Additive Noise (Lorenz Attractor)
As a final assessment of our framework for simulating accurate solutions of Fokker-Planck equations, we employed a Lorenz attractor model with parameters rendering the dynamics chaotic, perturbed by moderate additive Gaussian noise (Equation (A13)). By comparing stochastic simulations of N ∞ = 150,000 particles and deterministic and stochastic simulations of N = 4000 particles (Figure 9), we observed that the deterministic framework more precisely captured finer details of the underlying distribution (Figure 9a), represented here by the N ∞ stochastic simulation. While both stochastic and deterministic simulations capture the overall butterfly profile of the Lorenz attractor, the deterministic system indeed delivered a closer match to the underlying distribution. Similar to the previously examined models, cumulant trajectories computed from deterministically evolved particles show closer agreement with those computed from the N ∞ stochastic system, compared to the stochastic system comprising N particles (Figure 9b). In particular, cumulants for the x and y states exhibited high temporal fluctuations when computed from stochastically evolved distributions, while our framework conveyed more accurate cumulant trajectories, closer to those delivered by the N ∞ stochastic system.

Discussion and Outlook
We presented a particle method for simulating solutions of FPEs governing the temporal evolution of the probability density for stochastic dynamical systems of the diffusion type. By reframing the FPE in a Liouville form, we obtained an effective dynamics in terms of independent deterministic particle trajectories. Unfortunately, this formulation requires the knowledge of the gradient of the logarithm of the instantaneous probability density of the system state, which is actually the unknown quantity of interest. We circumvented this complication by introducing statistical estimators for the gradient-log-density based on a variational formulation. To combine high flexibility of estimators with computational efficiency, we employed kernel-based estimation together with an additional sparse approximation. For the case of equilibrium systems, we related our framework to Stein Variational Gradient Descent, a particle-based dynamics to approximate the stationary density, and to a geometric formulation of Fokker-Planck dynamics. We further discussed extensions of our method to settings with multiplicative noise and to second order Langevin dynamics.
To demonstrate the performance of our framework, we provided detailed tests and comparisons with stochastic simulations and analytic solutions (when possible). We demonstrated the accuracy of our method on conservative and non-conservative model systems with different dimensionalities. In particular, we found that our framework outperforms stochastic simulations both in linear and nonlinear settings by delivering more accurate densities for small particle number when the dimensionality is small enough. For increasing particle number, the accuracy of both approaches converges. Yet, our deterministic framework consistently delivered results with smaller variance among individual repetitions. Furthermore, we showed that our method, even for small particle numbers, exhibits low-order cumulant trajectories with significantly less temporal fluctuations when compared against to stochastic simulations of the same particle number.
We envisage several ways to improve and extend our method. There is room for enhancement by optimising hyperparameters of our algorithm such as inducing point position and kernel length scale. Current grid-based and uniform random selection of inducing point position may contribute to the deterioration of solution accuracy in higher dimensions. Other methods, such as subsampling or clustering of particle positions may lead to further improvements. On the other hand, a hyperparameter update may not be at all necessary at each time step in certain settings, such that a further speedup of our algorithm could be achieved.
The implementation of our method depends on the function class chosen to represent the estimator. In this paper we have focused on linear representations, leading to simple closed form expressions. It would be interesting to see if other, nonlinear parametric models, such as neural networks, (see e.g., [64]) could be employed to represent estimators. While in this setting, there would be no closed-form solutions, the small changes in estimates between successive time steps suggest that only a few updates of numerical optimisation may be necessary at each step. Moreover, the ability of neural networks to automatically learn relevant features from data might help to improve performance for higher dimensional problems when particle motion is typically restricted on lower dimensional submanifolds.
From a theoretical point of view, rigorous results on the accuracy of the particle approximation would be important. These would depend on the speed of convergence of estimators towards exact gradients of log-densities. However, to obtain such results may not be easy. While rates of convergence for kernel-based estimators have been studied in the literature, the methods for proofs usually rely on the independence of samples and would not necessarily apply to the case of interacting particles.
We have so far addressed only the forward simulation of FPEs. However, preliminary results indicate that related techniques may be applied to particle based simulations for smoothing (forward-backward) and related control problems for diffusion processes [65]. Such problems involve computations of an effective, controlled drift function in terms of gradient-log-densities. We defer further details and discussions on subsequent publications on the topic.
Taken together, the main advantage of our framework is its minimal requirement in simulated particle trajectories for attaining reliable Fokker-Planck solutions with smoothly evolving transient statistics. Moreover, our proposed method is nearly effortless to set up when compared to classical grid-based FPE solvers, while it delivers more reliable results than direct stochastic simulations.
for d ∈ D. Simulation time was determined by the time required for analytic mean m t to converge to its stationary solution within˜ precision˜ = 10 −5 , while the integration step was set to dt = 10 −3 .

Appendix A.4. Second Order Langevin Dynamics
For demonstrating the energy preservation properties of our method for second-order Langevin dynamics, we incorporated our framework into a Verlet symplectic integration scheme (Equation (A10)), and compared the results with stochastic simulations integrated according to a semi-symplectic scheme [63].
We consider a system with dynamics for positions X and velocities V captured by where the velocity change (acceleration) is the sum of a deterministic drift f , a velocity-dependent damping −γV, and a stochastic noise term σdB t . In conservative settings, the drift is the gradient of a potential f (x) = −∇U(x). Here we used a quadratic (harmonic) potential U(x) = 2x 2 and a double-well potential U(x) = x 4 − 2x 2 .
We may compute the energy of each particle at each time point as the sum of its kinetic and potential energies Here the superscripts denote individual particles. After the system has reached equilibrium, energy levels per particle are expected to remain constant. From the equipartition of energy and the fluctuation-dissipation relation, in the long time limit the average kinetic energy of the system is expected to resort to Symplectic integration [33] of Equation (56) follows the equations where n denotes a single integration step.
We calculated the average temporal 1-Wasserstein distance as the time average of instantaneous 1-Wasserstein distances between the two distributions under comparison, P A t and P B t , i.e., where as before T stands for the duration of the simulation and dt for the time discretisation step.

Appendix E. Frobenious Norm
For comparing covariance matrices of the simulated particle systems with ground truth we employed the Frobenious norm of the relevant matrices difference. The Frobenious norm of a m × n matrix A may be calculated from where a i,j denote the entries of matrix A, and A * stands for the conjugate transpose of A.

Appendix F. Influence of Hyperparameter Values on the Performance of the Gradient-Log-Density Estimator
To determine the influence of the hyperparameter values on the performance of the gradient-log-density estimator, we systematically evaluated the approximation error of our estimator for N = 1000 samples of a one dimensional log-normal distribution with mean µ = 0 and standard deviation σ = 0.5 for 20 independent realisations.
We quantified the approximation error as the average error between the analytically calculated and predicted gradient-log-density on each sample, i.e., where the analytically calculated gradient-log-density was determined as ∇ ln p(x) = µ−σ 2 −ln(x) σ 2 x . By systematically varying the regularisation parameter λ, the kernel length scale l, and the inducing point number M we observed the following: - The hyperparameter that strongly influences the approximation accuracy is the kernel length scale l ( Figure A1). -Underestimation of kernel length scale l has stronger impact on approximation accuracy than overestimation ( Figure A1). -For increasing regularisation parameter value λ, underestimation of l has less impact on the approximation accuracy ( Figures A1 and A2). -For overestimation of the kernel length scale l, the regularisation parameter λ and inducing point number M have nearly no effect on the resulting approximation error ( Figure A1).   . Required particle number, N * KL , to attain time averaged Kullback-Leibler divergence to ground truth, KL P A t , P N t t , for deterministic (green) and stochastic (brown) particle systems for a two dimensional Ornstein-Uhlenbeck process. Markers indicate mean required particle number, while error bars denote one standard deviation over 20 independent realisations. Grey circles indicate required particle number for each individual realisation. The deterministic particle system consistently required at least one order of magnitude less particles compared to its stochastic counterpart. (Further parameter values: regularisation constant λ = 0.001, inducing point number M = 100, and RBF kernel length scale l estimated at every time point as two times the standard deviation of the state vector. Inducing point locations were selected randomly at each time step from a uniform distribution spanning the state space volume covered by the state vector).