Variational Monte Carlo Approach to Partial Differential Equations with Neural Networks

The accurate numerical solution of partial differential equations is a central task in numerical analysis allowing to model a wide range of natural phenomena by employing specialized solvers depending on the scenario of application. Here, we develop a variational approach for solving partial differential equations governing the evolution of high dimensional probability distributions. Our approach naturally works on the unbounded continuous domain and encodes the full probability density function through its variational parameters, which are adapted dynamically during the evolution to optimally reflect the dynamics of the density. For the considered benchmark cases we observe excellent agreement with numerical solutions as well as analytical solutions in regimes inaccessible to traditional computational approaches.

Introduction. The description of nearly all processes in nature is formalized and modelled by means of differential equations, which dictate the evolution of a system given its initial state. Examples include the Navier-Stokes equation in fluid mechanics [1][2][3][4], the Schrödinger equation in quantum mechanics [5][6][7], and the Fokker-Planck equation governing diffusive processes [8][9][10][11][12][13][14][15]. Analytical solutions of these equations are only available in special cases and, generally, one is forced to resort to numerical techniques. A significant effort during the last century was made to improve the numerical solutions of differential equations [16][17][18]. There are numerous properties a numerical solver should ideally fulfill, rendering the field quite diverse, with many specialized solvers being developed [19].
Here, we focus on modelling the dynamics of ddimensional probability density functions (PDFs) by means of an ansatz function, which in our case is given by an artificial neural network (ANN), as illustrated in Fig. 1. We consider evolution equations of Fokker-Planck form where µ ∈ R d is the drift and D ∈ R d×d is the positive semi-definite diffusion matrix and it is understood that p, µ and D are evaluated at position x and time t. PDFs arise naturally across many disciplines, describing, for example, the phase space evolution of (quantum) matter [20,21], the positions of particles subject to Brownian motion [11], the density of fluids [1] or stock prices in finance [15]. For many of these scenarios the PDF evolution is described by a diffusion process, meaning that the path of a single sampled point evolves according to a stochastic differential equation (SDE) [22]. In the limit of averaging infinitely many stochastic trajectories one recovers the evolution of the PDF.
Consequently, the temporal evolution of probability densities can be obtained by either directly solving Eq. (1) via spatial discretization (grid based solvers), or by solving the corresponding stochastic dynamics for a large number of sample points (particle based solvers). The former approach, while allowing to control the discretization error via the grid spacing, suffers from the curse of dimensionality [23,24] as the computational cost scales exponentially in the spatial dimension, restricting its applicability to low dimensional cases. The latter approach solves the SDE associated to the Fokker-Planck equation through the Feynman-Kac formula for an ensemble of points sampled from the initial distribution [25,26]. While suited to compute observables, such as moments of the distribution, in high dimensions, there is no direct way to obtain estimates for functionals of the distribution as an expression for p is lacking [27][28][29].
In this work, we present a new tool that overcomes the aforementioned limitations of traditional methods by combining variational Monte-Carlo with normalizing flows (NF). While variational Monte-Carlo is a long es-tablished technique in quantum many-body physics [30][31][32][33], NFs are a relatively novel class of artificial neural networks also known as invertible neural networks (INNs) [34]. They have been applied with remarkable success to long standing problems in statistical physics [35], inference and data generation [34,[36][37][38][39][40], as well as quantum field theories [41,42]. Here, we understand the NF as an ansatz function for the time-dependent density. The choice of the ansatz-function is a degree of freedom in our approach and can be adapted to the problem at hand exploiting prior knowledge about the function class the time-dependent density belongs to. Among the possible choices, artificial neural networks are a promising class of ansatz functions, as they may become universal function approximators in the infinite parameter limit, which applies to lesser extent to NFs [43,44]. Adjusting the parameters of the ansatz function to the dynamics dictated by Eq. (1) is achieved by a time-dependent variational principle (TDVP), which maps the dynamics of the PDF onto the variational manifold generated by the ansatz function [32,33,45]. Crucially, the approach is self-contained and at no point relies on data generated from other solvers, in contrast to prior works using neural networks to solve PDEs [46][47][48][49], allowing us to obtain numerical solutions for tasks that are challenging for grid-based or particle-based solvers. Our approach differs from the popular physics informed neural networks (PINN) [50] and [51] in that we do not carry out a costly global gradient-descent based optimization in each time step to update the models' parameters, but rather follow an explicit, analytically derived time derivative of the network parameters which is given by the TDVP. We are particularly interested in high-dimensional scenarios which are infeasible to solve with grid-based methods and in quantities which are not easily obtainable by modelling many stochastic processes, such as functionals of the PDF. Indeed we show that, using the developed approach, we can reliably estimate differential entropies in a Monte Carlo fashion requiring only a few thousand samples. We benchmark our approach for the case of an eight-dimensional heat equation and a six-dimensional dissipative phase space evolution.
Normalizing flows. While we employ neural networks as ansatz functions, we emphasize that the derived TDVP is applicable to any parameterized density, such as Gaussian mixture models or energy-based estimators. We use NFs [34,36] to model densities as they have many desirable properties, among which are (i) a guarantee of normalization for any set of parameters θ, (ii) a tractable likelihood and (iii) the ability to generate independent samples without the need to resort to Markov Chains. NFs parameterize densities by assuming a latent distribution π which is transformed into the distribution of interest by a trainable and invertible map f θ , Usually, π is chosen to be a 'simple' distribution, e.g. a Gaussian, such that its samples z can be generated easily. The probability associated with the point x is proportional to π(f −1 θ (x)) times the determinant of the Jacobian of the transformation, The function f θ is composed from a series of invertible transformations f θ = ϕ 1 θ • .. • ϕ N θ which are explained in detail in the Supplemental Material (SM) [52]. Importantly, the Jacobian of the function is tractable meaning that its determinant is efficiently inferred when computing a forward pass, an operation carried out whenever the real space probability is evaluated at some point of interest. By stacking many of these 'coupling blocks' ϕ i , the function f θ becomes an expressive coordinate transform, that is, however, incapable of changing the tail behavior of the latent space distribution [53]. We overcome this problem by dynamically adapting the latent space distribution π to reflect dynamical changes in the tails of the distribution. This is explained in more detail below and in the SM [52].
Time-Dependent Variational Principle. The idea of the TDVP originated in the context of Variational Monte Carlo (VMC) [30] where it has been applied extensively to solve problems in quantum-many-body physics, with a growing interest in the use of neural networks as variational ansatz functions [32,33,45,54]. Its aim is to locally search for the closest approximation to the dynamics of the density within the variational manifold. Concretely, one aims to solve where D is a suitable distance measure between probability distributions, τ denotes a small time step,ṗ θ(t) is the derivative given by Eq. (1), andθ is the unknown corresponding parameter time derivative. The solution to Eq. (4) can be found by requiring the derivative with respect toθ to be zero. By expanding Eq. (4) to second order in τ one finds We defer the details of this derivation to the SM [52].
) and ∂ t log(p θ(t) (x)) is given by the RHS of the to be solved PDE. Here · x∼p θ (t) denotes an expectation value evaluated through Monte Carlo sampling from the model distribution p θ (t). Notice, that we heavily rely on the differentiability of the ansatz function p θ(t) with respect to both variational parameters and spatial coordinates.
The latter frequently appear on the RHS of Eq. (1) and are thus required for computingṗ θ(t) . This is in striking contrast to grid-based techniques which require making grid cells finer for higher accuracy. Here, instead, we have access to the exact derivatives through automatic differentiation. The choice of distance measure to compare the two probability distributions is not arbitrary as the form of S and F directly depends on it. In order to obtain expressions of S and F that can be efficiently estimated through a finite number of samples, we found that both the Hellinger distance D H (p, q) = 1−F (p, q) = 1 − √ pqdx and the Kullback-Leibler (KL) divergence D KL (p, q) = p log(p/q)dx yield the same result of the desired form. Care has to be taken when solving Eq. (5) forθ, as the inverse of S may not exist. This is the case if directions in parameter space are present along which the probabilities are stationary, which can be dealt with by regularization procedures [33,54]. Problem Setup. We are interested in solving initial value problems, for which the initial density distribution p(0, x) = u(x) is given along with the RHS of Eq. (1) which governs its evolution. To exactly encode the initial distribution u(x) in the model p θ(t=0) , the latent distribution is set to u(x) and the parameters of the map f θ(t=0) are chosen such that it represents the identity map f θ(t=0) (x) = x. If the initial distribution cannot be given in closed form and therefore cannot be set analytically as the latent space distribution π, the network may be trained on its samples to approximately encode it at time t = 0. Then a solver is used which integrates the parameters according to Eq. (5).
Application 1: Diffusion in High Dimensions. As a first benchmark scenario we consider the heat equation in d = 8 dimensions. The heat equation appears across many disciplines ranging from engineering [55,56] and molecular motion [11] to the pricing of financial derivatives given by the famous Black-Scholes equation [57,58] and reads Importantly, an analytical solution exists against which we can benchmark, making the described scenario a good showcase of the proposed approach. The solution is given by a convolution of the initial distribution [59], which is the Green's function to Eq. (6), such that We aim to observe the growth of the differential entropy with time, a task, which is challenging or even intractable using other numerical techniques in high dimensions for While analytical comparison data is available in the case of a Gaussian initial distribution, we compare to numerical data for the Student-t obtained with finite differences on a 1D Grid. Inset: Adjustment of the latent space distribution π by changing its parameter ν dynamically in time.
the reasons mentioned above [23,24,[27][28][29]. In the case of a Gaussian distribution for p(0, x) with zero mean and unit covariance matrix, we obtain a Gaussian of larger variance at later points in time, in which case we observe perfect agreement between the analytical solution and the one obtained using the INN as shown in Fig. 2. If we choose a Student-t distribution as initial distribution, i.e.
with ν = 2 we can no longer compare to the analytical solution as the involved integrals become infeasible to solve. However, by exploiting the spherical symmetry of the problem, we can map the evolution to an effective 1D problem of the radial dependency of p which we can approximately solve on a grid using finite differences.
The grid based solution and that obtained using the INN are generally in good agreement. We observe a slight difference which we attribute to technical challenges of the grid-based approach, which we discuss more elaborately in the SM [52]. Application 2: Diffusion in classical phase space. As a second demonstration of the proposed approach we consider classical Hamiltonian dynamics in phase space with additional diffusion. Concretely, we choose the Hamiltonian H to represent coupled harmonic oscillators (coupling strength k) which are in contact with heat baths of different temperatures T i , such that the solution does not factorize in the eigenbasis of H. We provide the Hamiltonian and its generated phase space flow in the SM [52]. The heat baths lead to a diffusion in phase space, which implies that sampled points of the distribution evolve according to an SDE. We show that the INN faithfully estimates moments of the distribution, probabilities (i.e. integrals over finite domains) as well as functionals of the PDF that correspond to integrals over the entire domain.
The described system obeys the following Fokker-Planck equation [60] whose corresponding stochastic differential equation is given by [60] dx i = ∂ pi Hdt, Here, dt is the Wiener process with zero average dw i = 0 and standard scaling dw 2 i = dt implying that Π is drawn from a standard Gaussian Π ∼ N (0, 1). For simplicity we choose all quantities except T i equal to unity.
In the case of heat baths of equal temperatures T i = T and vanishing coupling (k = 0) the system, in contrast, assumes a thermal steady state of Gaussian form in the long time limit given by the Gibbs-Ensemble with Z = exp(−H/k B T )dxdp the partition function, where the Gaussian form allows to compare against analytical results.
We consider four quantities of interest which we evaluate by drawing 10.000 samples from the INN, see Fig. 3. The first two quantities are the means and variances of the distribution evolved for the case of different T i and k = 1. Here, comparison against estimates from solving the SDE for the same number of sampled points is straight forward and one observes excellent agreement between both methods. To obtain an easy benchmark case for integral and entropy estimation, we choose k = 0 and T i = T such that the steady state is Gaussian, see Eq. (12). We choose the integration volumes to be hyperspheres of radius r centered at the origin allowing for analytical evaluation of the Gaussian integral. The values of these integrals correspond to the probability of finding the system inside the hypersphere. Using the INN, we can estimate such integrals in a Monte-Carlo fashion by uniformly sampling points x i from inside the integration domain and average the associated probabilities p θ (x i ), which are shown to converge to the analytically obtained steady-state value in Fig. 2(c).
Finally, we again focus on the differential entropy (Eq. (8)), where Fig. 3(d) shows that our method succeeds to predict the differential entropy with low noise while converging to the expected steady state value.
Conclusion and Outlook. We have introduced a variational approach to the dynamics of continuous probability distributions using normalizing flows and demonstrated its power by applying it to paradigmatic benchmark problems. Our method is widely applicable, even beyond the Fokker-Planck form (1), e.g. to cases with non-local terms [61]. Its unique strength lies in estimating functionals of probability densities in high dimensions enabled by the availability of exact samples with tractable likelihood. We emphasize that other approaches such as PINN [50] require solving a largescale non-convex optimization problem in each time step, which the TDVP replaces by the explicit update rule (5) (see [52] for further discussion). The form of the ansatz function can be chosen flexibly and is not required to be a neural network. The only restrictions are that (i) samples from its distribution may be obtained and (ii) derivatives with respect to inputs and parameters are computable. While building normalizing flows using stacked coupling blocks is a popular approach, other flow architectures exist and it would be interesting to investigate their potential in solving PDEs in the future. Since the TDVP can also work with non-normalized probabilities, also energy based models would be viable ansatz functions although this would mean that samples would have to be obtained by resorting to Markov-Chains.
For the utilized architecture we found that challenges exist when trying to solve chaotic dynamics. We believe this to be caused by the high amount of information of the phase space distribution which needs to be encoded using comparably few parameters. Additionally, we found it challenging to model distributions whose tail behaviour deviated from that of the latent space distribution. In the example shown in Fig. 2 this could be dealt with by elevating ν to be a variational parameter, which would tend to infinity for late times, representing the exact tail behaviour of the real space distribution. However, if the real space tail behaviour cannot be accurately modelled in latent space, e.g. because its form is not known beforehand, one cannot expect to accurately model the distribution on the entire domain.
Code & Data availability. The code used for this project is based on the jVMC library [62], making use of flax [63] and jax [64] and is available under GitHub: RehMoritz/vmc pde. The repository also contains the data from Fig. 2

Derivation of the TDVP Equation
The basic idea of a TDVP is to minimize the distance between the evolved state at time t + τ and the updated network state with respect to the parameter updatesθ at each time t. Here, we exemplarily derive Eq. (5) from the Hellinger distance D H (p, q) = p(x)q(x)dx, i.e. by maximizing the classical fidelity F (p, q) = 1 − D H (p, q). As noted in the main text, an equivalent derivation is possible using the Kullback-Leibler divergence D KL which leads to the same result. For better readability, we drop the time index and continue with the optimality condition where a and b are given by Next we perform a second order expansion of the square root in the (small) time step τ : Using that the normalization of p is conserved under the time evolution one finds that the term linear in τ vanishes: Thus, the optimality condition becomes (18) Dropping the factor of 2 we obtain an equation for the optimal parameter updateθ: Importantly, we can now evaluate the integral by sampling according to the encoded probabilities p(x) since the integrand is proportional to p(x) for both F and S. This is a unique property of the distance measures D H and D KL while other distance measures, as for example the L 2 norm, do not lead to expressions of a form that can be efficiently evaluated from Monte Carlo samples. The same derivation can be carried out without assuming normalization. In this case the form of S and F is altered to where the last two lines are obtained using Here, the log derivative trick was used in the third line. One may proceed similarly for the time derivative. Overall, this leaves us with a connected correlator structure instead of a simple correlator with O k the (logarithmic) variational derivative We finally arrive atθ where the tilde is due to the fact that we cannot invert S itself but rather need to regularize it because it is usually rank-deficient. One can show that the updates that were found are indeed maxima of the fidelity:

Approximation Error
The adjustment of the parameters to reflect changes in the probability carries an associated error, as the parameters can usually not be changed to perfectly reflect the time derivatives of all the sampled points used to estimate F . The TDVP allows to quantify this error by estimating the residual Computational Complexity Here we compare the computational complexity of the explicit variational method that we are proposing to an iterative gradient descent based technique.
The operations carried out in the gradient descent based procedure are given in Alg. 1. To summarize, the algorithm computes time derivatives at the sampled points using the differential operator F, which depends on the PDE under scrutiny (e.g. F = D∆ x in the case of the heat equation, see Eq. (6)) . The time derivatives, weighted with some small time step τ , are added to the current probability values p θ (x i ) and define the new regression targets. Using a loss function that is minimal when the encoded distribution agrees with the new regression targets, one searches for a new solution in the parameter space. Once a convergence criterion is met, the search stops and continues with the next time step. This search can become costly, since the optimization problem is in general non-convex without convergence guarantees. It is therefore beneficial to avoid the iterative search using a closed form, as lined out in Alg. 2.

Algorithm 1 Iterative Gradient Descent
θ ← θ + τθ The proposed algorithm also relies on obtaining samples from the distribution for which then time derivatives are computed. Beyond that, we require the logarithmic variational derivatives of the probabilities O k (x) = ∂ θ k log (p θ (x)). Obtaining these derivatives is of similar computational cost compared to a single gradient descent step in Alg. 1, as such a step also requires the differentiation of the probabilities at all sample positions. However, using the explicit scheme, this operation needs to be carried out only once. Additionally, we need to add these derivatives together, as shown in lines 5 and 6 of Alg. 2, an operation with negligible computational cost. To arrive at the time derivatives of the parametersθ, we need to invert the S matrix, which has cubic computational cost in the number of network parameters. In practice, there are many more computationally efficient ways than actually computing the inverse, which allow to reduce the computational cost of this step. While this step limits the number of network parameters that can practically be used, and thus the expressivity of the ansatz, we have not found this to be a limiting factor in the application considered in this work.
The runtimes of the examples we presented all lie below half an hour on a single NVIDIA A100 GPU.

Normalizing Flows
For the definition of the normalizing flow we use the Real-NVP type coupling blocks introduced in [34] and depicted in Fig. 4. Each coupling block ϕ i splits the input into two parts u 1 and u 2 which is done in a random but fixed way. The transformations in a single coupling block are defined by four networks s 1 , t 1 , s 2 , t 2 that change the input as follows: The inverse of this transformation is given by: Here means element-wise multiplication. The networks s and t are built equivalently as two layer feedforward networks with half as many nodes in each layer as there are dimensions. In some cases we found it useful to not include the additive t networks. Additionally, we allowed the network to adjust the mean µ and the covariance matrix Σ of the distribution in latent space directly, potentially along with parameters ϑ of the distribution, e.g. ν in the case of the Student-t, such that π = π(µ θ , Σ θ , ϑ θ ).
We parameterize Σ using either the Cholesky decomposition or by setting Σ = 1 + AA T , where we found the latter to be more stable numerically for simulating the heat equation. Network details are listed in in Table I.

Isotropic Heat Equation as a 1D Problem
Here we describe the procedure with which the reference data for Fig. 2 in the main text was obtained in the case of a Student-t initial distribution and give an explanation for the slight discrepancies observed in Fig. 2 in the main text.
The heat equation can be recast as a 1D problem if the initial condition p(t, x) = u(x) features a spherical symmetry. This is the case if it is fully described by a mean µ and a covariance matrix Σ, as this allows to rescale coordinates such that the new distribution obeys µ = 0 and Σ = 1 enabling us to write p(t, x) as p(t, r), where r = |x|. Then, the spherical form of the Laplacian may be exploited where d is the dimension of the distribution. The evolution of the distribution can then be solved using finite differences on a 1D grid. Note however, that there are caveats associated with this procedure as the second term of Eq. (31) has a divergence for r → 0. For the diffusion cases we considered, the distribution p(t, r) has a maximum at r = 0, irrespective of the time t, implying that also the numerator ∂ t p(t, r) vanishes. This necessitates the use of L'Hôspital's rule to write lim r→0 ∂ r p r = ∂ 2 r p ∂ r r = ∂ 2 r p.
We work with equidistant grid cells of size δ = 4 · 10 −3 and set a cutoff at r = 100. We employ L'Hôspital approximation for the first 10 grid cells, i.e. for r ∈ [0, 10δ], which we found to be necessary for numerical stability. This implies however, that also the reference data is not free of approximations, which may be particularly interesting as we did not observe the INN curve to come closer to the reference data when increasing the network size. This could be viewed as an indication that it is not necessarily the INN whose curve is deviating from the true entropy, but rather the data obtained from the 1D gridmethod described in this section. The difference between INN and grid-based result is shown in Fig. 5.

Phase Space Evolution
The phase space evolution of the example discussed in Fig. 3 of the main text is governed by the Hamiltonian H, which we choose to be such that k gives the strength of the coupling between oscillators. The resulting phase space flow is given bẏ If one considers damping in phase space, the following Fokker-Planck equation is obtained [60] ∂ t ρ(t, x, p) = [−∂ p H · ∂ x + ∂ x H · ∂ p + γ p · ∂ p + mk B i T i ∂ 2 pi ρ(t, x, p).