Systematic matrix formulation for efficient computational path integration

In this work we introduce a novel methodological treatment of the numerical path integration method, used for computing the response probability density function of stochastic dynamical systems. The method is greatly accelerated by transforming the corresponding Chapman-Kolmogorov equation to a matrix multiplication. With a systematic formulation we split the numerical solution of the Chapman-Kolmogorov equation into three separate parts: we interpolate the probability density function, we approximate the transitional probability density function of the process and evaluate the integral in the Chapman-Kolmogorov equation. We provide a thorough error and efﬁciency analysis through numerical experiments on a one, two, three and four dimensional problem. By comparing the results obtained through the Path Integration method with analytical solutions and with previous formulations of the path integration method, we demonstrate the superior ability of this formulation to provide accurate results. Potential bottlenecks are identiﬁed and a discussion is provided on how to address them. (cid:1) 2022 The Authors. Published by Elsevier Ltd. ThisisanopenaccessarticleundertheCCBY-NC-NDlicense (


Introduction
The probability density function (PDF) is an important characteristic of the response statistics of dynamical systems to random effects. Through the time evolution or the steady-state of the response PDF of these dynamical systems we can investigate various phenomena in a wide range of application areas, including engineering, finance, physical and life sciences. By obtaining a response PDF, one can predict important future characteristics of the dynamic system, such as n th order moments and reliability, and can improve decision making process, where appropriate. The use of stochastic differential equations (SDEs), where the described phenomenon is governed by established laws, is a concise and efficient tool for the description and the quantitative modelling of such systems.
In the case of utilising SDEs to analyse stochastic dynamics, there are well established methods to obtain the PDF of the state variables, such as using the Monte-Carlo (MC) method, or solving the corresponding Fokker-Planck equation. During the MC method we use path-wise approximations obtained through the numerical integration of the SDE. As it is a stochastic approach, usually we need a large number of approximated realisations to obtain a good approximation for the PDF. The advantage of this method is its relative simplicity and that it can be generalised for a large set of problem classes. However, due to the stochastic nature of the MC method, we might need a large number of sample paths to properly characterise the statistics of the investigated system, and some uncertainty may still remain in the results. Alternatively, in using the Fokker-Planck equation we have to solve a partial differential equation to obtain the response PDF. In general, there is no explicit analytical solution for this equation, except for a small number of special cases, thus requiring a numerical approximation, such as finite element [4,22] or finite difference [27] methods. The main advantage of this approach over the MC method is that the results are deterministic, and as such, it is free from the perturbations that stochastic simulations introduce. However, due to conditions on continuity the Fokker-Planck equation does not generalise to certain non-smooth settings, such as time-varying impacts.
In this paper we develop a novel approach within the path integral (PI) formulation, to track the PDF time evolution of the solution of an SDE. It is based on the law of total probability captured by the Chapman-Kolmogorov (CK) equation. An essential component of the CK equation is the evaluation of an integral involving the transitional PDF (TPDF), that describes the transition from one state to another over a time interval. Since the exact TPDF is not available in most cases, this has to be approximated. One possible approach to estimate the TPDF is a generalised cell mapping (GCM), where the region of interest is divided into cells, and the probability of moving from one cell to another is computed through MC simulations [15,30]. This method generalises well for a wide range of systems, but it introduces stochastic perturbations. Another approach, the Wiener Path Integral method, uses a variational formulation and the most probable path to approximate the TPDF [14,13,23,26]. The difficulty of this method is that it requires a Lagrangian functional that must be determined for each individual case, so that it has been applied to only a limited number of specific systems. The difficulty of this method is that it requires the solution of a boundary value problem corresponding to a Lagrangian functional that must be determined for each type of sytem that one wants to analyise. However, when the Lagrangian functional is available, this method is efficient at approximating the TPDF and thus can be used directly to determine the steady state PDF. This has been shown for number of engineering mechanics systems exhibiting diverse nonlinear, hysteretic behaviors [24], even with fractional derivative terms [25]. An approach that balances generalisability and efficiency uses the PDF of an explicit numerical time step to approximate the TPDF, thus removing the need for discrete cells and the negative effects of the stochastic perturbations in the GCM, while utilizing the straightforward determination of the PDF for a large number of system classes, including non-smooth dynamical systems. Hence this method is used in numerous implementations of the PI method, as well as in this work.
Most formulations of the numerical evaluation of PI solutions to the Chapman-Kolmogorov equations lead to a computationally expensive iterative method [2,5,6,[19][20][21], where the integral in the CK is evaluated directly while using an interpolation for the spatial discretisation of the PDF. Recently there were two major efforts in order to reduce the computation time of the iterations within a PI formulation. One approach uses GPUs [2,8] to evaluate an iteration, greatly increasing the performance of the method as it is a highly parallelisable task without restrictions on the problem investigated by this approach. The other approach reduced the computational cost of the calculations by an order of magnitude by reformulating the CK equation as a convolution and using FFT to transform that convolution to a matrix multiplication [9,10,17] in the Fourier space. However, there are a few restrictions with this FFT-based approach: one can apply it to systems subjected to additive noise excitation only, and it also requires that the response PDF is well described in the Fourier space. Furthermore, the previous formulations coupled the discretisation of the PDF and the evaluation of the integral in the CK equation, resulting in inefficient sampling of the TPDF and restrictive dependencies between the spatial and temporal resolutions. Related limitations are observed in the FFT-based method, which is compatible with resolutions that are powers of two, namely N ¼ 2 m ; m 2 N þ .
In this work we propose a new systematic formulation of the PI approach, the Step Matrix Multiplication PI method (SMM-PI). We transform the CK equation to a matrix multiplication, without the restrictions of the FFT-based approach. This is especially useful in the case of time-invariant systems, since then the matrix corresponding to the CK equation is computed only once and then used iteratively to advance the response PDF in time. An additional advantage is that there are no restrictions on the types of SDE systems that can be investigated within this formulation, since it maintains the general structure of the original PI method. Thus it can be used for a wide range of problem classes, such as systems with parametric noise or non-smooth systems. Several advantages follow from our formulation of SMM-PI, where the numerical solution of the CK equation is separated into three main tasks: interpolation of the PDF, approximation of the TPDF of the process and evaluation of the integral of the CK equation. This structure allows the efficient treatment of each of these computations as well as the flexibility to incorporate different interpolation and approximation methods for improved performance, depending on the problem. The approach also lends itself to straightforward application of further efficiencies, such as parallelisation. We conduct, for the first time, a systematic error and computational cost analysis of the PI method through numerical experiments on one, two, three and a four dimensional problems by comparing the results obtained through the PI method against analytical solutions. This provides a clear understanding of the improvements that can increase the performance and efficiency of the method. As previous studies of the PI-based methods did not include a detailed discussion of the error and performance of the method, providing limited evidence for convergence, here we compare the new approach with the direct and the FFT-based evaluations of the CK equation. This systematic evaluation leads to the clear understanding of the improvements to be made to increase the efficiency of the SMM-PI, and how to achieve them. Finally, we discuss the limitations of this formulation of the PI method and how they can be addressed.
The applications of the proposed SMM-PI include a variaty of stochastic systems described by SDE's. These include nonlinear systems with time-dependent drift and diffusion coefficients, moreover, the SMM-PI can be generalised to non-smooth systems as well. The SMM-PI works well for systems with a wide range of noise sources, including correlated noise. Also, we can have noise directly affecting each coordinate (including correlated noise), or only some, such as in the case of engineering models, where noise tends to appear in the second order equations describing, e.g. a motion. The SMM-PI is especially useful in the analysis of systems that have a steady-state behaviour and have stationary or attracting periodic response PDF's. Based on the general nature of the SMM-PI method, we can use it in biology, physics, engineering, finance or other fields of science.
The paper is organised as follows: in Section 2.1 we introduce and discretise the Chapman-Kolmogorov equation and construct the step matrix. Next, in Section 3 we analyse the error convergence, the performance and the efficiency of the proposed method through numerical examples. Then in Section 4 we provide a discussion on the choice of resolutions for the path integration method, finally, we draw conclusions in Section 5.

Chapman-Kolmogorov equation
We consider the following stochastic differential equation dxðtÞ ¼ fðxðtÞ; tÞ dt þ gðxðtÞ; tÞ dWðtÞ; ð1Þ We arrange the state space x in such a way that the diffusion term g has the following structure: g i;j ðxðtÞ; tÞ 0 for i < k 6 d and j ¼ 1; 2; . . . ; m; namely, we formulate the dymanics (1) so that the first 1; . . . ; k À 1 states are not subjected to direct noise excitation. Note that it is not uncommon that one or more components of g vanish, e.g., when (1) gives equations of motions representing a mechanical or electrical system and the components of x represent acceleration and velocity. We characterise the behaviour of the stochastic dynamical sys-tem (1) by computing the time dependent PDF pðx; tÞ using the PI method. The central part of any formulation of the PI method is the Chapman-Kolmogorov equation pðx; t nþ1 jy; t n Þpðy; t n Þdy: ð3Þ It describes the time evolution of the probability density function pðx; tÞ between the discrete times t n and t nþ1 using transitional probability density function, conditioned on the initial position at t n . Since the solution of the Chapman-Kolmogorov equation is generally not available in a closed analytical form, we have to numerically evaluate (3) for each discrete time step t n . Our systematic treatment of the Chapman-Kolmogorov equation has three key elements: Approximation of the transitional probability density function (TPDF) pðx; t nþ1 jy; t n Þ, Spatial discretization of the PDF pðx; t n Þ, Quadrature computation of the integral.
This approach leads to a number of advantages in computing the time evolution of the PDF pðx; tÞ. The formulation allows us to separately investigate the effect of the discretisation of each component and isolate the potential bottlenecks of the method. A critical feature of our approach is the evaluation of the integral in (3) in terms of multiplication by a step matrix. In the case when the SDE in (1) describes a time-invariant (or time periodic) solution, it is necessary to compute the step matrix (or a finite sequence of step matrices) only once. Furthermore, the systematic treatment of the PI method allows us to analyse each component's influence on the error convergence and computational performance, and opens the door to additional efficient implementations in the future by highlighting the potential bottlenecks of the PI method.

Approximation of the TPDF
We approximate the transitional probability density function pðx; t nþ1 jy; t n Þ by using the probability density function of the Euler-Maruyama scheme with xðt n Þ ¼ y: xðt nþ1 Þ % y þ fðy; t n ÞDt n þ gðy; t n ÞDW n : ¼ gðy; t n ; t nþ1 Þ þ gðy; t n ÞDW n ; ð4Þ where Dt n :¼ t nþ1 À t n is the length of the time step, and DW n ¼ Wðt nþ1 Þ À Wðt n Þ is the Wiener increment of the process WðtÞ on the time interval ½t n ; t nþ1 , with normal distribution Nð0; ffiffiffiffiffiffiffi ffi Dt n p Þ. Since only the states from x k through x d are subjected to direct noise excitation, the approximate evolution of the states x 1 through x kÀ1 in (4) is deterministic, while the evolution of the remaining states is stochastic. Thus it is useful to split the state space x into x I and x II : . .
The corresponding approximate dynamics is also separated as gðx; t n ; t nþ1 Þ ¼ g I ðx I ; x II ; t n ; t nþ1 Þ g II ðx I ; x II ; t n ; t nþ1 Þ ! and gðx; t n Þ ¼ 0 In the following we use the subscripts I and II to denote decompositions based on (5) and (6). The probability density function of the deterministic variables x I is given in terms of a Dirac-delta function dð:Þ, which is used to write the probability density function of (4) in terms of the decomposition, pðx; t nþ1 jy; t n Þ ¼ dðx I À g I ðy I ; y II ; t n ; t nþ1 ÞÞ Á p II ðx II ; y I ; y II ; t n ; t nþ1 Þ: Since (4) is the Euler-Maruyama step for (1) with increment (4), the function p II is the PDF p N of a multivariate normal distribution with parameters l ¼ g II and r ¼ g II ffiffiffiffiffiffiffi ffi Dt n p : p II ðx II ; y I ;y II ; t n ; t nþ1 Þ :¼ p N x II ;g II ðy I ;y II ; t n ;t nþ1 Þ;g II ðy I ; y II ;t n Þ ffiffiffiffiffiffiffi ffi Dt n p ; Here w denotes the solution of c I ðw; y II ; t n ; t nþ1 Þ ¼ x I À g I ðw; y II ; t n ; t nþ1 Þ ¼ 0; ð12Þ namely, where the argument of the multivariate Dirac-delta function dðxÞ in (7) is zero. In general, the solution of (12) is not available in a closed analytical form; thus, we obtain it using the Newton-Raphson iteration. The Jacobian det J c I originates from the change of variables in the integration of the Dirac-delta function dðgðxÞÞ of a function gðxÞ, detailed in (10). It is defined as J c I ðw; y II ; t n ; t nþ1 Þ :¼ @ @y I c I ðy I ; y II ; t n ; t nþ1 Þ We note that instead of (4) we can use other methods to approximate the TPDF of the process (1), such as higher order numerical schemes, e.g. using an explicit 4th order Rungke-Kutta (RK4) step to approximate the drift g or using the Milstein method to approximate the diffusion. Further demonstration and discussion is given in Section 3.1.1.

Spatial discretisation of the PDF
We approximate the PDF pðx; t n Þ at discrete times t n using an interpolation pðx; t n Þ defined over the finite spatial region given by pðx; t n Þ % pðx; t n Þ x 2 I; In (14) we assume that the PDF pðx; t n Þ is nonzero on the finite region I, which is reasonable for most stochastic processes for which a steady-state solution p st ðxÞ exists. Here, p denotes an interpolation function defined by pðx; t n Þ :¼ The interpolation (15) is constructed by a Cartesian product of interpolations along each dimension labeled by j, where j ¼ 1; . . . ; d, where we use N j for the number of interpolation nodes along the j-th dimension. For concise notation of the interpolation we use :; : h i for a Euclidean inner product [7] of two rank d tensors A; B 2 R N 1 Â...ÂN d , which is given by The elements of the rank d tensors /ðxÞ and q n in the product form of the interpolation are given by This formulation allows us to decouple the interpolation functions / j;i j from the values q i 1 ;...;i d , that are recorded at the nodes of the interpolation grid. The actual form of the base functions for / j;i j , depends on the type of the interpolation and the number of nodes N j that we use along the j-th direction. In Appendix B.3 we provide a simple demonstrative example on how the interpolation (15) is constructed for d ¼ 2.
In this work we compare the linear, cubic, quintic, barycentric (Lagrangian), and trigonometric interpolations. The base functions / j;i j of these interpolations are listed in Appendix B. We refer to the linear, cubic, and quintic interpolations as sparse interpolations, as their base functions / j;i j ðx j Þ are all zero, except for two, four or six of them, respectively. Similarly, we will refer to the barycentric and trigonometric interpolations as dense, as all their base functions / j;i j ðx j Þ have nonzero values. Also we implement the barycentric interpolation as Chebyshev interpolation, emphasizing that we record the data q i 1 ;...;i d at the Chebyshev points. This is to avoid the Runge phenomenon, an oscillation typically appearing at the edges of the interpolation interval for higher order interpolation approximations based on equally spaced nodes.
A common property of each interpolation approach is that at the nodes X ði 1 ;...;i d Þ : In order to compute the interpolated PDF pðx; t nþ1 Þ at the next discrete time t nþ1 we utilise property (18) to compute the data values q i 1 ;...;i d ;nþ1 as For the remainder of this paper, as an abuse of notation, we use a single ðiÞ to index quantities that correspond to quantities previously indexed with i 1 ; . . . ; i d , e.g.: q ðiÞ;n :¼ q i 1 ;...;i d ;nþ1 and X ðiÞ :¼ X ði 1 ;...;i d Þ . Substituting the TPDF (8), (11) Here q n behaves as a constant with respect to y II , thus we can interchange the integral and the Euclidean inner product leading to q ðiÞ;nþ1 ¼ U ðiÞ;n ; q n ; ð21Þ with U ðiÞ;n ¼ Z I ðk;dÞ p II X ðiÞ ; W ðiÞ ; y II ; t n ; t nþ1 À Á det J c I ðW ðiÞ ; y II ; t n ; t nþ1 Þ where W ðiÞ is defined as the solution of X ðiÞ;I À g I ðW ðiÞ ; y II ; t n ; tÞ ¼ 0 (similarly to (12)) and I ðk;dÞ :¼ Q d j¼k I j . As (21) gives a single q ðiÞ;nþ1 , we repeat (19) for each node value q ðiÞ;nþ1 and organise each resulting U ðiÞ;n into a step matrix S n to obtain the matrix multiplication vec q nþ1 . .
This form is equivalent to the numerical evaluation of (3) and represents a time step in the approximation of the PDF pðx; tÞ. The operator vecðAÞ reshapes the d-dimensional tensor The decoupling of the interpolation function /ðxÞ from the recorded node values q n in (15) allows us to transform the evaluation of the Chapman-Kolmogorov Eq. (3) to the matrix multiplication in (23). In the case that (1) is time-invariant (fðx; tÞ fðxÞ and gðx; tÞ gðxÞ) and we want to investigate the evolution or the steady-state solution of the PDF pðx; tÞ, we have to compute the step matrix S n S once, given that we use a constant time step Dt for each n. Then advancing the PDF pðx; tÞ by a single timestep is a very efficient operation, as it is achieved via a matrix-vector multiplication defined in (23). Similarly, in the case that (1) is a T p -periodic system (fðx; tÞ ¼ fðx; t þ T p Þ and gðx; tÞ ¼ gðx; t þ T p Þ), then we have to compute the set of S n ; n ¼ 1; . . . ; p (where p ¼ T p =Dt is the discrete period), and then just successively evaluate the matrix multiplication (23) to advance the PDF pðx; tÞ in time.
The main computational cost for the SMM-PI is in finding S n , since the matrix multiplication is one of the best optimised numerical task. Given that the step matrix S n is already computed, the matrix multiplication (23) is one of the fastest methods to approximate (3). This is especially useful when investigating time invariant or periodic systems, that require a large number of time steps to reach their steady-state.
Note that we have to choose the limits of the interpolated region I in such a way that it covers the entire region where the PDF pðx; tÞ takes nonzero values during the investigated time frame t 2 ½0; T.

Approximation of the integral
The final component of the solution of the Chapman-Kolmogorov equaiton (3) is to evaluate the integral (22). Depending on the time step Dt n and the state X ðiÞ , there is only a small I ðk;d;iÞ & I ðk;dÞ region where the TPDF p II , and thus the kernel of (22), is nonzero. To minimise the number of subnodes required for an accurate numerical approximation of U ðiÞ;n , we aim to evaluate the kernel (22) where it is relevant, namely, over the region I ðk;d;iÞ . In order to determine I ðk;d;iÞ , we first identify the n initial state of the drift step from where the system evolves to X ðiÞ by solving the equation X ðiÞ À gðn; t n ; t nþ1 Þ ¼ 0: Next, we take the covariance matrix Rðn; t n ; t nþ1 Þ ¼ gðn; t n Þgðn; t n Þ > Dt n and use its spectral radius to compute the largest standard deviation r, i.e., With the help of r we determine the new region I ðk;d;iÞ :¼ I ðrÞ ðk;d;iÞ \ I ðk;dÞ over the integral in (22) is evaluated by Since we use a small time step Dt n , we assume that the distribution in y is approximately Gaussian, thus we choose s ¼ 6 to ensure that the whole region is covered where the TPDF p II is nonzero. Using the region I ðk;d;iÞ instead of I ðk;dÞ we restrict the integration to where the TPDF is non-negligible.
In order to evaluate the integral (22) over I ðk;d;iÞ with high accuracy and performance, we use the Gauss-Legendre (GL) quadrature [1]. For integrating a function f : Here, the nodes z i ; i ¼ 1; . . . ; m, are defined by the roots of the m-th Legendre polynomial, where m is the number of sample points used, and the weights w i are given by the formula and P m ðzÞ is the m-th Lagrange polynomial. In the case of k ¼ d we can apply (27) directly by rescaling the integration lattice values z i and weights w i appropriately for the chosen interval. To apply the GL quadrature to integrate over the interval I ðk;d;iÞ ; kd, we need to rescale the integration lattice values z i and weights w i in each dimension j ¼ k; . . . ; d constructing I ðk;d;iÞ and take their Cartesian product in a manner similar to (15).
The formulations of the PI method in previous works [17,19,20] relied on using uniformly distributed node values in a cubic interpolation to directly evaluate the integral. Utilising the GL quadrature approximation combined with the integral over the region I ðk;d;iÞ greatly increases the accuracy and performance of the (approximate) evaluation of the integral (22) [29]. We discuss this further in Section 4.

Numerical experiments
In this section we test the error convergence and the computational efficiency of the PI method through numerical experiments. The first example we investigate is a simple nonlinear scalar system with additive noise: where Nð0; 1Þ refers to the standard normal distribution. The steady-state PDF of (29) shown in Fig. 1a is given by where I a ðzÞ is the modified Bessel function of the first kind.
The second system we discuss is a second order system (d ¼ 2), namely an oscillator with cubic nonlinearity: Here we use the states x and v, that represent the displacement and velocity of the oscillator, respectively. Here the state vector is defined as x ¼ x v ½ > and the initial distribution is a twodimensional standard normal distribution, i.e., xð0Þ $ N 0; I ð Þ. The stationary PDF p st ðxÞ corresponding to (31) is and C 1 is the normalisation constant. For the tests we choose f ¼ 0:15; k ¼ 0:25 and r 2 ¼ 0:075, which results in a symmetric steady-state PDF p st ðxÞ with two well separated peaks as Fig. 1b shows. Then the approximation of p PI st ðxÞ with dense interpolations is challenging, as the contrast of large flat regions with the rapid increase of the peaks make the interpolations prone to high frequency components.
To benchmark the performance of the PI method for d ¼ 3 and d ¼ 4, we apply the method to a linear oscillator subjected to first order filtered noise and to a coupled oscillator. The governing equation of the linear oscillator for d ¼ 3 is dxðtÞ ¼ vðtÞ dt; dvðtÞ ¼ ðÀ2fvðtÞ À xðtÞ þ rZðtÞÞ dt; Here the states x and v represent the displacement and velocity, Here the states x and y represent the displacements, and the v x and v y the corresponding velocities.
In the following sections we test the accuracy and performance in the case of the scalar system (d ¼ 1) and the nonlinear oscillator (d ¼ 2), while for the d ¼ 3 and d ¼ 4 dimensional systems we approximate the CPU time required to compute the step matrix S n and the matrix multiplication (equivalent to taking a time step). Furthermore, as all the systems described by Eqs. (29)-(34) are time invariant, we omit the notation of the time step n, where it is unnecessary, i.e. S n S and U ðiÞ;n U ðiÞ .

Error of the path integration method
First, we test the error convergence of the PI method for different temporal and spatial resolutions for the systems described by (29)

Error due to the temporal discretisation
First we consider the effect of the temporal resolution Dt on the error e 1 of the steady-state PDF approximated with the PI method. In the Runge-Kutta-Maruyama method the drift is approximated with the 4th order Runge-Kutta scheme (detailed in Appendix A), while the diffusion is approximated with the Maruyama scheme.
Utilising the RK4 method adds additional complexity to the computation required to solve (12) and determine J I . For d > 1 this might translate to better approximations for the interaction between the state variables, however, it does not increase the error convergence rate as Dt n ! 0 as it is limited by the weak order of the diffusion step, which is OðDtÞ (or e 1 / Dt) for the Maruyama approximation of the diffusion. The argument against the use of the Milstein method for the approximation of the diffusion is similar. The computation of the distribution of the Milstein step is very complex in the general case, with kd and non-diagonal diffusion, while not providing better convergence properties: both the Milstein and the Maruyama approaches yield a weak OðDtÞ approximation for the diffusion [12]. For the scalar system (31 Fig. 2a shows that we achieve almost the same accuracy, regardless of the time stepping methods used. In contrast, as Fig. 2b shows, for the second order system (31) (d ¼ 2) the use of the RK4 method provides higher accuracy, because it better approximates the dynamics of the state variables described by the drift. However, the two time stepping methods have the same error convergence rate, which is limited by the approximation of the noise term. This is illustrated by the auxiliary line that corresponds to a first order error convergence (e 1 / Dt 1 ).

Error due to spatial discretisation
Next, we consider the effect of the spatial resolution N and the choice of the interpolation method on the error e 1 of the steadystate PDF approximated with the PI method. For the scalar system (29) we choose the interpolated region as x 2 ½À3; 3, while for the second order system (31) we choose x; v 2 ½À3:5; 3:5 and use the same spatial resolution N for the interpolation along both dimensions x and v. Fig. 3 shows the errors of the different interpolation methods as functions of the number of nodes N used for the interpolation, and compares different time steps Dt. Based on the results obtained by numerical experiments, we see that the linear interpolation is the least reliable spatial discretisation method. Even though the solution corresponding to the linear interpolation converges with OðN À2 Þ, it does so after a plateau. For both the scalar d ¼ 1 and second order d ¼ 2 systems, this plateau persists for larger node numbers N as we increase Dt, even for the entire range of N investigated. The cubic and quintic interpolations display a faster convergence rate of OðN À3 Þ and OðN À5 Þ, respectively, and are less susceptible to resolution errors generated by the decrease of the time step Dt. The Chebyshev and trigonometric interpolations show an exponential convergence for all time resolutions Dt, so that for smaller values of N, the maximum achievable accuracy is attained, as dictated by the time stepping method, This is particularly clear for the scalar system (d ¼ 1): in Fig. 3a the Chebyshev interpolation achieves this minimum error e 1 at N ¼ 25 for Dt ¼ 10 À3 and N ¼ 37 for Dt ¼ 10 À5 , while for the cubic interpolations it is achieved for N > 100 (Dt ¼ 10 À3 ), or not at all in the investigated region (Dt ¼ 10 À5 ). The accuracy obtained with the  quintic interpolation lies between the cubic and the dense interpolations.
In contrast, for the second order system d ¼ 2 (Fig. 3b) the use of sparse interpolations can produce results with e 1 error similar to that obtained by the dense interpolations for larger time steps (Dt ¼ 10 À1 ). The quintic interpolation performs especially well, with its convergence rate appearing exponential for the largest time step Dt ¼ 10 À1 , similar to the dense interpolations. However, as we decrease the time step Dt, the dense interpolations unequivocally produce more accurate results for larger spatial resolutions N, until the accuracy is limited by the time discretisation. In Fig. 3 we observe another phenomenon, i.e., as we decrease the time step Dt, the error can increase if we do not increase the number of nodes N for the interpolations. Fig. 4 illustrates how the interactions of the resolutions for d ¼ 1 can influence the solutions obtained through the SMM-PI method for Dt ¼ 10 À2 ; 10 À3 ; 10 À5 and N ¼ 15; 25; 51 compared to the analytical solution p A st ðxÞ in (30). We see, that increasing the number N of the interpolation nodes indeed increases the approximation accuracy, up to a point where the approximated solutions are almost perfectly aligned with the analytical solution. However, decreasing the time step Dt can lead to less accurate results or to false solutions, if the number N is not adjusted accordingly. For the scalar system (29) the sparse interpolations (linear, cubic or quintic) are more susceptible to this phenomenon than the dense interpolations (Chebyshev and trigonometric). The figure shows both the failure of the linear interpolation to capture small variations in the PDF for smaller Dt, and the instability for higher order interpolations with limited N.

Performance benchmarks
In this section we benchmark the performance of the PI method in terms of CPU time and memory requirements. These results are complementary to those from Section 3.1, which characterise the numerical performance of the interpolation and thus the mathematical algorithm but may not translate directly to the time needed to compute accurate results. The measures we provide in this section reflect that the different interpolation methods result in different densities of the step matrices S, which in turn can influence the time it takes to compute a single time step. Our approach of the PI method is implemented in Julia (version 1.7 with libLLVM-12.0.1) and the tests were completed on a thin and light laptop with an Intel Ò Core TM i7 8565U CPU, with 24 GB of RAM and a Linux kernel 5.14. During the performance tests the step matrix S is represented as a dense matrix for the dense interpolations for d ¼ 1 and d ¼ 2, and as a compressed sparse row (CSR) sparse matrix in case of both the sparse and dense interpolations for d ¼ 3 and d ¼ 4. For comparison we included the benchmark results of two FORTRAN implementations [16] of the PI method: the direct integration [18] of the CK Eq. (3) and the approach where the CK equation is evaluated using the FFT [17]. We refer to these methods as the direct and FFT-based methods, respectively. Both utilise cubic B-spline interpolations for the spatial discretisation.

Computational time and spatial discretisation
In Fig. 5 we summarise the CPU time requirements to compute the matrix S and to take a single time step as a function of the spatial resolution N. During the tests we used Dt ¼ 10 À5 for d ¼ 1; Dt ¼ 10 À3 for d ¼ 2; Dt ¼ 10 À2 for d ¼ 3 and Dt ¼ 10 À1 for d ¼ 4. The dashed lines represent approximated computation times. The computation time of S was approximated based on the average time needed to compute a single row vecðU ðiÞ;n Þ of the step matrix S: specifically, we individually measured and averaged the computation time of 1000 randomly selected rows and then multiplied this average by the number of rows N d . For the approximation of the time required to evaluate a time step (a matrix multiplication) we use the previously computed rows and individually measured the computation time of multiplying each row with a dense column vector representing the node value vec- tor q n . We averaged these measurements over the rows, and multiplied this average by the number of rows N d to obtain the estimate of the CPU time of a single time step. In Fig. 5 the left column presents the CPU time necessary to compute the step matrix S, the right column presents the CPU time necessary to compute a single time step, while the rows correspond to the different d-dimensions of the systems used for the examples.
For the scalar problem (d ¼ 1) defined by (29), we see that the use of a sparse matrix for representing the step matrix S is beneficial only for larger spatial discretisations N. When we consider the exact computation time needed (continuous lines), we see that for small spatial resolutions N the computation of the step matrix is almost constant and the computation time gradually increases as we increase N. This is due to the significant overhead time to prepare the computations of the step matrix S, i.e., we need to preallocate the memory space for S and prepare the Newton-Raphson iteration to get W ðiÞ , all of which has comparable computational cost to the computations of the rows U ðiÞ . As we increase the spatial resolution N, the effect of this constant overhead becomes negligible compared to the time required to compute S, as the approximated time converges to the exact time requirement. Recall that for the approximation we consider only the computations of the rows U ðiÞ .
We also see a significant difference in the time it takes to compute the matrix S for the sparse interpolations (almost perfectly overlapping each other), the Chebyshev interpolation and for the trigonometric interpolation. This is due to the difference of the complexity of the two interpolation classes: during the sparse interpolation we need to compute two (linear interpolation), four (cubic interpolation) or six (quintic interpolation) weights, while for the Chebyshev and trigonometric interpolations require weights at all N points. Moreover, the computation of the weight matrix / for the trigonometric interpolation takes longer than for that of the Chebyshev interpolation, given that the trigonometric functions are more computationally expensive than the polynomials used in the Chebyshev case. Considering the CPU time cost of a single time step for d ¼ 1, Fig. 5 shows a performance gain from a sparse representation only for matrix S sizes larger than approximately 101Â101. As in Fig. 3, for d ¼ 1 the dense interpolations reach the maximum accuracy at N ¼ 37 even for Dt ¼ 10 À5 , thus larger matrix sizes are not necessary to accurately compute the PDF using dense interpolations. The figure confirms, that the computation of an accurate steadystate PDF p st with dense interpolations is orders of magnitudes faster than the computation with sparse interpolations for d ¼ 1.
For d ¼ 2 representing S as a sparse matrix is beneficial for the CPU cost of a time step, particularly for larger spatial discretisation values of N. There the time step computation is 3 orders of magnitude faster for the sparse representation than for the dense methods. For d ¼ 3 and d ¼ 4 the approximated CPU time cost of a time step is again orders of magnitude faster for the sparse interpolations than it is for the dense interpolations. Here we see again that both the sparse and dense interpolations are separated into two groups in terms of the computation time of S. Next, in Fig. 5 we compare the requirements of computing a time step with the SMM-PI with the time required to compute a time step with the direct and the FFT-based approaches for d > 1. In the case where the step matrix S has already been computed based on sparse interpolations, computing a single time step by multiplying the node value vector vecðq n Þ with the step matrix S is order of magnitudes faster than the direct approach and significantly faster than the FFT based approach. For the dense interpolation, the matrix multiplication can be slower even than the direct approach, since the number of operations in matrix multiplication (even for a sparsely represented step matrix obtained with dense interpolation) negatively impacts the performance of a time step. Also, analogous to the computation time of the step matrix S, the computation results are split into two groups based on the interpolation class (sparse or dense).
To summarise, we observe in Fig. 5 that for d ¼ 2; 3 and 4 the time complexity of computing the step matrix is approximately OðN 2d Þ; while the time complexity of computing a time step with dense interpolations is approximately OðN 2dÀ1 Þ and OðN d Þ for the dense and sparse interpolations, respectively. Thus we assume, that this holds for higher d > 4 dimensions as well. As the computational time necessary to compute S scales with d exponentially, the accurate computation of the PDF for larger systems is challenging using the SMM-PI method without using other efficiencies, e.g. parallelisation. We note that the systematic framework we present provides a solid foundation for implementing such efficiencies.

Numerical efficiency
In this section we investigate the efficiency of the PI method, defined as the computational time necessary to obtain a result with error e 1 . Fig. 6 Fig. 3 shows.
In Fig. 6a we see that in the case of d ¼ 1 the dense interpolation methods are capable of producing accurate results orders of magnitude faster than the sparse interpolation methods, especially compared to the direct and FFT-based evaluation approaches. As already seen in Fig. 4, the linear interpolation is unable to capture the structure of the PDF p st ðxÞ for small time steps Dt, and thus its poor performance can not be improved with increased CPU time. The large differences between the time necessary to obtain the dense and sparse interpolated approximation of the steady-state PDF p st are due to the representation of the step matrix S. We reach the maximum accuracy with the dense interpolations at N ¼ 37 and Dt ¼ 10 À5 , while the performance of the matrix multiplication with a sparse matrix is only better for N > 101, as shown in Fig. 5. We also see the efficiency benefits in this case from transforming the CK Eq. (3) to a matrix multiplication. This new approach computes an accurate approximation of the PDF of the system (29) that is at least an order of magnitude faster than the FFT-based computation and almost three orders of magnitude faster than the direct approach.
In the case of d ¼ 2 shown in Fig. 6b, the efficiency advantage is not as obvious as in d ¼ 1. For the largest time step size Dt ¼ 10 À1 shown, the computational cost of setting up S combined with the small number of time steps (400 steps to reach T ¼ 40) results in similar efficiency for each interpolation method, except the linear interpolation. As we decrease the time step to increase accuracy, the number of steps needed for convergence to steady-state increases. The SMM-PI approach is again more efficient than the FFT-based evaluation of the CK equation. The sparse quintic interpolation outperforms every other method, with a greater advantages observed for the smallest investigated time step Dt ¼ 10 À3 . Even though the dense interpolations use a smaller number of interpolation nodes to reach the desired accuracy, it is more expensive to compute the time evolution and the steady-state PDF of a dynamical system. This is due to the complexity difference of the sparse and dense interpolations, and the larger number of nonzero elements in the corresponding sparse step matrices S for the dense methods, that leads to slower matrix multiplications as well. This observation implies that for problems with large step matrices S (e.g. problems with d > 2) the accuracy benefit of the dense interpolation is outweighed by the slower performance of the matrix multiplication corresponding to the time stepping.

Memory requirements
Another important aspect of the PI method is the memory required to store the step matrix S, as its size increases exponentially with d. For example, in this section the step matrix is S 2 R N d ÂN d , meaning that S has m S;full ¼ N 2d elements. To reduce the memory needed to store the step matrix we use compressed sparse row (CSR) sparse matrices. The density q S 6 1 of the step matrix S is defined by the ratio of the nonzero elements and the total number of elements in the matrix S. Due to their nature of using all the grid values participating in the interpolation, the dense interpolations always produce a full step matrix S (q S ¼ 1), thus we consider the elements that satisfy the condition as zero elements. Here S ij denotes a single element of S, while max i;j ðjS ij jÞ denotes the element with the maximum absolute value.
The number of elements in a CSR matrix with a density q S is the sum of the q S N 2d number of nonzero elements, the q S N 2d number of the corresponding row indices, and the N d þ 1 column pointers, that assign the row indices to columns. Then the required number of stored numbers is This means that if m S;sparse < m S;full ¼ N 2d , then it is beneficial to store S as a sparse matrix, as far as memory requirements are concerned.
This limit density q lim S ðNÞ where m S;sparse ¼ m S;full is given by ing the step matrix S. When approximating the memory requirement of the step matrix S we assume, that each element or index is represented by a 64 bit double precision floating point number or a 64 bit integer. For d ¼ 1 and d ¼ 2 we measured the memory requirement directly, while for d ¼ 3 and d ¼ 4 we approximated them using the average density based of the 1000 rows previously used to find the computational time approximations (hence we plot as dashed lines). For d ¼ 3 and d ¼ 4 the black crosses (''+") denote exact measurements of the density and the number of elements/memory requirement for two specific cases with quintic interpolation. We see that for d ¼ 1; 2, the dense interpolations produce step matrices S with density over or at the limit q lim S ðNÞ throughout the range of N investigated, indicating that the use of a dense matrix representation is reasonable. For d ¼ 3 and 4, the sparse interpolations produce sparse step matrices for most d and N combinations shown, thus we use sparse matrices to store S.
Even there is a steady decline in the density of the step matrices S, the number of its nonzero elements grows, along with the memory needed to store the matrix. The density and thus the memory requirement of the dense and sparse interpolations are separated into two clearly distinguishable groups. This separation is due to the different approaches employed by the two types of interpolation: in dense interpolation we assign a weight for each N d interpolation grid value, while in sparse interpolations we assign a weight for a fixed number of node values (two for the linear, four for the cubic and six for the quintic interpolation). Even though condition (37) is used for the dense interpolations to exclude certain elements of the matrix S, the number of nonzero elements in the resulting step matrix S is multiple orders of magnitude larger than the number of elements in the step matrix obtained using the sparse interpolations.

Comparison with FFT approach for higher dimensional systems
In the previous sections we compared the efficiency for d ¼ 1 and d ¼ 2. In this section we provide a comparison of the SMM-PI and the FFT-based approaches for the higher order systems (33) (d ¼ 3) and (34) (d ¼ 4). We focus the comparison on the error e 1 , errors for the marginal densities, and total CPU time.
As neither (33) nor (34) is subjected to parametric noise, their response PDFs can be well approximated in the Fourier space (the response PDFs have decaying tails in each direction). Thus the FFT-based approach [17] is a potential candidate to consider along with the SMM-PI approach presented in this paper. Through this comparison, we consider whether the computation of the step matrix S is indeed worth the computational overhead, as it may be computationally expensive.
We recall that the FFT-based method is compatible only with resolutions that are powers of two, namely N ¼ 2 m ; m 2 N þ , and that the TPDF pðx; t nþ1 jy; t n Þ is sampled at the interpolation nodes when the integration is implemented. This can cause issues when using a small time step Dt, resulting in a small discretised diffusion term g k ðx; tÞ ffiffiffiffiffiffiffi ffi Dt n p for which the TPDF pðx; t nþ1 jy; t n Þ is narrow, with no or limited overlap with the interpolation nodes. To properly sample the interaction between the PDF and the TPDF, a high and computationally expensive spatial resolution N ¼ 2 m is necessary, even though a lower resolution N would be enough to characterise the PDF. In contrast, for the SSM-PI method we narrow the integration region to I ðk;d;iÞ , namely we omit the regions from the integration where we approximate that the TPDF is zero, as described in Section 2.4. In Tables 1 and 2  The error e 1 was computed by comparing the results obtained by the PI method to the Gaussian PDF characterised by (D.2) in Appendix D.
For d ¼ 3 we see that the FFT-based approach yields numerically unstable solutions, which exhibit large oscillations for increasing N. This instability stems from the fact that the FFTbased approach cannot capture the interaction between the TPDF and PDF for this example, even at higher spatial resolutions N ¼ 128. The large error e 1 for the FFT-based method are due to large oscillations in the solutions. In contrast, the SMM-PI approach was able to produce reasonably accurate results even for N ¼ 21, with e decreasing with N. Fig. 8 compares the steadystate marginal PDFs obtained with the SMM-PI and the FFTbased PI method with the analytical results. Once again the FFTbased method shows oscillations in the solution at N ¼ 128, while the SMM-PI captures the steady-state marginal PDFs of (33) already at N ¼ 21. Fig. 8c shows that the FFT-based method is able to capture the interaction between the TPDF pðx; t nþ1 jy; t n Þ and PDF pðx; t n Þ for a larger time step Dt ¼ 10 À1 , thus eliminating the oscillations in the solution. However, in this case the large time step causes a different numerical instability; thus we get almost uniform marginal distributions shown for p v;st ðvÞ and p Z;st ðZÞ.
When considering the error e 1 for d ¼ 4 in Table 2 we see that it gradually decreases with increasing N for the SMM-PI approach. In contrast, for the FFT-based apporach e 1 is not consistently decreasing as N increases, and is substantially larger than the error of the solution obtained through the SMM-PI approach. Fig. 9a presents the steady-state marginal PDFs of (34) compared to the analytical solution based on Appendix D. We see that with N ¼ 17 the SMM-PI approach yields a false solution, while the steady-state marginal PDFs are well approximated with N ¼ 21. In Fig. 9b we see, that with the FFT-based approach we get a solution with seemingly improving accuracy for the marginal PDFs as we increase N, even though the error measure e 1 does not decrease significantly.
To better understand the apparent discrepancy between e 1 and the behavior of the marginal PDFs in Fig. 9b we inspect the error distribution of the method by calculating the marginal errors e 1;x ðxÞ; e 1;y ðyÞ; e 1;Vx ðV x Þ; e 1;Vy ðV y Þ in Fig. 10 These quantities give us an insight on the cumulative errors in each direction for the two approaches with different resolutions. We see, that even though the marginal PDFs in Fig. 9a  Thus we see the importance of having a systematic PI approach that can reliably address computational error. An inspection of the shape of the marginal PDFs alone may lead us to incorrectly conclude that the full solution is correct, thus leading to false conclusions when this steady-state PDF is used in a practical problem. Even though the FFT-based method is able to produce results quickly one has to verify the results before accepting them, while the SMM-PI method does not have this issue. Additionally, the FFT-based approach cannot address systems described by Eq. (1) in full generality, e.g., in case the system is subjected to parametric noise excitation or it is an impacting system with non-symmetric response PDFs. Thus in general, we should use the SMM-PI approach, unless the step matrix S cannot fit into the memory of the computation. For d ¼ 4 this is a legitimate concern, as in the example above, even for N ¼ 31 we need 6.9 GB memory for the step matrix obtained with quintic interpolation. If that resolution is not sufficient, and we need a resolution of N ¼ 41 or 51, the step matrix S uses 31.6 GB or 277 GB of memory, respectively. If this large amount of memory is not available and if the FFT-based method is not efficiently applicable, we should utilise the direct approach. However, we should consider the direct approach as a last resort, as according to the benchmarks it is the least efficient method to evaluate the CK Eq. (3).

Discussion of choosing appropriate spatial and temporal resolutions
The transformation of the CK Eq. (3) to a matrix multiplication has great potential for increasing the efficiency of the PI method. At the same time, given the results in the previous sections, one has to consider a few things when using the method to obtain the time evolution or the steady state of the PDF of a stochastic dynamical system. Table 1 Error e1 and total computation times of the approximation of the steady-state response PDF p st ðxÞ % pðx; 110Þ of the system (33) (d ¼ 3) with time step Dt ¼ 10 À2 . For the SMM-PI approach the quintic interpolation and for the FFT approach cubic B-spline interpolation was used. We emphasize that the results obtained using FFT are numerically unstable with the abbreviation NU. First of all, it is necessary to choose the interpolation region I. In general, if there is no analytical solution or approximation that provide the relevant moments or the PDF evolution, preliminary Monte-Carlo simulations are required for an initial estimation for I. Based on the simulated realisations, we can choose a proper interpolation region based on the empirical response histogram of the system, or the steady-state central second moment of the state variables, e.g. by taking a 6 standard deviation region of the process. For the examples (29)-(34) we had the analytical solution for the steady state PDF: for d ¼ 1 and d ¼ 2 the solutions are presented in (30) and (32) while for d ¼ 3 and 4 the steady state PDF is provided in Appendix D.
After we have the interpolation region I, we choose proper temporal discretisation methods and resolutions. First, we need to choose the time stepping method and size Dt n for which the discretised SDE (4) captures the dynamics of the continuous SDE (1) accurately and without introducing numerical instability over the region I. A small Dt allows a more accurate approximation of both the drift and diffusion, however, a very small Dt can lead to numerical issues, as the TPDF pðx; t nþ1 jy; t n Þ with a small time step is highly concentrated near y for that step, so that integration of the TPDF can magnify the interpolation errors without sufficient spatial resolution N.
While there are various considerations to take into account, e.g. balancing N and Dt, as were considerations in previous approaches, this new approach provides a systematic way of implementing the PI method and addressing potential sources of computational error, in contrast to previous methods that did not. In Fig. 4 we demonstrated the effect of different temporal and spatial discretisation, namely, that even though the spatial discretisation was sufficient for a given Dt n , decreasing the time step may increase the error for the same N, as mentioned above. In the case of linear interpolation the method is not able to accurately resolve small fluctuations in the density for a small time step Dt, and instead tends to smooth spatial variation of pðx; tÞ. At the same time, the higher order (cubic, quintic and dense) interpolations can introduce well-known oscillatory instabilities if the resolution is too low. In this paper we do not provide a strict analysis of this effect; however, we note that this oscillatory instability is related to the case where the largest eigenvalues of the step matrix S take multi- Table 2 Error e1 and total computation times of the approximation of the steady-state response PDF p st ðxÞ % pðx; 40Þ of the system (34) (d ¼ 4) with time step Dt ¼ 10 À1 . For the SMM-PI approach the quintic interpolation and for the the FFT-based approac cubic B-spline interpolation was used.  ple, usually complex, values. This spectral structure leads to false solutions, as the steady state lim n!1 vec q n ð Þ in (23) converges to the corresponding eigenvector.
In essence, the SMM-PI formulation requires an adjustment of the resolution of the spatial and temporal discretisation to assure sufficient accuracy both for the approximation of the PDF pðx; t n Þ and its interaction with the TPDF pðx; t nþ1 jy; t n Þ in the evaluation of the CK. The new SMM-PI formulation, based on separate treatment of the three parts of interpolation of the PDF, approximation of the TPDF of the process and evaluation of the CK integral, provides insight on how choose these resolutions. For example, for a d-dimensional system with a single noise excitation (k ¼ d) it is a good strategy to initially set the time step Dt n and spatial resolution N in a way that the ratio of b k Àa k N and g k ðx; tÞ ffiffiffiffiffiffiffi ffi Dt n p is close to one or smaller, and test the resulting PDF against a histogram obtained through coarse MC simulations. Then we can make adjustments to the resolutions based on comparing the PDF obtained through the SMM-PI with e.g. the one obtained through the MC simulations: we can increase accuracy by decreasing Dt or increasing N, or decide to decrease computational cost by decreasing N. Note that the above recommendations on the initial ratio of Dt and N are based on empirical results, as we did not conduct the analysis on the complex interaction between the multiple sources of error (time evolution, spatial discretisation and numerical integration) to obtain the exact error convergence rates and numerical stability criteria.
For small dimensional stochastic systems (e.g. d ¼ 1 or 2) increasing N can be done at a low computational cost: if the spatial resolution N is not sufficient for the chosen time step Dt n , then we can increase the resolution N at a small performance penalty. As the size of the investigated system increases (e.g., d > 3) one has to carefully choose a balance between Dt and N giving the corre- sponding spatial resolution, since the computation of the step matrix S becomes very expensive in terms of both computational time and memory.
To summarise, the SMM-PI method is especially useful when analysing the models of low-dimensional nonlinear systems (d 6 3) and investigate the effect of parameters on the time evolution of the PDF.

Conclusions
In this paper we provided a novel systematic formulation of the PI method as an approximate method to solve the Chapman-Kolmogorov (CK) Eq. (3). The CK equation models the evolution of the PDF for a stochastic process in general, and we apply it to systems described by SDE's. There are three key elements of the new PI construction: the approximation of the TPDF pðx; t nþ1 jy; t n Þ, the discretisation of the PDF pðx; t n Þ, and the quadrature evaluation of the integral. Since the TPDF is the fundamental component that advances the PDF in time we approximate it using the PDF of a numerical time stepping scheme applied to the SDE (1). For the discretisation of the PDF pðx; t n Þ we use different interpolation methods, and to evaluate the integral in (3) that combines components, we use the Gauss-Legendre quadrature. Partitioning the PI method in this manner allows us to separately investigate the performance of the approximation for each component and to isolate the potential bottlenecks within the method. Furthermore, through this formulation we transform the Chapman-Kolmogorov equation to multiplication by step matrix that captures the full PDF evolution, thus boosting the performance of the PI method.
To test the accuracy, computational requirements and the efficiency of the PI method with the different temporal and spatial approximation methods and resolutions, we conducted numerical experiments on a set of nonlinear and linear examples. We note that this study appears to be the first systematic performance evaluation of any PI method. The results presented in Fig. 6 show, that in the case of the scalar system (d ¼ 1) the most efficient approximations for the PDF use dense interpolation approaches, for which the method yields accurate results even with relatively small size step matrices S n . For dimension d ¼ 2, the efficiency of the quintic interpolation method surpasses that of the dense interpolations, and our tests indicate that this conclusion holds for higher ddimensional systems as well. The numerical tests also confirm, that for d ¼ 1; 2; 3, transforming the CK Eq. (3) to a matrix multiplication significantly increases the efficiency of the PI method, compared to previous approaches, even before considering any parallelisation.
The computationally most expensive part of the SMM-PI approach is the generation of the step matrix S n . Nevertheless, since the new formulation provides the advantage that the row vectors vecðU ðiÞ;n Þ > of S n defined in (22) are independent of each other, we can greatly reduce the computation time necessary to obtain S n by the parallel computation of the individual rows, further increasing the efficiency of the SMM-PI method. However, there is a single caveat to this approach based on Fig. 7: a large amount of memory is necessary to store S n , as shown for d P 4, even when a sparse interpolation is applied. This is a major barrier to be overcome in order to generalise this approach to higher d-dimensional systems. Nonetheless, the method presented in this paper is proved to be remarkably efficient and is a potential candidate for the high-performance computational solution of nonlinear (time-dependent) stochastic systems (subjected to various noise sources) that are used for modelling in biology, physics, engineering, finance or other fields of science, even on thin and light laptops with moderate computational capabilities.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. In case N is even: During the trigonometric interpolation we the uniform grid x i ¼ a þ ði À 1Þ ðb À aÞ N ; l ¼ 0; . . . ; N j À 1: ðB:31Þ

B.3. Two-dimensional linear interpolation
In this section we demonstrate the construction of the linear interpolation of the function pðxÞ; x 2 R 2 ; p : R 2 # R on x 2 I. As p is a 2-dimensional function, we use a two-variable notation, i.e., pðx; yÞ :¼ p ½x; y > À Á , where we use x :¼ x 1 2 ½a 1 ; b 1 and y :¼ x 2 2 ½a 2 ; b 2 . We construct the linear interpolation along the two dimensions and show that it indeed results in the form provided in (15).

ðB:34Þ
The matrices corresponding to the product weight functions / 1;i 1 ðxÞ / 2;i 2 ðyÞ and the node values q i 1 ;i 2 are

Appendix C. Equation of motion of the coupled oscillator
The equation describing the motion of two linear oscillator coupled is given by dx ¼ v x dt; dy ¼ v y dt; dv x ¼ À2x 1 n 1 þ n 12 ð Þ v x À 2n 12 x 1 v y À x 2 1 þ x 2
Appendix D. Steady-state PDF of linear time-invariant stochastic systems The theorem (8.2.12) in [3] states that the steady-state solution of the equation In the case of (33) the coefficients of (D.1) are x ¼