A spectral surrogate model for stochastic simulators computed from trajectory samples

,


Introduction
Nowadays, computer simulations are an essential ingredient of the research and development workflow in virtually all fields of science and engineering. Typically, not all parameters and conditions needed for the simulations are known exactly, and this uncertainty affects the output of the simulations. This is the main focus of the field of uncertainty quantification (Smith, 2014 Most computer simulations can be classified as deterministic simulators: repeatedly evaluating the model M for the same set of input parameters x always yields the same deterministic response y = M(x) ∈ R. 1 To perform uncertainty quantification, the uncertainty on the input (parameters and conditions) is usually represented probabilistically, and we follow this approach in this paper. Propagating the input uncertainty through the deterministic simulator, the overall response of the simulation becomes a random quantity.
However, not all computer simulations can be classified as deterministic simulators. Some models contain intrinsic stochasticity that cannot be modeled as input parameter, e.g., epidemiological models where each transmission or recovery is a random event, governed by the respective rate of occurrence. Other models depend on uncontrollable environmental variables such as wind fields or earthquakes, for which it can be infeasible or undesirable to explicitly model their uncertainty. In these cases, it is more convenient to use the notion of a stochastic simulator: only some of the uncertainty is explicitly modeled as random input variables, and there is some residual randomness affecting the computational model that causes the model response M(x) for a fixed set of input parameters x to still be a random variable: Y x = M(x). In other words, evaluating the computer model several times with the same input parameters x will result in different realizations y of the random variable Y x . Of course, since there is no true randomness in a computer, every computer simulation can be made deterministic by fixing the random seed.
However, the seed is in general not a useful parametrization of uncertainty.
Uncertainty quantification methods typically require many runs of the computational model, which can become costly or even infeasible for expensive engineering simulators. To save computational resources, the model is often replaced with a cheaper surrogate model (or metamodel), which provides a reasonably good approximation to the original model. The surrogate model is computed from a small number of model evaluations and can subsequently be evaluated many times with negligible computational cost. Surrogate models often treat the model as a black box, i.e., they do not use any specific knowledge about the model and rely only on the available input-output data samples (and sometimes on the characteristics of the input parameter space). Popular surrogate models for deterministic simulators include polynomial chaos expansions (Ghanem and Spanos, 1991;Xiu and Karniadakis, 2002), Kriging (Sacks et al., 1989;Rasmussen and Williams, 2006), radial basis functions (Buhmann, 2000), and support vector regression (Vapnik, 1995;Smola and Schölkopf, 2004).
Since the response of stochastic simulators is a random variable for every set of input parameters, even more runs might be required to analyze their uncertainty, making surrogate models all the more relevant in this case. Research on surrogating stochastic simulators is comparatively recent. Most available methods focus on the marginal response distribution P (Y |X = x) for x ∈ D and emulate the conditional density itself or certain statistics of it. Early contributions aimed at characterizing the variation of the first two moments of the output response over the input domain using joint Gaussian process models (Iooss and Ribatet, 2009;Marrel et al., 1 We consider here only real-valued simulators. The extension to low-dimensional vector-valued simulators is straightforward. For the extension to high-dimensional vector-valued or function-valued simulators, see e.g. Nagel et al. (2020); Perrin et al. (2021). 2012). Another class of methods aims at directly modeling the variation of the marginal output probability density function (pdf) of the random variable Y x over the input domain. Assuming that the true marginal response pdf at a number of input locations is known, Moutoussamy et al. (2015) represent the marginal pdf of a new input point as a linear combination of training examples (i.e., kernel regression) or of specifically constructed basis functions. However, the true marginal pdf is rarely known or its generation might require a lot of samples. For a finite number of stochastic simulator evaluations over the input domain (with or without replications), Sudret (2020, 2021a) model the variation of the marginal output pdf over the input domain using the so-called generalized lambda model, a parametric distribution family that is able to approximate many classical families. In fact, stochastic simulators are akin to real-world scientific experiments, which are usually stochastic due to unavoidable measurement error and environmental noise. Therefore, standard statistical methods like quantile regression (Torossian et al., 2020) and kernel conditional density estimation (Hall et al., 2004) can also be used to emulate the marginal distribution of the response of a stochastic simulator. Furthermore, Zhu and Sudret (2022) developed an approach that emulates the stochastic simulator response in distribution, inspired by the weak PCE methodology based on maximum likelihood estimation (Xiu, 2010).
A related method from machine learning are Bayesian neural networks, whose weights are modeled as independent Gaussian random variables (MacKay, 1992;Goan and Fookes, 2020).
Bayesian methods such as Markov Chain Monte Carlo or variational inference are used to determine the parameters of the weight densities from the given data. Furthermore, generative models like variational autoencoders (Kingma and Welling, 2014) and generative adversarial networks (Goodfellow et al., 2014) can be seen as surrogate models in distribution, learning a conditional target density from data.
All the methods cited above aim at emulating only the univariate probability density functions of the response random variables of the stochastic simulator. However, they do not take into account the correlation and higher-order information between the stochastic simulator responses at different points in the input domain. This close relation between the responses at different input locations can be best illustrated by fixing the stochasticity of the simulator (e.g., by fixing the random seed) 2 : in this case, the stochastic simulator response over the input domain becomes a deterministic function, which we call a trajectory. In other words, the stochastic simulator can be seen as a random field, i.e., as a collection of random functions.
Surrogating a stochastic simulator based on few model evaluations becomes therefore the task of inferring a random field from discrete samples (often called "limited data"). Popular methods for modeling random fields include orthogonal series expansions, such as spectral representation (Shinozuka and Deodatis, 1991;Grigoriu, 1993) or Karhunen-Loève expansion (KLE) (Loève, 1978;Karhunen, 1946;Zhang and Ellingwood, 1994;Ramsay and Silverman, 2005;Grigoriu, 2006), and translation processes, which are mappings of Gaussian processes (Yamazaki and Shinozuka, 1988;Grigoriu, 1998;Sakamoto and Ghanem, 2002;Shields et al., 2011). To our 2 Note that this does not require this randomness to be modeled. In practice, fixing the seed might not be possible for all computational models, as it depends on their implementation. knowledge, the only publication in the specific context of stochastic simulators which takes the random field point of view and aims at emulating trajectories (including the higher-order relations between responses at different input locations) is by Azzi et al. (2019), who construct a metamodel using Karhunen-Loève expansion together with the deterministic methods PCE and Kriging.
The goal of our paper is to develop a surrogate model that is able to emulate the trajectories of a stochastic simulator, and allows insight into the dependence between the simulator responses at different input locations. Our method of choice in this paper is Karhunen-Loève expansion, one of the most popular methods for random field inference from limited data. The main challenges in constructing a trajectory-based surrogate for a stochastic simulator (a stochastic emulator) are explained in more detail in the following: 1. Accuracy and efficiency: the surrogate should be accurate while needing as few model evaluations as possible.
2. Continuous surrogate from discrete data: the surrogate should emulate the response over the whole (continuous) input domain, while the available data consists of trajectories sampled at a few points throughout the input domain (i.e., discrete samples).
3. The stochastic simulator is in general a non-Gaussian random field. This introduces additional complexity into the Karhunen-Loève model.
We are addressing each of these challenges by introducing a novel approach that combines several state-of-the-art methodologies. We use Karhunen-Loève expansion in conjunction with sparse PCE (Blatman and Sudret, 2011;Lüthen et al., 2021), which is a powerful and sample-efficient surrogate modeling method for deterministic simulators, to address Challenge 1. This circumvents the otherwise high computational cost of solving the integral eigenvalue problem of KLE (Schwab and Todor, 2006;Betz et al., 2014) by reducing the integral eigenvalue problem to finite-dimensional discrete principal component analysis (PCA) in the truncated space of PCE coefficients. The joint distribution of the resulting sample of dependent random KLE coefficients (Challenge 3) is identified using statistical inference within the marginal-copula framework (Torre et al., 2019b). The procedure results in an analytical formula for the stochastic emulator that can be used for computing marginals and correlations, as well as for generating new trajectories that resemble trajectories of the original stochastic simulator.
In our approach, the extension from discrete data to the continuous model (Challenge 2) is achieved by approximating the sampled trajectories by sparse regression-based PCE. A similar approach has been used by Navarro Jimenez et al. (2017) in the context of stochastic differential equations with the goal of sensitivity analysis, using non-intrusive pseudospectral projection to compute the PCE coefficients. The representation by sparse PCE can be seen as a variant of orthogonal series expansion (OSE) (Zhang and Ellingwood, 1994), which expands a second-order random process in terms of an orthogonal basis of the associated Hilbert space.
Note that when random fields are approximated based on a set of samples, it is most often assumed that the latter are collected on a discrete mesh in the index set, whereas this is not a requirement for our method. In such mesh-based approximations to random fields, PCE is often used for modeling the random variables arising in dimension-reduced expansions (Desceliers et al., 2006;Doostan et al., 2007;Das et al., 2009;Raisee et al., 2015;Abraham et al., 2018;Dai et al., 2019). This is distinct from our approach, as we use PCE to approximate the trajectories in the input space. Our approach yields an emulator for the whole input space (including unseen locations), while existing approaches are mostly focused on building an emulator on the discrete mesh where the samples were collected.
KLE represents a random field using an optimal orthogonal basis of the index space, resulting in an expansion in terms of deterministic functions weighted by random coefficients. These random coefficients are by construction uncorrelated, but unless the random field is a Gaussian random field, they are in general statistically dependent. Inferring the joint distribution of dependent random variables from samples is challenging but necessary for approximating a general non-Gaussian random field by KLE. To address this challenge of inference, several approaches have been proposed. Grigoriu (2010) suggests two methods to infer the joint distribution of the random coefficients of a series expansion model, of which one amounts to kernel density estimation, and the other to the fitting of a discrete joint distribution. Zentner (2013, 2014) use KLE for modeling seismic ground motion time series, and model the random KLE coefficients by 1D sample CDFs assuming at most pairwise dependence (Poirion and Zentner, 2013), or by kernel density estimation (Poirion and Zentner, 2014). In the present paper, we investigate the use of kernel density estimation and inference of parametric joint distributions based on marginals and vine copulas. This paper is organized as follows: in Section 2 we recall the relevant theory and definitions.
In Section 3 we present our new stochastic emulator. The proposed method is then applied in Section 4, where we assess its performance on several examples of varying complexity. Here we observe that the KLE is often significantly dominated by its first mode, a phenomenon that we investigate in Section 5. Finally, we draw conclusions and give an outlook on possible further developments in Section 6.

Theoretical foundation
We provide a brief summary of the relevant theory and concepts needed to construct our proposed stochastic emulator for stochastic simulators: random fields, polynomial chaos expansions, Karhunen-Loève expansion, and inference of joint probability distributions.

Stochastic simulators as random fields
Let X be a random vector with values in D ⊂ R d , with finite variance and joint probability density function (pdf) f X . Denote by ω ⊂ Ω an abstract random event in a probability space (Ω, F, P ). A stochastic simulator is a mapping Fixing x ∈ D, the quantity Y x = M(x, ·) : Ω → R is a random variable. Fixing ω ∈ Ω, M(·, ω) : D → R is a function in the input parameters, which we call trajectory or realization of the stochastic simulator (see also Fig. 1). We assume that Y x has finite variance for all x, and that M(·, ω) ∈ L 2 f X (D) for all ω ∈ Ω. These definitions imply that a stochastic simulator M can be seen as a random field (also: stochastic process or random process) {Y x } x∈D with index set D, i.e., as a family of random variables {Y x } indexed by x ∈ D. In the following, we provide a brief reminder of a few random field basics. For more details, see e.g. Grigoriu (2002).
To fully characterize a general random field, one needs to specify the collection of all its finitedimensional distributions for all n ≥ 1 and any x 1 , . . . , x n ∈ D. Extending the concept of moments of random variables to random fields, the deterministic mean function of the random field is given by If µ(x) = 0, the random field is called centered. The (auto-)covariance function is defined by In general, a random field is not uniquely defined by its mean and covariance function. The only exception is the family of Gaussian processes, for which all finite-dimensional joint distributions are multivariate Gaussian distributions. For Gaussian processes, conditional distributions are again multivariate Gaussians, which lies at the foundation of the popular surrogate modeling technique Kriging/Gaussian process modeling. While Gaussian random fields are computationally convenient, random fields encountered in real-world problems (and in particular, stochastic simulators) are often non-Gaussian. One obvious argument is that Gaussian variables are unbounded while physical quantities are almost always bounded (Grigoriu, 2002).
A special feature of a stochastic simulator M, as opposed to classical random fields, is that its index set is not an interval or a hypercube, but a general domain D ∈ R d with weight function f X . We will use this property to build an accurate surrogate model for M respecting the probability density f X of the input space.

Polynomial chaos expansion
Polynomial chaos expansion (PCE) is a technique for modeling random variables using a basis of polynomials that are orthonormal w.r.t. a given probability density function (Ghanem and Spanos, 1991;Xiu and Karniadakis, 2002). In our algorithm (Section 3), we will use PCE to approximate trajectories M(·, ω) of the stochastic simulator, which can be seen as random variables M(X, ω) with their randomness induced by the uncertainty in the input X.
Consider a random vector X with values in D ⊂ R d and independent components. Let be its probability density function (pdf) and assume that X has finite variance. Let L 2 f X (D) be the space of real-valued function that are square-integrable under f X , i.e., L 2 f X (D) = g : D → R Var X [g(X)] < +∞ . Under certain assumptions on the random vector X (Xiu and Karniadakis, 2002;Ernst et al., 2012), there exists an orthonormal where each element is the product of univariate polynomials characterized by the multi-index α.
Let M ∈ L 2 f X (D) be a (computational) model. Its output Y = M(X) is a random variable, which can be represented in terms of the orthonormal polynomial basis as M(X) = α∈N d a α ψ α (X) with a α = E X [M(X)ψ α (X)]. This representation is called polynomial chaos expansion. In practice, a truncated expansion is computed, where A ⊂ N d is a finite subset of the full basis. The accuracy of a truncated PCE depends on three ingredients: the choice of A, the method used for computing the coefficients a = (a α ) α∈A , and the choice of points X ⊂ D used in the coefficient computation method. An extensive overview of the state-of-the-art methods to determine these is given in Lüthen et al. (2021).

Karhunen-Loève expansion
Karhunen-Loève expansion (KLE) is a well-established spectral expansion technique through which a random field is represented in terms of an optimal orthogonal basis for the index space, weighted by random coefficients (Karhunen, 1946;Loève, 1978). KLE transforms the random field, which is an uncountably infinite but correlated family of random variables {M x } x∈D , into a countably infinite but uncorrelated family of different random variables {ξ i } i=1,2,... . Furthermore, the random variables ξ i are typically of decreasing importance. KLE is therefore well suited and often used for discretization and modeling efforts for random fields.
Here, (λ k , φ k ) are solutions to the integral eigenvalue problem and ξ k is the result of the projection of M onto the spatial basis From Eq. (6) and the properties of φ k and ξ k it follows immediately that the covariance function can be expressed as (Mercer's theorem). Note that the KLE random variables {ξ k } (herein KL-RV ) do not enter this expression.
KLE is especially well-suited to Gaussian random fields, since in this case the random variables ξ k are standard Gaussian and independent. However, Eq. (6) holds for all random fields fulfilling the assumptions, not only for Gaussian random fields. The non-Gaussianity is modeled by the (possibly complex) joint distribution f ξ of the KL-RV.
Eqs. (6) to (8) are formulated in terms of L 2 (D), but they can be generalized: let X be a random variable with values in D ⊂ R d , density f X , and finite variance. Then KLE can be generalized to the space L 2 f X (D) instead of L 2 (D). In that case, the index set D does not have to be bounded, since the volume of D under measure f X dx is finite. This is called extended KLE (Iemma et al., 2006). This property is crucial for our proposed stochastic emulator, which we will introduce in Section 3.
In practice, the infinite expansion in Eq. (6) must be truncated. From the orthonormality of {φ k } it follows from Eq. (9) that the variance of the random field is equal to ∞ k=1 λ k . The sequence λ 1 ≥ λ 2 ≥ · · · ≥ 0 is non-increasing, and typically (depending on the correlation length of the random field) this sequence decays rather quickly to zero. Loosely speaking, the higher the correlation between different locations in the index set, the fewer spatial basis functions are needed to approximate the trajectories, and the faster the decay of the eigenvalues. Knowing this, the KLE can be truncated at the M -th term with M chosen so that the fraction of explained variance is sufficiently large: for a small threshold parameter > 0 (e.g., = 0.001).
KLE is closely related to function principal component analysis (fPCA) (Besse and Ramsay, 1986;Ramsay and Silverman, 2005). To compute a solution to the integral eigenvalue problem in Eq. (7), there are several possibilities (Ramsay and Silverman, 2005, Section 8.4): the integrals can be approximated numerically; the eigenproblem can be discretized on a number of representative grid points in D (this is the approach chosen by the majority of modelers, including Azzi et al. (2019)); or the eigenproblem can be written in terms of a suitable (truncated) spatial basis, which transforms the problem into a (finite-dimensional) discrete eigenvalue problem. The third approach is related to orthogonal series expansion (OSE) (Zhang and Ellingwood, 1994).
It is used by Poirion and Zentner (2014), who derive the explicit discrete problem for a basis consisting of interpolation functions, building on results by Besse and Ramsay (1986) and Besse (1991). We use this approach together with the orthogonal basis provided by polynomial chaos expansion (Section 3). Detailed calculations are provided in Appendix A.

Inference of the joint distribution of the Karhunen-Loève random variables
Characterizing the dependent (but uncorrelated) Karhunen-Loève random variables (KL-RV) ξ k , k = 1, . . . , M correctly is important for the accurate modeling of a general non-Gaussian stochastic process (Grigoriu, 2010). However, inferring the joint distribution of a random vector is a challenging task. The main challenge is the scarcity of data: the higher the dimensionality, the more samples are needed to be able to correctly infer the dependence structure of the data.
We need to construct a suitable parametric or non-parametric model to accurately describe the joint distribution. In the following, we introduce the marginal-copula framework, which is a powerful tool to represent and infer complex dependence structures between random variables (Nelsen, 2006;Torre et al., 2019a).
Let Z be any M -dimensional random vector with multivariate cumulative distribution function (CDF) F Z and marginal distributions F Z i . The so-called Sklar's theorem states that F Z can be written as where the function C : [0, 1] d → R is called copula (Sklar, 1959;Nelsen, 2006). C is a CDF with uniform marginals, which defines the dependence structure of the random vector Z. C is unique if all marginals F Z i are continuous, and it holds that Let an i.i.d. sample Z = {z (1) , . . . , z (N ) } of the random vector Z be given. The goal is to infer the joint distribution F Z from this sample. For this, the copula representation of Eq. (11) is convenient, since it allows inferring the marginals and the dependence structure of the data separately, as briefly explained in the following.
To infer the marginal distributions, we consider two options. The first is parametric inference, where we choose from a set of parametric probability distributions with zero mean and unit standard deviation (see Table 1) the distribution with the smallest Akaike information criterion (AIC). If a distribution family has more than two parameters, its remaining parameters are chosen by maximum likelihood. We utilize the uncertainty quantification software UQLab (Marelli and Sudret, 2014;Torre et al., 2021) with a modification prescribing the desired moments. Table 1: Considered marginal families with zero mean and unit standard deviation. The last column lists the remaining degrees of freedom k after fixing the first two moments. The Akaike information criterion is then given as AIC = 2k − 2 log L, where L is the likelihood.
A second popular method to represent marginal behavior non-parametrically is kernel density estimation (KDE) (Wand and Jones, 1995;Simonoff, 1996), which has also been proposed for estimating the distribution of KL-RV (Grigoriu, 2010;Poirion and Zentner, 2014). Here the distribution is modeled as a Gaussian mixture, where the Gaussian density functions are centered in the data points and share the same standard deviation, called bandwidth in the case of 1D KDE. We adopt a bandwidth estimation method optimal for data with Gaussian distribution (Bowman and Azzalini, 1997).
To characterize the dependence structure, we use a copula. While any multivariate CDF with uniform marginals U([0, 1]) constitutes a copula, there are a number of well-known parametric families (see, e.g., Nelsen (2006); Joe (2014)). Besides the independence copula and the families derived from multivariate elliptical distributions, most of these parametric families are pair copulas, i.e., they couple only two variables. Constructing meaningful parametric copulas for more than two variables (other than elliptical copulas) is in general difficult (Nelsen, 2006).
A solution is to decompose the M -variate copula into a product of conditional pair copulas, which is known as vine copula construction (Bedford and Cooke, 2002). This is always possible as a consequence of the chain rule of probability. In general, a vine copula is the product of The factorization into pair copulas is not unique but depends on the ordering and grouping of variables. Two classes of vine copulas, differing in the order in which the variables are grouped into pairs, are the drawable vine (D-vine) (Kurowicka and Cooke, 2005) and the canonical vine (C-vine) (Aas et al., 2009). For a more detailed description of the vine copula construction, we refer to Aas et al. (2009) and Torre et al. (2019b).
To infer a copula from data, we first map the multivariate data to [0, 1] d by applying elementwise the inferred marginal CDFs (see Eq. (11)). Then we infer the dependence structure by using Kendall's tau to determine the groupings of variables as well as their order in the vine copula (Aas et al., 2009;Torre et al., 2019b). For each pair copula, the parameters are identified by maximum likelihood. Finally, the best-fitting copula is chosen using AIC. This approach is implemented in the statistical inference module of UQLab . The list of available copula families can be found in Lataniotis et al. (2021, Section 1.4).

Surrogating a stochastic simulator from a set of samples
We are now ready to describe the construction of our spectral surrogate model for a stochastic simulator. Assume that discrete samples of the stochastic simulator M are available in the following form: i.e., in the form of discrete evaluations of the stochastic simulator on R trajectories, where for In particular, for different trajectories the samples can be taken at different locations, i.e., for r 1 = r 2 we can have x (r 1 ,i) = x (r 2 ,i) and in principle even different numbers of samples N r 1 = N r 2 . However, here we assume for notational simplicity that N r = N for all r.
Our proposed method consists of the following steps (see also Fig. 2): with A (r) the set of regressors with nonzero associated coefficient a (r) α . We use a totaldegree basis with degree-and q-norm adaptivity to determine the truncation set A (r) (Blatman and Sudret, 2011; and apply the least-angle regression solver to compute the coefficients (sparse PCE) (Blatman and Sudret, 2011;Lüthen et al., 2021). This results in R PCE trajectories, where each trajectory uses the same set of P PCE basis functions.

Center the PCE trajectories by subtracting the sample mean
α of the centered trajectories and store them in a P × R matrixã. 4. Apply extended KLE to the set of PCE trajectories. The sample covariance function has the formĉ Computing the eigenfunctions φ(x) of the associated integral eigenvalue problem in Eq. (7) is equivalent to computing a PCA on the PCE coefficients, i.e., equivalent to solving the following P -dimensional eigenproblem forã: where T (see Appendix A for the derivation of this equivalence). The eigenvectors b contain the coefficients of the eigenfunctions represented in the PCE basis: . 5. Identify the truncation order K P for the KLE based on a given threshold for the explained variance. We use a threshold of 99.9% (see Eq. (10)).
6. Compute the realizations of the KL-RV ξ i from the sample trajectories by projecting onto the eigenfunctions. Due to the orthonormality of the PCE basis, this can be done analytically (see Appendix A.2). Denote the realizations by ξ (r) ∈ R K . 7. Infer the joint distribution f ξ of random KL coefficients from the data set {ξ (r) } r=1, ... ,R .
We will test four methods consisting of the techniques described in Section 2.4: (a) Option 1: assume standard Gaussian marginals, which implies independence; (b) Option 2: parametric inference of the marginals (with moment constraints) and of the copula; (c) Option 3: 1D kernel density estimation of each marginal, assuming independence; (d) Option 4: 1D kernel density estimation of each marginal and parametric inference of the copula. The resulting stochastic model for the random field M iŝ The full procedure is visualized in Fig. 2.
Using the stochastic emulator constructed in Eq. (18), we can easily compute the following quantities. • The mean functionμ is given by the sample mean of the approximated trajectories (PCE trajectories), see Eq. (15).
• The covariance functionĉ(·, ·) can be computed from the KLE eigenfunctions using the truncated version of Eq. (9). Note that this relation does not involve the KL-RV.
• New trajectories (i.e., realizations of the random field) can be generated by drawing new samples of the KL-RV ξ k , and evaluating Eq. (6).
• A histogram of the marginal pdf f M x of the random field at any input space location x can be created by generating many new trajectories and evaluating them at x .

Remark 1 (Another stochastic emulator).
A simple stochastic emulator able to model marginal distributions f M x can be constructed by evaluating all PCE trajectories from Step 2 above at the new location x and computing a kernel density estimate on the resulting set of predictions. This method will be used as a comparison method for marginal estimation in Section 4. However, unlike our stochastic emulator in Eq. (18), this simple emulator is not able to resample trajectories.

Remark 2 (Alternatives to PCE).
We choose PCE to approximate the sampled trajectories because it is a powerful method for deterministic surrogate modeling. However, the choice of PCE in the above method is not crucial: without any changes to the methodology, PCE could be replaced by any other spectral expansion onto an orthonormal basis of L 2 f X (D), e.g., a Poincaré basis  or a spline basis (Rahman, 2020). From the orthonormality of the basis it follows that functional PCA in L 2 f X (D) becomes traditional (unweighted) PCA in the coefficient space (see Appendix A), which avoids the expensive numerical solution of the integral eigenvalue problem in d dimensions, and instead solves an inexpensive discrete eigenvalue problem.

Numerical experiments
To analyse the performance of our stochastic emulator, we apply it to three models of increasing complexity: the three-dimensional Ishigami function with two random parameters (Section 4.1), the borehole model with five hidden (latent) variables (Section 4.2), and finally the Heston stochastic volatility model, a system of two stochastic ODEs with six inputs that has already been used by Zhu and Sudret (2021b) as a stochastic emulator benchmark model (Section 4.3).
We first investigate the pointwise approximation capabilities of our emulator by plotting the stochastic simulator and emulator responses at selected points throughout the input domain.
Then, we investigate the convergence behavior of our stochastic emulator using the following global error measures: • The global convergence of the marginal distributions is assessed using the averaged normalized Wasserstein distance. The Wasserstein distance of order two between two random variables Y 1 , Y 2 with quantile functions (inverse CDF) Q 1 , Q 2 is defined by (Villani, 2009) To measure the global quality of marginal approximation, we consider the quantity computed by Monte-Carlo integration on a validation set with N val = 1, 000 points and R val = 10, 000 replications (Zhu and Sudret, 2021a). • The global error between the true covariance function c and the emulated oneĉ is computed where C andĈ denote the true and emulated covariance matrices for a validation sample {x (i) : i = 1, . . . , N val }, and · F is the Frobenius norm.

Problem statement
The Ishigami function is a a well-known benchmark function for deterministic surrogate models.
It is highly non-linear and has significant interaction terms. It becomes a stochastic simulator by treating its parameters a and b, which are usually fixed at a = 0.7 and b = 0.1, as additional random variables: A and B have here the role of so-called hidden or latent random variables. In other words, we assume that they cannot be observed, and that therefore their values cannot be utilized in the surrogate modeling process. They introduce stochasticity into the otherwise deterministic   Fig. 3a is the trajectory along which the simulator/emulator response is plotted in Fig. 9. The red dots annotated by small numbers denote the five points that are used for visualization in the following. Fig. 3b shows a sample of the latent space (A, B in Eq. (22)).

Analysis of the KL-RV samples
To illustrate the type of result obtained with our proposed stochastic emulator described in We see that the validation samples have a slightly larger spread than the emulator samples, but that overall the behavior is matched well. Some parameters have linear functional dependence, e.g., a 1 , a 8 and a 17 , which is perfectly reproduced by the emulator. These parameters correspond to the basis functions α 1 = (0, 0, 0) (constant term), α 8 = (0, 4, 0) and α 17 = (0, 6, 0) and are needed to emulate the second term of the stochastic Ishigami model in Eq. (22). There are no interactions with the other terms, therefore a different value of A just proportionally changes the relative weighting of these terms. A similar explanation holds for a 2 and a 6 with α 2 = (1, 0, 0) and α 6 = (1, 0, 2), which are involved in emulating the first and the third term of Eq. (22) and change proportionally with B.

Marginal performance on selected validation points
We now investigate the performance of the stochastic emulator with parametric inference of KL-RV marginals and copulas (choice 7b) on a selection of out-of-sample validation points, i.e., points that were not used for training.  The five selected validation locations {x (i) } 5 i=1 in the input space are visualized in Fig. 3a by red dots. Each black (resp. red) point in Fig. 6 is a new trajectory of the stochastic simulator (resp. emulator) evaluated at the five given points. Both samples have the same size (200 new trajectories). Overall, the model behavior is captured well, but the stochastic simulator has a slightly larger spread (see e.g. Y 2 vs. Y 3 ).
From the data in the off-diagonal scatter plots in Fig. 6, we can compute the sample covariance matrix. However, we can also compute the covariance analytically from the KLE eigenfunctions, using Eq. (9). In Fig. 7, we use the five illustrative points shown in Fig. 3a to compare this covariance estimate to a validation covariance matrix computed empirically from 10, 000 trajectories of the stochastic simulator. Qualitatively, the covariance is reproduced well, although the KLE-based covariance is slightly smaller in magnitude than the empirical covariance.
In Fig. 8 stochastic simulator. As expected, we observe that with an increasing number R of trajectories, the shape of the predicted marginal becomes closer to the kernel density estimate of the validation set and shows less variation.
Finally, to assess visually how well the resampled trajectories match the behavior of the original stochastic model, we plot in Fig. 9 a 1D slice of 10 new trajectories generated by the stochastic simulator (left) and the stochastic emulator (middle). The slice through the input space is shown in Fig. 3a by a red line. On the right, data for the same slice is shown, but this time we show quantiles aggregated over 10, 000 new trajectories each. The trajectory slices look qualitatively similar, although there is a lot of variability between individual realizations. From the aggregated data on the left, we see that the bulk of the distribution (10%-90% quantile) is predicted quite accurately. Interestingly, in Fig. 9c it seems that the trajectories of the stochastic emulator (red) have a larger spread than the ones of the simulator (black), contrary to the results earlier in this section, which always showed the simulator having a larger spread than the emulator.
This illustrates the difficulty of inferring global behavior from local observations. Theoretically, the emulator should have a smaller variance than the simulator, because terms are missing from Eq. (9) due to truncation.

Convergence with the number of trajectories
To assess the global performance of our proposed method, we now construct stochastic emulators for all combinations of input space experimental design sizes N ∈ {50, 100, 150} and numbers of trajectories in the range R ∈ {10, 30, 100, 300}. We then evaluate each of the resulting stochastic emulators R val = 10, 000 times at N val = 1, 000 validation points in the input space (out-of-sample, i.e. not used for training) and compute the errors as described in Section 4.
Each combination is independently repeated 50 times to account for the statistical uncertainty of the sampling of both experimental design and trajectories, which allows us to display results in the form of Tukey boxplots.
In Fig. 10a,  We add the corresponding value of the normalized Wasserstein distance between simulator and emulator prediction as a small colored circle to the plot in Fig. 10a.
We observe that the quality of the marginal estimates improves as we increase the size of the input parameter sample, which is expected since the PCE approximation of the trajectories becomes better with increasing experimental design size. N = 50 points are clearly too few to achieve a good estimate, whereas N = 100 and N = 150 show convergence with the number of trajectories, indicating that the error from statistical inference is the dominating one. While the convergence of the marginals with the number of trajectories looks slow, the improvement of the marginal shapes is actually significant (compare the values with Fig. 10b).
In addition to marginal predictions, our stochastic emulator can also emulate the covariance function, using Eq. these results. In Fig. 10c we display the convergence of cov from Eq. (21). We observe that the error decreases with increasing numbers of trajectories. For the largest numbers of trajectories and experimental design points, the error is already in the range of the rough lower bound on achievable accuracy (obtained as described above by empirical sampling of the true model).
Interestingly, unlike the marginal error in Fig. 10a, an increasing number N of input parameter samples does not lead to a smaller covariance error. This indicates that the covariance estimate is less sensitive to the quality of the trajectory approximation, while the inference of the distribution of the KL-RV is more sensitive to it.
So far, we showed results for the stochastic emulator with parametric inference only (Option 7b). Now, in Fig. 11 we compare the four inference options described in Step 7 of the algorithm in Section 3 with the results of a fifth method described in Remark 1, which we call here PCE-KDE. We use the experimental design size N = 100, which yields PCE approximations with relative validation error of 10 −5 . Due to this close fit, the PCE-KDE estimate can be considered as near-optimal estimate given the available data.
The error marg is again computed on a validation set consisting of R val = 10, 000 trajectories evaluated at N val = 1, 000 points in the input space.
We observe that for 20 trajectories, the five methods show almost identical performance. This suggests that 20 trajectories are not yet enough to infer a distribution that is able to generalize to unseen data, so that any marginal distribution with mean zero and unit standard deviation provides a reasonable approximation. Comparing with the results for PCE-KDE, we see that our emulator is similarly accurate in prediction at an unseen point as a kernel density estimate using the training set of highly accurate PCE trajectories. This suggests that we do not lose much accuracy by applying our KLE approach on top of the PCE approximation, which can be seen as a form of dimension reduction in the stochastic space.
For the larger number of trajectories, R = 100 and R = 500, we do observe a difference between the performance of the different marginal inference methods: standard Gaussians perform worst, while kernel density estimation without copula performs best of all the inference methods considered. Kernel density estimation with independence assumption performs almost on par with the PCE-KDE estimate. This suggests that the KL-RV are close to independent in this case, and that fitting a vine copula (using the available pair copulas) does not improve the overall inference, at least in the considered cases of limited data. It also demonstrates that the true KL-RV distribution is not well approximated by independent standard Gaussians nor by other currently available parametric families, but that it can be approximated well by the more flexible kernel density estimation. Parametric inference, offering a variety of standard marginal shapes, performs slightly better than Gaussian random variables.

Borehole function with latent variables
As a second example, we consider the well-known borehole function, which computes the water flow between two aquifers that are connected by a borehole (Harper and Gupta, 1983). It depends on eight parameters and is defined by Its input random variables and their distributions are provided in Table 2.
We consider five parameters (Ξ = (R, T u , T l , H l , L)) of the borehole function to be latent, resulting in the three-dimensional stochastic simulatorB(r w , h u , k w ) = B(r w , h u , k w ; Ξ).
For the three-dimensional input space, we use N = {20, 30, 60} input samples and a maximal PCE degree of p = 6. The accuracy of the borehole approximation in terms of relative meansquared validation error is in the order of 10 −3 /10 −7 /10 −10 for the different experimental design sizes. The number of trajectories is in the range R = {10, 30, 100, 300}.
This results in typically M = 2 KL modes for an explained variance threshold of 99.9%. The eigenvalues of the KLE are approximately λ 1 ≈ 170 and λ 2 ≈ 0.5. The first mode alone covers more than 99.5% of the total variance, even though five independent parameters are used as latent variables. Two of these have a significant total Sobol' index, and the sum of the firstorder indices of the latent group is 19%. We will investigate this phenomenon in more detail in Section 5 below. As before, we analyze the global convergence of the marginal and covariance approximation for increasing numbers of input samples and trajectories, and we compare different methods for inferring the distribution of the KL-RV as described in Step 7 of our algorithm (Section 3).
For the detailed explanation of the error measures, the setup of the convergence study, and the interpretation of the plots, see Sections 4 and 4.1.
In Fig. 12a, we see that our method converges in both global error metrics ( marg and cov ) towards the rough empirical lower bound indicated by the gray area and dashed line. For marg , the difference between the results for the three experimental design sizes N = 20, 30, and 60 is small. This indicates that at least for this model, a validation mean-squared error smaller than O(10 −3 ) does not lead to significantly more accurate results, and that below this accuracy the error is dominated by the inference error.
The convergence of the emulated covariance function is displayed in Fig. 12c. As expected, the global error become smaller with an increasing number of trajectories, and approaches the lower bound representing the variability due to the error being computed from samples. Again, the difference in the results for the three experimental design sizes N = 20, 30, 60 is small.
Since the first mode accounts for more than 99.5% of the explained variance, the first KLE eigenfunction has the dominating influence on the covariance estimation (Eq. (9)). The results indicate that the first eigenfunction and its eigenvalue are estimated accurately already for the smallest experimental design sizes.
The comparison of the different inference methods for the distribution of the KL-RV (Step 7 of the algorithm) is displayed in Fig. 13. Similarly as for the Ishigami function, we observe that for a small number of trajectories (R = 10 and 30) the four inference methods and PCE-KDE show almost the same performance. Modeling the KL-RV with standard Gaussian distributions seems to offer a slight advantage (resulting in a slightly smaller median error and smaller variability) for small numbers of trajectories, probably because they make the strongest assumptions on the distribution shape, which is advantageous for generalizability in the case of small data. As the number of trajectories grows, a similar pattern as in Section 4.1 emerges: standard Gaussian inference shows the worst performance, followed by parametric inference. Inference with kernel density estimation (dependent and independent) shows the best performance, on par with the PCE-KDE estimate, which (due to the high accuracy of the PCE approximations for N = 60) represents the near-optimal estimate given the available training data.
Interestingly, there is no significant difference between the performance of KDE with and without the independence assumption. Here the magnitude of the eigenvalues might offer an explanation: with more than two orders of magnitude difference between λ 1 and λ 2 , the dependence between the two random variables ξ 1 and ξ 2 does not influence the resulting predictions as much as the correct identification of the marginal shape of the first KL-RV ξ 1 .

Heston stochastic volatility model for a stock price
As a third example, we consider the Heston stochastic volatility model, which describes a stock price Y t (Heston, 1993) with its volatility ν t modeled as stochastic process: with two Wiener processes W (1) t and W (2) t with correlation coefficient ρ. This model has six uniformly distributed parameters X = (µ, κ, θ, σ, ρ, ν 0 ) detailed in Table 3, the bounds of which are calibrated from real data as described in Zhu and Sudret (2021b). The quantity of interest is i.e., the stock price after 1 year. As proposed by Zhu and Sudret (2021b), we set U 0 = 1 and use the Euler-Maruyama method to integrate the system of stochastic differential equations (SDEs) and replace ν t by max(ν t , 0) to avoid negative values of ν t . This model is a stochastic simulator due to the stochasticity induced by the two Wiener processes W (1) t and W (2) t driving the SDEs.
A trajectory in the parameter space D is obtained by fixing the realizations of these processes and evaluating Eqs. (24) and (25)  This results in typically M = 4 to 6 KL modes for an explained variance threshold of 99.9%.
The first eigenvalue is λ 1 ≈ 0.05 and usually covers more than 97% of the variance. Again, we analyze the global convergence of the marginal and covariance approximation in the same way as in the preceding sections. The marginal approximations of the parametric stochastic emulator converge with increasing experimental design size and number of trajectories, but slowly, as displayed in Fig. 14a. There is no significant difference between the three experimental design sizes N = 50, 100, 150. This indicates that the improvement due to a better PCE approximation for an increasing number of experimental design points is overshadowed by the inaccuracy due to the inference of the KL-RV. This, in turn, could be because the PCE approximation is not yet sufficiently accurate (note that the relative validation error is in the order of 10 −2 for all ED sizes.) Even for the largest number of trajectories, the averaged and normalized Wasserstein distance between the responses of the true model and the emulator is still much larger than the variability resulting from sampling the true model, which is illustrated by the gray areas and the dashed line in Fig. 14a (quantiles and median, respectively). Comparing the boxplots to the colored points corresponding to the marginal estimates illustrated in Fig. 14b, we observe that the marginal shape of the stochastic response for one validation point x val is already captured quite well for 100-300 trajectories (with the value for R = 300 being a bit of an outlier).
The convergence of the covariance function is shown in Fig. 14c. As expected, the covariance estimate becomes better with increasing number of trajectories, even approaching the lower bound obtained by resampling the original model. However, we observe again that an increasing number of experimental design points does not influence the estimate much, which indicates that the covariance estimate is quite robust against the trajectory approximation quality. Since the first mode is also dominant for this example (accounting for ca. 97% of the explained variance), it indicates that the first eigenfunction is well estimated already for small experimental design sizes. Fig. 15 shows the comparison between the different methods for KL-RV inference (Step 7 of the algorithm in Section 3) as described in Section 4.1.4. All four methods perform comparably. The independent standard Gaussian approximation performs slightly better than the other methods in the case of few trajectories and slightly worse for the case of many trajectories, which is consistent with the previous cases. Again, KDE with and without dependence shows almost identical performance, indicating that the either the true copula is the independence copula, or that the existing parametric copulas are not suitable for capturing the dependence structure.
While the inference methods show a similar performance to PCE-KDE for smaller numbers of trajectories, PCE-KDE finds a better marginal approximation when R = 300 trajectories are available. This indicates that some information is lost in the KLE procedure of Section 3.  In this section we investigate how many modes K we can expect in the stochastic emulator of Eq. (18). We consider here a certain class of stochastic simulators that arise from a deterministic model z → M(z) by considering some of its variables as hidden (or latent). In other words, the stochastic simulator is M (X, Ξ), where X are the explicit input variables, and Ξ the latent variables inducing the stochasticity (the random events, see Section 2.1). Assume that all components of Z = (X, Ξ) are independent, and denote by f k the marginal distribution of Z k .
Assume further that the deterministic simulator M has finite variance under f Z . Then it can be decomposed into the Hoeffding-Sobol' decomposition (a.k.a. ANOVA decomposition, analysis of variance) (Hoeffding, 1948;Sobol and Gresham, 1995) as where, e.g., M 1 (x) = I:Z I ⊂X m I (z I ). The last summand of Eq. (28) contains all interaction terms from Eq. (27) that involve at least one input and at least one latent variable.
It is a rather common phenomenon in uncertainty quantification that engineering models actually have near-zero interaction terms. In that case, the model is essentially additive with respect to the two groups of variables X and Ξ: This means that any realization ξ of the unknown latent variables Ξ results in a constant shift of M(·, ξ) regardless of the value of the input parameters x, a behavior that can be accurately modeled by a single KL mode: if equality holds in Eq. (29), the mean function is  Table 2 its first order Sobol' indices (Sobol', 1993) sum up to 8 i=1 S 1 i ≈ 96.7%. The interaction effect between the explicit and the latent group is around 2%. Therefore, only one mode is sufficient to model the stochastic simulator that results from treating several of the model's variables as latent.
We presented a spectral surrogate model for stochastic simulators, a special class of computational models whose response for a given input is a random variable. Our method relies on several advanced techniques for modeling uncertainties, such as polynomial chaos expansion (PCE), Karhunen-Loève expansion (KLE), and statistical inference of multivariate distributions. The resulting surrogate model is not only able to emulate the marginal distributions and the covariance structure, but it can also generate new trajectories.
The form of our surrogate model provides insight into the model structure. The number of expansion modes indicates how high-dimensional the underlying stochasticity is. The eigenfunctions of the KLE, which are polynomials, give information about how the input parameters influence the stochastic simulator output. Even though our numerical examples were chosen to represent a range of cases of increasing complexity, we found that one mode was usually sufficient to explain more than 95% of the variance of the stochastic simulator. We were able to explain this phenomenon for the common case of stochastic simulators that arise from finite-dimensional deterministic models with independent inputs and finite variance by treating some of their input variables as latent. Considering the Hoeffding-Sobol' decomposition of the underlying deterministic simulator, we found that if the interaction terms between the explicit and latent variables are negligible, one single KL mode will be sufficient to emulate the behavior of the stochastic simulator. Indeed, by experience, this is a common occurrence in applications of uncertainty quantification. Interactions are rarely dominant in engineering problems, and so one KL mode might suffice in many cases.
From our numerical experiments, we found that the Gaussian (or more generally, parametric) approximation of marginals of the KL-random variables (KL-RV) can be preferable if the number of available trajectories is very small. When more trajectories are available, the better choice is kernel density estimation. Since the first mode was dominant in our numerical examples, the characterization of the dependence between KL-RV turned out not to be crucial, at least for the class of applications considered.
Our numerical tests reveal that the emulator is able to capture the true model behavior reasonably well, as long as enough input samples and trajectories are used. Due to the sequential nature of our approach, it is important to use enough points in the experimental design: if the PCE approximation is not accurate enough, also the inferred distribution for the KL-RV will be inaccurate. Interestingly, we observed in all three examples that the covariance estimate was not heavily influenced by using a larger experimental design, even though the latter typically results in more accurate PCE approximations of the trajectories. This indicates that the number of trajectories is more important for the covariance approximation than the quality of the PCE approximation. Also, it seems that (since the first mode was dominant for all investigated models) the first KLE eigenfunction can be identified accurately already from a small experimental design.
Note that our surrogate relies on the assumption that the trajectories are well approximated by sparse PCE, an assumption not fulfilled by some stochastic simulators, e.g., ones with discontinuous or non-differentiable trajectories in the input parameter space. This could be circumvented by using another basis specially adapted to the purpose, such as wavelets. Furthermore, by construction, our emulator is only suitable for stochastic simulators whose response is correlated throughout the input domain. If there is little to no correlation between the responses at different points in the input space, KLE (which is ultimately a dimension reduction technique), would need infeasibly many modes to converge.
Our methodology can be extended in several ways. The computation of the sparse PCE approximation of the trajectories could be done jointly for all trajectories, instead of fitting each trajectory separately and later discarding regressors. While in our study the dependence between KL-RV was not crucial for the accuracy of the emulator, an improved estimation of the dependence structure would be desirable if for future models more modes turn out to be important. Furthermore, an interesting question is under which circumstances one mode is enough to represent the underlying stochasticity of stochastic simulators, and how the methodology can be adapted to take advantage of this phenomenon. The general idea of representing trajectories by their coefficients, and after dimension reduction modeling their joint distribution, can also be applied outside spectral expansions, e.g. for Bayesian neural networks, at the cost of losing some of the analytical properties following from orthogonality. Finally, in the spirit of common random numbers (Pearce et al., 2022), the applicability of our stochastic emulator for purposes such as optimization should be explored. extended KLE is applied, then the sample covariance function is a polynomial, the integral eigenvalue problem reduces to PCA in the expansion coefficients, and the eigenfunctions are polynomials. In Appendix A.2, we show that the realizations of the random KLE coefficients can be determined analytically.
Let for r = 1, . . . , RM be the centered PCE trajectory computed from discrete evaluations of the trajectory T r (Eq. (13)).
The sample covariance function is a polynomial given bŷ

A.1 Analytical solution of the extended KLE eigenvalue problem
Computing an extended KLE in the function space L 2 f X (D) corresponds to solving the following eigenproblem: Sinceĉ is polynomial, also the eigenfunctions will be polynomials and can be represented in terms of the PCE basis as follows: Solving Eq. (32) reduces to finding (λ i , (b Dropping the i-subscript of the eigenfunction for convenience, we compute