Importance sampling and its optimality for stochastic simulation models

: We consider the problem of estimating an expected outcome from a stochastic simulation model. Our goal is to develop a theoretical framework on importance sampling for such estimation. By investigating the variance of an importance sampling estimator, we propose a two-stage procedure that involves a regression stage and a sampling stage to construct the ﬁnal estimator. We introduce a parametric and a nonparametric regression estimator in the ﬁrst stage and study how the allocation between the two stages aﬀects the performance of the ﬁnal estimator. We analyze the variance reduction rates and derive oracle properties of both methods. We evaluate the empirical performances of the methods using two numerical examples and a case study on wind turbine reliability evaluation.


Introduction
The 2011 Fisher lecture (Wu, 2015) features the landscape change in engineering, where computer simulation experiments are replacing physical experiments thanks to the advance of modeling and computing technologies. An insight from the lecture highlights that traditional principles for physical experiments do not necessarily apply to virtual experiments on a computer. The virtual environment calls for new modeling and analysis frameworks distinguished from those developed under the constraint of physical environment. In this context, this study considers a new problem that emerged along with the recent development of stochastic simulation-based engineering.
A concrete motivating problem of this study is estimating a system failure probability based on stochastic simulations (although our methods are more generally applicable to the estimation of any expected outcome, as detailed in the next section). A system configuration, X, is randomly sampled from a known density p and passed on to a stochastic simulation model. The simulation model, regarded as a stochastic black-box, produces V that follows an unknown distribution depending on X. When V is greater than a threshold, say ξ, the system fails. Thus, the goal is to estimate the probability P (V ≥ ξ) when X is from the density p. Specifically, our case study uses the simulation model that is designed to mimic the real system accurately. Thus, it takes roughly 1-min wall-clock time to simulate 10-min real operation of the system on a computer commonly available nowadays. The engineering goal is estimating the probability of system failure during 50-year operation, which is computationally challenging even with U.S. national labs' supercomputers (Manuel et al., 2013;Graf et al., 2016).
Such computational challenges are commonly observed in engineering simulations. Finite element simulations, which are used widely in various engineering applications, can take hours of computing time to obtain a single data point (e.g., Qian et al., 2006). Despite the computational expense, highly accurate simulations are cost-effective alternatives to physical experiments and used widely in industry (e.g., Ford Motor Company's crash simulation (Wang and Shan, 2007)) and in government (e.g., NASA's rocket booster simulation (Gramacy and Lee, 2012)).
The overarching goal of this study is to propose a framework on analyzing the problem of minimizing the necessary computational burden while maintaining the same level of estimation accuracy. A flip-side of the same problem is minimizing the estimation variance given fixed computational resource. Variance reduction techniques (VRTs) in the simulation literature aim to reduce the variance of estimator in simulation experiments. Traditional VRTs are well studied for the simulation model that outputs V given X in a deterministic fashion, also known as the deterministic simulation model (see Chapter 9 of Kroese et al. (2011) for survey of such VRTs), when the input X is sampled from a known distribution. For stochastic simulation models, if their underlying processes have known properties (e.g., Markovian), Glynn and Iglehart (1989) and Heidelberger (1995) provide VRTs. For black-box stochastic simulation models, few studies (e.g., Choe et al., 2015;Graf et al., 2017) consider VRTs. The research on VRTs for block-box stochastic simulations is still underdeveloped despite the rapid growth of such simulations being used in real systems, for example, chemical systems (Gillespie, 2001), biological systems (Henderson et al., 2012), and engineering systems (Ankenman et al., 2010;Picheny et al., 2013;Plumlee and Tuo, 2014).
Among VRTs, importance sampling (Kahn and Marshall, 1953) is known to be one of the most effective methods and has been used widely in various applications such as communication systems (Heidelberger, 1995;Bucklew, 2004), finance (Owen and Zhou, 2000;Glasserman and Li, 2005), insurance (Blanchet and Lam, 2011), and reliability engineering (Au and Beck, 1999;Lawrence et al., 2013;Choe et al., 2016) to name a few.
In the vast majority of literature, importance sampling takes a parametric form tailored to a problem at hand for both deterministic simulation model (e.g. Lawrence et al., 2013) and stochastic counterpart (Choe et al., 2015). Nonparametric approaches are also proposed for deterministic simulation models (Zhang, 1996;Neddermeyer, 2009). To the best of our knowledge, no nonparametric approach is developed for stochastic simulation models. This study particularly considers the black-box stochastic simulation model whose output takes an unknown stochastic relationship with the input.
In this paper, we focus on the theoretical aspects of importance sampling with stochastic simulation models. We design two-stage sampling methods that may perform as good as the best sampling method (also known as the oracle property). The main contributions of this paper to the existing body of literature are as follows: • We introduce an importance sampling approach to estimate the expectation of black-box stochastic simulation output and study the optimal importance sampler (Section 2). • We design a two-stage procedure that uses a parametric or a nonparametric regression estimator to approximate the optimal importance sampler ( Figure 1 and 2). • We analyze the allocation of the resources in both stages (Theorem 3 and 7) and study the convergence of the two-stage procedure toward the oracle importance sampler (Corollary 4 and 8). • We conduct an extensive numerical study to evaluate empirical performances of the two-stage importance samplers (Section 4.2.1 and 4.2.2). • We apply our methods to a case study on wind turbine reliability evaluation (Section 4.3) to validate our results.
This paper is organized as follows. In Section 2 we formulate the stochastic simulation-based estimation problem and introduce a two-stage procedure to estimate the expected simluation output. In Section 3 we study the theoretical performance of the proposed procedure and derive the corresponding oracle properties. In Section 4 we evaluate the empirical performance of our approach using two numerical examples and a wind turbine simulator. We discuss our result in Section 5.

Importance sampling for the stochastic simulation model
A stochastic simulation model takes an input configuration value and then returns a random number representing the outcome of this simulation result. The input configuration determines the distribution of the (random) outcome. Thus, the outcome of a stochastic simulation model can be represented by a random variable V conditioned on the input configuration x and the CDF of V is where config = x denotes choosing the configuration to be x. For simplicity, we denotes the random variable V conditioned on config = x as V (x).
In many scientific or engineering problems (e.g., Heidelberger, 1995;Au and Beck, 2003;Bucklew, 2004;Graf et al., 2017), we assume the nature generates the configuration from a known density p and we are interested in evaluating the quantity where g is a known function.
Example 1. A common example for equation (1) is the case when g(v) = v, often considered in the literature on two-level nested simulation (Sun et al., 2011), where the outer level generates a scenario (or configuration) according to a known density p and conditioning on the scenario, the inner level simulates a random outcome whose mean is of interest. Applications include decision theory (Brennan et al., 2007), financial engineering (Staum, 2009), and queuing system (Sun et al., 2011).

Example 2.
Another example takes g(v) = 1(v ∈ S ξ ) for some set S ξ parametrized by ξ. Specifically, Choe et al. (2015) considers a reliability evaluation problem where V stands for an instability measurement of a system that fails when V falls into S ξ = [ξ, ∞). The goal is to estimate the failure probability when the system is exposed to the nature. In the natural environment, the configuration behaves like a random variable from a density p.
To estimate E, we can choose several configurations x 1 , · · · , x n and then run the simulation to obtain realizations v 1 = V (x 1 ), · · · , v n = V (x n ). However, generating V from a given configuration x is often computationally expensive for stochastic simulation models. Therefore, we would like to run the simulation as few as possible. To put this constraint into consideration, we assume that we run the simulation only n times but we are able to choose the configuration for each simulation. We choose n configurations and evaluate the corresponding value of V . Namely, we only have pairs (x 1 , V 1 ), · · · , (x n , V n ), where each V i is a realization of the random variable V (x i ). Such a constraint on the number of simulations, n, is sometimes called a computational budget.
Under such situation, a natural question is: How do we choose the configurations x 1 , · · · , x n ? Here we use the idea from importance sampling -we choose x 1 , · · · , x n from a density function q. In other words, we first sample X 1 , · · · , X n from q and then use x i = X i as the configuration to run the i-th stochastic simulation. The density q is called sampling density. Note that each configuration does not necessarily have to be from the same density function.
When we generate X 1 , · · · , X n from q and then obtain V 1 , · · · , V n accordingly, a simple estimator of E is It is easy to see that E q is an unbiased estimator under the assumption that q(x) = 0 implies g(V (x))p(x) = 0 for all x, i.e., when the support of q covers the support of g(V (x))p(x). We call this type of estimator an importance sampling estimator. Throughout this paper, we will focus on importance sampling estimators.
Using an importance sampling estimator (2), a key problem we want to address is: what will be the optimal sampling density q * that minimizes the estimation error? Because the estimator (2) is unbiased, we only need to find the minimal variance estimator. Thus, the above question is equivalent to: what will be the optimal sampling density q * that minimizes the variance of E q ? The following lemma provides an answer to the above questions: where X * is a random variable from the density p. The equality holds when we choose q to be q(x) = q * (x) ∝ E(g 2 (V (X))|X = x) · p(x). Namely, the optimal sampling density is q * (x).
We call the quantity V min the oracle variance of the importance sampling. It is the minimal variance that an importance sampler can achieve. A widely studied special case in engineering is the deterministic simulation model where Var(V |X = x) = 0 for all x, which implies V min = 0 for any nonnegative function g(v) (e.g., Kahn and Marshall, 1953;Au and Beck, 1999;Kroese et al., 2011). The density that leads to the oracle variance, q * , is called the optimal sampling density. This density is a modification from the natural configuration density p; q * puts more weight on the regions with higher r(x) (e.g., higher probability of system failure). As long as r and r † are uniformly bounded within the support of p, the variance is finite.
However, we cannot directly generate configurations from q * because it involves the unknown quantity r(x) = E(g 2 (V (X))|X = x). A remedy to this problem is to apply a two-stage sampling. In the first stage, we generate part of configurations and evaluate the corresponding values of V (x). Using the sample in the first stage, we obtain a pilot estimator r of r. In the second stage, we generate configurations based on an estimate of q * using the pilot estimator r and use the remaining computational budget to evaluate values of V (x). Finally, we use samples from both stages to form the final estimator of E.
Here is a useful insight in estimating r(x). Let Y be a random variable such that Y = g 2 (V (X)). Then r(x) = E(g 2 (V (X))|X = x) = E(Y |X = x). Thus, estimating r(x) is equivalent to estimating the regression function with the observations (X 1 , Y 1 = g 2 (V 1 )), · · · , (X n , Y n = g 2 (V n )) assuming we have n observations.
Thus the two-stage procedure can be summarized as follows. We first generate a size m sample ( where X i , i = 1, · · · , m, are from an initial sampling density q 0 . Then we trans- Now we estimate the regression function r(x) by a regression estimator r(x) and compute the corresponding estimator q * of the oracle sampling density q * . Finally, we generate the remaining data points from q * and pool both samples together to form the final estimator of the quantity E. Because q * will tend to be closer to q * compared to the initial sampling density q 0 , estimating E using the sample in the second stage is more efficient (lower variance). The sample size m in the first stage is a crucial quantity in our analysis. The quantity m is called the allocation. When m is too small, the estimator of q * is inaccurate, so that the overall estimation efficiency is suboptimal. When m is too large, we only have a small budget for the second stage so that the overall estimation efficiency is low as well. As a result, to balance the estimation accuracy of q * and the size of efficient sample in the second stage, there will be an optimal value of m depending on the total sample size n. In what follows we propose two different models to estimate r and q * and analyze the optimal value of the allocation m.

Parametric importance sampling
As in the regression analysis, a straightforward approach of estimating the regression function is to assume a parametric model and estimate the corresponding parameters. Namely, we assume r(x) = r θ (x) for some θ ∈ Θ and use the first part of the sample to estimate θ.
To estimate r θ (x), we use a classical approach -the least square method : Then the estimator r(x) = r θm (x). Note that one can also assume a parametric form for the distribution of Y i |X i and then use a maximum likelihood estimator.
Parametric Importance Sampling (S1) We choose an initial sampling density q 0 and generate the first part of the sample (X 1 , V 1 ), · · · , (Xm, Vm).
Use the least square method (4) to obtain θm and the estimator r θm (x).
(S4) We then change the sampling density to q * θm and generate the remaining (S5) The final estimator is Using the estimator θ m , we can then estimate the regression function r θm and construct the estimated optimal sampling density For the remaining (n − m) data points, we generate the configurations from q * θm , run the simulation, and estimate E accordingly.
We summarize our Parametric Importance Sampling method in Figure 1. Later in Section 3.1 we will derive the variance of this approach and show that the optimal allocation is to choose m = O(n 2 3 ).

Nonparametric importance sampling
Now we consider estimating r(x) nonparametrically. For simplicity, we use the kernel regression (Nadaraya, 1964;Watson, 1964). Note that other nonparametric regression approach, such as the local polynomial regression (Wasserman, 2006), also works. The kernel regression uses the estimator where K is a smooth function (known as the kernel function) such as a Gaussian, and h > 0 is the smoothing bandwidth. Similar to the parametric approach, we then use this estimator to construct an estimated optimal sampling density Nonparametric Importance Sampling (S1) We choose an initial sampling density q 0 and generate the first part of the sample (X 1 , V 1 ), · · · , (Xm, Vm). (Xm, Vm), use the nonparametric regression to obtain the estimator r h and q * h . (S4) We then change the sampling density to q * h to generate the remaining sample generate the remaining data points from it, and construct the final estimator using the procedure described previously. Figure 2 summarizes the procedure of nonparametric importance sampling. There are two tuning parameters we need to select: the smoothing bandwidth h and the allocation size m. The smoothing bandwidth can be chosen by either cross-validation or a reference rule. In Section 3.3, we will derive the optimal rate for the smoothing bandwidth h = O log m m 1 d+4 and the optimal allocation

Remark. We may rewrite
are the estimator constructed from the two samples. If we know the variance of E 1 and E 2 , then we can further reduce the variance of the final estimator by doing a weighted combination

Theoretical analysis
Throughout our analysis, we assume that the natural configuration density p has a compact support K ⊂ R d and the support of the initial sampling density q 0 contains K.

Parametric importance sampling
Assumptions.
(P1) There exists an unique θ 0 ∈ Θ such that r(x) = r θ0 (x) and sup x∈K Var The support of r θ (x) contains the support of p(x) for every θ ∈ Θ and r(x) > 0 for all x ∈ K. Also, the PDF q 0 (x) exists and has non-zero variance.
(P1) means that the model is correctly specified -the regression function can be parametrized in the parametric model we consider. (P2) is a common assumption in the M-estimation theory (van der Vaart and Wellner, 1996;van der Vaart, 2000) to derive the convergence rate. The extra assumption (P3) is a mild assumption that converts the convergence rate of parameter estimation to the convergence rate of function estimation. As long as r θ (x) is smooth within an open set around θ 0 , (P3) holds.
The following theorem describes the estimation error when the parametric family contains the true regression function.

Theorem 2. Assume (P1-3). The error rate for the estimator r θm
Theorem 2 presents the error rate for estimating r(x) when the model is correct. Based on this error rate, we can further derive the variance of the parametric importance sampler in Figure 1.
be the excess variance from using q 0 compared to q * . The variance of the estimator E θm is Theorem 3 has three components. The first component V min is the oracle variance we have mentioned previously. It is the minimal variance that can be achieved by an importance sampling estimator. The second component 1 n 2 ·m·V q0 is the excess variance due to the initial sampling density. The third component is the excess variance due to the error of the estimator r θm (x).
By optimizing m with respect to the second and third components, we obtain the optimal rate of m as a function of sample size n: where the notation means that the two quantities will be of the same order, i.e., a n b n ⇔ lim n→∞ an bn ∈ (0, ∞). Thus, the optimal allocation is to choose m n 2 3 , which leads to the following: That is, if the model is correctly specified, the excess variance shrinks at rate O n − 1 3 under the optimal allocation. Note that if we add conditions so that for some uniformly bounded function C(x), the variance in Theorem 3 can be improved in the sense that the final term will be (n − m)O 1 m , which leads to the optimal allocation m n 1 2 . This allocation rate is similar to Li et al. (2013) although we are considering different scenarios. Li et al. (2013) considers mixture importance sampling for a deterministic (as opposed to stochastic) simulation model and study the optimal allocation, where the optimal mixing weight (of each component) is unknown and has to be estimated using a pilot sample. In our case, the parametric model is to approximate the regression function implied by a stochastic simulation model.
The key assumption of the parametric method is (P1): the actual r(x) belongs to the parametric family. However, if this assumption is violated, then the excess variance in the parametric method will never shrink to 0.
The proof of this theorem is trivial, so we omit it. Theorem 5 proves that when the model is incorrectly specified, there is an additional variance V θ * that never disappears. Thus, the variance of the parametric importance sampler will not converge to the optimal variance. Later we will see that this implies that the parametric importance sampler does not have the oracle inequalities when the model is incorrectly specified.

Nonparametric importance sampling
In this section, we study the properties of the nonparametric importance sampler in Figure 2. Similarly as the parametric importance sampler, we first derive the convergence rate of estimating r(x), then derive a variance decomposition for Var E h , and finally study the optimal allocation. Assumptions.
For all x, the function r(x) has bounded second derivative and q 0 (x) has bounded first derivative and sup (K2) The collection of functions is a VC-type class. i.e. there exists constants A, v and a constant envelope b 0 such that where N (T, d T , ) is the -covering number for a semi-metric set T with metric d T and L 2 (Q) is the L 2 norm with respect to the probability measure Q.
(N1) and (N2) are common assumptions for nonparametric regression; see, e.g., Wasserman (2006) and Györfi et al. (2006). (K1) is a standard condition on kernel function (Wasserman, 2006;Scott, 2015). (K2) regularizes the complexity of kernel functions so that we have a uniform bound on the stochastic variation. This assumption was first proposed in Giné and Guillou (2002) and Einmahl and Mason (2005) and later was used in various studies such as Genovese et al. (2014); Chen et al. (2015bChen et al. ( , 2017. Based on the above assumptions, the uniform convergence rate of the kernel estimator r h (x) is given by the following.

Theorem 6. Assume (N1-2), (K1-2). The error rate of the kernel estimator
The error in Theorem 6 can be decomposed into two parts: the bias part O(h 2 ) and the stochastic variation O P log m mh d (which is related to the variance).
In many nonparametric studies, similar bounds appear for density estimation; see, e.g., Giné and Guillou (2002); Einmahl and Mason (2005) Under such an optimal error rate, we again obtain the variance decomposition for the nonparametric importance sampler.
be the excess variance from using q 0 compared to q * . The variance of the estimator E h * under the optimal smoothing bandwidth is . Similar to Theorem 3, the variance in Theorem 7 has three components: the oracle variance V min , the excess variance due to the initial sampling density 1 n 2 · m · V q0 , and the excess variance from the estimator r h * (x).
To obtain the rate of the optimal allocation, we equate the two excess variances: (ignoring the log log n and multi-logarithm terms).
This choice of m yields the following variance reduction rate.
Note that in the last equality in Corollary 8, we use the fact that a n = O(log (4d+20)/(d+4) n) implies a n = O(log 5 n) to simplify the expression. Corollary 8 shows that under the optimal allocation, the excess variance in the nonparametric importance sampler shrinks at rate O , which are just slightly slower than the rate of the parametric importance sampler under correct model (the rate is O n − 1 3 by Corollary 4). Although the parametric method enjoys a fast convergence rate, it depends on a very restrictive assumption: the model has to be correctly specified. This assumption is generally not true in most applications. Thus, even if the nonparametric importance sampler has a slower variance reduction rate, the nonparametric approach still has its own merit in applicability.
There is a limitation in the nonparametric approach -the curse of dimensionality. When d is not small, the convergence rate is very slow. So the nonparametric approach may not be applied to scenarios with more than 4 variables. In this situation, even if the parametric model does not converge to the optimal sampler, it may still be a preferred approach. The parametric model will converge to the best possible result under the mis-specified model as long as the the conditions of Theorem 2 hold. Remark 1. Note that the nonparametric rate can be improved if the regression function is very smooth and we use a higher order kernel (Wasserman, 2006 and under the optimal allocation, the variance of nonparametric importance sampler will be When β → ∞, the variance of nonparametric importance sampler becomes 1 n V min 1 + O n −1/3 (ignoring the log n term), which recovers the parametric rate in Corollary 4.

Oracle properties
In Lemma 1, we see that the oracle sampler E q * has the minimal variance. Let be the collection of density functions that leads to an unbiased importance sampler and let Ξ = E q : q ∈ G be the collection of all possible unbiased importance samplers. Because all importance samplers from G is unbiased, any estimator E ∈ Ξ satisfies so the oracle sampler E q * satisfies Because of equation (11), E q * is called the oracle for Ξ with respect to the mean square error in nonparametric theory; see, e.g., page 60-61 in Tsybakov (2009).
We say E † ∈ Ξ satisfies the oracle inequalities if Note that the estimators in the above expressions are all based on a size n sample and we do not include the subscript n to abbreviation. In nonparametric theory, an estimator with the oracle inequalities implies that the estimator is asymptotically as good as the optimal estimator. The crude Monte Carlo sampler E p (which samples only from the natural configuration density p (Kroese et al., 2011)) obviously does not satisfy the oracle inequalities because The parametric importance sampler E θm satisfies the oracle inequalities when the model is correctly specified (i.e. r(x) = r θ (x) for some θ ∈ Θ). To see this, recall Corollary 4 and equation (10): However, when the model is incorrect, Theorem 5 proves that E θm does not have the oracle inequalities: The nonparametric importance sampler has a good advantage that it satisfies the oracle inequalities in most cases. By Corollary 8 and equation (10), Thus, without any further information about the structure of r(x), we recommend to use the nonparametric importance sampler since it behaves asymptotically as good as the oracle (optimal) importance sampler. Remark 2. How we obtain the oracle property is very different from the classical approach. Many estimators with oracle properties are constructed by minimizing an estimated risk (Tsybakov, 2009). That is, for a collection of estimators, the risk of each of them is estimated and the one that minimizes the (estimated) risk is chosen. When the risk is consistently estimated uniformly for all estimators, this procedure leads to an estimator with the oracle property. However, in our case, we do not consider any risk estimator nor do we choose an estimator from many possible candidates, but we still obtain the oracle property.

Empirical analysis
To evaluate the empirical performances of the importance samplers, this section presents an implementation guideline, a numerical study, and a case study.

Implementation guideline
To implement parametric or nonparametric importance sampling, we can follow the procedure in Figure 1 or Figure 2, respectively. In practice, n is typically determined based on the available computational budget. We can choose m according to the optimal allocation rate, m n importance sampler, or m n log n d+4 d+6 in Corollary 8 for nonparametric importance sampler. In the range of experiments we present below, any choice of multiplicative constant between two and six results in similar empirical performances of the importance samplers.
In the first stage, we can simply choose the natural configuration density p as the initial sampling density q 0 to maintain the same error rate as the crude Monte Carlo sampler. In the second stage, once we build a regression model r(x) for the unknown conditional expectation r(x) = E(g 2 (V (X))|X = x), we can exactly sample from q(x) ∝ r(x) · p(x) using the acceptancerejection method (Kroese et al., 2011, p.59) by using p(x) as the proposal (or envelope) density and finding an upper bound on r(x). The upper bound can be computed numerically or even known from physical/engineering knowledge of the system (e.g., the simulation output V can take up to a certain value due to a physical/engineering limit, or r(x) is a probability). As an alternative to the acceptance-rejection method, Markov chain Monte Carlo methods can be used as well.
Also, for an importance sampling estimator (e.g., in (5) or (7)), the normalization constant of q(x) can be calculated to a desired level of accuracy, independent of n, by using a numerical integration such as quadrature for lowdimensional x and Monte Carlo integration for high-dimensional x. Thus, we can ensure that estimation of the normalization constant will not affect the asymptotic optimality and empirical performance of the importance sampling estimator. If desired, one can avoid calculating the normalization constant by using a self-normalized importance sampling estimator at the expense of bias. More in-depth discussion of this trade-off is made by Owen (2018).
Note that in practice, the computational costs of the acceptance-rejection method and the numerical integration are negligible (a matter of hours, if not minutes) compared to running high-fidelity simulation models such as those discussed in Section 1 and the simulation model in our case study. For example, a simulation experiment can take days (if not weeks) even with a supercomputer (Manuel et al., 2013;Graf et al., 2016).

Numerical study
Our numerical study considers two examples, one with normal distributions for X and V |X and the other with exponential distributions for X and V |X. Motivated by our case study, we estimate the probability E = P (V > ξ) = E(g(V (X))) for g(V ) = 1(V > ξ) and a pre-specified ξ > 0. As a baseline, we set ξ such that P (V > ξ) is equal to 0.5 (unless specified otherwise) regardless of the input configuration dimension d because it is known that the performance of importance sampler often depends on the probability being estimated (Heidelberger, 1995;Kroese et al., 2011).
We vary the total sample size n = 1000, 2000, 4000, 8000 and the input configuration dimension d = 1, 2, 4 to see their impacts on the mean squared error where E i is an estimate of the ith replication and the total number of Monte Carlo replications, n MC , is set as 10,000 to obtain reliable results. We use highperformance computing (Lenovo NextScale E5-2680 v4, total 112 cores with 1TB RAM) for our simulation experiments, and they take several weeks in total. The R scripts for the experiments are available as a supplementary material.
We consider two parametric importance samplers, one with a correct model of and the other with an incorrect model, and a nonparametric importance sampler. To build the parametric models of r(x), we use the sample of size m = To sample from the importance sampling density q(x) ∝ r(x) · p(x) using the acceptance-rejection method, we use p(x) as the envelope density because p(x) ≥ r(x)p(x) in the examples: We sample x from p(x) and accept x with the probability r(x). To compute the normalizing constant of q(x), we use Monte Carlo integration. Since we know the true r(x), which is unknown in practice, in the examples, we calculate the true E = P (V > ξ) and V min using Monte Carlo integration and use them to calculate MSEs and demonstrate how empirical results conform to the theoretical predictions made in Section 3.

Example 1: normal-normal data generating model
As a modification of an example in Ackley (1987), we use the data generating model where the d-dimensional input vector X = (X (1) , . . . , X (d) ) follows a multivariate normal distribution with zero mean and identity covariance matrix, and the output V at X follows N (μ(X), 1) with cos(2πX (i) ) .
Thus, we have where Φ(·) is the CDF of a standard normal distribution. As parametric models of r(x), we consider two models: such that r θ (x) = r(x) for some θ = (θ 0 , . . . , θ d ) ∈ Θ. For fitting with a least square method, the initial parameters are set at the correct values, i.e., θ 0 = . . . = θ d = 1, in the implementation.  . Figure 3(d) shows nMSE against d for fixed n = 8000. Regardless of d, nMSE of the correct parametric importance sampler stays close to V min . In contrast, nMSE for the incorrect parametric importance sampler essentially remains the same as d varies in this example, although this observation cannot be taken as a general pattern because the input configuration dimension d impacts how incorrect the model is. While the nonparametric importance sampler performs almost as well as the correct parametric importance sampler when d = 1, the performance gap widens as d increases since n is fixed.
If X was sampled only from the natural configuration density p instead of an importance sampling density q in the estimator in (2), then this simple baseline approach, commonly called crude Monte Carlo (CMC) (Kroese et al., 2011), results in the estimator having the theoretical nMSE of 0.25 (= P (V > ξ)(1 − P (V > ξ)). In this example, the incorrect parametric importance sampler essentially does not improve over the baseline.
As d increases, V min approaches the baseline nMSE of 0.25 in Figure 3(d). This increasing inefficiency of optimal importance sampling with respect to d is regarded as peculiar to this example, because V min in (3) depends on d only through g(V (X)). Thus, this observation should not be interpreted as a manifestation of curse of dimensionality known in the importance sampling literature (e.g., Au and Beck, 2003), which may occur when the approximation of optimal density q * becomes harder as d increases. In contrast, it is known that the optimal importance sampler theoretically attains V min of zero regardless of d for deterministic simulation models with any nonnegative function g(v) (Kahn and Marshall, 1953).
We also vary the estimand E = P (V > ξ) = 0.005, 0.05, 0.5 to examine its impact on the performance of the importance samplers, while fixing n = 8000 and d = 1. The initial sampling density q 0 is set as the uniform distribution on (−5, 5) to broadly cover the support where rare events of interest can happen. Figure 4(a) suggests that as the estimand decreases by a factor of 10 (i.e., 0.5, 0.05, 0.005), the normalized root mean squared error (i.e. √ MSE/E) increases more slowly for the correct parametric and nonparametric importance samplers than the incorrect parametric importance sampler. Figure 4(b) shows the wellknown phenomenon in the literature that importance samplers save more against CMC as the estimand is the probability of a rarer event. The savings of the correct parametric and nonparametric importance samplers quickly go over 90% as E decreases to 0.005.

Example 2: exponential-exponential data generating model
Here, we consider a data generating model where both X and V |X follow exponential distributions that have heavier tails than normal distributions and allow analytical calculations of key objects of interest such as the estimand E, the conditional expectation r(x) = E(g 2 (V (X))|X = x), the optimal sampling density q * (x), and the oracle variance V min .
Let X = (X (1) , . . . , X (d) ) be a vector of d independent exponential random variables with the identical mean 1/λ > 0 so that the natural configuration density Given a configuration X, let V follow an exponential distribution with a mean 1/ X (1) + . . . + X (d) . In our simulation experiment, we fix λ = 1. With the given data generating model, we can analytically calculate which implies that q * is the joint density of d independent exponential random variables with the identical mean 1/ (ξ/2 + λ). We determine and calculate Similar to the normal-normal example, we consider two parametric models of r(x): (i) Correct model: We use r θ (x) = e θ0+θ1x (1) +...+θ d x (d) . For least square fitting, the initial parameters are set at θ 0 = . .
, with the initial parameters θ 0 = . . . = θ d = 0 for least square fitting. As a nonparametric model, we use the kernel regression model r h (x) as in the normal-normal example. Figure 5(a) plots nMSE versus n for d = 1 with respect to the three importance samplers. The behaviors of correct and incorrect parametric importance samplers echo what we see in the normal-normal example. In contrast, the nonparametric importance sampler behaves irregularly (note that nMSE in Figure 5(a) is calculated after discarding the nonparametric estimates exceeding one). Figure 5(b) shows the Tukey box plots of the nonparametric estimates over different n. The interquartile range (i.e., box height) decreases as n increases (note that the interquartile range of estimates may be comparable to the square root of MSE, but not directly to nMSE in Figure 5(a)). The magnitudes of outliers (e.g., estimates greater than one, presented as numbers at the top of Figure 5(b) for each n) suggest that the sampling distribution of the nonparametric estimator might be heavy-tailed for this example.
We calculate effective sample sizes (ESSs) as diagnostic tools of importance sampler performance, where a too small ESS compared to n is interpreted as a sign that the importance sampling is unreliable. An ESS widely used in the importance sampling literature (Kong, 1992;Elvira et al., 2018) is where w i is the likelihood ratio or importance weight for X i . In the first stage (i = 1, . . . , m), both parametric and nonparametric importance samplers have w i = p(Xi) q0 (Xi) . In the second stage (i = m + 1, . . . , n), the parametric importance sampler has w i = p(Xi) q * θm (Xi) and the nonparametric importance sampler has w i = p(Xi) q * Fig 6. For the exponential-exponential data generating model, we compare the three importance samplers in terms of their effective sample size (ESS), ESS, in (13) and g-specific ESS, ESS(g), in (14) when d = 1 and n = 1000. The box plots are created based on 10,000 replications. The nonparametric importance sampler yields ESS(g) close to one, explaining the erratic nMSE observed in Figure 5(a). g(V ), the following ESS specific to g(V ) can be a better diagnostic of importance samplers, as demonstrated for deterministic simulation models (where V (X) is a deterministic function of X) (Evans and Swartz, 1995;Owen, 2018): where w i (g) = |g(V )| w i . For stochastic simulation models, as illustrated in Figure 6(a), a very small ESS (close to one) does not necessarily indicate a poor estimate because correct parametric importance samplers always yield good estimates of E. On the other hand, ESS(g) is shown to be an effective diagnostic; in Figure 6(b), it is confirmed that when ESS(g) is close to one, the corresponding estimates of nonparametric importance samplers tend to be far from the estimand E.
The inspection of the g-specific ESS suggests that importance weights in the tail of nonparametric importance sampling density q * h (X i ) can be unstable. Especially because the kernel regression can poorly estimate the tail part, tailcontrolling techniques (Owen and Zhou, 2000;Li et al., 2013) can be helpful to address the issue.
We attribute the erratic nMSE of nonparametric importance sampler in Figure 5(a) to the severe violation of the assumption, made as a basis of our theoretical analysis in Section 3, that the natural configuration density p has a compact support K ⊂ R d . In this example, the tail of p(x) ∼ exp(−x) decays even more slowly than the tail of p(x) ∼ exp(−x 2 ) in the normal-normal example. The simulation experiment results for d = 2 and 4 are not presented here, as they repeat the same pattern as for d = 1. We note that the natural configuration density p in the case study in Section 4.3 has the compact support, indicating that the assumption is still practical.

Case study
Wind energy is one of the fastest growing renewable energy sources (You et al., 2017). Yet, harvesting wind energy remains expensive, compared with fossil energy sources such as oil, coal, and natural gas, due to the high capital cost in installing wind turbines. A utility-scale wind turbine whose blade diameter is commonly greater than 100 ft typically costs more than one million U.S. dollars. Therefore, wind energy industry pays the utmost attention on ensuring the structural reliability of the wind turbine to prevent its failure (e.g., Moriarty, 2008;Graf et al., 2017).
At the design stage of wind turbine, evaluating its reliability based on physical experiments is very limited due to the associated costs. Alternatively, the international standard, IEC 61400-1 (International Electrotechnical Commission, 2005), requires wind turbine manufacturers to use stochastic simulation models. For this purpose, the most widely used simulation models in the U.S. include TurbSim (Jonkman, 2009) and FAST (Jonkman and Buhl Jr., 2005), which are developed and maintained by the National Renewable Energy Laboratory (NREL) of the U.S. Department of Energy. TurbSim simulates a 3-dimensional time marching wind profile, which becomes an input to FAST that, in turn, simulates a wind turbine's structural response to the wind. This case study focuses on two types of bending moments at the root of a turbine blade, namely, edgewise and flapwise bending moments, which represent two perpendicular structural responses of the blade root due to an external force or moment causing it to bend. We use the same benchmark turbine model ) and simulator setup as Moriarty (2008) (see the references for the rest of the settings not provided below).
In this case study, the input configuration X is a 10-min average wind speed (unit: meter per second, m/s), which is fed into TurbSim. X is sampled from the truncated Rayleigh distribution with the support of [3,25] and the scale parameter 10 2/π. The simulation output of interest, V , is the 10-min maximum bending moment (unit: killonewton meter, kNm) at the blade root, which is produced by FAST based on the simulated wind from TurbSim. Because V is random even for a fixed X due to the randomness of wind profile generated in TurbSim, we regard TurbSim and FAST together as a black-box stochastic simulation model.
To compare the nonparametric importance sampler proposed in this paper with a parametric importance sampler, we take a parametric method in Choe et al. (2015) as a benchmark, which also approximates the optimal sampling density q * (x) ∝ r(x) · p(x) in Lemma 1. We use the same simulation experiment setup therein: For the edgewise bending moment, we use n = 3600, m = 600, and ξ = 9300 kNm; for the flapwise bending moment, we use n = 9600, m = 600, and ξ = 14300 kNm.
To model r(x) using a pilot sample, X is sampled m = 600 times from a uniform distribution with the support of [3,25], and the corresponding V 's are generated from the NREL simulators. To build the parametric model of r(x), the generalized additive model for location, scale and shape (GAMLSS) (Rigby and Stasinopoulos, 2005) is fitted to the pilot sample. Specifically, the GAMLSS model assumes that the conditional distribution of V given X is a generalized extreme value distribution whose location and scale parameters are cubic spline functions of X while the shape parameter is constant over X. The model parameters are estimated using the backfitting algorithm (Rigby and Stasinopoulos, 2005). To build the nonparametric model of r(x), we fit the kernel regression model to the pilot sample with the Gaussian kernel and choose the smoothing bandwidth by cross-validation.
For both parametric and nonparametric importance samplers, we repeat estimating the failure probability E = P (V > ξ) 50 times (in contrast to 10,000 times in the numerical study in Section 4.2). Recall that running the NREL simulators once takes about 1-min wall-clock time, implying that obtaining the pilot sample of size m = 600 takes roughly 10 hours. We use the same pilot sample in all 50 replications, because repeating 50 times of the simulation experiment with n − m = 3000 or 9000 itself requires several days of computation even if we use high-performance computing described in Section 4.2.
The parametric importance sampler in Choe et al. (2015) uses the failure probability estimator where the pilot sample is not used, compared with the estimator in (5). In their procedure, the pilot sample is only used to build the model of r(x). For fair comparison, we report both estimation results with and without using the pilot sample in the estimator. Note that nonparametric importance samplers for stochastic simulation models are not reported in the literature. Table 1 summarizes the simulation experiment results for edgewise bending moments. We see that the standard errors of parametric importance samplers and those of nonparametric importance samplers are not significantly different. Computational savings of both methods against CMC are remarkable and, at the same time, comparable with each other. Table 2 shows the estimation results for flapwise bending moments, which convey the similar message with the results in Table 1. Note that ξ is set to roughly yield the similar failure probability E = P (V > ξ) of 0.01 for both structural load types, because the magnitude of E tends to impact the computational saving. Yet, we see that the computational saving of importance sampling over CMC for flapwise bending moments is, albeit substantial, not as large as that for edgewise bending moment. This is because the natural configuration density p is not very different from the optimal sampling density q * for Note: The computational (comp.) saving is (n (CMC) − n)/n (CMC) , where n (CMC) = P (1 − P )/(Standard error) 2 is the theoretical sample size required for CMC to attain the same standard error when the true failure probability is equal to P , which is the sample mean of the parametric importance samplers using the pilot sample. The 95% bootstrap confidence interval (CI) is constructed based on 100,000 bootstrap replicates. flapwise bending moments so that the benefit of changing the sampling density is not enormous. This case study considers a relatively moderate failure probability E = P (V > ξ) around 0.01, so we can use the given computational resource to have 50 replications that allow us to construct bootstrap confidence intervals on standard errors and compare the performances of methods. On the other hand, simulation experiments at the scale of a national lab may consider a more extreme event with a smaller probability. For such case, combining the 50 replications can lead to an estimate of a much smaller probability (in the order of 1e-05 or smaller as in Choe et al. (2016)). Still, one can construct a confidence interval on the probability if only using the sample from the second stage .

Discussion
We consider the problem of estimating the average output from a stochastic simulation model and propose two-stage estimators using either a parametric approach or a nonparametric approach. Theoretically, both estimators satisfy the oracle inequalities but they achieve the oracle variance asymptotically under different rates and assumptions. As expected, the parametric approach needs a strong assumption but its variance converges to the oracle variance faster than the nonparametric approach. The nonparametric approach, however, requires weak assumptions but the variance reduction rate is not as fast as the parametric approach. Empirically, our numerical study confirmed the theoretical results, and our case study indicated that the proposed importance samplers perform well in practice, saving 50%-95% computational resources over a standard Monte Carlo estimator.
We note that Choe et al. (2015) investigated a parametric importance sampler for a stochastic simulation model, which is a special case (with g(v) = 1(v ≥ ξ)) of our parametric importance sampler. They use implicitly a two-stage method with the pilot sample being given a priori for modeling r(x) and do not use the pilot sample in the estimator. They neither consider nonparametric importance samplers nor perform a theoretical analysis on the convergence toward the optimal estimator. Our work focuses more on the formalization of the general two-stage strategy and the theoretical foundation of importance sampling for stochastic simulation models. Because we obtain a concrete convergence rate of the estimator, we enable the optimal choice of the pilot sample size.
In what follows we discuss possible future research directions.
• Manifold support case. In general, when the dimension of the configuration d is large, nonparametric importance sampler in Section 2.2 will not work due to the curse of dimensionality (the slow convergence rate). However, if the support of q * is concentrated around a lower dimensional manifold, the nonparametric sampler may still work because a fast convergence rate of a nonparametric estimator is possible (Balakrishnan et al., 2013;Chen, 2016). So we may be able to design a modified nonparametric importance sampling procedure that achieves oracle variance much faster than the rate in Corollary 8. The construction of such a procedure is left as a future work. • Multi-stage sampling. In this paper we only consider splitting the computational budget into two stages. We can generalize this idea into a kstage sampling procedure, where at each stage, we use samples in all previous stages to design our estimator and sampler for the current stage (e.g., Choe, 2017). In this case, the allocation problem becomes more complicated since we may assign different sample sizes to different stages. Also, the number of stage k will be another quantity that we want to optimize. Because the two-stage approach is a special case of a multi-stage sampling procedure, the latter will have a higher variance reduction rate than the proposed methods in this paper. • Confidence interval. In the current paper, we focus on the construction of an estimator of E. In practice, we often report not only a point estimator but also a confidence interval attached to it. Here we briefly describe two potential methods of constructing the confidence interval. The first method is to derive asymptotic normality of E and then find a consistent variance estimator. Note that this is a non-trivial task because when we use a twostage approach, the observations are no longer IID. Moreover, estimating the variance could be another challenging task. The other approach is to use the bootstrap (Efron, 1982(Efron, , 1992 to obtain a confidence interval. If we choose to use the bootstrap, we need to prove the validity of such a bootstrap procedure. • Multiple estimands of interest. This paper considers the single quantity to be estimated, E = E(g(V (X))), in (1) based on the output V (X) from a stochastic simulation model. Extending the current work to optimally estimating multiple quantities of interest warrants further investigation. Such extension for stochastic simulation models can build upon this line of research around deterministic simulation models. For example, we can consider importance sampling of the union of target events (Owen et al., 2019). Also, we can recycle existing simulation results (e.g., from the pilot stage or both stages of estimating another quantity) by recomputing importance weights (Cornuet et al., 2012). These approaches can benefit from the classical idea of safe and effective importance sampling (Owen and Zhou, 2000).
Proof of Lemma 1. Now by the following variance formula: Var(Y ) = E(Var(Y |X)) + Var(E(Y |X)) and choose Y = g(V i ) p(Xi) q(Xi) and X = X i , we have when q(x) = 0 implies r † (x)p(x) = 0. Note that X * is from density p. Thus, the sampling density q affects the variance only via the quantity E r(X i ) p 2 (Xi) q 2 (Xi) . The quantity E r(X i ) p 2 (Xi) q 2 (Xi) has a lower bound from the Cauchy-Schwarz inequality: And the equality holds when r(x) p(x) √ q(x) ∝ q(x), which implies the optimal sampling density is q * (x) ∝ r(x) · p(x).
Thus, when we choose the sampling density to be q * , by equation (16) and (17), the variance Proof of Theorem 2. By assumptions (P1-2) and the M-estimation theory (van der Vaart and Wellner, 1996;van der Vaart, 2000), where θ 0 is the parameter such that r(x) = r θ0 (x). Now by assumption (P3), which proves the result.
Proof of Theorem 3. For our estimator E θm , we decompose it into two parts Thus, Note that We first bound the covariance. Let D m = {(X 1 , V 1 ), · · · , (X m , V m )} be the collection of the first part of the data. Then Therefore, we only need to focus on the variance of each part.
Let V min = E 2 r(X * ) − E 2 (r † (X * )) be the minimal variance under the optimal sampling density. By Lemma 1,