Stochastic simulation under input uncertainty_ A Review

Stochastic simulation is an invaluable tool for operations-research practitioners for the performance evaluation of systems with random behavior and mathematically intractable performance measures. An important step in the development of a simulation model is input modeling, which is the selection of appropriate probability models that characterize the stochastic behavior of the system inputs. For example, in a queueing-system simulation, input modeling includes choosing the probability distributions for stochastic interarrival and service times. The lack of knowledge about the true input models is an important practical challenge. The impact of the lack of information about the true input model on the simulation output is referred to as ‘input uncertainty’ in the simulation literature. Ignoring input uncertainty often leads to poor estimates of the system performance, especially when there is limited amount of historical data to make inference on the input models. Therefore, it is critically important to assess the impact of input uncertainty on the estimated performance measures in a statistically valid and computationally efficient way. The goal of this paper is to present input uncertainty research in stochastic simulations by providing a classification of major research streams and focusing on the new developments in recent years. We also review application papers that investigate the value of representing input uncertainty in the simulation of real-world stochastic systems in various industries. We provide a self-contained presentation of the major research streams with a special attention on the new developments in the last couple of


Introduction
Simulation is an invaluable tool for practitioners for the analysis of complex systems and processes. It is the method of choice when realworld experiments on a system are expensive to conduct or no analytical formula is available to make inferences about the system under study. A major component of a simulation study is input modeling, which is the selection of probability distributions (i.e., input models) that characterize the stochastic behaviors of the random input variates in the system. For example, when simulating a queueing system, customer arrival times and service times are usually random and need to be represented with probability distributions. Similarly, in an inventory system simulation, customer demands and lead times can be modeled as random variables.
After the identification of the probability distributions and their parameters to represent such random input variables used to driven the simulation experiments, the next step in a simulation study is the generation of sample paths, where one generates random variates from the selected distributions to run the simulation. In the aforementioned queueing simulation for example, the customer arrival times and service completion times are generated. The final step of the simulation study is the output analysis and this is where one collects and analyzes the simulation output data to calculate the performance measures of the system under study. What is important to recognize here is that the simulation output data, and hence the performance measures, are directly affected by the probability distributions (i.e., input models) from which the sample paths are generated in simulation.
In practice, the typical approach is to select the "best" input model using goodness-of-fit tests, such as Anderson-Darling test or Kolmogorov-Smirnov test, and use the "best" distribution to execute the simulation [1]. This "best" distribution and its parameters, however, are often estimated from a limited set of historical data points and thus may be far from the "true" model and "true" parameters. As the number of historical data points approaches infinity, the selected probability distributions and their parameters converge to their counterparts; however, there can be cases where the simulation practitioner does not have access to large amounts of data in many applications.
The inherent variability of the simulation output that is due to the stochastic nature of the system under study is referred to as stochastic uncertainty; e.g., the variation of waiting times of customers in a queueing service system. It is well known in practice how to address this type of uncertainty in the form of certain measures, e.g., mean, variance and percentile of the performance measure can be estimated by simulation experiments with finite run length and replications [2]. In this paper, we focus on the uncertainty in the simulation output that is attributable to the lack of knowledge on the true input models, which is referred to as input uncertainty. When a parametric function is assumed to be known as the input model and the parameters are estimated from the historical data, input uncertainty takes the form of parameter uncertainty. For the cases where the input model itself is unknown, it may be referred to as input-model uncertainty. Input uncertainty and inputmodel uncertainty are often used interchangeably in the literature. One way to reduce the stochastic uncertainty is to run more simulation replications, which is possible with more computing power. However, input uncertainty cannot be decreased with additional simulation replications. In fact, the only way to reduce it is to collect more input data, which could be very expensive or even impossible in some situations. Therefore, the quantification of input uncertainty in stochastic simulations has traditionally been the focus of simulation literature.
The negative impacts of ignoring input uncertainty have been demonstrated in a variety of settings. For example, among others, [3] shows the danger of ignoring input uncertainty when running a queueing simulation [4]. show the consequences of ignoring input uncertainty by only using the best-fit distributions for estimating service levels in an inventory simulation. In the following, we illustrate the concept of input uncertainty through a simple M/M/1 queueing system (Section 1.1) and an inventory system operating under an (s, S)-policy (Section 1.2). More specifically, in Section 1.1 we assume that the probability distributions for modeling the service time and the interarrival time are given but the distribution parameters are unknown and estimated from a limited amount of historical data. Hence, we demonstrate the impact of parameter uncertainty in the estimation of a performance measure by using simulation. We further demonstrate how the input uncertainty can lead to an incorrect estimation of the optimal solution in service-level optimization. In Section 1.2, we assume that the input probability distribution itself is unknown and we further investigate the impact of input-model uncertainty on evaluating the performance of a given inventory policy. Our choice of relatively simple examples such as the M/M/1 queuing system and the (s, S) inventory system is on purpose. We show that even in these simple systems, the impact of input uncertainty can be significant. As the system under study becomes more complex (i.e., requiring a larger number of probability distributions as input models and a more complicated simulation logic), it is natural to expect that the impact of input uncertainty will be even higher.

Example: Simulation of an M/M/1 queueing system
Let F A and F S denote the probability distributions of the time between job arrivals and the service time of a service center, respectively, and suppose that they are exponential distributions with arrival rate per time unit and service rate = 1.5 per time unit. We consider the expected time a customer spends in the system (in the steady state) as the performance measure, and denote it by w(λ, θ). It is well known from queueing theory that = w ( , ) 1/( ). That is, the true value of this performance measure is equal to 1 time unit at the values of λ and θ specified above.
The practical challenge is that the parameters λ and θ are unknown. Suppose that they are estimated from m past realizations of the interarrival and service times via maximum likelihood method. Because m is finite (and can be quite small in many real-life situations), the point estimates of λ and θ will have some random error. To illustrate the impact of this error in the output performance measure, we take a frequentist approach and consider 1000 independent experiments consisting of generating a new set of m interarrival and service times (at the true parameters λ and θ). Specifically, we first generate m realworld data, find the maximum likelihood estimates ^a nd ^o f unknown parameters, and then estimate the mean performance measure by using simulation with input parameters ^a nd ^. Fig. 1 provides the histogram of the calculated performance measure w (^,^) at different lengths of historical data, given by m ∈ {25, 50, 100}. It is worth noting that using the closed-form formula w (^,^) corresponds to performing a simulation of infinite length at the input parameters ^a nd ^( i.e., running the simulation for a long time until no stochastic uncertainty remains in the estimated value of the performance measure. This allows us to make two key observations from Fig. 1: (1) Even if a large amount of computing effort is put into simulation to improve the estimation of a performance measure, the performance-measure estimate is subject to considerable amount of variability due to the uncertainty in the point estimators of the input-distribution parameters λ and θ. This can only be reduced by collecting more real-world data for input modeling. In fact, as the number of historical data points increases from = m 25 to = m 100, we observe that the uncertainty around the mean performance measure diminishes and converges to its true value of 1. (2) If the input uncertainty is not accounted for properly, more simulation effort may lead to even poor confidence intervals having lower coverage. For instance, Fig. 1 (top) illustrates a 95% confidence interval 1.16 ± 0.21 (solid black horizontal line) for the performance measure w(0.5, 1.5) by using the output data from only 30 independent simulation replications obtained under the input parameters = 0.59 and = 1.57. Notice that this confidence interval includes the true value of the performance measure = w (0.5, 1.5) 1, which is illustrated as the dashed vertical line in Fig. 1. Since the half-width of the confidence interval decreases with more simulation replications, the confidence interval begins to exclude the true value of the performance measure as the number of simulation replications increases; e.g., see the adjusted confidence interval when the number of simulation replications increases from 30 to 100 (solid red horizontal line). This example shows that a confidence interval that does not explicitly account for input uncertainty can have poor coverage of the true performance measure, and it becomes even worse as more simulation is performed. Fig. 1 has shown that input uncertainty can lead to a poor estimate of a performance measure when simulation is used for performance evaluation. Input uncertainty may also cause a poor estimate of the optimal solution within the context of performance optimization. We illustrate this in an optimization problem based on the simple queuing system described above. Suppose that the service rate θ is a decision variable with the objective of minimizing the long-run average cost per time unit, while the arrival rate is still an unknown input parameter for the simulation model. Specifically, we consider that c 1 is the cost rate associated with increasing service rate one unit and c 2 is the cost charged per time unit for a service request waiting in the queue. It follows from queuing theory that the objective function is given by Fig. 2 shows the impact of input uncertainty on the optimal solution of the service-rate optimization problem described above for = c 5 1 and = c 1 2 . If the arrival rate λ was known (i.e., no input uncertainty), then the optimal service rate could be obtained as = argmin c * ( ; ), which is equal to 0.86 in Fig. 2 (left) as shown as a dashed line. In this example, we consider that the decision maker obtains ^b y using 20 past realizations of the inter-arrival times, and solves for = argmin ĉ ( ;^). Fig. 2 (left) plots the histogram of ^o btained from 1000 independent experiments. The variability in this histogram can be interpreted as the impact of input uncertainty on the optimal solution. Notice that the decision maker uses the exact formula for the objective function; if such a formula was not available and simulation had to be used for performance evaluation, one could expect the variability would be even higher, showing the importance of accounting for input uncertainty in simulation-based optimization. Fig. 2 ( *, ))/ ( *, ) 100%. This shows that the solution which is considered as "optimal" may lead to substantially poor performance if input uncertainty is ignored.

Example: Simulation of an inventory system
Consider an (s, S) inventory policy with full backlogging. At the end of each period, the inventory position is calculated and, if it is below s, an order is placed at cost K to get the inventory position back up to S. The holding cost h is charged for each inventory unit carried to the next period and the purchasing cost c is charged for each ordered unit. Every order is delivered ℓ periods later. The demand in each period is independent and identically distributed; however, its probability distribution F is unknown in practice, leading to the problem of input uncertainty. We consider the steady-state expected stock-out rate (i.e., the fraction of the demand not supplied from stock on-hand) as the performance measure and denote it with w(F). The objective of the simulation is to estimate w(F) under a specific (s, S) policy when F is unknown.
Suppose that the length of the historical demand data is = m 20. Different from Section 1.1, we consider a nonparametric Bayesian input model to avoid making an assumption on the parametric form of F and also to be able to incorporate any prior information. Specifically, we capture the uncertainty in F with a Dirichlet process with a normal prior (using historical mean and variance as the distribution parameters) and a concentration parameter equal to 5 (i.e., this specific value of the parameter implies 80% weight is given to empirical distribution and 20% weight is given to prior in the base measure of the Dirichlet process). The Dirichlet process can be regarded as a distribution over all the possible demand distribution functions [5].
Let the true demand distribution be normal with mean 100 and standard deviation 30,    (80,130). Fig. 3 illustrates how the uncertainty in the input model F propagates to the uncertainty in the performance measure w(F) by plotting the histogram of the performance measure estimates from 1000 independent simulations, where each simulation is run with a randomly sampled distribution from the posterior distribution of F. Specifically, Fig. 3 (left) does this when there is no stochastic uncertainty (this is achieved by running the simulation sufficiently long such that running the simulation further does not significantly affect the result anymore; i.e., 4000 time periods in this case). On the other hand, Fig. 3 (right) illustrates the more common situation with the presence of stochastic uncertainty by running the simulation 400 periods after a 100-period warm-up period (because simulation can be computationally intensive in practice, making it impossible to completely eliminate the sampling error due to simulation). The solid horizontal lines in Fig. 3 illustrate 95% Bayesian credible intervals for the mean performance measure w(F). The one on the left is also referred to as the perfect-fidelity credible interval as the simulation perfectly determines the performance measure at a given input model (6).
Different from Section 1.1, adapting a Bayesian approach treats the performance measure w(F) as a random variable, and the probability distribution of this random variable provides a direct characterization of input uncertainty and stochastic uncertainty. We observe that perfect-fidelity credible intervals are smaller than the 95% Bayesian credible intervals. This is because the former has no stochastic uncertainty while the latter has both stochastic uncertainty and input uncertainty. It is of practical importance to be able to quantify the contributions of stochastic uncertainty and input uncertainty to a credible interval in a computationally efficient way, and thus a significant amount of research has focused on this problem [6,7].

Overview and organization of the paper
The topic of input uncertainty in stochastic simulations has gained growing interest in recent years. Early review papers include [8], [9], and [10], while more recent ones are [3], [11], [12], and [13]. As a rapidly growing research area, there is already a significant amount of new developments after these reviews have been published. In our paper, we aim to provide a self-contained presentation of the major research streams with a special attention on the new developments in the last couple of years. Furthermore, to the best of our knowledge, there is no state-of-an-art survey that explicitly discusses the operations research applications of the methods that address input uncertainty in stochastic simulations. Our review paper aims to fill this gap in the literature by providing an extensive review of the (i) methodological papers that model and analyze input uncertainty with a focus on recent trends, and more importantly (ii) application papers that explicitly model and address input uncertainty in the simulation of real-life stochastic systems in various industries. Finally, we also discuss the relationship between the input uncertainty literature and the fast-growing field of data science.
We organize the remainder of the paper as follows. Section 2 introduces a taxonomy of the simulation methods under input uncertainty. Section 3 presents major input uncertainty quantification methods. Section 4 discusses sensitivity analysis methods, which can provide guidance on most informative data collection to efficiently reduce input uncertainty. In Section 5, we discuss simulation optimization research under input uncertainty. Section 6 presents simulation applications in various industries under input uncertainty. We conclude in Section 7 with a discussion on the relationship between the fastgrowing area of data science and input uncertainty literature and directions for future research.

Taxonomy of the simulation methods under input uncertainty
where ω is an outcome from the probability space ( , ), and F is the collection of L input distributions. The sample space Ω captures the possible realizations of random elements in the simulated system. One might think ω ∈ Ω as the realization of a sequence of independent and identically distributed standard uniform random variables, which are used to draw random variates from the input distributions and then to generate sample paths of the variables of interest.
The simulation output from a single replication can be written as is the performance measure of interest and ϵ(F, ω) is a mean-zero, finite-variance random variable representing the simulation error. For notational convenience, we drop the dependence of simulation output on ω, and we append a subscript j to it to indicate the jth independent simulation replication as in As it is seen in this representation, the simulation output data (and hence the performance measure of interest) is a function of the unknown input models F used in the simulation. The estimation of these input models from limited amount of historical data has an impact on the simulation output data, which is referred to as input uncertainty. In the remainder of this section, we provide a taxonomy of the methods that quantify input uncertainty in simulation output data. Section 2.1 characterizes the literature based on whether the input uncertainty is viewed in accordance with the notions of frequentist or Bayesian statistics. Section 2.2, on the other hand, surveys the related literature based on whether the system response of interest is directly estimated by simulation runs or by a metamodel. Finally, Section 2.3 provides a survey of input uncertainty quantification methods when the input models are not necessarily parametric or independent of each other.

Representation of input uncertainty
The methods that quantify input uncertainty in simulation output data can be divided into frequentist and Bayesian approaches based on how the uncertainty around unknown input models and parameters is modeled. Let m l denote the number of independent and identically distributed real-world observations = … z z z ( , , ) l m 1 l from the lth input distribution. The collection of all real-world input data is denoted by . Both the frequentist and Bayesian approaches make use of the available real-world data, while the latter can also incorporate any subjective expert opinion on the input models in the quantification of the input uncertainty. For ease in presentation, we assume that F is a collection of L parametric distribution functions … F F ( , , ) is the collection of all the inputdistribution parameters. In general, input distributions can take a variety of forms (e.g., multivariate or nonparametric); Section 2.3 extends the survey of input uncertainty quantification methods to these cases.

Frequentist approach
Frequentist approaches assume that unknown input-model parameters, denoted by θ c , are estimated from the past realizations of the input random variables by using a point estimator ^, which is a function of real world data z. The objective of the simulation output data analysis is to obtain the confidence interval specified by bounds q L and q U such that ; it is common to refer to [q L , q U ] as the (1 )100% confidence interval (CI). The simulation response μ( · ) is estimated from noisy simulation output data; see Section 2.2 for more details. Since the real-world data is one particular realization of many possible realizations from the true data-generation mechanism, the uncertainty about ^i s quantified by its sampling distribution. Then, the CI that accounts for the input uncertainty can be identified by inverting the sampling distribution of μ (^), where μ ( ) is the point estimate of the system response at any fixed θ. There are two potential limitations when we use frequentist approaches [6]. First, it is often very difficult to obtain the exact sampling distribution of ^q uantifying the parameter uncertainty. Instead, one may resort to asymptotic approximation methods, such as normal approximation and bootstrapping. However, the validity of these approximations can be hard to justify in practice when there is only a limited amount of real-world input data. Second, it is not possible for the frequentist approaches to take any prior information about the input models into account. Table 1 summarizes some of the papers in the literature that take a frequentist approach to quantify input uncertainty in the simulation output. The table further categorizes the papers based on the characteristics of the input models and how the estimation error in the input models is propagated to the simulation output uncertainty. These will be discussed in Section 2.2 and Section 2.3.

Bayesian approach
Bayesian approaches quantify our belief of unknown input-model parameters θ with a random vector Θ. Before the collection of any realworld input data, the expert opinion on the input parameters is captured by its prior distribution π Θ (θ). This prior distribution is updated after the observation of the real-world input data z via the Bayes' rule to obtain the posterior distribution π Θ (θ|z) of the input paramaters. One might think the posterior distribution of Θ as being similar to the sampling distribution of ^i n the frequentist approach in that both of them capture the uncertainty in the unknown input-distribution parameters. However, there are two important distinctions from the frequentist approach. First, the prior distribution provides a convenient way to incorporate expert opinion on the input models. Second, there is no need to rely on the assumption regarding the availability of large amount of real-world input data. However, the derivation of the exact posterior distribution π Θ (θ|z) is often very difficult. This is why it is often necessary to resort to analytical approximations by using variational inference methods [33] or computational approximations by using Markov chain Monte Carlo (MCMC) methods [34] to obtain the posterior distribution π Θ (θ|z).
As the computational budget gets larger, theoretically, MCMC can provide exact posterior samples of the input parameters. However, to the best of our knowledge, there is no rigorous way to quantify the approximation error introduced by the analytical approximations in general.
If the system response (e.g., mean and quantile) function μ( · ) is known, the effect of the input uncertainty could be characterized by the induced posterior distribution of μ(Θ) with Θ ~ π Θ (θ|z). More specifically, we could construct a (1 )100% credible interval (CrI), denoted by [c L , c U ], which contains 1 of the probability content: as the perfect fidelity credible interval because it is the credible interval that does not possess any additional uncertainty due to simulation uncertainty and only reflects the uncertainty that is attributable to the input uncertainty.
Since the system response μ( · ) is unknown in practice, it is first necessary to estimate it from noisy simulation experiments. Section 2.2 reviews the methods to estimate μ( · ) under uncertainty around the input-distribution parameters. Table 2 presents a sample of representative papers that take a Bayesian approach to quantify input uncertainty in the simulation output. The table further categorizes the papers based on the characteristics of the input models and how the input uncertainty is propagated to the simulation output data. These will be discussed in Section 2.2 and Section 2.3.

Propagation of input uncertainty to simulation output data
In this section, we focus on how the system response surface μ( · ) is estimated. In particular, Section 2.2.1 reviews the approaches that directly estimate μ( · ) from independent simulation experiments, while Section 2.2.2 reviews the approaches that first identify a functional relationship between the input-distribution parameters and the system response.

Direct simulation
We suppose that the input uncertainty is quantified by B samples of the input models, denoted by ( ) . These samples could be generated by either frequentist or Bayesian approaches. In the direct simulation, we run simulations at each F (b) with = … b B 1, 2, , and estimate the mean response with sample mean. Specifically, at F (b) , the mean response μ(F (b) ) is estimated by the sample mean, where Y j represents the simulation output from the j-th replication and n b denotes the number of replications assigned to the sample of input model F (b) . Then, the α/2 and (1 /2)-th order statistics can be used to construct the percentile interval quantifying the overall estimation uncertainty of the mean response μ(F c ), where the order statistics , . This interval accounts for both input uncertainty and stochastic uncertainty. It is a two-sided percentile CI if the input uncertainty is quantified by a frequentist approach and it is a CrI if the input uncertainty is quantified by a Bayesian approach. Notice that we could consider both one-and two-sided intervals to quantify the system performance estimation uncertainty. Here, we use a two-sided interval for illustration.
It is important to note that in addition to the quantile-based CI or CrI provided in (1), the normality-based confidence interval is frequently used in the literature and it is indeed the basis of random effects model in [20] and other delta-effects based models that will be discussed in Section 3. Letting Var represent the variance of the simulation output data encompassing both the input uncertainty and stochastic uncertainty and z 1 /2 represent the 1 /2 percentile of the standard normal distribution, the normality-based CI can be summarized as follows:

Metamodel-Assisted simulation
For complex stochastic systems, each simulation run could be computationally expensive. To precisely construct a percentile interval quantifying the system response estimation uncertainty for μ(F c ), we often require B be large. It could be computationally prohibitive to use direct simulation to precisely estimate the system response, especially risk measures (e.g., quantiles), at all samples of input models. Thus, based on a few well-selected design points, we can construct a metamodel as an approximate response surface to efficiently propagate the estimation error in the input models to output uncertainty.
Here, we suppose that the input models F can be specified by a finite and fixed number of input parameters or moments, denoted by . By abusing the notation, we want to construct a metamodel for the response surface µ ( ). Metamodels commonly used in input uncertainty literature include the local linear approximation and the global Gaussian process metamodel. When we use the local linear approximation, the simulation output can be modeled as where denotes the vector of slope paramaters and e is the estimation error of Ȳ . Given a few well-selected design points, denoted by , and simulation output, denoted by Y , the sampling distribution or the posterior distribution of parameters can be used to quantify the metamodel uncertainty.
When we use the global Gaussian process metamodel, we suppose that the underlying true mean response surface can be thought of as a realization of a stationary Gausssian process (GP). Then, the simulation C.G. Corlu, et al. Operations Research Perspectives 7 (2020) 100162 output can be modeled as is the error due to simulation uncertainty. The covariance between W ( ) and W ( ) quantifies how the knowledge of the surface at some design points affects the prediction of the surface. A parametric form of the spatial covariance, denoted by is typically assumed where τ 2 denotes the variance and r( · ) is a correlation function that depends only on the distance . To efficiently estimate the impact of input uncertainty on the distribution of a detailed system output (e.g., waiting time of each customer in the service system), [28] introduce a distributional metamodel to simultaneously model a sequence of system percentile curves. Suppose that the underlying true percentile curve q (·) c is a realization of a GP, where … ( , , ) 1 denotes the significant levels of interest. Then, the simulation percentile estimator can be written as where µ is a constant global trend and W ( ) is a zero-mean GP modeling the spatial dependence of the percentile curve q ( ). Since percentile curves with different probabilities typically increase or decrease simultaneously, authors model the spatial dependence of percentile curves as where Z ( ) represents a zero-mean GP with variance equal to one and 2 is the spatial variance of the α ℓ -th quantile. Since the distributional metamodel can integrate the estimation over different percentiles, it can efficiently reduce the simulation estimation error.

Characteristics of input models
If the input process in a simulation model is univariate, stationary, independent and identically distributed (IID) and we have data, then a "simple" input model may suffice. We refer the reader to [49] for a review of such input models. In fact, almost all simulation input-modeling software assume that the historical input data includes observations that are IID. However, in many practical situations, it is possible to have input processes that may exhibit dependence. In this section, we review the simulation literature with input models that are not independent of each other. [50] provides techniques to check whether dependence exists in a data set. Dependence can occur in time sequence or across different input processes, or both. The degree of dependence is typically represented with parametric models as reviewed in Section 2.3.1. In Section 2.3.2, we provide an overview of the nonparametric and semi-parametric input models.

Multivariate input-models with spatial or temporal dependence
Dependent and multivariate input processes occur naturally in many service, communications, and manufacturing systems. [51] provide a comprehensive review on the development of multivariate input models which incorporate the interactions and interdependencies among the inputs for the stochastic simulation of such systems. Among others, most widely used approaches include the transformation-based methods Normal-To-Anything [52] and Vector-AutoRegressive-To-Anything [53], and copula-based models [54,55]. In particular, the copula-based models have the ability to capture a wide variety of dependence structures by describing dependence in a more general manner than linear correlation. [56] review the copula-based methods for multivariate input-modeling in stochastic simulations with dependent inputs. Specifically, they consider the cases where the dependence between pairs of simulation input random variables is measured by tail dependence (i.e., the amount of dependence in the tails of a bivariate distribution) and review the techniques to construct copula-based input models representing positive tail dependencies. Since the input correlation is often induced by latent common factors in many situations (e.g., financial portfolio management), [55] further explore the factor structure of the underlying generative processes for the dependence and develop a flexible Gaussian copula-based multivariate input model that can capture important properties in the real-world high dimensional data and provide the insights for system risk analysis.

Nonparametric and semiparametric input models
When no standard distribution fits the data well, or making a distributional assumption is not desired, then the data itself can be used to build an input model. This is commonly referred to as a nonparametric input model [1]. For a multivariate input process, it is typically needed to use parametric functions to capture the dependence relationships even if nonparametric models (e.g., the empirical distribution functions) are used as the marginal distributions of the input variable; see e.g., [57] and [58]. [47] use this approach in a simulation setting and refer to it as semiparametric input modeling. [7] introduce a semiparametric Bayesian framework that improves both computational and statistical efficiency, where both input and stochastic uncertainty are characterized by the posterior distributions. Without strong prior information on the input models and the system mean response surface, [32] propose a Bayesian nonparametric framework to quantify the impact from both sources of uncertainty. Nonparametric input models are introduced to faithfully capture the important features of the realworld data, and Bayesian posteriors of ible input models characterize the input uncertainty, which automatically accounts for both model selection and parameter value uncertainty.

Major input uncertainty quantification methods
In this section, we review approaches proposed for the representation of input uncertainty in stochastic simulations, including direct bootstrap, Bayesian Model Averaging, Delta method, metamodel-assisted bootstrapping, and a robust-optimization based approach. We discuss each method along with its pros and cons. Fig. 4 provides a summary of each method discussed in this section while Table 3 compares various methods and provides suggestions on when to use each method. We conclude this section by presenting the recent trends in input-uncertainty quantification.

Direct bootstrap
In the direct bootstrap, the empirical distribution is constructed as the input model estimate. The bootstrapping takes the real-world data as the whole population and generates B bootstrapped samples of the to quantify input uncertainty.
. The percentile confidence interval constructed based on the simulation outputs at all bootstrapped samples of the input model is used to quantify both input uncertainty and simulation uncertainty as shown in Equation (1). This method dates back to [14,15].
The direct bootstrap is simple, easy to implement, and it does not require any prior information on the input model and system response surface. However, it has a few potential limitations. First, it could be computationally prohibitive because it requires n simulation runs at B 1, , , and B needs to be large in order to construct an accurate percentile CI. Recently, [59] develop a subsampling framework to avoid the computational burden of the conventional bootstrap approach. The idea is to achieve efficiency by retrieving information from the conditional performance measures from each replication. Recently, [60] take a different approach focusing on extracting more information from within-replication sample paths. Green simulation, which is an approach used to reuse simulation outputs from past experiments, is applied at the sample path level to gain efficiency. Likelihood ratio method is used to implement the green simulation approach. The premise of the resulting procedure is its computational efficiency; i.e., as opposed to B × n simulation runs required to implement the direct bootstrap approach, only n runs are required. However, this approach comes with some limitations. For example, the likelihood function of system trajectory is unknown and the likelihood ratio calculations required for the implementation of the method could be a burden.
Another challenge in the implementation of the direct bootstrap method is to decide how to allocate the limited simulation budget between B and n. This decision has direct implications on simulation output analysis. [61] introduce a sequential experiment design to efficiently allocate the simulation resources to bootstrapped samples of input models which contribute most to the percentile CI bounds estimation.
Another disadvantage of the direct nonparametric bootstrap approach is that even though the underlying true distribution is continuous, empirical distribution is discrete. When we have a limited amount of real-world data, samples from the empirical distribution could overlook some important properties in the true input distributions (e.g., properties in the tails). This could cause obvious deviation on the simulation assessment of system performance when the simulation output is highly dependent on the tail behaviors of the input distribution. For example, supply chain management in high-tech industries often requires the service levels to be close to 100% making it critical to model the tail-behavior of the service-levels accurately.

Bayesian model averaging method
In the Bayesian Model Averaging (BMA) approach, given a few candidate parametric families, the posterior probabilities of the candidate models are used to quantify input uncertainty. Likewise, the posterior of distribution parameters is used to quantify the parameter uncertainty. Then, at each posterior sample of the input model, direct simulation is used to estimate the system performance. The confidence or credible interval based on the simulation outputs at all posterior samples of the input model is used to quantify both input and simulation uncertainty; see Equation (1) for the percentile CI or CrI.
The BMA approach dates back to 1970s and it is [35][36][37] who first adapts this approach to characterize the input uncertainty and parameter uncertainty in the analysis of simulation output data. [39,40] extend the work in Chick (2001) and decompose the posterior simulation output variance into two terms related to parameter uncertainty and simulation uncertainty. [38,41] further consider the case with unknown input models and quantify the simulation output variance due to simulation uncertainty, parameter uncertainty, and model uncertainty. [45] extend the BMA to input models with dependence by using the Normal-to-Anything (NORTA) input model to characterize correlated input processes.  (2012)).

Table 1
Literature on frequentist approaches to input uncertainty.

Distributional form
Parametric: [16][17][18], [19], [23] Parametric: [25], [26], [27], [29,30] Non-parametric: [14,15], [20], [21,22], [11], [24], [31,32] Non-parametric: Semi-parametric: Semi-parametric: Since the validation of Bayesian approaches does not require a large sample of real-world data, BMA overcomes the limitation of the direct bootstrap. All the information of the input model in the direct bootstrap only comes from the real-world data. By selecting the candidate distributions with appropriate tail behaviors, it could overcome the limitation of the bootstrap, especially when the amount of the real-world data is limited. Also, it is straightforward for Bayesian approaches to incorporate the prior information about the underlying input model.
However, BMA could have a few potential limitations. First, BMA is based on the assumption that all data come from one of the candidate distributions [62]. In other words, BMA relies on the assumption that all data are generated from a single true underlying parametric family. Without strong prior belief, it could be challenging to select the candidate distributions. If the selected parametric families are not mutually exclusive, such as exponential and Gamma distributions, it can potentially lead to model identification problems. Furthermore, a single parametric distribution typically cannot capture the rich properties in the real-world data. Second, BMA does not account for the simulation estimation error in Bayesian sense. Since frequentist and Bayesian approaches have different perspectives on quantifying the uncertainty and we also assess their performance differently, it is not clear how to assess the CI delivered by the BMA.

The delta method
Different from the direct bootstrap and the BMA that are named based on the methodology used for quantifying the input uncertainty, the delta method is named following the local Taylor approximation on the response surface. This approach is first discussed in [63] and is further studied in [16][17][18].
The underlying assumption is that the distribution family of the input model is known. The sampling distribution of the maximum likelihood estimates of the input parameters is used to quantify the parameter uncertainty. Since it could be hard to directly get the sampling distribution, the large-sample normal approximation is used to quantify input uncertainty. Then, the input uncertainty is propagated to the output through a linear approximate response surface.
Unlike the direct bootstrap and BMA that estimate the system response at each sample of the input model, delta method introduces an equation-based metamodel of the system response. Thus, it does not need substantial computational effort. However, a metamodel based on a local linear approximation is only appropriate when there is a large quantity of real-world data. The delta method is not suitable when the underlying response surface is highly non-linear and the amount of realworld data is limited. Second, delta method assumes that the distribution family of the input model is known. It is typically hard for standard parametric distributions to capture the rich properties in the real-world data. Another disadvantage of this method is that the simulation effort used to estimate the gradients of the response surface also increases with increasing number of input parameters.
Motivated by the aforementioned challenges of the classical delta method, [64] study the use of the delta method in estimating nonparametric input variance. The idea is to use random perturbation to obtain a finite difference estimator which approximates the gradient needed in the implementation of the delta method. Motivated from resampling, random perturbation uses multinomial distribution, which connects the approach to the infinitesimal jacknife estimator used in estimating the variance of bagging.

Metamodel-Assisted bootstrap
In the metamodel-assisted bootstrap approach, the bootstrap is used to quantify input uncertainty. Then, based on the simulation results at a few well-selected input models, Gaussian process metamodel is constructed to propagate the estimation error in input models to output uncertainty. A CI could be built to quantify both input uncertainty and simulation uncertainty. This approach is first proposed in [25].
Compared to the delta method, the use of a general-form metamodel does not require the input uncertainty to be small. Compared to the direct bootstrap and BMA, the use of a metamodel also reduces the impact of simulation error on the accuracy of the system performance estimate. However, the metamodel-assisted bootstrap requires the distribution family of the input model be known or the input model be specified by finite parameters/moments. Second, when the number of input model parameters is large (in hundreds to thousands), fitting a metamodel becomes challenging mainly due to the computational complexity of estimating the parameters as well as the large number of simulation runs needed to fit the model. Finally, the validity of using the bootstrap to quantify input uncertainty holds asymptotically. The finite sample performance could depend on the specific problem.
Motivated by the shortcomings of the direct bootstrap and the metamodel assisted bootstrap, [65] propose two new approaches named shrinkage and hierarchical bootstrap to produce direct bootstrap CIs that account for both input uncertainty and simulation uncertainty and investigate the empirical performance of the basic shrinkage CI, percentile shrinkage CI, basic hierarchical bootstrap CI, and percentile hierarchical bootstrap CI. The percentile shrinkage CI performs well when the number of available data, number of replications at each bootstrap sample, and the number of bootstrap samples are large while the basic hierarchical bootstrap CI shows good coverage in all cases regardless of the magnitude of the input data, the number of replications at each bootstrap sample as well as the number of bootstrap samples.

Robust optimization approach to input uncertainty
An alternative approach to modeling input uncertainty in stochastic simulations is to use a robust approach to input uncertainty. This approach is distinct from the other approaches discussed in the previous sections in the sense that it uses the premise of worst-case optimization borrowing some ideas from the well-established distributionally-robust optimization (DRO) literature ( [66] and [67]). A typical formulation under input uncertainty takes the form h x h x min ( ) and max ( ) where U is the uncertainty set (usually represented in the form of Table 2 Literature on Bayesian approaches to input uncertainty.

Distributional form
Parametric: [35][36][37], [38][39][40][41], [45], [46] Parametric: [42][43][44], [6] Non-parametric: [48] Non-parametric: -Semi-parametric: [7,47] Semi-parametric: - constraints), which captures the information and belief on the input model. The optimal solutions identified by these two optimization problems provide the best and worst-case values of the performance measure subject to the uncertainty imposed by U. Note that these two solutions are also the lower bound and the upper bound of the CI obtained for the performance measure of interest. The robust optimization approach to input uncertainty is relatively new and is still a very active research area. See [12] for an overview of this approach and its comparison to the more traditional approaches to input uncertainty discussed in the previous sections. Other recent papers in this area include [68], [69,70] and [71,72].
There are several advantages of this approach. First, it naturally works under parametric and nonparametric regimes. Second, the uncertainty set U can represent the prior beliefs in a nonparametric way offering more flexibility compared to the BMA approach. The approach can further quantify the nonparametric uncertainty of serial dependency [73]. Furthermore, the robust framework has been shown to improve over some of the traditional approaches. For example, the procedure developed in [69,70] is shown to improve over the direct bootstrap approach and the delta method. In particular, the optimization replaces the resampling step of the bootstrap approach leading to lighter computational requirements. Another advantage of the method is that unlike the bootstrap method, it is less sensitive to the computational budget allocation choices. Furthermore, it improves upon the delta method because it provides more accurate finite-sample performance.
However, it possesses some limitations and faces a few challenges. The main challenge is how to construct the uncertainty set U so that it will be practically meaningful and will allow computational tractability. Common types of U include some constraints based on a statistical distance (i.e., Kullback-Leibler or χ 2 -distance) from a baseline distribution, constraints that specify that the moments of input model are within certain ranges, or some shape constraints on the distribution. However, it is not clear how sensitive the system response to the selection of distance measure or more generally to the type of the constraint. Another challenge is that this framework does not allow the simulation practitioner to decompose the input uncertainty from the stochastic uncertainty. Furthermore, in general, h is nonlinear and nonconvex in U so the optimization problem is non-convex and hard to solve. Developing algorithms to solve this problem accurately and in a reasonable time is the subject of recent work in this area (see [71,72]).

Recent trends in input uncertainty quantification
Risk Quantification: The traditional approach in the input-uncertainty quantification literature is to gain inferences on the mean performance measures at the true but unknown input parameter. These inferences could be in the form of point estimates or confidence intervals. [74] take a different approach and study the risk quantification of the mean response with the goal of providing inferences on extreme scenarios of mean response in all possible input models. VaR and CVaR are used as risk measures of the mean response and a nested Monte Carlo simulation is proposed to estimate them under input uncertainty. Consistency and asymptotic normality of the proposed nested risk estimators are proved. This paper uses Bayesian approach but the approach developed is applicable to other input uncertainty quantification methods.
Online Data: One of the main assumptions in the input-uncertainty quantification literature is the availability of input data all at once in a batch (fixed data). However, this assumption may not be valid nowadays. Especially with the availability of more data and more advanced techniques that would make data acquisition more practical, it becomes imperative to be able to incorporate the online data into the simulation for real-time decision making. Motivated by this, [75] develop an inputuncertainty quantification method that work efficiently with online data. Authors assume that the input model takes a parametric form and Table 3 Comparison of input-uncertainty quantification methods. use the Bayesian approach to estimate unknown input parameters from data. The idea is that as a new data point arrives, the Bayesian posterior distribution is updated. This update is done via the help of an importance sampling procedure. The resulting procedure is shown to outperform a naive Bayesian approach in the simulation of an M/M/1 queuing system. Other Approaches: Motivated by the computational challenges of the delta method and bootstrap method, and the additional parametric assumptions needed to be satisfied for the metamodeling-based approaches to work, [76] devise a simpler approach which is based on sectioning the input data and constructing a confidence interval based on a pivotal statistic. The premise of the approach is its minimal simulation requirement and its reliance on minimal structural assumptions. The approach works for both parametric and nonparametric input regimes.

Sensitivity analysis for input uncertainty reduction
If the input uncertainty is too large, the next step is to try to reduce it through the collection of additional real-world input data. However, acquiring additional data could be costly. In that regard, it is important to determine the sample size sensitivity of each input process; i.e., how additional input data collection for each process contributes the most to the reduction of the overall input uncertainty, which can guide the most informative data collection. This section reviews those papers that take a sensitivity analysis approach for reducing the overall input uncertainty. [19] propose two main analysis-of-variance (ANOVA) approaches based on a fixed effects model and a random effects model. In the fixed effects model, ANOVA is used to detect differences between endpoints of the confidence interval (CI) of a particular parameter, whereas in the random effects model, it is used to identify differences in responses with respect to the parameters estimated from the bootstrap samples of the input data. [42,43] propose a Bayesian approach to study the sensitivity of the mean response to parameter uncertainty. The asymptotic normal approximation is used to quantify the parameter uncertainty and it is propagated to the output by using a local linear regression model. Both input and response surface parameter estimation uncertainties are characterized by posterior distributions. Then, balancing the costs for collecting additional input and simulation data, given a finite budget, they develop an optimal data collection strategy to minimize the overall estimation uncertainty of the system mean response.
[22] develop a nonparametric frequentist approach for sensitivity analysis. Basically, each input model is characterized by a mean and a variance. To study the impact of input uncertainty, they model the mean response surface with a separable linear regression model taking the first-two moments of each input model as independent variables. Then, the input uncertainty is quantified by bootstrapping. The contribution of each input to the overall input uncertainty is estimated by the main effect, which means all other input models are fixed but only a single model is varied. The derivative of the input variance with respect to the sample sizes provides the value of collecting more data. Since both [43] and [22] approximate the system response as a linear function of the input parameters or input moments, this approximation error could depend on both input uncertainty and the nonlinearity of the true response surface.
In the commonly used variance-based sensitivity measures, e.g., [22,43], the first-order effects and the total effects, fail to sum to the total variance and adequately deal with the interaction effects of the inputs. In this context, a first-order effect measures "the expected reduction in variance of the output when an input is fixed to a constant" and total effect measures "the expected remaining variance of the output when all other input values are fixed." [77]. For independent inputs, [78] introduces a new sensitivity measure based on Shapley value motivated by game theory, which can solve this issue. This approach is devised for a deterministic computer experiment where there is no stochastic uncertainty and it measures the uncertainty in the output allocated to each "known" input distribution. [77] further extends this measure to stochastic simulation experiments with dependent inputs. In addition, they propose a Monte Carlo algorithm to estimate the Shapley effects to avoid expensive computation when the number of inputs is large. [79] propose a Bayesian framework for global sensitivity analysis using the Bayesian metamodeling approach proposed in [6] for inputuncertainty quantification. The resulting metamodel assisted variance decomposition and sensitivity analysis quantifies the impact of each input model estimation uncertainty. The method also assesses the value of additional input data for each input process so that the system performance is improved. The application of the method in an M/M/1/K queuing system and a biopharmaceutical inventory example illustrates its better performance compared to the methods proposed in [43] and [22].
Most of the work in sensitivity analysis has focused on the sensitivity of the simulation output to the mean of the input distribution. More recently, [80] study the local sensitivity of simulation outputs to the variance of the parametric input models. Authors propose a new family of local sensitivity measures that are referred to as output-meanwith-respect-to-input-variance sensitivity measures.

Simulation optimization under input uncertainty
Simulation has been a widely used tool for comparing the performances of large and complex systems as well as identifying the system with the best expected performance. For instance, in supply chain management, simulation could be used to compare the performances of several inventory policies based on the expected demand fulfillment probability that they can achieve or expected cost that they provide. In queueing applications, simulation could be used to compare the expected waiting times of several queueing systems as well as to identify the service time that minimizes the total expected waiting time.
When the number of simulated systems is limited so that every system can be simulated, this problem is referred to as a Ranking and Selection (R&S) problem. A more general form of this problem where the solution space is continuous is called simulation optimization or Optimization via Simulation (OvS). Both areas are very established research areas; we refer the reader to [81] for an excellent review of the simulation optimization research. Our goal is not to perform an extensive review of the literature in this area; we rather focus on papers that particularly study input uncertainty in the implementation of the R&S and simulation optimization procedures. We review the R&S methods under input uncertainty in Section 5.1 and the simulation optimization methods under input uncertainty in Section 5.2.

Ranking and selection
Ranking and selection methods are used to compare and select a subset of systems or the "best" system according to their expected performance. When the goal is to select a subset of systems that performs good among a finite number of alternatives, this problem is referred to as the subset-selection problem. When the interest is on estimating the "best" system (i.e., the system that maximizes profit or minimizes cost), then the R&S procedure is called the indifference-zone (IZ) procedure. More specifically, the goal of an IZ procedure is to identify the best system with at least a probability of 1 when the true mean of the best is at least δ better than the second best.
Since we run a limited number of simulation replications, the simulation analyst is never certain about the order of the systems' performances. Even when we have infinite computing budget, as shown in the papers below, the presence of input uncertainty may dominate, and may have a significant impact on the performance of the R&S procedures. [82,83] are the first papers that look at the implications of parameter uncertainty on a subset-selection procedure. The authors design a decision rule that accounts for parameter uncertainty for a subset selection procedure and provide insights into the effect of parameter uncertainty on the identification of the stochastic system designs whose sample means are within an error tolerance of the best. [84] consider an IZ R&S procedure where the simulation input models to represent the uncertain inputs belong to an "ambiguity set" which contains a finite number of scenarios for the underlying input distribution. The authors devise a robust selection-of-the-best procedure which compares the alternatives based on their worst-case performance among all the distributions in the ambiguity set and selects the decision with the best worst-case performance. [85] continue the study of this problem and design a two-stage selection procedure and a sequential selection procedure. The resulting procedures are validated through two queuing applications. Unlike the above two papers, [86] take a ambiquity-neutral approach and investigate the impact of input uncertainty on the IZ R&S procedures. They particularly consider the impact of input uncertainty on the indifference-zone parameter δ. Authors conclude that in the presence of input uncertainty, for some competing designs, existing IZ procedures may fail to deliver a valid statistical guarantee of correct selection.
Unlike the frequentist approach adapted in [84], [87] take a robust Bayesian approach and propose several extensions of the knowledge gradient policy originally proposed by [88] to deal with Bayesian R&S in the presence of input uncertainty. The authors assume independence among alternatives but consider the correlations among the performance measures of an alternative under different input distributions.
Another approach used to solve R&S problems concerns properly allocating the simulation budget to maximize the probability of correctly selecting the best system among competing designs. This is known as the ''fixed budget problem." The most widely used method for the fixed budget problem is the optimal computing budget allocation (OCBA) procedure which allocates the simulation samples sequentially to maximize the probability of correct selection under a computational budget constraint. [89,90] consider an OCBA procedure under input uncertainty. More specifically, the authors take a conservative approach and devise a robust-OCBA procedure, which compares the system designs based on their worst-case performance and maximizes the probability of correct selection under the worst-case scenario. [91] continue the study of this problem and develop a new OCBA algorithm which allows the incorporation of additional input data with the hope of learning more about the input process. The resulting procedure maximizes the probability of correct selection by balancing input uncertainty and simulation uncertainty.
Building on the work in [84] and [90], [92] take a different approach and consider the practical case where the goal is to select the top m designs under input uncertainty. Selecting top m designs are especially relevant in engineering applications where multiple performance measures exist for competing designs. Authors formulate an OCBA problem where the goal is to maximize the probability of correctly selecting the top m designs within a fixed simulation budget. A conservative approach is taking where competing designs are compared based on their worst case performance. OCBA-RM procedure is proposed for allocating the budget efficiently in the presence of input uncertainty and is shown to be more more efficient than existing allocation rules OCBA-M in [93], Proportional to Variance Allocation (PTV), and Equal Allocation (EA).
Recently, [94] further look at the problem of ranking all alternatives in the presence of input uncertainty as opposed to ranking a subset of alternatives or selecting the best design. Authors propose a sequential ranking procedure which is based on an OCBA model.
Motivated by several practical applications, [95] consider a "constrained" R&S procedure with input uncertainty. Authors formulate the decision problem and propose a robust selection approach where an uncertainty set that contains a finite number of input distributions is assumed. The selection problem aims to maximize the probability of correct selection of the best design where the objective and constraint of a design are represented by their worst-case performance. The optimal selection rule, which is called robust optimal computing budget allocation with constraints (ROCBA -CO) is easy to implement and is shown to outperform two well-known procedures EA and OCBA-CO, which is a typical approach in constrained R&S problems ( [96]).
The aforementioned literature on R&S procedures with input uncertainty all assume a fixed input data set mainly because of the costs associated with collecting more input data. However, motivated by the analytical advances lately which make it easier to acquire additional new data, [97] relax this assumption and devise am IZ R&S procedure where additional data can be acquired on the spot during the implementation of the R&S procedure. Authors propose a moving average estimator for online estimation and devise a new procedure by extending the Sequential Elimination framework. The resulting procedure is shown achieve the target probability of correct selection. However, it suffers from conservatiness and its efficiency can be improved.
We conclude this section by noting that the "robust" approaches utilized in the literature may suffer from being too conservative and selecting a much worse design than the true optimal design. Recently, [98] overcome this conservatism with a procedure called input-output uncertainty comparisons.

Simulation optimization
Simulation optimization concerns the following problem where the objective function H(x) cannot be computed analytically so has to be evaluated via simulation and thus only sample performance h (x, ξ) is available. The distribution of ξ is usually unknown and is estimated from finite set of historical data leading to input uncertainty. The typical optimization method in the literature that hedges against the input uncertainty is the distributionally robust optimization (DRO) framework (see [99], [66], [100], [101], [102], and [103] among others). The DRO approach is a conservative approach because it uses the worst-case input distribution among all distributions that could be used to model the input data. Another extreme approach to hedge against input uncertainty would be to solve the optimization problem with all input distributions that could be used to represent the data and average out the results to make a final decision. This approach is easy to implement but is risk neutral to all possibilities. Motivated by the shortcomings of the simulation optimization methods under input uncertainty, [104] propose a new Bayesian riskbased optimization (BRO) framework which includes the DRO framework and the averaging approach as special cases. The BRO framework features three different formulations based on a mean-variance framework, Value-at-Risk (VaR) formulation, and Conditional-Value-at-Risk (CVaR) formulation. Authors show the effectiveness of the new formulations on a classical M/M/1 queueing system while [105] apply the formulations to a newsvendor inventory setting. [106] establish the consistency and asymptotic normality of the objective functions and optimal solutions obtained under these three formulations. Recently, [107] optimize the BRO problem under a specific formulation and posterior distribution. Specifically, in order to solve the BRO problem more efficiently, authors propose stochastic gradient estimators and stochastic approximation algorithms. A two-sided market model is studied as an example to show the performance of the proposed algorithms and to further provide insights on the choice of the BRO formulation.
Despite the success of the risk-based method, solving the risk-based simulation optimization problem remains as a numerical challenge. [108,109] make an attempt to solve the CVaR formulation using an extension of the gradient-based adaptive stochastic search method. The resulting method allows one to optimize CVaR more accurately with less computational effort. [110] propose a Bayesian simulation optimization approach under input uncertainty and modify the two most popular simulation optimization algorithms, Efficient Global Optimization algorithm [111] and the Knowledge Gradient algorithm with continuous parameters [112], to work in the presence of input uncertainty. Different from [87], this paper considers continuous set of alternatives and correlation between alternatives as well as input distributions. Furthermore, the goal is to identify the solution with the best expected performance rather than identifying the solution which is the best in the worst case. Building on the work in [110], [113] consider the case where real data collection is possible and devise a novel simulation optimization procedure called Bayesian Information Collection and Optimization, which automatically determines in each iteration whether it is more beneficial to collect more data or to run more simulations.
[114] modify the Efficient Global Optimization algorithm with the OCBA procedure to work under input uncertainty by combining the procedure with a stochastic kriging metamodel-assisted bootstrapping framework. The resulting framework is a robust simulation optimization algorithm in the sense that it also estimates the worst-case bounds of the optimal solution. Authors test the proposed algorithm using an M/M/1 queueing model and show that it provides tighter worst-case bounds.
To guide the dynamic decision making for complex stochastic systems, [31] introduce a simulation-based prediction framework. The prediction risk is quantified by the posterior predictive distribution accounting for both input uncertainty and simulation uncertainty. The input uncertainty is characterized by the posterior distribution of a flexible nonparametric input model with time dependence. Then, to search for the optimal operational decisions hedging against the prediction uncertainty, a mini-batch stochastic gradient descent approach is used to efficiently employ the simulation resource.
Finally, with the possibility of being able to collect more data to reduce the impact of input uncertainty, [115] consider the case where additional input data is available, which they refer to as "streaming data". This setting creates a series of stochastic optimization problems parameterized by estimated input parameters, which converge to their true values as more data become available. Borrowing ideas from the area of misspecified optimization problems, authors develop a stochastic approximation (SA) framework to solve these sequence of optimization problems. The resulting solution iterates are shown to converge to the optimal solution in an expected-value sense.

Applications of input uncertainty quantification
The goal of this section is to review several application areas in which the importance of capturing input uncertainty in operational decision making has been quantified. Among the areas discussed are inventory control, manufacturing systems (in particular, assemble-toorder systems and high-tech manufacturing systems), pharmaceutical supply chains, biomanufacturing, and energy systems.

Inventory control
The impact of input uncertainty in inventory management is a relatively well-studied area. There are two major groups of papers published in this area: (1) papers that investigate the impact of input uncertainty in an inventory setting where the goal is to identify the inventory targets that minimize/maximize a certain objective (see [116,117] and [118,119]), and (2) papers that investigate the impact of input uncertainty on the simulation of inventory systems (see [120], [121], [4] and [122]). There are a number of reasons as to why the input data might be limited or additional data collection might be infeasible in this context. For example, new product offerings with short product life cycles make it impossible to collect large amount of demand realizations. Furthermore, frequently changing market conditions pushes companies to use the most recent data to be able to capture the recent trends in the market. Brian Lewis, who is the co-founder of Fractal Sciences, describes this phenomenon as "big data dreams, small data reality", in the January/February 2014 issue of the Analytics magazine. [116,117] are the first papers that study input uncertainty in the context of an inventory problem by focusing on the estimation of inventory targets in a newsvendor setting in the presence of limited amount of historical demand data. Authors demonstrate the importance of accounting for parameter uncertainty on inventory-target estimation and compute inventory targets hedging against both stochastic uncertainty and parameter uncertainty. [119] further study the impact of parameter uncertainty on the inventory targets delivered to maximize the joint demand fulfillment probability in a budget-constrained multiitem inventory system. [121] use simulation to estimate the optimal order size for a newsvendor inventory setting using a Bayesian demand model with the objective of capturing parameter uncertainty due to the estimation of demand parameters from limited amount of historical demand data. Authors use the posterior sampling algorithm introduced in [120] to determine the optimal order size under parameter uncertainty. In the numerical section of the paper, authors compare the classical approach of estimating the optimal order size to the proposed Bayesian approach in the paper and find that the classical approach is more conservative in estimating the service levels and it overestimates the expected profit in the presence of limited demand data. The discrepancy between the two approaches diminishes with increasing number of historical demand observations. [122] investigate the impact of parameter uncertainty in the simulation of a newsvendor inventory setting with limited historical demand data. In particular, authors show that the 95% confidence-interval length obtained for the Type-1 service level (defined as "expected fraction of time the demand is completely satisfied by the stock on hand") and Type-2 service level (defined as "expected proportion of demand which is delivered without delay from the stock on hand") even for the case where simulation uncertainty is negligible may be too wide to help the inventory manager make an informative decision. Collecting more demand data helps to reduce the length of the confidence interval but this uncertainty never disappears completely, illustrating the importance of capturing demand parameter uncertainty in the simulation of inventory systems.
Although the focus in [122] is on demand parameter uncertainty, most of the time the demand models themselves are also unknown and need to be calibrated from the limited amount of historical demand data. Motivated by this observation, [4] consider the simulation of an inventory system with unknown input models and parameters and propose a simulation replication algorithm that jointly estimates the input models as well as the performance measures. The proposed algorithm is built on a nonparametric Bayesian approach, which frees the inventory manager from making restrictive assumptions on the form of the input models while allowing him/her to incorporate expert opinion into the decision making process. The numerical section of the paper compares the practice of ignoring input uncertainty (by using the bestfit distributions or empirical distributions to represent the limited amount of historical demand data) to the practice of accounting for input uncertainty (by using the proposed algorithm in the paper) in the simulation of a single-product inventory system with the objective of estimating the Type-1 and Type-2 service levels. The paper also constructs credible intervals using real-world data obtained from a global manufacturer.

Assemble-To-Order systems
The research in this area is in its infancy. To the best of our knowledge, [123] is the only paper that focuses on an Assemble-to-Order production system simulation where the demands for the end products and the time since the last customer arrival are correlated. Similar to the aforementioned inventory models, the lack of enough demand observations could be due to the short life products or the tendency of the companies use the recent demand observations to keep up with the market conditions. This paper proposes a procedure that decouples the sampling of the marginal distribution parameters and the correlation matrix to propagate the estimation error in input models to the output uncertainty. The numerical section focuses on an ATO system having six components and three end products with unknown production time parameters and demand arrival-rate parameters. The following three procedures are evaluated based on the coverage of the confidence intervals for the expected per-period profit they provide: (i) a conditional analysis which uses the empirical distribution for the marginal distributions and estimates the correlation matrix using the semi-parametric maximum likelihood method; (ii) bootstrap resampling method which captures the uncertainty only around the parameters of the marginal distributions ignoring the uncertainty around the correlation matrix; and (iii) the proposed procedure in the paper which accounts for the uncertainty around the parameters of the marginal distributions and the correlation matrix. Numerical results show that the conditional confidence intervals provide very poor coverage while the confidence intervals obtained with the proposed procedure attain the desired coverage. [124] consider the new-product development process of a pharmaceutical company. As part of this process, the company needs to source different biological products from different contract manufacturers whose reliability is unknown but estimated with a logistic regression from the historical data on the performance of the manufacturer. The reliability of the manufacturer, which is critically important for the pharmaceutical company to deliver the product on time, is defined as the likelihood of the manufacturer to successfully manufacture the biological product. The company uses simulation to estimate the performance of a sourcing decision and the execution of the simulation requires the generation of binary random variates from the logistic regression model whose parameters are estimated from the limited amount of data. The data on the past performance of the manufacturer is usually limited because a single company produces only limited number of new products and thus works only limited amount of times with the same manufacturer. This paper considers the input uncertainty that is due to the estimation of the parameters of the logistics regression model used in the simulation and quantifies the impact of input uncertainty on the expected profit of the pharmaceutical company. The paper also introduces a sampling-based algorithm with the goal of improving the sourcing decisions by accounting for the input uncertainty. The numerical experiments illustrate that ignoring input uncertainty in the decision-making process may lead to suboptimal decisions up to 90% of the time.

Pharmaceutical supply chains
[125] continue the study of the supplier failure probability under input uncertainty and investigate how input uncertainty is affected from the attributes of the historical products defined using the two metrics related to the similarity of the current product to the old products (called difference in the paper) and the standard deviation of the feature of the products developed by the supplier (called dispersion in the paper). The main takeaway from this paper is that input uncertainty should be considered particularly when the current product and the old products in the historical data are different from each other and an attribute does not vary much from one product to another.
Finally, [126] study an engineer-to-order production system with random yield where uncertainty in the yield is modeled with a betaregression model in which the mean value of the yield depends on the unique attributes of the engineer-to-order product. The manufacturer has limited amount of historical data about the past yield realizations and thus has to deal with the input uncertainty around the beta-regression parameters. The paper investigates the impact of the input uncertainty on identifying the optimal batch sizes and on the expected cost of the manufacturer.

Biopharmaceutical manufacturing
Biomanufacturing is growing rapidly, and it plays a significant role in supporting the economy and ensuring public health. The biopharmaceutical industry generated more than $300 billion in revenue in 2019 and more than 40% of the drug products in the development pipeline were biopharmaceuticals. Compared to traditional synthesized small molecule pharmaceuticals, biotherapeutics are manufactured in living cells whose biological processes are very complex and have highly variable outputs. The productivity and product critical quality attributes are determined by hundreds of critical process parameters, including raw materials, media compositions, feeding strategy, and process operational conditions. As new biotherapeutics (e.g., cell and gene therapies) become more and more "personalized", the production, regulation procedure, analytical testing time required by biopharmaceuticals of complex molecular structure is lengthy, and the historical process observations are relatively limited in particular for drugs in early development stages. To guide the decision making for biomanufacturing, [127] develop a stochastic simulation model for biomanufacturing risk analysis by focusing on the production process from raw materials to nished drug substance. They study main sources of uncertainty leading to batch-to-batch variation, such as raw material biomass, cell culture, and target protein purication, while accounting for input uncertainty.

Energy systems
Wind power as a renewable energy is gaining significant importance and is being integrated into smart power grids. However, the inherent uncertainty around the wind energy introduces certain operational challenges. To address this problem, [128] propose a short-term wind power probabilistic forecast that account for both inherent stochastic uncertainty and model estimation error. The resulting framework, which is based on a Bayesian nonparametric probabilistic forecast model, is shown to provide reliable wind power probabilistic forecast which can be utilized to support real-time risk management for smart power grids. [129] further consider the problem of power grids scheduling under high wind penetration. This problem is addressed in the literature with the use of a stochastic unit commitment (SUC) model, which ensures cost-efficient and reliable power grids scheduling. However, the traditional SUC model makes the implicit assumption that the statistical model characterizing the wind power generation uncertainty is known, leading to the underestimation of the input uncertainty. Furthermore, the sample average approximation method used to solve the SUC usually uses finite scenarios to approximate the expected cost and thus ignore the finite sampling error. Motivated by these two shortcomings of the existing SUC model, [129] propose a data-driven SUC model and further introduce a parallel computing based optimization approach, which provides optimal unit commitment decisions hedging against all sources of uncertainty including the stochastic uncertainty of the wind power generation, SUC input model estimation uncertainty, uncertainty induced to the use of finite sampling approach in the implementation of the sample average approximation method.
6.6. High-Tech manufacturing systems [130] study the decision-making process of high-tech companies who are (i) dealing with high uncertainty in the supply, production, and demand; (ii) have access to limited amount of historical data; and (iii) are usually risk-averse. The paper proposes a simulation-based prediction framework based on a two-stage dynamic decision model that accounts for input uncertainty. The proposed framework makes use of the metamodel-assisted approach to ensure efficient use of simulation resources and also delivers a credible interval that accounts for both stochastic uncertainty and input uncertainty. The numerical study focuses on a single-item, multi-period inventory system under a continuous (Q,R) policy with the goal of estimating the risk-adjusted total cost incurred in a given time-period. The authors compare the proposed metamodel-assisted approach to the sample-average approximation method, which is the typical approach used in the literature to estimate the risk-adjusted cost objective. The results show the superiority of the metamodel-assisted approach as it provides much smaller estimation error while efficiently using the simulation budget to reduce the stochastic uncertainty.

Discussion and future perspectives
This paper has reviewed the literature on input uncertainty in stochastic simulations by providing a classification of major research streams and focusing on the new developments in recent years. We also review application papers that investigate the value of representing input uncertainty in the simulation of real-world stochastic systems in various industries.
It is clear from this review that the problem of input uncertainty has received growing attention from the simulation researchers especially over the last couple of years. The future trends in this area where we expect to see more research developments in the coming years are the use of machine learning methods to better exploit the detailed simulation output data to generate deeper insights on system performance by explicitly considering input uncertainty, the integration of online data to simulation or simulation-optimization for real-time decision making, and improving the computational efficiency of classical inputuncertainty quantification methods.
Although several methodological advances have been made in this area, the research on the impact of input uncertainty on specific application domains remains scarce. We expect to see that the implications of input uncertainty in several applications areas will be investigated in the near future. It would be also interesting to see how the existing approaches would perform for specific application domains. Since different application areas may call for different capabilities of the input uncertainty quantification methods, we believe that this would also shape the future of the input uncertainty research.
It is important to note that the studies reviewed in this paper, including computer simulation input modeling, input uncertainty quantification, sensitivity analysis, simulation optimization under input uncertainty, are closely related to the fast-growing field of data science. They can facilitate the knowledge discovery of complex stochastic cyberphysical systems and internet-of-things (e.g., distributed manufacturing, supply chain, and health care processes) in the form of explanations and predictions. Given the heterogeneous data collected by smart devices connected by internet networks, the simulation model can facilitate data integration. The studies on computer input modeling, input uncertainty and optimization can facilitate us to: (1) characterize the randomness from each sources of uncertainty, (2) capture interesting and robust patterns that satisfy the rich and heterogeneous data and data streams, (3) improve the knowledge and prediction of complex systems, (4) facilitate the representative digital twin development, and (5) guide the optimal, coherent, and robust decision making for complex stochastic systems.
Despite the existence of well-established methods to represent input uncertainty in a simulation study, these methods have not been integrated to the simulation software packages for general use. One extension is Simio (http://www.simio.com), where the company implements the work in [22], allowing the users to perform input analysis and identify which input contributes most to the overall input uncertainty. We expect to see more collaboration between input uncertainty researchers and simulation software developers to allow the research be accessible for simulation practitioners.