From inference to design: A comprehensive framework for uncertainty quantification in engineering with limited information

In this paper we present a framework for addressing a variety of engineering design challenges with limited empirical data and partial information. This framework includes guidance on the characterisation of a mixture of uncertainties, efficient methodologies to integrate data into design decisions, and to conduct reliability analysis, and risk/reliability based design optimisation. To demonstrate its efficacy, the framework has been applied to the NASA 2020 uncertainty quantification challenge. The results and discussion in the paper are with respect to this application.


Introduction
Often complex systems must safely operate in extreme environments, under varying, or uncertain operational conditions. Ensuring satisfactory operation of novel systems subject to uncertain conditions is the aim of uncertainty quantification, the process of analysing the uncertainties in the interaction between complex systems and their environments [1]. It is common in modern engineering for a black-box, deterministic model to be used to inform design decisions, and uncertainties are always present in these models. Many model parameters may have an uncertain, unknown, or unobservable value. These models, used in the design process, are often constructed in isolation of the acquisition of experimental data. Experimental data may be scarce or poor-quality because of the novelty of the mission, inaccuracy of experimental measurements, or prohibitive cost of collecting data. Uncertainty quantification must be performed rigorously where possible, and diligently otherwise, so as to maximize the reliability of designs, decisions, and policies.
In engineering, the design process always, whether explicitly acknowledged or not, is about reducing uncertainty. All engineering design problems share some of the following four common challenges when dealing with their uncertainty.
• Variability -All engineering systems are expected to operate in varying environmental conditions. The design of these systems must be sensitive to the expected variability in the systems anticipated operation environment.
• Inference -When designing novel systems, often data is sparse or lacking entirely. During the design phases of engineered systems its common for more data to become available progressively. The design strategy needs to be able to incorporate this data as it becomes available, but doesn't excessively constrain the design by over-fitting to any data that does become available. • Ignorance -During the design of engineering systems computational models are often used to support decision makers. These models may include parameter values that are unknown, or there may be unknown parametric dependencies. An effective design strategy must express and conserve this uncertainty, and make minimal assumptions about it, to ensure design decisions are robust to ignorance. • Decision Making -The design process of complex systems involves multiple stages of progressive refinement. Each of these stages necessitates a decision about the collection of data which can be used to improve the design. These decisions must be made with limited information, and must accommodate the uncertainty present in the available data. The design strategy should provide as much flexibility to decision makers as possible. Constraining the design too early in the design process can be costly to change later.
The approach presented within this paper is a harmonised framework, borrowing from the most recent advances in uncertainty science, to form a complete solution sequence for dealing with these four types of uncertainty that are always present within engineering design. This approach supports the characterisation of these uncertainties simultaneously and can be used to perform an iterative design optimisation. This new framework provides support to practitioners to: make decisions about further data acquisition, assess the reliability of prototype designs, optimise the design against severe failures and uncertainties, and refine design decisions.
Within this work 1 we propose a new Bayesian calibration method to construct an uncertainty model which captures both epistemic and aleatory uncertainties using a multi-dimensional second-order probability distribution. The stochastic area metric (Kantorovich-Rubinstein/Wasserstein distance) [3,4], which measures the discrepancy between two non-parametric probability distributions is used in an approximate Bayesian updating procedure to replace the likelihood. This extension allows for a single-step calibration of correlated second-order probability distributions. We used sliced-normal distributions, recently introduced in [5], to efficiently generate samples of the updated posterior and to describe data-enclosing sets to visualise dependencies in high dimensional spaces. We used probability bounds analysis, a value-of-information approach, to investigate system output sensitivity to the epistemic parameters. We compare this approach to a variance-based sensitivity analysis, and use this comparison to identify which parameters should be prioritised. Second-order Monte-Carlo and probability bounds analysis are both adopted for forward propagation. Finally, we developed a reliability-and risk-based design optimisation scheme inclusive of both aleatory and epistemic uncertainties to identify candidate system designs which are robust to these unknowns. The use of probability bounds analysis throughout the framework allows for consistent characterisation and propagation of both aleatory and epistemic uncertainties.
A new NASA Langley challenge on optimisation under uncertainty was launched in 2020 [6]. The challenge combines problems of statistical inference, with complex multi-dimensional and black-box engineering systems. This challenge emulates common fundamental problems across science and engineering, and is intended to push the limits of the state-of-art within the field of uncertainty quantification. Because of the complexity of the challenge problem no single approach in the literature currently satisfies all requirements, and solutions have to strike a balance between robustness and efficiency. This new challenge builds upon the previous NASA uncertainty quantification and robust design challenge launched in 2014 [7] where a black-box model was used to provide a discipline-independent platform for bench-marking and testing of cutting edge methods for uncertainty propagation, sensitivity analysis, model calibration, and robust optimisation. Several authors tackled the original challenge using a variety of approaches [8][9][10][11][12]. Among these, [8] proposed a random set theoretical approach combined with approximate Bayesian computations and meta modelling, upon which our work builds and takes inspiration from.
The rest of the paper is organized as follows: In Section 2 we present the problem statement and discuss some of the most relevant challenges. Section 3 presents the theoretical preliminaries and reviews the existing literature. We highlight the novel contribution of this work in Section 4 and the numerical implementation of the proposed solution sequence is described in Section 5. We apply the proposed solution method to the NASA challenge problem and Section 6 presents the numerical results for the six problems and subproblem. Section 8 closes the paper with a discussion on the results, failed attempts and possible research directions.

Problem statement
Consider a system of interest which is modeled as a set of interconnected subsystems, with uncertain parameters (a, e). Here, a is vector of the aleatory inputs of size n a such that a ∈ A⫅R na , and e is vector of the epistemic inputs, such that e ∈ E 0 ⫅R ne . The true value of the parameters e ∈ E 0 and the probability distribution of the variables a ∼ F A are both unknown. The subsystem is modeled by the function y(a, e, t), where y : R na × R ne × [0, T] → R and t is time. The integrated system is modeled by z(a, e, θ, t), where z : R na × R ne × R nθ × [0, T] → R nz . Hence, the output of the y(a, e, t) is a function of time, and the output of the integrated system, z(a, e, θ, t), are n z functions of time. The functions y and z are given as discrete time histories, e.g., y(t) = [y(0), y(dt), ..., y(n t dt)] where n t dt = T.
This model is the subject of a design optimisation procedure, where the goal of the procedure is to select a design vector θ of size n θ in the space Θ⫅R nθ , which ensures a high probability of system compliance with given reliability requirements, g j ,j = 1, …, n g ; these requirements may be the stability of the system, efficiency and energy consummation, structural integrity, maximum thermal or elastic stress, etc. The reliability requirements are modeled via a vector of reliability functions g : R na × R ne × R nθ → R ng , where a g j defines the compliance of the system with respect to requirement j = 1,..,n g . If g j (a,e,θ) > 0, a design θ results in failure of requirement j given the uncertain scenario (a, e).
Two data sets of discrete time histories D 1 are made available at different stages of the design process. The quantities n 1 and n 2 denote the number of time-histories for a subsystem and the integrated system, respectively. For the sake of notation simplicity, the dependency of the signals z and y with respect to the uncertain factors (a, e) and θ has been dropped in D 1 and D 2 .
The computational models and the data sets must be combined to tackle different challenges in the prototyping of reliable design. Specifically, this work will focus on the following six tasks: • Model calibration and uncertainty quantification for the subsystem • Epistemic uncertainty reduction • Reliability analysis of a baseline design • Integrated system prototype via reliability-based design optimisation • Model updating and final design tuning • Risk-based design optimisation The problem statement presented in this section, as previously described emulates many real-life design challenges. The focus of this work is the NASA 2020 challenge on optimisation under uncertainty [6], but our approach is general. In summary, the specific setup of the NASA challenge problem considers n z = 2 dynamic system responses, n e = 4 epistemic parameters, n a = 5 inputs affected by aleatory uncertainty, and n θ = 9 parameters defining the design of a feedback controller with a non-collocated sensor/actuator pair. The design vector θ, must satisfy n g = 3 reliability requirements on the stability, settling time, and energy consumption of the systemcontroller. Two data sets D 1 and D 2 containing n 1 = n 2 = 100 length n t = 5000 realizations of the system responses are made available at different stages of the design process. Fig. 1 provides a visual summary of the NASA problem statement whilst Table 1 summarises the dimension of the parameters, the availability of data, and the relevance of the factors for the six tasks.

Challenges
There are four features of the NASA 2020 challenge which characterise its complexity and distinguish it from previous challenge problems: i) the calibration of a mixed uncertainty model-containing epistemic parameters and an unknown correlated probability distribution simultaneously-in high dimensions; ii) the inverse black-box inference problem, where an uncertainty model of the inputs is sought however only data for the output is provided; iii) the time-history of the model output is auto-correlated; iv) very limited data was provided compared to the size of the output making inferential uncertainty large, and the propensity to over-fitting significant.

Models of uncertainty
Most in the uncertainty quantification community agree that uncertainty is characterised by two classes: aleatory and epistemic uncertainty. These two classes behave differently, and require a careful and separate consideration in their modelling. This distinction between the nature of uncertainty is echoed in the NASA challenge problem statement, which clearly states that their computational model has both aleatory and epistemic inputs. In this section we describe these concepts, and review various mathematical frameworks for modelling systems with mixed aleatory and epistemic uncertainty.
Aleatory uncertainty arises from natural stochasticity, this is sometimes referred to as randomness. The specific random values of an aleatory parameter cannot be predicted exactly, however its overall dispersion follows a specific probability distribution which may in aggregate be inferred. Aleatory uncertainty is therefore thought to be perfectly modelled by probability theory, where the probability is interpreted as a frequency. This type of uncertainty is irreducible in the sense that gathering more data may give a better characterisation of its distribution, but cannot reduce it to a scalar.
Epistemic uncertainty, which arises from imperfect knowledge is typically understood to be reducible by empirical effort (gaining more knowledge). Epistemic parameters are constant, but unknown, and are not naturally varying. Epistemic uncertainty may also be modelled with probability, following the Bayesian interpretation, where a probabilistic statement is a subjective degree of belief assigned to possible events. The Bayesian interpretation makes no distinction between the frequency of observed events or subjective belief about unobserved events. Kolmogorov's axioms, that define the calculus of probability, do not distinguish between these either, therefore characterising physical systems with a combination of both (observed data and subjective belief) can allow biases to yield erroneous results.
Intervals [13] offer a more natural way to model imperfect knowledge, or epistemic uncertainty. These intervals may be bounded at one-end, both-ends or open. Intervals make no statement about the mass inside the interval and hence are a relaxation of probability theory. However, they are not ideally suited to characterising aleatory uncertainty. Because of the natural ability of intervals to express epistemic uncertainty, and the ability of probability to express aleatory uncertainty, combining these two frameworks yields the theory of imprecise probabilities (IP). IP is a theory of uncertainty which models both variability and imprecision. This provides a framework for more rigorous characterisation of the limits of knowledge, and a distinction between what is not known, and what is varying. IP is effectively a generalisation of probability theory for computing with sets of probabilities, relaxing the Kolmogorov's additivity axiom. Imprecise uncertainty models include probability bounds analysis [14][15][16][17][18], Dempster-Shafer evidence theory [19], random set theory [20][21][22], fuzzy set theory [23][24][25][26], polymorphic uncertainty theory [27,28], info-gap theory [29] and others.
The probability box (p-box) is an imprecise uncertainty model which has recently grown in popularity. A p-box bounds a set of probability distributions when any of the following three constraints are satisfied: (i) there are two bounding CDFs, (ii) interval bounds on the moments (μ X , σ X ) are defined, and (iii) a set of distribution families F (e.g. normal) is given. That is, a distribution F(x) is a member of a p-box if: Some of these constraints may be missing, or informed from the others. For example, bounds on the moments may be retrieved from the CDF and distribution family information. Alternatively, the CDF bounds may be informed through moment and range information [30], by making use of the Chebyshev, Markov, and Cantelli inequalities. Or, the distribution family may be missing entirely, with such p-boxes sometimes being referred to as non-parametric.
The standard method for propagating p-boxes is p-box arithmetic, or probability bounds analysis [31,17,32]. This requires an interface between the model source code and compiler because most programming languages do not support interval computation. Efforts to overwrite machine level arithmetic for interval computation are underway [33], but currently these approaches are not commonly used. In the NASA challenge, and generally in engineering, we are restricted to studying a black-box model. Therefore, the choice of the appropriate model of uncertainty is limited to methods primarily based on sampling. In recent literature we have seen the emergence of methods based on sampling that simulate the aforementioned imprecise models of uncertainty [34][35][36][37][38][39], including a number of non-intrusive methods involving multiple Monte-Carlo or optimisation operations [40][41][42][43]. Each of these authors have shown that probability bounds analysis, evidence theory, and random set theory can be performed via sampling without altering the rules of code compilation, demonstrating the applicability of such frameworks to black-box models.
A popular mixed uncertainty model, particularly in combination with Bayesian methods, is the second-order distribution. These attempt to model the aleatory and epistemic components. These structures arise naturally in so-called double-loop Monte-Carlo simulation, where first a probability distribution is picked at random from a specific family, and then its values are simulated with Monte Carlo. This yields an ensemble of distributions. These structures grew in popularity in the 1990s [44,45] as more computational resources became available. However, they are rarely discussed in recent literature, most likely due to the many criticisms: the method is generally very expensive since multiple Monte-Carlo loops are required; they are thought to severely underestimate epistemic uncertainty [46,47]; and, modelling dependencies may also be troublesome. In [47,Section 5.1], an example is given for the sum of N probability distributions in the interval [a, b] N . When propagating epistemic uncertainty, Monte Carlo gets increasingly less accurate as the number of variables increases. Given that the sample variance of the sum linearly decreases with the number of variables, it gets ever more improbable to sample close to the endpoints of the box where the exact bounds hold. Despite this, second-order distributions are a flexible method for modelling epistemic and aleatory uncertainty with sampling. Some of the criticisms of second-order distributions may be alleviated by converting them to p-boxes by enveloping the ensemble, referred to in the challenge document as an inner approximation of a p-box [17,18].
In this work we develop a probability bounds analysis via sampling strategy to address the NASA challenge problems using an inner approximation of a p-box constructed from a second-order distribution. This is well suited to deal efficiently with the propagation problem of mixed uncertainty.

Calibration of mixed uncertainty models
Inverse problems cover a class of mathematical challenges where the objective is to derive unobserved function inputs from a set of output observations [48][49][50]. In the context of probability modelling this involves finding a distribution of input parameters which matches some output distribution. Model calibration is an example of such an inverse problem. Probabilistic model calibration can be performed with Bayesian [51,52] or frequentist methods [53]. Bayesian approaches to model calibration have a number of desirable features, for example, they are generally applicable and can incorporate prior expert judgment where data is scarce. However, there are a number of limitations to these approaches. The calibration problem may be ill-posed [51] with multiple optimal solutions. Furthermore, computational challenges arise from highdimensional, high-peaked or multi-modal distributions, or when computationally expensive models [54] are required to evaluate candidates. Accounting for both aleatory and epistemic uncertainty is particularly challenging. A number of recent papers have also shown that non-additive measures are a requirement for valid inferential problems, and additive models used for epistemic uncertainty, commonly employed in Bayesian approaches, lead to false confidence [55,56]. False confidence occurs when there is a large probability of assigning high confidence to a false assertion. Martin and Liu,[57,58] propose an inferential framework to combat false confidence.
If inferential uncertainty is large due to limited data, or if both epistemic and aleatory uncertainty are present, imprecise probabilities are required. A number of recent authors have proposed methods for the calibration of imprecise probabilities [59,25,26,60]. The literature on the topic is sparse and is mainly under the Bayesian paradigm [51,61]. There is also work on fuzzy logic and arithmetic [25,26]. An example of fuzzy arithmetic to solve inverse problem in finite element models can be found in [62]. Interval approaches to the inverse problem have also appeared in recent literature [63,64]. There is work on interval characterization too [65]. However, the black box nature of the problem necessitates the use of sampling methods, making robust imprecise inference difficult.
Updating over time histories is a common problem; this introduces complications due to the auto-correlation present between time steps. One way to account for this auto-correlation within model updating is to model time histories by stochastic processes or random fields [66]. Due to the high dimensionality, approximations or dimensionality reduction methods are often required to compactly represent random fields, for example, Gaussian processes [67,68], Fourier transformations, and Karhunen-Lóeve expansions. Compact representations of interval fields [69], as well as imprecise random fields using parameterised kernels [70] have been proposed.
Within the NASA challenge problem, we are attempting to calibrate an uncertainty model (UM) for the system. This problem is high-dimensional, there is no knowledge of the input space other than bounds, the model accounts for both aleatory and epistemic uncertainty, and the available data is limited. This motivates the use of an approach, described in Section 5, which avoids the limitations of previous methods described.

Approximate Bayesian computation
Approximate Bayesian methods are a way to solve inference problems in the absence of a likelihood function which is replaced by a sufficient statistical metric. Bi et al. [71] propose a Bayesian method for calibrating a set of distributions using stochastic metrics to replace the likelihood [72], allowing for an updating procedure over a space of distributions. Similarly, an approximate Bayesian computation using the Wasserstein distance was applied by Bernto et al. [73] making the method scalable. Engquist et al. [50] analytically and numerically characterise two major effects of the quadratic Wasserstein distance as the measure of data discrepancy in computational solutions of inverse problems. Approximate Bayesian methods utilise a class of random-walk algorithms, Markov Chain Monte-Carlo (MCMC), that incrementally improve the selection of inputs that best match the output data when propagated through the model. A recent tutorial paper for some of these MCMC methods is available [74]. These algorithms use a simple proposal probability distributions seeded by samples of the previous iteration to sample new candidate points. If these candidate input points yield output points closer (as defined by the chosen distance metric) to the target output data then they are kept, otherwise they are discarded. Most common amongst these methods is the use of Metropolis-Hastings [51,75]. Transitional Markov Chain Monte-Carlo (TMCMC) was first proposed by Ching et al. [76] adapting the earlier Metropolis-Hastings algorithm. TMCMC constructs a set of independent Markov chains, making parallelisation possible. This is essential when the model is computationally expensive to evaluate, or when the dimensionality of the problem is large.
Since the introduction of TMCMC several improvements have been proposed, such as: a recalculation of weights at each iteration, a burn-in at the beginning of each chain to improve intermediary posterior sample convergence, and an adaptive scaling of the proposal distribution [77]. Further improvements have been identified with enhanced selection of task dependent distance metrics within the ABC regime [73,72]. Similar approaches include the Metropolis-adjusted Langevin algorithm [78], the Metropolis preconditioned Crank-Nicolson method [79], Hamiltonian MCMC [80], relativistic MCMC [81], and many others [82][83][84][85][86].
An appropriate likelihood metric is not readily available for the NASA challenge problem. Therefore approximate Bayesian methods with TMCMC sampling provides a means of calibration which is efficient and capable of handling the high dimensionality of the output.

Sensitivity analysis
Sensitivity analysis "is the study of how uncertainty in the output of a model (numerical or otherwise) can be apportioned to different sources of uncertainty in the model input" [87]. In general there are a number of motivations for one to employ sensitivity analysis, these may be: to rank uncertainties in order of magnitude of impact; to identify where greater empirical effort may best be placed to reduce uncertainty; to estimate the effect of different modeler assumptions; or to justify simplifying their model and removing inputs that have little effect. There are a number of approaches to do sensitivity analysis that are concerned with different models of uncertainty. These approaches are broadly classified as variance-based, Bayesian sensitivity analysis, and value-of-information. Variance based sensitivity uses a variance decomposition scheme. This class of sensitivity analyses attempts to attribute portions of the variance in the model output to different inputs, characterised by probability distributions [88,89]. Within variance-based methods the most commonly employed approach is to propagate samples of the input distributions to the output domain multiple times and to derive sample statistics about the resulting distribution. This approach attempts to rank input parameters based on their respective contribution to the output variance, either in isolation or by some order of combination.
Bayesian sensitivity is also probabilistic, but uses the machinery of Bayesian analysis [90]. This approach is sometimes referred to as a robust Bayesian analysis, where the modeller explores the effect of different prior distributions to their best available estimates on the posterior of interest. This approach seeks to identify which of the modellere's prior beliefs have the greatest impact on the posterior [91,92].
Value-of-information sensitivity analysis is concerned with estimating the reduction in uncertainty of the output should more evidence become available about each of the inputs. This approach is usually done comparing a metric of uncertainty, such as the width of the output interval or the Hartley-like measure, before and after pinching the uncertainty of one input or multiple inputs at the same time. Several methods have been proposed in the literature that explore this using different metrics [3,8,89].

Reliability-and risk-based design optimisation
The goal of reliability-based design optimisations (RBDO) is to identify a design that minimizes (or constrain to an acceptable level) the failure probability for some performance requirement(s). A traditional approach to RBDO involves two nested loops, an outer-loop explores the design space, the inner-loop calculates the failure probability for a candidate design [93]. However, nested loop methods are often very inefficient. This inefficiency is due to the time-consuming estimation of the failure probability which involves approximating a multidimensional integral over the failure domain for each candidate. Efficient RBDO can be achieved by: (i) replacing the nested loop with a single-loop [94] or decoupled approach [95], or (ii) reducing the computational costs of the inner loop. The cost of the inner loop can be reduced by utilising an approximate analytical method such as the first-and second-order reliability method to estimate the failure probability; or, by using efficient Monte-Carlo methods such as subset simulation [96], line sampling [34], or multi-level surrogate models [97]. Other contributions have explored inverting these nested loops for the propagation of non-additive probabilities [22].
In addition to these computational issues with RBDO, when we maximize the reliability we obtain no guarantees on the severity of failures. This may yield sub-optimal solutions, where the probability of failure is small but the severity of any failure may be very significant. In other words, the tails of the distribution of the reliability performance function can extend deep into the failure domain. In practice, the analyst might want to minimize the severity of failure given by the extent of the tail of the reliability functions. The average severity of failures can be obtained by computing the conditional value-at-risk scores. This can replace the failure probability in the objective function of the optimisation procedure. Minimization of superquantile, the conditional value-at-risk, yields designs with a small probability of failure and that are robust against severe failures. This type of design problem is known as risk-based design optimisation. Beck et al. [98], presents a comparison of risk-based, reliability-based, and robust design optimisation approaches, see also [99]. Robust methods tend to yield designs that are less sensitive to uncertainty, whereas RBDO methods seek designs with a maximum level of reliability, whilst risk-based design methods target designs that optimally compromise cost, probability of safe operations, and severity of failure. For more details about the superquantile approach on failure probabilities see [100], and for recent risk-based design methods see [101].
The majority of the existing reliability-and risk-BDO methods rely on a precise characterization of a probabilistic model, which is necessary to estimate a design's reliability or risk. The issue with defining a precise probabilistic model is the sensitivity of the failure probability and severity to the choice of model. Selecting a good model of the uncertainty can be challenging, especially for highdimensional problems, or when dependencies are unknown, or data is lacking. Where a lack of data is affecting the analysis, a distributionally robust reformulation of the RBDO problem or a non-probabilistic hybrid reformulation of the reliability problem can be done [102,103]. Other methods are available to circumvent theses issues in the definition of the probability model. These include performing risk-based and RBDO directly on given data [104], set-valued characterization of the uncertain factors and minimizing the non-probabilistic reliability index [103], or using Scenario-RBDO methods to enclose design solution data while offering epistemic guarantees [105]. Only a few works attempted to solve RBDO in the presence of mixed aleatory and epistemic uncertainty. Most of these have been in attempts to address the previous NASA challenge problem on UQ (2014), [9,8,10].

Novel contributions
In this paper we present a complete and harmonised framework for dealing with uncertainty in engineering design. It is complete in the sense that it addresses many major UQ tasks found in any engineering design mission, from calibration to design optimisation. This framework combines some of the latest methods in uncertainty quantification, and is tested on the high dimensional, low-data NASA challenge problem in an mixed uncertainty setting.
The novel contributions within the framework we present include methods for: • calibration of second order distributions on time history data, • construction of a mixed uncertainty model with a focus on characterisation of inferential uncertainty (Avoiding over-fitting), • efficient simulation of posterior samples using sliced-normal distributions, • probability bounds sensitivity analysis via sampling, and • RBDO under mixed parametric uncertainty.

The proposed approach
In this section, we present a novel solution framework to address some of the challenges discussed in the previous sections. We sketch some of the mathematical foundations that are required to understand the results presented in this document. We will keep a practical mindset oriented towards the problems of interest.

Notation
Extensive use of mathematical notation will be made throughout the paper. The use of notation will be consistent with the symbols already introduced in the challenge manifesto [106]. For more details about the symbols herein used and their meaning the reader is referred to the Nomenclature in Appendix A.

Constructing a multivariate parametric distribution
In this framework, we model the aleatory uncertainty with a parametric distribution F A whose parameters must be inferred from data as additional epistemic inputs. In principle any parametric family may be selected for the marginals, with the dependency structure being defined by a parametric copula family. The selection of these families may be problem specific. If a problem has bounded aleatory inputs, and no information about the inputs shape in the marginals, then a Beta distribution family is a good choice [107]. For our application we introduce a joint distribution based on the Beta marginal. This is because the family is bounded, like our aleatory inputs A i for the NASA challenge, and produces a wide variety of shapes with two parameters. For our problem, each a i is bounded in the range [0, 2], so we scale the standard Beta model: where i = 1,…,n a ,a i ∈ [0,2], and B is the beta distribution's CDF with shape parameters α and β. Since F A is n a dimensional, a total of 2n a parameters are required to define the shape of all of the marginal distributions. The dependence structure of F A may be modelled by selecting an n a dimensional parametric copula. The Gaussian copula is a typical choice which is parameterised by a semi-positive definite correlation matrix Σ. A Gaussian copula C Σ is defined in terms of the (mean zero) multivariate Gaussian CDF F G parameterised by correlation matrix Σ: where Φ − 1 is the inverse CDF of a standard normal distribution, u i ∈ [0, 1], and with n a being the dimension of the copula. With n a dimensions, an n a × n a correlation matrix is required to define the copula for F A , the off diagonals of which may be varied to change the dependence structure. Since there are na(na− 1) 2 unique off-diagonal elements in an n a × n a correlation matrix, a total of na(na− 1) 2 parameters are required to define the dependence of F A .
For this problem, since n a = 5, there are 2n a = 10 beta parameters required along with na(na− 1) 2 = 10 parameters for the correlation matrix. This gives a total of 20 additional epistemic parameters e 5:24 ∈ E B ⫅R 20 that are required to completely specify the joint distribution F A : where Sklar's theorem is used, and E 0 × E B is the collection of 24 parameters defining the entire joint distribution, with e 1:4 defining the original epistemic parameters, e 5:14 defining the shape of the marginals, and e 15:24 the dependence structure (see Fig. 2). We begin the indexing of this vector from 5 and call it e because they are epistemic parameters, and will be added to the original 4 epistemic parameters of the computational model in the calibration stage, where inference can be performed on this entire e 1:24 epistemic vector. Fig. 2 shows the density f A for a particular choice of E B . A Gaussian copula family requires a 5× 5 correlation matrix as an input, the off diagonal elements of which may be varied to create different dependence structures. A valid correlation matrix must be semi-positive definite. It has however been shown in [108] that a semi-positive definite correlation matrix may be parameterised in terms of partial correlations, which all independently take values in [ − 1,1]. Thus ten partial correlations, which may all take values between [ − 1, 1] may be included in the calibration procedure to define the dependence structure for A.
A multivariate imprecise structure can quite easily be constructed from F A , by expressing epistemic uncertainty about parameters E B ; for example when E B is an interval (compact space), F A is a multivariate p-box, and when E B is distributional (probability space), F A will be a second order distribution. The parameters of E B will be inferred in the calibration procedure.
The UM, as has been described, is an ensemble of aleatory models. Fig. 2 shows several visualisations of the probabilistic model, given a sample of the epistemic space. This is demonstrative of the flexibility of the Beta marginal and Gaussian copula model.
Since the epistemic space is expanded in this framework, its dimensionality will increase. For example, in the context of the NASA challenge, we go from a n e = 4 dimensional space to an n e = 24 dimensional space. Therefore we redefine e ∈ E = E 0 × E B ⫅R 24 for the subsequent steps in this framework. For the sake of simplicity, we use e without any subscripts, whenever this does not cause any confusion. Whenever e is used as an input to define an aleatory distribution this refers only to components e 5:24 , as in Eq. (3). In contrast, whenever e is supplied to the system or performance functions, it will denote e 1:4 .

Uncertainty calibration
Under this parametrisation, a standard calibration technique may be performed over e. In this work we use Bayesian model updating to perform the calibration of our uncertainty model. In Bayesian model updating, one would like to construct a probability distribution over e that is reflective of the provided experimental data. For this, we select a prior distribution which reflects the analysts knowledge about the parameter space e before observing any data, and is updated such that the updated distribution (a posterior) matches the experimental data in the output domain. This posterior is found by solving Bayes law: where P(e|D) is the posterior, constituting a calibrated model. P(e) is the prior, which is often selected to be a uniform distribution when nothing about the parameters except their ranges is known. P(D|e) is the likelihood, and is the probability of observing data D given a point e in the input domain. Defining and evaluating the likelihood is often the most difficult part of Bayesian inference, since it requires evaluating the model at point e and comparing its prediction to the data in a consistent way. This will be discussed in the following subsection. P(D) is a normalising constant and involves solving an integral, which in high dimensions is usually intractable. However since it does not depend on the parameters e, it is just a constant and so a method for sampling from an unnormalized probability distribution, such as MCMC, may be used to draw samples from P(e|D).

Approximate likelihood for second order distributions
For a multidimensional output, and under the assumption of independence, the likelihood is a function R n1 × R nt × E → R defined as: is the probability density of the data at time point t, evaluated for a model output from point e.
Note that there is some loss of information with this independence assumption. Accurate estimation of the data density is very costly, and would require a large amount of samples and possibly a density estimator such as Kernel Density Estimation. We therefore follow a similar approach outlined in [71], where the likelihood is replaced by an approximate likelihood based on a stochastic distance metric. A stochastic distance metric defines distances between probability distributions (i.e. defines how dissimilar they are). This framework fits well in our problem, since both our predictive distribution (for a fixed e) and the experimental data are distributional. We may therefore define a higher likelihood for points in e which produce similar distributions to D, and penalise points which produce distributions which are "far" to D w.r.t our metric. We define an approximate likelihood based on the Gaussian distribution for each time point, as: where d is a distance metric between distributions and ∊ is the so-called width factor which must be preselected by the analyst. The width factor controls the "spikiness" of the posterior, the smaller ∊ is the tighter the posterior is but also harder to sample from. The larger ∊ is, the more spread out the posterior is, leading to a looser inference. Bi et al. [71] suggest that ∊ ∈ [10 − 3 , 10 − 1 ]. A common distance metric is the Euclidean distance, which would correspond to the distance between the means. If the Euclidean distance is selected for d, this corresponds to the standard approximate Bayesian computing. However, this discards important information about the shapes and dependencies. A distance metric which can capture the entire shape of the distributions is ideal. For d we have selected to use the area metric [109]: the horizontal integral, or the area, between the empirical distribution functions of D and samples of the predictive distribution. The area metric is defined as: where F P (x|e) and F yt are the empirical cumulative distribution functions of the prediction and data at time t respectively. The area is also commonly known as the L 1 Wasserstein metric [73]. The area metric has some favourable properties for updating.
• It is unbounded, therefore it can quantify the discrepancy between two distributions no matter their relative locations or overlap.
• It returns the Euclidean distance for scalars (i.e. degenerate distributions).
• It is sensitive to the entire shape of the distributions, and not only their moments.
• It can be computed efficiently, and from sample data.
• It can be generalised to higher dimensions.
• It satisfies all the mathematical properties of a metric.
• It applies to non-parametric distributions.
• It is units preserving (returns the physical units of the distribution).
Alternative metrics commonly used in updating include the Kolmogorov-Smirnov distance or the Bhattacharyya distance, but these metrics are bounded, therefore require a 2-step updating procedure [71]. Since the area metric is unbounded we can perform the Bayesian updating in a single step. Eq. (6) is computed for each time point in the output, the likelihood for a sample of the epistemic space e, is computed by taking their product.
This requires n t evaluations of the integral Eq. (7), one for each time step t = 0,1,2,…n t dt, and add significant cost to the Bayesian updating. A dimensionality reduction technique can be used to reduce the dimension n t and the computational cost of the calibration, see e.g. PCA or Karhunen-Lóeve expansion [110]. In this work, we reduce the dimensionality n t by Fourier transformations of the discrete-time stochastic processes in D. We then select a subset of harmonics (≪ n t ) in the Fourier domain that are sufficient to explain the variability in the data.
Samples of the posterior may be simulated with a MCMC algorithm. For this work we used TMCMC [76], an MCMC algorithm which works well for high dimensional problems such as this. This procedure is expensive, since an entire Monte Carlo loop is required for a single evaluation of the likelihood. We therefore draw fewer samples from the posterior using TMCMC, and propose an efficient method for drawing further samples from the posterior based on sliced-normal distributions (see Section 5.4.1), to increase the accuracy of the uncertainty propagation. This is explored in the following section.

Sliced-normal sampling
In this framework, we use a sliced-normal distribution to approximate the Bayesian posterior [5]. This serves as a density estimator that is used to generate a larger posterior sample set for propagation. This allows for a more robust propagation of uncertainty through this sampling approach when the generation of a sufficient sample set directly from the Bayesian posterior is prohibitive computationally.
A sliced-normal is generated by extending the parameter space from the n p = 24 physical space of the data vector e ∈ E to an n q dimensional feature space vector q ∈ R nq , where n q = ( n p + d n p ) − 1 and d is a specified degree of freedom for a monomial expansion function q(e; d). For example, for a two-dimensional vector in physical space (x,y), a monomial expansion in d = 2 degrees of freedom would produce a five-dimensional feature space vector q(x, y; 2) = (x, y, xx, xy, yy). This has the effect of 'bending' the physical space such that the distribution of data more closely resembles a feature space Gaussian distribution.
The mean μ q and covariance matrix Σ q of the data in feature space can be estimated from a length N e set of posterior samples of e j , j = 1, 2, …, N e in a number of ways. We compare two methods in terms of their level set coverage, the maximum-likelihood, and worstcase maximum methods described in [5].
Once the mean and covariance μ q , Σ q of this feature space Gaussian distribution have been determined, the feature space density f q of a mapped point q(e; d) can be used as an estimate of the unnormalised physical space density f e : The constant missing in Eq. (8) is difficult to compute because it requires the solution of a multidimensional integral over a polynomial manifold given by q(e; d). However, the unnormalised densities are sufficient for producing samples from a TMCMC process. Evaluating the density of each point e using the sliced-normal is efficient since it only requires a monomial expansion and a density calculation on a single Gaussian. Because of this, the TMCMC process can proceed at a much faster pace than with the Bayesian posterior, allowing for many thousands of samples to be drawn for forward propagation.

Sliced-normal validation
If the sliced-normal is calibrated appropriately, the level sets drawn from the relative densities of the data allows one to maintain the coverage criterion of a suitable confidence structure. To check the coverage of this structure, the distribution of the p-values must stochastically dominate a uniform CDF. A p-value represents the probability of observing an event at least as extreme as some proposal. The p-values can be used to define confidence level sets C(α) that should contain a future sample with probability ⩾α. This is represented by: where, α is the desired confidence level, and β(e; e j ) = B − 1 (α B ; k +1, N e − k) is the inverse upper Clopper-Pearson bounding distribution evaluated at a second confidence level α B , with k defined as the number of samples having a Mahalanobis distance greater than the equivalent distance for e. The minimum confidence level required to contain a remaining sample in a leave-one-out validation process is given by 1 − β(e; e j ). The available sample set is shuffled and split in two, one set for training e 1:Ne− 1 and one for testing e Ne . The minimum required confidence to cover e Ne is then calculated. The training and sample sets are then reshuffled, and the process is repeated a number of times in order to estimate the distribution of minimum confidence levels, which is visualised in the left of Fig. 12. This provides evidence that the sliced-normal distribution is capable of covering the samples with a minimum frequency. Therefore the approach used here to calculate the optimal sliced-normal parameters matches the maximum-likelihood approach of [111].

The double loop MC method
The Monte-Carlo (MC) simulation is one of the most widely used methods to propagate aleatory uncertainty. With MC it is possible to treat the model under study as a black box, and makes this method relatively easy to utilise in almost any problem.
When a mixture of aleatory and epistemic uncertainty is present in the input, characterised by second-order distributions or pboxes, it is still use the Monte-Carlo with a modification. A number of different modifications have been proposed, including doubleloop Monte-Carlo [39], generalized importance sampling and interval Monte Carlo method [112], slicing method [43], multi-level meta-models [97] and, for a special classes of problems, analytical methods [113].
The double-loop MC approach is composed of two nested loops. The outer loop explores E whilst the inner loop is used to propagate the aleatory uncertainty. Each sample of the outer loop allows us to define a random multivariate distribution, F A , which is then sampled in the inner loop. Repeating this process produces an ensemble of F A . Fig. 3 summarises this process of sampling from a second-order distribution using double loop MC. First, N e epistemic samples, e, are obtained from the epistemic box E, either with Latin hyper-cube sampling or with the available samples from the calibration procedure. Each of these epistemic samples defines an epistemic point e 1:4,j , the green dot in Fig. 3, and an j th aleatory distribution defined by a set of marginal distributions (parameters e 5:14,j ), and a copula (parameters e 15:24,j ), the orange dot in Fig. 3. The aleatory distribution is then sampled by correlated inverse transform sampling. This involves drawing samples from the copula, the blue dot in Fig. 3, and then by inverting them through the marginal CDFs. These transformed samples will then follow the correlated multivariate distribution. In the case of a Gaussian copula, a well known method can be applied to generate its samples, i.e. first generating samples of a multivariate Gaussian (using Cholesky decomposition) and then performing an isoprobabilistic transform through the standard normal CDFs.
The matrices: a matrix of size n e × N e with the epistemic samples, and a matrix of size n a × N a × N e with the aleatory samples (see Fig. 4).
The black-box model (system and performance functions) is evaluated on these matrices, producing the output shown in Fig. 4. A p-box is then constructed by enveloping the MC ensemble predictions over the N e dimension. The rows of the output matrices (e.g. y, z and g) are samples from a unique probability distribution which can then be enveloped to create a p-box.
In this work we generate samples of e according to a sliced-normal model of the posterior, see Section 5.4.1.

Interval Monte Carlo
Representing the aleatory space with a p-box presents challenges for sampling. However, this can be overcome with an interval Monte Carlo approach. The interval Monte Carlo method can be conveniently adopted if the input p-boxes are non-parametric, i.e., only upper and lower CDF bounds are available. It also slackens a distributional assumption on F A , e.g., a Beta assumption. This allows us to simulate samples from distributions in the p-box which may be non-Beta. This can be useful to check the validity of the double loop MC results if a non-parametric, distribution-free, model is adopted.
The interval Monte Carlo [114], also known as slicing method, alpha-cut method, or random set sampling [22] is a simulation procedure to propagate p-boxes. Unlike the double loop Monte-Carlo, the slicing method samples intervals from the real-line input p-box rather than point-valued realizations from a random distribution F A . Intervals are propagated through a black-box model by computing the minimum and the maximum of the output quantities of interest.
In order to illustrate the slicing method we will make use of two indices: i = 1, …, n a to index the i th dimension of the aleatory vector, here denoted by x, and j = 1, …, N a to index the j th simulated sample, i.e. a slice. The slicing method operates as follows: 1. Produce uniform correlated samples, α j ∈ (0, 1] na , j = 1, …, N a , from a copula function α j ∼ C .

Use the inverse CDFs
of the bounds of the p-box to generate a slice. The slice of the joint correlated p-box is also called a focal element, and it is obtained with the following Cartesian product: Notice that for the slicing method, a focal element I is equivalent to an MC sample vector. 3. An inner approximation of the interval valued response of the output from the performance function g : R na → R ng is computed as follows: , subject to x ∈ I j ⫅R na , j = 1, …N a ,  function 1(c), which is 1 if the condition c is true and 0 otherwise: Different techniques can be used to approximate Eq. (11), for instance, sampling methods, the vertex method, or global optimisation.

Sensitivity analysis
With the means of sampling defined, we can propagate the uncertainty in the input through the model. In many practical cases, it is important to understand the effect of different input uncertainties on the output. In this work, we utilised two different sensitivity analyses methods: variance-based and value-of-information. The theory behind these approaches can be found in Section 3.4. The motivation for utilising two distinct methods was to investigate the effect of different measures of sensitivity and to enhance the robustness of our conclusions.

Value-of-information sensitivity analysis
The value-of-information (VoI) sensitivity analysis aims to assess the possible reduction in uncertainty about the output if extra knowledge about the input were available [3]. This is usually done by comparing the uncertainty in the output before and after reducing the uncertainty in one or more inputs through a process called pinching. In this section we propose a VoI sensitivity analysis based on partial pinching. Our proposed partial pinching procedure replaces the original uncertain input set with a tighter subset.
This choice is motivated by the need to address the challenge problem, where directional refinements of the epistemic parameters are requested. A directional refinement for a single parameter consists of either: lowering the upper bound, or raising the lower bound of a model parameter.
We face two issues in addressing the NASA challenge uncertainty reduction problem.
1. The output of the system response is stochastic and highly multidimensional. 2. The system is a complex nonlinear black-box function.
The combination of these two issues makes the sensitivity study uniquely challenging within the literature. VoI sensitivity within in the probability bounds analysis (PBA) paradigm is naturally equipped to deal with stochastic systems, however additional research is required to address VoI sensitivity analysis within black-box models.
The VoI score for any uncertainty contraction can be computed with the following simple expression, where, B is the baseline scenario, R corresponds to the scenario with an input partially or totally pinched (or reduced), [F] is the p-box system output, and unc() is a measure of uncertainty of the output. The result of the expression in Eq. (13) is an estimate of the expected value any additional empirical information about the input might give in terms of its corresponding reduction in uncertainty of the output. This score belongs to the [0, 1] interval. When the pinching is applied to each input quantity in turn we obtain a relative rank. The highest ranked input is the largest contributor to the output uncertainty. The sum of the VoI scores for each input pinch, unlike the factorization used in variance-based sensitivity analyses, does not generally add up to 1.
Because we are in the context of PBA, the obvious choice for the unc() measure is the area between the upper and lower bounds of the output p-box. This measure is the previously described stochastic area metric (Wasserstein distance), now applied to the bounds of a p-box: In this work, the sensitivity analysis is conducted to understand how the input epistemic vector e influences the output of the system functions y,z 1 , and z 2 . Because a sensitivity score is always computed associating a reduction in the input space (epistemic vector) to a reduction in the output space (system response), we introduce the following notation to uniquely denote each sensitivity score.

S VoI
where γ ∈ {y, z 1 , z 2 } is used as a place holder symbol for the output.
Since the dimension of the output vectors y, z 1 , z 2 ∈ R 5000 is large, we analyse the sensitivities in the Fourier domain, where the dimensions can be reduced to about 60 harmonics. The combined sensitivity scores of Eq. (15) are obtained in the Fourier space, by summing up the area metric corresponding to each of the 60 harmonics. This procedure looks at the marginal output distributions individually, so it cannot capture the effect of dependence between the harmonics.

Numerical strategy to compute the VoI scores
As pointed out in [115], PBA is global sensitivity analysis. However, the presence of a black-box code poses an additional challenge: it is not possible to propagate an interval vector of parameters while guaranteeing that the output is inclusion monotone. While the interval extension of the function representing the black-box may itself be monotone, this property cannot be ensured via sampling.
An interval function f is inclusion monotone if for any two interval vectors [x], [y]⫅R d holds, With sampling the above rule is violated, because coverage (minimum and maximum are included in the bounds) is only ensured when the function is evaluated with interval arithmetic rules. Therefore, it is not possible to consistently compute sensitivity scores using PBA with sampling, potentially leading to negative scores. The more samples the function is evaluated on, the better the approximation. The use of a global optimiser to obtain the bounds on likely a better one, each focal element would also lead to good approximation, although with enormous computational effort. To overcome this limitation, we devise a sampling strategy that emulates the inclusion property of Eq. (16). This strategy is based on a caching algorithm that stores the result of a large number of point valued input/output simulations. While the proposed strategy emulates the property of Eq. (16) of interval analysis, its bounding capability is still tied to sampling, so it is still an inner approximation. Convergence to the target bounds cannot be guaranteed in a black-box system, nevertheless an approximation can be computed with a significantly high amount of samples, in the order of thousands. The strategy has its highest efficiency when the epistemic uncertainty is predominant over the aleatory one, and gets less efficient as the focal elements get thinner. In other terms, this strategy works well when the p-boxes are fat as opposed to thin. The caching algorithm computes the bounds of each focal element on stored samples produced upfront, it emulates inclusion monotonicity and removes the need for re-sampling.
The algorithm implementing this strategy is summarised in Fig. 5. From left to right: (i) first, each p-box is discretised into a number of focal elements with equal mass, (ii) [lower] these are then combined by Cartesian product under random set independence, this results in the gray box shown over the blue scatter points, (iii) [higher] is the partial pinching (halving) of e 1 , (iv) the caching algorithm executes the propagation using the samples contained in each gray box [of (ii) and (iii)], and (v) finally, the output p-box for the baseline case (blue) and the pinched scenario (purple), corresponding to the lower halving of e 1 .

Variance-based sensitivity analysis
From the probabilistic perspective to uncertainty quantification, the model's output can be regarded as a random variable, Y, with a

Fig. 5.
Visual summary of value-of-information sensitivity procedure with partial pinching and probability bounds analysis. Left to right: focal elements drawn from input p-boxes, caching algorithm for pinching, propagation through the system function to the output probability box, and value-of-information sensitivity score.
corresponding (joint) probability distribution. In scenarios where the variance of this random variable is regarded as a good measure of output uncertainty, the importance ranking of the inputs can be done via variance-based sensitivity analysis. We will denote the epistemic vector e 1:4 by X, and the output of the model Y.
Since the models considered here are deterministic the variance of Y will be entirely due to the uncertainty in the input values. This leads to the notion that fixing one of the inputs, X i , to a given value, x i , and re-running the code results in Y having a smaller variance.
denote the conditional variance of Y taken over all factors, but X i (denoted X ∼i ) and given X i = x i . This conditional variance can be used as a measure of how influential the fixed parameter is. However, the conditional variance is also a function of x i and is of little practical use in determining input importance. This problem can be resolved by taking the expected value of the conditional variance over all possible values of Then, by the law of total variance, the the first order effect of X i on To improve interpretability, the first order effect can be normalized by the unconditional variance of the output, to give what is known as the first order Sobol' index, [88] A high value of the Sobol' index for the given variable, means that its contribution to the output variance, without regard to any interactions with other inputs, is important.
Often, it would be desirable to inspect the importance of interactions among inputs. In such cases, the higher order Sobol' indices can be constructed in much the same way as the first order ones. Let p⫅{1, …, n e } be a set of all inputs under investigation. Then, the ⃒ ⃒ ⃒p| th order index is given by where | ⋅ | denotes the cardinality of a set. For a model with d inputs, it is easy to show that there are 2 d − 1 possible combinations for ptoo many to be considered one by one, even for moderately sized d. To summarise the effect of interactions, one can compute the total Sobol' index [116], This metric captures the effect of the i th input and all of its interactions. An efficient way to compute Sobol' indices is discussed at length in [89].
For a model with independent inputs, the sensitivity indices provide a decomposition for the variance of the model's output. That is, they obey: In such models it also holds that 0 < S i ⩽S T i < 1, where equality can only arise in the case of a perfectly additive model. When the inputs are correlated, the results of the decomposition in Eq. (20) may include contributions from inputs that are outside of those currently analysed. In this case, the alternative decomposition introduced in [117] can be used. In this decomposition, which is a generalisation of that in Eq. (20), the covariance among inputs is taken into account. The reader is referred to [118] for further discussion of variance-based SA with correlated inputs.
For the case where the output is a multidimensional variable in the time domain, we can estimate the sensitivity indices for each time step and then compute a mean value and a corresponding standard error of an aggregated sensitivity metric across time.
Standard error is used to produce error bars on each total/local sensitivity index. This provides a sense of the amount of variability in the value of each index across the time dimension.
A. Gray et al.

Reliability analysis and reliability-based design optimisation
Calibration, propagation and sensitivity analysis all allow for a more accurate characterisation of the uncertainty model, and this is motivated by a desire to use this information to improve design decisions and prototyping. To do so, we use reliability analysis and an iterative design optimisation process. Both of these require assessment of the reliability performance of a design, and this framework provides a mean of doing so in the presence of a mixture of both aleatory and epistemic uncertainty.
The reliability of a design θ with respect to a set of requirements g 1 , …, g ng can be assessed using the worst-case reliability function: where g j : R nθ × R na+ne → R are the reliability functions describing the compliance of the system with respect to requirement j = 1,..,n g . A design θ will satisfy all requirements if w(a, e, θ) < 0 and requirement j is satisfied if g j (a, e, θ) < 0. If e is known and only aleatory uncertainty is affecting the study, the failure probability P f (θ) of a design gives the precise probability that the θ complies with the reliability requirements. Traditionally, P f (θ) is defined as a multidimensional integral of the probability density of a computed over the failure domain. If both epistemic and aleatory uncertainty affect the study, a generalized expression for the joint failure probability is given by and for the individual failure probabilities is defined as follows: where the probability of failure is a now function of e. Note that to estimate P[⋅] as a precise probability, the distribution of the aleatory variables and the value of e must be fixed. If e is only known to be contained in some set or the distribution of the aleatory variables is imprecise, the probability of failure will be an interval. A sample-based estimate of P f (e,θ), obtained from a set of n samples a 1 , a 2 , … , a n from a precise F A (a; e 5:24 ), is given by: where σ(e, θ) is the conditional expectation of w(a, e, θ) over the failure domain, i.e., the average magnitude of the violations. For a presentation of sample-based approximation of σ(e, θ) and a discussion on confidence bounds and the computational issues involved with it see, e.g., [119].
Note that the model of the uncertainty f a is a key component in the procedure and a large sample size is generally required to improve accuracy and convergence of the failure probability and tail expectation estimates. In addition, if epistemic uncertainty is present a precise estimation of the reliability cannot be obtained. To account for this, epistemic bounds on the joint and individual failure probabilities, P f and P f,j (e, θ), respectively, of a proposed design θ are computed as follows: If the resulting probability interval is vacuous, i.e., R j (θ) = [0,1], j = 1,…,n g , the epistemic uncertainty in the problem is preventing us from answering any questions about the reliability of the design θ.
Similarly, the severity score of the integrated system for all requirements, s(θ), and for individual requirements, s j (θ), are computed as follows: , j = 1, 2, …, n g .
Note that the value of the product σ(e, θ) × P f (e, θ) can be regarded as a classical measure of the risk for the design θ for an e ∈ E, that is, it quantifies not only the probability of incurring a failure but also the expected magnitude of that failure given by σ(e,θ). Hence, the severity score s w (θ) can be interpreted as the worst-case risk computed over all possible values of the epistemic parameters e ∈ E. While the width of R j indicates the degree of epistemic uncertainty, s j is an unbounded value from above. Increasing the epistemic uncertainty in a model will strictly increase s j , allowing for relative comparisons, but a single severity score does not have a clear interpretation.
Two levels of approximation jointly affect the estimates of R w (θ) and s(θ): i) sample-based approximation of P[⋅] and E[⋅]. ii) approximation for the minimum and maximum operators over the epistemic space E.
The problem in ii) is aggravated by the presence of strong non-linearities in g j , P f (e, θ) and σ(e, θ) that, combined with poor coverage of the epistemic domain leads to inner approximations of the bounds of these operators. The accuracy of these inner approximations can be improved by the methods outlined in Section 5.4.1 and Section 5.5. These numerical issues further complicate the reliability-based design in the presence of mixed aleatory-epistemic uncertainties. In this work, we seek reliable designs by solving the following optimisation programs: The optimisation program in Eq. (30) minimizes the right epistemic bound of R w (θ) whilst the one in Eq. (31) minimizes the severity for all requirements. These programs are non-convex and highly discontinuous in Θ, which precludes the use of gradient-based optimisation methods. The approach we adopted instead, is based on repeated runs of a genetic algorithm (GA). The fittest candidates in the final θ populations for the two programs were pooled together and ranked according to the resulting reliability and severity scores, given by Eqs. (26)- (29). Note that, although samples are very cheap to produce thanks to the SN approach, the computational cost needed to estimate the severity and epistemic bounds on the reliability of a design can be very high. This problem is particularly serious for small failure probabilities and numerically expensive reliability functions g. In these situations, decoupled RBDO approaches, numerically efficient methods for efficient estimation of low-failure probabilities and multi-level high fidelity models can be considered to reduce the computational cost of the RBDO.

Risk-based design optimisation
Traditionally, risk-based design optimisation methods seek system designs by minimizing a combination of probability and severity of failures. In this work, we look at a risk-based design from a new perspective. A risk-based design θ risk is a design that only accounts for a portion of the remaining epistemic space. A risk parameter r ∈ [0, 100] quantifies a reduction in epistemic volume. For example, a risk r = 10 means that we only retain 90% of the volume of the epistemic space. We will refer to the retained volume as E(r). Clearly, we have that E(r)⊂E(0) = E for any r > 0. The risk-based design θ risk must be compared to the optimal RBDO design (θ final ) and the resulting reliability scores must be used to either adopt or reject a risk-based design solution. For a more detailed description of the problem refer to the NASA challenge [6Problem F]. In this work, we investigate two metrics to quantify a gain from taking a risk r. The gain metrics are defined as follows: where l P f (r, θ) and l s (r, θ) measure the improvement in reliability (or severity) achieved by a candidate θ, in comparison to the best available design θ final , when a risk r is taken. A l Pf (r, θ) < 1 means that the worst-case reliability of a candidate design θ is lower than the one of final design computed over E, i.e., ) .
Similarly l s > 1 implies a reduction in severity with respect to θ final . Given a risk r, a risk-based design θ risk results in superior reliability compared to θ final if l Pr (r, θ risk ) > l Pr (r, θ final ) or, in words, if the maximum failure probability of the risk-based design computed over the risk-taking epistemic space E(r) is lower than the one of the final design computed over the same epistemic volume design. A risk-based design will result superior performance with respect to the severity score if l s (r, θ risk ) > l s (r, θ final ).

The 2020 NASA black-box challenge
In this section, it will be assumed that the reader is familiar with the NASA challenge [6]. A distinction between aleatory and epistemic uncertainty is made in the challenge statement (see Section 2), thus the two kinds will be treated differently. Such a distinction carries strong practical implications for the solution strategy. In this section we implement the solution strategy described in Section 5 to address the NASA challenge. Fig. 6 shows a visual representation of the NASA challenge for reference. The relation of the different system models, their inputs, and interactions are shown in the context of each of the problem tasks presented in Section 2. Here, we describe the solution sequence in corresponding order of these problem tasks.

Problem A: Model calibration & uncertainty quantification
Problem A of the challenge is to calibrate the model's input for the limited experimental data that is provided. No aleatory distribution is provided, so the beta model described in Section 5.2 is adopted to attempt to envelope the true distribution F A . Therefore, a calibrated uncertainty model will constitute a set within E describing a five-dimensional probability box along with a reduced volume for the four epistemic inputs in [0, 2] 4 . The calibration data provided represents 100 realisations of a discrete-time stochastic process. The provided data is neither Gaussian nor stationary. In this section, the calibration of UM is performed against the first set of data . From our perspective, D 2 was unavailable to us until the initial design decision had been made based on UM y ; this mimics the experience of a real-world design process.
As previously described, the problem of calibrating F A and the four-dimensional epistemic box was overcome by selecting the parametric distribution Eq. (3), and updating its parameters along with e 1:4 . Under this parametrisation 20 additional dimensions are added to the original four epistemic inputs in the updating procedure. A summary of these parameters and their meaning is given in Table 2.

Uncertainty calibration
Under the parametrisation presented previously, an approximate Bayesian calibration technique as described in Section 5.3 may be performed over E. This method requires prior P(e) and a likelihood, P(D|e) to be defined. The prior was selected to be a uniform distribution bounded to the ranges given in Table 2. The likelihood was estimated using the procedure described in Section 5.3.1.
The first calibration procedure was performed over D 1 and defines UM y . This calibration involves drawing realisations of E to produce a distribution of F A from which samples can be generated. These were then propagated through the system model y and compared against the available data D 1 , as shown in Fig. 7.
Calculating the likelihood over the time domain would require 5001 evaluations of the integral Eq. (7), and would add significant cost to the Bayesian updating. To reduce the number of comparisons that must be done, this procedure may be carried out in the Fourier domain. Of the entire spectrum of harmonics, we determined that the output signal was well represented with the first 30, these are shown in Fig. 8. Therefore, if a Fourier transformation is performed of the experimental data and of the predictive distribution only 60 (30 real and 30 imaginary) evaluations of Eqs. (6) and (7) are required.
With a prior and a likelihood now defined, we generate 520 samples of the posterior by TMCMC. Fig. 9 shows the calibrated model    output against D 1 . The output of these samples produces a p-box at each point in time, shown in Fig. 10. It can be seen that the tails of the data are captured by most, but not all, of the predictive p-boxes. There could be several reasons for this. The first, and we believe the main reason, is due to the very high dimensionality of the inputs e. Because the limited number of posterior samples being propagated do not fill the 24-dimensional space very densely, the bounds of the second-order distribution may not have converged. This is one of the main drawbacks of using second-order Monte-Carlo as a propagation method: that a very large number of samples are needed for high dimensions. We have however proposed a cheap method for generating a large number of samples of this 24-dimensional posterior, see Section 5.4.1. The second reason may be that our choice of stochastic metric in Eq. (7) is not very sensitive to the tails of the distribution [109]. If a metric which gives higher weight to the tails of distributions is selected, for example a higher-order Wasserstein distance, Anderson-Darling distance, or Kolmogorov Smirnov, distance we believe this could be improved. This would however introduce further computational complexity. The calibration procedure for the unknown probability distribution F A results in a five-dimensional p-box, which envelopes the true distribution. Fig. 11 shows this calibrated p-box, with marginal p-boxes shown along the diagonal panels and pairwise bivariate pboxes in the off-diagonal panels. A multivariate p-box is an extension of p-boxes to higher dimensions, and bound a set of multivariate Fig. 9. Realisations of the calibrated model UM y (blue) against data from D 1 (red). All of the samples of the second-order distribution have been plotted simultaneously. Fig. 10. Comparison of the predictive p-box (blue) from the calibrated model depicted in Fig. 9 to the empirical CDF of the experimental data (red) a.t various times.
CDFs. It can be seen that, the marginal p-box for a 5 is very wide and close to an interval, therefore according to UM y the marginal for a 5 can be nearly any distribution in [0,2]. In contrast, the probability distribution for the variable a 3 is much narrower. This result suggests a higher importance of the variable a 3 when prescribing an uncertainty model for the data y t .

Sliced-normal approximation of posterior for propagation
Due to the cost of sampling the posterior, because it required expensive MCMC, a means of efficiently generating additional samples is required to reduce sample based approximation errors. We thus use a sliced normal distribution as an approximation of the posterior, based on the 520 TMCMC samples. The density of the sliced normal is less expensive to evaluate than the posterior, allowing us to efficiently draw more samples. The sliced normal was fit to the 24 dimensional posterior with two degrees of freedom, producing a feature space expansion into 324 dimensions. Generating 10 4 sliced-normal samples takes roughly 90 min when parallelised using a 4core personal computer. By comparison, the 520 posterior samples takes approximately two days on a 40-core HPC machine.
The suitability of this approximation is dependent on its ability to appropriately represent the posterior density. For the two feature-space methods (maximum likelihood and worst-case) of calculating the sliced-normal parameters presented in [5], we evaluated the ability of the corresponding sliced-normals to appropriately assign relative likelihoods to the Bayesian posterior samples through leave-one-out validation. This was assessed by calculating a possibilistic transform of the sliced normal, and evaluating the alpha cut required to contain the remaining sample. The cumulative distribution of these alpha levels should dominate a unit uniform distribution, indicating that the sliced normal is correctly ordering the samples based on relative likelihood. Fig. 12 depicts the coverage properties for possibilistic structures based on the two methods on the left, with a representation of the alpha cut sets generated in a two-dimensional space shown on the right. The coverage comparison was performed over the four epistemic dimensions only due to the computational cost of the worst-case method It can be seen that both methods have similar Fig. 11. Representation of the five-dimensional p-box for F A . The marginal p-boxes are shown on the diagonal panels, with off-diagonal panels showing bivariate p-boxes for parameter pairs. Upper bounds for the bivariate p-boxes are shown in transparent blue, with the lower bound in red.
A. Gray et al. predictive capability. The right side of the figure indicates that the confidence level sets for this data are not unique, depending on the approach used to generate the sliced normal. Since the worst-case method is significantly cheaper to calculate we adopted this approach for generating further posterior samples.

Geometry of epistemic space
The sliced-normal distribution can be used to define level sets on the expanded epistemic space. For visualisation purposes these sets are calculated for pairs of epistemic parameters to highlight the ability of sliced normals to capture a variety of dependence  structures. A representation of the resulting geometry of the epistemic space is shown at defined confidence levels with contours in Fig. 13. Similar sets can be produced for the full 24 dimensional space, though the data sparsity and projection of a 24-dimensional set onto two dimensions makes visualisation difficult. Fig. 14.

Problem B: Uncertainty reduction
In this section we describe the path taken to update the uncertainty model with the aim of improving its predictive capability. The task is divided in two stages.
• The first stage is to rank the epistemic inputs by their importance, see Section 5.6.
• The second stage is to identify a set of changes, to reduce the bounds of the epistemic inputs, such that the expected reduction in the output uncertainty is as large as possible.
The tasks of ranking the epistemic parameters according to their effect on the predictive capability of the computational model, and determining those whose reduction will increase the predictive capability of the model is a challenging one. This cannot be done using naïve implementations of variance-based sensitivity analysis. Therefore, we describe two different approaches to evaluate these rankings. One using a modified variance based sensitivity analysis approach described in Section 5.6.3, the other using a value-ofinformation approach described in Section 5.6.1. Different aspects of variance-based and value-of-information sensitivity analysis are highlighted which makes them suitable for tackling the problem at hand. Despite the fact that the two methods differ philosophically and computationally, they are shown to achieve similar results.

Ranking via variance-based sensitivity analysis
As mentioned in Section 5.6.3, variance-based SA (VBSA) is an appropriate ranking approach whenever the variance of the model output is the primary measure of its uncertainty.
In this procedure the epistemic parameters are the only ones to be ranked, and they are given uniform distributions in UM 0 . We Fig. 14. Refinement e 2+ for harmonics 6, 18, 19.
A. Gray et al. interpret the variance of these uniform distributions as a metric for their uncertainty. By re-framing the interpretation in this way we are able to carry out the VBSA on the n a + n e -dimensional input space directly. This is in contrast to the full uncertainty model, described in and Section 6.1, where a more expressive, p-box, model of the uncertainty is used. We carry out the VBSA analysis in both the original and calibrated input spaces and compare the results below. For the prior space analysis, n = 5000 Latin hyper-cube (LH) points were sampled from the marginal distributions of each of the n a +n e = 9 inputs. Firstorder and total sensitivity indices were obtained at each time step and the results were averaged across the entire time window. It is important to note that here, as in Section 6.1.1, a simplifying assumptions is made about the time dimension of the model in that separate time steps are considered independent (see [120], for a discussion on time-dependent approaches to VBSA). The results are summarised on the left of Table 3.
We used a similar procedure for the sensitivity analysis with respect to the calibrated uncertainty model, UM y . In order to obtain input combinations in the constrained domain, the TMCMC samples from the updating procedure were used to compute the bounds of each parameter. A new set of n = 5000 LH samples was obtained in these bounds and the first order, and total indices were recomputed. The results are shown on the right in Table 3. For both prior and posterior analyses the aleatory variables were left free to vary within their ranges, so as to allow for any interactions between the aleatory and epistemic sets of variables to manifest themselves in the total sensitivity indices.
Under both UM 0 and UM y the second epistemic input, e 2 was identified as the most important, followed by e 1 , e 3 and e 4 in order. From the values of the total indices in Table 3, it becomes clear that the model output is sensitive to interactions between e 2 and one or more other inputs. In the present, variance-centric setting, these results suggest that e 2 is the epistemic parameter with the highest potential to improve the predictive capability of the model.

Ranking via value-of-information sensitivity analysis
The ranking of the epistemic parameters is done by scoring each of the 8 possible refinements {e − i ,e + i }, i = 1 : 4, where e − i denotes a request for the lower bound of the e i to be increased, whilst e + i denotes a request for the upper bound of e i to be decreased. For UM 0 : e − i = [0, 1] and e + i = [1,2]. The ranking is done by comparing the amount of epistemic uncertainty in the output y t before and after the refinement, and simultaneously checking that the data D 1 still falls within the bounds. The comparison is enabled by the area metric of Eq. (14), which quantifies the amount of area contained between two distributions or datasets.
For the purpose of the sensitivity analysis, the calibrated uncertainty model of Problem A, consisting of correlated second-order probability distributions is converted to its equivalent bounded counterpart. So the 24-dimensional epistemic space is boxed in a hyper-rectangle containing all the samples of the updated posterior. This has allowed us to treat the calibrated inputs as probability boxes for the sake of the uncertainty propagation. When the area metric, Eq. (14), is applied to the two bounding distributions of a pbox, the resulting area metric quantifies the amount of epistemic uncertainty carried in the p-box. After the refinement, the area of the p-box shrinks (or does not change) to reflect the fact that less epistemic uncertainty is projected to the output following the refinement. Denoting [F Y ] 0 and [F Y ] e * i the p-box of the output before and after the refinement respectively, their corresponding area metrics W 0 and W e * i are used to compute the sensitivity scores as: In order for the computation of Eq. (7) to succeed, the p-boxes [F Y ] ei need to be fully nested in When this does not hold, negative sensitivity scores can show up. When the uncertainty propagation is not rigorous it is not obvious how to avoid this problem. A meta-model strategy has been developed within this work to ensure that the nesting always occurs, as described in Section 5.6.1 (see Table 4).
Since the model of the subsystem has a time-varying output, VoIs S ei were computed in the reduced Fourier space looking at the first 30 harmonics. A summary of the ranking is provided in Table 5. Similar to the analysis of the VBSA, in Table 5, UM 0 and UM y are respectively the UMs before and after the calibration.
Both VoI and variance-based methods largely agree on the top-ranked parameters e 2 carrying the highest sensitivity. A slight difference between the two methods holds for the ranking of parameters e 3 and e 1 , but are still ranked above e 4 .

Table 3
Main and total effect indices for the epistemic inputs before, (UM 0 ), and after, (UM y ), calibration. Inputs ranked by their importance in descending order.

Parameter refinement
The purpose of ranking the epistemic parameters by their influence on the predictive capability of the model. As part of the NASA challenge, the hosts offered up to 4 refinements to the bounds of the epistemic parameters. As outlined in Section 2, there are 2 possible updates for the i th epistemic parameter. Namely, increasing its lower bound (denoted e − i ) and/or decreasing its upper bound (denoted e + i ), thus we had 8 possible options to choose from. Having a smaller budget than the possible selection of actions is common in realworld applications, where the analyst is likely to have constraints on the computational or experimental budget. Here, based on the two ranking approaches discussed earlier we show how each of them can be used to inform the choice of refinements.

Parameter refinement via variance-based analysis
In order to choose the parameters and bounds to refine, we conducted two sets of sensitivity analysis. In the first set, the epistemic parameters had their upper bounds lowered one at a time, and in the second set, each lower bound was raised, again, one parameter at a time. The main and total effects of these analyses were then compared to their UM − y counterparts and the parameter-bound combination with the largest absolute difference in the indices were selected for refinement. The aleatory variables were left unchanged throughout the analyses, to enable effects due to variability in A to propagate to the final results.
The motivation behind the above approach hinges on the fact that, in variance-based sensitivity analysis, a variable can have a Sobol' index equal to 0 in one of two ways: either the variable is not included in the model (the so-called dummy variable), or it has no variation, i.e. it is a fixed parameter. Thus, if the interval for each epistemic parameter was reduced to a number, its Sobol' index would become 0. For this reason, one would expect to observe a noticeable change in the values of sensitivity indices of important parameterbound combinations.
For each analysis the respective bound was moved to the median value of the parameter's range. This particular point was chosen as  Table 5 VoI sensitivity scores for each refinement. no information was provided on how much the bounds would be shrunk by, if requested. To escape the median assumption, one could carry out a sensitivity study on the bounds themselves, at an increased computational and post-processing expense. Such an analysis will yield highly correlated variates which need to be accounted for in the interpretation of results. We compared the results of the sensitivity study against a straightforward variance reduction analysis, in which the average output variance across the time domain was calculated with-and without-bound reduction. The analyses showed a clear preference for e 2 to be refined, with a subtle, but consistent advantage of e − 2 over e + 2 . A summary of the results is shown in Table 5.

Parameter refinement through value-of-information
The parameter refinements are scored as in Eq. (34) by computing the relative amount of reduced uncertainty that each refinement carries compared to the total. Eq. (34), where S e − i + S e + i = S ei , weighs the importance of each partial score S e * i towards the whole score S ei . The results are summarised in Table 5.
Out of the 8 possible refinements and with a maximum of k⩽4, we have requested one reduction for this problem. The VoI and conditional variance-based analyses for parameter refinement largely agree in the important refinements to be made. Following the requested parameter refinement, an additional iteration of the updating procedure was conducted. The new outputs of the postrefinement calibrated model, UM y− 2 , is shown for a number of output time points in Fig. 16 and for the entire process in Fig. 15. The improvement, i.e reduction in the epistemic uncertainty of the output, is obvious.

Problem C: Reliability analysis
In this section we seek to evaluate the reliability of a baseline design θ base according to the current UM. This task can be divided in the following sub-tasks: • evaluate the range of the failure probability for joint requirements and each individual requirement, Eqs. (26) (27); • rank the epistemic uncertainties according to the contraction of R w (θ) that might result from their reduction; • identify representative points (a, e) with a comparatively large likelihood near the failure domain, using (a, e) to characterize qualitatively different transitions to failure, and the corresponding time responses of the integrated system; and, • evaluate the severity of each individual requirement violation and all violations.
The challenge of the reliability assessment is to balance the computational cost with the precision of the reliability estimates. The severity estimators are very sensitive to a lack of samples. It is also really difficult to obtain precise lower failure probability bounds due to the epistemic uncertainty at the tails of our calibrated model. We question the interpretation of small failure probabilities (< 10 − 2 ) yielded by advanced Monte-Carlo methods such as importance sampling. Given the paucity of data n 1 = 100, and the explicit ac- Fig. 16. Calibrated p-boxes at various times in the process are shown. The second order samples from the Bayesian calibration are in blue, with the bounds from the sliced-normal in yellow, and the experimental data in red. It can be seen that the extra samples from the sliced-normal leads to a more converged p-box bound, and a better fitting to data. counting for the epistemic uncertainties, it is difficult to confidently obtain such small failure probabilities.

Range of failure probabilities for individual and all requirements
The range of failure probability R w (θ), for each individual requirement g 1:3 for the baseline design, θ base , are estimated via double loop MC sampling as described in Section 5.4. Table 6 present (first row) the results of the above analysis with respect to R w (θ), R j (θ) , s w (θ) and s j (θ). The reliability performance of θ base is estimated using N e = 10 4 epistemic samples from the sliced normal of UM y− 2 and N a = 200 aleatory samples from the corresponding realizations of the joint probability distributions F A . The most uncertain reliability scores for θ base are in respect of the, settling time [0, 0.99], stability [0, 0.67], and energy consumption [0, 0.27]. Note that the failure probability bounds for the joint requirement R w (θ) is poorly informative due to the large epistemic uncertainty in UM y− 2 .

Rank of epistemic uncertainty on R w
In this section we are asked to rank the epistemic uncertainties according to the contraction of R w (θ) that might result from their reduction. To do so we the VoI metric, Eq. (13), which measures the reduction in area of the p-box of g 1:3 for a fixed value of e 1:4 , relative to the area obtained with all parameters varying in their posterior range. Additionally, any reduction in the range of R w was also taken into account. The performance requirements g 1:3 showed high sensitivity towards the epistemic uncertainty in the UM. Only a modest contraction of R 1:3 was found from the reduction of epistemic inputs, with parameters e 1 and e 2 ranking highest.
We note that variance-based sensitivity analysis can also be used to assess the effect of e 1:4 on R w . However, such an analysis would require the problem to be slightly reformatted to enable VBSA to work with the interval R w . We chose not to pursue this strategy further, postponing more in-depth investigations for future work.

Identify realizations of (a, e) with large likelihood near the failure domain
To find the likely realisation near the failure domain we: • isolated a focal element for each failing requirement g 1:3 = 0, so three in total, and identified the realizations in the input domain falling inside these three focal elements. The parameters defining the marginal distributions of a 1 and a 2 were identified as major contributors to high failure probabilities for the three failure conditions. In contrast, none of the original epistemic parameters was found to be influential for resulting in high probability of failure.

Severity of requirement violation
We estimate the severity of each individual requirement and joint requirements via the double loop Monte-Carlo propagation procedure presented in Section 5.4. The accuracy of this estimator is more sensitive to the number of aleatory samples when compared to the epistemic bounds of the failure probability, i.e. an outlier can alter substantially the value of the severity. The requirements ranking for the most severe failures are reversed in order, compared to the reliability ones, see Table 6. Here, the energy consumption potentially leads to the most severe failures s 3 (θ base ) = 0.7154.

Optimality criteria and computational approach
For the reasons outlined in Section 5.7, a new candidate design was selected according to the criteria of minimizing failure probability and severity of the integrated system. We selected genetic algorithms as a global gradient-free optimisation strategy to search for a candidate design θ new . The set of possible candidate designs Θ was not provided in the challenge. As such, we elected to search within the space defined by the hyper-cube θ base (1 ± α), with α > 0 centred on the baseline design.

Analysis for θ new
The first four rows of Table 6 present the results of the optimisation. The reliability metrics are obtained as described for the baseline θ base and updated θ new designs. A substantial improvement of the severity can be observed, along with a modest reduction of upper bound on the system failure probability. It is also interesting to note that the baseline design represents a case that highlights the importance of the severity metric. The transition to failure is marked by an extremely sharp increase in severity at the boundary. Subsequent designs remove this sharp change, allowing for a less extreme transition to failure. Fig. 17 presents the resulting probability boxes for the baseline design and θ new in blue and red, respectively. Fig. 18, shows the composite failure regions and safe regions projected on to a 2-d space of uncertain factors. The figure shows three designs in the projected space. These figures are indicative of the relatively improved performance of the designs, shown by the removal of failure limit-states in the aleatory space for each θ. The transparent markers are aleatory samples from multiple probabilistic models F A defined by epistemic vectors e 5:24 .

Problem E: Model update & design tuning
Upon finding a new design point θ new , the last problem of the challenge is to perform a final improvement to the UM and design. After providing θ new to the challenge hosts, 100 realisations of the subsystems Z 1 and Z 2 were provided for calibration. The calibration approach pursued here is identical to that described in Section 6.1.1, where the new data was used alongside the original data to improve the UM. A sensitivity analysis, identical to that in Section 6.2.2 was done to request 3 more parameter refinements to E. These refinements were selected, with the analysis performed with respect the subsystems Z 1 , Z 2 and Y, and we ranked parameters again as in Table 7. Once the challenge hosts returned with 3 reductions to E, the UM was calibrated a final time over this reduced space with the new data. Figs. 19-21 show the results of this calibration, along with bounds from a sliced-normal distribution.
The updated model was then used within the RBDO described in Section 6.4 to obtain θ final . Table 6 presents the reliability scores for θ final and Fig. 18 shows a 2-d projection of the individual failure regions. Similarly to θ new , the most uncertain reliability score for θ final is with respect to the settling time requirement. However, the epistemic uncertainty affecting this score decreased substantially with the new model of the uncertainty, i.e., from [0,0.985] to only [0,0.075]. Fig. 22 shows bounds of the uncertainty model on Z 1 and Z 2 over the three designs and calibrated uncertainty models.

Problem F: Risk-based design
• F-1) Propose a metric to quantify the gain, l, resulting from taking the risk r =r. • F-2) Find a design point that maximizes l(r) and denote it as θr . Explain the process used to choose the portion of E being ignored.
• F-3) Compare θ final and θr using the figures of merit above.
• F-4) Evaluate l(r, θ final ) and l(r, θr ) for r = [0,0.5,1,…,10]. determine the optimal trade-off between performance gain l(r) and risk for each design point (if any). • F-5) Use the results of this sub-problem to detail a rationale that might be used to justify either adopting or rejecting a risk-based design Fig. 17. A comparison of the p-boxes for g 1:3 (top to bottom panel) for the baseline design (blue), the design θ new (red) and θ final (black solid line).

A. Gray et al.
This task is highly related to the definition of risk for non-probabilistic spaces. The NASA challenge problem gives a way to define risk as a reduction of the volume [6]. However, this way of defining risk does not uniquely identify possible subsets from which the volume can be computed, because there is an infinite number of subsets of E that have a given volume. So a procedure to establish a way to generate subsets of the updated epistemic space is needed. For example, we can make a 5% reduction by removing volume from the boundary of E while enforcing some compactness condition. Alternatively, in some applications, the best design improvements may be obtained by removing volume from the geometric center, e.g., discarding failure domains that are internal to the space defining E.
Thus, one of the main challenges in problem F is the definition of an appropriate epistemic volume removal scheme, or in other words, a good justification about the way we are accepting a risk r. The way we remove epistemic volume inevitably affects the resulting risk-based design. If a probabilistic characterization of the epistemic volume E, is adopted (e.g. using a posterior distribution) a risk-based analyst would probably feel tempted to reduce epistemic volume by removing the regions with the lowest probability mass. In contrast, a non-probabilistic characterization of the epistemic volume does not associate a probability to the different regions This appears to be a more rational choice, however, it is not clear how to reduce volume in this case, e.g. from the boundaries or the geometric center. Both procedures entail highly subjective choices.
We consider two different schemes for the removal of volume from the updated epistemic space for parameters e 1:4 . For the first risk-taking scheme, we arbitrarily remove volume from the boundary of the convex hull and reduction is calculated from the ratio  between the two volumes. This is equivalent to a removal of epistemic regions corresponding to the support of the tails of the posterior distributions of e. This operation is iterative and is referred to as convex hull peeling [121,122]. For computational reasons, instead of computing the convex hull in the full 24-dimensional space, we compute its pairwise projections in (e i , e j ). The Cartesian product of these convex hulls conservatively includes the 24-dimensional convex hull. Any samples that define this convex hull are then removed and the process is repeated. Fig. 23 displays the procedure and the resulting nested compact epistemic volumes. The black marked lines represent the volume for a risk r = 0, i.e., the convex hull containing all 10 4 sliced-normal samples. The red solid lines are nested volumes obtained by an iterative removal of the samples on the boundary of the hull. For the second removal scheme we consider a selective removal of a  volume having the highest failure probability and severity.
We considered seven candidates θ risk and compared their reliability performance to θ final and θ new . The seven risk-based candidates have been selected to be optimal in specific sub-regions of E, indicating that they are potentially a suitable design choice depending on the approach taken for volume reduction. Fig. 24, shows the reliability results of the different designs by removing volume as displayed in Fig. 23. The metrics of gain, l s (r) and l P f (r), of the nine designs are presented for an increasing reduction in the volume of the resulting set within the original (calibrated) four-dimensional epistemic space E 0 . Note that θ final results in a gain that is the highest for most of the volume reduction  A. Gray et al. procedure. Only after removing approximately 70 % of the volume, do some of the risk-based designs result in a better severity, see e.g. θ risk− 7 . The numerical results for θ final and two of the seven risk-based designs θ risk (with increasing risk levels) are presented in Table 9. The results show that the final design solution for Problem E is very robust to the considered uncertainties, and only when the epistemic uncertainty becomes very small, can we identify designs that have a better performance.
For the second approach, we discard a volume from E where the failure probability and severity of the candidate designs result in the highest score. To this aim, we associate labels to the 10 4 sliced-normal samples that correspond to the index of the design having the best performance. Each sample of e is labeled based on the best-performing design for of this sample, i.e., the index of the design having the lowest probability of failure, or severity, with respect to e. This yields nine classes, seven for the risk-based design, one for θ new , and one for θ final . The ratio between the number of samples within a class and the total number of samples, gives an approximation of the volume where each of the 9 candidates outperforms the others. Table 8 shows a summary of these approximated volumes. Fig. 23. Pairwise plot showing an illustrative subset of the expanded epistemic space, selected to graphically highlight the dependence, or lack thereof, between a number of the epistemic parameters. The incremental volume reduction resulting from the convex hull peeling procedure is shown across several parameter pairs. The black marked lines show the convex hull projection of the sliced-normal samples (the blue markers). The red lines show convex hulls obtained for increasing risk levels r and defining nested epistemic volumes.

Table 8
Ranking of the θ base , θ new , θ final and seven risk based designs according to their reliability and severity dominance over percentage of epistemic volume. The percentage of volumes are approximated as the retro of sliced-normal samples where a design's reliability result superior to the reliability of the others.

Issues/problems
The proposed solution faced a number of issues during development, many of which stem from the black-box nature of the problem. Bayesian updating allowed for flexibility in specifying an imprecise model, but this was extremely costly and there is no good method for selecting the optimum epsilon parameter in the likelihood (6), which must be selected heuristically. Part of the flexibility of the model stems from the ability to capture dependence through the correlation matrix between the marginal beta distributions. However, these correlations did not update to a notable degree during Bayesian updating. This may reflect the imprecision in attempting to assign a precise distribution to the aleatory space, and supports the use of imprecise methods, but it seems to have introduced conservatism where there may have been alternative approaches that could have made more use of the dependence between samples.
The sliced normal distributions provided a means of efficiently drawing additional samples whilst maintaining dependence structures, but the expressiveness was limited by computational cost. The sliced normal used in this approach had 234 dimensions based on a two degree of freedom expansion (in a 24 dimensional problem). Additional degrees of freedom would have improved the ability to capture multi-modality accurately, for example. But this was not possible due to the computational cost of such an expansion. A more efficient means of dealing with dimensionality when using sliced normals could have allowed for a more accurate representation of the Bayesian posterior. This problem gets worse the more dimensions your problem has.

Fig. 24.
For increasing levels of risk r (volume reduction), we show a comparison between the gain in severity l s (r, θ) (left) and gain in reliability l Pf (r, θ) (right) for the different designs listed in the legend. Annotated as a change in shading is the threshold of volume for which one design outperforms θ final .

Table 9
Summary of the reliability scores of the final design and two risk-based designs for an increasing volume reduction given by r. Results estimated using N e = 10 4 and N a = 300. A sensitivity analysis approach based on sampling initially failed due to the potential for samples drawn from the reduced p-box falling outside the original when resampling occurred. This had the potential to cause negative sensitivity indices for some reductions. This was remedied by restricting resampling to subsets of the original sample set and enforcing inclusion monotonicity, but this approach is limited by the extent to which epistemic uncertainty defines the reduced p-box. If epistemic uncertainty is greatly reduced after reduction, the resulting set of sub-samples may be too small to make practical use of.
The risk based design approach suggested a means of taking a risk where the epistemic space was reduced. In this solution, there was no unique means of producing a given reduction of the epistemic space. This could lead to potential false confidence, since reduction could preferentially target epistemic space which is associated with failure states. This may not reflect any true risk, and the means of taking such a risk should be defined before attempting any risk based design optimisation. This was remedied by proposing a number of candidate designs that are optimal for different risk approaches.

Limitations
The approach also has the following limitations: • Time correlation information is not used for calibration: some information is lost in the Bayesian updating when taking the product of the marginal likelihoods (independence assumption). • A prior must be selected for the second order model, and it is difficult to make use of prior knowledge in this case.
• The second order and sliced normal distributions suffer from increasing dimensionality, computational cost needs to be considered.
• Engineering problems are rarely true black boxes, in which case intrusive analysis can provide more robust results.
• Epistemic bounds on black-box models may not be rigorous.
• Optimisation requires a specified means of volume reduction. • The probabilistic model is inherently imprecise.
Although a reasonable approach for limited data, it should be noted that unless the aleatory uncertainty follows the same distribution that was selected for the probabilistic model (beta/Gaussian copula in this work), then this method will not yield a precise distribution, even with a large data set. This limitation may be solved by selecting a non-parametric distribution for the second order probabilistic model.

Future work
This work may be extended in the following way, and may solve some of the limitations previously described.
• An uncertainty metric which accounts for dependence Accounting for dependencies in the output data would give a more accurate calibration. In this work independence had to be assumed since the selected uncertainty metric is one dimensional. A possible candidate is the multivariate Wassertsein metric, a popular distance in Optimal Transport Theory [123]. However an efficient method to compute this distance between multivariate distribution functions is sought.
• Non-parametric second order distributions The p-boxes produced in this work will not reduce to a precise distribution, even with a large amount of data. This is because a parametric distribution family was used in the probabilistic model. Using non-parametric distributions would solve this limitation and give a more flexible calibration.
• A non-probabilistic solution to this problem A possibilistic approach would allow for unique means of epistemic volume reduction and would also prevent false confidence in reliability statements due to the consonant nature of possibilistic structures. The methods for doing so are currently under development, and the most appropriate means of assigning necessity to the input space based on the black box output are not yet defined. One potential solution is to represent the Bayesian posterior with a possibilistic transform of the sliced-normal distribution. This was investigated over the course of this work and is currently under development.
A uniform prior was used here for all model parameters, but the black box nature of the problem and the lack of context gives no justification for any individual prior.
• Testing for false confidence Failure regions that are finite may lead to false confidence when a probabilistic approach is taken to reliability analysis. Assessing the impact of this would be a useful aid in approaching design problems using the approach described in this manuscript. This may highlight where false-confidence is likely to cause concern, and where non-probabilistic approaches may alleviate these problems.

• Efficient RBDO under mixed uncertainties
There are several efficient reliability optimization methods that work well in situations where only aleatory uncertainty is considered, e.g., decoupled RBDO methods, multi-level surrogates. However, efficient optimization methods for mixed uncertainties are not as mature. Genetic algorithms provided a solution, but at significant computational expense. Resolving this would greatly expand the applicability of this approach.
• Probabilistic model free RBDO A probabilistic model free optimisation under uncertainty method may be applied into this problem, where the reliability is computed directly from the experimental data in a robust way [105]. This would remove the need for calibration (problem A).
• Use of surrogate modelling (re dimensionality/ expense) The use of surrogate modelling would allow for more samples to be used and for the reliability to be computed to a higher accuracy. Surrogate models would also be required for the methods of this paper to be applied on more computationally expensive models.

Conclusions
In this paper we present a harmonised framework for dealing with uncertainty in engineering design. The framework described within this document builds upon recent advances within the UQ literature. We have consolidated a plethora of methodologies and analysis procedures into one workable framework for engineering design with black box models. The framework deals with the four common sources of uncertainty in engineering design: (i) limited data, (ii) variability, (iii) unknown parameter values, and (iv) uncertain dependencies. We frame the design challenge, inclusive of these sources of uncertainty, as uncertainty/risk analytic problems: • uncertainty characterisation, • calibration, • sensitivity analysis and uncertainty reduction, • reliability analysis, • design optimisation, and • risk-, reliability-based design optimisation.
Our novel sequence of solutions quantifies each uncertainty, and provides support to practitioners to: make decisions about further data acquisition, assess the reliability of prototype designs, optimise the design against severe failures and uncertainties, and refine design decisions.
To demonstrate the applicability, efficacy, and efficiency of the proposed solution strategy we apply our framework to the NASA 2020 challenge; which combines problems of statistical inference, with complex multi-dimensional and black-box engineering systems. The challenge emulates common fundamental problems across science and engineering, so our success in addressing this challenge showcases the flexibility of our approach.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Value of the stochastic area metric e5:24 ∈ EB Vector of epistemic parameters defining fa EB Augmented epistemic space P f,j Failure probability for requirement j P f (θ) Estimator of the failure probability l(θ, r) Gain metric for a risk r ϕ Unnormalised gaussian density function q(u; n d ) Monomial expansion of u of degree n d