Bayesian model inversion using stochastic spectral embedding

In this paper we propose a new sampling-free approach to solve Bayesian inverse problems that extends the recently introduced spectral likelihood expansions (SLE) method. The latter solves the inverse problem by expanding the likelihood function onto a global polynomial basis orthogonal w.r.t. the prior distribution. This gives rise to analytical expressions for key statistics of the Bayesian posterior distribution, such as evidence, posterior moments and posterior marginals by simple post-processing of the expansion coefficients. It is well known that in most practically relevant scenarios, likelihood functions have close-to-compact support, which causes the global SLE approach to fail due to the high polynomial degree required for an accurate spectral representation. To solve this problem, we herein replace the global polynomial expansion from SLE with a recently proposed method for local spectral expansion refinement called stochastic spectral embedding (SSE). This surrogate-modeling method was developed for functions with high local complexity. To increase the efficiency of SSE, we enhance it with an adaptive sample enrichment scheme. We show that SSE works well for likelihood approximations and retains the relevant spectral properties of SLE, thus preserving analytical expressions of posterior statistics. To assess the performance of our approach, we include three case studies ranging from low to high dimensional model inversion problems that showcase the superiority of the SSE approach compared to SLE and present the approach as a promising alternative to existing inversion frameworks.


Introduction
Computational models are an invaluable tool for decision making, scientific advances and engineering breakthroughs.They establish a connection between a set of input parameters and output quantities with wide-ranging applications.Model inversion uses available experimental observations of the output to determine the set of input parameters that maximize the predictive potential of a model.The importance of efficient and reliable model inversion frameworks can hardly be overstated, considering that they establish a direct connection 1 between models and the real world.Without it, the most advanced model predictions might lack physical meaning and, consequently, be useless for their intended applications.
Bayesian model inversion is one way to formalize this problem (Jaynes, 2003;Gelman et al., 2014).It is based on Bayesian inference and poses the problem in a probabilistic setting by capitalizing on Bayes' theorem.In this setting a so-called prior (i.e., before observations) probability distribution about the model parameters is updated to a so-called posterior (i.e., after observations) distribution.The posterior distribution is the probability distribution of the input parameters conditioned on the available observations, and the main outcome of the Bayesian inversion process.
In Bayesian model inversion, the connection between the model output and the observations is established through a probabilistic discrepancy model.This model, which is a function of the input parameters, leads to the so-called likelihood function.The specific form of the likelihood function depends on the problem at hand, but typically it has a global maximum for the input parameters with the model output that is closest to the available observations (w.r.t.some metric), and rapidly goes to zero with increasing distance to those parameters.
Analytical expressions for the posterior distribution can only be found in few academic examples (e.g., conjugate priors with a linear forward model, Bishop (2006); Gelman et al. (2014)).In general model inversion problems, such analytical solutions are not available though.Instead, it is common practice to resort to sampling methods to generate a sample distributed according to the posterior distribution.The family of Markov chain Monte Carlo (MCMC) algorithms are particularly suitable for generating such a posterior sample (Beck and Au, 2002;Robert and Casella, 2004).
While MCMC and its extensions are extensively used in model inversion, and new algorithms are continuously being developed (Haario et al., 2001;Ching and Chen, 2007;Goodman and Weare, 2010;Neal, 2011), it has a few notable shortcomings that hinder its application in many practical cases.It is well known that there are no robust convergence criteria for MCMC algorithms, and that their performance is particularly sensitive to their tuning parameters.Additionally, samples generated by MCMC algorithms are often highly correlated, thus requiring extensive heuristic post-processing and empirical rules (Gelman and Rubin, 1992;Brooks and Gelman, 1998).MCMC algorithms are also in general not well suited for sampling multimodal posterior distributions.
When considering complex engineering scenarios, the models subject to inversion are often computationally expensive.Because MCMC algorithms usually require a significant number of forward model evaluations, it has been proposed to accelerate the procedure by using surrogate models in lieu of the original models.These surrogate models are either constructed non-adaptively before sampling from the posterior distribution (Marzouk et al., 2007;Marzouk and Xiu, 2009) or adaptively during the sampling procedure (Li and Marzouk, 2014;Birolleau et al., 2014).Adaptive techniques can be of great benefit with posterior distributions that are concentrated in a small subspace of the prior domain, as the surrogate only needs to be accurate near high density areas of the posterior distribution.
Polynomial chaos expansions (PCE) are a widely used surrogate modelling technique based on expanding the forward model onto a suitable polynomial basis (Ghanem and Spanos, 1991;Xiu and Karniadakis, 2002).In other words, it provides a spectral representation of the computational forward model.Thanks to the introduction of sparse regression (see, e.g.Blatman and Sudret (2011)), its computation has become feasible even in the presence of complex and computationally expensive engineering models.This technique has been successfully used in conjunction with MCMC to reduce the total computational costs associated with sampling from the posterior distribution (Marzouk et al., 2007;Wagner et al., 2020).
Recently, Nagel and Sudret (2016) proposed spectral likelihood expansions (SLE), a novel approach that is closely related to PCE and aims at finding a spectral expansion of the likelihood function in an orthogonal basis w.r.t. the prior density function.The advantage of this method is that it provides analytical expressions for the posterior marginals and general posterior moments by post-processing the expansion coefficients.However, the typically compact support of likelihood functions generally prevents a sparse functional representation.
Stochastic spectral embedding (SSE) is a metamodelling technique suitable for approximating functions with complex localized features recently developed in Marelli et al. (2020).
In this paper we show how this technique, when extended with adaptive sample enrichment, is effective for the approximation of likelihood functions in Bayesian model inversion problems.Furthermore, we show that due to the construction of the basis polynomials, SSE preserves the most important post-processing properties of SLE.
The paper is organized as follows: In Section 2 we establish the basics of Bayesian inference and particularly Bayesian model inversion.We then give an introduction into spectral function decomposition with a focus on polynomial chaos expansions and their application to likelihood functions (SLE) in Section 3. In Section 4 we present the main contribution of the paper, namely the derivation of Bayesian posterior quantities of interest when SSE is applied to likelihood functions and the extension of the SSE algorithm with an adaptive sampling strategy.Finally, in Section 5 we showcase the performance of our approach on three case studies of varying complexity.

Model inversion
The problem of model inversion occurs whenever the predictions of a model are to be brought into agreement with available observations or data.This is achieved by properly adjusting a set of input parameters of the model.The goal of inversion can be twofold: on the one hand the inferred input parameters might be used to predict new realizations of the model output.On the other hand, the inferred input parameters might be the main interest.Model inversion is a common problem in many engineering disciplines, that in some cases is still routinely solved manually, i.e. by simply changing the input parameters until some, often qualitative, goodness-of-fit criterion is met.More quantitative inversion approaches aim at automatizing this process, by establishing a metric (e.g., L 2 -distance) between the data and the model response, which is then minimized through suitable optimization algorithms.
While such approaches can often be used in practical applications, they tend not to provide measures of the uncertainties associated with the inferred model input or predictions.
These uncertainties are useful in judging the accuracy of the inversion, as well as indicating non-informative measurements.In fact, the lack of uncertainty quantification in the context of model inversion can lead to erroneous results that have far-reaching consequences in subsequent applications.One approach to consider uncertainties in inverse problems is the Bayesian framework for model inversion that will be presented hereinafter.

Bayesian inference
Consider some non-observable parameters X ∈ D X and the observables Y ∈ D Y .Furthermore, let Y = {y (1) , . . ., y (N ) } be a set of N measurements, i.e., noisy observations of a set of realizations of Y .Statistical inference consists in drawing conclusions about X using the information from Y (Gelman et al., 2014).These measurements can be direct observations of the parameters (Y = X) or some quantities indirectly related to X through a function or model M : D X → D Y .One way to conduct this inference is through Bayes' theorem of conditional probabilities, a process known as Bayesian inference.
Denoting by π(•) a probability density function (PDF) and by π(•|x) a PDF conditioned on x, Bayes' theorem can be written as where π(x) is known as the prior distribution of the parameters, i.e., the distribution of X before observing the data Y.The conditional distribution π(Y|x), known as likelihood, establishes a connection between the observations Y and a realization of the parameters X = x.For a given realization x, it returns the probability density of observing the data Y.
Under the common assumption of independence between individual observations, {y (i) } N i=1 , the likelihood function takes the form: (2) The likelihood function is a map D X → R + , and it attains its maximum for the parameter set with the highest probability of yielding Y.With this, Bayes' theorem from Eq. ( 1) can be rewritten as: where Z is a normalizing constant often called evidence or marginal likelihood.On the lefthand side, π(x|Y) is the posterior PDF, i.e., the distribution of X after observing data Y.In this sense, Bayes' theorem establishes a general expression for updating the prior distribution using a likelihood function to incorporate information from the data.

Bayesian model inversion
Bayesian model inversion describes the application of the Bayesian inference framework to the problem of model inversion (Beck and Katafygiotis, 1998;Kennedy and O'Hagan, 2001;Jaynes, 2003;Tarantola, 2005).The two main ingredients needed to infer model parameters within the Bayesian framework are a prior distribution π(x) of the model parameters and a likelihood function L. In practical applications, prior information about the model parameters is often readily available.Typical sources of such information are physical parameter constraints or expert knowledge.Additionally, prior inversion attempts can serve as guidelines to assign informative prior distributions.In cases where no prior information about the parameters is available, so-called non-informative or invariant prior distributions (Jeffreys, 1946;Harney, 2016) can also be assigned.The likelihood function serves instead as the link between model parameters X and observations of the model output Y.To connect these two quantities, it is necessary to choose a so-called discrepancy model that gives the relative probability that the model response to a realization of X = x describes the observations.
One common assumption for this probabilistic model is that the measurements are perturbed by a Gaussian additive discrepancy term E ∼ N (ε|0, Σ), with covariance matrix Σ.For a single measurement y (i) it reads: This discrepancy between the model output M(X) and the observables Y can result from measurement error or model inadequacies.By using this additive discrepancy model, the distribution of the observables conditioned on the parameters Y |x is written as: where N (•|µ, Σ) denotes the multivariate Gaussian PDF with mean value µ and covariance matrix Σ.The likelihood function L is then constructed using this probabilistic model π(y (i) |x) and Eq. ( 2).For a given set of measurements Y it thus reads: With the fully specified Bayesian model inversion problem, Eq. (3) directly gives the posterior distribution of the model parameters π(x|Y).In the setting of model inversion, the posterior distribution represents therefore the state of belief about the true data-generating model parameters, considering all available information: computational forward model, discrepancy model and measurement data (Beck and Katafygiotis, 1998;Jaynes, 2003).
Often, the ultimate goal of model inversion is to provide a set of inferred parameters, with associate confidence measures/intervals.This is often achieved by computing posterior statistics (e.g., moments, mode, etc.).Propagating the posterior through secondary models is also of interest.So-called quantites of interest (QoI) can be expressed by calculating the posterior expectation of suitable functions of the parameters h(x) : R M → R, with X|Y ∼ π(x|Y), as in: Depending on h, this formulation encompasses posterior moments ) 2 for the first and second moments, respectively), posterior covariance (h(x) = ) or expectations of secondary models (h(x) = M (x)).

Spectral function decomposition
To pose a Bayesian inversion problem, the specification of a prior distribution and a likelihood function described in the previous section is sufficient.Its solution, however, is not available in closed form in the general case.
Spectral likelihood expansion (SLE) is a recently proposed method that aims at solving the Bayesian inversion problem by finding a polynomial chaos expansion (PCE) of the likelihood function in a basis orthogonal w.r.t. the prior distribution (Nagel and Sudret, 2016).
This representation allows one to derive analytical expressions for the evidence Z, the posterior distribution, the posterior marginals, and many types of QoIs, including the posterior moments.
We offer here a brief introduction to regression-based, sparse PCE before introducing SLE, but refer the interested reader to more exhaustive resources on PCE (Ghanem and Spanos, 1991;Xiu and Karniadakis, 2002) and sparse PCE (Xiu, 2010;Blatman andSudret, 2010, 2011).
Let us consider a random variable X with independent components {X i , i = 1, . . ., M } and associated probability density functions is a scalar function of X which fulfills the finite variance condition (E M(X) 2 < +∞).Then it is possible to find a so-called where α is an M -tuple (α 1 , . . ., α M ) ∈ N M and A ⊂ N M .For most parametric distributions, well-known classical orthonormal polynomials {Ψ α } α∈N M satisfy the necessary orthonormality condition w.r.t.π(x) (Xiu and Karniadakis, 2002).For more general distributions, arbitrary orthonormal polynomials can be constructed numerically through the Stieltjes procedure (Gautschi, 2004).If additionally, A is a sparse subset of N M , the truncated expansion in Eq. ( 8) is called a sparse PCE.
There exist different algorithms to produce a sparse PCE in practice, i.e. select a sparse basis A and compute the corresponding coefficients.A powerful class of methods are regression-based approaches that rely on an initial input sample X , called experimental design, and corresponding model evaluations M(X ) (See, e.g.Lüthen et al. (2020) for a recent survey).Additionally, it is possible to design adaptive algorithms that choose the truncated basis size (Blatman and Sudret, 2011;Jakeman et al., 2015).
To assess the accuracy of PCE, the so-called generalization error shall be evaluated.A robust generalization error estimator is given by the leave-one-out (LOO) cross validation technique.This estimator is obtained by where M ∼i PCE is constructed by leaving out the i-th point from the experimental design X .For methods based on linear regression, it can be shown (Chapelle et al., 2002;Blatman and Sudret, 2010) that the LOO error is available analytically by post-processing the regressor matrix.

Spectral likelihood expansions
The idea of SLE is to use sparse PCE to find a spectral representation of the likelihood function L occurring in Bayesian model inversion problems (see Eq. ( 2)).We present here a brief introduction to the method and the main results of Nagel and Sudret (2016).
Likelihood functions can be seen as scalar functions of the input random vector X ∼ π(x).In this work we assume priors of the type π(x) = M i=1 π i (x i ), i.e. with independent marginals.Additionally, likelihood functions fulfill the finite variance condition (Nagel and Sudret, 2016) and therefore admit a spectral decomposition: where the explicit dependence on Y was dropped for notational simplicity.
Upon computing the basis and coefficients in Eq. ( 8), the solution to the inverse problem is converted to merely post-processing the coefficients a α .The following expressions can be derived for the individual quantities: Evidence The evidence emerges as the constant polynomial's coefficient a 0 Posterior Upon computing the evidence Z, the posterior can be evaluated directly through Posterior marginals Let u and v be two non-empty disjoint index sets such that u ∪ v = {1, . . ., M }.We split the random vector X into two vectors X u with components ) the prior marginal density functions of X u and X v respectively.The posterior marginals then read: where The series in the above equation constitutes a subexpansion that contains non-constant polynomials only in the directions i ∈ u.
Quantities of interest Finally, it is also possible to analytically compute posterior expectations of functions that admit a polynomial chaos expansion in the same basis of the form h(X) ≈ α∈A b α Ψ α (X).Eq. ( 7) then reduces to the spectral product: The quality of these results depends only on the approximation error introduced in Eq. ( 10).The latter, in turn, depends mainly on the chosen PCE truncation strategy (Blatman and Sudret, 2011;Nagel and Sudret, 2016) and the number of points used to compute the coefficients (i.e., the experimental design).It is known that likelihood functions typically have quasi-compact supports (i.e., L(X) ≈ 0 on a majority of D X ).Such functions require a very high polynomial degree to be approximated accurately, which in turn can lead to the need for prohibitively large experimental designs.

Stochastic spectral embedding
Stochastic spectral embedding (SSE) is a multi-level approach to surrogate modeling originally proposed in Marelli et al. (2020).It attempts to approximate a given square-integrable where K ⊆ N 2 is a set of multi-indices with elements k = ( , p) for which = 0, . . ., L and p = 1, . . ., P where L is the number of levels and P is the number of subdomains at a specific level .We call R k S (X) a residual expansion given by In the present paper the term j∈A k a k j Ψ k j (X) denotes a polynomial chaos expansion (see Eq. ( 8)) constructed in the subdomain D k X , but in principle it can refer to any spectral expansion (e.g., Fourier series).A schematic representation of the summation in Eq. ( 15) is given in Figure 2. The detailed notation and the algorithm to sequentially construct an SSE are given in the sequel.

SSE for Bayesian model inversion
Viewing the likelihood as a function of a random variable X, we can directly use Eq. ( 15) to write down its SSE representation where the variable X is distributed according to the prior distribution π(x) and, consequently, the local basis used to compute R k S (X) is orthonormal w.r.t. that distribution.Due to the local spectral properties of the residual expansions, the SSE representation of the likelihood function retains all of the post-processing properties of SLE (Section 3.1): Evidence The normalization constant Z emerges as the sum of the constant polynomial coefficients weighted by the prior mass: Posterior This allows us to write the posterior as Posterior marginal Utilizing again the disjoint sets u and v from Eq. ( 13) it is also possible to analytically derive posterior marginal PDFs as where x) that contains only non-constant polynomials in the directions i ∈ u.Note that, as we assumed that the prior distribution has independent components, the constants V k and V k v are obtained as products of univariate integrals which are available analytically from the prior marginal cumulative distribution functions (CDFs).

Quantities of interest Expected values of function
for k ∈ K under the posterior can be approximated by: where b k α are the coefficients of the PCE of h in the card(K) bases {Ψ k α } α∈A k .This can also be used for computing posterior moments.
These expressions can be seen as a generalization of the ones for SLE detailed in Section 3.1.For a single-level global expansion (i.e., card(K) = 1) and consequently V (0,1) = 1, they are identical.

Modifications to the original algorithm
The original algorithm for computing an SSE was presented in Marelli et al. (2020).It recursively partitions the input domain D X and constructs truncated expansions of the residual.We reproduce it below for reference but replace the model M with the likelihood function L. We further simplify the algorithm by choosing a partitioning strategy with based on a partitioning strategy (d) If < L, ← + 1, go back to 2a, otherwise terminate the algorithm 3. Termination (a) Return the full sequence of D ,p X and R ,p S (X ,p ) needed to compute Eq. ( 15).
In practice, the residual expansions R ,p S (X ,p ) are computed using a fixed experimental design X and corresponding model evaluations L(X ).The algorithm then only requires the specification of a partitioning strategy and a termination criterion, as detailed in Marelli et al. (2020).
Likelihood functions are typically characterized by a localized behaviour: Close to the data-generating parameters they peak while quickly decaying to 0 in the remainder of the prior domain.This means that in a majority of the domain the likelihood evaluation is non-informative.Directly applying the original algorithm is then expected to waste many likelihood evaluations.
We therefore modify the original algorithm by adding an adaptive sampling scheme (Section 4.2.1) that includes the termination criterion and introducing an improved partitioning strategy (Section 4.2.2) that is especially suitable for finding compact support features.The rationale for these modifications is presented next.

Adaptive sampling scheme
This estimator incorporates the subdomain size through the prior mass V +1,s , and the approximation accuracy, through the leave-one-out estimator.The distinction is necessary to assign an error estimator also to domains that have too few points to construct a residual expansion, in which case the error estimator of the previous level E ,s LOO is reused.The algorithm sequentially splits and refines subdomains with large approximation errors.
Because likelihood functions typically have the highest complexity close to their peak, these regions tend to have larger approximation errors and are therefore predominantly picked for refinement.The proposed way of adaptive sampling then ends up placing more points near the likelihood peak, thereby reducing the number of non-informative likelihood evaluations.
The choice of a constant N ref is simple and could in principle be replaced by a more elaborate strategy (e.g., based on the approximation error of the current subdomain relative to the total approximation error).A benefit of this enrichment criterion is that all residual expansions are computed with experimental designs of the same size.Upon choosing the domain with the maximum approximation error among the terminal domains, the error estimators then have a more comparable estimation accuracy.

Partitioning strategy
The partitioning strategy determines how a selected refinement domain is split.As described in Marelli et al. (2020), it is easy to define the splitting in the uniformly distributed quantile space U and map the resulting split domains D ,p U to the (possibly unbounded) real space X through an appropriate isoprobabilistic transform (e.g., the Rosenblatt transform (Rosenblatt, 1952)).
Similar to the original SSE algorithm presented in Marelli et al. (2020), we split the refinement domain in half w.r.t.its prior mass.The original algorithm chooses the splitting direction based on the partial variance in the refinement domain.This approach is well suited for generic function approximation problems.For the approximation of likelihood functions, however, we propose a partitioning strategy that is more apt for dealing with their compact support.
We propose to pick the split direction along which a split yields a maximum difference in the residual empirical variance between the two candidate subdomains created by the split.This can easily be visualized with an example given by the M = 2 dimensional domain D ,p X in Figure 1(a).Assume this subdomain was selected as the refinement domain.To decide along which dimension to split, we construct the M candidate subdomain pairs ..,M and estimate the corresponding {E i split } i=1,...,M in those subdomains defined by In this expression, X i,1 split and X i,2 split denote subsets of the experimental design X inside the subdomains D i,1 split and D i,2 split respectively.The occurring variances can be easily estimated with the empirical variance of the residuals in the respective candidate subdomains.
After computing the residual variance differences, the split is carried out along the di- i.e., to keep the subdomains D d,1 split and D d,2 split that introduce the largest difference in variance.For d = 1, the resulting split can be seen in Figure 1(d).
The choice of this partitioning strategy can be justified heuristically with the goal of approximating compact support functions.Assume that the likelihood function has compact support, this criterion will avoid cutting through its support and instead identify a split direction that results in one subdomain with large variance (expected to contain the likelihood support) and a subdomain with small variance.In subsequent steps, the algorithm will proceed by cutting away low variance subdomains, until the likelihood support is isolated.

The adaptive SSE algorithm
The algorithm is presented here with its two parameters N ref , the minimum experimental design size needed to expand a residual, and N ED , the final experimental design size.The sample X ,p refers to X ∩ D ,p X , i.e. the subset of X inside D ,p X .Further, the multi-index set T ∈ N 2 at each step of the algorithm gathers all indices ( , p) of unsplit subdomains.It thus denotes the terminal domains: D k X , k ∈ T .For visualization purposes we show the first iterations of the algorithm for a two-dimensional example in Figure 2.

Initialization:
(a) D 0,1 S (X) of L(X) in the full domain X 0,1 , retrieve its approximation error E 0,1 and initialize T = {(0, 1)} iii.Retrieve the approximation error E +1,s from Eq. ( 23  (a) Return the full sequence of D ,p X and R ,p S (X ,p ) needed to compute Eq. ( 15) The updating of the multi-index set in Step 2a refers to removing the current index ( , p) from the set and adding to it the newly created indices ( + 1, s 1 ) and ( + 1, s 2 ).

Case studies
To showcase the effectiveness of the proposed adaptive SSE approach, we present three case studies with increasing dimensionality and different model complexity: (i) a one-dimensional vibration problem for illustrative purposes, (ii) a six-dimensional heat transfer problem that describes the steady-state heat evolution in a solid body with inclusions and (iii) a 62-dimensional diffusion problem modeling the concentration-driven diffusion in a onedimensional domain.
For all case studies, we adopt the adaptive sparse-PCE based on LARS approach developed in Blatman and Sudret (2011) through its numerical implementation in UQLab (Marelli andSudret, 2014, 2019).Each R k S is therefore a degree-and q-norm-adaptive polynomial chaos expansion.We further introduce a rank truncation of r = 2, i.e. we limit the maximum number of input interactions (Marelli and Sudret, 2019) to two variables at a time.
The truncation set for each spectral expansion (Eq.( 8)) thus reads: where The q-norm is adaptively increased between q = {0.5, • • • , 0.8} while the maximum polynomial degree is adaptively increased in the interval p = {0, 1, • • • , p}, where the maximum degree p = 20 for case study (i) and (ii) and p = 3 for case study (iii) due to its high dimensionality.
In case study (ii) and (iii), the performance of SLE (Nagel and Sudret, 2016), the original non-adaptive SSE (Marelli et al., 2020) and the proposed adaptive SSE approach presented in Section 4.2 is compared.The comparison was omitted for case study (i), because only adaptive SSE succeeded in solving the problem.For clarity, we henceforth abbreviate the adaptive SSE algorithm to adSSE.
To simplify the comparison, the same partitioning strategy employed for adSSE (Section 4.2.2) was employed for the non-adaptive SSE approach.Also, the same experimental designs were used for the non-adaptive SSE and the SLE approaches.Finally, the same parameter N ref was used to define the enrichment samples in adSSE and the termination criterion in non-adaptive SSE.
To assess the performance of the three algorithms considered, we define an error measure that allows to quantitatively compare the similarity of the SSE, adSSE and SLE solution with the reference MCMC solution.This comparison is inherently difficult, as a samplingbased approach (MCMC) needs to be compared to a functional approximation (SSE, adSSE, SLE).We proceed to compare the univariate posterior marginals, available analytically in SSE, adSSE and SLE (See Eq. ( 13) and Eq. ( 20)), to the reference posterior marginals estimated with kernel density estimation (KDE, Wand and Jones (1995)) from the MCMC sample.Denoting by πi (κ i |Y) the SSE, adSSE or SLE approximations and by π i (κ i |Y) the reference solution, we define the following error measure where M is the dimensionality of the problem and JSD is the Jensen-Shannon divergence (Lin, 1991), a symmetric and bounded ([0, log(2)]) distance measure for probability distributions based on the Kullback-Leibler divergence.
The purpose of the error measure η is to allow for a fair comparison between the different methods investigated.It is not a practical measure for engineering applications because it relies on the availability of a reference solution, and it its magnitude does not have a clear As all algorithms (SSE, adSSE, SLE) depend on a randomly chosen experimental designs, we produce 20 replications for case study (ii) and 5 replications for case study (iii) by running them multiple times.

1-dimensional vibration problem
This first case study serves as an illustrative example of how the proposed adaptive algorithm constructs an adSSE.In the presented problem the goal is the inference of a single unknown parameter with a bi-modal posterior distribution.This problem is very difficult to solve with standard MCMC methods due to the probability valley, i.e. low probability region, between the posterior peaks.The considered problem is fabricated and uses synthetic data, but is presented in the context of a relevant engineering problem.
Consider the oscillator displayed in Figure 3 8.67, 8.84, 9.22, 8.54}.This problem is well known in mechanics and in the linear case (i.e., assuming small deformations and linear material behavior) can be solved analytically with the amplitude of the frequency response function.This function returns the ratio between the steady state amplitude of a linear oscillator and the amplitude of its excitation.It is given by We assume a discrepancy model with known discrepancy standard deviation σ.In conjunction with the available measurements Y, this leads to the following likelihood function: We employ the adSSE algorithm to approximate this likelihood function with N ref = 10.
A few iterations from the solution process are shown in Figure 4.The top plots show the subdomains D ,p X constructed at each refinement step, highlighting the terminal domains T .The middle plots display the residual between the true likelihood and the approximation at the current iteration, as well as the adaptively chosen experimental design X .The bottom plots display the target likelihood function and its current approximation.
The initial global approximation of the first iteration in Figure 4(a) is a constant polynomial based on the initial experimental design.By the third iteration, the algorithm has identified the subdomain D 2,2 X as the one of interest and proceeds to refine it in subsequent steps.By the 8th iteration both likelihood peaks have been identified.Finally, by the 10th iteration in Figure 4(d), both likelihood peaks are approximated well by the adSSE approach.
The last iteration shows how the algorithm splits domains and adds new sample points.
There is a clear clustering of subdomains and sample points near the likelihood peaks at X = 0.95 and X = 1.05.
The results from Eq. ( 22) show that without further computations it would be possible to directly extract the posterior moments by post-processing the SSE coefficients.In the present bi-modal case, however, the posterior moments are not very meaningful.Instead, the available posterior approximation gives a full picture of the inferred parameter X|Y.It is shown together with the true posterior and the original prior distribution in Figure 5.
For this case study, non-adaptive experimental design approaches like the standard SSE (Marelli et al., 2020) and the original SLE algorithm (Nagel and Sudret, 2016) will almost surely fail for the considered experimental design of N ED = 100.In numerous trial runs these approaches did not manage to accurately reconstruct the likelihood function due to a lack of informative samples near the likelihood peaks.

Moderate-dimensional heat transfer problem
This case study was originally presented in Nagel and Sudret (2016) and solved there using SLE.We again solve the same problem with SSE and compare the performance of SLE (Nagel and Sudret, 2016), the original non-adaptive SSE (Marelli et al., 2020) and the proposed adSSE approach presented in Section 4.2.
Consider the diffusion-driven stationary heat transfer problem sketched in Figure 6(a).
It models a 2D plate with a background matrix of constant thermal conductivity κ 0 and 6 inclusions with conductivities κ def = (κ 1 , . . ., κ 6 ) .The diffusion driven steady state heat distribution is described by a heat equation in Euclidean coordinates r def = (r 1 , r 2 ) of the ) and a temperature T1 Dirichlet boundary condition on the top.
We employ a finite element (FE) solver to solve the weak form of Eq. ( 31) by discretizing the domain into approximately 10 5 triangular elements.A sample solution returned by the FE-solver is shown in Figure 6b.
In this example we intend to infer the thermal conductivities κ of the inclusions.We assume the same problem constants as in Nagel and Sudret (2016) (i.e., q 2 = 2,000 W/m 2 , T1 = 200 K, κ 0 = 45W/m/K).The forward model M takes as an input the conductivities of the inclusions κ, solves the finite element problem and returns the steady state temperature T def = ( T1 , . . ., T20 ) at the measurement points, i.e., M : κ → T . 19 To solve the inverse problem, we assume a multivariate lognormal prior distribution with independent marginals on the inclusion conductivities, i.e. π(κ) = 6 i=1 LN (κ i |µ = 30 W/m/K, σ = 6 W/m/K).We further assume an additive Gaussian discrepancy model, which yields the likelihood function with a discrepancy standard deviation of σ.
As measurements, we generate one temperature field with κ def = (32, 36, 20, 24, 40, 28) W/m/K and collect its values at 20 points indicated by black dots in Figure 6(a).We then perturb these temperature values with additive Gaussian noise and use them as the inversion data We look at two instances of this problem that differ only by the discrepancy parameter σ from Eq. ( 32).The prior model response has a standard deviation of approximately 0.3 K, depending on the measurement point T i .We therefore solve the problem first with a large value σ = 0.25 K and second with a small value σ = 0.1 K.As the discrepancy standard deviation determines how peaked the likelihood function is, the first problem has a likelihood function with a much wider support and in turn is significantly easier to solve then the second one.It is noted here that in practice, the peakedness of the likelihood function is either increased by a smaller discrepancy standard deviation, or the inclusion of additional experimental data.
To monitor the dependence of the algorithms on the number of likelihood evaluations, we solve both problems with a set of maximum likelihood evaluations N ED = {1,000; 2,000, 5,000; 10,000; 30,000}.
The number of refinement samples is set to N ref = 1,000.
As a benchmark, we use reference posterior samples generated by the affine-invariant ensemble sampler MCMC algorithm (Goodman and Weare, 2010) with 30,000 steps and 50 parallel chains, requiring a total of N ED = 1.5 • 10 6 likelihood evaluations.Based on numerous heuristic convergence tests and due to the large number of MCMC steps, the resulting samples can be considered to accurately represent the true posterior distributions.
The results of the analyses are summarized in Figure 7, where the error measure η is plotted against the number of likelihood evaluations for the large and small standard deviation case.For the large discrepancy standard deviation case, both SSE approaches clearly outperform standard SLE w.r.t. the error measure η.This is most significant at mid-range experimental designs (N ED = 5,000; 10,000), where SLE does not reach the required high degrees and fails to accurately approximate the likelihood function.At larger experimental designs SLE catches up to non-adaptive SSE but is still outperformed by the proposed adSSE approach.The real strength of the adaptive algorithm shows for the case of a small discrepancy standard deviation, where the limitations of fixed experimental designs become obvious.When the likelihood function is nonzero in a small subdomain of the prior, the global SLE and non-adaptive SSE approach will fail in practice because of the insufficient 20 number of samples placed in the informative regions.The adSSE approach, however, works very well in these types of problems.It manages to identify the regions of interest and produces a likelihood approximation that accurately reproduces the posterior marginals.
Tables 1 and 2 show the convergence of the adSSE method's moment estimate (mean and variance) to the reference solution for a single run.In brackets next to the moment estimates ξ, the relative error   the posterior characteristics are very well captured.However, the adSSE approach sometimes fails to accurately represent the tails of the distribution.This is especially obvious in the small discrepancy case in Figure 8(c) where the tail is sometimes cut off.We emphasize here that the SSE marginals are obtained analytically as 1D and 2D surfaces for the univariate and bivariate marginals respectively.For the reference MCMC approach, on the other hand, they need to be approximated with histograms based on the available posterior sample.

Convergence of the posterior moments
In practical inference applications, posterior moments are often one of the main quantities of interest.An estimator of these moments is readily available at every refinement step of adSSE through Eq. ( 22).
Tracking the evolution of the posterior moments throughout the adSSE iterations can be used as a heuristic estimator of the convergence of the adSSE algorithm.However, only the stability of the solution can be assessed, without guarantees on the bias.As an example, we now consider the large discrepancy problem and plot the evolution of the posterior mean and standard deviation for every X i as a function of the number of likelihood evaluations in Figure 9.It can be seen that after ∼ 10,000 likelihood evaluations, most moment estimators achieve convergence to a value close to the reference solution.This plot also reveals a small bias of the E [X 3 |Y] and Var [X 3 |Y] estimators, that was previously highlighted in Table 1.

Influence of N ref
The main hyperparameter of the proposed adSSE algorithm is the number N ref , which corresponds to the number of sample points that are required at each PCE construction step (see Section 4.2).In Figure 10 we display the effect of different N ref values on the convergence in the small and large discrepancy problems.
N ref influences the accuracy of the two error estimators used inside the adSSE algorithm.
They are: (i) the residual expansion accuracy E LOO in Eq. ( 23) and (ii) the splitting error E split in Eq. ( 24).
Small values of N ref allow to quickly obtain a crude likelihood approximation with limited experimental design sizes N ED , but this comes at the cost of lower convergence rates at larger N ED .This behaviour can be partially attributed to the deterioration of residual expansion error E LOO in Eq. ( 23).At small experimental design sizes, the overall number of terminal domains is relatively small and this effect is not as pronounced.At larger experimental designs and higher numbers of subdomains, however, the error estimators high variances can lead to difficulties in identifying the true high error subdomains.
Large values of N ref lead to slower initial convergence rates because of the smaller number of overall subdomains.The algorithm stability, however, is increased because both error estimators have lower variance and thereby allow the algorithm to more reliably identify the       true high error subdomains and choose the split directions that maximize Eq. ( 25).

High-dimensional diffusion problem
The last cast study shows that adSSE for Bayesian model inversion remains feasible in high dimensional problems.The considered forward model is often used as a standard benchmark in UQ computations (Shin and Xiu, 2016;Fajraoui et al., 2017).It represents the 1-D diffusion along a domain with coordinate ξ ∈ [0, 1] given by the following boundary value problem: The concentration field u can be used to describe any steady-state diffusion driven process (e.g., heat diffusion, concentration diffusion, etc.).Assume that the diffusion coefficient κ is a log-normal random field given by κ(ξ, ω) = exp (10 + 3g(ξ)) where g is a standard normal stationary Gaussian random field with exponential autocorrelation function ρ(ξ, ξ ) = exp (−3 |ξ − ξ|).Let g be approximated through a truncated Karhunen-Loève expansion with the pairwise uncorrelated random variables X k denoting the field coefficients and the real valued function e k obtained from the solution of the Fredholm equation for ρ (Ghanem and Spanos, 1991).The truncation variable is set to M = 62 to explain 99% of the variance.
Some realizations of the random field and resulting concentrations are shown in Figure 11.
In this example, the random vector of coefficients X = (X 1 , . . ., X 62 ) shall be inferred using a single measurement of the diffusion field at u(ξ = 1) given by Y = 0.16.The considered model therefore takes as an input a realization of that random vector, and returns the diffusion field at ξ = 1, i.e., M : x → u(1).normal distribution (E [X i ] = 0 and Var [X i |Y] = 1 for i = 1, . . ., 62) it is obvious from this table that the data most significantly affects the first three parameters.
The adSSE approach manages to accurately recover the first two posterior moments at the relatively low cost of N ED = 10,000 likelihood evaluations.The average absolute error for E [X i ] and Var [X i |Y] is approximately 0.02.

Conclusions
Motivated by the recently proposed spectral likelihood expansions (SLE) framework for Bayesian model inversion presented in Nagel and Sudret (2016), we showed that the same analytical post-processing capabilities can be derived when the novel stochastic spectral embedding (SSE) approach from Marelli et al. (2020) is applied to likelihood functions.Because SSE is designed for models with complex local characteristics, it was expected to outperform SLE on practically relevant, highly localized likelihood functions.To further improve SSE performance in Bayesian model inversion applications, we introduced a novel adaptive sampling procedure and modified partitioning strategy.
There are a few unsolved shortcomings of the SSE that will be addressed in future works.
Namely, the discontinuities at the subdomain boundaries may be a source of error that should be addressed.Additionally, for the adaptive SSE algorithm, it is not possible at the moment to specify the optimal N ref parameter a priori.In light of the considerable influence of that parameter as shown in Section 5.2.2, it might be necessary to adaptively adjust it, or decouple this parameter from the termination criterion.
Approximating likelihood functions through local PCEs prohibits the enforcement of strict positivity throughout the function domain.For visualization purposes this is not an issue, as negative predictions can simply be set to 0 in a post-processing step.When computing posterior expectations with Eq. ( 22), however, this can lead to erroneous results such as negative posterior variances.One way to enforce strict positivity is through an initial transformation of the likelihood function (e.g., log-likelihood log L ≈ k∈K f PCE k (X)).This is avoided in the present work because it comes at a loss of the desirable analytical postprocessing properties.
The biggest advantage of SSE, however, is that it poses the challenging Bayesian computation in a function approximation setting.This yields an analytical expression of the posterior distribution and preserves the analytical post-processing capabilities of SLE while delivering highly improved likelihood function approximations.As opposed to many existing algorithms, SSE can even efficiently solve Bayesian problems with multiple posterior modes.
As shown in the case studies, the proposed adaptive algorithm further capitalizes on the compact support nature of likelihood functions and leads to significant performance gains, especially at larger experimental designs.
The proposed algorithm has two parameters: the experimental design size for the residual expansions N ref and the final experimental design size corresponding to the available computational budget N ED .At the initialization of the algorithm N ref points are sampled as a first experimental design.At every further iteration, additional points are then sampled from the prior distribution.These samples are generated in a space-filling way (e.g. through latin hypercube sampling) in the newly created subdomains D +1,s X to have always exactly N ref points available for constructing the residual expansions.The algorithm is terminated, once the computational budget N ED has been exhausted.At every step, the proposed algorithm chooses a single refinement domain from the set of unsplit, i.e. terminal domains, creates two new subdomains by splitting the refinement domain and constructs residual expansions after enriching the experimental design.The selection of this refinement domain is based on the error estimator E k that is defined by

Figure 1 :
Figure 1: Partitioning strategy for a 2D example visualized in the quantile space U .The refinement domain D ,p U is split into two subdomains D +1,s 1 U ) (c) If no new expansions were created, terminate the algorithm, otherwise go back to 2 3. Termination

Figure 2 :
Figure 2: Graphical representation of the first steps of the adaptive SSE algorithm described in Section 4.2.3 for a two-dimensional problem with independent prior distribution.Upper row: partitioning in the quantile space; Lower row: partitioning in the unbounded real space with π(x) contour lines in dashed blue.Red dots show the adaptive experimental design that has a constant size of N ref = 5 in each created subdomain.The terminal domains T are highlighted in orange.The splitting direction in each subdomain is determined randomly in this example.

Figure 3 :
Figure 3: 1-dimensional vibration problem: Sketch of the linear oscillator subject to harmonic (i.e., sinusoidal) excitation.Assume the prior information about its stiffness X def = k is that it follows a lognormal distribution with µ = 0.8 N/m and σ = 0.1 N/m.Its true value shall be determined using measurements of the oscillation amplitude at the location of the mass m.The known properties of the oscillator system are the oscillator mass m = 1 kg, the excitation frequency ω = 1 rad/s and the viscous damping coefficient c = 0.1 N s/m.The oscillation amplitude is measured in five independent oscillation events and normalized by the forcing amplitude yielding the measured amplitude ratios Y = {9.01, (a) 1st iteration, NED = 10 (b) 3rd iteration, NED = 30 (c) 8th iteration, NED = 80 (d) 10th iteration, NED = 100

Figure 4 :
Figure 4: One-dimensional vibration problem: Illustration of the adSSE algorithm approximating the likelihood function L.

Figure 5 :
Figure 5: One-dimensional vibration problem: Comparison of the true multimodal posterior and its adSSE based approximation.

Figure 6 :
Figure 6: Moderate-dimensional heat transfer problem: Model setup and exemplary solution.

( a )Figure 7 :
Figure 7: Moderate-dimensional heat transfer problem: Convergence of the η error measure (Eq.(28)) as a function of the experimental design size N ED in 20 replications for SLE, SSE with a static experimental design and the proposed adSSE approach.We display the two discrepancy standard deviation cases σ = {0.25,0.1} K.

def=
|ξ MCMC − ξ SSE |/ξ MCMC is also shown.Due to the nonstrict positivity of the SSE estimate, one variance estimate computed with Eq. (22) is negative and is therefore omitted from Table2.The full posterior marginals obtained from one run of adSSE with N ED = 30,000 are also compared to those of the reference MCMC and displayed in Figure8.The individual plots show the univariate posterior marginals (i.e.π(x i |Y)) on the main diagonal and the bivariate posterior marginals (i.e.π(x ij |Y)) in the i-th row and j-th column.It can be clearly seen that

Figure 9 :
Figure 9: Moderate-dimensional heat transfer problem: Evolution of the posterior moment estimation for a typical run of the adSSE algorithm on the small discrepancy problem.The thin lines show the MCMC reference solution.

( a )Figure 10 :
Figure 10: Moderate-dimensional heat transfer problem: Convergence of the η error measure (Eq.(28)) as a function of the experimental design size N ED in 20 replications for the proposed adSSE approach with different N ref parameters.We display the two discrepancy standard deviation cases σ = {0.25,0.1} K.

Figure 11 :
Figure 11: High-dimensional diffusion problem: 5 independent realizations of X and the resulting κ and u with the model output u(1) highlighted with a circle.

Figure 12 :
Figure 12: High-dimensional diffusion problem: Convergence of the η error measure (Eq.(28)) over five replications for SLE, SSE with a static experimental design and the proposed adSSE approach.

Figure 13 :
Figure 13: High-dimensional diffusion problem: Comparison plots of the posterior distribution marginals for the first 3 parameters {X 1 , X 2 , X 3 } computed from adSSE (N ED = 10,000) and MCMC (N ED = 10 7 ).The prior marginals are shown in red.

Table 1 :
Moderate-dimensional heat transfer problem: adSSE results with large discrepancy standard deviation σ = 0.25 K. Relative errors w.r.t.MCMC reference solution are shown in brackets.

Table 2 :
Moderate-dimensional heat transfer problem: adSSE results with small discrepancy standard deviation σ = 0.1 K. Relative errors w.r.t.MCMC reference solution are shown in brackets.Field with an asterisk ( * ) indicates negative variance estimate.