Learning model discrepancy: A Gaussian process and sampling-based approach

Predicting events in the real world with a computer model ( simulator ) is challenging. Every simulator, to varying extents, has model discrepancy , a mismatch between real world observations and the simulator (given the ‘true’ parameters are known). Model discrepancy occurs for various reasons, including simpliﬁed or missing physics in the simulator, numerical approximations that are required to compute the simulator outputs, and the fact that assumptions in the simulator are not generally applicable to all real world contexts. The existence of model discrepancy is problematic for the engineer as performing calibration of the simulator will lead to biased parameter estimates, and the resulting simulator is unlikely to accurately predict (or even be valid for) various contexts of interest. This paper proposes an approach for inferring model discrepancy that overcomes non-identiﬁability problems associated with jointly inferring the simulator parameters along with the model discrepancy. Instead, the proposed procedure seeks to identify model discrepancy given some parameter distribution, which could come from a ‘likelihood-free’ approach that considers the presence of model discrepancy during calibration, such as Bayesian history matching. In this case, model discrepancy is inferred whilst marginalising out the uncertain simulator outputs via a sampling-based approach, therefore better reﬂecting the ‘true’ uncertainty associated with the model discrepancy. Veriﬁcation of the approach is performed before a demonstration on an experiential case study, comprising a representative ﬁve storey building structure. (cid:1) 2020 The Authors. Published by Elsevier Ltd. ThisisanopenaccessarticleundertheCCBY

where z 2 R NÂ1 are real world observations for a set of inputs X 2 R NÂdx . The simulator g Á; Á ðÞ depends on a set of inputs X and parameters h 2 R MÂ1 , whereas the model discrepancy term d Á ðÞ is assumed to only be dependent on the inputs X. Finally, e 2 R NÂ1 is assumed Gaussian additive noise. The decision about what the inputs X and parameters h of a simulator are will be application specific, where generally, the inputs X are chosen to reflect the measured outputs z and the parameters h are additional variables in the simulator that could be tuned. For example, the inputs X could be variables that drive a physical process, such as a force or crack length, or independent variables like spectral lines or frequency bins, with the parameters h being, for example, material properties, or even mass, stiffness and damping coefficients, such as in direct model updating [18].
The main difference between the proposed method and existing techniques, such as in [1], is that the parameters h are identified before the functional model discrepancy term d Á ðÞ is inferred; rather than jointly inferring both h and d Á ðÞ. This decoupling assumption is made as the alternative joint inference problem is susceptible to non-identifiability issues, caused by modelling model discrepancy d Á ðÞwith a Gaussian process whilst inferring the parameters [4,6]. In fact, this modelling choice in the joint approach, leads to a rather poor likelihood, where even 'bad' parameter samples are given a high probability in the likelihood function [6], due to the ability of the GP to model any arbitrary function well [9], making the likelihood insensitive. Instead, by decoupling the problem, inferring the parameters and then the model discrepancy, nonidentifiability issues and problems with the likelihood can be overcome.
In order to decouple the parameter and model discrepancy inferences, the calibration method must be able to account for model discrepancy in another way. 'Likelihood-free' approaches, such as Bayesian history matching, offer such a technique. These methods incorporate model discrepancy though a notion of distance, removing issues associated with defining a specific likelihood, whilst approximating the parameter posterior distribution p hjz ðÞ . Once obtained, model discrepancy can be inferred using a Gaussian process model -without affecting the parameter posterior distribution. However, the Gaussian process model must be constructed from the uncertain simulator outputs p yjX; h ðÞ to the real world observations z. Unfortunately, it is not possible in closed-form to create a Gaussian process from uncertain inputs, meaning that a sampling-based solution is required.
It is noted that in the joint inference method, the parameters h are inferred, given an empirical Bayes estimate of the model discrepancy hyperparameters, requiring the Gaussian process to be conditioned on the parameter prior distributions, e.g.
R p zjy; /; h ðÞ p h ðÞ dh [1][2][3][4][5][6][7] -this can bias the inferred model discrepancy. This conditioning, typically leads to a restricted choice in prior distributions p h ðÞthat are conjugate with a Gaussian process, such as a Gaussian [1] or uniform [4] distribution; where non-conjugate priors require an additional expensive sampling procedure, on top of the parameter estimation, which is performed in low dimensions by a quadrature approach, and in high dimensions by another sampling step [1]. These issues are removed by considering the decoupled approach proposed in this paper.
By decoupling the inference procedure, the model discrepancy method can also be applied in scenarios where the parameter distribution is obtained by some elicitation process or experimentation. This makes the technique more generally applicable to a wider range of problems outside of those originally considered by the joint approach.
The proposed method assumes that some parameter distribution p hjz ðÞ has been obtained from a 'likelihood-free' calibration method, or p h ðÞhas been acquired from some elicitation process; where, for simplicity, p h ðÞwill be used to denote a generic parameter distribution. The approach then seeks to find the additive model discrepancy (and noise) term modelled using Gaussian process (GP) regression. As the simulator outputs, given the parameter distribution, are uncertain, a sampling-based approach is used to marginalise out the simulator outputs, meaning (potentially calibrated and) biascorrected model predictions can be made, reflecting the uncertainty from the parameter distribution. A brief outline of the approach is as follows: Obtain N s samples from the parameter distribution i.e. for the jth sample, h j ðÞ $ p h ðÞ . Propagate those N s samples through the simulator to obtain N s simulator output (denoted y 2 R NÂ1 ) samples i.e. y j Learn a GP mapping for each of the N s output samples y j ðÞ to a set of training observations z i.e. GP j ðÞ : y j ðÞ ; X ÈÉ ! z and obtain a weight w j for each regression model -the weights will be formally introduced in Section 2.2.
Calculate the weighted average of the set of GP regression models generating a bias-corrected model prediction It is noted that in the case where the simulator is computationally expensive to evaluate, a more computationally efficient emulator, or surrogate model [19], can be constructed. This efficient approximation can be sampled instead of the simulator in step two, where any emulator technique within the literature could be implemented [19][20][21]; in this paper a Gaussian process emulator is utilised.

Gaussian process regression
Model discrepancy is modelled in this paper by GP regression as it is a flexible, nonparametric tool, and because it has a Bayesian formulation allowing the uncertainty associated with the inferred functional form to be estimated [22,9]. These properties, the ability to approximate any unknown function well whilst quantifying the uncertainty in the prediction, are useful as the functional form of the model discrepancy is unknown a priori and quantifying the uncertainty associated with this form may aid simulator developers in targeting improvements to their computer model. In addition, by decoupling the inference problem, the choice of modelling model discrepancy with a Gaussian process no longer affects the likelihood in the parameter inference stage, making it a more suitable assumption.
The model discrepancy term is assumed in Eq. (1) to be additive, meaning it can be formed as a map from the simulator outputs y and inputs X to the observational data z; GP : y; X fg ! z, where the inferred model discrepancy GP can be related back to the inputs as d X ðÞ . For this reason GP regression is introduced in this section with the simulator outputs y being part of the inputs to the GP along with X, and where the noisy observations z are the outputs of the GP.
A Gaussian process states a prior distribution over a latent function f y; X ðÞ (of the noisy function z y; X where GP Á; Á ðÞ is a Gaussian process, with a mean function m Á ðÞand covariance function k Á; Á ðÞ which define the prior belief about the types of possible functions that could model the function f 2 R NÂ1 . Here a zero mean function is assumed, i.e. m y; X ðÞ ¼ 0, although it is trivial to add a non-zero mean function. The covariance function defines the correlation between any two points in the input space (hence being a function of y; X ðÞ and y 0 ; X 0 ÀÁ ) in a Reproducing Kernel Hilbert Space (RKHS) and is fully specified by a set of hyperparameters /, i.e. K ¼ k Á; Á; / ðÞ . The covariance function utilised in this paper is a Matérn 3/2 covariance function, as it is ideal for modelling relatively 'smooth' real world functions, 1 as it is (3/2-1) times mean square differentiable [23], and is defined as, where, and, where K f ;f 2 R NÂN is the covariance matrix for inputs 2 y 2 R NÂ1 and X 2 R NÂdx , r 2 f is the signal variance hyperparameter, and L y ¼ diag l y1 ; ...; l yd ÀÁ and L x ¼ diag l x1 ; ...; l xd ðÞ are lengthscale hyperparameters (making the covariance function an automatic relevance determination prior, i.e. it reduces the effect of redundant inputs). The covariance structure here separates out y and X allowing them to have an independent relationship with the outputs z. The hyperparameter vector for the Matérn 3/2 covariance function is therefore / ¼ r f ; L y ; L x

ÈÉ
. It is noted that the notation K f ;Ã ¼ k y; X ðÞ ; y Ã ; X Ã ðÞ ðÞ is used, where f indicates training and Ã test data. In order to make predictions, the joint Gaussian distribution is formed between a set of training D ¼ y; X fg ; z fg and testing data y Ã ; X Ã fg ; z Ã fg , assuming a Gaussian likelihood, where I f 2 R NÂN and I Ã 2 R NÃÂNÃ are identity matrices and r 2 n is a Gaussian noise variance. Following standard Gaussian conditionals, the posterior for a Gaussian process regression model can be formed as, Conventionally, GP models are inferred by taking a type-II maximum likelihood approach [9], i.e. finding the hyperparameters that maximise the marginal likelihood, leading to an empirical Bayes estimate of the hyperparameters/. By combining the noise variance with the set of covariance function hyperparameters, i.e. / ¼ r f ; L y ; L x ; r 2 n È É , the empirical Bayes estimates for the set of hyperparameters may be found through optimisation, -here a global optimisation approach is used, specifically via quantum particle swarm [24] -by minimising the negative log marginal likelihood, where, À log p zjy; X; / ðÞ ¼ It is noted that a fully Bayesian analysis would require marginalisation of the hyperparameters, which is not solvable in closed-form due to the dependence of the hyperparameters in the covariance function. However, the fully Bayesian solution may be inferred from a sampling-based approach [25][26][27] and is explored in the following section.

Sampling-based approach
The method outlined in this paper utilises GP regression in identifying the map GP : y; X fg ! z, and therefore inferring the model discrepancy term d X ðÞ . However, the output from a simulator y will typically be uncertain, i.e. p yjX; h ðÞ , arising from parametric uncertainty in p h ðÞ . However, Gaussian process regression cannot be solve in closed-form for uncertain inputs, and even though the simulator inputs X are deterministic, the simulator outputs y are uncertain. To create biascorrected predictions that account for this parametric uncertainty, the simulator outputs y Ã must be integrated out, forming the following integral, where p z Ã jX Ã ; h; D; / ð Þ is the bias-corrected predictive output, p y Ã jX Ã ; h ðÞ is the simulator prediction at test inputs X Ã which is conditioned on the parameter distribution p h ðÞ . It is noted that in previous work [8] the bias-corrected outputs have been approximated using the maximum a posteriori (MAP) estimate of the parameters (and an empirical Bayes estimate of the hyperparameters/) meaning p z Ã jX Ã ; h MAP ; D;/ . However, this will not account for the complete parametric uncertainty from p h ðÞand may result in a biased estimate of the model discrepancy. The proposed method seeks to approximate Eq. (10) via a sampling-based approach -specifically from an importance sampling viewpoint. Importance sampling is a technique for obtaining unbiased estimates of expectation integrals [28], such as in Eq. (10), and can be generalised as, where fx ðÞis a function, px ðÞthe nominal distribution over the variable x and qx ðÞis the proposal distribution, where X $ q are independent draws from the proposal distribution. The expectation in Eq. (11) can be formed as, where N are the number of samples and wX ðÞ ¼ pX ðÞ =qX ðÞ are the importance weights [28]. Eq. (10) can be approximated by setting the nominal and proposal distributions equal to the simulator output predictive distribution p y Ã jX Ã ; h ðÞ . This means that N s samples can be obtained from the parameter distribution, i.e. h j ðÞ $ p h ðÞand propagated through the simulator to obtain output samples y j , meaning that the weight for each sample equals one, i.e. w y Ã j ðÞ ÀÁ ¼ 1 (this effectively is the same as approximating the integral via Monte Carlo sampling, however the language of importance sampling will useful later in this section). The predictive equation p z Ã jX Ã ; h; D; / ð Þ can now be approximated using the set of weights and Gaussian process predictions for each sample, and the laws of total expectation and variance, where E z Ã ðÞ and V z Ã ðÞ are the GP predictive mean and covariance from Eq. (7). The bias-corrected predictions are approximately Gaussian, given that they are formed from weighted averaged Gaussian processes. The method is outlined in Algo-rithm 1. The main computational expense is in sampling the simulator outputs, which can be reduced by replacing the simulator with a computationally efficient emulator [19][20][21]. One problem with this approach is that the predictions are still dependent on a set of hyperparameters (that have been inferred from the GP associated with the parameter MAP estimates). However, these hyperparameters can also be marginalised out of the predictive equations using importance sampling, as discussed below.

Algorithm 1. Model discrepancy inference dependent on empirical Bayes estimates of /
The following integrals can be solved to generate a bias-corrected prediction not dependent on either the simulator outputs or GP hyperparameters, requiring the posterior of the hyperparameters p /jD ðÞ , which can also be approximated via importance sampling. By solving Eq. (14), rather than optimising the GP via a type-II maximum likelihood technique, the fully Bayesian solution of the hyperparameters can be acquired.
The posterior distribution of the hyperparameters p /jD ðÞ / p zjy; X; / ðÞ p / ðÞ (where D¼ y; X fg ; z fg ), can be approximated with importance sampling by setting the unnormalised nominal distribution as p zjy; X; / ðÞ p / ðÞ . By keeping the proposal and nominal distributions for the parameters the same as in the first approach, and by setting the proposal distribution for the hyperparameters equal to the prior distribution, i.e. / k ðÞ $ p / ðÞ ¼ q / ðÞ , the weights for each of the N / hyperparameter samples are equal to p zjy; X; / k ðÞ , i.e. the marginal likelihood of the GP model (in negative log-form in Eq. (9)). This formulation now allows both y Ã and / to be integrated out with importance sampling forming p z Ã jX Ã ; h; D ð Þ as, where the overall procedure is outlined in Algorithm 2. Is it noted that the approaches rely on importance sampling, which will suffer from the curse of dimensionality as the dimension of Y and / increases, as it will be less likely that a given sample will carry some meaningful weight. This issues can be mitigated by an adaptive approach [26], which is left for further research.

Obtaining the parameter distribution
As aforementioned, the proposed model discrepancy inference approach is primarily designed to accompany a 'likelihood-free' calibration process, where model discrepancy is account though a distance measure. These approaches allow for p h ðÞto be obtained without the GP model discrepancy influencing the posterior parameter distribution, removing problems associated with non-identifiability. This section briefly introduces Bayesian history matching as one such approach that can be used in combination with the proposed procedure (for more details on Bayesian history matching the interested reader is referred to [8]).

Bayesian history matching
Bayesian history matching (BHM) is an approximate Bayesian approach for calibrating statistical models of the form in Eq.
(1). The method seeks to determine whether parameter combinations are 'implausible', i.e. they are not likely to have produced the observations z, based on a criteria such that the remaining non-implausible parameter space is identified; leading to an approximation of the posterior distribution p hjz ðÞ . The criteria for discarding implausible samples is a combination of an implausibility metric and a threshold T, where the implausibility metric accounts for model discrepancy though a notion of distance.
BHM assumes that the simulator is computationally expensive to evaluate, and hence replaces the simulator with a computationally efficient GP emulator g X; h ðÞ $ G P : X; h fg ! y. This replacement of the simulator for an emulator is possible as a GP estimates the uncertainty associated with the approximation through the predictive variance V yX ; h ðÞ ðÞ from  (7)). This means that parameter combinations are only discarded when the approximation is certain enough, given the other uncertainty in the implausibility metric. The implausibility metric assess the distance between the mean emulator prediction E g Ã ðÞ and the observed data z, weighted by several uncertainties, The threshold is set given the statistical properties of the implausibility metric [8], e.g. for Eq. (16) the threshold can be set as T ¼ 3 given Pukelsheim's 3r rule [29]. By updating the emulator approximation at each iteration, the criteria can discard more parameter combinations with increased confidence. Finally, convergence is reached when either the uncertainty in the emulator is lower than the remaining uncertainties i.e. V o þ V m > V yX ; h ðÞ ðÞ or all the parameters are discarded.

Case study: numerical verification problems
The proposed two stage calibration and model discrepancy inference was verified on two numerical case studies; one where x 2 R NÂ1 and the other x 2 R NÂ2 . In addition, the first numerical case study is used to benchmark the proposed decoupled approach, using Bayesian history matching and the sampling-based model discrepancy procedure, against the hierarchical Bayesian model formulation, where the model discrepancy and model parameters are jointly inferred together. In both case studies the simulator modelled the tip deflection of a cantilever beam subject to an open crack with a point force of 10kN at the tip. The stiffness reduction model for an open crack used in this case study was that proposed by Christides and Barr [30], where the stiffness along the beam EI Á ðÞis a function of the length along the beam x, Young's modulus E, the second moment of area for the undamaged beam I 0 , the beam thickness t, the crack location l loc and a, a coefficient experimentally defined by Christides and Barr as 0:667. The constant C ¼ I 0 À I c ðÞ =I c is a function of the undamaged I 0 and damaged second moments of area I c , which for a rectangular beam are I 0 ¼ wt 3 ÀÁ =12 and I c ¼ wtÀ l cr ðÞ 3 =12; where w is the beam width and l cr is the crack length. The tip deflection was numerically estimated via Euler-Bernoulli bending beam equation, where M Á ðÞis the moment along the beam. In both case studies the beam used in the analysis was rectangular with the following dimensions: l ¼ 1m, w ¼ 0:5m and t ¼ 0:1m.

Numerical case study: one input problem
The first illustrative case study considers a scenario where the input was crack location x ¼ l loc , the parameters were Young's modulus E and crack length l cr , i.e. h ¼ l cr ; E fg , and the output (both from the simulator and experiments) was the tip deflection. In this analysis the true parameters were defined asl cr ¼ 38mm and b E ¼ 68GPa, i.e.ĥ ¼ 38; 68 fg ; the simulator evaluated at these parameters is depicted in Fig. 1. The training inputs for both the simulator and experimental data were 13 equally spaced points from 0:1m to 0:9m, and the simulator parameters were evaluated between 0mm and 50mm, and 50GPa and 90GPa, resulting in 25 equally spaced data points. These training inputs were used to construct an emulator with a linear mean and Matérn 3/2 covariance. The model discrepancy was defined as, d x ðÞ¼0:31 :5 À x ðÞ Â sin 1:8 x À 0:2 ðÞ Â 2p ðÞ ð 19Þ shown in Fig. 1. The experimental data z was formed from the simulator output plus the additive model discrepancy where the observation noise was Gaussian distributed with variance 0:001, where the experimental data is displayed in Fig. 1. The prior model discrepancy uncertainty (used in BHM) was V m ¼ 0:05, reflecting the expected magnitude of the model discrepancy, where the error bars on the experimental data in Fig. 1 show the total prior uncertainties. BHM was used to find the approximate posterior distribution, shown in Fig. 2 where the true parameters (shown in red) are close to the mode of the joint posterior distribution. Samples from the posterior distribution are shown in Fig. 1, where the output from the mode of the posterior distribution is visually in good agreement with the output at the true parameter values.  The proposed model discrepancy inference procedure was subsequently run for three scenarios: 1. Using a MAP estimate of the simulator parameters h MAP and an empirical Bayes estimate of the GP hyperparameters/ (when the simulator output for h MAP is used). 2. Marginalising out the simulator outputs y Ã using importance sampling, with an empirical Bayes estimate of the GP hyperparameters/ (when the simulator output for h MAP is used); N s ¼ 1000. 3. Marginalising out both the simulator outputs y Ã and GP hyperparameters / via importance sampling; N s ¼ 1000 and For each scenario a zero mean and Matérn 3/2 covariance function were used. The hierarchical Bayesian approach was also applied to the same training dataset where the prior parameter distributions were l cr $N 35; 10 ðÞ and E $N 70; 36 ðÞ and the model discrepancy Gaussian process was also modelled with a Matérn 3/2 covariance function. These prior parameter distributions are more informative than those consider in BHM (which can be understood as a uniform prior), where a comparison is shown in Fig. 3. The reason for using more informative priors in the hierarchical Bayesian approach than in BHM, is due to the findings of Brynjarsdóttir and O'Hagan [6], where to reduce the problem of non-identifiability, one solution is to apply more informative priors, constraining the posterior when the likelihood is relatively flat as a result of the model discrepancy Gaussian process; although it is noted that obtaining more informative priors is challenging in practical applications. Other aspects of the hierarchical Bayesian analysis were kept the same as in the BHM approach, such that objective comparisons could be made. The posterior distributions from the hierarchical Bayesian approach were obtained via an adaptive Markov chain Monte Carlo scheme [31,32], such that 100; 000 posterior samples were obtained with a 50; 000 sample burn-in period. The autocorrelations of the chains were checked for stationarity in order to confirm convergence. The posterior distribution is shown in Fig. 4, where it can be seen that the parametric uncertainties are much larger than in the posterior distribution from Bayesian History Matching (BHM). The effect arises as the likelihood has dominated in the posterior distribution, enlarging the area of probably parameter values due to the insensitive likelihood function [4,6] -which arises as a result of modelling the model discrepancy as a Gaussian process during joint inference approach.
The results from the three decoupled approaches and the hierarchical Bayesian approach are shown in Fig. 5, were it can be seen that all of the methods have managed to accurately predict the tip deflection well, reflected in low normalised mean squared errors (NMSEs) in Table 1 for a 200 point independent test dataset. However, due to the large parametric uncertainty  in the posterior distribution from the hierarchical Bayesian approach, the inferred model discrepancy not only has a large variance, but also a less accurate mean prediction with a NMSE of 39.349, over 17 times larger than the highest NMSE from the decoupled approach. This shows the challenges an insensitive likelihood causes on the inference process, and why a decoupled solution is one approach that can be used to overcome these challenges. Furthermore, the hierarchical Bayesian approach has an underestimated predictive variance for the tip deflection d tip , with a large number of data points, particularly around the first peak at 0.36 m, exceeding a three standard deviation interval. This relatively simple numerical case study shows the problems with a hierarchical Bayesian approach and further motives the need for alternative solutions to the model discrepancy inference problem, such as the decoupled approach proposed in the paper. In terms of comparing the three decoupled-based approaches, the main difference, as expected, is in the estimated uncertainty for the model discrepancy. Scenario one has the smallest uncertainty, with a larger number of experimental test data points outside a 3r range when compared to the other two scenarios. The first scenario is also overconfident in the model discrepancy predictions, especially around 0.1m, where the true model discrepancy is outside of the 3r range. Scenarios two and three increase the uncertainty in the model discrepancy, reflecting the parameter uncertainty in the posterior distribution, meaning the true model discrepancy remains within the 3r variance range. The NMSE is lowest for the model discrepancy in scenario three, with scenario two producing the largest error in its mean prediction. It can be argued from the results, that scenario one is overconfident and although the mean prediction is better than scenario two, its distribution could be misleading and less helpful to the engineer by not reflecting the true uncertainty in the analysis. Finally, the posterior of the Gaussian process hyperparameters is obtainable as part of scenario three, and presented in Fig. 6.

Numerical case study: two input problem
The second case study considers a scenario with multiple inputs, where X ¼ l loc ; l len fg i.e. crack location and length, where the output is tip deflection. The parameter in this analysis is the Young's modulus h ¼ E, where the true parameter value is 68GPa. The training inputs for both the simulator and experimental data were 64 data points evenly spaced between 0:1m and 0:5m, and 0mm and 50mm (for the crack location and length respectively) where the outputs are shown in Fig. 7. The simulator parameters were evaluated at four points between 50GPa and 90GPa and an emulator was constructed using a linear mean and Matérn 3/2 covariance function. The model discrepancy was defined as,  and displayed in Fig. 7. Again the experimental data, shown in Fig. 7, was formed from the simulator at the true parameters plus the model discrepancy with Gaussian additive noise with a variance of 0.001. The prior model discrepancy variance was V m ¼ 0:05. The approximate posterior from BHM is presented in Fig. 7 where the difference between the mode and true parameter value is 0.9%. Samples from the posterior are shown in Fig. 7 showing the simulator has been adequately calibrated.
The model discrepancy inference procedure was run for three scenarios:  1. Using a MAP estimate of the simulator parameters h MAP and an empirical Bayes estimate of the GP hyperparameters/ (when the simulator output for h MAP is used). 2. Marginalising out the simulator outputs y Ã using importance sampling, with an empirical Bayes estimate of the GP hyperparameters/ (when the simulator output for h MAP is used); N s ¼ 1000. 3. Marginalising out both the simulator outputs y Ã and GP hyperparameters / via importance sampling; N s ¼ 1000 and For each scenario a zero mean and Matérn 3/2 automatic relevance determination covariance function were used. The results in Fig. 8 show adequate predictions for all the scenarios with scenario 3 producing the lowest predictive NMSE on a 400 point independent test dataset (see Table 2). Interestingly, the second approach achieves a lower NMSE on the model discrepancy when compared to scenario three (with both performing better than scenario one). It can be seen in Fig. 8 that the model discrepancy has been correctly captured by all three approaches. Finally, scenario three obtained the posterior hyperparameter distribution, useful in understanding the extracted model discrepancy. (see Fig. 9).

Case study: five storey shear structure
In order to demonstrate the effectiveness of the proposed approach an experimental case study is presented. This case study seeks to infer the model discrepancy of a modal finite element model used to predict the change in natural frequency when different masses are applied to the fourth floor (simulating a damage scenario). Estimation of the parameter distribution was performed using Bayesian history matching as outlined in [8]. A brief overview of the calibration process is introduced below, where the reader is referred to [8] for more details.

Calibration using Bayesian history matching
Bayesian history matching was applied to infer the material properties h ¼ E; m; q fg (Young's modulus, Poisson's ratio and density) of a finite element model of a five storey shear structure (g Á; Á ðÞ ) known to have model-form errors due to modelling  simplifications (particularly of the boundary condition between the structure and fixing). Observational data were the first five bending modes of a representative building structure, z ¼ x 1 ; ...; x 5 f g , constructed from aluminium 6082, depicted in Fig. 10. These data were obtained via modal testing, where an electrodynamic shaker applied a Gaussian noise excitation with a bandwidth of 409.6 Hz, and five uniaxial accelerometers were used to capture the acceleration response at each of the five floors (where sample rate and time were chosen to allow frequency resolution of 0.05 Hz). Masses were incrementally added to the fourth floor of the structure m ¼ 0; 0:1; ...; 0:5 f g kg, representing pseudo-damage, and were treated as the inputs in this analysis i.e. m ¼ x. Ten repeat estimates of the natural frequencies were obtained for each mass providing a representation of observational uncertainty.
Calibration was performed on training data, which were the ten repeat observations of the bending natural frequencies when x ¼ 0; 0:3; 0:5 f g . The testing data were the ten repeat observations of the bending natural frequencies when x Ã ¼ 0:1; 0:2; 0:4 f g . The prior parameter bounds were AE15% of typical material properties for aluminium 6082; E ¼ 71GPa, m ¼ 0:33; q ¼ 2770kg/m 3 . These parameter bounds behave in a similar way to a uniform prior over the space. The approximate posterior distribution of the parameters, identified from the Bayesian history matching analysis, is displayed in Fig. 11.
Samples of the simulator output distribution (for each of the five natural frequencies) y  Fig. 12, where the error bars define to the prior model discrepancy and observational uncertainties. It is clear from Fig. 12 that there is a large amount of model discrepancy for the first natural frequency.

Model discrepancy inference
Inference of the model discrepancy from the Bayesian history matching analysis in Section 5.1 was performed using three approaches: 1. Using a MAP estimate of the simulator parameters h MAP and an empirical Bayes estimate of the GP hyperparameters/ (when the simulator output for h MAP is used). 2. Marginalising out the simulator outputs y Ã using importance sampling, with an empirical Bayes estimate of the GP hyperparameters/ (when the simulator output for h MAP is used); N s ¼ 1000. 3. Marginalising out both the simulator outputs y Ã and GP hyperparameters / via importance sampling; N s ¼ 1000 and For each of the three methods the model discrepancy was inferred as a map GP i : Y; X fg ! z i 8i 2 1 : 5 fg . It is noted that a multiple output Gaussian process could be implemented [33], meaning only one map would need to be inferred from Y; X fg to Z. This would not change the general formulation of the approach and is therefore left for further research. The priors for each of the five GP models were zero mean functions with Matérn 3/2 automatic relevance determination covariance functions (specified by Eqs. (3)-(5)).
For the third approach -the marginalisation of both the simulator outputs and GP hyperparameters -independent Gaussian priors were defined for each hyperparameter in the set. The priors for lengthscale hyperparameters assumed that the process will change slowly with the input (i.e. large lengthscales), log l The calibrated and bias-corrected predictions for each of the three approaches are displayed in Fig. 13, with the inferred model discrepancies shown in Fig. 14. Firstly, it can be seen that the mean predictions of all three approaches, both in terms of their output predictions' and inferred model discrepancies', are visually similar. The main difference between all three approaches is their estimation of the uncertainty, with significant differences in the inferred model discrepancy uncertainty.  The second and third methods have propagated the posterior parameter uncertainty through to the model discrepancy and the output predictions, unlike the first approach which collapses this uncertainty down to the parameter MAP estimates. This increase in uncertainty in the model discrepancy from methods two and three is useful to the engineer as it provides a better reflection of the underlying model discrepancy and will be helpful in identifying simulator improvements, as the 'true' model discrepancy is more likely to be contained within the confidence intervals. This is particularly clear given the results in the numerical case studies, where method one resulted in model discrepancy predictions where the 'true' model discrepancy  occasionally exceeded 3r. In terms of output prediction, each of the methods visually appear to have captured the noise process, with the third method showing increases in uncertainty outside of the training observations. This effect is likely to be caused by the small number of training observations, causing the uncertainty to increase away from the training data, indicating that the prior has a large effect on the posterior due to the small number of observations in the likelihood. However, the extra uncertainty quantified by marginalising out the posterior hyperparameter distributions is useful for gaining an insight into the level of trust in the identified model discrepancy given the limited training data used to estimate the discrepancy. Several validation metrics have been applied in order to quantitatively assess the performance of the inferred models for each of the three scenarios. The first metric, the normalised mean squared error (NMSE) (the sum of the squared errors divided by the variance and the number of data points), assesses the performance of the mean prediction. The second metric is the maximum mean discrepancy (MMD) distance, a measure of the distance between two distributions [34]. The distance is the difference between the means of two kernel embeddings of the data (where here a Gaussian kernel is used, where the scale parameter is inferred via a median heuristic [35]). The third metric is the posterior likelihood, and is a measure of the probability of the data coming from the inferred GP model. The validation metrics are applied to the predictions from each three scenario and are shown in Fig. 15 and Table 3.
In terms of the mean predictions, the NMSEs indicate that on average the third approach provided the best mean performance. In fact, both the second and third methods outperformed the first method in all of their mean predictions (apart from method three's predictions of the fourth natural frequency). This shows that including and propagating these sources of uncertainties are beneficial for the overall mean predictive performance, with the results supporting the conclusions from the numerical case studies. Helpfully, the models for all three scenarios show the same general mean predictive behaviour across the five natural frequencies, with predictions being poorest for the second natural frequency. Comparing the output distributions, the MMD distances for each scenario are relatively comparable, with the third method performing best on average. The reason for similar MMD distances is that the data distribution is being inferred from 10 observations at every input, and that this has a greater effect on the distance than each scenarios change in predictive distribution. This demonstrates the challenges in validating predictive distributions when only a few number of validation data points are available. In comparison to the NMSEs, the posterior likelihood indicates a different assessment of which natural frequency is predicted best -the second natural frequency is most likely to have produced the observational data. The posterior likelihoods indicate that on average the first method was more likely to have produced the observational data than the other approaches. However, the second method produces the highest posterior likelihoods for the third and fourth natural frequencies. Furthermore, the second and third methods have comparable posterior likelihoods for the first to third natural frequencies, which are very similar to method one. It is noted that the first method does not reflect the uncertainty associated with the parameters, and therefore may be overconfident in it's predictions, given the training observations.
From the third approach it is possible to obtain the posterior distribution of the GP hyperparameters. Fig. 16 depicts two of the posterior hyperparameters distributions; the first and fifth natural frequencies. The posteriors show that the lengthscales for the simulator output L y are all uncorrelated (as expected by the automatic relevance determination and distance metric assumptions in the covariance function, i.e. L y is diagonal). Furthermore, both show that the modes of the signal variance and noise variance are fairly constant when compared to the lengthscales. This means that the noise and signal variance have been well-identified, and the output uncertainty is mainly attributed the uncertainty in the lengthscales. Resultantly, the posterior hyperparameter distributions provide a high level of insight into the inferred model discrepancy. With more observational data, these posterior distributions will provide insight into the type of missing functional-form in the simulator, as the lengthscale distributions will be expected to decrease in uncertainty.

Conclusions
Every computer model (here defined as a simulator) will imperfectly reflect the real world due to some level of model discrepancy (whether due to missing physics, simplifications or approximations etc.). Without identifying the level of model discrepancy within a simulator, predictions are likely to be inaccurate. This paper proposes a method based on Gaussian process (GP) regression and a sampling-based approach for identified model discrepancy given some parameter distribution. This method has been demonstrated to be effective on numerical examples and an experimental case study of a representative five storey building structure.
The approach in this paper allows for bias-corrected predictions to be constructed that marginalise out the simulator outputs, with the additional ability to marginalise out the GP hyperparameters. By performing this process, the bias-corrected predictive distributions better reflect the parameter and hyperparameter uncertainty in the predictions and help the engineer identify 'true' improvements to a simulator, rather than those based on overconfident estimations of the model discrepancy. The technique relies on generating a set of Gaussian process maps from the uncertain simulator outputs and deterministic inputs to observational data and performing weighted averages to form the bias-corrected predictive distributions.
Three scenarios were investigated: using a MAP estimate of the simulator parameters and simulator outputs, and an empirical Bayes estimate of the GP hyperparameters; marginalising out the simulator outputs using importance sampling, and an empirical Bayes estimate of the GP hyperparameters; and marginalising out both the simulator outputs and the GP hyperparameters via importance sampling.
The numerical case studies show that a two stage decoupled process, utilising Bayesian history matching and the proposed model discrepancy procedure, is appropriate for calibrating a simulator and extracting model discrepancy. In addition, the first numerical case study demonstrated issues associated with the hierarchical Bayesian approach that seeks to jointly infer the parameters and model discrepancy. In accordance with the literature [4,6], the approach leads to an insensitive likelihood, and can cause non-identifiability issues. In the numerical case study the hierarchical Bayesian approach inferred a model discrepancy distribution that was more uncertain than the proposed decoupled approaches, and had a worse mean predictive performance. Furthermore, both numerical case studies showed that considering both the simulator output  and hyperparameter uncertainties provides more information about the model discrepancy function, and in improves the performance of the mean prediction.
Finally, an experimental case study was provided, where it was shown that the third approach (marginalising both the simulator outputs and the GP hyperparameters) produced the best mean predictions. In addition, the uncertainty associated with the parameters was propagated onto the model discrepancy and output predictions, better reflecting the uncertainty quantified by the BHM parameter posterior distribution. The inclusion of the parametric uncertainty is valuable in understanding the 'true' model discrepancy, and is beneficial in determining what is known about the model discrepancy from the analysis. In addition, the third approach provides posterior distributions of the hyperparameter, which provide further insight into the model discrepancy process.

Declaration of Competing Interest
None.