Bayesian history matching for structural dynamics applications

Computer models provide useful tools in understanding and predicting quantities of interest for structural dynamics. Although computer models (simulators) are useful for a speciﬁc context, each will contain some level of model-form error. These model-form errors arise for several reasons e.g., numerical approximations to a solution, simpliﬁcations of known physics, an inability to model all relevant physics etc. These errors form part of model discrepancy ; the difference between observational data and simulator outputs, given the ‘true’ parameters are known. If model discrepancy is not considered during calibration, any inferred parameters will be biased and predictive performance may be poor. Bayesian history matching (BHM) is a technique for calibrating simulators under the assumption that additive model discrepancy exists. This ‘likelihood-free’ approach iteratively assesses the input space using emulators of the simulator and identiﬁes parameters that could have ‘plausibly’ produced target outputs given prior uncertainties. This paper presents, for the ﬁrst time, the application of BHM in a structural dynamics context. Furthermore, a novel method is provided that utilises Gaussian Process (GP) regression in order to infer the missing model discrepancy functionally from the outputs of BHM. Finally, a demonstration of the effectiveness of the approach is provided for an experimental representative ﬁve storey building structure. (cid:1) 2020 The Authors. Published by Elsevier Ltd. ThisisanopenaccessarticleundertheCCBY license


Introduction
Calibration of computer models (herein defined as simulators) is often an important aspect of creating predictions that adequately match observational data. However, simulators often contain model-form errors from various sources, such as an absence and/or simplification of the physics and approximations in solution techniques. These errors form part of the term model discrepancy, which is defined as the mismatch between observational data and the simulator output when the 'true' parameters are known. If a mechanism that accounts for model discrepancy is not included in the parameter estimation approach, any parameters identified will be biased and may provide poor predictions, especially when extrapolating [1,2]. Consequently, Bayesian History Matching (BHM) has been developed as a methodology for calibrating simulators under the assumption that model discrepancy exists and can be treated as uncertain.
History matching as a term originates from the oil industry and describes methods that find parameters of simulators where the outputs closely match data from historical reservoir production. Many of these techniques within the literature, such as those reviewed by Oliver and Chen [3], are similar to classical model updating techniques well-established within structural dynamics [4,5]. Nonetheless, Craig  an approximate Bayesian methodology that searched for all, rather than a single parameter match [6]. This category of approaches was defined as Bayesian History Matching.
BHM has been implemented across a wide variety of applications, from its origins in oil reservoir modelling [6], to understanding Galaxy formation [7][8][9], complex social models of HIV transfer in populations [10,11] and climate science [12,13]. The method seeks to discard parts of the parameter space that were unlikely to produce outputs that match the observational data, given that observational and model discrepancy uncertainties exist. A key benefit of the approach is in gaining an understanding about parameters in complex computer models where model discrepancy is present. In addition, Andrianakis et al. in [10] discuss utilising the method as a pre-calibration step before applying fully Bayesian analysis, such that the parameter domain is well understood, and informative priors identified. However, none of the applications of BHM within the literature apply the technique to a structural dynamic context, nor do they seek to identify the functional form of the model discrepancy after the parameter distributions are estimated. Correspondingly, one of the primary novel contributions of this paper is in providing a methodology for inferring the functional form of the model discrepancy after calibration via BHM. This is performed by using the maximum a posteriori (MAP) estimate of the inferred parameter posterior distribution such that Gaussian Process (GP) regression models can be inferred to map between the simulator output and a set of training observational data. This novel combined approach provides an alternative method to conventional Bayesian calibration methods that consider model discrepancy within the parameter inference process [1,2,[14][15][16].
Another benefit of BHM is that the approach is 'likelihood-free' meaning that input and output combinations can be removed and added iteratively without invalidating the analysis. This means that the considered parameter domain can be truncated based on physical understanding of the parameters, reducing non-identifiability issues and non-physical inferences; which are often present in current Bayesian approaches to calibration that consider model discrepancy [1,2,[14][15][16]. Moreover, by decoupling the parameter and model discrepancy estimation problem, BHM offers a further mechanism to prevent against non-identifiability issues and non-physical inferences. The 'likelihood-free' approach also provides benefits in allowing approximate Bayesian solutions to be identified when a valid likelihood is not possible to obtain.
The technique is designed to be computationally competitive, when compared to Bayesian sampling methods such as Markov Chain Monte Carlo (MCMC), and therefore practical for complex simulators. By construction, BHM utilises emulators-computationally efficient surrogate models-such that the parameter domain can be explored efficiently. Within the literature both Bayes linear techniques [8,9] and Gaussian Process (GP) [10] emulators have been implemented. These choices are popular as both provide estimates of code uncertainty [1]-the uncertainty introduced by replacing the simulator with an emulator. Quantifying code uncertainty ensures that regions of the parameter space are not discarded until emulator predictions are sufficiently 'certain'.
The outline of this paper is as follows. Section 2 introduces BHM defining assumptions and aspects of implementation. Following the methodology, a numerical case study is presented in Section 3 such that the effectiveness of BHM can be demonstrated on an example where the true parameters are known. Next, BHM is demonstrated on experimental case study of a representative five storey building structure, where masses are used to simulate pseudo-damage scenarios. This section also outlines and implements the approach for inferring the functional form of the model discrepancy. Finally, conclusions are discussed, highlighting areas for further research.

Methodology
BHM is an approximate Bayesian approach for determining whether parameter combinations H are 'implausible'; that is to say not likely to have produced known observations z. These implausible parameter combinations h I 2 H are discarded based on a criteria such that the remaining non-implausible space h nI 2 H are identified. In terms of a statistical model, BHM aims to calibrate, where z j ðxÞ is the jth observational output given inputs x; g j ðx; hÞ is the jth simulator given x and parameters h. The model discrepancy and observational uncertainty are d and e respectively, where the simulator, model discrepancy and observational uncertainty are independent. Eq. (1) is similar to that proposed by Kennedy and O'Hagan in [1]; although here model discrepancy is defined as constant and additive with respect to the inputs. In order to calibrate Eq. (1), BHM explores the parameter space of the simulator iteratively, where each iteration is called a wave. During a wave simulator outputs are assessed for various parameter combinations using an implausibility metric and discarded if above a threshold T. As the method is required to assess a large parameter space, a computationally efficient emulator is utilised. The technology used in constructing an emulator within BHM must also quantify code uncertainty [1]-the uncertainty introduced by replacing the simulator with an emulator. This is important as code uncertainty prevents regions of the parameter space from begin discarded due to poor emulation; instead these are retained until the emulator sufficiently represents the simulator output. For this reason GP emulators are implemented-Bayesian, non-parametric regression models [17,18]-as they will fit known simulator runs exactly (under a no noise assumption) and provide estimations of code uncertainty when predicting away from known simulator runs. It is noted that Bayes linear techniques could also be used, as these also quantify code uncertainty [8,9]. However, these approaches are approximation methods that update beliefs using linear fitting, and are generally more applicable to scenarios where the simulator space is not well modelled by a GP.

Gaussian process emulators
A GP emulator is constructed as, g j ðx; hÞ$GP j ðmðx; hÞ; kððx; hÞ; ðx 0 ; h 0 ÞÞÞ ð2Þ where the jth simulator output is modelled as a GP, GP j ðÁ; ÁÞ; taking both x and h as its arguments-it is noted that the emulator is a map of both the inputs x and parameters h to the jth simulator output g j ðx; hÞ. The GP emulator is fully defined by its mean mðÁÞ and covariance kðÁ; ÁÞ functions, which define the prior belief over the functions that could represent the simulator output. Here the mean function is defined to be linear in the parameters, i.e. mðÁÞ ¼ Hb, where H is comprised of p basis functions H ¼ðh 1 ðÁÞ; ...; h p ðÁÞÞ and there are p coefficients b ¼ðb 1 ; ...; b p Þ. Is it noted that the basis functions in H should reflect the prior belief about gðÁ; ÁÞ, where typically no more than a constant or linear mean can be assumed a priori. The covariance function states the prior assumption about how smooth the simulator output is. It defines the correlation between any two points in a Reproducing Kernel Hilbert Space (RKHS) and is dependent on some set of hyperparameters / g , i.e. K ¼ kððÁ; ÁÞ; / g Þ. One example of a covariance function is the squared exponential covariance function, kððx; hÞ; ðx 0 ; h 0 ÞÞ ¼ r 2 where the hyperparameters / g for the squared exponential are two diagonal matrices of roughness parameters i.e.
Þ and a signal variance r 2 g (the reader is referred to [18] for more examples and definitions of covariance functions). By forming a joint prior over the training (denoted y) and testing (denoted Ã) points, standard Gaussian conditionals can be used to obtain the predictive distribution, pðg Ã jx Ã ; h Ã ; y; x; h; /Þ¼NðEðg Ã Þ; Vðg Ã ÞÞ ð4aÞ Vðg Ã jx Ã ; h Ã ; y; x; h; /Þ¼K Ã;Ã À K Ã;y K À1 y;y K y;Ã ð4cÞ where g Ã are predictions of the simulator function at the test points and Eq. (4c) quantifies the code uncertainty. Due to the focus of this paper being BHM and not GP regression, the reader is referred to [17][18][19][20] for more mathematical details. It is noted that the computation cost of training a GP emulator is Oðn 3 Þ (where n is the number of training points in K y;y ). It is expected that in applications involving computationally expensive simulators the computational cost in training a GP emulator will be insignificant compared to obtaining simulator evaluations. Nonetheless the computational cost in training a GP emulator can become problematic when the mapping from inputs x and parameters h to the jth simulator output g j ðx; hÞ is complex compared to the prior (fully specified by mðÁÞ and covariance kðÁ; ÁÞ), or when the parameter and input spaces are significantly high dimensional, and therefore a large number of simulator evaluations are required to adequately train the GP. However, in these scenarios sparse GP approximations can be utilised, where the computational cost of training the GP emulator reduces from Oðn 3 Þ to Oðnm 2 Þ, where typically m ( n and is a number of inducing points (for more details on sparse GP emulators see [21]).

Implausibility metric
The implausibility metric IðÁ; ÁÞ [10,11], used to determine whether a given parameter combination was likely to have produced the observed output, incorporates the emulators within its formulation, where V o ; V m and V c ðx; hÞ are the variances associated with the observational, model discrepancy and code uncertainties; V c ðx; hÞ¼V j ðg Ã j x Ã ; h Ã ; y; x; h; /Þ. Eq. (5) is essentially the distance between observational data and emulator mean predictions weighted by the uncertainties in the processes. Trivially, in the case where simulator runs are computationally cheap, the emulator mean can be replaced with the simulator output and the code uncertainty term removed. The implausibility metric requires specification of the observational and model discrepancy uncertainties, V o and V m . The observational uncertainty V o can often be estimated from expert knowledge and from the observational data. Model discrepancy uncertainty V m can be more challenging to define, but may be elicited from expert judgement. The likelihood-free property of BHM means that the model discrepancy uncertainty can be refined during each wave. This means that sensitivity analysis of the effect of V m can be performed during a wave to understand changes in rejection rates and help improve its specification. Observational and model discrepancy uncertainties can be dependant on both inputs x and outputs z j ðxÞ, i.e. V o;j ðxÞ and V m;j ðxÞ, if input dependent hetroscedastic noise or model discrepancy are hypothesised.
The implausibility metric presented in Eq. (5) provides a quantity for every parameter combination h, input x and output y. However a single value is required for each parameter combination in order to decide whether it should be removed. Several extensions of the implausibility metric that deal with multiple outputs and inputs can be considered [10]. Firstly, a maximum implausibility can be formed, whereby the worst case for a given parameter combination is used. Another approach is to form a multivariate implausibility metric for either the inputs, or outputs, which is equivalent to taking the Mahalanobis distance; assessing the Euclidean distance of the principal components (standard practice in outlier analysis [22]). Again a maximum can be taken over either Eq. (7), (8) to collapse the metric to a single value for each parameter combination.

Decision threshold
In order to decide which parts of the parameter space to exclude, a threshold T is placed on the implausibility metric. Large implausibilities (for each formulation) indicate a parameter set was very unlikely to have produced an output that matched the observational data, given the included uncertainties. Using this knowledge, a rejection criteria can be formed for a particular parameter combination h, where determining the value of T changes based on the type of implausibility metric considered. Andrianakis et al. [10] state that a sensible threshold T for single I j ðx; hÞ or maximum I max ðhÞ implausibilities (where the maximum is of a single implausibility set) can be determined by Pukelsheim's 3r rule [10]. The rule states that any continuous unimodal distribution will contain at least 99.5% of probability mass within three standard deviations away from the mean [23]. For multivariate implausibilities the threshold T can be set as a high percentile (e.g. a > 95%) from a chi-squared distribution with either j, or the input size of x, degrees of freedom [10], i.e. T ¼ F À1 v 2 ðaÞ the output from a chi-squared quantile function (inverse Cumulative Density Function (CDF)). This can be thought of as performing a frequentist hypothesis test on the parameter combination, using a chi-squared (v 2 ) test.

Parameter domain exploration
During each wave BHM explores the parameter space using the implausibility metric and threshold. This requires sampling the parameter domain using a design of experiments, and then running the simulator such that outputs can be obtained with which to form the emulator; a Latin hypercube-based approach is a natural choice for constructing the emulator. In this scenario the initial parameter space bounds are used in conjunction with a simulator budget to construct a Latin hypercube design. In this paper Generalised Maximum Latin Hypercube (GMLHC) designs are used as this approach has been shown to reduce the code uncertainty in GP emulators at the bounds of the design [24]; this is particularly useful in BHM, as decisions about whether parameters near the bounds are implausible can be made without high emulator uncertainty. At each wave a new GMLHC can be formed such that the non-implausible space from the last wave can be interrogated.
Once an emulator is constructed from a cost-effective number of simulator runs it is deployed in sampling the parameter domain. In this paper parameter combinations are sampled from a uniform distribution bounded by the initial parameter domain-effectively defining a uniform prior over the space. Emulator predictions are made for each of these samples and a decision made regarding whether they should be discarded.

Algorithm
The algorithm for performing BHM is stated in Algorithm 1. Two stopping criteria are constructed based on the following outcomes: all the space is deemed implausible; or the code uncertainty in the non-implausible region is less than the remaining uncertainties, i.e. V c;j ðx; h nI Þ < V o;j þ V m;j . This second stopping criteria indicates that the emulator is at least as certain about its predictions as the modeller is with the uncertainties due to model discrepancy and observation variability.
To illustrate BHM, Algorithm 1 is applied to a simple numerical example (where the sampling stage is replaced with a uniform grid). In the example a simulator constructed from Eq. (10a) models the experimental observation z, which is obtained from the 'true' process with noise, stated in Eq.  Fig. 1a. BHM was performed following Algorithm 1 with a simulator evaluation budget of four (for each space-filled design in wave k) where the single implausibility metric IðhÞ and threshold T ¼ 3 are implemented. The emulator for each wave was constructed from constant mean and squared exponential covariance functions. The first, second and fourth waves are shown in Fig. 1.
In the first wave (Fig. 1b) the emulator predictions are most uncertain outside of h 1 leading to these regions being classified as non-implausible. It can also be seen that the initial known simulator runs are deemed implausible, which can be visually confirmed as they are not within the remaining uncertainty bounds z AE ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Between these known simulator runs the code uncertainty increases leading to the parameters, around 1 and 2, being classed as non-implausible. By the second wave (Fig. 1c) additional simulator runs mean that the code uncertainty in the [0.75 2.25] interval are reduced below the remaining uncertainties and all judged as implausible. Simulator runs at the parameter bounds 'pin' the code uncertainty removing the domain edges as implausible. By the final wave ( Fig. 1d where k ¼ 4) the code uncertainty has reduced across the space and is lower than the remaining uncertainties in the non-implausible region. The non-implausible set h nI at this wave clearly contain two regions around the solution 0:90 and 4:23.

Approximate posterior sampling
Once the final wave has identified a non-implausible parameter region, importance sampling [10] can be used to obtain an approximation of the posterior distribution pðhjzÞ. The approximate posterior is formed from the ratio, where w un ðÁÞ are a set of un-normalised weights and h q are samples from a proposal distribution qðh q Þ. The un-normalised weights w un ¼ pðzjh q Þpðh q Þ=q un ðh q Þ are the probability of each sample h q $ q un . However, BHM is 'likelihood-free', and therefore an approximation of the likelihood must be formed. In this paper the likelihood is approximated as, which is the product of multivariate Gaussian distributions over zðxÞ for the set of inputs x, where hÞ. This approximation reflects the form of the implausibility metric. This assumes that these sources of uncertainty are normally distributed. This is an acceptable assumption for the emulator and code uncertainty due to the Gaussian form of the predictions. However, the assumption should be checked for the observational and model discrepancy uncertainties for a given application. The proposal distribution used in importance sampling must have adequate support. Here a multivariate Gaussian distribution is used, where l nI and R nI are the sample mean and variance-covariance from the non-implausible set after the last wave and j is an inflation parameter to ensure good coverage of the space.
The choice of prior pðhÞ depends on the modellers beliefs from the last wave. However, it is often reasonable to assume a constant prior over the final non-implausible set, due to the bounded approach for sampling the simulator in each wave. This means the weights in Eq. (11) become w un ¼ Lðh q Þ=q un ðh q Þ where h q are a number of samples from q un and the constant prior essentially truncates the proposal samples to be within the final non-implausible domain.
Lastly the approximate posterior from Eq. (11) can be re-sampled in order generate direct samples from the approximate posterior. This involves drawing N q samples where the probability of occurrence is defined by the normalised weights wðh q Þ¼w un ðh q Þ= P w un ðh q Þ. Fig. 2 demonstrates importance sampling and re-sampling applied to the numerical example in Fig. 1, where N q ¼ 10; 000 and j ¼ 2. The re-sampled posterior samples are subsequently used to draw Monte Carlo realisations of the simulator and bias-corrected output. The results show that the emulator has been adequately calibrated with the two parameter solutions lying within the central probability mass. Furthermore the simulator and bias-corrected results lie within the given uncertainty bounds. These uncertainty bounds are AE3 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi V o þ V m p for the simulator, reflecting both the model discrepancy and observational uncertainties, and AE3 ffiffiffiffiffi ffi V o p for the bias-corrected output, where 3 standard deviations reflect Pukelsheim's 3r rule.
3. Numerical case study: mass, tensioned wire system BHM accounts for model discrepancy by defining a prior variance V m , stating an assumption of uniform additive discrepancy across the space. In order to illustrate its effectiveness a simple numerical example of a mass, tensioned wire system, depicted in Fig. 3, is presented. In this case study the simulator output is the natural frequency of the system y ¼ x n , with the input being an applied tension x ¼ T, and the calibration parameter being the mass h ¼ M. The simulator is formed from, gðx; hÞ¼x n ðT; MÞ¼ where l ¼ 1m is the length. In this example model discrepancy is additive and sinusoidal defined as, dðxÞ¼0:5 sinð2p Â 0:044xÞ. This form of model discrepancy is chosen as it could be defined as uniform and additive, which is the assumption made within BHM. The observational data is obtained from, where the 'true' mass isĥ ¼ 5:43 and the observational uncertainty is e $Nð0; 0:01Þ. Fig. 4a presents differences between the simulator output (using the 'true' mass) and the realisation observational responses (where zðx Ã Þ are 50 realisations of Eq. (15)).
The training data is shown in Fig. 4a, where the inputs are in 100 N steps from 200 N to 1000 N. The observational uncertainties are depicted on the training data as AE ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi ðV o þ V m Þ p intervals. In this numerical example the variances associated with these uncertainties are known; V o ¼ 0:01 and V m ¼ 1=12 (calculated as the variance from a uniform distribution bounded [-0.5 0.5]). A simulator budget of 15 runs was given for defining the parameter space, as evaluations of the simulator were cheap for this case study; this number of runs allowed the emulator to adequately estimate the simulator in one wave, allowing the case study to focus on the methods ability to account for model discrepancy. As the calibration example is onedimensional, i.e. only the mass is calibrated, a uniform grid is used to sample the space between [2 20] kg. The emulator utilised in this case study was constructed from a linear mean function mðx; hÞ¼½x; h T b, reflecting the prior assumption that the natural frequency will increase with tension. In addition, a Matérn 3/2 covariance function was implemented, defining a prior that assumes the simulator is a relatively smooth function.
A multivariate implausibility metric was implemented with a threshold calculated from the 99% quantile of a 9 degreeof-freedom v 2 -distribution. 50; 000 uniformly distributed samples were used to explore the parameter domain and the approximate posterior was sampled 10; 000 times. Due to the emulator being trained from an adequate number of simulator runs, BHM reached the stopping criteria after one wave. The approximate posterior is depicted in Fig. 4b

Experimental case study: five storey shear structure
An experimental case study of a representative five storey building structure is presented in order to demonstrate the effectiveness of BHM. The aim of calibration in this scenario was to infer the material properties h ¼fE; m; qg (elastic modulus, Poisson's ratio and density respectively) of the columns and floors of an Finite Element (FE) model, using observational data from a representative building structure made from aluminium 6082. In this case study the simulator has been created such that model-form errors, caused by a simplification of the attachments to ground and of the bolted joints (modelled as bonded), are present. These issues cause a clear model discrepancy particularly in the first natural frequency, highlighting the need for BHM-it is noted that despite these model-form errors, only the material properties are calibrated. The quantities of interest were the first five bending natural frequencies of the structure under different masses, m ¼f0; 0:1; ...; 0:5g kg, added to the fourth floor of the building-simulating pseudo-damage extents.
The calibration process is presented in section 4.1, where the posterior distribution over the parameters is identified. Following calibration, section 4.2 outlines a method for inferring the functional form of the model discrepancy, leading to predictions that are bias-corrected, i.e. account for the model discrepancy. Validation of these bias-corrected predictions is subsequently performed in section 4.3, highlighting the benefits of the novel combined approach.
Modal testing was performed via an electrodynamic shaker; where the setup is shown in Fig. 5. The excitation was bandlimited Gaussian noise, with a bandwidth of 409.6 Hz; where 40 averages were obtained for each test. The sample rate and time were chosen such that the frequency resolution was 0.05 Hz. Five accelerometers were placed on each of the five floors to measure the first five bending modes. For each mass, ten repeats were conducted in order to obtain an understanding of the underlying modal frequency distributions.
The simulator gðx; hÞ was a modal FE model, where the five bending natural frequencies were extracted as a set of outputs y. Evaluations of the simulator were acquired for the six damage extents x ¼f0; 0:1; ...; 0:5g kg. The model parameters h were: elastic modulus E, Poisson's ratio m and density q. The prior bounds on h were AE15% of typical material properties for aluminium 6082, shown in Table 1. A fifteen point, three dimension GMLHC was used to sample the parameter space, with an independent five point, three dimension GMLHC for validation.

Bayesian history matching
BHM was implemented as outlined in Algorithm 1. Five independent GP emulators were constructed from the training GMLHC, such that the output natural frequencies could be predicted 1 . Each of the five emulators were constructed from a linear mean function mðx; hÞ¼½x; h T b and Matérn 3/2 covariance function. Exploration of the parameter domain was performed via propagating 100,000 samples from a uniform distribution over the bounds through the GP emulators (where each emulator was trained from fifteen simulator runs and validated against five separate simulator runs). ffiffiffiffiffi ffi Vo p . Fig. 3. Schematic of the mass, tensioned wire system. 1 It is noted that the simulator outputs will not be independent; an area of further research is in implementing multivariate GP emulators [25,26] Table 2 The process uncertainties defined in the implausibility measure utilised for performing BHM on the five storey representative building structure. The multivariate implausibility metric (Eq. (7)) was implemented with the non-implausibility criteria being when the maximum multivariate implausibility for all five outputs (the five natural frequencies) was less than the 99% quantile for a three degree-of-freedom v 2 -distribution (reflecting the size of the training inputs x). Table 2 outlines the process uncertainties utilised in the implausibility metric. Observational variances V o;j were estimated for each natural frequency from the variance of the output training data. The model discrepancy variances were determined using expert judgement; the simulators prediction of the first natural frequency was know to provide poor predictive performance and therefore was given a large variance, whereas the other remaining outputs were more accurate, leading to smaller prior model discrepancy variances.
The stopping criteria was met after one wave, as the emulators inferred code uncertainties were less than the total process uncertainties for each output-diagnostic checks using methods outlined in [19] were also used to confirm the inferred emulators were valid. After the first wave a non-implausible space % 2:8% of the original space was identified.
In order to visualise the non-implausible space, minimum implausibility and optical depth plots were created. These quantities divide the parameter space into 'bins' within which each of the 100; 000 samples (from the uniform parameter domain sampling) are placed. Minimum implausibility takes the lowest value of implausibility below the threshold for the set of samples within a given bin. This provides an indication of which parts of the parameter space can be discarded irrespective of the other parameters. Optical depth is the ratio between non-implausible samples and the total number of samples within a given bin, providing an estimate of the probability of finding a non-implausible parameter combination given the set within a bin. Fig. 6 presents these quantities after the first wave when each parameter is divided into thirty bins. Here it can be seen that the outputs, as expected, are relatively insensitive to changes in Poisson's ratio. Furthermore, there is a clear linear correlation between the non-implausible space of the elastic modulus and density, displayed in the bottom left and top right quadrants of Fig. 6.
Once the stopping criteria has been met approximate posterior densities can be formed using importance sampling and re-sampling. In this case, a Gaussian proposal distribution with j ¼ 1:5 was used to generate 100; 000 samples with which to assess the normalised weights using the methodology presented in Section 2.6. 100; 000 samples were subsequently obtained by re-sampling the posterior distribution. Fig. 7 presents the marginal and pairwise joint posterior distributions, which are visually similar to the minimum implausibility and optical depths; with a linear relationship between density and elastic modulus and a relatively insensitive effect from Poisson's ratio in the pairwise joint distributions. Fig. 8 displays the marginal posterior distribution for each parameter, all showing slightly bi-modal distributions. The marginal distributions are in contact with the bounds of the prior parameter domain. This demonstrates a known strength of BHM, as the bounds constrain calibration to physical values. If these bounds were not used then non-physical parameter inferences may be arrived at, due to the existence of model discrepancy. The output distributions for each of the five natural frequencies were obtained via Monte Carlo sampling the posterior parameter distribution. 1; 000 samples were taken from the re-sampled parameter posterior distributions and propagated through each of the five emulators in order to obtain realisations of the output distributions. As the code uncertainty across all emulators were low, each emulator mean was taken as deterministic and the GPs were not sampled. The mean predictions of the GP emulators for the 1000 Monte Carlo realisations are presented in Fig. 9. The predictions are shown against the observational data used within BHM zðxÞ with AEc r ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi V o;j þ V m;j p bounds; where c r is the standard deviation associated with 99% probability mass of a standard normal (assuming output distributions to be approximately Gaussian). Fig. 9 demonstrates that all five outputs are within the defined uncertainty bounds. However large discrepancies between the experimental observations and simulator outputs (represented by the five emulator's mean predictions) occur, especially for the first and fifth natural frequencies. This illustrates that the simulator has model-form errors, that would lead to incorrect parameter inference if model discrepancy was not considered in the calibration process.

Model discrepancy inference
Inferring the functional form of the model discrepancy is important, as it provides a mechanism for targeting model improvements and quantifying simulator adequacy, i.e. a large model discrepancy will infer that the simulator may not be fit for purpose. In light of this aim, a method is proposed for inferring the functional form of the model discrepancy using the inferred parameter estimates from BHM. The approach involves utilising GP regression to infer the functional difference between the calibrated simulator output and the observational data, at a set of training points. This sections outlines and demonstrates the approach for the five storey building structure case study.
The method begins by obtaining the parameter MAP estimates from the posterior distribution identified by BHM, i.e. h MAP ¼ max pðhjzÞ. The calibrated simulator output is subsequently obtained by evaluating the GP emulators at this calibrated value i.e. pðg Ã;j jx Ã ; h Ã ; y j ; x; h MAP ; /Þ, as the emulators have been established to approximate the simulator well during BHM. The mean prediction from these emulators, Eq. (4b), is taken to be the calibrated simulator output, gðx; h MAP Þ. Model discrepancy is subsequently modelled as a GP regression model (with a Gaussian noise variance) that maps between the calibrated simulator output (for all outputs) to the experimental data, i.e. GP j;d : gðx; h MAP Þ!z j ðxÞ. Once constructed, biascorrected outputs are obtained pðz Ã;j jx Ã ; z j ; y; x; h MAP ; / d;j Þ, that account for model discrepancy. It is noted that this method does not propagate the parameter uncertainty from the posterior distribution through to the model discrepancy inference stage, and therefore this uncertainty is not reflected in the bias-corrected predictions. Uncertainty propagation of the param-eter uncertainty is therefore highlighted as an area of further research. Likewise, implementing the approach with multioutput GP regression models should also be pursued as further research.
The outlined approach was applied to the five storey building structure case study. Five independent GP regression models were again constructed but here mapping from the calibrated emulator outputs to the ten repeat observations (aiding estimation of the observational uncertainty) at the training inputs x ¼f0; 0:3; 0:5g kg. The GP priors were modelled using zero mean functions and Matérn 3/2 covariance functions. The bias-corrected output predictions are displayed in Fig. 10 and the inferred model discrepancy functions in Fig. 11.
The inferred model discrepancies in Fig. 11 capture part of the missing physics. It can be observed for the first, second and fourth natural frequencies that the discrepancies increase with mass, and that the first natural frequency has a large discrepancy of around 2 Hz. In contrast, the model discrepancies for the third and fifth natural frequencies indicate that the calibrated simulator closely matched the observational data, and therefore the model discrepancy functions likely capture a 'noise' process rather than any particular missing physics. These discrepancies would lead the modeller in targeting improvements to the simulator that corrects the first natural frequency most, and that account for the relatively linear increase in natural frequency with mass that the simulator currently fails to capture.

Validation
Validation metrics are applied to the bias-corrected predictions from the joint BHM and GP regression approach in order to assess the methods effectiveness. Normalised Mean Squared Errors (NMSEs) were used to assess the mean predictive performance; defined as,  where z Ã ðxÞ are the output test data, r 2 zÃ , the variance of the output test data andẑ Ã ðxÞ, the mean test predictions. A score of zero indicates a mean prediction without any error; conversely, a score of 100 represents a scenario where the prediction is no better than taking the mean of the true values Eðz Ã ðxÞÞ. Table 3 presents the scores for each natural frequency demonstrating good predictive performance (as they are all below five). Predictive performance was found to be poorest for the second natural frequency, which is due to the discrepancy when x ¼ 0:1 kg.
The approach in this paper is probabilistic and as such the complete predictive distributions should be assessed. In this case, three statistical distances are applied: the Area Metric, Hellinger distance and Maximum Mean Discrepancy (MMD) distance, which measure the distance between two probability measures, P and Q. The Area Metric [27] is the L 1 -norm between two CDFs (FðxÞ), where the units of the Area Metric are that of the quantities of interest. The units therefore make this a useful validation metric. Additionally, the Area Metric assess the distance between CDFs, this means that an empirical CDF can be used to provide a non-parametric estimation of the observation distributions. The second validation metric considered is the Hellinger distance; the L 2 -norm between two Probability Density Functions (PDFs) (pðxÞ), where the metric is bounded [0 1]. These bounds mean the Hellinger distance is good at objectively comparing predictions at different scales, as zero means the distributions are the same, and one, that the distributions are completely different. The final validation metric is the MMD distance, which is the maximum distance between the mean embeddings of two sample sets in a RKHS, calculated as, where the projection is performed by the function class F , where the function f is called a reproducing kernel kðÁ; ÁÞ [28], and x and y are samples from P and Q respectively. The distance is non-parametric and has a lower bound of zero. A popular choice of kernel is the radial basis kernel [28,29], where the median pairwise distance among the joint data is commonly used to infer the hyperparameters [30]. The three validation metrics are applied to the output predictions and observational samples for each of the five natural frequency. Numerical integration is used to infer both the Area Metric and Hellinger distance, where an empirical CDF and kernel density estimate are used to approximate the observational distribution for both the Area Metric and Hellinger distance respectively. Due to the MMD being sample based, ten samples were obtained from predictive distributions such that the distance could calculated between these samples and the observational samples. This procedure was repeated 100 times in order to obtain the average MMD distance, which should be more robust to the predictive distribution sampling. Furthermore, the MMD distance was implemented using a radial basis kernel, reflecting the expected Gaussian-form of the observational data, where the hyperparameters were inferred using the median heuristic. The distances are presented in Fig. 12.
The Area Metric values, shown in Fig. 12a, are relatively small-of the order 10 À3 -providing evidence that the predictive distributions are close the observational data. The largest distance is for the fifth natural frequency at 0 kg, which is likely due to the spread of observation samples at 0 kg, with one data point, which potentially is an outlier, lying far from the others. The next relatively large distances are for the 0.1 kg case for all natural frequencies. These are due to the offset in the predictive mean. This result is further evidenced in the Hellinger distances, Fig. 12b, where the 0.1 kg case produces relatively large distances, i.e. close to 0.5. The MMD distances also identify the 0.1 kg case as producing the largest distances. This discrepancy at the 0.1 kg mass input is due to the training data not providing enough information about the functional trend for this input. More training data would therefore improve the predictive performance at this location. Additionally, improvements to the simulator could aid predictions at 0.1 kg.
Finally, the MMD witness function was evaluated for the second natural frequency in order to further investigate its performance, due to it having the largest NMSE. The MMD witness function f Ã ðÁÞ, is the difference between two kernel embeddings, defined over the variable t as, where fx i g m i¼1 and fy j g n j¼1 are the two sample sets. The witness function provides a visualisation of the difference between two distributions and is zero intuitively where the two distributions are the same, positive when P is larger than Q, and negative when Q is greater than P, as far as the smoothness constraint allows. Therefore, high values, whether positive or negative, indicate a large difference between the predictive and observational distributions. Fig. 13 presents the witness function for the second natural frequency predictions. The 0, 0.3, 0.4 and 0.5 kg cases have low witness function magnitudes, which is expected for the training data x ¼f0; 0:3; 0:5g kg but shows good predictive performance for 0.4 kg. It can also be seen that the 0.4 kg predictive distribution is narrower than the observational data (indicated by negative values about the mode), meaning the results are not conservative. The mode is also under-estimated for the 0.1 and 0.2 kg cases however, the variance for the 0.2 kg case is conservative, covering the observational prediction. Inclusion of the parameter uncertainty may inflate these predictive distributions, potentially improving predictive performance, or at least making the distributions more conservative.

Conclusions
Model discrepancy poses challenges in calibrating structural dynamics simulators. Without accounting for the presence of model discrepancy, any parameter estimate will be biased and predictive performance potentially poor. In this paper it has been demonstrated that BHM provides a methodology for calibrating simulators whilst assuming an additive model discrepancy. The approach has been demonstrated to be successful on both a numerical case study and on an experimental five storey building structure. Furthermore, a method has been outlined for inferring model discrepancy functional forms. The novel combined technique has been shown to provide improved predictive performance and a greater insight into simulator inadequacies.
BHM is an effective method for discarding parameter space iteratively in a 'likelihood-free' manner. This approach means that difficult-to-emulate outputs, or input combinations, can be excluded and reintroduced between waves when they are more defined; this is not possible in a likelihood based approach. By separating parameter inference from model discrepancy learning the technique removes non-identifiability problems, provided its assumptions are appropriate. Furthermore, by utilising importance sampling approximations of the posterior parameter distribution can be obtained. There are avenues for further research into optimal sampling methods for assessing the parameter space, with sequential design methods potentially providing an effective method for reducing the number of simulator evaluations required to construct effective emulators. Multivariate GP emulators should also be incorporated, such that a more informative prior can be used for dependent outputs.
The model discrepancy inference technique outlined in this paper, constructs GP regression models that map the calibrated simulator outputs to experimental observations. This approach has been demonstrated to be effective in learning such func-  Table 3 Normalised mean squared errors (NMSEs) between bias-corrected output predictions and experimental data. tions from a small number of data points. Further research should be conducted into propagating the full parameter distributions in a computationally efficient manner, without the need for constructing a large quantity of GP models. Additionally, research should be conducted into physically constraining these GPs such that prior knowledge is utilised effectively. Nonetheless, the work presented in this paper demonstrates an effective methodology for both calibrating computer models when model discrepancy is present, and inferring the functional form of that model discrepancy. The approach has been shown to improve predictive performance and aid in identification of improvements to the computer model.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.