Bayesian calibration of a flood simulator using binary flood extent observations

. Computational simulators of complex physical processes, such as inundations, require a robust characterization of the uncertainties involved to be useful for flood hazard and risk analysis. While flood extent data, as obtained from synthetic aperture radar (SAR) imagery, has become widely available, no methodologies have been implemented that can robustly assimilate this information source into fully probabilistic estimations of the model parameters, model structural deficiencies, and model predictions. This paper proposes a fully Bayesian framework to calibrate a 2D physics-based inundation model 5 using a single observation of flood extent, explicitly including uncertainty in the floodplain and channel roughness parameters, simulator structural deficiencies, and observation errors. The proposed approach is compared to the current state-of-practice Generalized Likelihood Uncertainty Estimation (GLUE) framework for calibration and with a simpler Bayesian model. We found that discrepancies between the computational simulator output and the flood extent observation are spatially correlated, and calibration models that do not account for this, such as GLUE, may consistently mispredict flooding over large regions. 10 The added structural deficiency term succeeds in capturing and correcting for this spatial behavior, improving the rate of correctly predicted pixels. We also found that binary data does not have information relative to the magnitude of the observed process (e.g. flood depths), raising issues in the identifiability of the roughness parameters, and the additive terms of structural deficiency and observation errors. The proposed methodology, while computationally challenging, is proven to perform better than existing techniques. It also has the potential to consistently combine observed flood extent data with other data such as 15 sensor information and crowd-sourced data, something which is not currently possible using GLUE calibration framework.


Introduction
As floods represent the most costly natural hazard worldwide, the relevance of scientifically-informed risk management to aid in mitigating the impact of floods in human environments is of foremost importance (Global Facility for Disaster Reduction and Recovery, 2014;Jha et al., 2012).Furthermore, the characterization of flood risk management as a problem of decision making under uncertainty is now widely accepted in the research community and increasing efforts are placed in that direction (Hall https://doi.org/10.5194/egusphere-2022-760Preprint.Discussion started: 21 September 2022 c Author(s) 2022.CC BY 4.0 License.capture the spatial structure of the residuals, and cannot differentiate between observation errors (typically assumed mutually independent) and model inadequacy errors that usually exhibit a systematic behavior in space and time (Stedinger et al., 2008;Vrugt et al., 2009;Rougier, 2014) (i.e.some locations are systematically over or under-flooded).As a result, all variability in prediction is characterized through uncertainty in the model parameters, implicitly assuming that, at least, some of the simulator parameters can fit the true process reasonably well in all locations of interest.
Alternatively, the formal Bayesian approach for uncertainty quantification in simulator predictions appears as a robust framework to cope with many of the limitations of GLUE, while transparently and consistently translating the modeler's subjective judgments on uncertainty into probabilities (Rougier, 2014).The seminal work on probabilistic calibration of computational simulators by Kennedy and O'Hagan (2001), proposed a fully probabilistic model building methodology to explicitly account for observation errors, model inadequacy and parametric uncertainty.This approach, later extended by others (Goldstein and Rougier, 2004;Rougier, 2014), has been widely used in many different disciplines, although its application to inundation models has been mostly restricted to a few cases where flood depth measurements were available (Hall et al., 2011).The complex statistical properties of spatial censored observations (i.e.binary observations) has restricted its widespread implementation in practice.Some researchers have extended this framework to binary observations (Wani et al., 2017;Cao et al., 2018;Chang et al., 2019), and (Woodhead, 2007) attempted this for inundation models.This latter study, however, did not explicitly account for the model inadequacy and inferences were focused on binary values (i.e.flooded or non-flooded) rather than on flood depths.
The present work proposes to adapt the fully Bayesian methodology of Kennedy and O'Hagan (2001) to create probabilistic predictions of flood depths using inundation simulators with spatial binary observations of flood extent obtained from SAR satellite imagery, and compare it to simpler and more traditionally used methods.The proposed calibration approach aims to: 1. develop a rigorous and probabilistic framework for inundation model inference.This enables consistent uncertainty propagation through the risk modeling chain using a proper Bayesian methodology, and allows for a consistent integration of data from different sources, such as binary satellite-borne flood extent data, continuous flood-depths measurements in the ground, or crowd-sourced qualitative data.
2. explicitly account for the deficiencies of the computational flood simulator driven by simplifications in the fluid dynamics equations used, errors in the input data (e.g.elevation map inaccuracies), and/or numerical inaccuracies of the solver.These are explicitly taken into account by the inclusion of an additive, spatially correlated, inadequacy term in predictions.
3. model errors in data-acquisition (i.e.observation errors) explicitly and disaggregated from other uncertainty sources such as model structural deficiencies and input errors.
The proposed approach is built based on the Gaussian Process classification theory (see (Rasmussen and Williams, 2006)) where the physical process acts as the latent variable that is being censored.It can be also found in the literature as Clipped Gaussian Process model (CGP) (Oliveira, 2000) or as a particular case of the spatially correlated multivariate Probit with nonlinear regressors (Chib and Greenberg, 1998).Several publications can be found where different physics-based simulators were https://doi.org/10.5194/egusphere-2022-760Preprint.Discussion started: 21 September 2022 c Author(s) 2022.CC BY 4.0 License.calibrated using censored observations of the real process and statistical inadequacy functions (Wani et al., 2017;Cao et al., 2018;Chang et al., 2019), although no attempt, to the authors' knowledge, has been done to implement this in the context of flood simulators and binary extent observations.This work is organized as follows.Section 2 provides a review of the general Bayesian framework for the statistical calibration of computational model as proposed by (Kennedy and O'Hagan, 2001), and describes the three specific calibration approaches used in this paper: (1) the widely used GLUE framework, (2) a fully Bayesian model without inadequacy function, and (3) a fully Bayesian framework including an inadequacy term.Section 3 describes a synthetic 1D example to help understand some of the key implications of using an inadequacy function and binary data for calibration.Section 4 describes a real case study, the data available for calibration, its numerical implementation, and the results from the three models.Section 5 presents a discussion of the key findings and limitations of the proposed approaches in terms of model building, uncertainty quantification, and information content of data.Conclusions and potential paths for future research can be found in Section 6.

Uncertainty framework
Using mathematical models to make inferences about real-world physical systems involve many sources of uncertainty, since (i) the models are always incomplete simplifications of reality, (ii) computational simulators can be numerically imprecise, and (iii) these simulators require inputs that are uncertain (Goldstein and Rougier, 2004).A robust characterization of these uncertainties is, then, necessary to understand the implications of the inferences made.In this context, an uncertainty framework is a mathematical description of the probabilistic relationships between, (1) a computational simulator S used to describe a physical system, (2) inferences about the true physical process Y, and (3) observations of that process Z.
In this work, we will use upper-case letters to describe random variables and lower-case letters for particular realizations of those.Bold letters indicate some vectorial quantity, that in the case of the spatial processes S, Y, Z indicate vales at different points in space.For example, Y = {Y (x 1 ) , Y (x 2 ) , . . ., Y (x 3 )}.Greek letters will be used for model input parameters.
The simulator is a mechanistic model used to represent a physical process by means of simplifying assumptions of the real world.In the context of flood modeling, this is a computational code that outputs flood-depths (and potentially many other parameters) at different points in space S, by solving a set of flow dynamics equations in a given spatial and time domain and for a set of required inputs (see Eq. (1)): (1) a set of boundary conditions ν that are considered to vary from event to event and are considered known (either observed or predicted), like the upstream hydrograph or peak flow in the case of a riverine flood scenario; and (2) a set of model's parameters β that are typically unobservable and considered to be fixed for a wide range of contexts by an appropriate calibration procedure (Kennedy and O'Hagan, 2001), such as terrain topography and roughness parameters.A real-world physical process is, however, described by a very complex function relating the input boundary conditions ν and true process output Y.This function is not known (and cannot be known), and the modeler needs to rely on the available simulator that is not, in any case of practical interest, a perfect representation of the true process: Y ̸ = S.To represent this discrepancy mathematically, some researchers (Kennedy and O'Hagan, 2001;Goldstein and Rougier, 2004) have proposed the use of an additive model inadequacy function δ as in Eq.( 2).
Finally, observations of the physical process Z at a set of points in space can be obtained through some data-acquisition technique.During flood events, there are different sources of observations such as water level readings from gauging stations in rivers, satellite imaging (particularly Synthetic Aperture Radar -SAR-imaging) of flood extent, and, more rarely, ground observations of water depths in the floodplain (Di Baldassarre, 2012).In any case, there are two distinct types of observations: uncensored water depth observations (a positive real value) and censored binary observations (water -no water) as those obtained from SAR satellite imagery.
Mathematically, observations can be thought of as noisy versions of the true process Y, typically characterized by some additive error process ε as per Eq.( 3) (Kennedy and O'Hagan, 2001;Goldstein and Rougier, 2004).Furthermore, binary observations, in which this work will mostly focus, can be represented as censored versions of the continuous water depth measurements as seen in Eq.( 4).
Where 1 {condition} is the indicator function that equals 1 when the condition is true, and 0 otherwise.
All variables and model parameters introduced so far, are summarized in Table 1.

Bayesian inference
The objective of uncertainty quantification is, in this case, to obtain probabilistic predictions of the true process output Y * at spatial locations x * for unobserved boundary conditions ν * .The Bayesian methodology allows to condition these predictions on previously observed data {z, ν}, also known as calibrated predictions (Gelman et al., 2013).The predictive distribution density can be computed as,

dδdsdβ
(5) The distribution f (s|ν * , β) represents the uncertainty in the computation of the simulator outputs at the locations of the new values, also termed code uncertainty in the literature.This can be associated to numerical errors in the implementation and solving of the equations of the model, but more significantly can arise due to surrogacy of the model by the use of statistical emulators.This is useful when the original simulator is computationally expensive to run, and doing hundreds or thousands of simulations is not feasible.This will not be the case in this work so we will assume that the simulator S can be deterministically obtained given its parameters β.That is f (s|ν * , β) has a single probability mass at s = S (ν * , β). and the predictive distribution is reduced to Eq. ( 6).An application on the use of statistical emulators within the framework described here can be find in Kennedy and O'Hagan (2001) and Hall et al. (2011).
Considering a deterministic computational simulator, and integrating out the terms δ and s in Eq. ( 5) we obtain, The distribution f (β|z, ν) is the posterior distribution of the simulator parameters β conditioned on observed data, and is a representation of the parametric uncertainty of the model.On the other hand, the conditional predictive distribution f (y * |β, ν * , z, ν) reflects the uncertainty on new true process predictions for a given simulator output defined by β and for a new event ν * .This is the quantification of model inadequacy, and is obtained by integrating over the posterior distribution of the inadequacy term f (δ|β, z, ν) conditioned on data as in Eq. ( 7).
The posterior distributions of both the inadequacy term δ and simulator parameters β are the core of Bayesian statistics since they define how data is assimilated into new predictions of the true process as given in Eqs. ( 8) and (9).Observation error The distribution f (z|δ, β, ν) is the joint probability of observations conditioned to a given value of the simulator and the inadequacy terms (also termed conditional likelihood function).That is, it is defined by the distribution of the errors in observations ε.On the other hand, f (z|β, ν) is the probability of obtaining the observed data for a given set of simulator parameters, and is obtained by integrating out the inadequacy terms from distribution (8).This is why is also termed here as marginal likelihood function, or simply likelihood function.Finally, f (δ) and f (β) are the prior distributions of the inadequacy term and the simulator parameters respectively.This should reflect the modeler's knowledge before obtaining, or disregarding, the available data (Rougier, 2014).
Since the spatial terms δ and ε are high-dimensional random variables, they can result in complex prior distributions with an arbitrarily large number of hyper-parameters.It is typical then, to define parametric joint distributions for each that depend on only a few set of hyper-parameters: θ and σ respectively (see Table 1).Parameter selection can be done by fixing these hyperparameters to some values defined by the modeler, or assign some appropriate hyper-priors to each.The posterior distribution of simulator parameters and the likelihood function are extended to include the set of hyperparameters as in Eq. ( 10).In the same line, the predictive distribution (6) and the inadequacy function posterior distribution (8) should always be understood as also conditioned on hyperparameters θ and σ.
In this framework, then, a given statistical model is defined by assigning a prior distribution for the inadequacy term, f (δ), and for the observation error f (z|δ, β, ν) (the likelihood function is obtained by integrating out δ from the latter).When no closed form solution for the distributions exists, approximate sampling schemes, such as Markov Chain Monte Carlo (MCMC) (Gelman et al., 2013), are typically used to (i) sample model parameters from (10), then (ii) sample the inadequacy function from (8) and finally (iii) sample the true process output (e.g.flood depths) from f (y * |s, δ).
In the next section, two different models within this framework will be analyzed: one with the inadequacy function and one without.

Statistical models
Given the proposed Bayesian uncertainty framework to obtain calibrated predictions of flood events given past observations, a proper implementation requires the modeler to define probability distributions for the observation errors model ε and prior
In this framework, the marginal likelihood f (z|β, ν) is obtained from a score that ranks each set of β in terms of how 'good' they are in predicting the observed data.In this sense, it is only informally a likelihood function, or pseudo-likelihood.It is typical to leave out (i.e.assign a 0 likelihood) those β values for which the score is below some minimum acceptance threshold defined by the modeler.A very high threshold will only accept very few β with a relatively good fit with the observation, but with potentially a larger bias (see (Beven, 2014a, b)).
The most widely used score function is some modified version of the Critical Success Index (CSI), which penalizes overprediction of flooded pixels (Aronica et al., 2002;Hunter et al., 2005;Stephens and Bates, 2015).This is, Where A is the number of correctly predicted pixels, B the number of over-predicted pixels (predicted flooded observed non-flooded), and C is the number of under-predicted pixels (predicted non-flooded, observed flooded).
To build a likelihood distribution for each value of β from the score in Eq. ( 11), three steps are typically followed: (1) compute a rescaled F 0 score so that it lies in the [0, 1] range to ensure positivity; (2) reject all 'non-behavioral' models by assigning F 0 = 0 according to a predefined acceptance threshold (e.g.all models with F < 0.5, or the 70% group with lower F values); and (3) standardize the resulting F-scores so that they all integrate to 1.The pseudo-likelihood is, then, defined in Eq. (12).
Posterior samples for the parameters β are typically drawn by assuming a uniform prior, although it is not a requirement of the framework.Values β j are drawn uniformly from the defined range of possible values, and the likelihood is computed for each with the procedure described above.The likelihood values are then used as weights to sample posterior values of β.
Finally, since no inadequacy is considered in this framework δ ≡ 0, true process inferences are obtained directly from the simulator output Y = S for each posterior sample of β.All uncertainty, then, is included in the posterior distribution of the simulator parameters (i.e.parametric uncertainty), and no attempt is done to describe the joint distribution of the discrepancy between observations and simulator outputs.

Model 2: No inadequacy, independent observations
This model considers no inadequacy (i.e.δ ≡ 0) and assumes that observation errors are spatially independent.It is perhaps, the simplest formally Bayesian model for calibrated predictions, and it implicitly assumes that the simulator can reliably capture the spatial structure of the true process.As in the GLUE framework, uncertainty in predictions is characterized by uncertainty in simulator parameters β.Model setup is summarized in Eqs. ( 13).
The likelihood function is given by the distribution of observation errors ε for a given set of hyper-parameters σ.Since observations are, in this case, independent binary variables, the likelihood function is just the product of Bernoulli probabilities given as, Where p j is the probability that observation Z j = 1 and depends on the marginal distribution for the observation error ε j .
Different probability distributions for the independent observation errors ε j yield likelihood models for the data with distinct properties.Some of the typical ones are:. 1. Probit model: assumes a Gaussian distribution ε j ∼ N 0, σ 2 so that p j = Φ (S j /σ) 2. Logit model: assumes a logistic distribution ε j ∼ logistic (0, σ) so that p j = 1 + e −Sj /σ −1 3. Binary channel model: assumes a discrete distribution for ε j where For a given set of parameters β and σ, each data point contributes marginally p zj j (1 − p j ) 1−zj to the total likelihood.
Considering this, we can draw some conclusions on how each error model deals with model predictions when maximizing the likelihood function: -Correctly predicted wet (z j = 1|S j > 0): The marginal contribution is in this case p j .The logit and probit models assign a marginal contribution of 0.5 < p j < 1 and the binary channel model a value σ 1 .
-Correctly predicted dry (z j = 0|S j = 0): The marginal contribution is in this case 1 − p j .The logit and probit models assign a marginal contribution of 0.5 and the binary channel model a value σ 2 .
-Overpredicted (z j = 0|S j > 0): The marginal contribution is in this case 1 − p j .The logit and probit models assign a marginal contribution of 0 < p j < 0. -Underpredicted (z j = 1|S j = 0): The marginal contribution is in this case p j .The logit and probit models assign a marginal contribution of 0.5 and the binary channel model a value 1 − σ 2 .
The analysis in the previous paragraph shows that the probit or logit models do not seem reasonable for this type of problem.
On the one hand, these models assign the same likelihood to a correctly predicted-dry point and an underpredicted point (predicted dry, and observed wet).On the other hand, both models seem very inflexible in the sense that they always penalize more overpredicted points than underpredicted ones, and correctly predicted wet points are more 'rewarded' than correctly predicted dry points.For these reasons, the binary channel model, although quiet simple appears as a more suitable and flexible option in this case (see (Woodhead, 2007) for a more detailed discussion on this model).This model also allows for a simple expression to calculate the log-likelihood for a given set of parameters as, Where A is the total number of correctly predicted flooded observations, B is the number of overpredicted (i.e.predicted flooded but observed dry) observations, C is the number of underpredicted observations (predicted dry but observed flooded), and D is the number of correctly predicted dry observations.
Since there is no inadequacy, as in the GLUE model, predictive samples of flood depths are obtained from the simulator output for each posterior sample of β.

Model 3: Inadequacy function, independent observations
The full model proposed here, assumes that the additive inadequacy function δ has a 0-mean Gaussian Process (GP) prior with a squared exponential covariance function defined by hyper-parameters θ, and that observation errors ε follow an independent Gaussian distribution with hyper-parameter σ that represents the marginal standard deviation.The use of GPs for the inadequacy is convenient for analytical reasons since Gaussian distributions have well-studied joint and conditional properties, but also from a historic perspective since the use of GPs for spatial data regression underlie the first geostatistical techniques, such as kriging (Schabenberger and Gotway, 2005).
This is a generalization of the spatially independent model described in Sect.2.2.2, in the sense that it assumes that the discrepancy between simulator output and true process has a spatially correlated structure.The null mean implies that, a priori, it is not known if model inadequacy will be positively or negatively biased.The model setup is described in equations ( 16).The true process Y is strictly positive and the sum S + δ not necessarily.Thus, the model setup requires to rectify the negative values with the function max (., 0) assigning the probability of all negative values to the point y = 0, and generating a zero-inflated process (there is a non-zero probability mass at y = 0).The spatial structure of the inadequacy function is characterized by a squared exponential kernel (SEK), as per Eq. ( 17), with a variance parameter θ 1 that controls the amplitude of the realization and an inverse length parameter θ 2 that control its roughness (number of crossings of an horizontal axis).For an analysis of the influence of kernel selection in GP regression refer to (Rasmussen and Williams, 2006).
Given this model setup, the likelihood of observed data is given by the cumulative multivariate Gaussian function as, Where K θ is the covariance matrix as obtained from kernel (17), for observed spatial points x given hyper-parameters θ and the integration limits for each observed point are, Equation ( 18) gives the likelihood of a multivariate probit model that has been widely used for spatial classification models and which can also be found in the literature as Clipped Gaussian Process (CGP) (Oliveira, 2000) or Spatial Generalized Linear Models (SGLM) (Schabenberger and Gotway, 2005;Berrett and Calder, 2016).It involves the integration of a highdimensional Multivariate Gaussian distribution.An efficient implementation to compute this was proposed in (Genz, 1992) and has been readily implemented in different coding languages.It is important to highlight that the rectification of the negative values in the definition of Y does not affect this likelihood, since the integration limits cover all positive, or negative, values.
For the current model, calibrated predictions of the true process require posterior sampling of the inadequacy term as described by distributions ( 7) and ( 8).Making use of Bayes Theorem and given the current model setup, and in coincidence with the theory of GPs for classification (Rasmussen and Williams, 2006), the posterior distribution of the inadequacy function at the observed points is given by, Where p j = Φ (S j + δ j /σ).
Since distribution ( 20 The first factor of the integrand is the conditional distribution of two correlated Gaussian variables and is given by Eq.( 22), as can be deduced from standard Multivariate Gaussian properties (Rasmussen and Williams, 2006).The second integrand represents the conditional distribution of a continuous Gaussian variable u conditioned to lie in the region defined by z = 1 {u}.
That is the definition of the truncated Gaussian distribution as in Eq. ( 23).This is strictly valid if u = Y + ε, which is not the case due to the rectification of negative values in the definition of Y. Then again, since observations filter together all negative and all positive values, this is not expected to have a significant influence in the model.
The distribution N B (u|., .) is a Gaussian distribution truncated to the region For example, in the bivariate case with observations z = {0, 1}, the distribution is truncated to the region (0, ∞) (i.e. the upper-left quadrant in the plane).It is important to note that expression ( 22) is very similar to the predictive distribution in Kriging regression with u instead of the actual observation values (Rasmussen and Williams, 2006).
At this point, posterior predictive samples at the locations of the observed data, can be obtained by simply adding the computational simulator output (for any new event ν * ) and the inadequacy sample (for each posterior set of θ and β parameters).
However, if predictions are required for other, non-observed, point in space x * , spatial correlation in the posterior predictions of the inadequacy term should be taken into account.Since inadequacy values at different points are jointly Gaussian, we can sample posterior predictions at unobserved points from the Gaussian distribution in (24).
Matrix K * * θ is the covariance for the new spatial points x * and K * θ for observed points x and new points x * .Summarizing, the inference process requires the following three steps: 1. Sample β, θ, σ from their posterior distribution with likelihood given by ( 18), with an appropriate MCMC scheme.
2. Sample the inadequacy function term at locations of interest from the conditional distribution f (δ|θ, z, ν) by: (a) Sample the noisy latent field u from the truncated normal distribution in Eq. ( 23).
(b) Given the sampled field u, sample the inadequacy field from the Gaussian distribution in Eq. ( 22).
(c) If necessary, sample the inadequacy field at unobserved locations using Eq. ( 24).

Performance assessment
In Bayesian inference, the performance of predictive models is typically done by comparing calibrated predictions against observations.Since we are dealing with spatial binary data, a spatial index can be built as the posterior probability of any given point in space to be mispredicted (over or underpredicted).That is, for an observed flooded point z j = 1, this index is the posterior probability that the point is predicted dry.Adopting a negative probability value for overpredictions and a positive for underpredictions to improve the visualization, this 'misprediction rate' index is expressed mathematically by, Where the sub-index j indicates the spatial location x j .
A point with ρ j close to 1 (or −1) means that the current calibrated predictions systematically underpredicts (or overpredicts) flooding at that location.On the contrary, a value closer to 0 implies that the simulator is consistently estimating it correctly.
For a perfect simulator, all pixels will be correctly predicted flooded or dry for any simulation (in fact, all simulations of the statistical simulator would be the same).
An overall metric for the entire region of analysis can be obtained as the average misprediction rate, by averaging ρ j for all the points as per Eq.( 26).This is equivalent to the average number of mispredicted observations over all posterior predictions.In this one-dimensional case, the true (synthetic) process is represented by a modulated harmonic function f (x) = 3x 2 • sin 10x 2 + 6x .To predict this, a computational model is available, in the form of a fixed amplitude harmonic function: S (x) = 3x 2 sin (βx).This simple model could represent our flood simulator.It does a good job in capturing the varying amplitude of the true periodic function, but it cannot perfectly represent its frequency content as can be seen by comparing the arguments of both sinusoidal functions in Fig. 1.The parameter β needs to be calibrated with observations (training dataset) of the true process in order to maximize the appropriate likelihood function.Observations of the true process are obtained by adding a white Gaussian noise, and thresholding at y = 0 in the case of binary observations (see Fig. 1).
In every case in this example, posterior samples of parameters were obtained by means of Markov Chain Monte Carlo (MCMC) methods.

Model calibration with continuous observations
Calibrated predictions with continuous observations are included to have a better understanding on the limitations of having censored observations.The prediction model is the same as Model 3 defined in Sect.2.2.3 with the difference that for uncensored observations we have: Z = S + δ + ε.This is in line with the standard theory of GP regression, where the simulator output S appears as a deterministic additive term, that can be found in Rasmussen and Williams (2006) and in Hall et al. (2011) applied to inundation models.The predictive distribution for new observations and the parameters' likelihood is given by equations ( 27) and (28).bounds reflects the choice of a narrow prior to converge to one of these models.Choosing a 'less-informative' prior might result in unreasonably large confidence bounds and a predictive curve that mixes rather physically-distinct models, rendering it useless.
On the other hand, the middle plot shows that the flexibility of the GP prior allows the data to be fitted adequately everywhere without the need of the computational model.For the full model (lower plot), it can be seen that both the simulator and the inadequacy adds themselves to correctly fit the data everywhere.
For the latter case, model parameters are, a priori, unidentifiable since the inadequacy function competes with the simulator in order to fit the data.That is, for any given set of simulator parameters β, a posterior distribution for the inadequacy parameters θ can be found that appropriately fits the data.To solve this, we fixed the values of simulator parameters β to be close to the ones obtained in the first case (see full yellow line in plot), and let the inadequacy correct the predictions only where necessary (see the yellow dotted line in the plot).This is in line with the suggestions in Reichert and Schuwirth (2012) and Wani et al. (2017), and is based on the assumption that physics-based computational simulators are expected to have better extrapolating capabilities to unseen events than the inadequacy function.Numerically, this is done by assigning narrow priors for β in the region where the simulator alone does a better job.

Model calibration with binary observations
In the case of binary observations, the role of the inadequacy function for data fitting becomes more complex.Using only the inadequacy to fit the binary data, the model shows a reasonable fit for the varying frequency but an approximately constant amplitude over the entire range (see Fig. 3).This reflects the limited information that binary data comprises relative to the uncensored counterpart.In particular, binary data does not give information about the amplitude of the true process, only about its crossings over the 0-axis (also termed function roughness, or frequency content).
It is expected that parameters that control the 'amplitude' of the inadequacy function (and/or the simulator) remain largely unidentified (Oliveira, 2000;Berrett and Calder, 2016).This is the case of the marginal variance θ 1 in the Squared Exponential Kernel of the Gaussian Process.Using a very wide prior for this parameter will yield arbitrary large uncertainty bounds since data does not constraint its amplitude (i.e. the posterior is also very wide).To show this, Fig. 3 compares two calibrations using relatively wide priors for θ 1 but centered at different values.Results show a similar frequency pattern but radically different amplitudes due to the different priors of θ 1 .In the same line, the noise parameter σ is also not identifiable since the binary operator filters out the high frequency-low amplitude influence of the white noise.
As a result, the amplitude of the inadequacy remains unidentified by the data, but at the same time has a very significant impact on predictions: an arbitrarily large θ 1 will imply arbitrarily large amplitudes of the inadequacy function, yielding physically unrealistic predictions, while very low θ 1 values will yield very low amplitude for the inadequacy, rendering it virtually useless for calibration.In this context, the influence of the simulator for calibration becomes of paramount importance.
It is the only component that can give a reliable assessment on the amplitude of the true process, while the inadequacy can help correct for the spatial structure (i.e.crossings over the 0-axis) wherever the simulator is deficient.There is no strict guidance on what value to center the prior for θ 1 , but a rule of thumb would indicate to use the lowest value that clearly improves fit with observations.Figure 4 shows different calibrations using both the simulator and the inadequacy for narrow θ 1 priors centered at different values.A very low value (upper plot) implies that the inadequacy practically does not modify the simulator output anywhere; while a very large value (lower plot) implies that the inadequacy amplitude overshadows the simulator output everywhere, but specially where the simulator on its own does not fit the data correctly.A prior for θ 1 410 centered around 0.1 seems to balance the amplitude tradeoff reasonably well, giving significant improvement over the simulator alone but still leaving the simulator output to predominate over the inadequacy.This also reinforces the idea that the simulator should be fixed around its best-values as was explained in the uncensored data case.The case study is based on a short reach on the upper river Thames in Oxfordshire, England, just downstream from a gauged weir at Buscot.The river, at this location, has an estimated bankfull discharge of 40m/s 3 and drains a catchment of approxi- et al., 2002).The topography DEM was obtained from stereophotogrammetry at a 50m scale with a vertical accuracy of ±25cm, obtained from large-scale UK Environment Agency maps and surveys (see Fig. 5).
For calibration, a satellite observation of the flood extent of a 1-in-5 year event that occurred in December 1992 was used 420 (Fig. 5).The binary image of the flood was captured 20hs after the flood peak when discharge was at a level of 73m 3 /s as per the hydrometric data recorded by the gauging station (Aronica et al., 2002).The resolution of the image is 50m.As described in previous research the short length of the reach and the broadness of the hydrograph imply that a steady-state hydraulic model is sufficiently accurate for the calibration (Aronica et al., 2002;Hall et al., 2011).

Inundation model 425
The computational inundation model used is the raster-based Lisflood-fp model (Neal et al., 2012).Lisflood-fp couples a 2D water flow model for the floodplain and a 1D solver for the channel flow dynamics.Its numerical structure makes it computationally efficient and suitable for the many simulations needed for probabilistic flood risk analysis and model calibration.Manning's roughness parameters for the channel r ch and for the floodplain r f p , both considered spatially uniform in the domain of analysis.That is, β = {r ch , r f p }.

Numerical implementation
The calibration routine was implemented in the programming language R (R Core Team, 2020), and the pmvnorm function of the TruncatedNormal (Botev and Belzile, 2021) package to compute the high-dimensional integral of the likelihood function (18).The function rtmvnorm from the same package was used to sample from the truncated Normal distribution of Eq. ( 23).
These two evaluations were the most time consuming of the entire process due to the high-dimensionality of the observations (around 1800 pixels): while a single run of the inundation model takes around 3s in a 10-core intel i9-10700k processor, one likelihood evaluation takes around 40s.The original image of the reach was trimmed closer to the flood extent to reduce the number of observations for calibration as seen in Fig. 5.
For models 2 and 3, both the predictive distribution of new observations and the posterior distributions of the model parameters were sampled using an adaptive MCMC scheme.A Gaussian jump distribution was used to select candidates, where the covariance matrix was empirically obtained from initial runs of the chain and subsequently scaled up and down in order to obtain an acceptance ratio of around 0.25.Two chains of 15,000 runs with an initial adaptive step of 5,000 were used in order to ensure adequate mixing and stabilization of the chains, as measured by a Rubin-Gelman convergence diagnostic (Gelman et al., 2013) below 1.05.

Results
Calibrated predictions for the observed event are obtained using the three methods described in Sect.2.2.These predictions are then compared to the satellite binary observation through the goodness-of-fit metrics in Eqs. ( 25) and ( 26).The 'probability of flood' map p (y j > 0|z, ν) is obtained empirically from the prediction samples, by computing the proportion of samples where each pixel is flooded.Since only one observation is available, no formal testing error is evaluated here.

Model 1: GLUE
The inundation model was run for a fine grid of β with uniform probability (prior) in the region 0.001 < r ch < 0.3 and 0.001 < r f p < 0.3.Only runs with F > 0.45 were retained as 'behavioral solutions' for posterior analysis and prediction, with a maximum of F = 0.52 obtained for r ch = 0.029 and r f p = 0.045 (see Fig. 6).While this leaves out the large majority of the runs, visual inspection showed that values lower than F = 0.45 could yield unacceptable inundation patterns.It appears form the marginal posterior distributions of Fig. 6 that the model fit is better for lower values of channel or floodplain roughness.
The probability of flood map and misprediction rate ρ maps in Fig. 7 show that all accepted simulations systematically mispredicts (over-or under-) inundation at several regions around the edge of the flood extent yielding an average misprediction rate of P = 0.08.This can be due to limitations of the Lisflood-fp model in capturing the spatial behavior of the true event, or some spatial dependent error in the input data (e.g.DEM), or, most probably, a combination of both.
It is important to stress again, that the results of the GLUE model are very sensitive to the acceptance threshold used (in this case, F > 0.45).A lower threshold implies larger areas with more uncertain inundation patterns and larger areas of relatively low probability of misprediction, while retaining only a few best simulations (i.e. a higher threshold) yields less variability in predictions and smaller areas with a higher probability of misprediction as in the results of Fig. 7.

Model 2: No inadequacy, independent observations
This model is tested with a binary-channel observation error structure (see Sect.Prior and posterior distribution for model parameters are shown in Fig. 8.It can be seen that the posterior of the Lisflood-fp parameters is narrowly concentrated around fixed 'best-fit' values: r ch = 0.03, r f p = 0.045.For the observation error parameters the posteriors are narrowly concentrated around σ 1 = 0.82 and σ 2 = 0.98 implying that over-predictions are more 'accepted' than underpredictions.The model likelihood seems to clearly discriminate this set of parameters against any other, resulting in practically no variability in predictions.This is in agreement with the results obtained by (Woodhead, 2007), and has to do with the shape of the likelihood function ( 15) where for σ 1 and σ 2 values close to 1, a slight change in the number of mispredicted pixels result in a large change in the likelihood.On the other hand, values equal to 0.5 would imply a constant likelihood (see Eq. ( 15)) and no information gained from data.two models.This model yields an average misprediction rate of P = 0.057 and also displays a different spatial pattern were over and under-predicted pixels seem to be more uniformly distributed along the observed flood extent edge.

Discussion
The methodology and results shown in the previous sections indicate that satellite-borne binary observations can provide valuable information for inundation model calibration, and that explicitly modeling the spatially correlated discrepancy between simulator output and observations through an inadequacy term can improve predictions.In this section we discuss some of the main takeouts from the results obtained for the illustrative example and the real-world case study.

Model building and hypothesis
The traditionally used GLUE framework was described and compared to a more general and formal framework where appropriate probabilistic models were assigned to the discrepancy between observations and simulator output.The main advantages of this methodology are (1) the capability of consistently including the different sources that add to the discrepancy such as observation errors or simulator inadequacy, and (2) the computation of formal probability distributions, and thus uncertainty bounds, on readily observed or future events.This does not mean that any calibrated predictions and uncertainty bounds within A basic assumption that was used for model building here, is that observation errors are spatially independent and homogeneous (identically distributed).Model 2 was a simple model built by assuming that all discrepancy between observations and simulator output could be explained by this type of observation errors alone.Results from this model, and from GLUE, (see Figs. 7 and 9) show that the discrepancy is spatially correlated, reflected by regions of systematically under-or over-flooded pixels, indicating that the simulator on its own cannot capture the spatial structure of the observed extent everywhere.This could be due to a more complex observation errors structure (uncertainty in satellite image acquisition and interpretation), due to limitations in the physical representation given by the equations in the model, or due to errors in the boundary conditions (e.g.error in the DEM used).Most probably, a combination of all.
These results also highlight the importance of explicitly modeling the spatial correlation that is not captured by the computational simulator.Model 3, implements this through and additive inadequacy term with a spatially correlated Gaussian Process prior.Implicit in this model, is the fact that all spatial correlation in discrepancy is assigned to simulator inadequacy and none to the observations errors (assumed independent).This, of course, could be challenged in the light of further information or knowledge, as in the error models studied by (Woodhead, 2007), but it remains a common assumption in the calibration of physical models (Kennedy and O'Hagan, 2001).

Information content of binary observations
The illustrative 1D example in Sect. 3 was specifically selected to show the limitations that using censored observations imply 575 for predictive accuracy, model building, and numerical modeling.Censored binary observations, as explained, do not provide information about the magnitude of the true process output (i.e.flood depths) but only of its spatial frequency (i.e.crossing over the 0-axis).This means that when only censored observations are available, an appropriate simulator is of paramount importance to obtain meaningful predictions as is seen in Fig. 4.
From a numerical standpoint, binary observations do not provide information about the marginal variance θ 1 that controls the The noise parameter σ also remains mostly unidentifiable since it is filtered out by the censoring and it does not seem to significantly affect the results if kept within reasonable values (similar to the θ 1 magnitude for this case study).Further experiments are needed to analyze the influence and tradeoffs of the marginal variance and noise values used in calibration.

Limitations
As explained in the introduction, the main reason for the popularity of GLUE models is its conceptual simplicity and ease of implementation.This is, at the same time, a limitation on the proposed Bayesian model.The discrepancy between a simulator output and observations of the real process can be very complex and non-stationary (both in space and time).
A flexible and non-linear model such as GPs are a good choice to predict complex and correlated residual structures, but also carry the risk of over-fitting observations.When uncensored data is available, this can be controlled with an appropriate level of observation noise (e.g.σ in our framework).When only binary data is available, however, noise is not distinguishable in the likelihood and over-fitting can pose a serious problem.Our proposed approach to deal with this, is to give the simulator a predominant role in prediction while restricting the inadequacy 'only where necessary'.This implies verifying improvements in the fit of the model in those places where the simulator alone systematically mispredicts, and also checking the absolute values of predictions to prevent unrealistic values from the inadequacy function.This, still, requires modeler expertise and knowledge about the particular problem at hand.
In the same line, an implicit hypothesis of the model is that the inadequacy function does not depend on the forcing event ν.This was assumed more from data availability than from theoretical grounds.While a single event might be informative of the inadequacy function, particularly if deficiencies in the simulator are due to input errors (e.g.DEM inaccuracies), it is also expected that some of the simulator deficiencies in predicting the true process are dependent on the magnitude of the events being predicted.This issue could be approached by calibrating with observations from events of different magnitude resulting, of course, in higher computational demand.
In this regard, an important limitation of the proposed Bayesian model with spatially correlated inadequacy function, is the computational burden.While this is also a limitation for any calibration method, in this case, the likelihood function ( 18) is very time consuming with current computational capacities and mathematical techniques (Genz, 1992), even for a small reach as the one in study here.The same goes for sampling from the truncated normal in Eq. ( 23) for calibrated predictions.Widespread adoption of this formal model, thus, requires the elaboration of more efficient techniques for computing and sampling from very-high dimensional Gaussian distributions or studying ways of using less observations without losing important information.
Dimensionality reduction techniques, are expected to play a role in this favor (Chang et al., 2019).

Conclusions
Efficient management of natural hazards risk require reliable predictions of very complex physical processes, such as inundations.Hence, the importance of having adequate predictive models that can capture the relevant spatial features of flooding making use of all available real-world data.This paper proposes a fully probabilistic framework for the statistical calibration https://doi.org/10.5194/egusphere-2022-760Preprint.Discussion started: 21 September 2022 c Author(s) 2022.CC BY 4.0 License.f (δ|β, z, ν) ∝ f (z|δ, β, ν) https://doi.org/10.5194/egusphere-2022-760Preprint.Discussion started: 21 September 2022 c Author(s) 2022.CC BY 4.0 License.distributions for the model inadequacy δ, and simulator parameters β.Three models are reviewed in this work: (1) the widely used GLUE method (current state-of-practice), (2) a simple Bayesian model without including an inadequacy term, and (3) a newly proposed Bayesian model including an inadequacy term with a spatially correlated Gaussian Process prior.
) does not have a closed form solution, approximate numerical techniques should be used to draw from posterior inadequacy samples.A simple change of variables allows us to obtain a more useful sampling scheme for this.Introducing an intermediate noisy latent variable u = S + δ + ε the distribution (20) can be rewritten as, https://doi.org/10.5194/egusphere-2022-760Preprint.Discussion started: 21 September 2022 c Author(s) 2022.CC BY 4.0 License.
To help understand some conceptual and numerical features of the role of the inadequacy function and binary observations on calibration, a simplified one-dimensional (1D) model is used as the real process, and noisy synthetic, uncensored and censored (binary), observations are numerically obtained from it.To do this, three different model settings are used: (1) a model with simulator alone, (2) a model with inadequacy alone, (3) a model with both.

Figure 1 .
Figure 1.(upper) Synthetic true model and realizations of computational model runs; (lower) Continuous and binary observations of the true process

)
Model predictions are shown in Fig.2for (1) the computational model alone δ ≡ 0, for (2) the inadequacy function alone S ≡ 0, and for (3) the full model.It can be seen, as expected, that the simulator alone cannot reliably predict the data in the entire range since it cannot capture the varying frequency of the true process.The likelihood shows many peaks for different β values corresponding to different possible models: low-frequency where the data is fit rather well in the left end of the range, or high-frequency where the data is fitted well in the right end of the range (as the one plotted in Fig.2).The narrow uncertainty https://doi.org/10.5194/egusphere-2022-760Preprint.Discussion started: 21 September 2022 c Author(s) 2022.CC BY 4.0 License.

Figure 2 .
Figure 2. (upper) Calibration using computational model only; (middle) Calibration using inadequacy function only; (lower) Calibration using computational model and an additive inadequacy function

Figure 3 .
Figure 3. Calibration using the inadequacy function only for a θ1 prior centered at 1 (upper plot) and centered at 20 (lower plot)

Figure 4 .
Figure 4. Comparison of calibration using both the simulator and the inadequacy function for different prior means for θ1: 0.02 (upper plot), 0.1 (middle plot), 20 (lower plot)

A
simplified rectangular cross-section is used for the channel with a constant width of 20m for the entire reach and a varying height of around 2m.The observed event is defined by the boundary condition of a fixed input discharge of ν = 73m/s 3 430 at the geographic location of the gauging station shown in Fig. 5, and by an assumed downstream boundary condition of a fixed water level of approximately 90cm above the channel bed height.The model's parameters used for calibration are the https://doi.org/10.5194/egusphere-2022-760Preprint.Discussion started: 21 September 2022 c Author(s) 2022.CC BY 4.0 License.

Figure 5 .
Figure 5. Floodplain topography at Buscot, SAR imagery of 1992 flood event (light blue), channel layout (dark blue) and gauge station location (red dot).The dotted green line indicates the observed region considered for calibration.

Figure 6 .
Figure 6.Marginal prior and posterior distribution of parameters for Model 1

Figure 8 .Figure 9 .
Figure 8. Marginal prior and posterior distribution of parameters for model 2

Figure 10 .Figure 11 .
Figure 10.Marginal prior and posterior distributions of parameters

Figure 12 .
Figure 12.Predictive distributions of flood depths for three different points in the floodplain

Figure 13 .
Figure 13.Flood-depth maps from model 3 predictions of simulator + inadequacy (upper), simulator alone (middle), and inadequacy (lower) for a marginal prior mode of θ1 = 0.05 (left column) and θ1 = 1 (right column).The dashed black line is the observed flooded extent.

580
inadequacy amplitude.To avoid identifiability and convergence problems a narrow prior (or simply fixing the value) is required, centered at an appropriate value following the criteria mentioned previously: not too large to unrealistically distort simulator outputs, but large enough to improve fit to the observations where needed.Due to this reason, relatively large uncertainty bounds on flood-depth predictions are also expected as can be seen from Figs. 4 and 12. https://doi.org/10.5194/egusphere-2022-760Preprint.Discussion started: 21 September 2022 c Author(s) 2022.CC BY 4.0 License.

Table 1 .
Summary of variables, parameters, and hyper-parameters in the uncertainty framework