Gaussian process emulation of spatio-temporal outputs of a 2D inland flood model

The computational limitations of complex numerical models have led to adoption of statistical emulators across a variety of problems in science and engineering disciplines to circumvent the high computational costs associated with numerical simulations. In flood modelling, many hydraulic and hydrodynamic numerical models, especially when operating at high spatiotemporal resolutions, have prohibitively high computational costs for tasks requiring the instantaneous generation of very large numbers of simulation results. This study examines the appropriateness and robustness of Gaussian Process (GP) models to emulate the results from a hydraulic inundation model. The developed GPs produce real-time predictions based on the simulation output from LISFLOOD- FP numerical model. An efficient dimensionality reduction scheme is developed to tackle the high dimensionality of the output space and is combined with the GPs to investigate the predictive performance of the proposed emulator for estimation of the inundation depth. The developed GP-based framework is capable of robust and straightforward quantification of the uncertainty associated with the predictions, without requiring additional model evaluations and simulations. Further, this study explores the computational advantages of using a GP-based emulator over alternative methodologies such as neural networks, by undertaking a comparative anal- ysis. For the case study data presented in this paper, the GP model was found to accurately reproduce water depths and inundation extent by classification and produce computational speedups of approximately 10,000 times compared with the original simulator, and 80 times for a neural network-based emulator.


Introduction
Numerical modelling and simulations of flooding scenarios is a vital part of flood risk management, enabling timely planning and response to extreme climatic events. A range of complex numerical modelling techniques have been developed in the past decade to accurately simulate the flow-structure interactions and the corresponding hydrodynamic structure. These models differ on factors such as scale, spatial resolution (e.g., global, continental, basin), numerical techniques, complexity as measured by the number of dimensions considered to model the flow (1D, 2D and 3D), the underlying processes which lead to flooding events (i.e., inland and coastal flood), and the model outputs. Hence, there are significant differences between the capabilities, accuracy, and robustness of the available flood models. Much research has been conducted into reviewing the current hydrodynamic modelling strategies (Teng, et al., 2017;Bulti & Abebe, 2020), and assessing the performance of existing widely used flood models (Fewtrell et al., 2011;Zischg et al., 2018).
A primary drawback of the existing numerical simulations for flood modelling is the associated computational costs to generate results. Despite active research to enhance the computational capabilities of these models, accurate flood modelling at high spatiotemporal resolutions could still be prohibitively expensive. In most flood modelling studies, it is also desirable to understand the influence of varying the inputs on the outputs and simulate potential flooding scenarios. In-depth understanding of the uncertainty associated with the model outputs, and the model parameterisation is essential to establish a robust flood modelling framework. There are several approached to address these questions including sensitivity analysis Oakley & O'Hagan, 2004;Daneshkhah & Bedford, 2013), uncertainty quantification ;Soize, 2017;Ghanem et al., 2017), and model calibration (Lee et al., 2019;Kennedy & O'Hagan, 2001). However, these techniques generally require very large numbers of model runs, indicating that performing such tasks can become infeasible with even modest model computation times and so alternative modelling approaches are needed for robust and timely analysis of numerical models.
One approach to mitigate the high costs associated with numerical flood modelling tools is to utilise parallel and GPU computing techniques (Sharif, et al., 2020;Morales-Hernandez, et al., 2021;Neal, et al., 2018;Yeganeh-Bakhtiary et al., 2020). At high temporal and spatial resolutions, although parallel CPU and GPU implementations can significantly reduce the computation time, a single simulation of such models can take many hours to complete Goodarzi et al., 2020;Shaw et al., 2021). Despite the computational limitations, in many cases, numerical modelling is the most reliable tool for simulation and prediction of flooding and inundation especially when no alternative data-driven approaches (Van Steenbergen, et al., 2012;Abolfathi et al., 2016) can be utilised due to sparse empirical observations.
The adoption of machine learning (ML) approaches (Mosavi, et al., 2018;Yang & Chang, 2020;Lin, et al., 2020) in flood modelling could alleviate the high computational costs associated with numerical modelling, allowing a wider range of scenarios to be considered in a reasonable amount of time and with more accessible computational resources. The increasing dependence on numerical models for applications in science and engineering and their associated computational costs has led to the development and adoption of a new class of ML-based models, known as surrogate models or probabilistic 'emulators'.
An emulator is a statistical model that approximates the outputs of a, usually complex, model where the model can be considered as a deterministic input-output computer simulator (Sarri et al., 2012). The emulator considered in this paper is originated from model output (O'Hagan, 2006), where working with the convenient simulators representing complex mathematical models is computationally very expensive. An emulator should approximate the output of a simulator to a sufficient level of accuracy with a significant reduction in associated computational cost, and the required training datapoints. In recent years, several studies focused on the use of emulators for a range of numerical models including tsunami models (Sarri et al., 2012;Salmanidou et al., 2017), flood models (Kabir et al., 2020), and climate projection models (Tran et al., 2019). A variety of approaches have been examined for statistical emulation including Artificial Neural Networks (García-Alba et al., 2019;Wang et al., 2019), Polynomial Chaos Expansions (Massoud, 2019;Moreno-Rodenas, et al., 2018) and Gaussian Processes (Conti et al., 2009;Chang et al., 2015;Longobardi et al., 2020;Yang et al., 2018). Kabir et al. (2020) proposed a deep convolutional neural network (CNN) model to emulate flood simulation results from a numerical model to generate rapid predictions of inundation levels. Benchmarked against a support vector regression (SVR) model, Kabir et al. (2020) found the CNN model outperforms SVR with respect to classification metrics such as precision and recall, and slightly underperforms with respect to regression metrics such as RMSE and Nash-Sutcliffe efficiency. However, when validating the model, only a very small subset of the locations in the domain were chosen to validate the model at, which may not provide the most accurate assessment of the model performance. The proposed emulator was more than 20 times faster compared to the underlying numerical model. However, while the computational improvements are significant relative to the original numerical model, the runtime of the emulator may still be prohibitively computationally expensive for tasks such as scenario modelling, sensitivity analysis, and uncertainty quantification. Kabir et al. (2020) further concluded that, in contrast to GPs, the CNN-based approach can not readily quantify the uncertainty associated with the emulator predictions.
There are several ML-based techniques which have led to a variety of hybrid approaches to forecasting flood inundation depth utilising the computationally expensive physics-based numerical models and ML methodologies. These approaches often aim to reconstruct, or simplify, the modelling problem to make it more tenable for ML approaches and have had success in reconstructing inundation values (Yan et al., 2021;Zhou et al., 2021). However, in this study, a novel GP-based methodology is developed to emulate the outputs of a 2D hydraulic flood model, making as few modelling simplifications as possible. The proposed model will require far less model evaluations, and consequently significant reduction in the computational resources needed and model's runtime. The numerical model used in this study is LISFLOOD-FP, originally developed by Bates & De Roo (2000) for flood inundation simulations. Due to the high number of model outputs to be considered, a dimensionality reduction scheme is developed on the output space to reduce the computational complexity of modelling with multi-output GP models. The proposed model has low computational costs and is capable of straightforward quantification of predictive uncertainty. The appropriateness and robustness of the developed multi-outputs GP model for predicting flood inundation depths is examined for a case study data in North Yorkshire, UK, and a direct comparison is made between the outlined GP methodology and CNN model proposed by Kabir et al. (2020).

Numerical flood modelling
Numerical flood models simulate the interactions of water flow with the surrounding environment using partial differential equations (PDEs) that govern the motion of the fluid flow. Many open-source models exist for modelling flood in two dimensions (i.e. spatial coordinates) including LISFLOOD, LISFLOOD-FP, MIKE-FLOOD and TELEMAC-2D. These models have been successfully validated for modelling inland and coastal flooding scenarios (Pinos & Timbe, 2019;Dong, et al., 2018;. This study adopts LISFLOOD-FP, a two-dimensional hydraulic model (hereafter, 2D flood model), capable of utilising high resolution topographic data for simulating flood inundation depths in challenging urban settings. The accuracy of LISFLOOD-FP for predicting inundation levels is validated by several studies (e.g., Shustikova et al., 2019;O'Loughlin et al., 2020).
2D flood models make the simplifying assumption that the depth of the water is shallow compared with the magnitude of the other spatial dimensions and as a result, vertical variations in flow and velocity structure can be ignored. The shallow water equations are commonly used as the governing equations of 2D flood models to determine the interactions between fluid flow and the environment. LISFLOOD-FP simplifies the problem of simulating 2D flow by decoupling the flows in x and y directions and treats the simulation of 2D flow as a series of calculations in one dimension through the cell face boundaries (Shustikova et al., 2019). The governing equations used for flow modelling in this study including the conservation of mass (Eq. (1)), and momentum (Eq. (2)) are represented as follows:

∂Q ∂x
Where Q is the volumetric flow rate in the channel, A denotes the cross-sectional area of the flow, q is the flow into the channel from other sources (e.g., precipitations, wastewater discharge), S 0 is the bed slope, n denotes the Manning's roughness coefficient, P is the wetted perimeter of the channel, and ℎ stands for the flow depth. The numerical model adopted in this study discretises the governing equations over squaretype Eulerian grids, by discretising the domain into a rectangular grid of H × W cells. The model uses an adaptive numerical time stepping scheme to determine Δt. Fluid flow between two cells is described as a function between the cells over time: where h i,j is the water free surface height at cell (i, j), Δx and Δy are the cell size in the x and y direction, and Q i,j x and Q i,j y are the volumetric flow rates between cells in the floodplain. The reader should refer to Bates & De Roo (2000) for a full explanation of the governing equations and numerical implementation.
Flood simulations with LISFLOOD-FP requires parameterisation with a topographic dataset, boundary conditions describing fluid flows into and out of the model domains, choice of numerical solvers, and other case-specific parameters such as friction coefficients. In this study, assuming D sources of time-varying boundary conditions, LISFLOOD-FP can be considered as a function, f (x (t) ), taking inputs, x (t) ∈ R D , describing the time-varying boundary conditions at time t. The outputs generated at each simulation timestep is a matrix of inundation depths ij denotes the water level in cell (i, j) at time t. However, for mathematical convenience, the matrices, Y (t) are flattened into vectors y (t) ∈ R HW as illustrated in Fig. (1). Therefore, a single flood simulation with T timesteps can be described by the dataset D = {(X,Y) are matrices corresponding to the row-wise aggregation of each x (t) and y (t) .

Gaussian Processes
A Gaussian Process is a stochastic process defined as a collection of random variables, any finite number of which have a joint Gaussian distribution (Rasmussen & Williams, 2006). In other words, a GP can be simply considered as a generalisation of the multivariate Gaussian distribution, retaining many convenient mathematical properties of the multivariate Gaussian distribution. However, instead of being defined over finite-length vectors and parameterised by a mean vector and covariance matrix, a GP is defined over functions and parameterised by mean and covariance functions. A GP is a (possibly infinite) distribution that is fully specified by its mean function m(x (i) ) and its kernel-covariance function k(x (i) ,x (j) ), referred to as the kernel function. For a real process f(x (i) ), such as a numerical model taking boundary conditions x (i) at time t = i as outlined in Section 2.1, the mean and kernel functions can be written as: cov ( Leading to the prior distribution over functions, which can be read as some function f(.), at a given location in the input space x (i) , is distributed as a GP with mean function m(.) and kernel function k(., .). A real kernel is a symmetric function which can approximately be considered to provide a metric of 'similarity' between two points x (i) and x (j) . Eq. (6) outlines a prior distribution over target functions and encapsulates the prior beliefs about the space of functions from which the target function f(.) could be drawn from. In practice, mean functions and kernel functions are replaced with finite length mean vectors, μ = 0, and covariance matrices, K, whose entries are (j) ). It should be noted that the zero-mean function assumption is common in ML applications, and this does not limit the mean of the posterior to zero (for further details see Rasmussen and Williams, 2006). The flexibility of GPs comes from the choices of kernel functions and their parameterisations. These choices will determine the properties of candidate functions such as smoothness and variability. By combining the prior distribution with the observed dataset, the space of candidate functions is constrained to only contain functions which interpolate (exactly or approximately) the training datapoints, obtaining a posterior distribution over the function of interest. It is more common in practice to have a collection of noisy observations of some target function, y (i) = f(x (i) ) + ε, where ε is additive independently identically distributed Gaussian noise with variance σ 2 n . The prior on the noisy observations can be rewritten as: where where X ∈ R N×D is the design matrix consisting of N training instances, which here represents the numerical timesteps in flood simulations, in which each row corresponds to x (i) ∈ R D . Therefore, given a set of observations (X,y), where X is the previously mentioned design matrix and y = {y i } N i=1 is a vector of noisy training outputs, a joint prior distribution over observed outputs, y, and predicted outputs, f * at a set of test inputs X * , can be defined as: where, By conditioning the prior distribution on the observed data, the posterior distribution of zero-mean Gaussian process over the test inputs is given as: where, Eq. (11) shows that, within the GP framework, prediction is equivalent to the construction of a full multivariate Gaussian posterior distribution over function values at new unseen points in the input space.
Hence, μ * can be considered as the mean prediction and Σ * the variance associated with the mean prediction.

Learning the hyperparameters
In the GP modelling framework, assuming a zero-mean, training the model means choosing an appropriate kernel function and using the training data to optimise the kernel's hyperparameters. The default kernel used in GP regression models is the Squared Exponential (SE) kernel: where σ 2 is the output variance, determining the average distance of the function from the mean value, and l is the lengthscale which determines how quickly the function varies and determines how the covariances decay with distance between points. The SE kernel function generalises well to many applications, yet it makes assumptions about smoothness which often are not realistic. In this study, a zero-mean function and the Matern 3/2 kernel function are used for the development of the GP model. The Matern 3/2 kernel is a stationary kernel, operating only on the Euclidean distance between points, with origins in spatial statistics and is commonly used for modelling physical processes (Rasmussen & Williams, 2006). Furthermore, preliminary testing showed the Matern 3/2 (Eq. (15)) to outperform the SE for the flood modelling context addressed in this study.
where δ ij = 1 if i = j, otherwise δ ij = 0. The kernel's hyperparameters, denoted by θ = (σ 2 , l, σ 2 n ), are determined through Maximum likelihood estimation with respect to each hyperparameter. Where σ 2 is the variance parameter, l is the length-scale, and σ 2 n is a Gaussian noise parameter.
To estimate kernel hyperparameters the log marginal likelihood is maximised with respect to the hyperparameters. Given that prior distribution over observations can be expressed as y ∼ N(0, K + + σ 2 n I)y ∼ N(0,K + σ 2 n I), the log marginal likelihood, explicitly conditioned on kernel hyperparameters, can be written as: By taking partial derivatives with respect to each hyperparameters the marginal likelihood can be maximised as: where K y = K + σ 2 n I and tr is the trace operator. Performing this operation scales with O(N 3 ) computational complexity from the inversion of K y matrix.

Dimensionality reduction
With the LISFLOOD-FP model, the outputs describe inundation across the study area for each timestep. When constructing a statistical emulator, a collection of K design floods where H is the number of cells in the model in the x-direction and W in the y-direction is used to train and validate the emulator. The input-output data matrices from each design flood are aggregated into new input-output matrices, X ∈ R N×D and Y ∈ R N×HW , with a new number of aggregated timesteps N = T 1 + … + T K . For numerical simulators operating at high spatiotemporal resolutions, this can result in very large values of HW and N.
To overcome computational issues arising from modelling a very high-dimensional output using GPs, a dimensionality reduction (DR) scheme is first developed and applied to the outputs of LISFLOOD-FP simulations, making the construction of a GP-based statistical emulator more feasible. Principal Component Analysis (PCA) decomposition is implemented using a randomised implementation of Single Value Decomposition (SVD) (Halko et al., 2009;Feng et al., 2018). This transformation produces a lower-dimensional dataset of latent features, Z ∈ R N×D * , to replace the original dataset Y ∈ R N×HW . The output features in the original dataset correspond to the water depths in cells of the output domain, and so it can be expected that there will exist significant linear spatial correlations in the data, a structure which PCA can effectively represent in a lower-dimensional space (Cheng et al., 2022). Furthermore, PCA relies on linear transformations allowing it to scale easily to very high-dimensional datasets and has a computationally efficient decoding process to reconstruct the original observations from the latent features.

PCA-GP development
For standard GP regression problems, only functions of the form f : R D →R are considered, such that when performing GP regression, a scalar output, y (i) , is regressed onto a vector valued input x (i) . This type of GP regression scales cubically with the number of training instances, O(N 3 ), which means the GP has the order of N 3 time complexity or express the runtime in terms of how quickly it grows relative to the size of the input, as the size of input gets larger. However, after the construction of the latent feature dataset, Z, there are D * outputs to be modelled and so construction of the emulator is equivalent to a vector valued regression problem in which the mapping f : R D →R D * is learned. Approaches for vector valued GP regression have been explored and methods such as the linear model of coregionalization (LMC) have been proposed (Alvarez et al., 2012). However, exact implementation of these methods would result in the computational complexity of O((D * ⋅N) 3 ).
Using the reduced dataset, Z, a collection of independent single output GPs (SOGP) is utilised to model each latent feature independently given that applying PCA constructs a collection of orthogonal latent features, possibly removing correlations between output features. The exact implementation of this methodology will scale with O(D * ⋅N 3 ) computational complexity. As it is evident, the computational time of SOGP is at least (D * ) 2 time faster than the LCM method. Fig. (2) illustrates the computational complexity of alternative approaches to vector valued GP regression for a varying number of features, D * , using a baseline of N = 1000 training instances. This figure implicitly highlights how infeasible a naive implementation of the original dataset would be for which D * ≈ 800, 000. Fig. (3) shows a graphical representation of the GP model with D input features and D * independent output features, mapping each input vector to a scalar via a GP f j for j = 1,...,D * . For each GP in this model, a zero-mean function and the Matern 3/2 kernel function are used with the hyperparameters for each determined as outlined in Section 2.2.

Validation metrics
When building statistical emulators, the primary concerns are predictive accuracy and time taken to generate new predictions. In this study, the modified versions of the root mean squared log-error (RMSLE) and root mean squared error (RMSE), defined by Eqs. (18) and (19), respectively, will be used for assessing regression performance.: where for each simulation, Ω is the set of cell indexes in which either LISFLOOD-FP or the GP emulator predict an inundation value greater than 0 at any time-step. This reduces the number of cells averaged over, providing a more realistic error metric because most of the cells will never see any inundation, causing the average error to be reduced. Furthermore, the log-error is used as well as ordinary RMSE, as it could give a more realistic weighting to the errors. Given that the relative error of prediction is more important than the absolute value, the error should be weighted accordingly. For instance, in a deep section of the river, the emulator's prediction error may be large in an absolute sense, but this might not be the case when scaled against the true value. Whereas in the floodplain, where inundation is lower, a small absolute error could represent a significant difference in the interpretation of the results. The model's ability to correctly classify wet/dry cells is also assessed. To convert regression predictions into a classification task, an inundation threshold (c) is applied as a binary classifier to determine if a cell is wet or dry: Different values of c were trialled including 0.05m, 0.1m and 0.3m. A threshold of c = 0.3 is recommended following the flood risk assessment conducted by Aldrige et al., 2016), in which it was concluded that properties intersecting a flood depth of 0.3m would be determined as flooded. Having established a binary classification scheme, an assessment of classification performance is conducted using true positive rate (TPR), false positive rate (FPR), and F1 score, defined by Eqs. (21)-((23), respectively:

Fig. 2.
Computational complexity analysis of the LCM model (Alvarez et al., 2012) and the SOGP model developed in this study in terms of the number of output features and training points. The univariate GP's complexity is also highlighted. A baseline of N=1000 training instances is used for the calculations.
where TP and TN are the number of true positives and negatives, respectively, FP and FN are the number of false positives and negatives, respectively.

Site description and modelling domain
The GP modelling framework developed in this study is adopted for a case study of urban flood modelling in Tadcaster, UK. Tadcaster is a town situated between York and Leeds in the Northeast of England. The town is prone to flooding from the river Wharfe which is running through the centre of the town. In December 2015 (Fig. (4)), in the aftermath of Storm Frank, the town faced its worst flood in recorded history during which many of businesses and homes were evacuated and the town's critical infrastructure (e.g., bridge and roads) suffered major damages from the force of the flood waters. As a result of the extensive flood damages from the 2015 Storm Frank, a £10m flood defence scheme was proposed to enhance the flood resilience of the town for extreme climatic events. However, the project execution and completion has been delayed until 2026 due to 'modelling inaccuracies' (Laycock, 2021). Hence, this study examines the performance of the proposed GP-based emulator for flood scenario modelling of Tadcaster with the aim of producing instantaneous inundation predictions and allowing for the quantification of uncertainty surrounding the model predictions. Fig. (5) illustrates the 3.6km 2 study domain that is considered for the case study in this paper.

Topographic data
LISFLOOD-FP was parameterised with a Digital Elevation Map (DEM) describing the topography of the modelling domain. The DEM was constructed using Digital Terrain Models (DTM) to capture the elevation of the underlying surface and then key geographical features and urban structures were reinserted to ensure accurate and realistic flow-routing in an urban environment. Flood modelling in urban environments requires high resolution spatial data, and therefore this study constructed the DEM with a 2m resolution, resulting in a twodimensional Eulerian mesh of 876,204 cells. Fig. (6) compares elevation (in meters above the sea level) for the original DTM (Fig. (6a)) and the post-processed DEM used in the flood model (Fig. (6b)).

Boundary conditions
Time-varying point-source hydrographs, at upstream points at the North side of the domain, were used as the boundary conditions for each simulation scenario alongside a free boundary at the Southern downstream boundary. The hydrographs were determined using discharge data collected from a river monitoring station within the domain. From the empirical discharge data, a collection of candidate flood hydrographs was extracted as the basis for synthetic hydrographs. Then, by sampling peak-discharge values from a Pareto distribution fitted using historical peaks-over-threshold data, the hydrographs were re-scaled to match the peak-discharge values. This study focused on modelling extreme climatic events by oversampling from the upper tails of the peak-discharge distribution to investigate scenarios with more severe flooding. This study simulates 14 synthetic flood scenarios, described by , to investigate the extent of inundation and flood water depth across the domain. For each hydrograph, a discharge value is recorded for the river Wharfe at 15-minute intervals, and input to the numerical flood model.

LISFLOOD-FP inundation data
Following implementation of the topographic features of the case study location, and the boundary conditions, the LISFLOOD-FP was used to simulate the flood for the synthetic hydrographs. A spatially varying manning's friction coefficient was applied with values determined based on the land usage (Papaioannou et al., 2018). Fig. (8) presents the land cover map with the respective Manning's coefficient values.
To minimize the potential uncertainty associated with the simulation results, an inundation threshold of 0.03m was applied, where an inundation less than 0.03m was assigned to 0. Following the completion of the simulations, the water depth inundation data was aggregated across simulations. The discharge value for time t = i along with the values at t = i − i, i − 2, …, i − 8 is used as training input for the GP emulator. These input features were then scaled to have zero mean and unit variance, resulting in aggregated inputs X ∈ R 2624×9 and outputs Y ∈ R 2624×876204 for the construction of the emulator model, i.e. 2624 aggregated timesteps across all 14 simulations and 876,204 features (cells) for each timestep.

Latent feature dataset
The dimensionality reduction approach outlined in Section 2.2 was then applied to Y to produce a dataset of latent variables Z ∈ R 2624×D * . It was determined to set D * := 6 after experimentation found that the first six principal components were sufficient to explain 99% of the variance within the dataset. To assess the fit of this PCA decomposition, the RMSE between the original dataset and the reconstruction of the latent features back to the original features space was assessed (Eq. (24)). The RMSE obtained was 0.018m meaning the reconstruction deviates from the original by 1.8cm on average.
Performing 10-fold cross-validation on the PCA transformation found the result obtained from Eq. (24) to be highly consistent across folds, showing the methodology generalises well to out-of-sample test instances. Fig. (9) illustrates the new dataset, Z, plotting each of principal component values aggregated across all the hydrographs, the delineation between each of the 14 hydrographs is highlighted by the blue dashed lines. The unit of measurement on the y-axis can be ignored as the regression performance in the latent feature space is not interpreted. Fig. (10) illustrates a visual representation of the complete methodology employed in this study to build and train the GP emulator for inland flood modelling, highlighting the sequential modelling process and data sources. Fig. (11) shows how the constructed emulator can then be employed to make predictions on unseen data as a replacement for the LISFLOOD-FP flood simulation model.

Cross validation results
To validate the GP model and assess out-of-sample performance, leave-one-out cross-validation (LOOCV) method was implemented whereby the GP-based emulator is trained on 13 synthetic hydrographs (Fig. (7)) and predictive performance is assessed on the remaining hydrograph, repeating this process for all 14 training examples. Fig. (12) illustrates the model's regression performance, showing the test RMSLE and RMSE values for each held-out hydrograph. The mean RMSLE and RMSE values are 0.11 and 0.21, respectively. As expected, the RMSLE values are lower than the RMSE, and are theoretically expected to be a  better estimate of the model's error. Reasonable variance among the results from cross-validation testing is observed, with hydrographs 8 being somewhat of an outlier. However, overall, consistency in performance across the test cases are evident.
While good regression performance is important, the ability to correctly identify areas as wet/dry is essential for flood risk management and produce interpretable forecasts. As such, classification was also considered utilising the scheme outlined in Eq. (20). Fig. (13) shows the cross-validation classification performance, differentiating the model performance by the threshold value c used. Table (1) provides a full breakdown of the mean (averaged across cross-validation folds) scores for each metric at each threshold. The scores remain relatively consistent across the threshold values with no value being clearly favourable. However, c = 0.05 appears marginally better than the other values. Overall, the GP model clearly shows effectiveness at predicting whether cells will be wet or dry.
Figs. (14)-(15) illustrate regression performance in the latent feature space with 95% predictive intervals highlighted. A common feature of these plots is the ability of each GP to accurately predict the value of the first 3 principal components in each test case. The performance on the following 3 components is more varied, and the model expresses greater uncertainty surrounding the predictions for these features as illustrated by the significantly wider confidence intervals. However, the earlier principal components capture most of the variance within the dataset   and so accurate prediction of these is significantly more important than accurate prediction of the latter components.
Figs. (16)-(17) illustrate results of flood modelling for hydrographs 7 and 9 during the cross-validation process (the data from these simulations is held out for testing while the model trains on the other 13 hydrographs), respectively, in the original feature space after projecting predictions on the latent feature space back to the high-dimensional space. The maximum water depth in each cell across the entire simulation is plotted, and then a water depth threshold of 0.3m is applied to the cells. If the maximum inundation across the flood event is greater than 0.3m the cells is highlighted and left blank otherwise.  illustrates the difference between the average water depth produced from LISFLOOD-FP simulation and the predictions by the GP emulator. This allows the exploration of the average prediction error on a cell-by-cell basis. Similar results are observed in both scenarios with the GP model's error being approximately 0 for very large sections of the inundated areas, and smaller areas of higher errors was observed. In both cases, the areas of higher error correspond to the areas with more complex urban topography which are in closest proximity to the river reaches. To improve future model performance in similar scenarios, the training data could be chosen such that a wide range of inundation scenarios in these areas is observed.

Benchmarking
To robustly assess the performance of the proposed GP model against alternative methods of statistical emulation, the CNN model (Fig. (20)) outlined in Kabir et al. (2020) was reconstructed for this study to make a direct comparison of predictive performance between the two methodologies.
Performing the same approach to cross-validation by training 14 different CNNs, each time holding one hydrograph out as validation data, a direct comparison of GP and CNN model performance can be made. Table (2) shows the cross-validation regression performance, and Table (3) describes the classification performance of each model.
Although performance of both models is quite similar, the results show that the GP model outperforms CNN with respect to both regression and classification. These results confirm the findings of previous studies in terms of the ability of deep learning methodologies to tackle these complex predictive tasks. However, this study proves the capability of the proposed GP methodology to outperform alternative emulations, which cannot be directly evaluated for uncertainty  quantification. The proposed GP model is also far more robust for scenario modelling compared to alternative emulators when the computational costs is significant (See Section 4.3).

Model Runtimes
Table (4) outlines training time, run time, and model size (i.e., number of cells in modelling domain) for the original LISFLOOD-FP simulator, the GP-based emulator developed in this study, and the CNN emulator reconstructed for this study. The run-time is calculated as the time taken to generate new predictions for a simulation with 250 timesteps of 15-minutes. The results show a significant reduction in runtime when using the GP emulator proposed in this study over the CNN. However, for both methods of statistical emulation, very significant reductions in runtime are observed as compared with the original numerical model. The GP emulator is shown to be 80 times faster than the CNN model adopted for the considered case study.
Constructing emulators has considerable developmental cost, given that training data are generated from the original numerical model which can be costly. Although this computational cost is common for all surrogate models, the GP model can be efficiently trained with considerable limited training datapoints which makes it a very cost-efficient emulator. Furthermore, modelling decisions need to be made during development of surrogate models (e.g., network structure for a CNN, or kernel for a GP). The training times in Table (4) only represent the time taken to estimate model hyperparameters. However, the overall development time associated with emulator construction is higher due to data collection, model training, and validation.

Application of GP Emulator for flood predictions
As an example of replacing the efficient GP-based emulator with the simulator, a flood was simulated using an estimated 1000-year return period synthetic hydrograph. Using the previously constructed peak discharge distribution (see Section 3.3), the estimated peak discharge was 750m 3 /s. For reference, the estimated peak discharge in the 2015 flood was 480m 3 /s. A candidate synthetic hydrograph was scaled to match this estimated peak discharge value, as shown in Fig. (21). Fig. (22) shows the GP's prediction for the 1000-year return period event in the latent feature space. Comparing the predictions made in this instance with those seen previously in Figs. (14) and (15), it can be observed that there is far greater uncertainty associated with these Fig. 13. Cross-validation classification scores for varying discretisation thresholds. predictions as illustrated by the much wider confidence intervals, even among the first 3 principal components where accurate prediction was observed previously. The uncertainty observed in the latent feature space for this set of data can be associated with the fact that the emulator did not observe a training instance like the 1000-year return period hydrograph. Fig. (23a) illustrates the inundation resulted from 1000-year flood event, with Fig. (23b) discretising this into sections of different inundation depths. The grey boundary illustrates the estimated area within the modelling domain that has a 0.1% or greater risk of flooding each year (as provided by the UK's Environment Agency). Overlayed in blue is the estimated inundated extent predicted from the GP model tested on the hydrograph described in Fig. (22). Different representations of the underlying processes or initial conditions of the model will propagate differences through the simulation and can result in varied outputs. For instance, the GP emulator models a flash-flood type scenario in which no previous inundation is present whereas the Environment Agency likely models on a longer timeframe and accounts for additional variables such as previous inundation and soil moisture content. However, there are also clear similarities and consistencies between these predictions which suggests that the proposed GP emulator model's predictions (and the estimation of a 1000-year flood) are in line with those of the Environment Agency.
The differences in maximum inundation between GP predictions at the upper and lower bounds of predicted 95% confidence intervals are analysed (Fig. 24). These intervals are calculated using the variance values predicted by the GP emulator in the latent feature space. Since GPs define a distribution over function values at test inputs using a Gaussian framework, there is a mean prediction and corresponding variance estimates. For each test instance x * i , the upper and lower 95% confidence interval values are defined as: Fig. (24) show that most of the difference between inundation  estimates is limited to less than 0.5m. However, in some sections, such as the river channel, inundation differences of greater than 1m between predictive interval bounds can be observed. These values can be considered to illustrate predictive uncertainty by the GP-based emulator.

Conclusions
A multi-output Gaussian Process-based statistical emulator was proposed for the emulation of high-dimensional outputs from a numerical flood simulator. The standard univariate GP regression methodology is extended to the multi-output case by producing a collection of univariate GPs to model each output feature independently. The GP model was developed to approximate the results of a coupled hydraulic flood model (i.e., LISFLOOD-FP), and significantly reduce the computational costs associated with numerical flood modelling at high spatiotemporal resolutions. To overcome the methodological issues surrounding the use of very high-dimensional data in GP modelling, a computationally efficient dimensionality reduction scheme was utilised.
The appropriateness and robustness of the proposed model is examined for a case study of a flood prone region in the UK. Topographical and hydrological data is collected and preprocessed to produce a realistic, high-resolution Digital Elevation Model and a collection of synthetic hydrographs to parameterise and run the LISFLOOD-FP simulations. Inundation data, in the form of water levels across the modelling domain over time, is generated from the numerical simulator which is then used to build and train the GP-based emulator.
The GP emulator was assessed against a CNN emulator by  performing the cross-validation to ensure no over-fitting occurs in either model, and to assess each methodology's ability to generalise to out-ofsample tests. During these tests the performance of both the dimensionality reduction scheme and the GP-based emulator were found to be   Kabir et al., (2020) for the case study introduced in Section3.  robust. The GP method outperformed the CNN emulator with respect to both regression and classification. Furthermore, the runtime for GP emulator was very significant reduced compared with the original numerical model and the CNN emulator.
Demonstrating the GP's high predictive accuracy, fast runtimes, and readily available quantification of predictive uncertainty, this study demonstrates the robustness and appropriateness of the proposed GP emulator as a more efficient method of probabilistically emulating complex physics-based models over alternative approaches such as the CNN-based emulator. However, a statistical emulator is very sensitive to its experimental design. Choosing a representative sample of training data remains vital to building an emulator that can generalise well to out-of-sample tests. Therefore, the choice of training data should be carefully considered, and the training dataset should be as large (comprehensive) as possible, while considering the associated cost of collecting new training data. It should be noted that each emulator is case-specific and so cannot be transferred to alternative scenarios in which different geographical areas are considered. Modelling a new scenario involves considerable further development time for the original simulator as well as the collection and processing of input data and simulation time. However, the GP emulator methodology proposed here can be easily translated to any scenario in which high-resolution spatiotemporal data is being generated from a highly complex model or processes (i.e., not just 2D inundation modelling). Furthermore, there are likely to be constraints to the spatiotemporal extent that emulators can effectively reproduce, especially with a GP methodology's computationally expensive training cost. However, this study demonstrated the model's high degree of accuracy for a very high spatial-resolution model with over 800,000 discrete cells. The developmental cost of building emulators can be very high and therefore consideration should be given to whether an emulator is necessary in the first place. However, for cases where the underlying model's computational cost is prohibitive to real-time forecasting or for tasks requiring large numbers of simulations (i.e. scenario modelling), a GP-based emulator is an efficient modelling alternative finding a suitable balance between accuracy and complexity, and this study shows GPs to outperform competing methods of emulation.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data Availability
Data will be made available on request.

Supplementary materials
Supplementary material associated with this article can be found, in the online version, at doi:10.1016/j.watres.2022.119100.