Assessing the spatial spread–skill of ensemble flood maps with remote-sensing observations

. An ensemble of forecast ﬂood inundation maps has the potential to represent the uncertainty in the ﬂood forecast and provide a location-speciﬁc likelihood of ﬂood-ing. Ensemble ﬂood map forecasts provide probabilistic information to ﬂood forecasters, ﬂood risk managers and insurers and will ultimately beneﬁt people living in ﬂood-prone areas. Spatial veriﬁcation of the ensemble ﬂood map forecast against remotely observed ﬂooding is important to understand both the skill of the ensemble forecast and the uncertainty represented in the variation or spread of the individual ensemble-member ﬂood maps. In atmospheric sciences, a scale-selective approach has been used to evaluate a convective precipitation ensemble forecast. This determines a skilful scale (agreement scale) of ensemble performance by locally computing a skill metric across a range of length scales. By extending this approach through a new application, we evaluate the spatial predictability and the spatial spread–skill of an ensemble ﬂood forecast across a domain of interest. The spatial spread–skill method


Introduction
Forecast flood maps indicating the extent and depth of fluvial flooding within an actionable lead time are a useful tool for flood risk managers and emergency response teams prior to and during a flood event.Typically, forecast flood maps are presented as deterministic forecasts showing precisely where flooding will occur.This can lead to incidents of false alarms or missed warnings and subsequent recriminations, causing mistrust in the system (Arnal et al., 2020;Savage et al., 2016).A timely prediction of exactly where and when fluvial flooding caused by intense or prolonged rainfall will occur is virtually impossible due to the chaotic nature of the atmosphere (Lorenz, 1969).The ensemble forecasting approach aims to address the sensitivity of the atmospheric dynamics to initial conditions, and through multiple model runs these initial condition uncertainties can be quantified (Leutbecher and Palmer, 2008).The ensemble forecast results in a probabilistic weather forecast that indicates the predictability of Published by Copernicus Publications on behalf of the European Geosciences Union.
the atmosphere at a given space and time.State-of-the-art operational ensemble flood forecasting systems link together a chain of forecast models to produce probabilistic streamflow and flood inundation forecasts at national and global scales (Cloke and Pappenberger, 2009;Emerton et al., 2016;Wu et al., 2020).Ensemble numerical weather prediction models provide meteorological inputs into land surface, hydrological and hydraulic models, cascading the atmospheric uncertainty through to the flood forecast.Throughout this chain of models, multiple sources of uncertainty exist that have been investigated in numerous studies (Beven, 2016;Matthews et al., 2022;Pappenberger et al., 2005;Zappa et al., 2011).As discussed by Boelee et al. (2019), these uncertainties include those arising from meteorological inputs, measurements and observations, initial conditions, unresolved physics within the models and parameter estimates.A probabilistic flood inundation forecast should present a meaningful prediction of the likelihood of flooding so that there is confidence in the forecast, given the uncertainties represented in the system (Alfonso et al., 2016).
The accuracy of the location of flooding, predicted in advance, is defined as spatial predictability.The spatial predictability of ensemble forecasts of flood inundation could be verified by comparisons with a remote observation of the flood from sensors based on satellites or uncrewed aerial vehicles (UAVs).Satellite-based optical and synthetic aperture radar (SAR) sensors are well-known for their flooddetection capability (e.g.Horritt et al., 2001;Mason et al., 2012).SAR sensors are active, which enables them to scan the Earth through weather and clouds and at night.The SAR backscatter intensity detected depends on the roughness of the surface, with unobstructed flooded areas and other surface water bodies appearing relatively smooth and returning low backscatter values.Dasgupta et al. (2018a) detail some of the challenges along with approaches to solutions of flood detection using SAR.Examples of these challenges include the following: roughening of the water surface by heavy rain and strong wind, emergent or partially submerged vegetation, and flood detection in urban areas.Accurate flood detection in urban areas, particularly due to surface water flooding, has become increasingly important (Speight and Krupska, 2021), and recent techniques have led to improved flood detection (Mason et al., 2018(Mason et al., , 2021a, b), b).Optical instruments rely on solar energy and cannot penetrate clouds, making them less useful during a flooding situation.Recent studies have investigated the flood-detection benefits from combining both optical and SAR imagery (Konapala et al., 2021;Tavus et al., 2020).Improvements in the spatial-temporal resolution of SAR images and their open-source availability mean that they are an increasingly valuable tool for hydraulic and hydrodynamic model improvements through calibration, validation and data assimilation (e.g.García-Pintado et al., 2015;Grimaldi et al., 2016;Cooper et al., 2018Cooper et al., , 2019;;Di Mauro et al., 2021;Dasgupta et al., 2018bDasgupta et al., , 2021a, b), b).The Global Flood Monitoring (GFM) product (EU Science Hub, 2021;GFM, 2021;Hostache, R., 2021) of the Copernicus Emergency Management Service (CEMS) (Copernicus Programme, 2021) produces SAR-derived flood inundation maps for every Sentinel-1 image that detects flooding.Three flood-detection algorithms provide uncertainty estimation and population-affected estimates within 8 h of the image acquisition.The European Space Agency (ESA) Copernicus Programme has recently included the ICEYE constellation of small satellites into the fleet of missions contributing to Europe's Copernicus environmental monitoring programme (ESA, 2021).ICEYE captures very-high-resolution (spot mode ground range resolution of 1 m) SAR images, which brings the potential for increased accuracy of flood detection, particularly in urban areas.
To evaluate the accuracy of an ensemble forecast, a number of verification measures have been proposed.Anderson et al. (2019) developed a joint verification framework for end-to-end assessment of the England and Wales Flood Forecasting Centre (FFC) ensemble flood forecasting system.Anderson et al. (2019) describe verification metrics such as the continuous rank probability score (CRPS), rank histograms, Brier skill score (BSS) and the relative operative characteristics (ROC) diagrams that are commonly applied to assess the main ensemble attributes desirable in both precipitation and streamflow ensemble forecasts (e.g.Renner et al., 2009).These metrics refer to flooding events as part of a time series evaluated against a reference benchmark, such as climatology, to produce an average skill score.In contrast, here we consider ensemble spatial verification at a single time point.The verification of ensemble forecasts usually involves comparing the RMSE (root mean square error) of the ensemble mean against an observed quantity to assess the skill of the forecast with the ensemble standard deviation used as a measure of spread.A perfect ensemble should encompass forecast uncertainties such that the ensemble spread is correlated with the RMSE of the forecast (Hopson, 2014).This spread-skill relationship was assessed by Buizza (1997) to investigate the predictability limits of the European Centre for Medium-Range Weather Forecasts (ECMWF) Ensemble Prediction System (EPS).This approach to ensemble verification is based on point values and makes the assumption that the ensemble mean is the forecast state with the highest probability and that the forecast distribution is Gaussian.Significant flooding events are, in their nature, a rare occurrence, and in certain circumstances a few ensemble members can indicate a low probability of an extreme flood.Also, in particular atmospheric scenarios the ensemble forecast may result in a multi-modal forecast where two clusters of ensemble members are each equally likely (Galmiche et al., 2021).For example, both clusters may indicate flooding events but at different magnitudes.In both of these instances the individual ensemble-member details are important, and evaluation of the ensemble mean alone would not be meaningful.When mapping the flood extent prediction, the ensem-Nat.Hazards Earth Syst.Sci., 23,[2769][2770][2771][2772][2773][2774][2775][2776][2777][2778][2779][2780][2781][2782][2783][2784][2785]2023 https://doi.org/10.5194/nhess-23-2769-2023ble mean field alone does not retain the spatial detail of the individual-member forecasts.
The spatial spread-skill of the ensemble forecast is determined by evaluating the full ensemble against observations of flooding.For a flood map ensemble to be considered spatially well-spread, the spread or variation between ensemble members should equal the spatial predictability or skill of the ensemble members (Dey et al. (2014); see Sect. 2).Presently, to the best of our knowledge, quantitative evaluation methods assessing the spatial spread-skill of ensemble forecast flood maps do not exist.However, previous work in numerical weather prediction by Ben Bouallègue and Theis (2014) investigated the application of spatial techniques to ensemble precipitation forecasts using a neighbourhood or fuzzy approach that allowed comparisons at scales larger than grid level (native resolution).A location-dependent approach to the spatial spread-skill evaluation of a convective precipitation ensemble forecast was developed by Dey et al. (2016b).This method compares every ensemble member across a range of scales on a spatial field against an observation field to assess whether the ensemble forecast is spatially over-spread, under-spread or well-spread on average across a domain of interest (Chen et al., 2018).In a recent study, a scale-selective approach was developed and applied to evaluate a deterministic flood map forecast where comparisons were made against conventional binary performance measures (Hooker et al., 2022a).A scale-selective approach to flood map evaluation was found to have several benefits over conventional binary performance measures.These include overcoming the double-penalty impact problem when validating at higher spatial resolutions and accounting for the impact of the flood magnitude on the skill score.The work described here extends and applies this scale-selective approach to assess the spatial predictability and the spatial spread-skill of an ensemble flood map forecast.
In this paper we aim to address the following questions: -How can we summarize the spatial predictability information in ensemble flood map forecasts?
-How can we evaluate and visualize the spatial spreadskill of an ensemble flood map forecast?
-How does the spatial spread-skill vary with location and how can this be presented?
In Sect. 2 we present a new approach to the evaluation of spatial predictability and the spatial spread-skill of an ensemble flood map forecast by comparing against a remotely observed flooding extent.We illustrate the features of the methods through an example case study of an extreme flooding event of the Brahmaputra River, which impacted India and Bangladesh in August 2017, with a focus on the Assam region of India.The flood event details are described in Sect.3.1.The international ensemble version of the JBA Consulting Flood Foresight system provides forecast flood maps for the study and is described in Sect.3.2.Observations of the flood are derived from satellite-based SAR sensors, and the method is explained in Sect.3.3.The results including the spatial spread-skill (SSS) map are discussed in Sect. 4. Our results show that individual ensemble-member spatial predictions of flooding are meaningful and that the full ensemble spatial detail should be evaluated.We conclude in Sect. 5 that the spatial spread-skill of the ensemble forecast varies with location across the domain and can be linked to physical characteristics of the flooding event.

Ensemble flood map spatial predictability evaluation methods
In this section we present new methods for evaluating and visualizing the spatial spread-skill of an ensemble flood map forecast.Hooker et al. (2022a) described and applied a new scale-selective approach to evaluate the spatial skill of a deterministic flood map forecast relative to an observed SARderived flood map.Here, we apply this same measure to evaluate different aspects of an ensemble forecast.The scaleselective fraction skill score (FSS) method is outlined in Sect.2.1.Agreement-scale maps indicating forecast accuracy are defined for location-specific comparisons between forecast and observed flood maps in Sect.2.2.These are used to assess the spatial relationship between each unique pair of ensemble-member flood maps (member-member) and between every ensemble-member flood map and the observed SAR-derived flood map (member-SAR; Sect.2.3).Visualization methods of the spatial spread-skill relationship including our new spatial spread-skill (SSS) map are presented in Sect.2.4.

Fraction skill score
The FSS is a scale-selective verification measure that can determine the skilful scale of a modelled flood map, when compared against a remotely sensed observation of flooding (Roberts and Lean, 2008;Hooker et al., 2022a).We will call these flood maps the model array and the observed array, respectively.For an ensemble forecast, the model array could be an individual ensemble member or a summarized flood estimate derived from a combination of ensemble members such as a combined ensemble or the ensemble median (see Sect. 3.4).Both the model and observed arrays are converted into binary fields using a situation-dependent threshold (e.g.depths greater than 0.2 m are labelled flooded).For this ensemble application of the FSS we evaluate the entire flood extent across the domain.Each grid cell is labelled as inundated (1) or dry (0).All grid cells are numbered according to their spatial locations (i, j ), i = 1. ..N x and j = 1. ..N y , where N x is the number of columns and N y is the number of rows.Surrounding each grid cell, a square of length n creates an n × n neighbourhood. (1) A potential maximum MSE n(ref) depends on the fraction of flooding in the domain for the modelled and observed fields and is calculated as Finally, the FSS is The FSS is initially calculated at grid level (n = 1) followed by the smallest neighbourhood size (n = 3) before increasingly larger neighbourhood sizes (n = 5, n = 7. ..) are considered.The FSS ranges between 0 (no skill) and 1 (perfect skill).Increasing the neighbourhood size typically leads to an improved FSS as the fractions are calculated over a larger area.Plotting FSS against the neighbourhood size can indicate a range of scales where the model is deemed to be the most skilful.A target FSS score (FSS T ) can be determined from the fraction of observed flooding across the whole domain (f 0 ): The point where the FSS n exceeds FSS T can be viewed as being equidistant between the skill of a random forecast and perfect skill (Roberts and Lean, 2008).A recent study by Skok and Roberts (2018) investigated the sensitivity of the calculated skilful scale to the constant value (0.5) in Eq. ( 4) and found that 0.5 gave meaningful results compared with the measured displacement.The magnitude of the observed flood, relative to the domain area, determines the value of FSS T .This allows the comparison of the skilful scale (neighbourhood size) where FSS T is reached across different domain sizes and floods of different magnitudes.

Location-dependent agreement scales
The FSS (Sect.2.1) gives a domain average measure of forecast performance and a minimum spatial scale at which the forecast is deemed skilful.To enable the spatial spread-skill of the full ensemble to be evaluated at specific locations, we first define an agreement scale (see Dey et al., 2014Dey et al., , 2016b;;Hooker et al., 2022a, for full methodology).The agreement scale is calculated and mapped for every grid cell in the domain and shows a measure of similarity between two arrays of data.In contrast to the FSS method the arrays are not required to be thresholded.The agreement-scale method can be applied to both binary flood extent maps and flood depth fields.These could both be ensemble-member flood maps or an ensemble-member flood map and an observed flood map.Two data arrays are compared, F 1ij and F 2ij , and the aim is to find a minimum neighbourhood size (or spatial scale) for every grid cell such that there is a predetermined acceptable minimum level of agreement between F 1ij and F 2ij .This is known as the agreement-scale . (Note that the relationship between the agreement scale and the neighbourhood size described previously in Sect.2.1 is given by S The agreement scale (now defined as S for simplicity in the following equations) is determined individually for every grid cell by testing and meeting a chosen criterion.
A relative MSE, D S ij , is calculated for all grid cells, initially at grid level S = 0 (n = 1), If F 1ij = 0 and F 2ij = 0 (both dry), then D S ij = 0 (correct at grid level).The value of D S ij ranges between 0 and 1.The arrays are deemed to be in agreement at the scale being tested if The parameter value α indicates an acceptable bias at grid level such that 0 ≤ α ≤ 1.Additional historical forecast data of flood events are not available for the region in this study; so we assume there is no background bias between the forecast and the observations and set α = 0.A fixed maximum scale S lim is predetermined using human judgement considering the physical characteristics of the flood event.The value chosen for S lim depends on the magnitude of the flood extent relative to the size of the sub-catchment.For the case study presented here, we set S lim = 80 (2400 m), which is approximately 1 4 to 1 2 of the sub-catchment widths in the domain.If D S ij ≥ D S crit,ij , then the next neighbourhood size up is considered (S = 1, n = 3, a 3 by 3 square), where F 1 1ij and F 1 2ij are arrays containing the average value of each neighbourhood surrounding the grid cell at position (i, j ) for each array.The process continues by comparing increasingly larger neighbourhoods (e.g. S = 2, n = 5, a 5 by 5 square) until the agreement criterion is met for every cell in the domain.The agreement scale at which the agreement criterion is met will usually vary from Nat. Hazards Earth Syst.Sci., 23, 2769-2785, 2023 https://doi.org/10.5194/nhess-23-2769-2023 grid cell to grid cell, and these values (S = 0, S = 1, S = 2 and so on up to S lim ), each specific to each grid cell location, can be mapped onto the domain of interest to provide a location-specific measure of agreement between the two data arrays that are compared.A small value for the agreement scale means that the two arrays being compared are very similar (spatially) at a specific location, whereas a large value for the agreement scale means that the two arrays being compared are dissimilar.Note that the skilful scale determined by the FSS (Sect.2.1) differs from the agreement scale defined here.The former is directly linked to the spatial differences between objects -e.g.Skok and Roberts (2018) whereas the latter reflects a pre-defined "acceptable" bias at different scales.
Validation of forecast flood maps against remotely observed flooding extent is typically carried out by labelling each grid cell using a contingency table with categories: correctly predicted flooded, underprediction (miss), overprediction (false alarm) and correctly predicted unflooded.In the contingency table underpredicted cells are set to +1, overpredicted cells are set to −1, correctly predicted flooded cells are assigned NaN and correctly predicted unflooded cells are set to 0. Mapping these categories creates a conventional contingency map, which, when combined (by an elementwise array product) with an agreement-scale map (Eq.7), creates a categorical-scale map made by plotting the absolute agreement-scale values coloured according to the contingency class.A categorical-scale map shows a measure of spatial accuracy between two data arrays (Hooker et al., 2022a).Categorical-scale maps may be used as a basis for comparison between ensemble members and observations, as we illustrate with our case study in Sect.4.3.

Ensemble spatial spread-skill evaluation
We assume that each ensemble forecast flood map represents an equally likely future scenario and the evaluation of the full ensemble is needed to quantify the uncertainty and to evaluate the spatial spread-skill relationship.The ensemble flood map's spatial characteristics vary with location, and in order to preserve the location-dependent information we utilize a method developed to evaluate a convective ensemble precipitation forecast (Dey et al., 2016b, a).Here we outline the method and describe a new application to evaluate an ensemble forecast flood map.
A neighbourhood approach (Sect.2.2) is used to assess the spatial agreement-scale or measure of similarity at each grid cell location (i, j ) between each unique pair of ensemble flood maps.For an ensemble of M members, there are unique pairs (e.g.1275 pairs for a 51-member ensemble).For an ensemble, the skilful scale can be renamed as a believable scale, which is the scale where ensemble members become sufficiently similar to observations such that they are a useful prediction.Every paired ensemble agreement-scale field is averaged at each grid cell to produce a mean field, from the agreement-scale field defined in Eq. ( 7), indicating the location-specific believable scales of the forecast flood map ensemble.Maps of S A(mm) ij summarize the spatial spread of the full ensemble.Each of the agreementscale fields between the ensemble members and the observations are also averaged at each grid cell to give A measure of the spatial spread-skill of the ensemble can be found by comparing the average agreement scale between the ensemble members S A(mm) ij representing the ensemble spread with the average agreement scale between the ensemble members and the observed flood field S A(mo) ij representing the ensemble skill.

Spatial spread-skill visualization methods
To evaluate the spatial spread-skill relationship, S A(mm) ij (representing the ensemble spread) must be compared in the same location as S A(mo) ij (representing the ensemble skill).Data arrays can be visually compared using a binned scatter plot that averages across a selected bin of cells at the same location within the domain.Dey et al. (2016b) demonstrated for an idealized example that by plotting S A(mm) ij against S A(mo) ij as a binned scatter plot in order to preserve the spatial location of the comparison (Fig. 1), the ensemble can be classified as over-spread, under-spread or well-spread.The ensemble is deemed to be well-spread at a specific location in the domain of interest when the spread of the individual members represented at each grid cell by S , the ensemble is considered to be over-spread and the binned scatter plot will lie beneath the 1 : 1 line.The converse is true for an under-spread ensemble forecast where the agreement between members, the spread, is less than the agreement between the ensemble and the observations, the skill.Here S To summarize the spread-skill relationship we develop this visualization further by plotting a hexagonal binned 2D histogram plot (an example hexbin plot is presented in Sect.4.3).The domain is divided into a (pre-determined) number of hexagons.Hexagons minimize the perimeterto-area ratio and therefore minimize the edge effects.The hexbin histogram plot colour shade represents the number of data points within each bin.
Whilst the hexbin plot is useful for gaining an understanding of the general spread-skill relationship of the ensemble flood map forecast, it does not tell us specifically where in the domain the ensemble spatial predictability is better or worse.Our new spatial spread-skill (SSS) map plots S A(mm) ij −S A(mo) ij at every grid cell location so that the spreadskill is mapped across the domain and can be linked directly to different sub-catchments and surface features such as tributaries, embankments, bridges and importantly the underlying topography or digital terrain model (DTM), which influence the derivation of the ensemble flood maps.Regions on the SSS map where the ensemble is over-spread are positive with negative areas indicating where the ensemble is underspread, and zero values show a well-spread ensemble.Note that this does not necessarily mean that the entire ensemble is in agreement with observations at grid level but that the agreement scales between S A(mm) ij and S A(mo) ij are equal.(An example SSS map is presented in Sect.4.3.)

Ensemble forecasting flood event case study
In this section we describe an example flooding event used to demonstrate the application of the spatial spread-skill evaluation approach.We evaluate a 1 d flood inundation 51 ensemble-member forecast from the Flood Foresight sys-tem (Sect.3.2) for the domain area against a satellite SARderived flood map (Sect.3.3).

Brahmaputra flood, Assam India, August 2017
The origin of the Brahmaputra River (also known as the Yarlung Tsangpo in Tibetan, the Siang and/or Dihang River in Arunachali, Luit in Assamese, and the Jamuna River in Bangladesh) lies in the Himalayan Kailas Range of southwestern Tibet, China.Draining an area of 543 000 km 2 , the Brahmaputra flows for 2000 km across the Tibetan Plateau and a further 1000 km parallel to the Himalayan foothills through the Assam Valley, India, before entering Bangladesh where the Brahmaputra joins the Ganges River (Palash et al., 2020).The Brahmaputra baseflow originates from the upstream glacial snowmelt; however the streamflow rates are dominated by the summer monsoon precipitation.The basin receives up to 95 % of its annual rainfall during the pre-monsoon and monsoon season, which usually runs from April to September and causes annual flooding of the Brahmaputra.The Assam region typically records on average 2300 mm of annual rainfall and up to 5000 mm in the Himalayan foothills (Dhar andNandargi, 2000, 2003).
For this example case we focus on the third wave of flooding that occurred during the monsoon season in August 2017, peaking around 12 August.Figure 2 shows the location of the Brahmaputra and of a chosen domain centred upon some of the worst flooding that occurred.This area includes a confluence zone where the Subansiri River meets the Brahmaputra.The monsoon flooding impacted an estimated 40 million people across India and Bangladesh.Locally in the Assam region, the flooding in August affected over 3.3 million people and approximately 3200 villages, and river embankments were damaged in 11 districts.Over 14 000 people were evacuated to 1 of around 700 relief camps that were also needed to house over 180 000 people relocated (Floodlist, 2017).The local Assam State Disaster Management Authority (AS-DMA, 2017) flood early warning system issued a low warning alert (disasters that can be managed at the district level) on 10 August for the district.
In 2017, the south-west monsoon season rainfalls were predicted to be normal by the South Asian Climate Outlook Forum (WMO, 2017).However, the pre-monsoon season began early in the year with heavy thunderstorms affecting the region from March onwards.In the Assam region, June and July were 60 % wetter than the previous 3 years, and during August more locally intense rainfall was recorded compared with historical observations (Palash et al., 2020).In higher-latitude areas, 30 km to the north of the domain at North Lakhimpur, 215.8 mm rainfall was recorded in the 3 d prior to the flood peak (Floodlist, 2017;Hossain et al., 2021).An above-normal flood situation is declared in India where the river water level exceeds the warning level, a severe flood occurs where the water level exceeds the danger level and an extreme flood occurs where the previous highest Nat.Hazards   flood level is exceeded (Central Water Commission, 2023).The peak water level recorded downstream at Tezpur (danger level of 65.23 m) on 14 August was 66.12 m.There are regional variations in maximum water levels reported, with upland regions to the north of the Assam Valley recording water levels that exceed the previous highest flood level, indicating an extreme flood level (Floodlist, 2017).

Ensemble flood forecasting system
The Flood Foresight system (Fig. 3), developed and operationally run by JBA Consulting, is a fluvial flood inundation mapping system that can be implemented at any river basin around the world.Flood Foresight utilizes a simulation library approach to generate real-time and forecast flood inundation and water depth maps.The simulation library approach saves valuable computing time and allows the application of Flood Foresight near-continuous real time at national and international scales.A pre-computed library of flood maps for a river basin or country is created using JFlow ® (where a DTM is available) (Bradbrook, 2006) and RFlow (where a DTM is unavailable).JFlow uses a raster-based approach with a detailed underlying digital terrain model (DTM) and a diffusion wave approximation of the full 2D hydrodynamic shallow water flow equations.
RFlow combines a 1D model based upon normal depth calculations, optimized for use on a digital surface model (DSM; NEXTmap, 2016) with rapid 2D flood spreading (created by spreading normal depth from upstream to downstream) and is calibrated against JFlow.These equations capture the main controls of the flood routing for shallow, topographically driven flow.Six flood maps at 30 m resolution are created for 20-, 50-, 100-, 200-, 500-and 1500-year return period flood events (corresponding to annual exceedance probabilities (AEPs) of 5 %, 2.5 %, 1 %, 0.5 %, 0.2 % and 0.07 %, respectively).Between each adjacent pair of modelled return period maps, five additional intermediate flood maps are created by linear interpolation of both flood depth and extent.An additional five flood maps are also created beneath the lowest return period flood map.This gives, in total, a library of 36 flood maps.Note that these flood maps are undefended, and local, temporary flood defences are not included.Flood Foresight is set up for a region by dividing the river basin into sub-catchments using the HydroBASINS dataset (level 12) (Lehner, 2014).Flood Foresight takes gridded inputs of ensemble forecast streamflow and uses these to select the most appropriate flood map for each sub-catchment.These are mosaicked together, and forecasts of ensemble flood maps are produced daily, out to 10 d ahead.
The global (non-UK and Ireland) configuration of Flood Foresight uses ensemble streamflow forecast data from the Global Flood Awareness System (GloFAS) (Alfieri et al., 2013;GloFAS, 2021).The runoff data produced are routed through a representation of the river network using a double kinematic wave approach, which includes bankfull and over-bankfull routing.The river network used is taken from the HydroSHEDS dataset (Lehner and Grill, 2013).GloFAS outputs a gridded (approximately 10 km spatial resolution) ensemble forecast of river streamflow (Fig. 4).Each of the GloFAS grid cells are linked to the subcatchments in the Flood Foresight system.The simulation library flood maps are selected when the forecast streamflow exceeds a return period (RP) threshold level within each sub-catchment.The RP threshold levels are calculated using ERA5 reanalysis data (Harrigan et al., 2020).Each ensemble-member flood map forecast is created by aggregat-ing the individual sub-catchment maps.In summary, the meteorological IFS 51-member ensemble input to the flood forecasting chain allows atmospheric evolution uncertainties to be represented within the ensemble streamflow forecast and the ensemble of inundation flood maps, thus creating a probabilistic flood map forecast, indicating the likelihood of flooding.Flood Foresight produces daily ensemble flood depth and extent forecasts at 30 m spatial resolution out to 10 d.

SAR-derived flood map
A Sentinel-1 (S1A) image was acquired in interferometric wide swath mode (swath width of 250 km) around the time of the flood peak at 17:18 (IST) on 12 August 2017.The ESA Grid Processing on Demand (GPOD) HASARD service (service terminated June 2021) was utilized to map the flooding.The flood-mapping algorithm (Chini et al., 2017) uses an automated, statistical, hierarchical split-based approach to distinguish between two classes (background and flood) using a pre-flood and flood image.A pre-flood image (February 2017) from the same satellite sensor and track was used to derive the flood map (Fig. 2).Original SAR images (VV polarization) were pre-processed, which involved precise orbit correction, radiometric calibration, thermal noise removal, terrain correction, speckle reduction and re-projection to the WGS84 coordinate system.The HASARD mapping algorithm removes permanent water bodies that are detected on the pre-flood image, such as the unflooded river water, lakes and reservoirs, by applying a thresholding approach.Flooded areas beneath vegetation, under bridges and near buildings will not be detected using this method.Flood Foresight forecast flood maps include the river channel and exclude surface features such as vegetation and buildings.To smooth the HASARD flood maps and allow a fairer comparison we apply a morphological closing operation (without impacting the location of the flood extent) to flood-fill vegetation and buildings.The wide and braided Brahmaputra River in the Assam region covers a significant area of the selected domain.In order to evaluate the flood prediction accuracy alone, the pre-flood occurrence of surface water using the JRC Global Surface Water database (Pekel et al., 2016) has been removed from the Flood Foresight forecast inundation maps.The observed flood extent derived from satellite-based SAR data at 20 m grid size is re-scaled to match the forecast flood map grid size (30 m) using average aggregation.The closest available (cloud-free) optical image available was a Sentinel-2 image on the 17 August 2017, 5 d after the SAR image acquisition.During this time the flood waters had receded from their peak, which makes this unsuitable for comparison with the SAR-derived flood map.Since no other validation sources are available, for the purposes of this study we have assumed that the SAR-derived observation of flooding represents the true flooding extent.Since October 2021, Sentinel-1 SAR images have been processed by CEMS GFM (GFM, 2021) to derive flooding extent and provide an uncertainty estimate of the grid cell classification.This means uncertainty information in the SAR-derived flood map could be accounted for in future evaluation studies by verification across different levels of observation uncertainty.Additionally, a flood mask, indicating areas where flood detection using SAR data is not currently possible (at the Sentinel-1 spatial resolution), could be used to exclude areas from the evaluation process (note that this was not possible for this case study, since this information was not available in 2017).

Forecast data
Flood Foresight was set up for the Brahmaputra basin in India and Bangladesh using the simulation library approach to flood mapping described in Sect.3.2.Flood maps were precomputed for the domain of interest (Fig. 2) using a DSM and RFlow.The forecast data for the Brahmaputra flood event contain a 51-member ensemble of flood maps indicating flooding extent, produced at a 1 d lead time.Vertically stacking each individual ensemble-member flood map and adding vertically across every grid cell combines all ensemble members into a single flood map (all flooded grid cells are set to 1) showing where flooding is possible across all members (ens all ).A spatial median flood map is created (ens median ) where 26 members or more predict flooding at a particular grid cell location.Each of the ensemble-member flood maps for the domain is plotted in Fig. 5 along with ens all , ens median and the SAR-derived flood map.
Figure 6 shows the amalgamated probabilistic ensemble forecast indicating the probability of flooding at each grid cell location.This was produced by vertically stacking each ensemble-member flood map and vertically adding the number of flooded cells at each grid cell location across all ensemble members.The total is divided by 51 to calculate the probability.The dark blue colours near the central river channel indicate agreement between all ensemble members and 100 % forecast probability of flooding, and lighter colours to the north of the river indicate a low probability of flooding.

Results and discussion
To demonstrate an application of the spatial-scale approach to both ensemble forecast flood map evaluation of forecast skill and the spatial spread-skill relationship, we apply the methods outlined in Sect. 2 to the flooding case described in Sect.3.1.First, in Sect.4.1 we verify the full ensemble using a spatial-scale approach to calculate a skilful scale of agreement between each ensemble member and the SAR-derived flood map (Fig. 2) along with the combined ensemble (ens all ) and the ensemble spatial median (ens median ).We evaluate the location-specific spatial skill of the ensemble by calculating categorical-scale maps (Sect.4.2) for ens all , ens median and a best and worst case ensemble member determined by the skilful scale calculated in Sect.4.1.In Sect.4.3 we evaluate the spatial predictability of the full ensemble and show https://doi.org/10.5194/nhess-23-2769-2023 Nat. Hazards Earth Syst.Sci., 23, 2769-2785, 2023  this on our new spatial spread-skill (SSS) map, indicating regions where the ensemble is over-spread, under-spread or well-spread.

Ensemble spatial-scale evaluation
Here we investigate how a scale-selective approach can be useful for extracting meaningful information from a flood map ensemble forecast where multiple forecast flood maps represent equally likely flooding scenarios (Fig. 5).A minimum skilful scale (where FSS > FSS T ) has been calculated for each individual-member flood map, ens all and ens median .
The results in Fig. 7 show that individual ensemble-member spatial skill varies considerably with FSS at grid level ranging from 0.35 to 0.59.One member ens 1 , which would usually be disregarded as an outlier due to its low probability, outperformed all other members significantly with a skilful scale achieved at a neighbourhood size of n = 3.The combined ens all showed more skill at grid level (n = 1) and smaller neighbourhood sizes compared with ens median ; both however exceeded FSS T at n = 41 or 600 m.At neigh-Nat.Hazards Earth Syst.Sci., 23, 2769-2785, 2023 https://doi.org/10.5194/nhess-23-2769-2023 Figure 7.The spatial skill of each individual ensemble-member forecast flood extent is evaluated along with the ens median (a spatial median where 26 or more members predict flooding at a grid cell location) and ens all (flooded grid cells from all ensemble members are combined).
The FSS is calculated at increasing neighbourhood sizes to determine the scale at which the forecast becomes skilful at capturing the observed flood (FSS T ).
bourhood sizes greater than n = 41, ens median outperformed ens all .There is a cluster of members showing similar skill to ens median and ens all and a second cluster, with more ensemble variation but indicating lower skill than the first cluster.
The ens median and ens all flood maps outperform the second cluster; however there are individual members with a higher spatial skill score compared to ens median and ens all .These results show that all ensemble-member flood maps, including outliers, should be considered individually as possible future flooding scenarios.Spatial variations across individual ensemble members (see Fig. 5 ens 1 compared to ens median ) indicate that it is not meaningful to consider only the ensemble median flood map to represent the information within the full ensemble.

Ensemble spatial predictability
The scale-selective skill scores calculated for different aspects of the ensemble forecast give a domain-averaged score and skilful scale.To understand location-specific spatial predictability of the ensemble forecast, categorical-scale maps are calculated and presented in Fig. 8.These show how the agreement scale (Sect.2.2) varies with location for Fig. 8a, ens all ; Fig. 8b, ens median ; Fig. 8c, ens 1 , the "best"-performing ensemble member; and Fig. 8d, ens 21 , the "worst"-performing ensemble member.The ensemble summary map ens all (Fig. 8a) captures most of the observed flooding (in grey) with small regions of underprediction (red).However, as you might expect to see by including every potential flooding realization there are significant regions of overprediction (blue) or false alarm.The region of overpre-diction to the south of the river is less evident in the ens median categorical-scale map (Fig. 8b), which performs worse to the north by underpredicting flooding here.This flooding is captured well by ens 1 (Fig. 8c) and in particular close to a confluence zone where the Subansiri River joins the Brahmaputra (grid cell location (1100, 250)).This ties in with the high rainfall totals accumulated just to the north of this region associated with localized very heavy rainfall (Floodlist, 2017).
A region of underprediction at grid cell location (750, 750) is missed by all members.In future work, a closer inspection of the DTM or surface features included and/or excluded in the hydraulic modelling, such as embankment heights, may indicate how this modelling could be improved.The worstperforming ensemble member ens 21 (Fig. 8d) accurately predicts flooding closer to the river channel; however underprediction to the north along with overprediction to the south show where the forecast was inaccurate.Categorical-scale maps enable different ensemble flood map presentations to be evaluated so that the most useful presentation method can be determined for a particular flooding situation.

Ensemble spatial spread-skill
To evaluate the location-specific skill of the full ensemble, one option would be to calculate 51 categorical-scale maps from each individual-member flood map (Fig. 5).Although this approach maintains the spatial detail held within each of the ensemble-member flood maps, it does require multiple visual comparisons to be made by the flood forecaster or modeller, which takes time and effort  Fig. 5 provides a demonstration of these forecasting difficulties.Further, the categorical-scale maps do not evaluate the ensemble spatial spread.To address this, we develop a spatial spread-skill (SSS) map (derived from Fig. 9, presented in Fig. 10) showing the spread-skill of the full ensemble forecast and keeping the location-specific detail.All ensemble members are included in this analysis which evaluates both the spatial skill and the ensemble spatial spread of the forecast against the remotely observed flooding extent.Figure 9 shows how the average ensemble-ensemble agreement scale in Fig. 9a S A(mm) ij calculated at each grid cell (representing ensemble spread) compares with the average ensemble-observed scale in Fig. 9b S A(mo) ij (representing ensemble skill) along with the hexbin scatter plot in Fig. 9c, which compares Fig. 9a and b to indicate the spatial spreadskill of the forecast.The hexagonal tessellation is used so that the distances along the hexbin diagonal are on the same scale as those along the x and y axis.For a perfect ensemble fore-cast the average agreement scale between ensemble members should match the agreement scale between the ensemble forecast and observed flood map; i.e. they should align along the 1 : 1 line.The SSS map plots the difference between the ensemble-ensemble and the ensemble-observed average agreement scales at each grid cell (Fig. 10) and indicates where the spatial spread-skill is over-spread, under-spread, or well-spread.Three numbered areas (Fig. 9a) identify three different ensemble spread-skill relationships.Area 1 shows that the agreement between ensemble members is close but that they disagree with the observed flooding extent.This is displayed in orange shades as an under-spread or missed region on the SSS map in Fig. 10.This is the region close to the confluence area described in Sect.4.2.Recall that in this region most ensemble members did not predict the flooding that occurred with the exception of one ensemble member (ens 1 ).In area 2 in Fig. 9, both Fig. 9a and Fig. 9b    well-spread: these are shown in white in Fig. 10.Apart from the missed and well-spread regions in Fig. 9, the overall visual impression is that the ensemble spread-skill lies below the 1 : 1 line and is over-spread, indicated by area 3.This corresponds to the purple shading on the SSS map (Fig. 10).Overall Fig. 9 tells us that the spread-skill relationship for this example case study is not uniform across the domain but is in fact location specific.The areas identified (1, 2 and 3) lie within different sub-catchments, which are linked to different GloFAS grid cells, driving the ensemble flood map selection for each sub-catchment.Inferences can be made about the spread-skill of the driving discharge data at sub-catchment level across the domain.Using the spatial spread-skill relationship shown on the ensemble SSS map we can infer how well the ensemble forecasting system encompasses the multiple sources of uncertainty and how meaningful the probabilistic ensemble forecast of flood inundation actually is.An ensemble flood map forecast that is well-spread suggests that the probabilistic forecast is meaningful.The SSS map is a useful evaluation tool for validating flood forecasts in ungauged or partially gauged rivers.A simulation library approach, like the Flood Foresight maps used here, relies on the accuracy of the return period thresholds set, the (ensemble) forecast streamflow and the accuracy of the flood inundation map for a given streamflow.The forecast evaluation approaches presented here enable these system attributes to be evaluated even where observed streamflow is limited or erroneous.The SSS map summarizes the whole ensemble, which makes it useful for forecasters attempting to convey uncertainty information to decision makers, highlighting regions where there is high or low confidence in the forecast.

Conclusions
Differences between ensemble members in ensemble forecast flood map systems are mostly driven by initial condition perturbations at the top of the hydro-meteorological forecast chain within the numerical weather prediction system.Presently, there is limited understanding or evaluation of how these meteorological uncertainties link to mapped flooding predictability, which involves additional sources of uncertainty.An evaluation of the spatial predictability and the spread-skill relationship of the ensemble flood map forecast provides an improved understanding of the performance of the forecast system.Uncertainties in other parts of the forecast chain are not truly represented by the ensemble flood maps, and evaluating the spatial spread-skill of the flood maps is important for understanding the likelihood of flooding that the ensemble flood maps capture.In this paper, we present a new scale-selective approach to assess the spatial predictability and spread-skill of an ensemble flood map forecast by comparing this against a satellite SAR-derived observation of flooding extent.By calculating a skilful scale at each grid cell for every unique ensemble-member pair we can determine the ensemble spatial spread, and between every ensemble member and the SAR-derived flood map we can determine the ensemble spatial skill.The hexbin scatter plot summarizes the spread-skill relationship so that a trend across the whole domain can be assessed.The difference between these skilful scales can be mapped onto the spatial spread-skill (SSS) map, which shows, for each specific location in the domain, whether the ensemble is over-spread, under-spread or well-spread.The methods are applied to an example flooding event of the Brahmaputra River in the Assam region of India in August 2017.
In operational practice there are multiple options of ensemble flood map presentation type such as presenting the ensemble median or another exceedance probability for delivery to end users and decision makers.An important aspect of developing an inundation flood forecasting system is to determine the most useful way to present a spatial ensemble forecast.Using a scale-selective approach we have evaluated the performance of individual ensemble members, a combined total ensemble and the spatial ensemble median compared to a SAR-derived observation of flooding extent.Other options could be to exclude ensemble-member outliers, to spatially cluster similar ensemble members into groups of flooding extent or to present a most likely, best and worst case ensemble flood map.Whichever presentation method is chosen, this should be fully explored using the spatial spread-skill methods described here to evaluate the ensemble performance of historical flooding events.We found for this example flooding event that one ensemble member significantly outperformed the combined and median flood maps and that potentially in some flood forecasting scenarios this member would have been excluded as an outlier.The categorical-scale maps show that the ensemble spatial median could miss vital flooding information and that all members should be considered potential future flooding scenarios.
Through mapping the spatial spread-skill relationship, which varies with location, links can be made between the spatial variations in spread-skill and the physical characteristics of the flooding event.We found that one ensemble member outperformed all others in a region close to a confluence zone and nearby observed heavy rainfall.The region correlates with an area of under-spread ensemble members, indicating that not enough members were predicting flooding here.Future studies could investigate the physical processes further using the methods presented here.The ensemble flood map spatial spread-skill could be investigated in the context of a particular physical process (such as rainfall intensity and/or location or an improved aspect of the hydrological model such as antecedent soil moisture) and how these uncertainties translate to the probabilistic flood map forecast.An understanding of the spatial predictability is particularly important for un-gauged catchments where the calibration of both forecast streamflow and return period thresholds (used to select the simulation library flood map) is rarely practised routinely.Ideally, in operational practice, these spa-Nat.Hazards Earth Syst.Sci., 23,[2769][2770][2771][2772][2773][2774][2775][2776][2777][2778][2779][2780][2781][2782][2783][2784][2785]2023 https://doi.org/10.5194/nhess-23-2769-2023tial verification approaches including the categorical-scale and SSS maps could be calculated and stored routinely as flooding events coincide with SAR-derived or other remotely observed flood maps to build up a verification catalogue or database.This database could then be used to investigate the spatial spread-skill model performance under different scenarios such as forecast lead time, month or season, or flood type.More locally, the impact of an improved DTM or the inclusion of a digital surface model (DSM) or other surface features in the hydraulic model such as embankments could be considered.Over time, such a database would improve our understanding of the spatial predictability of an ensemble flood map system and how well the uncertainties present are represented by the ensemble forecast.
The result would lie on a 1 : 1 line on the binned scatter plot.Where the spread between the ensemble members exceeds the skill of the ensemble forecast, i.e.

Figure 1 .
Figure 1. Figure reproduced with permission from Dey et al. (2016b) showing results on a binned scatter plot from an idealized experiment indicating the spatial spread-skill relationship between an ensemble forecast and the observation.

Figure 2 .
Figure 2. Left panel: domain location on the Brahmaputra River in the Assam region of India.Domain size is 57.5 km by 39.3 km.Right panel: Sentinel-1 SAR-derived flood map and permanent water bodies from the JRC Global Surface Water database for the domain of interest (DOI).Base map from © Google Maps.

Figure 3 .
Figure 3.The Flood Foresight ensemble forecast flood inundation and impact mapping work flow.Prepared by JBA Consulting.

Figure 4 .
Figure 4. GloFAS grid, permanent water bodies and Flood Foresight sub-catchments for the domain of interest (DOI).

Figure 5 .
Figure 5. Brahmaputra River, Assam region, August 2017.51 ensemble-member forecast flood maps (labelled 0 to 50), ens median and ens all all at 1 d lead time and the Sentinel-1 SAR-derived flood map.

Figure 6 .
Figure 6.Brahmaputra River, Assam region, August 2017.Colour shading from white (low) to dark blue (high) indicates the forecast probability of flooding based on a 1 d lead time, 51 ensemble-member flood map forecast for the Brahmaputra River in the Assam region, August 2017 (note that the map background is grey).

Figure 8 .
Figure 8. Brahmaputra River, Assam region, August 2017.Categorical-scale maps for (a) ens all (flooded grid cells from all ensemble members are combined), (b) ens median (a spatial median where 26 or more members predict flooding at a grid cell location), (c) individual ensemble member 1 and (d) individual ensemble member 21.Red areas indicate where the forecast is underpredicted and blue regions represent overprediction.The colour shading indicates the scale of agreement (Eq.7) between the forecast and the observed flooding, with lighter shading indicating that a smaller agreement scale is required to reach the agreement criterion (Eq.6).A fixed maximum-scale S lim is drawn to scale (c).For georeferencing, see Fig.6; each grid cell is 30 m × 30 m.

Figure 9 .
Figure 9. Brahmaputra River, Assam region, August 2017.(a) The average agreement-scale map of each unique pair of forecast ensemble flood maps and (b) between each ensemble member compared against the observed SAR-derived flood map.(c) A binned histogram scatter plot compares (a) and (b) to indicate the spatial spread-skill of the forecast ensemble.Panel (d) indicates the corresponding sub-catchment locations.Areas labelled 1, 2 and 3 are discussed in Sect.4.3.A fixed maximum-scale S lim (Eq.6) is drawn to scale (a).Note that PWB means permanent water bodies.

Figure 10 .
Figure 10.Brahmaputra River, Assam region, August 2017.(a) The spatial spread-skill (SSS) map shows the difference between the ensemble-ensemble and the ensemble-observed average agreement scales at each grid cell.Negative values (orange) indicate where the ensemble is under-spread and positive values (purple) indicate where the ensemble is over-spread.White areas indicate where the average agreement scales match and indicate good spatial spread-skill.Panel (d) indicates the corresponding sub-catchment locations.The labelled areas (1, 2 and 3) are discussed in Sect.4.3.A fixed maximum-scale S lim (Eq.6) is drawn to scale (a).Note that PWB means permanent water bodies.