The credibility challenge for global fluvial flood risk analysis

Quantifying flood hazard is an essential component of resilience planning, emergency response, and mitigation, including insurance. Traditionally undertaken at catchment and national scales, recently, efforts have intensified to estimate flood risk globally to better allow consistent and equitable decision making. Global flood hazard models are now a practical reality, thanks to improvements in numerical algorithms, global datasets, computing power, and coupled modelling frameworks. Outputs of these models are vital for consistent quantification of global flood risk and in projecting the impacts of climate change. However, the urgency of these tasks means that outputs are being used as soon as they are made available and before such methods have been adequately tested. To address this, we compare multi-probability flood hazard maps for Africa from six global models and show wide variation in their flood hazard, economic loss and exposed population estimates, which has serious implications for model credibility. While there is around 30%–40% agreement in flood extent, our results show that even at continental scales, there are significant differences in hazard magnitude and spatial pattern between models, notably in deltas, arid/semi-arid zones and wetlands. This study is an important step towards a better understanding of modelling global flood hazard, which is urgently required for both current risk and climate change projections.


Introduction
Flooding is one of the most damaging natural hazards, accounting for 31% of all economic losses worldwide resulting from natural hazards [1]. The ten costliest floods between 1980 and 2014 caused an estimated US $187 billion in overall losses (adjusted for inflation) as well as the loss of 13 597 lives [2]. With the frequency and magnitude of flood disasters projected to increase due to both climate change and growing population exposure [3,4], flooding is one of the key societal challenges for this century. In order to address this Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI. challenge, knowledge of the expected flood hazard for a given probability is required for risk reduction. Such risk reduction is at the heart of two recent international agreements: the Sendai Framework for Disaster Risk Reduction [5] and the Warsaw International Mechanism for Loss and Damage Associated with Climate Change Impacts [6]. Some countries have made significant progress in this regard, due to greater wealth, political will and more comprehensive data availability. However, fluvial (river) flood risk for much of the world is still 'unmapped', and even where mapping exists, it often uses different and inconsistent methodologies or datasets across countries and regions. This lack of consistent risk information makes global and national efforts to reduce risk and increase resilience as well as high level planning and decision making, particularly challenging. In the same way that national level modelling in some countries (e.g. UK, Germany) has allowed a more consistent, comprehensive and equitable understanding of flood hazard, relative to disparate collections of heterogeneous and patchy local scale modelling, so global scale models provide the same benefits for those interested in global flood risk relative to a national scale. In addition, consistent global coverage can provide flood risk information for many nations where even national level assessments are currently unavailable [7].
Computational river flood models are one of the core tools used for national flood hazard mapping, and flood forecasting. Usually, they consist of: (i) a method to estimate river flow magnitude for a given probability; and (ii) a model to simulate water flow in river channels and over floodplains. Programmes for national level flood modelling often use specially commissioned data collection, for example airborne laser terrain data at high resolution (1-2 m horizontal), detailed surveys of river bathymetry and long-term river flow data. The application of these methods on a global scale was hard to envisage ten years ago [8] due to the local nature of flood hazard, but recent global datasets have enabled this possibility [9]. Datasets such as the Shuttle Radar Topography Mission (SRTM) digital elevation data [10], suitably processed for floodplain modelling [11], as well as derived river networks [12], and mapping of characteristics such as channel width [13], mean there are now sufficient data at a moderate resolution (of the order ∼90 m at the equator) with which to undertake global flood modelling. Added to this, are new methods of estimating extreme flow probability distributions by cascading climate reanalysis datasets through atmospheric and land surface models [14][15][16] or regional flow frequency analysis based on river gauge observations [17]. Finally, with advances in algorithms for rapid simulation of flood flow physics [18], it is now possible to model global flood risk in sufficient detail (100 m-1 km resolution) to be useful for decision makers. Recognising this potential, scientific and commercial groups have recently been developing global flood hazard models.
Current publications show that model outputs are now available and being used to address science and management questions related to flood risk, including the issue of how these risks could change in the future due to climate change and socioeconomic development [7,19]. Global models are even being incorporated into a new range of open online hazard tools [1,7,20]. In parallel, proprietary Catastrophe (CAT) flood models for the insurance industry are being developed, and model evaluation is a regulatory requirement for most industries around the globe.
There are ultimately many different end users who need to know how accurate these models are and if they are fit for purpose [7]. However, to date, all global flood hazard models have had limited validation against observed flood flows or extents. Partly, this is because they are different to other more local scale models in this field, and so cannot draw on a rich heritage of previous testing methods, but mainly it is due to the difficulty of undertaking validation comprehensively over such large spatial scales, particularly in data scarce areas where risk products are most needed.
The validation and benchmarking that has been undertaken so far for individual global models, shows they have some skill in predicting flood hazard at a large river scale [4,14,16,[21][22][23][24]. Benchmarking undertaken for the SSBN model against Canadian and UK flood hazard maps, shows that the global model captures between two thirds and three quarters of the area determined to be at risk in the detailed models [22]. The JRC model was also benchmarked against European rivers and the results were comparable to SSBN's, although with lower scores in some areas. For regions outside Europe and North America, where no detailed flood models are generally available, comparison of the JRC model with satellite images of flooding show more variable results [21]. Better results for European rivers are thought to be due to the more reliable hydrological data available and the relatively small size of floodplain and wetland areas [21]. For the GLOFRIS model, visual comparisons with satellite observations of Bangladesh show plausible river flood hazard output [14]. The GLOFRIS model was benchmarked against some UK and German national flood hazard maps of large rivers, commensurate with model resolution, and showed that it captured around two thirds of the detailed model's predicted flood hazard [4]. The ECMWF model was benchmarked against a global flood hazard map that was produced for the 2011 Global Assessment Report on Disaster Risk Reduction [23,24], and found to compare reasonably well, but in general predicted greater flood extents. CaMa-UT was benchmarked against flow gauges and SAR satellite data of floodplain inundation of the Amazon basin, and showed a good correlation with observations [16]. Flow validation of the CaMa-UT model against gauges in 30 major river basins was also conducted and results were more variable, but improved on previous attempts [16].
As the number of available global models increases and their results are incorporated more deeply into decision making, there is an urgent need to understand how they compare with each other by those that use them. Are they interchangeable in the new global flood risk assessment frameworks? It is also important to identify strengths and weaknesses of particular models and how we might improve them.

Data, models and methods
The need to compare models was identified as a research priority at the inaugural Global Flood Partnership (GFP) meeting hosted at the European Centre for Medium Range Weather Forecasting in Reading, UK during March 2014 [25]. The research presented here is a direct outcome of that collective agreement to begin the process of model inter-comparison and testing. We take the flood hazard output from six state-of-the-art global models, and assess how they compare in terms of flood hazard simulations, to understand the implications for estimates of exposed gross domestic product (GDP) loss and population. The inter-comparison analysis is undertaken for the entire African continent for a standard range of hazard return periods (25, 100, 250, 500, 1000 years) and is summarised at continent, catchment and country level. The African continent was chosen as large enough to be meaningful, the least commercially sensitive to encourage participation, and most lacking in flood hazard information for global planning. All six models were also aggregated into a single 'model agreement' dataset, categorising areas by how many models agree that they are flooded.
The six global flood hazard models compared in this paper are CaMa-UT [16], GLOFRIS [14,26], ECMWF [15], JRC [21], SSBN [22], and CIMA-UNEP [1]. All the models attempt to simulate, for a given probability flow, how water that is excess to river channel capacity inundates the surrounding floodplain topography. While at its core this is a similar aim to traditional hydraulic modelling, the sheer scale of the model domain, and the lack of high quality DEM or gauged flow data require innovative approaches at all stages that are largely untested at this scale. The models each use a wide variety of different approaches to tackle these challenges. All the global flood hazard models predict flood extent and depth from fluvial (river) flooding only; coastal and pluvial hazard are excluded. The flood hazard is predicted for the range of standard return periods by deriving a river flow for the return period and simulating the flooding that would occur.
In generating river flows for a given probability, the six models can be grouped (figure 1) by general structure into: (i) those that use a model cascade of a precipitation timeseries from global climate reanalysis data driving a land surface model to produce flows at locations along a river (CaMa-UT, GLOFRIS, ECMWF, JRC); and (ii) those that use a regional flood frequency approach to estimate flood flows from pooled river gauged data (5000-8000 gauges), given upstream catchment characteristics (SSBN), or complemented with hydrologic simulations (CIMA-UNEP). The models also differ in how they simulate floodplain inundation, ranging in complexity from: (i) flood volume redistribution (GLOFRIS) and water elevation calculated from flow at a river section (CIMA-UNEP); (ii) floodplain storage elevation relationships All models are based on processed versions of the SRTM DEM [10] and Hydrosheds river network [12] to provide near global coverage. A detailed description of each model framework can be found in the supplementary material, and further technical details can be found in the supporting publications [14-16, 21, 22, 26].
All model results were provided for the comparison analysis in their native raster (grid) format (e.g. NetCDF, ArcGIS raster) and converted to a common geotiff format, while retaining the native resolution and data precision. Model results that were provided in multiple tiles or overlapping catchments were merged into seamless rasters covering the entire continent of Africa. All rasters were provided in and processing undertaken in the WGS84 projection system. Variation of raster cell area with latitude was accounted for using the Haversine method. Model outputs were mostly provided in a water depth format and these were converted to binary flood (depth>0 m), dry (depth=0 m) rasters for this analysis.
Exposure analysis was undertaken by intersecting the flooded areas with spatially distributed exposure datasets for population and GDP. Population exposure was calculated with the Worldpop dataset using the 2010 population with national totals adjusted to match UN population division estimates, resolution 1/120 decimal degrees [27] (http:///worldpop.org. uk). GDP exposure was estimated using downscaled GDP data for 2010 [28], at 1/120 decimal degrees.
Flooded area and exposure analysis was also undertaken for a combined SRTM Waterbody and MODIS water mask [13] in order to identify results for normally wet areas.
Summary statistics of flooded areas and exposure for regions of interest were calculated for the Africa continental boundary, country boundaries and Hydrosheds catchments [12].
To analyse model agreement, we aggregate (separately for each return period) the flood area extent from all the models into categories according to how many other models agree that an area is flooded. This results in a single categorised dataset where the category is an integer number of models that predict an area as flooded ( figure 2). This gives a range between 0 (no models predict flooding i.e. dry) and 6 (all models predict flooding). This aggregation is carried out at the finest resolution of all the models to ensure no loss of fidelity. The aggregated dataset is available free of charge for academic research and education purposes at Research Data Leeds (doi: 10.5518/96).
A model agreement index (MAI), equation (1), is then calculated from these categories for a given region (e.g. country) by summing the total area of each flooded category, multiplied by the fraction of models that agree in that category, and then dividing this sum by the maximum possible model agreement, resulting in a fraction of model agreement. The resulting fraction varies between 0 for no agreement and 1 for maximum agreement where; A is total flooded area predicted by all models, a i is the flooded area for an aggregated category, N is the number of models in comparison, i is the aggregated category (i.e. number of models in agreement).
This index does not assume any one model is correct and is purely an agreement measure for wet areas, dry areas are ignored. Including dry areas in an agreement index is problematic for three reasons: (i) each model has a different upper catchment size, where flooding is ignored, and these should really be no-data areas in the model results; and (ii) some models mask out arid areas in post-processing; and (iii) large dry areas (∼90% of land area) will bias an agreement measure upwards, giving a false impression of flooding agreement.
Cohen's kappa coefficient was also calculated for each pair of models for each return period and results are detailed in the supplementary material. For the aggregated dataset, population and GDP exposure were calculated for all return periods, and together with the assumption that a 2 year return period flood has zero exposure, expected annual exposure (EAE) was then calculated as the area under the exceedance probability-exposure curve, and the mean value of all categories is plotted in figure 4(d), see [29].

Results and discussion
Encouragingly, aggregated results (figure 2) show many areas of agreement between the models, most obviously directly adjacent to large rivers, particularly where these are constrained by distinct floodplain boundaries, such as near the confluence of the Niger and Benue Rivers in Nigeria ( figure 2(a)). However, when we calculate two measures of model agreement continent-wide, we find a MAI of only 0.29, and a mean Cohen's kappa coefficient of 0.43, across all models and all return periods. Both measures range between 0 (no agreement, or agreement by chance for kappa) and 1 (perfect agreement) and these calculated values therefore indicate significant differences. Similar Global Circulation Model inter-comparisons [30] have highlighted that agreement between models can be dependent on using common model components, but the global flood models compared here are very new and have been developed mostly independently so far, resulting in a variety of structures and very few shared components. There are many areas where the models disagree, in particular in delta regions, where differences in the way individual models handle bifurcating flood flows results in very different patterns of inundation. Arid and semi-arid climate zones also show more disagreement between models than tropical and semi-tropical areas, pointing towards the greater importance of evaporation and recharge processes in these areas. Some post-processing is carried out on some models to mask these arid areas, as they are difficult to treat well with traditional flood modelling assumptions. There is also model disagreement in the larger wetlands, such as that of the Congo River. This is likely due to the challenges of modelling the connectivity of the main channel and floodplain in large wetlands, as well as the presence of vegetation artefacts in the SRTM DEM, particularly in flat areas.
Comparing total flooded area to the total continental area (figure 3(a)) for each model and all return periods shows a wide variation in the area simulated to be under threat from flooding. At a 1-in-25 year return period, this flooded area ranges from 3% to 8.3% of the continent and for a 1-in-1000 year return period, 4.2%-10.5%, depending upon the model. These differences can be a consequence of the different hydrological datasets and model structures used. Permanent waterbodies account for 1% of the total flooded area. Another interesting difference in the flooded area results is that the majority of models display limited sensitivity to the range of probability, evidenced by the flatter curves in figure 3(a). Using the output from the less sensitive models in a risk analysis will show less difference between low probability and high probability hazards. Flooded area results at a catchment scale ( figure 3(b)) also show significant spatial differences.
These differences in hazard have significant implications for exposure analysis (figure 4). The spread of GDP and population exposure for flooded areas from the different models demonstrates that, even where models agree on the percentage area flooded, this aggregate agreement may result from very different spatial patterns of flooding which results in very different exposure estimates (figures 4(a) and (b)). For example, the 1-in-1000 year flood for the SSBN and ECMWF models have around the same flooded area of just over 10%, but show a difference in total population exposure of 6.5%. Some of this difference will be due to the SSBN model's inclusion of smaller rivers, and these smaller rivers will be in locations with less exposure. However, river size threshold does not explain all the differences, evidenced by the fact that CaMa-UT and ECMWF models share the same hydraulic model and river size threshold, but the CaMa-UT flooded area is only half that of ECMWF's, indicating that the difference here is due to different climate forcing or land surface models. Indeed, evaluation of reanalysis products over West Africa show significant biases in precipitation, which are especially acute in ERA-Interim, used in the ECMWF model [31]. Exposure analysis by country also shows big differences between the model results ( figure 4(c)), for example, Egypt ranging from approximately 1%-50%, depending on the model.
Applying a simple measure of model agreement (MAI) to each country, along with a measure of the EAE, we can see a spread of results that provides a useful perspective on the differences between models ( figure 4(d)). This analysis could be applied to any region, not just at country level, and provides an indication of where models agree or not and what the exposure implications are. Split arbitrarily into four quadrants, it also shows where different follow-up actions, such as model improvement or exposure dataset refinement, should have a higher priority. Looking at an example from each quadrant: Quadrant A: Egypt will be sensitive to model variations as it has ∼95% of its population living along the banks of the Nile and half of those in the delta. There is a low model agreement due to how the models deal with bifurcating delta channels.
Quadrant B: South Sudan also has a high exposure, but shows more agreement between models as all identify the large Sudd wetlands. There is some disagreement due to the fact that dynamics and evaporation play a dominant role in the flood extent, and not all models include these processes.
Quadrant C: Western Sahara has a low population with few exposed to flood in any model. There is low agreement between models, but it is flat and has an arid climate, and any flood risk is likely to be localised flashflooding. Models differ in this climatic context, as some do not include arid climate processes and there are no major rivers, but this is of low consequence in this context. Quadrant D: Rwanda shows better agreement between the models, but the relative proportion of population exposed to flood hazard is low. The country is small and elevated, has a temperate to subtropical climate, and is dominated by mountains and small confined river systems with some lakes, so models should generally agree better in this hydraulic and climatic context.
While there is encouraging agreement between the models in some areas, there are enough differences between the models in most areas that any flood risk conclusions resulting from identical analysis using different models will lead to very different implications and actions. This shows we are currently at an early stage of model development and the results from only one model will need to be used with appropriate caution.

Conclusions
We have outlined the two main types of new global flood models (climate or gauge data driven) and summarised some of their key structural differences. The newness of the models means there are a rich variety of structural approaches to the many challenges of modelling floods at a global scale.
Previous validation of individual models shows that these models have some skill in mapping flood extent on larger rivers, typically in the order of two thirds and three quarters of the area determined to be at risk in the more detailed engineering scale flood models. Many also show skill at capturing some large scale observed flood events.
Aggregating the flood extent data for six of these global flood models and subsequent analysis shows that over the continent of Africa, there is around 30%-40% agreement in flood extent. There are significant differences in hazard magnitude and spatial pattern between models, notably in deltas, arid/semi-arid zones and wetlands. There are also some areas of  strong agreement, where the flood hydraulics is more straightforward, such as confined floodplains along major rivers.
The main conclusion from this study, particularly important for users of these models, is that there are sufficient differences between the model results that they are currently not interchangeable in global flood risk frameworks.

Outlook
This first comparison of global flood hazard models has shown that it is vital to have a more sustained and carefully planned comparison leading into the future. We see this as analogous to the Coupled Model Intercomparison Project [32], and the Inter-Sectoral Impact Model Intercomparison Project [33], underpinning the Intergovernmental Panel on Climate Change, while being more explicitly focused on global flood risk.
The research presented here was shared at the June 2016 GFP conference at the Joint Research Centre, Ispra, Italy. The outlook for global flood model testing and validation were discussed at a dedicated workshop session, and outcomes summarised below.
The GFP conference has a strong representation from the user community, and it was clear that while they do not expect perfection from the models, they do want clarity on what models are useful or best in which areas, and how that relates to their interest (flood risk, planning or forecasting) and scale (local community, catchment, national). Making aggregated comparison data open access, like with this paper, will assist in this process, but web visualisation tools should also be considered to communicate outcomes in a localised manner.
Forthcoming comparisons should include more models as they become available, and ideally include commercial models used by the insurance industry. As all the models are complex chains of sub-models, this leads to multiple parameters and challenges in calibration. Undertaking meaningful calibration of these models and quantifying uncertainty are seen as important next stages of development. Expanding the models to include pluvial and coastal flood risk, is also considered an important aspect of future model development.
The GFP also has a very active flood observation community, and efforts are underway to collate benchmark datasets (e.g. satellite observations of flooding) for a more comprehensive validation against observed events. Increasingly, these models will be used for assessing the impacts of climate change on global flood risk, and recent attempts show increasing risk due to both greater flood hazard as well as growing exposure [3,4,7,19]. However, models will require credible skill at representing currently observed flooding before climate change impacts can be predicted with certainty. As models are improved, there is a parallel need to address scale and accuracy limitations in exposure and vulnerability datasets, which are used together with the flood model output for global scale risk assessments [34].
Future inter-comparisons should also extend beyond the outputs of the models and cover internal stages; model physics, estimated flows, return period estimation, processed DEMs, and river networks, all of which need improvement. For example, there is now a well-recognised and pressing need for global DEMs that improve on the relatively poor resolution and precision of the current datasets as these limit significantly our ability to estimate flood inundation and risk [35,36].
Future inter-comparisons, and the data and methods developed, should be an open and transparent process. This will drive model improvements more rapidly and allow users to see how the models compare to others available, bringing increased credibility to global flood risk management efforts.