Evaluation and machine learning improvement of global hydrological model-based flood simulations

A warmer climate is expected to accelerate global hydrological cycle, causing more intense precipitation and floods. Despite recent progress in global flood risk assessment, the accuracy and improvement of global hydrological models (GHMs)-based flood simulation is insufficient for most applications. Here we compared flood simulations from five GHMs under the Inter-Sectoral Impact Model Intercomparison Project 2a (ISIMIP2a) protocol, against those calculated from 1032 gauging stations in the Global Streamflow Indices and Metadata Archive for the historical period 1971–2010. A machine learning approach, namely the long short-term memory units (LSTM) was adopted to improve the GHMs-based flood simulations within a hybrid physics- machine learning approach (using basin-averaged daily mean air temperature, precipitation, wind speed and the simulated daily discharge from GHMs-CaMa-Flood model chain as the inputs of LSTM, and observed daily discharge as the output value). We found that the GHMs perform reasonably well in terms of amplitude of peak discharge but are relatively poor in terms of their timing. The performance indicated great discrepancy under different climate zones. The large difference in performance between GHMs and observations reflected that those simulations require improvements. The LSTM used in combination with those GHMs was then shown to drastically improve the performance of global flood simulations (especially in terms of amplitude of peak discharge), suggesting that the combination of classical flood simulation and machine learning techniques might be a way forward for more robust and confident flood risk assessment.


Introduction
Flooding is one of the costliest natural disasters in the world [1], which has been projected to occur more frequently under global warming conditions [2,3]. Reasonable and reliable projection of maximum discharge is critical for flood risk assessment and climate change adaptation. Global hydrological models (GHMs) are broadly adopted to assess the impacts of climate change on water availability and scarcity, droughts, flood hazard and risk, water resources management and planning [4][5][6], on the response of the global hydrological cycle to climate change, and for forecasting and examining the role of water in agriculture production [2,[7][8][9][10][11][12] at both global and regional scales. Global institutions have been utilizing more GHMs to make first Original content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence.
Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI. order global policy decisions on water-relate hazards in recent decades because of the lack of applicable input data to support in-depth regional studies. These examples include the World Bank series on climate change and development [13], IPCC assessment reports [14] and Global Environmental Outlooks [15].
Nonetheless, the accuracies of flood simulations are less frequently evaluated across GHMs under different socio-econ-scenarios and climate zones nowadays. Some existing studies merely focused on typically large basins, or selected stations in accordance to to the performance of discharge simulations [5,10,12,16]. To our knowledge, a general evaluation of flood simulations globally is still missing due to the lack of observed global daily streamflow used to calculate flood-related indices. The recently launched Global Streamflow Indices and Metadata Archive data make it possible.
Machine learning has shown remarkable potential in geosciences in recent years for various applications such as land use change detection, precipitation prediction and bias-correcting forecast [17][18][19]. Deep learning methods have made revolutionary advances in the sequential data-modeling domain [20][21][22][23][24] or for streamflow forecasting [25]. Recently, several studies have used the so-called hybrid modeling approach, coupling physical-models with machine learning [20,21,26]. Conceptually, those models aim at leveraging the predictive power of physical-models (generalization) while harvesting existing data to correct biases. Moreover, in term of discharge simulation, the GHMs without a specific calibration to river basins generally do not perform well enough when compared to the specifically calibrated regional models [10]. In this study, we implemented a novel scheme through the combination of GHMs, other existing data and a state-of-the-art machine learning approach accounting for temporal correlations, the long short-term memory units (LSTM), to improve the performance of flood simulations at the global-scale.
The objectives of this study are to: (1) evaluate the performance of GHMs simulations in terms of floodrelevant indices under different hydro-climatological conditions over the historical period; (2) highlight the current strengths and limitations of GHMs-based global-scale flood simulations and (3) detect the potential of LSTM combined with those GHMs to improve global-scale flood modeling. This paper is organized as follows: section 2 introduces the datasets and methods used in this study and the input datasets of the LSTM. Section 3 describes the detailed evaluation and LSTM improvement for GHMs-based flood simulations, followed by conclusions in section 4.

Data and methods
To evaluate the performance of GHMs forced with observational climate data (1971-2010) within the Inter-Sectoral Impact Model Intercomparison Project 2a (ISIMIP2a) project on flood simulation, we processed 15 routing simulations with the Catchmentbased Macro-scale Floodplain (CaMa-Flood) model [27,28] driven by daily runoff data simulated from five GHMs (section 2.1) under three socio-economic conditions, i.e. naturalized conditions (NONSOC), present-social conditions (PENSSOC) and dynamicsocial economic scenarios (VARSOC). We then conducted a comparison in terms of five flood-relevant indices (describing the amplitude and timing of peak discharge, section 2.2) offline-simulated by GHMs and CaMa-Flood against those calculated from GSIM dataset.

Evaluating GHMs-based flood simulation against GSIM data
We used daily runoff simulations (0.5°×0.5°) for the period 1971-2010 from five GHMs: DBH [29], H08 [30,31], LPJML [32], PCR-GLOBWB [33,34] and VIC [35] (please find the general characteristics and configurations of those models in supplementary table S1, available online at stacks.iop.org/ERL/14/114027/ mmedia and www.isimip.org/impactmodels/) under the ISIMIP2a protocol (www.isimip.org/). Briefly, the ISIMIP2a offers a framework for projecting the impacts of climate change consistently across different impact sectors and spatial scales, and serves as a basis for model evaluation and improvement, allowing for better estimates of the biophysical and socio-economic impacts of climate change at different levels of global warming. The GHMs used here were run under the VARSOC conditions, which account for timevarying (historical) dam constructions, water abstraction and land use changes. We then compared the outputs with GHM-simulations under the NONSOC (no human impacts, such as dams and water abstractions on river flow) and the PENSSOC (with socioeconomic conditions fixed at 2005 levels) conditions to test the sensitivity of our main findings to the implementation of direct human influences.
After obtaining daily runoff data from the GHMs (through https://esg.pik-potsdam.de/search/isimip/), we applied the CaMa-Flood model as the river routing approach to simulate basin-wide daily streamflow. The CaMa-Flood is currently the only open-source global river model available (http://hydro.iis.u-tokyo.ac.jp/ yamadai/cama-flood/index.html) that is capable of simulating backwater effect within a reasonable computational time and providing flooded area, flooded volume and flood depth. It has been widely used in flood assessment and future projection of global food risk [12,16,36].
It should be noted that none of these GHMs-CaMa-Flood model chains have been calibrated with observed discharge, allowing us to use GSIM as an independent data benchmark to evaluate the performance of these model chains in flood simulations. The GSIM dataset (available from https://doi.pangaea.de/ 10.1594/PANGAEA.887470) is a collection of calculated discharge indices (i.e. mean daily streamflow, standard deviation, coefficient of variation, percentiles, the inter quartile range of streamflow at monthly, seasonal and yearly scale), based on daily streamflow observations at more than 35 000 stations globally [37,38], which is useful for evaluating the performance of flood simulations. To do this, we selected 1032 catchments from the GSIM dataset according to the following criteria: 1. To avoid under-sampling, only catchments with at least 10-year complete and continuous observations during the period of 1971-2010 were chosen.

2.
A minimum catchment size of 9000 km 2 (approximately four grid cells) and a distance between stations and grid line being less than 5% grid edge length were selected, to omit catchments whose hydrological processes could not be correctly represented by the GHMs operating at 0.5°.
To assess the main properties of flood amplitude and timing, five indices (i.e. annual maximum discharge, annual maximum 7 d averaged discharge and their corresponding days of year, the 90-percentile maximum discharge from the yearly part of GSIM dataset) were selected in this study.

Evaluation metrics
To evaluate the GHMs-based flood simulation, three evaluation metrics, i.e. relative error (RE), Nash-Sutcliffe efficiency (NSE) and Pearson's correlation coefficient (PCC), which have been widely used in evaluating performance of hydrological models on streamflow simulation [10,39], were selected. Moreover, RE=0 indicates a perfect correlation between the observed and simulated streamflow. NSE (ranges from −∞ to 1) is a parameter that determines the relative importance of residual variance (noise) compared with the variance in the measured data, where NSE=1 indicates a perfect correlation. PCC has a value between +1 and −1, where 1 indicates total positive linear correlation; 0 indicates no linear correlation; and −1 indicates total negative linear correlation.

Improving GHMs-based flood simulations through LSTM
To future improve the GHMs-based flood simulations, we explored the potential of using a combination of GHMs and a state-of-the-art machine learning approach (LSTM). The feedforward neural network is usually comprised of an input layer, a few hidden layers and an output layer. A physics-guided neural network (PGNN) leverages the output of physics-based model simulations along with observational features to generate predictions using neural network architecture [20,26]. Here, we put forward a general PGNN method which leverages the advantages of these two approaches by combining knowledgebased GHMs and a novel technique, the LSTM, to build a hybrid simulation scheme. Briefly, LSTM is a special recurrent neural network, which feeds not only input data but also remember the states of the hidden neurons of the previous time steps, and simplifies solving the temporal lag and autocorrelation (these two properties exist in discharge simulations) of the data and avoids vanishing gradients [40]. Thus, we combined GHMs-CaMa-Flood with LSTM to improve the performance of flood simulations in this study.
While noticing the large spatial heterogeneity among basins, we applied LSTM at each selected station (at daily time scale) during 1979-2010 (according to the validation of meteorological data). We used basin-averaged daily mean air temperature, precipitation, wind speed and the simulated daily discharge (corresponding to the observed gauging stations selected) from the simulations of GHMs-CaMa-Flood model chain as the inputs of LSTM, and used observed daily discharge as the output value. Specifically, we chose simulations from VIC-CaMa-Flood model chain as the input of LSTM in order to simplify the calculations after noticing the similar performance of the five GHMs (in section 3.2). Considering a typical e-folding time scale (recession time) of streamflow, we set the time lag at 30 d (both forward and backward). After that, we selected the first half of the time series at each station as training samples (including several complete years) (as those need to be continuous unlike regular neural networks). The last half time series was then used as validation samples. For example, suppose the station has complete observations in 1981-2000, the observations in 1981-1990 would be used as training samples, while the observations in 1991-2000 would be used as testing samples.
Moreover, observed daily streamflow data were downloaded from the Global Runoff Data Center (GRDC, 56068 Koblenz, Germany) and U.S. Geological Survey (USGS, https://waterdata.usgs.gov/nwis) streamflow data. Following the criteria used for GSIM in section 2.1, 2110 stations were selected. For a better spatial coverage, a minimum of 5-year coverage instead of 10-year coverage was adopted in this section. The basins' daily areal precipitation was derived from Multi-Source Weighted-Ensemble Precipitation (MSWEP, [41]), and the basin average daily mean air temperature and wind speed were derived from ERA-interim dataset ( [42], available from www.ecmwf.int/en/forecasts/datasets/reanalysisdatasets/era-interim).  simulations perform reasonably well between 30°N and 30°S, but rather poor beyond 60°N. This might be related to the relatively low spatial resolution of terrain used beyond 60°N for generating routing direction within the CaMa-Flood model, constraining the accuracy of simulations in boreal regions [12]. Moreover, the low Pearson correlation coefficients in basins with high latitude and elevation (for day of year with maximum discharge in figures S4 and S5) indicate difficulties in reproducing snowmelt and soil freezing could be other sources of errors [5].
The simulations of peak discharge timing showed relatively poor performance than peak discharge amplitude. The mean PCCs derived from all selected stations are 0.41, 0.45, 0.53, 0.24 and 0.25 for maximum discharge, maximum 7 d averaged discharge, 90-percentile maximum discharge, day of year (1-366) with maximum discharge and day of year (1-366) with maximum 7 d averaged discharge, respectively. Overall, the simulation of peak discharge was unsatisfactory from GHMs-CaMa-Flood model chain.

Evaluation of GHMs-based flood simulation under different climate zones
We then examined GHMs-based flood simulations under different climate zones (Köppen-Geiger climate classification) and socio-economic scenarios (VAR-SOC, NONSOC and PENSSOC). The results showed great discrepancies between models under different socio-climate zones (figure 2). In basins located in BWh (arid hot desert) and DWa (cold region with dry winter and hot summer) climate zoned, all GHMs performed satisfactorily except for the DBH model. In basins located in DWd (cold region with dry and very cold winter) climate zone, the LPJML model demonstrated poor performance. Most models exhibited poor performance in basins located in DWb (cold region with dry winter and warm summer) climate zone except for the H08 and PCR-FLOBWB models. The VIC model showed the best performance in basins located in Dsb (cold region with dry and warm summer) and Dfd (cold region with very cold winter but dry season) climate zones. Overall, the performances of the five GHMs showed great discrepancy across climate zones and flood-relevant indices [5], but exhibited small difference among various socioeconomic scenarios (figure S1).
Taylor diagram is a useful tool for multi-model evaluation [44]. Based on Taylor diagram, we compared the performances of five GHMs in terms of five multistation-averaged flood-relevant indices (section 2.2). For annual maximum discharge, maximum 7 d averaged and annual 90-percentile maximum discharge, the PCR-GLOBWB and VIC models showed relatively good performances. For the days of maximum and 7 d averaged discharge, the DBH, PCR-GLOBWB and VIC models experienced better performance. The GHMs performances varied considerably across river basins and flood-relevant indices, due to their different process representation and spatially generalized parameters [45]. The large inter-model differences in seasonal variability might be due to different process descriptions related to human impacts and regulations in the GHMs [5,46].
In general, the uncertainties of GHMs-based flood simulations might come from (1) the biases in climate forcing data as inputs of the GHMs, (2) the errors in describing hydrological processes (representation of hydrological process and discrepancy of models with real hydrological process) at the scale of grid cell (3) the limitation of river routing approach such as model parameters related to physical routing process [10,11,[46][47][48]. The large variations in GHMs' performance in different basins, climate zones and various flood-relevant indices, call for a careful selection of GHMs and improvement in techniques adopted for future flood simulations.
3.3. Improvement of GHMs-based flood simulations using machine learning Given the temporal lag between basin-scale peak discharge and historical input data (precipitation, temperature) [49], we chose an LSTM to improve the performance of GHMs-based streamflow simulation. We used basin-averaged daily mean air temperature, precipitation, wind speed and simulated daily discharge (corresponding to the selected gauging stations) from VIC-CaMa-Flood model chain as the inputs of LSTM.
Subsequently, we compared the Nash-Sutcliffe efficiency (NSE) and RE of the VIC-driven CaMa-Flood streamflow simulations and LSTM outputs against the daily discharge observations. The NSE and RE of daily streamflow (figure 4) suggested significant improvements for all basins after using the physicsguided LSTM, which dramatically improved the NSE and RE for all stations (median value) from −0.54 to 0.49 and from −14.39 to −7.52, respectively.
Apparently, the VIC-CaMa-Flood model chain (figure 5) overestimated maximum discharge, especially for the lower amplitudes zones. In general, the timing points of the peak discharge were dispersive from observation section of DOYMAX and DOY-MAX7 in figure 5, which indicates the poor performance of VIC-CaMa-Flood model chain in the timing of peak discharge. Liu et al [49] and Zhao et al [12] also highlighted the poor performance of GHMs in the timing of peak discharge. The combination of LSTM and VIC-CaMa-Flood model chain (PGNN) drastically improved the accuracies of the five flood-relevant indices. Specifically, the physics-guided LSTM improved R 2 from 0.27 to 0.95 from model simulations to LSTM cooperation for maximum annual discharge, 0.3-0.98 for annual maximum 7 d averaged discharge, 0.29-0.99 for annual 90-percentile maximum discharge, 0.17-0.24 for day of year with maximum discharge and 0.19-0.27 for day of year with maximum 7 d averaged discharge.
Yet, beside these dramatic improvements, LSTM could not significantly improve the timing of peak discharge, even though with a time lag set to 30 d (both forward and backward). This might be because of the sensitivity of peak discharge to extreme precipitation, snowmelt (or soil freezing) and soil moisture excess [50], which was mistaken in the simulations cannot entirely be corrected by the LSTMs. Moreover, the biases in the climate forcing data, the representation of real hydrological process and the limitation of routing could further limit the capacity to accurately predict correct timing [10,11,46,51]. Thus, the improvement of timing might depend more on the traditional GHMs-CaMa-Flood model chain than tools like LSTM used in this study, e.g. the higher temporal/spatial resolution for both hydrological and routing process, higher-precision forcing data.
Based on meteorological data (precipitation, temperature, wind speed, day of year used in this study), we compared the performances of physics-guided LSTM and pure LSTM. The results (figure S2) showed obvious better values in R 2 for physics-guided LSTM (used as PGNN here) than pure LSTM for all five selected indices, such as annual maximum discharge, R 2 is 0.95 for PGNN, higher than 0.86 for a pure neural network based on meteorological data. To test the performance of meteorological data and simulated discharge in LSTM, we conducted a new LSTM model using pure simulated discharge as input (figure S3 in supplementary material). Compared with figure S2 (with pure meteorological data as input), the higher R 2 with pure discharge implied the dominant role of simulated discharge in the performance of flood simulations within the physics-guided LSTM. These results indicate that the PGNN used in this study are improved and strongly rely on both the VIC-CaMa-Flood and meteorological data, while the combination of LSTM and VIC-CaMa-Flood gave the best performance in flood simulations (with the highest R 2 ).
Our analyzes showed that the LSTM can correct the streamflow simulation biases through harvesting existing data. It could be a powerful tool for more robust and confidential flood risk assessment after combine with present discharge observations. Moreover, the combination of LSTM and GHMs-CaMa-Flood model chain generated in this study might serve as a useful tool for projecting future flood disasters under climate change.

Summary and conclusions
In this study, we performed a global evaluation of flood simulations from five GHMs against flood-relevant indices calculated from GSIM dataset. We made the first attempt to improve GHMs-based flood simulations using a hybrid physics-guided machine learning approach (LSTM). The following conclusions were drawn: (1) The hybrid physically based neural network used in this study demonstrated excellent performance in global flood simulation, which implies that the combination of classical flood simulations and machine learning techniques might be a powerful tool for more robust and confident flood risk assessment.
(2) The large variations in GHMs' performances in different basins, climate zones and for various flood-relevant indices, call for a careful selection of GHMs and improvement techniques adopted for future flood simulations. We appreciate the editor and two anonymous reviewers for their constructive and insightful comments.