The critical role of the routing scheme in simulating peak river discharge in global hydrological models

Global hydrological models (GHMs) have been applied to assess global flood hazards, but their capacity to capture the timing and amplitude of peak river discharge—which is crucial in flood simulations—has traditionally not been the focus of examination. Here we evaluate to what degree the choice of river routing scheme affects simulations of peak discharge and may help to provide better agreement with observations. To this end we use runoff and discharge simulations of nine GHMs forced by observational climate data (1971–2010) within the ISIMIP2a project. The runoff simulations were used as input for the global river routing model CaMa-Flood. The simulated daily discharge was compared to the discharge generated by each GHM using its native river routing scheme. For each GHM both versions of simulated discharge were compared to monthly and daily discharge observations from 1701 GRDC stations as a benchmark. CaMa-Flood routing shows a general reduction of peak river discharge and a delay of about two to three weeks in its occurrence, likely induced by the buffering capacity of floodplain reservoirs. For a majority of river basins, discharge produced by CaMa-Flood resulted in a better agreement with observations. In particular, maximum daily discharge was adjusted, with a multi-model averaged reduction in bias over about 2/3 of the analysed basin area. The increase in agreement was obtained in both managed and near-natural basins. Overall, this study demonstrates the importance of routing scheme choice in peak discharge simulation, where CaMa-Flood routing accounts for floodplain storage and backwater effects that are not represented in most GHMs. Our study provides important hints that an explicit parameterisation of these processes may be essential in future impact studies.


Introduction
Fluvial flooding is one of the most costly natural disasters in the world, claiming thousands of lives and causing billions of dollars in losses globally each year (Munich Re 2016). Most of the losses occur in Asia and Africa, where the frequency of flooding events is projected to increase under a warmer climate (Hirabayashi et al 2013, Dankers et al 2014, Winsemius et al 2015, Arnell and Gosling 2016. Fluvial flood occurs when river discharge exceeds the maximum flowing capacity at a river section. Accurate assessment and realistic projection of peak discharge is critical for risk assessment and adapting to climate change and designing flood protection infrastructures. Hydrological modelling has been a useful tool to provide flood projections under various climate scenarios (Lehner et al 2006, Prudhomme et al 2014. However, without a calibration for individual river basins global grid-based hydrological models (GHMs) generally perform worse in reproducing the observed discharge than calibrated regional models even at monthly timescale (Hattermann et al 2017, Veldkamp et al 2017. A first reason is the coarse spatial resolution and uncertainties in global precipitation datasets (Fekete et al 2004, Voisin et al 2008, Biemans et al 2009, Wanders and Wada 2015, Müller Schmied et al 2016. A second reason is the uncertainty related to the land surface model structure, namely the representation of local hydrological processes at the scale of grid cells, such as the partitioning of precipitation into evaporation and runoff, and representation of snow and groundwater (Haddeland et al 2011). A third factor is the routing of the simulated gridcell runoff through the river network, which is a crucial step in simulating river discharge, a variable directly comparable to numerous observations. This physical process is commonly referred to as river routing. Although the effects of river routing scheme including the representation of drainage area (Alkama et al 2011) and flow velocity (Decharme et al 2010) are considered to be important for simulating realistic river discharge at global scale, there is thus far no comprehensive multimodel evaluation of the sensitivity of simulated discharge, especially of peak discharge, to using different routing schemes.
Routing schemes as a component of land surface models (part of Earth system models) were primarily developed to close the hydrological cycle and compute freshwater input into the ocean, which is important for ocean convection and circulation in atmosphereocean models (Manabe and Stouffer 1988). The first models of that kind were capable of simulating a rough water balance, by adopting strong simplifications of the fluvial hydrodynamic equations, often with linear reservoirs and spatially or temporally constant water transfer velocity (Liston et al 1994, Miller et al 1994, Sausen et al 1994, Coe 1997, 2000, Oki 1999). Improved schemes now better simulate flow velocity, e.g. by using a kinematic wave equation combined with the Manning's equation (Arora and Boer 1999, Oki 2003, Decharme et al 2010. While these simplified schemes are theoretically capable of representing flood wave delay and attenuation, they often lack water pressure gradient simulations and some of them do not account for processes such as floodplain storage (represented and shown to be important in e.g. Decharme et al 2008, Dadson et al 2010 and backwater effect (Meade et al 1991), thereby limiting the accuracy of river discharge especially in flat areas and during flooding episodes. The incorporation of routing schemes in GHMs has also allowed for assessments of future changes in hydrological regimes under different climate scenarios from an ensemble of models (e.g. Schewe et al 2014). In a continuous process of improvements of routing scheme in some GHMs, simulations have benefited from an increased spatial resolution, in both the topographical inputs and routing algorithm (e.g. Ducharne et al 2003, David et al 2011, Stacke and Hagemann 2012a, Lehner and Grill 2013 and improved hydrologically conditioned digital elevation models at high resolution, such as HydroSHEDS (Lehner et al 2008). This trend reduces the deviation from regional-scale hydraulic models, which have been progressively applied to simulate flood propagation in river systems of increasing size building on simplified or the full Saint-Venant equations (Rudorff et al 2014, Paiva et al 2013. Recently, efficient and realistic global hydrodynamic simulations were made possible by the CaMa-Flood global river routing model, which relies on the HydroSHEDS topography and simulates floodplain dynamics and backwater effects by explicitly solving the diffusive wave equation (Yamazaki et al 2011), thus providing the potential for improved simulations of daily river discharge than GHMs' native routing schemes, especially for high flow simulations. CaMa-Flood is designed to use gridded runoff output from GHMs as forcing data and also allows for the calculation of flooded areas, flood depth and flood volume critical for global flood risk assessments (e.g. Hirabayashi et al 2013).
As mentioned above, differences between simulated and observed discharge could in principle stem Environ. Res. Lett. 12 (2017) 075003 from (1) biases in the observed climate input data, (2) errors in the representation of runoff, and (3) limitations in the derivation of discharge. Here, we focus on an evaluation of the last component, and to reduce the influence of the first two components we (1) consider all of a large set of river basins with very different climatic conditions and potential bias of the observational climate data and (2) use runoff simulations from nine different hydrological models to test for a systematic increase of agreement across a wide range of runoff conditions. However, we can only reduce but not exclude the potential for error compensations that has for instance been found in a study evaluating the implementation of variable flow velocity scheme in a global river routing model TRIP 2.0 (Ngo-Duc et al 2007). So far, evaluations did not focus on daily maximum discharge, which is more relevant to flooding. Further, these evaluations only used a single GHM, and there has been no comprehensive, multi-model analysis examining the influence of a different routing scheme on the performance of peak discharge simulation down to daily scale. We perform simulations with CaMa-Flood to produce monthly and daily river discharge, driven by daily runoff from nine GHMs in a naturalized (e.g. without anthropogenic disturbances to the water cycle) experiment under the ISIMIP2a protocol (www.isimip.org). The GHMs additionally simulated monthly and daily river discharge from their own river routing schemes. Both native routing and CaMa-Flood routing discharge are then compared to in situ discharge observations worldwide. These comparisons test (1) for which models, and (2) in which regions CaMa-Flood routing brings discharge closer to observation compared to GHM's native routing. In that, evaluation metrics for the performance of the hydrological models focus on the amplitude (maximum discharge) and timing (correlation) of river discharge, both relevant for flood risk assessment. Representation of (1) flow velocities, (2) floodplain storage and (3) ground water storage are expected to be critical for the simulation of peak discharge. CaMa-Flood builds on a detailed implementation of the first two processes going beyond the simplified implementations included in the other routing schemes. However, CaMa-Flood itself does not represent ground water storage-a process that is included in the runoff projections by three of the GHMs. As direct anthropogenic disturbances such as dams and reservoirs also influence observed discharge, which are not represented in the model simulations, we test the agreement between observations and simulations separately for heavily managed basins and nearnatural ones.

Data and methods
Nine offline routing simulations were performed with CaMa-Flood driven by daily runoff data from nine GHMs. The simulated daily and monthly discharges were then evaluated at 1701 globally distributed discharge gauges, using the data from the Global Runoff Data Centre (GRDC, 56068 Koblenz, Germany). Additionally, each GHM used in this study also provides daily discharge output from its own river routing scheme, which in general has simpler physics compared to CaMa-Flood. This group of discharge output was also evaluated against observations using the same performance metrics. For each GHM and at every discharge gauge location, the difference in performance metrics between the CaMa-Flood simulated discharge and the GHM's own discharge simulations were analysed. The following sections discuss the data and methods used here in detail. Supplementary table S2 (available online at stacks. iop.org/ERL/12/075003/mmedia) lists some general model characteristics regarding both the water and energy budget formulations, and routing schemes of the various GHMs, which mostly rely on linear reservoir algorithms. In the ISMIP2a framework, they all used the same 0.5°drainage direction map (DDM30, Döll and Lehner 2002), with the exception of CLM and PCR-GLOBWB. Note that the routing model of the MPI-HM model includes a wetland scheme with floodplains (Stacke and Hagemann 2012), and in order to have a consistent drainage direction map, the routing model of the MPI-HM model was also used to route the runoff from the ORCHIDEE model.
Except for WaterGAP2, none of the other GHMs has been calibrated with observed discharge data, allowing us to use GRDC observations as a relatively independent benchmark. As CaMa-Flood is not calibrated either, the WaterGAP group conducted additional no calibration (nc) runs (referred to as WaterGAP2nc) in order to allow a consistent comparison. The comparison procedure was conducted for both model versions, but only Water-GAP2nc was used for ensemble statistics and most of Environ. Res. Lett. 12 (2017) 075003 the discussion. However, evaluating the calibrated version allows us to present the effect of model calibration compared to the effect of river routing.
GHMs used in this study were all run under so called naturalized conditions (referred to as 'NOSOC' in the ISIMIP2a protocol), meaning that no human impacts, such as dams and water abstractions on river flow were considered. This facilitates a consistent comparison focusing on the effect of routing scheme, given that CaMa-Flood does not account for human regulation of rivers. Also, for peak daily flow, a previous study showed that for some major basins, the shape of the hydrograph is not significantly different between natural and human impact experiments (Pokhrel et al 2012). Comparing the effect of human impacts on peak discharge is difficult at the current stage since not all the GHMs include reservoir operations; also, the rules of reservoir operations (when they are included) is substantially different among GHMs (Masaki et al 2017). However, we also used the 'VARSOC' simulations generated within the ISIMIP2a intercomparison to test the sensitivity of our main findings on the implementation of direct human influences. In the 'VARSOC' setting modellers were asked to account for time-varying dam construction, water abstraction and land use changes if their model offers the capacity to do so.
Under the ISIMIP protocol, runoff is defined as the sum of local surface and subsurface runoff within each grid box. Therefore, when routing the runoff along rivers with CaMa-Flood (see section 2.2), precipitation and evaporation over rivers or lakes are not included in the resulting discharge, while they can in some GHMs. Runoff should not be confused with river discharge, which is the result of routing gridded runoff along the river network. The latter is also generated with the same drainage direction map (DDM30, Döll and Lehner 2002) by the GHMs. The sub-grid topography of river channel and floodplain storage is explicitly parameterized in each unit-catchment corresponding to each grid box, and water level is assumed to be uniform within each unit-catchment (figure S5). Currently CaMa-Flood is the only open-source global river model available (http://hydro.iis.u-tokyo.ac.jp/ ∼yamadai/cama-flood/) that is capable of simulating backwater effects with a reasonable computation time (7 hours on a 3.2 GHz Intel Xeon E5-2667 v3 processor for one single 40 year simulation), making it a popular choice in studies on the future projection of global flood risk (Pappenberger et al 2012, Hirabayashi et al 2013, Koirala et al 2014. Similar models such as ISBA-TRIP  and HyMAP (Getirana et al 2012) have other advantages such as differential time delays for surface and subsurface runoff, however they cannot simulate the backwater effect; their performance could be assessed in future studies.

Simulating river discharge with CaMa-Flood
In this study we used the default setup of CaMa-Flood version 3.4.4 following a common model spin-up procedure of five times repeating the first year (1971) of input daily runoff forcing. An adaptive time step scheme was used in the simulations, leading to a time step of about 10 minutes. The runoff data from the nine GHMs were finally spatially interpolated, conserving mass to the native resolution of CaMa-Flood (0.25°), which is twice as fine as the original GHM spatial resolution.

Comparison to the observed river discharge
Observed river discharge dataset from GRDC consists of more than 9000 stations in 160 countries, with an average record length of 42 years. We made a selection from this dataset according to the following criteria: 1. A minimum of 5 year coverage during the validation period of 1971-2010. While the data do not need to be continuous, we only use years with less than 10 missing days to avoid under-sampling any month. A threshold of 5 years instead of 10 years is adopted to allow for better spatial coverage, enabling assessment at some important basins such as Indus and Ganges in Asia, and Po in Europe, although a 10 year threshold would be more robust for inter-annual variations. Analyses using stations with a minimum of 10 year coverage show similar results (not shown), nevertheless.

2.
A minimum catchment size of 9000 km 2 to omit catchments whose hydrological processes are not correctly represented by GHMs operating at 0.5°( Hunger and Döll 2008).
Using the MIRCA2000 irrigated crop area dataset (Portmann et al 2010) and the Global Reservoir and Dam (GRanD) Database (Lehner et al 2011), we separated GRDC stations with (near-)natural flow regimes from stations with highly managed flow, based on catchments characteristics. A catchment is classified as (near-) natural if its area subject to irrigation is Environ. Res. Lett. 12 (2017) 075003 <2%, and the total reservoir capacity in the catchment is <10% of its long-term mean annual discharge.
As detailed in section 2.4, we examined basinaverage performance metrics, in addition to metrics at individual stations to avoid statistical effects from outliers; we chose not to focus on outlet stations, because for flood applications, it is important to have a reasonable representation of high flow at every stage of the river. We also analysed only stations located within major basins worldwide according to the DDM30 network. Grid cells in the DDM30 and CaMa-Flood's network corresponding to each GRDC station were determined through a semi-automatic procedure described in supplementary text C. In total, this selection procedure led to a set of 1701 stations with monthly data (figure 1) and 1205 stations with daily data. These stations belong to 198 (170 for stations with daily data) out of 321 major basins according to DDM30. These 198 basins sampled by at least one selected station cover 70 million km 2 (47% of global land area) area and host 3.8 billion people (55% of 2010 global total according to Gridded Population of the World, version 4, http://sedac.ciesin.columbia. edu/). Of all the selected GRDC stations, 538 (526 for daily) were identified as managed (i.e. having a managed upstream area), and 1130 (646 for daily) as (near-) natural, and the rest has no information regarding upstream area management.
While most models used the DDM30 drainage direction dataset provided in ISIMIP, different networks were used in CLM and PCR-GLOBWB (the latter use an independently adjusted version of DDM30). We nevertheless include them in the analysis for the purpose of maximizing the sample size while minimizing the impact of compromised consistency. For these two models we followed a similar procedure as with DDM30, keeping only stations with no more than 30% (approximately one grid cell for the smallest catchment) difference in upstream area with GRDC, and applying manual correction separately. This led to about 10% (40%) less corresponding cells selected for PCR-GLOBWB (CLM), respectively. As a result, for some stations, the ensemble statistics at station level contain only seven or eight GHMs. We checked that ensemble results excluding these two models are similar to the full ensemble, to make sure the different station sampling does not alter the conclusions of this study.

Performance metrics
Several metrics were computed at every station to assess the performance of GHM simulated discharge (D H ) and discharge using CaMa-Flood forced by GHM-runoff (D C ) versus GRDC observed daily and monthly discharge (D O ). The metrics include Pearson's correlation and percent biases of mean and standard deviation, as well as Nash-Sutcliffe Efficiency (NSE) for overall performance. We also included percent bias of multi-year maximum and mean annual maximum discharge as indicators of models' ability to simulate extreme discharge. Detailed description for the metrics is given in table 1.
After computing the metrics for pairs of GHM vs. observation and CaMa-Flood vs. observation, difference of each metric was assessed at GRDC stations. We consider NSE < 0, R < 0 and percent biases >100% as poor skill, and before computing the differences or basin-average metrics, we first capped NSE and R at 0 and the absolute value of percent biases at 100% (for example, change of NSE from À100 to À1 would be considered no change in performance, as both are poor). With this, performance difference at any station is limited within À1 and 1, preventing a single station to have a large effect in basin-average metrics. Environ. Res. Lett. 12 (2017) 075003 Although this capping does not allow for assessing differences between GHM and CaMa-Flood at very poorly performing stations (22% of the stations do not contribute to NSE difference due to capping; for other metrics less than 3% stations are effectively excluded in computing the differences), it is unlikely to have a large impact in model applications if discharge is poorly simulated in both cases; e.g. NSE < 0 suggests the model skill is worse than using observed mean (Krause et al 2005).

Number of GRDC stations
Based on the location of the stations in the 198 major river basins, we then computed the average performance metrics for each GHM and the performance difference between GHM and CaMa-Flood over every river basin. We considered the stations as samples of the rivers and computed the basin averages without weighting, and it is worth noting that the sampling may not be representative especially in data sparse basins. Multi-model ensemble mean of the performance differences were also calculated, for daily and monthly values respectively. For biases, where values close to 0 indicate good agreement with observation, performance difference was computed based on absolute values (table 1), so that any positive difference indicates an improvement. Finally, multimodel mean and best-model performance metrics were computed to show where the models can simulate a reasonable discharge compared to the GRDC observations.
Additionally, before comparing to observations, we examined how daily river discharge simulated by CaMa-Flood differs from the GHMs' original discharge at the location of selected GRDC stations, both in timing and amplitude. In order to filter out interannual variability in the timing of annual maximum daily discharge, multi-year mean daily discharge for each of the 365 days (excluding February 29 from data) were computed to produce mean daily hydrograph during 1971-2010 (referred to as climatological daily discharge) at each station. To define a metrics for timing differences between discharge simulated by CaMa-Flood and by the GHMs' native routing schemes, differences in number of days on when the simulated climatological discharge maximum occurs in the mean daily hydrograph between CaMa-Flood and GHM were calculated for each GRDC station. Numbers are shifted by a year, if necessary, to yield differences of no more than half a year. At some stations, small upstream area or short record length may cause an unrealistic time lag between the two maxima in some models. To prevent this from affecting the model ensemble mean, all absolute difference values greater than 150 days were excluded when computing the ensemble mean. Regarding amplitude we simply compared the multi-year mean annual maximum discharges based on CaMa-Flood and GHMs' native routine schemes. The results are reported in section 3.1.

Results
3.1. Effect of CaMa-Flood routing on the timing and amplitude of the simulated discharge For the majority of GHMs, CaMa-Flood routing generally produces a delayed climatological daily maximum discharge (figure 2) and a reduction in the amplitude of maximum river discharge at most stations (figure 3). Time series for individual river basins are shown in the supplementary data (figures S9ÀS11). Table 1. Description of performance metrics examined. D H,C represents discharge from GHM model output (D H ) or CaMa-Flood (D C ), computed separately with the same equations; D O is GRDC-observed monthly or daily discharge. Time is represented by t (month or day) with a station-dependent step of T; N represents station-dependent record length in years. For BMAX, max indicates maximum discharge of the whole time series. For BMYM, max i indicates annual maximum in year i. Overbar represents mean value of the whole time series.

Abbreviation
Name Calculation procedure Performance difference NSE Nash-Sutcliffe Efficiency Both effects are likely due to floodplain processes being present in CaMa-Flood and not in most of the GHMs examined (note that LPJmL and ORCHIDEE both have a version simulating floodplain processes but not used in ISIMIP simulations). The mean delay with CaMa-Flood is 18 days over all stations (with a large standard deviation of 24 days due to site specific characteristics), and the multi-year mean annual maximum discharge is 22% lower than in original GHMs. It is opposite for two of the models (ORCHIDEE and MPI-HM), where dynamic wetland processes, which include floodplain interactions, are already part of the native routing schemes (Stacke and Hagemann 2012), causing similar effects on timing and even stronger amplitude reduction of the peak discharge at many stations compared to CaMa-Flood's floodplain mechanism.

Change in performance of hydrological models when using CaMa-Flood routing
With alteration in both timing and amplitude of discharge shown in figures 2 and 3, noticeable performance changes can be identified with CaMa-Flood for most of the metrics (except BMEAN, for which no change is expected from different routing method when the upstream area is identical). We consider a change significant at the river basin scale, when the difference in the performance metrics was greater than 0.05 (or 5% for the percent changes). The ensemble mean of performance differences show that for all metrics related to peak flow (BSTD, BMAX, BMYM), most basins display a significant increase in agreement of more than 5% with CaMa-Flood routing. Basins where CaMa-Flood improved BMYM cover  (table   S3). Even though most GHMs in general simulate an earlier peak discharge compared to observations (figure S6), the delay in peak discharge with CaMa-Flood routing leads to a later peak (figure S7) that does not necessarily increase agreement with observations. Instead, the delay is probably too long, given that the model ensemble mean of absolute error in timing of peak discharge, averaged over 623 stations with over 30 years of observational record, becomes slightly larger (42 days) with CaMa-Flood compared to GHMs' native  Figure 4. Multi-model ensemble mean performance differences compared to daily GRDC data, all shown as basin averages (denoted by _m). Grey colour shows differences <5% in basin-averaged performance metrics. Green colours show basins where a discharge metrics is improved with CaMa-Flood compared to native GHM routing. routing schemes (36 days). As expected, little change in BMEAN is seen, and only a few basins display a significant change in NSE (table S4). Analyses on monthly discharge show similar results, although the improvement in peak discharge simulation is less pronounced ( figure S8). For the majority of GHMs, the larger part of the studied basin areas shows significantly better instead of worse agreement with observation in peak discharge simulation with CaMa-Flood (table 2, using BMYM as an example here as the statistics are similar for BSTD and BMAX). The largest improvement (and major contribution to ensemble mean) is observed in DBH and LPJmL. MATSIRO and ORCHIDEE show approximately the same area of increase and decrease in BMYM performance. The impact of CaMa-Flood routing on BMYM is largely negative for MPI-HM compared to its native routing scheme, and this is similar for ORCHIDEE (which is 'natively' routed by the MPI-HM scheme) to a lesser extent. Since MPI-HM includes a dynamic wetland scheme that could encompass larger buffering zones than floodplains, this result confirms that the gain in performance can be expected from better describing the hydraulic connection between the streams and the riparian areas. As a result, CaMa-Flood allows a large improvement of the simulated discharge if BMYM is originally large (too weak buffering with the original routing scheme), and slightly deteriorates the absolute bias in BMYM if GHMs' original discharge simulations are already relatively good (table S5). An interesting feature is that the area-weighted overall BMYM varies from 36 to 84% for the GHMs' original discharge, which spans a much larger inter-model range than the 43%-52% obtained with CaMa-Flood routing. This suggests that a significant portion of the variance among GHMs' performances in peak discharge simulation can be attributed to different routing schemes, including the effects of surface water bodies.
Similar conclusions can be drawn using only stations affected by human management, stations that measure near-natural discharges, or the full sample (table 2, figures not shown). For managed stations only, more models show larger area where CaMa-Flood routing leads to better agreement with observations. Notably for WaterGAP2, the larger part of the studied basin areas shows significantly better agreement with managed stations in peak discharge simulation with CaMa-Flood, but the opposite is true with natural stations only.

Discussion
In this study we assessed whether and where the CaMa-Flood global river routing scheme provides a closer agreement with observed discharge compared to the native routing schemes of nine GHMs. To our knowledge, this is a first multi-model comparison study using daily discharge observations, focusing on GHM performance in peak discharge simulations. Quite notable improvement of simulated peak discharge over large part of the world is found with CaMa-Flood routing compared to most of the GHMs' native routing schemes. As revealed by the daily hydrographs from three major river basins (supplementary text A), this improvement is primarily achieved through reducing the amplitude of peak discharge, in cases that peak discharge is overestimated by the GHMs, most likely due to floodplain mechanism in CaMa-Flood. Previous observational and modelling case studies have well documented the role of floodplains in reducing downstream peak discharge, largely through storage (or greatly reduced velocity) of a portion of the runoff on overbank surfaces (Ahilan et al 2016, Lininger and Latrubesse 2016, Acreman and Holden 2013, Woltemade and Potter 1994. CaMa-Flood represents this effect at global scale, with floodplain storage simulation and a higher roughness coefficient for floodplains. The variable flow speed simulation that accounts for backwater effect in CaMa-Flood may also play a role. Sensitivity experiments (Paiva et al 2013, Yamazaki et al 2011) showed that while backwater effect is important for representing more realistic floodplain storage and hydrological regimes (such as water surface profiles and flood extent) for Amazon, the simulated peak discharge with backwater effect is not drastically different compared to kinematic wave approach; in contrast, experiment with no floodplain produces daily discharge that is noisy and in advance, with a notably larger peak amplitude.
In addition, analyses on three selected GHMs (H08, LPJmL and PCR-GLOBWB) show that the CaMa-Flood simulated peak discharge compares favourably over the GHMs' native routing (supplementary text B), even when human impacts are included in the latter. These results indicate that an explicit representation of floodplains as included in CaMa-Flood may significantly improve the simulation of peak discharge. Model calibration (not implemented in CaMa-Flood) could potentially compensate for the lack of explicit floodplain hydrology as indicated by the good performances of the calibrated GHM WaterGAP2 (e.g. table 2). This is interesting, as the calibration scheme of WaterGAP2 only forces the simulated long-term average annual river discharge to observed values (Müller Schmied et al 2014) and is not directly designed to reproduce peak discharge at a daily time step. From a more general perspective, the differences in NSE or R between CaMa-Flood simulated discharge and discharge from GHMs' native schemes are relatively small for a majority of models. It is important to note that our assessment focuses on peak discharge while in some water resource applications low levels of discharge are more critical. The CaMa-Flood's representation of low flow conditions has to be assessed separately. As low flow Environ. Res. Lett. 12 (2017) 075003 conditions could be particularly sensitive to human management (Veldkamp et al 2017), it may be essential to account for processes such as water withdrawal (agricultural, domestic, industrial, etc.) and flow alteration due to dams and reservoirs in simulating discharge. These effects are currently not included in CaMa-Flood.
While our results highlight the importance of representing floodplains in routing models, future research on the attribution of factors such as floodplain storage, roughness, connectivity and backwater effect in affecting peak flow will be helpful for further model improvement. Large differences among individual basins including varied rainfall regimes and river channel characteristics, distribution of floodplains, synchrony between tributary and main channel peaks (Acreman and Holden 2013), and channel connectivity, for example reported in observational studies on Amazon and Congo (Alsdorf et al 2010, Jung et al 2010, have to be taken into consideration. Nevertheless, we performed a simple analysis in order to further examine whether the storage capacity of floodplains is related to peak discharge reduction with CaMa-Flood routing. Specifically, we computed daily basin totals of flood storage and runoff for 34 basins (>100 000 km 2 ) worldwide. For each basin we used the multi-year mean  of annual range of daily flood storage as a fraction of multi-year mean annual runoff to represent its floodplain storage capacity. Model ensemble medians of this capacity indicator and peak discharge (using dpBMYM here as in figure 3) changes at the outlets of the basins were then derived for five models that do not represent floodplains (DBH, H08, LPJmL, MATSIRO, WaterGAPnc; CLM and PCR-GLOBWB are excluded for this exercise due to their different routing networks). We saw a clear inverse relationship, suggesting that if the CaMa-Flood simulated annual fluctuation of flood storage is large relative to basin total runoff, a larger amplitude reduction of peak discharge at the basin outlet is more likely to be seen. Note that the relationship is likely not linear, as when the storage is large enough, its further increase may not lead to further reduction of peak discharge amplitude. Water stored on floodplain also does not always reduce peak flows especially if the storage is saturated during long wet period.
In addition to discharge, CaMa-Flood is capable of providing other key outputs for risk assessments such as flooded area, flood volume and flood depth. Some regional flood inundation models can represent physical processes in even more detail and could potentially be applied at the global scale (e.g. Sampson et al 2015). However, a global simulation with regional models is often limited to simulations of specific magnitude floods (e.g. a 100 year flood) due to computational constraints. When using runoff from multiple GHMs as input, CaMa-Flood can ensure that the simulated discharge follows a consistent drainage direction network that is similar to the established DDM30 network; this consistency alone is worth noting as it prevents potential location errors in using discharge results from multiple models. We also observed that the difference in overall performance among GHMs (between-GHMs spread) seems to be reduced with CaMa-Flood routing, suggesting that the routing scheme is an important contributing factor explaining the spread between models in simulated peak discharge. For all the GHMs we examined, CaMa-Flood generally leads to improved peak discharge simulations at a global scale except for MPI-HM and ORCHIDEE. In contrast to the other model specific routing schemes, these two models have their own representation of floodplain dynamics, which supports the hypothesis that such dynamics (also represented in CaMa-Flood) are critical for the representation of peak discharge. CaMa-Flood also enables the inclusion of more land or earth system models that do not have a routing scheme in global flood studies, in doing so accounting better for uncertainties in simulating land processes. Additionally, our study provides a model evaluation framework on peak discharge, which is often not directly addressed in global flood studies (e.g. Hirabayashi et al 2013). For an estimate on future flood hazards, it is advisable to use multiple GHMs as their performance can vary considerably depending on the basin of interest, due to their different process representation and generalized parameterisation. Recent research is exploring novel methods for utilizing ensemble model results (Zaherpour et al 2017); here we also showed that taking the best performing ensemble member at each station limits the basin-averaged biases in peak discharge to within 30% for most of the examined basins (figure 5). Therefore, even though global-scale flood simulations are still at an early stage, it is possible to gain insights using the current ensemble of modelling tools for future flood projections. Further improvement can be achieved by improving routing alone, for example, CaMa-Flood currently uses relatively low resolution in the terrain data north of 60°N for generating routing direction, limiting the accuracy of simulations in boreal regions; implementation of a higher quality topography input is currently under development. Neither evaporation over floodplains nor transmission loss is considered in CaMa-Flood, which could possibly lead to an overestimation of peak amplitude in basins under hot climate, such as in the inner Niger delta (Zwarts We also found that CaMa-Flood routing often shows considerable improvement in the cases where NSE<0 or biases >100% (results not shown), beyond which the performance metrics were capped for the analyses. It could be interesting from a model development perspective to also examine such improvement; however, regarding on the potential of model applications limited value can be gained if the improved NSE is still below 0. Additionally, we note that the GRDC stations used in this study are not evenly distributed across the globe, with relatively sparse samples in Asia and Central Africa, where flood risk is high. This prompted us to relax data requirement to a minimum of five years, which allowed us to perform analyses on about 10% more stations and cover a few more basins. However, results in data sparse basins are naturally less robust, have sampling bias and should be treated with caution. We also performed sensitivity analyses using only stations with a minimum record of ten years (results not shown) instead of five; here, our major conclusions still hold. The classification of managed and (near-) natural stations was based on two parameters (reservoir volume, irrigated area), while in the real world human impacts may be present in some other forms that are difficult to measure correctly. While currently being the best tool available, the GHMs' representation of time-varying human impacts is still at an early stage and will continue to improve in the future. Therefore, while our results show that human impacts implemented in the examined GHMs have a quite limited influence on peak discharge simulations, human interventions could be significant for certain regions in the real world.

Conclusions
The recently developed global river routing scheme CaMa-Flood, was shown to generally delay the timing and reduce the amplitude of daily peak discharge in most regions compared to the routing schemes in a majority of GHMs. The amplitude reduction led to an overall improvement in simulated peak discharge compared to the GHMs' native routing schemes. Similar levels of increased performance were observed when analysing managed or near-natural stations alone; also, sensitivity analysis displayed a very limited effect of human impacts on peak discharge. The improved peak discharge simulation is likely due to explicit simulation of the floodplain in CaMa-Flood that smoothens the daily hydrographs in a way similar to wetland implementations in the MPI-HM routing scheme. This buffering effect was also Environ. Res. Lett. 12 (2017) 075003 found to largely reduce inter-model differences in overall performance. Our analyses showed that the choice of routing scheme has a considerable influence on the simulated river discharge and its peak values, and suggested that GHMs could be useful tools in combination with routing schemes such as CaMa-Flood to simulate peak discharge. By adopting a multi-model approach, uncertainties in assessment with only a single model due to missing physical mechanisms and globally universal parameters could be largely mitigated. The benefit of a consistent routing also allows the inclusion of more models thus better accounting for structural model uncertainties in future flood-related studies. With a focus towards flood hazards simulations and projections, this study offers a comprehensive framework to assess how well the models simulate peak discharge, thus allowing a more informed interpretation on their future projections, which could be helpful for decision-makers. The improved multi-model database of historical daily discharge generated from this study will be made available and serve as a reference on the skills of models for studies projecting future flood disasters under a changing climate.