How do I know if I’ve improved my continental scale flood early warning system?

Flood early warning systems mitigate damages and loss of life and are an economically efficient way of enhancing disaster resilience. The use of continental scale flood early warning systems is rapidly growing. The European Flood Awareness System (EFAS) is a pan-European flood early warning system forced by a multi-model ensemble of numerical weather predictions. Responses to scientific and technical changes can be complex in these computationally expensive continental scale systems, and improvements need to be tested by evaluating runs of the whole system. It is demonstrated here that forecast skill is not correlated with the value of warnings. In order to tell if the system has been improved an evaluation strategy is required that considers both forecast skill and warning value. The combination of a multi-forcing ensemble of EFAS flood forecasts is evaluated with a new skill-value strategy. The full multi-forcing ensemble is recommended for operational forecasting, but, there are spatial variations in the optimal forecast combination. Results indicate that optimizing forecasts based on value rather than skill alters the optimal forcing combination and the forecast performance. Also indicated is that model diversity and ensemble size are both important in achieving best overall performance. The use of several evaluation measures that consider both skill and value is strongly recommended when considering improvements to early warning systems.


Introduction
Flood Early Warning Systems (EWS) are vital for enhancing disaster resilience (Guha-Sapir et al 2013, Stephens et al 2015a, Carsell et al 2004, Coughlan de Perez et al 2016, Girons Lopez et al 2017, particularly for serious flooding in transnational river basins , Eleftheriadou et al 2015, Webster et al 2010. Although flood forecasts are improving (Pappenberger et al 2011, Collier 2016, EWS developers still face considerable challenges (Pagano et al 2014, Wetterhall et al 2013, Zia and Wagner 2015. One of the most prominent challenges is understanding how best to evaluate scientific improvements within a computationally intensive operational forecasting environment. The complexities of these systems mean that when small scientific or technical improvements are made, the consequent improvements to the forecasted variables and flood warnings are not necessarily straightforward. For example improvements in grid resolution, bias correction or additional data assimilation do not always produce the expected results because of feedbacks in the system (Kauffeldt et al 2015, Adams andPagano 2016, supplementary material: section S1, figure S1 stacks.iop.org/ERL/12/044006/ mmedia). Thus the only way to comprehensively evaluate improvements in such complex systems is through an intensive set of numerical experiments which run the whole system (as for weather forecasting systems see for example https://software.ecmwf.int/ wiki/display/FCST/TerminologyþforþIFSþtesting).
Another consideration is that decisions about the utility of improvements to EWS are typically based on an assessment of how physically consistent the system is with respect to observations. This is measured in terms of the quantitative skill of the system in forecasting variables such as river discharge or water level (Pappenberger et al 2015a, Robertson et al 2013, Wanders et al 2014. However, investment decisions about EWS instead consider the cost-benefit ratio of predictions, such as the value of flood warnings issued . This reliance on skill measures to evaluate system improvements may exist because it is usual practice to evaluate system skill and alternatives are simply not considered, or because there is an inherent assumption that skill is correlated with value (which it may not be) or because evaluating the value of warnings is a very resource and data hungry activity which is not easy to achieve.
In this paper this mismatch is addressed by evaluating EWS improvements using traditional measures of forecast skill and measures of the value of the warnings, and a number of hybrid measures. The EWS used is the European Flood Awareness System (EFAS) which is an operational continental scale flood EWS (Smith et al 2015. A large set of reforecasts from EFAS is used to evaluate a system improvement that has never been previously objectively and fully tested in EFAS: the implementation of a multi-model Numerical Weather Prediction (NWP) forcing framework. Such a framework should theoretically provide a better estimation of uncertainty and an improved predictive distribution than a single forcing approach (

Methods
In this paper, the whole integrated EFAS EWS is used to demonstrate a new skill-value evaluation strategy in the testing of the implementation of a multi-forcing framework. The experiment uses a large set of flood forecasts generated with a 2 year EFAS reforecast. Multiforcing approaches use forecast combination techniques, which require the estimation of weights for each individual flood forecast, or ensemble of flood forecasts. All possible permutations of the NWP forcings available to EFAS are optimized in order to test the hypothesis that the full multi-model forcing provides the highest forecast skill and highest warning value (figure 1). The weights are optimized using five evaluation measures which range from traditional river discharge skill evaluation through to evaluating flood warning value (section 2.2). A sensitivity analysis is then undertaken in order to evaluate the impact of the model forcing combinations on EFAS performance. The combinations are evaluated relative to one another, again using evaluation measures ranging from river discharge skill through to flood warning value (figure 1). Following best practice for sensitivity testing (Saltelli et al 2008), first a Leave One out Comparison (LOOC) is undertaken followed by an Add One In Comparison (AOIC) (section 2.3). In order to avoid confusion in such a complex analysis, when the evaluation measures are used for optimization they are referred to as Methods 1-5, and when the evaluation measures are used for the sensitivity analysis the method names (CRPS, SEDI, etc) are used (see figure 1).

The European Flood Awareness System (EFAS)
EFAS produces probabilistic flood warnings up to 15 days ahead as part of the Copernicus Emergency Management Service , Smith et al 2015. EFAS was developed by the European Commission to contribute to better flood risk management in advance of and during flood crises across Europe. The system provides both National authorities and the European Commission with pan-European overviews of forecasted floods with the aim of improving coordination of aid and acting as a complementary source of information for national systems. The system is now forced with NWP forecasts that are both global and regional, deterministic and ensemble-based. The hydrological model is LISFLOOD (Van Der Knijff et al 2010) which is setup over the European domain on a 5 Â 5 km grid and also for 768 river catchments, which are used in the operational EFAS for monitoring and post-processing. The forecasts are bias-corrected and post-processed at the locations where real-time hydrological observations are available . Further details on the EFAS NWP forcings, flood warning decision rules and performance are provided in the supplementary material section S2.
When the EFAS has a system upgrade, a 'reforecast' is produced in order to evaluate the new changes to the system. Reforecasting is also known as hindcasting or retrospective forecasting and involves computing forecasts with the new EFAS configuration for past dates. The most recent reforecast for EFAS was produced in January 2014 (ECMWF 2014, Salamon 2014) and covers the 2 year period January 2012 to December 2013. The reforecasts used in this study are issued daily, looking up to 10 d ahead for the whole European domain. Here forecast lead times of 3-10 d are used in the analysis, as EFAS is a medium-range forecasting system that is designed for forecasts of lead times of 3 days and longer.
The NWP forcings for EFAS could be combined in a number of different ways (figure 1, table 1). It is assumed in the operational EFAS that the full multiforcing combination (configuration 15) provides the best system performance but hitherto this has not been tested. As spatially distributed observed discharge data are not available, and the quality and coverage of station observations over the European domain are very unequal , the river Environ. Res. Lett. 12 (2017) 044006 discharge observations used in this study to evaluate the reforecasts are proxies, derived from routing observed rainfall through the hydrological model (as per Pappenberger et al 2008. As the same model is used for observations and the predictions, this also allows us to control for a number of other uncertainties.

Forecast improvement: combination and optimization
First the flood forecasts are combined, requiring the estimation of weights for each individual forecast or ensemble of forecasts in order to optimise the output of the systems against the proxy observations. This is done for each of the 768 river catchments.
The river discharge forecasts from different NWP are combined using nonhomogeneous Gaussian regression, NGR (Gneiting et al 2005) (equation (1)). f i,s,t : mean of the ith ensemble forecast (in case of ensemble forecast)/forecast value (in case of deterministic forecast) at location s and lead time t M: number of systems w,g: bias correction parameters h,y: spread correction parameters s i,s,t : the standard deviation of the ith ensemble forecast. In the case where only a deterministic forecast is used this is replaced by the forecast value.
The parameters of the NGR can be estimated by optimising an evaluation measure on the past 90 d of forecasts. Here five different evaluation measures are used for this optimization stage. These have been selected to cover the range from a traditional skill based evaluation measure (method 1) through to a monetary value based score (method 5), with hybrid scores in between (methods 2-4).
2.2.1. Optimization method 1: optimization using continuous rank probability score (CRPS) (for each lead time), CRPS Method 1 considers the skill of river discharge and optimizes the CRPS (Hersbach 2000) for each lead time. The NGR is optimized independently for each  Figure 1. An overview of the experimental design for evaluating the implementation of multi-model forcings in the EFAS EWS. River discharge forecasts are produced by forcing EFAS with different NWP models ('Forecasts'). These forecasts are then combined in all possible permutations; the weightings for each combination are calculated by optimizing the past 90 d of forecasts against proxy observations using 5 different methods (1:CRPS, 2:CRPS-lagged, 3:SEDI, 4:Hit Rate and 5:Value). This then produces a large set of combined forecasts which are evaluated against one another in a sensitivity analysis. The evaluation measures used in the sensitivity analysis are CRPS, SEDI, Hit Rate and Value. In order to avoid confusion in such a complex analysis, when the evaluation measures are used for optimization they are referred to as methods 1-5, and when the evaluation measures are used for the sensitivity analysis the method names (CRPS, SEDI, etc) are used.
Environ. Res. Lett. 12 (2017) 044006 lead time and location using the analytical formula for the CRPS given in Grimit et al (2006).

2.2.2.
Optimization method 2: optimization using continuous rank probability score (CRPS) for lagged forecasts, CRPSl Warning decisions in EFAS are based on lagged forecasts. Consecutive forecasts are required to issue an alert as this provides a better false alarm rate (see supplementary material S2). The CRPS is optimized using a NGR formulation which contains not only the most recent forecast, but also forecasts issued 3-10 days beforehand increasing the number of ensemble systems i used in equation (1). This method is hence closer to the relevant decision rules .
2.2.3. Optimization method 3: optimization using the symmetric extreme dependency index, SEDI In this method, the performance of warnings which use lagged forecasts is scored in terms of hits, misses, false alarms and correct rejections using a contingency table and the decision framework shown in table S2. Flood events are low frequency events and so the Symmetric Extreme Dependency Index (SEDI) is used (Ferro andStephenson 2011, North et al 2013): Where H is the hit rate: And F is the false alarm rate: where a, b, c, and d are the number of hits, the number of false-alarms, the number of misses and the number of correct rejections respectively. The SEDI ranges from [À1, 1], taking the value of 1 for perfect forecasts and 0 for random forecasts, therefore scores above 0 have some degree of skill.
2.2.4. Optimization method 4: optimization using the hit rate As false alarms have a low cost in early warnings (Dale et al 2013), method 4 uses an objective scoring function for optimization which focusses on the number of hits and misses (the hit rate, equation (3)) for lagged forecasts as a proxy for monetary benefit.

Method 5: optimization using value
Pappenberger et al (2015) have estimated the monetary value of the EFAS by calculating the avoided flood damages of the early warnings and comparing with the costs of implementation and running the system. This required a large analysis involving details of EFAS forecasts, the EU and national forecasting context of EFAS, the flood alert decision rules, damage data sets (Barredo (2009), the EM-DAT (EM-DAT 2014) emergency events database and complementary information from the European Solidarity fund application (EC 2014)), and the calculation of avoided flood damages. However, when making comparisons of various setups within any one early warning system (in this case EFAS), the monetary value can be evaluated using an analysis of just the hits (correct forecasts) . This is because the base investment value in the system and the running costs remain static, the false alarms can be neglected (as above in method 4) and the total number of observed flood events (hits þ misses) remains constant in the evaluation dataset. Optimizing against the hits is directly equivalent to optimizing against the value, and thus the approach taken in Method 4 can be modified to use only the hits in the optimization and will reflect directly the monetary value (a in equation (3)).

Sensitivity analysis of forecast improvements
The optimised EFAS forecast sets are evaluated against one another in order to understand the influence and   figure 1). The mean of the CRPS for all lead times above 3 d (CRPS m ) is used for comparison with other evaluation measures (lead times above 3 d are selected in this calculation as EFAS is a mediumrange forecasting system that is designed for forecasts of lead times of 3 d and longer). The CRPS-lagged does not appear in this part of the analysis as this requires weighting past forecasts, which requires optimization.
The sensitivity analysis methodology follows recommendations from Saltelli et al (2008). First a 'Leave One Out Comparison' (LOOC) is performed, in which the combinations containing a particular forcing (or group of forcings) are compared with the combinations that do not contain the individual forcing (or group of forcings) (the score reference). For example, and referring to table 1, for the DWD forcing, combinations 1, 5, 6, 7, 11, 12, 13 (which contain the DWD forcing) are compared with combinations 2, 3, 4, 8, 9, 10,14 (which do not contain the DWD forcing).
Second, an 'Add One In Comparison (AOIC)' is performed in which an individual forcing/group of forcings is added to each combination and compared to the combination without it (the score reference). For example, and again referring to table 1, for the group of forcings 'ECMWF-ENS and COSMO-LEPS' , combinations 13, 14 and 15 are compared with combinations 1, 2 and 5.
In the sensitivity analysis, evaluation of the different configurations is undertaken using 'skill' scores. Using one specified system configuration as the reference, the individual scores are divided by the reference score and thus normalised. The higher the score the better, and anything above 0 indicates 'skill' of the forecast in relation to the reference. The CRPS, thus becomes the CRPSS (Continuous Rank Probability Skill Score) by dividing by the CRPS of the reference configuration. The SEDI becomes the SEDIS (Symmetric Extreme Dependency Index Skill Score) by dividing by the SEDI of the reference configuration.
The Hit Rate becomes the Hit Rate Skill by dividing by the Hit Rate of the reference configuration. The Value becomes the Relative Value by dividing by the Value of the reference configuration.

Results
Results are presented as an average over all of the 768 EFAS river catchments for the full reforecast data set.

Is an optimization method looking at both skill and value required?
First, evidence is presented on the requirement to consider both skill and value in the optimization methods used to combine forecasts. The spearman rank correlations between the methods are shown in table 2, with the mean value of all optimization methods provided in the bottom row. These demonstrate the expected relationships between scores, in that those that are most similar in terms of their constructions tend to have the higher correlations, for example, the relationships between the SEDI, Hit rate and Value scores, the 3 scores that most represent value. There is also some correlation between CRPS m and SEDI. Correlations, however, are in general weak, which is not surprising as the optimization methods are a mix between continuous and threshold based scores including various transformations. This highlights the necessity of considering both a range of optimization methods and evaluation measures to evaluate the system, and no one measure can fully replace another.
The forecast performance for the CRPS m and Value ranked between forcing combinations (numbered) for the 5 different methods of optimization (shapes) is shown in figure 2. This also demonstrates that skill and value are not well correlated and therefore the importance of an evaluation strategy that explicitly considers value as well as skill. The full complexity of attempting multi-forcing combination is shown by the variation in rankings between the different methods for the different forcing combinations. However, the full multi-model ensemble forcing combination (No 15, shown in a circle) ranks high for both Value and CRPSm (although not quite the highest). It also shows remarkable consistency between all methods used for optimization and supports the implementation of the full multi-forcing ensemble combination at the European scale for EFAS.

Which NWP forcings have the greatest relative contribution to improved forecast performance?
In order to understand the relative contribution of an individual NWP model forcing to the EFAS forecast performance, a comparison between all combinations in which the forcing is used to the situations when the forcing is not used is performed over all catchments, for the whole reforecast dataset, employing the Leave One Out comparison (LOOC) method. Table 3 shows the mean and standard deviation (all forecast start dates and river catchments) of the skill score values. Values above 0 indicate a positive contribution to forecast performance (i.e. making the forecasts better), with higher numbers meaning an increasingly positive contribution. Values lower than 0 indicating a negative contribution (i.e. making the forecasts worse).
Although numerical values in the cells cannot be directly compared with each other because they are different optimization-evaluation combinations, larger values indicate better performance and larger standard deviations indicate greater space-time variability across forecasts and river catchments.
Results show that most of the combinations have a positive contribution to EFAS performance regardless of the skill/value score or optimization method used; results for the DWD, ECMWF-Highres and ECMWF-ENS nearly always show a positive contribution. This provides good evidence for employing a multi-forcing framework for EFAS.
The picture for COSMO-LEPS is more mixed and does not add value to forecast performance in many of the combinations. However, COSMO-LEPS results also exhibit a very large variance which suggests that the contribution is very variable across Europe. If analysis is restricted to the Alpine area over which the high resolution COSMO-LEPS is considered to outperform lower resolution models, there is significant improvement in the COSMO-LEPS score (1 ± 1 as opposed to À1 ± 1 for method 1 and CRPSSm) with little deterioration in the other scores (3 ± 0.9 DWD; 2 ± 1 ECMWF-Highres; 3 ± 1 ECMWF-ENS 1 ± 1), indicating the value added from the COSMO-LEPS forcing even though it deteriorates the Europewide mean. This is an important finding because it means that in some areas there is a positive contribution to forecast performance even though the spatio-temporal mean is negative, and thus forcings to EFAS cannot be discounted purely on a spatio-temporal mean of performance. Environ. Res. Lett. 12 (2017) 044006 Table 3. Relative contribution of the NWP forcings to EFAS forecast performance for the 5 optimization methods and 4 evaluation measures. Positive numbers represent a positive contribution to the forecast performance (i.e. this forcing is making the forecasts better) and negative numbers represent a negative contribution to the forecast performance (i.e. this forcing is making the forecasts worse). Direct intercomparison of the values is not possible because they are different optimization-evaluation combinations, but larger values indicate better performance and larger standard deviations indicate greater space-time variability across forecasts and river catchments. CRPSSm ( Ã 10) 4 ± 11 0 ± 11 2 ± 5 2 ± 11 3 ± 283 4 ± 10 6 ± 10 5 ± 5 5 ± 12 1 ± 218 3 ± 14 8 ± 12 5 ± 6 6 ± 11 18 ± 208 À6 ± 15 À11 ± 12 À7 ± 13 À9 ± 17 262 ± 734 Hit Rate Skill ( Ã 1000) 7 ± 33 0 ± 33 2 ± 32 2 ± 35 À13 ± 31 10 ± 20 14 ± 23 5 ± 10 6 ± 16 À8 ± 16 3 ± 27 15 ± 27 3 ± 9 5 ± 15 À5 ± 15 À21 ± 53 À24 ± 25 À9 ± 24 À13 ± 30 7 ± 21 Relative Value ( Ã 100) 4 ± 12 3 ± 11 0 ± 8 0 ± 7 6 ± 10 2 ± 7 2 ± 8 0 ± 3 0 ± 3 6 ± 7 0 ± 8 0 ± 9 0 ± 3 0 ± 3 6 ± 6 3 ± 20 1 ± 12 5 ± 9 6 ± 9 À5 ± 8 Environ. Res. Lett. 12 (2017) 044006 If the analysis considers river catchments individually and looks for a maximum value from any of the catchments (rather than the mean over all catchments) then for Relative value and method 3 the following values are achieved: 33 (DWD), 17 (ECMWF-Highres), 22 (ECMWF-ENS), 100 (COSMO), which are significantly larger than the numbers in table 3. There are thus very strong spatial dependencies in the scores achieved for different combinations of forcings. There are also variations depending on the optimization method and the evaluation methods used. To highlight this figure 3 shows for each EFAS river catchment, the NWP forcing with the maximum positive contribution to EFAS forecast performance for the CRPS (a) and the Value (b). Each colour represents a different forcing and the spatial variability across the catchments is clearly shown. Considering the dark red colour which represents the DWD forcing as an example, those catchments in which EFAS forecasts improve the most when the DWD is added are shaded. Although there is some overlap between figures 3(a) and (b) in this shading, there are also substantial differences in the catchments shaded. This indicates that there would be significant gains to be achieved in flood forecast performance by using particular forcing combinations for individual river catchments, and also that these gains should be evaluated in terms of both skill and value as the results differ substantially. The resulting patterns are necessarily complex because of the spatial variability in the hydrological regimes of the river catchments as they respond to the variations in the performance of the numerical weather forecasts (i.e. river discharge responds non-linearly to changes in rainfall and varies between catchments). The impact of spatial variability in the optimal forcing combinations for EFAS should be explored further in future research.
3.3. Which NWP forcing most improves forecast performance when added to an existing configuration? To evaluate which forcing is the most beneficial when added to an existing configuration, the Add One In Comparison (AOIC) method is used. Figure 4 shows the improvements in forecast performance when a forcing is added to an existing configuration. First, it should be noted that these results demonstrate the value of multi-forcing ensembles. With increasingly large configurations the skill and the value continues to increases (i.e. the lines are not horizontal), although this effect is smaller for Relative Value (bottom row) than for CRPSS (top row). This is true for the addition of all forcings and for all optimization methods. This supports the implementation of a multi-forcing framework and suggests it provides an improved predictive distribution than a single forcing approach. It also suggests that the monetary benefit of the EFAS as calculated by  would be lower without the full multi-forcing ensemble. Figure 4(a) (left column) shows the improvements in CRPSSm and Relative Value from adding an additional forcing to an existing configuration (for all optimization methods). As would be expected, the larger the multi-forcing ensemble the less added value an individual forcing has. ECMWF-ENS contributes considerably more in terms of CRPSSm performance than any other NWP model. In terms of relative value gain the picture is less clear, with DWD and ECMWF-Highres adding the most value to the multi model CMWF-Highres ECMWF-ENS COSMO-LEPS DWD Environ. Res. Lett. 12 (2017) 044006 system. This suggests that model diversity is of greater importance for improving the hit rate but ensemble size is more important for improving the CRPS. Figure  4(b) compares the different optimization methods focusing on the ECMWF-ENS forcing. Method 1 and 2 perform similarly, considerably outperforming method 3 and 4 in CRPSSm as well as Relative Value. Additional skill gain in CRPSSm and Relative Value is very similar in a system which uses 3 or 4 forcings (size of original configuration 2 or 3). Given the substantial resource and political costs of adding any additional forcing into a continental scale flood forecasting system in terms of implementation and maintenance, one conclusion from these results could be that adding a 4th forcing is not worthwhile. However given the high correlation in the physics between the ECMWF models, additional diversity by incorporating other NWP forcings remains an attractive option (Hagedorn et al 2012).

Conclusions
This work has demonstrated that when evaluating the impacts of scientific and technical improvements to flood early warning systems the correlation between the skill of forecast variables and the value of warnings is not high and an evaluation strategy that considers both components is necessary. This will also be true for other earth system modelling and forecasting systems.
Here a new skill-value strategy has been tested on multi-forcing optimization of the European Flood Awareness System (EFAS). The full multi-forcing ensemble achieves a good flood forecast performance in both skill of river discharge forecasts and value of warnings and this configuration is recommended for operational forecasting and warning at the European Scale, but spatial variations are evident when looking at individual river catchments. Optimization of forecasts based on value rather than skill alters the optimal forcing combination and the forecast performance. Results indicate that a multi-model forcing framework provided an improved predictive distribution over a single model approach. In this evaluation adding more than 2 NWPs to a multi-forcing ensemble only brought small benefits in terms of score values, although, it should also be remembered that achieving diversity in NWP forcing models is also important for improving forecast hits, and the ensemble size is important for improving forecast skill. It should also be noted that the full multi-forcing framework brings the most benefit in forecast performance which indicates that the monetary benefit of the EFAS would be lower without the full multi-forcing ensemble.
Where possible the use of a full suite of skill-value evaluation methods is strongly recommended. Those evaluating modelling and forecasting systems with only one skill based evaluation method due to computational or other resource constraints, should consider the diversity of performances found in this study and that system skill may not reflect system value. Environ. Res. Lett. 12 (2017) 044006