An efficient approach for estimating streamflow forecast skill elasticity

Seasonal streamﬂow prediction skill can derive from catchment initial hydrological conditions (IHCs) and from the future seasonal climate forecasts (SCFs) used to produce the hydrological forecasts. Although much effort has gone into producing state-of-the-art seasonal streamﬂow forecasts from improving IHCs and SCFs, these developments are ex- pensiveandtimeconsumingandtheforecastingskillisstilllimitedinmostpartsoftheworld.Hence,sensitivityanalysesare crucial to funnel the resources into useful modeling and forecasting developments. It is in this context that a sensitivity analysis technique, the variational ensemble streamﬂow prediction assessment (VESPA) approach, was recently introduced. VESPA can be used to quantify the expected improvements in seasonal streamﬂow forecast skill as a result of realisticimprovementsinitspredictabilitysources(i.e.,theIHCsandtheSCFs)—termed‘‘skillelasticity’’—andtoindicate where efforts should be targeted. The VESPA approach is, however, computationally expensive, relying on multiple hindcasts havingvarying levelsofskillinIHCs andSCFs. Thispaperpresents twoapproximationsofthe approachthatare computationally inexpensive alternatives. These new methods were tested against the original VESPA results using 30 yearsofensemblehindcastsfor18catchmentsofthecontiguousUnitedStates.Theresultssuggestthatoneofthemethods, end point blending, is an effective alternative for estimating the forecast skill elasticities yielded by the VESPA approach. The results also highlight the importance of the choice of veriﬁcation score for a goal-oriented sensitivity analysis


Introduction
Unprecedented increases in computer capabilities have shaped the last several decades' advances in numerical weather prediction (NWP), and with them, the development of environmental forecasting and modeling systems. This has led to a shift in the strategy of operational forecasting centers toward more integrated modeling and forecasting approaches, such as coupled systems and Earth system models (ESMs), with the final aim to extend the limits of predictability (i.e., from subseasonal to seasonal forecasting). These developments are supported by the assimilation of more and better-quality observation data as well as the increase in model resolutions and complexity. However, such advances can be very expensive and data hungry and may not yield proportional improvements.
Seasonal hydrological forecasts are predictions of the future states of the land surface hydrology (e.g., streamflow), up to a few months ahead. They are valuable for applications such as reservoir management for hydropower, agriculture and urban water supply, spring flood and drought prediction, and navigation, among others (Clark et al. 2001;Hamlet et al. 2002;Chiew et al. 2003;Wood and Lettenmaier 2006;Regonda et al. 2006;Luo and Wood 2007;Kwon et al. 2009;Cherry et al. 2005;Viel et al. 2016). They have the potential to provide early warning for increased preparedness (Yuan et al. 2015). Traditionally, seasonal streamflow forecasts have relied upon land surface memory, the persistence in the land surface (e.g., catchment) initial hydrological conditions (IHCs; of soil moisture, groundwater, snowpack, and the current streamflow). IHCs are one of the most important predictability sources of seasonal streamflow forecasts and were thus the starting point for the development of the ensemble streamflow prediction (ESP) approach in the 1970s (Wood et al. 2016b). The ESP was first developed and used for reservoir management purposes. It is produced by running a hydrological model with observed meteorological inputs to produce current observed IHCs, from which the forecast is started, and the forcing over the forecast period is undertaken using an ensemble of historical meteorological observations (Day 1985). The ESP method assumes that the model states to initialize a forecast are perfectly estimated, while the future climate is completely unknown. However, the skill of the ESP decreases significantly after one to a few months of lead time over most parts of the world because of a decrease in the land surface memory with time. The achievable predictability from the ESP thus depends on the persistence of the IHCs, which can vary as a function of the season (i.e., the transition between dry and wet seasons can, for example, be hard to forecast) and the location and size of the catchment (i.e., the streamflow in a large catchment with a slow response time and/or situated in a region with negligible precipitation inputs during the forecast period will for example be easier to forecast; Wood and Lettenmaier 2008;Shukla et al. 2013; van Dijk et al. 2013;Yuan et al. 2015).
More recently, seasonal climate predictability derived from large-scale climate precursors [e.g., El Niño-Southern Oscillation (ENSO) and the North Atlantic Oscillation (NAO)] has been used to enhance seasonal streamflow forecasting (e.g., Wood et al. 2002;Yuan et al. 2013;Demargne et al. 2014;Mendoza et al. 2017). Such systems produce streamflow forecasts by initializing a hydrological model to estimate IHCs and forcing the model with inputs based on seasonal climate forecasts (SCFs; of temperature and precipitation) instead of historical observations. Their skill is also still limited because of the rapid decrease in precipitation forecasting skill beyond two weeks of lead time, and the skill is variable in both space and time (Yuan et al. 2011; van Dijk et al. 2013;Slater et al. 2017). In Europe, for instance, the skill is higher in winter in regions where the winter precipitation is highly correlated with the NAO. Regions with high skill include the Iberian Peninsula, Scandinavia, and regions around the Black Sea (Bierkens and van Beek 2009). In the contiguous United States (CONUS), the skill is on average higher over (semi)arid western catchments, due to the persistence of the IHCs influence up to three months of lead time. The skill can be higher in some regions of the western CONUS (i.e., California, the Pacific Northwest, and Great Basin) in the winter and fall due to higher precipitation forecasting skill in strong ENSO phases (Wood et al. 2005).
Increasing the seasonal streamflow forecast skill remains a challenge: one that is being tackled by improving IHCs and SCFs using a variety of techniques. Techniques include model developments and data assimilation and can vary in computational expense. However, over the past several decades, it has been shown that operational streamflow forecast quality has not significantly improved (Pagano et al. 2004;Welles et al. 2007). This is the motivation for the use of sensitivity analysis techniques to guide future forecasting developments for seasonal streamflow forecasting and is the basis for this paper.
It is in this context that the attribution of seasonal streamflow forecast uncertainty to the IHC and SCF errors has been researched extensively. Wood and Lettenmaier (2008) introduced a method based on two hindcasting end points: the ESP and the reverse ESP. In contrast to the ESP, which only represents the uncertainty in the future climate, the reverse ESP only represents the uncertainty in IHCs by using an ensemble of initial model states taken from historical simulations to initialize a prediction forced by a single set of observed meteorological inputs. Typically, the input uncertainty attenuates over a period of months as the influence of the perfect future climate input increasingly determines model states.
Comparing the skill of the ESP versus reverse-ESP seasonal streamflow forecasts allows one to identify the dominant predictability source (and conversely uncertainty source) of seasonal streamflow forecasting (i.e., the IHCs or the SCFs), and its evolution in both space and time. It was successfully used to disentangle the relative importance of initial conditions and boundary forcing errors on seasonal streamflow forecast uncertainties by several authors: for example, for catchments in the United States (Wood and Lettenmaier 2008;Li et al. 2009;Shukla and Lettenmaier 2011), in France (Singla et al. 2012), in Switzerland (Staudinger and Seibert 2014), in China (Yuan et al. 2016;Yuan 2016), and in the Amazon (Paiva et al. 2012), as well as for the entire globe (Shukla et al. 2013;Yossef et al. 2013;MacLeod et al. 2016). This work is instructive as it enables the dominant predictability source to be identified (i.e., where efforts and resources should be targeted) to focus improvement, which could potentially lead to more skillful seasonal streamflow predictions. This method was extended by Wood et al. (2016a, hereafter W16) via a method called variational ensemble streamflow prediction assessment (VESPA), which involves assessing intermediate IHC and SCF uncertainty points between the perfect and climatological points applied in ESP and reverse ESP. The approach allows the calculation of a metric called ''skill elasticity,'' that is, the sensitivity of streamflow forecast skill to IHC and SCF skill changes. A key drawback of the VESPA approach, however, is that it is computationally intensive. For each catchment and initialization month of a forecast, the response surface was defined through the use of dozens of multidecadal variable-skill ensemble hindcasts, ultimately amounting to millions of simulations. In contrast, the ESP and reverse-ESP skill can be estimated from a single set of ensemble hindcasts spanning a historical period. The IHC and SCF skill variation method was also highly specific to the particular model state configuration and involved a relatively simplistic linear blending procedure. The elasticity calculations were furthermore based only on a single verification score of forecast skill (i.e., coefficient of determination R 2 ) for the analysis. An ensemble forecast has many attributes, for example, the skill, the reliability, the resolution, and the uncertainty of the forecast, among others. To obtain a complete picture of the forecast quality, the scores should encompass many of these attributes, as each verification score will give us different information about the forecast quality.
The drawbacks of VESPA motivate us to assess two computationally inexpensive methods of estimating the forecast skill elasticities, using only the original ESP and reverse-ESP results that depend on the single hindcast series as mentioned above. The two methods are termed end point interpolation (EPI) and end point blending (EPB). In the first part of this paper, we compare results from the two methods tested on 18 catchments of the CONUS to the original results from the VESPA, using a single verification score. The objective of this part is to investigate whether the new methods can discriminate the influence of IHC and SCF errors on seasonal streamflow forecasting uncertainties and to assess the ability of those new methods to correctly estimate the forecast skill elasticities. In the second part, additional verification scores are applied for streamflow forecast verification, supporting the second objective of the paper, which is to explore the sensitivity of the results obtained from the two new methods and the VESPA approach to the choice of the verification score.
2. Methods, data, and evaluation strategy a. The VESPA approach In this work, as in W16, the term ''perfect'' refers to current observed meteorological data and the term climatological refers to the whole distribution of historical observed data. Figure 1 presents the ESP (Fig. 1a), the reverse ESP (Fig. 1b), the climatology (Fig. 1c), and the VESPA forecast ( Fig. 1d), as generated in W16. The ESP, the reverse ESP, the perfect forecast, and the climatology are all end points of the uncertainty in the sense that the uncertainty in those forecasts is either perfect or climatological. They are the end points of the VESPA approach.
VESPA aims to produce streamflow forecasts from IHCs and SCFs with an uncertainty situated between the perfect and the climatological uncertainty (Fig. 1d). Forecasts were generated by linearly blending the climatological and perfect IHCs (i.e., model moisture states) and the climatological and perfect SCFs (i.e., meteorological forcings of precipitation, evapotranspiration, and temperature), subsequently used to run the hydrological model. The weights used for blending the data were (w 5 0, 0.05, 0.10, 0.25, 0.50, 0.75, 0.90, 0.95, 1.0), applied so that a weight of zero is the perfect knowledge and unity is the climatological knowledge, with w IHC and w SCF denoting the weights used to blend the IHCs and the SCFs, respectively (W16). An ESP forecast results from the weights w IHC 5 0 and w SCF 5 1, the reverse ESP from w IHC 5 1 and w SCF 5 0, the perfect forecast from w IHC 5 0 and w SCF 5 0, and the climatology from w IHC 5 1 and w SCF 5 1.
To plot the skill of the VESPA forecasts as a function of the IHC and SCF skill, W16 used skill surface plots ( Fig. 2), interpolating forecast skill results from different IHC and SCF weighting combinations. The axes represent the SCF and IHC skill, derived respectively from the blending weights w SCF and w IHC using the following equations (W16): SCF skill 5 100 3 (1 2 w 2 SCF ) and (1) The SCF and the IHC skill values obtained from these equations are the percentage of climatological variance explained in the respective predictability source (i.e., SCF and IHC; W16). Each SCF skill-IHC skill combination corresponds to a specific VESPA forecast, the skill of which can be plotted on the skill surface plot (black plus signs in Fig. 2). The blue circles are the end points of the VESPA forecasts: the reverse ESP (revESP in Fig. 2), the perfect forecasts, the ESP, and the climatology (climo in Fig. 2). The skill surface plots are hence a graphical representation of the response surface obtained from the VESPA sensitivity analysis.
The VESPA seasonal streamflow forecasts were generated by W16 using lumped Sacramento Soil Moisture Accounting (SAC-SMA) and SNOW-17 catchment models for unimpaired catchments. The models were forced with daily inputs in precipitation, temperature, and potential evapotranspiration and were calibrated and validated against observed daily streamflow from the U.S. Geological Survey (USGS). Eighty-one skill variations of a 30-yr hindcast (from 1981 to 2010) were produced for 424 catchments in the CONUS, starting at the beginning of each month (i.e., forecast initialization dates), with lead times up to 6 months.

b. Alternative methods to the VESPA approach
In this paper we present two alternative methods of the VESPA approach, the EPI and the EPB. These methods aim to reproduce the response surface obtained from the VESPA approach by using the same 30-yr hindcast ensembles produced by W16, aggregated over the first three months with zero lead time for each initialization date (referred to as 3-month streamflow forecast hereafter) and corresponding exclusively to the end points (i.e., the ESP, the reverse ESP, the perfect forecast, and the climatology). The two new methods were tested for a subset of the CONUS-wide catchment dataset presented in W16 (Fig. 3), comprising 18 catchments from the large USGS Hydro-Climatic Data Network (HCDN; Lins 2012). The 18 selected catchments cover a large range of hydrometeorological conditions, including the maritime climate regime of the U.S. West Coast catchments; the humid regime of the eastern United States (south of the Great Lakes) with rainfall-driven runoff and variable winter snow in the most northern catchments; and the Intermountain West and northern Great Plains regions, where streamflow is greatly influenced by the snow cycle.

1) END POINT INTERPOLATION
The EPI produces a response surface by interpolating the forecast skill of the end points throughout the skill surface plot. Both linear (i.e., linear barycentric interpolation) and cubic interpolation techniques were tested. However, results will be shown for the linear interpolation only as the cubic interpolation did not provide noticeable improvements to the linear interpolation given that the interpolation is based on only four points situated at the corners of the response surface. The linear EPI was performed for each forecast initialization date and for each catchment.

2) END POINT BLENDING
The EPB generates hindcasts for each w SCF -w IHC combination (i.e., each plus sign in Fig. 2; w SCF and w IHC are selected to be the same blending weights used by W16, for the purpose of comparison). For each combination point, a new ensemble of 100 members was generated by blending the four end points, given a specific weighted average. The percentage of each end point used [EP(%); i.e., the number of members randomly selected from each end point], was calculated for each combination point using the following equation: where x EP and y EP are the w IHC and w SCF values of the end point for which the percentage is calculated, respectively. For example, if the w IHC and w SCF match the end point values, 100% of the EPB hindcast members are resampled from that end point (i.e., the end point skill is reproduced). This was done for each forecast initialization date and for each catchment.
To produce the skill surface plots for the EPB method, the SCF and IHC skill was calculated using the same equations as in W16 [i.e., Eqs. (1) and (2), respectively].

c. The evaluation strategy
The aim of this paper is to compare two computationally inexpensive alternative methods to the VESPA approach, the EPI and the EPB. To this end, the paper unfolds into two distinct objectives. First, we want to investigate whether the EPI and/or the EPB can discriminate the influence of IHC and SCF errors on seasonal streamflow forecasting uncertainties and reproduce VESPA skill elasticity estimates. This will validate the use of one or both methods as alternative to the VESPA approach. Second, we want to explore the sensitivity of the results obtained from the EPI, the EPB, and the VESPA methods to the choice of the verification score. This will be an attempt to demonstrate the importance of the choice of the verification score for forecast verification and communication. To explore the first objective of this paper, skill surface plots were produced for the EPI, the EPB, and the VESPA methods. As in W16, the seasonal streamflow forecast skill depicted in the skill surface plots was calculated from the R 2 of forecast ensemble means with the observations, where perfect forecasts (model simulations driven by the observed meteorology) were treated as observations to calculate the R 2 . As discussed at length in W16, this choice deliberately excludes the model errors as a source of forecast uncertainty.
The skill surface plots obtained from the EPI and the EPB methods were subsequently compared qualitatively and quantitatively to the skill surface plots obtained for the VESPA approach. The qualitative analysis consisted in visually inspecting the patterns contained in the skill surface plots, giving an indication of the dominant predictability source on the streamflow forecast skill. The quantitative analysis of the results was based on the calculation of the skill elasticities for the IHCs and the SCFs (E IHC and E SCF , respectively), for the EPI, the EPB, and the VESPA methods, averaged across three transects of a quadrant situated in the center of the response surface, according to the following equations: =3.
The numerators, expressed as S(F [Á]) 2 S(F [Á]), contain the gradients in the streamflow forecast skill between IHC skill (or SCF skill) values of 75% and 19% (the denominator). The values in between the square brackets of the numerator are the IHC skill followed by the SCF skill values, which indicates a certain w SCF -w IHC combination point in the example skill surface plot in Fig. 2. In the denominator, the IHC and SCF skill gradients are gradients in the percentage of the climatological variance explained in the respective predictability source. The skill elasticities (E IHC and E SCF ) are positively oriented, where a skill elasticity of zero is obtained when the predictability source has no influence on the skill of the streamflow forecast, while positive (negative) elasticities mean that an improvement in the predictability source will lead to higher (lower) streamflow forecast skill. The skill elasticities were calculated for all three methods and for the 3-month streamflow forecasts produced for each catchment and forecast initialization date. The only difference between Eqs. (4) and (5) and the skill elasticities calculated in W16 is that they chose to calculate skill elasticities around the ESP point in the skill surface plots. Here, we choose to calculate skill elasticities across a quadrant within the skill surface plot (between 75% and 19% of the climatological variance explained in the IHC and the SCF) in order for the skill elasticity values calculated in this paper to reflect the forecast skill gradients within the response surface. This is done differently to W16 because the aim of this paper is to compare (qualitatively and quantitatively) the skill surface plots obtained from the EPI and the EPB methods to the VESPA approach.

SURFACE TO THE CHOICE OF THE VERIFICATION SCORE?
To investigate the second objective of this paper, several verification scores were calculated for each method (i.e., the EPI, the EPB, and the VESPA approach). These scores were selected in order to cover key attributes of the forecasts verified, and they include d the mean absolute error (MAE) of forecast ensemble means, relative to the perfect forecasts and d the continuous rank probability score (CRPS) and its decomposition: The potential CRPS is the CRPS value that a forecast with perfect reliability would have. The uncertainty is the variability of the observations and the resolution is the ability of the forecast to distinguish situations with distinctly different frequencies of occurrence. The CRPS reliability is a measure of the bias and the spread of the system. The CRPS was chosen as it is a widely used score to assess the overall quality of an ensemble hydrometeorological forecast. The CRPS moreover has the advantage that it can be decomposed into different scores in order to look at the many different attributes of an ensemble forecast. The CRPS for a single forecast is equivalent to the MAE, which is why the latter was chosen.
For all of the above verification scores, the corresponding skill scores were calculated for each point of the skill surface plots from skill score forecast 5 1 2 score forecast score reference , where the score reference is the score of the climatology point, for each method, each initialization date, and each catchment. A perfect forecast results in a forecast skill score of unity and a forecast with equal quality as the reference forecast corresponds to a skill score of zero. Any forecasts of lower quality than the reference forecast produce negative skill score values. Skill scores were calculated in order to have a similar score range as the R 2 (i.e., a climatological score of zero and a perfect score of one), for comparative purposes. Skill elasticities were subsequently calculated for all the skill scores, using Eqs. (4) and (5), for all three methods and for the 3-month streamflow forecasts produced for each catchment and forecast initialization date. From these skill elasticity values, the influence of improvements in the IHCs and SCFs on the seasonal streamflow forecast skill can be assessed, in terms of the forecasts' overall performance (considering the mean of the ensemble or the full ensemble spread, from the MAE and the CRPS, respectively), their resolution and uncertainty (CRPSpot), and their reliability (CRPSreli).

Results
a. Can EPI and EPB discriminate the influence of IHC and SCF errors on seasonal streamflow forecast uncertainties?
For the first part of this study, the Crystal River (Colorado; USGS gauge 009081600), a snowmelt-driven catchment, will be used as a test case to illustrate the skill surface plots obtained from the EPI and the EPB methods, compared to the VESPA approach. Precipitation is the highest in winter and spring in this catchment and falls as snow between November and April. In April, the snow starts melting and consequently the soil moisture and streamflow both increase. Figure 4 displays the skill surface plots obtained for the VESPA (Fig. 4a), the linear EPI (Fig. 4b), and the EPB methods (Fig. 4c), from R 2 for the 3-month streamflow forecast for the Crystal River, for initializations on the first of each month (each row in Fig. 4). FIG. 4. Skill surface plots obtained for (a) the VESPA, (b) the linear EPI, and (c) the EPB methods. The skill is calculated from the R 2 of the 3-month streamflow forecast ensemble means against the perfect forecasts, for hindcasts produced from 1981 to 2010 for the Crystal River (USGS gauge 009081600), with forecast initializations on the first day of each month. Differences between the skill surface plots obtained for (d) the VESPA and linear EPI methods and (e) the VESPA and EPB methods are also shown. Figures 4d and 4e show the differences between the skill surface plots obtained for the VESPA and EPI methods and the VESPA and EPB methods, respectively. A first visual comparison of the skill surface plots obtained from the linear EPI method (Fig. 4b) and the EPB method ( Fig. 4c) with those obtained from the VESPA approach (Fig. 4a) for the Crystal River tells us that the skill surface plots obtained from all three methods are very similar. For each initialization date, the orientation of the gradients in streamflow forecast skill appears identical. The EPI and the EPB methods seem to correctly indicate the dominant predictability source on the 3-month streamflow forecast skill, for each initialization date for this catchment. Similar results were obtained for the other 17 catchments (see Figs. S1-S17 in the supplemental material). Forecasts made on the first of February, March, and September show a sensitivity to the SCF skill (i.e., horizontal or near to horizontal orientation of the streamflow forecast skill gradients), while all other forecasts are dominantly sensitive to the IHC skill (i.e., vertical or near to vertical orientation of the streamflow forecast skill gradients).
The gradients in streamflow forecast skill contained in the EPI skill surface plots (Fig. 4b) differ moderately from the gradients obtained from the VESPA approach (Fig. 4a). This can be observed in Fig. 4d, showing the differences between the skill surface plots obtained for both methods. The VESPA approach gives very strong gradients, causing a rapid decrease in streamflow forecast skill with a decrease in one of the predictability sources' skill, depending on the initialization date. In comparison, the EPI method indicates a gradual decrease in streamflow forecast skill with a decrease in one of the two predictability sources, depending on the initialization date. The streamflow forecast skill gradients produced by the EPI method are a reflection of the interpolation method used (i.e., here linear), and because the corner points lack information about describing curvature of the surface at interior points, they cannot fully capture nonlinearities in the skill gradients across the skill surface. For some interior points, this limitation of the EPI method could estimate very different skill elasticities than those obtained from the VESPA approach.
The skill surface plots produced by the EPB method (Fig. 4c) show minor differences in the streamflow forecast skill gradients when compared to the skill surface plots generated by the VESPA approach (Fig. 4a). This can be seen in Fig. 4e, which shows the differences between the skill surface plots obtained for both methods. To further inspect those differences, they will be explored quantitatively (i.e., by comparing the skill elasticities) below.
To quantify the accuracy of the patterns contained in the EPI and the EPB skill surface plots compared to the patterns of the VESPA skill surface plots, IHC and SCF skill elasticities (i.e., E IHC and E SCF , respectively) were calculated across a quadrant situated within the response surface for all three methods, for the 18 catchments and each forecast initialization date, from Eqs. (4) and (5), respectively. Figure 5 presents the skill elasticities for nine of the 18 catchments (the plots for the other nine catchments are shown in Fig. S18). Each plot corresponds to a catchment and shows the skill elasticities obtained from the VESPA, the linear EPI, and the EPB methods as a function of the forecast initialization date. From the nine different plots, the skill elasticities given by the EPB method appear almost identical to the VESPA approach, whereas the skill elasticities obtained from the EPI method differ in some places. This confirms that the patterns of the EPB method are very similar to the patterns of the VESPA approach, with it being the closest out of the two tested methods.
The value of the SCF skill elasticity (i.e., E SCF ) in relation to the value of the IHC skill elasticity (i.e., E IHC ), for a given method, indicates the dominant predictability source on the 3-month streamflow forecast skill (here calculated from the R 2 ). For a selected method, equal SCF and IHC skill elasticity values signifies that equal improvements in both the SCFs and the IHCs will lead to equal improvements in the streamflow forecast skill. If E SCF is superior (inferior) to E IHC , it reflects a larger potential increase in streamflow forecast skill by improving the SCFs (IHCs). Although the EPI method almost always indicates the same dominant predictability source as the two other methods, the degree of influence of changes in IHC and SCF skill on the streamflow forecast skill (i.e., the exact values of the skill elasticities) often differs. For many catchments and forecast initialization dates, the EPI appears to underestimate the skill elasticities produced by the VESPA method.
The nine different catchments for which the skill elasticities are presented in Fig. 5 display three different types of behavior, best captured by the VESPA approach and the EPB method. For the three catchments in Fig. 5 (left), improvements in the IHCs would yield the highest improvements in the 3-month streamflow forecast skill for spring to summer initializations (April-August for the Crystal River, March-July for the Fish River, and March-June for the Middle Branch Escanaba River) and in the winter (October-January for the Crystal River, November-December for the Fish River, and in December for the Middle Branch Escanaba River). SCF improvements would lead to better 3-month streamflow forecast skill for forecasts initialized in the late winter and summer to fall (February-March and September for the Crystal River, February and August-October for the Fish River, and January-February and July-September for the Middle Branch Escanaba River). For the three catchments in Fig. 5 (middle), a notable feature is that the 3-month streamflow forecast skill would benefit from SCF improvements for summer initializations (June-September for the Chattooga and the Nantahala Rivers and July-September for the New River). Finally, for the three catchments in Fig. 5 (right), the 3-month streamflow forecast skill would benefit from improvements in the SCFs for all initialization dates. This is true with the exception of forecasts initialized in December for East Fork Shoal Creek. It is important to note that there is uncertainty around these estimates. However, this is a good first indication of the sensitivity of 3-month streamflow forecast skill (measured from the R 2 ) to IHC and SCF errors for each forecast initialization date and each catchment.
The skill elasticities produced by the EPB method appear to be almost identical to the skill elasticities obtained from the VESPA approach, with occasional marginal differences. This suggests that the EPB method captures nearly exactly the degree of influence of changes in IHC and SCF skill on the streamflow forecast skill, obtained from the VESPA approach. Both methods additionally indicate the same dominant predictability source: the predictability source which, once improved, could lead to the largest increase in 3-month streamflow forecast skill. The EPB method will therefore be used as an alternative to the VESPA approach to investigate the second objective of this paper.
b. What is the sensitivity of the response surface to the choice of the verification score?
To investigate the sensitivity of the response surface to the choice of the verification score, and therefore to FIG. 5. Streamflow forecast skill elasticities for the IHCs (i.e., E IHC , solid line) and the SCFs (i.e., E SCF , dashed line), calculated across a quadrant situated within the 3-month streamflow forecast skill surface plots for the VESPA (red), the linear EPI method (gray), and the EPB method [blue; using Eqs. (4) and (5)]. Each plot shows the evolution of the IHC and SCF skill elasticities with the initialization date for a given catchment. The climatological regions of the catchments are indicated in the plots' headings. The skill surface plots from which these skill elasticities were calculated are presented in Fig. 4 and Figs. S1-S17. the attribute of the forecast, several scores were computed to evaluate the streamflow forecast quality. The R 2 , the mean absolute error skill score (MAESS), and the continuous rank probability skill score (CRPSS) were calculated to evaluate the forecasts' overall performance in terms of the ensemble mean and the entire ensemble. The potential CRPSS (CRPSSpot) was computed to look at the forecasts' resolution and uncertainty, and the CRPSS reliability (CRPSSreli) was computed to look at the forecasts' reliability. The Crystal River (USGS gauge 009081600) will here again be used as a test case to illustrate this part of the results. Figure 6 presents the IHC and SCF skill elasticities [i.e., E IHC and E SCF ; in Fig. 6 (top) and Fig. 6 (bottom), respectively] as a function of forecast initialization date for the Crystal River catchment. These are calculated from Eqs. (4) and (5), for all the mentioned verification scores, for the VESPA approach (Fig. 6a) and the EPB method (Fig. 6b). If we compare the skill elasticities obtained from the VESPA approach with the skill elasticities obtained from the EPB method, it appears that both methods produce very similar elasticities for the R 2 , the MAESS, and the CRPSS. This further confirms the results of the first part of the analysis, which highlighted the similarity of the EPB results to the VESPA results and extends it to multiple attributes of the seasonal streamflow forecasts. However, slight differences between the skill elasticities produced by the two methods can be observed for the CRPSSpot, and significant differences exist for the CRPSSreli. These dissimilarities are discussed further below.
If we now compare the skill elasticities obtained for the various verification scores for both methods, it is clear that the R 2 , the MAESS, the CRPSS, and the CRPSSpot give very similar skill elasticities. This hints that those verification scores overall agree on the degree of influence of changes in IHC and SCF skill on the streamflow forecast skill. However, a few dissimilarities can be observed for some of the forecast initialization dates. This is, for example, the case for forecasts made in the spring and in summer, where the E IHC appears lower for the MAESS and the CRPSS (and the CRPSSpot for the VESPA approach) compared to the E IHC obtained for the R 2 for both methods. It is also apparent for forecasts made on the first of February, March, and September, where the E SCF calculated for the MAESS and the CRPSS (and the CRPSSpot for the VESPA approach) is lower than the E SCF obtained for the R 2 for both methods. For both examples, it infers that improvements in the IHC and the SCF skill could lead to larger improvements in the streamflow forecast skill in terms of the R 2 rather than in terms of the MAESS and the CRPSS (and the CRPSSpot for the VESPA approach). Overall, this indicates that the degree of influence of changes in IHC and SCF skill on the streamflow forecast skill differs relative to the choice of the verification score.
While the R 2 , the MAESS, the CRPSS, and the CRPSSpot give a very similar picture, the skill elasticities obtained for the CRPSSreli appear very different, occasionally reaching negative values. These negative values indicate a loss in streamflow forecast skill (in terms of the forecast reliability) as a result of improvements in one of the two predictability sources, while all the other verification scores suggest a gain in streamflow forecast skill (in terms of the forecast ensemble mean and the ensemble overall performance, its resolution, and uncertainty) with improvements in one of the two predictability sources.
The substantial differences in skill elasticities obtained for the CRPSSreli from the VESPA versus EPB method suggest that there are limitations to the ability of EPB to reconstruct the full ensemble information present in VESPA, and of VESPA (applied with relatively small ensembles at the end points) to estimate sensitivities for complex verification scores such as reliability. The reliability verification score is influenced by the combination of bias, spread, and other ensemble properties and exhibits more noisy outcomes here than were obtained for other verification scores. A negative elasticity may occur because the ensemble spread has narrowed without sufficient improvements in bias, for instance. The behavior of the elasticity of reliabilities is even more difficult to diagnose, but we suspect that the presence of noise (erroneous local minima or maxima) or curvature in the associated VESPA skill surface greatly undermines the linear blending techniques.
Overall, these results suggest that improvements in the skill of either of the two predictability sources will impact streamflow forecast skill differently depending on the attribute (i.e., verification score) of the forecast skill that is considered and whether the ensemble mean or the full ensemble is used.

a. Implications and limitations of the results
W16 introduced the VESPA approach, a sensitivity analysis technique used to pinpoint the dominant predictability source of seasonal streamflow forecasting (i.e., the IHCs and the SCFs), as well as quantifying improvements that can be expected in seasonal streamflow forecast skill as a result of realistic improvements in those key predictability sources. Despite being a powerful sensitivity analysis approach, VESPA presents two key limitations.
1) It is computationally intensive, requiring multiple ensemble hindcasts to define the skill response surface (81 were used in the VESPA paper vs one for the EPB and the EPI techniques). 2) It requires a complex state and forcing blending procedure that may introduce additional uncertainties, biases, or interactions between the predictability sources (Saltelli et al. 2004;Baroni and Tarantola 2014) that are not accounted for or difficult to quantify. This is not necessary in any of the end points required in the two approaches presented here, which rely instead on analyzing a single conventional hindcast dataset that is more likely to be feasible for forecasting centers.
The central aim of this paper was to address the first limitation of the VESPA approach by presenting two computationally inexpensive alternative methods: the EPI and the EPB methods. Both methods successfully identified the dominant predictability source of 3-month streamflow forecasts for a given catchment and forecast initialization date (i.e., given by the orientation of the streamflow forecast skill gradients in the skill surface plots). However, the EPB was more successful in reproducing the VESPA skill elasticities-the exact streamflow forecast skill gradients situated within the skill surface plots (for skill and accuracy verification scores including the R 2 , the MAESS, the CRPSS, and the potential CRPSS to a certain extent). These skill elasticities indicate the influence of changes in IHC and SCF skill on streamflow forecast skill.
The new methods, by differing in their setup from the VESPA approach, do not inherit the drawbacks specific to this approach and mentioned above. The EPI and the EPB methods nevertheless have their own limitations.
The EPI (both for the linear and cubic interpolation methods; the latter was not shown) did not fully capture the VESPA skill elasticities because of the nature of the method that produces predefined gradients within the skill surface plots (i.e., defined by the interpolation method used). Additionally, curvature or local minima or maxima (if any) of the response surface cannot be represented by the EPI method. The EPB, on the other hand, performs better at reflecting curvature in the skill response surface, hence local elasticities between the end points. The EPB method aimed at reproducing VESPA elasticities only by manipulating the output of a single hindcast dataset (interpreted as ESP, reverse ESP, the perfect forecast, and climatology). The EPB method cannot match exactly the forecasts created by the VESPA approach, as it does not account for the idiosyncrasies in model forecast behavior, such as interactions between the predictability sources. Furthermore, it is likely that the more the model investigated is nonlinear or exhibits skill response thresholds, the more the results obtained from the EPB method will differ from the ones obtained from the VESPA approach. These results overall allow that the EPB method can be used as an inexpensive alternative method to the VESPA approach, yet with the potential limitations of the method stated above.
For the first part of the analysis, the streamflow forecast quality was evaluated in terms of the forecasts' skill from the R 2 . The use of multiple verification scores is, however, essential to obtain a more complete perspective of forecast quality. Thus, we explored the performance of the two new methods and the VESPA approach for a range of additional verification scores. The results, presented for the EPB method and the VESPA approach, showed differences in the response surfaces obtained for the various verification scores (i.e., the R 2 , the MAESS, the CRPSS, and its decomposition). This suggests distinct sensitivities of the seasonal streamflow forecast attributes (i.e., overall performance of the forecast ensemble mean and its full ensemble, forecast resolution, uncertainty, and reliability) to changes in the IHC and SCF skill. Ideally, a sensitivity analysis should be goal oriented, that is, it should be performed with prior knowledge of the intended use of the results (Saltelli et al. 2004;Pappenberger et al. 2010;Baroni and Tarantola 2014), which may favor using one verification score over another.
This paper covered selected limitations of the work presented by W16. However, many areas were left unexplored and could be interesting topics in which to focus future research. First, a major area inherent to model-based sensitivity analyses is that their results are model dependent (Saltelli et al. 2000); thus, the extent to which they can be transferred to reality depends on the model fidelity. The results presented in this paper are specific to the forecasting system and similar systems on which this analysis was based and should be used as an indicator of catchment sensitivities. As noted in W16, an extension of the elasticity analysis to include observations and a model error component would provide valuable insights. Another possible approach could be to use the results from various forecasting systems as input to the sensitivity analysis, in order to achieve a multimodel consensus view of the skill. As shown in Cloke et al. (2017), a multimodel forcing framework can be highly beneficial for streamflow forecasting compared to a single model forecasting approach, provided the models are chosen judiciously so as to provide a rational characterization of forecasting uncertainty. Second, the dependence of blending technique performance versus VESPA on the characteristics of the skill surface (e.g., linear or nonlinear) bears further investigation. Finally, this sensitivity analysis leaves generic the concept of improvements in either of the predictability sources, although the space-time nature of improvements may be consequential. This work could therefore be extended by studying the effect of degradations in the temporal and spatial accuracy of the input data, thereby indicating the relative value of improvements in the spatial or temporal predictability for a specific catchment and a specific time of the year.

b. The wider context
The new strategy of operational forecasting centers is to move toward more integrated operational modeling and forecasting approaches, such as land surfaceatmosphere coupled systems, and beyond that, Earth system models. These advances are enabled by the continuous growth of computing capabilities, a better understanding of physical processes and their interactions throughout all compartments of the Earth system, and the availability and use of more and better observation data (i.e., satellite data). Despite all these advances, most forecasts still reflect substantial uncertainty that grows with time and limits the predictability of observed events beyond a few weeks of lead time. The rapid progress has led our systems to be ever more data hungry as increases in model complexity and resolution are sought. These computationally expensive developments are not always feasible; hence, model developers must be creative and constantly weigh the costs and benefits of improving one aspect over another, such as increasing the resolution or complexity of the models (Flato 2011).
In this context, sensitivity analyses appear more than ever as a natural tool to establish priorities in improving predictions based on Earth system modeling. Such analyses are a powerful and valuable tool to support the examination of uncertainty and predictability across spatial and temporal scales and for various applications. They can be used for a large range of activities, including examining model structure, identifying minimum data standards, establishing priorities for updating forecasting systems, designing field campaigns, and providing realistic insights into the potential benefits of efforts to improve a forecasting system to managers with prior knowledge of their costs (Cloke et al. 2008;Lilburne and Tarantola 2009;W16).
However, sensitivity analyses must be easily reproducible to be effective in supporting each new model or forecast system update, and the results should easily be applied in order to constitute a ''continuous learning process'' (Baroni and Tarantola 2014). In other words, a sensitivity analysis should be a simple, tractable tool for addressing a multifaceted challenge.

Conclusions
This paper presents two computationally inexpensive alternative methods to the VESPA approach for estimating forecast skill sensitivities and elasticities. Of these, the end point blending (EPB) method provides a useful substitute to the VESPA approach. Despite the existence of some differences between the EPB and the VESPA outcomes, the EPB successfully identifies the dominant predictability source (i.e., IHCs and SCFs) of seasonal streamflow forecast skill, for a given catchment and forecast initialization date. The EPB method can additionally reproduce the VESPA forecast skill elasticities, indicating the degree of influence of changes in IHC and SCF skill on the streamflow forecast skill. The paper also draws attention to how the choice of verification score impacts the forecast's sensitivity to improvements made to the predictability sources. With a good understanding of the limitations of the methods, such a sensitivity analysis approach can be a valuable tool to guide future forecasting and modeling developments.