Seasonal hydroclimatic ensemble forecasts anticipate nutrient and suspended sediment loads using a dynamical-statistical approach

Subseasonal-to-seasonal (S2S) water quantity and quality forecasts are needed to support decision and policy making in multiple sectors, e.g. hydropower, agriculture, water supply, and flood control. Traditionally, S2S climate forecasts for hydroclimatic variables (e.g. precipitation) have been characterized by low predictability. Since recent next-generation S2S climate forecasts are generated using improved capabilities (e.g. model physics, assimilation techniques, and spatial resolution), they have the potential to enhance hydroclimatic predictions. Here, this is tested by building and implementing a new dynamical-statistical hydroclimatic ensemble prediction system. Dynamical modeling is used to generate S2S flow predictions, which are then combined with quantile regression to generate water quality forecasts. The system is forced with the latest S2S climate forecasts from the National Oceanic and Atmospheric Administration’s Climate Forecast System version 2 to generate biweekly flow, and monthly total nitrogen, total phosphorus, and total suspended sediment loads. By implementing the system along a major tributary of the Chesapeake Bay, the largest estuary in the US, we demonstrate that the dynamical-statistical approach generates skillful flow, nutrient load, and suspended sediment load forecasts at lead times of 1–3 months. Through the dynamical-statistical approach, the system comprises a cost and time effective solution to operational S2S water quality prediction.


Introduction
Hydroclimatic predictions at the subseasonal-to-seasonal (S2S) timescale are critically needed to inform water-related decisions and policies in multiple sectors, including agriculture (Shukla et al 2014), energy (De Felice et al 2015), and water resources management (Wood and Lettenmaier 2006). The S2S timescale is defined as spanning the period between 2 weeks and 9 months (National Academies of Sciences, Engineering and Medicine 2016). This timescale has remained a major weather-climate prediction gap, since it encompasses the time frame where most of the information from atmospheric initial conditions is lost and the timeframe is too short to be strongly influenced by climate modes of variability (National Academies of Sciences, Engineering and Medicine 2016). However, recent and significant improvements in coupled atmosphere-ocean general circulation models (GCMs) in terms of observational datasets, spatial resolutions, model physics, initial conditions, and assimilation techniques, are providing new opportunities to potentially enhance hydrological forecasting at the S2S timescale (Li et al 2009, Shukla and Lettenmaier 2011, Mo et al 2012, Yuan and Wood 2012, Yossef et al 2013, Mendoza et al 2017. To test this idea, we build and implement a new dynamicalstatistical hydroclimatic ensemble prediction system. The system is forced with outputs from the National Oceanic and Atmospheric Administration's (NOAA's) National Centers for Environmental Prediction Climate Forecast System version 2 (CFSv2). CFSv2 is a recently developed, state-of-the-science GCM (Saha et al 2014) designed to enhance global S2S climate predictions.
Previous S2S hydroclimatic studies have demonstrated the increased value of CFSv2 outputs over previous generation S2S climate forcing datasets (Yuan and Wood 2012, Yuan et al 2013, and shown that new hydrological forecast systems can be built to seamlessly integrate weather and climate forecast timescales (Yuan et al 2014). These studies have not, however, proven the ability of S2S climate predictions to support water quality forecasting. We show here for the first time the ability of S2S climate predictions to result in both skillful water quantity and quality forecasts. This has important implications for US national-level, and possibly other regions, operational hydrological forecasting strategies (Demargne et al 2014, Gochis et al 2015, Salas et al 2018. We demonstrate that it is possible to equip current operational forecasting systems, in a cost and time efficient manner, with new prediction capabilities for water quality, which are currently lacking in the US. Water quality predictions are increasingly needed at the S2S timescale, e.g. to anticipate freshwater ecological disasters such as harmful algal blooms (Michalak et al 2013, Michalak 2016. By increasing the lead time at which excess nutrient loadings are forecasted, mitigating actions could be coordinated and implemented further in advance to reduce nutrient-related impacts on sensitive water bodies, estuaries and lakes (National Academies of Sciences, Engineering and Medicine 2016).
Several approaches have been adopted to generate S2S hydroclimatic predictions. The approaches can be generally grouped into three main categories: statistical, dynamical, and hybrid. The statistical approach uses an empirical relationship between the streamflow forecasts and potential sources of predictability originating from various processes in the atmosphere, ocean, and land (Pagano et al 2009, Schepen et al 2016, Mendoza et al 2017, Lee et al 2018, Slater and Villarini 2018. In the dynamical approach, outputs from GCMs are used as inputs into a hydrological model to produce streamflow forecasts (Yuan and Wood 2012). The hybrid approach that combines the strengths of both the dynamical and statistical approach (Mendoza et al 2017) is implemented here.
Dynamic hydrologic simulation models often include a module to simulate water quality constituents (Bekele and Nicklow 2005, Easton et al 2008, Abbaspour et al 2015. However, such systems require a wide range of spatial and temporal observations to be properly configured for forecasting purposes. Water quality data are scarce, making the effective calibration of associated model parameters a difficult challenge (Jackson-Blake et al 2017). On the other hand, current operational hydrological forecasting systems have been designed primarily for streamflow forecasting purposes (Alfieri et al 2014, Demargne et al 2014, Pagano et al 2016. Upgrading these modeling systems with process-based water quality capabilities would require considerable effort and time. An alternative, the one being proposed here, is to implement a hybrid dynamical-statistical modeling approach that combines dynamic hydroclimatic predictions with statistical models to efficiently generate water quality forecasts. Additionally, several recent studies have shown using observations that precipitation is a dominant driver of monthly nutrient loads at the basin scale (Bastola and Misra 2015, Sinha and Michalak 2016, Sinha et al 2017. This supports the possibility of using S2S precipitation predictions to forecast nutrient loads, providing the motivation for the present research. The objective of this letter is to demonstrate the ability of S2S climate forecasts to result in both skillful water quantity (streamflow) and quality (nutrients and suspended sediments) predictions using a new dynamical-statistical ensemble prediction system. This system is comprised of: (i) hydrometeorological and water quality observations, (ii) S2S climate reforecasts from the CFSv2, (iii) a distributed hydrological model, and (iv) a statistical water quantity/quality model. The system is used to generate S2S ensemble predictions of streamflow, total nitrogen (TN), total phosphorus (TP), and total suspended sediment (TSS) at the biweekly and monthly timescales out several months. The generated water quantity and quality forecasts are used to address the following questions: How skillful are the CFSv2-based flow forecasts? Can the CFSv2based flow forecasts be used to anticipate nutrient and suspended sediment loads? What is the relative importance of different sources of uncertainty (climatic and hydrologic) in the flow, nutrient load, and suspended sediment load forecasts? How does the skill of the water quantity and quality forecasts vary with different forecasting conditions (e.g. lead time, season, and flow threshold) and watershed characteristics (e.g. basin size and land cover)? The remainder of the paper is organized as follows. Section 2 describes the datasets and methods used, section 3 presents the main results, and section 4 discusses key findings and their implications.

Forecasting and observational datasets
We use S2S precipitation and near-surface air temperature predictions from the CFSv2 (https://cfs.ncep. noaa.gov/) as forcing to generate S2S water quantity and quality forecasts. The CFSv2 is a fully coupled dynamical prediction system representing the interactions among the ocean, land, and atmosphere. The system runs at T126L64 spatial resolution (∼0.94°G aussian grid spacing or ∼100 km) and at 6-hourly temporal resolution (Saha et al 2014). We utilize a total of 14 years (2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016) of CFSv2 reforecast data. CFSv2 consists of 4 control runs per day at 00, 06, 12, and 18 Coordinated universal Time (UTC) cycles, out 9 months and initiated every 5th day beginning in January 1st of each year. We use 90 d of CFSv2 forecasts because previous findings have shown that CFSv2 forecasts for precipitation and near-surface temperature tend to be the most skillful at lead times<90 d (Tian et al 2017). For both precipitation and near-surface temperature forecasts, time-lagged ensembles consisting of 8 members are used. The ensembles are obtained by combining 4 runs per day from 2 consecutive initialization dates. The CFSv2 data are bilinearly interpolated onto the 4×4 km 2 grid cell resolution of the hydrological model used. Note that the CFSv2 provides global forecasts, so that our proposed methods have potential to be implemented in basins across the US and the world, provided the required observational data and models are available.
For the hydrometeorological observational datasets, we use multi-sensor precipitation estimates (MPEs), gridded near-surface air temperature, and streamflow observations at selected US Geological Survey (USGS) gages. These observational datasets are used to calibrate and verify the hydrological model, perform the hydrological model simulations, and obtain initial conditions for the forecasting runs. The forecasting runs are performed for the period 2003-2016. Both hourly gridded MPEs and near-surface air temperature data at 4×4 km 2 are obtained from the NOAA's Middle Atlantic River Forecast Center. Daily streamflow observations for the selected gaged locations are obtained from the USGS (https:// waterdata.usgs.gov/nwis/rt). Additionally, we use monthly estimates of TN, TP, and TSS loads at the selected gage locations to build and test a new statistical model for water quality forecasting. Water quality data are obtained from the Chesapeake Bay Nontidal Network Stations (https://sciencebase.gov/catalog/ item/59c65eb1e4b017cf313f0aea) (Moyer 2016).

Dynamical-statistical approach to S2S water quantity and quality predictions
We propose and describe a modular dynamicalstatistical approach for generating consistent S2S water quantity and quality ensemble forecasts. The approach combines dynamical CFSv2-based flow predictions with a statistical model to generate nutrient and suspended sediment loads at S2S timescales. To build and implement this dynamical-statistical system, the following three main modeling components are integrated: (i) a statistical climate preprocessor, (ii) a distributed hydrological model, and (iii) a statistical water quantity/quality model. The statistical climate preprocessor is used to correct systematic biases in the CFSv2 predictions. GCM outputs from the CFSv2 are often characterized by the presence of systematic biases (Saha et al 2014), which propagate through the hydrological model to result in biased streamflow forecasts (Yuan and Wood 2012). Statistical preprocessing techniques are used to correct such biases (Shukla and Lettenmaier 2011, Yuan and Wood 2012, Crochemore et al 2016, Lucatero et al 2018. We use a logistic regression model to statistically preprocess the CFSv2 forecasts (Messner et al 2014a, 2014b, Yang et al 2017, Sharma et al 2018. This model has been successfully used before with weather forecasts (Sharma et al 2019), and it is tested here for the first time with climate forecasts. Different logistic regression models are used for near-surface temperature and precipitation forecasts to take into consideration that precipitation is a censored variable, i.e. it can only take positive values. The logistic regression models used have the important advantage of accounting for forecast heteroscedasticity by using the ensemble spread as a predictor. Additionally, they allow obtaining the full continuous predictive probability density function of the forecast variables. The logistic regression models are described in detail in the supplementary text 1, available online at stacks.iop.org/ERL/14/ 084016/mmedia.
The statistically preprocessed CFSv2 predictions are then used as forcing to a hydrological model to generate the S2S streamflow forecasts. NOAA's Hydrology Laboratory-Research Distributed Hydrologic Model (HL-RDHM) is used as the spatially distributed hydrological model (Koren et al 2004). HL-RDHM is run at a 4 km spatial resolution and daily time step. A description of the HL-RDHM model is provided in supplementary text 2, including the calibration of the model. The streamflow forecasts from HL-RDHM are statistically postprocessed to remove any systematic biases using quantile regression (QR) (Koenker 2005). Uncertainties in the hydrologic model structure, parameters, and initial conditions could result in biased streamflow forecasts. In addition, postprocessing often corrects errors in the hydrological model that are ignored during parameter calibration Schaake 2008, Yuan andWood 2012). Different QR models are developed to correct systematic biases in the streamflow forecasts and to generate bias-corrected TN, TP and TSS load forecasts. QR has the relevant advantages of accounting for non-Gaussian error distributions and being less sensitive to the tail behavior of the forecasts (Koenker 2005). The latter is because QR results in conditional quantiles rather than conditional means and, consequently, the regression coefficients are robust with respect to outliers. The QR models are implemented separately for each lead time and forecast location. Note that some of the statistical models commonly used to estimate water quality constituents (Preston and Brakebill 1999, Runkel et  regression techniques, making them susceptible to outliers and requiring that model errors do not deviate significantly from normality, which is often not the case. The QR models employed here, which are described in detail in the supplementary text 3, overcome these drawbacks.
Both the logististic regression and QR models are implemented for the period 2003-2016 using a leaveone-out cross-validation approach. For this, we select a stationary period of 13 years for training and the remaining 1 year for verification purposes. This is repeated until all the 14 years have been statistically processed and verified independently of the training period. This is done so that no training data are discarded and the entire 14 year period of analysis can be used to generate the water quantity/quality forecasts.

Implementation and verification strategy
We implement the dynamical-statistical ensemble prediction system in the James River Basin (JRB). The JRB is the third largest tributary to the Chesapeake Bay, after the Susquehanna and Potomac River basins. The Chesapeake Bay is the largest estuary in the US. Nutrient pollution is a major environmental and economic concern in the Chesapeake Bay (Pyke et al 2008). We select the JRB because S2S water quantity and quality predictions are particularly challenging in this basin. Among other factors, this is due to the basin having little influence from known climate teleconnections (Leathers et al 2008) and having a relatively small fraction of agricultural land cover (table 1), which is typically a major source of excess nutrients in rivers. Therefore, we expect that any gains in S2S water quantity and quality predictability in this basin would be indicative of potential gains in many other US basins with more favorable forecasting conditions, e.g. where teleconnections are stronger and agricultural activities more dominant. The JRB has an overall drainage area of 17 504 km 2 and a humid subtropical climate, with mean annual precipitation and nearsurface temperature of ∼1118 mm and ∼13.15°C, respectively (Kang and Sridhar 2017). The land cover distribution is ∼75% forest, ∼15% cropland and pasture, and ∼7% urban. In the JRB, we select six different USGS daily gage stations as the streamflow and water quality forecast locations (table 1). These stations are the Bullpasture River at Williamsville (BPRV2; USGS gauge 02015700), Calfpasture River above Mill Creek at Goshen (GOHV2; USGS gauge 02020500), James River at Blue Ridge (JBIV2; USGS gauge 02024752), Rivanna River at Palmyra (PYAV2; USGS gauge 02034000), James River at Cartersville (CARV2; USGS gauge 02035000), and James River near Richmond (RMDV2; USGS gauge 0203700). This last gage location is at the overall outlet of the JRB, just upstream of the tidal section of the James River, while the other gages represent nested subbasins in the JRB.
A map of the JRB showing the gage locations is included in the supplementary material.
The annual average daily discharge at the outlet (RMDV2) for the chosen study period 2003-2016 is 78 000 m 3 s −1 . The year 2003 shows the highest annual average streamflow (151 000 m 3 s −1 ) and annual TN load (905 600 lbs), while the lowest annual average streamflow (42 500 m 3 s −1 ), and the lowest annual TN (145 820 lbs), TP (2 969 lbs), and TSS load (9902 700 lbs) occur in the year 2008. The selected forecast locations show different behaviors in terms of the nutrient loads. RMDV2 and CARV2 exhibit low variations in TN concentrations despite large variations in discharge, implying that these basins behave chemostatically (Thompson et al 2011). The concentration variability relative to discharge (expressed as the ratio of variation of concentration and discharge or CV C /CV Q , table 1) for RMDV2 and CARV2 is ∼0.20. In contrast, TN concentrations decrease with increasing discharge at JBIV2, PYAV2, GOHV2 and BPRV2, implying that dilution is the dominant process controlling TN concentrations in these basins (table 1). In the case of TP and TSS, the forecast locations do not reveal any chemostatic behavior. Further information about the selected basins is included in table 1.
To verify the skill of the water quantity and quality forecasts, we use the mean continuous ranked probability skill score (CRPSS) and the mean Brier Skill Score (BSS) (Hersbach 2000). The CRPSS is used because it accounts for ensemble or probabilistic information when computing the prediction skill. The BSS is used because it remains proper when the predicted variable is conditioned, e.g. when computing the skill of flows above a given threshold. The CRPSS is derived from the continuous ranked probability score (CRPS), which evaluates the overall accuracy of a probabilistic forecast by estimating the quadratic distance between the forecasts' cumulative distribution function and the corresponding observations. Thus, the CRPS accounts for both the forecast ensemble spread, or uncertainty, and forecast accuracy. Let the probability distribution function of the ensemble forecasts be ( ) p z and the actual value be z . a Then, the CRPS is defined as where ( ) P z f and ( ) P z a are cumulative functions given by respectively. H is the Heaviside step function which is 1 if the argument is positive and zero otherwise. To measure the skill of the forecasting system relative to a reference system, the associated skill score or CRPSS is computed as where CRPS m and CRPS r are the average CRPS values for the main forecasting system (i.e. the system to be evaluated) and reference forecast system, respectively. The CRPSS ranges from -¥ [ ] , 1 . Positive CRPSS values indicate the main forecasting system has higher skill than the reference forecasting system, with 1 indicating perfect skill. In this study we use sampled climatology as the reference forecasting system.
The Brier score (BS) is analogous to the mean square error over the n forecast-observation pairs, but where the forecast is a probability and the observation is either 0 or 1. The BS is given by In order to measure the skill score of the main forecast system with respect to the reference forecast, the associated skill score or BSS is computed as where BS m and BS r are the BS values for the main and reference forecasting system, respectively. Similar to the CRPSS, a BSS of 0 indicates no skill and a BSS of 1 indicates perfect skill. Besides the CRPSS and BSS, we also employ a commonly used deterministic metric, the correlation coefficient, to measure the degree of linear association between the mean ensemble forecasts and corresponding observations. Confidence intervals for the verification metrics are determined using the stationary block bootstrap technique (Politis and Romano 1994). The verification metrics (CRPSS, BSS, and correlation coefficient) are all computed following the leave-oneout cross-validation approach previously described for the forecasting system. That is, we select one year for verification purposes, and the remaining years for training. This is repeated for every year until all the available data have been postprocessed and verified independently of the training period.  1). The QR statistical model is used to simulate TN, TP and TSS loads in the selected basins. Note that the QR model is used to both simulate and forecast water quality by using the simulated and forecasted streamflow, respectively, as predictors. We find, overall, that the performance of the simulated TN, TP and TSS loads with the QR model is comparable to that of streamflow, with NSE values greater than 0.6 (table 1). For most of the basins, the NSE values for the simulated TN loads are slightly better than those for the TP and TSS loads, which tend to be very similar across basins, due to lower variability in TN concentrations relative to TP and TSS concentration variability. In general, the performance of the QR model in simulating TN, TP, and TSS loads is satisfactory.

Skill of the raw and preprocessed CFSv2 predictions
Prior to evaluating the water quantity and quality forecasts, we assess the performance of the CFSv2 forecasts using the CRPSS as the measure of forecast skill. The CRPSS (relative to sampled climatology) is determined for both the raw and preprocessed CFSv2 near-surface temperature (figures 1(a), (b)) and precipitation (figures 1(c), (d)) ensemble forecasts. The forecast verification is done for lead times of up to 3 months using both biweekly and monthly aggregations. To verify the CFSv2 ensemble predictions, each CFSv2 grid cell is treated as a separate verification unit. Thus, for the JRB, the CRPSS values are obtained by averaging the verification results from the different verification units that fall within the overall basin boundary.
The raw biweekly ensemble, near-surface temperature forecasts demonstrate positive skill up to ∼4 weeks, beyond which the skill continues to degrade, becoming worse than the sampled climatology ( figure 1(a)). This translates into monthly accumulations that are mostly skillful at the month 1 lead time ( figure 1(b)). In the case of raw ensemble precipitation forecasts, positive skill is observed only for lead times up to ∼2 weeks for biweekly accumulations (figure 1(c)); hence, the skill of monthly accumulations is low for month 1 and negligible or negative beyond that ( figure 1(d)). The CFSv2 raw precipitation forecasts demonstrate lower skill than the near-surface temperature forecasts across all the lead times for both biweekly and monthly accumulations.
After statistically preprocessing the raw CFSv2 forecasts, both the near-surface temperature and precipitation forecasts are more skillful than the raw forecasts (figure 1). The relative improvements in skill are generally greater at the later forecast lead times. As was the case with the raw forecasts, the preprocessed nearsurface temperature forecasts remain more skillful than the precipitation forecasts. The biweekly preprocessed, near-surface temperature forecasts remain skillful up to ∼5 weeks (figure 1(a)), with ∼1 week of skill gain compared to the raw forecasts. This skill gain in the biweekly forecasts results in enhanced forecast skill at the monthly timescale ( figure 1(b)), especially for months 1 and 2. A similar trend in skill improvement is observed in the preprocessed precipitation forecasts. The preprocessed, biweekly accumulated precipitation forecasts show ∼2 weeks of skill gain (figure 1(c)), resulting in skillful biweekly accumulations up to ∼4 weeks.

Skill of the S2S streamflow predictions
By using the statistically preprocessed CSFv2 precipitation and near-surface temperature forecasts as forcing to the distributed hydrological model HL-RDHM, we generate S2S raw ensemble streamflow forecasts. The raw forecasts are then statistically postprocessed to remove any systematic biases using the QR model. The CRPSS relative to sampled climatology is used to assess the skill of the raw and postprocessed ensemble streamflow forecasts (figure 2). The CRPSS is computed for all the basins using the biweekly accumulated streamflow forecasts. Overall, the results show that the CFSv2 outputs are able to produce skillful ensemble streamflow forecasts at S2S timescales (figure 2). However, the skill performance of the forecasts varies with the forecast lead time and basin size. We consider basins with area less than 2000 km 2 as small (BPRV2, GOHV2 and PYAV2) and basins with area greater than 7500 km 2 as large (JBIV2, PYAV2, and RMDV2). For small basins (figures 2(a)-(c)), the raw biweekly ensemble streamflow forecasts show positive skill up to ∼2 (GOHV2) and 3 weeks (PYAV2), whereas large basins (figures 2(d)-(f)) remain skillful up to ∼3 (JBIV2) and 5 weeks (CARV2). Since the native CFSv2 data have a coarse spatial resolution (∼100×100 km 2 ), the lower skill in the small basins could be due to the reduced ability of the CFSv2 outputs to capture the spatial variability of the climate variables in these basins, in particular precipitation.
After statistically postprocessing the S2S ensemble streamflow forecasts, the forecast skill improves relative to the raw forecasts across lead times (figure 2). In general, the relative improvements in skill are greater at the later forecast lead times. The relative skill improvements can be as low as ∼5% (CARV2 and RMDV2 at week 2, figures 2(e), (f)) to as high as ∼30% (RMDV2 at week 6, figure 2(f)). The skill gain achieved by postprocessing with respect to the raw forecasts is ∼2 weeks. The CRPSS is also calculated (not shown) for monthly accumulations. Both the raw and postprocessed monthly accumulations generally demonstrate higher skill than the respective biweekly accumulations. The higher skill in the monthly accumulated flows is due, in part, to temporal averaging and the accumulation of skill from earlier periods.

Skill of the S2S nutrient and suspended sediment load predictions
Using the raw, monthly ensemble streamflow forecasts as the predictor in the QR model, we generate forecasts of monthly nutrient and suspended sediment loads for the selected basins. The quality of the TN, TP, and TSS forecasts is assessed using both the correlation coefficient (figures 3(a)-(c)) and CRPSS (relative to sampled climatology) (figures 3(d)-(f)). Both metrics are computed for monthly loads and for lead times of 1-3 months. Overall, the CFSv2-based forecasts result in skillful nutrient and suspended sediment load forecasts at lead times of 1-3 months (figure 3). In terms of both metrics (correlation coefficient and CRPSS), it is apparent that large basins (RMDV2, CARV2 and JBIV2) remain skillful at all the lead times for the TN, TP and TSS forecasts, while small basins (PYAV2, GOHV2 and BPRV2) tend only to have positive skill up to month 2. The variability of streamflow and nutrient concentrations are generally higher for smaller basins (Jawitz and Mitchell 2011), also likely contributing to the reduced forecasting skill of the small basins. In general, TN demonstrates better forecasting ability than both TP and TSS, with TP outperforming TSS. Although these trends are apparent in all the basins and lead times, they tend to be emphasized at the later lead time of 3 months.
Compared to the skill of the streamflow forecasts, the TN, TP and TSS load forecasts show similar skill for the month 1 lead time. Beyond month 1, however, some of the load forecasts have greater CRPSS values than the streamflow forecasts. This can be explained by the nature of the empirical power law relationship between the nutrient concentration C and the streamflow Q (Zhang et al 2016) Q where a and b are coefficients (table 1). For example, higher predictability of TN at RMDV2 and CARV2 is associated with their chemostatic behavior. These basins show the highest CRPSS values at month 3 among all the basins and water quality variables. Under chemostatic conditions: (i) the power law relationship is linear with C collapsing to the value of the coefficient a as the exponent b » 0 (Zhang et al 2016), and (ii) the ratio of the concentration and discharge coefficients of variation is small (CV C /CV Q ; Thompson et al 2011). The linear relationship between flow and load allows the TN predictions at CARV2 and RMDV2 to stay skillful at month 3. In the case of GOHV2, the short observational record, which results in limited data for training the QR model, and the low skill of the streamflow forecasts make the TN forecast skill in this basin worse than sampled climatology at the month 3 lead time.
3.5. Effect of spatial scale, seasonality, and forecast threshold on the skill of the water quantity and quality predictions We now examine the effect of spatial scale (basin size), seasonality (warm and cool seasons), and forecast threshold (low versus high forecast values) on the skill of the streamflow, TN, TP, and TSS predictions. To assess the effect of basin size and seasonality on the forecast skill, the CRPSS values for the monthly streamflow, TN, TP and TSS forecasts at the month 1 lead time are plotted against the drainage area for the entire year ( figure 4(a)), cool season ( figure 4(b)), and warm season ( figure 4(c)). The cool season is defined as covering the months of October-March and the warm season as April-September. We find that monthly streamflow forecast skill increases with the basin size (figure 4), and the linear trend is significant for both seasons (p-value<0.05) (figures 4(b), (c)). The skill of the streamflow forecasts is appreciably higher during the cool season ( figure 4(b)) than the warm one ( figure 4(c)). This is mostly due to the CFSv2 precipitation forecast skill being higher in the cool season (Peng et al 2013, Tian et al 2017. The lower CFSv2 skill in the warm season is partly due to challenges with the GCM in predicting convective precipitation and tropical cyclones (Moore et al 2015). Both the nutrient and suspended sediment load forecasts also show increasing skill with the basin size. However, this trend is only significant for the warm season TN and cool season TP. The skill of the nutrient and suspended sediment load forecasts is also better in the cool season than in the warm season. This is mainly due to the higher skill of the cool season streamflow forecasts.
To understand the quality of the streamflow forecasts during low and high flow conditions, we employ the BSS (figure 5). The low forecast category represents flows with non-exceedance probability, Pr, less than 0.05 (Pr<0.05) while the high flow category is for Pr>0.95. We find that the BSS of the S2S streamflow forecasts is greater for high flows than low flows, with high flows remaining skillful up to lead times of ∼4-5 weeks while low flows are only skillful for ∼2-3 weeks (figure 5). There are two reasons for this. First, since the high flows result from the direct response of the basin to the precipitation events and the low flows are mostly dominated by subsurface processes, any skill in the precipitation forecasts propagates through the hydrological model directly to the high flows. Second, the performance of the hydrological model in simulation mode across the different basins is better for high flows than low flows, with an average NSE of 0.76 and 0.55 for high and low flows, respectively.
The behavior of the water quality forecasts during low and high load conditions (figure 6) mimics that of the streamflow forecasts (figure 5). For example, the BSS values for TN, TP, and TSS (figure 6) are comparable to those for streamflow at the month 1 lead time (figure 5), and the BSS values are higher during high load conditions than low conditions (figure 6). The higher skill for the high load forecasts is due, in part, to the higher skill of high flows. We also assess the effect of land cover on prediction skill. However, since the percent land cover distribution in the basins is relatively similar (table 1), we do not find any marked trends between the skill of the water quantity/quality predictions and land cover, perhaps with the exception of PYAV2. This is the most anthropogenically disturbed basin out of the 6 selected basins, as it has the lowest percent of forest land cover, ∼65%, and the largest percent of developed and agricultural land cover, ∼32%. PYAV2 is the only basin that has negative CRPSS skill at the month 3 lead time for TP and TSS (figures 3(e)-(f)). This negative skill could partly be due to increased anthropogenic influence in this basin. For example, it seems possible that water-related decisions (e.g. stormwater management), not considered in the hydrological model, could be influencing TP and TSS budgets.
3.6. Uncertainty of the streamflow, nutrient load, and suspended sediment load predictions To quantify the relative importance of different sources of uncertainties (climate and hydrological) in the S2S streamflow forecasts, we employ the CRPSS relative to both the observed and simulated values ( figure 7). When the S2S streamflow forecasts are verified relative to the observed values, the CRPSS accounts for the effect on forecast skill of both climate and hydrological uncertainties, whereas the forecasts verified relative to simulated values mainly account for climate uncertainty. The difference between the two, i.e. the CRPSS relative to observed values minus the CRPSS relative to simulated ones, provides an estimate of the effect of hydrological uncertainty on the streamflow forecast skill. We find that, regardless of the basin size, hydrological uncertainty has the strongest influence on streamflow forecast skill at the initial lead time, as also shown by Shukla and Lettenmaier (2011) using a different forecasting approach. For instance, for BPRV2 at the month 1 lead time ( figure 7(a)), the CRPSS values relative to the simulated and observed flows are ∼0.30 and ∼0.10, respectively, suggesting a reduction of ∼65% skill due to hydrological uncertainty. As the lead time grows, hydrological uncertainty become less pronounced and climate uncertainty starts to dominate the streamflow forecast skill. This is demonstrated in figure 7 by the close proximity between the two CRPSS values (observation and simulation) beyond the week 6 lead time.
The CRPSS is also used to quantify the impact of different sources of uncertainties (hydrologic, climate and water quality) on the nutrient and sediment load forecasts (figure 8). The water quality uncertainty is related to the generation of water quality forecasts using QR. In this case, the CRPSS verification relative to observed values (solid lines in figure 8) accounts for the water quality uncertainty along with the climate and hydrologic uncertainties, while the verification relative to simulated values mainly accounts for climate uncertainty. The difference between the two provides an estimate of both the hydrological and water quality uncertainties. We find that at the initial lead time (month 1) the major source of uncertainty is hydrological and water quality (figure 8). As the lead time grows, the hydrological and water quality uncertainties become less pronounced compared to the climate uncertainty. In contrast to figure 7, the solid and dashed lines in figure 8 do not tend to converge towards each other with increasing lead time. Interestingly, this suggests that there is potential for improvements in water quality modeling to enhance the skill of nutrient and suspended sediment load forecasts even at the later lead times.

Discussion and conclusions
In this study, we build and implement a new dynamical-statistical ensemble prediction system to generate S2S water quantity and quality forecasts. The system consists of a logistic regression statistical preprocessor, a distributed hydrologic model, and a QR statistical postprocessor. QR is used both to remove any systematic biases in the raw ensemble streamflow forecasts and to generate the nutrient and suspended sediment load forecasts. The system is implemented for the years 2003-2016 along a major tributary of the Chesapeake Bay. Overall, we find that the dynamical CFSv2-based forecasts, when combined with QR, can generate skillful flow, nutrient load, and suspended sediment load forecasts at lead times of 1-3 months. This study serves to benchmark the skill of CFSv2based water quality forecasts. Such benchmark could be used in the future to assess the potential skill improvements obtained from using more complex and expensive forecasting systems. For example, systems that consider multimodel climate forcings, process-based water quality modeling, and assimilation of initial conditions, among other factors.
Although the ability of the dynamical-statistical ensemble prediction system to generate skillful S2S water quantity and quality forecasts is demonstrated for the JRB, the results can be generalized to other places where the quality of the CFSv2 forecasts is similar to that observed in the JRB. Indeed, the S2S nearsurface temperature and precipitation forecast skill values obtained for the JRB are comparable to values reported for many other locations across the US (Chen et al 2013, Peng et al 2013, Saha et al 2014, Tian et al 2017. The CFSv2 near-surface temperature and precipitation forecast skills do not tend to vary markedly across the US (Chen et al 2013, Peng et al 2013, Saha et al 2014, Tian et al 2017, with some exceptions. For example, near-surface temperature skill is lower in parts of the US Northwest and South regions, whereas precipitation demonstrates minimum skill in parts of the US Rockies and Midwest region (Peng et al 2013, Tian et al 2017. Aside from these US regions of minimal CFSv2 skill, the skill values of the CFSv2-based water quantity and quality forecasts for the JRB are expected to be representative of potential skill values in many other locations in the US, specially locations with favorable forecasting conditions, e.g. where climate teleconnections are strong. This, together with the fact that the CFSv2 is a global forecasting system, make our results relevant to other basins. We find that the flow forecasts can be skillful up to 6 weeks, which goes beyond the skill limit observed for the CFSv2 precipitation outputs (∼4 weeks). This may be due to the propagation of skill from previous weeks through the serial dependency of flows. Nonetheless, the quality of the CFSv2 forecasts tends to limit hydroclimatic predictions beyond 2 months. To further extend the lead time of streamflow predictions, it is necessary to improve the spatial resolution of the CFSv2 forecasts. This would allow to resolve finer scale atmospheric phenomena and, as a consequence, potentially enhance precipitation and streamflow predictions. This is particularly relevant for small basins. Our results show that the skill-drainage area relationship is strong for streamflow, indicating a reduction in skill for small basins. This is also the case for TN and TP. Having more spatially resolved climate forcing could contribute to enhancing hydroclimatic skill across spatial basin scales. This could also help improve the warm season forecasts. There is a noticeable difference in the performance of the water quantity and quality forecasts during the cool and warm season. The forecasts are more skillful in the cool season. This is due to the higher skill of the CFSv2 forcing in this season. Thus, improving the warm season CFSv2 forecasts in the future will enhance the performance of the dynamical-statistical approach. Additionally, multimodel climate ensembles that adequately sample initial conditions, parameters, as well as structural uncertainties in the climate models, such as the North American Multimodel Ensemble (Kirtman et al 2014), could be used to further improve the streamflow forecast skill and subsequently the water quality forecasts.
There is also a marked difference in the skill of high and low flows/loads. The higher skill for the high loads is due to the higher skill of the high flow forecasts. These results are relevant to water quality management as monthly high flows have been associated with increasing eutrophication and hypoxia in coastal waters (Kaushal et al 2008). It also identifies an area where improvements to the hydrological model could be made. The model in simulation mode performs better for high flows than low flows. Nonetheless, the BSS for the high flows declines with increasing lead time and becomes comparable to the BSS of low flows beyond month 1. This relatively rapid decline in skill suggests that initial land surface states (e.g. soil moisture) can have a strong effect on the forecast skill. Improving the representation of these initial states could likely serve to prolong the forecasts' lead times.
The findings of this study could be relevant to water quality management applications. One area of application is in the prevention and mitigation of harmful algal blooms. For example, in the Great Lakes, high precipitation events, coupled with agricultural practices and land use change, have increased nutrient loading into Lake Erie, which has led to the episodic occurrence of extensive harmful algal blooms over recent years (Michalak et al 2013). Our ability to predict nutrient loads 3 months ahead could support building early warning systems for harmful algal blooms, and help make better informed decisions for nutrient-related management plans and actions. Since flows and nutrient loads are effective predictors of bloom size across different geographic regions (Stumpf et al 2012), our dynamic-statistical approach should be readily transferable to other freshwater systems and coastal locations, such as the Gulf of Mexico and Great Lakes, where the prediction of bloom occurrence, extent, and timing is crucial for implementing realistic bloom mitigation strategies. Although the water quality forecasts are generated at the monthly timescale, the ability of the dynamicalstatistical approach to generate daily streamflow forecasts allows to implement the approach at higher temporal resolutions (e.g. weekly or biweekly), provided water quality observations of sufficient temporal resolution are available to train the statistical model.
Seasonal and spatial variations in the forecast skill are visible among the basins. Both flow and load forecasts skill is better in the cool season than in the warm one, and is higher for the large basins. Although small basins demonstrate low forecast skill during the warm season, the forecast skill tends to increase with the basin size. This makes our approach more suitable to inform and support nutrient and sediment management strategies at the regional scale, rather than in small basins. To provide more localized forecasting information, it would be necessary to improve the spatial resolution of the CFSv2 forecasts and the spatiotemporal resolution of water quality observations. In the case of regional sediment management, the S2S forecasts could be used to better manage sediment sources, agricultural practices, and sediments in reservoirs and regulated rivers (Kondolf et al 2014). By using the forecasts to anticipate disproportionate suspended sediment loads, those sediment management activities could potentially be performed in a more dynamic and adaptive manner. This could contribute to making sediment management more effective. For example, in dams that allow sediment passage, reservoir operation plans could be designed to interact with the S2S flow and suspended sediment forecasts.
Our findings have implications for operationalizing S2S water quality predictions in the US and potentially other places. By building and testing a modular dynamical-statistical ensemble prediction system, we demonstrate a feasible approach to the generation of water quality forecasts. The approach is more cost effective and computationally efficient than fully implementing a process-based water quality model. This makes it readily implementable in an operational forecasting setting, which we view as the main application area of our study results in support of NOAA's operational priorities for the next years (National Oceanic and Atmospheric Administration 2016). For instance, an existing operational seasonal streamflow forecast product could be combined with QR to generate nutrient and suspended sediment load forecasts. This study also identifies relevant timescales and sources of uncertainties that could assist in the future when trying to implement process-based water quality models operationally. In addition, the outputs from the dynamical-statistical ensemble prediction system could be used as boundary conditions (e.g. riverine inflows and associated nutrient and suspended sediment concentrations) to operational marine hydrodynamic and biogeochemical forecasting models. The S2S forecasting approach adopted here is designed to emulate specific aspects of operational forecasting systems, such as the use of ensemble S2S climate predictions, a hydrologic model, and a statistical preprocessor and postprocessor. For example, the National Water Model (Salas et al 2018) uses the climate ensembles from the CFS to produce ensemble streamflow forecasts one month ahead. This study shows that it is possible to use current operational streamflow forecast systems, e.g. the National Water Model, to generate skillful water quality predictions several months in advance. This would have the benefit of making predictions available to the public, providing support to a wide range of water quality management strategies, and creating new opportunities for yet unknown applications.