Stochastic Decadal Projections of Colorado River Streamflow and Reservoir Pool Elevations Conditioned on Temperature Projections

Decadal (∼10‐year)‐scale flow projections in the Colorado River Basin (CRB) are increasingly important for water resources management and planning of its reservoir system. Physical models, ensemble streamflow prediction (ESP), do not have skill beyond interannual time scales. However, Global Climate Models have good skill in projecting decadal temperatures. This, combined with the sensitivity of CRB flows to temperature from recent studies, motivate the research question, can skill in decadal temperature projections be translated to operationally skillful flow projections and consequently, water resources management? To explore this, we used temperature projections from the Community Earth System Model‐Decadal Prediction Large Ensemble (CESM‐DPLE) along with past basin runoff efficiency as covariates in a Random Forest (RF) method to project ensembles of multiyear mean flow at the key aggregate gauge of Lees Ferry, Arizona. RF streamflow projections outperformed both ESP and climatology in a 1982–2017 hindcast, as measured by ranked probability skill score. The projections were disaggregated to monthly and subbasin scales to drive the Colorado River Mid‐term Modeling System (CRMMS) to generate ensembles of water management variables. The projections of pool elevations in Lakes Powell and Mead, the two largest U.S. reservoirs that are critical for water resources management in the basin, were found to reduce the hindcast median root mean square error by up to −20% and −30% at lead times of 48 and 60 months, respectively, relative to projections generated from ESP. This suggests opportunities for enhancing water resources management in the CRB and potentially elsewhere.

Water resources management in this basin has become more challenging due to past and projected warming. Decreased precipitation is generally agreed upon as the primary driver of the ongoing drought since 2001, referred to as the "Millennium drought," mostly resulting from a shift to cooler Pacific sea surface temperatures related to the Pacific Decadal Oscillation and Atlantic Multidecadal Oscillation (Delworth et al., 2015;Hoerling et al., 2010;Lehner et al., 2018;Nowak et al., 2012;Schubert et al., 2009;Seager & Ting, 2017;Seager et al., 2005;Zhao et al., 2017). Several recent studies suggest that this "hot drought," the worst on record with mean annual flow ∼20% less than the long-term average, was due to both decreased precipitation and, in contrast to previous droughts, a mean temperature ∼1°C above twentieth century climatology (Udall & Overpeck, 2017;Woodhouse et al., 2016). Conversely, temperature increases in the CRB and elsewhere have been attributed to anthropogenic GHG forcings (Hoerling et al., 2019;Lukas et al., 2014), but estimates of temperature's relative influence on the recent drought vary widely. Hoerling et al. (2019) suggest that increased temperature was responsible for one fifth of the recent flow decline, while Udall and Overpeck (2017) and Xiao et al. (2018) estimate that temperatures were responsible for approximately one third and one half of the flow decline, respectively. The difference in estimates of temperature influence is partially due to different temperature sensitivities used in each study. Consider Hoerling et al. (2019)'s fully coupled, land surface model (LSM)-derived sensitivity of −2.5% °C −1 compared to Udall and Overpeck (2017)'s sensitivity of −6.5% °C −1 , which itself was chosen because it was the mean of a range of −3% °C −1 to −10% °C −1 developed by Vano et al. (2012Vano et al. ( , 2014 from multimodel LSM simulations. Recently, Milly and Dunne (2020) reported a LSM model-derived sensitivity range of −7.8% °C −1 to −12.2% °C −1 . Empirically derived sensitivities tend toward higher values, with estimates of −9% °C −1 and 14% °C −1 from McCabe and Wolock (2007) and Nowak et al. (2012), respectively. The true CRB temperature sensitivity, and possible changes in it, will play an important role in the degree of potential future flow declines induced by the warming trend.
Projections of mid-and end-of-century CRB flows, primarily predicated on climate model outputs, vary widely and are dependent on a variety of factors including both climate and LSM selection, downscaling techniques, emissions scenarios, parameter estimation, resolution, etc. (Haddeland et al., 2002;Mendoza et al., 2016;Vano et al., 2014). While simulations from the Coupled Model Intercomparison Project (CMIP) all predict a warmer future in the CRB, approximately half of CMIP3 models project increases in CRB precipitation and streamflow by midcentury, while the other half project decreases, resulting in an ensemble median of approximately 0% Interestingly, approximately two thirds of CMIP5 models project midcentury precipitation increases, but like CMIP3, have an ensemble median of zero for midcentury streamflow change (USBR, 2016). Many of the CMIP models exhibit a wet-bias, with approximately one half and one quarter of CMIP5 and CMIP3 models, respectively, unable to replicate the Millennium drought at any point (Udall & Overpeck, 2017). This supports earlier conclusions that many climate models cannot simulate decadal and multidecadal drought cycles (Ault et al., 2012(Ault et al., , 2013. However, some studies have projected an increase in the risk and severity of decadal and multidecadal drought due to warming, despite any possible increases in precipitation (Ault et al., 2014(Ault et al., , 2016Cook et al., 2015). Further, among the CMIP3 and CMIP5 members that did replicate the Millennium drought in simulations, mean projected end-of-century flows were 85% and 91% of the twentieth century average observed flow at Lees Ferry, respectively. The CMIP3 and CMIP5 members that could not replicate the Millennium drought suggest mean end-of-century flows 109% and 113% higher, respectively, than twentieth century average observed flow (Udall & Overpeck, 2017). A separate analysis by Milly and Dunne (2020) forced CMIP members that had a high fidelity to observed Upper Colorado River Basin (UCRB) flows with projected midcentury changes in mean temperature and precipitation. This evaluation generated ensemble-mean flow changes between −5% and −24% under Representative Concentration Pathway (RCP) 4.5 and a +3% to −40% range under RCP 8.5. This suggests that flows will either decrease or, if buffering is provided by increased precipitation, remain approximately unchanged. Flow decreases were largely attributed to decreased snow albedo resulting in higher potential evapotranspiration.
The high variability of CRB flow, potential for multiyear droughts, and unknown, possibly deleterious future shifts in hydroclimate have prompted a large amount of research attention toward better understanding and predicting CRB flow behavior, with prognostic research efforts tending toward either seasonal forecasts or multidecadal projections (e.g., Baker, 2019;Baker et al., 2019Baker et al., , 2020Baker et al., , 2021aBracken et al., 2010Bracken et al., , 2014Erkyihun et al., 2016Erkyihun et al., , 2017Prairie et al., 2006;Rajagopalan et al., 2019). Decadal flow projections, or forecasts with lead times greater than 1 year and less than 10 years, have received relatively less attention, despite their importance for multiyear planning, which is particularly relevant in the CRB due to a high storage capacity and susceptibility to multiyear droughts. Ensemble streamflow prediction (ESP) is a current, commonly used method for generating seasonal and annual forecasts in the CRB and elsewhere. ESP involves initializing a hydrologic model with current conditions and forcing the model with sequences of observed past precipitation and temperature (Wood & Werner, 2011). While ESP has been shown to be significantly more skillful than climatology at seasonal time scales (Baker et al., 2021a;Franz et al., 2003;Harrigan et al., 2018), there is limited skill at a 12-month lead time and no skill relative to climatology in the CRB for annual forecasts with a lead time of 15 months or longer (Baker, 2017).
Unlike the uncertainty associated with climate model-based projections of future precipitation and streamflow at decadal and multidecadal scales, confidence in long-term temperature projections is higher due to hindcast skill in predicting the warming trend of the last several decades, primarily due to the strong signal imparted by GHG forcings (Hargreaves, 2010;Hausfather et al., 2020;Jiang et al., 2012;Kumar et al., 2016). Climate model-based projections have shown skill at high resolutions for seasonal time scales (Infanti & Kirtman, 2014;Ma et al., 2015;Mishra et al., 2019;Mo & Lyon, 2015;Slater et al., 2019), but generally only multiyear averages have been shown to be skillful past a 1-year lead time (Kim et al., 2012;Meehl et al., 2014;Towler et al., 2018;van Oldenborgh et al., 2012;Yeager et al., 2018). While earlier efforts at decadal prediction of streamflow have yielded modest skill (Nowak, 2011), decadal temperature predictions are becoming more common and are increasingly being used with nonparametric and other approaches to improve hydrologic projections at multiyear time scales (Esit et al., 2021;Kiem et al., 2021;Towler & Yates, 2021;Towler et al., 2021 [in review]).
In watersheds with multiyear storage capacities, like the CRB, flow projections past a 12-month lead time are important for operational planning. However, climatological forecasts are often used past year-1 due to the lack of decadal skill from current forecast methods, potentially leading to suboptimal decision making under multiyear drought scenarios. We investigate whether the skill found in climate model-based temperature projections can be transferred to decadal streamflow forecasts in the CRB. This critical need combined with the knowledge that (a) Global Climate Models have good skill in projecting decadal temperatures and (b) the documented sensitivity of CRB flows to temperature, motivates the present study with this research question-can skill in decadal temperature projections be translated to operationally skillful flow projections and consequently, water resources management? To this end, the paper is organized as follows. First, the data sets and the water resources 4 of 21 management model are described briefly. Then, the proposed streamflow projection methodology, a Random Forest (RF) algorithm conditioned on temperature covariates is presented along with projections of basin water resources management variables. Results assessing the skill of these projections follows. The paper concludes with a summary and discussion.

Data and Water Resources Management Model
Naturalized flow at Lees Ferry, Arizona, for 1906-2017 was obtained from the United States Bureau of Reclamation for this study (USBR, 2020). Naturalized flow is ideal when investigating hydroclimate teleconnections, as it has anthropogenic impacts like diversions and storage removed (Prairie et al., 2005). Streamflow at Lees Ferry, just downstream of Lake Powell, represents flow from the entire UCRB that on average produces ∼90% of total CRB flow (Christensen et al., 2004). The UCRB yields a majority of Colorado River streamflow since its headwaters are situated in the Rocky Mountains and lies within the states of Colorado, New Mexico, Utah, and Wyoming. Conversely, the Lower Basin is in the much more arid states of Arizona, California, and Nevada as well as portions of Mexico. The Lees Ferry streamflow gauge is used as the demarcation point between the UCRB and Lower Colorado River Basins.
Observed hydroclimate data used in this study include precipitation and minimum and maximum temperatures from the Parameter-elevation Regressions on Independent Slopes (PRISM) model (Daly et al., 1994). Monthly 4-km data for the entire continental US were clipped to the UCRB boundary and spatially averaged, then either summed or averaged in time for precipitation and temperature, respectively, to generate water year values for the UCRB.
Climate model simulated data for 1981-2017 were obtained from the Community Earth System Model-Decadal Prediction Large Ensemble (CESM-DPLE; Yeager et al., 2018). The CESM-DPLE is a 40-member ensemble at a monthly, 1-degree (nominally 111-km) resolution. The CESM-DPLE is a fully coupled model with atmosphere, ocean, land, and sea-ice components. The CESM-DPLE is initialized with "current" atmospheric conditions at the beginning of every 10-year projection in the 1981-2017 running hindcast. Since ensemble members are generated via perturbing initial conditions by round off errors, the ensemble spread is representative of natural variability and climate forcings only and does not include model biases (unlike other large ensembles). Calculating the anomaly correlation coefficient (ACC) indicates that the CESM-DPLE has good skill in projecting both maximum and minimum CRB temperature at multiyear time scales but has little to no skill in projecting precipitation ( Figure 2). The ACC represents the correlation between forecast anomalies and observed anomalies

of 21
on a spatial scale (Jolliffe & Stephenson, 2011;Miyakoda et al., 1972). An ACC of 1 means that the forecast and observed anomalies are perfectly correlated while a value of 0 signals that there is no correlation. A negative ACC indicates that the forecast is predicting the opposite of what occurred, and an ACC of 0.6 or greater is recognized as having better skill than using a climatological forecast (Anomaly Correlation Coefficient -Forecast User Guide -ECMWF Confluence Wiki, n.d.). The ACC for the DPLE minimum temperature is over 0.7 for all lead times. The DPLE maximum temperature projections have ACC values of 0.55, 0.46, and 0.35 for lead times of 1-5, 3-7, and 5-9 years, respectively, suggesting decreasing skill with increased lead time. Conversely, DPLE precipitation has an ACC of 0.10 for a lead time of 5-9 years and a value of 0 for both other lead times, generally indicating no skill in projecting precipitation. Lead time indicates the time into the future for which projections are made, as well as the years being temporally averaged into the projection. For example, a lead time of 3-7 years represents a projection of the mean annual value between 3 and 7 years into the future.
The translation of hydrologic skill into operational skill of climate-conditioned RF flows was tested by comparing their performance to ESP when used to force the Colorado River Mid-term Modeling System (CRMMS), previously known as the Mid-term Operations Model (MTOM; USBR, 2015). CRMMS is a key modeling tool used by the United States Bureau of Reclamation (Reclamation) for midterm planning in the CRB. It is driven by monthly streamflow at the 12 forecast locations in the UCRB (Figure 3), initial conditions of basin reservoirs, and a climatological approach for six subbasins below Lees Ferry. CRMMS is a RiverWare model (Zagona et al., 2001) that simulates operational decisions such as releases and diversions according to the Law of the River and policies from the 2007 Interim Guidelines for shortages (USBR, 2015(USBR, , 2020. Ensemble simulations are run via the Colorado River Basin Operational Prediction Testbed (Baker et al., 2021b) implemented within RiverSMART software (RiverSMART, 2020), providing a platform for testing of many different hydrologic input scenarios and comparing their resulting operational variables (e.g., pool elevation). Reclamation provided CRMMS files and forcing data (ESP, observed flows, and initial conditions). Several management actions related to delivery reductions (e.g., DCP and Interim Guidelines) are intimately tied to pool elevations at Lakes Mead and Powell, hence flow projections and subsequent simulations of reservoir levels are critically important for decision making.

Proposed Methodology
The proposed methodical framework consists of three modules: (a) RF machine learning model, (b) disaggregation model, and (c) water resources decision model. In the CRB, future drought risk might be better mitigated if well-informed water resources management decisions could be reliably made on multiyear time horizons. However, operational flow forecast methods like ESP are generally only skillful for lead times up to 1 year, which is the same time frame that operational decisions in the CRB are limited to. Conversely, temperature projections, often from Global Climate Model projections, have proved skillful at multiyear time scales. Figure 4 shows the scatterplots of PRISM-derived precipitation and temperatures with CRB flows for 5-year running means. Strong positive correlation with precipitation (0.8) and moderate to strong negative correlations with annual maximum (−0.74) and minimum (−0.39) temperatures can be seen, consistent with prior studies documenting temperature sensitivity to streamflow in the basin. Similar relationships and correlation strengths are seen between these hydroclimate variables with running mean lengths of 2 and 10 years. Motivated by these relationships in both the observed and model spaces, we propose our methodology to exploit these relationships and the skill in temperature projections to improve projections of streamflow and reservoir elevations.

Streamflow Projection
We propose a RF machine learning approach to generate streamflow ensembles conditioned on a set of covariates.

Covariate Selection
Strong correlations between temperature and CRB flow were seen in the historical record in Figures  To select the covariates, we assessed the relationship between the multiyear mean of UCRB flow and potential covariates: corresponding ensemble-mean minimum and maximum temperatures from CESM-DPLE and past multiyear basin runoff efficiencies (REs). The temperatures are a result of large-scale ocean and atmospheric conditions, while the RE captures the basin characteristics and antecedent conditions. We examined the relationships for "multiyear means" of 2-10 years. The past RE for each multiyear period was selected based on maximum correlation within a 15-year window. For example, for 3-year mean future flow, the optimal past RE 6 of 21 was found to be an 8-year mean, while the optimal past RE for 5-and 7-year mean future flow was 7 and 6 years, respectively. The scatterplots of flow and potential covariates for multiyear mean lengths of 3, 5, and 7 years are shown in Figure 5 as representative samples. The covariates for all other multiyear mean flow lengths are shown in Figures SC1-SC6. At shorter mean lengths (e.g., 3-year means), the relationship between CESM-DPLE-simulated temperature and observed flow is inverse and relatively linear, consistent with the observed hydroclimate relationships ( Figure 5). For 5-year means, simulated minimum temperature also exhibits a linear relationship with flow. While the simulated maximum temperature generally exhibits a negative correlation with flow for positive temperature anomalies, mimicking the observed relationship, flow unexpectedly decreases as negative temperature anomalies decrease. The relatively lower skill of DPLE maximum temperature projections compared to minimum temperature suggest that lower quality temperature projections might not be able to capture well the 7 of 21 observed trend between flow and temperature. From a physical perspective, we would expect that as temperature increases, evapotranspiration demand rises and consequently reduces flow as noted in several studies described earlier (Hoerling et al., 2019;Vano et al., 2012;Woodhouse et al., 2016;Xiao et al., 2018). In terms of impact, the cool temperature-low flow pairings occur in the late 1980s through early 1990s, a time when the warming signal was not as strong as in the last two decades, suggesting that this erroneous relationship should play a relatively minimal role when making forecasts forced with more recent temperature projection inputs. The relationship of the 7-year means is very similar to that of the 5-year means. We selected RE as an effective indicator of the current basin hydrologic signature because streamflow is an interaction between temperature, precipitation, and land surface conditions that is well captured by RE (Nowak et al., 2012).
Based on these diagnostic analyses, the covariates selected for various multiyear mean flow projections consist of three covariates-future multiyear annual mean of maximum and minimum temperatures from CESM-DPLE and past multiyear average RE. For example, for the 2-year mean flow projections from the year 1982, the feature vector consists of CESM-DPLE-projected mean maximum and minimum temperatures for 1982-1983 and 1973-1981 average RE (past 9-year mean efficiency yielded the highest correlation with future 2-year mean flow). Conversely, for the 5-year mean flow projections in the same year, the feature vector consists of CESM-DPLE projected mean maximum and minimum temperatures for 1982-1986 and 1975-1981 average RE (past 7-year mean efficiency yielded the highest correlation with future 5-year mean flow). Thus, for each historical year, we compute this vector of covariates.

Random Forest
An RF machine learning approach was used to generate multiyear mean flow forecasts based on the covariates described above. RFs, originally developed by Ho (1995) and later improved by Breiman (2001), are a Classification and Regression Tree (CART)-based supervised machine learning algorithm that can perform either classification or regression. RFs are essentially generated through repeated bootstrapping of the training data set, with a decision tree generated for each sample. Thus, unlike CART where a single tree is fitted to the data, here many trees (hence, forest) are fitted to bootstrapped samples of the historical data. For a covariate vector, estimates of the predictand (i.e., the dependent variable, here it is the multiyear CRB flow) are obtained from each tree in the forest and averaged. We use the estimates from all trees to make up an ensemble, which provides a distribution. The RF algorithm also provides variable importance details, or how much utility a given predictor in the training data set imparts on prediction performance. A single value of mean square error (MSE) or node purity is produced with every forest. The MSE metric indicates how much the fitted MSE would increase if the given predictor is randomly permuted. For details of the RF algorithm, we refer the reader to Hastie et al. (2009).
RFs are appealing for a multitude of reasons, including their predictive performance, robustness, speed, nonparametric nature, stability, diagnosis of variable importance, as well as their ability to handle nonlinearity, interactions, noise, and small sample sizes in forcing data (Tyralis et al., 2019). RFs are being used in a variety of applications including water resources and water quality modeling (Suchetana et al., 2017), construction safety risk (Tixier et al., 2016), and used with success in recent years for flow forecasting, primarily at daily and monthly time scales (Abbasi et al., 2021;Al-Juboori, 2019;Ghorbani et al., 2020;Hussain & Khan, 2020;Li et al., 2019;Liang et al., 2018;Muñoz et al., 2018;Papacharalampous & Tyralis, 2018;Pham et al., 2020).

RF Implementation
RF models are trained on historical, naturalized, water year flow at Lees Ferry for each multiyear block flow ranging from 2-to 10-year means ("multiyear means") along with the scaled anomalies of the covariates spanning the same time ranges. The water year flows begin in October while the climate projections begin in November; 10.1029/2021WR030936 9 of 21 the 1-month difference in start dates is likely negligible since multiyear averages are calculated and used for both data sets. All RF forecasts start on 1 October, regardless of the mean length. Multiyear mean flows are generated from each of the trees in the RF, thus producing an ensemble. We implemented the RF model in a K-fold cross-validation mode, which is described below.
1. To make a given forecast starting in year i with mean length N, a corresponding covariate vector x i,N is fed into a RF model ensemble generated using the randomForest R package (Liaw & Wiener, 2002;R Core Team, 2019). Each RF projection is trained on a blind training data set subset from the 1982-2017 historical period, d i − N , generated through a K-fold cross-validation approach, where K is the number of years dropped prior to model training. K is a function of both the selected mean length N and its past RE window length (re p ). So that no information from the N-year-long period to be forecast is included in the training set (i.e., a "blind forecast"), only years less than the forecast year i minus N or greater than the forecast year i plus the maximum of N and re p are included in the training set (Equation 1). This approach prevents any overlap of the training and forecast years given the nature of multiyear means. Using the maximum of N and re p also removes any knowledge of the past annual RE from the year to be forecast.  (2017) in the period of record used. The predictands include multiyear mean flows of a selected mean length (e.g., 5-year mean flow) and the predictors include CESM-DPLE projected mean minimum and maximum temperature of the same mean length, and past RE of mean length determined through a correlation analysis that varies for each mean length.
2. The RF algorithm was set to generate 300 trees for each combination of forecast year and mean length. This number of trees was selected because while there were only marginal performance improvements past 100-10 of 21 200 trees, use of 300 trees allows for a robust and well-dispersed forecast with very reasonable computational expense. To generate a projection for a given forecast year and mean length, the relevant feature vector was fed into each tree in the forest resulting in an ensemble of 300 multiyear mean flow projections (Equation 2). Traditionally, the average of all trees' predictions is used as the RF output in regression contexts, however our approach was to use individual trees' predictions to create an ensemble. Default values from the randomForest R package (Liaw & Wiener, 2002) for the remaining model parameters were used as tuning lead to only marginal differences in projection performance. The two most important model parameters include the number of trees, nTree, and the number of covariates randomly sampled for each split in a tree, mTry. mTry defaults to the number of covariates divided by 3. In this study, three covariates are used, resulting in a value of 1 for mTry.
where − = predictors for forecast year and mean length − = predictands for forecast year and mean length = th decision tree generated from blind training set = number of trees in forest 3. The RF algorithm generates some tree outcomes that match exactly the input predictands (i.e., streamflow) and other tree outcomes that deviate slightly from the predictands due to introduced randomization. Consequently, to generate annual flow sequences from the RF-generated multiyear means, if a projected multiyear mean was not an exact match with an observed multiyear mean flow from the blind training data set, then the closest observed value within the blind training data set was used for disaggregation. Disaggregation of multiyear means to annual flow sequences simply involved finding the sequence of annual flows that constituted a selected multiyear mean (Equations 3 and 4). One potential shortcoming of this approach is that it can only produce annual flow volumes sampled from the observed record, although use of paleo-reconstructed flows could lengthen the period of record.̄=

Space-Time Streamflow Projection on the Network via Disaggregation
While there is value in skillful projections of annual and multiyear mean flow at midterm time scales, for it to be of impact to stakeholders, the skill in flow projections needs to be translated into reservoir pool elevations.

of 21
As mentioned, to evaluate the utility of our novel climate-conditioned decadal streamflow projections within an operational context, we use Reclamation's CRMMS, which runs on a monthly time scale. As such, we needed to disaggregate the RF-simulated annual flows, ⇀ −annual , at Lees Ferry, the aggregate gauge, to a monthly time step at all 12 upstream subbasins.
The annual flows from the RF were disaggregated to monthly flows at all the locations (i.e., subbasins) of CRB that are input to CRMMS. For this space-time disaggregation, we used the nonparametric disaggregation method developed by Nowak et al. (2010). In this, a K-nearest neighbor (KNN) method is applied to each annual flow in the multiyear block to select a historical year and its space-time proportion of the annual flows at Lees Ferry aggregate gauge. The selected proportion vector is multiplied with the annual flow to obtain flows for the 12 months and 12 subbasins. For details of this simple and robust method and its application, we refer the readers to Nowak et al. (2010). This is repeated for all the annual flow values.

Projections of Operationally Critical Reservoir Elevations
Of the many variables CRMMS projects, we focus on simulated pool elevation at Lakes Mead and Powell, since various pool elevation thresholds in the two largest CRB reservoirs play a crucial role in determining water resources management throughout the basin. For example, when the August-projected end of year water level of Lake Mead drops below 1,075 ft a Level 1 Shortage Condition is declared for the Lower Basin and deliveries are reduced. A Lake Powell related threshold exists at elevation 3,700 ft, when the reservoir releases extra water to distribute high flows (i.e., the Equalization tier; Baker, 2019;USBR, 2020). These policies then percolate to regional and local water management strategies across the basin.
CRMMS simulations were run for 32 hindcasts spanning 1982-2017 using the disaggregated RF projections from the 3-, 5-, and 7-year means as hydrologic flow forcings. The 3-, 5-, and 7-year means were used as CRMMS inputs to test the shortest, midlength, and longest multiyear mean projections. The 10-year means only produced 27 hindcasts due to the longer mean length making 2008-2017 the last available full block of historical data. Each hindcast is a 3-, 5-, or 7-year-long block run at a monthly resolution. An ESP hindcast data set trained on observed 1981-2010 precipitation and temperature sequences and originally produced by the National Weather Service Colorado River Basin Forecasting Center (CBRFC) was used as a "baseline" hydrologic forcing in CRMMS simulations covering the 1982-2017 hindcast period. For each CRMMS run of a retrospective forecast, the ESP trace that covered the same time period was dropped prior to CRMMS input (i.e., the ESP trace that started in the same month and year as the CRMMS simulation to be run was dropped). This made the number of ESP traces either 29 or 30 for forecasts made before or after 2010, respectively. ESP was used as a baseline model since it is used by Reclamation and other stakeholders in the CRB for seasonal-to annual-scale planning and risk assessment as well as for multiyear exploratory studies. Finally, a CRMMS simulation forced with historical flows in each subbasin was run for each projection block to be used as a "truth" for calculating the projection skill of reservoir elevations. The historical simulation is not the true observed pool elevation since operational rules have changed over the course of the hindcast period but represents what would have been the observed pool elevation if the current Law of the River policies, including the Interim Guidelines, had been in place for the given historical unregulated flows. Since ESP and the historical simulation are at most 60 months long, the 7-year flow projection simulations were subset to a 5-year length for plotting and skill analysis. Similarly, the ESP and historical simulation were subset to a 36-month length when analyzed with the 3-year mean flow projections.

Validation of Models
We validated both the ensemble projections of flows and the monthly reservoir pool elevations, for the hindcast period of 1982-2017. Each framework was examined with separate performance metrics. For each forecast year, i, and mean length, N, the flow projections were made in a "blind" K-fold cross-validation model, where K = N + maximum(N,re p ) −1, such that the feature vector of that year and any preceding or following running means with overlapping knowledge of the year to be forecast were not included in the RF training data set. In other words, this avoids the prospect of selecting the flow of the N years immediately prior to or following year i in training the projection model. The years selected in these blind training data sets were also used to disaggregate from multiyear mean projections to annual flows. A variety of performance metrics were calculated on both the multiyear mean flow projections and the subsequent CRMMS simulation results. Applications of metrics such 12 of 21 as the ranked probability skill score (RPSS), root mean square error (RMSE), continuous ranked probability skill score (CRPSS), Nash-Sutcliffe efficiency (NSE), and reliability are further derived in the Supporting Information Section A.

Results
Results from the flow and water resources decision variables are described below. For the multiyear mean projections of Lees Ferry naturalized streamflow, the 3-, 5-, and 7-year means are examined in Section 4.1 due to their skill, which generally outperforms both climatology and ESP, as well as their reliability, which is comparable with ESP's reliability. For the simulated reservoir pool elevations, disaggregated results from the 3-, 5-, and 7-year mean flows are presented in Section 4.2.

Streamflow Projections
The blind K-fold cross-validation ensemble projections of 3-, 5-, and 7-year mean flows for each year for the 1982-2017 period from RF are shown as boxplots along with the historically observed multiyear mean flows ( Figure 6). The flow projections can capture multiyear variability quite well for some mean lengths given the limited duration of training data. Hindcast projection results for the remaining multiyear means are shown in Figure SD1.

of 21
Aggregating individual MSE values for each predictor variable from every year in the 1982-2017 hindcast shows that past RE is the most important variable, followed by CESM-DPLE-simulated maximum and minimum temperature as can be seen for the 5-year projection in Figure 7. The RE captures the combined impact of temperature with the basin hydrology and physical characteristics. This pattern is consistent for mean lengths of 2-9 years ( Figures SE1-SE7). The variable importance diagnosis was not performed on 10-year means due to a training data set with too few years for adequate analysis after making the data "blind" for 10-year means (i.e., dropping 19 out of 36 years in the 1982-2017 training record).
The RPSS was calculated by tercile thresholds for each year in the hindcast period. These tercile thresholds represent the 33rd and 66th percentiles but vary depending on the mean length selected. RPSS values greater or less than 0 indicate better or worse performance, respectively, than a climatological ensemble forecast, with a score of 1 indicating a perfect forecast. For long-lead forecasts, any positive RPSS is considered satisfactory, while RPSS values of 0.5 are considered good and higher values considered excellent. The RF projections show a positive median skill score for all mean lengths except for the 8-and 10-year mean flows, the former of which is just below 0 (Figure 8). The temporal pattern of the 10-year mean flow projections is fairly well captured, however the magnitudes were not. This we believe is due to the relatively long mean length necessitated dropping 19 years of data when training the RF in order to generate a blind forecast and thus yielded a constrained and underdispersed sample space, hence the lower skill. Furthermore, the RF median RPSS mostly outperformed ESP median RPSS where applicable, with improvements in median RPSS ranging from 0.39 to 0.58 (Figure 8 and Table 1). ESP had a median RPSS of 0 for 2-year mean flow projections and a negative median RPSS for all other mean lengths, echoing prior studies that show little to no skill in ESP past a 1-year lead time.
Time series of RPSS for the 3-, 5-, and 7-year means show that RF outperforms ESP in most years of the hindcast, particularly when projecting 3-year mean flows during the beginning years of the Millennium drought, where ESP has little to no skill (Figure 9). RPSS time series of the remaining multiyear means are shown in Figure SD2. Reliability diagrams (Figure 10) show the relationship between the forecasted probability of an event and its observed frequency. A perfectly reliable forecast falls along the 1:1 line and in some cases (e.g., 5-year means) the RF is not only closer to the 1:1 line than ESP but also has a flatter rank histogram. A flat rank histogram indicates that, over the course of the hindcast, the observed flows being forecast have an equal probability of occurring in any rank of the ensemble projection. In other words, the forecasts and observations are random samples from the same probability distribution. Rank histograms that are skewed to high or low ranks suggest forecast quality issues such as bias or underdispersion (Hamill, 2001). Reliability scores ( Figure SD3, Candille et al., 2007;Figure 7. Variable importance plot for Random Forests generating 5-year mean flow projections for running 1982-2017 hindcasts. "re.p," "DPLE.Tmax," and "DPLE.Tmin," indicate past runoff efficiency, DPLE maximum temperature, and DPLE minimum temperature, respectively. Percent increase in mean square error (MSE) indicates how much the mean square error would increase if the select variable is randomly changed during the training period.
14 of 21 Frenette, 2019;Hersbach, 2000) are calculated through a CRPS decomposition; a perfectly reliable forecast is represented by a reliability score of 0 and higher values indicate lower reliability. Where comparable, ESP has consistently better reliability scores than RF; although RF reliability begins to improve for mean lengths of 6, 7, and 8 years potentially due to reduced variability induced by multiyear averaging. RF reliability worsens for the 9-and 10-year means, likely due to the smaller sample size available for training these forecasts.

Reservoir Pool Elevations
As mentioned, ensemble projections of future 3-, 5-, and 7-year annual, naturalized Lees Ferry flow from the RF method were disaggregated in space and time via the KNN approach described in the previous section. These space-time flows were then used to drive the CRMMS and compared with CRMMS simulations forced with ESP over the 1982-2017 hindcast period of running 3-or 5-year blocks. Since ESP's duration is 5 years, analysis of CRMMS results was stopped at the 60-month mark for the 7-year flow projections. Figures 11a and 11b show the hindcasts for Lakes Powell and Mead pool elevations, respectively, generated from the 5-year mean flow projections for the 1999-2003 period, which covers the worst years of the Millennium drought. At a 60-month lead time, the historical reservoir level at Lake Powell is not captured by either forecast method ensemble, although  the RF median and quartiles skew more toward the historical pool elevation compared to ESP (Figure 11a). At Lake Mead, both forecast methods capture the historical reservoir levels at all lead times, but the RF distribution again tends toward lower elevations relative to ESP (Figure 11b). Both forecast methods have similar sharpness and spread. Similar results are seen in other hindcast blocks in the 1982-2017 period as well as from the other multiyear mean lengths (Sections G.1-G.3 in Supporting Information).
For each of the 32 different hindcasts, an "ensemble RMSE" was calculated on the end of water year (EOWY) simulated pool elevations at Lakes Powell and Mead and is shown as boxplots in Figure 12. For a detailed description of the ensemble RMSE calculation, see Supporting Information Section A. It can be seen from the boxplots that,  while ESP performs similarly to, or better than, RF at shorter lead times (≤24 months), RF generally outperforms ESP at lead times of 36 months or longer. This is also corroborated in the median RMSE at all lead times (Tables 2  and 3). RF forecasts at longer lead times generated reductions in hindcast median RMSE relative to ESP ranging from −7% to −30% at Mead and −3% to −20% at Powell. Generally, these results indicate translation of skill in flow projections informed by temperature to improvements in water resources management variables.
CRPSS and NSE were also applied to the CRMMS simulations and showed similar results as the ensemble RMSE. For more information, see Supporting Information Section H. Overall, these results suggest that ESP could remain the preferred forecast method during the first 12 months of a midterm forecast but improvements in skill during years 3-5 could be gained through RF or similar methods.

Summary and Discussion
A simple decadal flow projection method was developed based on RF machine learning trained on CESM-DPLE simulations of UCRB temperature, past RE, and observed UCRB flow. Motivation for the model came from moderate correlation and nonlinearities between future multiyear mean flow and future multiyear mean CESM-DPLE simulated temperature, as well as between past RE and future flow. RF projections of 3-through 7-year mean flow were able to broadly capture multiyear variability and outperformed climatology over half of the time. Based on hindcast median RPSS, RF outperformed ESP for multiyear mean projections of 3-5 years. The greatest improvement over ESP occurred for 3-year means, where the RF and ESP hindcast median RPSS were 0.41 and −0.15, respectively, a 0.56 difference. In projecting 3-year means during the lowest flow years of the Millennium drought (1999-2005 and 2012-2015), RF had RPSS values at or above ∼0.5, whereas ESP RPSS values were at or below 0. Interestingly, in the only other low flow period in the hindcast (1987)(1988)(1989)(1990)(1991)(1992) during which model temperatures projections were anomalously cool, RF performed worse than ESP in projecting 3-year means. This potentially indicates the importance of both the warming trend and quality of projected temperature covariates to model performance in predicting low flow years. Otherwise, RF tends to outperform both ESP and climatology in low flow years for mean lengths of up to 5 years. Both methods perform similarly for high flow years and generally show improvements over climatology.
When used as a hydrologic flow forcing in an operational simulation model of the CRB, RF projections generally outperformed ESP in predicting pool elevations at Lakes Powell and Mead for lead times of 36 months or greater. The improvements of RF over ESP are noteworthy when using projections disaggregated from either 3-or 5-year means. The hindcast median ensemble RMSE of RF was, at best, between 12% and 30% lower than that of ESP for lead times between 36 and 60 months: equating improvements of approximately 5-16 ft. One caveat is that the   We suggest that modest improvements upon current forecast methods can be gained in projecting reservoir elevations at midterm time scales using CESM-DPLE simulated temperature data along with RF machine learning models. Advancements in midterm flow projection techniques are critically important for multiyear planning of reservoir management. As shown during the Millennium drought, rapid drawdowns can occur with sustained low flows and a lack of midterm shortage planning. The warming trend, associated uncertainty with possible reductions in streamflow and increases in drought risk, as well as continued population growth all support the necessity of smart, efficient, and informed midterm planning that emphasizes conversation when facing multiyear droughts; temperature conditioned flow forecasts can help meet this need.
Future work might investigate methods to improve skill of the RF approach and the covariates used. Of particular importance is improving the skill for precipitation and pre-1980 temperature simulations, upon which the performance of any derivative flow projection is heavily dependent. Future studies should investigate any new iterations of CESM or other climate model-based projections to leverage potential advances in hydroclimate projection skill. While we used a correlation-informed heuristic analysis to determine model covariates, more objective techniques like Generalized Cross-Validation (GCV) might be used during covariate selection, particularly if the parameter space is significantly enlarged with additional climate models. Additionally, the model training data set might be further expanded through inclusion of paleo-reconstructed flow and overlapping simulated hydroclimate data. Though we used future 2-through 10-year block lengths, similar approaches could be applied to multidecadal time scales for longer term planning. Finally, incorporation of other machine learning methods (e.g., neural networks) as well as wavelet decomposition of temperature, precipitation, and flow data may improve simulated flow projections by exploiting hydroclimate linkage patterns and regime-like flow behavior.