Near‐term forecasts of stream temperature using deep learning and data assimilation in support of management decisions

Deep learning (DL) models are increasingly used to make accurate hindcasts of management‐relevant variables, but they are less commonly used in forecasting applications. Data assimilation (DA) can be used for forecasts to leverage real‐time observations, where the difference between model predictions and observations today is used to adjust the model to make better predictions tomorrow. In this use case, we developed a process‐guided DL and DA approach to make 7‐day probabilistic forecasts of daily maximum water temperature in the Delaware River Basin in support of water management decisions. Our modeling system produced forecasts of daily maximum water temperature with an average root mean squared error (RMSE) from 1.1 to 1.4°C for 1‐day‐ahead and 1.4 to 1.9°C for 7‐day‐ahead forecasts across all sites. The DA algorithm marginally improved forecast performance when compared with forecasts produced using the process‐guided DL model alone (0%–14% lower RMSE with the DA algorithm). Across all sites and lead times, 65%–82% of observations were within 90% forecast confidence intervals, which allowed managers to anticipate probability of exceedances of ecologically relevant thresholds and aid in decisions about releasing reservoir water downstream. The flexibility of DL models shows promise for forecasting other important environmental variables and aid in decision‐making.


| INTRODUC TI ON
The intersection between human and environmental water needs can generate complex water allocation decisions that are made with imperfect information about the future (Vorosmarty et al., 2000). Forecasts are predicted future conditions with associated uncertainty (Clark et al., 2001), and aquatic forecasts are becoming critical management tools for balancing various uses of water to generate more desirable outcomes (Turner et al., 2020;Viel et al., 2016). Parallel advances in predictive modeling, real-time environmental observing systems, and decision analysis tools can be leveraged to create more useful forecast products for water resources decisions.
To date, forecasts of aquatic systems have mostly relied on process-based models and statistical models. These models have enabled managers and other partners to anticipate future changes in water temperatures (Abdi et al., 2020;Thomas et al., 2020), dissolved oxygen concentrations (Matos & de Sousa, 1996), and streamflow (Block et al., 2009;Hansen et al., 2019;Turner et al., 2020). Process-based models define relations between driver data and the variable being forecasted a priori, whereas statistical models discover relations between driver data and the target variable during a model training phase with a few simple assumptions about model structure and distributions. Process error can be a large source of near-term ecological forecast error for process-based and statistical models (Dietze, 2017;Massoud et al., 2018;Thomas et al., 2020). This indicates that these types of models may be misrepresenting system dynamics using rigid equations that overly simplify the natural world, require a priori parameterization, and cannot adapt on the fly to mismatches between predictions and observations. Deep learning (DL) models, in contrast to process-based and simple statistical models, learn a complex mapping between drivers and output by exposing models to training data and an optimization scheme (Shen, 2018). DL models can be very accurate and have the potential to reduce error in forecasting by reducing process error. Promising examples of DL-based aquatic forecasting efforts include a short-term forecast of water demand (Jain et al., 2001), streamflow forecasting Nevo et al., 2022;Xiang & Demir, 2020), and algal bloom prediction (Lee & Lee, 2018). However, rarely do DL models assimilate real-time water observations-such as those available for many sites in the United States (Hirsch & Fisher, 2014)-to increase DL prediction accuracy. Data assimilation (DA) is a method that compares model predictions to observations and adjusts model states, parameters, or inputs to optimize estimates of system states (Reichle, 2008). DA is commonly used to improve process-based model performance (Thomas et al., 2020), and a DL architecture with a similar ability to update model states with information from real-time data could provide superior forecasts for water-use decisions (e.g., Nearing et al., 2021).
In this paper, we describe a real-time forecasting system of stream water temperature that assimilates observations into a DL model as they are collected and produces forecasts of stream temperature 7 days into the future. Water temperature is an ecological master variable that governs rates of organismal metabolism (Yvon-Durocher et al., 2010), gas exchange (Raymond et al., 2012), and chemical transformations (Greaver et al., 2016) and is a crucial aspect of water quality management and important input for other scientific studies. Observations of stream temperature are extensive and increasing rapidly (Read et al., 2017), providing opportunity to produce accurate, data-driven forecasts of this important variable. In the spirit of learning quickly from real-world forecasting efforts, here we describe the forecasts we produced and delivered in 2021 when beginning with a promising modeling approach while facing real-world deadlines, resources, and surprises. We evaluate forecasts issued across 168 consecutive days, assess how the DL forecast performance compares with and without DA, and compare this performance to a baseline persistence forecast. The DL forecasts of stream temperature with DA were considered by resource managers in the Delaware River Basin to optimize releases of water from reservoirs to cool downstream segments while retaining enough water to supply New York City and other municipalities with drinking water.

| Study site
We generated real-time forecasts of daily maximum stream water temperature (hereafter referred to as "maximum water temperature") in the Delaware River Basin in support of drinking water reservoir management decisions. The Delaware River Basin is an ecologically diverse region and a societally important watershed along the East Coast of the United States as it provides drinking water to over 15 million people (Williamson & Lant, 2015). The extensive and expanding environmental monitoring network in the Delaware River Basin provides a great

Research Impact Statement
Deep learning models can ingest real-time observations to make forecasts with associated uncertainty to aid in decision-making about releasing water from drinking water reservoirs. opportunity to leverage both historical and real-time aquatic information for forecasting stream water temperature at specific locations or basin wide.
Drinking water reservoirs in the Delaware River Basin provide essential services to the public, while also maintaining water availability and suitable aquatic habitat in stream segments downstream from the reservoirs. The Flexible Flow Management Program between the State of Delaware, the State of New Jersey, the State of New York, the Commonwealth of Pennsylvania, and the City of New York dictates water management in the basin. The 2017 agreement includes provisions that aim to maintain maximum water temperature below 23.89°C (75°F) in the upper Delaware River Basin (above Lordville, New York) to ensure cold-water stream habitat. The Cannonsville, Pepacton, and Neversink Reservoirs thermally stratify, which means that during the summer, the deep-water release points of the reservoirs discharge cold hypolimnetic reservoir water downstream. Reservoir releases can therefore be used to mitigate anticipated temperature exceedances in streams below the dams. Water in these reservoirs is highly regulated given competing demands of drinking water diversions from the reservoirs to New York City, drinking water intakes downstream, maintaining flows for habitat, and maintaining cold-water stream habitat. The volume of water in the reservoir used to cool downstream river segments ("thermal bank") is limited, and reservoir managers need to anticipate when these downstream segments may exceed thermal thresholds to optimize the use of the thermal bank. Near-term forecasts of stream water temperature with associated uncertainty can help inform managers of conditions when stream temperature may exceed these thermal thresholds.
Five sites are downstream from the three New York City reservoirs that are relevant to release decisions for mitigating temperature exceedances and for which we forecasted water temperature (Figure 1). These sites included two sites on the West Branch of the Delaware River that are affected by releases from the Cannonsville Reservoir (at Hancock, New York and Hale Eddy, New York), one site on the East Branch of the Delaware River that is affected by releases from the Pepacton Reservoir (at Harvard, New York), one site on the Delaware River (at Lordville, New York) that is affected by releases from both Cannonsville and Pepacton Reservoirs, and one site on the Neversink River (at Bridgeville, New York) that is affected by releases from the Neversink Reservoir.
F I G U R E 1 Map of the target water temperature forecasting sites in the Delaware River Basin, with call-outs detailing stream distances between upstream reservoir outlets (open circles) and target sites (closed circles).

| Datasets
We trained our models in two phases. First, we pretrained five site-specific DL models on process-based model output to guide the model toward more physically consistent predictions of water temperature (as in Jia et al., 2021). Next, we fine-tuned our models on observed maximum water temperatures. Using these trained models, we then forecasted maximum water temperature 7 days into the future for every issue date during our forecast period. Below we describe the pretraining, fine-tuning, and forecasting datasets used to make near-term future predictions of water temperature.

| Spatial fabric and stream temperature observation dataset
We used the National Geospatial Fabric to define the physical characteristics of five stream reaches, each with <1 day travel time, whose downstream endpoints are our five target sites (Viger, 2014;Viger & Bock, 2014). We downloaded sub-daily observations of stream water temperature from U.S. Geological Survey's (USGS) National Water Information System (NWIS) (U.S. Geological Survey, 2021), the Water Quality Portal (Read et al., 2017), and Spatial Hydro-Ecological Decision System (http://db.ecosh eds.org/) for both the fine-tuning dataset and for assimilating in real time when making forecasts. The sub-daily observations of stream temperature were aggregated to maximum water temperature values for each site that were used as the target variable for our DL models and used for real-time forecasting. We matched observations of maximum water temperature collected at specific latitude and longitude to the nearest stream segment within a tolerance of 250 m. Observations farther than 5000 m along the river channel to the outlet of a segment were omitted from our training and forecasting dataset. Segments with multiple observation sites were aggregated to a single maximum water temperature value from all temperature values.

| Historical driver dataset
We used five drivers (i.e., input features) to train our DL models from 1985 to 2021, including gridMET daily minimum and maximum air temperature and daily average downward shortwave radiation (Abatzoglou, 2013), NWIS daily average reservoir release rate from the reservoirs directly upstream from each site, and observations of yesterday's maximum water temperature. When observations of yesterday's maximum water temperature were not available, we used yesterday's predicted daily mean water temperature from the process model pretraining dataset (described below). Observations of maximum water temperature were available for 68%-98% of the training period dates depending on the stream segment.

| Process-based model pretraining dataset
We used process-based model output from 1985 to 2020 to pretrain the DL models before fine-tuning on observations of maximum water temperature. Pretraining DL models on process-based model output can improve predictive performance, including for out-of-sample predictions Read et al., 2019). For water temperature prediction, process-based models use an energy budget approach to estimate temperature change. Given the utility of learning these physical rules from an existing model and the emphasis on predicting summertime maximum water temperatures (which could be poorly represented in the fine-tuning dataset), pretraining was used to generate preliminary weights and biases for the long short-term memory (LSTM) models that were then refined via fine-tuning on actual observations (following Jia et al., 2021).
For each of the five site-specific models in the Delaware River Basin, we developed a pretraining dataset from process-based model predictions of daily average stream temperature from 1980 to 2020. The Precipitation Runoff Modeling System with a coupled Stream Temperature Network model (PRMS-SNTemp) was used to make initial predictions of daily average stream temperature (Markstrom, 2012;Sanders et al., 2017) using the calibrated flow parameters of Regan et al. (2018). PRMS-SNTemp is a natural flow model, meaning that it does not represent reservoirs within the stream network. This natural flow assumption routinely produces predictions of summer stream temperature that are biased high for stream segments downstream from reservoirs.
To improve upon the PRMS-SNTemp predictions below reservoirs, we simulated the water temperature of reservoir releases from two major reservoirs in the basin, Cannonsville and Pepacton Reservoirs, using the General Lake Model (GLM v3.1; Hipsey et al., 2019). These GLM simulations required lake structural information and were driven by daily inflows, outflows, inflow temperatures, and meteorological conditions. Reservoir hypsographic curves were acquired from Nystrom (2018). Observed daily inflows from major tributaries to each reservoir were obtained from the USGS National Water Information System (U.S. Geological Survey, 2021) and further modified by a multiplier of 1.8 for Pepacton Reservoir and 1.4 for Cannonsville Reservoir to represent total inflows from both gaged and ungaged tributaries; these multiplier values were selected through trial and error to achieve long-term water balance among inflows, outflows, precipitation, runoff, and evaporation in each reservoir. Inflow temperatures were estimated as the flow-weighted average of PRMS-SNTemp daily predictions of average stream temperature in the gaged tributaries. Recorded outflows (releases and diversions) were obtained from USGS New York Water Data Reports (U.S. Geological Survey, 2021). Given these input data, the GLM parameters cd (bulk aerodynamic transfer coefficient for momentum), sw_factor (scaling factor for shortwave radiation), and Kw (light extinction coefficient) were then calibrated using the optim() function in R to minimize root mean squared error (RMSE) relative to in-reservoir water temperature observations, which were provided by the New York City Department of Environmental Protection (NYC DEP; Oliver et al., 2021) and supplemented with data from NWIS. We used the calibrated GLM simulations to predict the temperature of released water at the dam, computed as the flowweighted average of reservoir water temperatures at the two release depths (one surface, one deep) for each reservoir.
The final pretraining dataset was computed as a weighted average of temperature predictions from PRMS-SNTemp and GLM, where GLM predictions only influenced stream segments downstream from GLM-simulated reservoirs. The weight given to GLM predictions in each reach was a function of distance, where stream segments closer to upstream reservoirs had water temperatures more similar to the reservoir releases. The reservoir weight began at 1 at the reservoir outlet and declined with distance downstream according to an exponential decay function that was fit to stream temperature observations; the decay rate was 0.036 km −1 for Cannonsville Reservoir and 0.040 km −1 for Pepacton Reservoir, such that the weight given to GLM predictions was <0.1 after 64 stream kilometers.

| Forecasted driver dataset
When making near-term forecasts, we used the National Oceanic and Atmospheric Administration's (NOAA's) Global Ensemble Forecast System (GEFS; https://nomads.ncep.noaa.gov/) operational forecasts of minimum and maximum air temperature and daily average downward shortwave radiation to predict daily maximum stream temperature for the forecast issue date and 7 days into the future. We also used yesterday's maximum water temperature (i.e., autoregression) and today's reservoir releases as additional drivers (Table 1; Figure 2). We generated TA B L E 1 Summary of the different datasets used to train the deep learning (DL) models and forecast stream temperature 7 days into the future. The DL models produced forecasts with and without data assimilation (DA), and their model performance was compared to a baseline persistence model that assumes yesterday's maximum water temperature will recur on all 7 future days. Dataset sources include the U.S. Geological Survey's National Water Information System (NWIS), the Water Quality Portal (WQP), Spatial Hydro-Ecological Decision System (EcoSHEDS), New York City Department of Environmental Protection (NYCDEP), the Office of the Delaware River Master (ODRM), gridMET, National Oceanic and Atmospheric Administration's Global Ensemble Forecasting System (GEFS), Precipitation Runoff Modeling System with a coupled Stream Temperature Network Model (PRMS-SNTemp), and the General Lake Model (GLM). DL and persistence model architecture and training procedures are described in the Sections 2.7, 2.9, and 2.10 below.

Input category Input Source
Deep learning model predictions 1 day at a time, building on predictions from days earlier in the prediction sequence. For predictions made with a 0-day lead time (i.e., nowcast), we used today's reservoir releases and yesterday's maximum water temperature analysis after assimilating observations (see Section 2.8) as drivers. For predictions made with 1-to 7-day lead times, we used reservoir release scenarios (see release scenario description below) and model predictions of yesterday's maximum water temperature as drivers.
We can anticipate realistic release scenarios for 7 days in the future because the Flexible Flow Management Plan dictates release volumes that are mandated given the day of year and current reservoir storage. These releases are referred to as "conservation releases" and are set once per month. For every forecast generated, we made forecasts using two different reservoir release scenarios, one with no additional reservoir release (current conservation release volume + 0cfs) and one simulating a relatively high reservoir release scenario (current conservation release volume + 100 cfs) from the Cannonsville Reservoir. The +0 cfs scenario simulates normal reservoir operations, whereas a +100 cfs scenario simulates a thermal bank release that would be used to mitigate expected temperature exceedances.
The +100 cfs scenario was typically a ~20% increase in outflow from the Cannonsville Reservoir during the forecast period. The additional reservoir release scenario was only added on to the lead time day 1 reservoir release driver and only for Cannonsville Reservoir because this is the main reservoir used to mitigate thermal exceedances for the Delaware River at Lordville site ( Figure 2). This +100 cfs release scenario was intended to help the reservoir manager anticipate what would happen to future maximum water temperature if a thermal bank release was made tomorrow.

F I G U R E 2
Schematic of forecast inputs and outputs for the deep learning with DA model, including when observations or driver data are being generated and used in the model. The model output panel shows forecast ensemble averages and 90% confidence interval (CI) for the forecast issued on June 28, 2021, for the Delaware River at Lordville site.
in UTC for 240 h past the forecast issue time. Starting with the 03:00 valid time, timesteps represent the average, minimum, or maximum of the preceding 3 h depending on the meteorological driver forecasted. To transform these timesteps to daily values in average solar time in the Delaware River Basin (approximately UTC -5:00), we treated the 09:00 through 30:00 UTC (4:00-25:00 in UTC -5:00) timesteps as day 0, 33:00-54:00 UTC (28:00-49:00 in UTC-5:00) as day 1, and so on. This provided the closest possible alignment of GEFS timesteps with average solar time in the Delaware River Basin. Minimum, maximum, and average daily values for these timesteps were then calculated for each GEFS grid cell for minimum air temperature, maximum air temperature, and downward shortwave radiation, respectively. To map GEFS values to individual stream segments, we matched a 0.25-° GEFS grid cell with the centroid of the target stream segment and used the meteorological drivers of that grid cell for the given segment. All stream segments were almost entirely contained within a single GEFS grid cell apiece. All 31 GEFS ensemble members were used to make ensemble forecasts of stream temperature; see the Section 2.10 below for how the GEFS ensembles are incorporated into the DL forecasts.

| DL model
We used an LSTM network to forecast daily maximum water temperature. An LSTM is a type of recurrent neural network that captures temporal relations among system states and has been used in several other hydrologic modeling applications with great success (Kratzert et al., 2018;Rahmani et al., 2020). The cell states of the system, c t , evolve through time and are modified at each time t by a filtered, transformed version of the model inputs at that time, x t (e.g., meteorological drivers). The LSTM also inherits information from previous model timesteps through a second set of states (hidden states, h t−1 ) that are a function of c t t−1 1 . At each timestep, the LSTM first generates a candidate cell state c t as: where tanh( ⋅ ) is the hyperbolic tangent function, W c h and W c x are learnable weight matrices of the hidden state and input features, respectively, and b c is a learnable bias vector. c t represents new "memories" that could be used to make predictions at timestep t. Then the LSTM generates a forget gate, f t , an input gate, i t , and an output gate, o t , as: where ( ⋅ ) is the sigmoid function, and W and b are again matrices and vectors, respectively, of learnable model parameters. The forget gate is used to filter the information inherited from c t−1 , and the input gate is used to filter the candidate cell state at time t, in essence determining how much old and new memory will be used for the new cell state and then passed to the next timestep. The output gate determines how much and which parts of the current cell state are used to update the hidden state, h t , which is then used in the water temperature prediction. The new cell state, c t , and the hidden representation, h t , are computed as: where ⊗ represents element-wise multiplication. The predicted target variables, ŷ t (in this case just maximum water temperature), are calculated as: During the model training phase, all weights, W, and biases, b, of the LSTM are optimized via iterative minimization of a loss function, , via the back-propagation algorithm (Rumelhart et al., 1986). Our loss function was the RMSE across all timesteps: (2) where T is the total number of training timesteps and z t are the observations at time t.
We estimated uncertainty in our DL model predictions using Monte Carlo Dropout (MCD) as described in Gal and Ghahramani (2016), where the DL model randomly removes a proportion of the network's recurrent and input elements (by setting their weights to 0) during each training iteration or prediction activity. When making many predictions using MCD, this produces an ensemble of LSTM model structures, such that the distribution of their predictions approximates the DL model uncertainty. We used MCD during training and for an ensemble of predictions; specifics of these procedures, model dimensions, and other model hyperparameters are described in the Section 2.9 below.

| Data assimilation
During the forecasting period, we used DA to adjust the LSTM states based on recently collected maximum water temperature observations. Previous studies have shown potential for DA improving DL model predictions (Brajard et al., 2020), but autoregressive DL models may perform just as well or better than DL models with DA (e.g., Fang & Shen, 2020;Nearing et al., 2021). Memory cells from a trained LSTM network can represent dynamic system storages that can be similar to our understanding of environmental processes (Kratzert, Herrnegger, et al., 2019;Lees et al., 2022), indicating that LSTM cells states could be updated via model-data fusion techniques using the learned processes from the trained LSTM to connect across different datasets (Dietze et al., 2013). Given the promise of both DA and autoregressive DL models to improve predictions, we developed a model that uses both techniques to ingest real-time observations to forecast maximum water temperature.
We used the ensemble Kalman filter (EnKF) as our DA algorithm to update maximum water temperature predictions and the cell states of the LSTM (Evensen, 1994). We used the EnKF given the ease of adding multiple sources of uncertainty via the ensemble members including meteorological driver uncertainty (e.g., NOAA GEFS ensembles), model process uncertainty (e.g., MCD), and initial condition uncertainty (e.g., initial LSTM states).
We concatenated LSTM state estimates (stream temperature and cell states) into one vector Y, and we used this vector for storing model  (11) shows what H t would look like if there were two stream segments, a and b, and three LSTM hidden units: K t was the Kalman gain weighting matrix, expressed as: where R t was an s by s observation error covariance matrix. If two stream segments, a and b, were predicted in the forecasting phase, R t would be expressed as: where Variance was the estimated variance around observations of stream temperature in each stream segment at time t. Variance of maximum water temperature observations was set to 0.25°C for all observation times and locations based on reported error in thermistors and temperature variation within stream segments when there were multiple sampling locations. More analyses are needed to determine if observation error varies significantly by location, day of year, or stream conditions (e.g., streamflow).
ΔY t was an [s + s × u] by N m matrix of all ensemble deviations from the average of the estimated states at time t (y t ), expressed as: where the mth column of ΔY t was: The cell states (c) of the LSTM were set to the updated states from Y u m,t (Equation 10), and we then proceeded to the next timestep to make predictions at time t + 1. We did not update the LSTM h states because maximum water temperature and c states were updated and the h state is calculated from the updated c state.

| Model training
We trained separate, site-specific LSTM models on five management-relevant and well-monitored segments within the Delaware River Basin.
All five stream segments are downstream from reservoirs ranging from 14.6 to 70.1 km downstream (Figure 1). Each model was first pretrained for 50 epochs at a learning rate of 0.05 with data from a single segment from 1985-05-01 to 2020-04-01, using the pretraining dataset as the target features and the historical drivers as the input features. Next, pretrained model weights and biases were fine-tuned with observations of maximum water temperature for each stream segment, for 350 epochs, at a learning rate of 0.05, using the same historical drivers from 1985-05-01 to 2021-04-14. We used six hidden units and recurrent and elemental dropout rates of 0.4 for both the pretrain and finetune phases. The model hyperparameters (e.g., number of epochs, learning rates, hidden units, dropout rates) were tuned manually to achieve reasonable model performance. Due to the quick turnaround time for going from DL-DA research to operational forecasts (~1 month), we did not perform a thorough hyperparameter or input feature search; however, we found these hyperparameters and input features to perform well for our DL forecasting models based on examining accuracy and uncertainty quantification metrics (e.g., RMSE, model reliability). After the models were trained, the ending LSTM states from the fine-tune training phase were used to initialize the LSTM states during the start of the forecasting phase. We used TensorFlow v2.5.0 (TensorFlow Developers, 2021) to train and forecast with the LSTM models.

| Model forecasts
At each forecast issue timestep, we made predictions for the issue date (day 0) and 7 days into the future using the starting conditions of the previous timestep's LSTM model (i.e., maximum water temperature, hidden, and cell states). We made predictions using trained LSTM models with data assimilation (DL-DA) and without data assimilation (DL) as well as a deterministic persistence forecast as our baseline model (Table 1).
The persistence model simply forecasts the same maximum water temperature that was observed yesterday for all 8 days that follow. The DL model forecasted maximum water temperature 7 days into the future using the drivers described in Table 1. For the DL-DA model, if observations were available at a given timestep, we assimilated the observations into the LSTM model as described in Equations (10-15) and updated the stream temperature estimates and LSTM cell states (Figure 2). The DL-DA model for the next forecast issue date was initialized with the updated states and was used to make predictions 7 days into the future.
We evaluated forecasts generated from 2021-04-16 to 2021-09-30 because 2021-04-16 was when we started downloading operational GEFS forecasts from NOAA and we stopped forecasting on 2021-09-30 because thermal exceedances were extremely unlikely after this date. Our LSTM models used the ending states from the fine-tune training phase (fine-tune training ended on 2021-04-14) to initialize the hidden and cell states during the forecasting phase. The DL and DL-DA models incorporated three different sources of uncertainty via the ensemble predictions: driver uncertainty (e.g., GEFS ensembles), DL model uncertainty (e.g., MCD), and initial condition uncertainty (e.g., initial hidden and cell states and yesterday's maximum water temperature predictions). Each DL and DL-DA model had a total of 3100 ensemble members. To produce 3100 ensemble member predictions, each of the 31 GEFS ensemble input features were Δ y m,t =ŷ m,t − y t .
repeated 100 times and used as batch inputs to the DL and DL-DA models. Because we used MCD when making forecasts, a distribution of 100 predictions were produced for each GEFS ensemble member representing the DL model uncertainty for predicting maximum water temperature using a particular GEFS ensemble member as an input feature. In addition, each of the 3100 ensemble members started at a slightly different hidden and cell state representing uncertainty in model initial conditions. Collectively, the 3100-ensemble prediction distribution represents our total forecast uncertainty when taking into account driver uncertainty, DL model uncertainty, and initial condition uncertainty. We produced 100 MCD predictions for each GEFS ensemble to balance the representativeness of the prediction distribution and computational speed and storage.
All model training was conducted on USGS Advance Research Computing resources using CPUs (Falgout et al., 2019) and the operational forecasts were generated using Amazon Elastic Compute Cloud instances. Operational forecasts were run each day immediately following a scheduled observation data pull that occurred at 9:00 a.m. EDT. Because observations of daily maximum water temperature had not yet occurred at 9:00 a.m. EDT, we could only assimilate maximum water temperature observations into our model at lead time day −1 or earlier ( Figure 2). Forecasts were summarized to the average of the ensemble predictions, 90% confidence intervals (CIs) of the ensembles, and the probability of exceeding 23.89°C (75°F) over the next 7 days for each of the sites of interest, and this summarized forecast output was sent to the New York City reservoir managers to aid in thermal release decision making (Table 2).

| Model evaluation
We evaluated model forecast performance with bias, RMSE, and continuous ranked probability score (CRPS; Thomas et al., 2020) by comparing valid date predictions with observations of maximum water temperature that occurred on the same dates. Bias and RMSE reflect accuracy of the average of the ensembles relative to the observed maximum water temperature. CRPS measures both the accuracy and precision of the full distribution of ensemble predictions (interpreted as a probabilistic forecast), where lower values indicate a better model performance.
Because multiple forecasts were valid for any given day, this allowed us to analyze how model forecast accuracy changed with different lead times and how it may have changed across the entire forecast period.

TA B L E 2
Example of summarized forecast output that can be delivered daily to the New York City reservoir managers.

NEAR-TERMFORECASTSOFSTREAMTEMPERATUREUSINGDEEPLEARNINGANDDATA ASSIMILATIONINSUPPORTOFMANAGEMENTDECISIONS
We evaluated how well the model characterizes forecast uncertainty with reliability plots where we calculate the proportion of observations that fall within CIs calculated from the ensemble predictions as described in Thomas et al. (2020). A well-calibrated forecasting model would have 10% of observations within the 10% forecast CI (i.e., the 45th-55th quantiles of the forecast probability distribution), 20% of observations in the 20% forecast CI (the 40th-60th quantiles), and so on. If a higher or lower percentage of the observations fall within a given forecast CI, then the model is considered underconfident or overconfident, respectively.
Model code used to create pretraining process-based datasets, train the DL models, and forecast with the trained DL models can be found at Zwart et al. (2022b). All model drivers and predictions used in this analysis can be found at Oliver et al. (2021) and Zwart et al. (2022a).

| DA and DL
Our DA routine updated LSTM cell states using Equations (9-15). As an example, we used the Delaware River at Lordville DL-DA model to show relations among temperature predictions, LSTM cell states, and adjustment of these states after assimilating observations of daily maximum water temperature. The sample error covariance showed that some cell states covaried more strongly with stream temperature than others, as cell states 2, 5, and 6 had the strongest covariance with maximum water temperature predictions for the Lordville DL-DA model ( Figure 3). Assimilating maximum water temperature observations into the LSTM adjusted the average ensemble temperature and cell states ( Figure 4). Maximum water temperature predictions and cell states 2, 5, and 6 showed the greatest adjustment after assimilating maximum water temperature observations for the Lordville DL-DA model.
After assimilating maximum water temperature observations into the LSTM, error in the average of the ensemble maximum water temperature predictions compared to observations was reduced (e.g., RMSE over the whole season for lead time day −1 before and after assimilating observations was 0.94 and 0.32°C, respectively; Figure 5). Model state adjustments via DA only occurred on lead time day −1 (Figure 2), so Figure 5 represents the error in maximum water temperature predictions for lead time day −1 for both the DL (prior) and DL-DA (posterior) models.

F I G U R E 3
Sample covariance between daily maximum stream temperature predictions (temperature) and long short-term memory (LSTM) cell states 1-6 (c1-c6) for the Delaware River at Lordville DL-DA model from 2021-04-15 to 2021-09-30. Sample covariance is calculated as 1 N m − 1 Δ Y t ΔY ⊤ t as shown in Equation (12).

| Model accuracy
Our DL-DA models predicted maximum water temperature dynamics across the forecasting period with an average RMSE of 1.2°C across all five stream sites for 1-day lead times while the persistence baseline model had an average RMSE of 1.97°C ( Figure 6). The addition of our DA algorithm (DL-DA model, Figure 7) marginally improved DL forecast performance for lead times 0 and 1 across all stream segments (0%-14% Our models produced predictions of maximum water temperature from 0-7 day lead times, with RMSEs ranging from 1.1 to 1.4°C across all segments for day-ahead forecasts (lead time day 1) for the DL-DA model (Figure 7). Performance of the models generally worsened as lead time in- in later summer months (e.g., August and September) all models were biased slightly high at 1-day lead times (all models had 0.1°C bias).

| Model reliability
Our DL and DL-DA models estimated forecast uncertainty using MCD, GEFS ensemble members, and initial condition uncertainty resulting in 65%-82% of observations contained within 90% CIs across all sites, models, and lead times ( Figure 8) between the DL and DL-DA models was very small. Uncertainty estimates were mildly overconfident, with more reliable uncertainty estimates for the central 10%-30% of the CI distribution and greater overconfidence for larger CIs. Forecast reliability for differing lead times across sites did not have a clear pattern, as lower lead times were generally more reliable for some sites (e.g., DR @ Lordville), while longer lead times were more reliable for other sites (e.g., NR @ Bridgeville).

F I G U R E 7
Model forecast accuracy as a function of forecast lead time for each site in the Delaware River Basin, including bias, root mean squared error (RMSE), and continuous ranked probability score (CRPS). Forecasts were produced for the DR @ Lordville, EBDR @ Harvard, NR @ Bridgeville, WBDR @ Hale Eddy, WBDR @ Hancock.

| Impact of release scenarios
Our DL-DA model predicted a decrease in maximum water temperature in response to the additional 100 cfs release added to lead time day 1 (Figure 9). The average decrease in stream temperature across all forecast issue times from June through August with a 1-day lead time was 0.11, 0.12, and 0.09°C at the WBDR @ Hale Eddy, WBDR @ Hancock, and DR @ Lordville sites, respectively. WBDR @ Hale Eddy was the closest downstream site to the Cannonsville Reservoir (14.6 km) and DR @ Lordville was the farthest downstream (45.3 km; Figure 1). Across all sites, the effect of the simulated release decreased at longer lead times. For nearly all lead times at the WBDR @ Hale Eddy and Hancock sites, the upper 90th percentile prediction was affected more by the simulated reservoir release compared to the average prediction and the lower 90th percentile prediction.

| Thermal exceedances
We examined how well our DL-DA and DL models predicted the exceedance of the thermal threshold of 23.89°C set at DR @ Lordville for all days during our forecast period. For 2 days during our forecast period, the maximum water temperature for DR @ Lordville exceeded the thermal threshold (June 29 and June 30; Figure 10a). For 1-day lead times, our DL-DA model predicted a 26.8% and 32.3% chance of exceeding the thermal threshold on June 29 and 30, respectively (Figure 10a), whereas our DL model predicted a 15.3% and 22.8% chance of exceeding the threshold for the same dates and lead time. Across the entire forecast season, the DL-DA model had a greater number of prediction instances (i.e., count of ensemble members) that were true thermal exceedances at shorter lead times and a greater number of prediction instances that were false thermal exceedances at longer lead times (Figure 10b). The DL-DA model had more true and false thermal exceedances at lead times 0-2 days compared to the DL model, and the increase in true thermal exceedances for the DL-DA model was much greater than the increase in false thermal exceedances. The difference between the models for true or false exceedance at longer lead times was small (Figure 10c).

| DISCUSS ION
Accurate and reliable forecasts of environmental variables are becoming important tools for environmental resource management (Bradford et al., 2020;Dietze et al., 2018). Using a novel integration of DL and DA, our models produced accurate predictions of maximum water temperature 7 days into the future for forecasts issued between 2021-04-16 and 2021-09-30, with an average RMSE of 1.37°C across all sites and forecast lead times while the persistence baseline model had an average RMSE of 2.37°C. Our models also characterized uncertainty relatively well and can aid reservoir managers in decisions about when to release water from reservoirs to cool stream segments. Below we discuss the forecast performance and further development that could improve upon the DL-DA forecast accuracy and reliability.

| Forecast accuracy
Although our models produced accurate predictions overall, performance varied across sites and lead times (Figures 6 and 7). The persistence forecast performance can give insight into the predictability across sites, as better performing persistence forecasts were often collocated with better performing DL and DL-DA forecasts. For example, WBDR @ Hale Eddy and Hancock sites had the best performing persistence, DL, and DL-DA forecasts at longer lead times, indicating likely stronger autocorrelation in maximum water temperature at these sites and thus higher predictability (Figure 7). Stream temperatures at these sites are heavily influenced by the upstream Cannonsville Reservoir, and water released from the reservoir likely mutes variability in maximum water temperature, thereby increasing autocorrelation. This contrasts with the NR @ Bridgeville site, which generally had the lowest forecast performance for the persistence, DL, and DL-DA models for 0-and 1-day lead times, and the EBDR @ Harvard site, which generally had the lowest performance for 2-to 7-day lead times. The NR @ Bridgeville and EBDR @ Harvard DL models were missing potentially important drivers that influence maximum water temperature such as streamflow, precipitation, or shading. We also did not model reservoir temperature in the Neversink Reservoir above the Bridgeville site, and the lack of a pre-trainer that simulated reservoir outlet dynamics may have partially accounted for decreased performance at this site. Our forecasting models were missing these potentially important features due to the short development period prior to delivering the forecasts to inform decision-making.
However, forecasting under tight deadlines underscores the value of the iterative forecast cycle and quick feedback on model performance to help guide model development . To improve upon the accuracy of DL model forecasts for these sites, DL models could be trained across multiple sites , additional upstream information or meteorological variables could be added as drivers (Rahmani et al., 2020), the DL model structure could be modified to represent network connections across stream segments , and/or custom model loss functions could be used to ensure the model obeys laws of thermodynamics (Read et al., 2019).
The DL-DA model predicted more true exceedances of the thermal threshold of 23.89°C than did the DL model, and only 0.23% of DL-DA prediction instances (i.e., count of ensemble members) were false exceedances across the entire forecast period for the DR @ Lordville site.
However, the DL-DA model tended to underpredict for the 2 days when observations exceeded this threshold for DR @ Lordville in 2021 ( Figure 10). More prediction instances were true thermal exceedance for shorter lead times (e.g., 0-to 2-day lead time) compared to longer lead times, and a portion of the prediction instances were forecasting true exceedances as far as the 7-day lead time, which helped alert reservoir managers to potential exceedances even when the model was not yet confident in them. These longer lead times, although not as accurate as shorter lead times, were useful for New York State Department of Environmental Conservation to plan for staffing coverage around weekends and holidays (Brenan Tarrier, New York State Department of Environmental Conservation, written commun., August 4, 2021). A higher proportion of DL-DA model prediction instances forecasted true exceedances compared to the DL model at 0-to 2-day lead times ( Figure 10c). This indicates that the DA algorithm may help improve DL model forecast accuracy when forecasting threshold exceedances; however, observations exceeded this threshold only for 2 days; thus, more events would need to be analyzed before concluding DA improves threshold exceedance accuracy.

| DA and DL
We found that combining DA with our DL models modestly improved forecast performance of maximum water temperature, typically at 0and 1-day lead times, compared to our DL model without DA (Figures 7 and 10c). DA is heavily relied upon in forecasting applications using process-based models to adjust model states, parameters, and drivers prior to issuing forecasts (Lewis et al., 2021), and interest in assimilating data into DL models to make improved predictions is growing. For example, Brajard et al. (2020) use an iterative DL and DA algorithm to emulate the Lorenz 96 model using sparse and noisy observations, which can be a challenging task for DL models alone, and Fang and Shen (2020) show that a data integration kernel within a DL model improves near-term forecasts of soil moisture. Chen et al. (2021) used an invertible neural network structure to adjust DL states after assimilating observations of stream temperature to improve predictions of water temperature downstream from reservoirs. In addition, Nearing et al. (2021) showed that DL models that ingest near-real-time observations improve forecasts of streamflow compared to models without those observations.
One of the simplest methods of ingesting near-real-time observations when making forecasts is using autoregression (i.e., lagged observations as input features). Nearing et al. (2021) demonstrated that autoregressive models outperformed models that ingested near-real-time observations using variational DA methods when forecasting streamflow and recommended improving autoregressive methods over DL-DA methods. In the current analysis, both our DL and DL-DA models used lagged observations as input features to the LSTM. In other words, F I G U R E 1 0 (a) Maximum water temperature predictions from our DL-DA model for 1 day lead time and observations for DR @ Lordville site from June 27 to July 3, 2021. Two thermal exceedance events were during this time, on June 29 and 30. The DL-DA model ensemble predictions for 1-day lead time are color coded by whether they correctly or incorrectly predicted exceedances or deceedances of the thermal threshold of 23.89°C. The mean of the ensemble predictions is indicated by a white bar and the maximum observed water temperature is indicated by a black point on each day. Probability of exceeding the thermal threshold for each day is indicated by percentage at the top of the panel in gray text. (b) The number of DL-DA model prediction instances correctly or incorrectly predicting thermal exceedances and incorrectly predicting thermal deceedances as a function of forecast lead time for the DR @ Lordville site across the entire forecast period. (c) The difference in model prediction instances (DL-DA minus DL) correctly or incorrectly predicting thermal exceedances as a function of forecast lead time for the DR @ Lordville site across the entire forecast period. Positive values mean that more DL-DA ensemble members predicted a given outcome compared to the DL ensemble members.
both the DL and DL-DA models were autoregressive models and both utilized the most recent observations to make predictions; however, our DL-DA model also adjusted LSTM cell states when assimilating observations. The sample error covariance showed that some cell states covaried strongly with stream temperature and updating cell states of an LSTM may be useful for improving predictions using this method of DA (Figure 3). The LSTM-DA method that Nearing et al. (2021) developed did not have an autoregressive component and only relied on variational DA to ingest near-real-time observations. Our inclusion of LSTM state updating and an autoregressive method may be why our DL-DA model showed modest improvement over our DL method, whereas Nearing et al. (2021) showed that their DL model outperformed their DL-DA model for forecasting streamflow.
However, we caution against concluding that our DL-DA method would have superior performance over other autoregressive-only DL models because we have a limited number of test sites and a short test period to which to evaluate our methods, and our model parameterization and target features were different compared to Nearing et al. (2021). Furthermore, loss of information occurs for any DA algorithm due to assumptions embedded in the algorithm (e.g., estimated error variance, data distributional assumptions; Nearing et al., 2018); thus, it may be beneficial to allow the DL model itself to decide how to use the new observation information. In addition, DA methods could be used to adjust model parameters instead of or in addition to model states, which may improve DL forecasts at longer lead times. Contrasting results from this current analysis and Nearing et al. (2021) indicate that more research is needed to assess under which conditions and which DA algorithms or autoregressive methods optimally combine new observations with DL model predictions to improve forecast performance.

| Forecast uncertainty
Our DL-DA model tended to be overconfident in predictions for the forecast period we analyzed. The overconfidence is likely due to overfitting models to the training dataset, which is a common problem in all types of modeling, especially machine learning. To improve upon the forecast uncertainty estimates, we may want to pretrain global DL models on many stream segments and fine-tune to individual segments, add in noise regularization to the drivers during training, and/or increase the dropout rate for the recurrent and input elements during model training. Due to the short model development period, we did not perform a thorough hyperparameter search for MCD dropout rates to optimize forecast reliability, which could potentially improve our characterization of maximum water temperature forecast uncertainty. In addition, Klotz et al. (2021) show that mixture density networks have better uncertainty characterization than MCD for streamflow predictions, indicating that adoption of these relatively newer methods is promising for improving DL forecast reliability.

| Reservoir release scenarios
Some forecast products can trigger decisions that alter the conditions being forecasted; in our use case, a forecast that predicts water temperature will exceed 23.89°C (75°F) in the future could trigger a manager to call for larger reservoirs releases that would cool downstream waters to maintain cool water fish habitat. Generating a forecast with possible scenarios that can be exposed to the decision maker can make the forecasts more actionable. Our implementation included two reservoir water release scenarios, which provided a range of expected outcomes for taking no action (conservation releases +0 cfs thermal bank release) and, alternatively, issuing a single-day preventative action (conservation releases +100 cfs release; see Figure 9). We observed a decrease in stream temperature following the reservoir release scenario, with the DR @ Lordville site having the smallest response, which was expected because DR @ Lordville was the farthest downstream from the Cannonsville Reservoir of the three sites. Although we observed an expected decrease in stream water temperature following the +100 cfs release scenario, the resulting average decrease was small for the three segments downstream from the Cannonsville Reservoir (decrease ~0.1°C). The expected average decrease in stream water temperature in response to a +100 cfs release during this time period is ~0.3-0.4°C for the sites downstream from Cannonsville Reservoir (Brenan Tarrier, New York State Department of Environmental Conservation, written commun., November 3, 2021). Currently, the model does not explicitly have information that would help it distinguish between increases in release volumes that are coming from the surface of the reservoir (through the spillway) versus the bottom of the reservoir (via conservation or directed releases), nor is it exposed to flow changes related to precipitation events. We speculate that adding in stream discharge and/or precipitation as an additional model driver may help the DL model learn about the effects of reservoir releases under various release conditions. In addition, we could expose the DL model to more pretraining scenarios that could help the model learn about the effect of reservoir releases on downstream temperature, potentially increasing the effect of reservoir release on changes in stream temperature.
Future efforts could extend this simple +0 cfs/+100 cfs case by exposing a greater variety of scenarios to reservoir release decision makers, including additional release volumes and their future timing (e.g., beyond the day-of-issue). Alternatively, the complete modeling system could be provided to the end-users where model inputs could be fully controlled by the decision makers. The combination of real-time data and modern DL modeling advances described here can provide timely and accurate information for critical water resources decisions.

| Summary
We demonstrate the utility of combining DA with DL models for forecasting management-relevant environmental variables. Both our DL and DL-DA model produced accurate and reasonably reliable forecasts of maximum water temperature 7 days into the future. This enabled decisions to be made about releasing water from New York City drinking water reservoirs for mitigating temperature exceedances downstream as outlined by the multi-state Flexible Flow Management Program. Future improvements could include increasing the complexity of the underlying DL models such as adding in additional upstream information or sub-daily meteorological variables as drivers and modifying the DL model structure to represent network connections across stream segments. Also, we have not yet explored all possible ways to combine DA with DL; more research would be useful to find the combinations that most improve forecast accuracy and reliability at multiple forecast horizons.
Given the flexibility of DL models, we expect future extensions and improvements on this type of framework could be used for many types of environmental forecasting applications. investigation; methodology; supervision; writing-review and editing.

ACK N OWLED G M ENTS
All model training was run on USGS High-Performance Computing. We thank Brenan Tarrier and Kendra Russell for their input on managementrelevant forecast target variables and locations, Gray Nearing for valuable input on the forecasting model design decisions, Tadgh Moore for sharing his experience and insights on reservoir modeling, and USGS VizLab team for figure design inspiration. We thank Reza Abdi, James McCreight, and anonymous reviewers for valuable feedback when reviewing earlier drafts of this manuscript. Any use of trade, firm, or product names is for descriptive purposes only and does not imply endorsement by the U.S. Government.

CO N FLI C T O F I NTE R E S T
The authors declare no conflicts of interest.

DATA AVA I L A B I L I T Y S TAT E M E N T
This submission uses novel code to create pretraining process-based datasets, train the deep learning models, and forecast with the trained deep learning models, which can be found at Zwart et al. (2022b). All data for model drivers and predictions used in this analysis can be found at Oliver et al. (2021) and full forecast output are published at Zwart et al. (2022a).