Exploring the exceptional performance of a deep learning stream temperature model and the value of streamflow data

Stream water temperature (T s) is a variable of critical importance for aquatic ecosystem health. T s is strongly affected by groundwater-surface water interactions which can be learned from streamflow records, but previously such information was challenging to effectively absorb with process-based models due to parameter equifinality. Based on the long short-term memory (LSTM) deep learning architecture, we developed a basin-centric lumped daily mean T s model, which was trained over 118 data-rich basins with no major dams in the conterminous United States, and showed strong results. At a national scale, we obtained a median root-mean-square error of 0.69°C, Nash–Sutcliffe model efficiency coefficient of 0.985, and correlation of 0.994, which are marked improvements over previous values reported in literature. The addition of streamflow observations as a model input strongly elevated the performance of this model. In the absence of measured streamflow, we showed that a two-stage model could be used, where simulated streamflow from a pre-trained LSTM model (Q sim) still benefited the T s model even though no new information was brought directly into the inputs of the T s model. The model indirectly used information learned from streamflow observations provided during the training of Q sim, potentially to improve internal representation of physically meaningful variables. Our results indicate that strong relationships exist between basin-averaged forcing variables, catchment attributes, and T s that can be simulated by a single model trained by data on the continental scale.


Introduction
Stream water temperature (T s ) is a critical, decision-relevant variable that controls numerous physical, chemical, and biological processes and properties including dissolved oxygen concentrations and nutrient transformation rates, as well as industrial processes such as cooling power plants and treating drinking water (Delpla et al 2009, Kaushal et al 2010, Madden et al 2013. Thermal regimes of streams directly affect aquatic species (Justice et al 2017) and in some cases, fish mortality rate increases as T s passes a certain threshold (Marcogliese 2001, Martins et al 2012. These thermal regimes are complicated by water uses in industry, such as utilizing stream water for cooling systems, which causes thermal pollution downstream (Raptis et al 2016). Fulfilling the temperature requirements of the environment, agriculture, industries, and municipalities, and coordinating these uses requires a delicate balance. Accurate T s models can inform the decision-making process and help lower the risks of exceeding thermal thresholds.
Myriad basin and in-stream/near-stream processes govern T s (Poole and Berman 2000). The heat balance of a basin, as modulated by land use types (Borman and Larson 2003, Moore et al 2005, Nelson and Palmer 2007 is a primary control on T s . At the basin scale, snowmelt and groundwater baseflow contributions (Kelleher et al 2012) are also important factors due to their sharp contrast in temperature with air. In streams, T s is influenced by solar radiation, latent heat flux, air-water heat exchange, riparian vegetation (Theurer et al 1985, Garner et al 2017, channel geomorphology (Hawkins et al 1997), hyporheic exchange (Evans and Petts 1997), and reservoirs and industrial discharges (Poff et al 2010). At any point in the channel network, T s is the spatiotemporal integration of all of the above processes. Process-based models, while offering physical explanations of causes and effects, need to embrace substantial model complexity to represent all or even parts of these complex processes with their heterogeneity and scaling effects (Johnson et al 2020). The requirements for input data also make scaling up such simulations challenging. Some large-scale process-based models have had root-mean-square error (RMSE) values reported of greater than 2.5 • C (Vliet et al 2012, Wanders et al 2019. A large body of literature has employed statistical models to simulate T s , with some good summaries given by Benyahya et al (2007) and Gallice et al (2015). Typically, T s was regressed to air temperature (T a ), but more recent studies regressed the parameters in T s -T a relationships using catchment attributes. Among these studies and most relevant to our work, Segura et al (2015) predicted the slope and intercept of the 7-day average T s -T a relationship based on catchment characteristics such as watershed area and baseflow index. A Nash-Sutcliffe coefficient of 0.78 was obtained for reference sites for the 7-day average T s , and strong hysteresis was noted in the stream-air temperature relationship (Segura et al 2015). Stewart et al (2015), integrated an artificial neural network with a soil water balance model and obtained an RMSE of around 1.5 • C and R 2 of 0.76 for 371 sites across Wisconsin. Very recently, Johnson et al (2020) used sine-wave linear regression and reported RMSE values of 1.41 • C for an extensive regional dataset and 1.85 • C for the national-scale U.S. Geological Survey (USGS) dataset. They highlighted the importance of spatial scales and heterogeneity. Graf et al (2019) used wavelet transformations of daily average air temperature as inputs to an artificial neural network to predict water temperature at eight different sites in Poland and obtained RMSE values ranging from 0.98 • C to 1.43 • C in the test period. Additionally, there are several studies that used recent water temperature data as an input to predict stream temperature, which can significantly increase model performance as it can be interpreted as a form of data assimilation (Feng et al 2020). Sohrabi et al (2017) obtained RMSE of ∼1.25 • C when they used the previous day's temperature and streamflow as drivers. On one temperature gauge, Stajkowski et al (2020) obtained RMSE of 0.76 • C using a variant of long short-term memory (LSTM) with the previous hour's stream temperature included in the inputs. This differs from the present work in that here, our purpose was to provide longterm projections, and so we did not use recent measurements as inputs.
Streamflow conditions are often not utilized by statistical models of T s , partially because relationships are not clear; none of the abovementioned statistical models used streamflow. However, we do know that streamflow exerts substantial control on T s . Rivers dominated by baseflow are typically fairly stable and cool in summer, but have relatively low thermal capacity and are rapidly heated by strong solar radiation. Peak flows are typically dominated by surface runoff, the temperature of which is strongly influenced by the fast-changing air temperature (Edinger et al 1968). A sensitivity analysis on large river basins found that on average, a decrease in river flow of 50% as compared to a reference condition lowered the minimum annual river discharge temperature (in winter) by −0.4 • C or raised the maximum temperature (in summer) by +1.2 • C ( van Vliet et al 2012). In another study on one specific river, a linear regression model based on monthly air temperature and streamflow data revealed that air temperature rise and flow reduction were responsible for 60% and 40% of June to August temperature increases, respectively (Moatar and Gailhard 2006).
Recently, deep learning (DL) models, including those based on the LSTM algorithm, have shown promise in predicting hydrologic variables such as soil moisture and streamflow by achieving superior results with low computational and human effort , Fang and Shen 2020, Feng et al 2020. LSTM can learn long-term dependencies and gave high performance in snow-dominated regions for streamflow prediction (Feng et al 2020). The memory mechanisms of LSTM may be able to mimic heat units, similar to heat accumulation and release processes. Thus, it is natural to think that LSTM may also be suitable for T s modeling. However, given the complicated and scale-dependent processes influencing T s , it is highly uncertain if there even is a stable relationship between basin-average forcing inputs and T s across different spatial scales, and if so, whether such a relationship can be captured by LSTM given limited observational data.
One advantage of a DL model as compared to process-based ones is that it can incorporate auxiliary information without requiring explicit understanding of relationships. In our case, not only does streamflow directly influence T s fluctuations, it also reveals multi-faceted hydrologic dynamics in a basin, regarding factors such as baseflow contributions and residence times of surface runoff, which could aid T s modeling. Therefore, we expect adding streamflow information to improve model performance. In a process-based modeling framework, streamflow data can be used to calibrate the hydrologic components. Unfortunately due to the issue of model equifinality (Beven andFreer 2001, Beven 2006), calibration may or may not improve model internal dynamics, depending on model parameterization, structure, and data information content (Huang and Liang 2006). In utilizing the DL framework, we hypothesize that models may be able to automatically extract information from hydrographs to inform T s , which, to our knowledge, no study has examined in the context of DL models.
Even if streamflow data are indeed useful, realworld use can be hampered by lack of available streamflow data. Beyond existing stations, collecting new streamflow data is more expensive than collecting new T s data. However, given that highly accurate LSTM-based streamflow models have been reported (Feng et al 2020), we wondered if a well-trained LSTM streamflow model could serve as a surrogate for actual measurements. Deep networks are known to maximally use available information, so it was not clear whether using such a streamflow model as an input to another DL model would present any benefit; the streamflow model used identical forcing data to those already used by our temperature model and would not explicitly bring in any new information.
In this work, we attempted to answer two main research questions and improve our understanding of the stream heat balance: (a) Are there reliable relationships between T s and basin-average meteorological forcing information or attributes that could be learned by deep networks to predict T s with high accuracy? (b) Can observed or simulated streamflow be used to improve temperature predictions, especially when the simulated streamflow is predicted using the same information as the T s model?

Methods
We simulated T s from a basin perspective, that is, as a function of basin-average climate forcings and attributes. This setup greatly simplified model representation compared to spatially explicit models and was supported by widely-available data, but ignored some local channel characteristics. We examined the effect of including daily streamflow in the inputs to assess its information content for T s .

Datasets
Basin characteristics came from the Geospatial Attributes of Gages for Evaluating Streamflow dataset version II (GAGES-II), and represent geological aspects, land cover, reservoir information, and air temperature data for basins across the conterminous United States (CONUS) (Falcone 2011). Historical data for daily mean T s was downloaded from the USGS's National Water Information System (USGS NWIS) website for all 9322 basins in GAGES-II (USGS 2016). We obtained daily meteorological forcing data (e.g. precipitation, maximum and minimum air temperature, vapor pressure, solar radiation) by interpolating a gridded meteorological dataset (Daymet) ( Many of the GAGES-II basins did not have T s observations recorded for all days of the year, with unobserved days being more common during the winter (there were nonetheless many sites with winter data, and our resulting model predicts temperature for all days in a year). For this work, we selected temperature gauges with more than 60% of daily observations available between 2010/10/01 and 2014/09/30, in basins where there were no major dams (over 50 ft in height, or having more than 5000 acre-feet in storage, defined in GAGES-II), resulting in a dataset of 118 basins ranging in size from 2 to 14 000 km 2 . We simulated T s at the pour point of basins where the USGS streamgages were located. Limiting this analysis to sites with >60% data coverage allowed us to focus on the capabilities of LSTM for T s modeling under relatively ideal conditions. Future research on the effect of reservoir presence and data availability could perhaps further improve stream temperature predictions, but such focus was outside the scope of our work here.

LSTM-based models for predicting T s
We used the long short-term memory (LSTM) algorithm, which has received increasing attention in hydrologic literature. This method is designed to learn and keep information for long periods using units called memory cells and gates. Cells store the information, and gates decide which information comes in and out of the cells. Because the basic LSTM architecture has been described extensively elsewhere, we refer readers to those papers for a more detailed discussion of the equations and structure of LSTM (Hochreiter and Schmidhuber 1997, Fang et al 2017, 2019, although a sketch and equations for this model are provided in figure S1. We standardized all input and target values. As a preprocessing step, streamflow was first divided by basin area and mean annual precipitation to obtain a dimensionless streamflow, which was then transformed to a new, more Gaussian distribution (Feng et al 2020): where v * and v are the variables after and before transformation, respectively. Next, the transformed streamflow data along with all other meteorological forcing data, basin characteristics, and T s observations were standardized by the following formula (Feng et al 2020): in which x i,new is the standardized value, x i is the raw value,x is the mean for the variable, and o ′ is the standard deviation for the variable. This standardization better conditions the model for gradient descent and also forces the model to pay roughly equal attention to both large wet basins and small dry basins (Feng et al 2020). All results in this study are shown after destandardization, or reversal of all standardization procedures, was applied to the model outputs.
Hyperparameters were chosen by running multiple tests to determine the hyperparameters as listed in table S2. The target metric the model aimed to minimize (loss function) was root-mean-square error (RMSE), and we also reported an unbiased RMSE, which is the RMSE minus the mean bias. We also report bias (mean error) and Nash-Sutcliffe efficiency coefficient (NSE, equation in supporting information) (Nash and Sutcliffe 1970) for the test periods for comparison with other studies. Further, because a model simply copying air temperature may give relatively acceptable metrics, we also report the Nash-Sutcliffe coefficient (NSE res ) calculated for residual temperature, the difference between daily mean water temperature and daily mean air temperature: T res = T s − T a . To provide a baseline for comparison, we also compared to a locally-fitted autoregressive model with exogenous variables (ARX 2 ). The ARX 2 inputs contained current and delayed atmospheric forcings (X) and ARX 2 -simulated stream temperature in the last 2 days: where a, b and c were fitted coefficients, T t, * s is the stream temperature simulated by this model at time step t, and p is the number of forcings. All temperature models were trained on data from 2010/10/01 to 2014/09/30 and tested from 2014/10/01 to 2016/09/30.

Streamflow observations or simulations as model inputs
Across USGS gages, streamflow is a more widelyavailable measurement than temperature, suggesting that inclusion of streamflow data could bring additional information. To test this, the following models were trained: where F is the forcing data time series, A T represents static and single-valued attributes of the basin for temperature modeling, Q obs is the observed time series of daily mean streamflow, and Q sim is simulated streamflow (described below). LSTM obsQ , LSTM noQ , and LSTM simQ are LSTM-based models incorporating observed streamflow, no streamflow information, and simulated streamflow, respectively. For Q sim , streamflow was simulated using a LSTM-based streamflow model shown to have very good performance (Feng et al 2020): where A Q represents static attributes of the basins used for streamflow modeling (table S1 in supporting information). Meteorological forcing data used for the simulations were the same as for the temperature prediction models. Q sim was trained using observations from 2397 basins, and a longer training period (from 2004/10/01 to 2014/09/30) was used than for the temperature models in this study.

Overall results
All LSTM-based models delivered exceptionally strong performance in the test period (figure 1). Across the conterminous United States (CONUS), the median test-period RMSE for the model incorporating streamflow observations (LSTM obsQ ) was 0.69 • C. The RMSE for the model incorporating simulated streamflow (LSTM simQ ) was 0.81 • C, which was still lower than that for the model lacking any streamflow information (LSTM noQ ), for which the RMSE was 0.86 • C. The corresponding median NSE values were 0.986, 0.983, and 0.979 respectively, and all of the correlation values were above 0.992, indicating that temporal fluctuations were extremely well captured. These metrics are markedly better than those reported in the literature at this scale, which demonstrates that LSTM is particularly well-suited for T s modeling at basin outlets. In general, the LSTM-based models performed much better than ARX 2 , which had a median RMSE of 1.41 • C. Moreover, when we evaluated T res , the locally-fitted ARX 2 model's median Nash-Sutcliffe Efficiency (NSE res ) worsened substantially to 0.772, indicating that a substantial portion (although not all) of ARX 2 's predictive power came from air temperature and some memory (linear regression performed worse than ARX 2 , not shown here). In comparison, the LSTM-based models were much less affected: the median NSE res values were above 0.950 and 0.924 for LSTM obsQ and LSTM noQ , respectively. LSTM models captured most fluctuations unaccounted for by seasonality and were able to capture more complicated memory effects than the simple linear autocorrelation used in ARX 2 . These temporal fluctuations could have been induced by heat storages in the basin (vegetation, snow, soil, groundwater, riparian zone, urban areas) causing delayed responses to atmospheric forcings.
LSTM obsQ generally performed better in the eastern CONUS than the western half, and better in the Figure 1. CONUS-scale aggregated metrics of stream temperature models for the test period. LSTM obsQ incorporated observed streamflow, LSTMnoQ had no input streamflow information, while LSTM simQ incorporated simulated streamflow (Q sim ). ARX2 is a locally-fitted auto-regressive model with extra inputs. The lower whisker, lower box edge, center bar, upper box edge and upper whisker represent 5%, 25%, 50%, 75% and 95% of data, respectively. northern half than the southern half (figures 2(a) and 2(b)). Most of the eastern basins had NSE values above 0.975 and RMSE values below 0.9 • C. Northern basins had slightly higher NSE values, presumably because in colder basins, the minimum winter liquid T s is confined to above 0 • C, and is therefore easier to predict. Existing statistical models often have difficulty with northern basins where air temperature and water temperature are decoupled. LSTM has a long memory to keep track of seasonal snow states and can learn threshold-like functions. Hence LSTM is quite useful where existing models often have deficiencies. Sites with large RMSE values were scattered across the geographic extent of the CONUS, with no clear, explainable patterns. Regardless, the NSE values for most of these 'difficult' basins were still quite high, with only two stations out of the 118 in our dataset having NSE values under 0.9.

Impacts of observed and simulated streamflow as inputs
Providing streamflow as an input to the T s model generally improved model accuracy, but the effects were pronounced for the poorly-simulated sites. The models incorporating either observed (LSTM obsQ ) or simulated (LSTM simQ ) streamflow improved median bias (reducing the absolute median bias by 0.120 • C and 0.062 • C) and RMSE (by 0.170 • C and 0.049 • C) as compared to the model lacking streamflow (LSTM noQ ) (figure 1). Including streamflow information helped to both reduce bias and greatly improve representation of temporal fluctuations, especially for the worse-performing sites. Without the streamflow data, ten sites had NSE values below 0.9. Additionally, LSTM noQ had a median bias of around −0.25 • C, while the median bias of LSTM obsQ was much closer to 0 • C. The inclusion of observed streamflow also greatly reduced overall error range, providing the largest improvements in model performance at the most troublesome sites.
The model incorporating simulated streamflow (LSTM simQ ) generally performed between LSTM obsQ and LSTM noQ . Similar to LSTM obsQ , LSTM simQ helped to noticeably improve the accuracy and reduced the spread of bias (decreasing error range, as shown by compressed whiskers and outliers compared to LSTM noQ ), but did not help as much to improve the median bias. Understandably, simulated streamflow had more errors compared to actual observations, as input attributes (A Q ) do not fully characterize a basin. While Q sim offered state-ofthe-art performance, it still encountered more errors estimating peaks (mainly due to rainfall inputs) and baseflows, especially in the western CONUS (possibly due to inadequate geological information) (Feng et al 2020).
Negative biases with LSTM noQ were attributable to underestimating T s peaks in both winter and summer in some sites (e.g. figure 3(a)) and a more consistent bias at other sites (e.g. figure 3(b)). T s peaks are often associated with streamflow peaks (possibly caused by warm rain) in the winter but after-storm recession limbs in the summer. For the Black River in Ohio ( figure 3(a)), T s peaks were coincidental with recession periods between storms in summer 2015 (annotated points A and B). Simulated T s by LSTM noQ did not rise as high as the observed T s , possibly because LSTM noQ had an internal representation of baseflow that was overestimated here, while LSTM obsQ captured the peaks well. For the South Fork Sultan River in Washington ( figure 3(b)), there was a more prominent year-round bias for temperature predictions in 2015, concurrent with an overestimation of baseflow in Q sim . This underestimation could potentially be due to multi-year accumulation and melt of snowpacks, as this basin typically has a long snow season, sometimes lasting the whole year, but the 2015 summer saw all snow melted by June (verified via Google Earth).
Several reasons could explain why observed streamflow helped the model, but to think them through, we first need to assume that the LSTM model has internal representations of physically-relevant quantities such as water depth, snowmelt, water temperature, net heat flux, and baseflow temperature. Other studies have shown that LSTM has learned to use cell memory states to represent intermediate hydrologic variables that were not matched to observations, e.g. snow cover (Jiang et al 2020). Given this assumption, it is then possible that observed or simulated streamflow corrects the internal 'water depth' variable to estimate the effect of net heat flux. From the energy balance equation, stream temperature changes are estimated by dividing the net heat flux over the flow depth. If streamflow is overestimated during summer baseflow periods, the positive heat flux is vertically diluted too much (and thus the temperature rise is underestimated). Secondly, derived from figure 3(b), the model may not be able to accurately keep track of long-term snow accumulation/melt resulting in LSTM noQ misjudging the amount of cool snowmelt water. However, LSTM obsQ was informed by observed flow and therefore corrected the error. In fact, the basins with the lowest NSE values were concentrated in the Rocky Mountains region having long snow seasons (figure 2(c)). Thirdly, LSTM-based T s models may have learned other holistic hydrologic information from the streamflow time series. For example, they may have learned to perform baseflow separation internally, if such a feature was helpful for T s prediction. Streamflow data may provide more clues to reduce uncertainties-for example, cool summer temperatures could be due to high baseflow or abundant riparian shade, so streamflow data may make it easier to distinguish between these causes.

Further discussion
LSTM, with its hidden layers to store system states (100 hidden units) with different rates, is extremely well-suited to model systems with memory and hysteresis. The warming and cooling of water storage compartments (soil water, groundwater, riparian zones, etc) are caused by different mechanisms with different rates, durations, and lags relative to drivers (accumulation, flush by storms, etc.). Such multirate exchanges, along with diffusive exchanges with soil, easily lead to hysteresis in the system (Briggs et al 2014). We suspect that the internal states and gates of LSTM-based models mimic the effects of buffers and delays by these heat (and water) storage compartments and can be sufficiently trained with 4 years of data as was done in this study.
Streamflow data may have carried multifaceted, temperature-relevant information about stream depth, basin hydrologic properties, and the relative influence of flow versus other heat-moderating processes. Even simulated streamflow provided valuable new information to the temperature model-despite the fact that Q sim ingested identical meteorological forcing data as the T s models. We posit that the pre-trained Q sim model may have derived some of this new information from the additional catchment attributes in A Q relative to A T (table S1 in supporting information) but that the majority of the new information came from the 10 years of streamflow observations across the 2397 stations on which Q sim was trained. We therefore hypothesize that Q sim was thus able to learn and transfer a wealth of nuanced information about each basin's hydrologic properties and responses to meteorological drivers, which in turn likely improved the implicit representations of those attributes in the T s models.
As the first LSTM application for stream temperature, this application is focused on temporal prediction for basins with a good record of historical data, and, as such, may not generalize well to ungauged basins. It is well-known that spatial extrapolation of stream temperature models can be quite risky (Gallice et al 2015), a problem that merits further work. Also warranting further investigation is the representation of spatial heterogeneity at smaller scales, e.g. using a multiscale graph network or calibrating parameters of a spatially-distributed process-based model.

Conclusions
This is the first time a basin-centric lumped T s model has been shown to be so effective. The results clearly indicate that robust (but complex) mapping relationships exist between basin-averaged attributes, climate forcings, and T s , which can be reliably learned by a uniform, continental-scale model using a few years' worth of daily T s observations. All models presented exceptional evaluation metrics that outperformed state-of-the-art models reported in the literature by a substantial margin. Additionally, this performance was achieved without the need for detailed representations of the subsurface or the channel network-a convenience that promises high-quality forecasts of future T s given available climate forcings.
Our use of a basin-centric lumped model to predict T s allowed for great simplification that potentially enabled LSTM to learn the connection between different factors influencing T s , alongside the more obvious benefits of simplifying model assembly and training. The disadvantage of this basin-centric formulation is that it assumes each basin is homogeneous in forcings and attributes. The homogeneity assumption fundamentally limits the size of the basin that can be simulated: when predictions are needed for larger, mainstem rivers, we will need reach-centric models. Therefore, while the current model is highly capable and useful, we do not perceive the present form of the model as being complete in functionality.
Our results show that observed streamflow information helped to improve modeling of T s , perhaps because the observations (a) allowed the model to better resolve groundwater and snowmelt contributions, and (b) provided a more accurate water volume used to estimate the effect of net heat fluxes, especially during recession periods when T s rapidly changed. The benefits were most substantial in basins with multiyear snow accumulations. If streamflow observations do not exist for a basin in which temperature prediction is desired, our results show that a well-trained continental-scale streamflow model can indirectly bring in data from a larger training dataset, which in this study alleviated more than half of the degradation in median NSE that would have otherwise resulted from the lack of streamflow observations.

Data availability statement
Data supporting the findings of this study are openly available at the following URL/DOI: 10.5066/P97CGHZH.

Acknowledgments
FR was supported by the Pennsylvania Water Resources Research Center graduate internship G19AC00425, with funding for that fellowship and AA and SO provided by the Integrated Water Prediction Program at the US Geological Survey. CS was supported by National Science Foundation Award OAC #1940190. Data sources have been cited in the paper, and all model inputs, outputs, and code are archived in a data release (Rahmani et al 2020). The LSTM code for modeling streamflow is available at https://github.com/mhpi/hydroDL. CS and KL have financial interests in HydroSapient, Inc., a company which could potentially benefit from the results of this research. This interest has been reviewed by the University in accordance with its Individual Conflict of Interest policy, for the purpose of maintaining the objectivity and the integrity of research at The Pennsylvania State University. Any use of trade, firm, or product names is for descriptive purposes only and does not imply endorsement by the US Government.