Using Machine Learning to Identify Hydrologic Signatures With an Encoder–Decoder Framework

Hydrologic signatures are quantitative metrics that describe a streamflow time series. Examples include annual maximum flow, baseflow index and recession shape descriptors. In this paper, we use machine learning (ML) to learn encodings that are optimal ML equivalents of hydrologic signatures, and that are derived directly from the data. We compare the learned signatures to classical signatures, interpret their meaning, and use them to build rainfall‐runoff models in otherwise ungauged watersheds. Our model has an encoder–decoder structure. The encoder is a convolutional neural net mapping historical flow and climate data to a low‐dimensional vector encoding, analogous to hydrological signatures. The decoder structure includes stores and fluxes similar to a classical hydrologic model. For each timestep, the decoder uses current climate data, watershed attributes and the encoding to predict coefficients that distribute precipitation between stores and store outflow coefficients. The model is trained end‐to‐end on the U.S. CAMELS watershed data set to minimize streamflow error. We show that learned signatures can extract new information from streamflow series, because using learned signatures as input to the process‐informed model improves prediction accuracy over benchmark configurations that use classical signatures or no signatures. We interpret learned signatures by correlation with classical signatures, and by using sensitivity analysis to assess their impact on modeled store dynamics. Learned signatures are spatially correlated and relate to streamflow dynamics including seasonality, high and low extremes, baseflow and recessions. We conclude that process‐informed ML models and other applications using hydrologic signatures may benefit from replacing expert‐selected signatures with learned signatures.

a popular ML model structure that process time series data sequentially and allow for storing and forgetting of information. These models have been popularized by the neuralhydrology Python library Kratzert et al., 2022). Using an LSTM in a joint learning framework across many watersheds improves prediction accuracy (Kratzert, Klotz, Shalev, et al., 2019), and in K-fold validation where the LSTM is tested in ungauged mode it outperformed a locally calibrated traditional model (Kratzert, Klotz, Herrnegger, et al., 2019). These successes across large regions demonstrate how ML techniques can extract and extrapolate general relationships from large samples of data, in a way that traditionally structured models have failed to do .
More recently, hydrologists have developed differentiable, parameter-learning (dPL) models that use an LSTM to predict static and dynamic parameters of an HBV-like process-based model (Feng et al., 2022). dPL models are close to matching state-of-the art LSTM performance, and allow users to interpret and simulate unobserved variables as in a classical model. Deep learning models can estimate the parameters of a global process-based model, improving calibration speed and spatial coherence of parameters over classical calibration (Tsai et al., 2021). These advances merge the interpretability and physical principles of classical hydrology with the cross-watershed information transfer and predictive power of machine learning (Frame et al., , 2022Hoedt et al., 2021;Razavi et al., 2022).
Instead of using a single ML model to predict runoff from rainfall, an alternative approach in multi-step-ahead flood forecasting applications is to use a pair of ML models. The first model uses meteorological data to predict the current catchment state (i.e., catchment wetness), and the second model uses this current state, plus the new meteorological data, to forecast flow. Various structures have been proposed for the paired models, including LSTMs (e.g., Kao et al., 2020) and Convolutional Neural Networks (CNNs; Lin et al., 2020;Van et al., 2020). CNNs use a sliding window that passes across the input series to extract relevant patterns and are widely used to extract features from time series (Wang et al., 2017). Paired ML model structures are known as encoder-decoder systems, with the first model being the encoder and the second model being the decoder. The encoding refers to the intermediate state that is calculated by the encoder and then used by the decoder to make the prediction. In the examples above, the encoding is a representation of the catchment wetness. However, encodings can represent any useful properties of the input signal (Cho et al., 2014).
In this paper, we propose a novel use of an encoder-decoder system in hydrology, to learn encodings that are optimal ML equivalents of hydrologic signatures, and that are derived directly from the data. In our framework, the encoder is a CNN that learns the signatures (encodings) from historical streamflow and meteorological time series. The decoder is a differentiable, process-informed ML model that uses the learned signatures and current meteorological data to predict streamflow. By jointly training the encoder (to learn the signatures) and decoder (to predict streamflow), we ensure that the learned signatures are optimized to increase the model's predictive power. Our aim is to use ML to learn optimized signatures and compare them to classical signatures. We will interpret the physical meaning of the new, learned signatures by correlation with classical signatures and by investigating their effects on the stores and fluxes in the decoder model. We will determine whether the learned signatures can extract new information from streamflow series by testing whether they improve decoder model performance over control scenarios that use classical signatures or no signatures.

Data
We used the U.S. CAMELS data set to conduct our analysis (Addor et al., 2017;Newman et al., 2015). This data set is widely used in hydrology machine learning studies, enabling easy performance comparison. CAMELS includes 671 watersheds in the contiguous U.S. that are minimally impacted by human activities, with areas from 4 to 25,000 km 2 . For each watershed, CAMELS provides continuous daily meteorological and streamflow data for 35 years from 1980 to 2014. Streamflow records are from U.S. Geological Survey gauges, meteorological forcing is from Daymet and includes precipitation, temperature maximum and minimum, snow water equivalent, solar radiation, vapor pressure, and day length (Thornton et al., 2012). CAMELS provides 30 static indices describing the topography, land cover, soil and geological characteristics of each watershed, and 13 pre-calculated hydrological signatures describing the flow characteristics. Two of the pre-calculated signatures use precipitation as well as streamflow data (runoff ratio, streamflow elasticity). Of the static indices, we removed categorical indices and near-duplicates, leaving 19. Of the hydrologic signatures, we removed the slope_fdc (slope of the flow duration curve) signature, which is known to contain errors. Of the watersheds, we removed Gauge number 03281100 for which no signatures were available. For comparison with learned signatures, in addition to the CAMELS signatures we also used a catalog of signatures that target baseflow and overland flow processes and are available via the TOSSH Matlab toolbox (Gnann et al., 2021a). All signatures used are listed in Table 1.

Encoder-Decoder Networks
An encoder-decoder is a pair of neural networks that can automatically extract salient information from high-dimensional input data. The first network is the encoder, which maps the input to a lower-dimensional encoding vector. The second network is the decoder, which uses the encoding as an input and performs some downstream task. The encoder and decoder are trained end-to-end to perform the downstream task, and in the process, the encoder learns to extract the information from the input that maximizes the decoder's performance at the downstream task.
Encoder-decoder frameworks are widely used, for example, in machine translation to find simple representations of the meaning of sentences, and in computer vision to describe the appearance of a patch of an image that is invariant to changes in viewpoint (Cho et al., 2014;DeTone et al., 2018). In both examples, these learned descriptors outperform human-designed descriptors. Encoders are key parts of well-known networks including the language model GPT-3 and the text-to-image synthesis network Dall-E (OpenAI, 2022). In these networks, the output is estimated from an encoding of the input rather than directly from the input.

An Encoder-Decoder Model to Extract Hydrological Signatures
In this paper, we use an encoder to extract features from a window of streamflow and meteorological data. These features are one input to a decoder that predicts streamflow from meteorological data (over a different time window). We train the encoder-decoder end-to-end to predict streamflow, and in the process the encoder learns the features of its input most useful for predicting streamflow.
The encodings are analogous to hydrological signatures. Both encodings and signatures are numerical values that describe a streamflow series. Because the encodings of historical streamflow data feed into the decoder, they perform a similar function to that envisaged by Gupta et al. (2008) for hydrological signatures, that they can be used to improve the structure or parameters of the hydrological model. It is worth noting that our system differs from an autoencoder, where the decoder reverses the encoding process to recreate the input. In that case, the encodings would form a compressed representation of the streamflow such that the streamflow could be reconstructed from the encoding. However, this encoding would not serve the same purpose as a hydrological signature, because it would be specific to the particular year of streamflow (e.g., it would contain information about the timing of flood events or low flows within the year). Instead, our decoder is a hydrological model that uses a rainfall series, together with the encoding, to predict streamflow in a different year. This structure forces the encoding to encapsulate general information about streamflow dynamics that applies to any year of streamflow.
The model structure is shown in Figure 1, and the encoder and decoder structures are described in more detail in the following sections.

Encoder
We require the encoder to map a window of streamflow and meteorological data (a multi-channel time series of N years of daily precipitation, temperature and streamflow values) to a feature vector. One-dimensional convolutional layers are a common approach to mapping spatial or temporal data (images or time series) to higher-level representations of patterns in the data, with some invariance to where the patterns occur (Goodfellow et al., 2016, Chapter 9). They are therefore a good choice to learn hydrologic signatures from meteorological and streamflow time series. Each convolutional layer applies filters (kernels) to the input data to produce an output of reduced size but with multiple channels. By using filters, convolutional neural networks can learn complex behavior while reducing the number of learned parameters compared to alternative ML model structures. Each layer has a kernel size of 9, with 32 or 16 output channels, and rectified linear (ReLU) activation. Between layers, a Maximum Pool (MaxPool) with stride length three reduces the input data (N years of daily precipitation, temperature and streamflow values) to a smaller array. We append an average pool layer to allow the network to support multiple years Overland flow signatures IE_effect Infiltration excess importance; average of the standardized z-score coefficients for mean and maximum event intensity in regression equations to predict event peak magnitude and quickflow volume (-) IE_thresh_signif Infiltration excess threshold significance; P-value for the significance of a change in slope in a plot of event quickflow volume versus event maximum intensity (-)

IE_thresh
Infiltration excess threshold depth (intensity of precipitation needed to produce quickflow); value of the threshold identified in the IE_thresh_signif signature (mm) SE_effect Saturation excess importance; average of the standardized z-score coefficients for total event precipitation and 3 and 7-day antecedent precipitation totals in regression equations to predict event peak magnitude and quickflow volume (-) 5 of 19 of input, although in this paper we use 1-year windows (see Section 3.1.1 Sensitivity Analysis for experiments with different length windows). Splitting the flow record into sub-samples and calculating signatures for each sample to give a distribution of values follows recommendations from previous authors (Schaefli, 2016;Vogel & Fennessey, 1994). This approach enables a more accurate representation of typical hydrologic years and reduces the sensitivity of signatures to hydrologic extremes present in the period of record (Vogel & Fennessey, 1994). The final sigmoid activation maps each channel from its unconstrained range to a value in the range [0,1], normalizing the 16 learned signatures for easier analysis.
The dimensions of the encoder are chosen such that each layer has an intermediate feature dimension larger than the final encoding size so it can retain sufficient information (32 channels for 16 encodings), so that the reduction in size between each layer (the stride) is not too large, and so that the network reduces each year of data to a single feature vector (the encoding). Sizes in powers of two allow for fast computation. These are all typical choices for CNN design (Goodfellow et al., 2016). Similarly, use of ReLU activations and MaxPool layers are standard choices with high efficiency in training. Given the extreme flexibility of ML models such as CNNs, structural choices such as which activation function to use are less important than in process-based models, as model parameters can adapt equally easily to any standard choice. We choose a 16-dimensional encoding because it is larger than the number of independent signatures that can typically be derived from streamflow data (2-4, Olden & Poff, 2003), while allowing for some redundancy in the common case that multiple channels learn the most dominant features. Further, the encoding size is similar to existing sets of signatures used to summarize streamflow SE_thresh_signif Saturation excess threshold significance; P-value for the significance of a change in slope in a plot of event quickflow volume versus event total precipitation (-) SE_thresh Saturation excess threshold location (depth of precipitation needed to produce quickflow); value of the threshold identified in the SE_thresh_signif signature (mm) SE_slope Above-threshold slope in a plot of quickflow volume versus total precipitation (mm/mm)

Storage_thresh_signif
Storage threshold significance; P-value for significance of a change in slope in a plot of event quickflow volume versus event precipitation + antecedent precipitation index (-)

Storage_thresh
Storage threshold location (storage depth needed to produce quickflow); value of the threshold identified in the Storage_thresh_signif signature (mm) Note. For more detailed technical descriptions and formulae, refer to Gnann et al. (2021a) and Addor et al. (2017). Signature descriptions adapted from those papers. dynamics, such as the 13-value CAMELS signature set that we used for comparison (Addor et al., 2017). To evaluate the number of independent signatures that can be learned by our model, we conducted a sensitivity test of the model performance by varying the encoding size from 2 to 32 signatures.
To verify that this network structure is suitable for learning signatures, we conducted a separate test (not part of the ML encoder training described in the rest of the paper) of whether the structure could learn the existing CAMELS signatures. The structure was able to learn the signatures, with typical median relative errors of 10%, showing that our overall structure choices are suitable for our research objective. We did not evaluate other kinds of encoder that could perform the same task, for example, recurrent encoders or transformers (Jaderberg et al., 2015).

Decoder
We require the decoder to use the encodings and a window of meteorological data (a 7-channel time series of N years of daily precipitation and climate values) to a predicted N-year streamflow time series. Two common types of model have emerged in the hydrological community to perform this type of streamflow prediction: LSTM models (e.g., Kratzert, Klotz, Herrnegger, et al., 2019;Kratzert, Klotz, Shalev, et al., 2019) and process-informed models (e.g., Feng et al., 2022). LSTM models can learn the dynamics of the target catchment in their parameters, and might represent information about dynamics in their hidden state. The networks also learn hidden state variables that are equivalent to stores (Lees et al., 2021). However, interpreting state and weights in an LSTM to hydrological processes is not straightforward. In contrast, process-informed models have structures similar to a conventional hydrological model and use recurrent state variables representing stores that are more straightforward to interpret. Given that interpretation of the new learned signatures and their relationship to model dynamics is a key aim of this research, we chose to use a process-informed structure.
Our decoder runs on a daily timestep and uses eight stores to represent water storage. Eight stores offer flexibility in simulating runoff generation pathways, and are sufficient to represent the stores we expect in a classical model without too much redundancy, simplifying the interpretation of the model's results. We conducted a sensitivity analysis to test model performance with alternative numbers of stores. As with a classical hydrologic model, at each timestep the decoder uses its inputs (the encoding and current meteorological data) and the current state of the catchment (the store volumes from the previous timestep) to predict outflows (streamflow) and to update the store volumes.
The model uses a neural network to learn how to distribute water between stores and how much outflow occurs from each store, the equivalent of learning dynamic model parameters. The neural network uses fully connected layers with ReLU activations, an efficient and flexible structure that is the default choice recommended by Goodfellow et al. (2016;Chapter 11). Two layers are used to combine the 16-value encoding with the 19 static CAMELS attributes to generate a 16-value attribute vector. This step prevents overfitting of the model and ensures that the complexity of the model remains the same with or without signatures, allowing a fair test of the model performance improvement gained by using signatures. For tests where the 16-value encoding is not used (see Section 2.4), the 19 CAMELS attributes are passed through an equivalent two-layer network to generate a 16-value attribute vector. Four further layers are used to combine the attribute vector with the daily meteorological data and store values from the previous timestep, to predict the dynamic model parameters. An alternative to this structure would be to use an LSTM to learn the dynamic model parameters (e.g., Feng et al., 2022). However, this would make it less easy to interpret the effect of learned signatures on model behavior, because the LSTM maintains memory in its hidden states rather than in the stores.
The dynamic parameters predicted by the decoder at each timestep are the actual evapotranspiration (scalar AET value), the distribution of precipitation between stores (eight-value vector a), and outflow coefficient for each store (8-value vector b). All these values (AET, a, b) are predicted once per time step as outputs from the four fully connected layers described above. We use softmax activation to enforce that precipitation fractions sum to one, and softplus activations to enforce that AET and outflows are positive. The streamflow prediction for each timestep is the dot product of the b coefficients with the store volumes. An update of the store volumes is performed as follows: precipitation fraction minus AET (a coefficient*precipitation -AET) is added to the stores, and outflow (b coefficient*store volume) is subtracted from the stores. As explained in Section 2.3, the decoder model parameters that predict the a, b, and AET coefficients are trained such that they optimize model ability to predict observed streamflow.

Training Strategy
The model is trained end-to-end, ensuring that the learned signatures are optimized to predict rainfall-runoff behavior. The training requires that we specify a loss function (performance measure) to evaluate the decoder's streamflow prediction. Following previous ML streamflow prediction studies we use the Nash Sutcliffe Efficiency (NSE). To prevent undue influence of outliers, we applied the Huber Loss to the square root of the NSE to downweight the impact of watersheds with very low NSE (Equation 1). Using the Huber Loss is standard practice in robust fitting (Friedman, 2001), and here it prevents model convergence to a solution of predicting zero flows everywhere to avoid extreme negative values. The Huber Loss performs a similar function to a bounded normalization of the NSE, which is recommended to prevent when calculating statistical moments of the NSE (e.g., the mean) across multiple watersheds (Mathevet et al., 2006;Pushpalatha et al., 2012).
Equation 1: Loss function L constructed of Huber Loss applied to square root of 1-NSE. Minimizing 1-NSE is equivalent to maximizing NSE.
The loss function ignores the first ⅛ of the time series as a warm up period, in our case equal to 6 weeks with 1 year of input. During training, we add a regularization term to the loss function, equal to the difference in summed store volumes between the beginning and end of the model run. This term helps to maintain the water balance by penalizing the model if water is accumulated in stores. Requiring mass conservation has been shown in LSTM trials to produce a small reduction in performance, but the difference is minor (<0.01 NSE loss) (Frame et al., 2022).
The training phase used 470 watersheds with 100 reserved for validation and 100 for testing. Watersheds in the test sample were never used for training, which is a stricter criterion than several previous studies that allowed different years from the same watersheds to occur in both training and testing samples for example, (Feng et al., 2022;Frame et al., 2021). Evaluating hydrological models under different climate conditions, or (harder) in different watersheds to those in the training set is considered best practice to produce a robust model (Klemeš, 1986;Razavi, 2021). We evaluate the model in a new set of watersheds to ensure that our results reflect how well the decoder can represent rainfall-runoff dynamics based on the encoding, and not how well it has learned the behavior of individual watersheds. Each sample used for training or testing consists of two parts. One randomly chosen year of historical meteorology and streamflow data is input into the encoder to calculate the 16-value encoding (signatures). A second year of meteorology and streamflow data is used to test the decoder, by inputting the meteorology data and testing the model output against the streamflow data. The decoder has no access to historical streamflow data except for the 16 signature values. In future, the decoder could be used to predict streamflow in completely ungauged watersheds by estimating signature values in those watersheds. Signature values could be estimated using regionalization methods, for example, by transferring signatures from watersheds with similar climate, topography and geology . We conducted additional experiments to test the sensitivity of the model to the length of the training data (from 1 year samples in the default case up to 5 years).
We train in batches of 128 samples, using the Adam optimizer (Kingma & Ba, 2014) and a learning rate of 0.0003, for 200 or more epochs that is, we pass through all training data. This takes 2 days on a PC, running on CPU. As with any neural network, our model has a large number of fitted parameters (10,000s to millions are typical; our encoder has 25,900 parameters and the decoder has 41,945 parameters). Therefore, care is needed to prevent overfitting of the training data to the detriment of testing performance. The primary method we use to prevent overfitting is early stopping, whereby after each epoch of training data the model performance is evaluated against the validation data set (Razavi, 2021). When the validation error stops decreasing or starts to increase, overfitting has started and therefore the training is halted. Figure 2 shows an example of training and validation performance for our model training using learned signatures. After approximately 2000 epochs, validation performance no longer 8 of 19 increases and therefore training is halted. Further strategies that we used to prevent overfitting were batching (Goodfellow et al., 2016) and regularization by maintaining a water balance (see details above). Limited experiments with other regularizations (weight decay and dropout; not shown) showed no effect on the difference between training and validation error. After training finishes, the model was evaluated on the (separate) test set of watersheds.

Analysis
After training our model as described above, we calculated the NSE performance measure for each watershed in the test set, and calculated the mean, median and quantiles of the NSE distribution. To test whether the learned signatures improved runoff prediction over classical signatures, we trained and tested the model using three configurations. whether there is extra information in streamflow dynamics that is not captured by the human-designed signatures, but that is important for model simulation. We further benchmarked our model performance by comparing the mean and median NSE scores against recent process-informed and LSTM-based deep learning models, and traditional hydrological models that have been tested on CAMELS watersheds. The main aim of our paper is to analyze the usefulness and interpretation of the learned signatures, rather than purely striving for the best NSE score. Therefore, the objective of the benchmarking was to show that our models provided reasonable scores in the same range as previous, well-respected models.
In our second research question, we aim to interpret the physical meaning of the learned signatures. This will expose the features of the streamflow dynamics that are most important for model simulation. To answer this question, we use two main strategies. Our first strategy is to plot a correlation matrix for the learned signatures against classical hydrologic signatures. We will use the CAMELS signatures and sets of signatures for baseflow and overland flow processes (Table 1), where learned signatures correlate with one or more classical signatures, and we infer their meaning as similar to those combinations of signatures. Where learned signatures do not correlate with any classical signatures, they have learned new dynamics of the flow series. We use maps of the learned signatures to evaluate their spatial patterns and spatial correlations and compare these with patterns of classical signatures, as spatial correlation is a key metric of signature behavior (Addor et al., 2018).
Our second strategy to interpret the meaning of learned signatures is to analyze the ML model behavior, and how the learned signatures affect its dynamics. We analyze the seasonal dynamics of the eight model stores and interpret their similarity to classical model stores. To do this, we examine two parameters: the a parameter which is the proportion of precipitation routed into the store, and the b parameter which is the linear outflow coefficient. For each day of the year, we calculate the average parameter values across all years and all watersheds, to visualize and interpret typical behavior. We then use a Method of Morris sensitivity analysis (Morris, 1991) to quantify the effect of each learned signature on the store dynamics. Method of Morris is an efficient global sensitivity analysis method that selects multiple starting locations in the parameter space, followed by "one-at-a-time" parameter adjustments to measure the effect on the output variable of interest. In our case, the parameters are the learned signature values, and the output variables are the a and b parameters for each store.

Results
The aim of this section is to assess the utility of the learned signatures and how they compare to classical signatures. We first tested the decoder model performance in predicting streamflow, and compared it against a decoder that uses no signatures or classical signatures. This tested whether the learned signatures extracted new and useful information from the streamflow signal. We then interpreted the physical meaning of the learned signatures, through their correlation with classical signatures, their spatial patterns, and their effect on the decoder stores and fluxes.

Model Performance Comparison
We trained the model for the three configurations described in Section 2.  Figure 3a, and performance statistics in Table 2.
The results confirm that as expected, using signatures as input improves model performance. The results further show that learned signatures improve performance over classical CAMELS signatures, implying that the learned signatures capture additional features of the streamflow dynamics that are not included in the classical signature set.
In Table 3, we compared the median and mean NSE for our model using learned signatures, against other machine learning and classical hydrological models that have been tested for streamflow prediction on the CAMEL-US data set. Overall, LSTM deep learning models typically produce the highest scores, followed by process-informed deep learning models. Classical hydrological models typically have slightly lower scores, even when they are locally calibrated on each watershed. Model performance in all categories varies according to several methodological choices such as how data is split between training/testing sets and whether ensembles of models are used. This comparison shows that our model performance is comparable with other recent models and therefore our model provides a trustworthy framework in which to evaluate and interpret learned signatures.

Figure 3b shows the spatial pattern of NSE scores for all watersheds for the [3] Learned Signatures configuration.
We included all watersheds in the map because the testing set contains only 100 watersheds, which is insufficient to show the regional pattern. The pattern is typical of hydrological model performance in the U.S., that is, performance is higher in humid watersheds and lower in arid watersheds (Newman et al., 2015). Poor performance can be caused by multiple factors, including rainfall errors, non-closed water balance and other unusual flow dynamics not well represented by the model structure. The successive introduction of CAMELS signatures and learned signatures notably improves the most poorly performing watersheds and those with below-average performance.

Sensitivity Analysis
We conducted additional experiments to test the sensitivity of the validation NSE performance to model structure choices. We varied the length of the encoding input meteorology and streamflow data used to calculate the   16). We varied the number of stores used in the process-informed decoder model from 2 to 32 (default was 8). The results are shown in Figure 4. Figure 4 shows that none of these structural choices make large differences to the validation NSE performance.
For models with at least six learned signatures and six model stores, validation NSE is in the range 0.685-0.710. Validation NSE increases with longer encoding input length and number of decoder stores. For number of learned signatures, validation NSE does not substantially increase above six signatures. All such choices offer some trade-offs, for example, a larger number of decoder model stores increases performance but makes the function of each store less easy to interpret, while a longer encoding length increases performance but reduces the number of samples on which to train the model and therefore may decrease robustness to future changes in climate. Future work could investigate whether performance continued to increase above six signatures for more complex hydrological models (i.e., greater number of decoder model stores) or for larger numbers of catchments. Given Note. All the ML models here are regionalized-not locally trained for individual basins. All performance values are given for test data, but the studies differ in whether test watersheds are included in the training set, this information is provided in the Model description and notes column. *These studies use 531 of the CAMELS watersheds, excluding basins >2,000 km 2 and those with discrepancies between different methods of calculating basin area. the small range in validation performance in all cases, we retained the default structure choices for the remainder of the analysis below.

Interpretation of Learned Signatures: Correlations With Classical Signatures
In this section, we interpret the meaning of the learned signatures by examining their correlation with classical signatures ( Figure 5) and their spatial patterns ( Figure 6). Classical signatures that have high correlations with learned signatures represent basic properties of the watershed relating to its water balance and ability to store water (Figure 3, section "Water balance"). These include runoff_ratio (partitioning of water to flow and evapotranspiration), hfd_mean (day of year when half of flow volume has passed, related to snow storage), and EventRR_TotalRR_ratio (related to division of flow into fast and slow pathways). Therefore, learned signatures similarly represent these properties.
Low flow metrics have moderate correlations with learned signatures. Among the more highly correlated are BFI and variability index (both related to baseflow magnitude), q5 (fifth percentile low flow magnitude), low_q_freq (how often low flow periods occur), and RecessionParametersBeta (shape of hydrograph recessions). High flow metrics also have moderate correlation with learned signatures, with q95 (high flow magnitude) having the strongest of those. These examples show that the machine learning model is learning both low and high flow characteristics of the data, despite the emphasis of the NSE loss function on high flow periods where errors are typically larger. Further, correlations of learned signatures to classical signatures that describe flow dynamics (e.g., half flow date, baseflow index, recession shape) show that the model learns characteristics of the streamflow as well as the meteorology data.
Overland flow metrics with moderate or high correlations to learned signatures include SE_effect (importance of saturation excess flow), SE_thresh_signif and Storage_thresh_signif. The threshold significance signatures measure whether the watershed has a strong threshold for the depth of rainfall needed over shorter/longer timescales to generate flow. This result may partly reflect that neural networks are good at learning threshold relationships.
Some classical signatures do not have a learned signature equivalent. This can occur for multiple reasons. For example, metrics related to infiltration excess processes have low correlations with learned signatures, which could be because these processes occur at sub-daily timescale while this analysis used data at a daily timestep. Other signatures may classify an aspect of the streamflow time series that is not important to model performance as measured by NSE, and therefore these classical signatures are not useful for improving model NSE.
The maps in Figure 6 show the spatial patterns of each learned signature. Note that high and low values can be reversed without change in meaning, and that the numbering 1-16 of the learned signatures is arbitrary as the signatures could be learned in any order. Some encodings reproduce key features of the hydroclimate, for example, maps 5, 6 and 9 are all similar to the pattern of aridity across the U.S. with the most arid areas in the Great Plains and the Southwest. These encodings correlate with the low_q_freq signature that measures the frequency of low flow periods. Even where there are strong correlations between an encoding and a physical signature, there may be notable differences in patterns. For example, learned signature 2 has a strong correlation of 0.67 with mean half flow date, but the spatial pattern doesn't differentiate between Florida and the wider Southern and Appalachian regions as strongly as the physical signature (see Figure 4c in Addor et al., 2017).
Some of the learned signatures such as 8, 15, and 16 do not correlate with any of the classical signatures in Figure 3 or with physical attributes (Addor et al., 2017). The lack of correlation shows that the ML framework has found new information in the streamflow data that was not previously used in flow descriptions or predictions. We can distinguish this new information from noise because it shows clear spatial patterns (Figure 6). All the learned signatures show high spatial correlations (smooth areas of high or low values), in contrast with classical signatures which sometimes show speckled maps with no clear spatial pattern (Addor et al., 2017). High spatial correlation is a desirable feature of signatures, as it implies lower uncertainty and stronger relationships to physical attributes (Addor et al., 2018).

Interpretation of Learned Signatures: Behavior of Process-Informed Model
We further interpret the meaning of the learned signatures by evaluating their impact on the ML model dynamics.
We first plotted the seasonal dynamics of the eight stores in the ML model (Figure 7), to understand their role. The figure shows mean values of store parameters a the proportion of precipitation routed into the store, and b the linear outflow coefficient. Values are averaged across all years and all watersheds. Store 1 (dark blue) has the highest outflow coefficient (Figure 7b), so we interpret it to represent a quickflow store. Water is routed into this store during the spring and summer, that is, snow-free periods (Figure 7a). Store 5 (lime green) has similar seasonality to Store 1, but has moderate outflow coefficients, interpreted as an interflow store. Water is routed into Stores 4 and 7 (cyan, red) during the winter, interpreted as snow or winter precipitation stores. Store 4 has higher outflow coefficients that peak in the spring, interpreted as a fast-melt store, while Store 7 has low outflow coefficients and is interpreted as delayed contributions of snowmelt to runoff. The interpretation as snow stores  is further confirmed by mapping the average proportion of precipitation entering each store, showing highest values in Northern watersheds (not shown). A high proportion of precipitation is routed into Store 8 (brown; high values in Figure 7a) but its outflow coefficients are low and stable, interpreted as a baseflow store. The remaining stores 2, 3 and 6 absorb little water on average (low values in Figure 7a) so play a less important role in most catchments. This interesting result shows that five stores are usually sufficient to represent the flow dynamics. However, a greater proportion of water is routed into stores 2, 3, and 6 in Great Plains watersheds where model performance is lower (not shown), and therefore the additional stores may represent unusual flow dynamics that help improve performance in difficult-to-model watersheds.
Based on the above interpretation of the role of each store, we analyze the impact of each learned signature on the stores. The results of the Method of Morris sensitivity analysis are plotted in Figure 8, which shows the mean variation in store inflow (a parameter) and outflow coefficient (b parameter) caused by a unit change in each learned signature value. Overall, the largest variations in store inputs are for quickflow (Figure 8a). For inflows and outflows, the learned signatures tend to impact groups of similar stores. This occurs particularly for the two slow stores (slow snow release and baseflow), which are usually simultaneously increased or decreased. This shows that the learned signature values make meaningful changes to watershed behavior, rather than trading off one slow store for another. For learned signatures 8 and 15 that have no clear correlation to classical signatures, we investigated their behavior by perturbing the signature and examining the changes in stores volumes and hydrographs in specific watersheds. This analysis showed that the signatures changed the relative volumes of water lost to AET and entering the slow-snow store, which altered the hydrograph volume at mid-range flows.
We synthesized the results of the correlation analysis and the sensitivity analysis into Table 4, and added our interpretation of the meaning of each learned signature. The table shows that four learned signatures relate to hydrograph recessions, three relate to seasonality and three to extremes (low or high), two relate to wetness and baseflow and two to ET. The variety of signature interpretations shows that despite the end-to-end training judging model performance using Nash Sutcliffe efficiency, the signatures did not solely focus on high flows.

Implications for Signature Use
The choice and design of signatures for applications in hydrology and ecohydrology is widely discussed. Questions include how to choose signatures among multiple options (Olden & Poff, 2003), what criteria to use for signature choice  and the impact of uncertainty in signature values (Westerberg & McMillan, 2015;Westerberg et al., 2016). The results from this study show that signatures can be learned using an encoder-decoder framework, and that the learned signatures represent a wide variety of aspects of streamflow dynamics including seasonality, recession speed and shape, wetness and baseflow importance, ET magnitude and high and low extremes. Using these signatures as input to a model in a watershed without other gauging data improves the model performance over equivalent models that use classical signatures or no signatures.
In summary, we demonstrated how to replace expert-designed signatures with ML-based signatures, and we showed that it improves performance in a flow prediction application. The implication of our results is that for hydrological applications that use signatures, the user should consider ML signatures as an alternative. In  (Carlisle et al., 2017), or applications that use signatures to predict processes in the upstream watershed (Wlostowski et al., 2020). In modeling, it includes applications where signatures are selected as calibration targets based on their relationship to individual model parameters. The ML framework used to optimize the learned signatures should be adapted based on the application.
In contrast, where signatures are hand-selected by the user based on the theoretical under-pinnings of the signature, or to manually emphasize one aspect of the flow dynamics over another, ML signatures cannot yet replace classical signatures. This includes ecohydrological applications that require a mechanistic or process-based explanation between the signature and the ecological response (Poff et al., 2010). It further includes catchment classification and model calibration applications where the user chooses signatures to balance the emphasis on high and low flows, or on magnitude-frequency-duration aspects of the flow dynamics (Kennard et al., 2010;Yadav et al., 2007). These applications rely on an expert understanding of the signature interpretation and require signatures to be chosen rather than optimized.

Machine Learning Model Structures in Hydrology
We used a process-informed ML model that tracks water volumes in stores, similar to other recent differentiable, parameter-learning approaches in hydrology (Feng et al., 2022). We showed that the stores learn recognizable hydrological dynamics such as losses to ET, snow storage and melt, and quickflow and baseflow stores. Our framework uses modern ML to learn these complex processes from the data, while enforcing the fundamental knowledge that catchments store and release water. The combination of machine learning for difficult-to-model aspects and classical methods for well-understood processes has been successful in computer vision. In that area, hybrid systems are state-of-the-art for 3D reconstruction tasks, combining ML for appearance-based data association with classical geometric methods (Choy et al., 2020;Ranftl & Koltun, 2018). In hydrology, the potential to combine "data-driven discovery" with "knowledge-driven regularization" offers an emerging pathway to bring the ML and geoscience communities together (Razavi et al., 2022).
As the hydrology community makes more extensive use of ML models, we will develop a knowledge base on ML model structure choices. As for classical hydrologic models, an ML hydrologic model should be chosen such that it can reproduce the hydrologic dynamics of interest, and can be calibrated (trained) using the data available.
There are many choices in ML model design, including the model structure (e.g., convolutional neural net, fully connected network or LSTM), number and sizes of model layers, activation functions, regularization methods, network initialization, training strategies including data length and batch size and the use of ensembles. These choices are known as the hyper-parameters of the network. As for classical hydrological models, much effort can be put into tuning these hyper-parameters where the aim of this study is to optimize performance. However, in this study, our primary aim was to discover which hydrologic signatures an ML model would learn, and interpret them in terms of their relationships to classical signatures and to model dynamics. Therefore, there were many model structure choices that we did not fully investigate (including input data length and use of ensembles) in order to focus our attention on the learned signatures. Future work could test these choices in more depth, to meet joint aims of learning signatures and maximizing model predictive performance.

Challenges and Future Work
In our model, we chose to use eight stores, which is sufficient to model fast, slow and snow storage, and 16 learned signatures, which is sufficient to capture important catchment dynamics. Previous works represent the state of the catchment with 64 or 256-dimensional LSTM hidden state (Feng et al., 2020;Lees et al., 2021). We achieve comparable model performance with a simpler model state, which is also easily interpretable. Alternatively, a model with more signatures and stores might have the capacity to learn more complex processes.
One concern is that the type of stores and signatures learned by the model might not be repeatable, which introduces uncertainty into the interpretation of the ML process. We conducted exploratory tests to train the model with different random seeds, which showed that the five most important model stores have similar dynamics, and learned signatures have similar behavior. However, further work is still needed to understand why certain signatures are learned, and how they should be interpreted.
An important application of hydrological signatures is that they can be regionalized (predicted for ungauged watersheds). The high spatial correlations of the signatures suggest that they are suitable for prediction from watershed attributes, which would enable this framework to be applied in ungauged catchments. Further analysis of the error in predicted signatures and its impact on forecast quality would be required.

Conclusions
We trained an ML network with an encoder-decoder structure to simultaneously learn the ML equivalent of hydrologic signatures, and predict streamflow dynamics. The decoder used a process-informed ML model structure that learned inflow distributions and outflow coefficients for eight model stores that simulated water storage and release. Using end-to-end training, we optimized the learned signatures to improve performance when used as input to the process-informed ML model. We interpreted the meaning of the learned stores and signatures by comparing them with classical stores and signatures. The hybrid model structure combines classical components such as model stores with ML components such as learned inflow and outflow parameters. This promising approach benefits from an ML model's ability to simulate complex relationships and learn across watersheds, while also allowing the user to interpret model behavior.
Our results showed that the learned signatures have many positive characteristics that are desired in a set of hydrological signatures. Model configurations that used learned signatures as input in otherwise-ungauged watersheds showed improved performance over those that used classical signatures or no signatures. Interpretation of the learned signatures showed that they cover a variety of streamflow dynamics, including seasonality, recession speed and shape, wetness and baseflow importance, ET magnitude and high and low extremes. Most signature applications aim for such variety in signatures . The maps of learned signatures showed that they are spatially correlated. Signatures with high spatial correlations are less sensitive to flow uncertainty, are more highly related to physical attributes and are better predicted by hydrologic models (Addor et al., 2018). Therefore, we found that ML learning for hydrologic signatures provided a successful approach. In applications where signatures are selected based on their ability to explain data or improve performance, we recommend that hydrologists consider replacing expert-selected signatures with learned signatures.