Generalization of an Encoder-Decoder LSTM model for flood prediction in ungauged catchments

Flood prediction in ungauged catchments is usually conducted by hydrological models that are parameterized based on nearby and similar gauged catchments. As an alternative to this process-based modelling, deep learning (DL) models have demonstrated their ability for prediction in ungauged catchments (PUB) with high efficiency. Catchment characteristics, the number of gauged catchments, and their level of hydroclimatic heterogeneity in the training dataset used for model regionalization can directly affect the model ’ s performance. Here, we study the generalization ability of a DL model to these factors by applying an Encoder-Decoder Long Short-Term Memory neural network for a 6-hour lead-time runoff prediction in 35 mountainous catchments in China. By varying the available number of catchments and model settings with different training datasets, namely local, regional, and PUB models, we evaluated the generalization ability of our model. We found that both quantity (i.e. number of gauged catchments available) and heterogeneity of the training dataset used for the DL model are important for improving model performance in the PUB context, due to a data synergy effect. The assessment of the sensitivity to catchment characteristics showed that the model performance is mainly correlated to the local hydro-climatic conditions; the more arid the region, the more likely it is to have a poor model performance for prediction in ungauged catchments. The results suggest that the regional ED-LSTM model is a promising method to predict streamflow from rainfall inputs in PUB, and outline the need for preparing a representative training dataset.


Introduction
Accurate and computationally efficient hydrological models are necessary for streamflow prediction to issue timely warnings for flash floods (Moore et al., 2005). Physics-based hydrological models are the most robust models that can be used for this purpose. They simulate physical processes in the rainfall-runoff transformation with parameters that represent soil, land surface, and climate properties, that need to be optimized for each geographic location with observations. But most catchments worldwide lack hydrological monitoring data and are considered "ungauged" (Guo et al., 2021), meaning that direct calibration of catchment parameters in these catchments is not possible. For this reason, the problem of prediction in ungauged basins (PUB) has received considerable attention in the hydrological community (Sivapalan et al., 2003).
One solution to the calibration of hydrological models for catchments without available data utilizes the concept of parameter regionalization. The idea is to use parameters calibrated in gauged catchments to predict the model parameters in a target ungauged catchment (Blöschl and Sivapalan 1995). Similarity-based and regression-based methods are widely used for model regionalization (Oudin et al., 2008). For example, Beck et al. (2016) proposed a scheme for the regionalization of model parameters at the global scale based on a similarity approach by selecting 10 gauged catchments with the most similar characteristics as donors for parameter transfer. However, the question of how to identify the selection criteria for choosing the optimal donor catchments remains a challenge that restricts the wide application of this method. Ragettli et al. (2017) applied the classification and regression tree (CART) method to explore parameter transferability in the full space of catchment descriptors for the hydrological model and showed that this method can be an effective tool for identifying similarity among catchments. However, it is model dependent and relies on manually identifying the similarity and then transferring the parameter set from a series of pre-defined hydrological models.
An alternative solution to parameterized hydrological models for streamflow simulation in ungauged catchments is the use of data-driven deep learning (DL) models. These models can be directly trained with inputs from meteorological and catchment characteristic data to simulate streamflow without using a physical hydrological model to predefine their similarity. For example, Kratzert et al. (2019) evaluated the ability of a Long Short-Term Memory (LSTM) model for the regionalization of over 500 catchments in the USA. They concluded that data-driven models had a strong capacity to learn non-linear climaterunoff relationships and to achieve model regionalization without identifying pre-defined criteria for similar donor catchments.
The catchment characteristics (e.g. topography, land use) and the climatic training dataset are the two most important factors that affect the model regionalization and the setup of a physical hydrological model in ungauged catchments (Teutschbein et al., 2018;Gong et al., 2021). While the performance of hydrological models in ungauged catchments is sensitive to these two factors, there are only a few studies that evaluated the generalization ability of data-driven DL models. For example, Potdar et al. (2021) predicted flood peak discharge in ungauged catchments based on the gradient boosted trees model (XGBoost) and found that catchment geomorphologic attributes have a higher impact on the prediction skill than climatologic attributes. Gauch et al. (2021) studied the sensitivity of the prediction skill of the LSTM model for daily streamflow in the USA to additional training samples and showed that it is not enough to train data-driven models on a few gauged catchments, but one should strive to use as many catchments as possible. Fang et al. (2022) proposed a concept of 'data synergy', pointing out that to achieve higher predictive performance, a representative dataset with large but heterogeneous training samples (i.e. different characteristics of catchments) is needed. However, the generalization of DL models to the PUB problem with respect to the representativeness of the training dataset and catchment characteristics has not been studied thoroughly yet.
This study aims to evaluate the generalization ability of a DL model to predict floods in ungauged catchments considering the above factors. For this purpose, an Encoder-Decoder LSTM (ED-LSTM) neural network was applied to set up a forecast model for a 6-hour lead-time streamflow prediction in 35 mountain catchments in China. Three model setups: (i) a local model for each catchment; (ii) a regional model; and (iii) regional PUB models which differ in the choice of training and testing catchments, were applied. Their performances were compared to the CART regionalization method to evaluate the ability of the DL model to predict streamflow in the ungauged catchments (Ragettli et al., 2017). The analysis of the generalization ability of the PUB models concerning the training dataset and catchment characteristics was conducted by comparing the three model setups. The main purpose is to provide recommendations on the importance of preparing representative training datasets in the context of PUB with DL methods. We aim to answer the question of how ED-LSTM models (and other rainfall-runoff DL-based models) may be used for event-based flood forecasting, and which data requirements have to be present to provide reasonably accurate predictions in ungauged basins.

Study area
The study focuses on 35 mountainous catchments (Table S1) located in ten Chinese provinces (Fig. 1). The catchments were classified as northern catchments (17) and southern catchments (18) based on the traditional geographical south-north division of China, which is called the Huai river-Qin mountain line. This line approximates the 0 • C January isotherm and the 800-mm isohyet (Zhao et al., 2015). Mean Fig. 1. Locations of the 35 studied catchments are divided into northern (17, orange cross) and southern (18, blue cross) regions, which are corresponding to the location of local hydrological stations. The black solid line represents the south (S) -north (N) division of catchments along the Huai river-Qin mountain line (Ragettli et al., 2017). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) annual precipitation in the north is on average 57 % lower than in the south and mean annual air temperatures are on average 6 • C lower.
The catchment areas range from 14 to 1693 km 2 , whereas the mean catchment size is 278 km 2 . Hourly hydrological and meteorological data are available from stream and rain gauges located within or in close vicinity to the catchments (Fig. 1). In addition, the county weather stations record daily maximum and minimum 2-meter air temperature. On average, 11 years of data are available per catchment, with 1 to 7 storm events occurring per year between April and October. We consider storm events as days with total precipitation greater than 5 mm d − 1 , following the definition by Ragettli et al. (2017). Each of the catchments is characterized by several static variables (Kratzert et al., 2019) describing their climatic, vegetation, soil, and topographical properties (Table S3).

Long Short-Term memory (LSTM) network
LSTM networks are a type of recurrent neural network that can learn time dependencies in time series data (Hochreiter and Schmidhuber 1997). It has cell and hidden states which can account for the long-short term memory effects. Therefore, it is a good choice for modeling time series of runoff as it can account for a range of time-dependent delays, like seasonality and natural annual variability cycles (long-term, months to years), and the immediate rainfall-runoff response (short-term, minutes to hours).
Equations 1 to 6 provide the mathematical formulation of the LSTM at each time step: o[t], and h[t] represent the forget gate, input gate, cell state, output gate, and hidden state at each time step respectively; W, U, and b are the weights and bias term of the neural network.
The σ (sigmoid function) and tanh are two activation functions and x [t] are the inputs.
The LSTM cell has three gates maintaining and adjusting its cell state and hidden state (Fig. 2), including a forget gate (Eq. 1), an input gate (Eq. 2), and an output gate (Eq. 4). Each gate has a sigmoid function that adds non-linearity to the linear combination of the input x[t] and hidden state from last time step h[t-1]. Eqs. 3 and 5 represent how much input and last hidden state is contributed to the cell state c [t]. Finally, the output h[t] of the current time step is calculated from the output gate and cell state shown in Eq. 6. Cell state and hidden state are then passed to the next time step.
The Encoder-Decoder (ED) (Fig. 2) structure has been used in the field of sequence-to-sequence prediction problems, especially for language translation (Cho et al., 2014). The encoder and decoder enable the model to operate on different input and output time steps. The ED-LSTM model consists of two LSTM networks in both the encoder and the decoder parts. The application of the ED structure in LSTMs can efficiently improve the performance of ahead-time prediction in the field of hydrology since the existence of the encoding and decoding architecture can eliminate the restriction on the length of input and output sequences (Kao et al., 2020). In this case, the input and output sequences do not necessarily have the same time steps. The input and output can be flexible to embed different types of input data. For example, catchment static variables and rainfall time series can be used as input to the model at the same time.
Unlike the basic LSTM structure, the encoder layer only outputs the hidden state from the last cell. Then, it is copied as input for each LSTM cell in the decoder layer. It contains information collected from the input sequence at each time step. Thus, it could be effective to use this structure to improve long-term dependencies for longer time step prediction than in the regular LSTM. In this study, five additional dense layers were set after the LSTM decoder layer for better decoding of the sequence output at each time step.

Input data composition
The ED-LSTM model experiments described in the next section require dynamic and static inputs. The dynamic input data includes the following climate variables: (i) hourly precipitation; (ii) hourly streamflow; and (iii) maximum/minimum 2-m daily air temperature. The dynamic inputs were divided into an observation phase and a prediction phase (Fig. 3). The first phase includes 24-hour precipitation, temperature, and streamflow data, computed from the hourly observed data before the prediction phase. The second phase contains the 6-hour precipitation and temperature as driving data for the streamflow forecast. Also, previously observed streamflow was used as the dynamic input because it is known to improve forecast (e.g. Song et al., 2020). Daily maximum and minimum temperature data are used to better account for snow-induced streamflow processes (e.g. Xiang et al., 2020).
We reduced the number of static catchment attributes from a total of 27 (Table S2) to 14 (Table S3) using a principal component (PC) analysis, preserving only the uncorrelated attributes that represent best the natural clusters in each category of catchment characteristics (Singh et al., 2014). The absolute PC scores (Table S4) of each selected attribute were taken to represent the uncorrelated patterns rather than the individual catchment characteristics (Ragettli et al., 2017).

Numerical experiments
Three different ED-LSTM numerical experiments were set up: (i) local models, i.e. a unique ED-LSTM setup for each of the catchments; (ii) a regional model simulating all catchments at once; and (iii) regional PUB modelsmodels that include both gauged and ungauged catchments in different combinations.
In the first experiment, ED-LSTM models were trained and tested on individual catchments without using data from other catchments. In total, 35 local models were set up. Only the dynamic variables were applied as input data and no static variables were involved in this setup. Preliminary tests were conducted on the catchment with the longest training samples to select the optimal ED-LSTM model hyperparameters.
In the second experiment, we set up a regional model, namely-one model for all catchments. This ED-LSTM model was trained and tested using an ensemble of events from all catchments together. The regional model was applied with both the dynamic and static variables as inputs. Kratzert et al. (2018) demonstrated that the LSTM model can perform better with a regional model setting than with a setup for individual catchments (as in the first experiment) as more data is available for model training (i.e. additional rainfall-runoff interactions are available for the LSTM to learn from). This experiment aims to find out how much can the flood warning prediction ability benefit from a large training dataset.
In the last experiment, the regional ED-LSTM setup was applied for prediction in the context of ungauged catchments (PUB models). We conducted two tests here, first to explore the ED-LSTM generalization ability to the number of gauged catchments used in the training, and second to the characteristics of the catchment. To explore the first question, we followed a similar setup as for the regional model. However, we trained the model with fewer catchments using the k-fold validation strategy, which enables us to test the model performance on unseen catchments in the training process. The k-fold validation is commonly used for model parameter selection (e.g. Chang et al., 2015) Fig. 3. Input composition and prediction process of the ED-LSTM structure for one event. Two encoder layers incorporated previous 24-hour data in the observation phase and 6-hour pseudo forecasting meteorological data as driving forces (all of these data were observations). The decoder layer outputs the predicted 6-hour runoff corresponding to a 6-hour driving force in the prediction phase. The dense layer functions as an embedding layer for feeding the catchment static variables. For one event, the two phases consist of a moving window and each step moves them forward in 6-hour increments.
but here we used it as a tool for an 'out-of-sample prediction': The 35 catchments were split randomly into k groups (namely 'folds') of approximately equal size; catchments from k-1 groups were used to train the model, and then the model was tested on the remaining single group of catchments as ungauged catchments. This procedure is repeated k times so that out-of-sample predictions are made available to all catchments (Kratzert et al., 2019). As the catchments are heterogeneous, the number of valid training samples varies. At first, 10-fold validation was adopted to train 10 models with the same model structure and each model was applied for the prediction in three ungauged catchments. This means that the training process was repeated 10 times with different 32 catchments to cover all catchments (Pub1 , Table S4).
To evaluate the generalization ability of the PUB modeling to the catchment characteristics such as climate and topography, k-means clustering methods were applied for classifying and grouping the catchments based on the 27 catchment static variables listed in Table S2 including climatic, topographic, vegetation and catchment drainage properties. Based on the silhouette scores (Fig. S1), we found that five clusters are required to group the catchments by their attributes. We averaged the model performance based on each cluster. Fang et al. (2022) hypothesized that DL-based models will have a better prediction skill in the context of PUB if a regional model is not trained on a relatively small and hydrologically homogenous dataset (e. g. few catchments but sharing similar hydrological characteristics) but rather on a larger and heterogeneous sample (e.g. multiple catchments with varying characteristics). To test this effect, additional experiments were conducted: the 'PUB' model was applied to each of the 5 clusters to create a new 'PUB' model for the smaller sub-regions represented by the clusters. The leave-one-out scheme was used so each sub-regional PUB model was trained on N-1 catchments and tested on a specific catchment; the 'PUB' model was then applied to either the north (17 catchments in "dry and cold" climate) or the south (18, "wet and warm") catchments. To examine the effects of regionalization performance on the sample size (e.g. Gong et al., 2021), the PUB experiment was iterated for a different number of catchments in training ranging from 18 to 30 based on fold numbers from 2 to 10 (Pub2 to Pub6, Table S4). As a reference for the quality of the ED-LSTM model predictions in the PUB mode, we used simulations for the 35 catchments by the PRMS hydrological model presented in Ragettli et al. (2017). Note that Ragettli et al. (2017) used two CART methods to emulate parameter regionalization in ungauged catchments. However, we used only the results of their classification tree as our reference, as the other CART method resulted in a very similar model performance.
The training strategy of the ED-LSTM model was as follows. In the first step, we determined the hyperparameters (e.g. learning rate, batch size, cell numbers) based on a grid search. The hydrometeorological dataset was divided into training, validation, and testing sets (50 %, 25 %, and 25 %, respectively) using the local model. Afterward, the dataset was split into training and testing sets for training the local and regional experiments (75 % and 25 %, respectively). For the PUB models, all data in 'gauged' catchments were used for training while the events in 'ungauged' catchments were used for testing. All ED-LSTM models had 256 memory cells in both the encoder and decoder layer, with a dropout rate of 0.4 based on the results of the hyperparameter grid search. There were 128, 64, 32, 16, and 1 cells in the five dense layers after the LSTM layer, and the optimal batch size was 32.

Evaluation metrics
The evaluation of the model performance aimed to (i) assess the capacity of the model to reproduce an overall streamflow fit at the event scale; and (ii) evaluate its ability to correctly identify streamflow extremes, i.e. the peak flow which is important for flood warning. The Nash-Sutcliffe efficiency (NSE, Nash and Sutcliffe 1970) metric was used to assess the overall streamflow fit: where sim and obs are the predicted and observed streamflow, t indicates a given time step and m refers to the mean. NSE ranges from -infinity to 1, with 1 being a perfect match. NSE values larger than 0.5 can be considered as a satisfying prediction capacity while the value of 0 signifies that the prediction is as good as the mean of the observations (Moriasi et al., 2007). Flood frequency analysis was used for the quantification of flood warning performance. The cumulative distribution function of the Generalized Extreme Value distribution was applied to estimate the return periods of the observed and simulated hourly streamflow peaks (see Ragettli et al. 2017, for example). For each storm event, we determined if the maximum hourly streamflow exceeded a reference flood quantile of a given return period. We consider the 2-year return period to represent common high streamflow and the 10-year return period to represent a severe flood.
We identified three cases for flood prediction performance  in Table S5). Moreover, to assess the temporal accuracy of reproducing the peak flow, a 2-hour condition was added to the evaluation, which means that if the simulated peak flow had a 2-hour shift compared to the observed peak, the prediction was also identified as a miss. Three contingency scores were computed for evaluating the flood warning ability: (i) the Probability of Detection (POD, Eq. (8)) which is the fraction of correct event predictions (hits) in all observed high flow events; (ii) the Success Rate (SR, Eq. (9)) which is the fraction of hits in the total number of all high flow event predictions; (iii) and the Critical Success Index (CSI, Eq. (10)) which is the fraction of hits in the total number of event predictions plus the number of missed observations.

Evaluation of the model performance
First, we compared the NSE values between the observed and predicted streamflow in the three experiments (Fig. 4). We qualitatively divided the NSE values into three performance groups: poor (NSE ≤ 0), average (0 < NSE < 0.5), and above average (NSE ≥ 0.5). The training of local models resulted in 14 catchments with above average NSE values and 12 catchments with poor NSE values, while in the regional model, most catchments resulted in an above average performance and only three catchments had poor performance (Fig. 4a). Evaluating the performance of the PUB models in the north and south areas separately ( Fig. 4b and c), it becomes apparent that the models have better prediction ability in southern catchments, with a considerably higher number of models with good performance (4 on average in the northern catchments in comparison to 12 on average in the southern ones). Compared to Ragettli et al. 2017 ′ s results, the PUB model performed better than the CART-based method in the northern catchments, as 2 (1) catchments resulted in above average performance and 5 (9) catchments in poor performance for the PUB (CART-based) model ( Fig. 4a and b). In southern catchments, the performance distribution of the two methods is nearly identical (Fig. 4c).
The results of the contingency scores, evaluating the flood warning performance, are presented in Table 1. For the 2-year return period events, the regional and the PUB models had the best performances (equally) for detecting streamflow peaks (POD values of 0.85, SR values of 0.82, and CSI values of 0.72, considering all catchments). For prediction in ungauged catchments, PUB models outperform the CARTbased regionalization methods by 12 % on the probability of detection considering all catchments. Similarly, PUB models have a better success rate and critical success index in comparison to the CART-based regionalization. For the 10-year return period events, all models' contingency scores decreased by at least 10 %. In line with the 2-year return period predictions, the regional and PUB models showed the best performance (POD of 0.7, considering all catchments). Again, the performance of the PUB models was 10 % higher than CART-based methods. The PUB model achieved the best performance for SR and CSI scores instead of being equal to the regional model. Results indicate that the model detection performances were affected by the climate, as in wetter climate (i.e. the southern catchments) better prediction abilities were observed than in the drier regions (i.e. the northern catchments). The generalization ability increases with the prediction of higher streamflow from an order of 8 % difference between the southern and northern catchments for the 2-year return period to ~30 % for the 10-year return period (exception is PUB, 10-year return period). Fig. 5 shows the overall performance of PUB models that were trained on different sub-regions. The homogeneous (similar climate) dataset did not improve but rather impair the PUB model performance ( Fig. 5a): while the median NSE value of the model trained on southern catchments was similar to the result trained on the global dataset (around 0.5), for the northern sub-region the model performance was significantly worse than the result when trained on all catchments. This is even more evident when the models are trained based on the climate clusters (Fig. 5b), where the overall performance has decreased significantly compared to the median NSE when the models are trained on the entire dataset. For clusters 1 and 3, the median has dropped significantly from around 0.35 to − 0.1 and − 0.7 (respectively). In addition, the variation of the models' NSE skill (i.e. the box plots) increases in all models trained on the cluster-based (homogenous) datasets in comparison to the training with the entire region datasets, with negative lower quartile and lower whisker NSE values reaching − 1 for clusters 1, 3, and 4.

Generalization ability to different training dataset
The model generalization ability to the number of catchments used for training is presented in Fig. S4. The median NSE of PUB models did not vary notably for models trained on 32 to 24 catchments but when the number of catchments used in the training dataset was below 18, the median NSE declined from 0.3 to only 0.15 (Fig. Sa) and this decline trend is consistent for even smaller number of catchments. In contrast to the NSE results, the POD scores of the PUB models do not degrade with the decreasing number of training catchments and all PUB models demonstrate good flood warning capability with median POD scores higher than 0.8 (Fig. S4b).
However, model performance at an individual catchment may not always be improved when trained on a large dataset. As shown in Fig. 6, four performance categories can be distinguished: (i) performance is similar for all models (e.g. Yimen catchment); (ii) poor performance in the local model but satisfactory in others (e.g. Shangliu catchment); (iii) poor performance in the PUB model but satisfactory in others (e.g. Pei River); and (iv) random performance variation with model setup (e.g. Qigu catchment). Most of the catchments (12) fall into category (ii) performance, followed by 10 catchments with category (i) performance. Fig. 4. Summary of the model performance classified by NSE values (poor, NSE < 0; average, 0 < NSE < 0.5; and above average, NSE greater than 0.5). The NSE values for the classification tree method ('CART') are from the hydrological model in Ragettli et al. (2017).

Table 1
Contingency scores (POD -probability of detection; SRsuccess rate; CSIcritical success index) evaluating high streamflow predictability (Aall catchments; Ssouthern catchments; Nnorthern catchments). Bold numbers represent the highest contingency score for each area. But still, 8 catchments fall into category (iii) while 5 resulted in category (iv), with a poorer performance even though they were trained using a larger dataset.

Generalization ability to catchment characteristics
The classification of the 35 catchments into five classes based on the k-means clustering method is presented in Fig. 7a. Of the northern catchments, 11 (81 %) were grouped into cluster 5. The southern catchments were classified into clusters 2 to 4; catchments from the southwest were mostly clustered in cluster 2, while catchments from the southeast region were grouped in cluster 4. The catchments in cluster 1 and cluster 3 are mainly located in central China in Henan province.
The five clusters represent different climatological and hydrological conditions (Table 2). Cluster 5 climate is arid to semi-arid, with a ratio of annual potential evapotranspiration to precipitation (PET/P) lower than 1. Moreover, the catchments in cluster 5 are mainly located in high mountain areas in northern latitudes (Fig. 7a), and, thus, are also colder on average compared to catchments in other clusters. The catchments in cluster 3 are located in lower, flatter, and warmer areas. The topography attributes of cluster 4 are similar to cluster 3. However, the latter is more humid. The catchments in cluster 2 are located in high elevation warm and humid areas and are located in southwest China (Fig. 7a). The climatic and topographical characteristics of the catchments in cluster 1  are intermediate of the clusters and the three catchments associated with this cluster are found in central China (Fig. 7a).
The models are sensitive to the catchment characteristics as the model performances vary between clusters. The NSE scores of the clustered catchments in the PUB model in clusters 2 and 4 (both above 0.5) are the highest among the five clusters, while cluster 5 has the lowest median NSE with 0.23 (Table 2 and Fig. 7b). A declining trend in NSE values is also observed from south to north (Fig. 7b). The POD scores, however, show no remarkable differences in flood warning capability between clusters, with the highest value of 0.91 in cluster 4 and the lowest values of 0.76 in clusters 2 and 5 (Table 2).
A negative correlation was found between PET/P (used as climate proxy) and NSE (Fig. 8a), indicating that the DL model is more likely to perform well in wetter areas. However, such a relationship cannot be observed between POD scores and PET/P (Fig. 8b). No correlations of either NSE or POD were found with topographic attributes. For example, neither NSE nor POD shows a clear correlation with the h-gradient (used as a proxy for topographic steepness, Fig. 8c and d).

Use of the ED-LSTM in the PUB context
The PUB model using the regional ED-LSTM structure showed a higher skill in predicting high streamflow than the conceptual hydrological model (Fig. 4 and Table 1). Even in catchments with poor prediction performance (i.e. NSE < 0.5), where the streamflow is not perfectly simulated by the ED-LSTM model, decent performances for flood warning were obtained (i.e. high POD values). It is further evident in the results that the PUB model is not sensitive to the number of training catchments and always keeps a good flood warning skill (Fig. S5). This implies that while the deep-learning models do not always learn the streamflow dynamic (i.e. the time series as a whole) well, they can still extrapolate the extreme streamflow eventsthis is further seen when plotting the correlation between NSE and POD (Fig. S2). It also implies that there is a high similarity in the intense observed rainfall between the catchments, which triggers flood peaks of similar magnitude. There may be a potential to use deep learning for flood prediction in ungauged catchments, even if the number of gauged catchments is small. In other words, the ED-LSTM appears to be a reliable flood warning model in ungauged catchments since it does not require successfully capturing the entire event hydrograph and the timing of the peak but rather solely forecasting the magnitude of the peak.
Given a similar size of training data, the prediction ability of an LSTM model is often much better than physics-based models (Kratzert et al., 2018;Frame et al., 2022). While the physically-based regionalization method selects and transfers the optimal parameter set from a gauged catchment to an ungauged catchment (Yang et al., 2018), more flexibility is found with the ED-LSTM PUB approach as the model adapts to different dynamic patterns rather than be limited to a fixed set of parameters obtained from a donor catchment. These processes of the proposed approach are spontaneous and do not require the identification of any criteria for selecting donor catchments, resulting in a potentially better performance of the ED-LSTM in PUB predictions. For example, here we obtained POD values of above 0.8 and 0.7 for the prediction of high streamflow at 2-and 10-year return periods (Table 1), while in previous studies (using various physically-based methods but for different locations, climates, and time scales) obtained values lower than 0.48 (2-year return period, France, Javelle et al., 2016), 0.61 (2year return period, 6 catchments in France and Italy, Norbiato et al., 2008), and 0.38 (10-year return period, Pakistan, Kim et al., 2018). We conclude that the proposed ED-LSTM PUB model can be considered an applicable tool for issuing fast and reasonably accurate flood warnings in ungauged catchments.

Model sensitivity to the training dataset
We found that the overall PUB model performance is better when training under a larger and more climate-heterogeneous dataset rather than using a smaller and regional-focused (i.e. climate-homogenous) dataset, as implied from the slight performance decrease comparing the south-north based PUB models (Fig. 5a) and the meaningful  performance decline of the cluster-based PUB models ( Fig. 5b and Fig. S4). Since the characteristics of the catchments in the different clusters vary significantly, the models trained solely on the data from the catchments in each of the single clusters are not able to generalize well when applied to catchments from clusters with other characteristics. This behavior is expected and is in line with the hypothesis proposed by Fang et al. (2022) that in the context of PUB, the DL models can still benefit from the data synergy effect provided by the modest diversity in the training data. A more heterogeneous dataset may increase the probability of covering the relevant conditions for a new catchment to increase the representativeness of various rainfall-runoff processes. The sensitivity of the ED-LSTM model to simulate a time series of streamflow concerning the number of training events (or catchments) is non-linear but with a positive correlation for NSE scores as shown in Fig. 5 and Fig. S4. The decrease in the ED-LSTM NSE performance with decreasing number of catchments used for the training (Fig. S4) can be potentially explained by the fact that catchments that are not well represented by the training dataset are more sensitive to the change in training dataset size. This is evident in the results, as the upper quartile (the well-represented catchments) in PUB1, PUB2, and PUB3 remain the same, while the lower quartile drops significantly (Fig. S4).
The results suggest that both quantity and diversity of the training dataset for DL models are equally important for improving PUB model performance. We stress that future applications of machine learning models in PUB context should ensure a representative dataset with a sufficiently large number of training samples and catchments to properly include the impacts emerging from catchment characteristics on overall PUB model performance.
However, some of our results challenge the hypothesis that DL models' performance is not compromised by additional information for streamflow simulation, even when they appear to have different hydroclimatic conditions (Fang et al., 2022). The performances of several catchments, especially for those with sufficiently representative training data, were also decreasing ( Fig. 6) from local to the PUB model. A possible explanation is the lack of data on the pseudo-ungauged catchments when conducting out-of-sample prediction. It means that the hydrological responses in these catchments are unique and the enlarged dataset still fundamentally lacks critical inputs so, in the PUB context, the prediction results were much poorer than the performance of the local model. Meanwhile, the performance decrease also happens between local and regional models due to the addition of some poorlyperforming catchments without sufficiently representative training data, which have irregular hydrological responses in comparison to the well-performing catchments, to the enlarged dataset. This even happens in a single catchment. For example, the flood hydrographs measured in the Xinghe catchment show strong non-stationaries, whereas for similar rainfall intensities the resulting flood waves are very different, which finally results in a very low prediction skill even for a local model. The diverse hydrological behaviors, in this case, can impair the learning. This drop in performance of some catchments in the regionalization of LSTM models is also reported and discussed by Kratzert et al. (2019) and Hashemi et al. (2021). However, we did not quantify to what extent the heterogeneity may harm the model performance in this study. Such analyses are left for further research. Nevertheless, when adding new data, the drop in performance of the well-performed catchments is minor compared to the benefits of increasing the performance of those poorly-represented catchments.

Model sensitivity to the catchment characteristics
Previous studies have found that the prediction of streamflow in ungauged catchments tends to be better in humid areas for either physics-based or data-driven models (Ragettli et al., 2017;Kratzert et al., 2018;Feng et al., 2020;Lees et al., 2021) and we confirm this finding with our results (e.g. the NSE -PET/P relation, Fig. 8 and Table 1). There is a physical explanation for this finding: in dry regions, runoff generation is more likely to occur due to infiltration excess. Soil infiltration capacity has large spatial variability in 35 catchments so different soil saturation states can result in different streamflow magnitude and timing for the same rainfall intensity. This can explain the variability in streamflow responses within one catchment, for example, the Xinghe catchment described in section 5.2. Hence, it is likely that streamflow simulation in dry PUB areas can be improved if the machine learning model is trained to learn the interaction between rainfall and streamflow with a larger sample of different rainfallstreamflow-soil moisture conditions. Alternatively, deep-learning models can be improved to simulate runoff infiltration excess by introducing physical laws, as discussed in the next section.
We found that the model performance is only sensitive to climatic variables and not to topographic variables (Fig. 8) when considering NSE as model evaluation metrics. This is in agreement with the statement of Addor et al. (2018) and Stein et al. (2021), who concluded that streamflow behavior across regions is most strongly influenced by climate attributes for flood prediction in ungauged catchments. Our findings, thus, support the need to incorporate more dynamic and static climatic variables that can increase the representativeness of the dataset when setting regional machine-learning models for flood warnings.

Limitations and future development
The analyses we conducted are limited by the small number of events that were available for the model training and evaluation. For example, we used the 10-year return period as the representative of a high streamflow event but in 11 of the catchments, the number of events is not sufficient to estimate the 10-year return period with high accuracy. Another limitation is that the models were trained by event-based data. If continuous streamflow data is available and used instead, the model can potentially learn better the hydrological patterns, such as the streamflow seasonality and event-antecedent soil moisture conditions (which has the potential to improve predictions in dry climates, see the previous section). A complete observation time series covering 5 to 10 years would likely be sufficient to represent the non-stationary behind the hydrological dynamics (O et al., 2020) but it was not available to us for this study. Extending the data for the training of the models will result in a better prediction, but will not change the main conclusions of this work, namely, the outperformance of DL models in comparison to a physics-based hydrological model in predicting floods in general and in the context of ungauged catchments in particular.
The results of our experiments imply that it is essential to focus on developing model structures that can be adaptable also in dry regions. This can be done by incorporating governing equations or physical constraints into "hydrological" machine learning models. An example is the development of the Mass-Conserved DL model that incorporates conservation law into the loss function (Hoedt et al., 2021). Another alternative is the development of physically-informed hybrid models that embed the hydrological dynamics into the recurrent neural network architecture (e.g. Jiang et al., 2020). These types of models can better capture the interaction between soil-rainfall-evaporation and streamflow, and be more readily generalized beyond the regimes covered with the training data (Khandelwal et al., 2020). Future applications of DL models in the field of hydrology should be combined with such hydrological knowledge and physics.

Conclusions
We applied the Encoder-Decoder LSTM model to predict rainfallrunoff events and flood peaks in ungauged catchments. The model outperformed conventional hydrological-model regionalization methods. The most considerable improvement in the model predictive ability was observed in the poorly represented catchments. By evaluating the generalization ability, i.e. the applicability of the model across many catchments and conditions, we found that the performance of the ED-LSTM model was not only sensitive to the number of samples used for the training of the model but also the representativeness (climateheterogeneity level) of the dataset. Also, the DL regional model still suffers from issues of model adaptabilityalthough the ED-LSTM model reliably predicts the occurrence of rare events also in arid regions, it is more likely to have a poor model performance for predicting streamflow in arid catchments than in humid catchments. Surprisingly, we discovered that the catchment topographic attributes, such as elevation and gradient, did not improve the model performance when added as static variables in the model setup. We conclude that, compared to conventional methods, the regional ED-LSTM model is a promising method for hydrological modeling in ungauged catchments, and our results could be an important reference for further studies of DL-based hydrological modeling with a rather limited amount of data to set a representative training dataset.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability
The code of Encoder-decoder LSTM modeling is available through GitHub (https://github.com/yikuizh/edlstm_flood_prediction). Rainfall and runoff data from ground stations in China that were used for this study are not freely available for academic or commercial use (contact SR for further details).