Coupling Deep Learning and Physically Based Hydrological Models for Monthly Streamflow Predictions

This study proposes a new hybrid model for monthly streamflow predictions by coupling a physically based distributed hydrological model with a deep learning (DL) model. Specifically, a simplified hydrological model is first developed by optimally selecting grid cells from a distributed hydrological model according to their soil moisture characteristics. It is then driven by bias corrected general circulation model (GCM) predictions to generate soil moistures for the forecasting months. Finally, model‐simulated soil moisture along with other predictors from multiple sources are used as inputs of the DL model to predict future monthly streamflows. The proposed hybrid model, using the simplified Variable Infiltration Capacity (VIC) as the hydrological model and the combination of Convolutional Neural Network and Gated Recurrent Unit (CNN‐GRU) as the DL model, is applied to predict 1‐, 3‐, and 6‐month ahead reservoir inflows for the Danjiangkou Reservoir in China. The results show that the hybrid model consistently performs better than VIC and CNN‐GRU models with great improvement in Kling‐Gupta efficiency (KGE) values for lead times up to 6 months. Additional tests indicate that hybrid models based on CNN‐GRU outperform those based on LASSO, XGBoost, CNN, and GRU models. Moreover, compared with the distributed hydrological model, the hybrid model greatly reduces the computation burden of rolling prediction. It also saves decision‐makers the time and effort of trying different combinations of predictors, which is indispensable when building DL models. Overall, the new hybrid model demonstrates great potential for monthly streamflow prediction where training data are limited.


Introduction
Prediction of monthly streamflow is of great importance for water resources management and reservoir operation (Crochemore et al., 2016;Lin et al., 2006;Wang & Robertson, 2011;Wetterhall et al., 2015).Generally, there are two types of approaches used for monthly streamflow predictions: data-driven approaches which use hydrological and climatic variables as predictors in statistical or machine learning models, and process-based approaches which drive hydrological models using historical meteorological data or seasonal meteorological forecasts.
Even modest incremental improvements in seasonal forecasting skill can help reap substantial socioeconomic benefits (Hamlet et al., 2002;Yao & Georgakakos, 2001).Consequently, there is intense interest in improving the accuracy of the streamflow prediction systems.Over the last two decades, with the advances in computer sciences and information engineering, a particular focus has been potential adoption of machine learning (ML), which trains a system to perform complex tasks and uses harvested data and experience to gain higher accuracy (Jordan & Mitchell, 2015).Generally, the more data, the better the performance will be.From the point of view of science and services, ML technique opens a myriad of possibilities to enhance existing streamflow prediction systems, given the fact that more accurate, clearer and faster data have become available from various sources (e.g., data providers, forecasting centers, monitoring networks like the relatively new SNOtel LITE (SNOLITE) network, and high-resolution data sets like the NASA Airborne Snow Observatory (ASO) remotely sensed snow product).So far various predictor types have been tested in ML-based streamflow prediction methods, including gridded meteorological data sets (Wang, Rezaie-Balf, et al., 2021), satellite-based data (Kumar et al., 2021), climate indices (Ahmed et al., 2021), and climate model forecasts from the North American Multi-Model Ensemble (NMME) or Copernicus Climate Change Service (C3S) seasonal multi-system (Slater & Villarini, 2018;Xu, Chen, Zhang, Xiong, & Chen, 2022).
Yet it is worth noting that statistical regression models remain the most widely used data-driven techniques in seasonal streamflow forecasting, which usually serve as a fundamental component of most operational prediction systems throughout the world due to their high efficiency and low operation costs (Mendoza et al., 2017).For example, the United States (U.S.) Department of Agriculture Natural Resources Conservation Service (NRCS) and the California Department of Water Resources have adopted principal component regression (PCR) models for operational water supply forecasting (WSF) since 1990s (Garen, 1992;Hart & Gehrke, 1990;James et al., 2013;Pagano et al., 2004).However, the regression-based prediction system has known technical issues and needs to be upgraded and replaced (Fleming et al., 2021).In regions such as the western U.S. where streamflow is largely snowmelt-dominated, the forecasts of forthcoming seasonal river flow volumes can be regressed on winter-spring mountain snowpack measurements (Grantz et al., 2005).However, beyond the snowmelt period, a seasonal forecast is usually less accurate (Anghileri et al., 2016).Although the horizon of the forecasts can be extended by making use of climate indices or seasonal-scale dynamical climate model forecasts, the regression coefficients of these variables depend largely on the extensive experiences from hydrologists and engineers (Yang et al., 2017).Compared with statistical regression models, ML is capable of training itself to find patterns or make predictions without any human training or control (Feurer et al., 2015).It can even gain insight or automate decision-making in multiple cases where humans cannot (e.g., Shen, 2018).Recently, NRCS has developed a next-generation WSF prototype known as M4 (Multi-model machine-learning metasystem), which is mainly composed of one linear regression, one quantile regression, and four ML models (Fleming & Goodbody, 2019).The retrospective and real-time operational tests conducted by Fleming et al. (2021) demonstrate that M4 appears to have met the NRCS design criteria, confirming the feasibility of migrating ML into operational streamflow prediction systems.
Deep learning (DL) is a major branch of ML, performs well in automatically learning the abstract and high-level representations of mass data.Previous studies have acknowledged that DL is a powerful and superior tool to make the most use of the ample database so as to learn the complex mappings between potential predictors and monthly streamflow (Feng et al., 2020;Herbert et al., 2021;Liu et al., 2022;Sadler et al., 2022;Xiang & Demir, 2020;Zhang et al., 2018;Zhao et al., 2021;Zhu et al., 2020).However, the use of DL models for streamflow prediction is still controversial because the simulated results cannot always be reasonably interpreted due to the lack of consideration of runoff generation mechanisms (e.g., Bloschl et al., 2019;Jiang et al., 2020;Karpatne et al., 2017).Pure DL models can also hardly simulate hydrometeorological situations which do not find in the training period (Cai et al., 2022;Wang et al., 2020).In addition, although multiple data types, such as streamflow observations, local meteorological data, and large-scale climate indices have been used as predictors for DL models (e.g., Ahmed et al., 2021Ahmed et al., , 2022;;Herbert et al., 2021), the relationship extracted from historical hydrological and meteorological data may not hold for future conditions.This is because the means and extremes of precipitation, evapotranspiration, and rates of discharge of rivers are being altered to various extents and in different ways due to climate change and human activities (Ionita et al., 2008;Wu et al., 2013).
In contrast to data-driven models, the architecture of process-based models is built upon the prior knowledge of hydrological systems, and therefore has the ability to capture the catchment dynamics, as well as the mechanisms behind runoff process (Casadel et al., 2003;Ivanov et al., 2004;Thober et al., 2015;Yilmaz et al., 2008).Moreover, physically based distributed hydrological models such as Variable Infiltration Capacity (VIC) model can specifically take into account the spatial variability of precipitation, infiltration capacity, land cover, and topography over a watershed (Agboma et al., 2009;Maurer et al., 2001;Su et al., 2008;Wood et al., 1988), which means they can provide greater spatial details on current hydrologic variables compared with lumped models.In process-based models, meteorological forcings are available from multiple sources, including a single or an ensemble of forcing variables resampled from historical climate observations (e.g., Ensemble Streamflow Prediction (ESP) (Day, 1985)), dynamical seasonal forecasts (e.g., SEAS5 and CFSv2 (Johnson et al., 2019;Yuan et al., 2011)), or stochastic climate generators (e.g., Caraway et al., 2014).
ESP is one of the most commonly used operational streamflow prediction methods (Harrigan et al., 2018;van Dijk et al., 2013;Werner et al., 2004;Wood & Schaake, 2008;Yuan et al., 2016), adopted by diverse government agencies and other service-delivery organizations (SDOs), including the U.S. National Weather Service River Forecast Centers (RFCs), the U.S. Bureau of Reclamation, the Bonneville Power Administration, the British Columbia (BC) River Forecast Center, and BC Hydro (e.g., Druce, 2001;McEnery et al., 2005;Rosenberg et al., 2011;Weber et al., 2012;Wood & Lettenmaier, 2006), and has also been applied by NRCS on a limited basis.However, the ESP only accounts for uncertainty in meteorological inputs, while ignoring climate non-stationarity.To tackle this problem, numerous variations and extensions of the traditional ESP have been proposed.One typical approach is to generate weather inputs for ESP through a conditional resampling of historical weather sequences based on the state of climate indices (i.e., pre-ESP) (Beckers et al., 2016;Grantz et al., 2005;Hamlet & Lettenmaier, 1999).Another well-known strategy for incorporating climate information into ESP forecasts is called trace weighting (Mendoza et al., 2017;Werner et al., 2004).It is a post-ESP method in which forecasted flow traces are weighted based on the similarity between climate-related features of the forecast year and those of the meteorological year.More recently, climate forecasts from dynamical climate models have also been added to the forecaster's toolbox to take into account the climate variability for the future period.Many studies have examined the feasibility of using historical meteorological data or general circulation model (GCM) hindcasts to drive distributed hydrological models for monthly streamflow predictions (Greuell et al., 2018;Ma et al., 2018Ma et al., , 2019;;Mazrooei et al., 2021;Mo & Lettenmaier, 2014;Shrestha et al., 2015;Sikder et al., 2016;Sinha et al., 2014;Sinha & Sankarasubramanian, 2013;Thober et al., 2015;Wood et al., 2002;Yuan et al., 2013).In general, both historical climate conditions and GCMs are useful information for improving overall skill of the streamflow predictions.However, it is extremely difficult for a purely process-based model to leverage all available climate information.In addition, water flow with nonlinear dynamic interactions and time lags cannot be perfectly described by process-based models due to limited human knowledge of physical processes in the real world (Clark et al., 2015).Moreover, the use of distributed hydrological models for monthly streamflow predictions is usually computationally expensive, and high input data demand on a gridded basis especially for large watersheds with high resolutions, which could hinder their practical application (Gu et al., 2020;Mazrooei et al., 2021).
As mentioned above, both DL models and physically based hydrological models have their own merits and limitations.Therefore, this study aims to propose a hybrid model (as detailed in Section 3) to take the complementary strengths of these two approaches for an effective and efficient prediction of monthly streamflow.The organization of the rest of the paper is as follows: in Section 2 the study area and data are presented, followed by a description of the hybrid model in Section 3. Section 4 provides the results and discussion about the use of the hybrid model for 1-, 3-, and 6-month ahead streamflow predictions.Section 5 makes a brief conclusion.

Study Area
The proposed hybrid model was applied to predict monthly inflows for the Danjiangkou Reservoir (DJKR) (32°3 6′-33°48′N, 110°59′-111°49′E), which is the largest artificial freshwater lake in Asia and the water source for the middle route of the South-to-North Water Diversion Project (SNWDP) in China (Figure 1a).The SNWDP has been constructed to improve the allocation of water resources in China (Liu & Du, 1985;Liu & Zheng, 2002).The Middle Route Project (MRP) (http://www.csnwd.com.cn/gcjs/zxgc/),one of the three water diversion routes (eastern, middle, and western) under the SNWDP, was designed to transfer water from the DJKR on the Hanjiang River to the water-scarce North China Plain.The drainage basin is located in a mountainous region with a subtropical monsoon climate characterized by four distinct seasons: spring (March-May), summer (June-August), autumn (September-November), and winter (December-February).The drainage area above the DJKR is approximately 95,200 km 2 .The annual mean precipitation varies from 700 to 1,100 mm, with 70%-80% of the rainfall occurring in the wet season between May and October (Guo et al., 2009).Three hydrological stations including Shiquan, Ankang, and Baihe used in this study are distributed in the upper reaches of the mainstream of the DJKR basin.Figure 1 and Table 1 present the locations of the three hydrological stations and the DJKR as well as the physical and hydroclimatic characteristics of the corresponding basins.

Observed Data
Daily meteorological variables including precipitation, maximum and minimum air temperatures for the 1980-2010 period over 20 meteorological stations (see Figure 1a) were provided by the National Climate Center of the China Meteorological Administration.Daily streamflow data for the Shiquan and Ankang hydrological stations cover the 1980-2008 period.Daily streamflow data for the Baihe hydrological station and daily inflow to the DJKR cover the 1980-2010 period.The streamflow data were obtained from the National Hydrological Yearbook of the Yangtze River Basin.Daily data were used for hydrological model calibration and validation.In addition, daily meteorological and streamflow data were aggregated to monthly values, and monthly meteorological data for the DJKR basin were calculated by averaging all grid points within the basin and surround.The  (Cheng et al., 2019;Guo et al., 2020;Yang et al., 2021).

General Circulation Model (GCM) Hindcasts
Retrospective monthly updated GCM outputs, also known as GCM hindcasts, were obtained from the European Center for Medium-Range Weather Forecasts System 5 (ECMWF SEAS5) product, which is the current operational seasonal forecast system run by ECMWF.To verify the system and calibrate the forecasts, SEAS5 model uses a set of retrospective seasonal forecasts for past dates that can be compared to the historical records (Johnson et al., 2019).This set of hindcast simulations starts on the 1st day of each month for years 1981-2016.It is made publicly available with 1°× 1°spatial resolution on latitude and longitude, and with lead times ranging from 1 to 7 months.More information about the SEAS5 model can be found in Johnson et al. (2019).This study averaged monthly time series of precipitation, maximum and minimum air temperatures with lead times of 1-6 months for 25 members of SEAS5 over the 1983-2010 period.Previous studies (Christiansen, 2018;Rougier, 2016) have shown that multi-member ensemble means outperform single forecasts, so we only used ensemble means for subsequent research.Additionally, the use of ensemble means can solve the problem of inconsistency in the number of members between hindcasts and real-time forecasts.To verify the predictive performance of the hybrid model, GCM hindcasts were used as GCM predictions in this study.GCM data were regridded using the bilinear interpolation to match the resolution of the distributed hydrological model to drive it, and the areal mean values were used as GCM predictions for DL model construction.

Downscaling of GCM Hindcasts
Prior to downscaling, both observed and GCM data are first interpolated to hydrological model grids using the bilinear interpolation method.The LS method is then applied to correct the interpolated GCM data at the monthly scale.The LS method corrects the systematic bias of mean monthly precipitation and temperature for each specific month.The scaling factor (defined as bias) is calculated as the ratio (for precipitation) or difference (for temperature) between the observed and historical forecast values in the calibration period.The scaling factor obtained through calibration is then applied as a multiplicative factor to correct raw monthly precipitation forecasts while as an additive factor to correct raw monthly temperature forecasts.Detailed information about the LS method can be found in Chen et al. (2013).
The bias corrected monthly precipitation and temperature are then disaggregated to daily data by the nonparametric disaggregation model of Prairie et al. (2007).This model disaggregates monthly forecasts into daily time series by identifying similar monthly conditions in the historical period based on the K-NN method.The number of the nearest neighbors for each month is chosen based on the heuristic scheme K =
In this study, precipitation and temperature forecasts from SEAS5 grids (1°× 1°) over the DJKR basin were first interpolated to the grid cells required by the VIC model (1/12°× 1/12°).The LS method was then applied to correct the interpolated data.To begin with, the 1983-2000 and 2001-2010 periods were selected for calibration and validation, respectively.Figure S1 in Supporting Information S1 shows bias corrected results versus observed monthly mean precipitation averaged over Shiquan, Ankang, Baihe, and DJKR sub-basins.Table S1 in Supporting Information S1 shows the evaluation metrics for lead times of 1-6 months.Overall, the observed precipitation is well reproduced by the bias corrected data for most months.The KGE values are higher than 0.70 for all sub-basins and lead times.When generating daily forecasts for the period of 2001-2010, the nearest neighbors were obtained by computing the Euclidean distance between the current monthly forecast and the historical observations from 1983 to 2000.After that, the same downscaling procedure was carried out using the data from the 1993-2010 and 1983-1992 periods as the calibration and validation periods, respectively (result is not shown).

Simplified Hydrological Model
The simplified model is constructed within a specified period of time and can be directly used for any other time periods.The development of the simplified model is composed of three parts.(a) The distributed hydrological model is calibrated and validated based on observed meteorological data.(b) Observed gridded meteorological forcing data are used to estimate initial hydrologic conditions (IHCs) for hydrological model in part (a) prior to each predicted month, and then bias-corrected and temporally disaggregated precipitation and temperature forecasts are used to drive the hydrological model with updated initial hydrologic conditions.(c) The least absolute shrinkage and selection operator (LASSO), a feature selection method proposed by Tibshirani (1996) is employed to select grid cells from the distributed hydrological model, among which the series of monthly soil moisture forecasts in all grid cells are treated as covariate vectors and the monthly streamflow of the outlet hydrological station of the study watershed is treated as the response vector.The motivation to use soil moisture as a covariate is not only because this variable is affected by the complex and dynamic interactions among climate, topographic, soil and vegetation, but also because soil moisture is a continuous storage indication and the key variable in runoff generation process.The detailed description of the simplified model and how it was constructed in this study are presented in Sections 3.2.1-3.2.3 below.

Hydrological Model Calibration and Validation
In this study, the Variable Infiltration Capacity (VIC) model was selected as the representative of hydrological models.The VIC model is a distributed hydrological model for catchment scale up to continental scale applications, developed together by Washington University, University of California at Berkeley and Princeton University (Liang et al., 1994).The structure and parameters of the model are described in detail in Xie et al. (2007).Digital elevation model, soil, land-use, and hydrological data are needed for VIC model calibration and validation.The VIC model calculates the evapotranspiration and soil moisture for each grid and then separately generates surface runoff and subsurface flow data.Runoff generation and flow routing are also two separated processes in the VIC model.In the three-layer VIC model, the soil is divided into the top thin soil layer, which represents quick baer soil evaporation following small rainfall events, the middle soil layer, which represents the dynamic response of the soil to rainfall events, and the lower layer, which characterizes the seasonal soil moisture behavior (Liang et al., 1994).The VIC model was implemented with the spatial resolution of 1/ 12°× 1/12°in this study.Therefore, the DJKR basin was divided into 1,382 grids.In addition, the meteorological data from meteorological stations were interpolated into the corresponding spatial resolution at the daily time step using the inverse-distance weighting interpolation (IDW) method.Daily hydrometeorological data for the 1980-1986 period were used for model calibration with 1980 as the spin-up period, whereas data for the 1987-2000 period were used for validation.A detailed description of the calibration process of the VIC model can be found in the in Supporting Information S1 (see Text S1).The Kling-Gupta efficiency (KGE) (Gupta et al., 2009), coefficient of determination (R 2 ), and Percent bias (PBIAS) were used to evaluate the VIC performance.Of which, KGE ranges from ∞ to 1, while R 2 ranges from 0 to 1.The larger value is equivalent to the better model performance for these two metrics.The best value for PBIAS is zero.The values of the three quality measures for the three hydrological stations and DJKR are listed in Table S4 in Supporting Information S1.

Rolling Prediction
Prior to selecting representative grid cells of the VIC model, precipitation and temperature forecasts with different lead times were used to drive the original VIC model to perform rolling predictions in the training period (see Figure S3a in Supporting Information S1).For example, after updating IHCs at the end of December 1982, the hydrological model was driven with precipitation and temperature forecasts with lead times of 1-6 months, which were issued at the beginning of January 1983.Model-simulated soil moisture from January to June in 1983 with lead times of 1-6 months were then obtained.For lead times of 1, 3, and 6 months, daily time series of soil moisture forecasts in the lower layer of 1,382 grid cells from January 1983 to December 2000, March 1983to February 2001, and June 1983to May 2001 were aggregated into the monthly values, respectively.These data series in the training period were used to select grid cells (i.e., develop the simplified model).The same rolling prediction procedure was carried out with the data from the last 18 years (1993)(1994)(1995)(1996)(1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010) as the training period and the first 10 years (1983)(1984)(1985)(1986)(1987)(1988)(1989)(1990)(1991)(1992) as the testing period.

Grid Cells Selection
The main water balance terms in VIC include precipitation, evaporation, soil moisture content of each soil layer, surface runoff, and baseflow (Lohmann et al., 1996).Soil moisture is one of the most important components, as it in the upper two soil layers determines the infiltration capacity and the volume of surface runoff, and it in the deepest soil layer determines the product of the baseflow (Liang et al., 1994).Figure S4 in Supporting Information S1 shows the correlations between the seven variables of the VIC model and the observed monthly inflow to the DJKR.We can find that the monthly inflow is well correlated (correlation more than 0.5 for over 80% grids) with all seven variables for current record.However, the correlations are much weaker when variables are generated using forecast data rather than observed data.Nonetheless, the monthly inflow is strongly correlated with the lower layer soil moisture in over 20% grids (correlation more than 0.6) for lead times up to 6 months.In addition, we found that baseflow is highly correlated with soil moisture in the lower layer (correlation more than 0.99 and 0.90 for over 70% and 90% grids, respectively), whether generated using observed data or forecast data.Therefore, we only used soil moisture in the lower layer as an indicator to represent the reservoir inflow pattern.Furthermore, soil moisture in the lower layer is considered to represent seasonal soil moisture behavior in the VIC model and reflect the degree of soil profile saturation, and the yield from a rainfall event on a wetted catchment is higher than on a drier catchment (Nilsson et al., 2006).LASSO was then employed for each lead time to select grid cells using monthly soil moisture forecasts in the lower layer of all grids as covariate vectors and monthly inflow as the response vector.The simplified VIC model was finally constructed with the selected grid cells, which largely reduces the number of grid cells to be calculated and is therefore computation-efficient.The simplified models can be directly used to generate real-time soil moisture forecasts in the testing periods (i.e., in or after 2001) (Figure S3a in Supporting Information S1).Specifically, the testing period for lead times of 1, 3, and 6 months is January 2001 to December 2010, March 2001 to December 2010, and June 2001 to December 2010, respectively.
Taking 1-month lead time as an example, the simplified VIC model is composed of 14 grid cells as shown in Figure S5 and Table S5 in Supporting Information S1. Figure S5 in Supporting Information S1 also shows the spatial maps of land use and soil in the DJKR basin.Loam and clay loam are the main soil types in topsoil and subsoil, respectively.Evergreen needleleaf, deciduous broadleaf, mixed cover, wood land and cropland are the vegetation types with the largest proportion in most cells.Seen from Table S5 in Supporting Information S1, most information about the vegetation and soil in the DJKR basin can be retained by the simplified VIC model.

Water Resources Research
10.1029/2023WR035618 XU ET AL.

Model Inputs
Like many other data-driven models, appropriate input variables (predictors) for the DL model need to be determined.In most studies, the time-lag effect of local hydrology (streamflow data), meteorological data, and large-scale climate data on the predicted monthly streamflow was taken into account (e.g., Apaydin et al., 2021;Fan et al., 2017;Ni, Wang, Singh, et al., 2020;Yang et al., 2017).Herein multistep observed data ahead of the targeted streamflow are considered as candidate predictors.In order to consider the non-stationary climate conditions, GCM predictions in the forecasting months are used as extra predictors.Furthermore, soil moisture forecasts derived from the selected grid cells of the simplified hydrological model are also incorporated for training and testing the DL model.
In this study, previous 6 months' streamflow, meteorological, and large-scale climate data prior to the forecasting months (see Figure S3b in Supporting Information S1), together with precipitation and temperature forecasts from SEAS5 and aggregated soil moistures generated by the simplified VIC model for the forecasting months (see Figure S3a in Supporting Information S1) were considered as candidate input variables to the DL model.These data sources can be randomly combined, and 31 combinations are tested in this study (Table 3).There are five combinations containing one type of data sources (A1-A5), 10 combinations containing two types of data sources (B1-B10), 10 combinations containing three types of data sources (C1-C10), five combinations containing four types of data sources (D1-D5), and one combination containing five types of data sources (E).The hybrid models include model-simulated soil moisture as predictors, that is, A5, B4, B7, B9-B10, C3, C5-C6, C8-C10, D1-D4, and E.
Competent variable selection is also an integral part of skillful prediction models (Chu et al., 2020;Das & Chanda, 2020;Ren et al., 2020).To discard irrelevant and redundant predictors, LASSO, an automatic and computation-efficient model, was used to select suitable predictors for our prediction models.LASSO is an extension of Ordinary Least Squares regression (OLS), which adds an L1 penalty to the residual sum of squares.The penalty parameter λ controls the amount of shrinkage.As λ increases, a certain number of estimated coefficients become zero, and thus their corresponding variables are removed from the model.In this study, the penalty parameter λ of the LASSO model was estimated by the 10-fold cross-validation method combined with the LassoLarsCV algorithm (Efron et al., 2004;Osborne et al., 2000).

DL Model Architecture
This section presents a brief description of the DL model proposed in this study, that is, a combination of Convolutional Neural Network and Gated Recurrent Unit (CNN-GRU).CNN is well suited for the task of automatic feature extracting while GRU can capture long-term temporal dependencies in highly nonlinear time series data (Gao et al., 2020;Ha & Choi, 2016).Stacking CNN and GRU layers can preserve the strengths of the respective architecture (Cirstea et al., 2018;Yu et al., 2021).The CNN and GRU models were also tested for comparison.The architecture of CNN, GRU and CNN-GRU models adopted in this study are shown in Tables S6, S7, and S8 in Supporting Information S1, respectively.They are relatively simple in form with a limited number of hidden layers.As the number of network layers increases, the feature abstraction and capabilities of DL increase gradually but the number of parameters required for training also increases and the need for sample data volume rises accordingly, which will raise the risk of over fitting for small sample data.Therefore, we propose an efficient way of extracting features so as not to increase the network parameters/computation significantly while extracting features.DL models are confronted with a very large choice of possible hyperparameter configurations.In our models, they are mainly the number of filters, kernel size, number of neurons in each layer, activation function, dropout rate, optimizer, learning rate, loss function, batch size, and epoch, as shown in Table S9 in Supporting Information S1.The learning rate value of 0.0001 came from extensive experiments conducted by a previous study of Ledig et al. (2017).Xiang et al. (2020) and Wang, Tian, et al. (2021) also found that this value worked well for their studies.The dropout rate of 0.1 is adopted as one of the most commonly used dropout settings (Géron, 2019).In this study, the models' performances were tested by varying four important hyperparameters.Specifically, we varied the number of filters (32, 64) and kernel size (3 × 3, 5 × 5) in the CNN layer, the number of hidden nodes in the GRU layer (64, 128), and batch size (8,16).So there exists 8, 4, and 16 combinations of hyperparameters for CNN, GRU, and CNN-GRU models, Water Resources Research 10.1029/2023WR035618 XU ET AL. respectively.Data normalization was executed as a data preprocessing step for each model in order to improve the convergence speed and prediction accuracy of the model (Zhang et al., 2015), so predicted values need to be inversely normalized before they were outputted.The required software for DL models used in this study are Python 3.6.13,Keras 1.0.8,TensorFlow 1.15.0,NumPy 1.19.2,SciPy 1.5.2, and Scikit-learn 0.24.2.

Benchmark Models
To evaluate the CNN-GRU based hybrid model performance, several benchmarks were used.They include eight basic extrapolative methods and two regression models.Detailed information about the basic extrapolative methods including persistence, average method, naive method, seasonal naive method and four moving average methods can be found in Text S3 in Supporting Information S1.Additional hybrid models were developed using the same data set described in Table 3 but different regression methods: LASSO and EXtreme gradient Boosting model (XGBoost).The LASSO model has already been introduced in the previous section.XGBoost is a variant of Gradient Boosting methods that consists of many decision trees (Chen et al., 2016).Each decision tree predicts the residual between all previous decision trees and the ground truth.The predicted values of all decision trees are then assembled together to estimate the target outcome (Nalenz & Villani, 2018).To avoid the overfitting problem, XGBoost adds regularizations to the objective function to control the complexity of the model.Previous studies have shown that using XGBoost has great potential in simulating and predicting streamflows (Ni, Wang, Wu, et al., 2020;Yu et al., 2020;Zhang et al., 2019).Three important hyperparameters in the XGBoost model were tested in this study, including learning rate, number of estimators (n_estimators), and maximum regression tree depth (max_depth).They were set as (0.1, 0.3), (100, 300, 500), and (3, 5, 6), respectively.Other hyperparameters including minimum number of samples in a child node (min_child_weight, 1), minimum loss reduction threshold (γ, 0), and percentage of attribute samples (colsample_bytree, 1) were set as default values, as displayed in parentheses.

Evaluation Framework
It is expected that the hybrid streamflow prediction model, integrating simulated output from a simplified distributed hydrological model into a data-driven model (i.e., CNN-GRU/CNN/GRU/LASSO/XGBoost), would exploit the strengths of the respective modeling approaches to provide the most accurate and reliable predictions.Therefore, to assess the performance of this hybrid model, a comparison of the performances of the original distributed hydrological model, the data-driven model, and the hybrid streamflow prediction model should be undertaken.The forecasting period for the original distributed hydrological model and the testing period for the data-driven and hybrid models were set to be identical.Besides, same training period was shared by data-driven and hybrid models.In this study, we did two different splits of the data to check the generalization ability of the prediction model with the limited samples.To begin with, the first 18 years (1983)(1984)(1985)(1986)(1987)(1988)(1989)(1990)(1991)(1992)(1993)(1994)(1995)(1996)(1997)(1998)(1999)(2000) were used as the training data set and the remaining years (2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010) were used as test data set.The training and testing periods for lead times of 1, 3, and 6 months are shown in Figure S3 in Supporting Information S1.After that, the first 10 years were held out for testing (1983)(1984)(1985)(1986)(1987)(1988)(1989)(1990)(1991)(1992), and the same model development procedure was carried out with the data from the last 18 years (1993)(1994)(1995)(1996)(1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010).The final evaluation was performed with the merged test data sets.Three metrics including the KGE, R 2 , and Willmott's Index of Agreement (WI) (Willmott, 1982) were used to evaluate the predictive performance of monthly streamflow predictions.KGE ranges from ∞ to 1, while R 2 and WI ranges from 0 to 1.The larger the three metrics, the better the model performance.

Comparison of Predictive Performance Among VIC, CNN-GRU, and Hybrid Models
We first compared the predictive performance among VIC, CNN-GRU, and hybrid streamflow prediction models.Figure 3 and Figures S6 and S7 in Supporting Information S1 show the KGE, R 2 , and WI values for monthly inflow predictions for DJKR with lead times of 1, 3, and 6 months, respectively.For 1-month lead time prediction, as shown in Figure 3, the predictive performance of different CNN-GRU models varies considerably, but none of them produces better predictions than the VIC model.Models using combinations B6 (a combination of local meteorological data and GCM predictions) and C2 (a combination of streamflow observations, local meteorological data, and GCM predictions) as predictors provide relatively better predictions in terms of the three evaluation metrics.In contrast, all hybrid models show a modest increase of 0.08-0.For 3-month-ahead prediction, all hybrid models give a very similar performance in terms of KGE and WI metrics, and generally outperform the VIC and CNN-GRU models (Figure S6 in Supporting Information S1).It is observed that the KGE and WI values for all hybrid models ranging between 0.53 and 0.54 and between 0.74 and 0.76, respectively.The choice of predictor combinations has much less influence on predictive performance than 1-month ahead prediction, but the A5-, B10-, and C9-driven predictions slightly outperform other predictor combinations.The large-scale climate indices are not good indicators of the catchment condition and their inclusion may counteract some of the benefits of using the model-simulated soil moisture.For 6-month-ahead prediction, hybrid models produce comparable or better predictive performance than the VIC model depending on the evaluation metrics.The results show that hybrid models generally perform better when including large-scale climate indices as predictors.Additionally, the hybrid model C10 shows an increase in KGE value of 0.23 when compared with the CNN-GRU model using large-scale climate indices and GCM predictions (B8) as predictors.In comparison with the VIC model, the KGE and WI values obtained by the hybrid model B9 are moderately increased (from 0.37 to 0.53 and from 0.68 to 0.74, respectively).However, the differences between these two models are negligible for R 2 values.The large-scale climate indices are indispensable predictors in improving the KGE value at the lead time of 6 months.The persistence in streamflow results from the delayed response in the rainfall-runoff process due to groundwater and soil water storage, giving streamflow a memory of conditions over several months.However, this memory in the streamflow series disappears relatively quickly and, over longer lags, large-scale climate indices are likely to be better explanatory variables for predicting monthly streamflows due to the longer memory.
Overall, model-simulated soil moisture provides a better representation of initial catchment conditions at the start of a forecasting period and climate situation during the forecasting period than any of the candidate predictors, although the relationship between model-simulated soil moisture and streamflow outputs is weakened with longer lead times as expected.In this study, the prediction accuracy is greatly improved by the hybrid model which uses model-simulated soil moisture as extra inputs.Moreover, there is no additional computation burden for the hybrid model in comparison with the CNN-GRU model.When compared with the VIC model, the proposed hybrid model not only produces better predictive performance but also largely alleviates the burden of storage and computation in the streamflow forecasting process.The original VIC model is driven with atmospheric forcing data from 1,382 grid cells which need to be updated prior to each predicted month.The surface runoff and subsurface flow obtained from each grid cell are then routed to predict the streamflows for each lead time.The rolling predictions take hundreds of runs of the VIC model, which are computationally inefficient.In contrast, the simplified VIC model does not include the flow routing process and largely reduces the number of grid cells.Taking 1-month lead time as an example, when simplified model is developed using the training period of 1983-2000, it only consists of 14 grid cells and thus saves more than 99% computation time than the original VIC model during the testing period of 2001-2010.Moreover, we further assessed whether the hybrid model improves the prediction accuracy for the three hydrological stations upstream of DJKR (Shiquan, Ankang, and Baihe), and similar conclusion can be drawn as in the DJKR (result is not shown).It is worth noting that the grid cells selection process was conducted separately for each basin.
The predicted monthly hydrographs using VIC, CNN-GRU, and hybrid models are shown in Figure 4 (1-month ahead prediction).Generally, the VIC model predicted hydrographs exhibit a good overall match to observed hydrographs.However, the CNN-GRU model always underestimates the high flows.For example, the high flows around 2003, 2005, and 2007 are consistently underestimated (see Figure 4a).In comparison, the hybrid model not only better captures two large peak flows that happened around 2003 and 2005, but also generates most accurate predictions for the complex graph in 2006.This suggests that the hybrid model has good intelligence for finding complex patterns in data and exploiting them to infer unobserved events from a given training data set consisting of normal events.Similarly, the peak flows during 1983-1985 for the DJKR can be better simulated by the hybrid model than that from CNN-GRU and VIC (Figure 4b).
In this study, the VIC model exhibits a remarkably better performance than CNN-GRU models for all lead times.This is because the inflow in DJKR is mainly contributed by short-duration and high-intensity rainfall events, and the connection between surface water and groundwater is weak due to the relief amplitudes and slopes of the basin, leading to low auto-correlations of monthly streamflow series.This makes it challenging for CNN-GRU models to efficiently learn the relationship between monthly streamflow and potential predictors (Xu, Chen, & Zhang, 2022).Besides, it is difficult for CNN-GRU models to predict the peak flows accurately when training samples are insufficient, especially when training period length is ≤25 years (Xu, Chen, & Zhang, 2022).
The application of DL models to predict streamflows introduces a significant challenge, since the uncertainty in the predicted outcomes arises from ambiguities in both the model structure and the determination of model hyperparameters.As shown in Figure 3, predictive uncertainty caused by randomly initialized weight matrices and different hyperparameters combinations exists for both hybrid and CNN-GRU models.While training DL models, the connections between neurons and layers of neurons, which are also referred to as model weights are adjusted to find an optimal set of values that is most suitable to solve the underlying task.Therefore, the random initialization strategy followed in this study has a certain impact on model performance.Besides, predictive uncertainty for both hybrid and CNN-GRU models can also be attributed to the lack of data, considering the amount of monthly streamflow data used for training models are very limited in the study watershed.Therefore, instead of training the DL starting from scratch with random initialization, transfer learning (i.e., using a predefined setting of weights that stem from a previously trained neural network) could be used in future studies, which has been proven to be a meaningful starting point for the training process (Bortsova et al., 2021).Moreover, the introduction of the batch normalization layer can allow us to use much higher learning rates and be less influenced by weights initialization prior to training a neural network (Ioffe & Szegedy, 2015).
To assess the prediction bounds, four indices including Containing Ratio (CR), Average Relative Bandwidth (RB), Average Symmetry Degree (S), and Average Relative Deviation Amplitude (RD) were calculated (Xiong et al., 2009).The results are listed in Table S10 in Supporting Information S1 for both the hybrid and DL models.Generally, within the 6-month lead time, the hybrid models exhibit better or comparable performance in terms of S, RD, and CR than the DL models.This means that the prediction bounds of hybrid models tend to be more symmetrical, aligning more closely with the hydrograph of observed streamflows.

Comparison of Predictive Performance of Hybrid Models Based on Different ML Methods
The illustrative case of inflow predictions for DJKR demonstrates that the hybrid model based on the CNN-GRU model has enhanced prediction accuracy for lead times up to 6 months.For 1-month, 3-month, and 6-month ahead predictions, the best performing hybrid models are B7, B10, and B9, respectively.We further compared the predictive performance among the hybrid models based on CNN, GRU, and CNN-GRU models (see Figure 5).Results shown in boxplots are calculated from all combinations of hyperparameters, so there are 8, 4, and 16 dots for CNN, GRU, and CNN-GRU models, respectively.Results show that the CNN-GRU model has the largest R 2 and WI values for all lead times.However, CNN and CNN-GRU models show similar performance in KGE values for lead times of 3 and 6 months, while the performance of GRU model is much inferior to that of the other two models, especially for the KGE metric.Similar conclusions can be drawn for other 15 predictor combinations for all lead times (see Text S4 in Supporting Information S1).
Water Resources Research 10.1029/2023WR035618 In order to provide a more comprehensive understanding of the model's performance, we examined how forecast skill varies across different seasons and lead times.As shown in Figure 6, the VIC model outperforms both DL models and hybrid models in winter, especially for 1-month and 3-month lead times.DL models perform much better in winter and spring compared to summer and fall, especially for the 1-month lead time.The KGE values decrease from approximately 0.40 to around 0.20.In contrast, the hybrid models maintain stable and favorable predictive performance across all four seasons.Furthermore, hybrid models exhibit superior predictive performance compared to VIC and DL models in summer and fall.For the 1-month lead time, the CNN-GRU based hybrid model surpasses CNN-GRU and VIC models, with improvements of 0.50 and 0.10 for summer and 0.33 and 0.09 for fall in KGE values.
In addition, using the same input described in Table 3, LASSO and XGBoost were also tested for the DJKR (see Figure 7).For 1-month lead time prediction, three models perform similarly in terms of KGE metric, while XGBoost performs slightly better than the other two models (Figure 7a1).However, in terms of R 2 and WI  7c3).KGE, R 2 , values for all the models mentioned above can be found in Tables S11-S52 in Supporting Information S1 and WI for CNN-GRU models increase by 0.02/0.06,0.02/0.08,and 0.01/0.05compared to LASSO and XGBoost, respectively.

Post-Hoc Interpretability for the Best-Performed Hybrid Model
In order to develop the best-performed CNN-GRU based hybrid model, various input combinations have been tested.It turned out that B7-and B9-driven predictions outperform other predictor combinations for lead times of 1 and 6 months, respectively.The expected gradients (EG) (Erion et al., 2021), an additive feature attribution method was used to derive explanations of the models' individual predictions (i.e., outputs).In the EG method, the importance score to each feature i for a specific input is expressed as Φ i

EG
. Its calculation procedure was performed with the SHapley Additive exPlanations (SHAP) package (Lundberg & Lee, 2017)  For 1-month-ahead prediction (see Figure 8a), the most effective meteorological feature for January and July is 2 month-lag evaporation, while in April and October, 4 month-lag evaporation is the most effective.Soil moisture generated by most of the grid cells contribute more to model's predictions than past evaporation and precipitation.The feature importance is also more balanced among soil moisture variables than meteorological variables.Four different months shown in Figure 8a illustrate three different behaviors of soil moisture variables: one on the right side (positive), one on the left side (negative) and two balance between the negative and positive sides.
According to Figure 8b, the climate index with the most extensive influence on the model's prediction is Nino 4 and Multivariate ENSO Index (MEI).
The MEI is based on six observed variables over the tropical Pacific Ocean: sea surface temperature, surface air temperature, sea level pressure, zonal component of surface wind, meridional component of surface wind and cloudiness (see Wolter, 1987).Its calculation process can refer to Wolter and Timlin (2011).In this study, it turns out that the MEI is more important than SOI.However, the lag correlation between SOI and MEI is relatively high, so perhaps a longer period of lagging information may affect the importance of these two indicators in the prediction model.As shown in Figure 8b, Nino 4 and Nino 1 + 2 are both important features for 6-month-ahead monthly streamflow prediction in the DJKR basin, because sea-surface temperature (SST) variability in the tropical central-eastern Pacific has been recognized as a major factor of the variations of the East Asian monsoon (Yuan & Yang, 2012), which is linked to an anomalous rainfall pattern over the middle and lower reaches of the Yangtze River basin and may be a very important index to predict the heavy monsoon rainfall at various lead times (Zhao et al., 2020;Zhu & Li, 2016).The soil moisture variables feature importance is more prominent in January and July.The importance score pattern in January and April is similar, while the pattern in July and October is similar.The sum of the Φ i EG values of four soil moisture variables is the determining variable throughout all months, far exceeding the other predictors.

Limitations and Future Work
The approach that treats both current and previous months' information as predictors is similar to the one employed by Hejazi and Cai (2011) and Yang et al. (2017).The longest memory time of historical climate data used in this study is 6 months, which has been extended compared with previous studies.Since the choice of lag times has a moderate impact on the predictive performance (Cheng et al., 2020), taking longer lag times of historical meteorological data and large-scale climate indices into consideration may produce better prediction accuracy for hybrid models.In addition, data splitting is an important consideration during ML development.A wide variety of cross-validation methods are available in the literature, including, but not limited to, k-fold crossvalidation, stratified k-fold cross-validation, leave-p-out cross-validation, and leave-one-out cross-validation (May et al., 2010;Rodriguez et al., 2010;Vehtari et al., 2017).An appropriate cross-validation method will produce error estimates that accurately reflect the level of skill expected on future data (Elsner & Schmertmann, 1994).The choice of a cross-validation method depends on data size, computing resources, application scenarios, etc. Investigation on cross-validation methods can be an avenue for future work.Due to changes in the generation process of climate model forecasts and hindcasts (e.g., model physics, assimilation algorithm and observational data used, etc.), the real-time forecasts may not be statistically consistent with the hindcasts.However, the inconsistency problem can be partly alleviated with the selection of a trustworthy climate model.For example, SEAS5 used in our study has taken action to enhance the consistency between hindcast and forecast initialization.This is because SEAS5 uses a more recent period (i.e., 1993-2016) for the calculation of forecast anomalies to avoid the long-term trend of climate change from overly affecting the forecast products, and it also makes a few adjustments to model initialization.In the atmosphere component, the prognostic ozone scheme in both hindcasts and forecasts is initialized with a seasonally varying climatology produced by the ozone model (Monge-Sanz et al., 2011).In the land component, a limiter is used to prevent the real-time land surface values taking inconsistent values relative to those used in the hindcasts.More information can be found in the SEAS5 user guide2.In the ocean and sea-ice components, SEAS5 obtains initial conditions for both forecasts and hindcasts from the OCEAN5 system (Zuo et al., 2019).OCEAN5 uses updated data assimilation and observational data sets to create both a historical reanalysis (Ocean Reanalysis System, version 5; ORAS5) and a real-time analysis (OCEAN5-RT).Moreover, SEAS5 generates ensemble members using an ensemble data assimilation (EDA) and the singular vector (SV) initial condition perturbation methods.The EDA perturbations are only available for the later years in the hindcast, so that consistency can be maintained between hindcasts and forecasts.The use of ensemble means from multiple members may also help to alleviate the inconsistency between hindcasts and real-time forecasts.Despite the above-mentioned measures, great attention should be paid to the impact of the bias inconsistency of climate model outputs in hindcast and forecast periods.
The inherent limitation of hydrological models in capturing the dynamic nature of hydrological systems gives rise to the complexity in the temporal characteristics of forecast errors.Hydrological model errors are known to be frequently autocorrelated, heteroscedastic, and non-Gaussian (e.g., Bates & Campbell, 2001;Evin et al., 2014;Kuczera, 1983).In this study, significant correlations are observed in the forecast errors of the VIC model across different lead times (see Tables S53-S55 in Supporting Information S1).Notably, the hybrid model demonstrates effective mitigation of autocorrelation issues in streamflow predictions at a 1-month lead time.However, at lead times of 3 and 6 months, hybrid models exhibit persistence in forecast errors which are similar to those observed in the VIC model.The White test was employed to evaluate the heteroscedasticity of forecast errors in both the VIC and hybrid models.The results show that heteroscedasticity exists in both cases.Moreover, a One-Sample Kolmogorov-Smirnov Normal Test was used to examine if forecast errors follow a Gaussian distribution.The results reveal that the forecast errors of VIC and hybrid models do not conform to a Gaussian distribution.Histograms of errors for VIC and hybrid models are shown in Figure S10 in Supporting Information S1.To address issues related to correlation, variance structure, and the distribution of forecast errors, various models have been developed, spanning from simple and traditional ones (e.g., Bates & Campbell, 2001;Evin et al., 2013;Marshall et al., 2006) to more sophisticated approaches (e.g., Bennett et al., 2016;Li et al., 2016).
The seasonality of forecast errors was also analyzed by first classifying the errors into four seasons.A One-Way Analysis of Variance (ANOVA) was then employed to assess whether forecast errors from VIC and hybrid models exhibit seasonality.As shown in Table S56 in Supporting Information S1, the hybrid models exhibit very different seasonality of forecast errors compared to that of the VIC model for the same lead time.Specifically, at the 1-month lead time, the hybrid model shows no discernible seasonality in forecast errors, while the VIC model is unable to consistently capture seasonal variations of streamflow, resulting in significant seasonality in forecast errors.In contrast, at the 6-month lead time, the seasonality of forecast errors becomes significant for the hybrid model.This issue can be addressed by applying an appropriate time series model, such as the AutoRegressive Integrated Moving Average (ARIMA) model, to the error time series.Seasonal effects can also be addressed by introducing seasonal indicators and dummy variables in DL models or employing clustering methods.
The hybrid model proposed in this study is particularly promising in situations where the predictive performance of both DL models and hydrological models is suboptimal, or when distributed hydrological models have high spatial resolution, leading to substantial computational demands.The tested DJKR basin has a typical subtropical humid climate.Testing across various climate regimes with different areas to validate the hybrid model can be avenues for future research.

Conclusion
This study proposed a hybrid monthly streamflow prediction model coupling simplified physically based hydrological and CNN-GRU models, which takes use of the merits from process-based and data-driven models and minimizes their limitations.Specifically, grid cells selected from the VIC model were used to generate soil moistures for the forecasting months, which were then used as extra inputs to the CNN-GRU model.In addition, the influence of using different input combinations for CNN-GRU model is also investigated by utilizing a total of five types of data sources including streamflow observations, local meteorological data, large-scale climate indices, GCM predictions, and model-simulated soil moisture.The developed hybrid model was applied to predict 1-, 3-, and 6-month ahead inflow for the DJKR and evaluated using the KGE, R 2 and WI metrics.VIC and CNN-GRU were used as baseline models for comparison with the hybrid model.Furthermore, predictive performance of the hybrid models based on different ML models were also tested.The following conclusions can be drawn: 1.For this specific river basin, the VIC model has a remarkably better performance than the CNN-GRU model for monthly streamflow predictions, particularly at the 1-month lead time, because monthly inflow data for the DJKR are limited in length and have a low correlation.2. The proposed hybrid model is capable of utilizing soil moisture generated from the VIC model, and it performs consistently better than both VIC and CNN-GRU models for monthly streamflow predictions up to 6 months, although the improvement in prediction accuracy decreases with an elongation of the lead time.For example, when compared with the respective models, the best performing hybrid model improves KGE values by at least 0.14, 0.12 and 0.16 for lead times of 1, 3, and 6 months, respectively for the DJKR basin.3. Different input combinations influence the predictive performance of the CNN-GRU model.In the hybrid model, the introduction of model-simulated soil moisture is capable of minimizing the influence of input combinations.For 1-month ahead prediction, the inclusion of model-simulated soil moisture produces the best predictive performance when used in combination with local meteorological data or with local meteorological data and GCM predictions.For longer lead times (e.g., 6 months), large-scale climate indices provide a better representation of historical climate conditions than meteorological data, making them indispensable predictors to better predict monthly streamflows.4. The CNN-GRU based hybrid models generate stable reservoir inflow values across all parameterizations.
Moreover, the hybrid models based on CNN-GRU outperform those based on all other ML models including LASSO, XGBoost, CNN, and GRU in the study watershed.
The proposed hybrid model appears to be a promising tool for monthly streamflow predictions, and its effectiveness needs to be further examined and justified in more basins and using more physically based hydrological models.

Figure 1 .
Figure 1.Locations of the three hydrological stations and Danjiangkou Reservoir (DJKR) (a) and average monthly precipitation/runoff depth of the corresponding basins (b).

Figure 2
Figure2illustrates the three stages of the proposed hybrid model.The first stage involves downscaling of monthly precipitation, maximum and minimum temperature forecasts provided by GCMs.The spatiotemporal downscaling procedure includes bias correction of monthly forecasts using the Linear scaling (LS) method and daily disaggregation using the K-nearest neighbor (K-NN) method.The downscaled forecasts are then used to drive the hydrological model for a specified period of time in the second stage.The second stage involves the development of a simplified spatially distributed hydrological model to reduce the computational cost required for the use of a distributed model to perform streamflow predictions.In the third stage, the simplified hydrological model is driven with the disaggregated precipitation and temperature forecasts to generate soil moisture at the daily time scale for the forecasting months.The daily soil moisture time series were aggregated into the monthly values and used as inputs to the DL model along with other predictors for monthly streamflow predictions.The detailed procedures are presented in Sections 3.1-3.3below.

Figure 2 .
Figure 2. A block diagram of the hybrid monthly streamflow prediction model.

Figure 3 .
Figure 3. Predictive performance (KGE, R 2 , and WI) of 1-month ahead prediction.The first column represents evaluation metrics for the VIC model.The dashed line corresponds to the VIC model performance.Letters A1-E represent 31 predictor combinations, in the same order as Table 3, corresponding to 31 models.There are 16 combinations of hyperparameters for each model (see Section 3.3.2),and all weight matrices and bias values of each model were randomly initialized, so 10 repeated experiments were carried out to reduce the impact of model parameters initialization on the results.Dots colored green and black represent repeated runs and different parameterizations of hybrid models (16 models) and CNN-GRU models (15 models), respectively.The horizontal line in each bar indicates the median value of 160 dots and the bars indicate interquartile value spread (25th and 75th).
Figure S7 in Supporting Information S1 shows that the case of using large-scale climate indices and model-simulated soil moisture (B9) or large-scale climate indices, GCM predictions, and model-simulated soil moisture (C10) as predictors exhibits the best performance with KGE, R 2 , and WI values of approximately 0.53, 0.32, and 0.74, respectively.The hybrid model B9 shows a large increase of 0.34 in KGE value compared with the CNN-GRU model using large-scale climate indices (A3) as predictors.

Figure 4 .
Figure 4. Comparison of observed and predicted monthly streamflows for DJKR in 1-month lead prediction.The green and purple solid lines represent median values of 160 runs for CNN-GRU and hybrid models, respectively.

Figure 5 .
Figure 5.Comparison of predictive performance (KGE, R 2 , and WI) among the CNN, GRU, and CNN-GRU models.The best performing hybrid model (based on CNN-GRU model performance) was singled out for each lead time.Boxplots of different tests.Each box is calculated from all combinations of model hyperparameters.Each dot represents the average value of 10 repeated experiments.The horizontal line in each box indicates the median value of different combinations of model hyperparameters, the boxes indicate interquartile value spread (25th and 75th), and the ends of the whiskers represent the 5th and 95th percentiles.

Figure 6 .
Figure6.The verification results are shown for seven models (depicted by colored lines), across each season (corresponding to columns), for three lead times (along the x-axis), and using three evaluation metrics (represented by rows).Specifically, for 1-month, 3-month, and 6-month ahead predictions, hybrid models using combinations B7, B10, and B9 as predictors, and DL models employing A2, A4, and A3 as predictors are singled out.

Figure 7 .
Figure 7.Comparison of predictive performance (KGE, R 2 , and WI) of 16 hybrid models among the LASSO, XGBoost, and CNN-GRU models.There are 18 and 16 combinations of hyperparameters for the XGBoost and CNN-GRU models, respectively (see Sections 3.3.2and 3.4).Each bar represents the average value of various combinations.

Table 1
Basic Information About the Three Hydrological Stations and DJKR

Table 2
Information of Selected Large-Scale Climate IndicesThe so-called teleconnection patterns are a number of preferred patterns that characterize the climate variability of the atmospheric circulation.

Table 3
Thirty-One Predictor Combinations

large-scale climate data, model-simulated soil moisture C9 Local meteorological data, GCM predictions, and model-simulated soil moisture C10 Large-scale climate indices, GCM predictions, and model-simulated soil moisture D1 Meteorological and climate data, GCM predictions, model-simulated soil moisture D2 Streamflow and climate data, GCM predictions, model-simulated soil moisture D3 Streamflow and meteorological data, GCM predictions, model-simulated soil moisture D4 Streamflow, meteorological, and climate data, model-simulated soil moisture
XU ET AL.
. A large positive or negative Φ i EG value indicates that the corresponding feature strongly increases or decreases the model output, while Φ i EG value close to zero suggests little influence.The Φ i EG values per model output are displayed in Figure 8 for 4 months (i.e., January, April, July, and October) and in Figures S8 and S9 in Supporting