A novel ensemble-based conceptual-data-driven approach for improved streamflow simulations

Abstract A novel framework, an ensemble-based conceptual-data-driven approach (CDDA), is developed that integrates a hydrological model (HM) with a data-driven model (DDM) to simulate an ensemble of HM residuals. Thus, a CDDA delivers an ensemble of ‘residual-corrected’ streamflow simulations. This framework is beneficial because it respects hydrological processes via the HM and it profits from the DDM’s ability to simulate the complex relationship between residuals and input variables. The CDDA enables exploring different DDMs to identify the most suitable model. Eight DDMs are explored: Multiple Linear Regression (MLR), k Nearest Neighbours Regression (kNN), Second-Order Volterra Series Model, Artificial Neural Networks (ANN), and two variants of eXtreme Gradient Boosting (XGB) and Random Forests (RF). The proposed CDDA, tested on three Swiss catchments, was able to improve the mean continuous ranked probability score by 16-29% over the standalone HM. Based on these results, XGB and RF are recommended for simulating the HM residuals.


Introduction
The water resources domain has witnessed growing interest in using data-driven models (DDMs) to improve precipitation-runoff simulation (Shen, 2018;Nearing et al., 2020), whether by directly replacing (e.g., Kratzert et al., 2019) or being used in conjunction with (e.g., Boucher et al., 2020) hydrological models (HMs). In this study, the use of DDMs are explored for without the need to explicitly consider the physical laws governing such processes. Hence, they are constantly increasing in popularity within hydrology and have been applied for many different hydrological applications: precipitation-runoff modelling (Rajurkar et al., 2004;Shortridge et al., 2016;Tongal & Booij, 2018), streamflow forecasting (Solomatine & Xue 2004;Boucher et al., 2011;Papacharalampous & Tyralis, 2018;Ni et al., 2020;Tyralis et al., 2020), groundwater level forecasting (Suryanarayana et al., 2014;Rahman et al., 2020), water quality modeling (Fatehi et al., 2015;Bhagat et al., 2021), and many others. DDMs require however the pre-selection of input variables as potential predictors, which greatly impacts their accuracy (Galelli et al., 2014). In addition, since hydrological variables often display auto-and crosscorrelation it is important to consider time-lagged versions of the input variables (e.g., precipitation, air temperature) when simulating a target hydrological variable (e.g., streamflow) (Gauch et al., 2021), and to be able to identify the maximum lag time (memory length). This maximum time lag defines a lag time after which the impact of the input variable on the target variable is marginal (Bowden et al., 2005). A correct pre-selection of input variables enables the exclusion of redundant and/or irrelevant input variables, thereby reducing the complexity and increasing the interpretability of the resulting DDM (Quilty et al., 2016).
Most previous data-driven approaches for streamflow simulation have used DDMs to explicitly simulate streamflow and in this way they provide an alternative to a HM. Yet, only very few works have attempted to characterize or directly simulate residuals of a HM with a data-driven approach. Some examples include Montanari & Koutsoyiannis (2012), Sikorska et al. (2015), Wani et al. (2017), and Ehlers et al. (2019), who simulate model errors via resampling from a query dataset based on streamflow properties (and other hydro-meteorological data in the case of Ehler et al., 2019). However, each of these studies relied on a specific statistical method for resampling model errors (the meta-Gaussian approach in Montanari & Koutsoyiannis (2012) and k nearest neighbours in the others) and, therefore, their approaches are not generalized to any DDM (as is the case in this study).
Approaches for coupling HMs with DDMs into a hybrid model are very rare in application to streamflow modelling (Wang et al. 2021). For example, Tongal and Booij (2018) decomposed streamflow into base and surface flows and used HM simulations at the previous time step along with meteorological variables as input to a DDM to simulate streamflow at the current time step, indirectly linking model residuals to flow conditions. Senent-Aparicio et al.

J o u r n a l P r e -p r o o f
Accepted manuscript, Environmental Modelling and Software 5 (2018) have proposed an indirect coupled approach, in which streamflow simulations from a DDM are corrected using additional information on maximum mean daily flow that is taken from a HM. Yang et al. (2020a) have proposed a different indirect coupled approach that used a HM to simulate pseudo-observed data for training a DDM to perform streamflow simulation. Hybrid streamflow simulation models (whereby a DDM is used to model HM residuals), have only been explored by Wu et al. (2019) and Konapala et al. (2020). Both studies used DDMs for simulating the residuals of HMs but they both lack consideration of the uncertainty in estimating HM parameters. Thus, the authors in these two studies did not consider ensemble streamflow simulations. In addition, Wu et al. (2019) required the transformation of HM residuals (prior to modelling via DDMs), and the determination of stationary time windows (as identified by the Hilbert Huang Transform) in order to simulate the HM residuals, and assumed that the simulated HM residuals follow a Student's t-distribution (to permit the construction of uncertainty intervals). More recently, Wang et al. (2021) proposed a hybrid model that uses the output of the Xinanjiang HM along with wavelet-decomposed sub-series of previous streamflow observations as input to Random Forests for simulating a single realization of streamflow without any uncertainty considered. Althoff et al. (2021) proposed an alternative hybrid model that consisted of a simplified version of a HM (soil moisture component) and a DDM to simulate streamflow. However, they also provide only a single realization of the streamflow without considering any sources of uncertainty.
As can be seen from the above literature review, despite several recently proposed hybrid modelling approaches, none of the above mentioned studies developed a fully coupled framework for an ensemble-based streamflow simulation where a DDM is used to simulate an ensemble of residuals stemming from a HM, conditioned on input variables (e.g., precipitation, air temperature). Thus, in this study, a novel conceptual-data-driven modelling framework is developed that pairs an ensemble of conceptual deterministic HMs with an ensemble of DDMs-(that simulate the HM residuals) for improved ensemble streamflow simulation. This novel framework is called an ensemble-based conceptual-data-driven approach (CDDA). While a HM simulates the precipitation-streamflow process at the catchment scale, respecting, to an extent, the physical hydrological processes, a DDM, added 'on-top' of the HM, mimics the HM residuals enabling the streamflow simulations to be improved. Thus, the CDDA combines the advantages of a HM (that seeks to respect the physical relationships between hydrological J o u r n a l P r e -p r o o f Accepted manuscript, Environmental Modelling and Software 6 processes) with the capabilities of the DDM (that can model complex (nonlinear) relationships between input-target variables) enabling an effectual estimation of the correlated residuals stemming from the HM. Evidently, the resulting output of the CDDA framework is an ensemble of streamflow simulations instead of a single streamflow realization. Conceptually, the HMgenerated ensemble passes information about the observed and simulated streamflow (gained through the identification of the HM parameters, i.e., calibrated using observed data) to the DDM, which is relied upon to extract additional information not captured by the HM, to simulate the HM residuals and improve the overall ensemble streamflow simulation. Thus, the CDDA can overcome certain issues of the above mentioned hybrid models since it provides an ensemble of HM residuals (accounting for HM parameter uncertainty), and does not require any transformation of HM residuals, relaxing all assumptions on the residual distribution. These last two strengths of the proposed CDDA approach also make it more attractive than the Bayesian method (described above), that requires specification of a likelihood function and the distribution of the model residuals. Thus, the novel CDDA approach can be seen as a less-restrictive and more flexible data-driven method for simulating HM residuals and generating an ensemble of streamflow simulations.
With regards to the above, the focus of this paper is primarily on developing the CDDA framework, that enables, for any case study where data required by the HM is available, the identification of the best DDM to simulate HM residuals and the selection of the most useful input variables to use in the DDM. Since any DDM can be used within the CDDA, eight different DDMs, that have either been explored in detail or shown to be promising in recent studies in the hydrological modelling literature, are explored: Multiple Linear Regression (MLR), k Nearest Neighbours Regression (KNN), Second-Order Volterra Series Model (SOV), Artificial Neural Networks (ANN), and two variants of eXtreme Gradient Boosting (XGB) and Random Forests (RF). These models are developed by considering time-lagged copies of observed streamflow, precipitation, and air temperature as potential predictors and are coupled with a bucket type HM, Hydrologiska Byråns Vattenbalansavdelning or HBV (Bergström & Forsman, 1973), and tested using three Swiss catchments. Furthermore, several input variable selection (IVS) methods are considered for selecting the most important candidate input variables to use in the DDM. Thus, the major objective of this study is to introduce the ensemble-based CDDA framework and evaluate its performance against a standalone (ensemble-J o u r n a l P r e -p r o o f Accepted manuscript, Environmental Modelling and Software 7 based) HM model. In addition, the proposed CDDA enables one to answer questions specific to individual case studies, such as: (A) Which DDMs are most suitable for simulating the residuals of the HM within CDDA? (B) Which input variables are the most important to consider when simulating HM residuals via the explored DDMs? The framework developed in this study is timely, as coupled HM-DDM approaches are only beginning to consider uncertainty in resulting streamflow simulations (e.g., Papacharalampous et al., 2019a;Tyralis et al., 2019a;Boucher et al., 2020;Teweldebrhan et al., 2020), although no such framework exists that uses DDMs to simulate the residuals from an ensemble of HM streamflow simulations.
The remainder of this paper is organized as follows: Section 2 describes the methods by introducing the theoretical CDDA framework and providing details on candidate DDMs selected for simulating the HM residuals; Section 3 includes the experimental settings, an overview of the case study, and the HM and DDM development details; Section 4 highlights the main results; Section 5 discusses the significance of the results, describes the current limitations of the developed approach, and suggests future research avenues; and Section 6 concludes the paper.

Overview of the Conceptual-Data-Driven Approach (CDDA)
A single data-driven model is developed that mimics residuals of a hydrological model and is linked to the input (explanatory) meteorological variables that are usually precipitation and air temperature. This DDM is attached on top of a single simulation (e.g., streamflow) from the HM resulting from its parameter set. Thus, the target (response) variable is the residual between the observed streamflow and the HM-simulated streamflow at the time step t-0. In case of multiple parameter sets of the HM (i.e., an ensemble of HM streamflow simulations), the DDM is attached to each HM simulation (i.e., there is a single DDM trained for each set of residuals). This latter models' setup, i.e., an ensemble HM+DDM, is called an ensemble-based conceptual-data-driven approach, see Figure 1 for an overview. The residuals of the HM are modelled using different DDMs taking the meteorological variables and observed streamflow at previous days as an input (see Section 2.3 for details).
Different input combinations are tested (Section 2.4) to explore which set of inputs best simulate the residuals of the HM. Note that the observed streamflow is assumed to be available (at a regular time interval) at a given site. In case observed streamflow is unavailable, both the HM and DDM cannot be calibrated and other indirect approaches would be required (see Section 5.2).
To evaluate the usefulness of the CDDA, the ensemble of deterministic outputs from the hydrological model without any DDM is used as the benchmark (Section 2.6). The eight different variants of the CDDA are proposed and compared against each other, and the benchmark, to determine the best DDM and identify useful input combinations.

Ensemble-based CDDA
The streamflow simulation for the ensemble member of the CDDA output is the sum of the simulation of the streamflow with the HM model ( ) and the simulation of its residual using the DDM model ( ): where , , and are observed precipitation, air temperature, and streamflow, is a parameter set of the CDDA, and are the parameter sets of the HM and DDM models, respectively, with { }.
Note that each ensemble member is associated with a single . By applying Eq. (1) to each ensemble member, an ensemble of streamflow simulations CDDA can be obtained through CDDA. Finally, while the HM requires and as input to simulate streamflow for a given day, the DDM can additionally use , , and from previous days (i.e., time-lagged copies of these variables) in order to improve predictive performance. This idea is discussed in more detail in Section 2.4.
The HM is run at a daily time step, as is the DDM. Depending on the chosen HM model structure, other input variables may be also required for the HM model (e.g. potential evapotranspiration) and thus, could also be explored within the DDM. Similarly, using a different time step for the HM (e.g., hourly) may require the DDM to be conditioned on input variables of a different resolution and different lag time length.

Data-Driven Models (DDM)
In this study, six different DDM (and eight variants, in total) were explored for simulating the residuals of the HM. Since this section is intended to provide only a brief explanation of each DDM, references to appropriate literature are included for the reader interested in a more detailed treatment of the various methods. However, due to the ubiquity of Multiple Linear Regression (MLR) (which is one of the adopted DDMs), no details are provided for this model.

K Nearest Neighbours Regression (KNN)
K Nearest Neighbours regression (KNN), introduced first by Fix and Hodges (1951), is one of the simplest nonlinear methods to apply (Altman 1992, Mitchell 1998, Liu et al. 2004, Hastie et al. 2009). KNN generates predictions for a 'query' vector (i.e., an input variable vector) by applying two simple steps: 1) searching for a set of K neighbours (i.e., input variable vectors) in a training dataset that are closest to the query according to a predefined distance metric (e.g., Euclidean distance) and 2) taking the average of the target (response) variable associated with each of the K closest neighbors. Thus, KNN is a local regression technique.
Despite its simplicity, one of the main drawbacks of KNN is that it cannot extrapolate beyond the range of the target in the training set. Nonetheless, KNN has a rich history in hydrology (Karlsson & Yakowitz, 1987) and is still frequently used for simulating and predicting streamflow (Lee et al., 2017;Ebrahimi et al., 2020), among other hydrological applications (Sun & Tevor, 2017;Jiang et al., 2020). For more information on KNN, the interested reader may refer to Chapter 2.3 of Hastie et al. (2009).
SOV, is essentially a polynomial regression where the design matrix is created by considering all zero-, first-, and second-order interactions amongst the input variables. The coefficients attached to the zero-order term is the bias and the remaining coefficients (attached to first-and second-order terms) are considered kernel coefficients (Wu & Kareem, 2014). After they have been estimated, the coefficients can be used to generate predictions of the target variable for a given input variable vector.
The reader may consult the Supplemental Material of Quilty and Adamowski (2020) for the mathematical formulation of the SOV adopted in this study.

Artificial Neural Networks (ANN)
Artificial Neural Networks have been applied to streamflow prediction since the 1990's (Abrahart et al., 2012) and represent one of the most popular data-driven models used for hydrological prediction (Fahimi et al., 2017). ANN generates predictions by passing a vector of input variables through a network comprising of a series of weights that connect to neurons, which contain a bias and (potentially, a nonlinear) activation function, that transforms the weighted inputs and directs them to the next layer of the network -the output of the network is the sum of weighted and (nonlinearly) transformed inputs. The reason why ANN is so powerful is due to its universal approximation capabilities, which allow it to approximate any nonlinear function (i.e., mapping from input variables to the target) provided nonlinear infinitely differentiable activation functions are used in the hidden layer and enough training iterations (epochs) as well as model parameters (weights and biases) are considered (Hornik et al., 1989).
The type of ANN used in this study is a feed-forward backpropagation (FFBP) network with a single hidden layer (along with the usual input and output layers, representing the input variables and target, respectively). The FFBP ANN operates by initializing all network parameters (i.e., weights and biases) to small random values and iteratively updates these parameters by back propagating the error signal through the network for a fixed number of epochs or until sequential evaluations of the error function do not appreciably decrease with respect to a predefined tolerance. After it has been trained, ANN generates predictions by passing input variable vectors through the network, where the resulting output is the sum of weighted and (nonlinearly) transformed inputs.
For additional information on ANN, the reader can refer to Chapter 5 of Ripley (1996) or many of the references included in review articles in the domains of statistics (Ching & Titterington, 1994) or hydrology (Abrahart et al., 2012;.

Random Forests (RF)
Random Forests, introduced by Breiman (2001), are a class of (ensemble) decision trees that include random sampling of both training instances (often referred to as bagging) and input variables, reducing predictive variance (without increase the bias) of the model (Breiman, 1996) while also providing robust performance in the presence of noisy input variables (Biau, 2012). A very useful feature of RF is that it implicitly provides an estimate of input variable importance J o u r n a l P r e -p r o o f Accepted manuscript, Environmental Modelling and Software 12 (Grömping, 2009), which can be used for input variable selection (Genuer et al., 2010), see for example Tyralis & Papacharalampous (2017). RF generates predictions by taking the average across all predictions produced by each ensemble member.
The interested reader may refer to Chapter 15 of Hastie et al. (2009) for details on RF as well as Fawagreh et al. (2014) for a review of different RF variants and applications.

eXtreme Gradient Boosting (XGB)
The eXtreme Gradient Boosting model is a very recent method that combines decision trees and boosting (Chen & Guestrin, 2016). While bagging (as used in RF) operates by developing an ensemble of independent models on each resampled dataset, boosting instead builds an ensemble of models where each member is built on top of the residuals generated by the previous member, with the objective to reduce the errors made in the previous iteration. In other words, bagging reduces the variance of the predictors, while boosting reduces bias (Vanschoren et al., 2012); both addressing opposite aspects of the bias-variance trade-off and using different strategies for 'ensembling' decision trees (Mehta et al., 2019). While recent research has shown that combining bagging and boosting can lead to better performing ensemble decision trees (Ghosal & Hooker, 2020), such methods are not within the scope of this study.
XGB is an improved version of the gradient boosting machine that is more computationally efficient and less prone to overfitting due to L2-norm regularization (Chen & Guestrin, 2016). Similar to RF, XGB inherently estimates the importance of input variables and can be used for input variable selection (e.g., Chen et al., 2020;Rahman et al., 2020;.

J o u r n a l P r e -p r o o f
Accepted manuscript, Environmental Modelling and Software 13 XGB has only recently been considered within the hydrological domain but has shown promise for streamflow simulation (Hadi et al., 2019;Gauch et al., 2021) and forecasting (Ni et al., 2020;Tyralis et al., 2020), predicting daily reference evapotranspiration (Fan et al., 2018), water quality index prediction (Abba et al., 2020), water table depth forecasting (Brédy et al., 2020), and imputing missing sub-hourly precipitation records (Chivers et al., 2020), among other applications.
Additional details on the theory and main innovations behind XGB can be found in Chen & Guestrin (2016).

Input Variable Selection (IVS)
Since input variable selection is an important step in the development of any DDM (Galelli et al., 2014), for each of the six DDMs (MLR, KNN, SOV, ANN, RF, and XGB), IVS was carried out to identify the input variables that may be useful for simulating the HM residuals. Different IVS approaches were considered for the DDM. The (linear) partial correlation input selection (PCIS) algorithm (May et al., 2008) was paired with MLR to serve as a fully linear benchmark. The Edgeworth approximations-based approach (EA) was used to carry out CMI-based IVS, which is a nonlinear analogue to PCIS (Quilty et al., 2016). The EA approach was coupled with KNN, SOV, and ANN. Since both RF and XGB implicitly perform IVS, an 'external' (model-free) IVS method (e.g., PCIS, EA) was not necessary. The PCIS method was selected as it is one of the most popular linear IVS methods (Galleli et al., 2014) while the EA method was selected as it has been shown to provide similar IVS accuracy as competing CMI-based methods (e.g., based on kernel density estimation, k nearest neighbors) at a fraction of the computational 'cost' (Quilty et al., 2016). A short description of the different IVS methods is provided in the sub-sections below.

Partial Correlation Input Selection (PCIS)
PCIS is an iterative greedy IVS method whereby candidate input variables (i.e., time lagged copies of observed streamflow, precipitation, and air temperature) are selected one at a time based on their partial correlation with the target variable (i.e., HM residuals at t-0), conditioned on previously selected inputs. At each iteration, the candidate input variable with the highest partial correlation (i.e., the best candidate input variable) is selected and a predefined IVS stopping condition is checked. The version of PCIS adopted in this study uses the Bayesian J o u r n a l P r e -p r o o f Accepted manuscript, Environmental Modelling and Software 14 Information Criterion (BIC) as a stopping condition to halt the IVS procedure (see Galleli et al., 2014). At each iteration of the IVS procedure, an MLR model is built that predicts the target variable using all previously selected inputs and the best candidate input variable; afterwards, the BIC is measured. If the BIC increases for the current iteration compared to the previous one, the IVS procedure is halted and all input variables selected before the current iteration are returned.
Otherwise, the IVS procedure continues. The variable importance measure for PCIS is the partial correlation coefficient (PC), which ranges between -1 and 1 (with 0 representing independence while -1 and 1 represent perfect correlation).
Additional details on PCIS can be found in May et al. (2008) and Galleli et al. (2014).

Edgeworth Approximations-based Conditional Mutual Information (EA)
The EA approach to CMI-based IVS was introduced and discussed in detail in Quilty et al. (2016); thus, only the essential features of this method are described here. Similar to PCIS, EA is an iterative greedy IVS method. However, in EA, the CMI is estimated between candidate input variables and the target, conditioned on previously selected inputs, instead of partial correlation. The stopping condition used for EA is based on a modified version of the tolerancebased approach from Vlachos & Kugiumtzis (2010) where at each iteration the CMI between the best candidate input variable and the target variable, conditioned on all previously selected inputs is compared to the mutual information between the target variable and all previously selected input variables, including the best candidate input variable. If the ratio between these two quantities drops below the tolerance (ranging between 0 and 1), the IVS procedure halts and all input variables selected before the current iteration are returned. By increasing the tolerance, the number of selected input variables can be decreased. The only difference between the EA method adopted here and in Quilty et al. (2016) is in the choice of stopping criterion. It was found during earlier experimentation that the tolerance-based method provided greater control over the IVS process, facilitating an improved balance between computational run-time and model accuracy (compared to the method used in Quilty et al., 2016). Trial-and-error led to the selection of 0.05 as a suitable threshold to balance these two objectives. Typically, for CMI- The variable importance measure for the EA method is the partial informational correlation (PIC), which is a nonlinearly scaled version of CMI (Sharma & Mehrotra, 2014) such that the PIC ranges between 0 and 1 (with 0 representing independence between variables and 1 indicating perfect correlation).

Variable Importance (Decision Tree-based) Methods
The decision tree-based methods (RF and XGB) generate internal measures of variable importance that are useful in identifying the most important inputs to the model (Wang et al., 2018;Pathy et al., 2020). In contrast to the PCIS and EA methods, which are considered modelfree approaches, RF and XGB are model-based IVS approaches. While the input variable importance measures generated by RF and XGB can also be used for IVS in other DDM (see for instance, Chen et al., 2018;Hadi et al., 2019;Prasad et al., 2019;Chen et al., 2020;Tyralis et al., 2020;Bhagat et al., 2021), this study only uses the variable importance measures to identify the most useful variables specific to these models.
For RF, the variable importance score is the total decrease in node impurity that occurs by splitting on a particular input variable and averaged over all decision trees, measured by the sum of squared errors (Li et al., 2017;Wright & Ziegler, 2017). The variable importance scores generated by RF were normalized by dividing each input variable's score by the maximum variable score (see Eq. (1) in Deng & Runger (2013)), resulting in the normalized variable importance (NVI).
The variable importance (VI) score adopted for XGB is the fractional contribution of each input variable to the model prediction, averaged across all trees, also known as 'gain' (Li et al., 2017), where the model's predictive performance is measured by the sum of squared errors loss function (Chen et al., 2019). For both RF and XGB, higher importance scores represent variables that are more important than the others (Li et al., 2017).

Benchmark Method
As discussed in Section 2.1, the CDDA can generate an ensemble of streamflow simulations via Eq. (1). In order to evaluate the added value of using the CDDA it must be J o u r n a l P r e -p r o o f Accepted manuscript, Environmental Modelling and Software 16 compared against a benchmark. In this case, the benchmark is simply the ensemble streamflow simulations generated by the HM model for each (i.e., without any model for simulating the HM residuals).

Performance Metrics
In order to compare the performance of CDDA against the benchmark HM, as well as to identify the best DDM to use within the CDDA, a number of statistical (performance) metrics were used. The performance metrics were divided into two classes: ensemble and deterministic metrics. The ensemble metrics make use of all ensemble members when evaluating the simulations' performance while the deterministic metrics are computed using the mean ensemble member, i.e., the mean simulation at each simulation time step. Since the metrics adopted in this study are well-known in the hydrology community, the formulae used to calculate these scores are not provided although the cited sources include the necessary information to permit their calculation.
The ensemble metrics consist of the mean continuous ranked probability score (CRPS) (Bröcker, 2012), the mean interval score (MIS) (Gneiting & Raftery, 2007), and the average width (AW) (Xiong et al., 2009). Both the MIS and AW require the specification of a confidence level, which is taken to be 0.05 in this study. For a particular confidence level, upper and lower uncertainty intervals can be computed, either by assuming the ensemble members follow a particular distribution (e.g., Gaussian) or by empirical means, such as estimating the quantiles associated with the confidence level. The latter approach is followed in this study; thus, the upper and lower uncertainty intervals are estimated at the 0.975 and 0.025 quantiles, respectively, and they together define the 95% uncertainty intervals. The CRPS was calculated using the scoringRules R package (Jordan et al., 2019) while the MIS and AW were calculated using custom R functions.
The CRPS is a useful metric that evaluates a simulation's reliability and sharpness, reducing to the mean absolute error (MAE) for deterministic simulations, permitting the comparison between ensemble and deterministic forecast quality (Boucher et al., 2011). The MIS also evaluates a simulation's reliability and sharpness, but additionally includes a penalty for simulations that fall outside the upper and lower uncertainty intervals . The AW solely measures the simulation's sharpness (i.e., the narrowness of its

Study Catchments
Three Swiss mid-size mid-altitude catchments were chosen for this study ( Figure 2). All three catchments are with an insignificant areal glacier percentage (<5%) and without any significant human direct impacts documented in the observation period . According to , the catchments represent two different types of dominant flood processes, i.e. rainfall-driven (Dünnern) and a mixed contribution of rainfall and snowmelt floods (Kleine Emme and Muota), see also Table 1.  All variables were averaged to mean catchment values using the Thiessen polygons' method.

Hydrological Model
In this study, a conceptual HBV model, in particular, HBV light (Seibert & Vis, 2012), was used for streamflow simulations. This bucket-type model consists of four major routines: i) precipitation excess and snowmelt, ii) soil moisture, iii) groundwater and runoff-streamflow response, and iv) routing in the stream. The snow component is important for catchments with significant snow processes, i.e., for two out of three studied here. HBV has been frequently applied to rainfall-and snow-dominated catchments (e.g., Breinl, 2016;Griessinger et al., 2016;Sikorska & Seibert, 2018;Westerberg et al., 2020).
The HBV model used in this study has 15 parameters and is run at a daily time step.
Model inputs are precipitation and air temperature time series and long term averaged values of daily evaporation and air temperature. The model output is a single (deterministic) realization of continuous streamflow time series at the catchment outlet. Due to its simplicity, HBV is used as an example in this study but can be easily interchanged with another HM without any difficulties.

Calibration of the Hydrological Model
The HBV model was calibrated via the Genetic Algorithm and Powell (GAP) optimization method (Seibert, 2000)

Input Variable Selection and Calibration of Data-Driven Models
Prior to the calibration of DDMs, input variable selection was required. Input variables for the DDM included observed streamflow, precipitation and air temperature, at the current and/or previous nine days. For streamflow, inputs included observations from the nine days preceding the simulation day (t-1, ..., t-9). For precipitation and air temperature, inputs included observations from the day of the simulation (t-0) as well as the previous nine days (t-0, ..., t-9).
In total, 29 different input variables were considered as potential predictors of the HM residuals at t-0.
The maximum time lag (D) for each input variable was determined using the conditional mutual information (CMI) (Brown et al., 2012) between the HM residuals at t-0 and each explanatory variable from t-0 to t-D (with the exception of streamflow, which considered time lags t-1 to t-D) by locating the lag at which the CMI reached a local minimum. The goal was to identify a sufficient number of time lags to accurately simulate the HM residuals while also attempting to keep the input variable set of a reasonable size. Given that precipitation has the largest effect on modifying the streamflow (Müftüoğlu, 1991), it was given a higher priority in identifying the maximum time lag. This approach resulted in a maximum time lag of D=9.
Section 2.4.2 outlines the method used to estimate CMI.
The different DDMs were calibrated (trained) using the target (residuals of the HMsimulated streamflow) and input variables (i.e., time-lagged versions of the observed precipitation, air temperature, streamflow) for the same calibration period as the HBV model, i.e.

K Nearest Neighbours Regression (KNN)
As noted in Section 2.3.1, the KNN model does not require any explicit training strategy since model predictions are generated by computing the distance (here, the Euclidean distance) of a given input variable vector (e.g., from the validation set) with those from the training data, locating its K nearest neighbours, and taking the mean of the targets associated with each of the K neighbours. While KNN does not require training, it requires the selection of the hyperparameter, K, with lower K values leading to predictions with high variance and low bias and vice versa for high K values (Hastie et al., 2009). Previous research by one of the authors found that a wide variety of K values (5, 10, 20, 30, 50, and 100) lead to similar results when sampling conditional model errors for streamflow simulation (Sikorska et al., 2015). Thus, this study used K=5, seeking to strike a balance in the bias-variance tradeoff related to the selection of K. Other common choices for K include √ , where is the number of samples in the training set (Lall & Sharma, 1996).
To ensure input variables with higher ranges are given equal weight (or importance) as input variables with smaller ranges, all inputs are individually normalized (i.e., scaled between 0 and 1, using the maximum of minimum of each variable over the training set), prior to searching for nearest neighbours. Using normalized inputs in KNN has been shown to significantly improve model performance (compared to using unnormalized inputs) (Piryonesi & El-Diraby, 2020) and in earlier experiments was also found to have the same effects for the catchments under study. The FNN package in R (Beygelzimer et al., 2019) was used for developing the KNN models, which uses the fast k nearest neighbours method (Beygelzimer et al., 2006) and the kd-tree approach (Friedman et al., 1977) when searching for nearest neighbours.

Multiple Linear Regression (MLR) and Second-Order Volterra Series Model (SOV)
The parameters in MLR are the slope and bias coefficients associated with the design matrix (input variables). While the parameters in SOV are the bias (zero-order) and kernel coefficients associated with the first-and second-order interactions amongst the input variables.
The parameters in MLR and SOV were solved via ordinary least squares (Wu & Kareem, 2014).
Note that there are no tunable hyper-parameters for the MLR and SOV models. The MLR and SOV models were developed using custom functions in R.

J o u r n a l P r e -p r o o f
Accepted The parameters in ANN include the input, hidden, and output layer weights as well as the hidden and output layer biases. The particular application of ANN used in this study is based on the avNNet function in the caret R package (Kuhn, 2019), which makes use of the nnet R package (Venables & Ripley, 2002). This implementation of ANN individually trains several networks with different randomly initialized parameters via the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm (Fletcher, 1987) and averages their predictions. The hyper-parameters include the number of networks (set by default to 5), the number of training epochs (or iterations, set by default to 100), the number of hidden nodes (since only a single hidden layer is used), and the decay rate.
A grid search over the decay rate (1e-3, 0.01, 0.1) and number of hidden neurons ( ; where is the number of inputs in the ANN) (Hecht-Nielsen, 1989;Fatehi et al., 2015) was carried out using 5-fold cross-validation and the RMSE as the objective function, to determine optimal values for these hyper-parameters. The ANN models adopted linear activation functions in the output layer and sigmoid (logistic) activation functions in the hidden layer. All model inputs were normalized before training the ANN (i.e., using the same approach as KNN).

Random Forests (RF)
The parameters in RF are the splitting variables and split points at each node of the decision tree while the hyper-parameters include the number of trees ( ), the number of variables selected randomly at each split ( ), and the minimum node size ( ) (Hastie et al., 2009).
In practice, it is generally found that once a sufficient number of trees have been considered in the 'forest', adding additional trees beyond this point does not substantially increase performance (for example, see the large-scale study by Probst & Boulesteix (2018)) and comes at an increased computational cost (Hastie et al., 2009 performance. However, in order to strike a balance between predictive accuracy and computational efficiency, is fixed in this study (at its default value, 5) and a two stage approach is used to identify suitable values for and , as described below.
At first, a preliminary analysis was undertaken revealing that 300 trees ( ) and a minimum node size of 5 ( ) led to stable performance when was within the range of 1 to 5. Thus, and were held constant at 300 and 5, respectively, and was optimized through a grid search ( ) using 5-fold cross-validation (where the RMSE was used as the objective function).
The RF models were developed using a combination of the caret and ranger R packages (Kuhn et al., 2019;Wright & Ziegler, 2017).

eXtreme Gradient Boosting (XGB)
The parameters in XGB, similar to RF, are the variables and values used at each split.
The XGB hyper-parameters are described in detail in Chen & Guestrin (2016) and Chen et al.

Quantitative Assessment of the CDDA Performance
Deterministic performance metrics for the mean ensemble simulation (i.e., the mean over all 1000 ensemble members) are presented for all CDDA variants in Table 2, whereas Table 3 illustrates the ensemble performance metrics (that consider all 1000 ensemble members). These values are compared to the performance criteria computed for the benchmark, which can be used to determine if and by how much the performance of the (ensemble) streamflow simulation is improved when using the CDDA instead of the HM only. The performance criteria are provided for all three study catchments. Given that CDDA generates ensemble streamflow simulations, it is also very important to assess ensemble performance when comparing CDDA to the standalone HBV. To assess the properties of the uncertainty intervals of the CDDA simulations versus the HBV-based simulations (benchmark), three ensemble performance metrics were considered: the mean continuous ranked probability score, the mean interval score, and the average width of uncertainty intervals, which are presented in Table 3. respectively. The CDDA based on MLR led to marginal improvements in these criteria (e.g., CRPS was increased by 6% and 4 % for Dünnern and Muota catchments, respectively, and no change was observed for the Kleine Emme).
In general, based on both deterministic and ensemble performance criteria, it can be concluded that all CDDA variants led to an improvement in the (ensemble) streamflow simulations with reference to the HM-based simulations. Among all tested CDDA variants, the CDDA based on XGB led to the largest improvement in the ensemble streamflow simulations.
The second best model was the one based on RF. The third and fourth best models were XGB_6 and RF_6, followed by ANN, SOV and KNN. The worst simulation performance was achieved for the CDDA adopting MLR, which was only slightly better than the benchmark. Thus, it may be argued that linear DDM are inappropriate for fitting HM model residuals in the study catchments (see further Section 5.3).

Importance of Input Variables in DDMs
To simulate the HM residuals, the different DDMs considered several input variables.
These input variables were streamflow, precipitation, and air temperature observed at previous days. In this study, a timeframe of up to 9 days preceding the day of simulation was considered for all three candidate input variables as described in Section 2.4. Hence, the previous 9 days of J o u r n a l P r e -p r o o f Accepted manuscript, Environmental Modelling and Software 35 streamflow (t-1, ..., t-9), as well as the current and previous 9 days (t-0, ..., t-9) of the precipitation and the air temperature were considered as input variables. In total, 29 different input variables were considered as potential predictors of the HM residuals. IVS was performed for each of the 1000 ensemble members, considering the HM residuals as the target variable, in order to determine the best predictors to use for each individual residual series. Since PCIS and EA are both model-free IVS methods, they were run independent of the DDMs (i.e., the inputs and model parameters were determined separately). In contrast, RF and XGB are model-based IVS approaches; thus, the inputs selected by these methods are related to the model parameters determined during training.
The importance scores of the input variables selected by the different IVS methods associated with the CDDA variants are illustrated in Figures 6-8 for the three study catchments, which summarize the importance scores across all 1000 ensemble members using box plots. The higher the importance score, the stronger impact the input variable has on the simulated residuals. Since the KNN, SOV and ANN models use the exact same inputs as determined by the EA IVS method, only a single plot is considered for these methods. Since PCIS is a linear IVS method, it is solely coupled with MLR. Note that for XGB_6 and RF_6, only six input variables are plotted as these model variants considered only the six most important input variables determined by their base method (XGB and RF). It is important to note that all variables presented in these plots with importance scores above 0 were not necessarily selected for each of the 1000 ensemble members, but any inputs with importance scores higher than 0, were selected at least once (in the 1000 ensemble members). Figure 6. Importance of input variables in different DDMs (Dünnern). Different variable importance scores for the DDM are described in Section 2.4. , and represent observed streamflow, precipitation and air temperature at preceding days (t-0, t-1, ..., t-9). Note that for observed streamflow at t-0 is not considered as it is the target simulation day.  From Figures 6-8 it can be seen that, generally, the importance of the input variables increases with decreasing the lag time, i.e., a higher importance is given to input variables whose J o u r n a l P r e -p r o o f Accepted manuscript, Environmental Modelling and Software 39 lag time directly precedes the simulation day. For all CDDA variants, among the three considered input variables, precipitation was the strongest predictor of the HM residuals, followed by streamflow. The air temperature was the weakest predictor among all DDMs apart from the MLR model where it was a stronger predictor than the streamflow in two out of three catchments.
Looking at different models in detail, it appears that for KNN, SOV and ANN (which use the EA method based on conditional mutual information for IVS), the air temperature does not play any significant role while the importance of the streamflow from the three preceding days (t-1, ..., t-3) and the precipitation from the three preceding days, including the day of simulation, (t-0, ..., t-3) were the most important for simulating HM residuals in all three catchments. For MLR, the precipitation (t-0, ..., t-3) was very important in all catchments, whereas the importance of the air temperature and streamflow varied depending on the catchment. In RF and XGB, the precipitation (t-0, ..., t-3), the streamflow (t-1, ..., t-3) and the temperature (t-0, ..., t-3) were all very important in all three catchments with precipitation (t-0, t-1, t-2) and streamflow t-1 being the most important. Yet, it can be noticed that all input variables have above 0 importance in RF and XGB, meaning that all variables contribute to the overall ensemble HM residual simulations. For RF_6 and XGB_6 that consider only the six most important variables (from their respective base model), the selected variables were always streamflow t-1 and t-2, and precipitation t-0, t-1, and t-2 in all three catchments. The sixth most important selected variable varied depending on the catchment and it was either air temperature t-2 (Dünnern), air temperature t-0 (Kleine Emme), or streamflow t-3 (Muota). The order of importance for these six variables varied depending on the catchment; however, streamflow t-1 and precipitation t-0 and t-1 were always found to be very important. An interesting result was obtained by RF_6 for the Muota catchment: when only the six most important inputs identified by RF were considered in RF_6, it was found that precipitation (t-2) did not add any information that was useful when simulating the HM residuals. This suggests that, when all 29 inputs are considered, there is another input (outside of the other five selected inputs), that, when combined with precipitation (t-2), adds information that is useful for simulating the HM-residuals. While it is outside the scope of this research to identify the other inter-dependent input, the interested reader may refer to Galelli et al. (2014) whom discuss the inter-dependency of inputs in IVS.

Effect of the Ensemble Size
Simulations of both CDDA and HM models were based upon 1000 ensemble members that originate from the 1000 optimized parameter sets of the HBV model. Yet, it is not clear whether use of all 1000 ensemble members is of a value for the CDDA. As it appears from Based on Figure 9, it can be seen that as few as 100 ensemble members provide roughly the same ensemble performance as 1000 ensemble members for both the HBV-based simulations (benchmark) and for the CDDA-based simulations when considering the CRPS. Using less than 100 ensemble members leads to a visible drop in the CRPS, whereas using more than 100 ensemble members does not significantly improve the model performance. The strongest improvement is noticed between one and 10 ensemble members, which seems logical when moving from the deterministic approach (only one ensemble member) towards a probabilistic approach (several ensemble members). This effect of growing the ensemble size on the model performance was visible for all tested CDDA variants as well as for the HBV model in all three study catchments.

Discussion
In this work an ensemble-based conceptual-data-driven approach ( limitations are discussed, and some recommendations for future research are given.

Different DDM and Importance of Input Variables
Six different DDMs and two additional variants were tested for simulating HM residuals.
These DDMs link the HM residuals on the simulation day (t-0) to observed precipitation and air temperature on the simulation day and those preceding it (t-0 to t-9) as well as streamflow on preceding days (t-1 to t-9). This maximal lag time of nine days was determined by studying the nonlinear correlation between explanatory variables and the HM residuals (see Section 3.5).
These results demonstrated that, generally, all DDMs coupled with the HM within the CDDA framework led to an improvement in the ensemble streamflow simulations with reference to the HM-based simulations (Section 4.2). This increase in the streamflow simulation performance (measured by deterministic and ensemble performance metrics) was, however, not the same for J o u r n a l P r e -p r o o f Analysis of the input variable importance (Section 4.3) revealed that the importance of the input variables increases with decreasing lag time, i.e., input variables from days directly preceding the simulation day were of a higher importance than those from the distant past. For all DDMs, the observed streamflow (t-1, …, t-3) and precipitation (t-0, …, t-3) from the current and preceding three days had the highest impact on the HM simulated residuals while for certain methods (RF and XGB), the air temperature from the current and three preceding days (t-0, …, t- 3) were also deemed useful inputs. However, for XGB and RF, observations farther in the past (t-4 to 4-9) also had an impact on the simulated residuals. Hence, it appears that, despite the maximal lag time of nine days, the effective memory length, which determines the lag length beyond which input variables have only marginal or null effect on the simulated output, appears to be about three days directly preceding the simulation day. This memory lag length (Bowden et al., 2005) may be further explained by the catchment memory to past inputs that determines how long water is retained in the catchment in different forms such as aquifers, snowpack or groundwater storage (Müftüoğlu, 1984;Rajurkar et al., 2004). This catchment memory is often identified as precipitation influence history (Müftüoğlu, 1991), which is the strongest predictor of the streamflow. The length of the catchment memory varies between catchments and may be from several hours to several days and longer. As the results indicate, not only the simulated streamflow but also residuals of the hydrological model may be linked to the catchment memory.
From these findings, it appears that the more complex DDMs, that link the residuals to additional input variables (i.e. use longer lag times), have better predictive skill in correctly mimicking the residuals of the HM. Hence, although the most important input variables seem to be observations from the last three days, using longer lag times than that with more complex models further improves the model performance. Note, however, that using longer lag times with simpler models does not improve the model performance.
Next, among the three considered input variables, the observed precipitation was the strongest predictor of the HM residuals, followed by the observed streamflow, while the observed air temperature was the weakest predictor of residuals. This seems reasonable as the precipitation is usually also the strongest predictor of streamflow in any HM that relies on the J o u r n a l P r e -p r o o f Accepted manuscript, Environmental Modelling and Software 44 precipitation-streamflow generation concept. The air temperature is usually used within many HM to determine the form of the precipitation (liquid or solid) and to determine whether snow melt occurs (Seibert & Vis, 2012). Thus, it seems reasonable that its impact is smaller than that of the precipitation, which determines the amount of water entering the catchment. The importance of the observed streamflow as a predictor of model residuals can be explained by the auto-correlation effect present for most hydrological models, i.e., when streamflow residuals at t-0 are correlated with residuals at preceding time steps (t-1, t-2, …, etc.) (Sorooshian & Dracup, 1980;Yang et al., 2007;Sikorska et al., 2012). The strength of this auto-correlation likely depends on the precipitation and flow conditions and for wet or high flow periods it is expected to be higher than for dry or low flow periods. This concept has been explored by Del Giudice et al. (2013), who linked the residuals of a hydrological model to the precipitation amount or streamflow via Bayesian inference. The authors found that, if conditioned on one variable, the streamflow-dependent description of model residuals was better-performing than the precipitation-dependent error. However, this seems opposite to findings from this study; however, DDMs that rely on only a single input variable were not investigated here, instead DDMs always included several different input variables.

Applicability of the Ensemble-based CDDA Approach and Limitations
The results demonstrated that the ensemble-based CDDA framework is very promising  (Yang et al., 2020b). Further, the use of IVS can also help reduce the computational burden of model development, which is substantial for large ensembles (e.g., 1000 members), although running the models is extremely quick in an operational setting. Thus, by reducing the computational burden, IVS may also increase the exploration of alternative DDMs.
Additionally, IVS may also prove useful in reducing the number of ensemble members in the CDDA by identifying a reduced number of members that maintain a similar level of performance as the initial ensemble size. For example, it was found that 100 ensemble members (i.e., the first 100 out of 1000 randomly generated members) carried roughly the same level of information as the entire 1000-member ensemble (see Section 4.4).
The major limitation of this study is that only parametric uncertainty of the hydrological model (here, HBV) was investigated and represented with multiple optimized parameter sets (1000) without explicitly considering any other uncertainty sources of the HM, such as uncertainties in the inputs or model structure (Renard et al., 2011;Sikorska & Renard, 2017 At present, the proposed ensemble-based CDDA framework has only been tested in three gauged catchments. Yet, it would be very interesting to test the approach in ungauged catchments (without streamflow observations). This would enable for improving streamflow simulations at sites where performance is the lowest (due to a lack of recorded data to calibrate a HM). As a direct calibration of the CDDA and its components (i.e., the HM or DDM) is not possible at ungauged locations, other methods to inform both models should be searched for: regionalization approaches could be used to inform the HM model parameters; while training in a large set of catchments of different properties could help to constrain information on simulated residuals, and transfer these residuals to the ungauged catchments that have similar properties. Gauch et al. (2021) have revealed that using data from a large set of catchments to train a single DDM yields better streamflow simulation results, also in poorly gauged regions, suggesting potential towards generalization of DDMs. Wu et al. (2019) have also demonstrated that a DDM used to simulate residuals of an un-calibrated HM may improve its predictive performance.
Finally, it is important to note that in an operational context, if there are no streamflow observations available, then lagged measurements of the streamflow cannot be used as an input variable for simulating streamflow via the CDDA. This is relevant, as there may be cases where streamflow gauging stations may be temporarily offline or are no longer in operation due to decommissioning, damage due to a flooding event, etc. (Tencaliec et al., 2015;Villalba et al., 2021); thus, obtaining updated information on recent streamflow is not possible. In these cases, J o u r n a l P r e -p r o o f Accepted manuscript, Environmental Modelling and Software 47 while streamflow data may be used for calibrating the CDDA (i.e., using streamflow measurements from a limited historical database as the target variable) other input variables may be required instead of streamflow to improve its predictive performance.

Recommendations
To summarize, it appears from the results that it is generally better to include (almost) any (nonlinear) DDM to simulate the residuals of a hydrological model than using a standalone HM. However, in the case of the simplest data-driven model (MLR), the improvement in ensemble streamflow simulation was only marginal, whereas more complex (nonlinear) models (XGB and RF) led to a significant improvement in the ensemble streamflow simulation.
Moreover, use of a DDM to simulate residuals is linked with some additional computational efforts. Thus, the selection of DDM is very important when computational power is restricted.
Since it was found that MLR barely improved the original HM simulations, this approach is not recommended, unless, of course, there is reason to believe the relationship between HM residuals and input variables is linear, which did not appear to be the case for the study catchments.
However, it was found that XGB and RF both led to significant gains in deterministic and ensemble performance over the standalone HM; given that both methods inherently perform IVS, and thus, require little user intervention while providing impressive performance, are recommended for further study. Additionally, since it was found that only 100 ensemble members provided a similar level of performance as the initial 1000-member ensemble, it is plausible, although it was not verified, that a smaller and carefully selected set of ensemble members, identified via IVS, may provide a similar level of performance as the initial ensemble size. This could be explored in two ways, by performing IVS: 1) on the HM ensemble and then developing DDMs for this reduced set or 2) on the CDDA ensemble. Speculation as to which of the two approaches would result in better performance is out of the scope of this paper but it is recommended as an interesting line of future research. Additionally, it is recommended to include other sources of uncertainty in the ensemble-based CDDA, such as those related to inputs, input variable selection, parameters, and model output, in order to improve uncertainty estimation and the overall utility of the probabilistic forecasts. Finally, other potentially useful input variables, such as potential evapotranspiration, soil moisture, relative humidity, wind speed, are recommended to be explored in the DDM, even if such methods are unable to be included in a particular HM.

Conclusions
A novel ensemble-based conceptual-data-driven approach (CDDA) has been proposed. It consists of a hydrological model (HM) to simulate an ensemble of precipitation-streamflow processes and a data-driven model (DDM) to model an ensemble of HM residuals. The DDM takes precipitation, air temperature, and streamflow observed at preceding days to simulate the residuals. Such a CDDA combines the advantages of a HM, respecting hydrological processes, with the ability of the DDM to simulate complex (nonlinear) relationships between input-target (explanatory-response) variables by tackling auto-correlated HM residuals. The CDDA does not require any statistical assumptions on the model residuals and it provides a framework for identifying suitable DDMs and input variables. Moreover, the ensemble-based CDDA is very flexible as it can be coupled with any hydrological model and any DDM for ensemble streamflow simulation. The selection of potential input variables can also be adjusted based on available data as well as user needs and specific conditions (e.g., hydrological model or type of runoff generation). Generally, the CDDA has been shown to be a very promising approach to improve ensemble streamflow simulations. Among eight variants of different DDMs, eXtreme Gradient Boosting (XGB) and Random Forests (RF) were found to be the most accurate predictors of the HM residuals and also required less user intervention in terms of selecting appropriate inputs to the DDM. Also, precipitation and streamflow from the three days directly preceding the simulation day were found to have the largest impact on simulated residuals.
Based on the results, XGB and RF models are recommended to simulate the residuals of the hydrological model within the ensemble-based CDDA framework. It was also found in this study that the number of ensemble members may be substantially reduced, i.e., from 1000 to 100, without significantly affecting model performance; this is especially important for cases where computational resources are limited.