Improving Real-time Inflow Forecasting into Hydropower Reservoirs through a Complementary Modelling Framework

Accuracy of reservoir inflow forecasts is instrumental for maximizing the value of water resources and benefits gained through hydropower generation. Improving hourly reservoir inflow forecasts over a 24 h lead time is considered within the day-ahead (Elspot) market of the Nordic exchange market. A complementary modelling framework presents an approach for improving real-time forecasting without needing to modify the pre-existing forecasting model, but instead formulating an independent additive or complementary model that captures the structure the existing operational model may be missing. We present here the application of this principle for issuing improved hourly inflow forecasts into hydropower reservoirs over extended lead times, and the parameter estimation procedure refor-mulated to deal with bias, persistence and heteroscedasticity. The procedure presented comprises an error model added on top of an unalterable constant parameter conceptual model. This procedure is applied in the 207 km 2 Krinsvatn catchment in central Norway. The structure of the error model is established based on attributes of the residual time series from the conceptual model. Besides improving forecast skills of operational models, the approach estimates the uncertainty in the complementary model structure and produces proba-bilistic inflow forecasts that entrain suitable information for reducing uncertainty in the decision-making processes in hy-dropower systems operation. Deterministic and probabilis-tic evaluations revealed an overall significant improvement in forecast accuracy for lead times up to 17 h. Evaluation of the percentage of observations bracketed in the forecasted 95 % confidence interval indicated that the degree of success in containing 95 % of the observations varies across seasons and hydrologic years.


Introduction
Hydrologic models can deliver information useful for management of natural resources and natural hazards (Beven, 2009).They are important components of hydropower planning and operation schemes where it is essential to estimate future reservoir inflows and quantify the water available for power production on a daily basis.The identification and representation of the significant responses of hydrologic systems have been diverse among hydrologists.Different hydrologists have incorporated their perceptions of the functioning of hydrologic systems into their models and come up with several rival models; some of them process based and others data based (for thorough reviews of the historic development of hydrologic modelling refer to Todini, 2007 andBeven, 2012).These models can be grouped into two main classes, conceptual and data-driven models.
Lumped conceptual hydrologic models are the most commonly used models in operational forecasting.Models of this class use sets of mathematical expressions to provide a simplified generalization of the complex natural processes of the hydrologic systems in the headwater areas of reservoirs.Application of such models conventionally requires estimating the model parameters by conditioning them to observed hy-Published by Copernicus Publications on behalf of the European Geosciences Union.drologic data.Unlike conceptual models, data-driven models establish mathematical relationship between input and output data without any explicit attempt to represent the physical processes of the hydrologic system.Reconciling the two modelling approaches and combining the advantages of both approaches (Todini, 2007) has produced some example applications in forecasting systems where the two modelling approaches are harmoniously used for improving reliability of hydrologic model outputs (e.g.Abebe and Price, 2003;Solomatine and Shrestha, 2009).
Usefulness of a model for operational prediction is determined by the level of accuracy to which the model reproduces observed hydrologic behaviour of the study area.In operational applications, evaluation of how well the models capture rainfall-runoff processes, especially the snow accumulation and melting process in cold regions, is important because of the extent to which the models accurately reproduce the reservoir inflows can significantly influence the efficiency of the hydropower reservoir operation and subsequently the power price.Application of hydrologic models for reproducing historic records can suffer from inadequacy in model structure, incorrect model parameters, or erroneous data.Consequently, despite failing to reproduce the observed hydrographs exactly, they enable simulation of hydrologic characteristics of a study catchment to a fair degree of accuracy.It gets more challenging when using the models in the operational set-up for forecasting the unknown future just based on the known past, which the model might not capture accurately.In the context of the Norwegian hydropower systems, being unable to predict future reservoir inflows accurately has negative consequences on the power producers.Norway's energy producers have to pledge the amount of energy they produce for next 24 h in the day-ahead market and if unable to provide the pledged amount of energy the chance of incurring losses is very high.Estimation of future reservoir inflows (be it long-or short-term) involves estimating the actual (initial) state of the basin, forecasting the basin inputs during the lead time, and describing the water movement during the lead time (Moll, 1983).Hence, the quality of a hydrologic forecast depends on the accuracy achieved and methodology selected in implementing each of these aspects.
In this study, we intend to use conceptual and data-driven models complementarily.A conceptual model with calibrated model parameters is used as the fundamental model that approximately captures dominant hydrologic processes and forecasts the behaviour of the catchment deterministically.A data-driven model is then formulated on the residuals, the difference between observations and predictions from the conceptual model.By studying the whole set of residuals and exploring the information they contain, important information that describes the inadequacies of the conceptual model can be extracted.In general, this kind of information can be used for improving either the conceptual model itself or the prediction skill of a forecasting system.Emulating the practice in most Norwegian hydropower reservoir opera-tors, we stick to the latter purpose with the aim of enhancing the performance of a hydropower reservoir inflow forecasting system.According to Kachroo (1992), data-driven models defined on the residuals from a conceptual model can expose whether the conceptual model is adequate to identify essential relationships exhibited in the input-output data series.Data-driven models can establish the mathematical relationship that describes the persistence revealed in the residual time series, which is caused by failure of the conceptual model to capture all the physical processes exactly.Thus, in the operational sense, the data-driven models can play a complementary role by adjusting output of the conceptual model whenever the conceptual model needs corrective adaptation (e.g.Serban and Askew, 1991;World Meteorological Organization, 1992).
Several example applications can be found in the scientific literature on using conceptual and data-driven models complementarily.For instance, Toth et al. (1999) compared performance improvements six autoregressive integrated moving average (ARIMA)-based error models brought to streamflow forecasts from a conceptual model to identify the best error model and data requirements.Shamseldin and O'Connor (2001) coupled a multi-layer neural network model on top of a conceptual rainfall-runoff model to improve accuracy of streamflow forecasts without interfering with the operation of the conceptual model.Similarly, Madsen and Skotner (2005) developed a procedure for improving operational flood forecasts by combining error models (linear and non-linear) and a general filtering technique.Xiong and O'Connor (2002) investigated performance of four errorforecast models, namely, the single autoregressive, the autoregressive threshold, the fuzzy autoregressive threshold and the artificial neural network updating models, for improving real-time flow forecasts and compared their results.Likewise, Goswami et al. (2005) examined the forecasting skill of eight error-modelling-based updating methods.A recent review on the application of error models and other data assimilation approaches for updating flow forecasts from conceptual models can be found in Liu et al. (2012).
As reviewed above, the principle of complementing conceptual models with data-driven models has enjoyed applications in real-time hydrologic forecasting since the 1990s.The methodological contribution of the present work is reformulation of the parameter estimation procedure for the databased model.We recognize that the bias, persistence and heteroscedasticity seen in the residuals from the conceptual model reflect structural inadequacy of the conceptual model to capture the catchment processes and, hence, are important in defining the manner the residual series is dealt with.Accordingly, we describe the reservoir inflows in a transformed space and present an iterative algorithm for estimating parameters of the data-driven model and the transformation parameters jointly.
Two main features distinguish application aspects of the present paper from previously published work built on the same concept of complementing conceptual models with data-driven models.First, it attempts to provide hourly reservoir inflows of improved accuracy 24 h ahead.The earlier papers mainly succeeded in improving forecasts for forecast lead times up to six time steps or incorporated a scheme to update the forecast system at an interval of six time steps.Second, an attempt is made in what follows, to produce a probabilistic forecast by estimating the uncertainty of the error model, rather than only the deterministic estimate.This, thereby, enables forecast of an ensemble of reservoir inflows, thereby allowing for a risk-based paradigm for hydropower generation to be put to use.Reasons as to why hydrologic forecasts should be probabilistic and the potential benefits therein are presented and explained in Krzysztofowicz (2001).Krzysztofowicz (1999) described a methodology for probabilistic forecasting via a deterministic hydrologic model.Li et al. (2013) presented a review of scientific papers that provide various regression and probabilistic approaches for assessing performance of hydrologic models during calibration and uncertainty assessment.Smith et al. (2012) demonstrate a good example of producing probabilistic forecasts based on deterministic forecast outputs.In this paper, the improvement levels achieved are evaluated deterministically using the same or similar metrics as past studies, and probabilistically using (i) the containing ratio (Xiong et al., 2009), which is also referred to as reliability score (e.g.Renard et al., 2010) and (ii) the probability integral transform (PIT) plot.The technique is similar to the predictive Q-Q plot (e.g.Thyer et al., 2009) but assesses, in terms of the percentiles, how close a continuous random variable transformed by its own cumulative distribution function (cdf) is to a uniform distribution.We emphasise here that taking into account uncertainties emanating from various recognized sources and describing the degree of reliability of the inflow forecasts has important benefits.According to Montanari and Brath (2004), the Bayesian forecasting system (BFS) and the generalized likelihood uncertainty estimation (GLUE) are the popular methods for inferring the uncertainty in hydrologic modelling.Yet, the scope of producing probabilistic inflow forecasts in this study is limited to attaching a certain probability to the deterministic forecasts, which are common in the Norwegian hydropower industry, based on analysis of the statistical properties of the error series from the conceptual model, and assessing its degree of reliability.
In the next section, the complementary model set-up is formulated and the performance evaluation criteria are provided.An example application is presented in the subsequent section.This includes description of the study area and data used, findings from the evaluation of the complimentary setup and its components during calibration and validation, and results of forecasting skill assessment using deterministic and reliability metrics.Finally, concluding remarks are provided.

The conceptual model set-up
The widely applied conceptual hydrologic model, HBV (Hydrologiska Byråns Vattenbalansavdelning) (Bergström, 1995), is used in this study.The version used allows for dividing the study catchment up into 10 elevation zones.A deterministic HBV model with already calibrated model parameter values was assumed to take the role of the operational hydrologic models Norwegian hydropower companies commonly use for forecasting reservoir inflows.In the operational set-up, the air temperature and precipitation input over the forecast lead time are obtained from the Norwegian Meteorological Institute (http://www.met.no).As this study aims to improve hydrologic forecasts into the hydropower reservoirs by complementing the conceptual model by an error model, we assume that the predictions from the HBV model are made using the best possible input data.Hence, the observed air temperature and precipitation data are used as input forecasts in hindcast.

The complementary error model
The error model aims at exploiting the bias, persistence and heteroscedasticity in the residuals and estimating the errors likely to occur in the forecast lead time.Forecasting the error in the lead time is regarded as a two-step process: offline identification and estimation of the error model, and error predictions based on most recent information.

Identification of the model structure
An error model that captures the structures the processes model is missing should lead to a zero-mean homoscedastic residual series from the modelling framework.In order to identify the right structure and establish a parsimonious model that adequately describes the data, we diagnose the residuals and address the bias, persistence and heteroscedasticity the series might exhibit as follows.
First and foremost, we transform the observed (Q) and the predicted ( q, from the conceptual model) inflows into z and ẑ, respectively.This way we deal with the heteroscedasticity seen in the residuals by making repeated use of Eq. ( 1) with the appropriate inflow term.
where β and λ are the transformation parameters.
The discrepancy (ε) between the observed and predicted inflow at time step (t) can be expressed as ε t = z t − ẑt .Analysis of whether the residuals are random or show some bias follows.Lest the mean of the residuals would be different from zero, the mean error (µ e ) is subtracted from the error series (ε) to produce a zero-mean residual series (e t = ε t − µ e ).

A. S. Gragne et al.: Improving real-time inflow forecasting into hydropower reservoirs
This is followed by assessment of the autocorrelation function (acf) and partial autocorrelation function (pacf), which are keys for identifying the order of Markovian dependence the residuals exhibit.We consider an autoregressive (AR) model structure (Eq.2) to represent the persistence structure in the residual series.Comparative assessment of error models of different complexity would be an interesting study but is beyond the scope of this work.Xiong and O'Connor (2002) affirm that the AR model's longstanding popularity is deservedly right and further emphasize the effectiveness of a very parsimonious model, such as the AR model, for error forecasting.
where p designates the length of the lag time, and a 1 , a 2 , . . ., a p are coefficients of the AR model.In order to provide improved hourly reservoir inflow forecasts over a 24 h lead time, the error-forecasting model takes the form of Eq. (3).In order to overcome lack of observed residuals encountered for forecast lead time (f ) longer than one-step ahead, it is necessary to utilize estimated errors as inputs (see Eq. 3).The number of estimated error values to be used as inputs depends on the identified order of the AR model and can vary across the forecast lead times.
In its complete form, the error-corrected reservoir inflow forecast (z ) from the complementary modelling framework can be given as

Parameter estimation
Parameters of the AR model can be set to the corresponding Yule-Walker estimates of a 1 , a 2 , . . ., a p given the autocorrelation function of the error series fulfils a form of the linear difference equation.However, in practice, Eq. ( 2) can be treated as a linear regression and parameters can be estimated by least squares method as demonstrated by Xiong and O'Connor (2002).An iterative algorithm suggested in Beven et al. (2008) is adopted for estimating the model parameters, while optimizing transformation of the inflow data.Adoption of a methodology that amalgamates parameter estimation and Box-Cox (Box and Cox, 1964) inspired transformation of inflow is useful for taking into account the heteroscedastic residuals and obtaining a normally distributed residual series from the error model.The parameter and inflow transformation steps with a little modification from Beven et al. (2008) over the calibration period (1, . . ., T ) are as follows: 1. Values of β, λ ≥ 0 are selected and the reservoir inflows ( q1:T , Q 1:T ) are transform to get (ẑ 1:T , z 1:T ) using Eq.(1).
3. Perform an optimization for the error-model parameters (a 1 , a 2 , . . ., a p ) to minimize (ε 1:T − ε1:T ) 2 , where ε represents the forecast from the error model which at a given observation time step (t) equals (µ e + êt ).Thus, the observed (ε) and forecasted (ε) errors at a given observation time step (t) can be related as ε t = εt + η t , where η t is a random noise that describes the total uncertainty originating from various sources.
4. Adjust (β, λ) and repeat the optimization until the residuals of the error model appear homoscedastic.The η t term (step 3) is assumed to be unimodal, symmetric and unbounded random variable with a zero expected mean and second moment given as σ 2 .

Performance evaluation
In addition to visual evaluation of the hydrographs, performance of the present procedure is robustly analysed using deterministic and reliability metrics.The root mean square error (RMSE), relative error (RE) and the Nash-Sutcliffe efficiency (NSE) (Nash and Sutcliffe, 1970) are employed to evaluate efficiency of the models during calibration and validation deterministically.Evaluations are made with respect to varying forecast lead times and season-wise as well.
Among the three statistical performance criteria, the RE (Eq.5) measures the relative error between the total observed and predicted inflow volume.For a good simulation the value of RE is expected to be close to zero.Quantifying the relative error (RE) of the simulations/forecasts is important because it indicates how the inaccuracies affect a hydropower company's ability to deliver the amount of energy it has pledged to provide to the energy market.Therefore, special attention is given to the less aggregate version of RE, which we refer to as percentage volume error (hereafter PVE) and describe as follows.
The PVE designates the relative error at each time step, which in reference to Eq. ( 5) can be obtained by omitting aggregation of the errors by summation.It indicates the magnitude of the errors as percentage of the observed inflows at each inflow time step.From a hydropower systems operations point of view, the PVE enables evaluation of the forecast errors at each time step and assess implication on the power production capacity directly.The PVE analysis devised here divides the computed PVEs into six PVE classes (i.e. ≤ 10, 10-20, 20-30, 30-40, 40-50 and > 50 %), and treats overestimates and underestimates separately.The number of times each of the six absolute PVE classes appeared in the set or subset of interest (i.e.hydrologic year or seasons) is constructed by keeping score of the PVE class into which each and every residual fell in.Then the fraction of time in which each PVE class occurred is divided into the total number of points in the given set/subset and is reported as a percentage.This is designated as a "PVE count".Model performance assessment using PVE (during simulation and forecasting) mainly focuses on assessing the change in the number of incidences in each PVE set, which in other words means the change in PVE counts.The PVE count/change in PVE count, along with the above-mentioned deterministic statistical criteria, is used for evaluating the simulation and forecasting skill of the complementarily set-up system (conceptual model + error model).As a metric for measuring the relative improvement in forecasting skills, high PVE counts for the low PVE classes (e.g.≤ 10 %) are considered desirable quality.The justification is that the penalty a power producer incurs when failing to deliver the pledged amount of power would be lesser if its forecasting system makes errors of lower PVE classes more frequently.
Another useful metric used for assessing forecasting skill of the complementary set-up is through uncertainty analysis.An interval forecast (Chatfield, 2000) can be constructed by specifying an upper and lower limit between which the future reservoir inflow is expected to lie with a certain probability (1 − α).The prediction interval for the inflow forecast is estimated using the Linear Regression Variance Estimator (LRVE) described by Shrestha and Solomatine (2006).Xiong et al. (2009) outlined several indices that can serve for describing the properties of prediction bounds of particular probability and for comparative study of prediction intervals resulting from different uncertainty assessment schemes.The indices characterise the prediction bound either by the percentage of observations it contains, its bandwidth, or its symmetry relative to the observation.Of all indices, according to Xiong et al. (2009), the containing ratio (CR), which describes the percentage of observed inflows falling in the desired interval percentage, is the widely used metric for assessing reliability of probabilistic forecasts.We adopt the CR metric for describing the reliability of the forecasts with the desired interval percentage of 95 % (α = 0.05).In addition to the CR, we verify the probabilistic forecasts graphically using the less formal PIT uniform probability plot.The working procedure as well as detailed application examples can be found in Laio and Tamea (2007) and Thyer et al. (2009).Among others, Pokhrel et al. (2013) and Wang et al. (2009) demonstrated viability of the "PIT uniform probability plot" approach for checking uniformity (and investigating the causes, in cases of deviations from uniformity) without binning the data subjectively.

Study area and data
The Krinsvatn catchment is located in Nord Trøndelag County in mid-north Norway.It comprises an area of 207 km 2 and about 57 % of the catchment is mountain area above the treeline.The elevation ranges from 87 to 628 m a.m.s.l.(above mean sea level) and is drained by the Stjørna/Nord River.The dominant land use is forest covering 20.2 % of the study site while marsh, lakes and farmlands cover about 9, 6.7 and 0.4 % of the catchment area, respectively.Figure 1 provides the location and main characteristics of the study site, and the daily potential evapotranspiration values used.
Observed hourly data of 11 water years (September 2000 to August 2011) were split into three sets used for warmingup (2000), calibrating (2001-2005) and validating (2006)(2007)(2008)(2009)(2010) the conceptual and the error models alike.Observed precipitation and temperature data of two meteorological stations (i.e.Svar-Sliper and Mørre-Breivoll) in neighbouring catchments are used.Discharge data for the catchment are derived from water level records at the Krinsvatn gauge station.Romanowicz et al. (2006) outline the advantages to direct use of water-level information in hydrologic forecasting.Rating curve uncertainties and their influence on the accuracy of flood predictions have been very well documented (e.g.Sikorska et al., 2013;Aronica et al., 2006;Pappenberger et al., 2006;Petersen-Overleir et al., 2009).Krinsvatn is considered a stable discharge measurement site with few external influences, and the rating curve was updated in 2004.This study, however, considers the uncertainty of the rating curve to be one of the factors contributing to the total error expressed in Eq. (2) and does not address it separately.

HBV model for Krinsvatn catchment
The catchment is divided into 10 elevation zones in the HBV model set-up.Input data used are hourly areal precipitation, air temperature, and potential evapotranspiration.The model is run on an hourly time step for the water years 2000 to 2005 with the last 5 water years being used for model calibration.Calibration is carried out using the shuffled complex evolution algorithm (Duan et al., 1993), with the NSE between the observed and predicted flows as an objective function.Description of the model parameters along the corresponding optimized values is provided in Table 1.

Overview of the conceptual model's performance
The simulation and observed reservoir inflow hydrographs shown in Fig. 2 indicate a certain level of agreement for most of the calibration and validation periods, which the statistical evaluations (Table 2) agree with.The overall hourly reservoir inflow predictions during calibration and validation show efficiency of NSE > 0.5 and RE < ±25 %, even though simu-   PVE counts of the six PVE classes (i.e.≤ 10, 10-20, 20-30, 30-40, 40-50 and > 50 %) are computed on the residuals between observed and simulated reservoir inflows.The stacked columns of Fig. 3a and b show how frequently each of the six absolute PVE classes occurred over the calibration and validation period.The results reveal a large degree of discrepancy between observations and predictions during calibration and validation.Simulated inflows deviated from the corresponding observed values by a magnitude of more than ±10 % in about 83.3 % (calibration) and 88.6 % (validation) of the respective simulation time steps.Huge difference between observations and simulations is noted in the summer season with absolute PVE of the class > 50 % occurring in more than half of the simulation time steps throughout the calibration and validation periods.Winter simulations listed the highest level of occurrence of PVE of the class ≤ ±10 % during both calibration and validation.Comparable to the results in Table 2, volume errors in winter simulations do not seem to be a serious problem, probably because the season is predominantly a snow accumulation rather than runoff generation period.Errors of the high absolute PVE classes scored high PVE counts in the spring and autumn seasons.
Details of the extent to which the reservoir inflows are under-and overestimated can be seen in Fig. 3c and d.The fraction of time the simulated inflows exhibited under-and overestimation during calibration is 51.9 and 46.8 %, respec- tively.In the validation period, the reservoir inflows are underestimated about 65.6 % of the time compared to overestimation in 33.4 % of the time.This is also revealed in the findings from statistical metrics in Table 2, which disclose the bias in the model.Yet, the results in Fig. 3 further reveal that the model predictions deviate from the observations at high discharges.For example, during the validation period 59.2 % of the time observations exceeded the predictions by magnitudes of more than 10 %.Such information is useful because direct evaluation of observed and predicted values explains the implications of model performance on the planning and operation of a hydropower system better than an aggregated variance-based statistic.From an operational management point of view, considerable underestimation of reservoir inflows can have both short-term and long-term effects on the operation of a hydropower system.In the short-term, the company could be forced to release unvalued water especially when the reservoir water level is close to its maximum capacity.Hence, the high percentage of underestimations that occur in the autumn and spring seasons (during calibration and validation) should not be tolerated because the inflows in the autumn and spring seasons are very important.On the one hand, substantial overestimation of reservoir inflows can at least expose any Norwegian hydropower company to undesirable expenses due to obligations to match the power supply it has failed to deliver by dealing with other producers in the intra-day physical market (Elbas).Although overestimation does not seem to be a pertinent issue, Fig. 3d unmasks that the inflows are overestimated by a magnitude > 50 % at least 10 % of the time in all seasons.

Residual analysis
Following the example of Xu (2001), a Kolmogorov-Smirnov test is applied to residuals of the conceptual model.The test revealed that the residuals are not normally distributed.The maximum deviation between the theoretical and the sample lines is 0.130, which is larger than Kolmogorov-Smirnov test statistic of 0.008 at significance level α = 0.05.
Presence of homoscedasticity in the residuals series is diagnosed visually by plotting the residuals versus the predicted reservoir inflows (Fig. 4a).With respect to the horizontal axis, the scattergram does not remain symmetric for the entire range of predicted inflows.The residuals show high variability and possible systematic bias when inflows are less than 3.5 mm while the opposite is true when the inflows exceed 3.5 mm.Inflows of magnitudes between 3.5 and 5.5 mm seem to be underestimated, while overestimation is visible when the inflow rates are greater than 5.5 mm.However, as can be seen from Fig. 2, inflows of magnitude up to 3 mm represent reservoir inflows during the rise of the hydrographs including all peak inflows for all hydrologic years except 2005 and 2010.Hence, except for the possible systematic bias during low flows, the inference from the scatter plot is inconclusive to support or dismiss the issue of predominant underestimation revealed in the model performance evaluation.Moreover, hourly inflows of magnitudes higher than 3mm are rare and occurred about 0.1 % of the time over the calibration and validation period.
Plots of autocorrelation and partial autocorrelation functions of the residual time series (Fig. 4b and c) indicate a strong time persistence structure in the error series.Rapid decaying of the partial autocorrelation function confirms the dominance of an autoregressive process, which the gradually decaying pattern of the autocorrelation function also suggests.Thus, in order to obtain a Gaussian series, it is important to address issues of heteroscedasticity and serial correlation in the residual series.As the current study aims at utilising the persistent structure in the residuals for supplementing the forecasting system, the corrective action to be taken only aims at removing the heteroscedasticity.A successful way to do it is through transformation of the flow data (e.g.Engeland et al., 2005).As outlined in the methodology section, the reservoir inflows (both observed and predicted) are transformed while estimating parameters of the error model.

Structure and performance of the error model
In accordance with the findings from the ACF and PACF plots discussed in Sect.3.3.2,AR models of up to an order of p = 3 were investigated while estimating parameters of the error model.As outlined in Sect.2.2.2, coefficient of the AR(p) model and the transformation parameters were estimated by minimizing the sum of the squares of the offsets between the inflows (observed and predicted) in the transformed space, and assessment of whether the subsequent residuals from the complementary modelling framework appear homoscedastic and exhibited correlation.The latter was assessed using the Kolmogorov-Smirnov (KS) statistic as a relative quantitative measure followed by visual inspection of the residual plots, which led to the selection of an AR(1) model with transformation parameters β = 41.4 and λ = 0.9, bias correction µ e = 0.021 and coefficient a 1 = 0.97.
Calibration efficiencies calculated for the error model using the RMSE, RE and NSE metrics are 0.096, −100 % and 0.517, respectively.Corresponding values for the validation period are computed as 0.095, 20.3 % and 0.630, respectively.NSE values for the calibration and validation periods imply the ability of the error model to capture at least half of the discrepancies observed between observations and predictions from the conceptual model.All the three metrics reveal a higher efficiency in the validation set than the calibration set.With reference to Table 2, this suggests too much fitting of the HBV model to the data that led to extraction of more information from the calibration set.Assessment of the residuals from the complementary framework reveals that the transformation reduced the maximum deviation between the theoretical and the sample lines slightly from 0.13 to 0.10; yet the residuals are not normally distributed (i.e.Kolmogorov-Smirnov statistic of 0.008 at significance level of α = 0.05).This implies the assumption that the residuals from the complementary forecasting system would be Gaussian is far from being true.As the aim of this study is to utilize the error and complementary models additively, we discuss in the next section the extent to which the complementary set-up boosted prediction ability in the forecasting mode and come back to the issue of violation of the Gaussian assumption in section 3.5, where we analyse the reliability of the forecasts probabilistically.

Forecasting skill of the complementary set-up (deterministic assessment)
Imitating operational application of forecasting models in the Norwegian hydropower system, reservoir inflows for the day-ahead market (Elspot) are estimated using the presented forecasting system.The system has to run once a day at an hourly time step, sometime before 12:00 LT after retrieving the latest observations, and the inflow forecasts are issued for the next 24-hourly time steps beginning from 12:00 LT.Overall performance of the complementary model in forecasting the reservoir inflows during the calibration and validation periods is first discussed and is followed by evaluation of its forecasting skill with respect to forecast lead times.Evaluation of the forecast skill presented in this paper is based on assessment of forecasts made for the period between September 2006 and August 2011 as the data sets from September 2000 to August 2006 are used for calibrating the system.

Overall performance
Assessment of the overall forecasting skill of the complementary set-up shows significant improvement in forecast accuracy.The RMSE and NSE statistical criteria computed between forecasted and observed inflows are 0.095 and 0.896, respectively.RMSE values for the autumn, winter, spring and summer forecasts are 0.094, 0.090, 0.132 and 0.044, respectively, and the corresponding NSE values are 0.904, 0.905, 0.859 and 0.873.Proving capability of the complementary set-up to reduce the bias revealed in the simulation forecasts from the concep-tual model, which was pointed out in the previous section, the 24 h lead-time forecasts exhibited low-level underestimation bias with RE equal to 3.8 %.Degree of bias in the inflow forecasts differed seasonally.The RE computed for each season in a decreasing order is summer (10.2 %), spring (4.6 %), autumn (2.9 %) and winter (0.7 %).The relatively higher bias in the spring and autumn forecasts can be related to runoff generation in the Krinsvatn catchment due to snowmelt or occurrence of precipitation in the form of rainfall, which can affect the persistence structure in the residual series obtained from the conceptual model.
Stacked-column plots in Fig. 5 display the occurrence level of each of the six PVE classes in the residual series between forecasts and observations.Visual comparison of stacked-column plots of Fig. 5 and Fig. 3 shows reduction in PVE count of the high PVE classes and increase in PVE counts of low PVE classes; e.g.PVE count for the PVE class > ±50 % decreased by about 15 %, while PVE count for the PVE class ≤ ±10 % grew by about 50 %.In order to assess this assertion, a further assessment is carried out by dividing the six PVE classes into two groups: low PVE (PVE ≤ ±10 %) and high PVE (PVE > ±10 %).Ratio between seasonal PVE counts of the low and high PVE classes is taken and comparison is made on two sets of residual series.These sets of residuals are (1) residuals from the simulated forecasts (conceptual model) and ( 2) residuals from forecasts of the complementary set-up.Results are presented in Table 3. Apart from confirming the success in reducing PVE counts of high PVE errors, the results indicate that an equal level of success is not achieved in all four seasons.In relative terms, high PVE errors occur more often in the spring and summer forecasts.As pointed out earlier, this can be associated with the snowmelt and, to a certain degree, to rainfall incidents occurring in these seasons.In both cases of under-and overestimation, absolute PVE of the class ≤ 10 % occurred more frequently; for example, the fraction of time reservoir inflow forecasts of 1 h lead-time deviated from the observations by a magnitude ≤ 10 % increased by about 52.7 and 27.7 % during under-and overestimations.Overall, the plots show that the magnitude of discrepancy at each forecasting point is significantly reduced.The improvement level at each forecast lead time is proportional to the vertical distance from the horizontal axis.It can be noted that, the vertical distance narrows down with increasing lead time suggesting a declining improvement level with increased lead time.93.9 87.7 81.7 76.0 70.6 64.9 59.3 54.4 50.6 47.4 44.8 42.5 40.4 38.5 36.8 35.2 33.9 32.8 30.0 31.3 30.5 29.7 29.0 28.3 10/11 94.6 88.6 82.2 75.7 69.4 63.4 57.7 52.5 48.7 46.8 44.5 41.7 39.0 36.7 34.6 32.7 31.1 29.8 28.7 27.8 26.8 25.8 24.6   Calculation of the relative RMSE reduction and the change in PVE counts agree that the forecast accuracy is improved through the complementary set-up.The assessments further revealed that the degree of improvement weakens with increased forecast lead time.However, the relative RMSE reduction computations indicate that in some occasions the simulated inflow forecasts stand out to be better.The relative RMSE reduction values for lead times longer than 20 h (Table 4) show that complementing the conceptual model with an error model is counterproductive in autumn and winter seasons of the water years 2007 and 2006, respectively.

Reliability of the inflow forecast
Computation of the CR for the entire forecast reveals that 95.8 % of the observations are inside the 95 % prediction interval.The inflow hydrographs (Fig. 8) confirm that most of the observed inflows are contained in the specified uncertainty bounds.
The percentage of observation points falling within the forecasted 95 % confidence interval varies from season to season and across hydrologic years (see Fig. 9a).All observed winter and summer inflows are bracketed in the 95 % uncertainty bound at least 95 % of the time.In general, the winter season is more of a snow accumulation period and a closer observation of the hydrographs (see Fig. 8) reveals that the summer hydrographs cover the recession and base flow portions of the annual hydrographs.Thus, better persistence structure and predictable discrepancies between simulated forecasts from the conceptual model and the observations.As Goswami et al. (2005) argued, the persistence structure in residual series primarily arises from the dynamic storage effects of a catchment system.
The desired percentage of autumn observations is contained in the 95 % prediction interval the years 2006, 2008 and 2010.In the years 2007 and 2009, however, only 93.2 and 93.8 % of the observed autumn inflows are bracketed in the estimated 95% prediction intervals, respectively.Reliability score (CR) calculations for the spring season indicate that percentage of observation points falling in the desired prediction interval percentage are below 95 % in the hydrologic years 2009 and 2010 (i.e.93.8 and 89.2 %, respectively).Unlike winter and summer inflows, autumn and spring flows mostly cover portions of the hydrograph corresponding to the rising limb or high-flow regime (see Fig. 8).While physical factors contributing to the increase in quick flow into the reservoir are precipitation incidents (in the form of rainfall) and melting of snow in the headwaters, comprehension of this concept and its encapsulation into the HBV model leaves control of the catchment response to two threshold values (TX and TS; see Table 1

for description).
Employing such simple threshold values to govern initiation of the runoff generation process based on air temperature measurement at a given time step obviously involves more sources of uncertainty (i.e.measurement, model structure and model parameters).For instance, we assume the input air temperature at a given time step is erroneously recorded to be higher than TX and/or TS due to measurement error.Subsequently, the model will partition the precipitation as rainfall and initiate melting of snow, which the observation does not reveal.This kind of misclassification of precipitation and/or misrepresentation of snow accumula- tion and melting processes can simply occur due to the error in the input temperature record.Because of this, the persistence in the errors between simulated forecasts from the conceptual model and the observations can get weaker.According to Goswami et al. (2005), some degree of persistence in the model input (i.e.rainfall) is another primary source of the persistence characteristic of observed flow series.Even though the least CR calculated for the autumn and spring seasons are by no means too bad (i.e.> 89 %), the requirement for reliability is for the uncertainty bound to contain as much fraction of observations as desired percentage of prediction interval; hence, the complementary set-up presented seems to have struggled with it in the aforementioned hydrologic years.
The fraction of observed inflows bounded within the estimated prediction interval decreases with increased lead time (Fig. 9b).The reliability score for all 24 forecast lead times fulfil the requirement of containing 95 % of the observations.For lead times beyond 19 h, the exact CR values are slightly lower than 95 % with a minimum of 94.8 % at forecasts lead time of 24 h.
From an operational hydrology point of view, we concur with the opinion of Thyer et al. (2009) that the toughest goodness-of-fit test the complementary framework has to pass is whether the predictive distribution is consistent with the observed inflow, which the PIT uniform probability plots (PIT plots) evaluate directly.This involves deriving at each time step the p value of the observation from the corresponding predictive distribution, and constructing the cumulative distribution function (cdf) of the p values.Subsequently, validity of the Gaussian hypothesis in the validation set is ex-  9c-f) reveal that the uncertainty attached to the deterministic forecasts is not always perfect.Overall, the PIT uniformity probability test confirms that the uncertainty is overestimated (i.e.low slope in the mid-range and thin tails).Irrespective of the forecast lead time, the highest degree of overestimation is noted in the summer set (i.e.most points fall outside the Kolmogorov significance band, and the p = 0.5 values deviate significantly from the bisector) and reduces from winter to autumn.On the other hand, PIT plots of the spring subset reveal that almost all transformed p values fall within the Kolmogorov significance band, which might imply validity of the Gaussian assumption used for forecasting the confidence intervals, at least, for the spring subset, and influence of high flows on the estimation of the model error variance.The latter might be one of the factors behind the overestimation of the uncertainty bands the PIT plots exhibited because the LRVE method (i.e.method used for forecasting the confidence intervals) solely relies on the historical residuals between forecasts and observations.While assessing reliability of predictive uncertainty quantifications, Thyer et al. (2009) reported violation of the probability model assumptions and poor performance of the Bayesian total error analysis (BATEA) methodology in quantifying the prediction uncertainty during lower flows than higher flows.They further exemplify that for flows of magnitudes close to zero the standard deviation the assumed output error model uses might be too high, leading to overestimation of the uncertainty.According to Schoups and Vrugt (2010), in hydrologic applications residual series are often assumed to be independent and identically distributed but these assumptions are usually violated.In the next section, we briefly assess reliability of the model identification and parameter estimation approach implemented in this study.

On the implemented parameter estimation technique
The parameter (AR model coefficient(s) and transformation parameters) estimation technique we employed (Sect.2.2.2) follows a pseudo multi-objective optimization approach, which includes minimizing the sum of squares of the residuals and making sure a homoscedastic residual series.We first employed the least squares (LS) method to estimate the parameters associated with several AR models (of the order of 1 to 3).Since the unit of the inflows (the errors as well) in the transformed space depended on the transformation parameters, and the inclusion of the transformation parameters into the calibration problem posed a challenge to identify the optimal among the candidate AR models, we resorted to the dimensionless KS statistic.The KS metric served as a relative quantitative measure to discriminate between candidate models by measuring how close-to-constant the residual variances' are.As a result, the selected AR model is suboptimal in terms of yielding the least discordance between predictions and observations.Putting aside the issue of (in)validity of the Gaussian assumption, we demonstrate that shortcomings of the present LS-and KS-based model, which we refer to as the LS-KS model, the probabilistic metrics revealed are not unique to the implemented parameter estimation approach.In order to verify this, we set-up an AR model estimating the coefficients and transformation parameters by maximizing the Gaussian maximum likelihood (GML).
An AR(2) model was identified with coefficients and transformation parameters: β = 1.08, λ = 0.01, a 1 = 1.82 and a 2 = −0.82.All the deterministic metrics used in this study confirm performance improvement of a slight degree by the GML-based model during calibration and validation.This does not come as a surprise because parameters of the LS-KS-based model were suboptimal.On the other hand, the KS test revealed that the maximum distance between the sample line and the theoretical line increased to 0.290, which is higher than the statistic the error transformation using parameterization of the LS-KS model (0.10) yielded.To be fair, comparison of the KS statistics associated with the GML and LS-KS transformation parameters might not be appropriate because the LS-KS-based AR model was selected for its low KS statistic.Nevertheless, the KS statistic corresponding to the GML-based transformation shows a heteroscedasticity of degree higher than the untransformed residuals (0.13).The PIT uniform probability plots revealed that both approaches overestimated the uncertainty in a similar pattern with the probability model assumption only honoured in the spring season.Comparison of the CR of the GML and LS-KSbased models showed a similar proportion of observations contained in the prediction interval.The CR again reveals the same characteristics of high values at short lead times and the fraction of observations contained in the prediction bound declines at longer lead times.This affirms that the validity of the Gaussian assumptions stand out as the main issue requiring further investigation in relation to probabilistic forecasting.We emphasise here the importance of formulating an appropriate likelihood function to ensure the uncertainty estimates that are derived represent the samples they are built on.Readers are referred to a framework for defining A. S. Gragne et al.: Improving real-time inflow forecasting into hydropower reservoirs the most appropriate likelihood model given the sample being used (Smith et al., 2015).While not adopted here, such a framework reduces the need to assume a likelihood function, adopting instead the most appropriate function suited to the data at hand.

Concluding remarks
In the present study, the forecasting system comprising of additively set-up conceptual and simple error models is presented.Parameters of the conceptual model were left unaltered, as are in most operational set-ups, and the data-driven model was arranged to forecast the corrective measures to be made to outputs of the conceptual models to provide more accurate inflow forecasts into hydropower reservoirs several hours ahead.
Application to the Krinsvatn catchment revealed that the present procedure could effectively improve forecast accuracy over a 24 h lead time.This proves that the efficiency of a flow forecasting system can be enhanced by setting up a data-driven model to complement a conceptual model operating in the simulation mode.Furthermore, the current study reveals that analysing characteristics of the residuals from the conceptual model is important and heteroscedastic behaviour should be addressed before identifying and estimating parameters of the error model.Compared to past studies that applied data-driven and conceptual models in a complementary way, the present procedure is successful in providing acceptably accurate forecast for extended lead times.It also outlines procedure for extracting useful information from the bias, the persistence and the heteroscedasticity the residual series from the conceptual model exhibited, although the assumption that the residuals from the modelling framework to be random failed to hold.
Results also indicate that probabilistic forecasts can be obtained from deterministic models by constructing uncertainty of the complementary set-up based on predictive uncertainty of the simple error model.The uncertainty bound seems to satisfy the reliability requirement of containing about 95 % of the observations in the prediction interval when evaluated over the entire forecasting period.Its reliability with respect to forecast lead time also appears satisfactory for all 24 forecast lead times in terms of containing the desired percentage of observations.Nevertheless, detailed assessment revealed that the degree of reliability of the forecasts vary from season to season and one hydrologic year to another.Given that the error model essentially makes use of the persistence structure in the residuals from the conceptual model, the present procedure seems to be unable to capture transitions in the hydrograph errors from over-to underestimation (and vice versa).On the one hand, it was unveiled that the degree of reliability of the forecasts decline with longer lead times and the deterministic metrics (RMSE and PVE) confirmed the same.Reliability assessment using the PIT plots revealed that, regard-less of season and lead time, the uncertainty bands somehow appear to be wider than they should be.The PIT plots spotlighted the challenge associated with forecasting confidence intervals using the LRVE or similar methods, which estimate the model error variance from the historical residuals.
In order to address these challenges, a future development can be to explore methodologies for taking care of seasonal variability in the structure of the residual series.Updating the error models periodically can be one solution but care must be taken if the selected updating method makes a Gaussian assumption.Another alternative would be to explore more complex stochastic models for the residuals, that use exogenous predictor variables either observed directly (much like the seasonal reservoir inflow forecasting models described in Sharma et al., 2000), or using state variables simulated from the conceptual model (like the Hierarchical Mixtures of Experts framework in Marshall et al., 2006 andJeremiah et al., 2013).Formulation of these models will also offer better insight into the deficiencies that exist within the HBV conceptual model, thereby allowing further improvement to reduce the structural errors present.A subsequent study (Gragne et al., 2015) attempts to address some of these issues using a filter updating procedure, which assimilates inflow measurements periodically to the error-forecasting model, and explores the potential of a data assimilation technique for improving model forecast accuracy and constraining forecast uncertainty without significant computational costs.
Another interesting topic of future investigation is the intercomparison of the probabilistic forecasts presented in the current paper with the same from popular methods such as the Bayesian forecasting system, the generalized likelihood uncertainty estimation and the Bayesian recursive estimation.We believe this would enable identification of the most effective and reliable probabilistic forecasting method that can also be implemented in an operational set-up.

Figure 1 .
Figure 1.Location, characteristics and potential evapotranspiration estimates of the study catchment.

Figure 2 .
Figure 2. Observed and predicted reservoir inflow hydrographs during calibration (left-column panels) and validation (right-column panels) of the conceptual model.

Figure 4 .
Figure 4. Plots of (a) residuals from the conceptual model as a function of predicted inflow during the calibration period, (b) autocorrelation function of the residuals, and (c) partial autocorrelation functions of the residuals.

Figure 6 .
Figure 6.Summary of relative seasonal RMSE reductions as a function of forecast lead time (minimum, mean and maximum values computed from corresponding computations for the hydrologic years 2006-2010).

Figure 9 .
Figure 9. Reliability score (containing ratio CR) for 95 % prediction interval for (a) each season of every hydrologic year, and (b) different forecast lead times based on entire series.In (c)-(f) sample PIT uniform probability plots for each of the four seasons at 1, 6, 12, 18 and 24 h forecast lead times.Solid line designates the theoretical uniform distribution, broken lines represent the Kolmogorov significance band, and the dots denote PIT value of the observed p values.

Table 1 .
Model parameters and corresponding optimized values.
of the validation period reveals the conceptual model's tendency to underestimate reservoir inflows in spring and summer considerably.In light of what the NSE and RE metrics suggest, the lower RMSE values (i.e. for instance summer season) do not reflect superior model performances.

Table 4 .
Relative RMSE reductions (%) in reservoir inflows forecast as a function of forecast lead time.

Hydrol. Earth Syst. Sci., 19, 3695-3714, 2015 www.hydrol-earth-syst-sci.net/19/3695/2015/ amined
Thyer et al. (2009)ansformed p values (i.e.transformation defined by own cdf) with that of a uniform distribution.When the two distributions plot to a straight line and the points remain within the Kolmogorov bands of 5 % significance from the diagonal bisector, the PIT plots validate consistency of the calibration assumption.Otherwise, the PIT plots invalidate consistency of the hypothesis and, among others, demonstrate whether the prediction uncertainty is over-or underpredicted.PIT plots point to an overestimated uncertainty if the points (p values) cluster around the mid-range and an underestimated uncertainty if the points (p values) cluster around the tails.We refer readers toThyer et al. (2009)for a detailed description of how to interpret the Q-Q plots, which also apply to the PIT plots.Comparison of the transformed p values (i.e.different sets based on season or lead time) with that of a uniform distribution (Fig.