When best is the enemy of good – critical evaluation of performance criteria in hydrological models

. Performance criteria play a key role in the calibration and evaluation of hydrological models and have been extensively developed and studied, but some of the most used criteria still have unknown pitfalls. This study set out to examine counterbalancing errors, which are inherent to the Kling-Gupta Efficiency (KGE) and its variants. A total of nine 15 performance criteria – including the KGE and its variants, as well as the Nash-Sutcliffe Efficiency (NSE) and the refined version of the Willmott ’s index of agreement (d r ) – were analysed using synthetic time series and a real case study. Results showed that, assessing a simulation, the score of the KGE and some of its variants can be increased by concurrent over-and underestimation of discharge. These counterbalancing errors may favour bias and variability parameters, therefore preserving an overall high score of the performance criteria. As bias and variability parameters generally account for 2/3 of the weight 20 in the equation of performance criteria such as the KGE, this can lead to an overall higher criterion score without being associated to an increase in model relevance. We recommend using (i) performance criteria that are not or less prone to counterbalancing errors (NSE, d r , modified KGE, non-parametric KGE,


Introduction
Hydrological models are fundamental to solve problems related to water resources.They help characterising hydrosystems (Hartmann et al., 2014), predicting floods (Kauffeldt et al., 2016;Jain et al., 2018) and managing water resources (Muleta and Nicklow, 2005).A lot of research efforts are thus dedicated to improve the reliability, the robustness and the relevance of such models.Improvements can be made by working on (i) input data, (ii) model parameters and structure, (iii) uncertainty quantification, and also (iv) model calibration (Beven, 2019).In this study, we focus on the proper use of performance criteria for calibrating and evaluating hydrological models -an important part that can easily be overlooked (Jackson et al., 2019).
A performance criterion aims to evaluate the goodness-of-fit of a model to an observed data.It is generally expressed as a score, for which the best value corresponds to a perfect fit between predictions and observations.In hydrology, the Nash-Sutcliffe Efficiency (NSE) (Nash and Sutcliffe, 1970) is still one of the most commonly used criteria (Kling et al., 2012), although the past decade has seen a gain in popularity of alternatives (Clark et al., 2021), e.g. the Kling-Gupta Efficiency (KGE) (Gupta et al., 2009).Many authors have pointed out the inherent limitations of using performance criteria, especially the fact that a single score metric cannot reflect all relevant hydrological aspects of a model (Gupta et al., 2009).The use of a multi-criteria framework is thus often emphasised to quantify different aspects of a model (Clark et al., 2021;Moriasi et al., 2015;Gupta et al., 1998;Jackson et al., 2019;van Werkhoven et al., 2009;Knoben et al., 2019;Althoff and Rodrigues, 2021;Ritter and Muñoz-Carpena, 2013;Krause et al., 2005;Legates and McCabe Jr., 1999), alongside a scientific evaluation of the results (Biondi et al., 2012).Knoben et al. (2019), Althoff and Rodrigues (2021) and Clark et al. (2021) pointed out that modellers should carefully think about which aspects they consider the most important in their hydrological model and how to evaluate them.
Performance criteria also have shortcomings at a distinctive level.A number of studies have identified several limitations of the NSE: (i) the contribution of the normalised bias depends of the discharge variability of the basin, (ii) discharge variability is inevitably underestimated because the NSE is maximised when the variability equals the correlation coefficient, which is always smaller than unity, and (iii) mean flow is not a meaningful benchmark for highly variable discharges (Gupta et al., 2009;Willmott et al., 2012).The KGE aims to address these limitations but also has its own issues (Gupta et al., 2009).Santos et al. (2018) identified pitfalls when using the KGE with a prior logarithmic transformation of the discharge.Knoben et al. (2019) warned against directly comparing NSE and KGE scores as the KGE has no inherent benchmark.Ritter and Muñoz-Carpena (2013) and Clark et al. (2021) showed that NSE and KGE scores can be strongly influenced by few data points, resulting in substantial uncertainties on the predictions.
What is not fully addressed yet is the trade-off between individual components (Wöhling et al., 2013) and especially the impact of counterbalancing errors induced by bias and variability parameters, which are integrated in many performance criteria.While accurate bias and variability are desired aspects of hydrological models, sometimes good evaluations may accidentally result from negative and positive values cancelling each other (Jackson et al., 2019;Massmann et al., 2018).This can be particularly detrimental to model calibration and evaluation, as it generates an increase in criterion score without necessarily being associated to a better model relevance.Some performance criteria naturally address this problem by using absolute or squared error values, but other criteria such as the KGE and its variants do not, as they use relative errors.The aim of this study is to assess the extent to which criteria scores can be trusted for calibrating and evaluating hydrological models when predictions have concurrent over-and underestimated values.The influence of counterbalancing errors is evaluated on nine performance criteria including the NSE and the KGE.This selection is far being from exhaustive but includes widely used and recently proposed KGE variants, as well as more traditional criteria such as the NSE or the refined version of the Willmott's index of agreement (dr) for comparison purpose.We first use synthetic time series to highlight the counterbalancing errors mechanism.Second, we show how counterbalancing errors can impair the interpretation of hydrological models in a real case study.Finally, we provide some recommendations about the use of scaling factors and the choice of appropriate performance criteria to nullify or reduce the influence of counterbalancing errors.

Parameters description
All the performance criteria considered in this study are based on the same or similar statistical indicators, which are first described to avoid repetition.
We use   () and   () to refer to observed and simulated values of calibration variable  at a specific time step . and   correspond to the Pearson and the Spearman rank correlation coefficients (Freedman et al., 2007), respectively.
is the ratio between the mean of simulated values   and the mean of observed values   : corresponds to the bias (mean error) normalised by the standard deviation of observed values   : is the ratio between the standard deviation of simulated values   and the standard deviation of observed values   : is the ratio between the coefficient of variation of simulated values (  =   /  ) and the coefficient of variation of observed values (  =   /  ): where   () and   () correspond to the simulated and observed values of calibration variable at exceedance probability .
is the mean of   () when looking at  observations: |  | is calculated as follows: with   the residual bias: (Pool et al., 2018) is also based on the FDC: where () and () stand for the time steps of the  ℎ largest discharge for the simulated and observed time series, respectively.
As ,   and   all represent the bias, they are therefore designed as "bias parameters" in this study.

Score calculation
A total of nine performance criteria are analysed in this study: the The NSE (Nash and Sutcliffe, 1970) is a normalised variant of the Mean Squared Error (MSE) and compares a prediction to the observed mean of the target variable: Gupta et al. (2009) algebraically decomposed the NSE into correlation, variability, and bias components: The Kling-Gupta Efficiency (KGE) was proposed by Gupta et al. (2009) as an alternative to the NSE.The optimal KGE corresponds to the closest point of the three-dimensional Pareto front -of ,  and  -to the ideal value of [1; 1; 1]: A modified Kling-Gupta Efficiency was proposed by Kling et al. (2012).The coefficient of variation is used instead of the standard deviation to ensure that bias and variability are not cross-correlated: ′ = 1 − √( − 1) 2 + ( − 1) 2 + ( − 1) 2 (13) Tang et al. (2021) proposed another variant (KGE'') by using the normalised bias instead of  to ensure that the score is not overly sensitive to mean values -  or   -close to zero (Santos et al., 2018;Tang et al., 2021): Pool et al. ( 2018) cautioned against the implicit assumptions of the KGE -data linearity, data normality and absence of outliers -and proposed a non-parametric alternative (KGENP) for limiting their impact.The non-parametric form of the variability is calculated using the Flow Duration Curve (FDC) and the Spearman rank correlation coefficient is used instead of the Pearson correlation coefficient: In a similar way, Schwemmle et al. (2021) used FDC-based parameters to account for variability and bias in another KGE variant: the Diagnostic Efficiency.This criterion is based on constant, dynamic and timing errors and aims to provide a stronger link to hydrological processes (Schwemmle et al., 2021): In this study, we used a Normalised Diagnostic Efficiency (DE') so that the best error score equals to one for facilitating the comparison with other performance criteria: Liu (2020) proposed another alternative, the Liu-Mean Efficiency, to improve the simulation of extreme events.The LME thus aims to address the underestimation of variability of the KGE, which is still a concern despite being not as severe as with the NSE (Gupta et al., 2009;Mizukami et al., 2019):

Generating synthetic time series with homothetic transformations
A simulation performance can be assessed in terms of bias, variability and timing errors (Gupta et al., 2009).Bias and variability errors correspond to a difference in volume and amplitude of discharges.Timing errors correspond to a shift in time.We created a synthetic hydrograph corresponding to one flood event as the reference (observed) time series.We also generated synthetic transformations -of the reference time series -with different errors on bias and variability corresponding to time series simulated by a model.We did not consider any timing errors as our aim is to assess counterbalancing errors induced by bias and variability parameters.Synthetic transformations were generated by multiplying the reference time series by a coefficient : where   () stands for the transformed discharge at the time  ,   () the reference discharge at the time  and  a coefficient. values were sampled between -0.36 and 0.36 at a defined interval of 0.002 on a logarithmic scale to ensure a fair distribution between underestimated (  < 1 ) and overestimated (  > 1 ) transformations.This results in 361 transformations evenly distributed around the  = 1 homothety, which corresponds to the reference time series (i.e.absence of transformation).We defined  bounds such that the transformed peak discharge roughly ranges from half ( = 0.437) to twice ( = 2.291) compared to the reference time series.Note that  homotheties still induce small timing errors -which were considered negligible -because the correlation coefficients (  and   ) also slightly account for the shape of the transformation.To study counterbalancing errors induced by bias and variability parameters, we generated time series that consist of two successive flood events and considered all possible combinations of the 361 transformations for the simulated time series (Fig. 1).This results in a total of 361 2 = 130321 transformations with two flood events, including (i) a "perfect" transformation with  = 1 for both flood events, (ii) "Bad-Good" (BG) or "Good-Bad" (GB) transformations when  = 1 for only one out of the two flood events, and (iii) "Bad-Bad" (BB) transformations when  ≠ 1 for both flood events.The performance of the transformations -with regards to the reference time series -were evaluated using the nine performance criteria presented in Sect. 2.  for   .Depending on the criterion, the variability parameter can also affect the score in a similar counter-intuitive manner. is heavily impacted by the counterbalance, whereas it seems mitigated for ,   and |  |.The timing parameters ( and   ) have an expected score that favour the BG model.However, the score difference on timing errors between BB and BG models is very small (0.03 at best for ).The impact on the overall score is thus minimised compared to the one induced by bias and variability parameters, which can be cumulated (e.g. both  and  counterbalancing errors in the KGE) or have a larger difference -up to 0.12 for .Counterbalancing errors can thus result in better values for bias and variability, which increase the overall score.In this case, the highest score may not be the most appropriate indicator of model relevance.

Identifying counterbalancing errors on a straightforward example
The largest differences in score appear for the LME and the LCE criteria as all their parameters are affected by counterbalancing errors (,  and /).The KGE and the KGE'' also show significant differences as they accumulate the counterbalancing errors of  and .The KGE' demonstrates a smaller difference than the KGE due to the use of .Both FDC-based criteria KGENP and DE' show the smallest differences due to   and |  |, which have a nearly equal value for both BB and BG models.The NSE has a slightly better score on the BG model, while the difference is more pronounced on dr.This example demonstrates how relative error metrics can cancel out each other and affect the design and the evaluation of hydrological models.The counterbalancing errors especially affect bias parameters (,   and   ) but also the variability parameter .

Exploring counterbalancing errors with synthetic transformations
Figure 4 shows the score distribution of the synthetic set of hydrographs presented in Sect.3.1.For each value of  1 , the minimum and maximum criteria scores of the transformations resulting from all combinations with  2 provide the dashed envelope of the score distribution, with the maximum transformation score at the top (1 corresponding to a perfect model), and the worst at the bottom.The transformations corresponding to the BG models (with  2 = 1) are represented by the black line.All transformations included in the dashed envelope can be identified as "Bad-Bad" models, except when  1 = 1 or  2 = 1 (black line).The LME criterion has a very distinctive envelope, for which the maximum score of 1 is reached for a lot of BB models, even when both  1 and  2 are different from 1.This can be explained by the interaction between  and  that leads to an infinite number of solutions (Choi, 2022).The KGENP and the DE' (FDC-based criteria) both shows similar envelopes with a break point near the maximum transformation score in both ways around  1 = 1.This is especially pronounced for the DE', for which the BG 210 model is nearly the best model between  1 = 0.83 and  1 = 1.17.These results show that counterbalancing errors can happen on a large range of parameters, and when using the KGE or its variants, there is a possibility for the more meaningful model (i.e.BG model) to have a lower score than a "compensated" or "Bad-Bad" model.Figure 5 shows the value of  2 corresponding to the best evaluation for a given  1 , by performance criteria.As identified above, the NSE and the dr both evaluate the BG models as the best transformations (NSE and dr black lines coincide at  2 = 1, Fig. 5).Counterbalancing errors are apparent for the KGE and its variants.For  1 ≠ 1, best transformations are always BB models and follow two conditions: (i) if  1 < 1 then  2 > 1, and (ii) if  1 > 1 then  2 < 1.This means that, in this case, such performance criteria will always be flawed towards concurrent under-and overestimation of discharges in a transformation.

Real case study
To highlight how counterbalancing errors can affect the assessment of hydrological models on a real case study, we used two different modelling approaches: artificial neural networks (ANN) and reservoir models.The simulations of karst spring discharges of both models were evaluated on the same 1-year validation period.To clearly highlight the problem, we deliberately chose a reservoir simulation that is noticeably affected by counterbalancing errors -yet still realistic.Further information on the modelling approaches, the input data, the calibration strategy and the simulation procedure can be found in Cinkus et al. (2022).

Study site
The Unica springs are the outlet of a complex karstic system influenced by a network of poljes.The recharge area is about 820 km 2 and is located in a moderate continental climate with a strong snow influence.Recharge comes from both (i) allogenic infiltration from two sub-basins drained by sinking rivers, and (ii) autogenic infiltration through a highly karstified limestone plateau (Gabrovšek et al., 2010;Kovačič, 2010;Petric, 2010).The network of connected poljes constitutes a common hydrological entity that induces a high hydrological variability in the system, and long and delayed high discharges at the Unica springs (Mayaud et al., 2019).The limestone massif can reach a height of 1800 m above sea level and has significant groundwater resources (Ravbar et al., 2012).A polje downstream of the springs can flood when the Unica discharge exceeds 60 m 3 s -1 for several days.If the flow reaches 80 m 3 s -1 , the flooding can reach the gauging station and influence its measurement.The flow data are from the gauging station in Unica-Hasberg (ARSO, 2021a).Precipitation, height of snow cover, and height of new snow data are from the meteorological stations in Postojna and Cerknica (ARSO, 2021b).Temperature and relative humidity data are from the Postojna station.Potential evapotranspiration is calculated from the Postojna station data with the Penman-Monteith formula (Allen et al., 1998).

Modelling approaches
The first modelling approach is based on Convolutional Neural Networks (CNN) (LeCun et al., 2015), which is a specific type of ANN that is powerful in processing image-like data but also very useful for processing sequential data was complemented by a Max-Pooling layer a Monte-Carlo dropout layer with 10% dropout rate and two dense layers.The first dense layer has an optimised number of neurons and the second a single output neuron.We programmed our models in Python 3.8 (van Rossum, 1995), using the following frameworks and libraries: BayesOpt (Nogueira, 2014), Matplotlib (Hunter, 2007), Numpy (van der Walt et al., 2011), Pandas (Reback et al., 2021;McKinney, 2010), Scikit-Learn (Pedregosa et al., 2018), TensorFlow 2.7 (Abadi et al., 2016) and its Keras API (Chollet et al., 2015).
The second modelling approach is a reservoir model, which is a conceptual representation of a hydrosystem consisting of several reservoirs that are supposed to be representative of the main processes involved.We used the adjustable modelling platform KarstMod (Mazzilli et al., 2019).The model structure consists of one upper reservoir for simulating soil and epikarst processes (including a soil available water capacity), and two lower reservoirs corresponding to matrix and conduits compartments.A very reactive transfer function from the upper reservoir to the spring is used to reproduce very fast flows occurring in the system.recession period is accurate.The last flood event (September 2017) comprise a small peak followed by a very high and longlasting flood.Both models fail to account for the small peak.The following important flood event is highly overestimated by the reservoir model, while being nicely simulated by the ANN model -despite the small underestimation and timing error.

Impact of counterbalancing errors on model evaluation
The small flood events (mid-January, mid-April, early and late June 2017) are better simulated by the ANN model than the reservoir model.The ANN model simulates them satisfactorily, except for the second one, where the simulated discharges are overestimated.The reservoir model does not simulate the first two events at all and largely overestimates the last two, in addition to timing errors.Both models can be improved during recession and low flow periods.The ANN model is rather close to the observed discharges but seems to be too sensitive to precipitation (continuous oscillations).On the other hand, the reservoir model shows no oscillations but either overestimates or underestimates the observed discharges.In general, the ANN model can be described as better because it is closer to the observed values in the high and low flow periods.Some events are not well simulated by both models (e.g. the May 2017 flood), which may be due to uncertainties in the input data.This visual assessment is confirmed only by few performance criteria: the NSE, dr and KGENP.These criteria evaluate the 280 ANN model as better, although the performances of both models are quite close for the dr.However, the KGE and most its variants (except the KGENP) all favour the reservoir model over the ANN model -sometimes by a large margin.It is interesting to note how similar these results are to those of the synthetic example (Fig. 3a).Looking at the values of the equations' parameters, we find that bias parameters are systematically better for the reservoir model, with 1 over 0.92 for , 0 over -0.06 for   and -0.07 over 0.18 for   .Timing errors are systematically better for the ANN model, with 0.95 over 285 0.92 for  and 0.94 over 0.83 for   .Variability parameters favour the reservoir model with 1.1 over 0.78 for , 1.1 over 0.85 for , 0.22 over 0.3 for |  |, and a very close better value by 0.005 on the   parameter.In summary, all bias and variability parameters have better values for the reservoir model, while timing and shape parameters are better for the ANN model.As the KGE and its variants are generally composed of equally-weighted bias, variability and timing, their overall score is heavily affected by compensation effects -except in the case of a large error on one parameter.In our case, all parameters have similar errors, which results in a better KGE for the reservoir model compared to the ANN model.This applies to all the KGE variants except the KGENP where the error on   is significant, resulting in a better score for the ANN model.The LME score is extremely high (0.99) for the reservoir model, which is probably due to the compensation of  and  identified by Choi (2022).Also, using  instead of  for assessing the variability seems to lower counterbalancing errors.
Figure 6b shows that there is a consistent greater or equal overestimation of the reservoir model compared to the ANN model, except for the May-June period where the difference is small and insignificant compared to the February or September events.The underestimated values are similar for both approaches, except when the reservoir model overestimates the flooding events.Interestingly, the cumulative sum of the absolute bias error between simulated and observed values is smaller for the ANN model (1394 m 3 ) than the reservoir model (1611 m 3 ), but still the relative bias and variability parameters are better for the reservoir model.This observation highlights how counterbalancing errors can impair the evaluation of hydrological models: seemingly better parameters values (bias and variability) that increase criteria scores are not necessarily associated with an increase in model relevance.

Recommendations
The aim of this paper is primarily to raise awareness among modellers.Performance criteria generally comprise several aspects of the characteristics of a model into a single value, which can lead to an inaccurate assessment of said aspects.
Ultimately, all criteria have their flaws and should be carefully selected with regards to the aim of the model.

Use of relevant performance criteria
Table 1 summarises the presence and impact of counterbalancing errors, as well as the advantages and drawbacks (as reported in other studies) of the different performance criteria.The recommendations on counterbalancing errors are based on the results of this research -i.e.synthetic and real case studies.The KGE and all its variants are affected by counterbalancing errors with varying degrees of intensity: (i) mildly impacted (+) for the KGE', KGENP and DE, (ii) moderately impacted (++) for the KGE, KGE'' and LCE, and (iii) strongly impacted (+++) for the LME.In this study, the NSE and dr stand out as clearly better since they have no counterbalancing errors.However, they have other drawbacks that are not associated with counterbalancing errors but still important to consider.We thus recommend using performance criteria that are not or less prone to counterbalancing errors (NSE, dr, KGE', KGENP, DE), preferably in a multi-criteria framework to better quantify the different aspects of a hydrological model and further reduce the uncertainties inherent to each performance criterion.(Gupta et al., 2009) Still slight underestimation of high discharges (Gupta et al., 2009) Bias and variability are cross correlated (Kling et al., 2012) Implicit assumptions of data linearity, data normality and absence of outliers (Pool et al., 2018) No inherent benchmark (Knoben et al., 2019) Not suited to logarithmic transformation of discharge (Santos et  The contribution of βn depends on the variability (Gupta et al., 2009) Variability is underestimated (Gupta et al., 2009) The benchmark is inappropriate for highly variable discharges (Gupta et al., 2009) dr No / Address the shortcomings of the NSE (Jackson et al., 2019, Willmott et al., 2012) a KGE drawbacks may likely apply to KGE variants, but this hasn't been studied extensively

Use of scaling factors
The assessment of the hydrological models in the real case study shows how concurrent over-and underestimation can generate counterbalancing errors on bias and variability parameters.For the case study considered in this paper, the ANN model, although offering a better simulation, is evaluated as -sometimes considerably -worse than the reservoir model, because it slightly underestimates the total volume.This has a great impact on the overall score, as the KGE and its variant are calculated with both bias and variability parameters accounting for 2/3 of the overall criterion score.
While the overall balance (bias) may be a desired feature in a model, we showed that a good value may be accidental and result from counterbalancing errors.The common use of the KGE neglects one of the original proposals which is to weight the parameters ,  and  in the equation.Gupta et al. (2009) proposed an alternative equation for adjusting the emphasis on the different aspects of a model: with   ,   and   the scaling factors of ,  and , respectively.By default, these factors are equal to 1, which induces a weight of 1/3 on the parameter in absolute value () and 2/3 on the parameters in relative values (, ).To the best of our knowledge, only Mizukami et al. (2019) ever considered changing the scaling factors when using the KGE.We suggest to carefully consider such scaling factors for the calibration and the evaluation of hydrological models using the KGE and its variants.Depending on the purpose of the model, they can help to emphasise particular aspects of a model or reduce the influence of relative parameters and counterbalancing errors.
Figure 8 shows how emphasising absolute parameters with scaling factors helps to reduce the influence of counterbalancing errors for the KGE (Fig. 8a) and its most used variant KGE' (Fig. 8b).The default value (1-1-1) -corresponding to scaling factors of 1 for , 1 for  (KGE) or  (KGE') and 1 for , respectively -is compared to other factor combinations with different ratios between absolute and relative parameters.The 1:2 ratio (1-2-2) increases counterbalancing errors as the emphasis is on the relative parameters, while the 2:1, 3:1, 4:1, and 5:1 ratios decrease counterbalancing errors.The ANN model is evaluated as better with the 4:1 ratio for the KGE and the 3:1 ratio for the KGE', highlighting that the KGE' is less sensitive to counterbalancing errors.This also shows how the score of a performance criterion and by extension its interpretation can be radically different depending on the parameters used in the equation.This is why a multi-criteria framework can strengthen the evaluation of models and reduce the uncertainties of performance criteria scores.

Figure 1 :
Figure 1: Synthetic hydrograph corresponding to two flood events.

Figure 2
Figure 2 presents two hydrographs extracted from the set of transformations: (i) a BB model with the combination [ 1 = 0.75;  2 = 1.2], and (ii) a BG model with the combination [ 1 = 0.75;  2 = 1].The BG model stands as a better model because it perfectly reproduces the second flood event and is identical to the BB model on the first flood ( 1 = 0.75).Nevertheless, the KGE and its variants -KGE', KGE'', KGENP, DE', LME and LCE -all favour the BB model, whereas only the NSE and the dr evaluate the BG model as better (Fig.3a).

Figure 2 :
Figure 2: Synthetic examples extracted from the set of transformations.The first and second flood events of the "Bad-Bad" and "Bad-Good" transformations were shifted with [  = .;   = .] and [  = .;   = ] combinations, respectively.

Figure 4 :
Figure 4: Score of each transformation for all [  ;   ] combinations by performance criteria.

Figure 5 :
Figure 5: Graph of each [  ;   ] combination identified as the best transformation by each performance criteria.The NSE and 215 . The model consists of a single 1D Convolutional layer with a fixed kernel size of three and an optimised number of filters.This layer https://doi.org/10.5194/hess-2022-380Preprint.Discussion started: 15 November 2022 c Author(s) 2022.CC BY 4.0 License.

Figure
Figure 6a shows the results of the two hydrological models on Unica springs.The models have overall good dynamics and succeed to reproduce the observed discharges.Regarding high flow periods, both models show a small timing error, inducing a delay in the simulated peak flood.The two first flood events (February-March 2017) are slightly underestimated by the ANN model while the first peak is overestimated by the reservoir model.Although having a similar volume estimate, the third flood event (May 2017) is better simulated by the ANN model because (i) the timing error is less important and (ii) the https://doi.org/10.5194/hess-2022-380Preprint.Discussion started: 15 November 2022 c Author(s) 2022.CC BY 4.0 License.

Figure 6 :
Figure 6: (a) Observed and simulated spring discharge time series on the validation period.(b) Relative difference between simulated and observed discharge on the validation period.

Figure 7 :
Figure 7: (a) Score of the ANN and reservoir models according to the different performance criteria.(b) Values of the parameters used in the calculation of the performance criteria.

Figure 8 :
Figure 8: (a) KGE and (b) KGE' scores of the ANN and reservoir models (Fig. 6a) according to different scaling factors.The yaxis numbers correspond to the scaling factors of the timing, variability and bias parameters, with the default being 1-1-1.