Simultaneous Calibration of Hydrological Models in Geographical Space

Hydrological models are usually calibrated for selected catchments individually using specific performance criteria. This procedure assumes that the catchments show individual behavior. As a consequence, the transfer of model parameters to other ungauged catchments is problematic. In this paper, the possibility of transferring part of the model parameters was investigated. Three different conceptual hydrological models were considered. The models were restructured by introducing a new parameter η which exclusively controls water balances. This parameter was considered as individual to each catchment. All other parameters, which mainly control the dynamics of the discharge (dynam-ical parameters), were considered for spatial transfer. Three hydrological models combined with three different performance measures were used in three different numerical experiments to investigate this transferability. The first numerical experiment, involving individual calibration of the models for 15 selected MOPEX catchments, showed that it is difficult to identify which catchments share common dynami-cal parameters. Parameters of one catchment might be good for another catchment but not the opposite. In the second numerical experiment, a common spatial calibration strategy was used. It was explicitly assumed that the catchments share common dynamical parameters. This strategy leads to parameters which perform well on all catchments. A leave-one-out common calibration showed that in this case a good parameter transfer to ungauged catchments can be achieved. In the third numerical experiment, the common calibration methodology was applied for 96 catchments. Another set of 96 catchments was used to test the transfer of common dy-namical parameters. The results show that even a large number of catchments share similar dynamical parameters. The performance is worse than those obtained by individual calibration , but the transfer to ungauged catchments remains possible. The performance of the common parameters in the second experiment was better than in the third, indicating that the selection of the catchments for common calibration is important .


Introduction
Hydrological models are widely used to describe catchment behavior, and for subsequent use 25 for water management, flood forecasting and other purposes. Hydrological modeling is usually done for catchments with observed precipitation and discharge data. The unknown (and partly not measurable) parameters of a conceptual or to some extent physics-based model are adjusted in a calibration procedure to reproduce the measured discharge from the observed weather and catchment properties. Due to the high variability of catchment properties and hydrological behavior 30 (Beven, 2000), this modeling procedure is usually performed individually for each catchment.
Different catchments are often modeled using different models. This great variety of models and catchments makes a generalization of the description of the hydrological processes very challenging (Sivapalan, 2003). Additionally, even for a selected model applied for a specific catchment, the parameter identification is not unique. A great number of parameter vectors might lead to a very 35 similar performance (Beven and Freer, 2001).
Moreover, due to over-reliance on measured discharge for model calibration, estimation of model parameters for ungauged basins is a big challenge. Instead of model calibration, parameters have to be estimated on the basis of other information (Sivapalan, 2003). A decade of world-wide research efforts have been carried out for the runoff prediction in ungauged basins(PUB) (Hrachowitz et al.,40 2013). The PUB synthesis book  takes a comparative approach to learning from similarities between catchments and summarizes a great number of interesting methods that are being used for predicting runoff regimes in ungauged basins. Many attempts have been made to develop catchment classification schemes to identify groups of catchments which behave similarly (Grigg, 1965;Sawicz et al., 2011;Ali et al., 2012;Sivakumar and Singh, 2012;Toth, 45 2013). However, the task is of great importance, McDonnell and Woods (2004) discussed the need for a widely accepted classification system and Wagener et al. (2007) pointed out that a good classification would help to model the rainfall-runoff process for ungauged catchments. Razavi and Coulibaly (2012) give a comprehensive review of regionalization methods for predicting streamflow in ungauged basins. Catchment similarity can be determined by comparing 50 their corresponding discharge series using correlation (Archfield and Vogel, 2010) or copulas (Samaniego et al., 2010). Much of the variability in discharge time series is controlled by the weather patterns. Therefore, it is likely that similarity in discharge is higher for catchments with well correlated weather, which often requires geographical closeness (Archfield and Vogel, 2010). However, discharge series produced by catchments can be very different under different 55 meteorological conditions. Even the same catchment behaves differently in a dry and in a wet year.
Due to the different weather forcing, the above methods would consider the same catchment in one time period as dissimilar to itself in another time period.
One can also define catchment similarity using hydrological models (McIntyre et al., 2005;Oudin et al., 2010;Razavi and Coulibaly, 2012). Catchments are similar if they can be modeled reasonably 60 well by the same model using the same model parameters (Bárdossy, 2007). Due to observational errors and specific features in the calibration period, the adjustment of the model can be very specific to the observation period leading to an overcalibration (Andréassian et al., 2012). To overcome such limitations, regional calibration (Fernandez et al., 2000) approach is suggested to identify single parameter set that perform well for all catchments within the modeled domain. Parajka et al. (2007) 65 indicate that the iterative regional calibration indeed reduced the uncertainty of most parameters.
Regional calibration can result in a better temporal robustness than normal individual calibration (Gaborit et al., 2015) and it provides effective approach in large-scale hydrological assessments (Ricard et al., 2012).
The focus of this paper is to investigate if the transformation of precipitation to discharge is 70 possible independently of the weather. For this purpose, the hydrological model parameters are separated into two groups: -Parameters describing the water balances, which are strongly related to climate.
-Parameters describing the dynamics of the runoff triggered by weather.
The second group of parameters is supposed to be weather independent and represent the focus 75 of this paper. To simplify the problem, a single new parameter η was introduced to describe water balance. This parameter is conditional on the other model parameters and adjusts the long term water balances.
The purpose of this paper is to investigate to what extent do different catchments share a similar dynamical rainfall-runoff behavior and can be modeled using the same model parameters with 80 exception of the newly introduced individualized water balance parameter η.
Hydrological models are usually judged according to the degree of reproducing discharge dynamics and water balances. While water balances are mainly driven by weather in terms of precipitation, temperature, radiation and wind. Dynamics isare controlled by catchment properties in terms of size, terrain, slopes, soils etc. Formation of landscapes as a result of long time climate 85 is a quasi equilibrium process. The hypothesis of this paper is that this equilibrium is mirrored in a similar dynamic behavior. Thus, a large number of catchments can be modeled by using the same dynamic parameters. 3. The geographical extent of the catchments used for simultaneous calibration is expanded.

100
A great number of assumed ungauged catchments are used for testing the hypothesis.
The hypothesis is that the rainfall-runoff process can be described using the same dynamical hydrological model parameters for a number of catchments. The very different climatic conditions and water balances of the catchments are considered by the newly introduced specific parameter η controlling the long term water balance of each catchment individually. The other model parameters 105 control the discharge dynamics on both short and long time scales. These dynamical parameters are supposed to be shared despite the great heterogeneity of the catchments. This procedure simplifies the hydrological model parameter estimation for ungauged catchments, namely the procedure is reduced to the estimation of a single parameter η, which can be related to long term water balances.
The paper is structured as follows: after the introduction, the investigation area is described. This 110 is followed by a description of the three conceptual hydrological models and the three performance criteria used for calibration and validation. In section four, the new model parameter η controlling the water balance is introduced. In sections five to seven, three numerical experiments are described and the results are presented, starting with the individual calibration of the models and ending with a transfer of the model parameters to randomly selected catchments. The paper concludes with 115 a discussion of the results.

Investigation area and available data
The study area is the eastern United States. Locations of the 196 catchments used in this study are shown in Fig. 1. The catchments for a subset used for the international Model Parameter Estimation Experiment (MOPEX) project. Catchments range in size from 134 to 9889 km 2 and exhibit aridity 120 indices (long-term potential evapotranspiration to precipitation rates) between 0.41 and 3.3, hence representing a heterogeneous dataset. Time-series data of daily streamflow, precipitation, and temperature for all catchments were provided by the MOPEX project (Duan et al., 2006). Catchments within this dataset are minimally impacted by human influences. Streamflow information within this dataset was originally provided by the United States Geological Survey (USGS) gauges, while 125 precipitation and temperature was supplied by the National Climate Data Center (NCDC). The MOPEX dataset has been used widely for hydrological model comparison studies (see references in Duan et al., 2006).

Hydrological models and performance criteria
Three simple conceptual hydrological models were applied in this study. The reason for this is that 130 the great number of calibration and validation experiments could only be performed with relatively simple model structures. It is important to see if the results are similar for different models and performance measures. In a subsequent study, spatially distributed models will be considered.

HYMOD model
The HYMOD model ) is a conceptual rainfall-runoff model derived from the 135 Probability Distributed Model (Moore, 1985). The soil moisture accounting module of HYMOD utilizes a Pareto distribution function of storage elements of varying sizes. The storage elements of the catchment are distributed according to a probability density function defined by the maximum soil moisture storage CMAX and the distribution of soil moisture stores b (Wagener et al., 2001).
Evaporation from the soil moisture store occurs at the rate of the potential evaporation estimates 140 using the Hamon approach. After evaporation, the remaining rainfall and snowmelt is used to fill the soil moisture stores. A routing module divides the excess rainfall using a split parameter α which separates fluxes amongst two parallel conceptual linear reservoirs meant to simulate the quick and slow flow response of the system (defined by residence times k q and k s ).

HBV model 145
The HBV model is a conceptual model and was originally developed at the Swedish Meteorological and Hydrological Institute (SMHI) (Bergström and Forsman, 1973). Snow accumulation and melt, actual soil moisture and runoff generation are calculated using conceptual routines. The snow accumulation and melt is based on the degree-day approach. Actual soil moisture is calculated by considering precipitation and evapotranspiration. Runoff generation is estimated by a non-linear

Xinanjiang model (XAJ)
The XAJ model was established in the early 1970s in China. This conceptual rainfall-runoff model has been applied to a large number of basins in the humid and semi-humid regions in China. The lumped version of XAJ model consisted of four main components (Zhao, 1995). The 160 evapotranspiration is represented by a 3-layer soil moisture module which differentiates upper, lower and deeper soil layers. Runoff production is calculated based on rainfall and soil storage deficit, tension water capacity curve is introduced to provide for a non-uniform distribution of tension water capacity throughout the whole catchment. The runoff separation module separates the determined runoff into three parts, namely surface runoff, interflow and groundwater. The flow routing module 165 transfers the local runoff to the outlet of the basin. In order to account for the precipitation that is contributed from snowmelt, the degree-day snowmelt approach is added in this model. In this study, the model has 16 parameters which can be adjusted using calibration.

Performance criteria
Model calibration depends strongly on the performance criteria used. In order to obtain reasonably 170 general results, three different criteria were selected to evaluate model performance.
The Nash-Sutcliffe Efficiency (Nash and Sutcliffe, 1970) between the observed and modeled flow is most frequently taken as the first evaluation criterion: Here Q o (t) is the observed discharge and Q m (t) is the modeled discharge on a given day t. The 175 abbreviation NS is used subsequently for this performance measure.
The NS model performance criterion was often criticized (for example in Schaefli and Gupta, 2007), and several modifications and other criteria were suggested. One interesting suggestion was published in Gupta et al. (2009), the authors suggest using a performance measure which accounts for the water balances and the correlation of the observed and modeled time series separately. Their 180 approach was slightly modified and the following performance criterion was introduced: Here r(Q o , Q m ) is the correlation coefficient between the observed and modeled time series of discharge. β is a weight to express the importance of the water balance. In our study, β = 5 was selected. The reason for selecting this version of the coefficient is that a model should produce good 185 water balances and appropriate discharge dynamics simultaneously. The quadratic form in Eq. (2) assures that both aspects are considered, and the worse of them is dominating. The abbreviation GK is used subsequently for this performance measure.
The Nash-Sutcliffe coefficient of the logarithm of the discharges is focusing on the low flow conditions more than the traditional NS coefficient: To equally concentrate on high and low flows, a combination of the original NS and the logarithmic NS is used as a third measure: The abbreviation NS + LNS is used subsequently for this performance measure.

195
The three performance criteria were modified, hence the higher the value the better the model.
Further the best value for the criteria is 1.

Model parameter to control water balance
Climatic conditions are of central importance for water balances. The relationship of potential 200 to actual evapotranspiration can differ strongly due to water or energy limitations. This suggests that catchments might have similar dynamical behavior but with different water balances. In order to account for this, the model parameters could be separated to form two groups, one group with parameters controlling the water balances and another controlling the discharge dynamics.
This separation of existing model parameters is difficult, as they often influence simultaneously 205 both components. Instead of an artificial model specific separation, a new parameter η was introduced to all three models. This parameter controls the ratio between daily potential and actual evapotranspiration depending on the available water and depends on the long term water balance only. This parameter η gives: Here SM is the actual soil water available for evapotranspiration. CMAX is the maximum possible soil moisture. E tp stands for the potential and E ta for the actual evapotranspiration, respectively.
The parameter η regulates the water balances in accordance with the dynamical parameters. It can be calculated directly for each parameter vector θ. This is necessary as it is thought to establish correct water balances. Thus it is a catchment and parameter vector dependent parameterThus parameter η de-215 pends on the catchment and parameter vector θ. f (η) = V iM (η, θ) is a monotonically decreasing function of η. If the model can provide correct long term water balances then: As f (η) = V iM (η, θ) is continuous, there is a unique η(θ) for which: 220 If Eq. (6) is not fulfilled, then the parameter vector θ is not appropriate for the model.
The parameter η is fitted individually for each θ, in this way a correct water balance is assured for the calibration period.

Experimental design
In this study, the ROPE algorithm (Bárdossy and Singh, 2008)   The ranges of the model parameters are relatively large. As a first step, we checked if the catchments have common parameter vectors. For each pair of catchments (i, j), for the same performance measure and time period, the intersection of the convex hull of the good parameter 260 sets G i ∩ G j is empty showing that there are no common best parameters. From the result, seemingly none of the catchments are similar.
As a next step, the 10 000 generated best dynamical parameter vectors for a given time period and hydrological model obtained for catchment i were applied to model all other catchments using the same hydrological model and time period. Note that the value of η is not transferred but adjusted 265 to the true long term water balance. In the numerical experiments, we assume that the long term discharge volumes are known variables for all simulations. However, it highlights the issue of estimating the real water balance in ungauged basin, which will be addressed in the discussion part. Figure 3 shows the color coded matrices for the mean NS performance and GK performance of the three hydrological models using transferred parameters for all 15 catchments for a calibration period 270 (1971)(1972)(1973)(1974)(1975)(1976)(1977)(1978)(1979)(1980).
The performance of the transferred parameter vectors displays a strongly varying picture. While in some cases the catchments seem to share parameter vectors with reasonably good performance, in other cases the transfer lead to weak performances. A further surprising fact is that none of the matrices isare symmetrical. One can see that some catchments are good donors as their parameters 275 are good for nearly all catchments, while others have parameters which are hardly transferable.
The asymmetry of the parameter transition matrices cannot be explained by catchment properties.
Two different catchments seem to share well performing parameters if calibrated on one catchment and no common good parameters if calibrated on the other one. Take the catchments 1 and 12 with the NS performance as an example. For all three models, parameters calibrated for catchment 1 are 280 not suitable for catchment 12, but parameters of catchment 12 perform reasonably well for catchment 1. From the observation data, we found that catchment 12 is under relatively dry climate conditions during the calibration period. We also found from the simulated hydrographs that the parameter sets calibrated on catchment 1 could not welladequately capture the dynamic behavior of catchment 12 as the low flows were underestimated for most of the time and the peak flows were obviously 285 overestimated. The matrices for NS show different performances with different models. In general, HBV model performs the best. The average value of the matrix is 0.62 for HBV, 0.55 for HYMOD and 0.54 for XAJ model. Furthermore, the correlation of transferred model performance between different models are all greater than 0.7. From the viewpoint of parameter transferability, the three models perform similarly, if a parameter transfer is reasonable from catchment i to j for one model 290 then it is also reasonable for the other models. The results for the GK performance differ from those of the NS performance. Here the XAJ model seems to give the generally best transferable parameters.
Parameter vectors from other catchments generally fail to perform on catchment 15 across all three models.
The difference of the transferability for these two performance measures could be explained by 295 different focuses -while NS is mainly focusing on the squared difference between the observed and modeled discharge, GK focuses on water balances and good timing and NS+LNS is strongly influenced by low flow events. It is interesting to observe that catchment 12 is a very bad receiver for model parameters for NS, while it is an excellent receiver for GK. This means that different events have different influence on the performance. A possible explanation for the asymmetry is the fact that 300 the catchments have different weather forcing in the calibration period. It could be that runoff events which are most important for a performance measure occur in the calibration period frequently in one catchment leading to good transferability, and seldom in the other causing weak transferability of the parameters from one catchment to another.
The transferability of the model parameters was also tested for an independent validation period 305 between 1991 and 2000. Figure 4 shows the corresponding color coded results for NS as performance measure. The matrices are similar to those obtained for calibration. Catchment 12 remained a bad receiver but a good donor indicating that the bad performance is unlikely to be caused by observation errors. Further, for some columns the off diagonal elements are larger than the diagonal ones which is a sign of a possible overcalibration of models.

310
To investigate the influence of climate on calibration, the hydrological models calibrated for different time periods using the same model and performance measure were compared. As the different time periods represent different climate conditions, the calibrations lead to different parameter sets. As a comparison, the differences in calibrated model parameters using the same model and performance measure for different catchments were compared. As an example, the left similarity of the two scatterplots suggests that the difference between the different catchments is comparable to the difference between the different time periods. In hydrological modeling, it is usually assumed that model parameters are constant over time assuming no significant change in climate or other characteristics. The results however show the assumption that parameters are the same over space is not completely unrealistic. The figures even suggest that there might be parameter vectors which perform reasonably well for all 15 catchments. As a next step, an experiment to test this assumption was devised. Since for many pairs of catchments, the parameter transfer worked reasonably well. As a next step, we investigated if there are parameters which perform reasonably well for all catchments. As seen in the previous section, none of the catchments share optimal parameters. Therefore common suboptimal parameters have to be found.

330
In order to identify parameter vectors which perform simultaneously well for each catchment, the hydrological models were calibrated for all 15 catchments simultaneously. The simultaneous calibration of the model for all catchments is a multi-objective optimization problem. The goal is to find parameter vectors which are almost equally good for all catchments with no exception.
As the models perform differently for the different catchments due to data quality and catchment 335 particularities, the performance was measured through the loss in performance compared to the usual individual calibration. Thus the objective function was formulated using the formulation of the compromise programming method (Zeleny, 1981): Here index i indicates the catchment number, index j indicates the type of the individual performance 340 measure specified in Eqs. (1), (2) and (4). The goal in this objective function is to minimize R (j) .
Here p is the so called balancing factor, the larger p is the more the biggest loss in performance contributes to the common performance. In order to obtain parameters which are good for all catchments, a relatively high p = 4 was selected for all three performance measures.
As same as individual calibration, the ROPE algorithm was used for the simultaneous calibration.

345
The optimized parameter sets H (j) are simultaneously well performed for each model and time period. The left part of Fig. 2 compares the performance of the individually calibrated and the common calibration for the 15 selected catchments using NS as performance criterion. As expected, the results show that the individual calibrations lead to better performances, but the joint parameter vectors perform reasonably well for all catchments.

350
As the goal of modeling is not the reconstruction of already observed data, the performances on a different validation period (1991)(1992)(1993)(1994)(1995)(1996)(1997)(1998)(1999)(2000) were also compared. The right part of Fig. 2 shows the mean model performances for the 15 individually calibrated and the common calibrated datasets. The observation that parameter vectors obtained through common calibration may outperform individual on-site calibration may also indicate the weakness of the calibration process for an individual 355 catchment, which should ideally be able to identify the 'best' parameter set.
These results indicate that instead of transferring model parameters from a single catchment, a parameter transfer might perform better if the parameters obtained through common calibration on all other catchments are used. In order to test this kind of parameter transfer, a set of simple "leave one out" calibrations were performed. This means that for a catchment i, the hydrological models was not considered for calibration, leading to 15 simultaneous calibrations. These common model parameters were then applied for the catchment which was left out. The performance of the models on these catchments in the calibration period is reasonably good for all catchments. Figure 6 shows the result of HBV and HYMOD using the NS performance measure. It compares the performance 365 of the parameters obtained via individual calibrations (red x-mark), parameter transfers from other catchments individually (blue plus) and the transfer of the common parameters obtained by leave one out procedure (green diamond). The performance of common parameters is obviously weaker than that of the individual calibration, but better than many parameter transfer obtained using individual parameter transfer. To test the potential of the transferability of the common parameters, a validation 370 period was used. Figure 7 shows the results for the validation time period 1991-2000. In this case, the common calibration performs very well. For HYMOD, it outperforms the parameter vectors obtained by individual calibration for 6 out of the 15 catchments. For the other catchments, the loss in performance is relatively small. Note that this good performance of the common models was obtained without using any information of the target catchment. The transfer of parameters obtained 375 from individual calibrations on other catchments shows a highly inhomogeneousheterogeneous picture as described in experiment 1. The transferred common calibration is better than most of these performances. Further note that the results of experiment 1 show that there is no explanation why certain transfer work well and others do not. Thus for the transfer of model parameters to ungauged catchments, common calibration seems to be a reasonable method.

380
In order to illustrate how model parameters of the leave one out common calibration perform in validation, two hydrographs are presented. Figures 8 and 9 show a part of the observed, the modeled and the common calibration transferred hydrographs for a randomly selected parameter set obtained by individual calibration and leave one out common calibration of HBV for catchments 5 and 14. While for catchment 5, the common calibration leads to a hydrograph which is slightly better 385 than that obtained by individual calibration, in the second case for catchment 14 the performance is reversed. However, in both cases the common parameters, which were obtained without using any observations of the catchment perform surprisingly well.

Numerical experiment 3: extension to other catchments
The results of the previous experiment suggest that even more catchments might share parameters 390 which perform well on all. The 15 catchments used in experiments 1 and 2 are however to some extent similar and can thus not necessarily be considered as representative for a great number of other catchments. Thus, for the third experiment, 192 catchments of the MOPEX dataset were considered. 96 of them were randomly selected for common calibration (marked as blue circle on Fig. 1), the other 96 catchments were used as receivers to test the performance of the common parameters 395 (marked as green triangle on Fig. 1). HBV model using three selected performance measures were considered in this experiment.
For each of the 192 catchments, an individual model calibration was carried out using 1971-1980 as calibration period. Common calibration was performed for the selected 96 catchments the same way as in experiment 2, for HBV model using all performance measures.

400
As a first step, the model performances for the individual and common calibration were compared.
As expected and already seen in experiment 2, the performance for the common calibration is lower than the individual one for HBV using all performance measures. For example, the mean performance NS over all 96 catchments drops from 0.69 to 0.50. When one applies the models for the validation period 1991-2000, the individually calibrated model mean performance is 0.65, 405 while for the common calibration the mean increases to 0.51. Figure 10 shows the histograms of the performance NS for the calibration and validation periods for the individual and the common calibrations. Results indicate the robustness of the common calibration. The transfer to the 96 assumed ungauged catchments shows very similar performance for the common parameters as for the catchments selected for common calibration. Figure 11 shows the histograms of the performance 410 NS for the individual calibration and the transfer for the assumed ungauged catchments. It can be seen clearly from the histogram that there is very little difference between the performance for the gauged and the ungauged catchments. In 90 % of catchments, the common calibration works reasonably well even for the ungauged cases. The common parameters describing runoff dynamics of all 192 catchments indicate that there is a high degree of similarity of these catchments.

415
Comparing the results of the common calibration using the 96 catchments to that obtained using the 15 catchments, one can observe that the increase of catchments considered for the common calibration lead to a decrease of the performance. The common parameter sets calibrated by 15 catchments in a reasonable geographic proximity perform better than the parameter sets calibrated by 96 catchments. Thus the parameters obtained through common calibration can be regarded to describe the common 420 dynamical behavior of many very different catchments over a large geographical area. If one is interested to find model parameters for a specific ungauged catchment, the common calibration using a more careful selection of the donor set of catchments is likely to lead to good parameter transfers.
The water balances of the 192 catchments are very different leading to very different η parameters. Figure 12 shows the distribution of η values for three randomly selected common good parameter 425 sets for HBV model using NS as performance measure for the calibration time period. It can be seen clearly from the curve that for the same catchment, η is specific for different dynamical parameter sets. And due to the differences in water balance, different catchments requires very different η-s to control actual evapotranspiration. Furthermore, for all 192 catchments, parameter η present very similar tendency for different dynamical parameter sets. Figure 13 plots the mean η value against 430 the ratio of the long term actual evapotranspiration to potential evapotranspiration (E ta /E tp ) for each catchment. It shows strong negative correlation (−0.72) between η and E ta /E tp .

Robust parameter sets
The three experiments were carried out in way that a set of parameters (usually represented by 10 000 435 individual parameter sets) was used. This leads to a considerable fluctuation of the results. Modelers often prefer to use single parameter vector. If a single parameter vector is desired, then according to Bárdossy and Singh (2008), the deepest parameter set (which represents the most central point in the whole parameter vectors) is the most likely candidate to be robust. This study also indicates the deepest parameter set perform slightly better than the mean of the parameter sets considered. 440 8.2 Variability and estimation of η As defined, the water balance related parameter η is specific for each catchment and each model parameter vector. Therefore, each individual catchment has a large variation in η for the calibrated 10 000 parameter sets. And for the same set of good parameters that matchingmatch different water balances, different catchments always require very different η-sη values to control actual 445 evapotranspiration. Parameter η was not transferred, only the other parameters controlling flow dynamics and short term water balances were assumed to be shared by many catchments.Parameter η is estimated because it controls the water balance and can be estimated at other catchments. The remainder of the parameters (the dynamic ones) are regionally calibrated (all catchments are given the same parameter set). Therefore only η varies between catchments. As η is specific for each parameter vector, regionalization of η 450 directly is not feasible and η remains different for different parameter vectors after regionalization. In the numerical experiments, in order to estimate water balance parameter η, the long term discharge volumes were treated as known variables for both gauged and ungauged catchments. For application in practical system, the long term discharge volumes have to be estimated for ungauged catchments.
This problem is not explicitly treated in this paper. The estimation of parameter η is a limitation of 455 the presented simultaneous calibration approach. Regionalization of long term discharge volumes is a prerequisite for the application in ungauged basins. For the study area, the discharge coefficients which relate discharge volumes to (known) precipitation show a quite smooth spatial behavior as shown on Fig. 14. Thus the regionalization of this parameter seems to be a not extremely complicated task in this particular region. According to the previous analysis of η, for each common 460 dynamical parameter set, one can have a possible estimator of η for a certain catchment based on the regionalization of discharge coefficients. The potential application of this approach in other regions needs to be investigated in future work.

Prediction in ungauged basins
The results of this study supported the general finding of Ricard et al. (2012) and Gaborit et al. one for both calibration and validation time period. The loss of model performance in validation is smaller than that in calibration. When applied to ungauged catchments, the simultaneous calibration shows more robustness than the individual one. Simultaneous calibration of models in geographical space offers a good possibility for the runoff prediction in ungauged basins. Compared with 470 traditional regionalization method, only the water balance parameter η has to be estimated based on the regionalization of discharge coefficients.
It was examined from the hydrographs that high flows are often underestimated and low flows are probably overestimated. This kind of phenomenon has also been detected in previous regional calibration studies (Ricard et al., 2012;Gaborit et al., 2015). This behavior mainly due to the 475 uncertainty of model structure and the low spatial and temporal resolutions of both models and input variables (Gaborit et al., 2015).

Conclusions
In this paper, the transfer of the dynamical parameters of hydrological models was investigated.
A new model parameter η controlling the actual evapotranspiration was introduced to cope with the 480 clear differences in water balances due to water or energy limitations. Three hydrological models were used in combination with three different performance measures in three numerical experiments on a large number of catchments.
The individual calibration and transfer results indicate that models are often overfitted during calibration. The parameters are sometimes more specific for the calibration time period and their 485 relation to catchment properties seems to be unclear. This makes parameter transfers or parameter regionalization based on individual calibration difficult. The common spatial calibration strategy, which explicitly assumed that catchments share dynamical parameters, was tested on a number of 15 catchments and 96 catchments, respectively. The common calibration provides an effective way to identify parameter sets which work reasonably for all catchments within the modeled 490 domain. Testing the parameters on an independent time period shows that common parameters perform comparably well as those obtained using individual calibration. The transfer of the common parameters to model ungauged catchments works well. The performance of common parameters on a small number(15) of catchments was better than on a big number (96) of catchments covering a large spatial scale. It indicates that the performance of the common parameters depends strongly on 495 the selection of the catchments used to assess them and a reasonable geographic proximity of the catchments might be a good choice for common calibration. The results of the experiments were similar for all three hydrological models applied independently of the choice of the performance measures. Note however that the common parameters corresponding to the different performance measures differ considerably. Common behavior is dependent on how one evaluates the performance 500 of the models.
The fact that many catchments share common parameters which describe their dynamical behavior does not mean that they have the same dynamical behavior. The model output highly depends on the parameter η which varies from catchment to catchment and also as a function of the other model parameters describing dynamical behavior. Common parameters offer a good possibility 505 for the prediction of ungauged catchments, only the parameter η which controls the long term water balances has to be estimated individually. This however can be done using other modeling approaches including regionalization methods.
In this study, all the models were tested on the daily time scale. The results show that many catchments behave similar as the same dynamical parameter sets could perform reasonable for all of 510 them. This means that hydrological behavior on the daily scale is mainly dominated by precipitation characteristics and actual evapotranspiration and we believe that differences in catchment properties rather havehave rather significant effects on smaller temporal scales like hourly(e.g. hourly). Results also indicate that the differences in catchment properties cannot be captured well by simple lumped model parameters.      black: 1951-1960, blue: 1971-1980 and red: 1991-2000); Right: for catchments 7 (red), 8 (blue) and 13 (black) for 1951-1960     calibration period (1971-1980), right: validation period (1991-2000). Figure 11. Histograms of the NS model performance of HBV for the 96 test (ungauged) catchments. Left: calibration period (1971Left: calibration period ( -1980Left: calibration period ( ), right: validation period (1991Left: calibration period ( -2000.