A comparison of regionalization methods in monsoon dominated tropical river basins

The present study evaluated five regionalization methods: global averaging; regression; spatial proximity; behavioral similarity and artificial neural network (ANN) for Soil and Water Assessment Tool (SWAT), using data from 24 river basins in monsoon dominated tropical river basins of peninsular India. Regionalization was performed for each basin using the remaining 23 basins. The performance of the calibration and thus the regionalization method is limited by the unreliable or erroneous data at the basins. Overall, we found that the regression method outperforms other regionalization methods in terms of predicting the daily as well as peak discharges. It was found that despite showing a better R in training, testing and validation, the ANN method performed poorly probably due to a lower number of training data. Therefore, it is suggested that the ANN should be avoided for regionalization in the absence of sufficient training data. Moreover, the regression equations developed in the present study can be utilized to predict SWAT parameters of basins located in the vicinity of the study area. However, the basins located far away from the group of catchments or having diverse characteristics should be avoided for regionalization.


INTRODUCTION
Hydrological processes are complex and dynamic, making their accurate predictions difficult. Since simple empirical equations are not proficient enough for prediction, hydrological modeling becomes necessary to simulate these complex hydrological processes. Numerous models have been developed for understanding the hydrological systems, especially over the past few decades (Devia et al. ). Some All the models require some degree of calibration and validation to achieve adequate basin representation, which is possible only for gauged basins. However, in many basins, observed streamflow data are not available or are insufficient for developing models. Such basins are considered as ungauged (Sivapalan et al. ). To overcome the problem of model calibration in ungauged basins, various regionalization techniques have been developed. The process of transferring parameters from hydrologically similar basins (donors) to a basin of interest (target) is generally referred to as regionalization (Blöschl & Sivapalan ).
The simplest method for regionalization is to identify similar or proxy basins, based on location or behavior, known as the distance-based method. When the geographical distance is used as the basis to determine similar basins, the entire set Basin to estimate maximum, minimum and long-term mean streamflows. They found drainage area, total watercourse length, sub-basin mean altitude and perimeter to be the explanatory variables for the regionalization. According to Merz et al. (), compared with the method of regionalization, quantity and quality of expert judgment (meaningful catchment attributes (Skøien et al. ) and hydrological distance measures (Merz & Blöschl )) play a very significant role in maximizing regionalization performance.
Regression, especially two-step regression, is the most popular and widely used regionalization technique. In the two-step regression, regionalization is performed in two steps: (1)  Recently, by comparing the results of 33 typical studies involving similarity-and regression-based regionalization, Guo et al. () found that the accuracy of a regionalization method increases with an increasing number of gauged catchments or the runoff site density. There is still plenty of room to improve the prediction capability in data-sparse regions.
The main objective of the present study is to compare different regionalization methods and it suggests a suitable method for regionalizing SWAT parameters so that the model can be applied to ungauged basins in peninsular India. Performance of the regression-based regionalization was compared with default, calibrated and other methods of regionalization (global averaging, spatial proximity, behavioral similarity and ANN). The description of the study area and datasets used is provided in the section below. This is followed by the 'Methodology' section which includes SWAT model description and various regression methods used. The results of the estimation of SWAT parameters by different methods and their comparison is provided in the 'Results and discussion' section. The summary and conclusion of the study are given in the final section.

STUDY AREA AND DATA USED
Twenty-four basins that satisfy the following four conditions were selected for impact assessment: 1. Basin Area: Due to the coarse resolution of input data, out of the available basins, we discarded the basins having an area less than 1,000 km 2 .
3. Human Intervention: All those basins having any dam or reservoir were discarded to avoid consideration of regulated flows.
Modified Nash À Sutcliffe efficiency (MNS) Here, Q m and Q o are modeled and observed discharge,  Significantly sensitive parameters are shown in bold. The symbols 'r' and 'v' with each parameter denote the relative and replace method of parameter calibration.
balance equation used in SWAT is given by: where SWt is the final soil water content (mm); SW 0 is the initial soil water content (mm); t is time (days); R i is the amount of precipitation on day i (mm); Q i surf is the amount of surface runoff on day i (mm); E i a is the amount of evapotranspiration on day i (mm); W i seep is water entering the vadose zone from soil profile on day i (mm); Q i gw is the amount of return flow on day i (mm). For computing surface runoff, SWAT uses Soil Conservation Service Curve Number (CN) method as shown below: where S is the retention parameter.

Global averaging
It involved computing the mean of each parameter listed in

Regression
The backward elimination method of stepwise regression was employed to determine functional relationships between the basin attributes (Table 3) and model parameters (Table 4). The procedure starts by including all the basin attributes in the regression model. In each step, a basin attribute is considered for subtraction from the set of variables based on the significance of its coefficient. If the coefficient of the attribute was found to be insignificant (t-stat < 1), the variable was removed from the regression. If the procedure results in the removal of all the variables, the corresponding globally averaged value obtained in the 'Global averaging' section was used (He et al. ).

Spatial proximity
Geographical distance (D t,d ) between the centroids of the basins is used as a similarity measure (Equation (4) where LAT is the latitude of the centroid of the basin; LON is the longitude of the centroid of test basin; t is the index for text basin and d is the index for donor basin.

Behavioral similarity
This method is similar to spatial proximity (Section 'Spatial proximity'), except that the distance between basins was calculated using basin attributes (see Table 3) instead of geographical coordinates (He et al. ).
Artificial neural network The SWAT model parameters for each test basin were predicted using an ANN model trained using calibrated model parameters of the remaining 23 basins as outputs (Table 4) and their corresponding basin attributes as inputs (Table 3).
ANN was employed to predict each SWAT model parameter for each test basin using the model parameters of the remaining 23 basins as outputs (Table 4) and basin attributes (of corresponding basins;  Table 5.

RESULTS AND DISCUSSION
For each of the five regionalization methods, first the SWAT model parameters were estimated for each basin, and then their performance was compared with the corresponding calibrated model for simulating daily streamflow during the study period.

Estimation of SWAT parameters
For the global averaging, the mean of each calibrated parameter was computed over all donor basins after removing the test basin. For spatial proximity, the closest basins were identified based on location and for behavioral similarity, the closest basins were identified based on basin attributes (see Table 6). It can be seen that out of the 24 basins, nine were having common donor basins for both spatial proximity and behavioral similarity methods.     Table 3).
The soil category, Loam (L) is an important explanatory variable that is present in all the equations, except r_CN2. The  Figure 4 shows the scatter plots of calibrated and regression-based parameters obtained from the regional regression equations. For all the parameters, regression equations were obtained with R 2 more than 0.7, except r_CN2 (R 2 ¼ 0.497) and v_GWQMN (R 2 ¼ 0.332).
Equations have the best performance for r_SOL_AWC, v_GW_DELAY and r_OV_N with R 2 more than 0.9. The performance of regression is better than a previous study   The '-' represents the case when no satisfactory ANN model was found.
by these regionalization methods is shown in Figure 8(c).
Furthermore, to compare the performance of these methods in simulating peak flows, the analyses were done for only those days for which streamflows have 5% exceedance probability. The % improvement (in terms of RMSE) for predicting such flows with respect to default set is shown in Figure 8( On comparing the performance based on peak flow prediction, it can seen that both the ANN, and the regression-based regionalization are comparatively better in predicting the peak streamflow (regression median ¼ 7.99% and ANN median ¼ 10.03%). Moreover, behavioral similarity shows marginally better performance (median ¼ 4.82%) than spatial proximity (2.16%). Thus, these methods (which involve physical parameters) are able to capture peak flows more accurately.    There are some exceptions such as for B-12, for which all the methods show significant improvement except for the spatial proximity method. Similarly for B-22, all the methods show improvement except the similarity methods (spatial proximity and behavioral similarity). These (B-12 and B-22) are the basins having diverse characteristics of LULC and soil categories (see Table 3 and Figure 3). Thus, we must be careful while selecting the range of basin attributes for regionalization. For the basins B-7 and B-23, none of the regionalization methods are able to show any improvement probably due to dissimilar climatic conditions (lower maximum temperature and higher relative humidity) and soil categories (Largest % of Loam for B-23), respectively.

SUMMARY AND CONCLUSIONS
In this study, parameters of the SWAT model for 24 river basins in peninsular India obtained from five different methods of regionalization (global averaging, regression, spatial proximity, behavioral similarity and ANN) are compared with the default and calibrated parameter sets.
Out of 24 basins, 9 basins were such that the closest basins are the same for both spatial proximity and behavioral similarity methods. All the regression equations developed during regionalization showed an R 2 value greater than 0.7 except for r_CN2 (R 2 ¼ 0.497) and v_GWQMN (R 2 ¼ 0.332), with equations of r_SOL_AWC, v_GW_DELAY and r_OV_N having R 2 greater than 0.9. For the best ANN architectures, nearly 30 neurons were needed in hidden layers for each variable. For the variables corresponding to groundwater processes (GWQMN, GW_DELAY, GW_REVAP), comparatively fewer neurons gave better results.
The median MNS value for all the calibrated models (about 0.55) is relatively smaller than the globally averaged median value of 0.75, found from 33 studies reviewed by Guo et al. (). This can be attributed to unreliable or erroneous data for Indian basins.
Compared with other methods, the spatial proximity has a larger box width and for the calibration period and for the ANN, the whiskers go below the zero MNS mark. In terms of % improvement, the regression and behavioral similarity show slightly better performance, with regression having a leading edge with a higher median value. On comparing the performance based on peak flow prediction, we found that methods (regression, behavioral similarity and ANN) which involve physical parameters are able to capture peak flows more accurately compared with other methods of regionalization.
On analyzing the regional patterns, we found that the basins for which calibration shows higher improvement, the regionalization methods also exhibit improved performance. For the basin located at the most extreme distance, none of the methods (including calibration) showed any significant improvement. For some of the basins (e.g. basins B-7 and B-23) having dissimilar climatic or soil characteristics, none of the regionalization methods are able to show any improvement. Based on the above analysis, the conclusions of the present study are: 1. The performance of the regionalization method (and calibration) is limited by the unreliable or erroneous data at the basin.
2. The regression method outperforms other regionalization methods in terms of predicting the daily as well as peak discharges. Thus, the equations developed in the present study can be utilized to predict SWAT parameters of basins located in the vicinity of the study area.
3. The basins located at far away distance or having diverse characteristics should be avoided.
4. The ANN should also be avoided for regionalization, as in the absence of sufficient training data, the performance is not adequate.

Future scope
Based on the results obtained from our study, we found that there is still a lot of future scope that must be explored in the field of regionalization, specially for Indian basins. The most important point to consider in future studies should be to include a sufficiently larger number of basins. With a larger number of basins, homogeneous groups should be formed and regionalization should be done according to that. Moreover, future studies can also explore nonlinear and nonparametric regression methods.