GIS-based Groundwater Spring Potential Mapping Using Data Mining Boosted Regression Tree and Probabilistic Frequency Ratio Models in Iran

This study intends to investigate the performance of boosted regression tree (BRT) and frequency ratio (FR) models in groundwater potential mapping. For this purpose, location of the springs was determined in the western parts of the Mashhad Plain using national reports and field surveys. In addition, thirteen groundwater conditioning factors were prepared and mapped for the modelling process. Those factor maps are: slope degree, slope aspect, altitude, plan curvature, profile curvature, slope length, topographic wetness index, distance from faults, distance from rivers, river density, fault density, land use, and lithology. Then, frequency ratio and boosted regression tree models were applied and groundwater potential maps (GPMs) were produced. In the last step, validation of the models was carried out implementing receiver operating characteristics (ROC)


Introduction
Groundwater problems have been identified as the most important challenges of the 21st century in the world [1,2]. Considering the increase of demand for fresh water resources, two appropriate tools are being used by the engineers and planners for an efficient management and production of groundwater resources [3]. Geographic information system (GIS) can be used in the decision making process in water resource management as it is a powerful tool [4]. Several researchers have applied a combination of GIS and remote sensing (RS) tools for evaluation of groundwater potential mapping in medium to regional scale [3][4][5]. With the progresses of the RS and GIS techniques and software, mapping of groundwater potential has become an easy and efficient procedure [5]. However, in developing and threshold countries like Iran, there is a severe lack of precise and complete dataset, for that reason, remote sensing data is necessary for understanding the groundwater condition and subsequently its management. In this respect, many researchers have used different models and methods, for example, Oh et al. [3], Ozdemir [6,7], Manap et al. [8], Pourtaghi and Pourghasemi [9], Naghibi et al. [10] used FR model. Other statistical models such as, weights-of-evidence [7,9], logistic regression [6], evidential belief function [11][12][13] models have been implemented for groundwater assessment. More recently, some researchers used data mining algorithms such as boosted regression trees (BRT), classification and regression trees, random forests, k nearest neighbor (KNN), linear discriminant analysis (LDA), quadric discriminant analysis (QDA), and multivariate adaptive regression splines (MARS), support vector machine (SVM), artificial neural network (ANN), flexible discriminant analysis, penalized discriminant analysis models in ground water potential mapping [14][15][16][17]. Data mining as a group of applied artificial intelligence can be clarified as the extraction of information from a dataset and relate the input and output variables [18]. Data mining models are able to deal with non-linear issues such as landslide and groundwater studies [14,19]. These models can be used for classification, regression and in some cases for survival studies. In this study BRT model was used for a two-condition classification problem which was existence or non-existence of the springs in the study area. In addition, it is worth mentioning that in the literature, some physically-based models are used to address the incidence of springs in massifs of crystalline rocks and at catchment scale, and have contributed to clarify the role of rock structure and flow circuits in the incidence of springs [20,21].
Recently, data mining approaches are getting extremely popular because of their high performance and strong features. Investigation of the literature shows that FR as a probabilistic model has been used in different studies for groundwater potential mapping. These studies have reported the results of this model acceptable. BRT, on the other hand, has been used in lower number of studies. This research aims to compare the performance of FR and BRT as probability based and data mining based approaches to determine which one of these approaches can provide better results. A thorough literature review reveals that several researchers have reported different importance and contribution of the GCFs in groundwater modelling. For example, some studies have reported high importance of altitude, plan curvature, and profile curvature [17], while other studies have depicted high contribution of altitude, TWI, slope angle, and fault density [15]. Thus, it can be seen that there are some differences between these two studies which could be related to different characteristics of the watersheds, and the modelling procedure implemented in the mentioned studies. Therefore, this research intends to (i) investigate and compare the performance of data mining based BRT and probabilistic based FR models, (ii) investigate the relationship among the thirteen conditioning factors, and (iii) determine the contribution of the conditioning factors.

Study Area
The study area is situated in the western parts of the Mashhad Plain which consists of Torghabeh and Shandiz residential areas located between 36° 04′ 17″ and 36° 30′ 36″ latitudes, and 59° 02′ 32″ and 59° 33′ 48″ longitudes. The study area covers approximately 1,268 km 2 ( Figure 1). The altitude of the study area varies from 1,045 to 2,944 m (average = 1,615 m). Four land use classes can be seen in the study area such as agriculture, orchard, rangeland, and residential area. Among these land use classes, rangeland covers the most part of the area. In this region, people exploit groundwater resources by well, spring and qanat, and use water resources in different sections such as farming, drinking water, and livestock.

Spring Characteristics
In the study area, 155 springs were detected and mapped at 1: 50,000-scale by field surveys and national reports by Iranian Department of Water Resources Management [22] (Figure 1). Then, the reported spring data were randomly grouped [3,[6][7] into two classes of 109 (70%), and 46 (30%). These two groups were employed in modelling and evaluating of the groundwater potential maps (GPM). Discharge of the springs ranges between 0.1 and 60 lit/s with an average of 3.6 lit/s. The pH of the water in springs changes from 5.2 to 8.4 with an average of 7.3. From 155 springs in the study area, 12 springs are seasonal and other springs are permanent. Water of these springs is used in different sections such as agriculture, drinking water, and livestock.

Data Preparation
The main factors affecting groundwater potential were selected based on literature review [3,6,7]. However, other researchers [20,21] have mentioned that a few factors influence groundwater potential such as geological parameters (i.e. length of fractures, the stress field orientation and the deformational regimes) and recharge system. These factors are slope degree, slope aspect, altitude, plan curvature, profile curvature, slope length, topographic wetness index, distance from rivers and faults, river density, fault density, land use, and lithology. A DEM having 30×30 m grid size was derived from the 1: 50,000-scale topographic maps of the studied area. This layer was employed to produce slope degree, slope aspect, altitude, slope length (LS), topographic wetness index (TWI), profile and plan curvature factors. Classification of the conditioning factors was arranged based on the methods encountered in the literature review [10,15,17].
In addition, two curvature maps including plan and profile curvature maps which has been reported as important conditioning factors on groundwater potential were used. A contour line which is created at the crossing between the horizontal plane and terrain surface is called plan c urvature ( Figure 2d) [23][24][25]. Profile curvature is defined as curvature of the flow (Figure 2e) [23,25,26]. These layers were calculated in the system for automated geoscientific analyses (SAGA) software and were used in this study as conditioning factors.
In addition, two other DEM-derived factors of LS and TWI were produced using the SAGA software and used in the modelling process. The equation for calculating LS factor was defined by Moore and Burch [27] and used in this study (Figure 2f). Pourghasemi et al. [28] mentioned that there is a direct relationship between LS and the water that gathers at the bottom of the region. TWI was calculated using the equation defined by Moore et al. [29] and was classified and used in the current study (Figure 2g). River and fault related factors were reported to have influence on the spring occurrence as well as the groundwater potentiality [10]. In this case, four factors were prepared and used such as distance layers of rivers and faults and density layers of the rivers and fault. One hundred, and two hundred and fifty meter intervals were considered in classifying two distance layers of river and fault (Figure 2h-i). In the case of river density and fault density, since there is jump in density layers, natural break classification scheme was implemented to classify them [30,31] (Figure 2j-k).
In order to create the land use map of the western parts of the Mashhad Plain, maximum likelihood, a supervised classification algorithm and Landsat satellite images of the year 2014 were implemented by Iranian forest, rangeland and watershed management organization [32]. Land use layer comprises four different classes of agriculture, orchard, rangeland and residential areas (Figure 2l).
A 1:100,000-scale geological map was employed for preparing the lithology layer of the western parts of the Mashhad Plain [33]. The lithology layer includes thirteen groups that are represented in Figure. 2m, and Table 1.

Multicollinearity Analysis by Variance Inflation Factor and Tolerance
Multicollinearity denotes the non-independence of GCFs which can be seen in datasets due to their high correlation [34]. For analyzing the multicollinearity issue, two indices such as variance inflation factor (VIF) and tolerance were computed for GCFs. VIFs greater than 10 or tolerance less than 0.1 show that there is multicollinearity and the factor should be removed from next steps of the modelling process [35,36].

Frequency Ratio (FR)
FR model can be used to compute the relationship among the input and output factors of the model which in this study are conditioning factors and spring location [3]. The results of this model can be understood easily and application of this model is simple [37,38]. To compute the FR for classes of each input factor, number of springs in each class and area related to each class was determined by ArcGIS 9.3. In the next stage, the FR equation can be used as below [39]: (1) Where, S depicts the number of springs in each class, TS shows the total number of springs, P represents the number of pixels in each class, and TP depicts the total number of pixels. At last, FR values for all of the factors will be summed and a final GPM will be produced by Equation 2: (2) A FR value of 1 depicts an average value, while a greater value than 1 shows higher correlation and potential of each class [40].

Boosted Regression Tree (BRT)
Another relatively new data mining model which was used for groundwater potentiality modelling in this study is boosted regression trees. BRT uses two powerful techniques such as gradient boosting and classification and regression tree (CART) [41,42], and was introduced by Friedman [43]. Boosting is used to improve model accuracy by averaging many rules as an alternative of obtaining a single one with higher accuracy [44]. Decision trees include two types of classification and regression. In regression trees, the target or output variable can have continuous values. The rules are decision rules which could be different decision rules in tree-based models based on the relationship between the output and input factors and the modelling procedure as well. Boosting gradient technique employs a weighted average of results gained from prediction of various samples [45]. In BRT, three parameters require to be optimized including number of trees or number of boosting iterations, interaction depth or max tree depth, and shrinkage [46]. The shrinkage shows the contribution of the trees in the grown model [47]. The size of the individual trees is determined by a parameter called the interaction depth [48].
For tuning BRT, a grid of these parameters was employed and a 10-fold cross validation method was implemented. This grid contains interaction depths of 1, 5, and 9, shrinkages of 0.001, 0.005, 0.01, 0.05, and 0.1, and number of trees of 100 to 1,000 with 100 intervals. BRT also determines the importance of the conditioning factors in the modelling process. This is done by measuring the final (tree) performances associated to the use of the given covariate and then normalizing with respect to the highest contributor [49]. In this study, BRT was optimized employing caret and generalized boosted regression models scripts in the R statistical software, and caret and gbm packages [46,50,51]. Considering the above mentioned grid, the model runs on data and the accuracy index will be calculated as shown in Figure 6. Finally, the CARET (Classification and Regression Training) package selects the best combination of the three mentioned parameters in BRT, and this final model will be employed in the prediction step. The CARET is a package which provides different models and algorithms for classification and regression issues [46]. In this package, two indices of accuracy and Kappa are calculated for the training procedure and selection of the best parameters for the final model. Accuracy, and Cohen's Kappa index, can be computed as below [52]: where, n is a proportion of pixels from the total number of training pixels that is accurately categorized. FP denotes false positive, TP shows true positive, TN represents true negative, and FN denotes false negative.

Validation of the Models Performance Using Receiver Operating Characteristics (ROC) Curve
In the modeling procedure, one of the most important steps to achieve scientific significance is validation process. In many studies, ROC curve has been used to assess the efficiency of the groundwater potential mapping [4,9,53]. This curve is known as a common index for predication of accuracy of the models in classification problems [54]. Area under the mentioned curve illustrates the capability of the method to foresee whether a phenomenon happens or not [55]. In this study, ROC and AUC values were plotted and calculated using SPSS 20. For calculating ROC, a dataset including validations spring locations and non-spring locations were used. For ROC analysis, proportion of the springs and non-springs need to be determined. The researchers are arguing about the best proportion of presence and absence to use for validation process which is called cutoff value [56]. A cutoff of 0.5 is used when springs and non-springs have the same number in the validation step [56]. Lombardo  In BRT which determines the probability of spring occurrence, a value greater than 0.5 shows high potential of the groundwater, while in frequency ratio, a high value of FR shows higher potential of the groundwater. This needs to be mentioned that non spring locations were determined by Hawth's tools extension in ArcGIS 9.3 based on a random algorithm. The results of this curve are represented in Figure 4. Based on the results, BRT had AUC of ROC value of 87.2%, while it was seen that FR had AUC of ROC value of 83.2% which shows higher capability of the BRT model than FR in this research.

Results and discussion
The results of multicollinearity are presented in Table 2. According to the results, it can be seen that VIF values are less than 10 and tolerance values are greater than 0.1; thus, it can be evident that there is no multicollinearity problem in this study.  Table 3 displays the result of the frequency ratio for different factors employed as input. A higher FR, in this case, shows higher probability of spring occurrence and groundwater potentiality [4,47]. Results implied that slope degree class (5-15) has a high FR value, slope aspect classes (east, and south) have the highest values, altitude class (1,800-2,200) has the highest values of 1.25, plan curvature class (concave) had the highest value of 1.34, profile curvature class (<-0.001) showed the highest value of 1.45, slope length class (>60) have the highest value of 1.49, TWI class (>12) has the highest value of 2.64, distance from rivers class (100-200) has the highest value of 2.56, distance from faults class (250-500, and <250) have the highest values of 1.68, and 1.67, respectively, river density class (<0.31) has the highest value of 1.59, fault density class (8.37-15.70) has the highest value of 1.31, land use class (rangeland) has the highest value 1.17, and lithology class (6 and 8) have the highest values of 11.45, and 6.11, respectively. Overall, the results of the FR model exhibit a relatively reverse relationship between spring occurrence and groundwater conditioning factors such as altitude and river density in the study area. Lower altitudes include lower slope areas with more developed drainage system which could be a reason for the observed reverse relationship between altitude and spring occurrence. In the respect of river density, higher river densities show higher concentration of the water flow and subsequently higher amount of infiltration. On the other hand, a direct relationship was seen between TWI and groundwater potentiality. TWI shows the water intent to gather at any point of the watershed, therefore a higher TWI could be referred to a higher amount of infiltration. The findings of this study also showed a reverse relationship between distance from faults and spring occurrence which could be the result of the influence of the faults on the fracture springs. In the GPM produced by FR, classes range and area percentage of each class are shown in Figure 5a and Table 5. According to Table 5, low, moderate, high, and very high classes cover 30.40, 19.44, 19.17, and 0.98 % of the study region, respectively. Table 3. Spatial relationship between each conditioning factor and spring locations using frequency ratio (FR) model considering number of pixels, percentage of pixels, number of springs, and percent of the springs.

Figure 5. GPMs produced by (a) FR and (b) BRT models classified into four classes of low, moderate, high, and very high, and distribution of the validation springs in different classes.
The results of the BRT model showed that the final BRT model included number of trees (Boosting iterations) of 500, interaction depth of 5, and shrinkage of 0.005 ( Figure 6). The final BRT model had accuracy and kappa values of 0.73, and 0.47, respectively. In addition, the importance of the conditioning factors was determined using BRT model ( Table 4). The results of BRT model showed that TWI, and altitude were the most important factors, while aspect and plan curvature had the lowest importance in groundwater potential modelling. Also, it was seen that lithology had not any effect on the modelling process in the current study. Figure 5b shows the final GPM produced by BRT model indicating that low, moderate, high, and very high classes cover 36.83, 29.57, 19.79, and 13.81 % of the study region, respectively ( Table 5).
The findings of this study suggest that BRT and FR methods are appropriate tools for groundwater assessment. BRT had better performance than FR model, but their AUC of ROC values were both more than 80% which shows their acceptable performance in groundwater potential mapping. The BRT as a model that uses regression and gradient boosting algorithms [59] have some advantages and disadvantages. BRT is capable to select the conditioning factors, i t can identify the interactions, and it is capable to fit an accurate function for classification [47]. In addition, one point which makes the BRT models robust is that it can remove predictor variable which have large number of missing values [60]. However, there are some disadvantage in the BRT model. For example, interpretation of the BRT model is difficult, and computation of the model can take a long time [60]. The results of the FR model showed that there is a relatively reverse relationship between spring occurrence and groundwater conditioning factors such as altitude and river density in the study area. On the other hand, a direct relationship was seen between TWI and groundwater potentiality. Moreover, this research attempted to define the importance of different groundwater conditioning factors in groundwater modelling. According to the results of this study, TWI was the most important factor, followed by altitude, and distance from rivers. On the other hand, aspect, and plan curvature were seen to be the least important factors. In addition, the results showed that lithology had not any effect on groundwater modelling using the BRT model. Naghibi and Pourghasemi [12] reported that altitude, distance from faults, SPI, and fault density represented the highest contribution to groundwater potential assessment. In another research, Naghibi et al [17] stated that altitude, slope, plan curvature, and profile curvature were the most important influencing factors. Comparing the results of the current research and the two above mentioned studies, it can be seen that only altitude was seen as an important factor, while other factors have different ranks and importance in different studies. This fact can be related to different hydro-geological, climatic, and topographical features of the watersheds. Furthermore, different algorithms and techniques employed in the implemented models could lead to these variations.

Conclusion
In order to obtain the objectives of this study, at first, a dataset of spring locations and groundwater conditioning factors was provided. Springs were classified into two groups of training and validation and groundwater conditioning factors are slope degree, slope aspect, altitude, plan curvature, profile curvature, slope length, topographic wetness index, distance from river, distance from fault, river density, fault density, land use, and lithology. Utilizing the training springs and groundwater conditioning factors, BRT and FR models were applied and GPMs were produced. At last, ROC and AUC were employed to appraise the capability of the methods in groundwater modelling. The results of this research showed that BRT and FR models are good tools for groundwater assessment. BRT had better performance than FR model, but their AUC of ROC values were both more than 80% which shows their acceptable performance. The results also showed that TWI was the most important factor, followed by altitude, and distance from rivers, while aspect, and plan curvature factors were seen to be the least important factors. In addition, the results showed that lithology had not any effect on groundwater modelling using the BRT model. The methodology produced in this study could be used for groundwater exploitation plans to decrease the costs and time needed for these projects in watersheds having similar conditions.