Landslide susceptibility assessment and mapping using statistical and data mining models in Iran

This study applied to evaluate landslide susceptibility using four data mining models including, “Generalized Linear Model (GLM)”, “Maximum Entropy (ME)”, “Articial Neural Network (ANN)”, and “Support Vector Machine (SVM)” in Cherikabad Watershed in Urmia City, Iran. In particular, Shannon entropy was used to assess the intercomparison of factors’ classes. Eleven factors including, elevation, slope angle, slope aspect, geological formation, annual mean rainfall, land use/ land cover, distance to the village, distance to faults, distance to roads, distance to streams, and NDVI used in the current study. Landslide inventory map was identied using Google Earth imagery, extensive eld surveys, and scrutinizing archived data. The produced landslide susceptibility maps were evaluated by the AUROC index. The results of performance metrics revealed that the Shannon entropy with an AUROC of 0.879 proved highly reliable and so is the intercomparison analysis of factors’ classes derived from it. Additionally, the goodness-of-t of the GLM, ME, ANN, and SVM models were 0.763, 0.740, 0.926, and 0.924, while their predictive powers were 0.751, 0.727, 0.917, and 0.935, respectively. Hence, the results indicated that the SVM model can be introduced as the superior model for the study area based on which the most critical factors affecting landslides were found to be elevation, annual mean rainfall, and distance to the village. The results of this work are of great use for land use planning in landslide-prone areas with similar geo-topological, geomorphological, and climatic conditions.

Iran is considered a landslide-prone country due to favorable geographical conditions in which many landslides have occurred in recent years (Moayedi et al., 2020). Evaluation of landslide susceptibility, and hazard with a spatial connotation of landslide occurrence probability has been considered as a practical solution and a pivotal part of the landslide risk mitigation (Hong et al. 2019). In particular, the classi cation of susceptible areas through landslide spatial modeling (LSM) is an essential step in environmental hazard assessment in watershed management frameworks (Sarkar et al., 1995). Also, landslide sensitivity assessment is a logical approach and a new solution to reduce and minimize landslide damages (Fell et al., 2008).
The aim of generating a landslide spatial map is to understand the most susceptible areas as well as to predict future landslide-prone areas (Chen et al., 2019c). During the past three decades, the development of novel modeling techniques has been unceasingly progressing, all of which aim at producing a more precise and generalizable landslide susceptibility mapping scheme (Kornejady et al., 2017;Broeckx et al., 2018;Reichenbach et al., 2018).
Among different landslide susceptibility assessment and zoning models, data mining models use a wide range of data to discover patterns and predict the occurrence of the phenomenon of interest in such a way that enables them to provide better results compared to their statistical counterparts (Zaki et al., 2014;Zheng, 2015). In light of these premises, we set the objectives of this work as follows: four data mining models, namely GLM, ME, ANN, and SVM), as well as a bivariate statistical model known as Shannon entropy-SE, were used to map landslide susceptibility across the Cherikabad (NW of Iran) and the derived susceptibility maps per each model were analyzed using one cutoff-independent metric termed as the AUCROC index.

Study area
The Cherikabad Watershed is situated in the Northwest of Iran and extends for an area of 51.8 km 2 (Fig. 1).
Elevation of the Cherikabad Watershed vary from 1,628 to 3,473 m.a.s.l. The average slope is about 18%. The average annual rainfall of this watershed is 430 mm. The maximum and minimum of temperature, respectively, amounts to 13 °C and -7 °C (Mohammadnejad and Asghari, 2016). An arid or semi-arid climate has prevailed upon the study area. Diverse lithological settings exist in the region mainly consist of the ophiolite, metamorphic and meta-sedimentary rocks, limestone, dolomites, shale, and conglomerates, most of which belong to the Late Cretaceous from the Mesozoic, while plains mostly consist of Quaternary sediments.

Methods And Materials
The methodological owchart adopted in this study is presented in Fig. 2.

Data compilation
Landslide inventory (LI) is known as the rst and the most pivotal step for modeling susceptibility and hazard of landslides and, generally, any modeling effort with a spatial connotation (Regmi et al., 2014). In this study, the LI was carried out utilizing a handheld GPS device during extensive eld surveys, Google Earth images, and acquiring archived organizational data. In general, 95 landslide locations were recorded and mapped in ArcGIS, which were split into two sets of training (60%) and validation (40%) (Teimouri and Kornejady 2019). The second step regards the selection of the optimal combination of landslide-conditioning factors, which include predisposing and triggering factors (Pourghasemi and Rossi 2017). Drawing on the literature review and scrutinizing our observations during eld inspections, eleven factors were selected, namely elevation, annual mean rainfall, slope aspect, slope degree, distance to streams/village/ roads/ faults, lithological formation, land use/ land cover, and NDVI (Fig. 3). Table 1 show the original sources of the acquired data, their spatial resolution, and causative roles are provided in detail in Table1.

Statistical model
In recent decades, different statistical techniques texted to generate landslide susceptibility and hazard maps in divers' areas. Statistical methods can quantitatively examine the relationship between different variables and landslide locations (Guzzetti et al., 1999;Carrara et al., 1999). Statistical methods have gained due attention from researchers who are in pursuit of assessment of the reciprocal relationship between the landslide locations and effective factors (Aleotti and Chowdhury, 1999). According to the general concusses, statistical methods are divided into two categories (reference): bivariate and multivariate statistical methods. Bivariate statistical methods examine the relationships between each controlling factor and landslide occurrence in an area regardless of the effect of other factors, whereas multivariate statistical methods consider the interaction among different factors and between each factor and landslide occurrence (Wang et al., 2005). Bivariate and multivariate statistical methods have different types. In this study, a bivariate statistical method, namely, Shannon entropy, was used.

Shannon entropy model
The concept of entropy has been used in different scienti c branches with different meanings, such as disorder, instability, confusion, and uncertainty (Yufeng and Fengxiant, 2009). The theory was rst proposed by Boltzmann and expounded by Shannon (1948) later on. Entropy can indirectly determine how to select the most in uential factors from the many available factors (Lombardo et al., 2015). Therefore, this theory can have a signi cant impact on the sieving process of thematic layers contributing to landslide susceptibility modeling (Sharma et al., 2012). Shannon entropy is underpinned by a simple yet practical mathematical equation, which is known as frequency ratio. The FR is a primary statistical model that quanti es the relationship between the factor's classes within which the landslides have occurred and, accordingly, the relative rate of that particular class and the weights of the factors themselves. In general, the mathematics of Shannon entropy can be presented as follow (Tay et al., 2014;Sharma and Mahajan, 2019): "where a and b are the pixel percentages and landslides percentages, respectively, (Pij) is the probability density, H j and H jmax represent values of the entropy, I j is the information coe cient, and Wj represents the resultant weight value for the parameter as a whole (Sharma and Mahajan, 2019)".
Machine learning/ data mining models Data mining (DM) is known the extraction of information from a big database (Győrödi, 2003). The DM is a branch of computer science that started in the late 1980s by applying the concepts and methods related to arti cial intelligence, patterns identi cation, database systems, and the science of statistics (Chandrasekaran et al., 2019). In this research, some well-known data mining models with previously reported good performances were adopted, namely maximum entropy (ME), SVM, ANN, and GLM, which were all implemented in the ModEco software. The Kappa histogram was plotted and treated as the Jackknife test to determine the most imperative factors that are positively contributing to the modeling process of landslide susceptibility in the Cherikabad Watershed. All data mining models were executed in a software namely the ModEco according to Guo and Liu (2010).

Support vector machine
The SVM is a robust data mining technique that is underpinned by theory of statistical learning and was rst introduced by Boser et al. (1992). The SVM model pivots on a robust pattern recognition algorithm. In general, in linear problems, the SVM creates a separating line between diver patterns so that maximized the margin between the points (Kecman, 2005). In highly complex nonlinear subjects, the SVM transfers into an n-dimensional space using the kernel trick, where it can better distinguish the patterns and set the separating line termed as the hyperplane (Statnikov, 2011). There are different kernel functions, among which radial basis function (RBF) has been the most popular algorithm and showed outstanding performance in other literature (Aiserman et al., 1964). More details can be found in Yao et al. (2008) and Marjanović et al. (2011).

Maximum entropy
The ME is known as a presence-only generative data mining model based on the theory of information (Phillips et al., 2006). Shannon (1948) rst coined the term entropy as the information content stored in the data and the degree to which a phenomenon can be relatively unpredictable. A higher degree of relative unpredictability can reciprocally provide higher information content (Shannon, 1951). Hence, Shannon expounded that the ME can increase the content of information and enable the model to better discern the pattern of interest (Kornejady et al., 2017). Phillips et al. (2004) rst developed the MaxEnt software for spatial modeling of the present locations of the phenomena, and by walking through the parameters space with a rst guess and further modi cations, can perpetually distinguish the presences from absences (Kornejady et al., 2017). In this process, the MaxEnt randomly extracts 10,000 pseudo-absence locations that enable the model to better differentiate the pattern. Details can be observed in Phillips et al. (2004Phillips et al. ( , 2006 and Elith et al. (2011).

Arti cial neural network
The ANN imitates humans' neural systems in which information is passed through neurons from one synapse to another (Cherkassky et al., 2006). ANNs are machine learning algorithms made of three main layers, namely input, hidden, and output (Saha et al., 2002). Input layers are responsible for transferring the input data to the hidden layer where the input data are processed through different sub-layers and assigned different weights (either positive or negative) by applying sigmoid activation functions (Yesilnacar and Topal, 2005). The output layer analyzes the system's behavior in response to the gained information and the learning process (Prasad et al., 2012). A speci c type of ANNs, named back-propagation neural network (BPNN), executes the learning and error analyzing process through a forward feeding ows of information and backward error propagation cycles (Kumar and Mathur, 2001).
ANNs are useful tools once nonlinear data are involved; however, in pursuit of nding the most suitable solution with the least error (global minima), the model may get stuck in local minima, which have been mistakenly recognized by the model as the global minima. Moreover, if the model structure is formed of many layers, the learning process may become cumbersome and prolonged. More details can be found in (Kumar and Mathur, 2001;Yesilnacar and Topal, 2005).

Generalized linear model
The GLM simply seeks a linear regression between the landslide locations and effective variables via a link function (Brenning, 2005;Felicísimo et al., 2013;Conoscenti et al., 2014). The latter connects the linear predictors and the mean of an exponential probability distribution function. The GLM was rst expounded by Nelder and Wedderburn (1972). As the name of GLM implies, it is a parametric method with limited and disputable assumptions that would not be well-suited to nonlinear predictors. More details are given in Lee and Pradhan (2007).

Model's validation
After preparing the landslide susceptibility maps using adopted data mining models, the validation dataset (40% of the total landslide numbers) (Park & Kim, 2019) was used to test the validity of the results. The latter was tested using the ROC curve (i.e., AUROC value). The ROC curve plots the false positives (i.e., 1-speci city, incorrectly predicting the non-landslide locations as landslides) on the x-axis against the true positives (i.e., sensitivity, correctly predicting the landslide locations as observed in nature). The AUC is the representative of the model's performance in which the AUROC value of 1 shows a perfect model, while values equal or close to 0.5 signify a neutral model whose results are derived from pure random chance (Pontius and Schneider, 2001;Yilmaz, 2009). Once the ROC curve is plotted only by the training set, it represents the goodness-of-t and the learning skill of the model, whereas the validation set-derived ROC curve only states the prediction power and the generalization capacity of the model. For drawing the ROC curve, the Performance Measure Tool (PMT) applied in ArcGIS 10.4 (Rahmati et al., 2019).

Results And Discussion
Validation of the Shannon entropy model After identifying the factors affecting the landslide occurrence and preparing their classi ed maps using GIS, the weight of each of these factors was calculated and analyzed following the previously mentioned equations ( Table   2). The FR was applied to determine the relationship between the landslides and the most in uential variables. Landslide susceptibility map derived from Shannon entropy was categorized into ve classes (i.e., very low, low, moderate, high, and very high). The Natural Break method used for this classi cation ( ve classes). According to Table (3), portion of high and very high susceptibility zones is about 32% of the Cherikabad Watershed. .
Based on validation's results, the AUROC value of the ME model was found to be about 0.88 (Fig. 4). According to the classi cation suggested by Hosmer et al. (2013), the results stated that the ME model performs very well in identifying the highest contributing factors to landslide occurrence as well as susceptibility mapping across the basin.

Intercomparison of factors' classes using Shannon entropy model
Some of the inferences derived from factors and the classes within were expected, while some unexpected results were also discernible. The latter makes any solitary factor analysis infeasible; that is, unexpected results should be scrutinized, and their synergistic interactions with other factors need to be further assessed. By doing so, a more straightforward explanation would emerge, which is provided in detail as follows.

Expected results
It is expected that landslide susceptibility should increase as rainfall increases, as with our results and those reported by Chen et al. (2019a) and Pham et al. (2019) . However, such a constant increase is contingent on spatially intact conditions such as homogeneous lithology along the elevation gradient, which is rarely the case. For instance, slopes in higher altitudes of our study area are generally formed of rocks that are highly resistant to landsliding. Hence, the decline of landslide susceptibility in higher rainfall classes (i.e., above 441 mm) is also justi ed. The same applies to the slope gradient factor. Expectedly, steep slopes exert more weight on the underlying materials, which leads to landslide initiation; however, highly steep slopes are formed of highly weathering-resistant rocks, which manifests in the form of low pedogenic development of soil pro les at upper slope positions.
Moreover, moderate slopes in the study area (i.e., 24-34%) are covered by highly susceptible shale and silts and poorly-managed pastures with signs of overgrazing. Low susceptibility values at lower slope positions are also justi ed due mainly to the low and less-effective gravitational pull. These results are similar to Kornejady et al. (2017).
Regarding the lithological formations, the tectonic status of the study area has led to diverse and heterogeneous outcrops that add up to 15 lithological formations, among which ultra-basic rocks, pillow-shaped rock mounds, and pelagic limestone, Mesozoic in age, had the highest rate of landslide density. Mainly, unmetamorphosed pelagic limestone by featuring a chaotic assemblage of clayey-marly sequences with an increase in shale thickness towards upper strata exhibits high susceptibility to different types of landslides, which has been previously reported in different studies such as Ortner and Kilian (2016) and Carabella et al. (2019). Similar results on landslide-prone limestones are also reported by Lau (2018), , and Pham et al. (2019). Regarding the NDVI factor, areas with sparse vegetation gained the highest susceptibility values, which aligns with the role of vegetation canopy and root system in protection of slope surface and reinforcement of subsurface soil particles. The same result has been reported by Yalcin et al. (2008). North-facing slopes in the study area receive more rainfall and accordingly have higher moisture content, which explains the high concentration of landslides in these slopes. This result is in agreement with Zhang et al. (2018) and Chen et al. (2019b).
Changes of the land-use from rangeland to irrigation farming is a well-known driver of landslide occurrence, particularly when it takes place on moderate to steep slopes with susceptible lithological formations. Similarly, irrigation farming land covers in the Cherikabad Watershed gained the highest susceptibility value, which addresses the role of the arti cial rise in groundwater table due to the excessive irrigation and the altered natural compactness of soil surface due to the use of agricultural machinery. Similar results have been reported by Pourghasemi et al. (2013) and Sharma and Mahajan (2019). Further, low to moderate altitudes receive more rainfall than higher altitudes due to the prevailing inverse rainfall gradient relationship in the region. Additionally, higher altitudes are comprised of landslide-and erosion-resistant rock formations, which coincides with Jaafari et al. (2014), Pourghasemi and Rossi (2017), Hong et al. (2017), and Kim et al. (2018).

Unexpected results
It is often expected to observe a decreasing susceptibility pattern by moving away from linear or point features such as streams, faults, roads, and residential areas, which are mostly mapped under distance functions; however, such an assumption only holds when such factors actively contribute to landsliding process. Nonetheless, even highly important distance-based thematic maps can show asymmetric susceptibility patterns due mainly to the adopted classi cation method and the pivotal role of factor outweighing effect. The latter regards the fact that factors interact in a complex manner; for instance, lithology and slope as the main predisposing factors can always interfere and outweigh other factors, hence resulting in asymmetric and unexpected results at rst glance. Hence, we scrutinized the asymmetries in each factor by investigating the other underlying factors.
As for the distance from streams, most of the landslides and accordingly, the highest susceptibility is observed in the range of 185-310 m from streams, which also contains intense fault systems that exacerbated the predisposing condition for landslide occurrence. Asymmetries in this factor mostly pinpoint the range of 85-185 m; 70% of which have are located on gentle slopes-lower than 24% slope angle. Similarly, most of the roads are constructed in gentle slopes-almost at. Moreover, as the result of local stakeholders' participation in soil conservation projects, the slopes nearby roads are protected by afforestation, installment of drainage systems, and construction of concrete masonry retaining walls. Hence, the role of the road system in landslide occurrence became negligible, which is in line with Regmi et al. (2010).
The role of rural residential areas in landslide occurrence is still controversial; some emphasized the role of landslide redistribution and reactivation of the old dormant landslide, while some simply rejected the idea because the residential fabric is built upon at slopes. The latter reasoning is also debatable since some local slopes may exist within or in the vicinity of residential fabric, which can be triggered by anthropogenic agents. Nonetheless, it seems that any interpretation regarding the causative role of the village on landslide occurrence, if any, should be concentrated only in areas close to the village, and there is a big chance that in farther areas, other predisposing factors would prevail. As with our results, the effective and interpretable distance reaches 3.37 km from the village where landslide susceptibility decreases concurrently with taking distance from the village, while upper classes are prevailed by other factors such as the presence of irrigation farming.
Distance from faults also seems to be unexpected due to the abundance of fault systems in the region; however, faults require to be seismically active to partake in landslide occurrence, and such an assumption does not always hold. Moreover, factor outweighing is another plausible explanation for such a result based on which by moving away from faults, sparse vegetation, gentle slopes of faults, and the outcrop of susceptible formations. Somewhat similar results have been reported by Pourghasemi et al. (2018) and Devkota et al. (2013).

Data mining/machine learning models
The nal maps extracted from four data mining models were classi ed in ArcGIS10.4 (Fig. 5). The percentages of high and very high classes for the ME, GLM, ANN, and SVM techniques were 13%, 35%, 23.5%, and 29%, respectively (Table 4).

Validation of data mining models
Regarding the AUROC values in the training stage, it is evident that SVM and ANN equally outperformed their counterparts by the respective values of 0.924 and 0.926, indicating a less desirable performance presented by GLM and ME with corresponding AUROC values of 0.763 and 0.74. The results of AUROC values obtained from the validation data for ME, GLM, ANN, and SVM models were 0.727, 0.751, 0.917, and 0.935, respectively. The latter stated that the SVM model with the AUROC value of 0.935 outperforms its counterparts in the Cherikabad Watershed ( Fig. 6). Outperformance of SVM has also been reported in some publications (e.g., Hipni et al. 2013;Pham et al. 2016;Chen et al. 2017c;Chen et al. 2018a), where most of its success is known to be related to its robust pattern recognition algorithm. Antithetically, ANNs are more prone to get stuck in local minima in pursuit of nding the global minimum, winding up with a less optimal solution where the most of training problems and generalizations occurred (Chen et al. 2017b). MaxEnt can also be deprived of good differentiation between presence and absence patterns since it only operates on presence location of landslides (Kornejady et al. 2017).
Factor importance analysis According to Fig. 7 and the Jackknife test, three factors, namely altitude, annual mean rainfall, and distance from the village, were ranked as the main contributing variables in the modeling process of landslide susceptibility and hazard in the Cherikabad Watershed. The latter aligns with the observed evidences during eld surveys.

Conclusion
In this work, the SVM and ANN models showed outstanding performances in terms of high goodness-of-t and prediction power, outperforming their statistical counterparts. According to the literature review, the ANN can roughly learn and optimize the corresponding learning parameters under limited landslide locations. The latter can lead to a model with somewhat less training skill, which can accordingly wind up in low prediction power and generalization capacity. Among different data mining/machine learning techniques, the SVM model is known to be underpinned by a robust pattern recognition algorithm, which is formulated through the process of setting maximized margins between the support vectors. One of the most important advantages of the SVM in comparison to the ANN, it is that the ANN has some problems with local minima, whereas the SVM solves this subject with a unique power.
On the other hand, the ANNs often end up in the best local solution (local minima), which is not the nal best solution (global minima). Regarding the pitfalls encountered by the MaxEnt algorithms, the main issue regards its presence-only modeling feature. Despite the merits of presence-only modeling schemes such as avoiding to presume the prevalence of the phenomenon, it is noteworthy that MaxEnt inevitably extracts 10,000 pseudo-absence locations to distinguish the presences from absences. Although such a process enables the model to better differentiate the landslide occurrence pattern across the study area, the pseudo-absence locations are randomly selected, and there is a reasonable chance that they may have been selected from areas that have a high susceptibility to landsliding, but the phenomenon is not morphologically emerged due to the lack of triggering factors and the overweening of stability-favoring predisposing factors in the factors' interaction pool. Moreover, it is known that solving nonlinear relationships among the factors by MaxEnt is somewhat problematic since it operates under some limited quanti cation equations called features such as linear, quadratic, product, threshold, and hinge, while the SVM and ANN better handle nonlinear relationships.   Flow chart of the overall methodology.

Figure 3
Landslide conditioning factors used for LSM in the study area: (a) Elevation, (b) Distance to streams, (c) Rainfall (d) slope aspect, (e) Distance to the village, (f) slope degree (g) Distance to faults, (h) Lithological formation, (i) Land cover, (j) NDVI, (k) distance to roads.

Figure 4
The ROC curve of the LSM derived from Shannon entropy model in the Cherikabad Watershed (plotted by PMT).  ROC curves plotted for the implemented data mining models in the Cherikabad Watershed.

Figure 7
Kappa values presented in the jackknife histogram, identifying the most important factors affecting landslide occurrence in Cherikabad Watershed