A robust discretization method of factor screening for landslide susceptibility mapping using convolution neural network, random forest, and logistic regression models

ABSTRACT The selection of discretization criteria and interval numbers of landslide-related environmental factors generally fails to quantitatively determine or filter, resulting in uncertainties and limitations in the performance of machine learning (ML) methods for landslide susceptibility mapping (LSM). The aim of this study is to propose a robust discretization criterion (RDC) to quantify and explore the uncertainty and subjectivity of different discretization methods. The RDC consists of two steps: raw classification dataset generation and optimal dataset extraction. To evaluate the robustness of the proposed RDC method, Lushan County of Sichuan Province in China was chosen as the study area to generate the LSM based on three datasets (optimal dataset, original dataset with continuous values, and statistical dataset) using three popular ML methods, namely, convolution neural network, random forest, and logistic regression. The results show that the areas under the receiver operating characteristic curve (AUCs) of the optimal dataset for the abovementioned ML models are 0.963, 0.961, and 0.930 which are higher than those of the original dataset (0.938, 0.947, and 0.900) and statistical dataset (0.948, 0.954, and 0.897). In conclusion, the RDC method can extract the more representative features from environmental factors and outperform the other conventional discretization methods.


Introduction
Landslides are widespread geological disasters in mountainous areas and result in serious casualties and economic losses every year, approximately 17% of fatalities across the world from natural hazards can be attributed to landslides (Achour et al. 2021;Ma and Mei 2021;Petley 2012).It is necessary to predict where potential landslides will occur, as it can play a pivotal role in the risk prevention and mitigation of mountain disasters (Huang et al. 2020a;Dou et al. 2019;Li et al. 2020;San 2014;Van Westen, Van Asch, and Soeters 2006;Liu et al. 2022).As an indispensable component of landslide risk assessment and mitigation strategies, landslide susceptibility mapping (LSM) can effectively address the above issues based on the relationships between historical landslides and landslide-related environmental factors (Chang et al. 2022;Reichenbach et al. 2018;Corominas et al. 2014).
The modeling processes of LSM are generally generated through several methods, including acquisition of recorded landslide inventory, consideration of landslide-related conditioning factors, preprocessing of environmental factors, selection of approximate mapping units, and construction of LSM methods (Huang et al. 2020b;Huang et al. 2022).Hence, over the past several decades, quite a few studies have concentrated on the comparison of LSM results by considering the different mapping units, distinctive landslide-related environmental factors, and improved modeling methods.For the improved modeling methods, there are increasing numbers of research reports about improved LSM models, such as convolutional neural network (CNN) (Panahi et al. 2021;San 2014;Wang, Fang, and Hong 2019a;Zhao et al. 2021), support vector machine (SVM) (Akinci and Zeybek 2021;Pham et al. 2019;San 2014), logistic regression (LR) (Abedini et al. 2019;Pham et al. 2021b;Phong et al. 2021), random forest (RF) (Mosavi et al. 2022;Achour and Pourghasemi 2020;Chen et al. 2018a;Chen et al. 2018b;Sun et al. 2020), frequency ratio method (FRM) (Wang et al. 2019b;Chang et al. 2020;Huang et al. 2022), and analytic hierarchy process (AHP) (Achour et al. 2017;Moragues et al. 2021;Panchal and Shrivastava 2020;Skilodimou et al. 2019).However, the discretization preprocessing of landslide-related environmental factors has not been fully explored or considered.
The discretization of landslide-related environmental factors with continuous values can be regarded as the primary step in extracting the classification characteristics of environmental factors and avoiding overfitting problems in modeling processes (Huang et al. 2020;Zhao et al. 2021).Most studies show that there is subjectivity and randomness in the selection of the discretization methods and interval number of landslide-related environmental factors, which will amplify or narrow feature differences in different categories and thus obviously affect the performance of LSM.For example, conventional unsupervised discretization methods, such as equal interval (EI), geometric interval (GI), quantile interval (QI), and natural breaks (NB), are subjectively selected to reclassify the environmental factors by user-defined interval numbers (Ba et al. 2017;Chang et al. 2020;Choi et al. 2012;Huang et al. 2018;Yan et al. 2019;Zhu et al. 2020).In addition, some studies have shown that environmental factors can be transformed into discrete values based on the statistical interval (SI) method, which fully considers the relationship between landslide distribution and environmental factors (Kurgan and Cios 2001;Tornyai and Lúchava 2015;Zhao et al. 2021).However, there is no objective standard to quantify which discretization methods and interval numbers should be considered, and user-defined discretization will result in biases and limitations in the performance of LSM (Huang et al. 2021).It is difficult to confirm whether LSM methods are adequately and effectively applied in LSM based on certain geographical environmental factors (Lin et al. 2021;Palenzuela Baena et al. 2019;Tian et al. 2009;Xiao et al. 2021).Therefore, a robust discretization method of factor screening should be proposed to quantify the uncertainty and subjectivity in the classification of environmental factors.
Based on the above analysis of discretization methods, the most appropriate interval numbers and discretization methods can be determined by screening the contribution of environmental factors with different discretization results to the occurrence of landslides based on statistical tools.However, most studies only apply statistical tools to measure the importance and interactions of landslide-related environmental factors (Yang et al. 2019;Rong et al. 2020;Sun et al. 2021).For applications of machine learning (ML) models for LSM, the optimal discretization methods for landslide-related environmental factors with continuous values have not been fully discussed.
In summary, based on the above issues, considering the complexity, diversity, and nonlinear characteristics of landslides, the uncertainty and subjectivity of discretization methods, and the spatial detection advantages of statistical tools, a robust discretization method of factor screening for LSM using machine learning methods is proposed in this work.To further validate the proposed method and confirm the productiveness of different machine learning methods, an original dataset (OD) with continuous values and a statistical dataset (SD) considering the landslide distribution were selected to participate in a controlled trial for LSM.

Study area
The study area is located in the northeastern part of Sichuan Province, China (Figure 1).It covers approximately 1,166.39 km 2 with an attitude varying from 557 m to 5,289 m.The study area has a mid-latitude inland subtropical humid climate.The average precipitation in this study area is 102 mm per year, and most precipitation occurs from May to September.It is situated in the transition zone between the southwest margin of the Sichuan Basin and the Qinghai-Tibet Plateau, and is composed of mountainous areas inclined from northwest to southeast (Xu, Xu, and Shyu 2015).On April 20, 2013, a 7.0-magnitude earthquake with a focal depth of 13 km occurred in Lushan, which caused 217 fatalities, injured 13, 484 persons, and destroyed 193, 000 houses.It was estimated that the total economic losses exceeded 50 billion RMB (Tang et al. 2015).Meanwhile, a strong shock caused many collapses, landslides and debris flows.Moreover, the loose deposits formed by earthquakes provided abundant sources for landslide activities under the action of rainstorms (Zhao et al. 2021).Due to its distinctive erosional landforms, complex climatic conditions, and frequent earthquake activity, the active landslides in the study area are generally prevalent.Figure 2 shows examples of rainfall-induced and earthquake-induced landslides in Lushan County.

Data used
In LSM, it is important to select abundant landslide inventory maps and landslide-related environmental factors (Hussin et al. 2016;Reichenbach et al. 2018).Landslide inventory maps play an irreplaceable role in the spatiotemporal expression of disasters in the study area.Accordingly, data on 353 historical landslides in 2015 and some recently occurring landslides in 2021 (Figure 7) were obtained from the Lushan Natural Resources and Planning Bureau.The attributes of landslide inventory maps are plentiful, including the name, time, type, volume, and coordinates of landslides.Basically, the interaction of landslide-related environmental factors has an obviously effect on the occurrence of landslides (Ba et al. 2017;Pan et al. 2008).From a geological point of view, lithology reflects the geotechnical properties of the landslide surface (Morales et al. 2021).As the slope gradient gradually increases, the stability of the slope gradually decreases (Tien Bui et al. 2016).The vegetation fractional coverage (VFC) reflects the ability of the ecosystem to adjust to rainstorms and protect the earth's surface (Cheng et al. 2018).Land use change establishes relationships with natural factors, which can cause slope instability and other disasters (Pham et al. 2021a).Rainfall is the main triggering factor of landslides, which causes a decline in slope stability and eventually leads to the occurrence of disasters (Morales et al. 2021;Gariano and Guzzetti 2016).Changes caused by human activities are equivalent to climate change in affecting the spatiotemporal occurrence of landslides (Crozier 2010).The action of earthquakes causes a rise in the groundwater level and a change in runoff conditions, which further creates the formation conditions of landslides (Rodrıguez, Bommer, and Chandler 1999).Hence, twelve landslide-related environmental factors were selected from multisource datasets.Topography with a spatial resolution of 30 m, including the elevation, slope, and aspect, was generated from digital elevation model (DEM).Geological data at a scale of 1:250,000, including faults and lithology, were obtained.The annual average rainfall was interpolated using 11 meteorological station values.A summary of the data used is shown in Table 1.In addition, to meet the above ML method requirements, balanced nonlandslides were randomly generated and validated by the buffer zone tool after excluding the influence of landslides (Zhao et al. 2021).

Discretization methods
Data discretization is defined as a process of transforming continuous values into a set of intervals.Compared with continuous values of landslide-related environmental factors, data discretization can meet the needs of models by reducing the time and space of model training and improving the anti-noise resistance ability of models (Sun et al. 2021).Moreover, as a simplified model is generated using discretized data, the overfitting risk of some part of the model is reduced, such as in LR (Liu et al. 2002).The general discretization methods are divided into supervised and unsupervised methods.Because the performance of supervised approaches mainly depends on the richness of the samples, unsupervised methods are easier to research and analyze for the classification of landsliderelated environmental factors.NB, EI, QI, and GI are four commonly used discretization methods that can achieve the purpose of data discretization by setting cut points through user-definition (Yan et al. 2019;Yang et al. 2021).The NB method is an unsupervised method that seeks to maximize the variance of the intervals (Ba et al. 2017).The EI method equally divides the continuous values into multiple intervals according to the number of cutting points, but it completely ignores the distributive characteristics of the data (Lin et al. 2021).The QI method sets identical numbers in the classification intervals, and the GI method creates geometric intervals by minimizing the sum of the element squares of the intervals.To improve the reliability of the models, the SI method is added to the experiments for comparison based on the distribution of landslides and landsliderelated environmental factors (Kurgan and Cios 2001;Tornyai and Lúchava 2015;Zhao et al. 2021).

Statistical method
GeoDetector, a robust spatial statistical tool that was developed to detect spatial heterogeneity and assess the spatial correlation of landslides and conditioning factors, mainly includes four parts: a factor detector, an interaction detector, a risk detector, and an ecological detector (Wang et al. 2010).GeoDetector is widely used in health risk assessment, landslide susceptibility assessment (Luo and Liu 2018;Sun et al. 2021), and other fields.In this research, the factor detector is mainly applied to calculate the q-statistic, which represents the contribution of conditioning factors to the occurrence of landslides.The interactive q-statistic calculated by the interaction detector is to characterizes whether the interaction of environmental factors increases or weakens the occurrence of landslides (Luo et al. 2016;Wang et al. 2010).The equation of the q-statistic is: where L is the number of categories of factor x; N h and N represent layer h and the number of units in the whole area, respectively; and s 2 and s 2 h indicate the total variance and variance of layer h for variable y.Specifically, a larger q-statistic value indicates a more consistent spatial distribution of landslides and conditioning factors and a stronger explanation of factors on the occurrence of landslides.In this research, the selection of the optimal discretization method is determined by using the q-statistic to measure the spatial correlation between landslides and landslide-related environmental factors.

Convolutional neural network
CNN, a deep learning method, is a multilayer perceptron similar to an artificial neural network (Girshick 2015).The basic structure of a CNN includes an input layer, convolutional layers, pooling layers, fully connected layers, and an output layer (Hong, Pourghasemi, and Pourtaghi 2016;Ji et al. 2020).It can conduct implicit parallel learning and extract hierarchical feature representation without prior definition of the precise mathematical formulas between the input and output parameters.Due to its powerful learning ability and better generalization ability, it has recently been widely used in the fields of pattern recognition, semantic segmentation, and imagery classification (Meena et al. 2021;Prakash, Manconi, and Loew 2021;Sameen, Pradhan, and Lee 2020;Zhao et al. 2021).Therefore, a two-dimensional CNN model, namely CNN-2D, is proposed in this paper and applied in LSM, the optimal parameters of which were designed and constructed by a neural network intelligence turning tool (NNI, https://nni.readthedocs.io/en/stable/).Accordingly, the conditioning factors were extended to two-dimensional data as input parameters, and each pixel of the image contained the corresponding landslide category (landslide or nonlandslide) and a 53 × 53 matrix composed of 12 landslide-related environmental factors.In this work, the first convolutional layer convolves input data through sixteen 2 × 2 convolutional kernels to obtain the feature matrix with sixteen 53 × 53 layers (the padding was filled), and a fully connected layer with 2,304 neurons was obtained through multiple-layer pooling and convolution to express the separated features that were extracted in the original layer.Finally, the tail of the model uses the two neural units to show the binary classification of landslides, such as landslide and nonlandslide.The network structure and the training parameters are shown in Figure 3.

Random forest
RF is an intelligent combinatorial classification algorithm based on statistical learning theory, which was proposed by Breiman (Breiman 2001).It mainly uses the bootstrap resampling method to extract multiple samples from the original data and constructs a classification tree for each bootstrap sample.Then, the result is obtained by combined voting on the prediction of the classification trees (Sun et al. 2020).RF has been applied to many fields, such as flood risk assessment and landslide susceptibility assessment (Chen et al. 2017a;Hong et al. 2019;Sun et al. 2021), due to its high tolerance and factor detection ability.Therefore, RF was selected as one of the ML methods to verify the highlights of this study.
In this study, the RF is formed of two trees (landslide and nonlandslide) and each tree comprised twelve random features.The generalization error of the RF algorithm can be expressed as follows:  3.3.3.Logistical regression LR, a common multivariate statistical analysis method, has been widely applied to interpret nonlinear relationships of natural phenomena (García-Rodríguez et al. 2008;Martinović, Gavin, and Reale 2016;Wang et al. 2020;Wang, Sawada, and Moriguchi 2013).Some studies found that LR performed better than SVM, likelihood ratio methods, and classification trees (Akgün and Bulut 2007;Lee and Sambath 2006;Oh et al. 2009).Due to its advantage of simplification, LR has been widely used in LSM to describe the multivariate regression relationship between landslides and environmental factors (Budimir, Atkinson, and Lewis 2015;Jiang et al. 2017).The equation of the LR method can be expressed as follows: where Z is the dependent variable (landslide or nonlandslide), b 0 denotes the intercept of the model, n is the number of landslide-related environmental factors, b i is the regression coefficient for each landsliderelated environmental factor, and x represents the twelve landslide-related environmental factors.

Modeling process
The purpose of this study was to conduct a robust discretization method of factor filtering and explore the effects of various discretization methods on the performance of ML methods for LSM.To validate the hypothesis, the flowchart used in this study is presented in Figure 4, and is composed of three phases: (1) twelve conditioning factors were processed as five sample datasets by five discretization methods; (2) landslide-related environmental factors were screened with optimal classification methods, which were selected as a new optimal dataset (ND), based on the above sample data which were processed by the q-statistic; and (3) three mature ML methods were selected to generate a more accurate LSM with three frequently-used datasets.

Sample design
It is important to select appropriate landslide-related environmental factors and mapping units (Pourghasemi et al. 2018).Although slope-machine learning models have stronger practical application than grid-machine learning models (Chang et al. 2022), the main disadvantage of the slope unit is that the same probability of landslide occurrences is assigned to the entire slope unit (Huabin et al. 2005).Moreover, calculation of grid units is more convenient than that of slope units.Therefore, we selected the 30 m resolution grid cell as the mapping unit in this study.,According to the geological conditions of zonation, previous literature, and field investigation (Sajedi-Hosseini et al. 2018;Intrieri and Gigli 2016), twelve conditioning factors (Table 1) were derived from data on topography, geology, hydrology, human activities, land cover, and climate (Lombardo et al. 2020).Furthermore, the EI, GI, QI and NB methods were used to divide the landslide-related environmental factors with continuous values into discrete values according to interval numbers (2-8) to obtain multiple datasets that were constructed to save as input parameters for optimal discretization methods.In addition, it is worth emphasizing that the SI method will only generate one species of classification dataset due to the subjective experience of the experts, the discretization of which was considered with the relationships between landslides and environmental factors.To ensure the equalization of samples, each experimental dataset was divided 7:3 by using stratified random sampling, of which 70% was used for model training and 30% was used for model validation (Mosavi et al. 2020).

Statistical processing
The research using GeoDetector to calculate the q-statistic and interactive q-statistic values to evaluate the effectiveness of different discretization methods, reflects the contribution of different classified landslide-related environmental factors on the occurrence of landslides and the interactive relationships among optimal conditioning factors, respectively.Consequently, we compared the q-statistic values for each influencing factor based on five discretization methods with settings of seven interval numbers (2-8).When the q-statistic and interactive q-statistic values are higher, the optimal number of intervals and the optimal discretization method are determined to characterize the optimal spatial correlation of each conditioning factor.The optimal classification combinations of twelve landslide-related environmental factors were identified as ND.

LSM with three methods
To verify the performance of ML methods based on the proposed ND and explore the effects of different datasets on the representation of ML methods, LSM was predicted by CNN, RF and LR methods based on two common datasets, including the OD (original data with continuous values) and SD (statistical data combined with the distribution of landslides and landslide-related environmental factors).Then, the performance of the proposed discretization method was compared and assessed by related indicators, such as statistical indicators, landslide susceptibility maps, and the kernel density.To avoid the contingency of the experimental results, tenfold cross validation was adopted in all experiments.4. Results

Quantification processes of conditioning factors for discretization
The optimal discretization results of landslide-related environmental factors contribute to improving the performance of models and reducing overfitting problems in model processes (Liu et al. 2002;Dougherty, Kohavi, and Sahami 1995).The quantification processes of conditioning factor discretization are shown in Figure 5. First, the q-statistic value differs for each conditioning factor under the conditions of all discretization methods with interval numbers of 2-8 is different.This means that objective and random discretization methods may lead to biases and limitations of the model.For example, for the q-statistic value, the distance to a road using the EI method increases with the number of intervals.However, the q-statistic value of the slope using the same method will initially increase, then decrease, and then increase again.In addition, the best q-statistic values of distance to a road, elevation, slope, and distance to an epicenter were 0.530, 0.419, 0.293, and 0.186 for the five discretization methods for different interval numbers (2-8), and the optimal discretization methods were QI with six intervals, SI and NB with seven intervals, and EI with eight intervals, respectively.Therefore, it is necessary to quantify the optimal discretization result for each conditioning factor compared with the user-defined discretization process, which is subject to explanations and understanding of knowledge-level representation.

Analysis of the optimal ND dataset for statistical tools
After each conditioning factor discretization, the best combination of ND with twelve conditioning factors is shown in Table 2.Among the twelve landslide -related environmental factors, the optimal discretization method of each conditioning factor differs, including the QI method with 6 and 7 intervals, SI and NB methods with 7 and 8 intervals, and EI method with 6 and 8 intervals.In addition, most of them presented significant effects on the occurrence of landslides in addition to aspect (p > 0.05).Moreover, the distance to a road revealed the highest determinant power, followed by elevation, lithology, slope, population density, rainfall, and distance to an epicenter, each of which contributed more than 18% to the occurrence of landslides.Therefore, the discretization results of conditioning factors will limit the performance of models when the same discretization method is applicable for all landslide-related environmental factors.Meanwhile, it is important to explore the direct influence of different discretization methods on the original dataset and the indirect performance of different classified datasets (SD, ND, and OD) on the machine learning methods.
Interactive detection measures the interaction and importance among each landslide-related environmental factor.Figure 6 highlights the interactive coefficients for the combinations of conditioning factors.It is worth emphasizing that the interaction between aspect and land use is less than 0.1, and the interactions between the distance to a road and other environmental factors (elevation, rainfall, etc.) are generally more than 0.5.This reflects that the distance to a road has a significant positive effect on the occurrence of landslides.

Landslide susceptibility mapping
The landslide susceptibility indices (LSIs) in the whole study area were calculated under the nine conditions of three machine learning methods (i.e.CNN, RF, and LR) and three datasets (i.e.ND, SD, OD).Subsequently, the nine types of LSIs were categorized into five groups by the natural breaks method (Huang et al. 2020a), including very low, low, moderate, high, and very high.In this study, the nine LSMs (ND-LR, ND-RF, ND-CNN, SD-LR, SD-RF, SD-CNN, OD-LR, OD-RF, and OD-CNN) were generated under the conditions of combining three datasets and three machine methods, as shown in Figure 7.These LSMs indicate that the very high and high susceptibility regions are mainly concentrated on both sides of roads.The distance to a road and elevation are two significant environmental factors.Moreover, in terms of the spatial distribution of the five susceptibility levels, the CNN model demonstrates better performance in LSM, followed by RF and LR.Additionally, regardless of which machine learning method was adopted, the differences in the LSMs (ND-LR, ND-RF, and ND-CNN) generated by the ND dataset were minimal compared with other LSMs (SD-LR, SD-RF, SD-CNN, OD-LR, OD-RF, and OD-CNN) constructed by corresponding machine leaning methods and other datasets, which shows the higher stability and better performance of the ND dataset compared with the SD and OD datasets, especially in the LR approach.Therefore, the optimal discretization method contributes to extracting the more representative features and improving the robustness of models.In addition, recently occurring landslides were obtained from a field survey, which is consistent with the LSM performance, and are mainly distributed at the very high and high susceptibility levels.Figure 8 is a statistical figure depicting the number of landslides in each susceptibility level for nine LSMs. Figure 9 shows the ratio of zonation in each susceptibility level for grid cells.It must be emphasized that regardless of which machine learning methods were used, the number of historical landslides in very high susceptibility regions predicted using the ND dataset is higher than those of other datasets (i.e.SD and OD), and were 258, 279, and 231, respectively.This means that the accuracy of the LSM generated by the ND dataset is better than that of the other two datasets.In addition, 231, 109, and 203 historical landslides were predicted in very high susceptibility regions by the LR model (ND-LR, SD-LR, and OD-LR, respectively).The OD was predicted on the LR model, which depicted a terrible performance in terms of the extremely low proportion of very low and low susceptibility regions (Figure 9 (c)).The performance of the LR model based on the OD displays an extremely unstable trend (Figure 9 (c)).In contrast, the performance of the CNN model depicts its low dependency on the sample data regardless of which dataset we selected (Figure 9 (a)).Therefore, the extremely unstable trend shows that it is necessary to quantify the discretization process among environmental factors.

Model accuracy evaluation
To measure the performance of the proposed method, four evaluation metrics, namely, the area under the receiver operating characteristic (ROC) curve (AUC), precision, recall, and F-measure  (F1), were adopted to assess the ability of the nine models under different calculation conditions (Tehrany et al. 2018;Choubin et al. 2019).Table 3 shows the statistical results for the above metrics.Figure 10 reveals the AUC values of the nine models.From the perspective of the ND dataset, the results arranged in order of AUC are ND-CNN > SD-CNN > OD-CNN, ND-RF > SD-RF > OD-RF, and ND-LR > SD-LR > OD-LR.Compared with other datasets (SD and OD), the ND dataset with the optimal knowledge-level representation has the best performance among the same machine learning methods, the AUCs of which were 0.963, 0.961, and 0.930 for the CNN, RF, and LR methods, respectively.From the perspective of model comparison, the LR models under conditions of different discretization results showed an obvious difference in the overall metrics, with AUCs of 0.930, 897, and 0.900, indicating that the LR model highly depends on the knowledge-level of data representation.

Distribution feature analysis of LSIs under each model
The performance of LSM can be comprehensively measured by the distribution features, mean values, and standard deviations of the LSIs (Huang et al. 2020a).The mean values and standard deviations of LSIs reflect the centralization trend and dispersion degree of the LSIs.The LSIs were calculated under  the three models with the three datasets (Figure 11).In terms of standard deviation performance, the nine models can be ranked as ND-CNN > OD-CNN > SD-CNN, ND-RF > SD-RF > OD-RF, and ND-LR > OD-LR > SD-LR.High standard deviations suggest that the corresponding LSM model has a higher distinguishing ability in all study areas (Huang et al. 2020b).This is consistent with the performance of the corresponding AUC values.Therefore, regardless of which machine learning models we adopted, the performance of the ND dataset with a higher knowledge-level was better than that of the other two datasets.regardless of which dataset we applied, the mean values and standard deviation values of the CNN were also better than those of the RF and LR methods.

Discussion
In recent years, many types of machine learning models have been developed, and their improvements are mainly concentrated on the acquisition of environmental factors, representation of mapping units, and modification of model structures (Huang et al. 2022).However, the discretization processes of environmental factors are still debated and unexplored (Huang et al. 2020a).The userdefined discretization results are usually influenced by subjectivity and randomness.These biases and limitations in the discretization results will influence the stability and accuracy of the models (Hong et al. 2019;Yang et al. 2019).Therefore, it is necessary to improve discretization methods to quantify the optimal discretization degree and extract the optimal knowledge-level representation for each environmental factor (Liu et al. 2002).In this study, a robust discretization method was introduced to quantitatively measure and filter the optimal discretization method with the most appropriate interval numbers, and the effectiveness of the proposed method was validated by three machine learning methods.First, it is suggested that the optimal discretization result for each conditioning factor may be different and should be considered.In previous studies, although some researchers have found that models with categorical variables show an excellent fit to the training data compared with continuous variables (Zhao et al. 2019), they mainly focus on selecting the identical discretization method for some environmental factors (Huang et al. 2021) or subjectively reclassify the environmental factors under expert experience (Zhao et al. 2021).However, standard guidelines for the quantification of environmental factors are still absent.In this study, a statistical tool was selected to analyze landslide-related environmental factors, which showed a significant difference in assessing the contributions of conditioning factors to the occurrence of landslides (Table 2 and Figure 5) (Hong, Pourghasemi, and Pourtaghi 2016;Liu et al. 2021).There is clear evidence that the optimal discretization results of different environmental factors differ (Figure 5 and Table 2).For example, the distance to a road and elevation were classified as discrete data using QI and SI methods, respectively.Therefore, the performance of the models will decrease or increase when we subjectively and randomly select discretization methods and cut points.
Second, a robust discretization method can improve the adaptation and robustness of the machine learning models, especially in the LR model.In terms of statistical metrics (Table 3, Figure 10 and 11), for the OD, SD, and ND dataset, the related metrics of the models under conditions of the ND dataset were better than the corresponding models under conditions of the other two datasets.In addition, the differences of performance in the ND-LR, ND-RF, and ND-CNN models were less than those in the other models, indicating that the improved discretization method can capture more representative features to train the corresponding models.Meanwhile, the LR models under conditions of different discretization results show an obvious difference.This means that the CNN and RF models can better handle both categorical and continuous data compared with the LR model (Chen et al. 2017b;Ließ, Glaser, and Huwe 2012).This research suggests that the classification of factors for different discretization criteria should be considered to maximize their spatial information and differences between categories, which can improve the accuracy of models (Lombardo, Opitz, and Huser 2018) and reduce the risk of model overfitting (Ließ, Glaser, and Huwe 2012).
Third, input parameters and model structures are equally important for the performance of models in LSM (Sun et al. 2021).In terms of the RF and LR models, the dataset with the best LSM performance is the ND dataset, followed by the SD and OD datasets, and the LSM results using the ND dataset have a higher accuracy and a better distribution pattern (Table 3 and Figure 11).It should be noted that regardless of which dataset we selected, the distribution of the histogram showed no significant changes for the CNN model.(Figure 11).The primary reason for this phenomenon might be that the CNN model is highly adaptive, and thus is conducive to mining the spatial characteristics of original data to compensate for the deficiency of data to some extent (Zhao et al. 2021).Although the difference in various discretization methods might be recovered by improving the structure of the models, the performance of the models would be more convincing assuming landslide-related environmental factors with more differences in the high-dimensional feature space (Chang et al. 2022;Chen et al. 2018b;Liu et al. 2021).
Finally, although this research contributes to improving the accuracy of LSM predicted by machine learning methods using a robust factor discretization method, limitations remain.On the one hand, it must be emphasized that the premise of GeoDetector is that all landslide-related environmental factors must be classified in advance, which limits the operational efficiency of models to some extent.On the other hand, a few unsupervised and supervised methods with the ability to mine more spatial information should also be considered when providing an abundance of samples (Chang et al. 2020;Huang et al. 2020b).In addition, the neighborhood characteristics of landslides should be considered in the spatial knowledge representation (Huang et al. 2022).Accordingly, the capture, representation, and propagation of spatial information are closely related to the performance of ML models in geospatial fields.We will evaluate more classification methods for landslide-related environmental factors, to identify additional spatial characteristics of factors and improve the performance of LSM in future research.

Conclusions
The novelty of this study is the proposal of a robust discretization method to quantitatively measure and screen the optimal discretization results of landslide-related environmental factors based on spatial statistics and improvement of the stability and adaptability of machine learning methods in LSM.The experimental results show that the proposed approach is an effective and powerful technique for improving the performance of models compared to the other methods.
In conclusion, the proposed robust discretization method can extract more comprehensive interclass features among each environmental factor compared to user-defined discretization methods with more subjectivity and randomness.The LR, RF, and CNN models with the improved ND dataset have stronger practice application than the models with other user-defined discretization datasets.Therefore, the presented methodology could be applied in machine learning models for LSM or similar scenarios.
Additionally, although the performance of models based on the improved spatial dataset allows for more applications, interpretation, and robustness in LSM compared to other methods, the extraction of the knowledge-level of each environmental factor, selection of optimal discretization metrics, and representation of spatial information are still insufficient and will be improved in future research.

Figure 1 .
Figure 1.Location and historical landslide distribution of the study area.
) where x and y are landslide-related environmental factors that indicate the probability over x and y space.mg is the margin function, and I(*) is the indicator function(Breiman 2001;Chen et al. 2018b).

Figure 4 .
Figure 4. Flowchart used in this study.

Figure 6 .
Figure 6.Interactive coefficients of factors on the occurrence of landslides.

Figure 8 .
Figure 8. Number of historical landslides in five susceptibility levels for nine LSMs.

Figure 9 .
Figure 9. Ratio of zonation for five susceptibility levels.The numbers 1-3 in the superscripts refer to the position in the pie chart (i.e.3: outermost pie chart; 2: medial pie chart; 1: innermost pie chart).(a) LSM predicted by the CNN; (b) LSMs predicted by the RF; (c) LSMs predicted by the LR.

Figure 10 .
Figure 10.AUC values of three datasets predicted by three machine learning models.

Figure 11 .
Figure11.LSI distribution features of the nine models.

Table 1 .
Summary of data used in this study.

Table 2 .
The selection of discretization methods and q-statistics.

Table 3 .
Statistical indicators for different methods.