Landslide Susceptibility Mapping in Guangdong Province, China, Using Random Forest Model and Considering Sample Type and Balance

: Landslides pose a serious threat to human lives and property. Accurate landslide susceptibility mapping (LSM) is crucial for sustainable development. Machine learning has recently become an important means of LSM. However, the accuracy of machine learning models is limited by the heterogeneity of environmental factors and the imbalance of samples, especially for large-scale LSM. To address these problems, we created an improved random forest (RF)-based LSM model and applied it to Guangdong Province, China. First, the RF-based LSM model was constructed using rainfall-induced landslide samples and 13 environmental factors and by exploring the optimal positive-to-negative and training-to-test sample ratios. Second, the performance of the RF-based LSM model was evaluated and compared with three other machine learning models. The results indicate that: (1) the proposed RF-based model has the best performance with the highest area under curve (AUC) of 0.9145, based on optimal positive-to-negative and training-to-test sample ratios of 1:1 and 8:2, respectively; (2) the introduction of rainfall and global human modiﬁcation (GHM) can increase the AUC from 0.8808 to 0.9145; and (3) rainfall and topography are two dominant factors in Guangdong landslides. These ﬁndings can facilitate landslide risk prevention and serve as a technical reference for large-scale accurate LSM.


Introduction
As one of the main types of geological hazards, landslides refer to the sliding of rock and soil masses along a slope. They are generally influenced by both natural factors and human interference and are characterized by wide distribution and frequent occurrence [1]. Landslide disasters in China have caused considerable casualties and property damage due to their sudden and unpredictable nature [2][3][4]. During the period from 2002 to 2014, a total of 246,768 landslides occurred in China, which led to the injury or death of 17,296 people and direct economic losses of CNY 58.75 billion [5]. It is evident that landslide disasters have imposed constraints on sustainable development [6,7]. The International Consortium on Landslides (ICL) has advocated for harmonizing regulations of the Sendai Landslide Partnerships 2015-2025 and the 2030 Agenda Sustainable Development Goals, among others [7]. Understanding where landslides can occur is, therefore, of great importance for reducing casualties, addressing ecological and economic losses, and promoting sustainable development [8]. This understanding generally includes two main aspects: landslide these variables may reflect one or a few aspects of human activity, this is not sufficient to reflect the comprehensive effect of human activities on landslides [46,47]. The third is to divide the study area into multiple regions using the clustering analysis method based on the distribution of landslides in the study area and the similarity of environmental factors and to use the category attribute of each region as one of the input variables of the machine learning model [16,41]. The introduction of more variables, however, may lead to a decrease in the generalization ability in the clustering analysis method [48]. In addition, it is difficult to quantitatively evaluate the clustering result, which may further increase the uncertainty of the LSM model [48].
Moreover, machine-learning-based LSM studies are often affected by evident sample imbalances. Yu et al. [5] indicated that the sample imbalance issue may affect the reliability of the LSM. Reichenbach et al. [18] also indicated that there are far more negative samples (i.e., non-landslide samples) than positive samples (i.e., landslide samples) in LSM studies, and the imbalance between positive and negative samples may affect the accuracy of the LSM. In particular, the selection of the training-to-test sample ratio is critical to the results of the machine-learning-based LSM models [18]. The training-to-test sample ratios of 7:3 [14,44] and 8:2 [49,50] are generally applied in existing LSM studies, while the performance of different ratios in constructing LSM models is seldom explored. In general, the selection of the positive-to-negative sample ratio and the training-to-test sample ratio affects the accuracy of LSM models. Most LSM studies have to rely on a small amount of sample data [14,51] and, hence, suffer from the problem of sample imbalance [18]; this is because it is difficult and expensive to obtain adequate landslide samples and their genesis types on a large scale in a tight timeframe [14,18]. Although some landslide inventories are available at national and global scales, they normally consist of major landslide events only. Therefore, making full use of the limited sample data and setting reasonable ratios of landslide samples are crucial for constructing LSM models.
In view of the above problems, this study proposes an improved RF-based model by considering sample type and balance for LSM in Guangdong Province. The proposed method aims to improve the existing LSM methods in terms of three aspects: (1) improving LSM accuracy by considering the comprehensive impact of multiple environmental factors, such as topography, geography, land cover type, rainfall, and human activities; (2) reducing the error caused by multiple types of landslide samples by using rainfall-induced landslide samples only as the model input; (3) mitigating the adverse effects of sample imbalance by optimizing the positive-to-negative sample ratio and training-to-test sample ratio of the LSM model based on quantitative analysis.

Study Area
The study area is Guangdong Province (20 • 13 N~25 • 31 N, 109 • 39 E~117 • 19 E) (Figure 1), located in Southern China, with a land area of 178,000 km 2 . The area is made up of 31.4% mountains, 25.61% hills, 20.26% tablelands, and 22.73% plains and valleys [52]. The intrusive rocks dominated by granite account for one-third of the total area of Guangdong Province, while migmatite and gneiss also exist widely [45,52]. Guangdong belongs to a subtropical monsoon climate, with an average annual rainfall of 1758.8 mm and characterized by perennial mild and humid weather. Under the pressure of abundant rainfall and frequent landing typhoons, the probability of landslides in Guangdong Province is high [2,45,51]. Considering that Guangdong has a high population density and economic activity intensity, the multiple effects of geography, economy, and demographics in Guangdong create conditions susceptible to the occurrence of landslide disasters [4]. Due to the lack of data on the environmental factors of some islands (e.g., Dongsha Islands), this study mainly investigates the landslide susceptibility of the major land areas in Guangdong Province.

Landslide Inventory
Landslide inventory refers to a detailed register of the spatial distribution, geometry, and attributes of landslides [9]. In this study, two classes of landslide inventory were used: (1) Class I: data of 335 historical landslide points in Guangdong Province from 2015 to 2019, which were collected from geological disaster reports published by the Department of Natural Resources of Guangdong Province. These reports provide information on the occurrence time, detailed location, coordinates, primary causal factor (rainfall-induced or human activity-induced), and geological hazard type (rock fall, landslide, ground collapse, and debris flow), etc. Among these landslide events, 254 belong to the rainfall-induced type; the remainder are triggered by both rainfall and slope excavation and, thus, belong to the mixed type of rainfall-induced and human activity-induced landslides. Generally, the Class I landslide data are mainly rainfall-induced, with some mixed with human activity effects. (2) Class II: data of 586 landslide polygons in Longchuan County and Zijin County, Heyuan City in Guangdong Province. The Class II landslide data were obtained through a two-step procedure, which include visual interpretation of the GaoFen-2 images in 2021 and verification based on field investigation. This indicates that the Class II landslide data have a high accuracy and can be used as ground truth data in the region. The Class II landslide data used in this study are of mixed types, covering all landslide events detected in the study area; however, attribution categories are not specified. In this study, the Class I landslide data were used as the landslide samples for training and testing the RF-based model for LSM, while the Class II landslide data were employed to validate the effectiveness of the RF-based model for multiple types of landslides; there is no overlap between the two classes of landslide data.

Environmental Factors
Considering the regional characteristics of Guangdong Province and the main type of landslide samples (i.e., Class I landslide data), 13 environmental factors were selected, which can be categorized into topography, geology, land cover, meteorology, hydrology, and human activities. Detailed information and data sources of these factors are listed in Table 1. These environmental factors were resampled into 1 km spatial resolution grid

Landslide Inventory
Landslide inventory refers to a detailed register of the spatial distribution, geometry, and attributes of landslides [9]. In this study, two classes of landslide inventory were used: (1) Class I: data of 335 historical landslide points in Guangdong Province from 2015 to 2019, which were collected from geological disaster reports published by the Department of Natural Resources of Guangdong Province. These reports provide information on the occurrence time, detailed location, coordinates, primary causal factor (rainfall-induced or human activity-induced), and geological hazard type (rock fall, landslide, ground collapse, and debris flow), etc. Among these landslide events, 254 belong to the rainfall-induced type; the remainder are triggered by both rainfall and slope excavation and, thus, belong to the mixed type of rainfall-induced and human activity-induced landslides. Generally, the Class I landslide data are mainly rainfall-induced, with some mixed with human activity effects. (2) Class II: data of 586 landslide polygons in Longchuan County and Zijin County, Heyuan City in Guangdong Province. The Class II landslide data were obtained through a two-step procedure, which include visual interpretation of the GaoFen-2 images in 2021 and verification based on field investigation. This indicates that the Class II landslide data have a high accuracy and can be used as ground truth data in the region. The Class II landslide data used in this study are of mixed types, covering all landslide events detected in the study area; however, attribution categories are not specified. In this study, the Class I landslide data were used as the landslide samples for training and testing the RF-based model for LSM, while the Class II landslide data were employed to validate the effectiveness of the RF-based model for multiple types of landslides; there is no overlap between the two classes of landslide data.

Environmental Factors
Considering the regional characteristics of Guangdong Province and the main type of landslide samples (i.e., Class I landslide data), 13 environmental factors were selected, which can be categorized into topography, geology, land cover, meteorology, hydrology, and human activities. Detailed information and data sources of these factors are listed in Table 1. These environmental factors were resampled into 1 km spatial resolution grid cells using the WGS1984 coordinate system; the spatial distributions of these factors in Guangdong Province are shown in Figure 2. To ensure the consistency of the grid unit, Sustainability 2023, 15, 9024 5 of 23 287 grids were labeled as landslide samples, with one or more historical landslide points falling into their grid area.

Topographic data
The elevation, slope, aspect, profile, and plane curvature were extracted from the 30 m spatial resolution ASTER global digital elevation model (GDEM) V2 version data after resampling (data year 2011). The ASTER GDEM V2 version was adopted in this study because it has been proven to be more accurate than the V1 version and it has more practical use and allows for greater verification than the V3 version [55]. Topographic data are the most commonly used factor in LSM studies [18]. Landslides can occur when elevation and slope exhibit certain conditions [18,48]; a higher curvature means that the slope has a stronger capacity for water accumulation and is more prone to landslides [34,48]; meteorological processes regulate sunlight, hydrological elements, and wind direction, which affect slope stability [48].

Geological data
Landslides are related to geological conditions [18]. They primarily occur in areas with a weathered soil layer on the bedrock surface [51] as well as with tectonic activity [18]. Thus, the geological data of the study area mainly include lithology and distance to fault, which were derived from the geological vector map of Guangdong Province provided by the National Geological Archive (NGA). The lithology factor was reclassified into the quaternary system, plutonic rock, metamorphic rock, melange (or mélange) rock, hypabyssal rock, extrusive rock, clay rock, clastic rock, and biochemical sedimentary rock, in accordance with the original data description and the study of Liu et al. [41]. The distance to fault factor was calculated using the Euclidean distance between each grid and the nearest fault structure.

Land cover data
The normalized difference vegetation index (NDVI) and land cover type were used to represent land cover in the study area, as they can affect slope stability by altering the density of vegetation, soil moisture content, land evapotranspiration, and root strength [18,35]. The NDVI was generated via the Landsat8 Collection 1 Tier 1 8-Day from 2015 to 2019 on the Google Earth Engine (GEE) platform; the maximum values during the study period were adopted as the NDVI factor. The land cover type was obtained from the land cover product GlobeLand30 for 2020. Since Guangdong Province is located in a low-latitude area and has a tropical and subtropical climate, eight land cover types, including cultivated land, forest, grassland, shrubland, wetland, water bodies, artificial surfaces, and bare land, were considered in the study area (excluding tundra and permanent snow and ice).

Topographic data
The elevation, slope, aspect, profile, and plane curvature were extracted from the 30 m spatial resolution ASTER global digital elevation model (GDEM) V2 version data after resampling (data year 2011). The ASTER GDEM V2 version was adopted in this study because it has been proven to be more accurate than the V1 version and it has more practical use and allows for greater verification than the V3 version [55]. Topographic data are the most commonly used factor in LSM studies [18]. Landslides can occur when elevation and slope exhibit certain conditions [18,48]; a higher curvature means that the slope has a stronger capacity for water accumulation and is more prone to landslides [34,48]; meteorological processes regulate sunlight, hydrological elements, and wind direction, which affect slope stability [48].

2.
Geological data Landslides are related to geological conditions [18]. They primarily occur in areas with a weathered soil layer on the bedrock surface [51] as well as with tectonic activity [18]. Thus, the geological data of the study area mainly include lithology and distance to fault, which were derived from the geological vector map of Guangdong Province provided by the National Geological Archive (NGA). The lithology factor was reclassified into the quaternary system, plutonic rock, metamorphic rock, melange (or mélange) rock, hypabyssal rock, extrusive rock, clay rock, clastic rock, and biochemical sedimentary rock, in accordance with the original data description and the study of Liu et al. [41]. The distance to fault factor was calculated using the Euclidean distance between each grid and the nearest fault structure.

3.
Land cover data The normalized difference vegetation index (NDVI) and land cover type were used to represent land cover in the study area, as they can affect slope stability by altering the density of vegetation, soil moisture content, land evapotranspiration, and root strength [18,35]. The NDVI was generated via the Landsat8 Collection 1 Tier 1 8-Day from 2015 to 2019 on the Google Earth Engine (GEE) platform; the maximum values during the study period were adopted as the NDVI factor. The land cover type was obtained from the land cover product GlobeLand30 for 2020. Since Guangdong Province is located in a low-latitude area and has a tropical and subtropical climate, eight land cover types, including cultivated land, forest, grassland, shrubland, wetland, water bodies, artificial surfaces, and bare land, were considered in the study area (excluding tundra and permanent snow and ice).

Meteorological and hydrological data
Rainfall is an important trigger of landslides [56]. Because the rainfall in Guangdong Province is significant and the samples input to the model are all rainfall-induced landslides, it is necessary to consider the rainfall factor. The annual average rainfall represents the average rainfall condition over the long term in the region [24], which has been validated in previous machine learning LSM studies [16,41]. The average annual rainfall of the study area from 2015 to 2019 is used as the rainfall factor; it was calculated by processing the primitive data of the monthly rainfall dataset provided by the National Earth System Science Data Center [53]. The river can carry away the rock and soil mass at the slope toe, resulting in the slope toe near the river easily forming an aerial surface and promoting the occurrence of landslides [24]. Thus, the factor of distance to river was added; it was obtained by calculating each grid's Euclidean distance to the nearest river. The river network data were obtained from OpenStreetMap.

5.
Human activity data The road construction near slopes can considerably reduce the stability of slopes [24] and even directly lead to landslide occurrence [47]. Therefore, the factor of distance to road indicates a trigger of human activity; it was obtained by calculating the Euclidean distance from each grid to the nearest road based on road data downloaded from the OpenStreetMap. In particular, to fully measure the complicated impact of human activities on landslides [47], the Global Human Modification (GHM) data with a 1 km spatial resolution on the GEE platform were also incorporated (data year 2016) [54]. The GHM is a continuous index ranging from 0 (no human impact) to 1 (high impact), where a greater value indicates more intense human modification of terrestrial lands. It provides an insight into the impact of human activities by analyzing various types of data, including human settlements, agricultural activities, transportation, mining and energy production activities, and electricity infrastructure. As an integrated variable, the GHM considers the impact of different types of human modification.

Methods
This study constructed an RF-based LSM model to analyze the distribution pattern of landslide susceptibility in Guangdong Province and its dominant environmental factors. First, the normalization and correlation test for the preprocessing of the input data of LSM was performed. Second, the RF-based LSM model with the optimal ratio of positive-tonegative samples and training-to-test samples was constructed. Third, the performance of the proposed RF-based LSM model was evaluated and compared with the SVM, MLP, and LR-based models using the area under curve (AUC) method. All steps were performed through the scikit-learn open-source library in Python and the ESRI ArcGIS software.

Normalization and Correlation Test of Model Inputs
To avoid the inconsistent influence in the data dimension of different input factors for the LSM model, min-max normalization pre-processing was performed for the 13 environmental factors using the following formula: where x is the original value of the independent variable (i.e., environmental factor), x max and x min are the maximum and minimum values of the independent variable, respectively, and x is the normalized independent variable. Considering the correlation effect between 13 environmental factors in the machine learning model [33], the variance inflation factor (VIF) was calculated to detect multicollinearity among these factors (VIF values above 5 indicate the presence of multicollinear-ity) [34]. An appropriate combination of environmental factors could then be determined. The formula for VIF is as follows: where R i is the correlation coefficient of regression analysis of x i for the remaining independent variables.

Construction of LSM Model Considering Sample Type and Balance
This study employed the RF algorithm to construct the LSM model based on Class I landslide data and environmental factors. To minimize the model fitting error caused by the heterogeneity of landslide samples, the rainfall-induced landslides derived from the Class I landslide data were used as positive samples. In addition, the negative samples were selected from non-landslide regions and defined as non-road and non-river regions more than 1 km away from the positive samples [44].
The selection of the positive-to-negative sample ratio (i.e., landslide to non-landslide sample ratio) and the training-to-test sample ratio is essential for machine-learning-based LSM models [5,18]. The grid cells where the landslides occur were selected as the landslide samples, while the grid cells in the area outside the occurrence of landslides were used as the non-landslide samples [18]. Thus, the number of available landslide samples was much lower than the number of the non-landslide samples. To obtain the optimal positiveto-negative sample ratio of the LSM model, a positive-to-negative sample ratio of 1:1 was used to represent the case of balanced positive and negative samples, and ratios of 1:2 and 1:3 represented the case of unbalanced positive and negative samples. Considering that the area under curve (AUC) is insensitive when there is a quantitative difference between positive and negative samples [57], the Sensitivity from the confusion matrix was employed to evaluate the landslide identification precision. As shown in Figure 3, TP, FN, TN, and FP are four metrics of the confusion matrix. TP is the true positive, i.e., the number of landslide samples correctly predicted by the model; FN is the false negative, i.e., the number of samples where the model misclassifies landslides as non-landslides; TN is the true negative, i.e., the number of non-landslide samples correctly predicted by the model; and FP is the false positive, i.e., the number of samples where the model misclassifies non-landslides as landslides. The indexes of Sensitivity and Speci f icity can be calculated using the following formulas: where Sensitivity denotes a true-positive rate, i.e., the rate of landslides correctly predicted number by the model to the number of landslide samples; Speci f icity denotes the truenegative rate, i.e., the rate of non-landslides correctly predicted number by the model to the number of non-landslide samples. In this study, the positive-to-negative sample ratio with the greatest Sensitivity was selected as the optimal ratio. Referring to several LSM studies, the training-to-test sample ratio of the sample set was set to 7:3 [14,44] and 8:2 [49,50]. AUC values of the LSM model with different ratios were calculated, and average AUC values were obtained by repeating the previous process ten times. The AUC denotes the area enclosed by the receiver operating characteristic (ROC) curve and the coordinate axis. It has been widely used to measure the accuracy of models [34]. The range of the AUC value is from 0.5 to 1, where a higher value indicates a stronger discriminative capability of the model [34]. Generally, values of AUC above 0.7 indicate high accuracy, while values above 0.9 indicate very high accuracy [58]. The ROC curve is generated with Sensitivity as the horizontal axis and Speci f icity as the vertical axis.
In this study, the training-to-test sample ratio with the highest average AUC value was selected as the optimal ratio. The optimal combination of hyperparameters was determined using the cross-validation and grid search methods. Referring to several LSM studies, the training-to-test sample ratio of the sample set was set to 7:3 [14,44] and 8:2 [49,50]. AUC values of the LSM model with different ratios were calculated, and average AUC values were obtained by repeating the previous process ten times. The AUC denotes the area enclosed by the receiver operating characteristic (ROC) curve and the coordinate axis. It has been widely used to measure the accuracy of models [34]. The range of the AUC value is from 0.5 to 1, where a higher value indicates a stronger discriminative capability of the model [34]. Generally, values of AUC above 0.7 indicate high accuracy, while values above 0.9 indicate very high accuracy [58]. The ROC curve is generated with as the horizontal axis and as the vertical axis. In this study, the training-to-test sample ratio with the highest average AUC value was selected as the optimal ratio. The optimal combination of hyperparameters was determined using the cross-validation and grid search methods.

Machine-Learning-Based LSM Model Comparison
To verify the reliability of the proposed RF-based LSM model, the performance of this model was compared with three other machine learning models, i.e., SVM, MLP, and LR, using the optimal positive-to-negative sample ratio and training-to-test sample ratio. Specifically, all samples were first divided into ten sample sets, each of which had the optimal ratio of positive-to-negative samples, i.e., 287 positive samples and 287n (n = 1, 2, or 3) negative samples. For each sample set, the optimal ratio of training-to-test samples was adopted, i.e., 70% or 80% of the samples were used as the training set, while the remaining 30% or 20% were used as the test set. Then, these ten sample sets were used to train and test four machine learning models, and their performances were evaluated based on the testing set and the ROC. Finally, AUC values corresponding to the sample sets of each machine learning model were calculated, and the model with the largest average AUC value indicated the highest accuracy of the LSM.

Model Application and Multidimensional Analysis
The model with the largest AUC value was employed to map the landslide susceptibility of Guangdong Province. We used the natural breaks method to identify four levels, i.e., low susceptibility, moderate susceptibility, high susceptibility, and very high susceptibility. The distribution pattern of landslide susceptibility in the study area was further analyzed at the provincial scale and the city scale.
To evaluate the effect of the model on landslide samples, the percentage of the number of Class I landslides at each landslide susceptibility level to the total number of Class I landslides ( ) was calculated; it is defined as follows: where donates the number of Class I landslides at each susceptibility level (four levels, including low, moderate, high, and very high), and donates the overall number of Class I landslides.

Machine-Learning-Based LSM Model Comparison
To verify the reliability of the proposed RF-based LSM model, the performance of this model was compared with three other machine learning models, i.e., SVM, MLP, and LR, using the optimal positive-to-negative sample ratio and training-to-test sample ratio. Specifically, all samples were first divided into ten sample sets, each of which had the optimal ratio of positive-to-negative samples, i.e., 287 positive samples and 287n (n = 1, 2, or 3) negative samples. For each sample set, the optimal ratio of training-to-test samples was adopted, i.e., 70% or 80% of the samples were used as the training set, while the remaining 30% or 20% were used as the test set. Then, these ten sample sets were used to train and test four machine learning models, and their performances were evaluated based on the testing set and the ROC. Finally, AUC values corresponding to the sample sets of each machine learning model were calculated, and the model with the largest average AUC value indicated the highest accuracy of the LSM.

Model Application and Multidimensional Analysis
The model with the largest AUC value was employed to map the landslide susceptibility of Guangdong Province. We used the natural breaks method to identify four levels, i.e., low susceptibility, moderate susceptibility, high susceptibility, and very high susceptibility. The distribution pattern of landslide susceptibility in the study area was further analyzed at the provincial scale and the city scale.
To evaluate the effect of the model on landslide samples, the percentage of the number of Class I landslides at each landslide susceptibility level to the total number of Class I landslides (Pn i ) was calculated; it is defined as follows: where n i donates the number of Class I landslides at each susceptibility level i (four levels, including low, moderate, high, and very high), and n donates the overall number of Class I landslides.
The susceptibility levels of landslides for each city in Guangdong Province were evaluated by calculating the percentage of the area at different susceptibility levels in each city of Guangdong Province (Pa ij ). The formula of Pa ij is as follows: where A ij donates the area of each susceptibility level i (four levels, including low, moderate, high, and very high) of city j, and A j donates the administrative area of city j.
Furthermore, the landslide susceptibility distribution pattern was the result of various environmental factors. Considering that the RF model has also proven to be an effective method for quantifying the importance of different variables, the RF model was used to analyze the impact of 13 environmental factors on the landslides in Guangdong Province. Table 2 shows the multicollinearity test results, i.e., the V IF values of 13 environmental factors. As shown in Table 2, the V IF values of the environmental factors are all less than 5, which indicates there is no multicollinearity among these factors [34]. Therefore, this study employed all 13 environmental factors for LSM.  Figure 4 shows the confusion matrices of the RF-based LSM model with different ratios of positive-to-negative samples (i.e., landslide samples and non-landslide samples) using the testing sets. It can be found that the Sensitivity value (multiplied by 100% to show the percentage) is the highest (79.31%) when the positive-to-negative sample ratio is 1:1. For ratios of 1:2 and 1:3, the Sensitivity values drop to 44.83% and 25.86%, respectively. We can, therefore, conclude that the ratio of positive-to-negative samples has a significant effect on the sensitivity index, i.e., the precision of landslide identification. In other words, the model's ability to identify landslides would be greatly limited when there is a significant quantitative difference between positive and negative samples. Therefore, 1:1 is selected as the optimal positive-to-negative sample ratio.

The Correlation of Environmental Factors
where donates the area of each susceptibility level (four levels, including low, moderate, high, and very high) of city , and donates the administrative area of city . Furthermore, the landslide susceptibility distribution pattern was the result of various environmental factors. Considering that the RF model has also proven to be an effective method for quantifying the importance of different variables, the RF model was used to analyze the impact of 13 environmental factors on the landslides in Guangdong Province. Table 2 shows the multicollinearity test results, i.e., the values of 13 environmental factors. As shown in Table 2, the values of the environmental factors are all less than 5, which indicates there is no multicollinearity among these factors [34]. Therefore, this study employed all 13 environmental factors for LSM.  Figure 4 shows the confusion matrices of the RF-based LSM model with different ratios of positive-to-negative samples (i.e., landslide samples and non-landslide samples) using the testing sets. It can be found that the value (multiplied by 100% to show the percentage) is the highest (79.31%) when the positive-to-negative sample ratio is 1:1. For ratios of 1:2 and 1:3, the values drop to 44.83% and 25.86%, respectively. We can, therefore, conclude that the ratio of positive-to-negative samples has a significant effect on the sensitivity index, i.e., the precision of landslide identification. In other words, the model's ability to identify landslides would be greatly limited when there is a significant quantitative difference between positive and negative samples. Therefore, 1:1 is selected as the optimal positive-to-negative sample ratio.  average AUC with a training-to-test sample ratio of 8:2 are higher than those of the 7:3 ratio. Thus, in this study, the optimal training-to-test sample ratio for the RF model was set as 8:2. Figure 5 shows that the ROC curves and AUC values are based on the same sample set with a positive-to-negative sample ratio of 1:1. The RF model was constructed using training-to-test sample ratios of 8:2 and 7:3, and the model construction process was repeated 10 times. The results show that the AUC values range from 0.7609 to 0.8852, with an average of 0.831, when the training-to-test sample ratio is set as 8:2. Meanwhile, the AUC values range from 0.7468 to 0.8569, with an average value of 0.7973, when the training-to-test sample ratio is set as 7:3. The values of both the highest AUC and the average AUC with a training-to-test sample ratio of 8:2 are higher than those of the 7:3 ratio. Thus, in this study, the optimal training-to-test sample ratio for the RF model was set as 8:2.

Comparison of Machine Learning Models
The proposed RF-based LSM model was compared with three other machine learning models, namely SVM, MLP, and LR, which were all trained and tested using 10 different sample sets with the optimal positive-to-negative sample ratio (1:1) and trainingto-test sample ratio (8:2). In addition, the optimal combination of hyperparameters was found and is shown in Table 3. Figure 6 illustrates the ROC and AUC values of these four models. In general, the AUC values of SVM, MLP, and LR models are between 0.6322 and 0.7901, with average AUC values around 0.7. Among these machine learning models, the RF model has the best performance on landslide susceptibility mapping, with the highest average AUC value of 0.8116 and the highest AUC value of 0.9145 (Figure 6a). These findings indicate that the RF model is superior to the other three machine learning models. This is likely because the RF model can prevent overfitting of the training set by creating multiple decision trees and handling missing values and outliers [17]. Table 3. The optimal combination of main hyperparameters involved in RF.

Hyperparameter
Explanation Value n_estimators The number of decision trees.

criterion
The sample segmentation criteria. gini max_depth The maximum depth of decision trees. None min_samples_split the minimum number of samples required to split an internal node. 2

Comparison of Machine Learning Models
The proposed RF-based LSM model was compared with three other machine learning models, namely SVM, MLP, and LR, which were all trained and tested using 10 different sample sets with the optimal positive-to-negative sample ratio (1:1) and training-to-test sample ratio (8:2). In addition, the optimal combination of hyperparameters was found and is shown in Table 3. Figure 6 illustrates the ROC and AUC values of these four models. In general, the AUC values of SVM, MLP, and LR models are between 0.6322 and 0.7901, with average AUC values around 0.7. Among these machine learning models, the RF model has the best performance on landslide susceptibility mapping, with the highest average AUC value of 0.8116 and the highest AUC value of 0.9145 (Figure 6a). These findings indicate that the RF model is superior to the other three machine learning models. This is likely because the RF model can prevent overfitting of the training set by creating multiple decision trees and handling missing values and outliers [17]. min_samples_leaf the minimum number of samples required to be at a leaf node. 1

Distribution Pattern Analysis of Landslide Susceptibility at the Provincial Scale
Following the analysis outlined above, the optimal positive-to-negative sample ratio of 1:1 and training-to-test sample ratio of 8:2 were determined. We also found that the RF model outperformed the SVM, MLP, and LR models. Therefore, the RF-based LSM model with the highest AUC value (A1 in Figure 6a) was employed in Guangdong Province to obtain the landslide susceptibility index for each grid. The landslide susceptibility raster layer of the study area was then classified into four levels using the natural breaks method, with the four levels being low susceptibility (0-0.28), moderate susceptibility (0.29-0.43), high susceptibility (0.44-0.6), and very high susceptibility (0.61-0.98). Figure 7 shows the distribution of the four landslide susceptibility levels in Guangdong Province. As shown in Figure 7, regions of high landslide susceptibility are mainly located in the north and east part of Guangdong, while regions of low landslide susceptibility are mainly located in the south and west part of Guangdong Province. It can be found that mountainous regions tend to have higher landslide susceptibility, while regions with gentle terrain, such as the Pearl River Delta, the Hanjiang Delta, and the coastal areas, tend to have lower landslide susceptibility. This also indicates that the landslides of Guangdong are closely related to topography.

Distribution Pattern Analysis of Landslide Susceptibility at the Provincial Scale
Following the analysis outlined above, the optimal positive-to-negative sample ratio of 1:1 and training-to-test sample ratio of 8:2 were determined. We also found that the RF model outperformed the SVM, MLP, and LR models. Therefore, the RF-based LSM model with the highest AUC value (A1 in Figure 6a) was employed in Guangdong Province to obtain the landslide susceptibility index for each grid. The landslide susceptibility raster layer of the study area was then classified into four levels using the natural breaks method, with the four levels being low susceptibility (0-0.28), moderate susceptibility (0.29-0.43), high susceptibility (0.44-0.6), and very high susceptibility (0.61-0.98). Figure 7 shows the distribution of the four landslide susceptibility levels in Guangdong Province. As shown in Figure 7, regions of high landslide susceptibility are mainly located in the north and east part of Guangdong, while regions of low landslide susceptibility are mainly located in the south and west part of Guangdong Province. It can be found that mountainous regions tend to have higher landslide susceptibility, while regions with gentle terrain, such as the Pearl River Delta, the Hanjiang Delta, and the coastal areas, tend to have lower landslide susceptibility. This also indicates that the landslides of Guangdong are closely related to topography.

Figure 7.
The landslide susceptibility map of Guangdong Province and the recorded 335 historical landslide points provided by the Class I landslide data (note: since the elevation data of the sea around some offshore land is missing, its slope, aspect, and curvature could not be calculated, so the boundary of this figure may be slightly different from the boundary of Figure 1). Table 4 shows the percentage of the Class I landslide number in each landslide susceptibility level region to its overall number in the study area ( ). The results showed that 98.5% of historical landslide points in the study area have high and very high landslide susceptibility. Moreover, as shown in Figure 7, the areas surrounding the landslide points are mainly categorized as high landslide susceptibility regions. This indicates that the landslide susceptibility levels of Guangdong Province have excellent spatial consistency with the Class I landslide data.

Distribution Pattern Analysis of Landslide Susceptibility at the City Scale
To further investigate the hierarchical structure of the landslide susceptibility level at the city scale, the values for all the 21 cities in Guangdong Province were calculated and ranked in descending order. Figure 8 shows that the landslide susceptibility level varies from city to city. The landslide susceptibility of these cities can be classified into four categories for multi-level construction works for disaster reduction. The first category includes cities with large values at the very high susceptibility level, among which Heyuan has the largest value of 29.37%; the cities of Qingyuan, Zhaoqing, Shaoguan, Guangzhou, and Meizhou have values over 20% at the very high susceptibility level. These cities should, thus, further strengthen landslide prevention measures. The second category includes cities with large values at the high susceptibility level, such as Chaozhou, Shenzhen, Jieyang, Shanwei, and Shantou. Although their values at the very high landslide susceptibility level were not prominent, their values at the high landslide susceptibility level were over 25%, indicating that these cities have potential risks of landslides. The third category includes Jiangmen, Zhuhai, Yangjiang, and Zhanjiang, which are all coastal cities. Their values at the very high susceptibility level were less than 2%, while the values at the low susceptibility level were more than 50%. This result shows that the probability of landslides in these cities is low. The Figure 7. The landslide susceptibility map of Guangdong Province and the recorded 335 historical landslide points provided by the Class I landslide data (note: since the elevation data of the sea around some offshore land is missing, its slope, aspect, and curvature could not be calculated, so the boundary of this figure may be slightly different from the boundary of Figure 1). Table 4 shows the percentage of the Class I landslide number in each landslide susceptibility level region to its overall number in the study area (Pn i ). The results showed that 98.5% of historical landslide points in the study area have high and very high landslide susceptibility. Moreover, as shown in Figure 7, the areas surrounding the landslide points are mainly categorized as high landslide susceptibility regions. This indicates that the landslide susceptibility levels of Guangdong Province have excellent spatial consistency with the Class I landslide data.

Distribution Pattern Analysis of Landslide Susceptibility at the City Scale
To further investigate the hierarchical structure of the landslide susceptibility level at the city scale, the Pa ij values for all the 21 cities in Guangdong Province were calculated and ranked in descending order. Figure 8 shows that the landslide susceptibility level varies from city to city. The landslide susceptibility of these cities can be classified into four categories for multi-level construction works for disaster reduction. The first category includes cities with large Pa ij values at the very high susceptibility level, among which Heyuan has the largest Pa ij value of 29.37%; the cities of Qingyuan, Zhaoqing, Shaoguan, Guangzhou, and Meizhou have Pa ij values over 20% at the very high susceptibility level. These cities should, thus, further strengthen landslide prevention measures. The second category includes cities with large Pa ij values at the high susceptibility level, such as Chaozhou, Shenzhen, Jieyang, Shanwei, and Shantou. Although their Pa ij values at the very high landslide susceptibility level were not prominent, their Pa ij values at the high landslide susceptibility level were over 25%, indicating that these cities have potential risks of landslides. The third category includes Jiangmen, Zhuhai, Yangjiang, and Zhanjiang, which are all coastal cities. Their Pa ij values at the very high susceptibility level were less than 2%, while the Pa ij values at the low susceptibility level were more than 50%. This result shows that the probability of landslides in these cities is low. The fourth category consisted of the cities with relatively large Pa ij values at the moderate susceptibility level, including Zhongshan, Dongguan, Foshan, Yunfu, Huizhou, and Maoming; these areas require regular inspection of potential landslide hazards. fourth category consisted of the cities with relatively large values at the moderate susceptibility level, including Zhongshan, Dongguan, Foshan, Yunfu, Huizhou, and Maoming; these areas require regular inspection of potential landslide hazards.

Figure 8.
values in each city of Guangdong Province ( is calculated using Equation (6)).

Analysis of Environmental Factor Importance
The selection of environmental factors may also influence the model's accuracy. Thus, the 13 environmental factors of landslide susceptibility in the study area were analyzed and ranked according to their importance. As shown in Figure 9, rainfall is the most important factor contributing to landslides in the model, which is consistent with the Class I landslide data from disaster reports, indicating that most landslides are rainfall-induced. The elevation, profile curvature, slope, and plane curvature rank from second to fifth place, which indicates that topography factors (except for the aspect), also have great influence on landslides in Guangdong Province. The GHM has higher importance than the distance to roads, which means it can serve as a better index for representing human activities. Additionally, the NDVI and the distance to fault ranked sixth and seventh, respectively, whereas the importance of the distance to river, lithology, slope direction, and land cover type is relatively low, representing the relatively low contribution of these environmental factors to landslide susceptibility in Guangdong Province.
Given that the landslides are generally related to geological conditions, the landslide susceptibility model was constructed by adopting two geological factors commonly used  (6)).

Analysis of Environmental Factor Importance
The selection of environmental factors may also influence the model's accuracy. Thus, the 13 environmental factors of landslide susceptibility in the study area were analyzed and ranked according to their importance. As shown in Figure 9, rainfall is the most important factor contributing to landslides in the model, which is consistent with the Class I landslide data from disaster reports, indicating that most landslides are rainfall-induced. The elevation, profile curvature, slope, and plane curvature rank from second to fifth place, which indicates that topography factors (except for the aspect), also have great influence on landslides in Guangdong Province. The GHM has higher importance than the distance to roads, which means it can serve as a better index for representing human activities. fourth category consisted of the cities with relatively large values at the moderate susceptibility level, including Zhongshan, Dongguan, Foshan, Yunfu, Huizhou, and Maoming; these areas require regular inspection of potential landslide hazards.

Figure 8.
values in each city of Guangdong Province ( is calculated using Equation (6)).

Analysis of Environmental Factor Importance
The selection of environmental factors may also influence the model's accuracy. Thus, the 13 environmental factors of landslide susceptibility in the study area were analyzed and ranked according to their importance. As shown in Figure 9, rainfall is the most important factor contributing to landslides in the model, which is consistent with the Class I landslide data from disaster reports, indicating that most landslides are rainfall-induced. The elevation, profile curvature, slope, and plane curvature rank from second to fifth place, which indicates that topography factors (except for the aspect), also have great influence on landslides in Guangdong Province. The GHM has higher importance than the distance to roads, which means it can serve as a better index for representing human activities. Additionally, the NDVI and the distance to fault ranked sixth and seventh, respectively, whereas the importance of the distance to river, lithology, slope direction, and land cover type is relatively low, representing the relatively low contribution of these environmental factors to landslide susceptibility in Guangdong Province.
Given that the landslides are generally related to geological conditions, the landslide susceptibility model was constructed by adopting two geological factors commonly used Additionally, the NDVI and the distance to fault ranked sixth and seventh, respectively, whereas the importance of the distance to river, lithology, slope direction, and land cover type is relatively low, representing the relatively low contribution of these environmental factors to landslide susceptibility in Guangdong Province.
Given that the landslides are generally related to geological conditions, the landslide susceptibility model was constructed by adopting two geological factors commonly used in existing LSM studies, i.e., lithology and distance to fault [18]. The landslides of Guangdong Province have proven to be predominantly rainfall-induced; the rising groundwater levels have led to an increase in the soil pore water pressure, reduction in the geotechnical strength, and an increase in slope stress state [52,59]. Therefore, lithology has a relatively lower impact on landslides; its influence may be considered theoretical and indirect.

Impact of Rainfall and GHM on Landslide Susceptibility Mapping
In this study, GHM was introduced to LSM for the first time, and the importance of rainfall as an environmental factor for landslides in Guangdong Province was verified. As shown in Table 5, the AUC values of the optimal model (RF) with and without the rainfall and GHM factors were calculated to investigate the effectiveness of these two environmental factors. When the GHM factor was removed from the RF model, the AUC value decreased from 0.9145 to 0.8991; when the rainfall factor was removed, the AUC value decreased from 0.9145 to 0.8917; when both factors were removed, the AUC value dramatically decreased from 0.9145 to 0.8808. The results indicated that the introduction of rainfall and GHM can significantly improve the mapping accuracy of landslide susceptibility. The model accuracy and factor weights vary at different spatial resolutions in the same region [41]. Considering the limitations on the spatial resolution, the importance and impact of some factors with high heterogeneity (such as GHM) may also be underestimated. Therefore, the increase in the resolution can help us to analyze the impact of factors more accurately, as long as the computational power allows.
Other rainfall characteristics, such as intensity and duration, have also proven helpful for LSM studies [18]; these factors are more applicable to the modeling of landslide inventories after rainfall or typhoon events [51,60] as well as for landslide prediction under multiple scenarios [59,61].

Analysis of Landslide Environmental Factors in Guangdong Province
To reveal the nonlinear relationship between landslide susceptibility and each environmental factor, partial dependence plots were drawn. As shown in Figure 10, among all the factors, the most dramatic increase in the probability of landslide occurrence is observed when the rainfall range is 1650-1900 mm. The probability of landslide occurrence is also more prominent at 50-250 m of elevation and for the land cover type of artificial surfaces (code is 80). Moreover, the profile curvature, plane curvature, distance to road, and distance to fault show a linear relationship with the probability of landslide occurrence within a certain range. It is worth mentioning that the GHM increases rapidly around the interval 0.25-0.35 and then decreases slowly. The reasons may be twofold. On the one hand, initial human activities are destructive to the land (e.g., agricultural reclamation and large-scale deforestation [47]); however, as human activities increase, greater attention is paid to the construction of infrastructure to prevent landslide occurrence (e.g., conversion of farmland to forestry [47] and the installation of well-maintained drainage systems [46]). On the other hand, the most intensive human activities in Guangdong Province occur in the Pearl River Delta, which mostly consists of plains and, thus, has a very low landslide probability. Although the linear relationship between distance to road and the probability of landslide occurrence is more obvious, the potential advantage of the GHM is that it quantifies the intensity of human modification. In other words, GHM may be able to show to what extent human modification would promote landslide occurrence and, thus, help identify regions under threat.
However, it should be noted that the partial dependence plots are incapable of revealing the combined effect of high-dimensional data [62]. The partial dependence plots in Figure 10 should not be interpreted as a simple linear relationship between the generation of landslides and individual environmental factors. Occurrences of landslides are, in most cases, the result of the combined effects of multiple environmental factors, with intricate and complex mechanisms [13,17]. For example, Yangjiang City and Jiangmen City are located in an area with the most abundant rainfall in Guangdong and boast intense human activity. However, the terrain in this area consists of mostly plains and low hills. The occurrence of landslides is, therefore, quite low in this region. This explains why there is a sharp decline when the annual rainfall amount surpasses 2100 mm in the partial dependence plot of rainfall.
To summarize, the partial dependence plots can help elucidate the relationship between landslide susceptibility and different environmental factors to some extent; however, the complexity of the interactions among these factors should be taken into account. However, it should be noted that the partial dependence plots are incapable of revealing the combined effect of high-dimensional data [62]. The partial dependence plots in Figure 10 should not be interpreted as a simple linear relationship between the generation of landslides and individual environmental factors. Occurrences of landslides are, in most cases, the result of the combined effects of multiple environmental factors, with intricate and complex mechanisms [13,17]. For example, Yangjiang City and Jiangmen City are located in an area with the most abundant rainfall in Guangdong and boast intense human activity. However, the terrain in this area consists of mostly plains and low hills. The occurrence of landslides is, therefore, quite low in this region. This explains why there is a sharp decline when the annual rainfall amount surpasses 2100 mm in the partial dependence plot of rainfall.
To summarize, the partial dependence plots can help elucidate the relationship between landslide susceptibility and different environmental factors to some extent; however, the complexity of the interactions among these factors should be taken into account. Interestingly, in the LSM studies that have adopted the RF algorithm, almost all of them have used a larger number of factors (more than 10) as inputs [22][23][24][25][26][27]. The reason may be that the RF model allows for the input of both categorical and continuous factors [17]; thus, it does not require much preprocessing and avoids the loss of information. This may also be one of the reasons why the RF model accuracy was better than the other models in the study. However, it is noteworthy that different classification criteria for categorical factors may still affect the result [18], especially for the factors that lack uniform classification criteria, such as lithology. Interestingly, in the LSM studies that have adopted the RF algorithm, almost all of them have used a larger number of factors (more than 10) as inputs [22][23][24][25][26][27]. The reason may be that the RF model allows for the input of both categorical and continuous factors [17]; thus, it does not require much preprocessing and avoids the loss of information. This may also be one of the reasons why the RF model accuracy was better than the other models in the study. However, it is noteworthy that different classification criteria for categorical factors may still affect the result [18], especially for the factors that lack uniform classification criteria, such as lithology.

Model Evaluation for Multiple Types of Landslides
To further verify the reliability of the LSM model, multiple types of landslides were involved in the validation process. Longchuan County and Zijin County in Heyuan City were selected as the study cities because Heyuan City had the highest susceptibility value among all the 21 cities (Pa ij = 29.37%). The landslide susceptibility map of these two counties was overlaid with Class II landslide data. Figure 11 shows that the distribution patterns of landslides in Longchuan County and Zijin County are positively associated with the landslide susceptibility level. As shown in Table 6, the percentage of the area at different susceptibility levels in each county (such as Pa ij ) was applied in these two counties. The percentages of landslide areas at high and very high susceptibility levels are over 88% for both Longchuan County and Zijin County. Regions with high and very high susceptibility levels in Longchuan County and Zijin County are spatially consistent with the actual landslide locations. This further proves the reliability and generalizability of the model for application in Guangdong Province. Nevertheless, this paper only used remote sensing interpretation data in local areas for preliminary validation, because it is difficult to obtain a comprehensive sample of landslides in the whole province. Additional validation should be performed when more landslide samples are available.
The Class II landslide data were obtained via remote sensing interpretation and field verification. The Class II landslide data cover almost all landslide events that can be detected in the study area but do not contain their attribution categories. Thus, we considered the Class II landslide data as the mixed type and applied them to validate the LSM model. To further improve the model effect, detailed information on landslide type is needed.

Model Evaluation for Multiple Types of Landslides
To further verify the reliability of the LSM model, multiple types of landslides were involved in the validation process. Longchuan County and Zijin County in Heyuan City were selected as the study cities because Heyuan City had the highest susceptibility value among all the 21 cities ( = 29.37%). The landslide susceptibility map of these two counties was overlaid with Class II landslide data. Figure 11 shows that the distribution patterns of landslides in Longchuan County and Zijin County are positively associated with the landslide susceptibility level. As shown in Table 6, the percentage of the area at different susceptibility levels in each county (such as ) was applied in these two counties. The percentages of landslide areas at high and very high susceptibility levels are over 88% for both Longchuan County and Zijin County. Regions with high and very high susceptibility levels in Longchuan County and Zijin County are spatially consistent with the actual landslide locations. This further proves the reliability and generalizability of the model for application in Guangdong Province. Nevertheless, this paper only used remote sensing interpretation data in local areas for preliminary validation, because it is difficult to obtain a comprehensive sample of landslides in the whole province. Additional validation should be performed when more landslide samples are available.
The Class II landslide data were obtained via remote sensing interpretation and field verification. The Class II landslide data cover almost all landslide events that can be detected in the study area but do not contain their attribution categories. Thus, we considered the Class II landslide data as the mixed type and applied them to validate the LSM model. To further improve the model effect, detailed information on landslide type is needed. Figure 11. The Class II landslide data and the landslide susceptibility map of two counties in Heyuan City: (a) Longchuan County, (b) Zijin County.

Importance of Sample Type and Balance
This study employed the RF model for mapping the landslide susceptibility of Guangdong Province. Compared with previous large-scale landslide susceptibility studies [16,41,44], the RF-based LSM model can provide satisfactory mapping accuracy when multiple landslide sample types and sample imbalance issues are considered. It is, therefore, necessary to select the landslide type with similar genesis and relevant environmental factors as input variables and to optimize the positive-to-negative sample ratio and training-to-test sample ratio. The sample balance issue has been reported in some LSM studies [5,18]; however, the landslide sample ratio selection is rarely explored. In this study, the sample ratios were analyzed quantitatively using confusion matrices and AUC values. The RF-based LSM model with optimal sample ratios has better generalization ability and operability, as it does not require a large amount of landslide data or the introduction of many variables [16]. Moreover, compared with other machine learning models, the improved RF-based model is more suitable for LSM studies on a large scale, especially when the sample size is small and there are multiple sample types.

Conclusions
To address the issues of environmental heterogeneity and sample imbalance in largescale landslide susceptibility mapping, this study proposes an improved RF-based LSM model considering the sample type and balance and introducing GHM to the LSM model for the first time. The results indicate the importance of the impact of sample type and balance as well as appropriate environmental factors on the accuracy of LSM.
For the LSM models, the optimal positive-to-negative sample ratios of 1:1 and trainingto-test sample ratio of 8:2 were obtained with confusion matrices and AUC values. Compared with the SVM, MLP, and LR models, the RF model, with the highest AUC value of 0.9145, had excellent accuracy for LSM in Guangdong Province. In the study area, rainfall and topography are the two primary influencing factors of landslides, with higher importance rankings. Moreover, the newly introduced GHM can be used for the LSM model to improve the AUC value. In Guangdong Province, the regions of high landslide susceptibility are mainly located in the northeast, while the regions of low landslide susceptibility are mainly located in the southwest. Heyuan, Qingyuan, Zhaoqing, Guangzhou, Shaoguan, and Meizhou in Guangdong Province show higher landslide susceptibility. These findings can provide an effective technical reference for LSM studies and can contribute to landslide prevention and disaster reduction, ensuring sustainable development of the regional economy.