Enhancing landslide susceptibility mapping using a positive-unlabeled machine learning approach: a case study in Chamoli, India

Introduction The Indian Himalayas’ susceptibility to landslides, particularly as a location where climate change effects may be event catalysts, necessitates the development of dependable landslide susceptibility maps (LSM). Method This study diverges from traditional binary classification models, framing LSM as a positive-unlabeled learning problem. This approach acknowledges that regions without recorded landslides are not necessarily at low risk but could simply have not experienced landslides yet. The study utilizes novel positive-unlabeled learning-enhanced algorithms—Random Forest, K-Nearest Neighbor, and Decision Tree—to create LSM for Chamoli district, India. Eleven causative factors for landslides are identified, including elevation, aspect, slope, geology, geomorphology, distance to lineament, lithology, NDVI, distance to river, distance to road and residential land use. To address spatial correlation biases, instead of randomly splitting the data-set, the study adopts spatial splitting to get the training and testing datasets. Conclusion The study reveals that positive-unlabeled learning substantially improves the Area Under Curve and recall, leading to a more conservative LSM compared to binary classification methods. Analysis shows that the southern region of Chamoli exhibits high recall but lower accuracy, suggesting a latent high landslide susceptibility despite a lack of historical landslides in this region. The study also quantifies the impact of human activity on landslide risk, indicating an elevated threat to life and the local economy, especially in Chamoli’s southwestern areas.


Introduction
Annually, landslides in India result in a monetary loss of approximately US$ 20-30 M (Parkash 2011), impacting a minimum of 15% of India's total land area (Sangeeta and 2008;Vijith et al. 2009).Among the explored methods, machine learning is known for capturing the nonlinear relationship between the landslide causative factors and landslides, and generating cell-calibrated output (Das et al. 2023).Chauhan et al. (2010) used a logistic regression model to create a landslide susceptibility zonation (LSZ) map of 600 km 2 area in the Chamoli region in the Garhwal Himalayas.The study found that 71.13% of observed landslides occur in the predicted very high and high susceptibility zones, covering 21.96% of the LSZ.Saha et al. (2020) evaluated a set of individual and ensemble machine learning methods to generate an LSM for the Rudraprayag District of Uttarakhand, India.The study considered 16 different landslide conditioning factors, and the dataset was randomly split into training and testing sets.Their study found that the Artificial Neural Network-Random Forest-Logistic Regression (ANN-RF-LR) model performs the best, with an 88% area under the success rate curve and a 94% area under the prediction rate curve.Saha et al. (2021) employed an ensemble of conditional probability and boosted regression treebased method to develop a landslide susceptibility map for Rudraprayag, India.Their findings indicated that fallow land and roadsides with high elevation and steep slopes exhibit a significantly higher susceptibility to landslide occurrences.Juyal and Sharma (2021) explored the feasibility of utilizing machine learning techniques to develop a landslide susceptibility map in Uttarakhand, India.They highlighted that the prevalent methods for LSM generation include Support Vector Machines (SVM), Random Forest (RF), and Artificial Neural Networks (ANN).Furthermore, they suggested that an ensemble of machine learning algorithms can offer a superior performance.Kainthura and Sharma (2022) conducted a comparative study to assess the efficacy of three different machine learning techniques, namely RF, Neural Network, and Bayesian Network, for developing an LSM of the Uttarkashi region, India.The study employed eleven landslide factors and randomly divided the dataset into training and testing sets.The results indicated that RF yields the most favorable outcome, exhibiting a testing accuracy of 86%.Solanki et al. (2022) used machine learning algorithms to prepare landslide susceptibility maps for the Kali River Valley in the Kumaun Himalaya region.The study considered 15 landslide conditioning factors and used 368 landslides for analysis.The results indicated that slope, elevation, and distance to thrust are the most significant factors.The study concluded that the ensemble XGB and RF algorithms outperform KNN, with approximately 85% accuracy compared to KNN's 81%.In a recent investigation by Badola et al. (2023), the landslide susceptibility of the river Alakananda in Chamoli, Uttarakhand, India, was analyzed using the XGBoost method and seven landslide factors.The results of the study demonstrated a high susceptibility along the road, with an AUC value of 91% being achieved.Table 1 shows the comparison of landslide conditioning factors and predictive model performance among the literatures above.
Table 1 Overview of landslide conditioning factors and predictive model performances in using machine learning to generate the LSM for the Himalayan region of India Traditional machine learning techniques applied to create LSM typically frame the challenge as a binary classification task, distinguishing samples into either 'landslide' (positive) or 'non-landslide' (negative) categories.Such a binary distinction might be an oversimplification, as the 'non-landslide' category not only encompasses truly landslide-free areas (negative) but also regions with latent landslide potential (positive).Consequently, LSM generation should not be treated as a conventional binary classification machine learning problem (Fang et al. 2021).Wu et al. (2020) first used Positive Unlabeled (PU)-bagging decision tree in generating LSM. for landslide-prone areas, considering that only positive and unlabeled data are provided in the landslide susceptibility modeling field.PU bagging refers to creating a training set by combining all positive data points with a random sample from the unlabeled points, with replacement.The results demonstrated that this method outperforms other established techniques, such as logistic regression, support vector machines, and artificial neural networks, in producing accurate LSM.Fang et al. (2021) applied a PU learning approach in combination with adaptive sampling and RF to predict landslide susceptibility in the Three Gorges Reservoir area of China, which achieved desirable prediction outcomes.Yao et al. (2022) classified LSM as a P.U.learning problem and introduced a twostep deep learning (T-DNN) framework based on the Spy technique.This approach enhances the selection of negative samples and improves the accuracy and reliability of landslide susceptibility assessment.
This study pioneers the application of a PU learning methodology to craft a high-fidelity LSM for the Himalayan region of India.Unlike traditional machine learning methods, the LSM created through PU learning exhibits a more conservative approach, which is crucial for the Himalayan area, where landslides pose a continual risk to communities.The LSM produced in this research is able to identify regions with a high landslide susceptibility even if they have never experienced landslides, providing valuable insights for local authorities.This research not only underscores the importance of implementing PU learning for LSM development but also establishes a pioneering model for future geohazard modeling, especially in areas facing similar challenges with dataset characteristics.

Study area
This study focuses on the Chamoli district, as shown in Fig. 1, the second largest district in the Uttarakhand state of India, with a total geographical area of approximately 7820 km 2 .Characterized by its predominantly mountainous terrain, Chamoli exhibits an agrarian economy, with forest cover constituting the primary land use at 58.38%.The district experiences a diverse climate, oscillating between subtropical monsoon and tropical upland types, and maintains a relative humidity fluctuating between 35 and 70% annually (CGWB 2022).The landscape of the Chamoli region is characterized by elevated hills and mountains, interspersed with narrow valleys and deep gorges, with certain areas enduring perennial snow cover.
The district hosts the vital arterial road connecting Rishikesh to Badrinath.This revered Hindu pilgrimage circuit encompasses four seminal temples nestled in the Garhwal Himalayas: Kedarnath, Badrinath, Gangotri, and Yamunotri.Given the influx of pilgrims attracted to this route, it is crucial to ensure its safety.
In this terrain, landslides emerge as a recurrent and formidable hazard, imperiling the pilgrims and the intricate infrastructure of Chamoli's road network (Roushan 2023).The Chamoli Rock-Ice Avalanche in 2021 demonstrated the severity of such hazards, claiming 78 lives and leaving 204 individuals unaccounted for (Vangla et al. 2022).In addition, it likely heralds a new causative factor of climate change and its associated factors (elevated temperatures, glacial melting, and flooding amongst others) that has not been explicitly relevant in regions of past studies.Consequently, the development of accurate LSM's for the Chamoli and similar Himalayan foothill districts has escalated in priority and is poised to play a pivotal role in pinpointing vulnerable zones and devising effective mitigation strategies.

Data source and materials
This study examines eleven causative factors for the landslide, including elevation, aspect, slope, geology, geomorphology, distance to lineament, lithology, NDVI, distance to river, distance to road and residential land use, as shown in Fig. 2.These factors are selected to provide comprehensive geological, hydrological, land cover, and morphological insights, marking an initial effort to forecast events in regions affected by climate change stressors (Lima et al. 2022;Reichenbach et al. 2018).
The elevation determines the distribution of the surface and directly provides the gravitational energy that initializes a landslide (Das et al. 2023;Youssef et al. 2016).Aspect is tied with the insolation, rainfall, and land use pattern, which will indirectly impact a landslide (Das et al. 2023).The slope is the critical factor, considering that there is a positive correlation between landslide and slope angle (Das et al. 2023;Felicísimo et al. 2013).Geology significantly influences landslide risk due to slopeforming material, soil type, weathering, and erosion conditions (Das et al. 2023).Geomorphology represents the comprehensive collection of knowledge regarding the terrain and, therefore, plays a role in assessing the risk of landslides (Das et al. 2023).Lineaments are mappable features on a surface, often indicative of underlying structures such as fractures, faults, and shear zones.Lineaments can indicate areas of weakness, which contribute to higher landslide risk in their vicinity (Chauhan et al. 2010;O'leary et al. 1976).The resistance of lithology units to landslides can vary due to differences in their structure and compactness (Chauhan et al. 2010).NDVI shows the growth conditions and the concentration of vegetation.A low NDVI, indicative of sparse vegetation, will increase landslide susceptibility due to reduced water retention and weakened soil stability (Pradhan and Lee 2010).The distance to a river is an important factor to consider when assessing landslide susceptibility since rivers can disrupt the equilibrium of forces within a slope, alter the weight of materials, and cause erosion (Gupta and Shukla 2022).The intensity of human activities can be reflected by land use and distance to roads.Urbanization can potentially increase the susceptibility of landslides in mountainous regions like the Chamoli district, highlighting the importance of these factors (Singh et al. 2021).
In this study, the elevation data is obtained from the Shuttle Radar Topography Mission (SRTM) 1 Arc-Second Global dataset provided by USGS, which offers a 30-m resolution (USGS 2018).Aspect and slope are derived using ArcGIS Pro based on the elevation data.NDVI is derived from Sentinel-2 data with a resolution of 10 m (EOB 2023).The remaining data related to causative factors are acquired from "Bhukosh", a portal for geoscientific data of the Geological Survey of India, and other relevant Government of India websites.These datasets were obtained as a shapefile and accessed in July 2022 (IMD 2023).All data is re-projected, resampled to 50 m resolution, and rasterized using ArcGIS Pro.Therefore, the resolution of the LSM produced in this study is 50 m by 50 m.Table 2 further elucidates the features of each landslide causative factor stored in the dataset.Future studies are expected to introduce additional climate change stressor factors to further enhance the predictions based on other similar events such as Sikkim Glacial Lake Out-Flow (GLOF) (Sattar 2023).
This work utilizes historical landslide data to identify positive landslide points.The data is sourced from the Geological Survey of India (Geosadak 2023) and news outlets through crowdsourcing.Every landslide point or polygon is buffered to form a polygon with a diameter of 500 m.A 500-m buffer is selected to ensure a conservative manner in generating LSM, capturing a broader area that might be influenced by historical landslides.Furthermore, the 500-m buffer is used consistently throughout the study to avoid introducing any other variables in the analysis.This uniform buffer is particularly applied when generating categorical features such as Distance to lineament, Distance to road, and Distance to river, as shown in Table 2.

PU learning
The PU learning algorithm employed in this study originates from Learning Classifiers from Only Positive and Unlabeled Data (Elkan and Noto 2008).This algorithm assumes that the selection of positive labeled data points is performed randomly from the entire pool of positive data points.This assumption serves as the basis for the derivation of the underlying lemma: In Eq. 1, x represents individual data points and y denotes the binary label [0,1].The variable s serves as the indicator for whether a data point is labeled.If the data point is labeled, s = 1; otherwise, s = 0.It is important to note that s will only be assigned a value of 1 when y is 1.
P y = 1|x is calculated based on the lemma, and the corresponding algorithms are shown in Fig. 3.
In essence, this algorithm and its foundational lemma enable the computation of the probability of a data point being positively labeled, leveraging the relationships between the labeled data points and the overall dataset. (1) In this study, data points that correspond to areas where a landslide has previously occurred are classified as positive points, whereas data points that do not have a history of landslides are classified as unlabeled points.

Spatial correlation: train test split
Numerous studies have employed a random dataset split into training and testing sets for LSM generation using machine learning techniques (Kainthura and Sharma 2022;Saha et al. 2020;Solanki et al. 2022).However, such an False approach fails to account for the spatial correlation within the spatial data, leading to an overly optimistic model performance (Meyer et al. 2019).To this end, this study divides the dataset into four equally sized sections along the latitude, denoted as X1, X2, X3, and X4, as shown in Fig. 4, with each section serving as a testing set in turn while the model is trained on the remaining sections.Considering PU learning's focus on positive points as the source of known knowledge, the positive point ratio, defined as the percentage of positive points amongst the four sections, is calculated and presented in Table 3.Although the dataset splitting method ensures a balanced number of data points in each dataset, an imbalance is still observed in the positive point ratio, with X1 containing only 13.55% of positive points, whereas X3 has 39.55%.This issue is further discussed in the following section.

Performance metrics
Three performance metrics are used to evaluate the model performance in this work, namely recall, AUC, and accuracy.Recall, also called as true positive rate, measures the proportion of actual positive instances that were correctly identified by a classifier.Recall is a particularly important metric in PU learning considering the nature of the data, since only positive instances are labeled.
AUC is generated from ROC curve which is used to assess the performance of binary classifiers by plotting true positive rate against the false positive rate.False positive rate is the proportion of actual negative instances that were incorrectly identified as positive.AUC measures the total area underneath the ROC curve.An AUC of 1.0 denotes a perfect classifier, while an AUC of 0.5 indicates a classifier performing no better than random chance.Accuracy is one of the most common metrics to evaluate classification models.It represents the fraction of predictions that are correct out of all the predictions made.In PU learning, where data consists of positive and unlabeled instances, the unlabeled points are often treated as negative for accuracy calculations.This approach allows for the swift establishment of a baseline model.

Results
In this study, five experiments are conducted using different train and test datasets, namely random splitting, using X1 to test and X2, X3, X4 to train, using X2 to test and X1, X3, X4 to train, using X3 to test and X1, X2, X4 to train, using X4 to test and X1, X2, X3 to train.For each experiment, RF, KNN, and decision tree models are trained with and without the PU learning algorithm.The recall metric measures the proportion of actual positive points that are correctly predicted, which is considered the most important metric for this dataset, as discussed in "Performance metrics" section.The model performance is presented in Table 4, with the bold value indicating the highest performance of each metric in every experiment.Notably, using the PU learning algorithm improves the recall and AUC performance in every experiment, with a particularly pronounced improvement in the recall value.Among the models evaluated, the RF model with the PU learning algorithm reaches an average recall of 56.67%, AUC of 69.37%, and accuracy of 79.28% when splitting the dataset spatially.This model consistently surpasses others in recall, all the while sustaining high accuracy levels.
Figure 5 depicts the performance of RF with and without PU learning in each experiment.It is evident from Fig. 5 that the accuracy remains relatively stable in the first four sets of columns, which represent random split, X1, X2, and X3 as test sets, respectively.However, there is a notable improvement in recall with the inclusion of PU learning, indicating the enhanced ability of models with PU Learning to correctly identify positive instances.
A different trend is observed in the X4 data set, where applying P.U.learning results in a drop in accuracy, despite an increase in recall.This decrease in accuracy in the X4 set can be attributed to the presence of a higher number of points prone to landslides, even though landslides have not occurred at those points.The model, with its heightened sensitivity to positive instances through PU learning, predicts more of these unlabeled points as positive.This results in a decrease in accuracy, as these unlabeled points are treated as negatives in the accuracy calculation.
Figure 6a-d illustrates the visualization of the landslide susceptibility degree when utilizing RF with PU learning on X1, X2, X3, and X4 as test sets, respectively.Figure 6e combines the test results into a singular map and Fig. 6f overlays Fig. 6e together with ground truth landslide map.The probability output by RF with PU learning is categorized into five classes, namely very low susceptible (VLS), low susceptible (LS), medium susceptible (MS), high susceptible (HS), and very high susceptible (VHS) based on Jenk's natural break.As observed in Fig. 6e, f, the X4 region, especially the western area, has a higher concentration of VHS areas, but less historical landslide points, which coincides with the model's performance depicted in Fig. 5.

Balanced positive points distribution
As outlined in "Spatial correlation: train test split" section, there is an imbalanced distribution of positive points across X1, X2, X3, and X4.Since the model relies heavily on positive points, this imbalance can negatively impact the model's performance.To address this issue, the range of X1, X2, X3, and X4 is adjusted based on latitude to ensure that each set has a roughly equal number of positive points.Table 5 shows the corresponding positive point ratios after this adjustment.Compared to the original split in Table 3, the positive point ratio in X1 has significantly increased, while that in X3 has decreased following the adjustment.
After obtaining the new data split, four experiments were conducted using revised train and test datasets.These experiments involved using revised X1 to test and revised X2, X3, X4 to train, using revised X2 to test and revised X1, X3, X4 to train, using revised X3 to test and revised X1, X2, X4 to train, and using revised X4 to test and revised X1, X2, X3 to train.As demonstrated in "Results" section, RF generally performs better than KNN and decision tree models.Therefore, for each experiment, an RF model was trained both with and without the PU learning algorithm.Table 6 displays the performance of the models using the revised dataset, with the bold value highlighting the highest performance of each metric in every experiment.Consistent with the results in "Results" section, using the PU learning algorithm enhances the recall and AUC performance in all experiments, particularly the recall value.Moreover, the revised X4 exhibits a significant decrease in accuracy after using the PU learning algorithm, which also aligns with the findings in "Results" section.
Figure 7 compares accuracy and recall for the RF with the PU learning algorithm between the original and revised datasets.The accuracy remains roughly consistent with and without adjustment across all experiments.However, there is a decrease in recall for X1 and an increase in recall for X3, while the recall for X2 and X4 remains relatively unchanged.This shift in recall aligns with the fact that the number of positive points in X1 increases while the number of positive points in X3 decreases after the adjustment.Overall, the performance of the revised model demonstrates that the results in "Results" section were not significantly influenced by the imbalanced distribution of positive points in each set.

Human activities and HS and VHS area
Figure 8 shows Chamoli LSM overlaying with road map and residential land use map.From Fig. 6e and Fig. 8, it is evident that the VHS area aligns with the road map in Chamoli.Also, the Southwestern region of Chamoli exhibits heightened susceptibility to landslides, which aligns with the residential land use map.It is noteworthy that human activities may lead to greater susceptibility to landslides, while urbanization is also vulnerable to such incidents.To this end, two indices, namely Human Activity-Landslide Susceptibility Alignment (HA-LSA) Index and Landslide Threat Coverage (LTC) Index, are derived for road and residential land use, respectively.The   LSA Index shows the consistency between human activities and HS, VHS area, shedding light on how human activities may exacerbate landslide susceptibility, as shown in Eq. 2. The LTC Index shows the percentage of human activities under the threat of landslides, as shown in Eq. 3. Table 7 shows the results for road and residential land use. ( The overlap between human activities and HS, VHS area HS, VHS area (3) LTC index = The overlap between human activities and HS, VHS area human activities area The indices derived from road and residential land use data indicate a notable overlap between human activities and HS and VHS areas, with values ranging from 50 to 60% for HA-LSA Index.This observation underscores the potential of human activities, such as the construction of roads and residential areas, to increase the susceptibility of the region to landslides.However, the more concerning finding is that LTC Index reveals that over 50% of the roads and over 30% of the residential areas are currently under the threat of landslides.Considering the seriousness of this issue, it's crucial for local government officials to consider immediate action.

P > 1 in PU learning case
In this work, landslide susceptibility is classified into five degrees according to the probability derived from the RF model with the PU learning algorithm.However, for certain data points, particularly those in the HS and VHS areas, the probability may exceed 1.This is due to the assumption of the PU learning algorithm that the positive labeled data points are selected randomly from the entire pool of positive data points.If the assumption does not hold true, P(s = 1|x) could be larger than P(s = 1|y = 1) , leading to a P y = 1|x being greater than 1.However, it is hard for a real-world dataset to perfectly fit into this assumption.In the Chamoli district, there is a higher chance for the HS and VHS areas to have landslides (positive points) than the VLS, LS, and MS areas.Therefore, there are points with a probability larger than one in the HS, and VHS areas.Despite this limitation, the conclusion drawn from the model remains valid since the probability range still captures the different degrees of landslide susceptibility.In future studies, adjusting the PU learning algorithm to fit the landslide scenario better to obtain more accurate probability estimates may be beneficial.• The use of machine learning to generate a LSM is more accurately classified as a PU learning problem rather than a binary classification problem due to the nature of landslide data points.The absence of historical landslide events does not necessarily indicate a lack of susceptibility to landslides.Hence, it should be considered as unlabeled instead of being labeled as negative.• When generating LSM with machine learning methods, it is preferable to partition the dataset into training and testing sets based on spatial criteria rather than random selection due to the inherent spatial correlation in geospatial datasets.• It is noteworthy that in the context of PU learning, recall is a particularly significant metric.Compared with the original machine learning model, using PU learning can significantly increase the recall, indicating that the model can better identify the features of areas susceptible to landslides.• The RF model with the PU learning algorithm achieves the highest performance for the Chamoli landslide dataset, demonstrating an average recall of 56.67%, AUC of 69.37%, and accuracy of 79.28% when the dataset is spatially split.• The model achieves a higher recall but a much lower accuracy when using PU learning for X4 as the test set.This suggests that the southern part of Chamoli (X4) may be highly susceptible to landslides, including some areas where landslides have not occurred before, indicating a need for increased attention from local authorities.• The imbalance of positive points in the Chamoli landslide dataset does not significantly impact the performance of the model, indicating the robustness of the approach.• Roads and residential land use align with the HS and VHS area distributions, which could indicate that human activities cause a higher landslide susceptibility.Also, a large percentage of roads and residential land use are within HS and VHS areas, potentially threatening life and the local economy.• The landslide dataset does not perfectly fit into the PU learning assumption, which leads to some points having a probability larger than 1.However, this does not affect the quality of the LSM's, and the algorithm can be further improved to better reflect the true probability of landslides occurring in an area.• Ongoing studies are focused on introducing additional climate change stressors as factors in the PU learning method to further enhance the prediction method and will be presented in future publications.

Fig. 5
Fig. 5 Accuracy and recall of RF with and without P.U.learning algorithm

Fig. 7
Fig. 7 The comparison of accuracy and recall on the original dataset and the revised dataset.a Accuracy; b Recall

Fig. 8
Fig. 8 LSM of Chamoli with road and residential land use map

Table 2
The landslide causative factors in the dataset

Table 3
Positive point ratio in each set

Table 4
Performance of machine learning models with and without P.U.learning algorithms

Table 5
Positive point ratio of revised train test split

Table 6
The performance of RF model with and without PU learning algorithm on revised datasets

Table 7
HA-LSA Index and LTC Index for roads and residential land use