Techniques for evaluating the performance of landslide susceptibility models
Introduction
In the last decades, many efforts have been made to assess landslide susceptibility at a regional scale. In spite of a huge number of models produced using various methods, little attention has been devoted to the problem of result evaluation.
Model evaluation is a multi-criteria problem (Davis and Goodrich, 1990). The acceptance of a model needs to fulfil at least three criteria: its adequacy (conceptual and mathematical) in describing the system behaviour, its robustness to small changes of the input data (i.e. data sensitivity), and its accuracy in predicting the observed data.
With physically-based models, the first evaluation criterion is aimed at assessing whether the model provides a physically acceptable explanation of the cause–effect relationships. Alternatively, justification is given for using simplifications of physical processes. With statistical or empirical models, the first kind of evaluation focuses on how well the variables used by the models can describe the processes. Due to the complexity of natural systems, this kind of evaluation involves a large component of judgement by experts with a deep knowledge of landslide processes (Carrara et al., 2003).
The robustness of the model can be evaluated by systematically analyzing the variation of the model performance to small changes of input parameters or uncertainties. In the landslide susceptibility literature, only a few papers deal with robustness evaluation (Guzzetti et al., 2006, Melchiorre et al., 2006).
The most relevant criterion for quality evaluation is the assessment of model accuracy, which is performed by analyzing the agreement between the model results and the observed data. In the case of landslide susceptibility models, the observed data comprise the presence/absence of landslides within a certain terrain unit used for the analysis.
In the pioneering susceptibility models produced beginning in the 1980s, accuracy was evaluated through visual comparison of actual landslides with susceptibility classification (Brabb, 1984, Gökceoglu and Aksoy, 1996), or in terms of efficiency (or accuracy) (e.g., Carrara, 1983). In the last decade, different authors have proposed equivalent methods to evaluate the models in terms of landslide density within different susceptibility classes (“landslide density”, Montgomery and Dietrich, 1994, Ercanoglu and Gokceoglu, 2002, Crosta and Frattini, 2003; “degree of fit”, Irigaray et al., 1999, Baeza and Corominas, 2001, Fernández et al., 2003; “b/a ratio”, Lee and Min, 2001, Lee et al., 2003). Other authors chose to represent the success of the model by comparing the landslide density with the area of susceptible zone for different susceptibility levels (Zinck et al., 2001; “Success-Rate curves”, Chung and Fabbri, 2003, Remondo et al., 2003, Zêzere et al., 2004, Lee, 2005, Guzzetti et al., 2006). More recently, ROC curves have been adopted for model evaluation and comparison in the landslide literature (Yesilnacar and Topal, 2005, Begueria, 2006, Gorsevski et al., 2006, Frattini et al., 2008, Nefeslioglu et al., 2008).
When a landslide susceptibility model is applied in practice, the classification of land according to susceptibility results in economical consequences. For instance, terrain that is classified as stable can be used without restrictions, increasing its economical value, whereas unstable terrain is restricted in use, and is consequently reduced in value. The misclassification of terrain in a model also produces economic costs. Hence, the performance of the models can be evaluated by assessing these costs, in order to select the best model, or the one that minimizes costs to society. This has been typically done in disciplines such as machine learning (Drummond and Holte, 2000, Provost and Fawcett, 2001) and biometrics (Pepe, 2003, Briggs and Ruppert, 2005).
All the techniques used in the literature to assess the accuracy of landslide susceptibility models do not account for misclassification costs. This limitation is significant for landslide susceptibility analysis as the costs of misclassifications are very different depending on the error type. Error Type II (false negative) means that a terrain unit with landslides is classified as stable, and consequently used without restrictions. The false negative misclassification cost, c(−|+), is equal to the loss of elements at risk that can be impacted by landslides in these units. This cost depends on the economic value and the vulnerability of elements at risk (e.g., lives, buildings and lifelines), and the temporal probability and intensity of landslides. Error Type I (false positive) means that a unit without landslides is classified as unstable, and therefore limited in their use and economic development. Hence, the false positive misclassification cost, c(+|−), amounts to the loss of economic value of these terrain units. This cost is different for each terrain unit as a function of its environmental (slope gradient, altitude, aspect, distance from the main valley, etc.) and social economic (distance from an urban/industrial area, road, etc.) characteristics. With landslide susceptibility models, costs related to Error Type II are normally much larger than those related to Error Type I. For example, citing a public facility such as a school building, in a terrain unit that is incorrectly identified as stable (Type II error) could lead to very large social and economic costs in the event of a landslide.
Accounting for misclassification costs in the evaluation of model performance is possible with ROC curves by using an additional procedure (Provost and Fawcett, 1997), but the results are difficult to visualize and assess. In this paper, a simple technique (Cost curves, Drummond and Holte, 2000) is adopted to explicitly account for these costs.
In the following, different techniques for the evaluation and comparison of landslide susceptibility model performance (accuracy statistics, ROC curves, Success-Rate curves, and Cost curves) are presented and tested on shallow landslide/debris-flow susceptibility models.
The aim of the paper is to demonstrate the applicability and capability of these techniques, by discussing their advantages and disadvantages. In order to simplify the presentation and to keep the focus on model evaluation, landslides susceptibility models already presented in the literature (Carrara et al., 2008) are used.
As previously mentioned, accuracy is assessed by analyzing the agreement between the model results and the observed data. Since the observed data comprise the presence/absence of landslides within a certain terrain unit, a simpler method to assess the accuracy is to compare these data with a binary classification of susceptibility in stable and unstable units. This classification requires a cutoff value of susceptibility that divides stable terrains (susceptibility less than the cutoff) and unstable terrain (susceptibility greater than the cutoff).
The comparison of observed data and model results reclassified into two classes is represented through contingency tables (Table 1). Accuracy statistics assess the model performance by combining correct and incorrect classified positives (i.e., unstable areas) and negatives (i.e., stable areas).
The first statistic, presented in the field of weather forecasting (Finley, 1884), is the Efficiency (Accuracy or Percent correct, Table 2), which measures the percentage of actual landslides that are correctly classified by the model. However, Gilbert (1884) showed that the Efficiency statistic is unreliable because it is heavily influenced by the most common class, usually “stable slopes” in the case of landslide susceptibility models, and it is not equitable. A statistic is equitable if it gives the same score for different types of unskilled classifications. In other words, classifications of random chance, “always positive” and “always negative”, should produce the same (bad) score (Murphy, 1996).
True Positive rate (TP) and the False Positive rate (FP) are insufficient performance statistics, because they ignore false positives and false negatives, respectively. They are not equitable, and they are useful only when used in conjunction (e.g., ROC curves).
The Threat score (Gilbert, 1884) measures the fraction of observed and/or classified events that were correctly predicted. Because it penalizes both false negatives and false positives, it does not distinguish the source of classification error. Moreover, it depends on the frequency of events (poorer scores for rarer events) since some true positives can occur purely due to random chance.
The Equitable threat score (Gilbert's skill score; Gilbert, 1884, Schaefer, 1990) measures the fraction of observed and/or classified events that were correctly predicted, adjusted for true positives associated with random chance. As above, it does not distinguish the source of classification error.
The Pierce's skill score (True skill statistic; Peirce, 1884, Hanssen and Kuipers, 1965) uses all elements of contingency table and does not depend on event frequency. This score may be more useful for more frequent events (Mason, 2003).
Heidke's skill score (Cohen's kappa; Heidke, 1926) measures the fraction of correct classifications after eliminating those classifications which would be correct due purely to random chance.
The odds ratio (Stephenson, 2000) measures the ratio of the odds of true prediction to the odds of false prediction. This statistic takes prior probabilities into account and gives better scores for rare events, but cannot be used if any of the cells in the contingency table is equal to 0.
Finally, the odd ratio skill score (Yule's Q; Yule, 1900) is closely related to the odds ratio, but conveniently ranges between − 1 and 1.
As mentioned, accuracy statistics require the splitting of the classified objects into a few classes by defining specific values of the susceptibility index that are called cutoff values. For statistical models, such as discriminant analysis (e.g., Carrara, 1983) or logistic regression analysis (e.g., Chung et al., 1995, Atckinson and Massari, 1998, Dai and Lee, 2002, Ohlmacher and Davis, 2003, Nefeslioglu et al., 2008), a statistically significant probability cutoff (pcutoff) exists, equal to 0.5. When the groups of stable and unstable terrain units are equal in size and their distribution is close to normal, this value maximizes the number of correctly predicted stable and unstable units. In different conditions, or for other types of landslide susceptibility models, such as physically-based (Van Westen and Terlien, 1996, Gökceoglu and Aksoy, 1996, Crosta and Frattini, 2003, Frattini et al., 2004, Godt et al., 2008), heuristic (e.g., Barredo et al., 2000), artificial neural networks (Lee et al., 2003, Ermini et al., 2005, Nefeslioglu et al., 2008), fuzzy logic (Binaghi et al., 1998, Ercanoglu and Gokceoglu, 2004), the choice of cutoff values to define susceptibility classes is arbitrary, unless a cost criteria is adopted (Provost and Fawcett, 1997). A first solution to this limitation consists in evaluating the performance of the models over a large range of cutoff values by using cutoff-independent performance criteria. Another solution consists in finding the optimal cutoff by minimizing the costs of the models.
The most commonly-used cutoff-independent performance techniques for landslide susceptibility models are the Receiver Operating Characteristic (ROC) curves and Success-Rate curves.
The ROC analysis was developed during the Second World War to assess the performance of radar receivers in detecting targets. It has been adopted in different scientific fields, such as medical diagnostic testing (Goodenough et al., 1974, Hanley and McNeil, 1982, Swets, 1988) and machine learning (Egan, 1975, Adams and Hand, 1999, Provost and Fawcett, 2001). The Area Under the ROC Curve (AUC) can be used as a metric to assess the overall quality of a model (Hanley and McNeil, 1982): the larger the area, the best the performance of the model over the whole range of possible cutoffs. The points on the ROC curve represent (FP, TP) pairs derived from different contingency tables created by applying different cutoffs. Points closer to the upper-right corner correspond to lower cutoff values (Fig. 1). An ROC curve is better than another if it is closer to the upper-left corner. The range of values for which the ROC curve is better than a trivial model (i.e., a model which classifies objects by chance, represented in the ROC space by a straight line joining the lower-left and the upper-right corner) is defined operating range.
Success-Rate curves (Zinck et al., 2001, Chung and Fabbri, 2003) represent the percentage of correctly classified objects (i.e., terrain units) on the y-axis, and the percentage of area classified as positive (i.e., unstable) on the x-axis. In the landslide literature, the y-axis is normally considered as the number of landslides, or the percentage of landslide area, correctly classified. In the case of grid-cell units where landslides correspond to single grid cells and all the terrain units have the same area, the y-axis corresponds to TP, analogous with the ROC space, and the x-axis corresponds to the number of units classified as positive.
The total cost of misclassification of a model depends on (Drummond and Holte, 2000): the percentage of terrain units that are incorrectly classified, the a-priori probability of having a landslide in the area, and the costs of misclassification of the different error types.
In order to explicitly represent costs in the evaluation of model performance, Drummond and Holte (2006) proposed the Cost curve representation. The Cost curve represents the Normalized Expected cost as a function of a Probability-Cost function (Fig. 1).
The Normalized Expected cost, NE(C) is calculated as:where the expected cost is normalized by the maximum expected cost, that occurs when all cases are incorrectly classified, i.e. when FP and FN are both one. The maximum normalized cost is 1 and the minimum is 0.
The Probability-Cost function, PC(+) is:which represents the normalized version of p(+)c(−|+), so that PC(+) ranges from 0 to 1. When misclassification costs are equal, PC(+) = p(+). In general, PC(+) = 0 occurs when cost is only due to negative cases, i.e., positive cases never occur (p(+) = 0) or their misclassification cost, c(−|+), is null. PC(+) = 1 corresponds to the other extreme, i.e., p(−) = 0 or c(+|−) = 0.
A single classification model, which would be a single point (FP, TP) in ROC space, is a straight line in the Cost curve representation (Fig. 1). A set of points in ROC space, the basis for an ROC curve, is a set of Cost lines, one for each ROC point.
Section snippets
Case study
To analyse the advantages and disadvantages of the different evaluation techniques, debris-flow/shallow landslides susceptibility models recently presented in the literature (Carrara et al., 2008) are used. In the following, some fundamental data and information concerning the study area and the models are reported. A more detailed description can be obtained from the study performed by Carrara et al. (2008).
Methods
Here, the performance of the susceptibility models is evaluated by using accuracy statistics, ROC curves, Success-Rate curves, and Cost curves.
Seven commonly-used cutoff-sensitive accuracy statistics are applied (Table 2): Accuracy, Threat score, Gilbert's skill score, Pierce's skill score, Heidke skill score, and Yule's Q. The following cutoffs are defined:
- •
(statistical models) probability = 0.5
- •
(physically-based shallow landslide model) q/T ratio = 0.01
In order to compare the performance of the
Results
Accuracy statistics show similar results for the evaluation of the different models (Fig. 4, Table 5). The coarse slope-unit discriminant model (cSU_DIS) outperforms the others with all the statistics, whereas the physically-based model is always the worst. Overall, Accuracy and the Threat score show smaller differences among the models, thus making the choice of the best model more difficult. The other statistics are practically equivalent.
A comparison of grid-cell discriminant models built
Accuracy statistics
Traditional cutoff-dependent approaches reveal slight differences among the models (Fig. 4). Only the physically-based model shows a significantly lower performance. This result indicates that the accuracy statistics have a scarce capability to discriminate the performance of the models, at least for the present case study. In addition, the application of each statistic is reliable only under specific conditions (e.g., rare events or frequent events) that should be evaluated case by case, in
Conclusions
In this paper different approaches for the evaluation of landslide susceptibility models, and for the comparison of their performance are analysed. These approaches are tested on debris-flow susceptibility models that have been developed by the authors (Carrara et al., 2008) by using different methods and different terrain units.
From the results of the analysis it is possible to conclude that:
- •
cutoff-dependent accuracy statistics (Accuracy, Threat score, Gilbert's skill score, Pierce's skill
Acknowledgments
Thanks to Paolo Campedel from the Provincia Autonoma di Trento for useful discussions on the models. Also thanks to Jonathan Godt and Candan Gokceoglu for their helpful reviews.
References (64)
- et al.
Comparing classifiers when the misallocation costs are uncertain
Pattern Recognition
(1999) - et al.
Comparing heuristic landslide hazard assessment techniques using GIS in the Trijana basin, Gran Canaria Island, Spain
JAG
(2000) - et al.
Comparing models of debris-flow susceptibility in the alpine environment
Geomorphology
(2008) - et al.
Landslide characteristics and slope instability modelling using GIS, Lantau Island, Hong Kong
Geomorphology
(2002) - et al.
Use of fuzzy relations to produce landslide susceptibility map of a landslide prone area (West Black Sea Region, Turkey)
Engineering Geology
(2004) - et al.
Artificial neural networks applied to landslide susceptibility assessment
Geomorphology
(2005) - et al.
Shallow landslides in pyroclastic soil: a distributed modeling approach for hazard assessment
Engineering Geology
(2004) - et al.
Assessment of rockfall susceptibility by integrating statistical and physically-based approaches
Geomorphology
(2008) - et al.
Requirements for integrating transient deterministic shallow landslide models with GIS for hazard and susceptibility assessments
Engineering Geology
(2008) - et al.
Landslide susceptibility mapping of the slopes in the residual soils of the Mengen region (Turkey) by deterministic stability analyses and image processing techniques
Engineering Geology
(1996)
Estimating the quality of landslide susceptibility models
Geomorphology
Artificial neural networks and cluster analysis in landslide susceptibility zonation
Geomorphology
An assessment on the use of logistic regression and artificial neural networks with different sampling strategies for the preparation of landslide susceptibility maps
Engineering Geology
Using multiple logistic regression and GIS technology to predict landslide hazard in northeast Kansas, USA
Engineering Geology
Prediction of landslide susceptibility using rare events logistic regression: a case-study in the Flemish Ardennes (Belgium)
Geomorphology
Landslide susceptibility mapping: a comparison of logistic regression and neural networks methods in a medium scale study, Hendek region (Turkey)
Engineering Geology
Mapping and modelling mass movements and gullies in mountainous areas using remote sensing and GIS techniques
International Journal of Applied Earth Observation and Geoinformation
Impact of mapping errors on the reliability of landslide hazard models
Natural Hazard and Earth Systems Science
Generalised linear modelling of susceptibility lo landsliding in the Central Apennines, Italy
Computers & Geosciences
Assessment of shallow landslide susceptibility by means of multivariate statistical techniques
Earth Surface Processes and Landforms
Validation and evaluation of predictive models in hazard assessment and risk management
Natural Hazards
Innovative approaches to landslide hazard mapping
Slope instability zonation: a comparison between certainty factor and fuzzy Dempster–Shafer approaches
Natural Hazards
Assessing the skill of yes/no predictions
Biometrics
Multivariate models for landslide hazard evolution
Mathematical Geology
Geomorphological and historical data in assessing landslide hazard
Earth Surface Processes and Landforms
Validation of spatial prediction models for landslide hazard mapping
Natural Hazards
Multivariate regression analysis for landslide hazard zonation
Distributed modeling of shallow landslides triggered by intense rainfall
Natural Hazards and Earth System Sciences
Explicitly representing expected cost: an alternative to ROC representation
Cost curves: an improved method for visualizing classifier performance
Machine Learning
Cited by (355)
On the use of explainable AI for susceptibility modeling: Examining the spatial pattern of SHAP values
2024, Geoscience FrontiersA review of approaches for submarine landslide-tsunami hazard identification and assessment
2024, Marine and Petroleum GeologyUtilizing a single-temporal full polarimetric Gaofen-3 SAR image to map coseismic landslide inventory following the 2017 Mw 7.0 Jiuzhaigou earthquake (China)
2024, International Journal of Applied Earth Observation and GeoinformationAssessing multi-hazard susceptibility to cryospheric hazards: Lesson learnt from an Alaskan example
2023, Science of the Total EnvironmentModeling tree species richness patterns and their environmental drivers across Hyrcanian mountain forests
2023, Ecological Informatics