Assessing the quality of landslide susceptibility maps – case study Lower Austria

Landslide susceptibility maps are helpful tools to identify areas potentially prone to future landslide occurrence. As more and more national and provincial authorities demand for these maps to be computed and implemented in spatial planning strategies, several aspects of the quality of the landslide susceptibility model and the resulting classified map are of high interest. In this study of landslides in Lower Austria, we focus on the model form uncertainty to assess the quality of a flexible statistical modelling technique, the generalized additive model (GAM). The study area (15 850 km2) is divided into 16 modelling domains based on lithology classes. A model representing the entire study area is constructed by combining these models. The performances of the models are assessed using repeated k-fol cross-validation with spatial and random subsampling. This reflects the variability of performance estimates arising from sampling variation. Measures of spatial transferability and thematic consistency are applied to empirically assess model quality. We also analyse and visualize the implications of spatially varying prediction uncertainties regarding the susceptibility map classes by taking into account the confidence intervals of model predictions. The 95 % confidence limits fall within the same susceptibility class in 85 % of the study area. Overall, this study contributes to advancing open communication and assessment of model quality related to statistical landslide susceptibility models.


Introduction
Landslides occur in mountainous as well as in hilly or coastal regions worldwide and have often been an underestimated hazard.In general, people and governing authorities are not sufficiently aware of the potential locations and consequences of landslides (Hervás, 2003).However, in Austria authorities and residents have become more aware of landslide hazards because of recent major landslide events, which affected many residents, caused significant damage to infrastructure and private properties and were reported and discussed on the local media (Damm et al., 2010).These include events in August 2005 in Gasen and Haslau and incidents in 2009 in the district of Feldbach and in the province of Lower Austria, where about 4000 landslides occurred in total (Schwarz and Tilch, 2008;Hornich and Adelwöhrer, 2010; Abteilung Feuerwehr und Zivilschutz, Amt der NÖ Landesregierung, 2010;BMLFUW, 2010).While in the past in Austria practices of post-disaster recovery and reconstruction were applied, recently more and more national and provincial authorities have been demanding pre-disaster mitigation tools, which help to prevent future damage caused by natural hazards such as landslides.Aiming at avoidance of the hazard, zonation plans can be facilitated in prospective land use planning to prevent development in undesirable locations or undesirable types of development (Schwab et al., 2005).In this context landslide susceptibility maps have proven to be a powerful tool, as they give coherent information on the spatial probability on where landslides, or landslide scarps, might occur (Brabb, 1984;Glade et al., 2005;Guzzetti et al., 2000Guzzetti et al., , 2006;;Varnes, 1984).The term landslide susceptibility is hereby defined by Brabb (1984) as the likelihood of a landslide occurring in an area with given local terrain attributes.The frequency or time of occurrence of the future landslides is not assessed within susceptibility zoning (Fell et al., 2008).
This study was embedded in a project preparing landslide susceptibility maps for the province of Lower Austria (15 850 km 2 ) with an aimed output map scale of 1 : 25 000 (Glade et al., 2012).Generally, landslide hazard zonation for determining slope instability due to landslides is based on simple landslide inventories or heuristic, statistical (including machine learning) or deterministic approaches (Soeters and Van Westen, 1996).Although these approaches are dealing with landslide hazards (spatial and temporal occurrence probability of landslides) the same categorization can be applied for susceptibility maps.Statistical landslide susceptibility models are particularly useful for modelling large areas on a medium scale (1 : 25 000-1 : 500 000) to get an overview of which slopes or slope sections might be prone to landslides in future (Fell et al., 2008).According to the resulting susceptibility maps, hot spots can be identified where more detailed analysis of the slope stability should follow (Van Westen et al., 1997).For this study a statistical approach was regarded as suitable considering the aimed output map scale, the size of the study area, the available data and the limited availability of additional resources for preparing input data (such as a more detailed landslide inventory or soil parameters).
The central assumption of landslide susceptibility modelling is based on James Hutton's (1726-1797) concept of uniformitarianism, "the past and the present are keys to the future" (Orme, 2002).Thus, statistical estimation of the possible future location of landslides is usually based on the conditions (e.g.local terrain attributes) of past landslides (Varnes et al., 1984;Carrara et al., 1993).Presently, several statistical and machine learning techniques are available for application to landslide susceptibility modelling.The most common are logistic regression (e.g.Atkinson et al., 1998;Ayalew and Yamagishi, 2005;Van Den Eeckhaut et al., 2006), bivariate models such as weights of evidence (e.g.Neuhäuser and Terhorst, 2007;Schicker and Moon, 2012) and machine learning techniques such as artificial neural networks (e.g.Nefeslioglu et al., 2008;Pradhan and Lee, 2009) and support vector machines (e.g.Marjanović et al., 2011;Pradhan, 2013).Detailed reviews and comparison of different models can be found, among others, in Guzzetti et al. (1999), Dai et al. (2002), Brenning (2005), Glade and Crozier (2005), Guzzetti (2005), Rossi et al. (2010) and Vorpahl et al. (2012).However, the more flexible machine learning algorithms often tend to overfit (Sect.5.2), while linear models might not be flexible enough to portray possible nonlinearity in the relationship between landslide occurrence and the explanatory variables (Brenning, 2005;Nefeslioglu et al., 2008;Goetz et al., 2011).In this study the nonlinear generalized additive model (GAM, Hastie and Tibshirani, 1990) was selected for the susceptibility modelling.The GAM is widely used in ecological modelling (Guisan et al., 2002) and has recently been introduced to landslide susceptibility modelling (Brenning, 2008;Jia et al., 2008;Park and Chi, 2008;Goetz et al., 2011;Vorpahl et al., 2012).It is an extension of the generalized linear models (GLMs, such as logistic regression) that can model nonlinear relationships between response and explanatory variables while being less prone to overfitting in geomorphological modelling than the more flexible methods (Hastie and Tibshirani, 1990;Brenning, 2009;Goetz et al., 2011).
Since the application of the landslide susceptibility map resulting from this study is planned for implementation by municipal authorities, our main focus is to identify different aspects of quality (depending on input data, model performance and provided analysis of prediction uncertainties).Accordingly, the purpose of this study is to review the characteristics that describe quality, to analyse the (statistical) model quality and to assess and visualize the uncertainty of the prediction.
Likewise, this paper presents a review on considerations on quality in statistical landslide susceptibility maps (Sect.2); background information on the study area and the landslides (Sect.3); a description of the data sources and the input data preparation (Sect.4); a detailed section on the methods used assessing the model form uncertainty and prediction uncertainty within a topographically heterogeneous study area (Sect.5) and a thorough presentation of the results (Sect.6) followed by a critical discussion (Sect.7).The paper ends with a short conclusion highlighting the main findings of this study (Sect.8).

Considerations on the quality of statistical landslide susceptibility models and resultant maps
As it comes to the application of the landslide susceptibility maps, which is associated with constructing a decisive reality (Egner and Pott, 2010) for the municipality and the land use planners, a detailed and transparent assessment of its quality is necessary.This very broad term of quality can be interpreted in several ways and in several stages of the process of preparing a landslide susceptibility map as described, amongst others, by Carrara (1993), Carrara et al. (1995), Ardizzone et al. (2002), and Guzzetti et al. (2006).In general, good quality refers to input data, a model or result (e.g.susceptibility map) which includes relatively low uncertainties.
Landslide susceptibility modelling is inherently riddled with uncertainties.In engineering and risk assessment a basic distinction between aleatory and epistemic uncertainties is very common (Hoffman and Hammonds, 1994;Oberkampf et al., 2004;Roy and Oberkampf, 2011;Hill et al., 2013).Many processes and conditions leading to landslide occurrence are known and mappable, known but not collectable, or are simply unknown (Carrara et al., 1999).The imperfect understanding of the complexity of this hazardous phenomenon arising from the missing knowledge (Ardizzone et al., 2002;Kunz et al., 2011) is generally referred to as epistemic uncertainty (Hoffman and Hammonds, 1994;Hora, 1996;Oberkampf et al., 2004).Epistemic uncertainty is often divided into input uncertainty, parametric uncertainty and structural or model form uncertainty (Roy and Oberkampf, 2011;Rougier and Beven, 2013).Naturally, these uncertainties can be reduced by improving knowledge of the subject (Spiegelhalter and Riesch, 2011).Unavoidable uncertainty arising from the natural variability and/or randomness of the natural hazard process can be distinguished as aleatory uncertainty (Rougier and Beven, 2013;Rougier, 2013).Although these distinctions can be applied, the organization of uncertainties is dependent on the study objectives (Hora, 1996;Roy and Oberkampf, 2011).
In the following, we will discuss quality by addressing uncertainty issues associated with statistical modelling regarding input data (parametric uncertainty), model performance (model form uncertainty) and the final susceptibility map.Details on other types of uncertainties which might be more important for other types of susceptibility assessment (e.g.deterministic) or on the propagation of uncertainties can be found e.g. in Hoffman and Hammonds (1994), Draper (1995), Oberkampf et al. (2004), Karam (2005), Helton et al. (2010) andRougier (2013).

Quality of input data
Achieving a good quality of the final landslide susceptibility map starts with the quality of the input data set for the modelling.Besides the geomorphological relevance, the spatial resolution and accuracy of the geo-environmental as well as the landslide inventory data is important (van Westen et al., 2008).During the preparation of all input data (done by mapping or measuring) subjectivity, the experience of the mapper, measuring errors or imprecision in computer processes are sources of parametric uncertainty (Mosleh, 1986;Ardizzone et al., 2002;Elith et al., 2002;van Westen et al., 2005van Westen et al., , 2008)).Furthermore, possible incompleteness, not only in a sense of full spatial coverage but also of general availability of important thematic information on (predisposing or preparatory) factors determining the landslide susceptibility influences the quality and makes the susceptibility assessment more difficult (Carrara et al., 1999;Ardizzone et al., 2002;Van Westen et al., 2005).This source of epistemic uncertainty can rarely be reduced and other assessments of the uncertainty have to be selected.The quantitative assessment of parametric uncertainty and incorporation in subsequent hazard and risk assessment is still a challenge (Ardizzone et al., 2002;Oberkampf et al., 2004;Roy and Oberkampf, 2011).Therefore, estimation on the completeness of the landslide inventory and details on the collecting and mapping method giving information on the accuracy and the location of the landslide point/line/polygon (main scarp or entire land-slide body) are very important.This influences the further usage of the input data set and the feasible interpretation of the susceptibility maps substantially.

Quality of statistical model form
Every model is a simplification of reality.Therefore, while modelling landslide susceptibility, it should be expected that some discrepancies regarding the ability to explain all the processes involved in the phenomenon would naturally occur (Rougier and Beven, 2013).Every modelling result is highly dependent on the input data, the assumptions made to set up the model design and the selection and performance of an appropriate model.These assumptions also define the limitations of the model, its result and the allowed interpretation of it.Quantifying this part of the model form uncertainty is challenging.However, model validation procedures have been developed and are commonly applied to assess the model performance and therefore the model form uncertainty (Roy and Oberkampf, 2011).
In single hold-out methods the data set is split in one single training and test sample.The training sample is used to fit the model and the test sample is used to determine the model performance.This results in a single estimate of the performance measure (e.g. one single AUROC value) without providing a measure of precision of the estimator.The estimate depends on the (random) sample used for modelling the susceptibility and testing the model's performance, which may itself have "peculiar" random characteristics that would be different for a different test set.Repeated k-fold cross-validation solves this problem by using, one after another, different subsets or partitions of the data set as test and training sets, thus effectively using the entire data set for performance estimation (Brenning, 2012a, b).In addition, repeating this procedure for different data partitioning reduces sampling variability and allows for the determination of the precision of the performance estimator (see Sect. 5.3 for details).
Depending on the partitioning method (randomly or spatially) to obtain the folds, these statistical estimation methods can further provide a means for assessing the non-spatial transferability of a model onto a different, independent random sample, and the spatial transferability into a spatially separate area.Spatial transferability refers to the capability of the model to generalize empirical relationships learned on a training data set, and to transfer these relationships to (usually adjacent) regions without major loss in predictive performance (Brenning, 2005;Von Ruette et al., 2011).This (non-) spatial transferability describes the model form uncertainty besides other performance measures such as the AU-ROC value.
In addition, aleatory uncertainties are of importance in the modelling stage, because even if all parameters were known perfectly, some deviance (or discrepancy) of the model results in nature might still occur due to the natural variability of the process (Elith et al., 2002, Rougier andBeven, 2013).

Quality of final susceptibility mapvisualizing prediction uncertainties
One important model uncertainty which can be visualized in a map is the prediction uncertainty arising from using a statistical model.The output of a statistical model for spatial modelling is usually comprised of a single probability value for each unit of the prediction surface (grid cells, slopes or terrain units).These individual probability values represent an estimated conditional mean value of the predicted probability (Hosmer and Lemeshow, 2000).Therefore, there is a prediction uncertainty or possible range, as determined by the standard error of the predicted probabilities, of probability estimates for each unit of the susceptibility map (Guzzetti et al., 2006).The analysis of the standard error of the predicted probabilities and the prediction uncertainty analysis are independent of any class thresholds.Estimating the standard error was analysed and presented in previous research (e.g.Guzzetti et al., 2006;Van den Eeckhaut et al., 2009;Rossi et al., 2010;Sterlacchini et al., 2011).However, the way it affects the appearance of the classified landslide susceptibility map is of interest for planning purposes as considering these prediction uncertainties might result in overlaps of different susceptibility classes.Naturally, the amount of overlapping raster cells and therefore the uncertainty in the final classified map might be dependent on the number of classes and the selected thresholds for the classification.

Lower Austria and landslide occurrence
The Austrian province of Lower Austria covers a total area of 19 177 km 2 , 15 850 km 2 of which constitute our study area (Fig. 1).
The variation of material and topographic characteristics across Lower Austria can be attributed to the diverse lithology (Petschko et al., 2012).The predominant material types include gravel, sand in the alluvial deposits and fluvial terraces; loess and loam; marl with a high amount of silt in the Molasse Zone and Schlier Zone; sandstone and marlstone of the Rheno-danubian Flysch Zone and Mélange Zone; limestone, dolostone and the igneous rocks of the Austroalpine Unit; and gneisses and granites in the Bohemian Massif (Fig. 1, Table 1).
The median slope angle was used as a simple proxy of the topographic characteristics to highlight the diverse conditions across Lower Austria.The median slope angle ranges from a minimum of < 1 • (alluvial deposits) to a maximum of 27 • (Austroalpine Unit with dolostone; Table 1).
Land use patterns and variation on precipitation also contribute to the diverse geography of Lower Austria.The land use (40 % agricultural land, 40 % forest, 13 % grassland) is unevenly distributed across the province.Areas south of the Danube and the relatively flat northeast region of the province are predominantly used as agricultural land.The steeper slopes in the south and southwest are mainly forestcovered (coniferous and deciduous forest; Eder et al., 2011).The spatial distribution of the mean annual precipitation rate shows a gradient between the northeast (400-500 mm) and the southwest regions (1600-1700 mm) (Hydrographischer Dienst des Landes Niederösterreich, 2011).
Many different landslide types including rockfalls, earth slides, debris slides and debris flows, can be found in Lower Austria (classified according to Cruden and Varnes, 1996;and Dikau et al., 1996).While rockfall and debris flows typically occur in the southern lithology (Austroalpine Unit), earth-and debris slides occur all over the province with different densities (Table 1), sizes and depths.Only earth-and debris slides which were mapped with one point in the main scarp of each slide were within the scope of this study; recent examples of these landslide types are shown in Fig. 2.
The earth-and debris slides in Lower Austria are mainly triggered by heavy rainfall or rapid snowmelt events (Schwenk, 1992;Schweigl and Hervás, 2009).The Mélange lithology has the maximum landslide density of 5.5 landslides km −2 .The Rheno-danubian Flysch Zone and the Zone of Molasse with Schlier also have high landslide densities (> 4 landslides km −2 ).The lithological properties and topographic characteristics of these units are highly susceptible to earth-and debris slides (Gottschling, 2006;Wessely, 2006).These densities were calculated from the landslide inventory described in Sect.4.1.

Data sources and preparation
The susceptibility modelling in this study was based on an airborne laser scanning digital terrain model (LiDAR-DTM) with a resolution of 1 m × 1 m, acquired between 2006 and   2009.LiDAR-DTMs are very useful for mapping landslides and representing the morphology of the study area, even in forest areas (Van den Eeckhaut et al., 2007;Razak et al., 2011).
Considering the detailed resolution we have given in the LiDAR-DTM and the rather coarse resolution given in the soil data (50 m × 50 m) we had to find a compromise for the analysis and output of the susceptibility map.We decided an output resolution of the maps of 10 m × 10 m would be suitable to produce 1 : 25 000 scale maps; we are still able to take advantage of the high resolution of the topographic data but also avoid wrong signals in the modelling from very local variations in the derivatives of the high-resolution DTM (Van Westen et al., 2008).The resampling to a 10 m × 10 m resolution was done by bilinear interpolation.Nevertheless, we are aware that this artificial improvement of the soil data resolution does not increase the data accuracy.

Response variable -landslide inventory
The landslide inventory originates from previous research where earth-and debris slides were mapped from a highresolution LiDAR-DTM (Petschko et al., 2010;Glade et al., 2012, Petschko et al., 2013a).This inventory consisted of points located in each landslide's main scarp and includes 12 889 earth-and debris slides (Table 1).However, no attributes (landslide type, estimated age, land use, time/date of occurrence etc.) were assigned to each point.
Considering the large study area and the aimed result of a susceptibility map with three classes, a point inventory was preferred over a polygon inventory (Petschko et al., 2013a, c).The decision to map one point per landslide was aimed at increasing mapping effectiveness, avoiding uncertainty related to mapping landslide polygon boundaries, reducing spatial autocorrelation of the case samples (e.g.landslides) and providing equal treatment of small and large landslide samples (Carrara, 1993;Atkinson and Massari, 1998;Van den Eeckhaut et al., 2006;Heckmann et al., 2013;Petschko et al., 2013a).A comparison of modelling landslide susceptibility with sampling either a single point for the main scarp or a random point anywhere in a landslide polygon was conducted by Petschko et al. (2013c).They observed only small differences between the predictive model performances and classified susceptibility maps for the two landslide sampling schemes.
The landslide mapping focussed on distinct and easily detectable morphologies that remain visible after the occurrence of a landslide (McCalpin, 1984).Very old landslides and landslides that were not visible in the available imagery, were not included in the inventory; the full slide with scarp, transportation and accumulation area had to be discernible in the LiDAR-DTM hillshade to be included in the inventory (Petschko et al., 2013a).In areas where human activity on land use was very high, such as agricultural areas, the ability to identify landslides was particularly challenging (Bell et al., 2012).We assume that in agricultural areas fewer landslides were mapped than actually occurred; however, the full extent of the incompleteness of the inventory obtained in the study area remains unknown.

Explanatory variables
Geomorphically meaningful, explanatory variables available for this study represented terrain conditions, soil properties and potential tectonic activity.
Several terrain attributes were derived from the LiDAR-DTM as proxies for geomorphic and hydrological processes.We used SAGA GIS (Conrad, 2007) to calculate slope angle ( • ), slope aspect ( • , transformed using the sine and cosine representing east versus west and north versus south; Brenning, 2009), overall curvature (all calculated with second-degree polynomial approximation; Zevenbergen and Thornes, 1987), a topographic wetness index (SAGA wetness index; Boehner et al., 2002), catchment height and catchment area calculated with multiple flow direction algorithm (Freeman, 1991;Quinn et al., 1991), as well as convergence indices, calculated with 10 m and 50 m radius, respectively, to represent morphological changes on different scales.Data preparation was performed within the R package RSAGA (Brenning, 2011).
Van Westen et al. (2008) had discussed the relevance of most of these terrain attributes to landslide susceptibility.The relationship of slope angle with landslide activity is well known from general slope stability literature (e.g.Crozier, 1986).Slope aspect can be used as a proxy for bedding orientation.It may also reflect differences in intensity of solar radiation, which controls the local temperature and evaporation and therefore soil moisture (Van Westen et al., 2008).Curvature represents convex and concave surfaces related to local morphology (3 × 3 grid cells).The topographic wetness index was used as a proxy for the soil moisture and ground water level (Beven and Kirkby, 1979;Seibert et al., 2007).The position of the landslide on the slope and the distance from the ridge was represented by the variable catchment height.The catchment area was calculated for the subcatchments and gives a local representation of the contributing area.Convergence indices were calculated to represent the slope morphology on two different scales by using two different window sizes for the calculation (10 and 50 m).Positive values indicate ridges while negative values indicate local depressions.
Information on the soil properties is also very important as they have an effect on the infiltration capacity and water storage in the soil, which ultimately influences the disposition to landslides (Crozier, 1986).From the soil data set of Eder et al. (2011) we extracted four gridded variables representing permeability (mm d −1 ; average value of the top 20 cm, and minimum value within 100 cm profiles) and void space (%; average value of the top 20 cm, and average of 100 cm profiles), which may be considered as a proxies for the infiltration capacity.
Available tectonic vector data included fault lines and nappe boundaries (Kurz, 2012).The proximity to a fault line might refer to the occurrence of weakened material that has already been strongly tectonically influenced, reworked and mechanically fragmented.This material generally has lower shear strength and may therefore be more prone to landslides (Crozier, 1986).Furthermore, earth-and debris slides might occur with higher density close to nappe boundaries, as these indicate a distinct difference in the material and permeability.For example, many earth-and debris slides have been observed where the nappe boundary of the Austroalpine Unit with limestone overlays the Rheno-danubian Flysch Zone.The higher water permeability of the Austroalpine limestone on top of denser sandstones and marlstones of the Flysch Zone can result in increased soil water availability at the boundary zone and presence of boundary springs (Schnabel, 1985).The fault lines and nappe boundaries were included in the susceptibility analysis by calculating grids representing the Euclidean distance from the respective features.
Land cover data for Lower Austria was also available for this study, but was not included in the analysis.Since the ages of the landslides were not available in the inventory, it would not be accurate to associate the landslide distribution with present land cover conditions (Petschko et al., 2013b).We simply do not know how land cover had historically changed and influenced all of the landslides in this inventory, which is important to understand the changing conditions of slope stability (Glade, 2003;Beguería, 2006b;Van Den Eeckhaut et al., 2009;Bathurst et al., 2010).For example, areas which are forested today might not have been forested for a long time period in the past, e.g.due to influences of mining ore in the study area (Lettner and Wrbka, 2011).Therefore, the landslide susceptibility models were prepared focussing on local terrain conditions (Brabb, 1984), which are relatively static compared to the dynamic nature of land cover (Van Westen et al., 2008).

Modelling of heterogeneous areas
The heterogeneity of geotechnical and topographical characteristics over a study area should be considered when modelling landslide susceptibility (Lee et al., 2008;Blahůt et al., 2010).Modelling separately in lithology units is one approach to addressing this heterogeneity (Petschko et al., 2012).This approach avoids the use of interaction terms to represent lithology-dependent processes and preparatory factors, and thus facilitates easier interpretation of the models.Fitting the model with individual modelling domains based on the lithology units is expected to improve overall predictive performance by accurately representing the diversity of geotechnical conditions across the study area (Blahůt et al., 2010;Petschko et al., 2012;Trigila et al., 2013).
Accordingly, the study area was divided into 16 modelling domains based on a simplified 1 : 200 000 lithological map (Fig. 1; Schnabel, 2002;Bell et al., 2013).The lithological map gives details on the parent material available for soil formation.This determines the geotechnical characteristics on a scale of 1 : 200 000.Data on the parent material was used as a proxy for these characteristics, as no appropriate geotechnical data covering the entire study area was available.Lithology units with no observed landslides were merged with geotechnically similar units to create homogenous modelling domains (Table 1).The final susceptibility map of the province was obtained by merging the individual susceptibility maps of the 16 domains.

Generalized additive model
Among the currently available methods for landslide susceptibility modelling a generalized additive model (GAM) is a compromise between the nonlinear predictive flexibility of machine learning algorithms and the smooth, yet linear, predictions of GLMs such as logistic regression.The model fit of the GAM can be easily interpretable, unlike most machine learning algorithms (Brenning, 2008;Goetz et al., 2011).In addition, the GAM is more suitable for representing the nonlinear response of slope stability to changing site conditions than the GLM (Goetz et al., 2011).
The basic design of a GAM is to replace the linear function of each co-variate as used in a GLM with an empirically fitted smooth function to "let the data show the appropriate functional form" (Hastie and Tibshirani, 1990).Thus a GAM allows the combination of linear and nonlinear smoothing functions in an additive manner to describe the relationship between explanatory and response variables.In the case of the logistic additive model for binary (presence/absence) response variables, the response variable is not modelled directly, but using the logit of the occurrence probability p(x) conditional on explanatory variables x = (x 1 , . . ., x m ) T (Hastie and Tibshirani, 1990): (1) where the functions f i are nonparametric smoothers.The quantity is referred to as the odds.Thus, the logistic GAM is additive at the logit level, but increases in f i have a multiplicative effect on the odds.We used a combined backward and forward stepwise variable selection based on the Akaike Information Criterion (AIC), which measures the goodness-of-fit while penalizing for model complexity to obtain a parsimonious model (Akaike, 1974).Smaller models help to keep the estimated coefficient standard errors small and prevent the model from overfitting, which occurs when the number of variables in the model is large relative to the number of landslide samples (Hosmer and Lemeshow, 2000).Overfitting refers to an algorithm or model that performs very well on the available training data to which it is fitted, but poorly on future or new test data and therefore produces unreliable predictions (Hand, 1997;Hosmer and Lemeshow, 2000).
Our study design was a case-control study with the mapped landslide samples as cases and randomly selected non-landslide samples as controls.The landslide susceptibility maps were derived for each modelling domain from a model (GAM 1 -GAM 16 ) using all landslide samples while the model performance is assessed using cross-validation, i.e. subsamples of landslide cells (Sect.5.3).The sampling of non-landslide cells for the entire study area was based on a density of 2 % of all grid cells.An equal number of cases and controls (1 : 1) was used for each model fitted; the landslide samples were matched to an equal number of randomly selected non-landslide samples.This gives the sample size in the respective modelling domain.
It was necessary to adjust each model's raw predictions based on the corresponding sampling rate to account for the general relative landslide susceptibility of each modelling domain.We adjusted the prediction by considering the sampling rate (τ 0 /τ 1 ) of each domain, using Eq. ( 3), where τ 0 = number of sampled non-landslide cells (4) total number of non-landslide cells , and τ 1 = number of sampled landslide cells (5) total number of landslide cells , and odds (x) is the unadjusted prediction, in our case, based on training a model with an equal number of landslides to non-landslides (1 : 1) in each domain.The odds * (x) gives the adjusted odds (Scott and Wild, 1986;Hosmer and Lemeshow, 2000).As all landslides were used in this step, τ 1 =1.The adjusted probability p * (x) was calculated from odds * (x) which was therefore comparable among different modelling domains.GAM modelling was performed with the gam package within the open-source statistical software R (Hastie, 2011;R Development Core Team, 2011).
The adjusted probabilities of landslide susceptibility of the merged map were further classified into discrete classes of low, medium and high susceptibility based on thresholds related to the percentage of observed landslides falling within each susceptibility class.This was a result of testing different thresholds and checking them in the field according to the best geomorphic and planning plausibility.A threshold for the low susceptibility class was selected so that 5 % of the observed landslides having the lowest predicted probabilities would fall within this class.An additional 25 % of the observed landslides would fall within the medium susceptibility class, while the high susceptibility class contained the 70 % of the landslides with the highest predicted probabilities.The corresponding class thresholds of 0.00209 and 0.0141 were also applied to classify the maps corresponding to the approximate upper and lower confidence limits calculated for each grid cell at the 95 % confidence level (refer to Sect.5.5).

Spatial and non-spatial cross-validation
The assembly of the test data for performance assessment can be achieved in three ways: (1) random subsampling, (2) spatial subsampling and (3) temporal partitioning of the landslide data (Chung and Fabbri, 2003).As there was no information on the landslide age or time of occurrence available for temporal partitioning, we focused on testing random and spatial subsampling of cases and controls within each homogeneous modelling domain.
We used non-spatial and spatial k-fold cross-validation to assess each model's predictive performance as a measure of Nat.Hazards Earth Syst.Sci., 14, 95-118, 2014 www.nat-hazards-earth-syst-sci.net/14/95/2014/ the model's non-spatial and spatial transferability (Kohavi, 1995;Townsend Peterson et al., 2007).With random subsampling traditional non-spatial cross-validation was performed.Furthermore, spatial cross-validation was applied by spatial subsampling based on k-means clustering classification of point coordinates (Ruß and Brenning, 2010).In k-fold crossvalidation, k (here: k = 5) randomly or spatially selected disjoint subsamples, or folds, are derived.The model was fitted k times on the combined data of k-1 folds and tested on the data of the remaining fold by applying the fitted model to the test fold and calculating the performance measure (Kohavi, 1995).
In order to obtain results that are independent of a particular partitioning, cross-validation was repeated r times (here: r = 20), and the median and interquartile range of the 20 outcomes was calculated.This resulted in 100 different estimates of the performance measures.
We use the AUROC as a performance measure, which was derived by comparing the sensitivity of a model (proportion of true positives) to the specificity (more precisely, 1-specificity, or false positives rate; Hosmer and Lemeshow, 2000).The AUROC takes values between 0 and 1 where a value of 0.5 would be achieved by pure chance agreement between predictions and observations and a value of 1 represent perfect discrimination (Brenning, 2005;Guzzetti et al., 2006); however, this may also indicate overfitting.Thus, the AUROC measures the model's ability to discriminate landslide and non-landslide cells (Hosmer and Lemeshow, 2000).
Spatial and non-spatial cross-validation was also applied to test the effects of reducing the sample size on the interquartile range of the AUROC values and the thematic consistency within one domain.This was performed within the modelling domain Flysch Zone.The cross-validation was applied 9 times (with 5 folds and 20 repetitions each) while the sample size of the training sample (including landslide and non-landslide samples) is reduced from 12 562 to 6400, 3200, 1600, 800, 400, 200, 100 and to 50 samples.The test sample was determined constant with a sum of 2000 landslide and non-landslide sample cells.Spatial and non-spatial cross-validation was performed with the sperrorest package in R (Brenning, 2012b).

Transferability and thematic consistency indices
The non-spatial and spatial transferability were expressed by the interpretation of the estimated interquartile range (IQR) of the AUROC values resulting from the non-spatial and spatial cross-validation of each modelling domain.The lower the estimated IQR, the better we considered the non-spatial and spatial transferability of the model within the modelling domain.However, sample size differences among modelling domains result in differences in sampling variability of AU-ROC estimators, which then has an influence on the IQR of AUROC among cross-validation repetitions.In order to account for this contribution to sampling variability and be able to provide a transferability measure that was comparable among modelling domains, the IQR had to be adjusted according to the sample size.We define a transferability index by adjusting the estimated interquartile range (IQR) for the contribution of test-set estimation to sampling variability.For this purpose, we calculated the approximate standard error (SE) of the AUROC (AUC) estimator on a test set of N landslide and N non-landslide samples using the equation presented by Hanley and McNeil (1982): The quantities Q 1 and Q 2 were calculated from the AUROC (AUC) value as shown by the following equations: Since the IQR is approximately equal to 1.35 times the corresponding standard deviation under normal distribution, we used the following equation to remove the contribution of SE to the IQR of the AUROC, and refer to this quantity as the transferability index T : When slightly negative terms occurred under the square root due to the approximation used, these values were replaced by a value of zero.A higher value of T indicates a greater variation in predictive performance among models fitted to cross-validation training partitions, which can be interpreted as a poorer transferability.We refer to the transferability calculated by spatial (non-spatial) cross-validation as spatial (non-spatial) transferability T sp (T nsp ).
Furthermore, the results of the non-spatial and spatial cross-validation provide an estimate on the variable importance and the thematic consistency in each modelling domain.Based on the 100 models fitted on the different crossvalidation training sets within each modelling domain, using stepwise variable selection we assess the importance of each variable in predicting landslide susceptibility using its selection frequency (Goetz et al., 2011).This variable-selection frequency can also be interpreted as a proxy for the thematic consistency of the model (Guzzetti et al., 2006).The thematic consistency is an additional measure to the transferability index showing the model's (and the stepwise variable selection's) sensitivity to sample variation.A high thematic consistency is preferable and describes a good quality of the model (Guzzetti et al., 2006).To formalize this concept, we introduce a thematic consistency index that measures the agreement between variable choices among crossvalidation repetitions.In analogy to the Gini impurity index used in classification (Hand, 1997), we define the consistency index C by where p i is the proportion of models that include the ith predictor variable out of m predictors.The consistency index is calculated for each modelling domain and for spatial (C sp ) and non-spatial cross-validation (C nsp ).Good thematic consistency is achieved when each variable has a selection frequency either close to 0 % or 100 %, resulting in a p m (1p m ) value near 0. Therefore, a low value of C indicates a strong consistency among models.Selection frequencies around 50 % indicate weak thematic consistency and produce p m (1-p m ) values of up to 0.25 m and a maximum C value of 1.

Spatially varying prediction uncertainties
The basis for the visualization of the landslide susceptibility map is the predicted probability of the occurrence of landslides of each grid cell, which is computed from the predicted logit.These predictions are subject to uncertainty due to sampling variability and model error, which can be expressed by the standard error of model predictions.This standard error further provides a means to determine approximate upper and lower confidence limits for the predicted logit and ultimately the predicted probability.These limits define an interval within which the true logit or probability of sites with given values of the explanatory variables is located with the chosen confidence of, for example, 95 % (Hosmer and Lemeshow, 2000).In other words, we have strong confidence that the true probability of landslide occurrence at a given type of location is within the confidence interval, but we would hesitate to claim that the true probability falls within any narrower range of values within the confidence interval.
In this study, confidence interval estimates for the predicted logit and probability are of special interest in order to assess the implications of spatially varying uncertainties for the interpretation of the final classified susceptibility map.Since the available GAM implementation (Hastie, 2011) does not provide standard errors or confidence intervals for "new" objects that are not included in the training sample, we proceed as follows to estimate standard errors for each location in the prediction map.We first compute standard errors on the logit and probability levels for all sample cells.A lookup table is then used to transfer these standard errors to all grid cells of the raster based on the similarity of the values of explanatory variables used by the model.Tolerance thresholds were applied to each explanatory variable to identify suitable observations in the training sample that match any given prediction location and therefore have similar standard errors.Several tolerance thresholds were tested for each modelling domain to maximize the R 2 obtained by comparing, on the training sample, standard errors estimated by table lookup to the standard errors calculated by the GAM implementation.This results in a raster data set which gives an estimation of the standard error of the predicted logit for each grid cell.
Based on these approximated standard errors we estimate the approximate upper and lower limit of a 95 % confidence interval of the predicted logit using a normal approximation.These logit-scale confidence intervals are further converted to the probability level and adjusted based on each modelling domain's sampling rate (Sect.5.2).The approximate upper and lower confidence limits and the predicted probability are used to compare the spatially varying uncertainties in a classified landslide susceptibility map.Therefore, the classified susceptibility map is compared to the classified maps of upper and lower confidence limits to assess potential areas and grid cells in which overlaps of the susceptibility classes (high, medium or low) may occur.
Furthermore, the approximate logit-scale standard errors from each model's predictions are used as relative uncertainty indices of the susceptibility map within each domain.This uncertainty index allows for a more nuanced visualization of prediction uncertainties within each domain, knowing, however, that its interpretation is only applicable within the domain since no adjustment for the domain-specific sampling rate is applied to this index.
To communicate the possible effects of the prediction uncertainty on the classified susceptibility map, a static way of visualization of the results was selected (Elith et al., 2002).Therefore, the classified susceptibility map was presented alongside a map of possible class overlaps considering the upper and lower confidence limits and with a map of the standard errors (unclassified).
A flow chart of the analysis in this study points out the relation of the different aspects of quality raised in Sect. 2 with the presented methods and measures (Fig. 3).

Landslide susceptibility map
The three classes of the final landslide susceptibility map classified according to the proportion of landslides included covered 75 % (low susceptibility), 19 % (medium susceptibility) and 6 % (high susceptibility) of the total study area (Fig. 4).
The variable frequency analysis showed that different subsets of the available 15 explanatory variables were included in the GAMs (GAM 1 -GAM 16 ) for the different modelling domains (Table 2).The total number of variables used in the models (GAM 1 -GAM 16 ) ranged from four variables in the Bohemian Massif with plutonic rock and Waschberg Zone, including Bohemian Massif with sedimentary cover domains (52 landslides each), to 11 variables used in the model for the Flysch Zone (6281 landslides, Table 3).The number of  variables included in the models generally increased with the number of observations in the training sample, which was attributed to the AIC's penalization based on model complexity relative to sample size.Furthermore, 65 % of the variables were used in a smoothed form (Table 2).However, mainly linear model terms were selected in four modelling domains.A similar overall frequency of nonlinear model terms was obtained in the models fitted within the cross-validation procedures (71 % nonlinear overall) with very similar results for the spatial and non-spatial techniques (refer to Table 2 for details).Two domains (Loess, Loam and Waschberg Zone and Bohemian Massif with sedimentary cover) primarily used linear model terms in spatial and non-spatial crossvalidation.Additionally, in the Bohemian Massif with plutonic rock (only in spatial cross-validation) and in the alluvial deposits including lake and wetland (only in non-spatial cross-validation) high proportions of linear terms were observed.

Spatial and non-spatial cross-validation
In general, spatial cross-validation had a larger range of AU-ROC values than non-spatial cross-validation over all crossvalidation repetitions (Fig. 5).However, the median AU-ROC values were very similar; non-spatial cross-validation had only slightly higher median values.The highest median AUROC values of 0.98 (spatial) and 0.99 (non-spatial cross-validation) were found in the alluvial deposits including lakes and wetland domain.With the exception of the Permo-Mesozoic rocks domain, all median AUROC values are higher than 0.74 (Fig. 5).In this domain the largest differences of median AUROC values between spatial (0.53) and non-spatial cross-validation (0.79) were found.Furthermore, the lowest 1st quartile AUROC value, which was higher performing with non-spatial cross-validation (0.73) than spatial cross-validation (0.35), was found for the Permo-Mesozoic rocks domain.Additionally, this domain showed the highest interquartile range of AUROC values (0.42) with spatial cross-validation and the second highest (0.11) with nonspatial cross-validation.
Three domains had very high AUROC values of the 3rd quartile, which ranged from 0.97 to 1 in the spatial and nonspatial cross-validation.These were the loess and loam, the Bohemian Massif with plutonic rock and the alluvial deposits including lakes and wetland domains.
The poorest spatial and non-spatial transferabilities assigned at T sp > 0.10 and T nsp > 0.04 were obtained in the three modelling domains with the smallest sample sizes (Table 3).Transferability tended to be better in domains with larger sample sizes and/or higher landslide densities, but was unrelated to median AUROC.Among the domains with larger sample sizes, the Austroalpine Unit with dolostone stood out with relatively poor spatial transferability (T sp = 0.098) in addition to its relatively low median AUROC of 0.75.Spatial transferabilities were best (T sp < 0.03) for igneous rocks of the Austroalpine Unit, the Molasse Zone and the Schlier Zone.
Reducing the sample size from using the total number of landslide and non-landslide samples in the case-control sample (12 562) to 50 samples within a modelling domain (Flysch Zone) still resulted in acceptable median AUROC values (> 0.76) for all sample sizes.However, the median AU-ROC values decreased from 0.84 to 0.76 in both the spatial and non-spatial cross-validation (Fig. 6) with very little difference between both approaches.While the AUROC values stay relatively constant until a sample size of 3200,  they started to decrease more rapidly from there on as the sample size decreases.Despite that, the interquartile ranges were substantially higher with spatial cross-validation, and generally increased with decreasing sample sizes in both spatial (0.052-0.087) and non-spatial (0.011-0.059) crossvalidation.Below a sample size of about 400 (spatial cross-   validation) and 200 (non-spatial cross-validation) the interquartile range of the AUROC values sharply increased as the sample size decreased; thus the transferability of the model decreased substantially for sample sizes smaller than 400 (spatial cross-validation) and 200 (non-spatial crossvalidation).In addition, the smaller sample sizes led to lower thematic consistency of the model.

Variable importance and thematic consistency
In the comparison of the models fitted within the 16 modelling domains (GAM 1 -GAM 16 ), topographic variables were more important than the available variables on soil properties.Out of a maximum of 16 selections, the variable slope angle was selected 15 times within the stepwise variable selection, whereas the minimum permeability value was not selected at all (Table 2).However, this only shows the result of one specific random sample and variable selection.According to the relative variable-selection frequency resulting from the two cross-validation approaches, the variable importance for predicting the landslide susceptibility also changed distinctly between the modelling domains.
All modelling domains (except the domain of the Permo-Mesozoic rocks) had slope angle as the most important variable.It was selected on average in 91.8 and 95.7 % of the model repetitions in spatial and non-spatial cross-validation.
Another important variable was catchment height, which was selected 55.6 % (spatial) and 65 % (non-spatial) of the repetitions.In the spatial cross-validation the Euclidian distance to nappe boundaries was the second most important variable as it was selected in 74.2 % of the models (non-spatial selection frequency 56.1 %, rank 6).The topographic wetness index (68.3% / 61.4 %), the convergence index (50 pixel radius; 56.1 % / 64.3 %) and the curvature (53.8 % / 58.6 %) were also among the top ranking variables in both crossvalidation approaches.Void space (mean 0-100 cm) was not selected in any of the model runs, while void space (0-20 cm) was selected by less than 1 % of the models and the minimum permeability was included in 1.1 % (spatial cross-validation) to 4.9 % (non-spatial cross-validation) of the models on average over all modelling domains.Overall the thematic consistency was stronger within the non-spatial cross-validation because training sets are less variable when the data are partitioned randomly as opposed to spatially (Table 3).A strong thematic consistency was assigned for a consistency index of C sp , C nsp < 0.20 and was found for seven domains in the non-spatial cross-validation but only for four domains in the spatial cross-validation.The strongest thematic consistency was observed in the Flysch Zone (C sp = 0.099) in spatial cross-validation and in the Austroalpine Unit with dolostone (C nsp = 0.068) and the Klippen Zone (C nsp = 0.081) in the non-spatial crossvalidation.These domains had a very large sample size and landslide density and also a good spatial or non-spatial transferability.While in spatial cross-validation the domains with the smallest sample size and poorest spatial transferability had the weakest thematic consistency, in non-spatial crossvalidation the consistency index was unrelated to sample size and transferability.Among these, the Waschberg Zone (including the Bohemian Massif with sedimentary cover domain) gave very contrasting results.It showed a weak thematic consistency in spatial cross-validation but a medium consistency index in non-spatial cross-validation.The weakest thematic consistency was found for the Permo-Mesozoic rocks (C sp = 0.519) and the alluvial deposits including lakes and wetlands (C nsp = 0.389) domains.In non-spatial cross-validation the thematic consistency was stronger with lower median AUROC values.However, in spatial crossvalidation the median AUROC values and the thematic consistency were unrelated.

Spatially varying prediction uncertainties
The largest range of logit-scale standard errors was obtained for the Bohemian Massif with plutonic rock domain (0.37-16.24;Table 3), whereas the Quaternary fluvial terrace had the minimum range (0.22-1.69).The highest standard error of 16.24 was in the Bohemian Massif with plutonic rock domain, while the median of the highest standard errors over all domains was 2.27.After the transformation from logit-scale to probability scale the typical distribution of the standard error was the following: the range of the standard error of the predicted probability was largest for medium probabilities that were in the medium susceptibility class.The minimum range of the standard errors was typically obtained at the minimum and maximum probability that were contained in the low and high susceptibility class.Nevertheless, the lowest standard errors were computed at the minimum predicted probability.
Within the study area of Lower Austria seven types of susceptibility class uncertainties in the classified landslide susceptibility maps were identified by the analysis of overlaps between the susceptibility classes (Figs. 7 and 8).Most commonly, for about 85 % of the grid cells, there were no overlaps of different susceptibility classes either with the lower or the upper confidence limit meaning that in all maps the susceptibility class was the same.The most common overlaps were found for the low susceptibility class where 6 % of the grid cells were classified with low susceptibility in the predicted probability map but with medium susceptibility in the upper confidence limit map (Fig. 7).Moreover, 2 % of the study area experienced overlaps of the medium class in the predicted probability map to the high susceptibility class in the upper confidence limit map.Even fewer grid cells, 0.03 % had a range from the low class in the predicted probability map to the high class in the upper confidence limit map.
The results of the analysis of spatially varying uncertainties were presented by maps chosen in two exemplary modelling domains with contrasting landslide densities, the Flysch Zone (4.6 km −2 ) and the Bohemian Massif (0.09 km −2 , Fig. 8).The range of the standard errors of the predicted logit of these domains was very similar.While the standard errors range from 0.05 to 1.6 in the Flysch Zone, the Bohemian Massif had a range from 0.24 to 1.73 (Table 3).With this minimum value the Flysch Zone gives the lowest standard error in the study area.The R 2 resulting from using the lookup table to transfer the standard errors to all grid cells range from 0.51 (in the Mélange Zone) to 0.9 (in Loess, Loam; Table 3).In the Flysch Zone the computed R 2 was 0.82 which is only slightly better than in the Bohemian Massif (0.79).
Comparing the classified maps from the Flysch Zone to the Bohemian Massif it can be seen that the grid cells showing an overlap in the Flysch Zone were much more scattered, pixel-wise, and mainly occurred at the boundaries of the susceptibility classes (Fig. 8).In the centre of the high susceptible class no overlaps occurred.However, in the Bohemian Massif the overlaps mainly occurred between the medium and high susceptibility classes and generally covered the entire class or larger areas, instead of single grid cells only.Both maps shared a low frequency of dark blue and dark green grid cells, which showed overlaps from the low class to the high class (either from lower confidence limit to predicted probability or from predicted probability to upper confidence limit).Furthermore, the overlaps of classes were occurring equally in-and outside of permanent settlement areas, which was important when considering the map for future planning purposes (Fig. 8).

Quality of input data
Practical challenges in this study arise from the size of the study area and the intended output map scale of 1 : 25 000.The size of the study area brings along some limitations regarding the availability of data sources that offer a full spatial coverage and a reasonable map scale.This introduces a number of parametric uncertainties into the modelling.Generally a complete, unbiased inventory would be desirable, as for example a full inventory that was mapped directly in the aftermath of a landslide event (single landslide or multiple landslides triggered at the same time) in the area of the susceptibility map (Van Westen et al., 2008).This would allow for the inclusion of land cover or precipitation data in the modelling which might be helpful to learn even more about the landslides in the area and to build scenarios on future landslide susceptibility (or hazard/risk; Begueria, 2006b; Tarolli et al., 2011).However, a complete inventory is rarely 60 1 Figure 7. Susceptibility class uncertainty expressed by types of overlaps between 2 susceptibility maps at the lower confidence limit (LLCI) or the upper confidence limit (ULCI) 3 at the 95% confidence level relative to the class in the predicted probability map (PP).With 4 the susceptibility class the percentage of area in the maps of LLCI, PP and ULCI is given in 5 the box.The arrow thickness is relative to the percentage of affected area.Possible types of 6 overlaps, which did not occur in the study area, are indicated with grey arrows.Next to the 7 type of overlap the percentage of affected area related to the study area is given.In 85% of the 8 study area, the 95% confidence limits fall within the same susceptibility class (not indicated 9 here).10 11 12 Fig. 7. Susceptibility class uncertainty expressed by types of overlaps between susceptibility maps at the lower confidence limit (LLCI) or the upper confidence limit (ULCI) at the 95 % confidence level relative to the class in the predicted probability map (PP).With the susceptibility class the percentage of area in the maps of LLCI, PP and ULCI is given in the box.The arrow thickness is relative to the percentage of affected area.Possible types of overlaps, which did not occur in the study area, are indicated with grey arrows.Next to the type of overlap the percentage of affected area related to the study area is given.In 85 % of the study area, the 95 % confidence limits fall within the same susceptibility class (not indicated here).
available.Particularly for historical inventories, the level and type of completeness is unknown, while it is known that they are generally incomplete (Malamud et al., 2004).Even a substantially complete inventory, which would be useful in statistical modelling as it includes a substantial fraction (random sample) of all landslides at all scales, land use types, lithological units or slope angles, cannot be reached for historical inventories (Malamud et al., 2004).This originates from the observation that landslides and their visibility on aerial photographs or other base maps (e.g.hillshades derived from airborne laser-scanning DTMs) are highly influenced by new landslides, reactivation, erosion, land use type and anthropogenic activities (Bell et al., 2012;Malamud et al., 2004;McCalpin, 1984, Van Westen et al., 2008).Furthermore the mapping and identification of landslides is highly dependent on the experience and knowledge of the investigator (Van Westen et al., 1999;Harp et al., 2002;Ardizzone et al., 2002).If these influences on the completeness are not random they might introduce a bias in the inventory and furthermore in the sampling which results in a model bias or systematic modelling error.
In our study area it is assumed that the inventory is not complete as it originates from recent data sources (not multitemporal) only and the visibility of landslides in the LiDAR-DTM or orthophoto is influenced by human impact depending on the land use type (Bell et al., 2012).Furthermore, a drawback is that no information on the time of occurrence of the landslide is available.However, the type of incompleteness was not analysed for the entire study area.Therefore, it is not clear if the missing landslides are missing completely at random or are biased toward absence in certain land uses or lithological units.The implications of an incomplete inventory on the model performance (shown by the AUROC value) were estimated by performing the repeated k-fold cross-validation using training and test samples.The results show rather high AUROC values for most modelling domains, which indicates that even with an incomplete inventory (training sample) the prediction of landslides of the test sample was successful for most cases.However, sample size is of importance for the model performance.For the discussion of this please see Sect.7.4.

Study design to meet the heterogeneity of the study area
Observed variable-selection frequencies showed that different explanatory variables were relevant in different domains, which provides additional justification to the decision to model susceptibility in each modelling domain separately.Additionally, not only the different choice of the variables is important but also the way the variables are fitted or smoothed according to the sample in the respective domain.
Previous studies showed that within each lithological unit landslides occur at different slope angles (Blahůt et al., 2010;Muenchow et al., 2012;Petschko et al., 2012)  class uncertainty refers to differences between susceptibility maps at the lower confidence 7 limit (LLCI) or the upper confidence limit (ULCI) at the 95% confidence level relative to the 8 class in the predicted probability map (PP) in panels a) and d).Data sources: LiDAR-DTM 9 hillshade, river, major road and settlement -Provincial Government of Lower Austria. 10 11  (d-f) are in the Bohemian Massif (very low landslide density).The susceptibility class uncertainty refers to differences between susceptibility maps at the lower confidence limit (LLCI) or the upper confidence limit (ULCI) at the 95 % confidence level relative to the class in the predicted probability map (PP) in panels (a) and (d).Data sources: LiDAR-DTM hillshade, river, major road and settlement -Provincial Government of Lower Austria.
present for other explanatory variables as well, as the geomorphic and geologic characteristics change (Lee et al., 2008).Facing this, our study design gives much more flexibility to represent the characteristics of the study area.Furthermore, it incorporates information on lithology by adjusting the odds of the prediction with the sampling rate of cells in each lithological unit.
However, one may argue that with this approach problems occur at the boundaries of the lithological units.Inaccuracies in the delineation of the lithological map of the area have effects on the model results as the landslides might be assigned incorrect lithological information.This may lead to an underestimation of effect sizes as data from different (true) litho-logical units would be mixed.Similar mixing effects may occur for quantitative predictor variables as well, for example as a function of positional accuracy for scale and resolution.In regression this effect is known as dilution, which may introduce a bias of estimated regression coefficients toward zero (Frost and Thompson, 2000).As this would also occur using the lithological map as a factor instead of as a mask for partitioning the study area, the boundary inaccuracies are not only a problem in the applied study design.

Spatial and non-spatial cross-validation
We assessed the model form uncertainty by deriving a model performance estimate and thematic consistency by spatial and non-spatial cross-validation.Cross-validation estimation of a model's predictive performance is a crucial step in predictive modelling because estimation on the training set is always too optimistic (Hosmer and Lemeshow, 2000;Brenning, 2005).Cross-validation results in bias-reduced performance estimates as the test partitions used in each repetition do not overlap with the training sample (Brenning, 2005).In particular spatial cross-validation is recommendable for spatial data, which may be subject to spatial autocorrelation (Brenning, 2005(Brenning, , 2012a)).
The median AUROC values estimated by spatial and nonspatial cross-validation were generally similarly high in this study.However, the median AUROC values and the transferability index clearly showed that non-spatial cross-validation provided a more optimistic or even overoptimistic assessment of the model performance and transferability in contrast to spatial cross-validation.Therefore, spatial and temporal cross-validation should be preferred for performance estimation (Chung and Fabbri, 2008).While spatial performance and transferability do not necessarily reflect a model's predictive performance in the time domain, they provide a more realistic assessment of its ability to generalize from the available data than non-spatial approaches (Brenning, 2005).
The spatially and non-spatially least transferable models in this study were associated with domains that had the smallest sample sizes.The relationship of sample size on predictive abilities has also been shown in other spatial modelling studies (Stockwell and Peterson, 2002;Hjort and Marmion, 2008).However, we believe that the cases of high variation in AUROC values may be also related to the cross-validation sampling variation as indicated by the difference between T sp and lower T nsp , and possibly the proportion of stable and unstable terrain in a modelling domain.
Heterogeneity of landslide conditions (e.g.related to topography or land use) in the cross-validation samples is more likely to occur if samples are partitioned spatially, such as the case in spatial cross-validation.Here, similar characteristics of the explanatory variables in both training and test sample are assumed and necessary (Guzzetti et al., 2006).If this assumption is not met by the data (e.g. a rock type or land use class is missing in the test sample) the transfer of the fitted model to the test sample and the estimation of the model performance are difficult (or impossible) (Guzzetti et al., 2006).In our study some model domains might have high contrast between stable (e.g.large flat areas) and unstable (e.g.steep areas) terrain which gives potential for greater variation of sampled terrain conditions; it may be possible that in some samples one terrain condition is overrepresented relative to others.The sampling strategy may be improved further by masking low-lying flat areas that are not typically susceptible to landslides (Van den Eeckhaut et al., 2009;Goetz et al., 2011).Consequently, the sample may have more potential to capture the differentiating conditions of stable and unstable terrain in an area that is generally susceptible to landslides (e.g.steep hillslopes).This might lead to a smaller variation in the AUROC values.
The high importance of topographic variables in the susceptibility modelling goes along with findings in other studies (Guzzetti et al., 2006;Begueria, 2006b;Van Westen et al., 2008;Guns and Vanacker, 2012).However, it was rather surprising that the soil data on permeability and void space was not selected more often during the stepwise-variable selection.This might be related to the poor resolution of the soil data (50 m × 50 m) and to the usage of topographic data as a proxy for hydrological and soil characteristics (soil moisture and thickness).A strong correlation between the topographic wetness index and soil characteristics was found amongst others by Seibert et al. (2007).Furthermore, the higher resolution of the topographic variables might be of advantage for better describing the local conditions of landslide susceptibility.
Whereas a very strong thematic consistency was generally found for domains with a large sample size and sampling rate, domains with a small sample size and rate showed a high variability in the variable-selection frequencies which gave a weak thematic consistency.Therefore, the weak thematic consistency might also be associated with a poor spatial and non-spatial transferability, both originating from a small sample size and a small sampling rate.This relation was stronger for the spatial cross-validation, while the thematic consistency from non-spatial cross-validation was unrelated.
Generally, the usage of rather static data in the modelling was a necessity resulting from the available landslide inventory.However, this does not give the possibility to include dynamic data on triggers into the modelling or to design scenarios of future landslide susceptibility considering land cover or precipitation change.A clear source of model form uncertainty is the concept of uniformitarianism in modelling.This implies that the predisposing and triggering factors of landslides do not change in future.However, natural variability of landslide triggering mechanisms and also of the climate system might cause future changes.Currently, the influence of climate change on current or future landslide activity is debated.However, no clear evidence on these possible future changes was found in many regions (e.g.Crozier, 2010;Huggel et al., 2012).Additionally, it might be possible that the most important data set explaining the susceptibility might still be missing.This might happen although expert knowledge on geomorphology was applied in selecting geomorphologically relevant data.Moreover, it has to be stated that this type of susceptibility map was designed for the main scarps of landslides.A classified map might cover the possible runout of landslides but by definition does not show how the initiated landslides might move downslope and endanger further areas (Demoulin and Chung, 2007).

Quality of final susceptibility mapvisualizing prediction uncertainties
This analysis of prediction uncertainty is an improvement of previous landslide susceptibility studies (Guzzetti et al., 2006;Van den Eeckhaut et al., 2009;Rossi et al., 2010, Sterlacchini et al., 2011), as it not only showed an uncertainty index of the predicted probability on a grid cell basis but additionally provided information on where the susceptibility class uncertainties were located.Some model uncertainties within this method arise from using the lookup table for transferring the prediction standard error to all grid cells as shown by the range of resulting R 2 .This method might be improved or substituted by a function assigning the standard errors to all grid cells.It was found that in the classified map the majority of grid cells did not change.However, there are differences between the modelling domains where some domains had larger overlaps of different susceptibility classes than others.Special attention should be given to the low susceptibility class.Here, the highest percentage of overlapping classes and underestimation of the susceptibility were detected.
The visualization of these spatially varying uncertainties is of special interest for future land use and development planning usually performed by non-landslide experts.In the aftermath of this study each landslide susceptibility class will be related to, not legally binding, recommendations for the designation of new building areas.Therefore, a misclassification (e.g.low instead of medium susceptibility) might lead to an interpretation by the municipality or landowner that underestimated landslide susceptibility.Knowledge about the susceptibility class overlaps might outline where more caution and detailed investigations are necessary.Additionally, it also shows where no uncertainties are expected, which might help to avoid costs for slope investigations.
There is a need to communicate the research results and their quality with appropriate explanations for the local officials, environmental managers and the public to raise awareness and knowledge about it, which leads to an easier understanding and incorporation of the results into the decisionmaking process (Knuepfer and Petersen, 2002;Rogers, 2006;Brierley, 2009;Hill et al., 2013).This analysis might aid good acceptance of the landslide susceptibility maps in the local governments, as instead of a fuzzy statement on involved uncertainties, these are clearly shown in a map on grid cell level (Guzzetti et al., 2006;Luoto et al., 2010).Furthermore, the preparation of the susceptibility maps showing the class overlaps contributes to an easier understanding of the possible effects of the prediction uncertainties.
The question if the policy-makers or stakeholders are really interested in knowing more about the uncertainty is discussed conversely.The study of Brugnach et al. (2006) pointed out that the confidence in modelling results is dependent on the way the uncertainties are addressed.Policymakers were missing more information on the uncertainty of any model result.Therefore, the modelling results should be presented with a measure of uncertainty or confidence indicator (Brugnach et al., 2006).In habitat suitability modelling the visualization of uncertainty was identified as relevant to inform decision-makers about areas with extreme error, but also about areas which are particularly well modelled (Elith et al., 2002).This openly addresses the uncertainties involved in the maps instead of giving an impression of certainty (Elith et al., 2002).However, interviews of Klimeš and Blahůt (2012) showed that local governments do not want any information on uncertainties.
Nevertheless, the spatially verying prediction uncertainties might have severe consequences on buildings and their inhabitants if an event occurred within the uncertainties of the method used to delineate the hazard zones.The converse discussion shows that more or better communication with the stakeholders or policy-makers (also during the modelling process) is necessary to learn about uncertainties and enlarge confidence into the modelling (Brugnach et al., 2006).However, the way the uncertainties are presented to the stakeholder has to be adapted by the scientist to ensure the success of the communication.The visualization of some aspects of the quality of landslide susceptibility maps, such as the spatially varying prediction uncertainty, can enhance the communication among experts and decision-makers to facilitate informed decisions (Kunz et al., 2011).
Additionally, further aspects of considering and communicating the effects of epistemic uncertainty are still open research fields in susceptibility modelling.A clear assessment of these is necessary to evaluate on their consequences on the susceptibility (or hazard or risk) map.

Considerations on sample size
Summarizing the previously discussed findings some considerations on a minimum sample size might be possible.While the transferability index is less strongly related to sample size or sampling rate, the thematic consistency index shows a stronger relationship to them.Generally, larger sample sizes and sampling rates result in better thematic consistency and transferability of the model.Furthermore, the minimum standard error of the prediction was lower with larger sample size (Table 3).
The effect of a reduced sample size on the median and interquartile range AUROC values was assessed in the Flysch domain.We found that the median AUROC remained satisfactory high but decreased as sample size decreased, while the interquartile range of the AUROC increased.Even with the smallest sample size the model still achieved a good discrimination between landslide and non-landslide cells according to the median AUROC value.Summarizing the results, a minimum sample size with a sum of around 400 slide and non-slide cells might be recommended for the methods applied in this study.This size leads to an acceptable transferability and thematic consistency of the model in spatial cross-validation in the analysed modelling domain.However, examples from successfully fitting a susceptibility model with smaller sample sizes (10 landslides with 15 cells each in an area of 177 km 2 ; Demoulin and Chung, 2007) give a very contrasting result.This shows that for small sample sizes other modelling methods (statistical or heuristic) might be more suitable.However, some difficulties arise in the modelling with small samples, which might create the need for a larger sample or for selecting a different method.The sample needs to be substantially complete to represent all local terrain conditions (Malamud, et al., 2004).However, increasing the sample size can only be done by enlarging the landslide inventory (e.g. by selecting a larger study area, mapping more landslides).This is challenging, as in some regions no additional data on landslides (or resources for mapping more landslides) might be available due to a general low susceptibility of the area or to missing data.
Giving a recommendation on a minimum sample size is challenging, as this study showed that the general trends found for sample size and sampling rate do not apply for all modelling domains.Therefore, we highlight that the resulting quality estimates (transferability index, consistency index and prediction uncertainty) might additionally be dependent on a combination of the domain size and the landslide density (landslides per km 2 ).Also, dependencies on local terrain conditions and their homogeneity in the modelling domain might exist.In this context the provision of the presented quality estimates is recommended, as these clearly show the quality of the results which might be lower compared to studies with larger sample sizes.
Moreover, the geomorphic plausibility of the susceptibility map has to be analysed.Previous studies found that high performance measures do not always guarantee high geomorphic plausibility of the map (Bell, 2007;Trigila et al., 2013).It might be possible that with a smaller sample size the geomorphic plausibility of the map is lower.However, the influence of a small sample size on the geomorphic plausibility of the susceptibility map is unclear.Nevertheless, analysing this was beyond the scope of this study.

Conclusions
The high quality of landslide susceptibility maps is defined by the low aleatory and epistemic uncertainties involved in the susceptibility modelling.This was analysed in terms of landslide susceptibility model performance and spatially varying prediction uncertainties of the final classified susceptibility map.The analysis gives an overview and some estimates on the epistemic and aleatory uncertainties involved in statistical susceptibility modelling.However, the effect of the propagation of all the single uncertainties on the final map and subsequently to hazard and risk maps has to be analysed further.Considering the present results in high model performance, analysis and visualization of prediction uncertainties, the applied model and resulting classified landslide susceptibility map are regarded to be of high quality.
The applied study design with modelling in the different domains provides a high flexibility for representing the characteristics of the heterogeneous study area.Opposed to single hold-out validation the repeated k-fold cross-validation provides a measure on the precision of the estimator and is independent of the one (random) test sample.The spatial crossvalidation gave a more realistic assessment of the model performance and spatial transferability from the available data than non-spatial approaches.This aids as an estimate on the model form uncertainty, which is considered to be low.
A recommendation on an appropriate minimum sample size might be given considering the presented analysis in different modelling domains and the tests on reducing the sample size.According to the results, the larger the sample size the better is the transferability and thematic consistency.Therefore, also the quality of the statistical model form increases with sample size.However, not all modelling domains follow this trend.This might be related to a combined influence of the heterogeneity or homogeneity of the local terrain conditions, the size of the domain and its landslide density.A minimum sample size of around 400 slide and non-slide cells (200 each) might be recommended for the methods applied in this study.
Only rather static data on local terrain conditions could reasonably be included in the analysis given the landslide inventory, with no known landslide age.The stepwise variable selection resulted in a satisfactory thematic consistency.Among the geomorphologically relevant variables, topographic variables were selected with a higher frequency than soil variables.This might be related to the spatial resolution of the respective data.However, this result goes along with comparable studies having topographic variables as the most frequently selected variables.The final landslide susceptibility map gives a good representation of the landslide susceptibility based on topography although the map does not include possible landslide triggers.
Regarding the susceptibility class uncertainties we conclude that the majority of the study area is not affected by class uncertainties.Special attention has to be drawn to possible overlaps of the low and medium susceptibility class in the predicted probability map and the map of the upper confidence limit.A misclassification in the low class might result in an underestimation of the susceptibility.This might have adverse effects on the municipality or landowner if the recommendations for the assignment of building areas might not be restrictive enough.
We discussed that there is a need to assess, minimize and communicate uncertainties involved in susceptibility modelling.The analysis results need to be communicated in an understandable manner to the stakeholder to allow for informed decisions instead of giving an impression of certainty.

56Figure 3 .
Figure 3. Flow chart presenting the methods and measures applied within this study assessing the quality of input data, quality of statistical model and quality of final susceptibility map.

Fig. 3 .
Fig. 3. Flow chart presenting the methods and measures applied within this study assessing the quality of input data, quality of statistical model and quality of final susceptibility map.

Figure 5 .Fig. 5 .
Figure 5. Boxplots showing the range of AUROC values resulting from repeated k-fold cross-2 validation with spatial subsampling (spCV) and non-spatial subsampling (nspCV) for each 3 modelling domain.*The y-axis limits for this domain range from 0 to 1 and a grey line marks 4 the 0.5 value.5 6

Figure 6
Figure 6.a) Median AUROC value and b) interquartile range of AUROC values (IQR) at reduced sample sizes (numbers given refer to the total number of equally distributed landslide and non-landslide points) in the Flysch Zone (Domain 126) resulting from spatial crossvalidation (spCV, grey line) and non-spatial cross-validation (nspCV, black dashed line).

Fig. 6 .
Fig. 6.(a) Median AUROC value and (b) interquartile range of AU-ROC values (IQR) at reduced sample sizes (numbers given refer to the total number of equally distributed landslide and non-landslide points) in the Flysch Zone (Domain 126) resulting from spatial cross-validation (spCV, grey line) and non-spatial cross-validation (nspCV, black dashed line).

Figure 8 .
Figure 8. Landslide susceptibility map (target scale 1:25,000) and map uncertainty in an 3 example area.a), d) susceptibility map; b), e) relative uncertainty index; c), f) susceptibility 4 class uncertainty.Panels a)-c) correspond to an area in the Flysch Zone (very high landslide 5 density), d)-f) are in the Bohemian Massif (very low landslide density).The susceptibility 6

Fig. 8 .
Fig. 8. Landslide susceptibility map (target scale 1 : 25 000) and map uncertainty in an example area.(a), (d) Susceptibility map; (b), (e) relative uncertainty index; (c), (f) susceptibility class uncertainty.Panels (a-c) correspond to an area in the Flysch Zone (very high landslide density),(d-f) are in the Bohemian Massif (very low landslide density).The susceptibility class uncertainty refers to differences between susceptibility maps at the lower confidence limit (LLCI) or the upper confidence limit (ULCI) at the 95 % confidence level relative to the class in the predicted probability map (PP) in panels (a) and (d).Data sources: LiDAR-DTM hillshade, river, major road and settlement -Provincial Government of Lower Austria.
A possible example was shown by the visualization of the H. Petschko et al.: Assessing the quality of landslide susceptibility maps prediction uncertainty and its effects on the classified landslide susceptibility map.

Table 1 . Lithological units of Lower Austria: landslides and topography.
+ sorted according to frequency of occurrence, after geological map

Table 2 .
Variable frequency for the model using all landslide cells (GAM 1 -GAM 16 ) and variable-selection frequency of variables used linearly (N) or with a smoothing function (S) in spatial (spCV) and non-spatial (nspCV) cross-validation.All values are summarized over all modelling domains.

Table 3 .
Number of selected variables for the models GAM 1 -GAM 16 .Median AUROC values, transferability and thematic consistency of the modelling domains for spatial (spCV) and non-spatial (nspCV) cross-validation.Lower index values T sp , T nsp and C sp , C nsp indicate strong spatial and non-spatial transferability and thematic consistency, respectively.Minimum (Min.), maximum (Max.) and range of standard error (SE) of the prediction on logit level.R 2 obtained by comparing standard errors estimated by the lookup table to the standard errors calculated by the GAM implementation.