Improving the Accuracy of Vegetation Classifications in Mountainous Areas

Abstract In recent decades, mountainous areas that contain some of the best-preserved habitats worldwide are experiencing significant, rapid changes. Efficient monitoring of these areas is crucial for impact assessments, understanding the key processes underlying the changes, and development of measures that mitigate degradation. Remote sensing is an efficient, cost-effective means of monitoring landscapes. One of the main challenges in the development of remote sensing techniques is improving classification accuracy, which is complicated in mountainous areas because of the rugged topography. This study evaluated the 3 main steps in the supervised vegetation classification of a mountainous area in the Spanish Pyrenees using Landsat-5 Thematic Mapper imagery. The steps were (1) choosing the training data sampling type (expert supervised or random selection), (2) deciding whether to include ancillary data, and (3) selecting a classification algorithm. The combination (in order of importance) of randomly selected training data, ancillary data (topographic and vegetation index), and a random forest classifier improved classification accuracy significantly (4–11%) in the study area in the Spanish Pyrenees. The classification procedure includes important steps that improve classification accuracies; these are often ignored in standard vegetation classification protocols. Improved accuracy is vital to the study of landscape changes in highly sensitive mountain ecosystems.


Introduction
Mountainous areas contain some of the best-preserved natural habitats, but they are threatened by changes in land use and climate (Thuiller et al 2005). The trend in European mountains has been a reduction in grazing by extensive livestock (Luick 1998;Kozak 2003;Torta 2004;Lasanta-Martínez et al 2005), and its consequences in terms of the structural and compositional changes in ecosystems, eg woody encroachment and reduced biodiversity, are worrisome (Dirnbö ck et al 2003;Komac et al 2011a). Efficient monitoring of these ecosystems is crucial for assessing the impact of changes, detecting thresholds in ecosystem responses and degradation, and ultimately, for developing early measures to mitigate these changes. However, monitoring mountain areas is expensive and time consuming because of the inaccessibility and complexity of the terrain (highly variable aspect, slope, and elevation), which at the same time lead to different plant communities. In this respect, remote sensing and geographic information systems play an important and cost-effective role in monitoring landscape changes (Poyatos et al 2003;Pô ç as et al 2011b). Although several types of satellite imagery are used for vegetation mapping at the local or regional level, such as SPOT, IKONOS, ASTER, and Landsat (Lu and Weng 2007), we chose Landsat-5 Thematic Mapper (TM) because the platform has been providing images since 1984, which makes it the longest series available. The long-term data allow comparisons between periods, which increases the value of these comparisons (Pô ç as et al 2011a; . Since the 1970s, remote sensing data have been used for supervised vegetation classifications (Shalaby and Tateishi 2007;Aragó n and Oesterheld 2008), and there have been efforts to improve classification accuracy (Chen and Stow 2002;Pal and Mather 2006;Tso and  Mather 2009). Although new classification methods have been developed, no single procedure has universal applicability (Lu and Weng 2007). The type of environment, scale, type and amount of training data, and availability of ancillary data influence the accuracy of classifications. In addition, classifications are more complicated and challenging in mountains because of the variable topography and potential for shadows (Shi et al 2009).
In mountainous areas, even small improvements in classification accuracy can be vital in detecting changes in vegetation, which tend to be prolonged and can thus go undetected in the short term; eg shrub encroachment on subalpine grassland can advance 2 m/y (Komac et al 2011b). At each step of the classification process, changes can be made to improvement classification accuracy. The type of training data used is important in the classification process, because it dictates the information that the algorithm will use to classify the vegetation. The training data must be mutually exclusive and exhaustive (Chuvieco and Huete 2009); therefore, selection should maximize the separability among categories and ensure that enough variability is retained so that the data are representative of each category. There is a tradeoff, because variability and separability are negatively correlated. Furthermore, the selection and addition of ancillary data can be critical steps for improving classification accuracy, especially in complex environments such as mountainous areas. In these areas, topography can influence the reflectance of the vegetation because of surface roughness, shadows, presence of rocks or bare soil, and steep slopes, which add inaccuracy to the values measured passively by satellite (Chuvieco and Huete 2009;Tso and Mather 2009;Balthazar et al 2012).
Methods of classifying vegetation have to be chosen, and numerous algorithms are available (Lu and Weng 2007;Tso and Mather 2009). In this study, we compared 2 classifiers, maximum likelihood (ML) and random forest (RF). The ML algorithm takes into account the variance and covariance of each category to be classified. When assigning each pixel to one of the categories, the algorithm assumes normality in the distribution of the values in each category. Therefore, given the mean and the covariance matrix of each category, the statistical probability is computed, and its ultimate category is the one that has the highest probability (ML) (Lillesand et al 2008). The ML classifier is commonly used in remote sensing (Lu and Weng 2007;Liu et al 2011), but it is a parametric classifier that is noise sensitive (Adams and Gillespie 2006).
The RF classifier is based on a combination of tree decision classifiers, in which each tree is generated using a random vector that is sampled independently of the training set of input variables (Breiman 2001). Although tree decision classifiers are independent of the distribution of the training data, the combination of the tree decision classifiers-which uses a majority voting scheme for each variable-from all trees provides a stable and stronger result (Tso and Mather 2009). The RF algorithm is used in ecology (Prasad et al 2006;Cutler et al 2007), remote sensing (Pal 2005;, and vegetation classification (Ham et al 2005;Gislason et al 2006;Chan and Paelinckx 2008;Wessels et al 2011;. Because it is not dependent on parametric restrictions, it is easy to fine tune and executes quickly (Breiman 2001;Pal 2005;Prasad et al 2006;Cutler et al 2007). In addition, RF is robust to collinearity and overfitting of the variables (Breiman 2001), which can be particularly useful in mountain areas, where topographic variables tend to be highly collinear. RF can thus improve vegetation classifications, because topography and vegetation are highly correlated (Franklin et al 1994;De la Riva 1997;Shi et al 2009).
The objective of this study was to identify the best combination of classification steps to improve the accuracy of land-cover classifications, using Landsat-5 TM images, from among the selection of training data, ancillary data, and classifier. The choice of a classifier (Rodriguez-Galiano et al 2012) and the selection of ancillary data (Rodriguez-Galiano and Chica-Olmo 2012) can significantly improve the accuracy of the classification, but the effects of the selection of training data and the combinations of the options selected in each step need to be improved.

Study area
The 25,970-ha study area included Ordesa and Monte Perdido National Park and the surrounding protected area but excluded the northwest portion of the Ara River and Lapazosa Cliff in the Central Pyrenees, Aragó n, Spain (42u369N, 0u009E) ( Figure 1). Elevation was between 650 and 3355 masl. The area has a mountain climate characterized by an equinoctial precipitation regime and a continental influence. Average annual precipitation is 1657 mm, and the average daily maximum and minimum temperatures (at 2200 m) are 8.6 and 1.4uC, respectively.

Vegetation categories
Vegetation in the study area is of 4 main physiognomic types ( Figure 2). These are strongly influenced by climate conditions and follow an elevation gradient: (1) high elevation (.2200 masl), sparse grassland (.60% rock cover); (2) dense grassland, which forms a continuum of vegetation cover that is interspersed with (3) shrubland at elevations below 2200 m; and (4) dense forests, which are below 1700 masl. Multiple plant communities are represented in each of the 4 categories. To identify the strata in the stratified sampling procedure, we used

Landsat-5 TM and multilayer composition
Since 1984, Landsat-5 has operated using the TM sensor, which was designed for thematic cartography and is useful for environmental monitoring (image spatial resolution 5 30 m; 120 m in the thermal infrared). The analysis used cloud-free images taken on 4 August 2007 by the European Space Agency, which were orthorectified in Erdas 9.2 using a digital elevation model (DEM) with a 20-m resolution and orthophotographs from 2006. In the 2006 photographs, much of the study area up to 2000 m was covered with snow; therefore, the analysis included orthophotographs taken in 1997 and both had 0.5-m spatial resolution. The root mean square error of the orthorectified image was ,0.5 pixels. An ATCOR 3 module was used to convert digital values to reflectance and remove atmospheric and topographic effects (based on the Lambertian assumption), and MODTRAN 4 was used to generate the lookup table. The correction was performed following the calibration parameters described by Chander et al 2009 for images taken after 31 December 1991, using 60-km scene visibility, a midland summer rural model for the solar region, a solar elevation angle of 57u, and a solar azimuth angle of 136u, as described in the image metadata. In ATCOR 3, a 10-m resolution DEM was used to correct errors caused by variations in elevation, slope, and aspect, because it was desirable to have a spatial resolution close to 0.25 times the pixel size (Geosystems 2009).
For the classification, we used 2 multilayers (Figure 3), which form a composite image that contains the bands that represent each of the variables included in the analysis. The layers differed in their spectral composition and use of ancillary data. Multilayer A included 6 of the 7 bands in the Landsat-5 TM images (all except the thermal band), and multilayer B included 14 layers: 7 bands from the Landsat-5 TM image, 2 vegetation indices (normalized difference vegetation index [NDVI] and soil adjust vegetation index [SAVI]), and 5 topographic features (elevation, slope, north-south aspect, east-west aspect, and insolation). NDVI 5 (r nir 2 r red)/(r nir + r red), where (r red) and (r nir) are the ground reflectance values of the red and near-infrared bands, respectively. SAVI 5 ([r nir 2 r red] * 1.5)/([r nir + r red] + 0.5) (Baret and Guyot 1991), which was derived from the ATCOR module. Topographic details were obtained from the 20-m resolution DEM; to obtain north-south and east-west orientations, aspects were transformed to cosine and sine, respectively (Felicísimo 1994). Potential insolation was calculated using ArcGis 9.3 software. Topographic information can complement remote sensing data because the rugged topography in mountainous areas has a strong influence on vegetation composition (Franklin et al 1994;Shi et al 2009).

Training data
The training data must be mutually exclusive and exhaustive (Chuvieco and Huete 2009). The 2 most common sampling strategies were used: expert supervised (see the section on selected training) and random (see the section on random training) (Figures 3, 4). Random sampling is the most exhaustive sampling procedure; it exhibits higher variability within vegetation categories and lower separability. Expert-supervised block selection is more exclusive and therefore exhibits lower variability and higher separability among vegetation categories. Transformed divergence values (TrDi) were used to quantify the separability between categories in the training data (Swain 1972). TrDi $ 1500 indicates acceptable separability (Haack et al 1987) and TrDi $ 1600 indicates good separability (Metternicht and Zinck 1998) between classes, with the maximum separability TrDi 5 2000. Distributions of the sources of the ST data (left) and the RT data (right) used for training the algorithms in supervised classifications. ST accounts for exclusivity, with lower data variance, and RT accounts for exhaustibility, with higher variance in the training data. from zones that were representative of each vegetation category in the study area and was influenced by the experience and knowledge of the observer (Figure 4). We used the following 2 versions of ST: N Total-block training (TBT): For each vegetation category, different numbers of training data were selected using a nonstratified design that included 4000 training data. This is the most common way of selecting training data in remote sensing research.
N Reduced-block training (RBT): RBT data were obtained using a random selection of 800 training data from the TBT and based on a stratified design that included 200 training data for each vegetation category.
Random training (RT): RT was based on a random selection of a single pixel. The vegetation category to which each selected pixel corresponded was identified using a vegetation map of the area (given earlier), inspection in the field, and orthophotographs ( Figure 4). The following 2 versions of RT were evaluated: N Random single-pixel training (RST): In RST, 800 pixels within the study area were selected using a nonstratified design.
N Randomly stratified single-pixel training (RSST): In RSST, 200 pixels per vegetation category were selected in a stratified design. For the stratification, we used the vegetation map described in the section on vegetation categories.

Selection of classifiers
A supervised classification includes a multilayer image, training data, and a classifier (Figure 3). For each vegetation category, classifiers identify a known amount of training data and assign the remainder of the pixels in the multilayer to a vegetation category based on their similarity in the reflectance (Chuvieco and Huete 2009).
In this study, we compared the ML method using Erdas 9.2 and the RF method using R from the randomForest package. ML is the most commonly used (standard) method (Liu et al 2011), and RF is one of the most promising methods for the classification of vegetation using remote sensing data (Gislason et al 2006). ML is a parametric method that is based on the probability that a pixel belongs to a given class, and the mean vector and the variance-covariance matrix are the 2 parameters that characterize each class (Adams and Gillespie 2006). The nonparametric RF classifier is among the treebased classifier methods that interactively use a hierarchical splitting mechanism to categorize the data. The RF can produce a large number of trees (ntree), each generated by selecting a subset of the sample (randomly drawn with replacement from the original dataset, bootstrapping), and uses a random set of predictors (mtry, the number of predictors used in each split). The trees that result are aggregated by averaging (Breiman 2001;Pal 2005). For the RF classifier, ntree and mtry are the tune parameters. The tuning parameters were adjusted by selecting an ntree in which the overall misclassification error stabilizes (ntree 5 500), testing different numbers of predictors, and selecting the one that reduces significantly the misclassification error using the tuneRF function from the randomForest R package (Liaw and Wiener 2002).

Accuracy assessment
The accuracy of the classifications was tested using samples taken following the RST procedure (see the earlier section) (Congalton 1988(Congalton , 1991. We selected 800 pixels, which provided more than the minimum number of samples per category recommended for accuracy assessment (50 samples per category) (Congalton 1991), specifically, 213 pixels for forest, 120 for shrub, 140 for dense grassland, and 327 for sparse grassland. To assess the accuracy of each vegetation classification, we used Cohen's kappa index (k) of agreement (Cohen 1960), which is the standard method in remote sensing research (Congalton et al 1983;Rosenfield and Fitzpatricklins 1986;Chen and Stow 2002;Congalton and Green 2009).

Statistical analysis
To assess the effects of the different procedures-ie type of multilayer, method of obtaining training data, and classification methods-on classification accuracy, 2 complementary analyses were performed: a pairwise comparison of k values and a multivariate comparison using generalized least squares (GLS) models (Zuur et al 2009). In the pairwise comparison, differences between each pair of independent kappa indices were tested statistically based on the standardized and normally distributed Z value (significant at p 5 0.05 when Z . 1.96) (Congalton and Green 2009), which identified the classification pairs that differed significantly.
A multivariate analysis compared the following procedures: multilayer composition type, sampling procedures for selecting training data, classification methods, and the interactions among them. GLS models were used because variance heterogeneity was detected in model residuals and differences identified by a multivariate analysis of variance model were statistically significant (Zuur et al 2009). The 3 procedures (multilayer and sampling types and classification methods) and all 2way interactions were included in the GLS. Sampling procedures were grouped into 2 categories (ST and RT), because optimal comparisons were obtained using categorical variables that had the same number of categories. The best model was identified based on forward and backward stepwise procedures using Akaike's information criterion (Zuur et al 2009). If a procedure or interaction term was significant, the factor was evaluated using analysis of variance (ANOVA) and a Tukey test. To identify the differences among the training data types in each layer used in the classification, spectral curves were created for each vegetation category using the mean percentage of reflectance and standard deviation of the different types of training data for 6 Landsat bands and the mean and standard deviation of the values of the ancillary data (NDVI and elevation).

Results
In Ordesa and Monte Perdido National Park and the surrounding protected area in the Spanish Central Pyrenees, the accuracy (k 5 0.83-0.94) of a supervised classification of the vegetation improved by 4-11%, depending on the choices made at each step ( Figure 5). Sampling type (ST or RT) had the greatest effect on accuracy, and classification method (ML or RF) had the least effect (Table 1). The type of multilayer had an intermediate effect, and the interaction between multilayer type and classification method was marginally significant (Table 1).
RT produced significantly higher k values than did ST (ANOVA F 5 2.6, P , 0.05), even though the separability of RT (as reflected by the TrDi value) was lower than that of ST. The accuracy of the classification that used multilayer A (which included the Landsat-5 TM spectral bands only) was independent of the classification method (ML or RF); however, when topographic information and vegetation indices were included in the multilayer (multilayer B), the RF classifier was significantly better than was the ML classifier (Tukey post-hoc test T 5 2.9, P , 0.05). The RF classification method produced significantly higher k values than did ML (ANOVA F 5 5.09, P , 0.05). The combination of RT data, multilayer B, and the RF method had the highest accuracy ( Figure 5). In this case, the results based on stratified (RSST) and nonstratified (RST) RT data did not differ significantly ( Figure 5).
Based on the k values of each vegetation category, the RF algorithm produced stable, robust results (Table 2). Differences among sampling types (ST and RT) were never .10% (columns TBT-RST and RBT-RSST, Table 2). The use of RT sampling data with the 2 multilayers provided a small but consistent improvement ( Table 2). The ML, however, produced results that were sensitive to changes in the type of sampling for either multilayer A or multilayer B (differences . 10%; columns TBT-RST and RBT-RSST, Table 2). Multilayer A produced the highest accuracies using ST data (TBT or RBT) (except for sparse grassland), while with multilayer B, the highest accuracies were obtained using RT data (RST or RSST) ( Table 2).
The 2 types of ST (TBT, RBT) and the 2 types of RT (RST, RSST) had similar spectral curves ( Figure 6); therefore, ST and RT were compared. In most cases, and especially in forest and sparse grassland, the RT data had higher variability or standard deviations than did the ST data ( Figure 6). The spectral signatures of the forest, shrubland, and grassland were typical of these vegetation types ( Figure 6A-D), and they all showed limited (,12%) reflectance in the visible bands (highest in sparse grassland). Increases in the reflectance of all vegetation types were detected in the near-infrared (NIR) region (band 4), and the increase was highest in dense grassland (up to 35%). Reflectance in the shortwave infrared (SWIR) region (bands 5 and 7) tended to decline because of an increase in radiance absorption by the water contained in plants. In sparse grassland, however, reflectance values in NIR and SWIR were 25-30% because of the high proportion of bare soil. In SWIR bands, the reflectance values from the RT data were higher than those from the ST data, and in NIR, except forests, the values from the RT data were lower than those from the ST data. In the visible region (bands 1-3), the performances of the 2 sampling types were similar.
An analysis of the training values of the ancillary data ( Figure 6E, F) showed that the RT data had more contrasted values than did the ST data, eg for elevation, which increased the separability among categories, even though the RT data had higher variability than did the ST data. Except for shrubland, the RT data had the lowest NDVI. Like Landsat bands 2, 3, 5, and 7, elevation and NDVI were the most important layers for the RF algorithm in the supervised classification when multilayer B was used. The combination of multilayer type and classification method, which was marginally significant (Table 1), was examined further because of the differences between the final maps obtained using multilayer A or B with the RF algorithm and RT data (RST) (Figure 7). In most cases, the pure pixels (which represented only 1 of the 4 vegetation categories) were classified correctly, and were similar with the 2 strategies (Figure 7). When the pixels included a combination of vegetation categories, the 2 classification procedures often produced different results. When multilayer A was used, at high elevations-where there are little more than sparse grassland-some shadows and steep slopes were classified as shrub (Figure 7). These errors were corrected by including topographic variables such as ancillary data in the multilayer (multilayer B).

Discussion
In Ordesa and Monte Perdido National Park and the surrounding protected area in the Spanish Central Pyrenees, the modest improvements (k 5 0.83-0.94) in the accuracies of the classification strategies might be essential for detecting changes in the vegetation in mountainous areas, particularly woody encroachment and loss of biodiversity (Dirnbö ck et al 2003;Lasanta-Martínez et al 2005;Komac 2010). The improvements came from considering options at all crucial steps in the classification procedure (Lu and Weng 2007), rather than the classification method only (Landgrebe 2003;Cutler et al 2007;Tso and Mather 2009). Conditional k values for each vegetation type and the overall k (in bold), for the supervised classifications, using combinations of procedures that included 2 multilayer types (A and B), 2 methods of classification (ML and RF), and 4 sampling types, 2 of which not stratified, TBT (ST) and RST (RT), whereas the other 2 were stratified, RBT (ST) and RSST (RT). The third and sixth columns show a subtraction between the k of the sampling types (columns 1, 2, 4, and 5).  In our study, the selection of training data was the most important factor in increasing the accuracy of a supervised classification (Hixson et al 1980;Gog and Howarth 1990), and RT was the best sampling type (Gog and Howarth 1990;Gong et al 1996). The nonstratified RST provided slightly, but not significantly, better results than did the stratified RSST. Thus, stratification might not be recommendable except when a vegetation map exists and the vegetation categories are automatically identified, which is less time consuming.

TBT-RST
The RT data provided high variation (as reflected in the standard deviation of the data), which can help to better define all vegetation communities, but it created overlap in the training data values of the 4 vegetation types, thus reducing the separability of those categories. In addition, training data that have high variability, such as the RT data, are better classified using nonparametric classifiers such as the RF classifier (Cortijo and De La Blanca 1997).
In our study, the second most important factor influencing the accuracy of the classifications was the multilayer compositions used. In mountainous areas, the inclusion of topographic information can improve classification accuracy (Franklin et al 1994;Hansen et al 1996;Gislason et al 2006;Shi et al 2009), because topography strongly influences vegetation composition. The benefits of the topographic layers were greatest in areas of mixed vegetation, in which case, with multilayer A (Landsat information only), information was insufficient to distinguish among the vegetation categories. In this case, the use of the topographic information (multilayer B) might improve the classification by adding more information that better defines the classification of each pixel. But too many layers can reduce classification accuracy (Price et al 2002) because not all classifiers work well with a large number of layers or correlated layers; in that case, nonparametric classifiers might be more appropriate (Lu and Weng 2007). That might be why, when we used the ancillary information in the multilayer, the RF classifier improved accuracy significantly; however, when the ML method was used, the improvement was not as pronounced as before.
The ML and other single-stage classifier methods are used extensively in remote sensing studies, but they can produce results that are less accurate than are those derived by multistage iterative classifiers FIGURE 7 An example from the study area showing the differences between 2 of the methodologies used, differing in the multilayer used (A or B). In higher altitudes where there are no shrub communities, multilayer B lead to better classification of the mixed zones, taking into account the topographic data.
(San Miguel-Ayanz and Biging 1996; Friedl and Brodley 1997;Muchoney et al 2000;Rogan et al 2002;Shi et al 2009). In some cases, RF multistage classifiers had advantages and were more accurate (Na et al 2009;Waske and Braun 2009; than was the ML classifier. The RF classifier is a nonparametric classifier, robust to outliers and noise, in which the most effective feature is selected sequentially in each node-splitting stage, and it can be used with layers that have high collinearity (Breiman 2001); thus, it is suited to the classification of complex areas. The ML classifier is appropriate when there is enough information for the classification, few and noncorrelated layers, and the variability of the training data is low (eg ST data). In mountain areas, however, such conditions are infrequent and RF or other multistage classifiers are highly recommended.

Conclusions
This study aimed to optimize the outcome of the vegetation classifications derived from remote sensing images in mountain areas. We assessed the 3 main steps required for this procedure (selection of the training data, the ancillary data, and the classifier), evaluating the most widely used and promising options for each step. All analyses were applied to a case study area in a national park within the Spanish Pyrenees, representative of a mountain scenario. Each of the 3 steps played a significant role in improving the accuracy of the classification, and their relative importance was not influenced by the selection of the classifier. The results highlight the need to select more than 1 step to improve the accuracy of vegetation classifications in mountainous areas because of the complexity of the topography and its consequences, such as undesired shadows.
More research on vegetation classifications in other mountainous areas is required to identify the role of each classification step in improving accuracy. Ultimately, improvements in accuracy would improve the ability to detect change, which is directly related to our ability to manage and preserved these important and complex ecosystems.
The specific combinations of parameters and procedures used in this research improved the accuracies of vegetation classifications by 4-11%. RT data coupled with multilayers derived from Landsat-5 TM images and ancillary data, and a RF classifier method produced the highest accuracy for the supervised classification of vegetation. Some standard protocols for the classification of Landsat images include important steps that dictate the way in which the data are analyzed. The choice of sampling design, the use of ancillary data, and to a lesser extent, the choice of classifiers influenced the accuracies of vegetation classifications, especially in mountainous areas, where errors caused by a heterogeneous topography are intrinsically high.

ACKNOWLEDGMENTS
The Spanish government (Configuración Espacial de la Biodiversidad y Conservación del Ecosistemas project, I+D+I. CGL2008-00655/BOS, Ministry of Science and Innovation) and the European Community (Land and Ecosystem Degradation and Desertification: Assessing the Fit of Responses project, FW7 ENV.2009.2.1.3.2) funded this study. In addition, we thank the managers of Ordesa and Monte Perdido National Park for providing information and assistance. For assistance with the bibliographic search, we thank Cristina Pérez de Larraya. The suggestions of IC Barrio considerably improved earlier versions of the paper. We thank Bruce MacWhirter for his critical reading of the manuscript and for providing helpful suggestions. Finally, we thank the anonymous reviewers for their helpful comments.