Simulating highly disturbed vegetation distribution: the case of China’s Jing-Jin-Ji region

Background Simulating vegetation distribution is an effective method for identifying vegetation distribution patterns and trends. The primary goal of this study was to determine the best simulation method for a vegetation in an area that is heavily affected by human disturbance. Methods We used climate, topographic, and spectral data as the input variables for four machine learning models (random forest (RF), decision tree (DT), support vector machine (SVM), and maximum likelihood classification (MLC)) on three vegetation classification units (vegetation group (I), vegetation type (II), and formation and subformation (III)) in Jing-Jin-Ji, one of China’s most developed regions. We used a total of 2,789 vegetation points for model training and 974 vegetation points for model assessment. Results Our results showed that the RF method was the best of the four models, as it could effectively simulate vegetation distribution in all three classification units. The DT method could only simulate vegetation distribution in units I and II, while the other two models could not simulate vegetation distribution in any of the units. Kappa coefficients indicated that the DT and RF methods had more accurate predictions for units I and II than for unit III. The three vegetation classification units were most affected by six variables: three climate variables (annual mean temperature, mean diurnal range, and annual precipitation), one geospatial variable (slope), and two spectral variables (Mid-infrared ratio of winter vegetation index and brightness index of summer vegetation index). Variables Combination 7, including annual mean temperature, annual precipitation, mean diurnal range and precipitation of driest month, produced the highest simulation accuracy. Conclusions We determined that the RF model was the most effective for simulating vegetation distribution in all classification units present in the Jing-Jin-Ji region. The RF model produced high accuracy vegetation distributions in classification units I and II, but relatively low accuracy in classification unit III. Four climate variables were sufficient for vegetation distribution simulation in such region.


INTRODUCTION
Vegetation is an essential component of terrestrial ecosystems and landscapes (Editorial Committee of Vegetation Map of China, 2007). Environmental research, resource management, and conservation planning require vegetation distribution maps (Franklin, 2010) to better understand, use, and monitor vegetation. Vegetation patterns and distributions are affected by the climate (Chen et al., 2015;Zhang et al., 2018) and other disturbances, particularly those caused by changes in land use (Hansen et al., 2013;Wehkamp et al., 2018). Human disturbances, such as industrialization, urbanization, population growth, land use change for agricultural use, etc., strongly influence the environment by greatly altering vegetation patterns, making exact mapping a significant challenge (Xie, Sha & Yu, 2008;Zhou et al., 2016).
Field surveys, the traditional method used to map vegetation, are costly and laborintensive (Newell & Leathwick, 2005;Zhou et al., 2016). Mapping using remote sensing data is also a popular method that has been used over the last 30 years (Xie, Sha & Yu, 2008). This method makes it possible to obtain a wide range of reliable data from remote sensing images, and it updates vegetation boundaries by visually interpreting images and field surveys (Zhang et al., 2008). However, determining vegetation units and their boundaries by visual interpretation can produce inaccurate results. Researchers may get different results from the same images for the same study areas (Bie & Beckett, 1973;Pfeffer, Pebesma & Burrough, 2003). Furthermore, field survey and remote sensing methods manually draw vegetation unit boundaries based on climate, elevation, and soil type information, which can be inaccurate in transition areas (Zhang et al., 2008). Using simulation models in combination with field and remote sensing data may be an effective alternative for mapping vegetation.
Changes in the environment can affect vegetation composition, structure, function, and spatial distribution. Environmental variables have been used to simulate the global distribution of vegetation (Dilts et al., 2015;Mod et al., 2016). Simulation models are usually developed to test how environmental variables control vegetation distribution (Guisan & Zimmermann, 2000). Modern remote sensing data and software make it more convenient than ever before to produce predictive vegetation maps (Franklin, 1995).
Predictive vegetation mapping uses environmental variables and various models based on niche theory and gradient analysis to visualize communities in geographic space (Dilts et al., 2015;Lany et al., 2019). Other methods based on statistics and machine learning have also been used to simulate vegetation distribution. Predictive vegetation mapping includes various statistical methods such as the generalized linear model, the generalized additive model, and multivariate statistical approaches (Lany et al., 2019;Prasad, Iverson & Liaw, 2006). Recently, machine learning modeling methods have been used to map the distribution of both vegetation communities and individual species. These methods include the support vector machine (SVM), decision tree (DT), and artificial neural network (Guisan & Zimmermann, 2000;Hastie, Tibshirani & Friedman, 2009;Zhou et al., 2016). These machine learning models have fewer limitations and can produce more reliable results than traditional vegetation modeling methods (Hastie, Tibshirani & Friedman, 2009). Advanced machine learning techniques can integrate spectral and spatial predictors and improve classification accuracy by retaining important information about vegetation composition and structural differences (Sesnie et al., 2010). Machine learning models efficiently and cost-effectively produce vegetation maps without the general inaccuracies caused by visual interpretation (Franklin, 2010).
The Jing-Jin-Ji region, also known as the Beijing-Tianjin-Hebei urban agglomeration, is the center of northern Chinese politics, culture, and economy. Because of its extension, it faces significant problems such as unbalanced regional development and the struggle between economic growth and limited resources. The region's larger cities, including Beijing and Tianjin, have large populations, developed economies, and abundant educational resources. However, these big cities face issues of limited natural resources and serious ecological and environmental pollution. In particular, Beijing's large population requires limited resources such as water, land, and vegetation (Wang & Gong, 2018). Breaking up administrative divisions may be the best method to coordinate regional development (Wang et al., 2019). The new Xiong'an area located in Hebei province is being constructed to relocate some of Beijing's population. The development of areas like Xiong'an is affected by the surrounding natural environment. To better integrate the environmental carrying capacity and socioeconomic development of the Jing-Jin-Ji region, including the new Xiong'an area, accurate vegetation maps with temporal resolution are necessary. The most updated vegetation map of the Jing-Jin-Ji region is the Vegetation Map of the People's Republic of China (VMC), with a scale of 1:1,000,000 (Editorial Committee of Vegetation Map of China, 2007). Most of its data come from a field survey conducted between 1980 and 1990, meaning its temporal and spatial scales are both outdated.
In this study, we integrated geospatial, climate, and spectral data to simulate vegetation distribution through four different models across three vegetation classification units. This research was different from the research of Zhou et al. (2016). Firstly, the research area of this research was the Jing-Jin-Ji region located in the North China Plain and affected by high social-economic disturbance, while the Qilian Mountain in the research of Zhou et al. is characterized by complex terrain, but without high social-economic disturbance. Secondly, the predictive variables as well as the combinations of these variables were different from the research of Zhou et al. (2016). Thirdly, we compared four model methods for simulating distribution of vegetation in three vegetation classification levels, while only three models were used for simulation in two vegetation classification levels in the research of Zhou et al. (2016). Our primary objectives were to: (1) determine the best modeling method for vegetation affected by high socioeconomic disturbance, (2) create an improved vegetation map of the Jing-Jin-Ji region, (3) determine the predictive abilities of different models across different vegetation classification units, and (4) determine which variables enhanced the classification accuracy for vegetation mapping.

Study area
The Jing-Jin-Ji region is located in the northern part of the North China Plain. Its location ranges from 113 • 04 to 119 • 53 E and 36 • 01 to 42 • 37 N and is bordered by Taihang Mountain in the west, Yanshan Mountain in the north, and the Bohai Sea in the east. The region includes the Beijing, Tianjin, and Hebei provinces (Fig. 1). Jing-Jin-Ji has a population of approximately 110 million people and covers an area of approximately 216,000 km 2 (Wang et al., 2019). The region is a temperate monsoon climate zone with an elevation range of −14 to 2,837 m (Fig. 1). The annual precipitation ranges from 305 to 711 mm, with increased precipitation at lower altitudes. The annual mean temperature ranges from −3 to 14 • C, with colder averages at higher elevations. The amount of precipitation in the region gradually decreases moving from the southeast to the northwest, while temperature changes show the reverse pattern.

Vegetation and training data
The VMC, completed in 2007 based on field survey data, included eight vegetation groups (I), 15 vegetation types (II), and 75 formations and subformations (III) from the Jing-Jin-Ji region. However, some of the map's vegetation unit areas are very small and difficult to distinguish. To ensure that enough training and assessment point data can be randomly selected in units II and III, we selected eight units I, 12 units II, and 39 units III from the study area (Table 1). Cultivated vegetation are mainly distributed in low areas with an altitude range of −14 to 254 m and an annual mean temperature range of 7 to 14 • C. Major cultivated plants include winter wheat and coarse grains. Scrub and grass-forb communities are mainly distributed in the north, in elevations ranging from 254 to 1,440 m.
We obtained model training and assessment data on vegetation composition from field surveys and other publications. We collected a total of 3,763 vegetation points, with 2,789 of those used for model training and 974 used for model assessment. Each unit III had at least 80 vegetation points, with at least 60 of those used for model training and 20 used for model assessment. The model training and assessment data were randomly selected for each unit III. Additionally, we increased the credibility of the model assessment by first rasterizing the vector VMC onto the same grid as the modeled data, and then assessing the data using the Kappa coefficient (Landis & Koch, 1977;Weng & Zhou, 2006;Zhou et al., 2016).   Table 2 The vegetation indices. 1995; Cohen & Goward, 2004;Zhou et al., 2016). We tested the vegetation discrimination of 14 vegetation indices (Table 2).

Indices Abbreviation Formula
To determine the distribution predictive ability of different variables, we grouped the variables into different combinations based on the results of the Pearson correlation. We only used less correlated variables (R < |0.7|, Pearson correlation) (Chala et al., 2017) in Combinations 1-9 (Table 3), then used variable combinations as input predictor variables to simulate vegetation distribution. Combination 1 included the less correlated variables of the summer land surface albedos from bands 1 to 7. Combination 2 included the less correlated variables of the winter land surface albedos from bands 1 to 7. Combination 3 included the less correlated variables in Combinations 1 and 2. Combination 4 included Table 3 Variable combinations. Note: DT10 and RF10 represent the top 10 important variables of decision tree (DT) and random forest (RF) methods with Combination 9 in the vegetation group level, respectively. The vegetation indices and their abbreviations were shown in Table 2.

Number
Variables combinations 1 Summer land surface albedos of band 1 and 5. 2 Winter land surface albedos of band 1 and 6.
3 Summer land surface albedos of band 1 and 5.
Winter land surface albedos of band 1 and 6.

Vegetation distribution models
We used DT, RF, MLC, and SVM vegetation distribution models in this study. The DT model is a divisive, monothetic, and supervised classifier often used for species distribution modeling and related applications (Franklin, 2010). It is computationally fast and easy to understand and implement. It uses classification or regression algorithms to generate classification rules, and then visualizes those rules into simple tree graphics (Hastie, Tibshirani & Friedman, 2009;Zhou et al., 2016). The DT model calculates the most significant variables contributing to the model (Deng, 2010). We used a DT with five layers, 40 samples in the smallest parent node, and 10 samples in the smallest child node. The RF model is an ensemble method that has been applied in risk assessment and species distribution modeling studies (Cutler et al., 2007;Zhang & Dong, 2017). The RF model creates and combines different DTs to produce considerably more accurate classifications that are unaffected by noise or overtraining (Burai et al., 2015;Cutler et al., 2007;Gislason, Benediktsson & Sveinsson, 2006). The RF model also calculates the most significant variables that contribute to the model (Cutler et al., 2007). Running an RF model requires defined parameters, including tree number, number of randomly selected features, and node impurity function. We generated the RF model in EnMAP-Box, a license-free and platform-independent software interface designed to process hyperspectral remote sensing data, which was developed by the Humboldt University of Berlin. There are in-built applications aimed at the processing of hyperspectral data, such as SVM and RF for classification of image data in the EnMAP-Box . We used the default settings in EnMAP-Box with 100 trees. The number of randomly selected features was equal to the square root of the number of all features, and we used a Gini coefficient for the node impurity function Ma, Gao & Gu, 2019;van der Linden et al., 2015;Zhou et al., 2016).
The MLC model is one of the most commonly used supervised image classification methods. MLC's classification rules use the statistics of the Gaussian probability density function to assign each pixel to the class with the highest probability. Although the MLC method usually generates similar or more accurate classifications than other methods, it is not applicable when there are fewer training samples than input predictors (Burai et al., 2015;Zhou et al., 2016).
The SVM model is a supervised machine learning model used for classification and regression. It is a complex and widely used method that can output more accurate predictions (Burai et al., 2015) than other methods. The SVM model searches for an optimal plane in a multidimensional space to divide all sample elements into two categories, making the distance between the closest points in the two classes as large as possible (Kabacoff, 2016). Running an SVM model requires a defined kernel parameter g and regularization parameter c. In this study, we generated the SVM model in the EnMAP-Box. The default settings in EnMAP-Box to the SVM model was applied, where the parameter g was 0.01 to 1,000, and the parameter c was 0.1 to 1,000. Parameters g and c were tested using a grid search with internal performance estimation, and we used those with the best performance for data training (Lin et al., 2014;van der Linden et al., 2015).
We generated the predicted vegetation maps of the three classification units using the DT, RF, MLC, and SVM methods with a resolution of 500 m. We selected all 11 variable combinations as the input variables for each method. The DT and RF method results indicated which variables were most important for vegetation discrimination.

Model assessment
We used the VMC and a total of 974 vegetation points to assess the overall accuracy and Kappa coefficient of every predicted vegetation map. Kappa coefficient values ranging from 0.4 to 0.55 indicated moderate agreement, from 0.56 to 0.8 indicated substantial agreement, and from 0.81 to 1 indicated almost perfect agreement (Landis & Koch, 1977;Weng & Zhou, 2006;Zhou et al., 2016). When the Kappa coefficient value was greater than 0.4, the assessed predicted map was considered acceptable.

Unit I modeling and assessment
The RF model's results were better than the results of the DT, MLC, and SVM models ( Table 4). The RF model had a Kappa coefficient larger than 0.4 when using variable Combinations 6 to 11 assessed by field point data, with an overall accuracy of 50% to 72%. The RF model had a Kappa coefficient larger than 0.56 when using variable Combinations 7 to 11 assessed by field data, with an overall accuracy of 68% to 72%. The RF model had the highest Kappa coefficient of 0.66 and the highest overall accuracy of 72% when using variable Combination 7. The DT model had a Kappa coefficient larger than 0.4 when using variable Combinations 7 to 11 assessed by field point data, with an overall accuracy of 54% to 56%. The DT model had no Kappa coefficient larger than 0.56 when using all variable combinations. After VMC assessment, we found the highest Kappa coefficient was 0.38 with an overall accuracy of 57% in the RF model using variable Combinations 9 to 11 (Table 4; Fig. 2).

Unit II modeling and assessment
The RF model results were better than the results of the other three models. The RF model using variable Combinations 7 to 11 had a Kappa coefficient larger than 0.4, with overall accuracies of 66%-70% and 54%-55% for field point data and VMC assessments, respectively. The RF model using Combinations 7 to 11 had a Kappa coefficient larger than 0.56 and an overall accuracy of 66%-70% when assessed by field point data. The RF model had the highest Kappa coefficient of 0.65 and the highest overall accuracy of 70% when using variable Combination 7. The DT model using variable Combinations 7 to 11 had a Kappa coefficient larger than 0.4, with overall accuracies of 53%-55% and 65%-72% for field point data and VMC assessments, respectively. The DT model had the highest Kappa coefficient of 0.54 and overall accuracy of 72% when using variable Combination 7. The DT model had a larger Kappa coefficient and greater overall accuracy when assessed by VMC rather than the RF model (Table 5; Fig. 3).

Unit III modeling and assessment
Only the RF model could simulate vegetation distribution in unit III. The RF model using variable Combinations 7 to 11 had a Kappa coefficient larger than 0.4 and an overall accuracy of 55%-58% assessed by field point data. The RF model using variable Combination 7 had the highest Kappa coefficient of 0.57 (the only model with a Kappa coefficient larger than 0.56) and the highest overall accuracy of 58% assessed by field point  Table 1. Full-size DOI: 10.7717/peerj.9839/ fig-2 data. The Kappa coefficients in all models were less than 0.4 when assessed by the VMC (Table 6; Fig. 4). The abbreviations were same with Table 4.

Important variables
For the RF model, eight of the top 10 most important variables were the same across the different vegetation units: three climate variables (annual mean temperature, mean diurnal range, and annual precipitation), one geospatial variable (slope), and four spectral variables (Mid-infrared ratio and NDVI of winter vegetation index, brightness index and NDVI of summer vegetation index). For the DT model, nine of the top 10 most important variables were the same across the different vegetation units: four climate variables (annual mean temperature, mean diurnal range, precipitation of the driest month, and annual precipitation), one geospatial variable (slope), and 4 spectral variables (Mid-infrared ratio of winter vegetation index, brightness index of summer vegetation index, summer surface albedo of band 1, winter surface albedo of band 6) ( Table 7).

Vegetation classification units
Vegetation classification is an important and complex system with multiple levels. Higher level classification methods not only accurately classify vegetation, but they can also describe   Table 1.
Full-size DOI: 10.7717/peerj.9839/ fig-3 ecosystem diversity, even during global changes (Faber-Langendoen et al., 2014). Plants in different vegetation classification units have different spectral characteristics and climatic conditions that are the basis for vegetation distribution simulation. Thus, models using the same variables to simulate the vegetation distribution of different classification units may produce different classification accuracies (Dobrowski et al., 2008;Prasad, Iverson & Liaw, 2006). Map accuracy has been found to be a function of which classification system and categories are used (Muchoney et al., 2000). Previous studies have explored vegetation distribution simulation using different vegetation classification systems. Plant functional types (PFTs), defined as plant sets sharing similar perturbation response effects on dominant ecosystem processes, have been used to simulate vegetation distribution, as seen in the Biome and Box system models (Box, 1981;Box, 1996;Dormann & Woodin, 2002) with positive simulation results (Box, 1981;Song, Zhou & Ouyang, 2005;Weng & Zhou, 2006). The Mapped Atmosphere-Plant-Soil System (MAPSS) model was also used to simulate vegetation distribution using vegetation life forms, leaf area index, leaf morphology, and leaf longevity (Zhao et al., 2002). Other researchers studied potential vegetation distribution using the Holdridge life zone model, with positive vegetation pattern results (Zheng et al., 2006). When the IGBP classification system was applied to simulate vegetation distribution at a regional scale, the map estimate accuracy was upwards of 80% (Muchoney et al., 2000). In this study, we used machine   Table 1.
Full-size DOI: 10.7717/peerj.9839/ fig-4 learning models and a hierarchical classification system from the VMC to determine the best modeling method for vegetation affected by high socioeconomic disturbance at various classification levels. In the VMC, unit I was the highest classification level, mainly based upon community appearance; unit II was the second highest level, mainly based upon community and climate appearance; and unit III was the medium classification level, based upon the dominant species. The accuracy of the vegetation distribution simulations in units I and II was similar to each other and higher than unit III's simulation (Tables 4-6).

Different model performances
We were interested in vegetation distribution modeling's ability to forecast and respond to environmental changes and vegetation pattern management at local to global scales. Vegetation distribution predictions can help explain the relationship between plants and their abiotic and biotic environments (Franklin, 2010). To benefit from ecosystem service functions, people can design vegetation distributions according to distribution and abundance patterns and trends (Hastie, Tibshirani & Friedman, 2009). Vegetation classification has become a widely used ecological method due to a number of new statistical and machine learning methods used alongside mapped biological and environmental data to model vegetation distributions over large spatial scales at higher resolutions (Cutler et al., 2007). Different image classification methods are rarely used together in the same classification research, especially when combined with environmental variables (Li et al., 2014).
In this study, the RF model performed better than the DT, SVM, and MLC models across the three classification levels. This finding was consistent with the results of other studies that found that the RF method modeled vegetation distribution better than other methods (Prasad, Iverson & Liaw, 2006). The DT model divided the data into homogenous subgroups according to the range of predictor variable values. The DT model was generally able to handle a large number of independent variables and could build a tree model faster than the other methods. However, the DT model was somewhat unstable for vegetation distribution modeling and had lower classification accuracy (Zhou et al., 2016). The RF model generated a large number of independent trees through data subsets and developed a split in every tree model using a random subset of predictor variables. Therefore, we concluded that the RF model was generally better than the DT model. The SVM model was developed from statistical learning methods and discriminated class samples by locating potentially nonlinear or multiple linear boundaries between individual training points (Burai et al., 2015). The aim of the MLC model was to maximize the overall probability that a pixel is correctly assigned to a class. However, the MLC model requires a large number of training samples that limits its application (Sesnie et al., 2010). Previous research has shown that classification accuracies when using the SVM classifier were higher than the MLC model (Pal & Mather, 2005;Boyd, Sanchez-Hernandez & Foody, 2006;Sanchez-Hernandez, Boyd & Foody, 2007;Sesnie et al., 2010). Because the model had fewer requirements, the DT method provided significantly more accurate classifications than those of the MLC model (Boyd, Sanchez-Hernandez & Foody, 2006). Other studies found that the RF and SVM models were similarly accurate (65.3% and 66.6%, respectively) (Sesnie et al., 2010), and that the RF, MLC, DT, and SVM models performed similarly and reasonably well when simulating land use classification (Li et al., 2014). In addition to the methods mentioned above, an artificial neural network implemented at a regional scale produced classification accuracies of 60%-80% (Muchoney et al., 2000;Haslem et al., 2010). In the Arctic, this method provided the most accurate vegetation mapping (Langford et al., 2019). The reasons for the similarly positive results of these models may be due to the relatively large differences between classification objects, and their use of sufficiently representative training samples and appropriate input variables. In our study, only the SVM and MLC models' output simulated the results of variable Combinations 1 to 6. This may be due to the poor separability of the training samples, as the models could not recognize the training points or their vegetation categories (Jarnevich et al., 2015). The Jing-Jin-Ji region has many types of vegetation with very small distribution areas, so the selected training points may have been insufficient. Future training points for these vegetation types should be selected using field surveys, and more suitable models for modeling global vegetation distribution should be developed and tested (Jiang et al., 2012).

Important variables in vegetation classification models
Variable selection is directly related to the vegetation distribution model's ability to capture important environmental factors (Mod et al., 2016). Models predict the important variables that drive the distribution of vegetation (Prasad, Iverson & Liaw, 2006). Vegetation distribution is predominantly driven by temperature, precipitation, and topographical variables (Franklin, 1995;Mod et al., 2016;Prasad, Iverson & Liaw, 2006), specifically those related to physiological tolerance, site energy, and moisture balance (Franklin, 1995). In addition to environmental variables, some spectral variables are used as input variables. However, the overuse of spectral variables can actually decrease discrimination accuracy, meaning that only spectral variables reflecting vegetation information should be selected, such as those related to the visible spectrum, infrared spectrum, and vegetation indices (Price, Guo & Stiles, 2002;Zhou et al., 2016). Different variables respond to different information. Spectral variables directly reflect land surface object information, while geospatial and climatic variables reveal information about the vegetative environment.
Terrain, an important variable in vegetation distribution models, has long been used to improve map accuracy, especially for regions with large elevation differences (Dobrowski et al., 2008;Oke & Thompson, 2015). Sesnie et al. (2010) found that adding elevation as a predictive variable dramatically improved the accuracies of the SVM and RF models >80% for most forest types. Slopes with similar elevations but different aspects have very different soil and vegetation temperatures (Gunton, Polce & Kunin, 2015;Mod et al., 2016). Dobrowski et al. (2008) highlighted the importance of slope and aspect when mapping vegetation communities in the Sierra Nevada. Slope was also an important variable in this study (Table 7) since different types of vegetation require different precipitation and temperature levels and have different tolerances to extreme heat and cold. The significance of these climate variables (annual mean temperature, temperature range, and annual precipitation) has been validated in other studies (Prasad, Iverson & Liaw, 2006;Sesnie et al., 2008). We looked at two surface albedo indices (the summer surface albedo of band 1 and the winter surface albedo of band 6). Sesnie et al. (2010) combined elevation and spectral band data to increase the classification accuracy to a satisfactory level for most forest types. De Colstoun et al. (2003) obtained high accuracies (80%) when classifying coniferous, temperate broad-leaf, and mixed forest types using Landsat ETM+ bands. Other studies have used different vegetation index variables (Price, Guo & Stiles, 2002;Zhou et al., 2016) specific to their study areas and data.
The input variables used in our vegetation distribution model are not exhaustive. Ecophysiologically meaningful predictors such as soil moisture, pH, and nutrients, should be considered. Other factors, such as actual light, disturbance, biotic interactions, land use, and bioclimatic information could also be incorporated into vegetation distribution models (Dobrowski et al., 2008;Mod et al., 2016;Prasad, Iverson & Liaw, 2006;Sesnie et al., 2010). We suggest building more ecophysiologically sound vegetation distribution models that require a collaborative effort across the ecological, geographical, and environmental sciences (Mod et al., 2016).

Other factors affecting classification accuracy
In addition to classification units, models, and input variables, classification accuracy is affected by other factors, including algorithm error and image data (Li et al., 2014). We must acknowledge the existence of errors in random sample selection, modeling, and data preprocessing algorithms. Remote sensing data sources, as well as the date and processing of selected images, vary, resulting in different simulated values and accuracies (Price, Guo & Stiles, 2002). Remote sensing images with high spectral and spatial resolutions provide rich spectral and ground information, moderately improving the predictive ability of the vegetation distribution model (Peng et al., 2002). However, the use of high spectral and spatial resolution images creates a greater demand for data access, larger computer storage capacities, and faster data processors (Price, Guo & Stiles, 2002), which is why we did not use high spectral and spatial resolution images in this study. Moreover, some cultivated vegetation and shelter forests in the Jing-Jin-Ji region are greatly affected by human disturbance, which affects their water-heat conditions and soil nutrition. Urbanization reduces vegetation, transforming some areas into industrial, commercial, and residential land. This has led to the direct or indirect pollution of the water, soil, and air, and the reduced predictive ability of vegetation distribution models. The VMC we used for model assessment was published in 2007, and no updated study has been published over the past 10 years. The current state of the Jing-Jin-Ji region's vegetation no longer coincides with the VMC's assessment.

CONCLUSIONS
Our main objective was to determine the best simulation method for vegetation affected by high socioeconomic disturbance in the Jing-Jin-Ji region. The RF model was the most capable at simulating vegetation distribution across all three units. The DT model could simulate the vegetation distribution in units I and II. The SVM and MLC models could not simulate the distribution in any of the three units. Based on the Kappa coefficient, the RF model was generally better than the DT model and the most suitable model for simulating vegetation distribution in the Jing-Jin-Ji region. The most important variables affecting vegetation classification accuracy were three climate variables (annual mean temperature, mean diurnal range, and annual precipitation), one geospatial variable (slope), and two spectral variables (Mid-infrared ratio of winter vegetation index and brightness index of summer vegetation index). We recommend using the RF model to produce or improve the vegetation maps in areas of high human disturbance.