Models for dominating forest cover type prediction

The question of the most suitable forest tree species for defined area and landscape has been investigated in the paper. A set of classifiers is constructed in order to build relations between type of soil and other features of forest area and preferable species of trees. The decision tree classifiers, ensemble methods implementing bagging and boosting over such trees are used. The machine learning methods are implemented to obtain the best suited tree species to cover given forest area. This classification task is one of very important problems of forest regeneration process. Efforts of ecologists can have better results if there are expert systems allowing to understand the best forest cover type for areas of forest fires or deforestation that takes place because of human factor. Results and conclusions of this paper can be used in processing of other forest recover tasks. The same methods can be implemented in order to get the preferable tree species for different areas if there’s enough data to solve these tasks with machine learning technique.


Introduction
Today deforestation is very important problem for the whole world. It's caused with human activities in some areas, somewhere seasonable fires are caused with climate and local features. There was a lot of forest fires of giant magnitude recently: in the USA, Australia, Brazil, Siberian regions of Russia. Forest regeneration is a very important ecological problem.
Nowadays data analysis and machine learning are implemented in a lot of different domains of knowledge [1,2]. In this research the most suitable tree species are defined with machine learning methods. This solution can accelerate process of forest regeneration. Conclusions suite the area where the data was collected in the best way [3]. But the same technique can be used to handle data of different forest areas. Of course, here data collection and dataset creation are very important problems that must be solved in different regions by ecologists [4]. Their efforts help to involve data scientists all over the world to ecological problems solution [5]. Still problems of forest regeneration after fires [6 -13], agricultural deforestation and regeneration after logging [14 -16] are usually researched with traditional methods. Now data science and time series analysis methods [17,18] can be implemented in this domain of knowledge to make predictions of fires and to construct classification and clustering [19] of forest types for regeneration.

The dataset structure and classification quality metrics
In the original data competition [3] the main task was to predict the dominant kind of tree cover. The data analyzed in the paper are collected in the Roosevelt National Forest (Colorado, USA). Forest area was divided into cells. Its width and height are 30 m. Each row contains data about such cell. There are  (4 values) and type of dominating tree species (type of cover, 7 values) are handled with one-hot encoding. Cover_Type is the main parameter predicted in the data competition [3]. There are 7 types of dominating tree species introduced in the dataset: spruce (fir), lodgepole pines, ponderosa pines, willows (cottonwood), aspen, douglas-fir and krummholz. There are 581012 records in the dataset.
At the largest portion of area (85% of area observed in the dataset) Lodgepole Pine and Ponderosa Pine dominate. It means that the classes in classification problem are unbalanced. One can't use ordinary accuracy metrics to test quality of classifiers. In this case special metrics is used. Measures of classifiers' quality usually are precision (2), recall (3) and , = 1 value (4) that can be considered as their combination [20]: , Correlation coefficients between all pairs of parameters have been considered. Hillshade indices at 9 a.m. and 3 p.m. have got negative coefficient and its value is 78%. It can be explained with daily move of the Sun in the sky. Some area gets a lot of sun emission in the morning. But in the evening illumination is lower at the same place because of the landscape specifics. Correlation between the hillshade at noon and at 3 pm indices can be treated in the same way.
Parameters describing elevation above the sea level and type of soil correlate. Type of soil depends on local climate and height above sea level is one of important factors influencing climate.
Also aspect value correlates with hillshade index measured at 3 pm (65%). It can be explained as description of cells getting oriented approximately to mean trajectory of the Sun at this time.
In pairs of parameters with high correlation one of them is removed.
Horizontal h and vertical v parts of distance to nearest surface water source are combined into new parameter that can be treated as Euclidean distance to that source: Parameters don't correlate with type of forest cover. The dataset has got high quality. Linear models and classification models can be used to describe type of forest cover [20].

Experiments
The classification task described above has been solved with a few algorithms: "k nearest neighbours" classifier, decision tree classifier and ensemble methods (extra trees classifier, random forest classifier and gradient boosting classifier).
Levels of F1 measures of constructed classifiers are shown in the table 1. Appropriate classes of classifiers in scikit-learn library are presented in the first column. Macroaveraged values of F1 measure for each classifier are shown in the second column. Ensemble methods unite responses of a lot of "simple" classifiers. Thus, it's difficult to explain their decisions. At the same time decision tree classifiers operate with just one tree and their behaviour can be explained [20]. Here main parameters of classification process are elevation above the sea level, type of soil, distances to the closest fire points and roads.
The classification tree has got a lot of nodes which contain parts of the investigated examples. There's a lot of set of conditions which define each class. So, only some simple cases are shown. All classes, except Lodgepole Pine and Ponderosa Pine (dominating at 85% of area), are combined into the third class. Bounds of classes in some cases are shown in the table 2. Here dist_fire denotes distance to the closest firepoint, dist_roads means distance to the closest road, wilderness is a type of wilderness area (4 binary values), hillshade3pm is hillshade index at 3 p.m. and elevation shows height of cell above sea level.
Tree ensemble classifiers construct "strong" classifier with a set of "weak" ones which are decision tree classifiers. Work of a lot of classifiers can define the most appropriate subset of dataset, appropriate diapasons of parameters. The ExtraTreesClassifier is an enhanced version of the RandomForestClassifier algorithm and here its results are better. They are based on the bagging idea The gradient boosting is supposed to be one of the best ensemble methods. It's based on the boosting technique [20]. But the ExtraTreesClassifier shows the best result in this task.
Principal component analysis [20] has been applied to the dataset. Two components are enough to describe 97% of variance. Subset of 50000 records has been created to make plots of various types of trees in the principal components basis containing two components PC1 and PC2. As it was mentioned above Lodgepole Pine and Ponderosa Pine dominate at the largest portion of forest area (85%). So, the first plot contains information only about these types of trees. The other ones are shown at the second plot.  (4)). Implementation of scaling by means of standard deviation and mean value (according expression (5)) delivers results that look the same.

Conclusion
The Forest cover type dataset has been investigated in this paper. It includes information about tree species of the Roosevelt National Forest (USA). The data competition [3] was aimed to construct cover type classifiers. The most appropriate tree species for given type of landscape need to be found.
Forest recover task is very important ecological work. Forests area decreases steadily because of fires and human activity. There's need in strong efforts to recover forests all over the world at high speed.
A set of learning models have been used to classify the dataset [3]. As it's shown at the figures 1 and 2 types of trees are mixed in the dataset. So, linear regression or linear classification models haven't got appropriate results.
Here the decision trees classifiers, bagging and boosting methods using trees are implemented. Their F1 measures are greater or equal to 89%.
Way of classes distinguishing with decision trees can be explained. Main parameters of classification process are elevation above the sea level, type of soil, distances to the closest fire points and roads. Some bounds of classes obtained with the decision tree are shown in the table 2.
Solutions of such classification tasks can improve efforts of ecologists aimed to recover forests.