Mapping growing stock volume and forest live biomass: a case study of the Polissya region of Ukraine

Forest inventory and biomass mapping are important tasks that require inputs from multiple data sources. In this paper we implement two methods for the Ukrainian region of Polissya: random forest (RF) for tree species prediction and k-nearest neighbors (k-NN) for growing stock volume and biomass mapping. We examined the suitability of the five-band RapidEye satellite image to predict the distribution of six tree species. The accuracy of RF is quite high: ~99% for forest/non-forest mask and 89% for tree species prediction. Our results demonstrate that inclusion of elevation as a predictor variable in the RF model improved the performance of tree species classification. We evaluated different distance metrics for the k-NN method, including Euclidean or Mahalanobis distance, most similar neighbor (MSN), gradient nearest neighbor, and independent component analysis. The MSN with the four nearest neighbors (k = 4) is the most precise (according to the root-mean-square deviation) for predicting forest attributes across the study area. The k-NN method allowed us to estimate growing stock volume with an accuracy of 3 m3 ha−1 and for live biomass of about 2 t ha−1 over the study area.


Introduction
Data from remote sensing (RS) are crucial for a number of tasks in forest inventory and monitoring, including the estimation of forest area and its dynamics, construction of thematic forest maps, estimation of tree species distribution, stratification of the territory for sampling, and calculation of forest parameters for an area from a sample (Latifi et al 2015a, McRoberts et al 2014, Schepaschenko et al 2015aSchepaschenko et al 2015b. The challenges of the combined implementation of groundbased and remote methods of forest inventory have been discussed extensively in the scientific literature (Chirici et al 2016, McRoberts andTomppo 2007). In this regard, the technology of satellite image processing by the k-nearest neighbors (k-NN) method and random forest (RF) have been quite successful.
According to McRoberts (2012), the k-NN method is used in forest inventory for four main tasks: (1) imputation of missing values for forest inventory and forest monitoring databases; (2) wall-to-wall mapping based on point measurements; (3) small area estimation; and (4) support for design-based and model-based inference. Forest parameter estimation based on the k-NN technique involves combining sample plot measurements, RS data, and auxiliary information, including a digital elevation model (DEM), land cover, etc. Given the limited extent of ground measurements, the method provides reasonable accuracy, both overall and at a fine scale. The k-NN technique builds a continuous, spatially explicit model of each forest parameter. The model reflects the distribution and variability of forest indicators and results in a set of maps that support forest monitoring and inventory (Beaudoin et al 2014, Maselli and Chiesi 2006, McRoberts and Tomppo 2007, Mozgeris 2008, Reese et al 2003, Tomppo et al 2016.
Non-parametric methods are common for classification of satellite imagery. They do not have specific requirements for the distribution of the studied parameters. RF is one of the most efficient and well-established machine learning techniques for classification of remote sensing imagery (Breiman 2001, Belgiu andDrȃguţ 2016) and can also be used for regression analysis. It relies on bagging to form an ensemble of classification and regression trees. Bootstrap samples are used to construct multiple trees such that each tree is split with a randomized subset of features, thus the name 'random' forest. Another subset, which is not part of the classification, is used for accuracy assessment. Recently, the method was successfully applied for deriving forest parameters (Latifi et al 2012) and tree species prediction in particular (Immitzer et al 2012, Myklush et al 2013.
Although both the k-NN and RF methods have been commonly used in forestry applications as outlined in Chirici et al (2016), a meta-analysis by the same authors revealed that only 3.4% of the 148 papers reported confidence intervals while most were limited to overall accuracy, the Kappa index and root-mean-square deviation (RMSD) (see e.g. Bernier et al 2011, Gagliasso et al 2014, Latifi et al 2015b, Trubins and Sallnäs 2014, Zald et al 2016. In contrast this paper reports the estimated forest parameters with standard errors, which are needed for forest inventories. The forest mask produced using RF is also provided with confidence intervals and compared with other forest masks. This paper discusses the implementation of both methods: (1) RF for delineating forested area and classifying tree species; and (2) k-NN for mapping growing stock volume and live biomass. We report confidence intervals for the estimated parameters using an area in Ukraine as a case study.
The traditional Ukrainian forest inventory and planning (FIP) approach involves first dividing the forest (using RS data) into homogenous units (primary units of forest inventory) and then forest parameters (i.e. tree species composition, age, average height and diameter, growing stock volume, etc.) are assigned by trained personal on the ground through visual estimation or measurements. This results in the description of the individual forest stands with a return interval of about 10 years. The limitations include: (1) FIP covers area managed by the state forestry authority only (about 85% of forested land); (2) shelterbelts and other protective tree associations in agroforestry systems are not covered; (3) independent monitoring is needed. Reliable remote sensing techniques are especially relevant for Ukraine, as about 15% of the forested area is not covered by the FIP and also because of the relatively large dynamics of forest cover, some of the drivers of which are illegal logging, dieback of shelterbelts due to drought, and afforestation of abandoned arable land.

Study area
The study area is in the Snovsk district of the Chernihiv region (figure 1) and covers 45 km 2 . It is located within the East European Plain with wavy plan relief typical of the Ukrainian Polissya. The major part of the area belongs to the first and second river terraces with the altitude ranging between 120 and 170 m above mean sea level. According to the FIP completed in 2011, forest cover represents 1893 ha (463 individual forest stands) or 41.8% of the study area. Pine and birch are the most common species (44.7% and 39.8% of the forest area, respectively), alders cover 13.1%, while other species (aspen, oak, ash, black locust, spruce, and maple) each cover less than 1%. Forests are highly productive (reaching an average stand height of 27-34 m at 100 years old); the age structure has two peaks: young (37%) or mature (42%) forests are dominant, while the middle aged, premature, and over-mature groups are less well represented.
The study area is representative of the entire Ukrainian Polissya region in terms of landscape, tree species composition and productivity.

Spatial datasets
We used five-band RapidEye imagery with a spatial resolution of 5 m, acquired in 2011. The image was both geometrically corrected and converted to the top of atmosphere (TOA) reflectance. Another input dataset was a 10 m DEM resampled to the same resolution. All datasets were converted to a WGS 84/UTM zone 36 N projection ( figure 2).
An IKONOS image with a spatial resolution of 1 m was used for visual interpretation of land cover type in order to validate the forest mask.

Live biomass model
The FIP database consists of information for every individual forest stand. This includes tree species and several other parameters, which were used to estimate the live biomass of all the forest stands as follows (Shvidenko et al 2007): where R is the biomass expansion factor; M is the live biomass of fraction fr, oven-dry t ha −1 ; GS is the growing stock volume in m 3 ; A is age in years; RS is relative stocking; SI is the site index, which reflects the quality of the site (Shvidenko et al 2007); and a 0 -a 5 are model parameters.
The biomass expansion factors (table S1, available at stacks.iop.org/ERL/12/105001/mmedia) for birch, black alder and aspen were estimated using 443 sample plots collected in Ukraine and neighboring countries, where the biomass fractions were measured using the destructive sampling method (Schepaschenko et al 2017b). Parameters for the pine and oak forests were taken from Shvidenko et al (2007) and applicable for European Russia, Belarus and Ukraine. Table S1 also reports the RMSD, which is the square root of the   average of the squared deviations between the actual and estimated values of the live biomass. The live forest biomass of every forest stand was estimated based on equation (1) and the parameters presented in table S1. The models considered stem wood biomass over bark, aboveground live biomass of the stand (stem wood over bark, branches and leaves), the total live biomass of forest stands (aboveground stand biomass and roots) and the total forest live biomass (forest stand, understory and green forest floor) in an oven-dry state.

Random forest
The RF technique was used (1) to create a forest mask and (2) to delineate the dominant tree species. In our analysis, we used five-band RapidEye imagery at a 5 m spatial resolution and a DEM. The reference data were taken from the FIP database in the following way. We randomly distributed 4000 points over the study area. Of these, 1729 fell within the forest and species information from the FIP database, which were then used as a training dataset for classification. The main tree species and sample sizes are presented in table 1.
The results of the RF classification were aggregated in order to remove small groups of pixels (under 40). Finally, the minimum size of a forest polygon was 0.1 ha, which corresponds to the national regulations for the forest inventory.
The validation dataset consists of an additional 2300 randomly selected points. We calculated the confusion matrix and estimated the user's and producer's accuracies and the confidence intervals for the area of different tree species assessment (Congalton andGreen 2008, Olofsson et al 2014).
The data were classified in R using the {random-Forest} package, v. 4.6-12 (Liaw and Wiener 2002). To understand the contribution of each variable in the classification model, we used the mean decrease in accuracy (MDA) as a measure of the variable's importance. The general idea of the MDA is to rank variables according to their contribution (in percentage terms) to the mean squared error of a classification if they are excluded from the calculation. Clearly, the larger the MDA, the more the variable contributes to the accuracy of the model.

k-NN imputation
Imputation is a process that replaces missing values with predicted or observed values (McRoberts 2009). The k-NN technique was used to map growing stock volume and forest biomass. Ground truth information was obtained from the FIP database and consists of 90 randomly selected forest stands.
The k-NN method requires that both the number (k) of nearest neighbors and the equation to calculate the distance to these neighbors in the parameter space be specified. We compared several methods to estimate the distance, as suggested by Crookston and Finley (2008), namely, the Euclidean (EUC) and Mahalanobis (MAL) distances between the reflectance of the target and the reference pixels of the image. Other methods of calculating distance are based on canonical correlation analysis (MSN-most similar neighbor), canonical correspondence analysis (GNN-gradient nearest neighbor), and component analysis (ICA-independent component analysis).
The response variable (growing stock volume or live biomass) for a pixel p (̃) was predicted by equation (2) (Tomppo et al 2016): where y is the value of variable y of the i-th member of the training dataset for pixel p; w , is the weight of the i-th member for pixel p; and I ℎ is the size of training sub-dataset for the stratum h.
The weight, w , , is inversely proportional to the distance (d ) between the target pixel and the NN in the parameter space as follows: where k is the number of nearest 'neighbors'; and t is a real number usually between 0 and 2. We use t = 2 to increase the contribution of large distances. Besides classification and mapping, the k-NN method can be successfully used for the unbiased estimation of the mean values of the forest parameters based on a sample (McRoberts 2012). Model-based methods are used extensively in forest inventories to infer the mean values of the forest attributes, where the estimates are required for a small area of the population (small area estimates). If we assume that a simple model describes the population, then the expected values of Y at an i-th point belonging to the area of interest can be estimated as follows: where is the mean value of Y for the population unit; and is the random deviation of observation y from its mean value at a point i.
Model-based approaches are generally focused on the estimation of the mean values of forest attributes in the population rather than on definite observations. With the k-NN method equation (2), the estimator of the population mean iŝ: where N is the sample size.
The k-NN model was run in R using the {yaImpute} package, version 1.0-26 (Crookston and Finley 2008).

Estimation of the performance of global/ regional products in the study area
We compared a number of global and regional maps with the ground FIP data in the study area in order to estimate their accuracy and if locally calibrated models are able to improve this accuracy further. The FIP data were aggregated to match the pixel size of each map. We predicted the forested area, the average and total biomass, the overall accuracy of the forest mask and the tree species group. Note that the estimated accuracies are only valid for the study area and cannot be treated as an estimation of uncertainties of the global/regional maps.

Forest mask
The contribution of different parameters to the forest extent prediction is shown in figure 3. The most important input datasets for delineating the forest are the B5 NIR band, the coordinates and the elevation.
The very high resolution image, with the forest mask overlaid (revealing the edges of the imagery so that forest areas can be clearly seen), is presented in figure 4 with correspondence evident. The misclassification rate was estimated at 1.6% based on the OOB (out-of-bag) error.
The forest mask was used for further segregation of the forest area by dominant tree species.
We compared the forest extent in several global and regional maps of high resolution (25-60 m) with a random sample of visual interpretation of the IKONOS image (table 2). We estimated the user's and producer's accuracy of the forest masks and confidence interval for area estimation. Two of the maps (Hansen et al 2013, Sexton et al 2013 are represented as percentage tree cover. We compared the percentage of forest on the IKONOS image with the tree cover value on the maps (table 2, forest share RMSD). To delineate forest in these two maps, we applied a threshold (table 2, tree cover threshold) that matched the forested area as closely as possible without decreasing the accuracy.
The global land cover and forest maps delineate forest with reasonably good user's accuracy (75%-90%). GlobeLand30 shows the least amount of forest, which results in the lowest producer's accuracy. Despite the high resolution of this product, i.e. 30 m, it recognizes core forest areas only, classifying the rest as cropland, i.e. the dominant land cover type. Only the national dataset (Lesiv et al 2015) demonstrates high and balanced user's and producer's accuracies with narrow confidence intervals despite underestimating the forest area.

Tree species classification
The DEM, together with B5 NIR, contribute the most to the accuracy of the model (figure 5). Elevation was particularly important for the prediction of alder forest, which covers the lowest elevations, i.e. floodplains.
The OOB error (in our case 8.22%) shows a sufficiently high accuracy of prediction for the tree species. Figure 6 shows the variability of independent variables used for tree species classification. Despite the fact that an individual variable may have overlapping values for different species, each contributes to a robust tree species prediction. For example, pine   forests are distinguished by both low near-infrared radiation (NIR) and red-edge values. As expected, the spectral reflectance of the tree species has large values in the NIR and red-edge channels of RapidEye imagery, which means that these two parameters contribute the most in tree species recognition. The median value of the pine reflectance is significantly lower compared to all deciduous species while alder is the most distinguishable species among the deciduous trees based on these spectral channels.
The results of the classification are presented in the form of a tree species map for the study area (figure 7). The map was used for further quantification of growing stock values and live biomass.
From figure 7, one can see that the territory is quite heterogeneous in terms of species composition. A significant part of the forest area is covered with a pine (PISY, 48.7%), birch (BEPE, 27.8%) and alder (ALGL, 17.7%) dominated forest. Low elevation in the south-eastern part of the study area is covered by alder, while oak forest is observed in the western  Figure 5. Importance of the variables for predicting the tree species using a RF classification model.  part only. In general, it is consistent with the forest inventory data. The results of the validation of the tree species map based on 2300 randomly selected pixels are presented in table 3. The RF model is more accurate for alder and pine, which are represented by a pure (single species) stand in the study area. Birch is, in most cases, mixed with other species, and therefore birch-dominated forests are less accurate (table 3).

B1
The overall accuracy of the tree species classification is 87.9%. The relatively low value of the producer's accuracy for oak, black locust, and aspen can be attributed to the small portion of the overall area covered by these species. For instance, oak is represented by five forest stands with an overall area of 3 ha while there are only two 3.5 ha black locust stands. On the other hand, the user's accuracy overall (except for oak) exceeds 75%. The most robust model in terms of both user's and producer's accuracies was obtained for pine and alder, where sufficient training data were available for the RF algorithm.
We checked how global and regional datasets describe tree species in the study area. We indicate the share of forest area recognized by every map (table 4) and the accuracy of classifying coniferous evergreen and broadleaved deciduous species. Even at the tree group level, the error exceeded 25% and even 50% in many cases. The Global Land Cover by National Mapping Organizations (GLCNMO) classified 1/3 of the forested area as 'broadleaf evergreen' or 'needleleaf deciduous', which are not represented in the study area. MODIS Land Cover classifies all the forest as mixed, which is more or less correct at a 500 m resolution.

Growing stock volume and biomass
Optical imagery can be used for the indirect estimation of biomass based on canopy cover and surface reflectance (Schepaschenko et al 2017a, Lu et al 2016. We tested several k-NN imputation methods to predict GSV and LB (table 5). Table 5 demonstrates the advantage of the MSN method, which has the smallest normalized RMSD for every response variable. This method was thus selected for further implementation. To determine the optimal number of k for use with the NN technique, we started with a small number and then increased the number iteratively until the accuracy no long increased significantly. Figure 8 shows that the highest accuracy was reached at k = 4. More NNs provide the same or a larger normalized RMSD.
As a result of the k-NN imputation, we obtained GSV and LB maps at a spatial resolution of 5 m (figure 9). The pine forest has the largest GSV (325 m 3 ha −1 on average). There is minor variability in the GSV within stocked forest stands. Sparse stands, which are mainly encroaching on former arable land, have much higher variability of GSV. High variability is observed for small biomass forests  (up to 75 m 3 ha −1 ) and biomass is highly correlated with GSV (figure 9). We used equation (5) to estimate the mean growing stock volume and the live biomass for the study region (table 6).
The average modeled GSV over the study area is 165 m 3 ha −1 , which corresponds to forest inventory data of 162 m 3 ha −1 . Pine has the largest GSV and biomass values per hectare.
The distribution of the imputed GSV and biomass in the study area for different tree species and aggregated for all species together is presented in figure 10. The distribution has two peaks. The second one corresponds to mature pine forests with large GSV and biomass.
We compared the biomass from different global and regional maps (table 7). Most of the datasets have a 1 km resolution so they are too coarse for calculating the accuracy of the spatial distribution of biomass over the study area. Hence we only compared the average biomass value for the forest area recognized by the datasets and the total biomass for the study area. The average AGB estimates vary from −31% up to +49% compared to the FIP data. The most similar estimates were obtained for the national map produced by Lesiv et al (2015) and the boreal biomass map by Thurner et al (2014). All the datasets substantially underestimated the total biomass (from −7% to −72%). This is the result of both underestimation of forested area and the average biomass value.

Conclusions
The integration of the information derived from satellite images, a DEM, other geospatial datasets, and a limited number of field measurements can contribute to the effective prediction of forest attributes at the pixel level during a forest inventory. The application of RF and k-NN techniques allows for an unbiased estimation of the mean values of forest parameters and the mapping of the forest based on remote sensing data with a small number of ground measurements. Both methods are viable for the processing of RapidEye images for area estimation, prediction of tree species composition, and imputation of structural forest parameters. In the context of the remote sensing assisted estimation of the growing stock volume and the live biomass, our research demonstrates how forests can be mapped in the form of continuous surfaces and how the mean values of forest attributes can be assessed across a defined geographic region.
The methods applied here demonstrate a substantial increase in accuracy compared to existing global and national products. Global forest masks capture forests well in the study area, but existing tree species distributions and biomass estimations are poor. Hence the proposed methods are very promising for national-scale forest inventories and monitoring in Ukraine, especially considering illegal harvesting, the dieback of shelterbelts due to droughts, and the afforestation of abandoned arable land. 15-20 t/ha 21-50 t/ha 51-100 t/ha 101-150 t/ha > 200 t/ha Figure 9. Predicted growing stock volume and total stand biomass over the study area using k-NN imputation (MSN, k = 4) with an IKONOS image in background.      Gallaun et al (2010) 97 117 Boreal by Thurner et al (2014) 7 9 3 6 European by Kindermann et al (2008) 58 124 Global by Kindermann et al (2008) 125 114 Global by Hu et al (2016) 100 37 Ukraine by Lesiv et al (2015) 80 120