Tree species classification using Sentinel-2 imagery and Bayesian inference International Journal of Applied Earth Observations and Geoinformation

The increased temporal frequency of optical satellite data acquisitions provides a data stream that has the po- tential to improve land cover mapping, including mapping of tree species. However, for large area operational mapping, partial cloud cover and different image extents can pose challenges. Therefore, methods are needed to assimilate new images in a straightforward way without requiring a total spatial coverage for each new image. This study shows that Bayesian inference applied sequentially has the potential to solve this problem. To test Bayesian inference for tree species classification in the boreo-nemoral zone of southern Sweden, field data from the study area of Remningstorp (58 ◦ 27 ′ 18.35 ′′ N, 13 ◦ 39 ′ 8.03 ′′ E) were used. By updating class likelihood with an increasing number of combined Sentinel-2 images, a higher and more stable cross-validated overall accuracy was achieved. Based on a Mahalanobis distance, 23 images were automatically chosen from the period of 2016 to 2018 (from 142 images total). An overall accuracy of 87% (a Cohen ’ s kappa of 78.5%) was obtained for four tree species classes: Betula spp. , Picea abies , Pinus sylvestris , and Quercus robur . This application of Bayesian inference in a boreo-nemoral forest suggests that it is a practical way to provide a high and stable classification accuracy. The method could be applied where data are not always complete for all areas. Furthermore, the method requires less reference data than if all images were used for classification simultaneously.


Introduction
Tree species information is currently one of the key parameters of interest to ecologists and forest managers alike. The increased stream of freely available optical satellite data provides more opportunities to use a multitude of images, making it possible to map more thematically detailed classes with higher accuracy. As an example, the Sentinel-2 satellites from the European Space Agency's (ESA) Copernicus program provides images with a temporal frequency of five days at the equator, and higher frequency at higher latitudes (e.g., every one to three days over Sweden). The use of multi-temporal data for land cover classification has been shown to improve accuracy of tree species classification (Amani et al., 2017;Immitzer et al., 2019). Yet, processing a large number of images poses a new challenge in the process of land cover mapping: how can the stream of partially overlapping data best be used? Hence, there is a need for methods that provide highly accurate and thematically detailed (e.g., tree species) maps over large areas that can be easily updated as new images are acquired.
Earlier work on tree species classification from optical data benefited from using a few multi-temporal images where phenological differences were used to separate the classes (Fassnacht et al., 2016;Hagner and Reese, 2007;Hill et al., 2010;Reese et al., 2002;Wolter et al., 1995). Much of the recent work using Sentinel-2 data have used a selection of scenes rather than the continuous data flow. Immitzer et al. (2019) classified tree species using 18 Sentinel-2 images (from 2015 to 2017) and a random forest classification. They found that six to seven dates of imagery provided as accurate a model as using all 18 images (an out-ofbag overall accuracy of 85.7% for 12 tree species), and that images from April to August contributed most to a higher accuracy. Persson et al. (2018) used four images, from April, May, July, and October of the same year in a random forest classification of five tree species. They achieved the highest overall accuracy of 88.2% when using all images and confirmed that the spring and autumn images, which had different phenological states, were the most useful. Puletti et al. (2018) used three Sentinel-2 images from spring, summer and autumn in a random forest classification to discriminate between coniferous, broad-leaved and mixed forest, achieving a maximum overall accuracy of 86.2% using a separate validation data set.
Research on determining vegetation phenology (including tree species) using a continuous flow of remote sensing data has been conducted at a site in Sweden by Jönsson et al. (2018). Their study evaluated a new method that would better accommodate data gaps in time series of Sentinel-2, Landsat and MODIS data, and compared the method to that used in TIMESAT (Jönsson and Eklundh, 2004), among others. They applied seasonal shape priors combined with box constrained separable least squares fit to logistic model functions, which was a robust approach when processing time series with sparse or missing data (e.g., due to cloud occlusions).
When making land cover classification over large areas with multitemporal images, there are other issues apart from handling the large number of images. One such issue is the problem of occluded areas in the images, leading to missing data. One of the primary causes of this is clouds and cloud shadows. Another issue may be differences in images taken under different atmospheric conditions or with different illumination conditions (e.g., sun angle). Yet another issue is that each area may have been imaged a different number of times, resulting in an unequal number of observations at different times between sites. Therefore, one motivation behind the current study is to investigate a method that could utilise a data stream of images that are only partly overlapping.
As the amount of available data have increased, new approaches to efficiently and accurately classifying a continuous stream of optical data have been developed, including convolutional neural networks (Yuan et al., 2020), distribution-based thresholding (Nink et al., 2019), and support vector machines (Shelestov et al., 2017). In past decades, Maximum likelihood classification was the most widely used classification method for making thematic land cover maps from remote sensing data. One advantage of Maximum likelihood classification is that it can be used in a Bayesian framework where probabilities can be updated by using new observations. Strahler (1980) proposed that the method may make "use of time-sequential information in making the outcome of a later classification contingent on an earlier classification." How this is done depends on the decision rule used (Swain, 1978). With the high temporal resolution of current earth observation missions, combined with the increase in computational power since 1980, this Bayesian method can be worth revisiting.
Similar to Strahler's proposal Strahler (1980), Bayesian updating of land cover was used by Cardille and Fortin (2016) to solve the problem of partially overlapping data when classifying an area of central Quebec into water, forest and burnt area from a time series consisting of eleven Landsat 8 images. They used the method to ignore noise caused by clouds and smoke, and to avoid problems due to images covering only parts of the area. By using Bayesian updating, they achieved an overall accuracy of more than 90% at each time step, compared to a median overall accuracy of 78% when images were used separately. Crowley et al. (2019) used the same method with a combination of six Landsat 8 images, ten Sentinel-2 images, and six MODIS images to produce maps of burned and unburned areas in British Colombia.
The objective of this study was to investigate the utility of applying a Bayesian inference method to a continuous flow of Sentinel-2 data for tree species classification in the boreo-nemoral forest landscape of southern Sweden. The aim was to test the method of Bayesian inference by applying Maximum Likelihood classification to the posterior probabilities produced by the method, similar to that proposed by Strahler (1980), and using data with high temporal resolution. The focus was not on finding the optimal spectral bands or combination image dates for classification. This method could provide a relatively simple approach to achieve continuously updated and highly accurate large area tree species maps, using a continuous flow of data with different image date combinations and only partial overlap. In the case of Sweden, such tree species maps are of interest to the forest agency and the environmental protection agency, and at European level for example for keeping the Corine land cover data base up to date. By applying Bayesian inference and updated posterior probabilities to multiple images in a sequence, the expected outcome was to obtain a higher and more stable overall accuracy than when classifying single or few images.

Study site and field data
Field data were collected during the summer of 2016 and the spring of 2017 in the study area of Remningstorp in the boreo-nemoral zone of southern Sweden (58 • 27 ′ 18.35 ′′ N, 13 • 39 ′ 8.03 ′′ E). In total, 335 circular plots with a radius of 10 meters were available (Fig. 1). Of these plots, 265 were placed in the north-west and central parts of the study area, in managed forest stands mainly dominated by conifers. These plot coordinates had been determined by an earlier inventory campaign using systematic sampling in a grid with random start. A new grid with random start was placed in the south-east part of the study area, that was dominated by broad-leaved forest, mainly in a nature reserve that was historically used for grazing.
During the field inventory of the south-eastern area, the grid points were used as a reference, but the plot centres were allowed to be shifted by up to 20 m to the nearest area containing a dominant representation of a single tree species. The tree species composition was assessed with an angle gauge at the grid point as well as positions 20 m north, east, south and west of the grid point. If one species constituted more than 70% of the basal area at one of the positions, that position was selected as the new centre coordinate for the field plot. In total 70 new field plots were established.
All plot centres were positioned using a real-time kinematic GPS with an accuracy of one metre. Stem diameter at 1.3 m above ground (d bh ) and species was recorded for all trees where d bh was at least 4 cm. Only plots where mean d bh was at least 10 cm were included. Since angle gauge measurements of basal area might differ from the basal area within a circular plot (Gregoire and Valentine, 2008, chapter 8), only those plots where one species constituted 70% or more of the basal area calculated from d bh were used. To assure enough training data for each species, only those that dominated more than 10 plots were kept. In total, 169 plots with four species classes remained (Table 1). These four tree species represent over 90% of the forest biomass in Sweden (Nilsson et al., 2020).

Remote sensing data
Satellite imagery from both Sentinel-2A and Sentinel-2B were used for classification. The mission provides data around every three days over the study area. Each image consists of thirteen spectral bands with varying ground sampling distance: 10 metres, 20 metres and 60 metres (European Space Agency, 2015). Images from the period of 2016-07-31 to 2018-08-30 from granule T33VVE were downloaded from Copernicus Open Access Hub. Only Level-1C images were available for all dates, therefore, we performed atmospheric correction to Level-2A using the Sen2Cor program, version 2.8 (European Space Agency, 2019), provided by ESA. Both processing levels were evaluated and due to the way in which the proposed classification method works, combined with a fairly small study area, no significant differences in the outcomes when using Level-1C or Level-2A were expected.
Due to the low sun angles and presence of snow in winter images, only images from the months of April to October of each year were used. In total, 142 images were obtained from this time period, and 22 images, or 15%, were free of clouds and cloud shadows in the study area.
Among the spectral bands available from different sensors, the most common ones used for vegetation mapping are green, red and nearinfrared. Since the aim of this study was to evaluate the method of Bayesian inference on a data stream (i.e., rather than to determine optimal band combinations), it was deemed most important to have constant parameters for every image through the time series; therefore, the green, red, and near-infrared wavelength bands were used (bands 3, 4, and 8), which all have the same pixel size of 10 metres. The visible blue band was excluded, however, since it is very sensitive to variations in atmospheric effects, and therefore less useful for large-area operational mapping of tree species.
When extracting pixel values, field plot coordinates were transformed from SWEREF 99TM (EPSG 3006) to WGS84/UTM zone 33 N (EPSG 32633), which are almost identical in their placement for this study area. Spectral data for the field plot locations were extracted with bilinear interpolation.

Classification method
Maximum likelihood (ML) classification is made by using a decision rule based on an expression describing conditional probabilities of different classes given an observation (Swain, 1978, pp. 152-158). A decision rule for making an ML classification using multiple images was formulated, with the assumption that observations were conditionally independent given a class. The probability of interest was where ω k is the event that a pixel has class k, and X t is the event of observing the vector x → t of band values in the pixel of an image taken at time t. The selection rule, equivalent to the probability of interest, which was used for classification was: Select k to maximize where p is the probability density function, θ → k,t is the parameter vector of that function for species k and image t.
Bayesian classification is often made using a normal distribution, but can be used with any likelihood function (Gorte and Stein, 1998). Two alternative distributions were tested: normal distribution and t-distribution. The t-distribution has wider tails, possibly allowing for a more robust classification. Kernel methods for obtaining empirical distributions were not an option due to the small sample size for some of the tree species in this study.

Image selection
Images were assigned a grade according to the level of class separation within them. For each pair of classes, a Mahalanobis distance (Mahalanobis, 1936), hereafter denoted as Z or Z-value, of the difference between the class population means was calculated as where μ → is sample mean vector, Σ is the sample variance-covariance matrix, m is sample size, and subscripts a and b denote the different  classes. To select images, they were put in decreasing order by Z-value and for each pairwise combination of classes, the highest ranking images were selected. The number of images to be selected was set as close as possible to fifteen percent of the total number of images (N = 142), corresponding to the proportion of cloud-free images. There were a total of four thematic classes, giving six different class combinations. The value closest to 15% of 142 that is evenly divisible by six is 24. The total number of images selected per class combination by their Z-value was increased until the number of images selected for classification was no more than 24.

Accuracy evaluation
Classification model accuracy was evaluated using leave-one-out cross-validation. The main measure of accuracy was overall accuracy, that is defined as the number of correctly classed elements divided by the total number of elements.
To assess the impact on accuracy due to adding more images, combinations of a different number of images were evaluated. This was done both for all available images (N = 142) as well as the subset selected by their Z-values (N⩽24). For each integer number i from one to the number of available images N minus one, N image combinations were sampled from a set of image combinations, S i . The set S i consisted of all possible combinations of i images. The overall accuracies when using each of those sampled combinations for classification were recorded and plotted to view the effect of using multiple images. When only one image was used for classification, it was the same as making an ordinary ML classification.

Results
When using images selected by their Z-value, no difference between the use of a normal distribution and a t-distribution was found. Only the results from calculations made using normal distributions are presented here. In addition, there was no difference when using Level-1C and Level-2A data for classification, and therefore Level-1C images were used for the study.
The images selected by their Z-values ranged from April to October, but most tended to be from May, June, and July, which is the beginning and middle of the summer in Sweden. To separate between P. abies and P. sylvestris, only images from the middle and end of summer were selected, and to separate Q. robur from other classes, mostly images from early summer were selected. P. abies and P. sylvestris were the two most spectrally similar classes, as shown by low Z-values for images selected for separation of that combination. The highest Z-values were obtained for class-pairs that included the Q. robur class ( Table 2).
The overall accuracy for the classification was 87% when using the 23 images selected by their Z-values (Table 3). When all 142 images-both those with high and low Z-values-were used for classification, an overall accuracy of 85% was achieved. When using only a single image, the highest overall classification accuracy obtained was 83%, the average was 63%, and the lowest was 33%. When using all 142 images as well as when using the images selected by their Z-value, it could be seen that a maximum overall accuracy occurred when using only a subset of the images (Fig. 2). When sampling image combinations, both from the total of 142 images and the 23 images selected by Z-values, the average overall accuracy of the samples increased with increasing sample size, as expected. At the same time, variation in classification accuracy decreased (Fig. 2) (see Fig. 3).
The confusion matrix (Table 3) shows that it was primarily the two coniferous species that were conflated. Q. robur was only conflated with Betula spp. while Betula spp. itself was conflated at least one time with each of the other species.

Discussion
The use of Bayesian inference applied sequentially on multiple Sentinel-2 images resulted in a higher overall classification accuracy of Table 2 Images selected by their Z-value (a Mahalanobis distance, calculated using Eq. (2)). B stands for Betula spp. (birch), S stands for Picea abies (spruce), P stands for Pinus sylvestris (pine), and O stands for Quercus robur (oak). Bold face means that the image in question was among the best ones for discriminating between those two classes.  tree species in a boreo-nemoral forest. In addition, the variation of overall accuracy between samples was smaller as compared to the result when Bayesian classification was applied to a single date image. Even though the method presented in this article does not utilize the temporal relationship between images, it does improve the classification accuracy by using information from multiple observations. Similar to other studies (e.g., Hill et al., 2010;Persson et al., 2018), a higher accuracy was obtained when using multiple image dates as opposed to a single date image. A big advantage of the proposed method is that it does not require that all areas are covered with the same sequence of image data, which allows, for example, cloudy areas to be handled in an efficient way.
There are many classification algorithms available for use with remote sensing data, however, there are certain advantages to Bayesian inference. For example, in large area mapping, each cell in a fixed grid over the area could be assigned the probability density values of different tree species, and updated with new probabilities and class membership as each new image became available. In this study, images were selected based on a quality measure derived for the entire image (i. e., Z-values). Sentinel-2 data products include pixel-wise quality measurements which are routinely used to flag pixels with poor quality. This could be utilized to select the pixels to include or exclude when making a Bayesian inference.
In addition, when using Bayesian inference on individual images in a sequential manner, the curse of dimensionality can be avoided; if bands from every image in the time series were added as a new dimension of the data set, this would cause observations to be too sparse and overfitting would be a concern. Otherwise, to utilize a stream of images simultaneously, reduction in dimensionality would be needed each time an image was added to the model. With Bayesian inference where probabilities can be updated sequentially, this is not a problem as long as the assumption of conditional independence holds. When processing single images sequentially as opposed to combining many images, less training data would be required for this Bayesian approach.
Selecting images by their Z-value improved the overall classification accuracy, but more importantly, the overall accuracy initially increased by a relatively large percentage when adding more images that had relatively high Z-values. This supports the practice of selecting images that are cloud-free and of good radiometric quality (Kempeneers and Soille, 2017). When using Z-values to select images, mainly cloud-free images were identified, even though the emphasis of the method lies in class separation. The main advantage of using Z-values is that it is easily implemented and provides an automated approach to find a subset of optimal images from the data stream for the classification process. When all images were used, as shown in Fig. 2(a), the increase in accuracy started to flatten out when using between 50 and 100 images. When this result is compared to the overall accuracy obtained using images chosen by their Z-value, it can be seen in Fig. 2(b) that the variation in overall accuracy quickly decreased with increased sample size. Most importantly, the average classification accuracy was higher when using images with higher Z-values, as compared to using all images.
There were high Z-values for different tree species at different times of the year, supporting the use of multi-date images for catching phenological differences. Q. robur and Betula spp. green-up at different times in the spring, which is shown by the high separability between Q. robur and Betula spp. when using spring images, which for this study was primarily mid to late May images (Table 2). P. abies and P. sylvestris had the highest separation during late June and all through July. The reason for this might be that P. sylvestris trees have a sparser crown structure in comparison to P. abies trees, which leads to a sparser canopy in P. sylvestris stands as compared to P. abies stands. For P. sylvestris stands, this allows more light to penetrate the canopy. In the study area, P. sylvestris stands often have an undergrowth of Corylus avellana (hazel), resulting in some pixels with a mixed spectral response of P. sylvestris and C. avellana; this can result in difficulties in obtaining a pure P. sylvestris signature and lead to confusion in the classification with other spectrally overlapping classes. For the results from this paper, it should be kept in mind that a drought occurred in Sweden during the summer of 2018, potentially affecting the forest and other vegetation.
As for all parametric methods, some assumptions of data distributions need to be made. Two assumptions were tested: (1) that data were normally distributed, and (2) that all species followed a t-distribution with as many degrees of freedom as there were plots in the training data.
The assumption of a t-distribution was selected since it allowed for a higher probability density for values far from the mean, especially in classes with few field plots. Both assumptions produced the same overall accuracy, which means that the variation in data is greater that the difference between the two distributions. An example where both of these assumptions would not be met would be during the leafing out of deciduous trees; a part of the population could be leafed-out while another could still be leafless. This would result in a bimodal distribution with the two subgroups appearing as local maxima. If more field data were available, it might be feasible to construct empirical distributions using kernel methods.
Sentinel-2 has some issues with the geometric quality of images. The geometric correction of Sentinel-2 imagery has proven to be of varying quality, with geo-location inaccuracies in Level-1C data of up to around 10 m on unrefined images and 8 m on refined images (Gascon et al., 2017). Such shifts in the pixel geometry, and therefore occasional poor geographic co-location between images, was also seen in this study. The classification results could be expected to improve with better colocation of pixels, but the general results from this study should be valid even without a perfect geometry.
An image-level linear calibration from digital numbers to radiance or reflectance will not influence the classification result since each image is classified with the aid of ground reference plots. However, for large-area applications, any variation in atmospheric optical thickness within an image might be an error source. For this reason, and to obtain a good signal to noise ratio, it is important to select images that are as cloud and haze free as possible and only use the part of the images with a good image quality, which is possible with the presented method. In the case of large-area applications, the use of data at processing level 2A (with Bottom of Atmosphere correction) might provide a better basis for tree species classification, since the effects of non-linear atmospheric calibration could be greater.
Additional modeling could be used with Bayesian inference. A Hidden Markov Model (HMM) could provide a way to incorporate a model for phenological change to improve classification accuracy (Siachalou et al., 2015). To do this, a model of probabilities for phenological change as a function of time would be needed. To do this in the best way, fieldbased information on vegetation phenology over the study area should be used to formulate the transition matrices needed for HMM; however, this type of field data is not currently collected in a comprehensive and available manner for all of Sweden.
All images were treated as equally important for classification purposes, regardless of how close they were in time to the field inventory. One idea for future development would be to smooth out the distributions with a weighting according to the difference in time from the inventory data. This would mean that an observation taken at a date close to the image acquisition date would be trusted more than one further away in time.
To further examine the usefulness of Bayesian inference as a classification method of tree species, a study on a greater landscape level would be of interest. If data from a national forest inventory (NFI) were to be used, it might be advisable to use whole swaths of satellite data, as opposed to individual 100 × 100 km granules, to get better estimations of distributions. Since NFI data are often acquired on a regular basis, weighting observations by temporal difference might be useful.
The method of Bayesian inference, where pixel-wise probabilities can be calculated and then updated as more observations are made, provides a way to easily automate classification of a data stream consisting of multi-temporal images. The results in this study might provide a foundation for a practical method for continuous large area forest mapping where new images are screened for image quality using a Zvalue, and high quality images are used for improving the previous classification. The method looks promising and should be tested in more forest ecosystems.

Conclusions
This study demonstrated the utility of classical Bayesian inference applied sequentially, where accumulated class likelihoods were used as prior probabilities from an image stream. The study used an initial stream of 142 Sentinel-2 images to map four tree species over a study Fig. 3. Classification results when using the 23 images chosen by Z-value. An existing land cover map was used here for non-forest land cover. Right subfigure is GSD-Orthophoto, 0.5m, color-IR © Lantmäteriet. area in southern Sweden. Using Z-scores, a best subset of 23 images was identified. The overall classification accuracy of the 23 images was 87%, while for the 142 images it was 85% and for the single best image it was 83%. In addition, the variation of overall accuracy between samples was smaller when Bayesian inference applied sequentially was used. The method is a straight forward way to handle cases where images have different geographical extents or contain occlusions (e.g., due to cloud cover). The method thus has the potential to be an efficient way for utilizing a stream of multi-temporal images for large area tree species or land cover classification.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.