Tree Cover Estimation in Global Drylands from Space Using Deep Learning

Accurate tree cover mapping is of paramount importance in many fields, from biodiversity conservation to carbon stock estimation, ecohydrology, erosion control, or Earth system modelling. Despite this importance, there is still uncertainty about global forest cover, particularly in drylands. Recently, the Food and Agriculture Organization of the United Nations (FAO) conducted a costly global assessment of dryland forest cover through the visual interpretation of orthoimages using the Collect Earth software, involving hundreds of operators from around the world. Our study proposes a new automatic method for estimating tree cover using artificial intelligence and free orthoimages. Our results show that our tree cover classification model, based on convolutional neural networks (CNN), is 23% more accurate than the manual visual interpretation used by FAO, reaching up to 79% overall accuracy. The smallest differences between the two methods occurred in the driest regions, but disagreement increased with the percentage of tree cover. The application of CNNs could be used to improve and reduce the cost of tree cover maps from the local to the global scale, with broad implications for research and management.


Introduction
Making accurate estimates of tree cover with affordable methods and available data across vast regions is critical in many fields, such as carbon stocks and biomass estimation [1][2][3][4][5], biodiversity assessment [6,7], ecohydrological modelling [8], Earth system modelling [9][10][11], forestry [12], and ecological succession [13,14]. The role of tree cover in stabilizing the carbon cycle in drylands is particularly relevant, give their capacity to buffer the seasonal and interannual variability in precipitation [15]. Reducing uncertainties about the extent, distribution, and tree cover of forests in dryland biomes [16,17] is particularly necessary given the huge contribution of drylands to the interannual variability of the global carbon cycle [18]. To address these uncertainties, the FAO (Food and Agriculture Organization of the United Nations) recently conducted the Global Drylands Assessment (GDA), a costly study that assessed the forest cover of dryland forests around the world [19].
FAO's GDA expanded the extent of tree cover in dryland biomes by 9% [19] compared to previous studies. Such differences in tree cover were particularly important for regions with a high proportion of open forest (10-40% tree cover), but also for those areas with closed forest cover (≥40% tree cover) [20]. Discrepancies in tree cover estimation can be explained by the different methodologies used for estimating tree cover, which include images with different spatial resolutions (mainly very high resolution versus mainly Landsat [21] or MODIS [22,23] images), and the use of different image processing methods (automatic algorithms versus manual photointerpretation, [16,[19][20][21][22][23]24]). FAO's GDA used the augmented visual interpretation method (implemented in the Collect Earth software) on very high resolution images (VHR) from Google Earth TM . However, FAO's GDA was controversial and several studies raised many sources of uncertainty [25][26][27][28][29], mainly related to soil background effects and to the biases and subjectivities introduced by hundreds of operators worldwide. In any case, FAO's GDA required a vast effort, which limits the use of this methodology for monitoring. Hence, there still exists a need to develop more cost-effective, faster, less subjective and more precise methods to reduce uncertainty in estimating tree cover in global drylands [30][31][32].
Recent studies have demonstrated the potential of convolutional neural networks (CNNs) in remote sensing (e.g., [33]) for land cover mapping using aerial and satellite imagery (e.g., scattered shrub detection [30], crop type classification, land use and land cover mapping, and change detection [34][35][36][37][38]). CNNs, a type of deep-learning method within artificial intelligence, are the dominant and most accurate methods in recognition and detection in images [39]. Since 2014, the quality of CNN architectures has considerably improved thanks to deeper and broader networks [40]. CNNs such as GoogLeNet Inception can achieve accuracies of over 90%, when trained on high quality datasets, operate under strict memory and computational processing capacity constraints, and are increasingly used in remote sensing [41]. For example, Inception v.3 [42], the model that won the ILSVRC 2016 (ImageNet Large Scale Visual Recognition Competition), is now one of the most widely used CNNs.
Image classification based on deep learning has been proved to improve the speed and accuracy reached by manual labelling [43][44][45]. CNNs have been used to classify different land uses with well-defined classes in satellite images [46]. However, in continuous classifications, such as the estimation of the percentage of tree cover, artificial intelligence based on deep learning has been used in few studies (e.g., biomass estimation [47]). Our study proposes the use of CNNs to develop an automatic method for quantifying forest cover in drylands using very high resolution satellite and aerial images (such as the RGB images of the Earth surface provided by Google Earth TM software, https://google.com/earth). Then, we compared the performance of our CNN-based method with the approach of the FAO's Global Dryland Assessment, which is based on manual visual interpretation using Collect Earth by human technicians [19]. Our main hypothesis is that the estimation of tree cover using a CNN-based model would be more accurate and less biased than FAO's assessment using manual visual interpretation. To do this, we first compared the effect of different strategies to design the training datasets on the ability of the CNN in differentiating three tree cover classes (open, close, and dense forest; see Figures 1 and S2): i.e., (1) by including a non-forest auxiliary class with many more samples than the targeted tree cover classes, (2) by considering samples along continuous tree cover levels within each class versus samples taken only at a discrete tree cover level within each class, and (3) by increasing the size of the training dataset. Second, we used the best trained CNN model to quantify tree cover on a representative sample of plots in the FAO's GDA. Finally, the accuracy of tree cover quantification was compared between the CNN-based method and the FAO's GDA [19].
The main contributions of this paper can be summarized as follows:  Finding a dataset design and learning strategy that makes the CNN model correctly learn and predict tree cover in orthoimages. In particular, we showed how (1) adding to the training dataset an auxiliary "non-forest" class with many more samples than the targeted classes negatively affected performance; (2) selecting training samples along the continuous range of tree cover percentages within each class achieved greater performance than selecting samples at a discrete percentage; and (3) increasing the number of samples in the targeted tree-cover classes of the training dataset did not significantly affect performance.  Demonstrating as a proof of concept that CNN models can estimate tree cover in orthoimages with better accuracy than FAO's GDA based on manual land cover classification in Collect Earth.  Figure S2). (3) Detecting phase, the best trained CNN-based model is applied to 396 new images taken from FAO's plots, stratified across zones and aridity levels to estimate their tree cover class in different drylands globally. When image probability for belonging to a class was lower than 50%, the prediction was considered as Uncertain.
(4) Validation phase, 4a) 80% of images were used to train the model, and the other 20% of images were dedicated for the internal validation and (4b) the external validation was based on the 396 new images taken from FAO's plots and independently tagged by the authors. The performance assessment of the tree cover classifications was based on Precision, Recall, and F1-measure.

Materials and Methods
The problem of estimating tree cover in global drylands with deep learning by means of an image classification model was addressed as a proof-of-concept of how convolutional neural networks can provide more accurate estimates of tree cover than visual photointerpretation made by humans. To build the CNN-based model of tree cover, we used the Inception v.3 [42,43] image recognition model, one of the most widely used CNNs (see Supplementary Material, Figure S1), with TensorFlow (Sections 2.3-2.5). CNN-based models learn directly from image data, eliminating the manual determination of spectral and spatial characteristics that define the classes of objects or units to be identified in the image (e.g., tree cover classes). To compare CNN's performance to FAO's GDA photointerpretation through Collect Earth [19], we assessed a representative sample of plots from FAO's GDA, stratified across 12 geographical areas, four levels of aridity, and four levels of tree cover (Table S1), as in [19]. The 12 dryland regions were: Australia, Central Asia, Southeastern America, Southwestern America, Europe, Horn of Africa, Middle East, North America, North Africa, Sahel, South Africa, and Southwest Asia. The four aridity levels were based on the aridity index (AI: ratio between total annual precipitation and potential evapotranspiration): hyperarid (AI < 0.05), arid (0.05 ≤ AI < 0.2), semiarid (0.2 ≤ AI < 0.5) and dry subhumid (0.5 ≤ AI < 0.65)., The four tree cover classes were: Non-Forest (<10%), Open Forest (10-40%), Closed Forest (41-65%), and Dense Forest (66-100%).

Workflow
To build the CNN-based method for tree cover estimation, we followed four phases (  Table S1 and Archive S3) with the built CNN-based model (see Section 2.4). (4) Validation phase: assessment of performance on the new images (see Section 2.5).

Datasets Design
Before building the training datasets for the CNN-based models, we compiled a global database of very high resolution satellite and aerial RGB images from two sources:


First, we downloaded 71,135 images from Google Maps corresponding to the FAO's GDA 0.5 ha forest and non-forest plots that were available at zoom 19 (eye altitude ~150 m and pixel ~0.5 m). Download occurred between 1 and 13 December 2017 (see Supplementary Material, Archive S1). In FAO's GDA, each plot is georeferenced and tagged as forest or non-forest, with a tree cover percentage, and with a climate and region of the world [19].  Second, to increase the number of images in the "Non-Forest" class, we used the Northwestern Polytechnical University NWPU-RESISC45 dataset [48], a set of publicly available reference orthoimages for the classification of remotely sensed images, developed by NWPU. This dataset contains 31,500 images with 45 different scene classes (i.e., land use, land cover, objects, and infrastructure) with 700 images within each class. To build our global dataset, we removed all images corresponding to the "chaparral" and "forest" classes and grouped the remaining classes. We then reviewed all these images and filtered noisy ones. We obtained a total of 29,942 images that met FAO's criteria for Non-Forest (e.g., roads, rivers, beaches, urban areas, etc.).
Once the global database was compiled, we created five different training datasets to compare the effect of five different learning strategies on the performance of the CNN-based model ( Table 1). The selection of forest images from our database was stratified across the different combinations of regions, aridity levels, and tree cover levels used by FAO [19] (Supplementary Material, Table S1). The five created datasets contained three tree cover classes: Open Forest, Closed Forest, and Dense Forest ( Figure S2). In the continuous strategy, we built each tree cover class by selecting training samples along the continuous range of tree cover percentages. While in the discrete strategy, we built each tree cover class by selecting samples only at a discrete percentage of tree cover (the 15%, 55%, and 95% levels were used to maximize the distance and contrast between classes, i.e., 40% difference in tree cover percentage among them). We also included a "Non-Forest" class in two of the datasets (see Section 2.4 for a description of the images contained in each dataset). Table 1. Five datasets created considering continuous or discrete training strategy for the forest classes and with or without the "Non-Forest" class. In the continuous strategy, the forest classes were: Open Forest with tree cover 10-40%, Closed Forest with tree cover 41-65%, and Dense Forest with tree cover 66-100%. In the discrete strategy, the forest classes were: Open Forest with tree cover 15%, Closed Forest with tree cover 55%, and Dense Forest with tree cover 95%). Number of samples (plots) used to train the CNN-based models (third column). See Table S1 in Supplementary Material for more information about the images used in the classes.

Training Phase: CNN-Based Model Parameters
To build a CNN-based model able to accurately estimate tree cover in a given input image, we evaluated different training strategies. In particular, we considered continuous and discrete learning. In the continuous learning, the model is trained to estimate an interval of tree cover, e.g., Open Forest with tree cover from 10% to 40%. In discrete learning, the model is trained to estimate the exact tree cover, e.g., Open Forest with 15% tree cover. We also evaluated considering a "Non-Forest" class in the training. In total, we designed five datasets (see Table 1).
For this, we obtained five models by training Inception v.3 on the five created datasets. Several studies have shown that increasing the size of the dataset using data augmentation improves the performance of the CNN-based models [49,50]. These techniques have been proposed to reduce the requirement of a large dataset for model training [43][44][45]. We configured the model parameters by training the last two fully connected network layers in our dataset using a learning rate of 0.001 and a decay factor of 16 every 30 epochs. As an optimization algorithm, we used RMSProp with a momentum of 0.9 and epsilon of 0.1.
In the training phase, data-augmentation technique was used to artificially increase the number of samples [50]. Data augmentation apply specific transformations to the original samples, but preserving their labels to obtain samples slightly different from the original ones [50]. In particular, a 50% increase in brightness, half-random rotation, random rotation with a factor randomly selected between 0º and 359º, random cropping, and random scale of image size were used. In addition, we used transfer learning to reuse the knowledge acquired by the CNN model in a first problem to apply it to another related problem. Transfer learning is analogous to the development of knowledge in humans, which never starts from scratch, as it is always done from previous knowledge. We used the pretrained weights on ImageNet database to initialize the Inception v.3 model. ImageNet has 1.28 million images for 1000 different classes of objects [51].

Optimizing a CNN-Based Model to Detect Tree Cover Classes
In the classification phase, the CNN-based model provided for each image the probability (0-100%) of belonging to each class. The prediction of CNN-based model for each image had to be 50% or higher probability to be used in further analysis. Otherwise, we considered the prediction to be uncertain and was not taken into account.
To determine which of the five CNN-based models was the best estimated tree cover, each of them had a dataset built differently (see below in this section). To assess whether the "continuous tree cover" versus "discrete tree cover" categories increased the model's performance during the training phase, we tested two different strategies for constructing tree cover categories. In the first case, we defined continuous categories considering the following three tree cover intervals: 10-40% for Open Forest, 41-65% for Closed Forest, and 66-100% for Dense Forest classes [52]. In the case of the discrete categories, we selected three tree cover classes with maximum separation of tree cover with 40% constructed as images of Open Forest with 15% tree cover, Closed Forest with 55%, and Dense Forest with 95%. Therefore, there were five possible combinations (five different) regarding the type of CNN-based model training datasets (Table 1). In addition, we evaluated whether a Non-Forest class with tens of thousands of training images facilitates the classification of tree cover in drylands. Finally, success rates were compared between the CNN-based model trained without a Non-Forest class and with a Non-Forest class that contained 29,942 non-forest images from the NWPU-RESISC45 dataset [48].

Validation of the CNN-Based Model
To validate the CNN-based models, we built a final dataset of the 396 images from 71,135 Google Maps VHR images of FAO's GDA plots. We selected the images with samples stratified across different combinations of regions, aridity levels, and tree cover, according to FAO [19]. Thus, to get answers to these combinations in the 71,135 images downloaded with zoom 19 from Google Maps, we tried to choose a minimum of three images and a maximum of seven. However, this was not always possible, resulting in a dataset with a total of 396 images (Table S1 and Archive S3). Two evaluators (authors) established Forest or Non-Forest tags and tree cover for each image in the external dataset. The evaluators were trained to perform the visual interpretation of the tree cover level of the images. Therefore, the labels assigned by them were used in the external validation as a reference for a correct classification. Hence, two validation procedures were performed: internal and external. First, the internal validation used the performance obtained by classifying 20% of the images from the training dataset (while 80% was used in the training phase). Second, external validation was carried out independently of the training dataset.
To evaluate the performance of both CNN-based models, we used these metrics [30]: precision (Equation (1)), recall (Equation (2)), and F1-measure (Equation (3)): where false positives represent the images that were classified by the models but actually corresponded to another class, true positives represent the images that were correctly classified by the model, and false negatives represent the images not detected by the model. In simple terms, the high positive predictive value means that the model returned substantially better tree cover classes, while the high sensitivity means that the model returned most of the actual tree cover. The F1-measure provides a balance between sensitivity and precision. We used five-fold cross-validation strategy to evaluate our approach and the baseline on the test dataset.

Effect of CNN Training Strategies on Performance to Estimating Tree Cover in Drylands
The lowest F1-measure for tree cover classification was observed in the two models trained with the datasets that included the Non-Forest class. Regarding the models trained on datasets of the same size that did not include a Non-Forest class (rows 2 and 3 in Figure 2), the CNN-based model with continuous categories obtained a higher F1-measure than the CNN-based model trained with discrete categories. The improvement in F1-measure thanks to the use of continuous categories in the training dataset was large (up to 6%) in Dense forest, null (0%) in Closed forest, and small (2%) in Open forest (Figure 2). The increase in the size of the training dataset improved model performance. The most accurate CNN-based model was obtained without the Non-Forest auxiliary class, using continuous tree cover classes, and a larger number of training samples (see Table 1). This model reached the highest F1-measure for Open Forest and Dense Forest, 71% and 79%, respectively. The CNN-based model with the largest number of samples showed an increase in the F1-measure compared to the CNN-based model with a smaller number of samples, up to 1% for Open Forest, and Dense Forest ( Figure 2).
In summary, the FAO's GDA manual photointerpretation method using Collect Earth [19] always showed lower accuracy than the CNN-based model for the selected dataset in all tree cover classes. For Open, Close, and Dense Forest classes according to the best CNN-based model, the respective F1-measures were higher (71%, 53%, and 79%) than FAO's assessment (54%, 30%, and 52%) (Figure 3 and Table S2).

Differences between the Best CNN-Based Model and FAO's Global Dryland Assessment
Full consensus of correctly estimated tree cover classes between the best CNN-based model and FAO's GDA assessment only occurred in 79 plots (Figure 4), 22% of the total dataset ( Table 2). The number of correctly estimated tree cover classes just by CNN-based models was higher (129) than the number correctly estimated just by FAO's GDA (65). Regarding aridity gradient, the disagreements between FAO's GDA assessment and the best CNN-based model were 4.7%, 15.3%, 24.5%, and 27.6% of F1-measure in hyperarid, arid, semiarid, and dry subhumid, respectively (Table 2). Table 2. Consensus of the estimated tree cover class in 359 plots (396 without 37 plots with uncertain class with score <50%). GT is ground truth, CNN is the best results of our CNN-based model, and FAO is the FAO's Global Dryland Assessment from Bastin et al. [19].

CNNs to Estimate Tree Cover in Drylands
Our results showed that the automatic classification of tree cover in satellite VHR RGB images in drylands with CNNs was more effective than the manual photointerpretation method of Collect Earth used by FAO's GDA, which employed hundreds of operators. The greater performance of CNNs compared to human visual interpretation has also been reported in other studies (e.g., [43][44][45]). Our results call for further improvements in our CNN method by using more images in the training dataset and increasing their quality (e.g., spatial resolution) and representativeness in all global environmental gradients.
Many studies show how deep-learning classification is a powerful tool for land cover and crop types using remote sensing data [53,54], detection of individual trees in RGB images [55], identification of tree species related to forest disturbance with very high resolution multispectral images [56], and detection of firs damaged by bark beetle in unmanned aerial vehicle (UAV) images [57]. However, the studies to detect the percentage of land use surface, is addressed by segmentation methods such as Mask-RCNN [58] with high computational cost [59]. In this study, tree cover estimation is approached from a simple classification method [42] with better results than manual methods [19]. In addition, we observed that increasing the training data set improves the accuracy of the classification (e.g., [60,61]); however, our results showed that increasing the training data on the tree cover estimation problem, increasing training data by 165%, only improved the F1-measure to 1% in two of the three classes studied (Tree cover 10-40% and Tree cover 66-100%).
Intermediate tree cover (Closed Forest class) was worse estimated both by the Collect Earth manual approach and CNN-based methods compared with low tree cover (Open Forest) and high tree cover (Dense Forest). This is due to their intermediate position between the Open Forest and Dense Forest classes, with similarity for both classes. In fact, many studies only use two forest density categories, such as open and dense forest (e.g., [62][63][64]), even for assessing transitional forests [65] or even just one closed forest class [66]. This similarity of our Close Forest class to the surrounding Open and Dense forest classes could confuse the identifiable patterns of Closed Forest with the other two classes by the CNN-based models. In addition, the comparison between the FAO's GDA method and the best CNN-based model could have involuntary classification errors, due to human failures in the photointerpretation of the plots or to an update of the images in Google Maps (between FAO's assessment and ours) that would show changes in tree cover. Since the accuracy (F1-measure) of FAO's GDA tree cover estimation is relatively low [26][27][28], we should not expect high consensus with any method that has high accuracy. If a high consensus exists, it would imply that the new method is as inaccurate as FAO's GDA [19].
The CNN-based model was more effective when the training dataset did not include a Non-Forest class, which can be understood as a decrease in background noise [67]. So, the learning was more oriented towards the proposed goal: the quantification of the tree cover of forest (e.g., [68,69]). The CNN-based models trained with continuous categories showed a high F1-measure probably because they were trained with more diverse and varied examples to characterize each class of tree cover, except the Closed Forest class, as previously discussed. The CNN-based models trained with discrete categories presented low F1-measure probably due to the variability between the training samples was lower than using continuous categories.

Limitations and Challenges in Tree Cover Mapping
The main limitation found in this work was of a technical nature. Despite the current increase in the number of very high resolution satellite sensors (e.g., [70][71][72]), there are still many parts of the Earth's surface that are not yet covered by VHR images (e.g., remote areas). For example, there is low availability of VHR images in Africa while high availability in Australia [73]. This means that we could not design a larger training dataset for each class (even if we used data augmentation and transfer learning) in CNN-based models [74]. Other technical difficulties for the application of CNNs on satellite images to estimate tree cover at the global scale are the vast amount of data to handle [75], and the need for graphics processing unit (GPU) servers to efficiently train the CNN models [76].
Traditionally, forest cover is often estimated by surrogates of vegetation structure (e.g., leaf area index, phytolith index [77]). Many studies have assessed tree cover from local to global scales through the use of normalized difference vegetation index (NDVI) images from the MODIS (moderate resolution image spectroradiometer) and Landsat satellite sensors, in both cases with high accuracies [15,18,20,22]..
In addition, as a complement to these methods, CNN-based methods could provide new information for a more accurate estimate of tree cover and, subsequently of net carbon gains [78], biodiversity [79], or forest health [80]. A global application of our CNN-based model on VHR satellite images to estimate tree cover in drylands could greatly complement traditional tree cover mapping methods [16,20,22]. Increased tree-cover estimation accuracies would also help to characterize forest areas affected by environmental and anthropogenic changes (e.g., [81,82]), or to set priorities for protecting biodiversity from global change [83][84][85]. The use of CNNs is useful due to their automatic character; once built and trained, it can be used as a classifier of new images without another training phase. In addition, CNNs have the favorable characteristic of being able to learn and reuse learning from one image to another [49], this characteristic makes them versatile at a planetary scale. For example, the application of CNN-based model of tree cover detectors on VHR satellite images could be used to through a global community involved as part of the Group on Earth Observations-Biodiversity Observation Network (GEOBON) initiative on essential biodiversity variables from satellite remote sensing [86].

Conclusions
Dry forests have global importance. However, there is still high uncertainty about their extent and tree cover. Our research shows how the development and implementation of a new free and automatic tools using CNN-based models on very high resolution and free aerial and satellite RGB imagery could be used to quantify the tree cover of forests in global drylands. Compared to Collect Earth manual method used in the FAO's Global Drylands Assessment, the CNN-based model has demonstrated greater accuracy (by 23% better) and speed. We encourage further development and application of this technology in future works at local, regional, continental, or global scales to increase the accuracy of forest tree cover estimation in drylands, from researchers to organizations (e.g., FAO). More accurate tree cover mapping based on CNN models would positively impact many fields, from biodiversity conservation to carbon stock estimation, ecohydrology, erosion control, or Earth system modelling.  Table S1: Representativeness of the database for external validation of the CNN-method that classifies orthoimages into tree cover classes. A total of 396 images were sampled globally stratified across the different combinations of regions, aridity levels, tree cover levels and forest and non-forest classes according to . Some combinations did not exist or did not have images with good quality. NF (Non-Forest), F(Forest), TC (Tree cover). Table S2