An Attention-Based U-Net for Detecting Deforestation Within Satellite Sensor Imagery (cid:63)

In this paper, we implement and analyse an Attention U-Net deep network for semantic segmentation using Sentinel-2 satellite sensor imagery, for the purpose of detecting deforestation within two forest biomes in South America, the Amazon Rainforest and the Atlantic Forest. The performance of the Attention U-Net is compared with U-Net, Residual U-Net, ResNet50-SegNet and FCN32-VGG16 across three diﬀerent datasets (three-band Amazon, four-band Amazon and Atlantic Forest). Results indicate that the Attention U-Net provides the best deforestation masks when tested on each dataset, achieving average pixel-wise F1-scores of 0.9550, 0.9769 and 0.9461 for each dataset, respectively. Mask reproductions from each classiﬁer were also analysed, showing that compared to the ground reference the Attention U-Net could more accurately detect non-forest polygons than U-Net and overall it provides more accurate land cover polygons than each of the other methods despite its reduced complexity and training time, thus being the ﬁrst application of an Attention U-Net to a land cover segmentation problem. This paper concludes with a brief discussion on the ability of the attention mechanism to oﬀset


Introduction
The Amazon Rainforest represents around 40% of the remaining tropical forests on Earth (Hubbell et al., 2008), and provides refuge for 10% of the world's species (WWF, 2020). Therefore, the enormous carbon sequestering capability of the Amazon Rainforest is pivotal to the regulation of the continental, and global climate, since it is estimated to store 76 billion tonnes of carbon in the form of 390 billion trees (Müller, 2020). However, the region has seen large-scale deforestation for agriculture, raw materials, and for land to build housing due to rapid development of South America (Garcia-Ayllon, 2016).
This destruction poses an existential threat to the Amazon Rainforest and threatens to further worsen the effects of global warming. It is estimated that the Amazon's ability to act as a carbon sink will disappear in 2035 (Hubau et al., 2020), and is already showing signs of being close to this (Harris et al., 2021), thus resulting in extreme weather such as drought and forest fires locally and globally.
In 2020 alone, an average of 2309.5 hectares of forest per day was destroyed (MapBiomas, 2020); roughly equating to an area the size of Ottawa, the capital city of Canada, which covers 6287 square kilometers (Statistics Canada, 2011), every month. As a result, it has become a global priority to minimise the rate of deforestation of the Amazon by designating protected areas, campaigning against companies which produce products in illegally-cleared areas of the forest, as well as by regular monitoring (Tollefson, 2015). The latter has long been a problem as on-the-ground monitoring is infeasible due to the sheer amount of 2 surface area the Amazon Rainforest covers (Gong et al., 1994). This paper looks to further the effort towards remotely-sensed monitoring for detecting deforestation within the Amazon region primarily but also for use in other forest biomes, through the use of artificial intelligence (AI) in the form of an Attention U-Net deep neural network (Oktay et al., 2018).

Architecture Fundamentals
The Attention U-Net is based upon the U-Net architecture (Ronneberger et al., 2015), which itself is a specific type of fully convolutional network (FCN); a family of neural networks characterised by an encoder-decoder, or contraction and expansion, structure. These are designed for semantic segmentation, also known as pixel-wise classification.
U-Net builds upon the standard FCN architecture by introducing skip connections, meaning that blocks of layers within the contraction phase can pass their output directly to blocks within the expansion phase, which greatly improves the ability to extract high-level features within training images. Previously, the U-Net has been applied to the task of semantic segmentation of the Amazon Rainforest using Sentinel-2 satellite imagery with high success , and the aim of this paper is to explore the incorporation of an attention mechanism into U-Net to improve upon this benchmark.
An attention mechanism aims to replicate the human ability to direct focus, or to concentrate on, specific stimuli. In the domain of neural networks, this involves learning which parts of the input to focus on during the process of training and prediction. Attention mechanisms are prominently used within the field of natural language processing (NLP), where they focus on sections of an input corpus, which is useful within tasks such as sentiment analysis (Galassi et al., 2020).
The Attention U-Net is created by adding an attention gate to the skip connection used within U-Net. Rather than concatenating each upscaled layer in the expansion phase with the appropriate contraction-phase layer, the upscaled layer is concatenated with the output from the attention mechanism, a function of the pre-upscaled layer and the aforementioned contraction-phase layer.

Previous Work
Machine learning-based forest cover change monitoring of the Amazon has been ongoing for almost a decade (Souza et al., 2013), with deep learning (DL) methods being the current state-of-the-art. This has been demonstrated within comparisons of cutting-edge methods such as U-Net, ResUNet (Diakogiannis et al., 2020) and SharpMask (Pinheiro et al., 2016), using Landsat imagery of the Amazon Rainforest, versus less sophisticated methods such as the multilayer perceptron (MLP) and random forests (de Bem et al., 2020).
Previous segmentation work using U-Net, and involving Sentinel-2 satellite data, has also been carried out, such as detecting change within Ukrainian forests (Isaienkov et al., 2021), as well as mapping irrigation systems (Graf, 2020). Other examples include the use of a spatio-temporal FCN for land cover segmentation of Slovenia (Zupanc et al., 2019). Spatio-temporal approaches use a short-term collection of images to map changes over time, thus learning how to recognise deforested areas. This type of approach has been used with early and late fusion spatio-temporal U-Nets and have been shown to provide marginal improvements upon U-Net (Maretto et al., 2021) at mapping deforestation within the Amazon.
Desertification detection within Algeria, using Landsat ETM+ satellite data, with a variational autoencoder (VAE) (Verstraete, 1986) is another example of the wide variety of contexts and approaches that have been used with semantic segmentation. Importantly, previous applications of the Attention U-Net have only been within medical contexts, such as brain tumour segmentation (Islam et al., 2021), liver computerised tomography (CT) scan segmentation (Li et al., 2020) and gland segmentation (Zhao et al., 2020). As a result, we believe that this paper represents the first, or one of the first, successful applications of the Attention U-Net to a cover change detection problem.

Datasets
To evaluate the Attention U-Net for deforestation segmentation, we used three datasets produced from images in the satellite imagery database Sentinel-Hub (Sinergise, 2014). The first dataset is a collection of RGB-converted images and deforestation masks of the Amazon Rainforest ( [dataset] Bragagnolo et al., 2019), where 0s and 1s represent deforested and forested areas respectively in the masks. This dataset is composed of 30 training images and 15 validation images. In order to have more unseen data to evaluate our models, we took five images from the training data and added them to the validation dataset.
The other two datasets are both composed of 4-band RGB + near-infrared imagery, one containing images from the Amazon Rainforest and the other from the Atlantic Forest (Mata Atlantica) ( [dataset] Bragagnolo et al., 2021). Figure   1 shows the location of these biomes, as well as example images, and clearly 5 shows that the images are highly concentrated within two geographically distinct regions. These datasets had 499 and 485 training images, respectively, 100 validation images and 20 test images. For training, we randomly selected 250 training images due to memory limitations.
Throughout this paper, we refer to the first dataset as the 'RGB' dataset and the latter two datasets as the 4-band Amazon and 4-band Atlantic Forest datasets. Within each dataset, each image is of shape (512, 512, 3) in the RGB dataset and (512,512,4) in the 4-band datasets, and each image has a corresponding (512, 512, 1) deforestation mask. In order to produce the images and masks found within each dataset, the author of the dataset split a large satellite image into sub-images and produced masks using a modified version of the k-means classification algorithm with the GRASS-GIS 7.6.1 software suite (GRASS Development Team, 2020). Images were repeatedly re-classified until the corresponding masks had 'a satisfactory rating'. as the parameters within the attention mechanism as seen in Figure 2 add complexity. In the configuration used within this paper, the Attention U-Net is much more computationally efficient and has greater performance than U-Net.

Difference Between U-Net and Attention U-Net
In order to evaluate the performance of the Attention U-Net, four other models were also tested: U-Net, Residual U-Net, ResNet50 (He et al., 2015) with a SegNet backbone (Badrinarayanan et al., 2016), and FCN32 with a VGG16 backbone (Simonyan and Zisserman, 2015). Each of the models were trained from scratch, including the backbone architecture, in order to provide a fair comparison. Typically, VGG16 and ResNet models are trained using transfer learning, where the model has already been trained on a dataset such as ImageNet (Deng et al., 2009) before being trained on a suitable dataset corresponding to the given task. Transfer learning is particularly useful where limited training data is available, as the features learned from training on a dataset such as ImageNet can provide transferrable knowledge to tasks such as segmentation (Dube et al., 2018).

Training and Testing Procedure
The number of epochs and the learning rate used for each network can be found in Table 1. These values were found to be the values which gave maximal validation accuracy through experimentation. Models were trained using different learning rates and epochs until the highest validation accuracy was obtained. Once the optimal hyperparameters were found, models were not re-  (Kingma and Ba, 2017) was used as it provided the greatest peak validation accuracy compared to the stochastic gradient descent (SGD) optimiser.
The Binary Crossentropy (BCE) loss function was used as it has been shown to work well within binary semantic segmentation tasks (Jadon, 2020   pixel-wise differences between these masks and the original ground truth masks.

Quantifying Results
To quantify our results, the weighted Precision, Recall, F1-score and Jaccard Index, also known as the Intersection over Union (IoU) score, were used.
The IoU score was selected as it is describes the similarity of the predicted deforestation polygons with the ground truth, which is a better measure within image segmentation compared to pixel accuracy which only measures the number of accurate pixel predictions. Weighted metrics were used as they account for class imbalance between forest and non-forest pixels (Tague-Sutcliffe, 1992).
Precision, Recall and IoU scores were computed for both classes and weighted according to the number of pixels within each class. The positive class is the forest pixels.
Another essential piece of the analysis of a model is determining its computational efficiency; a factor which determines whether it is viable for real-world use. If the training time is too high, it may be more suitable to opt for a slightly less performant model which takes less time to train. Furthermore, models with large parameter spaces are more likely to overfit and have worse generalisability than less complex models (Ying, 2019) despite typically having greater perfor-mance. Therefore, it is important to evaluate the efficiency of a candidate model to determine whether it should be recommended for wider use so in this paper we compare the number of learnable parameters and the total training time of each model.
To carry out our experimentation, we used the Google Colaboratory Python environment (Google, 2017) as it provides free use of datacenter-grade GPUs.
Neural networks within this paper were trained on an NVIDIA Tesla P100 16GB GPU with 12GB of RAM. Each of the models were written with the Keras (Chollet et al., 2015) Application Programming Interface (API) of the TensorFlow machine learning framework (Abadi et al., 2015).

RGB Dataset
When testing the models on the RGB validation data, the Attention U-Net achieved the highest results overall, as can be seen in Table 4. This is evidenced in Figure 4a where the mask prediction from Attention U-Net is markedly better than those produced by the other classifiers. There is a reduced tendency to incorrectly classify forest as non-forest, false-positives, in contrast to U-Net, which appears to often exaggerate the non-forest polygons. The exception to this is the upper red circle within the upper Attention U-Net reconstructed mask, where the other classifiers fail to identify the extent of the forested polygon being highlighted.

A similar result is seen with the validation and test metrics for the 4-band
Amazon data. Table 2 shows that the Attention U-Net outperforms each of the other three classifiers, with a 0.5% improvement in F1-score over the standard U-Net. We can see this within Figure 4b where the Attention U-Net produces deforestation polygons with greater detail than U-Net and gives less false-positives than ResNet50-SegNet.

Atlantic Forest Dataset
Following on from this, the Attention U-Net once again outperforms other models on the Atlantic Forest data. In particular, Figure 3 shows that the F1score produced by the Attention U-Net is significantly greater than for the other models. This difference can be witnessed within Figure 4c where the Attention  U-Net is again able to identify more complex polygons when compared to U-Net. It can be seen that ResNet50-SegNet can also accurately identify the same polygons in question however it also produces more false-positives, than the Attention U-Net.

Testing on non-local imagery
When testing the Amazon-trained models on the Atlantic Forest data, in Table 5, we see that the Attention U-Net is the most performant overall, except (a) Image from the RGB Amazon dataset.
(b) Image from the 4-band Amazon dataset.
(c) Image from the 4-band Atlantic Forest dataset. for the recall score in which it is bested by U-Net, however it more importantly has a higher F1-score and IoU score meaning that the reproduced mask is more spatially similar to the ground truth and has a greater precision and recall overall. When we look at the results from the opposite scenario, the difference in performance between the Attention U-Net and the other models is much greater suggesting that the Attention U-Net provides greater transferability to data from a different location than the other methods. Figure 6 shows that the Attention U-Net is the most efficient model, containing both the fewest number of parameters, as well as the lowest training time for each of the datasets; it is also vastly more efficient, training between 20% to 56% faster than other models. The training time of both '4-band' models was 13 identical, due to being trained over the same number of epochs with the same learning rate.

Attention U-Net versus U-Net
Finally, we compare the ground truth masks to the predictions made by Attention U-Net and U-Net. Figure 5 shows that the Attention U-Net correctly identifies a greater percentage of forest pixels compared to U-Net on both the RGB and Atlantic Forest datasets, by 2.47% and 3.06% respectively and produces 2.47% and 3.16% fewer false-positives, respectively. On the 4-band Amazon dataset, the Attention U-Net produces fewer misclassifications as well as a greater proportion of correct predictions overall compared to U-Net; this is highlighted by the fact that only 2.21% of pixels are mis-classified. When taking into account the correctly identified pixels within each dataset, the Attention U-Net identifies 1.03%, 0.274% and 1.73% more pixels correctly on the respective datasets. When using a model to determine deforested regions in satellite imagery in order to estimate total deforested area, false-positives are more desirable than false-negatives as deforested area being underestimated can potentially caused new deforestation within an area to go undetected. However, in this case, as the Attention U-Net more accurately identifies a greater number of pixels than U-Net the greater number of false-negatives is not an issue.

General comments
Throughout our analysis, the Attention U-Net outperforms the other models.
Despite the Residual U-Net providing better results in some cases, the Attention U-Net consistently provides the best, or second-best, results. The improvement of Attention U-Net upon U-Net is likely due to the attention mechanism being able to distinguish high levels of detail in complex polygons, resulting in fewer errors within mask predictions. It was also shown within our experimentation that the 4-band Attention U-Net models are transferable to images from a different region, and this could be further confirmed by testing on a similar dataset from a different forest. We also saw that the Attention U-Net is more efficient than the U-Net, where the training time was up to 30% lower yet had noticeably improved performance. In regards to the datasets themselves, we can see in Figure 5 that the Atlantic Forest dataset has a large class imbalance in favour of non-forest pixels, which accounted for two-thirds of the total number of pixels. This is likely the reason why the Atlantic Forest models performed very well when evaluated on Amazon data.

Limitations
We conjecture that the performance of classifiers is limited by the quality of the ground truth masks as they were produced using an imperfect classification method. It was noted in the dataset author's paper  that some model mask predictions identify deforested polygons which were not picked up within ground truth masks. As a result, it could be useful for future work to update the ground truth masks by adding in the polygons found by our Attention U-Net model.

Future work
To build upon the work from this paper, interested readers could experiment with other loss functions such as Jaccard loss (Bertels et al., 2019), Dice loss (Sudre et al., 2017), or derivatives such as DiceTopK and DiceFocal, as they have been successful with other segmentation tasks (Ma et al., 2021). Also, the addition of regularisation layers such as Dropout and Batch Normalisation could reduce overfitting and validation loss. These were not tested in our experimentation, but have been shown to provide improvements to deep learning models in multiple scenarios.
Since the addition of the attention mechanism allows the Attention U-Net to perform to such a degree despite having very few parameters, we believe that others may have success implementing attention mechanisms into less complex versions of existing deep learning methods to a similar effect. One such possibility is the use of a Residual Attention U-Net which would contain more parameters than the Attention U-Net, and perhaps longer training time, but may improve upon the Residual U-Net.
Finally, we suggest that transfer learning could be used with either of the 4-band Attention U-Net models by training on both 4-band training datasets.
This could allow for greater transferability to images from a wider set of locations. It was shown in Section 3.3 that the models trained on a single location were transferable, so it is sensible to suggest that transfer learning would further improve this and allow for successful applicability to forest imagery from around the world.

Conclusion
In this paper, we have carried out a quantitative analysis of the performance of the Attention U-Net at the semantic segmentation of South American tropical rainforest imagery for detecting deforestation. We found that the addition of an attention mechanism to a less complex version of U-Net provides greater performance than the standard U-Net architecture, as well as several other state-of-the-art methods. The attention mechanism enables the network to retain high levels of spatial information despite containing layers of much lower dimensionality than U-Net. Due to the successful application of an attention mechanism to a deep neural network for this task, we can recommend the use of an Attention U-Net for other land cover segmentation tasks in the field.
CRediT authorship contribution statement Administration.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.