Methodology of Data Fusion Using Deep Learning for Semantic Segmentation of Land Types in the Amazon

This study proposes a methodology using deep learning and a multi-resolution segmentation algorithm to perform the semantic segmentation of remote sensing images. Initially the image is segmented using a CNN, and then an image with homogeneous regions is generated using a multi-resolution segmentation algorithm. Finally, a data fusion process is performed with these two images, generating the final classified image. The field of study was the Brazilian Amazon region. The proposed methodology classifies images in the following classes: forest, pasture and agriculture. The input data used were LANDSAT-8/OLI images. The reference data were extracted from the results of the TerraClass project in 2014. Two datasets were evaluated: the first with six bands and the second with three bands. Three CNN architectures were evaluated together with three optimization methods: SGDM, ADAM, and RMSProp and the dropout and L2 regularization methods as methods for generalization improvement. The best model, CNN + optimization method + technique for generalization improvement, evaluated in the validation set, was submitted to a 5-fold cross validation methodology, and the results were compared with pre-trained networks using the learning transfer methodology; in this case the networks used for comparison were ResNet50, InceptionResnetv2, MobileNetv2 and Xception. The proposed methodology was evaluated through image segmentation of some regions of the Amazon. Finally, the proposed methodology was evaluated in regions used by other authors. The accuracy values obtained for the images evaluated were over 99%.


I. INTRODUCTION
Remote sensing consists of collecting information about an object or geographic area without direct contact. This information was compiled through various technologies such as sensors installed on-board aircraft, satellites, or other platforms [1]. The data generated from remote sensing systems are very useful for several applications, for example: urban -demographic interventions, government registrations, urban planning, support for the real estate sector; agricultural -monitoring crop conditions, crop forecasting, soil erosion; geological -minerals, oil and natural gas The associate editor coordinating the review of this manuscript and approving it for publication was Aneel Rahim . exploration; ecological -monitoring flooded regions, soils, oceans and continental waters; forest -lumber production, biomass estimation, monitoring of deforestation) [2].
With regard to monitoring deforestation in the Amazon through remote sensing, the National Institute for Space Research (INPE) is a world reference. Among the various projects developed by INPE for this monitoring, the Program for Monitoring Deforestation in the Legal Amazon (PRODES) and TerraClass stand out [3], [4]. PRODES provides data through annual maps of deforestation in the Amazon region. The TerraClass project uses the data generated by PRODES to perform a land use and land cover classification in the following classes: forest, pasture, agriculture, urban areas, mining and others. This information helps government agencies develop public prevention policies to contain deforestation [5]. In 2018 and 2019, deforestation in the Amazon affected 7,535 km 2 and 10,129 km 2 , respectively. Therefore, the deforested area showed an increase of 34% from 2018 to 2019. In 2019, 84.56% of all deforestation observed in the Amazon occurred in the states of Pará, Mato Grosso, Amazonas, and Rondônia [6]. Although PRODES and TerraClass are large projects and provide very reliable data, they still rely on a significant portion of work done by human intervention. In order to speed up and make the process more efficient, several studies have been done using machine learning tools [7]- [18].
The works of [7] and [17] employed unsupervised classification to segment remote sensing images. In [17], the authors used Bayesian Updating of Land Cover: Unsupervised (BULC-U) and obtained an accuracy of 80%. In [7], the authors used K-Nearest Neighbors (KNN) and obtained an accuracy of 75.95%.
According to [19], from the first decade of this century, the neural network was reborn under the new label known as deep learning. Deep learning makes use of computational models with hierarchical architectures composed of multiple processing layers, in order to ''learn'' representations of data in very different formats: audio, images, and text [20]. According to [19], the large amount of data available in recent years, together with the increase in computational power, allowed the use of deeper architectures that were not previously possible.
The works of [9], [10], [18] used deep learning to detect deforested areas in regions of the Brazilian Amazon. The results obtained by these authors were quite satisfactory. However, the authors evaluated a specific region of the Amazon. Another criticism we make of the previously published studies is that they have not made available the dataset used for benchmarking.
A major challenge in training CNNs with remote sensing images is that the classes are unbalanced. In other words, in a region captured by a satellite image, there is a great imbalance in terms of the area of various types of soils. This problem can lead, in CNN training, to optimization methods performing better in the most frequent classes. This work proposes a method to solve the difficulties in the segmentation of land use and coverage in different regions of the Amazon using CNNs. To circumvent the imbalance problem, the mosaic image technique is proposed in this work for training CNN in remote sensing applications. In this technique, small rectangular samples of agriculture, forest and pasture are extracted from satellite images. From these samples, a large image is created, with almost the same number of small rectangles of forest, agriculture, and pasture.
Another contribution of this study is that it boosts the results obtained with CNN in soil classification by fusing its output with the output of a segmentation algorithm of homogeneous regions. The fusion results outperform the results obtained only with CNN.
Therefore, the main contributions of this paper are the following: 1) Proposing customized CNN architectures for semantic segmentation of different types of use and land cover, in different regions of the Amazon. 2) Proposing a training, validation and testing methodology of CNN networks, for the remote sensing area, using the mosaic image concept. 3) Comparing the results of the proposed CNN architectures, trained from scratch, with the results of transfer learning architectures. 4) Proposing a fusion technique using CNN data and data from a segmentation algorithm of homogeneous regions for improving the soil classification in remote sensing applications. 5) Availability of a dataset of types of land use/cover (forest, agriculture and pasture) in the Amazon region, which may be used as a base for another research in the Amazon region.

II. LITERATURE REVIEW
In [7], [16], [17], the authors carried out the classification of images in different regions of the state of Mato Grosso, Brazil. The classes used by the authors were agriculture, forest, and pasture. In [7], the authors used LANDATA-5/TM images as input dataset. The training and testing dataset were built from information from the TerraClass2008 project. The authors used the following classifiers: Decision Tree (DT), Naive Bayes (NB), KNN, Support Vector Machine (SVM), and Optimal Path Forest (OPF). The best results were obtained using the SVM classifier in the region corresponding to image 227/68 (LANDSAT-5/TM image). For this region, an accuracy of 86.04% was obtained. In [16], the authors used LANDSAT-8/OLI and PROBA-V C1 images. The classifier used by the authors was the random forest, and the accuracy obtained was 88%. In [17], the region of the state of Mato Grosso evaluated by the authors included the municipalities of Alto Boa Vista, Querência, Ribeirão Cascalheira, and Canarana. The images used were LANDSAT-5, and the data from the GlobCover project of 2009 were used as the gold standard. The authors used the unsupervised classifier called BULC-U. Using this classifier, the accuracy obtained was 80%. VOLUME 8, 2020 In [8], the authors used random forests to classify regions of the state of Rondônia in the following classes: deforestation, degradation, and forest. The authors used LANDSAT-5/TM and LANDSAT-8/OLI images. The accuracy obtained was 85%.
In [11], [12], [15], the authors employed a threshold segmentation for remote sensing image classification. In [11], the authors evaluated a region of the state of Mato Grosso with the objective of classifying deforestation and environmental degradation in the region. The region assessed corresponds to an area of scene 226/68 of LANDSAT. LANDSAT-5/TM images from 2005 to 2011 and LANDSAT-8/OLI images from 2013 to 2017 were used. The accuracy obtained by the authors was 94.93%. In [12], the authors used LANDSAT-8/OLI and Sentinel-2 images to classify selective cut in a region of southern Amazonas State, near the village of Santo Antonio do Matupi. The accuracy obtained by the authors was 95.7% for the LANDSAT-8/OLI images and 96.7% for the Sentinel-2 images. In [15], the authors carried out the classification of images in the region of the Carajás mines, located in the south of the state of Pará. The classes defined by the authors were cerrado, forest, water and mining area. The authors used iknos, geoeye and world view images, and the gold standard was generated by the authors themselves through visual interpretation. The accuracy obtained by the authors was 91%.
In [18], the authors used CNN to perform the classification of deforested areas in a region of southern Pará state. The authors used Landsat-8/OLI images and PRODES data as the gold standard. Using CNN U-NET, the authors obtained an accuracy of about 95%.
In [9], the authors mapped deforestation between images one year apart from the other, between 2017 and 2018 and between 2018 and 2019, using a deep learning methodology. The images used were scenes 227/63, 227/65, and 230/65 from LANDSAT-8/OLI, which correspond to regions of the state of Pará and Amazonas, Brazil. The deforestation was extracted from the PRODES project. The authors used three CNN architectures available in the literature: SharpMask, U-Net and ResUnet, and two classical machine learning algorithms: random forest (RF) and multi-layer perceptron (MLP). The best results obtained with the CNN ResUnet had an accuracy and an F1-Score of 99.93% and 94.65%, respectively.
In [10], the authors evaluated deep learning strategies for automatic deforestation detection. The experiments used two Landsat-8/OLI images, acquired at different dates. The first study area corresponds to a small region of the Amazon biome, located, more specifically, in the state of Pará, Brazil. The state of Pará comprises 26% of the Brazilian Amazon, and most of it is covered by dense tropical forest. The reference data used in this experiment refer to the deforestation that occurred between August 2016 and July 2017. The second study area is a small region of the Brazilian Cerrado biome, located in the state of Maranhão, Brazil. For this case, the deforestation took place in the years 2017 and 2018. The authors obtained an accuracy and F1-Score of 95% and 63%, respectively, in the Amazon, and 97% and 78%, respectively, in the Cerrado. The last results, obtained in the Cerrado reached higher percentages than those of the Amazon. According to the authors, this is due to the pattern of deforestation in the Cerrado biome that removes all vegetation, in contrast with the Amazon, where vegetation is not completely removed.
The papers presented in this review showed that machine learning techniques have great potential to classify remote sensing images and to perform automatic segmentation in the most varied types of vegetation cover in the Amazon region. In [9], [10], [18], the authors used deep learning techniques to classify deforestation regions in the Amazon. The results achieved were quite satisfactory. However, in each work, a specific region of the Amazon was used. Another criticism we make of the previously published works is that they have not made available the dataset used for the benchmark. This limits the diversification of solutions to the real world. No papers were found that presented an adequate model in deep learning for land use/cover classification for the Amazon region, encompassing the many varieties of existing environments.
In this work we make available the dataset used, which covers different areas of the Amazon. These areas cover the Brazilian states of Amazonas, Mato Grosso, Pará, and Rondônia.

III. MATERIALS AND METHODS
This work aims proposing a method to solve the difficulties in the segmentation of land use and coverage of different regions of the Amazon region. Figure 1 shows a flowchart of the methodology used for segmentation of land use and coverage. As shown, there are two branches: The left one corresponds to CNN processing and the right one to segmentation with homogeneous regions. In the first block of the left branch, the band's selection is performed. After, in the semantic segmentation block, the CNN performs the identification of the different soil coverage. In the right branch, we perform the segmentation of homogeneous regions. At the end, uniting the two branches, we have the fusion block. In the following sections, each block of this flowchart will be explained in detail.

A. LANDSAT IMAGE
In this work, LANDSAT-8/OLI images from the Amazon region, from 2014, were used. These images are available for free at [25]. As shown in Table 1, the images used cover the Brazilian states of Amazonas, Mato Grosso, Pará and Rondônia.
For generating the gold-standard images for the classifier, the results of the TerraClass project, from 2014, were used. The data were acquired free of charge on the INPE website [4]. The data generated in the TerraClass project delimit the regions of the Brazilian Amazon in the following classes: forest, agriculture, pasture, unobserved area, urban area, mining, others, non-forest and hydrography.  For this work, information on forest, pasture and agriculture areas were used corresponding to a problem with three classes. The classes of unobserved area, urban area, mining and others were not considered because they represent a very small percentage area, and it would not be possible to extract a significant amount of samples to be used in the process of semantic classification with CNNs. The nonforest and hydrography areas were not used because they are part of an INPE exclusion mask. Thus, these regions are not considered in the systematic mapping.
Using the reference data from the TerraClass project for the year 2014, 40 × 40 pixel-sized patches were generated for each class. Figure 2 presents three examples of these patches, corresponding to forest, pasture, and agriculture. Using the LANDSAT-8/OLI images, 4,000 agricultural patches, 225,000 forest patches, and 6,000 pasture patches were generated. These patches were divided into three sets: training, validation and testing, as shown in Table 2.
Using the patches previously described, large images, defined as mosaic images, are generated. Each mosaic image has dimensions of 400 × 400 pixels and is formed by 100 patches. Each unit of the mosaic image is randomly generated selecting a patch of agriculture, forest, or pasture. Each mosaic image has a gold standard image. For building the gold standard image, the pixels corresponding to forest regions were marked with the value 255, the pixels corresponding to pastures region were marked with the value 100, and the pixels corresponding to agriculture regions were marked with the value 1. Figure 3 shows an example of a mosaic image and the corresponding gold standard. It can be observed that, out of the 100 patches, 34 are from agriculture, 35 from forest, and 31 from pasture regions.
Two versions of image datasets were generated with a total of 4,000 mosaic images each. For each version, 70% of the images (2800) were destined for the training set, 15% (600),  for the validation set and 15% (600), for the test set. The first version consists of 6-band images: B2, B3, B4, B5, B6 and B7. The second version is composed of three-band images: B4, B5, and B6. Table 3 presents the image datasets generated used in this work.

C. CNN SEMANTIC SEGMENTATION
Three CNN architectures ( Figure 4) used in this work were based on the architecture proposed in the work of [28]. In this work, the authors performed the lumen segmentation in intravascular optical coherence tomography (IVOCT) images.
As shown in Figure 4, CNN1 has four sub-sampling steps (maxpooling). Before and after the first subsampling step, there are three sequences of 3 × 3 convolutive layers, batch normalization layer, and ReLU. After the second, third and fourth subsampling steps, there are four sequences of 3 × 3 convolutive layers. In addition, CNN1 is composed of four oversampling steps. CNN2 has two subsampling steps (maxpooling) and two oversampling steps. Before each subsampling, there are three sequences of 3 × 3 convolutive layers, a batch normalization layer, and ReLU. Finally, CNN3, like CNN2, also has two subsampling steps (maxpooling) and two oversampling steps. However, before each subsampling there is only one sequence of 3 × 3 convolutive layer, batch normalization layer, and ReLU. The optimization methods evaluated were SGDM, ADAM and RMSProp. For each of these methods the following methods were used to improve generalization: dropout, L 2 regularization and dropout with L 2 regularization. Mosaic images were used as the input dataset. Two groups of simulations were done. Both groups are comprised of 36 simulations (3 architectures × 3 optimization methods × 4 improving generalization methods). In the first group, six-band mosaic images were used as CNN inputs, while in the second group, three-band mosaic images were used as CNN inputs.
For each dataset version, the model (architecture + optimization method + improving generalization method) with best performance, evaluated in terms of accuracy in the validation set was selected. The upper region has more gray pixels so it is classified as pasture. The central region has more black pixels so it is classified as agriculture, and because there are more white pixels in the lower region, it is classified as forest.
The CNN performances were compared with pre-trained network performances, using transfer learning, with shallow training. The pre-trained networks used in this work were ResNet50, InceptionResnetv2, MobileNetv2, and Xception combined with SGDM, ADAM, and RMSProp optimization methods, six-band mosaic images as input. A total of twelve experiments were performed.
A Dell Precision 5820 Workstation with Intel Xeon W-2102 2.9GHz, 64GB RAM, Windows 10, Matlab 2019a and NVIDIA Quadro GV100 32GB was used in the experiments. The following parameters were used for the CNN training: initial learning rate = 0.001, learning rate dropout factor = 0.5, number of epochs = 200, batch size = 2, dropout layer parameter = 0.3, L 2 smoothing factor = 0.001.
In this work, CNN was trained using mosaic images consisting of 40 × 40-pixel patches that tend to generate images cut out with patches of the same size. To neutralize this trend and improve classification/segmentation, a fusion process using the CNN output and a multi-resolution homogeneous region segmentation algorithm was used. This algorithm and the fusion process are described in the sequences.

D. SEGMENTATION OF HOMOGENOUS REGIONS
The segmentation of homogeneous regions consists of creating an image with regions of the same homogeneity based on the value of the pixel of the image. In this work, the algorithm called a multiresolution segmentation proposed by [30] was used, which performs a general segmentation based on homogeneity definitions combined with local and global optimization techniques. A scale parameter is used to control the average size of the image object. According to the authors, multi-resolution segmentation is a technique for merging regions starting with one-pixel objects. In several subsequent steps, smaller image objects are merged into larger objects.
Throughout this pairwise grouping process, the underlying optimization procedure minimizes the heterogeneity of the resulting image objects. At each step, this pair of adjacent image objects is merged, resulting in the smallest growth of the defined heterogeneity. If the smallest growth exceeds the limit defined by the scale parameter, the process stops. In this way, we can say that multiresolution segmentation is a local optimization procedure. In this work, the use of this algorithm was performed through the e-Cognition Software. The scale factor used was 200.

E. DATA FUSION
The data fusion consists of using the image resulting from the multi-resolution algorithm and the image resulting from the semantic classification made by the CNN model. The data fusion used in this work is illustrated by means of the flowchart presented in Figure 5. For each region generated by the homogeneous region segmentation algorithm, we evaluated the pixel classification by the CNN. The most frequent class of pixels in a region is considered the class of that region. Figure 6 illustrates this process: Figure 6(a) shows an original image; Figure 6(b) shows the results of the homogeneous region algorithm applied in Figure 6(a). In Figure 6(b), inside a black rectangle, three small neighbor regions, which are magnified in the right side, are selected. These selected regions are filled with a hypothetical pixel classification by the CNN. The upper region has more white pixels, so it is classified as forest. The central region has more black pixels, so it is classified as agriculture, and the lower region has more gray pixels, so it is classified as pasture.

A. CNN SIMULATION
Tables 4 and 7 present the performance metrics of the three CNN architectures proposed in this work using the six-band and three-band mosaic image dataset, respectively. For the six-band dataset, the model with the best performance was the one in which CNN1 was used, together with the RMSProp optimization method and without using methods for improving generalization. This model obtained a GA of 99.98%. For the three-band dataset, the model with the best performance was the one in which CNN1 is used together with the RMSProp optimization method and dropout method to improve generalization. This model obtained a GA of 99.92%. Tables 5 and 8 present the confusion matrix for the best CNN model for the six-band and three-band dataset, respectively. From Tables 5 and 8, we observe that these models presented a greater classification error in regions where the gold-standard identified areas as forest but were classified as pasture. These CNN models were evaluated in a 5-fold cross-validation methodology. The dataset used for these experiments was the training and test dataset. Tables 6 and 9 present the performance obtained for this experiment using the six-band and three-band dataset, respectively. For the 5-fold cross-validation methodology, GA of 99.97% and 99.86% were obtained for the six-and three-band dataset, respectively. Table 10 presents the results obtained using pre-trained nets. The best performance was achieved using the Resnet50 network and the ADAM method. For this model, a GA of 94.50% was obtained. However, this performance was inferior to the CNN1 model with RMSProp optimization method. This model presented a GA of 99.98%.

C. CLASSIFICATION/SEGMENTATION OF IMAGES FROM AMAZON REGIONS
To evaluate the methodology proposed in this work, some images from the Amazon region were selected to perform the segmentation/classification. These images were classified using only the model CNN2 + ADAM optimization method + L 2 regularization and using the fusion process described in sections III-D and III-E. In Figure 7, we have an example of the classification of a LANDSAT-8/OLI image using this process. As shown   in Figure 7(c), in the results of CNN classification, there are some regions incorrectly classified as forest. In Figure 7(e), which shows the result of the fusion process, these misclassifications are minimized. The accuracy obtained using only CNN semantic segmentation was 96.84%. After the data fusion process, the accuracy increased to 99.81%.
We can observe that the proposed methodology presented impressive performances. The best results were obtained using the CNN2 model, the ADAM optimization method, and L 2 regularization for improving generalization. The data fusion process has increased the image classification accuracy, as shown in Figure 7. Table 11 presents a comparison of   Figure 7 using only CNN2/ADAM/L 2 and the data fusion process.
the image segmentation performance of Figure 7 using only CNN model and the data fusion process.
In Figure 8, we show two LANDSAT-8/OLI images of Amazon regions with their respective gold standard, and the classified image using the CNN2 + ADAM optimization method + L 2 regularization and data fusion process. Image 1 corresponds to a region of scene 228/67, and image 2 corresponds to a region of scene 226/68. The accuracies    [9], [10] The methodology presented in this work was also evaluated in the regions evaluated by [17], [18]. Two datasets with 2000 mosaic images each were assembled using the images evaluated by [17], [18], respectively. Table 12 shows the images used for the construction of each dataset and the division of the set for training and validation. Each mosaic image is 400 × 400-pixel size and is made up of patches of 20 × 20-pixel size. Each image patch corresponds to a region of deforestation or nondeforestation. Figures 9 and 10 show examples of mosaic images and the corresponding FIGURE 9. Example of a mosaic image for the region used by [9]. In (a), the mosaic image is displayed in color composition B6-B5-B4 and (b) the corresponding gold standard.

FIGURE 10.
Example of a mosaic image for the region used by [10]. In (a), the mosaic image is displayed in color composition B6-B5-B4 and (b) corresponding gold standard.
gold standards for datasets 1 and 2, respectively. Table 13 resents the performance of the CNN models for each of  these datasets. The best model of each CNN presented in Table 6 was used.
The CNN3/RMSProp/L 2 model and the data fusion process were used to classify images presented in Figure 11 and 12. Figure 11 corresponds to a region evaluated by [17]. For this image, the obtained accuracy was 99.91% and the value of F1-Score was 99.94%. The accuracy value was very similar to the value obtained by [17]. The F1-Score was higher than the value obtained by the authors, which was 94.65%. Figure 12 corresponds to a region evaluated by [18]. For this image the obtained accuracy was 99.94%, and the value of the F1-Score was 99.92%. The accuracy value was higher than the best results obtained by [18], which was 95%. In regard to the value of the F1-Score, the value was higher than the one obtained by the authors, which was 63%.

V. CONCLUSION
This study proposed a methodology using deep learning to solve the difficulties in the segmentation of land use and land cover for the Amazon region for the pasture, agriculture and forest classes. The proposed methodology consisted of a data fusion process performed through semantic segmentation using CNN and a homogeneous region algorithm. LANDSAT-8/OLI optical images of the Brazilian Amazon region were used to validate the proposed methodology. The reference data used were the results obtained from the TerraClass project of INPE in 2014. The CNN architectures used in this study were based on the architecture proposed in the work of [28]. Two datasets were evaluated: the first using six bands and the second with three bands. The results for the six-band dataset outperforms the one obtained with a three-band dataset.
In this study, CNN was trained using mosaic images consisting of 40 × 40 pixel patches that tend to generate images cut out with patches of the same size. To neutralize this trend and improve classification/segmentation, a fusion process using the CNN output and a multi-resolution homogeneous region segmentation algorithm was used.
The proposed method was compared to some pretrained CNN architectures using transfer learning. The comparison demonstrated the effectiveness of our proposed method against these well-known CNN architectures in the task of remote sensing applications. The proposed method proved to be more appropriate for this specific application. For the same region evaluated by [9], the results were practically the same. Regarding the results obtained by [10], our method obtained better results.
The main contributions of this work to soil classification in remote sensing applications were twofold. The first one was to propose the mosaic image technique for training CNN architectures. The second one was a fusion technique to improve the soil classification, using two types of data: the CNN output image and output image of a homogeneous region segmentation algorithm. Another contribution is making the datasets used in this work available, which may be used as a base for other research in the Amazon Region. Data will be available upon request. Given the results presented, we conclude that the methodology proposed in this work proved to be quite efficient in performing the task of segmentation/classification of remote sensing images for regions of the Amazon.
For future studies, we propose using more varied classes of land use, in addition to using other types of remote sensing images, for example, synthetic aperture radar (SAR) images.