Mapping land cover on Reunion Island in 2017 using satellite imagery and geospatial ground data

We here present a reference database and three land use maps produced in 2017 over the Reunion island using a machine learning based methodology. These maps are the result of a satellite image analysis performed using the Moringa land cover processing chain developed in our laboratory. The input dataset for map production consists of a single very high spatial resolution Pleiades images, a time series of Sentinel-2 and Landsat-8 images, a Digital Terrain Model (DTM) and the aforementioned reference database. The Moringa chain adopts an object based approach: the Pleiades image provides spatial accuracy with the delineation of land samples via a segmentation process, the time series provides information on landscape and vegetation dynamics, the DTM provides information on topography and the reference database provides annotated samples (6256 polygons) for the supervised classification process and the validation of the results. The three land use maps follow a hierarchical nomenclature ranging from 4 classes for the least detailed level to 34 classes for the most detailed one. The validation of these maps shows a good quality of the results with overall accuracy rates ranging from 86% to 97%. The maps are freely accessible and used by researchers, land managers (State services and local authorities) and also private companies.


a b s t r a c t
We here present a reference database and three land use maps produced in 2017 over the Reunion island using a machine learning based methodology. These maps are the result of a satellite image analysis performed using the Moringa land cover processing chain developed in our laboratory. The input dataset for map production consists of a single very high spatial resolution Pleiades images, a time series of Sentinel-2 and Landsat-8 images, a Digital Terrain Model (DTM) and the aforementioned reference database. The Moringa chain adopts an object based approach: the Pleiades image provides spatial accuracy with the delineation of land samples via a segmentation process, the time series provides information on landscape and vegetation dynamics, the DTM provides information on topography and the reference database provides annotated samples (6256 polygons) for the supervised classification process and the validation of the results. The three land use maps follow a hierarchical nomenclature ranging from 4 classes for the least detailed level to 34 classes for the most detailed one. The validation of these maps shows a good quality of the results with overall accuracy rates ranging from 86% to 97%. The maps are freely accessible and used by researchers, land managers (State services and local authorities) and also private companies.

Data
The data described in this data paper are of two different types related to land use on Reunion Island for the year 2017: A reference database consisting of GIS vector dataset in ESRI shapefile format composed of 6256 polygons representative of the diversity of land use on Reunion Island. This data is used in a Specifications Table   Subject Computer Science, Earth Sciences, Social Sciences Specific subject area Remote Sensing, GIS, Land Cover Map Type of data Vector How data were acquired The reference database was created with the QGIS software (www.qgis.org) For the production of land use maps, the Moringa processing chain uses the Orfeo ToolBox software (www.orfeo-toolbox.org) with scripts written in Python language. The source code of the Moringa processing chain is available at https://gitlab.irstea.fr/ raffaele.gaetano/moringa.git Data format Raw data (Shapefile, Esri) Parameters for data collection For the construction of the reference database, the objective of the choice of plots was (i) a good representativeness of the class and (ii) a homogeneous distribution over the territory Description of data collection To create the reference database, we used existing IGN databases but also field surveys acquired with a GPS and the knowledge of professionals specialized in their field. For the land use maps, we used a supervised classification method of satellite images (Sentinel2, Landsat8 and Pleiades) based on the Random Forest algorithm driven by the reference database mentioned above. We produced the three land use maps using the reference database and satellite image classifications as described below. Value of the Data -the referenced land cover maps provide an unprecedented overview of the entire territory of Reunion Island, with a significant potential impact in various tasks related to land, agriculture and environmental monitoring. -These maps can be used in GIS to monitor changes in the territory and help managers make decisions about urbanization, natural and agricultural land management. -The reference database can be used by remote sensing specialists to assess new methods for land cover mapping and other classification algorithms. -All data provided is georeferenced and in vector format for use in GIS tools in future projects. supervised classification process to train an algorithm to recognize land use classes on a set of variables from high and very high spatial resolution satellite images. A part of the polygons (20%) is dedicated to the validation of classification results (Cf. Fig. 1); Three land use maps produced starting from a dataset including a Very High Spatial Resolution (VHRS) Pl eiades image, a time series of HRS Sentinel-2 and Landsat-8 images and a digital terrain model. These maps correspond to three levels of land use nomenclature (from 4 to 34 classes) and are distributed in vector format (shapefile). Each geometry corresponds to an object provided by the segmentation of the Pleiades image, attributed using reflectances, radiometric and textural indices computed on the different remote sensing images available plus topographic information (altitude and slope). Such objects are individually classified using a supervised classification algorithm trained using the reference database described above. The validation of the maps gives overall accuracies ranging from 86% for the most detailed level to 98% for the least detailed level. These maps were produced as part of the GABIR project, for which we needed precise information on crop location. The nomenclature is therefore detailed on this type of land use but we have also detailed the natural areas so that these maps can be used in different themes. The three maps produced are illustrated in Figs. 2e4.
Final maps in ESRI shapefile format are delivered in the local UTM projection (WGS 84 UTM 40 South, EPSG code 32740).
Data are referenced in the CIRAD Dataverse and are hosted on CIRAD's Aware Geographic Catalog (http://aware.cirad.fr). The advantage of distributing geographical data with this system is that it is possible to simply visualize them, to use them directly on GIS software with web services (WFS: Web Feature Service or WMS: Web Map Service) or simply download them.

Reference database and nomenclature
The reference database is organized according to a multi-level nomenclature (Cf. Table 1). For class representativeness purposes it is important that the number of polygons in each class meets the minimum requirement of 60 entities at the most detailed level of the nomenclature.
To build this database we use data disseminated by the French Institute of Geography (IGN) and available on condition of eligibility. These are the products: RGE Alti®, "BD Topo®" and "RPG". They are available on this website: http://professionnels.ign.fr. The reference database for 2017 consists of 6256 plots. We will give a brief description of the sources and techniques used to build it according to the various land use groups:   For agricultural areas, we have information on certain cultivated plots. These are the declarations made by farmers to apply for EU subsidies: the Registre Parcellaire Graphique (RPG, the French Land Parcel Identification System). This data is disseminated in France by the French Institute for Geographical and Forestry Information (IGN). The description of this data is available here (in French): http://professionnels.ign.fr/doc/DC_DL_RPG-2-0.pdf. These vector data are of good quality and can be used as a model to locate crops. The release times imply that we use the RPG for year N À1. It is therefore necessary to check the correct consistency of the data by photo-interpretation of the VHSR image. The RPG provides little information on tree crops. Therefore, for these classes we have called on colleagues specialised in mango, lychee and citrus fruit cultivation who are familiar with their area and can locate plots in the VHSR image. Field surveys were conducted using GPS for market gardening crops. The plots of the "greenhouse or shade cultivation" class are derived from the "industrial building" layer of the IGN's "BD TOPO" product, and selected among different height levels to keep a sufficient diversity of greenhouse types. Each of the polygons was verified by photointerpretation using the Pleiades image. If the greenhouse or shade was not visible in the image, the polygon was deleted. For natural areas, there is no regularly updated mapping, but the main classes can be recognized from the GIS layers of the State services that manage these areas (ONF: National Forestry Office and DEAL: Regional Office of the Ministry of Environment, Development and Housing). The validity date of this data is not known. We therefore checked the coherence of each selected polygon against the VHSR image. Two specific classes have been added (identified by photo-interpretation) to address the problems of satellite images: a class of shadows due to the island's steep terrain (areas which are never visible because of shades) and a class of vegetation located on steep slopes facing the morning sun called "rampart moor". For water and wet spaces, the "marsh", "water" and "hillside retention" classes were obtained by photo-interpretation of the 2017 Pleiades image. These classes are easily recognizable on this type of image.
For urban spaces, we randomly selected polygons from the IGN's "BD TOPO" product. For the housing type building, four building height classes have previously been created (depending on the height of the layer field) in order to preserve a good diversity of the types of buildings present on the island. A random selection of polygons from each class was then made. The "built" layer was completed by a random selection of industrial buildings from the "industrial building" layer of the IGN's "BD TOPO" product. This selection was made in the "nature" field of the layer (i.e. the following types: silo, industrial and livestock). These different polygons were finally merged into a single built up class grouping all types of buildings while maintaining a good representativeness.
The "photovoltaic panel" class was obtained by photo-interpretation of the polygons on the 2017 Pleiades image The contours of each polygon have been checked to ensure that they are perfectly overlapping on the THRS image used for the classification process and slightly eroded (negative buffer) in order to limit edge effects in the training process.
The reference database was used in the works described in Refs. [1,2].

Images
➢ Very High Spatial Resolution (VHSR): We use VHSR images acquired in the framework of the Kalideos project, driven by CNES (Centre National d'Etudes Spatiales: government agency responsible for shaping and implementing France's space policy in Europe). Pleiades images are not free and are available under the condition of eligibility via the Kalideos project website (https://www.kalideos.fr). In 2017, we dispose of 15 Pleiades scenes covering the whole island. Several zones are acquired multiple times because of cloud coverage. Pre-processed images are provided: radiometric correction to Top-Of-Atmosphere (TOA) level, precise orthorectification using high quality IGN's Pl eiades database as a spatial reference (see Fig. 7, image mosaic).

➢ High Spatial Resolution (HSR):
The Sentinel 2A and 2B satellites (S2A and S2B) have been deployed by the European Space Agency (ESA). The images offer 13 spectral bands with a spatial resolution between 10 m and 60 m. We only keep the 10 bands with a resolution of 10 m and 20 m. The interval between two subsequent acquisitions is 5 days considering both satellites. We use the Sentinel-2 (S2) level-2 images provided by THEIA (download site: https://theia.cnes.fr). These images are corrected for atmospheric effects (Top-Of-Canopy, TOC) and have a cloud mask calculated by the MAJA algorithm [3]. The first image of the Sentinel-2B satellite was acquired on Reunion Island on July 6, 2017. The satellite was put into operational service in August 2017. So from August we have an S-2 image every 5 days.
The Landsat-8 (L8) satellite was deployed by NASA (download site: https://earthexplorer.usgs.gov). The revisiting frequency is 16 days. L8 images have a spatial resolution of 15 m for the panchromatic band and 30 m for the multispectral bands. We use TOA level radiometrically corrected images, and pre-process the whole reflectance stack to a 15 m spatial resolution using pansharpening.
The characteristics of the L8 and S2 images are different, but in tropical areas with high cloud cover, the combination of these sensors increases the probability of regularly observing the entire territory.
The time series consists of 64 images acquired between January 1 and December 31, 2017. That is 22 L8, 29 S2A and 13 S2B. The selection criteria for these images are: cover at least 20% of the study area and have less than 80% cloud cover per tile. Fig. 5 illustrates the finally selected images.

Topography
The Digital Terrain Model (DTM) used is the IGN's "RGE ALTI®" product with a resolution of 5 m. This product is obtained by combining several types of acquisition (LiDAR, radar and aerial photography). The production method is described in this document: (http://professionnels.ign.fr/doc/DC_ RGEALTI_2-0.pdf).

The Moringa processing chain
The TETIS laboratory is developing an automated soil classification method based on the Moringa chain that minimizes interactions with users by automating most image analysis and processing processes [4]. The Moringa chain can be downloaded with this link: https://gitlab.irstea.fr/raffaele. gaetano/moringa.
The methodology jointly uses a Spot6/7 or Pleiades Very High Spatial Resolution (VHSR) image, time series of Sentinel-2 and Landsat-8 High Spatial Resolution (HRS) optical images and a Digital Terrain Model (DTM) for Object Based Image Analysis (OBIA) classification (use of the Random Forest algorithm) driven by a learning database combining in situ and photo-interpretation measurements. The chain is built upon the Orfeo Tool Box (OTB) applications, orchestrated by python scripts. Some preprocessing steps are performed under QGiS. The main processes used by the chain are summarized in Fig. 6. In the following paragraphs, we will describe the elements that present specific parameters adapted to the production of the maps presented here and other details useful for a full understanding of the method used.

VHSR pre-processing
To combine the Pleiades images, we used the OTB mosaic tool described in the article [5] which allows radiometric equalization between the images. The assembly map and the image produced are illustrated in Fig. 7 which shows the areas covered by each image in the mosaic.

HSR pre-processing
Pre-processing applied to HRS images is automated in the Moringa chain. For S2, all bands are resampled at 10 m. For L8, pansharpening processing is applied to bring the spatial resolution as close as possible to S2 images and a cloud cover mask is calculated with the F-Mask tool [6].
An automatic co-registration onto the THRS image is performed automatically by a procedure based on homologous point extraction. This processing is conceived to improve overlapping among the different remote sensing dataset, and is mostly important for the characterization of smaller scale objects.
Cloud and shadow masks are then used to fill gaps due to cloud coverage into the time series by means of a multi-temporal interpolation [7]. The chain produces, from the cloud masks, an image illustrating the number of times a pixel is not covered by clouds in the time series. This illustration locates the areas where there is a risk of instability of the results on the maps if the number of images is low (Cf. Fig. 8).

The variables used in the classification
➢ We have selected the indices that we believe are most useful for extracting our classes of interest (Cf. Table 2) ➢ Textures are important to detect patterns visible on the THRS image such as tree alignments in tree crops. In the Moringa chain, these texture indices are the only variables derived from the VHSR image. We use the OTB "Haralick Texture Extraction" algorithm applied to the panchromatic image (Cf. Table 2) ➢ Slopes were also calculated on this data using QGIS software. DTM and slopes are used as variables in the classification process.

Object based classification
We use the Random Forest classification algorithm [14] which is adapted for heterogeneous data. This is our case since we use data from several sensors but also altitude, slopes and texture indices. RF is in our case applied to objects extracted in the image via a segmentation step. This step allows to extract objects corresponding to homogeneous groups of pixels in the THRS image. We use the method described in Ref. [15] with OTB's GenericRegionMerging (Large Scale version) application [16]. To obtain a segmentation result adapted to our study, we act on both the homogeneity criteria and the maximum heterogeneity threshold. We have chosen the following parameters: ➢ Scale parameter: 150 ➢ Weight parameter on the shape: 0.1 ➢ Weight parameter on compactness: 0.5 To apply RF to segmentation objects, zonal statistics are extracted for each object using the average of the pixels of each variable from the time series, textures, slopes and altitude (Cf. Table 2). We perform an independent training and classification for each level of nomenclature.

Validation
We here use the k-fold cross-validation technique to evaluate the accuracy of the provided land use maps. We decide to avoid using the classical hold out strategy (i.e., leaving a single subset aside for validation) because of the strong intra-class variability which can bias the evaluation. After dividing the field database into approximately equal k subsets (with k ¼ 5 in our case), a cross-validation is performed by training five Random Forest models with different combination of four out of five subsets and validating with using the fifth one, so that each subset is used once for validation and 4 times for training.
The final precision metrics (global accuracy, Kappa, fscore, etc.) are obtained by averaging the metrics obtained at each turn (fold) of the cross-validation. This process is performed for each level of the nomenclature.
➢ Overall accuracy: refers to the portion of well-classified area in relation to the total annotated surface. It does not provide information on the most reliable classes. ➢ The Kappa index (or KIA) takes into account the errors in the rows and columns of the confusion matrix. It expresses the reduction in error compared to that obtained by a classification that would be carried out randomly. It varies from 0 to 1. For example, a value of 0.75 expresses that the classification method used avoids 75% of the errors obtained by a procedure working completely at random. ➢ The F1-score is the harmonic mean of the user's and producer's accuracies given by the confusion matrix. These quality indicators are given in Table 3.

Post classification
➢ Smoothing by majority filter: We apply a majority filter to the classification to smooth out contours and remove isolated pixels. We use OTB's Classification Map Regularization tool. The size of the structuring element can be adjusted to measure the intensity of the smoothing. To limit the degradation of the classification, a filter of radius 1, corresponding to a 3 Â 3 pixel window, is chosen.
➢ Crossing with GIS layers: To make corrections to the classification results we use the SAFER SAUP layer which identifies the arable lands. Coupling the output of the classification with this layer makes it possible to switch all polygons detected as cropland outside arable land to natural areas.