Pan-tropical Sentinel-2 cloud-free annual composite datasets

Sentinel-2 MSI is one of the core missions of the Copernicus Earth Observation programme of the European Union. This mission shows great potential to map the regional high-resolution spatio-temporal dynamics of land use and land cover. In tropical regions, despite the high revisiting time of 5 days including both Sentinel-2A and 2B satellites, the frequent presence of clouds, cloud-shadows, haze and other atmospheric contaminants are precluding the visibility of the Earth surface up to several months. In this paper we present four annual pan-tropical cloud-free composites computed and exported from Google Earth Engine (GEE) by making use of available Sentinel-2 L1C collection for the period spanning from 2015 to 2020. We furthermore propose empirical approaches to reduce the BRDF effect over tropical forest areas by showing pros and cons of image-based versus swath-based methodologies. Additionally, we provide a dedicated web-platform offering a fast and intuitive way to browse and explore the proposed annual composites as well as layers of potential annual changes as a ready-to-use means to visually identify and verify degradation and deforestation activities as well as other land cover changes.


a b s t r a c t
Sentinel-2 MSI is one of the core missions of the Copernicus Earth Observation programme of the European Union. This mission shows great potential to map the regional highresolution spatio-temporal dynamics of land use and land cover. In tropical regions, despite the high revisiting time of 5 days including both Sentinel-2A and 2B satellites, the frequent presence of clouds, cloud-shadows, haze and other atmospheric contaminants are precluding the visibility of the Earth surface up to several months. In this paper we present four annual pan-tropical cloud-free composites computed and exported from Google Earth Engine (GEE) by making use of available Sentinel-2 L1C collection for the period spanning from 2015 to 2020. We furthermore propose empirical approaches to reduce the BRDF effect over tropical forest areas by showing pros and cons of image-based versus swath-based methodologies. Additionally, we provide a dedicated web-platform offering a fast and intuitive way to browse and explore the proposed annual composites as well as layers of potential annual changes as a ready-to-use means to visually identify and verify degradation and deforestation activities as well as other land cover changes.

Value of the Data
• The annual cloud-free pan-tropical composites facilitate the use of Sentinel-2 Multispectral Instrument (MSI) Level 1C imagery for national forestry services in tropical countries, especially in relation to Reducing Emissions from Deforestation and Forest Degradation (REDD + ) activities and the United Nations Restoration Declaration Decade (2021-2030) agenda on diversity restoration. • Anyone aiming at large-scale land cover/land use change mapping (LCLUC) including photointerpretation, automatic classification and validation can benefit from the pre-computed annual composites and change layer. In particular, the following stakeholder groups would be the main users: the European Commission, conservationists, policy makers, financiers, donors, international development agencies including national governments, stakeholders of the research and scientific community, non-governmental organizations (NGOs) as well as commercial companies. • The annual composites can be used for automatic classification, feature extraction, visual inspection, validation, change detection, computation of NDVI, NBR, NDWI and any other spectral indexes based on RED, NIR and SWIR1 bands.
• The pre-computed composites derived from Sentinel-2 L1C imagery are accessible, free of charge, throughout a dedicated web-portal as well as via WMS services to be easily integrated into other websites or desktop GIS software. With a 20 m resolution in the RED, NIR, SWIR1 bands, geographic Lat-Lon WGS84 projection and 8 bit data type, each composite is a light weight product (e.g., only 200 GB for the entire Africa) easy to handle in both local and regional-scale mapping activities. GeoTiff tiles of 10 °·10 °can be eventually downloaded for local processing and analysis.

Context and reference information
The Sentinel-2 mission of the European Copernicus Earth observation program started with the launch of the first satellite in June 2015 and became operational in late 2017 when the constellation of the two polar-orbiting satellites (2A and 2B) was able to acquire images of the land and coastal areas at a spatial resolution ranging from 10 to 60 m in the optical domain (13 spectral bands from 0.44-2.2 μm) and a revisit time of 5 days at the equator [1] .
The above-mentioned characteristics, accompanied by a free, full and open access distribution policy, are making Sentinel-2 imagery a valuable product for LULCC mapping initiatives either at regional, national or continental scale.
The work described in this manuscript has been conducted in the framework of the Re-CaREDD project (Reinforcement of Capacities for REDD + ) [2] , an international initiative aiming at enhancing the capacity of institutions in tropical partner countries to report on deforestation and forest degradation in a reliable and cost-efficient manner.
Although applicable at global scale, the proposed methodology has been developed and applied only within the pan-tropical scope of the project ( Fig. 1 ), where the persistent cloud cover, cloud shadows and haze frequently result in no exploitable images over large areas for weeks or even months. Hence the need for a robust methodology capable of removing any pixel affected by atmospheric contaminations and using the few available cloud-free ones to compute the annual synthesis.
In order to cope with the limited internet bandwidth and computing power of many tropical partner institutions, the proposed annual composites have been produced only for the SWIR1, NIR and RED bands, with a spatial resolution of 20 m, a geographic Lat-Long, WGS84 projection  and 8-bit data type. This configuration reduces the original size of each 3 bands composite by a factor of 6 (0.5 TB vs 3 TB).
The Google Earth Engine (GEE) [3] script link available in the "data accessibility" section can be used to recompute and export a customized composite with any time frame, band combination, resolution and projection thanks to a dedicated and user-friendly graphical user interface.

Annual cloud-free composites
This article is proposing four Sentinel-2 L1C cloud-free pan-tropical annual composites for the period 2015-2020 by extracting per-band annual median values after cloud and shadow masking based on spectral conditions specifically developed for tropical regions ( Fig. 1 ). This set of rules, further described in chapter 2.1, have been compiled in a library named PINO (convenient trade name, not acronym).
All available raw Sentinel-2A and 2B L1C Top of Atmosphere (TOA) Reflectance images have been processed in the GEE cloud computing platform and downloaded by selecting only SWIR1, NIR and RED bands converted to 8 bit (Byte) using a multiplicative factor of 0.051 for visualization purposes and size optimization.
For the 2020 composite a post processing step, aiming at equalising the different acquisition swaths, has been applied as described in chapter 2.3.
Due to the scarcity of available images at the beginning of the Sentinel-2 missions (Sentinel 2A was launched in June 2015 while 2B in March 2017), the first proposed composite covers a time frame spanning from 2015 to 2017.

Potential annual change layers
Additional layers, showing potential annual land cover changes, are computed and visualized on the fly using the SWIR1 bands of two selected GeoTiff yearly composites (Y1, Y2) ( Fig. 2 ) in a false colour combination virtual file (vrt) as follows R: SWIR1 Y2 , G: SWIR1 Y1 , B: SWIR1 Y2 . This  visualization choice and technique does not need any image pre-processing and highlights the loss and gain of photosynthetic activity on the ground in purple and green tones respectively as the direct consequence of the absorption (vegetation) or reflection (soil) in this portion of the electromagnetic spectrum.
The intensity of the original SWIR1 reflectance band is preserved when no changes are occurring leading to dark grey tones in dense moist forest or bright ones in arid regions. Despite being sensitive to residual cloud and shadow contaminations in the reference composites, the change layers may serve as a ready-to-use means to visually identify and verify land cover changes, as well as degradation processes ( Fig. 3 ).

Download
Each annual pan-tropical composite is composed of 10 °x10 °tiles, corresponding to 112 downloadable GeoTiff files ( Fig. 4 ) with internal cubic overviews and a total size of approximately 500 GB.
Each file is identified by a prefix corresponding to the central coordinates of the box, the continent, the year and the band order e.g., N35_W115_LAC_composite_2019_1184.tif corresponds to the upper left tile in Latin America and Caribbean region. Other continent identifiers are AFR (Africa) and SEA (Southeast Asia). The proposed gridding system, although leading to multiple downloads for large countries, reduces the volume of data to be transferred by single request to better cope with internet bandwidth and connection failures as reported by several stakeholders operating in the tropical belt.
The GEE script provided in the 'data accessibility' section can be eventually used to export a customized version of the composite.

Experimental Design, Materials and Methods
The new Sentinel-2 Global Mosaic service (S2GM) [4] of the Copernicus Global Land Service is providing on-demand composites from Sentinel-2 surface reflectance (level 2A). However, the algorithms used for the image compositing rely on the scene classification of Sentinel-2 L2A data, which is prone to errors (e.g., confusion between clouds, bright soil and built-up areas) eventually resulting in undesired artefacts in the S2GM products [5] .
The L2A level, Bottom of the Atmosphere (BOA) reflectance, should ensure a bidirectional reflectance distribution function (BRDF) corrected product, hence uniform in space and time. However, the L2A products currently produced by the European Space Agency (ESA) Copernicus Ground Segment with the Sen2Cor processor [6] does not fully resolve the differences in brightness across the acquisitions swath in west-east direction and suffer from overcorrection over north-facing slopes as visible on Fig. 5 .
Additionally, the best pixel Medoid/STC-S2 mosaicking algorithms [7] preserve the spectral profile of each pixel but introduces spatial heterogeneity also referred to as the "salt and pepper" effect as demonstrated on Fig. 6 . In order to cope with that problematic, Copernicus is further developing Sen2Cor and the S2GM service is working on alternative compositing algorithm (ArtCom [8] ) based on mean values, providing spatially smoother and visually more appealing products.
Kempeneers et al. [9] proposed an image compositing based on quick looks and the cloud mask of Level 1C original images but resulted in a global composite contaminated with persistent cloud coverage and tiling effect limiting its usage for large scale classification purposes ( Fig. 7 ).
Corbane et al. [10] proposed a compositing approach by computing the distribution of all selected observations in a UTM grid zone with the 25 th percentile of least cloudy pixels. Although the reference year has been set to 2018, image selection was extended to 2017 in cloudy areas  (cloud percentage above 30%) in order to avoid data gaps ( Fig. 8 ). This approach increased the likelihood to use the pixels stemming from the same season, hence reducing phenology changes within the same UTM. The tiling effect has been drastically reduced but acquisition swaths are still pronounced on the final product. This global composite is available at 10 m resolution for the RED, GREEN BLUE and NIR bands.
The quality of the above-mentioned composites is strongly related to the quality of the official Sentinel-2 L1C and L2A cloud masks provided by the Copernicus service: although good enough for less cloudy areas, the frequent cloud cover in tropical regions often compromises the quality of the final composite products.
Several ready to use cloud and cloud-shadow masking algorithms such as the Automated Cloud-Cover Assessment Algorithm (ACCA) [11] , F-mask [12] and Hollstein et al. [13] have been tested. However, due to the need of local fine-tuning, application of morphological filters to incorporate the cloud edge, computing time restrictions and overall quality, these algorithms could not be efficiently applied over larger cloudy areas.
The algorithm adopted for clouds and shadows masking in the proposed annual composites, namely PINO (described in section 2.1), is the evolution of the threshold-based rule-set proposed by Simonetti et al. [14] . Originally developed to operate only with Landsat TM/ETM/OLI sensors at a spatial resolution of 30 m, the rule-set has been further developed adding a decision tree tailored to the spatial and spectral characteristic of the Sentinel-2 imagery so as to better detect small and sparse clouds over the tropical belt where the annual average cloud percentage computed over the Military Grid Reference System (MGRS) tile [15] can be greater than 80% of the total observations ( Fig. 9 ) hence limiting the possibility of sensing cloud-free images.
The PINO algorithm has been applied to cloud-prone countries, as highlighted in yellow in Fig. 10 , while a simple mask based on QA60 band equal or greater than 1024 was sufficient in areas (green) with abundance of cloud free images (greater than 60% of total observations). The former approach is resource (CPU, RAM) demanding hence almost two times slower; however, the average execution time in GEE (per orbit, per country) remains within the 2 h.
Additionally, the 2019 composite of Colombia, Gabon and Equatorial Guinea, has been processed in early 2020 with an extra 3D shadow casting model (based on sun elevation and azimuth parameters) with the aim of maximizing the detection of non-contaminated pixels  in presence of "mackerel sky" [16] , a sky condition characterized by the alternance of clouds, shadows and small clear gaps in between. However, due to the resource-intense 3D models often leading to processing errors, for the computation of the 2020 annual composite in early 2021, the PINO rule-set has been enriched (version 26 described in section 2.1) with spectral thresholds able to better identify shadows over the tropical moist forest and avoid shadowcasting models and buffers.
Due to the large volume of pixels to be classified and normalized over time and space, the composites are computed and exported from Google Earth Engine (GEE).
Despite the fact that the proposed approach is resource (CPU, RAM) demanding, it allows creating high quality annual cloud-free composites in tropical areas affected by persistent cloud coverage such as in Malabo, Bioko Island (Equatorial Guinea), the cloudiest city in Africa [17] . The 12 least cloudy Sentinel-2 acquisitions in 2019 and the annual composite of all 68 available images corresponding to MGRS granules T32NMK, are displayed in Fig. 11 .
The amount of contaminated pixels per single image prevents a wall-to-wall Land Cover-Land Cover Change (LC-LCC) mapping activity, hence the need for a pixel-based annual composite. However, in 2019 some forest areas of Bioko Island remained un-sensed by this optical sensor due to the fact that no single cloud free observation was available.
Please note that the median annual compositing approach i) does not preserves relationships between bands (Medoid always returns bands from the same selected pixel), ii) does not necessary catches the natural vegetation cycle and ii) can be strongly biased by the uneven distribution of cloud across the year, eventually leading to synthesis computed from images acquired in the dry season. However, by using the GEE script provided in the "data accessibility" section, new customized composites can be generated and exported to better cope with local or regional phenological conditions.

PINO classification: GEE code
The following code shows the JavaScript GEE implementation of the PINO cloud mask tailored to Sentinel-2 L1C images collected all over the tropical belt. Originally developed in 2014 by Simonetti et al. [14] to operate only with Landsat TM/ETM/OLI sensors at a spatial resolution of 30 m, it has been further developed adding decision tree tailored to the spatial and spectral characteristic of the Sentinel-2 imagery. The proposed release, version 26, does not apply any morphological filters (e.g., opening, closing) or buffering to cloud or shadow classes.
Spectral indices, as well as the median extraction, are based on TOA reflectance values hence subject to haze and other atmospheric contaminations. However, the cloud and shadow mask computed based on the L1C products as proposed hereafter, can be eventually applied to the correspondent Bottom of Atmosphere (BOA) reflectance image in the L2A collection.

Decision tree approach
The classification algorithm relies on several spectral indexes (NDVI, NDSI, NDWI) as well as a set of fuzzy bands conditions (e.g., BLU > GREEN > RED) and fixed thresholds (e.g., B1 > 20 0 0 and NDVI > 0.2) designed to recognize the spectral profile of different land cover types, clouds and shadows. This set of rules is the transposition in python code of the average profile of several spectral signatures, belonging to different classes, extracted from Sentinel-2 L1C images across the tropical belt.
After the initial cloud masking based on the QA60 band, the input image is dived into 3 broad thematic strata (1: water, 2: soil, 3: vegetation) on the basis of the NDVI values; within each stratum the spectral profile of each pixel is compared with the set of rules to identify clouds or shadows contaminations.
This approach allows i) better commission/omission error tracking in each stratum, ii) integrating tailored spectral rules, iii) distinguishing between clouds and shadows over water, soil or vegetation to apply further processing steps depending on the field of application of the composite (e.g., PINO has room of improvement in cloud/shadow detections over water).

Compositing by orbits
Despite the fact that double acquisitions provide a higher probability of getting cloud free observations in the overlapping area, pixels laying on the eastern and western side of the adjacent swaths present a considerable spectral difference caused by the BRDF effect, especially over the dense humid forest.
To overcome the heterogeneous spectral response in overlapping areas, the Sentinel-2 L1C 2019 and 2020 median composites have been processed by orbit and subsequently exported using the geographic Lat-Lon common projection instead of the multiple native UTM ones. Additionally, an inwards buffer of 8km has been applied to each swath (orbit) geometry to i) limit the amount of data to be processed and ii) remove sporadic detectors zig-zag along the edge of the scene and iii) guarantee a correct empirical BRDF correction as described hereafter.
The red and white lines in Fig. 12 show the boundaries of the re-computed orbits number 125 and 25, respectively as available in the vector file [18] used in the compositing algorithm in 2019 and 2020; the 2018 median composite in background reflects the uneven distribution of images in the three distinct zones corresponding to the west orbit, the overlap with double acquisitions and east orbit.

Different empirical approaches to mitigate the BRDF effects on 2019 and 2020 composites
The Sentinel-2 2015-2017 and 2018 composites have been processed by computing the simple median value of each band across the year. However, the differences in brightness across the Sentinel-2 acquisitions in west-east direction, is visible in the final products, especially over the humid tropical forest as shown in Fig. 13 .
In 2019, before computing the median value, each granule has been normalized using the dark object subtraction method (using evergreen forest as pseudo-invariant feature) with the aim of mitigating this BRDF radiometric effect. The evergreen forests in central Africa look homogeneous but tiles falling outside the evergreen domain, which are not normalized, are still affected ( Fig. 14 ).
Considering the above-mentioned limitation, the 2020 composite has been produced by applying a post processing orbit normalization using the IMPACT Toolbox [19] consisting of a multiplicative gradient ranging, over humid forest, from -12% to 0% (west-east) for the RED, NIR and SWIR1 bands; the effect of the gradient is visible when comparing 2018 and 2020 composites as reported on Fig. 15 .

Indication for potential change between 2015-2017 / 2018 / 2019 / 2020
Potential change layers are generated 'on the fly' by the web-platform based on the spectral difference of the SWIR1 bands of two selected yearly composite (Y1, Y2) and rendered in a in a false colour combination as follows R: SWIR1 Y2 , G: SWIR1 Y1 , B: SWIR1 Y2 . Purple and green colours correspond to an increase (e.g., due to soil component) and a decrease (absorption e.g., due to vegetation growth or water) in the SWIR1 band, respectively. These layers may serve as a ready-to-use source of information to identify potential land cover changes, to be then confirmed by visual inspection of the reference annual composites or by supplementary materials.
Examples of land cover changes as derived from the proposed annual composites, are reported on Figs. 16 -19 covering locations in Malaysia, Brazil, Madagascar and Tanzania, respectively. Vegetation loss (purple) is quite evident in Malaysia, characterized by vast deforested areas followed by re-greening (green) in the subsequent years. Less pronounced but still evident, the geometric deforestation patches in Brazil while a more fragmented and small-scale pattern is observable in Madagascar. Apparently, in Tanzania, the layers show evident changes around lakes; a closer look confirms the drop and the increase of the SWIR1 band across the years caused by the inter-annual water dynamics. Being based on the change of the SWIR1 intensity    over time, the potential change layer per-se is not capable of discriminating soil to vegetation transitions from soil to water and vice versa.

Ethics Statement
Not applicable.
CRediT Author Statement D. Simonetti was the principal investigator of this study, he designed and developed the cloud mask and compositing algorithm in GEE, downloaded and equalized individual satellite swaths, organized and optimized data for dissemination and developed the server back-end. The manuscript was prepared by D. Simonetti with the contributions of U. Pimple and A. Marelli .
A. Marelli developed the front-end of the Sentinel-2 cloud-free explorer; A. Langner contributed to the analysis of the results providing useful feedback on the products in the framework of the ReCaREDD (Reinforcement of Capacities for REDD + ) and REDDCopernicus projects.

Declaration of Competing Interest
The authors declare that they have no financial interests or personal relationships that could have appeared to influence the work reported in this paper.