CloudSEN12+: The largest dataset of expert-labeled pixels for cloud and cloud shadow detection in Sentinel-2

Detecting and screening clouds is the first step in most optical remote sensing analyses. Cloud formation is diverse, presenting many shapes, thicknesses, and altitudes. This variety poses a significant challenge to the development of effective cloud detection algorithms, as most datasets lack an unbiased representation. To address this issue, we have built CloudSEN12+, a significant expansion of the CloudSEN12 dataset. This new dataset doubles the expert-labeled annotations, making it the largest cloud and cloud shadow detection dataset for Sentinel-2 imagery up to date. We have carefully reviewed and refined our previous annotations to ensure maximum trustworthiness. We expect CloudSEN12+ will be a valuable resource for the cloud detection research community.

effective cloud detection algorithms, as most datasets lack an unbiased representation.To address this issue, we have built CloudSEN12 + , a significant expansion of the Cloud-SEN12 dataset.This new dataset doubles the expert-labeled annotations, making it the largest cloud and cloud shadow detection dataset for Sentinel-2 imagery up to date.We have carefully reviewed and refined our previous annotations to ensure maximum trustworthiness.We expect CloudSEN12 + will be a valuable resource for the cloud detection research community.
© 2024 The Author(s).Published by Elsevier Inc.This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ ) Specifications Table Subject Computers in Earth Sciences.Specific subject area Cloud detection in optical remote sensing data.

Type of data GeoTIFF imagery CSV Table Data collection
The dataset comprises Sentinel-2 Level 1C (S2) imagery associated with hand-crafted labels.The S2 images are retrieved from Google Earth Engine [ 9 ] using the R client interface [ 3 ].Sixteen experts generated the labels following a strict cloud detection protocol.Additionally, we have incorporated cloud detection predictions generated by the CloudSEN12 UnetMobV2 model [ 1 ]

Value of the Data
• The collection consists of more than 50,0 0 0 S2 image patches ( Fig. 1 ).It covers diverse cloud scenes with varying shapes, thicknesses, sizes, and altitudes, providing a comprehensive dataset for training and testing cloud detection algorithms.• It includes images from various regions worldwide, providing a geographically diverse dataset that can help improving the generalization of trained cloud detection algorithms.• It provides high-quality expert-labeled annotations using a consistent and well-defined labeling protocol in two patch sizes: 509 ×509 and 20 0 0 ×20 0 0 10 m pixels.As part of the legacy of version 1 (CloudSEN12 dataset), it also provides scribble expert-labeled annotations and no-label patches.• It can serve as a foundation for other remote sensing (RS) sensors, enabling researchers to transfer the knowledge gained from S2 to similar sensors, such as Landsat or multiple of small-size RGBNIR optical satellites.• This dataset is licensed under CC0, which puts it in the public domain and allows anyone to use, modify, and distribute it without permission or attribution.CloudSEN12 + spatial coverage.The terms p509 and p20 0 0 denote the patch size 509 × 509 and 20 0 0 × 20 0 0, respectively.'high', 'scribble', and 'nolabel' refer to the types of expert-labeled annotations.

Background
Accurately detecting clouds in optical RS imagery is critical for various environmental and Earth observation studies [ 5 , 10 , 13 ].Clouds obstruct and contaminate surface reflectance signal, causing inaccuracies when retrieving land and ocean parameters [ 15 ].To tackle this challenge, there has been a growing interest in creating robust algorithms for cloud screening from RS imagery.From a data-driven perspective, the first step in developing cloud detection algorithms is to create a training dataset.Several datasets ( Fig. 2 ) have been created for this purpose [ 4 , 8 , 6 , 11 ].However, they have limitations and particularly they lack diversity across cloud types and geographies [ 14 ].The CloudSEN12 dataset [ 1 ] was designed to address these issues, nevertheless models trained in CloudSEN12 are still not exempt of errors [ 2 ].The novel CloudSEN12 + tackles these errors by extending CloudSEN12 with more labels in challenging areas, increasing the size of the patches to improve shadow detection, and curating several of the original labels following extra quality control procedures.With these improvements, we expect to push forward the accuracy of cloud detection models.

Data Description
Table 1 presents the number of image patches in each subfolder.The dataset is divided into two main collections, p509 and p20 0 0, as shown in Fig. 3 A.These numbers correspond to the image patch sizes of 509 ×509 and 20 0 0 ×20 0 0 pixels, respectively.

Table 2
Cloud semantic categories considered in CloudSEN12 + .Lower priority levels indicate greater relevance.Some classes have a greater impact on the overall quality of the image.To measure this impact fairly, we have introduced a 'Priority' column to indicate the classes that require greater attention from labelers.Lower priority levels indicate higher relevance.

Code
Class Description Priority 0 Clear Pixels without cloud and cloud shadow contamination.They are primarily identified using bands B4-B3-B2, B1-B12-B13, and the cirrus band.Opaque clouds that block all the reflected light from the Earth's surface.We identify them by assuming clouds exhibit distinctive shapes and maintain higher reflectance values in bands B4-B3-B2, B1-B12-B13, and the cirrus band.Semitransparent clouds that alter the surface spectral signal but still allow to recognize the background.This is the hardest class to identify.We utilize CloudApp [ 1 ] to better understand the background, both with and without cloud cover.

Cloud Shadow
Dark pixels where light is occluded by thick or thin clouds.Cloud shadows depend on clouds presence and, by considering the solar position, we can identify and map these shadows through a reasoned projection of the cloud shape. 2 The initial p509 folder is further divided into three groups depending on the manual label type: 'high', 'scribble', and 'nolabel' ( Fig. 3 B).S2 images labeled as 'high' indicate that each pixel within the image (i.e., 509 ×509) is associated with a cloud semantic category described in Table 2 .This subset is ideal for training machine learning models since they require pixel-level accuracy to learn complex patterns and distinctions in cloud formations.Using the Intelligently Reinforced Image Segmentation (IRIS) [ 12 ] brush tool, S2 images within the 'scribble' subset cover only a small percentage of pixels with annotations (less than 5%).These labels are particularly useful for validation, offering a balanced representation of pixels far and near to edges -areas where cloud detection algorithms commonly fail [ 1 ].Finally, S2 images in the 'nolabel' subset do not have human annotations.However, we include in all patches the accurate cloud masks generated by the CloudSEN12 UnetMobV2 model, which can serve as a basis for training a cloud detection model before performing a fine-tuning with the 'high' quality human labels.Both the 'high' and 'scribble' categories are segmented into three subfolders (train, val, and test), while 'nolabel' only contains the train subfolder ( Fig. 3 C).
The p20 0 0 collection exclusively contains 'high' quality human annotations and is systematically organized into the train, val, and test subfolders.In both cases, p509 and p20 0 0, the human annotators had the option to generate the labels with the initial support of a machine learning model assistant (see section Cloud detection protocol).In contrast to p509, the p20 0 0 subset includes only one image per location.The p20 0 0 collection is designed to enhance the performance of models initially trained on the p509 dataset by leveraging larger image patches.The models trained in p20 0 0 patches should better capture the spatial autocorrelation between cloud and cloud shadow classes thanks to the wider receptive field.
Each image patch in p509 and p20 0 0 comprises fifteen bands: thirteen corresponding to S2, one for the manual label (filled with NA for no label subset), and one for the automatic labels generated by CloudSEN12 UnetMobV2.

Experimental Design, Materials and Methods
CloudSEN12 + is an extension of the CloudSEN12 dataset which adds a set of new manually labeled images with a larger patch size (see subsection Large patch size collection) and improves the labels of 452 images identified by a label quality protocol (see subsection Semi-automatic label quality).

Sentinel-2 data
The S2 mission currently comprises two nearly identical satellites, Sentinel-2A and Sentinel-2B, launched in June 2015 and March 2017, respectively.These satellite products offer estimates of reflectance values across 13 spectral channels, covering the entire globe every five days [ 7 ].The S2 imagery is freely distributed under an open data policy.In the CloudSEN12 + dataset, we use the L1C products that provide Top-of-Atmosphere surface reflectance.All image bands at 20m and 60m are unsampled to 10-meter resolution using nearest neighbor interpolation to have a uniform resolution across bands.

Large patch size collection
One of the most significant challenges in cloud semantic segmentation is accurately identifying cloud shadows [ 16 ].Neural network models often struggle to differentiate between cloud shadows and other types of shadows, such as those originating from terrain or other objects.To address this problem, larger patches were added to make it easier for the networks to learn the spatial relationship between clouds and their shadows.Our selection process involved manually choosing images with high potential error for cloud shadows (see section Semi-automatic label quality).Furthermore, more bright regions like deserts and snow were included.Ultimately, 849 images were labeled.The final spatial distribution of the dataset can be seen in Fig. 1 .

Cloud detection protocol
Creating human-generated labels for cloud detection can be a complex task, and several factors can contribute to potential inaccuracies.Firstly, defining borders between clear and cloud-contaminated areas is challenging, as individual priors and biases influence the thresholds and decisions used to differentiate them.Secondly, cloud detection is an imbalanced problem.Opaque and oval-shaped clouds are more commonly observed and labeled, which can result in the under-representation of less frequent cloud types, such as semi-transparent and elongated clouds.Third, semantic classes are not always mutually exclusive.Pixels within an image can simultaneously belong to multiple classes.For instance, a semi-transparent cirrus cloud may overlap an opaque cumulus cloud or a cloud shadow, creating mixed pixels.Finally, some classes have interdependence, as cloud shadows are inherently dependent on the existence of clouds.
Recognizing and accurately labeling the complex cloud patterns requires specialized knowledge.To achieve the highest label accuracy for CloudSEN12 + , we have meticulously designed a comprehensive five-step protocol that effectively addresses the unique challenges that cloud labeling poses ( Fig. 4 ).This protocol is not only applicable to CloudSEN12 + but can also be adapted to enhance labeling accuracy in various remote sensing tasks.Our protocol is built around IRIS, a semi-automatic tool designed for manual segmentation of multi-spectral and geospatial imagery ( Fig. 5 ).This tool aids in achieving precise and consistent labeling by leveraging machine learning assistance while allowing for human oversight and adjustment.
1. Sampling : Manual sampling remains the most effective approach, despite being timeconsuming.The vast array of cloud types and their unique characteristics demand careful consideration.To address this, labelers prioritize atypical clouds, such as contrails, ice clouds, and haze/fog, over more common varieties like cumulus and stratus.Furthermore, using a reference model to determine which data points should be included in a dataset is helpful.Labelers make informed decisions about which samples to prioritize by comparing human interpretation with the reference model.2. Agreement : Before starting the labeling process, all labelers come to a mutual agreement on the definitions and criteria for each semantic class, creating common guidelines (refer to Table 2 ).When ambiguity arises, and a pixel could belong to multiple classes, the priority attribute is established to determine the final allocation based on the higher priority class.This prioritization strategy ensures consistent labeling decisions, particularly in borderline cases.Additionally, all labelers agree on a specific metric to optimize.For CloudSEN12 + , the chosen metric is the F2-score, which places more emphasis on recall in the evaluation.This prioritization highlights errors in thick clouds and cloud shadows over those in thin clouds and clear classes.Lastly, specific band combinations are established to aid cloud detection (refer to Table 2 ). 3. Training : Labelers follow a comprehensive training program designed to teach them to agree with the labeling software and enhance their skills in ambiguous scenarios.The training begins with an in-depth review of accurately labeled image examples, enabling participants to align with the established standards and expectations for labeling.Furthermore, the training encompasses hands-on practical sessions, during which labelers put their learning in realtime scenarios and receive constructive feed-back to sharpen their labeling skills.The training stage is pivotal in ensuring that all contributors are thoroughly prepared and maintain consistency in their labeling effort s. 4. Production : The labeling process is conducted in this stage.Labelers can start labeling from a blank canvas or fine-tune the preliminary cloud mask predictions provided by the Cloud-SEN12 UnetMobV2 model.Each labeler undertakes the task independently. 5. Quality Control : The labels go through a double-blind quality control process that involves all the labelers to ensure their integrity and accuracy.If more than two independent reviewers report a label, it is sent back to the production stage.Additionally, all human-generated labels exhibiting a P error equal to 1 (see Semi-automatic label quality section) are subject to a meticulous re-examination.

Semi-automatic label quality
CloudSEN12 + employs a dual-scoring approach to detect potential human errors in semantic segmentation [ 2 ].The methodology is illustrated in Fig. 6 .Initially, we calculate the trustworthiness index (TI), which compares the cloud mask prediction from a reference model with the corresponding human annotations used as ground truth.We have selected the CloudSEN12 Un-etMobV2 as the best available reference model.The TI is computed using the F2 multi-class score, adopting a one-vs-all macro strategy: Annotation errors are more susceptible in challenging scenarios, such as class boundaries, intricate cloud shapes, or insufficient contextual information.To address this, we incorporate a Hardness Index ( HI) that considers the perceived difficulty of the labelers during the annotation Fig. 6.A high-level summary of our workflow to detect human errors.Prediction accuracy (TI) and sample difficulty (HI) are used to identify errors in high-quality and scribble subsets.process.In order to build this index, a ResNet-10 model is trained with the S2 images as input and the labelers' perceived difficulty as the target, which is included in the metadata of the CloudSEN12 dataset [ 1 ].This model effectively accounts for the complexity of the annotation task and helps identify areas where errors are more likely to occur.
The TI and HI indices are estimated for all the image patches in CloudSEN12 + .The potential errors P error are detected by considering a simple combination of these indices: P error = 1 if TI < 0 .3 and HI > 0 .5 0 Otherwise All the image patches flagged by P error undergo an extra visual inspection (see Fig. 6 ).This method flagged 17.12% of CloudSEN12 + annotations as potential errors (3,570 images).Upon visual inspection by the labeling team, 342 and 110 image patches from the high and scribble subsets were confirmed as real human errors.In Figs. 7 and 8 , we present examples of human labeling before and after the review process.

Limitations
As mentioned in the semi-automatic label quality section, the ground truth data relies on human interpretation, which is not infallible.While two rounds of validation have been performed on this dataset, some errors may remain, especially in complex areas with snow, faint cloud shadows, or thin clouds, where consensus was difficult to achieve.Nonetheless, these errors are expected to be minimal.After the second review, out of the 3,570 images examined (17.12 %), only 452 (12.6%) were found to have actual errors, with less than 1% being significant errors.
a b s t r a c tDetecting and screening clouds is the first step in most optical remote sensing analyses.Cloud formation is diverse, presenting many shapes, thicknesses, and altitudes.This variety poses a significant challenge to the development of Dataset link: CloudSEN12 + : The largest collection of expert-labeled pixels for cloud and cloud shadow detection in Sentinel-2 (Original data)

Fig. 3 .
Fig. 3.The CloudSEN12 + dataset is structured hierarchically, with the top level (A) dividing the dataset into two main categories: p509 and p20 0 0 image patches, represented by gray folders.Moving to the next level (B), the images are further organized based on the label type, with each label type having a different folder.Within each label type, an additional level (C) groups the images based on a block of random data splitting, represented by blue folders.Moreover, within the p509 category, there is an additional division based on geographic location, highlighted by yellow folders (D).Each yellow folder contains a set of five distinct images with cloud cover ranging from 0 % (cloud-free) to near 100% (cloudy).

Fig. 4 .
Fig. 4. The image illustrates our cloud detection protocol, structured into five stages: Sampling, Agreement, Training, Production, and Quality Control.The IRIS graphical user interface is integral to each of these stages.The Quality Control section is detailed in Fig. 6 .

Fig. 5 .
Fig. 5.The IRIS (Intelligently Reinforced Image Segmentation) graphical user interface includes seven toolbars: A) Edit and navigation bar; B) Selection tool for drawing semantic classes; C) Drawing toolbar, with a final button to execute the GBDT algorithm that completes the mask using previous manual annotations; D) Testing toolbar, allowing comparison between human and AI annotations; E) Image contrast toolbar, which adjusts brightness and saturation; F) Image metadata section, displaying a thumbnail and IP location via Google Maps; G) Machine learning summary support, showing GBDT performance metrics.The IRIS interface includes views of the Cirrus band, Red-Green-Blue, and Blue-SWIR1-SWIR2 bands by default.This image corresponds to the one found in the supplementary information of [ 1 ].
T P c + 0 . 2 F P c + 0 .8 F N c Where, FN represents false negatives, FP false positives and TP true positives; c identifies each class (clear, thin cloud, thick cloud and cloud shadow), and C the number of classes (C = 4).

Fig. 8 .
Fig. 8. Correcting labels in the 'scribble' subset.These images originate from ROIs 1909, 3472, and 3474.The varying shades of yellow, green, and red represent the edges (darker) and center (lighter) of the annotations.

Table 1
Summary of image patch distribution across CloudSEN12 + subfolders.