Large-scale detection and categorization of oil spills from SAR images with deep learning

We propose a deep learning framework to detect and categorize oil spills in synthetic aperture radar (SAR) images at a large scale. By means of a carefully designed neural network model for image segmentation trained on an extensive dataset, we are able to obtain state-of-the-art performance in oil spill detection, achieving results that are comparable to results produced by human operators. We also introduce a classification task, which is novel in the context of oil spill detection in SAR. Specifically, after being detected, each oil spill is also classified according to different categories pertaining to its shape and texture characteristics. The classification results provide valuable insights for improving the design of oil spill services by world-leading providers. As the last contribution, we present our operational pipeline and a visualization tool for large-scale data, which allows to detect and analyze the historical presence of oil spills worldwide.


Introduction
Spaceborne Synthetic Aperture Radar (SAR) instruments have been used for monitoring and early detection of oil spills for several decades and are a well-established tool for many operational monitoring services. This information is paramount for planning oil spill preparedness strategies where location, extent, and early warning are relevant. Oil detections can concern both legal, illegal and accidental discharge from off-shore installations, ships, and pipelines. For example, in the North sea, oil slicks resulted from discharge of produced water 1 are frequently detected around oil platforms [3,35,44]. The large amount of archived data of several thousands of detected and some verified oil spills over the years is a direct result of the long-term use of SAR. This has created a huge potential for new technologies and methods to emerge, which can take advantage of this work-effort and archived data. Remarkably relevant is the free database of satellite images obtained from the Sentinel-1 sensors, which is acquiring images since April 2014 (Sentinel-1A) and April 2016 (Sentinel-1B) [7]. SAR sensors allow monitoring the surface independently of the weather and sun conditions. This is especially important in the North Sea and the Barents Sea due to heavy cloud cover and darkness for long periods of the year.
In SAR images, the oil slicks appear as dark patches due to the low backscatter response compared to the surrounding clean sea areas. The low backscatter is a result of the oil damping of short-gravity and capillary ocean surface waves. The dark signature of oil slicks in SAR is also common for many other ocean features like low-wind areas and natural biogenic slick, also known as look-alikes. An extensive effort has been made to design methodologies to distinguish oil slicks from natural biogenic slicks and/or low wind areas (see e.g., [28,39,42,43,45,47]). The backscatter of oil slicks depends on several factors like wind, sensor properties, and oil characteristics. For example, the wind is the main factor generating the ocean surface roughness and, therefore, wind significantly affect the oil-sea contrast. The oil-sea contrast also depends on the incidence angle of the satellite, as it affects the backscatter response. In • (Detection) we develop a CNN architecture that detects oil spills in SAR scenes with high accuracy.
When trained on a large-scale dataset, our model achieves extremely high performance.
• (Classification) each oil spill detected by our deep learning model is further processed by a second neural network, which classifies it according to shape, contrast, and texture categories.
• (Visualization) we present our production pipeline to perform inference at a large scale and visualize the obtained results. Our visualization tool allows analyzing the presence of oil spills worldwide at specific times in history.
Closely related to our first contribution (detection), is the work presented in [28] that compares the performance of six existing CNN models for semantic segmentation in performing oil spill detection. Among the tested models, DeepLabv3+ [10] achieves the best segmentation performance. To train and evaluate the models, the authors introduce a new segmentation dataset based on the pollution events provided by the European Maritime Safety Agency (EMSA) through the CleanSeaNet service. Compared to ours, the dataset is smaller as it consists of 1112 images, each one covering an area of approximately (12.5 × 6.5)km 2 . Additionally, the images are associated with segmentation masks with 5 classes (sea, land, oil, ships, and look-alikes), while we use binary masks (oil and non-oil). Finally, differently from [28], we do not adopt off-the-shelf architectures but rather propose a CNN model, a training, and evaluation procedure that are optimized for the oil spill segmentation task at hand.
The remainder of this paper is organized as follows. In Section 2, we describe the original data and the preparation of the dataset used for training the proposed deep learning framework. In Section 3, we present the CNN architecture used to perform semantic segmentation of oil spills. Section 3 describes the second CNN model that categorizes the oil spills after they are detected by the segmentation network.
In Section 5 we report the performance of the proposed framework and discuss the results obtained in comparison to the golden standard of manual labelling and in-situ measurements. Section 6 describes the software we developed for large-scale visualization and analysis, based on the proposed framework. Finally, in Section 7 we draw our conclusions.

Dataset description
The dataset has been produced by Kongsberg Satellite Service (KSAT) 2 and consists of Synthetic Aperture Radar (SAR) scenes from Sentinel-1, associated with a binary mask generated by trained human operators at KSAT, which indicates the location and extent of the oil spills. KSAT has a long and consolidated experience in oil detection by SAR, offering worldwide near-real-time services with extensive coverage and fine temporal resolution based on different SAR sensors. The masks associate each pixel with a label that is 0 for the class "non-oil spill" and 1 for the class "oil spill". The whole dataset consists of 713 products of the Sentinel-1 sensor and each product covers an area up to approximately 150, 000 km 2 . The products are collected over a period of 4 years between 2014 and 2018, and they contain 2, 093 oil spill events. An oil spill event refers to a collection of one or several single oil slicks that are located nearby and originate from the same source according to trained human operators at KSAT. The total number of individual oil spills is 227, 964.
The SAR scenes are acquired with the dual-polarimetric SAR mode (IW mode of Sentinel-1) using the vertical transmit and vertical receive (VV) and vertical transmit and horizontal receive (VH) polarization channels with the medium resolution mode (see [7] for additional information about Sentinel-1). The VV polarization is preferred over VH for oil spill detection, due to less impact of system noise in VV compared to the VH channel. For this reason, the VV channel is the one considered in this study.
All the SAR products used for training the deep learning models are smoothed for noise removal and are at 40 meters resolution, i.e., each pixel covers an area of 40×40 meters. After the model is trained, to perform detection also on the high-definition (10m resolution) SAR products that are publicly available from the Copernicus Open Access Hub (https://scihub.copernicus.eu/), we first applied a boxcar filter with size 11 and then we down-sampled the images to 1/4 of their original size to recover the 40m resolution.
The values in the VV channel are provided in 16bit unsigned integer format, meaning that the backscatter assumes values in the interval [0, 2 16 ]. Radiometric calibration to sigma-nought is not performed since is a linear transformation and we expect the deep learning model to learn to apply such a transformation if needed. Very few outlier pixels (typically backscattering from ships/platforms or pixels located at high incidence angle) in the dataset have high backscatter values up to 2 16 , while the majority are concentrated around a much smaller interval. Fig. 1 shows a normalized distribution of the values of the pixels marked as "oil" and "non-oil" computed over all the SAR products in our dataset. Since pixels with high backscatter are very few, for visualization purposes the density plots show a limited range [0, 500] of backscatter values, rather than the whole interval [0, 2 16 ]. As we can see from the plots, the backscatter distribution is shifted toward 0 100 200 300 400 500 oil pixels 0 100 200 300 400 500 non-oil pixels Figure 1: Distribution of the backscatter values in the VV images, across pixels belonging to class "oil" and "non-oil". The plots show only the range [0, 500], rather than [0, 2 16 ]. smaller values. As expected, the backscatter distribution of oil-covered pixels is shifted towards values that are lower than non-oil pixels. Based on this analysis, we clip the maximum value in all VV images in the dataset to 150, which corresponds to approximately the 98% percentile. Notably, all the oil spill pixels have a backscatter value lower than 150. Besides the binary masks indicating the position and shape of oil spills, each oil spill event in the dataset is categorized according to 12 different fields, which indicate the type of shape and texture of the oil spill. Tab. 1 reports the name of the 12 categories, the set of possible values assumed by each category, and the distribution of the values across the dataset. It is possible to notice immediately that for several categories, such as the texture attributes, the distribution of the values is very skewed. This likely indicates that discriminating across certain categories is challenging for the human operators that are labelling the SAR scenes. In particular, texture categories are difficult to detect given the inherent noise in the VV channel. An objective in this study is to assess if such categories can be predicted using machine learning; indirectly this indicates how much information about texture, shape, and contrast can be extrapolated from the SAR image.

Division in patches
To convert the dataset in a format suitable for training a CNN architecture, we extracted a set of patches of size 160 × 160 pixels from the SAR products, each one covering an area of 41 km 2 . An entire oil spill event can be very large and a patch, in general, does not cover it completely. Recall, an oil spill event can cover several single neighbouring oil slicks originating from the same source. By referring to the example in Fig. 2, 8 patches are necessary to cover the oil event depicted in the figure. Since the labels in Tab. 1 are associated with a whole oil event, which can be composed of multiple oil slicks, two options can be considered to associate labels and patches. The first is to assign the same label to all the patches covering the same oil event. The second is to take the most central of the patches covering the oil event (depicted as the green box in Fig. 2) and the label of the whole oil event is only assigned to the central patch. We opted for this second option and ended up with a total of 2, 093 "centred" patches associates with a label describing the values of the 12 attributes in Tab. 1. We refer to this dataset as D 1 . Figure 2: Illustration of patches extraction from the SAR products. The patches centred on the oil spill events (depicted in green) form the dataset D 1 and are associate with a label that indicates the values of the 12 categories for that oil event. The second dataset, D 2 , contains i) all the patches of D 1 , all the patches with at least 1 oil spill pixel, iii) an equal amount of patches without oil, randomly sampled from other locations in the SAR product. Along with the VV channel, the segmentation masks are always included in both datasets.
We note that D 1 includes only a fraction of the available segmentation masks: by referring to the oil event depicted in Fig. 2, all the data outside the green central patch is not contained in D 1 . To exploit also the remaining information (i.e., the segmentation masks associated to the oil spill pixels outside the green box in Fig. 2), we built a second dataset D 2 , which includes all the patches of D 1 plus all the patches that contain at least one pixel belonging to the oil class. The additional patches assigned to D 2 are depicted as blue boxes in Fig. 2. We note that all the patches in D 1 (green boxes) are also included in D 2 . During the training phase, we want to expose the segmentation model also to patches where no oil is present. Therefore, we also included in D 2 patches without any oil spill pixels, which are randomly sampled from the SAR products. The total number of patches in D 2 is 187, 321 and approximately half of the patches do not contain any oil pixel. The total amounts of pixels belonging to class 0 and 1 are 59, 689, 609 and 1, 243, 242, respectively. Therefore, the pixels of class "oil" is 2.04% of the total.

Division on in training, validation, and test set.
The dataset D 1 is used to train the deep learning model that performs classification (see Sect. 4) and to perform hyperparameter selections (see Sect. 5). We split D 1 , in training, validation, and test set with sizes 1843, 150, and 100 respectively.
The dataset D 2 is used to train the model that performs oil spills detection and is split in a training and validation set of sizes 149, 856 and 37, 465, respectively. We made sure that all the patches in the validation and test set of D 1 are excluded from the training set of D 2 .
From the original dataset consisting of 713 SAR products, 3 whole products are kept aside, i.e. they are not used to extract the patches that populate D 1 or D 2 . These 3 products form a separate dataset, D test , which is used exclusively to test the performance of the segmentation task. Tab. 2 summarizes the content of the datasets D 1 , D 2 , and D 1 or D test . Sect. 5 has a special focus on the performance achieved on the three test scenes. Tab. 3 provides additional details on D test that will be useful for discussing the results.

Oil spill detection
The oil spill detection is conveniently framed as a semantic segmentation task, which consists in performing pixel-level classification [48]. In the following, we first describe the neural network model used to perform segmentation, then the procedures adopted for training the model and to perform inference on new, unseen data.

The deep learning architecture for semantic segmentation
The model used to perform segmentation is a fully convolutional network, referred in the rest of the paper as OFCN (Oil Fully ConvNet). The OFCN is a network with no dense layers, which can process inputs of variable size. This allows for training on small images and processing larger ones at inference time. The OFCN model is based on the U-net [37], a popular deep learning architecture for image segmentation that is also used in several remote sensing applications [6,17,30].  Figure 3: Schematic depiction of the OFCN architecture used for segmentation. Conv(n) stands for a convolutional layer with n neurons. For example, n = 32 in the first Encoder Block, 64 in the second, and so on.
The OFCN consists of an encoder and a decoder part, respectively depicted in green and purple in Fig. 3. The encoder gradually extracts feature maps that detect the patterns of interest in the image. By reducing the spatial dimensions and increasing the number of filters, the deeper layers in the encoder capture features of increasing complexity and larger spatial extent in the input image. The decoder gradually transforms the high-level features and, in the end, maps them into the output. The output is a binary segmentation mask, which has the same height/width of the input image and associates to each pixel a class value: 1 if it belongs to the oil class, 0 otherwise. The skip connections link the feature maps from the encoding to the decoding layers, such that some information can bypass the bottleneck located at the bottom of the architecture. In this way, our architecture still learns to generalize from the highlevel latent representation but also recovers spatial information from the intermediate representations through a pixel-wise semantic alignment.
Each block in the encoder consists of 9 layers, as depicted in the green box of Fig. 3 (bottom-left). Conv(n) indicates a convolutional layer with n filters (e.g., n = 32 in the 1st block, 64, 128, 256, and 512 in the 2nd, 3rd, 4th and 5th blocks, respectively), filter size 3 × 3, stride 1, and padding modality same [14]. Each Conv layer is followed by a Batch Normalization (BN) layer [23] and a ReLU activation function. At the end of each block, there is a max-pooling with stride 2, a Squeeze-and-Excitation (SE) [22], and a Dropout layer [46]. Each unit in the deeper layer of the encoder has a receptive field of 140, meaning that each feature depends on a neighbourhood with a radius of 140 pixels in the input image. A visualization of the growth of the receptive field in the encoder layers is reported in Appendix A.
Compared to the encoder, the decoder has a somehow mirrored structure and the details of each block are shown in Fig. 3 (bottom-right). The main differences are the Bilinear upsampling layers, which upscale the features map of the previous layer, and concatenation with the skip connections that injects in the decoder the output of the encoder blocks. The last decoder block replaces the second ReLU activation with a sigmoid that produces output in the interval [0, 1]. This is a common choice in binary classification tasks, such as the generation of the binary oil spill mask.
Let n be the number of convolutional filters in the first layer, the number of filters in the rest of the OFCN architecture is univocally determined and is n-(n × 2)-(n × 4)-(n × 8)-(n × 16)-(n × 8)-(n × 4)-(n × 2)-n. For conciseness, the notation OFCN(n) in the rest of the paper is used, For example, the architecture in Fig. 3 is OFCN (32).
In the Appendix, we describe Bilinear upsampling, Batch Normalization, Squeeze-and-Excitation, and Dropout modules.

Training and inference
The OFCN is trained to predict the binary segmentation mask, indicating the presence of oil spills. The network weights are optimized by iteratively minimizing a loss function evaluated on mini-batches of input-output pairs. The choice of the loss function and the other training procedures are reported in the following.

Loss functions for unbalanced dataset
Oil spills are small objects and, as discussed in Sect. 2, they represent only a tiny fraction (≈ 2%) of the entire dataset. Due to the strong imbalance between the pixels of class 0 ("non-oil spill") and class 1 ("oil spill"), a naive classifier can achieve an accuracy of ≈ 98% simply by assigning all the pixels to class 0. To handle the class imbalance, rather than training the OFCN with the standard binary crossentropy loss, a binary cross-entropy with class balancing is used. Specifically, the standard cross-entropy is re-weighted to assign a larger penalty when a pixel of the under-represented class is wrongly classified. This can be done by weighting the loss associated with each pixel of the oil class with a value larger than for the non-oil class. We also tested three additional loss functions that account for class imbalance but obtained unsatisfactory results. The details are in Appendix C.

Data augmentation
To prevent the model from overfitting the training data and to enhance its generalization capability on unseen examples, we augment the dataset during training. Randomized data augmentation can improve the generalization performance in several computer vision tasks, including applications on remote sensing [13]. In particular, we apply the following random transformations on the fly: horizontal and vertical flips, horizontal and vertical shifts, rotations, zooming and shearing to the training images. To ensure consistency between input and the target segmentation masks used for training, the same transformations of the input are also applied to the oil masks.

Two-stage training
We first trained the OFCN on a low-resolution version of the dataset, obtained by downsizing the patches size by half (80 × 80 pixels). After having completed the training on the low-resolution dataset, we resumed the training on the patches of original size without resetting the network weights. This is possible because, as discussed in Sect. 3.1, the OFCN can consume images of variable size, since its weights are independent of the input shape. The intuition behind training first on downsized images is to let the model learn first the coarser structure in the inputs and then refine the parameters' tuning as the incoming images expand and become more detailed.

Test time augmentation.
When computing the prediction of a SAR scene at inference time, we slide the OFCN on the large image, computing predictions for one window at a time. Again, we stress that the window size at test time can be larger than 160 × 160 pixels. However, this approach usually generates checkerboard artefacts and border effects close to the window edges. To obtain smoother and more accurate predictions, overlapping sliding windows with a stride equal to half the window size are processed by the OFCN. Furthermore, 8 predictions from all the possible 90 • rotations and flips of each window are generated. To obtain the final result, we used a 2 nd order spline interpolation to merge all the computed predictions.

Oil spill classification
With oil spill classification we refer to the task of predicting the 12 categories, described in Tab. 1, which indicate the texture, shape, and contrast of the oil spill. Such categories are useful to end-users and analysts in deriving information about the source, stage of weathering and internal variations within the oil slicks. To classify each of the 12 categories, we train a separate instance of the architecture depicted in Fig. 4. The model takes as input a SAR patch and the associated mask predicted by the OFCN model Figure 4: First the trained OFCN generates the segmentation masks fro the SAR images. Then, both SAR images and predicted mask are fed in the classification network. A different architecture is trained to classify each one of the categories (e.g., the depicted one classifies the "texture" category). described in Sect. 3. The (predicted) segmentation mask encourages the classification network to focus on the areas of the patch with oil spills and allows to extract more easily shape information. Fig. 4 shows the whole pipeline: input SAR image → detection → classification. The pipeline is not trained end-to-end. In fact, we first train the OFCN and, afterwards, the classification network. The first, obvious reason for not training the whole pipeline end-to-end is that the category labels are available only for D 1 and not D 2 , the large dataset used to train the OFCN. Moreover, compared to the segmentation masks the 12 category labels are noisier as they are more subjective to human interpretation. Therefore, to achieve the best possible segmentation performance, the OFCN is trained independently without being conditioned by the classification loss.
The classification network in Fig. 4 consists of 5 convolutional blocks responsible for feature extraction. Each block is structured as [conv(n)-BN-ReLU-Maxpool-SE], where the number of convolutional filters n is 32, 64, 128, 256, and 512 in the blocks B1-B5, respectively. The classification head has the following architecture: Dense(256)-BN-ReLU-Dropout-Dense(#cat)-Softmax. The layers "Dense" are two fullyconnected layers whose numbers of units are 256 and the number of values assumed by each category (#cat), respectively. The network is trained by minimizing a categorical cross-entropy loss and by using the same image augmentation procedure used for training the OFCN architecture.

Results and discussion
In this section, we describe the experimental setting and report the results obtained for the detection and classification task, respectively.

Evaluation metrics.
While the parameters of OFCN are optimized by minimizing the loss described in Sec. 3, we require more interpretable metrics to quantify the results obtained on the test, and also to monitor the model performance on the validation set during training. We considered different metrics rather than accuracy since the oil class is highly under-represented in the segmentation task and also some categories are imbalanced in the categorization task.
The first metric is the F1 score, which is computed at the pixel level and is defined as where precision is defined as T P T P +F P and recall is T P T P +F N (TP = True Positives, FP = False Negatives, FN = False Negatives). To compute the F1 score in the segmentation task, the output o of the sigmoid in the last layer of the OFCN is rounded as follows: When not specified otherwise, we use τ = 0.5.
In each experiment presented in the following, the F1 score on the validation set is used to evaluate during training the performance of the current model on unseen data. Specifically, whenever the model improves its F1 score on validation, the current instance of the weights are saved as the best model.
We also consider a second metric that indicates if the OFCN managed to correctly locate the oil spill, without accounting for small differences in the shape contours of human-made and predicted segmentation masks. For this purpose, we consider the bounding boxes that contain oil spills in both the human-made and predicted mask. TPs are now measured as the number of bounding boxes in the human-made mask that has a non-zero intersection with a bounding box in the predicted mask. Similarly, we compute the FP and FN. To quantify how much the bounding boxes in the ground truth and the prediction overlap, we computed the intersection over union (IoU): Area of bounding boxes intersection Area of bounding boxes union .
Contrarily to the F1 score that is evaluated during training to save the best model, the IoU is only computed once the training is over to evaluate the final performance.
To make the hyperparameters search tractable, we used a smaller architecture, OFCN (16), and we trained it on the training and validation set of D 1 , which is much smaller than D 2 , for 100 epochs only. We used mini-batches of size 32 and image augmentation is used with the following parameters: max rotation 90 • , max width shift 0.1 of total width, max height shift 0.1 of total height, max shearing 0.3, max zoom 0.2, probability of horizontal and vertical flips 0.5, pad mode "mirror". Tab. 4 reports the F1 score obtained by the 3 best configurations.

Comparison with baselines
We compare the performance of the proposed OFCN with the vanilla U-net architecture [37] and with DeeplabV3+ [10], which is considered, at the time of writing, the state-of-the-art for image segmentation in computer vision. Notably, DeeplabV3+ is the segmentation architecture that achieved the best performance in related work on oil spill segmentation in Ref. [28].
To perform the comparison, we used the Keras implementations of U-net and DeeplabV3+ available at two popular public repositories 4 . For this experiment, we used the larger OFCN(32) architecture with configuration C1 described in Tab. 4. The DeeplabV3+ is configured with the Xception backbone [11]. All the settings are the same as in the previous experiment, with the exception that the models are trained for 400 epochs and the batch size is 16.  Tab. 5 reports the number of trainable parameters in each architecture, the time (in hours) necessary to complete 400 epochs of training, the final training accuracy and training loss, and the best F1 score obtained on the validation. First, we notice that DeeplabV3+ has much more trainable parameters compared to OFCN and U-net, which makes its training almost 50% slower than for the other two architecture. On the other hand, the training times of OFCN and U-net are comparable. DeeplabV3+ outperformed U-net but, despite its larger capacity, did not achieve a better performance than the proposed OFCN architecture.
Importantly, we report that in some runs the U-net did not manage to learn anything: the loss was not decreasing and the predicted output was a mask of all zeros for each image in the training set. This indicates a strong sensitivity to initialization and a lack of robustness in the U-net model. Finally, we also experimented with DeeplabV3+ configured with the Mobilenet [21] backbone but we obtained unsatisfactory performance.
The training graphs showing the evolution of the loss on the training set and the F1 score on the validation set are in Appendix B and they show that none of the models overfits the training data.

Training on the large dataset
First, we trained three OFCN(32) models configured with the best hyperparameters settings (C1, C2, C3 reported in Tab. 4) for 400 epochs. Training each model on D 2 takes up to one week on an Nvidia RTX 2080. The results reported in Tab. 6 shows that, even in this case, OFCN configured with C1 obtains the best performance: highest accuracy and lowest loss on the training set, highest F1 score on the validation set.  Table 6: Validation performance, obtained on D 2 using the three best configurations C1, C2, and C3, the two-step training (2ST) strategy with configuration C1, and a long training of 3000 epochs (Long). We also report the training time, the accuracy, and loss achieved on the training set.

ID
We also applied the 2-stage training strategy, discussed in Sect. 3.2, using the configuration C1 (C1-2ST in Tab. 6). The procedure takes approximately 50% extra time since in the first stage the images are down-sampled with a factor of 2 but it yields some performance improvement.  . Note that we do not plot the training accuracy, which is always above 98% from the first epochs, and we did not compute the F1 score on the training set. The latter requires to compute predictions of the whole training set at each epoch and, given the size of the dataset D 2 , it would significantly prolong the training time, which is already in the order of days.
From the plots in Fig 5, we notice that the training procedure is much more stable when the OFCN is equipped with the SE module and achieves a higher F1 score when using BN. Most importantly, none of the models is overfitting on the training set and the F1 score is still improving after 400 epochs. This suggests that that the training has not converged yet and better performance can be achieved by training the OFCN model for more epochs.
In our last experiment, we trained OFCN(32) configured with C1 for 500 epochs in phase one (downscaled images) and then 3,000 epochs on full resolution images. Training this model took almost two months and we obtained a significant improvement, reaching an exceptional 0.892 F1 score on the validation set. Fig. 6 reports the training statistics for the second training phase. After 3,000 epochs the F1 score has finally stabilized but we also notice that the network is still not overfitting the training set. Fig. 7 reports some examples of segmentation masks predicted by the OFCN on the validation set of D 2 . From left to right, we depict the VV input channel, the ground truth mask made by human operator, and the OFCN prediction thresholded at 0.5 (values ≤ 0.5 → 0, values ≥ 0.5 → 1). To facilitate the interpretation of the results, we generated a bounding box around both the oil spills in the mask generated by the operator and the mask predicted by OFCN. If two bounding boxes overlap, they  FPs are more common as they can arise from small details that might be overlooked by the human operator and often appear on the edge of the oil spill outline. On the other hand, FNs are very rare meaning that our model misses very few of the human-detected oil spills. Having a low amount of FNs is particularly important because FP can always be discarded during a post-analysis, whereas a missed detection cannot be recovered.
From now on, by OFCN we will refer to the OFCN(32) model trained with configuration C1-2ST-Long (see Tab. 6).  Here, we investigate how much the incidence angle of the satellite affects oil spills detection. All the patches in our dataset are characterized by an incidence angle between 30 and 45 degrees, which is the range where the scanSAR mode of Sentinel-1 is operating [7]. Our initial hypothesis was that oil spills are detected more easily at medium inclinations since high and low incidence angles yield low oil-sea contrast [1,15]. However, the results disproved our hypothesis. In Fig. 8, the red line shows the mean detection F1 score obtained for oil spills at a given incidence angle; the red area shows the standard deviation; the blue bars indicate the number of oil spills for each incidence angle. To assess if there are statistical differences between the F1 scores obtained at different incidence angles, we perform the Kruskal-Wallis H-test, which is a non-parametric version of ANOVA that makes only a few assumptions about the characteristics of the population from which the data originate [29]. With the Kruskal-Wallis, we test the null hypothesis that the population median of all of the groups are equal and we obtain H-statistic = 54.14 and p-value < 10 −4 . Since the p-value is much lower than 0.05, the null hypothesis cannot be rejected, meaning that there is not a statistically significant difference between the F1 scores obtained at different incidence angles.
However, we recall that the incidence angles available are in the range 30 to 45 degrees, where oil spill detection is usually preferred and Bragg scattering dominates [1,20]. Outside this range, specular reflection from both oil slicks and clean sea occurs resulting in lower oil-sea contrast [1,33]. It is also interesting to notice fewer oil samples at the near and far incidence angles (30 and 45 degrees), where oil slick detection is challenging.

Visualization of the learned filters
To provide an interpretation of what the OFCN is learning, we synthetically generated images with patterns that maximally activate the convolutional filters in OFCN. To generate the images, we rely on the Activation Maximization technique [34]: we start from an input image containing random noise and we gradually update it by ascending the gradient δActivationM aximizationLoss δinput . Figure 9: Filters visualization. We synthetically generated the input images that maximally activate 64 of the 512 filters in the first convolutional layer of the "Enc Block 512" at the bottom of the OFCN. Fig. 9 depicts the patterns that maximally activate the first 64 of the 512 filters in the first convolutional layer of the "Enc Block 512" (the one at the bottom of OFCN in Fig. 3). Interestingly, it is possible to notice several wave patterns, meaning that the OFCN is looking for waves when trying to detect the oil spills. This is reasonable since ocean surface waves are present in moderate wind conditions, which are the most favourable for detecting oil spills [2,41]. In fact, oil spills cannot be detected if the sea surface is too flat due to low wind or is too irregular due to high wind. For low winds, the backscatter response is similar between the calm ocean surface and oil slicks [18]. Whereas for strong wind, the oil can break and/or sink due to upper surface layer turbulence. The wind speed range for optimal oil-sea contrast is suggested to be 2-3m/s to 10-14m/s [2,41].  Tab. 7 reports the performance obtained on the three test products of D test , whose details are in Tab. 3. Compared to the validation set, the proportion of oil pixels is much lower in the three SAR products. Also, given the large size of the images that extend up to 860km, there is a higher chance of detecting FPs. For these reasons, the F1 scores Tab. 7 are lower compared to the score obtained on the validation set (0.892). We note, however, that a large portion of FPs is due to small bounding boxes not marked by the human operators. As discussed at the beginning of this section, OFCN returns a soft output in [0,1] which must be rounded to obtain a binary segmentation mask. By varying the rounding threshold τ it is possible to vary quite significantly the number of TP, FP, and FN and also the two performance metrics, F1 score and IoU. In particular, with a lower τ more FP appear, while a higher τ implies more FN. From Fig. 10 we observe that in the three test products by using higher τ the number of FP decreases significantly and the FN only increases in the third image (Fig. 10(c)). This suggests that a high τ value improves the detection precision. On the other hand, the IoU and F1 score become much worse for high τ values, indicating that more precise contours can be found using a lower threshold. We will exploit the behaviours observed for different τ when implementing our visualization tool, presented in Sec. 6. Fig. 11 depicts examples of segmentation results on each one of the SAR products in D test . Note the presence of a large FP in the second product ( Fig. 11(f)).

Oil spill classification: experimental setting and results
To train the classification network we first generate the predicted masks for both the training and validation set of D 1 using the trained OFCN (32). We decided to use the soft predictions, i.e., we do not threshold the output of the OFCN since it introduces an unnecessary bias and can conceal potential information of interest. As previously discussed, the soft OFCN output values in [0,1] that can be interpreted as the amount of certainty that a pixel belongs to the oil class. Such a classification probability is lower in areas close to the edges of the oil spill, or where the oil starts to dissolve.
We trained 12 different instances of the classification models described in Sec. 4, one for each category. Each model is trained for 1, 000 epochs using Adam optimizer with initial learning rate 10 −4 , batch size 32, L 2 norm regularization weight 10 −6 , and dropout 0.1. Image augmentation is used with the following parameters: max rotation 90 • , max-width shift 0.1 of total width, max height shift 0.1 of total height, max shearing 0.3, max zoom 0.2, probability of horizontal and vertical flips 0.5, pad mode "mirror".
The accuracies obtained for each category are reported in Tab. 8. Since the values in most categories are unbalanced, we also report the F1 score as a performance measure.   Compared to traditional image processing tools, CNNs usually achieve very high performance on recognizing textures and they exploit this capability to achieve high accuracy in downstream classification tasks. However, in our case, the performance obtained for some texture categories are particularly low. We argue that one of the reasons is the presence of noise in the labelling process since it is difficult to precisely determine texture and contrast features from a SAR image in a consistent manner. We also observe a low accuracy and F1 score in some other categories, such as "Contrast" and "Edge". Compared to the masks in the segmentation task, the classification labels are less reliable since the labelling procedure is more subjective and there is room for human errors. Most importantly, the trained operators use complementary information to define the categories and also to label them like an oil spill, such as sea state, wind, and historical seep sites. The operators also account for nearby potential polluters (ships/platforms), by combining the automatic identification system (AIS) and sea maps. Since all these information are not contained in the SAR products, a classification model based only on the image content can struggle in determining the right category and also to detect oil slicks that are not necessarily human-made.

Large-scale visualization
We developed a pipeline to automatically acquire all the SAR products available in a given area and within a specified time frame and then process them with our deep learning framework. Our pipeline performs the following steps: 1. as input, we only specify the coordinates of an area and the time interval; 2. from the Alaskan Sar Facility (ASF) repository all the SAR products within the time frame that overlaps at least 20% with the specified area are fetched; 3. since the SAR images come from Sentinel-1 (GDRH, 10m resolution), they are first smoothed and then down-sampled by a factor of 4 to match the mode of our training data; 4. all the SAR products are processed with the OFCN described in Sec. 3; the procedure consists of two steps, filtering and coloring, discussed below; 5. each oil spill is associated with a vector of features, including the size of the slick and the distance from the closest oil spill detected; 6. very small slicks are removed, i.e., slicks whose surface is lower than 0.25km 2 and are further than 1.5km from any other oil spill.
Based on the discussion related to Fig. 10, it is possible to obtain high precision in the detection by thresholding the OFCN output with a high τ . On the other hand, with a smaller τ the oil spill contours are more accurate. For this reason, we first perform a filtering applying τ = 0.8 to keep only larger slicks and discard many FPs. In the coloring step, we compute with τ = 0.5 the outline of the slicks that remain after filtering. The filtering-coloring procedure is very fast since the soft output of OFCN does not need to be recomputed.
The results obtained are encoded into geojson files they are visualized with NLive 5 , our geographic visualization tool. In NLive it is possible to select the individual oil slicks and visualize a small chunk of SAR image where the oil spill has been detected, plus additional information about shape and the distance from the closest neighbour. The oil spills detected in an area of approximately 100, 000 km 2 in the South hemisphere between October 2014 to March 2020 are shown in Fig. 12. In the example, a total of 501 SAR products were retrieved, and in 136 of them at least one oil spill has been detected; a total of 665 oil spills were found.

Conclusions
In this paper, we proposed a deep learning framework to perform detection and categorization of oil spills on a large scale dataset. We formulated oil spill detection as an image segmentation task, where each pixel in an input SAR image is assigned to the class "oil" or "non-oil". We designed a fully convolutional neural network for semantic segmentation, which we trained on pairs consisting of a small patch of a large SAR product and an associated binary mask, drawn by a human operator, that defines the class for each pixel. Through an extensive experimental evaluation, we demonstrated the capability of the proposed architecture in achieving high detection performance, obtaining results comparable to human operators.
Once the oil spill is detected, we used a second neural network to classify an oil spill, according to 12 different categories describing shape and texture features. Differently from the detection task, the classification is not done at a pixel level but is relative to the whole patch. Our is the first exploratory work in categorizing oil spills in SAR images; the categorization results are useful to end-users and analysts to derive information about the source, stage of weathering and internal variations within the oil slicks that could be related to oil concentration or thickness.
Despite neural networks are particularly capable of detecting textures, we obtained a low classification accuracy for some category. We believe that part of the reason is the noise and inconsistency in the human-made labels. Indeed, it is extremely difficult to precisely determine texture and contrast features from a SAR image in a consistent manner. Remarkably, our findings on the automatic categorization performance provided valuable insights for improving the design of future oil spill services by operators such as KSAT.
Finally, we presented a production pipeline to detect and visualize the presence of oil spills worldwide at given times in history. Our pipeline fetches SAR products from the ASF repository of Sentinel-1 images and performs automatic detection and categorization. The results are visualized in an interactive geographical map, where each oil spill can be individually selected to be further analyzed. To the best of our knowledge, this is the first tool based on deep learning that allows analyzing oil activity on such a large scale.
A Further details of the segmentation model A.1 Bilinear upsampling layer A standard 2D upsampling procedure enlarges the image simply by inserting new rows and columns between the existing ones and fills them by replicating the content of the existing pixel values. Instead, to obtain a more accurate generation of the output map we perform bilinear upsampling in the decoder layers of the OFCN. Bilinear upsampling computes the new pixel values by performing a linear interpolation between the adjacent existing pixels. It has been shown that bilinear upsampling yields a more accurate reconstruction and the architecture equipped with it obtain a better segmentation accuracy [31].

A.2 Batch normalization layer
To speed up the convergence of the training and provide a regularization to the network that improves its generalization capabilities, we applied Batch Normalization (BN) [23] before each non-linear activation in the OFCN, both in the encoder and decoder. BN normalizes channel-wise the mean and scale of the activations in the previous layer, making the network output almost invariant to the scale of the activations of the previous layers. BN has the effect of reducing the variance in the distribution of layer activations over the course of training, preventing the network weights to diverge and the activation to saturate. Empirically it was shown that BN stabilizes and accelerates the training while reducing the need to tune a variety of other hyperparameters to achieve higher performance [40].

A.3 Squeeze-and-Excitation layer
We equipped the encoder of the OFCN with SE modules, which improve channel interdependencies at almost no additional computational cost [22]. SE gets a global understanding of each channel by squeezing each feature map to a single numerical value. Figure 13: Overview of the Squeeze-and-Excitation block used in the OFCN encoder. The SE blocks are inserted after each ReLU activation. Fig. 13 illustrates the mechanism of the Squeeze-and-Excitation block. Let X ∈ R H×W ×C be a feature map generated, for example, by a convolutional layer, where H and W are the height and width of the features maps and C the number of channels (i.e., the neurons in the previous convolutional layer). A squeeze operation aggregates the feature map X across the spatial dimensions H and W . The resulting embedding is a vector in R C that captures the global distribution of channel-wise feature responses. An excitation operation uses the embedding vector to implement a self-gating mechanism that rescales the weights of the original feature map channel-wise. The resulting feature mapX is used as input for the next neural network layer.

C Unsuccessful approaches
In the following, we mention other strategies we experimented with but did not provide satisfactory results.

C.1 Loss functions for class imbalance
Besides the binary cross-entropy with class balancing, we tried three additional loss functions that are specifically designed to handle classes with an uneven number of samples.
• Focal loss. Addresses class imbalance by reshaping the standard cross-entropy loss such that it down-weights the loss assigned to well-classified examples [32]. The Focal Loss focuses on training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. The Focal Loss is defined as F OC(p t ) = −α(1 − p t ) γ log(p t ).
We used the default parameters α = 0.25 and γ = 2 proposed in the original paper.
• Jaccard Loss handles class imbalance by computing the similarity between the predicted region and the ground-truth region for an object present in the image. In particular, the loss penalizes a naive algorithm that predicts every pixel of an image as the background, as the intersection between the predicted and ground-truth regions would be zero [36]. The Jaccard loss is defined as where indicates the Hadamard product.
• Lovász-softmax loss is an extension of the Jaccard Loss, which generates convex surrogates to submodular loss functions, including the Lovasz hinge. We refer to the original paper for the formal definition [5]. The official TensorFlow implementation 6 has been used to perform the experiments.
For each loss function, we repeated the same hyperparameters search described in Sec. 5.1 and in Tab. 9 we report the best configuration found and the associated F1 score. It is immediately possible to notice that the results are significantly lower than those reported in Tab. 4 and obtained by using binary cross-entropy and class weights. In particular, when optimized with the Lovász-softmax loss, our model achieves an F1 score 17% lower.  Table 9: Best configurations and F1 scores for loss functions different from binary cross-entropy. Acronyms: BN (Batch Normalization), SE (Squeeze-and-Excitation), L 2 reg. (strength of the L 2 regularization on the network parameters), LR (Learning Rate), FOC (Focal loss), JAC (Jaccard loss), LOV (Lovász-softmax loss).

C.2 Conditional Random Field.
The outcome of the prediction on each tile is modified by using CRF [25] as a subsequent post-processing step. CRF produces a result that is given by the combination of the pixel-wise neural network prediction, the pixel value in the input image (SAR value in this case) and pixel position. More formally, the network prediction ψ(x i ) for pixel i is combined with the following pairwise potential where j are the indices of the other pixels in the patch, p i indicates pixel position and I i the SAR value of pixel i. The parameters configuration used for the training is w (1) = 5, w (2) = 0.1, θ 2 α = 2, θ 2 β = 2, and θ 2 γ = 1. We found CRF to be computationally intensive, very sensitive to several hyperparameters that are difficult to tune, and, most importantly did bring significant improvement in the segmentation performance. Similar results were found also in other related work [27].

C.3 Multi-head classification network
In a first attempt, to perform the categorization task we designed an architecture that shares the first 5 convolutional blocks and has 12 different output heads, each one specialized in predicting one of the 12 categories. Such a network is capable of predicting at the same time all the categories given the input VV and mask. However, we achieved better individual accuracy by training 12 different networks independently with a single head, one for each category, such as the one depicted in Fig. 4.

C.4 Gradient descent optimizers
As an alternative to Adam, the Nadam optimizer [38] minimizes the loss faster in the first epochs, but in the end, settled worse to minima than Adam.