Enhancing Coastal Water Body Segmentation with Landsat Irish Coastal Segmentation (LICS) Dataset

Ireland’s coastline, a critical and dynamic resource, is facing challenges such as erosion, sedimentation, and human activities. Monitoring these changes is a complex task we approach using a combination of satellite imagery and deep learning methods. However, limited research exists in this area, particularly for Ireland. This paper presents the Landsat Irish Coastal Segmentation (LICS) dataset, which aims to facilitate the development of deep learning methods for coastal water body segmentation while addressing modelling challenges specific to Irish meteorology and coastal types. The dataset is used to evaluate various automated approaches for segmentation, with U-NET achieving the highest accuracy of 95.0% among deep learning methods. Nevertheless, the Normalized Difference Water Index (NDWI) benchmark outperformed U-NET with an average accuracy of 97.2%. The study suggests that deep learning approaches can be further improved with more accurate training data and by considering alternative measurements of erosion. The LICS dataset and code are freely available to support reproducible research and further advancements in coastal monitoring efforts.

dataset, which aims to facilitate the development of deep learning methods for coastal water body segmentation while addressing modelling challenges specific to Irish meteorology and coastal types.The dataset is used to evaluate various automated approaches for segmentation, with U-NET achieving the highest accuracy of 95.0% among deep learning methods.Nevertheless, the Normalized Difference Water Index (NDWI) benchmark outperformed U-NET with an average accuracy of 97.2%.The study suggests that deep learning approaches can be further improved with more accurate training data and by considering alternative measurements of erosion.The LICS dataset and code are freely available to support reproducible research and further advancements in coastal monitoring efforts.

Introduction
Ireland's coastline is both a vital and dynamic resource.Coastal regions are impacted by erosion, sedimentation, and human activities like land development.
In fact, it is estimated that 20% of Ireland's 4,578km of coastline are eroding [1].
A trend that is likely to be exacerbated by climate change and sea-level rise [2].
To identify the areas worst at risk we must closely monitor changes in the coastline.
The length of the coastline means this is no straightforward task.There is a growing consensus that, to meet the challenge, we can use a combination of satellite imagery and deep learning methods [3].At the same time, there is limited research done in this area.Particularly for Ireland, there are no extensive open-source machine-learning datasets for coastal water body segmentation.
Hence, we present the Landsat Irish Coastal Segmentation (LICS) dataset.
Its purpose is to aid the development of deep learning methods for coastal water body segmentation.At the same time, the dataset may be used to shed light on modelling challenges specific to Ireland.In particular, we aim to answer questions about how solar altitude, various coastline types and the date of images will impact model performance.In the process, we benchmark various automated approaches for segmentation and explore their assumptions.In the spirit of reproducible research, both the dataset1 and code2 are freely available.

Background
We must distinguish between two tasks -coastal water body segmentation and coastline detection.For segmentation, we aim to classify each pixel in an image as either land or ocean.For coastline detection, we aim to classify each pixel as either coastline or not.The latter process will depend on how we define the coastline.In this paper, we consider the instantaneous coastline which is the boundary between land and water at the exact time a satellite image was taken [4].Under this definition, the two tasks are related.That is the coastline pixels are the pixels where the segmentation map changes from land to ocean.
The instantaneous coastline is limited in its ability to measure erosion as it depends on the tide.Alternative measurements include the high water mark, vegetation line and dune volume [5].These are considered to be better definitions for measuring erosion.However, gathering ground truth for these measurements is more complicated as they require onsite evaluation.In comparison, the instantaneous coastline can be determined using only satellite images and additional higher-resolution images of the same coastline [6,3].This partly explains why most studies have chosen this definition and approach to creating a ground truth dataset.
Traditionally, spectral indices have been used for water body segmentation [7,8].For coastline detection, various edge detection algorithms have been applied [9,10,11].The advantage of these approaches is they do not require a training set.The downside is they are not robust to noise in satellite images caused by factors like clouds, swell and land development [12,13,14].Satellite images and ground-based sky images [15] are often corrupted by atmospheric clouds [16,17,18].Additionally, as they require one channel as input, we must first select [19] an individual spectral band or combine multiple bands into one value per pixel.In the process, we may lose important information from other bands or from interactions between bands.
In comparison, deep learning models can use all available spectral bands.Additionally, they can use a pixel's context to make predictions.This means they can use the spectral band intensities from surrounding pixels and not just the intensities for the given pixel.Initial work with these models has shown promise.[20], [21] and [22] apply variations of U-NET, a common image segmentation algorithm, to coastal water body segmentation datasets.However, the images in the studies are naturally coloured meaning the models cannot make use of the range of spectral bands available in satellite images.Particularly, the Near-Infrared (NIR) band which is important for water body segmentation [23,24].
To the best of our knowledge, four studies use satellite images as input.[25] showed a multi-layer perception could accurately segment five coastal water bodies across three continents.[26] focused on predicting the vegetation line using convolutional neural networks (CNN).[6] used a combination of CNN and transformer architecture for land-sea segmentation in the yellow sea region of china.In terms of dataset diversity, [3] presents the most extensive study.The researchers provided a test set of 98 images from 49 locations around the world.
The aim was to provide a benchmark dataset that would aid the development of land-ocean segmentation models that are scalable to all global coastlines.Such a model is ideal.However, it is a challenging task.All coastal regions will have their own unique geographical and meteorological conditions [27,28,29] and labelling a training dataset that adequately captures these variations will be time-consuming.Hence, [3] opted to use semi-supervised methods to label their training dataset.To make the task more manageable, we have chosen to focus on one country-Ireland.Still, even this relatively small island presents a large variation in coastline conditions.
From sandy beaches to rocky cliffs, Ireland's varying coastal geographies will make some coastlines more or less susceptible to erosion [30,31].Wave power is another factor that affects erosion [32].The west coast of Ireland faces the Atlantic and experiences a larger amount of wave energy [33].The long-term effect is typically more jagged coastlines in these areas.In other words, we have a less uniform boundary between land and ocean and we expect these areas to be more challenging to produce accurate segmentation.
Other considerations are cloud cover, tidal variations and variations in solar altitude-the angle of elevation of the sun above the horizontal plane.Ireland experiences large differences in solar altitude between summer and winter months.A factor worth considering as low solar altitudes have been shown to lead to poorer performance for water body extraction indices [34].Ultimately, if we want a model that can perform accurate segmentation across all times and coastline types, we must build a dataset that adequately captures variation in these factors.

Landsat Irish Coastal Segmentation Dataset
We introduce the Landsat Irish Coastal Segmentation (LICS) dataset [35].This is the first dataset created for deep-learning semantic segmentation of the Irish coastline.It has been created with the goal of developing robust models that can perform accurate segmentation across different years, coastal types and atmospheric conditions.Particular attention has been paid to the model performance at varying solar altitudes.Figure 1 gives a summary of the dataset development process and we will discuss each step in depth.

Selecting Scenes
The first step was to obtain metadata of all potential Landsat scenes.A tile covers a specific geographic area and we considered 11 tiles which all contained some section of the Irish coastline.You can see examples of these in Figure 2.
Combined, every section of the Irish coastline is included in these 11 tiles.We obtained the metadata for all scenes from these tiles from April 1984 to May 2023.This was 14,850 scenes in total.The fields included in the metadata allowed us to select scenes from this list for model development.Specifically, we removed any scenes that did not meet the following criteria: 1. We select scenes from Landsat 5, 7, 8 & 9.
2. For Landsat 7, we only select scenes before 2003-05-31 due to faulty satellite mirrors after this date.
3. We select scenes that fall in Tier 1 as these are the highest quality data.4. We consider scenes that had less than 10% total cloud cover.
The cloud cover percentage is calculated using the CFMask algorithm [36].
Figure 3 gives the histogram of these cloud cover percentages for all the scenes.Ideally, we would only select scenes that had 0% cloud cover.However, we can see that this would severely limit the number of available scenes.In fact, only 5.6% of scenes had less than 10% cloud cover so we decided to use this as our cutoff.
The above process left us with 326 scenes.We selected 100 scenes from this list using the solar altitude as an additional criterion.We calculated the  For each year and altitude category, we selected the scene that had the lowest cloud cover.In Table 1 we see the breakdown of scenes for each tile.To have a more even distribution across the tiles, we selected a further 8 scenes for tile 6.
The final dataset had 42, 42 and 43 scenes in the altitude categories respectively and at least one scene in each year.The final result is a dataset that captures variation introduced by solar altitude, coastline type and time.As altitude is related to the time of year, we also capture month-on-month variation.

Spectral bands
After selecting the final list, we obtained the spectral bands for the 100 scenes.We consider the bands listed in Table 2 as input into the modelling approaches.These all have a resolution of 30m.These are the bands common to Landsat 5, 7, 8 and 9.The newer satellites do have more bands available.
However, we believe the ones we have selected to be appropriate for water body segmentation as they include bands common to water body indices.

Cropping scenes
The Landsat scenes are roughly 8,000 by 8,000 pixels.These dimensions are larger than what is typically used to train machine learning models.Hence, as seen in Figure 5 we crop 256 by 256 pixels squares from each scene to create the training and test set.For the test set, we select one geographical location for each tile.Hence, we have 11 testing locations with additional variation introduced through time and atmospheric conditions.These locations are chosen randomly with the conditions that they fall on the island of Ireland, no bounding box is included and the ratio of land to ocean is between 40% and 60%.
For the training set, 300 crops per scene are selected.These are chosen randomly with the condition that they do not overlap with the testing location and

Training Annotation
The training annotations were created manually by drawing segmentation masks on top of the Landsat scenes.Specifically, pixels were given a value of 1 for ocean and 0 otherwise.To be clear, a scene was annotated before the above cropping process and then the masks were cropped along with the spectral bands.This approach was chosen as it was less time-consuming than annotating the 30,000 training instances individually.
To further reduce time requirements only a rough mask was drawn.These typically took between 15 and 25 minutes depending on the tile and a strict cutoff of 30 minutes per scene was used.As a reference when drawing the masks, the scenes were visualised using the standard RGB (3/2/1) bands and using the NIR band in replace of the Red band (4/2/1).These can be seen in Figure 6.Open-source software called Label Studio was used to draw the annotations.

Test Annotation
Evaluating models using these rough annotations would likely overestimate model performance.Hence, for the test instances, we created more precise annotations.This was done after the cropping process and so only the pixels within the test crop area were annotated.Figure 7 gives the references used to create the test annotations.Like the training set, these instances were visualised using the 3/2/1 and 4/2/3 bands.Additionally, Google Earth Pro was used to provide a higher-resolution image of the testing locations.This allowed us to observe the location at various tide levels and at times close to when the Landsat scene was taken.No time limit was set to ensure the most accurate annotations possible.
The problem of mixed pixels should be mentioned.These are land and ocean pixels in a Landsat scene that have merged.This effect will be most prominent in pixels close to the instantaneous coastline.As we have decided on a binary target variable, these must either be classified as land or ocean.In the test set, these are handled by the authors' judgement based on the available resources mentioned above.Overall, the process produced segmentation masks that reflected the true instantaneous coastline as closely as possible without visiting the testing locations.We can see an example of a rough training mask and a more precise test mask for a testing location in Figure 1.The hope is the mistakes in the rough masks are not systematic.Then through training on 30,000 instances, the mistakes will be averaged out and we will be able to predict an accurate segmentation.
Evaluating the segmentation approaches using the more precise test masks will give a clearer indication of the true performance of the models.However, the results should be interpreted with the test set annotation process in mind.

Coastal Type Classification
For further analysis, the test images were classified by their coastline types -"rocky" or "sandy".We consider only these classifications as it is estimated that the majority of Ireland's coast is either hard rock (59%) or sandy beaches (39%) [1].All testing locations are classified visually using the same references seen in Figure 7.For locations with mixed types, the majority type was used for the final classification.These include tiles (207,22), (208,22) and (208,24).
They were classified as rocky but a minority of the coastline was sandy.

Segmentation Approaches
Normalized Difference Water Index As a benchmark, we use the Normalized Difference Water Index (NDWI) [37].
This is a well-established spectral index for water body extraction.As seen in Equation 1, an intensity value is calculated for each pixel in a test image.If this value is equal to or above 0 the pixel is labelled as water.If it is less than 0 the pixel is labelled as land.As this process is deterministic, it does not require the training set.

Extreme Gradient Boosting
For comparison to the deep learning methods, we used an Extreme Gradient Boosting (XGBoost) model [38].This is an ensemble method that makes predictions using a collection of decision trees.Specifically, we used a model with 500 trees and a maximum depth of 3 for each tree.To create the dataset for this

U-Net
For the deep learning method, we use the U-Net architecture [39].This is a popular segmentation architecture developed for medical image segmentation.
The architecture consists of an encoder, bottleneck and decoder.Layers in the encoder and decoder are connected through skip connections.The model was trained using a 90/10 training/evaluation split for 50 epochs with early stopping if the validation loss did not improve for 10 epochs.We follow the same process using variations of the U-Net.That is the Attention U-Net [40] and R2 U-Net [41].It is not clear if these variations will provide improvement in performance for our problem.This is because they have been developed to address issues common to medical imagery-small sample sizes and unbalanced datasets.
As input, the 3 deep learning approaches take all bands and pixels.This means they can not only use the spectral bands for a pixel but also the surrounding pixels to make segmentation predictions.We expect this to improve model performance.Especially for pixels close to the coastline where we expect the distinction between land and ocean pixels to be less clear.By comparing these approaches to the XGBoost model we can understand the extent to which this is true.

Evaluation Metrics Confusion Matrix Metrics
When evaluating the segmentation approaches we consider confusion matrixbased measures based on the values in Table 3. Suppose P i,j and G i,j are the pixel values in the i, jth position in the predicted segmentation mask (P) and the ground truth mask (G).Then TP is the count of cases where P i,j = G i,j = 1, TN is the count where P i,j = G i,j = 0, FP is the count where P i,j = 1, G i,j = 0 and FN is the count where P i,j = 0, G i,j = 1.We use the metrics based on these values listed in Equations 2-5.

Accuracy = T P + T N T P +
P recision = T P T P + F P (3) F 1 = 2 * P recision * Recall P recision + Recall When calculating these metrics all pixels in an image are considered.As mentioned we expected the pixels close to the coastline to be harder to classify.
By taking the general performance, these metrics may overestimate the performance of models in these regions.Hence, we also consider variations of these metrics where only the pixels within 10 pixels of a coastline pixel are considered.The coastline pixels are determined using the process detailed in the next section.

Figure of Merit
We use Figure of Merit (FOM) as another approach for assessing the accuracy of the coastline.Previous experiments have shown this to be an effective metric for evaluating coastline edge detection problems [42].This metric is used for evaluating edge detection algorithms.Hence, as seen in Figure 8, we must first create edge maps for the test masks and predictions.To do this we first calculate the gradient of each pixel.Pixels with a gradient that is not equal to 0 is labelled as an edge pixel.
FOM is calculated using equation 6. N G is the number of actual edge pixels, N E is the number of the detected edge pixels, α is the scaling constant, and d(k) is the minimum distance between the detected edge pixel and an actual edge pixel [43].In the context of our problem, FOM captures the average distance of the predicted coastline from the ground truth coastline.

Interpretability Metric
To interpret the deep learning models, we use a permutation feature importance approach [24].This involves permuting the pixels in each band of the 100 test instances.The permutation score for each band is the original model accuracy less the accuracy when that band is permuted.Large values for this band suggest that the band was important to a model's predictions.This allows us to test the assumption that deep learning models benefit from using multiple bands as input.Understanding, which spectral bands are most important to predictions also builds trust in model predictions.This is because we can relate the results to previous research on spectral indices.A final benefit is that it will inform choices around future model development.

Evaluation metrics
Table 4 gives the evaluation metrics when all pixels are used in the calculations.For the deep learning approaches, U-NET had the highest accuracy of 95.0%.This is 2.4 percentage points higher than XGBoost.This suggests that model performance is improved when pixel context can be used to classify each pixel.However, we see that the NDWI benchmark had better evaluation approaches in all metrics except recall.The average accuracy for NDWI was 2.2 percentage points higher than U-NET.FOM also indicates that NDWI was able to better approximate the coastline than the other approaches.5 gives the evaluation metrics when only the pixels within 10 pixels of a coastline edge are used in the calculations.Comparing the metrics to those in Table 4, we see a decrease in all the values.This means that all methods had more difficulty predicting pixels close to the coastline than the pixels in general.
Additionally, we now see larger differences between the methods.The average accuracy for NDWI is 10.2 percentage points higher than U-NET.This tells us that the improvement in NDWI over U-NET seen in Table 5 comes primarily from more accurate predictions around the coastline.A visual analysis of Figure 9 supports these results.We see that the U-NET predicts masks that tend to either under or over-estimate the coastline.
In comparison, NDWI accurately predicts the coastline for most instances but misclassified ocean pixels further away from land.This is seen in the images for tiles (205,23) and (208,24).XGBoost is impacted in a similar way.Both of these methods only consider the intensity of individual pixels and the misclassified pixels will likely have intensity values similar to land pixels.In comparison, the deep learning approaches do not tend to misclassify pixels in this way.This is likely a result of including pixel context in predictions.

The Advantages and Disadvantages of the Annotation Process
We should consider the above results with the dataset annotation process in mind.The test set was annotated to provide precise segmentation masks.
However, there was no on-site evaluation to ensure their accuracy.Further bias can be introduced as the annotations were not cross-evaluated by other professionals.In other words, the ground truth was determined by only one of the paper's authors using a visual analysis of the satellite images and Google Earth images of the same location.
A similar process was used for the training dataset.However, to ensure a reasonable amount of time was required to develop this dataset, the annotation process produced less precise segmentation masks.As you can see in Figure 10, this means there are incorrectly labelled pixels used to train the machine learning approaches.This helps explain the lower accuracy for these approaches compared to NDWI.Additionally, the way Landsat tiles were chosen may also limit the model's robustness as we have relied on relatively cloudless images.A final limitation is that the method for developing the training data was labor-intensive.In comparison, the NDWI benchmark requires no training data.There are still some noticeable benefits to the deep learning approach.

Accuracy by Coastal Type
The LICS dataset was not designed specifically to analyse coastal types.Yet, through randomly selecting testing locations we can expect to capture variations in this factor.We can see this in Figure 9 where various coastal types are present.
As mentioned the testing location of the tiles, where further classified as having either a rocky or sandy coastline type.Another characteristic is the shape of the coastlines.That is some tiles are jagged and others more uniform.
Table 6 gives the average accuracy from each tile.U-NET had the lowest accuracy for tile (208,23) which visually is the least uniform.In contrast, U-NET had the highest accuracy for tiles (208,24) and (205,24) which are relatively uniform.This suggests the model is not robust to variations in this coastline characteristic.As mentioned, the west coast of Ireland is more exposed to swell leading to more jagged coastlines.As a result, we may expect the model to perform worse in these regions.
Table 7 gives the average accuracy of the tiles in each of these groups.For NDWI, the accuracy for the sandy coastlines is 2.5 percentage points higher than for rocky coastlines.This suggests it is potentially harder to segment rocky coastlines.However, we see the opposite for U-Net with a smaller difference of 1.1 percentage points.Considering this, in light of Table 6 it seems as if the coastline type does not influence model performance as much as its shape.

Accuracy through Time
In Table 8 we can see some variation when comparing accuracy for U-Net by decade.Specifically, the difference between the best (2010) and worst (2020 performing decades was 2.3 percentage points.However, we must consider potential confounding between the tiles and years.In Table 9, we see the percentage of test instances that come from the tiles for each decade.18% of the test Considering that this was the tile with the lowest accuracy, it would partially explain the lower accuracy for 2020 in general.In other words, variation in accuracy by decade is not due to inherent characteristics of scenes in that decade but the non-uniform distribution of tiles across the decades.

Solar Altitude
In Table 10, we see the performance across the different altitude categories.
For U-NET the difference between the best and worst altitudes is 1.5 percentage points.This figure is 2.1 percentage points for NDWI.In contrast to previous research the spectral indices performed better for lower altitudes.Even so, the results suggest that solar altitude can have an impact on the performance of spectral indices.In comparison, the performance of U-NET is more uniform.
This suggests that solar altitude does not play a significant role in the ability of the model to perform accurate segmentation.The model did have lower performance for some coastline shapes.We believe this can be addressed through more accurate training annotations.At the same time, the model produced similar results for different decades, coastal types and solar altitude which is a proxy for time of year.Hence, the results show that a deep learning model can be used for inference for any scene in Ireland, during any time of the year, provided that scene is not cloudy.

Permutation Band Importance
The visual analysis of the segmentation predictions in Figure 9 suggests that the deep learning models benefit from using pixel context.This is a commonly stated benefit of deep learning models over spectral indices.Another stated benefit is they can use all available spectral bands as input.Looking at Figure 11, we can see that for the U-NET approach this benefit may be overstated.
The largest permutation scores are 38.96% and 17.17% for the NIR and SWIR  This suggests that only the NIR and SWIR 1 bands are having a significant impact of model predictions.
The NIR band is recognised as an important spectral band for water body segmentation.It is used in the NDWI indices included in this paper.it is also used in the calculation for the Automated Water Extraction Index with Shadows Elimination (AWEIsh) [44] and Water Index 2015 (WI2015) [45].Likewise, the SWIR 1 band is used to calculate AWEIsh and WI2015 as well as a mod-

Conclusion & Future Work
We presented LICS, the first Irish coastline segmentation dataset for deep learning.It was created using Landsat scenes from 1984 to 2023 and includes 30,000 training instances and 100 test instances.We benchmarked the performance on this dataset using various segmentation approaches including an NDWI threshold, XGBoost and the U-NET deep learning architecture.For the benefit of the community, both the dataset and code for these experiments are made freely available.
When developing the LICS dataset, we aimed to capture variation in factors, inherent to the Irish coast, that were expected to impact model performance.
These include the year and month of the scene, coastline types and solar altitude.
Initial results suggest that it is possible to build a deep learning model that is robust to changes in these factors.This means that such a model can output accurate segmentation for any Landsat scene of the Irish coastline.This will enable accurate inference and further coastal monitoring efforts.
We explored assumptions around the benefits of deep learning approaches, such as U-NET, over other segmentation methods.These are that U-NET can use pixel context and all available spectral bands to make predictions.A visual analysis of U-NET predictions verified the first benefit.Interpreting the U-NET showed that the second benefit is not as influential.Results suggested only the NIR and SWIR 1 bands were used to make predictions.Future models can take advantage of this result.By using only the two bands as input, we can reduce model complexity and training time whilst having no negative effect on model performance.
It is important to not overstate the performance of the deep learning approaches.The results do not show an improvement over traditional spectral indices.U-NET was the best-performing model with an average accuracy of 95.0%.This is compared to 97.2% when using NDWI.However, a visual analysis showed promise for the deep learning approaches.U-NET tended to misclassify pixels close to the coastline.This is likely a result of the annotation process for the training set.It produced rough masks where the pixels close to the coastline were most likely to be incorrectly labelled.
We believe that the deep learning approach can significantly outperform the spectral indices given more accurate training data.Future research will focus on developing a modelling process that will create accurate annotations while limiting the amount of time required to label training data.This will likely involve semi-supervised methods used to annotate a large number of training instances as well as a smaller manually annotated dataset.This will enable a transfer learning approach where an initial model, trained on the semi-supervised dataset, can be fine-tuned on the manually annotated dataset.
When pursuing this goal we must consider the purpose of the model.The dataset was developed based on the instantaneous coastline definition.This fundamentally limits a model's ability to monitor erosion and other coastline changes.Additionally, the 30m resolution of Landsat scenes means that only changes over relatively long periods can be observed.Future research will focus on alternative definitions such as the high water mark, vegetation line and dune volume and use higher resolution sources such as sentinel-2 satellite imagery.
Variations to the instantaneous coastline definition will also be considered such as including a third category for mixed pixels.When doing all of this, we will explore how the LICS dataset can be leveraged using fine-tuning approaches.

Figure 1 :
Figure 1: Summary of the Landsat scene selection, scene cropping and annotation process.The end result of the process is 30,000 training instances and 100 test instances.

Figure 2 :
Figure 2: An example of each of the 11 Landsat tiles considered for this analysis.The tile's row and path (row,path) are given in the title above each image.The scenes have been visualised using the NIR band to show contrast between land and ocean.

Figure 3 :
Figure 3: Frequency of cloud cover percentage.The frequencies are calculated using the metadata of 14,850 Landsat scenes of Ireland.

Figure 4 :
Figure 4: Average solar altitude by month of 14,850 Landsat scenes of Ireland.We take the altitude of the sun at the location and time the scenes were taken.We can see that the altitude is highest in the summer months.
contain no bounding box.Each training instance was randomly flipped vertically with 50% probability and horizontally with 50% probability.The final result is a dataset of 30,000 training instances and 100 test instances.Importantly, the test set is geographically independent of the training set.Hence, evaluation results will indicate the model's ability to generalise to the Irish coastline.

Figure 5 :
Figure 5: Example of test and training crops from a Landsat scene with tile (205,23).The test crop is given by the red square.The training crops are shown by the 300 blue squares.

Figure 6 :
Figure 6: Example of the references used to annotate the training set.These include a visualization of the visible light bands (RGB) and a visualisation which uses the Near-infrared band in place of the red band (NGB).

Figure 7 :
Figure 7: Example of the references used to annotate the test set.These include a visualization of the visible light bands (RGB) and a visualisation which uses the Near-infrared band in place of the red band (NGB) and high-resolution Google Earth images from multiple time periods.
model, we randomly select 100 pixels from each training image.This gives us 3,000,000 rows where each row has 8 values-one for each band and the target variable.After training, the model is used to classify each pixel in a test image individually.The predictions are then combined into the final segmentation prediction.

Figure 8 :
Figure 8: Example of an edge map created using gradients of a mask.

Figure 9 :
Figure 9: Examples of predicted masks.Each row gives a random test image from a tile.The tile is given above the RGB visualisation in the first column.The second column gives the test mask for that image.The remaining 5 columns give the predicted mask for the 5 different approaches.The number next to the approach name gives the accuracy for that prediction when compared to the mask.The red line overlaying the predictions gives the edge of the coastline given in the mask.Examples for the remaining tiles can be found in the appendix.

Figure 10 :
Figure 10: Example of the output from the training and test annotation process.The training masks are obtained by cropping a mask drawn on the entire Landsat scene.When we zoom in on this mask, you can see there are incorrectly labelled pixels.In comparison, the test masks are obtained by annotating only the test location with more precision.
Firstly, although we have shown a rough training mask for a testing location in Figure 10, we ensured that we would not include these locations in the training set.This is to ensure the results indicate how well the model can generalise to unseen locations.Secondly, the U-NET model, though initially trained on this dataset, can be fine-tuned and improved over time.Lastly, the dataset should still produce a model that is robust to other factors like solar altitudes, coastline types and time.We explore these factors in more depth using the NDWI, U-NET model and the accuracy metric.

Figure 11 :
Figure 11: Permutation importance scores from each band used in the U-NET model.Accuracy is decreased by 38.96 and 17.17 percentage points when the NIR and SWIR 1 bands are permuted.

Table 1 :
The number of scenes selected for each tile.

Table 2 :
The spectral bands used as input into segmentation approaches.They all have a resolution of 30m.

Table 3 :
Confusion matrix for pixel classification.Water pixels are represented by a value of 1 and land pixels are represented by a value of 0.

Table 4 :
Evaluation metrics for the segmentation approaches applied to the LICS test set.The average of the evaluation metrics over 100 test images is given.

Table 5 :
Evaluation metrics within 10 pixels of the coastline.The average of the evaluation metrics over 100 test images is given.

Table 6 :
Average accuracy by tile for NDWI and U-NET.N gives the number of test images for each tile.

Table 7 :
Average accuracy by coastline type for the NDWI and UNET approaches.The 11 tiles in the test set have been classified as either sandy or rocky coastlines.N gives the number of test images in each category.

Table 8 :
Average accuracy by decade for the NDWI and UNET approaches.N gives the number of test images for each decade.

Table 9 :
Percentage of test images that come from each tile in each decade.

Table 10 :
Average accuracy by altitude category for the NDWI and UNET approaches.N gives the number of test images in each category.