Assessment of Machine Learning Methods for Seagrass Classification in the Mediterranean

Posidonia oceanica is an endemic seagrass species in the Mediterranean. Even though this species has been put under protection, P. oceanica is currently listed as threatened. Therefore, in order to conserve this species, high resolution, accurate and temporal distribution maps are needed to be produced. In this study, it is aimed to create seagrass distribution maps with machine learning algorithms namely as random forests and support vector machines using WorldView-2 imagery. In-situ data has been collected via underwater video and scuba diving for classification training and testing. Atmospheric, radiometric and water column corrections are applied for preprocessing of the optical satellite image. The light penetration in the water is limited by depth. Therefore, we have limited our study area based on maximum depth of 20 meters. The classification accuracies and Cohen’s kappa coefficients are calculated as 94% and 0.89 for random forests, 71% and 0.61 for support vector machines, respectively. According to the results, it can be clearly said that random forests method is superior to support vector machines for seagrass mapping in our study area. The proposed framework in this study enables to rapidly produce seagrass distribution maps which can be used to monitor temporal change for a sustainable ecosystem.


Introduction
Posidonia oceanica is an endemic plant which is acknowledged as one of the most valuable ecosystems in the Mediterranean ecosystem (Badalamenti et al., 2015). However, P. oceanica meadows are declining in the Mediterranean (Hughes et al., 2009). Therefore, it is vitally important to prevent the risk of seagrass extinction by assessment of present state and spatial pattern, investigating the cause, cycles and processes of degradation and monitoring for preservation and restoration (Hossain et al., 2014).
Due to time and scales limitations of classic surveying techniques such as point or transect based surveys, information is scarce about benthic habitat's ecologic function and geographic diversity (Wright and Heyman, 2008). Geological and biological researches about seafloor by collecting samples from sea bottom such as grab and trawl is not sufficient to characterize biological patterns and processes (Eleftheriou, 2013;Van Rein et al., 2009). Even though these techniques provide detailed information in local areas, they are not operable to present biological characteristics of the sea floor in regional and national scales (Brown et al., 2011).
Optical remote sensing products provide cost effective solutions and have wide coverage areas for shallow coastal systems. They can produce more detailed, accurate and comprehensive spatial distribution maps than classical surveying techniques (Lyons et al., 2011). They have been benefited in benthic studies with the advances in remote sensing techniques (Franklin, 2010). Since the penetration of light in water is poor based on turbidity, these researches are limited in shallow coastal areas (Brown et al., 2011).
Optical satellite images are widely used in seagrass detection, aerial coverage, distribution mapping, extent and biomass change detection studies (Hossain et al., 2015). Since P. oceanica is an essential plant for Mediterranean, this species is extensively investigated by the researchers (Gumusay et al., 2019). Pasqualini et al. (2005) have compared P. oceanica mapping accuracies of SPOT 2.5 m and SPOT 10 m imageries. Fornes et al. (2006) have investigated the use of IKONOS imagery for P. oceanica mapping with maximum likelihood supervised classification. Matarrese et al. (2008) have produced P. oceanica meadows maps of Taranto Gulf, Ionian Sea using IKONOS, ETM+ and ASTER imagery. Borfecchia et al. (2013) have proposed an integrated methodology including QuickBird imagery and spectral vegetation indices for P. oceanica mapping. Matta et al. (2014) have focused on the extraction of P. oceanica cover in the shallow waters (< 10 m) of the Gulf of the Oristano, Italy using RapidEye imagery. Hachani et al. (2016) have utilized Google Earth imagery to map P. oceanica in the south eastern Gulf of Tunisia with minimum distance classification method. Traganos and Reinartz (2018a) have investigated mapping distribution of P. oceanica in Thermaikos Gulf, north-western Aegean Sea with Sentinel-2 imagery using maximum likelihood, support vector machines (SVM) and random forests (RF). Similarly, the same authors applied a semi analytical inversion model on Sentinel-2 imagery to obtain P. oceanica extent in the same study area (Traganos and Reinartz, 2018b). Topouzelis et al. (2018) have mapped seagrasses in Greek territorial water including P. oceanica using Landsat-8 imagery. Poursanidis et al. (2019) have focused on the contribution of the coastal aerosol band of Sentinel-2 for habitat mapping and satellite-derived bathymetry.
In this study, it is aimed to assess the performance of machine learning techniques for seagrass classification using WorldView-2 imagery in Gulf of Gokova in the Mediterranean coast of Turkey. After pre-processing of WorldView-2 imagery, random forests (RF) and support vector machine (SVM) supervised machine learning classification techniques are applied to map P. oceanica distribution.

Study Area
The study area is the Gulf of Gokova which is located in the Mediterranean coast of Turkey ( Figure 1). Gokova Bay is an ideal spot for seagrass habitats, especially for P. oceanica. Even though P. oceanica is able to live up to 50 meters of depth (Duarte, 1991), the study area is limited to maximum 20 meters of depth since the penetration of light into the sea is limited. Therefore, our study area size is around 8.6 km 2 of sea floor.

Satellite Image
In this study, we have used WorldView-2 imagery captured on 20 July 2010 for machine learning classification. We have only used Red, Green and Blue bands of the image. In our study area, these bands have spatial resolution of 2 meters. Technical specifications of WorldView-2 satellite have been given in Table 1.

In-Situ Data
In order to train machine learning classifiers and carry out accuracy assessment, in-situ data has been collected by scuba diving and underwater video. We collected seagrass presence positions using RTK-GNSS solutions. We also took advantage of a recent report into Gokova Bay's biodiversity (Okuş et al., 2006).

Pre-Processing
We aimed to extract current seagrass coverage using optical imagery via classification for accuracy assessment of the model. The use of optical satellite images in marine applications is limited by water turbidity, which affects light penetration in the water. Therefore, we were able to perform a classification analysis efficiently in Gokova Bay within a depth limit of 20 meters which was determined using electronic nautical charts (ENCs). As the seafloor is not observed directly, satellite images should undergo a preprocessing phase. Satellite images usually come in digital numbers (DNs) based on the radiometric resolution, which contains atmospheric effects. DNs are converted to surface reflectance in two steps using atmospheric/radiometric corrections. First, DNs are transformed to radiance, which is band interleaved by line raster in μ /( 2× × ) units.
The radiance image is then converted to surface reflectance using atmospheric correction methods. There are various atmospheric corrections methods used in P. oceanica studies in the literature. Borfecchia et al. (2013) have exploited COST method for atmospheric correction of QuickBird imagery. Matta et al., 2014 have corrected atmospheric effects with 6SV. Traganos and Reinartz (2018a) have used ATCOR software for this purpose. The same authors have tried a different approach in their another study by using C2RCC (Traganos and Reinartz, 2018b). Poursanidis et al. (2019) performed atmospheric corrections with Sen2Cor for Sentinel-2 imagery. In this study, the Fast Line-of-sight Atmospheric Analysis of Spectral Hypercubes (FLAASH) tool, which is based on Moderate Resolution Atmospheric Transmission (MODTRAN), was used. MODTRAN provides information on atmospheric components such as water vapour, aerosol, visible range, etc. from the best spatially suitable model among 6 atmospheric models. The mid-latitude summer (MLS) atmospheric model was used to obtain surface reflectance values. The MLS atmospheric model used values of 2.92 g/cm2 for water vapour and 21° C for the surface temperature.
Once the image is radiometrically and atmospherically corrected, water column correction needs to be applied to the image in order to remove the effects of the water column on bottom reflectance prior to classification. Water column correction methods can be summarized into three categories namely as band combination algorithms, modelbased algebraic algorithms and optimization/matching algorithms (Zoffoli et al., 2014). For example, Traganos and Reinartz (2018a) and Poursanidis et al. (2019) have used analytical model of Maritorena et al. (1994) for this step. In this study, a widely used band combination algorithm proposed by Lyzenga (1981) was used. This common method in marine remote applications is based on the creation of Depth Invariant Indices (DIIs) using the attenuation ratios of bands in the visible range only (RGB) as other bands cannot penetrate to seafloor. This ratio is calculated based on a region with a single bottom type (generally sand) and varying depth using (1) which provides the attenuation ratio of band i to band j. The variable a is calculated using (2): where σ_ii is the reflectance in band i, σ_jj is the reflectance in band j, σ_jj is the covariance of reflectance values band i and band j and calculated as: where Xi and Xj is the reflectance values in the base region for band i and band j, respectively. According to (4), DII of band i and band j is calculated using the attenuation ratio. DIIs are calculated for each band pair. For example, DIIs of Yij, Yik and Yjk must be calculated for three band Xijk image. A subset image of before and after DIIs calculation is shown in Figure 2. Classification methods were then applied to DIIs obtained via pre-processing. In this study, two different machine learning algorithms were utilized for seagrass classification in Gokova Bay. Due to their proven image classification performance, both the SVM and RF methods were chosen to create continuous seagrass distribution maps from the available WorldView-2 satellite image from 20 July 2010. We have used the WorldView-2 image based on availability in the Gulf of Gokova. Classification analysis for seagrasses could also be performed on other optical imagery with RGB bands which is widely investigated by (Hossain et al., 2015).

Machine Learning Classification
SVM is a machine-learning method developed by (Vapnik, 1995) which is based on statistical learning and structural risk minimization from training samples. Support vectors are created by constructing decision functions using training samples. The optimum hyperplane is defined by using support vectors (Vapnik, 1995).
Suppose a binary classification problem with two linearly separable classes where { , } i=1 N is the training space, where N is the size of training space and Y i ={-1 , +1} are target values of respective classes (Haykin, 2009). An optimum hyperplane can be defined as: × + = 0where w is the weight vector, x i is the input space and b is bias. w and b are estimated as: The distance between spaces of two classes is defined as 2/‖w‖. ‖w‖ value must be maximized to maximize the distance hence, to obtain a better separation. By using Lagrange multipliers (α), the optimization problem can be redefined as: The input space is transformed into a higher dimension space by applying kernel function for better separation. The above equation can be written using the kernel function where ( i , ) = ɸ( ) • ɸ( ): There are four widely used kernel functions namely as linear, polynomial, radial basis and sigmoid. In this study, the kernel function, gamma and penalty parameters are used as radial basis function, 0.333 and 100, respectively. SVM classification is carried out using ENVI image analysis software.
RF is an ensemble classifier proposed by (Breiman, 2001) that forms a powerful classification model by combining a number of individual tree predictors. It is an efficient method as significant improvements can be achieved in the classification accuracy by letting the individual trees vote for the most popular class. The generalization error of an RF classifier depends on the strength of the individual trees in the forest.
Individual trees in the decision forest are constructed from created features from the dataset. Bootstrap sampling method is used for feature selection. Tree development from the selected features is utilized with Classification and Regression Tree (CART) method. CART method uses Gini index for optimum feature selection. Gini index is a confusion based method that calculates the differences between the probability distributions of the target attribute values for a specific dataset (Han et al., 2012).
where p i is the probability of that class in the dataset A. Gini index is calculated for a binary split for each class. In order to obtain the best split for a class, all possible subsets are evaluated. Gini index for the divided dataset can be calculated using (8). Minimum value of Gini index provides the best split for a binary subset. This process is repeated for each class. In the decision process, class estimates of all trees are examined and the class with the most votes are considered as the result of the decision forest.
There are two parameters in RF model namely as tree count and random variable. The dataset is split into training and test for tree development. The development of a classification tree in RF model can be summarized as; • N number of random data is collected from the original dataset for number of N training data.
• Suppose that the number of input is M. m number is defined as m<M for each node. M value is constant for each tree in the forest.
• The trees develop as large as possible. These steps are repeated for determined tree count. The error ratio that can be encountered during tree development depends on the stability of each tree and correlation between two trees. If the correlation between two trees increases, forest error ratio increases with it. Likewise, if the stability of each individual tree in the forest increases, the error ratio decreases. The correlation and the error ratio depend on the value m. Smaller values of m cause smaller correlation and less stability, larger values of m results in the opposite conditions.
In this study, RF classification is carried out MATLAB environment using tree count and random variable as 40 and 2, respectively. These values are determined empirically.

Results and Discussion
The seagrass distribution map of Gokova Bay was produced with WorldView-2 satellite imagery using RF and SVM machine learning methods. The classification accuracies and Cohen's kappa coefficients were calculated as 94% and 0.89 for RF, 72% and 0.61 for SVM, respectively. According to Landis and Koch (1977), kappa values between 0.81-1 and 0.61-0.80 are referred as 'Almost Perfect' and 'Substantial', respectively. While RF produced promising results, the SVM method was not able to classify continuous seagrass distribution with a satisfactory level of accuracy.
Classification results are given in four subset areas where relatively denser seagrass occurs in Gokova Bay. Northern study area classification results are given in Figure 3. SVM was not successful at extracting seagrass coverage. On the contrary, RF produced better results even though seagrass in this region is not much dense.
Classification results of Region 2 (Southeast of the study area) are given in Figure 4. Both classification methods extracted seagrass coverage parallel to the shore. This kind of seagrass distribution may be due to morphology with high slopes. It can be clearly said that SVM performed relatively well in this region.
Region 3 is where the densest seagrass occurs. This may be due to the fact that this region is the least wave induced area in the study area. Classification results of Region 3 (South of the study area) are given in Figure 5. Even in dense seagrass conditions, SVM was not able to extract continuous seagrass distribution. Both classification techniques extracted patchy seagrass distribution in Region 4 (Southwest of the study area).  for (a) random forests and (b) support vector machines and Region 4 (southwest of the study area) for (c) random forest and (d) support vector machines A total area of 8.6 kilometre square has been classified. Seagrass coverage area has been calculated as 24.63 ha and 10.75 ha by RF and SVM, respectively. Traganos and Reinartz (2018) have also compared performances of SVM and RF for P. oceanica mapping. In their case, SVM performed better than RF. Therefore, it is safe to say that our results here are not universal. The performance metrics may rather rely on used imagery, season and water characteristics of the study area. Additionally, Poursanidis et al. (2019) have investigated the effect of the coastal band in Sentinel-2 to P. oceanica classification. They have stated that coastal band in the imagery improves accuracy and moreover, it can also expand the maximum depth that can be observed. WorldView-2 imagery used in our study also includes coastal band and this can also be exploited for future studies.
It can be observed that recent seagrass related studies rely on machine learning methods rather than former state of art methods such as maximum likelihood and minimum distance. However, to the best of our knowledge, even though there are limited seagrass studies using deep learning with UAV (Martin-Abadal et al., 2018) and underwater videos (Martin-Abadal et al., 2019), there is no study to date utilizing recently trending deep learning methods with satellite imagery for seagrass mapping. Since information about seagrass coverage is scarce and deep learning requires large amount of data to train, this creates a bottleneck to apply deep learning techniques for benthic studies.

Conclusions
In this study, seagrass coverage maps have been produced using WorldView-2 imagery using RF and SVM supervised machine learning classification and in-situ survey samples in Gokova Bay which is located on the Mediterranean coast of Turkey. The study shows that WorldView-2 imagery can be exploited for benthic studies. The results have also shown that RF method was able to produce more accurate and extract more continuous seagrass coverage. Even though SVM method is able to provide results with moderate accuracy, visual inspections and post-processing reveal that SVM results are not reliable.
The machine learning classifications are carried out using RGB bands. WorldView-2 also provides coastal band. Our future work is to include coastal band into the classification phase to investigate its effect on performance and accuracy.