Seagrass and hydrographic data for the Mediterranean Sea

The dataset includes 1,771 locations of major seagrass families (Cymodoceaceae, Zosteraceae, Posidoniaceae, Hydrocharitaceae, Ruppiaceae), which are further divided into the species they include, as well as 1,284 locations of seagrass absence (algorithmically produced), in the Mediterranean Sea. For each location, 217 biological, chemical, physics, and human related parameters are available, which were merged from other publicly available data sources. As the most comprehensive dataset for seagrass in the Mediterranean to date, it is suitable for data analysis and machine learning. For more insight, please see “Seagrass Detection in the Mediterranean: A Supervised Learning Approach” (Effrosynidis et al., 2018). The dataset is available on Mendeley Data (Effrosynidis, 2019).


Data
The present article delivers data for seagrass presence observations in the Mediterranean Sea ( Fig. 1) in combination to a set of parameters (Table 2) describing the local environmental, ecosystemic and human impact forcings. It was produced after geo-locating, sub-setting, aggregating, and compiling "big geo-referenced datasets" maintained from various openly available data sources. This posed a major challenge, especially as these data consisted of blended in-situ observations and assimilated numerical model outcomes, differing in spatial resolution. In addition, a seagrass absence dataset was algorithmically developed (Fig. 2), following certain rules and hypothesis. A Python code was developed following a set of well-defined rules to generate points exhibiting the absence in seagrass meadows.
The dataset has a tabular format, with each row representing an observation point and each column representing the different variables describing this point. There are also two columns that contain the coordinates of each data point, an identification number (ID column), a column that states in which class this entry belongs (absence or the name of the seagrass species), and a similar column that groups species into families (Table 1).
Seagrass beds are considered as highly-valued components of coastal ecosystems related to nutrients cycling, carbon sequestration, food-web structure and ecosystem services [3e5]. Understanding the non-linearities hidden in the relations of seagrass presence with critical physio-chemical factors may aid the assessment of climate change effects.

Experimental design, materials, and methods
In this section we analyze in detail how we created the dataset using external datasets for the seagrass points and the environmental variables. Value of the data This dataset can be used in machine learning classification tasks to classify an unknown location having seagrass or not, and if it has, in which seagrass family it belongs to. This dataset can also help in finding which variables are most important when classifying seagrasses and under which conditions each seagrass family favors. Provided data can be used to resolve conditions related to seagrass presence and absence and the differences between them.

Mediterranean sea
The data points for seagrass abundance are located over the Mediterranean Sea, with margins from À5.60 to 36.29 E in longitude and 30.18 to 45.97 N in latitude The initial geo-referenced dataset on seagrass species distribution was provided by the UNEP-WCMC (United Nations Environment Programme -World Conservation Monitoring Centre) [6]. The Mediterranean Sea was selected because of i) the high number of observations compared to other regions, ii) the availability of high spatial resolution variables from external databases, and iii) the fact that there are five major seagrass families present, with one of them (Hydrocharitaceae), invading the Mediterranean Sea since the opening of the Suez Channel. However, the same procedure could be followed for any other region on the globe, having sufficient external data to aggregate. In this database a limited number of problematic points was identified, based mostly on the depth zone distribution and the distance to coast. For example, there were some points geo-located at extreme depths, impossible for any seagrass species to live. As these points were very few, and this is a real-world dataset where the measuring instruments are not perfect and imprecise, we did not exclude them and they can be used for outlier analysis. So, all the observations are near the coasts, with few exceptions.

Dataset creation
Several steps were taken to reach the final form of the dataset. The first step was to obtain the seagrass presence reviewed data. The UNEP-WCMC global biodiversity standardized database [6] was  used as a source in this scope. This dataset illustrates the global distribution of seagrass species and is given through a geo-referenced shapefile. Data were filtered and only instances located in the Mediterranean Sea were kept (Fig. 1).
The initial dataset had, for each data point, the seagrass species. There were 17 observations with the tag "unspecified" where the name of the seagrass species should be. We excluded these points from our final dataset and created a new column by grouping species into families. The names of the species that were grouped into families can be seen in Table 1.
Next, for each observation point, 217 environmental variables were extracted from several publicly available hydrographic databases, shown in Table 2. Selecting the most appropriate environmental variables is considered an important task in determining the distribution of seagrass taxa. In this work we selected the most appropriate environmental parameters based on an extended literature review and taking the opinion of various seagrass experts. Although, several local/regional studies on the subject exist documenting the impact of environment to the presence and species of seagrass, the relative influence of environmental drivers, ecosystem components and human impacts was never previously reported.
For the extraction of the temporally-changing hydrographic variables, the Copernicus Marine Environmental Service (CMEMS) database [7] was used, as it is the most comprehensive platform for the Mediterranean Sea. It contains re-analyzed hydrographic (water temperature, salinity, currents, waves, etc.) and water quality data (nutrients, dissolved oxygen, chlorophyll-a, etc.) produced in gridded form by numerical models (the hydrodynamic NEMO-OPA and the biogeochemical 3DVAR-OGSTM-BFM models). Model outcomes assimilated the data collected by on-site sensors and satellites. The final products are oceanographic data in netCDF format, annually-updated, covering the whole Mediterranean Sea reporting mean-monthly values per model cell.
These data were imported into a Geographic Information System (QGIS). Monthly mean data were extracted for the year 2015 (considered as a typical year) for both surface and seabed, and the value of the raster pixel that its centre was closer to a seagrass point geo-location was kept. The year 2015 was chosen because it was the most recent year that the external data sources had in In the cases were seabed values are also considered, the variables are doubled to 38. Based on our experience with machine learning on such data, it seems that features based on seasonal mean, and yearly min, max, provide sufficient granularity and information for learning models. Raw time series, e.g. day-to-day temperature, are less important for seagrass growth than its min/max extremes. Table 3 presents the CMEMS Product used for this dataset and its spatial resolution of the netCDF files.
The static variables were extracted from several sources. The EMODnet (European Marine Observation and Data Network) [8] platform provided the bathymetry and the substrate variables. EMODnet bathymetric data consist of data from bathymetric surveys, hydrographic composite datasets and The General Bathymetric Chart of the Oceans (GEBCO) digital bathymetry data on which digital terrain models (DTMs) have been applied. It is the most reliable and comprehensive dataset presently available for the Mediterranean Sea. Similarly, EMODnet habitat dataset consists of an interoperable product assembling individual point datasets from seabed habitat surveys, maps and models from various sources, aiding to assess the environmental state of ecosystems. The level of data quality varies, with more reliable data those produced by surveys. Same as before, the value that was closer to the seagrass points was extracted for the present dataset.
For the distance-related variables, expressing the human impact on seagrass distribution, external datasets were used. The geo-locations of interest were mapped and the distance of each seagrass observation point to the closest point of human influence (port, coastal city, etc.) was computed using the haversine distance as is the most appropriate because it determines the great-circle distance between two points on a sphere given their longitudes and latitudes. For cities, two datasets were used, one for all major cities [9] and another for all coastal communities. 1 The same strategy was followed for the distance to river mouths using two additional datasets [9,10]. A dataset for ports 2 and a dataset for coasts [11] were finally used to create these fields in our dataset. All distances were reported in degrees as it is the default distance metric in QGIS.
The final step was to clear the dataset from incomplete data and outliers. So, we removed rows that had a significant amount of missing data and replaced the outliers with the average of the column.

Absence dataset
Because there are not publicly available seagrass absence datasets, we proceeded in creating our own following some rules and a hypothesis. We claim that there is a high probability, points near the seagrass presence dataset had already been checked and no seagrass was reported. If there was seagrass, it would had been in the presence dataset already. We do not check for seagrass absence in locations far from locations with seagrass abundance. Therefore, the issue of under-sampling at certain parts of the Mediterranean is not affecting the dataset and the analysis followed. According to our previous research [1], if the artificially created dataset would contain any uncertainty, it would not affect the classification measures significantly, even for a 20% error case.
Based on the above assumption, a Python code was developed and applied on Q-GIS following a set of well-defined rules to generate points exhibiting an absence in seagrass. We used 2 files: the first is the seagrass dataset and the second is the data of any temporal environmental variable obtained from CMEMS. The main purpose of the code is to do comparisons of distances between the longitudes and latitudes of the points of these files in order to discover valid absence points. The distance is calculated using the haversine distance. As CMEMS data are gridded and geo-referenced, environmental data were assigned at the centre point of each CMEMS pixel. As it has already been shown in Table 3, different CMEMS products have different resolutions, so depending on the choice, the absence points can be denser or not. We used the Water Temperature shapefile. At this point we have the longitudes and latitudes of the initial seagrass presence points, and the longitudes and latitudes of the centers of the gridded CMEMS data. For each point of the initial seagrass dataset, we searched for the closest point from the gridded data that has not already found to be in lack of seagrass by previous iterations. A python list was used to keep track of which points were already checked. There is also a restriction that forbids a point to be selected if its distance from the nearest shoreline is longer than 10 km. So, if many seagrass presence points are close to each other in an area, the probability of occurrence of absence points is low. Researchers can follow the same approach to generate their own absence points as it is very easy to implement and requires only two parameters; the choice of the gridded dataset and the distance that restricts points to be considered as absence.
Following the above procedure, 1,284 points of seagrass absence were created and the final dataset consists of a total of 3,055 points as presented in Fig. 2.

Dataset analysis
The final dataset has 3,055 entries with 1,771 records indicating seagrass presence and 1,284 records representing seagrass absence. For each observation, 217 environmental variables are reported.   The only categorical variable is the substrate and its values as given by the external dataset are presented in Table 4. The statistics for the other static variables are shown in Table 5. Bathymetry is measured in meters and all other variables in the respective length of a degree of longitude.
Having many temporal variables, in Table 6 we present only the spatial statistics for the annual average values at sea surface.
Finally, we display the Spearman's rank between the variables of the dataset as it is a standard first step for machine learning engineers to check for the correlation between variables (Fig. 3). For the temporal variables, the annual average values were used.