Modeling Macroalgal Forest Distribution at Mediterranean Scale: Present Status, Drivers of Changes and Insights for Conservation and Management

Macroalgal forests are one of the most productive and valuable marine ecosystems, but yet strongly exposed to fragmentation and loss. Detailed large-scale information on their distribution is largely lacking, hindering conservation initiatives. In this study, a systematic effort to combine spatial data on Cystoseira C. Agardh canopies (Fucales, Phaeophyta) was carried out to develop a Habitat Suitability Model (HSM) at Mediterranean scale, providing critical tools to improve site prioritization for their management, restoration and protection. A georeferenced database on the occurrence of 20 Cystoseira species was produced collecting all the available information from published and grey literature, web data portals and co-authors personal data. Data were associated to 55 predictor variable layers in the (ASCII) raster format and were used in order to develop the HSM by means of a Random Forest, a very effective Machine Learning technique. Knowledge about the distribution of Cystoseira canopies was available for about the 14% of the Mediterranean coastline. Absence data were available only for the 2% of the basin. Despite these gaps, our HSM showed high accuracy levels in reproducing Cystoseira distribution so that the first continuous maps of the habitat across the entire basin was produced. Misclassification errors mainly occurred in the eastern and southern part of the basin, where large gaps of knowledge emerged. The most relevant drivers were the geomorphological ones, followed by anthropogenic variables proxies of pollution and urbanization. Our model shows the importance of data sharing to combine a large number of spatial and environmental data, allowing to individuate areas with high probability of Cystoseira occurrence as suitable for its presence. This approach encourages the use of this modeling tool for the prediction of Cystoseira distribution and for supporting and planning conservation and management initiatives. The step forward is to refine the spatial information of presence-absence data about Cystoseira canopies and of environmental predictors in order to address species-specific assessments.


INTRODUCTION
Increasing human pressures such as coastal development, habitat destruction, pollution, maritime traffic, fisheries and illegal fishing together with climate change are strongly affecting the distribution of marine coastal species and habitats (Claudet and Fraschetti, 2010;Coll et al., 2010). New and emerging uses of marine resources (e.g., seabed mining, aquaculture) are also expected as additional sources of disturbance for marine coastal ecosystems (Wolff et al., 2018). This changing scenario calls for science-based information to understand the processes driving present trajectories of ecological change. In this framework, a solid knowledge of the distribution of species and their habitats over large spatial and temporal scales is critical to support all stages of marine spatial planning, to inform action prioritizations for scientists and decision-makers and to provide guidance for sustainable exploitation of marine resources, minimizing the negative impacts of present and future human activities (Douvere, 2010;Levin et al., 2014;Martin et al., 2014).
Recently, increasing efforts to carry out systematic collections of spatial data have been conducted from different areas of the world with significant progresses for a variety of species and habitats (Duarte, 2002;Orth et al., 2006;Waycott et al., 2009;Van Soest et al., 2012;Yesson et al., 2017). In Europe, a network of organizations is currently working together to integrate and share information to combine available marine data across EU countries (European Marine Observation and Data Network, EMODnet 1 ). In this respect, the Mediterranean Sea should be a hub of information: it has been intensely studied since the last century and several initiatives have been carried out to document and combine the available knowledge about the occurrence and status of key target species/habitats such as coralligenous outcrops and maërl beds (Martin et al., 2014), Posidonia oceanica meadows (Telesca et al., 2015), coral assemblages (e.g., Cladocora caespitosa) (Chefaoui et al., 2017), sea pens (Bastari et al., 2018), and gorgonian species (Boavida et al., 2016;Ghanem et al., 2018). Despite these efforts, Gubbay et al. (2016) document a substantial lack of quantitative information on definition, distribution and temporal trends of the status of most of Mediterranean habitats: data collected has often limited spatial and temporal scope, 1 http://www.emodnet.eu/ scattered over different institutions in small datasets for specific species groups or habitats (Portman et al., 2013), with important gaps in different levels of information, such as the evaluation of conservation status (Coll et al., 2012(Coll et al., , 2010. Real data very often do not cover large spatial extensions so that modeling approaches and extensive extrapolations are needed to fill gaps in knowledge (Martin et al., 2014). Different tools have been used to address a deeper understanding of large-scale distribution of species and to fill gaps in actual knowledge, in relation to the type and resolution of available data (presence-only data, presence-absence data and data on predictive environmental variables) (Guisan and Zimmermann, 2000). Maximum entropy (Maxent algorithm), Random Forest (RF), generalized linear and additive models (GLMs and GAMs) are some of the modeling techniques used in these studies to develop predictive occurrence maps for target species/habitats (hereafter referred to as Habitat Suitability Models, HSMs).
All these examples, and others not listed here, demonstrate the wide range of applications for which systematic conservation planning can be applied at different scales, based on data from field surveys, expert knowledge and model-based estimations of species distribution (Guisan and Zimmermann, 2000;Levin et al., 2014). Despite limitations and associated uncertainties, HSMs can be a cost-effective approach to integrate real data, as they can help documenting where sensitive marine species and habitats are expected to occur (McArthur et al., 2010;Gorman et al., 2013;Martin et al., 2014), and predicting their possible shifts in distribution under global climate change (Guisan and Wilfried, 2005;Rodríguez et al., 2007;Martínez et al., 2018).
Macroalgal forests represent a paradigmatic example of key threatened benthic habitat featured by sparse but increasing spatial information, deserving further efforts to improve the management of those pressures determining their increasing loss across the Mediterranean and elsewhere (Benedetti-Cecchi et al., 2001). Macroalgal forests are one of the most productive and valuable, yet undervalued habitats, undergoing dramatic changes (Mangialajo et al., 2008;Thibaut et al., 2005Thibaut et al., , 2015Blanfuné et al., 2016;Mancuso et al., 2018). Along temperate rocky coasts worldwide, large canopy-forming algae (Laminariales and Fucales) were dominant in both intertidal and subtidal habitats, providing shelter, food and nursery areas to a multitude of marine communities, increasing three-dimensional complexity and spatial heterogeneity of rocky substrates, enhancing biodiversity and productivity in coastal ecosystems (Ballesteros et al., 1998;Fowler-Walker and Connell, 2002;Steneck et al., 2002;Cheminée et al., 2013;Gianni et al., 2013;Gorman et al., 2013;Piazzi et al., 2018). However, the cumulative impacts of local anthropogenic pressures combined with other global stressors are driving the decline of brown algae and their associated communities in several regions of the world subject to increased human threats (Airoldi and Beck, 2007;Lamela-Silvarrey et al., 2012;Sala et al., 2012;Gianni et al., 2013;Strain et al., 2014;Mineur et al., 2015;Thibaut et al., 2015;Bianchi et al., 2018).
In the Mediterranean Sea, the genus Cystoseira C. Agardh (Fucales, Phaeophyta) is one of the most representative of the Sargassaceae family and includes a total of 45 taxa (Guiry and Guiry, 2010) with habitat-forming species dominating several assemblages from the very shallow to deep waters (−50 m) (Feldmann, 1937;Giaccone and Bruni, 1973;Ballesteros et al., 1998;ESRI, 2012;García-Fernández and Bárbara, 2016). Reductions in their cover and biomass, prompted by the integration of multiple natural and artificial effects, promote their replacement by turf-forming, filamentous or other ephemeral seaweeds (Murray and Littler, 1978;Benedetti-Cecchi et al., 2001Thibaut et al., 2005;Ballesteros et al., 2007;Pinedo et al., 2007;Airoldi et al., 2008;Connell et al., 2008;Falace et al., 2010;Perkol-Finkel and Airoldi, 2010;Fraschetti et al., 2011;Giakoumi et al., 2012). Moreover, macroalgal forests can be overgrazed to barrens by urchins or fish (Fraschetti et al., 2001;Sala et al., 2011;Guarnieri et al., 2014;Vergés et al., 2014). Due to their role in supporting biodiversity and food webs, the loss of these long-lived brown algae is leading to a decrease in critical ecosystem services such as fisheries (Thibaut et al., 2015;Buonomo et al., 2018) and also to a reduction in the potential to sequestrate carbon dioxide and to mitigate climate change.
There is an increasing attention toward the state of macroalgal forests from both a conservation (Annex II of the Barcelona Convention, COM/2009/0585/FIN) and a restoration point of view (MERCES project, 2016 2 ;AFRIMED project, 2018, 3 ), to better understand the potential of reverting present trajectories of change through active restoration. Furthermore, Cystoseira assemblages are being considered as habitats of critical importance for the EU (Directive 92/43/EEC; Annex I, included in "Rocky reefs") and as indicators to assess ecological status in the context of the Water Framework Directive (WFD; Directive 2000/60/EC). Despite the increasing interest, there are still many information gaps in the spatial distribution of Cystoseira canopies across the Mediterranean Sea and the drivers involved in the observed shifts. Understanding causal relationships and filling these gaps of knowledge is a crucial step to reverse current patterns of regression.
The aims of this study are (1) to synthesize knowledge about the distribution of Cystoseira species and (2) to develop a HSM for Cystoseira species living in the shallow rocky substrates through the Random Forest technique (RF) (Breiman, 2001), considering areas where both presence and absence data were available at Mediterranean scale. RF is a Machine Learning technique which, starting from a set of selected predictor variables and combining an ensemble of classification trees, is able to identify suitable and unsuitable areas for holding Cystoseira species. The developed HSM has been used to identify environmental predictors which are related to Cystoseira spatial distribution and to predict the occurrence of Cystoseira canopies at locations where information was not available. Our model can be regarded as a valuable tool for the assessment of species distribution on shallow rocky shores to guide their management, conservation and active restoration.

Georeferenced Data for Cystoseira Species
A systematic review was conducted consisting of three steps: (1) articles identification using two databases [ISI Web of Science (WOS) and Scopus] searched for the 1985-2018 time frame (cut-off date 31 March 2018), (2) abstract screening, and (3) review of pertinent articles. The aim of this activity was to collect all the information about the georeferenced occurrence of the genus Cystoseira at Mediterranean scale. The systematic literature screening was carried out by searching in the "Title, " "Abstract, " and "Keyword" fields using Web of Science Core Collection. The following combination of terms was included in the search: ("Cystoseira" OR "Cystoseira canopies" OR "Fucales" OR "brown algae" OR "macroalgal forest * " OR "habitat form * ") AND ("distribution" OR "occurrence" OR "shift" OR "habitat loss" OR "decline") AND "Mediterranean Sea." We also searched the citation lists of selected articles, using the same search terms. Supplementary Figures S1, S2 show respectively the number of publications per year and the number of publications per country obtained from the literature screening. Unpublished information, gray literature and maps have been also searched and cataloged (Supplementary Table S1).
The georeferenced database with the spatial distribution of Cystoseira across the Mediterranean Sea, initially produced in the framework of the FP7 EU project CoCoNet (FP7, Grant agreement no: 287844), was also used. In addition, coauthors of this manuscript personally contributed with their data (Supplementary Table S2). New data were also acquired from the monitoring program CARLIT (CARtography of LITtoral and upper-sublittoral benthic communities) . The EMODnet biology data portal (Guardia, 2018 4 ) which contains a dataset on the distribution of Cystoseira across the Mediterranean Sea, the Black Sea and the Eastern Atlantic Ocean was also used.
Most collected data were only available in.jpeg or.pdf format, or only as a description in a text. Therefore, this information was digitized as shapefile points or polylines in order to be associated with a map, using the Open Source QGIS software (QGIS Development Team, 2018. QGIS Geographic Information System. O3.Open Source Geospatial Foundation 5 ). The two resulting vector shapefiles of points and polylines showed the georeferenced distribution of Cystoseira species along Mediterranean coasts. Each data entry was accompanied by the geographic coordinates, data origins and data providers, sampling method, date and depth, publication date and which species of the genus were sampled, when this information was available.
Absence records, which were only available for a limited number of locations, were assembled in a line shapefile mainly generated from maps found in the collected articles, but also from expert opinion.
In order to develop a HSM for Cystoseira canopies across the Mediterranean Sea, we firstly extracted from the whole dataset only those records documenting the presence of species in the shallow rocky substrates, excluding those corresponding to species that are only found at deeper stands: C. algeriensis Feldmann, C. corniculata (Turner) Zanardini, C. crinitophylla Ercegovic, C. dubia Valiante, C. montagnei J. Agardh, C. sauvageauana Hamel, C. schiffneri Hamel, C. sedoides (Desfontaines) C. Agardh, C. squarrosa De Notaris, C. tamariscifolia (Hudson) Papenfuss, C. zosteroides (Turner) C. Agardh. Occurrence data of C. amentacea (C. Agardh) Bory, C. barbata (Stackhouse) C. Agardh, C. brachycarpa J. Agardh, C. compressa (Esper) Gerloff & Nizamuddin, C. crinita Duby, C. elegans Sauvageau, C. foeniculacea (Linnaeus) Greville, C. humilis Schousboe ex Kützing and C. mediterranea Sauvageau were included in the model subset, together with all records collected as Cystoseira spp. produced by sampling surveys in which the CARLIT method was applied, since it is focused on the identification of shallow rocky substrates .
These data, combined with the available absence records, were assembled in a single vector shapefile. Then the vector layer was converted to the (ASCII) raster format, using the same procedure, grid resolution, geographical extent and coordinate system as the layers of the predictor variables which were selected as input to the model (see next section). The resulting Cystoseira raster layer was featured by values covering the entire Mediterranean coastline and was composed of three types of pixels: "absence" pixels (coded as 1), "presence" pixels (coded as 2) and "no-data" pixels (coded as 3), with the last one corresponding to all the sections of Mediterranean shoreline where no information was available.

Modeling Cystoseira Occurrence: Selection of Predictor Variables
A set of 55 predictor variables was associated with the dataset with the occurrence of the Cystoseira spp. as input to the modeling procedure. Most of these predictor variables derived from Halpern et al. (2008) and Wolff et al. (2018). Some have been obtained from data geoportals, e.g., the Copernicus Marine Environment Monitoring Service 6 , and others were instead based on GIS calculations (the complete list of predictor variables, together with their sources, is shown in the Table 1).
Predictor layers were selected on the basis of the coverage of the information they provided. We only used variables with data coverage over the entire Mediterranean basin and with the highest available resolution. Following this criterion, we collected values for environmental variables (e.g., wind stress and the waves energy along the coastline) and anthropogenic variables (e.g., pollution, population density and shipping intensity). All the 55 selected layers were converted to a common raster format, having as geographical extent the Mediterranean Sea, in WGS 84 coordinate system and with a resolution of 0.004166 decimal degrees (i.e., each pixel was about 460 m along the latitudinal axis, and from about 330 m to 380 m along the longitudinal one, depending on latitude). The rasterization process was carried out using the package "raster" (Hijmans, 2017. R package version 2.6-7 7 ) in the R open source data analysis software (R Core Team, 2016 8 ). The same raster format was applied to Cystoseira spp. occurrence data to ensure the appropriate matching with the predictor layers.
Furthermore, since most layers coming from the above sources did not provide data for the shoreline, where Cystoseira species potentially occur, we developed a procedure to estimate the value of the pixels of the shoreline. We considered the value of the first non-empty pixel of the layer for qualitative predictors and the mean value of neighbor pixels for quantitative predictors, within a search radius of 10 pixels at most. In particular, the mean value was obtained on the Moore neighborhood or, otherwise, on the smallest frame with at least a non-empty pixel. Thereby, all the predictor variables provided data for each pixel over the Mediterranean shoreline where Cystoseira records were distributed (presence, absence, and no-data).

Modeling Approach
From the Cystoseira raster layer we extracted a subset including only regions where both presence and absence data were available. Hence, the distribution of Cystoseira canopies across the Mediterranean Sea was modeled focusing on a dataset composed by 8,143 pixels: 5,475 "presence" pixels and 2,668 "absence" pixels. The Machine Learning (ML) method selected to model the distribution of Cystoseira canopies was the RF (Breiman, 2001). This technique, based on a set of classification trees, needs two subsets of data: a training set to tune the model and a test set to validate model performances. We split our dataset in a training and a test set (with the 20% of the dataset assigned to the test set and the remaining 80% to the training set) following the next steps: at first, we superimposed a grid of 0.50 decimal degrees square cells to the whole Mediterranean basin. Then, we randomly assigned cells to the test set excluding those containing less than 100 records (i.e., 100 pixels of the Cystoseira raster layer). This number corresponded to the first quartile of the distribution of records among all the cells. Only cells where the presence/absence records were not too unbalanced (no more than 80% of presence records) were selected in order to better reflect the overall characteristics of the dataset. Moreover, care was taken to avoid geographical  segregation in assigning cells, so as to be representative, as much as possible, of multiple environmental conditions occurring across Mediterranean regions. This procedure minimized the influence of spatial autocorrelation in the test procedure, as only at the boundaries of the test set cells a negligible number of pixels closely resembling those in the training set cells could be found. The RF predicted the class (i.e., presence or absence) of each record in the training and the test set taking the majority voting (the 50%, at least) of the overall trees which composed the forest. We trained several RFs through a Fortran 90 program obtained from the original Fortran 77 source code by Leo Breiman and Adele Cutler 9 . The only changes to the original code were in the input of training parameters from file and in the dynamic allocation of the arrays, while the algorithm and the output file format were not modified.
RFs, by combining multiple classification trees in a single output, require the tuning of different parameters which affect the forest growth with repercussions on model accuracy: the number of trees in the forest and the number of cases in the terminal leaves of the trees, which have to be tuned in order to minimize the generalization error and to avoid overfitting; the number of predictor variables randomly selected at each split, which, remaining constant during the forest growth, shows large effects on the strength of each individual tree and on the correlation between any pair of them (Breiman, 2001). RFs were grown using 250 and 500 trees in the forest, a number of predictor variables per split at each node of the trees ranging between 4 and 14 and six different number of cases in the terminal leaves of the forests (1, 10, 25, 50, 100, and 150).
Since our dataset was unbalanced in the number of presence and absence records, the cut-off value (t = 0.50) was optimized by analyzing the Receiver Operating Characteristic (ROC) curve (Zweig and Campbell, 1993), in order to find the best compromise between the predictions of true and false positives, i.e., between sensitivity and specificity of the model. Then, we calculated the Kappa statistics (Cohen, 1960) to identify the best model among those we trained, considering also the Area Under 9 https://www.stat.berkeley.edu/~breiman/RandomForests/cc_software.htm the ROC curve (AUC) as a measure of the overall accuracy of the models. Furthermore, measuring the Kappa statistics over the test set (built to be as independent as possible from the training set) implies evaluating the model robustness just as a null model does with parametric modeling approaches. Indeed, this coefficient measures the agreement among raters (or between model output and observed data) by assessing the deviation from random agreement (McHugh, 2012). Kappa statistics and ROC curve were calculated using the "caret" package (Kuhn, 2017. R package version 6.0-78 10 ) and "pROC" package (Robin et al., 2011. BMC Bioinformatics, 12, p. 77. doi: 10.1186/1471-2105 ) in the R environment.
The importance of predictor variables was assessed by comparing the permutation importance measure (or the Mean Decrease Accuracy, MDA) (Breiman, 2001) with the Gini importance measure (or the Mean Decrease Impurity, MDI) (Breiman, 2003), both calculated at the end of the training phase. In the first instance, cases out of the bootstrap samples used in the training phase of the RF (OOB records), are randomly permuted in the values of predictor variables. The difference between the misclassification rate of original and permuted OOB values, divided by the standard error, is used as a measure of the importance of predictor variables, as the increase in this difference is proportional to variable importance. In the second case, every time a node is split according to a predictor variable, the Gini Importance for the two descendent nodes is less than the one for the parent nodes. An alternate measure of variable importance is so provided by the total decrease in node impurity for each variable averaged overall the trees in the RF. In any case, the two measures of variable importance were often consistent with each other.
We also validated the performance of the selected model by getting the predictions for regions where only presence records were available, i.e., records that were not used either in the training or in the test set for the RF. The performance of the models on these records was evaluated by looking at the total number of trees in the forest that voted for presence.
At the end, the model was used as a tool to predicting presence and absence of Cystoseira species for areas where no information was available (i.e., no-data records). In the Supplementary Figure S3, a flow diagram illustrating the main steps from data access to HSM and the analysis of the variables relative importance is presented.

Distribution and Coverage of Cystoseira Species Across the Mediterranean Sea
The currently known distribution, obtained by combining line (presence: 12,968; absence: 564) and point (presence: 19,782) records, is shown in Figure 1, depicting areas where Cystoseira canopies are known to be either present or absent. Spatial coverage is geographically biased since most data derived from a specific monitoring activity within the EU Water Framework Directive on macroalgae (CARLIT index), carried out only on the western Mediterranean Sea. Types and number of collected records are specified for each of the 22 Mediterranean countries in the Supplementary Table S3. The table shows for which countries occurrence data were available and where information was reported at species level. In addition, both the presence and the absence shapefiles are available on request to the corresponding author, allowing the examination of finer details which were hard to represent in the map shown in Figure 1, aimed at giving an overview of the collected dataset.
Presence data, from both line and point records, covered 15 out of 22 Mediterranean countries. For only four out of these 22 countries absence data were also available (Albania, France, Italy, and Spain). Cystoseira canopies were found to be present along the Mediterranean coastline for 6,342.41 km out of a total coastal length of 46,000 km. Cystoseira absence, expressed as line records only, accounted for 1,328.27 km. No presence or absence data were available along the rest of the coasts of the basin. Obviously, the linear length estimates were only based on line records.
An exhaustive list of all the Cystoseira species included in the dataset, with their respective number of occurrences, is also provided in the Supplementary Table S4. The species with the highest number of available records are C. amentacea, C. mediterranea, and C. compressa which represent about the 40% of the dataset collectively, considering both point and line records. This information is mostly available for shallow rocky infralittoral habitat, where a lot of ecological studies and systematic monitoring have been conducted, depicting in this way a biased scenario of the presence of Cystoseira at basin scale. In addition, most studies did not provide information at species level and a total of 18,742 records were mapped at genus level as Cystoseira spp.
The Cystoseira raster layer, produced in order to develop the HSM for Cystoseira canopies based on a RF and obtained from the rasterization of the infralittoral species subset, was composed of 113,021 pixels stretched over all the Mediterranean coastline.
Of these, 100,609 were coded as "no-data" pixels, 9,744 were "presence" pixels and the remaining 2,688 were "absence" pixels. The subset mainly included records of canopy-forming species (90% of all records) but also records of species which may not be "forest"-forming (e.g., C. compressa).

Habitat Suitability Model for Cystoseira Canopies
To develop the HSM for Cystoseira canopies we used only data from areas where both presence and absence information were available. These areas included the Karaburun Peninsula and Sazani Island in Albania, Apulia and Sicily regions in Italy, Corsica and the southern coast of France, the eastern coast of the Iberian Peninsula with the Balearic archipelago in the western Mediterranean Sea. Figure 2 shows the aforementioned areas and, for reason of scaling, we illustrated the amount of the available data with the presence/absence ratio using cells of different dimensions and colors. Practically, from the "Cystoseira" raster layer we extracted a subset of data composed by 8,143 pixels: 5,475 "presence" pixels and 2,668 "absence" pixels. Then, we split this dataset in two further subsets, the training set and the test set, needed, respectively, to train and validate the RF model. The training set was composed of 6,531 records, of which 4,402 were reported as presence and 2,129 as absence. The remaining 1,612 records, divided into 9 separated Mediterranean areas, were assigned to the test set. Of these records, 1,073 were reported as presence and 539 as absence.
The modeling process was trained with all the 55 predictor variable layers (Table 1), associated with presence/absence data, since a RF is able to select the most relevant predictor variables out of the whole set of those available. Moreover, its efficiency is not impaired by correlations between variables as the best split at the nodes of each tree is selected from a random subset of them. This property is one of the strengths of the RF technique, which selects only relevant variables even in presence of noninformative ones (Ishwaran, 2007;Louppe et al., 2013;Catucci and Scardi, 2020), thus becoming insensitive to the collinearity issues that might hinder other modeling methods.
A total of 132 RFs, tuned with different combination of parameters, was trained. Model elected as the best was obtained from the following combination of parameters: 500 trees in the forest, 25 cases in the terminal leaves of the forest and 9 predictor variables per split at each node of the trees. RF's outputs were analyzed by the computation of Kappa statistics, which gave an evaluation of the overall accuracy of the models. Kappa statistics of the selected model, calculated for the default cut-off value (t = 0.50), were 0.919 for the training set and 0.573 for the test set. Since our dataset was unbalanced in the number of presence and absence records, we optimized the cut-off value through the analysis of the ROC curve based on the test set, in order to overcome the bias that affected RF predictions toward presence (given that around the 70% of records in the training set were for presence). The optimized cut-off value was found to be 0.61 (Figure 3B). It did not affect the Kappa statistics for the training set, which remained quite unchanged (K = 0.917), but allowed an improvement for the test set (K = 0.637). These  values resulted the largest in comparison with those obtained from any other RF we run and are indicative of a good model accuracy. Confusion matrices for the training and the test set calculated before and after the optimization of the cut-off value are shown in the Table 2. The optimized cut-off value improved predictions over false positives for the training set and, to a larger extent, for the test set. While the decrease in false positives was inevitably accompanied by an increase in false negatives, the overall accuracy of model predictions grew by adopting the optimal cut-off value.
Through the ROC curve analysis, we also examined the AUC value as indicator of model accuracy. As shown in Figure 3, the FIGURE 3 | ROC curves of the best RF trained with the AUC for both the training (A) and the test set (B). The optimal cut-off, represented by the red dot in the Figure 3B, was calculated on the analysis of the ROC curve based on the test set.  Figure 3A) and 0.875 for the test set ( Figure 3B). These values, which were not affected by the cut-off value, confirmed our evaluation of model output since they resulted considerably large not only for the training, but also for the test set.

The Relative Importance of Predictor Variables
Variable importance was assessed on the basis of the comparison between the permutation importance measure and the Gini importance measure, both calculated according to the original RF algorithm during the training procedure. From this analysis we identified a group of predictor variables which played a major role, although ranking differently, in increasing classification accuracy according to both the importance measures used (Figure 4). The most relevant predictor variables were the topographic coastal slope ("cst") and coast materials ("coast_material"), which both can be regarded as an expression of coastline geomorphology. Looking at the permutation importance measure (Figure 4A), the following factors linked to human pressures were identified among the variables mostly affecting the RF output: distance to FIGURE 4 | Most relevant variables according to importance values obtained from both the permutation (A) and the Gini importance measures (B). Name of the variables in alphabetical order: "ArableArea" = arable area (in km2); "botsalin" = bottom salinity; "calcite" = calcite concentration; "chmean" = chlorophyll a concentration (mean); "coast_material" = coast material classes; "cst" = topographic coastal slope (in degrees, 30 arc-seconds resolution); "dacmean" = diffuse attenuation coefficient; "disriver" = distance to river mouths; "dissox" = dissolved oxygen concentration; "dist200 m" = distance to 200 m isobath; "distport" = distance to ports; "GTSR_10" = height above mean sea level in 10 years (in m); "GTSR_100" = height above mean sea level in 100 years (in m); "maxWind" = max stress of wind from 2008 to 2017 (in Pa); "nitrate" = nitrate concentration; "oceacidif" = ocean acidification; "ph" = pH; "phosphate" = phosphate concentration; "silicate" = silicate concentration; "sstmean" = sea surface temperature (mean); "sstrange" = sea surface temperature (annual range); "UrbanArea" = urban area (in km2); "VerticalMovement" = vertical land movement (in mm/yr); "zeumean" = euphotic depth. ports ("distport") and coastal development expressed as urban areas ("Urban Area"). Variables related to natural pressures, as the distance to the 200 m isobath ("dist200 m"), the max stress of the wind ("maxWind") and the distance to river mouths ("distriver"), were also detected as important according to the permutation measure. On the other hand, the Gini importance measure ( Figure 4B) highlighted that some seawater physical variables, for instance the euphotic depth ("zeu mean"), the diffuse attenuation coefficient ("dac mean"), the mean of chlorophyll a ("chmean") and the nitrate concentration ("nitrate"), contributed significantly to the splits in the RF trees.

Model Validation and Purpose
While confusion matrices and statistics based on comparison between observed and modeled occurrences were evaluated on presence and absence records in the test set, we used all the presence records excluded from the training and test set to investigate model accuracy. These records corresponded to 4,269 "presence" pixels of the Cystoseira raster layer (Figure 5). For each record, the number of trees in the forest that voted for presence was analyzed. This number was then scaled into a (0,1) range and assumed, taking into account the optimized cut-off value, as estimates for Cystoseira presence to be analyzed in order to improve the assessment of the strength of model predictions. Thus, records associated with a RF output larger than 0.61 were considered correctly predicted, in accordance with the optimal cut-off value. We analyzed RF output values distribution by grouping presence records in nine Mediterranean regions aiming at assessing model accuracy on spatially independent areas (Figure 6). Distributions obtained from seven of these regions showed quite high probability of presence for most records, meaning that these areas were predicted to have suitable conditions for Cystoseira occurrence, consistently with the observed status. Sardinia resulted having the highest frequency of correctly predicted cases with the median RF output value around the 0.9. The west coast of north Italy, the eastern coast of Adriatic Sea, the Aegean Sea, the African coasts, the western and the southern areas of Mediterranean Sea followed immediately after Sardinia, with RF output concentrated between the 0.7 and the 0.8 classes. These areas were essentially predicted as suitable by the model, but with less confidence than the first one. For the remaining regions, i.e., the eastern Mediterranean regions and the northern Adriatic Sea, the number of cases in which trees voted for presence significantly decreased, pointing out that the model identifies these areas as almost unsuitable for holding Cystoseira species. Indeed, the median RF output value for the latter cases shifted between the 0.3 and the 0.6 (i.e., less than the optimal cutoff value).
Ultimately, developing a HSM for species of the Cystoseira genus living in the shallow rocky infralittoral habitat enabled also to carry out an exploratory analysis with the purpose of figuring out how Cystoseira forests are distributed along all those sites where no data were available across the Mediterranean Sea. Considering the spatial resolution of this study, we mapped 100,609 records for which no information occurred (i.e., "no-data" pixels of the Cystoseira raster layer). Using model predictions and proceeding as for "presence" pixels considered for model validation, we investigated, for each record, the number of trees in the RF that "voted" for presence, which is related to the predicted probability of presence. Our model classified 47,783 of these records as "absence" given that the RF output was lower than the optimized cut-off value and 52,826 as "presence" with an RF output larger than or equal to the cut-off value. To understand the distribution of Cystoseira occurrences across the basin, we firstly classified pixels on the basis of the abiotic properties of the coastline (i.e., "coast_material" variable), considering them as a limiting factor able to control canopies occurrence. Hence, we grouped records falling on "rock/unerodible" pixels, where canopies could potentially occur, and separated these records from those which covered "sand/mud" pixels, considered as unsuitable for Cystoseira growth. Figure 7A shows the distribution of the RF output values for the first group of pixels, which were regarded as possibly suitable for holding Cystoseira living in the shallow rocky infralittoral habitat species: the modal class of RF output values lies on the right of the optimal cut-off value, shown as a red dashed line in the figure. The left tail of the distribution represents all records predicted as "absence" in spite of potentially suitable geomorphological conditions. On the other  side, Figure 7B shows the distribution of the RF output values for "sand/mud" pixels and reveals that these pixels, as expected, are mostly predicted as unsuitable, with the modal class of RF output on the left of the optimal cut-off value, in accordance with the unsuitable coast material. In this case, the right tail of the distribution shows records predicted as "presence" even though geomorphological conditions are unsuitable and could be regarded as model misclassifications, most probably related to the coarse resolution of our pixels (which might include small suitable areas), to the particular coast exposure or composition, or even to the lack of sufficient data in developing the HSM.
In the Figure 8 we presented an overview of our habitat suitability assessment. High resolution details of the model predictions are provided in the raster ASCII format as a zip file in the Supplementary Data Sheet S1.

DISCUSSION
We produced the first effort able to deliver a georeferenced dataset of Cystoseira forests strongly affected by human pressures and deserving conservation priorities across the whole Mediterranean basin. This first crucial step enabled us to deepen our understanding on canopies distribution and to fill gaps in our knowledge by developing a HSM for a subset of Cystoseira species distributed on the shallow rocky shores.
To develop the HSM, we considered areas where both presence and absence records were available, using all other Mediterranean regions to validate and test model performances. Our model showed a quite high accuracy level in reproducing Cystoseira distribution, endorsed by a large number of occurrence and environmental data, but also enhanced by the identification of the species of the shallow rocky shores as a rather uniform ecological target to be modeled. On the other hand, the spatial resolution of our study, imposed by the available resolution for predictor variables, precluded a species-specific distribution assessment. Fine-scale data are indeed needed in order to improve the model in this sense (Cefalì et al., 2018). Our database suffers from all the limitations already described across the literature. Research efforts (published and unpublished) differ among countries (Coll et al., 2010), and large data gaps emerged in the eastern and southern part of the basin. Furthermore, heterogeneous sampling methods or false absences in occupancy surveys can lead to underestimation if the imperfect detection of the species is not accounted for Katsanevakis et al. (2017). These issues might have affected our spatial representation of Cystoseira canopies, resulting in an incorrect estimate of species distribution.
Despite the limits, the chosen approach based on the RF technique allowed us to highlight the most relevant predictor variables affecting the HSM and therefore those variables better candidates to explain Cystoseira canopies distribution and potential for regression. Considering both the importance measures obtained as RF outputs, the topographic coastal slope and the nature of substrate along the coast were identifies as the main factors in controlling Cystoseira canopies distribution, in accordance with the specific coastal conditions required for the infralittoral species development, limited to intertidal and shallow subtidal rocky shores (ESRI, 2012;Mancuso et al., 2018). Some anthropogenic variables emerged as relatively important from this analysis, but they followed the importance of the geomorphological ones. These variables (i.e., the distance to ports and to the urban areas), proxies of pollution and urbanization, have been claimed to drive the loss of Cystoseira forests in many Mediterranean regions FIGURE 7 | Distribution RF output values for "no-data" records, grouped on the basis of the type of coastline: (A) "rock/unerodible" pixels; (B) "sand/mud" pixels. Red dashed lines represent the optimal cut-off (t = 0.61). in the last 20 years (Cormaci and Furnari, 1999;Benedetti-Cecchi et al., 2001;Thibaut et al., 2005;Mangialajo et al., 2008;Sales and Ballesteros, 2009;Perkol-Finkel and Airoldi, 2010). Finally, environmental variables, such as the distance to river mouths, the euphotic depth, the diffuse attenuation coefficient, the mean of chlorophyll a and the nitrate concentration, also displayed a significant role in the RF growth. Actually, these predictors can be regarded as indicators of nutrient enhancement, water turbidity and eutrophication levels which are included in several studies amongst the main causes for the regression of Cystoseira species (Cormaci and Furnari, 1999;Arévalo et al., 2007;Sales and Ballesteros, 2009;Fraschetti et al., 2011;Sala et al., 2012;Mancuso et al., 2018). Nevertheless, attention should be drawn to the fact that RF, given its potential to reflect the resemblance between the local spatial structure of predictor variables and species distribution, does not allow to reveal causal relationships. In our model, RF highlights that anthropogenic pressures, directly or indirectly, could have important roles in affecting the distribution of Cystoseira species, identifying several factors to be prioritized in conservation actions devoted to this genus. However, experimental studies are needed to identify the drivers for the observed canopies regression, as well as a shift from a single-threat approach toward a multiple-stressor one should be adopted in order to understand patterns of distribution and trajectories of change in Cystoseira forests (Benedetti-Cecchi et al., 2015;Mancuso et al., 2018). Notwithstanding, recently developed methods based on the hybridization of experimental and observational data provide novel opportunities to leverage the scope and causal inferential strength of large-scale studies (Benedetti-Cecchi et al., 2018). For example, these methods allow integrating empirical estimates of biological interactions into species distribution models, effectively increasing the predictive accuracy and ability to attribute causality of these models. This might be particularly critical for Cystoseira since another potentially strong but yet unexplored predictor for the presence of macroalgal forests is the emergence of highly effective grazers, invasive Indo Pacific rabbitfish (especially Siganus luridus), which in the southeast Mediterranean (Turkey, Israel) already decimate all erect, edible, macroalgae down to turf barrens (Rilov et al., 2018). The fishes are rapidly spreading to the west and north and the rate of spread of is probably strongly related to water temperature (and is thus probably facilitated by ocean warming). From the validation phase of the modeling procedure we determined the predictive accuracy in spatially independent Mediterranean areas in order to better evaluate model performances. Indeed, testing the model in a wider variety of spatial context means to better define the range of applications for which the model predictions are suited (Guisan and Zimmermann, 2000). In most cases model accuracy was high and presence records matched with pixels classified as suitable for holding Cystoseira canopies. In particular, Sardinia resulted having the highest frequency of correctly predicted cases and this result possibly derives from the low presence of anthropogenic pressures affecting the area. Model predictions showed an evident misclassification rate only for the eastern Mediterranean regions and the northern Adriatic Sea. Reasons for model biases probably lie in distinctive conditions, including the presence of human threats or unsuitable environmental conditions, characterizing these coastal areas. A finer resolution of predictor variables would enhance the RF ability to correctly reproduce Cystoseira distribution, pointing out environmental heterogeneities hidden under a coarser resolution used in this study. Data on the environmental drivers affecting intertidal and nearshore ecosystems (e.g., human impacts or the type of shoreline) are largely incomplete (Halpern et al., 2008), reducing our ability to assess the present and the future state of marine habitats. Moreover, it should be stressed that especially for the eastern Mediterranean regions, outdated occurrence data may have led to an inaccurate representation of the current distribution of the canopies.

Final Remarks
Underpinned by model outputs, we will better direct our management and restoration efforts on the basis of the predictions on the presence/absence of the species, also combined to the information about human pressures. A first model attempt was performed by Buonomo et al. (2018) analyzing intertidal Cystoseira populations in order to predict their future ranges according to different climatic scenarios in the Mediterranean Sea. According to that model an important loss of suitable areas is expected across the range of distribution of the habitat-forming seaweed species already by 2050, with cascading effects on the whole ecosystem and the services that it provides.
In this respect, the model output from this study allows to investigate areas classified as suitable with high probability of Cystoseira occurrence, to assess if the predicted status of presence matched the real one and thus to define new suitable locations for restoration plans. As a result, the HSM could be seen as a useful baseline tool for the assessment of Cystoseira distribution and for the establishment of future-oriented marine planning initiatives from both conservation and restoration point of view, at least as far as the species on the shallow rocky shores. However, there is still a long way until we can use these predictions for true management and (mainly) restoration. Actually, both have to rely on species-specific actions and a (more) spatially accurate information on the environmental factors at the places to be managed or restored. In the case of restoration, it is also pivotal that the pressures that drove the disappearance of the canopies have been mitigated. For this reason, acting on reducing and carefully planning the distribution of local pressures should be considered a priority (Buonomo et al., 2018). In any case, to assess the spatial generality of models, an exhaustive evaluation of how the quality of its output varies within different regional context is required (Gorman et al., 2013). As stressed from the model validation, predictions for Mediterranean regions where model performances are quite limited due to environmental distinctiveness and heterogeneities (e.g., the eastern Mediterranean areas) may not reproduce the actual canopies distribution. In this regard, the large proportion of "no-data" records is an important limit in the development of the HSM and therefore in the understanding of the potential distribution of Cystoseira forests across all the Mediterranean coastlines.
Improving model outputs with a finer resolution of predictor variable layers and better dataset with species occurrences would allow more reliable predictions also for these regions and would promote the identification of species-specific suitable and unsuitable areas making our model more sensitive to ecological differences among species.

DATA AVAILABILITY STATEMENT
The dataset produced for this study will be made available by the authors, without undue reservation, to any qualified researcher in the shapefile format.

AUTHOR CONTRIBUTIONS
SIMFRA and MS conceived the study. EF carried out the data analysis. MS provided the guidance on modeling efforts. SIMFRA and EF wrote the manuscript. All authors contributed to the data collection and manuscript revision. SIMFRA coordinated all steps in finalizing the manuscript.