Creating functional groups of marine fish 1 from categorical traits

12 Background. Functional groups serve two important functions in ecology, they allow for simplification of ecosystem models and can aid in understanding diversity. Despite their important applications, there has not been a universally accepted method of how to define them. A common approach is to cluster species on a set of traits, validated through visual confirmation of resulting groups based primarily on expert opinion. The goal of this research is to determine a suitable procedure for creating and evaluating functional groups that arise from clustering nominal traits. 13


Methods.
To do so we produced a species by trait matrix of 22 traits from 116 fish species from Tasman Bay and Golden Bay, New Zealand. Data collected from photographs and published literature were predominantly nominal, and a small number of continuous traits were discretized. Some data were missing, so the benefit of imputing data was assessed using four approaches on data with known missing values. Hierarchical clustering is utilised to search for underlying data structure in the data that may represent functional groups. Within this clustering paradigm there are a number of distance matrices and linkage methods available, several combinations of which we test. The resulting clusters are evaluated using internal metrics developed specifically for nominal clustering. This revealed the choice of number of clusters, distance matrix and linkage method greatly affected the overall within-and between-cluster variability. We visualise the clustering in two dimensions and the stability of clusters is assessed through bootstrapping.

39
Marine ecosystems are large and complex, requiring simplification of their components in order to be 40 studied and understood. One such simplification is the construction of functional species groups, which 41 involves creating distinct sets of species according to a selection of their functional traits (Tilman, 2001). the ecosystem can be derived (Gravel et al., 2016). There are two primary uses of functional groups: 48 to simplify the numerous species contained in an ecosystem for modelling; and to assess the diversity 49 of an ecosystem. It is a particularly important step in ecosystems modelling as it identifies the basic 50 structures that become the inputs of the model, thus making the outputs more interpretable (Fulton et al.,51 2003). If functional groups are used in assessing the diversity of an ecosystem (in addition to or instead of 52 species richness), the problem of functional redundancy can be avoided (Stuart-Smith et al., 2013), and 53 the variation in the productivity of a given ecosystem can be more clearly observed (Tilman et al., 1997). 54 Functional groups for ecosystem models typical have been established using expert knowledge of the 55 system and its inhabitants (Baretta et al., 1995; Olivier and Planque, 2017), while groups representing 56 functional diversity have been created using trait or diet data and statistical classification methods 57 (Petchey and Gaston, 2002). Diet data are commonly used to create functional groups of fishes in marine 58 ecosystems, because diet can demonstrate resource partitioning between species, which is a key indicator or predict prey selection (Spitz et al., 2014). 67 With such a wide array of applications there are inevitably many variations in approaches to deriving 68 the groups. One approach is to record traits that reflect how species use the environment and its resources, 69 and use those to cluster groups based on their similarities (Mindel et al., 2016). Selecting functional traits 70 for classification is a crucial step in the grouping process as these ultimately determine how species group to infer a given species food source and its acquisition which were used to derive functional groups. 74 These traits are time consuming and expensive to collect and measuring many traits for all members 75 of species-rich ecosystems is impractical (Madin et al., 2016). The traits that will be most valuable in 76 practice will be those available for most species (Costello et al., 2015). To   region is characterised by its relatively shallow water habitat that has large ocean currents that enter this 127 system from the Tasman sea bring nutrient rich cold water that makes the area highly productive. Large 128 sheltered areas mean that this area is home to a diverse range of species, from small reef bound species to 129 large migrating pelagic species.

130
(i) Select the functional group to be defined 131 The type of functional group defined will be dependent on the ecosystem that is being modelled. Different 132 ecosystems require different functions in order for their production to be exploited by its inhabitants 133 (Fonseca and Ganade, 2001). For example, coral reef fishes need strong, sharp teeth in order to exploit 134 polyps, while large pelagic species need to be fast moving in order to capture prey. Functional groups of 135 species should be defined by how the species use their environment and its resources as ecosystem models 136 attempt to model the entire process of an ecosystem spatially and temporally (Fulton et al., 2004 The species selected to include should represent the taxonomy, time and space that the functional groups 144 are trying to capture (Fonseca and Ganade, 2001). That is, species that rarely occupy the area of interest, 145 or species with greatly differeing biomasses should be included in the analysis. This is because including (iii) Select functions of interest 153 To avoid functional redundancy more functions can be selected to increase the chances of species having 154 unique roles within the ecosystem, while ensuring that species who display the same traits across a number 155 of functions truly belong to the same functional group. We selected four different functions to represent 156 how the species of interest utilise their environment: diet, morphology, habitat use and life history traits 157 (Villéger et al., 2017b;Gravel et al., 2016;Costello et al., 2015). Diet determines a species influence 158 on other organisms in the environment and its position in the food web (Costello et al., 2015). Habitat 159 preference allow us to understand how the different species might aggregate in the environment and 160 can provide information about the likely lifestyle of the species (Chan, 2001;Vadas Jr and Orth, 1997). 161 Morphology traits are important in defining the range of food sources, behaviour, adaptation and habitat 162 use available to a certain species (Sibbing and Nagelkerke, 2000). Life history primarily reflected the  traits that could be recorded from fish species. As cost and time are often significant motivators for 168 conducting research, it was a goal of this study to record functional trait information only from published 169 resources or from photographs, rather than collecting and measuring specimens. We identified 40 traits 170 that could be recorded without measuring species directly (Table S1). For some cases, variables that 171 previously required a specimen to be measured were able to be categorised into nominal variables. For 172 example, caudal peduncle aspect ratio was recorded as caudal fin shape. Where information differed 173 ontogenetically within species, the information for adult females was recorded. The final list of recorded 174 traits is provided in Table 1.  The life history traits selected primarily reflect the reproductive strategies for each of the species.  between years, where species that spawn more often tend to have more stable populations (Longhurst,208 2002). Fish that provide parental care or give birth to live young (viviparous) tend to give birth to fewer, 209 larger offspring, often in more sheltered habitats such as estuaries.

210
Habitat traits are important in defining how a species uses their environment. As we focused on a 211 small ecosystem the habitat variables of a given species must match the available habitat of that ecosystem. 212 We included the minimum and maximum known depth of the species as TBGB is a relatively shallow bay

218
Diet traits allow us to understand a species position within a food web. Diet can be recorded in a 219 number of ways, but for our purposes we sought a simple classification of diet. Therefore we have two 220 diet variables only; diet category (omnivore; invertivore, piscivore, herbivore and gelatinous invertebrate 221 feeders) and trophic level (obtained from FishBase for consistency). In this section we describe the steps taken to analyse and group the data. Our approach differs to traditional 231 functional group analyses as we use categorical (nominal) data. In order to use nominal data we must 232 ensure we have a complete dataset (no missing values) and our continuous variables must be discretized.

233
These two steps are detailed in our data preparation stage, followed by a description of the distance 234 matrices available for nominal data. We then describe some linkage options and finally detail the data 235 evaluation stage. Our approach utilises the R package nomclust which is designed exclusively for 236 clustering observations with nominal variables (Šulc andŘezankovà, 2015; Team, 2018).

237
(v) Data preparation 238 Only 22 of the 40 recorded traits had less than 25% missing data and were retained for analysis. 25% was 239 selected as the cutoff as the accuracy of imputed datasets is seriously degraded above 20-25% for small  (Table 1). For each method, we ran a simulation in which data were randomly  which when comparing two observations of a given variable, takes into account relative frequencies of 284 categories (Goodall, 1966). A similarity value is assigned based on the normalised similarity between 285 the two observations, where the similarity value is higher if a category occurs infrequently. This method 286 takes into account that individuals attributes occur stochastically and independently in a population. Lin's 287 distance is an information theoretic definition of similarity based on relative frequencies (Lin, 1998).

288
Matches are given higher weightings when they occur infrequently, and conversely mismatches are given 289 higher weightings when they occur infrequently.

290
(vii) Clustering methods 291 As we do not know the number of functional groups in the ecosystem a priori, we used hierarchical clustering to visualise group association given our chosen distance metric. Hierarchical clustering first places all n objects in n separate single member clusters, and larger clusters are formed by sequentially joining first individual observations and then groups of observations until at last all observations are in a single group. The closeness of pairs of observations or groups of observations to another are determined by a measure of distance calculated in the preceding step. In linkage, all pairwise inter-cluster dissimilarities are calculated.The pair of clusters that are least dissimilar (that is, most similar) is identified and these two clusters are fused. Once observations or clusters are joined to a group they remain as a part of that cluster for the remainder of the analysis. There are a number of linkage methods that can be used for this type of data and here we explore three methods available in the R package nomclust (Blashfield, 1976).
To describe the linkage methods we use the following notation: D(A, B) is the distance between clusters A and B, which have sizes n A and n B respectively. In single linkage (minimising inter-cluster dissimilarity), the dissimilarity between two clusters is the smallest of all pairwise distances between the observations in the two clusters: In complete linkage (maximises inter-cluster dissimilarity), the dissimilarity between two clusters is the largest of all pairwise distances between the observations in the two clusters: In average linkage, the dissimilarity between two clusters is the average of all pairwise distances between observations in the two clusters:

292
Evaluating clustering outputs can occur in two ways; external, where the resulting clusters are compared against known groupings (as in supervised learning), or internal evaluation, where some metric (there are many) is used to evaluate cluster separation and compactness. Since in our case the true groupings are unknown only internal evaluation is considered. To select the best distance matrix and clustering method for our data we utilised internal evaluation measures available from nomclust (Šulc andŘezankovà, 2015). The within-cluster entropy coefficient (WCE) is a measure of compactness which evaluates the variability of each cluster by calculating a measure of normalised entropy (the number of variables that have the same categories from each of the variables evaluated) (Šulc, 2016). WCE is measured from 0 to 1, where a lower value indicates intra-cluster homogeneity. Due to the way that these values are calculated they will generally always improve by adding clusters to the solution because the within cluster variability decreases: n gcu n g ln n gcu n g )) Where n is the total number of objects (species), m is the number of variables (traits), n g is the number 293 of objects in the g th cluster (g = 1, . . . , k) and n gcu is the number of objects in the g th cluster by the c th 294 variable with the u th category (u = 1, . . . , K c ).

Manuscript to be reviewed
To select the number of groups we use the pseudo F coefficient based on the entropy (PSFE), a measure of separation (Šulc, 2016). The PSFE is a measure of entropy of the between-and within-cluster variability adjusted for the number of clusters and number of objects in the cluster where a higher value indicates a better grouping: where n is the number of observations and k is the number of clusters, nWCE(1) is the variability in the 296 whole dataset, and nWCE(k) the within-cluster variability in the k-cluster solution.
297 Therefore, a more informative measure of performance is the degree of improvement with increasing 298 number of clusters. Results from these measures are therefore presented as the difference between the k th Manuscript to be reviewed Hierarchical clustering is usually presented as a dendogram, but due to the large number of species in 370 the dataset we take advantage of dimensionality reduction techniques to plot the clusters in two dimensions. IOF creates more connected groups as the number of clusters increase, but again, only one group is 383 separated and compact. This is supported by the stability analysis which show that IOF and Eskin have 384 more stable clusters when a larger number is selected than Lin or Goodall ( Figure 6).

385
Using the Rand index we compared distance matrix and linkage method combinations for three and of groups increases smaller groups tend to form, but these groups are highly unstable and are highly 396 dependent on the distance matrix selected.

398
Clustering species based on their traits theoretically allows functional groups to form. This is particularly 399 difficult to test, as it is unknown how many functional groups exist within a given ecosystem, nor which  we explored the the difference in WCE score across number of clusters ( Figure 4). This revealed that in 437 most cases the largest decrease in WCE (between cluster sizes) corresponded to the highest PSFE score.

438
This approach allowed us to see more clearly which combination of distance matrix and linkage method 439 was fitting our nominal data set best. 440 We used bootstrapping to assess cluster stabilities, where observations were re-sampled with replace-441 ment and clustered repeatedly, with the Jaccard coefficient extracted after each clustering (Hennig, 2007).

442
A stable cluster is more likely to remain unchanged in composition (contain the same observations after 443 each bootstrap) during re-sampling. There was no clear pattern in stability between number of cluster or 444 distance matrices, but generally three clusters were the most stable, and had the lowest variation ( Figure   445 6). This was expected as some species had more traits in common than others, making it more likely for 446 them to always be placed in the same group (less likely to change groups during re-sampling).

447
A good indication that true structure has been found in a dataset is when methods align in agreement find that across a range of distance matrices and linkage methods that just three groups continue to emerge.

470
The first distinct group is the sharks. These separate out first, and remain separated as the numbers of  for how to analyse them. The problem of how to handle mixed data is yet to be resolved, particularly 485 as in many distance matrices nominal variables tend to have a higher influence on the similarity matrix 486 than continuous variable because they produce higher contrasts (Mirkin, 2012). Future analyses should 487 investigate using mixed (continuous and nominal) data to cluster functional groups.

488
As yet, there is no agreement on the set of functional traits to use that will provide meaningful by discretizing some traits. Moving forward, it is likely that more traits are needed, and an assessment of 502 their importance to predicting group associations. One solution may be to use bi-clustering that is able to 503 perform dimensionality reduction by clustering traits, while simultaneously clustering species (Fernández 504 and Pledger, 2016).

506
Our results demonstrate that the best clustering solution for our data is three clusters using the Goodall Manuscript to be reviewed to assume that any combination of distance matrix and linkage method will be informative, nor that the combination used by a previous study is a good fit for your data. Instead, data exploration and evaluation 512 analyses, such as those explored in this paper, must be employed. Not exploring the available options may 513 lead to not finding a data structure when there is one, or randomly finding a structure among the noise 514 when no clusters truly exist (Handl et al., 2005). This is because clustering algorithms are biased towards 515 the properties on which they are built. Robust detection of genuine underlying structure requires that 516 multiple algorithms find the same solution.

517
Deriving functional groups is an important process in developing our understanding of ecosystems. The 518 goal of creating functional groups is to classify the species found in a given ecosystem into representative teleost fish species can be made from known or easy to gather information. During this process, it quickly 529 became apparent that there is no straightforward answer to how a functional group should be identified, 530 and that there was not one most appropriate distance matrix or linkage method that could be applied to all 531 situations. We therefore encourage future investigations to explore different distance matrices and linkage 532 methods as they are easy to implement in statistical packages such as R (Ihaka and Gentleman, 1996).