Taxonomic revision of the Malagasy members of the Nesomyrmex angulatus species group using the automated morphological species delineation protocol NC-PART-clustering

Background. Applying quantitative morphological approaches in systematics research is a promising way to discover cryptic biological diversity. Information obtained through twenty-first century science poses new challenges to taxonomy by offering the possibility of increased objectivity in independent and automated hypothesis formation. In recent years a number of promising new algorithmic approaches have been developed to recognize morphological diversity among insects based on multivariate morphometric analyses. These algorithms objectively delimit components in the data by automatically assigning objects into clusters. Method. In this paper, hypotheses on the diversity of the Malagasy Nesomyrmex angulatus group are formulated via a highly automated protocol involving a fusion of two algorithms, (1) Nest Centroid clustering (NC clustering) and (2) Partitioning Algorithm based on Recursive Thresholding (PART). Both algorithms assign samples into clusters, making the class assignment results of different algorithms readily inferable. The results were tested by confirmatory cross-validated Linear Discriminant Analysis (LOOCV-LDA). Results. Here we reveal the diversity of a unique and largely unexplored fragment of the Malagasy ant fauna using NC-PART-clustering on continuous morphological data, an approach that brings increased objectivity to taxonomy. We describe eight morphologically distinct species, including seven new species: Nesomyrmex angulatus (Mayr, 1862), N. bidentatus sp. n., N. clypeatus sp. n., N. devius sp. n., N. exiguus sp. n., N. fragilis sp. n., N. gracilis sp. n., and N. hirtellus sp. n.. An identification key for their worker castes using morphometric data is provided. Conclusions. Combining the dimensionality reduction feature of NC clustering with the assignment of samples into clusters by PART advances the automatization of morphometry-based alpha taxonomy.


INTRODUCTION
Madagascar, one of Earth's biodiversity hotspots (Myers et al., 2000), has a unique and very diverse ant fauna with a high degree of micoendemism. Multifaceted efforts to discover ant diversity in the Malagasy region have collected sufficient data to support species-level taxonomy (Fisher, 2005).
On Madagascar, the high rate of diversity and possible cryptic species (i.e., genealogical lineages that cannot be convincingly separated using conventional morphological approaches, see Seifert, 2009) pose extraordinary challenges for biodiversity research. The authors estimate that many groups, such as the Malagasy Nesomyrmex, may contain ten times more species than was previously described. This dramatic increase in suspected species is due to a profusion of microendemic species. Our approach to this taxonomic challenges is to apply a quantitative morphological approach in combination with modern algorithms to delineate species by statistical means.
The Malagasy representatives of the genus previously had been classified into four lineages: angulatus-group, hafahafa-group, madecassus-group and sikorai-group. These species groups were defined by Csősz & Fisher (2015) based on salient morphological characteristics. The taxonomy of the hafahafa group has been clarified using quantitative morphology (Csősz & Fisher, 2015), proving the power of these methods to tackle cryptic diversity in tropical biomes.
Here the diversity of a unique and largely unexplored fragment of the Malagasy ant fauna, the Nesomyrmex angulatus group, is inferred via a highly automated protocol involving the fusion of two algorithms, Nest Centroid clustering (NC clustering) (Seifert, Ritz & Csősz, 2014) and Partitioning Algorithm based on Recursive Thresholding (PART) (Nilsen & Lingjaerde, 2013) using continuous morphometric data. NC clustering has proven efficient at pattern recognition within large and complex datasets (Csősz et al., 2014;Guillem, Drijfhout & Martin, 2014;Wachter et al., 2015) and PART makes assignments to objectively-defined clusters based on statistical thresholds (Nilsen et al., 2013;Tibshirani, Walther & Hastie, 2001).
Delimitations of clusters recognized by these exploratory analyses were tested via confirmatory Linear Discriminant Analysis (LDA) and Multivariate Ratio Extractor, MRA (Baur & Leuenberger, 2011) following the earlier protocol of Csősz & Fisher (2015).
Multivariate evaluation of morphological data has revealed that the N. angulatus species-group comprises eight well-outlined clusters in the Malagasy zoogeographical region, all representing species; of these, seven taxa are new to science. The eight species outlined, Nesomyrmex angulatus (Mayr, 1862), N. bidentatus sp. n., N. clypeatus sp. n., N. devius sp. n., N. exiguus sp. n., N. fragilis sp. n., N. gracilis sp. n., and N. hirtellus sp. n. are described or redefined here based on worker caste. We provide a combined key that includes both a traditional, character-based key and a numeric identification tool that helps readers resolve the most problematic cases.
The final species hypotheses are corroborated by qualitative morphological characters. Combining NC clustering and PART has proved to be an efficient method to automate species delimitation in insect taxonomy. characters, earlier protocols (Csősz, Heinze & Mikó, 2015;Csősz & Fisher, 2015) were considered. Explanations and abbreviations for measured characters are as follows: CL: Maximum cephalic length in median line. The head must be carefully tilted to the position providing the true maximum. Excavations of hind vertex and/or clypeus reduce CL (Fig. 1A). CW: Maximum width of the head. Includes compound eyes (Fig. 1A). CWb: Maximum width of head capsule without the compound eyes. Measured just posterior of the eyes (Fig. 1A). CS: Absolute cephalic size. The arithmetic mean of CL and CWb. Cdep: Antero-median clypeal depression. Maximum depth of the median clypeal depression on its anterior contour line as it appears in fronto-dorsal view (Fig. 1B). EL: Maximum diameter of the compound eye (not shown). FRS: Frontal carina distance. Distance between the frontal carinae immediately caudal of the posterior intersection points between the frontal carinae and the torular lamellae. If these dorsal lamellae do not laterally surpass the frontal carinae, the deepest point of the scape corner pits may be taken as the reference line. These pits take up the inner corner of the scape base when the scape is directed caudally and produce a dark triangular shadow in the lateral frontal lobes immediately posterior to the dorsal lamellae of the scape joint capsule (Fig. 1B). ML (Weber length): Mesosoma length from caudalmost point of propodeal lobe to transition point between anterior pronotal slope and anterior pronotal shield. Preferentially measured in lateral view; if the transition point is not well defined, use dorsal view and take the centre of the dark-shaded borderline between pronotal slope and pronotal shield as anterior reference point. In gynes: length from caudalmost point of propodeal lobe to the most distant point of steep anterior pronotal face (Fig. 1C). MW: Mesosoma width. In workers MW is defined as the longest width of the pronotum in dorsal view excluding the pronotal spines (Fig. 1E). MPST: Maximum distance from the center of the propodeal stigma to the anteroventral corner of the ventrolateral margin of the metapleuron (Fig. 1D). NOH: Maximum height of the petiolar node. Measured in lateral view from the uppermost point of the petiolar node perpendicular to a reference line from the petiolar spiracle to the imaginary midpoint of the transition between dorso-caudal slope and dorsal profile of caudal cylinder of the petiole (Fig. 1D). NOL: Length of the petiolar node. Measured in lateral view from the center of the petiolar spiracle to dorso-caudal corner of caudal cylinder. Do not take as the reference point the dorso-caudal corner of the helcium, which is sometimes visible (Fig. 1D). PEH: Maximum petiole height. The chord of the ventral petiolar profile at node level is the reference line perpendicular to which the maximum height of petiole is measured (Fig. 1F). PEL: Diagonal petiolar length in lateral view; measured from anterior corner of subpetiolar process to dorso-caudal corner of caudal cylinder (Fig. 1F). PEW: Maximum width of petiole in dorsal view. Nodal spines are not considered (Fig. 1G). PoOC: Postocular distance. Use a cross-scaled ocular micrometer and adjust the head to the measuring position of CL. Caudal measuring point: median occipital margin; frontal measuring point: median head at the level of the posterior eye margin (Fig. 1A). PPH: Maximum height of the postpetiole in lateral view. Measured perpendicularly to a line defined by the linear section of the segment border between dorsal and ventral petiolar sclerite (Fig. 1F). PPL: Postpetiole length. The longest anatomical line that is perpendicular to the posterior margin of the postpetiole and is between the posterior postpetiolar margin and the anterior postpetiolar margin (Fig. 1D). PPW: Postpetiole width. Maximum width of postpetiole in dorsal view (Fig. 1G). PSTI: Apical distance of pronotal spines in dorsal view; if spine tips are rounded or thick take the centers of spine tips as reference points (Fig. 1E). SL: Scape length. Maximum straight line scape length excluding the articular condyle ( Fig. 1A). SPBA: Minimum spine distance. The smallest distance of the lateral margins of the spines at their base. This should be measured in dorsofrontal view, since the wider parts of the ventral propodeum do not interfere with the measurement in this position. If the lateral margins of spines diverge continuously from the tip to the base, a smallest distance at base is not defined. In this case, SPBA is measured at the level of the bottom of the interspinal meniscus (Fig. 1G). SPST: Spine length. Distance between the center of propodeal stigma and spine tip. The stigma center refers to the midpoint defined by the outer cuticular ring but not to the center of the real stigma opening, which may be positioned eccentrically (Fig. 1E). SPTI: Apical spine distance. The distance of spine tips in dorsal view; if spine tips are rounded or truncated, the centers of spine tips are taken as reference points (Fig. 1G).
Two characters (Cdep and PSTI) were used only in two species, N. angulatus and N. clypeatus sp. n., hence these characters were not involved in overall multivariate analyses.
Taxonomic nomenclature, OTU concepts, and natural language (NL) phenotypes were compiled in mx (http://purl.org/NET/mx-database). Taxonomic history and descriptions of taxonomic treatments were rendered from this software. Hymenoptera-specific terminology of morphological statements used in descriptions and identification key, and diagnoses are mapped to classes in phenotype-relevant ontologies (Hymenoptera Anatomy Ontology (HAO) (Yoder et al., 2010) via a URI table (Table S2); for more information about this approach see Seltmann et al. (2012) andMikó et al. (2014).
In verbal descriptions of taxa based on external morphological traits, recent taxonomic papers (Csősz et al., 2014;Seifert & Csősz, 2015) were considered. Definitions of surface sculpturing are linked to Harris (1979). Body size is given in µm, means of morphometric ratios as well as minimum and maximum values are given in parentheses with up to three digits. Inclination of pilosity and cuticular spines is given in degrees. Definitions of species-groups as well as descriptions of species are surveyed in alphabetic order.
The electronic version of this article in Portable Document Format (PDF) will represent a published work according to the International Commission on Zoological Nomenclature (ICZN), and hence the new names contained in the electronic version are effectively published under that code from the electronic edition alone. This published work and the nomenclatural acts it contains have been registered in ZooBank, the online registration system for the ICZN. The ZooBank LSIDs (Life Science Identifiers) can be resolved and the associated information viewed through any standard web browser by appending the LSID to the prefix http://zoobank.org/. The LSID for this publication is: lsid:zoobank.org:pub:63B1A3E5-9E62-46AD-B594-6B3E83364D90. The online version of this work is archived and available from the following digital repositories: PeerJ, PubMed Central and CLOCKSS.

Statistical framework-hypothesis formation and testing
The present statistical framework follows the procedure applied in Csősz & Fisher (2015). Advantages and limitations of the present procedure are discussed there.

Data preparation and cleaning
Nest-centroid clustering (NC-clustering), and linear discriminant analysis (LDA) do not require special data preparation (e.g., standardization), hence raw data were applied for each of the statistical analyses. Data, however, are standardized (i.e., centered and scaled) for the multivariate ratio analysis (MRA) to prevent variables with large values from dominating the analysis (Baur & Leuenberger, 2011). Variables are tested via matrix scatterplots and Pearson product-moment correlation coefficients for error variance. The lack of a positive within-class correlation between different traits may indicate measurements errors, or may represent a morphological artifact (Baur et al., 2014). All traits but one, postpetiole length (PPL), have shown strong linear correlations to other traits. The distribution of PPL is more likely spherical, caused by an unknown source of error. For this reason, PPL was removed from further analyses. Raw data in µm is given in Table S3.

Generating prior species hypotheses via the combined application of NC clustering and PART
This method searches for discontinuities in continuous morphometric data and sorts all similar cases into the same cluster in a two-step procedure. The first step reduces dimensionality in data with cumulative linear discriminant analysis (LDA) using nest samples (i.e., individuals collected from the same nest are assumed genetically closely related, often sisters) as groups (Seifert, Ritz & Csősz, 2014). The second step calculates pairwise distances between samples using LD scores as input and the distance matrix is displayed in a dendrogram. The NC-clustering was done via packages cluster (Maechler et al., 2014) and MASS (Venables & Ripley, 2002).
The ideal number of clusters was determined by Partitioning Algorithm based on Recursive Thresholding via the package clusterGenomics (Nilsen & Lingjaerde, 2013) using the function 'part', which also assigns observations (i.e., specimens, or samples) into partitions. The method estimates the number of clusters in a data based on recursive application of the Gap statistic (Tibshirani, Walther & Hastie, 2001) and is able to discover both top-level clusters as well as sub-clusters nested within the main clusters. If more  than one cluster is returned by the Gap statistic, it is re-optimized on each subset of cases corresponding to a cluster until a stopping threshold is reached or the subset under evaluation has less than 2*minSize cases (Nilsen et al., 2013). Two clustering methods are used to determine the optimal number of clusters ''hclust'' and ''kmeans'' with 1,000 bootstrap iterations. The results of PART are mapped on the dendrogram in colored bars via the function 'mark.dendrogram' found in Beleites & Sergo (2015). The script was written in R and can be found in Appendix S1.

Arriving at final species hypothesis using confirmatory Linear Discriminant Analysis (LDA) and LDA ratio extractor
To provide increased reliability of species delimitation, hypotheses for clusters and classification of cases via exploratory processes were confirmed by LDA Leave-oneout cross-validation (LOOCV). Classification hypotheses were imposed for all samples congruently classified by partitioning methods, while wild-card settings (i.e., no prior hypothesis imposed on the classification) were given to samples that were incongruently classified by the two methods or proved to be outliers. To extract the best ratios for the easiest species separation in the key and diagnoses we applied multivariate ratio analysis (MRA), a modern statistical method based on principal component analysis (PCA) and linear discriminant analysis (LDA) (Baur & Leuenberger, 2011).

RESULTS AND DISCUSSION
Eight clusters were identified by both clustering algorithms 'hclust' and 'kmeans' using function 'part'. The pattern recognized by these partitioning algorithms can be fitted on the hierarchical structure seen on the dendrogram generated by NC clustering (Fig. 2). The grouping hypotheses generated by the combination of hypothesis-free exploratory analyses was validated by Linear Discriminant Analysis with leave-one-out crossvalidation (LOOCV-LDA). The overall classification success is 100% (Table 1). The phenetically distinguishable clusters represent eight morphologically diagnosable OTUs that differ in many qualitative characters (e.g., shape of propodeal spines, petiolar node, surface sculpturing, etc.), hence the eight clusters solution is accepted as the final species hypothesis. Prior species hypothesis was generated by method PART using two clustering methods, hclust ('part-hclust') and kmeans ('part-kmeans'). Color code is the same as above, but outliers returned by 'part-hclust' appear in black.
The geographic distribution of each morphospecies corresponds to the known major areas of endemism in Madagascar (Brown et al., 2014;Vences et al., 2009) and can be characterized by one of the simplified bioclimatic zones of Madagascar (Schatz, 2000, after Cornet, 1974: eastern rainforest, central montane forest, western dry forest, and southwest desert spiny bush thicket (Fig. 3).
These species are grouped into four species complexes based on morphological similarity. The bidentatus-complex consists of two species: Nesomyrmex bidentatus sp. n. and N. fragilis sp. n.; the devius-complex includes two new species: N. devius sp. n., N. exiguus sp. n., N. gracilis sp. n. and N. hirtellus sp. n.; while two species, N . angulatus (Mayr, 1862) and N. clypeatus sp. n., form a complex of their own in the Malagasy zoogeographical region. Separation of species as well as complexes are convincingly supported by Multivariate Ratio Analyses. Morphometric data for species calculated on individuals are given in Table 2.

Key to workers of Malagasy Nesomyrmex angulatus group
Note: absolute size is given in µm, indexes are dimensionless values minimum and maximum values are given in brackets. Classification power between couplet based on a certain character is calculated and percent value is given in parentheses.
1. Median clypeal notch present (    2. Anterolateral corner of pronotum angulate in dorsal wiew (Fig. 4C). Frontal carina narrow, scape longer: FRS/SL < 0.5 (100%)... angulatus -Anterolateral corner of pronotum rounded in dorsal view (Fig. 4D) (Fig. 4A). Nesomyrmex angulatus differs from species of bidentatus-complex and devius-complex (N. devius, N. exiguus, N. fragilis, N. gracilis and N. hirtellus) by having sharp anterolateral pronotal angles (Fig. 4C) and a numeric key, FRS/CS ratio yields perfect separation between workers of N. angulatus and members of bidentatus-complex (Table 2). Distribution. In the Malagasy zoogeographical region, this species is known to occur in coastal dry forests, mangroves and the coastal scrub of the northern, dry area of Madagascar and on adjacent islands in the Mozambique channel (Fig. 3). Worldwide, N. angulatus has spread to the eastern Africa and the Middle East.  Etymology. The name (bidentatus) refers to the short propodeal denticle pair of this species. Diagnosis. Workers of N. bidentatus differ from those of N. clypeatus by having no median clypeal notch (Fig. 4B) and from N. angulatus by the lack of an anterolateral pronotal corner (Fig. 4D). This species can be well separated from N. devius, N. exiguus, N. gracilis, and N. hirtellus based on its short and blunt spines and shorter apical spine distance (SPST/CS, see Table 2). This species is the most similar to N. fragilis: these two species can be separated by PooC/SPTI ratio, which yields 92.9% classification success (Fig. 5B). Due to the fact that the MRA plot offers only 97.3% of the correct classification, a reduced discriminant function using a combination of four characters (D4 = 0.062581 Distribution. This species is endemic to the Malagasy region. It is known to occur in tropical dry forests and littoral forests of the northern, dry area of Madagascar (Fig. 3).  , 80 m, 21-25.ii.2002, collection code: BLF5777;CASENT0448820, Fisher et al. (CASENT0448820, CAS); Paratypes: fifteen workers, and 6 gynes with the same label data with the holotype under CASENT codes: CASENT0448818, BLF5777, (1w, CAS); CASENT0448819, BLF5777, (1w, CAS); CASENT0448823, BLF5777, (3w, CAS); CASENT0448824, BLF5777, Etymology. The name (dēvius = devious) refers to the relatively long path required to arrive at the current taxonomic situation of this species, caused by its superficial similarities to other taxa. Diagnosis. Workers of N. devius differ from those of N. clypeatus by having no median clypeal notch (Fig. 4B) and from those of N. angulatus by the lack of an anterolateral pronotal corner (Fig. 4D). This species can be separated from N. bidentatus and N. fragilis based on the apical spine distance ratio (SPTI/CS, see Table 2). This species occurs in the southern part of Madagascar syntopically with N. hirtellus from the N. devius complex. A simple ratio (PoOC/ SPST, see details in key) offers 92.2% success in determination between this species and N. Hirtellus, but a combination of two ratios (PoOC/SPST and MW/PPH) yields a safer determination (Fig. 5D). The other two species of this complex, N. exiguus and N. gracilis do not occur syntopically with this species, as these are distributed far to the north of the distributional area of N. devius. Distribution. This species is endemic to the Malagasy region, and its distribution is restricted in the southwestern, sub-arid area of Madagascar (Fig. 3)     syntopically with this species, as both are distributed far to the north of the range of N. hirtellus. Distribution. This species is endemic to the Malagasy region, and its distribution is restricted to the southwestern, sub-arid area of Madagascar (Fig. 3); it can be found in tropical dry forest and spiny forest.

CONCLUSION
Combined application of exploratory techniques NC-PART clustering on continuous morphological data revealed that the N. angulatus species-group comprises eight welloutlined clusters in the Malagasy zoogeographical region, all representing species; of these, seven taxa are new to science. Delimitations of clusters recognized by the currently introduced combination of morphometric procedure NC-PART clustering were tested via confirmatory Linear Discriminant Analysis (LDA) and Multivariate Ratio Extractor, MRA (Baur & Leuenberger, 2011).
The fusion of NC clustering and partitioning method PART combines the advantages of the two methods. The dimensionality reduction feature and hierarchical visual display of NC-clustering (Seifert, Ritz & Csősz, 2014) are reinforced by the partitioning method PART, which makes assignments to objectively-defined clusters based on statistical thresholds (Nilsen et al., 2013;Tibshirani, Walther & Hastie, 2001).
The most important benefit of this procedure is that cluster assignments are no longer user-defined, but rather done by the algorithm based on statistical criteria; this is in contrast to hierarchical clustering, where one must decide the boundaries of meaningful clusters. The new, combined method not only offers greater automatization and produces faster inferences, but also takes another step toward increased objectivity in the decision making process of morphometry-based alpha taxonomy.
Quantitative morphometric approaches are often misinterpreted as causing oversplitting as a result of excessive discriminatory power, a characteristic wrongly attributed to these algorithms. In exceptional cases, multivariate statistics on morphometric data may lead to oversplitting, particularly if the cluster boundaries are not properly defined. However, the combined NC-PART clustering is method prevents unjustified oversplitting i.e., treating precarious sub clusters as meaningful fragments via application of the gap statistic criterion (Csősz & Fisher, 2015;Tibshirani, Walther & Hastie, 2001) implemented in method PART.
The combined procedure presented here is a valuable asset to morphometry based alpha taxonomy as it provides greater resolving power, increased objectivity, and largely automated decision making. However, they do not believe that the validity of the patterns obtained by the new procedure would be exclusive or superior to alternative solutions when additional biological information is available, such as molecular data, discrete morphological data, distribution, natural history. To the contrary, the authors hold that diverse evidence from different approaches is necessary to achieve the highest quality measures of biodiversity. The current method contributes to the analysis of complex morphometric data in a manner allows for increased objectivity and independent hypothesis formation in the taxonomic workflow.