Estimating the phylogeny of geoemydid turtles (Cryptodira) from landmark data: an assessment of different methods

Background In the last 20 years, a general picture of the evolutionary relationships between geoemydid turtles (ca. 70 species distributed over the Northern hemisphere) has emerged from the analysis of molecular data. However, there is a paucity of good traditional morphological characters that correlate with the phylogeny, which are essential for the robust integration of fossil and molecular data. Part of this problem might be due to intrinsic limitations of traditional discrete characters. Here, we explore the use of continuous data in the form of 3D coordinates of homologous landmarks on the turtle shell for phylogenetic inference and the phylogenetic placement of single species on a scaffold molecular tree. We focus on the performance yielded by sampling the carapace and/or plastral lobes and using various phylogenetic methods. Methods We digitised the landmark coordinates of the carapace and plastron of 42 and 46 extant geoemydid species, respectively. The configurations were superimposed and we estimated the phylogenetic tree of geoemydids with landmark analysis under parsimony, traditional Farris parsimony, unweighted squared-change parsimony, maximum likelihood with a Brownian motion model, and neighbour-joining on a matrix of pairwise Procrustes distances. We assessed the performance of those analyses by comparing the trees against a reference phylogeny obtained from seven molecular markers. For comparisons between trees we used difference measures based on quartets and splits. We used the same reference tree to evaluate phylogenetic placement performance by a leave-one-out validation procedure. Results Whatever method we used, similarity to the reference phylogeny was low. The carapace alone gave slightly better results than the plastron or the complete shell. Assessment of the potential for placement of single species on the reference tree with landmark data gave much better results, with similar accuracy and higher precision compared to the performance of discrete characters with parsimony.


INTRODUCTION
Geometric morphometrics by means of landmark analysis have had considerable success in the description of subtle or complex shape variation that is otherwise difficult to characterise by nature. Being able to represent and analyse these changes as continuous allows researchers to obtain a more fine-grained resolution from the same character and avoids the problem of arbitrarily binning continuous variation into discrete states.
2. Landmarks allow capturing and representing some shell variation that may be apparent to researchers, but difficult to characterise and to code into states. Examples of this variation include the degree of 'doming' and the contour of the carapace or the degree of roundness of the plastral lobes.
Thus, the use of landmarks could help palliate the problem of the scarcity of discrete states. Geometric morphometric methods have been applied to geoemydids in studies of diversification, morphological convergence, and adaptation (Claude et al., 2003McLaughlin & Stayton, 2016) and shape asymmetry (Rivera & Claude, 2008), and also to validate character definitions (Russell & Jamniczky, 2004). To our knowledge, the present study is the first attempt to use landmark data to reconstruct the phylogeny of geoemydids, or any other clade of turtles.
We have two main objectives in our assessment of the performance of various phylogenetic reconstruction methods with our dataset. The first is to provide one more study of the performance of phylogenetic reconstruction with empirical morphometric data, which will contribute to understanding the behaviour and usefulness of the methods for future applications. The second is to assess the potential usefulness of morphometric data for determining the phylogenetic position of geoemydid species that are only known from fossil remains. With this second goal in mind, we had particular interest in the results of the analyses of plastron landmarks. Because turtle fossils are typically deformed and disarticulated by dorsoventral compression, fossil carapaces will seldom be adequate for analysis of their geometrical properties. In contrast, the original shape of plastra is mostly flat and will therefore be less affected by dorsoventral compression.

Landmark analysis under parsimony
Several methods can be applied to morphometric data for estimating phylogenetic relationships (e.g. neighbour joining (NJ), linear parsimony, squared-change parsimony (SCP), maximum likelihood (ML)). Thanks to the nature of shape data, these methods can be applied in the shape space to reduce the length of the tree. While NJ works directly on distances without explicitly reconstructing ancestral character states, the three other methods estimate ancestral shapes from landmark data in order to minimise the morphological evolution of the tree according to an optimisation logic. LAUP differs from these methods because it considers individual landmarks as a unit for the optimisation procedure and introduces several new concepts to parsimony analysis. We will provide a brief account of the method to facilitate the comprehension of this paper, but interested readers are referred to the original publications (Catalano, Goloboff & Giannini, 2010;Goloboff & Catalano, 2011Catalano & Goloboff, 2012). Catalano and Goloboff's approach operates directly on landmark coordinates and attempts to minimise individual landmark displacements between ancestors and descendants along the tree, keeping evolutionary changes localised. This is achieved by estimating ancestral configurations at the internal nodes in a manner analogous to the estimation of character states in traditional parsimony. The landmark displacements are minimised as Euclidian distances, not squared-changes as in SCP.
The optimisation of the ancestral configurations is 'spatial' as opposed to 'linear' in the sense that the coordinates of a landmark are optimised as a multivariate character and not as three (or two) independent characters. However, the method of optimisation is not exact. Goloboff & Catalano (2011) implemented algorithms to find the ancestral position of a landmark by trying out a limited set of positions determined by a grid encompassing the range of the observed positions of the landmarks in the descendants. The ancestral position approximation can be refined by increasing the number of subdivisions of the grid or by repeating the procedure using a smaller grid around the position of the first approximation ('grid nesting'). For other possible improvements, see Goloboff & Catalano (2011).
Another major aspect of LAUP is concerned with tree search and the landmark superposition (or 'alignment'). Although it is possible to look for most parsimonious trees (MPTs) using the approach just described on any kind of superposition, including Procrustes, Catalano & Goloboff (2012) argue that it is more consistent with the logic of LAUP to superimpose the configurations to minimise the parsimony score of a given tree. Thus, it is possible to dynamically optimise both tree and alignment during tree search: an initial superposition is used to make a first estimation of tree topology, and then the superposition is adjusted by rotating the configurations to minimise the score of the estimated topology. The process is repeated until no further changes in topology and superposition are produced.

Reference phylogeny
We constructed a phylogeny of geoemydids from molecular data to use as a reference against which the analyses of morphometric data could be compared. We collected DNA sequences of 64 geoemydid species and two testudinid outgroups from seven loci: 12s, C-MOS, COI-5p, CYTB, R35, RAG1, and RAG2. COI-5p sequences were obtained from DNA barcode sequences available on the BOLD workbench (Ratnasingham & Hebert, 2007). Sequences of the other loci were collected from GenBank. We only selected sequences that were used in published studies. Our least inclusive analytical unit is the species. In some cases, we used sequences from different subspecies in order to maximise locus sampling per species.
We aligned all sequences with MUSCLE v.3.8.31 (Edgar, 2004), and performed manual corrections as necessary (Dataset S1). Selection of partition scheme and transition models was performed using the greedy merging strategy (Lanfear et al., 2014) implemented in IQ-TREE version 1.5.3 (multi-thread) (Nguyen et al., 2015). For the ML phylogenetic estimation, we ran 15 searches on IQ-TREE with an edge-proportional partition model (Chernomor, Von Haeseler & Minh, 2016) and the default search heuristic settings. Branch support was assessed with ultra-fast bootstrapping (Minh, Nguyen & Von Haeseler, 2013).
Finally, the resulting ML tree was adapted to match the species sampling of each morphometric dataset (see below) by dropping superfluous species and replacing the

Specimen and landmark sampling
We examined specimens from collections in the following institutions in the USA: the Field Museum of Natural History (FMNH) in Chicago, Illinois, the Chelonian Research Institute (PCHP) in Oviedo, Florida, the Museum of Comparative Zoology (MCZ) in Cambridge, Massachusetts, and the Yale Peabody Museum (YPM) in New Haven, Connecticut. We restricted our sampling to specimens with visible bony sutures and without supernumerary bone plates or scutes. The full details of all the specimens examined is given in Dataset S2. In total, 103 carapaces and 98 plastra were used in this   (Table S1), all of adult specimens, representing 47 out of the 71 currently recognised geoemydid species. On average, each species is represented by 2.3 carapace specimens and 2.1 plastron specimens (Table S2). Our sample is not sufficient to cover all subspecies as well as possible sexual dimorphism and ontogenetic variation within geoemydid species, in part because skeletal material is rare and typically lacks sex information. These are clear limitations in our data, as uneven sampling of sexual and size variation can feasibly lead to artefactual clustering of species (e.g. species represented only by females). We note, however, that these limitations are inherent in palaeontological material, and are more likely to affect the resolution of species that are closely related than to cause major errors in the estimation of the global topology. Geoemydids and testudinids (clade Testuguria) are more closely related to each other than to emydids (Spinks et al., 2004). However, taking into account extant species and the fossil record, it is apparent that the shell morphology of testudinids is highly autapomorphic, with features such as high doming, alternating patterns of costal width, and very long anterior gular projections for most species. We consider, in contrast, that the shell of emydids is more conservative and morphologically closer to the shape of ancestral testugurians. Therefore, we selected the emydid M. terrapin to be the outgroup for our analyses.
We defined 152 landmarks for the carapace (Table 1; Dataset S3) and 48 (Table 2; Dataset S4) landmarks for the plastron (Fig. 1). Most of these landmarks (150 and 46, respectively) are laterally symmetric, i.e. they have a homologous counterpart on the opposite side of the shell. Because of this redundancy, we reduced the number of effective landmarks for the phylogenetic analyses to 77 landmarks on the carapace and 25 on the plastron (see Data filtering and Procrustes superimposition).
Our landmark definition scheme corresponds mostly to the landmarks of Claude et al. (2003), but we added landmarks for neural bones 1-5 (the number of neural bones in the Note: The landmarks marked with asterisks (49-52) were not included in the analyses for this study. We show them in this list because they are present in the raw data we provide. These landmarks actually belong to the carapace, and were acquired together with the plastron landmarks in order to make it possible to realign the plastra with their respective carapaces.
That alignment was not done in this study, because it's not applicable to species with mobile plastra and to collection specimens in which the carapace and the plastron were preserved separately.
posterior half of the carapace is often variable within a species), the marginal series, and the anterior peripheral series. We also adapted the definitions of the plastral landmarks to preserve all information related to right/left variation in our raw data, although side variation was not addressed in the present study. Landmark coordinates were acquired in three dimensions with a Microscribe G digitising arm (Revware Inc., Raleigh, NC, USA) by E.A. The raw landmark data is given in Dataset S5. Each carapace and plastron was digitised twice to account for intra-operator error.

Specimen identification
For specimen identification, we relied primarily on collection labels and catalogue information, but we verified those identifications against morphological characters and current information on geographic ranges when it was possible. Specimens for which the species attribution seemed dubious were discarded for the construction of the datasets for the analyses.

Data filtering and Procrustes superposition
We excluded right/left variation and developmental abnormalities from the data. For this, we first filtered our data to exclude specific landmarks from regions of 10 carapace specimens with obvious developmental abnormalities, as inferred from extreme right/left asymmetry and by comparison to conspecifics. No plastra had abnormalities requiring this treatment. Then, we performed generalised procrustes analyses (GPA) and generated a consensus of the two digitisation replicates of each specimen. The specimen consensus of carapaces and plastra were symmetrised estimating the missing landmarks by reflection using the R package StereoMorph v.1.5.1 (Olsen & Westneat, 2015). As side variation was removed from the data, we dropped the left side landmarks for further analyses. Therefore, the effective number of landmarks for the analyses became 77 for the carapace and 25 for the plastron. From the symmetrised specimens, we computed a generalised Procrustes consensus configuration for each species. The code used to process the landmark data is given in Dataset S6.
In box turtles (Cuora), leaf turtles (Cyclemys), and Notochelys, the plastron is not always a single structure as it is usually divided along the transverse hinge that develops in these groups forming an anterior and a posterior plastral lobe. If there is a 'natural resting angle' between the two lobes, it would seldom be observed in osteological specimens. Instead of imposing an arbitrary angle between the lobes for the generalised Procrustes superimposition with the plastra of unhinged turtles, we artificially divided the unhinged plastra into anterior and posterior lobes. This way, generalised Procrustes superimpositions of anterior and posterior plastral lobes could be performed among all turtle species. Note that the division of the landmarks into anterior and posterior lobes only follows from this necessity and does not represent a priori hypotheses of modularity. Further, although those two parts are likely to become modules in hinged turtles for architectural and functional reasons, they are not necessarily true modules in species with unhinged plastra.

Datasets
In order to assess the performance of individual superimpositions created in the previous step, and some of their combinations, we defined five datasets for the analyses (Table 3).
For the sake of clarity, we use italics to refer to datasets. When we do not use italics, we refer to the anatomical structures after which the datasets are named. The species of the shell dataset correspond to the set intersection of the species of the carapace and plastron datasets. There is no missing data in any of the datasets.
The superposition steps were performed on the anterior lobe, posterior lobe, and carapace datasets independently. The shell and plastron datasets combine those independently superposed datasets, and therefore are not representations of the intact shell and plastron shapes. This is because, as noted in the previous section, it is not possible to produce superpositions of the entire plastra that include species that have mobile lobes, and for the same reason it is not possible to reconstruct the 'correct' position of the lobes relative to the carapace. Furthermore, some collection specimens of species with immobile plastra were prepared by sawing through the bridge and their carapace and plastra no longer fit back together perfectly, hindering the reconstruction the full shell shape in those cases as well.

Phylogenetic analyses of morphometric data
We analysed our morphometric data using LAUP (spatial parsimony), SCP (Maddison, 1991;McArdle & Rodrigo, 1994;Rohlf, 2001), linear parsimony (Goloboff, Mattoni & Quinteros, 2006), neighbour-joining (Saitou & Nei, 1987) on Procrustes distances, and ML under a Brownian motion model of evolution (Felsenstein, 1973(Felsenstein, , 1981. For all methods we estimated a single optimal tree and performed a bootstrap analysis to quantify branch support. Bootstrap replications did not repeat the superposition step, only the tree searches. We took a 70% threshold as evidence of significant bootstrap support. This number is common in the discrete character literature, but arbitrary. To our knowledge, there are no studies analogous to Hillis & Bull (1993) that examine the properties of bootstrap support with continuous characters.

Selection of spatial optimisation parameters for LAUP
Landmark analysis under parsimony takes considerably longer time to run compared to traditional parsimony analysis because of the computationally-intensive heuristic spatial optimisation that must be performed to estimate ancestral landmark positions and compute the score of a tree topology. Roughly speaking, the standard optimisation strategy consists of applying a grid with a fixed number of subdivisions that discretises the space of possible positions for the ancestral landmark. After determining the optimal position of the landmark within that grid, further refinement can be done by repeating the process with a smaller grid centred around that position. Each refinement step is referred to as a 'nesting' level (Goloboff & Catalano, 2011). The number of grid divisions and grid-nesting iterations have been shown to have great impact on score estimation (Goloboff & Catalano, 2011), with a significant trade-off in computation time. In order to estimate reasonable grid subdivision and nesting parameters, we performed a series of trial analyses on the shell dataset with incremental values. We tried values of 6, 8, and 10 for the number of grid divisions, and values of 1 and 2 iterations of grid nesting with a cell window size of 1. These ranges of values were determined by our computational resources and are not atypical for the LAUP literature (Goloboff & Catalano, 2011;Catalano, Ercoli & Prevosti, 2015;Perrard, Lopez-Osorio & Carpenter, 2016). Combined, these are six parameter settings, for each of which we ran three LAUP searches of two random addition sequences followed by tree-bisection-and-reconnection (TBR) branch swapping. The outcome of the analyses was evaluated in terms of total computation time, tree score, and tree topology.

LAUP search strategies
We performed LAUP searches on all the datasets using GPA alignments and the dynamic alignment based on parsimony proposed by Catalano & Goloboff (2012). The heuristic search strategy for the GPA alignments consisted of 20 replicates of random addition sequences followed by TBR branch swapping using the default settings of TNT's MULT command.
For the dynamic alignments, we used a search strategy conceived by Santiago Catalano (E. Ascarrunz, 2017, personal communication). It first performs a random pairwise alignment of the configurations followed by a tree search with a random addition sequence and a round of TBR. The tree obtained from that initial search is then fed into a loop of dynamic alignments followed by a tree search with TBR, until the same topology is obtained in two successive iterations. That whole procedure is performed eight times, yielding eight trees. Because eight is a very low number for random addition sequences with standard parsimony, but higher numbers were also prohibiting in computation time, we also performed an additional search using the reference tree as the starting tree instead. The tree with the lowest parsimony score of those nine searches is selected as the best estimate of the MPT.
Branch support was assessed using 100 replications of symmetric bootstrap resampling landmarks, each with one random addition sequence followed by TBR.
Searches with the GPA and dynamic alignments were performed with TNT 1.5 (Goloboff & Catalano, 2016).

Linear parsimony
We ran linear parsimony analyses that treat each coordinate of landmarks in a GPA superposition as an independent continuous character (Goloboff, Mattoni & Quinteros, 2006). The search strategy consisted of 5,000 random addition sequences followed by TBR branch swapping. Bootstrap was performed with 500 replicates, each analysed with 100 random addition sequences followed by TBR branch swapping.

Maximum likelihood analysis
Maximum likelihood analyses were performed directly on GPA superposition coordinates using CONTML v.3.696 from the Phylip package (Felsenstein, 2005). We did not attempt to correct for selective and genetic covariances between landmarks (Felsenstein, 1988(Felsenstein, , 2002. The heuristic search settings for the ML tree consisted of 50 repetitions of a random addition sequence followed by tree improvement by subtree pruning and regrafting (SPR, or 'global rearrangements' in Phylip's terminology). Bootstrapping was performed with 200 matrices resampling landmarks, each analysed with five repetitions of random addition sequence followed by SPR.

Neighbour-joining analysis
We constructed NJ trees from Procrustes distance matrices of each dataset using the R package ape (Paradis, Claude & Strimmer, 2004). For branch support assessment, we also computed NJ trees from 200 bootstrap replicates resampling landmarks.

Squared-change parsimony
To our knowledge, there is no published software capable of performing heuristic tree searches using SCP as an optimality criterion (but see Klingenberg & Gidaszewski, 2010 for a branch-and-bound implementation). Therefore, we wrote an R script for this purpose. The script computes the unweighted sum of squared changes on unrooted trees using the method described by Rohlf (2001;see also McArdle & Rodrigo, 1994;Claude, 2008). This implementation yields practically identical results to the anc.recon function for estimating ML ancestral states (Goolsby, 2017) included in the Rphylopars package (Goolsby, Bruggeman & Ané, 2017), when all branch lengths are set to unity (a punctuational model corresponding to unweighted SCP). This implementation is also slightly faster for datasets of the size used in this study. The code is written in pure R (Dataset S7), except for a single instruction that calls the C++ Eigen library (Bates & Eddelbuettel, 2013;Jacob, Guennebaud & Eigen Project Contributors, 2017) to perform matrix multiplication.
The search strategy consisted of performing 100 random addition sequences, each followed by hill-climbing optimisation using first random SPRs, and secondly with nearest-neighbour interchanges (NNI). The 25 trees with the best sum of squared changes scores were then used as starting trees for further optimisation with an implementation of the stochastic NNI perturbation search strategy of Nguyen et al. (2015), which was run 10 times. All the branch swapping operations made use of functions from the Phangorn package v.2. 1.1 (Schliep, 2011). Bootstrapping was performed on 200 resampled matrices, each analysed with a single random addition sequence and the hill-climbing optimisation described above, but with less thorough settings.

Topological accuracy and tree comparisons
All trees obtained from the morphometric analyses were compared against the reference tree in order to assess the accuracy of topological estimates. This assumes that the molecular data carries significant phylogenetic signal, and that it yields tree estimates that are better or at least as accurate as our morphometric data. The assumption can be empirically challenged if we find that the trees obtained from the different analyses of the landmark data are highly congruent and different from the molecular reference tree. Under such a scenario, it would be difficult to determine whether the congruence of the trees of different parts of the shell is either indicative of strong and reliable phylogenetic signal, or simply the result of the strong integration of the carapace and the plastral lobes.
Using a modified expression from Bogdanowicz, Giaro & Wróbel (2012), we define the topological accuracy (TA) of an estimated tree T m with some dissimilarity measure as where d(T ref , T rand ) is the median dissimilarity between the reference tree and 10,000 random trees generated with the rmtree function of ape (for details on the generator algorithm, see Paradis, 2012: 313). Therefore, TA is 1 when there is no topological conflict between the observed tree and the reference tree, and is close to 0 when there is about as much topological conflict between the reference tree and the observed tree as between the reference tree and random trees on average.
The value of TA d will, of course, depend on the dissimilarity measure used to compute it. The choice of dissimilarity measure is hardly ever simple, because each measure emphasises different topological features to the point that different measures can occasionally lead to different conclusions. For this reason, we compute TA d with two dissimilarity measures. The first is the number of unrooted quartets that are resolved differently in the two trees (dQ) (the d quantity of Estabrook, McMorris & Meacham, 1985; also equivalent to an extreme case of the parametric quartet distance of Bansal, Dong & Fernández-Baca, 2011), which we computed with QDist v.2.0 (Nielsen et al., 2011). The other is the tree contradiction difference (CD) (Bapst, Schreiber & Carlson, 2018), defined as the number of incompatible splits between two unrooted topologies. We computed CD with the treeContradiction function of the paleotree package (Bapst, 2012). The TA computed with either measure is referred to as TA dQ or TA CD , respectively.
The dissimilarity measures take only into account conflict between resolved relationships, ignoring differences corresponding to the presence of polytomies. This amounts to treating all polytomies as 'soft' in the sense of Maddison (1989). That is useful, because it allows us not to consider conflict with branches that are only poorly supported (with ultra-fast bootstrap support <95%) in the reference tree which we collapsed prior to the comparisons. However, because TA only takes into account resolved relationships, a fully unresolved tree (e.g. a majority-rule bootstrap consensus) is always going to be 100% accurate while also being fully uninformative. For this reason, we also report the amount of resolution of the trees, which we compute as the number of nodes that are present in the unrooted tree divided by N − 2, where N is the number of tips of the tree.
In addition to comparing the trees obtained from landmark data against the reference tree to assess error in topology estimation, we compared the landmark trees against each other to explore possible associations between topologies estimated from different datasets and different methods. For this, we computed pairwise distance matrices of all the trees using the quartet distance (Estabrook, McMorris & Meacham, 1985) and the Robinson-Foulds distance, which are closely related to the dQ and CD dissimilarity measures, respectively, but satisfy the conditions of true metrics (dQ and CD fail to satisfy the identity of indiscernibles and the triangle inequality for trees with polytomies). We used metric multidimensional scaling (MDS) (see Hillis, Heath & John, 2005 for a similar application) on the distance matrices to produce visualisations of the tree space projected on two dimensions and applied the partitioning around medoids algorithm (PAM) (Kaufman & Rousseeuw, 1990) in order to identify clusters of topologically similar trees. We determined the optimal number of clusters to be found by PAM, the k parameter, by means of the elbow method (Thorndike, 1953). In addition, we also applied PAM with k set to 5 and 6, to see if trees would cluster following the five datasets and/or the six methods of phylogenetic analysis.
Comparisons between trees estimated from landmark data required pruning out all the species not common to all datasets. After the pruning step, all the trees used for comparisons had 42 species.

Fossil placement performance
To assess the potential performance of analyses of morphometric data for the placement of fossil species on a global reference phylogeny obtained by other means, we used a leave-one-out protocol derived from Garbin, Ascarrunz & Joyce (2018), which was in turn inspired by Berger & Stamatakis (2010). The procedure removes a species from the reference tree and reinserts it by estimating its position based on the morphometric data and some optimality criterion. This operation is performed on every species in the reference tree (except the outgroup), taken one by one. The quality of the placement of each species with the different methods is quantified by counting the number of nodes that separate the original ('correct') position in the reference tree and the point on which the species was reinserted. We scaled that quantity by the number of nodes separating the original position from the farthest possible reinsertion point.
We applied this protocol on the carapace, plastron, and shell datasets using GPA LAUP, ML, and SCP as optimality criteria. We did not use dynamic alignment with LAUP solely because of the excessive computational time required to realign the landmarks with every species on every possible placement position for the three datasets.

Utility software
Computations were performed on a desktop computer using GNU Parallel build 20161222 (Tange, 2011) to make efficient use of all the CPU cores. A few utility functions in our scripts make use of the the phytools package v.0.6 (Revell, 2012). Tree plots were produced with FigTree v.1.4.2 (Rambaut, 2014) and the APE package v.5.1 (Paradis, Claude & Strimmer, 2004).

Reference phylogeny
The ML tree obtained from our molecular sequence analysis (Fig. 2) is broadly concordant with recent global-level geoemydid phylogenies (Spinks et al., 2004;Diesmos et al., 2005;Guillon et al., 2012). All currently recognised 'genera' are recovered as monophyletic. Internal relationships of Mauremys and Cuora differ from the results of detailed analyses of these groups, but the different resolutions generally have low ultra-fast bootstrap support (<95%). Rhinoclemmys is recovered as sister to all other geoemydids (Spinks et al., 2004;Diesmos et al., 2005;contra Praschag et al., 2006;Sasaki et al., 2006) and Geoemyda and Siebenrockiella were found as sister to the Orlitia-Batagur clade (Spinks et al., 2004;contra Diesmos et al., 2005;Praschag et al., 2006). The latter point is also the only significant difference in the arrangement of major clades compared to the tree obtained from the molecular analysis of Garbin, Ascarrunz & Joyce (2018). This disagreement disappears once the bootstrap support for the clades in both trees is considered.
Unlike the traditional non-parametric bootstrap (Hillis & Bull, 1993), ultra-fast bootstrap support values have been found to be approximately unbiased estimates of the probability of a branch being correct (Minh, Nguyen & Von Haeseler, 2013). Therefore, we collapsed all branches with less than 95% ultra-fast bootstrap support for comparisons between the reference molecular phylogeny and the trees obtained from morphometric data, unless noted otherwise. The reference tree with unsupported branches collapsed is fairly conservative in relation to the results of recent molecular studies (Spinks et al., 2004;Diesmos et al., 2005;Praschag et al., 2006). Therefore, it seems unlikely that taking the collapsed reference tree as provisionally correct will be unduly severe for assessing the accuracy of the trees estimated from the morphometric data. Figure 2 Phylogeny obtained from the ML analysis of seven molecular markers. Numbers under the branches are ultra-fast bootstrap values. The reference tree used to assess the performance of the analyses of landmark data was produced from this phylogeny by pruning or collapsing the branches in grey, and adding the emydid Malaclemys terrapin as the outgroup. The internal branches that were collapsed have less than 95% ultra-fast bootstrap support. The terminal branches that were pruned correspond to species for which we did not have landmark data. The scale is in expected number of nucleotide substitutions.
Full-size  DOI: 10.7717/peerj.7476/ fig-2 Parameters for LAUP spatial optimisation We identify a single topology as the most parsimonious in replications of four parameter combinations, including the most exhaustive one (10 grid divisions, two nesting iterations). We consider that topology to be the best estimate of the MPT for the GPA alignment of the shell dataset. Scaled quartet distances between the best tree and the trees obtained in other trials were never greater than 0.072. As expected, parsimony scores improve with more thorough approximation settings (Table 4). Using two levels of grid nesting produced greater improvements in the tree score than increasing the number of grid subdivisions. The two-level nesting setting was also more effective in increasing the chance of finding the optimal topology.
More exhaustive spatial optimisation settings slow down tree searches dramatically, raising search times from between 53 min and 1 h 30 min to over 26 h on mid-2014 MacBook Pro with a 2.2 GHz Intel Core i7 processor. For the following LAUP analyses in this study, we used six grid subdivisions and two grid nesting iterations, as searches with those parameters were able to find the best tree fast enough to allow for bootstrap analyses (search times of around 1 h 40 min).

Phylogenetic analyses of morphometric data
For the optimality criterion methods (GPA and dynamic LAUP, linear parsimony, SCP, and ML) the tree scores obtained from the morphometric data analyses were always better than the scores of the reference tree (Table 5, trees available in Dataset S8). That suggests that any limited ability to recover relationships from morphometric data was not the result of suboptimal tree searches. This conclusion is also supported by the fact that we used the reference tree as the starting tree in LAUP, ensuring that the topological vicinity of the reference tree was explored. Searches initiated with the reference tree found better parsimony scores with dynamic LAUP three out of five times, whereas all the GPA LAUP searches initiated from random addition sequences outperformed the searches initiated from the reference tree. Quantitatively, the overall performance of all methods with all datasets was poor ( Table 5). The optimal trees obtained with each method had low accuracy, with TA dQ values ranging in 0.179-0.445. Accuracy estimated with the CD (TA CD ) was even lower, in the range of 0.108-0.319, in some cases about half the value of TA dQ . This large difference is likely due to the greater sensitivity of TA CD to displacements of a few outlier species, an issue that also affects other split-based measures like the Robinson-Foulds distance (Böcker, Canzar & Klau, 2013). Nonetheless, TA dQ and TA CD broadly coincide on the accuracy of the trees relative to each other; the Pearson correlation coefficient between TA dQ and TA CD is 0.841.
No analytical method or type of alignment showed a significant advantage or disadvantage over the rest. However, there were more consistent differences in performance between the different datasets (Table 5, see also . Analyses of the anterior lobe landmarks tended to perform the worst (median TA dQ = 0.198, median TA CD = 0.135), closely followed by the posterior lobe and plastron datasets with similar performance (median TA dQ = 0.229 and median TA CD = 0.149 for posterior lobe, and median TA dQ = 0.217 and median TA CD = 0.149 for plastron). The analyses of the shell and carapace datasets tended to be the most accurate (carapace median TA dQ = 0.342, median TA CD = 0.254; shell median TA dQ = 0.296, median TA CD = 0.2).
The 70%-rule consensus of bootstrap trees improved accuracy (Table 6). Predictably, this enhanced accuracy came at the expense of major losses of resolution; the most resolved consensus tree was obtained from the carapace dataset with neighbour-joining, with 52% of resolved nodes. Even that best-resolved tree is disappointing, as most of the resolved nodes only join pairs of species, failing to find major clades. The clades recovered more often in the 70%-rule bootstrap consensus included Batagur, Pangshura, and Orlitia (Fig. 3). We found no notable differences in total resolution or artefactual resolution (incorrectly resolved nodes) across the different methods.
Qualitatively, some trees present interesting features (Fig. 4). The reference tree has a large and well-supported clade of Asian turtles including Batagur, Hardella, Pangshura, Morenia, Orlitia, and Malayemys (Fig. 2). The trees from morphometric data often recover close relationships between the species in that Batagur-Malayemys clade, although some of those 'genera' are only occasionally recovered as monophyletic. The trees obtained from the carapace dataset tended to show all the Mauremys species clustering together either as a clade or as a paraphyletic assemblage, except for the linear parsimony tree. This is surprising, as Mauremys seems difficult to recover as a clade from its carapace alone (never recovered as monophyletic or paraphyletic in Garbin, Ascarrunz & Joyce, 2018), which does not seem very distinctive among geoemydids from our visual inspection of collection specimens.
The anterior lobe, posterior lobe, and plastron datasets yielded trees that consistently found Cuora as monophyletic, but with the two Cyclemys species often arranged as paraphyletic to Cuora (Figs. 3 and 4). This likely reflects the presence of a plastral hinge in these species. For instance, a close alignment of the pectoro-abdominal sulcus with the articulation between the hyoplastron and the hypoplastron is needed for a functional hinge, and is also present in species with partial plastral kinesis, such as Heosemys spinosa and Vijayachelys sylvatica (not observed in our sample, but reported in Moll, 1985, and Full-size  DOI: 10.7717/peerj.7476/ fig-3 references therein). The significant geometric signal produced by the presence of the hinge in testudinoids was demonstrated and studied in detail by Claude (2006). That study also showed that the hinge might be a source of confounding systematic characters, as it induces similar (albeit distinguishable) plastral morphologies in emydids and geoemydids. In our results, although the clustering of Cuora and Cyclemys clearly reflects an evolutionary convergence, it is encouraging that Cuora is always recovered as a clade, possibly because of features such as the rounded margins of its plastral lobes and the particular geometry of the entoplastral region, among others. The strength of that plastral signal is such, that all shell trees also recover Cuora as monophyletic.
It is noteworthy that many of the trees inferred from datasets that include plastral lobes also grouped together as a clade or paraphyletic assemblage all or almost all Rhinoclemmys species (Fig. 4), but this result is not as consistent across methods. Quantitative comparison between trees is concordant with some of the previous observations (Fig. 5). The elbow method indicates that the tree spaces produced by quartet and Robinson-Foulds distances can be optimally partitioned into three clusters with the PAM algorithm (k = 3). In both tree spaces the greatest separation occurs along the MDS axis 1, between a compact cluster of trees derived from the carapace dataset and two clusters of trees derived from the anterior lobe, posterior lobe, plastron, and shell datasets. Among those two clusters, the trees inferred from the anterior lobe and shell datasets cluster together almost completely (the single exception is the ML shell tree in the quartet The resolution is given as the number of nodes observed in the tree divided by the number of nodes in a fully resolved tree with the same number of species. Trees were treated as unrooted. distance tree space). The trees inferred from the plastron dataset are found to cluster either with the posterior lobe trees (quartet distance tree space) or included in the cluster with the shell and posterior lobe trees (Robinson-Foulds distance tree space). These results are indicative of the strong influence of the plastral configurations, the anterior lobe in particular, in the overall signal detected by all methods. This influence must be due to specific geometric features, as the carapace dataset contains more than twice as many landmarks as the plastron dataset. The GPA superposition of the anterior lobe and posterior lobe have a between-species variance of 0.012 and 0.015, respectively, and the whole plastron has a variance of 0.013. In contrast, the carapace superposition has a variance of 0.002. Trees from the shell dataset still retain some of the better properties seen in the carapace trees, such as the better grouping of Mauremys and finding Notochelys close to Cyclemys.
We also applied PAM with k values of 5 and 6, to see if the trees would cluster reflecting the five datasets or the six methods of phylogenetic analysis used. We found five clusters that for the most part reflect the five landmark datasets. This corroborates that the datasets are the most important factor shaping the tree space even within the three main clusters indicated by the elbow method. PAM with k = 6 does not yield clusters that reflect the six phylogenetic methods.

Fossil placement performance
The fossil placement analyses consist in removing a single species from the reference tree and inserting it back into the tree based on the landmark data. This is done for all species in the tree, one by one. The results show some differences in performance across datasets, with the plastron dataset yielding the greatest scaled nodal errors, and carapace the least (Fig. 6). Additionally, ML performed consistently better than GPA LAUP and SCP, as measured by the median and the integrated cumulative distribution of the median scaled error. The differences are small and appear most pronounced for the carapace dataset, where the median scaled placement error with ML (0.077) was less than half the median error with GPA LAUP and SCP (both 0.167). For the carapace dataset, scaled placement errors in the range 0.067-0.09 correspond to placements at only one node away from the original position in the reference tree. With ML, the 70th percentile of the scaled placement errors reflects up to three nodes of separation between the reinsertion point and the original ('correct') position on the reference tree.
The fossil placement analyses were performed with the fully resolved reference tree that includes many branches that are poorly supported by molecular data (Fig. 2). Therefore, at least a small amount of the placement error can be reasonably ascribed to the incorrect branching patterns in the reference tree. Considering this, the performance of fossil placement analyses is much better than the performance of full phylogenetic analyses, at least for the carapace and shell datasets.

Performance of methods of phylogenetic analysis
No method consistently yielded more accurate and precise trees. In this respect, our results are in agreement with the study of Catalano & Torres (2017), who compiled 41 landmark datasets and analysed them with five phylogenetic methods including neighbourjoining of Procrustes distances, linear parsimony, and LAUP (GPA). They found little difference in the performance of all methods. Their results and ours suggest that the current methods of phylogenetic inference and landmark datasets available do not directly address the major obstacles in making reliable phylogenetic inferences from geometric information. In our case the results suggest that the poor performance is particularly driven by low phylogenetic information and/or high noise (convergence) in the data, and the differences between trees from different methods are attributable to comparatively minor effects of the search strategies and optimality criteria. Indeed, the topological vicinities demonstrated in Fig. 5 show that there is some significant degree of agreement between the methods, and a Shimodaira-Hasegawa test (Shimodaira & Hasegawa, 1999) of all the trees obtained from the anterior lobe dataset shows that none of the trees yielded by the other methods has a significantly worse likelihood than the ML tree (all p-values > 0.26, test performed with Phylip's CONTML). In contrast, the molecular reference tree has a significantly lower likelihood with the anterior lobe data (p < 0.001).
We take this to be indicative of all the methods being misled and converging into the same optimality plateau, away from the more plausible molecular tree. Previous studies have found evidence of significant convergence of shell shapes in testudinids (Claude et al., 2003;Claude, 2006), and this is likely part of the signal that is misleading all the methods. These problems apply even to LAUP, which was specifically conceived for a proper treatment of landmark characters consistent with the general framework of maximum parsimony. Our heuristic searches under LAUP were severely limited by the computational demands of the numerical approximation algorithms involved. However, we should note that we also gave LAUP an additional 'advantage' by launching searches using the reference tree as the starting point. Furthermore, the searches performed by Catalano & Torres (2017) were more exhaustive than ours, and their overall findings are the same. Other two major studies making use of LAUP have been Perrard, Lopez-Osorio & Carpenter (2016) and Jones & Butler (2018). Perrard, Lopez-Osorio & Carpenter (2016) used dynamic LAUP of wing venation patterns in landmark-only and total-evidence analyses of Vespinae. They found that landmarks alone would not yield reliable results, and the addition of landmark characters to their molecular and discrete character data affected only poorly supported nodes and was reasonably attributable to noise. Jones & Butler (2018) conducted phylogenetic analyses of phytosaurs with different versions of a matrix with mostly discrete characters and subsets of characters represented in either discrete, continuous, or landmark form, or a combination of continuous and landmark form. Compared to the present study, their analyses ran more thorough heuristic searches and made use of implied weighting (Goloboff, 1993(Goloboff, , 2014, but did not perform the computationally expensive dynamic realignment of 3D configurations (all their landmark characters were two-dimensional and the configurations were aligned by GPA). They found that the inclusion of landmark characters had a small topological effect compared to univariate continuous characters, and was accompanied by reduction in nodal support. The authors favoured the trees obtained from discrete characters alone and discrete combined with continuous characters.
Squared-change parsimony has long been recommended for the treatment of morphometric shape variables in a phylogenetic context (Rohlf, 2001(Rohlf, , 2002; also criticised by Catalano, Goloboff & Giannini, 2010), but its use for phylogenetic inference has remained limited to small datasets that do not require heuristic searches (Klingenberg & Gidaszewski, 2010). Here, we presented the first implementation, to our knowledge, of heuristic searches with the minimisation of the sum of squared changes as the optimality criterion. In our study, SCP often performed worse than other methods with our data (except with the carapace dataset), but the performance of all the methods was poor, with TA dQ values typically lower to 0.4. Further studies will be needed to arrive at more general conclusions about the performance and utility of SCP in phylogenetic analysis using continuous characters. As expected with any kind of data analysed with methods that perform similarly, the greatest impact on topology was determined by the particular properties of the samples chosen. The datasets yielded recognisably different sets of topologies, although none consistently better than the others in terms of accuracy, and in the fossil placement analyses the carapace dataset outperformed the shell and plastron datasets. At the same time, the number of landmarks in the configurations was not the major factor driving the topology when multiple configurations were included in the same analysis. This highlights the importance of configuration selection in analyses in general. Catalano, Ercoli & Prevosti (2015) performed LAUP analyses of mustelids based on nine configurations from diverse osteological elements and found that the topology inferred from landmarks converged with their molecular reference tree as more configurations were included. As our configurations are not as numerous and feature degrees of functional and geometric integration (the shell as a character complex), our study should not be taken as strong evidence against a general rule of thumb that increasing the number of independent configurations sampled improves tree inference from landmark data, but serves as a reminder to be mindful of the influences of individual configurations, especially when the total number of samples is low.
Further studies would be needed to evaluate whether atomising the shell into modules would be helpful for identifying independently evolving landmark sets and improving phylogenetic accuracy. Even if a priori selection of modules could be tempting, it should be kept in mind that the delimitation of developmental modules can itself evolve, making this kind of approach complicated in practice. For instance, in geoemydids, it is clear that the acquisition of a hinge between plastral lobes will generate functional and architectural constraints that would likely alter module delimitations between clades that have mobile plastra and those who don't.
Given all the above, we have no grounds in performance to recommend any of the methods we studied over the rest, but other practical considerations are still of interest. The Brownian motion model that we used here under the criterion of ML is also naturally applicable to Bayesian inference (Parins-Fukuchi, 2017, 2018, and the field of Bayesian analysis of morphological characters is developing rapidly (see Wright, 2019 for an overview of the state of the art). Much work remains to be done, but the software RevBayes (Höhna et al., 2016) has emerged as a flexible and performant platform for Bayesian phylogenetic analysis, and seems suitable for future experimentation with continuous characters and their integration with more traditional discrete morphological characters and molecular data. On the parsimony side, similar tools are available on TNT and already mature, despite the recency of LAUP. However, the computational burden of LAUP might discourage potential users, especially since in the studies published to date landmark data has not decisively altered the conclusions reached from discrete data. From a more positive angle, our results suggest that simple linear parsimony performs about as well as LAUP with empirical data, and therefore parsimony users could take advantage of linear parsimony at the very least for kick starting LAUP searches.

Prospects on the utility of morphometric data
Our results show that phylogenetic analyses of morphometric shell data can recover close relationships of species in clades validated by molecular data, particularly some of the 'genera' in our ingroup. However, most of those relationships receive poor bootstrap support, and more global relationships inferred from morphometric data are much more inconsistent between changes of analytical method and bear little resemblance to our molecular reference tree. The failure of morphometric data to recover deep nodes is comparable to the performance of discrete morphological shell characters (Garbin, Ascarrunz & Joyce, 2018), and shows that the phylogenetic signal of morphology with either approach is too weak to suffice for the inference of the global phylogeny of geoemydids with an acceptable level of confidence.
The set of structures represented by our morphometric data does not completely overlap with the set of structures represented in the discrete morphological matrices. The morphometric representation fails to consider classic diagnostic characters, such as the number and position of longitudinal carapacial keels and musk duct foramina. Conversely, a recent revision of discrete shell characters (Garbin, Ascarrunz & Joyce, 2018) does not include characters for the overall shape of the carapace, and very few characters pertaining the angles formed by the bone sutures and the scute sulci. Furthermore, from the results of the present study, it is apparent that the morphometric data can capture synapomorphies not previously found using discrete character coding. These potential morphometric synapomorphies allowed the methods to frequently recover most or all the Mauremys species as a clade in the carapace trees and almost all the Rhinoclemmys clustering together as paraphyletic in the plastron trees. Those two clades are difficult to recover with discrete morphological characters.
Given the limitations of the landmark data of the present study and the discrete matrix from Garbin, Ascarrunz & Joyce (2018), it becomes clear that inferences about the phylogenetic position of fossil geoemydids will need to rely on molecular data to resolve the global relationships within the clade.
A way to achieve the integration of morphological and molecular data is by means of fossil placement analyses on a molecular scaffold. We evaluated the performance of those analyses with landmark data in this study, and Garbin, Ascarrunz & Joyce (2018) did the same with discrete characters. Globally, the accuracy of the discrete characters from the entire shell is very similar to the accuracy obtained by carapace landmarks alone (Figs. 6 and 7), with the landmark data also having the advantage of giving fully resolved relationships whereas the discrete data yielded multiple optimal parsimony placements for 56% of the species (number of most parsimonious placements: median 3, maximum 15). The performance of landmarks from the entire shell dataset was slightly worse. The fact that the mere accumulation of the configurations (i.e., increasing the number of shape variables) does not improve phylogenetic inference from landmarks in our data highlights the importance of misleading signal from the plastron data, which is particularly unfortunate, as this is the structure whose geometry is best preserved in fossils, and the necessity to boost the signal of the most reliable characters. Parins-Fukuchi (2018; based on previous work by Berger & Stamatakis, 2010) presented a method for continuous data that calibrates characters (e.g. landmarks) according to their fit to a reliable phylogeny obtained by external means (e.g. molecular analyses). We will explore this approach in an upcoming contribution.
From a practical perspective, geoemydid shells are almost never found with their intact original geometry, which hampers the utilisation the carapace landmarks. However, a strength of landmarks as a means of data acquisition and encoding is that a wealth of linear characters can be derived from them, and therefore it should be possible to identify linear measurements corresponding to localised shell features that would not be strongly affected by global deformation. Among such measurements, valuable phylogenetic characters may be discovered. Our future work will present a simple method to achieve this.

CONCLUSIONS
Our results show that geoemydid shell landmark data behave similarly to traditional discrete shell characters: they do not suffice for the reliable estimation of phylogenetic relationships, regardless of the method of analysis used, but their combination with molecular trees in phylogenetic placement analyses yields more promising results. More work will be needed to make practical the application of landmark data in palaeontological studies of geoemydids, particularly because the structures that bear the strongest phylogenetic signal are not the ones best preserved in fossils. We believe that such further effort is warranted, nevertheless, as our landmark datasets and the currently available discrete character matrices do not cover completely overlapping sets of morphological features, and as much observed variation is better represented by continuous measurements anyway.  (2018). The figure and the analyses are new, and they include the 39 species that are also present in the reference tree of this study. The median scaled nodal distance was used when multiple most parsimonious placements were found for a single species. Please also note that there is no similar figure in Garbin, Ascarrunz & Joyce (2018).