Repeatability of clades as a criterion of reliability: a case study for molecular phylogeny of Acanthomorpha (Teleostei) with larger number of taxa
Introduction
With advances in the collection of molecular data, phylogenetic results obtained from molecular sources are used to a greater extent to interpret organismal diversity (Moritz and Hillis, 1996). These phylogenetic hypotheses rely increasingly on the information obtained from different genes. The benefit of sampling several independent gene genealogies to infer phylogenetic relationships among taxa is well established (e.g., Cao et al., 1994; Cumming et al., 1995; Russo et al., 1996; Zardoya and Meyer, 1996), since ultimately a better representation of the whole genome is highly desirable. However, the issue of how to analyze multiple sources of data appears to remain unsettled (Lecointre and Deleporte, 2000; Miyamoto and Fitch, 1995). Extreme views emphasize separate analysis (Mickevich, 1978) or simultaneous analysis (e.g., Nixon and Carpenter, 1996), also called the “total evidence” approach by Kluge (1989). Even if the importance of different protocols of analyses was discussed by the flurry of recent papers (De Queiroz et al., 1995; Huelsenbeck et al., 1996; Levasseur and Lapointe, 2001; Miyamoto and Fitch, 1995; Nixon and Carpenter, 1996; Lecointre and Deleporte, 2000), the “total evidence approach” is currently the most widely employed. In this paper, we present new molecular data for the Acanthomorpha (Teleostei) to question whether, in terms of reliability, the direct application of the “total evidence” approach is the best solution for a difficult phylogenetic problem.
One of the central questions in systematics is how phylogenetic hypothesis can be assessed for confidence. As claimed by Hennig (1966), “… the reliability of hypothesis increases with number of individual characters that can be fitted into transformation series…” Following this claim, supporters of the “total evidence” approach advocate combining all available data in a single matrix in order to globally maximize congruence of the whole set of available relevant characters, the principle of character congruence (Barrett et al., 1991; Eernisse and Kluge, 1993; Kluge, 1989). The basic assumption for this approach is that there are no significant differences in nature between partitions, thus implying that any delineation of data partitions is only product of technical and/or historical artifacts. The total evidence approach performs well (securing increasing rubustness as more characters are analyzed) when this basic assumption is met and when the distribution of homoplasy (non-historical signal) is randomly distributed among the data partitions. In this case, it is expected that phylogeny will be inferred correctly, if enough data are collected, because historical signal will rise above random homoplasy (Farris, 1983). That is, stochastic errors in the data may lead to the incorrect inference when sample size is small but will disappear with infinite sample size (Swofford et al., 1996). However, molecular systematists have recognized that homoplasy tends to accumulate within genes in ways that are not completely random (Naylor and Brown, 1998). Non-random aspects of molecular homoplasy may be understood by analyzing functional constraints and can be detected without phylogenetic tools, for example by identifying mutational and/or base compositional biases within some positions or regions free to vary. These molecular processes may originate and accumulate non-random homoplasy within a gene and potentially mislead phylogenetic reconstruction. Furthermore, these properties that can be very different from one gene to another and provoke different kinds of deceptive signals. For instance, a set of unrelated taxa sharing the same strong compositional bias in a gene will be erroneously clustered in a tree based on DNA sequences of this gene (Hasegawa and Hashimoto, 1993; Leipe et al., 1993; Chang and Campbell, 2000; Gautier, 2000). It is possible that the contribution of each data matrix to the final topology may be disproportionate, and have unexpected effects in simultaneous analysis. In the worst case, a topology could be completely determined by one of the matrices which contains strong hierarchic but non-historical signal when the others present weak but truly historical signal (Naylor and Adams, 2001; Chen, 2001). In such cases, the preferred strategy to obtain a reliable result would not be a simple total evidence analysis but a careful dissection of noise and signal among the different data partitions. Clearly, reliability of the inference will not necessarily increase with increasing number of characters by just combining heterogeneous sources of data. Warnings against simultaneous analysis have been addressed repeatedly in the recent literature, for instance under the notion of “process partitions” by Bull et al. (1993), who emphasized that verification of congruence or homogeneity between data sets is necessary and critical before combining data and performing simultaneous analysis. Finally, if homoplasy accumulates in a non-random manner within genes while in a heterogeneous manner between genes, data partitions have some degree of naturalness, so acceptance a priori of the null hypothesis of the total evidence approach is not a reasonable practice.
The most common way in systematic studies to assess “reliability” of phylogenetic inferences is the use of indicators of robustness, such as the Bremer (or decay) index (Bremer, 1994) and bootstrap proportions (Felsenstein, 1985). Robustness is attached numerical value to internal branches in trees, calculated from a given (single) data set to measure the strength of support for those branches and corresponding groups. One must keep in mind that these indicators merely assess the strength of the signal used to order the data hierarchically (Swofford et al., 1996). That signal can originate either from common ancestry or non-phylogenetic sources like convergent strong selective constrains. Therefore, the numerical value of a robustness indicator does not measure the reliability of a phylogenetic inference. Robustness would be considered as reliability only if (1) assumptions about independence of characters and homogeneous distribution of homoplasy were not violated (Kluge and Wolf, 1993; Sanderson, 1989); and (2) all the available knowledge at the time has been taken into account (Carnap, 1950; Lecointre and Deleporte, 2000). However, as stated above, the ideal data set may not be easy to collect and this may be particularly true for molecular data. According the simulation studies, support indices could over- or under-estimate the real expected reliability (Hillis and Bull, 1993). These indicators could be totally misleading due to classical pitfalls of phylogenetic reconstruction provoked by unequal rates of changes among lineages or base compositional bias and/or by long branch attraction (Felsenstein, 1978; Huelsenbeck, 1997; Philippe and Adoutte, 1998; Philippe and Douzery, 1994; Philippe and Laurent, 1998). One must wonder whether a high bootstrap proportion should be given higher confidence than a lower one. It is often impossible to know from a single tree (such as a tree inferred from simultaneous analysis) whether the grouping patterns are due to artifacts of phylogenetic reconstruction or due to common ancestry, whatever the statistical robustness associated. However, separate analysis provides other opportunities for assessing reliability.
Reliability is the quality of being trustworthy given to a statement at a given time. It is never associated with a numerical statistical value drawn from a single data set isolated from other remaining evidence (Carnap, 1950). In other words, when analyzing several data sets separately (which is what the world-wide scientific community does every day), a given bootstrap proportion obtained for a clade from a single data set cannot be a measure of reliability. In science, reliability depends on the repeatability of results through different investigations (Grande, 1994). It is not surprising that experienced molecular systematists converge on a “taxonomic congruence” approach, proposing to analyze data sets separately (Grande, 1994; Mickevich, 1978; Miyamoto and Fitch, 1995; Nelson, 1979), at least as a heuristic step. The congruence of inferences separately drawn from independent data is considered as strong indicator of reliability. If we keep in mind the fact that molecular homoplasy may have different effects on tree reconstruction from one gene to another, obtaining the same clade from separate analysis of several genes despite this fact renders the clade even more reliable. In other words, obtaining the same tree or even some common clades means that there is a common structure in these data sets that must come from common evolutionary history. Miyamoto and Fitch (1995) suggested that relationships among taxa that are supported by different independent data sets are particularly robust even if the statistical support for each individual result is weak. This is equivalent to obtaining independent verification of an experimental hypothesis from an additional experimental source. This independent type of verification may be lost in combining data sets right from the beginning. Empirically, this point of view implies that two independent genes are not likely to harbor the same positively misleading signals. Even if it is always possible to imagine that two or three genes can exhibit the same positively misleading signals (for instance the same long branches due to a common taxonomic sampling issue), the risk here is by far lower than blindly trusting the bootstrap proportions from the direct simultaneous analysis. The same reasoning can be used to reply to the objection made to separate analyses, that different genes may contribute to resolve different parts of a phylogeny. Finding the same clade repeated despite the possibility that different genes may resolve different parts of the phylogeny is using repeatability in a conservative manner, securing reliability.
Thus, the main advantage of separate analysis (without consensus trees) is that it provides a measure of repeatability, but more than a simple majority-rule consensus tree, an additional opportunity to detect tree reconstruction artifacts due to local positively misleading signals. We would be inclined to prefer the same clade that is inferred repeatedly from several data sets with low bootstrap proportions than a highly supported clade inferred from a single data set. We will therefore not use consensus trees, instead we will use repeatability though separate analyses to assess reliability of the clades found in the tree from the simultaneous analysis (Fig. 1). In other words, we use the simultaneous analysis to obtain the complete tree, and separate analyses to determine which clades of that tree are reliable.
The spiny teleost fishes grouped within Acanthomorpha (Rosen, 1973) comprise more than 14,736 species (Helfman et al., 1997; Nelson, 1994) and represent one third of the extant vertebrate species of the world. This clade is divided into three large assemblages: the Paracanthopterygii (cods, goosefishes), the Atherinomorpha (silversides), and the most species-rich group, the Percomorpha (perch-like fishes). The earliest acanthomorph fossils known are aipichthyids and polymixiids from the Cenomanian, Upper Cretaceous (Gaudant, 1978; Gayet, 1980a; Otero and Gayet, 1996; Patterson, 1964). Shortly after this period, a vast diversity of acanthomorphs (representing 80 families) suddenly appears in the fossil record, starting in the Early Eocene between 45–55 million years ago (Benton, 1993; Patterson, 1993). This pattern suggests a putative rapid radiation, which resulted in the most diverse vertebrate group of the modern fauna.
Since the pioneering work on systematics of fishes by Greenwood et al. (1966), many studies were published proposing hypotheses of relationships for lower teleosts, but relatively few for the higher teleosts, especially for the Acanthomorpha (Lecointre, 1994; Rosen, 1982, Rosen, 1985). Consequently, Nelson (1989) concluded his survey of teleostean phylogeny with the following statement “recent work has resolved the bush at the bottom, but the bush at the top persists,” a bush already clearly illustrated by Rosen (1982). Our knowledge of high-level acanthomorph phylogeny is very poor considering their sizeable species diversity, especially within the major clade Percomorpha. The vast majority of studies of higher teleosts have focused on relationships at the specific and generic levels or between closely related families. So far, the only three cladograms based on morphological characters depicting interrelationships among acanthomorphs are those of Johnson and Patterson (1993), Lauder and Liem (1983), and Stiassny and Moore (1992). In spite of showing resolution for the basal parts of tree, and in spite of proposing new hypothesis (e.g., Smegmamorpha in Johnson and Patterson, 1993), substantial disagreement persists, especially on the phylogenetic positions of the Zeiformes (dories), Beryciformes (squirrel fishes), and Synbranchiformes (swamp or spiny eels). Clearly, the phylogenetic relationships reflecting the main acanthomorph radiation are still unclear. Molecular data are only slowly starting to produce results, such as the two recent studies published during the preparation of this paper (Miya et al., 2001; Wiley et al., 2000). These phylogenetic trees are based on a combined matrix (1722 characters) of 12S mitochondrial DNA, 28S nuclear DNA, and morphological data (Johnson and Patterson, 1993) and selected nucleotides sequences (7002 characters) from whole mitochondrial genomes (Miya et al., 2001). If merely increasing the number of characters for analysis and if performing a “total evidence” approach could give reliable results, these two studies should provide a better insight of acanthomorph phylogeny. However, interrelationships between acanthomorph orders or suborders representing major lineages remain poorly resolved in terms of statistical support, with few exceptions. Moreover, as discussed above, robustness does not necessarily mean reliability. Without comparing trees from independent data sets, it is not possible to assess reliability of newly proposed acanthomorph clades. Following this view, the acanthomorph problem still needs to be examined, especially by way of separate analyses.
Adequate taxonomic sampling to fairly represent the highly complex patterns of diversification of Acanthomorpha is a compulsory issue requiring careful consideration. In general, major lineages within Acanthomorpha are poorly defined, especially for Percomorpha. In such situations, taxonomic sampling must be extended to neighboring lineages, until the sample is sufficiently inclusive to contain the clade of interest. This is the case for the Percomorpha (Johnson, 1993; Johnson and Patterson, 1993; Rosen, 1973, Rosen, 1985; Stiassny, 1986) and explains why one must sample the whole acanthomorph diversity when just trying to investigate percomorph phylogeny. A related problem is that some traditionally recognized percomorph subdivisions have been shown to be polyphyletic (e.g., Perciformes, Trachiniodei, Percoidei, Scorpaeniformes; Gill, 1996; Johnson, 1993; Patterson and Rosen, 1989; Stiassny, 1990; Stiassny and Moore, 1992; Travers, 1981) and may not even belong to this group. Since monophyly of such groups is questionable, using reduced sampling from predefined groups is risky. When sampling taxa from paraphyletic or polyphyletic groups, phylogenetic conclusions will depend on the choice of representatives. To address correctly the phylogenetic hypothesis, sampling a large variety of terminals within each of the putatively polyphyletic subdivisions is required. This drastically increases the necessary taxonomic sampling. However, all previous studies sampled very few acanthomorph terminals. One of the best sampling efforts includes merely 32 acanthomorph taxa (Miya et al., 2001), with only a single representative from the large group Perciformes, which is clearly a polyphyletic group (Johnson and Patterson, 1993)!
For this study we sampled acanthomorph diversity thoroughly, including representation of 48 suborders and more than 60 families. We present and analyze new data from four genes with different properties in their cellular location, function, and sequence variation. These include two nuclear genes: portions of the 28S ribosomal DNA (domains C1–C2, D3, D6, C12, and D12) and the gene encoding rhodopsin; and two mitochondrial ribosomal genes: 12S and 16S (Table 1). Using both separate analysis and simultaneous analysis, this study aims to discover reliable clades among the main lineages within the acanthomorph radiation, with particular attention to the phylogenetic relationships of the order Zeiformes and the interrelationships of members of the Smegmamorpha (new clade defined by Johnson and Patterson, 1993) and of “Perciformes.” We present a detailed analysis that shows the use of repeatability as the main criterion to postulate the validity of some previously unrecognized clades.
Section snippets
Taxon sampling and DNA extraction
Taxa were selected to represent a large proportion of acanthomorph diversity, including representatives of 40 suborders and more than 60 families, plus outgroup taxa from seven different orders (Table 1). The sampling backbone followed the cladogram proposed by Johnson and Patterson (1993), one of morphological hypotheses we intended to test. All terminal clades are represented except Stephanoberyciformes and Elassomatidae. For the questionable “Perciformes” clade, 41 species were chosen to
Characterization of nucleotide substitution patterns
Sequences were successfully obtained using the primers listed in Table 2 for all species except Pogonoperca punctata and Fistularia petimba (for the 28S domains D3, D6, and D12). The rhodopsin sequence of Lampris immaculatus could only be amplified successfully using primers rh545 and rh1073. The 5′end portion (321 bp) of the rhodopsin gene for this taxon was replaced by question marks. If failure of amplification of rhodopsin in L. immaculatus is not related to the presence of an intron, our
Phylogenetic trees based on rhodopsin and base compositional bias
Although the rhodopsin tree contains the highest number of well-supported clades, base compositional bias across taxa at third codon positions may be affecting the accuracy of phylogenetic inference. When base composition varies significantly among taxa, all classical methods (MP, ML, and ME) tend to group sequences of similar nucleotide composition together, regardless of evolutionary history (Lockhart et al., 1994). The LogDet transformation, designed to correct this problem (Lockhart et al.,
Conclusion
In this study, separate analysis of multiple data sets has taken precedence over the total evidence approach for the assessment of phylogenetic reliability. Several main messages emerge: (1) This approach is especially useful when phylogenetic signal in the data is relatively low due to putative radiation and when one of the data partitions may be influenced by strong misleading signal. (2) Blindly trusting the results from simultaneous analysis, even associated with high bootstrap supports, is
Acknowledgements
During this 10-year-long project, numerous people have provided fish samples. We thank Nicolas Bailly, Philippe Bouchet, François Catzeflis, Romain Causse, Pascal Deynat, Catherine Chombard, Guido Dingerkus, Marie-Henriette Dubuit, Guy Duhamel, Yves Fermon, Jin-Chywan Gwo, Michel Hignette, Jean-Claude Hureau, Sébastien Lavoué, Yves Le Gal, Chen-Hsiang Liu, François Meunier, Pierre Noël, Catherine Ozouf-Costaz, Eva Pisano, Stuart Poss, Jean-Claude Quéro, François Renaud, Peter Ritchie, Thibaud
References (182)
Branch support and tree stability
Cladistics
(1994)- et al.
Opsin phylogeny and evolution: a model for blue shifts in wavelength regulation
Mol. Phylogenet. Evol.
(1995) - et al.
Phylogenetic investigation of the Stephanoberyciformes and Beryciformes, particularly whalefishes (Euteleostei: Cetomimidae), based on partial 12S rDNA and 16S rDNA sequences
Mol. Phylogenet. Evol.
(2000) - et al.
Phylogenetic relationships among atheriniform fishes (Teleostei: Atherinomorpha)
Zool. J. Linnean Soc. London B
(1996) - et al.
The rhodopsin-encoding gene of the bony fish lack introns
Gene
(1995) Compositional bias in DNA
Curr. Opin. Genet. Dev.
(2000)- et al.
The “evolutionary signal” of homoplasy in protein-coding gene sequences and its consequences for a priori weighting in phylogeny
C. R. Acad. Sci., Sér. III
(1998) - et al.
Molecular evolution of the cottoid fish endemic to Lake Baikal deduced from nuclear DNA evidence
Mol. Phylogenet. Evol.
(1997) - et al.
Different models, different trees: the geographic origin of PTLV-I
Mol. Phylogenet. Evol.
(1999) Use of rRNA secondary structure in phylogenetic studies to identify homologous positions: a example of alignment and data presentation from the frogs
Mol. Phylogenet. Evol.
(1995)
Cladistics: what’s in a word?
Cladistics
Phylogenetic frameworks: towards a firmer foundation for the comparative approach
Biol. J. Linn. Soc.
Phylogenetic relationships of mormyrid electric fishes (Mormyridae; Teleostei) inferred from cytochrome b sequences
Mol. Phylogenet. Evol.
A 28S rRNA based phylogeny of the Gnathostomes: first steps in the analysis of conflict and congruence with morphologically based cladograms
Mol. Phylogenet. Evol.
Small subunit ribosomal RNA of Hexamita inflata and the quest for the first branch in the eukaryotic tree
Mol. Biochem. Parasitol.
A second type of rod opsin cDNA from the common carp (Cyprinus caprio)
Biochem. Biophys. Acta
The phylogenetic utility of the mitochondrial cytochrome b gene for inferring relationships among actinopterygian fishes
Molecular phylogeny and evolution of deep-sea fish genes Sternoptyx
Mol. Phylogenet. Evol.
Specific synthesis of DNA in vitro via a polymerase catalyzed chain reaction
Methods Enzymol.
Phylogenetic analysis of the South American electric fishes (order Gymnotiformes) and the evolution of their electrogenic system: a synthesis based on morphology, electrophysiology, and mitochondrial sequence data
Mol. Biol. Evol.
The origin and evolution of the Antarctic ichthyofauna
The molecular basis for the blue–green sensitivity in the rod visual pigments of the European eel
Proc. R. Soc. London B
Rhodopsin cDNA sequence from the sand goby (Pomatoshistus minutus) compared with those of other vertebrates
Proc. R. Soc. London B
Molecular evolution at subzero temperatures: mitochondrial and nuclear phylogenies of fishes from Antarctica (suborder Notothenioidei), and the evolution of antifreeze glycopetide
Mol. Biol. Evol.
Against consensus
Syst. Zool.
The fossil record II
Secondary structure and conserved motifs of the frequently sequenced domains IV and V of the insect mitochondrial large subunit rRNA gene
Insect Mol. Biol.
Partitioning and combining data in phylogenetic analysis
Syst. Biol.
Phylogenetic relationships among eutherian orders estimated from inferred sequences of mitochondrial proteins: instability of a tree based on a single gene
J. Mol. Evol.
Logical Foundations of Probability
Bias in phylogenetic reconstruction of vertebrate rhodopsin sequences
Mol. Biol. Evol.
Pleuronectiform relationships: a cladistic reassessment
Bull. Mar. Sci.
Sampling properties of DNA sequence data in phylogenetic analysis
Mol. Biol. Evol.
Best-fit maximum-likelihood models for phylogenetic inference: empirical tests with known phylogenies
Evolution
Separate versus combined analysis of phylogenetic evidence
Ann. Rev. Ecol. Syst.
Osteology and relationships of the fishes of the antarctic family Harpagiferidae (Pisces, Notothenioidei)
Antarctic Fish Biology
Taxonomic congruence versus total evidence, and amniote phylogeny inferred from fossils, molecules, and morphology
Mol. Biol. Evol.
The logical basis of phylogenetic analysis
Cases in which parsimony or compatibility methods will be positively misleading
Syst. Zool.
Evolutionary trees from DNA sequences: a maximum likelihood approach
J. Mol. Evol.
Confidence limits on phylogenies: an approach using the bootstrap
Evolution
Contribution à l’étude anatomique et systématique de l’Ichtyo-faune cénomanienne du Protugal. Première partie: les Acanthopterygii
Com. Serv. Geol. Portugal
Recherches sur de l’Ichtyo-faune cénomanienne des Monts de Judée: Les acanthoptérygiens
Ann. Paléotol. Vertébrés
Sur la découverte dans le Crétacé de Hadjula (Liban) du plus ancien Caproidae connu
C. R. Hebdo. Séances Acad. Sci., Paris
Cited by (0)
- 1
Present address: 315 Manter Hall, School of Biological Sciences, University of Nebraska-Lincoln, NE 68511-0118, USA.