Repeatability of clades as a criterion of reliability: a case study for molecular phylogeny of Acanthomorpha (Teleostei) with larger number of taxa

https://doi.org/10.1016/S1055-7903(02)00371-8Get rights and content

Abstract

Although much progress has been made recently in teleostean phylogeny, relationships among the main lineages of the higher teleosts (Acanthomorpha), containing more than 60% of all fish species, remain poorly defined. This study represents the most extensive taxonomic sampling effort to date to collect new molecular characters for phylogenetic analysis of acanthomorph fishes. We compiled and analyzed three independent data sets, including: (i) mitochondrial ribosomal fragments from 12S and 16s (814 bp for 97 taxa); (ii) nuclear ribosomal 28S sequences (847 bp for 74 taxa); and (iii) a nuclear protein-coding gene, rhodopsin (759 bp for 86 taxa). Detailed analyses were conducted on each data set separately and the principle of taxonomic congruence without consensus trees was used to assess confidence in the results as follows. Repeatability of clades from separate analyses was considered the primary criterion to establish reliability, rather than bootstrap proportions from a single combined (total evidence) data matrix. The new and reliable clades emerging from this study of the acanthomorph radiation were: Gadiformes (cods) with Zeioids (dories); Beloniformes (needlefishes) with Atheriniformes (silversides); blenioids (blennies) with Gobiesocoidei (clingfishes); Channoidei (snakeheads) with Anabantoidei (climbing gouramies); Mastacembeloidei (spiny eels) with Synbranchioidei (swamp-eels); the last two pairs of taxa grouping together, Syngnathoidei (aulostomids, macroramphosids) with Dactylopteridae (flying gurnards); Scombroidei (mackerels) plus Stromatoidei plus Chiasmodontidae; Ammodytidae (sand lances) with Cheimarrhichthyidae (torrentfish); Zoarcoidei (eelpouts) with Cottoidei; Percidae (perches) with Notothenioidei (Antarctic fishes); and a clade grouping Carangidae (jacks), Echeneidae (remoras), Sphyraenidae (barracudas), Menidae (moonfish), Polynemidae (threadfins), Centropomidae (snooks), and Pleuronectiformes (flatfishes).

Introduction

With advances in the collection of molecular data, phylogenetic results obtained from molecular sources are used to a greater extent to interpret organismal diversity (Moritz and Hillis, 1996). These phylogenetic hypotheses rely increasingly on the information obtained from different genes. The benefit of sampling several independent gene genealogies to infer phylogenetic relationships among taxa is well established (e.g., Cao et al., 1994; Cumming et al., 1995; Russo et al., 1996; Zardoya and Meyer, 1996), since ultimately a better representation of the whole genome is highly desirable. However, the issue of how to analyze multiple sources of data appears to remain unsettled (Lecointre and Deleporte, 2000; Miyamoto and Fitch, 1995). Extreme views emphasize separate analysis (Mickevich, 1978) or simultaneous analysis (e.g., Nixon and Carpenter, 1996), also called the “total evidence” approach by Kluge (1989). Even if the importance of different protocols of analyses was discussed by the flurry of recent papers (De Queiroz et al., 1995; Huelsenbeck et al., 1996; Levasseur and Lapointe, 2001; Miyamoto and Fitch, 1995; Nixon and Carpenter, 1996; Lecointre and Deleporte, 2000), the “total evidence approach” is currently the most widely employed. In this paper, we present new molecular data for the Acanthomorpha (Teleostei) to question whether, in terms of reliability, the direct application of the “total evidence” approach is the best solution for a difficult phylogenetic problem.

One of the central questions in systematics is how phylogenetic hypothesis can be assessed for confidence. As claimed by Hennig (1966), “… the reliability of hypothesis increases with number of individual characters that can be fitted into transformation series…” Following this claim, supporters of the “total evidence” approach advocate combining all available data in a single matrix in order to globally maximize congruence of the whole set of available relevant characters, the principle of character congruence (Barrett et al., 1991; Eernisse and Kluge, 1993; Kluge, 1989). The basic assumption for this approach is that there are no significant differences in nature between partitions, thus implying that any delineation of data partitions is only product of technical and/or historical artifacts. The total evidence approach performs well (securing increasing rubustness as more characters are analyzed) when this basic assumption is met and when the distribution of homoplasy (non-historical signal) is randomly distributed among the data partitions. In this case, it is expected that phylogeny will be inferred correctly, if enough data are collected, because historical signal will rise above random homoplasy (Farris, 1983). That is, stochastic errors in the data may lead to the incorrect inference when sample size is small but will disappear with infinite sample size (Swofford et al., 1996). However, molecular systematists have recognized that homoplasy tends to accumulate within genes in ways that are not completely random (Naylor and Brown, 1998). Non-random aspects of molecular homoplasy may be understood by analyzing functional constraints and can be detected without phylogenetic tools, for example by identifying mutational and/or base compositional biases within some positions or regions free to vary. These molecular processes may originate and accumulate non-random homoplasy within a gene and potentially mislead phylogenetic reconstruction. Furthermore, these properties that can be very different from one gene to another and provoke different kinds of deceptive signals. For instance, a set of unrelated taxa sharing the same strong compositional bias in a gene will be erroneously clustered in a tree based on DNA sequences of this gene (Hasegawa and Hashimoto, 1993; Leipe et al., 1993; Chang and Campbell, 2000; Gautier, 2000). It is possible that the contribution of each data matrix to the final topology may be disproportionate, and have unexpected effects in simultaneous analysis. In the worst case, a topology could be completely determined by one of the matrices which contains strong hierarchic but non-historical signal when the others present weak but truly historical signal (Naylor and Adams, 2001; Chen, 2001). In such cases, the preferred strategy to obtain a reliable result would not be a simple total evidence analysis but a careful dissection of noise and signal among the different data partitions. Clearly, reliability of the inference will not necessarily increase with increasing number of characters by just combining heterogeneous sources of data. Warnings against simultaneous analysis have been addressed repeatedly in the recent literature, for instance under the notion of “process partitions” by Bull et al. (1993), who emphasized that verification of congruence or homogeneity between data sets is necessary and critical before combining data and performing simultaneous analysis. Finally, if homoplasy accumulates in a non-random manner within genes while in a heterogeneous manner between genes, data partitions have some degree of naturalness, so acceptance a priori of the null hypothesis of the total evidence approach is not a reasonable practice.

The most common way in systematic studies to assess “reliability” of phylogenetic inferences is the use of indicators of robustness, such as the Bremer (or decay) index (Bremer, 1994) and bootstrap proportions (Felsenstein, 1985). Robustness is attached numerical value to internal branches in trees, calculated from a given (single) data set to measure the strength of support for those branches and corresponding groups. One must keep in mind that these indicators merely assess the strength of the signal used to order the data hierarchically (Swofford et al., 1996). That signal can originate either from common ancestry or non-phylogenetic sources like convergent strong selective constrains. Therefore, the numerical value of a robustness indicator does not measure the reliability of a phylogenetic inference. Robustness would be considered as reliability only if (1) assumptions about independence of characters and homogeneous distribution of homoplasy were not violated (Kluge and Wolf, 1993; Sanderson, 1989); and (2) all the available knowledge at the time has been taken into account (Carnap, 1950; Lecointre and Deleporte, 2000). However, as stated above, the ideal data set may not be easy to collect and this may be particularly true for molecular data. According the simulation studies, support indices could over- or under-estimate the real expected reliability (Hillis and Bull, 1993). These indicators could be totally misleading due to classical pitfalls of phylogenetic reconstruction provoked by unequal rates of changes among lineages or base compositional bias and/or by long branch attraction (Felsenstein, 1978; Huelsenbeck, 1997; Philippe and Adoutte, 1998; Philippe and Douzery, 1994; Philippe and Laurent, 1998). One must wonder whether a high bootstrap proportion should be given higher confidence than a lower one. It is often impossible to know from a single tree (such as a tree inferred from simultaneous analysis) whether the grouping patterns are due to artifacts of phylogenetic reconstruction or due to common ancestry, whatever the statistical robustness associated. However, separate analysis provides other opportunities for assessing reliability.

Reliability is the quality of being trustworthy given to a statement at a given time. It is never associated with a numerical statistical value drawn from a single data set isolated from other remaining evidence (Carnap, 1950). In other words, when analyzing several data sets separately (which is what the world-wide scientific community does every day), a given bootstrap proportion obtained for a clade from a single data set cannot be a measure of reliability. In science, reliability depends on the repeatability of results through different investigations (Grande, 1994). It is not surprising that experienced molecular systematists converge on a “taxonomic congruence” approach, proposing to analyze data sets separately (Grande, 1994; Mickevich, 1978; Miyamoto and Fitch, 1995; Nelson, 1979), at least as a heuristic step. The congruence of inferences separately drawn from independent data is considered as strong indicator of reliability. If we keep in mind the fact that molecular homoplasy may have different effects on tree reconstruction from one gene to another, obtaining the same clade from separate analysis of several genes despite this fact renders the clade even more reliable. In other words, obtaining the same tree or even some common clades means that there is a common structure in these data sets that must come from common evolutionary history. Miyamoto and Fitch (1995) suggested that relationships among taxa that are supported by different independent data sets are particularly robust even if the statistical support for each individual result is weak. This is equivalent to obtaining independent verification of an experimental hypothesis from an additional experimental source. This independent type of verification may be lost in combining data sets right from the beginning. Empirically, this point of view implies that two independent genes are not likely to harbor the same positively misleading signals. Even if it is always possible to imagine that two or three genes can exhibit the same positively misleading signals (for instance the same long branches due to a common taxonomic sampling issue), the risk here is by far lower than blindly trusting the bootstrap proportions from the direct simultaneous analysis. The same reasoning can be used to reply to the objection made to separate analyses, that different genes may contribute to resolve different parts of a phylogeny. Finding the same clade repeated despite the possibility that different genes may resolve different parts of the phylogeny is using repeatability in a conservative manner, securing reliability.

Thus, the main advantage of separate analysis (without consensus trees) is that it provides a measure of repeatability, but more than a simple majority-rule consensus tree, an additional opportunity to detect tree reconstruction artifacts due to local positively misleading signals. We would be inclined to prefer the same clade that is inferred repeatedly from several data sets with low bootstrap proportions than a highly supported clade inferred from a single data set. We will therefore not use consensus trees, instead we will use repeatability though separate analyses to assess reliability of the clades found in the tree from the simultaneous analysis (Fig. 1). In other words, we use the simultaneous analysis to obtain the complete tree, and separate analyses to determine which clades of that tree are reliable.

The spiny teleost fishes grouped within Acanthomorpha (Rosen, 1973) comprise more than 14,736 species (Helfman et al., 1997; Nelson, 1994) and represent one third of the extant vertebrate species of the world. This clade is divided into three large assemblages: the Paracanthopterygii (cods, goosefishes), the Atherinomorpha (silversides), and the most species-rich group, the Percomorpha (perch-like fishes). The earliest acanthomorph fossils known are aipichthyids and polymixiids from the Cenomanian, Upper Cretaceous (Gaudant, 1978; Gayet, 1980a; Otero and Gayet, 1996; Patterson, 1964). Shortly after this period, a vast diversity of acanthomorphs (representing 80 families) suddenly appears in the fossil record, starting in the Early Eocene between 45–55 million years ago (Benton, 1993; Patterson, 1993). This pattern suggests a putative rapid radiation, which resulted in the most diverse vertebrate group of the modern fauna.

Since the pioneering work on systematics of fishes by Greenwood et al. (1966), many studies were published proposing hypotheses of relationships for lower teleosts, but relatively few for the higher teleosts, especially for the Acanthomorpha (Lecointre, 1994; Rosen, 1982, Rosen, 1985). Consequently, Nelson (1989) concluded his survey of teleostean phylogeny with the following statement “recent work has resolved the bush at the bottom, but the bush at the top persists,” a bush already clearly illustrated by Rosen (1982). Our knowledge of high-level acanthomorph phylogeny is very poor considering their sizeable species diversity, especially within the major clade Percomorpha. The vast majority of studies of higher teleosts have focused on relationships at the specific and generic levels or between closely related families. So far, the only three cladograms based on morphological characters depicting interrelationships among acanthomorphs are those of Johnson and Patterson (1993), Lauder and Liem (1983), and Stiassny and Moore (1992). In spite of showing resolution for the basal parts of tree, and in spite of proposing new hypothesis (e.g., Smegmamorpha in Johnson and Patterson, 1993), substantial disagreement persists, especially on the phylogenetic positions of the Zeiformes (dories), Beryciformes (squirrel fishes), and Synbranchiformes (swamp or spiny eels). Clearly, the phylogenetic relationships reflecting the main acanthomorph radiation are still unclear. Molecular data are only slowly starting to produce results, such as the two recent studies published during the preparation of this paper (Miya et al., 2001; Wiley et al., 2000). These phylogenetic trees are based on a combined matrix (1722 characters) of 12S mitochondrial DNA, 28S nuclear DNA, and morphological data (Johnson and Patterson, 1993) and selected nucleotides sequences (7002 characters) from whole mitochondrial genomes (Miya et al., 2001). If merely increasing the number of characters for analysis and if performing a “total evidence” approach could give reliable results, these two studies should provide a better insight of acanthomorph phylogeny. However, interrelationships between acanthomorph orders or suborders representing major lineages remain poorly resolved in terms of statistical support, with few exceptions. Moreover, as discussed above, robustness does not necessarily mean reliability. Without comparing trees from independent data sets, it is not possible to assess reliability of newly proposed acanthomorph clades. Following this view, the acanthomorph problem still needs to be examined, especially by way of separate analyses.

Adequate taxonomic sampling to fairly represent the highly complex patterns of diversification of Acanthomorpha is a compulsory issue requiring careful consideration. In general, major lineages within Acanthomorpha are poorly defined, especially for Percomorpha. In such situations, taxonomic sampling must be extended to neighboring lineages, until the sample is sufficiently inclusive to contain the clade of interest. This is the case for the Percomorpha (Johnson, 1993; Johnson and Patterson, 1993; Rosen, 1973, Rosen, 1985; Stiassny, 1986) and explains why one must sample the whole acanthomorph diversity when just trying to investigate percomorph phylogeny. A related problem is that some traditionally recognized percomorph subdivisions have been shown to be polyphyletic (e.g., Perciformes, Trachiniodei, Percoidei, Scorpaeniformes; Gill, 1996; Johnson, 1993; Patterson and Rosen, 1989; Stiassny, 1990; Stiassny and Moore, 1992; Travers, 1981) and may not even belong to this group. Since monophyly of such groups is questionable, using reduced sampling from predefined groups is risky. When sampling taxa from paraphyletic or polyphyletic groups, phylogenetic conclusions will depend on the choice of representatives. To address correctly the phylogenetic hypothesis, sampling a large variety of terminals within each of the putatively polyphyletic subdivisions is required. This drastically increases the necessary taxonomic sampling. However, all previous studies sampled very few acanthomorph terminals. One of the best sampling efforts includes merely 32 acanthomorph taxa (Miya et al., 2001), with only a single representative from the large group Perciformes, which is clearly a polyphyletic group (Johnson and Patterson, 1993)!

For this study we sampled acanthomorph diversity thoroughly, including representation of 48 suborders and more than 60 families. We present and analyze new data from four genes with different properties in their cellular location, function, and sequence variation. These include two nuclear genes: portions of the 28S ribosomal DNA (domains C1–C2, D3, D6, C12, and D12) and the gene encoding rhodopsin; and two mitochondrial ribosomal genes: 12S and 16S (Table 1). Using both separate analysis and simultaneous analysis, this study aims to discover reliable clades among the main lineages within the acanthomorph radiation, with particular attention to the phylogenetic relationships of the order Zeiformes and the interrelationships of members of the Smegmamorpha (new clade defined by Johnson and Patterson, 1993) and of “Perciformes.” We present a detailed analysis that shows the use of repeatability as the main criterion to postulate the validity of some previously unrecognized clades.

Section snippets

Taxon sampling and DNA extraction

Taxa were selected to represent a large proportion of acanthomorph diversity, including representatives of 40 suborders and more than 60 families, plus outgroup taxa from seven different orders (Table 1). The sampling backbone followed the cladogram proposed by Johnson and Patterson (1993), one of morphological hypotheses we intended to test. All terminal clades are represented except Stephanoberyciformes and Elassomatidae. For the questionable “Perciformes” clade, 41 species were chosen to

Characterization of nucleotide substitution patterns

Sequences were successfully obtained using the primers listed in Table 2 for all species except Pogonoperca punctata and Fistularia petimba (for the 28S domains D3, D6, and D12). The rhodopsin sequence of Lampris immaculatus could only be amplified successfully using primers rh545 and rh1073. The 5end portion (321 bp) of the rhodopsin gene for this taxon was replaced by question marks. If failure of amplification of rhodopsin in L. immaculatus is not related to the presence of an intron, our

Phylogenetic trees based on rhodopsin and base compositional bias

Although the rhodopsin tree contains the highest number of well-supported clades, base compositional bias across taxa at third codon positions may be affecting the accuracy of phylogenetic inference. When base composition varies significantly among taxa, all classical methods (MP, ML, and ME) tend to group sequences of similar nucleotide composition together, regardless of evolutionary history (Lockhart et al., 1994). The LogDet transformation, designed to correct this problem (Lockhart et al.,

Conclusion

In this study, separate analysis of multiple data sets has taken precedence over the total evidence approach for the assessment of phylogenetic reliability. Several main messages emerge: (1) This approach is especially useful when phylogenetic signal in the data is relatively low due to putative radiation and when one of the data partitions may be influenced by strong misleading signal. (2) Blindly trusting the results from simultaneous analysis, even associated with high bootstrap supports, is

Acknowledgements

During this 10-year-long project, numerous people have provided fish samples. We thank Nicolas Bailly, Philippe Bouchet, François Catzeflis, Romain Causse, Pascal Deynat, Catherine Chombard, Guido Dingerkus, Marie-Henriette Dubuit, Guy Duhamel, Yves Fermon, Jin-Chywan Gwo, Michel Hignette, Jean-Claude Hureau, Sébastien Lavoué, Yves Le Gal, Chen-Hsiang Liu, François Meunier, Pierre Noël, Catherine Ozouf-Costaz, Eva Pisano, Stuart Poss, Jean-Claude Quéro, François Renaud, Peter Ritchie, Thibaud

References (182)

  • A.G. Kluge et al.

    Cladistics: what’s in a word?

    Cladistics

    (1993)
  • S.M. Lanyon

    Phylogenetic frameworks: towards a firmer foundation for the comparative approach

    Biol. J. Linn. Soc.

    (1993)
  • S. Lavoué et al.

    Phylogenetic relationships of mormyrid electric fishes (Mormyridae; Teleostei) inferred from cytochrome b sequences

    Mol. Phylogenet. Evol.

    (2000)
  • H.L.V. et al.

    A 28S rRNA based phylogeny of the Gnathostomes: first steps in the analysis of conflict and congruence with morphologically based cladograms

    Mol. Phylogenet. Evol.

    (1993)
  • D.D. Leipe et al.

    Small subunit ribosomal RNA of Hexamita inflata and the quest for the first branch in the eukaryotic tree

    Mol. Biochem. Parasitol.

    (1993)
  • J. Lim et al.

    A second type of rod opsin cDNA from the common carp (Cyprinus caprio)

    Biochem. Biophys. Acta

    (1997)
  • C. Lydeard et al.

    The phylogenetic utility of the mitochondrial cytochrome b gene for inferring relationships among actinopterygian fishes

  • M. Miya et al.

    Molecular phylogeny and evolution of deep-sea fish genes Sternoptyx

    Mol. Phylogenet. Evol.

    (1998)
  • K.B. Mullis et al.

    Specific synthesis of DNA in vitro via a polymerase catalyzed chain reaction

    Methods Enzymol.

    (1987)
  • J.A. Alves-Gomes et al.

    Phylogenetic analysis of the South American electric fishes (order Gymnotiformes) and the evolution of their electrogenic system: a synthesis based on morphology, electrophysiology, and mitochondrial sequence data

    Mol. Biol. Evol.

    (1995)
  • Anderson, M.E., 1984. On the anatomy and phylogeny of the Zoarcidae (Teleostei: Perciformes). Ph.D. Dissertation,...
  • M.E. Anderson

    The origin and evolution of the Antarctic ichthyofauna

  • Archer, S.N., Hirano, J., unpublished. Comparative analysis of opsins in Mediterranean coastal...
  • S.H. Archer et al.

    The molecular basis for the blue–green sensitivity in the rod visual pigments of the European eel

    Proc. R. Soc. London B

    (1995)
  • S.N. Archer et al.

    Rhodopsin cDNA sequence from the sand goby (Pomatoshistus minutus) compared with those of other vertebrates

    Proc. R. Soc. London B

    (1992)
  • L. Bargelloni et al.

    Molecular evolution at subzero temperatures: mitochondrial and nuclear phylogenies of fishes from Antarctica (suborder Notothenioidei), and the evolution of antifreeze glycopetide

    Mol. Biol. Evol.

    (1994)
  • M. Barrett et al.

    Against consensus

    Syst. Zool.

    (1991)
  • M.J. Benton

    The fossil record II

    (1993)
  • T.R. Buckley et al.

    Secondary structure and conserved motifs of the frequently sequenced domains IV and V of the insect mitochondrial large subunit rRNA gene

    Insect Mol. Biol.

    (2000)
  • J.J. Bull et al.

    Partitioning and combining data in phylogenetic analysis

    Syst. Biol.

    (1993)
  • Y. Cao et al.

    Phylogenetic relationships among eutherian orders estimated from inferred sequences of mitochondrial proteins: instability of a tree based on a single gene

    J. Mol. Evol.

    (1994)
  • R. Carnap

    Logical Foundations of Probability

    (1950)
  • B.S.W. Chang et al.

    Bias in phylogenetic reconstruction of vertebrate rhodopsin sequences

    Mol. Biol. Evol.

    (2000)
  • F. Chapleau

    Pleuronectiform relationships: a cladistic reassessment

    Bull. Mar. Sci.

    (1993)
  • Chen, W.-J., 2001. La répétitivité des clades comme critère de fiabilité: application à la phylogénie de Acanthomorpha...
  • M.P. Cumming et al.

    Sampling properties of DNA sequence data in phylogenetic analysis

    Mol. Biol. Evol.

    (1995)
  • C.W. Cunningham et al.

    Best-fit maximum-likelihood models for phylogenetic inference: empirical tests with known phylogenies

    Evolution

    (1998)
  • Cuvier, G., Valenciennes, 1828–49. Histoire naturelle des poissons,...
  • A. De Queiroz et al.

    Separate versus combined analysis of phylogenetic evidence

    Ann. Rev. Ecol. Syst.

    (1995)
  • R.R. Eakin

    Osteology and relationships of the fishes of the antarctic family Harpagiferidae (Pisces, Notothenioidei)

  • J.T. Eastman

    Antarctic Fish Biology

    (1993)
  • D.J. Eernisse et al.

    Taxonomic congruence versus total evidence, and amniote phylogeny inferred from fossils, molecules, and morphology

    Mol. Biol. Evol.

    (1993)
  • J.S. Farris

    The logical basis of phylogenetic analysis

  • J. Felsenstein

    Cases in which parsimony or compatibility methods will be positively misleading

    Syst. Zool.

    (1978)
  • J. Felsenstein

    Evolutionary trees from DNA sequences: a maximum likelihood approach

    J. Mol. Evol.

    (1981)
  • J. Felsenstein

    Confidence limits on phylogenies: an approach using the bootstrap

    Evolution

    (1985)
  • M. Gaudant

    Contribution à l’étude anatomique et systématique de l’Ichtyo-faune cénomanienne du Protugal. Première partie: les Acanthopterygii

    Com. Serv. Geol. Portugal

    (1978)
  • M. Gayet

    Recherches sur de l’Ichtyo-faune cénomanienne des Monts de Judée: Les acanthoptérygiens

    Ann. Paléotol. Vertébrés

    (1980)
  • M. Gayet

    Sur la découverte dans le Crétacé de Hadjula (Liban) du plus ancien Caproidae connu

    C. R. Hebdo. Séances Acad. Sci., Paris

    (1980)
  • Gayet, M., 1980c. Découverte dans le Crétacé de Hadjula (Liban) du plus ancien Caproidae connu. étude anatomique et...
  • Cited by (0)

    1

    Present address: 315 Manter Hall, School of Biological Sciences, University of Nebraska-Lincoln, NE 68511-0118, USA.

    View full text