Skip to main content
Log in

Measuring Fit of Sequence Data to Phylogenetic Model: Gain of Power Using Marginal Tests

  • Published:
Journal of Molecular Evolution Aims and scope Submit manuscript

Abstract

Testing fit of data to model is fundamentally important to any science, but publications in the field of phylogenetics rarely do this. Such analyses discard fundamental aspects of science as prescribed by Karl Popper. Indeed, not without cause, Popper (Unended quest: an intellectual autobiography. Fontana, London, 1976) once argued that evolutionary biology was unscientific as its hypotheses were untestable. Here we trace developments in assessing fit from Penny et al. (Nature 297:197–200, 1982) to the present. We compare the general log-likelihood ratio (the G or G 2 statistic) statistic between the evolutionary tree model and the multinomial model with that of marginalized tests applied to an alignment (using placental mammal coding sequence data). It is seen that the most general test does not reject the fit of data to model (P ~ 0.5), but the marginalized tests do. Tests on pairwise frequency (F) matrices, strongly (P < 0.001) reject the most general phylogenetic (GTR) models commonly in use. It is also clear (P < 0.01) that the sequences are not stationary in their nucleotide composition. Deviations from stationarity and homogeneity seem to be unevenly distributed amongst taxa; not necessarily those expected from examining other regions of the genome. By marginalizing the 4t patterns of the i.i.d. model to observed and expected parsimony counts, that is, from constant sites, to singletons, to parsimony informative characters of a minimum possible length, then the likelihood ratio test regains power, and it too rejects the evolutionary model with P ≪ 0.001. Given such behavior over relatively recent evolutionary time, readers in general should maintain a healthy skepticism of results, as the scale of the systematic errors in published trees may really be far larger than the analytical methods (e.g., bootstrap) report.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  • Ababneh F, Jermiin LS, Ma C, Robinson J (2006) Matched-pairs tests of homogeneity with applications to homologous nucleotide sequences. Bioinformatics 22:1225–1231

    Article  CAS  PubMed  Google Scholar 

  • Adachi J, Hasegawa M (1996) MOLPHY Version 2.3: programs for molecular phylogenetics based on maximum likelihood. Computer Science Monographs, vol 28. Institute of Statistical Mathematics, Tokyo, pp 1–150

  • Anderson TW, Darling DA (1952) Asymptotic theory of certain “goodness-of-fit” criteria based on stochastic processes. Ann Math Stat 23:193–212

    Article  Google Scholar 

  • Bulmer M (1991) Use of the method of generalised least squares in reconstructing phylogenies from sequence data. Mol Biol Evol 8:868–883

    CAS  Google Scholar 

  • Felsenstein J (1978) Cases in which parsimony or compatibility methods will be positively misleading. Syst Zool 27:401–410

    Article  Google Scholar 

  • Felsenstein J (1982) Numerical methods for inferring evolutionary trees. Quart Rev Biol 57:379–404

    Article  Google Scholar 

  • Foster PG (2004) Modeling compositional heterogeneity. Syst Biol 53:485–495

    Article  PubMed  Google Scholar 

  • Goldman N (1993a) Statistical tests of models of DNA substitution. J Mol Evol 36:182–198

    Article  CAS  PubMed  Google Scholar 

  • Goldman N (1993b) Simple diagnostic tests of models of DNA substitution. J Mol Evol 37:650–661

    CAS  PubMed  Google Scholar 

  • Goodman M, Tagle DA, Fitch DH, Bailey W, Czelusniak J, Koop DF, Benson P, Slightom L (1990) Primate evolution at the DNA level and a classification of hominoids. J Mol Evol 30:260–266

    Article  CAS  PubMed  Google Scholar 

  • Hendy MD, Penny D (1993) Spectral analysis of phylogenetic data. J Classif 10:5–24

    Article  Google Scholar 

  • Jermiin LS, Jayaswal V, Ababneh F, Robinson J (2008) Phylogenetic model evaluation. In: Keith J (ed) Bioinformatics—volume I: data, sequences analysis, evolution. Humana Press, Totowa, NJ, pp 331–363

    Google Scholar 

  • Kriegs JO, Churakov G, Kiefmann M, Jordan U, Brosius J, Schmitz J (2006) Retroposed elements as archives for the evolutionary history of placental mammals. PLoS Biol 4:e91

    Article  PubMed  Google Scholar 

  • Lin Y, Waddell PJ, Penny D (2002) Pika and vole mitochondrial genomes increase support for both rodent monophyly and Glires. Gene 294:119–129

    Article  CAS  PubMed  Google Scholar 

  • McCullagh P, Nelder JA (1989) Generalised linear models, 2nd edn. Chapman and Hall, London

    Google Scholar 

  • Murphy WJ, Eizirik ED, O’Brien SJ, Madsen O, Scally M, Douady CJ, Teeling E, Ryder OA, Stanhope MJ, de Jong WW, Springer MS (2001) Resolution of the early placental mammal radiation using Bayesian phylogenetics. Science 294:2348–2351

    Article  CAS  PubMed  Google Scholar 

  • Nishihara H, Hasegawa M, Okada N (2006) Pegasoferae, an unexpected mammalian clade revealed by tracking ancient retroposon insertions. Proc Natl Acad Sci USA 103:9929–9934

    Article  CAS  PubMed  Google Scholar 

  • Ota R, Waddell PJ, Hasegawa M, Shimodaira H, Kishino H (2000) Appropriate likelihood ratio tests and marginal distributions for evolutionary tree models with constraints on parameters. Mol Biol Evol 17:798–803

    CAS  PubMed  Google Scholar 

  • Penny D, Foulds LR, Hendy MD (1982) Testing the theory of evolution by comparing phylogenetic trees constructed from five different protein sequences. Nature 297:197–200

    Article  CAS  PubMed  Google Scholar 

  • Popper KR (1976) Unended quest: an intellectual autobiography. Fontana, London

    Google Scholar 

  • Rambaut A, Grassly NC (1997) Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput Appl Biosci 13:235–238

    CAS  PubMed  Google Scholar 

  • Reeves JH (1992) Heterogeneity in the substitution process of amino acid sites of proteins coded for by mitochondrial DNA. J Mol Evol 35:17–31

    Article  CAS  PubMed  Google Scholar 

  • Robinson TJ, Fu B, Ferguson-Smith MA, Yang F (2004) Cross-species chromosome painting in the golden mole and elephant-shrew: support for the mammalian clades Afrotheria and Afroinsectiphillia but not Afroinsectivora. Proc R Soc Lond B Biol Sci 271:1477–1484

    Article  CAS  Google Scholar 

  • Rzhetsky A, Nei M (1995) Tests of applicability of several models for DNA sequence data. Mol Biol Evol 12:131–151

    CAS  PubMed  Google Scholar 

  • Saitou N, Nei M (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 183:584–598

    Google Scholar 

  • Sokal RR, Rohlf FJ (1994) Biometry: the principals and practice of statistics in biological research, 3rd edn. W.H. Freeman and Co., New York

    Google Scholar 

  • Steel MA, Székely L, Erdös PL, Waddell PJ (1993) A complete family of phylogenetic invariants for any number of taxa under Kimura’s 3ST model. NZ J Bot (Conference Issue) 31: 289–296

    Google Scholar 

  • Steel MA, Székely LA, Hendy MD (1994) Reconstructing trees when sequence sites evolve at variable rates. J Comp Biol 1:153–163

    Article  CAS  Google Scholar 

  • Swofford DL (2000) PAUP*: phylogenetic analysis using parsimony (*and other methods), Version 4.0b10. Sinauer Associates, Sunderland, MA

  • Tavaré S (1986) Some probabilistic and statistical problems in the analysis of DNA sequences. Lect Math Life Sci 17:57–86

    Google Scholar 

  • Teeling EC, Scally M, Kao DJ, Romagnoli ML, Springer MS, Stanhope MJ (2000) Molecular evidence regarding the origin of echolocation and flight in bats. Nature 403:188–192

    Article  CAS  PubMed  Google Scholar 

  • Waddell PJ (1995) Statistical methods of phylogenetic analysis, including Hadamard conjugations, LogDet transforms, and maximum likelihood. PhD Thesis, Massey University, New Zealand

  • Waddell PJ (1998) The consistency of ML plus other “predictive” methods of phylogenetic analysis and the role of BIC in evaluating trees. Research Memorandum 715, The Institute of Statistical Mathematics, Hiroo, Tokyo, Japan

    Google Scholar 

  • Waddell PJ (2005) Measuring the fit of sequence data to phylogenetic model: allowing for missing data. Mol Biol Evol 22:395–401 (epub October 2004)

    Article  CAS  PubMed  Google Scholar 

  • Waddell PJ, Kishino H (2000) Cluster inference methods and graphical models evaluated on NCI60 microarray gene expression data. Genome Inform 11:129–141

    CAS  Google Scholar 

  • Waddell PJ, Penny D (1996) Evolutionary trees of apes and humans from DNA sequences. In: Lock AJ, Peters CR (eds) Handbook of symbolic evolution. Clarendon Press, Oxford, pp 53–73

    Google Scholar 

  • Waddell PJ, Shelly S (2003) Evaluating placental inter-ordinal phylogenies with novel sequences including RAG1, gamma-fibrinogen, ND6, and mt-tRNA, plus MCMC driven nucleotide, amino acid, and codon models. Mol Phylogen Evol 28:197–224

    Article  CAS  Google Scholar 

  • Waddell PJ, Steel MA (1996) General time reversible distances with unequal rates across sites. Mol Phylogenet Evol 8: 398–414. Technical Report 143, Department of Mathematics, University of Canterbury, New Zealand, ISSN 1172-8531

    Google Scholar 

  • Waddell PJ, Steel MA (1997) General time-reversible distances with unequal rates across sites: mixing gamma and inverse Gaussian distributions with invariant sites. Mol Phylogenet Evol 8:398–414

    Article  CAS  PubMed  Google Scholar 

  • Waddell PJ, Penny D, Moore T (1997) Extending Hadamard conjugations to model sequence evolution with variable rates across sites. Mol Phylogen Evol 8:33–50

    Article  CAS  Google Scholar 

  • Waddell PJ, Cao Y, Hauf J, Hasegawa M (1999a) Using novel phylogenetic methods to evaluate mammalian mtDNA, including AA invariant sites-LogDet plus site stripping, to detect internal conflicts in the data, with special reference to the position of hedgehog, armadillo, and elephant. Syst Biol 48:31–53

    Article  CAS  PubMed  Google Scholar 

  • Waddell PJ, Okada N, Hasegawa M (1999b) Towards resolving the interordinal relationships of placental mammals. Syst Biol 48:1–5

    Article  CAS  PubMed  Google Scholar 

  • Waddell PJ, Kishino H, Ota R (2001) A phylogenetic foundation for comparative mammalian genomics. Genome Inform 12:141–154

    CAS  PubMed  Google Scholar 

  • Waddell PJ, Mine H, Patel A, Hasegawa M (2004) INTEROGATE 1.0: exploration and testing of stationarity, reversibility and clock-likeness in sequence data. Research Memorandum 929. The Institute of Statistical Mathematics, Tokyo, pp 1–22

  • Waddell PJ, Mine H, Hasegawa M (2005) INTEROGATE 1.0. Exploration and testing of stationarity, reversibility and clock-likeness in sequence data. Computer Science Monograph 31. ISM, Japan

    Google Scholar 

  • Waddell PJ, Umehara S, Griche K-C, Kishino H (2006) Quantitative assessments of genome-wide indels support Atlantogenata at the root of placental mammals. RM 1022. Institute of Statistical Mathematics, Tokyo

    Google Scholar 

  • Waters PD, Dobigny G, Waddell PJ, Robinson TJ (2007) Evolutionary history of LINE-1 in the major clades of placental mammals. PLoS ONE 2:e158

    Article  PubMed  Google Scholar 

  • Zietkiewicz E, Richer C, Labuda D (1999) Phylogenetic affinities of tarsier in the context of primate Alu repeats. Mol Phylogenet Evol 11:77–83

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgments

This work was supported by NIH Grant 5R01LM008626 to PJW. Thanks to Mike Steel for helpful discussions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Peter J. Waddell.

Electronic supplementary material

Below is the link to the electronic supplementary material.

The online supplementary material contains the ANSI C-code programs wrtitten by R.O. and a guide to their use

(ZIP 1772 kb)

Appendix 1

Appendix 1

The constraint tree used:

(((((Sus_scrofa,Camelus_dromedarius,((Ovis_canadenensis,Okapia_johnstoni),(Lagenorhynchus_obscurus,Balaenoptera_physalus))),(Equus_grevyi,(Diceros_bicornis,Tapirus_indicus)),(Panthera_uncia,(Phoca_vitulina,Ailurus_fulgens)),(Manis_sp,Manis_pentadactyla),(Rhinolophus_creaghi,(Myotis_vellifer,Nyctimene_albiventer))),(Erinaceus_europaeus,(Uropsilus_sp,(Scalopus_aquaticus,Talpa_sp)))),((Cyncephalus_volans,(Homo_sapiens,Tarsius_syrichta)),((Mus_musculus,(Dolichotis_patagonum,Hystrix_africaeaustralis)),(Sylvilagus_sp,Ochotona_sp)))),(((Macroscelides_sp,Orycteropus_afer,(Amblysomus_hottentotus,Tenrec_sp)),(Procavia_capensis,Elephas_maximus,(Dugong_dugon,Trichechus_manatus))),(Dasypus_novemcinctus,(Cyclopes_didactylus,Choloepus_hoffmani))));

The ML tree used:

((((((Camelus_dromedarius,(Sus_scrofa,((Ovis_canadenensis,Okapia_johnstoni),(Lagenorhynchus_obscurus,Balaenoptera_physalus)))),(Panthera_uncia,(Phoca_vitulina,Ailurus_fulgens))),(((Equus_grevyi,(Diceros_bicornis,Tapirus_indicus)),(Rhinolophus_creaghi,(Myotis_vellifer,Nyctimene_albiventer))),(Manis_sp,Manis_pentadactyla))),(Erinaceus_europaeus,(Uropsilus_sp,(Scalopus_aquaticus,Talpa_sp)))),((Cyncephalus_volans,(Homo_sapiens,Tarsius_syrichta)),((Mus_musculus,(Dolichotis_patagonum,Hystrix_africaeaustralis)),(Sylvilagus_sp,Ochotona_sp)))),(((Orycteropus_afer,(Macroscelides_sp,(Amblysomus_hottentotus,Tenrec_sp))),(Elephas_maximus,(Procavia_capensis,(Dugong_dugon,Trichechus_manatus)))),(Dasypus_novemcinctus,(Cyclopes_didactylus,Choloepus_hoffmani))));

Rights and permissions

Reprints and permissions

About this article

Cite this article

Waddell, P.J., Ota, R. & Penny, D. Measuring Fit of Sequence Data to Phylogenetic Model: Gain of Power Using Marginal Tests. J Mol Evol 69, 289–299 (2009). https://doi.org/10.1007/s00239-009-9268-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00239-009-9268-8

Keywords

Navigation