Abstract
Testing fit of data to model is fundamentally important to any science, but publications in the field of phylogenetics rarely do this. Such analyses discard fundamental aspects of science as prescribed by Karl Popper. Indeed, not without cause, Popper (Unended quest: an intellectual autobiography. Fontana, London, 1976) once argued that evolutionary biology was unscientific as its hypotheses were untestable. Here we trace developments in assessing fit from Penny et al. (Nature 297:197–200, 1982) to the present. We compare the general log-likelihood ratio (the G or G 2 statistic) statistic between the evolutionary tree model and the multinomial model with that of marginalized tests applied to an alignment (using placental mammal coding sequence data). It is seen that the most general test does not reject the fit of data to model (P ~ 0.5), but the marginalized tests do. Tests on pairwise frequency (F) matrices, strongly (P < 0.001) reject the most general phylogenetic (GTR) models commonly in use. It is also clear (P < 0.01) that the sequences are not stationary in their nucleotide composition. Deviations from stationarity and homogeneity seem to be unevenly distributed amongst taxa; not necessarily those expected from examining other regions of the genome. By marginalizing the 4t patterns of the i.i.d. model to observed and expected parsimony counts, that is, from constant sites, to singletons, to parsimony informative characters of a minimum possible length, then the likelihood ratio test regains power, and it too rejects the evolutionary model with P ≪ 0.001. Given such behavior over relatively recent evolutionary time, readers in general should maintain a healthy skepticism of results, as the scale of the systematic errors in published trees may really be far larger than the analytical methods (e.g., bootstrap) report.
Similar content being viewed by others
References
Ababneh F, Jermiin LS, Ma C, Robinson J (2006) Matched-pairs tests of homogeneity with applications to homologous nucleotide sequences. Bioinformatics 22:1225–1231
Adachi J, Hasegawa M (1996) MOLPHY Version 2.3: programs for molecular phylogenetics based on maximum likelihood. Computer Science Monographs, vol 28. Institute of Statistical Mathematics, Tokyo, pp 1–150
Anderson TW, Darling DA (1952) Asymptotic theory of certain “goodness-of-fit” criteria based on stochastic processes. Ann Math Stat 23:193–212
Bulmer M (1991) Use of the method of generalised least squares in reconstructing phylogenies from sequence data. Mol Biol Evol 8:868–883
Felsenstein J (1978) Cases in which parsimony or compatibility methods will be positively misleading. Syst Zool 27:401–410
Felsenstein J (1982) Numerical methods for inferring evolutionary trees. Quart Rev Biol 57:379–404
Foster PG (2004) Modeling compositional heterogeneity. Syst Biol 53:485–495
Goldman N (1993a) Statistical tests of models of DNA substitution. J Mol Evol 36:182–198
Goldman N (1993b) Simple diagnostic tests of models of DNA substitution. J Mol Evol 37:650–661
Goodman M, Tagle DA, Fitch DH, Bailey W, Czelusniak J, Koop DF, Benson P, Slightom L (1990) Primate evolution at the DNA level and a classification of hominoids. J Mol Evol 30:260–266
Hendy MD, Penny D (1993) Spectral analysis of phylogenetic data. J Classif 10:5–24
Jermiin LS, Jayaswal V, Ababneh F, Robinson J (2008) Phylogenetic model evaluation. In: Keith J (ed) Bioinformatics—volume I: data, sequences analysis, evolution. Humana Press, Totowa, NJ, pp 331–363
Kriegs JO, Churakov G, Kiefmann M, Jordan U, Brosius J, Schmitz J (2006) Retroposed elements as archives for the evolutionary history of placental mammals. PLoS Biol 4:e91
Lin Y, Waddell PJ, Penny D (2002) Pika and vole mitochondrial genomes increase support for both rodent monophyly and Glires. Gene 294:119–129
McCullagh P, Nelder JA (1989) Generalised linear models, 2nd edn. Chapman and Hall, London
Murphy WJ, Eizirik ED, O’Brien SJ, Madsen O, Scally M, Douady CJ, Teeling E, Ryder OA, Stanhope MJ, de Jong WW, Springer MS (2001) Resolution of the early placental mammal radiation using Bayesian phylogenetics. Science 294:2348–2351
Nishihara H, Hasegawa M, Okada N (2006) Pegasoferae, an unexpected mammalian clade revealed by tracking ancient retroposon insertions. Proc Natl Acad Sci USA 103:9929–9934
Ota R, Waddell PJ, Hasegawa M, Shimodaira H, Kishino H (2000) Appropriate likelihood ratio tests and marginal distributions for evolutionary tree models with constraints on parameters. Mol Biol Evol 17:798–803
Penny D, Foulds LR, Hendy MD (1982) Testing the theory of evolution by comparing phylogenetic trees constructed from five different protein sequences. Nature 297:197–200
Popper KR (1976) Unended quest: an intellectual autobiography. Fontana, London
Rambaut A, Grassly NC (1997) Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput Appl Biosci 13:235–238
Reeves JH (1992) Heterogeneity in the substitution process of amino acid sites of proteins coded for by mitochondrial DNA. J Mol Evol 35:17–31
Robinson TJ, Fu B, Ferguson-Smith MA, Yang F (2004) Cross-species chromosome painting in the golden mole and elephant-shrew: support for the mammalian clades Afrotheria and Afroinsectiphillia but not Afroinsectivora. Proc R Soc Lond B Biol Sci 271:1477–1484
Rzhetsky A, Nei M (1995) Tests of applicability of several models for DNA sequence data. Mol Biol Evol 12:131–151
Saitou N, Nei M (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 183:584–598
Sokal RR, Rohlf FJ (1994) Biometry: the principals and practice of statistics in biological research, 3rd edn. W.H. Freeman and Co., New York
Steel MA, Székely L, Erdös PL, Waddell PJ (1993) A complete family of phylogenetic invariants for any number of taxa under Kimura’s 3ST model. NZ J Bot (Conference Issue) 31: 289–296
Steel MA, Székely LA, Hendy MD (1994) Reconstructing trees when sequence sites evolve at variable rates. J Comp Biol 1:153–163
Swofford DL (2000) PAUP*: phylogenetic analysis using parsimony (*and other methods), Version 4.0b10. Sinauer Associates, Sunderland, MA
Tavaré S (1986) Some probabilistic and statistical problems in the analysis of DNA sequences. Lect Math Life Sci 17:57–86
Teeling EC, Scally M, Kao DJ, Romagnoli ML, Springer MS, Stanhope MJ (2000) Molecular evidence regarding the origin of echolocation and flight in bats. Nature 403:188–192
Waddell PJ (1995) Statistical methods of phylogenetic analysis, including Hadamard conjugations, LogDet transforms, and maximum likelihood. PhD Thesis, Massey University, New Zealand
Waddell PJ (1998) The consistency of ML plus other “predictive” methods of phylogenetic analysis and the role of BIC in evaluating trees. Research Memorandum 715, The Institute of Statistical Mathematics, Hiroo, Tokyo, Japan
Waddell PJ (2005) Measuring the fit of sequence data to phylogenetic model: allowing for missing data. Mol Biol Evol 22:395–401 (epub October 2004)
Waddell PJ, Kishino H (2000) Cluster inference methods and graphical models evaluated on NCI60 microarray gene expression data. Genome Inform 11:129–141
Waddell PJ, Penny D (1996) Evolutionary trees of apes and humans from DNA sequences. In: Lock AJ, Peters CR (eds) Handbook of symbolic evolution. Clarendon Press, Oxford, pp 53–73
Waddell PJ, Shelly S (2003) Evaluating placental inter-ordinal phylogenies with novel sequences including RAG1, gamma-fibrinogen, ND6, and mt-tRNA, plus MCMC driven nucleotide, amino acid, and codon models. Mol Phylogen Evol 28:197–224
Waddell PJ, Steel MA (1996) General time reversible distances with unequal rates across sites. Mol Phylogenet Evol 8: 398–414. Technical Report 143, Department of Mathematics, University of Canterbury, New Zealand, ISSN 1172-8531
Waddell PJ, Steel MA (1997) General time-reversible distances with unequal rates across sites: mixing gamma and inverse Gaussian distributions with invariant sites. Mol Phylogenet Evol 8:398–414
Waddell PJ, Penny D, Moore T (1997) Extending Hadamard conjugations to model sequence evolution with variable rates across sites. Mol Phylogen Evol 8:33–50
Waddell PJ, Cao Y, Hauf J, Hasegawa M (1999a) Using novel phylogenetic methods to evaluate mammalian mtDNA, including AA invariant sites-LogDet plus site stripping, to detect internal conflicts in the data, with special reference to the position of hedgehog, armadillo, and elephant. Syst Biol 48:31–53
Waddell PJ, Okada N, Hasegawa M (1999b) Towards resolving the interordinal relationships of placental mammals. Syst Biol 48:1–5
Waddell PJ, Kishino H, Ota R (2001) A phylogenetic foundation for comparative mammalian genomics. Genome Inform 12:141–154
Waddell PJ, Mine H, Patel A, Hasegawa M (2004) INTEROGATE 1.0: exploration and testing of stationarity, reversibility and clock-likeness in sequence data. Research Memorandum 929. The Institute of Statistical Mathematics, Tokyo, pp 1–22
Waddell PJ, Mine H, Hasegawa M (2005) INTEROGATE 1.0. Exploration and testing of stationarity, reversibility and clock-likeness in sequence data. Computer Science Monograph 31. ISM, Japan
Waddell PJ, Umehara S, Griche K-C, Kishino H (2006) Quantitative assessments of genome-wide indels support Atlantogenata at the root of placental mammals. RM 1022. Institute of Statistical Mathematics, Tokyo
Waters PD, Dobigny G, Waddell PJ, Robinson TJ (2007) Evolutionary history of LINE-1 in the major clades of placental mammals. PLoS ONE 2:e158
Zietkiewicz E, Richer C, Labuda D (1999) Phylogenetic affinities of tarsier in the context of primate Alu repeats. Mol Phylogenet Evol 11:77–83
Acknowledgments
This work was supported by NIH Grant 5R01LM008626 to PJW. Thanks to Mike Steel for helpful discussions.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendix 1
Appendix 1
The constraint tree used:
(((((Sus_scrofa,Camelus_dromedarius,((Ovis_canadenensis,Okapia_johnstoni),(Lagenorhynchus_obscurus,Balaenoptera_physalus))),(Equus_grevyi,(Diceros_bicornis,Tapirus_indicus)),(Panthera_uncia,(Phoca_vitulina,Ailurus_fulgens)),(Manis_sp,Manis_pentadactyla),(Rhinolophus_creaghi,(Myotis_vellifer,Nyctimene_albiventer))),(Erinaceus_europaeus,(Uropsilus_sp,(Scalopus_aquaticus,Talpa_sp)))),((Cyncephalus_volans,(Homo_sapiens,Tarsius_syrichta)),((Mus_musculus,(Dolichotis_patagonum,Hystrix_africaeaustralis)),(Sylvilagus_sp,Ochotona_sp)))),(((Macroscelides_sp,Orycteropus_afer,(Amblysomus_hottentotus,Tenrec_sp)),(Procavia_capensis,Elephas_maximus,(Dugong_dugon,Trichechus_manatus))),(Dasypus_novemcinctus,(Cyclopes_didactylus,Choloepus_hoffmani))));
The ML tree used:
((((((Camelus_dromedarius,(Sus_scrofa,((Ovis_canadenensis,Okapia_johnstoni),(Lagenorhynchus_obscurus,Balaenoptera_physalus)))),(Panthera_uncia,(Phoca_vitulina,Ailurus_fulgens))),(((Equus_grevyi,(Diceros_bicornis,Tapirus_indicus)),(Rhinolophus_creaghi,(Myotis_vellifer,Nyctimene_albiventer))),(Manis_sp,Manis_pentadactyla))),(Erinaceus_europaeus,(Uropsilus_sp,(Scalopus_aquaticus,Talpa_sp)))),((Cyncephalus_volans,(Homo_sapiens,Tarsius_syrichta)),((Mus_musculus,(Dolichotis_patagonum,Hystrix_africaeaustralis)),(Sylvilagus_sp,Ochotona_sp)))),(((Orycteropus_afer,(Macroscelides_sp,(Amblysomus_hottentotus,Tenrec_sp))),(Elephas_maximus,(Procavia_capensis,(Dugong_dugon,Trichechus_manatus)))),(Dasypus_novemcinctus,(Cyclopes_didactylus,Choloepus_hoffmani))));
Rights and permissions
About this article
Cite this article
Waddell, P.J., Ota, R. & Penny, D. Measuring Fit of Sequence Data to Phylogenetic Model: Gain of Power Using Marginal Tests. J Mol Evol 69, 289–299 (2009). https://doi.org/10.1007/s00239-009-9268-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00239-009-9268-8