Skip to main content
Log in

Tree-based maximal likelihood substitution matrices and hidden Markov models

  • Articles
  • Published:
Journal of Molecular Evolution Aims and scope Submit manuscript

Abstract

There has been considerable interest in the problem of making maximum likelihood (ML) evolutionary trees which allow insertions and deletions. This problem is partly one of formulation: how does one define a probabilistic model for such trees which treats insertion and deletion in a biologically plausible manner? A possible answer to this question is proposed here by extending the concept of a hidden Markov model (HMM) to evolutionary trees. The model, called a tree-HMM, allows what may be loosely regarded as learnable affine-type gap penalties for alignments. These penalties are expressed in HMMs as probabilities of transitions between states. In the tree-HMM, this idea is given an evolutionary embodiment by defining trees of transitions. Just as the probability of a tree composed of ungapped sequences is computed, by Felsenstein's method, using matrices representing the probabilities of substitutions of residues along the edges of the tree, so the probabilities in a tree-HMM are computed by substitution matrices for both residues and transitions. How to define these matrices by a ML procedure using an algorithm that learns from a database of protein sequences is shown here. Given these matrices, one can define a tree-HMM likelihood for a set of sequences, assuming a particular tree topology and an alignment of the sequences to the model. If one could efficiently find the alignment which maximizes (or comes close to maximizing) this likelihood, then one could search for the optimal tree topology for the sequences. An alignment algorithm is defined here which, given a particular tree topology, is guaranteed to increase the likelihood of the model. Unfortunately, it fails to find global optima for realistic sequence sets. Thus further research is needed to turn the tree-HMM into a practical phylogenetic tool.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Allison L, Wallace CS, Yee CN (1992) Minimum message length encoding, evolutionary trees and multiple alignment. 25th Hawaii Intern Conf on System Sciences 1:663–674

    Google Scholar 

  • Barry D, Hartigan JA (1987) Statistical analysis of hominoid molecular evolution. Stat Sci 2:191–210

    Google Scholar 

  • Berg OG, von Hippel PH (1987) Selection of DNA binding sites by regulatory proteins: statistical-mechanical theory and application to operators and promoters. J Mol Biol 193:723–750

    Google Scholar 

  • Bishop MJ, Thompson EA (1986) Maximum likelihood alignment of DNA sequences. J Mot Biol 190:159–165

    Google Scholar 

  • Bishop MJ, Friday AE, Thompson EA (1987) Inference of evolutionary relationships. In: Bishop M, Rawlings CJ (eds) Nucleic acid and protein sequence analysis. IRL Press, Oxford, pp 359–385

    Google Scholar 

  • Brown M, Hughey R, Krogh A, Mian IS, Sjoelander K, Haussler D (1993) Using Dirichlet mixture priors to derive hidden Markov models for protein families. Proc Intelligent Systems for Molecular Biology, Washington DC

  • Cavalli-Sforza LL, Edwards AWE (1967) Phylogenetic analysis: models and estimation procedures. Evolution 21:550–570

    Google Scholar 

  • Cox DR, Miller HD (1965) The theory of stochastic processes. Chapman and Hall, London

    Google Scholar 

  • Dayhoff MO, Schwartz RM, Orcutt BC (1978) A model of evolutionary change in proteins. In: Dayhoff MO (ed) Atlas of protein sequence and structure, vol 5 suppl 3. pp 345–352

  • Eddy SR, Mitchison G, Durbin R (1995) Maximum discrimination hidden Markov models of sequence consensus. J Comput Biol 2:923

    Google Scholar 

  • Edwards AWF, Cavalli-Sforza LL (1963) The reconstruction of evolution. Ann Hum Genet 27:105

    Google Scholar 

  • Edwards AWF, Cavalli-Sforza LL (1964) Reconstruction of evolutionary trees. In: Heywood VH, McNeill J (Eds) Phenetic and phylogenetic classification. Systematics Assoc, London, Publ No. 6, pp 67–76

    Google Scholar 

  • Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17:368–376

    CAS  PubMed  Google Scholar 

  • Felsenstein J (1988) Phylogenies from molecular sequences: inference and reliability. Annu Rev Genet 22:521–565

    Google Scholar 

  • Gonnet GH, Cohen MA, Benner SA (1992) Exhaustive matching of the entire protein sequence database. Science 256:1443–1445

    Google Scholar 

  • Gribskov M, McLachlan AS, Eisenberg D (1987) Profile analysis: detection of distantly related proteins. Proc Natl Acad Sci USA 84:4355–8

    Google Scholar 

  • Hein J (1989) A new method that simultaneously aligns and reconstructs ancestral sequences for any number of homologous sequences, when the phylogeny is given. Mol Biol Evol 6:649–668

    Google Scholar 

  • Henikoff S, Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci 89:10915–10919

    CAS  PubMed  Google Scholar 

  • Hillis DM, Moritz C (1990) Molecular systematics. Sinauer Associates, Sunderland MA

    Google Scholar 

  • Krogh A, Brown M, Mian IS, Sjoelander K, Haussler D (1994) Hidden Markov models in computational biology: application to protein modeling. J Mol Biol 235:1501–1531

    Google Scholar 

  • Kullback S (1978) Information theory and statistics. Peter Smith, Gloucester, MA

    Google Scholar 

  • Overington J, Donnelly D, Johnson MS, Sali A, Blundell TL (1992) Environment-specific amino acid substitution tables: Tertiary templates and prediction of protein folds. Protein Sci l:216–226

    Google Scholar 

  • Rabiner LR (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77:257–286

    Google Scholar 

  • Rumelhart DE, Zipser D (1985) Feature discovery by competitive learning. Cogn Sci 9:75–112

    Google Scholar 

  • Sander C, Schneider R (1993) The HSSP data base of protein structure-sequence alignments. Nucleic Acids Res 21:3105–3109

    Google Scholar 

  • Sankoff D, Morel C, Cedergen RJ (1973) Evolution of 5S RNA and the non-randomness of base replacement. Nature New Biol 245:232–234

    Google Scholar 

  • Thorne JL, Kishino H, Felsenstein J (1991) An evolutionary model for maximum likelihood alignment of DNA sequences. J Mol Evol 33:114–124

    Google Scholar 

  • Thorne JL, Kishino H, Felsenstein J (1992) Inching toward reality: an improved likelihood model of sequence evolution. J Mol Evol 34: 3–16

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mitchison, G., Durbin, R. Tree-based maximal likelihood substitution matrices and hidden Markov models. J Mol Evol 41, 1139–1151 (1995). https://doi.org/10.1007/BF00173195

Download citation

  • Received:

  • Accepted:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF00173195

Key words

Navigation