Abstract
There has been considerable interest in the problem of making maximum likelihood (ML) evolutionary trees which allow insertions and deletions. This problem is partly one of formulation: how does one define a probabilistic model for such trees which treats insertion and deletion in a biologically plausible manner? A possible answer to this question is proposed here by extending the concept of a hidden Markov model (HMM) to evolutionary trees. The model, called a tree-HMM, allows what may be loosely regarded as learnable affine-type gap penalties for alignments. These penalties are expressed in HMMs as probabilities of transitions between states. In the tree-HMM, this idea is given an evolutionary embodiment by defining trees of transitions. Just as the probability of a tree composed of ungapped sequences is computed, by Felsenstein's method, using matrices representing the probabilities of substitutions of residues along the edges of the tree, so the probabilities in a tree-HMM are computed by substitution matrices for both residues and transitions. How to define these matrices by a ML procedure using an algorithm that learns from a database of protein sequences is shown here. Given these matrices, one can define a tree-HMM likelihood for a set of sequences, assuming a particular tree topology and an alignment of the sequences to the model. If one could efficiently find the alignment which maximizes (or comes close to maximizing) this likelihood, then one could search for the optimal tree topology for the sequences. An alignment algorithm is defined here which, given a particular tree topology, is guaranteed to increase the likelihood of the model. Unfortunately, it fails to find global optima for realistic sequence sets. Thus further research is needed to turn the tree-HMM into a practical phylogenetic tool.
Similar content being viewed by others
References
Allison L, Wallace CS, Yee CN (1992) Minimum message length encoding, evolutionary trees and multiple alignment. 25th Hawaii Intern Conf on System Sciences 1:663–674
Barry D, Hartigan JA (1987) Statistical analysis of hominoid molecular evolution. Stat Sci 2:191–210
Berg OG, von Hippel PH (1987) Selection of DNA binding sites by regulatory proteins: statistical-mechanical theory and application to operators and promoters. J Mol Biol 193:723–750
Bishop MJ, Thompson EA (1986) Maximum likelihood alignment of DNA sequences. J Mot Biol 190:159–165
Bishop MJ, Friday AE, Thompson EA (1987) Inference of evolutionary relationships. In: Bishop M, Rawlings CJ (eds) Nucleic acid and protein sequence analysis. IRL Press, Oxford, pp 359–385
Brown M, Hughey R, Krogh A, Mian IS, Sjoelander K, Haussler D (1993) Using Dirichlet mixture priors to derive hidden Markov models for protein families. Proc Intelligent Systems for Molecular Biology, Washington DC
Cavalli-Sforza LL, Edwards AWE (1967) Phylogenetic analysis: models and estimation procedures. Evolution 21:550–570
Cox DR, Miller HD (1965) The theory of stochastic processes. Chapman and Hall, London
Dayhoff MO, Schwartz RM, Orcutt BC (1978) A model of evolutionary change in proteins. In: Dayhoff MO (ed) Atlas of protein sequence and structure, vol 5 suppl 3. pp 345–352
Eddy SR, Mitchison G, Durbin R (1995) Maximum discrimination hidden Markov models of sequence consensus. J Comput Biol 2:923
Edwards AWF, Cavalli-Sforza LL (1963) The reconstruction of evolution. Ann Hum Genet 27:105
Edwards AWF, Cavalli-Sforza LL (1964) Reconstruction of evolutionary trees. In: Heywood VH, McNeill J (Eds) Phenetic and phylogenetic classification. Systematics Assoc, London, Publ No. 6, pp 67–76
Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17:368–376
Felsenstein J (1988) Phylogenies from molecular sequences: inference and reliability. Annu Rev Genet 22:521–565
Gonnet GH, Cohen MA, Benner SA (1992) Exhaustive matching of the entire protein sequence database. Science 256:1443–1445
Gribskov M, McLachlan AS, Eisenberg D (1987) Profile analysis: detection of distantly related proteins. Proc Natl Acad Sci USA 84:4355–8
Hein J (1989) A new method that simultaneously aligns and reconstructs ancestral sequences for any number of homologous sequences, when the phylogeny is given. Mol Biol Evol 6:649–668
Henikoff S, Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci 89:10915–10919
Hillis DM, Moritz C (1990) Molecular systematics. Sinauer Associates, Sunderland MA
Krogh A, Brown M, Mian IS, Sjoelander K, Haussler D (1994) Hidden Markov models in computational biology: application to protein modeling. J Mol Biol 235:1501–1531
Kullback S (1978) Information theory and statistics. Peter Smith, Gloucester, MA
Overington J, Donnelly D, Johnson MS, Sali A, Blundell TL (1992) Environment-specific amino acid substitution tables: Tertiary templates and prediction of protein folds. Protein Sci l:216–226
Rabiner LR (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77:257–286
Rumelhart DE, Zipser D (1985) Feature discovery by competitive learning. Cogn Sci 9:75–112
Sander C, Schneider R (1993) The HSSP data base of protein structure-sequence alignments. Nucleic Acids Res 21:3105–3109
Sankoff D, Morel C, Cedergen RJ (1973) Evolution of 5S RNA and the non-randomness of base replacement. Nature New Biol 245:232–234
Thorne JL, Kishino H, Felsenstein J (1991) An evolutionary model for maximum likelihood alignment of DNA sequences. J Mol Evol 33:114–124
Thorne JL, Kishino H, Felsenstein J (1992) Inching toward reality: an improved likelihood model of sequence evolution. J Mol Evol 34: 3–16
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Mitchison, G., Durbin, R. Tree-based maximal likelihood substitution matrices and hidden Markov models. J Mol Evol 41, 1139–1151 (1995). https://doi.org/10.1007/BF00173195
Received:
Accepted:
Issue Date:
DOI: https://doi.org/10.1007/BF00173195