Computational analysis of evolution and conservation in a protein superfamily
Introduction
Large-scale consolidation and organization of sequences into groups closely related in evolution is a useful step in the classification of proteins [1] and prediction of protein function [2]. In most cases, this is not a trivial undertaking and requires the careful selection of methods for aligning sequences and inferring phylogenetic relationships. Considerations include both the applicability of a particular method to the data (e.g., different models of evolution, different degrees of divergence) and the practical consideration of computational feasibility.
This paper will outline a general strategy for the identification and alignment of sequences from a family of proteins, and the inference of patterns of amino acid conservation and evolutionary relationships between them. This is based in part on approaches used previously to analyze the serpin gene superfamily [3], [4], [5], [6]. Information on the serpin superfamily can be found elsewhere (for a brief overview, see [7]; a detailed treatment can be found in [8]).
Section snippets
Description of method
Table 1 summarizes the software programs used in this paper and the Internet locations from which they can be accessed (as at April 2003). Most of the programs operate under a UNIX-type environment (e.g., Linux), which readily permits scripting and linking of software together. Many programs can be obtained as ready-to-run executables, but depending on the system used, some may need to be compiled. For information on the PERL programming language, the O’Reilly text ‘Learning PERL’ is
Concluding remarks
A large-scale phylogenetic analysis should be undertaken using both sequence and non-sequence-based data. A consideration of the completeness of the sequence data is important; removing those that show identity above a given threshold (e.g., 90%) can dramatically increase the manageability of the data and the speed with which it is processed. Alignment accuracy is critical and reference to structural information can be very useful in this regard. From this alignment, simple techniques can
Acknowledgements
The authors wish to thank Michael Cameron for suggesting the use of edit distance in the partition cluster method. J.C.W. is an Australian National Health and Medical Research Council Senior Research Fellow.
References (81)
- et al.
J. Biol. Chem.
(2001) - et al.
J. Mol. Biol.
(1998) - et al.
Gene
(2002) - et al.
Trends Biochem. Sci.
(1998) - et al.
J. Mol. Biol.
(2003) - et al.
J. Mol. Biol.
(2000) Methods Enzymol.
(1996)- et al.
J. Mol. Biol.
(1990) J. Mol. Biol.
(1996)- et al.
Methods Enzymol.
(1996)
J. Struct. Biol.
J. Mol. Biol.
J. Mol. Biol.
J. Mol. Biol.
Genomics
Genomics
Bull. Math. Biol.
Nucleic Acids Res.
Genome Res.
Genome Res.
Mol. Biol. Evol.
Mol. Biol. Evol.
Mol. Biol. Evol.
Chem. Rev.
Introduction to Bioinformatics
Molecular Evolution: A Phylogenetic Approach
Nucleic Acids Res.
Nucleic Acids Res.
Nucleic Acids Res.
Bioinformatics
Nucleic Acids Res.
Nucleic Acids Res.
Biochemistry
Nucleic Acids Res.
Bioinformatics
Proteins Suppl.
Bioinformatics
Bioinformatics
EMBO J.
Cited by (9)
Haemonchus contortus: Cloning and characterization of serpin
2010, Experimental ParasitologyShape-shifting serpins - advantages of a mobile mechanism
2006, Trends in Biochemical SciencesThe murine orthologue of human antichymotrypsin: A structural paradigm for clade A3 serpins
2005, Journal of Biological ChemistryCitation Excerpt :Sequence Analysis—Antitrypsin and antichymotrypsin-like protein sequences were identified from human, cow, pig, and rat using the BLAST algorithm (45) and muACT-n as a probe with an expected threshold of 1 × 10-6 and minimum sequence identity of 35%; mouse sequences were those described previously (21). Following construction of an initial set of 500 distance neighbor-joining bootstrapped distance trees using MOLPHY (46), sequences showing a strong association with muACT-n were short-listed using the partition consensus method (11, 47). A multiple protein alignment was performed using ClustalW (48), and patterns of divergence across sites within these sequences were examined using maximum likelihood dN/dS determination (49).
Protein superfamilies based phylogenomic analysis of archaeal domain
2011, Biochemistry Research UpdatesSerpins in plants and green algae
2008, Functional and Integrative Genomics