Elsevier

Methods

Volume 32, Issue 2, February 2004, Pages 73-92
Methods

Computational analysis of evolution and conservation in a protein superfamily

https://doi.org/10.1016/S1046-2023(03)00200-7Get rights and content

Abstract

Many gene superfamilies have hundreds or thousands of members and hence pose a significant challenge when performing a large-scale phylogenetic analysis. Derivation of the most accurate alignment possible and inference of evolutionary relationships (with an appropriate measure of confidence) are significant “bottlenecks” in the process. A generally applicable strategy is outlined for identifying and aligning sequences, performing simple analysis of the resulting alignment, and inferring evolutionary relationships. Reference is made to the serpin superfamily. The ‘partition cluster’ method, a relatively rapid technique for extracting underlying associations from phylogenetic bootstrap trees, is also presented.

Introduction

Large-scale consolidation and organization of sequences into groups closely related in evolution is a useful step in the classification of proteins [1] and prediction of protein function [2]. In most cases, this is not a trivial undertaking and requires the careful selection of methods for aligning sequences and inferring phylogenetic relationships. Considerations include both the applicability of a particular method to the data (e.g., different models of evolution, different degrees of divergence) and the practical consideration of computational feasibility.

This paper will outline a general strategy for the identification and alignment of sequences from a family of proteins, and the inference of patterns of amino acid conservation and evolutionary relationships between them. This is based in part on approaches used previously to analyze the serpin gene superfamily [3], [4], [5], [6]. Information on the serpin superfamily can be found elsewhere (for a brief overview, see [7]; a detailed treatment can be found in [8]).

Section snippets

Description of method

Table 1 summarizes the software programs used in this paper and the Internet locations from which they can be accessed (as at April 2003). Most of the programs operate under a UNIX-type environment (e.g., Linux), which readily permits scripting and linking of software together. Many programs can be obtained as ready-to-run executables, but depending on the system used, some may need to be compiled. For information on the PERL programming language, the O’Reilly text ‘Learning PERL’ is

Concluding remarks

A large-scale phylogenetic analysis should be undertaken using both sequence and non-sequence-based data. A consideration of the completeness of the sequence data is important; removing those that show identity above a given threshold (e.g., 90%) can dramatically increase the manageability of the data and the speed with which it is processed. Alignment accuracy is critical and reference to structural information can be very useful in this regard. From this alignment, simple techniques can

Acknowledgements

The authors wish to thank Michael Cameron for suggesting the use of edit distance in the partition cluster method. J.C.W. is an Australian National Health and Medical Research Council Senior Research Fellow.

References (81)

  • G.A Silverman et al.

    J. Biol. Chem.

    (2001)
  • J Park et al.

    J. Mol. Biol.

    (1998)
  • O Kruger et al.

    Gene

    (2002)
  • S.F Altschul et al.

    Trends Biochem. Sci.

    (1998)
  • R Sadreyev et al.

    J. Mol. Biol.

    (2003)
  • C Notredame et al.

    J. Mol. Biol.

    (2000)
  • W.R Pearson

    Methods Enzymol.

    (1996)
  • S.F Altschul et al.

    J. Mol. Biol.

    (1990)
  • O Gotoh

    J. Mol. Biol.

    (1996)
  • D.G Higgins et al.

    Methods Enzymol.

    (1996)
  • G D’Alfonso et al.

    J. Struct. Biol.

    (2001)
  • M.J Thompson et al.

    J. Mol. Biol.

    (1999)
  • S.J Hamill et al.

    J. Mol. Biol.

    (2000)
  • L Mirny et al.

    J. Mol. Biol.

    (2001)
  • F.L Scott et al.

    Genomics

    (1999)
  • A.J Bartuski et al.

    Genomics

    (1997)
  • T Margush et al.

    Bull. Math. Biol.

    (1981)
  • N.D Rawlings et al.

    Nucleic Acids Res.

    (1999)
  • J.A Eisen

    Genome Res.

    (1998)
  • J.A Irving et al.

    Genome Res.

    (2000)
  • J.A Irving et al.

    Mol. Biol. Evol.

    (2002)
  • H Ragg et al.

    Mol. Biol. Evol.

    (2001)
  • W.R Atchley et al.

    Mol. Biol. Evol.

    (2001)
  • P.G Gettins

    Chem. Rev.

    (2002)
  • R.L. Schwartz, T. Phoenix, Learning Perl, O’Reilly,...
  • A.M Lesk

    Introduction to Bioinformatics

    (2002)
  • R.D.M Page et al.

    Molecular Evolution: A Phylogenetic Approach

    (1998)
  • A Bateman et al.

    Nucleic Acids Res.

    (2002)
  • I Letunic et al.

    Nucleic Acids Res.

    (2002)
  • S.F Altschul et al.

    Nucleic Acids Res.

    (1997)
  • K Karplus et al.

    Bioinformatics

    (1998)
  • M Madera et al.

    Nucleic Acids Res.

    (2002)
  • Z Zhang et al.

    Nucleic Acids Res.

    (1998)
  • P.C Hopkins et al.

    Biochemistry

    (1993)
  • B Boeckmann et al.

    Nucleic Acids Res.

    (2003)
  • R Spang et al.

    Bioinformatics

    (2001)
  • J.M Bujnicki et al.

    Proteins Suppl.

    (2001)
  • L Holm et al.

    Bioinformatics

    (1998)
  • J Park et al.

    Bioinformatics

    (2000)
  • C Chothia et al.

    EMBO J.

    (1986)
  • Cited by (9)

    • The murine orthologue of human antichymotrypsin: A structural paradigm for clade A3 serpins

      2005, Journal of Biological Chemistry
      Citation Excerpt :

      Sequence Analysis—Antitrypsin and antichymotrypsin-like protein sequences were identified from human, cow, pig, and rat using the BLAST algorithm (45) and muACT-n as a probe with an expected threshold of 1 × 10-6 and minimum sequence identity of 35%; mouse sequences were those described previously (21). Following construction of an initial set of 500 distance neighbor-joining bootstrapped distance trees using MOLPHY (46), sequences showing a strong association with muACT-n were short-listed using the partition consensus method (11, 47). A multiple protein alignment was performed using ClustalW (48), and patterns of divergence across sites within these sequences were examined using maximum likelihood dN/dS determination (49).

    • Serpins in plants and green algae

      2008, Functional and Integrative Genomics
    View all citing articles on Scopus
    View full text