Evaluation of models for the evolution of protein sequences and functions under structural constraint

doi:10.1016/j.bpc.2006.06.008

Biophysical Chemistry

Volume 124, Issue 2, 20 November 2006, Pages 134-144

https://doi.org/10.1016/j.bpc.2006.06.008 Get rights and content

Abstract

In the field of evolutionary structural genomics, methods are needed to evaluate why genomes evolved to contain the fold distributions that are observed. In order to study the effects of population dynamics in the evolved genomes we need fast and accurate evolutionary models which can analyze the effects of selection, drift and fixation of a protein sequence in a population that are grounded by physical parameters governing the folding and binding properties of the sequence. In this study, various knowledge-based, force field, and statistical methods for protein folding have been evaluated with four different folds: SH2 domains, SH3 domains, Globin-like, and Flavodoxin-like, to evaluate the speed and accuracy of the energy functions. Similarly, knowledge-based and force field methods have been used to predict ligand binding specificity in SH2 domain. To demonstrate the applicability of these methods, the dynamics of evolution of new binding capabilities by an SH2 domain is demonstrated.

Introduction

Molecular evolution at the structural level plays an important role in structural genomics, which has been concerned with describing the “parts list” of protein structures that are used to construct various genomes [1]. At the same time, comparative genome analysis (and population genetic theory) have led to the hypothesis that many of the differences between genomes can be explained by differences in effective population size during the evolution of organisms [2]. In attempting to build a bridge between these two views of genome evolution, both a theoretical [3], [4], [5], [6] and a lattice modeling framework [7], [8], [9], [10] have been used for linking the evolution of sequence through structure to function. Three dimensional windows or contact maps have been developed to study the physical interactions governing protein folding [11], [12] and more recently, the evolution of individual regions of a protein [13], [14]. These can be used to extend lattice modeling studies to the study of real proteins with different folds. To enable such simulation studies, effective coarse-grained methods are needed to evaluate the folding and binding capabilities of real protein structures, ultimately enabling an understanding of the underlying rules that dictate the “parts list” for any genome.

In our previous work we have studied the mechanism of the evolution of proteins through the passage of gene duplication followed by mutation and selection with the help of three dimensional protein lattices [10]. In addition to functional selective pressures we evolved the protein lattices under structural selective pressure on the basis of folding energy, calculated with the use of defined interactions at the lattice points and contact potentials derived from a contact energy matrix [12], [15]. Lattice models have previously been used in this context to make important predictions about the behavior of proteins in evolutionary contexts, including their metastability [16].

Since proteins are robust to site mutations and plastic in nature in that they accept mutations without destroying the fold [16], [17], [18], the development of an evolutionary model based upon population genetics theory together with the evolution of lattice (or real protein-encoding) genes based upon either statistical or physical (force field) energy constraints is an other challenge. The evolutionary constraints for proteins include that they should perform a function and they must be stable enough to perform that function reliably while resisting unfolding, aggregation, and proteolysis.

In fact, a new field is emerging, as large scale gene and genome sequencing is enabling not only the comparison of closely related species, but increasingly of populations within a species. Simultaneously, the field of molecular evolution is increasingly interested in models of sequence evolution that incorporate structure. This combination is bringing together traditional population genetics with structural biology, where population-level variation can be examined not only at the sequence level, but also at the protein structural level in analyses characterizing variation in protein function.

One of the central problems in developing evolutionary models for proteins is developing an empirical energy function whose global minimum occurs when the protein is folded into the native state. Also, this can help us in analyzing the effect of mutations on evolving protein molecules to test if the global minimum has shifted. There are two classes of methods that are presently available for designing empirical energy functions. Knowledge-based methods depend upon contact interaction matrices derived from known protein folds in PDB and are widely used in molecular evolutionary studies. Alternatively, force field methods are parameterized with the forces governing interactions between atoms in proteins.

Some recent work has focused on deriving knowledge-based energy matrices both for long range and short range interactions between amino acids in protein folds based on RMSD between α carbons, the torsion and bond angle changes of virtual Cα–Cα bonds, and the coupling between them [19], [20]. Another method involves deriving energy parameters for simplified models of folding based on the maximization of the thermodynamic average of the overlap between protein native structures and a Boltzman ensemble of alternative structures [21]. Lastly, a third approach uses an all atom model and a physical energy function [22].

In addition to developing an energy function, another challenge facing us is fast and accurate side-chain conformation prediction. An efficient approach uses results from graph theory to solve the combinatorial problem encountered in the side-chain prediction problem [23]. In this method, side chains are represented as vertices in an undirected graph. Any two residues that have rotamers with nonzero interaction energies are considered to have an edge in the graph. The resulting graph can be partitioned into connected subgraphs with no edges between them. These subgraphs can in turn be broken into biconnected components, which are graphs that cannot be disconnected by removal of a single vertex. The combinatorial problem is reduced to finding the minimum energy of these small biconnected components and combining the results to identify the global minimum energy conformation.

In this study we compare various computationally fast methods to evaluate which methods are best able to characterize the folding and binding of real proteins for use in population genomic studies where large numbers of sequences, folds, and mutations need to be evaluated. We further evaluate how these methods can be used to model the evolution of new binding functionalities in the absence of gene duplication. Ultimately, such methods can be used towards developing a better understanding of the structural “parts list” found differentially in various genomes.

Section snippets

Methods

In order to evaluate various methods for folding and binding, we have analyzed three protein folds from the categories, only α (only containing alpha helices), only β (only containing beta sheets), α + β (mainly antiparallel beta sheets (segregated alpha and beta regions)) and α/β (mainly parallel beta sheets (beta–alpha–beta units)): Globin-like, SH3 domain, SH2 domain and Flavodoxin-like, respectively. We have downloaded all coordinate files for these structures from Protein Data Bank (PDB) [24]

Results

The folding energies of a set of PDB files (Table 1) were analyzed with various methods by threading both the native sequence and a set of random sequences through the established folds. It is assumed that only a very small fraction of random sequences will preferentially fold into any particular fold and that a method that differentiates sequences that are known to fold into a particular structure over random sequences therefore performs better. In every case the conformation of side chains

Discussion

Accurate models (for example, all-atom models that incorporate van der Waals effects, electrostatic interactions, amino acid rotamer information and other important physical principles) provide precise and realistic energies for a single protein structure. However, the computational time spent calculating each variant in a population of similar proteins would make population structural genomic studies impossible, even with the largest supercomputers. Therefore, computationally fast, more

Conclusion

In this study we have compared various knowledge-based methods, and force field methods for protein folding and ligand binding specificity for four different folds, SH2, SH3, Globin-like, and Flavodoxin-like. One knowledge-based energy function (Model 1) showed the best results in differentiating native protein sequences from random sequences in short computational times. On the other hand, protein force field methods showed the best results in characterizing binding specificity for SH2 domain

Acknowledgments

We are grateful to Arne Elofsson, Knut Teigen, and Jessica Liberles for helpful discussions. Funding for this work was provided by FUGE, the Norwegian functional genomics research platform.

References (35)

M. Gerstein
A structural census of genomes: comparing bacterial, eukaryotic, and archaeal genomes in terms of protein structure
J. Mol. Biol.
(1997)
F.N. Braun et al.
Retention of enzyme gene duplicates by subfunctionalization
Int. J. Biol. Macromol.
(2003)
P.D. Williams et al.
Evolution of functionality in lattice proteins
J. Mol. Graph. Model.
(2001)
D.M. Taverna et al.
Why are proteins so robust to site mutations?
J. Mol. Biol.
(2002)
I. Bahar et al.
Inter-residue potentials in globular proteins and the dominance of highly specific hydrophilic interactions at close separation
J. Mol. Biol.
(1997)
N. Kurt et al.
Structure-based prediction of potential binding and nonbinding peptides to HIV-1 protease
Biophys. J.
(2003)
M. Wiederstein et al.
Protein sequence randomization: efficient estimation of protein stability using knowledge-based potentials
J. Mol. Biol.
(2005)
R. Samudrala et al.
An all-atom distance-dependent conditional probability discriminatory function for protein structure prediction
J. Mol. Biol.
(1998)
M. Lynch et al.
On the origins of genome complexity
Science
(2003)
F.N. Braun et al.
Repeat modulated population genetic effects in fungal proteins
J. Mol. Evol.
(2004)

U. Bastolla et al.

Statistical properties of neutral evolution

J. Mol. Evol.

(2003)

M. Lynch

Simple evolutionary pathways to complex proteins

Protein Sci.

(2005)

S. Govindarajan et al.

Evolution of model proteins on a foldability landscape

Proteins Struct. Funct. Genet.

(1997)

G. Tiana et al.

The evolution dynamics of model proteins

J. Chem. Phys.

(2004)

S. Rastogi et al.

Subfunctionalization of duplicated genes as a transition state to neofunctionalization

BMC Evol. Biol.

(2005)

H.S. Chan et al.

Origins of structure in globular proteins

Proc. Natl. Acad. Sci. U. S. A.

(1990)

M. Vendruscolo et al.

Pairwise contact potentials are unsuitable for protein folding

J. Chem. Phys.

(1998)

Cited by (0)

View full text

Evaluation of models for the evolution of protein sequences and functions under structural constraint

Abstract

Introduction

Section snippets

Methods

Results

Discussion

Conclusion

Acknowledgments

J. Mol. Biol.

Int. J. Biol. Macromol.

J. Mol. Graph. Model.

J. Mol. Biol.

J. Mol. Biol.

Biophys. J.

J. Mol. Biol.

J. Mol. Biol.

On the origins of genome complexity

Science

Repeat modulated population genetic effects in fungal proteins

J. Mol. Evol.

Statistical properties of neutral evolution

J. Mol. Evol.

Simple evolutionary pathways to complex proteins

Protein Sci.

Evolution of model proteins on a foldability landscape

Proteins Struct. Funct. Genet.

The evolution dynamics of model proteins

J. Chem. Phys.

Subfunctionalization of duplicated genes as a transition state to neofunctionalization

BMC Evol. Biol.

Origins of structure in globular proteins

Proc. Natl. Acad. Sci. U. S. A.

Pairwise contact potentials are unsuitable for protein folding

J. Chem. Phys.