Evaluation of models for the evolution of protein sequences and functions under structural constraint
Introduction
Molecular evolution at the structural level plays an important role in structural genomics, which has been concerned with describing the “parts list” of protein structures that are used to construct various genomes [1]. At the same time, comparative genome analysis (and population genetic theory) have led to the hypothesis that many of the differences between genomes can be explained by differences in effective population size during the evolution of organisms [2]. In attempting to build a bridge between these two views of genome evolution, both a theoretical [3], [4], [5], [6] and a lattice modeling framework [7], [8], [9], [10] have been used for linking the evolution of sequence through structure to function. Three dimensional windows or contact maps have been developed to study the physical interactions governing protein folding [11], [12] and more recently, the evolution of individual regions of a protein [13], [14]. These can be used to extend lattice modeling studies to the study of real proteins with different folds. To enable such simulation studies, effective coarse-grained methods are needed to evaluate the folding and binding capabilities of real protein structures, ultimately enabling an understanding of the underlying rules that dictate the “parts list” for any genome.
In our previous work we have studied the mechanism of the evolution of proteins through the passage of gene duplication followed by mutation and selection with the help of three dimensional protein lattices [10]. In addition to functional selective pressures we evolved the protein lattices under structural selective pressure on the basis of folding energy, calculated with the use of defined interactions at the lattice points and contact potentials derived from a contact energy matrix [12], [15]. Lattice models have previously been used in this context to make important predictions about the behavior of proteins in evolutionary contexts, including their metastability [16].
Since proteins are robust to site mutations and plastic in nature in that they accept mutations without destroying the fold [16], [17], [18], the development of an evolutionary model based upon population genetics theory together with the evolution of lattice (or real protein-encoding) genes based upon either statistical or physical (force field) energy constraints is an other challenge. The evolutionary constraints for proteins include that they should perform a function and they must be stable enough to perform that function reliably while resisting unfolding, aggregation, and proteolysis.
In fact, a new field is emerging, as large scale gene and genome sequencing is enabling not only the comparison of closely related species, but increasingly of populations within a species. Simultaneously, the field of molecular evolution is increasingly interested in models of sequence evolution that incorporate structure. This combination is bringing together traditional population genetics with structural biology, where population-level variation can be examined not only at the sequence level, but also at the protein structural level in analyses characterizing variation in protein function.
One of the central problems in developing evolutionary models for proteins is developing an empirical energy function whose global minimum occurs when the protein is folded into the native state. Also, this can help us in analyzing the effect of mutations on evolving protein molecules to test if the global minimum has shifted. There are two classes of methods that are presently available for designing empirical energy functions. Knowledge-based methods depend upon contact interaction matrices derived from known protein folds in PDB and are widely used in molecular evolutionary studies. Alternatively, force field methods are parameterized with the forces governing interactions between atoms in proteins.
Some recent work has focused on deriving knowledge-based energy matrices both for long range and short range interactions between amino acids in protein folds based on RMSD between α carbons, the torsion and bond angle changes of virtual Cα–Cα bonds, and the coupling between them [19], [20]. Another method involves deriving energy parameters for simplified models of folding based on the maximization of the thermodynamic average of the overlap between protein native structures and a Boltzman ensemble of alternative structures [21]. Lastly, a third approach uses an all atom model and a physical energy function [22].
In addition to developing an energy function, another challenge facing us is fast and accurate side-chain conformation prediction. An efficient approach uses results from graph theory to solve the combinatorial problem encountered in the side-chain prediction problem [23]. In this method, side chains are represented as vertices in an undirected graph. Any two residues that have rotamers with nonzero interaction energies are considered to have an edge in the graph. The resulting graph can be partitioned into connected subgraphs with no edges between them. These subgraphs can in turn be broken into biconnected components, which are graphs that cannot be disconnected by removal of a single vertex. The combinatorial problem is reduced to finding the minimum energy of these small biconnected components and combining the results to identify the global minimum energy conformation.
In this study we compare various computationally fast methods to evaluate which methods are best able to characterize the folding and binding of real proteins for use in population genomic studies where large numbers of sequences, folds, and mutations need to be evaluated. We further evaluate how these methods can be used to model the evolution of new binding functionalities in the absence of gene duplication. Ultimately, such methods can be used towards developing a better understanding of the structural “parts list” found differentially in various genomes.
Section snippets
Methods
In order to evaluate various methods for folding and binding, we have analyzed three protein folds from the categories, only α (only containing alpha helices), only β (only containing beta sheets), α + β (mainly antiparallel beta sheets (segregated alpha and beta regions)) and α/β (mainly parallel beta sheets (beta–alpha–beta units)): Globin-like, SH3 domain, SH2 domain and Flavodoxin-like, respectively. We have downloaded all coordinate files for these structures from Protein Data Bank (PDB) [24]
Results
The folding energies of a set of PDB files (Table 1) were analyzed with various methods by threading both the native sequence and a set of random sequences through the established folds. It is assumed that only a very small fraction of random sequences will preferentially fold into any particular fold and that a method that differentiates sequences that are known to fold into a particular structure over random sequences therefore performs better. In every case the conformation of side chains
Discussion
Accurate models (for example, all-atom models that incorporate van der Waals effects, electrostatic interactions, amino acid rotamer information and other important physical principles) provide precise and realistic energies for a single protein structure. However, the computational time spent calculating each variant in a population of similar proteins would make population structural genomic studies impossible, even with the largest supercomputers. Therefore, computationally fast, more
Conclusion
In this study we have compared various knowledge-based methods, and force field methods for protein folding and ligand binding specificity for four different folds, SH2, SH3, Globin-like, and Flavodoxin-like. One knowledge-based energy function (Model 1) showed the best results in differentiating native protein sequences from random sequences in short computational times. On the other hand, protein force field methods showed the best results in characterizing binding specificity for SH2 domain
Acknowledgments
We are grateful to Arne Elofsson, Knut Teigen, and Jessica Liberles for helpful discussions. Funding for this work was provided by FUGE, the Norwegian functional genomics research platform.
References (35)
A structural census of genomes: comparing bacterial, eukaryotic, and archaeal genomes in terms of protein structure
J. Mol. Biol.
(1997)- et al.
Retention of enzyme gene duplicates by subfunctionalization
Int. J. Biol. Macromol.
(2003) - et al.
Evolution of functionality in lattice proteins
J. Mol. Graph. Model.
(2001) - et al.
Why are proteins so robust to site mutations?
J. Mol. Biol.
(2002) - et al.
Inter-residue potentials in globular proteins and the dominance of highly specific hydrophilic interactions at close separation
J. Mol. Biol.
(1997) - et al.
Structure-based prediction of potential binding and nonbinding peptides to HIV-1 protease
Biophys. J.
(2003) - et al.
Protein sequence randomization: efficient estimation of protein stability using knowledge-based potentials
J. Mol. Biol.
(2005) - et al.
An all-atom distance-dependent conditional probability discriminatory function for protein structure prediction
J. Mol. Biol.
(1998) - et al.
On the origins of genome complexity
Science
(2003) - et al.
Repeat modulated population genetic effects in fungal proteins
J. Mol. Evol.
(2004)