Progress and challenges in protein structure prediction
Introduction
In recent years, despite many debates, structure genomics is probably one of the most noteworthy efforts in protein structure determination, which aims to obtain 3D models of all proteins by an optimized combination of experimental structure solution and computer-based structure prediction [1, 2•]. Two factors will dictate the success of the structure genomics: experimental structure determination of optimally selected proteins and efficient computer modeling algorithms. Based on about 40 000 structures in the PDB library (many are redundant) [3], 4 million models/fold-assignments can be obtained by a simple combination of the PSI-BLAST search and the comparative modeling technique [4•]. Development of more sophisticated and automated computer modeling approaches will dramatically enlarge the scope of modelable proteins in the structure genomics project.
The crucial problems/efforts in the field of protein structure prediction include: first, for the sequences of similar structures in PDB (especially those of weakly/distant homologous relation to the target), how to identify the correct templates and how to refine the template structure closer to the native; second, for the sequences without appropriate templates, how to build models of correct topology from scratch. The progress made along these directions was assessed in the recent CASP7 experiment [5] under the categories of template-based modeling (TBM) and free modeling (FM). Here, I will review the new progress and challenges in these directions.
Section snippets
Template-based modeling
The canonical procedure of the TBM consists of four steps: first, finding known structures (templates) related to the sequence to be modeled (target); second, aligning the target sequence to the template structure; third, building structural frameworks by copying the aligned regions or by satisfying the spatial restraints from templates; fourth, constructing the unaligned loop regions and adding side-chain atoms. The first two steps are actually done in a single procedure called threading (or
Free modeling
When structural analogs do not exist in the PDB library or could not be successfully identified by threading (which is more often the case as shown by Figure 1), the structure prediction has to be generated from scratch. This type of predictions has been termed as ‘ab initio’ or ‘de novo’ modeling, a term that may be easily understood as a modeling ‘from first principle’. In CASP7, it is named as ‘free modeling’ which I think reflects more appropriately the status of the field, since the most
Conclusions
Since a detailed physicochemical description of protein folding principles does not yet exist, the protein structure prediction problem is largely defined by the evolutionary or structural distance between the target and the solved proteins in the PDB library. For the proteins with close templates, full-length models can be constructed by copying the template framework. Recent studies show that if using the best possible template structures in PDB, the state-of-the-art modeling algorithms could
References and recommended reading
Papers of particular interest, published within the annual period of review, have been highlighted as:
• of special interest
•• of outstanding interest
Acknowledgements
The project is supported in part by KU Start-up Fund 06194, the Alfred P. Sloan Foundation, and Grant Number R01GM083107 of the National Institute of General Medical Sciences.
References (51)
GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences
J Mol Biol
(1999)- et al.
LiveBench-8: the large-scale, continuous assessment of automated protein structure prediction
Protein Sci
(2005) - et al.
MUSTER: improving protein sequence profile–profile alignments by using multiple sources of structure information
Proteins
(2008) - et al.
Molecular dynamics in the endgame of protein structure prediction
J Mol Biol
(2001) - et al.
Assessment of CASP7 predictions for template-based modeling targets
Proteins
(2007) - et al.
Structural genomics: beyond the human genome project
Nat Genet
(1999) - et al.
The impact of structural genomics: expectations and outcomes
Science
(2006) - et al.
The Protein Data Bank
Nucleic Acids Res
(2000) - et al.
MODBASE: a database of annotated comparative protein structure models and associated resources
Nucleic Acids Res
(2006) - et al.
Critical assessment of methods of protein structure prediction (CASP) — round VII
Proteins
(2007)
A method to identify protein sequences that fold into a known three-dimensional structure
Science
A new approach to protein fold recognition
Nature
TM-align: a protein structure alignment algorithm based on the TM-score
Nucleic Acids Res
The protein structure prediction problem could be solved using the current PDB library
Proc Natl Acad Sci U S A
Development and large scale benchmark testing of the PROSPECTOR 3.0 threading algorithm
Protein
FFAS03: a server for profile–profile sequence alignments
Nucleic Acids Res
Fold recognition by combining sequence profiles derived from evolution and from depth-dependent structural alignment of fragments
Proteins
ORFeus: detection of distant homology using sequence profiles and predicted secondary structure
Nucleic Acids Res
FUGUE: sequence–structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties
J Mol Biol
Hidden Markov models for detecting remote protein homologies
Bioinformatics
Protein homology detection by HMM–HMM comparison
Bioinformatics
A machine learning information retrieval approach to protein fold recognition
Bioinformatics
Profile analysis: detection of distantly related proteins
Proc Natl Acad Sci U S A
COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance
J Mol Biol
CAFASP3: the third critical assessment of fully automated structure prediction methods
Proteins
Cited by (423)
Support vector machine in drug design
2023, Cheminformatics, QSAR and Machine Learning Applications for Novel Drug DevelopmentEnergy landscapes in inorganic chemistry
2023, Comprehensive Inorganic Chemistry III, Third EditionAnalysis of the sidechain structures of amino acids and peptides and a deduced method for the efficient search of peptide conformations
2022, Computational and Theoretical ChemistryImproved inter-residue contact prediction via a hybrid generative model and dynamic loss function
2022, Computational and Structural Biotechnology JournalComputational deciphering of blast resistance genes in rice
2024, Fungal Diseases of Rice and Their Management