Abstract
Obtaining diffracting quality crystals remains a major challenge in protein structure research. We summarize and compare methods for selecting the best protein targets for crystallization, construct optimization and crystallization condition design. Target selection methods are divided into algorithms predicting the chance of successful progression through all stages of structural determination (from cloning to solving the structure) and those focusing only on the crystallization step. We tried to highlight pros and cons of different approaches examining the following aspects: data size, redundancy and representativeness, overfitting during model construction, and results evaluation. In summary, although in recent years progress was made and several sequence properties were reported to be relevant for crystallization, the successful prediction of protein crystallization behavior and selection of corresponding crystallization conditions continue to challenge structural researchers.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Laskowski RA, Thornton JM (2008) Understanding the molecular machinery of genetics through 3D structures. Nature 9:141–151
Sanderson MR, Skelly JV (2007) Macromolecular crystallography conventional and high-throughput methods. Oxford University Press, Oxford
McPherson A (1999) Crystallization of biological macromolecules. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY
Doye JPK, Louis AA, Vendruscolo M (2004) Inhibition of protein crystallization by evolutionary negative design. Phys Biol 1:P9–P13
Bergfors T (1999) Protein crystallization: techniques, strategies, and tips. International University Line, Uppsala
Niesen FH, Berglund H, Vedadi M (2007) The use of differential scanning fluorimetry to detect ligand interactions that promote protein stability. Nat Protoc 2:2212–2221
Derewenda ZS (2004) Rational protein crystallization by mutational surface engineering. Structure 12:529–535
Derewenda ZS (2004) The use of recombinant methods and molecular engineering in protein crystallization. Methods 34:354–363
Cooper DR, Boczek T, Grelewska K et al (2007) Protein crystallization by surface entropy reduction: optimization of the SER strategy. Acta Crystallogr D Biol Crystallogr 63:636–645
Braig K, Otwinowski Z, Hegde R et al (1994) The crystal structure of the bacterial chaperonin GroEL at 2.8 A. Nature 371:578–586
Lawson DM, Artymiuk PJ, Yewdall SJ et al (1991) Solving the structure of human H ferritin by genetically engineering intermolecular crystal contacts. Nature 349:541–544
McElroy HE, Sisson GW, Schoettlin WE et al (1992) Studies on engineering crystallizability by mutation of surface residues of human thymidylate synthase. J Cryst Growth 122: 265–272
Yamada H, Tamada T, Kosaka M et al (2007) “Crystal lattice engineering”, an approach to engineer protein crystal contacts by creating intermolecular symmetry: crystallization and structure determination of a mutant human RNase 1 with a hydrophobic interface of leucines. Protein Sci 16:1389–1397
Goldschmidt L, Cooper DR, Derewenda ZS, Eisenberg D (2007) Toward rational protein crystallization: a Web server for the design of crystallizable protein variants. Protein Sci 16:1569–1576
Smyth DR, Mrozkiewicz MK, McGrath WJ et al (2003) Crystal structures of fusion proteins with large-affinity tags. Protein Sci 12:1313–1322
Kobe B, Ve T, Williams SJ (2015) Fusion-protein-assisted protein crystallization. Acta Crystallogr F Struct Biol Commun 71:861–869
Berman HM, Westbrook J, Feng Z et al (2000) The Protein Data Bank. Nucleic Acids Res 28:235–242
Christendat D, Yee A, Dharamsi A et al (2000) Structural proteomics of an archaeon. Nat Struct Biol 7:903–909
Burley SK (2000) An overview of structural genomics. Nat Struct Biol 7(Suppl):932–934
Witten IH, Frank E (2005) Data Mining: practical machine learning tools and techniques, 2nd Edition. Morgan Kaufmann, San Francisco
Smialowski P, Frishman D, Kramer S (2010) Pitfalls of supervised feature selection. Bioinformatics 26:440–443
Krstajic D, Buturovic LJ, Leahy DE, Thomas S (2014) Cross-validation pitfalls when selecting and assessing regression and classification models. J Cheminform 6:10
Kimber MS, Houston S, Nec A et al (2003) Data mining crystallization databases: knowledge-based approaches to optimize protein crystal screens. Proteins 568:562–568
Goh CS, Lan N, Douglas SM et al (2004) Mining the structural genomics pipeline: identification of protein properties that affect high-throughput experimental analysis. J Mol Biol 336:115–130
Tatusov RL, Galperin MY, Natale DA, Koonin EV (2000) The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res 28:33–36
Canaves JM, Page R, Wilson IA, Stevens RC (2004) Protein biophysical properties that correlate with crystallization success in Thermotoga maritima: maximum clustering strategy for structural genomics. J Mol Biol 344:977–991
Overton IM, Barton GJ (2006) A normalised scale for structural genomics target ranking: the OB-Score. FEBS Lett 580:4005–4009
Apweiler R, Bairoch A, Wu CH et al (2004) UniProt: the Universal Protein knowledgebase. Nucleic Acids Res 32(Database): D115–D119
Overton IM, Padovani G, Girolami MA, Barton GJ (2008) ParCrys: a Parzen window density estimation approach to protein crystallization propensity prediction. Bioinformatics 24:901–907
Wang G, Dunbrack RL (2003) PISCES: a protein sequence culling server. Bioinformatics 19:1589–1591
Richard O, Duda PEH (1973) Pattern classification and scene analysis. Wiley-Interscience, New York
Chen L, Oughtred R, Berman HM, Westbrook J (2004) TargetDB: a target registration database for structural genomics projects. Bioinformatics 20:2860–2862
Altschul SF, Madden TL, Schaffer AA et al (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402
Bateman A, Birney E, Durbin R et al (2000) The Pfam protein families database. Nucleic Acids Res 28:263–266
Eddy S (2003) HMMER user’s guide (http://saf.bio.caltech.edu/saf_manuals/hmmer/v2_3_2.pdf)
Barton GJ, Sternberg MJ (1987) A strategy for the rapid multiple alignment of protein sequences. Confidence levels from tertiary structure comparisons. J Mol Biol 198:327–337
Engelman DM, Steitz TA, Goldman A (1986) Identifying nonpolar transbilayer helices in amino acid sequences of membrane proteins. Annu Rev Biophys Biophys Chem 15:321–353
Wootton JC (1994) Non-globular domains in protein sequences: automated segmentation using complexity measures. Comput Chem 18:269–285
Kurgan L, Razib AA, Aghakhani S et al (2009) CRYSTALP2: sequence-based protein crystallization propensity prediction. BMC Struct Biol 9:50
Smialowski P, Schmidt T, Cox J et al (2006) Will my protein crystallize? A sequence-based predictor. Proteins 62:343–355
Hall MA (1999) Correlation-based Feature Selection for Machine Learning. Methodology i20:1–5
Chen K, Kurgan L, Rahbari M (2007) Prediction of protein crystallization using collocation of amino acid pairs. Biochem Biophys Res Commun 355:764–769
Overton IM, van Niekerk CAJ, Barton GJ (2011) XANNpred: Neural nets that predict the propensity of a protein to yield diffraction-quality crystals. Proteins 79:1027–1033
Rost B (1999) Twilight zone of protein sequence alignments. Protein Eng 12:85–94
Cole C, Barber JD, Barton GJ (2008) The Jpred 3 secondary structure prediction server. Nucleic Acids Res 36:W197–W201
Yang ZR, Thomson R, Mcneil P, Esnouf RM (2005) Structural bioinformatics RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins. Bioinformatics 21: 3369–3376
Krogh A, Larsson B, von Heijne G, Sonnhammer EL (2001) Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 305:567–580
Mizianty MJ, Kurgan L (2011) Sequence-based prediction of protein crystallization, purification and production propensity. Bioinformatics 27:i24–i33
McGuffin LJ, Bryson K, Jones DT (2000) The PSIPRED protein structure prediction server. Bioinformatics 16:404–405
Ward JJ, Sodhi JS, McGuffin LJ et al (2004) Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol 337:635–645
Faraggi E, Xue B, Zhou Y (2009) Improving the prediction accuracy of residue solvent accessibility and real-value backbone torsion angles of proteins by guided-learning through a two-layer neural network. Proteins 74:847–856
Wang H, Wang M, Tan H et al (2014) PredPPCrys: accurate prediction of sequence cloning, protein production, purification and crystallization propensity from protein sequences using multi-step heterogeneous feature fusion and selection. PLoS One 9:e105902
Ward JJ, McGuffin LJ, Bryson K et al (2004) The DISOPRED server for the prediction of protein disorder. Bioinformatics 20:2138–2139
Magnan CN, Baldi P (2014) SSpro/ACCpro 5: almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity. Bioinformatics 30:2592–2597
Rao HB, Zhu F, Yang GB et al (2011) Update of PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucleic Acids Res 39:W385–W390
Valafar H, Prestegard JH, Valafar F (2002) Datamining protein structure databanks for crystallization patterns of proteins. Ann N Y Acad Sci 980:13–22
Huang Y, Niu B, Gao Y et al (2010) CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics 26:680–682
Slabinski L, Jaroszewski L, Rychlewski L et al (2007) XtalPred: a web server for prediction of protein crystallizability. Bioinformatics 23:3403–3405
Lupas A, Van Dyke M, Stock J (1991) Predicting coiled coils from protein sequences. Science 252:1162–1164
Genest C (1984) Aggregation opinions through logarithmic pooling. Theor Decis 17:61–70
Jahandideh S, Jaroszewski L, Godzik A (2014) Improving the chances of successful protein structure determination with a random forest classifier. Acta Crystallogr D Biol Crystallogr 70:627–635
Petersen B, Petersen TN, Andersen P et al (2009) A generic method for assignment of reliability scores applied to solvent accessibility predictions. BMC Struct Biol 9:51
Miller S, Janin J, Lesk AM, Chothia C (1987) Interior and surface of monomeric proteins. J Mol Biol 196:641–656
Breiman L (2001) Random forests. Mach Learn 45:5–32
Price WN, Chen Y, Handelman SK et al (2009) Understanding the physical properties that control protein crystallization by analysis of large-scale experimental data. Nat Biotechnol 27:51–57
Rost B, Yachdav G, Liu J (2004) The PredictProtein server. Nucleic Acids Res 32:W321–W326
Babnigg G, Joachimiak A (2010) Predicting protein crystallization propensity from protein sequence. J Struct Funct Genomics 11:71–80
Babnigg G, Giometti CS (2004) GELBANK: a database of annotated two-dimensional gel electrophoresis patterns of biological systems with completed genomes. Nucleic Acids Res 32:D582–D585
Kawashima S, Ogata H, Kanehisa M (1999) AAindex: Amino Acid Index Database. Nucleic Acids Res 27:368–369
Kohonen T (1982) Self-organized formation of topologically correct feature maps. Biol Cybern 43:56–69
Liu J, Rost B (2004) Sequence-based prediction of protein domains. Nucleic Acids Res 32:3522–3530
Orengo CA, Michie AD, Jones S et al (1997) CATH--a hierarchic classification of protein domain structures. Structure 5:1093–1108
Berezin C, Glaser F, Rosenberg J et al (2004) ConSeq: the identification of functionally and structurally important residues in protein sequences. Bioinformatics 20:1322–1324
Thibert B, Bredesen DE, del Rio G (2005) Improved prediction of critical residues for protein function based on network and phylogenetic analyses. BMC Bioinformatics 6:213
Dosztanyi Z, Csizmok V, Tompa P et al (2005) IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics 21:3433–3434
Wootton JC, Federhen S (1996) Analysis of compositionally biased regions in sequence databases. Methods Enzymol 266:554–571
Pollastri G, McLysaght A (2005) Porter: a new, accurate server for protein secondary structure prediction. Bioinformatics 21:1719–1720
Adamczak R, Porollo A, Meller J (2004) Accurate prediction of solvent accessibility using neural networks – based regression. Bioinformatics 767:753–767
Rehm T, Huber R, Holak TA (2002) Application of NMR in structural proteomics: screening for proteins amenable to structural analysis. Structure 10:1613–1618
Hamuro Y, Burns L, Canaves J et al (2002) Domain organization of D-AKAP2 revealed by enhanced deuterium exchange-mass spectrometry (DXMS). J Mol Biol 321:703–714
Cohen SL, Ferre-D’Amare AR, Burley SK, Chait BT (1995) Probing the solution structure of the DNA-binding protein Max by a combination of proteolysis and mass spectrometry. Protein Sci 4:1088–1099
Bordner AJ, Abagyan R (2005) Statistical analysis and prediction of protein-protein interfaces. Proteins 60:353–366
Ofran Y, Rost B (2003) Analysing six types of protein-protein interfaces. J Mol Biol 325: 377–387
Fellouse FA, Wiesmann C, Sidhu SS (2004) Synthetic antibodies from a four-amino-acid code: a dominant role for tyrosine in antigen recognition. PNAS 101:12467–12472
Lo Conte L, Chothia C, Janin J (1999) The atomic structure of protein-protein recognition sites. J Mol Biol 285:2177–2198
Dale GE, Oefner C, D’Arcy A (2003) The protein as a variable in protein crystallization. J Struct Biol 142:88–97
Cox M, Weber PC (1988) An investigation of protein crystallization parameters using successive automated grid search (SAGS). J Cryst Growth 90:318–324
Carter CW Jr, Carter CW (1979) Protein crystallization using incomplete factorial experiments. J Biol Chem 254:12219–12223
Jancarik J, Kim SH (1991) Sparse matrix sampling: a screening method for crystallization of proteins. J Appl Crystallogr 24:409–411
Stura EA, Nemerow GR, Wilson IA (1991) Strategies in protein crystallization. J Cryst Growth 110:1–12
McPherson A (1992) Two approaches to the rapid screening of crystallization conditions. J Cryst Growth 122:161–167
Hennessy D, Buchanan B, Subramanian D et al (2000) Statistical methods for the objective design of screening procedures for macromolecular crystallization. Acta Crystallogr D Biol Crystallogr 56:817–827
Gilliland GL, Tung M, Blakeslee DM, Ladner JE (1994) Biological Macromolecule Crystallization Database, Version 3.0: new features, data and the NASA archive for protein crystal growth data. Acta Crystallogr D Biol Crystallogr 50:408–413
Newman J (2005) Expanding screening space through the use of alternative reservoirs in vapor-diffusion experiments. Acta Crystallogr D Biol Crystallogr 61:490–493
Dunlop KV, Hazes B (2005) A modified vapor-diffusion crystallization protocol that uses a common dehydrating agent. Acta Crystallogr D Biol Crystallogr 61:1041–1048
Kantardjieff KA, Jamshidian M, Rupp B (2004) Distributions of pI versus pH provide prior information for the design of crystallization screening experiments: response to comment on “Protein isoelectric point as a predictor for increased crystallization screening efficiency”. Bioinformatics 20:2171–2174
Kantardjieff KA, Rupp B (2004) Protein isoelectric point as a predictor for increased crystallization screening efficiency. Bioinformatics 20:2162–2168
Page R, Grzechnik SK, Canaves JM et al (2003) Shotgun crystallization strategy for structural genomics: an optimized two-tiered crystallization screen against the Thermatoga maritima proteome. Acta Crystallogr D Biol Crystallogr 59:1028–1037
Izaac A, Schall CA, Mueser TC (2006) Assessment of a preliminary solubility screen to improve crystallization trials: uncoupling crystal condition searches. Acta Crystallogr D Biol Crystallogr 62:833–42
Anderson MJ, Hansen CL, Quake SR (2006) Phase knowledge enables rational screens for protein crystallization. PNAS 103: 16746–16751
Page R, Stevens RC (2004) Crystallization data mining in structural genomics: using positive and negative results to optimize protein crystallization screens. Methods 34: 373–389
Page R, Deacon AM, Lesley SA, Stevens RC (2005) Shotgun crystallization strategy for structural genomics II: crystallization conditions that produce high resolution structures for T. maritima proteins. J Struct Funct Genomics 6:209–217
Gao W, Li SX, Bi RC (2005) An attempt to increase the efficiency of protein crystal screening: a simplified screen and experiments. Acta Crystallogr D Biol Crystallogr 61:776–779
Gileadi O, Knapp S, Lee WH et al (2007) The scientific impact of the Structural Genomics Consortium: a protein family and ligand-centered approach to medically-relevant human proteins. J Struct Funct Genomics 8:107–119
Durbin SD, Feher G (1996) Protein crystallization. Annu Rev Phys Chem 47:171–204
Smialowski P, Martin-Galiano AJ, Cox J, Frishman D (2007) Predicting experimental properties of proteins from sequence by machine learning techniques. Curr Protein Pept Sci 8:121–133
Mikolajka A, Yan X, Popowicz GM et al (2006) Structure of the N-terminal domain of the FOP (FGFR1OP) protein and implications for its dimerization and centrosomal localization. J Mol Biol 359:863–875
Dong A, Xu X, Edwards AM et al (2007) In situ proteolysis for protein crystallization and structure determination. Nat Methods 4: 1019–1021
Ksiazek D, Brandstetter H, Israel L et al (2003) Structure of the N-terminal domain of the adenylyl cyclase-associated protein (CAP) from Dictyostelium discoideum. Structure 11:1171–1178
Kim KM, Yi EC, Baker D, Zhang KY (2001) Post-translational modification of the N-terminal His tag interferes with the crystallization of the wild-type and mutant SH3 domains from chicken src tyrosine kinase. Acta Crystallogr D Biol Crystallogr 57:759–762
Charles M, Veesler S, Bonnete F (2006) MPCD: a new interactive on-line crystallization data bank for screening strategies. Acta Crystallogr D Biol Crystallogr 62:1311–1318
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer Science+Business Media New York
About this protocol
Cite this protocol
Smialowski, P., Wong, P. (2016). Protein Crystallizability. In: Carugo, O., Eisenhaber, F. (eds) Data Mining Techniques for the Life Sciences. Methods in Molecular Biology, vol 1415. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-3572-7_17
Download citation
DOI: https://doi.org/10.1007/978-1-4939-3572-7_17
Published:
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-3570-3
Online ISBN: 978-1-4939-3572-7
eBook Packages: Springer Protocols