Skip to main content
Log in

Learning the local landscape of protein structures with convolutional neural networks

  • Original Paper
  • Published:
Journal of Biological Physics Aims and scope Submit manuscript

Abstract

One fundamental problem of protein biochemistry is to predict protein structure from amino acid sequence. The inverse problem, predicting either entire sequences or individual mutations that are consistent with a given protein structure, has received much less attention even though it has important applications in both protein engineering and evolutionary biology. Here, we ask whether 3D convolutional neural networks (3D CNNs) can learn the local fitness landscape of protein structure to reliably predict either the wild-type amino acid or the consensus in a multiple sequence alignment from the local structural context surrounding site of interest. We find that the network can predict wild type with good accuracy, and that network confidence is a reliable measure of whether a given prediction is likely going to be correct or not. Predictions of consensus are less accurate and are primarily driven by whether or not the consensus matches the wild type. Our work suggests that high-confidence mis-predictions of the wild type may identify sites that are primed for mutation and likely targets for protein engineering.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data availability

Analysis scripts and processed data are available on GitHub: https://github.com/akulikova64/CNN_protein_landscape. Trained neural networks and the training set protein chains and microenvironments have been deposited at the Texas Data Repository and are available at: https://doi.org/10.18738/T8/8HJEF9.

References

  1. Abadi, M., Agarwal, A., Barham, P., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jozefowicz, R., Jia, Y., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Schuster, M., Monga, R., Moore, S., Murray, D., Olah, C., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng, X.: Tensorflow: Large-scale machine learning on heterogeneous systems (2015). Software available from: https://www.tensorflow.org/

  2. Abriata, L.A., Bovigny, C., Dal Peraro, M.: Detection and sequence/structure mapping of biophysical constraints to protein variation in saturated mutational libraries and protein sequence alignments with a dedicated server. BMC Bioinf. 17, 242 (2016)

    Article  Google Scholar 

  3. Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: A practical and powerful approach to multiple testing. J R Stat Soc Series B Stat Methodol J R STAT SOC B. 57, 289–300 (1995)

    MathSciNet  MATH  Google Scholar 

  4. Bisardi, M., Rodriguez-Rivas, J., Zamponi, F., Weigt, M.: Modeling sequence-space exploration and emergence of epistatic signals in protein evolution. https://arxiv.org/abs/2106.02441 (2021)

  5. Dolinsky, T.J., Czodrowski, P., Li, H., Nielsen, J.E., Jensen, J.H., Klebe, G., Baker, N.A.: PDB2PQR: expanding and upgrading automated preparation of biomolecular structures for molecular simulations. Nucleic Acids Research 35, W522–W525 (2007)

    Article  Google Scholar 

  6. Dyson, H.J., Wright, P.E., Scheraga, H.A.: The role of hydrophobic interactions in initiation and propagation of protein folding. Proc. Natl. Acad. Sci. U.S.A. 103(35), 13057–13061 (2006)

    Article  ADS  Google Scholar 

  7. Echave, J., Spielman, S.J., Wilke, C.O.: Causes of evolutionary rate variation among protein sites. Nature Rev. Genet. 17, 109–121 (2016)

    Article  Google Scholar 

  8. Echave, J., Wilke, C.O.: Biophysical models of protein evolution: understanding the patterns of evolutionary sequence divergence. Annu. Rev. Biophys. 46, 85–103 (2017)

    Article  Google Scholar 

  9. Elnaggar, A., Heinzinger, M., Dallago, C., Rehawi, G., Yu, W., Jones, L., Gibbs, T., Feher, T., Angerer, C., Steinegger, M., Bhowmik, D., Rost, B.: ProtTrans: Towards cracking the language of life’s code through self-supervised deep learning and high performance computing. IEEE Transactions on Pattern Analysis & Machine Intelligence (2021). https://doi.org/10.1109/TPAMI.2021.3095381

    Article  Google Scholar 

  10. Frauenfelder, H., Sligar, S.G., Wolynes, P.G.: The energy landscapes and motions of proteins. Science 254, 1598–1603 (1991)

    Article  ADS  Google Scholar 

  11. Goldstein, R.A., Pollock, D.D.: The tangled bank of amino acids. Protein Sci. 25, 1354–1362 (2016)

    Article  Google Scholar 

  12. Goldstein, R.A., Pollock, D.D.: Sequence entropy of folding and the absolute rate of amino acid substitutions. Nature Ecol. Evol. 1, 1923–1930 (2017)

    Article  Google Scholar 

  13. Hartman, E.C., Tullman-Ercek, D.: Learning from protein fitness landscapes: a review of mutability, epistasis, and evolution. Curr. Opin. Syst. Biol. 14, 25–31 (2019)

    Article  Google Scholar 

  14. Henikoff, S., Henikoff, J.G.: Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. U.S.A. 89, 10915–10919 (1992)

    Article  ADS  Google Scholar 

  15. Huang, T.T., del Valle Marcos, M.L., Hwang, J.K., Echave, J.: A mechanistic stress model of protein evolution accounts for site-specific evolutionary rates and their relationship with packing density and flexibility. BMC Evol. Biol. 14, 78 (2014)

  16. Jack, B.R., Meyer, A.G., Echave, J., Wilke, C.O.: Functional sites induce long-range evolutionary constraints in enzymes. PLOS Biol. 14, 1–23 (2016)

    Article  Google Scholar 

  17. Jiang, Q., Teufel, A.I., Jackson, E.L., Wilke C.O.: Beyond thermodynamic constraints: Evolutionary sampling generates realistic protein sequence variation. Genetics 208, 1387–1395 (2018)

  18. Johnson, M.M., Wilke, C.O.: Site-specific amino acid distributions follow a universal shape. J. Mol. Evol. 88, 731–741 (2020)

    Article  ADS  Google Scholar 

  19. Jones, D.T., Buchan, D.W.A., Cozzetto, D., Pontil, M.: PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments. Bioinformatics 28, 184–190 (2011)

    Article  Google Scholar 

  20. Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., Bridgland, A., Meyer, C., Kohl, S.A.A., Ballard, A.J., Cowie, A., Romera-Paredes, B., Nikolov, S., Jain, R., Adler, J., Back, T., Petersen, S., Reiman, D., Clancy, E., Zielinski, M., Steinegger, M., Pacholska, M., Berghammer, T., Bodenstein, S., Silver, D., Vinyals, O., Senior, A.W., Kavukcuoglu, K., Kohli, P., Hassabis, D.: Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021). https://doi.org/10.1038/s41586-021-03819-2

    Article  ADS  Google Scholar 

  21. Kachroo, A.H., Laurent, J.M., Yellman, C.M., Meyer, A.G., Wilke, C.O., Marcotte, E.M.: Systematic humanization of yeast genes reveals conserved functions and genetic modularity. Science 348, 921–925 (2015)

    Article  ADS  Google Scholar 

  22. Le, S.Q., Gascuel, O.: An improved general amino acid replacement matrix. Mol. Biol. Evol. 25, 1307–1320 (2008)

    Article  Google Scholar 

  23. Leaver-Fay, A., M.Tyka, Lewis, S.M., Lange, O.F., Thompson, J., Jacak, R., Kaufman, K.W., Douglas Renfrew, P., Smith, C.A., Sheffler, W., Davis, I.W., Cooper, S., Treuille, A., Mandell, D.J., Richter, F., Andrew Ban, Y.E., Fleishman, S.J., Corn, J.E., Kim, D.E., Lyskov, S., Berrondo, M., Mentzer, S., Popovic, Z., Havranek, J.J., Karanicolas, J., Das, R., Meiler, J., Kortemme, T., Gray, J.J., Kuhlman, B., Baker, D., Bradley, P.: Rosetta3: An object-oriented software suite for the simulation and design of macromolecules. Meth. Enzymol. 487, 545–574 (2011)

  24. Marcos, M.L., Echave, J.: Too packed to change: side-chain packing and site-specific substitution rates in protein evolution. PeerJ 3, e911 (2015)

  25. Mirny, L.A., Shakhnovich, E.I.: Universally conserved positions in protein folds: reading evolutionary signals about stability, folding kinetics and function. J. Mol. Biol. 291, 177–196 (1999)

    Article  Google Scholar 

  26. Mitternacht, S.: FreeSASA: an open source C library for solvent accessible surface area calculations [version 1; peer review: 2 approved]. F1000 Research 5, 189 (2016)

  27. Nelson, E.D., Grishin, N.V.: Long-range epistasis mediated by structural change in a model of ligand binding proteins. PLoS ONE 11, e0166739 (2016)

  28. Parra, R.G., Schafer, N.P., Radusky, L.G., Tsai, M.Y., Guzovsky, A.B., Wolynes, P.G., Ferreiro, D.U.: Protein Frustratometer 2: a tool to localize energetic frustration in protein molecules, now with electrostatics. Nucleic Acids Res. 44, W356–W360 (2016)

    Article  Google Scholar 

  29. Pokusaeva, V.O., Usmanova, D.R., Putintseva, E.V., Espinar, L., Sarkisyan, K.S., Mishin, A.S., Bogatyreva, N.S., Ivankov, D.N., Akopyan, A.V., Avvakumov, S.Y., Povolotskaya, I.S., Filion, G.J., Carey, L.B., Kondrashov, F.A.: An experimental assay of the interactions of amino acids from orthologous sequences shaping a complex fitness landscape. PLOS Genet. 15, 1–30 (2019)

    Article  Google Scholar 

  30. Pollock, D.D., Thiltgen, G., Goldstein, R.A.: Amino acid coevolution induces an evolutionary Stokes shift. Proc. Natl. Acad. Sci. U.S.A. 109, E1352–E1359 (2012)

    Article  ADS  Google Scholar 

  31. R Core Team: R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria (2019)

  32. Repecka, D., Jauniskis, V., Karpus, L., Rembeza, E., Rokaitis, I., Zrimec, J., Poviloniene, S., Laurynenas, A., Viknander, S., Abuajwa, W., Savolainen, O., Meskys, R., Engqvist, M.K.M., Zelezniak, A.: Expanding functional protein sequence spaces using generative adversarial networks. Nat. Mach. Intell. 3, 324–333 (2021)

    Article  Google Scholar 

  33. Rives, A., Meier, J., Sercu, T., Goyal, S., Lin, Z., Liu, J., Guo, D., Ott, M., Zitnick, C.L., Ma, J., Fergus, R.: Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. U.S.A. 118(15) (2021)

  34. Schymkowitz, J., Borg, J., Stricher, F., Nys, R., Rousseau, F., Serrano, L.: The FoldX web server: an online force field. Nucleic Acids Res. 33, W382–W388 (2005)

    Article  Google Scholar 

  35. Shah, P., McCandlish, D.M., Plotkin, J.B.: Contingency and entrenchment in protein evolution under purifying selection. Proc. Natl. Acad. Sci. U.S.A. 112, E3226–E3235 (2015)

    Article  ADS  Google Scholar 

  36. Sharir-Ivry, A., Xia, Y.: Nature of long-range evolutionary constraint in enzymes: insights from comparison to pseudoenzymes with similar structures. Mol. Biol. Evol. 35, 2597–2606 (2018)

    Article  Google Scholar 

  37. Shroff, R., Cole, A.W., Diaz, D.J., Morrow, B.R., Donnell, I., Gollihar, J., Ellington, A.D., Thyer, R.: Discovery of novel gain-of-function mutations guided by structure-based deep learning. ACS Synth. Biol. 9, 2927–2935 (2020)

    Article  Google Scholar 

  38. Sitkoff, D., Sharp, K.A., Honig, B.: Accurate calculation of hydration free energies using macroscopic solvent models. J. Phys. Chem. 98, 1978–1988 (1994)

    Article  Google Scholar 

  39. Teufel, A.I., Johnson, M.M., Laurent, J.M., Kachroo, A.H., Marcotte, E.M., Wilke, C.O.: The many nuanced evolutionary consequences of duplicated genes. Mol. Biol. Evol. 36, 304–314 (2019)

    Article  Google Scholar 

  40. Torng, W., Altman, R.B.: 3D deep convolutional neural networks for amino acid environment similarity analysis. BMC Bioinf. 18, 302 (2017)

    Article  Google Scholar 

  41. Whelan, S., Goldman, N.: A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol. Biol. Evol. 18, 691–699 (2001)

    Article  Google Scholar 

  42. Wickham, H., Averick, M., Bryan, J., Chang, W., D’Agostino McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Lin Pedersen, T., Miller, E., Milton Bache, S., Müller, K., Ooms, J., Robinson, D., Paige Seidel, D., Spinu, V., Takahashi, K., Vaughan, D., Wilke, C., Woo, K., Yutani, H.: Welcome to the tidyverse. J. Open Source Softw. 4, 1686 (2019)

    Article  ADS  Google Scholar 

  43. Xu, Y., Verma, D., Sheridan, R.P., Liaw, A., Ma, J., Marshall, N.M., McIntosh, J., Sherer, E.C., Svetnik, V., Johnston, J.M.: Deep dive into machine learning models for protein engineering. J. Chem. Inf. Model. 60, 2773–2790 (2020)

    Article  Google Scholar 

Download references

Acknowledgements

We thank Raghav Shroff for writing our initial PDB to CIF converter.

Funding

This work was supported by grants from the Welch Foundation (F-1654), the Department of Defense – Defense Threat Reduction Agency (HDTRA12010011), and the National Institutes of Health (R01 AI148419). We would like to thank AMD for the donation of critical hardware and support resources from its HPC Fund that made this work possible. C.O.W. also acknowledges funding from the Jane and Roland Blumberg Centennial Professorship in Molecular Evolution and the Dwight W. and Blanche Faye Reeder Centennial Fellowship in Systematic and Evolutionary Biology at UT Austin.

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by A. V. Kulikova, D. J. Diaz, and J. M. Loy. The first draft of the manuscript was written by A. V. Kulikova and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Claus O. Wilke.

Ethics declarations

Conflicts of interest/Competing interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: The Revolutionary Impact of Landscapes in Biology

Guest Editors: Robert Austin, Shyamsunder Erramilli, Sonya Bahar.

Supplementary information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 447 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kulikova, A.V., Diaz, D.J., Loy, J.M. et al. Learning the local landscape of protein structures with convolutional neural networks. J Biol Phys 47, 435–454 (2021). https://doi.org/10.1007/s10867-021-09593-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10867-021-09593-6

Keywords

Navigation