AlphaFold heralds a data-driven revolution in biology and medicine

Thornton, Janet M.; Laskowski, Roman A.; Borkakoti, Neera

doi:10.1038/s41591-021-01533-0

Download PDF

Comment
Published: 12 October 2021

AlphaFold heralds a data-driven revolution in biology and medicine

Nature Medicine volume 27, pages 1666–1669 (2021)Cite this article

23k Accesses
89 Citations
109 Altmetric
Metrics details

Subjects

Protein structures predicted using artificial intelligence will aid medical research, but the greatest benefit will come if clinical data can be similarly used to better understand human disease.

The protein structure prediction problem is the question of how a protein’s sequence of amino acids results in its fully folded three-dimensional structure. This has presented a formidable computational challenge for many decades.

At the end of 2020, a significant advance was announced by DeepMind, a London-based artificial intelligence (AI) company now part of Google’s parent firm, Alphabet Inc. DeepMind’s AlphaFold 2 program had significantly outperformed other methods in the biennial Critical Assessment of protein Structure Prediction (CASP)¹, producing models of a quality approaching that of experimental determination. AlphaFold 2 has since been published² and, more recently, the source code and almost 350,000 protein models from various species, including human, have been made public³. This trove of protein structures has implications for both experimental and computational structural biology, and beyond^4,5,6,7, but here we consider its possible bearing on medicine.

AlphaFold 2 uses data gathered by structural biologists and made publicly available by the worldwide Protein Data Bank (wwPDB)⁸—which currently holds over 180,000 experimentally determined structures. It is commendable that DeepMind has released the code and predictions for everyone to use.

Over 350,000 protein models have been made available on the AlphaFold Protein Structure Database at the European Molecular Biology Laboratory–European Bioinformatics Institute (EMBL-EBI), with tools to view and interrogate the structures³. These proteins come from 21 species, including the most common model organisms and some notable pathogens—Leishmania infantum, Mycobacterium tuberculosis, Plasmodium falciparum and Trypanosoma cruzi. Before the end of the year, DeepMind expect to release models covering UniRef90, a unique sample of all known protein sequences comprising 130 million proteins.

Although protein structures do not of themselves lead to new medicines, they often provide a better understanding of the molecular mechanisms of a protein and in so doing offer insights into how the protein works and how its modulation might lead to a disease or a therapy. Over the past 50 years, protein structures have been an integral part of drug design efforts, with many large pharma companies establishing their own structural biology teams. Structural data have played a critical role both in determining the druggability of a given protein target⁹ and then in enabling the design of small-molecule drugs that will bind to it⁷.

Variable quality

The AlphaFold AI program rapidly generates models of protein structures from their amino acid sequence more accurately than had previously been achieved. The accuracy of the models is variable (both within and between models) depending on the protein, but, importantly, a confidence measure is provided at each residue position by the predicted local distance difference test (pLDDT) score.

The predictions for single-chain, structured proteins are remarkably good—indeed, comparable in quality to those from experimental structure determination. However, the quality of the predictions depends on the length of the protein and its flexibility.

Not all protein structure predictions are of equal value. Figure 1 highlights three example predictions, showing the good, the bad and the ugly. Figure 2a provides an overview of the coverage (experimental and predicted) and quality of structures for the human proteome. Figure 2b illustrates the distribution of quality scores for the human sequences.

**Fig. 1: The good, the bad and the ugly.**

**Fig. 2: Confidence scores for AlphaFold models.**

A new structure prediction pipeline

Despite the varying quality of the new structures, SWISS-MODEL¹⁰ has already installed the code from AlphaFold to complement its existing structure prediction pipelines, while other groups have added the models to their databases of protein information, for example UniProt¹¹ and PDBsum¹². ColabFold¹³ provides tools for modeling multi-chain homo- and hetero-complexes using the AlphaFold and also RoseTTAFold models¹⁴. Another use of the models is in the interpretation of low-resolution electron microscopy data, especially where the protein shows flexibility between domains.

However, there are major limitations to the relevance of the AlphaFold data to the design of therapeutics. In particular, large multi-domain and flexible proteins still are not modeled very well, and the models lack any ligands (small molecules, DNA, cofactors, metals and other proteins) and therefore do not provide any interaction data, which are especially relevant for elucidating function.

Initially, the AlphaFold models will be used in exactly the same way as experimental structural data (and indeed will be used to help determine low-resolution experimental structures). We see four areas of immediate potential impact for medicine (see Fig. 3).

**Fig. 3: Using AlphaFold for drug design and disease-associated variants.**

Therapeutic design

Most small-molecule drugs are designed with the benefit of structural insights¹⁵. Future design programs (whether for small molecules, biologics, biosimilars or proteolysis targeting chimeras (PROTAC) therapeutics) will use the models from AlphaFold whenever an experimental structure is not available.

For human sequences, the novel coverage is actually rather small (Fig. 2b), especially for those proteins for which drugs have already been developed. It is, of course, invaluable to know the prospective ligand-binding site, preferably with a structure of the complex with a ligand (Fig. 3a). As the predicted models lack all ligands, however, this requires docking approaches, with their varying reliability.

Comparative analyses of the target proteins with AlphaFold models of similar proteins may be used to generate more specific drugs, such as drugs with potentially fewer toxic side effects. In addition, AlphaFold data from different species may be studied to make more informed choices as to the most suitable animal model for testing potential medicines targeted towards humans.

Better drugs and more validated targets are always needed, and although protein structural data may contribute to this, designing small molecules using protein structures at the start of a drug development program is rarely the bottleneck in the time taken to launch a new drug onto the market.

Human pathogenic variants

Structural data help to identify pathogenic variants in humans—that is, those that cause disease¹⁶. A current challenge is to identify such pathogenic variants (for example, in developmental diseases or cancer progression) among the many variants observed in an individual’s genome. Almost 50% of known variants are classified as variants of unknown significance (VUSs) in ClinVar¹⁷, a database of genomic variation and its relationship to human health.

AlphaFold has limited value for modeling the effects of individual mutations, although reliable models may be used to identify likely binding sites, enzyme active sites, interfaces or structural constraints, and so identify those variants that are more likely to be pathogenic than those that can be benignly replaced by other amino acids (Fig. 3b).

Most functions predicted from sequences or structures rely on close or distant evolutionary relationships. Predicted structures potentially allow one to see further back in evolutionary time, to identify the most distant relatives—from which some functional inference may be drawn.

Drug targets in pathogens

Structural coverage of pathogens in the wwPDB is often much less than for model organisms. With the larger release of data promised for later in 2021, however, predicted structures for many new organisms will be made available.

Protein structures from pathogens such as viruses, bacteria and fungi can be used to assess druggability and possible cross-reactions with human proteins and to aid in the design of medications targeted toward multiple pathogens. Identifying drug targets in infectious agents may provide the most available low-hanging fruit in the short term, and indeed DeepMind is already collaborating with organizations such as Drugs for Neglected Diseases Initiative and other partners.

Enhance vaccine and antibody design

With the COVID-19 pandemic and the development of SARS-CoV-2 vaccines, knowledge of the antigenic spike protein structure has assisted in understanding the surface topology of the virus and its antigenicity.

Amazingly, as of 3 September 2021, there were 1,491 structures of SARS-CoV-2 proteins in the wwPDB¹⁸, contributed by laboratories all around the world. The possibility of predicting viral spike proteins accurately will provide very rapid analysis compared to experimental structure determination for emerging viruses in future pandemics.

A data-driven revolution

The impact of the protein structures from AlphaFold in medicine is potentially substantial. However, AlphaFold is most likely to be just the start of a revolution based on data-driven prediction in biology and medicine. Biological processes at all levels (intracellular, intercellular, organoid and organism) involve interactions between molecules.

Although current AlphaFold predictions are limited to single protein chains and do not provide explicit information about interactions with other molecules, new AI-based tools could predict such interactions across the proteome—delving into different complexes in different cell types, which change with the environment and over time. In the longer term, AI methods will be developed and applied to many aspects of protein structures to improve predictability.

Projects such as the Earth Biogenomes¹⁹ and Darwin Tree of Life²⁰ that ultimately seek to sequence all living organisms will generate masses of new protein sequence data. AlphaFold2 is the first step to generating the whole structural proteomes for all of these different species. The challenge is then to interpret these genomes in terms of each organism’s body shape, development, behavior and natural history, using genotype-to-phenotype studies. Natural products have been the basis for many drugs, so elucidating the genomes of many new species may ultimately lead to novel nature-inspired therapies. No doubt AI methods will be extensively employed in this quest.

From a medical perspective, the opportunities presented by AI are to follow in the footsteps of the DeepMind approach and use clinical data to understand diseases—their diagnosis and prognosis, and determination of what combinations of therapies are best suited for particular patients in a more holistic approach.

Protein Structure Prediction presented the perfect challenge for AI: the data for all known structures were freely available, well curated and organized in the wwPDB. The challenge was very specific, and the success of the outcome measurable and independently assessed in CASP.

The availability of biological research data from institutes such as the US National Center for Biotechnology Information (NCBI) and EMBL-EBI (with the many different types of data and available data resources) has transformed biological research in the last 20 years. The situation for clinical data is entirely different. Like biological data, clinical data are very heterogenous, but they are rarely easily available, often not quantitative, difficult to share across borders and described by limited ontologies and metadata. To add more complexity, such data cannot be made publicly available while maintaining personal confidentiality.

Consequently, to take advantage of the new, powerful AI methods, the imperative with clinical data should be to build the national and international infrastructures necessary to allow clinical data to be collected and shared, collated and standardized.

By analogy with AlphaFold’s success in predicting structures, this will accelerate the process of finding therapies that are effective and available to all. In the UK, Health Data Research UK is addressing this challenge by creating Trusted Research Environments for clinical data, and worldwide, the Global Alliance for Global Health is establishing standards and protocols to enable swifter progress. For this to be successful, multi-disciplinary teams will be needed, involving clinicians, domain experts and machine learning experts, to develop the tools to exploit the data.

It has taken many years to establish the biological databases that are so widely used today—and the challenge for clinical data is even larger. This calls for immediate investment in creating a new health data infrastructure so that patients will be proud to contribute their data to improve human health and the world can face new pandemics with confidence.

References

CASP14—14th Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction. Retrieved from https://predictioncenter.org/casp14/ (accessed 27 September 2021).
Jumper, J. et al. Nature 596, 583–589 (2021).
Article CAS Google Scholar
Tunyasuvunakool, K et al. Nature 595, 590–596 (2021).
Jones, D. T. & Thornton, J. M. Crystallogr. News 156, 6–9 (2021).
Google Scholar
AlQuraishi, M. Curr. Opin. Chem. Biol. 65, 1–8 (2021).
Article CAS Google Scholar
Diwan, G. D. et al. J. Mol. Biol. 167180 (2021).
Workman, P. The Institute of Cancer Research blogs 2021. Retrieved from https://www.icr.ac.uk/blogs/the-drug-discoverer/page-details/reflecting-on-deepmind-s-alphafold-artificial-intelligence-success-what-s-the-real-significance-for-protein-folding-research-and-drug-discovery
wwPDB Consortium. Nucleic Acids Res. 47, D520–D528 (2019).
Article Google Scholar
Hopkins, A. L. & Groom, C. R. Nat. Rev. Drug. Discov. 1, 727–730 (2002).
Article CAS Google Scholar
Waterhouse, A. et al. Nucleic Acids Res. 46, W296–W303 (2018).
Article CAS Google Scholar
UniProt Consortium. Nucleic Acids Res. 49, D480–D489 (2021).
Article Google Scholar
Laskowski, R. A., Jablonska, J., Pravda, L., Varekova, R. S. & Thornton, J. M. Protein Sci. 27, 129–134 (2018).
Article CAS Google Scholar
Mirdita, M., Ovchinnikov, S. & Steinegger, M. Preprint at https://doi.org/10.1101/2021.08.15.456425 (2021).
Baek, M. et al. Science 373, 871–876 (2021).
Article CAS Google Scholar
Batool, M., Ahmad, B. & Choi, S. Int. J. Mol. Sci. 20, 2783 (2019). (11).
Article CAS Google Scholar
Stefl, S. et al. J. Mol. Biol. 425, 3919–3936 (2013).
Article CAS Google Scholar
Landrum, M. J. et al. Nucleic Acids Res. 48, D835–D844 (2020).
Article CAS Google Scholar
COVID-19 protein structures in the PDB. Retrieved from https://www.ebi.ac.uk/thornton-srv/databases/pdbsum/covid-19.html (accessed 3 September 2021).
Lewin, H. A. et al. Proc. Natl Acad. Sci. USA 115, 4325–4333 (2018).
Article CAS Google Scholar
Darwin Tree of Life. Retrieved from https://www.darwintreeoflife.org (accessed 27 September 2021).

Download references

Author information

Authors and Affiliations

European Bioinformatics Institute - European Molecular Biology Laboratory EMBL-EBI, South Building, Wellcome Genome Campus, Hinxton, UK
Janet M. Thornton, Roman A. Laskowski & Neera Borkakoti

Authors

Janet M. Thornton
View author publications
You can also search for this author in PubMed Google Scholar
Roman A. Laskowski
View author publications
You can also search for this author in PubMed Google Scholar
Neera Borkakoti
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

J.M.T. wrote the first draft of the article, and R.A.L. and N.B. edited and improved it. R.A.L. performed the analyses and created the figures.

Corresponding author

Correspondence to Janet M. Thornton.

Ethics declarations

Competing interests

J.M.T. sits on the board of Health Data Research UK.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Thornton, J.M., Laskowski, R.A. & Borkakoti, N. AlphaFold heralds a data-driven revolution in biology and medicine. Nat Med 27, 1666–1669 (2021). https://doi.org/10.1038/s41591-021-01533-0

Download citation

Published: 12 October 2021
Issue Date: October 2021
DOI: https://doi.org/10.1038/s41591-021-01533-0

This article is cited by

Discovering value: women’s participation in university and commercial AI invention
- Alexander V. Giczy
- Nicholas A. Pairolero
- Andrew A. Toole
Nature Biotechnology (2024)
AlphaFold predictions are valuable hypotheses and accelerate but do not replace experimental structure determination
- Thomas C. Terwilliger
- Dorothee Liebschner
- Paul D. Adams
Nature Methods (2024)
Prophylactic and therapeutic measures for emerging and re-emerging viruses: artificial intelligence and machine learning - the key to a promising future
- RC Theijeswini
- Soumya Basu
- Anand Anbarasu
Health and Technology (2024)
Protein Multiple Conformation Prediction Using Multi-Objective Evolution Algorithm
- Minghua Hou
- Sirong Jin
- Guijun Zhang
Interdisciplinary Sciences: Computational Life Sciences (2024)
Simplicity science
- Matteo Marsili
Indian Journal of Physics (2024)