The impact of AlphaFold Protein Structure Database on the fields of life sciences

Arguably, 2020 was the year of high‐accuracy protein structure predictions, with AlphaFold 2.0 achieving previously unseen accuracy in the Critical Assessment of Protein Structure Prediction (CASP). In 2021, DeepMind and EMBL‐EBI developed the AlphaFold Protein Structure Database to make an unprecedented number of reliable protein structure predictions easily accessible to the broad scientific community.

not be possible, within a biologically appropriate amount of time, for a protein to fold into its native conformation by a random search through the landscape of every possible conformation [13]. A few years later, in 1973, Christian B. Anfinsen postulated that proteins are not doing random searches and that the amino acid sequence is theoretically sufficient to determine the three-dimensional structure of that protein within certain limits [14]. This postulate posed the grand challenge to structural bioinformaticians of predicting, with high accuracy, the structure of proteins based on only amino acid sequence data [10].
Until recently, such ab initio models were lagging behind the accuracy of template-based approaches, that is, where an experimentally determined, homologous protein structure drives the modelling of the protein of interest [15].
In 2020 two parallel breakthroughs were achieved in ab initio modelling, both relying on the application of Artificial Intelligence techniques [10,12]. AlphaFold demonstrated that predicting protein structures with high accuracy and at an unprecedented scale is not only possible but already achieved [16]. Shortly after the publication of the AlphaFold methodology, DeepMind and EMBL-EBI developed and launched a data resource, the AlphaFold Protein Structure Database (AlphaFold DB) [17]. Releasing an unprecedented number of highly accurate models sent ripples throughout the fields of structural biology, structural bioinformatics and structure-based drug discovery.
Here, we present the latest updates of AlphaFold DB and provide examples of how this data resource affected the fields of life sciences.

SIGNIFICANCE STATEMENT
Since its launch in 2021, the AlphaFold Protein Structure Database has had successive releases that expanded the coverage of the sequence space with accurate theoretical models by an order of magnitude.
It now stores over 214 million unique protein structures, compared to around 200,000 PDB structures corresponding to approximately 60,000 unique protein sequences. This new data set almost covers the complete UniProt database. Additionally, AlphaFold models cover the entire length of protein sequences compared to the fragmented, often short coverage of PDB entries. One of the important limitations in terms of the prediction we archive in AlphaFold DB is that no multimeric structures are currently supported, even though tools like AlphaFold-Multimer is capable of predicting such assemblies [18].

Data infrastructure
AlphaFold DB is designed and developed through a collaborative effort between DeepMind and EMBL-EBI. The infrastructure is designed for portability and scalability, taking advantage of the Google Cloud Platform ( Figure 1).
All the data, that is, the model coordinates, the predicted aligned error files and the corresponding meta-information, is generated and stored on the cloud storage areas.

Front-end application
The front-end of AlphaFold DB is implemented using the AngularJS framework. It relies on the API endpoint mentioned above to retrieve and display information for a specific protein of interest. This web application provides a search interface where users can search and filter using protein names, gene names, UniProt accessions and identifiers, and species names. The protein entry pages display relevant information for the protein of interest, such as protein name, gene name and biological function. These pages also allow users to download the coordinate files in mmCIF and PDB formats and the predicted aligned error files in JSON format. We also use the PDBe implementation of the Mol* three-dimensional molecular graphics viewer to display the coordinates and colour residues by their corresponding pLDDT score [20]. These scores reflect the confidence one may have in the coordinates of specific amino acids. The pLDDT scores and the three-dimensional representation should be evaluated together with the predicted aligned error (PAE) data, which we display using an interactive heatmap plot. The PAE data provides information about the confidence in the relative orientation of residue pairs in the protein models.

New data since the initial release
The dataset size in AlphaFold DB grew almost 600x over the past year, from around 360,000 to over 214 million unique protein structures. This massive increase necessitated changes in the infrastructure so that loading, processing, and exposing data remained efficient and responsive.

F I G U R E 1
Overview of the AlphaFold Protein Structure Database infrastructure. AlphaFold 2.0 generated metadata files and models are stored using Google Cloud Platform (GCP) storage buckets. The proteome models are made available via the public FTP area of EMBL-EBI. The metadata files are loaded into Apache-Solr instances, and an API exposes this data both to users and to the front-end application of AlphaFold DB.
The releases since the initial launch were thematic and added welldefined data sets to the available pool of structures. In particular, the second release included new predictions that cover almost the complete SwissProt dataset of UniProt, over 500,000 manually curated sequences. The third release added 32 new proteomes of organisms implicated in global health and neglected diseases ( Table 1). The latest, fourth release took the amount of data to 23TiB, with over 214 million sequences covered.
While these millions of predicted structures significantly expanded the structural coverage of the sequence space, albeit with computationally predicted models, it is important to note some limitations of the current implementation. As of 2022, the database does not provide predictions for viral proteins, nor for isoforms and mutant structures. Additionally, the predictions are only available for sequences longer than 15 amino acids, and shorter than 2700 for proteins in the proteome datasets and Swiss-Prot, and 1280 for other proteins. This is a limitation imposed as part of the implementation; the AlphaFold system can support longer sequences, but with considerably longer computation times. Importantly, the human proteome is an exception in AlphaFold DB, as for this proteome even the longer proteins are included, but instead of long, continuous models they are split into overlapping fragments of 1400 amino acid long predictions.

Data types in the AlphaFold protein structure database
The AlphaFold database provides access to three distinct, independent output data of the AlphaFold 2.0 algorithm and related metainformation that can place the data in its context. The three data types are the model coordinates, the confidence measures (pLDDT scores) and the predicted aligned error scores.
The model coordinates are stored both in PDB and in mmCIF formats. Users are encouraged to take advantage of the richness and extensibility of the mmCIF formatted files, as these files encode much more meta-information than the PDB format would allow. This mmCIF format follows the modelCIF dictionary specification, available at https://github.com/ihmwg/ModelCIF. Importantly, additional metadata and annotation will only be available in the mmCIF formatted AlphaFold model files, as these are not supported by the legacy PDB format.
Model confidence is represented by pLDDT scores. These scores are provided for each amino acid position, with higher values corresponding to higher accuracy and confidence, while lower values are less reliable. In particular, pLDDT scores below 50 can correspond with certain classes of intrinsically disordered regions [21,22], but such scores can also reflect genuinely poor quality predictions, most often caused by shallow multiple sequence alignments that underlay the prediction process [10].
The third output of AlphaFold 2.0 is the predicted aligned error (PAE), which can give information on the reliability of pairwise relative positions of amino acids, and by extension, of sequence segments (e.g., domains). Investigating the PAE scores is essential to assessing if two or more seemingly close domains or regions of a protein structure are oriented with high confidence or if their closeness is just a random positioning within the model [17].

The impact of the AlphaFold database
The unprecedented access to large numbers of highly accurate models impacted several areas of life sciences ( Figure 2). Data providers are taking advantage of the programmatic access to AlphaFold

Integration of AlphaFold models with other major data providers
Several major data providers have integrated AlphaFold models both within the infrastructure of EMBL-EBI and elsewhere worldwide.
There are various data resource categories that benefit from the availability of these models.
An important category of data deals with protein sequences, and these can seamlessly integrate with AlphaFold predictions. For example, the UniProt database displays AlphaFold predictions in three dimensional on all their protein sequence pages, where such a model is available [23]. The more specialised databases InterPro and Pfam also have specific pages to display relevant AlphaFold structures [24,25].
The most obvious links are to other data resources in the field of protein structures. Linking experimentally determined structures and theoretical models is key to providing a more comprehensive structural context for proteins. Access to both experimental and predicted protein structures allows researchers to validate the predicted models, and also to potentially improve the experimental structure. Even when no experimentally solves structure exists, but there are experimental data, it is good practice to compare the theoretical model and see if it explains the data. It also allows the transfer of function annotations between structural representations. The Protein Data Bank in Europe -Knowledge Base (PDBe-KB), for example, is displaying AlphaFold models next to entries deposited in the PDB [26]. This is particularly useful when PDB entries only cover shorter segments of the full-length protein sequence. SWISS-MODEL Repository also displays AlphaFold models to easily compare SWISS-MODEL and AlphaFold structures [27].
Perhaps most importantly, data resources that traditionally could not take advantage of protein structures are now able to link or map their data to the AlphaFold models. For example, the protein-protein interaction database STRING uses AlphaFold models to provide a visual representation of the proteins involved in specific interactions [28]. MobiDB, a database of consensus disorder predictions, is also taking advantage of AlphaFold models [29]. Indeed, AlphaFold models are very relevant to the field of intrinsically disordered proteins, as it offered a striking demonstration of how widespread these flexible regions are across proteomes [30]. This is important, because while

High-throughput analyses enabled by AlphaFold models
The number of novel protein structures made available through the AlphaFold database presented a treasure trove of data to bioinformaticians worldwide. Developers of protein structure analysis tools also took advantage of this influx of new and accurate models. There have been many important breakthroughs in high-throughput analysis; the following list of examples is by no means exhaustive. Prediction method developers, such as Jakubec et al., are both taking advantage of AlphaFold models but also highlight a new scaling challenge in terms of working with an unprecedented number of protein structures. In their recent work, they ran their tool PrankWeb 3 to predict potential ligand-binding sites using PDB structures and AlphaFold models [37]. They demonstrated that PrankWeb 3 could predict an immense number of potential binding sites with reasonable accuracy, but due to the already massive size of the AlphaFold dataset, they only employ their method on a subset of models.
This highlights the importance of the scalability of modern tools and algorithms. Previously, no structure analysis tool had to work with more than perhaps a couple hundred thousand protein structures. This number is already 5-times larger, and it will keep increasing.
We expect new and innovative approaches that will be able to tackle the unexpected and rapid growth of available models. Indeed, novel and efficient tools like FoldSeek and three dimensional -AF-Surfer have already been developed that can help researchers search through immense amounts of protein structures to find hits structurally similar to an input conformation [38,39]. High-throughput structural similarity search enables classification efforts, for example, assigning structural CATH domains to AlphaFold models [40].

Determining protein structures using AlphaFold models
AlphaFold and all the other protein structure modelling algorithms would not have been possible without the efforts of structural biologists who employ an array of experimental techniques to determine the structure of proteins and their complexes with ligands and nucleic acids. It is fitting that now AlphaFold is being used by structural biologists to help solve the structure of challenging proteins. Careful analyses also highlight the strength and limitations of AlphaFold when compared to experimental techniques.
One of the most important achievements using AlphaFold models is the fine-grained, almost complete model of the human Nuclear Pore Complex (NPC), as demonstrated by Mosalaganti, Obarska-Kosinska et al. [41]. Using nucleoporin structure models from AlphaFold DB, they ran Google Colab to model the various subcomplexes of the human NPC scaffold, and fit the resulting models into a low resolution envelop based on cryo-EM experimental data. This approach yielded a model that has atomic-resolution for over 90% of the NPC, giving the best resolved structure to date. bound CST monomer into their density map [44]. However, the N-terminus of the CST monomer was ill-defined in the PDB structure, and consequently, they swapped it with a model from the AlphaFold database. Their model served as a tool to better understand the CST-Polα/primase complex, which has a crucial role in telomere maintenance.
Another study looked into the reasons behind an observation from the last CASP round, specifically that AlphaFold 2.0 performed worse for targets that were determined using NMR spectroscopy [45]. Fowler and Williamson found that generally, AlphaFold models are more accurate than NMR assemblies, partially explaining the discrepancy between AlphaFold models and NMR structures in the CASP assessment. However, in the case of more dynamic proteins, NMR techniques seem to be better suited [46].
The rapid uptake of AlphaFold models by researchers in structural biology was accelerated by the contributions of scientific software developers. For example, ChimeraX and MRParse can both retrieve predicted proteins structures from the AlphaFold database to assist in building cryo-EM models and for molecular replacement [47,48].
Other software used in macromolecular structure determination, such as ISOLDE and Phenix are also taking advantage of AlphaFold models [49,50].

4.6
Structure-based drug discovery based on AlphaFold models Since the majority of drug targets are proteins, having access to a large number of previously unknown protein structures is a clear benefit to the field of drug discovery [51]. While protein structures do not necessarily uncover immediately how the protein works, the structures still serve as important and complementary data to compound libraries by providing target structures to perform virtual screening against.
In a recent study, Tian et al. retrieved models from PDB and the AlphaFold database, identified potential ligand-binding pockets and performed docking of artemisinin, a natural phytochemical, against key targets implicated in Ulcerative Colitis [52].
In another study, Kobakhidze et al. used models from the AlphaFold database or directly generated predictions using the Google Colab to investigate whether AAA+ ATPase p97 would be a potential, novel target implicated both in tuberculosis and parasitology [53].
In addition to structure-based drug discovery, AlphaFold models are helping in vaccine development, as demonstrated by Collar et al. [54].

4.7
The impact of AlphaFold models on teaching science The availability of protein models for a very large number of known protein sequences gave a powerful tool to teachers of structural biology. In particular, these models help students to understand the meaning, importance and abundance of structurally flexible protein regions. These regions are generally missing from protein structures in the PDB, but AlphaFold showed just how prevalent these regions of the proteins are ( Figure 3). The intrinsic disorder content varies by proteomes, but it is generally a sizeable fraction. For example, in human proteome about 30% of the residues have high disorder propensity, which is reflected in the AlphaFold models of the human proteome [10].
It is, however, important to emphasise, that while these models help identify intrinsically disordered regions, they do not characterise them structurally. Due to the inherent flexible nature of these segments, they can only be accurately represented as a conformational ensemble, and AlphaFold, RoseTTAfold or similar new generation protein structure predictors only show single snapshot from these ensembles [10,12].

F I G U R E 3
Helping students understand the prevalence of structural disorder. Intrinsically disordered regions of proteins are prevalent in nature, but they are generally missing from protein structures in the PDB, as flexible residues are often unobserved, especially in X-ray crystal structures, which still constitute the majority of all PDB entries. AlphaFold DB gave a powerful tool to teachers and researchers to explain the concept and grasp the abundance of flexible protein regions in the proteomes of many organisms. For example, the human Mediator of DNA damage checkpoint protein only has two well-defined protein structures in the PDB, which cover short sections of the complete sequence. AlphaFold made it possible to put these domains in the context of long unstructured regions.
Access to the models also helped develop training materials targeted at structural biologists, bioinformaticians and computer scientists at various career stages. For example, the training portal of the Bonvin Lab includes AlphaFold and its impact on various common structural bioinformatics workflows (https://www.bonvinlab.org/ education/). Such materials are helping researchers in specific fields, but to help the broader scientific community there will be a need for more training materials for generalist users to ensure proteins structures can help solve scientific problems further away from its core user base.

DISCUSSION
Recent, significant advances in the theoretical protein structure modelling field have been driven by AI-based technologies, such as AlphaFold 2.0 and RoseTTAFold [10,12]. These tools made an unprecedented amount of highly accurate predictions easily accessible to the broad scientific community.
While the DeepMind team open-sourced the AlphaFold 2.0 algorithms and created interactive Google Colabs that allow researchers to run predictions, providing protein structure models through an easyto-use and openly accessible database was the best way to lower the accessibility barrier to non-structural biologists and make the models FAIR [10]. Having access to these models means that researchers can avoid running the same predictions over and over for proteins that are very actively researched, saving computational power and its associated costs.
Access to the AlphaFold models has considerably impacted several life sciences fields. Within less than a year, over 250 citations refer to the database, from other data providers to high-throughput bioinformatics analyses and structure determination to structure-based drug discovery [37,55,56]. It is also striking how many (37%) of the related publications are preprints, indicating that the field is progressing at an accelerated pace.
We have just expanded the data set further and include over 214 million predicted protein structure models. This expansion will provide new opportunities for research and analyses but will also pose significant challenges. We anticipate that new tools and algorithms will have to be developed to take advantage of protein structure on such a large scale, ultimately benefiting the broader scientific community.