Introduction

Since 1971, the Protein Data Bank (PDB) has served the scientific community as the single, global repository for structural data of biomolecules (). Data archived at the PDB include atomic coordinates and related experimental data from macromolecular crystallography, nuclear magnetic resonance spectroscopy and 3D electron microscopy studies. Understanding these 3D structures of proteins, nucleic acids, and large molecular machines informs our understanding of fundamental biology, medicine and drug discovery, and energy.

The PDB was conceived as a resource for the crystallographic community, to archive their primary results. However, as the number of structures grew, it became apparent that this body of information would have much wider application. Communities of researchers emerged that focused on data mining, using the available structures to hypothesize and test overarching principles of biomolecular structure, folding, and function. Soon after, the archive showed growing application in the field of structure-guided drug design, and has since been instrumental in the discovery and development of dozens of blockbuster medical treatments (). In addition, structures from the archive are used widely to provide structural understanding of biomolecular structure and function, promoting research in many fields of biology, but also in chemistry, physics, mathematics, computer science and beyond. The growing utility of the archive naturally lead to widespread policies of structure deposition, and today, most major journals require release of coordinates when publishing reports of structural studies (), ensuring that the results of structural research are available for these many derivative disciplines.

Today, the PDB is an established archive with >160,000 entries, which have been extensively curated for consistency and accuracy. In this report, we use an analysis of citations to reveal the impact of structural biology and the general availability of these atomic structures of biomolecules in diverse scientific communities.

The wwPDB archive

The PDB archive began in 1971 with seven structures, which were distributed by request on magnetic tape (). In subsequent years, the rapid growth of the archive and development of worldwide computational infrastructure required a more comprehensive approach to deposition, curation, and dissemination of the archive. Since 2003, the Worldwide PDB (wwPDB) organization has managed the PDB archive and ensured that PDB data are freely and publicly available (, ) following the FAIR principles (). Locally-funded, regional PDB Data Centers in the US (), Europe (), and Japan () safeguard and disseminate PDB structures using a common data dictionary () and a unified global system for data deposition-validation-biocuration ().

The Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB PDB) (, ) has served as the US PDB Data Center since 1999. RCSB PDB manages deposition and curation of roughly 42% of new structures, and free dissemination of the archive through the comprehensive RCSB PDB website. The PDB archive and RCSB.org are heavily used: during 2019, >838 million structure data files were downloaded from the archive. The RCSB PDB website also provides extensive value added through resources for visualization and analysis, integration with ~40 external resources, as well as a sister website, PDB-101, targeted for educational and outreach communities. Together, these resources provide rich structural views of fundamental biology, biomedicine, and energy sciences, which are accessed by millions of users from around the world and from a wide range of scientific disciplines.

Previous work on the impact of data reuse of the PDB archive

Evaluation of the impact of data archives is important for data depositors, data users, and for resource management, planning, and funding. For scientific databases, citations are often used as a tangible expression of reuse of the data. Since these publications are peer-reviewed and from reputable journals, it lends confidence that the derivative work is contributing to the growing body of scientific knowledge and validates the role of the data archive as a central resource for the community.

Previous citation analyses of the PDB archive have focused on the inaugural article describing the RCSB PDB resource, “The Protein Data Bank” () that appeared in Nucleic Acids Research. This inaugural article is regularly used to cite both the PDB data archive and RCSB PDB services. This reference is useful to study due to its high volume of citations. A 2014 analysis () ranked the inaugural article 92nd among the top 100 most-cited research publications of all time and a 2017 study () placed it 5th among papers published since 2000. The 2017 analysis by Basner also found, using internal methods for normalizing across category, that articles citing the inaugural RCSB PDB publication had a citation-based impact exceeding the world-average in 16 scientific fields including Biology & Biochemistry, Computer Science, Plant & Animal Sciences, Physics, Environment/Ecology, Mathematics and Geosciences. Another study () found the research areas for articles citing the inaugural RCSB PDB publication are changing over time, with more recent growth in disciplines such as Mathematical Computational Biology, Chemistry Medicinal, and Computer Science Interdisciplinary Applications.

Other studies have looked at how individual structures are referenced in the literature to demonstrate data reuse. For example, Huang et al. found an increase in the number of citations to PDB entries by URL rather than to publication (), and Bousfield et al. cross-referenced open access literature with the PDB archive and found the average annual number of citations for a PDB structure is 6.7 (). Scientific articles that are connected to open access data have been shown to be more highly cited than articles without data being made available (). This is certainly reflected in the primary citations included in PDB entries. At the time the data for this study were collected (March 1, 2018), the PDB archive contained ~139,000 structures, and the study found that the PDB archive from 2000–2016 had been cited by more than 1 million scientific publications in the Web of Science, giving an average number of ~40 citations per PDB structure publication (). This number rises to ~80 citations per PDB structure publication for drug targets in all therapeutic areas ().

This study explores citation patterns of individual PDB structures to identify PDB entries of most interest in specific fields and to examine trends in the application of structural biology. Each structure represents the results of an experiment determined by a laboratory and then deposited to the PDB data archive, biocurated by the wwPDB, and then made publicly available in the archive (). The majority of PDB structures (~80%) have a corresponding primary citation that is the first paper to describe the molecule, its structure, and its function. Public release of most PDB data (87% in 2018) is coordinated with the time of publication of this primary citation. As identified by previous research, and due to the nature of the data, most papers citing PDB structures are in the field of biology. To identify strong examples of structures cited in other research categories, we identified the top cited structures within related disciplines.

Methods

For each entry in the PDB archive, we analyzed the set of articles that cite the primary citation of the PDB entry. First, the primary citations for each PDB entry were exported from the RCSB PDB database. A single primary citation may describe multiple PDB structures—in these cases, entries were treated and counted separately in the analysis. Then, publication data for articles that cite these PDB primary citations as of March 2018 were exported and organized by subject categories within the Web of Science (). Related subject categories were aggregated, for example, the Chemistry category reported here includes Chemistry Physical, Chemistry Organic, Chemistry Analytical, and others. Note that each publication can be assigned to more than one category. The ranking of citation impact across disciplines and longitudinally is an ongoing area of active bibliometric research (, , , , , ). For this study, the citation data are not normalized in any way to provide a direct comparison of impact between categories. Exported data were analyzed in July 2018.

Results and Discussion

Overall citation of PDB structures

The top-cited PDB structures (Table 1) are landmark structures that signal achievements in fundamental biology and their application to biomedicine and biotechnology. A detailed description of the impact of the structures of the nucleosome (PDB 1aoi) and major histocompatibility complex 1 (1hla) has been published (). The structure of bacteriorhodopsin was the first EM structure released in the PDB, and represents a ground-breaking use of electron crystallography of two-dimensional sheets to determine the membrane-bound structure of a protein (1brd). The structure of the F1 portion of ATP synthase (1bmf) revealed the atomic details of the rotary molecular motor, providing a structural explanation for decades of biochemical studies. Similarly, the structure of the potassium channel resolved a long-standing question about the nature of specificity, revealing the central role of hydration and dehydration of ions in controlling ion passage across the cell membrane. The structures of MHC I (1hla), MDM2 (1rv1), and serum albumin (1uor) are a testament to the utility of atomic structures in the understanding of biomedically-important biomolecules and in structure-based design of pharmaceuticals. Many additional milestone structures closely follow these top 10 entries, including photosystem II (1s5l) with 2286 citations and green fluorescent protein (1ema) with 1529 citations. In keeping with the importance of these molecules, all structures in the top-cited list (Figure 1) have been highlighted in the RCSB PDB’s Molecule of the Month series at PDB101.rcsb.org ().

Figure 1 

Images of the most highly-cited PDB structures, taken from PDB-101, the educational portal of the RCSB PDB. In several cases, a highly-cited structure entry was the example used in the Molecule of the Month article.

Table 1

Top-cited PDB Structure Primary Citations as of March 1, 2018.

StructurePDB IDJournalStructure Primary CitationTimes Cited

Nucleosome1aoiNature()4,927
Potassium channel1bl8Science()4,695
Bacteriorhodopsin1brdJournal of Molecular Biology()4,298
Rhodopsin1f88Science()4,237
Major histocompatibility class I1hlaNature()3,081
MDM2/imidazoline inhibitor1rv1Science()2,649
Thrombin2v3o/2v3hScience()2,596
Serum albumin1uorNature()2,552
ATP Synthase F11bmfNature()2,453

The importance of these 10 entries is also supported by prominence of the journals where the primary citation was published. Of the nine articles describing the top 10 structures, four were published in Nature, four in Science, and one in the Journal of Molecular Biology. The oldest structure was published in 1987, and the most recent in 2007. Surprisingly, the initial 12 structures archived in the PDB since the 1970s and that represent the seminal achievements of structural biology are not included in this list. This may be due in part to inconsistent citation practices of the time, both for the citations that were included in the early structure deposition to the archive, and for how PDB structures were cited in the literature. For example, the citation included in the entry for Kendrew’s landmark structure of myoglobin (1mbn) is not the initial structure solution (), which currently shows >1100 citations, but rather a later report of the molecule () that is not included in the Web of Science. Similarly, multiple publications were presented over decades during the structure solution of hemoglobin (2dhb), including the primary citation associated with the entry () with 253 citations and a key Nature paper with 932 citations ().

Top cited structures by category

We also analyzed subject categories for the journals where citing articles were published, to assess the range of disciplines where data from the PDB archive is having impact. Not surprisingly, PDB structures are most cited by publications with the subject category Biochemistry Molecular Biology, including 101,921 unique structures (72% of archive) at the time of this study. Six non-biological categories were chosen to show utility of the archive in related disciplines, including Materials Science, Physics, Computer Science, Chemistry, Engineering, and Mathematics.

We also identified the most highly-cited article that cited a PDB structure in each category. For the physics-related categories these highly-cited papers included reviews related to biotechnology and nanotechnology: Materials Science (), Physics (), and Engineering (). For Computer Science, Chemistry, and Mathematics, the papers were primary citations for the widely-used molecular visualization program VMD (), small and macromolecular structure determination program SHELX (), and database of theoretical models SWISS-MODEL (), respectively. These citations highlight how available data in the repository support cross-disciplinary use across the physical sciences.

Figure 2 reveals that individual PDB entries have impact on a wide range of disciplines. The three most highly-cited structures are included, along with the number of citations falling into the top 5 categories. Not surprisingly, all have Biochemistry & Molecular Biology as the top category. The following categories are quite different for these three entries, reflecting the different uses that are made of these structures: the nucleosome (1aoi) in basic biology and understanding of genetic mechanisms, the potassium channel (1bl8) as a central structure used for understanding and engineering specific ion channels, and rhodopsin (1f88) which was used for many years as a template for understanding and modeling the pharmacology of G-protein coupled receptors (GPCRs). The citations also fall into numerous other categories: for example, citations for the nucleosome structure fall into over 100 separate categories.

Figure 2 

Top subject categories for publications citing the top-cited PDB structures. Common categories include Biochemistry & Molecular Biology (green), Cell Biology (orange), Biophysics (red), and Chemistry (yellow).

In the sections below, we identify the top-cited structures in each of these subject categories, and describe how these structures have impacted the fields.

Materials Science

Articles in this category cited 18,495 unique structures, or roughly 12% of the archive. Two of the top cited PDB entries here, serum albumin (1uor) and potassium channel (1bl8) also appear in the overall top cited list (Table 2). Surprisingly, two additional structures of serum albumin (1ao6/1bm0) also appear in this top cited list. The reasons for citation of these structures in materials science journals are reflected in the most frequently used keywords in these citing articles: in vivo, mechanism, drug delivery, adsorption, in vitro, crystal structure, protein/s, nanoparticles, and binding.

Table 2

Top-Cited PDB Structures in Materials Science.

StructurePDB IDJournalStructure Primary CitationTimes Cited

Serum albumin1uor*Nature()215
Alpha-hemolysin7ahlScience()140
Designed DNA3gbiNature()138
Biotin/streptavidin1stpScience()136
Collagen3hqv/3hr2Proceedings of the National Academy of Sciences USA()116
Photosystem II3wu2Nature()100
Potassium channel1bl8*Science()95
Serum albumin1ao6/1bm0Protein Engineering()93

* Also appears in top cited overall list.

These structures provide information that is useful in a variety of current bioengineering and nanotechnology goals. Serum albumin (1uor, 1ao6, 1bm0) plays essential roles in delivery of a wide variety of small molecules in the blood, thus it is often a key for assessing the ADME (Absorption, Distribution, Metabolism, Excretion) properties of engineered molecules. Many of these cited papers explore design of nanoparticles for delivery of molecules in the blood, building on knowledge of the structure. Similarly, alpha-hemolysin (7ahl) and potassium channels (1bl8) are worked examples of selective channels and have been used in bioengineering efforts. In particular, structural understanding of alpha-hemolysin has been instrumental in the engineering of nanopores for DNA sequencing. Designed DNA structures (3gbi) are some of the first successful examples of de novo design in bionanotechnology, and the strong streptavidin-biotin interaction (1stp) is often used to connect modular components in designed nanostructures. Materials Science publications reference the in situ structure of collagen microfibrils (3hqv and 3hr2), including reports exploring the properties of connective tissue and biomineralization. Photosystem II appears in Materials Science (3wu2), as well as in nearly all of the other categories below, since these structures revealed the water-splitting details of the oxygen-evolving center.

Physics

Articles in this category cited 50,819 unique structures. Two of the top cited structures, potassium channel (1bl8) and the nucleosome (1aoi) also appear in the overall top cited list (Table 3). The most frequently used keywords in these citing articles include: model, mechanism, protein/s, binding, molecular dynamics/simulations, spectroscopy, and crystal structure.

Table 3

Top-Cited PDB Structures in Physics.

StructurePDB IDJournalStructure Primary CitationTimes Cited

Photosystem I3pcqNature()319
Potassium channel1bl8*Science()311
Nucleosome1aoi*Nature()186
Photosystem II1s5lScience()186
Photosystem II3wu2Nature()177
Alpha-spectrin sh3 domain1m8mNature()174
Photosystem I1jb0Nature()148
Designed TRP-cage miniprotein1l2yNature Structural Biology()147
Light-harvesting complex1lghStructure()131
Alpha-hemolysin7ahlScience()127

* Also appears in top cited overall list.

This category includes an interesting mix of structures related to photosynthesis (3pcq, 1s5l, 3wu2, 1jb0, 1lgh) and structures related to development of experimental methods (also 3pcq, 1m8m, 1l2y). The structures of photosystems revealed the detailed arrangements of chromophores in the protein complexes, and thus provide concrete information on the types of geometries and distances that are relevant for excitation and electron transfer. These structures also include several seminal developments in structural science with strong ties to physics, including determination of photosystem I by femtosecond X-ray protein nanocrystallography (3pcq) and determination of the spectrin domain by solid-state magic-angle-spinning NMR spectroscopy (1m8m). In addition, two structures related to nanotechnology appear in the list: a very small de novo designed protein (1l2y) and alpha-hemolysin (7ahl), mentioned in the section above. Inclusion of the nucleosome (1aoi) in this list may seem like a bit of a puzzle, until we understand that much effort has been expended with trying to understand and model the physics of DNA bending as it relates to nucleosome positioning and higher-order chromatin structure.

Computer Science

Articles in this category cited 28,122 unique structures, and rhodopsin (1f88) also appears in the overall top cited list (Table 4). The most frequently used keywords in these citing articles included: docking, identification, inhibitors, prediction, molecular dynamics, forcefield, design, binding, and crystal structure.

Table 4

Top-Cited PDB Structures in Computer Science.

StructurePDB IDJournalStructure Primary CitationTimes Cited

Rhodopsin1f88*Science()185
Beta2 adrenergic receptor2rh1Science()134
ZipA/inhibitor1y2g/1y2fJournal of Medicinal Chemistry()129
Adenosine receptor3emlScience()94
HIV protease/cyclic ureas1hvrScience()77
Beta1 adrenergic receptor2vt4Nature()69
Beta2 adrenergic receptor2r4r/2r4sNature()67
Dihydrofolate reductase3dfr/4dfrJournal of Biological Chemistry()64

* Also appears in top cited overall list.

These structures represent important targets for drug development, and thus are often used to test new structure-based drug design methodology, including a shape-based 3-D scaffold hopping method (1y2f, 1y2g) and novel use of cyclic ureas to mimic substrate binding and displace a key water molecule in HIV protease (1hvr). Half of the list are GPCRs (2rh1, 3eml, 2vt4, 2r4s), along with the landmark structure of rhodopsin (1f88), which was used for many years to model GPCRs and ligand binding thereto.

Chemistry

Articles in this category cited 87,073 unique structures, and given the strong connections between biology and chemistry, half of the top cited structures (2v3h/2v3o, 1uor, 1f88, 1bl8) appear in the overall top cited list (Table 5). The most frequently used keywords in these citing articles were: complexes, protein/s, E. coli, inhibitors, mechanism, design, derivatives, binding, and crystal structure.

Table 5

Top-Cited PDB Structures in Chemistry.

StructurePDB IDJournalStructure Primary CitationTimes Cited

Thrombin2v3o/2v3h*Science()2,425
Serum albumin1uor*Nature()1,334
Photosystem II1s5lScience()1,043
Rhodopsin1f88*Science()1,003
Photosystem II3wu2Nature()936
Fe-only hydrogenase1fehScience()925
Potassium channel1bl8*Science()803
Hydrogenase1hfeStructure()743
DNA quadruplex1k8pNature()671

* Also appears in top cited overall list.

The top two entries are thrombin with an inhibitor (2v3h) and with a fluorinated version of the inhibitor (2v3o), and the primary reference is a review that is cited by studies of the effects of fluorination in inhibitor design. Serum albumin (1uor) showed up in “Materials Science” in relation to bioengineering efforts, but many of the citations in “Chemistry” are directly related to binding of molecules to the protein and characterizing its functional properties in blood. Two hydrogenase enzymes (1feh, 1hfe) and photosystem II (1s5l, 3wu2) perform interesting chemistry catalyzed by unusual metal clusters. The structure of a human telomeric quadruplex (1k8p) is cited by all manner of studies looking at its chemical properties and interactions with ions, small molecules, and proteins.

Engineering

Articles in this category cited 16,190 unique structures, and serum albumin (1uor) and potassium channel (1bl8) again appear in the top cited list (Table 6). The most frequently used keywords in these citing articles were: mechanism, in vivo, protein/s, purification, binding, in vitro, E. coli, expression.

Table 6

Top-Cited PDB Structures in Engineering.

StructurePDB IDJournalStructure Primary CitationTimes Cited

Serum albumin1uor*Nature()84
Collagen3hqv/3hr2Proceedings of the National Academy of Sciences USA()68
Potassium channel1bl8*Science()51
Fibronectin1fnfCell()51
Photosystem II1s5lScience()44
Lipase/inhibitor5tglNature()43
Laccase1gycJournal of Biological Chemistry()43
Osteocalcin1q8hNature()42
Collagen1cagScience()40

* Also appears in top cited overall list.

Several of these structures are related to bioengineering projects. The citations for serum albumin (1uor) include studies about the interaction with a wide variety of dyes, nanoparticles and other engineered molecules. Structures of collagen (3hqv, 3hr2, 1cag), fibronectin (1fnf), and osteocalcin (1q8h) played key roles in the understanding of cell adhesion, connective tissues and bone. The potassium channel structure (1bl8) is cited by studies looking at nanopores and biosensors. The lipase (5tgl) and laccase (1gyc) structures were cited in studies of engineered and immobilized versions of the enzymes.

Mathematics

Articles in this category cited 7,306 unique structures. Four of the top cited structures (1bl8, 1f88, 1aoi, 1brd) appear in the overall top cited list (Table 7). The most frequently used keywords in these citing articles were: molecular dynamics, protein/s, binding, identification, recognition, sequence, prediction, database, crystal structure.

Table 7

Top-Cited PDB Structures in Mathematics.

StructurePDB IDJournalStructure Primary CitationTimes Cited

Potassium channel1bl8*Science()29
Designed protein1qysScience()24
Rhodopsin1f88*Science()22
Nucleosome1aoi*Nature()20
Designed protein1fsv/1fsdScience()20
Ribosomal subunit1ffkScience()13
Bacteriorhodopsin1brd*Journal of Molecular Biology()11
Photosystem II2axtNature()11
B-DNA dodecamer1bnaProceedings of the National Academy of Sciences USA()10

* Also appears in top cited overall list.

Many of these papers are involved in modeling and analysis of protein and nucleic acid structures, with “Mathematical Computational Biology” being the major mathematics-related category. For example, the potassium channel (1bl8) citations include many computation studies exploring the dynamics of channel gating and permeation, as well as methods for predicting structure and function of other channels based on this structure, and the rhodopsin structure (1f88) includes several citations for methods that model the structure of GPCRs. The nucleosome (1aoi) citations include studies about nucleosome positioning and modeling of DNA bending or higher chromatin structure. The designed protein structures (1qys, 1fsv, 1fsd) are cited by methods papers that explore prediction of protein folding and design, and the ribosome structure (1ffk) is cited by methods exploring RNA structure and interaction of RNA and protein. The first atomic structure of a B-DNA helix (1bna) is cited in modeling studies of DNA conformation and interaction.

Conclusions

This analysis has shown a large impact of the PDB archive within the discipline of molecular biology and in many related disciplines. Of course, this analysis uses only one metric for assessing impact—the record of citations. Additional information could be obtained through analysis of instances of PDB structure IDs in publications, or linkage of specific PDB entries in digital resources. We are also interested in assessing the impact of the PDB archive in education and public understanding, which may potentially be approached through analysis of usage and citation of entries in textbooks and popular publications. That effort may be more difficult, however, given the citation practices in those publications are not as tightly codified as in professional scientific publications.

The PDB archive was originally established to serve the structural biology community. The extensive usage of PDB structures across a variety of disciplines demonstrates the importance of structural studies and how data archives support interdisciplinary research.

Additional Files

The additional files for this article can be found as follows:

Supplementary File 1

The “CitationData.json” file contains citation information for 41 unique PDB ID (or pair of PDB IDs) mentioned in the paper. For each PDB ID (or pair of PDB IDs), the record contains primary citaition (with “primary_citation” key) and all citations (with “citing_citations” key) which cited the primary citation. The citation record contains the following records: (1) Citation authors (with “Authors” key) (2) Citation title (with “Title” key) (3) Journal information (with “Journal” key) (4) Volume number (with “Volume” key) (5) Issue number (with “Issue” key) (6) Page number (with “Pages” key) (7) Published year (with “Year” key) (8) Subject categories (with “Subject” key). DOI: https://doi.org/10.5334/dsj-2020-025.s1

Supplementary File 2

The “AnalyseTop10CitationData.py” code reads “CitationData.json” file and writes the “Top10Result.txt” which reports the top 10 cited PDB entries for overall and six non-biological subject categories (“Chemistry”, “Engineering”, “Mathematics”, “Materials Science”, “Physics”, “Computer Science”) by running the following command: python AnalyseTop10CitationData.py CitationData.json. DOI: https://doi.org/10.5334/dsj-2020-025.s2